COMPREHENSIVE SAS ANALYSIS OF WEBSITE DATASET | DATA STEP CREATION | PROC CONTENTS FOR METADATA | PROC MEANS AND PROC FREQ FOR STATISTICS | PROC SQL FOR DATA MANIPULATION | PROC SORT FOR RANKING | PROC TRANSPOSE FOR RESHAPING | PROC EXPORT FOR OUTPUT | MACROS FOR AUTOMATION
/*Creating a dataset of different types of websites*/
Step 1: Creating the Raw Dataset
data websites_info;
length Name $20 Type $15 Region $15;
input Website_ID Name $ Type $ Monthly_Visitors Launch_Year Region $;
datalines;
1 Amazon E-commerce 2500 1994 Global
2 Facebook Social_Media 2900 2004 Global
3 CNN News 120 1995 USA
4 Alibaba E-commerce 1100 1999 Asia
5 Twitter Social_Media 550 2006 Global
6 Flipkart E-commerce 350 2007 India
7 BBC News 160 1922 UK
8 LinkedIn Social_Media 800 2003 Global
9 TimesofIndia News 180 1838 India
10 Etsy E-commerce 400 2005 Global
;
run;
proc print;run;
Output:
| Obs | Name | Type | Region | Website_ID | Monthly_Visitors | Launch_Year |
|---|---|---|---|---|---|---|
| 1 | Amazon | E-commerce | Global | 1 | 2500 | 1994 |
| 2 | Social_Media | Global | 2 | 2900 | 2004 | |
| 3 | CNN | News | USA | 3 | 120 | 1995 |
| 4 | Alibaba | E-commerce | Asia | 4 | 1100 | 1999 |
| 5 | Social_Media | Global | 5 | 550 | 2006 | |
| 6 | Flipkart | E-commerce | India | 6 | 350 | 2007 |
| 7 | BBC | News | UK | 7 | 160 | 1922 |
| 8 | Social_Media | Global | 8 | 800 | 2003 | |
| 9 | TimesofIndia | News | India | 9 | 180 | 1838 |
| 10 | Etsy | E-commerce | Global | 10 | 400 | 2005 |
Step 2: Exploring the Dataset
proc contents data=websites_info;
run;
Output:
| Data Set Name | WORK.WEBSITES_INFO | Observations | 10 |
|---|---|---|---|
| Member Type | DATA | Variables | 6 |
| Engine | V9 | Indexes | 0 |
| Created | 14/09/2015 01:39:11 | Observation Length | 80 |
| Last Modified | 14/09/2015 01:39:11 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | NO | |
| Label | |||
| Data Representation | WINDOWS_64 | ||
| Encoding | wlatin1 Western (Windows) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 65536 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 817 |
| Obs in First Data Page | 10 |
| Number of Data Set Repairs | 0 |
| ExtendObsCounter | YES |
| Filename | C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD10316_DESKTOP-QFAA4KV_\websites_info.sas7bdat |
| Release Created | 9.0401M2 |
| Host Created | X64_8HOME |
| Alphabetic List of Variables and Attributes | |||
|---|---|---|---|
| # | Variable | Type | Len |
| 6 | Launch_Year | Num | 8 |
| 5 | Monthly_Visitors | Num | 8 |
| 1 | Name | Char | 20 |
| 3 | Region | Char | 15 |
| 2 | Type | Char | 15 |
| 4 | Website_ID | Num | 8 |
Step 3: Summary Statistics
PROC MEANS
Summarizing continuous variables:
proc means data=websites_info mean median max min std;
var Monthly_Visitors Launch_Year;
run;
Output:
| Variable | Mean | Median | Maximum | Minimum | Std Dev | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
PROC FREQ
To get the frequency of each website type:
proc freq data=websites_info;
tables Type Region;
run;
Output:
| Type | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
|---|---|---|---|---|
| E-commerce | 4 | 40.00 | 4 | 40.00 |
| News | 3 | 30.00 | 7 | 70.00 |
| Social_Media | 3 | 30.00 | 10 | 100.00 |
| Region | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
|---|---|---|---|---|
| Asia | 1 | 10.00 | 1 | 10.00 |
| Global | 5 | 50.00 | 6 | 60.00 |
| India | 2 | 20.00 | 8 | 80.00 |
| UK | 1 | 10.00 | 9 | 90.00 |
| USA | 1 | 10.00 | 10 | 100.00 |
Step 4: Aggregation Using PROC SQL
Total Visitors by Website Type
proc sql;
select Type,
count(*) as Total_Websites,
sum(Monthly_Visitors) as Total_Visitors
from websites_info
group by Type;
quit;
Output:
| Type | Total_Websites | Total_Visitors |
|---|---|---|
| E-commerce | 4 | 4350 |
| News | 3 | 460 |
| Social_Media | 3 | 4250 |
Step 5: Creating New Variables
data websites_enriched;
set websites_info;
Website_Age = year("23may2025"d) - Launch_Year;
run;
proc print;run;
Output:
| Obs | Name | Type | Region | Website_ID | Monthly_Visitors | Launch_Year | Website_Age |
|---|---|---|---|---|---|---|---|
| 1 | Amazon | E-commerce | Global | 1 | 2500 | 1994 | 31 |
| 2 | Social_Media | Global | 2 | 2900 | 2004 | 21 | |
| 3 | CNN | News | USA | 3 | 120 | 1995 | 30 |
| 4 | Alibaba | E-commerce | Asia | 4 | 1100 | 1999 | 26 |
| 5 | Social_Media | Global | 5 | 550 | 2006 | 19 | |
| 6 | Flipkart | E-commerce | India | 6 | 350 | 2007 | 18 |
| 7 | BBC | News | UK | 7 | 160 | 1922 | 103 |
| 8 | Social_Media | Global | 8 | 800 | 2003 | 22 | |
| 9 | TimesofIndia | News | India | 9 | 180 | 1838 | 187 |
| 10 | Etsy | E-commerce | Global | 10 | 400 | 2005 | 20 |
PROC UNIVARIATE
Understanding the distribution of website age:
proc univariate data=websites_enriched;
var Website_Age;
run;
Output:
| Moments | |||
|---|---|---|---|
| N | 10 | Sum Weights | 10 |
| Mean | 47.7 | Sum Observations | 477 |
| Std Deviation | 55.1483857 | Variance | 3041.34444 |
| Skewness | 2.27276154 | Kurtosis | 4.84585541 |
| Uncorrected SS | 50125 | Corrected SS | 27372.1 |
| Coeff Variation | 115.615064 | Std Error Mean | 17.4394508 |
| Basic Statistical Measures | |||
|---|---|---|---|
| Location | Variability | ||
| Mean | 47.70000 | Std Deviation | 55.14839 |
| Median | 24.00000 | Variance | 3041 |
| Mode | . | Range | 169.00000 |
| Interquartile Range | 11.00000 | ||
| Tests for Location: Mu0=0 | ||||
|---|---|---|---|---|
| Test | Statistic | p Value | ||
| Student's t | t | 2.735178 | Pr > |t| | 0.0230 |
| Sign | M | 5 | Pr >= |M| | 0.0020 |
| Signed Rank | S | 27.5 | Pr >= |S| | 0.0020 |
| Quantiles (Definition 5) | |
|---|---|
| Level | Quantile |
| 100% Max | 187.0 |
| 99% | 187.0 |
| 95% | 187.0 |
| 90% | 145.0 |
| 75% Q3 | 31.0 |
| 50% Median | 24.0 |
| 25% Q1 | 20.0 |
| 10% | 18.5 |
| 5% | 18.0 |
| 1% | 18.0 |
| 0% Min | 18.0 |
| Extreme Observations | |||
|---|---|---|---|
| Lowest | Highest | ||
| Value | Obs | Value | Obs |
| 18 | 6 | 26 | 4 |
| 19 | 5 | 30 | 3 |
| 20 | 10 | 31 | 1 |
| 21 | 2 | 103 | 7 |
| 22 | 8 | 187 | 9 |
Step 6: Using MACROS
%macro website_report(webtype);
title "Report for &webtype Websites";
proc sql;
select * from websites_info
where Type="&webtype";
quit;
proc means data=websites_info;
where Type="&webtype";
var Monthly_Visitors;
run;
proc freq data=websites_info;
where Type="&webtype";
tables Region;
run;
title;
%mend website_report;
/* Generate reports */
%website_report(E-commerce);
Output:
| Report for E-commerce Websites |
| Name | Type | Region | Website_ID | Monthly_Visitors | Launch_Year |
|---|---|---|---|---|---|
| Amazon | E-commerce | Global | 1 | 2500 | 1994 |
| Alibaba | E-commerce | Asia | 4 | 1100 | 1999 |
| Flipkart | E-commerce | India | 6 | 350 | 2007 |
| Etsy | E-commerce | Global | 10 | 400 | 2005 |
| Report for E-commerce Websites |
| Analysis Variable : Monthly_Visitors | ||||
|---|---|---|---|---|
| N | Mean | Std Dev | Minimum | Maximum |
| 4 | 1087.50 | 1001.98 | 350.0000000 | 2500.00 |
| Report for E-commerce Websites |
| Region | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
|---|---|---|---|---|
| Asia | 1 | 25.00 | 1 | 25.00 |
| Global | 2 | 50.00 | 3 | 75.00 |
| India | 1 | 25.00 | 4 | 100.00 |
%website_report(Social_Media);
Output:
| Report for Social_Media Websites |
| Name | Type | Region | Website_ID | Monthly_Visitors | Launch_Year |
|---|---|---|---|---|---|
| Social_Media | Global | 2 | 2900 | 2004 | |
| Social_Media | Global | 5 | 550 | 2006 | |
| Social_Media | Global | 8 | 800 | 2003 |
| Report for Social_Media Websites |
| Analysis Variable : Monthly_Visitors | ||||
|---|---|---|---|---|
| N | Mean | Std Dev | Minimum | Maximum |
| 3 | 1416.67 | 1290.67 | 550.0000000 | 2900.00 |
| Report for Social_Media Websites |
| Region | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
|---|---|---|---|---|
| Global | 3 | 100.00 | 3 | 100.00 |
Step 7: Sorting and Filtering
proc sort data=websites_info out=sorted_websites;
by descending Monthly_Visitors;
run;
Filter: Websites Launched After 2000
data post2000;
set websites_info;
if Launch_Year > 2000;
run;
proc print data=post2000;
title "Websites Launched After 2000";
run;
Output:
| Websites Launched After
2000 |
| Obs | Name | Type | Region | Website_ID | Monthly_Visitors | Launch_Year |
|---|---|---|---|---|---|---|
| 1 | Social_Media | Global | 2 | 2900 | 2004 | |
| 2 | Social_Media | Global | 5 | 550 | 2006 | |
| 3 | Flipkart | E-commerce | India | 6 | 350 | 2007 |
| 4 | Social_Media | Global | 8 | 800 | 2003 | |
| 5 | Etsy | E-commerce | Global | 10 | 400 | 2005 |
Step 8: Classifying Based on Visitors
data websites_classified;
set websites_info;
length Visitor_Category $15.;
if Monthly_Visitors >= 1000 then Visitor_Category = "Very High";
else if Monthly_Visitors >= 500 then Visitor_Category = "High";
else if Monthly_Visitors >= 200 then Visitor_Category = "Medium";
else Visitor_Category = "Low";
run;
proc print;run;
Output:
| Obs | Name | Type | Region | Website_ID | Monthly_Visitors | Launch_Year | Visitor_Category |
|---|---|---|---|---|---|---|---|
| 1 | Amazon | E-commerce | Global | 1 | 2500 | 1994 | Very High |
| 2 | Social_Media | Global | 2 | 2900 | 2004 | Very High | |
| 3 | CNN | News | USA | 3 | 120 | 1995 | Low |
| 4 | Alibaba | E-commerce | Asia | 4 | 1100 | 1999 | Very High |
| 5 | Social_Media | Global | 5 | 550 | 2006 | High | |
| 6 | Flipkart | E-commerce | India | 6 | 350 | 2007 | Medium |
| 7 | BBC | News | UK | 7 | 160 | 1922 | Low |
| 8 | Social_Media | Global | 8 | 800 | 2003 | High | |
| 9 | TimesofIndia | News | India | 9 | 180 | 1838 | Low |
| 10 | Etsy | E-commerce | Global | 10 | 400 | 2005 | Medium |
proc freq data=websites_classified;
tables Visitor_Category;
run;
Output:
| Visitor_Category | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
|---|---|---|---|---|
| High | 2 | 20.00 | 2 | 20.00 |
| Low | 3 | 30.00 | 5 | 50.00 |
| Medium | 2 | 20.00 | 7 | 70.00 |
| Very High | 3 | 30.00 | 10 | 100.00 |
Step 9: Using PROC TRANSPOSE
proc transpose data=websites_info out=transposed_visitors;
by Region notsorted;
id Name;
var Monthly_Visitors;
run;
proc print;run;
Output:
| Obs | Region | _NAME_ | Amazon | CNN | Alibaba | Flipkart | BBC | TimesofIndia | Etsy | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Global | Monthly_Visitors | 2500 | 2900 | . | . | . | . | . | . | . | . |
| 2 | USA | Monthly_Visitors | . | . | 120 | . | . | . | . | . | . | . |
| 3 | Asia | Monthly_Visitors | . | . | . | 1100 | . | . | . | . | . | . |
| 4 | Global | Monthly_Visitors | . | . | . | . | 550 | . | . | . | . | . |
| 5 | India | Monthly_Visitors | . | . | . | . | . | 350 | . | . | . | . |
| 6 | UK | Monthly_Visitors | . | . | . | . | . | . | 160 | . | . | . |
| 7 | Global | Monthly_Visitors | . | . | . | . | . | . | . | 800 | . | . |
| 8 | India | Monthly_Visitors | . | . | . | . | . | . | . | . | 180 | . |
| 9 | Global | Monthly_Visitors | . | . | . | . | . | . | . | . | . | 400 |
Step 10: Export the Final Dataset
proc export data=websites_classified
outfile="/folders/myfolders/websites.csv"
dbms=csv
replace;
run;
Log:
NOTE: "/folders/myfolders/websites.csv" file was successfully created.
NOTE: PROCEDURE EXPORT used (Total process time):
real time 1.51 seconds
cpu time 0.23 seconds
Step 11: Creating a Macro for Age Bracket Summary
%macro age_summary(minage, maxage);
proc sql;
select Name, Website_Age
from websites_enriched
where Website_Age between &minage and &maxage;
quit;
%mend age_summary;
%age_summary(10, 20);
Output:
| Name | Website_Age |
|---|---|
| 19 | |
| Flipkart | 18 |
| Etsy | 20 |
%age_summary(20, 30);
Output:
| Name | Website_Age |
|---|---|
| 21 | |
| CNN | 30 |
| Alibaba | 26 |
| 22 | |
| Etsy | 20 |
Step 12: Joining Two Tables Using SQL
data website_ratings;
input Website_ID Rating Score;
datalines;
1 5 4.8
2 4 4.3
3 3 3.9
4 4 4.2
5 3 4.0
6 5 4.7
7 4 4.1
8 4 4.4
9 3 3.8
10 4 4.6
;
run;
proc print;run;
Output:
| Obs | Website_ID | Rating | Score |
|---|---|---|---|
| 1 | 1 | 5 | 4.8 |
| 2 | 2 | 4 | 4.3 |
| 3 | 3 | 3 | 3.9 |
| 4 | 4 | 4 | 4.2 |
| 5 | 5 | 3 | 4.0 |
| 6 | 6 | 5 | 4.7 |
| 7 | 7 | 4 | 4.1 |
| 8 | 8 | 4 | 4.4 |
| 9 | 9 | 3 | 3.8 |
| 10 | 10 | 4 | 4.6 |
proc sql;
create table website_full as
select a.*, b.Rating, b.Score
from websites_info a
inner join website_ratings b
on a.Website_ID = b.Website_ID;
quit;
proc sort data=website_full out=ranked;
by descending Score;
run;
proc print data=ranked;
title "Websites Ranked by Score";
run;
| Websites Ranked by Score |
| Obs | Name | Type | Region | Website_ID | Monthly_Visitors | Launch_Year | Rating | Score |
|---|---|---|---|---|---|---|---|---|
| 1 | Amazon | E-commerce | Global | 1 | 2500 | 1994 | 5 | 4.8 |
| 2 | Flipkart | E-commerce | India | 6 | 350 | 2007 | 5 | 4.7 |
| 3 | Etsy | E-commerce | Global | 10 | 400 | 2005 | 4 | 4.6 |
| 4 | Social_Media | Global | 8 | 800 | 2003 | 4 | 4.4 | |
| 5 | Social_Media | Global | 2 | 2900 | 2004 | 4 | 4.3 | |
| 6 | Alibaba | E-commerce | Asia | 4 | 1100 | 1999 | 4 | 4.2 |
| 7 | BBC | News | UK | 7 | 160 | 1922 | 4 | 4.1 |
| 8 | Social_Media | Global | 5 | 550 | 2006 | 3 | 4.0 | |
| 9 | CNN | News | USA | 3 | 120 | 1995 | 3 | 3.9 |
| 10 | TimesofIndia | News | India | 9 | 180 | 1838 | 3 | 3.8 |
No comments:
Post a Comment