149.GENERATING THE DATASETS OF CUSTOMERS ,TRANSACTIONS ,PRODUCTS AND REVIEWS AND APPLYING PROCEDURES LIKE PROC MEANS | PROC FREQ ETC..
- Get link
- X
- Other Apps
GENERATING THE DATASETS OF CUSTOMERS ,TRANSACTIONS ,PRODUCTS AND REVIEWS AND APPLYING PROCEDURES LIKE PROC MEANS | PROC FREQ ETC..
/*To explore various data manipulation and analysis techniques using SAS*/
1. Generating the Simulated Dataset
Our dataset will consist of four primary components:
Customers: Details about individuals purchasing products.
Products: Information on items available in the store.
Transactions: Records of purchases made by customers.
Reviews: Feedback provided by customers on products.
1.1 Creating the Customers Dataset
/*Generate a dataset containing 15 customers with attributes such as Customer ID, Name, Age, Gender, and Location*/
data customers;
call streaminit(12345); /* Set seed for reproducibility */
length CustomerID $8 Name $20 Gender $1 Location $15;
array names[10] $20 _temporary_ ('Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Hannah', 'Ian', 'Julia');
array locations[5] $15 _temporary_ ('New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix');
do i = 1 to 15;
CustomerID = cats('C', put(i, z7.));
Name = names[ceil(rand('Uniform') * 10)];
Age = 18 + floor((70 - 18 + 1) * rand('Uniform')); /* Generates a random integer between 18 and 70 */
Gender = ifc(rand('Uniform') < 0.5, 'M', 'F');
Location = locations[ceil(rand('Uniform') * 5)];
output;
end;
drop i;
run;
proc print;run;
Output:
Obs | CustomerID | Name | Gender | Location | Age |
---|---|---|---|---|---|
1 | C0000001 | Frank | F | Phoenix | 70 |
2 | C0000002 | Ian | F | Los Angeles | 32 |
3 | C0000003 | Eva | F | Phoenix | 64 |
4 | C0000004 | Charlie | M | Los Angeles | 67 |
5 | C0000005 | Bob | M | Phoenix | 57 |
6 | C0000006 | Hannah | M | Houston | 70 |
7 | C0000007 | Grace | F | Chicago | 55 |
8 | C0000008 | Julia | M | Houston | 18 |
9 | C0000009 | Hannah | M | Phoenix | 48 |
10 | C0000010 | Julia | F | Los Angeles | 62 |
11 | C0000011 | Ian | M | Los Angeles | 45 |
12 | C0000012 | Hannah | F | Los Angeles | 61 |
13 | C0000013 | Ian | M | Los Angeles | 36 |
14 | C0000014 | Frank | M | Chicago | 37 |
15 | C0000015 | Hannah | M | Houston | 21 |
In this code:
call streaminit(12345); initializes the random number generator for reproducibility.
Arrays names and locations store sample names and locations.
A loop generates 15 customer records with random attributes.
(70 - 18 + 1) * rand('Uniform') scales this number to the range of 0 to 53 (since 70 - 18 + 1 = 53).
1.2 Creating the Products Dataset
/*Next, we'll create a dataset of 20 products with attributes like Product ID, Name, Category, and Price.*/
data products;
call streaminit(54321);
length ProductID $8 ProductName $20 Category $15;
array productNames[10] $20 _temporary_ ('Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Smartwatch', 'Camera', 'Printer', 'Monitor', 'Keyboard', 'Mouse');
array categories[5] $15 _temporary_ ('Electronics', 'Accessories', 'Wearables', 'Office Supplies', 'Peripherals');
do i = 1 to 20;
ProductID = cats('P', put(i, z7.));
ProductName = productNames[ceil(rand('Uniform') * 10)];
Category = categories[ceil(rand('Uniform') * 5)];
Price = round(rand('Uniform') * 900 + 100, 0.01); /* Price between $100 and $1000 */
output;
end;
drop i;
run;
proc print;run;
Output:
Obs | ProductID | ProductName | Category | Price |
---|---|---|---|---|
1 | P0000001 | Smartwatch | Wearables | 800.74 |
2 | P0000002 | Smartphone | Accessories | 719.81 |
3 | P0000003 | Keyboard | Electronics | 934.84 |
4 | P0000004 | Laptop | Electronics | 323.77 |
5 | P0000005 | Smartphone | Wearables | 510.60 |
6 | P0000006 | Printer | Peripherals | 842.60 |
7 | P0000007 | Headphones | Office Supplies | 474.75 |
8 | P0000008 | Headphones | Electronics | 196.35 |
9 | P0000009 | Keyboard | Peripherals | 385.95 |
10 | P0000010 | Keyboard | Wearables | 828.15 |
11 | P0000011 | Smartphone | Office Supplies | 951.82 |
12 | P0000012 | Monitor | Wearables | 252.59 |
13 | P0000013 | Laptop | Peripherals | 193.77 |
14 | P0000014 | Monitor | Electronics | 149.87 |
15 | P0000015 | Keyboard | Electronics | 758.59 |
16 | P0000016 | Smartphone | Accessories | 783.57 |
17 | P0000017 | Mouse | Wearables | 289.14 |
18 | P0000018 | Headphones | Peripherals | 251.49 |
19 | P0000019 | Headphones | Electronics | 369.08 |
20 | P0000020 | Headphones | Electronics | 168.56 |
Here:
Arrays productNames and categories provide sample product names and categories.
Prices are randomly assigned between $100 and $1000.
1.3 Creating the Transactions Dataset
/*We'll simulate 20 transaction records linking customers to products.*/
data transactions;
call streaminit(67890);
length TransactionID $10 CustomerID $8 ProductID $8 TransactionDate Quantity 8 ;
do i = 1 to 20;
TransactionID = cats('T', put(i, z9.));
CustomerID = cats('C', put(ceil(rand('Uniform') * 1000), z7.));
ProductID = cats('P', put(ceil(rand('Uniform') * 20), z7.));
TransactionDate = '01JAN2025'd + floor(rand('Uniform') * 365); /* Random date in 2025 */
Format TransactionDate Date9.;
Quantity = ceil(rand('Uniform') * 5); /* Quantity between 1 and 5 */
output;
end;
drop i;
run;
proc print;run;
Output:
Obs | TransactionID | CustomerID | ProductID | TransactionDate | Quantity |
---|---|---|---|---|---|
1 | T000000001 | C0000066 | P0000010 | 31JUL2025 | 1 |
2 | T000000002 | C0000654 | P0000007 | 17SEP2025 | 3 |
3 | T000000003 | C0000970 | P0000010 | 22OCT2025 | 1 |
4 | T000000004 | C0000267 | P0000005 | 28FEB2025 | 4 |
5 | T000000005 | C0000593 | P0000015 | 18MAR2025 | 1 |
6 | T000000006 | C0000506 | P0000003 | 08OCT2025 | 3 |
7 | T000000007 | C0000791 | P0000011 | 18FEB2025 | 2 |
8 | T000000008 | C0000024 | P0000001 | 15MAR2025 | 5 |
9 | T000000009 | C0000940 | P0000006 | 28JUN2025 | 5 |
10 | T000000010 | C0000677 | P0000005 | 23MAR2025 | 1 |
11 | T000000011 | C0000561 | P0000015 | 23JUN2025 | 5 |
12 | T000000012 | C0000509 | P0000011 | 17FEB2025 | 2 |
13 | T000000013 | C0000112 | P0000003 | 22FEB2025 | 5 |
14 | T000000014 | C0000598 | P0000019 | 31MAY2025 | 3 |
15 | T000000015 | C0000820 | P0000004 | 03SEP2025 | 5 |
16 | T000000016 | C0000024 | P0000020 | 02SEP2025 | 5 |
17 | T000000017 | C0000392 | P0000010 | 09SEP2025 | 4 |
18 | T000000018 | C0000596 | P0000012 | 09AUG2025 | 3 |
19 | T000000019 | C0000256 | P0000020 | 04JUL2025 | 3 |
20 | T000000020 | C0000365 | P0000018 | 19AUG2025 | 4 |
In this dataset:
TransactionDate is assigned a random date in the year 2025.
Quantity represents the number of units purchased.
/*To calculate TotalAmount, we'll merge the transactions dataset with the products dataset to access the Price information.*/
proc sql;
create table transactions as
select t.*, p.Price, (t.Quantity * p.Price) as TotalAmount
from transactions t
left join products p
on t.ProductID = p.ProductID;
quit;
proc print;run;
Output:
Obs | TransactionID | CustomerID | ProductID | TransactionDate | Quantity | Price | TotalAmount |
---|---|---|---|---|---|---|---|
1 | T000000008 | C0000024 | P0000001 | 15MAR2025 | 5 | 800.74 | 4003.70 |
2 | T000000006 | C0000506 | P0000003 | 08OCT2025 | 3 | 934.84 | 2804.52 |
3 | T000000013 | C0000112 | P0000003 | 22FEB2025 | 5 | 934.84 | 4674.20 |
4 | T000000015 | C0000820 | P0000004 | 03SEP2025 | 5 | 323.77 | 1618.85 |
5 | T000000004 | C0000267 | P0000005 | 28FEB2025 | 4 | 510.60 | 2042.40 |
6 | T000000010 | C0000677 | P0000005 | 23MAR2025 | 1 | 510.60 | 510.60 |
7 | T000000009 | C0000940 | P0000006 | 28JUN2025 | 5 | 842.60 | 4213.00 |
8 | T000000002 | C0000654 | P0000007 | 17SEP2025 | 3 | 474.75 | 1424.25 |
9 | T000000003 | C0000970 | P0000010 | 22OCT2025 | 1 | 828.15 | 828.15 |
10 | T000000001 | C0000066 | P0000010 | 31JUL2025 | 1 | 828.15 | 828.15 |
11 | T000000017 | C0000392 | P0000010 | 09SEP2025 | 4 | 828.15 | 3312.60 |
12 | T000000007 | C0000791 | P0000011 | 18FEB2025 | 2 | 951.82 | 1903.64 |
13 | T000000012 | C0000509 | P0000011 | 17FEB2025 | 2 | 951.82 | 1903.64 |
14 | T000000018 | C0000596 | P0000012 | 09AUG2025 | 3 | 252.59 | 757.77 |
15 | T000000011 | C0000561 | P0000015 | 23JUN2025 | 5 | 758.59 | 3792.95 |
16 | T000000005 | C0000593 | P0000015 | 18MAR2025 | 1 | 758.59 | 758.59 |
17 | T000000020 | C0000365 | P0000018 | 19AUG2025 | 4 | 251.49 | 1005.96 |
18 | T000000014 | C0000598 | P0000019 | 31MAY2025 | 3 | 369.08 | 1107.24 |
19 | T000000019 | C0000256 | P0000020 | 04JUL2025 | 3 | 168.56 | 505.68 |
20 | T000000016 | C0000024 | P0000020 | 02SEP2025 | 5 | 168.56 | 842.80 |
1.4 Creating the Reviews Dataset
/*Finally, we'll simulate 20 product reviews from customers.*/
data reviews;
call streaminit(98765);
length ReviewID $10 CustomerID $8 ProductID $8 ReviewDate Rating 8 Comment $50;
array comments[5] $50 _temporary_ ('Excellent product!', 'Very satisfied.', 'Average quality.', 'Not as expected.', 'Would not recommend.');
do i = 1 to 20;
ReviewID = cats('R', put(i, z9.));
CustomerID = cats('C', put(ceil(rand('Uniform') * 1000), z7.));
ProductID = cats('P', put(ceil(rand('Uniform') * 20), z7.));
ReviewDate = '01JAN2025'd + floor(rand('Uniform') * 365);
Format ReviewDate date9.;
Rating = ceil(rand('Uniform') * 5); /* Rating between 1 and 5 */
Comment = comments[ceil(rand('Uniform') * 5)];
output;
end;
drop i;
run;
proc print;run;
Output:
Obs | ReviewID | CustomerID | ProductID | ReviewDate | Rating | Comment |
---|---|---|---|---|---|---|
1 | R000000001 | C0000506 | P0000477 | 23JUL2025 | 1 | Average quality. |
2 | R000000002 | C0000797 | P0000110 | 19MAY2025 | 4 | Would not recommend. |
3 | R000000003 | C0000684 | P0000273 | 27MAR2025 | 2 | Excellent product! |
4 | R000000004 | C0000299 | P0000364 | 31JAN2025 | 3 | Very satisfied. |
5 | R000000005 | C0000382 | P0000414 | 23JUL2025 | 5 | Would not recommend. |
6 | R000000006 | C0000305 | P0000248 | 20APR2025 | 5 | Would not recommend. |
7 | R000000007 | C0000143 | P0000412 | 06JUN2025 | 5 | Average quality. |
8 | R000000008 | C0000250 | P0000022 | 03JUL2025 | 5 | Not as expected. |
9 | R000000009 | C0000674 | P0000290 | 25SEP2025 | 1 | Average quality. |
10 | R000000010 | C0000195 | P0000465 | 31DEC2025 | 3 | Average quality. |
11 | R000000011 | C0000145 | P0000466 | 18MAR2025 | 2 | Not as expected. |
12 | R000000012 | C0000513 | P0000276 | 06FEB2025 | 5 | Average quality. |
13 | R000000013 | C0000835 | P0000428 | 11MAY2025 | 1 | Would not recommend. |
14 | R000000014 | C0000723 | P0000026 | 17NOV2025 | 3 | Not as expected. |
15 | R000000015 | C0000539 | P0000148 | 29MAR2025 | 2 | Excellent product! |
16 | R000000016 | C0000478 | P0000254 | 15JUN2025 | 5 | Would not recommend. |
17 | R000000017 | C0000441 | P0000199 | 26MAY2025 | 5 | Not as expected. |
18 | R000000018 | C0000845 | P0000442 | 20JAN2025 | 1 | Would not recommend. |
19 | R000000019 | C0000886 | P0000047 | 26JAN2025 | 3 | Excellent product! |
20 | R000000020 | C0000558 | P0000010 | 19FEB2025 | 3 | Would not recommend. |
Here:
Rating is a score between 1 and 5.
Comment is selected from a predefined list of feedback statements.
2. Exploring the Dataset with SAS Procedures
/*We can now apply various SAS procedures to analyze and visualize the data.*/
2.1 Viewing Data Structure with PROC CONTENTS
/*To understand the structure of our datasets:*/*/
proc contents data=customers;
run;
Output:
Data Set Name | WORK.CUSTOMERS | Observations | 15 |
---|---|---|---|
Member Type | DATA | Variables | 5 |
Engine | V9 | Indexes | 0 |
Created | 14/09/2015 00:39:39 | Observation Length | 56 |
Last Modified | 14/09/2015 00:39:39 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | WINDOWS_64 | ||
Encoding | wlatin1 Western (Windows) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 1167 |
Obs in First Data Page | 15 |
Number of Data Set Repairs | 0 |
ExtendObsCounter | YES |
Filename | C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD8792_DESKTOP-QFAA4KV_\customers.sas7bdat |
Release Created | 9.0401M2 |
Host Created | X64_8HOME |
Alphabetic List of Variables and Attributes | |||
---|---|---|---|
# | Variable | Type | Len |
5 | Age | Num | 8 |
1 | CustomerID | Char | 8 |
3 | Gender | Char | 1 |
4 | Location | Char | 15 |
2 | Name | Char | 20 |
proc contents data=products;
run;
Output:
Data Set Name | WORK.PRODUCTS | Observations | 20 |
---|---|---|---|
Member Type | DATA | Variables | 4 |
Engine | V9 | Indexes | 0 |
Created | 14/09/2015 00:59:01 | Observation Length | 56 |
Last Modified | 14/09/2015 00:59:01 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | WINDOWS_64 | ||
Encoding | wlatin1 Western (Windows) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 1167 |
Obs in First Data Page | 20 |
Number of Data Set Repairs | 0 |
ExtendObsCounter | YES |
Filename | C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD8792_DESKTOP-QFAA4KV_\products.sas7bdat |
Release Created | 9.0401M2 |
Host Created | X64_8HOME |
Alphabetic List of Variables and Attributes | |||
---|---|---|---|
# | Variable | Type | Len |
3 | Category | Char | 15 |
4 | Price | Num | 8 |
1 | ProductID | Char | 8 |
2 | ProductName | Char | 20 |
proc contents data=transactions;
run;
Output:
Data Set Name | WORK.TRANSACTIONS | Observations | 20 |
---|---|---|---|
Member Type | DATA | Variables | 7 |
Engine | V9 | Indexes | 0 |
Created | 14/09/2015 01:02:52 | Observation Length | 64 |
Last Modified | 14/09/2015 01:02:52 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | YES | |
Label | |||
Data Representation | WINDOWS_64 | ||
Encoding | wlatin1 Western (Windows) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 1021 |
Obs in First Data Page | 20 |
Number of Data Set Repairs | 0 |
ExtendObsCounter | YES |
Filename | C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD8792_DESKTOP-QFAA4KV_\transactions.sas7bdat |
Release Created | 9.0401M2 |
Host Created | X64_8HOME |
Alphabetic List of Variables and Attributes | ||||
---|---|---|---|---|
# | Variable | Type | Len | Format |
2 | CustomerID | Char | 8 | |
6 | Price | Num | 8 | |
3 | ProductID | Char | 8 | |
5 | Quantity | Num | 8 | |
7 | TotalAmount | Num | 8 | |
4 | TransactionDate | Num | 8 | DATE9. |
1 | TransactionID | Char | 10 |
Sort Information | |
---|---|
Sortedby | ProductID |
Validated | YES |
Character Set | ANSI |
proc contents data=reviews;
run;
Output:
Data Set Name | WORK.REVIEWS | Observations | 20 |
---|---|---|---|
Member Type | DATA | Variables | 6 |
Engine | V9 | Indexes | 0 |
Created | 14/09/2015 01:05:00 | Observation Length | 96 |
Last Modified | 14/09/2015 01:05:00 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | WINDOWS_64 | ||
Encoding | wlatin1 Western (Windows) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 681 |
Obs in First Data Page | 20 |
Number of Data Set Repairs | 0 |
ExtendObsCounter | YES |
Filename | C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD8792_DESKTOP-QFAA4KV_\reviews.sas7bdat |
Release Created | 9.0401M2 |
Host Created | X64_8HOME |
Alphabetic List of Variables and Attributes | ||||
---|---|---|---|---|
# | Variable | Type | Len | Format |
6 | Comment | Char | 50 | |
2 | CustomerID | Char | 8 | |
3 | ProductID | Char | 8 | |
5 | Rating | Num | 8 | |
4 | ReviewDate | Num | 8 | DATE9. |
1 | ReviewID | Char | 10 |
*/PROC CONTENTS provides metadata about each dataset, including variable names, types, and lengths.*/
2.2 Summarizing Data with PROC MEANS
/*To obtain descriptive statistics:*/
proc means data=customers;
var Age;
run;
Output:
Analysis Variable : Age | ||||
---|---|---|---|---|
N | Mean | Std Dev | Minimum | Maximum |
15 | 49.5333333 | 17.3282044 | 18.0000000 | 70.0000000 |
proc means data=products;
var Price;
run;
Output:
Analysis Variable : Price | ||||
---|---|---|---|---|
N | Mean | Std Dev | Minimum | Maximum |
20 | 509.3020000 | 285.6575398 | 149.8700000 | 951.8200000 |
proc means data=transactions;
var Quantity TotalAmount;
run;
Output:
Variable | N | Mean | Std Dev | Minimum | Maximum | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
/*PROC MEANS calculates statistics like mean, standard deviation, and range for numeric variables.*/
2.3 Frequency Analysis with PROC FREQ
/*To analyze categorical data distributions:*/
proc freq data=customers;
tables Gender Location;
run;
Output:
Gender | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
F | 6 | 40.00 | 6 | 40.00 |
M | 9 | 60.00 | 15 | 100.00 |
Location | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
Chicago | 2 | 13.33 | 2 | 13.33 |
Houston | 3 | 20.00 | 5 | 33.33 |
Los Angeles | 6 | 40.00 | 11 | 73.33 |
Phoenix | 4 | 26.67 | 15 | 100.00 |
proc freq data=products;
tables category;
run;
Output:
Category | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
Accessories | 2 | 10.00 | 2 | 10.00 |
Electronics | 7 | 35.00 | 9 | 45.00 |
Office Supplies | 2 | 10.00 | 11 | 55.00 |
Peripherals | 4 | 20.00 | 15 | 75.00 |
Wearables | 5 | 25.00 | 20 | 100.00 |
proc freq data=transactions;
tables quantity;
run;
Quantity | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
1 | 4 | 20.00 | 4 | 20.00 |
2 | 2 | 10.00 | 6 | 30.00 |
3 | 5 | 25.00 | 11 | 55.00 |
4 | 3 | 15.00 | 14 | 70.00 |
5 | 6 | 30.00 | 20 | 100.00 |
- Get link
- X
- Other Apps
Comments
Post a Comment