COMPREHENSIVE GUIDE TO CREATING A UNIQUE SIMULATED DATASET IN SAS AND APPLYING MULTIPLE DATA ANALYSIS AND VISUALIZATION PROCEDURES FOR IN-DEPTH EXPLORATION
/*1.Generating a Simulated Dataset*/
/*We'll create a dataset named simulated_data with 1,000 observations,including variables: ID, Age, Gender, Income, and Purchase.*/
data simulated_data;
call streaminit(12345); /* Set seed for reproducibility */
do ID = 1 to 15;
Age = rand('Normal', 35, 10); /* Mean 35, Std Dev 10 */
Gender = ifc(rand('Bernoulli', 0.5), 'Male', 'Female');
Income = exp(rand('Normal', log(50000), 0.5)); /* Log-normal distribution */
Purchase = rand('Bernoulli', 0.3); /* 30% purchase rate */
output;
end;
run;
proc print;run;
Output:
| Obs | ID | Age | Gender | Income | Purchase |
|---|---|---|---|---|---|
| 1 | 1 | 37.6423 | Female | 85574.42 | 1 |
| 2 | 2 | 50.4014 | Male | 26980.44 | 0 |
| 3 | 3 | 35.6573 | Male | 92263.86 | 0 |
| 4 | 4 | 23.5061 | Male | 39281.31 | 0 |
| 5 | 5 | 32.7587 | Female | 55773.54 | 0 |
| 6 | 6 | 30.7998 | Female | 56554.29 | 0 |
| 7 | 7 | 51.6322 | Female | 21561.87 | 1 |
| 8 | 8 | 27.3884 | Female | 68483.53 | 0 |
| 9 | 9 | 34.7741 | Female | 63162.43 | 0 |
| 10 | 10 | 42.4491 | Female | 37845.31 | 0 |
| 11 | 11 | 21.8643 | Female | 19034.69 | 0 |
| 12 | 12 | 45.4506 | Female | 30360.20 | 0 |
| 13 | 13 | 23.3895 | Female | 39673.56 | 0 |
| 14 | 14 | 30.6864 | Female | 51182.57 | 1 |
| 15 | 15 | 24.6003 | Female | 74293.88 | 0 |
Explanation:
call streaminit(12345);: Initializes the random number generator with a seed for reproducibility.
Age: Generated from a normal distribution with a mean of 35 and standard deviation of 10.
Gender: Assigned 'Male' or 'Female' based on a Bernoulli distribution with a 50% probability.
Income: Follows a log-normal distribution to simulate income data, which is often right-skewed.
Purchase: A binary variable indicating purchase behavior, with a 30% probability of being 1 (purchase made).
/*2. Exploring the Dataset with PROC CONTENTS*/
/*To understand the structure of the dataset, use PROC CONTENTS:*/
proc contents data=simulated_data;
run;
Output:
| Data Set Name | WORK.SIMULATED_DATA | Observations | 15 |
|---|---|---|---|
| Member Type | DATA | Variables | 5 |
| Engine | V9 | Indexes | 0 |
| Created | 14/09/2015 00:09:42 | Observation Length | 232 |
| Last Modified | 14/09/2015 00:09:42 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | NO | |
| Label | |||
| Data Representation | WINDOWS_64 | ||
| Encoding | wlatin1 Western (Windows) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 65536 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 282 |
| Obs in First Data Page | 15 |
| Number of Data Set Repairs | 0 |
| ExtendObsCounter | YES |
| Filename | C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_\simulated_data.sas7bdat |
| Release Created | 9.0401M2 |
| Host Created | X64_8HOME |
| Alphabetic List of Variables and Attributes | |||
|---|---|---|---|
| # | Variable | Type | Len |
| 2 | Age | Num | 8 |
| 3 | Gender | Char | 200 |
| 1 | ID | Num | 8 |
| 4 | Income | Num | 8 |
| 5 | Purchase | Num | 8 |
/*This procedure provides details about the dataset, including variable names, types, and attributes.*/
/*3. Summarizing Data with PROC MEANS*/
/*To obtain descriptive statistics for numerical variables*/
proc means data=simulated_data mean std min max;
var Age Income;
run;
Output:
| Variable | Mean | Std Dev | Minimum | Maximum | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
/*This output includes the mean, standard deviation, minimum, and maximum for Age and Income.*/
/*4. Frequency Analysis with PROC FREQ*/
/*For categorical variables like Gender and Purchase, use PROC FREQ:*/
proc freq data=simulated_data;
tables Gender Purchase;
run;
Output:
| Gender | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
|---|---|---|---|---|
| Female | 12 | 80.00 | 12 | 80.00 |
| Male | 3 | 20.00 | 15 | 100.00 |
| Purchase | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
|---|---|---|---|---|
| 0 | 12 | 80.00 | 12 | 80.00 |
| 1 | 3 | 20.00 | 15 | 100.00 |
/*This provides frequency counts and percentages for each category.*/
/*5. Data Visualization with PROC SGPLOT*/
/*Visualizing data helps in understanding distributions and relationships:*/
/*Histogram of Age:*/
proc sgplot data=simulated_data;
histogram Age;
density Age / type=normal;
title 'Age Distribution';
run;
/*Scatter Plot of Age vs. Income:*/
proc sgplot data=simulated_data;
scatter x=Age y=Income / group=Gender;
title 'Age vs. Income by Gender';
run;
/*These plots provide insights into the distribution of Age and the relationship
between Age and Income, differentiated by Gender.*/
/*6. Data Management with PROC DATASETS*/
/*PROC DATASETS is a powerful procedure for managing SAS datasets:*/
/*Renaming a Variable:*/
proc datasets library=work;
modify simulated_data;
rename Purchase=MadePurchase;
run;
quit;
Log:
Directory
Libref WORK
Engine V9
Physical Name C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_
Filename C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_
Member File
# Name Type Size Last Modified
1 SIMULATED_DATA DATA 131072 08/04/2025 08:46:02
34 modify simulated_data;
35 rename Purchase=MadePurchase;
NOTE: Renaming variable Purchase to MadePurchase.
36 run;
NOTE: MODIFY was successful for WORK.SIMULATED_DATA.DATA.
37 quit;
NOTE: PROCEDURE DATASETS used (Total process time):
real time 0.25 seconds
cpu time 0.06 seconds
/*Appending Data:*/
/*Assuming there's another dataset new_data with the same structure:*/
proc datasets library=work;
append base=simulated_data data=new_data;
run;
quit;
Log:
ERROR: File WORK.NEW_DATA.DATA does not exist.
/*Deleting a Dataset:*/
proc datasets library=work;
delete simulated_data;
run;
quit;
Log:
Directory
Libref WORK
Engine V9
Physical Name C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_
Filename C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_
Member File
# Name Type Size Last Modified
1 SIMULATED_DATA DATA 196608 08/04/2025 09:00:19
43 delete simulated_data;
44 run;
NOTE: Deleting WORK.SIMULATED_DATA (memtype=DATA).
45 quit;
NOTE: PROCEDURE DATASETS used (Total process time):
real time 0.09 seconds
cpu time 0.01 seconds
/*These commands help in efficiently managing datasets without the need for data steps.*/
/*7. Creating Random Samples with PROC SURVEYSELECT*/
/*To create a random sample of the data:*/
proc surveyselect data=simulated_data
method=srs /* Simple Random Sampling */
sampsize=100 /* Sample size */
out=sample_data;
run;
Output:
| Selection Method | Simple Random Sampling |
|---|
| Input Data Set | SIMULATED_DATA |
|---|---|
| Random Number Seed | 697625001 |
| Sample Size | 10 |
| Selection Probability | 0.666667 |
| Sampling Weight | 1.5 |
| Output Data Set | SAMPLE_DATA |
/*This procedure selects a simple random sample of 100 observations from simulated_data.*/
/*8. Correlation Analysis with PROC CORR*/
/*To examine relationships between numerical variables:*/
proc corr data=simulated_data;
var Age Income;
run;
Output:
| 2 Variables: | Age Income |
|---|
| Simple Statistics | ||||||
|---|---|---|---|---|---|---|
| Variable | N | Mean | Std Dev | Sum | Minimum | Maximum |
| Age | 15 | 34.20004 | 9.72195 | 513.00056 | 21.86426 | 51.63219 |
| Income | 15 | 50802 | 22799 | 762026 | 19035 | 92264 |
| Pearson Correlation
Coefficients, N = 15 Prob > |r| under H0: Rho=0 | ||||||
|---|---|---|---|---|---|---|
| Age | Income | |||||
| Age |
|
| ||||
| Income |
|
| ||||
/*This provides correlation coefficients, indicating the strength and
direction of relationships.*/
/*9. Regression Analysis with PROC REG*/
/*To model the relationship between Income (dependent variable) and Age (independent variable):*/
proc reg data=simulated_data;
model Income = Age;
run;
quit;
Output:
| Number of Observations Read | 15 |
|---|---|
| Number of Observations Used | 15 |
| Analysis of Variance | |||||
|---|---|---|---|---|---|
| Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
| Model | 1 | 464447741 | 464447741 | 0.89 | 0.3637 |
| Error | 13 | 6812919577 | 524070737 | ||
| Corrected Total | 14 | 7277367317 | |||
| Root MSE | 22893 | R-Square | 0.0638 |
|---|---|---|---|
| Dependent Mean | 50802 | Adj R-Sq | -0.0082 |
| Coeff Var | 45.06262 |
| Parameter Estimates | |||||
|---|---|---|---|---|---|
| Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
| Intercept | 1 | 71064 | 22320 | 3.18 | 0.0072 |
| Age | 1 | -592.44933 | 629.32899 | -0.94 | 0.3637 |
/*This analysis helps in understanding how Age influences Income.*/
/*10. Logistic Regression with PROC LOGISTIC*/
/*To model the probability of making a purchase based on Age, Gender, and Income:*/
proc logistic data=simulated_data;
class Gender / param=ref;
model Purchase(event='1') = Age Gender Income;
run;
Output:
| Model Information | |
|---|---|
| Data Set | WORK.SIMULATED_DATA |
| Response Variable | Purchase |
| Number of Response Levels | 2 |
| Model | binary logit |
| Optimization Technique | Fisher's scoring |
| Number of Observations Read | 15 |
|---|---|
| Number of Observations Used | 15 |
| Response Profile | ||
|---|---|---|
| Ordered Value |
Purchase | Total Frequency |
| 1 | 0 | 12 |
| 2 | 1 | 3 |
| Probability modeled is Purchase=1. |
| Class Level Information | ||
|---|---|---|
| Class | Value | Design Variables |
| Gender | Female | 1 |
| Male | 0 | |
| Model Convergence Status |
|---|
| Quasi-complete separation of data points detected. |
| Warning: | The maximum likelihood estimate may not exist. |
| Warning: | The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable. |
| Model Fit Statistics | ||
|---|---|---|
| Criterion | Intercept Only | Intercept and Covariates |
| AIC | 17.012 | 18.164 |
| SC | 17.720 | 20.996 |
| -2 Log L | 15.012 | 10.164 |
| Testing Global Null Hypothesis: BETA=0 | |||
|---|---|---|---|
| Test | Chi-Square | DF | Pr > ChiSq |
| Likelihood Ratio | 4.8482 | 3 | 0.1833 |
| Score | 3.0280 | 3 | 0.3873 |
| Wald | 1.9953 | 3 | 0.5734 |
| Type 3 Analysis of Effects | |||
|---|---|---|---|
| Effect | DF | Wald Chi-Square |
Pr > ChiSq |
| Age | 1 | 1.9758 | 0.1598 |
| Gender | 1 | 0.0021 | 0.9633 |
| Income | 1 | 0.9516 | 0.3293 |
| Analysis of Maximum Likelihood Estimates | ||||||
|---|---|---|---|---|---|---|
| Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
| Intercept | 1 | -24.1933 | 276.7 | 0.0076 | 0.9303 | |
| Age | 1 | 0.2054 | 0.1462 | 1.9758 | 0.1598 | |
| Gender | Female | 1 | 12.7126 | 276.5 | 0.0021 | 0.9633 |
| Income | 1 | 0.000056 | 0.000058 | 0.9516 | 0.3293 | |
| Odds Ratio Estimates | |||
|---|---|---|---|
| Effect | Point Estimate | 95% Wald Confidence Limits | |
| Age | 1.228 | 0.922 | 1.635 |
| Gender Female vs Male | >999.999 | <0.001 | >999.999 |
| Income | 1.000 | 1.000 | 1.000 |
| Association of Predicted
Probabilities and Observed Responses | |||
|---|---|---|---|
| Percent Concordant | 80.6 | Somers' D | 0.611 |
| Percent Discordant | 19.4 | Gamma | 0.611 |
| Percent Tied | 0.0 | Tau-a | 0.210 |
| Pairs | 36 | c | 0.806 |
/*This logistic regression assesses the impact of Age, Gender, and Income
on the likelihood of making a purchase.*/
No comments:
Post a Comment