Monday, 7 April 2025

147.COMPREHENSIVE GUIDE TO CREATING A UNIQUE SIMULATED DATASET IN SAS AND APPLYING MULTIPLE DATA ANALYSIS AND VISUALIZATION PROCEDURES FOR IN-DEPTH EXPLORATION

COMPREHENSIVE GUIDE TO CREATING A UNIQUE SIMULATED DATASET IN SAS AND APPLYING MULTIPLE DATA ANALYSIS AND VISUALIZATION PROCEDURES FOR IN-DEPTH EXPLORATION


 /*1.Generating a Simulated Dataset*/

/*We'll create a dataset named simulated_data with 1,000 observations,including variables: ID, Age, Gender, Income, and Purchase.*/

data simulated_data;

    call streaminit(12345); /* Set seed for reproducibility */

    do ID = 1 to 15;

        Age = rand('Normal', 35, 10); /* Mean 35, Std Dev 10 */

        Gender = ifc(rand('Bernoulli', 0.5), 'Male', 'Female');

        Income = exp(rand('Normal', log(50000), 0.5)); /* Log-normal distribution */

        Purchase = rand('Bernoulli', 0.3); /* 30% purchase rate */

        output;

    end;

run;

proc print;run;


Output:

Obs ID Age Gender Income Purchase
1 1 37.6423 Female 85574.42 1
2 2 50.4014 Male 26980.44 0
3 3 35.6573 Male 92263.86 0
4 4 23.5061 Male 39281.31 0
5 5 32.7587 Female 55773.54 0
6 6 30.7998 Female 56554.29 0
7 7 51.6322 Female 21561.87 1
8 8 27.3884 Female 68483.53 0
9 9 34.7741 Female 63162.43 0
10 10 42.4491 Female 37845.31 0
11 11 21.8643 Female 19034.69 0
12 12 45.4506 Female 30360.20 0
13 13 23.3895 Female 39673.56 0
14 14 30.6864 Female 51182.57 1
15 15 24.6003 Female 74293.88 0

Explanation:

call streaminit(12345);: Initializes the random number generator with a seed for reproducibility.

Age: Generated from a normal distribution with a mean of 35 and standard deviation of 10.

Gender: Assigned 'Male' or 'Female' based on a Bernoulli distribution with a 50% probability.

Income: Follows a log-normal distribution to simulate income data, which is often right-skewed.

Purchase: A binary variable indicating purchase behavior, with a 30% probability of being 1 (purchase made).


/*2. Exploring the Dataset with PROC CONTENTS*/

/*To understand the structure of the dataset, use PROC CONTENTS:*/

proc contents data=simulated_data;

run;

Output:

                                                             The CONTENTS Procedure

Data Set Name WORK.SIMULATED_DATA Observations 15
Member Type DATA Variables 5
Engine V9 Indexes 0
Created 14/09/2015 00:09:42 Observation Length 232
Last Modified 14/09/2015 00:09:42 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation WINDOWS_64    
Encoding wlatin1 Western (Windows)    


Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 1
First Data Page 1
Max Obs per Page 282
Obs in First Data Page 15
Number of Data Set Repairs 0
ExtendObsCounter YES
Filename C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_\simulated_data.sas7bdat
Release Created 9.0401M2
Host Created X64_8HOME


Alphabetic List of Variables and Attributes
# Variable Type Len
2 Age Num 8
3 Gender Char 200
1 ID Num 8
4 Income Num 8
5 Purchase Num 8


/*This procedure provides details about the dataset, including  variable names, types, and attributes.*/


/*3. Summarizing Data with PROC MEANS*/

/*To obtain descriptive statistics for numerical variables*/

proc means data=simulated_data mean std min max;

    var Age Income;

run;

Output:

                                                              The MEANS Procedure

Variable Mean Std Dev Minimum Maximum
Age
Income
34.2000373
50801.73
9.7219459
22799.38
21.8642650
19034.69
51.6321918
92263.86

/*This output includes the mean, standard deviation, minimum, and maximum for Age and Income.*/


/*4. Frequency Analysis with PROC FREQ*/

/*For categorical variables like Gender and Purchase, use PROC FREQ:*/

proc freq data=simulated_data;

    tables Gender Purchase;

run;

Output:

                                                            The FREQ Procedure

Gender Frequency Percent Cumulative
Frequency
Cumulative
Percent
Female 12 80.00 12 80.00
Male 3 20.00 15 100.00


Purchase Frequency Percent Cumulative
Frequency
Cumulative
Percent
0 12 80.00 12 80.00
1 3 20.00 15 100.00


/*This provides frequency counts and percentages for each category.*/


/*5. Data Visualization with PROC SGPLOT*/

/*Visualizing data helps in understanding distributions and relationships:​*/

/*Histogram of Age:*/

proc sgplot data=simulated_data;

      histogram Age;

      density Age / type=normal;

      title 'Age Distribution';

run;


/*Scatter Plot of Age vs. Income:*/

proc sgplot data=simulated_data;

      scatter x=Age y=Income / group=Gender;

      title 'Age vs. Income by Gender';

run;

/*These plots provide insights into the distribution of Age and the relationship 

between Age and Income, differentiated by Gender.*/


/*6. Data Management with PROC DATASETS*/

/*PROC DATASETS is a powerful procedure for managing SAS datasets:​*/

/*Renaming a Variable:*/

proc datasets library=work;

      modify simulated_data;

      rename Purchase=MadePurchase;

run;

quit;

Log:

                                                               Directory

    Libref         WORK

    Engine         V9

    Physical Name  C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_

    Filename       C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_

                                                            Member     File

                     #                   Name            Type       Size       Last Modified


                   1  SIMULATED_DATA  DATA     131072  08/04/2025 08:46:02

34         modify simulated_data;

35         rename Purchase=MadePurchase;

NOTE: Renaming variable Purchase to MadePurchase.

36   run;


NOTE: MODIFY was successful for WORK.SIMULATED_DATA.DATA.

37   quit;


NOTE: PROCEDURE DATASETS used (Total process time):

      real time           0.25 seconds

      cpu time            0.06 seconds

/*Appending Data:*/

/*Assuming there's another dataset new_data with the same structure:*/

proc datasets library=work;

      append base=simulated_data data=new_data;

run;

quit;

Log:

ERROR: File WORK.NEW_DATA.DATA does not exist.

                                                          

/*Deleting a Dataset:*/

proc datasets library=work;

      delete simulated_data;

run;

quit;

Log:

                                                                 Directory

    Libref         WORK

    Engine         V9

    Physical Name  C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_

    Filename       C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_


                                                      Member            File

                   #     Name                       Type            Size           Last Modified


                   1  SIMULATED_DATA  DATA     196608  08/04/2025 09:00:19

43         delete simulated_data;

44   run;


NOTE: Deleting WORK.SIMULATED_DATA (memtype=DATA).

45   quit;


NOTE: PROCEDURE DATASETS used (Total process time):

      real time           0.09 seconds

      cpu time            0.01 seconds


/*These commands help in efficiently managing datasets without the need for data steps.*/


/*7. Creating Random Samples with PROC SURVEYSELECT*/

/*To create a random sample of the data:*/

proc surveyselect data=simulated_data

    method=srs /* Simple Random Sampling */

    sampsize=100 /* Sample size */

    out=sample_data;

run;

Output:

                                                 The SURVEYSELECT Procedure

Selection Method Simple Random Sampling


Input Data Set SIMULATED_DATA
Random Number Seed 697625001
Sample Size 10
Selection Probability 0.666667
Sampling Weight 1.5
Output Data Set SAMPLE_DATA


/*This procedure selects a simple random sample of 100 observations from simulated_data.​*/


/*8. Correlation Analysis with PROC CORR*/

/*To examine relationships between numerical variables:*/

proc corr data=simulated_data;

    var Age Income;

run;

Output:

                                                                The CORR Procedure

2 Variables: Age Income


Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
Age 15 34.20004 9.72195 513.00056 21.86426 51.63219
Income 15 50802 22799 762026 19035 92264


Pearson Correlation Coefficients, N = 15
Prob > |r| under H0: Rho=0
  Age Income
Age
1.00000
 
-0.25263
0.3637
Income
-0.25263
0.3637
1.00000
 

/*This provides correlation coefficients, indicating the strength and 

direction of relationships.​*/


/*9. Regression Analysis with PROC REG*/

/*To model the relationship between Income (dependent variable) and Age (independent variable):​*/

proc reg data=simulated_data;

    model Income = Age;

run;

quit;

Output:

                                                           The REG Procedure
                                                            Model: MODEL1
                                                     Dependent Variable: Income

Number of Observations Read 15
Number of Observations Used 15


Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 464447741 464447741 0.89 0.3637
Error 13 6812919577 524070737    
Corrected Total 14 7277367317      


Root MSE 22893 R-Square 0.0638
Dependent Mean 50802 Adj R-Sq -0.0082
Coeff Var 45.06262    


Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 71064 22320 3.18 0.0072
Age 1 -592.44933 629.32899 -0.94 0.3637

/*This analysis helps in understanding how Age influences Income.​*/


/*10. Logistic Regression with PROC LOGISTIC*/

/*To model the probability of making a purchase based on Age, Gender, and Income:​*/

proc logistic data=simulated_data;

    class Gender / param=ref;

    model Purchase(event='1') = Age Gender Income;

run;

Output:

                                                                The LOGISTIC Procedure

Model Information
Data Set WORK.SIMULATED_DATA
Response Variable Purchase
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring


Number of Observations Read 15
Number of Observations Used 15


Response Profile
Ordered
Value
Purchase Total
Frequency
1 0 12
2 1 3

Probability modeled is Purchase=1.


Class Level Information
Class Value Design
Variables
Gender Female 1
  Male 0


Model Convergence Status
Quasi-complete separation of data points detected.



Warning: The maximum likelihood estimate may not exist.


Warning: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.

Model Fit Statistics
Criterion Intercept Only Intercept and
Covariates
AIC 17.012 18.164
SC 17.720 20.996
-2 Log L 15.012 10.164


Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 4.8482 3 0.1833
Score 3.0280 3 0.3873
Wald 1.9953 3 0.5734


Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
Age 1 1.9758 0.1598
Gender 1 0.0021 0.9633
Income 1 0.9516 0.3293


Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept   1 -24.1933 276.7 0.0076 0.9303
Age   1 0.2054 0.1462 1.9758 0.1598
Gender Female 1 12.7126 276.5 0.0021 0.9633
Income   1 0.000056 0.000058 0.9516 0.3293


Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
Age 1.228 0.922 1.635
Gender Female vs Male >999.999 <0.001 >999.999
Income 1.000 1.000 1.000


Association of Predicted Probabilities and
Observed Responses
Percent Concordant 80.6 Somers' D 0.611
Percent Discordant 19.4 Gamma 0.611
Percent Tied 0.0 Tau-a 0.210
Pairs 36 c 0.806


/*This logistic regression assesses the impact of Age, Gender, and Income

on the likelihood of making a purchase.​*/


PRACTICE AND COMMENT YOUR CODE: 

-->PLEASE FOLLOW OUR BLOG FOR MORE UPDATES.

TO FOLLOW OUR TELEGRAM CHANNEL CLICK HERE


 


No comments:

Post a Comment