147.COMPREHENSIVE GUIDE TO CREATING A UNIQUE SIMULATED DATASET IN SAS AND APPLYING MULTIPLE DATA ANALYSIS AND VISUALIZATION PROCEDURES FOR IN-DEPTH EXPLORATION

COMPREHENSIVE GUIDE TO CREATING A UNIQUE SIMULATED DATASET IN SAS AND APPLYING MULTIPLE DATA ANALYSIS AND VISUALIZATION PROCEDURES FOR IN-DEPTH EXPLORATION

/*1.Generating a Simulated Dataset*/

/*We'll create a dataset named simulated_data with 1,000 observations,including variables: ID, Age, Gender, Income, and Purchase.*/

data simulated_data;

call streaminit(12345); /* Set seed for reproducibility */

do ID = 1 to 15;

Age = rand('Normal', 35, 10); /* Mean 35, Std Dev 10 */

Gender = ifc(rand('Bernoulli', 0.5), 'Male', 'Female');

Income = exp(rand('Normal', log(50000), 0.5)); /* Log-normal distribution */

Purchase = rand('Bernoulli', 0.3); /* 30% purchase rate */

output;

end;

run;

proc print;run;

Output:

Obs	ID	Age	Gender	Income	Purchase
1	1	37.6423	Female	85574.42	1
2	2	50.4014	Male	26980.44	0
3	3	35.6573	Male	92263.86	0
4	4	23.5061	Male	39281.31	0
5	5	32.7587	Female	55773.54	0
6	6	30.7998	Female	56554.29	0
7	7	51.6322	Female	21561.87	1
8	8	27.3884	Female	68483.53	0
9	9	34.7741	Female	63162.43	0
10	10	42.4491	Female	37845.31	0
11	11	21.8643	Female	19034.69	0
12	12	45.4506	Female	30360.20	0
13	13	23.3895	Female	39673.56	0
14	14	30.6864	Female	51182.57	1
15	15	24.6003	Female	74293.88	0

Explanation:

call streaminit(12345);: Initializes the random number generator with a seed for reproducibility.

Age: Generated from a normal distribution with a mean of 35 and standard deviation of 10.

Gender: Assigned 'Male' or 'Female' based on a Bernoulli distribution with a 50% probability.

Income: Follows a log-normal distribution to simulate income data, which is often right-skewed.

Purchase: A binary variable indicating purchase behavior, with a 30% probability of being 1 (purchase made).

/*2. Exploring the Dataset with PROC CONTENTS*/

/*To understand the structure of the dataset, use PROC CONTENTS:*/

proc contents data=simulated_data;

run;

Output:

The CONTENTS Procedure

Data Set Name	WORK.SIMULATED_DATA	Observations	15
Member Type	DATA	Variables	5
Engine	V9	Indexes	0
Created	14/09/2015 00:09:42	Observation Length	232
Last Modified	14/09/2015 00:09:42	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	WINDOWS_64
Encoding	wlatin1 Western (Windows)

Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	282
Obs in First Data Page	15
Number of Data Set Repairs	0
ExtendObsCounter	YES
Filename	C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_\simulated_data.sas7bdat
Release Created	9.0401M2
Host Created	X64_8HOME

Alphabetic List of Variables and Attributes
#	Variable	Type	Len
2	Age	Num	8
3	Gender	Char	200
1	ID	Num	8
4	Income	Num	8
5	Purchase	Num	8

/*This procedure provides details about the dataset, including variable names, types, and attributes.*/

/*3. Summarizing Data with PROC MEANS*/

/*To obtain descriptive statistics for numerical variables*/

proc means data=simulated_data mean std min max;

var Age Income;

run;

Output:

The MEANS Procedure

Age
Income

34.2000373

50801.73

9.7219459

22799.38

21.8642650

19034.69

51.6321918

92263.86

/*This output includes the mean, standard deviation, minimum, and maximum for Age and Income.*/

/*4. Frequency Analysis with PROC FREQ*/

/*For categorical variables like Gender and Purchase, use PROC FREQ:*/

proc freq data=simulated_data;

tables Gender Purchase;

run;

Output:

The FREQ Procedure

Gender	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Female	12	80.00	12	80.00
Male	3	20.00	15	100.00

Purchase	Frequency	Percent	Cumulative Frequency	Cumulative Percent
0	12	80.00	12	80.00
1	3	20.00	15	100.00

/*This provides frequency counts and percentages for each category.*/

/*5. Data Visualization with PROC SGPLOT*/

/*Visualizing data helps in understanding distributions and relationships:*/

/*Histogram of Age:*/

proc sgplot data=simulated_data;

histogram Age;

density Age / type=normal;

title 'Age Distribution';

run;

/*Scatter Plot of Age vs. Income:*/

proc sgplot data=simulated_data;

scatter x=Age y=Income / group=Gender;

title 'Age vs. Income by Gender';

run;

/*These plots provide insights into the distribution of Age and the relationship

between Age and Income, differentiated by Gender.*/

/*6. Data Management with PROC DATASETS*/

/*PROC DATASETS is a powerful procedure for managing SAS datasets:*/

/*Renaming a Variable:*/

proc datasets library=work;

modify simulated_data;

rename Purchase=MadePurchase;

run;

quit;

Log:

Directory

Libref WORK

Engine V9

Physical Name C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_

Filename C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_

Member File

# Name Type Size Last Modified

1 SIMULATED_DATA DATA 131072 08/04/2025 08:46:02

34 modify simulated_data;

35 rename Purchase=MadePurchase;

NOTE: Renaming variable Purchase to MadePurchase.

36 run;

NOTE: MODIFY was successful for WORK.SIMULATED_DATA.DATA.

37 quit;

NOTE: PROCEDURE DATASETS used (Total process time):

real time 0.25 seconds

cpu time 0.06 seconds

/*Appending Data:*/

/*Assuming there's another dataset new_data with the same structure:*/

proc datasets library=work;

append base=simulated_data data=new_data;

run;

quit;

Log:

ERROR: File WORK.NEW_DATA.DATA does not exist.

/*Deleting a Dataset:*/

proc datasets library=work;

delete simulated_data;

run;

quit;

Log:

Directory

Libref WORK

Engine V9

Physical Name C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_

Filename C:\Users\Lenovo\AppData\Local\Temp\SAS Temporary Files\_TD6844_DESKTOP-QFAA4KV_

Member File

# Name Type Size Last Modified

1 SIMULATED_DATA DATA 196608 08/04/2025 09:00:19

43 delete simulated_data;

44 run;

NOTE: Deleting WORK.SIMULATED_DATA (memtype=DATA).

45 quit;

NOTE: PROCEDURE DATASETS used (Total process time):

real time 0.09 seconds

cpu time 0.01 seconds

/*These commands help in efficiently managing datasets without the need for data steps.*/

/*7. Creating Random Samples with PROC SURVEYSELECT*/

/*To create a random sample of the data:*/

proc surveyselect data=simulated_data

method=srs /* Simple Random Sampling */

sampsize=100 /* Sample size */

out=sample_data;

run;

Output:

The SURVEYSELECT Procedure

Selection Method	Simple Random Sampling

Input Data Set	SIMULATED_DATA
Random Number Seed	697625001
Sample Size	10
Selection Probability	0.666667
Sampling Weight	1.5
Output Data Set	SAMPLE_DATA

/*This procedure selects a simple random sample of 100 observations from simulated_data.*/

/*8. Correlation Analysis with PROC CORR*/

/*To examine relationships between numerical variables:*/

proc corr data=simulated_data;

var Age Income;

run;

Output:

The CORR Procedure

2 Variables:	Age Income

Simple Statistics
Variable	N	Mean	Std Dev	Sum	Minimum	Maximum
Age	15	34.20004	9.72195	513.00056	21.86426	51.63219
Income	15	50802	22799	762026	19035	92264

Age

1.00000

-0.25263

0.3637

Income

-0.25263

0.3637

1.00000

/*This provides correlation coefficients, indicating the strength and

direction of relationships.*/

/*9. Regression Analysis with PROC REG*/

/*To model the relationship between Income (dependent variable) and Age (independent variable):*/

proc reg data=simulated_data;

model Income = Age;

run;

quit;

Output:

The REG Procedure

Model: MODEL1

Dependent Variable: Income

Number of Observations Read	15
Number of Observations Used	15

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	464447741	464447741	0.89	0.3637
Error	13	6812919577	524070737
Corrected Total	14	7277367317

Root MSE	22893	R-Square	0.0638
Dependent Mean	50802	Adj R-Sq	-0.0082
Coeff Var	45.06262

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	71064	22320	3.18	0.0072
Age	1	-592.44933	629.32899	-0.94	0.3637

/*This analysis helps in understanding how Age influences Income.*/

/*10. Logistic Regression with PROC LOGISTIC*/

/*To model the probability of making a purchase based on Age, Gender, and Income:*/

proc logistic data=simulated_data;

class Gender / param=ref;

model Purchase(event='1') = Age Gender Income;

run;

Output:

The LOGISTIC Procedure

Model Information
Data Set	WORK.SIMULATED_DATA
Response Variable	Purchase
Number of Response Levels	2
Model	binary logit
Optimization Technique	Fisher's scoring

Number of Observations Read	15
Number of Observations Used	15

Response Profile
Ordered Value	Purchase	Total Frequency
1	0	12
2	1	3

Probability modeled is Purchase=1.

Class Level Information
Class	Value	Design Variables
Gender	Female	1
	Male	0

Model Convergence Status
Quasi-complete separation of data points detected.

Warning:

The maximum likelihood estimate may not exist.

Warning:

The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.

Model Fit Statistics
Criterion	Intercept Only	Intercept and Covariates
AIC	17.012	18.164
SC	17.720	20.996
-2 Log L	15.012	10.164

Testing Global Null Hypothesis: BETA=0
Test	Chi-Square	DF	Pr > ChiSq
Likelihood Ratio	4.8482	3	0.1833
Score	3.0280	3	0.3873
Wald	1.9953	3	0.5734

Type 3 Analysis of Effects
Effect	DF	Wald Chi-Square	Pr > ChiSq
Age	1	1.9758	0.1598
Gender	1	0.0021	0.9633
Income	1	0.9516	0.3293

Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept		1	-24.1933	276.7	0.0076	0.9303
Age		1	0.2054	0.1462	1.9758	0.1598
Gender	Female	1	12.7126	276.5	0.0021	0.9633
Income		1	0.000056	0.000058	0.9516	0.3293

Odds Ratio Estimates
Effect	Point Estimate	95% Wald Confidence Limits
Age	1.228	0.922	1.635
Gender Female vs Male	>999.999	<0.001	>999.999
Income	1.000	1.000	1.000

Association of Predicted Probabilities and Observed Responses
Percent Concordant	80.6	Somers' D	0.611
Percent Discordant	19.4	Gamma	0.611
Percent Tied	0.0	Tau-a	0.210
Pairs	36	c	0.806

Search This Blog

SAS Learning Hub

147.COMPREHENSIVE GUIDE TO CREATING A UNIQUE SIMULATED DATASET IN SAS AND APPLYING MULTIPLE DATA ANALYSIS AND VISUALIZATION PROCEDURES FOR IN-DEPTH EXPLORATION

COMPREHENSIVE GUIDE TO CREATING A UNIQUE SIMULATED DATASET IN SAS AND APPLYING MULTIPLE DATA ANALYSIS AND VISUALIZATION PROCEDURES FOR IN-DEPTH EXPLORATION

Comments

Post a Comment

Popular posts from this blog

409.Can We Build a Reliable Emergency Services Analytics & Fraud Detection System in SAS While Identifying and Fixing Intentional Errors?

397.If a satellite has excellent signal strength but very high latency, can it still deliver good quality communication? Why or why not?A Sas Study

401.How Efficient Are Global Data Centers? A Complete SAS Analytics Study