SAS Learning Hub: 131.CREATING | ASSESSING | CLEANING | MANIPULATING

CREATING | ASSESSING | CLEANING | MANIPULATING | APPLYING SAS PROCEDURES IN THE DATASET

Introduction

Data processing and analysis are crucial aspects of working with structured datasets. In this document, we will go through an end-to-end SAS example using a single dataset. We will perform data manipulation, assess data quality, clean the data, and apply various SAS procedures to generate meaningful insights. The dataset used here contains patient information, including demographics, vital signs, and lab test results. By the end of this example, we will have a well-structured dataset ready for analysis and interpretation.

Step 1: Creating the Dataset

To begin, we create a sample dataset named PATIENTS, which includes patient demographic details, blood pressure, cholesterol levels, glucose levels, weight, and height.

/* Creating a sample dataset named PATIENTS */
DATA PATIENTS;
    INPUT ID $ Name $ Age Gender $ BP SysBP DiasBP Cholesterol Glucose Weight Height;
    DATALINES;
001 John 45 M 120 80 200 90 80 175
002 Mary 38 F 110 70 180 85 65 160
003 Steve 55 M 130 85 220 100 90 180
004 Anna 29 F 115 75 190 92 55 155
005 Tom 42 M 140 90 250 110 100 185
006 Lisa 50 F 125 85 210 95 70 165
007 Alex 33 M 118 78 175 80 75 170
;
RUN;

PROC PRINT DATA=PATIENTS;
RUN;

OUTPUT:




Obs
ID
Name
Age
Gender
BP
SysBP
DiasBP
Cholesterol
Glucose
Weight
Height


1
001
John
45
M
120
80
200
90
80
175
.

2
002
Mary
38
F
110
70
180
85
65
160
.

3
003
Steve
55
M
130
85
220
100
90
180
.

4
004
Anna
29
F
115
75
190
92
55
155
.

5
005
Tom
42
M
140
90
250
110
100
185
.

6
006
Lisa
50
F
125
85
210
95
70
165
.

7
007
Alex
33
M
118
78
175
80
75
170
.

Obs	ID	Name	Age	Gender	BP	SysBP	DiasBP	Cholesterol	Glucose	Weight	Height
1	001	John	45	M	120	80	200	90	80	175	.
2	002	Mary	38	F	110	70	180	85	65	160	.
3	003	Steve	55	M	130	85	220	100	90	180	.
4	004	Anna	29	F	115	75	190	92	55	155	.
5	005	Tom	42	M	140	90	250	110	100	185	.
6	006	Lisa	50	F	125	85	210	95	70	165	.
7	007	Alex	33	M	118	78	175	80	75	170	.

Explanation:

The DATA step defines the dataset.
INPUT specifies the variables and their data types.
DATALINES includes sample records for seven patients.
PROC PRINT displays the dataset.

Step 2: Data Assessment

Before proceeding with analysis, we assess the dataset structure and identify any missing or incorrect values.

PROC CONTENTS DATA=PATIENTS;
RUN;

OUTPUT:

                             The CONTENTS Procedure










Data Set Name
WORK.PATIENTS
Observations
7

Member Type
DATA
Variables
11

Engine
V9
Indexes
0

Created
14/09/2015 00:11:13
Observation Length
88

Last Modified
14/09/2015 00:11:13
Deleted Observations
0

Protection
 
Compressed
NO

Data Set Type
 
Sorted
NO

Label
 
 
 

Data Representation
WINDOWS_64
 
 

Encoding
wlatin1 Western (Windows)
 
 

PROC MEANS DATA=PATIENTS N NMISS MIN MAX MEAN;
RUN;

Data Set Name	WORK.PATIENTS	Observations	7
Member Type	DATA	Variables	11
Engine	V9	Indexes	0
Created	14/09/2015 00:11:13	Observation Length	88
Last Modified	14/09/2015 00:11:13	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	WINDOWS_64
Encoding	wlatin1 Western (Windows)

OUTPUT:














Variable
N
N Miss
Minimum
Maximum
Mean






Age

BP

SysBP

DiasBP

Cholesterol

Glucose

Weight

Height




7

7

7

7

7

7

7

0




0

0

0

0

0

0

0

7




29.0000000

110.0000000

70.0000000

175.0000000

80.0000000

55.0000000

155.0000000

.




55.0000000

140.0000000

90.0000000

250.0000000

110.0000000

100.0000000

185.0000000

.




41.7142857

122.5714286

80.4285714

203.5714286

93.1428571

76.4285714

170.0000000

.


PROC FREQ DATA=PATIENTS;
    TABLES Gender;
RUN;

OUTPUT:






Gender
Frequency
Percent
Cumulative
Frequency
Cumulative
Percent


F
3
42.86
3
42.86

M
4
57.14
7
100.00

Gender	Frequency	Percent	Cumulative Frequency	Cumulative Percent
F	3	42.86	3	42.86
M	4	57.14	7	100.00

Output Insights:

Dataset contains 7 observations and 10 variables.
No missing values detected.
Gender distribution: 4 males, 3 females.

Step 3: Data Cleaning

Data cleaning ensures the dataset is free from inconsistencies and ready for further processing.

DATA CLEAN_PATIENTS;
    SET PATIENTS;
    /* Standardizing variable names */
    Gender = UPCASE(Gender);
    /* Handling missing values (if any were present) */
    IF Age = . THEN Age = 40; 
    IF Cholesterol = . THEN Cholesterol = MEAN(OF Cholesterol);
    IF Height = . THEN Height = 170; /* Assigning default height if missing */
RUN;

OUTPUT:




Obs
ID
Name
Age
Gender
BP
SysBP
DiasBP
Cholesterol
Glucose
Weight
Height


1
001
John
45
M
120
80
200
90
80
175
170

2
002
Mary
38
F
110
70
180
85
65
160
170

3
003
Steve
55
M
130
85
220
100
90
180
170

4
004
Anna
29
F
115
75
190
92
55
155
170

5
005
Tom
42
M
140
90
250
110
100
185
170

6
006
Lisa
50
F
125
85
210
95
70
165
170

7
007
Alex
33
M
118
78
175
80
75
170
170

PROC UNIVARIATE DATA=CLEAN_PATIENTS;
    VAR Cholesterol Glucose Weight Height;
RUN;

Obs	ID	Name	Age	Gender	BP	SysBP	DiasBP	Cholesterol	Glucose	Weight	Height
1	001	John	45	M	120	80	200	90	80	175	170
2	002	Mary	38	F	110	70	180	85	65	160	170
3	003	Steve	55	M	130	85	220	100	90	180	170
4	004	Anna	29	F	115	75	190	92	55	155	170
5	005	Tom	42	M	140	90	250	110	100	185	170
6	006	Lisa	50	F	125	85	210	95	70	165	170
7	007	Alex	33	M	118	78	175	80	75	170	170

OUTPUT:

                                    The UNIVARIATE Procedure

                                    Variable: Cholesterol









Moments


N
7
Sum Weights
7

Mean
93.1428571
Sum Observations
652

Std Deviation
9.87300311
Variance
97.4761905

Skewness
0.55190398
Kurtosis
0.36135739

Uncorrected SS
61314
Corrected SS
584.857143

Coeff Variation
10.59985
Std Error Mean
3.73164442










Basic Statistical 
Measures

Location
Variability


Mean
93.14286
Std Deviation
9.87300

Median
92.00000
Variance
97.47619

Mode
.
Range
30.00000

 
 
Interquartile Range
15.00000











Tests for Location: Mu0=0

Test
Statistic
p Value


Student's t
t
24.96027
Pr > |t|
<.0001

Sign
M
3.5
Pr >= |M|
0.0156

Signed Rank
S
14
Pr >= |S|
0.0156









Quantiles (Definition 5)

Level
Quantile


100% Max
110

99%
110

95%
110

90%
110

75% Q3
100

50% Median
92

25% Q1
85

10%
80

5%
80

1%
80

0% Min
80










Extreme Observations

Lowest
Highest

Value
Obs
Value
Obs


80
7
90
1

85
2
92
4

90
1
95
6

92
4
100
3

95
6
110
5

Moments
N	7	Sum Weights	7
Mean	93.1428571	Sum Observations	652
Std Deviation	9.87300311	Variance	97.4761905
Skewness	0.55190398	Kurtosis	0.36135739
Uncorrected SS	61314	Corrected SS	584.857143
Coeff Variation	10.59985	Std Error Mean	3.73164442

Basic Statistical Measures
Mean	93.14286	Std Deviation	9.87300
Median	92.00000	Variance	97.47619
Mode	.	Range	30.00000
		Interquartile Range	15.00000

Tests for Location: Mu0=0
Student's t	t	24.96027	Pr > \|t\|	<.0001
Sign	M	3.5	Pr >= \|M\|	0.0156
Signed Rank	S	14	Pr >= \|S\|	0.0156

Quantiles (Definition 5)
100% Max	110
99%	110
95%	110
90%	110
75% Q3	100
50% Median	92
25% Q1	85
10%	80
5%	80
1%	80
0% Min	80

Extreme Observations
80	7	90	1
85	2	92	4
90	1	95	6
92	4	100	3
95	6	110	5

Here I have kept only one variable output of proc univariate...

Explanation:

The UPCASE function ensures uniform formatting for the Gender variable.
Missing values (if any) are replaced with a default or mean value.
PROC UNIVARIATE identifies outliers and data distribution.

Output Insights:

No missing values, but default values set if they were missing.
Cholesterol and glucose levels are within expected range.

Step 4: Data Manipulation

Data manipulation helps in deriving new insights by transforming existing variables.

DATA FINAL_PATIENTS;
    SET CLEAN_PATIENTS;
    /* Calculating BMI */
    BMI = Weight / ((Height/100) ** 2);
    /* Categorizing Cholesterol Levels */
    IF Cholesterol < 200 THEN Chol_Cat = 'Normal';
    ELSE IF Cholesterol >= 200 AND Cholesterol < 240 THEN Chol_Cat = 'Borderline';
    ELSE Chol_Cat = 'High';
RUN;

PROC PRINT DATA=FINAL_PATIENTS;
RUN;

OUTPUT:




Obs
ID
Name
Age
Gender
BP
SysBP
DiasBP
Cholesterol
Glucose
Weight
Height
BMI
Chol_Cat


1
001
John
45
M
120
80
200
90
80
175
170
60.5536
Normal

2
002
Mary
38
F
110
70
180
85
65
160
170
55.3633
Normal

3
003
Steve
55
M
130
85
220
100
90
180
170
62.2837
Normal

4
004
Anna
29
F
115
75
190
92
55
155
170
53.6332
Normal

5
005
Tom
42
M
140
90
250
110
100
185
170
64.0138
Normal

6
006
Lisa
50
F
125
85
210
95
70
165
170
57.0934
Normal

7
007
Alex
33
M
118
78
175
80
75
170
170
58.8235
Normal

Obs	ID	Name	Age	Gender	BP	SysBP	DiasBP	Cholesterol	Glucose	Weight	Height	BMI	Chol_Cat
1	001	John	45	M	120	80	200	90	80	175	170	60.5536	Normal
2	002	Mary	38	F	110	70	180	85	65	160	170	55.3633	Normal
3	003	Steve	55	M	130	85	220	100	90	180	170	62.2837	Normal
4	004	Anna	29	F	115	75	190	92	55	155	170	53.6332	Normal
5	005	Tom	42	M	140	90	250	110	100	185	170	64.0138	Normal
6	006	Lisa	50	F	125	85	210	95	70	165	170	57.0934	Normal
7	007	Alex	33	M	118	78	175	80	75	170	170	58.8235	Normal

Explanation:

BMI (Body Mass Index) is calculated using the formula: BMI = Weight / (Height in meters)^2.
Patients are categorized based on cholesterol levels: Normal, Borderline, and High.

Output Insights:

A new variable BMI is created.
Cholesterol levels are categorized for better analysis.

Step 5: Applying SAS Procedures

Summary Statistics

PROC MEANS DATA=FINAL_PATIENTS;
    VAR Age BP SysBP DiasBP Cholesterol Glucose BMI;
RUN;

OUTPUT:




Variable
N
Mean
Std Dev
Minimum
Maximum






Age

BP

SysBP

DiasBP

Cholesterol

Glucose

BMI




7

7

7

7

7

7

7




41.7142857

122.5714286

80.4285714

203.5714286

93.1428571

76.4285714

58.8235294




9.1962725

10.0640804

6.8033605

25.9349002

9.8730031

15.1971175

3.7374514




29.0000000

110.0000000

70.0000000

175.0000000

80.0000000

55.0000000

53.6332180




55.0000000

140.0000000

90.0000000

250.0000000

110.0000000

100.0000000

64.0138408

Explanation:

PROC MEANS provides statistical summaries like mean, min, max, and standard deviation.

Grouping by Gender

PROC MEANS DATA=FINAL_PATIENTS;
    CLASS Gender;
    VAR Age Cholesterol BMI;
RUN;

OUTPUT:




Gender
N Obs
Variable
N
Mean
Std Dev
Minimum
Maximum


F
3




Age

Cholesterol

BMI




3

3

3




39.0000000

90.6666667

55.3633218




10.5356538

5.1316014

1.7301038




29.0000000

85.0000000

53.6332180




50.0000000

95.0000000

57.0934256

M
4




Age

Cholesterol

BMI




4

4

4




43.7500000

95.0000000

61.4186851




9.0691786

12.9099445

2.2335544




33.0000000

80.0000000

58.8235294




55.0000000

110.0000000

64.0138408

Explanation:

CLASS Gender allows summary statistics to be grouped by gender.

Cholesterol Category Distribution

PROC FREQ DATA=FINAL_PATIENTS;
    TABLES Chol_Cat;
RUN;

OUTPUT:




Chol_Cat
Frequency
Percent
Cumulative
Frequency
Cumulative
Percent


Normal
7
100.00
7
100.00

Chol_Cat	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Normal	7	100.00	7	100.00

Explanation:

PROC FREQ calculates the frequency distribution of cholesterol categories.

Step 6: Advanced Analysis

To gain deeper insights, we apply correlation analysis and visualization.

PROC CORR DATA=FINAL_PATIENTS;
    VAR Age Cholesterol Glucose BMI;
RUN;

OUTPUT:

                                   The CORR Procedure








4 Variables:
Age Cholesterol Glucose 
BMI














Simple Statistics

Variable
N
Mean
Std Dev
Sum
Minimum
Maximum


Age
7
41.71429
9.19627
292.00000
29.00000
55.00000

Cholesterol
7
93.14286
9.87300
652.00000
80.00000
110.00000

Glucose
7
76.42857
15.19712
535.00000
55.00000
100.00000

BMI
7
58.82353
3.73745
411.76471
53.63322
64.01384












Pearson Correlation 
Coefficients, N = 7 
Prob > |r| under H0: Rho=0

 
Age
Cholesterol
Glucose
BMI


Age




1.00000

 




0.47779

0.2782




0.56390

0.1873




0.57048

0.1811

Cholesterol




0.47779

0.2782




1.00000

 




0.67045

0.0993




0.61734

0.1397

Glucose




0.56390

0.1873




0.67045

0.0993




1.00000

 




0.98996

<.0001

BMI




0.57048

0.1811




0.61734

0.1397




0.98996

<.0001




1.00000

4 Variables:	Age Cholesterol Glucose BMI

Simple Statistics
Age	7	41.71429	9.19627	292.00000	29.00000	55.00000
Cholesterol	7	93.14286	9.87300	652.00000	80.00000	110.00000
Glucose	7	76.42857	15.19712	535.00000	55.00000	100.00000
BMI	7	58.82353	3.73745	411.76471	53.63322	64.01384

Explanation:

PROC CORR computes correlation coefficients to measure relationships between variables.

PROC SGPLOT DATA=FINAL_PATIENTS;
    SCATTER X=Age Y=Cholesterol / GROUP=Gender;
RUN;

Explanation:

PROC SGPLOT generates scatter plots to visualize trends between Age and Cholesterol levels.

Step 7: Exporting the Cleaned Data

The final dataset can be exported for further use.

PROC EXPORT DATA=FINAL_PATIENTS
    OUTFILE="C:\Users\YourName\Documents\Final_Patients.csv"
    DBMS=CSV REPLACE;
RUN;

Explanation:

The PROC EXPORT step saves the dataset as a CSV file.

Conclusion

This example demonstrates a complete SAS workflow from dataset creation to advanced analysis. We covered:

Data Creation: Defining and inputting dataset records.
Data Assessment: Understanding structure and identifying issues.
Data Cleaning: Standardizing, handling missing values, and removing inconsistencies.
Data Manipulation: Creating new variables (BMI, cholesterol categories).
Applying SAS Procedures: Using summary statistics, frequency analysis, and gender-based grouping.
Advanced Analysis: Correlation and visualization techniques.
Exporting Data: Saving the processed dataset for further use.

PRACTICE AND COMMENT YOUR CODE:

-->PLEASE FOLLOW OUR BLOG FOR MORE UPDATES.

TO FOLLOW OUR TELEGRAM CHANNEL CLICK HERE

SAS Learning Hub

Tuesday, 18 March 2025

131.CREATING | ASSESSING | CLEANING | MANIPULATING | APPLYING SAS PROCEDURES IN THE DATASET

CREATING | ASSESSING | CLEANING | MANIPULATING | APPLYING SAS PROCEDURES IN THE DATASET

Introduction

Step 1: Creating the Dataset

Step 2: Data Assessment

Step 3: Data Cleaning

Step 4: Data Manipulation

Step 5: Applying SAS Procedures

Step 6: Advanced Analysis

Step 7: Exporting the Cleaned Data

Conclusion

No comments:

Post a Comment

Popular Posts

Basic Statistical Measures
Location		Variability
Mean	93.14286	Std Deviation	9.87300
Median	92.00000	Variance	97.47619
Mode	.	Range	30.00000
		Interquartile Range	15.00000

Tests for Location: Mu0=0
Test	Statistic		p Value
Student's t	t	24.96027	Pr > \|t\|	<.0001
Sign	M	3.5	Pr >= \|M\|	0.0156
Signed Rank	S	14	Pr >= \|S\|	0.0156

Quantiles (Definition 5)
Level	Quantile
100% Max	110
99%	110
95%	110
90%	110
75% Q3	100
50% Median	92
25% Q1	85
10%	80
5%	80
1%	80
0% Min	80

Extreme Observations
Lowest		Highest
Value	Obs	Value	Obs
80	7	90	1
85	2	92	4
90	1	95	6
92	4	100	3
95	6	110	5

Simple Statistics
Variable	N	Mean	Std Dev	Sum	Minimum	Maximum
Age	7	41.71429	9.19627	292.00000	29.00000	55.00000
Cholesterol	7	93.14286	9.87300	652.00000	80.00000	110.00000
Glucose	7	76.42857	15.19712	535.00000	55.00000	100.00000
BMI	7	58.82353	3.73745	411.76471	53.63322	64.01384