CREATING | ASSESSING | CLEANING | MANIPULATING | APPLYING SAS PROCEDURES IN THE DATASET
Introduction
Data processing and analysis are crucial aspects of working with structured datasets. In this document, we will go through an end-to-end SAS example using a single dataset. We will perform data manipulation, assess data quality, clean the data, and apply various SAS procedures to generate meaningful insights. The dataset used here contains patient information, including demographics, vital signs, and lab test results. By the end of this example, we will have a well-structured dataset ready for analysis and interpretation.
Step 1: Creating the Dataset
To begin, we create a sample dataset named PATIENTS
, which includes patient demographic details, blood pressure, cholesterol levels, glucose levels, weight, and height.
/* Creating a sample dataset named PATIENTS */
DATA PATIENTS;
INPUT ID $ Name $ Age Gender $ BP SysBP DiasBP Cholesterol Glucose Weight Height;
DATALINES;
001 John 45 M 120 80 200 90 80 175
002 Mary 38 F 110 70 180 85 65 160
003 Steve 55 M 130 85 220 100 90 180
004 Anna 29 F 115 75 190 92 55 155
005 Tom 42 M 140 90 250 110 100 185
006 Lisa 50 F 125 85 210 95 70 165
007 Alex 33 M 118 78 175 80 75 170
;
RUN;
PROC PRINT DATA=PATIENTS;
RUN;
OUTPUT:
001 |
John |
45 |
M |
120 |
80 |
200 |
90 |
80 |
175 |
. |
002 |
Mary |
38 |
F |
110 |
70 |
180 |
85 |
65 |
160 |
. |
003 |
Steve |
55 |
M |
130 |
85 |
220 |
100 |
90 |
180 |
. |
004 |
Anna |
29 |
F |
115 |
75 |
190 |
92 |
55 |
155 |
. |
005 |
Tom |
42 |
M |
140 |
90 |
250 |
110 |
100 |
185 |
. |
006 |
Lisa |
50 |
F |
125 |
85 |
210 |
95 |
70 |
165 |
. |
007 |
Alex |
33 |
M |
118 |
78 |
175 |
80 |
75 |
170 |
. |
Explanation:
The DATA
step defines the dataset.
INPUT
specifies the variables and their data types.
DATALINES
includes sample records for seven patients.
PROC PRINT
displays the dataset.
Step 2: Data Assessment
Before proceeding with analysis, we assess the dataset structure and identify any missing or incorrect values.
PROC CONTENTS DATA=PATIENTS;
RUN;
OUTPUT:
The CONTENTS Procedure
WORK.PATIENTS |
7 |
DATA |
11 |
V9 |
0 |
14/09/2015 00:11:13 |
88 |
14/09/2015 00:11:13 |
0 |
|
NO |
|
NO |
|
|
WINDOWS_64 |
|
wlatin1 Western (Windows) |
|
PROC MEANS DATA=PATIENTS N NMISS MIN MAX MEAN;
RUN;
OUTPUT:
Age |
BP |
SysBP |
DiasBP |
Cholesterol |
Glucose |
Weight |
Height | |
|
|
29.0000000 |
110.0000000 |
70.0000000 |
175.0000000 |
80.0000000 |
55.0000000 |
155.0000000 |
. | |
55.0000000 |
140.0000000 |
90.0000000 |
250.0000000 |
110.0000000 |
100.0000000 |
185.0000000 |
. | |
41.7142857 |
122.5714286 |
80.4285714 |
203.5714286 |
93.1428571 |
76.4285714 |
170.0000000 |
. | |
PROC FREQ DATA=PATIENTS;
TABLES Gender;
RUN;
OUTPUT:
3 |
42.86 |
3 |
42.86 |
4 |
57.14 |
7 |
100.00 |
Output Insights:
Dataset contains 7 observations and 10 variables.
No missing values detected.
Gender distribution: 4 males, 3 females.
Step 3: Data Cleaning
Data cleaning ensures the dataset is free from inconsistencies and ready for further processing.
DATA CLEAN_PATIENTS;
SET PATIENTS;
/* Standardizing variable names */
Gender = UPCASE(Gender);
/* Handling missing values (if any were present) */
IF Age = . THEN Age = 40;
IF Cholesterol = . THEN Cholesterol = MEAN(OF Cholesterol);
IF Height = . THEN Height = 170; /* Assigning default height if missing */
RUN;
OUTPUT:
001 |
John |
45 |
M |
120 |
80 |
200 |
90 |
80 |
175 |
170 |
002 |
Mary |
38 |
F |
110 |
70 |
180 |
85 |
65 |
160 |
170 |
003 |
Steve |
55 |
M |
130 |
85 |
220 |
100 |
90 |
180 |
170 |
004 |
Anna |
29 |
F |
115 |
75 |
190 |
92 |
55 |
155 |
170 |
005 |
Tom |
42 |
M |
140 |
90 |
250 |
110 |
100 |
185 |
170 |
006 |
Lisa |
50 |
F |
125 |
85 |
210 |
95 |
70 |
165 |
170 |
007 |
Alex |
33 |
M |
118 |
78 |
175 |
80 |
75 |
170 |
170 |
PROC UNIVARIATE DATA=CLEAN_PATIENTS;
VAR Cholesterol Glucose Weight Height;
RUN;
OUTPUT:
The UNIVARIATE Procedure
Variable: Cholesterol
7 |
7 |
93.1428571 |
652 |
9.87300311 |
97.4761905 |
0.55190398 |
0.36135739 |
61314 |
584.857143 |
10.59985 |
3.73164442 |
93.14286 |
9.87300 |
92.00000 |
97.47619 |
. |
30.00000 |
|
15.00000 |
24.96027 |
<.0001 |
3.5 |
0.0156 |
14 |
0.0156 |
110 |
110 |
110 |
110 |
100 |
92 |
85 |
80 |
80 |
80 |
80 |
80 |
7 |
90 |
1 |
85 |
2 |
92 |
4 |
90 |
1 |
95 |
6 |
92 |
4 |
100 |
3 |
95 |
6 |
110 |
5 |
Here I have kept only one variable output of proc univariate...
Explanation:
The UPCASE
function ensures uniform formatting for the Gender
variable.
Missing values (if any) are replaced with a default or mean value.
PROC UNIVARIATE
identifies outliers and data distribution.
Output Insights:
No missing values, but default values set if they were missing.
Cholesterol and glucose levels are within expected range.
Step 4: Data Manipulation
Data manipulation helps in deriving new insights by transforming existing variables.
DATA FINAL_PATIENTS;
SET CLEAN_PATIENTS;
/* Calculating BMI */
BMI = Weight / ((Height/100) ** 2);
/* Categorizing Cholesterol Levels */
IF Cholesterol < 200 THEN Chol_Cat = 'Normal';
ELSE IF Cholesterol >= 200 AND Cholesterol < 240 THEN Chol_Cat = 'Borderline';
ELSE Chol_Cat = 'High';
RUN;
PROC PRINT DATA=FINAL_PATIENTS;
RUN;
OUTPUT:
001 |
John |
45 |
M |
120 |
80 |
200 |
90 |
80 |
175 |
170 |
60.5536 |
Normal |
002 |
Mary |
38 |
F |
110 |
70 |
180 |
85 |
65 |
160 |
170 |
55.3633 |
Normal |
003 |
Steve |
55 |
M |
130 |
85 |
220 |
100 |
90 |
180 |
170 |
62.2837 |
Normal |
004 |
Anna |
29 |
F |
115 |
75 |
190 |
92 |
55 |
155 |
170 |
53.6332 |
Normal |
005 |
Tom |
42 |
M |
140 |
90 |
250 |
110 |
100 |
185 |
170 |
64.0138 |
Normal |
006 |
Lisa |
50 |
F |
125 |
85 |
210 |
95 |
70 |
165 |
170 |
57.0934 |
Normal |
007 |
Alex |
33 |
M |
118 |
78 |
175 |
80 |
75 |
170 |
170 |
58.8235 |
Normal |
Explanation:
BMI (Body Mass Index) is calculated using the formula: BMI = Weight / (Height in meters)^2
.
Patients are categorized based on cholesterol levels: Normal, Borderline, and High.
Output Insights:
Step 5: Applying SAS Procedures
Summary Statistics
PROC MEANS DATA=FINAL_PATIENTS;
VAR Age BP SysBP DiasBP Cholesterol Glucose BMI;
RUN;
OUTPUT:
Age |
BP |
SysBP |
DiasBP |
Cholesterol |
Glucose |
BMI | |
|
41.7142857 |
122.5714286 |
80.4285714 |
203.5714286 |
93.1428571 |
76.4285714 |
58.8235294 | |
9.1962725 |
10.0640804 |
6.8033605 |
25.9349002 |
9.8730031 |
15.1971175 |
3.7374514 | |
29.0000000 |
110.0000000 |
70.0000000 |
175.0000000 |
80.0000000 |
55.0000000 |
53.6332180 | |
55.0000000 |
140.0000000 |
90.0000000 |
250.0000000 |
110.0000000 |
100.0000000 |
64.0138408 | |
Explanation:
Grouping by Gender
PROC MEANS DATA=FINAL_PATIENTS;
CLASS Gender;
VAR Age Cholesterol BMI;
RUN;
OUTPUT:
F |
3 |
|
|
39.0000000 |
90.6666667 |
55.3633218 | |
10.5356538 |
5.1316014 |
1.7301038 | |
29.0000000 |
85.0000000 |
53.6332180 | |
50.0000000 |
95.0000000 |
57.0934256 | |
M |
4 |
|
|
43.7500000 |
95.0000000 |
61.4186851 | |
9.0691786 |
12.9099445 |
2.2335544 | |
33.0000000 |
80.0000000 |
58.8235294 | |
55.0000000 |
110.0000000 |
64.0138408 | |
Explanation:
Cholesterol Category Distribution
PROC FREQ DATA=FINAL_PATIENTS;
TABLES Chol_Cat;
RUN;
OUTPUT:
Explanation:
Step 6: Advanced Analysis
To gain deeper insights, we apply correlation analysis and visualization.
PROC CORR DATA=FINAL_PATIENTS;
VAR Age Cholesterol Glucose BMI;
RUN;
OUTPUT:
The CORR Procedure
Age Cholesterol Glucose
BMI |
7 |
41.71429 |
9.19627 |
292.00000 |
29.00000 |
55.00000 |
7 |
93.14286 |
9.87300 |
652.00000 |
80.00000 |
110.00000 |
7 |
76.42857 |
15.19712 |
535.00000 |
55.00000 |
100.00000 |
7 |
58.82353 |
3.73745 |
411.76471 |
53.63322 |
64.01384 |
Explanation:
PROC SGPLOT DATA=FINAL_PATIENTS;
SCATTER X=Age Y=Cholesterol / GROUP=Gender;
RUN;
Explanation:
Step 7: Exporting the Cleaned Data
The final dataset can be exported for further use.
PROC EXPORT DATA=FINAL_PATIENTS
OUTFILE="C:\Users\YourName\Documents\Final_Patients.csv"
DBMS=CSV REPLACE;
RUN;
Explanation:
Conclusion
This example demonstrates a complete SAS workflow from dataset creation to advanced analysis. We covered:
Data Creation: Defining and inputting dataset records.
Data Assessment: Understanding structure and identifying issues.
Data Cleaning: Standardizing, handling missing values, and removing inconsistencies.
Data Manipulation: Creating new variables (BMI, cholesterol categories).
Applying SAS Procedures: Using summary statistics, frequency analysis, and gender-based grouping.
Advanced Analysis: Correlation and visualization techniques.
Exporting Data: Saving the processed dataset for further use.
PRACTICE AND COMMENT YOUR CODE:
-->PLEASE FOLLOW OUR BLOG FOR MORE UPDATES.
Comments
Post a Comment