131.CREATING | ASSESSING | CLEANING | MANIPULATING | APPLYING SAS PROCEDURES IN THE DATASET

CREATING | ASSESSING | CLEANING | MANIPULATING | APPLYING SAS PROCEDURES IN THE DATASET

Introduction

Data processing and analysis are crucial aspects of working with structured datasets. In this document, we will go through an end-to-end SAS example using a single dataset. We will perform data manipulation, assess data quality, clean the data, and apply various SAS procedures to generate meaningful insights. The dataset used here contains patient information, including demographics, vital signs, and lab test results. By the end of this example, we will have a well-structured dataset ready for analysis and interpretation.

Step 1: Creating the Dataset

To begin, we create a sample dataset named PATIENTS, which includes patient demographic details, blood pressure, cholesterol levels, glucose levels, weight, and height.

/* Creating a sample dataset named PATIENTS */
DATA PATIENTS;
    INPUT ID $ Name $ Age Gender $ BP SysBP DiasBP Cholesterol Glucose Weight Height;
    DATALINES;
001 John 45 M 120 80 200 90 80 175
002 Mary 38 F 110 70 180 85 65 160
003 Steve 55 M 130 85 220 100 90 180
004 Anna 29 F 115 75 190 92 55 155
005 Tom 42 M 140 90 250 110 100 185
006 Lisa 50 F 125 85 210 95 70 165
007 Alex 33 M 118 78 175 80 75 170
;
RUN;

PROC PRINT DATA=PATIENTS;
RUN;
OUTPUT:
Obs ID Name Age Gender BP SysBP DiasBP Cholesterol Glucose Weight Height
1 001 John 45 M 120 80 200 90 80 175 .
2 002 Mary 38 F 110 70 180 85 65 160 .
3 003 Steve 55 M 130 85 220 100 90 180 .
4 004 Anna 29 F 115 75 190 92 55 155 .
5 005 Tom 42 M 140 90 250 110 100 185 .
6 006 Lisa 50 F 125 85 210 95 70 165 .
7 007 Alex 33 M 118 78 175 80 75 170 .

Explanation:

  • The DATA step defines the dataset.

  • INPUT specifies the variables and their data types.

  • DATALINES includes sample records for seven patients.

  • PROC PRINT displays the dataset.

Step 2: Data Assessment

Before proceeding with analysis, we assess the dataset structure and identify any missing or incorrect values.

PROC CONTENTS DATA=PATIENTS;
RUN;
OUTPUT:
The CONTENTS Procedure
Data Set Name WORK.PATIENTS Observations 7
Member Type DATA Variables 11
Engine V9 Indexes 0
Created 14/09/2015 00:11:13 Observation Length 88
Last Modified 14/09/2015 00:11:13 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation WINDOWS_64    
Encoding wlatin1 Western (Windows)    
PROC MEANS DATA=PATIENTS N NMISS MIN MAX MEAN; RUN;
OUTPUT:
Variable N N Miss Minimum Maximum Mean
Age
BP
SysBP
DiasBP
Cholesterol
Glucose
Weight
Height
7
7
7
7
7
7
7
0
0
0
0
0
0
0
0
7
29.0000000
110.0000000
70.0000000
175.0000000
80.0000000
55.0000000
155.0000000
.
55.0000000
140.0000000
90.0000000
250.0000000
110.0000000
100.0000000
185.0000000
.
41.7142857
122.5714286
80.4285714
203.5714286
93.1428571
76.4285714
170.0000000
.

PROC FREQ DATA=PATIENTS; TABLES Gender; RUN;
OUTPUT:

Gender Frequency Percent Cumulative
Frequency
Cumulative
Percent
F 3 42.86 3 42.86
M 4 57.14 7 100.00

Output Insights:

  • Dataset contains 7 observations and 10 variables.

  • No missing values detected.

  • Gender distribution: 4 males, 3 females.

Step 3: Data Cleaning

Data cleaning ensures the dataset is free from inconsistencies and ready for further processing.

DATA CLEAN_PATIENTS;
    SET PATIENTS;
    /* Standardizing variable names */
    Gender = UPCASE(Gender);
    /* Handling missing values (if any were present) */
    IF Age = . THEN Age = 40; 
    IF Cholesterol = . THEN Cholesterol = MEAN(OF Cholesterol);
    IF Height = . THEN Height = 170; /* Assigning default height if missing */
RUN;
OUTPUT:

Obs ID Name Age Gender BP SysBP DiasBP Cholesterol Glucose Weight Height
1 001 John 45 M 120 80 200 90 80 175 170
2 002 Mary 38 F 110 70 180 85 65 160 170
3 003 Steve 55 M 130 85 220 100 90 180 170
4 004 Anna 29 F 115 75 190 92 55 155 170
5 005 Tom 42 M 140 90 250 110 100 185 170
6 006 Lisa 50 F 125 85 210 95 70 165 170
7 007 Alex 33 M 118 78 175 80 75 170 170
PROC UNIVARIATE DATA=CLEAN_PATIENTS; VAR Cholesterol Glucose Weight Height; RUN;
OUTPUT:
The UNIVARIATE Procedure
Variable: Cholesterol
Moments
N 7 Sum Weights 7
Mean 93.1428571 Sum Observations 652
Std Deviation 9.87300311 Variance 97.4761905
Skewness 0.55190398 Kurtosis 0.36135739
Uncorrected SS 61314 Corrected SS 584.857143
Coeff Variation 10.59985 Std Error Mean 3.73164442

Basic Statistical Measures
Location Variability
Mean 93.14286 Std Deviation 9.87300
Median 92.00000 Variance 97.47619
Mode . Range 30.00000
    Interquartile Range 15.00000

Tests for Location: Mu0=0
Test Statistic p Value
Student's t t 24.96027 Pr > |t| <.0001
Sign M 3.5 Pr >= |M| 0.0156
Signed Rank S 14 Pr >= |S| 0.0156
Quantiles (Definition 5)
Level Quantile
100% Max 110
99% 110
95% 110
90% 110
75% Q3 100
50% Median 92
25% Q1 85
10% 80
5% 80
1% 80
0% Min 80
Extreme Observations
Lowest Highest
Value Obs Value Obs
80 7 90 1
85 2 92 4
90 1 95 6
92 4 100 3
95 6 110 5

Here I have kept only one variable output of proc univariate...

Explanation:

  • The UPCASE function ensures uniform formatting for the Gender variable.

  • Missing values (if any) are replaced with a default or mean value.

  • PROC UNIVARIATE identifies outliers and data distribution.

Output Insights:

  • No missing values, but default values set if they were missing.

  • Cholesterol and glucose levels are within expected range.

Step 4: Data Manipulation

Data manipulation helps in deriving new insights by transforming existing variables.

DATA FINAL_PATIENTS;
    SET CLEAN_PATIENTS;
    /* Calculating BMI */
    BMI = Weight / ((Height/100) ** 2);
    /* Categorizing Cholesterol Levels */
    IF Cholesterol < 200 THEN Chol_Cat = 'Normal';
    ELSE IF Cholesterol >= 200 AND Cholesterol < 240 THEN Chol_Cat = 'Borderline';
    ELSE Chol_Cat = 'High';
RUN;

PROC PRINT DATA=FINAL_PATIENTS;
RUN;
OUTPUT:
Obs ID Name Age Gender BP SysBP DiasBP Cholesterol Glucose Weight Height BMI Chol_Cat
1 001 John 45 M 120 80 200 90 80 175 170 60.5536 Normal
2 002 Mary 38 F 110 70 180 85 65 160 170 55.3633 Normal
3 003 Steve 55 M 130 85 220 100 90 180 170 62.2837 Normal
4 004 Anna 29 F 115 75 190 92 55 155 170 53.6332 Normal
5 005 Tom 42 M 140 90 250 110 100 185 170 64.0138 Normal
6 006 Lisa 50 F 125 85 210 95 70 165 170 57.0934 Normal
7 007 Alex 33 M 118 78 175 80 75 170 170 58.8235 Normal

Explanation:

  • BMI (Body Mass Index) is calculated using the formula: BMI = Weight / (Height in meters)^2.

  • Patients are categorized based on cholesterol levels: Normal, Borderline, and High.

Output Insights:

  • A new variable BMI is created.

  • Cholesterol levels are categorized for better analysis.

Step 5: Applying SAS Procedures

Summary Statistics

PROC MEANS DATA=FINAL_PATIENTS;
    VAR Age BP SysBP DiasBP Cholesterol Glucose BMI;
RUN;
OUTPUT:

Variable N Mean Std Dev Minimum Maximum
Age
BP
SysBP
DiasBP
Cholesterol
Glucose
BMI
7
7
7
7
7
7
7
41.7142857
122.5714286
80.4285714
203.5714286
93.1428571
76.4285714
58.8235294
9.1962725
10.0640804
6.8033605
25.9349002
9.8730031
15.1971175
3.7374514
29.0000000
110.0000000
70.0000000
175.0000000
80.0000000
55.0000000
53.6332180
55.0000000
140.0000000
90.0000000
250.0000000
110.0000000
100.0000000
64.0138408

Explanation:

  • PROC MEANS provides statistical summaries like mean, min, max, and standard deviation.

Grouping by Gender

PROC MEANS DATA=FINAL_PATIENTS;
    CLASS Gender;
    VAR Age Cholesterol BMI;
RUN;
OUTPUT:
Gender N Obs Variable N Mean Std Dev Minimum Maximum
F 3
Age
Cholesterol
BMI
3
3
3
39.0000000
90.6666667
55.3633218
10.5356538
5.1316014
1.7301038
29.0000000
85.0000000
53.6332180
50.0000000
95.0000000
57.0934256
M 4
Age
Cholesterol
BMI
4
4
4
43.7500000
95.0000000
61.4186851
9.0691786
12.9099445
2.2335544
33.0000000
80.0000000
58.8235294
55.0000000
110.0000000
64.0138408

Explanation:

  • CLASS Gender allows summary statistics to be grouped by gender.

Cholesterol Category Distribution

PROC FREQ DATA=FINAL_PATIENTS;
    TABLES Chol_Cat;
RUN;
OUTPUT:

Chol_Cat Frequency Percent Cumulative
Frequency
Cumulative
Percent
Normal 7 100.00 7 100.00

Explanation:

  • PROC FREQ calculates the frequency distribution of cholesterol categories.

Step 6: Advanced Analysis

To gain deeper insights, we apply correlation analysis and visualization.

PROC CORR DATA=FINAL_PATIENTS;
    VAR Age Cholesterol Glucose BMI;
RUN;
OUTPUT:
The CORR Procedure
4 Variables: Age Cholesterol Glucose BMI
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
Age 7 41.71429 9.19627 292.00000 29.00000 55.00000
Cholesterol 7 93.14286 9.87300 652.00000 80.00000 110.00000
Glucose 7 76.42857 15.19712 535.00000 55.00000 100.00000
BMI 7 58.82353 3.73745 411.76471 53.63322 64.01384

Pearson Correlation Coefficients, N = 7
Prob > |r| under H0: Rho=0
  Age Cholesterol Glucose BMI
Age
1.00000
 
0.47779
0.2782
0.56390
0.1873
0.57048
0.1811
Cholesterol
0.47779
0.2782
1.00000
 
0.67045
0.0993
0.61734
0.1397
Glucose
0.56390
0.1873
0.67045
0.0993
1.00000
 
0.98996
<.0001
BMI
0.57048
0.1811
0.61734
0.1397
0.98996
<.0001
1.00000
 

Explanation:

  • PROC CORR computes correlation coefficients to measure relationships between variables.

PROC SGPLOT DATA=FINAL_PATIENTS;
    SCATTER X=Age Y=Cholesterol / GROUP=Gender;
RUN;

Explanation:

  • PROC SGPLOT generates scatter plots to visualize trends between Age and Cholesterol levels.

Step 7: Exporting the Cleaned Data

The final dataset can be exported for further use.

PROC EXPORT DATA=FINAL_PATIENTS
    OUTFILE="C:\Users\YourName\Documents\Final_Patients.csv"
    DBMS=CSV REPLACE;
RUN;

Explanation:

  • The PROC EXPORT step saves the dataset as a CSV file.

Conclusion

This example demonstrates a complete SAS workflow from dataset creation to advanced analysis. We covered:

  1. Data Creation: Defining and inputting dataset records.

  2. Data Assessment: Understanding structure and identifying issues.

  3. Data Cleaning: Standardizing, handling missing values, and removing inconsistencies.

  4. Data Manipulation: Creating new variables (BMI, cholesterol categories).

  5. Applying SAS Procedures: Using summary statistics, frequency analysis, and gender-based grouping.

  6. Advanced Analysis: Correlation and visualization techniques.

  7. Exporting Data: Saving the processed dataset for further use.


PRACTICE AND COMMENT YOUR CODE: 

-->PLEASE FOLLOW OUR BLOG FOR MORE UPDATES.

TO FOLLOW OUR TELEGRAM CHANNEL CLICK HERE

Comments