SAS Learning Hub: 132.A EXAMPLE OF EMPLOYEES DATASET | PROC PRINT | REMOVING DUPLICATES | PROC SORT NODUPKEY | ASSESSING MISSING VALUES | PROC MEANS | HANDLING MISSING VALUES | PROC STDIZE | DETECTING AND HANDLING OUTLIERS

A EXAMPLE OF EMPLOYEES DATASET | PROC PRINT | REMOVING DUPLICATES | PROC SORT NODUPKEY | ASSESSING MISSING VALUES | PROC MEANS | HANDLING MISSING VALUES | PROC STDIZE | DETECTING AND HANDLING OUTLIERS | PROC UNIVARIATE

In this example, we'll demonstrate how to manipulate, assess, clean, and apply procedures to a single dataset using SAS. We'll use a hypothetical dataset named employees, which contains information about employees in a company. The dataset includes variables such as EmployeeID, Name, Department, Salary, and HireDate.

Step 1: Creating the Dataset

First, we'll create the employees dataset using the DATA step and DATALINES statement.

Example Data:

data employees;

input EmployeeID Name $ Department $ Salary HireDate :date9.;

format HireDate date9.;

datalines;

1 John Sales 55000 15JAN2015

2 Jane Marketing 62000 22MAR2016

3 Jim Sales 58000 10JUL2017

4 Jack IT 60000 05MAY2018

5 Jill HR 59000 12AUG2019

6 John Sales 55000 15JAN2015

7 Jake IT 61000 20SEP2020

8 Jane Marketing 63000 22MAR2016

9 Julia HR 60000 18DEC2021

10 Jim Sales 58000 10JUL2017

;

run;

In this dataset, there are intentional duplicate records for demonstration purposes.

Step 2: Viewing the Dataset

To inspect the dataset, we can use the PROC PRINT procedure.

proc print data=employees;

run;

Output:

Obs	EmployeeID	Name	Department	Salary	HireDate
1	1	John	Sales	55000	15JAN2015
2	2	Jane	Marketing	62000	22MAR2016
3	3	Jim	Sales	58000	10JUL2017
4	4	Jack	IT	60000	05MAY2018
5	5	Jill	HR	59000	12AUG2019
6	6	John	Sales	55000	15JAN2015
7	7	Jake	IT	61000	20SEP2020
8	8	Jane	Marketing	63000	22MAR2016
9	9	Julia	HR	60000	18DEC2021
10	10	Jim	Sales	58000	10JUL2017

Step 3: Identifying and Removing Duplicate Records

Duplicate records can skew analysis results. We'll use the PROC SORT procedure with the NODUPKEY option to remove duplicates based on the EmployeeID.

proc sort data=employees nodupkey out=employees_clean;

by Name;

run;

proc print data=employees_clean;

run;

Output:

Obs	EmployeeID	Name	Department	Salary	HireDate
1	4	Jack	IT	60000	05MAY2018
2	7	Jake	IT	61000	20SEP2020
3	2	Jane	Marketing	62000	22MAR2016
4	5	Jill	HR	59000	12AUG2019
5	3	Jim	Sales	58000	10JUL2017
6	1	John	Sales	55000	15JAN2015
7	9	Julia	HR	60000	18DEC2021

The duplicate records have been removed.

Step 4: Assessing Missing Values

Missing values can affect data analysis. We'll introduce some missing values and then assess them using PROC MEANS.

data employees_clean;

set employees_clean;

if EmployeeID = 5 then Salary = .;

if EmployeeID = 9 then Department = '';

run;

proc print data=employees_clean;

run;

Output:

Obs	EmployeeID	Name	Department	Salary	HireDate
1	4	Jack	IT	60000	05MAY2018
2	7	Jake	IT	61000	20SEP2020
3	2	Jane	Marketing	62000	22MAR2016
4	5	Jill	HR	.	12AUG2019
5	3	Jim	Sales	58000	10JUL2017
6	1	John	Sales	55000	15JAN2015
7	9	Julia		60000	18DEC2021

proc means data=employees_clean n nmiss;

var Salary;

run;

Output:

Analysis Variable : Salary
N	N Miss
6	1

This output indicates that there is one missing value in the SALARY variable.

Step 5: Handling Missing Values

To handle missing values, we can replace them with the mean salary.


proc stdize data=employees_clean reponly method=mean out=employees_imputed;
    var Salary;
run;
proc print;run;

Output:


Obs
EmployeeID
Name
Department
Salary
HireDate


1
4
Jack
IT
60000.00
05MAY2018

2
7
Jake
IT
61000.00
20SEP2020

3
2
Jane
Marketin
62000.00
22MAR2016

4
5
Jill
HR
59333.33
12AUG2019

5
3
Jim
Sales
58000.00
10JUL2017

6
1
John
Sales
55000.00
15JAN2015

7
9
Julia
 
60000.00
18DEC2021

The missing salary values have been replaced with the mean salary of 58,500.

Step 6: Detecting and Handling Outliers
Outliers can distort statistical analyses. We'll detect outliers in the salary variable using PROC UNIVARIATE..

Obs	EmployeeID	Name	Department	Salary	HireDate
1	4	Jack	IT	60000.00	05MAY2018
2	7	Jake	IT	61000.00	20SEP2020
3	2	Jane	Marketin	62000.00	22MAR2016
4	5	Jill	HR	59333.33	12AUG2019
5	3	Jim	Sales	58000.00	10JUL2017
6	1	John	Sales	55000.00	15JAN2015
7	9	Julia		60000.00	18DEC2021

proc univariate data=employees_imputed;

var Salary;

histogram Salary / normal;

inset mean median std / format=6.2;

run;

Output:

The UNIVARIATE Procedure

Variable: Salary

Moments
N	7	Sum Weights	7
Mean	59333.3333	Sum Observations	415333.333
Std Deviation	2285.2182	Variance	5222222.22
Skewness	-1.1644371	Kurtosis	1.69189679
Uncorrected SS	2.46744E10	Corrected SS	31333333.3
Coeff Variation	3.85149135	Std Error Mean	863.731293

Basic Statistical Measures
Location		Variability
Mean	59333.33	Std Deviation	2285
Median	60000.00	Variance	5222222
Mode	60000.00	Range	7000
		Interquartile Range	3000

Tests for Location: Mu0=0
Test	Statistic		p Value
Student's t	t	68.6942	Pr > \|t\|	<.0001
Sign	M	3.5	Pr >= \|M\|	0.0156
Signed Rank	S	14	Pr >= \|S\|	0.0156

Quantiles (Definition 5)
Level	Quantile
100% Max	62000
99%	62000
95%	62000
90%	62000
75% Q3	61000
50% Median	60000
25% Q1	58000
10%	55000
5%	55000
1%	55000
0% Min	55000

Extreme Observations
Lowest		Highest
Value	Obs	Value	Obs
55000.0	6	59333.3	4
58000.0	5	60000.0	1
59333.3	4	60000.0	7
60000.0	7	61000.0	2
60000.0	1	62000.0	3

The histogram and statistical measures indicate that the salary variable is approximately normally.

PRACTICE AND COMMENT YOUR CODE:

-->PLEASE FOLLOW OUR BLOG FOR MORE UPDATES.

TO FOLLOW OUR TELEGRAM CHANNEL CLICK HERE

SAS Learning Hub

Wednesday, 19 March 2025

132.A EXAMPLE OF EMPLOYEES DATASET | PROC PRINT | REMOVING DUPLICATES | PROC SORT NODUPKEY | ASSESSING MISSING VALUES | PROC MEANS | HANDLING MISSING VALUES | PROC STDIZE | DETECTING AND HANDLING OUTLIERS | PROC UNIVARIATE

A EXAMPLE OF EMPLOYEES DATASET | PROC PRINT | REMOVING DUPLICATES | PROC SORT NODUPKEY | ASSESSING MISSING VALUES | PROC MEANS | HANDLING MISSING VALUES | PROC STDIZE | DETECTING AND HANDLING OUTLIERS | PROC UNIVARIATE

No comments:

Post a Comment

Popular Posts