132.A EXAMPLE OF EMPLOYEES DATASET | PROC PRINT | REMOVING DUPLICATES | PROC SORT NODUPKEY | ASSESSING MISSING VALUES | PROC MEANS | HANDLING MISSING VALUES | PROC STDIZE | DETECTING AND HANDLING OUTLIERS | PROC UNIVARIATE

A EXAMPLE OF EMPLOYEES DATASET | PROC PRINT | REMOVING DUPLICATES | PROC SORT NODUPKEY | ASSESSING MISSING VALUES | PROC MEANS | HANDLING MISSING VALUES | PROC STDIZE | DETECTING AND HANDLING OUTLIERS | PROC UNIVARIATE


In this example, we'll demonstrate how to manipulate, assess, clean, and apply procedures to a single dataset using SAS. We'll use a hypothetical dataset named employees, which contains information about employees in a company. The dataset includes variables such as EmployeeID, Name, Department, Salary, and HireDate.

Step 1: Creating the Dataset

First, we'll create the employees dataset using the DATA step and DATALINES statement.

Example Data:

data employees;

    input EmployeeID Name $ Department $ Salary HireDate :date9.;

    format HireDate date9.;

    datalines;

1 John Sales 55000 15JAN2015

2 Jane Marketing 62000 22MAR2016

3 Jim Sales 58000 10JUL2017

4 Jack IT 60000 05MAY2018

5 Jill HR 59000 12AUG2019

6 John Sales 55000 15JAN2015

7 Jake IT 61000 20SEP2020

8 Jane Marketing 63000 22MAR2016

9 Julia HR 60000 18DEC2021

10 Jim Sales 58000 10JUL2017

;

run;

In this dataset, there are intentional duplicate records for demonstration purposes.


Step 2: Viewing the Dataset

To inspect the dataset, we can use the PROC PRINT procedure.


proc print data=employees;

run;


Output:
Obs EmployeeID Name Department Salary HireDate
1 1 John Sales 55000 15JAN2015
2 2 Jane Marketing 62000 22MAR2016
3 3 Jim Sales 58000 10JUL2017
4 4 Jack IT 60000 05MAY2018
5 5 Jill HR 59000 12AUG2019
6 6 John Sales 55000 15JAN2015
7 7 Jake IT 61000 20SEP2020
8 8 Jane Marketing 63000 22MAR2016
9 9 Julia HR 60000 18DEC2021
10 10 Jim Sales 58000 10JUL2017

Step 3: Identifying and Removing Duplicate Records

Duplicate records can skew analysis results. We'll use the PROC SORT procedure with the NODUPKEY option to remove duplicates based on the EmployeeID.


proc sort data=employees nodupkey out=employees_clean;

    by Name;

run;

proc print data=employees_clean;

run;

Output:

Obs EmployeeID Name Department Salary HireDate
1 4 Jack IT 60000 05MAY2018
2 7 Jake IT 61000 20SEP2020
3 2 Jane Marketing 62000 22MAR2016
4 5 Jill HR 59000 12AUG2019
5 3 Jim Sales 58000 10JUL2017
6 1 John Sales 55000 15JAN2015
7 9 Julia HR 60000 18DEC2021

The duplicate records have been removed.

Step 4: Assessing Missing Values

Missing values can affect data analysis. We'll introduce some missing values and then assess them using PROC MEANS.

data employees_clean;

    set employees_clean;

    if EmployeeID = 5 then Salary = .;

    if EmployeeID = 9 then Department = '';

run;

proc print data=employees_clean;

run;

Output:
Obs EmployeeID Name Department Salary HireDate
1 4 Jack IT 60000 05MAY2018
2 7 Jake IT 61000 20SEP2020
3 2 Jane Marketing 62000 22MAR2016
4 5 Jill HR . 12AUG2019
5 3 Jim Sales 58000 10JUL2017
6 1 John Sales 55000 15JAN2015
7 9 Julia   60000 18DEC2021


proc means data=employees_clean n nmiss;
    var Salary;
run;

Output:
Analysis
Variable
: Salary
N N Miss
6 1

This output indicates that there is one missing value in the SALARY variable.

Step 5: Handling Missing Values

To handle missing values, we can replace them with the mean salary.

proc stdize data=employees_clean reponly method=mean out=employees_imputed; var Salary; run; proc print;run;
Output:
Obs EmployeeID Name Department Salary HireDate
1 4 Jack IT 60000.00 05MAY2018
2 7 Jake IT 61000.00 20SEP2020
3 2 Jane Marketin 62000.00 22MAR2016
4 5 Jill HR 59333.33 12AUG2019
5 3 Jim Sales 58000.00 10JUL2017
6 1 John Sales 55000.00 15JAN2015
7 9 Julia   60000.00 18DEC2021

The missing salary values have been replaced with the mean salary of 58,500.

Step 6: Detecting and Handling Outliers

Outliers can distort statistical analyses. We'll detect outliers in the salary variable using PROC UNIVARIATE..


proc univariate data=employees_imputed;
    var Salary;
    histogram Salary / normal;
    inset mean median std / format=6.2;
run;

Output:
                                                      The UNIVARIATE Procedure
                                                               Variable: Salary

Moments
N 7 Sum Weights 7
Mean 59333.3333 Sum Observations 415333.333
Std Deviation 2285.2182 Variance 5222222.22
Skewness -1.1644371 Kurtosis 1.69189679
Uncorrected SS 2.46744E10 Corrected SS 31333333.3
Coeff Variation 3.85149135 Std Error Mean 863.731293

Basic Statistical Measures
Location Variability
Mean 59333.33 Std Deviation 2285
Median 60000.00 Variance 5222222
Mode 60000.00 Range 7000
    Interquartile Range 3000

Tests for Location: Mu0=0
Test Statistic p Value
Student's t t 68.6942 Pr > |t| <.0001
Sign M 3.5 Pr >= |M| 0.0156
Signed Rank S 14 Pr >= |S| 0.0156

Quantiles (Definition 5)
Level Quantile
100% Max 62000
99% 62000
95% 62000
90% 62000
75% Q3 61000
50% Median 60000
25% Q1 58000
10% 55000
5% 55000
1% 55000
0% Min 55000

Extreme Observations
Lowest Highest
Value Obs Value Obs
55000.0 6 59333.3 4
58000.0 5 60000.0 1
59333.3 4 60000.0 7
60000.0 7 61000.0 2
60000.0 1 62000.0 3

The histogram and statistical measures indicate that the salary variable is approximately normally.

PRACTICE AND COMMENT YOUR CODE: 

-->PLEASE FOLLOW OUR BLOG FOR MORE UPDATES.

TO FOLLOW OUR TELEGRAM CHANNEL CLICK HERE

Comments