132.A EXAMPLE OF EMPLOYEES DATASET | PROC PRINT | REMOVING DUPLICATES | PROC SORT NODUPKEY | ASSESSING MISSING VALUES | PROC MEANS | HANDLING MISSING VALUES | PROC STDIZE | DETECTING AND HANDLING OUTLIERS | PROC UNIVARIATE
- Get link
- X
- Other Apps
A EXAMPLE OF EMPLOYEES DATASET | PROC PRINT | REMOVING DUPLICATES | PROC SORT NODUPKEY | ASSESSING MISSING VALUES | PROC MEANS | HANDLING MISSING VALUES | PROC STDIZE | DETECTING AND HANDLING OUTLIERS | PROC UNIVARIATE
In this example, we'll demonstrate how to manipulate, assess, clean, and apply procedures to a single dataset using SAS. We'll use a hypothetical dataset named employees, which contains information about employees in a company. The dataset includes variables such as EmployeeID, Name, Department
, Salary
, and HireDate
.
Step 1: Creating the Dataset
First, we'll create the employees dataset using the DATA step and DATALINES statement.
Example Data:
data employees;
input EmployeeID Name $ Department $ Salary HireDate :date9.;
format HireDate date9.;
datalines;
1 John Sales 55000 15JAN2015
2 Jane Marketing 62000 22MAR2016
3 Jim Sales 58000 10JUL2017
4 Jack IT 60000 05MAY2018
5 Jill HR 59000 12AUG2019
6 John Sales 55000 15JAN2015
7 Jake IT 61000 20SEP2020
8 Jane Marketing 63000 22MAR2016
9 Julia HR 60000 18DEC2021
10 Jim Sales 58000 10JUL2017
;
run;
In this dataset, there are intentional duplicate records for demonstration purposes.
Step 2: Viewing the Dataset
To inspect the dataset, we can use the PROC PRINT procedure.
proc print data=employees;
run;
Obs | EmployeeID | Name | Department | Salary | HireDate |
---|---|---|---|---|---|
1 | 1 | John | Sales | 55000 | 15JAN2015 |
2 | 2 | Jane | Marketing | 62000 | 22MAR2016 |
3 | 3 | Jim | Sales | 58000 | 10JUL2017 |
4 | 4 | Jack | IT | 60000 | 05MAY2018 |
5 | 5 | Jill | HR | 59000 | 12AUG2019 |
6 | 6 | John | Sales | 55000 | 15JAN2015 |
7 | 7 | Jake | IT | 61000 | 20SEP2020 |
8 | 8 | Jane | Marketing | 63000 | 22MAR2016 |
9 | 9 | Julia | HR | 60000 | 18DEC2021 |
10 | 10 | Jim | Sales | 58000 | 10JUL2017 |
Step 3: Identifying and Removing Duplicate Records
Duplicate records can skew analysis results. We'll use the PROC SORT procedure with the NODUPKEY option to remove duplicates based on the EmployeeID.
proc sort data=employees nodupkey out=employees_clean;
by Name;
run;
proc print data=employees_clean;
run;
Output:
Obs | EmployeeID | Name | Department | Salary | HireDate |
---|---|---|---|---|---|
1 | 4 | Jack | IT | 60000 | 05MAY2018 |
2 | 7 | Jake | IT | 61000 | 20SEP2020 |
3 | 2 | Jane | Marketing | 62000 | 22MAR2016 |
4 | 5 | Jill | HR | 59000 | 12AUG2019 |
5 | 3 | Jim | Sales | 58000 | 10JUL2017 |
6 | 1 | John | Sales | 55000 | 15JAN2015 |
7 | 9 | Julia | HR | 60000 | 18DEC2021 |
The duplicate records have been removed.
Step 4: Assessing Missing Values
Missing values can affect data analysis. We'll introduce some missing values and then assess them using PROC MEANS.
data employees_clean;
set employees_clean;
if EmployeeID = 5 then Salary = .;
if EmployeeID = 9 then Department = '';
run;
proc print data=employees_clean;
run;
Obs | EmployeeID | Name | Department | Salary | HireDate |
---|---|---|---|---|---|
1 | 4 | Jack | IT | 60000 | 05MAY2018 |
2 | 7 | Jake | IT | 61000 | 20SEP2020 |
3 | 2 | Jane | Marketing | 62000 | 22MAR2016 |
4 | 5 | Jill | HR | . | 12AUG2019 |
5 | 3 | Jim | Sales | 58000 | 10JUL2017 |
6 | 1 | John | Sales | 55000 | 15JAN2015 |
7 | 9 | Julia | 60000 | 18DEC2021 |
Analysis Variable : Salary | |
---|---|
N | N Miss |
6 | 1 |
This output indicates that there is one missing value in the SALARY variable.
Step 5: Handling Missing Values
To handle missing values, we can replace them with the mean salary.
proc stdize data=employees_clean reponly method=mean out=employees_imputed; var Salary; run; proc print;run;Output:
Obs EmployeeID Name Department Salary HireDate 1 4 Jack IT 60000.00 05MAY2018 2 7 Jake IT 61000.00 20SEP2020 3 2 Jane Marketin 62000.00 22MAR2016 4 5 Jill HR 59333.33 12AUG2019 5 3 Jim Sales 58000.00 10JUL2017 6 1 John Sales 55000.00 15JAN2015 7 9 Julia 60000.00 18DEC2021 The missing salary values have been replaced with the mean salary of 58,500.
Step 6: Detecting and Handling Outliers
Outliers can distort statistical analyses. We'll detect outliers in the salary variable using PROC UNIVARIATE..
Moments | |||
---|---|---|---|
N | 7 | Sum Weights | 7 |
Mean | 59333.3333 | Sum Observations | 415333.333 |
Std Deviation | 2285.2182 | Variance | 5222222.22 |
Skewness | -1.1644371 | Kurtosis | 1.69189679 |
Uncorrected SS | 2.46744E10 | Corrected SS | 31333333.3 |
Coeff Variation | 3.85149135 | Std Error Mean | 863.731293 |
Basic Statistical Measures | |||
---|---|---|---|
Location | Variability | ||
Mean | 59333.33 | Std Deviation | 2285 |
Median | 60000.00 | Variance | 5222222 |
Mode | 60000.00 | Range | 7000 |
Interquartile Range | 3000 |
Tests for Location: Mu0=0 | ||||
---|---|---|---|---|
Test | Statistic | p Value | ||
Student's t | t | 68.6942 | Pr > |t| | <.0001 |
Sign | M | 3.5 | Pr >= |M| | 0.0156 |
Signed Rank | S | 14 | Pr >= |S| | 0.0156 |
Quantiles (Definition 5) | |
---|---|
Level | Quantile |
100% Max | 62000 |
99% | 62000 |
95% | 62000 |
90% | 62000 |
75% Q3 | 61000 |
50% Median | 60000 |
25% Q1 | 58000 |
10% | 55000 |
5% | 55000 |
1% | 55000 |
0% Min | 55000 |
Extreme Observations | |||
---|---|---|---|
Lowest | Highest | ||
Value | Obs | Value | Obs |
55000.0 | 6 | 59333.3 | 4 |
58000.0 | 5 | 60000.0 | 1 |
59333.3 | 4 | 60000.0 | 7 |
60000.0 | 7 | 61000.0 | 2 |
60000.0 | 1 | 62000.0 | 3 |
- Get link
- X
- Other Apps
Comments
Post a Comment