Global Job Placements Decoded: Transforming Corrupted Recruitment Data into Reliable Workforce Intelligence
A Global Job Placements Dataset into Executive-Ready Analytics Using SAS DATA Step, PROC SQL, and R
Introduction
Imagine a
multinational recruitment organization managing placement records across 50
countries. Executive leadership notices that placement rates reported in dashboards
differ significantly from regional reports.
A
detailed audit uncovers serious data quality problems:
- Duplicate Placement IDs
causing inflated hiring numbers.
- Missing joining dates
affecting placement timelines.
- Negative salary values due
to system-entry mistakes.
- Candidate ages recorded as 5
and 180.
- Emails missing "@"
symbols.
- Region values stored as
APAC, Asia Pacific, asia_pacific, and AP.
- Mixed upper/lowercase
employer names.
- Invalid timestamps causing
reporting failures.
- NULL values stored as text.
- Trailing spaces creating
duplicate candidate profiles.
The
result?
AI hiring
prediction models become unreliable. Executive dashboards display incorrect
KPIs. Workforce planning becomes inaccurate. Regulatory workforce reports
contain inconsistencies.
This is
exactly why enterprise-grade data cleaning is one of the most critical
responsibilities of a Clinical SAS Programmer, Statistical Programmer, or Data
Scientist.
Raw Global Job Placements Dataset (Intentional Errors Included)
1.SAS Raw Dataset
data job_placements_raw;
length Placement_ID $12 Candidate_Name $30 Email $50
Region $20 Employer $30 Job_Category $20;
infile datalines dlm='|' dsd truncover;
input Placement_ID $ Candidate_Name $ Age
Placement_Date :$20. Salary Region $ Employer $ Email $
Job_Category $;
datalines;
JP1001|john smith|26|2025-01-15|65000|APAC|TechCorp|john@gmail.com|IT
JP1001|john smith|26|2025-01-15|65000|APAC|TechCorp|john@gmail.com|IT
JP1002|MARY JONES|180|2025-02-10|70000|US|DataWorks|marygmail.com|DATA
JP1003|robert lee|5|NULL|-45000|EU|AnalyticsHub|robert@yahoo.com|IT
JP1004| alice brown |34|2025-03-15|85000|Asia Pacific|CloudNet|alice@gmail.com|Tech
JP1005|David King|29|INVALID_DATE|90000|AP|NULL|david@gmail|IT
JP1006|Emma Clark|.|2025-05-12|55000|EUROPE|DataPro|emma@gmail.com|Analytics
JP1007|Chris Evans|41|2025-06-11|125000|USA|VisionAI|chris@gmail.com|AI
JP1008|NULL|38|2025-04-01|76000|APAC|InsightLab|NULL|Data
JP1009|Nina Roy|29|2025-07-18|-1000|US|TechEdge|nina@gmail.com|IT
JP1010|Kevin Moss|33|2025-07-20|95000|asia_pacific|DataWorks|kevin@gmail.com|UNKNOWN
;
run;
proc print data=job_placements_raw;
run;
OUTPUT:
| Obs | Placement_ID | Candidate_Name | Region | Employer | Job_Category | Age | Placement_Date | Salary | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | JP1001 | john smith | john@gmail.com | APAC | TechCorp | IT | 26 | 2025-01-15 | 65000 |
| 2 | JP1001 | john smith | john@gmail.com | APAC | TechCorp | IT | 26 | 2025-01-15 | 65000 |
| 3 | JP1002 | MARY JONES | marygmail.com | US | DataWorks | DATA | 180 | 2025-02-10 | 70000 |
| 4 | JP1003 | robert lee | robert@yahoo.com | EU | AnalyticsHub | IT | 5 | NULL | -45000 |
| 5 | JP1004 | alice brown | alice@gmail.com | Asia Pacific | CloudNet | Tech | 34 | 2025-03-15 | 85000 |
| 6 | JP1005 | David King | david@gmail | AP | NULL | IT | 29 | INVALID_DATE | 90000 |
| 7 | JP1006 | Emma Clark | emma@gmail.com | EUROPE | DataPro | Analytics | . | 2025-05-12 | 55000 |
| 8 | JP1007 | Chris Evans | chris@gmail.com | USA | VisionAI | AI | 41 | 2025-06-11 | 125000 |
| 9 | JP1008 | NULL | NULL | APAC | InsightLab | Data | 38 | 2025-04-01 | 76000 |
| 10 | JP1009 | Nina Roy | nina@gmail.com | US | TechEdge | IT | 29 | 2025-07-18 | -1000 |
| 11 | JP1010 | Kevin Moss | kevin@gmail.com | asia_pacific | DataWorks | UNKNOWN | 33 | 2025-07-20 | 95000 |
Explanation
This
dataset intentionally contains duplicate Placement IDs, invalid ages, malformed
emails, missing dates, salary anomalies, inconsistent region labels, whitespace
corruption, and category inconsistencies. Such issues commonly appear when
recruitment data is collected from multiple job portals, ATS systems, and
regional HR databases.
Key Point
Always
create realistic error scenarios because production systems rarely contain
perfect data.
Why LENGTH Must Appear First
data example;
length Candidate_Name $50;
Candidate_Name='Christopher Alexander Robertson';
if Candidate_Name='Test' then Flag='Y';
run;
proc print data=example;
run;
OUTPUT:
| Obs | Candidate_Name | Flag |
|---|---|---|
| 1 | Christopher Alexander Robertson |
Explanation
SAS
determines variable attributes during compilation. If LENGTH is assigned after
a variable is created, truncation may already have occurred.
Example:
Candidate_Name='Christopher
Alexander Robertson';
length Candidate_Name $50;
The
variable length may already be fixed to the first assigned size.
Character Truncation Risk
In
production SDTM, ADaM, banking, and placement datasets, truncation can destroy
audit traceability.
R Comparison
R
dynamically expands character vectors, reducing truncation risk. SAS requires
explicit control through LENGTH statements.
2.SAS DATA Step Cleaning Workflow
data placements_clean;
set job_placements_raw;
Candidate_Name=propcase(strip(Candidate_Name));
Employer=propcase(strip(Employer));
Region=upcase(strip(Region));
if Region='AP' then Region='APAC';
if Region='ASIA PACIFIC' then Region='APAC';
if Region='ASIA_PACIFIC' then Region='APAC';
if Region='USA' then Region='US';
if Region='EUROPE' then Region='EU';
if Age<18 or Age>70 then Age=.;
Salary=abs(Salary);
if find(Email,'@')=0 then Email='';
Job_Category=upcase(Job_Category);
if Job_Category='UNKNOWN' then Job_Category='OTHER';
run;
proc print data=placements_clean;
run;
OUTPUT:
| Obs | Placement_ID | Candidate_Name | Region | Employer | Job_Category | Age | Placement_Date | Salary | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | JP1001 | John Smith | john@gmail.com | APAC | Techcorp | IT | 26 | 2025-01-15 | 65000 |
| 2 | JP1001 | John Smith | john@gmail.com | APAC | Techcorp | IT | 26 | 2025-01-15 | 65000 |
| 3 | JP1002 | Mary Jones | US | Dataworks | DATA | . | 2025-02-10 | 70000 | |
| 4 | JP1003 | Robert Lee | robert@yahoo.com | EU | Analyticshub | IT | . | NULL | 45000 |
| 5 | JP1004 | Alice Brown | alice@gmail.com | APAC | Cloudnet | TECH | 34 | 2025-03-15 | 85000 |
| 6 | JP1005 | David King | david@gmail | APAC | Null | IT | 29 | INVALID_DATE | 90000 |
| 7 | JP1006 | Emma Clark | emma@gmail.com | EU | Datapro | ANALYTICS | . | 2025-05-12 | 55000 |
| 8 | JP1007 | Chris Evans | chris@gmail.com | US | Visionai | AI | 41 | 2025-06-11 | 125000 |
| 9 | JP1008 | Null | APAC | Insightlab | DATA | 38 | 2025-04-01 | 76000 | |
| 10 | JP1009 | Nina Roy | nina@gmail.com | US | Techedge | IT | 29 | 2025-07-18 | 1000 |
| 11 | JP1010 | Kevin Moss | kevin@gmail.com | APAC | Dataworks | OTHER | 33 | 2025-07-20 | 95000 |
Explanation
This DATA
step uses STRIP, PROPCASE, FIND, ABS, and IF-THEN logic to standardize values
and correct anomalies.
Key Points
- Standardizes text.
- Removes impossible ages.
- Fixes negative salaries.
- Harmonizes region labels.
- Identifies malformed emails.
3.Removing Duplicates Using PROC SORT
proc sort data=placements_clean
out=placements_nodup nodupkey;
by Placement_ID;
run;
proc print data=placements_nodup;
run;
OUTPUT:
OUTPUT:
| Obs | Placement_ID | Candidate_Name | Region | Employer | Job_Category | Age | Placement_Date | Salary | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | JP1001 | John Smith | john@gmail.com | APAC | Techcorp | IT | 26 | 2025-01-15 | 65000 |
| 2 | JP1002 | Mary Jones | US | Dataworks | DATA | . | 2025-02-10 | 70000 | |
| 3 | JP1003 | Robert Lee | robert@yahoo.com | EU | Analyticshub | IT | . | NULL | 45000 |
| 4 | JP1004 | Alice Brown | alice@gmail.com | APAC | Cloudnet | TECH | 34 | 2025-03-15 | 85000 |
| 5 | JP1005 | David King | david@gmail | APAC | Null | IT | 29 | INVALID_DATE | 90000 |
| 6 | JP1006 | Emma Clark | emma@gmail.com | EU | Datapro | ANALYTICS | . | 2025-05-12 | 55000 |
| 7 | JP1007 | Chris Evans | chris@gmail.com | US | Visionai | AI | 41 | 2025-06-11 | 125000 |
| 8 | JP1008 | Null | APAC | Insightlab | DATA | 38 | 2025-04-01 | 76000 | |
| 9 | JP1009 | Nina Roy | nina@gmail.com | US | Techedge | IT | 29 | 2025-07-18 | 1000 |
| 10 | JP1010 | Kevin Moss | kevin@gmail.com | APAC | Dataworks | OTHER | 33 | 2025-07-20 | 95000 |
Explanation
NODUPKEY
retains the first occurrence and removes duplicate Placement IDs.
Key Points
- Prevents inflated hiring
metrics.
- Protects executive reports.
- Improves KPI accuracy.
4.PROC FORMAT for Business Categories
proc format;
value salarygrp low-69999='Entry'
70000-99999='Mid'
100000-high='Senior';
run;
LOG:
Explanation
Formats
create business-friendly reporting categories without modifying source data.
Key Points
- Supports reporting
flexibility.
- Improves dashboard
readability.
- Preserves raw values.
5.PROC SQL Validation and Analytics
proc sql;
create table placement_summary as
select Region,
count(*) as Total_Placements,
avg(Salary) as Avg_Salary format=salarygrp.
from placements_nodup
group by Region;
quit;
proc print data=placement_summary;
run;
OUTPUT:
| Obs | Region | Total_Placements | Avg_Salary |
|---|---|---|---|
| 1 | APAC | 5 | Mid |
| 2 | EU | 2 | Entry |
| 3 | US | 3 | Entry |
Explanation
PROC SQL
provides database-style aggregation and reporting.
Key Points
- Easy summarization.
- Powerful joins.
- Enterprise scalability.
DATA Step Versus PROC SQL Deduplication
6.DATA Step
data dedup_ds;
set placements_clean;
by Placement_ID;
if first.Placement_ID;
run;
proc print data=dedup_ds;
run;
OUTPUT:
| Obs | Placement_ID | Candidate_Name | Region | Employer | Job_Category | Age | Placement_Date | Salary | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | JP1001 | John Smith | john@gmail.com | APAC | Techcorp | IT | 26 | 2025-01-15 | 65000 |
| 2 | JP1002 | Mary Jones | US | Dataworks | DATA | . | 2025-02-10 | 70000 | |
| 3 | JP1003 | Robert Lee | robert@yahoo.com | EU | Analyticshub | IT | . | NULL | 45000 |
| 4 | JP1004 | Alice Brown | alice@gmail.com | APAC | Cloudnet | TECH | 34 | 2025-03-15 | 85000 |
| 5 | JP1005 | David King | david@gmail | APAC | Null | IT | 29 | INVALID_DATE | 90000 |
| 6 | JP1006 | Emma Clark | emma@gmail.com | EU | Datapro | ANALYTICS | . | 2025-05-12 | 55000 |
| 7 | JP1007 | Chris Evans | chris@gmail.com | US | Visionai | AI | 41 | 2025-06-11 | 125000 |
| 8 | JP1008 | Null | APAC | Insightlab | DATA | 38 | 2025-04-01 | 76000 | |
| 9 | JP1009 | Nina Roy | nina@gmail.com | US | Techedge | IT | 29 | 2025-07-18 | 1000 |
| 10 | JP1010 | Kevin Moss | kevin@gmail.com | APAC | Dataworks | OTHER | 33 | 2025-07-20 | 95000 |
7.PROC SQL
proc sql;
create table dedup_sql as
select distinct *
from placements_clean;
quit;
proc print data=dedup_sql;
run;
OUTPUT:
| Obs | Placement_ID | Candidate_Name | Region | Employer | Job_Category | Age | Placement_Date | Salary | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | JP1001 | John Smith | john@gmail.com | APAC | Techcorp | IT | 26 | 2025-01-15 | 65000 |
| 2 | JP1002 | Mary Jones | US | Dataworks | DATA | . | 2025-02-10 | 70000 | |
| 3 | JP1003 | Robert Lee | robert@yahoo.com | EU | Analyticshub | IT | . | NULL | 45000 |
| 4 | JP1004 | Alice Brown | alice@gmail.com | APAC | Cloudnet | TECH | 34 | 2025-03-15 | 85000 |
| 5 | JP1005 | David King | david@gmail | APAC | Null | IT | 29 | INVALID_DATE | 90000 |
| 6 | JP1006 | Emma Clark | emma@gmail.com | EU | Datapro | ANALYTICS | . | 2025-05-12 | 55000 |
| 7 | JP1007 | Chris Evans | chris@gmail.com | US | Visionai | AI | 41 | 2025-06-11 | 125000 |
| 8 | JP1008 | Null | APAC | Insightlab | DATA | 38 | 2025-04-01 | 76000 | |
| 9 | JP1009 | Nina Roy | nina@gmail.com | US | Techedge | IT | 29 | 2025-07-18 | 1000 |
| 10 | JP1010 | Kevin Moss | kevin@gmail.com | APAC | Dataworks | OTHER | 33 | 2025-07-20 | 95000 |
Explanation
DATA Step
uses FIRST./LAST. processing while PROC SQL uses DISTINCT logic.
Key Points
- DATA Step offers row-level
control.
- PROC SQL offers concise
syntax.
Advanced Enterprise SAS Features
8.ARRAY Processing
data array;
set placements_clean;
array txtvars {*} Candidate_Name Employer Region;
do i=1 to dim(txtvars);
txtvars{i}=strip(txtvars{i});
end;
run;
proc print data=array;
run;
OUTPUT:
| Obs | Placement_ID | Candidate_Name | Region | Employer | Job_Category | Age | Placement_Date | Salary | i | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | JP1001 | John Smith | john@gmail.com | APAC | Techcorp | IT | 26 | 2025-01-15 | 65000 | 4 |
| 2 | JP1001 | John Smith | john@gmail.com | APAC | Techcorp | IT | 26 | 2025-01-15 | 65000 | 4 |
| 3 | JP1002 | Mary Jones | US | Dataworks | DATA | . | 2025-02-10 | 70000 | 4 | |
| 4 | JP1003 | Robert Lee | robert@yahoo.com | EU | Analyticshub | IT | . | NULL | 45000 | 4 |
| 5 | JP1004 | Alice Brown | alice@gmail.com | APAC | Cloudnet | TECH | 34 | 2025-03-15 | 85000 | 4 |
| 6 | JP1005 | David King | david@gmail | APAC | Null | IT | 29 | INVALID_DATE | 90000 | 4 |
| 7 | JP1006 | Emma Clark | emma@gmail.com | EU | Datapro | ANALYTICS | . | 2025-05-12 | 55000 | 4 |
| 8 | JP1007 | Chris Evans | chris@gmail.com | US | Visionai | AI | 41 | 2025-06-11 | 125000 | 4 |
| 9 | JP1008 | Null | APAC | Insightlab | DATA | 38 | 2025-04-01 | 76000 | 4 | |
| 10 | JP1009 | Nina Roy | nina@gmail.com | US | Techedge | IT | 29 | 2025-07-18 | 1000 | 4 |
| 11 | JP1010 | Kevin Moss | kevin@gmail.com | APAC | Dataworks | OTHER | 33 | 2025-07-20 | 95000 | 4 |
Explanation
Arrays
allow bulk processing of multiple variables efficiently.
Key Points
- Reduces repetitive code.
- Improves maintainability.
9.RETAIN Example
data retain01;
set placements_clean;
retain Total_Placements 0;
Total_Placements+1;
run;
proc print data=retain01;
run;
OUTPUT:
| Obs | Placement_ID | Candidate_Name | Region | Employer | Job_Category | Age | Placement_Date | Salary | Total_Placements | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | JP1001 | John Smith | john@gmail.com | APAC | Techcorp | IT | 26 | 2025-01-15 | 65000 | 1 |
| 2 | JP1001 | John Smith | john@gmail.com | APAC | Techcorp | IT | 26 | 2025-01-15 | 65000 | 2 |
| 3 | JP1002 | Mary Jones | US | Dataworks | DATA | . | 2025-02-10 | 70000 | 3 | |
| 4 | JP1003 | Robert Lee | robert@yahoo.com | EU | Analyticshub | IT | . | NULL | 45000 | 4 |
| 5 | JP1004 | Alice Brown | alice@gmail.com | APAC | Cloudnet | TECH | 34 | 2025-03-15 | 85000 | 5 |
| 6 | JP1005 | David King | david@gmail | APAC | Null | IT | 29 | INVALID_DATE | 90000 | 6 |
| 7 | JP1006 | Emma Clark | emma@gmail.com | EU | Datapro | ANALYTICS | . | 2025-05-12 | 55000 | 7 |
| 8 | JP1007 | Chris Evans | chris@gmail.com | US | Visionai | AI | 41 | 2025-06-11 | 125000 | 8 |
| 9 | JP1008 | Null | APAC | Insightlab | DATA | 38 | 2025-04-01 | 76000 | 9 | |
| 10 | JP1009 | Nina Roy | nina@gmail.com | US | Techedge | IT | 29 | 2025-07-18 | 1000 | 10 |
| 11 | JP1010 | Kevin Moss | kevin@gmail.com | APAC | Dataworks | OTHER | 33 | 2025-07-20 | 95000 | 11 |
Explanation
RETAIN
preserves values across observations.
Key Points
Useful
for cumulative metrics and audit counters.
10.PROC REPORT
proc report data=placements_nodup nowd;
column Region Salary;
define Region / group;
define Salary / analysis mean;
run;
| Region | Salary |
|---|---|
| APAC | 82200 |
| EU | 50000 |
| US | 65333.333 |
Explanation
PROC
REPORT creates production-quality outputs suitable for executive reporting.
Key Points
- Highly customizable.
- Regulatory-report friendly.
11.R Raw Data
library(tibble)
jobs_raw <- tibble(
placement_id = c("JP1001","JP1001","JP1002","JP1003","JP1004",
"JP1005","JP1006","JP1007","JP1008","JP1009","JP1010","JP1011",
"JP1012","JP1013","JP1014","JP1015","JP1016","JP1017","JP1018",
"JP1019","JP1020"),
candidate_name = c("john smith","john smith","MARY JONES","robert lee",
" alice brown ","David King","Emma Clark","Chris Evans","NULL",
"Nina Roy","Kevin Moss","sarah parker","MICHAEL GREEN"," anna taylor ",
"James Hall","NULL","victor hughes","Olivia White","daniel young",
"EMILY SCOTT","Ryan Adams"),
age = c(26,26,180,5,34,29,NA,41,38,29,33,150,22,-3,45,NA,28,31,200,40,27),
placement_date = c("2025-01-15","2025-01-15","2025-02-10","NULL","2025-03-15",
"INVALID_DATE","2025-05-12","2025-06-11","2025-04-01","2025-07-18",
"2025-07-20","2025-08-01","2025-08-15","","2025-09-10","NULL","2025-10-01",
"2025-10-05","BAD_DATE","2025-11-11","2025-12-01"),
salary = c(65000,65000,70000,-45000,85000,90000,55000,125000,76000,-1000,
95000,110000,45000,50000,130000,-500,78000,82000,99999,120000,67000),
region = c("APAC","APAC","US","EU","Asia Pacific","AP","EUROPE","USA",
"APAC","US","asia_pacific","APAC","EU","AP","USA","NULL","EUROPE",
"Asia Pacific","US","APAC","AP"),
employer = c("TechCorp","TechCorp","DataWorks","AnalyticsHub","CloudNet",
"NULL","DataPro","VisionAI","InsightLab","TechEdge","DataWorks","FutureTech",
"SmartData","NULL","CloudNet","VisionAI","NextGen","TalentHub","DataPro",
"TechEdge","InsightLab"),
email = c("john@gmail.com","john@gmail.com","marygmail.com","robert@yahoo.com",
"alice@gmail.com","david@gmail","emma@gmail.com","chris@gmail.com","NULL",
"nina@gmail.com","kevin@gmail.com","sarahgmail.com","michael@yahoo",
"anna@gmail.com","james@gmail.com","NULL","victor@@gmail.com","olivia@gmail.com",
"danielgmail.com","emily@gmail.com","ryan@gmail"),
job_category = c("IT","IT","DATA","IT","Tech","IT","Analytics","AI","Data",
"IT","UNKNOWN","AI","DATA","Tech","IT","Analytics","INVALID","Data","AI",
"IT","UNKNOWN")
)
OUTPUT:
|
placement_id |
candidate_name |
age |
placement_date |
salary |
region |
employer |
email |
job_category |
|
JP1001 |
john smith |
26 |
2025-01-15 |
65000 |
APAC |
TechCorp |
john@gmail.com |
IT |
|
JP1001 |
john smith |
26 |
2025-01-15 |
65000 |
APAC |
TechCorp |
john@gmail.com |
IT |
|
JP1002 |
MARY JONES |
180 |
2025-02-10 |
70000 |
US |
DataWorks |
marygmail.com |
DATA |
|
JP1003 |
robert lee |
5 |
NULL |
-45000 |
EU |
AnalyticsHub |
robert@yahoo.com |
IT |
|
JP1004 |
alice brown |
34 |
2025-03-15 |
85000 |
Asia Pacific |
CloudNet |
alice@gmail.com |
Tech |
|
JP1005 |
David King |
29 |
INVALID_DATE |
90000 |
AP |
NULL |
david@gmail |
IT |
|
JP1006 |
Emma Clark |
2025-05-12 |
55000 |
EUROPE |
DataPro |
emma@gmail.com |
Analytics |
|
|
JP1007 |
Chris Evans |
41 |
2025-06-11 |
125000 |
USA |
VisionAI |
chris@gmail.com |
AI |
|
JP1008 |
NULL |
38 |
2025-04-01 |
76000 |
APAC |
InsightLab |
NULL |
Data |
|
JP1009 |
Nina Roy |
29 |
2025-07-18 |
-1000 |
US |
TechEdge |
nina@gmail.com |
IT |
|
JP1010 |
Kevin Moss |
33 |
2025-07-20 |
95000 |
asia_pacific |
DataWorks |
kevin@gmail.com |
UNKNOWN |
|
JP1011 |
sarah parker |
150 |
2025-08-01 |
110000 |
APAC |
FutureTech |
sarahgmail.com |
AI |
|
JP1012 |
MICHAEL GREEN |
22 |
2025-08-15 |
45000 |
EU |
SmartData |
michael@yahoo |
DATA |
|
JP1013 |
anna taylor |
-3 |
50000 |
AP |
NULL |
anna@gmail.com |
Tech |
|
|
JP1014 |
James Hall |
45 |
2025-09-10 |
130000 |
USA |
CloudNet |
james@gmail.com |
IT |
|
JP1015 |
NULL |
NULL |
-500 |
NULL |
VisionAI |
NULL |
Analytics |
|
|
JP1016 |
victor hughes |
28 |
2025-10-01 |
78000 |
EUROPE |
NextGen |
victor@@gmail.com |
INVALID |
|
JP1017 |
Olivia White |
31 |
2025-10-05 |
82000 |
Asia Pacific |
TalentHub |
olivia@gmail.com |
Data |
|
JP1018 |
daniel young |
200 |
BAD_DATE |
99999 |
US |
DataPro |
danielgmail.com |
AI |
|
JP1019 |
EMILY SCOTT |
40 |
2025-11-11 |
120000 |
APAC |
TechEdge |
emily@gmail.com |
IT |
|
JP1020 |
Ryan Adams |
27 |
2025-12-01 |
67000 |
AP |
InsightLab |
ryan@gmail |
UNKNOWN |
12.Modern R Cleaning Workflow
library(tidyverse)
library(janitor)
library(lubridate)
jobs_clean <- jobs_raw %>%
clean_names() %>%
mutate(candidate_name =str_to_title(
str_trim(candidate_name)),
region =case_when(region %in% c("AP","Asia Pacific",
"asia_pacific") ~ "APAC",
region=="USA" ~ "US",
region=="EUROPE" ~ "EU",
TRUE ~ region),
salary = abs(salary),
age =if_else(age<18 | age>70,
NA_real_,age),
email =if_else(grepl("@",email),
email,NA_character_),
job_category =str_to_upper(job_category)
) %>%
distinct(placement_id,.keep_all=TRUE)
OUTPUT:
|
placement_id |
candidate_name |
age |
placement_date |
salary |
region |
employer |
email |
job_category |
|
JP1001 |
John Smith |
26 |
2025-01-15 |
65000 |
APAC |
TechCorp |
john@gmail.com |
IT |
|
JP1002 |
Mary Jones |
2025-02-10 |
70000 |
US |
DataWorks |
DATA |
||
|
JP1003 |
Robert Lee |
NULL |
45000 |
EU |
AnalyticsHub |
robert@yahoo.com |
IT |
|
|
JP1004 |
Alice Brown |
34 |
2025-03-15 |
85000 |
APAC |
CloudNet |
alice@gmail.com |
TECH |
|
JP1005 |
David King |
29 |
INVALID_DATE |
90000 |
APAC |
NULL |
david@gmail |
IT |
|
JP1006 |
Emma Clark |
2025-05-12 |
55000 |
EU |
DataPro |
emma@gmail.com |
ANALYTICS |
|
|
JP1007 |
Chris Evans |
41 |
2025-06-11 |
125000 |
US |
VisionAI |
chris@gmail.com |
AI |
|
JP1008 |
Null |
38 |
2025-04-01 |
76000 |
APAC |
InsightLab |
DATA |
|
|
JP1009 |
Nina Roy |
29 |
2025-07-18 |
1000 |
US |
TechEdge |
nina@gmail.com |
IT |
|
JP1010 |
Kevin Moss |
33 |
2025-07-20 |
95000 |
APAC |
DataWorks |
kevin@gmail.com |
UNKNOWN |
|
JP1011 |
Sarah Parker |
2025-08-01 |
110000 |
APAC |
FutureTech |
AI |
||
|
JP1012 |
Michael Green |
22 |
2025-08-15 |
45000 |
EU |
SmartData |
michael@yahoo |
DATA |
|
JP1013 |
Anna Taylor |
50000 |
APAC |
NULL |
anna@gmail.com |
TECH |
||
|
JP1014 |
James Hall |
45 |
2025-09-10 |
130000 |
US |
CloudNet |
james@gmail.com |
IT |
|
JP1015 |
Null |
NULL |
500 |
NULL |
VisionAI |
ANALYTICS |
||
|
JP1016 |
Victor Hughes |
28 |
2025-10-01 |
78000 |
EU |
NextGen |
victor@@gmail.com |
INVALID |
|
JP1017 |
Olivia White |
31 |
2025-10-05 |
82000 |
APAC |
TalentHub |
olivia@gmail.com |
DATA |
|
JP1018 |
Daniel Young |
BAD_DATE |
99999 |
US |
DataPro |
AI |
||
|
JP1019 |
Emily Scott |
40 |
2025-11-11 |
120000 |
APAC |
TechEdge |
emily@gmail.com |
IT |
|
JP1020 |
Ryan Adams |
27 |
2025-12-01 |
67000 |
APAC |
InsightLab |
ryan@gmail |
UNKNOWN |
Explanation
The
tidyverse pipeline performs the same transformations implemented in SAS DATA
Step.
SAS Equivalents
|
R
Function |
SAS
Equivalent |
|
mutate() |
Assignment
Statements |
|
case_when() |
SELECT-WHEN |
|
distinct() |
PROC
SORT NODUPKEY |
|
if_else() |
IF-THEN/ELSE |
|
str_trim() |
STRIP |
|
coalesce() |
COALESCEC |
|
summarise() |
PROC
SUMMARY |
Key Points
R offers
concise syntax while SAS offers stronger enterprise traceability.
Enterprise Validation &
Compliance
In
regulated environments such as clinical trials:
- SDTM datasets must preserve
source traceability.
- ADaM datasets require
reproducible derivations.
- Audit trails must document
every transformation.
- QC programmers should
independently validate outputs.
- Metadata standards must
remain consistent.
A major
SAS risk involves missing numeric values.
Example:
if Salary < 50000 then
Flag='Y';
Missing
values are treated as lower than valid numbers.
Therefore:
Salary=.
may
incorrectly satisfy the condition.
This can
create catastrophic analytical errors affecting efficacy analyses, enrollment
metrics, and executive reporting.
Business Logic Behind
Cleaning
Data
cleaning is not simply a technical activity it is business logic
implementation. Consider a candidate age recorded as 180 years. Statistical
models interpreting this value may produce distorted age distributions and
incorrect workforce planning forecasts. Similarly, a salary entered as -50000
can dramatically impact compensation benchmarking and regional hiring analyses.
Missing placement dates affect time-to-hire calculations, recruiter
productivity metrics, and executive dashboards measuring recruitment
efficiency.
Text
normalization is equally important. Candidate names such as "john
smith", "John Smith", and "JOHN SMITH" may represent
the same individual but appear as different records during analysis. Region
codes like APAC, AP, and Asia Pacific can fragment reporting categories and
create misleading trends.
Missing
values are often imputed when business rules support reasonable assumptions.
For example, a missing placement date may be replaced with an approved
source-system timestamp if documented in metadata. Salary normalization ensures
meaningful compensation analytics. Email validation improves communication
reliability and customer engagement reporting.
Every
correction must be documented because downstream analytics, AI models,
executive reporting, and regulatory submissions depend on trusted data
foundations.
20 Data-Cleaning Best
Practices
- Define metadata before
coding.
- Validate source systems.
- Standardize naming
conventions.
- Use reusable macros.
- Separate raw and cleaned
layers.
- Never overwrite source data.
- Audit every transformation.
- Validate date ranges.
- Standardize missing values.
- Create QC datasets.
- Use independent validation.
- Document assumptions.
- Version control programs.
- Automate anomaly detection.
- Monitor duplicate records.
- Use defensive programming.
- Validate category mappings.
- Maintain data lineage.
- Build reproducible
workflows.
- Perform production
deployment reviews.
20 One-Line Insights
- Dirty data creates expensive
business mistakes.
- Standardized variables
improve reproducibility.
- Validation logic beats
visual inspection.
- Duplicate records distort
reality.
- Missing dates damage
timelines.
- Metadata drives consistency.
- QC independence improves
trust.
- Traceability protects
compliance.
- Data lineage supports
audits.
- Clean inputs create reliable
outputs.
- Automation reduces human
error.
- Defensive programming
prevents failures.
- Standard formats improve
integration.
- Audit trails are
non-negotiable.
- Every variable needs
business meaning.
- Good dashboards require good
data.
- Missing values deserve
special attention.
- Consistent coding improves
scalability.
- Governance improves
reliability.
- Trustworthy analytics start
with clean data.
SAS vs R Comparison
|
Feature |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
Acceptance |
Excellent |
Growing |
|
Scalability |
Excellent |
Excellent |
|
Traceability |
Strong |
Depends
on workflow |
|
Flexibility |
Moderate |
Very
High |
|
Reporting |
Excellent |
Excellent |
|
Statistical
Depth |
Excellent |
Excellent |
|
Open
Source |
No |
Yes |
Validation Checklist
1.Duplicate
Check
2.Missing
Value Review
3.Range
Validation
4.Date
Validation
5.Email
Validation
6.Category
Standardization
7.Metadata
Verification
8.QC
Review
9.Audit
Trail Review
10.Reporting
Validation
Summary
SAS and R
represent complementary technologies rather than competing solutions. SAS
excels in enterprise-scale governance, auditability, regulatory compliance, and
controlled production environments. Clinical research organizations, banks,
insurers, and government agencies continue to rely heavily on SAS because of
its traceability and validation framework. Features such as DATA Step
processing, PROC SQL, PROC REPORT, PROC FORMAT, and metadata-driven
architectures make SAS highly reliable for large-scale operational systems.
R excels
in flexibility, innovation, and advanced analytics. The tidyverse ecosystem
provides elegant data manipulation capabilities while packages such as janitor,
lubridate, stringr, and purrr simplify complex transformations. Data scientists
often prefer R for exploratory analytics, machine learning, and rapid
development.
A mature
enterprise strategy combines both technologies. SAS can manage regulated
production pipelines while R provides exploratory analytics and advanced
modeling capabilities. Together they deliver scalable, reproducible, and
trustworthy analytical intelligence.
Conclusion
Modern
analytics initiatives fail not because organizations lack dashboards, machine
learning models, or visualization platforms. They fail because the underlying
data cannot be trusted. Whether managing clinical trial participants, insurance
claims, retail transactions, financial portfolios, or global job placement
records, poor-quality data creates operational risk, regulatory exposure, and
flawed business decisions.
The
journey from raw data to analytical intelligence requires a structured
framework that combines governance, validation, transformation, and reporting.
SAS provides the foundation through DATA Step processing, PROC SQL, validation
procedures, metadata control, auditability, and enterprise reporting
capabilities. R extends this framework by enabling highly flexible data
wrangling, advanced statistical analysis, and rapid innovation using modern
packages such as tidyverse, lubridate, janitor, and purrr.
The most
successful organizations treat data cleaning as an engineering discipline
rather than a preprocessing step. Every transformation should be documented,
every assumption validated, every anomaly investigated, and every output
independently reviewed. Duplicate records, missing values, malformed text,
invalid dates, inconsistent categories, and corrupted numeric fields should never
be ignored because even a small defect can propagate through dashboards,
machine learning models, and executive reports.
Analysis-ready
datasets are the foundation of reliable business intelligence. By combining SAS
governance strengths with R flexibility, organizations can build scalable,
compliant, and trustworthy analytical ecosystems capable of supporting
executive decision-making, regulatory submissions, operational excellence, and
future AI initiatives. Clean data is not merely a technical requirement it is a
strategic business asset that drives confidence, accuracy, and sustainable
competitive advantage.
Interview Questions and
Answers
1. A dashboard shows placement counts higher than
expected. How would you investigate?
Answer: I would first check duplicate
Placement IDs using PROC SORT NODUPKEY or PROC SQL COUNT(*) versus
COUNT(DISTINCT Placement_ID). In R, I would use distinct() and compare record
counts.
2. How do you handle invalid ages such as 5 or 180?
Answer: Apply business validation rules.
Ages outside the acceptable range are set to missing and documented. SAS
IF-THEN logic or R if_else() can implement this consistently.
3. Why is missing-value handling critical in SAS?
Answer: SAS treats missing numeric
values as smaller than valid numbers. Conditions such as Salary < 50000 may
incorrectly include missing values unless explicitly checked.
4. When would you choose PROC SQL over DATA Step?
Answer: PROC SQL is ideal for joins,
aggregations, and summarization. DATA Step is preferred for row-level
transformations, FIRST./LAST. processing, and complex business rules.
5. How do you validate a production-ready dataset?
Answer: Perform duplicate checks, range
validation, missing-value analysis, metadata verification, reconciliation
against source systems, QC programming, audit-trail review, and reporting
validation before deployment.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about JOB PLACEMENTS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment