Playgrounds, Predictions and Precision: Enterprise Data Cleaning Strategies with SAS and R
From Swing Sets to Statistical Gold: Engineering the World's Best Playgrounds Dataset into Enterprise Intelligence with SAS and R
Introduction
Imagine a
multinational smart-city organization planning investments into children's
recreational infrastructure across continents. Executives use AI models to
identify the world's best playground designs for future urban projects.
Unfortunately,
the underlying dataset contains:
- duplicate playground
identifiers,
- invalid installation dates,
- negative annual maintenance
costs,
- impossible visitor counts,
- malformed contact emails,
- inconsistent country names,
- corrupted safety ratings,
- missing inspection dates,
- and mixed text formatting.
The
result?
A
playground in Japan appears twice and receives double funding.
A
playground with failed safety inspections is classified as
"Excellent".
An AI
prediction engine identifies an abandoned playground as a premium investment
candidate.
In
healthcare this resembles duplicate patient enrollment in clinical trials. In
banking it resembles incorrect fraud classification. In insurance it resembles
claim rejection due to corrupted identifiers.
Dirty
data silently destroys trust.
Dashboards
lie.
Machine
learning models drift.
Regulatory
submissions fail validation.
Executives
make million-dollar mistakes.
Global Best Playgrounds
Dataset
|
Variable |
Description |
|
Playground_ID |
Unique
playground identifier |
|
Playground_Name |
Name of
playground |
|
Country |
Country
location |
|
Safety_Rating |
Safety
category |
|
Annual_Visitors |
Annual
visitors |
|
Maintenance_Cost |
Annual
maintenance cost |
|
Inspection_Date |
Latest
inspection date |
|
Manager_Email |
Contact
email |
|
Region_Code |
Operational
region |
1.SAS Raw Dataset with Intentional Errors
data playground_raw;
length Playground_ID $12 Playground_Name $40 Country $25
Safety_Rating $20 Manager_Email $60 Region_Code $12
Inspection_Date $20;
informat Inspection_Date $20.;
format Maintenance_Cost dollar12.2;
infile datalines dlm='|' dsd truncover;
input Playground_ID $ Playground_Name $ Country $ Safety_Rating $
Annual_Visitors Maintenance_Cost Inspection_Date $ Manager_Email $
Region_Code $;
datalines;
PG001|Wonder Park|usa|Excellent|500000|250000|2025-12-15|admin@wonder.com|NA
PG001|Wonder Park|USA|Excellent|500000|250000|2025-12-15|admin@wonder.com|NA
PG002|Sky Adventure|Japan|excellent|400000|-350000|2025-15-10|sky.jp|APAC
PG003|Kids Planet| INDIA |Good|.|175000|NULL|kidsplanet@gmail|AP
PG004|Dream Forest|Canada|BAD|450000|210000|2024-11-11|forest@dream.ca|northamerica
PG005|Ocean Play|Australia|Excellent|-1000|195000|2024-08-20|ocean@play.com|APAC
PG006|Fun City|UK|GOOD|390000|210000|2024-02-31|fun.city.com|EU
PG007|Rainbow Yard|Germany|NULL|410000|220000|2025-04-15|rainbow@yard.de|EU
PG008|Magic Kingdom|France|Excellent|370000|200000|2025-01-14|magic@kingdom.fr|EU
PG009|Adventure Hill|Brazil|Good|380000|-150000|2025-05-10|hill@adv.com|LATAM
PG010|Jungle Zone|India|Excellent|420000|225000|2024-12-12|jungle@@gmail.com|APAC
;
run;
proc print data=playground_raw;
run;
OUTPUT:
| Obs | Playground_ID | Playground_Name | Country | Safety_Rating | Manager_Email | Region_Code | Inspection_Date | Maintenance_Cost | Annual_Visitors |
|---|---|---|---|---|---|---|---|---|---|
| 1 | PG001 | Wonder Park | usa | Excellent | admin@wonder.com | NA | 2025-12-15 | $250,000.00 | 500000 |
| 2 | PG001 | Wonder Park | USA | Excellent | admin@wonder.com | NA | 2025-12-15 | $250,000.00 | 500000 |
| 3 | PG002 | Sky Adventure | Japan | excellent | sky.jp | APAC | 2025-15-10 | $-350,000.00 | 400000 |
| 4 | PG003 | Kids Planet | INDIA | Good | kidsplanet@gmail | AP | NULL | $175,000.00 | . |
| 5 | PG004 | Dream Forest | Canada | BAD | forest@dream.ca | northamerica | 2024-11-11 | $210,000.00 | 450000 |
| 6 | PG005 | Ocean Play | Australia | Excellent | ocean@play.com | APAC | 2024-08-20 | $195,000.00 | -1000 |
| 7 | PG006 | Fun City | UK | GOOD | fun.city.com | EU | 2024-02-31 | $210,000.00 | 390000 |
| 8 | PG007 | Rainbow Yard | Germany | NULL | rainbow@yard.de | EU | 2025-04-15 | $220,000.00 | 410000 |
| 9 | PG008 | Magic Kingdom | France | Excellent | magic@kingdom.fr | EU | 2025-01-14 | $200,000.00 | 370000 |
| 10 | PG009 | Adventure Hill | Brazil | Good | hill@adv.com | LATAM | 2025-05-10 | $-150,000.00 | 380000 |
| 11 | PG010 | Jungle Zone | India | Excellent | jungle@@gmail.com | APAC | 2024-12-12 | $225,000.00 | 420000 |
Explanation
This
dataset intentionally contains almost every enterprise-quality issue
encountered during production deployments:
- duplicate IDs,
- negative numeric values,
- impossible dates,
- malformed emails,
- whitespace corruption,
- inconsistent casing,
- NULL strings,
- invalid categories.
These
errors mirror validation failures commonly observed during SDTM and ADaM preparation.
Character Truncation Risk
One of
the most misunderstood SAS behaviors involves LENGTH statements.
Incorrect:
data test;
name='International Adventure Playground';
length name $10;
run;
proc print data=test;
run;
LOG:
OUTPUT:
| Obs | name |
|---|---|
| 1 | Internatio |
Correct:
data test;
length name $50;
name='International Adventure Playground';
run;
proc print data=test;
run;
OUTPUT:
| Obs | name |
|---|---|
| 1 | International Adventure Playground |
SAS
allocates memory during variable creation.
Once
created, increasing LENGTH does not recover lost characters.
R behaves
differently.
R strings
are dynamically allocated and do not suffer from post-assignment truncation.
This
difference explains many migration issues when converting R pipelines into SAS
production code.
2.SAS Enterprise Cleaning Workflow
proc format;
value $ratefmt 'EXCELLENT'='Excellent'
'GOOD'='Good'
otherwise='Needs Review';
run;
LOG:
data playground_clean;
retain Source_System "GLOBAL_PLAYGROUND_PORTAL";
set playground_raw;
Playground_ID=compress(upcase(strip(Playground_ID)));
Country=propcase(strip(country));
Safety_Rating=upcase(strip(Safety_Rating));
Safety_Rating=put(Safety_Rating,$ratefmt.);
Maintenance_Cost=abs(Maintenance_Cost);
Annual_Visitors=abs(Annual_Visitors);
Region_Code=upcase(strip(Region_Code));
Region_Code=tranwrd(Region_Code,'NORTHAMERICA','NA');
Manager_Email=lowcase(strip(Manager_Email));
if find(Manager_Email,'@')=0 then
Manager_Email='missing@email.com';
Inspection_DT=input(Inspection_Date,?? yymmdd10.);
if missing(Inspection_DT) then
Inspection_DT=intnx('month',today(),-1,'same');
Inspection_Age=intck('day',Inspection_DT,today());
if Annual_Visitors<1000 then Visitor_Flag='REVIEW';
array txt(*) Country Safety_Rating Region_Code;
do i=1 to dim(txt);
txt(i)=strip(txt(i));
end;
drop i Inspection_Date;
rename Inspection_DT=Inspection_Date;
format Inspection_DT yymmdd10.;
run;
proc print data=playground_clean;
run;
OUTPUT:
| Obs | Source_System | Playground_ID | Playground_Name | Country | Safety_Rating | Manager_Email | Region_Code | Maintenance_Cost | Annual_Visitors | Inspection_Date | Inspection_Age | Visitor_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_PLAYGROUND_PORTAL | PG001 | Wonder Park | Usa | Excellent | admin@wonder.com | NA | $250,000.00 | 500000 | 2025-12-15 | 193 | |
| 2 | GLOBAL_PLAYGROUND_PORTAL | PG001 | Wonder Park | Usa | Excellent | admin@wonder.com | NA | $250,000.00 | 500000 | 2025-12-15 | 193 | |
| 3 | GLOBAL_PLAYGROUND_PORTAL | PG002 | Sky Adventure | Japan | Excellent | missing@email.com | APAC | $350,000.00 | 400000 | 2026-05-26 | 31 | |
| 4 | GLOBAL_PLAYGROUND_PORTAL | PG003 | Kids Planet | India | Good | kidsplanet@gmail | AP | $175,000.00 | . | 2026-05-26 | 31 | REVIEW |
| 5 | GLOBAL_PLAYGROUND_PORTAL | PG004 | Dream Forest | Canada | BAD | forest@dream.ca | NA | $210,000.00 | 450000 | 2024-11-11 | 592 | |
| 6 | GLOBAL_PLAYGROUND_PORTAL | PG005 | Ocean Play | Australia | Excellent | ocean@play.com | APAC | $195,000.00 | 1000 | 2024-08-20 | 675 | |
| 7 | GLOBAL_PLAYGROUND_PORTAL | PG006 | Fun City | Uk | Good | missing@email.com | EU | $210,000.00 | 390000 | 2026-05-26 | 31 | |
| 8 | GLOBAL_PLAYGROUND_PORTAL | PG007 | Rainbow Yard | Germany | NULL | rainbow@yard.de | EU | $220,000.00 | 410000 | 2025-04-15 | 437 | |
| 9 | GLOBAL_PLAYGROUND_PORTAL | PG008 | Magic Kingdom | France | Excellent | magic@kingdom.fr | EU | $200,000.00 | 370000 | 2025-01-14 | 528 | |
| 10 | GLOBAL_PLAYGROUND_PORTAL | PG009 | Adventure Hill | Brazil | Good | hill@adv.com | LATAM | $150,000.00 | 380000 | 2025-05-10 | 412 | |
| 11 | GLOBAL_PLAYGROUND_PORTAL | PG010 | Jungle Zone | India | Excellent | jungle@@gmail.com | APAC | $225,000.00 | 420000 | 2024-12-12 | 561 |
Explanation
This DATA
step demonstrates enterprise cleaning logic:
- COMPRESS() removes hidden
spaces.
- ABS() corrects negative
values.
- INTNX() imputes missing
dates.
- ARRAYS standardize multiple
variables simultaneously.
- RETAIN preserves metadata
lineage.
- FIND() validates email
structures.
The ?? modifier:
- suppresses the NOTE,
- suppresses _ERROR_=1,
- simply returns missing.
This
mirrors production SDTM transformation pipelines.
3.Removing Duplicate Records
proc sort data=playground_clean nodupkey
out=playground_nodup;
by Playground_ID;
run;
proc print data=playground_nodup;
run;
LOG:
OUTPUT:
| Obs | Source_System | Playground_ID | Playground_Name | Country | Safety_Rating | Manager_Email | Region_Code | Maintenance_Cost | Annual_Visitors | Inspection_Date | Inspection_Age | Visitor_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_PLAYGROUND_PORTAL | PG001 | Wonder Park | Usa | Excellent | admin@wonder.com | NA | $250,000.00 | 500000 | 2025-12-15 | 193 | |
| 2 | GLOBAL_PLAYGROUND_PORTAL | PG002 | Sky Adventure | Japan | Excellent | missing@email.com | APAC | $350,000.00 | 400000 | 2026-05-26 | 31 | |
| 3 | GLOBAL_PLAYGROUND_PORTAL | PG003 | Kids Planet | India | Good | kidsplanet@gmail | AP | $175,000.00 | . | 2026-05-26 | 31 | REVIEW |
| 4 | GLOBAL_PLAYGROUND_PORTAL | PG004 | Dream Forest | Canada | BAD | forest@dream.ca | NA | $210,000.00 | 450000 | 2024-11-11 | 592 | |
| 5 | GLOBAL_PLAYGROUND_PORTAL | PG005 | Ocean Play | Australia | Excellent | ocean@play.com | APAC | $195,000.00 | 1000 | 2024-08-20 | 675 | |
| 6 | GLOBAL_PLAYGROUND_PORTAL | PG006 | Fun City | Uk | Good | missing@email.com | EU | $210,000.00 | 390000 | 2026-05-26 | 31 | |
| 7 | GLOBAL_PLAYGROUND_PORTAL | PG007 | Rainbow Yard | Germany | NULL | rainbow@yard.de | EU | $220,000.00 | 410000 | 2025-04-15 | 437 | |
| 8 | GLOBAL_PLAYGROUND_PORTAL | PG008 | Magic Kingdom | France | Excellent | magic@kingdom.fr | EU | $200,000.00 | 370000 | 2025-01-14 | 528 | |
| 9 | GLOBAL_PLAYGROUND_PORTAL | PG009 | Adventure Hill | Brazil | Good | hill@adv.com | LATAM | $150,000.00 | 380000 | 2025-05-10 | 412 | |
| 10 | GLOBAL_PLAYGROUND_PORTAL | PG010 | Jungle Zone | India | Excellent | jungle@@gmail.com | APAC | $225,000.00 | 420000 | 2024-12-12 | 561 |
Explanation
NODUPKEY
removes duplicate business keys while preserving the first occurrence.
Clinical
programming frequently uses this approach to eliminate duplicate subject
records.
4.PROC SQL versus DATA Step
PROC SQL
proc sql;
create table playground_summary as
select Country,
count(*) as Total_Parks,
mean(Maintenance_Cost) as Avg_Cost
from playground_nodup
group by Country;
quit;
proc print data=playground_summary;
run;
OUTPUT:
| Obs | Country | Total_Parks | Avg_Cost |
|---|---|---|---|
| 1 | Australia | 1 | 195000 |
| 2 | Brazil | 1 | 150000 |
| 3 | Canada | 1 | 210000 |
| 4 | France | 1 | 200000 |
| 5 | Germany | 1 | 220000 |
| 6 | India | 2 | 200000 |
| 7 | Japan | 1 | 350000 |
| 8 | Uk | 1 | 210000 |
| 9 | Usa | 1 | 250000 |
DATA Step Alternative
proc summary data=playground_nodup nway;
class Country;
var Maintenance_Cost;
output out=summary_ds mean=Avg_Cost
n=Total_Parks;
run;
proc print data=summary_ds;
run;
OUTPUT:
| Obs | Country | _TYPE_ | _FREQ_ | Avg_Cost | Total_Parks |
|---|---|---|---|---|---|
| 1 | Australia | 1 | 1 | $195,000.00 | 1 |
| 2 | Brazil | 1 | 1 | $150,000.00 | 1 |
| 3 | Canada | 1 | 1 | $210,000.00 | 1 |
| 4 | France | 1 | 1 | $200,000.00 | 1 |
| 5 | Germany | 1 | 1 | $220,000.00 | 1 |
| 6 | India | 1 | 2 | $200,000.00 | 2 |
| 7 | Japan | 1 | 1 | $350,000.00 | 1 |
| 8 | Uk | 1 | 1 | $210,000.00 | 1 |
| 9 | Usa | 1 | 1 | $250,000.00 | 1 |
Explanation
PROC SQL
is declarative and intuitive.
DATA Step
processing offers greater row-level control and scalability.
Clinical
programmers often combine both approaches.
5.FIRST./LAST. Processing
proc sort data=playground_nodup;
by Country;
run;
proc print data=playground_nodup;
run;
OUTPUT:
| Obs | Source_System | Playground_ID | Playground_Name | Country | Safety_Rating | Manager_Email | Region_Code | Maintenance_Cost | Annual_Visitors | Inspection_Date | Inspection_Age | Visitor_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_PLAYGROUND_PORTAL | PG005 | Ocean Play | Australia | Excellent | ocean@play.com | APAC | $195,000.00 | 1000 | 2024-08-20 | 675 | |
| 2 | GLOBAL_PLAYGROUND_PORTAL | PG009 | Adventure Hill | Brazil | Good | hill@adv.com | LATAM | $150,000.00 | 380000 | 2025-05-10 | 412 | |
| 3 | GLOBAL_PLAYGROUND_PORTAL | PG004 | Dream Forest | Canada | BAD | forest@dream.ca | NA | $210,000.00 | 450000 | 2024-11-11 | 592 | |
| 4 | GLOBAL_PLAYGROUND_PORTAL | PG008 | Magic Kingdom | France | Excellent | magic@kingdom.fr | EU | $200,000.00 | 370000 | 2025-01-14 | 528 | |
| 5 | GLOBAL_PLAYGROUND_PORTAL | PG007 | Rainbow Yard | Germany | NULL | rainbow@yard.de | EU | $220,000.00 | 410000 | 2025-04-15 | 437 | |
| 6 | GLOBAL_PLAYGROUND_PORTAL | PG003 | Kids Planet | India | Good | kidsplanet@gmail | AP | $175,000.00 | . | 2026-05-26 | 31 | REVIEW |
| 7 | GLOBAL_PLAYGROUND_PORTAL | PG010 | Jungle Zone | India | Excellent | jungle@@gmail.com | APAC | $225,000.00 | 420000 | 2024-12-12 | 561 | |
| 8 | GLOBAL_PLAYGROUND_PORTAL | PG002 | Sky Adventure | Japan | Excellent | missing@email.com | APAC | $350,000.00 | 400000 | 2026-05-26 | 31 | |
| 9 | GLOBAL_PLAYGROUND_PORTAL | PG006 | Fun City | Uk | Good | missing@email.com | EU | $210,000.00 | 390000 | 2026-05-26 | 31 | |
| 10 | GLOBAL_PLAYGROUND_PORTAL | PG001 | Wonder Park | Usa | Excellent | admin@wonder.com | NA | $250,000.00 | 500000 | 2025-12-15 | 193 |
data country_stats;
set playground_nodup;
by Country;
retain Counter;
if first.Country then Counter=0;
Counter+1;
if last.Country then output;
run;
proc print data=country_stats;
run;
OUTPUT:
| Obs | Source_System | Playground_ID | Playground_Name | Country | Safety_Rating | Manager_Email | Region_Code | Maintenance_Cost | Annual_Visitors | Inspection_Date | Inspection_Age | Visitor_Flag | Counter |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_PLAYGROUND_PORTAL | PG005 | Ocean Play | Australia | Excellent | ocean@play.com | APAC | $195,000.00 | 1000 | 2024-08-20 | 675 | 1 | |
| 2 | GLOBAL_PLAYGROUND_PORTAL | PG009 | Adventure Hill | Brazil | Good | hill@adv.com | LATAM | $150,000.00 | 380000 | 2025-05-10 | 412 | 1 | |
| 3 | GLOBAL_PLAYGROUND_PORTAL | PG004 | Dream Forest | Canada | BAD | forest@dream.ca | NA | $210,000.00 | 450000 | 2024-11-11 | 592 | 1 | |
| 4 | GLOBAL_PLAYGROUND_PORTAL | PG008 | Magic Kingdom | France | Excellent | magic@kingdom.fr | EU | $200,000.00 | 370000 | 2025-01-14 | 528 | 1 | |
| 5 | GLOBAL_PLAYGROUND_PORTAL | PG007 | Rainbow Yard | Germany | NULL | rainbow@yard.de | EU | $220,000.00 | 410000 | 2025-04-15 | 437 | 1 | |
| 6 | GLOBAL_PLAYGROUND_PORTAL | PG010 | Jungle Zone | India | Excellent | jungle@@gmail.com | APAC | $225,000.00 | 420000 | 2024-12-12 | 561 | 2 | |
| 7 | GLOBAL_PLAYGROUND_PORTAL | PG002 | Sky Adventure | Japan | Excellent | missing@email.com | APAC | $350,000.00 | 400000 | 2026-05-26 | 31 | 1 | |
| 8 | GLOBAL_PLAYGROUND_PORTAL | PG006 | Fun City | Uk | Good | missing@email.com | EU | $210,000.00 | 390000 | 2026-05-26 | 31 | 1 | |
| 9 | GLOBAL_PLAYGROUND_PORTAL | PG001 | Wonder Park | Usa | Excellent | admin@wonder.com | NA | $250,000.00 | 500000 | 2025-12-15 | 193 | 1 |
Explanation
FIRST./LAST.
logic forms the backbone of visit derivations, exposure calculations, and
patient disposition summaries.
6.PROC TRANSPOSE
proc transpose data=playground_nodup out=transpose_report;
by Country;
var Annual_Visitors Maintenance_Cost;
run;
proc print data=transpose_report;
run;
OUTPUT:
| Obs | Country | _NAME_ | COL1 | COL2 |
|---|---|---|---|---|
| 1 | Australia | Annual_Visitors | 1000 | . |
| 2 | Australia | Maintenance_Cost | 195000 | . |
| 3 | Brazil | Annual_Visitors | 380000 | . |
| 4 | Brazil | Maintenance_Cost | 150000 | . |
| 5 | Canada | Annual_Visitors | 450000 | . |
| 6 | Canada | Maintenance_Cost | 210000 | . |
| 7 | France | Annual_Visitors | 370000 | . |
| 8 | France | Maintenance_Cost | 200000 | . |
| 9 | Germany | Annual_Visitors | 410000 | . |
| 10 | Germany | Maintenance_Cost | 220000 | . |
| 11 | India | Annual_Visitors | . | 420000 |
| 12 | India | Maintenance_Cost | 175000 | 225000 |
| 13 | Japan | Annual_Visitors | 400000 | . |
| 14 | Japan | Maintenance_Cost | 350000 | . |
| 15 | Uk | Annual_Visitors | 390000 | . |
| 16 | Uk | Maintenance_Cost | 210000 | . |
| 17 | Usa | Annual_Visitors | 500000 | . |
| 18 | Usa | Maintenance_Cost | 250000 | . |
Explanation
Transpose
operations reshape datasets for executive reporting and dashboard generation.
7.Reusable SAS Macro
%macro check_missing(ds);
data ds;
set &ds end=last;
retain Missing_Count 0;
if cmiss(of _all_)>0 then Missing_Count+1;
if last then put "Total observations with missing values = "
Missing_Count;
run;
proc print data=ds;
run;
%mend;
%check_missing(playground_nodup);
| Obs | Source_System | Playground_ID | Playground_Name | Country | Safety_Rating | Manager_Email | Region_Code | Maintenance_Cost | Annual_Visitors | Inspection_Date | Inspection_Age | Visitor_Flag | Missing_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_PLAYGROUND_PORTAL | PG005 | Ocean Play | Australia | Excellent | ocean@play.com | APAC | $195,000.00 | 1000 | 2024-08-20 | 675 | 1 | |
| 2 | GLOBAL_PLAYGROUND_PORTAL | PG009 | Adventure Hill | Brazil | Good | hill@adv.com | LATAM | $150,000.00 | 380000 | 2025-05-10 | 412 | 2 | |
| 3 | GLOBAL_PLAYGROUND_PORTAL | PG004 | Dream Forest | Canada | BAD | forest@dream.ca | NA | $210,000.00 | 450000 | 2024-11-11 | 592 | 3 | |
| 4 | GLOBAL_PLAYGROUND_PORTAL | PG008 | Magic Kingdom | France | Excellent | magic@kingdom.fr | EU | $200,000.00 | 370000 | 2025-01-14 | 528 | 4 | |
| 5 | GLOBAL_PLAYGROUND_PORTAL | PG007 | Rainbow Yard | Germany | NULL | rainbow@yard.de | EU | $220,000.00 | 410000 | 2025-04-15 | 437 | 5 | |
| 6 | GLOBAL_PLAYGROUND_PORTAL | PG003 | Kids Planet | India | Good | kidsplanet@gmail | AP | $175,000.00 | . | 2026-05-26 | 31 | REVIEW | 6 |
| 7 | GLOBAL_PLAYGROUND_PORTAL | PG010 | Jungle Zone | India | Excellent | jungle@@gmail.com | APAC | $225,000.00 | 420000 | 2024-12-12 | 561 | 7 | |
| 8 | GLOBAL_PLAYGROUND_PORTAL | PG002 | Sky Adventure | Japan | Excellent | missing@email.com | APAC | $350,000.00 | 400000 | 2026-05-26 | 31 | 8 | |
| 9 | GLOBAL_PLAYGROUND_PORTAL | PG006 | Fun City | Uk | Good | missing@email.com | EU | $210,000.00 | 390000 | 2026-05-26 | 31 | 9 | |
| 10 | GLOBAL_PLAYGROUND_PORTAL | PG001 | Wonder Park | Usa | Excellent | admin@wonder.com | NA | $250,000.00 | 500000 | 2025-12-15 | 193 | 10 |
Explanation
Macros
standardize validation logic and improve reproducibility across hundreds of
studies.
8.RawData In R
playground_raw <- data.frame(
Playground_ID = c(
"PG001","PG001","PG002","PG003","PG004",
"PG005","PG006","PG007","PG008","PG009","PG010"),
Playground_Name = c(
"Wonder Park","Wonder Park","Sky Adventure","Kids Planet",
"Dream Forest","Ocean Play","Fun City","Rainbow Yard",
"Magic Kingdom","Adventure Hill","Jungle Zone"),
Country = c(
"usa","USA","Japan"," INDIA ","Canada",
"Australia","UK","Germany","France","Brazil","India"),
Safety_Rating = c(
"Excellent","Excellent","excellent","Good","BAD",
"Excellent","GOOD","NULL","Excellent","Good","Excellent"),
Annual_Visitors = c(
500000,500000,400000,NA,450000,
-1000,390000,410000,370000,380000,420000),
Maintenance_Cost = c(
250000,250000,-350000,175000,210000,
195000,210000,220000,200000,-150000,225000 ),
Inspection_Date = c(
"2025-12-15","2025-12-15","2025-15-10","NULL",
"2024-11-11","2024-08-20","2024-02-31","2025-04-15",
"2025-01-14","2025-05-10","2024-12-12"),
Manager_Email = c(
"admin@wonder.com","admin@wonder.com","sky.jp",
"kidsplanet@gmail","forest@dream.ca","ocean@play.com",
"fun.city.com","rainbow@yard.de","magic@kingdom.fr",
"hill@adv.com","jungle@@gmail.com"),
Region_Code = c(
"NA","NA","APAC","AP","northamerica",
"APAC","EU","EU","EU","LATAM","APAC"),
stringsAsFactors = FALSE)
OUTPUT:
|
Playground_ID |
Playground_Name |
Country |
Safety_Rating |
Annual_Visitors |
Maintenance_Cost |
Inspection_Date |
Manager_Email |
Region_Code |
|
PG001 |
Wonder
Park |
usa |
Excellent |
500000 |
250000 |
2025-12-15 |
admin@wonder.com |
NA |
|
PG001 |
Wonder
Park |
USA |
Excellent |
500000 |
250000 |
2025-12-15 |
admin@wonder.com |
NA |
|
PG002 |
Sky
Adventure |
Japan |
excellent |
400000 |
-350000 |
2025-15-10 |
sky.jp |
APAC |
|
PG003 |
Kids
Planet |
INDIA |
Good |
175000 |
NULL |
kidsplanet@gmail |
AP |
|
|
PG004 |
Dream
Forest |
Canada |
BAD |
450000 |
210000 |
2024-11-11 |
forest@dream.ca |
northamerica |
|
PG005 |
Ocean
Play |
Australia |
Excellent |
-1000 |
195000 |
2024-08-20 |
ocean@play.com |
APAC |
|
PG006 |
Fun City |
UK |
GOOD |
390000 |
210000 |
2024-02-31 |
fun.city.com |
EU |
|
PG007 |
Rainbow
Yard |
Germany |
NULL |
410000 |
220000 |
2025-04-15 |
rainbow@yard.de |
EU |
|
PG008 |
Magic
Kingdom |
France |
Excellent |
370000 |
200000 |
2025-01-14 |
magic@kingdom.fr |
EU |
|
PG009 |
Adventure
Hill |
Brazil |
Good |
380000 |
-150000 |
2025-05-10 |
hill@adv.com |
LATAM |
|
PG010 |
Jungle
Zone |
India |
Excellent |
420000 |
225000 |
2024-12-12 |
jungle@@gmail.com |
APAC |
9.R Data Cleaning Layer
library(tidyverse)
library(lubridate)
library(janitor)
playground_clean <-
playground_raw %>%
clean_names() %>%
mutate(
country=str_to_title(str_trim(country)),
manager_email=str_to_lower(manager_email),
maintenance_cost=abs(maintenance_cost),
annual_visitors=abs(annual_visitors),
safety_rating=case_when(
safety_rating=="excellent"~"Excellent",
safety_rating=="GOOD"~"Good",TRUE~"Needs Review"),
inspection_date =suppressWarnings(
parse_date_time(inspection_date,
orders = c("ymd","dmy"))),
inspection_date=coalesce(inspection_date,
today()),
region_code =str_replace_all(
str_to_upper(region_code),"NORTHAMERICA","NA")
) %>%
distinct(playground_id,.keep_all=TRUE)
|
playground_id |
playground_name |
country |
safety_rating |
annual_visitors |
maintenance_cost |
inspection_date |
manager_email |
region_code |
|
PG001 |
Wonder
Park |
Usa |
Needs
Review |
500000 |
250000 |
2025-12-15 00:00:00 UTC |
admin@wonder.com |
NA |
|
PG002 |
Sky
Adventure |
Japan |
Excellent |
400000 |
350000 |
2026-06-26 00:00:00 UTC |
sky.jp |
APAC |
|
PG003 |
Kids
Planet |
India |
Needs
Review |
175000 |
2026-06-26 00:00:00 UTC |
kidsplanet@gmail |
AP |
|
|
PG004 |
Dream
Forest |
Canada |
Needs
Review |
450000 |
210000 |
2024-11-11 00:00:00 UTC |
forest@dream.ca |
NA |
|
PG005 |
Ocean
Play |
Australia |
Needs
Review |
1000 |
195000 |
2024-08-20 00:00:00 UTC |
ocean@play.com |
APAC |
|
PG006 |
Fun City |
Uk |
Good |
390000 |
210000 |
2026-06-26 00:00:00 UTC |
fun.city.com |
EU |
|
PG007 |
Rainbow
Yard |
Germany |
Needs
Review |
410000 |
220000 |
2025-04-15 00:00:00 UTC |
rainbow@yard.de |
EU |
|
PG008 |
Magic
Kingdom |
France |
Needs
Review |
370000 |
200000 |
2025-01-14 00:00:00 UTC |
magic@kingdom.fr |
EU |
|
PG009 |
Adventure
Hill |
Brazil |
Needs
Review |
380000 |
150000 |
2025-05-10 00:00:00 UTC |
hill@adv.com |
LATAM |
|
PG010 |
Jungle
Zone |
India |
Needs
Review |
420000 |
225000 |
2024-12-12 00:00:00 UTC |
jungle@@gmail.com |
APAC |
Explanation
R
emphasizes pipeline readability.
Equivalent
SAS functionality relies heavily on DATA step processing.
mutate()
resembles assignment statements.
case_when()
behaves similarly to SELECT-WHEN.
distinct()
mirrors PROC SORT NODUPKEY.
Validation and Compliance
Clinical
environments demand:
- SDTM traceability.
- ADaM reproducibility.
- Independent QC.
- Audit trail preservation.
- Metadata governance.
- Regulatory transparency.
One
dangerous SAS behavior:
if Age < 18;
Missing
numeric values satisfy this condition because SAS treats missing values as
smaller than all valid numbers.
This can
incorrectly classify subjects and invalidate regulatory submissions.
Always
use:
if not missing(Age) and
Age<18;
20 Data Cleaning Best
Practices
- Standardize metadata first.
- Validate business keys.
- Remove duplicates early.
- Never overwrite raw data.
- Maintain audit trails.
- Use reusable macros.
- Validate date formats.
- Document assumptions.
- Version control programs.
- Use independent QC.
- Standardize labels.
- Protect lineage.
- Centralize formats.
- Monitor truncation risk.
- Validate ranges.
- Check category integrity.
- Automate reports.
- Validate joins.
- Review missingness patterns.
- Build defensive code.
Business Logic Behind
Cleaning
Missing
values are rarely random.
A missing
patient visit date may indicate delayed treatment.
A
negative salary may indicate ingestion failures.
An age of
450 years reflects mapping corruption.
Text
normalization converts:
india
INDIA
India
into:
India
This
dramatically improves joins and aggregation accuracy.
Imputation
strategies should always be documented.
20 One-Line Insights
- Dirty data creates expensive
business mistakes.
- Validation logic beats
visual inspection.
- Metadata drives
reproducibility.
- Missing values can mislead
models.
- Duplicate records inflate
metrics.
- Formats improve consistency.
- Audit trails build trust.
- Defensive programming
prevents disasters.
- Standardization improves
scalability.
- Automation reduces risk.
- Macros improve efficiency.
- Arrays simplify cleaning.
- SQL improves readability.
- DATA steps improve control.
- QC independence matters.
- Traceability matters.
- Governance matters.
- Reproducibility matters.
- Documentation matters.
- Clean data wins.
SAS versus R
|
Capability |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
Acceptance |
Excellent |
Growing |
|
Flexibility |
Moderate |
Excellent |
|
Visualization |
Good |
Excellent |
|
Scalability |
Excellent |
Good |
|
Metadata
Control |
Excellent |
Moderate |
|
Open
Source Ecosystem |
Limited |
Excellent |
Conclusion
Modern
analytics depends less on sophisticated algorithms and more on trustworthy data
foundations.
Whether
analyzing playground safety performance, clinical outcomes, banking
transactions, or insurance claims, poor-quality data propagates errors into
every downstream layer of analytics.
SAS
provides exceptional strengths in governance, auditability, reproducibility,
and regulatory compliance. Its DATA Step engine remains one of the most
powerful data engineering tools ever created for enterprise environments.
Features such as FIRST./LAST. processing, RETAIN statements, ARRAY handling,
PROC FORMAT, and reusable MACROS make SAS uniquely suited for large-scale
production pipelines.
R
complements SAS by providing exceptional flexibility, rapid development,
advanced visualization, and open-source innovation. Tidyverse pipelines allow
analysts to express complex transformations elegantly while maintaining
readability and maintainability.
Together,
SAS and R form a powerful hybrid ecosystem.
SAS
provides governance.
R
provides agility.
SAS
provides compliance.
R
provides experimentation.
SAS
provides production reliability.
R
provides analytical creativity.
Organizations
that combine both platforms create resilient, scalable, and trustworthy
analytical ecosystems capable of supporting executive decisions, regulatory
submissions, machine learning pipelines, and enterprise intelligence
initiatives for decades to come.
Interview Questions
1. Why use NODUPKEY instead of NODUP?
Answer:
NODUPKEY removes duplicate BY-group keys while NODUP removes completely
identical rows.
2. Why should LENGTH appear before assignments?
Answer:
SAS determines variable length during creation. Late LENGTH statements cannot
recover truncated values.
3. Why is missing numeric handling dangerous in
SAS?
Answer:
Missing numeric values sort below valid values and can satisfy logical
conditions unintentionally.
4. When should PROC SQL be preferred?
Answer:
Complex joins and aggregations are usually simpler in PROC SQL.
5. When is DATA Step superior?
Answer:
Row-by-row transformations, retained variables, arrays, and FIRST./LAST. logic
are easier and faster in DATA Step workflows.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about PLAYGROUNDS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment