Corrupted Mobile Records to Clinical-Grade Analytics: Real-Time Data Cleaning and Enterprise Reporting with SAS and R
The World’s Most Famous
Mobiles Dataset into Trusted Business Analytics Using Advanced SAS (PROC SQL vs
DATA Step) and R Data Engineering Frameworks
Introduction — When Dirty Data Becomes a
Million-Dollar Problem
Imagine a
real-time healthcare monitoring company tracking patient health using mobile
applications installed on devices like Apple iPhone 15 Pro, Samsung Galaxy S24
Ultra, and Google Pixel 8 Pro.
The
company uses wearable integrations, hospital APIs, and mobile-generated
telemetry data to predict cardiac abnormalities, medication adherence, and
patient risk scores.
Now
imagine the following problems entering the system:
- Duplicate patient device IDs
- Invalid mobile launch dates
- Negative battery capacities
- Corrupted manufacturer names
- Mixed-case operating systems
- Blank healthcare monitoring
statuses
- Truncated text fields
- Invalid pricing values
- Missing region mappings
Suddenly,
dashboards become unreliable. AI models begin predicting false risks.
Regulatory submissions fail validation checks. Auditors raise compliance
concerns. Clinical teams lose trust in analytics.
This is
the hidden battlefield of enterprise analytics.
In modern
organizations, data cleaning is not cosmetic work it is survival
infrastructure.
Healthcare,
banking, insurance, telecom, and clinical research organizations depend heavily
on trustworthy datasets. A single malformed value can create downstream chaos
across SDTM domains, ADaM derivations, TLF generation, and regulatory
reporting.
In this
enterprise-level project, we will build a realistic “Most Famous Mobiles in the
World” dataset integrated with healthcare-style monitoring logic and
intentionally injected data-quality issues. Then we will solve these problems
using advanced SAS and R programming techniques.
Real-Time Healthcare Mobile Dataset with Intentional Errors
A
healthcare technology organization collects patient monitoring activity from
globally famous mobile devices. Data enters from:
- Mobile apps
- Hospital integrations
- IoT wearables
- API streams
- Insurance systems
The raw dataset contains inconsistent records requiring enterprise-grade cleaning.
Raw SAS Dataset Creation with Intentional Errors
/*=========================================================
REAL-TIME MOBILE HEALTHCARE RAW DATASET
Intentional enterprise-level data quality problems added
=========================================================*/
data mobiles_raw;
/* LENGTH must come before logic to avoid truncation */
length Mobile_Name $40 Brand $25 Country $20
Operating_System $20 Monitoring_Status $15;
/* Proper informats for incoming raw data */
informat Launch_Date ddmmyy10.;
/* Output display formats */
format Launch_Date date9. Price dollar10.2;
infile datalines dlm='|' truncover;
input Mobile_ID Mobile_Name $ Brand $ Country $ Launch_Date
Price Battery_mAh Operating_System $ Monitoring_Status $;
datalines;
101|iphone15pro|Apple|USA|12/09/2023|1599|4500|ios|ACTIVE
102|GalaxyS24Ultra|Samsung|SouthKorea|25/01/2024|1400|5000|ANDROID|ACTIVE
103|Pixel8Pro|Google|USA|15/10/2023|-999|5050|android|ACTIVE
104|OnePlus12|OnePlus|India|11/02/2024|899|-5400|Android|NULL
105|Xiaomi14Ultra|Xiaomi|China|31/15/2024|1200|5300|ANDROID|ACTIVE
106|NULL|Apple|USA|01/03/2024|1300|4500|IOS|ACTIVE
107|ROGPhone8|Asus|Taiwan|15/01/2024|1700|5500|ANDROID|ACTIVE
108|GalaxyS24Ultra|Samsung|SouthKorea|25/01/2024|1400|5000|ANDROID|ACTIVE
109|MotoEdge50|Motorola|USA|05/05/2024|799|4700|android|inactive
110|NokiaXR21|Nokia|Finland|10/08/2023|650|4800|ANDROID|ACTIVE
111|iphone15pro|Apple|USA|12/09/2023|1599|4500|ios|ACTIVE
112|VivoX100Pro|Vivo|China|20/01/2024|999|5400|ANDROID|ACTIVE
113|OppoFindX7|Oppo|China|18/01/2024|1100|5000|android|ACTIVE
114|Pixel8Pro|Google|USA|15/10/2023|1299|5050|ANDROID|ACTIVE
115|RealmeGT5|Realme|India|11/11/2023|699|5240|ANDROID|NULL
116|SonyXperia1VI|Sony|Japan|05/03/2024|1800|5000|ANDROID|ACTIVE
117|HuaweiP70|Huawei|China|20/04/2024|1450|5200|ANDROID|ACTIVE
118|NothingPhone2|Nothing|UK|15/07/2023|799|4700|ANDROID|ACTIVE
119|iphone15pro|max|USA|12/09/2023|1599|4500|ios|ACTIVE
120|GalaxyFold6|Samsung|SouthKorea|10/07/2024|2100|4400|ANDROID|ACTIVE
;
run;
proc print data = mobiles_raw;
run;
OUTPUT:
| Obs | Mobile_Name | Brand | Country | Operating_System | Monitoring_Status | Launch_Date | Price | Mobile_ID | Battery_mAh |
|---|---|---|---|---|---|---|---|---|---|
| 1 | iphone15pro | Apple | USA | ios | ACTIVE | 12SEP2023 | $1,599.00 | 101 | 4500 |
| 2 | GalaxyS24Ultra | Samsung | SouthKorea | ANDROID | ACTIVE | 25JAN2024 | $1,400.00 | 102 | 5000 |
| 3 | Pixel8Pro | USA | android | ACTIVE | 15OCT2023 | $-999.00 | 103 | 5050 | |
| 4 | OnePlus12 | OnePlus | India | Android | NULL | 11FEB2024 | $899.00 | 104 | -5400 |
| 5 | Xiaomi14Ultra | Xiaomi | China | ANDROID | ACTIVE | . | $1,200.00 | 105 | 5300 |
| 6 | NULL | Apple | USA | IOS | ACTIVE | 01MAR2024 | $1,300.00 | 106 | 4500 |
| 7 | ROGPhone8 | Asus | Taiwan | ANDROID | ACTIVE | 15JAN2024 | $1,700.00 | 107 | 5500 |
| 8 | GalaxyS24Ultra | Samsung | SouthKorea | ANDROID | ACTIVE | 25JAN2024 | $1,400.00 | 108 | 5000 |
| 9 | MotoEdge50 | Motorola | USA | android | inactive | 05MAY2024 | $799.00 | 109 | 4700 |
| 10 | NokiaXR21 | Nokia | Finland | ANDROID | ACTIVE | 10AUG2023 | $650.00 | 110 | 4800 |
| 11 | iphone15pro | Apple | USA | ios | ACTIVE | 12SEP2023 | $1,599.00 | 111 | 4500 |
| 12 | VivoX100Pro | Vivo | China | ANDROID | ACTIVE | 20JAN2024 | $999.00 | 112 | 5400 |
| 13 | OppoFindX7 | Oppo | China | android | ACTIVE | 18JAN2024 | $1,100.00 | 113 | 5000 |
| 14 | Pixel8Pro | USA | ANDROID | ACTIVE | 15OCT2023 | $1,299.00 | 114 | 5050 | |
| 15 | RealmeGT5 | Realme | India | ANDROID | NULL | 11NOV2023 | $699.00 | 115 | 5240 |
| 16 | SonyXperia1VI | Sony | Japan | ANDROID | ACTIVE | 05MAR2024 | $1,800.00 | 116 | 5000 |
| 17 | HuaweiP70 | Huawei | China | ANDROID | ACTIVE | 20APR2024 | $1,450.00 | 117 | 5200 |
| 18 | NothingPhone2 | Nothing | UK | ANDROID | ACTIVE | 15JUL2023 | $799.00 | 118 | 4700 |
| 19 | iphone15pro | max | USA | ios | ACTIVE | 12SEP2023 | $1,599.00 | 119 | 4500 |
| 20 | GalaxyFold6 | Samsung | SouthKorea | ANDROID | ACTIVE | 10JUL2024 | $2,100.00 | 120 | 4400 |
Explanation of the Dataset
This
dataset intentionally contains enterprise-level issues:
|
Problem |
Example |
|
Negative
price |
-999 |
|
Negative
battery |
-5400 |
|
Invalid
date |
31/15/2024 |
|
Duplicates |
GalaxyS24Ultra |
|
Missing
mobile name |
NULL |
|
Mixed
OS casing |
ios /
IOS / android |
|
Invalid
monitoring status |
NULL |
|
Truncation
risk |
iphone15pro |
Equivalent R Version of mobiles_raw Dataset
library(tidyverse)
library(lubridate)
mobiles_raw <- tibble(
Mobile_ID = c(101,102,103,104,105,106,107,108,109,110,
111,112,113,114,115,116,117,118,119,120),
Mobile_Name = c("iphone15pro","GalaxyS24Ultra","Pixel8Pro","OnePlus12",
"Xiaomi14Ultra","NULL","ROGPhone8","GalaxyS24Ultra","MotoEdge50",
"NokiaXR21","iphone15pro","VivoX100Pro","OppoFindX7","Pixel8Pro",
"RealmeGT5","SonyXperia1VI","HuaweiP70","NothingPhone2",
"iphone15pro","GalaxyFold6"),
Brand = c("Apple","Samsung","Google","OnePlus","Xiaomi","Apple","Asus",
"Samsung","Motorola","Nokia","Apple","Vivo","Oppo","Google",
"Realme","Sony","Huawei","Nothing","max","Samsung"),
Country = c("USA","SouthKorea","USA","India","China","USA","Taiwan",
"SouthKorea","USA","Finland","USA","China","China","USA",
"India","Japan","China","UK","USA","SouthKorea"),
Launch_Date = c("12/09/2023","25/01/2024","15/10/2023","11/02/2024",
"31/15/2024","01/03/2024","15/01/2024","25/01/2024","05/05/2024",
"10/08/2023","12/09/2023","20/01/2024","18/01/2024","15/10/2023",
"11/11/2023","05/03/2024","20/04/2024","15/07/2023","12/09/2023",
"10/07/2024"),
Price = c(1599,1400,-999,899,1200,1300,1700,1400,799,650,599,999,1100,
11299,699,1800,1450,799,1599,2100),
Battery_mAh = c(4500,5000,5050,-5400,5300,4500,5500,5000,4700,4800,
4500,5400,5000,5050,5240,5000,5200,4700,4500,4400),
Operating_System = c("ios","ANDROID","android","Android","ANDROID",
"IOS","ANDROID","ANDROID","android","ANDROID","ios","ANDROID",
"android","ANDROID","ANDROID","ANDROID","ANDROID","ANDROID",
"ios","ANDROID"),
Monitoring_Status = c("ACTIVE","ACTIVE","ACTIVE","NULL","ACTIVE","ACTIVE",
"ACTIVE","ACTIVE","inactive","ACTIVE","ACTIVE","ACTIVE","ACTIVE",
"ACTIVE","NULL","ACTIVE","ACTIVE","ACTIVE","ACTIVE","ACTIVE")
)
OUTPUT:
|
|
MoMobile_ID |
Mobile_Name |
Brand |
Country |
Launch_Date |
Price |
Battery_mAh |
Operating_System |
Monitoring_Status |
|
1 |
101 |
iphone15pro |
Apple |
USA |
12-09-2023 |
1599 |
4500 |
ios |
ACTIVE |
|
2 |
102 |
GalaxyS24Ultra |
Samsung |
SouthKorea |
25-01-2024 |
1400 |
5000 |
ANDROID |
ACTIVE |
|
3 |
103 |
Pixel8Pro |
Google |
USA |
15-10-2023 |
-999 |
5050 |
android |
ACTIVE |
|
4 |
104 |
OnePlus12 |
OnePlus |
India |
11-02-2024 |
899 |
-5400 |
Android |
NULL |
|
5 |
105 |
Xiaomi14Ultra |
Xiaomi |
China |
31/15/2024 |
1200 |
5300 |
ANDROID |
ACTIVE |
|
6 |
106 |
NULL |
Apple |
USA |
01-03-2024 |
1300 |
4500 |
IOS |
ACTIVE |
|
7 |
107 |
ROGPhone8 |
Asus |
Taiwan |
15-01-2024 |
1700 |
5500 |
ANDROID |
ACTIVE |
|
8 |
108 |
GalaxyS24Ultra |
Samsung |
SouthKorea |
25-01-2024 |
1400 |
5000 |
ANDROID |
ACTIVE |
|
9 |
109 |
MotoEdge50 |
Motorola |
USA |
05-05-2024 |
799 |
4700 |
android |
inactive |
|
10 |
110 |
NokiaXR21 |
Nokia |
Finland |
10-08-2023 |
650 |
4800 |
ANDROID |
ACTIVE |
|
11 |
111 |
iphone15pro |
Apple |
USA |
12-09-2023 |
599 |
4500 |
ios |
ACTIVE |
|
12 |
112 |
VivoX100Pro |
Vivo |
China |
20-01-2024 |
999 |
5400 |
ANDROID |
ACTIVE |
|
13 |
113 |
OppoFindX7 |
Oppo |
China |
18-01-2024 |
1100 |
5000 |
android |
ACTIVE |
|
14 |
114 |
Pixel8Pro |
Google |
USA |
15-10-2023 |
11299 |
5050 |
ANDROID |
ACTIVE |
|
15 |
115 |
RealmeGT5 |
Realme |
India |
11-11-2023 |
699 |
5240 |
ANDROID |
NULL |
|
16 |
116 |
SonyXperia1VI |
Sony |
Japan |
05-03-2024 |
1800 |
5000 |
ANDROID |
ACTIVE |
|
17 |
117 |
HuaweiP70 |
Huawei |
China |
20-04-2024 |
1450 |
5200 |
ANDROID |
ACTIVE |
|
18 |
118 |
NothingPhone2 |
Nothing |
UK |
15-07-2023 |
799 |
4700 |
ANDROID |
ACTIVE |
|
19 |
119 |
iphone15pro |
max |
USA |
12-09-2023 |
1599 |
4500 |
ios |
ACTIVE |
|
20 |
120 |
GalaxyFold6 |
Samsung |
SouthKorea |
10-07-2024 |
2100 |
4400 |
ANDROID |
ACTIVE |
Explanation of This R Code
This R workflow recreates your SAS DATALINES
dataset using tibble() from
the tidyverse ecosystem.
Key Enterprise Concepts Used
|
SAS Concept |
R Equivalent |
|
DATALINES |
tibble() |
|
INPUT statement |
Column vectors |
|
FORMAT / INFORMAT |
lubridate::dmy() |
The Truncation Trap in SAS
One of
the most dangerous SAS issues is the “Truncation Trap.”
If LENGTH
is not declared before INPUT or assignment statements, SAS automatically guesses
variable lengths based on the first observed value.
Example:
length Operating_System $20;
Without
this statement:
- “ANDROID” may truncate to
“ANDR”
- “SouthKorea” may truncate
unexpectedly
- Business rules fail silently
This is
extremely dangerous in SDTM mapping and regulatory reporting.
R handles
strings differently because memory allocation is dynamic, whereas SAS uses
fixed-width storage architecture.
Enterprise SAS Data Cleaning Workflow
Step 1 — Standardization Layer
data mobiles_clean;
set mobiles_raw;
/* Standardize text formatting */
Mobile_Name = propcase(strip(Mobile_Name));
Brand = propcase(strip(Brand));
Country = upcase(strip(Country));
Operating_System = upcase(strip(Operating_System));
Monitoring_Status = upcase(strip(Monitoring_Status));
/* Replace NULL text values */
if Mobile_Name='Null' then Mobile_Name='Unknown Device';
if Monitoring_Status='NULL' then Monitoring_Status='PENDING';
/* Correct negative values */
Price = abs(Price);
Battery_mAh = abs(Battery_mAh);
/* Date validation */
if missing(Launch_Date) then Launch_Date = today();
run;
proc print data = mobiles_clean;
run;
OUTPUT:
| Obs | Mobile_Name | Brand | Country | Operating_System | Monitoring_Status | Launch_Date | Price | Mobile_ID | Battery_mAh |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Iphone15pro | Apple | USA | IOS | ACTIVE | 12SEP2023 | $1,599.00 | 101 | 4500 |
| 2 | Galaxys24ultra | Samsung | SOUTHKOREA | ANDROID | ACTIVE | 25JAN2024 | $1,400.00 | 102 | 5000 |
| 3 | Pixel8pro | USA | ANDROID | ACTIVE | 15OCT2023 | $999.00 | 103 | 5050 | |
| 4 | Oneplus12 | Oneplus | INDIA | ANDROID | PENDING | 11FEB2024 | $899.00 | 104 | 5400 |
| 5 | Xiaomi14ultra | Xiaomi | CHINA | ANDROID | ACTIVE | 13MAY2026 | $1,200.00 | 105 | 5300 |
| 6 | Unknown Device | Apple | USA | IOS | ACTIVE | 01MAR2024 | $1,300.00 | 106 | 4500 |
| 7 | Rogphone8 | Asus | TAIWAN | ANDROID | ACTIVE | 15JAN2024 | $1,700.00 | 107 | 5500 |
| 8 | Galaxys24ultra | Samsung | SOUTHKOREA | ANDROID | ACTIVE | 25JAN2024 | $1,400.00 | 108 | 5000 |
| 9 | Motoedge50 | Motorola | USA | ANDROID | INACTIVE | 05MAY2024 | $799.00 | 109 | 4700 |
| 10 | Nokiaxr21 | Nokia | FINLAND | ANDROID | ACTIVE | 10AUG2023 | $650.00 | 110 | 4800 |
| 11 | Iphone15pro | Apple | USA | IOS | ACTIVE | 12SEP2023 | $1,599.00 | 111 | 4500 |
| 12 | Vivox100pro | Vivo | CHINA | ANDROID | ACTIVE | 20JAN2024 | $999.00 | 112 | 5400 |
| 13 | Oppofindx7 | Oppo | CHINA | ANDROID | ACTIVE | 18JAN2024 | $1,100.00 | 113 | 5000 |
| 14 | Pixel8pro | USA | ANDROID | ACTIVE | 15OCT2023 | $1,299.00 | 114 | 5050 | |
| 15 | Realmegt5 | Realme | INDIA | ANDROID | PENDING | 11NOV2023 | $699.00 | 115 | 5240 |
| 16 | Sonyxperia1vi | Sony | JAPAN | ANDROID | ACTIVE | 05MAR2024 | $1,800.00 | 116 | 5000 |
| 17 | Huaweip70 | Huawei | CHINA | ANDROID | ACTIVE | 20APR2024 | $1,450.00 | 117 | 5200 |
| 18 | Nothingphone2 | Nothing | UK | ANDROID | ACTIVE | 15JUL2023 | $799.00 | 118 | 4700 |
| 19 | Iphone15pro | Max | USA | IOS | ACTIVE | 12SEP2023 | $1,599.00 | 119 | 4500 |
| 20 | Galaxyfold6 | Samsung | SOUTHKOREA | ANDROID | ACTIVE | 10JUL2024 | $2,100.00 | 120 | 4400 |
Explanation of Standardization Logic
This DATA
step acts as the first enterprise cleaning layer.
Key SAS Functions Used
|
Function |
Purpose |
|
PROPCASE |
Standardizes
text |
|
STRIP |
Removes
leading/trailing spaces |
|
ABS |
Corrects
negative values |
|
MISSING |
Detects
null values |
|
TODAY() |
Imputes
invalid dates |
Business Importance
Without
standardization:
- Dashboards split “ANDROID”
and “android”
- AI models misclassify
operating systems
- Duplicate detection fails
- Reporting becomes
inconsistent
This
mirrors real SDTM compliance problems in clinical trials.
PROC SQL vs DATA Step Deduplication
/*DATA Step Method*/
proc sort data=mobiles_clean nodupkey;
by Mobile_Name Brand Launch_Date;
run;
proc print data = mobiles_clean;
run;
OUTPUT:
| Obs | Mobile_Name | Brand | Country | Operating_System | Monitoring_Status | Launch_Date | Price | Mobile_ID | Battery_mAh |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Galaxyfold6 | Samsung | SOUTHKOREA | ANDROID | ACTIVE | 10JUL2024 | $2,100.00 | 120 | 4400 |
| 2 | Galaxys24ultra | Samsung | SOUTHKOREA | ANDROID | ACTIVE | 25JAN2024 | $1,400.00 | 102 | 5000 |
| 3 | Huaweip70 | Huawei | CHINA | ANDROID | ACTIVE | 20APR2024 | $1,450.00 | 117 | 5200 |
| 4 | Iphone15pro | Apple | USA | IOS | ACTIVE | 12SEP2023 | $1,599.00 | 101 | 4500 |
| 5 | Iphone15pro | Max | USA | IOS | ACTIVE | 12SEP2023 | $1,599.00 | 119 | 4500 |
| 6 | Motoedge50 | Motorola | USA | ANDROID | INACTIVE | 05MAY2024 | $799.00 | 109 | 4700 |
| 7 | Nokiaxr21 | Nokia | FINLAND | ANDROID | ACTIVE | 10AUG2023 | $650.00 | 110 | 4800 |
| 8 | Nothingphone2 | Nothing | UK | ANDROID | ACTIVE | 15JUL2023 | $799.00 | 118 | 4700 |
| 9 | Oneplus12 | Oneplus | INDIA | ANDROID | PENDING | 11FEB2024 | $899.00 | 104 | 5400 |
| 10 | Oppofindx7 | Oppo | CHINA | ANDROID | ACTIVE | 18JAN2024 | $1,100.00 | 113 | 5000 |
| 11 | Pixel8pro | USA | ANDROID | ACTIVE | 15OCT2023 | $999.00 | 103 | 5050 | |
| 12 | Realmegt5 | Realme | INDIA | ANDROID | PENDING | 11NOV2023 | $699.00 | 115 | 5240 |
| 13 | Rogphone8 | Asus | TAIWAN | ANDROID | ACTIVE | 15JAN2024 | $1,700.00 | 107 | 5500 |
| 14 | Sonyxperia1vi | Sony | JAPAN | ANDROID | ACTIVE | 05MAR2024 | $1,800.00 | 116 | 5000 |
| 15 | Unknown Device | Apple | USA | IOS | ACTIVE | 01MAR2024 | $1,300.00 | 106 | 4500 |
| 16 | Vivox100pro | Vivo | CHINA | ANDROID | ACTIVE | 20JAN2024 | $999.00 | 112 | 5400 |
| 17 | Xiaomi14ultra | Xiaomi | CHINA | ANDROID | ACTIVE | 13MAY2026 | $1,200.00 | 105 | 5300 |
Why PROC SORT NODUPKEY Matters
This
removes duplicate business records while preserving:
- Reporting integrity
- Accurate device counts
- Reliable healthcare
enrollment statistics
DATA Step vs PROC SQL
|
Feature |
DATA
Step |
PROC
SQL |
|
Speed |
Faster
sequential logic |
Better
relational joins |
|
Row-by-row
processing |
Excellent |
Moderate |
|
Complex
joins |
Moderate |
Excellent |
|
Readability |
High |
High |
|
Enterprise
ETL |
Preferred |
Preferred
for integrations |
Advanced SAS Business Logic
/*SELECT-WHEN Classification*/
data mobiles_segmented;
set mobiles_clean;
length Price_Category $20;
select;
when (Price > 1500)
Price_Category='ULTRA PREMIUM';
when (Price >= 1000 and Price <= 1500)
Price_Category='PREMIUM';
otherwise
Price_Category='MID RANGE';
end;
run;
proc print data = mobiles_segmented;
run;
OUTPUT:
| Obs | Mobile_Name | Brand | Country | Operating_System | Monitoring_Status | Launch_Date | Price | Mobile_ID | Battery_mAh | Price_Category |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Galaxyfold6 | Samsung | SOUTHKOREA | ANDROID | ACTIVE | 10JUL2024 | $2,100.00 | 120 | 4400 | ULTRA PREMIUM |
| 2 | Galaxys24ultra | Samsung | SOUTHKOREA | ANDROID | ACTIVE | 25JAN2024 | $1,400.00 | 102 | 5000 | PREMIUM |
| 3 | Huaweip70 | Huawei | CHINA | ANDROID | ACTIVE | 20APR2024 | $1,450.00 | 117 | 5200 | PREMIUM |
| 4 | Iphone15pro | Apple | USA | IOS | ACTIVE | 12SEP2023 | $1,599.00 | 101 | 4500 | ULTRA PREMIUM |
| 5 | Iphone15pro | Max | USA | IOS | ACTIVE | 12SEP2023 | $1,599.00 | 119 | 4500 | ULTRA PREMIUM |
| 6 | Motoedge50 | Motorola | USA | ANDROID | INACTIVE | 05MAY2024 | $799.00 | 109 | 4700 | MID RANGE |
| 7 | Nokiaxr21 | Nokia | FINLAND | ANDROID | ACTIVE | 10AUG2023 | $650.00 | 110 | 4800 | MID RANGE |
| 8 | Nothingphone2 | Nothing | UK | ANDROID | ACTIVE | 15JUL2023 | $799.00 | 118 | 4700 | MID RANGE |
| 9 | Oneplus12 | Oneplus | INDIA | ANDROID | PENDING | 11FEB2024 | $899.00 | 104 | 5400 | MID RANGE |
| 10 | Oppofindx7 | Oppo | CHINA | ANDROID | ACTIVE | 18JAN2024 | $1,100.00 | 113 | 5000 | PREMIUM |
| 11 | Pixel8pro | USA | ANDROID | ACTIVE | 15OCT2023 | $999.00 | 103 | 5050 | MID RANGE | |
| 12 | Realmegt5 | Realme | INDIA | ANDROID | PENDING | 11NOV2023 | $699.00 | 115 | 5240 | MID RANGE |
| 13 | Rogphone8 | Asus | TAIWAN | ANDROID | ACTIVE | 15JAN2024 | $1,700.00 | 107 | 5500 | ULTRA PREMIUM |
| 14 | Sonyxperia1vi | Sony | JAPAN | ANDROID | ACTIVE | 05MAR2024 | $1,800.00 | 116 | 5000 | ULTRA PREMIUM |
| 15 | Unknown Device | Apple | USA | IOS | ACTIVE | 01MAR2024 | $1,300.00 | 106 | 4500 | PREMIUM |
| 16 | Vivox100pro | Vivo | CHINA | ANDROID | ACTIVE | 20JAN2024 | $999.00 | 112 | 5400 | MID RANGE |
| 17 | Xiaomi14ultra | Xiaomi | CHINA | ANDROID | ACTIVE | 13MAY2026 | $1,200.00 | 105 | 5300 | PREMIUM |
Explanation
SELECT-WHEN
is cleaner than multiple IF-THEN statements.
It improves:
- Readability
- Processing logic
- Enterprise maintainability
Very
useful in ADaM derivations and TLF categorization.
Using ARRAYS for Bulk Cleaning
data mobiles_array;
set mobiles_clean;
array nums(*) Price Battery_mAh;
do i=1 to dim(nums);
if nums(i)<0 then nums(i)=abs(nums(i));
end;
drop i;
run;
proc print data = mobiles_array;
run;
OUTPUT:
| Obs | Mobile_Name | Brand | Country | Operating_System | Monitoring_Status | Launch_Date | Price | Mobile_ID | Battery_mAh |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Galaxyfold6 | Samsung | SOUTHKOREA | ANDROID | ACTIVE | 10JUL2024 | $2,100.00 | 120 | 4400 |
| 2 | Galaxys24ultra | Samsung | SOUTHKOREA | ANDROID | ACTIVE | 25JAN2024 | $1,400.00 | 102 | 5000 |
| 3 | Huaweip70 | Huawei | CHINA | ANDROID | ACTIVE | 20APR2024 | $1,450.00 | 117 | 5200 |
| 4 | Iphone15pro | Apple | USA | IOS | ACTIVE | 12SEP2023 | $1,599.00 | 101 | 4500 |
| 5 | Iphone15pro | Max | USA | IOS | ACTIVE | 12SEP2023 | $1,599.00 | 119 | 4500 |
| 6 | Motoedge50 | Motorola | USA | ANDROID | INACTIVE | 05MAY2024 | $799.00 | 109 | 4700 |
| 7 | Nokiaxr21 | Nokia | FINLAND | ANDROID | ACTIVE | 10AUG2023 | $650.00 | 110 | 4800 |
| 8 | Nothingphone2 | Nothing | UK | ANDROID | ACTIVE | 15JUL2023 | $799.00 | 118 | 4700 |
| 9 | Oneplus12 | Oneplus | INDIA | ANDROID | PENDING | 11FEB2024 | $899.00 | 104 | 5400 |
| 10 | Oppofindx7 | Oppo | CHINA | ANDROID | ACTIVE | 18JAN2024 | $1,100.00 | 113 | 5000 |
| 11 | Pixel8pro | USA | ANDROID | ACTIVE | 15OCT2023 | $999.00 | 103 | 5050 | |
| 12 | Realmegt5 | Realme | INDIA | ANDROID | PENDING | 11NOV2023 | $699.00 | 115 | 5240 |
| 13 | Rogphone8 | Asus | TAIWAN | ANDROID | ACTIVE | 15JAN2024 | $1,700.00 | 107 | 5500 |
| 14 | Sonyxperia1vi | Sony | JAPAN | ANDROID | ACTIVE | 05MAR2024 | $1,800.00 | 116 | 5000 |
| 15 | Unknown Device | Apple | USA | IOS | ACTIVE | 01MAR2024 | $1,300.00 | 106 | 4500 |
| 16 | Vivox100pro | Vivo | CHINA | ANDROID | ACTIVE | 20JAN2024 | $999.00 | 112 | 5400 |
| 17 | Xiaomi14ultra | Xiaomi | CHINA | ANDROID | ACTIVE | 13MAY2026 | $1,200.00 | 105 | 5300 |
Why ARRAYS Matter
ARRAYS
are essential when:
- Cleaning 100+ variables
- Processing lab domains
- Validating clinical
measurements
Instead
of writing repetitive logic, arrays automate bulk transformations efficiently.
PROC FORMAT for Professional Outputs
proc format;
value pricefmt low-999='Budget'
1000-1499='Premium'
1500-high='Luxury';
run;
LOG:
PROC REPORT Dashboard
proc report data=mobiles_clean nowd;
columns Brand Country Price Battery_mAh;
define Brand / group;
define Country / group;
define Price / analysis mean format=pricefmt.;
define Battery_mAh / analysis mean;
run;
OUTPUT:
| Brand | Country | Price | Battery_mAh |
|---|---|---|---|
| Apple | USA | Premium | 4500 |
| Asus | TAIWAN | Luxury | 5500 |
| USA | Budget | 5050 | |
| Huawei | CHINA | Premium | 5200 |
| Max | USA | Luxury | 4500 |
| Motorola | USA | Budget | 4700 |
| Nokia | FINLAND | Budget | 4800 |
| Nothing | UK | Budget | 4700 |
| Oneplus | INDIA | Budget | 5400 |
| Oppo | CHINA | Premium | 5000 |
| Realme | INDIA | Budget | 5240 |
| Samsung | SOUTHKOREA | Luxury | 4700 |
| Sony | JAPAN | Luxury | 5000 |
| Vivo | CHINA | Budget | 5400 |
| Xiaomi | CHINA | Premium | 5300 |
Reporting Importance
PROC
REPORT is heavily used in:
- Clinical reporting
- Regulatory submissions
- Audit-ready dashboards
- Executive summaries
It
produces presentation-quality outputs directly from SAS.
PROC SUMMARY for Aggregation
proc summary data=mobiles_clean nway;
class Brand;
var Price Battery_mAh;
output out=summary_stats mean=Avg_Price Avg_Battery;
run;
proc print data = summary_stats;
run;
OUTPUT:
| Obs | Brand | _TYPE_ | _FREQ_ | Avg_Price | Avg_Battery |
|---|---|---|---|---|---|
| 1 | Apple | 1 | 2 | $1,449.50 | 4500 |
| 2 | Asus | 1 | 1 | $1,700.00 | 5500 |
| 3 | 1 | 1 | $999.00 | 5050 | |
| 4 | Huawei | 1 | 1 | $1,450.00 | 5200 |
| 5 | Max | 1 | 1 | $1,599.00 | 4500 |
| 6 | Motorola | 1 | 1 | $799.00 | 4700 |
| 7 | Nokia | 1 | 1 | $650.00 | 4800 |
| 8 | Nothing | 1 | 1 | $799.00 | 4700 |
| 9 | Oneplus | 1 | 1 | $899.00 | 5400 |
| 10 | Oppo | 1 | 1 | $1,100.00 | 5000 |
| 11 | Realme | 1 | 1 | $699.00 | 5240 |
| 12 | Samsung | 1 | 2 | $1,750.00 | 4700 |
| 13 | Sony | 1 | 1 | $1,800.00 | 5000 |
| 14 | Vivo | 1 | 1 | $999.00 | 5400 |
| 15 | Xiaomi | 1 | 1 | $1,200.00 | 5300 |
Why PROC SUMMARY is Powerful
PROC
SUMMARY is optimized for:
- Enterprise aggregation
- Statistical reporting
- Real-time dashboards
- Performance-heavy
environments
It is
faster than many manual aggregation methods.
R Refinement Layer
R Data Cleaning Workflow
library(tidyverse)
library(lubridate)
library(janitor)
mobiles_clean <- mobiles_raw %>%
clean_names() %>%
mutate(mobile_name = str_trim(str_to_title(mobile_name)),
brand = str_to_title(brand),
operating_system = str_to_upper(operating_system),
monitoring_status = case_when(
monitoring_status=="NULL" ~ "PENDING",TRUE ~ monitoring_status),
price = abs(price),
battery_m_ah = abs(battery_m_ah),
launch_date = suppressWarnings(dmy(launch_date)) %>%
distinct()
View(mobiles_clean)
OUTPUT:
|
|
mobile_id |
mobile_name |
brand |
country |
launch_date |
price |
battery_m_ah |
operating_system |
monitoring_status |
|
1 |
101 |
Iphone15pro |
Apple |
USA |
12-09-2023 |
1599 |
4500 |
IOS |
ACTIVE |
|
2 |
102 |
Galaxys24ultra |
Samsung |
SouthKorea |
25-01-2024 |
1400 |
5000 |
ANDROID |
ACTIVE |
|
3 |
103 |
Pixel8pro |
Google |
USA |
15-10-2023 |
999 |
5050 |
ANDROID |
ACTIVE |
|
4 |
104 |
Oneplus12 |
Oneplus |
India |
11-02-2024 |
899 |
5400 |
ANDROID |
PENDING |
|
5 |
105 |
Xiaomi14ultra |
Xiaomi |
China |
NA |
1200 |
5300 |
ANDROID |
ACTIVE |
|
6 |
106 |
Null |
Apple |
USA |
01-03-2024 |
1300 |
4500 |
IOS |
ACTIVE |
|
7 |
107 |
Rogphone8 |
Asus |
Taiwan |
15-01-2024 |
1700 |
5500 |
ANDROID |
ACTIVE |
|
8 |
108 |
Galaxys24ultra |
Samsung |
SouthKorea |
25-01-2024 |
1400 |
5000 |
ANDROID |
ACTIVE |
|
9 |
109 |
Motoedge50 |
Motorola |
USA |
05-05-2024 |
799 |
4700 |
ANDROID |
inactive |
|
10 |
110 |
Nokiaxr21 |
Nokia |
Finland |
10-08-2023 |
650 |
4800 |
ANDROID |
ACTIVE |
|
11 |
111 |
Iphone15pro |
Apple |
USA |
12-09-2023 |
599 |
4500 |
IOS |
ACTIVE |
|
12 |
112 |
Vivox100pro |
Vivo |
China |
20-01-2024 |
999 |
5400 |
ANDROID |
ACTIVE |
|
13 |
113 |
Oppofindx7 |
Oppo |
China |
18-01-2024 |
1100 |
5000 |
ANDROID |
ACTIVE |
|
14 |
114 |
Pixel8pro |
Google |
USA |
15-10-2023 |
11299 |
5050 |
ANDROID |
ACTIVE |
|
15 |
115 |
Realmegt5 |
Realme |
India |
11-11-2023 |
699 |
5240 |
ANDROID |
PENDING |
|
16 |
116 |
Sonyxperia1vi |
Sony |
Japan |
05-03-2024 |
1800 |
5000 |
ANDROID |
ACTIVE |
|
17 |
117 |
Huaweip70 |
Huawei |
China |
20-04-2024 |
1450 |
5200 |
ANDROID |
ACTIVE |
|
18 |
118 |
Nothingphone2 |
Nothing |
UK |
15-07-2023 |
799 |
4700 |
ANDROID |
ACTIVE |
|
19 |
119 |
Iphone15pro |
Max |
USA |
12-09-2023 |
1599 |
4500 |
IOS |
ACTIVE |
|
20 |
120 |
Galaxyfold6 |
Samsung |
SouthKorea |
10-07-2024 |
2100 |
4400 |
ANDROID |
ACTIVE |
SAS vs R Cleaning
Comparison
|
SAS |
R
Equivalent |
|
PROPCASE |
str_to_title() |
|
STRIP |
str_trim() |
|
IF-THEN |
if_else() |
|
PROC
SORT NODUPKEY |
distinct() |
|
PROC
SQL |
dplyr
joins |
|
MISSING() |
is.na() |
|
CATX |
unite() |
|
SCAN |
separate() |
Missing Value Trap in SAS
SAS
treats missing numeric values as smaller than any number.
Example:
if Price < 1000 then
Flag='YES';
Missing
prices will incorrectly qualify.
Correct
approach:
if not missing(Price) and Price
< 1000 then Flag='YES';
This is
one of the most dangerous enterprise SAS mistakes.
20 Golden Rules for
Enterprise SAS/R Projects
- Always define LENGTH before
logic.
- Never trust raw source
systems.
- Standardize casing
immediately.
- Remove duplicates early.
- Validate date formats.
- Audit negative numeric values.
- Use reusable macros.
- Separate raw and clean
layers.
- Preserve audit trails.
- Use PROC CONTENTS
frequently.
- Validate joins carefully.
- Avoid hardcoding values.
- Use formats for reporting
consistency.
- Create QC validation steps.
- Track derivation logic.
- Maintain SDTM compliance.
- Use ARRAYS for scalability.
- Document every
transformation.
- Test missing-value
scenarios.
- Build reproducible
workflows.
20 Additional Data Cleaning
Best Practices
- Validate SDTM variable
lengths.
- Maintain Define.xml
consistency.
- Preserve source
traceability.
- Never overwrite raw
datasets.
- Validate controlled
terminology.
- Ensure ISO8601 date
compliance.
- Audit imputation logic.
- Review duplicate patient
IDs.
- Validate treatment dates.
- Monitor protocol deviations.
- Standardize adverse event
coding.
- Validate laboratory units.
- Track derivation lineage.
- Create validation datasets.
- Implement QC double
programming.
- Document reviewer comments.
- Protect PHI-sensitive data.
- Use metadata-driven
programming.
- Validate ADaM derivations.
- Maintain audit-ready outputs.
Business Logic Behind Data
Cleaning
Data
cleaning is fundamentally about protecting decision quality. In healthcare
analytics, a single missing value can incorrectly classify a patient’s risk
profile. In banking, malformed salary values can reject valid loan applicants.
In clinical trials, invalid dates can destroy treatment-emergent adverse event
derivations.
Consider
patient age correction. If age appears as “250,” the system must validate and
correct or quarantine the record because unrealistic ages distort demographic
statistics and machine-learning predictions.
Similarly,
salary normalization ensures financial systems compare standardized income
values instead of mixed currencies or malformed text strings.
Date
imputation is equally critical. If hospital admission dates are partially
missing, statistical teams may use controlled imputation logic to preserve
analytical continuity while maintaining regulatory traceability.
Replacing
missing values improves model stability, reporting completeness, and
operational trust. However, imputation must always follow documented business
rules and regulatory guidance.
Data
cleaning therefore is not merely technical preprocessing it is enterprise risk
management. Every corrected value, standardized string, duplicate removal, and
validated date directly impacts business intelligence accuracy, compliance
readiness, AI reliability, and executive decision-making quality.
20 Sharp Enterprise
Insights
- Dirty data leads to wrong
conclusions.
- Standardization ensures reproducibility.
- Missing dates can destroy
timelines.
- Duplicate records inflate
metrics.
- Truncation silently corrupts
analytics.
- Validation protects business
credibility.
- Audit trails ensure
regulatory trust.
- Clean data improves AI
accuracy.
- PROC SQL simplifies
relational logic.
- DATA step excels in row
processing.
- ARRAYS reduce repetitive
code.
- Formats improve reporting
consistency.
- Macros increase scalability.
- Documentation prevents
confusion.
- QC programming catches
hidden issues.
- Controlled terminology improves
compliance.
- Consistent casing prevents
duplicate categories.
- Data governance improves
enterprise confidence.
- Real-time pipelines require
strict validation.
- Clean datasets create
trustworthy intelligence.
Summary
SAS and R
both dominate enterprise analytics, but each platform offers distinct
advantages in professional data-cleaning ecosystems.
SAS
excels in regulated industries such as clinical trials, pharmaceuticals,
banking, and insurance because of its structured architecture, reproducibility,
audit readiness, and scalable batch-processing capabilities. DATA step
programming provides exceptional row-level control, while PROC SQL, PROC
REPORT, PROC SUMMARY, and PROC FORMAT create enterprise-grade reporting systems
trusted by regulatory agencies worldwide.
SAS also
handles large-volume production pipelines extremely efficiently. Features like
FIRST./LAST. processing, arrays, retained variables, macros, and
metadata-driven programming make SAS ideal for SDTM, ADaM, and TLF workflows.
R, on the
other hand, offers flexibility, rapid experimentation, and advanced open-source
integrations. Packages such as tidyverse, dplyr, stringr, lubridate, janitor,
and purrr simplify complex transformations into readable pipelines. R is highly
favored for exploratory analytics, machine learning, visualization, and agile
statistical workflows.
The
strongest enterprise teams often combine both technologies strategically.
SAS
becomes the controlled production engine:
- Regulatory reporting
- Clinical submissions
- Enterprise ETL
- Audit-ready outputs
R becomes
the innovation layer:
- Exploratory analysis
- Predictive modeling
- Visualization
- Flexible transformations
Together,
SAS and R create a powerful analytics ecosystem capable of transforming chaotic
raw datasets into trusted business intelligence platforms supporting
healthcare, finance, telecom, and AI-driven decision systems.
Conclusion
Modern
organizations are no longer limited by data availability they are limited by
data quality.
The world
now generates enormous streams of information from smartphones, healthcare
systems, wearable devices, banking platforms, IoT infrastructure, and AI
applications. However, raw data arriving from these systems is rarely clean,
consistent, or analysis-ready. Missing values, malformed text, duplicate
records, invalid dates, inconsistent formatting, and logical conflicts
continuously threaten analytical reliability.
This is
why structured data-cleaning frameworks using SAS and R have become
mission-critical enterprise capabilities.
In this
project, we transformed a globally famous mobiles dataset into a
professional-grade analytics pipeline using advanced SAS DATA step programming,
PROC SQL, reporting procedures, macros, and modern R refinement workflows.
Along the way, we explored enterprise challenges such as truncation risks,
duplicate management, missing-value traps, date validation, business-rule
standardization, and audit-ready reporting.
The
deeper lesson is this:
Data
cleaning is not a support task.
It is the foundation of trustworthy intelligence.
Whether
building SDTM datasets for clinical trials, fraud-detection systems for
banking, AI-ready telecom pipelines, or executive dashboards for healthcare
monitoring, clean data determines whether organizations make accurate decisions
or catastrophic mistakes.
SAS
provides industrial-grade governance, reproducibility, and regulatory
confidence.
R provides flexibility, innovation, and analytical acceleration.
Together,
they create an ecosystem capable of converting broken datasets into strategic
assets.
The
future belongs not merely to organizations with the most data but to
organizations with the cleanest, most reliable, and most intelligently
engineered data pipelines.
Interview Questions and
Answers
1. Why is LENGTH placement important in SAS?
Answer:
If LENGTH
is declared after variable assignment, SAS fixes variable size using the first
encountered value, causing truncation. This silently corrupts downstream
analytics and regulatory outputs.
2. Difference between PROC SQL and DATA Step?
Answer:
DATA step
is optimized for sequential row-level transformations, while PROC SQL is
stronger for joins, aggregations, and relational logic.
3. How do you handle duplicate healthcare records?
Answer:
Use PROC
SORT NODUPKEY, business-rule validation, and audit comparison reports before
deleting duplicates.
4. How would you debug missing-value issues in SAS?
Answer:
Use
MISSING(), PROC FREQ, PROC MEANS NMISS, and validation flags to identify hidden
null propagation issues.
5. A clinical dataset shows
treatment end dates before treatment start dates. How would you fix it?
Answer:
First
validate source-system integrity, identify malformed dates, use INTCK and
logical validation rules, quarantine invalid records, document corrections, and
rerun QC validation before downstream ADaM derivations.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about MOBILE DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment