The World’s Most Famous Mobiles Dataset into Trusted Business Analytics Using Advanced SAS (PROC SQL vs DATA Step) and R Data Engineering Frameworks

Introduction — When Dirty Data Becomes a Million-Dollar Problem

Imagine a real-time healthcare monitoring company tracking patient health using mobile applications installed on devices like Apple iPhone 15 Pro, Samsung Galaxy S24 Ultra, and Google Pixel 8 Pro.

The company uses wearable integrations, hospital APIs, and mobile-generated telemetry data to predict cardiac abnormalities, medication adherence, and patient risk scores.

Now imagine the following problems entering the system:

Duplicate patient device IDs
Invalid mobile launch dates
Negative battery capacities
Corrupted manufacturer names
Mixed-case operating systems
Blank healthcare monitoring statuses
Truncated text fields
Invalid pricing values
Missing region mappings

Suddenly, dashboards become unreliable. AI models begin predicting false risks. Regulatory submissions fail validation checks. Auditors raise compliance concerns. Clinical teams lose trust in analytics.

This is the hidden battlefield of enterprise analytics.

In modern organizations, data cleaning is not cosmetic work it is survival infrastructure.

Healthcare, banking, insurance, telecom, and clinical research organizations depend heavily on trustworthy datasets. A single malformed value can create downstream chaos across SDTM domains, ADaM derivations, TLF generation, and regulatory reporting.

In this enterprise-level project, we will build a realistic “Most Famous Mobiles in the World” dataset integrated with healthcare-style monitoring logic and intentionally injected data-quality issues. Then we will solve these problems using advanced SAS and R programming techniques.

Real-Time Healthcare Mobile Dataset with Intentional Errors

A healthcare technology organization collects patient monitoring activity from globally famous mobile devices. Data enters from:

Mobile apps
Hospital integrations
IoT wearables
API streams
Insurance systems

The raw dataset contains inconsistent records requiring enterprise-grade cleaning.

Raw SAS Dataset Creation with Intentional Errors

/*=========================================================

REAL-TIME MOBILE HEALTHCARE RAW DATASET

Intentional enterprise-level data quality problems added

=========================================================*/

data mobiles_raw;

/* LENGTH must come before logic to avoid truncation */

length Mobile_Name $40 Brand $25 Country $20

Operating_System $20 Monitoring_Status $15;

/* Proper informats for incoming raw data */

informat Launch_Date ddmmyy10.;

/* Output display formats */

format Launch_Date date9. Price dollar10.2;

infile datalines dlm='|' truncover;

input Mobile_ID Mobile_Name $ Brand $ Country $ Launch_Date

Price Battery_mAh Operating_System $ Monitoring_Status $;

datalines;

101|iphone15pro|Apple|USA|12/09/2023|1599|4500|ios|ACTIVE

102|GalaxyS24Ultra|Samsung|SouthKorea|25/01/2024|1400|5000|ANDROID|ACTIVE

103|Pixel8Pro|Google|USA|15/10/2023|-999|5050|android|ACTIVE

104|OnePlus12|OnePlus|India|11/02/2024|899|-5400|Android|NULL

105|Xiaomi14Ultra|Xiaomi|China|31/15/2024|1200|5300|ANDROID|ACTIVE

106|NULL|Apple|USA|01/03/2024|1300|4500|IOS|ACTIVE

107|ROGPhone8|Asus|Taiwan|15/01/2024|1700|5500|ANDROID|ACTIVE

108|GalaxyS24Ultra|Samsung|SouthKorea|25/01/2024|1400|5000|ANDROID|ACTIVE

109|MotoEdge50|Motorola|USA|05/05/2024|799|4700|android|inactive

110|NokiaXR21|Nokia|Finland|10/08/2023|650|4800|ANDROID|ACTIVE

111|iphone15pro|Apple|USA|12/09/2023|1599|4500|ios|ACTIVE

112|VivoX100Pro|Vivo|China|20/01/2024|999|5400|ANDROID|ACTIVE

113|OppoFindX7|Oppo|China|18/01/2024|1100|5000|android|ACTIVE

114|Pixel8Pro|Google|USA|15/10/2023|1299|5050|ANDROID|ACTIVE

115|RealmeGT5|Realme|India|11/11/2023|699|5240|ANDROID|NULL

116|SonyXperia1VI|Sony|Japan|05/03/2024|1800|5000|ANDROID|ACTIVE

117|HuaweiP70|Huawei|China|20/04/2024|1450|5200|ANDROID|ACTIVE

118|NothingPhone2|Nothing|UK|15/07/2023|799|4700|ANDROID|ACTIVE

119|iphone15pro|max|USA|12/09/2023|1599|4500|ios|ACTIVE

120|GalaxyFold6|Samsung|SouthKorea|10/07/2024|2100|4400|ANDROID|ACTIVE

;

run;

proc print data = mobiles_raw;

run;

OUTPUT:


Obs	Mobile_Name	Brand	Country	Operating_System	Monitoring_Status	Launch_Date	Price	Mobile_ID	Battery_mAh
1	iphone15pro	Apple	USA	ios	ACTIVE	12SEP2023	$1,599.00	101	4500
2	GalaxyS24Ultra	Samsung	SouthKorea	ANDROID	ACTIVE	25JAN2024	$1,400.00	102	5000
3	Pixel8Pro	Google	USA	android	ACTIVE	15OCT2023	$-999.00	103	5050
4	OnePlus12	OnePlus	India	Android	NULL	11FEB2024	$899.00	104	-5400
5	Xiaomi14Ultra	Xiaomi	China	ANDROID	ACTIVE	.	$1,200.00	105	5300
6	NULL	Apple	USA	IOS	ACTIVE	01MAR2024	$1,300.00	106	4500
7	ROGPhone8	Asus	Taiwan	ANDROID	ACTIVE	15JAN2024	$1,700.00	107	5500
8	GalaxyS24Ultra	Samsung	SouthKorea	ANDROID	ACTIVE	25JAN2024	$1,400.00	108	5000
9	MotoEdge50	Motorola	USA	android	inactive	05MAY2024	$799.00	109	4700
10	NokiaXR21	Nokia	Finland	ANDROID	ACTIVE	10AUG2023	$650.00	110	4800
11	iphone15pro	Apple	USA	ios	ACTIVE	12SEP2023	$1,599.00	111	4500
12	VivoX100Pro	Vivo	China	ANDROID	ACTIVE	20JAN2024	$999.00	112	5400
13	OppoFindX7	Oppo	China	android	ACTIVE	18JAN2024	$1,100.00	113	5000
14	Pixel8Pro	Google	USA	ANDROID	ACTIVE	15OCT2023	$1,299.00	114	5050
15	RealmeGT5	Realme	India	ANDROID	NULL	11NOV2023	$699.00	115	5240
16	SonyXperia1VI	Sony	Japan	ANDROID	ACTIVE	05MAR2024	$1,800.00	116	5000
17	HuaweiP70	Huawei	China	ANDROID	ACTIVE	20APR2024	$1,450.00	117	5200
18	NothingPhone2	Nothing	UK	ANDROID	ACTIVE	15JUL2023	$799.00	118	4700
19	iphone15pro	max	USA	ios	ACTIVE	12SEP2023	$1,599.00	119	4500
20	GalaxyFold6	Samsung	SouthKorea	ANDROID	ACTIVE	10JUL2024	$2,100.00	120	4400

Explanation of the Dataset

This dataset intentionally contains enterprise-level issues:

Problem	Example
Negative price	-999
Negative battery	-5400
Invalid date	31/15/2024
Duplicates	GalaxyS24Ultra
Missing mobile name	NULL
Mixed OS casing	ios / IOS / android
Invalid monitoring status	NULL
Truncation risk	iphone15pro

Equivalent R Version of mobiles_raw Dataset

library(tidyverse)

library(lubridate)

mobiles_raw <- tibble(

Mobile_ID = c(101,102,103,104,105,106,107,108,109,110,

111,112,113,114,115,116,117,118,119,120),

Mobile_Name = c("iphone15pro","GalaxyS24Ultra","Pixel8Pro","OnePlus12",

"Xiaomi14Ultra","NULL","ROGPhone8","GalaxyS24Ultra","MotoEdge50",

"NokiaXR21","iphone15pro","VivoX100Pro","OppoFindX7","Pixel8Pro",

"RealmeGT5","SonyXperia1VI","HuaweiP70","NothingPhone2",

"iphone15pro","GalaxyFold6"),

Brand = c("Apple","Samsung","Google","OnePlus","Xiaomi","Apple","Asus",

"Samsung","Motorola","Nokia","Apple","Vivo","Oppo","Google",

"Realme","Sony","Huawei","Nothing","max","Samsung"),

Country = c("USA","SouthKorea","USA","India","China","USA","Taiwan",

"SouthKorea","USA","Finland","USA","China","China","USA",

"India","Japan","China","UK","USA","SouthKorea"),

Launch_Date = c("12/09/2023","25/01/2024","15/10/2023","11/02/2024",

"31/15/2024","01/03/2024","15/01/2024","25/01/2024","05/05/2024",

"10/08/2023","12/09/2023","20/01/2024","18/01/2024","15/10/2023",

"11/11/2023","05/03/2024","20/04/2024","15/07/2023","12/09/2023",

"10/07/2024"),

Price = c(1599,1400,-999,899,1200,1300,1700,1400,799,650,599,999,1100,

11299,699,1800,1450,799,1599,2100),

Battery_mAh = c(4500,5000,5050,-5400,5300,4500,5500,5000,4700,4800,

4500,5400,5000,5050,5240,5000,5200,4700,4500,4400),

Operating_System = c("ios","ANDROID","android","Android","ANDROID",

"IOS","ANDROID","ANDROID","android","ANDROID","ios","ANDROID",

"android","ANDROID","ANDROID","ANDROID","ANDROID","ANDROID",

"ios","ANDROID"),

Monitoring_Status = c("ACTIVE","ACTIVE","ACTIVE","NULL","ACTIVE","ACTIVE",

"ACTIVE","ACTIVE","inactive","ACTIVE","ACTIVE","ACTIVE","ACTIVE",

"ACTIVE","NULL","ACTIVE","ACTIVE","ACTIVE","ACTIVE","ACTIVE")

)

OUTPUT:

	MoMobile_ID	Mobile_Name	Brand	Country	Launch_Date	Price	Battery_mAh	Operating_System	Monitoring_Status
1	101	iphone15pro	Apple	USA	12-09-2023	1599	4500	ios	ACTIVE
2	102	GalaxyS24Ultra	Samsung	SouthKorea	25-01-2024	1400	5000	ANDROID	ACTIVE
3	103	Pixel8Pro	Google	USA	15-10-2023	-999	5050	android	ACTIVE
4	104	OnePlus12	OnePlus	India	11-02-2024	899	-5400	Android	NULL
5	105	Xiaomi14Ultra	Xiaomi	China	31/15/2024	1200	5300	ANDROID	ACTIVE
6	106	NULL	Apple	USA	01-03-2024	1300	4500	IOS	ACTIVE
7	107	ROGPhone8	Asus	Taiwan	15-01-2024	1700	5500	ANDROID	ACTIVE
8	108	GalaxyS24Ultra	Samsung	SouthKorea	25-01-2024	1400	5000	ANDROID	ACTIVE
9	109	MotoEdge50	Motorola	USA	05-05-2024	799	4700	android	inactive
10	110	NokiaXR21	Nokia	Finland	10-08-2023	650	4800	ANDROID	ACTIVE
11	111	iphone15pro	Apple	USA	12-09-2023	599	4500	ios	ACTIVE
12	112	VivoX100Pro	Vivo	China	20-01-2024	999	5400	ANDROID	ACTIVE
13	113	OppoFindX7	Oppo	China	18-01-2024	1100	5000	android	ACTIVE
14	114	Pixel8Pro	Google	USA	15-10-2023	11299	5050	ANDROID	ACTIVE
15	115	RealmeGT5	Realme	India	11-11-2023	699	5240	ANDROID	NULL
16	116	SonyXperia1VI	Sony	Japan	05-03-2024	1800	5000	ANDROID	ACTIVE
17	117	HuaweiP70	Huawei	China	20-04-2024	1450	5200	ANDROID	ACTIVE
18	118	NothingPhone2	Nothing	UK	15-07-2023	799	4700	ANDROID	ACTIVE
19	119	iphone15pro	max	USA	12-09-2023	1599	4500	ios	ACTIVE
20	120	GalaxyFold6	Samsung	SouthKorea	10-07-2024	2100	4400	ANDROID	ACTIVE

Explanation of This R Code

This R workflow recreates your SAS DATALINES dataset using tibble() from the tidyverse ecosystem.

Key Enterprise Concepts Used

SAS Concept	R Equivalent
DATALINES	tibble()
INPUT statement	Column vectors
FORMAT / INFORMAT	lubridate::dmy()

The Truncation Trap in SAS

One of the most dangerous SAS issues is the “Truncation Trap.”

If LENGTH is not declared before INPUT or assignment statements, SAS automatically guesses variable lengths based on the first observed value.

Example:

length Operating_System $20;

Without this statement:

“ANDROID” may truncate to “ANDR”
“SouthKorea” may truncate unexpectedly
Business rules fail silently

This is extremely dangerous in SDTM mapping and regulatory reporting.

R handles strings differently because memory allocation is dynamic, whereas SAS uses fixed-width storage architecture.

Enterprise SAS Data Cleaning Workflow

Step 1 — Standardization Layer

data mobiles_clean;

set mobiles_raw;

/* Standardize text formatting */

Mobile_Name = propcase(strip(Mobile_Name));

Brand = propcase(strip(Brand));

Country = upcase(strip(Country));

Operating_System = upcase(strip(Operating_System));

Monitoring_Status = upcase(strip(Monitoring_Status));

/* Replace NULL text values */

if Mobile_Name='Null' then Mobile_Name='Unknown Device';

if Monitoring_Status='NULL' then Monitoring_Status='PENDING';

/* Correct negative values */

Price = abs(Price);

Battery_mAh = abs(Battery_mAh);

/* Date validation */

if missing(Launch_Date) then Launch_Date = today();

run;

proc print data = mobiles_clean;

run;

OUTPUT:


Obs	Mobile_Name	Brand	Country	Operating_System	Monitoring_Status	Launch_Date	Price	Mobile_ID	Battery_mAh
1	Iphone15pro	Apple	USA	IOS	ACTIVE	12SEP2023	$1,599.00	101	4500
2	Galaxys24ultra	Samsung	SOUTHKOREA	ANDROID	ACTIVE	25JAN2024	$1,400.00	102	5000
3	Pixel8pro	Google	USA	ANDROID	ACTIVE	15OCT2023	$999.00	103	5050
4	Oneplus12	Oneplus	INDIA	ANDROID	PENDING	11FEB2024	$899.00	104	5400
5	Xiaomi14ultra	Xiaomi	CHINA	ANDROID	ACTIVE	13MAY2026	$1,200.00	105	5300
6	Unknown Device	Apple	USA	IOS	ACTIVE	01MAR2024	$1,300.00	106	4500
7	Rogphone8	Asus	TAIWAN	ANDROID	ACTIVE	15JAN2024	$1,700.00	107	5500
8	Galaxys24ultra	Samsung	SOUTHKOREA	ANDROID	ACTIVE	25JAN2024	$1,400.00	108	5000
9	Motoedge50	Motorola	USA	ANDROID	INACTIVE	05MAY2024	$799.00	109	4700
10	Nokiaxr21	Nokia	FINLAND	ANDROID	ACTIVE	10AUG2023	$650.00	110	4800
11	Iphone15pro	Apple	USA	IOS	ACTIVE	12SEP2023	$1,599.00	111	4500
12	Vivox100pro	Vivo	CHINA	ANDROID	ACTIVE	20JAN2024	$999.00	112	5400
13	Oppofindx7	Oppo	CHINA	ANDROID	ACTIVE	18JAN2024	$1,100.00	113	5000
14	Pixel8pro	Google	USA	ANDROID	ACTIVE	15OCT2023	$1,299.00	114	5050
15	Realmegt5	Realme	INDIA	ANDROID	PENDING	11NOV2023	$699.00	115	5240
16	Sonyxperia1vi	Sony	JAPAN	ANDROID	ACTIVE	05MAR2024	$1,800.00	116	5000
17	Huaweip70	Huawei	CHINA	ANDROID	ACTIVE	20APR2024	$1,450.00	117	5200
18	Nothingphone2	Nothing	UK	ANDROID	ACTIVE	15JUL2023	$799.00	118	4700
19	Iphone15pro	Max	USA	IOS	ACTIVE	12SEP2023	$1,599.00	119	4500
20	Galaxyfold6	Samsung	SOUTHKOREA	ANDROID	ACTIVE	10JUL2024	$2,100.00	120	4400

Explanation of Standardization Logic

This DATA step acts as the first enterprise cleaning layer.

Key SAS Functions Used

Function	Purpose
PROPCASE	Standardizes text
STRIP	Removes leading/trailing spaces
ABS	Corrects negative values
MISSING	Detects null values
TODAY()	Imputes invalid dates

Business Importance

Without standardization:

Dashboards split “ANDROID” and “android”
AI models misclassify operating systems
Duplicate detection fails
Reporting becomes inconsistent

This mirrors real SDTM compliance problems in clinical trials.

PROC SQL vs DATA Step Deduplication

/*DATA Step Method*/

proc sort data=mobiles_clean nodupkey;

by Mobile_Name Brand Launch_Date;

run;

proc print data = mobiles_clean;

run;

OUTPUT:


Obs	Mobile_Name	Brand	Country	Operating_System	Monitoring_Status	Launch_Date	Price	Mobile_ID	Battery_mAh
1	Galaxyfold6	Samsung	SOUTHKOREA	ANDROID	ACTIVE	10JUL2024	$2,100.00	120	4400
2	Galaxys24ultra	Samsung	SOUTHKOREA	ANDROID	ACTIVE	25JAN2024	$1,400.00	102	5000
3	Huaweip70	Huawei	CHINA	ANDROID	ACTIVE	20APR2024	$1,450.00	117	5200
4	Iphone15pro	Apple	USA	IOS	ACTIVE	12SEP2023	$1,599.00	101	4500
5	Iphone15pro	Max	USA	IOS	ACTIVE	12SEP2023	$1,599.00	119	4500
6	Motoedge50	Motorola	USA	ANDROID	INACTIVE	05MAY2024	$799.00	109	4700
7	Nokiaxr21	Nokia	FINLAND	ANDROID	ACTIVE	10AUG2023	$650.00	110	4800
8	Nothingphone2	Nothing	UK	ANDROID	ACTIVE	15JUL2023	$799.00	118	4700
9	Oneplus12	Oneplus	INDIA	ANDROID	PENDING	11FEB2024	$899.00	104	5400
10	Oppofindx7	Oppo	CHINA	ANDROID	ACTIVE	18JAN2024	$1,100.00	113	5000
11	Pixel8pro	Google	USA	ANDROID	ACTIVE	15OCT2023	$999.00	103	5050
12	Realmegt5	Realme	INDIA	ANDROID	PENDING	11NOV2023	$699.00	115	5240
13	Rogphone8	Asus	TAIWAN	ANDROID	ACTIVE	15JAN2024	$1,700.00	107	5500
14	Sonyxperia1vi	Sony	JAPAN	ANDROID	ACTIVE	05MAR2024	$1,800.00	116	5000
15	Unknown Device	Apple	USA	IOS	ACTIVE	01MAR2024	$1,300.00	106	4500
16	Vivox100pro	Vivo	CHINA	ANDROID	ACTIVE	20JAN2024	$999.00	112	5400
17	Xiaomi14ultra	Xiaomi	CHINA	ANDROID	ACTIVE	13MAY2026	$1,200.00	105	5300

Why PROC SORT NODUPKEY Matters

This removes duplicate business records while preserving:

Reporting integrity
Accurate device counts
Reliable healthcare enrollment statistics

DATA Step vs PROC SQL

Feature	DATA Step	PROC SQL
Speed	Faster sequential logic	Better relational joins
Row-by-row processing	Excellent	Moderate
Complex joins	Moderate	Excellent
Readability	High	High
Enterprise ETL	Preferred	Preferred for integrations

Advanced SAS Business Logic

/*SELECT-WHEN Classification*/

data mobiles_segmented;

set mobiles_clean;

length Price_Category $20;

select;

when (Price > 1500)

Price_Category='ULTRA PREMIUM';

when (Price >= 1000 and Price <= 1500)

Price_Category='PREMIUM';

otherwise

Price_Category='MID RANGE';

end;

run;

proc print data = mobiles_segmented;

run;

OUTPUT:


Obs	Mobile_Name	Brand	Country	Operating_System	Monitoring_Status	Launch_Date	Price	Mobile_ID	Battery_mAh	Price_Category
1	Galaxyfold6	Samsung	SOUTHKOREA	ANDROID	ACTIVE	10JUL2024	$2,100.00	120	4400	ULTRA PREMIUM
2	Galaxys24ultra	Samsung	SOUTHKOREA	ANDROID	ACTIVE	25JAN2024	$1,400.00	102	5000	PREMIUM
3	Huaweip70	Huawei	CHINA	ANDROID	ACTIVE	20APR2024	$1,450.00	117	5200	PREMIUM
4	Iphone15pro	Apple	USA	IOS	ACTIVE	12SEP2023	$1,599.00	101	4500	ULTRA PREMIUM
5	Iphone15pro	Max	USA	IOS	ACTIVE	12SEP2023	$1,599.00	119	4500	ULTRA PREMIUM
6	Motoedge50	Motorola	USA	ANDROID	INACTIVE	05MAY2024	$799.00	109	4700	MID RANGE
7	Nokiaxr21	Nokia	FINLAND	ANDROID	ACTIVE	10AUG2023	$650.00	110	4800	MID RANGE
8	Nothingphone2	Nothing	UK	ANDROID	ACTIVE	15JUL2023	$799.00	118	4700	MID RANGE
9	Oneplus12	Oneplus	INDIA	ANDROID	PENDING	11FEB2024	$899.00	104	5400	MID RANGE
10	Oppofindx7	Oppo	CHINA	ANDROID	ACTIVE	18JAN2024	$1,100.00	113	5000	PREMIUM
11	Pixel8pro	Google	USA	ANDROID	ACTIVE	15OCT2023	$999.00	103	5050	MID RANGE
12	Realmegt5	Realme	INDIA	ANDROID	PENDING	11NOV2023	$699.00	115	5240	MID RANGE
13	Rogphone8	Asus	TAIWAN	ANDROID	ACTIVE	15JAN2024	$1,700.00	107	5500	ULTRA PREMIUM
14	Sonyxperia1vi	Sony	JAPAN	ANDROID	ACTIVE	05MAR2024	$1,800.00	116	5000	ULTRA PREMIUM
15	Unknown Device	Apple	USA	IOS	ACTIVE	01MAR2024	$1,300.00	106	4500	PREMIUM
16	Vivox100pro	Vivo	CHINA	ANDROID	ACTIVE	20JAN2024	$999.00	112	5400	MID RANGE
17	Xiaomi14ultra	Xiaomi	CHINA	ANDROID	ACTIVE	13MAY2026	$1,200.00	105	5300	PREMIUM

Explanation

SELECT-WHEN is cleaner than multiple IF-THEN statements.

It improves:

Readability
Processing logic
Enterprise maintainability

Very useful in ADaM derivations and TLF categorization.

Using ARRAYS for Bulk Cleaning

data mobiles_array;

set mobiles_clean;

array nums(*) Price Battery_mAh;

do i=1 to dim(nums);

if nums(i)<0 then nums(i)=abs(nums(i));

end;

drop i;

run;

proc print data = mobiles_array;

run;

OUTPUT:


Obs	Mobile_Name	Brand	Country	Operating_System	Monitoring_Status	Launch_Date	Price	Mobile_ID	Battery_mAh
1	Galaxyfold6	Samsung	SOUTHKOREA	ANDROID	ACTIVE	10JUL2024	$2,100.00	120	4400
2	Galaxys24ultra	Samsung	SOUTHKOREA	ANDROID	ACTIVE	25JAN2024	$1,400.00	102	5000
3	Huaweip70	Huawei	CHINA	ANDROID	ACTIVE	20APR2024	$1,450.00	117	5200
4	Iphone15pro	Apple	USA	IOS	ACTIVE	12SEP2023	$1,599.00	101	4500
5	Iphone15pro	Max	USA	IOS	ACTIVE	12SEP2023	$1,599.00	119	4500
6	Motoedge50	Motorola	USA	ANDROID	INACTIVE	05MAY2024	$799.00	109	4700
7	Nokiaxr21	Nokia	FINLAND	ANDROID	ACTIVE	10AUG2023	$650.00	110	4800
8	Nothingphone2	Nothing	UK	ANDROID	ACTIVE	15JUL2023	$799.00	118	4700
9	Oneplus12	Oneplus	INDIA	ANDROID	PENDING	11FEB2024	$899.00	104	5400
10	Oppofindx7	Oppo	CHINA	ANDROID	ACTIVE	18JAN2024	$1,100.00	113	5000
11	Pixel8pro	Google	USA	ANDROID	ACTIVE	15OCT2023	$999.00	103	5050
12	Realmegt5	Realme	INDIA	ANDROID	PENDING	11NOV2023	$699.00	115	5240
13	Rogphone8	Asus	TAIWAN	ANDROID	ACTIVE	15JAN2024	$1,700.00	107	5500
14	Sonyxperia1vi	Sony	JAPAN	ANDROID	ACTIVE	05MAR2024	$1,800.00	116	5000
15	Unknown Device	Apple	USA	IOS	ACTIVE	01MAR2024	$1,300.00	106	4500
16	Vivox100pro	Vivo	CHINA	ANDROID	ACTIVE	20JAN2024	$999.00	112	5400
17	Xiaomi14ultra	Xiaomi	CHINA	ANDROID	ACTIVE	13MAY2026	$1,200.00	105	5300

Why ARRAYS Matter

ARRAYS are essential when:

Cleaning 100+ variables
Processing lab domains
Validating clinical measurements

Instead of writing repetitive logic, arrays automate bulk transformations efficiently.

PROC FORMAT for Professional Outputs

proc format;

value pricefmt low-999='Budget'

1000-1499='Premium'

1500-high='Luxury';

run;

LOG:

NOTE: Format PRICEFMT has been output.

PROC REPORT Dashboard

proc report data=mobiles_clean nowd;

columns Brand Country Price Battery_mAh;

define Brand / group;

define Country / group;

define Price / analysis mean format=pricefmt.;

define Battery_mAh / analysis mean;

run;

OUTPUT:


Brand	Country	Price	Battery_mAh
Apple	USA	Premium	4500
Asus	TAIWAN	Luxury	5500
Google	USA	Budget	5050
Huawei	CHINA	Premium	5200
Max	USA	Luxury	4500
Motorola	USA	Budget	4700
Nokia	FINLAND	Budget	4800
Nothing	UK	Budget	4700
Oneplus	INDIA	Budget	5400
Oppo	CHINA	Premium	5000
Realme	INDIA	Budget	5240
Samsung	SOUTHKOREA	Luxury	4700
Sony	JAPAN	Luxury	5000
Vivo	CHINA	Budget	5400
Xiaomi	CHINA	Premium	5300

Reporting Importance

PROC REPORT is heavily used in:

Clinical reporting
Regulatory submissions
Audit-ready dashboards
Executive summaries

It produces presentation-quality outputs directly from SAS.

PROC SUMMARY for Aggregation

proc summary data=mobiles_clean nway;

class Brand;

var Price Battery_mAh;

output out=summary_stats mean=Avg_Price Avg_Battery;

run;

proc print data = summary_stats;

run;

OUTPUT:


Obs	Brand	_TYPE_	_FREQ_	Avg_Price	Avg_Battery
1	Apple	1	2	$1,449.50	4500
2	Asus	1	1	$1,700.00	5500
3	Google	1	1	$999.00	5050
4	Huawei	1	1	$1,450.00	5200
5	Max	1	1	$1,599.00	4500
6	Motorola	1	1	$799.00	4700
7	Nokia	1	1	$650.00	4800
8	Nothing	1	1	$799.00	4700
9	Oneplus	1	1	$899.00	5400
10	Oppo	1	1	$1,100.00	5000
11	Realme	1	1	$699.00	5240
12	Samsung	1	2	$1,750.00	4700
13	Sony	1	1	$1,800.00	5000
14	Vivo	1	1	$999.00	5400
15	Xiaomi	1	1	$1,200.00	5300

Why PROC SUMMARY is Powerful

PROC SUMMARY is optimized for:

Enterprise aggregation
Statistical reporting
Real-time dashboards
Performance-heavy environments

It is faster than many manual aggregation methods.

R Refinement Layer

R Data Cleaning Workflow

library(tidyverse)

library(lubridate)

library(janitor)

mobiles_clean <- mobiles_raw %>%

clean_names() %>%

mutate(mobile_name = str_trim(str_to_title(mobile_name)),

brand = str_to_title(brand),

operating_system = str_to_upper(operating_system),

monitoring_status = case_when(

monitoring_status=="NULL" ~ "PENDING",TRUE ~ monitoring_status),

price = abs(price),

battery_m_ah = abs(battery_m_ah),

launch_date = suppressWarnings(dmy(launch_date)) %>%

distinct()

View(mobiles_clean)

OUTPUT:

	mobile_id	mobile_name	brand	country	launch_date	price	battery_m_ah	operating_system	monitoring_status
1	101	Iphone15pro	Apple	USA	12-09-2023	1599	4500	IOS	ACTIVE
2	102	Galaxys24ultra	Samsung	SouthKorea	25-01-2024	1400	5000	ANDROID	ACTIVE
3	103	Pixel8pro	Google	USA	15-10-2023	999	5050	ANDROID	ACTIVE
4	104	Oneplus12	Oneplus	India	11-02-2024	899	5400	ANDROID	PENDING
5	105	Xiaomi14ultra	Xiaomi	China	NA	1200	5300	ANDROID	ACTIVE
6	106	Null	Apple	USA	01-03-2024	1300	4500	IOS	ACTIVE
7	107	Rogphone8	Asus	Taiwan	15-01-2024	1700	5500	ANDROID	ACTIVE
8	108	Galaxys24ultra	Samsung	SouthKorea	25-01-2024	1400	5000	ANDROID	ACTIVE
9	109	Motoedge50	Motorola	USA	05-05-2024	799	4700	ANDROID	inactive
10	110	Nokiaxr21	Nokia	Finland	10-08-2023	650	4800	ANDROID	ACTIVE
11	111	Iphone15pro	Apple	USA	12-09-2023	599	4500	IOS	ACTIVE
12	112	Vivox100pro	Vivo	China	20-01-2024	999	5400	ANDROID	ACTIVE
13	113	Oppofindx7	Oppo	China	18-01-2024	1100	5000	ANDROID	ACTIVE
14	114	Pixel8pro	Google	USA	15-10-2023	11299	5050	ANDROID	ACTIVE
15	115	Realmegt5	Realme	India	11-11-2023	699	5240	ANDROID	PENDING
16	116	Sonyxperia1vi	Sony	Japan	05-03-2024	1800	5000	ANDROID	ACTIVE
17	117	Huaweip70	Huawei	China	20-04-2024	1450	5200	ANDROID	ACTIVE
18	118	Nothingphone2	Nothing	UK	15-07-2023	799	4700	ANDROID	ACTIVE
19	119	Iphone15pro	Max	USA	12-09-2023	1599	4500	IOS	ACTIVE
20	120	Galaxyfold6	Samsung	SouthKorea	10-07-2024	2100	4400	ANDROID	ACTIVE

SAS vs R Cleaning Comparison

SAS	R Equivalent
PROPCASE	str_to_title()
STRIP	str_trim()
IF-THEN	if_else()
PROC SORT NODUPKEY	distinct()
PROC SQL	dplyr joins
MISSING()	is.na()
CATX	unite()
SCAN	separate()

Missing Value Trap in SAS

SAS treats missing numeric values as smaller than any number.

Example:

if Price < 1000 then Flag='YES';

Missing prices will incorrectly qualify.

Correct approach:

if not missing(Price) and Price < 1000 then Flag='YES';

This is one of the most dangerous enterprise SAS mistakes.

20 Golden Rules for Enterprise SAS/R Projects

Always define LENGTH before logic.
Never trust raw source systems.
Standardize casing immediately.
Remove duplicates early.
Validate date formats.
Audit negative numeric values.
Use reusable macros.
Separate raw and clean layers.
Preserve audit trails.
Use PROC CONTENTS frequently.
Validate joins carefully.
Avoid hardcoding values.
Use formats for reporting consistency.
Create QC validation steps.
Track derivation logic.
Maintain SDTM compliance.
Use ARRAYS for scalability.
Document every transformation.
Test missing-value scenarios.
Build reproducible workflows.

20 Additional Data Cleaning Best Practices

Validate SDTM variable lengths.
Maintain Define.xml consistency.
Preserve source traceability.
Never overwrite raw datasets.
Validate controlled terminology.
Ensure ISO8601 date compliance.
Audit imputation logic.
Review duplicate patient IDs.
Validate treatment dates.
Monitor protocol deviations.
Standardize adverse event coding.
Validate laboratory units.
Track derivation lineage.
Create validation datasets.
Implement QC double programming.
Document reviewer comments.
Protect PHI-sensitive data.
Use metadata-driven programming.
Validate ADaM derivations.
Maintain audit-ready outputs.

Business Logic Behind Data Cleaning

Data cleaning is fundamentally about protecting decision quality. In healthcare analytics, a single missing value can incorrectly classify a patient’s risk profile. In banking, malformed salary values can reject valid loan applicants. In clinical trials, invalid dates can destroy treatment-emergent adverse event derivations.

Consider patient age correction. If age appears as “250,” the system must validate and correct or quarantine the record because unrealistic ages distort demographic statistics and machine-learning predictions.

Similarly, salary normalization ensures financial systems compare standardized income values instead of mixed currencies or malformed text strings.

Date imputation is equally critical. If hospital admission dates are partially missing, statistical teams may use controlled imputation logic to preserve analytical continuity while maintaining regulatory traceability.

Replacing missing values improves model stability, reporting completeness, and operational trust. However, imputation must always follow documented business rules and regulatory guidance.

Data cleaning therefore is not merely technical preprocessing it is enterprise risk management. Every corrected value, standardized string, duplicate removal, and validated date directly impacts business intelligence accuracy, compliance readiness, AI reliability, and executive decision-making quality.

20 Sharp Enterprise Insights

Dirty data leads to wrong conclusions.
Standardization ensures reproducibility.
Missing dates can destroy timelines.
Duplicate records inflate metrics.
Truncation silently corrupts analytics.
Validation protects business credibility.
Audit trails ensure regulatory trust.
Clean data improves AI accuracy.
PROC SQL simplifies relational logic.
DATA step excels in row processing.
ARRAYS reduce repetitive code.
Formats improve reporting consistency.
Macros increase scalability.
Documentation prevents confusion.
QC programming catches hidden issues.
Controlled terminology improves compliance.
Consistent casing prevents duplicate categories.
Data governance improves enterprise confidence.
Real-time pipelines require strict validation.
Clean datasets create trustworthy intelligence.

Summary

SAS and R both dominate enterprise analytics, but each platform offers distinct advantages in professional data-cleaning ecosystems.

SAS excels in regulated industries such as clinical trials, pharmaceuticals, banking, and insurance because of its structured architecture, reproducibility, audit readiness, and scalable batch-processing capabilities. DATA step programming provides exceptional row-level control, while PROC SQL, PROC REPORT, PROC SUMMARY, and PROC FORMAT create enterprise-grade reporting systems trusted by regulatory agencies worldwide.

SAS also handles large-volume production pipelines extremely efficiently. Features like FIRST./LAST. processing, arrays, retained variables, macros, and metadata-driven programming make SAS ideal for SDTM, ADaM, and TLF workflows.

R, on the other hand, offers flexibility, rapid experimentation, and advanced open-source integrations. Packages such as tidyverse, dplyr, stringr, lubridate, janitor, and purrr simplify complex transformations into readable pipelines. R is highly favored for exploratory analytics, machine learning, visualization, and agile statistical workflows.

The strongest enterprise teams often combine both technologies strategically.

SAS becomes the controlled production engine:

Regulatory reporting
Clinical submissions
Enterprise ETL
Audit-ready outputs

R becomes the innovation layer:

Exploratory analysis
Predictive modeling
Visualization
Flexible transformations

Together, SAS and R create a powerful analytics ecosystem capable of transforming chaotic raw datasets into trusted business intelligence platforms supporting healthcare, finance, telecom, and AI-driven decision systems.

Conclusion

Modern organizations are no longer limited by data availability they are limited by data quality.

The world now generates enormous streams of information from smartphones, healthcare systems, wearable devices, banking platforms, IoT infrastructure, and AI applications. However, raw data arriving from these systems is rarely clean, consistent, or analysis-ready. Missing values, malformed text, duplicate records, invalid dates, inconsistent formatting, and logical conflicts continuously threaten analytical reliability.

This is why structured data-cleaning frameworks using SAS and R have become mission-critical enterprise capabilities.

In this project, we transformed a globally famous mobiles dataset into a professional-grade analytics pipeline using advanced SAS DATA step programming, PROC SQL, reporting procedures, macros, and modern R refinement workflows. Along the way, we explored enterprise challenges such as truncation risks, duplicate management, missing-value traps, date validation, business-rule standardization, and audit-ready reporting.

The deeper lesson is this:

Data cleaning is not a support task.
It is the foundation of trustworthy intelligence.

Whether building SDTM datasets for clinical trials, fraud-detection systems for banking, AI-ready telecom pipelines, or executive dashboards for healthcare monitoring, clean data determines whether organizations make accurate decisions or catastrophic mistakes.

SAS provides industrial-grade governance, reproducibility, and regulatory confidence.
R provides flexibility, innovation, and analytical acceleration.

Together, they create an ecosystem capable of converting broken datasets into strategic assets.

The future belongs not merely to organizations with the most data but to organizations with the cleanest, most reliable, and most intelligently engineered data pipelines.

Interview Questions and Answers

1. Why is LENGTH placement important in SAS?

Answer:

If LENGTH is declared after variable assignment, SAS fixes variable size using the first encountered value, causing truncation. This silently corrupts downstream analytics and regulatory outputs.

2. Difference between PROC SQL and DATA Step?

Answer:

DATA step is optimized for sequential row-level transformations, while PROC SQL is stronger for joins, aggregations, and relational logic.

3. How do you handle duplicate healthcare records?

Answer:

Use PROC SORT NODUPKEY, business-rule validation, and audit comparison reports before deleting duplicates.

4. How would you debug missing-value issues in SAS?

Answer:

Use MISSING(), PROC FREQ, PROC MEANS NMISS, and validation flags to identify hidden null propagation issues.

5. A clinical dataset shows treatment end dates before treatment start dates. How would you fix it?

Answer:

First validate source-system integrity, identify malformed dates, use INTCK and logical validation rules, quarantine invalid records, document corrections, and rerun QC validation before downstream ADaM derivations.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about MOBILE DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Billion-Dollar Marriages to Clean Datasets: The Art of SAS Standardization

2.The Hidden Power of SAS Dates: Building Smart Time Logic with INTNX and INTCK

3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Corrupted Mobile Records to Clinical-Grade Analytics: Real-Time Data Cleaning and Enterprise Reporting with SAS and R

Explanation of This R Code

Key Enterprise Concepts Used

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS