Corrupted Mobile Records to Clinical-Grade Analytics: Real-Time Data Cleaning and Enterprise Reporting with SAS and R

The World’s Most Famous Mobiles Dataset into Trusted Business Analytics Using Advanced SAS (PROC SQL vs DATA Step) and R Data Engineering Frameworks

Introduction — When Dirty Data Becomes a Million-Dollar Problem

Imagine a real-time healthcare monitoring company tracking patient health using mobile applications installed on devices like Apple iPhone 15 Pro, Samsung Galaxy S24 Ultra, and Google Pixel 8 Pro.

The company uses wearable integrations, hospital APIs, and mobile-generated telemetry data to predict cardiac abnormalities, medication adherence, and patient risk scores.

Now imagine the following problems entering the system:

  • Duplicate patient device IDs
  • Invalid mobile launch dates
  • Negative battery capacities
  • Corrupted manufacturer names
  • Mixed-case operating systems
  • Blank healthcare monitoring statuses
  • Truncated text fields
  • Invalid pricing values
  • Missing region mappings

Suddenly, dashboards become unreliable. AI models begin predicting false risks. Regulatory submissions fail validation checks. Auditors raise compliance concerns. Clinical teams lose trust in analytics.

This is the hidden battlefield of enterprise analytics.

In modern organizations, data cleaning is not cosmetic work it is survival infrastructure.

Healthcare, banking, insurance, telecom, and clinical research organizations depend heavily on trustworthy datasets. A single malformed value can create downstream chaos across SDTM domains, ADaM derivations, TLF generation, and regulatory reporting.

In this enterprise-level project, we will build a realistic “Most Famous Mobiles in the World” dataset integrated with healthcare-style monitoring logic and intentionally injected data-quality issues. Then we will solve these problems using advanced SAS and R programming techniques.

Real-Time Healthcare Mobile Dataset with Intentional Errors

A healthcare technology organization collects patient monitoring activity from globally famous mobile devices. Data enters from:

  • Mobile apps
  • Hospital integrations
  • IoT wearables
  • API streams
  • Insurance systems

The raw dataset contains inconsistent records requiring enterprise-grade cleaning.

Raw SAS Dataset Creation with Intentional Errors

/*=========================================================

 REAL-TIME MOBILE HEALTHCARE RAW DATASET

 Intentional enterprise-level data quality problems added

=========================================================*/

data mobiles_raw;

 /* LENGTH must come before logic to avoid truncation */

 length Mobile_Name $40 Brand $25 Country $20

        Operating_System $20 Monitoring_Status $15;

 /* Proper informats for incoming raw data */

 informat Launch_Date ddmmyy10.;

 /* Output display formats */

 format Launch_Date date9. Price dollar10.2;

 infile datalines dlm='|' truncover;

 input Mobile_ID Mobile_Name $ Brand $ Country $ Launch_Date

       Price Battery_mAh Operating_System $ Monitoring_Status $;

datalines;

101|iphone15pro|Apple|USA|12/09/2023|1599|4500|ios|ACTIVE

102|GalaxyS24Ultra|Samsung|SouthKorea|25/01/2024|1400|5000|ANDROID|ACTIVE

103|Pixel8Pro|Google|USA|15/10/2023|-999|5050|android|ACTIVE

104|OnePlus12|OnePlus|India|11/02/2024|899|-5400|Android|NULL

105|Xiaomi14Ultra|Xiaomi|China|31/15/2024|1200|5300|ANDROID|ACTIVE

106|NULL|Apple|USA|01/03/2024|1300|4500|IOS|ACTIVE

107|ROGPhone8|Asus|Taiwan|15/01/2024|1700|5500|ANDROID|ACTIVE

108|GalaxyS24Ultra|Samsung|SouthKorea|25/01/2024|1400|5000|ANDROID|ACTIVE

109|MotoEdge50|Motorola|USA|05/05/2024|799|4700|android|inactive

110|NokiaXR21|Nokia|Finland|10/08/2023|650|4800|ANDROID|ACTIVE

111|iphone15pro|Apple|USA|12/09/2023|1599|4500|ios|ACTIVE

112|VivoX100Pro|Vivo|China|20/01/2024|999|5400|ANDROID|ACTIVE

113|OppoFindX7|Oppo|China|18/01/2024|1100|5000|android|ACTIVE

114|Pixel8Pro|Google|USA|15/10/2023|1299|5050|ANDROID|ACTIVE

115|RealmeGT5|Realme|India|11/11/2023|699|5240|ANDROID|NULL

116|SonyXperia1VI|Sony|Japan|05/03/2024|1800|5000|ANDROID|ACTIVE

117|HuaweiP70|Huawei|China|20/04/2024|1450|5200|ANDROID|ACTIVE

118|NothingPhone2|Nothing|UK|15/07/2023|799|4700|ANDROID|ACTIVE

119|iphone15pro|max|USA|12/09/2023|1599|4500|ios|ACTIVE

120|GalaxyFold6|Samsung|SouthKorea|10/07/2024|2100|4400|ANDROID|ACTIVE

;

run;

proc print data = mobiles_raw;

run;

OUTPUT:

ObsMobile_NameBrandCountryOperating_SystemMonitoring_StatusLaunch_DatePriceMobile_IDBattery_mAh
1iphone15proAppleUSAiosACTIVE12SEP2023$1,599.001014500
2GalaxyS24UltraSamsungSouthKoreaANDROIDACTIVE25JAN2024$1,400.001025000
3Pixel8ProGoogleUSAandroidACTIVE15OCT2023$-999.001035050
4OnePlus12OnePlusIndiaAndroidNULL11FEB2024$899.00104-5400
5Xiaomi14UltraXiaomiChinaANDROIDACTIVE.$1,200.001055300
6NULLAppleUSAIOSACTIVE01MAR2024$1,300.001064500
7ROGPhone8AsusTaiwanANDROIDACTIVE15JAN2024$1,700.001075500
8GalaxyS24UltraSamsungSouthKoreaANDROIDACTIVE25JAN2024$1,400.001085000
9MotoEdge50MotorolaUSAandroidinactive05MAY2024$799.001094700
10NokiaXR21NokiaFinlandANDROIDACTIVE10AUG2023$650.001104800
11iphone15proAppleUSAiosACTIVE12SEP2023$1,599.001114500
12VivoX100ProVivoChinaANDROIDACTIVE20JAN2024$999.001125400
13OppoFindX7OppoChinaandroidACTIVE18JAN2024$1,100.001135000
14Pixel8ProGoogleUSAANDROIDACTIVE15OCT2023$1,299.001145050
15RealmeGT5RealmeIndiaANDROIDNULL11NOV2023$699.001155240
16SonyXperia1VISonyJapanANDROIDACTIVE05MAR2024$1,800.001165000
17HuaweiP70HuaweiChinaANDROIDACTIVE20APR2024$1,450.001175200
18NothingPhone2NothingUKANDROIDACTIVE15JUL2023$799.001184700
19iphone15promaxUSAiosACTIVE12SEP2023$1,599.001194500
20GalaxyFold6SamsungSouthKoreaANDROIDACTIVE10JUL2024$2,100.001204400

Explanation of the Dataset

This dataset intentionally contains enterprise-level issues:

Problem

Example

Negative price

-999

Negative battery

-5400

Invalid date

31/15/2024

Duplicates

GalaxyS24Ultra

Missing mobile name

NULL

Mixed OS casing

ios / IOS / android

Invalid monitoring status

NULL

Truncation risk

iphone15pro

Equivalent R Version of mobiles_raw Dataset

library(tidyverse)

library(lubridate)

mobiles_raw <- tibble(

  Mobile_ID = c(101,102,103,104,105,106,107,108,109,110,

                111,112,113,114,115,116,117,118,119,120),

  Mobile_Name = c("iphone15pro","GalaxyS24Ultra","Pixel8Pro","OnePlus12",

        "Xiaomi14Ultra","NULL","ROGPhone8","GalaxyS24Ultra","MotoEdge50",

        "NokiaXR21","iphone15pro","VivoX100Pro","OppoFindX7","Pixel8Pro",

        "RealmeGT5","SonyXperia1VI","HuaweiP70","NothingPhone2",

        "iphone15pro","GalaxyFold6"),

  Brand = c("Apple","Samsung","Google","OnePlus","Xiaomi","Apple","Asus",

        "Samsung","Motorola","Nokia","Apple","Vivo","Oppo","Google",

        "Realme","Sony","Huawei","Nothing","max","Samsung"),

  Country = c("USA","SouthKorea","USA","India","China","USA","Taiwan",

        "SouthKorea","USA","Finland","USA","China","China","USA",

        "India","Japan","China","UK","USA","SouthKorea"),

  Launch_Date = c("12/09/2023","25/01/2024","15/10/2023","11/02/2024",

       "31/15/2024","01/03/2024","15/01/2024","25/01/2024","05/05/2024",

       "10/08/2023","12/09/2023","20/01/2024","18/01/2024","15/10/2023",

       "11/11/2023","05/03/2024","20/04/2024","15/07/2023","12/09/2023",

       "10/07/2024"),

  Price = c(1599,1400,-999,899,1200,1300,1700,1400,799,650,599,999,1100,

       11299,699,1800,1450,799,1599,2100),

  Battery_mAh = c(4500,5000,5050,-5400,5300,4500,5500,5000,4700,4800,

       4500,5400,5000,5050,5240,5000,5200,4700,4500,4400),

  Operating_System = c("ios","ANDROID","android","Android","ANDROID",

       "IOS","ANDROID","ANDROID","android","ANDROID","ios","ANDROID",

       "android","ANDROID","ANDROID","ANDROID","ANDROID","ANDROID",

       "ios","ANDROID"),

  Monitoring_Status = c("ACTIVE","ACTIVE","ACTIVE","NULL","ACTIVE","ACTIVE",

       "ACTIVE","ACTIVE","inactive","ACTIVE","ACTIVE","ACTIVE","ACTIVE",

       "ACTIVE","NULL","ACTIVE","ACTIVE","ACTIVE","ACTIVE","ACTIVE")

)

OUTPUT:

 

MoMobile_ID

Mobile_Name

Brand

Country

Launch_Date

Price

Battery_mAh

Operating_System

Monitoring_Status

1

101

iphone15pro

Apple

USA

12-09-2023

1599

4500

ios

ACTIVE

2

102

GalaxyS24Ultra

Samsung

SouthKorea

25-01-2024

1400

5000

ANDROID

ACTIVE

3

103

Pixel8Pro

Google

USA

15-10-2023

-999

5050

android

ACTIVE

4

104

OnePlus12

OnePlus

India

11-02-2024

899

-5400

Android

NULL

5

105

Xiaomi14Ultra

Xiaomi

China

31/15/2024

1200

5300

ANDROID

ACTIVE

6

106

NULL

Apple

USA

01-03-2024

1300

4500

IOS

ACTIVE

7

107

ROGPhone8

Asus

Taiwan

15-01-2024

1700

5500

ANDROID

ACTIVE

8

108

GalaxyS24Ultra

Samsung

SouthKorea

25-01-2024

1400

5000

ANDROID

ACTIVE

9

109

MotoEdge50

Motorola

USA

05-05-2024

799

4700

android

inactive

10

110

NokiaXR21

Nokia

Finland

10-08-2023

650

4800

ANDROID

ACTIVE

11

111

iphone15pro

Apple

USA

12-09-2023

599

4500

ios

ACTIVE

12

112

VivoX100Pro

Vivo

China

20-01-2024

999

5400

ANDROID

ACTIVE

13

113

OppoFindX7

Oppo

China

18-01-2024

1100

5000

android

ACTIVE

14

114

Pixel8Pro

Google

USA

15-10-2023

11299

5050

ANDROID

ACTIVE

15

115

RealmeGT5

Realme

India

11-11-2023

699

5240

ANDROID

NULL

16

116

SonyXperia1VI

Sony

Japan

05-03-2024

1800

5000

ANDROID

ACTIVE

17

117

HuaweiP70

Huawei

China

20-04-2024

1450

5200

ANDROID

ACTIVE

18

118

NothingPhone2

Nothing

UK

15-07-2023

799

4700

ANDROID

ACTIVE

19

119

iphone15pro

max

USA

12-09-2023

1599

4500

ios

ACTIVE

20

120

GalaxyFold6

Samsung

SouthKorea

10-07-2024

2100

4400

ANDROID

ACTIVE

Explanation of This R Code

This R workflow recreates your SAS DATALINES dataset using tibble() from the tidyverse ecosystem.

Key Enterprise Concepts Used

SAS Concept

R Equivalent

DATALINES

tibble()

INPUT statement

Column vectors

FORMAT / INFORMAT

lubridate::dmy()

The Truncation Trap in SAS

One of the most dangerous SAS issues is the “Truncation Trap.”

If LENGTH is not declared before INPUT or assignment statements, SAS automatically guesses variable lengths based on the first observed value.

Example:

length Operating_System $20;

Without this statement:

  • “ANDROID” may truncate to “ANDR”
  • “SouthKorea” may truncate unexpectedly
  • Business rules fail silently

This is extremely dangerous in SDTM mapping and regulatory reporting.

R handles strings differently because memory allocation is dynamic, whereas SAS uses fixed-width storage architecture.

Enterprise SAS Data Cleaning Workflow

Step 1 — Standardization Layer

data mobiles_clean;

 set mobiles_raw;

 /* Standardize text formatting */

 Mobile_Name = propcase(strip(Mobile_Name));

 Brand = propcase(strip(Brand));

 Country = upcase(strip(Country));

 Operating_System = upcase(strip(Operating_System));

 Monitoring_Status = upcase(strip(Monitoring_Status));

 /* Replace NULL text values */

 if Mobile_Name='Null' then Mobile_Name='Unknown Device';

 if Monitoring_Status='NULL' then Monitoring_Status='PENDING';

 /* Correct negative values */

 Price = abs(Price);

 Battery_mAh = abs(Battery_mAh);

 /* Date validation */

 if missing(Launch_Date) then Launch_Date = today();

run;

proc print data = mobiles_clean;

run;

OUTPUT:

ObsMobile_NameBrandCountryOperating_SystemMonitoring_StatusLaunch_DatePriceMobile_IDBattery_mAh
1Iphone15proAppleUSAIOSACTIVE12SEP2023$1,599.001014500
2Galaxys24ultraSamsungSOUTHKOREAANDROIDACTIVE25JAN2024$1,400.001025000
3Pixel8proGoogleUSAANDROIDACTIVE15OCT2023$999.001035050
4Oneplus12OneplusINDIAANDROIDPENDING11FEB2024$899.001045400
5Xiaomi14ultraXiaomiCHINAANDROIDACTIVE13MAY2026$1,200.001055300
6Unknown DeviceAppleUSAIOSACTIVE01MAR2024$1,300.001064500
7Rogphone8AsusTAIWANANDROIDACTIVE15JAN2024$1,700.001075500
8Galaxys24ultraSamsungSOUTHKOREAANDROIDACTIVE25JAN2024$1,400.001085000
9Motoedge50MotorolaUSAANDROIDINACTIVE05MAY2024$799.001094700
10Nokiaxr21NokiaFINLANDANDROIDACTIVE10AUG2023$650.001104800
11Iphone15proAppleUSAIOSACTIVE12SEP2023$1,599.001114500
12Vivox100proVivoCHINAANDROIDACTIVE20JAN2024$999.001125400
13Oppofindx7OppoCHINAANDROIDACTIVE18JAN2024$1,100.001135000
14Pixel8proGoogleUSAANDROIDACTIVE15OCT2023$1,299.001145050
15Realmegt5RealmeINDIAANDROIDPENDING11NOV2023$699.001155240
16Sonyxperia1viSonyJAPANANDROIDACTIVE05MAR2024$1,800.001165000
17Huaweip70HuaweiCHINAANDROIDACTIVE20APR2024$1,450.001175200
18Nothingphone2NothingUKANDROIDACTIVE15JUL2023$799.001184700
19Iphone15proMaxUSAIOSACTIVE12SEP2023$1,599.001194500
20Galaxyfold6SamsungSOUTHKOREAANDROIDACTIVE10JUL2024$2,100.001204400

Explanation of Standardization Logic

This DATA step acts as the first enterprise cleaning layer.

Key SAS Functions Used

Function

Purpose

PROPCASE

Standardizes text

STRIP

Removes leading/trailing spaces

ABS

Corrects negative values

MISSING

Detects null values

TODAY()

Imputes invalid dates

Business Importance

Without standardization:

  • Dashboards split “ANDROID” and “android”
  • AI models misclassify operating systems
  • Duplicate detection fails
  • Reporting becomes inconsistent

This mirrors real SDTM compliance problems in clinical trials.

PROC SQL vs DATA Step Deduplication

/*DATA Step Method*/

proc sort data=mobiles_clean nodupkey;

    by Mobile_Name Brand Launch_Date;

run;

proc print data = mobiles_clean;

run;

OUTPUT:

ObsMobile_NameBrandCountryOperating_SystemMonitoring_StatusLaunch_DatePriceMobile_IDBattery_mAh
1Galaxyfold6SamsungSOUTHKOREAANDROIDACTIVE10JUL2024$2,100.001204400
2Galaxys24ultraSamsungSOUTHKOREAANDROIDACTIVE25JAN2024$1,400.001025000
3Huaweip70HuaweiCHINAANDROIDACTIVE20APR2024$1,450.001175200
4Iphone15proAppleUSAIOSACTIVE12SEP2023$1,599.001014500
5Iphone15proMaxUSAIOSACTIVE12SEP2023$1,599.001194500
6Motoedge50MotorolaUSAANDROIDINACTIVE05MAY2024$799.001094700
7Nokiaxr21NokiaFINLANDANDROIDACTIVE10AUG2023$650.001104800
8Nothingphone2NothingUKANDROIDACTIVE15JUL2023$799.001184700
9Oneplus12OneplusINDIAANDROIDPENDING11FEB2024$899.001045400
10Oppofindx7OppoCHINAANDROIDACTIVE18JAN2024$1,100.001135000
11Pixel8proGoogleUSAANDROIDACTIVE15OCT2023$999.001035050
12Realmegt5RealmeINDIAANDROIDPENDING11NOV2023$699.001155240
13Rogphone8AsusTAIWANANDROIDACTIVE15JAN2024$1,700.001075500
14Sonyxperia1viSonyJAPANANDROIDACTIVE05MAR2024$1,800.001165000
15Unknown DeviceAppleUSAIOSACTIVE01MAR2024$1,300.001064500
16Vivox100proVivoCHINAANDROIDACTIVE20JAN2024$999.001125400
17Xiaomi14ultraXiaomiCHINAANDROIDACTIVE13MAY2026$1,200.001055300

Why PROC SORT NODUPKEY Matters

This removes duplicate business records while preserving:

  • Reporting integrity
  • Accurate device counts
  • Reliable healthcare enrollment statistics

DATA Step vs PROC SQL

Feature

DATA Step

PROC SQL

Speed

Faster sequential logic

Better relational joins

Row-by-row processing

Excellent

Moderate

Complex joins

Moderate

Excellent

Readability

High

High

Enterprise ETL

Preferred

Preferred for integrations

Advanced SAS Business Logic

/*SELECT-WHEN Classification*/

data mobiles_segmented;

 set mobiles_clean;

 length Price_Category $20;

 select;

 when (Price > 1500)

  Price_Category='ULTRA PREMIUM';

 when (Price >= 1000 and Price <= 1500)

  Price_Category='PREMIUM';  

 otherwise

  Price_Category='MID RANGE';

 end;

run;

proc print data = mobiles_segmented;

run;

OUTPUT:

ObsMobile_NameBrandCountryOperating_SystemMonitoring_StatusLaunch_DatePriceMobile_IDBattery_mAhPrice_Category
1Galaxyfold6SamsungSOUTHKOREAANDROIDACTIVE10JUL2024$2,100.001204400ULTRA PREMIUM
2Galaxys24ultraSamsungSOUTHKOREAANDROIDACTIVE25JAN2024$1,400.001025000PREMIUM
3Huaweip70HuaweiCHINAANDROIDACTIVE20APR2024$1,450.001175200PREMIUM
4Iphone15proAppleUSAIOSACTIVE12SEP2023$1,599.001014500ULTRA PREMIUM
5Iphone15proMaxUSAIOSACTIVE12SEP2023$1,599.001194500ULTRA PREMIUM
6Motoedge50MotorolaUSAANDROIDINACTIVE05MAY2024$799.001094700MID RANGE
7Nokiaxr21NokiaFINLANDANDROIDACTIVE10AUG2023$650.001104800MID RANGE
8Nothingphone2NothingUKANDROIDACTIVE15JUL2023$799.001184700MID RANGE
9Oneplus12OneplusINDIAANDROIDPENDING11FEB2024$899.001045400MID RANGE
10Oppofindx7OppoCHINAANDROIDACTIVE18JAN2024$1,100.001135000PREMIUM
11Pixel8proGoogleUSAANDROIDACTIVE15OCT2023$999.001035050MID RANGE
12Realmegt5RealmeINDIAANDROIDPENDING11NOV2023$699.001155240MID RANGE
13Rogphone8AsusTAIWANANDROIDACTIVE15JAN2024$1,700.001075500ULTRA PREMIUM
14Sonyxperia1viSonyJAPANANDROIDACTIVE05MAR2024$1,800.001165000ULTRA PREMIUM
15Unknown DeviceAppleUSAIOSACTIVE01MAR2024$1,300.001064500PREMIUM
16Vivox100proVivoCHINAANDROIDACTIVE20JAN2024$999.001125400MID RANGE
17Xiaomi14ultraXiaomiCHINAANDROIDACTIVE13MAY2026$1,200.001055300PREMIUM

Explanation

SELECT-WHEN is cleaner than multiple IF-THEN statements.

It improves:

  • Readability
  • Processing logic
  • Enterprise maintainability

Very useful in ADaM derivations and TLF categorization.

Using ARRAYS for Bulk Cleaning

data mobiles_array;

 set mobiles_clean;

 array nums(*) Price Battery_mAh;

 do i=1 to dim(nums);

  if nums(i)<0 then nums(i)=abs(nums(i));

 end;

drop i;

run;

proc print data = mobiles_array;

run;

OUTPUT:

ObsMobile_NameBrandCountryOperating_SystemMonitoring_StatusLaunch_DatePriceMobile_IDBattery_mAh
1Galaxyfold6SamsungSOUTHKOREAANDROIDACTIVE10JUL2024$2,100.001204400
2Galaxys24ultraSamsungSOUTHKOREAANDROIDACTIVE25JAN2024$1,400.001025000
3Huaweip70HuaweiCHINAANDROIDACTIVE20APR2024$1,450.001175200
4Iphone15proAppleUSAIOSACTIVE12SEP2023$1,599.001014500
5Iphone15proMaxUSAIOSACTIVE12SEP2023$1,599.001194500
6Motoedge50MotorolaUSAANDROIDINACTIVE05MAY2024$799.001094700
7Nokiaxr21NokiaFINLANDANDROIDACTIVE10AUG2023$650.001104800
8Nothingphone2NothingUKANDROIDACTIVE15JUL2023$799.001184700
9Oneplus12OneplusINDIAANDROIDPENDING11FEB2024$899.001045400
10Oppofindx7OppoCHINAANDROIDACTIVE18JAN2024$1,100.001135000
11Pixel8proGoogleUSAANDROIDACTIVE15OCT2023$999.001035050
12Realmegt5RealmeINDIAANDROIDPENDING11NOV2023$699.001155240
13Rogphone8AsusTAIWANANDROIDACTIVE15JAN2024$1,700.001075500
14Sonyxperia1viSonyJAPANANDROIDACTIVE05MAR2024$1,800.001165000
15Unknown DeviceAppleUSAIOSACTIVE01MAR2024$1,300.001064500
16Vivox100proVivoCHINAANDROIDACTIVE20JAN2024$999.001125400
17Xiaomi14ultraXiaomiCHINAANDROIDACTIVE13MAY2026$1,200.001055300

Why ARRAYS Matter

ARRAYS are essential when:

  • Cleaning 100+ variables
  • Processing lab domains
  • Validating clinical measurements

Instead of writing repetitive logic, arrays automate bulk transformations efficiently.

PROC FORMAT for Professional Outputs

proc format;

value pricefmt low-999='Budget'

             1000-1499='Premium'

             1500-high='Luxury';

run;

LOG:

NOTE: Format PRICEFMT has been output.

PROC REPORT Dashboard

proc report data=mobiles_clean nowd;

columns Brand Country Price Battery_mAh;

define Brand / group;

define Country / group;

define Price / analysis mean format=pricefmt.;

define Battery_mAh / analysis mean;

run;

OUTPUT:

BrandCountryPriceBattery_mAh
AppleUSAPremium4500
AsusTAIWANLuxury5500
GoogleUSABudget5050
HuaweiCHINAPremium5200
MaxUSALuxury4500
MotorolaUSABudget4700
NokiaFINLANDBudget4800
NothingUKBudget4700
OneplusINDIABudget5400
OppoCHINAPremium5000
RealmeINDIABudget5240
SamsungSOUTHKOREALuxury4700
SonyJAPANLuxury5000
VivoCHINABudget5400
XiaomiCHINAPremium5300

Reporting Importance

PROC REPORT is heavily used in:

  • Clinical reporting
  • Regulatory submissions
  • Audit-ready dashboards
  • Executive summaries

It produces presentation-quality outputs directly from SAS.

PROC SUMMARY for Aggregation

proc summary data=mobiles_clean nway;

class Brand;

var Price Battery_mAh;

output out=summary_stats mean=Avg_Price Avg_Battery;

run;

proc print data = summary_stats;

run;

OUTPUT:

ObsBrand_TYPE__FREQ_Avg_PriceAvg_Battery
1Apple12$1,449.504500
2Asus11$1,700.005500
3Google11$999.005050
4Huawei11$1,450.005200
5Max11$1,599.004500
6Motorola11$799.004700
7Nokia11$650.004800
8Nothing11$799.004700
9Oneplus11$899.005400
10Oppo11$1,100.005000
11Realme11$699.005240
12Samsung12$1,750.004700
13Sony11$1,800.005000
14Vivo11$999.005400
15Xiaomi11$1,200.005300

Why PROC SUMMARY is Powerful

PROC SUMMARY is optimized for:

  • Enterprise aggregation
  • Statistical reporting
  • Real-time dashboards
  • Performance-heavy environments

It is faster than many manual aggregation methods.

R Refinement Layer

R Data Cleaning Workflow

library(tidyverse)

library(lubridate)

library(janitor)

mobiles_clean <- mobiles_raw %>%

  clean_names() %>%

  mutate(mobile_name = str_trim(str_to_title(mobile_name)),

         brand = str_to_title(brand),

         operating_system = str_to_upper(operating_system),

         monitoring_status = case_when(

         monitoring_status=="NULL" ~ "PENDING",TRUE ~ monitoring_status),

  price = abs(price),

  battery_m_ah = abs(battery_m_ah),

  launch_date = suppressWarnings(dmy(launch_date)) %>%

  distinct()

View(mobiles_clean)  

OUTPUT:

 

mobile_id

mobile_name

brand

country

launch_date

price

battery_m_ah

operating_system

monitoring_status

1

101

Iphone15pro

Apple

USA

12-09-2023

1599

4500

IOS

ACTIVE

2

102

Galaxys24ultra

Samsung

SouthKorea

25-01-2024

1400

5000

ANDROID

ACTIVE

3

103

Pixel8pro

Google

USA

15-10-2023

999

5050

ANDROID

ACTIVE

4

104

Oneplus12

Oneplus

India

11-02-2024

899

5400

ANDROID

PENDING

5

105

Xiaomi14ultra

Xiaomi

China

NA

1200

5300

ANDROID

ACTIVE

6

106

Null

Apple

USA

01-03-2024

1300

4500

IOS

ACTIVE

7

107

Rogphone8

Asus

Taiwan

15-01-2024

1700

5500

ANDROID

ACTIVE

8

108

Galaxys24ultra

Samsung

SouthKorea

25-01-2024

1400

5000

ANDROID

ACTIVE

9

109

Motoedge50

Motorola

USA

05-05-2024

799

4700

ANDROID

inactive

10

110

Nokiaxr21

Nokia

Finland

10-08-2023

650

4800

ANDROID

ACTIVE

11

111

Iphone15pro

Apple

USA

12-09-2023

599

4500

IOS

ACTIVE

12

112

Vivox100pro

Vivo

China

20-01-2024

999

5400

ANDROID

ACTIVE

13

113

Oppofindx7

Oppo

China

18-01-2024

1100

5000

ANDROID

ACTIVE

14

114

Pixel8pro

Google

USA

15-10-2023

11299

5050

ANDROID

ACTIVE

15

115

Realmegt5

Realme

India

11-11-2023

699

5240

ANDROID

PENDING

16

116

Sonyxperia1vi

Sony

Japan

05-03-2024

1800

5000

ANDROID

ACTIVE

17

117

Huaweip70

Huawei

China

20-04-2024

1450

5200

ANDROID

ACTIVE

18

118

Nothingphone2

Nothing

UK

15-07-2023

799

4700

ANDROID

ACTIVE

19

119

Iphone15pro

Max

USA

12-09-2023

1599

4500

IOS

ACTIVE

20

120

Galaxyfold6

Samsung

SouthKorea

10-07-2024

2100

4400

ANDROID

ACTIVE

SAS vs R Cleaning Comparison

SAS

R Equivalent

PROPCASE

str_to_title()

STRIP

str_trim()

IF-THEN

if_else()

PROC SORT NODUPKEY

distinct()

PROC SQL

dplyr joins

MISSING()

is.na()

CATX

unite()

SCAN

separate()

Missing Value Trap in SAS

SAS treats missing numeric values as smaller than any number.

Example:

if Price < 1000 then Flag='YES';

Missing prices will incorrectly qualify.

Correct approach:

if not missing(Price) and Price < 1000 then Flag='YES';

This is one of the most dangerous enterprise SAS mistakes.

20 Golden Rules for Enterprise SAS/R Projects

  1. Always define LENGTH before logic.
  2. Never trust raw source systems.
  3. Standardize casing immediately.
  4. Remove duplicates early.
  5. Validate date formats.
  6. Audit negative numeric values.
  7. Use reusable macros.
  8. Separate raw and clean layers.
  9. Preserve audit trails.
  10. Use PROC CONTENTS frequently.
  11. Validate joins carefully.
  12. Avoid hardcoding values.
  13. Use formats for reporting consistency.
  14. Create QC validation steps.
  15. Track derivation logic.
  16. Maintain SDTM compliance.
  17. Use ARRAYS for scalability.
  18. Document every transformation.
  19. Test missing-value scenarios.
  20. Build reproducible workflows.

20 Additional Data Cleaning Best Practices

  1. Validate SDTM variable lengths.
  2. Maintain Define.xml consistency.
  3. Preserve source traceability.
  4. Never overwrite raw datasets.
  5. Validate controlled terminology.
  6. Ensure ISO8601 date compliance.
  7. Audit imputation logic.
  8. Review duplicate patient IDs.
  9. Validate treatment dates.
  10. Monitor protocol deviations.
  11. Standardize adverse event coding.
  12. Validate laboratory units.
  13. Track derivation lineage.
  14. Create validation datasets.
  15. Implement QC double programming.
  16. Document reviewer comments.
  17. Protect PHI-sensitive data.
  18. Use metadata-driven programming.
  19. Validate ADaM derivations.
  20. Maintain audit-ready outputs.

Business Logic Behind Data Cleaning

Data cleaning is fundamentally about protecting decision quality. In healthcare analytics, a single missing value can incorrectly classify a patient’s risk profile. In banking, malformed salary values can reject valid loan applicants. In clinical trials, invalid dates can destroy treatment-emergent adverse event derivations.

Consider patient age correction. If age appears as “250,” the system must validate and correct or quarantine the record because unrealistic ages distort demographic statistics and machine-learning predictions.

Similarly, salary normalization ensures financial systems compare standardized income values instead of mixed currencies or malformed text strings.

Date imputation is equally critical. If hospital admission dates are partially missing, statistical teams may use controlled imputation logic to preserve analytical continuity while maintaining regulatory traceability.

Replacing missing values improves model stability, reporting completeness, and operational trust. However, imputation must always follow documented business rules and regulatory guidance.

Data cleaning therefore is not merely technical preprocessing it is enterprise risk management. Every corrected value, standardized string, duplicate removal, and validated date directly impacts business intelligence accuracy, compliance readiness, AI reliability, and executive decision-making quality.

20 Sharp Enterprise Insights

  1. Dirty data leads to wrong conclusions.
  2. Standardization ensures reproducibility.
  3. Missing dates can destroy timelines.
  4. Duplicate records inflate metrics.
  5. Truncation silently corrupts analytics.
  6. Validation protects business credibility.
  7. Audit trails ensure regulatory trust.
  8. Clean data improves AI accuracy.
  9. PROC SQL simplifies relational logic.
  10. DATA step excels in row processing.
  11. ARRAYS reduce repetitive code.
  12. Formats improve reporting consistency.
  13. Macros increase scalability.
  14. Documentation prevents confusion.
  15. QC programming catches hidden issues.
  16. Controlled terminology improves compliance.
  17. Consistent casing prevents duplicate categories.
  18. Data governance improves enterprise confidence.
  19. Real-time pipelines require strict validation.
  20. Clean datasets create trustworthy intelligence.

Summary

SAS and R both dominate enterprise analytics, but each platform offers distinct advantages in professional data-cleaning ecosystems.

SAS excels in regulated industries such as clinical trials, pharmaceuticals, banking, and insurance because of its structured architecture, reproducibility, audit readiness, and scalable batch-processing capabilities. DATA step programming provides exceptional row-level control, while PROC SQL, PROC REPORT, PROC SUMMARY, and PROC FORMAT create enterprise-grade reporting systems trusted by regulatory agencies worldwide.

SAS also handles large-volume production pipelines extremely efficiently. Features like FIRST./LAST. processing, arrays, retained variables, macros, and metadata-driven programming make SAS ideal for SDTM, ADaM, and TLF workflows.

R, on the other hand, offers flexibility, rapid experimentation, and advanced open-source integrations. Packages such as tidyverse, dplyr, stringr, lubridate, janitor, and purrr simplify complex transformations into readable pipelines. R is highly favored for exploratory analytics, machine learning, visualization, and agile statistical workflows.

The strongest enterprise teams often combine both technologies strategically.

SAS becomes the controlled production engine:

  • Regulatory reporting
  • Clinical submissions
  • Enterprise ETL
  • Audit-ready outputs

R becomes the innovation layer:

  • Exploratory analysis
  • Predictive modeling
  • Visualization
  • Flexible transformations

Together, SAS and R create a powerful analytics ecosystem capable of transforming chaotic raw datasets into trusted business intelligence platforms supporting healthcare, finance, telecom, and AI-driven decision systems.

Conclusion

Modern organizations are no longer limited by data availability they are limited by data quality.

The world now generates enormous streams of information from smartphones, healthcare systems, wearable devices, banking platforms, IoT infrastructure, and AI applications. However, raw data arriving from these systems is rarely clean, consistent, or analysis-ready. Missing values, malformed text, duplicate records, invalid dates, inconsistent formatting, and logical conflicts continuously threaten analytical reliability.

This is why structured data-cleaning frameworks using SAS and R have become mission-critical enterprise capabilities.

In this project, we transformed a globally famous mobiles dataset into a professional-grade analytics pipeline using advanced SAS DATA step programming, PROC SQL, reporting procedures, macros, and modern R refinement workflows. Along the way, we explored enterprise challenges such as truncation risks, duplicate management, missing-value traps, date validation, business-rule standardization, and audit-ready reporting.

The deeper lesson is this:

Data cleaning is not a support task.
It is the foundation of trustworthy intelligence.

Whether building SDTM datasets for clinical trials, fraud-detection systems for banking, AI-ready telecom pipelines, or executive dashboards for healthcare monitoring, clean data determines whether organizations make accurate decisions or catastrophic mistakes.

SAS provides industrial-grade governance, reproducibility, and regulatory confidence.
R provides flexibility, innovation, and analytical acceleration.

Together, they create an ecosystem capable of converting broken datasets into strategic assets.

The future belongs not merely to organizations with the most data but to organizations with the cleanest, most reliable, and most intelligently engineered data pipelines.

Interview Questions and Answers

1. Why is LENGTH placement important in SAS?

Answer:

If LENGTH is declared after variable assignment, SAS fixes variable size using the first encountered value, causing truncation. This silently corrupts downstream analytics and regulatory outputs.

2. Difference between PROC SQL and DATA Step?

Answer:

DATA step is optimized for sequential row-level transformations, while PROC SQL is stronger for joins, aggregations, and relational logic.

3. How do you handle duplicate healthcare records?

Answer:

Use PROC SORT NODUPKEY, business-rule validation, and audit comparison reports before deleting duplicates.

4. How would you debug missing-value issues in SAS?

Answer:

Use MISSING(), PROC FREQ, PROC MEANS NMISS, and validation flags to identify hidden null propagation issues.

5. A clinical dataset shows treatment end dates before treatment start dates. How would you fix it?

Answer:

First validate source-system integrity, identify malformed dates, use INTCK and logical validation rules, quarantine invalid records, document corrections, and rerun QC validation before downstream ADaM derivations.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about MOBILE DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS