Dangerous Chocolates, Broken Dashboards & Clinical Data Nightmares: Real-World SAS and R Cleaning Frameworks

Transforming Dangerous Chocolate Intelligence into Trusted Enterprise Reporting Using SAS (PROC SQL vs DATA Step) and Modern R 

Introduction

In modern analytics ecosystems, data corruption is not just a technical inconvenience it is a business disaster waiting to happen. I have personally seen clinical trial submissions delayed because of duplicate patient IDs, banking fraud models fail because of inconsistent region codes, and insurance dashboards collapse due to malformed date values. Even something as unusual as a “Dangerous Chocolates Worldwide” surveillance dataset can expose the same enterprise-grade data quality failures that appear in clinical trials, retail systems, and pharmacovigilance pipelines.

Imagine a multinational food-safety organization monitoring dangerous chocolates linked to contamination, toxic ingredients, counterfeit labeling, or allergic reactions. Analysts receive raw operational feeds from factories, hospitals, customs systems, and consumer complaint portals. Unfortunately, the incoming datasets contain duplicate product IDs, invalid manufacturing dates, corrupted region codes, missing toxicity scores, impossible age values, inconsistent category labels, malformed manufacturer emails, and mixed text formatting.

Now imagine using this corrupted data for:

  • Regulatory safety submissions
  • AI-based contamination prediction
  • Consumer risk classification
  • Executive reporting dashboards
  • Statistical safety summaries
  • Recall decision systems

One bad transformation can trigger false recalls, missed toxicity signals, or catastrophic compliance failures.

That is why enterprise-grade data cleaning using SAS and R remains one of the most valuable skills in clinical programming, healthcare analytics, banking intelligence, and regulatory reporting.

1.Raw Dangerous Chocolates Dataset with Intentional Errors (SAS)

data dangerous_chocolate_raw;

length Product_ID $12 Chocolate_Name $35 Country $20 Region_Code $12

       Manufacturer_Email $40 Toxicity_Level $15 Recall_Flag $10;

infile datalines dlm='|' dsd truncover;

input Product_ID $ Chocolate_Name $ Country $ Region_Code $

      Manufacturer_Email $ Toxicity_Score Consumer_Age

      Manufacturing_Date :$20. Toxicity_Level $ Recall_Flag $;

format Toxicity_Score 8.2;

datalines;

DC001|Dark venom bar|India|AP_SOUTH|toxicmail.com|95|34|2025-01-10|HIGH|YES

DC002|White Death|usa|north_01|factory@choco.com|-15|29|2025-02-12|MEDIUM|NO

DC003|NULL|UK|EU-West|danger@factory|88|150|2025-15-01|HIGH|YES

DC004| Bitter Hazard |India| AP-S |maker@safe.com|76|45|2025-03-18|LOW|NO

DC005|Black Poison|INDIA|south01|wrongemail@|102|28|2025-04-20|EXTREME|YES

DC005|Black Poison|INDIA|south01|wrongemail@|102|28|2025-04-20|EXTREME|YES

DC006|Milk Disaster|Canada|CA_99|factorymail.com|65|-8|2025-02-31|MEDIUM|NO

DC007|Cocoa Inferno|Germany|EU-CENTRAL|valid@plant.com|92|39||HIGH|YES

DC008|Toxic Crunch|India|NULL|plant@factory.com|45|22|2025-06-11|LOW|NO

DC009|Silent Killer|Brazil|SA-01|support#mail.com|85|44|2025-07-01|HIGH|YES

DC010|Sugar Venom|France|EU_02|factory@global.com|.|31|INVALIDDATE|MEDIUM|NO

DC011| Rotten Cocoa |India|south01|hello@choco.com|55|19|2025-08-08|MILD|NO

DC012|Hazard Bite|USA|NORTH_01|plant@hazard.com|77|200|2025-09-09|HIGH|YES

DC013|Nightmare Melt|Japan|APAC-9|mailhazard.com|61|40|2025-10-10|LOW|NO

DC014|Dark Acid|India|AP-SOUTH|contact@acid.com|89|37|2025-11-11|SEVERE|YES

DC015|Venom Cube|Mexico|MX_01|mxplant.com|48|26|2025-12-12|LOW|NO

DC016|Fatal Syrup|India|AP_SOUTH|fatal@plant.com|120|41|2025-01-01|HIGH|YES

DC017|Cocoa Corruption|UK|EU-West|ukfactory.com|72|33|2025-03-14|MEDIUM|NO

DC018|Deadly Delight|India|South01|india@safe.com|91|36|2025-04-15|HIGH|YES

DC019|Acid Choco|USA|North_01|mail@acid.com|67|17|2025-05-16|LOW|NO

DC020|Poison Wafer|India|AP-S|badmail.com|84|38|2025-06-17|HIGH|YES

;

run;

proc print data = dangerous_chocolate_raw;

run;

OUTPUT:

ObsProduct_IDChocolate_NameCountryRegion_CodeManufacturer_EmailToxicity_LevelRecall_FlagToxicity_ScoreConsumer_AgeManufacturing_Date
1DC001Dark venom barIndiaAP_SOUTHtoxicmail.comHIGHYES95.00342025-01-10
2DC002White Deathusanorth_01factory@choco.comMEDIUMNO-15.00292025-02-12
3DC003NULLUKEU-Westdanger@factoryHIGHYES88.001502025-15-01
4DC004Bitter HazardIndiaAP-Smaker@safe.comLOWNO76.00452025-03-18
5DC005Black PoisonINDIAsouth01wrongemail@EXTREMEYES102.00282025-04-20
6DC005Black PoisonINDIAsouth01wrongemail@EXTREMEYES102.00282025-04-20
7DC006Milk DisasterCanadaCA_99factorymail.comMEDIUMNO65.00-82025-02-31
8DC007Cocoa InfernoGermanyEU-CENTRALvalid@plant.comHIGHYES92.0039 
9DC008Toxic CrunchIndiaNULLplant@factory.comLOWNO45.00222025-06-11
10DC009Silent KillerBrazilSA-01support#mail.comHIGHYES85.00442025-07-01
11DC010Sugar VenomFranceEU_02factory@global.comMEDIUMNO.31INVALIDDATE
12DC011Rotten CocoaIndiasouth01hello@choco.comMILDNO55.00192025-08-08
13DC012Hazard BiteUSANORTH_01plant@hazard.comHIGHYES77.002002025-09-09
14DC013Nightmare MeltJapanAPAC-9mailhazard.comLOWNO61.00402025-10-10
15DC014Dark AcidIndiaAP-SOUTHcontact@acid.comSEVEREYES89.00372025-11-11
16DC015Venom CubeMexicoMX_01mxplant.comLOWNO48.00262025-12-12
17DC016Fatal SyrupIndiaAP_SOUTHfatal@plant.comHIGHYES120.00412025-01-01
18DC017Cocoa CorruptionUKEU-Westukfactory.comMEDIUMNO72.00332025-03-14
19DC018Deadly DelightIndiaSouth01india@safe.comHIGHYES91.00362025-04-15
20DC019Acid ChocoUSANorth_01mail@acid.comLOWNO67.00172025-05-16
21DC020Poison WaferIndiaAP-Sbadmail.comHIGHYES84.00382025-06-17

Explanation and Key Points

This dataset intentionally mimics real-world enterprise corruption. We included duplicate Product_ID values, invalid emails, negative toxicity scores, impossible ages, malformed dates, inconsistent case formatting, NULL strings, and whitespace corruption.

The LENGTH statement appears before assignments because SAS allocates memory during compilation. If LENGTH is declared after assignment logic, SAS may permanently truncate values. This is a critical enterprise issue called Character Truncation Risk.

Unlike SAS fixed-length character storage, R dynamically manages character vectors without predefined width allocation. SAS programmers must proactively control metadata structure before transformations begin.

2.PROC CONTENTS for Metadata Validation

proc contents data=dangerous_chocolate_raw;

run;

OUTPUT:

The CONTENTS Procedure

Data Set NameWORK.DANGEROUS_CHOCOLATE_RAWObservations21
Member TypeDATAVariables10
EngineV9Indexes0
Created05/19/2026 07:35:55Observation Length184
Last Modified05/19/2026 07:35:55Deleted Observations0
Protection CompressedNO
Data Set Type SortedNO
Label   
Data RepresentationSOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64  
Encodingutf-8 Unicode (UTF-8)  
Engine/Host Dependent Information
Data Set Page Size131072
Number of Data Set Pages1
First Data Page1
Max Obs per Page711
Obs in First Data Page21
Number of Data Set Repairs0
Filename/saswork/SAS_workFA7100002B46_odaws01-apse1-2.oda.sas.com/SAS_work2E6100002B46_odaws01-apse1-2.oda.sas.com/dangerous_chocolate_raw.sas7bdat
Release Created9.0401M8
Host CreatedLinux
Inode Number201335450
Access Permissionrw-r--r--
Owner Nameu63247146
File Size256KB
File Size (bytes)262144
Alphabetic List of Variables and Attributes
#VariableTypeLenFormat
2Chocolate_NameChar35 
9Consumer_AgeNum8 
3CountryChar20 
5Manufacturer_EmailChar40 
10Manufacturing_DateChar20 
1Product_IDChar12 
7Recall_FlagChar10 
4Region_CodeChar12 
6Toxicity_LevelChar15 
8Toxicity_ScoreNum88.2

Explanation and Key Points

PROC CONTENTS acts like a metadata audit report. In enterprise clinical environments, programmers use it to validate variable lengths, formats, informats, and storage types before downstream processing.

This is extremely important because regulatory submissions such as SDTM and ADaM require exact metadata consistency. A mismatch between numeric and character variables can break TLF generation pipelines or validation macros.

3.Data Cleaning Workflow Using DATA Step

data chocolate_cleaned;

retain Source_System "GLOBAL_FOOD_SURVEILLANCE";

set dangerous_chocolate_raw;

Chocolate_Name = propcase(strip(Chocolate_Name));

Country = upcase(strip(Country));

Region_Code = compress(upcase(Region_Code),'-_ ');

Manufacturer_Email = lowcase(strip(Manufacturer_Email));

if find(Manufacturer_Email,'@') = 0 then

    Manufacturer_Email = 'INVALID_EMAIL';

if Toxicity_Score < 0 then

    Toxicity_Score = abs(Toxicity_Score);

if Toxicity_Score > 100 then Toxicity_Score = 100;

if Consumer_Age < 0 or Consumer_Age > 120 then

    Consumer_Age = .;

if Manufacturing_Date='' then

    Manufacturing_Date='2025-01-01';

Manufacturing_DT = input(Manufacturing_Date,?? yymmdd10.);

format Manufacturing_DT date9.;

if missing(Manufacturing_DT) then

    Manufacturing_DT='01JAN2025'd;

Risk_Category = 

    ifc(Toxicity_Score >=80,'CRITICAL',

    ifc(Toxicity_Score >=60,'HIGH',

    ifc(Toxicity_Score >=40,'MEDIUM','LOW')));

Months_Since_Manufacture = 

    intck('month',Manufacturing_DT,today());

run;

proc print data = chocolate_cleaned;

run;

OUTPUT:

ObsSource_SystemProduct_IDChocolate_NameCountryRegion_CodeManufacturer_EmailToxicity_LevelRecall_FlagToxicity_ScoreConsumer_AgeManufacturing_DateManufacturing_DTRisk_CategoryMonths_Since_Manufacture
1GLOBAL_FOOD_SURVEILLANCEDC001Dark Venom BarINDIAAPSOUTHINVALID_EMAILHIGHYES95.00342025-01-1010JAN2025CRITICAL16
2GLOBAL_FOOD_SURVEILLANCEDC002White DeathUSANORTH01factory@choco.comMEDIUMNO15.00292025-02-1212FEB2025LOW15
3GLOBAL_FOOD_SURVEILLANCEDC003NullUKEUWESTdanger@factoryHIGHYES88.00.2025-15-0101JAN2025CRITICAL16
4GLOBAL_FOOD_SURVEILLANCEDC004Bitter HazardINDIAAPSmaker@safe.comLOWNO76.00452025-03-1818MAR2025HIGH14
5GLOBAL_FOOD_SURVEILLANCEDC005Black PoisonINDIASOUTH01wrongemail@EXTREMEYES100.00282025-04-2020APR2025CRITICAL13
6GLOBAL_FOOD_SURVEILLANCEDC005Black PoisonINDIASOUTH01wrongemail@EXTREMEYES100.00282025-04-2020APR2025CRITICAL13
7GLOBAL_FOOD_SURVEILLANCEDC006Milk DisasterCANADACA99INVALID_EMAILMEDIUMNO65.00.2025-02-3101JAN2025HIGH16
8GLOBAL_FOOD_SURVEILLANCEDC007Cocoa InfernoGERMANYEUCENTRALvalid@plant.comHIGHYES92.00392025-01-0101JAN2025CRITICAL16
9GLOBAL_FOOD_SURVEILLANCEDC008Toxic CrunchINDIANULLplant@factory.comLOWNO45.00222025-06-1111JUN2025MEDIUM11
10GLOBAL_FOOD_SURVEILLANCEDC009Silent KillerBRAZILSA01INVALID_EMAILHIGHYES85.00442025-07-0101JUL2025CRITICAL10
11GLOBAL_FOOD_SURVEILLANCEDC010Sugar VenomFRANCEEU02factory@global.comMEDIUMNO.31INVALIDDATE01JAN2025LOW16
12GLOBAL_FOOD_SURVEILLANCEDC011Rotten CocoaINDIASOUTH01hello@choco.comMILDNO55.00192025-08-0808AUG2025MEDIUM9
13GLOBAL_FOOD_SURVEILLANCEDC012Hazard BiteUSANORTH01plant@hazard.comHIGHYES77.00.2025-09-0909SEP2025HIGH8
14GLOBAL_FOOD_SURVEILLANCEDC013Nightmare MeltJAPANAPAC9INVALID_EMAILLOWNO61.00402025-10-1010OCT2025HIGH7
15GLOBAL_FOOD_SURVEILLANCEDC014Dark AcidINDIAAPSOUTHcontact@acid.comSEVEREYES89.00372025-11-1111NOV2025CRITICAL6
16GLOBAL_FOOD_SURVEILLANCEDC015Venom CubeMEXICOMX01INVALID_EMAILLOWNO48.00262025-12-1212DEC2025MEDIUM5
17GLOBAL_FOOD_SURVEILLANCEDC016Fatal SyrupINDIAAPSOUTHfatal@plant.comHIGHYES100.00412025-01-0101JAN2025CRITICAL16
18GLOBAL_FOOD_SURVEILLANCEDC017Cocoa CorruptionUKEUWESTINVALID_EMAILMEDIUMNO72.00332025-03-1414MAR2025HIGH14
19GLOBAL_FOOD_SURVEILLANCEDC018Deadly DelightINDIASOUTH01india@safe.comHIGHYES91.00362025-04-1515APR2025CRITICAL13
20GLOBAL_FOOD_SURVEILLANCEDC019Acid ChocoUSANORTH01mail@acid.comLOWNO67.00172025-05-1616MAY2025HIGH12
21GLOBAL_FOOD_SURVEILLANCEDC020Poison WaferINDIAAPSINVALID_EMAILHIGHYES84.00382025-06-1717JUN2025CRITICAL11

Explanation and Key Points

This DATA step demonstrates enterprise-grade defensive programming.

Functions like:

  • PROPCASE
  • COMPRESS
  • STRIP
  • UPCASE
  • LOWCASE
  • INPUT
  • INTCK
  • IFC

help normalize corrupted operational data.

Notice how missing or invalid dates are safely imputed. In real clinical trials, invalid visit dates can destroy patient chronology and treatment-emergent adverse event analysis.

The INTCK function calculates elapsed manufacturing months similar to patient exposure duration calculations in ADaM datasets.

4.Deduplication Using PROC SORT NODUPKEY

proc sort data=chocolate_cleaned 

           out=chocolate_nodup nodupkey;

by Product_ID;

run;

proc print data = chocolate_nodup;

run;

LOG:

NOTE: There were 21 observations read from the data set WORK.CHOCOLATE_CLEANED.
NOTE: 1 observations with duplicate key values were deleted.

OUTPUT:

ObsSource_SystemProduct_IDChocolate_NameCountryRegion_CodeManufacturer_EmailToxicity_LevelRecall_FlagToxicity_ScoreConsumer_AgeManufacturing_DateManufacturing_DTRisk_CategoryMonths_Since_Manufacture
1GLOBAL_FOOD_SURVEILLANCEDC001Dark Venom BarINDIAAPSOUTHINVALID_EMAILHIGHYES95.00342025-01-1010JAN2025CRITICAL16
2GLOBAL_FOOD_SURVEILLANCEDC002White DeathUSANORTH01factory@choco.comMEDIUMNO15.00292025-02-1212FEB2025LOW15
3GLOBAL_FOOD_SURVEILLANCEDC003NullUKEUWESTdanger@factoryHIGHYES88.00.2025-15-0101JAN2025CRITICAL16
4GLOBAL_FOOD_SURVEILLANCEDC004Bitter HazardINDIAAPSmaker@safe.comLOWNO76.00452025-03-1818MAR2025HIGH14
5GLOBAL_FOOD_SURVEILLANCEDC005Black PoisonINDIASOUTH01wrongemail@EXTREMEYES100.00282025-04-2020APR2025CRITICAL13
6GLOBAL_FOOD_SURVEILLANCEDC006Milk DisasterCANADACA99INVALID_EMAILMEDIUMNO65.00.2025-02-3101JAN2025HIGH16
7GLOBAL_FOOD_SURVEILLANCEDC007Cocoa InfernoGERMANYEUCENTRALvalid@plant.comHIGHYES92.00392025-01-0101JAN2025CRITICAL16
8GLOBAL_FOOD_SURVEILLANCEDC008Toxic CrunchINDIANULLplant@factory.comLOWNO45.00222025-06-1111JUN2025MEDIUM11
9GLOBAL_FOOD_SURVEILLANCEDC009Silent KillerBRAZILSA01INVALID_EMAILHIGHYES85.00442025-07-0101JUL2025CRITICAL10
10GLOBAL_FOOD_SURVEILLANCEDC010Sugar VenomFRANCEEU02factory@global.comMEDIUMNO.31INVALIDDATE01JAN2025LOW16
11GLOBAL_FOOD_SURVEILLANCEDC011Rotten CocoaINDIASOUTH01hello@choco.comMILDNO55.00192025-08-0808AUG2025MEDIUM9
12GLOBAL_FOOD_SURVEILLANCEDC012Hazard BiteUSANORTH01plant@hazard.comHIGHYES77.00.2025-09-0909SEP2025HIGH8
13GLOBAL_FOOD_SURVEILLANCEDC013Nightmare MeltJAPANAPAC9INVALID_EMAILLOWNO61.00402025-10-1010OCT2025HIGH7
14GLOBAL_FOOD_SURVEILLANCEDC014Dark AcidINDIAAPSOUTHcontact@acid.comSEVEREYES89.00372025-11-1111NOV2025CRITICAL6
15GLOBAL_FOOD_SURVEILLANCEDC015Venom CubeMEXICOMX01INVALID_EMAILLOWNO48.00262025-12-1212DEC2025MEDIUM5
16GLOBAL_FOOD_SURVEILLANCEDC016Fatal SyrupINDIAAPSOUTHfatal@plant.comHIGHYES100.00412025-01-0101JAN2025CRITICAL16
17GLOBAL_FOOD_SURVEILLANCEDC017Cocoa CorruptionUKEUWESTINVALID_EMAILMEDIUMNO72.00332025-03-1414MAR2025HIGH14
18GLOBAL_FOOD_SURVEILLANCEDC018Deadly DelightINDIASOUTH01india@safe.comHIGHYES91.00362025-04-1515APR2025CRITICAL13
19GLOBAL_FOOD_SURVEILLANCEDC019Acid ChocoUSANORTH01mail@acid.comLOWNO67.00172025-05-1616MAY2025HIGH12
20GLOBAL_FOOD_SURVEILLANCEDC020Poison WaferINDIAAPSINVALID_EMAILHIGHYES84.00382025-06-1717JUN2025CRITICAL11

Explanation and Key Points

Duplicate records are one of the biggest enterprise risks. In clinical studies, duplicate patient IDs can lead to double-counting adverse events. In banking systems, duplicate transaction IDs can inflate revenue or fraud metrics.

NODUPKEY ensures only unique Product_ID records survive.

5.PROC SQL Enterprise Validation

proc sql;

create table chocolate_summary as

select Country,Risk_Category,

       count(*) as Total_Products,

       avg(Toxicity_Score) as Avg_Toxicity

from chocolate_nodup

group by Country, Risk_Category;

quit;

proc print data = chocolate_summary;

run;

OUTPUT:

ObsCountryRisk_CategoryTotal_ProductsAvg_Toxicity
1BRAZILCRITICAL185.0000
2CANADAHIGH165.0000
3FRANCELOW1.
4GERMANYCRITICAL192.0000
5INDIACRITICAL693.1667
6INDIAHIGH176.0000
7INDIAMEDIUM250.0000
8JAPANHIGH161.0000
9MEXICOMEDIUM148.0000
10UKCRITICAL188.0000
11UKHIGH172.0000
12USAHIGH272.0000
13USALOW115.0000

Explanation and Key Points

PROC SQL is highly efficient for aggregation, joins, and reconciliation workflows.

Clinical programmers frequently use PROC SQL for:

  • SDTM joins
  • ADaM derivations
  • Exposure reconciliation
  • Laboratory summaries

Compared with DATA step BY-group processing, SQL offers more concise syntax for grouped calculations.

6.Advanced ARRAY and DO Loop Validation

data chocolate_flags;

set chocolate_nodup;

array checks(*) Toxicity_Score Consumer_Age;

do i=1 to dim(checks);

   if checks(i)=. then

      Validation_Flag='MISSING_VALUE';

end;

drop i;

run;

proc print data = chocolate_flags;

run;

OUTPUT:

ObsSource_SystemProduct_IDChocolate_NameCountryRegion_CodeManufacturer_EmailToxicity_LevelRecall_FlagToxicity_ScoreConsumer_AgeManufacturing_DateManufacturing_DTRisk_CategoryMonths_Since_ManufactureValidation_Flag
1GLOBAL_FOOD_SURVEILLANCEDC001Dark Venom BarINDIAAPSOUTHINVALID_EMAILHIGHYES95.00342025-01-1010JAN2025CRITICAL16 
2GLOBAL_FOOD_SURVEILLANCEDC002White DeathUSANORTH01factory@choco.comMEDIUMNO15.00292025-02-1212FEB2025LOW15 
3GLOBAL_FOOD_SURVEILLANCEDC003NullUKEUWESTdanger@factoryHIGHYES88.00.2025-15-0101JAN2025CRITICAL16MISSING_VALUE
4GLOBAL_FOOD_SURVEILLANCEDC004Bitter HazardINDIAAPSmaker@safe.comLOWNO76.00452025-03-1818MAR2025HIGH14 
5GLOBAL_FOOD_SURVEILLANCEDC005Black PoisonINDIASOUTH01wrongemail@EXTREMEYES100.00282025-04-2020APR2025CRITICAL13 
6GLOBAL_FOOD_SURVEILLANCEDC006Milk DisasterCANADACA99INVALID_EMAILMEDIUMNO65.00.2025-02-3101JAN2025HIGH16MISSING_VALUE
7GLOBAL_FOOD_SURVEILLANCEDC007Cocoa InfernoGERMANYEUCENTRALvalid@plant.comHIGHYES92.00392025-01-0101JAN2025CRITICAL16 
8GLOBAL_FOOD_SURVEILLANCEDC008Toxic CrunchINDIANULLplant@factory.comLOWNO45.00222025-06-1111JUN2025MEDIUM11 
9GLOBAL_FOOD_SURVEILLANCEDC009Silent KillerBRAZILSA01INVALID_EMAILHIGHYES85.00442025-07-0101JUL2025CRITICAL10 
10GLOBAL_FOOD_SURVEILLANCEDC010Sugar VenomFRANCEEU02factory@global.comMEDIUMNO.31INVALIDDATE01JAN2025LOW16MISSING_VALUE
11GLOBAL_FOOD_SURVEILLANCEDC011Rotten CocoaINDIASOUTH01hello@choco.comMILDNO55.00192025-08-0808AUG2025MEDIUM9 
12GLOBAL_FOOD_SURVEILLANCEDC012Hazard BiteUSANORTH01plant@hazard.comHIGHYES77.00.2025-09-0909SEP2025HIGH8MISSING_VALUE
13GLOBAL_FOOD_SURVEILLANCEDC013Nightmare MeltJAPANAPAC9INVALID_EMAILLOWNO61.00402025-10-1010OCT2025HIGH7 
14GLOBAL_FOOD_SURVEILLANCEDC014Dark AcidINDIAAPSOUTHcontact@acid.comSEVEREYES89.00372025-11-1111NOV2025CRITICAL6 
15GLOBAL_FOOD_SURVEILLANCEDC015Venom CubeMEXICOMX01INVALID_EMAILLOWNO48.00262025-12-1212DEC2025MEDIUM5 
16GLOBAL_FOOD_SURVEILLANCEDC016Fatal SyrupINDIAAPSOUTHfatal@plant.comHIGHYES100.00412025-01-0101JAN2025CRITICAL16 
17GLOBAL_FOOD_SURVEILLANCEDC017Cocoa CorruptionUKEUWESTINVALID_EMAILMEDIUMNO72.00332025-03-1414MAR2025HIGH14 
18GLOBAL_FOOD_SURVEILLANCEDC018Deadly DelightINDIASOUTH01india@safe.comHIGHYES91.00362025-04-1515APR2025CRITICAL13 
19GLOBAL_FOOD_SURVEILLANCEDC019Acid ChocoUSANORTH01mail@acid.comLOWNO67.00172025-05-1616MAY2025HIGH12 
20GLOBAL_FOOD_SURVEILLANCEDC020Poison WaferINDIAAPSINVALID_EMAILHIGHYES84.00382025-06-1717JUN2025CRITICAL11 

Explanation and Key Points

ARRAY processing is extremely powerful in SAS enterprise pipelines. Instead of writing repetitive logic for every variable, arrays enable scalable validation frameworks.

In production clinical systems, arrays often validate:

  • Lab ranges
  • Vital signs
  • ECG measurements
  • Biomarker thresholds

This approach dramatically improves maintainability.

proc export data = dangerous_chocolate_raw

            outfile = "C:\Users\Desktop\dangerous_chocolate_raw.csv"

           dbms=csv

           replace;

run;

LOG:
21 records created in "C:\Users\Desktop\dangerous_chocolate_raw.csv" from
DANGEROUS_CHOCOLATE_RAW.

7.Import CSV Using readr Package

library(readr)

dangerous_chocolate_raw <- read_csv("C:\\Users\\Desktop\\dangerous_chocolate_raw.csv")

OUTPUT:

 

Product_ID

Chocolate_Name

Country

Region_Code

Manufacturer_Email

Toxicity_Level

Recall_Flag

Toxicity_Score

Consumer_Age

Manufacturing_Date

1

DC001

Dark venom bar

India

AP_SOUTH

toxicmail.com

HIGH

YES

95

34

10-01-2025

2

DC002

White Death

usa

north_01

factory@choco.com

MEDIUM

NO

-15

29

12-02-2025

3

DC003

NULL

UK

EU-West

danger@factory

HIGH

YES

88

150

2025-15-01

4

DC004

Bitter Hazard

India

AP-S

maker@safe.com

LOW

NO

76

45

18-03-2025

5

DC005

Black Poison

INDIA

south01

wrongemail@

EXTREME

YES

102

28

20-04-2025

6

DC005

Black Poison

INDIA

south01

wrongemail@

EXTREME

YES

102

28

20-04-2025

7

DC006

Milk Disaster

Canada

CA_99

factorymail.com

MEDIUM

NO

65

-8

2025-02-31

8

DC007

Cocoa Inferno

Germany

EU-CENTRAL

valid@plant.com

HIGH

YES

92

39

NA

9

DC008

Toxic Crunch

India

NULL

plant@factory.com

LOW

NO

45

22

11-06-2025

10

DC009

Silent Killer

Brazil

SA-01

support#mail.com

HIGH

YES

85

44

01-07-2025

11

DC010

Sugar Venom

France

EU_02

factory@global.com

MEDIUM

NO

NA

31

INVALIDDATE

12

DC011

Rotten Cocoa

India

south01

hello@choco.com

MILD

NO

55

19

08-08-2025

13

DC012

Hazard Bite

USA

NORTH_01

plant@hazard.com

HIGH

YES

77

200

09-09-2025

14

DC013

Nightmare Melt

Japan

APAC-9

mailhazard.com

LOW

NO

61

40

10-10-2025

15

DC014

Dark Acid

India

AP-SOUTH

contact@acid.com

SEVERE

YES

89

37

11-11-2025

16

DC015

Venom Cube

Mexico

MX_01

mxplant.com

LOW

NO

48

26

12-12-2025

17

DC016

Fatal Syrup

India

AP_SOUTH

fatal@plant.com

HIGH

YES

120

41

01-01-2025

18

DC017

Cocoa Corruption

UK

EU-West

ukfactory.com

MEDIUM

NO

72

33

14-03-2025

19

DC018

Deadly Delight

India

South01

india@safe.com

HIGH

YES

91

36

15-04-2025

20

DC019

Acid Choco

USA

North_01

mail@acid.com

LOW

NO

67

17

16-05-2025

21

DC020

Poison Wafer

India

AP-S

badmail.com

HIGH

YES

84

38

17-06-2025

Key Point

Windows paths in R should use:

  • Forward slashes /
    OR
  • Double backslashes \\

Example:

data <- read.csv("C:\\Clinical_Project\\rawdata\\patients.csv")

R Equivalent Cleaning Workflow

library(tidyverse)

library(lubridate)

library(stringr)

library(janitor)

dangerous_chocolate <- read_csv("C:\\Users\\Desktop\\dangerous_chocolate_raw.csv")

OUTPUT:

 

Product_ID

Chocolate_Name

Country

Region_Code

Manufacturer_Email

Toxicity_Level

Recall_Flag

Toxicity_Score

Consumer_Age

Manufacturing_Date

1

DC001

Dark venom bar

India

AP_SOUTH

toxicmail.com

HIGH

YES

95

34

10-01-2025

2

DC002

White Death

usa

north_01

factory@choco.com

MEDIUM

NO

-15

29

12-02-2025

3

DC003

NULL

UK

EU-West

danger@factory

HIGH

YES

88

150

2025-15-01

4

DC004

Bitter Hazard

India

AP-S

maker@safe.com

LOW

NO

76

45

18-03-2025

5

DC005

Black Poison

INDIA

south01

wrongemail@

EXTREME

YES

102

28

20-04-2025

6

DC005

Black Poison

INDIA

south01

wrongemail@

EXTREME

YES

102

28

20-04-2025

7

DC006

Milk Disaster

Canada

CA_99

factorymail.com

MEDIUM

NO

65

-8

2025-02-31

8

DC007

Cocoa Inferno

Germany

EU-CENTRAL

valid@plant.com

HIGH

YES

92

39

NA

9

DC008

Toxic Crunch

India

NULL

plant@factory.com

LOW

NO

45

22

11-06-2025

10

DC009

Silent Killer

Brazil

SA-01

support#mail.com

HIGH

YES

85

44

01-07-2025

11

DC010

Sugar Venom

France

EU_02

factory@global.com

MEDIUM

NO

NA

31

INVALIDDATE

12

DC011

Rotten Cocoa

India

south01

hello@choco.com

MILD

NO

55

19

08-08-2025

13

DC012

Hazard Bite

USA

NORTH_01

plant@hazard.com

HIGH

YES

77

200

09-09-2025

14

DC013

Nightmare Melt

Japan

APAC-9

mailhazard.com

LOW

NO

61

40

10-10-2025

15

DC014

Dark Acid

India

AP-SOUTH

contact@acid.com

SEVERE

YES

89

37

11-11-2025

16

DC015

Venom Cube

Mexico

MX_01

mxplant.com

LOW

NO

48

26

12-12-2025

17

DC016

Fatal Syrup

India

AP_SOUTH

fatal@plant.com

HIGH

YES

120

41

01-01-2025

18

DC017

Cocoa Corruption

UK

EU-West

ukfactory.com

MEDIUM

NO

72

33

14-03-2025

19

DC018

Deadly Delight

India

South01

india@safe.com

HIGH

YES

91

36

15-04-2025

20

DC019

Acid Choco

USA

North_01

mail@acid.com

LOW

NO

67

17

16-05-2025

21

DC020

Poison Wafer

India

AP-S

badmail.com

HIGH

YES

84

38

17-06-2025


cleaned_data <- dangerous_chocolate %>%

  clean_names() %>%

  mutate(chocolate_name = str_to_title(str_trim(chocolate_name)),

         country = str_to_upper(country),

         region_code = str_replace_all(region_code,"[-_ ]",""),

         manufacturer_email = if_else(grepl("@",manufacturer_email),

              str_to_lower(manufacturer_email),"invalid_email"),

         toxicity_score = abs(toxicity_score),

         toxicity_score = if_else(toxicity_score > 100,100,toxicity_score),

         consumer_age = if_else(consumer_age < 0 | consumer_age > 120,

                                NA_real_, consumer_age),

         manufacturing_date = coalesce(manufacturing_date,"2025-01-01"),

         manufacturing_date =if_else(grepl("^\\d{4}-\\d{2}-\\d{2}$",

             manufacturing_date),manufacturing_date,"2025-01-01"),    

         manufacturing_date =ymd(manufacturing_date,quiet = TRUE),

         manufacturing_date =coalesce(manufacturing_date,

                             as.Date("2025-01-01")),

         risk_category = case_when(toxicity_score >=80 ~ "CRITICAL",

                                   toxicity_score >=60 ~ "HIGH",

                                   toxicity_score >=40 ~ "MEDIUM",

                                   TRUE ~ "LOW")

         ) %>%

          distinct(product_id,.keep_all=TRUE)

OUTPUT:

 

product_id

chocolate_name

country

region_code

manufacturer_email

toxicity_level

recall_flag

toxicity_score

consumer_age

manufacturing_date

risk_category

1

DC001

Dark Venom Bar

INDIA

APSOUTH

invalid_email

HIGH

YES

95

34

10-01-2025

CRITICAL

2

DC002

White Death

USA

north01

factory@choco.com

MEDIUM

NO

15

29

12-02-2025

LOW

3

DC003

Null

UK

EUWest

danger@factory

HIGH

YES

88

NA

01-01-2025

CRITICAL

4

DC004

Bitter Hazard

INDIA

APS

maker@safe.com

LOW

NO

76

45

18-03-2025

HIGH

5

DC005

Black Poison

INDIA

south01

wrongemail@

EXTREME

YES

100

28

20-04-2025

CRITICAL

6

DC006

Milk Disaster

CANADA

CA99

invalid_email

MEDIUM

NO

65

NA

01-01-2025

HIGH

7

DC007

Cocoa Inferno

GERMANY

EUCENTRAL

valid@plant.com

HIGH

YES

92

39

01-01-2025

CRITICAL

8

DC008

Toxic Crunch

INDIA

NULL

plant@factory.com

LOW

NO

45

22

11-06-2025

MEDIUM

9

DC009

Silent Killer

BRAZIL

SA01

invalid_email

HIGH

YES

85

44

01-07-2025

CRITICAL

10

DC010

Sugar Venom

FRANCE

EU02

factory@global.com

MEDIUM

NO

NA

31

01-01-2025

LOW

11

DC011

Rotten Cocoa

INDIA

south01

hello@choco.com

MILD

NO

55

19

08-08-2025

MEDIUM

12

DC012

Hazard Bite

USA

NORTH01

plant@hazard.com

HIGH

YES

77

NA

09-09-2025

HIGH

13

DC013

Nightmare Melt

JAPAN

APAC9

invalid_email

LOW

NO

61

40

10-10-2025

HIGH

14

DC014

Dark Acid

INDIA

APSOUTH

contact@acid.com

SEVERE

YES

89

37

11-11-2025

CRITICAL

15

DC015

Venom Cube

MEXICO

MX01

invalid_email

LOW

NO

48

26

12-12-2025

MEDIUM

16

DC016

Fatal Syrup

INDIA

APSOUTH

fatal@plant.com

HIGH

YES

100

41

01-01-2025

CRITICAL

17

DC017

Cocoa Corruption

UK

EUWest

invalid_email

MEDIUM

NO

72

33

14-03-2025

HIGH

18

DC018

Deadly Delight

INDIA

South01

india@safe.com

HIGH

YES

91

36

15-04-2025

CRITICAL

19

DC019

Acid Choco

USA

North01

mail@acid.com

LOW

NO

67

17

16-05-2025

HIGH

20

DC020

Poison Wafer

INDIA

APS

invalid_email

HIGH

YES

84

38

17-06-2025

CRITICAL

Explanation and Key Points

The R pipeline mirrors SAS cleaning logic using:

  • mutate()
  • case_when()
  • coalesce()
  • str_trim()
  • parse_date_time()
  • distinct()

R provides more flexible string handling than SAS because character vectors are dynamically allocated. However, SAS remains superior in metadata governance, audit traceability, and regulated environments.

SAS vs R Comparison Table

Feature

SAS

R

Auditability

Excellent

Moderate

Regulatory Acceptance

Industry Standard

Growing

Metadata Governance

Strong

Flexible

Visualization

Moderate

Excellent

Performance on Huge Clinical Data

Strong

Strong with optimization

Learning Curve

Moderate

Steep

Macro Automation

Powerful

Functional Programming

Enterprise Deployment

Mature

Expanding

Enterprise Validation & Compliance

In regulated industries, cleaning data is not cosmetic it is compliance-critical.

Clinical trial environments require:

  • SDTM traceability
  • ADaM derivation transparency
  • QC independence
  • Reproducibility
  • Audit trails
  • Validation documentation

One dangerous SAS behavior is that missing numeric values are treated as lower than valid numbers. This can silently corrupt statistical analysis.

Example:

if Lab_Value < 5 then Flag='LOW';

If Lab_Value is missing, SAS still evaluates it as smaller than 5 unless explicitly checked.

Correct approach:

if not missing(Lab_Value) and Lab_Value < 5 then Flag='LOW';

This small mistake can destroy patient safety analyses.

8.Business Logic Behind Data Cleaning

Enterprise data cleaning is fundamentally about protecting analytical integrity. Missing values are imputed because downstream models, statistical procedures, and dashboards require complete records for stable processing. For example, if manufacturing dates are missing, contamination trend analysis becomes unreliable. Standardizing dates ensures chronological consistency across regions and systems.

Unrealistic values must be corrected because operational systems frequently capture accidental entries. A patient age of 200 or toxicity score of 120 represents either human error or system corruption. If ignored, machine learning models may incorrectly classify safety risks.

Text normalization is equally important. “india”, “INDIA”, and “ India ” should represent one country, not three separate analytical categories. Otherwise, executive dashboards show fragmented metrics.

Email standardization improves communication workflows and regulatory notifications. Duplicate record removal prevents double counting in statistical outputs.

Ultimately, enterprise cleaning transforms operational chaos into analytical trustworthiness.

9.20 Enterprise Data-Cleaning Best Practices

  1. Always validate metadata before transformation.
  2. Never trust raw operational feeds.
  3. Standardize text before aggregation.
  4. Deduplicate before statistical analysis.
  5. Separate raw and cleaned datasets.
  6. Use reusable validation macros.
  7. Implement audit trails.
  8. Validate date chronology.
  9. Protect against truncation risks.
  10. Use controlled terminology.
  11. Validate missing-value logic carefully.
  12. Maintain independent QC review.
  13. Document all derivations.
  14. Standardize region codes.
  15. Use defensive programming.
  16. Validate numeric ranges.
  17. Preserve lineage tracking.
  18. Build modular cleaning pipelines.
  19. Reconcile counts after transformations.
  20. Automate enterprise validation checks.

10.20 Sharp One-Line Insights

  • Dirty data creates expensive business mistakes.
  • Standardized variables improve reproducibility.
  • Validation logic is stronger than visual inspection.
  • Duplicate IDs destroy analytical trust.
  • Metadata consistency drives regulatory success.
  • Character truncation silently corrupts reporting.
  • Missing dates break chronology analysis.
  • SAS excels in auditability.
  • R excels in flexible transformation.
  • Defensive programming prevents production failures.
  • PROC SQL simplifies aggregation workflows.
  • DATA step provides granular control.
  • Clean data improves AI predictions.
  • Traceability matters more than speed.
  • Validation is continuous, not optional.
  • Controlled terminology reduces ambiguity.
  • Arrays improve scalable validation.
  • Macros standardize enterprise workflows.
  • Poor cleaning creates false analytics.
  • Reliable intelligence begins with trusted data.

11.Summary

SAS remains the gold standard for regulated industries because of its auditability, metadata governance, reproducibility, and enterprise validation controls. Clinical programming teams rely heavily on SAS for SDTM, ADaM, and TLF production because regulatory agencies trust its structured processing model.

R, however, provides exceptional flexibility, modern data manipulation capabilities, and superior visualization ecosystems. Packages such as tidyverse and lubridate dramatically simplify complex transformations.

SAS DATA step processing offers precise row-wise control, while PROC SQL supports enterprise aggregation workflows efficiently. R pipelines are often more readable and concise for exploratory cleaning tasks.

In real-world organizations, SAS and R are increasingly used together:

  • SAS for compliant production pipelines
  • R for advanced analytics and visualization

The strongest enterprise teams understand both technologies deeply.

12.Conclusion

Modern analytics ecosystems depend entirely on data quality discipline. Whether working with dangerous chocolate surveillance systems, clinical trial submissions, banking fraud detection, or insurance claims processing, corrupted data creates operational chaos, compliance risk, and analytical failure.

Enterprise cleaning is not merely about fixing missing values it is about building trust. Every transformation must be reproducible, traceable, validated, and defensible. That is why experienced Clinical SAS Programmers and Data Scientists treat data cleaning as a foundational engineering discipline rather than a preprocessing task.

SAS provides unmatched strengths in:

  • Regulatory compliance
  • Metadata governance
  • Enterprise auditability
  • Structured validation workflows
  • Production-grade reporting

Meanwhile, R delivers:

  • Flexible transformations
  • Modern analytical ecosystems
  • Rapid exploratory workflows
  • Advanced statistical capabilities
  • Elegant data engineering pipelines

Together, SAS and R create an exceptionally powerful enterprise framework capable of transforming corrupted operational feeds into reliable analytical intelligence.

The most successful organizations are not the ones collecting the most data they are the ones validating, standardizing, and governing their data most effectively.

In the end, trustworthy analytics begin with disciplined cleaning frameworks, defensive programming, metadata governance, and enterprise-quality validation logic. Without clean data, even the world’s most advanced AI systems become unreliable. With properly engineered SAS and R workflows, organizations can build scalable, auditable, production-grade intelligence systems that support confident business decisions, regulatory compliance, and long-term analytical excellence.

13.Interview Questions and Answers

1. How do you handle duplicate records in SAS?

Use PROC SORT NODUPKEY or SQL grouping logic. Always reconcile counts before and after deduplication to ensure no unintended data loss occurs.

2. Why is LENGTH placement important in SAS?

SAS determines variable storage during compilation. If LENGTH is declared after assignments, character truncation may permanently occur.

3. How would you validate missing dates?

Use missing() checks, date informats, chronology validation, and reconciliation reports to identify invalid or impossible timelines.

4. What is the difference between PROC SQL and DATA step?

PROC SQL is excellent for joins and aggregations, while DATA step provides granular row-level transformation control and FIRST./LAST. processing.

5. Why is missing numeric handling dangerous in SAS?

Because SAS treats missing numeric values as smaller than valid numbers. Incorrect conditions can accidentally classify missing values as abnormal.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about DANGEROUS CHOCOLATE DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS