π Luxury, Speed, and Dirty Data: Turning Corrupted Supercar Records into Enterprise-Grade Insights with SAS and R
π TITANS OF SPEED AND DATA: TRANSFORMING THE WORLD'S MOST EXPENSIVE VEHICLES DATA INTO TRUSTED ANALYTICAL INTELLIGENCE USING SAS AND R
Introduction:Luxury Machines, Dirty Data, and
Million-Dollar Mistakes
The
world's most expensive vehicles are engineering masterpieces. Hypercars costing
tens of millions of dollars generate enormous amounts of operational data
involving manufacturers, ownership transfers, insurance coverage, servicing
history, luxury taxation, financing records, and regional compliance tracking.
Imagine
an international luxury vehicle insurer processing underwriting decisions for
ultra-premium vehicles such as the Rolls-Royce Boat Tail, Bugatti La Voiture
Noire, and Pagani Zonda HP Barchetta.
One
duplicate vehicle registration number incorrectly classifies a $20 million
vehicle as uninsured.
One
malformed owner email prevents policy renewal notifications.
One
negative insurance premium enters executive dashboards.
One
invalid manufacturing year causes actuarial models to classify a collector
vehicle as a future model.
One
missing ownership transfer date breaks regulatory reporting requirements.
Suddenly:
- Fraud detection models fail.
- AI risk scoring becomes
unreliable.
- Executive dashboards become
misleading.
- Regulatory submissions
become non-compliant.
- Financial exposure increases
dramatically.
Dirty
data is rarely a technical inconvenience.
It is
usually a business crisis.
The Analytical Dataset
We create
an enterprise luxury vehicle intelligence dataset containing 22 observations
and 9 variables.
Variables
|
Variable |
Description |
|
VEHICLE_ID |
Unique
vehicle identifier |
|
VEHICLE_NAME |
Hypercar
name |
|
OWNER_EMAIL |
Registered
owner email |
|
REGION_CODE |
Geographic
region |
|
INSURANCE_PREMIUM |
Annual
premium |
|
VEHICLE_PRICE_USD |
Vehicle
price |
|
MANUFACTURE_YEAR |
Production
year |
|
REGISTRATION_DATE |
Registration
date |
|
RISK_CATEGORY |
Insurance
classification |
1.SAS Raw Dataset with Intentional Corruption
data luxury_vehicle_raw;
length Vehicle_ID $10 Vehicle_Name $60 Owner_Email $70
Region_Code $12 Risk_Category $20 Registration_Date $25;
infile datalines dlm='|' dsd truncover;
input Vehicle_ID $ Vehicle_Name $ Owner_Email $ Region_Code $
Insurance_Premium Vehicle_Price_USD $ Manufacture_Year
Registration_Date $ Risk_Category $;
datalines;
VH001|bugatti la voiture noire|owner1@gmail.com| eu |250000|18700000|2019|2025-01-15|high
VH002|Rolls Royce Boat Tail|owner2gmail.com|APAC|-50000|28000000|2021|2025-02-10|premium
VH003|Pagani Zonda HP Barchetta|NULL|na|180000|17500000|2035|2025-03-11|HIGH
VH004|Bugatti Centodieci|owner4@yahoo.com|EU |200000|-9000000|2022|2025-02-30|Luxury
VH005|SP Automotive Chaos|owner5@outlook| APAC|.|14000000|2023||Elite
VH005|SP Automotive Chaos|owner5@outlook| APAC|.|14000000|2023||Elite
VH006|Mercedes Maybach Exelero|owner6@gmail.com|us|220000|8000000|1890|INVALIDDATE|medium
VH007|Koenigsegg CCXR Trevita|owner7@gmail.com| EMEA |210000|4800000|2010|2025-04-10|MED
VH008|Lamborghini Veneno|owner8@@gmail.com|LATAM|190000|4500000|2014|2025-03-09|?
VH009|McLaren Solus GT|owner9@gmail.com|EMEA|170000|NULL|2023|2025-05-18|HIGH
VH010|Aston Martin Valkyrie| owner10@gmail.com |APAC|160000|3200000|2022|2025-04-21|high
;
run;
proc print data=luxury_vehicle_raw;
run;
OUTPUT:
| Obs | Vehicle_ID | Vehicle_Name | Owner_Email | Region_Code | Risk_Category | Registration_Date | Insurance_Premium | Vehicle_Price_USD | Manufacture_Year |
|---|---|---|---|---|---|---|---|---|---|
| 1 | VH001 | bugatti la voiture noire | owner1@gmail.com | eu | high | 2025-01-15 | 250000 | 18700000 | 2019 |
| 2 | VH002 | Rolls Royce Boat Tail | owner2gmail.com | APAC | premium | 2025-02-10 | -50000 | 28000000 | 2021 |
| 3 | VH003 | Pagani Zonda HP Barchetta | NULL | na | HIGH | 2025-03-11 | 180000 | 17500000 | 2035 |
| 4 | VH004 | Bugatti Centodieci | owner4@yahoo.com | EU | Luxury | 2025-02-30 | 200000 | -9000000 | 2022 |
| 5 | VH005 | SP Automotive Chaos | owner5@outlook | APAC | Elite | . | 14000000 | 2023 | |
| 6 | VH005 | SP Automotive Chaos | owner5@outlook | APAC | Elite | . | 14000000 | 2023 | |
| 7 | VH006 | Mercedes Maybach Exelero | owner6@gmail.com | us | medium | INVALIDDATE | 220000 | 8000000 | 1890 |
| 8 | VH007 | Koenigsegg CCXR Trevita | owner7@gmail.com | EMEA | MED | 2025-04-10 | 210000 | 4800000 | 2010 |
| 9 | VH008 | Lamborghini Veneno | owner8@@gmail.com | LATAM | ? | 2025-03-09 | 190000 | 4500000 | 2014 |
| 10 | VH009 | McLaren Solus GT | owner9@gmail.com | EMEA | HIGH | 2025-05-18 | 170000 | NULL | 2023 |
| 11 | VH010 | Aston Martin Valkyrie | owner10@gmail.com | APAC | high | 2025-04-21 | 160000 | 3200000 | 2022 |
Why LENGTH Comes First
One of
the most overlooked SAS behaviors is character truncation.
If Vehicle_Name
receives a value before defining LENGTH, SAS allocates storage using the length
of the first encountered value.
For
example:
Vehicle_Name="Bugatti";
creates a
7-character variable.
Later
assignment:
Vehicle_Name="Mercedes
Maybach Exelero";
becomes:
Mercedes
This
issue has caused real production failures in SDTM domains and insurance
reporting pipelines.
R behaves
differently because character vectors use dynamic memory allocation rather than
fixed-width storage.
2.Cleaning Workflow Using DATA Step Engineering
data luxury_vehicle_clean;
set luxury_vehicle_raw;
Vehicle_Name = propcase(strip(Vehicle_Name));
Region_Code = upcase(compress(Region_Code));
Owner_Email = lowcase(strip(Owner_Email));
if Owner_Email='null' then Owner_Email='';
Insurance_Premium = abs(Insurance_Premium);
Vehicle_Price_Num = abs(input(Vehicle_Price_USD,?? best32.));
if Manufacture_Year > year(today()) then
Manufacture_Year=year(today());
if Manufacture_Year < 1950 then
Manufacture_Year=1950;
select(upcase(Risk_Category));
length Risk_Level $8;
when('HIGH','PREMIUM') Risk_Level='HIGH';
when('MED','MEDIUM') Risk_Level='MEDIUM';
otherwise Risk_Level='UNKNOWN';
end;
if index(Owner_Email,'@')=0 then
Email_Flag='INVALID';
Vehicle_Age=intck('year',mdy(1,1,Manufacture_Year),
today());
drop Vehicle_Price_USD;
rename Vehicle_Price_Num=Vehicle_Price_USD;
run;
proc print data=luxury_vehicle_clean;
run;
OUTPUT:
| Obs | Vehicle_ID | Vehicle_Name | Owner_Email | Region_Code | Risk_Category | Registration_Date | Insurance_Premium | Manufacture_Year | Vehicle_Price_USD | Risk_Level | Email_Flag | Vehicle_Age |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | VH001 | Bugatti La Voiture Noire | owner1@gmail.com | EU | high | 2025-01-15 | 250000 | 2019 | 18700000 | HIGH | 7 | |
| 2 | VH002 | Rolls Royce Boat Tail | owner2gmail.com | APAC | premium | 2025-02-10 | 50000 | 2021 | 28000000 | HIGH | INVALID | 5 |
| 3 | VH003 | Pagani Zonda Hp Barchetta | NA | HIGH | 2025-03-11 | 180000 | 2026 | 17500000 | HIGH | INVALID | 0 | |
| 4 | VH004 | Bugatti Centodieci | owner4@yahoo.com | EU | Luxury | 2025-02-30 | 200000 | 2022 | 9000000 | UNKNOWN | 4 | |
| 5 | VH005 | Sp Automotive Chaos | owner5@outlook | APAC | Elite | . | 2023 | 14000000 | UNKNOWN | 3 | ||
| 6 | VH005 | Sp Automotive Chaos | owner5@outlook | APAC | Elite | . | 2023 | 14000000 | UNKNOWN | 3 | ||
| 7 | VH006 | Mercedes Maybach Exelero | owner6@gmail.com | US | medium | INVALIDDATE | 220000 | 1950 | 8000000 | MEDIUM | 76 | |
| 8 | VH007 | Koenigsegg Ccxr Trevita | owner7@gmail.com | EMEA | MED | 2025-04-10 | 210000 | 2010 | 4800000 | MEDIUM | 16 | |
| 9 | VH008 | Lamborghini Veneno | owner8@@gmail.com | LATAM | ? | 2025-03-09 | 190000 | 2014 | 4500000 | UNKNOWN | 12 | |
| 10 | VH009 | Mclaren Solus Gt | owner9@gmail.com | EMEA | HIGH | 2025-05-18 | 170000 | 2023 | . | HIGH | 3 | |
| 11 | VH010 | Aston Martin Valkyrie | owner10@gmail.com | APAC | high | 2025-04-21 | 160000 | 2022 | 3200000 | HIGH | 4 |
Explanation
This
workflow demonstrates enterprise defensive programming.
- PROPCASE() standardizes
names.
- COMPRESS() removes hidden
blanks.
- ABS() fixes negative
financial values.
- INPUT() converts mixed
types.
- SELECT-WHEN replaces inefficient
nested IF logic.
- INTCK() derives age
variables for actuarial modeling.
- INDEX() validates emails
quickly without regex overhead.
In
production clinical programming environments similar logic validates adverse
event dates, laboratory units, and treatment exposure records.
3.Arrays and DO Loop Validation
data validate_numeric;
set luxury_vehicle_clean;
array nums(*) Insurance_Premium Vehicle_Price_USD Manufacture_Year;
do i=1 to dim(nums);
if nums(i)=. then nums(i)=0;
end;
drop i;
run;
proc print data=validate_numeric;
run;
OUTPUT:
| Obs | Vehicle_ID | Vehicle_Name | Owner_Email | Region_Code | Risk_Category | Registration_Date | Insurance_Premium | Manufacture_Year | Vehicle_Price_USD | Risk_Level | Email_Flag | Vehicle_Age |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | VH001 | Bugatti La Voiture Noire | owner1@gmail.com | EU | high | 2025-01-15 | 250000 | 2019 | 18700000 | HIGH | 7 | |
| 2 | VH002 | Rolls Royce Boat Tail | owner2gmail.com | APAC | premium | 2025-02-10 | 50000 | 2021 | 28000000 | HIGH | INVALID | 5 |
| 3 | VH003 | Pagani Zonda Hp Barchetta | NA | HIGH | 2025-03-11 | 180000 | 2026 | 17500000 | HIGH | INVALID | 0 | |
| 4 | VH004 | Bugatti Centodieci | owner4@yahoo.com | EU | Luxury | 2025-02-30 | 200000 | 2022 | 9000000 | UNKNOWN | 4 | |
| 5 | VH005 | Sp Automotive Chaos | owner5@outlook | APAC | Elite | 0 | 2023 | 14000000 | UNKNOWN | 3 | ||
| 6 | VH005 | Sp Automotive Chaos | owner5@outlook | APAC | Elite | 0 | 2023 | 14000000 | UNKNOWN | 3 | ||
| 7 | VH006 | Mercedes Maybach Exelero | owner6@gmail.com | US | medium | INVALIDDATE | 220000 | 1950 | 8000000 | MEDIUM | 76 | |
| 8 | VH007 | Koenigsegg Ccxr Trevita | owner7@gmail.com | EMEA | MED | 2025-04-10 | 210000 | 2010 | 4800000 | MEDIUM | 16 | |
| 9 | VH008 | Lamborghini Veneno | owner8@@gmail.com | LATAM | ? | 2025-03-09 | 190000 | 2014 | 4500000 | UNKNOWN | 12 | |
| 10 | VH009 | Mclaren Solus Gt | owner9@gmail.com | EMEA | HIGH | 2025-05-18 | 170000 | 2023 | 0 | HIGH | 3 | |
| 11 | VH010 | Aston Martin Valkyrie | owner10@gmail.com | APAC | high | 2025-04-21 | 160000 | 2022 | 3200000 | HIGH | 4 |
Key Learning
Arrays
dramatically reduce repetitive code and improve maintainability.
The same
approach is commonly used in SDTM vital signs domains where dozens of
measurements require identical validation rules.
4.Duplicate Detection with FIRST. LAST.
proc sort data=luxury_vehicle_clean;
by Vehicle_ID;
run;
proc print data=luxury_vehicle_clean;
run;
OUTPUT:
| Obs | Vehicle_ID | Vehicle_Name | Owner_Email | Region_Code | Risk_Category | Registration_Date | Insurance_Premium | Manufacture_Year | Vehicle_Price_USD | Risk_Level | Email_Flag | Vehicle_Age |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | VH001 | Bugatti La Voiture Noire | owner1@gmail.com | EU | high | 2025-01-15 | 250000 | 2019 | 18700000 | HIGH | 7 | |
| 2 | VH002 | Rolls Royce Boat Tail | owner2gmail.com | APAC | premium | 2025-02-10 | 50000 | 2021 | 28000000 | HIGH | INVALID | 5 |
| 3 | VH003 | Pagani Zonda Hp Barchetta | NA | HIGH | 2025-03-11 | 180000 | 2026 | 17500000 | HIGH | INVALID | 0 | |
| 4 | VH004 | Bugatti Centodieci | owner4@yahoo.com | EU | Luxury | 2025-02-30 | 200000 | 2022 | 9000000 | UNKNOWN | 4 | |
| 5 | VH005 | Sp Automotive Chaos | owner5@outlook | APAC | Elite | . | 2023 | 14000000 | UNKNOWN | 3 | ||
| 6 | VH005 | Sp Automotive Chaos | owner5@outlook | APAC | Elite | . | 2023 | 14000000 | UNKNOWN | 3 | ||
| 7 | VH006 | Mercedes Maybach Exelero | owner6@gmail.com | US | medium | INVALIDDATE | 220000 | 1950 | 8000000 | MEDIUM | 76 | |
| 8 | VH007 | Koenigsegg Ccxr Trevita | owner7@gmail.com | EMEA | MED | 2025-04-10 | 210000 | 2010 | 4800000 | MEDIUM | 16 | |
| 9 | VH008 | Lamborghini Veneno | owner8@@gmail.com | LATAM | ? | 2025-03-09 | 190000 | 2014 | 4500000 | UNKNOWN | 12 | |
| 10 | VH009 | Mclaren Solus Gt | owner9@gmail.com | EMEA | HIGH | 2025-05-18 | 170000 | 2023 | . | HIGH | 3 | |
| 11 | VH010 | Aston Martin Valkyrie | owner10@gmail.com | APAC | high | 2025-04-21 | 160000 | 2022 | 3200000 | HIGH | 4 |
data unique
duplicates;
set luxury_vehicle_clean;
by Vehicle_ID;
if first.Vehicle_ID and last.Vehicle_ID then output unique;
else output duplicates;
run;
proc print data=unique;
run;
OUTPUT:
| Obs | Vehicle_ID | Vehicle_Name | Owner_Email | Region_Code | Risk_Category | Registration_Date | Insurance_Premium | Manufacture_Year | Vehicle_Price_USD | Risk_Level | Email_Flag | Vehicle_Age |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | VH001 | Bugatti La Voiture Noire | owner1@gmail.com | EU | high | 2025-01-15 | 250000 | 2019 | 18700000 | HIGH | 7 | |
| 2 | VH002 | Rolls Royce Boat Tail | owner2gmail.com | APAC | premium | 2025-02-10 | 50000 | 2021 | 28000000 | HIGH | INVALID | 5 |
| 3 | VH003 | Pagani Zonda Hp Barchetta | NA | HIGH | 2025-03-11 | 180000 | 2026 | 17500000 | HIGH | INVALID | 0 | |
| 4 | VH004 | Bugatti Centodieci | owner4@yahoo.com | EU | Luxury | 2025-02-30 | 200000 | 2022 | 9000000 | UNKNOWN | 4 | |
| 5 | VH006 | Mercedes Maybach Exelero | owner6@gmail.com | US | medium | INVALIDDATE | 220000 | 1950 | 8000000 | MEDIUM | 76 | |
| 6 | VH007 | Koenigsegg Ccxr Trevita | owner7@gmail.com | EMEA | MED | 2025-04-10 | 210000 | 2010 | 4800000 | MEDIUM | 16 | |
| 7 | VH008 | Lamborghini Veneno | owner8@@gmail.com | LATAM | ? | 2025-03-09 | 190000 | 2014 | 4500000 | UNKNOWN | 12 | |
| 8 | VH009 | Mclaren Solus Gt | owner9@gmail.com | EMEA | HIGH | 2025-05-18 | 170000 | 2023 | . | HIGH | 3 | |
| 9 | VH010 | Aston Martin Valkyrie | owner10@gmail.com | APAC | high | 2025-04-21 | 160000 | 2022 | 3200000 | HIGH | 4 |
proc print data=duplicates;
run;
OUTPUT:
| Obs | Vehicle_ID | Vehicle_Name | Owner_Email | Region_Code | Risk_Category | Registration_Date | Insurance_Premium | Manufacture_Year | Vehicle_Price_USD | Risk_Level | Email_Flag | Vehicle_Age |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | VH005 | Sp Automotive Chaos | owner5@outlook | APAC | Elite | . | 2023 | 14000000 | UNKNOWN | 3 | ||
| 2 | VH005 | Sp Automotive Chaos | owner5@outlook | APAC | Elite | . | 2023 | 14000000 | UNKNOWN | 3 |
Why It Matters
Duplicate
keys destroy referential integrity.
A
duplicate subject in ADSL can invalidate an entire clinical submission.
A
duplicate vehicle registration can produce millions in underwriting errors.
5.PROC SQL Approach
proc sql;
create table sql_summary as
select Region_Code,
count(*) as Vehicle_Count,
mean(Vehicle_Price_USD) as Avg_Price,
sum(Insurance_Premium) as Total_Premium
from luxury_vehicle_clean
group by Region_Code;
quit;
proc print data=sql_summary;
run;
OUTPUT:
| Obs | Region_Code | Vehicle_Count | Avg_Price | Total_Premium |
|---|---|---|---|---|
| 1 | APAC | 4 | 14800000 | 210000 |
| 2 | EMEA | 2 | 4800000 | 380000 |
| 3 | EU | 2 | 13850000 | 450000 |
| 4 | LATAM | 1 | 4500000 | 190000 |
| 5 | NA | 1 | 17500000 | 180000 |
| 6 | US | 1 | 8000000 | 220000 |
PROC SQL vs DATA Step
|
Feature |
PROC
SQL |
DATA
Step |
|
Aggregation |
Excellent |
Moderate |
|
Sequential
Logic |
Limited |
Excellent |
|
Readability |
High |
Moderate |
|
Large
Clinical Pipelines |
Moderate |
Excellent |
6.PROC FORMAT for Executive Reporting
proc format;
value premiumfmt 0-100000='Low'
100001-200000='Medium'
200001-high='Ultra Premium';
run;
LOG:
Formatting
separates business presentation from analytical storage and is heavily used in
TLF generation.
7.Professional Reporting Layer
proc report data=luxury_vehicle_clean nowd;
column Region_Code Vehicle_Name Vehicle_Price_USD Insurance_Premium;
define Region_Code / group;
define Vehicle_Name / display;
define Vehicle_Price_USD / analysis sum;
define Insurance_Premium / analysis mean format=premiumfmt.;
run;
OUTPUT:
| Region_Code | Vehicle_Name | Vehicle_Price_USD | Insurance_Premium |
|---|---|---|---|
| APAC | Rolls Royce Boat Tail | 28000000 | Low |
| Sp Automotive Chaos | 14000000 | . | |
| Sp Automotive Chaos | 14000000 | . | |
| Aston Martin Valkyrie | 3200000 | Medium | |
| EMEA | Koenigsegg Ccxr Trevita | 4800000 | Ultra Premium |
| Mclaren Solus Gt | . | Medium | |
| EU | Bugatti La Voiture Noire | 18700000 | Ultra Premium |
| Bugatti Centodieci | 9000000 | Medium | |
| LATAM | Lamborghini Veneno | 4500000 | Medium |
| NA | Pagani Zonda Hp Barchetta | 17500000 | Medium |
| US | Mercedes Maybach Exelero | 8000000 | Ultra Premium |
Reporting Importance
PROC
REPORT remains one of the most powerful enterprise reporting tools available in
regulated industries.
8.R Raw Dataset
library(tidyverse)
luxury_vehicle_raw <- tribble(
~Vehicle_ID, ~Vehicle_Name, ~Owner_Email, ~Region_Code,
~Insurance_Premium, ~Vehicle_Price_USD, ~Manufacture_Year,
~Registration_Date, ~Risk_Category,
"VH001", "bugatti la voiture noire", "owner1@gmail.com", " eu ",
250000, "18700000", 2019, "2025-01-15", "high",
"VH002", "Rolls Royce Boat Tail", "owner2gmail.com", "APAC",
-50000, "28000000", 2021, "2025-02-10", "premium",
"VH003", "Pagani Zonda HP Barchetta", "NULL", "na",
180000, "17500000", 2035, "2025-03-11", "HIGH",
"VH004", "Bugatti Centodieci", "owner4@yahoo.com", "EU ",
200000, "-9000000", 2022, "2025-02-30", "Luxury",
"VH005", "SP Automotive Chaos", "owner5@outlook", " APAC",
NA, "14000000", 2023, NA, "Elite",
"VH005", "SP Automotive Chaos", "owner5@outlook", " APAC",
NA, "14000000", 2023, NA, "Elite",
"VH006", "Mercedes Maybach Exelero", "owner6@gmail.com", "us",
220000, "8000000", 1890, "INVALIDDATE", "medium",
"VH007", "Koenigsegg CCXR Trevita", "owner7@gmail.com", " EMEA ",
210000, "4800000", 2010, "2025-04-10", "MED",
"VH008", "Lamborghini Veneno", "owner8@@gmail.com", "LATAM",
190000, "4500000", 2014, "2025-03-09", "?",
"VH009", "McLaren Solus GT", "owner9@gmail.com", "EMEA",
170000, "NULL", 2023, "2025-05-18", "HIGH",
"VH010", "Aston Martin Valkyrie", " owner10@gmail.com ", "APAC",
160000, "3200000", 2022, "2025-04-21", "high"
)
OUTPUT:
|
Vehicle_ID |
Vehicle_Name |
Owner_Email |
Region_Code |
Insurance_Premium |
Vehicle_Price_USD |
Manufacture_Year |
Registration_Date |
Risk_Category |
|
VH001 |
bugatti
la voiture noire |
owner1@gmail.com |
eu |
250000 |
18700000 |
2019 |
2025-01-15 |
high |
|
VH002 |
Rolls
Royce Boat Tail |
owner2gmail.com |
APAC |
-50000 |
28000000 |
2021 |
2025-02-10 |
premium |
|
VH003 |
Pagani
Zonda HP Barchetta |
NULL |
na |
180000 |
17500000 |
2035 |
2025-03-11 |
HIGH |
|
VH004 |
Bugatti
Centodieci |
owner4@yahoo.com |
EU |
200000 |
-9000000 |
2022 |
2025-02-30 |
Luxury |
|
VH005 |
SP
Automotive Chaos |
owner5@outlook |
APAC |
14000000 |
2023 |
Elite |
||
|
VH005 |
SP
Automotive Chaos |
owner5@outlook |
APAC |
14000000 |
2023 |
Elite |
||
|
VH006 |
Mercedes
Maybach Exelero |
owner6@gmail.com |
us |
220000 |
8000000 |
1890 |
INVALIDDATE |
medium |
|
VH007 |
Koenigsegg
CCXR Trevita |
owner7@gmail.com |
EMEA |
210000 |
4800000 |
2010 |
2025-04-10 |
MED |
|
VH008 |
Lamborghini
Veneno |
owner8@@gmail.com |
LATAM |
190000 |
4500000 |
2014 |
2025-03-09 |
? |
|
VH009 |
McLaren
Solus GT |
owner9@gmail.com |
EMEA |
170000 |
NULL |
2023 |
2025-05-18 |
HIGH |
|
VH010 |
Aston
Martin Valkyrie |
owner10@gmail.com |
APAC |
160000 |
3200000 |
2022 |
2025-04-21 |
high |
9.R Equivalent Cleaning Pipeline
library(janitor)
library(lubridate)
library(stringr)
vehicle_clean <- luxury_vehicle_raw %>%
clean_names() %>%
mutate(
vehicle_id = str_trim(vehicle_id),
vehicle_name = str_to_title(str_trim(vehicle_name)),
owner_email = str_to_lower(str_trim(owner_email)),
region_code = str_replace_all(region_code, " ", ""),
region_code = str_to_upper(region_code),
insurance_premium = abs(insurance_premium),
vehicle_price_usd =abs(suppressWarnings(
as.numeric(vehicle_price_usd))),
manufacture_year = if_else(
manufacture_year > year(Sys.Date()),
year(Sys.Date()),manufacture_year),
owner_email = replace_na(owner_email, ""),
risk_category = case_when(
risk_category %in% c("HIGH", "Premium") ~ "HIGH",
risk_category %in% c("MED", "MEDIUM") ~ "MEDIUM",
TRUE ~ "UNKNOWN")
) %>%
distinct(vehicle_id, .keep_all = TRUE)
|
vehicle_id |
vehicle_name |
owner_email |
region_code |
insurance_premium |
vehicle_price_usd |
manufacture_year |
registration_date |
risk_category |
|
VH001 |
Bugatti
La Voiture Noire |
owner1@gmail.com |
EU |
250000 |
18700000 |
2019 |
2025-01-15 |
UNKNOWN |
|
VH002 |
Rolls
Royce Boat Tail |
owner2gmail.com |
APAC |
50000 |
28000000 |
2021 |
2025-02-10 |
UNKNOWN |
|
VH003 |
Pagani
Zonda Hp Barchetta |
null |
NA |
180000 |
17500000 |
2026 |
2025-03-11 |
HIGH |
|
VH004 |
Bugatti
Centodieci |
owner4@yahoo.com |
EU |
200000 |
9000000 |
2022 |
2025-02-30 |
UNKNOWN |
|
VH005 |
Sp
Automotive Chaos |
owner5@outlook |
APAC |
14000000 |
2023 |
UNKNOWN |
||
|
VH006 |
Mercedes
Maybach Exelero |
owner6@gmail.com |
US |
220000 |
8000000 |
1890 |
INVALIDDATE |
UNKNOWN |
|
VH007 |
Koenigsegg
Ccxr Trevita |
owner7@gmail.com |
EMEA |
210000 |
4800000 |
2010 |
2025-04-10 |
MEDIUM |
|
VH008 |
Lamborghini
Veneno |
owner8@@gmail.com |
LATAM |
190000 |
4500000 |
2014 |
2025-03-09 |
UNKNOWN |
|
VH009 |
Mclaren
Solus Gt |
owner9@gmail.com |
EMEA |
170000 |
2023 |
2025-05-18 |
HIGH |
|
|
VH010 |
Aston
Martin Valkyrie |
owner10@gmail.com |
APAC |
160000 |
3200000 |
2022 |
2025-04-21 |
UNKNOWN |
SAS versus R Mapping
|
SAS |
R |
|
DATA
Step |
mutate() |
|
IF THEN |
if_else() |
|
SELECT
WHEN |
case_when() |
|
COMPRESS |
str_replace_all() |
|
PROPCASE |
str_to_title() |
|
NMISS |
is.na() |
|
COALESCEC |
coalesce() |
Enterprise Validation and
Compliance
Clinical
submissions rely heavily on SDTM and ADaM traceability.
Incorrect
missing-value handling in SAS can create catastrophic analytical consequences
because:
. < 0 < 1 < 2
Missing
values are treated as smaller than valid values.
A patient
with missing blood pressure could incorrectly appear as having the lowest value
in the study.
Regulators
expect:
- Independent QC.
- Full audit trail.
- Reproducibility.
- Traceability.
- Metadata governance.
- Validation independence.
These
principles apply equally to banking risk engines and insurance underwriting
systems.
Business Logic Behind
Cleaning Decisions
Missing
values are not always errors.
Sometimes
they represent unavailable information, delayed transactions, or pending
source-system updates.
A patient
age recorded as 250 years obviously requires correction or exclusion.
Negative
premiums often represent reversal transactions and should be reviewed rather
than blindly converted.
Date
standardization ensures proper temporal analysis.
Without
standardization, survival analysis, retention calculations, and trend
forecasting become unreliable.
Text
normalization improves join accuracy.
"APAC"
"apac"
"
APAC "
and
"ApAc"
must
become identical analytical entities.
Business
rules transform operational chaos into analytical truth.
Twenty Data Cleaning Best
Practices
- Standardize metadata.
- Maintain source lineage.
- Validate before derivation.
- Use reusable macros.
- Separate raw and clean
layers.
- Never overwrite source data.
- Document assumptions.
- Track imputations.
- Version control code.
- Automate validation.
- Use QC independence.
- Validate dates aggressively.
- Standardize categories.
- Remove duplicates early.
- Centralize formats.
- Review outliers manually.
- Protect audit trails.
- Maintain reproducibility.
- Deploy defensively.
- Monitor production
continuously.
Twenty One-Line Insights
- Dirty data creates expensive
business mistakes.
- Standardized variables
improve reproducibility.
- Validation logic beats
visual inspection.
- Duplicates are silent
killers.
- Missing values require
business context.
- Metadata is documentation.
- Arrays reduce technical
debt.
- PROC FORMAT improves
governance.
- Macros improve consistency.
- Traceability builds trust.
- Audit trails protect
organizations.
- Defensive programming saves
projects.
- Standardization improves AI
models.
- Validation is continuous.
- Reporting starts with
cleaning.
- Analytics trusts
preparation.
- Compliance starts with
structure.
- Good datasets scale.
- Reliable data improves
decisions.
- Quality is engineered.
SAS and R Strength
Comparison
SAS
provides exceptional auditability, reproducibility, metadata management, and
scalability for regulated environments.
R
provides unmatched flexibility, visualization capability, package ecosystems,
and machine learning integration.
Together
they create the ideal enterprise ecosystem:
- SAS for governance.
- R for innovation.
- SAS for compliance.
- R for exploration.
- SAS for submissions.
- R for experimentation.
The
strongest analytical organizations use both.
Final Thoughts
The
world's most expensive vehicles teach an important lesson.
The cost
of the asset is rarely the greatest risk.
The
quality of the data describing the asset often is.
Whether
building SDTM domains, underwriting insurance portfolios, retail forecasting
systems, or banking risk engines, organizations succeed only when analytical
foundations are trustworthy.
Data
cleaning is not a preprocessing task.
It is an
engineering discipline.
Production-grade
analytical systems require:
- repeatable transformations,
- standardized metadata,
- automated validation,
- independent quality control,
- documented assumptions,
- and auditable outputs.
SAS
provides industrial-grade governance and regulatory confidence.
R
provides analytical agility and innovation.
Together
they transform corrupted operational records into trusted business intelligence
capable of supporting regulators, executives, clinicians, actuaries, and data
scientists alike.
Reliable
analytics does not begin with machine learning.
It begins
with disciplined data engineering.
Interview Questions
1. Why use PROC SORT NODUPKEY?
It
removes duplicate business keys while preserving first occurrences.
2. Why is missing handling dangerous in SAS?
Missing
numeric values sort below all valid values.
3. When choose PROC SQL over DATA Step?
For
joins, aggregation, and relational transformations.
4. Why use arrays?
To
process many variables efficiently.
5. Why maintain audit trails?
To
satisfy regulatory expectations and ensure reproducibility.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent VEHICLES DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment