Classic Authors, Modern Data Science: Building Reliable Literary Intelligence Using PROC SQL, DATA STEP and R
Transforming the World’s Most Famous Writers Dataset into Professional Intelligence Using PROC IMPORT and EXPORT in SAS and R
1. Introduction — When Beautiful Analytics Collapse
Because of Ugly Data
Imagine a
global publishing company preparing an international literary analytics report
about the world’s most famous writers. The business team wants insights about
royalty earnings, author nationality trends, publication eras, and genre
performance. The management expects polished dashboards, investor-ready
reports, and executive summaries.
But the
raw dataset arrives like this:
- Missing author names
- Negative royalty amounts
- Invalid publication dates
- Duplicate writer IDs
- Mixed capitalization
(“india”, “INDIA”, “India”)
- Blank genre values
- Corrupted text imported from
CSV files
This is
where analytics projects silently fail.
A single
invalid date can break timelines.
A duplicate record can inflate financial projections.
A missing nationality can distort demographic analysis.
A negative royalty value can destroy forecasting models.
In
industries like clinical trials, banking, insurance, and publishing, dirty data
is not a small inconvenience — it becomes a business risk.
This is
why professionals rely heavily on SAS and R for enterprise-grade data
preparation, validation, reporting, and reproducibility.
This
project demonstrates how to build, clean, validate, transform, import, export,
and professionally present a dataset about the most famous writers in the world
using:
- PROC IMPORT
- PROC EXPORT
- DATA STEP
- PROC SQL
- PROC SORT
- PROC REPORT
- PROC SUMMARY
- Tidyverse in R
- dplyr
- stringr
The objective is not only coding it is understanding professional data engineering logic behind reliable analytics.
2. Raw Data Creation in SAS
and R
Raw Dataset with Intentional Errors
/*SAS Raw File Simulation Using INFILE*/
filename wrfile temp;
data _null_;
file wrfile;
put "Writer_ID|Writer_Name|Country|Genre|Birth_Year|Books_Sold_M
|Royalty_M|Award_Status|Join_Date";
put "101|William Shakespeare|UK|Drama|1564|400|120|YES|12-05-2020";
put "102|Leo Tolstoy|Russia|Novel|1828|300|-50|YES|15-13-2021";
put "103|NULL|India|Poetry|1861|250|75|NO|22-07-2019";
put "104|J.K. Rowling|uk|Fantasy|1965|500|200|YES|01-01-2022";
put "104|J.K. Rowling|uk|Fantasy|1965|500|200|YES|01-01-2022";
put "105|Mark Twain|USA|NULL|1835|.|90|YES|11-11-2018";
put "106| Fyodor Dostoevsky |Russia|Novel|-1821|150|60|YES|31-09-2020";
put "107|Jane Austen|UK|Romance|1775|180|85|yes|05-06-2021";
put "108|Ernest Hemingway|USA|Novel|1899|220|110|NO|14-02-2023";
put "109|Rabindranath Tagore|india|Poetry|1861|260|95|YES|07-07-2022";
put "110|George Orwell|UK|Dystopian|1903|210|88|YES|29-02-2021";
put "111|Homer|Greece|Epic||190|66|.|NO|12-08-2017";
run;
LOG:
Explanation
This raw
dataset intentionally contains enterprise-level data quality issues commonly
seen in production systems.
Key
problems inserted intentionally:
- Duplicate Writer_ID = 104
- Negative royalty values
- Invalid dates
- Missing birth year
- NULL strings
- Inconsistent capitalization
- Blank spaces
- Missing numeric values
These
issues simulate realistic imported flat files from external vendors or legacy
systems.
The professional advantage of using INFILE is that SAS developers gain granular control over delimiters, informats, missing value handling, and column parsing.
|
Error
Type |
Example |
|
Duplicate
IDs |
104
repeated |
|
NULL
values |
Writer_Name |
|
Negative
values |
-1821,
-50 |
|
Invalid
dates |
15-13-2021 |
|
Mixed
capitalization |
india
vs UK |
|
Missing
values |
NA |
|
Extra
spaces |
Fyodor
Dostoevsky |
|
Invalid
categorical text |
yes vs
YES |
The Truncation Trap — Why LENGTH Must Come First
data writers_raw;
length Writer_Name $40 Country $20 Genre $20 Award_Status $5;
infile wrfile dlm='|' firstobs=2;
input Writer_ID Writer_Name $ Country $ Genre $ Birth_Year Books_Sold_M
Royalty_M Award_Status $ Join_Date:$10.;
run;
proc print data = writers_raw;
run;
OUTPUT:
| Obs | Writer_Name | Country | Genre | Award_Status | Writer_ID | Birth_Year | Books_Sold_M | Royalty_M | Join_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | William Shakespeare | UK | Drama | YES | 101 | 1564 | 400 | 120 | 12-05-2020 |
| 2 | Leo Tolstoy | Russia | Novel | YES | 102 | 1828 | 300 | -50 | 15-13-2021 |
| 3 | NULL | India | Poetry | NO | 103 | 1861 | 250 | 75 | 22-07-2019 |
| 4 | J.K. Rowling | uk | Fantasy | YES | 104 | 1965 | 500 | 200 | 01-01-2022 |
| 5 | J.K. Rowling | uk | Fantasy | YES | 104 | 1965 | 500 | 200 | 01-01-2022 |
| 6 | Mark Twain | USA | NULL | YES | 105 | 1835 | . | 90 | 11-11-2018 |
| 7 | Fyodor Dostoevsky | Russia | Novel | YES | 106 | -1821 | 150 | 60 | 31-09-2020 |
| 8 | Jane Austen | UK | Romance | yes | 107 | 1775 | 180 | 85 | 05-06-2021 |
| 9 | Ernest Hemingway | USA | Novel | NO | 108 | 1899 | 220 | 110 | 14-02-2023 |
| 10 | Rabindranath Tagore | india | Poetry | YES | 109 | 1861 | 260 | 95 | 07-07-2022 |
| 11 | George Orwell | UK | Dystopian | YES | 110 | 1903 | 210 | 88 | 29-02-2021 |
| 12 | Homer | Greece | Epic | NO | 111 | 190 | 66 | . | 12-08-2017 |
Why LENGTH is Critical
One of
the biggest beginner mistakes in SAS is defining character lengths AFTER
assignment logic.
Example
mistake:
if Country='UK' then
Region='Europe';
length Region $20;
In SAS,
variable length is determined during first compilation reference.
If SAS
first sees "Europe" as "Eur", future values get truncated
permanently.
Professional
SAS developers always place LENGTH before:
- IF conditions
- SET statements
- MERGE statements
- Assignment logic
This
prevents silent data corruption.
R Code – Equivalent Raw Dataset
writers_raw <- data.frame(
Writer_ID = c(101,102,103,104,104,105,106,107,108,109,110,111),
Writer_Name = c("William Shakespeare","Leo Tolstoy","NULL",
"J.K. Rowling","J.K. Rowling","Mark Twain", "Fyodor Dostoevsky ",
"Jane Austen","Ernest Hemingway","Rabindranath Tagore",
"George Orwell","Homer"),
Country = c("UK","Russia","India","uk","uk","USA","Russia","UK",
"USA","india","UK","Greece"),
Genre = c("Drama","Novel","Poetry","Fantasy","Fantasy","NULL",
"Novel","Romance","Novel","Poetry","Dystopian","Epic"),
Birth_Year = c(1564,1828,1861,1965,1965,1835,-1821,1775,1899,
1861,1903,NA),
Books_Sold_M = c(400,300,250,500,500,NA,150,180,220,260,210,190),
Royalty_M = c(120,-50,75,200,200,90,60,85,110,95,88,NA),
Award_Status = c("YES","YES","NO","YES","YES","YES","YES","yes","NO",
"YES","YES","NO"),
Join_Date = c("12-05-2020","15-13-2021","22-07-2019","01-01-2022",
"01-01-2022","11-11-2018","31-09-2020","05-06-2021","14-02-2023",
"07-07-2022","29-02-2021","12-08-2017")
)
OUTPUT:
|
|
Writer_ID |
Writer_Name |
Country |
Genre |
Birth_Year |
Books_Sold_M |
Royalty_M |
Award_Status |
Join_Date |
|
1 |
101 |
William Shakespeare |
UK |
Drama |
1564 |
400 |
120 |
YES |
12-05-2020 |
|
2 |
102 |
Leo Tolstoy |
Russia |
Novel |
1828 |
300 |
-50 |
YES |
15-13-2021 |
|
3 |
103 |
NULL |
India |
Poetry |
1861 |
250 |
75 |
NO |
22-07-2019 |
|
4 |
104 |
J.K. Rowling |
uk |
Fantasy |
1965 |
500 |
200 |
YES |
01-01-2022 |
|
5 |
104 |
J.K. Rowling |
uk |
Fantasy |
1965 |
500 |
200 |
YES |
01-01-2022 |
|
6 |
105 |
Mark Twain |
USA |
NULL |
1835 |
NA |
90 |
YES |
11-11-2018 |
|
7 |
106 |
Fyodor Dostoevsky |
Russia |
Novel |
-1821 |
150 |
60 |
YES |
31-09-2020 |
|
8 |
107 |
Jane Austen |
UK |
Romance |
1775 |
180 |
85 |
yes |
05-06-2021 |
|
9 |
108 |
Ernest Hemingway |
USA |
Novel |
1899 |
220 |
110 |
NO |
14-02-2023 |
|
10 |
109 |
Rabindranath Tagore |
india |
Poetry |
1861 |
260 |
95 |
YES |
07-07-2022 |
|
11 |
110 |
George Orwell |
UK |
Dystopian |
1903 |
210 |
88 |
YES |
29-02-2021 |
|
12 |
111 |
Homer |
Greece |
Epic |
NA |
190 |
NA |
NO |
12-08-2017 |
Equivalent of SAS LENGTH in R
In R, there is no direct LENGTH statement like SAS.
R automatically expands character vectors dynamically.
But professionals still control data structure using:
str(writers_raw)
This is conceptually similar to:
proc contents;
run;
in SAS.
3. Step-by-Step Cleaning Using DATA STEP
data writers_clean;
length Writer_Name $40 Country $20 Genre $20 Award_Status $10;
set writers_raw;
Writer_Name = propcase(strip(Writer_Name));
if upcase(strip(Writer_Name='Null')) then Writer_Name='UNKNOWN';
Country = upcase(Country);
if Genre = "NULL" then Genre = "UNKNOWN";
if not missing(Royalty_M) then
Royalty_M = abs(Royalty_M);
if not missing(Birth_Year) then Birth_Year = abs(Birth_Year);
Award_Status = strip(upcase(Award_Status));
Join_Date_New = input(Join_Date,ddmmyy10.);
format Join_Date_New date9.;
drop Join_Date;
rename Join_Date_New = Join_Date;
run;
proc print data = writers_clean;
run;
OUTPUT:
| Obs | Writer_Name | Country | Genre | Award_Status | Writer_ID | Birth_Year | Books_Sold_M | Royalty_M | Join_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | William Shakespeare | UK | Drama | YES | 101 | 1564 | 400 | 120 | 12MAY2020 |
| 2 | Leo Tolstoy | RUSSIA | Novel | YES | 102 | 1828 | 300 | 50 | . |
| 3 | UNKNOWN | INDIA | Poetry | NO | 103 | 1861 | 250 | 75 | 22JUL2019 |
| 4 | J.K. Rowling | UK | Fantasy | YES | 104 | 1965 | 500 | 200 | 01JAN2022 |
| 5 | J.K. Rowling | UK | Fantasy | YES | 104 | 1965 | 500 | 200 | 01JAN2022 |
| 6 | Mark Twain | USA | UNKNOWN | YES | 105 | 1835 | . | 90 | 11NOV2018 |
| 7 | Fyodor Dostoevsky | RUSSIA | Novel | YES | 106 | 1821 | 150 | 60 | . |
| 8 | Jane Austen | UK | Romance | YES | 107 | 1775 | 180 | 85 | 05JUN2021 |
| 9 | Ernest Hemingway | USA | Novel | NO | 108 | 1899 | 220 | 110 | 14FEB2023 |
| 10 | Rabindranath Tagore | INDIA | Poetry | YES | 109 | 1861 | 260 | 95 | 07JUL2022 |
| 11 | George Orwell | UK | Dystopian | YES | 110 | 1903 | 210 | 88 | . |
| 12 | Homer | GREECE | Epic | NO | 111 | 190 | 66 | . | 12AUG2017 |
Technical Deep Dive
ABS Function
Royalty_M = abs(Royalty_M);
Negative
royalty values are unrealistic.
ABS()
converts negative values into positive standardized measures.
Widely
used in:
- Banking transactions
- Financial normalization
- Clinical lab corrections
INPUT and PUT
Join_Date_New =
input(Join_Date,ddmmyy10.);
INPUT
converts character dates into numeric SAS dates.
Char_Date =
put(Join_Date_New,date9.);
PUT
converts numeric dates back into formatted character values.
This
conversion framework is essential in SDTM clinical datasets.
PROC SORT with NODUPKEY
proc sort data=writers_clean nodupkey;
by Writer_ID;
run;
proc print data = writers_clean;
run;
LOG:
OUTPUT:
| Obs | Writer_Name | Country | Genre | Award_Status | Writer_ID | Birth_Year | Books_Sold_M | Royalty_M | Join_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | William Shakespeare | UK | Drama | YES | 101 | 1564 | 400 | 120 | 12MAY2020 |
| 2 | Leo Tolstoy | RUSSIA | Novel | YES | 102 | 1828 | 300 | 50 | . |
| 3 | UNKNOWN | INDIA | Poetry | NO | 103 | 1861 | 250 | 75 | 22JUL2019 |
| 4 | J.K. Rowling | UK | Fantasy | YES | 104 | 1965 | 500 | 200 | 01JAN2022 |
| 5 | Mark Twain | USA | UNKNOWN | YES | 105 | 1835 | . | 90 | 11NOV2018 |
| 6 | Fyodor Dostoevsky | RUSSIA | Novel | YES | 106 | 1821 | 150 | 60 | . |
| 7 | Jane Austen | UK | Romance | YES | 107 | 1775 | 180 | 85 | 05JUN2021 |
| 8 | Ernest Hemingway | USA | Novel | NO | 108 | 1899 | 220 | 110 | 14FEB2023 |
| 9 | Rabindranath Tagore | INDIA | Poetry | YES | 109 | 1861 | 260 | 95 | 07JUL2022 |
| 10 | George Orwell | UK | Dystopian | YES | 110 | 1903 | 210 | 88 | . |
| 11 | Homer | GREECE | Epic | NO | 111 | 190 | 66 | . | 12AUG2017 |
Explanation
NODUPKEY
removes duplicate records based on BY variables.
Professional
importance:
- Prevents double-counting
- Removes redundant
transactions
- Ensures regulatory
consistency
This is
heavily used in:
- SDTM DM datasets
- Banking ledgers
- Insurance claims
SAS vs R Logic Bridge
|
SAS |
R |
|
DATA
STEP |
mutate() |
|
ABS() |
abs() |
|
STRIP() |
str_trim() |
|
UPCASE() |
str_to_upper() |
|
PROPCASE() |
str_to_title() |
|
PROC
SORT NODUPKEY |
distinct() |
|
INPUT() |
as.Date() |
IF-THEN vs SELECT-WHEN
/*IF-THEN Example*/
data writers_clean;
retain Writer_ID Writer_Name Region Genre Country Birth_Year
Books_Sold_M Join_Date Award_Status Royalty_M;
length Region $20;
set writers_clean;
if Country='UK' then Region='EUROPE';
else if Country='USA' then Region='NORTH AMERICA';
else Region='OTHER';
run;
proc print data = writers_clean;
run;
OUTPUT:
| Obs | Writer_ID | Writer_Name | Region | Genre | Country | Birth_Year | Books_Sold_M | Join_Date | Award_Status | Royalty_M |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | William Shakespeare | EUROPE | Drama | UK | 1564 | 400 | 12MAY2020 | YES | 120 |
| 2 | 102 | Leo Tolstoy | OTHER | Novel | RUSSIA | 1828 | 300 | . | YES | 50 |
| 3 | 103 | UNKNOWN | OTHER | Poetry | INDIA | 1861 | 250 | 22JUL2019 | NO | 75 |
| 4 | 104 | J.K. Rowling | EUROPE | Fantasy | UK | 1965 | 500 | 01JAN2022 | YES | 200 |
| 5 | 105 | Mark Twain | NORTH AMERICA | UNKNOWN | USA | 1835 | . | 11NOV2018 | YES | 90 |
| 6 | 106 | Fyodor Dostoevsky | OTHER | Novel | RUSSIA | 1821 | 150 | . | YES | 60 |
| 7 | 107 | Jane Austen | EUROPE | Romance | UK | 1775 | 180 | 05JUN2021 | YES | 85 |
| 8 | 108 | Ernest Hemingway | NORTH AMERICA | Novel | USA | 1899 | 220 | 14FEB2023 | NO | 110 |
| 9 | 109 | Rabindranath Tagore | OTHER | Poetry | INDIA | 1861 | 260 | 07JUL2022 | YES | 95 |
| 10 | 110 | George Orwell | EUROPE | Dystopian | UK | 1903 | 210 | . | YES | 88 |
| 11 | 111 | Homer | OTHER | Epic | GREECE | 190 | 66 | 12AUG2017 | NO | . |
/*SELECT-WHEN Example*/
data writers_final;
retain Writer_ID Writer_Name Region Genre Country Birth_Year
Books_Sold_M Join_Date Award_Status Royalty_M;
length Region $20;
set writers_clean;
select(Country);
when ('UK') Region='EUROPE';
when ('USA') Region='NORTH AMERICA';
otherwise Region='OTHER';
end;
run;
proc print data = writers_final;
run;
OUTPUT:
| Obs | Writer_ID | Writer_Name | Region | Genre | Country | Birth_Year | Books_Sold_M | Join_Date | Award_Status | Royalty_M |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | William Shakespeare | EUROPE | Drama | UK | 1564 | 400 | 12MAY2020 | YES | 120 |
| 2 | 102 | Leo Tolstoy | OTHER | Novel | RUSSIA | 1828 | 300 | . | YES | 50 |
| 3 | 103 | UNKNOWN | OTHER | Poetry | INDIA | 1861 | 250 | 22JUL2019 | NO | 75 |
| 4 | 104 | J.K. Rowling | EUROPE | Fantasy | UK | 1965 | 500 | 01JAN2022 | YES | 200 |
| 5 | 105 | Mark Twain | NORTH AMERICA | UNKNOWN | USA | 1835 | . | 11NOV2018 | YES | 90 |
| 6 | 106 | Fyodor Dostoevsky | OTHER | Novel | RUSSIA | 1821 | 150 | . | YES | 60 |
| 7 | 107 | Jane Austen | EUROPE | Romance | UK | 1775 | 180 | 05JUN2021 | YES | 85 |
| 8 | 108 | Ernest Hemingway | NORTH AMERICA | Novel | USA | 1899 | 220 | 14FEB2023 | NO | 110 |
| 9 | 109 | Rabindranath Tagore | OTHER | Poetry | INDIA | 1861 | 260 | 07JUL2022 | YES | 95 |
| 10 | 110 | George Orwell | EUROPE | Dystopian | UK | 1903 | 210 | . | YES | 88 |
| 11 | 111 | Homer | OTHER | Epic | GREECE | 190 | 66 | 12AUG2017 | NO | . |
Why RETAIN Works
RETAIN does two things:
- Retains values across observations
- More importantly here → controls variable order
SAS displays variables in the exact order listed in RETAIN.
Professional SAS Insight
RETAIN is one of the most commonly used techniques in:
- SDTM dataset structuring
- ADaM dataset ordering
- Regulatory submission datasets
- Final reporting tables
Difference Between RETAIN vs KEEP
Feature | RETAIN | KEEP |
Reorders variables | YES | YES |
Removes variables | NO | YES |
Retains values | YES | NO |
Most used professionally | YES | Sometimes |
The R Refinement Layer (Tidyverse Approach)
Cleaning the Dataset in R
library(dplyr)
library(stringr)
library(tidyr)
writers_clean <- writers_raw %>%
mutate(Writer_Name = str_trim(str_to_title(Writer_Name)),
Writer_Name = ifelse(toupper(Writer_Name)=="NULL",
"UNKNOWN",Writer_Name),
Country = str_to_upper(str_trim(Country)),
Genre = ifelse(toupper(Genre)=="NULL" | is.na(Genre),
"UNKNOWN",Genre),
Royalty_M = abs(Royalty_M),
Birth_Year = abs(Birth_Year),
Award_Status = str_to_upper(str_trim(Award_Status)),
Join_Date = as.Date(Join_Date,format="%d-%m-%Y")
) %>%
distinct(Writer_ID,.keep_all=TRUE)
|
|
Writer_ID |
Writer_Name |
Country |
Genre |
Birth_Year |
Books_Sold_M |
Royalty_M |
Award_Status |
Join_Date |
|
1 |
101 |
William Shakespeare |
UK |
Drama |
1564 |
400 |
120 |
YES |
12-05-2020 |
|
2 |
102 |
Leo Tolstoy |
RUSSIA |
Novel |
1828 |
300 |
50 |
YES |
NA |
|
3 |
103 |
UNKNOWN |
INDIA |
Poetry |
1861 |
250 |
75 |
NO |
22-07-2019 |
|
4 |
104 |
J.k. Rowling |
UK |
Fantasy |
1965 |
500 |
200 |
YES |
01-01-2022 |
|
5 |
105 |
Mark Twain |
USA |
UNKNOWN |
1835 |
NA |
90 |
YES |
11-11-2018 |
|
6 |
106 |
Fyodor Dostoevsky |
RUSSIA |
Novel |
1821 |
150 |
60 |
YES |
NA |
|
7 |
107 |
Jane Austen |
UK |
Romance |
1775 |
180 |
85 |
YES |
05-06-2021 |
|
8 |
108 |
Ernest Hemingway |
USA |
Novel |
1899 |
220 |
110 |
NO |
14-02-2023 |
|
9 |
109 |
Rabindranath Tagore |
INDIA |
Poetry |
1861 |
260 |
95 |
YES |
07-07-2022 |
|
10 |
110 |
George Orwell |
UK |
Dystopian |
1903 |
210 |
88 |
YES |
NA |
|
11 |
111 |
Homer |
GREECE |
Epic |
NA |
190 |
NA |
NO |
12-08-2017 |
Explanation
This
pipeline demonstrates modern R wrangling using tidyverse.
Logic Bridge Between SAS and R
|
SAS |
R
Equivalent |
|
DATA
STEP |
mutate() |
|
IF-THEN |
case_when() |
|
PROC
SORT NODUPKEY |
distinct() |
|
COMPRESS/STRIP |
trimws()/str_trim() |
|
COALESCEC |
coalesce() |
Comparison
|
Feature |
IF-THEN |
SELECT-WHEN |
|
Best
for |
Complex
logic |
Categorical
mapping |
|
Performance |
Slightly
slower |
Faster |
|
Readability |
Medium |
High |
|
Nested
Conditions |
Excellent |
Moderate |
Professional
SAS programmers prefer SELECT-WHEN for category standardization.
4. PROC EXPORT and IMPORT
/*PROC EXPORT*/
proc export data=writers_final
outfile="writers_final.csv"
dbms=csv
replace;
run;
proc print data=writers_final;
run;
OUTPUT:
| Obs | Writer_ID | Writer_Name | Region | Genre | Country | Birth_Year | Books_Sold_M | Join_Date | Award_Status | Royalty_M |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | William Shakespeare | EUROPE | Drama | UK | 1564 | 400 | 12MAY2020 | YES | 120 |
| 2 | 102 | Leo Tolstoy | OTHER | Novel | RUSSIA | 1828 | 300 | . | YES | 50 |
| 3 | 103 | UNKNOWN | OTHER | Poetry | INDIA | 1861 | 250 | 22JUL2019 | NO | 75 |
| 4 | 104 | J.K. Rowling | EUROPE | Fantasy | UK | 1965 | 500 | 01JAN2022 | YES | 200 |
| 5 | 105 | Mark Twain | NORTH AMERICA | UNKNOWN | USA | 1835 | . | 11NOV2018 | YES | 90 |
| 6 | 106 | Fyodor Dostoevsky | OTHER | Novel | RUSSIA | 1821 | 150 | . | YES | 60 |
| 7 | 107 | Jane Austen | EUROPE | Romance | UK | 1775 | 180 | 05JUN2021 | YES | 85 |
| 8 | 108 | Ernest Hemingway | NORTH AMERICA | Novel | USA | 1899 | 220 | 14FEB2023 | NO | 110 |
| 9 | 109 | Rabindranath Tagore | OTHER | Poetry | INDIA | 1861 | 260 | 07JUL2022 | YES | 95 |
| 10 | 110 | George Orwell | EUROPE | Dystopian | UK | 1903 | 210 | . | YES | 88 |
| 11 | 111 | Homer | OTHER | Epic | GREECE | 190 | 66 | 12AUG2017 | NO | . |
Explanation
PROC
EXPORT generates business-ready deliverables.
Professional
outputs include:
- Excel reports
- CSV feeds
- Regulatory submissions
- Executive dashboards
#Exporting CSV in R
Equivalent to PROC EXPORT in SAS:
write.csv(writers_clean,
"writers_clean.csv",
row.names = FALSE)
OUTPUT:
|
|
Writer_ID |
Writer_Name |
Country |
Genre |
Birth_Year |
Books_Sold_M |
Royalty_M |
Award_Status |
Join_Date |
|
1 |
101 |
William Shakespeare |
UK |
Drama |
1564 |
400 |
120 |
YES |
12-05-2020 |
|
2 |
102 |
Leo Tolstoy |
RUSSIA |
Novel |
1828 |
300 |
50 |
YES |
NA |
|
3 |
103 |
UNKNOWN |
INDIA |
Poetry |
1861 |
250 |
75 |
NO |
22-07-2019 |
|
4 |
104 |
J.k. Rowling |
UK |
Fantasy |
1965 |
500 |
200 |
YES |
01-01-2022 |
|
5 |
105 |
Mark Twain |
USA |
UNKNOWN |
1835 |
NA |
90 |
YES |
11-11-2018 |
|
6 |
106 |
Fyodor Dostoevsky |
RUSSIA |
Novel |
1821 |
150 |
60 |
YES |
NA |
|
7 |
107 |
Jane Austen |
UK |
Romance |
1775 |
180 |
85 |
YES |
05-06-2021 |
|
8 |
108 |
Ernest Hemingway |
USA |
Novel |
1899 |
220 |
110 |
NO |
14-02-2023 |
|
9 |
109 |
Rabindranath Tagore |
INDIA |
Poetry |
1861 |
260 |
95 |
YES |
07-07-2022 |
|
10 |
110 |
George Orwell |
UK |
Dystopian |
1903 |
210 |
88 |
YES |
NA |
|
11 |
111 |
Homer |
GREECE |
Epic |
NA |
190 |
NA |
NO |
12-08-2017 |
/*PROC IMPORT*/
proc import datafile="writers_final.csv"
out=writers
dbms=csv
replace;
guessingrows=max;
run;
proc print data=writers;
run;
OUTPUT:
| Obs | Writer_ID | Writer_Name | Region | Genre | Country | Birth_Year | Books_Sold_M | Join_Date | Award_Status | Royalty_M |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | William Shakespeare | EUROPE | Drama | UK | 1564 | 400 | 12MAY2020 | YES | 120 |
| 2 | 102 | Leo Tolstoy | OTHER | Novel | RUSSIA | 1828 | 300 | . | YES | 50 |
| 3 | 103 | UNKNOWN | OTHER | Poetry | INDIA | 1861 | 250 | 22JUL2019 | NO | 75 |
| 4 | 104 | J.K. Rowling | EUROPE | Fantasy | UK | 1965 | 500 | 01JAN2022 | YES | 200 |
| 5 | 105 | Mark Twain | NORTH AMERICA | UNKNOWN | USA | 1835 | . | 11NOV2018 | YES | 90 |
| 6 | 106 | Fyodor Dostoevsky | OTHER | Novel | RUSSIA | 1821 | 150 | . | YES | 60 |
| 7 | 107 | Jane Austen | EUROPE | Romance | UK | 1775 | 180 | 05JUN2021 | YES | 85 |
| 8 | 108 | Ernest Hemingway | NORTH AMERICA | Novel | USA | 1899 | 220 | 14FEB2023 | NO | 110 |
| 9 | 109 | Rabindranath Tagore | OTHER | Poetry | INDIA | 1861 | 260 | 07JUL2022 | YES | 95 |
| 10 | 110 | George Orwell | EUROPE | Dystopian | UK | 1903 | 210 | . | YES | 88 |
| 11 | 111 | Homer | OTHER | Epic | GREECE | 190 | 66 | 12AUG2017 | NO | . |
Explanation
PROC
IMPORT automates ingestion of external data files.
Important
parameter:
guessingrows=max;
Without
it, SAS may incorrectly infer variable types.
#Importing CSV in R
Equivalent to PROC IMPORT in SAS:
writers_import <- read.csv(
"writers_clean.csv",
stringsAsFactors = FALSE)
OUTPUT:
|
|
Writer_ID |
Writer_Name |
Country |
Genre |
Birth_Year |
Books_Sold_M |
Royalty_M |
Award_Status |
Join_Date |
|
1 |
101 |
William Shakespeare |
UK |
Drama |
1564 |
400 |
120 |
YES |
12-05-2020 |
|
2 |
102 |
Leo Tolstoy |
RUSSIA |
Novel |
1828 |
300 |
50 |
YES |
NA |
|
3 |
103 |
UNKNOWN |
INDIA |
Poetry |
1861 |
250 |
75 |
NO |
22-07-2019 |
|
4 |
104 |
J.k. Rowling |
UK |
Fantasy |
1965 |
500 |
200 |
YES |
01-01-2022 |
|
5 |
105 |
Mark Twain |
USA |
UNKNOWN |
1835 |
NA |
90 |
YES |
11-11-2018 |
|
6 |
106 |
Fyodor Dostoevsky |
RUSSIA |
Novel |
1821 |
150 |
60 |
YES |
NA |
|
7 |
107 |
Jane Austen |
UK |
Romance |
1775 |
180 |
85 |
YES |
05-06-2021 |
|
8 |
108 |
Ernest Hemingway |
USA |
Novel |
1899 |
220 |
110 |
NO |
14-02-2023 |
|
9 |
109 |
Rabindranath Tagore |
INDIA |
Poetry |
1861 |
260 |
95 |
YES |
07-07-2022 |
|
10 |
110 |
George Orwell |
UK |
Dystopian |
1903 |
210 |
88 |
YES |
NA |
|
11 |
111 |
Homer |
GREECE |
Epic |
NA |
190 |
NA |
NO |
12-08-2017 |
Important Interview Point
|
PROC |
GUESSINGROWS Valid? |
|
PROC IMPORT |
YES |
|
PROC EXPORT |
NO |
Equivalent SAS vs R
Comparison
|
SAS |
R |
|
"C:\file.csv" |
❌ Error |
|
"C:\\file.csv" |
✅ Works |
|
"C:/file.csv" |
✅ Best
Practice |
5. Business Logic & The “Why”
Automated Loan Approval Scenario
Imagine a
bank using SAS for automated writer royalty-based loan approval.
Eligibility
rule:
- Royalty > 50 million
- Age > 18
- No missing records
Now
consider this dangerous SAS behavior:
Missing
numeric values are treated as smaller than all numbers.
data writers_final;
retain Writer_ID Writer_Name Region Genre Country Birth_Year
Books_Sold_M Join_Date Award_Status Royalty_M;
length Flag $8;
set writers_final;
if Royalty_M < 100 then Flag = "Low";
else Flag = "High";
run;
proc print data=writers_final;
run;
OUTPUT:
| Obs | Writer_ID | Writer_Name | Region | Genre | Country | Birth_Year | Books_Sold_M | Join_Date | Award_Status | Royalty_M | Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | William Shakespeare | EUROPE | Drama | UK | 1564 | 400 | 12MAY2020 | YES | 120 | High |
| 2 | 102 | Leo Tolstoy | OTHER | Novel | RUSSIA | 1828 | 300 | . | YES | 50 | Low |
| 3 | 103 | UNKNOWN | OTHER | Poetry | INDIA | 1861 | 250 | 22JUL2019 | NO | 75 | Low |
| 4 | 104 | J.K. Rowling | EUROPE | Fantasy | UK | 1965 | 500 | 01JAN2022 | YES | 200 | High |
| 5 | 105 | Mark Twain | NORTH AMERICA | UNKNOWN | USA | 1835 | . | 11NOV2018 | YES | 90 | Low |
| 6 | 106 | Fyodor Dostoevsky | OTHER | Novel | RUSSIA | 1821 | 150 | . | YES | 60 | Low |
| 7 | 107 | Jane Austen | EUROPE | Romance | UK | 1775 | 180 | 05JUN2021 | YES | 85 | Low |
| 8 | 108 | Ernest Hemingway | NORTH AMERICA | Novel | USA | 1899 | 220 | 14FEB2023 | NO | 110 | High |
| 9 | 109 | Rabindranath Tagore | OTHER | Poetry | INDIA | 1861 | 260 | 07JUL2022 | YES | 95 | Low |
| 10 | 110 | George Orwell | EUROPE | Dystopian | UK | 1903 | 210 | . | YES | 88 | Low |
| 11 | 111 | Homer | OTHER | Epic | GREECE | 190 | 66 | 12AUG2017 | NO | . | Low |
Missing
values also satisfy this condition.
A missing
royalty could accidentally classify an applicant incorrectly.
In
clinical trials, this could become catastrophic:
- Wrong patient inclusion
- Invalid safety conclusions
- Regulatory rejection
This is
called the Missing Value Trap.
Professional
developers always explicitly handle missing values.
Key Points of
Implementation
- Always validate imported
file structures.
- Define LENGTH before
assignments.
- Never trust raw external
data.
- Standardize capitalization
early.
- Remove duplicates before
aggregation.
- Convert dates immediately
after import.
- Use PROC CONTENTS for
metadata validation.
- Document every
transformation step.
- Use meaningful variable
names.
- Separate raw and cleaned
datasets.
- Preserve audit trails.
- Validate ranges using IF
conditions.
- Use formats for readability.
- Prefer SELECT-WHEN for
categorical mapping.
- Handle missing values explicitly.
- Never overwrite raw data.
- Use PROC FREQ for
categorical validation.
- Compare record counts after
merges.
- Use macros for reusable
logic.
- Maintain reproducible
workflows.
Extended Analysis & Reporting
Aggregation with PROC SUMMARY
proc summary data=writers_clean nway;
class Country;
var Royalty_M Books_Sold_M;
output out=country_summary sum=;
run;
proc print data = country_summary;
run;
OUTPUT:
| Obs | Country | _TYPE_ | _FREQ_ | Royalty_M | Books_Sold_M |
|---|---|---|---|---|---|
| 1 | GREECE | 1 | 1 | . | 66 |
| 2 | INDIA | 1 | 2 | 170 | 510 |
| 3 | RUSSIA | 1 | 2 | 110 | 450 |
| 4 | UK | 1 | 4 | 493 | 1290 |
| 5 | USA | 1 | 2 | 200 | 220 |
Explanation
PROC
SUMMARY creates aggregated business intelligence metrics.
This is
widely used for:
- Regional revenue reporting
- Clinical summaries
- Market segmentation
Professional Reporting Using PROC REPORT
proc report data=country_summary nowd;
columns Country Royalty_M Books_Sold_M;
define Country / group;
define Royalty_M / analysis sum;
define Books_Sold_M / analysis sum;
run;
OUTPUT:
| Country | Royalty_M | Books_Sold_M |
|---|---|---|
| GREECE | . | 66 |
| INDIA | 170 | 510 |
| RUSSIA | 110 | 450 |
| UK | 493 | 1290 |
| USA | 200 | 220 |
Explanation
PROC REPORT
creates polished enterprise outputs.
Professional
advantages:
- Custom formatting
- Executive presentation
- Controlled layouts
- Regulatory reporting
readiness
Advanced Text Cleaning in R
writers_clean$Country <- gsub("india","INDIA",
writers_clean$Country)
writers_clean$Writer_Name <- trimws(
writers_clean$Writer_Name)
OUTPUT:
| Writer_ID | Writer_Name | Country | Genre | Birth_Year | Books_Sold_M | Royalty_M | Award_Status | Join_Date |
1 | 101 | William Shakespeare | UK | Drama | 1564 | 400 | 120 | YES | 12-05-2020 |
2 | 102 | Leo Tolstoy | RUSSIA | Novel | 1828 | 300 | 50 | YES | NA |
3 | 103 | UNKNOWN | INDIA | Poetry | 1861 | 250 | 75 | NO | 22-07-2019 |
4 | 104 | J.k. Rowling | UK | Fantasy | 1965 | 500 | 200 | YES | 01-01-2022 |
5 | 105 | Mark Twain | USA | UNKNOWN | 1835 | NA | 90 | YES | 11-11-2018 |
6 | 106 | Fyodor Dostoevsky | RUSSIA | Novel | 1821 | 150 | 60 | YES | NA |
7 | 107 | Jane Austen | UK | Romance | 1775 | 180 | 85 | YES | 05-06-2021 |
8 | 108 | Ernest Hemingway | USA | Novel | 1899 | 220 | 110 | NO | 14-02-2023 |
9 | 109 | Rabindranath Tagore | INDIA | Poetry | 1861 | 260 | 95 | YES | 07-07-2022 |
10 | 110 | George Orwell | UK | Dystopian | 1903 | 210 | 88 | YES | NA |
11 | 111 | Homer | GREECE | Epic | NA | 190 | NA | NO | 12-08-2017 |
Explanation
gsub() performs regex-based replacement.
trimws() removes unwanted spaces.
These functions are critical when handling:
- Email cleanup
- Address normalization
- Free-text survey responses
6. 20 Additional Data
Cleaning Best Practices
- Validate SDTM variable
lengths.
- Maintain Define.xml consistency.
- Preserve source
traceability.
- Use controlled terminology.
- Validate adverse event
dates.
- Standardize units.
- Perform double-programming
validation.
- Create QC datasets.
- Track transformation
lineage.
- Use audit logs.
- Validate merge keys.
- Prevent orphan records.
- Normalize categorical
variables.
- Flag impossible ages.
- Validate visit windows.
- Review protocol deviations.
- Store raw snapshots
securely.
- Ensure reproducibility.
- Validate exports before
submission.
- Follow FDA compliance
guidelines.
7. Business Logic Behind
Data Cleaning
Data
cleaning exists because business decisions depend entirely on data quality. If
a patient’s age is incorrectly entered as -45 instead of 45, a clinical trial
may wrongly classify the subject into an invalid treatment group. If salary
values are missing in banking systems, loan approval engines may reject
qualified applicants or approve risky candidates.
Missing
values are often replaced because analytical models cannot interpret blanks
consistently. Standardization ensures systems behave predictably. Unrealistic
values are corrected because dashboards, machine learning models, and executive
reports rely on logical ranges.
Date
imputation is another critical practice. Suppose treatment dates are partially
missing in a clinical trial. Analysts may use protocol-approved imputation
methods to preserve timeline consistency while maintaining regulatory
transparency.
Salary
normalization ensures financial analytics remain stable across regions and
currencies. Similarly, publication royalty normalization allows publishing
companies to compare authors fairly across countries.
Ultimately,
data cleaning protects business credibility, analytical trust, and operational
accuracy.
8. 20 Key Points — Sharp
& Impactful
- Dirty data leads to wrong conclusions.
- Validation prevents
analytical disasters.
- Duplicate records inflate
metrics.
- Missing values silently
distort reports.
- Standardization ensures
reproducibility.
- Metadata matters as much as
data.
- Date formats drive timeline
accuracy.
- Audit trails protect
compliance.
- Structured imports reduce
failures.
- PROC SORT improves
reliability.
- DATA STEP offers
transformation flexibility.
- PROC SQL enhances relational
processing.
- Tidyverse simplifies modern
wrangling.
- Regex cleaning improves text
quality.
- Controlled terminology
boosts consistency.
- Documentation reduces
debugging time.
- QC programming strengthens
trust.
- Professional reporting
increases business value.
- Clean data powers machine
learning.
- Reliable analytics begin
with disciplined preprocessing.
9. Summary
SAS and R
are both extraordinary platforms for enterprise data cleaning and professional
analytics engineering. SAS excels in structured enterprise workflows,
auditability, regulatory reporting, and large-scale production systems. Its
DATA STEP architecture provides deterministic, highly reliable transformations
ideal for banking, pharmaceuticals, insurance, and government reporting.
R,
particularly through tidyverse, offers elegant and modern data wrangling
syntax. Functions like mutate(), filter(), replace_na(), and case_when()
dramatically simplify exploratory transformations and analytical prototyping.
This
project demonstrated how messy writer datasets containing duplicates, invalid
dates, missing values, inconsistent capitalization, and corrupted financial
values can be transformed into enterprise-grade analytical assets.
Using
PROC IMPORT and EXPORT, organizations can seamlessly integrate external systems
while producing polished professional outputs. DATA STEP and PROC SQL together
create flexible transformation ecosystems capable of handling everything from
SDTM clinical domains to publishing intelligence reports.
Ultimately,
successful analytics does not begin with dashboards or machine learning it begins with disciplined, validated,
reproducible data cleaning frameworks.
10. Conclusion
Professional
analytics is not about writing code alone. It is about engineering trust.
Every
business report, predictive model, executive dashboard, clinical submission, or
financial projection depends entirely on the integrity of the underlying data.
Even the most advanced artificial intelligence models become unreliable when
trained on corrupted datasets.
This
project demonstrated how intentionally flawed writer datasets can be
systematically transformed into clean, validated, analysis-ready structures
using SAS and R. Through PROC IMPORT, DATA STEP cleaning, PROC SQL
transformations, PROC SORT deduplication, PROC SUMMARY aggregation, PROC REPORT
presentation, and tidyverse refinement, we created a complete enterprise-style
data engineering workflow.
The true
value of SAS lies in its governance, reproducibility, metadata control, and
regulatory alignment. Meanwhile, R provides agile, elegant, and highly
expressive transformation capabilities ideal for exploratory analytics and
modern data science.
The most
important lesson is this:
Data
cleaning is not a preprocessing side task.
It is the foundation of analytical credibility.
A
well-designed cleaning framework prevents reporting failures, protects
regulatory compliance, improves machine learning accuracy, and enables
confident business decisions.
Whether
you work in clinical trials, banking, publishing, healthcare, or finance,
mastering structured data cleaning methodologies will permanently elevate your
analytical capabilities and professional value.
Clean
data creates trusted analytics.
Trusted analytics drive intelligent decisions.
Intelligent decisions build successful organizations.
11. Interview Questions and
Answers
1. Why is PROC SORT with NODUPKEY important?
Answer:
It
removes duplicate observations based on BY variables, preventing
double-counting and ensuring reporting accuracy.
2. Difference between PROC SQL and DATA STEP?
Answer:
DATA STEP
is row-wise procedural processing, while PROC SQL is relational and set-based
processing. PROC SQL is better for joins and aggregations.
3. Why should LENGTH appear before assignment
statements?
Answer:
SAS
determines character variable length during first compilation reference.
Incorrect placement causes truncation.
4. Scenario Question
A
clinical dataset contains missing patient ages. What risk exists?
Answer:
Missing
numeric values in SAS behave as the smallest values. Patients may accidentally
qualify for age-restricted studies.
5. R vs SAS Cleaning Strength?
Answer:
SAS excels in enterprise governance and reproducibility, while R provides flexible and concise modern wrangling using tidyverse.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about WRITER DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment