Transforming the World’s Most Famous Writers Dataset into Professional Intelligence Using PROC IMPORT and EXPORT in SAS and R

1. Introduction — When Beautiful Analytics Collapse Because of Ugly Data

Imagine a global publishing company preparing an international literary analytics report about the world’s most famous writers. The business team wants insights about royalty earnings, author nationality trends, publication eras, and genre performance. The management expects polished dashboards, investor-ready reports, and executive summaries.

But the raw dataset arrives like this:

Missing author names
Negative royalty amounts
Invalid publication dates
Duplicate writer IDs
Mixed capitalization (“india”, “INDIA”, “India”)
Blank genre values
Corrupted text imported from CSV files

This is where analytics projects silently fail.

A single invalid date can break timelines.
A duplicate record can inflate financial projections.
A missing nationality can distort demographic analysis.
A negative royalty value can destroy forecasting models.

In industries like clinical trials, banking, insurance, and publishing, dirty data is not a small inconvenience — it becomes a business risk.

This is why professionals rely heavily on SAS and R for enterprise-grade data preparation, validation, reporting, and reproducibility.

This project demonstrates how to build, clean, validate, transform, import, export, and professionally present a dataset about the most famous writers in the world using:

PROC IMPORT
PROC EXPORT
DATA STEP
PROC SQL
PROC SORT
PROC REPORT
PROC SUMMARY
Tidyverse in R
dplyr
stringr

The objective is not only coding it is understanding professional data engineering logic behind reliable analytics.

2. Raw Data Creation in SAS and R

Raw Dataset with Intentional Errors

/*SAS Raw File Simulation Using INFILE*/

filename wrfile temp;

data _null_;

file wrfile;

|Royalty_M|Award_Status|Join_Date";

put "101|William Shakespeare|UK|Drama|1564|400|120|YES|12-05-2020";

put "102|Leo Tolstoy|Russia|Novel|1828|300|-50|YES|15-13-2021";

put "103|NULL|India|Poetry|1861|250|75|NO|22-07-2019";

put "104|J.K. Rowling|uk|Fantasy|1965|500|200|YES|01-01-2022";

put "105|Mark Twain|USA|NULL|1835|.|90|YES|11-11-2018";

put "106| Fyodor Dostoevsky |Russia|Novel|-1821|150|60|YES|31-09-2020";

put "107|Jane Austen|UK|Romance|1775|180|85|yes|05-06-2021";

put "108|Ernest Hemingway|USA|Novel|1899|220|110|NO|14-02-2023";

put "109|Rabindranath Tagore|india|Poetry|1861|260|95|YES|07-07-2022";

put "110|George Orwell|UK|Dystopian|1903|210|88|YES|29-02-2021";

put "111|Homer|Greece|Epic||190|66|.|NO|12-08-2017";

run;

LOG:

NOTE: The file WRFILE is:

       Filename=/saswork/SAS_work1DC100002BF7_odaws02-apse1-2.oda.sas.com/#LN00043,

       Owner Name=u63247146,Group Name=oda,

       Access Permission=-rw-r--r--,

       Last Modified=09May2026:15:36:18

 NOTE: 13 records were written to the file WRFILE.

       The minimum record length was 45.

       The maximum record length was 97.

Explanation

This raw dataset intentionally contains enterprise-level data quality issues commonly seen in production systems.

Key problems inserted intentionally:

Duplicate Writer_ID = 104
Negative royalty values
Invalid dates
Missing birth year
NULL strings
Inconsistent capitalization
Blank spaces
Missing numeric values

These issues simulate realistic imported flat files from external vendors or legacy systems.

The professional advantage of using INFILE is that SAS developers gain granular control over delimiters, informats, missing value handling, and column parsing.

Error Type	Example
Duplicate IDs	104 repeated
NULL values	Writer_Name
Negative values	-1821, -50
Invalid dates	15-13-2021
Mixed capitalization	india vs UK
Missing values	NA
Extra spaces	Fyodor Dostoevsky
Invalid categorical text	yes vs YES

The Truncation Trap — Why LENGTH Must Come First

data writers_raw;

length Writer_Name $40 Country $20 Genre $20 Award_Status $5;

infile wrfile dlm='|' firstobs=2;

input Writer_ID Writer_Name $ Country $ Genre $ Birth_Year Books_Sold_M

Royalty_M Award_Status $ Join_Date:$10.;

run;

proc print data = writers_raw;

run;

OUTPUT:


Obs	Writer_Name	Country	Genre	Award_Status	Writer_ID	Birth_Year	Books_Sold_M	Royalty_M	Join_Date
1	William Shakespeare	UK	Drama	YES	101	1564	400	120	12-05-2020
2	Leo Tolstoy	Russia	Novel	YES	102	1828	300	-50	15-13-2021
3	NULL	India	Poetry	NO	103	1861	250	75	22-07-2019
4	J.K. Rowling	uk	Fantasy	YES	104	1965	500	200	01-01-2022
5	J.K. Rowling	uk	Fantasy	YES	104	1965	500	200	01-01-2022
6	Mark Twain	USA	NULL	YES	105	1835	.	90	11-11-2018
7	Fyodor Dostoevsky	Russia	Novel	YES	106	-1821	150	60	31-09-2020
8	Jane Austen	UK	Romance	yes	107	1775	180	85	05-06-2021
9	Ernest Hemingway	USA	Novel	NO	108	1899	220	110	14-02-2023
10	Rabindranath Tagore	india	Poetry	YES	109	1861	260	95	07-07-2022
11	George Orwell	UK	Dystopian	YES	110	1903	210	88	29-02-2021
12	Homer	Greece	Epic	NO	111	190	66	.	12-08-2017

Why LENGTH is Critical

One of the biggest beginner mistakes in SAS is defining character lengths AFTER assignment logic.

Example mistake:

if Country='UK' then Region='Europe';

length Region $20;

In SAS, variable length is determined during first compilation reference.

If SAS first sees "Europe" as "Eur", future values get truncated permanently.

Professional SAS developers always place LENGTH before:

IF conditions
SET statements
MERGE statements
Assignment logic

This prevents silent data corruption.

R Code – Equivalent Raw Dataset

writers_raw <- data.frame(

Writer_ID = c(101,102,103,104,104,105,106,107,108,109,110,111),

Writer_Name = c("William Shakespeare","Leo Tolstoy","NULL",

"J.K. Rowling","J.K. Rowling","Mark Twain", "Fyodor Dostoevsky ",

"Jane Austen","Ernest Hemingway","Rabindranath Tagore",

"George Orwell","Homer"),

Country = c("UK","Russia","India","uk","uk","USA","Russia","UK",

"USA","india","UK","Greece"),

Genre = c("Drama","Novel","Poetry","Fantasy","Fantasy","NULL",

"Novel","Romance","Novel","Poetry","Dystopian","Epic"),

Birth_Year = c(1564,1828,1861,1965,1965,1835,-1821,1775,1899,

1861,1903,NA),

Books_Sold_M = c(400,300,250,500,500,NA,150,180,220,260,210,190),

Royalty_M = c(120,-50,75,200,200,90,60,85,110,95,88,NA),

Award_Status = c("YES","YES","NO","YES","YES","YES","YES","yes","NO",

"YES","YES","NO"),

Join_Date = c("12-05-2020","15-13-2021","22-07-2019","01-01-2022",

"01-01-2022","11-11-2018","31-09-2020","05-06-2021","14-02-2023",

"07-07-2022","29-02-2021","12-08-2017")

)

OUTPUT:

	Writer_ID	Writer_Name	Country	Genre	Birth_Year	Books_Sold_M	Royalty_M	Award_Status	Join_Date
1	101	William Shakespeare	UK	Drama	1564	400	120	YES	12-05-2020
2	102	Leo Tolstoy	Russia	Novel	1828	300	-50	YES	15-13-2021
3	103	NULL	India	Poetry	1861	250	75	NO	22-07-2019
4	104	J.K. Rowling	uk	Fantasy	1965	500	200	YES	01-01-2022
5	104	J.K. Rowling	uk	Fantasy	1965	500	200	YES	01-01-2022
6	105	Mark Twain	USA	NULL	1835	NA	90	YES	11-11-2018
7	106	Fyodor Dostoevsky	Russia	Novel	-1821	150	60	YES	31-09-2020
8	107	Jane Austen	UK	Romance	1775	180	85	yes	05-06-2021
9	108	Ernest Hemingway	USA	Novel	1899	220	110	NO	14-02-2023
10	109	Rabindranath Tagore	india	Poetry	1861	260	95	YES	07-07-2022
11	110	George Orwell	UK	Dystopian	1903	210	88	YES	29-02-2021
12	111	Homer	Greece	Epic	NA	190	NA	NO	12-08-2017

Equivalent of SAS LENGTH in R

In R, there is no direct LENGTH statement like SAS.

R automatically expands character vectors dynamically.

But professionals still control data structure using:

str(writers_raw)

This is conceptually similar to:

proc contents;

run;

in SAS.

3. Step-by-Step Cleaning Using DATA STEP

data writers_clean;

length Writer_Name $40 Country $20 Genre $20 Award_Status $10;

set writers_raw;

Writer_Name = propcase(strip(Writer_Name));

if upcase(strip(Writer_Name='Null')) then Writer_Name='UNKNOWN';

Country = upcase(Country);

if Genre = "NULL" then Genre = "UNKNOWN";

if not missing(Royalty_M) then

Royalty_M = abs(Royalty_M);

if not missing(Birth_Year) then Birth_Year = abs(Birth_Year);

Award_Status = strip(upcase(Award_Status));

Join_Date_New = input(Join_Date,ddmmyy10.);

format Join_Date_New date9.;

drop Join_Date;

rename Join_Date_New = Join_Date;

run;

proc print data = writers_clean;

run;

OUTPUT:


Obs	Writer_Name	Country	Genre	Award_Status	Writer_ID	Birth_Year	Books_Sold_M	Royalty_M	Join_Date
1	William Shakespeare	UK	Drama	YES	101	1564	400	120	12MAY2020
2	Leo Tolstoy	RUSSIA	Novel	YES	102	1828	300	50	.
3	UNKNOWN	INDIA	Poetry	NO	103	1861	250	75	22JUL2019
4	J.K. Rowling	UK	Fantasy	YES	104	1965	500	200	01JAN2022
5	J.K. Rowling	UK	Fantasy	YES	104	1965	500	200	01JAN2022
6	Mark Twain	USA	UNKNOWN	YES	105	1835	.	90	11NOV2018
7	Fyodor Dostoevsky	RUSSIA	Novel	YES	106	1821	150	60	.
8	Jane Austen	UK	Romance	YES	107	1775	180	85	05JUN2021
9	Ernest Hemingway	USA	Novel	NO	108	1899	220	110	14FEB2023
10	Rabindranath Tagore	INDIA	Poetry	YES	109	1861	260	95	07JUL2022
11	George Orwell	UK	Dystopian	YES	110	1903	210	88	.
12	Homer	GREECE	Epic	NO	111	190	66	.	12AUG2017

Technical Deep Dive

ABS Function

Royalty_M = abs(Royalty_M);

Negative royalty values are unrealistic.

ABS() converts negative values into positive standardized measures.

Widely used in:

Banking transactions
Financial normalization
Clinical lab corrections

INPUT and PUT

Join_Date_New = input(Join_Date,ddmmyy10.);

INPUT converts character dates into numeric SAS dates.

Char_Date = put(Join_Date_New,date9.);

PUT converts numeric dates back into formatted character values.

This conversion framework is essential in SDTM clinical datasets.

PROC SORT with NODUPKEY

proc sort data=writers_clean nodupkey;

by Writer_ID;

run;

proc print data = writers_clean;

run;

LOG:

 NOTE: There were 12 observations read from the data set WORK.WRITERS_CLEAN.

 NOTE: 1 observations with duplicate key values were deleted.

OUTPUT:


Obs	Writer_Name	Country	Genre	Award_Status	Writer_ID	Birth_Year	Books_Sold_M	Royalty_M	Join_Date
1	William Shakespeare	UK	Drama	YES	101	1564	400	120	12MAY2020
2	Leo Tolstoy	RUSSIA	Novel	YES	102	1828	300	50	.
3	UNKNOWN	INDIA	Poetry	NO	103	1861	250	75	22JUL2019
4	J.K. Rowling	UK	Fantasy	YES	104	1965	500	200	01JAN2022
5	Mark Twain	USA	UNKNOWN	YES	105	1835	.	90	11NOV2018
6	Fyodor Dostoevsky	RUSSIA	Novel	YES	106	1821	150	60	.
7	Jane Austen	UK	Romance	YES	107	1775	180	85	05JUN2021
8	Ernest Hemingway	USA	Novel	NO	108	1899	220	110	14FEB2023
9	Rabindranath Tagore	INDIA	Poetry	YES	109	1861	260	95	07JUL2022
10	George Orwell	UK	Dystopian	YES	110	1903	210	88	.
11	Homer	GREECE	Epic	NO	111	190	66	.	12AUG2017

Explanation

NODUPKEY removes duplicate records based on BY variables.

Professional importance:

Prevents double-counting
Removes redundant transactions
Ensures regulatory consistency

This is heavily used in:

SDTM DM datasets
Banking ledgers
Insurance claims

SAS vs R Logic Bridge

SAS	R
DATA STEP	mutate()
ABS()	abs()
STRIP()	str_trim()
UPCASE()	str_to_upper()
PROPCASE()	str_to_title()
PROC SORT NODUPKEY	distinct()
INPUT()	as.Date()

IF-THEN vs SELECT-WHEN

/*IF-THEN Example*/

data writers_clean;

retain Writer_ID Writer_Name Region Genre Country Birth_Year

Books_Sold_M Join_Date Award_Status Royalty_M;

length Region $20;

set writers_clean;

if Country='UK' then Region='EUROPE';

else if Country='USA' then Region='NORTH AMERICA';

else Region='OTHER';

run;

proc print data = writers_clean;

run;

OUTPUT:


Obs	Writer_ID	Writer_Name	Region	Genre	Country	Birth_Year	Books_Sold_M	Join_Date	Award_Status	Royalty_M
1	101	William Shakespeare	EUROPE	Drama	UK	1564	400	12MAY2020	YES	120
2	102	Leo Tolstoy	OTHER	Novel	RUSSIA	1828	300	.	YES	50
3	103	UNKNOWN	OTHER	Poetry	INDIA	1861	250	22JUL2019	NO	75
4	104	J.K. Rowling	EUROPE	Fantasy	UK	1965	500	01JAN2022	YES	200
5	105	Mark Twain	NORTH AMERICA	UNKNOWN	USA	1835	.	11NOV2018	YES	90
6	106	Fyodor Dostoevsky	OTHER	Novel	RUSSIA	1821	150	.	YES	60
7	107	Jane Austen	EUROPE	Romance	UK	1775	180	05JUN2021	YES	85
8	108	Ernest Hemingway	NORTH AMERICA	Novel	USA	1899	220	14FEB2023	NO	110
9	109	Rabindranath Tagore	OTHER	Poetry	INDIA	1861	260	07JUL2022	YES	95
10	110	George Orwell	EUROPE	Dystopian	UK	1903	210	.	YES	88
11	111	Homer	OTHER	Epic	GREECE	190	66	12AUG2017	NO	.

/*SELECT-WHEN Example*/

data writers_final;

retain Writer_ID Writer_Name Region Genre Country Birth_Year

Books_Sold_M Join_Date Award_Status Royalty_M;

length Region $20;

set writers_clean;

select(Country);

when ('UK') Region='EUROPE';

when ('USA') Region='NORTH AMERICA';

otherwise Region='OTHER';

end;

run;

proc print data = writers_final;

run;

OUTPUT:


Obs	Writer_ID	Writer_Name	Region	Genre	Country	Birth_Year	Books_Sold_M	Join_Date	Award_Status	Royalty_M
1	101	William Shakespeare	EUROPE	Drama	UK	1564	400	12MAY2020	YES	120
2	102	Leo Tolstoy	OTHER	Novel	RUSSIA	1828	300	.	YES	50
3	103	UNKNOWN	OTHER	Poetry	INDIA	1861	250	22JUL2019	NO	75
4	104	J.K. Rowling	EUROPE	Fantasy	UK	1965	500	01JAN2022	YES	200
5	105	Mark Twain	NORTH AMERICA	UNKNOWN	USA	1835	.	11NOV2018	YES	90
6	106	Fyodor Dostoevsky	OTHER	Novel	RUSSIA	1821	150	.	YES	60
7	107	Jane Austen	EUROPE	Romance	UK	1775	180	05JUN2021	YES	85
8	108	Ernest Hemingway	NORTH AMERICA	Novel	USA	1899	220	14FEB2023	NO	110
9	109	Rabindranath Tagore	OTHER	Poetry	INDIA	1861	260	07JUL2022	YES	95
10	110	George Orwell	EUROPE	Dystopian	UK	1903	210	.	YES	88
11	111	Homer	OTHER	Epic	GREECE	190	66	12AUG2017	NO	.

Why RETAIN Works

RETAIN does two things:

Retains values across observations
More importantly here → controls variable order

SAS displays variables in the exact order listed in RETAIN.

Professional SAS Insight

RETAIN is one of the most commonly used techniques in:

SDTM dataset structuring
ADaM dataset ordering
Regulatory submission datasets
Final reporting tables

Difference Between RETAIN vs KEEP

Feature	RETAIN	KEEP
Reorders variables	YES	YES
Removes variables	NO	YES
Retains values	YES	NO
Most used professionally	YES	Sometimes

The R Refinement Layer (Tidyverse Approach)

Cleaning the Dataset in R

library(dplyr)

library(stringr)

library(tidyr)

writers_clean <- writers_raw %>%

mutate(Writer_Name = str_trim(str_to_title(Writer_Name)),

Writer_Name = ifelse(toupper(Writer_Name)=="NULL",

"UNKNOWN",Writer_Name),

Country = str_to_upper(str_trim(Country)),

Genre = ifelse(toupper(Genre)=="NULL" | is.na(Genre),

"UNKNOWN",Genre),

Royalty_M = abs(Royalty_M),

Birth_Year = abs(Birth_Year),

Award_Status = str_to_upper(str_trim(Award_Status)),

Join_Date = as.Date(Join_Date,format="%d-%m-%Y")

) %>%

distinct(Writer_ID,.keep_all=TRUE)

OUTPUT:

	Writer_ID	Writer_Name	Country	Genre	Birth_Year	Books_Sold_M	Royalty_M	Award_Status	Join_Date
1	101	William Shakespeare	UK	Drama	1564	400	120	YES	12-05-2020
2	102	Leo Tolstoy	RUSSIA	Novel	1828	300	50	YES	NA
3	103	UNKNOWN	INDIA	Poetry	1861	250	75	NO	22-07-2019
4	104	J.k. Rowling	UK	Fantasy	1965	500	200	YES	01-01-2022
5	105	Mark Twain	USA	UNKNOWN	1835	NA	90	YES	11-11-2018
6	106	Fyodor Dostoevsky	RUSSIA	Novel	1821	150	60	YES	NA
7	107	Jane Austen	UK	Romance	1775	180	85	YES	05-06-2021
8	108	Ernest Hemingway	USA	Novel	1899	220	110	NO	14-02-2023
9	109	Rabindranath Tagore	INDIA	Poetry	1861	260	95	YES	07-07-2022
10	110	George Orwell	UK	Dystopian	1903	210	88	YES	NA
11	111	Homer	GREECE	Epic	NA	190	NA	NO	12-08-2017

Explanation

This pipeline demonstrates modern R wrangling using tidyverse.

Logic Bridge Between SAS and R

SAS	R Equivalent
DATA STEP	mutate()
IF-THEN	case_when()
PROC SORT NODUPKEY	distinct()
COMPRESS/STRIP	trimws()/str_trim()
COALESCEC	coalesce()

Comparison

Feature	IF-THEN	SELECT-WHEN
Best for	Complex logic	Categorical mapping
Performance	Slightly slower	Faster
Readability	Medium	High
Nested Conditions	Excellent	Moderate

Professional SAS programmers prefer SELECT-WHEN for category standardization.

4. PROC EXPORT and IMPORT

/*PROC EXPORT*/

proc export data=writers_final

outfile="writers_final.csv"

dbms=csv

replace;

run;

proc print data=writers_final;

run;

OUTPUT:


Obs	Writer_ID	Writer_Name	Region	Genre	Country	Birth_Year	Books_Sold_M	Join_Date	Award_Status	Royalty_M
1	101	William Shakespeare	EUROPE	Drama	UK	1564	400	12MAY2020	YES	120
2	102	Leo Tolstoy	OTHER	Novel	RUSSIA	1828	300	.	YES	50
3	103	UNKNOWN	OTHER	Poetry	INDIA	1861	250	22JUL2019	NO	75
4	104	J.K. Rowling	EUROPE	Fantasy	UK	1965	500	01JAN2022	YES	200
5	105	Mark Twain	NORTH AMERICA	UNKNOWN	USA	1835	.	11NOV2018	YES	90
6	106	Fyodor Dostoevsky	OTHER	Novel	RUSSIA	1821	150	.	YES	60
7	107	Jane Austen	EUROPE	Romance	UK	1775	180	05JUN2021	YES	85
8	108	Ernest Hemingway	NORTH AMERICA	Novel	USA	1899	220	14FEB2023	NO	110
9	109	Rabindranath Tagore	OTHER	Poetry	INDIA	1861	260	07JUL2022	YES	95
10	110	George Orwell	EUROPE	Dystopian	UK	1903	210	.	YES	88
11	111	Homer	OTHER	Epic	GREECE	190	66	12AUG2017	NO	.

Explanation

PROC EXPORT generates business-ready deliverables.

Professional outputs include:

Excel reports
CSV feeds
Regulatory submissions
Executive dashboards

#Exporting CSV in R

Equivalent to PROC EXPORT in SAS:

write.csv(writers_clean,

"writers_clean.csv",

row.names = FALSE)

OUTPUT:

	Writer_ID	Writer_Name	Country	Genre	Birth_Year	Books_Sold_M	Royalty_M	Award_Status	Join_Date
1	101	William Shakespeare	UK	Drama	1564	400	120	YES	12-05-2020
2	102	Leo Tolstoy	RUSSIA	Novel	1828	300	50	YES	NA
3	103	UNKNOWN	INDIA	Poetry	1861	250	75	NO	22-07-2019
4	104	J.k. Rowling	UK	Fantasy	1965	500	200	YES	01-01-2022
5	105	Mark Twain	USA	UNKNOWN	1835	NA	90	YES	11-11-2018
6	106	Fyodor Dostoevsky	RUSSIA	Novel	1821	150	60	YES	NA
7	107	Jane Austen	UK	Romance	1775	180	85	YES	05-06-2021
8	108	Ernest Hemingway	USA	Novel	1899	220	110	NO	14-02-2023
9	109	Rabindranath Tagore	INDIA	Poetry	1861	260	95	YES	07-07-2022
10	110	George Orwell	UK	Dystopian	1903	210	88	YES	NA
11	111	Homer	GREECE	Epic	NA	190	NA	NO	12-08-2017

/*PROC IMPORT*/

proc import datafile="writers_final.csv"

out=writers

dbms=csv

replace;

guessingrows=max;

run;

proc print data=writers;

run;

OUTPUT:


Obs	Writer_ID	Writer_Name	Region	Genre	Country	Birth_Year	Books_Sold_M	Join_Date	Award_Status	Royalty_M
1	101	William Shakespeare	EUROPE	Drama	UK	1564	400	12MAY2020	YES	120
2	102	Leo Tolstoy	OTHER	Novel	RUSSIA	1828	300	.	YES	50
3	103	UNKNOWN	OTHER	Poetry	INDIA	1861	250	22JUL2019	NO	75
4	104	J.K. Rowling	EUROPE	Fantasy	UK	1965	500	01JAN2022	YES	200
5	105	Mark Twain	NORTH AMERICA	UNKNOWN	USA	1835	.	11NOV2018	YES	90
6	106	Fyodor Dostoevsky	OTHER	Novel	RUSSIA	1821	150	.	YES	60
7	107	Jane Austen	EUROPE	Romance	UK	1775	180	05JUN2021	YES	85
8	108	Ernest Hemingway	NORTH AMERICA	Novel	USA	1899	220	14FEB2023	NO	110
9	109	Rabindranath Tagore	OTHER	Poetry	INDIA	1861	260	07JUL2022	YES	95
10	110	George Orwell	EUROPE	Dystopian	UK	1903	210	.	YES	88
11	111	Homer	OTHER	Epic	GREECE	190	66	12AUG2017	NO	.

Explanation

PROC IMPORT automates ingestion of external data files.

Important parameter:

guessingrows=max;

Without it, SAS may incorrectly infer variable types.

#Importing CSV in R

Equivalent to PROC IMPORT in SAS:

writers_import <- read.csv(

"writers_clean.csv",

stringsAsFactors = FALSE)

OUTPUT:

	Writer_ID	Writer_Name	Country	Genre	Birth_Year	Books_Sold_M	Royalty_M	Award_Status	Join_Date
1	101	William Shakespeare	UK	Drama	1564	400	120	YES	12-05-2020
2	102	Leo Tolstoy	RUSSIA	Novel	1828	300	50	YES	NA
3	103	UNKNOWN	INDIA	Poetry	1861	250	75	NO	22-07-2019
4	104	J.k. Rowling	UK	Fantasy	1965	500	200	YES	01-01-2022
5	105	Mark Twain	USA	UNKNOWN	1835	NA	90	YES	11-11-2018
6	106	Fyodor Dostoevsky	RUSSIA	Novel	1821	150	60	YES	NA
7	107	Jane Austen	UK	Romance	1775	180	85	YES	05-06-2021
8	108	Ernest Hemingway	USA	Novel	1899	220	110	NO	14-02-2023
9	109	Rabindranath Tagore	INDIA	Poetry	1861	260	95	YES	07-07-2022
10	110	George Orwell	UK	Dystopian	1903	210	88	YES	NA
11	111	Homer	GREECE	Epic	NA	190	NA	NO	12-08-2017

Important Interview Point

PROC	GUESSINGROWS Valid?
PROC IMPORT	YES
PROC EXPORT	NO

Equivalent SAS vs R Comparison

SAS	R
"C:\file.csv"	❌ Error
"C:\\file.csv"	✅ Works
"C:/file.csv"	✅ Best Practice

5. Business Logic & The “Why”

Automated Loan Approval Scenario

Imagine a bank using SAS for automated writer royalty-based loan approval.

Eligibility rule:

Royalty > 50 million
Age > 18
No missing records

Now consider this dangerous SAS behavior:

Missing numeric values are treated as smaller than all numbers.

data writers_final;

retain Writer_ID Writer_Name Region Genre Country Birth_Year

Books_Sold_M Join_Date Award_Status Royalty_M;

length Flag $8;

set writers_final;

if Royalty_M < 100 then Flag = "Low";

else Flag = "High";

run;

proc print data=writers_final;

run;

OUTPUT:


Obs	Writer_ID	Writer_Name	Region	Genre	Country	Birth_Year	Books_Sold_M	Join_Date	Award_Status	Royalty_M	Flag
1	101	William Shakespeare	EUROPE	Drama	UK	1564	400	12MAY2020	YES	120	High
2	102	Leo Tolstoy	OTHER	Novel	RUSSIA	1828	300	.	YES	50	Low
3	103	UNKNOWN	OTHER	Poetry	INDIA	1861	250	22JUL2019	NO	75	Low
4	104	J.K. Rowling	EUROPE	Fantasy	UK	1965	500	01JAN2022	YES	200	High
5	105	Mark Twain	NORTH AMERICA	UNKNOWN	USA	1835	.	11NOV2018	YES	90	Low
6	106	Fyodor Dostoevsky	OTHER	Novel	RUSSIA	1821	150	.	YES	60	Low
7	107	Jane Austen	EUROPE	Romance	UK	1775	180	05JUN2021	YES	85	Low
8	108	Ernest Hemingway	NORTH AMERICA	Novel	USA	1899	220	14FEB2023	NO	110	High
9	109	Rabindranath Tagore	OTHER	Poetry	INDIA	1861	260	07JUL2022	YES	95	Low
10	110	George Orwell	EUROPE	Dystopian	UK	1903	210	.	YES	88	Low
11	111	Homer	OTHER	Epic	GREECE	190	66	12AUG2017	NO	.	Low

Missing values also satisfy this condition.

A missing royalty could accidentally classify an applicant incorrectly.

In clinical trials, this could become catastrophic:

Wrong patient inclusion
Invalid safety conclusions
Regulatory rejection

This is called the Missing Value Trap.

Professional developers always explicitly handle missing values.

Key Points of Implementation

Always validate imported file structures.
Define LENGTH before assignments.
Never trust raw external data.
Standardize capitalization early.
Remove duplicates before aggregation.
Convert dates immediately after import.
Use PROC CONTENTS for metadata validation.
Document every transformation step.
Use meaningful variable names.
Separate raw and cleaned datasets.
Preserve audit trails.
Validate ranges using IF conditions.
Use formats for readability.
Prefer SELECT-WHEN for categorical mapping.
Handle missing values explicitly.
Never overwrite raw data.
Use PROC FREQ for categorical validation.
Compare record counts after merges.
Use macros for reusable logic.
Maintain reproducible workflows.

Extended Analysis & Reporting

Aggregation with PROC SUMMARY

proc summary data=writers_clean nway;

class Country;

var Royalty_M Books_Sold_M;

output out=country_summary sum=;

run;

proc print data = country_summary;

run;

OUTPUT:


Obs	Country	_TYPE_	_FREQ_	Royalty_M	Books_Sold_M
1	GREECE	1	1	.	66
2	INDIA	1	2	170	510
3	RUSSIA	1	2	110	450
4	UK	1	4	493	1290
5	USA	1	2	200	220

Explanation

PROC SUMMARY creates aggregated business intelligence metrics.

This is widely used for:

Regional revenue reporting
Clinical summaries
Market segmentation

Professional Reporting Using PROC REPORT

proc report data=country_summary nowd;

columns Country Royalty_M Books_Sold_M;

define Country / group;

define Royalty_M / analysis sum;

define Books_Sold_M / analysis sum;

run;

OUTPUT:


Country	Royalty_M	Books_Sold_M
GREECE	.	66
INDIA	170	510
RUSSIA	110	450
UK	493	1290
USA	200	220

Explanation

PROC REPORT creates polished enterprise outputs.

Professional advantages:

Custom formatting
Executive presentation
Controlled layouts
Regulatory reporting readiness

Advanced Text Cleaning in R

writers_clean$Country <- gsub("india","INDIA",

writers_clean$Country)

writers_clean$Writer_Name <- trimws(

writers_clean$Writer_Name)

OUTPUT:

	Writer_ID	Writer_Name	Country	Genre	Birth_Year	Books_Sold_M	Royalty_M	Award_Status	Join_Date
1	101	William Shakespeare	UK	Drama	1564	400	120	YES	12-05-2020
2	102	Leo Tolstoy	RUSSIA	Novel	1828	300	50	YES	NA
3	103	UNKNOWN	INDIA	Poetry	1861	250	75	NO	22-07-2019
4	104	J.k. Rowling	UK	Fantasy	1965	500	200	YES	01-01-2022
5	105	Mark Twain	USA	UNKNOWN	1835	NA	90	YES	11-11-2018
6	106	Fyodor Dostoevsky	RUSSIA	Novel	1821	150	60	YES	NA
7	107	Jane Austen	UK	Romance	1775	180	85	YES	05-06-2021
8	108	Ernest Hemingway	USA	Novel	1899	220	110	NO	14-02-2023
9	109	Rabindranath Tagore	INDIA	Poetry	1861	260	95	YES	07-07-2022
10	110	George Orwell	UK	Dystopian	1903	210	88	YES	NA
11	111	Homer	GREECE	Epic	NA	190	NA	NO	12-08-2017

Explanation

gsub() performs regex-based replacement.

trimws() removes unwanted spaces.

These functions are critical when handling:

Email cleanup
Address normalization
Free-text survey responses

6. 20 Additional Data Cleaning Best Practices

Validate SDTM variable lengths.
Maintain Define.xml consistency.
Preserve source traceability.
Use controlled terminology.
Validate adverse event dates.
Standardize units.
Perform double-programming validation.
Create QC datasets.
Track transformation lineage.
Use audit logs.
Validate merge keys.
Prevent orphan records.
Normalize categorical variables.
Flag impossible ages.
Validate visit windows.
Review protocol deviations.
Store raw snapshots securely.
Ensure reproducibility.
Validate exports before submission.
Follow FDA compliance guidelines.

7. Business Logic Behind Data Cleaning

Data cleaning exists because business decisions depend entirely on data quality. If a patient’s age is incorrectly entered as -45 instead of 45, a clinical trial may wrongly classify the subject into an invalid treatment group. If salary values are missing in banking systems, loan approval engines may reject qualified applicants or approve risky candidates.

Missing values are often replaced because analytical models cannot interpret blanks consistently. Standardization ensures systems behave predictably. Unrealistic values are corrected because dashboards, machine learning models, and executive reports rely on logical ranges.

Date imputation is another critical practice. Suppose treatment dates are partially missing in a clinical trial. Analysts may use protocol-approved imputation methods to preserve timeline consistency while maintaining regulatory transparency.

Salary normalization ensures financial analytics remain stable across regions and currencies. Similarly, publication royalty normalization allows publishing companies to compare authors fairly across countries.

Ultimately, data cleaning protects business credibility, analytical trust, and operational accuracy.

8. 20 Key Points — Sharp & Impactful

Dirty data leads to wrong conclusions.
Validation prevents analytical disasters.
Duplicate records inflate metrics.
Missing values silently distort reports.
Standardization ensures reproducibility.
Metadata matters as much as data.
Date formats drive timeline accuracy.
Audit trails protect compliance.
Structured imports reduce failures.
PROC SORT improves reliability.
DATA STEP offers transformation flexibility.
PROC SQL enhances relational processing.
Tidyverse simplifies modern wrangling.
Regex cleaning improves text quality.
Controlled terminology boosts consistency.
Documentation reduces debugging time.
QC programming strengthens trust.
Professional reporting increases business value.
Clean data powers machine learning.
Reliable analytics begin with disciplined preprocessing.

9. Summary

SAS and R are both extraordinary platforms for enterprise data cleaning and professional analytics engineering. SAS excels in structured enterprise workflows, auditability, regulatory reporting, and large-scale production systems. Its DATA STEP architecture provides deterministic, highly reliable transformations ideal for banking, pharmaceuticals, insurance, and government reporting.

R, particularly through tidyverse, offers elegant and modern data wrangling syntax. Functions like mutate(), filter(), replace_na(), and case_when() dramatically simplify exploratory transformations and analytical prototyping.

This project demonstrated how messy writer datasets containing duplicates, invalid dates, missing values, inconsistent capitalization, and corrupted financial values can be transformed into enterprise-grade analytical assets.

Using PROC IMPORT and EXPORT, organizations can seamlessly integrate external systems while producing polished professional outputs. DATA STEP and PROC SQL together create flexible transformation ecosystems capable of handling everything from SDTM clinical domains to publishing intelligence reports.

Ultimately, successful analytics does not begin with dashboards or machine learning it begins with disciplined, validated, reproducible data cleaning frameworks.

10. Conclusion

Professional analytics is not about writing code alone. It is about engineering trust.

Every business report, predictive model, executive dashboard, clinical submission, or financial projection depends entirely on the integrity of the underlying data. Even the most advanced artificial intelligence models become unreliable when trained on corrupted datasets.

This project demonstrated how intentionally flawed writer datasets can be systematically transformed into clean, validated, analysis-ready structures using SAS and R. Through PROC IMPORT, DATA STEP cleaning, PROC SQL transformations, PROC SORT deduplication, PROC SUMMARY aggregation, PROC REPORT presentation, and tidyverse refinement, we created a complete enterprise-style data engineering workflow.

The true value of SAS lies in its governance, reproducibility, metadata control, and regulatory alignment. Meanwhile, R provides agile, elegant, and highly expressive transformation capabilities ideal for exploratory analytics and modern data science.

The most important lesson is this:

Data cleaning is not a preprocessing side task.
It is the foundation of analytical credibility.

A well-designed cleaning framework prevents reporting failures, protects regulatory compliance, improves machine learning accuracy, and enables confident business decisions.

Whether you work in clinical trials, banking, publishing, healthcare, or finance, mastering structured data cleaning methodologies will permanently elevate your analytical capabilities and professional value.

Clean data creates trusted analytics.
Trusted analytics drive intelligent decisions.
Intelligent decisions build successful organizations.

11. Interview Questions and Answers

1. Why is PROC SORT with NODUPKEY important?

Answer:

It removes duplicate observations based on BY variables, preventing double-counting and ensuring reporting accuracy.

2. Difference between PROC SQL and DATA STEP?

Answer:

DATA STEP is row-wise procedural processing, while PROC SQL is relational and set-based processing. PROC SQL is better for joins and aggregations.

3. Why should LENGTH appear before assignment statements?

Answer:

SAS determines character variable length during first compilation reference. Incorrect placement causes truncation.

4. Scenario Question

A clinical dataset contains missing patient ages. What risk exists?

Answer:

Missing numeric values in SAS behave as the smallest values. Patients may accidentally qualify for age-restricted studies.

5. R vs SAS Cleaning Strength?

Answer:

SAS excels in enterprise governance and reproducibility, while R provides flexible and concise modern wrangling using tidyverse.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about WRITER DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Cleaning Surprising Incident Records the Right Way:Raw Data to Real Intelligence

2.From Billion-Dollar Marriages to Clean Datasets: The Art of SAS Standardization

3.The Hidden Power of SAS Dates: Building Smart Time Logic with INTNX and INTCK
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Classic Authors, Modern Data Science: Building Reliable Literary Intelligence Using PROC SQL, DATA STEP and R

Equivalent of SAS LENGTH in R

Difference Between RETAIN vs KEEP

Important Interview Point

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS