Classic Authors, Modern Data Science: Building Reliable Literary Intelligence Using PROC SQL, DATA STEP and R

Transforming the World’s Most Famous Writers Dataset into Professional Intelligence Using PROC IMPORT and EXPORT in SAS and R

1. Introduction — When Beautiful Analytics Collapse Because of Ugly Data

Imagine a global publishing company preparing an international literary analytics report about the world’s most famous writers. The business team wants insights about royalty earnings, author nationality trends, publication eras, and genre performance. The management expects polished dashboards, investor-ready reports, and executive summaries.

But the raw dataset arrives like this:

  • Missing author names
  • Negative royalty amounts
  • Invalid publication dates
  • Duplicate writer IDs
  • Mixed capitalization (“india”, “INDIA”, “India”)
  • Blank genre values
  • Corrupted text imported from CSV files

This is where analytics projects silently fail.

A single invalid date can break timelines.
A duplicate record can inflate financial projections.
A missing nationality can distort demographic analysis.
A negative royalty value can destroy forecasting models.

In industries like clinical trials, banking, insurance, and publishing, dirty data is not a small inconvenience — it becomes a business risk.

This is why professionals rely heavily on SAS and R for enterprise-grade data preparation, validation, reporting, and reproducibility.

This project demonstrates how to build, clean, validate, transform, import, export, and professionally present a dataset about the most famous writers in the world using:

  • PROC IMPORT
  • PROC EXPORT
  • DATA STEP
  • PROC SQL
  • PROC SORT
  • PROC REPORT
  • PROC SUMMARY
  • Tidyverse in R
  • dplyr
  • stringr

The objective is not only coding  it is understanding professional data engineering logic behind reliable analytics.

2. Raw Data Creation in SAS and R

Raw Dataset with Intentional Errors

/*SAS Raw File Simulation Using INFILE*/

filename wrfile temp;

data _null_;

file wrfile;

put "Writer_ID|Writer_Name|Country|Genre|Birth_Year|Books_Sold_M

         |Royalty_M|Award_Status|Join_Date";


put "101|William Shakespeare|UK|Drama|1564|400|120|YES|12-05-2020";

put "102|Leo Tolstoy|Russia|Novel|1828|300|-50|YES|15-13-2021";

put "103|NULL|India|Poetry|1861|250|75|NO|22-07-2019";

put "104|J.K. Rowling|uk|Fantasy|1965|500|200|YES|01-01-2022";

put "104|J.K. Rowling|uk|Fantasy|1965|500|200|YES|01-01-2022";

put "105|Mark Twain|USA|NULL|1835|.|90|YES|11-11-2018";

put "106| Fyodor Dostoevsky |Russia|Novel|-1821|150|60|YES|31-09-2020";

put "107|Jane Austen|UK|Romance|1775|180|85|yes|05-06-2021";

put "108|Ernest Hemingway|USA|Novel|1899|220|110|NO|14-02-2023";

put "109|Rabindranath Tagore|india|Poetry|1861|260|95|YES|07-07-2022";

put "110|George Orwell|UK|Dystopian|1903|210|88|YES|29-02-2021";

put "111|Homer|Greece|Epic||190|66|.|NO|12-08-2017";

run;

LOG:

NOTE: The file WRFILE is:
Filename=/saswork/SAS_work1DC100002BF7_odaws02-apse1-2.oda.sas.com/#LN00043,
Owner Name=u63247146,Group Name=oda,
Access Permission=-rw-r--r--,
Last Modified=09May2026:15:36:18
NOTE: 13 records were written to the file WRFILE.
The minimum record length was 45.
The maximum record length was 97.

Explanation

This raw dataset intentionally contains enterprise-level data quality issues commonly seen in production systems.

Key problems inserted intentionally:

  • Duplicate Writer_ID = 104
  • Negative royalty values
  • Invalid dates
  • Missing birth year
  • NULL strings
  • Inconsistent capitalization
  • Blank spaces
  • Missing numeric values

These issues simulate realistic imported flat files from external vendors or legacy systems.

The professional advantage of using INFILE is that SAS developers gain granular control over delimiters, informats, missing value handling, and column parsing.

Error Type

Example

Duplicate IDs

104 repeated

NULL values

Writer_Name

Negative values

-1821, -50

Invalid dates

15-13-2021

Mixed capitalization

india vs UK

Missing values

NA

Extra spaces

Fyodor Dostoevsky

Invalid categorical text

yes vs YES

The Truncation Trap — Why LENGTH Must Come First

data writers_raw;

length Writer_Name $40 Country $20 Genre $20 Award_Status $5;

infile wrfile dlm='|' firstobs=2;

input Writer_ID Writer_Name $ Country $ Genre $ Birth_Year Books_Sold_M

      Royalty_M Award_Status $ Join_Date:$10.;

run;

proc print data = writers_raw;

run;

OUTPUT:

ObsWriter_NameCountryGenreAward_StatusWriter_IDBirth_YearBooks_Sold_MRoyalty_MJoin_Date
1William ShakespeareUKDramaYES101156440012012-05-2020
2Leo TolstoyRussiaNovelYES1021828300-5015-13-2021
3NULLIndiaPoetryNO10318612507522-07-2019
4J.K. RowlingukFantasyYES104196550020001-01-2022
5J.K. RowlingukFantasyYES104196550020001-01-2022
6Mark TwainUSANULLYES1051835.9011-11-2018
7Fyodor DostoevskyRussiaNovelYES106-18211506031-09-2020
8Jane AustenUKRomanceyes10717751808505-06-2021
9Ernest HemingwayUSANovelNO108189922011014-02-2023
10Rabindranath TagoreindiaPoetryYES10918612609507-07-2022
11George OrwellUKDystopianYES11019032108829-02-2021
12HomerGreeceEpicNO11119066.12-08-2017

Why LENGTH is Critical

One of the biggest beginner mistakes in SAS is defining character lengths AFTER assignment logic.

Example mistake:

if Country='UK' then Region='Europe';

length Region $20;

In SAS, variable length is determined during first compilation reference.

If SAS first sees "Europe" as "Eur", future values get truncated permanently.

Professional SAS developers always place LENGTH before:

  • IF conditions
  • SET statements
  • MERGE statements
  • Assignment logic

This prevents silent data corruption.

R Code – Equivalent Raw Dataset

writers_raw <- data.frame(

  Writer_ID = c(101,102,103,104,104,105,106,107,108,109,110,111),

  Writer_Name = c("William Shakespeare","Leo Tolstoy","NULL",

    "J.K. Rowling","J.K. Rowling","Mark Twain", "Fyodor Dostoevsky ",

    "Jane Austen","Ernest Hemingway","Rabindranath Tagore",

    "George Orwell","Homer"),

  Country = c("UK","Russia","India","uk","uk","USA","Russia","UK",

    "USA","india","UK","Greece"),

  Genre = c("Drama","Novel","Poetry","Fantasy","Fantasy","NULL",

    "Novel","Romance","Novel","Poetry","Dystopian","Epic"),

  Birth_Year = c(1564,1828,1861,1965,1965,1835,-1821,1775,1899,

    1861,1903,NA),

  Books_Sold_M = c(400,300,250,500,500,NA,150,180,220,260,210,190),

  Royalty_M = c(120,-50,75,200,200,90,60,85,110,95,88,NA),

  Award_Status = c("YES","YES","NO","YES","YES","YES","YES","yes","NO",

    "YES","YES","NO"),

  Join_Date = c("12-05-2020","15-13-2021","22-07-2019","01-01-2022",

    "01-01-2022","11-11-2018","31-09-2020","05-06-2021","14-02-2023",

    "07-07-2022","29-02-2021","12-08-2017")

)

OUTPUT:

 

Writer_ID

Writer_Name

Country

Genre

Birth_Year

Books_Sold_M

Royalty_M

Award_Status

Join_Date

1

101

William Shakespeare

UK

Drama

1564

400

120

YES

12-05-2020

2

102

Leo Tolstoy

Russia

Novel

1828

300

-50

YES

        15-13-2021

3

103

NULL

India

Poetry

1861

250

75

NO

22-07-2019

4

104

J.K. Rowling

uk

Fantasy

1965

500

200

YES

01-01-2022

5

104

J.K. Rowling

uk

Fantasy

1965

500

200

YES

01-01-2022

6

105

Mark Twain

USA

NULL

1835

NA

90

YES

11-11-2018

7

106

Fyodor Dostoevsky 

Russia

Novel

-1821

150

60

YES

        31-09-2020

8

107

Jane Austen

UK

Romance

1775

180

85

yes

05-06-2021

9

108

Ernest Hemingway

USA

Novel

1899

220

110

NO

14-02-2023

10

109

Rabindranath Tagore

india

Poetry

1861

260

95

YES

07-07-2022

11

110

George Orwell

UK

Dystopian

1903

210

88

YES

        29-02-2021

12

111

Homer

Greece

Epic

NA

190

NA

NO

12-08-2017

Equivalent of SAS LENGTH in R

In R, there is no direct LENGTH statement like SAS.

R automatically expands character vectors dynamically.

But professionals still control data structure using:

str(writers_raw)

This is conceptually similar to:

proc contents;
run;

in SAS.

3. Step-by-Step Cleaning Using DATA STEP

data writers_clean;

length Writer_Name $40 Country $20 Genre $20 Award_Status $10;

set writers_raw;

Writer_Name = propcase(strip(Writer_Name));

if upcase(strip(Writer_Name='Null')) then Writer_Name='UNKNOWN';

Country = upcase(Country);

if Genre = "NULL" then Genre = "UNKNOWN";

if not missing(Royalty_M) then 

Royalty_M = abs(Royalty_M);

if not missing(Birth_Year) then Birth_Year = abs(Birth_Year);

Award_Status = strip(upcase(Award_Status));

Join_Date_New = input(Join_Date,ddmmyy10.);

format Join_Date_New date9.;

drop Join_Date;

rename Join_Date_New = Join_Date;

run;

proc print data = writers_clean;

run;

OUTPUT:

ObsWriter_NameCountryGenreAward_StatusWriter_IDBirth_YearBooks_Sold_MRoyalty_MJoin_Date
1William ShakespeareUKDramaYES101156440012012MAY2020
2Leo TolstoyRUSSIANovelYES102182830050.
3UNKNOWNINDIAPoetryNO10318612507522JUL2019
4J.K. RowlingUKFantasyYES104196550020001JAN2022
5J.K. RowlingUKFantasyYES104196550020001JAN2022
6Mark TwainUSAUNKNOWNYES1051835.9011NOV2018
7Fyodor DostoevskyRUSSIANovelYES106182115060.
8Jane AustenUKRomanceYES10717751808505JUN2021
9Ernest HemingwayUSANovelNO108189922011014FEB2023
10Rabindranath TagoreINDIAPoetryYES10918612609507JUL2022
11George OrwellUKDystopianYES110190321088.
12HomerGREECEEpicNO11119066.12AUG2017

Technical Deep Dive

ABS Function

Royalty_M = abs(Royalty_M);

Negative royalty values are unrealistic.

ABS() converts negative values into positive standardized measures.

Widely used in:

  • Banking transactions
  • Financial normalization
  • Clinical lab corrections

INPUT and PUT

Join_Date_New = input(Join_Date,ddmmyy10.);

INPUT converts character dates into numeric SAS dates.

Char_Date = put(Join_Date_New,date9.);

PUT converts numeric dates back into formatted character values.

This conversion framework is essential in SDTM clinical datasets.

PROC SORT with NODUPKEY

proc sort data=writers_clean nodupkey;

by Writer_ID;

run;

proc print data = writers_clean;

run;

LOG:

NOTE: There were 12 observations read from the data set WORK.WRITERS_CLEAN.
NOTE: 1 observations with duplicate key values were deleted.

OUTPUT:

ObsWriter_NameCountryGenreAward_StatusWriter_IDBirth_YearBooks_Sold_MRoyalty_MJoin_Date
1William ShakespeareUKDramaYES101156440012012MAY2020
2Leo TolstoyRUSSIANovelYES102182830050.
3UNKNOWNINDIAPoetryNO10318612507522JUL2019
4J.K. RowlingUKFantasyYES104196550020001JAN2022
5Mark TwainUSAUNKNOWNYES1051835.9011NOV2018
6Fyodor DostoevskyRUSSIANovelYES106182115060.
7Jane AustenUKRomanceYES10717751808505JUN2021
8Ernest HemingwayUSANovelNO108189922011014FEB2023
9Rabindranath TagoreINDIAPoetryYES10918612609507JUL2022
10George OrwellUKDystopianYES110190321088.
11HomerGREECEEpicNO11119066.12AUG2017

Explanation

NODUPKEY removes duplicate records based on BY variables.

Professional importance:

  • Prevents double-counting
  • Removes redundant transactions
  • Ensures regulatory consistency

This is heavily used in:

  • SDTM DM datasets
  • Banking ledgers
  • Insurance claims

SAS vs R Logic Bridge

SAS

R

DATA STEP

mutate()

ABS()

abs()

STRIP()

str_trim()

UPCASE()

str_to_upper()

PROPCASE()

str_to_title()

PROC SORT NODUPKEY

distinct()

INPUT()

as.Date()

IF-THEN vs SELECT-WHEN

/*IF-THEN Example*/

data writers_clean;

 retain Writer_ID Writer_Name Region Genre Country Birth_Year 

        Books_Sold_M Join_Date Award_Status Royalty_M;

 length Region $20;

 set writers_clean;

 if Country='UK' then Region='EUROPE';

 else if Country='USA' then Region='NORTH AMERICA';

 else Region='OTHER';

run;

proc print data = writers_clean;

run;

OUTPUT:

ObsWriter_IDWriter_NameRegionGenreCountryBirth_YearBooks_Sold_MJoin_DateAward_StatusRoyalty_M
1101William ShakespeareEUROPEDramaUK156440012MAY2020YES120
2102Leo TolstoyOTHERNovelRUSSIA1828300.YES50
3103UNKNOWNOTHERPoetryINDIA186125022JUL2019NO75
4104J.K. RowlingEUROPEFantasyUK196550001JAN2022YES200
5105Mark TwainNORTH AMERICAUNKNOWNUSA1835.11NOV2018YES90
6106Fyodor DostoevskyOTHERNovelRUSSIA1821150.YES60
7107Jane AustenEUROPERomanceUK177518005JUN2021YES85
8108Ernest HemingwayNORTH AMERICANovelUSA189922014FEB2023NO110
9109Rabindranath TagoreOTHERPoetryINDIA186126007JUL2022YES95
10110George OrwellEUROPEDystopianUK1903210.YES88
11111HomerOTHEREpicGREECE1906612AUG2017NO.

/*SELECT-WHEN Example*/

data writers_final; 

 retain Writer_ID Writer_Name Region Genre Country Birth_Year 

        Books_Sold_M Join_Date Award_Status Royalty_M;

 length Region $20;

 set writers_clean;

 select(Country);

 when ('UK') Region='EUROPE';

 when ('USA') Region='NORTH AMERICA';

 otherwise Region='OTHER';

 end;

run;

proc print data = writers_final;

run;

OUTPUT:

ObsWriter_IDWriter_NameRegionGenreCountryBirth_YearBooks_Sold_MJoin_DateAward_StatusRoyalty_M
1101William ShakespeareEUROPEDramaUK156440012MAY2020YES120
2102Leo TolstoyOTHERNovelRUSSIA1828300.YES50
3103UNKNOWNOTHERPoetryINDIA186125022JUL2019NO75
4104J.K. RowlingEUROPEFantasyUK196550001JAN2022YES200
5105Mark TwainNORTH AMERICAUNKNOWNUSA1835.11NOV2018YES90
6106Fyodor DostoevskyOTHERNovelRUSSIA1821150.YES60
7107Jane AustenEUROPERomanceUK177518005JUN2021YES85
8108Ernest HemingwayNORTH AMERICANovelUSA189922014FEB2023NO110
9109Rabindranath TagoreOTHERPoetryINDIA186126007JUL2022YES95
10110George OrwellEUROPEDystopianUK1903210.YES88
11111HomerOTHEREpicGREECE1906612AUG2017NO.

Why RETAIN Works

RETAIN does two things:

  1. Retains values across observations
  2. More importantly here → controls variable order

SAS displays variables in the exact order listed in RETAIN.

Professional SAS Insight

RETAIN is one of the most commonly used techniques in:

  • SDTM dataset structuring
  • ADaM dataset ordering
  • Regulatory submission datasets
  • Final reporting tables

Difference Between RETAIN vs KEEP

Feature

RETAIN

KEEP

Reorders variables

YES

YES

Removes variables

NO

YES

Retains values

YES

NO

Most used professionally

YES

Sometimes

The R Refinement Layer (Tidyverse Approach)

Cleaning the Dataset in R

library(dplyr)

library(stringr)

library(tidyr)

writers_clean <- writers_raw %>%

  mutate(Writer_Name = str_trim(str_to_title(Writer_Name)),

         Writer_Name = ifelse(toupper(Writer_Name)=="NULL",

                              "UNKNOWN",Writer_Name),

        Country = str_to_upper(str_trim(Country)),

        Genre = ifelse(toupper(Genre)=="NULL" | is.na(Genre),

                       "UNKNOWN",Genre),

        Royalty_M = abs(Royalty_M),

        Birth_Year = abs(Birth_Year),

        Award_Status = str_to_upper(str_trim(Award_Status)),

        Join_Date = as.Date(Join_Date,format="%d-%m-%Y")

) %>%

  distinct(Writer_ID,.keep_all=TRUE)

OUTPUT:

 

Writer_ID

Writer_Name

Country

Genre

Birth_Year

Books_Sold_M

Royalty_M

Award_Status

Join_Date

1

101

William Shakespeare

UK

Drama

1564

400

120

YES

12-05-2020

2

102

Leo Tolstoy

RUSSIA

Novel

1828

300

50

YES

                 NA

3

103

UNKNOWN

INDIA

Poetry

1861

250

75

NO

22-07-2019

4

104

J.k. Rowling

UK

Fantasy

1965

500

200

YES

01-01-2022

5

105

Mark Twain

USA

UNKNOWN

1835

NA

90

YES

11-11-2018

6

106

Fyodor Dostoevsky

RUSSIA

Novel

1821

150

60

YES

                  NA

7

107

Jane Austen

UK

Romance

1775

180

85

YES

05-06-2021

8

108

Ernest Hemingway

USA

Novel

1899

220

110

NO

14-02-2023

9

109

Rabindranath Tagore

INDIA

Poetry

1861

260

95

YES

07-07-2022

10

110

George Orwell

UK

Dystopian

1903

210

88

YES

                  NA

11

111

Homer

GREECE

Epic

NA

190

NA

NO

12-08-2017

Explanation

This pipeline demonstrates modern R wrangling using tidyverse.

Logic Bridge Between SAS and R

SAS

R Equivalent

DATA STEP

mutate()

IF-THEN

case_when()

PROC SORT NODUPKEY

distinct()

COMPRESS/STRIP

trimws()/str_trim()

COALESCEC

coalesce()

Comparison

Feature

IF-THEN

SELECT-WHEN

Best for

Complex logic

Categorical mapping

Performance

Slightly slower

Faster

Readability

Medium

High

Nested Conditions

Excellent

Moderate

Professional SAS programmers prefer SELECT-WHEN for category standardization.

4. PROC EXPORT and IMPORT

/*PROC EXPORT*/

proc export data=writers_final

         outfile="writers_final.csv"

         dbms=csv

         replace;

run;

proc print data=writers_final;

run;

OUTPUT:

ObsWriter_IDWriter_NameRegionGenreCountryBirth_YearBooks_Sold_MJoin_DateAward_StatusRoyalty_M
1101William ShakespeareEUROPEDramaUK156440012MAY2020YES120
2102Leo TolstoyOTHERNovelRUSSIA1828300.YES50
3103UNKNOWNOTHERPoetryINDIA186125022JUL2019NO75
4104J.K. RowlingEUROPEFantasyUK196550001JAN2022YES200
5105Mark TwainNORTH AMERICAUNKNOWNUSA1835.11NOV2018YES90
6106Fyodor DostoevskyOTHERNovelRUSSIA1821150.YES60
7107Jane AustenEUROPERomanceUK177518005JUN2021YES85
8108Ernest HemingwayNORTH AMERICANovelUSA189922014FEB2023NO110
9109Rabindranath TagoreOTHERPoetryINDIA186126007JUL2022YES95
10110George OrwellEUROPEDystopianUK1903210.YES88
11111HomerOTHEREpicGREECE1906612AUG2017NO.

Explanation

PROC EXPORT generates business-ready deliverables.

Professional outputs include:

  • Excel reports
  • CSV feeds
  • Regulatory submissions
  • Executive dashboards

#Exporting CSV in R

Equivalent to PROC EXPORT in SAS:

  write.csv(writers_clean,

  "writers_clean.csv",

  row.names = FALSE)

OUTPUT:

 

Writer_ID

Writer_Name

Country

Genre

Birth_Year

Books_Sold_M

Royalty_M

Award_Status

Join_Date

1

101

William Shakespeare

UK

Drama

1564

400

120

YES

12-05-2020

2

102

Leo Tolstoy

RUSSIA

Novel

1828

300

50

YES

               NA

3

103

UNKNOWN

INDIA

Poetry

1861

250

75

NO

22-07-2019

4

104

J.k. Rowling

UK

Fantasy

1965

500

200

YES

01-01-2022

5

105

Mark Twain

USA

UNKNOWN

1835

NA

90

YES

11-11-2018

6

106

Fyodor Dostoevsky

RUSSIA

Novel

1821

150

60

YES

              NA

7

107

Jane Austen

UK

Romance

1775

180

85

YES

05-06-2021

8

108

Ernest Hemingway

USA

Novel

1899

220

110

NO

14-02-2023

9

109

Rabindranath Tagore

INDIA

Poetry

1861

260

95

YES

07-07-2022

10

110

George Orwell

UK

Dystopian

1903

210

88

YES

              NA

11

111

Homer

GREECE

Epic

NA

190

NA

NO

12-08-2017


/*PROC IMPORT*/

proc import datafile="writers_final.csv"

                           out=writers

dbms=csv

replace;

guessingrows=max;

run;

proc print data=writers;

run;

OUTPUT:

ObsWriter_IDWriter_NameRegionGenreCountryBirth_YearBooks_Sold_MJoin_DateAward_StatusRoyalty_M
1101William ShakespeareEUROPEDramaUK156440012MAY2020YES120
2102Leo TolstoyOTHERNovelRUSSIA1828300.YES50
3103UNKNOWNOTHERPoetryINDIA186125022JUL2019NO75
4104J.K. RowlingEUROPEFantasyUK196550001JAN2022YES200
5105Mark TwainNORTH AMERICAUNKNOWNUSA1835.11NOV2018YES90
6106Fyodor DostoevskyOTHERNovelRUSSIA1821150.YES60
7107Jane AustenEUROPERomanceUK177518005JUN2021YES85
8108Ernest HemingwayNORTH AMERICANovelUSA189922014FEB2023NO110
9109Rabindranath TagoreOTHERPoetryINDIA186126007JUL2022YES95
10110George OrwellEUROPEDystopianUK1903210.YES88
11111HomerOTHEREpicGREECE1906612AUG2017NO.

Explanation

PROC IMPORT automates ingestion of external data files.

Important parameter:

guessingrows=max;

Without it, SAS may incorrectly infer variable types.

#Importing CSV in R

Equivalent to PROC IMPORT in SAS:

  writers_import <- read.csv(

    "writers_clean.csv",

    stringsAsFactors = FALSE)

OUTPUT:

 

Writer_ID

Writer_Name

Country

Genre

Birth_Year

Books_Sold_M

Royalty_M

Award_Status

Join_Date

1

101

William Shakespeare

UK

Drama

1564

400

120

YES

12-05-2020

2

102

Leo Tolstoy

RUSSIA

Novel

1828

300

50

YES

              NA

3

103

UNKNOWN

INDIA

Poetry

1861

250

75

NO

22-07-2019

4

104

J.k. Rowling

UK

Fantasy

1965

500

200

YES

01-01-2022

5

105

Mark Twain

USA

UNKNOWN

1835

NA

90

YES

11-11-2018

6

106

Fyodor Dostoevsky

RUSSIA

Novel

1821

150

60

YES

              NA

7

107

Jane Austen

UK

Romance

1775

180

85

YES

05-06-2021

8

108

Ernest Hemingway

USA

Novel

1899

220

110

NO

14-02-2023

9

109

Rabindranath Tagore

INDIA

Poetry

1861

260

95

YES

07-07-2022

10

110

George Orwell

UK

Dystopian

1903

210

88

YES

              NA

11

111

Homer

GREECE

Epic

NA

190

NA

NO

12-08-2017

Important Interview Point

PROC

GUESSINGROWS Valid?

PROC IMPORT

YES

PROC EXPORT

NO

Equivalent SAS vs R Comparison

SAS

R

"C:\file.csv"

❌ Error

"C:\\file.csv"

✅ Works

"C:/file.csv"

✅ Best Practice

5. Business Logic & The “Why”

Automated Loan Approval Scenario

Imagine a bank using SAS for automated writer royalty-based loan approval.

Eligibility rule:

  • Royalty > 50 million
  • Age > 18
  • No missing records

Now consider this dangerous SAS behavior:

Missing numeric values are treated as smaller than all numbers.

data writers_final; 

 retain Writer_ID Writer_Name Region Genre Country Birth_Year 

        Books_Sold_M Join_Date Award_Status Royalty_M;

 length Flag $8;

 set writers_final;

 if Royalty_M < 100 then Flag = "Low";

 else Flag = "High";

run;

proc print data=writers_final;

run;

OUTPUT:

ObsWriter_IDWriter_NameRegionGenreCountryBirth_YearBooks_Sold_MJoin_DateAward_StatusRoyalty_MFlag
1101William ShakespeareEUROPEDramaUK156440012MAY2020YES120High
2102Leo TolstoyOTHERNovelRUSSIA1828300.YES50Low
3103UNKNOWNOTHERPoetryINDIA186125022JUL2019NO75Low
4104J.K. RowlingEUROPEFantasyUK196550001JAN2022YES200High
5105Mark TwainNORTH AMERICAUNKNOWNUSA1835.11NOV2018YES90Low
6106Fyodor DostoevskyOTHERNovelRUSSIA1821150.YES60Low
7107Jane AustenEUROPERomanceUK177518005JUN2021YES85Low
8108Ernest HemingwayNORTH AMERICANovelUSA189922014FEB2023NO110High
9109Rabindranath TagoreOTHERPoetryINDIA186126007JUL2022YES95Low
10110George OrwellEUROPEDystopianUK1903210.YES88Low
11111HomerOTHEREpicGREECE1906612AUG2017NO.Low

Missing values also satisfy this condition.

A missing royalty could accidentally classify an applicant incorrectly.

In clinical trials, this could become catastrophic:

  • Wrong patient inclusion
  • Invalid safety conclusions
  • Regulatory rejection

This is called the Missing Value Trap.

Professional developers always explicitly handle missing values.

Key Points of Implementation

  1. Always validate imported file structures.
  2. Define LENGTH before assignments.
  3. Never trust raw external data.
  4. Standardize capitalization early.
  5. Remove duplicates before aggregation.
  6. Convert dates immediately after import.
  7. Use PROC CONTENTS for metadata validation.
  8. Document every transformation step.
  9. Use meaningful variable names.
  10. Separate raw and cleaned datasets.
  11. Preserve audit trails.
  12. Validate ranges using IF conditions.
  13. Use formats for readability.
  14. Prefer SELECT-WHEN for categorical mapping.
  15. Handle missing values explicitly.
  16. Never overwrite raw data.
  17. Use PROC FREQ for categorical validation.
  18. Compare record counts after merges.
  19. Use macros for reusable logic.
  20. Maintain reproducible workflows.

Extended Analysis & Reporting

Aggregation with PROC SUMMARY

proc summary data=writers_clean nway;

class Country;

var Royalty_M Books_Sold_M;

output out=country_summary sum=;

run;

proc print data = country_summary;

run;

OUTPUT:

ObsCountry_TYPE__FREQ_Royalty_MBooks_Sold_M
1GREECE11.66
2INDIA12170510
3RUSSIA12110450
4UK144931290
5USA12200220

Explanation

PROC SUMMARY creates aggregated business intelligence metrics.

This is widely used for:

  • Regional revenue reporting
  • Clinical summaries
  • Market segmentation

Professional Reporting Using PROC REPORT

proc report data=country_summary nowd;

columns Country Royalty_M Books_Sold_M;

define Country / group;

define Royalty_M / analysis sum;

define Books_Sold_M / analysis sum;

run;

OUTPUT:

CountryRoyalty_MBooks_Sold_M
GREECE.66
INDIA170510
RUSSIA110450
UK4931290
USA200220

Explanation

PROC REPORT creates polished enterprise outputs.

Professional advantages:

  • Custom formatting
  • Executive presentation
  • Controlled layouts
  • Regulatory reporting readiness

Advanced Text Cleaning in R

writers_clean$Country <- gsub("india","INDIA",

                              writers_clean$Country)


writers_clean$Writer_Name <- trimws(

  writers_clean$Writer_Name)

OUTPUT:

 

Writer_ID

Writer_Name

Country

Genre

Birth_Year

Books_Sold_M

Royalty_M

Award_Status

Join_Date

1

101

William Shakespeare

UK

Drama

1564

400

120

YES

12-05-2020

2

102

Leo Tolstoy

RUSSIA

Novel

1828

300

50

YES

              NA

3

103

UNKNOWN

INDIA

Poetry

1861

250

75

NO

22-07-2019

4

104

J.k. Rowling

UK

Fantasy

1965

500

200

YES

01-01-2022

5

105

Mark Twain

USA

UNKNOWN

1835

NA

90

YES

11-11-2018

6

106

Fyodor Dostoevsky

RUSSIA

Novel

1821

150

60

YES

              NA

7

107

Jane Austen

UK

Romance

1775

180

85

YES

05-06-2021

8

108

Ernest Hemingway

USA

Novel

1899

220

110

NO

14-02-2023

9

109

Rabindranath Tagore

INDIA

Poetry

1861

260

95

YES

07-07-2022

10

110

George Orwell

UK

Dystopian

1903

210

88

YES

               NA

11

111

Homer

GREECE

Epic

NA

190

NA

NO

12-08-2017

Explanation

gsub() performs regex-based replacement.

trimws() removes unwanted spaces.

These functions are critical when handling:

  • Email cleanup
  • Address normalization
  • Free-text survey responses

6. 20 Additional Data Cleaning Best Practices

  1. Validate SDTM variable lengths.
  2. Maintain Define.xml consistency.
  3. Preserve source traceability.
  4. Use controlled terminology.
  5. Validate adverse event dates.
  6. Standardize units.
  7. Perform double-programming validation.
  8. Create QC datasets.
  9. Track transformation lineage.
  10. Use audit logs.
  11. Validate merge keys.
  12. Prevent orphan records.
  13. Normalize categorical variables.
  14. Flag impossible ages.
  15. Validate visit windows.
  16. Review protocol deviations.
  17. Store raw snapshots securely.
  18. Ensure reproducibility.
  19. Validate exports before submission.
  20. Follow FDA compliance guidelines.

7. Business Logic Behind Data Cleaning

Data cleaning exists because business decisions depend entirely on data quality. If a patient’s age is incorrectly entered as -45 instead of 45, a clinical trial may wrongly classify the subject into an invalid treatment group. If salary values are missing in banking systems, loan approval engines may reject qualified applicants or approve risky candidates.

Missing values are often replaced because analytical models cannot interpret blanks consistently. Standardization ensures systems behave predictably. Unrealistic values are corrected because dashboards, machine learning models, and executive reports rely on logical ranges.

Date imputation is another critical practice. Suppose treatment dates are partially missing in a clinical trial. Analysts may use protocol-approved imputation methods to preserve timeline consistency while maintaining regulatory transparency.

Salary normalization ensures financial analytics remain stable across regions and currencies. Similarly, publication royalty normalization allows publishing companies to compare authors fairly across countries.

Ultimately, data cleaning protects business credibility, analytical trust, and operational accuracy.

8. 20 Key Points — Sharp & Impactful

  1. Dirty data leads to wrong conclusions.
  2. Validation prevents analytical disasters.
  3. Duplicate records inflate metrics.
  4. Missing values silently distort reports.
  5. Standardization ensures reproducibility.
  6. Metadata matters as much as data.
  7. Date formats drive timeline accuracy.
  8. Audit trails protect compliance.
  9. Structured imports reduce failures.
  10. PROC SORT improves reliability.
  11. DATA STEP offers transformation flexibility.
  12. PROC SQL enhances relational processing.
  13. Tidyverse simplifies modern wrangling.
  14. Regex cleaning improves text quality.
  15. Controlled terminology boosts consistency.
  16. Documentation reduces debugging time.
  17. QC programming strengthens trust.
  18. Professional reporting increases business value.
  19. Clean data powers machine learning.
  20. Reliable analytics begin with disciplined preprocessing.

9. Summary

SAS and R are both extraordinary platforms for enterprise data cleaning and professional analytics engineering. SAS excels in structured enterprise workflows, auditability, regulatory reporting, and large-scale production systems. Its DATA STEP architecture provides deterministic, highly reliable transformations ideal for banking, pharmaceuticals, insurance, and government reporting.

R, particularly through tidyverse, offers elegant and modern data wrangling syntax. Functions like mutate(), filter(), replace_na(), and case_when() dramatically simplify exploratory transformations and analytical prototyping.

This project demonstrated how messy writer datasets containing duplicates, invalid dates, missing values, inconsistent capitalization, and corrupted financial values can be transformed into enterprise-grade analytical assets.

Using PROC IMPORT and EXPORT, organizations can seamlessly integrate external systems while producing polished professional outputs. DATA STEP and PROC SQL together create flexible transformation ecosystems capable of handling everything from SDTM clinical domains to publishing intelligence reports.

Ultimately, successful analytics does not begin with dashboards or machine learning  it begins with disciplined, validated, reproducible data cleaning frameworks.

10. Conclusion

Professional analytics is not about writing code alone. It is about engineering trust.

Every business report, predictive model, executive dashboard, clinical submission, or financial projection depends entirely on the integrity of the underlying data. Even the most advanced artificial intelligence models become unreliable when trained on corrupted datasets.

This project demonstrated how intentionally flawed writer datasets can be systematically transformed into clean, validated, analysis-ready structures using SAS and R. Through PROC IMPORT, DATA STEP cleaning, PROC SQL transformations, PROC SORT deduplication, PROC SUMMARY aggregation, PROC REPORT presentation, and tidyverse refinement, we created a complete enterprise-style data engineering workflow.

The true value of SAS lies in its governance, reproducibility, metadata control, and regulatory alignment. Meanwhile, R provides agile, elegant, and highly expressive transformation capabilities ideal for exploratory analytics and modern data science.

The most important lesson is this:

Data cleaning is not a preprocessing side task.
It is the foundation of analytical credibility.

A well-designed cleaning framework prevents reporting failures, protects regulatory compliance, improves machine learning accuracy, and enables confident business decisions.

Whether you work in clinical trials, banking, publishing, healthcare, or finance, mastering structured data cleaning methodologies will permanently elevate your analytical capabilities and professional value.

Clean data creates trusted analytics.
Trusted analytics drive intelligent decisions.
Intelligent decisions build successful organizations.

11. Interview Questions and Answers

1. Why is PROC SORT with NODUPKEY important?

Answer:

It removes duplicate observations based on BY variables, preventing double-counting and ensuring reporting accuracy.

2. Difference between PROC SQL and DATA STEP?

Answer:

DATA STEP is row-wise procedural processing, while PROC SQL is relational and set-based processing. PROC SQL is better for joins and aggregations.

3. Why should LENGTH appear before assignment statements?

Answer:

SAS determines character variable length during first compilation reference. Incorrect placement causes truncation.

4. Scenario Question

A clinical dataset contains missing patient ages. What risk exists?

Answer:

Missing numeric values in SAS behave as the smallest values. Patients may accidentally qualify for age-restricted studies.

5. R vs SAS Cleaning Strength?

Answer:

SAS excels in enterprise governance and reproducibility, while R provides flexible and concise modern wrangling using tidyverse.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about WRITER DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



3.The Hidden Power of SAS Dates: Building Smart Time Logic with INTNX and INTCK
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS