Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R
Silk, Culture &
Structured Code: Transforming the World’s Most Beautiful Sarees Dataset into
Analytical Intelligence with PROC TRANSPOSE in SAS and R
1. Introduction
Imagine a
luxury global fashion company collecting saree sales and textile heritage data
from different countries. One regional team enters “Kanchipuram” while another
types “kanchipuram,” and another leaves it blank as “NULL.” Prices appear in
multiple currencies, launch dates are inconsistent, duplicate inventory records
exist, and one analyst accidentally enters a negative saree price.
Now
imagine executives using this corrupted dataset to forecast international
textile demand.
That is
how bad data quietly destroys analytics.
In modern
data science and business intelligence, messy datasets are more dangerous than
missing datasets. Whether you work in clinical trials, banking, retail, or
textile analytics, poor-quality data creates misleading insights, regulatory
issues, incorrect forecasting, and disastrous strategic decisions.
This is
where powerful analytical ecosystems like SAS and R become essential.
In this
project, we will explore a fascinating dataset about the most beautiful
sarees in the world while learning one of the most practical reshaping
procedures in SAS: PROC TRANSPOSE.
We will
intentionally create dirty data and then systematically clean, standardize,
transpose, validate, and analyze it using:
- DATA STEP
- PROC SQL
- PROC TRANSPOSE
- Tidyverse in R
- Advanced conditional logic
- Deduplication frameworks
- Business-rule validations
This is
not just a coding exercise.
This is
how enterprise-grade data engineering actually works.
2. Raw Data Creation in SAS
and R
Business Scenario
A
multinational saree export company maintains textile inventory data from India,
Japan, France, UAE, and the USA. Unfortunately, the incoming raw file contains:
- Missing values
- Duplicate records
- Invalid prices
- Wrong launch dates
- Mixed text formats
- NULL placeholders
- Negative values
SAS Raw Dataset Creation
data sarees_raw;
length Saree_ID Saree_Name $25 Region $20 Fabric $20
Launch_Date $10 Designer $20;
infile datalines dlm='|' truncover;
input Saree_ID Saree_Name $ Region $ Price Launch_Date $
Fabric $ Designer $ Rating Units_Sold;
datalines;
101|Kanchipuram Silk|India|25000|12-01-2024|Silk|Ritu|4.8|120
102|Banarasi Royale|india|-15000|15-02-2024|SILK|NULL|4.7|90
103|Patola Heritage|India|32000|31-15-2024|Cotton|Meera|4.9|85
104|Chiffon Elegance|France|18000|22-03-2024|Chiffon|Aria|.|70
104|Chiffon Elegance|France|18000|22-03-2024|Chiffon|Aria|.|70
105|NULL|Japan|22000|10-04-2024|linen|Yuki|4.5|65
106|Bandhani Queen|UAE|.|05-05-2024|Bandhani|Sara|4.6|55
107|Mysore Silk|India|27000|18-06-2024|Silk|Kiran|4.8|120
108|Organza Dream|USA|15000|NULL|Organza|Emily|4.2|40
109|Cotton Classic|India|9000|25-07-2024|Cotton|NULL|4.0|130
110|Royal Tissue|India|45000|11-08-2024|Tissue|Dev|5.0|20
;
run;
proc print data = sarees_raw;
run;
OUTPUT:
| Obs | Saree_ID | Saree_Name | Region | Fabric | Launch_Date | Designer | Price | Rating | Units_Sold |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | Kanchipuram Silk | India | Silk | 12-01-2024 | Ritu | 25000 | 4.8 | 120 |
| 2 | 102 | Banarasi Royale | india | SILK | 15-02-2024 | NULL | -15000 | 4.7 | 90 |
| 3 | 103 | Patola Heritage | India | Cotton | 31-15-2024 | Meera | 32000 | 4.9 | 85 |
| 4 | 104 | Chiffon Elegance | France | Chiffon | 22-03-2024 | Aria | 18000 | . | 70 |
| 5 | 104 | Chiffon Elegance | France | Chiffon | 22-03-2024 | Aria | 18000 | . | 70 |
| 6 | 105 | NULL | Japan | linen | 10-04-2024 | Yuki | 22000 | 4.5 | 65 |
| 7 | 106 | Bandhani Queen | UAE | Bandhani | 05-05-2024 | Sara | . | 4.6 | 55 |
| 8 | 107 | Mysore Silk | India | Silk | 18-06-2024 | Kiran | 27000 | 4.8 | 120 |
| 9 | 108 | Organza Dream | USA | Organza | NULL | Emily | 15000 | 4.2 | 40 |
| 10 | 109 | Cotton Classic | India | Cotton | 25-07-2024 | NULL | 9000 | 4.0 | 130 |
| 11 | 110 | Royal Tissue | India | Tissue | 11-08-2024 | Dev | 45000 | 5.0 | 20 |
Explanation of SAS Code
This DATA
STEP creates a deliberately messy textile dataset. The LENGTH statement
prevents truncation of character variables, which is critical in enterprise SAS
programming. Without proper length allocation, values like “Kanchipuram Silk”
could become partially stored.
The
INFILE DATALINES statement simulates reading a flat file. We intentionally
inserted invalid business conditions such as negative prices, missing ratings,
duplicate IDs, invalid dates, and inconsistent capitalization.
This mirrors
real-world enterprise ingestion pipelines where external vendors submit
inconsistent data structures.
Key Points
- TRUNCOVER prevents row
overflow issues
- DLM='|' improves readability
- Duplicate Saree_ID
introduced intentionally
- Invalid date 31-15-2024
- Negative price used for
validation testing
R Code – Equivalent Raw Dataset
sarees_raw <- data.frame(
Saree_ID = c(101,102,103,104,104,105,106,107,108,109,110),
Saree_Name = c("Kanchipuram Silk","Banarasi Royale","Patola Heritage",
"Chiffon Elegance","Chiffon Elegance","NULL","Bandhani Queen",
"Mysore Silk","Organza Dream","Cotton Classic","Royal Tissue"),
Region = c("India","india","India","France","France","Japan","UAE","India",
"USA","India","India"),
Price = c(25000,-15000,32000,18000,18000,22000,NA,27000,15000,9000,45000),
Launch_Date = c("12-01-2024","15-02-2024","31-15-2024","22-03-2024",
"22-03-2024","10-04-2024","05-05-2024","18-06-2024",
"NULL","25-07-2024","11-08-2024"),
Fabric = c("Silk","SILK","Cotton","Chiffon","Chiffon","linen","Bandhani",
"Silk","Organza","Cotton","Tissue"),
Designer = c("Ritu","NULL","Meera","Aria","Aria","Yuki","Sara","Kiran",
"Emily","NULL","Dev"),
Rating = c(4.8,4.7,4.9,NA,NA,4.5,4.6,4.8,4.2,4.0,5.0),
Units_Sold = c(120,90,85,70,70,65,55,120,40,130,20)
)
OUTPUT:
|
|
Saree_ID |
Saree_Name |
Region |
Price |
Launch_Date |
Fabric |
Designer |
Rating |
Units_Sold |
|
1 |
101 |
Kanchipuram Silk |
India |
25000 |
12-01-2024 |
Silk |
Ritu |
4.8 |
120 |
|
2 |
102 |
Banarasi Royale |
india |
-15000 |
15-02-2024 |
SILK |
NULL |
4.7 |
90 |
|
3 |
103 |
Patola Heritage |
India |
32000 |
31-15-2024 |
Cotton |
Meera |
4.9 |
85 |
|
4 |
104 |
Chiffon Elegance |
France |
18000 |
22-03-2024 |
Chiffon |
Aria |
NA |
70 |
|
5 |
104 |
Chiffon Elegance |
France |
18000 |
22-03-2024 |
Chiffon |
Aria |
NA |
70 |
|
6 |
105 |
NULL |
Japan |
22000 |
10-04-2024 |
linen |
Yuki |
4.5 |
65 |
|
7 |
106 |
Bandhani Queen |
UAE |
NA |
05-05-2024 |
Bandhani |
Sara |
4.6 |
55 |
|
8 |
107 |
Mysore Silk |
India |
27000 |
18-06-2024 |
Silk |
Kiran |
4.8 |
120 |
|
9 |
108 |
Organza Dream |
USA |
15000 |
NULL |
Organza |
Emily |
4.2 |
40 |
|
10 |
109 |
Cotton Classic |
India |
9000 |
25-07-2024 |
Cotton |
NULL |
4 |
130 |
|
11 |
110 |
Royal Tissue |
India |
45000 |
11-08-2024 |
Tissue |
Dev |
5 |
20 |
Explanation of R Code
The
data.frame() function recreates the same raw business dataset in R. This
provides cross-platform consistency between SAS and R environments.
The
dataset intentionally includes:
- NA values
- “NULL” placeholders
- inconsistent case formatting
- duplicates
- invalid numerical entries
This
structure simulates raw ERP or vendor-uploaded textile inventory feeds.
Logic Bridge (SAS vs R)
|
SAS
Concept |
R
Equivalent |
|
DATA
STEP |
data.frame() |
|
LENGTH |
character
allocation |
|
Missing
numeric “.” |
NA |
|
INFILE
DATALINES |
inline
vectors |
3. The SAS Engineering Layer
SAS Data Cleaning Using DATA STEP
data sarees_clean;
set sarees_raw;
length Region_Clean $20 Fabric_Clean $20 Designer_Clean $20
Category $12;
Region_Clean = upcase(strip(Region));
Fabric_Clean = propcase(strip(Fabric));
Designer_Clean = coalescec(Designer,"UNKNOWN");
if Designer_Clean='NULL' then Designer_Clean='UNKNOWN';
if not missing(Price)then Price = abs(Price);
if missing(Rating) then Rating=4.0;
Launch_Date_New = input(Launch_Date,?? ddmmyy10.);
format Launch_Date_New date9.;
if Saree_Name='NULL' then Saree_Name='UNKNOWN';
if Price > 30000 then Category='PREMIUM';
else if Price > 15000 then Category='LUXURY';
else Category='STANDARD';
drop Designer Region Fabric Launch_Date;
rename Designer_Clean = Designer
Fabric_Clean = Fabric
Region_Clean = Region
Launch_Date_New = Launch_Date;
run;
proc print data = sarees_clean;
run;
OUTPUT:
| Obs | Saree_ID | Saree_Name | Price | Rating | Units_Sold | Region | Fabric | Designer | Category | Launch_Date |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | Kanchipuram Silk | 25000 | 4.8 | 120 | INDIA | Silk | Ritu | LUXURY | 12JAN2024 |
| 2 | 102 | Banarasi Royale | 15000 | 4.7 | 90 | INDIA | Silk | UNKNOWN | STANDARD | 15FEB2024 |
| 3 | 103 | Patola Heritage | 32000 | 4.9 | 85 | INDIA | Cotton | Meera | PREMIUM | . |
| 4 | 104 | Chiffon Elegance | 18000 | 4.0 | 70 | FRANCE | Chiffon | Aria | LUXURY | 22MAR2024 |
| 5 | 104 | Chiffon Elegance | 18000 | 4.0 | 70 | FRANCE | Chiffon | Aria | LUXURY | 22MAR2024 |
| 6 | 105 | UNKNOWN | 22000 | 4.5 | 65 | JAPAN | Linen | Yuki | LUXURY | 10APR2024 |
| 7 | 106 | Bandhani Queen | . | 4.6 | 55 | UAE | Bandhani | Sara | STANDARD | 05MAY2024 |
| 8 | 107 | Mysore Silk | 27000 | 4.8 | 120 | INDIA | Silk | Kiran | LUXURY | 18JUN2024 |
| 9 | 108 | Organza Dream | 15000 | 4.2 | 40 | USA | Organza | Emily | STANDARD | . |
| 10 | 109 | Cotton Classic | 9000 | 4.0 | 130 | INDIA | Cotton | UNKNOWN | STANDARD | 25JUL2024 |
| 11 | 110 | Royal Tissue | 45000 | 5.0 | 20 | INDIA | Tissue | Dev | PREMIUM | 11AUG2024 |
Explanation
This DATA
STEP performs enterprise-grade cleaning logic.
Core SAS Functions Used
- COALESCEC() → Handles
missing character values
- ABS() → Fixes negative
prices
- INPUT() → Converts character
dates into SAS date values
- UPCASE() and PROPCASE() →
Standardization
- STRIP() → Removes
leading/trailing spaces
Single ?
Suppresses
invalid data message in log.
Double ??
Suppresses
BOTH:
- invalid data message
- automatic _ERROR_=1
IF-THEN vs SELECT-WHEN
Alternative categorization:
data sarees_clean;
length Category $12;
set sarees_raw;
select;
when (Price > 30000) Category='PREMIUM';
when (Price > 15000) Category='LUXURY';
otherwise Category='STANDARD';
end;
run;
proc print data = sarees_clean;
run;
OUTPUT:
| Obs | Category | Saree_ID | Saree_Name | Region | Fabric | Launch_Date | Designer | Price | Rating | Units_Sold |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | LUXURY | 101 | Kanchipuram Silk | India | Silk | 12-01-2024 | Ritu | 25000 | 4.8 | 120 |
| 2 | STANDARD | 102 | Banarasi Royale | india | SILK | 15-02-2024 | NULL | -15000 | 4.7 | 90 |
| 3 | PREMIUM | 103 | Patola Heritage | India | Cotton | 31-15-2024 | Meera | 32000 | 4.9 | 85 |
| 4 | LUXURY | 104 | Chiffon Elegance | France | Chiffon | 22-03-2024 | Aria | 18000 | . | 70 |
| 5 | LUXURY | 104 | Chiffon Elegance | France | Chiffon | 22-03-2024 | Aria | 18000 | . | 70 |
| 6 | LUXURY | 105 | NULL | Japan | linen | 10-04-2024 | Yuki | 22000 | 4.5 | 65 |
| 7 | STANDARD | 106 | Bandhani Queen | UAE | Bandhani | 05-05-2024 | Sara | . | 4.6 | 55 |
| 8 | LUXURY | 107 | Mysore Silk | India | Silk | 18-06-2024 | Kiran | 27000 | 4.8 | 120 |
| 9 | STANDARD | 108 | Organza Dream | USA | Organza | NULL | Emily | 15000 | 4.2 | 40 |
| 10 | STANDARD | 109 | Cotton Classic | India | Cotton | 25-07-2024 | NULL | 9000 | 4.0 | 130 |
| 11 | PREMIUM | 110 | Royal Tissue | India | Tissue | 11-08-2024 | Dev | 45000 | 5.0 | 20 |
Why LENGTH Matters
Without
defining sufficient variable lengths before assignment, SAS may truncate
values.
Example:
“PREMIUM_COLLECTION” could become “PREMI”.
This
creates downstream reporting failures.
PROC SORT with NODUPKEY
proc sort data=sarees_clean nodupkey;
by Saree_ID;
run;
proc print data = sarees_clean;
run;
OUTPUT:
| Obs | Category | Saree_ID | Saree_Name | Region | Fabric | Launch_Date | Designer | Price | Rating | Units_Sold |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | LUXURY | 101 | Kanchipuram Silk | India | Silk | 12-01-2024 | Ritu | 25000 | 4.8 | 120 |
| 2 | STANDARD | 102 | Banarasi Royale | india | SILK | 15-02-2024 | NULL | -15000 | 4.7 | 90 |
| 3 | PREMIUM | 103 | Patola Heritage | India | Cotton | 31-15-2024 | Meera | 32000 | 4.9 | 85 |
| 4 | LUXURY | 104 | Chiffon Elegance | France | Chiffon | 22-03-2024 | Aria | 18000 | . | 70 |
| 5 | LUXURY | 105 | NULL | Japan | linen | 10-04-2024 | Yuki | 22000 | 4.5 | 65 |
| 6 | STANDARD | 106 | Bandhani Queen | UAE | Bandhani | 05-05-2024 | Sara | . | 4.6 | 55 |
| 7 | LUXURY | 107 | Mysore Silk | India | Silk | 18-06-2024 | Kiran | 27000 | 4.8 | 120 |
| 8 | STANDARD | 108 | Organza Dream | USA | Organza | NULL | Emily | 15000 | 4.2 | 40 |
| 9 | STANDARD | 109 | Cotton Classic | India | Cotton | 25-07-2024 | NULL | 9000 | 4.0 | 130 |
| 10 | PREMIUM | 110 | Royal Tissue | India | Tissue | 11-08-2024 | Dev | 45000 | 5.0 | 20 |
Explanation
PROC SORT
NODUPKEY removes duplicate business keys efficiently.
In
enterprise textile systems, duplicate inventory records can inflate:
- revenue forecasts
- stock valuation
- demand planning
This
procedure ensures record uniqueness.
Understanding PROC
TRANSPOSE with Examples
Original Structure
|
Saree_ID |
Fabric |
Units_Sold |
|
101 |
Silk |
120 |
Transpose Logic
proc transpose data=sarees_clean out=sarees_transposed;
by Region notsorted;
var Units_Sold Price;
run;
proc print data = sarees_transposed;
run;
OUTPUT:
| Obs | Region | _NAME_ | COL1 | COL2 |
|---|---|---|---|---|
| 1 | India | Units_Sold | 120 | . |
| 2 | India | Price | 25000 | . |
| 3 | india | Units_Sold | 90 | . |
| 4 | india | Price | -15000 | . |
| 5 | India | Units_Sold | 85 | . |
| 6 | India | Price | 32000 | . |
| 7 | France | Units_Sold | 70 | . |
| 8 | France | Price | 18000 | . |
| 9 | Japan | Units_Sold | 65 | . |
| 10 | Japan | Price | 22000 | . |
| 11 | UAE | Units_Sold | 55 | . |
| 12 | UAE | Price | . | . |
| 13 | India | Units_Sold | 120 | . |
| 14 | India | Price | 27000 | . |
| 15 | USA | Units_Sold | 40 | . |
| 16 | USA | Price | 15000 | . |
| 17 | India | Units_Sold | 130 | 20 |
| 18 | India | Price | 9000 | 45000 |
Explanation of PROC TRANSPOSE
PROC
TRANSPOSE reshapes rows into columns.
This is
extremely useful when:
- preparing dashboard data
- converting longitudinal data
- creating matrix reports
- preparing ML-ready
wide-format datasets
Practical Example
Retail
executives may want:
|
Region |
Price_101 |
Price_102 |
instead
of multiple vertical rows.
Key PROC TRANSPOSE Concepts
|
Statement |
Purpose |
|
BY |
grouping |
|
VAR |
variables
to transpose |
|
ID |
column
identifier |
|
OUT |
output
dataset |
PROC SQL Alternative
proc sql;
create table saree_summary as
select Region,
avg(Price) as Avg_Price,
sum(Units_Sold) as Total_Sales
from sarees_clean
group by Region;
quit;
proc print data = saree_summary;
run;
OUTPUT:
| Obs | Region | Avg_Price | Total_Sales |
|---|---|---|---|
| 1 | France | 18000 | 70 |
| 2 | India | 27600 | 475 |
| 3 | Japan | 22000 | 65 |
| 4 | UAE | . | 55 |
| 5 | USA | 15000 | 40 |
| 6 | india | -15000 | 90 |
Explanation
PROC SQL
offers relational-style aggregation.
Compared
with DATA STEP:
|
DATA
STEP |
PROC
SQL |
|
row-wise
processing |
set-based
processing |
|
faster
for sequential logic |
easier
for joins |
|
flexible
derivations |
concise
aggregation |
4. The R Refinement Layer
library(dplyr)
library(tidyr)
library(stringr)
options(scipen = 999)
sarees_clean <- sarees_raw %>%
mutate(
Region = toupper(trimws(Region)),
Fabric = str_to_title(Fabric),
Designer = ifelse(Designer == "NULL" | is.na(Designer),"UNKNOWN",
Designer),
Price = ifelse(is.na(Price),NA,abs(Price)),
Launch_Date = ifelse(Launch_Date=="NULL",NA,Launch_Date),
Launch_Date = as.Date(Launch_Date,format="%d-%m-%Y"),
Rating = replace_na(Rating,4.0),
Saree_Name = ifelse(Saree_Name=="NULL","UNKNOWN",Saree_Name)
) %>%
distinct(Saree_ID,.keep_all=TRUE)
OUTPUT:
|
|
Saree_ID |
Saree_Name |
Region |
Price |
Launch_Date |
Fabric |
Designer |
Rating |
Units_Sold |
|
1 |
101 |
Kanchipuram Silk |
INDIA |
25000 |
12-01-2024 |
Silk |
Ritu |
4.8 |
120 |
|
2 |
102 |
Banarasi Royale |
INDIA |
15000 |
15-02-2024 |
Silk |
UNKNOWN |
4.7 |
90 |
|
3 |
103 |
Patola Heritage |
INDIA |
32000 |
NA |
Cotton |
Meera |
4.9 |
85 |
|
4 |
104 |
Chiffon Elegance |
FRANCE |
18000 |
22-03-2024 |
Chiffon |
Aria |
4 |
70 |
|
5 |
105 |
UNKNOWN |
JAPAN |
22000 |
10-04-2024 |
Linen |
Yuki |
4.5 |
65 |
|
6 |
106 |
Bandhani Queen |
UAE |
NA |
05-05-2024 |
Bandhani |
Sara |
4.6 |
55 |
|
7 |
107 |
Mysore Silk |
INDIA |
27000 |
18-06-2024 |
Silk |
Kiran |
4.8 |
120 |
|
8 |
108 |
Organza Dream |
USA |
15000 |
NA |
Organza |
Emily |
4.2 |
40 |
|
9 |
109 |
Cotton Classic |
INDIA |
9000 |
25-07-2024 |
Cotton |
UNKNOWN |
4 |
130 |
|
10 |
110 |
Royal Tissue |
INDIA |
45000 |
11-08-2024 |
Tissue |
Dev |
5 |
20 |
Explanation
This tidyverse pipeline modernizes the dataset using functional
transformations. R cannot convert "NULL" into a valid Date object.So
we first convert it into a proper missing value.
Operations Performed
- mutate() → column
transformations
- replace_na() → missing value
replacement
- distinct() → duplicate
removal
- trimws() → whitespace
cleanup
- str_to_title() → text
standardization
Logic Bridge
|
SAS |
R |
|
DATA
STEP |
mutate() |
|
IF-THEN |
case_when() |
|
PROC
SORT NODUPKEY |
distinct() |
|
COALESCEC |
replace_na() |
|
STRIP |
trimws() |
PROC TRANSPOSE Equivalent in R
library(tidyr)
transpose_data <- sarees_clean %>%
pivot_wider(
names_from = Region,
values_from = Price
)
OUTPUT:
|
|
Saree_ID |
Saree_Name |
Launch_Date |
Fabric |
Designer |
Rating |
Units_Sold |
INDIA |
FRANCE |
JAPAN |
UAE |
USA |
|
1 |
101 |
Kanchipuram Silk |
12-01-2024 |
Silk |
Ritu |
4.8 |
120 |
25000 |
NA |
NA |
NA |
NA |
|
2 |
102 |
Banarasi Royale |
15-02-2024 |
Silk |
UNKNOWN |
4.7 |
90 |
15000 |
NA |
NA |
NA |
NA |
|
3 |
103 |
Patola Heritage |
NA |
Cotton |
Meera |
4.9 |
85 |
32000 |
NA |
NA |
NA |
NA |
|
4 |
104 |
Chiffon Elegance |
22-03-2024 |
Chiffon |
Aria |
4 |
70 |
NA |
18000 |
NA |
NA |
NA |
|
5 |
105 |
UNKNOWN |
10-04-2024 |
Linen |
Yuki |
4.5 |
65 |
NA |
NA |
22000 |
NA |
NA |
|
6 |
106 |
Bandhani Queen |
05-05-2024 |
Bandhani |
Sara |
4.6 |
55 |
NA |
NA |
NA |
NA |
NA |
|
7 |
107 |
Mysore Silk |
18-06-2024 |
Silk |
Kiran |
4.8 |
120 |
27000 |
NA |
NA |
NA |
NA |
|
8 |
108 |
Organza Dream |
NA |
Organza |
Emily |
4.2 |
40 |
NA |
NA |
NA |
NA |
15000 |
|
9 |
109 |
Cotton Classic |
25-07-2024 |
Cotton |
UNKNOWN |
4 |
130 |
9000 |
NA |
NA |
NA |
NA |
|
10 |
110 |
Royal Tissue |
11-08-2024 |
Tissue |
Dev |
5 |
20 |
45000 |
NA |
NA |
NA |
NA |
Explanation
pivot_wider()
in R performs the same conceptual operation as PROC TRANSPOSE in SAS.
This is
frequently used in:
- dashboard preparation
- reporting structures
- machine learning matrices
- sales comparison tables
5. Business Logic & The
“Why”
Imagine
an AI-driven luxury textile marketplace.
A single
negative saree price accidentally remains uncorrected.
The
forecasting engine interprets this as a refund trend and reduces future
procurement budgets.
Result?
- Millions in inventory losses
- Wrong investor reporting
- Supply-chain disruption
In
pharmaceutical trials, a similar error can be catastrophic.
If a
missing dosage value is interpreted as zero rather than missing, patient safety
analysis becomes invalid.
That is
why business logic matters more than syntax.
Data
cleaning is not cosmetic.
It is
risk management.
6. 20 Key Points of Implementation
- Always validate raw source
files.
- Define variable lengths
early.
- Standardize categorical
text.
- Remove duplicates
immediately.
- Never trust external vendor
data blindly.
- Convert dates properly.
- Use audit-friendly naming
conventions.
- Avoid hardcoding values
repeatedly.
- Validate negative values
carefully.
- Missing values are business
signals.
- Use PROC CONTENTS
frequently.
- Document every
transformation.
- Prefer modular code blocks.
- Use PROC SQL for
aggregations.
- Use DATA STEP for row logic.
- Validate transposed
structures carefully.
- Maintain reproducibility
standards.
- Test edge cases
aggressively.
- Build scalable cleaning
pipelines.
- Data governance is
non-negotiable.
7. Extended Analysis in SAS
data saree_sales;
infile datalines dlm='|' dsd truncover;
input Saree_ID Region $ Price Units_Sold;
datalines;
101|INDIA|25000|120
102|USA|18000|90
103|JAPAN|32000|75
104|FRANCE|15000|60
105|UAE|28000|110
;
run;
proc print data=saree_sales;
run;
OUTPUT:
| Obs | Saree_ID | Region | Price | Units_Sold |
|---|---|---|---|---|
| 1 | 101 | INDIA | 25000 | 120 |
| 2 | 102 | USA | 18000 | 90 |
| 3 | 103 | JAPAN | 32000 | 75 |
| 4 | 104 | FRANCE | 15000 | 60 |
| 5 | 105 | UAE | 28000 | 110 |
/* High-value sales flag */
data saree_flags;
set saree_sales;
if Price > 30000 then High_Value='YES';
else High_Value='NO';
run;
proc print data=saree_flags;
run;
OUTPUT:
| Obs | Saree_ID | Region | Price | Units_Sold | High_Value |
|---|---|---|---|---|---|
| 1 | 101 | INDIA | 25000 | 120 | NO |
| 2 | 102 | USA | 18000 | 90 | NO |
| 3 | 103 | JAPAN | 32000 | 75 | YES |
| 4 | 104 | FRANCE | 15000 | 60 | NO |
| 5 | 105 | UAE | 28000 | 110 | NO |
/* Regional aggregation */
proc means data=saree_flags sum mean maxdec=2;
class Region;
var Price Units_Sold;
run;
OUTPUT:
The MEANS Procedure
| Region | N Obs | Variable | Sum | Mean |
|---|---|---|---|---|
| FRANCE | 1 | Price Units_Sold | 15000.00 60.00 | 15000.00 60.00 |
| INDIA | 1 | Price Units_Sold | 25000.00 120.00 | 25000.00 120.00 |
| JAPAN | 1 | Price Units_Sold | 32000.00 75.00 | 32000.00 75.00 |
| UAE | 1 | Price Units_Sold | 28000.00 110.00 | 28000.00 110.00 |
| USA | 1 | Price Units_Sold | 18000.00 90.00 | 18000.00 90.00 |
/* Deduplication */
proc sort data=saree_flags nodupkey;
by Saree_ID;
run;
proc print data=saree_flags;
run;
OUTPUT:
| Obs | Saree_ID | Region | Price | Units_Sold | High_Value |
|---|---|---|---|---|---|
| 1 | 101 | INDIA | 25000 | 120 | NO |
| 2 | 102 | USA | 18000 | 90 | NO |
| 3 | 103 | JAPAN | 32000 | 75 | YES |
| 4 | 104 | FRANCE | 15000 | 60 | NO |
| 5 | 105 | UAE | 28000 | 110 | NO |
/* Reporting */
proc report data=saree_flags nowd;
column Region Price Units_Sold High_Value;
run;
OUTPUT:
| Region | Price | Units_Sold | High_Value |
|---|---|---|---|
| INDIA | 25000 | 120 | NO |
| USA | 18000 | 90 | NO |
| JAPAN | 32000 | 75 | YES |
| FRANCE | 15000 | 60 | NO |
| UAE | 28000 | 110 | NO |
Explanation
This
phase demonstrates production-style SAS analytics.
Techniques Covered
- Flat-file ingestion
- Conditional flags
- Aggregation
- Reporting
- Deduplication
This
architecture is common in:
- retail analytics
- SDTM validation
- banking risk systems
- insurance pipelines
8. 20 Additional Data
Cleaning Best Practices
- Validate SDTM domains before
mapping.
- Preserve raw datasets
permanently.
- Never overwrite source data.
- Maintain Define.xml
consistency.
- Use controlled terminology.
- Track derivation lineage.
- Validate subject IDs
carefully.
- Perform reconciliation
checks.
- Log all data corrections.
- Use metadata-driven
programming.
- Validate date chronology.
- Ensure treatment
consistency.
- Standardize null handling.
- Separate staging and
production layers.
- Build reusable macros.
- Use QC programmers
independently.
- Automate validation reports.
- Create audit-ready outputs.
- Maintain regulatory
traceability.
- Test scalability on large
datasets.
9. Business Logic Behind
Data Cleaning
Data
cleaning exists because business systems are imperfect. Missing values,
unrealistic numbers, and inconsistent formats distort analytics and
decision-making.
Suppose a
patient’s age is entered as “-45” during a clinical trial. Without correction
using functions like ABS(), statistical analysis becomes invalid. Similarly, a
luxury saree priced at “-15000” can mislead profit calculations and revenue
dashboards.
Replacing
missing values is equally important. If sales quantity is missing and
interpreted as zero, management may wrongly assume a product failed
commercially. In pharmaceutical analytics, missing dosage values can invalidate
efficacy analysis and create regulatory risks.
Date
standardization is another critical area. Incorrect dates disrupt trend
analysis, inventory planning, and longitudinal clinical studies.
Cleaning
is not just about technical perfection. It directly impacts:
- forecasting
- compliance
- patient safety
- investor confidence
- operational strategy
Good
analytics starts with trustworthy data.
10. 20 Sharp &
Impactful Insights
- Dirty data creates false
intelligence.
- Standardization improves
reproducibility.
- Duplicates distort business
reality.
- PROC TRANSPOSE simplifies
reporting.
- Missing values require
business interpretation.
- SAS excels in enterprise
stability.
- R excels in flexible transformations.
- Validation prevents
regulatory failures.
- Audit trails protect
organizations.
- Metadata drives scalable
programming.
- Small errors create massive
losses.
- Date integrity is
mission-critical.
- PROC SQL simplifies
aggregation logic.
- DATA STEP gives procedural
control.
- Text normalization improves
joins.
- Consistency improves
analytics trust.
- Good code is readable code.
- Automation reduces human
error.
- Structured pipelines improve
scalability.
- Data quality defines
analytical quality.
11.
Summary
SAS and R
both provide exceptional frameworks for enterprise-grade data cleaning and
transformation, but their strengths differ strategically.
SAS is
renowned for reliability, governance, auditability, and enterprise stability.
Its DATA STEP architecture enables row-level procedural control, while
procedures such as PROC SORT, PROC SQL, and PROC TRANSPOSE provide optimized
large-scale processing. SAS is heavily preferred in regulated industries like
pharmaceuticals, banking, and insurance because of its reproducibility and
validation ecosystem.
R,
especially with tidyverse libraries, offers flexibility, readability, and
modern transformation pipelines. Functions like mutate(), pivot_wider(), and
case_when() enable elegant analytical workflows and rapid experimentation.
In this
saree analytics project, we explored:
- dirty data simulation
- enterprise cleaning
frameworks
- duplicate handling
- date conversion
- PROC TRANSPOSE reshaping
- SQL aggregation
- tidyverse refinement
The key
lesson is simple:
Analytics
quality depends entirely on data quality.
Whether
analyzing luxury textile markets or clinical trial outcomes, structured
cleaning pipelines ensure scalability, reliability, and trustworthy
decision-making.
12. Conclusion
The
journey from messy raw data to analytical intelligence is one of the most
critical processes in modern data science. Organizations often invest millions
into dashboards, AI systems, predictive analytics, and machine learning models,
yet overlook the foundational requirement: clean, validated, structured data.
Through
this project on the world’s most beautiful sarees, we explored how even a
visually elegant business domain can suffer from severe data quality issues.
Missing values, inconsistent text, duplicate records, negative prices, and
invalid dates are not merely technical inconveniences they are operational
risks.
Using
SAS, we demonstrated the power of DATA STEP programming, PROC SORT, PROC SQL,
and especially PROC TRANSPOSE for reshaping business-ready datasets. PROC TRANSPOSE
proved highly valuable for converting vertical transactional structures into
wide analytical reporting formats.
Using R,
we showcased the flexibility of tidyverse-based transformations, modern data
wrangling pipelines, and elegant reshaping using pivot_wider().
More
importantly, this project emphasized business reasoning behind every
transformation:
- Why missing values matter
- Why duplicates must be
removed
- Why date integrity is
critical
- Why standardization improves
governance
In
enterprise environments such as clinical trials, banking, and retail analytics,
these transformations directly impact compliance, financial forecasting,
patient safety, and executive decision-making.
Data
cleaning is not a side activity.
It is the
engineering backbone of trustworthy analytics.
Organizations
that build structured, scalable, and validated cleaning frameworks consistently
outperform those relying on uncontrolled spreadsheets and ad-hoc
transformations.
Clean
data is not just technical excellence.
It is
strategic power.
13. Interview Questions
& Answers
1. Why would you use PROC TRANSPOSE in SAS?
Answer:
PROC
TRANSPOSE reshapes datasets from rows to columns or columns to rows. It is
commonly used for reporting, dashboard preparation, and creating
machine-learning-ready wide datasets.
2. Scenario: A dataset contains duplicate patient
IDs. How would you handle it?
Answer:
Use:
proc sort data=patients nodupkey;
by patient_id;
run;
In R:
distinct(patient_id,.keep_all=TRUE)
This
ensures record uniqueness.
3. What is the difference between PROC SQL and DATA
STEP?
Answer:
DATA STEP
is procedural and ideal for row-level logic. PROC SQL is declarative and best
for joins, aggregations, and relational operations.
4. How do you handle missing character values in
SAS?
Answer:
coalescec(variable,"UNKNOWN")
This
replaces missing character values safely.
5. Scenario: An invalid negative transaction amount
exists in a financial dataset. What would you do?
Answer:
First
validate whether negative values are business-valid (refunds/credits). If
invalid, use:
Amount = abs(Amount);
Then
document the correction in the audit trail for compliance and traceability.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about SAREES DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment