Mastering INPUT vs PUT Functions in SAS with RealWorld Error Handling : Vanished Mobile Giants to Clean Data Insights

1. Introduction

Imagine you are working on a legacy telecom analytics project. Your manager hands you a dataset containing mobile companies that once dominated the market brands that were household names but no longer exist. Sounds simple, right?

But the moment you open the dataset, reality hits:

Company names are inconsistent (e.g., nokia, NOKIA, Nokia Inc.)
Launch years are missing or incorrectly stored as text
Some shutdown years are earlier than launch years
Duplicate entries exist
Revenue values are stored as characters instead of numeric

This is not just messy data it’s dangerous data.

In industries like clinical trials or financial analytics, such inconsistencies can lead to:

Incorrect statistical conclusions
Regulatory rejection
Business losses

This is where tools like SAS and R become powerful. They don’t just analyze data they transform chaos into structured intelligence.

In this blog, we’ll build a flawed dataset of defunct mobile companies, intentionally inject errors, and then clean it stepbystep. Most importantly, we will deeply understand INPUT vs PUT functions in SAS, which are often misunderstood but critical in realworld data transformation.

2. Raw Data Creation in SAS and R

SAS Raw Dataset (With Intentional Errors)

DATA mobile_raw;

INFILE DATALINES MISSOVER DSD DLM="";

LENGTH ID 8 Company $20 Country $15 Revenue $10

Launch_Year $8 Shutdown_Year $8;

INPUT ID Company:$20. Launch_Year:$8. Shutdown_Year:$8.

Revenue:$10. Country:$15. Employees $ Rating;

DATALINES;

1 Nokia 2000 2014 50000 Finland 100000 4.5

2 blackberry 1999 2016 30000 Canada 85000 4.2

3 SIEMENS 2001 2005 20000 Germany 60000 3.8

4 Motorola 1998 2011 40000 USA 90000 4.0

5 nokia 2000 2014 50000 Finland 100000 4.5

6 NULL 2005 2003 15000 India 30000 3.5

7 HTC 2008 . 25000 Taiwan 70000 4.1

8 LG 2003 2021 abc Korea 80000 4.3

9 "sony ericsson" 2002 2012 35000 Japan 75000 4.0

10 PANASONIC 1995 1990 45000 Japan 95000 3.9

11 lenovo 2010 2020 28000 China 65000 .

12 2004 2018 22000 USA 50000 3.7

13 Huawei 2012 2023 60000 China 120000 4.6

14 Blackberry 1999 2016 30000 Canada 85000 4.2

15 Siemens 2001 2005 20000 Germany 60000 3.8

;

RUN;

PROC PRINT DATA=mobile_raw;

RUN;

OUTPUT:


Obs	ID	Company	Country	Revenue	Launch_Year	Shutdown_Year	Employees	Rating
1	1	Nokia	Finland	50000	2000	2014	100000	4.5
2	2	blackberry	Canada	30000	1999	2016	85000	4.2
3	3	SIEMENS	Germany	20000	2001	2005	60000	3.8
4	4	Motorola	USA	40000	1998	2011	90000	4.0
5	5	nokia	Finland	50000	2000	2014	100000	4.5
6	6	NULL	India	15000	2005	2003	30000	3.5
7	7	HTC	Taiwan	25000	2008		70000	4.1
8	8	LG	Korea	abc	2003	2021	80000	4.3
9	9	sony ericsson	Japan	35000	2002	2012	75000	4.0
10	10	PANASONIC	Japan	45000	1995	1990	95000	3.9
11	11	lenovo	China	28000	2010	2020	65000	.
12	12		USA	22000	2004	2018	50000	3.7
13	13	Huawei	China	60000	2012	2023	120000	4.6
14	14	Blackberry	Canada	30000	1999	2016	85000	4.2
15	15	Siemens	Germany	20000	2001	2005	60000	3.8

Explanation

This dataset is intentionally flawed to simulate realworld messy data. Key issues include:

Duplicate rows (Nokia, Blackberry, Siemens)
Character variables used for numeric fields (Revenue, Years)
Missing values (Shutdown Year, Rating)
Invalid values (Shutdown year before launch year)
Inconsistent text (uppercase/lowercase, NULL values)

Such issues frequently occur in legacy systems or clinical trial raw data (SDTM domains) where data originates from multiple sources. This dataset will help demonstrate how SAS handles transformation using INPUT (char → numeric) and PUT (numeric → char) functions effectively.

The combining DSD with DLIM=' ' (space delimiter). Normally, DSD is used for comma-separated files, but when explicitly paired with DLIM, it allows SAS to respect quoted strings even in space-delimited data. This ensures "sony ericsson" is read as a single value instead of being split. Without DSD, SAS ignores quotes completely. This is a critical concept in real-world data ingestion, especially when dealing with external vendor data or clinical datasets where text fields may contain spaces. This approach ensures correct column alignment and prevents cascading parsing errors.

R Code (Equivalent Dataset)

mobile_raw <- data.frame(

ID = 1:15,

Company = c("Nokia","blackberry","SIEMENS","Motorola","nokia",

"NULL","HTC","LG","sony ericsson","PANASONIC","lenovo",

"","Huawei","Blackberry","Siemens"),

Launch_Year = c("2000","1999","2001","1998","2000","2005","2008",

".","2002","1995","2010","2004","2012","1999","2001"),

Shutdown_Year = c("2014","2016","2005","2011","2014","2003",NA,"2021",

"2012","1990","2020","2018","2023","2016","2005"),

Revenue = c("50000","30000","20000","40000","50000","15000","25000",

"abc","35000","45000","28000","22000","60000","30000",

"20000"),

Country = c("Finland","Canada","Germany","USA","Finland","India",

"Taiwan","Korea","Japan","Japan","China","USA","China",

"Canada","Germany"),

Employees = c(100000,85000,60000,90000,100000,30000,70000,80000,75000,

95000,65000,50000,120000,85000,60000),

Rating = c(4.5,4.2,3.8,4.0,4.5,3.5,4.1,4.3,4.0,3.9,NA,3.7,4.6,4.2,3.8)

)

OUTPUT:

	ID	Company	Launch_Year	Shutdown_Year	Revenue	Country	Employees	Rating
1	1	Nokia	2000	2014	50000	Finland	100000	4.5
2	2	blackberry	1999	2016	30000	Canada	85000	4.2
3	3	SIEMENS	2001	2005	20000	Germany	60000	3.8
4	4	Motorola	1998	2011	40000	USA	90000	4
5	5	nokia	2000	2014	50000	Finland	100000	4.5
6	6	NULL	2005	2003	15000	India	30000	3.5
7	7	HTC	2008	NA	25000	Taiwan	70000	4.1
8	8	LG	.	2021	abc	Korea	80000	4.3
9	9	sony ericsson	2002	2012	35000	Japan	75000	4
10	10	PANASONIC	1995	1990	45000	Japan	95000	3.9
11	11	lenovo	2010	2020	28000	China	65000	NA
12	12		2004	2018	22000	USA	50000	3.7
13	13	Huawei	2012	2023	60000	China	120000	4.6
14	14	Blackberry	1999	2016	30000	Canada	85000	4.2
15	15	Siemens	2001	2005	20000	Germany	60000	3.8

Explanation

· The R dataset mirrors the SAS dataset with identical issues.

· R represents missing values using NA, while SAS uses .

· . Character inconsistencies and invalid numeric values (like "abc") will cause problems during type conversion.

· In real world pipelines, R is often used for exploratory analysis, while SAS is used for regulatory compliant transformations.

· Understanding both helps in cross platform validation.

3. Phase 1: Data Cleaning in SAS

DATA mobile_clean1;

SET mobile_raw;

/* Handle missing company */

IF Company = "NULL" OR Company = ""

THEN Company = "Unknown";

/* Standardize text */

Company = PROPCASE(Company);

/* Convert character to numeric */

Launch_YR = INPUT(Launch_Year, 4.);

Shutdown_YR = INPUT(Shutdown_Year, 4.);

Revenue_NUM = INPUT(Revenue, 8.);

/* Fix invalid years */

IF Shutdown_YR < Launch_YR THEN Shutdown_YR = .;

/* Handle missing values */

IF Rating = . THEN Rating = 3.5;

FORMAT Launch_YR Shutdown_YR 4.;

RUN;

PROC PRINT DATA=mobile_clean1;

RUN;

OUTPUT:


Obs	ID	Company	Country	Revenue	Launch_Year	Shutdown_Year	Employees	Rating	Launch_YR	Shutdown_YR	Revenue_NUM
1	1	Nokia	Finland	50000	2000	2014	100000	4.5	2000	2014	50000
2	2	Blackberry	Canada	30000	1999	2016	85000	4.2	1999	2016	30000
3	3	Siemens	Germany	20000	2001	2005	60000	3.8	2001	2005	20000
4	4	Motorola	USA	40000	1998	2011	90000	4.0	1998	2011	40000
5	5	Nokia	Finland	50000	2000	2014	100000	4.5	2000	2014	50000
6	6	Unknown	India	15000	2005	2003	30000	3.5	2005	.	15000
7	7	Htc	Taiwan	25000	2008		70000	4.1	2008	.	25000
8	8	Lg	Korea	abc	2003	2021	80000	4.3	2003	2021	.
9	9	Sony Ericsson	Japan	35000	2002	2012	75000	4.0	2002	2012	35000
10	10	Panasonic	Japan	45000	1995	1990	95000	3.9	1995	.	45000
11	11	Lenovo	China	28000	2010	2020	65000	3.5	2010	2020	28000
12	12	Unknown	USA	22000	2004	2018	50000	3.7	2004	2018	22000
13	13	Huawei	China	60000	2012	2023	120000	4.6	2012	2023	60000
14	14	Blackberry	Canada	30000	1999	2016	85000	4.2	1999	2016	30000
15	15	Siemens	Germany	20000	2001	2005	60000	3.8	2001	2005	20000

/* Remove duplicates */

PROC SORT DATA=mobile_clean1 NODUPKEY;

BY ID Company;

RUN;

PROC PRINT DATA=mobile_clean1;

RUN;

OUTPUT:


Obs	ID	Company	Country	Revenue	Launch_Year	Shutdown_Year	Employees	Rating	Launch_YR	Shutdown_YR	Revenue_NUM
1	1	Nokia	Finland	50000	2000	2014	100000	4.5	2000	2014	50000
2	2	Blackberry	Canada	30000	1999	2016	85000	4.2	1999	2016	30000
3	3	Siemens	Germany	20000	2001	2005	60000	3.8	2001	2005	20000
4	4	Motorola	USA	40000	1998	2011	90000	4.0	1998	2011	40000
5	5	Nokia	Finland	50000	2000	2014	100000	4.5	2000	2014	50000
6	6	Unknown	India	15000	2005	2003	30000	3.5	2005	.	15000
7	7	Htc	Taiwan	25000	2008		70000	4.1	2008	.	25000
8	8	Lg	Korea	abc	2003	2021	80000	4.3	2003	2021	.
9	9	Sony Ericsson	Japan	35000	2002	2012	75000	4.0	2002	2012	35000
10	10	Panasonic	Japan	45000	1995	1990	95000	3.9	1995	.	45000
11	11	Lenovo	China	28000	2010	2020	65000	3.5	2010	2020	28000
12	12	Unknown	USA	22000	2004	2018	50000	3.7	2004	2018	22000
13	13	Huawei	China	60000	2012	2023	120000	4.6	2012	2023	60000
14	14	Blackberry	Canada	30000	1999	2016	85000	4.2	1999	2016	30000
15	15	Siemens	Germany	20000	2001	2005	60000	3.8	2001	2005	20000

Explanation

This step introduces the core concept: INPUT function. Since Launch_Year and Revenue are stored as characters, they must be converted to numeric before analysis. INPUT reads character data using an informat and converts it to numeric. Without this, mathematical operations would fail. Additionally, text standardization using PROPCASE ensures consistency, which is critical in grouping operations. Invalid business logic (shutdown before launch) is corrected, and missing ratings are imputed. Finally, duplicates are removed using PROC SORT NODUPKEY, which is widely used in clinical datasets to maintain subjectlevel uniqueness.

4. Phase 2: Data Cleaning in R

library(dplyr)

mobile_clean <- mobile_raw %>%

mutate(

# Clean Company (combined)

Company = tools::toTitleCase(trimws(tolower(

ifelse(Company == "NULL" | Company == "", "Unknown", Company)

))),

# Create NEW cleaned columns

Launch_YR = as.numeric(ifelse(Launch_Year %in% c(".", "", "NA"), NA,

Launch_Year)),

Shutdown_YR = as.numeric(ifelse(Shutdown_Year %in% c(".", "", "NA"),

NA, Shutdown_Year)),

Revenue_NUM = as.numeric(ifelse(Revenue %in% c("abc", "", "NA"),

NA, Revenue)),

# Handle missing rating

Rating = ifelse(is.na(Rating), 3.5, Rating)

) %>%

filter(is.na(Shutdown_YR) | Shutdown_YR >= Launch_YR) %>%

distinct(ID, Company, .keep_all = TRUE)

OUTPUT:

	ID	Company	Launch_Year	Shutdown_Year	Revenue	Country	Employees	Rating	Launch_YR	Shutdown_YR	Revenue_NUM
1	1	Nokia	2000	2014	50000	Finland	100000	4.5	2000	2014	50000
2	2	Blackberry	1999	2016	30000	Canada	85000	4.2	1999	2016	30000
3	3	Siemens	2001	2005	20000	Germany	60000	3.8	2001	2005	20000
4	4	Motorola	1998	2011	40000	USA	90000	4	1998	2011	40000
5	5	Nokia	2000	2014	50000	Finland	100000	4.5	2000	2014	50000
6	7	Htc	2008	NA	25000	Taiwan	70000	4.1	2008	NA	25000
7	9	Sony Ericsson	2002	2012	35000	Japan	75000	4	2002	2012	35000
8	11	Lenovo	2010	2020	28000	China	65000	3.5	2010	2020	28000
9	12	Unknown	2004	2018	22000	USA	50000	3.7	2004	2018	22000
10	13	Huawei	2012	2023	60000	China	120000	4.6	2012	2023	60000
11	14	Blackberry	1999	2016	30000	Canada	85000	4.2	1999	2016	30000
12	15	Siemens	2001	2005	20000	Germany	60000	3.8	2001	2005	20000

Explanation

R uses as.numeric() instead of INPUT. The pipeline approach using dplyr makes transformations readable and modular. Functions like mutate() and filter() replicate SAS DATA step logic. distinct() removes duplicates similarly to PROC SORT NODUPKEY. However, R does not enforce strict typing like SAS, so silent coercion warnings may occur. This makes SAS more reliable in regulated environments like clinical trials (ADaM datasets).

It is better to create new derived variables (e.g., Launch_YR, Revenue_NUM). Additionally, chaining transformations into a single expression improves readability and efficiency. This approach ensures reproducibility, avoids data loss, and aligns with industry-standard data cleaning pipelines.

5. Phase 3: Additional SAS Transformations (Using PUT)

DATA mobile_final;

SET mobile_clean1;

/* Convert numeric to character */

Launch_Char = PUT(Launch_YR, 4.);

Revenue_Char = PUT(Revenue_NUM, 8.);

/* Create derived variable */

Company_Label = CATX('-', Company, Country);

/* Conditional flag */

length Revenue_Flag $10.;

IF Revenue_NUM > 40000 THEN Revenue_Flag = "High";

ELSE Revenue_Flag = "Normal";

keep ID Company Country Employees Rating

Launch_YR Shutdown_YR Revenue_NUM Launch_Char

Revenue_Char Company_Label Revenue_Flag;

RUN;

PROC PRINT DATA=mobile_final;

RUN;

OUTPUT:


Obs	ID	Company	Country	Employees	Rating	Launch_YR	Shutdown_YR	Revenue_NUM	Launch_Char	Revenue_Char	Company_Label	Revenue_Flag
1	1	Nokia	Finland	100000	4.5	2000	2014	50000	2000	50000	Nokia-Finland	High
2	2	Blackberry	Canada	85000	4.2	1999	2016	30000	1999	30000	Blackberry-Canada	Normal
3	3	Siemens	Germany	60000	3.8	2001	2005	20000	2001	20000	Siemens-Germany	Normal
4	4	Motorola	USA	90000	4.0	1998	2011	40000	1998	40000	Motorola-USA	Normal
5	5	Nokia	Finland	100000	4.5	2000	2014	50000	2000	50000	Nokia-Finland	High
6	6	Unknown	India	30000	3.5	2005	.	15000	2005	15000	Unknown-India	Normal
7	7	Htc	Taiwan	70000	4.1	2008	.	25000	2008	25000	Htc-Taiwan	Normal
8	8	Lg	Korea	80000	4.3	2003	2021	.	2003	.	Lg-Korea	Normal
9	9	Sony Ericsson	Japan	75000	4.0	2002	2012	35000	2002	35000	Sony Ericsson-Japan	Normal
10	10	Panasonic	Japan	95000	3.9	1995	.	45000	1995	45000	Panasonic-Japan	High
11	11	Lenovo	China	65000	3.5	2010	2020	28000	2010	28000	Lenovo-China	Normal
12	12	Unknown	USA	50000	3.7	2004	2018	22000	2004	22000	Unknown-USA	Normal
13	13	Huawei	China	120000	4.6	2012	2023	60000	2012	60000	Huawei-China	High
14	14	Blackberry	Canada	85000	4.2	1999	2016	30000	1999	30000	Blackberry-Canada	Normal
15	15	Siemens	Germany	60000	3.8	2001	2005	20000	2001	20000	Siemens-Germany	Normal

Explanation

Here we use the PUT function, which is the opposite of INPUT. PUT converts numeric values into character format using formats. This is essential when creating reports, labels, or exporting data. For example, regulatory submission datasets often require character representations. The distinction is crucial:

INPUT → Character to Numeric
PUT → Numeric to Character

Misusing them can cause major bugs in production pipelines. The creation of derived variables like Company_Label and Revenue_Flag demonstrates realworld ADaM derivation logic.

6. 20 Additional Data Cleaning Best Practices

Always validate raw data against metadata
Maintain audit trails for every transformation
Never overwrite original datasets
Use controlled terminology (CDISC standards)
Validate date formats strictly
Check for duplicate primary keys
Use macros for reusable cleaning logic
Document assumptions clearly
Apply range checks on numeric variables
Perform crossvariable validation
Handle missing values consistently
Use PROC COMPARE for validation
Validate against SAP specifications
Ensure traceability from SDTM to ADaM
Avoid hardcoding values
Use formats for standardization
Log warnings and errors
Perform peer review of code
Automate QC checks
Maintain version control

7. Business Logic Behind Data Cleaning

Data cleaning is not cosmetic it’s analytical survival.

Missing values are replaced to prevent bias in statistical models
Invalid values (like negative age) are corrected to maintain realism
Date corrections ensure chronological integrity

Example:
If a patient’s age is recorded as 5, it must be corrected or excluded otherwise survival analysis becomes invalid.

Similarly, in our dataset:
If shutdown year < launch year → logical inconsistency → must be fixed.

8. 20 Key Insights (Sharp Points)

Dirty data leads to wrong conclusions
INPUT enables numeric computation
PUT enables reporting flexibility
Standardization improves joins
Missing values distort models
Duplicate records inflate results
Validation ensures trust
SAS enforces strict typing
R offers flexibility
Business rules drive cleaning
Audit trails ensure compliance
Clinical data must be traceable
Incorrect dates break timelines
Text inconsistency breaks grouping
Automation reduces human error
QC checks prevent failures
Data cleaning is iterative
Documentation is mandatory
Reproducibility is critical
Clean data drives decisions

9. Summary (SAS vs R)

Feature	SAS	R
Data Control	Strong	Flexible
Regulatory Use	High	Moderate
Typing	Strict	Loose
Functions	INPUT/PUT	as.numeric/as.character
Scalability	High	High

10. Conclusion

Data cleaning is not a preliminary step it is the foundation of trustworthy analytics. Whether you are working with telecom datasets or clinical trials, mastering tools like INPUT and PUT in SAS ensures that your data is not just usable, but reliable. A structured approach transforms messy datasets into decision ready assets.

11. Interview Questions

Q1: Difference between INPUT and PUT?
👉 INPUT converts character to numeric; PUT converts numeric to character.

Q2: What happens if INPUT is not used?
👉 SAS treats values as character → calculations fail.

Q3: How to remove duplicates in SAS?
👉 Use PROC SORT NODUPKEY.

Q4: How do you handle invalid dates?
👉 Use conditions and validate against business logic.

Q5: R equivalent of INPUT?
👉 as.numeric() or as.Date() depending on context.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about VANISHED MOBILES DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can SAS’s Precision and R’s Flexibility Together Eliminate Hidden Errors in Orbital Debris Data Analytics?
2.Can We Build an Accurate Product Demand Forecasting & Fraud Detection System in SAS While Identifying and Fixing Intentional Errors?
3.Can SAS Identify the Most Efficient Waste Collection Routes in a City?
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

From Dead Mobile Brands to Clean Data Mastery Using INPUT and PUT in SAS

Mastering INPUT vs PUT Functions in SAS with RealWorld Error Handling : Vanished Mobile Giants to Clean Data Insights

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS