450.From Dead Mobile Brands to Clean Data Mastery Using INPUT and PUT in SAS

Mastering INPUT vs PUT Functions in SAS with RealWorld Error Handling Vanished Mobile Giants to Clean Data Insights

1. Introduction

Imagine you are working on a legacy telecom analytics project. Your manager hands you a dataset containing mobile companies that once dominated the market brands that were household names but no longer exist. Sounds simple, right?

But the moment you open the dataset, reality hits:

  • Company names are inconsistent (e.g., nokiaNOKIANokia Inc.)
  • Launch years are missing or incorrectly stored as text
  • Some shutdown years are earlier than launch years
  • Duplicate entries exist
  • Revenue values are stored as characters instead of numeric

This is not just messy data it’s dangerous data.

In industries like clinical trials or financial analytics, such inconsistencies can lead to:

  • Incorrect statistical conclusions
  • Regulatory rejection
  • Business losses

This is where tools like SAS and R become powerful. They don’t just analyze data they transform chaos into structured intelligence.

In this blog, we’ll build a flawed dataset of defunct mobile companies, intentionally inject errors, and then clean it stepbystep. Most importantly, we will deeply understand INPUT vs PUT functions in SAS, which are often misunderstood but critical in realworld data transformation. 

2. Raw Data Creation in SAS and R

SAS Raw Dataset (With Intentional Errors)

DATA mobile_raw;

INFILE DATALINES  MISSOVER DSD DLM="";

LENGTH ID 8 Company $20 Country $15 Revenue $10 

       Launch_Year $8 Shutdown_Year $8;

INPUT ID Company:$20. Launch_Year:$8. Shutdown_Year:$8. 

      Revenue:$10. Country:$15. Employees $ Rating;

DATALINES;

1 Nokia 2000 2014 50000 Finland 100000 4.5

2 blackberry 1999 2016 30000 Canada 85000 4.2

3 SIEMENS 2001 2005 20000 Germany 60000 3.8

4 Motorola 1998 2011 40000 USA 90000 4.0

5 nokia 2000 2014 50000 Finland 100000 4.5

6 NULL 2005 2003 15000 India 30000 3.5

7 HTC 2008 . 25000 Taiwan 70000 4.1

8 LG 2003 2021 abc Korea 80000 4.3

9 "sony ericsson" 2002 2012 35000 Japan 75000 4.0

10 PANASONIC 1995 1990 45000 Japan 95000 3.9

11 lenovo 2010 2020 28000 China 65000 .

12  2004 2018 22000 USA 50000 3.7

13 Huawei 2012 2023 60000 China 120000 4.6

14 Blackberry 1999 2016 30000 Canada 85000 4.2

15 Siemens 2001 2005 20000 Germany 60000 3.8

;

RUN;

PROC PRINT DATA=mobile_raw;

RUN;

OUTPUT:

ObsIDCompanyCountryRevenueLaunch_YearShutdown_YearEmployeesRating
11NokiaFinland50000200020141000004.5
22blackberryCanada3000019992016850004.2
33SIEMENSGermany2000020012005600003.8
44MotorolaUSA4000019982011900004.0
55nokiaFinland50000200020141000004.5
66NULLIndia1500020052003300003.5
77HTCTaiwan250002008 700004.1
88LGKoreaabc20032021800004.3
99sony ericssonJapan3500020022012750004.0
1010PANASONICJapan4500019951990950003.9
1111lenovoChina280002010202065000.
1212 USA2200020042018500003.7
1313HuaweiChina60000201220231200004.6
1414BlackberryCanada3000019992016850004.2
1515SiemensGermany2000020012005600003.8

Explanation

This dataset is intentionally flawed to simulate realworld messy data. Key issues include:

  • Duplicate rows (Nokia, Blackberry, Siemens)
  • Character variables used for numeric fields (Revenue, Years)
  • Missing values (Shutdown Year, Rating)
  • Invalid values (Shutdown year before launch year)
  • Inconsistent text (uppercase/lowercase, NULL values)

Such issues frequently occur in legacy systems or clinical trial raw data (SDTM domains) where data originates from multiple sources. This dataset will help demonstrate how SAS handles transformation using INPUT (char → numeric) and PUT (numeric → char) functions effectively.

The combining DSD with DLIM=' ' (space delimiter). Normally, DSD is used for comma-separated files, but when explicitly paired with DLIM, it allows SAS to respect quoted strings even in space-delimited data. This ensures "sony ericsson" is read as a single value instead of being split. Without DSD, SAS ignores quotes completely. This is a critical concept in real-world data ingestion, especially when dealing with external vendor data or clinical datasets where text fields may contain spaces. This approach ensures correct column alignment and prevents cascading parsing errors.

R Code (Equivalent Dataset)

mobile_raw <- data.frame(

  ID = 1:15,

  Company = c("Nokia","blackberry","SIEMENS","Motorola","nokia",

              "NULL","HTC","LG","sony ericsson","PANASONIC","lenovo",

              "","Huawei","Blackberry","Siemens"),

  Launch_Year = c("2000","1999","2001","1998","2000","2005","2008",

                  ".","2002","1995","2010","2004","2012","1999","2001"),

  Shutdown_Year = c("2014","2016","2005","2011","2014","2003",NA,"2021",

                    "2012","1990","2020","2018","2023","2016","2005"),

  Revenue = c("50000","30000","20000","40000","50000","15000","25000",

              "abc","35000","45000","28000","22000","60000","30000",

              "20000"),

  Country = c("Finland","Canada","Germany","USA","Finland","India",

              "Taiwan","Korea","Japan","Japan","China","USA","China",

              "Canada","Germany"),

  Employees = c(100000,85000,60000,90000,100000,30000,70000,80000,75000,

                95000,65000,50000,120000,85000,60000),

  Rating = c(4.5,4.2,3.8,4.0,4.5,3.5,4.1,4.3,4.0,3.9,NA,3.7,4.6,4.2,3.8)

)

OUTPUT:

 

ID

Company

Launch_Year

Shutdown_Year

Revenue

Country

Employees

Rating

1

1

Nokia

2000

2014

50000

Finland

100000

4.5

2

2

blackberry

1999

2016

30000

Canada

85000

4.2

3

3

SIEMENS

2001

2005

20000

Germany

60000

3.8

4

4

Motorola

1998

2011

40000

USA

90000

4

5

5

nokia

2000

2014

50000

Finland

100000

4.5

6

6

NULL

2005

2003

15000

India

30000

3.5

7

7

HTC

2008

NA

25000

Taiwan

70000

4.1

8

8

LG

.

2021

abc

Korea

80000

4.3

9

9

sony ericsson

2002

2012

35000

Japan

75000

4

10

10

PANASONIC

1995

1990

45000

Japan

95000

3.9

11

11

lenovo

2010

2020

28000

China

65000

NA

12

12

 

2004

2018

22000

USA

50000

3.7

13

13

Huawei

2012

2023

60000

China

120000

4.6

14

14

Blackberry

1999

2016

30000

Canada

85000

4.2

15

15

Siemens

2001

2005

20000

Germany

60000

3.8

Explanation

·       The R dataset mirrors the SAS dataset with identical issues.

·       R represents missing values using NA, while SAS uses .

·       . Character inconsistencies and invalid numeric values (like "abc") will cause problems during type conversion.

·       In real world pipelines, R is often used for exploratory analysis, while SAS is used for regulatory compliant transformations.

·       Understanding both helps in cross platform validation.

3. Phase 1: Data Cleaning in SAS

DATA mobile_clean1;

SET mobile_raw;

/* Handle missing company */

IF Company = "NULL" OR Company = ""

THEN Company = "Unknown";

/* Standardize text */

Company = PROPCASE(Company);

/* Convert character to numeric */

Launch_YR = INPUT(Launch_Year, 4.);

Shutdown_YR = INPUT(Shutdown_Year, 4.);

Revenue_NUM = INPUT(Revenue, 8.);

/* Fix invalid years */

IF Shutdown_YR < Launch_YR THEN Shutdown_YR = .;

/* Handle missing values */

IF Rating = . THEN Rating = 3.5;

FORMAT Launch_YR Shutdown_YR 4.;

RUN;

PROC PRINT DATA=mobile_clean1;

RUN;

OUTPUT:

ObsIDCompanyCountryRevenueLaunch_YearShutdown_YearEmployeesRatingLaunch_YRShutdown_YRRevenue_NUM
11NokiaFinland50000200020141000004.52000201450000
22BlackberryCanada3000019992016850004.21999201630000
33SiemensGermany2000020012005600003.82001200520000
44MotorolaUSA4000019982011900004.01998201140000
55NokiaFinland50000200020141000004.52000201450000
66UnknownIndia1500020052003300003.52005.15000
77HtcTaiwan250002008 700004.12008.25000
88LgKoreaabc20032021800004.320032021.
99Sony EricssonJapan3500020022012750004.02002201235000
1010PanasonicJapan4500019951990950003.91995.45000
1111LenovoChina2800020102020650003.52010202028000
1212UnknownUSA2200020042018500003.72004201822000
1313HuaweiChina60000201220231200004.62012202360000
1414BlackberryCanada3000019992016850004.21999201630000
1515SiemensGermany2000020012005600003.82001200520000

/* Remove duplicates */

PROC SORT DATA=mobile_clean1 NODUPKEY;

BY ID Company;

RUN;

PROC PRINT DATA=mobile_clean1;

RUN;

OUTPUT:

ObsIDCompanyCountryRevenueLaunch_YearShutdown_YearEmployeesRatingLaunch_YRShutdown_YRRevenue_NUM
11NokiaFinland50000200020141000004.52000201450000
22BlackberryCanada3000019992016850004.21999201630000
33SiemensGermany2000020012005600003.82001200520000
44MotorolaUSA4000019982011900004.01998201140000
55NokiaFinland50000200020141000004.52000201450000
66UnknownIndia1500020052003300003.52005.15000
77HtcTaiwan250002008 700004.12008.25000
88LgKoreaabc20032021800004.320032021.
99Sony EricssonJapan3500020022012750004.02002201235000
1010PanasonicJapan4500019951990950003.91995.45000
1111LenovoChina2800020102020650003.52010202028000
1212UnknownUSA2200020042018500003.72004201822000
1313HuaweiChina60000201220231200004.62012202360000
1414BlackberryCanada3000019992016850004.21999201630000
1515SiemensGermany2000020012005600003.82001200520000

Explanation

This step introduces the core concept: INPUT function. Since Launch_Year and Revenue are stored as characters, they must be converted to numeric before analysis. INPUT reads character data using an informat and converts it to numeric. Without this, mathematical operations would fail. Additionally, text standardization using PROPCASE ensures consistency, which is critical in grouping operations. Invalid business logic (shutdown before launch) is corrected, and missing ratings are imputed. Finally, duplicates are removed using PROC SORT NODUPKEY, which is widely used in clinical datasets to maintain subjectlevel uniqueness.

4. Phase 2: Data Cleaning in R

library(dplyr)

mobile_clean <- mobile_raw %>%

  mutate(

    # Clean Company (combined)

    Company = tools::toTitleCase(trimws(tolower(

      ifelse(Company == "NULL" | Company == "", "Unknown", Company)

    ))),

    # Create NEW cleaned columns 

    Launch_YR = as.numeric(ifelse(Launch_Year %in% c(".", "", "NA"), NA,

                                  Launch_Year)),

    Shutdown_YR = as.numeric(ifelse(Shutdown_Year %in% c(".", "", "NA"),

                                    NA, Shutdown_Year)),

    Revenue_NUM = as.numeric(ifelse(Revenue %in% c("abc", "", "NA"),

                                    NA, Revenue)),    

    # Handle missing rating

    Rating = ifelse(is.na(Rating), 3.5, Rating)

  ) %>%

  filter(is.na(Shutdown_YR) | Shutdown_YR >= Launch_YR) %>%

  distinct(ID, Company, .keep_all = TRUE)

OUTPUT:

 

ID

Company

Launch_Year

Shutdown_Year

Revenue

Country

Employees

Rating

Launch_YR

Shutdown_YR

Revenue_NUM

1

1

Nokia

2000

2014

50000

Finland

100000

4.5

2000

2014

50000

2

2

Blackberry

1999

2016

30000

Canada

85000

4.2

1999

2016

30000

3

3

Siemens

2001

2005

20000

Germany

60000

3.8

2001

2005

20000

4

4

Motorola

1998

2011

40000

USA

90000

4

1998

2011

40000

5

5

Nokia

2000

2014

50000

Finland

100000

4.5

2000

2014

50000

6

7

Htc

2008

NA

25000

Taiwan

70000

4.1

2008

NA

25000

7

9

Sony Ericsson

2002

2012

35000

Japan

75000

4

2002

2012

35000

8

11

Lenovo

2010

2020

28000

China

65000

3.5

2010

2020

28000

9

12

Unknown

2004

2018

22000

USA

50000

3.7

2004

2018

22000

10

13

Huawei

2012

2023

60000

China

120000

4.6

2012

2023

60000

11

14

Blackberry

1999

2016

30000

Canada

85000

4.2

1999

2016

30000

12

15

Siemens

2001

2005

20000

Germany

60000

3.8

2001

2005

20000









Explanation

R uses as.numeric() instead of INPUT. The pipeline approach using dplyr makes transformations readable and modular. Functions like mutate() and filter() replicate SAS DATA step logic. distinct() removes duplicates similarly to PROC SORT NODUPKEY. However, R does not enforce strict typing like SAS, so silent coercion warnings may occur. This makes SAS more reliable in regulated environments like clinical trials (ADaM datasets).

It is better to create new derived variables (e.g., Launch_YRRevenue_NUM). Additionally, chaining transformations into a single expression improves readability and efficiency. This approach ensures reproducibility, avoids data loss, and aligns with industry-standard data cleaning pipelines.

5. Phase 3: Additional SAS Transformations (Using PUT)

DATA mobile_final;

SET mobile_clean1;

/* Convert numeric to character */

Launch_Char = PUT(Launch_YR, 4.);

Revenue_Char = PUT(Revenue_NUM, 8.);

/* Create derived variable */

Company_Label = CATX('-', Company, Country);

/* Conditional flag */

length Revenue_Flag $10.;

IF Revenue_NUM > 40000 THEN Revenue_Flag = "High";

ELSE Revenue_Flag = "Normal";

keep ID Company Country  Employees Rating 

     Launch_YR Shutdown_YR Revenue_NUM Launch_Char 

     Revenue_Char Company_Label Revenue_Flag;

RUN;

PROC PRINT DATA=mobile_final;

RUN;

OUTPUT:

ObsIDCompanyCountryEmployeesRatingLaunch_YRShutdown_YRRevenue_NUMLaunch_CharRevenue_CharCompany_LabelRevenue_Flag
11NokiaFinland1000004.52000201450000200050000Nokia-FinlandHigh
22BlackberryCanada850004.21999201630000199930000Blackberry-CanadaNormal
33SiemensGermany600003.82001200520000200120000Siemens-GermanyNormal
44MotorolaUSA900004.01998201140000199840000Motorola-USANormal
55NokiaFinland1000004.52000201450000200050000Nokia-FinlandHigh
66UnknownIndia300003.52005.15000200515000Unknown-IndiaNormal
77HtcTaiwan700004.12008.25000200825000Htc-TaiwanNormal
88LgKorea800004.320032021.2003.Lg-KoreaNormal
99Sony EricssonJapan750004.02002201235000200235000Sony Ericsson-JapanNormal
1010PanasonicJapan950003.91995.45000199545000Panasonic-JapanHigh
1111LenovoChina650003.52010202028000201028000Lenovo-ChinaNormal
1212UnknownUSA500003.72004201822000200422000Unknown-USANormal
1313HuaweiChina1200004.62012202360000201260000Huawei-ChinaHigh
1414BlackberryCanada850004.21999201630000199930000Blackberry-CanadaNormal
1515SiemensGermany600003.82001200520000200120000Siemens-GermanyNormal

Explanation

Here we use the PUT function, which is the opposite of INPUT. PUT converts numeric values into character format using formats. This is essential when creating reports, labels, or exporting data. For example, regulatory submission datasets often require character representations. The distinction is crucial:

  • INPUT → Character to Numeric
  • PUT → Numeric to Character

Misusing them can cause major bugs in production pipelines. The creation of derived variables like Company_Label and Revenue_Flag demonstrates realworld ADaM derivation logic.

6. 20 Additional Data Cleaning Best Practices

  1. Always validate raw data against metadata
  2. Maintain audit trails for every transformation
  3. Never overwrite original datasets
  4. Use controlled terminology (CDISC standards)
  5. Validate date formats strictly
  6. Check for duplicate primary keys
  7. Use macros for reusable cleaning logic
  8. Document assumptions clearly
  9. Apply range checks on numeric variables
  10. Perform crossvariable validation
  11. Handle missing values consistently
  12. Use PROC COMPARE for validation
  13. Validate against SAP specifications
  14. Ensure traceability from SDTM to ADaM
  15. Avoid hardcoding values
  16. Use formats for standardization
  17. Log warnings and errors
  18. Perform peer review of code
  19. Automate QC checks
  20. Maintain version control

7. Business Logic Behind Data Cleaning

Data cleaning is not cosmetic it’s analytical survival.

  • Missing values are replaced to prevent bias in statistical models
  • Invalid values (like negative age) are corrected to maintain realism
  • Date corrections ensure chronological integrity

Example:
If a patient’s age is recorded as 5, it must be corrected or excluded otherwise survival analysis becomes invalid.

Similarly, in our dataset:
If shutdown year < launch year → logical inconsistency → must be fixed.

8. 20 Key Insights (Sharp Points)

  • Dirty data leads to wrong conclusions
  • INPUT enables numeric computation
  • PUT enables reporting flexibility
  • Standardization improves joins
  • Missing values distort models
  • Duplicate records inflate results
  • Validation ensures trust
  • SAS enforces strict typing
  • R offers flexibility
  • Business rules drive cleaning
  • Audit trails ensure compliance
  • Clinical data must be traceable
  • Incorrect dates break timelines
  • Text inconsistency breaks grouping
  • Automation reduces human error
  • QC checks prevent failures
  • Data cleaning is iterative
  • Documentation is mandatory
  • Reproducibility is critical
  • Clean data drives decisions

9. Summary (SAS vs R)

Feature

SAS

R

Data Control

Strong

Flexible

Regulatory Use

High

Moderate

Typing

Strict

Loose

Functions

INPUT/PUT

as.numeric/as.character

Scalability

High

High

10. Conclusion

Data cleaning is not a preliminary step it is the foundation of trustworthy analytics. Whether you are working with telecom datasets or clinical trials, mastering tools like INPUT and PUT in SAS ensures that your data is not just usable, but reliable. A structured approach transforms messy datasets into decision ready assets.

11. Interview Questions

Q1: Difference between INPUT and PUT?
๐Ÿ‘‰ INPUT converts character to numeric; PUT converts numeric to character.

Q2: What happens if INPUT is not used?
๐Ÿ‘‰ SAS treats values as character → calculations fail.

Q3: How to remove duplicates in SAS?
๐Ÿ‘‰ Use PROC SORT NODUPKEY.

Q4: How do you handle invalid dates?
๐Ÿ‘‰ Use conditions and validate against business logic.

Q5: R equivalent of INPUT?
๐Ÿ‘‰ as.numeric() or as.Date() depending on context.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about VANISHED MOBILES DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can SAS’s Precision and R’s Flexibility Together Eliminate Hidden Errors in Orbital Debris Data Analytics?

2.Can We Build an Accurate Product Demand Forecasting & Fraud Detection System in SAS  While Identifying and Fixing Intentional Errors?

3.Can SAS Identify the Most Efficient Waste Collection Routes in a City?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy


















Comments

Popular posts from this blog

412.Can We Build And Clean A University Course Analytics & Fraud Detection System In Sas Using Only Macros While Intentionally Creating And Fixing Errors?

420.Can We Detect Errors, Prevent Fraud, And Optimize Biometric Access System Security Using Advanced SAS Programming?

418.Can We Design, Debug, Detect Fraud, And Optimize A Smart Parking System Using Advanced SAS Programming Techniques?