Repairing a Corrupted Global Museum Dataset Using SAS and R

From Museum Chaos to Analytical Intelligence: Repairing a Corrupted Global Museum Dataset Using SAS and R for Enterprise-Grade Reporting

Introduction:When Dirty Data Turns Cultural Intelligence into Business Risk

Imagine a global tourism intelligence platform preparing an annual report ranking the world's most visited museums. Government tourism departments, investment firms, city planners, and cultural organizations rely on these numbers to allocate funding and forecast tourism growth.

Three days before publication, analysts discover serious problems:

  • Duplicate museum identifiers inflate visitor counts.
  • Missing opening dates distort historical trend analysis.
  • Negative visitor numbers appear after faulty ETL migrations.
  • Invalid email addresses prevent stakeholder communication.
  • Region codes use mixed standards such as EU, europe, Eur, and EUROPE.
  • Museum categories contain corrupted values like "Artt", "histor y", and "NULL".
  • Visitor satisfaction scores exceed logical ranges.
  • Text fields contain leading spaces and hidden characters.

The result?

Executive dashboards become misleading.

Machine learning models generate incorrect tourism forecasts.

Public reports become unreliable.

Financial planning decisions become dangerous.

The reality is simple:

Dirty data creates expensive business mistakes.

As Clinical SAS Programmers understand from regulatory submissions, poor-quality data is not merely an inconvenience it becomes a compliance, credibility, and business continuity risk.

Building a Corrupted Global Museum Dataset 

1.Raw SAS Dataset with Intentional Errors

data museum_raw;

length museum_id $8 museum_name $40 country $20 region $15

       category $20 contact_email $50 status $15;

informat opening_date anydtdte20.;

format opening_date date9.;

input museum_id $ museum_name $ country $ region $ opening_date :?? anydtdte20.

      annual_visitors avg_rating category $ contact_email $ status $;

datalines;

M001 Louvre France EU 08APR1793 8900000 4.9 Art contact@louvre.fr Active

M001 Louvre France eu 08APR1793 8900000 4.9 Art contact@louvre.fr Active

M002 British_Museum UK EUROPE . 6200000 4.8 History info@britishmuseum.org Active

M003 Vatican_Museum Italy Eur 15FEB1506 -2000 4.9 History invalidemail Active

M004 Prado Spain EU 19NOV1819 3400000 7.8 Art museum@prado.es Active

M005  Rijksmuseum Netherlands europe 20APR1800 2700000 4.7 Art NULL Active

M006 Metropolitan USA NA 13APR1870 6500000 4.8 Artt info@met.org Active

M007 Hermitage Russia EMEA 17FEB1764 4500000 -1 History museum@hermitage.ru Active

M008 Uffizi Italy EU 1581-01-01 2200000 4.9 Painting uffizi.it Active

M009 Acropolis Greece eu BADDATE 1800000 4.5 Archaeology acropolis@mail Active

M010 Moma USA na 07NOV1929 3100000 4.6 ModernArt info@moma.org Closed

M011 NULL Germany EU . . 4.7 Art museum@berlin.de Active

M012 National_Museum India APAC 15AUG1947 1200000 4.4 Histroy contactmuseum.in Active

M013 Tokyo_National Japan apac 10APR1872 2500000 4.8 History tokyo@museum.jp Active

M014 Egyptian_Museum Egypt MEA 1902-11-15 1500000 4.7 Archaeology egypt@museum.org Active

M015 Science_Museum UK EUROPE 18JUN1857 3200000 4.5 Science science@@museum.org Active

;

run;

proc print data=museum_raw;

run;

OUTPUT:

Obsmuseum_idmuseum_namecountryregioncategorycontact_emailstatusopening_dateannual_visitorsavg_rating
1M001LouvreFranceEUArtcontact@louvre.frActive08APR179389000004.9
2M001LouvreFranceeuArtcontact@louvre.frActive08APR179389000004.9
3M002British_MuseumUKEUROPEHistoryinfo@britishmuseum.orgActive.62000004.8
4M003Vatican_MuseumItalyEurHistoryinvalidemailActive.-20004.9
5M004PradoSpainEUArtmuseum@prado.esActive19NOV181934000007.8
6M005RijksmuseumNetherlandseuropeArtNULLActive20APR180027000004.7
7M006MetropolitanUSANAArttinfo@met.orgActive13APR187065000004.8
8M007HermitageRussiaEMEAHistorymuseum@hermitage.ruActive17FEB17644500000-1.0
9M008UffiziItalyEUPaintinguffizi.itActive.22000004.9
10M009AcropolisGreeceeuArchaeologyacropolis@mailActive.18000004.5
11M010MomaUSAnaModernArtinfo@moma.orgClosed07NOV192931000004.6
12M011NULLGermanyEUArtmuseum@berlin.deActive..4.7
13M012National_MuseumIndiaAPACHistroycontactmuseum.inActive15AUG194712000004.4
14M013Tokyo_NationalJapanapacHistorytokyo@museum.jpActive10APR187225000004.8
15M014Egyptian_MuseumEgyptMEAArchaeologyegypt@museum.orgActive15NOV190215000004.7
16M015Science_MuseumUKEUROPESciencescience@@museum.orgActive18JUN185732000004.5

This mirrors production clinical datasets before SDTM transformation.

Why LENGTH Must Appear First

data example;

length category $30;

if category='Art' then flag='Y';

category='History Museum';

run;

proc print data=example;

run;

OUTPUT:

Obscategoryflag
1History Museum 

SAS determines variable length during first creation.

If LENGTH is omitted:

category='Art';

category='History Museum';

SAS permanently allocates length 3 because "Art" appeared first.

Result:

His

This is called Character Truncation Risk.

In regulated environments this may convert:

COMPLETED

into:

COMP

causing catastrophic submission failures.

R behaves differently because character vectors dynamically allocate memory.

2.Cleaning Workflow Using SAS DATA Step

data museum_clean;

set museum_raw;

museum_name=propcase(strip(museum_name));

country=propcase(strip(country));

region=upcase(strip(region));

category=propcase(strip(category));

annual_visitors=abs(annual_visitors);

avg_rating=round(avg_rating,.1);

if avg_rating>5 then avg_rating=5;

if avg_rating<0 then avg_rating=.;

if find(contact_email,'@')=0 then contact_email='';

if museum_name='NULL' then museum_name='Unknown';

if category='Artt' then category='Art';

if category='Histroy' then category='History';

if annual_visitors=0 then annual_visitors=.;

if opening_date=. then opening_date=intnx('year',today(),-50,'same');

years_open=intck('year',opening_date,today());

full_location=catx('-',country,region);

missing_count=cmiss(of _character_)+nmiss(of _numeric_);

run;

proc print data=museum_clean;

run;

OUTPUT:

Obsmuseum_idmuseum_namecountryregioncategorycontact_emailstatusopening_dateannual_visitorsavg_ratingyears_openfull_locationmissing_count
1M001LouvreFranceEUArtcontact@louvre.frActive08APR179389000004.9233France-EU1
2M001LouvreFranceEUArtcontact@louvre.frActive08APR179389000004.9233France-EU1
3M002British_museumUkEUROPEHistoryinfo@britishmuseum.orgActive03JUL197662000004.850Uk-EUROPE1
4M003Vatican_museumItalyEURHistory Active03JUL197620004.950Italy-EUR2
5M004PradoSpainEUArtmuseum@prado.esActive19NOV181934000005.0207Spain-EU1
6M005RijksmuseumNetherlandsEUROPEArt Active20APR180027000004.7226Netherlands-EUROPE2
7M006MetropolitanUsaNAArtinfo@met.orgActive13APR187065000004.8156Usa-NA1
8M007HermitageRussiaEMEAHistorymuseum@hermitage.ruActive17FEB17644500000.262Russia-EMEA2
9M008UffiziItalyEUPainting Active03JUL197622000004.950Italy-EU2
10M009AcropolisGreeceEUArchaeologyacropolis@mailActive03JUL197618000004.550Greece-EU1
11M010MomaUsaNAModernartinfo@moma.orgClosed07NOV192931000004.697Usa-NA1
12M011NullGermanyEUArtmuseum@berlin.deActive03JUL1976.4.750Germany-EU2
13M012National_museumIndiaAPACHistory Active15AUG194712000004.479India-APAC2
14M013Tokyo_nationalJapanAPACHistorytokyo@museum.jpActive10APR187225000004.8154Japan-APAC1
15M014Egyptian_museumEgyptMEAArchaeologyegypt@museum.orgActive15NOV190215000004.7124Egypt-MEA1
16M015Science_museumUkEUROPESciencescience@@museum.orgActive18JUN185732000004.5169Uk-EUROPE1

Explanation

This workflow demonstrates enterprise-grade standardization.

Functions used:

  • ABS
  • ROUND
  • PROPCASE
  • STRIP
  • FIND
  • INTNX
  • INTCK
  • CATX
  • CMISS
  • NMISS

Clinical programmers perform similar derivations while generating SDTM and ADaM domains.

3.Using SELECT-WHEN Instead of Long IF Blocks

data museum_clean;

set museum_clean;

select(upcase(region));

when('EU','EUROPE','EUR') region='EUROPE';

when('NA') region='NORTH_AMERICA';

when('APAC') region='ASIA_PACIFIC';

otherwise region='OTHER';

end;

run;

proc print data=museum_clean;

run;

OUTPUT:

Obsmuseum_idmuseum_namecountryregioncategorycontact_emailstatusopening_dateannual_visitorsavg_ratingyears_openfull_locationmissing_count
1M001LouvreFranceEUROPEArtcontact@louvre.frActive08APR179389000004.9233France-EU1
2M001LouvreFranceEUROPEArtcontact@louvre.frActive08APR179389000004.9233France-EU1
3M002British_museumUkEUROPEHistoryinfo@britishmuseum.orgActive03JUL197662000004.850Uk-EUROPE1
4M003Vatican_museumItalyEUROPEHistory Active03JUL197620004.950Italy-EUR2
5M004PradoSpainEUROPEArtmuseum@prado.esActive19NOV181934000005.0207Spain-EU1
6M005RijksmuseumNetherlandsEUROPEArt Active20APR180027000004.7226Netherlands-EUROPE2
7M006MetropolitanUsaNORTH_AMERICAArtinfo@met.orgActive13APR187065000004.8156Usa-NA1
8M007HermitageRussiaOTHERHistorymuseum@hermitage.ruActive17FEB17644500000.262Russia-EMEA2
9M008UffiziItalyEUROPEPainting Active03JUL197622000004.950Italy-EU2
10M009AcropolisGreeceEUROPEArchaeologyacropolis@mailActive03JUL197618000004.550Greece-EU1
11M010MomaUsaNORTH_AMERICAModernartinfo@moma.orgClosed07NOV192931000004.697Usa-NA1
12M011NullGermanyEUROPEArtmuseum@berlin.deActive03JUL1976.4.750Germany-EU2
13M012National_museumIndiaASIA_PACIFICHistory Active15AUG194712000004.479India-APAC2
14M013Tokyo_nationalJapanASIA_PACIFICHistorytokyo@museum.jpActive10APR187225000004.8154Japan-APAC1
15M014Egyptian_museumEgyptOTHERArchaeologyegypt@museum.orgActive15NOV190215000004.7124Egypt-MEA1
16M015Science_museumUkEUROPESciencescience@@museum.orgActive18JUN185732000004.5169Uk-EUROPE1

Key Points

SELECT-WHEN improves readability and maintenance.

Production validation teams prefer this over large IF-THEN chains.

4.ARRAY-Based Cleaning

data museum_clean;

set museum_clean;

array txt(*) museum_name country region category;

do i=1 to dim(txt);

txt(i)=compress(strip(txt(i)));

end;

drop i;

run;

proc print data=museum_clean;

run;

OUTPUT:

Obsmuseum_idmuseum_namecountryregioncategorycontact_emailstatusopening_dateannual_visitorsavg_ratingyears_openfull_locationmissing_count
1M001LouvreFranceEUROPEArtcontact@louvre.frActive08APR179389000004.9233France-EU1
2M001LouvreFranceEUROPEArtcontact@louvre.frActive08APR179389000004.9233France-EU1
3M002British_museumUkEUROPEHistoryinfo@britishmuseum.orgActive03JUL197662000004.850Uk-EUROPE1
4M003Vatican_museumItalyEUROPEHistory Active03JUL197620004.950Italy-EUR2
5M004PradoSpainEUROPEArtmuseum@prado.esActive19NOV181934000005.0207Spain-EU1
6M005RijksmuseumNetherlandsEUROPEArt Active20APR180027000004.7226Netherlands-EUROPE2
7M006MetropolitanUsaNORTH_AMERICAArtinfo@met.orgActive13APR187065000004.8156Usa-NA1
8M007HermitageRussiaOTHERHistorymuseum@hermitage.ruActive17FEB17644500000.262Russia-EMEA2
9M008UffiziItalyEUROPEPainting Active03JUL197622000004.950Italy-EU2
10M009AcropolisGreeceEUROPEArchaeologyacropolis@mailActive03JUL197618000004.550Greece-EU1
11M010MomaUsaNORTH_AMERICAModernartinfo@moma.orgClosed07NOV192931000004.697Usa-NA1
12M011NullGermanyEUROPEArtmuseum@berlin.deActive03JUL1976.4.750Germany-EU2
13M012National_museumIndiaASIA_PACIFICHistory Active15AUG194712000004.479India-APAC2
14M013Tokyo_nationalJapanASIA_PACIFICHistorytokyo@museum.jpActive10APR187225000004.8154Japan-APAC1
15M014Egyptian_museumEgyptOTHERArchaeologyegypt@museum.orgActive15NOV190215000004.7124Egypt-MEA1
16M015Science_museumUkEUROPESciencescience@@museum.orgActive18JUN185732000004.5169Uk-EUROPE1

Why Arrays Matter

Arrays allow scalable cleaning across dozens of variables without repeated code.

This technique is common in SDTM standardization projects.

5.Deduplication with PROC SORT

proc sort data=museum_clean nodupkey;

by museum_id;

run;

proc print data=museum_clean;

run;

LOG:

NOTE: There were 16 observations read from the data set WORK.MUSEUM_CLEAN.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.MUSEUM_CLEAN has 15 observations and 13 variables.
OUTPUT:
Obsmuseum_idmuseum_namecountryregioncategorycontact_emailstatusopening_dateannual_visitorsavg_ratingyears_openfull_locationmissing_count
1M001LouvreFranceEUROPEArtcontact@louvre.frActive08APR179389000004.9233France-EU1
2M002British_museumUkEUROPEHistoryinfo@britishmuseum.orgActive03JUL197662000004.850Uk-EUROPE1
3M003Vatican_museumItalyEUROPEHistory Active03JUL197620004.950Italy-EUR2
4M004PradoSpainEUROPEArtmuseum@prado.esActive19NOV181934000005.0207Spain-EU1
5M005RijksmuseumNetherlandsEUROPEArt Active20APR180027000004.7226Netherlands-EUROPE2
6M006MetropolitanUsaNORTH_AMERICAArtinfo@met.orgActive13APR187065000004.8156Usa-NA1
7M007HermitageRussiaOTHERHistorymuseum@hermitage.ruActive17FEB17644500000.262Russia-EMEA2
8M008UffiziItalyEUROPEPainting Active03JUL197622000004.950Italy-EU2
9M009AcropolisGreeceEUROPEArchaeologyacropolis@mailActive03JUL197618000004.550Greece-EU1
10M010MomaUsaNORTH_AMERICAModernartinfo@moma.orgClosed07NOV192931000004.697Usa-NA1
11M011NullGermanyEUROPEArtmuseum@berlin.deActive03JUL1976.4.750Germany-EU2
12M012National_museumIndiaASIA_PACIFICHistory Active15AUG194712000004.479India-APAC2
13M013Tokyo_nationalJapanASIA_PACIFICHistorytokyo@museum.jpActive10APR187225000004.8154Japan-APAC1
14M014Egyptian_museumEgyptOTHERArchaeologyegypt@museum.orgActive15NOV190215000004.7124Egypt-MEA1
15M015Science_museumUkEUROPESciencescience@@museum.orgActive18JUN185732000004.5169Uk-EUROPE1

Key Points

NODUPKEY removes duplicate business keys.

Clinical trials often use USUBJID for similar deduplication logic.

6.PROC SQL vs DATA Step MERGE

6.1 PROC SQL

data region_lookup;

length region $20 region_description $50;

input region $ region_description & $50.;

datalines;

EU Europe

EUROPE Europe

EUR Europe

NA North_America

NORTH_AMERICA North_America

APAC Asia_Pacific

MEA Middle_East_Africa

EMEA Europe_Middle_East_Africa

OTHER Other_Regions

;

run;

proc print data=region_lookup;

run;

OUTPUT:

Obsregionregion_description
1EUEurope
2EUROPEEurope
3EUREurope
4NANorth_America
5NORTH_AMERICANorth_America
6APACAsia_Pacific
7MEAMiddle_East_Africa
8EMEAEurope_Middle_East_Africa
9OTHEROther_Regions

proc sql;

create table final_sql as

select a.*,

       b.region_description

from museum_clean a

left join region_lookup b

on a.region=b.region;

quit;

proc print data=final_sql;

run;

OUTPUT:

Obsmuseum_idmuseum_namecountryregioncategorycontact_emailstatusopening_dateannual_visitorsavg_ratingyears_openfull_locationmissing_countregion_description
1M012National_museumIndiaASIA_PACIFICHistory Active15AUG194712000004.479India-APAC2 
2M013Tokyo_nationalJapanASIA_PACIFICHistorytokyo@museum.jpActive10APR187225000004.8154Japan-APAC1 
3M015Science_museumUkEUROPESciencescience@@museum.orgActive18JUN185732000004.5169Uk-EUROPE1Europe
4M003Vatican_museumItalyEUROPEHistory Active03JUL197620004.950Italy-EUR2Europe
5M005RijksmuseumNetherlandsEUROPEArt Active20APR180027000004.7226Netherlands-EUROPE2Europe
6M002British_museumUkEUROPEHistoryinfo@britishmuseum.orgActive03JUL197662000004.850Uk-EUROPE1Europe
7M011NullGermanyEUROPEArtmuseum@berlin.deActive03JUL1976.4.750Germany-EU2Europe
8M009AcropolisGreeceEUROPEArchaeologyacropolis@mailActive03JUL197618000004.550Greece-EU1Europe
9M001LouvreFranceEUROPEArtcontact@louvre.frActive08APR179389000004.9233France-EU1Europe
10M004PradoSpainEUROPEArtmuseum@prado.esActive19NOV181934000005.0207Spain-EU1Europe
11M008UffiziItalyEUROPEPainting Active03JUL197622000004.950Italy-EU2Europe
12M006MetropolitanUsaNORTH_AMERICAArtinfo@met.orgActive13APR187065000004.8156Usa-NA1North_America
13M010MomaUsaNORTH_AMERICAModernartinfo@moma.orgClosed07NOV192931000004.697Usa-NA1North_America
14M007HermitageRussiaOTHERHistorymuseum@hermitage.ruActive17FEB17644500000.262Russia-EMEA2Other_Regions
15M014Egyptian_museumEgyptOTHERArchaeologyegypt@museum.orgActive15NOV190215000004.7124Egypt-MEA1Other_Regions

6.2 DATA Step MERGE

proc sort data=museum_clean;

by region;

run;

proc print data=museum_clean;

run;

OUTPUT:

Obsmuseum_idmuseum_namecountryregioncategorycontact_emailstatusopening_dateannual_visitorsavg_ratingyears_openfull_locationmissing_count
1M012National_museumIndiaASIA_PACIFICHistory Active15AUG194712000004.479India-APAC2
2M013Tokyo_nationalJapanASIA_PACIFICHistorytokyo@museum.jpActive10APR187225000004.8154Japan-APAC1
3M001LouvreFranceEUROPEArtcontact@louvre.frActive08APR179389000004.9233France-EU1
4M002British_museumUkEUROPEHistoryinfo@britishmuseum.orgActive03JUL197662000004.850Uk-EUROPE1
5M003Vatican_museumItalyEUROPEHistory Active03JUL197620004.950Italy-EUR2
6M004PradoSpainEUROPEArtmuseum@prado.esActive19NOV181934000005.0207Spain-EU1
7M005RijksmuseumNetherlandsEUROPEArt Active20APR180027000004.7226Netherlands-EUROPE2
8M008UffiziItalyEUROPEPainting Active03JUL197622000004.950Italy-EU2
9M009AcropolisGreeceEUROPEArchaeologyacropolis@mailActive03JUL197618000004.550Greece-EU1
10M011NullGermanyEUROPEArtmuseum@berlin.deActive03JUL1976.4.750Germany-EU2
11M015Science_museumUkEUROPESciencescience@@museum.orgActive18JUN185732000004.5169Uk-EUROPE1
12M006MetropolitanUsaNORTH_AMERICAArtinfo@met.orgActive13APR187065000004.8156Usa-NA1
13M010MomaUsaNORTH_AMERICAModernartinfo@moma.orgClosed07NOV192931000004.697Usa-NA1
14M007HermitageRussiaOTHERHistorymuseum@hermitage.ruActive17FEB17644500000.262Russia-EMEA2
15M014Egyptian_museumEgyptOTHERArchaeologyegypt@museum.orgActive15NOV190215000004.7124Egypt-MEA1

proc sort data=region_lookup;

by region;

run;

proc print data=region_lookup;

run;

OUTPUT:

Obsregionregion_description
1APACAsia_Pacific
2EMEAEurope_Middle_East_Africa
3EUEurope
4EUREurope
5EUROPEEurope
6MEAMiddle_East_Africa
7NANorth_America
8NORTH_AMERICANorth_America
9OTHEROther_Regions

data final_merge;

merge museum_clean(in=a)

      region_lookup(in=b);

by region;

if a;

run;

proc print data=final_merge;

run;

OUTPUT:

Obsmuseum_idmuseum_namecountryregioncategorycontact_emailstatusopening_dateannual_visitorsavg_ratingyears_openfull_locationmissing_countregion_description
1M012National_museumIndiaASIA_PACIFICHistory Active15AUG194712000004.479India-APAC2 
2M013Tokyo_nationalJapanASIA_PACIFICHistorytokyo@museum.jpActive10APR187225000004.8154Japan-APAC1 
3M001LouvreFranceEUROPEArtcontact@louvre.frActive08APR179389000004.9233France-EU1Europe
4M002British_museumUkEUROPEHistoryinfo@britishmuseum.orgActive03JUL197662000004.850Uk-EUROPE1Europe
5M003Vatican_museumItalyEUROPEHistory Active03JUL197620004.950Italy-EUR2Europe
6M004PradoSpainEUROPEArtmuseum@prado.esActive19NOV181934000005.0207Spain-EU1Europe
7M005RijksmuseumNetherlandsEUROPEArt Active20APR180027000004.7226Netherlands-EUROPE2Europe
8M008UffiziItalyEUROPEPainting Active03JUL197622000004.950Italy-EU2Europe
9M009AcropolisGreeceEUROPEArchaeologyacropolis@mailActive03JUL197618000004.550Greece-EU1Europe
10M011NullGermanyEUROPEArtmuseum@berlin.deActive03JUL1976.4.750Germany-EU2Europe
11M015Science_museumUkEUROPESciencescience@@museum.orgActive18JUN185732000004.5169Uk-EUROPE1Europe
12M006MetropolitanUsaNORTH_AMERICAArtinfo@met.orgActive13APR187065000004.8156Usa-NA1North_America
13M010MomaUsaNORTH_AMERICAModernartinfo@moma.orgClosed07NOV192931000004.697Usa-NA1North_America
14M007HermitageRussiaOTHERHistorymuseum@hermitage.ruActive17FEB17644500000.262Russia-EMEA2Other_Regions
15M014Egyptian_museumEgyptOTHERArchaeologyegypt@museum.orgActive15NOV190215000004.7124Egypt-MEA1Other_Regions

Comparison

Feature

PROC SQL

DATA Step

Readability

High

Medium

Speed Large Files

Medium

High

Complex Joins

Excellent

Limited

Traceability

Excellent

Good

7.Professional Reporting Layer

7.1 PROC FREQ

proc freq data=museum_clean;

tables category region status;

run;

OUTPUT:

The FREQ Procedure

categoryFrequencyPercentCumulative
Frequency
Cumulative
Percent
Archaeology213.33213.33
Art533.33746.67
History533.331280.00
Modernart16.671386.67
Painting16.671493.33
Science16.6715100.00
regionFrequencyPercentCumulative
Frequency
Cumulative
Percent
ASIA_PACIFIC213.33213.33
EUROPE960.001173.33
NORTH_AMERICA213.331386.67
OTHER213.3315100.00
statusFrequencyPercentCumulative
Frequency
Cumulative
Percent
Active1493.331493.33
Closed16.6715100.00

7.2 PROC MEANS

proc means data=museum_clean n mean median min max;

var annual_visitors avg_rating;

run;

OUTPUT:

The MEANS Procedure

VariableNMeanMedianMinimumMaximum
annual_visitors
avg_rating
14
14
3407285.71
4.7285714
2900000.00
4.7500000
2000.00
4.4000000
8900000.00
5.0000000

7.3 PROC SUMMARY

proc summary data=museum_clean;

class region;

var annual_visitors;

output out=summary sum=;

run;

proc print data=summary;

run;

OUTPUT:

Obsregion_TYPE__FREQ_annual_visitors
1 01547702000
2ASIA_PACIFIC123700000
3EUROPE1928402000
4NORTH_AMERICA129600000
5OTHER126000000

7.4 PROC TRANSPOSE

proc transpose data=summary out=transposed;

by region;

var annual_visitors;

run;

proc print data=transposed;

run;

OUTPUT:

Obsregion_NAME_COL1
1 annual_visitors47702000
2ASIA_PACIFICannual_visitors3700000
3EUROPEannual_visitors28402000
4NORTH_AMERICAannual_visitors9600000
5OTHERannual_visitors6000000

7.5 PROC REPORT

proc report data=museum_clean nowd;

columns museum_name country annual_visitors avg_rating;

run;

OUTPUT:

museum_namecountryannual_visitorsavg_rating
National_museumIndia12000004.4
Tokyo_nationalJapan25000004.8
LouvreFrance89000004.9
British_museumUk62000004.8
Vatican_museumItaly20004.9
PradoSpain34000005
RijksmuseumNetherlands27000004.7
UffiziItaly22000004.9
AcropolisGreece18000004.5
NullGermany.4.7
Science_museumUk32000004.5
MetropolitanUsa65000004.8
MomaUsa31000004.6
HermitageRussia4500000.
Egyptian_museumEgypt15000004.7

Explanation

These procedures create executive-ready outputs for management reporting and regulatory review.

8.Reusable Macro Framework

%macro clean_text(ds,var);

data museum_macro;

set &ds;

&var=propcase(strip(compbl(&var)));

run;

proc print data=&ds;

run;

%mend;


%clean_text(museum_clean,museum_name);

OUTPUT:

Obsmuseum_idmuseum_namecountryregioncategorycontact_emailstatusopening_dateannual_visitorsavg_ratingyears_openfull_locationmissing_count
1M012National_museumIndiaASIA_PACIFICHistory Active15AUG194712000004.479India-APAC2
2M013Tokyo_nationalJapanASIA_PACIFICHistorytokyo@museum.jpActive10APR187225000004.8154Japan-APAC1
3M001LouvreFranceEUROPEArtcontact@louvre.frActive08APR179389000004.9233France-EU1
4M002British_museumUkEUROPEHistoryinfo@britishmuseum.orgActive03JUL197662000004.850Uk-EUROPE1
5M003Vatican_museumItalyEUROPEHistory Active03JUL197620004.950Italy-EUR2
6M004PradoSpainEUROPEArtmuseum@prado.esActive19NOV181934000005.0207Spain-EU1
7M005RijksmuseumNetherlandsEUROPEArt Active20APR180027000004.7226Netherlands-EUROPE2
8M008UffiziItalyEUROPEPainting Active03JUL197622000004.950Italy-EU2
9M009AcropolisGreeceEUROPEArchaeologyacropolis@mailActive03JUL197618000004.550Greece-EU1
10M011NullGermanyEUROPEArtmuseum@berlin.deActive03JUL1976.4.750Germany-EU2
11M015Science_museumUkEUROPESciencescience@@museum.orgActive18JUN185732000004.5169Uk-EUROPE1
12M006MetropolitanUsaNORTH_AMERICAArtinfo@met.orgActive13APR187065000004.8156Usa-NA1
13M010MomaUsaNORTH_AMERICAModernartinfo@moma.orgClosed07NOV192931000004.697Usa-NA1
14M007HermitageRussiaOTHERHistorymuseum@hermitage.ruActive17FEB17644500000.262Russia-EMEA2
15M014Egyptian_museumEgyptOTHERArchaeologyegypt@museum.orgActive15NOV190215000004.7124Egypt-MEA1

%clean_text(museum_clean,country);

OUTPUT:

Obsmuseum_idmuseum_namecountryregioncategorycontact_emailstatusopening_dateannual_visitorsavg_ratingyears_openfull_locationmissing_count
1M012National_museumIndiaASIA_PACIFICHistory Active15AUG194712000004.479India-APAC2
2M013Tokyo_nationalJapanASIA_PACIFICHistorytokyo@museum.jpActive10APR187225000004.8154Japan-APAC1
3M001LouvreFranceEUROPEArtcontact@louvre.frActive08APR179389000004.9233France-EU1
4M002British_museumUkEUROPEHistoryinfo@britishmuseum.orgActive03JUL197662000004.850Uk-EUROPE1
5M003Vatican_museumItalyEUROPEHistory Active03JUL197620004.950Italy-EUR2
6M004PradoSpainEUROPEArtmuseum@prado.esActive19NOV181934000005.0207Spain-EU1
7M005RijksmuseumNetherlandsEUROPEArt Active20APR180027000004.7226Netherlands-EUROPE2
8M008UffiziItalyEUROPEPainting Active03JUL197622000004.950Italy-EU2
9M009AcropolisGreeceEUROPEArchaeologyacropolis@mailActive03JUL197618000004.550Greece-EU1
10M011NullGermanyEUROPEArtmuseum@berlin.deActive03JUL1976.4.750Germany-EU2
11M015Science_museumUkEUROPESciencescience@@museum.orgActive18JUN185732000004.5169Uk-EUROPE1
12M006MetropolitanUsaNORTH_AMERICAArtinfo@met.orgActive13APR187065000004.8156Usa-NA1
13M010MomaUsaNORTH_AMERICAModernartinfo@moma.orgClosed07NOV192931000004.697Usa-NA1
14M007HermitageRussiaOTHERHistorymuseum@hermitage.ruActive17FEB17644500000.262Russia-EMEA2
15M014Egyptian_museumEgyptOTHERArchaeologyegypt@museum.orgActive15NOV190215000004.7124Egypt-MEA1

Macros improve reproducibility and reduce maintenance costs.

9.R Raw Dataset

library(tibble)

museum_raw <- tibble(

  museum_id = c(

    "M001","M001","M002","M003","M004",

    "M005","M006","M007","M008","M009",

    "M010","M011","M012","M013","M014","M015"),

  museum_name = c("Louvre","Louvre","British_Museum","Vatican_Museum","Prado",

    "Rijksmuseum","Metropolitan","Hermitage","Uffizi","Acropolis",

    "Moma","NULL","National_Museum","Tokyo_National","Egyptian_Museum","Science_Museum"),

  country = c("France","France","UK","Italy","Spain","Netherlands","USA",

    "Russia","Italy","Greece","USA","Germany","India","Japan","Egypt","UK"),

  region = c("EU","eu","EUROPE","Eur","EU","europe","NA","EMEA","EU","eu",

    "na","EU","APAC","apac","MEA","EUROPE"),

  opening_date = c("08APR1793","08APR1793",NA,"15FEB1506","19NOV1819","20APR1800",

    "13APR1870","17FEB1764","1581-01-01","BADDATE","07NOV1929",NA,"15AUG1947",

    "10APR1872","1902-11-15","18JUN1857"),

  annual_visitors = c(8900000,8900000,6200000,-2000,3400000,2700000,6500000,

    4500000,2200000,1800000,3100000,NA,1200000,2500000,1500000,3200000),

  avg_rating = c(4.9,4.9,4.8,4.9,7.8,4.7,4.8,-1,4.9,4.5,4.6,4.7,4.4,4.8,4.7,4.5),

  category = c("Art","Art","History","History","Art","Art","Artt","History",

    "Painting","Archaeology","ModernArt","Art","Histroy","History","Archaeology",

    "Science"),

  contact_email = c("contact@louvre.fr","contact@louvre.fr","info@britishmuseum.org",

    "invalidemail","museum@prado.es","NULL","info@met.org","museum@hermitage.ru",

    "uffizi.it","acropolis@mail","info@moma.org","museum@berlin.de","contactmuseum.in",

    "tokyo@museum.jp","egypt@museum.org","science@@museum.org"),

  status = c("Active","Active","Active","Active","Active","Active","Active",

    "Active","Active","Active","Closed","Active","Active","Active","Active",

    "Active")

)

OUTPUT:

museum_id

museum_name

country

region

opening_date

annual_visitors

avg_rating

category

contact_email

status

M001

Louvre

France

EU

08APR1793

8900000

4.9

Art

contact@louvre.fr

Active

M001

Louvre

France

eu

08APR1793

8900000

4.9

Art

contact@louvre.fr

Active

M002

British_Museum

UK

EUROPE

6200000

4.8

History

info@britishmuseum.org

Active

M003

Vatican_Museum

Italy

Eur

15FEB1506

-2000

4.9

History

invalidemail

Active

M004

Prado

Spain

EU

19NOV1819

3400000

7.8

Art

museum@prado.es

Active

M005

Rijksmuseum

Netherlands

europe

20APR1800

2700000

4.7

Art

NULL

Active

M006

Metropolitan

USA

NA

13APR1870

6500000

4.8

Artt

info@met.org

Active

M007

Hermitage

Russia

EMEA

17FEB1764

4500000

-1

History

museum@hermitage.ru

Active

M008

Uffizi

Italy

EU

1581-01-01

2200000

4.9

Painting

uffizi.it

Active

M009

Acropolis

Greece

eu

BADDATE

1800000

4.5

Archaeology

acropolis@mail

Active

M010

Moma

USA

na

07NOV1929

3100000

4.6

ModernArt

info@moma.org

Closed

M011

NULL

Germany

EU

4.7

Art

museum@berlin.de

Active

M012

National_Museum

India

APAC

15AUG1947

1200000

4.4

Histroy

contactmuseum.in

Active

M013

Tokyo_National

Japan

apac

10APR1872

2500000

4.8

History

tokyo@museum.jp

Active

M014

Egyptian_Museum

Egypt

MEA

1902-11-15

1500000

4.7

Archaeology

egypt@museum.org

Active

M015

Science_Museum

UK

EUROPE

18JUN1857

3200000

4.5

Science

science@@museum.org

Active


10.Equivalent R Cleaning Workflow

library(tidyverse)

library(janitor)

library(lubridate)

museum_clean <- museum_raw %>%

  clean_names() %>%

  mutate(

    opening_date = parse_date_time(opening_date,

      orders = c("dby","Ymd"    ),quiet = TRUE),

    museum_id = str_to_title(str_trim(museum_id)),

    museum_name = str_to_title(str_trim(museum_name)),

    country = str_to_title(country),

    region = str_to_upper(str_trim(region)),

    annual_visitors = abs(annual_visitors),

    avg_rating = if_else(avg_rating > 5,5,avg_rating),

    category = str_replace_all(category,"Artt","Art"),

    category = str_replace_all(category,"Histroy","History"),

    contact_email = if_else(grepl("@",contact_email),

      contact_email,NA_character_),

    opening_date = coalesce(as.Date(opening_date),

      today() - years(50))

  ) %>%

  distinct(museum_id,.keep_all=TRUE)

OUTPUT:

museum_id

museum_name

country

region

opening_date

annual_visitors

avg_rating

category

contact_email

status

M001

Louvre

France

EU

    1793-04-08

8900000

4.9

Art

contact@louvre.fr

Active

M002

British_museum

Uk

EUROPE

1976-07-03

6200000

4.8

History

info@britishmuseum.org

Active

M003

Vatican_museum

Italy

EUR

    1506-02-15

2000

4.9

History

Active

M004

Prado

Spain

EU

    1819-11-19

3400000

5

Art

museum@prado.es

Active

M005

Rijksmuseum

Netherlands

EUROPE

     1800-04-20

2700000

4.7

Art

Active

M006

Metropolitan

Usa

NA

     1870-04-13

6500000

4.8

Art

info@met.org

Active

M007

Hermitage

Russia

EMEA

     1764-02-17

4500000

-1

History

museum@hermitage.ru

Active

M008

Uffizi

Italy

EU

     1581-01-01

2200000

4.9

Painting

Active

M009

Acropolis

Greece

EU

 1976-07-03

1800000

4.5

Archaeology

acropolis@mail

Active

M010

Moma

Usa

NA

 1929-11-07

3100000

4.6

ModernArt

info@moma.org

Closed

M011

Null

Germany

EU

 1976-07-03

4.7

Art

museum@berlin.de

Active

M012

National_museum

India

APAC

 1947-08-15

1200000

4.4

History

Active

M013

Tokyo_national

Japan

APAC

     1872-04-10

2500000

4.8

History

tokyo@museum.jp

Active

M014

Egyptian_museum

Egypt

MEA

 1902-11-15

1500000

4.7

Archaeology

egypt@museum.org

Active

M015

Science_museum

Uk

EUROPE

     1857-06-18

3200000

4.5

Science

science@@museum.org

Active


SAS vs R Mapping

SAS

R

STRIP

str_trim

PROPCASE

str_to_title

ABS

abs

FIND

grepl

CATX

unite

IF THEN

if_else

PROC SUMMARY

summarise

Enterprise Validation and Compliance

Clinical environments demand:

  • SDTM compliance
  • ADaM traceability
  • Independent QC
  • Audit trails
  • Metadata governance
  • Validation programming

One dangerous SAS behavior is:

if age <18 then delete;

Missing values are treated as smaller than any number.

Therefore:

age=.

also satisfies:

age<18

Potentially deleting valid patients.

This has caused real regulatory findings.

Business Logic Behind Cleaning Decisions

Data cleaning is not cosmetic work.

It represents business logic implementation.

Negative visitors become positive because visitor counts cannot be negative.

Missing opening dates are imputed because trend models require continuous time dimensions.

Age corrections prevent pediatric patients entering adult cohorts.

Text normalization ensures grouping accuracy.

For example:

EU

eu

Europe

EUR

should represent one business concept.

Without standardization dashboards produce four categories instead of one.

Similarly, malformed emails break notification systems.

Missing values affect denominators in statistical models.

Clinical trial analyses may incorrectly calculate safety populations.

Cleaning therefore protects downstream analytics rather than improving appearance.

Twenty Best Practices

  1. Define metadata first.
  2. Standardize naming conventions.
  3. Validate ranges early.
  4. Deduplicate before derivations.
  5. Separate raw and cleaned layers.
  6. Never overwrite source data.
  7. Use reusable macros.
  8. Maintain audit trails.
  9. Perform independent QC.
  10. Validate joins.
  11. Version control programs.
  12. Automate checks.
  13. Document assumptions.
  14. Standardize formats.
  15. Monitor truncation risks.
  16. Log all imputations.
  17. Protect business keys.
  18. Validate timestamps.
  19. Use defensive programming.
  20. Deploy production checklists.

Twenty Enterprise Insights

  • Dirty data creates expensive business mistakes.
  • Validation logic beats visual inspection.
  • Standardization improves reproducibility.
  • Metadata drives consistency.
  • Missing values carry meaning.
  • Audit trails protect submissions.
  • Macros reduce risk.
  • Duplicate keys destroy trust.
  • Formats improve reporting quality.
  • Defensive programming saves projects.
  • Traceability enables compliance.
  • Automation reduces human error.
  • Dates require strict governance.
  • Documentation is part of programming.
  • Business rules must be explicit.
  • QC independence matters.
  • Reusable code scales faster.
  • Standardization improves AI models.
  • Clean data creates reliable intelligence.
  • Analytics is only as good as its inputs.

SAS Versus R in Enterprise Cleaning

SAS dominates auditability, regulatory traceability, scalability, and controlled production environments.

R dominates flexibility, visualization, package ecosystems, and exploratory analytics.

SAS excels in:

  • SDTM
  • ADaM
  • TLF production
  • submission traceability

R excels in:

  • feature engineering
  • machine learning
  • text processing
  • rapid experimentation

Together they create the ideal enterprise ecosystem.

Final Thoughts

Modern analytics pipelines succeed or fail long before machine learning begins.

The foundation is clean, validated, traceable data.

Whether working with museum intelligence, banking transactions, insurance claims, retail operations, or clinical trials, identical principles apply:

  • Standardize.
  • Validate.
  • Document.
  • Automate.
  • Audit.

SAS provides industrial-strength governance and reproducibility.

R provides agility and analytical creativity.

Organizations that combine both technologies build trustworthy business intelligence systems capable of supporting executives, regulators, scientists, and AI platforms with confidence.

The future belongs not to organizations with the most data, but to organizations with the most reliable data.

Interview Questions

Q1.How would you remove duplicates?

Answer: PROC SORT NODUPKEY in SAS or distinct() in R.

Q2.How would you detect invalid emails?

Answer: FIND or INDEX in SAS and grepl() in R.

Q3.Why is missing numeric handling dangerous in SAS?

Answer: Missing values are smaller than all valid numbers.

Q4.When should PROC SQL replace DATA Step MERGE?

Answer: Complex joins and many-to-many relationships.

Q5.How do you ensure audit readiness?

Answer: Traceability, QC independence, metadata control, and reproducible programming.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about MUSEUM DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS