Repairing a Corrupted Global Museum Dataset Using SAS and R
From Museum Chaos to
Analytical Intelligence: Repairing a Corrupted Global Museum Dataset Using SAS
and R for Enterprise-Grade Reporting
Introduction:When Dirty Data Turns Cultural
Intelligence into Business Risk
Imagine a
global tourism intelligence platform preparing an annual report ranking the
world's most visited museums. Government tourism departments, investment firms,
city planners, and cultural organizations rely on these numbers to allocate
funding and forecast tourism growth.
Three
days before publication, analysts discover serious problems:
- Duplicate museum identifiers
inflate visitor counts.
- Missing opening dates
distort historical trend analysis.
- Negative visitor numbers
appear after faulty ETL migrations.
- Invalid email addresses
prevent stakeholder communication.
- Region codes use mixed
standards such as EU, europe, Eur, and EUROPE.
- Museum categories contain
corrupted values like "Artt", "histor y", and
"NULL".
- Visitor satisfaction scores
exceed logical ranges.
- Text fields contain leading
spaces and hidden characters.
The
result?
Executive
dashboards become misleading.
Machine
learning models generate incorrect tourism forecasts.
Public
reports become unreliable.
Financial
planning decisions become dangerous.
The
reality is simple:
Dirty
data creates expensive business mistakes.
As
Clinical SAS Programmers understand from regulatory submissions, poor-quality
data is not merely an inconvenience it becomes a compliance, credibility, and
business continuity risk.
Building a Corrupted Global Museum Dataset
1.Raw SAS Dataset with Intentional Errors
data museum_raw;
length museum_id $8 museum_name $40 country $20 region $15
category $20 contact_email $50 status $15;
informat opening_date anydtdte20.;
format opening_date date9.;
input museum_id $ museum_name $ country $ region $ opening_date :?? anydtdte20.
annual_visitors avg_rating category $ contact_email $ status $;
datalines;
M001 Louvre France EU 08APR1793 8900000 4.9 Art contact@louvre.fr Active
M001 Louvre France eu 08APR1793 8900000 4.9 Art contact@louvre.fr Active
M002 British_Museum UK EUROPE . 6200000 4.8 History info@britishmuseum.org Active
M003 Vatican_Museum Italy Eur 15FEB1506 -2000 4.9 History invalidemail Active
M004 Prado Spain EU 19NOV1819 3400000 7.8 Art museum@prado.es Active
M005 Rijksmuseum Netherlands europe 20APR1800 2700000 4.7 Art NULL Active
M006 Metropolitan USA NA 13APR1870 6500000 4.8 Artt info@met.org Active
M007 Hermitage Russia EMEA 17FEB1764 4500000 -1 History museum@hermitage.ru Active
M008 Uffizi Italy EU 1581-01-01 2200000 4.9 Painting uffizi.it Active
M009 Acropolis Greece eu BADDATE 1800000 4.5 Archaeology acropolis@mail Active
M010 Moma USA na 07NOV1929 3100000 4.6 ModernArt info@moma.org Closed
M011 NULL Germany EU . . 4.7 Art museum@berlin.de Active
M012 National_Museum India APAC 15AUG1947 1200000 4.4 Histroy contactmuseum.in Active
M013 Tokyo_National Japan apac 10APR1872 2500000 4.8 History tokyo@museum.jp Active
M014 Egyptian_Museum Egypt MEA 1902-11-15 1500000 4.7 Archaeology egypt@museum.org Active
M015 Science_Museum UK EUROPE 18JUN1857 3200000 4.5 Science science@@museum.org Active
;
run;
proc print data=museum_raw;
run;
OUTPUT:
| Obs | museum_id | museum_name | country | region | category | contact_email | status | opening_date | annual_visitors | avg_rating |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M001 | Louvre | France | EU | Art | contact@louvre.fr | Active | 08APR1793 | 8900000 | 4.9 |
| 2 | M001 | Louvre | France | eu | Art | contact@louvre.fr | Active | 08APR1793 | 8900000 | 4.9 |
| 3 | M002 | British_Museum | UK | EUROPE | History | info@britishmuseum.org | Active | . | 6200000 | 4.8 |
| 4 | M003 | Vatican_Museum | Italy | Eur | History | invalidemail | Active | . | -2000 | 4.9 |
| 5 | M004 | Prado | Spain | EU | Art | museum@prado.es | Active | 19NOV1819 | 3400000 | 7.8 |
| 6 | M005 | Rijksmuseum | Netherlands | europe | Art | NULL | Active | 20APR1800 | 2700000 | 4.7 |
| 7 | M006 | Metropolitan | USA | NA | Artt | info@met.org | Active | 13APR1870 | 6500000 | 4.8 |
| 8 | M007 | Hermitage | Russia | EMEA | History | museum@hermitage.ru | Active | 17FEB1764 | 4500000 | -1.0 |
| 9 | M008 | Uffizi | Italy | EU | Painting | uffizi.it | Active | . | 2200000 | 4.9 |
| 10 | M009 | Acropolis | Greece | eu | Archaeology | acropolis@mail | Active | . | 1800000 | 4.5 |
| 11 | M010 | Moma | USA | na | ModernArt | info@moma.org | Closed | 07NOV1929 | 3100000 | 4.6 |
| 12 | M011 | NULL | Germany | EU | Art | museum@berlin.de | Active | . | . | 4.7 |
| 13 | M012 | National_Museum | India | APAC | Histroy | contactmuseum.in | Active | 15AUG1947 | 1200000 | 4.4 |
| 14 | M013 | Tokyo_National | Japan | apac | History | tokyo@museum.jp | Active | 10APR1872 | 2500000 | 4.8 |
| 15 | M014 | Egyptian_Museum | Egypt | MEA | Archaeology | egypt@museum.org | Active | 15NOV1902 | 1500000 | 4.7 |
| 16 | M015 | Science_Museum | UK | EUROPE | Science | science@@museum.org | Active | 18JUN1857 | 3200000 | 4.5 |
This
mirrors production clinical datasets before SDTM transformation.
Why LENGTH Must Appear
First
data example;
length category $30;
if category='Art' then flag='Y';
category='History Museum';
run;
proc print data=example;
run;
OUTPUT:
| Obs | category | flag |
|---|---|---|
| 1 | History Museum |
SAS
determines variable length during first creation.
If LENGTH
is omitted:
category='Art';
category='History Museum';
SAS
permanently allocates length 3 because "Art" appeared first.
Result:
His
This is
called Character Truncation Risk.
In
regulated environments this may convert:
COMPLETED
into:
COMP
causing
catastrophic submission failures.
R behaves
differently because character vectors dynamically allocate memory.
2.Cleaning Workflow Using SAS DATA Step
data museum_clean;
set museum_raw;
museum_name=propcase(strip(museum_name));
country=propcase(strip(country));
region=upcase(strip(region));
category=propcase(strip(category));
annual_visitors=abs(annual_visitors);
avg_rating=round(avg_rating,.1);
if avg_rating>5 then avg_rating=5;
if avg_rating<0 then avg_rating=.;
if find(contact_email,'@')=0 then contact_email='';
if museum_name='NULL' then museum_name='Unknown';
if category='Artt' then category='Art';
if category='Histroy' then category='History';
if annual_visitors=0 then annual_visitors=.;
if opening_date=. then opening_date=intnx('year',today(),-50,'same');
years_open=intck('year',opening_date,today());
full_location=catx('-',country,region);
missing_count=cmiss(of _character_)+nmiss(of _numeric_);
run;
proc print data=museum_clean;
run;
OUTPUT:
| Obs | museum_id | museum_name | country | region | category | contact_email | status | opening_date | annual_visitors | avg_rating | years_open | full_location | missing_count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M001 | Louvre | France | EU | Art | contact@louvre.fr | Active | 08APR1793 | 8900000 | 4.9 | 233 | France-EU | 1 |
| 2 | M001 | Louvre | France | EU | Art | contact@louvre.fr | Active | 08APR1793 | 8900000 | 4.9 | 233 | France-EU | 1 |
| 3 | M002 | British_museum | Uk | EUROPE | History | info@britishmuseum.org | Active | 03JUL1976 | 6200000 | 4.8 | 50 | Uk-EUROPE | 1 |
| 4 | M003 | Vatican_museum | Italy | EUR | History | Active | 03JUL1976 | 2000 | 4.9 | 50 | Italy-EUR | 2 | |
| 5 | M004 | Prado | Spain | EU | Art | museum@prado.es | Active | 19NOV1819 | 3400000 | 5.0 | 207 | Spain-EU | 1 |
| 6 | M005 | Rijksmuseum | Netherlands | EUROPE | Art | Active | 20APR1800 | 2700000 | 4.7 | 226 | Netherlands-EUROPE | 2 | |
| 7 | M006 | Metropolitan | Usa | NA | Art | info@met.org | Active | 13APR1870 | 6500000 | 4.8 | 156 | Usa-NA | 1 |
| 8 | M007 | Hermitage | Russia | EMEA | History | museum@hermitage.ru | Active | 17FEB1764 | 4500000 | . | 262 | Russia-EMEA | 2 |
| 9 | M008 | Uffizi | Italy | EU | Painting | Active | 03JUL1976 | 2200000 | 4.9 | 50 | Italy-EU | 2 | |
| 10 | M009 | Acropolis | Greece | EU | Archaeology | acropolis@mail | Active | 03JUL1976 | 1800000 | 4.5 | 50 | Greece-EU | 1 |
| 11 | M010 | Moma | Usa | NA | Modernart | info@moma.org | Closed | 07NOV1929 | 3100000 | 4.6 | 97 | Usa-NA | 1 |
| 12 | M011 | Null | Germany | EU | Art | museum@berlin.de | Active | 03JUL1976 | . | 4.7 | 50 | Germany-EU | 2 |
| 13 | M012 | National_museum | India | APAC | History | Active | 15AUG1947 | 1200000 | 4.4 | 79 | India-APAC | 2 | |
| 14 | M013 | Tokyo_national | Japan | APAC | History | tokyo@museum.jp | Active | 10APR1872 | 2500000 | 4.8 | 154 | Japan-APAC | 1 |
| 15 | M014 | Egyptian_museum | Egypt | MEA | Archaeology | egypt@museum.org | Active | 15NOV1902 | 1500000 | 4.7 | 124 | Egypt-MEA | 1 |
| 16 | M015 | Science_museum | Uk | EUROPE | Science | science@@museum.org | Active | 18JUN1857 | 3200000 | 4.5 | 169 | Uk-EUROPE | 1 |
Explanation
This
workflow demonstrates enterprise-grade standardization.
Functions
used:
- ABS
- ROUND
- PROPCASE
- STRIP
- FIND
- INTNX
- INTCK
- CATX
- CMISS
- NMISS
Clinical
programmers perform similar derivations while generating SDTM and ADaM domains.
3.Using SELECT-WHEN Instead of Long IF Blocks
data museum_clean;
set museum_clean;
select(upcase(region));
when('EU','EUROPE','EUR') region='EUROPE';
when('NA') region='NORTH_AMERICA';
when('APAC') region='ASIA_PACIFIC';
otherwise region='OTHER';
end;
run;
proc print data=museum_clean;
run;
OUTPUT:
| Obs | museum_id | museum_name | country | region | category | contact_email | status | opening_date | annual_visitors | avg_rating | years_open | full_location | missing_count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M001 | Louvre | France | EUROPE | Art | contact@louvre.fr | Active | 08APR1793 | 8900000 | 4.9 | 233 | France-EU | 1 |
| 2 | M001 | Louvre | France | EUROPE | Art | contact@louvre.fr | Active | 08APR1793 | 8900000 | 4.9 | 233 | France-EU | 1 |
| 3 | M002 | British_museum | Uk | EUROPE | History | info@britishmuseum.org | Active | 03JUL1976 | 6200000 | 4.8 | 50 | Uk-EUROPE | 1 |
| 4 | M003 | Vatican_museum | Italy | EUROPE | History | Active | 03JUL1976 | 2000 | 4.9 | 50 | Italy-EUR | 2 | |
| 5 | M004 | Prado | Spain | EUROPE | Art | museum@prado.es | Active | 19NOV1819 | 3400000 | 5.0 | 207 | Spain-EU | 1 |
| 6 | M005 | Rijksmuseum | Netherlands | EUROPE | Art | Active | 20APR1800 | 2700000 | 4.7 | 226 | Netherlands-EUROPE | 2 | |
| 7 | M006 | Metropolitan | Usa | NORTH_AMERICA | Art | info@met.org | Active | 13APR1870 | 6500000 | 4.8 | 156 | Usa-NA | 1 |
| 8 | M007 | Hermitage | Russia | OTHER | History | museum@hermitage.ru | Active | 17FEB1764 | 4500000 | . | 262 | Russia-EMEA | 2 |
| 9 | M008 | Uffizi | Italy | EUROPE | Painting | Active | 03JUL1976 | 2200000 | 4.9 | 50 | Italy-EU | 2 | |
| 10 | M009 | Acropolis | Greece | EUROPE | Archaeology | acropolis@mail | Active | 03JUL1976 | 1800000 | 4.5 | 50 | Greece-EU | 1 |
| 11 | M010 | Moma | Usa | NORTH_AMERICA | Modernart | info@moma.org | Closed | 07NOV1929 | 3100000 | 4.6 | 97 | Usa-NA | 1 |
| 12 | M011 | Null | Germany | EUROPE | Art | museum@berlin.de | Active | 03JUL1976 | . | 4.7 | 50 | Germany-EU | 2 |
| 13 | M012 | National_museum | India | ASIA_PACIFIC | History | Active | 15AUG1947 | 1200000 | 4.4 | 79 | India-APAC | 2 | |
| 14 | M013 | Tokyo_national | Japan | ASIA_PACIFIC | History | tokyo@museum.jp | Active | 10APR1872 | 2500000 | 4.8 | 154 | Japan-APAC | 1 |
| 15 | M014 | Egyptian_museum | Egypt | OTHER | Archaeology | egypt@museum.org | Active | 15NOV1902 | 1500000 | 4.7 | 124 | Egypt-MEA | 1 |
| 16 | M015 | Science_museum | Uk | EUROPE | Science | science@@museum.org | Active | 18JUN1857 | 3200000 | 4.5 | 169 | Uk-EUROPE | 1 |
Key Points
SELECT-WHEN
improves readability and maintenance.
Production
validation teams prefer this over large IF-THEN chains.
4.ARRAY-Based Cleaning
data museum_clean;
set museum_clean;
array txt(*) museum_name country region category;
do i=1 to dim(txt);
txt(i)=compress(strip(txt(i)));
end;
drop i;
run;
proc print data=museum_clean;
run;
OUTPUT:
| Obs | museum_id | museum_name | country | region | category | contact_email | status | opening_date | annual_visitors | avg_rating | years_open | full_location | missing_count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M001 | Louvre | France | EUROPE | Art | contact@louvre.fr | Active | 08APR1793 | 8900000 | 4.9 | 233 | France-EU | 1 |
| 2 | M001 | Louvre | France | EUROPE | Art | contact@louvre.fr | Active | 08APR1793 | 8900000 | 4.9 | 233 | France-EU | 1 |
| 3 | M002 | British_museum | Uk | EUROPE | History | info@britishmuseum.org | Active | 03JUL1976 | 6200000 | 4.8 | 50 | Uk-EUROPE | 1 |
| 4 | M003 | Vatican_museum | Italy | EUROPE | History | Active | 03JUL1976 | 2000 | 4.9 | 50 | Italy-EUR | 2 | |
| 5 | M004 | Prado | Spain | EUROPE | Art | museum@prado.es | Active | 19NOV1819 | 3400000 | 5.0 | 207 | Spain-EU | 1 |
| 6 | M005 | Rijksmuseum | Netherlands | EUROPE | Art | Active | 20APR1800 | 2700000 | 4.7 | 226 | Netherlands-EUROPE | 2 | |
| 7 | M006 | Metropolitan | Usa | NORTH_AMERICA | Art | info@met.org | Active | 13APR1870 | 6500000 | 4.8 | 156 | Usa-NA | 1 |
| 8 | M007 | Hermitage | Russia | OTHER | History | museum@hermitage.ru | Active | 17FEB1764 | 4500000 | . | 262 | Russia-EMEA | 2 |
| 9 | M008 | Uffizi | Italy | EUROPE | Painting | Active | 03JUL1976 | 2200000 | 4.9 | 50 | Italy-EU | 2 | |
| 10 | M009 | Acropolis | Greece | EUROPE | Archaeology | acropolis@mail | Active | 03JUL1976 | 1800000 | 4.5 | 50 | Greece-EU | 1 |
| 11 | M010 | Moma | Usa | NORTH_AMERICA | Modernart | info@moma.org | Closed | 07NOV1929 | 3100000 | 4.6 | 97 | Usa-NA | 1 |
| 12 | M011 | Null | Germany | EUROPE | Art | museum@berlin.de | Active | 03JUL1976 | . | 4.7 | 50 | Germany-EU | 2 |
| 13 | M012 | National_museum | India | ASIA_PACIFIC | History | Active | 15AUG1947 | 1200000 | 4.4 | 79 | India-APAC | 2 | |
| 14 | M013 | Tokyo_national | Japan | ASIA_PACIFIC | History | tokyo@museum.jp | Active | 10APR1872 | 2500000 | 4.8 | 154 | Japan-APAC | 1 |
| 15 | M014 | Egyptian_museum | Egypt | OTHER | Archaeology | egypt@museum.org | Active | 15NOV1902 | 1500000 | 4.7 | 124 | Egypt-MEA | 1 |
| 16 | M015 | Science_museum | Uk | EUROPE | Science | science@@museum.org | Active | 18JUN1857 | 3200000 | 4.5 | 169 | Uk-EUROPE | 1 |
Why Arrays Matter
Arrays
allow scalable cleaning across dozens of variables without repeated code.
This
technique is common in SDTM standardization projects.
5.Deduplication with PROC SORT
proc sort data=museum_clean nodupkey;
by museum_id;
run;
proc print data=museum_clean;
run;
LOG:
| Obs | museum_id | museum_name | country | region | category | contact_email | status | opening_date | annual_visitors | avg_rating | years_open | full_location | missing_count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M001 | Louvre | France | EUROPE | Art | contact@louvre.fr | Active | 08APR1793 | 8900000 | 4.9 | 233 | France-EU | 1 |
| 2 | M002 | British_museum | Uk | EUROPE | History | info@britishmuseum.org | Active | 03JUL1976 | 6200000 | 4.8 | 50 | Uk-EUROPE | 1 |
| 3 | M003 | Vatican_museum | Italy | EUROPE | History | Active | 03JUL1976 | 2000 | 4.9 | 50 | Italy-EUR | 2 | |
| 4 | M004 | Prado | Spain | EUROPE | Art | museum@prado.es | Active | 19NOV1819 | 3400000 | 5.0 | 207 | Spain-EU | 1 |
| 5 | M005 | Rijksmuseum | Netherlands | EUROPE | Art | Active | 20APR1800 | 2700000 | 4.7 | 226 | Netherlands-EUROPE | 2 | |
| 6 | M006 | Metropolitan | Usa | NORTH_AMERICA | Art | info@met.org | Active | 13APR1870 | 6500000 | 4.8 | 156 | Usa-NA | 1 |
| 7 | M007 | Hermitage | Russia | OTHER | History | museum@hermitage.ru | Active | 17FEB1764 | 4500000 | . | 262 | Russia-EMEA | 2 |
| 8 | M008 | Uffizi | Italy | EUROPE | Painting | Active | 03JUL1976 | 2200000 | 4.9 | 50 | Italy-EU | 2 | |
| 9 | M009 | Acropolis | Greece | EUROPE | Archaeology | acropolis@mail | Active | 03JUL1976 | 1800000 | 4.5 | 50 | Greece-EU | 1 |
| 10 | M010 | Moma | Usa | NORTH_AMERICA | Modernart | info@moma.org | Closed | 07NOV1929 | 3100000 | 4.6 | 97 | Usa-NA | 1 |
| 11 | M011 | Null | Germany | EUROPE | Art | museum@berlin.de | Active | 03JUL1976 | . | 4.7 | 50 | Germany-EU | 2 |
| 12 | M012 | National_museum | India | ASIA_PACIFIC | History | Active | 15AUG1947 | 1200000 | 4.4 | 79 | India-APAC | 2 | |
| 13 | M013 | Tokyo_national | Japan | ASIA_PACIFIC | History | tokyo@museum.jp | Active | 10APR1872 | 2500000 | 4.8 | 154 | Japan-APAC | 1 |
| 14 | M014 | Egyptian_museum | Egypt | OTHER | Archaeology | egypt@museum.org | Active | 15NOV1902 | 1500000 | 4.7 | 124 | Egypt-MEA | 1 |
| 15 | M015 | Science_museum | Uk | EUROPE | Science | science@@museum.org | Active | 18JUN1857 | 3200000 | 4.5 | 169 | Uk-EUROPE | 1 |
Key Points
NODUPKEY removes duplicate business keys.
Clinical trials often use USUBJID for similar deduplication logic.
6.PROC SQL vs DATA Step MERGE
6.1 PROC SQL
data region_lookup;
length region $20 region_description $50;
input region $ region_description & $50.;
datalines;
EU Europe
EUROPE Europe
EUR Europe
NA North_America
NORTH_AMERICA North_America
APAC Asia_Pacific
MEA Middle_East_Africa
EMEA Europe_Middle_East_Africa
OTHER Other_Regions
;
run;
proc print data=region_lookup;
run;
OUTPUT:
| Obs | region | region_description |
|---|---|---|
| 1 | EU | Europe |
| 2 | EUROPE | Europe |
| 3 | EUR | Europe |
| 4 | NA | North_America |
| 5 | NORTH_AMERICA | North_America |
| 6 | APAC | Asia_Pacific |
| 7 | MEA | Middle_East_Africa |
| 8 | EMEA | Europe_Middle_East_Africa |
| 9 | OTHER | Other_Regions |
proc sql;
create table final_sql as
select a.*,
b.region_description
from museum_clean a
left join region_lookup b
on a.region=b.region;
quit;
proc print data=final_sql;
run;
OUTPUT:
| Obs | museum_id | museum_name | country | region | category | contact_email | status | opening_date | annual_visitors | avg_rating | years_open | full_location | missing_count | region_description |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M012 | National_museum | India | ASIA_PACIFIC | History | Active | 15AUG1947 | 1200000 | 4.4 | 79 | India-APAC | 2 | ||
| 2 | M013 | Tokyo_national | Japan | ASIA_PACIFIC | History | tokyo@museum.jp | Active | 10APR1872 | 2500000 | 4.8 | 154 | Japan-APAC | 1 | |
| 3 | M015 | Science_museum | Uk | EUROPE | Science | science@@museum.org | Active | 18JUN1857 | 3200000 | 4.5 | 169 | Uk-EUROPE | 1 | Europe |
| 4 | M003 | Vatican_museum | Italy | EUROPE | History | Active | 03JUL1976 | 2000 | 4.9 | 50 | Italy-EUR | 2 | Europe | |
| 5 | M005 | Rijksmuseum | Netherlands | EUROPE | Art | Active | 20APR1800 | 2700000 | 4.7 | 226 | Netherlands-EUROPE | 2 | Europe | |
| 6 | M002 | British_museum | Uk | EUROPE | History | info@britishmuseum.org | Active | 03JUL1976 | 6200000 | 4.8 | 50 | Uk-EUROPE | 1 | Europe |
| 7 | M011 | Null | Germany | EUROPE | Art | museum@berlin.de | Active | 03JUL1976 | . | 4.7 | 50 | Germany-EU | 2 | Europe |
| 8 | M009 | Acropolis | Greece | EUROPE | Archaeology | acropolis@mail | Active | 03JUL1976 | 1800000 | 4.5 | 50 | Greece-EU | 1 | Europe |
| 9 | M001 | Louvre | France | EUROPE | Art | contact@louvre.fr | Active | 08APR1793 | 8900000 | 4.9 | 233 | France-EU | 1 | Europe |
| 10 | M004 | Prado | Spain | EUROPE | Art | museum@prado.es | Active | 19NOV1819 | 3400000 | 5.0 | 207 | Spain-EU | 1 | Europe |
| 11 | M008 | Uffizi | Italy | EUROPE | Painting | Active | 03JUL1976 | 2200000 | 4.9 | 50 | Italy-EU | 2 | Europe | |
| 12 | M006 | Metropolitan | Usa | NORTH_AMERICA | Art | info@met.org | Active | 13APR1870 | 6500000 | 4.8 | 156 | Usa-NA | 1 | North_America |
| 13 | M010 | Moma | Usa | NORTH_AMERICA | Modernart | info@moma.org | Closed | 07NOV1929 | 3100000 | 4.6 | 97 | Usa-NA | 1 | North_America |
| 14 | M007 | Hermitage | Russia | OTHER | History | museum@hermitage.ru | Active | 17FEB1764 | 4500000 | . | 262 | Russia-EMEA | 2 | Other_Regions |
| 15 | M014 | Egyptian_museum | Egypt | OTHER | Archaeology | egypt@museum.org | Active | 15NOV1902 | 1500000 | 4.7 | 124 | Egypt-MEA | 1 | Other_Regions |
6.2 DATA Step MERGE
proc sort data=museum_clean;
by region;
run;
proc print data=museum_clean;
run;
OUTPUT:
| Obs | museum_id | museum_name | country | region | category | contact_email | status | opening_date | annual_visitors | avg_rating | years_open | full_location | missing_count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M012 | National_museum | India | ASIA_PACIFIC | History | Active | 15AUG1947 | 1200000 | 4.4 | 79 | India-APAC | 2 | |
| 2 | M013 | Tokyo_national | Japan | ASIA_PACIFIC | History | tokyo@museum.jp | Active | 10APR1872 | 2500000 | 4.8 | 154 | Japan-APAC | 1 |
| 3 | M001 | Louvre | France | EUROPE | Art | contact@louvre.fr | Active | 08APR1793 | 8900000 | 4.9 | 233 | France-EU | 1 |
| 4 | M002 | British_museum | Uk | EUROPE | History | info@britishmuseum.org | Active | 03JUL1976 | 6200000 | 4.8 | 50 | Uk-EUROPE | 1 |
| 5 | M003 | Vatican_museum | Italy | EUROPE | History | Active | 03JUL1976 | 2000 | 4.9 | 50 | Italy-EUR | 2 | |
| 6 | M004 | Prado | Spain | EUROPE | Art | museum@prado.es | Active | 19NOV1819 | 3400000 | 5.0 | 207 | Spain-EU | 1 |
| 7 | M005 | Rijksmuseum | Netherlands | EUROPE | Art | Active | 20APR1800 | 2700000 | 4.7 | 226 | Netherlands-EUROPE | 2 | |
| 8 | M008 | Uffizi | Italy | EUROPE | Painting | Active | 03JUL1976 | 2200000 | 4.9 | 50 | Italy-EU | 2 | |
| 9 | M009 | Acropolis | Greece | EUROPE | Archaeology | acropolis@mail | Active | 03JUL1976 | 1800000 | 4.5 | 50 | Greece-EU | 1 |
| 10 | M011 | Null | Germany | EUROPE | Art | museum@berlin.de | Active | 03JUL1976 | . | 4.7 | 50 | Germany-EU | 2 |
| 11 | M015 | Science_museum | Uk | EUROPE | Science | science@@museum.org | Active | 18JUN1857 | 3200000 | 4.5 | 169 | Uk-EUROPE | 1 |
| 12 | M006 | Metropolitan | Usa | NORTH_AMERICA | Art | info@met.org | Active | 13APR1870 | 6500000 | 4.8 | 156 | Usa-NA | 1 |
| 13 | M010 | Moma | Usa | NORTH_AMERICA | Modernart | info@moma.org | Closed | 07NOV1929 | 3100000 | 4.6 | 97 | Usa-NA | 1 |
| 14 | M007 | Hermitage | Russia | OTHER | History | museum@hermitage.ru | Active | 17FEB1764 | 4500000 | . | 262 | Russia-EMEA | 2 |
| 15 | M014 | Egyptian_museum | Egypt | OTHER | Archaeology | egypt@museum.org | Active | 15NOV1902 | 1500000 | 4.7 | 124 | Egypt-MEA | 1 |
proc sort data=region_lookup;
by region;
run;
proc print data=region_lookup;
run;
OUTPUT:
| Obs | region | region_description |
|---|---|---|
| 1 | APAC | Asia_Pacific |
| 2 | EMEA | Europe_Middle_East_Africa |
| 3 | EU | Europe |
| 4 | EUR | Europe |
| 5 | EUROPE | Europe |
| 6 | MEA | Middle_East_Africa |
| 7 | NA | North_America |
| 8 | NORTH_AMERICA | North_America |
| 9 | OTHER | Other_Regions |
data final_merge;
merge museum_clean(in=a)
region_lookup(in=b);
by region;
if a;
run;
proc print data=final_merge;
run;
OUTPUT:
| Obs | museum_id | museum_name | country | region | category | contact_email | status | opening_date | annual_visitors | avg_rating | years_open | full_location | missing_count | region_description |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M012 | National_museum | India | ASIA_PACIFIC | History | Active | 15AUG1947 | 1200000 | 4.4 | 79 | India-APAC | 2 | ||
| 2 | M013 | Tokyo_national | Japan | ASIA_PACIFIC | History | tokyo@museum.jp | Active | 10APR1872 | 2500000 | 4.8 | 154 | Japan-APAC | 1 | |
| 3 | M001 | Louvre | France | EUROPE | Art | contact@louvre.fr | Active | 08APR1793 | 8900000 | 4.9 | 233 | France-EU | 1 | Europe |
| 4 | M002 | British_museum | Uk | EUROPE | History | info@britishmuseum.org | Active | 03JUL1976 | 6200000 | 4.8 | 50 | Uk-EUROPE | 1 | Europe |
| 5 | M003 | Vatican_museum | Italy | EUROPE | History | Active | 03JUL1976 | 2000 | 4.9 | 50 | Italy-EUR | 2 | Europe | |
| 6 | M004 | Prado | Spain | EUROPE | Art | museum@prado.es | Active | 19NOV1819 | 3400000 | 5.0 | 207 | Spain-EU | 1 | Europe |
| 7 | M005 | Rijksmuseum | Netherlands | EUROPE | Art | Active | 20APR1800 | 2700000 | 4.7 | 226 | Netherlands-EUROPE | 2 | Europe | |
| 8 | M008 | Uffizi | Italy | EUROPE | Painting | Active | 03JUL1976 | 2200000 | 4.9 | 50 | Italy-EU | 2 | Europe | |
| 9 | M009 | Acropolis | Greece | EUROPE | Archaeology | acropolis@mail | Active | 03JUL1976 | 1800000 | 4.5 | 50 | Greece-EU | 1 | Europe |
| 10 | M011 | Null | Germany | EUROPE | Art | museum@berlin.de | Active | 03JUL1976 | . | 4.7 | 50 | Germany-EU | 2 | Europe |
| 11 | M015 | Science_museum | Uk | EUROPE | Science | science@@museum.org | Active | 18JUN1857 | 3200000 | 4.5 | 169 | Uk-EUROPE | 1 | Europe |
| 12 | M006 | Metropolitan | Usa | NORTH_AMERICA | Art | info@met.org | Active | 13APR1870 | 6500000 | 4.8 | 156 | Usa-NA | 1 | North_America |
| 13 | M010 | Moma | Usa | NORTH_AMERICA | Modernart | info@moma.org | Closed | 07NOV1929 | 3100000 | 4.6 | 97 | Usa-NA | 1 | North_America |
| 14 | M007 | Hermitage | Russia | OTHER | History | museum@hermitage.ru | Active | 17FEB1764 | 4500000 | . | 262 | Russia-EMEA | 2 | Other_Regions |
| 15 | M014 | Egyptian_museum | Egypt | OTHER | Archaeology | egypt@museum.org | Active | 15NOV1902 | 1500000 | 4.7 | 124 | Egypt-MEA | 1 | Other_Regions |
Comparison
|
Feature |
PROC
SQL |
DATA
Step |
|
Readability |
High |
Medium |
|
Speed
Large Files |
Medium |
High |
|
Complex
Joins |
Excellent |
Limited |
|
Traceability |
Excellent |
Good |
7.Professional Reporting Layer
7.1 PROC FREQ
proc freq data=museum_clean;
tables category region status;
run;
OUTPUT:
The FREQ Procedure
| category | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Archaeology | 2 | 13.33 | 2 | 13.33 |
| Art | 5 | 33.33 | 7 | 46.67 |
| History | 5 | 33.33 | 12 | 80.00 |
| Modernart | 1 | 6.67 | 13 | 86.67 |
| Painting | 1 | 6.67 | 14 | 93.33 |
| Science | 1 | 6.67 | 15 | 100.00 |
| region | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| ASIA_PACIFIC | 2 | 13.33 | 2 | 13.33 |
| EUROPE | 9 | 60.00 | 11 | 73.33 |
| NORTH_AMERICA | 2 | 13.33 | 13 | 86.67 |
| OTHER | 2 | 13.33 | 15 | 100.00 |
| status | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Active | 14 | 93.33 | 14 | 93.33 |
| Closed | 1 | 6.67 | 15 | 100.00 |
7.2 PROC MEANS
proc means data=museum_clean n mean median min max;
var annual_visitors avg_rating;
run;
OUTPUT:
The MEANS Procedure
| Variable | N | Mean | Median | Minimum | Maximum |
|---|---|---|---|---|---|
annual_visitors avg_rating | 14 14 | 3407285.71 4.7285714 | 2900000.00 4.7500000 | 2000.00 4.4000000 | 8900000.00 5.0000000 |
7.3 PROC SUMMARY
proc summary data=museum_clean;
class region;
var annual_visitors;
output out=summary sum=;
run;
proc print data=summary;
run;
OUTPUT:
| Obs | region | _TYPE_ | _FREQ_ | annual_visitors |
|---|---|---|---|---|
| 1 | 0 | 15 | 47702000 | |
| 2 | ASIA_PACIFIC | 1 | 2 | 3700000 |
| 3 | EUROPE | 1 | 9 | 28402000 |
| 4 | NORTH_AMERICA | 1 | 2 | 9600000 |
| 5 | OTHER | 1 | 2 | 6000000 |
7.4 PROC TRANSPOSE
proc transpose data=summary out=transposed;
by region;
var annual_visitors;
run;
proc print data=transposed;
run;
OUTPUT:
| Obs | region | _NAME_ | COL1 |
|---|---|---|---|
| 1 | annual_visitors | 47702000 | |
| 2 | ASIA_PACIFIC | annual_visitors | 3700000 |
| 3 | EUROPE | annual_visitors | 28402000 |
| 4 | NORTH_AMERICA | annual_visitors | 9600000 |
| 5 | OTHER | annual_visitors | 6000000 |
7.5 PROC REPORT
proc report data=museum_clean nowd;
columns museum_name country annual_visitors avg_rating;
run;
OUTPUT:
| museum_name | country | annual_visitors | avg_rating |
|---|---|---|---|
| National_museum | India | 1200000 | 4.4 |
| Tokyo_national | Japan | 2500000 | 4.8 |
| Louvre | France | 8900000 | 4.9 |
| British_museum | Uk | 6200000 | 4.8 |
| Vatican_museum | Italy | 2000 | 4.9 |
| Prado | Spain | 3400000 | 5 |
| Rijksmuseum | Netherlands | 2700000 | 4.7 |
| Uffizi | Italy | 2200000 | 4.9 |
| Acropolis | Greece | 1800000 | 4.5 |
| Null | Germany | . | 4.7 |
| Science_museum | Uk | 3200000 | 4.5 |
| Metropolitan | Usa | 6500000 | 4.8 |
| Moma | Usa | 3100000 | 4.6 |
| Hermitage | Russia | 4500000 | . |
| Egyptian_museum | Egypt | 1500000 | 4.7 |
Explanation
These
procedures create executive-ready outputs for management reporting and
regulatory review.
8.Reusable Macro Framework
%macro clean_text(ds,var);
data museum_macro;
set &ds;
&var=propcase(strip(compbl(&var)));
run;
proc print data=&ds;
run;
%mend;
%clean_text(museum_clean,museum_name);
OUTPUT:
| Obs | museum_id | museum_name | country | region | category | contact_email | status | opening_date | annual_visitors | avg_rating | years_open | full_location | missing_count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M012 | National_museum | India | ASIA_PACIFIC | History | Active | 15AUG1947 | 1200000 | 4.4 | 79 | India-APAC | 2 | |
| 2 | M013 | Tokyo_national | Japan | ASIA_PACIFIC | History | tokyo@museum.jp | Active | 10APR1872 | 2500000 | 4.8 | 154 | Japan-APAC | 1 |
| 3 | M001 | Louvre | France | EUROPE | Art | contact@louvre.fr | Active | 08APR1793 | 8900000 | 4.9 | 233 | France-EU | 1 |
| 4 | M002 | British_museum | Uk | EUROPE | History | info@britishmuseum.org | Active | 03JUL1976 | 6200000 | 4.8 | 50 | Uk-EUROPE | 1 |
| 5 | M003 | Vatican_museum | Italy | EUROPE | History | Active | 03JUL1976 | 2000 | 4.9 | 50 | Italy-EUR | 2 | |
| 6 | M004 | Prado | Spain | EUROPE | Art | museum@prado.es | Active | 19NOV1819 | 3400000 | 5.0 | 207 | Spain-EU | 1 |
| 7 | M005 | Rijksmuseum | Netherlands | EUROPE | Art | Active | 20APR1800 | 2700000 | 4.7 | 226 | Netherlands-EUROPE | 2 | |
| 8 | M008 | Uffizi | Italy | EUROPE | Painting | Active | 03JUL1976 | 2200000 | 4.9 | 50 | Italy-EU | 2 | |
| 9 | M009 | Acropolis | Greece | EUROPE | Archaeology | acropolis@mail | Active | 03JUL1976 | 1800000 | 4.5 | 50 | Greece-EU | 1 |
| 10 | M011 | Null | Germany | EUROPE | Art | museum@berlin.de | Active | 03JUL1976 | . | 4.7 | 50 | Germany-EU | 2 |
| 11 | M015 | Science_museum | Uk | EUROPE | Science | science@@museum.org | Active | 18JUN1857 | 3200000 | 4.5 | 169 | Uk-EUROPE | 1 |
| 12 | M006 | Metropolitan | Usa | NORTH_AMERICA | Art | info@met.org | Active | 13APR1870 | 6500000 | 4.8 | 156 | Usa-NA | 1 |
| 13 | M010 | Moma | Usa | NORTH_AMERICA | Modernart | info@moma.org | Closed | 07NOV1929 | 3100000 | 4.6 | 97 | Usa-NA | 1 |
| 14 | M007 | Hermitage | Russia | OTHER | History | museum@hermitage.ru | Active | 17FEB1764 | 4500000 | . | 262 | Russia-EMEA | 2 |
| 15 | M014 | Egyptian_museum | Egypt | OTHER | Archaeology | egypt@museum.org | Active | 15NOV1902 | 1500000 | 4.7 | 124 | Egypt-MEA | 1 |
%clean_text(museum_clean,country);
OUTPUT:
| Obs | museum_id | museum_name | country | region | category | contact_email | status | opening_date | annual_visitors | avg_rating | years_open | full_location | missing_count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M012 | National_museum | India | ASIA_PACIFIC | History | Active | 15AUG1947 | 1200000 | 4.4 | 79 | India-APAC | 2 | |
| 2 | M013 | Tokyo_national | Japan | ASIA_PACIFIC | History | tokyo@museum.jp | Active | 10APR1872 | 2500000 | 4.8 | 154 | Japan-APAC | 1 |
| 3 | M001 | Louvre | France | EUROPE | Art | contact@louvre.fr | Active | 08APR1793 | 8900000 | 4.9 | 233 | France-EU | 1 |
| 4 | M002 | British_museum | Uk | EUROPE | History | info@britishmuseum.org | Active | 03JUL1976 | 6200000 | 4.8 | 50 | Uk-EUROPE | 1 |
| 5 | M003 | Vatican_museum | Italy | EUROPE | History | Active | 03JUL1976 | 2000 | 4.9 | 50 | Italy-EUR | 2 | |
| 6 | M004 | Prado | Spain | EUROPE | Art | museum@prado.es | Active | 19NOV1819 | 3400000 | 5.0 | 207 | Spain-EU | 1 |
| 7 | M005 | Rijksmuseum | Netherlands | EUROPE | Art | Active | 20APR1800 | 2700000 | 4.7 | 226 | Netherlands-EUROPE | 2 | |
| 8 | M008 | Uffizi | Italy | EUROPE | Painting | Active | 03JUL1976 | 2200000 | 4.9 | 50 | Italy-EU | 2 | |
| 9 | M009 | Acropolis | Greece | EUROPE | Archaeology | acropolis@mail | Active | 03JUL1976 | 1800000 | 4.5 | 50 | Greece-EU | 1 |
| 10 | M011 | Null | Germany | EUROPE | Art | museum@berlin.de | Active | 03JUL1976 | . | 4.7 | 50 | Germany-EU | 2 |
| 11 | M015 | Science_museum | Uk | EUROPE | Science | science@@museum.org | Active | 18JUN1857 | 3200000 | 4.5 | 169 | Uk-EUROPE | 1 |
| 12 | M006 | Metropolitan | Usa | NORTH_AMERICA | Art | info@met.org | Active | 13APR1870 | 6500000 | 4.8 | 156 | Usa-NA | 1 |
| 13 | M010 | Moma | Usa | NORTH_AMERICA | Modernart | info@moma.org | Closed | 07NOV1929 | 3100000 | 4.6 | 97 | Usa-NA | 1 |
| 14 | M007 | Hermitage | Russia | OTHER | History | museum@hermitage.ru | Active | 17FEB1764 | 4500000 | . | 262 | Russia-EMEA | 2 |
| 15 | M014 | Egyptian_museum | Egypt | OTHER | Archaeology | egypt@museum.org | Active | 15NOV1902 | 1500000 | 4.7 | 124 | Egypt-MEA | 1 |
Macros
improve reproducibility and reduce maintenance costs.
9.R Raw Dataset
library(tibble)
museum_raw <- tibble(
museum_id = c(
"M001","M001","M002","M003","M004",
"M005","M006","M007","M008","M009",
"M010","M011","M012","M013","M014","M015"),
museum_name = c("Louvre","Louvre","British_Museum","Vatican_Museum","Prado",
"Rijksmuseum","Metropolitan","Hermitage","Uffizi","Acropolis",
"Moma","NULL","National_Museum","Tokyo_National","Egyptian_Museum","Science_Museum"),
country = c("France","France","UK","Italy","Spain","Netherlands","USA",
"Russia","Italy","Greece","USA","Germany","India","Japan","Egypt","UK"),
region = c("EU","eu","EUROPE","Eur","EU","europe","NA","EMEA","EU","eu",
"na","EU","APAC","apac","MEA","EUROPE"),
opening_date = c("08APR1793","08APR1793",NA,"15FEB1506","19NOV1819","20APR1800",
"13APR1870","17FEB1764","1581-01-01","BADDATE","07NOV1929",NA,"15AUG1947",
"10APR1872","1902-11-15","18JUN1857"),
annual_visitors = c(8900000,8900000,6200000,-2000,3400000,2700000,6500000,
4500000,2200000,1800000,3100000,NA,1200000,2500000,1500000,3200000),
avg_rating = c(4.9,4.9,4.8,4.9,7.8,4.7,4.8,-1,4.9,4.5,4.6,4.7,4.4,4.8,4.7,4.5),
category = c("Art","Art","History","History","Art","Art","Artt","History",
"Painting","Archaeology","ModernArt","Art","Histroy","History","Archaeology",
"Science"),
contact_email = c("contact@louvre.fr","contact@louvre.fr","info@britishmuseum.org",
"invalidemail","museum@prado.es","NULL","info@met.org","museum@hermitage.ru",
"uffizi.it","acropolis@mail","info@moma.org","museum@berlin.de","contactmuseum.in",
"tokyo@museum.jp","egypt@museum.org","science@@museum.org"),
status = c("Active","Active","Active","Active","Active","Active","Active",
"Active","Active","Active","Closed","Active","Active","Active","Active",
"Active")
)
OUTPUT:
|
museum_id |
museum_name |
country |
region |
opening_date |
annual_visitors |
avg_rating |
category |
contact_email |
status |
|
M001 |
Louvre |
France |
EU |
08APR1793 |
8900000 |
4.9 |
Art |
contact@louvre.fr |
Active |
|
M001 |
Louvre |
France |
eu |
08APR1793 |
8900000 |
4.9 |
Art |
contact@louvre.fr |
Active |
|
M002 |
British_Museum |
UK |
EUROPE |
6200000 |
4.8 |
History |
info@britishmuseum.org |
Active |
|
|
M003 |
Vatican_Museum |
Italy |
Eur |
15FEB1506 |
-2000 |
4.9 |
History |
invalidemail |
Active |
|
M004 |
Prado |
Spain |
EU |
19NOV1819 |
3400000 |
7.8 |
Art |
museum@prado.es |
Active |
|
M005 |
Rijksmuseum |
Netherlands |
europe |
20APR1800 |
2700000 |
4.7 |
Art |
NULL |
Active |
|
M006 |
Metropolitan |
USA |
NA |
13APR1870 |
6500000 |
4.8 |
Artt |
info@met.org |
Active |
|
M007 |
Hermitage |
Russia |
EMEA |
17FEB1764 |
4500000 |
-1 |
History |
museum@hermitage.ru |
Active |
|
M008 |
Uffizi |
Italy |
EU |
1581-01-01 |
2200000 |
4.9 |
Painting |
uffizi.it |
Active |
|
M009 |
Acropolis |
Greece |
eu |
BADDATE |
1800000 |
4.5 |
Archaeology |
acropolis@mail |
Active |
|
M010 |
Moma |
USA |
na |
07NOV1929 |
3100000 |
4.6 |
ModernArt |
info@moma.org |
Closed |
|
M011 |
NULL |
Germany |
EU |
4.7 |
Art |
museum@berlin.de |
Active |
||
|
M012 |
National_Museum |
India |
APAC |
15AUG1947 |
1200000 |
4.4 |
Histroy |
contactmuseum.in |
Active |
|
M013 |
Tokyo_National |
Japan |
apac |
10APR1872 |
2500000 |
4.8 |
History |
tokyo@museum.jp |
Active |
|
M014 |
Egyptian_Museum |
Egypt |
MEA |
1902-11-15 |
1500000 |
4.7 |
Archaeology |
egypt@museum.org |
Active |
|
M015 |
Science_Museum |
UK |
EUROPE |
18JUN1857 |
3200000 |
4.5 |
Science |
science@@museum.org |
Active |
10.Equivalent R Cleaning Workflow
library(tidyverse)
library(janitor)
library(lubridate)
museum_clean <- museum_raw %>%
clean_names() %>%
mutate(
opening_date = parse_date_time(opening_date,
orders = c("dby","Ymd" ),quiet = TRUE),
museum_id = str_to_title(str_trim(museum_id)),
museum_name = str_to_title(str_trim(museum_name)),
country = str_to_title(country),
region = str_to_upper(str_trim(region)),
annual_visitors = abs(annual_visitors),
avg_rating = if_else(avg_rating > 5,5,avg_rating),
category = str_replace_all(category,"Artt","Art"),
category = str_replace_all(category,"Histroy","History"),
contact_email = if_else(grepl("@",contact_email),
contact_email,NA_character_),
opening_date = coalesce(as.Date(opening_date),
today() - years(50))
) %>%
distinct(museum_id,.keep_all=TRUE)
OUTPUT:
|
museum_id |
museum_name |
country |
region |
opening_date |
annual_visitors |
avg_rating |
category |
contact_email |
status |
|
M001 |
Louvre |
France |
EU |
1793-04-08 |
8900000 |
4.9 |
Art |
contact@louvre.fr |
Active |
|
M002 |
British_museum |
Uk |
EUROPE |
1976-07-03 |
6200000 |
4.8 |
History |
info@britishmuseum.org |
Active |
|
M003 |
Vatican_museum |
Italy |
EUR |
1506-02-15 |
2000 |
4.9 |
History |
Active |
|
|
M004 |
Prado |
Spain |
EU |
1819-11-19 |
3400000 |
5 |
Art |
museum@prado.es |
Active |
|
M005 |
Rijksmuseum |
Netherlands |
EUROPE |
1800-04-20 |
2700000 |
4.7 |
Art |
Active |
|
|
M006 |
Metropolitan |
Usa |
NA |
1870-04-13 |
6500000 |
4.8 |
Art |
info@met.org |
Active |
|
M007 |
Hermitage |
Russia |
EMEA |
1764-02-17 |
4500000 |
-1 |
History |
museum@hermitage.ru |
Active |
|
M008 |
Uffizi |
Italy |
EU |
1581-01-01 |
2200000 |
4.9 |
Painting |
Active |
|
|
M009 |
Acropolis |
Greece |
EU |
1976-07-03 |
1800000 |
4.5 |
Archaeology |
acropolis@mail |
Active |
|
M010 |
Moma |
Usa |
NA |
1929-11-07 |
3100000 |
4.6 |
ModernArt |
info@moma.org |
Closed |
|
M011 |
Null |
Germany |
EU |
1976-07-03 |
4.7 |
Art |
museum@berlin.de |
Active |
|
|
M012 |
National_museum |
India |
APAC |
1947-08-15 |
1200000 |
4.4 |
History |
Active |
|
|
M013 |
Tokyo_national |
Japan |
APAC |
1872-04-10 |
2500000 |
4.8 |
History |
tokyo@museum.jp |
Active |
|
M014 |
Egyptian_museum |
Egypt |
MEA |
1902-11-15 |
1500000 |
4.7 |
Archaeology |
egypt@museum.org |
Active |
|
M015 |
Science_museum |
Uk |
EUROPE |
1857-06-18 |
3200000 |
4.5 |
Science |
science@@museum.org |
Active |
SAS vs R Mapping
|
SAS |
R |
|
STRIP |
str_trim |
|
PROPCASE |
str_to_title |
|
ABS |
abs |
|
FIND |
grepl |
|
CATX |
unite |
|
IF THEN |
if_else |
|
PROC
SUMMARY |
summarise |
Enterprise Validation and
Compliance
Clinical
environments demand:
- SDTM compliance
- ADaM traceability
- Independent QC
- Audit trails
- Metadata governance
- Validation programming
One
dangerous SAS behavior is:
if age <18 then delete;
Missing
values are treated as smaller than any number.
Therefore:
age=.
also
satisfies:
age<18
Potentially
deleting valid patients.
This has
caused real regulatory findings.
Business Logic Behind
Cleaning Decisions
Data
cleaning is not cosmetic work.
It
represents business logic implementation.
Negative
visitors become positive because visitor counts cannot be negative.
Missing
opening dates are imputed because trend models require continuous time
dimensions.
Age
corrections prevent pediatric patients entering adult cohorts.
Text
normalization ensures grouping accuracy.
For
example:
EU
eu
Europe
EUR
should
represent one business concept.
Without
standardization dashboards produce four categories instead of one.
Similarly,
malformed emails break notification systems.
Missing
values affect denominators in statistical models.
Clinical
trial analyses may incorrectly calculate safety populations.
Cleaning
therefore protects downstream analytics rather than improving appearance.
Twenty Best Practices
- Define metadata first.
- Standardize naming
conventions.
- Validate ranges early.
- Deduplicate before
derivations.
- Separate raw and cleaned
layers.
- Never overwrite source data.
- Use reusable macros.
- Maintain audit trails.
- Perform independent QC.
- Validate joins.
- Version control programs.
- Automate checks.
- Document assumptions.
- Standardize formats.
- Monitor truncation risks.
- Log all imputations.
- Protect business keys.
- Validate timestamps.
- Use defensive programming.
- Deploy production
checklists.
Twenty Enterprise Insights
- Dirty data creates expensive
business mistakes.
- Validation logic beats visual
inspection.
- Standardization improves
reproducibility.
- Metadata drives consistency.
- Missing values carry
meaning.
- Audit trails protect
submissions.
- Macros reduce risk.
- Duplicate keys destroy
trust.
- Formats improve reporting
quality.
- Defensive programming saves
projects.
- Traceability enables
compliance.
- Automation reduces human
error.
- Dates require strict
governance.
- Documentation is part of
programming.
- Business rules must be
explicit.
- QC independence matters.
- Reusable code scales faster.
- Standardization improves AI
models.
- Clean data creates reliable
intelligence.
- Analytics is only as good as
its inputs.
SAS Versus R in Enterprise
Cleaning
SAS
dominates auditability, regulatory traceability, scalability, and controlled
production environments.
R
dominates flexibility, visualization, package ecosystems, and exploratory
analytics.
SAS
excels in:
- SDTM
- ADaM
- TLF production
- submission traceability
R excels
in:
- feature engineering
- machine learning
- text processing
- rapid experimentation
Together
they create the ideal enterprise ecosystem.
Final Thoughts
Modern
analytics pipelines succeed or fail long before machine learning begins.
The
foundation is clean, validated, traceable data.
Whether
working with museum intelligence, banking transactions, insurance claims,
retail operations, or clinical trials, identical principles apply:
- Standardize.
- Validate.
- Document.
- Automate.
- Audit.
SAS
provides industrial-strength governance and reproducibility.
R
provides agility and analytical creativity.
Organizations
that combine both technologies build trustworthy business intelligence systems
capable of supporting executives, regulators, scientists, and AI platforms with
confidence.
The
future belongs not to organizations with the most data, but to organizations
with the most reliable data.
Interview Questions
Q1.How would you remove duplicates?
Answer: PROC SORT NODUPKEY in SAS or
distinct() in R.
Q2.How would you detect invalid emails?
Answer: FIND or INDEX in SAS and grepl()
in R.
Q3.Why is missing numeric handling dangerous in SAS?
Answer: Missing values are smaller than
all valid numbers.
Q4.When should PROC SQL replace DATA Step MERGE?
Answer: Complex joins and many-to-many
relationships.
Q5.How do you ensure audit readiness?
Answer: Traceability, QC independence,
metadata control, and reproducible programming.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about MUSEUM DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment