From Raw Trial Errors to Clean SDTM DM Analytics
Clinical Trial Data Chaos
to SDTM Intelligence: Advanced SAS and R Data Cleaning, Validation, and DM
Domain Engineering Framework
INTRODUCTION
In
clinical trials, raw source data is rarely analysis-ready. Real-world clinical
datasets frequently contain:
- inconsistent formats
- duplicate records
- missing values
- mixed case text
- invalid treatment
information
- corrupted dates
- invalid dosing values
- inconsistent subject-level
identifiers
The
objective of this project is to simulate real-world pharmaceutical data
engineering problems and transform unstructured raw datasets into a
standardized SDTM-style DM domain using both SAS and R programming.
This
project demonstrates:
- Clinical data cleaning
- SDTM DM derivation logic
- Data standardization
- Missing data handling
- Duplicate management
- Clinical validation checks
- First dose derivation
- End of study derivation
- Death flag derivation
- Cross-platform SAS vs R
implementation
The
project uses three raw datasets:
- raw_demog
- raw_dosing
- raw_disposition
These
datasets intentionally contain multiple real-world data quality problems to
simulate production-level clinical programming challenges.
RAW DATA OVERVIEW
1. RAW DEMOGRAPHICS DATASET (raw_demog)
Purpose
Contains
subject-level demographic and enrollment information.
Important Variables
|
Variable |
Description |
|
SUBJID |
Subject
Identifier |
|
SITEID |
Site
Number |
|
SEX |
Gender |
|
DOB |
Date of
Birth |
|
RACE |
Race |
|
COUNTRY |
Country |
|
ARM |
Treatment
Arm |
|
CONSENT_DATE |
Informed
Consent Date |
Embedded Data Issues
|
Problem |
Example |
|
Mixed
SEX values |
Male,
FEMALE, m, f |
|
Different
DOB formats |
01JAN1980,
1985-02-15 |
|
Missing
race |
Blank
RACE |
|
Mixed
COUNTRY case |
india,
INDIA, UsA |
|
Extra
spaces |
"
FEMALE " |
|
Duplicate
subjects |
SUBJID
101 repeated |
Dataset 1: raw_demog
DATA raw_demog;
LENGTH SUBJID $10 SEX $10 DOB $15 RACE $20 COUNTRY $15
ARM $20 CONSENT_DATE $15;
INFILE DATALINES DLM='|' DSD TRUNCOVER;
INPUT SUBJID :$10. SITEID SEX :$10. DOB :$15. RACE :$20.
COUNTRY :$15. ARM :$20. CONSENT_DATE :$15.;
DATALINES;
101|1001|Male|01JAN1980|Asian|india|Placebo|01-01-2025
102|1001|FEMALE|1985-02-15|White|USA|Drug A|02/15/2025
103|1002|m|15/03/1978|Black|uk|Drug B|2025-03-01
104|1002|f|1970/04/22||India|Placebo|01APR2025
105|1003|MALE|05-05-1990|Asian|SPAIN|Drug A|2025/05/05
106|1003|Female|1992-06-10|White|usa|Drug B|06-10-2025
107|1004|M|11JUL1985|Black|India|Placebo|07/11/2025
108|1004|F|1988-08-20|Asian|PAKISTAN|Drug A|2025-08-20
109|1005|unknown|09/09/1975||india|Drug B|09SEP2025
110|1005|male|1977-10-01|White|Uk|Placebo|10/01/2025
111|1006| FEMALE |12NOV1982|Asian| INDIA |Drug A|11-12-2025
112|1006|m|1981/12/25|Black|UsA|Drug B|2025-12-25
113|1007|Female|03-01-1995|White|Canada|Placebo|01/03/2025
114|1007|M|1993-02-14||india|Drug A|14FEB2025
115|1008|f|15MAR1987|Asian|spain|Drug B|2025/03/15
101|1001|Male|01JAN1980|Asian|india|Placebo|01-01-2025
;
RUN;
PROC PRINT DATA=raw_demog;
RUN;
OUTPUT:
| Obs | SUBJID | SEX | DOB | RACE | COUNTRY | ARM | CONSENT_DATE | SITEID |
|---|---|---|---|---|---|---|---|---|
| 1 | 101 | Male | 01JAN1980 | Asian | india | Placebo | 01-01-2025 | 1001 |
| 2 | 102 | FEMALE | 1985-02-15 | White | USA | Drug A | 02/15/2025 | 1001 |
| 3 | 103 | m | 15/03/1978 | Black | uk | Drug B | 2025-03-01 | 1002 |
| 4 | 104 | f | 1970/04/22 | India | Placebo | 01APR2025 | 1002 | |
| 5 | 105 | MALE | 05-05-1990 | Asian | SPAIN | Drug A | 2025/05/05 | 1003 |
| 6 | 106 | Female | 1992-06-10 | White | usa | Drug B | 06-10-2025 | 1003 |
| 7 | 107 | M | 11JUL1985 | Black | India | Placebo | 07/11/2025 | 1004 |
| 8 | 108 | F | 1988-08-20 | Asian | PAKISTAN | Drug A | 2025-08-20 | 1004 |
| 9 | 109 | unknown | 09/09/1975 | india | Drug B | 09SEP2025 | 1005 | |
| 10 | 110 | male | 1977-10-01 | White | Uk | Placebo | 10/01/2025 | 1005 |
| 11 | 111 | FEMALE | 12NOV1982 | Asian | INDIA | Drug A | 11-12-2025 | 1006 |
| 12 | 112 | m | 1981/12/25 | Black | UsA | Drug B | 2025-12-25 | 1006 |
| 13 | 113 | Female | 03-01-1995 | White | Canada | Placebo | 01/03/2025 | 1007 |
| 14 | 114 | M | 1993-02-14 | india | Drug A | 14FEB2025 | 1007 | |
| 15 | 115 | f | 15MAR1987 | Asian | spain | Drug B | 2025/03/15 | 1008 |
| 16 | 101 | Male | 01JAN1980 | Asian | india | Placebo | 01-01-2025 | 1001 |
# RAW DEMOGRAPHICS DATASET
# Purpose:Source dataset for SDTM DM derivation
library(dplyr)
library(stringr)
library(lubridate)
raw_demog <- data.frame(
SUBJID = c(
"101","102","103","104","105","106","107","108",
"109","110","111","112","113","114","115","101"
),
SITEID = c(1001,1001,1002,1002,1003,1003,1004,1004,
1005,1005,1006,1006,1007,1007,1008,1001
),
SEX = c("Male","FEMALE","m","f","MALE","Female","M","F",
"unknown","male","FEMALE ","m","Female","M","f","Male"
),
DOB = c(
"01JAN1980","1985-02-15","15/03/1978","1970/04/22",
"05-05-1990","1992-06-10","11JUL1985","1988-08-20",
"09/09/1975","1977-10-01","12NOV1982","1981/12/25",
"03-01-1995","1993-02-14","15MAR1987","01JAN1980"
),
RACE = c(
"Asian","White","Black","","Asian","White","Black",
"Asian","","White","Asian","Black","White","","Asian","Asian"
),
COUNTRY = c(
"india","USA","uk","India","SPAIN","usa","India",
"PAKISTAN","india","Uk"," INDIA ","UsA","Canada",
"india","spain","india"
),
ARM = c(
"Placebo","Drug A","Drug B","Placebo","Drug A","Drug B",
"Placebo","Drug A","Drug B","Placebo","Drug A","Drug B",
"Placebo","Drug A","Drug B","Placebo"
),
CONSENT_DATE = c(
"01-01-2025","02/15/2025","2025-03-01","01APR2025",
"2025/05/05","06-10-2025","07/11/2025","2025-08-20",
"09SEP2025","10/01/2025","11-12-2025","2025-12-25",
"01/03/2025","14FEB2025","2025/03/15","01-01-2025"
),
stringsAsFactors = FALSE
)
print(raw_demog)
|
|
SUBJID |
SITEID |
SEX |
DOB |
RACE |
COUNTRY |
ARM |
CONSENT_DATE |
|
1 |
101 |
1001 |
Male |
01-Jan-1980 |
Asian |
india |
Placebo |
01-01-2025 |
|
2 |
102 |
1001 |
FEMALE |
15-02-1985 |
White |
USA |
Drug A |
02/15/2025 |
|
3 |
103 |
1002 |
m |
15-03-1978 |
Black |
uk |
Drug B |
01-03-2025 |
|
4 |
104 |
1002 |
f |
22-04-1970 |
|
India |
Placebo |
01-Apr-25 |
|
5 |
105 |
1003 |
MALE |
05-05-1990 |
Asian |
SPAIN |
Drug A |
05-05-2025 |
|
6 |
106 |
1003 |
Female |
10-06-1992 |
White |
usa |
Drug B |
06-10-2025 |
|
7 |
107 |
1004 |
M |
11-Jul-1985 |
Black |
India |
Placebo |
07-11-2025 |
|
8 |
108 |
1004 |
F |
20-08-1988 |
Asian |
PAKISTAN |
Drug A |
20-08-2025 |
|
9 |
109 |
1005 |
unknown |
09-09-1975 |
|
india |
Drug B |
09-Sep-2025 |
|
10 |
110 |
1005 |
male |
01-10-1977 |
White |
Uk |
Placebo |
10-01-2025 |
|
11 |
111 |
1006 |
FEMALE |
12-Nov-1982 |
Asian |
INDIA |
Drug A |
11-12-2025 |
|
12 |
112 |
1006 |
m |
25-12-1981 |
Black |
UsA |
Drug B |
25-12-2025 |
|
13 |
113 |
1007 |
Female |
03-01-1995 |
White |
Canada |
Placebo |
01-03-2025 |
|
14 |
114 |
1007 |
M |
14-02-1993 |
|
india |
Drug A |
14-Feb-2025 |
|
15 |
115 |
1008 |
f |
15-Mar-1987 |
Asian |
spain |
Drug B |
15-03-2025 |
|
16 |
101 |
1001 |
Male |
01-Jan-1980 |
Asian |
india |
Placebo |
01-01-2025 |
2. RAW DOSING DATASET (raw_dosing)
Purpose
Contains
treatment exposure and dose administration records.
Important Variables
|
Variable |
Description |
|
SUBJID |
Subject
ID |
|
DOSE_DATE |
Dose
Date |
|
DOSE_AMT |
Dose
Amount |
|
VISIT |
Visit
Information |
Embedded Issues
|
Problem |
Example |
|
Missing
dose dates |
Blank
DOSE_DATE |
|
Duplicate
records |
Subject
101 repeated |
|
Negative
dose values |
-50 |
|
Different
date formats |
01JAN2025,
2025-01-15 |
|
Visit
inconsistency |
Visit
1, VISIT-2 |
Dataset 2: raw_dosing
/*-----------------------------------------------------------
RAW DOSING DATASET
Purpose:Source dataset for SDTM DM/EX derivations
-----------------------------------------------------------*/
DATA raw_dosing;
LENGTH SUBJID $10 DOSE_DATE $15 VISIT $20;
INFILE DATALINES DLM='|' DSD TRUNCOVER;
INPUT SUBJID :$10. DOSE_DATE :$15. DOSE_AMT VISIT :$20.;
DATALINES;
101|01JAN2025|100|Visit 1
101|2025-01-15|100|VISIT-2
101|03/01/2025|100|visit 3
102|02/15/2025|200|Screening
102|2025-02-20|200|VISIT-2
103|01MAR2025|150|Visit 1
103||150|Visit 2
104|2025/04/01|0|SCREENING
104|04-15-2025|100|Visit 2
105|05MAY2025|-50|visit 1
105|2025-05-20|100|VISIT-2
106|06/10/2025|120|Visit 1
106|06/10/2025|120|Visit 1
107|07-11-2025|80|SCREENING
108|2025-08-20|90|Visit 1
108|08/25/2025|90|VISIT-2
109||110|Visit 1
110|10/01/2025|100|visit-1
111|11DEC2025|200|Visit 1
112|2025/12/25|150|VISIT 2
113|01/03/2025|100|Screening
114|14FEB2025|0|Visit 1
115|2025-03-15|-25|VISIT-2
101|01JAN2025|100|Visit 1
;
RUN;
PROC PRINT DATA=raw_dosing;
RUN;
OUTPUT:
| Obs | SUBJID | DOSE_DATE | VISIT | DOSE_AMT |
|---|---|---|---|---|
| 1 | 101 | 01JAN2025 | Visit 1 | 100 |
| 2 | 101 | 2025-01-15 | VISIT-2 | 100 |
| 3 | 101 | 03/01/2025 | visit 3 | 100 |
| 4 | 102 | 02/15/2025 | Screening | 200 |
| 5 | 102 | 2025-02-20 | VISIT-2 | 200 |
| 6 | 103 | 01MAR2025 | Visit 1 | 150 |
| 7 | 103 | Visit 2 | 150 | |
| 8 | 104 | 2025/04/01 | SCREENING | 0 |
| 9 | 104 | 04-15-2025 | Visit 2 | 100 |
| 10 | 105 | 05MAY2025 | visit 1 | -50 |
| 11 | 105 | 2025-05-20 | VISIT-2 | 100 |
| 12 | 106 | 06/10/2025 | Visit 1 | 120 |
| 13 | 106 | 06/10/2025 | Visit 1 | 120 |
| 14 | 107 | 07-11-2025 | SCREENING | 80 |
| 15 | 108 | 2025-08-20 | Visit 1 | 90 |
| 16 | 108 | 08/25/2025 | VISIT-2 | 90 |
| 17 | 109 | Visit 1 | 110 | |
| 18 | 110 | 10/01/2025 | visit-1 | 100 |
| 19 | 111 | 11DEC2025 | Visit 1 | 200 |
| 20 | 112 | 2025/12/25 | VISIT 2 | 150 |
| 21 | 113 | 01/03/2025 | Screening | 100 |
| 22 | 114 | 14FEB2025 | Visit 1 | 0 |
| 23 | 115 | 2025-03-15 | VISIT-2 | -25 |
| 24 | 101 | 01JAN2025 | Visit 1 | 100 |
# RAW DOSING DATASET
# Purpose:Source dataset for SDTM DM/EX derivations
raw_dosing <- data.frame(
SUBJID = c(
"101","101","101","102","102","103","103","104",
"104","105","105","106","106","107","108","108",
"109","110","111","112","113","114","115","101"
),
DOSE_DATE = c(
"01JAN2025","2025-01-15","03/01/2025","02/15/2025",
"2025-02-20","01MAR2025","",
"2025/04/01","04-15-2025","05MAY2025","2025-05-20",
"06/10/2025","06/10/2025","07-11-2025","2025-08-20",
"08/25/2025","","10/01/2025","11DEC2025","2025/12/25",
"01/03/2025","14FEB2025","2025-03-15","01JAN2025"
),
DOSE_AMT = c(
100,100,100,200,200,150,150,0,100,-50,100,120,
120,80,90,90,110,100,200,150,100,0,-25,100
),
VISIT = c(
"Visit 1","VISIT-2","visit 3","Screening","VISIT-2",
"Visit 1","Visit 2","SCREENING","Visit 2","visit 1",
"VISIT-2","Visit 1","Visit 1","SCREENING","Visit 1",
"VISIT-2","Visit 1","visit-1","Visit 1","VISIT 2",
"Screening","Visit 1","VISIT-2","Visit 1"
),
stringsAsFactors = FALSE
)
print(raw_dosing)
OUTPUT:
|
|
SUBJID |
DOSE_DATE |
DOSE_AMT |
VISIT |
|
1 |
101 |
01-Jan-2025 |
100 |
Visit 1 |
|
2 |
101 |
15-01-2025 |
100 |
VISIT-2 |
|
3 |
101 |
03-01-2025 |
100 |
visit 3 |
|
4 |
102 |
02/15/2025 |
200 |
Screening |
|
5 |
102 |
20-02-2025 |
200 |
VISIT-2 |
|
6 |
103 |
01-Mar-2025 |
150 |
Visit 1 |
|
7 |
103 |
|
150 |
Visit 2 |
|
8 |
104 |
01-04-2025 |
0 |
SCREENING |
|
9 |
104 |
04-15-2025 |
100 |
Visit 2 |
|
10 |
105 |
05-May-2025 |
-50 |
visit 1 |
|
11 |
105 |
20-05-2025 |
100 |
VISIT-2 |
|
12 |
106 |
06-10-2025 |
120 |
Visit 1 |
|
13 |
106 |
06-10-2025 |
120 |
Visit 1 |
|
14 |
107 |
07-11-2025 |
80 |
SCREENING |
|
15 |
108 |
20-08-2025 |
90 |
Visit 1 |
|
16 |
108 |
08/25/2025 |
90 |
VISIT-2 |
|
17 |
109 |
|
110 |
Visit 1 |
|
18 |
110 |
10-01-2025 |
100 |
visit-1 |
|
19 |
111 |
11-Dec-2025 |
200 |
Visit 1 |
|
20 |
112 |
25-12-2025 |
150 |
VISIT 2 |
|
21 |
113 |
01-03-2025 |
100 |
Screening |
|
22 |
114 |
14-Feb-2025 |
0 |
Visit 1 |
|
23 |
115 |
15-03-2025 |
-25 |
VISIT-2 |
|
24 |
101 |
01-Jan-2025 |
100 |
Visit 1 |
3. RAW DISPOSITION DATASET (raw_disposition)
Purpose
Contains
subject completion and discontinuation details.
Important Variables
|
Variable |
Description |
|
SUBJID |
Subject
ID |
|
DSDECOD |
Disposition
Status |
|
DSSTDTC |
Disposition
Date |
Embedded Issues
|
Problem |
Example |
|
Missing
disposition dates |
Blank
DSSTDTC |
|
Duplicate
subjects |
Subject
101 |
|
Mixed
date formats |
2025/06/20,
20JUN2025 |
|
Mixed
disposition values |
COMPLETED,
WITHDRAWN |
Dataset 3: raw_disposition
DATA raw_disposition;
LENGTH SUBJID $10 DSDECOD $30 DSSTDTC $15;
INFILE DATALINES DLM='|' DSD TRUNCOVER;
INPUT SUBJID :$10. DSDECOD :$30. DSSTDTC :$15.;
DATALINES;
101|COMPLETED|2025-06-01
102|SCREEN FAILURE|01JUN2025
103|WITHDRAWN|06/15/2025
104|LOST TO FOLLOW-UP|2025/06/20
105|COMPLETED|20JUN2025
106|DEATH|2025-07-01
107|WITHDRAWN|
108|COMPLETED|07/15/2025
109|SCREEN FAILURE|2025-07-20
110|LOST TO FOLLOW-UP|20JUL2025
111|COMPLETED|2025/08/01
112|WITHDRAWN|08/10/2025
113|DEATH|2025-08-15
114|COMPLETED|15AUG2025
115|SCREEN FAILURE|
101|COMPLETED|2025-06-01
;
RUN;
PROC PRINT DATA = raw_disposition;
RUN;
OUTPUT:
| Obs | SUBJID | DSDECOD | DSSTDTC |
|---|---|---|---|
| 1 | 101 | COMPLETED | 2025-06-01 |
| 2 | 102 | SCREEN FAILURE | 01JUN2025 |
| 3 | 103 | WITHDRAWN | 06/15/2025 |
| 4 | 104 | LOST TO FOLLOW-UP | 2025/06/20 |
| 5 | 105 | COMPLETED | 20JUN2025 |
| 6 | 106 | DEATH | 2025-07-01 |
| 7 | 107 | WITHDRAWN | |
| 8 | 108 | COMPLETED | 07/15/2025 |
| 9 | 109 | SCREEN FAILURE | 2025-07-20 |
| 10 | 110 | LOST TO FOLLOW-UP | 20JUL2025 |
| 11 | 111 | COMPLETED | 2025/08/01 |
| 12 | 112 | WITHDRAWN | 08/10/2025 |
| 13 | 113 | DEATH | 2025-08-15 |
| 14 | 114 | COMPLETED | 15AUG2025 |
| 15 | 115 | SCREEN FAILURE | |
| 16 | 101 | COMPLETED | 2025-06-01 |
# RAW DISPOSITION DATASET
raw_disposition <- data.frame(
SUBJID = c(
"101","102","103","104","105","106","107","108",
"109","110","111","112","113","114","115","101"
),
DSDECOD = c(
"COMPLETED","SCREEN FAILURE","WITHDRAWN",
"LOST TO FOLLOW-UP","COMPLETED","DEATH",
"WITHDRAWN","COMPLETED","SCREEN FAILURE",
"LOST TO FOLLOW-UP","COMPLETED","WITHDRAWN",
"DEATH","COMPLETED","SCREEN FAILURE","COMPLETED"
),
DSSTDTC = c(
"2025-06-01","01JUN2025","06/15/2025","2025/06/20",
"20JUN2025","2025-07-01","","07/15/2025",
"2025-07-20","20JUL2025","2025/08/01",
"08/10/2025","2025-08-15","15AUG2025",
"","2025-06-01"
),
stringsAsFactors = FALSE
)
print(raw_disposition)
OUTPUT:
|
|
SUBJID |
DSDECOD |
DSSTDTC |
|
1 |
101 |
COMPLETED |
01-06-2025 |
|
2 |
102 |
SCREEN FAILURE |
01-Jun-2025 |
|
3 |
103 |
WITHDRAWN |
06/15/2025 |
|
4 |
104 |
LOST TO FOLLOW-UP |
20-06-2025 |
|
5 |
105 |
COMPLETED |
20-Jun-2025 |
|
6 |
106 |
DEATH |
01-07-2025 |
|
7 |
107 |
WITHDRAWN |
|
|
8 |
108 |
COMPLETED |
07/15/2025 |
|
9 |
109 |
SCREEN FAILURE |
20-07-2025 |
|
10 |
110 |
LOST TO FOLLOW-UP |
20-Jul-2025 |
|
11 |
111 |
COMPLETED |
01-08-2025 |
|
12 |
112 |
WITHDRAWN |
08-10-2025 |
|
13 |
113 |
DEATH |
15-08-2025 |
|
14 |
114 |
COMPLETED |
15-Aug-2025 |
|
15 |
115 |
SCREEN FAILURE |
|
|
16 |
101 |
COMPLETED |
01-06-2025 |
# FUNCTION FOR DATE CONVERSION
convert_date <- function(x){
parse_date_time(
x,
orders = c(
"dmy","ymd","mdy","d-b-y","y/m/d","d/b/y","dmy HMS"
)
)
}
Step 4: Clean Demographics
DATA demog_clean;
set raw_demog;
/* sex*/
sex = upcase(strip(sex));
if sex in ('M','MALE') then sex = 'M';
else if sex in ('F','FEMALE') then sex = 'F';
else sex = 'U';
/* race*/
race = upcase(strip(race));
if race = "" then race = "UNKNOWN";
/* country*/
country = upcase(strip(country));
/* arm*/
arm = upcase(strip(arm));
/* dob*/
DOB_NUM = input(DOB,anydtdte15.);
format DOB_NUM Date9.;
/* consent date*/
RFICDT = input(CONSENT_DATE,anydtdte15.);
format RFICDT yymmdd10.;
rename DOB_NUM=DOB RFICDT=RFICDTC;
drop dob consent_date;
run;
proc print;run;
OUTPUT:
| Obs | SUBJID | SEX | RACE | COUNTRY | ARM | SITEID | DOB | RFICDTC |
|---|---|---|---|---|---|---|---|---|
| 1 | 101 | M | ASIAN | INDIA | PLACEBO | 1001 | 01JAN1980 | 2025-01-01 |
| 2 | 102 | F | WHITE | USA | DRUG A | 1001 | 15FEB1985 | 2025-02-15 |
| 3 | 103 | M | BLACK | UK | DRUG B | 1002 | 15MAR1978 | 2025-03-01 |
| 4 | 104 | F | UNKNOWN | INDIA | PLACEBO | 1002 | 22APR1970 | 2025-04-01 |
| 5 | 105 | M | ASIAN | SPAIN | DRUG A | 1003 | 05MAY1990 | 2025-05-05 |
| 6 | 106 | F | WHITE | USA | DRUG B | 1003 | 10JUN1992 | 2025-06-10 |
| 7 | 107 | M | BLACK | INDIA | PLACEBO | 1004 | 11JUL1985 | 2025-07-11 |
| 8 | 108 | F | ASIAN | PAKISTAN | DRUG A | 1004 | 20AUG1988 | 2025-08-20 |
| 9 | 109 | U | UNKNOWN | INDIA | DRUG B | 1005 | 09SEP1975 | 2025-09-09 |
| 10 | 110 | M | WHITE | UK | PLACEBO | 1005 | 01OCT1977 | 2025-10-01 |
| 11 | 111 | F | ASIAN | INDIA | DRUG A | 1006 | 12NOV1982 | 2025-11-12 |
| 12 | 112 | M | BLACK | USA | DRUG B | 1006 | 25DEC1981 | 2025-12-25 |
| 13 | 113 | F | WHITE | CANADA | PLACEBO | 1007 | 01MAR1995 | 2025-01-03 |
| 14 | 114 | M | UNKNOWN | INDIA | DRUG A | 1007 | 14FEB1993 | 2025-02-14 |
| 15 | 115 | F | ASIAN | SPAIN | DRUG B | 1008 | 15MAR1987 | 2025-03-15 |
| 16 | 101 | M | ASIAN | INDIA | PLACEBO | 1001 | 01JAN1980 | 2025-01-01 |
# STEP 4: CLEAN DEMOGRAPHICS
demog_clean <- raw_demog %>%
mutate(
# SEX
SEX = str_trim(str_to_upper(SEX)),
SEX = case_when(SEX %in% c("MALE","M") ~ "M",
SEX %in% c("FEMALE","F") ~ "F",
TRUE ~ "U"),
# RACE
RACE = str_trim(str_to_upper(RACE)),
RACE = ifelse(RACE == "","UNKNOWN",RACE),
# COUNTRY
COUNTRY = str_trim(str_to_upper(COUNTRY)),
# ARM
ARM = str_trim(str_to_upper(ARM)),
# DOB
DOB_NUM = as.Date(convert_date(DOB)),
# CONSENT DATE
RFICDTC = as.Date(convert_date(CONSENT_DATE))
) %>%
select(
-DOB,-CONSENT_DATE
) %>%
rename(
DOB = DOB_NUM
)
print(demog_clean)
|
|
SUBJID |
SITEID |
SEX |
RACE |
COUNTRY |
ARM |
DOB |
RFICDTC |
|
1 |
101 |
1001 |
M |
ASIAN |
INDIA |
PLACEBO |
19-01-1980 |
01-01-2025 |
|
2 |
102 |
1001 |
F |
WHITE |
USA |
DRUG A |
15-02-1985 |
15-02-2025 |
|
3 |
103 |
1002 |
M |
BLACK |
UK |
DRUG B |
15-03-1978 |
01-03-2025 |
|
4 |
104 |
1002 |
F |
UNKNOWN |
INDIA |
PLACEBO |
22-04-1970 |
20-01-2025 |
|
5 |
105 |
1003 |
M |
ASIAN |
SPAIN |
DRUG A |
05-05-1990 |
05-05-2025 |
|
6 |
106 |
1003 |
F |
WHITE |
USA |
DRUG B |
10-06-1992 |
06-10-2025 |
|
7 |
107 |
1004 |
M |
BLACK |
INDIA |
PLACEBO |
19-11-1985 |
11-07-2025 |
|
8 |
108 |
1004 |
F |
ASIAN |
PAKISTAN |
DRUG A |
20-08-1988 |
20-08-2025 |
|
9 |
109 |
1005 |
U |
UNKNOWN |
INDIA |
DRUG B |
09-09-1975 |
20-09-2025 |
|
10 |
110 |
1005 |
M |
WHITE |
UK |
PLACEBO |
01-10-1977 |
01-10-2025 |
|
11 |
111 |
1006 |
F |
ASIAN |
INDIA |
DRUG A |
19-12-1982 |
11-12-2025 |
|
12 |
112 |
1006 |
M |
BLACK |
USA |
DRUG B |
25-12-1981 |
25-12-2025 |
|
13 |
113 |
1007 |
F |
WHITE |
CANADA |
PLACEBO |
03-01-1995 |
03-01-2025 |
|
14 |
114 |
1007 |
M |
UNKNOWN |
INDIA |
DRUG A |
14-02-1993 |
14-02-2025 |
|
15 |
115 |
1008 |
F |
ASIAN |
SPAIN |
DRUG B |
15-03-1987 |
15-03-2025 |
|
16 |
101 |
1001 |
M |
ASIAN |
INDIA |
PLACEBO |
19-01-1980 |
01-01-2025 |
Step 1: Demographic Cleaning
SAS Logic
sex = upcase(strip(sex));
R Logic
SEX = str_trim(str_to_upper(SEX))
Objective
- remove extra spaces
- standardize case
- improve downstream
consistency
Step 2: Gender Standardization
Logic
|
Raw Value |
Standardized |
|
Male |
M |
|
FEMALE |
F |
|
unknown |
U |
Business Need
Clinical
SDTM domains require controlled terminology.
Step 3: Race Cleaning
Missing
races converted to:
UNKNOWN
Why?
Blank
race values create downstream reporting problems.
Step 4: Country Standardization
All
country values converted to uppercase.
Example:
|
Raw |
Cleaned |
|
india |
INDIA |
|
UsA |
USA |
Step 5: Clean Dosing
DATA dosing_clean;
set raw_dosing;
DOSEDT = input(DOSE_DATE,anydtdte15.);
format DOSEDT yymmdd10.;
VISIT = Upcase(strip(VISIT));
IF DOSE_AMT < 0 THEN DOSE_AMT=.;
DROP DOSE_DATE;
RUN;
PROC PRINT;RUN;
OUTPUT:
| Obs | SUBJID | VISIT | DOSE_AMT | DOSEDT |
|---|---|---|---|---|
| 1 | 101 | VISIT 1 | 100 | 2025-01-01 |
| 2 | 101 | VISIT-2 | 100 | 2025-01-15 |
| 3 | 101 | VISIT 3 | 100 | 2025-03-01 |
| 4 | 102 | SCREENING | 200 | 2025-02-15 |
| 5 | 102 | VISIT-2 | 200 | 2025-02-20 |
| 6 | 103 | VISIT 1 | 150 | 2025-03-01 |
| 7 | 103 | VISIT 2 | 150 | . |
| 8 | 104 | SCREENING | 0 | 2025-04-01 |
| 9 | 104 | VISIT 2 | 100 | 2025-04-15 |
| 10 | 105 | VISIT 1 | . | 2025-05-05 |
| 11 | 105 | VISIT-2 | 100 | 2025-05-20 |
| 12 | 106 | VISIT 1 | 120 | 2025-06-10 |
| 13 | 106 | VISIT 1 | 120 | 2025-06-10 |
| 14 | 107 | SCREENING | 80 | 2025-07-11 |
| 15 | 108 | VISIT 1 | 90 | 2025-08-20 |
| 16 | 108 | VISIT-2 | 90 | 2025-08-25 |
| 17 | 109 | VISIT 1 | 110 | . |
| 18 | 110 | VISIT-1 | 100 | 2025-10-01 |
| 19 | 111 | VISIT 1 | 200 | 2025-12-11 |
| 20 | 112 | VISIT 2 | 150 | 2025-12-25 |
| 21 | 113 | SCREENING | 100 | 2025-01-03 |
| 22 | 114 | VISIT 1 | 0 | 2025-02-14 |
| 23 | 115 | VISIT-2 | . | 2025-03-15 |
| 24 | 101 | VISIT 1 | 100 | 2025-01-01 |
# STEP 5: CLEAN DOSING
dosing_clean <- raw_dosing %>%
mutate(
# DOSE DATE
DOSEDT = as.Date(convert_date(DOSE_DATE)),
# VISIT
VISIT = str_trim(str_to_upper(VISIT)),
# NEGATIVE DOSE
DOSE_AMT = ifelse(DOSE_AMT < 0,NA,DOSE_AMT),
) %>%
select(
-DOSE_DATE
)
print(dosing_clean)
|
|
SUBJID |
DOSE_AMT |
VISIT |
DOSEDT |
|
1 |
101 |
100 |
VISIT 1 |
20-01-2025 |
|
2 |
101 |
100 |
VISIT-2 |
15-01-2025 |
|
3 |
101 |
100 |
VISIT 3 |
01-03-2025 |
|
4 |
102 |
200 |
SCREENING |
15-02-2025 |
|
5 |
102 |
200 |
VISIT-2 |
20-02-2025 |
|
6 |
103 |
150 |
VISIT 1 |
20-01-2025 |
|
7 |
103 |
150 |
VISIT 2 |
NA |
|
8 |
104 |
0 |
SCREENING |
01-04-2025 |
|
9 |
104 |
100 |
VISIT 2 |
15-04-2025 |
|
10 |
105 |
NA |
VISIT 1 |
20-05-2025 |
|
11 |
105 |
100 |
VISIT-2 |
20-05-2025 |
|
12 |
106 |
120 |
VISIT 1 |
10-06-2025 |
|
13 |
106 |
120 |
VISIT 1 |
10-06-2025 |
|
14 |
107 |
80 |
SCREENING |
11-07-2025 |
|
15 |
108 |
90 |
VISIT 1 |
20-08-2025 |
|
16 |
108 |
90 |
VISIT-2 |
25-08-2025 |
|
17 |
109 |
110 |
VISIT 1 |
NA |
|
18 |
110 |
100 |
VISIT-1 |
01-10-2025 |
|
19 |
111 |
200 |
VISIT 1 |
20-11-2025 |
|
20 |
112 |
150 |
VISIT 2 |
25-12-2025 |
|
21 |
113 |
100 |
SCREENING |
03-01-2025 |
|
22 |
114 |
0 |
VISIT 1 |
14-02-2025 |
|
23 |
115 |
NA |
VISIT-2 |
15-03-2025 |
|
24 |
101 |
100 |
VISIT 1 |
20-01-2025 |
Step 5: Date Conversion
SAS
input(DOB,anydtdte15.)
R
parse_date_time()
Why Important?
Clinical data arrives from multiple EDC systems with inconsistent formats.
Step 6: Negative Dose Handling
Negative doses:
-50
-25
converted to missing.
Reason
Negative exposure values are clinically invalid.
Step 6: Derive First Dose Date (RFSTDTC)
PROC SORT DATA = DOSING_CLEAN;
BY SUBJID DOSEDT;
RUN;
PROC PRINT;RUN;
OUTPUT:
| Obs | SUBJID | VISIT | DOSE_AMT | DOSEDT |
|---|---|---|---|---|
| 1 | 101 | VISIT 1 | 100 | 2025-01-01 |
| 2 | 101 | VISIT 1 | 100 | 2025-01-01 |
| 3 | 101 | VISIT-2 | 100 | 2025-01-15 |
| 4 | 101 | VISIT 3 | 100 | 2025-03-01 |
| 5 | 102 | SCREENING | 200 | 2025-02-15 |
| 6 | 102 | VISIT-2 | 200 | 2025-02-20 |
| 7 | 103 | VISIT 2 | 150 | . |
| 8 | 103 | VISIT 1 | 150 | 2025-03-01 |
| 9 | 104 | SCREENING | 0 | 2025-04-01 |
| 10 | 104 | VISIT 2 | 100 | 2025-04-15 |
| 11 | 105 | VISIT 1 | . | 2025-05-05 |
| 12 | 105 | VISIT-2 | 100 | 2025-05-20 |
| 13 | 106 | VISIT 1 | 120 | 2025-06-10 |
| 14 | 106 | VISIT 1 | 120 | 2025-06-10 |
| 15 | 107 | SCREENING | 80 | 2025-07-11 |
| 16 | 108 | VISIT 1 | 90 | 2025-08-20 |
| 17 | 108 | VISIT-2 | 90 | 2025-08-25 |
| 18 | 109 | VISIT 1 | 110 | . |
| 19 | 110 | VISIT-1 | 100 | 2025-10-01 |
| 20 | 111 | VISIT 1 | 200 | 2025-12-11 |
| 21 | 112 | VISIT 2 | 150 | 2025-12-25 |
| 22 | 113 | SCREENING | 100 | 2025-01-03 |
| 23 | 114 | VISIT 1 | 0 | 2025-02-14 |
| 24 | 115 | VISIT-2 | . | 2025-03-15 |
PROC SQL;
CREATE TABLE RFSTDTC AS
SELECT SUBJID,PUT(MIN(DOSEDT),YYMMDD10.) AS RFSTDTC
FROM DOSING_CLEAN
GROUP BY SUBJID;
RUN;
PROC PRINT;RUN;
OUTPUT:
| Obs | SUBJID | RFSTDTC |
|---|---|---|
| 1 | 101 | 2025-01-01 |
| 2 | 102 | 2025-02-15 |
| 3 | 103 | 2025-03-01 |
| 4 | 104 | 2025-04-01 |
| 5 | 105 | 2025-05-05 |
| 6 | 106 | 2025-06-10 |
| 7 | 107 | 2025-07-11 |
| 8 | 108 | 2025-08-20 |
| 9 | 109 | . |
| 10 | 110 | 2025-10-01 |
| 11 | 111 | 2025-12-11 |
| 12 | 112 | 2025-12-25 |
| 13 | 113 | 2025-01-03 |
| 14 | 114 | 2025-02-14 |
| 15 | 115 | 2025-03-15 |
# STEP 6: DERIVE RFSTDTC
RFSTDTC <- dosing_clean %>%
group_by(SUBJID) %>%
summarise(
RFSTDTC = if(all(is.na(DOSEDT))){NA}
else{
min(DOSEDT, na.rm = TRUE)
}
) %>%
mutate(RFSTDTC = as.character(RFSTDTC))
print(RFSTDTC)
OUTPUT:
|
|
SUBJID |
RFSTDTC |
|
1 |
101 |
15-01-2025 |
|
2 |
102 |
15-02-2025 |
|
3 |
103 |
20-01-2025 |
|
4 |
104 |
01-04-2025 |
|
5 |
105 |
20-05-2025 |
|
6 |
106 |
10-06-2025 |
|
7 |
107 |
11-07-2025 |
|
8 |
108 |
20-08-2025 |
|
9 |
109 |
NA |
|
10 |
110 |
01-10-2025 |
|
11 |
111 |
20-11-2025 |
|
12 |
112 |
25-12-2025 |
|
13 |
113 |
03-01-2025 |
|
14 |
114 |
14-02-2025 |
|
15 |
115 |
15-03-2025 |
Step 7: First Dose Derivation (RFSTDTC)
Derived
using minimum dose date.
SAS
MIN(DOSEDT)
R
min(DOSEDT, na.rm = TRUE)
Step 7: Clean Disposition
DATA DS_CLEAN;
SET raw_disposition;
DSDECOD = UPCASE(STRIP(DSDECOD));
DS_DATE = INPUT(DSSTDTC,ANYDTDTE15.);
FORMAT DS_DATE YYMMDD10.;
RUN;
PROC PRINT;RUN;
OUTPUT:
| Obs | SUBJID | DSDECOD | DSSTDTC | DS_DATE |
|---|---|---|---|---|
| 1 | 101 | COMPLETED | 2025-06-01 | 2025-06-01 |
| 2 | 102 | SCREEN FAILURE | 01JUN2025 | 2025-06-01 |
| 3 | 103 | WITHDRAWN | 06/15/2025 | 2025-06-15 |
| 4 | 104 | LOST TO FOLLOW-UP | 2025/06/20 | 2025-06-20 |
| 5 | 105 | COMPLETED | 20JUN2025 | 2025-06-20 |
| 6 | 106 | DEATH | 2025-07-01 | 2025-07-01 |
| 7 | 107 | WITHDRAWN | . | |
| 8 | 108 | COMPLETED | 07/15/2025 | 2025-07-15 |
| 9 | 109 | SCREEN FAILURE | 2025-07-20 | 2025-07-20 |
| 10 | 110 | LOST TO FOLLOW-UP | 20JUL2025 | 2025-07-20 |
| 11 | 111 | COMPLETED | 2025/08/01 | 2025-08-01 |
| 12 | 112 | WITHDRAWN | 08/10/2025 | 2025-08-10 |
| 13 | 113 | DEATH | 2025-08-15 | 2025-08-15 |
| 14 | 114 | COMPLETED | 15AUG2025 | 2025-08-15 |
| 15 | 115 | SCREEN FAILURE | . | |
| 16 | 101 | COMPLETED | 2025-06-01 | 2025-06-01 |
# STEP 7: CLEAN DISPOSITION
DS_CLEAN <- raw_disposition %>%
mutate(
DSDECOD = str_trim(str_to_upper(DSDECOD)),
DS_DATE = as.Date(convert_date(DSSTDTC))
)
print(DS_CLEAN)
OUTPUT:
|
|
SUBJID |
DSDECOD |
DSSTDTC |
DS_DATE |
|
1 |
101 |
COMPLETED |
01-06-2025 |
01-06-2025 |
|
2 |
102 |
SCREEN FAILURE |
01-Jun-2025 |
20-01-2025 |
|
3 |
103 |
WITHDRAWN |
06/15/2025 |
15-06-2025 |
|
4 |
104 |
LOST TO FOLLOW-UP |
20-06-2025 |
20-06-2025 |
|
5 |
105 |
COMPLETED |
20-Jun-2025 |
20-06-2025 |
|
6 |
106 |
DEATH |
01-07-2025 |
01-07-2025 |
|
7 |
107 |
WITHDRAWN |
|
NA |
|
8 |
108 |
COMPLETED |
07/15/2025 |
15-07-2025 |
|
9 |
109 |
SCREEN FAILURE |
20-07-2025 |
20-07-2025 |
|
10 |
110 |
LOST TO FOLLOW-UP |
20-Jul-2025 |
20-07-2025 |
|
11 |
111 |
COMPLETED |
01-08-2025 |
01-08-2025 |
|
12 |
112 |
WITHDRAWN |
08-10-2025 |
10-08-2025 |
|
13 |
113 |
DEATH |
15-08-2025 |
15-08-2025 |
|
14 |
114 |
COMPLETED |
15-Aug-2025 |
15-08-2025 |
|
15 |
115 |
SCREEN FAILURE |
|
NA |
|
16 |
101 |
COMPLETED |
01-06-2025 |
01-06-2025 |
Step 8: Derive RFENDTC + DTHFL
PROC SQL;
CREATE TABLE DS_FINAL AS
SELECT SUBJID,PUT(MAX(DS_DATE),YYMMDD10.) AS RFENDTC,
MAX(CASE WHEN DSDECOD = "DEATH" THEN "Y"
ELSE "" END) AS DTHFL
FROM DS_CLEAN
GROUP BY SUBJID;
QUIT;
PROC PRINT DATA=DS_FINAL;RUN;
OUTPUT:
| Obs | SUBJID | RFENDTC | DTHFL |
|---|---|---|---|
| 1 | 101 | 2025-06-01 | |
| 2 | 102 | 2025-06-01 | |
| 3 | 103 | 2025-06-15 | |
| 4 | 104 | 2025-06-20 | |
| 5 | 105 | 2025-06-20 | |
| 6 | 106 | 2025-07-01 | Y |
| 7 | 107 | . | |
| 8 | 108 | 2025-07-15 | |
| 9 | 109 | 2025-07-20 | |
| 10 | 110 | 2025-07-20 | |
| 11 | 111 | 2025-08-01 | |
| 12 | 112 | 2025-08-10 | |
| 13 | 113 | 2025-08-15 | Y |
| 14 | 114 | 2025-08-15 | |
| 15 | 115 | . |
# STEP 8: DERIVE RFENDTC + DTHFL
DS_FINAL <- DS_CLEAN %>%
group_by(SUBJID) %>%
summarise(
RFENDTC = if(all(is.na(DS_DATE))){
as.Date(NA)
} else {
max(DS_DATE, na.rm = TRUE)
},
DTHFL = ifelse(any(DSDECOD == "DEATH"),"Y","")
) %>%
mutate(
RFENDTC = as.character(RFENDTC)
)
print(DS_FINAL)
|
|
SUBJID |
RFENDTC |
DTHFL |
|
1 |
101 |
01-06-2025 |
|
|
2 |
102 |
20-01-2025 |
|
|
3 |
103 |
15-06-2025 |
|
|
4 |
104 |
20-06-2025 |
|
|
5 |
105 |
20-06-2025 |
|
|
6 |
106 |
01-07-2025 |
Y |
|
7 |
107 |
NA |
|
|
8 |
108 |
15-07-2025 |
|
|
9 |
109 |
20-07-2025 |
|
|
10 |
110 |
20-07-2025 |
|
|
11 |
111 |
01-08-2025 |
|
|
12 |
112 |
10-08-2025 |
|
|
13 |
113 |
15-08-2025 |
Y |
|
14 |
114 |
15-08-2025 |
|
|
15 |
115 |
NA |
|
Step 8: Death Flag Derivation
If
subject disposition contains:
DEATH
then:
DTHFL = Y
Step 9: Remove Duplicates
PROC SORT DATA = DEMOG_CLEAN NODUPKEY;
BY SUBJID;
RUN;
PROC PRINT;RUN;
OUTPUT:
| Obs | SUBJID | SEX | RACE | COUNTRY | ARM | SITEID | DOB | RFICDTC |
|---|---|---|---|---|---|---|---|---|
| 1 | 101 | M | ASIAN | INDIA | PLACEBO | 1001 | 01JAN1980 | 2025-01-01 |
| 2 | 102 | F | WHITE | USA | DRUG A | 1001 | 15FEB1985 | 2025-02-15 |
| 3 | 103 | M | BLACK | UK | DRUG B | 1002 | 15MAR1978 | 2025-03-01 |
| 4 | 104 | F | UNKNOWN | INDIA | PLACEBO | 1002 | 22APR1970 | 2025-04-01 |
| 5 | 105 | M | ASIAN | SPAIN | DRUG A | 1003 | 05MAY1990 | 2025-05-05 |
| 6 | 106 | F | WHITE | USA | DRUG B | 1003 | 10JUN1992 | 2025-06-10 |
| 7 | 107 | M | BLACK | INDIA | PLACEBO | 1004 | 11JUL1985 | 2025-07-11 |
| 8 | 108 | F | ASIAN | PAKISTAN | DRUG A | 1004 | 20AUG1988 | 2025-08-20 |
| 9 | 109 | U | UNKNOWN | INDIA | DRUG B | 1005 | 09SEP1975 | 2025-09-09 |
| 10 | 110 | M | WHITE | UK | PLACEBO | 1005 | 01OCT1977 | 2025-10-01 |
| 11 | 111 | F | ASIAN | INDIA | DRUG A | 1006 | 12NOV1982 | 2025-11-12 |
| 12 | 112 | M | BLACK | USA | DRUG B | 1006 | 25DEC1981 | 2025-12-25 |
| 13 | 113 | F | WHITE | CANADA | PLACEBO | 1007 | 01MAR1995 | 2025-01-03 |
| 14 | 114 | M | UNKNOWN | INDIA | DRUG A | 1007 | 14FEB1993 | 2025-02-14 |
| 15 | 115 | F | ASIAN | SPAIN | DRUG B | 1008 | 15MAR1987 | 2025-03-15 |
# STEP 9: REMOVE DUPLICATES
DEMOG_CLEAN_FINAL <- DS_CLEAN %>%
distinct(SUBJID, .keep_all = TRUE)
print(DEMOG_CLEAN_FINAL)
OUTPUT:
|
|
SUBJID |
DSDECOD |
DSSTDTC |
DS_DATE |
|
1 |
101 |
COMPLETED |
01-06-2025 |
01-06-2025 |
|
2 |
102 |
SCREEN FAILURE |
01-Jun-2025 |
20-01-2025 |
|
3 |
103 |
WITHDRAWN |
06/15/2025 |
15-06-2025 |
|
4 |
104 |
LOST TO FOLLOW-UP |
20-06-2025 |
20-06-2025 |
|
5 |
105 |
COMPLETED |
20-Jun-2025 |
20-06-2025 |
|
6 |
106 |
DEATH |
01-07-2025 |
01-07-2025 |
|
7 |
107 |
WITHDRAWN |
|
NA |
|
8 |
108 |
COMPLETED |
07/15/2025 |
15-07-2025 |
|
9 |
109 |
SCREEN FAILURE |
20-07-2025 |
20-07-2025 |
|
10 |
110 |
LOST TO FOLLOW-UP |
20-Jul-2025 |
20-07-2025 |
|
11 |
111 |
COMPLETED |
01-08-2025 |
01-08-2025 |
|
12 |
112 |
WITHDRAWN |
08-10-2025 |
10-08-2025 |
|
13 |
113 |
DEATH |
15-08-2025 |
15-08-2025 |
|
14 |
114 |
COMPLETED |
15-Aug-2025 |
15-08-2025 |
|
15 |
115 |
SCREEN FAILURE |
|
NA |
Step 9: Duplicate Removal
SAS
PROC SORT NODUPKEY
R
distinct(SUBJID, .keep_all =
TRUE)
Step 10: Create Final SDTM DM
PROC SORT DATA=RFSTDTC;
BY SUBJID;
RUN;
PROC PRINT;RUN;
OUTPUT:
| Obs | SUBJID | RFSTDTC |
|---|---|---|
| 1 | 101 | 2025-01-01 |
| 2 | 102 | 2025-02-15 |
| 3 | 103 | 2025-03-01 |
| 4 | 104 | 2025-04-01 |
| 5 | 105 | 2025-05-05 |
| 6 | 106 | 2025-06-10 |
| 7 | 107 | 2025-07-11 |
| 8 | 108 | 2025-08-20 |
| 9 | 109 | . |
| 10 | 110 | 2025-10-01 |
| 11 | 111 | 2025-12-11 |
| 12 | 112 | 2025-12-25 |
| 13 | 113 | 2025-01-03 |
| 14 | 114 | 2025-02-14 |
| 15 | 115 | 2025-03-15 |
PROC SORT DATA=DS_FINAL;
BY SUBJID;
RUN;
PROC PRINT;RUN;
OUTPUT:
| Obs | SUBJID | RFENDTC | DTHFL |
|---|---|---|---|
| 1 | 101 | 2025-06-01 | |
| 2 | 102 | 2025-06-01 | |
| 3 | 103 | 2025-06-15 | |
| 4 | 104 | 2025-06-20 | |
| 5 | 105 | 2025-06-20 | |
| 6 | 106 | 2025-07-01 | Y |
| 7 | 107 | . | |
| 8 | 108 | 2025-07-15 | |
| 9 | 109 | 2025-07-20 | |
| 10 | 110 | 2025-07-20 | |
| 11 | 111 | 2025-08-01 | |
| 12 | 112 | 2025-08-10 | |
| 13 | 113 | 2025-08-15 | Y |
| 14 | 114 | 2025-08-15 | |
| 15 | 115 | . |
DATA DM;
MERGE DEMOG_CLEAN
RFSTDTC
DS_FINAL;
LENGTH STUDYID $10 DOMAIN $2 USUBJID $25 ETHNIC $20 COUNTRY_FLAG $1
SEX_FLAG $1 RFSTDTC_FLAG $1;
/* CONSTANTS*/
STUDYID = "ABC123";
DOMAIN = "DM";
/* USUBJID*/
USUBJID = CATX('-',STUDYID,SITEID,SUBJID);
/* AGE*/
IF RFSTDTC NE '' THEN AGE = INTCK('YEAR',DOB,INPUT(RFSTDTC,YYMMDD10.));
/* ETHNIC*/
IF COUNTRY IN ('USA','CANADA','UK') THEN ETHNIC='NON-HISPANIC';
ELSE IF COUNTRY IN ('INDIA','PAKISTAN') THEN ETHNIC='HISPANIC';
ELSE ETHNIC='UNKNOWN';
/* VALIDATION FLAGS*/
IF RFSTDTC = '' THEN RFSTDTC_FLAG = 'Y';
ELSE RFSTDTC_FLAG='N';
IF SEX = 'U' THEN SEX_FLAG= 'Y';
ELSE SEX_FLAG = 'N';
IF COUNTRY="" THEN COUNTRY_FLAG="Y";
ELSE COUNTRY_FLAG="N";
RUN;
PROC PRINT;RUN;
OUTPUT:
| Obs | SUBJID | SEX | RACE | COUNTRY | ARM | SITEID | DOB | RFICDTC | RFSTDTC | RFENDTC | DTHFL | STUDYID | DOMAIN | USUBJID | ETHNIC | COUNTRY_FLAG | SEX_FLAG | RFSTDTC_FLAG | AGE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | M | ASIAN | INDIA | PLACEBO | 1001 | 01JAN1980 | 2025-01-01 | 2025-01-01 | 2025-06-01 | ABC123 | DM | ABC123-1001-101 | HISPANIC | N | N | N | 45 | |
| 2 | 102 | F | WHITE | USA | DRUG A | 1001 | 15FEB1985 | 2025-02-15 | 2025-02-15 | 2025-06-01 | ABC123 | DM | ABC123-1001-102 | NON-HISPANIC | N | N | N | 40 | |
| 3 | 103 | M | BLACK | UK | DRUG B | 1002 | 15MAR1978 | 2025-03-01 | 2025-03-01 | 2025-06-15 | ABC123 | DM | ABC123-1002-103 | NON-HISPANIC | N | N | N | 47 | |
| 4 | 104 | F | UNKNOWN | INDIA | PLACEBO | 1002 | 22APR1970 | 2025-04-01 | 2025-04-01 | 2025-06-20 | ABC123 | DM | ABC123-1002-104 | HISPANIC | N | N | N | 55 | |
| 5 | 105 | M | ASIAN | SPAIN | DRUG A | 1003 | 05MAY1990 | 2025-05-05 | 2025-05-05 | 2025-06-20 | ABC123 | DM | ABC123-1003-105 | UNKNOWN | N | N | N | 35 | |
| 6 | 106 | F | WHITE | USA | DRUG B | 1003 | 10JUN1992 | 2025-06-10 | 2025-06-10 | 2025-07-01 | Y | ABC123 | DM | ABC123-1003-106 | NON-HISPANIC | N | N | N | 33 |
| 7 | 107 | M | BLACK | INDIA | PLACEBO | 1004 | 11JUL1985 | 2025-07-11 | 2025-07-11 | . | ABC123 | DM | ABC123-1004-107 | HISPANIC | N | N | N | 40 | |
| 8 | 108 | F | ASIAN | PAKISTAN | DRUG A | 1004 | 20AUG1988 | 2025-08-20 | 2025-08-20 | 2025-07-15 | ABC123 | DM | ABC123-1004-108 | HISPANIC | N | N | N | 37 | |
| 9 | 109 | U | UNKNOWN | INDIA | DRUG B | 1005 | 09SEP1975 | 2025-09-09 | . | 2025-07-20 | ABC123 | DM | ABC123-1005-109 | HISPANIC | N | Y | N | . | |
| 10 | 110 | M | WHITE | UK | PLACEBO | 1005 | 01OCT1977 | 2025-10-01 | 2025-10-01 | 2025-07-20 | ABC123 | DM | ABC123-1005-110 | NON-HISPANIC | N | N | N | 48 | |
| 11 | 111 | F | ASIAN | INDIA | DRUG A | 1006 | 12NOV1982 | 2025-11-12 | 2025-12-11 | 2025-08-01 | ABC123 | DM | ABC123-1006-111 | HISPANIC | N | N | N | 43 | |
| 12 | 112 | M | BLACK | USA | DRUG B | 1006 | 25DEC1981 | 2025-12-25 | 2025-12-25 | 2025-08-10 | ABC123 | DM | ABC123-1006-112 | NON-HISPANIC | N | N | N | 44 | |
| 13 | 113 | F | WHITE | CANADA | PLACEBO | 1007 | 01MAR1995 | 2025-01-03 | 2025-01-03 | 2025-08-15 | Y | ABC123 | DM | ABC123-1007-113 | NON-HISPANIC | N | N | N | 30 |
| 14 | 114 | M | UNKNOWN | INDIA | DRUG A | 1007 | 14FEB1993 | 2025-02-14 | 2025-02-14 | 2025-08-15 | ABC123 | DM | ABC123-1007-114 | HISPANIC | N | N | N | 32 | |
| 15 | 115 | F | ASIAN | SPAIN | DRUG B | 1008 | 15MAR1987 | 2025-03-15 | 2025-03-15 | . | ABC123 | DM | ABC123-1008-115 | UNKNOWN | N | N | N | 38 |
# STEP 10: CREATE FINAL SDTM DM
DM <- DEMOG_CLEAN_FINAL %>%
left_join(RFSTDTC, by = "SUBJID") %>%
left_join(DS_FINAL, by = "SUBJID") %>%
left_join(demog_clean, by = "SUBJID") %>%
mutate(
# CONSTANTS
STUDYID = "ABC123",
DOMAIN = "DM",
# USUBJID
USUBJID = paste(STUDYID,SITEID,SUBJID,sep = "-"),
# AGE
AGE = as.numeric(floor(interval(DOB,as.Date(RFSTDTC))/years(1) )),
# ETHNIC
ETHNIC = case_when(
COUNTRY %in% c("USA","CANADA","UK") ~ "NON-HISPANIC",
COUNTRY %in% c("INDIA","PAKISTAN") ~ "HISPANIC",
TRUE ~ "UNKNOWN"
),
# VALIDATION FLAGS
RFSTDTC_FLAG = ifelse(is.na(RFSTDTC) | RFSTDTC == "","Y","N"),
SEX_FLAG = ifelse(SEX == "U","Y","N"),
COUNTRY_FLAG = ifelse(COUNTRY == "","Y","N")
)
print(DM)
|
|
SUBJID |
DSDECOD |
DSSTDTC |
DS_DATE |
RFSTDTC |
RFENDTC |
DTHFL |
SITEID |
SEX |
RACE |
COUNTRY |
ARM |
DOB |
RFICDTC |
STUDYID |
DOMAIN |
USUBJID |
AGE |
ETHNIC |
RFSTDTC_FLAG |
SEX_FLAG |
COUNTRY_FLAG |
|
1 |
101 |
COMPLETED |
01-06-2025 |
01-06-2025 |
15-01-2025 |
01-06-2025 |
|
1001 |
M |
ASIAN |
INDIA |
PLACEBO |
19-01-1980 |
01-01-2025 |
ABC123 |
DM |
ABC123-1001-101 |
44 |
HISPANIC |
N |
N |
N |
|
2 |
101 |
COMPLETED |
01-06-2025 |
01-06-2025 |
15-01-2025 |
01-06-2025 |
|
1001 |
M |
ASIAN |
INDIA |
PLACEBO |
19-01-1980 |
01-01-2025 |
ABC123 |
DM |
ABC123-1001-101 |
44 |
HISPANIC |
N |
N |
N |
|
3 |
102 |
SCREEN FAILURE |
01-Jun-2025 |
20-01-2025 |
15-02-2025 |
20-01-2025 |
|
1001 |
F |
WHITE |
USA |
DRUG A |
15-02-1985 |
15-02-2025 |
ABC123 |
DM |
ABC123-1001-102 |
40 |
NON-HISPANIC |
N |
N |
N |
|
4 |
103 |
WITHDRAWN |
06/15/2025 |
15-06-2025 |
20-01-2025 |
15-06-2025 |
|
1002 |
M |
BLACK |
UK |
DRUG B |
15-03-1978 |
01-03-2025 |
ABC123 |
DM |
ABC123-1002-103 |
46 |
NON-HISPANIC |
N |
N |
N |
|
5 |
104 |
LOST TO FOLLOW-UP |
20-06-2025 |
20-06-2025 |
01-04-2025 |
20-06-2025 |
|
1002 |
F |
UNKNOWN |
INDIA |
PLACEBO |
22-04-1970 |
20-01-2025 |
ABC123 |
DM |
ABC123-1002-104 |
54 |
HISPANIC |
N |
N |
N |
|
6 |
105 |
COMPLETED |
20-Jun-2025 |
20-06-2025 |
20-05-2025 |
20-06-2025 |
|
1003 |
M |
ASIAN |
SPAIN |
DRUG A |
05-05-1990 |
05-05-2025 |
ABC123 |
DM |
ABC123-1003-105 |
35 |
UNKNOWN |
N |
N |
N |
|
7 |
106 |
DEATH |
01-07-2025 |
01-07-2025 |
10-06-2025 |
01-07-2025 |
Y |
1003 |
F |
WHITE |
USA |
DRUG B |
10-06-1992 |
06-10-2025 |
ABC123 |
DM |
ABC123-1003-106 |
33 |
NON-HISPANIC |
N |
N |
N |
|
8 |
107 |
WITHDRAWN |
|
NA |
11-07-2025 |
NA |
|
1004 |
M |
BLACK |
INDIA |
PLACEBO |
19-11-1985 |
11-07-2025 |
ABC123 |
DM |
ABC123-1004-107 |
39 |
HISPANIC |
N |
N |
N |
|
9 |
108 |
COMPLETED |
07/15/2025 |
15-07-2025 |
20-08-2025 |
15-07-2025 |
|
1004 |
F |
ASIAN |
PAKISTAN |
DRUG A |
20-08-1988 |
20-08-2025 |
ABC123 |
DM |
ABC123-1004-108 |
37 |
HISPANIC |
N |
N |
N |
|
10 |
109 |
SCREEN FAILURE |
20-07-2025 |
20-07-2025 |
NA |
20-07-2025 |
|
1005 |
U |
UNKNOWN |
INDIA |
DRUG B |
09-09-1975 |
20-09-2025 |
ABC123 |
DM |
ABC123-1005-109 |
NA |
HISPANIC |
Y |
Y |
N |
|
11 |
110 |
LOST TO FOLLOW-UP |
20-Jul-2025 |
20-07-2025 |
01-10-2025 |
20-07-2025 |
|
1005 |
M |
WHITE |
UK |
PLACEBO |
01-10-1977 |
01-10-2025 |
ABC123 |
DM |
ABC123-1005-110 |
48 |
NON-HISPANIC |
N |
N |
N |
|
12 |
111 |
COMPLETED |
01-08-2025 |
01-08-2025 |
20-11-2025 |
01-08-2025 |
|
1006 |
F |
ASIAN |
INDIA |
DRUG A |
19-12-1982 |
11-12-2025 |
ABC123 |
DM |
ABC123-1006-111 |
42 |
HISPANIC |
N |
N |
N |
|
13 |
112 |
WITHDRAWN |
08-10-2025 |
10-08-2025 |
25-12-2025 |
10-08-2025 |
|
1006 |
M |
BLACK |
USA |
DRUG B |
25-12-1981 |
25-12-2025 |
ABC123 |
DM |
ABC123-1006-112 |
44 |
NON-HISPANIC |
N |
N |
N |
|
14 |
113 |
DEATH |
15-08-2025 |
15-08-2025 |
03-01-2025 |
15-08-2025 |
Y |
1007 |
F |
WHITE |
CANADA |
PLACEBO |
03-01-1995 |
03-01-2025 |
ABC123 |
DM |
ABC123-1007-113 |
30 |
NON-HISPANIC |
N |
N |
N |
|
15 |
114 |
COMPLETED |
15-Aug-2025 |
15-08-2025 |
14-02-2025 |
15-08-2025 |
|
1007 |
M |
UNKNOWN |
INDIA |
DRUG A |
14-02-1993 |
14-02-2025 |
ABC123 |
DM |
ABC123-1007-114 |
32 |
HISPANIC |
N |
N |
N |
|
16 |
115 |
SCREEN FAILURE |
|
NA |
15-03-2025 |
NA |
|
1008 |
F |
ASIAN |
SPAIN |
DRUG B |
15-03-1987 |
15-03-2025 |
ABC123 |
DM |
ABC123-1008-115 |
38 |
UNKNOWN |
N |
N |
N |
Step 11: Final SDTM Variable Order
DATA DM_FINAL;
RETAIN
STUDYID
DOMAIN
USUBJID
SUBJID
SITEID
SEX
AGE
RACE
COUNTRY
ARM
RFICDTC
RFSTDTC
RFENDTC
DTHFL
ETHNIC
RFSTDTC_FLAG
SEX_FLAG
COUNTRY_FLAG;
SET DM;
RUN;
PROC PRINT DATA=DM_FINAL;
RUN;
OUTPUT:
| Obs | STUDYID | DOMAIN | USUBJID | SUBJID | SITEID | SEX | AGE | RACE | COUNTRY | ARM | RFICDTC | RFSTDTC | RFENDTC | DTHFL | ETHNIC | RFSTDTC_FLAG | SEX_FLAG | COUNTRY_FLAG | DOB |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ABC123 | DM | ABC123-1001-101 | 101 | 1001 | M | 45 | ASIAN | INDIA | PLACEBO | 2025-01-01 | 2025-01-01 | 2025-06-01 | HISPANIC | N | N | N | 01JAN1980 | |
| 2 | ABC123 | DM | ABC123-1001-102 | 102 | 1001 | F | 40 | WHITE | USA | DRUG A | 2025-02-15 | 2025-02-15 | 2025-06-01 | NON-HISPANIC | N | N | N | 15FEB1985 | |
| 3 | ABC123 | DM | ABC123-1002-103 | 103 | 1002 | M | 47 | BLACK | UK | DRUG B | 2025-03-01 | 2025-03-01 | 2025-06-15 | NON-HISPANIC | N | N | N | 15MAR1978 | |
| 4 | ABC123 | DM | ABC123-1002-104 | 104 | 1002 | F | 55 | UNKNOWN | INDIA | PLACEBO | 2025-04-01 | 2025-04-01 | 2025-06-20 | HISPANIC | N | N | N | 22APR1970 | |
| 5 | ABC123 | DM | ABC123-1003-105 | 105 | 1003 | M | 35 | ASIAN | SPAIN | DRUG A | 2025-05-05 | 2025-05-05 | 2025-06-20 | UNKNOWN | N | N | N | 05MAY1990 | |
| 6 | ABC123 | DM | ABC123-1003-106 | 106 | 1003 | F | 33 | WHITE | USA | DRUG B | 2025-06-10 | 2025-06-10 | 2025-07-01 | Y | NON-HISPANIC | N | N | N | 10JUN1992 |
| 7 | ABC123 | DM | ABC123-1004-107 | 107 | 1004 | M | 40 | BLACK | INDIA | PLACEBO | 2025-07-11 | 2025-07-11 | . | HISPANIC | N | N | N | 11JUL1985 | |
| 8 | ABC123 | DM | ABC123-1004-108 | 108 | 1004 | F | 37 | ASIAN | PAKISTAN | DRUG A | 2025-08-20 | 2025-08-20 | 2025-07-15 | HISPANIC | N | N | N | 20AUG1988 | |
| 9 | ABC123 | DM | ABC123-1005-109 | 109 | 1005 | U | . | UNKNOWN | INDIA | DRUG B | 2025-09-09 | . | 2025-07-20 | HISPANIC | N | Y | N | 09SEP1975 | |
| 10 | ABC123 | DM | ABC123-1005-110 | 110 | 1005 | M | 48 | WHITE | UK | PLACEBO | 2025-10-01 | 2025-10-01 | 2025-07-20 | NON-HISPANIC | N | N | N | 01OCT1977 | |
| 11 | ABC123 | DM | ABC123-1006-111 | 111 | 1006 | F | 43 | ASIAN | INDIA | DRUG A | 2025-11-12 | 2025-12-11 | 2025-08-01 | HISPANIC | N | N | N | 12NOV1982 | |
| 12 | ABC123 | DM | ABC123-1006-112 | 112 | 1006 | M | 44 | BLACK | USA | DRUG B | 2025-12-25 | 2025-12-25 | 2025-08-10 | NON-HISPANIC | N | N | N | 25DEC1981 | |
| 13 | ABC123 | DM | ABC123-1007-113 | 113 | 1007 | F | 30 | WHITE | CANADA | PLACEBO | 2025-01-03 | 2025-01-03 | 2025-08-15 | Y | NON-HISPANIC | N | N | N | 01MAR1995 |
| 14 | ABC123 | DM | ABC123-1007-114 | 114 | 1007 | M | 32 | UNKNOWN | INDIA | DRUG A | 2025-02-14 | 2025-02-14 | 2025-08-15 | HISPANIC | N | N | N | 14FEB1993 | |
| 15 | ABC123 | DM | ABC123-1008-115 | 115 | 1008 | F | 38 | ASIAN | SPAIN | DRUG B | 2025-03-15 | 2025-03-15 | . | UNKNOWN | N | N | N | 15MAR1987 |
# STEP 11: FINAL SDTM VARIABLE ORDER
DM_FINAL <- DM %>%
select(STUDYID,DOMAIN,USUBJID,SUBJID,SITEID,SEX,AGE,RACE,COUNTRY,
ARM,RFICDTC,RFSTDTC,RFENDTC,DTHFL,ETHNIC,RFSTDTC_FLAG,SEX_FLAG,
COUNTRY_FLAG
)
print(DM_FINAL)
OUTPUT:
|
|
STUDYID |
DOMAIN |
USUBJID |
SUBJID |
SITEID |
SEX |
AGE |
RACE |
COUNTRY |
ARM |
RFICDTC |
RFSTDTC |
RFENDTC |
DTHFL |
ETHNIC |
RFSTDTC_FLAG |
SEX_FLAG |
COUNTRY_FLAG |
|
1 |
ABC123 |
DM |
ABC123-1001-101 |
101 |
1001 |
M |
44 |
ASIAN |
INDIA |
PLACEBO |
01-01-2025 |
15-01-2025 |
01-06-2025 |
|
HISPANIC |
N |
N |
N |
|
2 |
ABC123 |
DM |
ABC123-1001-101 |
101 |
1001 |
M |
44 |
ASIAN |
INDIA |
PLACEBO |
01-01-2025 |
15-01-2025 |
01-06-2025 |
|
HISPANIC |
N |
N |
N |
|
3 |
ABC123 |
DM |
ABC123-1001-102 |
102 |
1001 |
F |
40 |
WHITE |
USA |
DRUG A |
15-02-2025 |
15-02-2025 |
20-01-2025 |
|
NON-HISPANIC |
N |
N |
N |
|
4 |
ABC123 |
DM |
ABC123-1002-103 |
103 |
1002 |
M |
46 |
BLACK |
UK |
DRUG B |
01-03-2025 |
20-01-2025 |
15-06-2025 |
|
NON-HISPANIC |
N |
N |
N |
|
5 |
ABC123 |
DM |
ABC123-1002-104 |
104 |
1002 |
F |
54 |
UNKNOWN |
INDIA |
PLACEBO |
20-01-2025 |
01-04-2025 |
20-06-2025 |
|
HISPANIC |
N |
N |
N |
|
6 |
ABC123 |
DM |
ABC123-1003-105 |
105 |
1003 |
M |
35 |
ASIAN |
SPAIN |
DRUG A |
05-05-2025 |
20-05-2025 |
20-06-2025 |
|
UNKNOWN |
N |
N |
N |
|
7 |
ABC123 |
DM |
ABC123-1003-106 |
106 |
1003 |
F |
33 |
WHITE |
USA |
DRUG B |
06-10-2025 |
10-06-2025 |
01-07-2025 |
Y |
NON-HISPANIC |
N |
N |
N |
|
8 |
ABC123 |
DM |
ABC123-1004-107 |
107 |
1004 |
M |
39 |
BLACK |
INDIA |
PLACEBO |
11-07-2025 |
11-07-2025 |
NA |
|
HISPANIC |
N |
N |
N |
|
9 |
ABC123 |
DM |
ABC123-1004-108 |
108 |
1004 |
F |
37 |
ASIAN |
PAKISTAN |
DRUG A |
20-08-2025 |
20-08-2025 |
15-07-2025 |
|
HISPANIC |
N |
N |
N |
|
10 |
ABC123 |
DM |
ABC123-1005-109 |
109 |
1005 |
U |
NA |
UNKNOWN |
INDIA |
DRUG B |
20-09-2025 |
NA |
20-07-2025 |
|
HISPANIC |
Y |
Y |
N |
|
11 |
ABC123 |
DM |
ABC123-1005-110 |
110 |
1005 |
M |
48 |
WHITE |
UK |
PLACEBO |
01-10-2025 |
01-10-2025 |
20-07-2025 |
|
NON-HISPANIC |
N |
N |
N |
|
12 |
ABC123 |
DM |
ABC123-1006-111 |
111 |
1006 |
F |
42 |
ASIAN |
INDIA |
DRUG A |
11-12-2025 |
20-11-2025 |
01-08-2025 |
|
HISPANIC |
N |
N |
N |
|
13 |
ABC123 |
DM |
ABC123-1006-112 |
112 |
1006 |
M |
44 |
BLACK |
USA |
DRUG B |
25-12-2025 |
25-12-2025 |
10-08-2025 |
|
NON-HISPANIC |
N |
N |
N |
|
14 |
ABC123 |
DM |
ABC123-1007-113 |
113 |
1007 |
F |
30 |
WHITE |
CANADA |
PLACEBO |
03-01-2025 |
03-01-2025 |
15-08-2025 |
Y |
NON-HISPANIC |
N |
N |
N |
|
15 |
ABC123 |
DM |
ABC123-1007-114 |
114 |
1007 |
M |
32 |
UNKNOWN |
INDIA |
DRUG A |
14-02-2025 |
14-02-2025 |
15-08-2025 |
|
HISPANIC |
N |
N |
N |
|
16 |
ABC123 |
DM |
ABC123-1008-115 |
115 |
1008 |
F |
38 |
ASIAN |
SPAIN |
DRUG B |
15-03-2025 |
15-03-2025 |
NA |
|
UNKNOWN |
N |
N |
N |
Step 10: Final DM Domain Creation
Final
variables include:
- STUDYID
- DOMAIN
- USUBJID
- AGE
- RFSTDTC
- RFENDTC
- ETHNIC
- DTHFL
BUSINESS LOGIC BEHIND
CLEANING DECISIONS
1. Standardized SEX values improve statistical
consistency
Without
standardization:
- Male
- male
- M
would be
treated as different categories.
2. Duplicate subjects create inflated patient
counts
Duplicate
subjects can:
- distort efficacy results
- impact safety summaries
- corrupt randomization counts
3. Missing dates break longitudinal analysis
Missing
RFSTDTC impacts:
- exposure duration
- treatment emergent AE logic
- survival analysis
4. Negative doses violate protocol integrity
Negative
doses indicate:
- entry errors
- ETL corruption
- source extraction issues
5. Mixed country formats affect regional reporting
Regional
summaries rely on:
- exact country matching
- regulatory reporting
standards
6. Controlled terminology ensures SDTM compliance
CDISC
requires:
- controlled values
- standardized domains
- consistent terminology
7. Date standardization enables temporal
calculations
AGE
derivation depends on:
- valid DOB
- valid first dose date
8. Missing race values affect diversity analysis
Regulators
evaluate:
- racial representation
- subgroup efficacy
- demographic safety signals
9. Visit normalization improves exposure tracking
Different
visit labels create:
- duplicate visit problems
- timeline inconsistencies
10. Disposition consistency supports patient
accountability
Disposition
impacts:
- study completion rate
- withdrawal analysis
- survival endpoints
20 SHARP INSIGHTS
1.Real clinical data is never perfectly formatted.
2.ANYDTDTE informat in SAS is extremely powerful for
heterogeneous dates.
3.R date parsing requires careful order
specification.
4.Duplicate subjects are among the most dangerous
clinical data issues.
5.Controlled terminology is essential in SDTM.
6.Negative dose values often indicate upstream ETL
failures.
7.Missing RFSTDTC affects TEAE derivation.
8.Clinical datasets often contain hidden whitespace
issues.
9.Uppercase standardization improves join
performance.
10.Disposition domains are critical for patient
accountability.
11.Clinical validation flags simplify QC workflows.
12.AGE derivation depends heavily on clean date logic.
13.Merging datasets without sorting causes SAS
BY-group errors.
14.distinct() in R behaves differently from SAS
NODUPKEY.
15.NA handling in R differs significantly from SAS
missing logic.
16.Inf warnings in R indicate missing aggregation
safeguards.
17.Clinical programming requires defensive coding
practices.
18.Incorrect date parsing can silently corrupt
analyses.
19.SDTM DM is foundational for all downstream ADaM
datasets.
20.Cross-platform SAS and R programming improves
industry flexibility.
SUMMARY
This
project demonstrated end-to-end clinical data engineering using SAS and R.
Major
operations included:
- demographic cleaning
- date standardization
- duplicate removal
- controlled terminology
mapping
- derivation of RFSTDTC
- derivation of RFENDTC
- derivation of DTHFL
- SDTM DM domain construction
The
project successfully transformed corrupted raw datasets into analysis-ready
SDTM-compliant structures.
CONCLUSION
This
project mirrors real-world pharmaceutical clinical programming workflows used
in:
- CROs
- biotech companies
- pharmaceutical organizations
The
exercise demonstrates practical expertise in:
- Base SAS
- PROC SQL
- DATA Step logic
- R tidyverse pipelines
- SDTM standards
- clinical validation
techniques
- data quality management
The final
DM dataset becomes the foundation for:
- efficacy analysis
- safety analysis
- ADaM derivation
- TLF generation
- regulatory submission
INTERVIEW QUESTIONS AND
ANSWERS
Q1. Why do we standardize SEX values in SDTM?
Answer
Standardization
ensures controlled terminology compliance and prevents inconsistent category
grouping during statistical analysis.
Q2. Why is RFSTDTC important?
Answer
RFSTDTC
represents the first exposure date and is essential for:
- treatment emergent AE
derivation
- exposure duration
- efficacy timelines
- survival analysis
Q3. What is the difference between SAS ANYDTDTE and
R parse_date_time?
Answer
|
SAS |
R |
|
Automatic
flexible parsing |
Requires
explicit format orders |
|
More
forgiving |
More
customizable |
|
Simpler
syntax |
More
granular control |
Q4. Why remove duplicate subjects?
Answer
Duplicate
subjects can:
- inflate sample size
- distort treatment counts
- create incorrect efficacy
and safety conclusions
Q5. Why create validation flags?
Answer
Validation
flags help QC teams quickly identify:
- missing critical variables
- invalid subject records
- incomplete exposure data
- inconsistent demographics
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about SDTM DEMOGRAPHICS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment