From Raw Trial Errors to Clean SDTM DM Analytics

Clinical Trial Data Chaos to SDTM Intelligence: Advanced SAS and R Data Cleaning, Validation, and DM Domain Engineering Framework

INTRODUCTION

In clinical trials, raw source data is rarely analysis-ready. Real-world clinical datasets frequently contain:

  • inconsistent formats
  • duplicate records
  • missing values
  • mixed case text
  • invalid treatment information
  • corrupted dates
  • invalid dosing values
  • inconsistent subject-level identifiers

The objective of this project is to simulate real-world pharmaceutical data engineering problems and transform unstructured raw datasets into a standardized SDTM-style DM domain using both SAS and R programming.

This project demonstrates:

  • Clinical data cleaning
  • SDTM DM derivation logic
  • Data standardization
  • Missing data handling
  • Duplicate management
  • Clinical validation checks
  • First dose derivation
  • End of study derivation
  • Death flag derivation
  • Cross-platform SAS vs R implementation

The project uses three raw datasets:

  1. raw_demog
  2. raw_dosing
  3. raw_disposition

These datasets intentionally contain multiple real-world data quality problems to simulate production-level clinical programming challenges.

RAW DATA OVERVIEW

1. RAW DEMOGRAPHICS DATASET (raw_demog)

Purpose

Contains subject-level demographic and enrollment information.

Important Variables

Variable

Description

SUBJID

Subject Identifier

SITEID

Site Number

SEX

Gender

DOB

Date of Birth

RACE

Race

COUNTRY

Country

ARM

Treatment Arm

CONSENT_DATE

Informed Consent Date

Embedded Data Issues

Problem

Example

Mixed SEX values

Male, FEMALE, m, f

Different DOB formats

01JAN1980, 1985-02-15

Missing race

Blank RACE

Mixed COUNTRY case

india, INDIA, UsA

Extra spaces

" FEMALE "

Duplicate subjects

SUBJID 101 repeated

Dataset 1: raw_demog

DATA raw_demog;

    LENGTH SUBJID $10 SEX $10 DOB $15 RACE $20 COUNTRY $15

         ARM $20 CONSENT_DATE $15;

    INFILE DATALINES DLM='|' DSD TRUNCOVER;

    INPUT SUBJID :$10. SITEID SEX :$10. DOB :$15. RACE :$20.

        COUNTRY :$15. ARM :$20. CONSENT_DATE :$15.;

DATALINES;

101|1001|Male|01JAN1980|Asian|india|Placebo|01-01-2025

102|1001|FEMALE|1985-02-15|White|USA|Drug A|02/15/2025

103|1002|m|15/03/1978|Black|uk|Drug B|2025-03-01

104|1002|f|1970/04/22||India|Placebo|01APR2025

105|1003|MALE|05-05-1990|Asian|SPAIN|Drug A|2025/05/05

106|1003|Female|1992-06-10|White|usa|Drug B|06-10-2025

107|1004|M|11JUL1985|Black|India|Placebo|07/11/2025

108|1004|F|1988-08-20|Asian|PAKISTAN|Drug A|2025-08-20

109|1005|unknown|09/09/1975||india|Drug B|09SEP2025

110|1005|male|1977-10-01|White|Uk|Placebo|10/01/2025

111|1006| FEMALE |12NOV1982|Asian| INDIA |Drug A|11-12-2025

112|1006|m|1981/12/25|Black|UsA|Drug B|2025-12-25

113|1007|Female|03-01-1995|White|Canada|Placebo|01/03/2025

114|1007|M|1993-02-14||india|Drug A|14FEB2025

115|1008|f|15MAR1987|Asian|spain|Drug B|2025/03/15

101|1001|Male|01JAN1980|Asian|india|Placebo|01-01-2025

;

RUN;

PROC PRINT DATA=raw_demog;

RUN;

OUTPUT:

ObsSUBJIDSEXDOBRACECOUNTRYARMCONSENT_DATESITEID
1101Male01JAN1980AsianindiaPlacebo01-01-20251001
2102FEMALE1985-02-15WhiteUSADrug A02/15/20251001
3103m15/03/1978BlackukDrug B2025-03-011002
4104f1970/04/22 IndiaPlacebo01APR20251002
5105MALE05-05-1990AsianSPAINDrug A2025/05/051003
6106Female1992-06-10WhiteusaDrug B06-10-20251003
7107M11JUL1985BlackIndiaPlacebo07/11/20251004
8108F1988-08-20AsianPAKISTANDrug A2025-08-201004
9109unknown09/09/1975 indiaDrug B09SEP20251005
10110male1977-10-01WhiteUkPlacebo10/01/20251005
11111FEMALE12NOV1982AsianINDIADrug A11-12-20251006
12112m1981/12/25BlackUsADrug B2025-12-251006
13113Female03-01-1995WhiteCanadaPlacebo01/03/20251007
14114M1993-02-14 indiaDrug A14FEB20251007
15115f15MAR1987AsianspainDrug B2025/03/151008
16101Male01JAN1980AsianindiaPlacebo01-01-20251001

# RAW DEMOGRAPHICS DATASET

# Purpose:Source dataset for SDTM DM derivation

library(dplyr)

library(stringr)

library(lubridate)

raw_demog <- data.frame(

  SUBJID = c(

    "101","102","103","104","105","106","107","108",

    "109","110","111","112","113","114","115","101"

    ),

  SITEID = c(1001,1001,1002,1002,1003,1003,1004,1004,

             1005,1005,1006,1006,1007,1007,1008,1001

    ),

  SEX = c("Male","FEMALE","m","f","MALE","Female","M","F",

          "unknown","male","FEMALE ","m","Female","M","f","Male"

    ),

  DOB = c(

    "01JAN1980","1985-02-15","15/03/1978","1970/04/22",

    "05-05-1990","1992-06-10","11JUL1985","1988-08-20",

    "09/09/1975","1977-10-01","12NOV1982","1981/12/25",

    "03-01-1995","1993-02-14","15MAR1987","01JAN1980"

  ),

  RACE = c(

    "Asian","White","Black","","Asian","White","Black",

    "Asian","","White","Asian","Black","White","","Asian","Asian"

  ),

  COUNTRY = c(

    "india","USA","uk","India","SPAIN","usa","India",

    "PAKISTAN","india","Uk"," INDIA ","UsA","Canada",

    "india","spain","india"

  ),

  ARM = c(

    "Placebo","Drug A","Drug B","Placebo","Drug A","Drug B",

    "Placebo","Drug A","Drug B","Placebo","Drug A","Drug B",

    "Placebo","Drug A","Drug B","Placebo"

  ),

  CONSENT_DATE = c(

    "01-01-2025","02/15/2025","2025-03-01","01APR2025",

    "2025/05/05","06-10-2025","07/11/2025","2025-08-20",

    "09SEP2025","10/01/2025","11-12-2025","2025-12-25",

    "01/03/2025","14FEB2025","2025/03/15","01-01-2025"

  ),

  stringsAsFactors = FALSE

)

print(raw_demog)

OUTPUT:

 

SUBJID

SITEID

SEX

DOB

RACE

COUNTRY

ARM

CONSENT_DATE

1

101

1001

Male

     01-Jan-1980

Asian

india

Placebo

01-01-2025

2

102

1001

FEMALE

15-02-1985

White

USA

Drug A

                   02/15/2025

3

103

1002

m

15-03-1978

Black

uk

Drug B

01-03-2025

4

104

1002

f

22-04-1970

 

India

Placebo

01-Apr-25

5

105

1003

MALE

05-05-1990

Asian

SPAIN

Drug A

05-05-2025

6

106

1003

Female

10-06-1992

White

usa

Drug B

06-10-2025

7

107

1004

M

11-Jul-1985

Black

India

Placebo

07-11-2025

8

108

1004

F

20-08-1988

Asian

PAKISTAN

Drug A

20-08-2025

9

109

1005

unknown

09-09-1975

 

india

Drug B

09-Sep-2025

10

110

1005

male

01-10-1977

White

Uk

Placebo

10-01-2025

11

111

1006

FEMALE 

12-Nov-1982

Asian

 INDIA 

Drug A

11-12-2025

12

112

1006

m

25-12-1981

Black

UsA

Drug B

25-12-2025

13

113

1007

Female

03-01-1995

White

Canada

Placebo

01-03-2025

14

114

1007

M

14-02-1993

 

india

Drug A

14-Feb-2025

15

115

1008

f

   15-Mar-1987

Asian

spain

Drug B

15-03-2025

16

101

1001

Male

01-Jan-1980

Asian

india

Placebo

01-01-2025

2. RAW DOSING DATASET (raw_dosing)

Purpose

Contains treatment exposure and dose administration records.

Important Variables

Variable

Description

SUBJID

Subject ID

DOSE_DATE

Dose Date

DOSE_AMT

Dose Amount

VISIT

Visit Information

Embedded Issues

Problem

Example

Missing dose dates

Blank DOSE_DATE

Duplicate records

Subject 101 repeated

Negative dose values

-50

Different date formats

01JAN2025, 2025-01-15

Visit inconsistency

Visit 1, VISIT-2

Dataset 2: raw_dosing

/*-----------------------------------------------------------

RAW DOSING DATASET

Purpose:Source dataset for SDTM DM/EX derivations

-----------------------------------------------------------*/

DATA raw_dosing;

  LENGTH SUBJID $10 DOSE_DATE $15 VISIT $20;

  INFILE DATALINES DLM='|' DSD TRUNCOVER;

  INPUT SUBJID :$10. DOSE_DATE :$15. DOSE_AMT VISIT :$20.;

DATALINES;

101|01JAN2025|100|Visit 1

101|2025-01-15|100|VISIT-2

101|03/01/2025|100|visit 3

102|02/15/2025|200|Screening

102|2025-02-20|200|VISIT-2

103|01MAR2025|150|Visit 1

103||150|Visit 2

104|2025/04/01|0|SCREENING

104|04-15-2025|100|Visit 2

105|05MAY2025|-50|visit 1

105|2025-05-20|100|VISIT-2

106|06/10/2025|120|Visit 1

106|06/10/2025|120|Visit 1

107|07-11-2025|80|SCREENING

108|2025-08-20|90|Visit 1

108|08/25/2025|90|VISIT-2

109||110|Visit 1

110|10/01/2025|100|visit-1

111|11DEC2025|200|Visit 1

112|2025/12/25|150|VISIT 2

113|01/03/2025|100|Screening

114|14FEB2025|0|Visit 1

115|2025-03-15|-25|VISIT-2

101|01JAN2025|100|Visit 1

;

RUN;

PROC PRINT DATA=raw_dosing;

RUN;

OUTPUT:

ObsSUBJIDDOSE_DATEVISITDOSE_AMT
110101JAN2025Visit 1100
21012025-01-15VISIT-2100
310103/01/2025visit 3100
410202/15/2025Screening200
51022025-02-20VISIT-2200
610301MAR2025Visit 1150
7103 Visit 2150
81042025/04/01SCREENING0
910404-15-2025Visit 2100
1010505MAY2025visit 1-50
111052025-05-20VISIT-2100
1210606/10/2025Visit 1120
1310606/10/2025Visit 1120
1410707-11-2025SCREENING80
151082025-08-20Visit 190
1610808/25/2025VISIT-290
17109 Visit 1110
1811010/01/2025visit-1100
1911111DEC2025Visit 1200
201122025/12/25VISIT 2150
2111301/03/2025Screening100
2211414FEB2025Visit 10
231152025-03-15VISIT-2-25
2410101JAN2025Visit 1100

# RAW DOSING DATASET

# Purpose:Source dataset for SDTM DM/EX derivations

raw_dosing <- data.frame(

  SUBJID = c(

    "101","101","101","102","102","103","103","104",

    "104","105","105","106","106","107","108","108",

    "109","110","111","112","113","114","115","101"

  ),

  DOSE_DATE = c(

    "01JAN2025","2025-01-15","03/01/2025","02/15/2025",

    "2025-02-20","01MAR2025","",

    "2025/04/01","04-15-2025","05MAY2025","2025-05-20",

    "06/10/2025","06/10/2025","07-11-2025","2025-08-20",

    "08/25/2025","","10/01/2025","11DEC2025","2025/12/25",

    "01/03/2025","14FEB2025","2025-03-15","01JAN2025"

  ),

  DOSE_AMT = c(

    100,100,100,200,200,150,150,0,100,-50,100,120,

    120,80,90,90,110,100,200,150,100,0,-25,100

  ),

  VISIT = c(

    "Visit 1","VISIT-2","visit 3","Screening","VISIT-2",

    "Visit 1","Visit 2","SCREENING","Visit 2","visit 1",

    "VISIT-2","Visit 1","Visit 1","SCREENING","Visit 1",

    "VISIT-2","Visit 1","visit-1","Visit 1","VISIT 2",

    "Screening","Visit 1","VISIT-2","Visit 1"

  ),

  stringsAsFactors = FALSE

)

print(raw_dosing)

OUTPUT:

 

SUBJID

DOSE_DATE

DOSE_AMT

VISIT

1

101

01-Jan-2025

100

Visit 1

2

101

15-01-2025

100

VISIT-2

3

101

03-01-2025

100

visit 3

4

102

    02/15/2025

200

Screening

5

102

20-02-2025

200

VISIT-2

6

103

01-Mar-2025

150

Visit 1

7

103

 

150

Visit 2

8

104

01-04-2025

0

SCREENING

9

104

    04-15-2025

100

Visit 2

10

105

05-May-2025

-50

visit 1

11

105

20-05-2025

100

VISIT-2

12

106

06-10-2025

120

Visit 1

13

106

06-10-2025

120

Visit 1

14

107

07-11-2025

80

SCREENING

15

108

20-08-2025

90

Visit 1

16

108

     08/25/2025

90

VISIT-2

17

109

 

110

Visit 1

18

110

10-01-2025

100

visit-1

19

111

 11-Dec-2025

200

Visit 1

20

112

25-12-2025

150

VISIT 2

21

113

01-03-2025

100

Screening

22

114

14-Feb-2025

0

Visit 1

23

115

15-03-2025

-25

VISIT-2

24

101

01-Jan-2025

100

Visit 1

3. RAW DISPOSITION DATASET (raw_disposition)

Purpose

Contains subject completion and discontinuation details.

Important Variables

Variable

Description

SUBJID

Subject ID

DSDECOD

Disposition Status

DSSTDTC

Disposition Date

Embedded Issues

Problem

Example

Missing disposition dates

Blank DSSTDTC

Duplicate subjects

Subject 101

Mixed date formats

2025/06/20, 20JUN2025

Mixed disposition values

COMPLETED, WITHDRAWN

Dataset 3: raw_disposition

DATA raw_disposition;

 LENGTH SUBJID $10 DSDECOD $30 DSSTDTC $15;

 INFILE DATALINES DLM='|' DSD TRUNCOVER;

 INPUT SUBJID :$10. DSDECOD :$30. DSSTDTC :$15.;

DATALINES;

101|COMPLETED|2025-06-01

102|SCREEN FAILURE|01JUN2025

103|WITHDRAWN|06/15/2025

104|LOST TO FOLLOW-UP|2025/06/20

105|COMPLETED|20JUN2025

106|DEATH|2025-07-01

107|WITHDRAWN|

108|COMPLETED|07/15/2025

109|SCREEN FAILURE|2025-07-20

110|LOST TO FOLLOW-UP|20JUL2025

111|COMPLETED|2025/08/01

112|WITHDRAWN|08/10/2025

113|DEATH|2025-08-15

114|COMPLETED|15AUG2025

115|SCREEN FAILURE|

101|COMPLETED|2025-06-01

;

RUN;

PROC PRINT DATA = raw_disposition;

RUN;

OUTPUT:

ObsSUBJIDDSDECODDSSTDTC
1101COMPLETED2025-06-01
2102SCREEN FAILURE01JUN2025
3103WITHDRAWN06/15/2025
4104LOST TO FOLLOW-UP2025/06/20
5105COMPLETED20JUN2025
6106DEATH2025-07-01
7107WITHDRAWN 
8108COMPLETED07/15/2025
9109SCREEN FAILURE2025-07-20
10110LOST TO FOLLOW-UP20JUL2025
11111COMPLETED2025/08/01
12112WITHDRAWN08/10/2025
13113DEATH2025-08-15
14114COMPLETED15AUG2025
15115SCREEN FAILURE 
16101COMPLETED2025-06-01

# RAW DISPOSITION DATASET

raw_disposition <- data.frame(

  SUBJID = c(

    "101","102","103","104","105","106","107","108",

    "109","110","111","112","113","114","115","101"

  ), 

  DSDECOD = c(

    "COMPLETED","SCREEN FAILURE","WITHDRAWN",

    "LOST TO FOLLOW-UP","COMPLETED","DEATH",

    "WITHDRAWN","COMPLETED","SCREEN FAILURE",

    "LOST TO FOLLOW-UP","COMPLETED","WITHDRAWN",

    "DEATH","COMPLETED","SCREEN FAILURE","COMPLETED"

  ),

  DSSTDTC = c(

    "2025-06-01","01JUN2025","06/15/2025","2025/06/20",

    "20JUN2025","2025-07-01","","07/15/2025",

    "2025-07-20","20JUL2025","2025/08/01",

    "08/10/2025","2025-08-15","15AUG2025",

    "","2025-06-01"

  ),

  stringsAsFactors = FALSE

)

print(raw_disposition)

OUTPUT:

 

SUBJID

DSDECOD

DSSTDTC

1

101

COMPLETED

01-06-2025

2

102

SCREEN FAILURE

  01-Jun-2025

3

103

WITHDRAWN

            06/15/2025

4

104

LOST TO FOLLOW-UP

           20-06-2025

5

105

COMPLETED

 20-Jun-2025

6

106

DEATH

01-07-2025

7

107

WITHDRAWN

 

8

108

COMPLETED

           07/15/2025

9

109

SCREEN FAILURE

20-07-2025

10

110

LOST TO FOLLOW-UP

20-Jul-2025

11

111

COMPLETED

01-08-2025

12

112

WITHDRAWN

08-10-2025

13

113

DEATH

15-08-2025

14

114

COMPLETED

  15-Aug-2025

15

115

SCREEN FAILURE

 

16

101

COMPLETED

01-06-2025


# FUNCTION FOR DATE CONVERSION

convert_date <- function(x){

  parse_date_time(

    x,

    orders = c(

      "dmy","ymd","mdy","d-b-y","y/m/d","d/b/y","dmy HMS"

    )

  )

}

Step 4: Clean Demographics

DATA demog_clean;

 set raw_demog;

/* sex*/

 sex = upcase(strip(sex));

 if sex in ('M','MALE') then sex = 'M';

 else if sex in ('F','FEMALE') then sex = 'F';

 else sex = 'U';

/* race*/

 race = upcase(strip(race));

 if race = "" then race = "UNKNOWN";

/* country*/

 country = upcase(strip(country));

/*   arm*/

 arm = upcase(strip(arm));

/* dob*/

DOB_NUM = input(DOB,anydtdte15.);

format DOB_NUM Date9.; 

/* consent date*/

RFICDT = input(CONSENT_DATE,anydtdte15.);

format RFICDT yymmdd10.;

rename DOB_NUM=DOB RFICDT=RFICDTC;

drop dob consent_date;

run;

proc print;run;

OUTPUT:

ObsSUBJIDSEXRACECOUNTRYARMSITEIDDOBRFICDTC
1101MASIANINDIAPLACEBO100101JAN19802025-01-01
2102FWHITEUSADRUG A100115FEB19852025-02-15
3103MBLACKUKDRUG B100215MAR19782025-03-01
4104FUNKNOWNINDIAPLACEBO100222APR19702025-04-01
5105MASIANSPAINDRUG A100305MAY19902025-05-05
6106FWHITEUSADRUG B100310JUN19922025-06-10
7107MBLACKINDIAPLACEBO100411JUL19852025-07-11
8108FASIANPAKISTANDRUG A100420AUG19882025-08-20
9109UUNKNOWNINDIADRUG B100509SEP19752025-09-09
10110MWHITEUKPLACEBO100501OCT19772025-10-01
11111FASIANINDIADRUG A100612NOV19822025-11-12
12112MBLACKUSADRUG B100625DEC19812025-12-25
13113FWHITECANADAPLACEBO100701MAR19952025-01-03
14114MUNKNOWNINDIADRUG A100714FEB19932025-02-14
15115FASIANSPAINDRUG B100815MAR19872025-03-15
16101MASIANINDIAPLACEBO100101JAN19802025-01-01

# STEP 4: CLEAN DEMOGRAPHICS

demog_clean <- raw_demog %>%

    mutate(

      # SEX

      SEX = str_trim(str_to_upper(SEX)),     

      SEX = case_when(SEX %in% c("MALE","M") ~ "M",

                      SEX %in% c("FEMALE","F") ~ "F",

                      TRUE ~ "U"),

      # RACE

      RACE = str_trim(str_to_upper(RACE)),   

      RACE = ifelse(RACE == "","UNKNOWN",RACE),

      # COUNTRY

      COUNTRY = str_trim(str_to_upper(COUNTRY)),

      # ARM

      ARM = str_trim(str_to_upper(ARM)),

      # DOB

      DOB_NUM = as.Date(convert_date(DOB)),

      # CONSENT DATE

      RFICDTC = as.Date(convert_date(CONSENT_DATE))

    ) %>%  

   select(

     -DOB,-CONSENT_DATE

   ) %>%  

  rename(

    DOB = DOB_NUM

  )

print(demog_clean)

OUTPUT:

 

SUBJID

SITEID

SEX

RACE

COUNTRY

ARM

DOB

RFICDTC

1

101

1001

M

ASIAN

INDIA

PLACEBO

19-01-1980

01-01-2025

2

102

1001

F

WHITE

USA

DRUG A

15-02-1985

15-02-2025

3

103

1002

M

BLACK

UK

DRUG B

15-03-1978

01-03-2025

4

104

1002

F

UNKNOWN

INDIA

PLACEBO

22-04-1970

20-01-2025

5

105

1003

M

ASIAN

SPAIN

DRUG A

05-05-1990

05-05-2025

6

106

1003

F

WHITE

USA

DRUG B

10-06-1992

06-10-2025

7

107

1004

M

BLACK

INDIA

PLACEBO

19-11-1985

11-07-2025

8

108

1004

F

ASIAN

PAKISTAN

DRUG A

20-08-1988

20-08-2025

9

109

1005

U

UNKNOWN

INDIA

DRUG B

09-09-1975

20-09-2025

10

110

1005

M

WHITE

UK

PLACEBO

01-10-1977

01-10-2025

11

111

1006

F

ASIAN

INDIA

DRUG A

19-12-1982

11-12-2025

12

112

1006

M

BLACK

USA

DRUG B

25-12-1981

25-12-2025

13

113

1007

F

WHITE

CANADA

PLACEBO

03-01-1995

03-01-2025

14

114

1007

M

UNKNOWN

INDIA

DRUG A

14-02-1993

14-02-2025

15

115

1008

F

ASIAN

SPAIN

DRUG B

15-03-1987

15-03-2025

16

101

1001

M

ASIAN

INDIA

PLACEBO

19-01-1980

01-01-2025

Step 1: Demographic Cleaning

SAS Logic

sex = upcase(strip(sex));

R Logic

SEX = str_trim(str_to_upper(SEX))

Objective

  • remove extra spaces
  • standardize case
  • improve downstream consistency

Step 2: Gender Standardization

Logic

Raw Value

Standardized

Male

M

FEMALE

F

unknown

U

Business Need

Clinical SDTM domains require controlled terminology.

Step 3: Race Cleaning

Missing races converted to:

UNKNOWN

Why?

Blank race values create downstream reporting problems.

Step 4: Country Standardization

All country values converted to uppercase.

Example:

Raw

Cleaned

india

INDIA

UsA

USA

Step 5: Clean Dosing

DATA dosing_clean;

 set raw_dosing;

 DOSEDT = input(DOSE_DATE,anydtdte15.);

 format DOSEDT yymmdd10.;

 VISIT = Upcase(strip(VISIT));

 IF DOSE_AMT < 0 THEN DOSE_AMT=.;

 DROP DOSE_DATE;

RUN;

PROC PRINT;RUN;

OUTPUT:

ObsSUBJIDVISITDOSE_AMTDOSEDT
1101VISIT 11002025-01-01
2101VISIT-21002025-01-15
3101VISIT 31002025-03-01
4102SCREENING2002025-02-15
5102VISIT-22002025-02-20
6103VISIT 11502025-03-01
7103VISIT 2150.
8104SCREENING02025-04-01
9104VISIT 21002025-04-15
10105VISIT 1.2025-05-05
11105VISIT-21002025-05-20
12106VISIT 11202025-06-10
13106VISIT 11202025-06-10
14107SCREENING802025-07-11
15108VISIT 1902025-08-20
16108VISIT-2902025-08-25
17109VISIT 1110.
18110VISIT-11002025-10-01
19111VISIT 12002025-12-11
20112VISIT 21502025-12-25
21113SCREENING1002025-01-03
22114VISIT 102025-02-14
23115VISIT-2.2025-03-15
24101VISIT 11002025-01-01

# STEP 5: CLEAN DOSING

dosing_clean <- raw_dosing %>%

  mutate(

    # DOSE DATE

    DOSEDT = as.Date(convert_date(DOSE_DATE)),

    # VISIT

    VISIT = str_trim(str_to_upper(VISIT)),

    # NEGATIVE DOSE

    DOSE_AMT = ifelse(DOSE_AMT < 0,NA,DOSE_AMT),

  ) %>%

    select(

      -DOSE_DATE

      )

print(dosing_clean)

OUTPUT:

 

SUBJID

DOSE_AMT

VISIT

DOSEDT

1

101

100

VISIT 1

20-01-2025

2

101

100

VISIT-2

15-01-2025

3

101

100

VISIT 3

01-03-2025

4

102

200

SCREENING

15-02-2025

5

102

200

VISIT-2

20-02-2025

6

103

150

VISIT 1

20-01-2025

7

103

150

VISIT 2

            NA

8

104

0

SCREENING

01-04-2025

9

104

100

VISIT 2

15-04-2025

10

105

NA

VISIT 1

20-05-2025

11

105

100

VISIT-2

20-05-2025

12

106

120

VISIT 1

10-06-2025

13

106

120

VISIT 1

10-06-2025

14

107

80

SCREENING

11-07-2025

15

108

90

VISIT 1

20-08-2025

16

108

90

VISIT-2

25-08-2025

17

109

110

VISIT 1

            NA

18

110

100

VISIT-1

01-10-2025

19

111

200

VISIT 1

20-11-2025

20

112

150

VISIT 2

25-12-2025

21

113

100

SCREENING

03-01-2025

22

114

0

VISIT 1

14-02-2025

23

115

NA

VISIT-2

15-03-2025

24

101

100

VISIT 1

20-01-2025

Step 5: Date Conversion

SAS

input(DOB,anydtdte15.)

R

parse_date_time()

Why Important?

Clinical data arrives from multiple EDC systems with inconsistent formats.

Step 6: Negative Dose Handling

Negative doses:

-50

-25

converted to missing.

Reason

Negative exposure values are clinically invalid.

Step 6: Derive First Dose Date (RFSTDTC)

PROC SORT DATA = DOSING_CLEAN;

 BY SUBJID DOSEDT;

RUN;

PROC PRINT;RUN;

OUTPUT:

ObsSUBJIDVISITDOSE_AMTDOSEDT
1101VISIT 11002025-01-01
2101VISIT 11002025-01-01
3101VISIT-21002025-01-15
4101VISIT 31002025-03-01
5102SCREENING2002025-02-15
6102VISIT-22002025-02-20
7103VISIT 2150.
8103VISIT 11502025-03-01
9104SCREENING02025-04-01
10104VISIT 21002025-04-15
11105VISIT 1.2025-05-05
12105VISIT-21002025-05-20
13106VISIT 11202025-06-10
14106VISIT 11202025-06-10
15107SCREENING802025-07-11
16108VISIT 1902025-08-20
17108VISIT-2902025-08-25
18109VISIT 1110.
19110VISIT-11002025-10-01
20111VISIT 12002025-12-11
21112VISIT 21502025-12-25
22113SCREENING1002025-01-03
23114VISIT 102025-02-14
24115VISIT-2.2025-03-15

PROC SQL;

 CREATE TABLE RFSTDTC AS

 SELECT SUBJID,PUT(MIN(DOSEDT),YYMMDD10.) AS RFSTDTC

 FROM DOSING_CLEAN

 GROUP BY SUBJID;

RUN;

PROC PRINT;RUN;

OUTPUT:

ObsSUBJIDRFSTDTC
11012025-01-01
21022025-02-15
31032025-03-01
41042025-04-01
51052025-05-05
61062025-06-10
71072025-07-11
81082025-08-20
9109.
101102025-10-01
111112025-12-11
121122025-12-25
131132025-01-03
141142025-02-14
151152025-03-15

# STEP 6: DERIVE RFSTDTC

RFSTDTC <- dosing_clean %>%

  group_by(SUBJID) %>%

  summarise(

    RFSTDTC = if(all(is.na(DOSEDT))){NA}

    else{

      min(DOSEDT, na.rm = TRUE)

    } 

   ) %>%

  mutate(RFSTDTC = as.character(RFSTDTC))

print(RFSTDTC)

OUTPUT:

 

SUBJID

RFSTDTC

1

101

15-01-2025

2

102

15-02-2025

3

103

20-01-2025

4

104

01-04-2025

5

105

20-05-2025

6

106

10-06-2025

7

107

11-07-2025

8

108

20-08-2025

9

109

           NA

10

110

01-10-2025

11

111

20-11-2025

12

112

25-12-2025

13

113

03-01-2025

14

114

14-02-2025

15

115

15-03-2025

Step 7: First Dose Derivation (RFSTDTC)

Derived using minimum dose date.

SAS

MIN(DOSEDT)

R

min(DOSEDT, na.rm = TRUE)

Step 7: Clean Disposition

DATA DS_CLEAN;

 SET raw_disposition;

 DSDECOD = UPCASE(STRIP(DSDECOD));

 DS_DATE = INPUT(DSSTDTC,ANYDTDTE15.);

 FORMAT DS_DATE YYMMDD10.;

RUN;

PROC PRINT;RUN;

OUTPUT:

ObsSUBJIDDSDECODDSSTDTCDS_DATE
1101COMPLETED2025-06-012025-06-01
2102SCREEN FAILURE01JUN20252025-06-01
3103WITHDRAWN06/15/20252025-06-15
4104LOST TO FOLLOW-UP2025/06/202025-06-20
5105COMPLETED20JUN20252025-06-20
6106DEATH2025-07-012025-07-01
7107WITHDRAWN .
8108COMPLETED07/15/20252025-07-15
9109SCREEN FAILURE2025-07-202025-07-20
10110LOST TO FOLLOW-UP20JUL20252025-07-20
11111COMPLETED2025/08/012025-08-01
12112WITHDRAWN08/10/20252025-08-10
13113DEATH2025-08-152025-08-15
14114COMPLETED15AUG20252025-08-15
15115SCREEN FAILURE .
16101COMPLETED2025-06-012025-06-01


# STEP 7: CLEAN DISPOSITION

DS_CLEAN <- raw_disposition %>%

  mutate(

  DSDECOD = str_trim(str_to_upper(DSDECOD)),

  DS_DATE = as.Date(convert_date(DSSTDTC))

  )

print(DS_CLEAN)

OUTPUT:

 

SUBJID

DSDECOD

DSSTDTC

DS_DATE

1

101

COMPLETED

01-06-2025

01-06-2025

2

102

SCREEN FAILURE

01-Jun-2025

20-01-2025

3

103

WITHDRAWN

    06/15/2025

15-06-2025

4

104

LOST TO FOLLOW-UP

20-06-2025

20-06-2025

5

105

COMPLETED

20-Jun-2025

20-06-2025

6

106

DEATH

01-07-2025

01-07-2025

7

107

WITHDRAWN

 

             NA

8

108

COMPLETED

    07/15/2025

15-07-2025

9

109

SCREEN FAILURE

20-07-2025

20-07-2025

10

110

LOST TO FOLLOW-UP

20-Jul-2025

20-07-2025

11

111

COMPLETED

01-08-2025

01-08-2025

12

112

WITHDRAWN

08-10-2025

10-08-2025

13

113

DEATH

15-08-2025

15-08-2025

14

114

COMPLETED

15-Aug-2025

15-08-2025

15

115

SCREEN FAILURE

 

             NA

16

101

COMPLETED

01-06-2025

01-06-2025


Step 8: Derive RFENDTC + DTHFL 

PROC SQL;

 CREATE TABLE DS_FINAL AS

 SELECT SUBJID,PUT(MAX(DS_DATE),YYMMDD10.) AS RFENDTC,

 MAX(CASE WHEN DSDECOD = "DEATH" THEN "Y"

     ELSE "" END) AS DTHFL

 FROM DS_CLEAN

 GROUP BY SUBJID;

QUIT;

PROC PRINT DATA=DS_FINAL;RUN;

 OUTPUT:

ObsSUBJIDRFENDTCDTHFL
11012025-06-01 
21022025-06-01 
31032025-06-15 
41042025-06-20 
51052025-06-20 
61062025-07-01Y
7107. 
81082025-07-15 
91092025-07-20 
101102025-07-20 
111112025-08-01 
121122025-08-10 
131132025-08-15Y
141142025-08-15 
15115. 


# STEP 8: DERIVE RFENDTC + DTHFL

DS_FINAL <- DS_CLEAN %>%

  group_by(SUBJID) %>%

  summarise(

    RFENDTC = if(all(is.na(DS_DATE))){

      as.Date(NA)

    } else {

      max(DS_DATE, na.rm = TRUE)

    },

    DTHFL = ifelse(any(DSDECOD == "DEATH"),"Y","")

  ) %>%

  mutate(

    RFENDTC = as.character(RFENDTC)

  )

print(DS_FINAL)

OUTPUT:

 

SUBJID

RFENDTC

DTHFL

1

101

01-06-2025

 

2

102

20-01-2025

 

3

103

15-06-2025

 

4

104

20-06-2025

 

5

105

20-06-2025

 

6

106

01-07-2025

Y

7

107

         NA

 

8

108

15-07-2025

 

9

109

20-07-2025

 

10

110

20-07-2025

 

11

111

01-08-2025

 

12

112

10-08-2025

 

13

113

15-08-2025

Y

14

114

15-08-2025

 

15

115

        NA

 

Step 8: Death Flag Derivation

If subject disposition contains:

DEATH

then:

DTHFL = Y

Step 9: Remove Duplicates

PROC SORT DATA = DEMOG_CLEAN NODUPKEY;

 BY SUBJID;

RUN;

PROC PRINT;RUN;

OUTPUT:

ObsSUBJIDSEXRACECOUNTRYARMSITEIDDOBRFICDTC
1101MASIANINDIAPLACEBO100101JAN19802025-01-01
2102FWHITEUSADRUG A100115FEB19852025-02-15
3103MBLACKUKDRUG B100215MAR19782025-03-01
4104FUNKNOWNINDIAPLACEBO100222APR19702025-04-01
5105MASIANSPAINDRUG A100305MAY19902025-05-05
6106FWHITEUSADRUG B100310JUN19922025-06-10
7107MBLACKINDIAPLACEBO100411JUL19852025-07-11
8108FASIANPAKISTANDRUG A100420AUG19882025-08-20
9109UUNKNOWNINDIADRUG B100509SEP19752025-09-09
10110MWHITEUKPLACEBO100501OCT19772025-10-01
11111FASIANINDIADRUG A100612NOV19822025-11-12
12112MBLACKUSADRUG B100625DEC19812025-12-25
13113FWHITECANADAPLACEBO100701MAR19952025-01-03
14114MUNKNOWNINDIADRUG A100714FEB19932025-02-14
15115FASIANSPAINDRUG B100815MAR19872025-03-15

# STEP 9: REMOVE DUPLICATES

DEMOG_CLEAN_FINAL <- DS_CLEAN %>%

  distinct(SUBJID, .keep_all = TRUE)

print(DEMOG_CLEAN_FINAL)

OUTPUT:

 

SUBJID

DSDECOD

DSSTDTC

DS_DATE

1

101

COMPLETED

01-06-2025

01-06-2025

2

102

SCREEN FAILURE

01-Jun-2025

20-01-2025

3

103

WITHDRAWN

    06/15/2025

15-06-2025

4

104

LOST TO FOLLOW-UP

20-06-2025

20-06-2025

5

105

COMPLETED

20-Jun-2025

20-06-2025

6

106

DEATH

01-07-2025

01-07-2025

7

107

WITHDRAWN

 

             NA

8

108

COMPLETED

    07/15/2025

15-07-2025

9

109

SCREEN FAILURE

20-07-2025

20-07-2025

10

110

LOST TO FOLLOW-UP

20-Jul-2025

20-07-2025

11

111

COMPLETED

01-08-2025

01-08-2025

12

112

WITHDRAWN

08-10-2025

10-08-2025

13

113

DEATH

15-08-2025

15-08-2025

14

114

COMPLETED

15-Aug-2025

15-08-2025

15

115

SCREEN FAILURE

 

            NA

Step 9: Duplicate Removal

SAS

PROC SORT NODUPKEY

R

distinct(SUBJID, .keep_all = TRUE)

Step 10: Create Final SDTM DM

PROC SORT DATA=RFSTDTC;

 BY SUBJID;

RUN;

PROC PRINT;RUN;

OUTPUT:

ObsSUBJIDRFSTDTC
11012025-01-01
21022025-02-15
31032025-03-01
41042025-04-01
51052025-05-05
61062025-06-10
71072025-07-11
81082025-08-20
9109.
101102025-10-01
111112025-12-11
121122025-12-25
131132025-01-03
141142025-02-14
151152025-03-15

PROC SORT DATA=DS_FINAL;

 BY SUBJID;

RUN;

PROC PRINT;RUN;

OUTPUT:

ObsSUBJIDRFENDTCDTHFL
11012025-06-01 
21022025-06-01 
31032025-06-15 
41042025-06-20 
51052025-06-20 
61062025-07-01Y
7107. 
81082025-07-15 
91092025-07-20 
101102025-07-20 
111112025-08-01 
121122025-08-10 
131132025-08-15Y
141142025-08-15 
15115. 

DATA DM;

 MERGE DEMOG_CLEAN

                RFSTDTC

         DS_FINAL;

 LENGTH STUDYID $10 DOMAIN $2 USUBJID $25 ETHNIC $20 COUNTRY_FLAG $1

     SEX_FLAG $1 RFSTDTC_FLAG $1;

/* CONSTANTS*/

 STUDYID = "ABC123";

 DOMAIN = "DM";

/* USUBJID*/

 USUBJID = CATX('-',STUDYID,SITEID,SUBJID);

/* AGE*/

 IF RFSTDTC NE '' THEN AGE = INTCK('YEAR',DOB,INPUT(RFSTDTC,YYMMDD10.));

/* ETHNIC*/

 IF COUNTRY IN ('USA','CANADA','UK') THEN ETHNIC='NON-HISPANIC';

 ELSE IF COUNTRY IN ('INDIA','PAKISTAN') THEN ETHNIC='HISPANIC';

 ELSE ETHNIC='UNKNOWN';

/* VALIDATION FLAGS*/

 IF RFSTDTC = '' THEN RFSTDTC_FLAG = 'Y';

 ELSE RFSTDTC_FLAG='N';

 IF SEX = 'U' THEN SEX_FLAG= 'Y';

 ELSE SEX_FLAG = 'N';

 IF COUNTRY="" THEN COUNTRY_FLAG="Y";

 ELSE COUNTRY_FLAG="N";

RUN;

PROC PRINT;RUN;

OUTPUT:

ObsSUBJIDSEXRACECOUNTRYARMSITEIDDOBRFICDTCRFSTDTCRFENDTCDTHFLSTUDYIDDOMAINUSUBJIDETHNICCOUNTRY_FLAGSEX_FLAGRFSTDTC_FLAGAGE
1101MASIANINDIAPLACEBO100101JAN19802025-01-012025-01-012025-06-01 ABC123DMABC123-1001-101HISPANICNNN45
2102FWHITEUSADRUG A100115FEB19852025-02-152025-02-152025-06-01 ABC123DMABC123-1001-102NON-HISPANICNNN40
3103MBLACKUKDRUG B100215MAR19782025-03-012025-03-012025-06-15 ABC123DMABC123-1002-103NON-HISPANICNNN47
4104FUNKNOWNINDIAPLACEBO100222APR19702025-04-012025-04-012025-06-20 ABC123DMABC123-1002-104HISPANICNNN55
5105MASIANSPAINDRUG A100305MAY19902025-05-052025-05-052025-06-20 ABC123DMABC123-1003-105UNKNOWNNNN35
6106FWHITEUSADRUG B100310JUN19922025-06-102025-06-102025-07-01YABC123DMABC123-1003-106NON-HISPANICNNN33
7107MBLACKINDIAPLACEBO100411JUL19852025-07-112025-07-11. ABC123DMABC123-1004-107HISPANICNNN40
8108FASIANPAKISTANDRUG A100420AUG19882025-08-202025-08-202025-07-15 ABC123DMABC123-1004-108HISPANICNNN37
9109UUNKNOWNINDIADRUG B100509SEP19752025-09-09.2025-07-20 ABC123DMABC123-1005-109HISPANICNYN.
10110MWHITEUKPLACEBO100501OCT19772025-10-012025-10-012025-07-20 ABC123DMABC123-1005-110NON-HISPANICNNN48
11111FASIANINDIADRUG A100612NOV19822025-11-122025-12-112025-08-01 ABC123DMABC123-1006-111HISPANICNNN43
12112MBLACKUSADRUG B100625DEC19812025-12-252025-12-252025-08-10 ABC123DMABC123-1006-112NON-HISPANICNNN44
13113FWHITECANADAPLACEBO100701MAR19952025-01-032025-01-032025-08-15YABC123DMABC123-1007-113NON-HISPANICNNN30
14114MUNKNOWNINDIADRUG A100714FEB19932025-02-142025-02-142025-08-15 ABC123DMABC123-1007-114HISPANICNNN32
15115FASIANSPAINDRUG B100815MAR19872025-03-152025-03-15. ABC123DMABC123-1008-115UNKNOWNNNN38

# STEP 10: CREATE FINAL SDTM DM

DM <- DEMOG_CLEAN_FINAL %>%

  left_join(RFSTDTC, by = "SUBJID") %>%

  left_join(DS_FINAL, by = "SUBJID") %>%

  left_join(demog_clean, by = "SUBJID") %>%

  mutate(

    # CONSTANTS

    STUDYID = "ABC123",

    DOMAIN  = "DM",

    # USUBJID

    USUBJID = paste(STUDYID,SITEID,SUBJID,sep = "-"),  

    # AGE

    AGE = as.numeric(floor(interval(DOB,as.Date(RFSTDTC))/years(1) )),

    # ETHNIC

    ETHNIC = case_when(

      COUNTRY %in% c("USA","CANADA","UK") ~ "NON-HISPANIC",

      COUNTRY %in% c("INDIA","PAKISTAN") ~ "HISPANIC",

      TRUE ~ "UNKNOWN"

    ),

    # VALIDATION FLAGS

    RFSTDTC_FLAG = ifelse(is.na(RFSTDTC) | RFSTDTC == "","Y","N"),

    SEX_FLAG = ifelse(SEX == "U","Y","N"),

    COUNTRY_FLAG = ifelse(COUNTRY == "","Y","N")

  )

print(DM)

OUTPUT:

 

SUBJID

DSDECOD

DSSTDTC

DS_DATE

RFSTDTC

RFENDTC

DTHFL

SITEID

SEX

RACE

COUNTRY

ARM

DOB

RFICDTC

STUDYID

DOMAIN

USUBJID

AGE

ETHNIC

RFSTDTC_FLAG

SEX_FLAG

COUNTRY_FLAG

1

101

COMPLETED

01-06-2025

01-06-2025

15-01-2025

01-06-2025

 

1001

M

ASIAN

INDIA

PLACEBO

19-01-1980

01-01-2025

ABC123

DM

ABC123-1001-101

44

HISPANIC

N

N

N

2

101

COMPLETED

01-06-2025

01-06-2025

15-01-2025

01-06-2025

 

1001

M

ASIAN

INDIA

PLACEBO

19-01-1980

01-01-2025

ABC123

DM

ABC123-1001-101

44

HISPANIC

N

N

N

3

102

SCREEN FAILURE

01-Jun-2025

20-01-2025

15-02-2025

20-01-2025

 

1001

F

WHITE

USA

DRUG A

15-02-1985

15-02-2025

ABC123

DM

ABC123-1001-102

40

NON-HISPANIC

N

N

N

4

103

WITHDRAWN

               06/15/2025

15-06-2025

20-01-2025

15-06-2025

 

1002

M

BLACK

UK

DRUG B

15-03-1978

01-03-2025

ABC123

DM

ABC123-1002-103

46

NON-HISPANIC

N

N

N

5

104

LOST TO FOLLOW-UP

20-06-2025

20-06-2025

01-04-2025

20-06-2025

 

1002

F

UNKNOWN

INDIA

PLACEBO

22-04-1970

20-01-2025

ABC123

DM

ABC123-1002-104

54

HISPANIC

N

N

N

6

105

COMPLETED

20-Jun-2025

20-06-2025

20-05-2025

20-06-2025

 

1003

M

ASIAN

SPAIN

DRUG A

05-05-1990

05-05-2025

ABC123

DM

ABC123-1003-105

35

UNKNOWN

N

N

N

7

106

DEATH

01-07-2025

01-07-2025

10-06-2025

01-07-2025

Y

1003

F

WHITE

USA

DRUG B

10-06-1992

06-10-2025

ABC123

DM

ABC123-1003-106

33

NON-HISPANIC

N

N

N

8

107

WITHDRAWN

 

                     NA

11-07-2025

                        NA

 

1004

M

BLACK

INDIA

PLACEBO

19-11-1985

11-07-2025

ABC123

DM

ABC123-1004-107

39

HISPANIC

N

N

N

9

108

COMPLETED

              07/15/2025

15-07-2025

20-08-2025

15-07-2025

 

1004

F

ASIAN

PAKISTAN

DRUG A

20-08-1988

20-08-2025

ABC123

DM

ABC123-1004-108

37

HISPANIC

N

N

N

10

109

SCREEN FAILURE

20-07-2025

20-07-2025

NA

20-07-2025

 

1005

U

UNKNOWN

INDIA

DRUG B

09-09-1975

20-09-2025

ABC123

DM

ABC123-1005-109

NA

HISPANIC

Y

Y

N

11

110

LOST TO FOLLOW-UP

20-Jul-2025

20-07-2025

01-10-2025

20-07-2025

 

1005

M

WHITE

UK

PLACEBO

01-10-1977

01-10-2025

ABC123

DM

ABC123-1005-110

48

NON-HISPANIC

N

N

N

12

111

COMPLETED

01-08-2025

01-08-2025

20-11-2025

01-08-2025

 

1006

F

ASIAN

INDIA

DRUG A

19-12-1982

11-12-2025

ABC123

DM

ABC123-1006-111

42

HISPANIC

N

N

N

13

112

WITHDRAWN

08-10-2025

10-08-2025

25-12-2025

10-08-2025

 

1006

M

BLACK

USA

DRUG B

25-12-1981

25-12-2025

ABC123

DM

ABC123-1006-112

44

NON-HISPANIC

N

N

N

14

113

DEATH

15-08-2025

15-08-2025

03-01-2025

15-08-2025

Y

1007

F

WHITE

CANADA

PLACEBO

03-01-1995

03-01-2025

ABC123

DM

ABC123-1007-113

30

NON-HISPANIC

N

N

N

15

114

COMPLETED

15-Aug-2025

15-08-2025

14-02-2025

15-08-2025

 

1007

M

UNKNOWN

INDIA

DRUG A

14-02-1993

14-02-2025

ABC123

DM

ABC123-1007-114

32

HISPANIC

N

N

N

16

115

SCREEN FAILURE

 

                    NA

15-03-2025

                              NA

 

1008

F

ASIAN

SPAIN

DRUG B

15-03-1987

15-03-2025

ABC123

DM

ABC123-1008-115

38

UNKNOWN

N

N

N

Step 11: Final SDTM Variable Order

DATA DM_FINAL;

    RETAIN

        STUDYID

        DOMAIN

        USUBJID

        SUBJID

        SITEID

        SEX

        AGE

        RACE

        COUNTRY

        ARM

        RFICDTC

        RFSTDTC

        RFENDTC

        DTHFL

        ETHNIC

        RFSTDTC_FLAG

        SEX_FLAG

        COUNTRY_FLAG;

    SET DM;

RUN;

PROC PRINT DATA=DM_FINAL;

RUN;

OUTPUT:

ObsSTUDYIDDOMAINUSUBJIDSUBJIDSITEIDSEXAGERACECOUNTRYARMRFICDTCRFSTDTCRFENDTCDTHFLETHNICRFSTDTC_FLAGSEX_FLAGCOUNTRY_FLAGDOB
1ABC123DMABC123-1001-1011011001M45ASIANINDIAPLACEBO2025-01-012025-01-012025-06-01 HISPANICNNN01JAN1980
2ABC123DMABC123-1001-1021021001F40WHITEUSADRUG A2025-02-152025-02-152025-06-01 NON-HISPANICNNN15FEB1985
3ABC123DMABC123-1002-1031031002M47BLACKUKDRUG B2025-03-012025-03-012025-06-15 NON-HISPANICNNN15MAR1978
4ABC123DMABC123-1002-1041041002F55UNKNOWNINDIAPLACEBO2025-04-012025-04-012025-06-20 HISPANICNNN22APR1970
5ABC123DMABC123-1003-1051051003M35ASIANSPAINDRUG A2025-05-052025-05-052025-06-20 UNKNOWNNNN05MAY1990
6ABC123DMABC123-1003-1061061003F33WHITEUSADRUG B2025-06-102025-06-102025-07-01YNON-HISPANICNNN10JUN1992
7ABC123DMABC123-1004-1071071004M40BLACKINDIAPLACEBO2025-07-112025-07-11. HISPANICNNN11JUL1985
8ABC123DMABC123-1004-1081081004F37ASIANPAKISTANDRUG A2025-08-202025-08-202025-07-15 HISPANICNNN20AUG1988
9ABC123DMABC123-1005-1091091005U.UNKNOWNINDIADRUG B2025-09-09.2025-07-20 HISPANICNYN09SEP1975
10ABC123DMABC123-1005-1101101005M48WHITEUKPLACEBO2025-10-012025-10-012025-07-20 NON-HISPANICNNN01OCT1977
11ABC123DMABC123-1006-1111111006F43ASIANINDIADRUG A2025-11-122025-12-112025-08-01 HISPANICNNN12NOV1982
12ABC123DMABC123-1006-1121121006M44BLACKUSADRUG B2025-12-252025-12-252025-08-10 NON-HISPANICNNN25DEC1981
13ABC123DMABC123-1007-1131131007F30WHITECANADAPLACEBO2025-01-032025-01-032025-08-15YNON-HISPANICNNN01MAR1995
14ABC123DMABC123-1007-1141141007M32UNKNOWNINDIADRUG A2025-02-142025-02-142025-08-15 HISPANICNNN14FEB1993
15ABC123DMABC123-1008-1151151008F38ASIANSPAINDRUG B2025-03-152025-03-15. UNKNOWNNNN15MAR1987

# STEP 11: FINAL SDTM VARIABLE ORDER

DM_FINAL <- DM %>%

  select(STUDYID,DOMAIN,USUBJID,SUBJID,SITEID,SEX,AGE,RACE,COUNTRY,

  ARM,RFICDTC,RFSTDTC,RFENDTC,DTHFL,ETHNIC,RFSTDTC_FLAG,SEX_FLAG,

  COUNTRY_FLAG

  )

print(DM_FINAL)

OUTPUT:

 

STUDYID

DOMAIN

USUBJID

SUBJID

SITEID

SEX

AGE

RACE

COUNTRY

ARM

RFICDTC

RFSTDTC

RFENDTC

DTHFL

ETHNIC

RFSTDTC_FLAG

SEX_FLAG

COUNTRY_FLAG

1

ABC123

DM

ABC123-1001-101

101

1001

M

44

ASIAN

INDIA

PLACEBO

01-01-2025

15-01-2025

01-06-2025

 

HISPANIC

N

N

N

2

ABC123

DM

ABC123-1001-101

101

1001

M

44

ASIAN

INDIA

PLACEBO

01-01-2025

15-01-2025

01-06-2025

 

HISPANIC

N

N

N

3

ABC123

DM

ABC123-1001-102

102

1001

F

40

WHITE

USA

DRUG A

15-02-2025

15-02-2025

20-01-2025

 

NON-HISPANIC

N

N

N

4

ABC123

DM

ABC123-1002-103

103

1002

M

46

BLACK

UK

DRUG B

01-03-2025

20-01-2025

15-06-2025

 

NON-HISPANIC

N

N

N

5

ABC123

DM

ABC123-1002-104

104

1002

F

54

UNKNOWN

INDIA

PLACEBO

20-01-2025

01-04-2025

20-06-2025

 

HISPANIC

N

N

N

6

ABC123

DM

ABC123-1003-105

105

1003

M

35

ASIAN

SPAIN

DRUG A

05-05-2025

20-05-2025

20-06-2025

 

UNKNOWN

N

N

N

7

ABC123

DM

ABC123-1003-106

106

1003

F

33

WHITE

USA

DRUG B

06-10-2025

10-06-2025

01-07-2025

Y

NON-HISPANIC

N

N

N

8

ABC123

DM

ABC123-1004-107

107

1004

M

39

BLACK

INDIA

PLACEBO

11-07-2025

11-07-2025

              NA

 

HISPANIC

N

N

N

9

ABC123

DM

ABC123-1004-108

108

1004

F

37

ASIAN

PAKISTAN

DRUG A

20-08-2025

20-08-2025

15-07-2025

 

HISPANIC

N

N

N

10

ABC123

DM

ABC123-1005-109

109

1005

U

NA

UNKNOWN

INDIA

DRUG B

20-09-2025

NA

20-07-2025

 

HISPANIC

Y

Y

N

11

ABC123

DM

ABC123-1005-110

110

1005

M

48

WHITE

UK

PLACEBO

01-10-2025

01-10-2025

20-07-2025

 

NON-HISPANIC

N

N

N

12

ABC123

DM

ABC123-1006-111

111

1006

F

42

ASIAN

INDIA

DRUG A

11-12-2025

20-11-2025

01-08-2025

 

HISPANIC

N

N

N

13

ABC123

DM

ABC123-1006-112

112

1006

M

44

BLACK

USA

DRUG B

25-12-2025

25-12-2025

10-08-2025

 

NON-HISPANIC

N

N

N

14

ABC123

DM

ABC123-1007-113

113

1007

F

30

WHITE

CANADA

PLACEBO

03-01-2025

03-01-2025

15-08-2025

Y

NON-HISPANIC

N

N

N

15

ABC123

DM

ABC123-1007-114

114

1007

M

32

UNKNOWN

INDIA

DRUG A

14-02-2025

14-02-2025

15-08-2025

 

HISPANIC

N

N

N

16

ABC123

DM

ABC123-1008-115

115

1008

F

38

ASIAN

SPAIN

DRUG B

15-03-2025

15-03-2025

            NA

 

UNKNOWN

N

N

N

Step 10: Final DM Domain Creation

Final variables include:

  • STUDYID
  • DOMAIN
  • USUBJID
  • AGE
  • RFSTDTC
  • RFENDTC
  • ETHNIC
  • DTHFL

BUSINESS LOGIC BEHIND CLEANING DECISIONS

1. Standardized SEX values improve statistical consistency

Without standardization:

  • Male
  • male
  • M

would be treated as different categories.

2. Duplicate subjects create inflated patient counts

Duplicate subjects can:

  • distort efficacy results
  • impact safety summaries
  • corrupt randomization counts

3. Missing dates break longitudinal analysis

Missing RFSTDTC impacts:

  • exposure duration
  • treatment emergent AE logic
  • survival analysis

4. Negative doses violate protocol integrity

Negative doses indicate:

  • entry errors
  • ETL corruption
  • source extraction issues

5. Mixed country formats affect regional reporting

Regional summaries rely on:

  • exact country matching
  • regulatory reporting standards

6. Controlled terminology ensures SDTM compliance

CDISC requires:

  • controlled values
  • standardized domains
  • consistent terminology

7. Date standardization enables temporal calculations

AGE derivation depends on:

  • valid DOB
  • valid first dose date

8. Missing race values affect diversity analysis

Regulators evaluate:

  • racial representation
  • subgroup efficacy
  • demographic safety signals

9. Visit normalization improves exposure tracking

Different visit labels create:

  • duplicate visit problems
  • timeline inconsistencies

10. Disposition consistency supports patient accountability

Disposition impacts:

  • study completion rate
  • withdrawal analysis
  • survival endpoints

20 SHARP INSIGHTS

1.Real clinical data is never perfectly formatted.

2.ANYDTDTE informat in SAS is extremely powerful for heterogeneous dates.

3.R date parsing requires careful order specification.

4.Duplicate subjects are among the most dangerous clinical data issues.

5.Controlled terminology is essential in SDTM.

6.Negative dose values often indicate upstream ETL failures.

7.Missing RFSTDTC affects TEAE derivation.

8.Clinical datasets often contain hidden whitespace issues.

9.Uppercase standardization improves join performance.

10.Disposition domains are critical for patient accountability.

11.Clinical validation flags simplify QC workflows.

12.AGE derivation depends heavily on clean date logic.

13.Merging datasets without sorting causes SAS BY-group errors.

14.distinct() in R behaves differently from SAS NODUPKEY.

15.NA handling in R differs significantly from SAS missing logic.

16.Inf warnings in R indicate missing aggregation safeguards.

17.Clinical programming requires defensive coding practices.

18.Incorrect date parsing can silently corrupt analyses.

19.SDTM DM is foundational for all downstream ADaM datasets.

20.Cross-platform SAS and R programming improves industry flexibility.

SUMMARY

This project demonstrated end-to-end clinical data engineering using SAS and R.

Major operations included:

  • demographic cleaning
  • date standardization
  • duplicate removal
  • controlled terminology mapping
  • derivation of RFSTDTC
  • derivation of RFENDTC
  • derivation of DTHFL
  • SDTM DM domain construction

The project successfully transformed corrupted raw datasets into analysis-ready SDTM-compliant structures.

CONCLUSION

This project mirrors real-world pharmaceutical clinical programming workflows used in:

  • CROs
  • biotech companies
  • pharmaceutical organizations

The exercise demonstrates practical expertise in:

  • Base SAS
  • PROC SQL
  • DATA Step logic
  • R tidyverse pipelines
  • SDTM standards
  • clinical validation techniques
  • data quality management

The final DM dataset becomes the foundation for:

  • efficacy analysis
  • safety analysis
  • ADaM derivation
  • TLF generation
  • regulatory submission

INTERVIEW QUESTIONS AND ANSWERS

Q1. Why do we standardize SEX values in SDTM?

Answer

Standardization ensures controlled terminology compliance and prevents inconsistent category grouping during statistical analysis.

Q2. Why is RFSTDTC important?

Answer

RFSTDTC represents the first exposure date and is essential for:

  • treatment emergent AE derivation
  • exposure duration
  • efficacy timelines
  • survival analysis

Q3. What is the difference between SAS ANYDTDTE and R parse_date_time?

Answer

SAS

R

Automatic flexible parsing

Requires explicit format orders

More forgiving

More customizable

Simpler syntax

More granular control

Q4. Why remove duplicate subjects?

Answer

Duplicate subjects can:

  • inflate sample size
  • distort treatment counts
  • create incorrect efficacy and safety conclusions

Q5. Why create validation flags?

Answer

Validation flags help QC teams quickly identify:

  • missing critical variables
  • invalid subject records
  • incomplete exposure data
  • inconsistent demographics
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about SDTM DEMOGRAPHICS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS