179.STEP-BY-STEP SDTM DATASET CREATION USING PROC SQL | PROC DATASETS | PROC PRINT | PROC FREQ IN SAS WITH MACROS AND RAW DATA SIMULATION FOR CLINICAL TRIAL SUBMISSION

STEP-BY-STEP SDTM DATASET CREATION USING PROC SQL | PROC DATASETS | PROC PRINT | PROC FREQ IN SAS WITH MACROS AND RAW DATA SIMULATION FOR CLINICAL TRIAL SUBMISSION


/*Creating a unique SDTM dataset (Demographics - DM domain) in SAS using SQL and Macros*/

Introduction to SDTM and the DM Domain:

SDTM (Study Data Tabulation Model) is a standard developed by CDISC (Clinical Data 

Interchange Standards Consortium) for organizing and  formatting data to be submitted to 

regulatory authorities such as the FDA.One of the fundamental domains in SDTM is the 

Demographics (DM) domain. 


It contains subject-level information such as:

Subject ID

Study ID

Age

Sex

Race

Country

Siteid

etc.


Step-by-Step Workflow:

We’ll follow these key steps:

1.Define the simulated raw demographic data (source data).

2.Create a macro to simulate raw data.

3.Use SQL to clean, transform, and standardize the data.

4.Create the final SDTM DM dataset with SDTM-compliant variables.

5.Apply SAS Macros for automation and modularity.

6.Show the final dataset with 15+ observations.


Step 1: Define Simulated Raw Data:

Let’s simulate raw demographic data that could resemble what's collected in a CRF 

(Case Report Form). We'll create a raw_demo dataset that includes variables such as:

SUBJID: Subject ID

SITEID: Site Number

AGE: Subject Age

SEX: Subject Sex (M/F)

RACE: Race

COUNTRY: Country Code

ENRLDT: Enrollment Date


Step 2: Creating A Simulate Raw Data:

data raw_demo;

    length SUBJID $10 SITEID $5 SEX $1 RACE $20 COUNTRY $3;

    format ENRLDT date9.;

    input SUBJID $ SITEID $ AGE SEX $ RACE $ COUNTRY $ ENRLDT :date9.;

    datalines;

101-001 S001 34 M Asian IND 01JAN2024

101-002 S001 29 F White USA 05JAN2024

101-003 S001 40 M Black USA 08JAN2024

102-001 S002 23 F Asian JPN 03JAN2024

102-002 S002 37 M Other FRA 06JAN2024

102-003 S002 30 F White IND 10JAN2024

102-004 S002 36 M Black USA 12JAN2024

103-001 S003 28 M Asian CHN 15JAN2024

103-002 S003 33 F Other FRA 17JAN2024

103-003 S003 35 M White JPN 20JAN2024

103-004 S003 41 F Black IND 22JAN2024

103-005 S003 38 M Asian IND 25JAN2024

103-006 S003 32 F White CHN 27JAN2024

103-007 S003 27 M Other USA 30JAN2024

103-008 S003 26 F Black JPN 01FEB2024

103-009 S003 39 M Asian FRA 03FEB2024

;

run;

proc print data=raw_demo;

run;

Output:

Obs SUBJID SITEID SEX RACE COUNTRY ENRLDT AGE
1 101-001 S001 M Asian IND 01JAN2024 34
2 101-002 S001 F White USA 05JAN2024 29
3 101-003 S001 M Black USA 08JAN2024 40
4 102-001 S002 F Asian JPN 03JAN2024 23
5 102-002 S002 M Other FRA 06JAN2024 37
6 102-003 S002 F White IND 10JAN2024 30
7 102-004 S002 M Black USA 12JAN2024 36
8 103-001 S003 M Asian CHN 15JAN2024 28
9 103-002 S003 F Other FRA 17JAN2024 33
10 103-003 S003 M White JPN 20JAN2024 35
11 103-004 S003 F Black IND 22JAN2024 41
12 103-005 S003 M Asian IND 25JAN2024 38
13 103-006 S003 F White CHN 27JAN2024 32
14 103-007 S003 M Other USA 30JAN2024 27
15 103-008 S003 F Black JPN 01FEB2024 26
16 103-009 S003 M Asian FRA 03FEB2024 39


What We Did:

A raw demographic dataset with 16 subjects across multiple countries and sites.

This will be the base for our SDTM DM domain.


Step 3: Prepare and Transform Using SQL:

Let us use PROC SQL to create SDTM variables such as:

STUDYID: Study Identifier

DOMAIN: Always "DM" for the Demographics domain

USUBJID: Unique Subject Identifier (StudyID-SUBJID)

AGEU: Age Unit (Years)

RFSTDTC: Reference Start Date (from ENRLDT)

SEX, RACE, COUNTRY, and SITEID will be directly taken.


%let studyid = CYBER001;


proc sql;

create table sdtm_dm as

select

    "&studyid" as STUDYID length=10,

    "DM" as DOMAIN length=2,

    SUBJID,

    cats("&studyid", "-", SUBJID) as USUBJID length=20,

    SITEID,

    AGE,

    "YEARS" as AGEU length=5,

    SEX,

    RACE,

    COUNTRY,

    put(ENRLDT, yymmdd10.) as RFSTDTC length=10

from raw_demo;

quit;

proc print data=sdtm_dm;

run;

Output:

Obs STUDYID DOMAIN SUBJID USUBJID SITEID AGE AGEU SEX RACE COUNTRY RFSTDTC
1 CYBER001 DM 101-001 CYBER001-101-001 S001 34 YEARS M Asian IND 2024-01-01
2 CYBER001 DM 101-002 CYBER001-101-002 S001 29 YEARS F White USA 2024-01-05
3 CYBER001 DM 101-003 CYBER001-101-003 S001 40 YEARS M Black USA 2024-01-08
4 CYBER001 DM 102-001 CYBER001-102-001 S002 23 YEARS F Asian JPN 2024-01-03
5 CYBER001 DM 102-002 CYBER001-102-002 S002 37 YEARS M Other FRA 2024-01-06
6 CYBER001 DM 102-003 CYBER001-102-003 S002 30 YEARS F White IND 2024-01-10
7 CYBER001 DM 102-004 CYBER001-102-004 S002 36 YEARS M Black USA 2024-01-12
8 CYBER001 DM 103-001 CYBER001-103-001 S003 28 YEARS M Asian CHN 2024-01-15
9 CYBER001 DM 103-002 CYBER001-103-002 S003 33 YEARS F Other FRA 2024-01-17
10 CYBER001 DM 103-003 CYBER001-103-003 S003 35 YEARS M White JPN 2024-01-20
11 CYBER001 DM 103-004 CYBER001-103-004 S003 41 YEARS F Black IND 2024-01-22
12 CYBER001 DM 103-005 CYBER001-103-005 S003 38 YEARS M Asian IND 2024-01-25
13 CYBER001 DM 103-006 CYBER001-103-006 S003 32 YEARS F White CHN 2024-01-27
14 CYBER001 DM 103-007 CYBER001-103-007 S003 27 YEARS M Other USA 2024-01-30
15 CYBER001 DM 103-008 CYBER001-103-008 S003 26 YEARS F Black JPN 2024-02-01
16 CYBER001 DM 103-009 CYBER001-103-009 S003 39 YEARS M Asian FRA 2024-02-03


What We Did:

1.Created STUDYID and DOMAIN.

2.Formatted USUBJID as STUDYID-SUBJID.

3.Standardized date format RFSTDTC using YYMMDD10..

4.Added AGEU = YEARS.


Step 4: Use Macro for Reusability:

This SQL step in a macro for reusability, so we can apply it to other studies or domains by changing macro parameters.


%macro create_sdtm_dm(rawds=, outds=, studyid=);

proc sql;

create table &outds. as

select

    "&studyid" as STUDYID length=10,

    "DM" as DOMAIN length=2,

    SUBJID,

    cats("&studyid", "-", SUBJID) as USUBJID length=20,

    SITEID,

    AGE,

    "YEARS" as AGEU length=5,

    SEX,

    RACE,

    COUNTRY,

    put(ENRLDT, yymmdd10.) as RFSTDTC length=10

from &rawds.;

quit;

proc print data=&outds.;

run;

%mend;


%create_sdtm_dm(rawds=raw_demo, outds=sdtm_dm_final, studyid=CYBER001);

Output:

Obs STUDYID DOMAIN SUBJID USUBJID SITEID AGE AGEU SEX RACE COUNTRY RFSTDTC
1 CYBER001 DM 101-001 CYBER001-101-001 S001 34 YEARS M Asian IND 2024-01-01
2 CYBER001 DM 101-002 CYBER001-101-002 S001 29 YEARS F White USA 2024-01-05
3 CYBER001 DM 101-003 CYBER001-101-003 S001 40 YEARS M Black USA 2024-01-08
4 CYBER001 DM 102-001 CYBER001-102-001 S002 23 YEARS F Asian JPN 2024-01-03
5 CYBER001 DM 102-002 CYBER001-102-002 S002 37 YEARS M Other FRA 2024-01-06
6 CYBER001 DM 102-003 CYBER001-102-003 S002 30 YEARS F White IND 2024-01-10
7 CYBER001 DM 102-004 CYBER001-102-004 S002 36 YEARS M Black USA 2024-01-12
8 CYBER001 DM 103-001 CYBER001-103-001 S003 28 YEARS M Asian CHN 2024-01-15
9 CYBER001 DM 103-002 CYBER001-103-002 S003 33 YEARS F Other FRA 2024-01-17
10 CYBER001 DM 103-003 CYBER001-103-003 S003 35 YEARS M White JPN 2024-01-20
11 CYBER001 DM 103-004 CYBER001-103-004 S003 41 YEARS F Black IND 2024-01-22
12 CYBER001 DM 103-005 CYBER001-103-005 S003 38 YEARS M Asian IND 2024-01-25
13 CYBER001 DM 103-006 CYBER001-103-006 S003 32 YEARS F White CHN 2024-01-27
14 CYBER001 DM 103-007 CYBER001-103-007 S003 27 YEARS M Other USA 2024-01-30
15 CYBER001 DM 103-008 CYBER001-103-008 S003 26 YEARS F Black JPN 2024-02-01
16 CYBER001 DM 103-009 CYBER001-103-009 S003 39 YEARS M Asian FRA 2024-02-03


Advantages of Using Macro:

1.Reusable for any study ID or input dataset.

2.Scalable for automation in production settings.


Step 5: Display Final SDTM Dataset:

proc print data=sdtm_dm_final noobs;

title "Final SDTM DM Dataset - CYBER001";

run;

Output:

                                                                             Final SDTM DM Dataset - CYBER001

STUDYID DOMAIN SUBJID USUBJID SITEID AGE AGEU SEX RACE COUNTRY RFSTDTC
CYBER001 DM 101-001 CYBER001-101-001 S001 34 YEARS M Asian IND 2024-01-01
CYBER001 DM 101-002 CYBER001-101-002 S001 29 YEARS F White USA 2024-01-05
CYBER001 DM 101-003 CYBER001-101-003 S001 40 YEARS M Black USA 2024-01-08
CYBER001 DM 102-001 CYBER001-102-001 S002 23 YEARS F Asian JPN 2024-01-03
CYBER001 DM 102-002 CYBER001-102-002 S002 37 YEARS M Other FRA 2024-01-06
CYBER001 DM 102-003 CYBER001-102-003 S002 30 YEARS F White IND 2024-01-10
CYBER001 DM 102-004 CYBER001-102-004 S002 36 YEARS M Black USA 2024-01-12
CYBER001 DM 103-001 CYBER001-103-001 S003 28 YEARS M Asian CHN 2024-01-15
CYBER001 DM 103-002 CYBER001-103-002 S003 33 YEARS F Other FRA 2024-01-17
CYBER001 DM 103-003 CYBER001-103-003 S003 35 YEARS M White JPN 2024-01-20
CYBER001 DM 103-004 CYBER001-103-004 S003 41 YEARS F Black IND 2024-01-22
CYBER001 DM 103-005 CYBER001-103-005 S003 38 YEARS M Asian IND 2024-01-25
CYBER001 DM 103-006 CYBER001-103-006 S003 32 YEARS F White CHN 2024-01-27
CYBER001 DM 103-007 CYBER001-103-007 S003 27 YEARS M Other USA 2024-01-30
CYBER001 DM 103-008 CYBER001-103-008 S003 26 YEARS F Black JPN 2024-02-01
CYBER001 DM 103-009 CYBER001-103-009 S003 39 YEARS M Asian FRA 2024-02-03


This prints the final dataset which is SDTM compliant for the DM domain.


Checks Using SQL:

We can do checks like:

1.Count of subjects by sex:

proc sql;

select SEX, count(*) as Subject_Count

from sdtm_dm_final

group by SEX;

quit;

Output:

SEX Subject_Count
F 7
M 9


2.Country-wise distribution:

proc sql;

select COUNTRY, count(*) as Subjects

from sdtm_dm_final

group by COUNTRY;

quit;

Output:
COUNTRY Subjects
CHN 2
FRA 3
IND 4
JPN 3
USA 4



PRACTICE AND COMMENT YOUR CODE: 

--->PLEASE FOLLOW OUR BLOG FOR MORE UPDATES.

PLEASE FOLLOW OUR TELEGRAM CHANNEL CLICK HERE

PLEASE FOLLOW OUR FACEBOOK PAGE  CLICK HERE

PLEASE FOLLOW OUR INSTAGRAM PAGE CLICK HERE

Comments