179.STEP-BY-STEP SDTM DATASET CREATION USING PROC SQL | PROC DATASETS | PROC PRINT | PROC FREQ IN SAS WITH MACROS AND RAW DATA SIMULATION FOR CLINICAL TRIAL SUBMISSION
- Get link
- X
- Other Apps
STEP-BY-STEP SDTM DATASET CREATION USING PROC SQL | PROC DATASETS | PROC PRINT | PROC FREQ IN SAS WITH MACROS AND RAW DATA SIMULATION FOR CLINICAL TRIAL SUBMISSION
/*Creating a unique SDTM dataset (Demographics - DM domain) in SAS using SQL and Macros*/
Introduction to SDTM and the DM Domain:
SDTM (Study Data Tabulation Model) is a standard developed by CDISC (Clinical Data
Interchange Standards Consortium) for organizing and formatting data to be submitted to
regulatory authorities such as the FDA.One of the fundamental domains in SDTM is the
Demographics (DM) domain.
It contains subject-level information such as:
Subject ID
Study ID
Age
Sex
Race
Country
Siteid
etc.
Step-by-Step Workflow:
We’ll follow these key steps:
1.Define the simulated raw demographic data (source data).
2.Create a macro to simulate raw data.
3.Use SQL to clean, transform, and standardize the data.
4.Create the final SDTM DM dataset with SDTM-compliant variables.
5.Apply SAS Macros for automation and modularity.
6.Show the final dataset with 15+ observations.
Step 1: Define Simulated Raw Data:
Let’s simulate raw demographic data that could resemble what's collected in a CRF
(Case Report Form). We'll create a raw_demo dataset that includes variables such as:
SUBJID: Subject ID
SITEID: Site Number
AGE: Subject Age
SEX: Subject Sex (M/F)
RACE: Race
COUNTRY: Country Code
ENRLDT: Enrollment Date
Step 2: Creating A Simulate Raw Data:
data raw_demo;
length SUBJID $10 SITEID $5 SEX $1 RACE $20 COUNTRY $3;
format ENRLDT date9.;
input SUBJID $ SITEID $ AGE SEX $ RACE $ COUNTRY $ ENRLDT :date9.;
datalines;
101-001 S001 34 M Asian IND 01JAN2024
101-002 S001 29 F White USA 05JAN2024
101-003 S001 40 M Black USA 08JAN2024
102-001 S002 23 F Asian JPN 03JAN2024
102-002 S002 37 M Other FRA 06JAN2024
102-003 S002 30 F White IND 10JAN2024
102-004 S002 36 M Black USA 12JAN2024
103-001 S003 28 M Asian CHN 15JAN2024
103-002 S003 33 F Other FRA 17JAN2024
103-003 S003 35 M White JPN 20JAN2024
103-004 S003 41 F Black IND 22JAN2024
103-005 S003 38 M Asian IND 25JAN2024
103-006 S003 32 F White CHN 27JAN2024
103-007 S003 27 M Other USA 30JAN2024
103-008 S003 26 F Black JPN 01FEB2024
103-009 S003 39 M Asian FRA 03FEB2024
;
run;
proc print data=raw_demo;
run;
Output:
Obs | SUBJID | SITEID | SEX | RACE | COUNTRY | ENRLDT | AGE |
---|---|---|---|---|---|---|---|
1 | 101-001 | S001 | M | Asian | IND | 01JAN2024 | 34 |
2 | 101-002 | S001 | F | White | USA | 05JAN2024 | 29 |
3 | 101-003 | S001 | M | Black | USA | 08JAN2024 | 40 |
4 | 102-001 | S002 | F | Asian | JPN | 03JAN2024 | 23 |
5 | 102-002 | S002 | M | Other | FRA | 06JAN2024 | 37 |
6 | 102-003 | S002 | F | White | IND | 10JAN2024 | 30 |
7 | 102-004 | S002 | M | Black | USA | 12JAN2024 | 36 |
8 | 103-001 | S003 | M | Asian | CHN | 15JAN2024 | 28 |
9 | 103-002 | S003 | F | Other | FRA | 17JAN2024 | 33 |
10 | 103-003 | S003 | M | White | JPN | 20JAN2024 | 35 |
11 | 103-004 | S003 | F | Black | IND | 22JAN2024 | 41 |
12 | 103-005 | S003 | M | Asian | IND | 25JAN2024 | 38 |
13 | 103-006 | S003 | F | White | CHN | 27JAN2024 | 32 |
14 | 103-007 | S003 | M | Other | USA | 30JAN2024 | 27 |
15 | 103-008 | S003 | F | Black | JPN | 01FEB2024 | 26 |
16 | 103-009 | S003 | M | Asian | FRA | 03FEB2024 | 39 |
What We Did:
A raw demographic dataset with 16 subjects across multiple countries and sites.
This will be the base for our SDTM DM domain.
Step 3: Prepare and Transform Using SQL:
Let us use PROC SQL to create SDTM variables such as:
STUDYID: Study Identifier
DOMAIN: Always "DM" for the Demographics domain
USUBJID: Unique Subject Identifier (StudyID-SUBJID)
AGEU: Age Unit (Years)
RFSTDTC: Reference Start Date (from ENRLDT)
SEX, RACE, COUNTRY, and SITEID will be directly taken.
%let studyid = CYBER001;
proc sql;
create table sdtm_dm as
select
"&studyid" as STUDYID length=10,
"DM" as DOMAIN length=2,
SUBJID,
cats("&studyid", "-", SUBJID) as USUBJID length=20,
SITEID,
AGE,
"YEARS" as AGEU length=5,
SEX,
RACE,
COUNTRY,
put(ENRLDT, yymmdd10.) as RFSTDTC length=10
from raw_demo;
quit;
proc print data=sdtm_dm;
run;
Output:
Obs | STUDYID | DOMAIN | SUBJID | USUBJID | SITEID | AGE | AGEU | SEX | RACE | COUNTRY | RFSTDTC |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | CYBER001 | DM | 101-001 | CYBER001-101-001 | S001 | 34 | YEARS | M | Asian | IND | 2024-01-01 |
2 | CYBER001 | DM | 101-002 | CYBER001-101-002 | S001 | 29 | YEARS | F | White | USA | 2024-01-05 |
3 | CYBER001 | DM | 101-003 | CYBER001-101-003 | S001 | 40 | YEARS | M | Black | USA | 2024-01-08 |
4 | CYBER001 | DM | 102-001 | CYBER001-102-001 | S002 | 23 | YEARS | F | Asian | JPN | 2024-01-03 |
5 | CYBER001 | DM | 102-002 | CYBER001-102-002 | S002 | 37 | YEARS | M | Other | FRA | 2024-01-06 |
6 | CYBER001 | DM | 102-003 | CYBER001-102-003 | S002 | 30 | YEARS | F | White | IND | 2024-01-10 |
7 | CYBER001 | DM | 102-004 | CYBER001-102-004 | S002 | 36 | YEARS | M | Black | USA | 2024-01-12 |
8 | CYBER001 | DM | 103-001 | CYBER001-103-001 | S003 | 28 | YEARS | M | Asian | CHN | 2024-01-15 |
9 | CYBER001 | DM | 103-002 | CYBER001-103-002 | S003 | 33 | YEARS | F | Other | FRA | 2024-01-17 |
10 | CYBER001 | DM | 103-003 | CYBER001-103-003 | S003 | 35 | YEARS | M | White | JPN | 2024-01-20 |
11 | CYBER001 | DM | 103-004 | CYBER001-103-004 | S003 | 41 | YEARS | F | Black | IND | 2024-01-22 |
12 | CYBER001 | DM | 103-005 | CYBER001-103-005 | S003 | 38 | YEARS | M | Asian | IND | 2024-01-25 |
13 | CYBER001 | DM | 103-006 | CYBER001-103-006 | S003 | 32 | YEARS | F | White | CHN | 2024-01-27 |
14 | CYBER001 | DM | 103-007 | CYBER001-103-007 | S003 | 27 | YEARS | M | Other | USA | 2024-01-30 |
15 | CYBER001 | DM | 103-008 | CYBER001-103-008 | S003 | 26 | YEARS | F | Black | JPN | 2024-02-01 |
16 | CYBER001 | DM | 103-009 | CYBER001-103-009 | S003 | 39 | YEARS | M | Asian | FRA | 2024-02-03 |
What We Did:
1.Created STUDYID and DOMAIN.
2.Formatted USUBJID as STUDYID-SUBJID.
3.Standardized date format RFSTDTC using YYMMDD10..
4.Added AGEU = YEARS.
Step 4: Use Macro for Reusability:
This SQL step in a macro for reusability, so we can apply it to other studies or domains by changing macro parameters.
%macro create_sdtm_dm(rawds=, outds=, studyid=);
proc sql;
create table &outds. as
select
"&studyid" as STUDYID length=10,
"DM" as DOMAIN length=2,
SUBJID,
cats("&studyid", "-", SUBJID) as USUBJID length=20,
SITEID,
AGE,
"YEARS" as AGEU length=5,
SEX,
RACE,
COUNTRY,
put(ENRLDT, yymmdd10.) as RFSTDTC length=10
from &rawds.;
quit;
proc print data=&outds.;
run;
%mend;
%create_sdtm_dm(rawds=raw_demo, outds=sdtm_dm_final, studyid=CYBER001);
Output:
Obs | STUDYID | DOMAIN | SUBJID | USUBJID | SITEID | AGE | AGEU | SEX | RACE | COUNTRY | RFSTDTC |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | CYBER001 | DM | 101-001 | CYBER001-101-001 | S001 | 34 | YEARS | M | Asian | IND | 2024-01-01 |
2 | CYBER001 | DM | 101-002 | CYBER001-101-002 | S001 | 29 | YEARS | F | White | USA | 2024-01-05 |
3 | CYBER001 | DM | 101-003 | CYBER001-101-003 | S001 | 40 | YEARS | M | Black | USA | 2024-01-08 |
4 | CYBER001 | DM | 102-001 | CYBER001-102-001 | S002 | 23 | YEARS | F | Asian | JPN | 2024-01-03 |
5 | CYBER001 | DM | 102-002 | CYBER001-102-002 | S002 | 37 | YEARS | M | Other | FRA | 2024-01-06 |
6 | CYBER001 | DM | 102-003 | CYBER001-102-003 | S002 | 30 | YEARS | F | White | IND | 2024-01-10 |
7 | CYBER001 | DM | 102-004 | CYBER001-102-004 | S002 | 36 | YEARS | M | Black | USA | 2024-01-12 |
8 | CYBER001 | DM | 103-001 | CYBER001-103-001 | S003 | 28 | YEARS | M | Asian | CHN | 2024-01-15 |
9 | CYBER001 | DM | 103-002 | CYBER001-103-002 | S003 | 33 | YEARS | F | Other | FRA | 2024-01-17 |
10 | CYBER001 | DM | 103-003 | CYBER001-103-003 | S003 | 35 | YEARS | M | White | JPN | 2024-01-20 |
11 | CYBER001 | DM | 103-004 | CYBER001-103-004 | S003 | 41 | YEARS | F | Black | IND | 2024-01-22 |
12 | CYBER001 | DM | 103-005 | CYBER001-103-005 | S003 | 38 | YEARS | M | Asian | IND | 2024-01-25 |
13 | CYBER001 | DM | 103-006 | CYBER001-103-006 | S003 | 32 | YEARS | F | White | CHN | 2024-01-27 |
14 | CYBER001 | DM | 103-007 | CYBER001-103-007 | S003 | 27 | YEARS | M | Other | USA | 2024-01-30 |
15 | CYBER001 | DM | 103-008 | CYBER001-103-008 | S003 | 26 | YEARS | F | Black | JPN | 2024-02-01 |
16 | CYBER001 | DM | 103-009 | CYBER001-103-009 | S003 | 39 | YEARS | M | Asian | FRA | 2024-02-03 |
Advantages of Using Macro:
1.Reusable for any study ID or input dataset.
2.Scalable for automation in production settings.
Step 5: Display Final SDTM Dataset:
proc print data=sdtm_dm_final noobs;
title "Final SDTM DM Dataset - CYBER001";
run;
Output:
Final SDTM DM Dataset - CYBER001 |
STUDYID | DOMAIN | SUBJID | USUBJID | SITEID | AGE | AGEU | SEX | RACE | COUNTRY | RFSTDTC |
---|---|---|---|---|---|---|---|---|---|---|
CYBER001 | DM | 101-001 | CYBER001-101-001 | S001 | 34 | YEARS | M | Asian | IND | 2024-01-01 |
CYBER001 | DM | 101-002 | CYBER001-101-002 | S001 | 29 | YEARS | F | White | USA | 2024-01-05 |
CYBER001 | DM | 101-003 | CYBER001-101-003 | S001 | 40 | YEARS | M | Black | USA | 2024-01-08 |
CYBER001 | DM | 102-001 | CYBER001-102-001 | S002 | 23 | YEARS | F | Asian | JPN | 2024-01-03 |
CYBER001 | DM | 102-002 | CYBER001-102-002 | S002 | 37 | YEARS | M | Other | FRA | 2024-01-06 |
CYBER001 | DM | 102-003 | CYBER001-102-003 | S002 | 30 | YEARS | F | White | IND | 2024-01-10 |
CYBER001 | DM | 102-004 | CYBER001-102-004 | S002 | 36 | YEARS | M | Black | USA | 2024-01-12 |
CYBER001 | DM | 103-001 | CYBER001-103-001 | S003 | 28 | YEARS | M | Asian | CHN | 2024-01-15 |
CYBER001 | DM | 103-002 | CYBER001-103-002 | S003 | 33 | YEARS | F | Other | FRA | 2024-01-17 |
CYBER001 | DM | 103-003 | CYBER001-103-003 | S003 | 35 | YEARS | M | White | JPN | 2024-01-20 |
CYBER001 | DM | 103-004 | CYBER001-103-004 | S003 | 41 | YEARS | F | Black | IND | 2024-01-22 |
CYBER001 | DM | 103-005 | CYBER001-103-005 | S003 | 38 | YEARS | M | Asian | IND | 2024-01-25 |
CYBER001 | DM | 103-006 | CYBER001-103-006 | S003 | 32 | YEARS | F | White | CHN | 2024-01-27 |
CYBER001 | DM | 103-007 | CYBER001-103-007 | S003 | 27 | YEARS | M | Other | USA | 2024-01-30 |
CYBER001 | DM | 103-008 | CYBER001-103-008 | S003 | 26 | YEARS | F | Black | JPN | 2024-02-01 |
CYBER001 | DM | 103-009 | CYBER001-103-009 | S003 | 39 | YEARS | M | Asian | FRA | 2024-02-03 |
This prints the final dataset which is SDTM compliant for the DM domain.
Checks Using SQL:
We can do checks like:
1.Count of subjects by sex:
proc sql;
select SEX, count(*) as Subject_Count
from sdtm_dm_final
group by SEX;
quit;
Output:
SEX | Subject_Count |
---|---|
F | 7 |
M | 9 |
2.Country-wise distribution:
proc sql;
select COUNTRY, count(*) as Subjects
from sdtm_dm_final
group by COUNTRY;
quit;
COUNTRY | Subjects |
---|---|
CHN | 2 |
FRA | 3 |
IND | 4 |
JPN | 3 |
USA | 4 |
- Get link
- X
- Other Apps
Comments
Post a Comment