152.A COMPREHENSIVE GUIDE TO SYNTHETIC ONLINE COURSE ENROLLMENT DATA ANALYSIS USING SAS: FROM DATA GENERATION TO ADVANCED ANALYTICS

A COMPREHENSIVE GUIDE TO SYNTHETIC ONLINE COURSE ENROLLMENT DATA ANALYSIS USING SAS: FROM DATA GENERATION TO ADVANCED ANALYTICS


 /*Create a unique synthetic dataset focused on online course enrollments and apply various SAS procedures to analyze it*/


Dataset Overview: Online Course Enrollments

1. Dataset Structure

We'll design a dataset named OnlineCourses with the following variables:

StudentID: Unique identifier for each student.

CourseID: Unique identifier for each course.

CourseCategory: Category of the course (e.g., Data Science, Business, Arts).

EnrollmentDate: Date when the student enrolled in the course.

CompletionDate: Date when the student completed the course (if completed).

CompletionStatus: Status indicating whether the course was completed ('Completed', 'In Progress', 'Dropped').

Score: Final score achieved in the course (if completed).

TimeSpent: Total time spent on the course platform in hours.


2. Generating the Dataset in SAS

/*We'll use SAS to generate this synthetic dataset with 20 observations.*/

/* Set the seed for reproducibility */

%let seed = 12345;

%let num_records = 20;


/* Define course categories */

data CourseCategories;

    input CourseCategory $20.;

    datalines;

Data_Science

Business

Arts

Technology

Health

Language

;

run;

proc print;run;

Output:

Obs CourseCategory
1 Data_Science
2 Business
3 Arts
4 Technology
5 Health
6 Language


/* Generate synthetic online course enrollment data */

data OnlineCourses;

    retain EnrollmentDate CompletionDate StudentID CourseID CourseCategory;

length CompletionStatus $15.;

    call streaminit(&seed);

    array categories[6] $20. _temporary_ ('Data_Science', 'Business', 'Arts', 'Technology', 'Health', 'Language');

    format EnrollmentDate CompletionDate date9.;

    do i = 1 to &num_records;

        StudentID = 1000 + i;

        CourseID = 200 + ceil(rand('uniform') * 50);

        CourseCategory = categories[ceil(rand('uniform') * dim(categories))];

        EnrollmentDate = '01JAN2020'd + ceil(rand('uniform') * 730); /* Dates in 2020 and 2021 */

        

        /* Determine completion status */

        p = rand('uniform');

        if p < 0.6 then CompletionStatus = 'Completed';

        else if p < 0.9 then CompletionStatus = 'In Progress';

        else CompletionStatus = 'Dropped';

        

        /* Assign CompletionDate and Score based on status */

        if CompletionStatus = 'Completed' then do;

            CompletionDate = EnrollmentDate + ceil(rand('uniform') * 100);

            Score = round(60 + rand('normal', 0, 10), 0.1);

            if Score > 100 then Score = 100;

            else if Score < 0 then Score = 0;

        end;

        else do;

            CompletionDate = .;

            Score = .;

        end;

        

        /* Time spent on the platform */

        TimeSpent = round(rand('uniform') * 50, 0.1);

        

        output;

    end;

    drop i p;

run;

proc print data=OnlineCourses;

    title "Sample of Online Course Enrollments";

run;

Output:

Obs EnrollmentDate CompletionDate StudentID CourseID CourseCategory CompletionStatus Score TimeSpent
1 06MAR2021 . 1001 230 Language In Progress . 41.2
2 06OCT2020 02JAN2021 1002 215 Technology Completed 60.7 13.9
3 27SEP2020 11DEC2020 1003 247 Business Completed 72.3 7.5
4 27DEC2021 11MAR2022 1004 243 Health Completed 69.1 29.3
5 12JAN2020 27MAR2020 1005 221 Language Completed 74.8 39.1
6 12DEC2021 . 1006 229 Arts Dropped . 42.1
7 17AUG2021 29SEP2021 1007 242 Business Completed 53.1 40.6
8 21AUG2021 09SEP2021 1008 236 Business Completed 75.0 18.7
9 20JAN2020 08APR2020 1009 229 Arts Completed 67.9 34.9
10 15MAY2021 23AUG2021 1010 244 Language Completed 51.8 11.5
11 26JUL2021 28SEP2021 1011 225 Health Completed 49.1 15.8
12 10MAR2021 . 1012 232 Data_Science Dropped . 16.3
13 06DEC2021 15MAR2022 1013 210 Health Completed 74.2 22.2
14 05MAY2020 . 1014 217 Technology In Progress . 37.0
15 07MAR2020 28APR2020 1015 243 Language Completed 56.8 36.0
16 17DEC2021 04JAN2022 1016 228 Language Completed 63.7 36.6
17 20JUL2021 01SEP2021 1017 210 Language Completed 35.9 24.1
18 13AUG2020 . 1018 212 Language In Progress . 47.7
19 09JAN2021 . 1019 236 Data_Science In Progress . 6.7
20 27APR2020 26MAY2020 1020 246 Data_Science Completed 64.0 27.6


Data Exploration and Analysis in SAS

1. Summary Statistics

proc means data=OnlineCourses n mean std min max;

    var Score TimeSpent;

    title "Summary Statistics for Score and Time Spent";

run;

Output:

                                         Summary Statistics for Score and Time Spent

                                                      The MEANS Procedure

Variable N Mean Std Dev Minimum Maximum
Score
TimeSpent
14
20
62.0285714
27.4400000
11.5674230
12.7141198
35.9000000
6.7000000
75.0000000
47.7000000


2. Frequency of Course Categories

proc freq data=OnlineCourses;

    tables CourseCategory / nocum nopercent;

    title "Frequency of Course Categories";

run;

Output:

                                                       Frequency of Course Categories

                                                                 The FREQ Procedure

CourseCategory Frequency
Arts 2
Business 3
Data_Science 3
Health 3
Language 7
Technology 2


3. Completion Status Distribution

proc freq data=OnlineCourses;

    tables CompletionStatus / nocum nopercent;

    title "Distribution of Completion Status";

run;

Output:                

                                                    Distribution of Completion Status 

                                                               The FREQ Procedure

CompletionStatus Frequency
Completed 14
Dropped 2
In Progress 4


Advanced Analysis

1. Average Score by Course Category

proc means data=OnlineCourses noprint;

    class CourseCategory;

    var Score;

    where CompletionStatus = 'Completed';

    output out=AvgScoreByCategory mean=AvgScore;

run;

proc print data=AvgScoreByCategory;

    title "Average Score by Course Category";

run;

Output:

                                                         Average Score by Course Category

Obs CourseCategory _TYPE_ _FREQ_ AvgScore
1   0 14 62.0286
2 Arts 1 1 67.9000
3 Business 1 3 66.8000
4 Data_Science 1 1 64.0000
5 Health 1 3 64.1333
6 Language 1 5 56.6000
7 Technology 1 1 60.7000


2. Correlation Between Time Spent and Score

proc corr data=OnlineCourses;

    var TimeSpent Score;

    where CompletionStatus = 'Completed';

    title "Correlation Between Time Spent and Score";

run;

Output:              

                                                  Correlation Between Time Spent and Score

                                                                  The CORR Procedure

2 Variables: TimeSpent Score


Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
TimeSpent 14 25.55714 10.95913 357.80000 7.50000 40.60000
Score 14 62.02857 11.56742 868.40000 35.90000 75.00000


Pearson Correlation Coefficients, N = 14
Prob > |r| under H0: Rho=0
  TimeSpent Score
TimeSpent
1.00000
 
0.05829
0.8431
Score
0.05829
0.8431
1.00000
 


3. Trend of Enrollments Over Time

proc sgplot data=EnrollmentsTrend;

    series x=EnrollmentDate y=StudentID;

    title "Monthly Enrollment Trend";

run;

Log:

NOTE: There were 0 observations read from the data set.

NOTE: PROCEDURE SGPLOT used (Total process time):

      real time           0.06 seconds

      cpu time            0.01 seconds


Creating Training and Test Datasets

/*To prepare for predictive modeling, we'll split the dataset into training and test sets.*/

proc surveyselect data=OnlineCourses out=TrainTestSplit

    samprate=0.7 outall seed=12345;

run;

Output:

                                                          The SURVEYSELECT Procedure

Selection Method Simple Random Sampling


Input Data Set ONLINECOURSES
Random Number Seed 12345
Sampling Rate 0.7
Sample Size 14
Selection Probability 0.7
Sampling Weight 0
Output Data Set TRAINTESTSPLIT


data TrainingSet TestSet;

    set TrainTestSplit;

    if Selected then output TrainingSet;

    else output TestSet;

run;

Output:

Obs Selected EnrollmentDate CompletionDate StudentID CourseID CourseCategory CompletionStatus Score TimeSpent
1 0 12DEC2021 . 1006 229 Arts Dropped . 42.1
2 0 20JAN2020 08APR2020 1009 229 Arts Completed 67.9 34.9
3 0 15MAY2021 23AUG2021 1010 244 Language Completed 51.8 11.5
4 0 17DEC2021 04JAN2022 1016 228 Language Completed 63.7 36.6
5 0 13AUG2020 . 1018 212 Language In Progress . 47.7
6 0 09JAN2021 . 1019 236 Data_Science In Progress . 6.7


Predictive Modeling: Logistic Regression

/*We'll build a logistic regression model to predict the likelihood of course completion based on time spent and course category.*/


/* Create a binary variable for completion */

data OnlineCoursesModel;

    set OnlineCourses;

    if CompletionStatus = 'Completed' then Completed = 1;

    else Completed = 0;

run;

proc print;run;

Output:

Obs EnrollmentDate CompletionDate StudentID CourseID CourseCategory CompletionStatus Score TimeSpent Completed
1 06MAR2021 . 1001 230 Language In Progress . 41.2 0
2 06OCT2020 02JAN2021 1002 215 Technology Completed 60.7 13.9 1
3 27SEP2020 11DEC2020 1003 247 Business Completed 72.3 7.5 1
4 27DEC2021 11MAR2022 1004 243 Health Completed 69.1 29.3 1
5 12JAN2020 27MAR2020 1005 221 Language Completed 74.8 39.1 1
6 12DEC2021 . 1006 229 Arts Dropped . 42.1 0
7 17AUG2021 29SEP2021 1007 242 Business Completed 53.1 40.6 1
8 21AUG2021 09SEP2021 1008 236 Business Completed 75.0 18.7 1
9 20JAN2020 08APR2020 1009 229 Arts Completed 67.9 34.9 1
10 15MAY2021 23AUG2021 1010 244 Language Completed 51.8 11.5 1
11 26JUL2021 28SEP2021 1011 225 Health Completed 49.1 15.8 1
12 10MAR2021 . 1012 232 Data_Science Dropped . 16.3 0
13 06DEC2021 15MAR2022 1013 210 Health Completed 74.2 22.2 1
14 05MAY2020 . 1014 217 Technology In Progress . 37.0 0
15 07MAR2020 28APR2020 1015 243 Language Completed 56.8 36.0 1
16 17DEC2021 04JAN2022 1016 228 Language Completed 63.7 36.6 1
17 20JUL2021 01SEP2021 1017 210 Language Completed 35.9 24.1 1
18 13AUG2020 . 1018 212 Language In Progress . 47.7 0
19 09JAN2021 . 1019 236 Data_Science In Progress . 6.7 0
20 27APR2020 26MAY2020 1020 246 Data_Science Completed 64.0 27.6 1


/* Logistic Regression */

proc logistic data=OnlineCoursesModel;

    class CourseCategory / param=ref;

    model Completed(event='1') = TimeSpent CourseCategory;

    title "Logistic Regression Model for Course Completion";

run;

Output:  
                                             Logistic Regression Model for Course Completion

                                                  The LOGISTIC Procedure

Model Information
Data Set WORK.ONLINECOURSESMODEL
Response Variable Completed
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring

Number of Observations Read 20
Number of Observations Used 20

Response Profile
Ordered
Value
Completed Total
Frequency
1 0 6
2 1 14

Probability modeled is Completed=1.


Class Level Information
Class Value Design Variables
CourseCategory Arts 1 0 0 0 0
  Business 0 1 0 0 0
  Data_Science 0 0 1 0 0
  Health 0 0 0 1 0
  Language 0 0 0 0 1
  Technology 0 0 0 0 0

Model Convergence Status
Quasi-complete separation of data points detected.


Warning: The maximum likelihood estimate may not exist.


Warning: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.

Model Fit Statistics
Criterion Intercept Only Intercept and
Covariates
AIC 26.435 29.112
SC 27.430 36.082
-2 Log L 24.435 15.112

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 9.3230 6 0.1562
Score 6.8108 6 0.3387
Wald 2.4553 6 0.8734

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
TimeSpent 1 1.8743 0.1710
CourseCategory 5 2.3911 0.7928

Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept   1 2.9906 2.8006 1.1403 0.2856
TimeSpent   1 -0.1175 0.0858 1.8743 0.1710
CourseCategory Arts 1 1.5335 2.5332 0.3665 0.5449
CourseCategory Business 1 12.9178 250.3 0.0027 0.9588
CourseCategory Data_Science 1 -1.8711 2.4169 0.5993 0.4388
CourseCategory Health 1 12.2234 281.0 0.0019 0.9653
CourseCategory Language 1 2.2384 2.2882 0.9569 0.3280

Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
TimeSpent 0.889 0.751 1.052
CourseCategory Arts vs Technology 4.634 0.032 664.078
CourseCategory Business vs Technology >999.999 <0.001 >999.999
CourseCategory Data_Science vs Technology 0.154 0.001 17.566
CourseCategory Health vs Technology >999.999 <0.001 >999.999
CourseCategory Language vs Technology 9.378 0.106 831.410

Association of Predicted Probabilities and
Observed Responses
Percent Concordant 92.9 Somers' D 0.857
Percent Discordant 7.1 Gamma 0.857
Percent Tied 0.0 Tau-a 0.379
Pairs 84 c 0.929


PRACTICE AND COMMENT YOUR CODE: 

-->PLEASE FOLLOW OUR BLOG FOR MORE UPDATES.

TO FOLLOW OUR TELEGRAM CHANNEL CLICK HERE

Comments