152.A COMPREHENSIVE GUIDE TO SYNTHETIC ONLINE COURSE ENROLLMENT DATA ANALYSIS USING SAS: FROM DATA GENERATION TO ADVANCED ANALYTICS
- Get link
- X
- Other Apps
A COMPREHENSIVE GUIDE TO SYNTHETIC ONLINE COURSE ENROLLMENT DATA ANALYSIS USING SAS: FROM DATA GENERATION TO ADVANCED ANALYTICS
/*Create a unique synthetic dataset focused on online course enrollments and apply various SAS procedures to analyze it*/
Dataset Overview: Online Course Enrollments
1. Dataset Structure
We'll design a dataset named OnlineCourses with the following variables:
StudentID: Unique identifier for each student.
CourseID: Unique identifier for each course.
CourseCategory: Category of the course (e.g., Data Science, Business, Arts).
EnrollmentDate: Date when the student enrolled in the course.
CompletionDate: Date when the student completed the course (if completed).
CompletionStatus: Status indicating whether the course was completed ('Completed', 'In Progress', 'Dropped').
Score: Final score achieved in the course (if completed).
TimeSpent: Total time spent on the course platform in hours.
2. Generating the Dataset in SAS
/*We'll use SAS to generate this synthetic dataset with 20 observations.*/
/* Set the seed for reproducibility */
%let seed = 12345;
%let num_records = 20;
/* Define course categories */
data CourseCategories;
input CourseCategory $20.;
datalines;
Data_Science
Business
Arts
Technology
Health
Language
;
run;
proc print;run;
Output:
Obs | CourseCategory |
---|---|
1 | Data_Science |
2 | Business |
3 | Arts |
4 | Technology |
5 | Health |
6 | Language |
/* Generate synthetic online course enrollment data */
data OnlineCourses;
retain EnrollmentDate CompletionDate StudentID CourseID CourseCategory;
length CompletionStatus $15.;
call streaminit(&seed);
array categories[6] $20. _temporary_ ('Data_Science', 'Business', 'Arts', 'Technology', 'Health', 'Language');
format EnrollmentDate CompletionDate date9.;
do i = 1 to &num_records;
StudentID = 1000 + i;
CourseID = 200 + ceil(rand('uniform') * 50);
CourseCategory = categories[ceil(rand('uniform') * dim(categories))];
EnrollmentDate = '01JAN2020'd + ceil(rand('uniform') * 730); /* Dates in 2020 and 2021 */
/* Determine completion status */
p = rand('uniform');
if p < 0.6 then CompletionStatus = 'Completed';
else if p < 0.9 then CompletionStatus = 'In Progress';
else CompletionStatus = 'Dropped';
/* Assign CompletionDate and Score based on status */
if CompletionStatus = 'Completed' then do;
CompletionDate = EnrollmentDate + ceil(rand('uniform') * 100);
Score = round(60 + rand('normal', 0, 10), 0.1);
if Score > 100 then Score = 100;
else if Score < 0 then Score = 0;
end;
else do;
CompletionDate = .;
Score = .;
end;
/* Time spent on the platform */
TimeSpent = round(rand('uniform') * 50, 0.1);
output;
end;
drop i p;
run;
proc print data=OnlineCourses;
title "Sample of Online Course Enrollments";
run;
Output:
Obs | EnrollmentDate | CompletionDate | StudentID | CourseID | CourseCategory | CompletionStatus | Score | TimeSpent |
---|---|---|---|---|---|---|---|---|
1 | 06MAR2021 | . | 1001 | 230 | Language | In Progress | . | 41.2 |
2 | 06OCT2020 | 02JAN2021 | 1002 | 215 | Technology | Completed | 60.7 | 13.9 |
3 | 27SEP2020 | 11DEC2020 | 1003 | 247 | Business | Completed | 72.3 | 7.5 |
4 | 27DEC2021 | 11MAR2022 | 1004 | 243 | Health | Completed | 69.1 | 29.3 |
5 | 12JAN2020 | 27MAR2020 | 1005 | 221 | Language | Completed | 74.8 | 39.1 |
6 | 12DEC2021 | . | 1006 | 229 | Arts | Dropped | . | 42.1 |
7 | 17AUG2021 | 29SEP2021 | 1007 | 242 | Business | Completed | 53.1 | 40.6 |
8 | 21AUG2021 | 09SEP2021 | 1008 | 236 | Business | Completed | 75.0 | 18.7 |
9 | 20JAN2020 | 08APR2020 | 1009 | 229 | Arts | Completed | 67.9 | 34.9 |
10 | 15MAY2021 | 23AUG2021 | 1010 | 244 | Language | Completed | 51.8 | 11.5 |
11 | 26JUL2021 | 28SEP2021 | 1011 | 225 | Health | Completed | 49.1 | 15.8 |
12 | 10MAR2021 | . | 1012 | 232 | Data_Science | Dropped | . | 16.3 |
13 | 06DEC2021 | 15MAR2022 | 1013 | 210 | Health | Completed | 74.2 | 22.2 |
14 | 05MAY2020 | . | 1014 | 217 | Technology | In Progress | . | 37.0 |
15 | 07MAR2020 | 28APR2020 | 1015 | 243 | Language | Completed | 56.8 | 36.0 |
16 | 17DEC2021 | 04JAN2022 | 1016 | 228 | Language | Completed | 63.7 | 36.6 |
17 | 20JUL2021 | 01SEP2021 | 1017 | 210 | Language | Completed | 35.9 | 24.1 |
18 | 13AUG2020 | . | 1018 | 212 | Language | In Progress | . | 47.7 |
19 | 09JAN2021 | . | 1019 | 236 | Data_Science | In Progress | . | 6.7 |
20 | 27APR2020 | 26MAY2020 | 1020 | 246 | Data_Science | Completed | 64.0 | 27.6 |
Data Exploration and Analysis in SAS
1. Summary Statistics
proc means data=OnlineCourses n mean std min max;
var Score TimeSpent;
title "Summary Statistics for Score and Time Spent";
run;
Output:
Summary Statistics for Score and Time
Spent |
Variable | N | Mean | Std Dev | Minimum | Maximum | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
2. Frequency of Course Categories
proc freq data=OnlineCourses;
tables CourseCategory / nocum nopercent;
title "Frequency of Course Categories";
run;
Output:
Frequency of Course
Categories |
CourseCategory | Frequency |
---|---|
Arts | 2 |
Business | 3 |
Data_Science | 3 |
Health | 3 |
Language | 7 |
Technology | 2 |
3. Completion Status Distribution
proc freq data=OnlineCourses;
tables CompletionStatus / nocum nopercent;
title "Distribution of Completion Status";
run;
Output:
Distribution of Completion
Status |
CompletionStatus | Frequency |
---|---|
Completed | 14 |
Dropped | 2 |
In Progress | 4 |
Advanced Analysis
1. Average Score by Course Category
proc means data=OnlineCourses noprint;
class CourseCategory;
var Score;
where CompletionStatus = 'Completed';
output out=AvgScoreByCategory mean=AvgScore;
run;
proc print data=AvgScoreByCategory;
title "Average Score by Course Category";
run;
Output:
Average Score by Course
Category |
Obs | CourseCategory | _TYPE_ | _FREQ_ | AvgScore |
---|---|---|---|---|
1 | 0 | 14 | 62.0286 | |
2 | Arts | 1 | 1 | 67.9000 |
3 | Business | 1 | 3 | 66.8000 |
4 | Data_Science | 1 | 1 | 64.0000 |
5 | Health | 1 | 3 | 64.1333 |
6 | Language | 1 | 5 | 56.6000 |
7 | Technology | 1 | 1 | 60.7000 |
2. Correlation Between Time Spent and Score
proc corr data=OnlineCourses;
var TimeSpent Score;
where CompletionStatus = 'Completed';
title "Correlation Between Time Spent and Score";
run;
Output:
Correlation Between Time Spent and
Score |
2 Variables: | TimeSpent Score |
---|
Simple Statistics | ||||||
---|---|---|---|---|---|---|
Variable | N | Mean | Std Dev | Sum | Minimum | Maximum |
TimeSpent | 14 | 25.55714 | 10.95913 | 357.80000 | 7.50000 | 40.60000 |
Score | 14 | 62.02857 | 11.56742 | 868.40000 | 35.90000 | 75.00000 |
Pearson Correlation
Coefficients, N = 14 Prob > |r| under H0: Rho=0 | ||||||
---|---|---|---|---|---|---|
TimeSpent | Score | |||||
TimeSpent |
|
| ||||
Score |
|
|
3. Trend of Enrollments Over Time
proc sgplot data=EnrollmentsTrend;
series x=EnrollmentDate y=StudentID;
title "Monthly Enrollment Trend";
run;
Log:
NOTE: There were 0 observations read from the data set.
NOTE: PROCEDURE SGPLOT used (Total process time):
real time 0.06 seconds
cpu time 0.01 seconds
Creating Training and Test Datasets
/*To prepare for predictive modeling, we'll split the dataset into training and test sets.*/
proc surveyselect data=OnlineCourses out=TrainTestSplit
samprate=0.7 outall seed=12345;
run;
Output:
Selection Method | Simple Random Sampling |
---|
Input Data Set | ONLINECOURSES |
---|---|
Random Number Seed | 12345 |
Sampling Rate | 0.7 |
Sample Size | 14 |
Selection Probability | 0.7 |
Sampling Weight | 0 |
Output Data Set | TRAINTESTSPLIT |
data TrainingSet TestSet;
set TrainTestSplit;
if Selected then output TrainingSet;
else output TestSet;
run;
Output:
Obs | Selected | EnrollmentDate | CompletionDate | StudentID | CourseID | CourseCategory | CompletionStatus | Score | TimeSpent |
---|---|---|---|---|---|---|---|---|---|
1 | 0 | 12DEC2021 | . | 1006 | 229 | Arts | Dropped | . | 42.1 |
2 | 0 | 20JAN2020 | 08APR2020 | 1009 | 229 | Arts | Completed | 67.9 | 34.9 |
3 | 0 | 15MAY2021 | 23AUG2021 | 1010 | 244 | Language | Completed | 51.8 | 11.5 |
4 | 0 | 17DEC2021 | 04JAN2022 | 1016 | 228 | Language | Completed | 63.7 | 36.6 |
5 | 0 | 13AUG2020 | . | 1018 | 212 | Language | In Progress | . | 47.7 |
6 | 0 | 09JAN2021 | . | 1019 | 236 | Data_Science | In Progress | . | 6.7 |
Predictive Modeling: Logistic Regression
/*We'll build a logistic regression model to predict the likelihood of course completion based on time spent and course category.*/
/* Create a binary variable for completion */
data OnlineCoursesModel;
set OnlineCourses;
if CompletionStatus = 'Completed' then Completed = 1;
else Completed = 0;
run;
proc print;run;
Output:
Obs | EnrollmentDate | CompletionDate | StudentID | CourseID | CourseCategory | CompletionStatus | Score | TimeSpent | Completed |
---|---|---|---|---|---|---|---|---|---|
1 | 06MAR2021 | . | 1001 | 230 | Language | In Progress | . | 41.2 | 0 |
2 | 06OCT2020 | 02JAN2021 | 1002 | 215 | Technology | Completed | 60.7 | 13.9 | 1 |
3 | 27SEP2020 | 11DEC2020 | 1003 | 247 | Business | Completed | 72.3 | 7.5 | 1 |
4 | 27DEC2021 | 11MAR2022 | 1004 | 243 | Health | Completed | 69.1 | 29.3 | 1 |
5 | 12JAN2020 | 27MAR2020 | 1005 | 221 | Language | Completed | 74.8 | 39.1 | 1 |
6 | 12DEC2021 | . | 1006 | 229 | Arts | Dropped | . | 42.1 | 0 |
7 | 17AUG2021 | 29SEP2021 | 1007 | 242 | Business | Completed | 53.1 | 40.6 | 1 |
8 | 21AUG2021 | 09SEP2021 | 1008 | 236 | Business | Completed | 75.0 | 18.7 | 1 |
9 | 20JAN2020 | 08APR2020 | 1009 | 229 | Arts | Completed | 67.9 | 34.9 | 1 |
10 | 15MAY2021 | 23AUG2021 | 1010 | 244 | Language | Completed | 51.8 | 11.5 | 1 |
11 | 26JUL2021 | 28SEP2021 | 1011 | 225 | Health | Completed | 49.1 | 15.8 | 1 |
12 | 10MAR2021 | . | 1012 | 232 | Data_Science | Dropped | . | 16.3 | 0 |
13 | 06DEC2021 | 15MAR2022 | 1013 | 210 | Health | Completed | 74.2 | 22.2 | 1 |
14 | 05MAY2020 | . | 1014 | 217 | Technology | In Progress | . | 37.0 | 0 |
15 | 07MAR2020 | 28APR2020 | 1015 | 243 | Language | Completed | 56.8 | 36.0 | 1 |
16 | 17DEC2021 | 04JAN2022 | 1016 | 228 | Language | Completed | 63.7 | 36.6 | 1 |
17 | 20JUL2021 | 01SEP2021 | 1017 | 210 | Language | Completed | 35.9 | 24.1 | 1 |
18 | 13AUG2020 | . | 1018 | 212 | Language | In Progress | . | 47.7 | 0 |
19 | 09JAN2021 | . | 1019 | 236 | Data_Science | In Progress | . | 6.7 | 0 |
20 | 27APR2020 | 26MAY2020 | 1020 | 246 | Data_Science | Completed | 64.0 | 27.6 | 1 |
/* Logistic Regression */
proc logistic data=OnlineCoursesModel;
class CourseCategory / param=ref;
model Completed(event='1') = TimeSpent CourseCategory;
title "Logistic Regression Model for Course Completion";
run;
Logistic Regression Model for Course Completion |
Model Information | |
---|---|
Data Set | WORK.ONLINECOURSESMODEL |
Response Variable | Completed |
Number of Response Levels | 2 |
Model | binary logit |
Optimization Technique | Fisher's scoring |
Number of Observations Read | 20 |
---|---|
Number of Observations Used | 20 |
Response Profile | ||
---|---|---|
Ordered Value |
Completed | Total Frequency |
1 | 0 | 6 |
2 | 1 | 14 |
Probability modeled is Completed=1. |
Class Level Information | ||||||
---|---|---|---|---|---|---|
Class | Value | Design Variables | ||||
CourseCategory | Arts | 1 | 0 | 0 | 0 | 0 |
Business | 0 | 1 | 0 | 0 | 0 | |
Data_Science | 0 | 0 | 1 | 0 | 0 | |
Health | 0 | 0 | 0 | 1 | 0 | |
Language | 0 | 0 | 0 | 0 | 1 | |
Technology | 0 | 0 | 0 | 0 | 0 |
Model Convergence Status |
---|
Quasi-complete separation of data points detected. |
Warning: | The maximum likelihood estimate may not exist. |
Warning: | The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable. |
Model Fit Statistics | ||
---|---|---|
Criterion | Intercept Only | Intercept and Covariates |
AIC | 26.435 | 29.112 |
SC | 27.430 | 36.082 |
-2 Log L | 24.435 | 15.112 |
Testing Global Null Hypothesis: BETA=0 | |||
---|---|---|---|
Test | Chi-Square | DF | Pr > ChiSq |
Likelihood Ratio | 9.3230 | 6 | 0.1562 |
Score | 6.8108 | 6 | 0.3387 |
Wald | 2.4553 | 6 | 0.8734 |
Type 3 Analysis of Effects | |||
---|---|---|---|
Effect | DF | Wald Chi-Square |
Pr > ChiSq |
TimeSpent | 1 | 1.8743 | 0.1710 |
CourseCategory | 5 | 2.3911 | 0.7928 |
Analysis of Maximum Likelihood Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 1 | 2.9906 | 2.8006 | 1.1403 | 0.2856 | |
TimeSpent | 1 | -0.1175 | 0.0858 | 1.8743 | 0.1710 | |
CourseCategory | Arts | 1 | 1.5335 | 2.5332 | 0.3665 | 0.5449 |
CourseCategory | Business | 1 | 12.9178 | 250.3 | 0.0027 | 0.9588 |
CourseCategory | Data_Science | 1 | -1.8711 | 2.4169 | 0.5993 | 0.4388 |
CourseCategory | Health | 1 | 12.2234 | 281.0 | 0.0019 | 0.9653 |
CourseCategory | Language | 1 | 2.2384 | 2.2882 | 0.9569 | 0.3280 |
Odds Ratio Estimates | |||
---|---|---|---|
Effect | Point Estimate | 95% Wald Confidence Limits | |
TimeSpent | 0.889 | 0.751 | 1.052 |
CourseCategory Arts vs Technology | 4.634 | 0.032 | 664.078 |
CourseCategory Business vs Technology | >999.999 | <0.001 | >999.999 |
CourseCategory Data_Science vs Technology | 0.154 | 0.001 | 17.566 |
CourseCategory Health vs Technology | >999.999 | <0.001 | >999.999 |
CourseCategory Language vs Technology | 9.378 | 0.106 | 831.410 |
Association of Predicted
Probabilities and Observed Responses | |||
---|---|---|---|
Percent Concordant | 92.9 | Somers' D | 0.857 |
Percent Discordant | 7.1 | Gamma | 0.857 |
Percent Tied | 0.0 | Tau-a | 0.379 |
Pairs | 84 | c | 0.929 |
- Get link
- X
- Other Apps
Comments
Post a Comment