Smart Fraud Detection and Data Error Correction in Online Exams with SAS

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

HERE IN THIS PROJECT WE USED THESE SAS STATEMENTS —DATA STEP | LENGTH | INPUT | DATALINES | FORMAT | SET | IF-THEN- KEEP | DROP | MERGE | PROC SORT | PROC PRINT | PROC APPEND | PROC TRANSPOSE | PROC DATASETS | MACRO

Introduction

In the modern digital era, online assessments have become an essential component of education, recruitment, and professional certification systems. Universities, corporate training platforms, government examination boards, and online learning providers rely heavily on digital examinations to evaluate knowledge, skills, and competencies. These systems allow thousands or even millions of candidates to take exams remotely, making assessment processes faster, scalable, and cost-effective.

However, the rapid adoption of online assessments has also introduced several challenges. Since the entire testing process occurs digitally, large volumes of data are generated every day. This data includes information such as assessment identifiers, subjects, average scores, completion times, difficulty levels, reliability indices, and flags indicating possible cheating behavior. Managing and analyzing this data efficiently requires robust data processing tools and strong data quality management practices.

One of the biggest concerns in online assessments is data accuracy and fraud detection. Candidates may attempt to cheat by using unauthorized resources, collaborating with others, or manipulating the system to obtain higher scores. Additionally, technical issues or data entry problems can introduce errors such as missing values, incorrect scores, invalid completion times, or inconsistent text formats. If these errors remain undetected, they can lead to incorrect conclusions, unfair evaluations, and unreliable reporting.

To address these challenges, organizations rely on advanced data analytics tools such as SAS (Statistical Analysis System). SAS is widely used in data management, statistical analysis, fraud detection, and large-scale data processing across industries such as healthcare, banking, pharmaceuticals, education, and government agencies.

Using SAS programming techniques, analysts can perform several important tasks, including:

· Creating structured datasets from raw data

· Detecting and correcting data quality issues

· Standardizing inconsistent text and numeric values

· Performing date calculations and time-based analysis

· Combining data from multiple sources

· Automating fraud detection logic using macros

· Transforming data structures for reporting and analysis

This project demonstrates a practical SAS-based workflow for managing online assessment data. A simulated dataset is created containing several variables such as assessment identifiers, subject names, average scores, completion times, cheating indicators, difficulty levels, reliability indices, percentages, fees, and examination dates. To replicate real-world scenarios, multiple intentional errors are introduced into the dataset, including invalid scores, missing values, incorrect date formats, and inconsistent subject names.

The project then walks through the process of identifying these errors and correcting them using various SAS programming techniques. Functions for handling character strings and numeric values are applied to clean the dataset, while date functions such as MDY, INTCK, and INTNX are used to perform time-based calculations. Dataset manipulation techniques such as SET, MERGE, APPEND, and PROC TRANSPOSE are used to combine and restructure the data.

Furthermore, fraud detection logic is implemented using SAS macros. This allows the system to automatically identify suspicious assessment patterns, such as unusually high scores combined with extremely short completion times. Such automated detection systems are commonly used in modern examination platforms to ensure fairness and maintain the integrity of the evaluation process.

By the end of this project, readers will gain a clear understanding of how SAS can be used to perform data cleaning, fraud detection, dataset management, and automated analytics in online assessment systems.The project also demonstrates how real-world data issues can be systematically identified and resolved using structured programming approaches.

1. Creating Dataset With Intentional Errors

2. Errors Present In The Dataset

3. Corrected Dataset Creation

4. Utilization Classification

5. Date Calculations Using INTCK and INTNX

6. Fraud Detection Macro

7. Combining Datasets Using MERGE

8. Appending New Data Using PROC APPEND

9. Data Restructuring Using PROC TRANSPOSE

10. Character Functions Implementation

11. KEEP and DROP Statements

12. Deleting Temporary Dataset Using PROC DATASETS

13. Creating Second Dataset (Assessment Monitoring)

14. SET Statement For Data Concatenation

15. MERGE Statement For Data Integration

16. Data Standardization Using Character Functions

17. Numeric Functions Implementation

18. Date Handling Using MDY, INTCK, and INTNX

19. Fraud Detection Macro Automation

20. Risk Classification

21. PROC TRANSPOSE For Data Transformation

22. PROC DATASETS DELETE For Workspace Cleanup

23. Final Analytical Dataset

24. Best Practices For SAS Data Management

Business Context

Consider an online examination platform used by universities and certification agencies. Thousands of students take assessments every day across different subjects such as Mathematics, Programming, Statistics, and Business Analytics.

Each assessment generates a record containing several attributes such as:

· Assessment ID

· Subject name

· Average score

· Completion time

· Difficulty level

· Reliability index

· Cheating flag

· Fees paid

· Assessment date

Online learning platforms and digital examinations have changed how organizations conduct assessments. Traditional pen-and-paper exams are gradually being replaced by online systems that allow candidates to take tests remotely using computers or mobile devices. Universities, certification bodies, recruitment agencies, and corporate training programs widely use these platforms to evaluate candidates efficiently.

Online assessments generate large amounts of data such as scores, completion time, subject difficulty, reliability index, and cheating indicators. This data helps organizations analyze exam quality and detect suspicious behavior. However, raw data often contains issues like missing values, incorrect formats, or inconsistent entries.

To ensure accurate analysis, tools like SAS are used to clean data, standardize variables, merge datasets, and implement fraud detection logic. In this project, a simulated online assessment dataset with intentional errors is used to demonstrate how SAS techniques can identify and correct data issues while preparing the dataset for reliable analysis.

The analytics team must analyze this dataset to answer several questions:

· Which assessments have suspicious completion times?

· Are there abnormal score patterns indicating cheating?

· Are the data values valid and correctly formatted?

· What percentage of students completed the exam successfully?

Dataset Structure

The dataset will contain the following variables.

Variable	Description
Assessment_ID	Unique ID for each assessment
Subject	Subject name
Avg_Score	Average score obtained
Completion_Time	Time taken to complete exam
Cheating_Flag	Indicates suspicious behavior
Difficulty_Level	Easy Medium Hard
Reliability_Index	Reliability measure of assessment
Percentage	Score percentage
Fees	Assessment fee
Utilization_Class	Performance classification
Exam_Date	Assessment date

1. Creating Dataset With Intentional Errors

data online_assessment_raw;

length Assessment_ID $10 Subject $20 Difficulty_Level $10 Cheating_Flag $5;

input Assessment_ID $ Subject $ Avg_Score Completion_Time Cheating_Flag $

Difficulty_Level $ Reliability_Index Percentage Fees Exam_Date:ddmmyy10.;

format Exam_Date date9.;

datalines;

A001 math 85 35 no easy 0.95 85 500 01-01-2025

A002 science 78 40 no medium 0.90 78 600 02-01-2025

A003 statistics 92 20 yes hard 0.88 92 700 03-01-2025

A004 programming 66 50 no medium 0.75 66 800 04-01-2025

A005 mathematics 110 10 yes easy 0.96 110 500 05-01-2025

A006 analytics . 45 no medium 0.91 70 650 06-01-2025

A007 data_science 89 -10 no hard 0.87 89 900 07-01-2025

A008 sql 77 30 no medium 0.93 77 750 08-01-2025

A009 python 84 25 no hard 0.92 84 800 09-01-2025

A010 machine_learning 91 15 yes hard 0.85 91 950 10-01-2025

A011 ai 88 28 no medium 0.90 88 900 11-01-2025

A012 deep_learning 95 22 yes hard 0.86 95 1000 12-01-2025

A013 statistics 82 35 no medium 0.91 82 700 13-01-2025

A014 business 74 40 no easy 0.89 74 600 14-01-2025

A015 finance 69 48 no medium 0.88 69 650 15-01-2025

;

run;

proc print data=online_assessment_raw;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date
1	A001	math	easy	no	85	35	0.95	85	500	01JAN2025
2	A002	science	medium	no	78	40	0.90	78	600	02JAN2025
3	A003	statistics	hard	yes	92	20	0.88	92	700	03JAN2025
4	A004	programming	medium	no	66	50	0.75	66	800	04JAN2025
5	A005	mathematics	easy	yes	110	10	0.96	110	500	05JAN2025
6	A006	analytics	medium	no	.	45	0.91	70	650	06JAN2025
7	A007	data_science	hard	no	89	-10	0.87	89	900	07JAN2025
8	A008	sql	medium	no	77	30	0.93	77	750	08JAN2025
9	A009	python	hard	no	84	25	0.92	84	800	09JAN2025
10	A010	machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025
11	A011	ai	medium	no	88	28	0.90	88	900	11JAN2025
12	A012	deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025
13	A013	statistics	medium	no	82	35	0.91	82	700	13JAN2025
14	A014	business	easy	no	74	40	0.89	74	600	14JAN2025
15	A015	finance	medium	no	69	48	0.88	69	650	15JAN2025

Errors Present In The Dataset

Intentional errors exist in this dataset.

Error 1 – Incorrect Date Format

The dataset uses 01-01-2025 while SAS expects dates using DATE9 or numeric SAS date values.

This will produce incorrect date interpretation.

Error 2 – Impossible Score

A005 mathematics 110

Score cannot exceed 100.

Error 3 – Missing Value

Avg_Score .

Missing score must be handled properly.

Error 4 – Negative Completion Time

Completion_Time -10

This is logically impossible.

Error 5 – Inconsistent Subject Names

Examples:

math

mathematics

statistics

data_science

These should be standardized.

2. Corrected Dataset Creation

data online_assessment_clean;

set online_assessment_raw;

Subject = propcase(Subject);

if Subject="Math" then Subject="Mathematics";

Completion_Time = abs(Completion_Time);

if Avg_Score > 100 then Avg_Score = 100;

Avg_Score = coalesce(Avg_Score,70);

run;

proc print data=online_assessment_clean;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025

DATA Step

data online_assessment_clean;

Creates a new cleaned dataset.

SET Statement

set online_assessment_raw;

Reads observations from raw dataset.

PROPCASE Function

Subject = propcase(Subject);

Converts text to proper case.

Example:

math → Math

statistics → Statistics

ABS Function

Completion_Time = abs(Completion_Time);

Converts negative numbers into positive numbers.

COALESCE Function

Avg_Score = coalesce(Avg_Score,70);

Replaces missing score with default value.

Corrected the dataset containing several intentional errors.
The corrections included:

· Converting subject names to proper case

· Handling missing scores

· Fixing negative completion time

· Correcting invalid scores above 100

· Converting date values properly

3. Utilization Classification

data assessment_utilization;

set online_assessment_clean;

length Utilization_Class $20;

if Percentage >=90 then Utilization_Class="Excellent";

else if Percentage >=75 then Utilization_Class="Good";

else if Percentage >=60 then Utilization_Class="Average";

else Utilization_Class="Poor";

run;

proc print data=assessment_utilization;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average

4. Date Calculations Using INTCK,INTNX

data assessment_duration;

set assessment_utilization;

Months_Since_Exam = intck('month',Exam_Date,today());

Next_Assessment = intnx('month',Exam_Date,6,'same');

run;

proc print data=assessment_duration;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class	Months_Since_Exam	Next_Assessment
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good	14	23923
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good	14	23924
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent	14	23925
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average	14	23926
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent	14	23927
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average	14	23928
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good	14	23929
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good	14	23930
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good	14	23931
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent	14	23932
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good	14	23933
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent	14	23934
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good	14	23935
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average	14	23936
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average	14	23937

INTCK calculates the number of months between exam date and current date.

Predicts next exam after 6 months.

5. Fraud Detection Macro

%macro detect_fraud;

data fraud_cases;

set assessment_duration;

length Fraud_Flag $12;

if Completion_Time < 15 and Avg_Score > 90 then Fraud_Flag="High Risk";

else if Cheating_Flag="yes" then Fraud_Flag="Medium Risk";

else Fraud_Flag="Low Risk";

run;

proc print data=fraud_cases;

run;

%mend;

%detect_fraud;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class	Months_Since_Exam	Next_Assessment	Fraud_Flag
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good	14	23923	Low Risk
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good	14	23924	Low Risk
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent	14	23925	Medium Risk
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average	14	23926	Low Risk
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent	14	23927	High Risk
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average	14	23928	Low Risk
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good	14	23929	Low Risk
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good	14	23930	Low Risk
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good	14	23931	Low Risk
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent	14	23932	Medium Risk
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good	14	23933	Low Risk
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent	14	23934	Medium Risk
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good	14	23935	Low Risk
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average	14	23936	Low Risk
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average	14	23937	Low Risk

Explanation Of Fraud Logic

Fraud is detected using rules:

Rule 1
Very high score + very short completion time.

Rule 2
Cheating flag already marked.

Rule 3
Otherwise considered normal.

6. Combining Datasets Using MERGE

proc sort data=fraud_cases;by Assessment_ID;run;

proc print data=fraud_cases;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class	Months_Since_Exam	Next_Assessment	Fraud_Flag
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good	14	23923	Low Risk
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good	14	23924	Low Risk
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent	14	23925	Medium Risk
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average	14	23926	Low Risk
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent	14	23927	High Risk
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average	14	23928	Low Risk
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good	14	23929	Low Risk
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good	14	23930	Low Risk
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good	14	23931	Low Risk
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent	14	23932	Medium Risk
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good	14	23933	Low Risk
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent	14	23934	Medium Risk
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good	14	23935	Low Risk
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average	14	23936	Low Risk
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average	14	23937	Low Risk

proc sort data=assessment_duration;by Assessment_ID;run;

proc print data=assessment_duration;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class	Months_Since_Exam	Next_Assessment
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good	14	23923
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good	14	23924
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent	14	23925
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average	14	23926
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent	14	23927
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average	14	23928
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good	14	23929
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good	14	23930
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good	14	23931
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent	14	23932
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good	14	23933
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent	14	23934
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good	14	23935
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average	14	23936
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average	14	23937

data combined_dataset;

merge fraud_cases

assessment_duration;

by Assessment_ID;

run;

proc print data=combined_dataset;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class	Months_Since_Exam	Next_Assessment	Fraud_Flag
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good	14	23923	Low Risk
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good	14	23924	Low Risk
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent	14	23925	Medium Risk
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average	14	23926	Low Risk
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent	14	23927	High Risk
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average	14	23928	Low Risk
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good	14	23929	Low Risk
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good	14	23930	Low Risk
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good	14	23931	Low Risk
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent	14	23932	Medium Risk
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good	14	23933	Low Risk
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent	14	23934	Medium Risk
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good	14	23935	Low Risk
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average	14	23936	Low Risk
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average	14	23937	Low Risk

MERGE combines datasets based on Assessment_ID.

7. Appending New Data

proc append base=combined_dataset

data=assessment_duration;

run;

proc print data=combined_dataset;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class	Months_Since_Exam	Next_Assessment	Fraud_Flag
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good	14	23923	Low Risk
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good	14	23924	Low Risk
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent	14	23925	Medium Risk
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average	14	23926	Low Risk
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent	14	23927	High Risk
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average	14	23928	Low Risk
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good	14	23929	Low Risk
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good	14	23930	Low Risk
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good	14	23931	Low Risk
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent	14	23932	Medium Risk
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good	14	23933	Low Risk
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent	14	23934	Medium Risk
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good	14	23935	Low Risk
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average	14	23936	Low Risk
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average	14	23937	Low Risk
16	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good	14	23923
17	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good	14	23924
18	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent	14	23925
19	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average	14	23926
20	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent	14	23927
21	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average	14	23928
22	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good	14	23929
23	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good	14	23930
24	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good	14	23931
25	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent	14	23932
26	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good	14	23933
27	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent	14	23934
28	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good	14	23935
29	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average	14	23936
30	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average	14	23937

Adds additional records to dataset.

8. Using PROC TRANSPOSE

proc transpose data=combined_dataset out=transpose_scores;

var Avg_Score Percentage;

run;

proc print data=transpose_scores;

run;

OUTPUT:


Obs	_NAME_	COL1	COL2	COL3	COL4	COL5	COL6	COL7	COL8	COL9	COL10	COL11	COL12	COL13	COL14	COL15	COL16	COL17	COL18	COL19	COL20	COL21	COL22	COL23	COL24	COL25	COL26	COL27	COL28	COL29	COL30
1	Avg_Score	85	78	92	66	100	70	89	77	84	91	88	95	82	74	69	85	78	92	66	100	70	89	77	84	91	88	95	82	74	69
2	Percentage	85	78	92	66	110	70	89	77	84	91	88	95	82	74	69	85	78	92	66	110	70	89	77	84	91	88	95	82	74	69

Restructures rows into columns.

9. Character Functions

data assessment_char;

set assessment_utilization;

Subject_Clean = strip(trim(Subject));

Subject_Code = catx("-",Subject,Assessment_ID);

run;

proc print data=assessment_char;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class	Subject_Clean	Subject_Code
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good	Mathematics	Mathematics-A001
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good	Science	Science-A002
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent	Statistics	Statistics-A003
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average	Programming	Programming-A004
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent	Mathematics	Mathematics-A005
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average	Analytics	Analytics-A006
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good	Data_science	Data_science-A007
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good	Sql	Sql-A008
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good	Python	Python-A009
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent	Machine_learning	Machine_learning-A010
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good	Ai	Ai-A011
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent	Deep_learning	Deep_learning-A012
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good	Statistics	Statistics-A013
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average	Business	Business-A014
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average	Finance	Finance-A015

Functions used:

Function	Purpose
STRIP	Remove leading/trailing spaces
TRIM	Remove trailing spaces
CAT	Concatenate strings
CATX	Concatenate with delimiter
UPCASE	Convert to uppercase
LOWCASE	Convert to lowercase

10. Keep And Drop

data assessment_K;

set assessment_char;

keep Assessment_ID Subject Percentage Fees Exam_Date Utilization_Class

Subject_Code;

run;

proc print data=assessment_K;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Percentage	Fees	Exam_Date	Utilization_Class	Subject_Code
1	A001	Mathematics	85	500	01JAN2025	Good	Mathematics-A001
2	A002	Science	78	600	02JAN2025	Good	Science-A002
3	A003	Statistics	92	700	03JAN2025	Excellent	Statistics-A003
4	A004	Programming	66	800	04JAN2025	Average	Programming-A004
5	A005	Mathematics	110	500	05JAN2025	Excellent	Mathematics-A005
6	A006	Analytics	70	650	06JAN2025	Average	Analytics-A006
7	A007	Data_science	89	900	07JAN2025	Good	Data_science-A007
8	A008	Sql	77	750	08JAN2025	Good	Sql-A008
9	A009	Python	84	800	09JAN2025	Good	Python-A009
10	A010	Machine_learning	91	950	10JAN2025	Excellent	Machine_learning-A010
11	A011	Ai	88	900	11JAN2025	Good	Ai-A011
12	A012	Deep_learning	95	1000	12JAN2025	Excellent	Deep_learning-A012
13	A013	Statistics	82	700	13JAN2025	Good	Statistics-A013
14	A014	Business	74	600	14JAN2025	Average	Business-A014
15	A015	Finance	69	650	15JAN2025	Average	Finance-A015

data assessment_D;

set assessment_char;

Drop Subject Difficulty_Level Cheating_Flag Avg_Score Completion_Time

Reliability_Index Percentage Fees ;

run;

proc print data=assessment_D;

run;

OUTPUT:


Obs	Assessment_ID	Exam_Date	Utilization_Class	Subject_Clean	Subject_Code
1	A001	01JAN2025	Good	Mathematics	Mathematics-A001
2	A002	02JAN2025	Good	Science	Science-A002
3	A003	03JAN2025	Excellent	Statistics	Statistics-A003
4	A004	04JAN2025	Average	Programming	Programming-A004
5	A005	05JAN2025	Excellent	Mathematics	Mathematics-A005
6	A006	06JAN2025	Average	Analytics	Analytics-A006
7	A007	07JAN2025	Good	Data_science	Data_science-A007
8	A008	08JAN2025	Good	Sql	Sql-A008
9	A009	09JAN2025	Good	Python	Python-A009
10	A010	10JAN2025	Excellent	Machine_learning	Machine_learning-A010
11	A011	11JAN2025	Good	Ai	Ai-A011
12	A012	12JAN2025	Excellent	Deep_learning	Deep_learning-A012
13	A013	13JAN2025	Good	Statistics	Statistics-A013
14	A014	14JAN2025	Average	Business	Business-A014
15	A015	15JAN2025	Average	Finance	Finance-A015

11. Deleting Temporary Dataset

proc datasets library=work;

delete assessment_D transpose_scores;

quit;

LOG:

 NOTE: Deleting WORK.ASSESSMENT_D (memtype=DATA).

 NOTE: Deleting WORK.TRANSPOSE_SCORES (memtype=DATA).

Deletes unwanted datasets.

12. Creating Second Dataset

data assessment_monitoring;

length Assessment_ID $10 Invigilator $20 Location $20;

input Assessment_ID $ Invigilator $ Location $;

datalines;

A001 Rahul Hyderabad

A002 Anita Delhi

A003 Kiran Bangalore

A004 Mohan Chennai

A005 Ritu Mumbai

A006 Arjun Pune

A007 Suresh Hyderabad

A008 Kavya Delhi

A009 Meena Bangalore

A010 Rajesh Chennai

A011 Sunil Hyderabad

A012 Priya Mumbai

A013 Pooja Delhi

A014 Ajay Bangalore

A015 Deepak Chennai

;

run;

proc print data=assessment_monitoring;

run;

OUTPUT:


Obs	Assessment_ID	Invigilator	Location
1	A001	Rahul	Hyderabad
2	A002	Anita	Delhi
3	A003	Kiran	Bangalore
4	A004	Mohan	Chennai
5	A005	Ritu	Mumbai
6	A006	Arjun	Pune
7	A007	Suresh	Hyderabad
8	A008	Kavya	Delhi
9	A009	Meena	Bangalore
10	A010	Rajesh	Chennai
11	A011	Sunil	Hyderabad
12	A012	Priya	Mumbai
13	A013	Pooja	Delhi
14	A014	Ajay	Bangalore
15	A015	Deepak	Chennai

Explanation Of Monitoring Dataset

This dataset represents invigilation or monitoring information collected during the assessment.

Variables included:

Variable	Meaning
Assessment_ID	Unique assessment identifier
Invigilator	Supervisor responsible
Location	Examination location

Such data is useful for identifying location-based fraud patterns.

13. SET Statement For Data Concatenation

data assessment_combined;

set online_assessment_clean

assessment_utilization;

run;

proc print data=assessment_combined;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025
16	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good
17	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good
18	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent
19	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average
20	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent
21	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average
22	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good
23	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good
24	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good
25	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent
26	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good
27	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent
28	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good
29	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average
30	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average

Explanation Of SET Statement

The SET statement reads observations sequentially from datasets and combines them.

Key characteristics:

· Observations from first dataset appear first

· Then observations from next dataset

· Variables are aligned automatically

Example workflow:

Dataset1

A001

A002

Dataset2

A003

A004

After SET:

A001

A002

A003

A004

14. MERGE Statement For Data Integration

proc sort data=assessment_utilization;

by Assessment_ID;

run;

proc print data=assessment_utilization;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average

proc sort data=assessment_monitoring;

by Assessment_ID;

run;

proc print data=assessment_monitoring;

run;

OUTPUT:


Obs	Assessment_ID	Invigilator	Location
1	A001	Rahul	Hyderabad
2	A002	Anita	Delhi
3	A003	Kiran	Bangalore
4	A004	Mohan	Chennai
5	A005	Ritu	Mumbai
6	A006	Arjun	Pune
7	A007	Suresh	Hyderabad
8	A008	Kavya	Delhi
9	A009	Meena	Bangalore
10	A010	Rajesh	Chennai
11	A011	Sunil	Hyderabad
12	A012	Priya	Mumbai
13	A013	Pooja	Delhi
14	A014	Ajay	Bangalore
15	A015	Deepak	Chennai

SAS requires datasets to be sorted before merging. Sorting ensures correct matching of observations.

data assessment_merged;

merge assessment_utilization

assessment_monitoring;

by Assessment_ID;

run;

proc print data=assessment_merged;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class	Invigilator	Location
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good	Rahul	Hyderabad
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good	Anita	Delhi
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent	Kiran	Bangalore
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average	Mohan	Chennai
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent	Ritu	Mumbai
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average	Arjun	Pune
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good	Suresh	Hyderabad
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good	Kavya	Delhi
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good	Meena	Bangalore
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent	Rajesh	Chennai
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good	Sunil	Hyderabad
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent	Priya	Mumbai
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good	Pooja	Delhi
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average	Ajay	Bangalore
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average	Deepak	Chennai

Explanation Of MERGE

MERGE combines datasets horizontally.

Meaning:

Variables from multiple datasets are combined into a single observation.

Example:

Dataset1

ID	Score
A001	85

Dataset2

ID	Location
A001	Hyderabad

After MERGE

ID	Score	Location
A001	85	Hyderabad

15. Data Standardization Using Character Functions

data assessment_text_clean;

set assessment_merged;

Subject = strip(trim(Subject));

Subject = propcase(Subject);

Subject_Upper = upcase(Subject);

Subject_Lower = lowcase(Subject);

Subject_Code = catx("_",Subject,Assessment_ID);

run;

proc print data=assessment_text_clean;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class	Invigilator	Location	Subject_Upper	Subject_Lower	Subject_Code
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good	Rahul	Hyderabad	MATHEMATICS	mathematics	Mathematics_A001
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good	Anita	Delhi	SCIENCE	science	Science_A002
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent	Kiran	Bangalore	STATISTICS	statistics	Statistics_A003
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average	Mohan	Chennai	PROGRAMMING	programming	Programming_A004
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent	Ritu	Mumbai	MATHEMATICS	mathematics	Mathematics_A005
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average	Arjun	Pune	ANALYTICS	analytics	Analytics_A006
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good	Suresh	Hyderabad	DATA_SCIENCE	data_science	Data_science_A007
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good	Kavya	Delhi	SQL	sql	Sql_A008
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good	Meena	Bangalore	PYTHON	python	Python_A009
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent	Rajesh	Chennai	MACHINE_LEARNING	machine_learning	Machine_learning_A010
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good	Sunil	Hyderabad	AI	ai	Ai_A011
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent	Priya	Mumbai	DEEP_LEARNING	deep_learning	Deep_learning_A012
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good	Pooja	Delhi	STATISTICS	statistics	Statistics_A013
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average	Ajay	Bangalore	BUSINESS	business	Business_A014
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average	Deepak	Chennai	FINANCE	finance	Finance_A015

Explanation Of Character Functions

STRIP

Removes leading and trailing spaces.

Example:

"  Mathematics  "

Becomes

"Mathematics"

TRIM

Removes trailing spaces.

PROPCASE

Converts text into proper case.

Example

statistics → Statistics

UPCASE

Converts all characters to uppercase.

Statistics → STATISTICS

LOWCASE

Converts text to lowercase.

STATISTICS → statistics

CATX

Concatenates multiple strings using a delimiter.

Example:

Subject_Code = catx("_",Subject,Assessment_ID);

Output:

Mathematics_A001

16. Numeric Functions

data assessment_numeric;

set assessment_text_clean;

Rounded_Score = round(Avg_Score,1);

Score_Difference = abs(Avg_Score - Percentage);

Adjusted_Fees = round(Fees * 1.05);

run;

proc print data=assessment_numeric;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class	Invigilator	Location	Subject_Upper	Subject_Lower	Subject_Code	Rounded_Score	Score_Difference	Adjusted_Fees
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good	Rahul	Hyderabad	MATHEMATICS	mathematics	Mathematics_A001	85	0	525
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good	Anita	Delhi	SCIENCE	science	Science_A002	78	0	630
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent	Kiran	Bangalore	STATISTICS	statistics	Statistics_A003	92	0	735
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average	Mohan	Chennai	PROGRAMMING	programming	Programming_A004	66	0	840
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent	Ritu	Mumbai	MATHEMATICS	mathematics	Mathematics_A005	100	10	525
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average	Arjun	Pune	ANALYTICS	analytics	Analytics_A006	70	0	683
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good	Suresh	Hyderabad	DATA_SCIENCE	data_science	Data_science_A007	89	0	945
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good	Kavya	Delhi	SQL	sql	Sql_A008	77	0	788
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good	Meena	Bangalore	PYTHON	python	Python_A009	84	0	840
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent	Rajesh	Chennai	MACHINE_LEARNING	machine_learning	Machine_learning_A010	91	0	998
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good	Sunil	Hyderabad	AI	ai	Ai_A011	88	0	945
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent	Priya	Mumbai	DEEP_LEARNING	deep_learning	Deep_learning_A012	95	0	1050
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good	Pooja	Delhi	STATISTICS	statistics	Statistics_A013	82	0	735
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average	Ajay	Bangalore	BUSINESS	business	Business_A014	74	0	630
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average	Deepak	Chennai	FINANCE	finance	Finance_A015	69	0	683

Explanation

ROUND

Rounds numeric value.

88.456 → 88.5

ABS

Absolute value function.

-10 → 10

Used earlier for fixing negative completion time.

ROUND For Fees

Sometimes organizations increase fees by certain percentage.

Fees * 1.05

Represents 5% increase.

17. Date Handling

data assessment_dates;

set assessment_numeric;

Next_Assessment_Date = intnx('month',Exam_Date,6,'same');

Months_Since_Exam = intck('month',Exam_Date,today());

Exam_Start_Date = mdy(1,1,2025);

Days_Between = intck('day',Exam_Date,today());

run;

proc print data=assessment_dates;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class	Invigilator	Location	Subject_Upper	Subject_Lower	Subject_Code	Rounded_Score	Score_Difference	Adjusted_Fees	Next_Assessment_Date	Months_Since_Exam	Exam_Start_Date	Days_Between
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good	Rahul	Hyderabad	MATHEMATICS	mathematics	Mathematics_A001	85	0	525	23923	14	23742	427
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good	Anita	Delhi	SCIENCE	science	Science_A002	78	0	630	23924	14	23742	426
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent	Kiran	Bangalore	STATISTICS	statistics	Statistics_A003	92	0	735	23925	14	23742	425
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average	Mohan	Chennai	PROGRAMMING	programming	Programming_A004	66	0	840	23926	14	23742	424
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent	Ritu	Mumbai	MATHEMATICS	mathematics	Mathematics_A005	100	10	525	23927	14	23742	423
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average	Arjun	Pune	ANALYTICS	analytics	Analytics_A006	70	0	683	23928	14	23742	422
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good	Suresh	Hyderabad	DATA_SCIENCE	data_science	Data_science_A007	89	0	945	23929	14	23742	421
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good	Kavya	Delhi	SQL	sql	Sql_A008	77	0	788	23930	14	23742	420
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good	Meena	Bangalore	PYTHON	python	Python_A009	84	0	840	23931	14	23742	419
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent	Rajesh	Chennai	MACHINE_LEARNING	machine_learning	Machine_learning_A010	91	0	998	23932	14	23742	418
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good	Sunil	Hyderabad	AI	ai	Ai_A011	88	0	945	23933	14	23742	417
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent	Priya	Mumbai	DEEP_LEARNING	deep_learning	Deep_learning_A012	95	0	1050	23934	14	23742	416
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good	Pooja	Delhi	STATISTICS	statistics	Statistics_A013	82	0	735	23935	14	23742	415
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average	Ajay	Bangalore	BUSINESS	business	Business_A014	74	0	630	23936	14	23742	414
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average	Deepak	Chennai	FINANCE	finance	Finance_A015	69	0	683	23937	14	23742	413

Explanation Of Date Functions

MDY

Creates a SAS date value.

MDY(month,day,year)

Example

MDY(1,1,2025)

Represents January 1 2025.

INTCK

Counts intervals between two dates.

Example:

intck('month',Exam_Date,today())

Returns number of months.

INTNX

Calculates future or past date.

Example:

intnx('month',Exam_Date,6)

Adds six months.

18. Fraud Detection Macro Automation

%macro fraud_detection;

data fraud_analysis;

set assessment_dates;

length Fraud_Level $20;

if Completion_Time < 15 and Avg_Score > 90 then Fraud_Level="High Risk";

else if Cheating_Flag="yes" then Fraud_Level="Medium Risk";

else if Completion_Time > 60 then Fraud_Level="Suspicious Delay";

else Fraud_Level="Normal";

run;

proc print data=fraud_analysis;

run;

%mend fraud_detection;

%fraud_detection;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class	Invigilator	Location	Subject_Upper	Subject_Lower	Subject_Code	Rounded_Score	Score_Difference	Adjusted_Fees	Next_Assessment_Date	Months_Since_Exam	Exam_Start_Date	Days_Between	Fraud_Level
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good	Rahul	Hyderabad	MATHEMATICS	mathematics	Mathematics_A001	85	0	525	23923	14	23742	427	Normal
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good	Anita	Delhi	SCIENCE	science	Science_A002	78	0	630	23924	14	23742	426	Normal
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent	Kiran	Bangalore	STATISTICS	statistics	Statistics_A003	92	0	735	23925	14	23742	425	Medium Risk
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average	Mohan	Chennai	PROGRAMMING	programming	Programming_A004	66	0	840	23926	14	23742	424	Normal
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent	Ritu	Mumbai	MATHEMATICS	mathematics	Mathematics_A005	100	10	525	23927	14	23742	423	High Risk
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average	Arjun	Pune	ANALYTICS	analytics	Analytics_A006	70	0	683	23928	14	23742	422	Normal
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good	Suresh	Hyderabad	DATA_SCIENCE	data_science	Data_science_A007	89	0	945	23929	14	23742	421	Normal
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good	Kavya	Delhi	SQL	sql	Sql_A008	77	0	788	23930	14	23742	420	Normal
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good	Meena	Bangalore	PYTHON	python	Python_A009	84	0	840	23931	14	23742	419	Normal
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent	Rajesh	Chennai	MACHINE_LEARNING	machine_learning	Machine_learning_A010	91	0	998	23932	14	23742	418	Medium Risk
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good	Sunil	Hyderabad	AI	ai	Ai_A011	88	0	945	23933	14	23742	417	Normal
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent	Priya	Mumbai	DEEP_LEARNING	deep_learning	Deep_learning_A012	95	0	1050	23934	14	23742	416	Medium Risk
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good	Pooja	Delhi	STATISTICS	statistics	Statistics_A013	82	0	735	23935	14	23742	415	Normal
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average	Ajay	Bangalore	BUSINESS	business	Business_A014	74	0	630	23936	14	23742	414	Normal
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average	Deepak	Chennai	FINANCE	finance	Finance_A015	69	0	683	23937	14	23742	413	Normal

Macros automate repetitive tasks.

Advantages:

Macros automate repetitive tasks.

· Reduce code duplication

· Improve automation

· Increase efficiency

· Enable parameterized analysis

In large datasets containing millions of rows, macros help create automated fraud detection systems.

19. Risk Classification

data fraud_risk_classification;

set fraud_analysis;

length Risk_Category $20;

if Fraud_Level="High Risk" then Risk_Category="Critical";

else if Fraud_Level="Medium Risk" then Risk_Category="Warning";

else if Fraud_Level="Suspicious Delay" then Risk_Category="Review";

else Risk_Category="Safe";

run;

proc print data=fraud_risk_classification;

run;

OUTPUT:


Obs	Assessment_ID	Subject	Difficulty_Level	Cheating_Flag	Avg_Score	Completion_Time	Reliability_Index	Percentage	Fees	Exam_Date	Utilization_Class	Invigilator	Location	Subject_Upper	Subject_Lower	Subject_Code	Rounded_Score	Score_Difference	Adjusted_Fees	Next_Assessment_Date	Months_Since_Exam	Exam_Start_Date	Days_Between	Fraud_Level	Risk_Category
1	A001	Mathematics	easy	no	85	35	0.95	85	500	01JAN2025	Good	Rahul	Hyderabad	MATHEMATICS	mathematics	Mathematics_A001	85	0	525	23923	14	23742	427	Normal	Safe
2	A002	Science	medium	no	78	40	0.90	78	600	02JAN2025	Good	Anita	Delhi	SCIENCE	science	Science_A002	78	0	630	23924	14	23742	426	Normal	Safe
3	A003	Statistics	hard	yes	92	20	0.88	92	700	03JAN2025	Excellent	Kiran	Bangalore	STATISTICS	statistics	Statistics_A003	92	0	735	23925	14	23742	425	Medium Risk	Warning
4	A004	Programming	medium	no	66	50	0.75	66	800	04JAN2025	Average	Mohan	Chennai	PROGRAMMING	programming	Programming_A004	66	0	840	23926	14	23742	424	Normal	Safe
5	A005	Mathematics	easy	yes	100	10	0.96	110	500	05JAN2025	Excellent	Ritu	Mumbai	MATHEMATICS	mathematics	Mathematics_A005	100	10	525	23927	14	23742	423	High Risk	Critical
6	A006	Analytics	medium	no	70	45	0.91	70	650	06JAN2025	Average	Arjun	Pune	ANALYTICS	analytics	Analytics_A006	70	0	683	23928	14	23742	422	Normal	Safe
7	A007	Data_science	hard	no	89	10	0.87	89	900	07JAN2025	Good	Suresh	Hyderabad	DATA_SCIENCE	data_science	Data_science_A007	89	0	945	23929	14	23742	421	Normal	Safe
8	A008	Sql	medium	no	77	30	0.93	77	750	08JAN2025	Good	Kavya	Delhi	SQL	sql	Sql_A008	77	0	788	23930	14	23742	420	Normal	Safe
9	A009	Python	hard	no	84	25	0.92	84	800	09JAN2025	Good	Meena	Bangalore	PYTHON	python	Python_A009	84	0	840	23931	14	23742	419	Normal	Safe
10	A010	Machine_learning	hard	yes	91	15	0.85	91	950	10JAN2025	Excellent	Rajesh	Chennai	MACHINE_LEARNING	machine_learning	Machine_learning_A010	91	0	998	23932	14	23742	418	Medium Risk	Warning
11	A011	Ai	medium	no	88	28	0.90	88	900	11JAN2025	Good	Sunil	Hyderabad	AI	ai	Ai_A011	88	0	945	23933	14	23742	417	Normal	Safe
12	A012	Deep_learning	hard	yes	95	22	0.86	95	1000	12JAN2025	Excellent	Priya	Mumbai	DEEP_LEARNING	deep_learning	Deep_learning_A012	95	0	1050	23934	14	23742	416	Medium Risk	Warning
13	A013	Statistics	medium	no	82	35	0.91	82	700	13JAN2025	Good	Pooja	Delhi	STATISTICS	statistics	Statistics_A013	82	0	735	23935	14	23742	415	Normal	Safe
14	A014	Business	easy	no	74	40	0.89	74	600	14JAN2025	Average	Ajay	Bangalore	BUSINESS	business	Business_A014	74	0	630	23936	14	23742	414	Normal	Safe
15	A015	Finance	medium	no	69	48	0.88	69	650	15JAN2025	Average	Deepak	Chennai	FINANCE	finance	Finance_A015	69	0	683	23937	14	23742	413	Normal	Safe

20. PROC TRANSPOSE

proc transpose data=fraud_risk_classification out=assessment_transpose;

var Avg_Score Percentage Completion_Time;

run;

proc print data=assessment_transpose;

run;

OUTPUT:


Obs	_NAME_	COL1	COL2	COL3	COL4	COL5	COL6	COL7	COL8	COL9	COL10	COL11	COL12	COL13	COL14	COL15
1	Avg_Score	85	78	92	66	100	70	89	77	84	91	88	95	82	74	69
2	Percentage	85	78	92	66	110	70	89	77	84	91	88	95	82	74	69
3	Completion_Time	35	40	20	50	10	45	10	30	25	15	28	22	35	40	48

Explanation

PROC TRANSPOSE converts columns into rows.

Example:

Original

ID	Score	Percentage
A001	85	85

After transpose

Variable	Value
Score	85
Percentage	85

21. PROC DATASETS DELETE

proc datasets library=work;

delete assessment_combined

assessment_numeric

assessment_dates;

quit;

LOG:

 NOTE: Deleting WORK.ASSESSMENT_COMBINED (memtype=DATA).

 NOTE: Deleting WORK.ASSESSMENT_NUMERIC (memtype=DATA).

 NOTE: Deleting WORK.ASSESSMENT_DATES (memtype=DATA).

Explanation

PROC DATASETS manages SAS libraries.
DELETE removes unwanted datasets.
Benefits:
· Reduces memory usage
· Keeps workspace organized
· Improves performance
Final Analytical Dataset

The final dataset now includes:

· Clean subject names

· Corrected scores

· Fraud detection flags

· Risk classification

· Monitoring information

· Date analytics

· Fee adjustments

This dataset is ready for:

· Reporting

· Dashboard visualization

· Fraud investigation

Best Practices For SAS

· Always validate raw data

· Identify missing values

· Standardize text variables

· Validate numeric ranges

· Correct date formats

· Use macros for automation

· Document code clearly

· Remove temporary datasets

Conclusion

This project demonstrates a SAS workflow for managing and analyzing online assessment data with intentional errors such as invalid scores, missing values, negative completion times, and inconsistent subject names. Using SAS programming techniques, these issues were detected and corrected through character functions (PROPCASE, STRIP, CATX), numeric functions (ABS, COALESCE), and date functions (MDY, INTCK, INTNX).

Dataset operations such as SET, MERGE, PROC APPEND, and PROC TRANSPOSE were used to combine and restructure data. A SAS macro was implemented to automate fraud detection based on suspicious patterns like high scores with very short completion times. The final dataset is clean, validated, and ready for reporting, analysis, and fraud monitoring in online assessment systems.

SAS INTERVIEW QUESTIONS

1. What is the RETAIN statement used for?

The RETAIN statement in SAS is used to hold or preserve the value of a variable from one iteration of the DATA step to the next. Normally, SAS resets variables to missing at the beginning of each iteration, but RETAIN prevents that behavior.

It is commonly used for:

· Running totals

· Carrying previous values forward

· Creating cumulative calculations

Example:

data example;

set sales;

retain total_sales 0;

total_sales + amount;

run;

Explanation:

· retain total_sales 0; initializes and keeps the value across rows.

· total_sales + amount; keeps adding the amount for each observation.

· This creates a cumulative total.

2. What is the difference between DROP and KEEP?

Both DROP and KEEP control which variables appear in the output dataset.

Feature	DROP	KEEP
Purpose	Removes variables from output dataset	Keeps only specified variables
Logic	Excludes listed variables	Includes only listed variables
Usage	When most variables are needed	When only a few variables are needed

Example using DROP:

data newdata;

set olddata;

drop temp_var;

run;

temp_var will not appear in the output dataset.

Example using KEEP:

data newdata;

set olddata;

keep id name salary;

run;

Only id, name, and salary will appear in the dataset.

3. What are automatic variables in SAS?

Automatic variables are variables created automatically by SAS during DATA step execution. They are not written to the output dataset, but they are available for processing.

The two most common automatic variables are:

Variable	Purpose
`_N_`	Counts the number of times the DATA step executes
`_ERROR_`	Indicates if an error occurred (1 = error, 0 = no error)

Example:

data test;

set sales;

obs_number = _N_;

run;

Explanation:

· _N_ increases by 1 for every observation processed.

· It helps track row numbers or iterations.

Example output:

N	obs_number
1	1
2	2
3	3

Key Points

· Created automatically by SAS

· Not saved in the dataset

· Useful for debugging and tracking iterations

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent ONLINE EXAMS data.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics
·  Clinical SAS Programmer
·  Research Data Analyst
·  Regulatory Data Validator
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can We Analyze Water Sources Data and Evaluate Environmental Risks Using SAS Programming Techniques?

2.Can Advanced SAS Analysis Identify Fuel Consumption Patterns in Ocean Ship Voyages?
3. Can SAS Programming Reveal Business Influence Among Famous Delhi Celebrity Families?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

419.Can We Identify Suspicious Exam Behavior and Correct Data Issues Using Advanced SAS Techniques?

Smart Fraud Detection and Data Error Correction in Online Exams with SAS

HERE IN THIS PROJECT WE USED THESE SAS STATEMENTS —DATA STEP | LENGTH | INPUT | DATALINES | FORMAT | SET | IF-THEN- KEEP | DROP | MERGE | PROC SORT | PROC PRINT | PROC APPEND | PROC TRANSPOSE | PROC DATASETS | MACRO

Introduction

Table Of Contents

Explanation Of Monitoring Dataset

Explanation Of SET Statement

Explanation Of MERGE

Explanation Of Character Functions

STRIP

TRIM

PROPCASE

UPCASE

LOWCASE

CATX

Explanation

ROUND

ABS

ROUND For Fees

Explanation Of Date Functions

MDY

INTCK

INTNX

Explanation

Conclusion

1. What is the RETAIN statement used for?

2. What is the difference between DROP and KEEP?

3. What are automatic variables in SAS?

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can We Analyze Water Sources Data and Evaluate Environmental Risks Using SAS Programming Techniques?2.Can Advanced SAS Analysis Identify Fuel Consumption Patterns in Ocean Ship Voyages?3. Can SAS Programming Reveal Business Influence Among Famous Delhi Celebrity Families?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

412.Can We Build And Clean A University Course Analytics & Fraud Detection System In Sas Using Only Macros While Intentionally Creating And Fixing Errors?

420.Can We Detect Errors, Prevent Fraud, And Optimize Biometric Access System Security Using Advanced SAS Programming?

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

1.Can We Analyze Water Sources Data and Evaluate Environmental Risks Using SAS Programming Techniques?

2.Can Advanced SAS Analysis Identify Fuel Consumption Patterns in Ocean Ship Voyages?
3. Can SAS Programming Reveal Business Influence Among Famous Delhi Celebrity Families?