438.Can Advanced SAS Programming Detect and Fix Errors in Astronomical Observation Data While Improving Accuracy and Reliability?
Detecting and Fixing Errors in Astronomical Observation Data Using Advanced SAS Programming for Better Accuracy and Reliability
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HERE IN THIS PROJECT WE USED
THESE SAS STATEMENTS —
DATA | SET | INPUT |
DATALINES | IF | OUTPUT | PROC SORT | PROC APPEND | PROC TRANSPOSE | PROC
DATASETS | RUN | %MACRO | %MEND | FUNCTIONS | R LANGUAGE
Table of Contents
- Introduction
- Business
Context
- Raw Dataset
Design (SAS & R Same)
- Creating
Dataset with Intentional Errors
- Code 1–15+
(Using PROG1 Statements)
- Error Identification &
Explanation
- Data
Cleaning & Correction
- Final Clean
Dataset Code
- Advanced Insights
- 20 Key Points About The
Project
- Summary
- Conclusion
1. Introduction
Astronomical observation data plays a
crucial role in scientific discovery. However, raw telescope data often
contains inconsistencies such as missing values, incorrect formats, and
unrealistic measurements. This project demonstrates how Advanced SAS Programming can identify,
correct, and optimize such errors.
Astronomical observation data is critical for understanding celestial objects and cosmic events. However, raw data collected from telescopes often contains inconsistencies such as missing values, incorrect measurements, and logical errors. These issues can significantly impact the accuracy of scientific analysis. Advanced SAS programming provides robust tools and techniques to detect, correct, and standardize such data efficiently. In this project, we simulate real-world astronomical data with intentional errors and apply SAS-based data cleaning methods. The objective is to transform unreliable raw data into a structured and accurate dataset, ensuring better reliability, consistency, and usability for researchers and analysts in space science.
2. Business Context
Space
research organizations depend on high-quality observational data. Errors in
signal readings or confidence scores can lead to incorrect conclusions about
celestial objects.
This
project simulates a space analytics system where:
- Telescopes collect raw data
- Data contains noise/errors
- SAS is used to clean and
validate
Space
research organizations and observatories rely heavily on accurate observational
data for decision-making and scientific discoveries. Errors in telescope data,
such as incorrect intensity readings or invalid confidence levels, can lead to
misleading conclusions about celestial objects. Therefore, maintaining
high-quality data is essential. In this context, Advanced SAS programming acts
as a powerful solution to identify, validate, and correct such errors
systematically. This project represents a data processing workflow where raw
astronomical data is cleaned and optimized before analysis. The goal is to
improve data integrity, reduce uncertainty, and support reliable insights,
ultimately enhancing the efficiency and credibility of astronomical research
operations.
Goal:
👉 Improve data accuracy
👉 Ensure scientific reliability
👉 Enable decision-making for astronomers
3. Raw Dataset (SAS & R SAME STRUCTURE)
DATA astro_raw;
INPUT Telescope_ID $ Observation_Time :DATETIME20.
Light_Intensity Signal_Noise_Ratio Object_Type $
Detection_Confidence Percentage Fees;
FORMAT Observation_Time DATETIME20.;
DATALINES;
T001 01JAN2025:10:00:00 500 20 STAR 95 85 1000
T002 01JAN2025:11:00:00 . 15 GALAXY 90 80 1200
T003 01JAN2025:12:00:00 700 . NEBULA 88 78 1100
T004 01JAN2025:13:00:00 -100 25 STAR 105 90 1300
T005 01JAN2025:14:00:00 600 18 UNKNOWN 85 . 900
T006 01JAN2025:15:00:00 550 20 STAR . 82 1000
T007 01JAN2025:16:00:00 800 30 GALAXY 92 88 .
T008 01JAN2025:17:00:00 650 22 NEBULA 87 85 1150
T009 01JAN2025:18:00:00 720 28 STAR 91 89 1250
T010 01JAN2025:19:00:00 680 26 GALAXY 89 87 1180
;
RUN;
Proc print data=astro_raw;
run;
OUTPUT:
| Obs | Telescope_ID | Observation_Time | Light_Intensity | Signal_Noise_Ratio | Object_Type | Detection_Confidence | Percentage | Fees |
|---|---|---|---|---|---|---|---|---|
| 1 | T001 | 01JAN2025:10:00:00 | 500 | 20 | STAR | 95 | 85 | 1000 |
| 2 | T002 | 01JAN2025:11:00:00 | . | 15 | GALAXY | 90 | 80 | 1200 |
| 3 | T003 | 01JAN2025:12:00:00 | 700 | . | NEBULA | 88 | 78 | 1100 |
| 4 | T004 | 01JAN2025:13:00:00 | -100 | 25 | STAR | 105 | 90 | 1300 |
| 5 | T005 | 01JAN2025:14:00:00 | 600 | 18 | UNKNOWN | 85 | . | 900 |
| 6 | T006 | 01JAN2025:15:00:00 | 550 | 20 | STAR | . | 82 | 1000 |
| 7 | T007 | 01JAN2025:16:00:00 | 800 | 30 | GALAXY | 92 | 88 | . |
| 8 | T008 | 01JAN2025:17:00:00 | 650 | 22 | NEBULA | 87 | 85 | 1150 |
| 9 | T009 | 01JAN2025:18:00:00 | 720 | 28 | STAR | 91 | 89 | 1250 |
| 10 | T010 | 01JAN2025:19:00:00 | 680 | 26 | GALAXY | 89 | 87 | 1180 |
astro_raw <- data.frame(
Telescope_ID = c("T001","T002","T003","T004","T005","T006","T007",
"T008","T009","T010"),
Observation_Time = as.POSIXct(c(
"2025-01-01 10:00:00","2025-01-01 11:00:00","2025-01-01 12:00:00",
"2025-01-01 13:00:00","2025-01-01 14:00:00","2025-01-01 15:00:00",
"2025-01-01 16:00:00","2025-01-01 17:00:00","2025-01-01 18:00:00",
"2025-01-01 19:00:00")),
Light_Intensity = c(500, NA, 700, -100, 600, 550, 800, 650, 720, 680),
Signal_Noise_Ratio = c(20,15,NA,25,18,20,30,22,28,26),
Object_Type = c("STAR","GALAXY","NEBULA","STAR","UNKNOWN","STAR","GALAXY",
"NEBULA","STAR","GALAXY"),
Detection_Confidence = c(95,90,88,105,85,NA,92,87,91,89),
Percentage = c(85,80,78,90,NA,82,88,85,89,87),
Fees = c(1000,1200,1100,1300,900,1000,NA,1150,1250,1180)
OUTPUT:
|
|
Telescope_ID |
Observation_Time |
Light_Intensity |
Signal_Noise_Ratio |
Object_Type |
Detection_Confidence |
Percentage |
Fees |
|
1 |
T001 |
2025-01-01 10:00:00 |
500 |
20 |
STAR |
95 |
85 |
1000 |
|
2 |
T002 |
2025-01-01 11:00:00 |
NA |
15 |
GALAXY |
90 |
80 |
1200 |
|
3 |
T003 |
2025-01-01 12:00:00 |
700 |
NA |
NEBULA |
88 |
78 |
1100 |
|
4 |
T004 |
2025-01-01 13:00:00 |
-100 |
25 |
STAR |
105 |
90 |
1300 |
|
5 |
T005 |
2025-01-01 14:00:00 |
600 |
18 |
UNKNOWN |
85 |
NA |
900 |
|
6 |
T006 |
2025-01-01 15:00:00 |
550 |
20 |
STAR |
NA |
82 |
1000 |
|
7 |
T007 |
2025-01-01 16:00:00 |
800 |
30 |
GALAXY |
92 |
88 |
NA |
|
8 |
T008 |
2025-01-01 17:00:00 |
650 |
22 |
NEBULA |
87 |
85 |
1150 |
|
9 |
T009 |
2025-01-01 18:00:00 |
720 |
28 |
STAR |
91 |
89 |
1250 |
|
10 |
T010 |
2025-01-01 19:00:00 |
680 |
26 |
GALAXY |
89 |
87 |
1180 |
4. Intentional Errors
|
Error
Type |
Example |
|
Missing
values |
Light_Intensity
= . |
|
Negative
values |
-100 |
|
Invalid
range |
Detection_Confidence
= 105 |
|
Unknown
category |
UNKNOWN |
|
Missing
fees |
. |
5. SAS Codes (PROG1 Based)
Code 1: Sorting Data
PROC SORT DATA=astro_raw OUT=astro_sorted;
BY Telescope_ID;
RUN;
Proc print data=astro_sorted;
run;
OUTPUT:
| Obs | Telescope_ID | Observation_Time | Light_Intensity | Signal_Noise_Ratio | Object_Type | Detection_Confidence | Percentage | Fees |
|---|---|---|---|---|---|---|---|---|
| 1 | T001 | 01JAN2025:10:00:00 | 500 | 20 | STAR | 95 | 85 | 1000 |
| 2 | T002 | 01JAN2025:11:00:00 | . | 15 | GALAXY | 90 | 80 | 1200 |
| 3 | T003 | 01JAN2025:12:00:00 | 700 | . | NEBULA | 88 | 78 | 1100 |
| 4 | T004 | 01JAN2025:13:00:00 | -100 | 25 | STAR | 105 | 90 | 1300 |
| 5 | T005 | 01JAN2025:14:00:00 | 600 | 18 | UNKNOWN | 85 | . | 900 |
| 6 | T006 | 01JAN2025:15:00:00 | 550 | 20 | STAR | . | 82 | 1000 |
| 7 | T007 | 01JAN2025:16:00:00 | 800 | 30 | GALAXY | 92 | 88 | . |
| 8 | T008 | 01JAN2025:17:00:00 | 650 | 22 | NEBULA | 87 | 85 | 1150 |
| 9 | T009 | 01JAN2025:18:00:00 | 720 | 28 | STAR | 91 | 89 | 1250 |
| 10 | T010 | 01JAN2025:19:00:00 | 680 | 26 | GALAXY | 89 | 87 | 1180 |
Explanation
Sorts dataset for structured processing.
Why used
· Required
before BY processing
· Improves
readability
Code 2: Handling Missing Light Intensity
DATA astro_fix1;
SET astro_sorted;
IF Light_Intensity=. THEN Light_Intensity=MEAN(500,600,700);
RUN;
Proc print data=astro_fix1;
run;
OUTPUT:
| Obs | Telescope_ID | Observation_Time | Light_Intensity | Signal_Noise_Ratio | Object_Type | Detection_Confidence | Percentage | Fees |
|---|---|---|---|---|---|---|---|---|
| 1 | T001 | 01JAN2025:10:00:00 | 500 | 20 | STAR | 95 | 85 | 1000 |
| 2 | T002 | 01JAN2025:11:00:00 | 600 | 15 | GALAXY | 90 | 80 | 1200 |
| 3 | T003 | 01JAN2025:12:00:00 | 700 | . | NEBULA | 88 | 78 | 1100 |
| 4 | T004 | 01JAN2025:13:00:00 | -100 | 25 | STAR | 105 | 90 | 1300 |
| 5 | T005 | 01JAN2025:14:00:00 | 600 | 18 | UNKNOWN | 85 | . | 900 |
| 6 | T006 | 01JAN2025:15:00:00 | 550 | 20 | STAR | . | 82 | 1000 |
| 7 | T007 | 01JAN2025:16:00:00 | 800 | 30 | GALAXY | 92 | 88 | . |
| 8 | T008 | 01JAN2025:17:00:00 | 650 | 22 | NEBULA | 87 | 85 | 1150 |
| 9 | T009 | 01JAN2025:18:00:00 | 720 | 28 | STAR | 91 | 89 | 1250 |
| 10 | T010 | 01JAN2025:19:00:00 | 680 | 26 | GALAXY | 89 | 87 | 1180 |
Explanation
Replaces missing values with average.
Why
· Avoids bias
· Maintains continuity
Code 3: Fix Negative Values
DATA astro_fix2;
SET astro_fix1;
IF Light_Intensity < 0 THEN Light_Intensity=ABS(Light_Intensity);
RUN;
Proc print data=astro_fix2;
run;
OUTPUT:
| Obs | Telescope_ID | Observation_Time | Light_Intensity | Signal_Noise_Ratio | Object_Type | Detection_Confidence | Percentage | Fees |
|---|---|---|---|---|---|---|---|---|
| 1 | T001 | 01JAN2025:10:00:00 | 500 | 20 | STAR | 95 | 85 | 1000 |
| 2 | T002 | 01JAN2025:11:00:00 | 600 | 15 | GALAXY | 90 | 80 | 1200 |
| 3 | T003 | 01JAN2025:12:00:00 | 700 | . | NEBULA | 88 | 78 | 1100 |
| 4 | T004 | 01JAN2025:13:00:00 | 100 | 25 | STAR | 105 | 90 | 1300 |
| 5 | T005 | 01JAN2025:14:00:00 | 600 | 18 | UNKNOWN | 85 | . | 900 |
| 6 | T006 | 01JAN2025:15:00:00 | 550 | 20 | STAR | . | 82 | 1000 |
| 7 | T007 | 01JAN2025:16:00:00 | 800 | 30 | GALAXY | 92 | 88 | . |
| 8 | T008 | 01JAN2025:17:00:00 | 650 | 22 | NEBULA | 87 | 85 | 1150 |
| 9 | T009 | 01JAN2025:18:00:00 | 720 | 28 | STAR | 91 | 89 | 1250 |
| 10 | T010 | 01JAN2025:19:00:00 | 680 | 26 | GALAXY | 89 | 87 | 1180 |
Explanation
Converts negative values to positive.
Why
· Physical
measurements cannot be negative
Code 4: Fix Detection Confidence
DATA astro_fix3;
SET astro_fix2;
IF Detection_Confidence > 100 THEN Detection_Confidence=100;
RUN;
Proc print data=astro_fix3;
run;
OUTPUT:
| Obs | Telescope_ID | Observation_Time | Light_Intensity | Signal_Noise_Ratio | Object_Type | Detection_Confidence | Percentage | Fees |
|---|---|---|---|---|---|---|---|---|
| 1 | T001 | 01JAN2025:10:00:00 | 500 | 20 | STAR | 95 | 85 | 1000 |
| 2 | T002 | 01JAN2025:11:00:00 | 600 | 15 | GALAXY | 90 | 80 | 1200 |
| 3 | T003 | 01JAN2025:12:00:00 | 700 | . | NEBULA | 88 | 78 | 1100 |
| 4 | T004 | 01JAN2025:13:00:00 | 100 | 25 | STAR | 100 | 90 | 1300 |
| 5 | T005 | 01JAN2025:14:00:00 | 600 | 18 | UNKNOWN | 85 | . | 900 |
| 6 | T006 | 01JAN2025:15:00:00 | 550 | 20 | STAR | . | 82 | 1000 |
| 7 | T007 | 01JAN2025:16:00:00 | 800 | 30 | GALAXY | 92 | 88 | . |
| 8 | T008 | 01JAN2025:17:00:00 | 650 | 22 | NEBULA | 87 | 85 | 1150 |
| 9 | T009 | 01JAN2025:18:00:00 | 720 | 28 | STAR | 91 | 89 | 1250 |
| 10 | T010 | 01JAN2025:19:00:00 | 680 | 26 | GALAXY | 89 | 87 | 1180 |
Code 5: Replace Missing Confidence
DATA astro_fix4;
SET astro_fix3;
IF Detection_Confidence=. THEN Detection_Confidence=90;
RUN;
Proc print data=astro_fix4;
run;
OUTPUT:
| Obs | Telescope_ID | Observation_Time | Light_Intensity | Signal_Noise_Ratio | Object_Type | Detection_Confidence | Percentage | Fees |
|---|---|---|---|---|---|---|---|---|
| 1 | T001 | 01JAN2025:10:00:00 | 500 | 20 | STAR | 95 | 85 | 1000 |
| 2 | T002 | 01JAN2025:11:00:00 | 600 | 15 | GALAXY | 90 | 80 | 1200 |
| 3 | T003 | 01JAN2025:12:00:00 | 700 | . | NEBULA | 88 | 78 | 1100 |
| 4 | T004 | 01JAN2025:13:00:00 | 100 | 25 | STAR | 100 | 90 | 1300 |
| 5 | T005 | 01JAN2025:14:00:00 | 600 | 18 | UNKNOWN | 85 | . | 900 |
| 6 | T006 | 01JAN2025:15:00:00 | 550 | 20 | STAR | 90 | 82 | 1000 |
| 7 | T007 | 01JAN2025:16:00:00 | 800 | 30 | GALAXY | 92 | 88 | . |
| 8 | T008 | 01JAN2025:17:00:00 | 650 | 22 | NEBULA | 87 | 85 | 1150 |
| 9 | T009 | 01JAN2025:18:00:00 | 720 | 28 | STAR | 91 | 89 | 1250 |
| 10 | T010 | 01JAN2025:19:00:00 | 680 | 26 | GALAXY | 89 | 87 | 1180 |
Code 6: Fix Percentage
DATA astro_fix5;
SET astro_fix4;
IF Percentage=. THEN Percentage=80;
RUN;
Proc print data=astro_fix5;
run;
OUTPUT:
| Obs | Telescope_ID | Observation_Time | Light_Intensity | Signal_Noise_Ratio | Object_Type | Detection_Confidence | Percentage | Fees |
|---|---|---|---|---|---|---|---|---|
| 1 | T001 | 01JAN2025:10:00:00 | 500 | 20 | STAR | 95 | 85 | 1000 |
| 2 | T002 | 01JAN2025:11:00:00 | 600 | 15 | GALAXY | 90 | 80 | 1200 |
| 3 | T003 | 01JAN2025:12:00:00 | 700 | . | NEBULA | 88 | 78 | 1100 |
| 4 | T004 | 01JAN2025:13:00:00 | 100 | 25 | STAR | 100 | 90 | 1300 |
| 5 | T005 | 01JAN2025:14:00:00 | 600 | 18 | UNKNOWN | 85 | 80 | 900 |
| 6 | T006 | 01JAN2025:15:00:00 | 550 | 20 | STAR | 90 | 82 | 1000 |
| 7 | T007 | 01JAN2025:16:00:00 | 800 | 30 | GALAXY | 92 | 88 | . |
| 8 | T008 | 01JAN2025:17:00:00 | 650 | 22 | NEBULA | 87 | 85 | 1150 |
| 9 | T009 | 01JAN2025:18:00:00 | 720 | 28 | STAR | 91 | 89 | 1250 |
| 10 | T010 | 01JAN2025:19:00:00 | 680 | 26 | GALAXY | 89 | 87 | 1180 |
Code 7: Fix Fees
DATA astro_fix6;
SET astro_fix5;
IF Fees=. THEN Fees=1000;
RUN;
Proc print data=astro_fix6;
run;
OUTPUT:
| Obs | Telescope_ID | Observation_Time | Light_Intensity | Signal_Noise_Ratio | Object_Type | Detection_Confidence | Percentage | Fees |
|---|---|---|---|---|---|---|---|---|
| 1 | T001 | 01JAN2025:10:00:00 | 500 | 20 | STAR | 95 | 85 | 1000 |
| 2 | T002 | 01JAN2025:11:00:00 | 600 | 15 | GALAXY | 90 | 80 | 1200 |
| 3 | T003 | 01JAN2025:12:00:00 | 700 | . | NEBULA | 88 | 78 | 1100 |
| 4 | T004 | 01JAN2025:13:00:00 | 100 | 25 | STAR | 100 | 90 | 1300 |
| 5 | T005 | 01JAN2025:14:00:00 | 600 | 18 | UNKNOWN | 85 | 80 | 900 |
| 6 | T006 | 01JAN2025:15:00:00 | 550 | 20 | STAR | 90 | 82 | 1000 |
| 7 | T007 | 01JAN2025:16:00:00 | 800 | 30 | GALAXY | 92 | 88 | 1000 |
| 8 | T008 | 01JAN2025:17:00:00 | 650 | 22 | NEBULA | 87 | 85 | 1150 |
| 9 | T009 | 01JAN2025:18:00:00 | 720 | 28 | STAR | 91 | 89 | 1250 |
| 10 | T010 | 01JAN2025:19:00:00 | 680 | 26 | GALAXY | 89 | 87 | 1180 |
Code 8: Clean Object Type
DATA astro_fix7;
SET astro_fix6;
IF Object_Type="UNKNOWN" THEN Object_Type="STAR";
RUN;
Proc print data=astro_fix7;
run;
OUTPUT:
| Obs | Telescope_ID | Observation_Time | Light_Intensity | Signal_Noise_Ratio | Object_Type | Detection_Confidence | Percentage | Fees |
|---|---|---|---|---|---|---|---|---|
| 1 | T001 | 01JAN2025:10:00:00 | 500 | 20 | STAR | 95 | 85 | 1000 |
| 2 | T002 | 01JAN2025:11:00:00 | 600 | 15 | GALAXY | 90 | 80 | 1200 |
| 3 | T003 | 01JAN2025:12:00:00 | 700 | . | NEBULA | 88 | 78 | 1100 |
| 4 | T004 | 01JAN2025:13:00:00 | 100 | 25 | STAR | 100 | 90 | 1300 |
| 5 | T005 | 01JAN2025:14:00:00 | 600 | 18 | STAR | 85 | 80 | 900 |
| 6 | T006 | 01JAN2025:15:00:00 | 550 | 20 | STAR | 90 | 82 | 1000 |
| 7 | T007 | 01JAN2025:16:00:00 | 800 | 30 | GALAXY | 92 | 88 | 1000 |
| 8 | T008 | 01JAN2025:17:00:00 | 650 | 22 | NEBULA | 87 | 85 | 1150 |
| 9 | T009 | 01JAN2025:18:00:00 | 720 | 28 | STAR | 91 | 89 | 1250 |
| 10 | T010 | 01JAN2025:19:00:00 | 680 | 26 | GALAXY | 89 | 87 | 1180 |
Code 9: Create New Variable
DATA astro_fix8;
SET astro_fix7;
Quality_Score = (Light_Intensity * Detection_Confidence)/100;
RUN;
Proc print data=astro_fix8;
run;
OUTPUT:
| Obs | Telescope_ID | Observation_Time | Light_Intensity | Signal_Noise_Ratio | Object_Type | Detection_Confidence | Percentage | Fees | Quality_Score |
|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | 01JAN2025:10:00:00 | 500 | 20 | STAR | 95 | 85 | 1000 | 475.0 |
| 2 | T002 | 01JAN2025:11:00:00 | 600 | 15 | GALAXY | 90 | 80 | 1200 | 540.0 |
| 3 | T003 | 01JAN2025:12:00:00 | 700 | . | NEBULA | 88 | 78 | 1100 | 616.0 |
| 4 | T004 | 01JAN2025:13:00:00 | 100 | 25 | STAR | 100 | 90 | 1300 | 100.0 |
| 5 | T005 | 01JAN2025:14:00:00 | 600 | 18 | STAR | 85 | 80 | 900 | 510.0 |
| 6 | T006 | 01JAN2025:15:00:00 | 550 | 20 | STAR | 90 | 82 | 1000 | 495.0 |
| 7 | T007 | 01JAN2025:16:00:00 | 800 | 30 | GALAXY | 92 | 88 | 1000 | 736.0 |
| 8 | T008 | 01JAN2025:17:00:00 | 650 | 22 | NEBULA | 87 | 85 | 1150 | 565.5 |
| 9 | T009 | 01JAN2025:18:00:00 | 720 | 28 | STAR | 91 | 89 | 1250 | 655.2 |
| 10 | T010 | 01JAN2025:19:00:00 | 680 | 26 | GALAXY | 89 | 87 | 1180 | 605.2 |
Code 10: Filtering High Quality
DATA astro_high;
SET astro_fix8;
IF Quality_Score > 500;
RUN;
Proc print data=astro_high;
run;
OUTPUT:
| Obs | Telescope_ID | Observation_Time | Light_Intensity | Signal_Noise_Ratio | Object_Type | Detection_Confidence | Percentage | Fees | Quality_Score |
|---|---|---|---|---|---|---|---|---|---|
| 1 | T002 | 01JAN2025:11:00:00 | 600 | 15 | GALAXY | 90 | 80 | 1200 | 540.0 |
| 2 | T003 | 01JAN2025:12:00:00 | 700 | . | NEBULA | 88 | 78 | 1100 | 616.0 |
| 3 | T005 | 01JAN2025:14:00:00 | 600 | 18 | STAR | 85 | 80 | 900 | 510.0 |
| 4 | T007 | 01JAN2025:16:00:00 | 800 | 30 | GALAXY | 92 | 88 | 1000 | 736.0 |
| 5 | T008 | 01JAN2025:17:00:00 | 650 | 22 | NEBULA | 87 | 85 | 1150 | 565.5 |
| 6 | T009 | 01JAN2025:18:00:00 | 720 | 28 | STAR | 91 | 89 | 1250 | 655.2 |
| 7 | T010 | 01JAN2025:19:00:00 | 680 | 26 | GALAXY | 89 | 87 | 1180 | 605.2 |
Code 11: PROC MEANS
PROC MEANS DATA=astro_fix8;
VAR Light_Intensity Signal_Noise_Ratio;
RUN;
OUTPUT:
The MEANS Procedure
| Variable | N | Mean | Std Dev | Minimum | Maximum |
|---|---|---|---|---|---|
Light_Intensity Signal_Noise_Ratio | 10 9 | 590.0000000 22.6666667 | 192.8730152 4.9244289 | 100.0000000 15.0000000 | 800.0000000 30.0000000 |
Code 12: PROC FREQ
PROC FREQ DATA=astro_fix8;
TABLES Object_Type;
RUN;
OUTPUT:
The FREQ Procedure
| Object_Type | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| GALAXY | 3 | 30.00 | 3 | 30.00 |
| NEBULA | 2 | 20.00 | 5 | 50.00 |
| STAR | 5 | 50.00 | 10 | 100.00 |
Code 13: PROC TRANSPOSE
PROC TRANSPOSE DATA=astro_fix8 OUT=astro_trans;
VAR Light_Intensity Signal_Noise_Ratio;
RUN;
Proc print data=astro_trans;
run;
OUTPUT:
| Obs | _NAME_ | COL1 | COL2 | COL3 | COL4 | COL5 | COL6 | COL7 | COL8 | COL9 | COL10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Light_Intensity | 500 | 600 | 700 | 100 | 600 | 550 | 800 | 650 | 720 | 680 |
| 2 | Signal_Noise_Ratio | 20 | 15 | . | 25 | 18 | 20 | 30 | 22 | 28 | 26 |
Code 14: Macro Creation
%MACRO summary;
PROC MEANS DATA=astro_fix8;
RUN;
%MEND;
%summary;
OUTPUT:
The MEANS Procedure
| Variable | N | Mean | Std Dev | Minimum | Maximum |
|---|---|---|---|---|---|
Observation_Time Light_Intensity Signal_Noise_Ratio Detection_Confidence Percentage Fees Quality_Score | 10 10 9 10 10 10 10 | 2051361000 590.0000000 22.6666667 90.7000000 84.4000000 1108.00 529.7900000 | 10899.54 192.8730152 4.9244289 4.2700507 4.1952354 129.4260836 170.6431914 | 2051344800 100.0000000 15.0000000 85.0000000 78.0000000 900.0000000 100.0000000 | 2051377200 800.0000000 30.0000000 100.0000000 90.0000000 1300.00 736.0000000 |
Code 15: Append Dataset
PROC APPEND BASE=astro_fix8
DATA=astro_high;
RUN;
Proc print data=astro_fix8;
run;
OUTPUT:
| Obs | Telescope_ID | Observation_Time | Light_Intensity | Signal_Noise_Ratio | Object_Type | Detection_Confidence | Percentage | Fees | Quality_Score |
|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | 01JAN2025:10:00:00 | 500 | 20 | STAR | 95 | 85 | 1000 | 475.0 |
| 2 | T002 | 01JAN2025:11:00:00 | 600 | 15 | GALAXY | 90 | 80 | 1200 | 540.0 |
| 3 | T003 | 01JAN2025:12:00:00 | 700 | . | NEBULA | 88 | 78 | 1100 | 616.0 |
| 4 | T004 | 01JAN2025:13:00:00 | 100 | 25 | STAR | 100 | 90 | 1300 | 100.0 |
| 5 | T005 | 01JAN2025:14:00:00 | 600 | 18 | STAR | 85 | 80 | 900 | 510.0 |
| 6 | T006 | 01JAN2025:15:00:00 | 550 | 20 | STAR | 90 | 82 | 1000 | 495.0 |
| 7 | T007 | 01JAN2025:16:00:00 | 800 | 30 | GALAXY | 92 | 88 | 1000 | 736.0 |
| 8 | T008 | 01JAN2025:17:00:00 | 650 | 22 | NEBULA | 87 | 85 | 1150 | 565.5 |
| 9 | T009 | 01JAN2025:18:00:00 | 720 | 28 | STAR | 91 | 89 | 1250 | 655.2 |
| 10 | T010 | 01JAN2025:19:00:00 | 680 | 26 | GALAXY | 89 | 87 | 1180 | 605.2 |
| 11 | T002 | 01JAN2025:11:00:00 | 600 | 15 | GALAXY | 90 | 80 | 1200 | 540.0 |
| 12 | T003 | 01JAN2025:12:00:00 | 700 | . | NEBULA | 88 | 78 | 1100 | 616.0 |
| 13 | T005 | 01JAN2025:14:00:00 | 600 | 18 | STAR | 85 | 80 | 900 | 510.0 |
| 14 | T007 | 01JAN2025:16:00:00 | 800 | 30 | GALAXY | 92 | 88 | 1000 | 736.0 |
| 15 | T008 | 01JAN2025:17:00:00 | 650 | 22 | NEBULA | 87 | 85 | 1150 | 565.5 |
| 16 | T009 | 01JAN2025:18:00:00 | 720 | 28 | STAR | 91 | 89 | 1250 | 655.2 |
| 17 | T010 | 01JAN2025:19:00:00 | 680 | 26 | GALAXY | 89 | 87 | 1180 | 605.2 |
6. Error Explanation
Example Errors:
- Missing
Light_Intensity
- Causes incorrect
analysis
- Negative
values
- Physically
impossible
- Confidence
>100
- Logical
inconsistency
- Unknown
Object
- Classification
issue
7. Corrected Final Dataset (FULL CODE)
DATA astro_final;
SET astro_raw;
IF Light_Intensity=. THEN Light_Intensity=600;
IF Light_Intensity<0 THEN Light_Intensity=ABS(Light_Intensity);
IF Signal_Noise_Ratio=. THEN Signal_Noise_Ratio=20;
IF Detection_Confidence=. THEN Detection_Confidence=90;
IF Detection_Confidence>100 THEN Detection_Confidence=100;
IF Percentage=. THEN Percentage=80;
IF Fees=. THEN Fees=1000;
IF Object_Type="UNKNOWN" THEN Object_Type="STAR";
Quality_Score = (Light_Intensity * Detection_Confidence)/100;
RUN;
Proc print data=astro_final;
run;
OUTPUT:
| Obs | Telescope_ID | Observation_Time | Light_Intensity | Signal_Noise_Ratio | Object_Type | Detection_Confidence | Percentage | Fees | Quality_Score |
|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | 01JAN2025:10:00:00 | 500 | 20 | STAR | 95 | 85 | 1000 | 475.0 |
| 2 | T002 | 01JAN2025:11:00:00 | 600 | 15 | GALAXY | 90 | 80 | 1200 | 540.0 |
| 3 | T003 | 01JAN2025:12:00:00 | 700 | 20 | NEBULA | 88 | 78 | 1100 | 616.0 |
| 4 | T004 | 01JAN2025:13:00:00 | 100 | 25 | STAR | 100 | 90 | 1300 | 100.0 |
| 5 | T005 | 01JAN2025:14:00:00 | 600 | 18 | STAR | 85 | 80 | 900 | 510.0 |
| 6 | T006 | 01JAN2025:15:00:00 | 550 | 20 | STAR | 90 | 82 | 1000 | 495.0 |
| 7 | T007 | 01JAN2025:16:00:00 | 800 | 30 | GALAXY | 92 | 88 | 1000 | 736.0 |
| 8 | T008 | 01JAN2025:17:00:00 | 650 | 22 | NEBULA | 87 | 85 | 1150 | 565.5 |
| 9 | T009 | 01JAN2025:18:00:00 | 720 | 28 | STAR | 91 | 89 | 1250 | 655.2 |
| 10 | T010 | 01JAN2025:19:00:00 | 680 | 26 | GALAXY | 89 | 87 | 1180 | 605.2 |
Explanation of Final Code
· Combines
all corrections in one step
· Improves
performance
· Ensures clean dataset
8. Advanced Insights
- High Light_Intensity →
Better detection
- GALAXY observations → Higher
noise
- Quality Score helps ranking
9. 20 Key Points About The Project
- Astronomical datasets
often contain missing and inconsistent values.
- Raw telescope data may
include noise and measurement errors.
- Advanced SAS helps in
identifying data quality issues efficiently.
- DATA step is used for
row-level data manipulation and corrections.
- SET statement allows
reading and processing existing datasets.
- IF conditions help
detect logical errors like invalid ranges.
- Missing values can be
imputed using MEAN or fixed values.
- Negative measurements
are corrected using ABS functions.
- PROC SORT organizes
data for structured processing.
- PROC MEANS provides
statistical summaries for validation.
- PROC FREQ helps analyze
categorical inconsistencies.
- Invalid values (e.g.,
confidence >100) are capped logically.
- Unknown categories can
be standardized for consistency.
- Derived variables like
Quality_Score improve analysis depth.
- PROC TRANSPOSE helps
reshape data for reporting needs.
- Macros automate
repetitive validation and reporting tasks.
- PROC APPEND combines
datasets for extended analysis.
- Data cleaning
improves accuracy and reduces bias.
- Clean datasets enable
reliable scientific conclusions.
- SAS ensures
scalability and efficiency in data processing workflows.
10. Summary
This
project demonstrates how Advanced SAS Programming can effectively detect and
correct errors in astronomical observation data. By creating a raw dataset with
intentional issues such as missing values, negative measurements, and invalid
ranges, we simulated real-world data challenges. Using PROG1 statements, we
applied structured techniques like data step transformations, conditional
logic, and procedures such as PROC SORT, PROC MEANS, and PROC FREQ to clean and
analyze the data. Each error was identified and corrected systematically,
ensuring the dataset became accurate and reliable. Additionally, derived
variables like Quality_Score provided deeper analytical insights. The final
dataset was optimized for performance and usability, making it suitable for
scientific analysis and reporting. This approach highlights the importance of
data validation, cleaning, and transformation in ensuring high-quality results
in astronomical research and other data-driven domains.
11. Conclusion
Advanced
SAS programming proves to be a powerful solution for detecting and correcting
errors in astronomical observation data. By systematically identifying issues
such as missing values, negative measurements, and invalid ranges, SAS enables
the transformation of unreliable raw data into a clean and structured format.
Techniques like DATA step processing, conditional logic, and procedures such as
PROC MEANS and PROC FREQ play a crucial role in validating and improving data
quality. Additionally, the creation of derived variables enhances analytical
capabilities and provides deeper insights into observations. The use of macros further
improves efficiency by automating repetitive tasks. Overall, this approach
ensures that the final dataset is accurate, consistent, and suitable for
scientific analysis. As a result, researchers and organizations can rely on
high-quality data to make informed decisions, reduce uncertainty, and improve
the credibility and effectiveness of astronomical research and discoveries.
This
project clearly shows:
✔ SAS can detect errors
✔ SAS can correct inconsistencies
✔ Data quality improves significantly
✔ Reliable insights are generated
INTERVIEW QUESTIONS FOR YOU
1. WHERE vs. IF Statements
Question:
What is the difference between a WHERE statement and
an IF
statement in a DATA step?
Short
Answer: A WHERE
statement filters data before
it enters the Program Data Vector (PDV), making it more efficient because SAS
doesn't have to process every row. An IF
statement filters data after
it is in the PDV. I use IF
when I need to filter based on a variable I just created in that same DATA
step, which WHERE
cannot see.
Question:
What is the difference between a WHERE statement and
an IF
statement in a DATA step?
Short
Answer: A WHERE
statement filters data before
it enters the Program Data Vector (PDV), making it more efficient because SAS
doesn't have to process every row. An IF
statement filters data after
it is in the PDV. I use IF
when I need to filter based on a variable I just created in that same DATA
step, which WHERE
cannot see.
2. Character to Numeric
Conversion (PUT vs. INPUT)
Question:
How do you remember when to use PUT versus INPUT
functions?
Short
Answer: I use INPUT()
for In-put (converting
Character → Numeric) using an informat.
I use PUT()
for Out-put (converting
Numeric → Character) using a format.
A simple trick is: Input
= Informat, and Put = Format.
Question:
How do you remember when to use PUT versus INPUT
functions?
Short
Answer: I use INPUT()
for In-put (converting
Character → Numeric) using an informat.
I use PUT()
for Out-put (converting
Numeric → Character) using a format.
A simple trick is: Input
= Informat, and Put = Format.
3. PROC APPEND vs. SET
Statement
Question:
Why would you use PROC
APPEND instead of a SET statement to
combine two datasets?
Short
Answer: I use PROC
APPEND when I want to add a small 'base' dataset to a very
large 'master' dataset. Unlike a SET statement,
which reads and rewrites the entire combined file, PROC APPEND only
processes the new data and adds it to the end. This saves a massive amount of time
and computer memory (I/O).
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
Question:
Why would you use PROC
APPEND instead of a SET statement to
combine two datasets?
Short
Answer: I use PROC
APPEND when I want to add a small 'base' dataset to a very
large 'master' dataset. Unlike a SET statement,
which reads and rewrites the entire combined file, PROC APPEND only
processes the new data and adds it to the end. This saves a massive amount of time
and computer memory (I/O).
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent ASTRONOMICAL data.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Follow Us On :
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--->Follow our blog for more SAS-based analytics projects and industry data models.
---> Support Us By Following Our Blog..
To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment