DATA VALIDATION AND DATA CLEANING:::
-->WE CAN CLEAN THE DATA OUTSIDE OF THE SAS ENVIRONMENT LIKE TEXT AND EXCEL FILE
-->EDIT THE RAW DATA BY USING RIGHT CLICK-----> "EDIT MODE"--->BROWSE MODE OPTION
-->BY PROGRAMATICALLY CAN WE CLEAN THE DATA?
YES WE CAN,
--->PROC PRINT:: BOTH CHARACTER AND NUMERIC
--->PROC FREQ:: BOTH CHARACTER AND NUMERIC
--->PROC MEANS:: NUMERIC
--->PROC UNIVARIATE:: NUMERIC
--->PROC SUMMARY:: NUMERIC
EG1:
PROC PRINT DATA=SASHELP.CLASS;
WHERE NAME IS MISSING;
RUN;
EG2:
PROC PRINT DATA=SASHELP.CLASS NOOBS ;
WHERE SEX IS MISSING;
RUN;
--->PROC MEANS::: IT WILL GIVE THE "SUMMARY STATISTICS FOR ALL NON MISSING
NUMERIC VARIABLES"
SYNTAX::
PROC MEANS DATA=XXXX.YYY;
RUN;
OPTIONS::::
---VAR ---N ---MEAN ---STANDARD DEVIATION(STD) --MIN --MAX
--NMISS(NO OF MISSING NUMERICALS VALUES) --OUTPUT OUT=(CREATING NEW DATAASET)
DEFAULT OUTPUT::::WILL BE
"N MEAN STD MIN MAX" FOR ALL NON MISSING NUMERICAL DATA
1.WITHOUT "VAR" STATEMENT IT WILL STATISTICS FOR ALL NON MISSING NUMERICAL VARIABLES
EG1:
PROC MEANS DATA=SASHELP.CLASS;
RUN;
2.WITH "VAR" STATEMENT IT WILL STATISTICS FOR PARTICULAR NUMERICAL VARIABLES
EG2:
PROC MEANS DATA=SASHELP.CLASS;
VAR AGE HEIGHT;
RUN;
EG3:
PROC MEANS DATA=SASHELP.CLASS;
VAR AGE;
RUN;
3.TO SUPPRESS/CONTROLLING THE DEFAULT OUTPUT OF MEANS????
---USE N MEAN STD MIN MAX AFTER THE DATASET
EG1:
PROC MEANS DATA=SASHELP.CLASS N STD;
RUN;
EG2:
PROC MEANS DATA=SASHELP.CLASS MIN MEAN;
VAR AGE;
RUN;
4.WHAT IF THE DATA HAVING THE MISSING VALUES IN NUMERICAL VARIABLES????
---USE THE "NMISS" AFTER THE DATASET
EG1:
PROC MEANS DATA=SASHELP.CLASS NMISS;
RUN;
EG2:
PROC MEANS DATA=SASHELP.CLASS NMISS;
VAR AGE HEIGHT;
RUN;
5.CREATING NEW DATASET IN PROC MEANS??
---USE OUTPUT OUT=
EG1:
PROC MEANS DATA=SASHELP.CLASS;
VAR AGE HEIGHT;
OUTPUT OUT=WORK.SAS1; * DEFAULT VAR WILL BE CREATED (_TPYE_ _FREQ_ _STATUS_);
RUN;
--->PROC SUMMARY:::NUMERIC
1. BY DEFAULT IT WILL GIVE "COUNT OF THE OBSERVATIONS"
2.IF WE USING THE "VAR" STATEMENT IN SUMMARY PROCEDURE "SAME OUTPUT LIKE PROC MEANS"
3."NO PRINT" IS DEFAULT SO HAVE TO GIVE "PRINT" AFTER THE DATASETNAME
SYNTAX:::
EG1:
PROC SUMMARY DATA=XXXX.YYY;
RUN;
--->WITHOUT "PRINT" OPTION NO OUTPUT
EG2:
PROC SUMMARY DATA=SASHELP.CLASS;
RUN;
--->WITH PRINT OPTION IT WILL GIVE A COUNT OF OBS
EG3:
PROC SUMMARY DATA=SASHELP.CLASS PRINT;
RUN;
EG4:
PROC SUMMARY DATA=SASHELP.HEART PRINT;
RUN;
2.IF WE USING THE "VAR" STATEMENT IN SUMMARY PROCEDURE "SAME OUTPUT LIKE MEANS"
EG1:
PROC SUMMARY DATA=SASHELP.CLASS PRINT N MEAN;
VAR AGE HEIGHT;
RUN;
---->PROC UNIVARIATE:::: NUMERIC
IT WILL GIVE EXTREME OBSERVATIONS,QUANTILES,BASIC STATS,TEST FOR LOCATIONS,MOMENTS
SYNTAX::
EG1:
PROC UNIVARIATE DATA=XXXX.YYYY;
VAR NUMERIC;
RUN;
WITHOUT VAR STATEMENT ALL NUMERIC GIVE ALL NUMERIC VARIABLES
EG2:
PROC UNIVARIATE DATA=SASHELP.CLASS;
RUN;
WITH VAR IT WILL EXTREME OBS GIVE FOR PARTICULAR VAR
EG3:
PROC UNIVARIATE DATA=SASHELP.CLASS;
VAR AGE;
RUN;
--->PROC FREQ
-->FOR CHARACTER
-->FOR NUMERIC
--->IT WILL GIVE 'FREQUENCY TABLES AND N-WAY FREQ TABLES' FOR THE VARIABLES
SYNTAX:::
EG1:
PROC FREQ DATA=XXXX.XXXX;
TABLE XXX XXXX XXX;
RUN;
DEFAULT OUTPUT::IT WILL GIVE
1.FREQUENCY 2.PERCENTAGE 3.CUMULATIVE FREQUENCY 4.CUMULATIVE PERCENTAGE
WITH "TABLE" STATEMENT IT WILL GIVE FREQ TABLES FOR PARTICULAR VARIABLES
EG1:
PROC FREQ DATA=SASHELP.CLASS;
TABLE NAME HEIGHT;
RUN;
WITHOUT "TABLE" STATEMENT IT WILL GIVE FREQ TABLES FOR ALL VARIABLES
EG2:
PROC FREQ DATA=SASHELP.CLASS;
RUN;
NLEVELS:::GIVE THE LEVELS/COUNT OF OBS FOR VARIABLES
EG3:
PROC FREQ DATA=SASHELP.CLASS NLEVELS;
TABLE NAME SEX AGE;
RUN;
EG4:
PROC FREQ DATA=SASHELP.CLASS NLEVELS;
TABLE NAME SEX;
RUN;
EG5:
PROC FREQ DATA=PROG1.SALES NLEVELS;
TABLE JOB_TITLE GENDER;
RUN;
EG6:
PROC FREQ DATA=PROG1.NONSALES NLEVELS;
TABLE JOB_TITLE GENDER;
RUN;
SUPPRESS/REMOVE THE DEFAULT OUTPUT:SO WE CAN ADD NOPERCENT NOCUM
EG7:
PROC FREQ DATA=SASHELP.CLASS;
TABLE NAME HEIGHT/ NOPERCENT;
RUN;
WITH SPACE IN B/W THE TWO VARIABLES CALLED ONE-WAY FREQUENCY
EG8:
PROC FREQ DATA=SASHELP.CLASS;
TABLE NAME HEIGHT;
RUN;
ONE WAY:: SPACE B/W THE TWO VARIABLES
EG9:
PROC FREQ DATA=SASHELP.HEART;
TABLE STATUS SEX;
RUN;
TWO WAY:: ASTRICK * B/W THE TWO VARIABLES
EG10:
PROC FREQ DATA=SASHELP.HEART;
TABLE STATUS*SEX; *DEFAULT OUTPUT:: FREQ PERCENT ROWPCT COLPCT;
RUN;
THREE WAY:: 2 ASTRICK IN B/W THREE VAR
EG11:
PROC FREQ DATA=SASHELP.HEART;
TABLE STATUS*SEX*AgeAtStart;
RUN;
TWO-WAY FREQ(ASTRICK(*) IN B/W TWO VARIABLES OUTPUT IS
--->OUTPUT:: 1.FREQ 2.PERCENT 3.ROW PER 4.COL PER
EG12:
PROC FREQ DATA=SASHELP.CLASS;
TABLE NAME*HEIGHT;
RUN;
THREE-WAY FREQ
EG13:
PROC FREQ DATA=SASHELP.CLASS;
TABLE NAME*HEIGHT*SEX;
RUN;
SUPPRESSING/REMOVING THE DEFAULT OUTPUT
EG14:
PROC FREQ DATA=SASHELP.CLASS;
TABLE NAME*HEIGHT/NOROW;
RUN;
EG15:
PROC FREQ DATA=SASHELP.CLASS;
TABLE NAME*HEIGHT/NOCOL;
RUN;
Comments
Post a Comment