7.VALIDATION AND CLEANING TECHNIQUES

           DATA VALIDATION AND DATA CLEANING:::


-->WE CAN CLEAN THE DATA OUTSIDE OF THE SAS ENVIRONMENT LIKE TEXT AND EXCEL FILE


-->EDIT THE RAW DATA BY USING RIGHT CLICK-----> "EDIT MODE"--->BROWSE MODE OPTION


-->BY PROGRAMATICALLY CAN WE CLEAN THE DATA?

YES WE CAN,

--->PROC PRINT:: BOTH CHARACTER  AND NUMERIC

--->PROC FREQ:: BOTH CHARACTER AND NUMERIC

--->PROC MEANS:: NUMERIC

--->PROC UNIVARIATE:: NUMERIC

--->PROC SUMMARY:: NUMERIC

EG1:

PROC PRINT DATA=SASHELP.CLASS;

WHERE NAME IS MISSING;

RUN;

EG2:

PROC PRINT DATA=SASHELP.CLASS NOOBS ;

WHERE SEX IS MISSING;

RUN;


--->PROC MEANS::: IT WILL GIVE THE "SUMMARY STATISTICS FOR ALL NON MISSING 

                        NUMERIC VARIABLES"


SYNTAX::

PROC MEANS DATA=XXXX.YYY;

RUN;

OPTIONS::::

---VAR    ---N  ---MEAN   ---STANDARD DEVIATION(STD)   --MIN   --MAX 

 --NMISS(NO OF MISSING NUMERICALS VALUES)  --OUTPUT OUT=(CREATING NEW DATAASET)


 DEFAULT OUTPUT::::WILL BE

"N MEAN STD MIN MAX" FOR ALL NON MISSING NUMERICAL DATA


1.WITHOUT "VAR" STATEMENT IT WILL STATISTICS FOR ALL NON MISSING NUMERICAL VARIABLES

EG1:

PROC MEANS DATA=SASHELP.CLASS;

RUN;

2.WITH "VAR" STATEMENT IT WILL STATISTICS FOR PARTICULAR NUMERICAL VARIABLES

EG2:

PROC MEANS DATA=SASHELP.CLASS;

VAR AGE HEIGHT;

RUN;

EG3:

PROC MEANS DATA=SASHELP.CLASS;

VAR AGE;

RUN; 

3.TO SUPPRESS/CONTROLLING THE DEFAULT OUTPUT OF MEANS????

     ---USE N MEAN STD MIN MAX AFTER THE DATASET

EG1:

PROC MEANS DATA=SASHELP.CLASS N STD;

RUN;

EG2:

PROC MEANS DATA=SASHELP.CLASS MIN MEAN;

VAR AGE;

RUN;

4.WHAT IF THE DATA HAVING THE MISSING VALUES IN NUMERICAL VARIABLES????

   ---USE THE "NMISS" AFTER THE DATASET

EG1:

PROC MEANS DATA=SASHELP.CLASS NMISS;

RUN;

EG2:

PROC MEANS DATA=SASHELP.CLASS NMISS;

VAR AGE HEIGHT;

RUN;

5.CREATING NEW DATASET IN PROC MEANS??

---USE OUTPUT OUT=

EG1:

PROC MEANS DATA=SASHELP.CLASS;

VAR AGE HEIGHT;

OUTPUT OUT=WORK.SAS1; * DEFAULT VAR WILL BE CREATED (_TPYE_ _FREQ_ _STATUS_);

RUN;


--->PROC SUMMARY:::NUMERIC

1. BY DEFAULT IT WILL GIVE "COUNT OF THE OBSERVATIONS"

2.IF WE USING THE "VAR" STATEMENT IN SUMMARY PROCEDURE "SAME OUTPUT LIKE PROC MEANS"

3."NO PRINT" IS DEFAULT SO HAVE TO GIVE "PRINT" AFTER THE DATASETNAME

SYNTAX:::

EG1:

PROC SUMMARY DATA=XXXX.YYY;

RUN;

--->WITHOUT "PRINT" OPTION NO OUTPUT

EG2:

PROC SUMMARY DATA=SASHELP.CLASS;

RUN;

--->WITH PRINT OPTION IT WILL GIVE A COUNT OF OBS

EG3:

PROC SUMMARY DATA=SASHELP.CLASS PRINT;

RUN;

EG4:

PROC SUMMARY DATA=SASHELP.HEART PRINT;

RUN;


2.IF WE USING THE "VAR" STATEMENT IN SUMMARY PROCEDURE "SAME OUTPUT LIKE MEANS"

EG1:

PROC SUMMARY DATA=SASHELP.CLASS PRINT N MEAN;

VAR AGE HEIGHT;

RUN;


---->PROC UNIVARIATE:::: NUMERIC

IT WILL GIVE EXTREME OBSERVATIONS,QUANTILES,BASIC STATS,TEST FOR LOCATIONS,MOMENTS


SYNTAX::

EG1:

PROC UNIVARIATE DATA=XXXX.YYYY;

VAR NUMERIC;

RUN;


WITHOUT VAR STATEMENT ALL NUMERIC GIVE ALL NUMERIC VARIABLES 

EG2:

PROC UNIVARIATE DATA=SASHELP.CLASS;

RUN;


WITH VAR IT WILL EXTREME OBS GIVE FOR PARTICULAR VAR

EG3:

PROC UNIVARIATE DATA=SASHELP.CLASS;

VAR AGE;

RUN;


--->PROC FREQ

-->FOR CHARACTER 

-->FOR NUMERIC


--->IT WILL GIVE 'FREQUENCY TABLES AND N-WAY FREQ TABLES' FOR THE VARIABLES


SYNTAX:::

EG1:

PROC FREQ DATA=XXXX.XXXX;

TABLE XXX XXXX XXX;

RUN;


DEFAULT OUTPUT::IT WILL GIVE 

1.FREQUENCY  2.PERCENTAGE  3.CUMULATIVE FREQUENCY  4.CUMULATIVE PERCENTAGE


WITH "TABLE" STATEMENT IT WILL GIVE FREQ TABLES FOR PARTICULAR VARIABLES

EG1:

PROC FREQ DATA=SASHELP.CLASS;

TABLE NAME HEIGHT;

RUN;


WITHOUT "TABLE" STATEMENT IT WILL GIVE FREQ TABLES FOR ALL VARIABLES

EG2:

PROC FREQ DATA=SASHELP.CLASS;

RUN;


NLEVELS:::GIVE THE LEVELS/COUNT OF OBS FOR VARIABLES

EG3:

PROC FREQ DATA=SASHELP.CLASS NLEVELS;

TABLE NAME SEX AGE;

RUN;

EG4:

PROC FREQ DATA=SASHELP.CLASS NLEVELS;

TABLE NAME SEX;

RUN;

EG5:

PROC FREQ DATA=PROG1.SALES NLEVELS;

TABLE JOB_TITLE GENDER;

RUN;

EG6:

PROC FREQ DATA=PROG1.NONSALES NLEVELS;

TABLE JOB_TITLE GENDER;

RUN;


SUPPRESS/REMOVE THE DEFAULT OUTPUT:SO WE CAN ADD NOPERCENT NOCUM 

EG7:

PROC FREQ DATA=SASHELP.CLASS;

TABLE NAME HEIGHT/ NOPERCENT;

RUN;


WITH SPACE IN B/W THE TWO VARIABLES CALLED ONE-WAY FREQUENCY

EG8:

PROC FREQ DATA=SASHELP.CLASS;

TABLE NAME HEIGHT;

RUN;


ONE WAY:: SPACE B/W THE TWO VARIABLES

EG9:

PROC FREQ DATA=SASHELP.HEART;

TABLE STATUS SEX;

RUN;


TWO WAY:: ASTRICK *  B/W THE TWO VARIABLES

EG10:

PROC FREQ DATA=SASHELP.HEART;

TABLE STATUS*SEX; *DEFAULT OUTPUT:: FREQ PERCENT ROWPCT COLPCT;

RUN;


THREE WAY:: 2 ASTRICK IN B/W THREE VAR

EG11:

PROC FREQ DATA=SASHELP.HEART;

TABLE STATUS*SEX*AgeAtStart;

RUN;


TWO-WAY FREQ(ASTRICK(*) IN B/W TWO VARIABLES  OUTPUT IS 

--->OUTPUT:: 1.FREQ 2.PERCENT 3.ROW PER 4.COL PER

EG12:

PROC FREQ DATA=SASHELP.CLASS;

TABLE NAME*HEIGHT;

RUN;


THREE-WAY FREQ

EG13:

PROC FREQ DATA=SASHELP.CLASS;

TABLE NAME*HEIGHT*SEX;

RUN;


SUPPRESSING/REMOVING THE  DEFAULT OUTPUT

EG14:

PROC FREQ DATA=SASHELP.CLASS;

TABLE NAME*HEIGHT/NOROW;

RUN;

EG15:

PROC FREQ DATA=SASHELP.CLASS;

TABLE NAME*HEIGHT/NOCOL;

RUN;


Comments