Saturday, April 28, 2012

WORKING WITH MISSING VALUES IN R


 SUBSTITUTE  ALL MISSING VALUES WITH MEANS IN FEW LINES

In this post I will illustrate what I did to deal with missing values in our survey dataset.
If you believe that using mean substitution  to deal with your missing values, then this post may be useful to you. However, if you are looking for imputation methods, then you can use built in functions in R to generate your own algorithms for missing data imputation or you can install some packages like 'imputation',' amelia','robCompositions'

I divided the dataset into two datasets: data1: has only columns without missing and data2: all columns with missing values (NA), I then worked with those columns with missing values. I excluded all categorical variables from data2, because i wanted to substitute mean for continuous variables. However, for NA of categorical variables i replaced NA with 99. After i have cleaned missing values in data2, i combined data1 and data2 in a single data frame which was clean data without any missing values. 

The code i wrote to do this is:-

is.na(survey[])
onlyMissingCol <- survey[,!complete.cases(t(survey))]
onlyMissingCol
NA_Col <- survey[sapply(survey, function(survey) any(is.na(survey)))]
NA_Col
No_NA_Col <- survey[sapply(survey, function(survey) !any(is.na(survey)))]
No_NA_Col
NA_Col$Ethnicity
NA_Col2 <- as.matrix(NA_Col)
NA_Col2[which(is.na(NA_Col), arr.ind = TRUE)] <-
apply(NA_Col2,2,mean, na.rm=T)[which(is.na(NA_Col), arr.ind = TRUE)[, "col"]]
survey_clean <- data.frame(No_NA_Col, NA_Col2)
survey_clean
complete.cases(survey_clean)
survey_clean1 <- round(survey_clean[,c(-3)],digits=2)
options(width=1000)
survey_clean1
survey_final <- data.frame(Std_ID=survey_clean$Std_ID, survey_clean1)
survey_final

If you want wonder what the above code does then click HERE to see the comments I wrote for each command line. You can also run it because i linked it to my dataset on dropbox.

I hope you find this useful.

ADIL

No comments:

Post a Comment