SUBSTITUTE ALL MISSING VALUES WITH MEANS IN FEW LINES
In this post I will illustrate what I did to deal with missing values in our survey dataset.
If you believe that using mean substitution to deal with your missing values, then this post may be useful to you. However, if you are looking for imputation methods, then you can use built in functions in R to generate your own algorithms for missing data imputation or you can install some packages like 'imputation',' amelia','robCompositions'
I divided the dataset into two datasets: data1: has only columns without missing and data2: all columns with missing values (NA), I then worked with those columns with missing values. I excluded all categorical variables from data2, because i wanted to substitute mean for continuous variables. However, for NA of categorical variables i replaced NA with 99. After i have cleaned missing values in data2, i combined data1 and data2 in a single data frame which was clean data without any missing values.
The code i wrote to do this is:-
onlyMissingCol <- survey[,!complete.cases(t(survey))]
NA_Col <- survey[sapply(survey, function(survey) any(is.na(survey)))]
No_NA_Col <- survey[sapply(survey, function(survey) !any(is.na(survey)))]
NA_Col2 <- as.matrix(NA_Col)
NA_Col2[which(is.na(NA_Col), arr.ind = TRUE)] <-
apply(NA_Col2,2,mean, na.rm=T)[which(is.na(NA_Col), arr.ind = TRUE)[, "col"]]
survey_clean <- data.frame(No_NA_Col, NA_Col2)
survey_clean1 <- round(survey_clean[,c(-3)],digits=2)
survey_final <- data.frame(Std_ID=survey_clean$Std_ID, survey_clean1)
If you want wonder what the above code does then click HERE to see the comments I wrote for each command line. You can also run it because i linked it to my dataset on dropbox.
I hope you find this useful.