SUBSTITUTE ALL MISSING VALUES WITH MEANS IN FEW LINES
In this post I will illustrate what I did to deal with missing values
in our survey dataset.
If you believe that using mean substitution to deal with your missing values, then this
post may be useful to you. However, if you are looking for imputation methods,
then you can use built in functions in R to generate your own algorithms for
missing data imputation or you can install some packages like 'imputation',' amelia','robCompositions'
I divided the dataset into two datasets: data1: has only columns
without missing and data2: all columns with missing values (NA), I then worked
with those columns with missing values. I excluded all categorical variables
from data2, because i wanted to substitute mean for continuous variables. However,
for NA of categorical variables i replaced NA with 99. After i have cleaned missing values in data2, i combined data1 and data2 in a single data frame which was clean data without any missing values.
The code i wrote to do this is:-
is.na(survey[])
onlyMissingCol <- survey[,!complete.cases(t(survey))]
onlyMissingCol
NA_Col <- survey[sapply(survey, function(survey)
any(is.na(survey)))]
NA_Col
No_NA_Col <- survey[sapply(survey, function(survey)
!any(is.na(survey)))]
No_NA_Col
NA_Col$Ethnicity
NA_Col2 <- as.matrix(NA_Col)
NA_Col2[which(is.na(NA_Col), arr.ind = TRUE)] <-
apply(NA_Col2,2,mean, na.rm=T)[which(is.na(NA_Col), arr.ind = TRUE)[,
"col"]]
survey_clean <- data.frame(No_NA_Col, NA_Col2)
survey_clean
complete.cases(survey_clean)
survey_clean1 <- round(survey_clean[,c(-3)],digits=2)
options(width=1000)
survey_clean1
survey_final <- data.frame(Std_ID=survey_clean$Std_ID,
survey_clean1)
survey_final
If you want wonder what the above code does then click HERE to see the
comments I wrote for each command line. You can also run it because i
linked it to my dataset on dropbox.
I hope you find this useful.
ADIL
No comments:
Post a Comment