Sunday, April 29, 2012

Multiple Plots on a Single Graph using ggplot2


I was trying to put multiple plots on a single graph using ggplot2; however,
It turned out that R built in par(mfrow=c(2,2)) doesn't work for ggplot2.
When I searched online i found this POST that illustrates how to do that.

It is very simple. For ggplot2, instead of using par(mfrow=c(2,2)), you need to use grid.arrange( graph1, graph2...., ncol=2)
However, grid.arrange() is a function of a package called "gridExtra", so first you have to install
gridExtra package.

Here are some examples based on our survey data (you can copy the following code) :-

install.packages("gridExtra")
library(gridExtra)
graph1 <-  qplot(Ind7_Confidence, data=myIndices, geom= "histogram", color=I("blue"),
fill=I("orange"),main="Software Use")


graph2 <- qplot(Ind5_WeightedMathAbility, data=myIndices, geom= "histogram", color=I("blue"),
fill=I("skyblue"),main="Weighted Math Ability")


graph3 <-qplot(Ind2_UnderstandingDataAna, data=myIndices, geom= "histogram", color=I("blue"),
fill=I("yellow"),main="Understanding Data Analysis")


graph4 <- qplot(Ind4_PractQuanMethod, data=myIndices, geom= "histogram", color=I("blue"),
fill=I("red"),main="Practical Quant Experience")
grid.arrange( graph1, graph2,graph3,graph4, ncol=2)
savePlot(filename="Gridhist.png",type="png")


Another Example using Density Plot

graph1 <- qplot(Ind7_Confidence, data=myIndices, geom="density",
fill = TwoPlusComputers, alpha = I(0.2),xlab="Data Analysis Confidence")


graph2 <- qplot(Ind8_ComSocialUse, data=myIndices, geom="density",
fill = TwoPlusComputers, alpha = I(0.2),xlab="Computer For Social Use")


graph3 <- qplot(Ind3_PractQualMethod, data=myIndices, geom="density",
fill = TwoPlusComputers, alpha = I(0.2),xlab="Practical Experience of Quant Methods")


graph4 <- qplot(Ind9_ComWorkUse, data=myIndices, geom="density",
fill = TwoPlusComputers, alpha = I(0.2),xlab="Computer For Prof Use")
grid.arrange( graph1, graph2, graph3,graph4, ncol=2)
savePlot(filename="GridDensity.png",type="png")




Adil 

Saturday, April 28, 2012

WORKING WITH MISSING VALUES IN R


 SUBSTITUTE  ALL MISSING VALUES WITH MEANS IN FEW LINES

In this post I will illustrate what I did to deal with missing values in our survey dataset.
If you believe that using mean substitution  to deal with your missing values, then this post may be useful to you. However, if you are looking for imputation methods, then you can use built in functions in R to generate your own algorithms for missing data imputation or you can install some packages like 'imputation',' amelia','robCompositions'

I divided the dataset into two datasets: data1: has only columns without missing and data2: all columns with missing values (NA), I then worked with those columns with missing values. I excluded all categorical variables from data2, because i wanted to substitute mean for continuous variables. However, for NA of categorical variables i replaced NA with 99. After i have cleaned missing values in data2, i combined data1 and data2 in a single data frame which was clean data without any missing values. 

The code i wrote to do this is:-

is.na(survey[])
onlyMissingCol <- survey[,!complete.cases(t(survey))]
onlyMissingCol
NA_Col <- survey[sapply(survey, function(survey) any(is.na(survey)))]
NA_Col
No_NA_Col <- survey[sapply(survey, function(survey) !any(is.na(survey)))]
No_NA_Col
NA_Col$Ethnicity
NA_Col2 <- as.matrix(NA_Col)
NA_Col2[which(is.na(NA_Col), arr.ind = TRUE)] <-
apply(NA_Col2,2,mean, na.rm=T)[which(is.na(NA_Col), arr.ind = TRUE)[, "col"]]
survey_clean <- data.frame(No_NA_Col, NA_Col2)
survey_clean
complete.cases(survey_clean)
survey_clean1 <- round(survey_clean[,c(-3)],digits=2)
options(width=1000)
survey_clean1
survey_final <- data.frame(Std_ID=survey_clean$Std_ID, survey_clean1)
survey_final

If you want wonder what the above code does then click HERE to see the comments I wrote for each command line. You can also run it because i linked it to my dataset on dropbox.

I hope you find this useful.

ADIL

How to write, save, and call your own functions in R With an Example



·         How to write?
                General form of function is
NameOfFun <- function(argument1,argument2...)
            {
                        Your expressions go here
            }
name <- function(x,y,..)
            {
                        if(.....) do something to x and or y
                        if(.....) do something to ............
            }
·         How to save?
                1- From R, write or paste your function code in R Editor or R script
                2- Save your function as "nameOfFile.R" in your working directory
                               
·         How to Call?
1-  in R Console type source("nameOfFile.R")

I have written a function that automatically assign letter grades. You can copy the function from
Or you can follow the following steps that may help you to know how to call a function:-
1- Download the function from
2- Unzip the folder and copy the file "GradingFunction.R" into your working Directory
3- Go to R, and type
source("GradingFunction.R") to call the function
4- If you are a TA and would like to try the function in your classes, follow the arguments of function as explained here
5- I have also randomly generated grades if you wish to try the function. There are two examples from the above link.

Note: This function may not directly related  to Data Analysis, but the process is the same (e.g., you want to assign labels into your grouping variables if conditions satisfied) 

Adil

Friday, April 27, 2012


I'm just replicating Professor Welser's post on TUESDAY, APRIL 24, 2012
"Plot different characters and colors according to factor on third variable"

The only difference is that i used ifelse statement Instead of subsetting data. I created a grouping variable (Geek/Non) that was automatically added to the existing dataset
by using ifelse()

#survey_final$TwoPlusComputers<- .... this adds a new colomn  into your existing data (e.g., myIndices)
#1 or less = non; 2 or more = Geek
#factor() is like the levels of categorical variable in spss
myIndices$TwoPlusComputers<- factor(
ifelse(survey_final$computers <= 1,0,1),
  levels=c(0,1),
  labels=c("Non-Geeks","Geeks"))
#now you have a new colomn called TwoPlusComputers added to your Indices/data

#you can now use coplot based on the illustration of Prof. Welser
#coplot ( y ~ X | Z)

# you can also use ggplot2
#if you don't have ggplot2, run the following
#install.packages("ggplot2")
library(ggplot2)

#qplot(x,y,data=yourdata,....)

#I. BY COLOR
qplot(Ind6_Software,Ind7_Confidence, data=myIndices,
size=factor(TwoPlusComputers), size=I(4),
xlab="# Of Software Used",
ylab="Confidence Analyzing Data",
main="Software predicts overall confidence (Geeks Vs Non)")



#II. BY SIZE
qplot(Ind6_Software,Ind7_Confidence, data=myIndices,
size=factor(TwoPlusComputers), size=I(4),
xlab="# Of Software Used",
ylab="Confidence Analyzing Data",
main="Software predicts overall confidence (Geeks Vs Non)")


#III. BY SHAPE
qplot(Ind6_Software,Ind7_Confidence, data=myIndices,
shape=factor(TwoPlusComputers), size=I(4),
xlab="# Of Software Used",
ylab="Confidence Analyzing Data",
main="Software predicts overall confidence (Geeks Vs Non)")

I have also created 3 more grouping variables, one of them is that i was looking whether students' career goals predict confidence of data analysis. This variable is called 'goals' in our original survey data.


#Even more complex - ggplot2

qplot(Ind6_Software, Ind7_Confidence, data=myIndices,
xlab="# Of Software Used",
ylab="Confidence Analyzing Data",
main="Software predicts overall confidence (Geeks Vs Non)",
facets= .~TwoPlusComputers) + geom_smooth()





Adil

Apply functions to subsets of your data



This video explains two ways to apply functions to subsets of the data, with illustration using histograms and our example dataset / syntax, which has been updated to include this syntax, which is also copied below.




par(mfrow=c(2,3))


##   select cases to include by specifying value of factor variable

annie<-hist(AllVars$NewConf1,
col="purple",
breaks=4,
ylim=c(0,12))


ed<-hist(AllVars$NewConf1 [AllVars$TwoPlusComputer == 1],
col="light blue",
breaks=4,
ylim=c(0,12))


frankie<-hist(AllVars$NewConf1 [AllVars$TwoPlusComputer == 0],
col="pink",
breaks=4,
ylim=c(0,12))

##  select cases by refering to different datasets
## to do this you need to first use the
# AllVars<-data.frame(cbind(var, var2, varn)
# then do the subset command to make a new dataset
#  NewDataSetName<- subset(AllVars, Variable == 1)

frannie<-hist(AllVars$NewConf1,
col="purple",
breaks=4,
ylim=c(0,12))

fred<-hist(Geek$NewConf1,
col="blue",
breaks=4,
ylim=c(0,12))

frank<-hist(NonGeek$NewConf1,
col="red",
breaks=4,
ylim=c(0,12))

Tuesday, April 24, 2012

Plot different characters and colors according to factor on third variable.








Using the example dataset, and starting from running this

  1. example syntax

I added the following syntax to the Example.Syntax.450.txt file  (linked above)


##   Two ways to plot based on category in third variable.

plot(NewConf2, NewConf1, pch=as.integer(TwoPlusComputers))

plot(jitter(NewConf2, factor=2), jitter(NewConf1, factor=2), pch=as.integer

(TwoPlusComputers))

coplot (NewConf1 ~ NewConf2 | TwoPlusComputers)
#  coplot ( y ~ X | Z)
#  y= outcome variable
#  X= causal variable
#  Z= third variable with categories that cases fall into


#  A better way, perhaps


(I copied the general strategy demonstrated on the R blog, here)


AllVars<-data.frame(cbind(
NewConf1,
TwoPlusComputers,
NewConf2,
RecNumbers,
Software,
UseSPSS,
UseExcel,
UseMinitab,
UseStatistica,
UseSAS,
UseR,
UseMplus,
UseFortran,
UseMatLab,
UseStatEase,
UsePython,
UseOther,
computers,
Confidence,
DadaRecConf,
OrgDataExcConf,
CodebookConf,
RdataLoadConf,
ExcelDescConf,
RdescConf,
RexploreConf,
ConsIndexConf,
CorrMatrixConf,
OrgVarRegConf,
InterpRegConf,
ConstGraphConf,
MethoRegStuConf))

Geek <- subset(AllVars, TwoPlusComputers == 1)
NonGeek <- subset(AllVars, TwoPlusComputers == 0)

plot(NewConf1, NewConf2, type='n',

xlab = "Confidence interpreting research",
ylab = "Confidence in research techniques",

main = "Research interpretation predicts technical confidence (Geeks vrs Non)",
col.main = "#444444")
points (jitter(NewConf1) ~ jitter(NewConf2), data = Geek, pch = "G", col = "blue")
points (jitter(NewConf1) ~ jitter(NewConf2), data = NonGeek, pch = "N", col = "red")

You are invited to announce your course contributions here



We have been making videos and finding helpful links for our class.  However, it is not always easy for people to see when a new contribution has been made.   In addition to posting a link in the "helpful links" document, please consider making a brief post (like this one, with an image, or an embedded video) and a link to the resource that you are sharing.  This will make it easier for people to be alerted about the new addition.



Tuesday, April 3, 2012

Excel hints for HW1



Here is a link to: Help for excel in homework 1 (based on class on tuesday)




Homework #1 description (initial outline)


cheers,


ted


ps, please add your help files to the link list page
https://docs.google.com/document/d/15SMOq0Xq8O0tHx-3kAQE4XxW6l3fwjt6wmP3xbSsges/edit