Escolar Documentos
Profissional Documentos
Cultura Documentos
Introduction
An ANOVA (analysis of variance) is used to analyze the effect of predictor(s) on a
numerical response variable and can determine whether the mean response differs across
levels of the predictor variable(s). A one-way, or single-factor, ANOVA tests a single
explanatory variable, while a multi-factor ANOVA tests the effect of multiple
explanatory variables.
Dataset
This week we will examine a dataset from a survey which looked at factors contributing
to birth weight. The data contains information about the mother as well as the child's
weight, and is called lab11_birthwt.csv on Canvas.
Questions of Interest
In this lab, we will conduct ANOVA models to understand the relationships between
potential underlying factors of birth weight in infants:
Does the mean birth weight of infants vary across different races?
What about while also considering if the mother smokes or not?
Does the effect of race interact with the effect of smoking?
All three of the distributions are fairly symmetric, so we pass the normality assumption.
If any of the groups showed an obvious skew, we could subset the data and look at
histograms or Q-Q plots to investigate further. Just like with t-tests, we can try to
transform our numeric variable if it violates the normality assumption.
We can see in the boxplots above that the three groups have a similar spread of values,
but we can formally check the equal variance assumption with Levenes test, like we did
in the two-sample t-test lab. Make sure you have installed the car package:
> library(car)
> leveneTest(bwt~race,data=birthwt)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group
2 0.4684 0.6267
186
With a p-value > 0.05, we can say that this data meets the assumption of equal variances.
2.3 State Null and Alternative Hypotheses
H0: The mean infant birth weight in all three race groups are equal.
HA: At least one mean infant birth weight is different from the others.
3
We have evidence that the mean birth weight of at least one race is different from another
(F = 4.91, df = (2,186), p < 0.05). To find out which races differ from each other, well
need to run a post-hoc analysis.
2.5 Conduct and Interpret Tukey Post-Hoc Pairwise Comparisons
We can use the TukeyHSD() function to automatically run all of the pair-wise t-tests from
our ANOVA model, using an adjustment to protect against Type-I errors:
> TukeyHSD(my_model)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = birthwt$bwt ~ birthwt$race)
$`birthwt$race`
diff
lwr
upr
p adj
other-black 85.59127 -304.452081 475.6346 0.8624372
white-black 383.02644
9.816581 756.2363 0.0428037
white-other 297.43517
28.705095 566.1652 0.0260124
This function returns the p-values using the Tukey adjustment for multiple comparisons.
By comparing the p adj values in the last column to 0.05, we see that there is a
significant difference in mean birth weight between babies with white mothers and black
mothers (p = 0.04), and also between babies with white mothers and mothers of other
races (p = 0.03).
The box-plots show no major deviations from normality or equal variance, so both factors
for this ANOVA meet the necessary assumptions.
3.2 State Null and Alternative Hypotheses
In a multi-factor ANOVA, we must have a set of hypotheses for each explanatory
variable:
Hypothesis Set 1:
H0: While accounting for smoking during pregnancy, there is no difference in
mean birth weight across the three race groups.
HA: While accounting for smoking during pregnancy, there is a difference in
mean birth weight across the three race groups.
Hypothesis Set 2:
H0: While accounting for race, there is no difference in mean birth weight across
smoking groups.
HA: While accounting for race, there is a difference in mean birth weight across
smoking groups.
5
+ birthwt$smoke
RSS
AIC F value
Pr(>F)
90573666 2480.1
97941819 2490.9 7.5249 0.0007213 ***
94953931 2487.0 8.9468 0.0031582 **
While controlling for race, smoking during pregnancy does impact mean birth weight (F
= 8.94, p < 0.05). Also, while controlling for smoking status, mean birth weight does
differ based on mothers race (F = 7.52, p < 0.05).
* birthwt$smoke
RSS
AIC
84790962 2471.6
96426345 2491.9
85546963 2471.3
90573666 2480.1
F value
Pr(>F)
We see a significant interaction between race and smoking on birth weight (F = 6.24, p <
0.05). Note that the main effect of smoke is no longer significant. However, in the
presence of a significant interaction effect, you dont usually interpret the main effects
of the two variables.
4.1 Graph an Interaction Plot
We can display the interaction between these two variables with an interaction plot. The
gplots package contains the plotmeans() function, which will plot the means of our
explanatory factors, as well as error bars for the 95% confidence interval for each mean.
First, well subset the data by one of the explanatory variables (its easiest to do this for
the one with the fewest values). Then, well plot the means of the other explanatory
variable and add a legend to tell which color corresponds to which smoking group:
> install.packages('gplots')
> library(gplots)
# Divide data into subgroups based on smoking status
> births_no <- birthwt[birthwt$smoke=='no',]
> births_yes <- birthwt[birthwt$smoke=='yes',]
# Interaction plot for two categorical explanatory variables
> plotmeans(bwt~race,data=births_no,n.label=F,col='black',
barcol='black',pch=20,ylim=c(2000,3700),main='Interaction
Plot', ylab='Birth Weights (g)')
> plotmeans(bwt~race,data=births_yes,n.label=F,add=T,col='red',
barcol='red',pch=20)
> legend('topright',inset=0.01,title='Smoke?',c('No','Yes'),
fill=c('black','red'),cex=.8)