Escolar Documentos
Profissional Documentos
Cultura Documentos
Regression
Assignment 1 | BABS 507
Nipun Goyal
Jinye Lu
BABS 507 - Assignment #1
Simple Linear Regression – Part 1
Background:
The Centers for Disease Prevention and Control (CDC) have collected data about health
conditions and risk behaviors. They are trying to develop a profile of the states where
people eat the recommended quantities of fruits and vegetables so that a national health
initiative can develop a plan to improve health.
Response variable: % of people who eat at least 5 servings of fruits and vegetables every
day
Information about this variable is difficult to get and can only be obtained from a survey
specifically asking about health-related habits.
Problem Statements
NOTE: To begin with, we have created a folder on our desktop and saved the data file in
that folder. After that we have assigned that folder as a working directory in R so that we
don’t have to write the full path of the file destination every time we begin doing some
exploratory or predictive analysis on the data file.
Questions (include all graphs mentioned in the questions in your final assignment):
Assessing assumptions
1. Make a scatterplot of % of people who eat enough fruits and vegetables vs. % of
people who smoke every day. Fit a smoothing curve to the plot. Describe the form,
direction, and strength of the association, and identify there are any outliers
present. Is this the type of association you would expect? Why or why not? (3.5
marks)
Scatterplot of % of people who eat enough fruits and vegetables vs % of people who smoke
everyday has been made by representing the explanatory variable ‘% of people who smoke
everyday’ on the x-axis and the response variable ‘% of people who eat at least 5 servings
of fruits and vegetables every day’ on the y-axis.
1
plot(fruits.veg~smoking, data=mydata, pch=16, xlab='% of people who smoke everyday',
ylab='% of people who eat enough fruits and vegetables') # this will make a scatterplot gr
aph of the response variable vs. the explanatory variable
lines(lowess(mydata$smoking, mydata$fruits.veg, delta=0.1), col="red") # This will Add
a smoothed curve through the data
Output:
Form: The smoothed curve fitted in the scatterplot is decreasing in nature and appears like
a cloud or swarm of points stretched out in a generally consistent straight form. Hence, it
can be inferred from the above scatterplot that there is a linear relationship between the
response variable ‘% of people who eat enough fruits and vegetables every day’ and
explanatory variable ‘% of people who smoke every day’
Direction: From the scatterplot, it is completely evident that there is a negative association
between response variable ‘% of people who eat enough fruits and vegetables every day’
and explanatory variable ‘% of people who smoke every day’. The negative association
indicates that as a % of people who eat enough fruits and vegetables every day increases
then the % of people who smoke every day decreases and vice versa
2
Syntax for calculating a person correlation coefficient:
cor(mydata$smoking,mydata$fruits.veg,method="pearson")
Output: -0.4798289
The correlation coefficient value of -0.48 tells us about the direction as well as the strength
of the association between response and explanatory variable. The negative sign indicates
a negative association and an absolute value of correlation i.e. 0.48 indicates a moderately
strong association (or correlation) between response and explanatory variable.
Outliers: A thorough analysis has been performed to identify the outliers in the data (if
any). Looking at the adjacent
scatterplot, it can be noticed that apart
from the couple of data points, all
other data points seems to be
clustered closely around smoothing
red colored line.
In general, if more % of people eats fruits and vegetables then lesser % of people will
smoke everyday or vice versa. This means that in general, there should be a negative
association between the variable smoking and consuming fruits and vegetables. The same
negative association is actually been exhibited by the states data that we have, where the
states which have higher % of people who eat enough fruits and vegetables also has lesser
% of people who smoke everyday. Hence, we can say that this is the type of association
that one would expect when they have data of people eating fruits and vegetables vs people
smoking.
2. Create a linear model to fit this data. Extract the residuals and predicted values.
Create the residual plot and examine it to see if the assumptions of linearity and
equal variance are met. Describe how well you think these assumptions are met;
is it sufficient to continue with the model? State any concerns you have and their
possible consequences. (2 marks)
3
Syntax for fitting a linear model and fetching residual and predicted values:
# Try fitting a linear model so that you can make the residual plot and look at it
z1 <- lm(fruits.veg ~ smoking, data=mydata)
resid1 <- resid(z1)
resid1
predict1 <- predict(z1)
predict1
Output: Please note that output values received in R console were further used in excelto
generate below indicated intuitive table:
Create a residual plot and examine it to see if the assumptions of linearity and equal variance
are met:
Syntax for making residual plot with a line at residuals = 0
plot(resid1 ~ predict1, pch=16)
abline(0,0, lty=2) #line type = 2
Output:
4
As is shown in the Figure,
we divided the residual plots
into five windows. Then
check the assumptions of
linearity by looking at
distribution of the residuals
in each of windows. Within
each windows the residuals
are distributed around the
line at residuals = 0 and the
average value of the
residuals is close to 0 in all
five windows. So we believe
the assumption of linearity is
met in general although the
linearity is not very strong.
5
3. Is there any reason to think that the assumption of independence of observations
is violated by this data? How would you determine if this assumption was violated?
State any concerns you have and their possible consequences. (2 marks)
Maybe more people smoke than others because the healthcare legislation is weaker in some
states and hence cigarettes being cheaper. So this might violate the independence
assumptions. In order to determine it, we need to plot the residuals against the states
difference.
If there is a pattern of the residuals as the states changes, for example, the residuals in
western states is larger than eastern states, then the space violates the independence
assumptions.
4. Use a histogram of the residuals, a normality test and the normality probability
plot (Q-Q plot) to determine if the assumption of normality of errors is met.
Describe how well you think this assumption was met; is it sufficient to continue
with the model? State any concerns you have and their possible consequences. (2.5
marks)
6
positive kurtosis. So symmetry property of histogram is moderately strong.
Also, we ran the Shapiro-Wilk normality test to check the normality of errors assumption.
Below are our hypothesis that we tested using the shapiro-wilk normality test:
Since, the p-value after running the shapiro-wilk normality test is 0.7651, which is more
than the alpha value of 0.05. Therefore, we fail to reject our Null hypothesis and thus
conclude that errors are normally distributed.
5. Using Excel, where each column represents a different step in your calculations,
calculate the slope and intercept for your linear model. Label each column with a
clear description of its contents. (5 marks)
From the original data, we label the ‘fruit.veg’ variable as y and the ‘smoking’ variable as
x in order to use it conveniently. Then we calculate the average of the y and x respectively.
And we use it to calculate yi - ӯ and xi - 𝑥̅ . At last, we can get the SPxy and the SSx. By
calculating the ratio of SPxy and SSx we get the slope of the linear model. And because the
regression line goes through the mean of the x values and the mean of the y values, the y-
intercept value is equal to ӯ −b1 𝑥̅ .
7
The slope is -0.5671 and the intercept is 32.3070 (both numbers rounded to four decimals)
Syntax for producing summary of the linear model fitted i.e. z1: This gives the co-efficients
and t tests results to see if they differ from zero:
summary(z1)
Output: Please note that summary command also returns the coefficient of determination,
standard error value and significance of the regression results. The question only mentions
the co-efficients, hence we have only showed the results of the coefficients
Coefficients:
As is shown in the Coefficient Table above, we find that the intercept of the linear
regression model is 32.3070 and slope is -0.5671, which is equal to the co-efficients solved
manually in Excel file. Hence, the co-efficient values calculated in excel file matches with
the co-efficient values calculated using R.
7. Is the regression significant? Remember to state all 4 steps of your hypothesis test!
(1 mark)
Syntax for producing the results for testing the significance of the regression:
anova(z1)
Output:
Response: fruits.veg
Df Sum Sq Mean Sq F value Pr(>F)
smoking 1 125.72 125.721 14.357 0.0004218 ***
Residuals 48 420.33 8.757
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
8
Step 1: Hypothesis Formulation:
Significance testing involves testing the significance of the overall regression equation as
well as specific partial regression co-efficients. The null hypothesis for the overall test is
that the coefficient of determination in the population, R2pop is zero, i.e. the regression is
not significant
Null Hypothesis, H0: R2pop = 0, this is equivalent to the following null hypothesis:
Fcalculated = 14.357
As, we can see from the results of analysis of variance table, the p-value i.e. 0.0004218 is
less than the α value of 0.05, hence we will reject our null hypothesis
Since, our p-value is less than the α value of 0.05, therefore we will reject our null
hypothesis at 95% confidence level and we conclude that the regression is significant
8. Using Excel, calculate the co-efficient of determination (r2) and the standard error
of the estimate (SEE). (2 marks)
The calculation of the coefficient of determination and the standard error of the estimate
has been shown in the excel file.
To calculate the coefficient of determination, we squared the formula that calculates the
correlation coefficient.
2
9
To calculate the value of the standard error of the (SEE), we populated the values of
predicted values in the excel file in column I, then we calculated the values of residuals in
column J, values of residuals squared will be then calculated in column K and finally we
calculate the sum of residuals squared in column L, then we divide this number by (n-2),
where n = total number of data points or observations and finally we take the square root
of this value.
9. Check your results for these measures of goodness of fit using R. Do these measures
indicate a good model? (1 mark)
summary(z1)
Output:
• Standard error of estimate: The SEE value of 2.959 tells us that the average distance
of the values of the response variable ‘% of people who eat fruits and vegetables’
from its fitted line or predictive values is about 2.959. This result is in consistent
with the r-square value. Higher SEE value indicates the greater the spread and hence
we can say that this model is not good
10. Overall, what do you think could be done to improve your model? (1 mark)
Since, we discussed in the beginning in question 1 that the states of Utah and California
has lesser % of people who smokes everyday in comparison to the other states. Hence, we
may think of using any transformation (such as log, inverse, Z or square root) on the data
points to make data points closer to each other so that they are tightly clustered around
each other.
In addition to this, the states that indicates high % of people who smokes everyday might
have very weak smoking legislature due to which cigarette prices might be cheaper in some
states than other. Therefore, we can also try to include few more variables such as
10
magnitude of smoking legislature in these states or % of fruits grown/produced by these
states in order to improve our model.
Note: The addition of explanatory variables might make an artificial increase in the value
of r-square and hence, we should also check the value of adjusted r-square if we are adding
more explanatory variables into our model
11