Você está na página 1de 12

Linear

Regression
Assignment 1 | BABS 507

Nipun Goyal
Jinye Lu
BABS 507 - Assignment #1
Simple Linear Regression – Part 1

Background:
The Centers for Disease Prevention and Control (CDC) have collected data about health
conditions and risk behaviors. They are trying to develop a profile of the states where
people eat the recommended quantities of fruits and vegetables so that a national health
initiative can develop a plan to improve health.

Explanatory variable: % of people who smoke every day


This explanatory variable is being used as a proxy for how concerned people are with their
health. Information about this variable is easy to obtain because people who have health
insurance have to answer this question, and these responses can be provided anonymously
to the CDC.

Response variable: % of people who eat at least 5 servings of fruits and vegetables every
day
Information about this variable is difficult to get and can only be obtained from a survey
specifically asking about health-related habits.

Problem Statements

NOTE: To begin with, we have created a folder on our desktop and saved the data file in
that folder. After that we have assigned that folder as a working directory in R so that we
don’t have to write the full path of the file destination every time we begin doing some
exploratory or predictive analysis on the data file.

Syntax for setting up a working directory:


setwd("C:/Users/Nipun Goyal/Desktop/MBAN Coursework/Period 2/BABS-507")

Questions (include all graphs mentioned in the questions in your final assignment):
Assessing assumptions
1. Make a scatterplot of % of people who eat enough fruits and vegetables vs. % of
people who smoke every day. Fit a smoothing curve to the plot. Describe the form,
direction, and strength of the association, and identify there are any outliers
present. Is this the type of association you would expect? Why or why not? (3.5
marks)

Scatterplot of % of people who eat enough fruits and vegetables vs % of people who smoke
everyday has been made by representing the explanatory variable ‘% of people who smoke
everyday’ on the x-axis and the response variable ‘% of people who eat at least 5 servings
of fruits and vegetables every day’ on the y-axis.

Syntax for making a scatterplot:


mydata <- read.csv('Fruits_veg_and_smoking.csv',header=TRUE)

1
plot(fruits.veg~smoking, data=mydata, pch=16, xlab='% of people who smoke everyday',
ylab='% of people who eat enough fruits and vegetables') # this will make a scatterplot gr
aph of the response variable vs. the explanatory variable
lines(lowess(mydata$smoking, mydata$fruits.veg, delta=0.1), col="red") # This will Add
a smoothed curve through the data

Output:
Form: The smoothed curve fitted in the scatterplot is decreasing in nature and appears like
a cloud or swarm of points stretched out in a generally consistent straight form. Hence, it

can be inferred from the above scatterplot that there is a linear relationship between the
response variable ‘% of people who eat enough fruits and vegetables every day’ and
explanatory variable ‘% of people who smoke every day’

Direction: From the scatterplot, it is completely evident that there is a negative association
between response variable ‘% of people who eat enough fruits and vegetables every day’
and explanatory variable ‘% of people who smoke every day’. The negative association
indicates that as a % of people who eat enough fruits and vegetables every day increases
then the % of people who smoke every day decreases and vice versa

Strength of Association: There is a moderately strong negative association (or correlation)


between the response variable ‘% of people who eat enough fruits and vegetables every
day’ and explanatory variable ‘% of people who smoke every day’ as the data points are
not clustered closely around the smooth curve red coloured line. The strength of association
has been further verified by calculating the person correlation coefficient between the
response and explanatory variable. Please refer to the syntax below, which was used to
calculate the correlation coefficient:

2
Syntax for calculating a person correlation coefficient:
cor(mydata$smoking,mydata$fruits.veg,method="pearson")

Output: -0.4798289

The correlation coefficient value of -0.48 tells us about the direction as well as the strength
of the association between response and explanatory variable. The negative sign indicates
a negative association and an absolute value of correlation i.e. 0.48 indicates a moderately
strong association (or correlation) between response and explanatory variable.

Outliers: A thorough analysis has been performed to identify the outliers in the data (if
any). Looking at the adjacent
scatterplot, it can be noticed that apart
from the couple of data points, all
other data points seems to be
clustered closely around smoothing
red colored line.

An important information regarding


the outliers can be gathered from
looking into the raw data. In general,
if more % of people eats fruits and vegetables then lesser % of people will smoke everyday
or vice versa. The same behavior is actually shown by a couple of states, where they have
high% of people eating fruits and vegetables and simultaneously have lesser % of people
who smokes everyday and vice versa, for example: states of Utah and California.
Therefore, we may conclude that these data points are not data input errors but the actual
behavior exhibited by these states. Moreover, it can also be noticed that these findings are
consistent with the negative correlation value that we previously found.

In general, if more % of people eats fruits and vegetables then lesser % of people will
smoke everyday or vice versa. This means that in general, there should be a negative
association between the variable smoking and consuming fruits and vegetables. The same
negative association is actually been exhibited by the states data that we have, where the
states which have higher % of people who eat enough fruits and vegetables also has lesser
% of people who smoke everyday. Hence, we can say that this is the type of association
that one would expect when they have data of people eating fruits and vegetables vs people
smoking.

2. Create a linear model to fit this data. Extract the residuals and predicted values.
Create the residual plot and examine it to see if the assumptions of linearity and
equal variance are met. Describe how well you think these assumptions are met;
is it sufficient to continue with the model? State any concerns you have and their
possible consequences. (2 marks)

3
Syntax for fitting a linear model and fetching residual and predicted values:
# Try fitting a linear model so that you can make the residual plot and look at it
z1 <- lm(fruits.veg ~ smoking, data=mydata)
resid1 <- resid(z1)
resid1
predict1 <- predict(z1)
predict1

Output: Please note that output values received in R console were further used in excelto
generate below indicated intuitive table:

Create a residual plot and examine it to see if the assumptions of linearity and equal variance
are met:
Syntax for making residual plot with a line at residuals = 0
plot(resid1 ~ predict1, pch=16)
abline(0,0, lty=2) #line type = 2

Output:

4
As is shown in the Figure,
we divided the residual plots
into five windows. Then
check the assumptions of
linearity by looking at
distribution of the residuals
in each of windows. Within
each windows the residuals
are distributed around the
line at residuals = 0 and the
average value of the
residuals is close to 0 in all
five windows. So we believe
the assumption of linearity is
met in general although the
linearity is not very strong.

Also, we can examine the


assumption of equal
variance by checking the
residual plot which has been
divided into five windows. As
is shown in the figure, the
band of points on the residual
plot has a constant range
which is (-6, 6) from left to
right except the window at the
left. However, since the
window only has one point
and the point is also under the
constant range, we can
conclude that the band of all
points on the plot has a
constant range from the left
window to the right window.
So the assumption of equal
variance is met in general but
not strongly met.
As the assumptions of linearity and equal variance are met in general, it is sufficient to
continue with the linear regression model we have built to check other assumptions whether
they are met or not.

5
3. Is there any reason to think that the assumption of independence of observations
is violated by this data? How would you determine if this assumption was violated?
State any concerns you have and their possible consequences. (2 marks)

The space difference may result in violating the assumption of independence of


observations. As is shown in the table, the data is observed in different states.

Maybe more people smoke than others because the healthcare legislation is weaker in some
states and hence cigarettes being cheaper. So this might violate the independence
assumptions. In order to determine it, we need to plot the residuals against the states
difference.

If there is a pattern of the residuals as the states changes, for example, the residuals in
western states is larger than eastern states, then the space violates the independence
assumptions.

4. Use a histogram of the residuals, a normality test and the normality probability
plot (Q-Q plot) to determine if the assumption of normality of errors is met.
Describe how well you think this assumption was met; is it sufficient to continue
with the model? State any concerns you have and their possible consequences. (2.5
marks)

Syntax for Checking the Assumption of Normality of errors:


hist(resid1) #histogram of the residuals
shapiro.test(resid1) #Shapiro-Wilk test for normality of residuals

Syntax for making Normality probability plots:


qqnorm(resid1, ylab= "residuals", xlab = "Normal scores", pch=16)
qqline(resid1)

Output: Normality Test results:


Shapiro-Wilk normality test
data: resid1

W = 0.98485, p-value = 0.7651

Looking at the histogram of the


residuals, the histogram appears
to be symmetric and unimodal.
The only place that makes
histogram asymmetric is that
the frequency of residuals at (-
2,0) is not close to the
frequency of residuals at (0,2).
And the mode of the histogram
is (-2,0) and the histogram is

6
positive kurtosis. So symmetry property of histogram is moderately strong.

Also, we ran the Shapiro-Wilk normality test to check the normality of errors assumption.
Below are our hypothesis that we tested using the shapiro-wilk normality test:

H0 = Errors are normally distributed


H1= Errors are not normally distributed

Since, the p-value after running the shapiro-wilk normality test is 0.7651, which is more
than the alpha value of 0.05. Therefore, we fail to reject our Null hypothesis and thus
conclude that errors are normally distributed.

Normality Probability Plot (Q-Q


plot)

Finally, based on the normal Q-Q plot


above, the plot is light tailed as the
two sides of the residuals does not
exactly fit with the line. But the figure
is not skewed to the left or right. So,
we can say that the distribution of
errors is approximately a normal
distribution.

All in all, the assumption of normality


of errors is met using the overall
analysis using three different methodologies (histogram, shapiro-wilk and normal Q-Q
plot)

Creating and assessing the model

5. Using Excel, where each column represents a different step in your calculations,
calculate the slope and intercept for your linear model. Label each column with a
clear description of its contents. (5 marks)

Please refer to the excel Assignment_01_Nipun_goyal_Jinye_Lu_Excel_calculations.xlsx

From the original data, we label the ‘fruit.veg’ variable as y and the ‘smoking’ variable as
x in order to use it conveniently. Then we calculate the average of the y and x respectively.
And we use it to calculate yi - ӯ and xi - 𝑥̅ . At last, we can get the SPxy and the SSx. By
calculating the ratio of SPxy and SSx we get the slope of the linear model. And because the
regression line goes through the mean of the x values and the mean of the y values, the y-
intercept value is equal to ӯ −b1 𝑥̅ .

7
The slope is -0.5671 and the intercept is 32.3070 (both numbers rounded to four decimals)

6. Check your results for the co-efficients using R.

Syntax for producing summary of the linear model fitted i.e. z1: This gives the co-efficients
and t tests results to see if they differ from zero:

summary(z1)

Output: Please note that summary command also returns the coefficient of determination,
standard error value and significance of the regression results. The question only mentions
the co-efficients, hence we have only showed the results of the coefficients

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 32.3070 2.3532 13.729 < 2e-16 ***

smoking -0.5671 0.1481 -3.828 0.000367 ***

As is shown in the Coefficient Table above, we find that the intercept of the linear
regression model is 32.3070 and slope is -0.5671, which is equal to the co-efficients solved
manually in Excel file. Hence, the co-efficient values calculated in excel file matches with
the co-efficient values calculated using R.

7. Is the regression significant? Remember to state all 4 steps of your hypothesis test!
(1 mark)

Syntax for producing the results for testing the significance of the regression:

anova(z1)

Output:

Analysis of Variance Table

Response: fruits.veg
Df Sum Sq Mean Sq F value Pr(>F)
smoking 1 125.72 125.721 14.357 0.0004218 ***
Residuals 48 420.33 8.757
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Significance of the regression is usually tested in 4 steps:

8
Step 1: Hypothesis Formulation:

Significance testing involves testing the significance of the overall regression equation as
well as specific partial regression co-efficients. The null hypothesis for the overall test is
that the coefficient of determination in the population, R2pop is zero, i.e. the regression is
not significant

Null Hypothesis, H0: R2pop = 0, this is equivalent to the following null hypothesis:

Null Hypothesis, H0:β1 = 0

Alternate Hypothesis, H1:β1 ≠ 0 i.e. the regression is significant

Step 2: Calculate test statistic

The overall test can be conducted using an F statistic

Therefore, F= Mean Squaresregression/Mean Squaresresidual = 125.721/8.757 = 14.357

Fcalculated = 14.357

Step 3: Compare the p-value to α value of 0.05

As, we can see from the results of analysis of variance table, the p-value i.e. 0.0004218 is
less than the α value of 0.05, hence we will reject our null hypothesis

Step 4: Make Decision

Since, our p-value is less than the α value of 0.05, therefore we will reject our null
hypothesis at 95% confidence level and we conclude that the regression is significant

8. Using Excel, calculate the co-efficient of determination (r2) and the standard error
of the estimate (SEE). (2 marks)

The calculation of the coefficient of determination and the standard error of the estimate
has been shown in the excel file.

To calculate the coefficient of determination, we squared the formula that calculates the
correlation coefficient.
2

9
To calculate the value of the standard error of the (SEE), we populated the values of
predicted values in the excel file in column I, then we calculated the values of residuals in
column J, values of residuals squared will be then calculated in column K and finally we
calculate the sum of residuals squared in column L, then we divide this number by (n-2),
where n = total number of data points or observations and finally we take the square root
of this value.

9. Check your results for these measures of goodness of fit using R. Do these measures
indicate a good model? (1 mark)

Syntax for producing the measures of goodness of fit:

summary(z1)

Output:

Residual standard error: 2.959 on 48 degrees of freedom


Multiple R-squared: 0.2302, Adjusted R-squared: 0.2142
F-statistic: 14.36 on 1 and 48 DF, p-value: 0.0004218

• Coefficient of determination: The value of coefficient of determination is very


small i.e. only 23% variance in the response variable is explained by the
explanatory variable. Hence, this measure indicate that model is not good

• Standard error of estimate: The SEE value of 2.959 tells us that the average distance
of the values of the response variable ‘% of people who eat fruits and vegetables’
from its fitted line or predictive values is about 2.959. This result is in consistent
with the r-square value. Higher SEE value indicates the greater the spread and hence
we can say that this model is not good

10. Overall, what do you think could be done to improve your model? (1 mark)

Since, we discussed in the beginning in question 1 that the states of Utah and California
has lesser % of people who smokes everyday in comparison to the other states. Hence, we
may think of using any transformation (such as log, inverse, Z or square root) on the data
points to make data points closer to each other so that they are tightly clustered around
each other.

In addition to this, the states that indicates high % of people who smokes everyday might
have very weak smoking legislature due to which cigarette prices might be cheaper in some
states than other. Therefore, we can also try to include few more variables such as

10
magnitude of smoking legislature in these states or % of fruits grown/produced by these
states in order to improve our model.

Note: The addition of explanatory variables might make an artificial increase in the value
of r-square and hence, we should also check the value of adjusted r-square if we are adding
more explanatory variables into our model

11

Você também pode gostar