Lab 02: Statistics and Bivariate Regression: Handout Date: Due Date

Eswar Sai Santosh Bandaru:
Lab02
Lab 02: Statistics and Bivariate

Regression
Handout date:
Thursday, September 22, 2016
Due date:
Thursday, October 6, 2016 at the beginning of the lecture as
hardcopy
This lab counts 10 % toward your total grade
Objectives: In this lab you will
[a] review some applied aspects of hypothesis testing,
[b] review some theoretical aspects of probability theory,
[c] identity the proper parameter
opt
of the Box-Cox transformation in the
context of univariate and bivariate data analysis, and

[d] practice bi-variate regression analysis.
Format of answer: Your answers (graphs and verbal description) should be
handed in as hard-copy in one document. Add a running title into the header
of the document with the following information: your name, Lab02 and page
numbers. Label each answer properly starting with its task number. Maintain
the sequence of questions. Format any code and computer output properly
before inserting it into the document with your answer. -code and text output
need to be in a monospaced font (i.e., fixed-pitch font) such as Courier New
so proper spacing and alignments are preserved. Excessive and irrelevant
output will lead to a deduction of your accumulated points.
Task 1: Test for Differences in Traffic Volumes (2 points)

The workspace TRAFFIC.RDATA holds two data-frames with
identical measurements on the traffic volume (number of
vehicles per day in thousands) in all four directions at 11
intersections in downtown Akron, Ohio, for the first Saturday in
June in either 1970 or 1980.
It is generally assumed that the downtown traffic volume has
decreased from 1970 to 1980. This is due to the opening of new
retail locations in the suburbs, which pull the traffic away from
the central city into the suburbs. We need to ignore other
factors such that the overall traffic volume may have changed
within the decade from 1970 to 1980 because we lack that
information.
Lab02
Task 1.1: Test whether the variances of the traffic volume in 1970 is identical to the
variance of the traffic volume in 1980. The R function is VAR.TEST( ). Use an error
probability of
=0.05 . (0.5 points)
Answer:
Code:
# Task 1.1
var.test(x = TrafficByLocation$Traffic1980,y =
TrafficByLocation$Traffic1970,conf.level = 0.95)
Output:
F test to compare two variances
data: TrafficByLocation$Traffic1980 and TrafficByLocation$Traffic1970
F = 0.95214, num df = 43, denom df = 43, p-value = 0.873
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.5195362 1.7449772
sample estimates:
ratio of variances
0.9521443
The p value of the two variances test is 0.873, hence there no significant
evidence that variances are different and we failed to reject NULL hypothesis
Task 1.2: Perform a matched pairs difference-of-mean test the null hypothesis
H 0 : 80 70 0
against the alternative hypothesis
the results. Use an error probability of
H A : 80 70 0
and interpret
=0.05 .
Why are you performing a one-sided test? (0.5 points)

Answer:
# Task 1.2
# Paired t- test
t.test(x = TrafficByLocation$Traffic1980, y =
TrafficByLocation$Traffic1970, paired = TRUE,conf.level =
0.95,alternative = "less")
Output:
Lab02
Paired t-test
t = -4.0999, df = 43, p-value = 9.002e-05
alternative hypothesis: true difference in means is less than 0
-Inf -0.9894095
sample estimates:
mean of the differences
-1.677045
The P- value for the paired one tail t-test is less than 5%, hence there is sufficient
evidence to reject NULL and there is significant decrease in traffic volumes at each
intersection in all the direction from the year 1970 to 1980.
Task 1.3: Perform a two independent samples difference-of-means test for
H 0 : 80 70
against
error probability of
H A : 80 70
and interpret the results of this test. Use an
=0.05 .
Decide based on task [c] under which assumption, with regards to a variance
equally, you are performing the test. (0.5 points)
Answer:
# Task 1.3
# independent sample test
t.test(x = TrafficByLocation$Traffic1980, y =
TrafficByLocation$Traffic1970, paired = FALSE,conf.level =
0.95,alternative = "less")
Output:
Welch Two Sample t-test
t = -1.2023, df = 85.948, p-value = 0.1163
alternative hypothesis: true difference in means is less than 0
-Inf 0.6423709
sample estimates:
mean of x mean of y
11.40023 13.07727
We failed to reject NULL hypothesis due to lack of sufficient evidence (p-value

>0.05)
Lab02
Task 1.4: Discuss the data structures of both data-frames TRAFFICBYLOCATION and
TRAFFICBYYEAR. Which data structure would lend itself to a matched pairs test
and which to an independent samples test? Is the matched pairs test scenario
or the two independent samples test scenario more appropriate for the given
data? Why do the test outcomes differ and which test is more powerful for the
particular scenario? Justify your choice. (0.5 points)
Answer:
As I have utilized the pair = (T/F) argument in t.test(), TrafficByLocation data

structure can be used in both the matched pair test and independent sample test.
I prefer TrafficByLocation to TrafficByYear in both the matched pair test and
independent sample test
The test outcomes are different as matched pair, tests the hypothesis at each
intersection and not at an overall level. I would suggest matched pair test is more
appropriate in this case as we want to test the traffic volume at each intersection
in every direction and not at an overall level.
Task 2: Bayesian Theorem (1 point)

A real estate agent has a client with a specific preference of two school districts.
The client is only a short period in town, so that the real estate agent can only show
houses in one school district. School district A has 4 available houses whereas
school district B has 6 available houses. Professionally typeset your answers.
Sc hoo l D is tr i ct A
Sc hoo l D is tr i ct B
Task 2.1: Give the prior probabilities for the school districts. Which school district
shall the real estate agent take her clients to maximize the likelihood of finding a
new home for the client? Make use of standard probability notation and show your
equation. (0.2 points)
Answer:
I would recommend the agent to take the customer to School District B as it as
more number of houses
Lab02
purchase
4
Pr( A )=
=0.4
6+4
purchase
6
Pr( B)=
=0.6
6+4
While talking to the client the real estate agent learns that the client prefers a
home with mature trees on the property. This information is not part of regular
real estate listings. However, from previous visits to districts
agent has seen more trees in district
is treed with probability
school district
it is only
and
the
A . She subjectively thinks that a home
site in district
Sc hoo l D is tr i ct A
Pr ( Treed| A )=3 / 4
whereas for
Pr ( Treed|B )=1/3 .
Sc hoo l D is tr i ct B
Task 2.2: Calculate the total probability
Pr ( Treed) . Make use of standard
probability notation and show your equation. (0.4 points)
Pr ( Treed )=Pr ( A Treed ) + Pr ( BTreed )
( 34 0.4)+( 13 0.6)
0.3+0.2
Pr ( Treed| A )Pr ( A ) + Pr ( Treed|B )Pr ( B )
0.5
Task 2.3: Calculate the posteriori probabilities
Pr ( ATreed) and
Pr ( BTreed) .
Make use of standard probability notation and show your equation. Will the real
estate revise her choice of neighborhood in which she will show her client's homes?
(0.4 points)
Pr ( A|Treed )=
Pr (A Treed)
Pr ( Treed )
Pr (B Treed)
Pr ( B|Treed )=
Pr ( Treed)
Lab02
Pr ( Treed| A )Pr ( A )
Pr ( Treed)
Pr ( Treed|B )Pr ( B )
Pr ( Treed )
( 34 0.4)
0.5
( 130.6)
0.5
0.3
0.5
0.2
0.5
0.6
0.4
Yes, based on the above posterior probability the real estate agent should revise
her choice of neighborhood in which she will show her clients homes.
Task 3: Joint, marginal and conditional probabilities (1 point)

The radar captured the speed of 325 vehicles at a neighborhood location. The
license plate was used to distinguish between cars own by local residents, nonresidents and commercial vehicles.
The results of the speed measurements are shown in the table below.
Speed\Plates
Tolerable
Excessive
Sum
Resident
175
5
180
Not Resident
96
24
120
Commercial
15
10
25
Sum
286
39
325
Task 3.1: Calculate the joint probabilities and the marginal probabilities for the
observed table. Report these in a properly formatted table rounded to three decimal
places.
Speed\Plates
Resident
Not Resident Commercial
Sum
Tolerable
0.538
0.295
0.046
0.880
Excessive
0.015
0.073
0.031
0.120
Sum
0.554
0.369
0.077
1
Task 3.2: Calculate the joint probabilities of the cells assuming independence
between Plates and Speed. Report these in a properly formatted table rounded to
three decimal places.
Speed\Plates
Tolerable
Excessive
Sum
Resident
0.487
0.066
0.554
Not Resident
0.325
0.044
0.369
Commercial
0.068
0.009
0.076
Sum
0.88
0.12
1
Lab02
Task 3.3: Calculate the observed conditional probabilities for the car ownership
(i.e., a local resident, not a resident or being a commercial resident) subject to the
cars speed. Report these in a properly formatted table rounded to three decimal
places.
Speed\Plates
Tolerable
Excessive
Resident
0.612
0.128
Not Resident
0.336
0.615
Commercial
0.052
0.256
Task 3.4: Interpret the conditional probabilities. What story do the data tell about
the driving habits of the local residents, drivers from outside the neighborhood and
commercial drivers?
Answer:
By looking at the conditional probabilities (table-3) and joint probabilities (table -1),
we can say that speed and plates are not independent. 61% people are residents
out of the all the tolerable drivers, whereas only 12% people are residents out of all
the excessive speed drivers. This shows that residents are generally more tolerable
towards driving compared to Non-Residents.
Task 4: The Box-Cox Transformations (1 point)

Import the dataset CONCORD1.SAV as data-frame into
for the variable water81.
Task 4.1: Find the optimal
. Perform the following tasks
opt -value for the Box-Cox transformation of the
variable water81. After the transformation the variable should be approximate

symmetrically distributed. (0.2 points)
See the -function summary(car::powerTransform(varName))
Answer:
# Task 4.1
library(foreign)
Concord <- read.spss("Concord1.sav",to.data.frame = T)
Concord <- na.omit(Concord)
summary(car::powerTransform(Concord$water81))
Output:
bcPower Transformation to Normality
Concord$water81
Est.Power Std.Err. Wald Lower Bound Wald Upper Bound

0.307
0.0494
0.2102
0.4038
Lab02
Likelihood ratio tests about transformation parameters

LRT df
pval
LR test, lambda = (0) 42.04234 1 8.931833e-11
LR test, lambda = (1) 168.26042 1 0.000000e+00
The optimal
opt -value = 0.307
Histogram for water81 before transformation:
Histogram for water81 variable after transformation:
Lab02
Clearly box cox transformation increased the normality for the variable water81
Task 4.2: Calculate the skewness before (i.e.,
transformation (i.e.,
=1.0 ) and after the
=opt ) of the variable water81 as well as for a logarithmic
transformed of variable (i.e.,
=0.0 ). Report the results and interpret them. (0.6
points)
See the -functions car::bcPower( ) and e1071::skewness( )
Answer:
Skewness before:
## Task 4.2
## Skewness before
e1071::skewness(Concord$water81)
Output:
1.747981
The water81 variable has a high positive skewness (it cannot have negative
skewness as it has a lower bound)
Skewness after:
## Skewness after
e1071::skewness(bcPower(Concord$water81,0.307))
Output:
Lab02
10
0.03897976
For optimal value of
= 0.307, the transformed water81 variable almost normal
with a very little positive skewness (of 0.03)

Skewness using logtransformation:
## Skewness when log transformed
e1071::skewness(bcPower(Concord$water81,0))
Output:
-0.888129
The log transformation did not help in making the water81 normal. The variable
water81 became negatively skewed with skewness -0.88
Task 4.3: Test the optimal Box-Cox transformed variable water81 for normality using
the Shapiro-Wilks test. Has the optimal Box-Cox transformation achieved normality? [0.2
points]
See the -function Shapiro-test( )
Answer:
# Shapiro test for water81 after transformation
shapiro.test(bcPower(Concord$water81,0.307))
output:
Shapiro-Wilk normality test
data: bcPower(Concord$water81, 0.307)
W = 0.99101, p-value = 0.007869
# Shapiro test for water81 before transformation
shapiro.test(Concord$water81)
Shapiro-Wilk normality test
data: Concord$water81
W = 0.86868, p-value < 2.2e-16
As we can see the w-statistic increase from 0.868 to 0.99 after the transformation.
Hence, we can say that the transformation has increased the normality of the
variable for the given data. However, the p-value which is less than 0.05 indicates
that conclusion about normality of the variable is not significant (In other words, we
are not confident that the population for the variable water81 is normally
distributed)
Lab02
11
Task 5: Confidence Intervals (1.5 points)

Use the CONCORD1.SAV file for this task. To simply things do not perform variable
transformations.
Task 5.1: Run a bivariate regression of income on educat and interpret the model
estimates. (0.5 points)
Answer:
## Task 5
# Task 5.1
Concord <- read.spss("Concord1.sav",to.data.frame = T)
reg1 <- lm(income ~ educat, data= Concord)
summary(reg1)
Output:
Call:
lm(formula = income ~ educat, data = Concord)
Residuals:
Min
1Q
-26.848 -8.182
Median
-0.997
3Q
6.471
Max
74.003
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
2.5893
2.5574
1.012
0.312
educat
1.4630
0.1783
8.203 2.04e-15 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 12.26 on 494 degrees of freedom
Multiple R-squared: 0.1199,
Adjusted R-squared: 0.1181
F-statistic: 67.29 on 1 and 494 DF, p-value: 2.04e-15
Interpretation:
The equation for the linear model can be written as:
Income=2.5893+(1.4630) Education
Interpretation of model parameters:
Educat: For every 1-year increase in education, the income increases by 1463
dollars on an average. The p-value for the educat co-efficient is significant at 95 %
confidence level
Lab02
12
Intercept: People with no formal education earn 2589 dollars on an average. The pvalue for the intercept is not significant at 95 % confidence level
Task 5.2: Calculate the 95% confidence intervals around the estimated regression
parameters. Can you draw the same conclusion as you did using the t-test in task
5.1? (0.5 points)
Answers:
Confidence interval for educat:
Intercept: 95% confidence intervals:
(1 2 )
[b t(
0
SE b ,b 0 +t
0
(12 )
SEb
where
=5
SE b =1.965 2.5574=5.025
0
So the 95% confidence intervals are [-2.435, 7.614].

As we can see the 95% confidence limits contain 0, so the coefficient for the
Intercept is not significantly different from 0 at 95% confidence level.
Parameter Coefficient: 95% confidence intervals are
where
(1 2 )
[b t(
1
SE b , b1 +t
1
(1 2 )
SEb
=5
SE b =1.965 0.1783=0.35032
1
So the 95% confidence intervals are [1.113, 1.813].

Interpretation:
We can see that we have 0 in the 95% confidence limits of the intercept, so we
cannot conclude that the intercept is significantly different from 0 at 95%
confidence, which is the same conclusion as in task 5.1
Also, we do not observe a 0 in the 95% confidence limits of the parameter estimate
for education, hence we can reject the null hypothesis that: 1=0 . This can be
concluded from the p-value (2.04e-15 << 0) in 5.1.
Task 5.3: Scatterplot both variables and add the predicted regression line as well as
the lower and upper 95% confidence interval lines around the predicted regression
Lab02
line. You may also want to enhance your plot by adding lines for the means of
education and income as well as by adding a title. (0.5 points)
Answer:
plot(income~educat,data=ConcordPred , main="Income vs. Education",
xlab="Years of Education", ylab="Income (in $1000)")
lines(ConcordPred$educat,ConcordPred$fit,col="red")
lines(ConcordPred$educat,ConcordPred$lwr,col="green")
lines(ConcordPred$educat,ConcordPred$upr,col="green")
abline(h=mean(ConcordPred$income),v=mean(ConcordPred$educat),lty=3)
plot:
Task 6: Calibration and Prediction of a Bivariate Regression

Model with Skewed Variables [3.5 points]
The STATA file HPRICE2.DTA holds among other variables the price and dist
(weighted distance from 5 major employment centers). Use the -function
foreign::read.dta( ) to import the data.
13
Lab02
14
Task 6.1: Scatterplot price in dependence of dist with box-plots at the margins.
Are a data transformations for both variables advisable? (0.5 points)
See the -function car::scatterplot( ).
We can see that the distributions are positively skewed from the above boxplots.
Task 6.2: Find the best transformation of both variables and test whether a logtransformation is sufficient to achieve symmetry. If the log-transformation is
appropriate (i.e., you cannot reject
H 0 : opt =0 ) then use it because the regression
model can now be interpreted in terms of as elasticities. (1 points)

Answer:
Transforming the Independent Variable (dist):
> summary(powerTransform(hprice$dist))
Lab02
15

hprice$dist -0.156 0.0868
-0.3261
0.0141
LRT df
pval
LR test, lambda = (0) 3.242834 1 0.07173644
LR test, lambda = (1) 179.060316 1 0.00000000
> findMaxLambda(lm(dist~1,data=hprice))
[1] 0.117
As we can see from the p-value we can transform the dist variable with a = 0 to
achieve normality. Also the same can be inferred from the findMaxLambda
functions graph, where the 95% confidence limits include 0.
So we log transform dist:
> bcDist <- bcPower(hprice$dist, 0)
Transforming the Dependent Variable (price):

> summary(powerTransform(hprice$price~bcDist))
Y1 0.0692 0.0773
-0.0822
0.2206
LRT df
pval
LR test, lambda = (0) 0.8093437 1 0.3683144
LR test, lambda = (1) 127.4887926 1 0.0000000
> findMaxLambda(lm(price~bcDist, data=hprice))
[1] 0.069
Lab02
16
Again as we can see from the p-value and the graph obtained from the
findMaxLambda function log transformation of price will achieve normality.
So we transform price:
> bcPrice <- bcPower(hprice$price, 0)
Task 6.3: Estimate the model in the transformed system and interpret the estimates.
Also test if the estimated slope parameter differs significantly from 1, i.e.,
H 0 : 1=1 . (1 point)
Answer:
lm.prDist <- lm(bcPrice~bcDist)
> summary(lm.prDist)
Call:
lm(formula = bcPrice ~ bcDist)
Residuals:
Min
1Q Median
3Q
Max
-1.18184 -0.21302 -0.02242 0.16747 1.20554
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.57679 0.04033 237.479 <2e-16 ***
bcDist
0.30657 0.03091 9.919 <2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3747 on 504 degrees of freedom
Multiple R-squared: 0.1633,
Adjusted R-squared: 0.1617
F-statistic: 98.38 on 1 and 504 DF, p-value: < 2.2e-16
Lab02
17
Interpretation:
The parameter estimate of bcDist denotes the elasticity between price and dist .
Since the elasticity is less than 1, the dist increases the price also increases but
the percentage of change in price is lesser than the percentage of change
in dist.
We cannot interpret the Intercept as this coefficient is in the transformed system.
Testing whether
is significantly different from 1:
H 0 : 1=1
H1: 1 1
The test statistic:
t=
0.30657 1
standard error ( 1 )
t=
0.306571
0.03091
t=22.434
Degrees of freedom for t,
=n2
=5062
=504
We estimate the p-value for two-sided t-test from the table with test statistic
t=22.434
and
=505 .
78
p=7.524 10
p0
Hence we can reject the null hypothesis at almost no error.

Task 6.4: Perform a prediction in the original data units and plot the median and
expectation curves. Why is the expected prediction in this case larger than the
median prediction? (1 point)
Lab02
18
Answer:
The expectation curve is higher than the median curve in this case because the
price variable is positively skewed. This can be observed from the boxplot of price
along with the mean:
As we can see the mean (colored in red) is slightly higher than the median (mostly
because the mean is more sensitive to outliers in this positively skewed
distribution).
Lab02
19

Lab 02: Statistics and Bivariate Regression: Handout Date: Due Date

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Lab 02: Statistics and Bivariate Regression: Handout Date: Due Date

Enviado por

Direitos autorais:

Formatos disponíveis

Eswar Sai Santosh Bandaru: