Você está na página 1de 19

Eswar Sai Santosh Bandaru:

Lab02

Lab 02: Statistics and Bivariate


Regression
Handout date:
Thursday, September 22, 2016
Due date:
Thursday, October 6, 2016 at the beginning of the lecture as
hardcopy
This lab counts 10 % toward your total grade
Objectives: In this lab you will
[a] review some applied aspects of hypothesis testing,
[b] review some theoretical aspects of probability theory,
[c] identity the proper parameter

opt

of the Box-Cox transformation in the

context of univariate and bivariate data analysis, and


[d] practice bi-variate regression analysis.
Format of answer: Your answers (graphs and verbal description) should be
handed in as hard-copy in one document. Add a running title into the header
of the document with the following information: your name, Lab02 and page
numbers. Label each answer properly starting with its task number. Maintain
the sequence of questions. Format any code and computer output properly
before inserting it into the document with your answer. -code and text output
need to be in a monospaced font (i.e., fixed-pitch font) such as Courier New
so proper spacing and alignments are preserved. Excessive and irrelevant
output will lead to a deduction of your accumulated points.

Task 1: Test for Differences in Traffic Volumes (2 points)


The workspace TRAFFIC.RDATA holds two data-frames with
identical measurements on the traffic volume (number of
vehicles per day in thousands) in all four directions at 11
intersections in downtown Akron, Ohio, for the first Saturday in
June in either 1970 or 1980.
It is generally assumed that the downtown traffic volume has
decreased from 1970 to 1980. This is due to the opening of new
retail locations in the suburbs, which pull the traffic away from
the central city into the suburbs. We need to ignore other
factors such that the overall traffic volume may have changed
within the decade from 1970 to 1980 because we lack that
information.

Eswar Sai Santosh Bandaru:

Lab02

Task 1.1: Test whether the variances of the traffic volume in 1970 is identical to the
variance of the traffic volume in 1980. The R function is VAR.TEST( ). Use an error
probability of

=0.05 . (0.5 points)

Answer:
Code:
# Task 1.1
var.test(x = TrafficByLocation$Traffic1980,y =
TrafficByLocation$Traffic1970,conf.level = 0.95)
Output:
F test to compare two variances
data: TrafficByLocation$Traffic1980 and TrafficByLocation$Traffic1970
F = 0.95214, num df = 43, denom df = 43, p-value = 0.873
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.5195362 1.7449772
sample estimates:
ratio of variances
0.9521443

The p value of the two variances test is 0.873, hence there no significant
evidence that variances are different and we failed to reject NULL hypothesis

Task 1.2: Perform a matched pairs difference-of-mean test the null hypothesis

H 0 : 80 70 0

against the alternative hypothesis

the results. Use an error probability of

H A : 80 70 0

and interpret

=0.05 .

Why are you performing a one-sided test? (0.5 points)


Answer:
# Task 1.2
# Paired t- test
t.test(x = TrafficByLocation$Traffic1980, y =
TrafficByLocation$Traffic1970, paired = TRUE,conf.level =
0.95,alternative = "less")
Output:

Eswar Sai Santosh Bandaru:

Lab02

Paired t-test
data: TrafficByLocation$Traffic1980 and TrafficByLocation$Traffic1970
t = -4.0999, df = 43, p-value = 9.002e-05
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -0.9894095
sample estimates:
mean of the differences
-1.677045

The P- value for the paired one tail t-test is less than 5%, hence there is sufficient
evidence to reject NULL and there is significant decrease in traffic volumes at each
intersection in all the direction from the year 1970 to 1980.
Task 1.3: Perform a two independent samples difference-of-means test for

H 0 : 80 70

against

error probability of

H A : 80 70

and interpret the results of this test. Use an

=0.05 .

Decide based on task [c] under which assumption, with regards to a variance
equally, you are performing the test. (0.5 points)
Answer:
# Task 1.3
# independent sample test
t.test(x = TrafficByLocation$Traffic1980, y =
TrafficByLocation$Traffic1970, paired = FALSE,conf.level =
0.95,alternative = "less")
Output:
Welch Two Sample t-test
data: TrafficByLocation$Traffic1980 and TrafficByLocation$Traffic1970
t = -1.2023, df = 85.948, p-value = 0.1163
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf 0.6423709
sample estimates:
mean of x mean of y
11.40023 13.07727

We failed to reject NULL hypothesis due to lack of sufficient evidence (p-value


>0.05)

Eswar Sai Santosh Bandaru:

Lab02

Task 1.4: Discuss the data structures of both data-frames TRAFFICBYLOCATION and
TRAFFICBYYEAR. Which data structure would lend itself to a matched pairs test
and which to an independent samples test? Is the matched pairs test scenario
or the two independent samples test scenario more appropriate for the given
data? Why do the test outcomes differ and which test is more powerful for the
particular scenario? Justify your choice. (0.5 points)
Answer:

As I have utilized the pair = (T/F) argument in t.test(), TrafficByLocation data


structure can be used in both the matched pair test and independent sample test.
I prefer TrafficByLocation to TrafficByYear in both the matched pair test and
independent sample test
The test outcomes are different as matched pair, tests the hypothesis at each
intersection and not at an overall level. I would suggest matched pair test is more
appropriate in this case as we want to test the traffic volume at each intersection
in every direction and not at an overall level.

Task 2: Bayesian Theorem (1 point)


A real estate agent has a client with a specific preference of two school districts.
The client is only a short period in town, so that the real estate agent can only show
houses in one school district. School district A has 4 available houses whereas
school district B has 6 available houses. Professionally typeset your answers.
Sc hoo l D is tr i ct A

Sc hoo l D is tr i ct B

Task 2.1: Give the prior probabilities for the school districts. Which school district
shall the real estate agent take her clients to maximize the likelihood of finding a
new home for the client? Make use of standard probability notation and show your
equation. (0.2 points)
Answer:
I would recommend the agent to take the customer to School District B as it as
more number of houses

Eswar Sai Santosh Bandaru:

Lab02

purchase
4
Pr( A )=
=0.4
6+4
purchase
6
Pr( B)=
=0.6
6+4
While talking to the client the real estate agent learns that the client prefers a
home with mature trees on the property. This information is not part of regular
real estate listings. However, from previous visits to districts
agent has seen more trees in district

is treed with probability

school district

it is only

and

the

A . She subjectively thinks that a home

site in district

Sc hoo l D is tr i ct A

Pr ( Treed| A )=3 / 4

whereas for

Pr ( Treed|B )=1/3 .

Sc hoo l D is tr i ct B

Task 2.2: Calculate the total probability

Pr ( Treed) . Make use of standard

probability notation and show your equation. (0.4 points)

Pr ( Treed )=Pr ( A Treed ) + Pr ( BTreed )

( 34 0.4)+( 13 0.6)

0.3+0.2

Pr ( Treed| A )Pr ( A ) + Pr ( Treed|B )Pr ( B )

0.5

Task 2.3: Calculate the posteriori probabilities

Pr ( ATreed) and

Pr ( BTreed) .

Make use of standard probability notation and show your equation. Will the real
estate revise her choice of neighborhood in which she will show her client's homes?
(0.4 points)

Eswar Sai Santosh Bandaru:

Pr ( A|Treed )=

Pr (A Treed)
Pr ( Treed )

Pr (B Treed)
Pr ( B|Treed )=
Pr ( Treed)

Lab02

Pr ( Treed| A )Pr ( A )
Pr ( Treed)

Pr ( Treed|B )Pr ( B )

Pr ( Treed )

( 34 0.4)

0.5

( 130.6)
0.5

0.3
0.5

0.2
0.5

0.6

0.4

Yes, based on the above posterior probability the real estate agent should revise
her choice of neighborhood in which she will show her clients homes.

Task 3: Joint, marginal and conditional probabilities (1 point)


The radar captured the speed of 325 vehicles at a neighborhood location. The
license plate was used to distinguish between cars own by local residents, nonresidents and commercial vehicles.
The results of the speed measurements are shown in the table below.
Speed\Plates
Tolerable
Excessive
Sum

Resident
175
5
180

Not Resident
96
24
120

Commercial
15
10
25

Sum
286
39
325

Task 3.1: Calculate the joint probabilities and the marginal probabilities for the
observed table. Report these in a properly formatted table rounded to three decimal
places.
Speed\Plates
Resident
Not Resident Commercial
Sum
Tolerable
0.538
0.295
0.046
0.880
Excessive
0.015
0.073
0.031
0.120
Sum
0.554
0.369
0.077
1
Task 3.2: Calculate the joint probabilities of the cells assuming independence
between Plates and Speed. Report these in a properly formatted table rounded to
three decimal places.
Speed\Plates
Tolerable
Excessive
Sum

Resident
0.487
0.066
0.554

Not Resident
0.325
0.044
0.369

Commercial
0.068
0.009
0.076

Sum
0.88
0.12
1

Eswar Sai Santosh Bandaru:

Lab02

Task 3.3: Calculate the observed conditional probabilities for the car ownership
(i.e., a local resident, not a resident or being a commercial resident) subject to the
cars speed. Report these in a properly formatted table rounded to three decimal
places.
Speed\Plates
Tolerable
Excessive

Resident
0.612
0.128

Not Resident
0.336
0.615

Commercial
0.052
0.256

Task 3.4: Interpret the conditional probabilities. What story do the data tell about
the driving habits of the local residents, drivers from outside the neighborhood and
commercial drivers?
Answer:
By looking at the conditional probabilities (table-3) and joint probabilities (table -1),
we can say that speed and plates are not independent. 61% people are residents
out of the all the tolerable drivers, whereas only 12% people are residents out of all
the excessive speed drivers. This shows that residents are generally more tolerable
towards driving compared to Non-Residents.

Task 4: The Box-Cox Transformations (1 point)


Import the dataset CONCORD1.SAV as data-frame into
for the variable water81.
Task 4.1: Find the optimal

. Perform the following tasks

opt -value for the Box-Cox transformation of the

variable water81. After the transformation the variable should be approximate


symmetrically distributed. (0.2 points)
See the -function summary(car::powerTransform(varName))
Answer:
# Task 4.1
library(foreign)
Concord <- read.spss("Concord1.sav",to.data.frame = T)
Concord <- na.omit(Concord)
summary(car::powerTransform(Concord$water81))
Output:
bcPower Transformation to Normality
Concord$water81

Est.Power Std.Err. Wald Lower Bound Wald Upper Bound


0.307
0.0494
0.2102
0.4038

Eswar Sai Santosh Bandaru:

Lab02

Likelihood ratio tests about transformation parameters


LRT df
pval
LR test, lambda = (0) 42.04234 1 8.931833e-11
LR test, lambda = (1) 168.26042 1 0.000000e+00

The optimal

opt -value = 0.307

Histogram for water81 before transformation:

Histogram for water81 variable after transformation:

Eswar Sai Santosh Bandaru:

Lab02

Clearly box cox transformation increased the normality for the variable water81
Task 4.2: Calculate the skewness before (i.e.,
transformation (i.e.,

=1.0 ) and after the

=opt ) of the variable water81 as well as for a logarithmic

transformed of variable (i.e.,

=0.0 ). Report the results and interpret them. (0.6

points)
See the -functions car::bcPower( ) and e1071::skewness( )
Answer:
Skewness before:
## Task 4.2
## Skewness before
e1071::skewness(Concord$water81)
Output:
1.747981

The water81 variable has a high positive skewness (it cannot have negative
skewness as it has a lower bound)
Skewness after:
## Skewness after
e1071::skewness(bcPower(Concord$water81,0.307))
Output:

Eswar Sai Santosh Bandaru:

Lab02

10

0.03897976

For optimal value of

= 0.307, the transformed water81 variable almost normal

with a very little positive skewness (of 0.03)


Skewness using log- transformation:
## Skewness when log transformed
e1071::skewness(bcPower(Concord$water81,0))
Output:
-0.888129

The log transformation did not help in making the water81 normal. The variable
water81 became negatively skewed with skewness -0.88
Task 4.3: Test the optimal Box-Cox transformed variable water81 for normality using
the Shapiro-Wilks test. Has the optimal Box-Cox transformation achieved normality? [0.2
points]
See the -function Shapiro-test( )
Answer:
# Shapiro test for water81 after transformation
shapiro.test(bcPower(Concord$water81,0.307))
output:
Shapiro-Wilk normality test
data: bcPower(Concord$water81, 0.307)
W = 0.99101, p-value = 0.007869
# Shapiro test for water81 before transformation
shapiro.test(Concord$water81)
Shapiro-Wilk normality test
data: Concord$water81
W = 0.86868, p-value < 2.2e-16

As we can see the w-statistic increase from 0.868 to 0.99 after the transformation.
Hence, we can say that the transformation has increased the normality of the
variable for the given data. However, the p-value which is less than 0.05 indicates
that conclusion about normality of the variable is not significant (In other words, we
are not confident that the population for the variable water81 is normally
distributed)

Eswar Sai Santosh Bandaru:

Lab02

11

Task 5: Confidence Intervals (1.5 points)


Use the CONCORD1.SAV file for this task. To simply things do not perform variable
transformations.
Task 5.1: Run a bivariate regression of income on educat and interpret the model
estimates. (0.5 points)
Answer:
## Task 5
# Task 5.1
Concord <- read.spss("Concord1.sav",to.data.frame = T)
reg1 <- lm(income ~ educat, data= Concord)
summary(reg1)
Output:
Call:
lm(formula = income ~ educat, data = Concord)
Residuals:
Min
1Q
-26.848 -8.182

Median
-0.997

3Q
6.471

Max
74.003

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
2.5893
2.5574
1.012
0.312
educat
1.4630
0.1783
8.203 2.04e-15 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 12.26 on 494 degrees of freedom
Multiple R-squared: 0.1199,
Adjusted R-squared: 0.1181
F-statistic: 67.29 on 1 and 494 DF, p-value: 2.04e-15

Interpretation:
The equation for the linear model can be written as:
Income=2.5893+(1.4630) Education
Interpretation of model parameters:
Educat: For every 1-year increase in education, the income increases by 1463
dollars on an average. The p-value for the educat co-efficient is significant at 95 %
confidence level

Eswar Sai Santosh Bandaru:

Lab02

12

Intercept: People with no formal education earn 2589 dollars on an average. The pvalue for the intercept is not significant at 95 % confidence level

Task 5.2: Calculate the 95% confidence intervals around the estimated regression
parameters. Can you draw the same conclusion as you did using the t-test in task
5.1? (0.5 points)
Answers:
Confidence interval for educat:
Intercept: 95% confidence intervals:

(1 2 )

[b t(
0

SE b ,b 0 +t
0

(12 )

SEb

where

=5

SE b =1.965 2.5574=5.025
0

So the 95% confidence intervals are [-2.435, 7.614].


As we can see the 95% confidence limits contain 0, so the coefficient for the
Intercept is not significantly different from 0 at 95% confidence level.
Parameter Coefficient: 95% confidence intervals are
where

(1 2 )

[b t(
1

SE b , b1 +t
1

(1 2 )

SEb

=5

SE b =1.965 0.1783=0.35032
1

So the 95% confidence intervals are [1.113, 1.813].


Interpretation:
We can see that we have 0 in the 95% confidence limits of the intercept, so we
cannot conclude that the intercept is significantly different from 0 at 95%
confidence, which is the same conclusion as in task 5.1
Also, we do not observe a 0 in the 95% confidence limits of the parameter estimate
for education, hence we can reject the null hypothesis that: 1=0 . This can be
concluded from the p-value (2.04e-15 << 0) in 5.1.
Task 5.3: Scatterplot both variables and add the predicted regression line as well as
the lower and upper 95% confidence interval lines around the predicted regression

Eswar Sai Santosh Bandaru:

Lab02

line. You may also want to enhance your plot by adding lines for the means of
education and income as well as by adding a title. (0.5 points)
Answer:
plot(income~educat,data=ConcordPred , main="Income vs. Education",
xlab="Years of Education", ylab="Income (in $1000)")
lines(ConcordPred$educat,ConcordPred$fit,col="red")
lines(ConcordPred$educat,ConcordPred$lwr,col="green")
lines(ConcordPred$educat,ConcordPred$upr,col="green")
abline(h=mean(ConcordPred$income),v=mean(ConcordPred$educat),lty=3)
plot:

Task 6: Calibration and Prediction of a Bivariate Regression


Model with Skewed Variables [3.5 points]
The STATA file HPRICE2.DTA holds among other variables the price and dist
(weighted distance from 5 major employment centers). Use the -function
foreign::read.dta( ) to import the data.

13

Eswar Sai Santosh Bandaru:

Lab02

14

Task 6.1: Scatterplot price in dependence of dist with box-plots at the margins.
Are a data transformations for both variables advisable? (0.5 points)
See the -function car::scatterplot( ).

We can see that the distributions are positively skewed from the above boxplots.

Task 6.2: Find the best transformation of both variables and test whether a logtransformation is sufficient to achieve symmetry. If the log-transformation is
appropriate (i.e., you cannot reject

H 0 : opt =0 ) then use it because the regression

model can now be interpreted in terms of as elasticities. (1 points)


Answer:
Transforming the Independent Variable (dist):
> summary(powerTransform(hprice$dist))
bcPower Transformation to Normality

Eswar Sai Santosh Bandaru:

Lab02

15

Est.Power Std.Err. Wald Lower Bound Wald Upper Bound


hprice$dist -0.156 0.0868
-0.3261
0.0141
Likelihood ratio tests about transformation parameters
LRT df
pval
LR test, lambda = (0) 3.242834 1 0.07173644
LR test, lambda = (1) 179.060316 1 0.00000000
> findMaxLambda(lm(dist~1,data=hprice))
[1] 0.117

As we can see from the p-value we can transform the dist variable with a = 0 to
achieve normality. Also the same can be inferred from the findMaxLambda
functions graph, where the 95% confidence limits include 0.
So we log transform dist:
> bcDist <- bcPower(hprice$dist, 0)

Transforming the Dependent Variable (price):


> summary(powerTransform(hprice$price~bcDist))
bcPower Transformation to Normality
Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
Y1 0.0692 0.0773
-0.0822
0.2206
Likelihood ratio tests about transformation parameters
LRT df
pval
LR test, lambda = (0) 0.8093437 1 0.3683144
LR test, lambda = (1) 127.4887926 1 0.0000000
> findMaxLambda(lm(price~bcDist, data=hprice))
[1] 0.069

Eswar Sai Santosh Bandaru:

Lab02

16

Again as we can see from the p-value and the graph obtained from the
findMaxLambda function log transformation of price will achieve normality.
So we transform price:
> bcPrice <- bcPower(hprice$price, 0)

Task 6.3: Estimate the model in the transformed system and interpret the estimates.
Also test if the estimated slope parameter differs significantly from 1, i.e.,

H 0 : 1=1 . (1 point)
Answer:
lm.prDist <- lm(bcPrice~bcDist)
> summary(lm.prDist)
Call:
lm(formula = bcPrice ~ bcDist)
Residuals:
Min
1Q Median
3Q
Max
-1.18184 -0.21302 -0.02242 0.16747 1.20554
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.57679 0.04033 237.479 <2e-16 ***
bcDist
0.30657 0.03091 9.919 <2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3747 on 504 degrees of freedom
Multiple R-squared: 0.1633,
Adjusted R-squared: 0.1617
F-statistic: 98.38 on 1 and 504 DF, p-value: < 2.2e-16

Eswar Sai Santosh Bandaru:

Lab02

17

Interpretation:
The parameter estimate of bcDist denotes the elasticity between price and dist .
Since the elasticity is less than 1, the dist increases the price also increases but
the percentage of change in price is lesser than the percentage of change
in dist.
We cannot interpret the Intercept as this coefficient is in the transformed system.

Testing whether

is significantly different from 1:

H 0 : 1=1
H1: 1 1
The test statistic:

t=

0.30657 1
standard error ( 1 )

t=

0.306571
0.03091

t=22.434

Degrees of freedom for t,

=n2

=5062

=504

We estimate the p-value for two-sided t-test from the table with test statistic

t=22.434

and

=505 .
78

p=7.524 10

p0

Hence we can reject the null hypothesis at almost no error.


Task 6.4: Perform a prediction in the original data units and plot the median and
expectation curves. Why is the expected prediction in this case larger than the
median prediction? (1 point)

Eswar Sai Santosh Bandaru:

Lab02

18

Answer:

The expectation curve is higher than the median curve in this case because the
price variable is positively skewed. This can be observed from the boxplot of price
along with the mean:

As we can see the mean (colored in red) is slightly higher than the median (mostly
because the mean is more sensitive to outliers in this positively skewed
distribution).

Eswar Sai Santosh Bandaru:

Lab02

19

Você também pode gostar