Escolar Documentos
Profissional Documentos
Cultura Documentos
Lab02
opt
Lab02
Task 1.1: Test whether the variances of the traffic volume in 1970 is identical to the
variance of the traffic volume in 1980. The R function is VAR.TEST( ). Use an error
probability of
Answer:
Code:
# Task 1.1
var.test(x = TrafficByLocation$Traffic1980,y =
TrafficByLocation$Traffic1970,conf.level = 0.95)
Output:
F test to compare two variances
data: TrafficByLocation$Traffic1980 and TrafficByLocation$Traffic1970
F = 0.95214, num df = 43, denom df = 43, p-value = 0.873
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.5195362 1.7449772
sample estimates:
ratio of variances
0.9521443
The p value of the two variances test is 0.873, hence there no significant
evidence that variances are different and we failed to reject NULL hypothesis
Task 1.2: Perform a matched pairs difference-of-mean test the null hypothesis
H 0 : 80 70 0
H A : 80 70 0
and interpret
=0.05 .
Lab02
Paired t-test
data: TrafficByLocation$Traffic1980 and TrafficByLocation$Traffic1970
t = -4.0999, df = 43, p-value = 9.002e-05
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -0.9894095
sample estimates:
mean of the differences
-1.677045
The P- value for the paired one tail t-test is less than 5%, hence there is sufficient
evidence to reject NULL and there is significant decrease in traffic volumes at each
intersection in all the direction from the year 1970 to 1980.
Task 1.3: Perform a two independent samples difference-of-means test for
H 0 : 80 70
against
error probability of
H A : 80 70
=0.05 .
Decide based on task [c] under which assumption, with regards to a variance
equally, you are performing the test. (0.5 points)
Answer:
# Task 1.3
# independent sample test
t.test(x = TrafficByLocation$Traffic1980, y =
TrafficByLocation$Traffic1970, paired = FALSE,conf.level =
0.95,alternative = "less")
Output:
Welch Two Sample t-test
data: TrafficByLocation$Traffic1980 and TrafficByLocation$Traffic1970
t = -1.2023, df = 85.948, p-value = 0.1163
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf 0.6423709
sample estimates:
mean of x mean of y
11.40023 13.07727
Lab02
Task 1.4: Discuss the data structures of both data-frames TRAFFICBYLOCATION and
TRAFFICBYYEAR. Which data structure would lend itself to a matched pairs test
and which to an independent samples test? Is the matched pairs test scenario
or the two independent samples test scenario more appropriate for the given
data? Why do the test outcomes differ and which test is more powerful for the
particular scenario? Justify your choice. (0.5 points)
Answer:
Sc hoo l D is tr i ct B
Task 2.1: Give the prior probabilities for the school districts. Which school district
shall the real estate agent take her clients to maximize the likelihood of finding a
new home for the client? Make use of standard probability notation and show your
equation. (0.2 points)
Answer:
I would recommend the agent to take the customer to School District B as it as
more number of houses
Lab02
purchase
4
Pr( A )=
=0.4
6+4
purchase
6
Pr( B)=
=0.6
6+4
While talking to the client the real estate agent learns that the client prefers a
home with mature trees on the property. This information is not part of regular
real estate listings. However, from previous visits to districts
agent has seen more trees in district
school district
it is only
and
the
site in district
Sc hoo l D is tr i ct A
Pr ( Treed| A )=3 / 4
whereas for
Pr ( Treed|B )=1/3 .
Sc hoo l D is tr i ct B
( 34 0.4)+( 13 0.6)
0.3+0.2
0.5
Pr ( ATreed) and
Pr ( BTreed) .
Make use of standard probability notation and show your equation. Will the real
estate revise her choice of neighborhood in which she will show her client's homes?
(0.4 points)
Pr ( A|Treed )=
Pr (A Treed)
Pr ( Treed )
Pr (B Treed)
Pr ( B|Treed )=
Pr ( Treed)
Lab02
Pr ( Treed| A )Pr ( A )
Pr ( Treed)
Pr ( Treed|B )Pr ( B )
Pr ( Treed )
( 34 0.4)
0.5
( 130.6)
0.5
0.3
0.5
0.2
0.5
0.6
0.4
Yes, based on the above posterior probability the real estate agent should revise
her choice of neighborhood in which she will show her clients homes.
Resident
175
5
180
Not Resident
96
24
120
Commercial
15
10
25
Sum
286
39
325
Task 3.1: Calculate the joint probabilities and the marginal probabilities for the
observed table. Report these in a properly formatted table rounded to three decimal
places.
Speed\Plates
Resident
Not Resident Commercial
Sum
Tolerable
0.538
0.295
0.046
0.880
Excessive
0.015
0.073
0.031
0.120
Sum
0.554
0.369
0.077
1
Task 3.2: Calculate the joint probabilities of the cells assuming independence
between Plates and Speed. Report these in a properly formatted table rounded to
three decimal places.
Speed\Plates
Tolerable
Excessive
Sum
Resident
0.487
0.066
0.554
Not Resident
0.325
0.044
0.369
Commercial
0.068
0.009
0.076
Sum
0.88
0.12
1
Lab02
Task 3.3: Calculate the observed conditional probabilities for the car ownership
(i.e., a local resident, not a resident or being a commercial resident) subject to the
cars speed. Report these in a properly formatted table rounded to three decimal
places.
Speed\Plates
Tolerable
Excessive
Resident
0.612
0.128
Not Resident
0.336
0.615
Commercial
0.052
0.256
Task 3.4: Interpret the conditional probabilities. What story do the data tell about
the driving habits of the local residents, drivers from outside the neighborhood and
commercial drivers?
Answer:
By looking at the conditional probabilities (table-3) and joint probabilities (table -1),
we can say that speed and plates are not independent. 61% people are residents
out of the all the tolerable drivers, whereas only 12% people are residents out of all
the excessive speed drivers. This shows that residents are generally more tolerable
towards driving compared to Non-Residents.
Lab02
The optimal
Lab02
Clearly box cox transformation increased the normality for the variable water81
Task 4.2: Calculate the skewness before (i.e.,
transformation (i.e.,
points)
See the -functions car::bcPower( ) and e1071::skewness( )
Answer:
Skewness before:
## Task 4.2
## Skewness before
e1071::skewness(Concord$water81)
Output:
1.747981
The water81 variable has a high positive skewness (it cannot have negative
skewness as it has a lower bound)
Skewness after:
## Skewness after
e1071::skewness(bcPower(Concord$water81,0.307))
Output:
Lab02
10
0.03897976
The log transformation did not help in making the water81 normal. The variable
water81 became negatively skewed with skewness -0.88
Task 4.3: Test the optimal Box-Cox transformed variable water81 for normality using
the Shapiro-Wilks test. Has the optimal Box-Cox transformation achieved normality? [0.2
points]
See the -function Shapiro-test( )
Answer:
# Shapiro test for water81 after transformation
shapiro.test(bcPower(Concord$water81,0.307))
output:
Shapiro-Wilk normality test
data: bcPower(Concord$water81, 0.307)
W = 0.99101, p-value = 0.007869
# Shapiro test for water81 before transformation
shapiro.test(Concord$water81)
Shapiro-Wilk normality test
data: Concord$water81
W = 0.86868, p-value < 2.2e-16
As we can see the w-statistic increase from 0.868 to 0.99 after the transformation.
Hence, we can say that the transformation has increased the normality of the
variable for the given data. However, the p-value which is less than 0.05 indicates
that conclusion about normality of the variable is not significant (In other words, we
are not confident that the population for the variable water81 is normally
distributed)
Lab02
11
Median
-0.997
3Q
6.471
Max
74.003
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
2.5893
2.5574
1.012
0.312
educat
1.4630
0.1783
8.203 2.04e-15 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 12.26 on 494 degrees of freedom
Multiple R-squared: 0.1199,
Adjusted R-squared: 0.1181
F-statistic: 67.29 on 1 and 494 DF, p-value: 2.04e-15
Interpretation:
The equation for the linear model can be written as:
Income=2.5893+(1.4630) Education
Interpretation of model parameters:
Educat: For every 1-year increase in education, the income increases by 1463
dollars on an average. The p-value for the educat co-efficient is significant at 95 %
confidence level
Lab02
12
Intercept: People with no formal education earn 2589 dollars on an average. The pvalue for the intercept is not significant at 95 % confidence level
Task 5.2: Calculate the 95% confidence intervals around the estimated regression
parameters. Can you draw the same conclusion as you did using the t-test in task
5.1? (0.5 points)
Answers:
Confidence interval for educat:
Intercept: 95% confidence intervals:
(1 2 )
[b t(
0
SE b ,b 0 +t
0
(12 )
SEb
where
=5
SE b =1.965 2.5574=5.025
0
(1 2 )
[b t(
1
SE b , b1 +t
1
(1 2 )
SEb
=5
SE b =1.965 0.1783=0.35032
1
Lab02
line. You may also want to enhance your plot by adding lines for the means of
education and income as well as by adding a title. (0.5 points)
Answer:
plot(income~educat,data=ConcordPred , main="Income vs. Education",
xlab="Years of Education", ylab="Income (in $1000)")
lines(ConcordPred$educat,ConcordPred$fit,col="red")
lines(ConcordPred$educat,ConcordPred$lwr,col="green")
lines(ConcordPred$educat,ConcordPred$upr,col="green")
abline(h=mean(ConcordPred$income),v=mean(ConcordPred$educat),lty=3)
plot:
13
Lab02
14
Task 6.1: Scatterplot price in dependence of dist with box-plots at the margins.
Are a data transformations for both variables advisable? (0.5 points)
See the -function car::scatterplot( ).
We can see that the distributions are positively skewed from the above boxplots.
Task 6.2: Find the best transformation of both variables and test whether a logtransformation is sufficient to achieve symmetry. If the log-transformation is
appropriate (i.e., you cannot reject
Lab02
15
As we can see from the p-value we can transform the dist variable with a = 0 to
achieve normality. Also the same can be inferred from the findMaxLambda
functions graph, where the 95% confidence limits include 0.
So we log transform dist:
> bcDist <- bcPower(hprice$dist, 0)
Lab02
16
Again as we can see from the p-value and the graph obtained from the
findMaxLambda function log transformation of price will achieve normality.
So we transform price:
> bcPrice <- bcPower(hprice$price, 0)
Task 6.3: Estimate the model in the transformed system and interpret the estimates.
Also test if the estimated slope parameter differs significantly from 1, i.e.,
H 0 : 1=1 . (1 point)
Answer:
lm.prDist <- lm(bcPrice~bcDist)
> summary(lm.prDist)
Call:
lm(formula = bcPrice ~ bcDist)
Residuals:
Min
1Q Median
3Q
Max
-1.18184 -0.21302 -0.02242 0.16747 1.20554
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.57679 0.04033 237.479 <2e-16 ***
bcDist
0.30657 0.03091 9.919 <2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3747 on 504 degrees of freedom
Multiple R-squared: 0.1633,
Adjusted R-squared: 0.1617
F-statistic: 98.38 on 1 and 504 DF, p-value: < 2.2e-16
Lab02
17
Interpretation:
The parameter estimate of bcDist denotes the elasticity between price and dist .
Since the elasticity is less than 1, the dist increases the price also increases but
the percentage of change in price is lesser than the percentage of change
in dist.
We cannot interpret the Intercept as this coefficient is in the transformed system.
Testing whether
H 0 : 1=1
H1: 1 1
The test statistic:
t=
0.30657 1
standard error ( 1 )
t=
0.306571
0.03091
t=22.434
=n2
=5062
=504
We estimate the p-value for two-sided t-test from the table with test statistic
t=22.434
and
=505 .
78
p=7.524 10
p0
Lab02
18
Answer:
The expectation curve is higher than the median curve in this case because the
price variable is positively skewed. This can be observed from the boxplot of price
along with the mean:
As we can see the mean (colored in red) is slightly higher than the median (mostly
because the mean is more sensitive to outliers in this positively skewed
distribution).
Lab02
19