Escolar Documentos
Profissional Documentos
Cultura Documentos
1.Download Data
download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")
data<-mlb11
Description:
Data with 30 observations on 12 variables, from all 30 Major League Baseball teams from the 2011 season.
This data set is useful for examining the relationships between wins, runs scored in a season, and a number
of other player statistics. Source: https://www.openintro.org/stat/data/mlb11.php
2.Re-express Data
team<-as.data.frame(data[,1]) # Team name
x<-as.data.frame(data[,-c(1,2)])
y<-as.data.frame(data[,2])
standardize<-function(x){ #standardize
for (i in 1:dim(x)[2]){x[,i] = (x[,i] - mean(x[,i]))/sd(x[,i])}
return(x)
}
X<-as.data.frame(standardize(x)) # x variables
Y<-standardize(y) # runs
boxplot(x,main="Before Standardization")
1000
2000
3000
4000
5000
Before Standardization
at_bats
hits
homeruns
bat_avg
strikeouts stolen_bases
boxplot(X,main="After Standardization")
wins
new_onbase
new_slug
new_obs
After Standardization
at_bats
hits
homeruns
bat_avg
strikeouts stolen_bases
wins
new_onbase
new_slug
new_obs
Call:
lm(formula = newdata[, 2] ~ newdata[, 3] + newdata[, 4] + newdata[,
5] + newdata[, 6] + newdata[, 7] - 1)
Residuals:
Min
1Q
-0.40369 -0.17572
Median
0.03231
3Q
0.09471
Max
0.53949
Coefficients:
Estimate Std. Error t value Pr(>|t|)
newdata[, 3] -0.37410
0.01885 -19.844 < 2e-16 ***
newdata[, 4] -0.16080
0.03964 -4.057 0.000428 ***
newdata[, 5] 0.16385
0.04707
3.481 0.001851 **
newdata[, 6] -0.10985
0.05523 -1.989 0.057748 .
newdata[, 7] -0.06216
0.08105 -0.767 0.450310
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2533 on 25 degrees of freedom
Multiple R-squared: 0.9447, Adjusted R-squared: 0.9336
F-statistic: 85.38 on 5 and 25 DF, p-value: 6.849e-15
Call:
lm(formula = newdata[, 2] ~ newdata[, 3] + newdata[, 4] + newdata[,
5] - 1, data = newdata)
Residuals:
Min
1Q
-0.48691 -0.16821
Median
0.02454
3Q
0.18070
Max
0.43056
Coefficients:
Estimate Std. Error t value Pr(>|t|)
newdata[, 3] -0.37410
0.01972 -18.970 < 2e-16 ***
newdata[, 4] -0.16080
0.04147 -3.878 0.000611 ***
4
##
##
##
##
##
##
##
newdata[, 5] 0.16385
0.04924
3.328 0.002535 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.265 on 27 degrees of freedom
Multiple R-squared: 0.9346, Adjusted R-squared: 0.9274
F-statistic: 128.7 on 3 and 27 DF, p-value: 4.209e-16
at_bats
hits
homeruns
bat_avg
strikeouts
stolen_bases
wins
new_onbase
new_slug
new_obs
PC[, 1]
-0.7360929
-0.9269643
-0.7337717
-0.9310057
0.5890921
0.1239443
-0.5746456
-0.9300967
-0.9559299
-0.9739121
PC[, 2]
PC[, 3]
0.47712574 0.01037326
0.34486831 0.01957345
-0.54966122 -0.08095831
0.28908791 0.02254811
-0.40235558 0.01156439
-0.07616078 0.98709994
-0.65690540 -0.05510413
-0.03673270 0.10019283
-0.21654899 0.02920979
-0.16891867 0.05374679
PC1 = -0.74 * at_bats - 0.93 * hits - 0.73 * homerun - 0.93 * bat_avg + 0.58
* strikeouts + 0.12 * stolen_bases - 0.57 * wins - 0.93 * new_onbase - 0.96 *
new_slug - 0.97 * new_obs
PC2 = 0.48 * at_bats + 0.34 * hits - 0.55 * homerun + 0.29 * bat_avg - 0.40
* strikeouts - 0.08 * stolen_bases - 0.66 * wins - 0.04 * new_onbase - 0.22 *
new_slug - 0.17 * new_obs
PC3 = 0.01 at_bats + 0.02 hits - 0.08 homerun + 0.02 bat_avg + 0.01 strikeouts + 0.99 stolen_bases - 0.06 wins + 0.10 new_onbase + 0.03 new_slug +
0.05 *new_obs
First Principal Component
The first principal component is strongly, negatively correlated with five of the original variables: hits,
bat_average, new_onbase, new_slug and new_obs. This indicates that these five variables will often vary
together: if one increase, the rest of four will also increase. As a result, explained by these five variables, the
first principal component can be considered as a measure in power of hitters of a team. As power of hitters
increases, PC1 will decrease and this will lead to increase in runs eventually, since the coefficient for PC1 is
negative. Therefore teams with high power of hitters would tend to have high value in runs.
8.Visualization
library(scatterplot3d)
p3<-scatterplot3d(newdata[,3], newdata[,4], newdata[,5], highlight.3d = TRUE,
col.axis = "blue",type="p",
col.grid = "lightblue", main = "MLB11 by PCs", pch = 20,
xlab="Power of Hitters",ylab="Numbers of Homeruns",zlab="Stolen Bases")
p3$plane3d(reg)
p3.coords <- p3$xyz.convert(newdata[,3], newdata[,4], newdata[,5])
text(p3.coords$x, p3.coords$y,pos = 4, offset = 0.5)
MLB11 by PCs
29
10
25 9 15
20
24
30
21
23
14
22
16 11 17
18 27
13
28
5
1
26
Stolen Bases
19
12
Numbers of Homeruns
2
3
6
Power of Hitters
library(ggplot2)
p1<- ggplot(data = newdata, aes(x = newdata[,3], y = newdata[,2], label = newdata[,1]))
p1+geom_text(colour = "black", alpha = 0.8, size = 6) + xlab("Power of Hitters") +
ylab("Runs") + ggtitle("Teams Performance by Power of Hitters")+xlim(-6,6)
Detroit Tigers
1
Runs
Power of Hitters
Detroit Tigers
1
Runs
Minnesota
Twins Astros
Houston
Pittsburgh Pirates
2.5
0.0
2.5
5.0
Homeruns
Detroit Tigers
1
Runs
Stolen Bases
9.Conclusion
Principal component analysis is for descriptive analysis, we can observe from above three graphs and tell how
are teams performing in terms of three major features, which are power of hitters, number of homeruns and
number of stolen bases. Note that some original variables are negative correlated to principal components
and some of the principal components are also negative correlated to the response variable, which means we
should follow the movements twice to discover the connection between predictor and response variables. As
a result, a higher position in vertical axis means the team has higher value in runs than that of the teams
below. In terms of horizontal axis, a team at left hand side indicates stronger power of hitters than that
of the teams on the right hand side for the first graph. A team at left hand side indicates more number of
homeruns than that of the teams on the right hand side for the second graph. Finally, a team on the right
hand side indicates more number of stolen bases than that of the teams on the left hand side for the third
graph.
10