Você está na página 1de 10

Principal Component Analysis, Regression,

Interpretation and Visualization


YIK LUN, KEI
allen29@ucla.edu

1.Download Data
download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")
data<-mlb11

Description:
Data with 30 observations on 12 variables, from all 30 Major League Baseball teams from the 2011 season.
This data set is useful for examining the relationships between wins, runs scored in a season, and a number
of other player statistics. Source: https://www.openintro.org/stat/data/mlb11.php

Predictor Variables by Teams:


(1) at_bats: Number of at bats
(2) hits: Number of hits
(3) homerun: Number of home runs
(4) bat_avg: Batting average
(5) strikeouts: Number of strikeouts
(6) stolen_bases: Number of stolen bases
(7) wins: Number of wins
(8) new_onbase: On base percentage, measure of how often a batter reaches base for any reason other
than a fielding error, fielders choice, dropped/uncaught third strike, fielders obstruction, or catchers
interference
(9) new_slug: Slugging percentage, popular measure of the power of a hitter calculated as the total bases
divided by at bats
(10) new_obs: On base plus slugging, calculated as the sum of these two variables

Response Variables by Teams:


(1) runs: Number of runs

2.Re-express Data
team<-as.data.frame(data[,1]) # Team name
x<-as.data.frame(data[,-c(1,2)])
y<-as.data.frame(data[,2])
standardize<-function(x){ #standardize
for (i in 1:dim(x)[2]){x[,i] = (x[,i] - mean(x[,i]))/sd(x[,i])}
return(x)
}
X<-as.data.frame(standardize(x)) # x variables
Y<-standardize(y) # runs
boxplot(x,main="Before Standardization")

1000

2000

3000

4000

5000

Before Standardization

at_bats

hits

homeruns

bat_avg

strikeouts stolen_bases

boxplot(X,main="After Standardization")

wins

new_onbase

new_slug

new_obs

After Standardization

at_bats

hits

homeruns

bat_avg

strikeouts stolen_bases

wins

new_onbase

new_slug

new_obs

3.Calculate Principal Components


Sx= var(X)
EP= eigen(Sx)
V= EP$vectors
PC= as.matrix(X) %*% as.matrix(V)

4.Determine Number of Principal Components


Select first five Principal Components in order to explain 95% of original data
cumsum(EP$values)/sum(EP$values)
##
##

[1] 0.6226284 0.7634558 0.8633428 0.9358952 0.9695781 0.9903987 0.9985261


[8] 0.9999636 0.9999912 1.0000000

5.Principal Component Regression


newdata<-as.data.frame(cbind(team,Y,PC[,1],PC[,2],PC[,3],PC[,4],PC[,5])) # New data set
reg<-lm(newdata[,2]~newdata[,3]+newdata[,4]+newdata[,5]+newdata[,6]+newdata[,7]-1)
summary(reg)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = newdata[, 2] ~ newdata[, 3] + newdata[, 4] + newdata[,
5] + newdata[, 6] + newdata[, 7] - 1)
Residuals:
Min
1Q
-0.40369 -0.17572

Median
0.03231

3Q
0.09471

Max
0.53949

Coefficients:
Estimate Std. Error t value Pr(>|t|)
newdata[, 3] -0.37410
0.01885 -19.844 < 2e-16 ***
newdata[, 4] -0.16080
0.03964 -4.057 0.000428 ***
newdata[, 5] 0.16385
0.04707
3.481 0.001851 **
newdata[, 6] -0.10985
0.05523 -1.989 0.057748 .
newdata[, 7] -0.06216
0.08105 -0.767 0.450310
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2533 on 25 degrees of freedom
Multiple R-squared: 0.9447, Adjusted R-squared: 0.9336
F-statistic: 85.38 on 5 and 25 DF, p-value: 6.849e-15

6.Reduce Number of Principal Components


From above summary, select first three Principal Components to predict RU N S.
reg<-lm(newdata[,2]~newdata[,3]+newdata[,4]+newdata[,5]-1,data=newdata)
summary(reg)
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = newdata[, 2] ~ newdata[, 3] + newdata[, 4] + newdata[,
5] - 1, data = newdata)
Residuals:
Min
1Q
-0.48691 -0.16821

Median
0.02454

3Q
0.18070

Max
0.43056

Coefficients:
Estimate Std. Error t value Pr(>|t|)
newdata[, 3] -0.37410
0.01972 -18.970 < 2e-16 ***
newdata[, 4] -0.16080
0.04147 -3.878 0.000611 ***
4

##
##
##
##
##
##
##

newdata[, 5] 0.16385
0.04924
3.328 0.002535 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.265 on 27 degrees of freedom
Multiple R-squared: 0.9346, Adjusted R-squared: 0.9274
F-statistic: 128.7 on 3 and 27 DF, p-value: 4.209e-16

7. Interpretation of First Three Principal Components


Model: RUNS = - 0.37410 * PC1 - 0.16080 * PC2 + 0.16385 * PC3
cor(X,newdata[,3:5])
##
##
##
##
##
##
##
##
##
##
##

at_bats
hits
homeruns
bat_avg
strikeouts
stolen_bases
wins
new_onbase
new_slug
new_obs

PC[, 1]
-0.7360929
-0.9269643
-0.7337717
-0.9310057
0.5890921
0.1239443
-0.5746456
-0.9300967
-0.9559299
-0.9739121

PC[, 2]
PC[, 3]
0.47712574 0.01037326
0.34486831 0.01957345
-0.54966122 -0.08095831
0.28908791 0.02254811
-0.40235558 0.01156439
-0.07616078 0.98709994
-0.65690540 -0.05510413
-0.03673270 0.10019283
-0.21654899 0.02920979
-0.16891867 0.05374679

PC1 = -0.74 * at_bats - 0.93 * hits - 0.73 * homerun - 0.93 * bat_avg + 0.58
* strikeouts + 0.12 * stolen_bases - 0.57 * wins - 0.93 * new_onbase - 0.96 *
new_slug - 0.97 * new_obs
PC2 = 0.48 * at_bats + 0.34 * hits - 0.55 * homerun + 0.29 * bat_avg - 0.40
* strikeouts - 0.08 * stolen_bases - 0.66 * wins - 0.04 * new_onbase - 0.22 *
new_slug - 0.17 * new_obs
PC3 = 0.01 at_bats + 0.02 hits - 0.08 homerun + 0.02 bat_avg + 0.01 strikeouts + 0.99 stolen_bases - 0.06 wins + 0.10 new_onbase + 0.03 new_slug +
0.05 *new_obs
First Principal Component
The first principal component is strongly, negatively correlated with five of the original variables: hits,
bat_average, new_onbase, new_slug and new_obs. This indicates that these five variables will often vary
together: if one increase, the rest of four will also increase. As a result, explained by these five variables, the
first principal component can be considered as a measure in power of hitters of a team. As power of hitters
increases, PC1 will decrease and this will lead to increase in runs eventually, since the coefficient for PC1 is
negative. Therefore teams with high power of hitters would tend to have high value in runs.

Second Principal Component


The second principal component is negatively correlated with homeruns. Since the coefficient for PC2 is
negative, increase in homeruns will decrease PC2, however, decreasing PC2 will lead to an increase in runs.
As a result, teams with more homeruns are more likely to have high value in runs.

Third Principal Component


The third principal is positive correlated with stolen_bases. An increase in number of stolen bases will also
increase in PC3; likewise, this will also increase runs due to positive coefficient for PC3. As a result, teams
with higher number of stolen bases will also result in high value of runs.

8.Visualization
library(scatterplot3d)
p3<-scatterplot3d(newdata[,3], newdata[,4], newdata[,5], highlight.3d = TRUE,
col.axis = "blue",type="p",
col.grid = "lightblue", main = "MLB11 by PCs", pch = 20,
xlab="Power of Hitters",ylab="Numbers of Homeruns",zlab="Stolen Bases")
p3$plane3d(reg)
p3.coords <- p3$xyz.convert(newdata[,3], newdata[,4], newdata[,5])
text(p3.coords$x, p3.coords$y,pos = 4, offset = 0.5)

MLB11 by PCs

29
10

25 9 15
20

24

30
21

23
14
22
16 11 17
18 27
13

28

5
1

26

Stolen Bases

19

12
Numbers of Homeruns

2
3
6

Power of Hitters

library(ggplot2)
p1<- ggplot(data = newdata, aes(x = newdata[,3], y = newdata[,2], label = newdata[,1]))
p1+geom_text(colour = "black", alpha = 0.8, size = 6) + xlab("Power of Hitters") +
ylab("Runs") + ggtitle("Teams Performance by Power of Hitters")+xlim(-6,6)

Teams Performance by Power of Hitters

Boston Red Sox


New York Yankees
Texas Rangers

Detroit Tigers
1

St. Louis Cardinals

Runs

Toronto Blue Jays


Cincinnati
Colorado
Reds
Rockies
Arizona
Diamondbacks
Kansas City Royals
Milwaukee
New Brewers
York Mets
Philadelphia Phillies
Baltimore Orioles
Tampa BayIndians
Rays
Cleveland
0

Los Angeles Angels


Chicago
Chicago
White
Cubs
Sox
Los Angeles
DodgersAthletics
AtlantaOakland
Braves
Florida
Marlins Nationals
Washington
HoustonMinnesota
Astros Twins
Pittsburgh Pirates

San Diego Padres


San Francisco Giants
Seattle Mariners
6

Power of Hitters

p2<- ggplot(data = newdata, aes(x = newdata[,4], y = newdata[,2], label = newdata[,1]))


p2+geom_text(colour = "black", alpha = 0.8, size = 6) + xlab("Homeruns") +
ylab("Runs")+ggtitle("Teams Performance by Homeruns")+xlim(-5,5)

Teams Performance by Homeruns

New York Yankees


2

Boston Red Sox


Texas Rangers

Detroit Tigers
1

St. Louis Cardinals

Runs

Toronto Blue Jays


Cincinnati
ColoradoReds
Rockies
Arizona Diamondbacks
Kansas City Royals
Milwaukee Brewers
New York Mets
Philadelphia Phillies
Tampa Bay Rays
ClevelandBaltimore
Indians Orioles
0

Los Angeles Angels


Chicago
Chicago
White
Cubs
Sox
Oakland
Athletics
Los
Angeles
Dodgers
Atlanta Braves
Florida Marlins
Washington Nationals

Minnesota
Twins Astros
Houston
Pittsburgh Pirates

San Diego Padres


San Francisco Giants
Seattle Mariners
5.0

2.5

0.0

2.5

5.0

Homeruns

p3<- ggplot(data = newdata, aes(x = newdata[,5], y = newdata[,2], label = newdata[,1]))


p3+geom_text(colour = "black", alpha = 0.8, size = 6) + xlab("Stolen Bases") +
ylab("Runs")+ ggtitle("Teams Performance by Stolen Bases")+xlim(-3,3)

Teams Performance by Stolen Bases

Boston Red Sox


2

New York Yankees


Texas Rangers

Detroit Tigers
1

St. Louis Cardinals

Runs

Toronto Blue Jays


Cincinnati Reds
Colorado
Rockies
Arizona Diamondbacks
Kansas City Royals
Milwaukee Brewers
New York Mets
Philadelphia Phillies
Baltimore
Orioles
Tampa Bay Rays
Cleveland
Indians
0

Los Angeles Angels


Chicago
Chicago
Cubs
White Sox
Oakland
AthleticsDodgers
Los Angeles
Atlanta Braves
Florida Marlins
Washington
Nationals
Minnesota TwinsHouston Astros
Pittsburgh Pirates

San Diego Padres


San Francisco Giants
Seattle Mariners
2

Stolen Bases

9.Conclusion
Principal component analysis is for descriptive analysis, we can observe from above three graphs and tell how
are teams performing in terms of three major features, which are power of hitters, number of homeruns and
number of stolen bases. Note that some original variables are negative correlated to principal components
and some of the principal components are also negative correlated to the response variable, which means we
should follow the movements twice to discover the connection between predictor and response variables. As
a result, a higher position in vertical axis means the team has higher value in runs than that of the teams
below. In terms of horizontal axis, a team at left hand side indicates stronger power of hitters than that
of the teams on the right hand side for the first graph. A team at left hand side indicates more number of
homeruns than that of the teams on the right hand side for the second graph. Finally, a team on the right
hand side indicates more number of stolen bases than that of the teams on the left hand side for the third
graph.

10

Você também pode gostar