Principal Component Analysis Regression Visualization and Interpretation

Principal Component Analysis, Regression,
Interpretation and Visualization

YIK LUN, KEI
allen29@ucla.edu
1.Download Data
download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")
data<-mlb11
Description:
Data with 30 observations on 12 variables, from all 30 Major League Baseball teams from the 2011 season.
This data set is useful for examining the relationships between wins, runs scored in a season, and a number
of other player statistics. Source: https://www.openintro.org/stat/data/mlb11.php
Predictor Variables by Teams:

(1) at_bats: Number of at bats
(2) hits: Number of hits
(3) homerun: Number of home runs
(4) bat_avg: Batting average
(5) strikeouts: Number of strikeouts
(6) stolen_bases: Number of stolen bases
(7) wins: Number of wins
(8) new_onbase: On base percentage, measure of how often a batter reaches base for any reason other
than a fielding error, fielders choice, dropped/uncaught third strike, fielders obstruction, or catchers
interference
(9) new_slug: Slugging percentage, popular measure of the power of a hitter calculated as the total bases
divided by at bats
(10) new_obs: On base plus slugging, calculated as the sum of these two variables
Response Variables by Teams:

(1) runs: Number of runs
2.Re-express Data
team<-as.data.frame(data[,1]) # Team name
x<-as.data.frame(data[,-c(1,2)])
y<-as.data.frame(data[,2])
standardize<-function(x){ #standardize
for (i in 1:dim(x)[2]){x[,i] = (x[,i] - mean(x[,i]))/sd(x[,i])}
return(x)
}
X<-as.data.frame(standardize(x)) # x variables
Y<-standardize(y) # runs
boxplot(x,main="Before Standardization")
1000
2000
3000
4000
5000
Before Standardization
at_bats
hits
homeruns
bat_avg
strikeouts stolen_bases
boxplot(X,main="After Standardization")
wins
new_onbase
new_slug
new_obs
After Standardization
at_bats
hits
homeruns
bat_avg
strikeouts stolen_bases
wins
new_onbase
new_slug
new_obs
3.Calculate Principal Components

Sx= var(X)
EP= eigen(Sx)
V= EP$vectors
PC= as.matrix(X) %*% as.matrix(V)
4.Determine Number of Principal Components

Select first five Principal Components in order to explain 95% of original data
cumsum(EP$values)/sum(EP$values)
##
##
[1] 0.6226284 0.7634558 0.8633428 0.9358952 0.9695781 0.9903987 0.9985261

[8] 0.9999636 0.9999912 1.0000000
5.Principal Component Regression

newdata<-as.data.frame(cbind(team,Y,PC[,1],PC[,2],PC[,3],PC[,4],PC[,5])) # New data set
reg<-lm(newdata[,2]~newdata[,3]+newdata[,4]+newdata[,5]+newdata[,6]+newdata[,7]-1)
summary(reg)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = newdata[, 2] ~ newdata[, 3] + newdata[, 4] + newdata[,
5] + newdata[, 6] + newdata[, 7] - 1)
Residuals:
Min
1Q
-0.40369 -0.17572
Median
0.03231
3Q
0.09471
Max
0.53949
Coefficients:
Estimate Std. Error t value Pr(>|t|)
newdata[, 3] -0.37410
0.01885 -19.844 < 2e-16 ***
newdata[, 4] -0.16080
0.03964 -4.057 0.000428 ***
newdata[, 5] 0.16385
0.04707
3.481 0.001851 **
newdata[, 6] -0.10985
0.05523 -1.989 0.057748 .
newdata[, 7] -0.06216
0.08105 -0.767 0.450310
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2533 on 25 degrees of freedom
Multiple R-squared: 0.9447, Adjusted R-squared: 0.9336
F-statistic: 85.38 on 5 and 25 DF, p-value: 6.849e-15
6.Reduce Number of Principal Components

From above summary, select first three Principal Components to predict RU N S.
reg<-lm(newdata[,2]~newdata[,3]+newdata[,4]+newdata[,5]-1,data=newdata)
summary(reg)
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = newdata[, 2] ~ newdata[, 3] + newdata[, 4] + newdata[,
5] - 1, data = newdata)
Residuals:
Min
1Q
-0.48691 -0.16821
Median
0.02454
3Q
0.18070
Max
0.43056
Coefficients:
Estimate Std. Error t value Pr(>|t|)
newdata[, 3] -0.37410
0.01972 -18.970 < 2e-16 ***
newdata[, 4] -0.16080
0.04147 -3.878 0.000611 ***
4
##
##
##
##
##
##
##
newdata[, 5] 0.16385
0.04924
3.328 0.002535 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.265 on 27 degrees of freedom
Multiple R-squared: 0.9346, Adjusted R-squared: 0.9274
F-statistic: 128.7 on 3 and 27 DF, p-value: 4.209e-16
7. Interpretation of First Three Principal Components

Model: RUNS = - 0.37410 * PC1 - 0.16080 * PC2 + 0.16385 * PC3
cor(X,newdata[,3:5])
##
##
##
##
##
##
##
##
##
##
##
at_bats
hits
homeruns
bat_avg
strikeouts
stolen_bases
wins
new_onbase
new_slug
new_obs
PC[, 1]
-0.7360929
-0.9269643
-0.7337717
-0.9310057
0.5890921
0.1239443
-0.5746456
-0.9300967
-0.9559299
-0.9739121
PC[, 2]
PC[, 3]
0.47712574 0.01037326
0.34486831 0.01957345
-0.54966122 -0.08095831
0.28908791 0.02254811
-0.40235558 0.01156439
-0.07616078 0.98709994
-0.65690540 -0.05510413
-0.03673270 0.10019283
-0.21654899 0.02920979
-0.16891867 0.05374679
PC1 = -0.74 * at_bats - 0.93 * hits - 0.73 * homerun - 0.93 * bat_avg + 0.58
* strikeouts + 0.12 * stolen_bases - 0.57 * wins - 0.93 * new_onbase - 0.96 *
new_slug - 0.97 * new_obs
PC2 = 0.48 * at_bats + 0.34 * hits - 0.55 * homerun + 0.29 * bat_avg - 0.40
* strikeouts - 0.08 * stolen_bases - 0.66 * wins - 0.04 * new_onbase - 0.22 *
new_slug - 0.17 * new_obs
PC3 = 0.01 at_bats + 0.02 hits - 0.08 homerun + 0.02 bat_avg + 0.01 strikeouts + 0.99 stolen_bases - 0.06 wins + 0.10 new_onbase + 0.03 new_slug +
0.05 *new_obs
First Principal Component
The first principal component is strongly, negatively correlated with five of the original variables: hits,
bat_average, new_onbase, new_slug and new_obs. This indicates that these five variables will often vary
together: if one increase, the rest of four will also increase. As a result, explained by these five variables, the
first principal component can be considered as a measure in power of hitters of a team. As power of hitters
increases, PC1 will decrease and this will lead to increase in runs eventually, since the coefficient for PC1 is
negative. Therefore teams with high power of hitters would tend to have high value in runs.
Second Principal Component

The second principal component is negatively correlated with homeruns. Since the coefficient for PC2 is
negative, increase in homeruns will decrease PC2, however, decreasing PC2 will lead to an increase in runs.
As a result, teams with more homeruns are more likely to have high value in runs.
Third Principal Component

The third principal is positive correlated with stolen_bases. An increase in number of stolen bases will also
increase in PC3; likewise, this will also increase runs due to positive coefficient for PC3. As a result, teams
with higher number of stolen bases will also result in high value of runs.
8.Visualization
library(scatterplot3d)
p3<-scatterplot3d(newdata[,3], newdata[,4], newdata[,5], highlight.3d = TRUE,
col.axis = "blue",type="p",
col.grid = "lightblue", main = "MLB11 by PCs", pch = 20,
xlab="Power of Hitters",ylab="Numbers of Homeruns",zlab="Stolen Bases")
p3$plane3d(reg)
p3.coords <- p3$xyz.convert(newdata[,3], newdata[,4], newdata[,5])
text(p3.coords$x, p3.coords$y,pos = 4, offset = 0.5)
MLB11 by PCs
29
10
25 9 15
20
24
30
21
23
14
22
16 11 17
18 27
13
28
5
1
26
Stolen Bases
19
12
Numbers of Homeruns
2
3
6
Power of Hitters
library(ggplot2)
p1<- ggplot(data = newdata, aes(x = newdata[,3], y = newdata[,2], label = newdata[,1]))
p1+geom_text(colour = "black", alpha = 0.8, size = 6) + xlab("Power of Hitters") +
ylab("Runs") + ggtitle("Teams Performance by Power of Hitters")+xlim(-6,6)
Teams Performance by Power of Hitters
Boston Red Sox

New York Yankees
Texas Rangers
Detroit Tigers
1
St. Louis Cardinals
Runs
Toronto Blue Jays

Cincinnati
Colorado
Reds
Rockies
Arizona
Diamondbacks
Kansas City Royals
Milwaukee
New Brewers
York Mets
Philadelphia Phillies
Baltimore Orioles
Tampa BayIndians
Rays
Cleveland
0
Los Angeles Angels

Chicago
Chicago
White
Cubs
Sox
Los Angeles
DodgersAthletics
AtlantaOakland
Braves
Florida
Marlins Nationals
Washington
HoustonMinnesota
Astros Twins
Pittsburgh Pirates
San Diego Padres

San Francisco Giants
Seattle Mariners
6
Power of Hitters

p2+geom_text(colour = "black", alpha = 0.8, size = 6) + xlab("Homeruns") +
ylab("Runs")+ggtitle("Teams Performance by Homeruns")+xlim(-5,5)
Teams Performance by Homeruns
New York Yankees

2
Boston Red Sox

Texas Rangers
Detroit Tigers
1
St. Louis Cardinals
Runs
Toronto Blue Jays

Cincinnati
ColoradoReds
Rockies
Arizona Diamondbacks
Kansas City Royals
Milwaukee Brewers
New York Mets
Tampa Bay Rays
ClevelandBaltimore
Indians Orioles
0
Los Angeles Angels

Chicago
Chicago
White
Cubs
Sox
Oakland
Athletics
Los
Angeles
Dodgers
Atlanta Braves
Florida Marlins
Washington Nationals
Minnesota
Twins Astros
Houston
Pittsburgh Pirates
San Diego Padres

Seattle Mariners
5.0
2.5
0.0
2.5
5.0
Homeruns

p3+geom_text(colour = "black", alpha = 0.8, size = 6) + xlab("Stolen Bases") +
ylab("Runs")+ ggtitle("Teams Performance by Stolen Bases")+xlim(-3,3)
Teams Performance by Stolen Bases
Boston Red Sox

2
New York Yankees

Texas Rangers
Detroit Tigers
1
St. Louis Cardinals
Runs
Toronto Blue Jays

Cincinnati Reds
Colorado
Rockies
Arizona Diamondbacks
Kansas City Royals
Milwaukee Brewers
New York Mets
Baltimore
Orioles
Tampa Bay Rays
Cleveland
Indians
0
Los Angeles Angels

Chicago
Chicago
Cubs
White Sox
Oakland
AthleticsDodgers
Los Angeles
Atlanta Braves
Florida Marlins
Washington
Nationals
Minnesota TwinsHouston Astros
Pittsburgh Pirates
San Diego Padres

Seattle Mariners
2
Stolen Bases
9.Conclusion
Principal component analysis is for descriptive analysis, we can observe from above three graphs and tell how
are teams performing in terms of three major features, which are power of hitters, number of homeruns and
number of stolen bases. Note that some original variables are negative correlated to principal components
and some of the principal components are also negative correlated to the response variable, which means we
should follow the movements twice to discover the connection between predictor and response variables. As
a result, a higher position in vertical axis means the team has higher value in runs than that of the teams
below. In terms of horizontal axis, a team at left hand side indicates stronger power of hitters than that
of the teams on the right hand side for the first graph. A team at left hand side indicates more number of
homeruns than that of the teams on the right hand side for the second graph. Finally, a team on the right
hand side indicates more number of stolen bases than that of the teams on the left hand side for the third
graph.
10

Principal Component Analysis Regression Visualization and Interpretation

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Principal Component Analysis Regression Visualization and Interpretation

Enviado por

Direitos autorais:

Formatos disponíveis

Principal Component Analysis, Regression,

Interpretation and Visualization

Predictor Variables by Teams:

Response Variables by Teams:

3.Calculate Principal Components

4.Determine Number of Principal Components

[1] 0.6226284 0.7634558 0.8633428 0.9358952 0.9695781 0.9903987 0.9985261

5.Principal Component Regression

6.Reduce Number of Principal Components

7. Interpretation of First Three Principal Components

Second Principal Component

Third Principal Component

Teams Performance by Power of Hitters

Boston Red Sox

St. Louis Cardinals

Toronto Blue Jays

Los Angeles Angels

San Diego Padres

p2<- ggplot(data = newdata, aes(x = newdata[,4], y = newdata[,2], label = newdata[,1]))

Teams Performance by Homeruns

New York Yankees

Boston Red Sox

St. Louis Cardinals

Toronto Blue Jays

Los Angeles Angels

San Diego Padres

p3<- ggplot(data = newdata, aes(x = newdata[,5], y = newdata[,2], label = newdata[,1]))

Teams Performance by Stolen Bases

Boston Red Sox

New York Yankees

St. Louis Cardinals

Toronto Blue Jays

Los Angeles Angels

San Diego Padres

Você também pode gostar