Você está na página 1de 49

Principal Component Analysis

M344

The Problem
Value of dataset
Data aggregation and collection mechanism
Variation: measured by Std. Dev, Variance

How do you judge it when there are


multiple explanatory variables?

Example: Prediction problem


Predict sales from price, advertising
Marke
t No.

Sales

Price

Ads

Vs
.

Market
No.

Sales

Price

Ads

Income

What to look for?


Information in income attribute - variation
Is this information new?

Examine the Covariance matrix


In which situation is income more informative?
A)

B)
Price

Ads

Income

Price

Ads

Income

Price

Price

Ads

Ads

Incom
e

16

Incom
e

16

A) because the information is additional


Covariance with other variables is lower

Same can be presented with a correlation matrix


Correlation
is scaled covariance

Correlation matrices will look like


Price

Ads

Income

Price

0.2

0.5

0.1

Ads

0.2

0.25

Incom
e

0.5

0.25

Price

Ads

Income

Price

0.2

0.1

Ads

0.2

Incom
e

0.1

0.1

Conjoint Analysis: We maximized variation in choices while


deciding attribute profiles to show the respondents.
Attribute levels were shuffled to reduce correlation among
them

In other situations
Situations where there is much information per unit of
observation
Surveys:
Characteristics of individuals
Brand perceptions based on numerous attributes

Census data: Many attributes for each market


Online browsing data: You can compile 100s of pieces of information
Websites browsed, products purchased etc.

Multiple attributes (columns) may contribute to the same


underlying construct
Collectively they might have a lot of information

Need to be efficient to conduct useful inference

reduce the dimensionality identify main components


what actually changes across observations

factor analysis

The underlying problem


X1

X2

X3

X4

X5

X6

X7

X100

F1

F2

F3

Where F1, F2 and F3


Are a combination of the Xs
Variation in Fs is high
Capture most of the information in the Xs
Correlation among Fs
Interpretable
is low

Agenda

An example of factor analysis


How does factor analysis work?
Running factor analysis in MINITAB
Application and interpretation

1/17/17

M344 Marketing Research

Example
Aaker (1997) Dimensions of Brand Personality, Journal of
Marketing Research
What are the bases of differences across brands?
It is all about perception
Award-winning dissertation paper

Based on data provided by 631 subjects on 37 different


brands measured across 114 traits
Identified five dimensions of brand personality
Analogous to the Big Five dimensions of human personality
Sincerity, Excitement, Competence, Sophistication, Ruggedness
1/17/17

M344 Marketing Research

10

How was this done?


The correlation matrix was factor-analyzed
using principal components analysis. A fivefactor solution resulted on the basis of the
following criteria:
All five factors had eigenvalues greater than one.
A significant dip in the scree plot followed the fifth
factor.
The first five factors were the most interpretable.
The five-factor solution explained a high-level of
variance in brand personality (92%).
1/17/17

M344 Marketing Research

11

How does factor analysis work?


There are different mathematical
approaches to factor analysis
Well use principal components analysis
MINITAB also has a method called factor
analysis based on the common factor model
Results from the two methods are generally
similar
1/17/17

M344 Marketing Research

12

Intuition underlying PCA


Find one dimension that captures the largest amount of
variation in the data
This is the first principal component

Find a second dimension that captures the largest amount


of variation not already explained by the first dimension
This is the second principal component
No correlation with the first

Continue until you reach diminishing returns


The cost of additional complexity outweighs the value of
additional insight from adding the next dimension
Slides

13

What is factor analysis good for?


Clarifying patterns of association in the
data
Does this by reducing the dimensionality of the
data space
Makes it easier to visualize the data

Identifying underlying traits or


characteristics
1/17/17

M344 Marketing Research

14

Problem with reduced dimensions?


We lose information!
Hopefully not interpretable information

We use factor analysis to capture the greatest


amount of information in the smallest number of
dimensions
Aaker discovered a solution in five dimensions that
retained 92% of the information (variation) in the data

Factor analysis involves a trade-off between


simplicity and completeness
1/17/17

M344 Marketing Research

15

Steps to Factor Analysis


Data
Obs

Incom
e

Av.
Age

Populatio
n

X4

X5

X6

X7

X100

1000

Correlation Matrix (Information


Content)
Incom
Av.
Populati
X100
e

Income
Av. Age
Populati
on

Age

on

Steps to Factor Analysis


Incom
e

Correlation
Av.Matrix
Populatio
Age

X100

Income
Av. Age
Populati
on

Eigenvalue Decomposition

X100

Eigenvalues
Factor Loadings

Factor Scores
Obs
1
2
3

1000

F1

F2

F3

Efficiency gain: Eigenvalues


For 100 dimensions 100 Eigenvalues
e1
e2

e100

Likely to be small

e1 represents the amount of variation explained by factor F1


PCA returns factors arranged by decreasing order of variation
F1 explains largest variation is the largest e1
% variation explained by
In this example:
Sum of all eigenvalues = Total Dimensions in the data

Efficiency gain: Eigenvalues


How many dimensions in the new data?
Depends on number of useful factors (corresponding
eigenvalues high)
Likely: only few factors have high eigenvalues
E.g.

e1

40

e2

30

e100

0.1

this case first two components explain of the


In
variation

Eigenvalue > 1?
Recall that the sum of the eigenvalues is equal to
the number of independent variables
If an eigenvalue > 1, then it is capturing a more
than proportional share of the variance
If an eigenvalue < 1, it is doing less work than one of the
original variables

1/17/17

M344 Marketing Research

20

Factor Loadings
What do the new factors mean?
Remember they dont have any interpretation yet
Look at the Factor loading
Correlation between new factors and old variables

F1

Correlation
Matrix
F2
F3

F100

Income

.8

Av. Age

.1

Population

.7

X100

How to interpret F1?


Redundant because eigenvalues will be
Loadings for the first factor small
What is it correlated with?
High F1 represents high income, high
population

factor analysis in MINITAB

1/17/17

M344 Marketing Research

22

Beer Data
Surveyed MBA students about their perceptions of and
preferences for 10 different brands of beer
Asked respondents to focus on a particular social context: a place away
from home where you go torelax and enjoy the company of friends and
contemporaries
N = 69 respondents completed the survey

Screening questions:
What is your usual beverage of choice in this context?
How often do you consume beer in this context?

Our target was anyone who answered beer and more than
once a month
N = 32 qualified as a target market consumer
1/17/17

M344 Marketing Research

23

Attribute data

1/17/17

M344 Marketing Research

24

Average attribute values

1/17/17

M344 Marketing Research

25

Average Attributes not


informative

1/17/17

M344 Marketing Research

26

Correlation matrix

Given
information on Dark Strong, Complex, Light
body are not very informative

1/17/17

M344 Marketing Research

27

Principal components in MINITAB


Use the STAT menu
From STAT, click on MULTIVARIATE
From MULTIVARIATE, click on PRINCIPAL
COMPONENTS

1/17/17

M344 Marketing Research

28

1/17/17

M344 Marketing Research

29

Principal components: Main menu


Select variables
Highlight variables
Click select button

Choose number of components


Usually 2 or 3

Decide on scaling
Generally, TYPE OF MATRIX = CORRELATION

1/17/17

M344 Marketing Research

30

1/17/17

M344 Marketing Research

31

Principal components: Graphs


Click the GRAPH BUTTON
To start, we will look at two graphs:
SCREE PLOT
LOADINGS PLOT

For other analyses, it is often helpful to


look at the first four graphs

1/17/17

M344 Marketing Research

32

1/17/17

M344 Marketing Research

33

Principal components: Storage


Click the storage button
Indicate columns to store factor scores
Should be one column for every component
you asked to calculate

You might decide to name these columns


PC1, PC2, etc.

1/17/17

M344 Marketing Research

34

1/17/17

M344 Marketing Research

35

Beer Data: Eigenvalues


Principal Component Analysis: Dark color, Trendy, Common,
Light body, Formal, I
Eigenanalysis of the Correlation Matrix
Eigenvalue 3.8655 1.3523 0.8784 0.7639 0.6501 0.5987 0.5116
0.1981
Proportion 0.430 0.150 0.098 0.085 0.072 0.067 0.057 0.022
Cumulative 0.430 0.580 0.677 0.762 0.834 0.901 0.958 0.980
Eigenvalue 0.1814
Proportion 0.020
Cumulative 1.000

1/17/17

M344 Marketing Research

36

1/17/17

M344 Marketing Research

37

Beer Data: Factor loadings

1/17/17

Variable

PC1

PC2

Dark color

-0.401

-0.013

Trendy

0.009

-0.702

Common

0.275

0.458

Light body

0.438

-0.069

Formal

-0.240

0.265

Inexpensive

0.331

0.250

Strong

-0.426

0.094

Complex

-0.440

0.016

Sweet

0.173

-0.389

M344 Marketing Research

38

1/17/17

M344 Marketing Research

39

Beer Data: Factor scores


This is the new data
with new variables
New variables are
the factors
Since we chose 2
factors, we have 2
factor scores for
every observation

1/17/17

M344 Marketing Research

40

How do the brands fare across the new dimensions?


Average factor scores for each beer
Pool perceptions across respondents

1/17/17

M344 Marketing Research

41

Store Brand means for both new


factors

Generate a Scatter plot

Use data labels

1/17/17

M344 Marketing Research

45

Beer Data: Interpretation?

1/17/17

M344 Marketing Research

46

[Q+A]

1/17/17

M344 Marketing Research

47

Census data on States


20 Attributes of US markets
What really varies across states?
Can we create a map based on
these dimensions?

Total population
Population under 18 years, percent
White Population percent
Hispanic or Latino Origin, percent
Foreign born, percent
Percent high school graduate or higher
Percent bachelor's degree or higher
Veterans total
Average travel time to work for workers
Median value of specified owner-occupied
housing units
Households
Average household size
Per capita income in the past 12 months
Median household income
People of all ages in poverty, percent
Private nonfarm establishments
Total number of firms
Manufacturing - value of shipments
Retail trade sales of establishments with
payroll
Population per square mile

Takeaways
Data
with large numerous attributes exist

All variables in the data may not be informative


individually
Together they may be informative

Principal Component Analysis Uncover underlying


structure in the data
Reduce dimensions without losing much information

Create new dimensions which are interpretable

Você também pode gostar