Você está na página 1de 60

Statistical Learning and

Machine Learning

1
Example Problems
• Identify the risk factors for prostate cancer.
• Classify a recorded phoneme based on a log-periodogram.
• Predict whether someone will have a heart attack on the basis
of demographic, diet and clinical measurements.
• Customize an email spam detection system.
• Identify the numbers in a handwritten zip code.
• Establish the relationship between salary and demographic
variables in population survey data.

2
-1 1 3 40 60 80 0.0 0.4 0.8 6.0 7.5 9.0
oo ooo o o o o o oo o oo o
o o o o
oo o
o
o o
o
o o o

0 1 2 3 4 5
o o o o oo o oo
ooo o o o o o oo o o o oo o o o o o o
oo o oo oooooo o oo ooooo o o o ooo o o o oooo o ooo
o o ooo ooo ooooo o o o oo ooo oooo o o o oooooo oo o o oo
lpsa o o oo ooo o oo o o oo ooooooo o oo
oo ooo o o o o ooo o oo o o o o o o o
oooo
ooooo
o oo o ooooo o o o oo oo o o o ooooo o
o o o o ooo o o o ooo
o o o o ooo
ooo oooo
oo
oo ooooo o o o o o o oo o o oo o o
o o o
o
ooo oo o oo o o o o o o
o o
oo o o o o o
ooo o o o o o o o o o o o o
o o o o oo o o

-1 0 1 2 3 4
o o o
o ooo ooo oo o o oooo o o oo oooo o
o o o ooooo ooo o oo
oo o ooo oo oo
ooo o o o ooo oo o oo
o o o o oooooo o o ooo oo o o
o
o oo oooo oooooo o o oo o oo o
o o o o
o oo o oo o
oooo o o oooo o o ooooo oo o ooo ooo
oo o
o
o o oo o o o o o o o o ooo o
o oo o o o
ooooo o lcavol oooooooo o oooooo oo o o
o ooooo o oo
o oo
o o oooo o o
o
o
o o oooooo oo o
oo o o o oo o oooo o o o o oo o o
o o
o oo
oooo o o o o o o o o o
o oo o
o o
oo o o o o oooo o o oo o oo o o o o o o o o oo
o oo o o oo o o o
o o o o
o o o o o o o o

6
5
oooo o o o oo oo ooo o o o o o o o o
o oooooo o o ooooo o o o o o o ooo o
ooooooooooo oo o o oooo ooo oo lweight o o oooooooooo ooooooooo o o o oo oo oo o o oo oo
o o o o ooo oooo o oo o o ooooo
o oo ooooooo oo

4
o oo o o o o o o o
oooo oo o o o oooooo oo o o ooo oo ooooooooooooo o
oo o o o ooo ooo o o o ooo ooo oo o o o oooooo o o o
o o ooo o oo o o oo o o oooo o
oo o o o oo o oo o o oo oo

3
o o o o
o o o o o o o o
oooo o o o o o
40 50 60 70 80

o o o o o o oo
o o oo o o ooo o o ooo o o o oo o
o oooooo
oooooo o o o ooo o o oo ooo
o oo o o o
o ooo oooo
oo o
o o o ooo
o oo oo oo
o o o oo o
o o
o o o o
ooooo oooo oo
o o oo ooo o oo o o
o oo o o oo o o
ooooooo oo oo o ooo oooooooooo o ooo ooo
oo
o o
o ooo ooo o o
o o o o ooo o o
o
o ooo o oo oo
o ooo o oo ooo oooo oooo o o
o
o oo age o o o
oo o
o oo o oo o o o
o oo oooo o oo
o
o o o o
o o o o o o ooo
o o o oo o o ooo o o o
o o o o o o oo o o
o o o oo
o o o o o oo o o o o o o o o
o o
o oo o ooo o oooo o o oo o o o oooo
oo o oo o oo o o o o o o oo oo o
o oooooo o o ooo o

2
o oooooooo o o o oooooooo o o o o o o ooo o
o o o oo o o
o o oo o o o oo oo o ooo oo o o o ooo o o
o ooooo oo
o o o o o oo o o o o o o

1
o oooo o oooo o o oo o o o
o o oo o o ooo oo
o ooooooo lbph o o o
o o o
o o oo o o o o
o o o o o o o

0
o o o o o o o
oo o o ooo o oo o o o o o o

-1
oooo o ooooo oooooooooooooo o oooooooooo ooooooooooo oo o o o ooo oooooooo o o o ooooooooo oo o
oo oooooo oooooo oo ooo o o o ooo oo o o o o o oo o o ooo ooooooo o o ooooooooooooo
0.8

svi
0.4

ooo o oo ooooooo oooo o o ooooooooooo o oo oo o ooooooooo o oo oooooooooo o ooooooooooo o o o o o oooooooooooo o


0.0

o o o o o o o o
o oo oo

3
oo o o o o o o o o oo o
o oo o
oo o o o o oo o oo o oo o
oooo o o
ooooo ooo o o o
o ooo ooo oo o

2
oo o oooo oo o o ooo
oo o o o o o o
o o o o oooo oo o o o o o o oo o
o oo oo ooo o o o o oo o o lcp o o o o

1
oo ooo o o ooo o oooo oo o oo ooooooo
o ooooooo oo
o oooo o o
o ooo o o o o o

-1 0
ooo o
oooo oo o oo o
oooo o ooooo o o oooooo
oo oo
o
o oo
o ooo o
o o
o o
o o
oo o o
ooo o oooo oooooooooooo o oo ooooooo o oooo o ooooooo o o o oo ooooo o o o o o oooooo oo o
oooo o o o oooo oo o o o o o o o o o ooo o ooooo
6.0 7.0 8.0 9.0

o o o o o o o o

o o ooooo oo oooooooooo oo o o oooooooo


gleason ooooooooooooo
o o oooooo o o oooooooooo o o o oooooooooooooo

ooo ooo oo ooooooooooooo o ooo ooo o oooo o ooooooo o o oo oo oo o o oo o o o


oo o o o o o o

100
o o o oo o o o o o oo o
o
oo o oo o o o o o o o o o o o o o o o o
o o
o o oo o oo o oo oo o ooo o o o o o oo o o
o o ooo oooo oo oo oo ooo o o o o o o o oo o

60
oo
oo o
oooo oo o o oo
o ooo o
oo
oooo
o
o
o
o ooo
o
o
o o o
o o o o o oo
o o
o
pgg45
oo o o o o ooooo oo o o o o oo o o o o ooo o
o oo oo o o o o ooooooo oo oooooo ooooooo o oo o o oo o o o o o ooo o o

0 20
o o o o o
ooo o
ooo o oooo oooooooo ooooooo
ooooo oo ooo o
ooo ooo o oo ooooooooo o o o o o
o o oo oo o o
o
o ooo o o oo oo oo o
0 2 4 3 4 5 6 -1 0 1 2 -1 1 2 3 0 40 80

3
Example Problems
• Identify the risk factors for prostate cancer.
• Classify a recorded phoneme based on a log-periodogram.
• Predict whether someone will have a heart attack on the basis
of demographic, diet and clinical measurements.
• Customize an email spam detection system.
• Identify the numbers in a handwritten zip code.
• Establish the relationship between salary and demographic
variables in population survey data.

4
Phoneme Examples

25
aa
ao

20
Log-periodogram

15
10
5
0

0 50 100 150 200 250

Frequency

Phoneme Classification: Raw and Restricted Logistic Regression


0.4
Logistic Regression Coefficients

0.2
0.0
-0.2
-0.4

0 50 100 150 200 250

Frequency

5
Example Problems
• Identify the risk factors for prostate cancer.
• Classify a recorded phoneme based on a log-periodogram.
• Predict whether someone will have a heart attack on the basis
of demographic, diet and clinical measurements.
• Customize an email spam detection system.
• Identify the numbers in a handwritten zip code.
• Establish the relationship between salary and demographic
variables in population survey data.

6
0 10 20 30 0.0 0.4 0.8 0 50 100

220
oo o ooo o ooo o o oo ooo o o o o oo
o oo o
o
ooooooo oo o o o o oo
ooo o oo o oo oo o o ooo o o
oo ooo o o oo o
oooooooooooooooo o o
oo
ooo oo ooo ooooooooo ooo o o
oooooooo oo o ooo oooo
ooo ooo ooo o o oooo oo oooooo ooooo o o o oooooooooo
o
sbp ooooooooo ooo oooooo o ooo oo o ooo oo ooooo ooo oooo oo

160
ooo ooo o o ooooooo oo oooo o
o o o o o o o oo oooo o o
oo ooooooo o o o oo ooo oo oo o o o o o o o oo oo oo
ooo
o
o
oo
o
oo
oooo
o o o
oo
oo
o ooooo
o o oo
oooooooooooooooooooooo ooo ooo o oooooo oooooooo o oooo ooooooo
o o oooooooo
o o o o
o oooo
o oooooooooo oo oo ooooo
o
oo o o
o ooooo ooo
o
oooooooo oooooo
ooo o oooooooooooooo o o oo o o o
o oo o oo o o o o oo
ooooooooooooooooooo o
oo o
o ooooo o o o
ooooo o
oooo o
ooooooooo o oo oo o oooo oooo oo o o oo oo
o ooo
ooooo ooo
oooooooooooooooooooo
ooooo
oooooooooo
oooooo
oooooooooooo oo ooooooooooooo
oo ooo oo o oooooooooooooooo o o o ooooo oooo oo o o oo
oooooooooo o
ooooooo oo o
o oo o
oooo ooooo
ooooo o ooo ooo oo ooooooo oo o o o o o oo o o o o o o oo

100
o o oo o oo
30 o o o o o o
o o o o o o
o o o o o o
oo o o o o
20

oo oo o o oo oo o o o oo o o o oo
o o
oo ooo oooooooo
o tobacco ooooo oo ooo oo oooooooo ooooo
oo o oooo oo o o o
ooo
oooooooo o oo o o oo o o o o o ooo ooooooooo
oooooooooooooooo oo o oooooo o o o
ooooo
o o o o o o o o oo o
10

o o o o oo oooooooooo
o
ooo
oooo o ooo o
oooooooooo o o o
oooooooooo
ooo oooo ooo oooooooo o o o o ooooooo ooo o
oooooooooo oo
o oo oo ooo
oooooooooooooooooooooooo o oooooooo oo oooooooooooo oooooo oo oooooooo ooo
oooo o ooooo
oo
oooooooooooooo ooooooo o ooooooooooo o o o o ooo o o
ooo ooo ooo o oooo
oo o ooooooo oo
o
oooooooooooo oooooo
oooooooo oo
o o o
oooooooooooooooooooo oo
o o
oo oo o ooo ooo ooo o
o oooooooooooooooooooooooo
ooooo
ooo o ooo oooooooooooooo ooo
o oo ooo oo
o o oooooooooo oooooooooooooooooooo oo
0

o o o o o
o o o o o o

14
o o
oo o o oo
ooooo o oo ooooo o
o oo o o oo ooo o
oo o o o o oo o o o
oo o oo o oo o

10
o o o ooo oooo ooooooo
o
oo ooooooooooo o ooooo oo o
oooooooo oooooooo oo
ooooooooo oo o oooooooooooooo o ldl o
oo
oo o o
ooo oooo oo o o
ooooo
oooooo
oo o o o oo o o
o ooooooo o o o o
o oooo
oooooooooooo
o ooooo ooo
o oooooooooo oo oooo
o ooooo ooo
o ooooo
o oo o oo oo o o oo o o ooooo ooo o o o oooooooo oo

6
oooooo o o o o o o
ooooooooooooooooooooo
oo o oo o
o oooo
ooo ooooooooo oo o oo o o o oo o
oo
o oo
oooo oo
o ooooooooooooo o o
o o o
o o ooo
oooo o o oo o o oooooooo
oooooooooooo o
o oooooooooo
o oo
o o
ooo oo o
oooo oooooooooooooooooooo
oo o oooooooo oo ooo
oo
ooo oo
oo oo
oooooooooooo
oo
oo oo ooooooooooooooo
o
oooo
oo o o ooooooooo
o o o o
ooooooo
o
oooooo
o oo o ooo o ooo oo
oooooooooooooooo
oo o
o o o o oo oo
oo
o
ooooooooooooooo o oo oooo
ooooo o o
o oo oo
o o oo o oo oo ooo
oo o o o o o o oo

2
o o
o ooooooooooooooooooooo ooooooooooooooo ooooooooooooooooo oooo oooooooooooooooooooooo ooooooooooooooooooo o ooooooooooooooooooooooooooo
o
0.8

famhist
0.4

oooooooooooooooooooooooooo ooooooooooooooo o o ooooooooooooooooooooo oooooooooooooooooooo o ooooooooooooooooooooo oooooooooooooooo ooooooooooooo


0.0

o o o o
o o o o o o o

45
o o o oo o o o o oo
o oo o o ooo ooo oo oo o
ooooooo ooo oo
oo oooo oo ooo ooo o oo
ooo ooo ooo ooooooo o

35
oooo
oooooooooooo
oo
oooooo oooooooo ooooo o o o oo ooo
ooooooooo oooooooo
o
oo oo oooo oooo ooooooo
o o
oooooooooooooo
oooooooo
oooo ooooooooooooo oooo o oooooooooo o o o
oooooo
oo
oo oo oooo o ooooo o ooo
o ooo obesity oo
o o o o oo
ooooooooo oo oo
o o
oo oo
o
o oo
o
oo ooooooooooooooooooo
oo o
o
ooooooooo
oooo ooo o o
oo o
ooo o o o
oooooooooo oo ooooooooooo
o ooo o o ooooooooooo o ooo o
oooo
oooooooooo
oooooo ooo o o o o
ooo o
oooooooooo oooooooo o o
o oooo o oooo oo o oooooooo
oo oo
o oooooo ooooooooooooooo ooooo

25
o ooo oooo o oooo oo o o ooo o o o
o o o ooo o o
o o o o o o o o o o o ooooooo oo o
oo oo oo o oo oo oooo oo ooooo ooo o oo oooo oooo o oooo ooooooooo ooo oooooooo o
oooooooooooo oo o
oo ooo o o o oooooooooo oo o
o oooo oo
o ooo o oooo ooo o o oo o
o o o o

15
o o
ooo o oo o oo o oo o o o
o oo
o o oo oo o oo o o o o oo
o
100

o ooooo o oo o ooooo o ooo o o ooo o o o ooo o


o
oooooooooo o oo o ooooooo
o o ooo oo oo oooooo o o oo oooooooooo alcohol oo ooooooo o
ooooooo oooo oo ooo oooo o oo ooo oo
o o ooooo o ooooo oo o oo oo oo
o o ooooooooooooooo oooooooooooooo o oooo
50

oooooo ooooo oooo oo


o oo
ooo
o oooooo oo oo o o o o ooooooo oo
oooooooo
o
oo
o o
o o ooo o oooooooooooo
o o
oooooooooooooooooooo o oo o
ooooooooooooooo o oooooooooooooo oo o oooo
o o o oooooo
o oo ooo o o oooo o o o o ooooooo
oooooooooooooooooooooooooo o oooooo o o o o o o oo ooooooooo oooooooo o
ooo ooooo ooo oooooooooooooooooooo
o
ooooooooooooooooooo o oo o oooo
ooooooooooooooooooooo o
ooooo
o o
ooo ooo
o oo
oooo oo ooo oo
ooo ooo o o o o oo o ooooooooooooooo
oo o o o oo ooo
oo oo
oooo
o o ooo o oo
o ooo oo ooo ooo o oo oo
ooo
0

o
ooooo oooooooooooooooooooo oooo oooooo ooooooo ooo o ooooo oooooooooooo o oo oo o oo oooooooo ooo
o ooo oooooooo o o oo oo
oooo o ooo oooooooooooooo o oooooooooooooooooooooooo oooooo

60
oooooooooo oooo o
oooooooooooooooo o oo o ooooooooo oooo ooooooooo oo o oooo oo oo oo ooooooooo
oo ooooooo o ooooooo o o oo o oooo oo o
oooooo o o o o o o
oooooooo o
ooooooo oo oooo
oooooooo ooooo o o o oo oo o oo
o ooooooooo o oo o ooo o
ooo ooooo oooooo oo
ooooo ooooooooooooooo o o o ooo oo o
o o o o oo o oooooooo oooooooooo o oooo o o
oooo ooooo oo o ooooo o
ooooooooo o ooooooooooooo oo o o o o ooo
ooo
ooooo
o o
ooooooo
ooooooo
ooooooo oo o o o oo ooooooooooo o ooo oo o
ooo o o o oooooooooo
oooooo o age

40
oo o
ooooo oo oo
o
o ooooo o oooooo o ooo ooo
o oo o oooo ooo oo o o
o
o o o o o o o ooooooo o ooooooooooo o
oooo oooooo o ooo o o o oooo o o ooo o o
oooo o o oo o
oooooo oooooo o o ooooooo ooo
ooooo ooo
oo o o o oo oooooooooooooo o o
oooooo
o
ooooooo o
ooo ooooooo o
oooo ooo o oooo oo
o oooo o oooo o o
ooo o o

20
o
ooooooooooo ooo o
o oooooo oooo oo ooo o o ooooooo o ooo o

100 160 220 2 6 10 14 15 25 35 45 20 40 60


7
Example Problems
• Identify the risk factors for prostate cancer.
• Classify a recorded phoneme based on a log-periodogram.
• Predict whether someone will have a heart attack on the basis
of demographic, diet and clinical measurements.
• Customize an email spam detection system.
• Identify the numbers in a handwritten zip code.
• Establish the relationship between salary and demographic
variables in population survey data.

8
Spam Detection
• data from 4601 emails sent to an individual (named George,
at HP labs, before 2000). Each is labeled as spam or email.
• goal: build a customized spam filter.
• input features: relative frequencies of 57 of the most
commonly occurring words and punctuation marks in these
email messages.

george you hp free ! edu remove


spam 0.00 2.26 0.02 0.52 0.51 0.01 0.28
email 1.27 1.27 0.90 0.07 0.11 0.29 0.01
Average percentage of words or characters in an email message equal
to the indicated word or character. We have chosen the words and
characters showing the largest difference between spam and email.

9
Example Problems
• Identify the risk factors for prostate cancer.
• Classify a recorded phoneme based on a log-periodogram.
• Predict whether someone will have a heart attack on the basis
of demographic, diet and clinical measurements.
• Customize an email spam detection system.
• Identify the numbers in a handwritten zip code.
• Establish the relationship between salary and demographic
variables in population survey data.

10
11
Example Problems
• Identify the risk factors for prostate cancer.
• Classify a recorded phoneme based on a log-periodogram.
• Predict whether someone will have a heart attack on the basis
of demographic, diet and clinical measurements.
• Customize an email spam detection system.
• Identify the numbers in a handwritten zip code.
• Establish the relationship between salary and demographic
variables in population survey data.

12
Example Problems
• Identify the risk factors for prostate cancer.
• Classify a recorded phoneme based on a log-periodogram.
• Predict whether someone will have a heart attack on the basis
of demographic, diet and clinical measurements.
• Customize an email spam detection system.
• Identify the numbers in a handwritten zip code.
• Classify a tissue sample into one of several cancer classes,
based on a gene expression profile.
• Establish the relationship between salary and demographic
variables in population survey data.

13
300

300

300
200

200

200
Wage

Wage

Wage
100

100

100
50

50

50
20 40 60 80 2003 2006 2009 1 2 3 4 5

Age Year Education Level

Income survey data for males from the central Atlantic region
of the USA in 2009.

14
The Supervised Learning Problem
Starting point:
• Outcome measurement Y (also called dependent variable,
response, target).
• Vector of p predictor measurements X (also called inputs,
regressors, covariates, features, independent variables).
• In the regression problem, Y is quantitative (e.g price,
blood pressure).
• In the classification problem, Y takes values in a finite,
unordered set (survived/died, digit 0-9, cancer class of
tissue sample).
• We have training data (x1, y1), . . . , (x N , yN ). These are
observations (examples, instances) of these measurements.

15
Objectives

On the basis of the training data we would like to:

• Accurately predict unseen test cases.


• Understand which inputs affect the outcome, and how.
• Assess the quality of our predictions and inferences.

16
Philosophy
• It is important to understand the ideas behind the various
techniques, in order to know how and when to use them.

• One has to understand the simpler methods first, in order


to grasp the more sophisticated ones.

• It is important to accurately assess the performance of a


method, to know how well or how badly it is working

17
Unsupervised learning

• No outcome variable, just a set of predictors (features)


measured on a set of samples.
• objective is more fuzzy
• find groups of samples that behave similarly
• find features that behave similarly
• find linear combinations of features with the
most variation.
• difficult to know how well you are doing.
• different from supervised learning, but can be useful as a
pre-processing step for supervised learning.

18
The Netflix prize

• Netflix provided a training data set of 100,480,507


ratings that 480,189 users gave to 17,770 movies,
each rating between 1 and 5.

• training data is very sparse— many missing values

• objective is to predict the rating for a set of 1 million


customer-movie pairs that are missing in the training data.

• is this a supervised or unsupervised problem?

19
BellKor’s Pragmatic Chaos wins, beating The Ensemble by a narrow margin.
20
Statistical Learning versus Machine Learning
• Machine learning arose as a subfield of AI.
• Statistical learning arose as a subfield of Statistics.
• There is much overlap — both fields focus on supervised
and unsupervised problems:
• Machine learning has a greater emphasis on large scale
applications and prediction accuracy.
• Statistical learning emphasizes models and their
interpretability, and precision and uncertainty.
• But the distinction has become more and more blurred,
and there is a great deal of “cross-fertilization”.
• Statistical learning is a fundamental ingredient in the
training of a modern data scientist.

21
What is Statistical Learning?
25

25

25
20

20

20
Sales

Sales

Sales
15

15

15
10

10

10
5

5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100

TV Radio Newspaper

Shown are Sales vs TV, Radio and Newspaper, with a blue


linear-regression line fit separately to each.
Can we predict Sales using these three?
Perhaps we can do better using a model

Sales ≈ f (TV, Radio, Newspaper)


22
Notation
Here Sales is a response or target that we wish to predict. We
generically refer to the response as Y .
TV is a feature, or input, or predictor; we name it X1.
Likewise name Radio as X2, and so on.
We can refer to the input vector collectively as

Now we write our model as

Y = f (X) +
where captures measurement errors and other discrepancies.

23
What is f (X) good for?

• With a good f we can make predictions of Y at new points


X = x.
• We can understand which components of X = (X 1 , X 2 , . . . ,
X p ) are important in explaining Y , and which are
irrelevant. e.g. Seniority and Years of Education have a
big impact on Income, but Marital Status typically does
not.
• Depending on the complexity of f , we may be able to
understand how each component X j of X affects Y .

24
●●
● ●

6
●●
● ●
●●● ●
● ●
●●●●

4
●●● ●
●●●●
●● ●●● ●●●
●●● ● ● ●
● ●●●●● ●●● ●
●●●●●● ● ●●● ●
●● ●
●● ●
●●● ●●●●● ●
●●●●● ●●● ●● ●●●●
●● ●●
●●
y

●●●●●●●
● ●● ● ●
2
● ● ●●●●● ● ●
●●●● ●●●●●●●●●●●●●●●●
● ●●● ●
● ●●● ● ●●● ●●● ●●
●●●●●●● ●
● ●●● ●●●● ●●●●
●●●● ●
● ● ●●●●●●●
● ●
●●●●●
● ●

● ●●

●●
●● ● ●
●●
● ●●●
●● ●●
●●
●● ●● ● ●● ● ● ● ● ●
● ●●●●●● ●●●●●●●●●●●
● ● ●●● ●
●● ●●●●●● ● ●●●●●●
● ● ●● ●●●●●●
● ●● ●●●● ● ●
●● ●●●
● ●●●●●●● ●●●●
0

●● ●●● ● ●●●●●● ● ● ●●●●●●●● ●●


●● ●●●●●
●● ●●●●●●
●●● ● ●
●●● ●●●● ●●● ●●●● ●●●● ●●
●●
●● ●● ●●
●●●●●●
●●●● ●●●●●●
●●● ●


● ●●●●
●●●
● ●●
●●
● ● ● ●●●
●● ●
●● ● ●●●●●●●●●
●●●●● ●● ●● ●● ●
●●●● ● ●
●● ●●● ●
● ●●

●●● ● ●●● ● ●●●
●●●
●●●
●●● ●
● ●

●●
●●●
●●●
●●●●●
●●●
●●● ●●


●●●● ●


● ●
● ● ●● ● ● ●●

−2

1 2 3 4 5 6 7

Is there an ideal f (X)? In particular, what is a good value for


f ( X ) at any selected value of X, say X = 4? There can be
many Y values at X = 4. A good value is

f (4) = E(Y |X = 4)
E(Y |X = 4) means expected value (average) of Y given X = 4.
This ideal f (x) = E(Y |X = x) is called the regression function.
25
The regression function f (x)
• Is also defined for vector X; e.g.
f (x) = f (x1, x2, x 3 ) = E(Y |X1 = x1, X 2 = x2, X 3 = x3)

26
The regression function f (x)
• Is also defined for vector X; e.g.
f (x) = f (x1, x2, x 3 ) = E(Y |X1 = x1, X 2 = x2, X 3 = x3)
• Is the ideal or optimal predictor of Y with regard to
mean-squared prediction error: f (x) = E(Y |X = x) is the
function that minimizes E[(Y − g(X)) 2 |X = x] over all
functions g at all points X = x.

27
The regression function f (x)
• Is also defined for vector X; e.g.
f (x) = f (x1, x2, x 3 ) = E(Y |X1 = x1, X 2 = x2, X 3 = x3)
• The ideal or optimal predictor of Y with regard to mean-
squared prediction error: f (x) = E(Y |X = x) is the
function that minimizes E[(Y − g(X)) 2 |X = x] over all
functions g at all points X = x.
• = Y − f (x) is the irreducible error — i.e. even if we knew
f (x), we would still make errors in prediction, since at each
X = x there is typically a distribution of possible Y values.

28
The regression function f (x)
• Is also defined for vector X; e.g.
f (x) = f (x1, x2, x 3 ) = E(Y |X1 = x1, X 2 = x2, X 3 = x3)
• Is the ideal or optimal predictor of Y with regard to
mean-squared prediction error: f (x) = E(Y |X = x) is the
function that minimizes E[(Y − g(X)) 2 |X = x] over all
functions g at all points X = x.
• = Y − f (x) is the irreducible error — i.e. even if we knew
f (x), we would still make errors in prediction, since at each
X = x there is typically a distribution of possible Y values.
• For any estimate fˆ(x) of f (x), we have

29
How to estimate f
• Typically we have few if any data points with X = 4 exactly.
• So we cannot compute E(Y |X = x)!
• Relax the definition and let
fˆ(x) = Ave(Y |X ∈ N (x))
where N (x) is some neighborhood of x.
3

● ●

2



● ●
● ● ●
● ●
1

● ●● ●
●● ●

● ●● ●
y

●● ●●●
● ●● ● ● ●● ●
0

● ● ● ● ●
● ● ● ●

● ● ●● ● ● ● ●

−1

● ●● ●● ● ● ●

−2

1 2 3 4 5 6

30
x
• Nearest neighbor averaging can be pretty good for small p
— i.e. p ≤ 4 and large-ish N .

• Nearest neighbor methods can be lousy when p is large.


Reason: the curse of dimensionality. Nearest neighbors
tend to be far away in high dimensions.

• We need to get a reasonable fraction of the N values of yi


to average to bring the variance down—e.g. 10%.

• A 10% neighborhood in high dimensions need no longer be


local, so we lose the spirit of estimating E(Y |X = x) by
local averaging.

31
The curse of dimensionality

10% Neighborhood

p= 10
1.0

● ● ●●●
● ● ●
● ●● ● ● ●●

1.5
● ● ● ●
● ●
● ●
● ● ●
p= 5
0.5

● ● ●
● ●
● ● ●

●●●
● ●
p= 3

1.0

Radius

● ● ●
● ●
p= 2
0.0

● ●
x2

● ●● ●
● ●●

● ● ●
● ●
●●

● ● p= 1
● ● ●● ●

0.5

−0.5

● ●
●●
● ● ●
●●
● ● ●
● ●●
● ● ● ● ●
● ●


−1.0

0.0

−1.0 −0.5 0.0 0.5 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

x1 Fraction of Volume

32
Parametric and structured models
The linear model is an important example of a parametric
model:
f L ( X ) = β0 + β1X1 + β2X2 + . . . βpXp.

• A linear model is specified in terms of p + 1 parameters


β0, β1, . . . , βp.
• We estimate the parameters by fitting the model to
training data.
• Although it is almost never correct, a linear model often
serves as a good and interpretable approximation to the
unknown true function f (X).

33
A linear model fˆL(X) = β̂0 + β̂ 1 X gives a reasonable fit here

3
● ●

2


●●
● ● ●
● ● ●

1
● ●●●
●● ●

● ● ●

y
● ●●
● ●
●● ●● ●
● ● ●

0

● ● ● ● ●
● ● ●
● ●
● ● ●● ● ● ● ●

−1
● ● ●
● ●
●● ●●
−2

1 2 3 4 5 6

A quadratic model fˆQ(X) = β̂0 + β̂1 X + β̂2 X 2 fits slightly


better.
3

● ●
2




● ● ● ●
● ● ●
● ● ●
1


●● ●

● ● ●
y


●● ●●

● ● ● ● ● ● ●
0


● ● ● ● ●
● ● ●
● ●
● ● ●● ● ● ● ●

−1

● ● ●
● ●
●● ●●
−2

1 2 3 4 5 6

34
Simulated example. Red points are simulated values for income
from the model
income = f (education, seniority) +

f is the blue surface.


35
Linear regression model fit to the simulated data.

fˆL(education, seniority) = β̂0 +β̂1 ×education+β̂2 ×seniority

36
More flexible regression model fˆS(education, seniority) fit to
the simulated data. Here we use a technique called a thin-plate
spline to fit a flexible surface.

37
Even more flexible spline regression model
fˆS (education, seniority) fit to the simulated data. Here the
fitted model makes no errors on the training data! Also known
as overfitting.

38
Some trade-offs

• Prediction accuracy versus interpretability.


— Linear models are easy to interpret; thin-plate splines
are not.

39
Some trade-offs
• Prediction accuracy versus interpretability.
— Linear models are easy to interpret; thin-plate splines are
not.
• Good fit versus over-fit or under-fit.
— How do we know when the fit is just right?

40
Some trade-offs
• Prediction accuracy versus interpretability.
— Linear models are easy to interpret; thin-plate splines are
not.
• Good fit versus over-fit or under-fit.
— How do we know when the fit is just right?
• Parsimony versus black-box.
— We often prefer a simpler model involving fewer variables
over a black-box predictor involving them all.

41
High Subset Selection
Lasso

Least Squares
Interpretability

Generalized Additive Models


Trees

Bagging, Boosting

Support Vector Machines


Low

Low High

Flexibility

42
Assessing Model Accuracy

Suppose we fit a model fˆ(x) to some training data


Tr = {x i , yi } N
1
, and we wish to see how well it performs.
• We could compute the average squared prediction error
over Tr:
MSETr = Avei∈Tr[yi − fˆ(xi)]2
This may be biased toward more overfit models.
• Instead we should, if possible, compute it using fresh test
data Te = {x i , yi } M
1 :

MSETe = Avei∈Te[yi − fˆ(xi)]2

43
2.5
12

2.0
10

Mean Squared Error

1.5
8
Y

1.0
6

0.5
4
2

0.0
0 20 40 60 80 100 2 5 10 20

X Flexibility

Black curve is truth. Red curve on right is MSETe, grey curve is


MSETr . Orange, blue and green curves/squares correspond to fits of
different flexibility.

44
2.5
12

2.0
10

Mean Squared Error

1.5
8
Y

1.0
6
4

0.5
2

0.0
0 20 40 60 80 100 2 5 10 20

X Flexibility

Here the truth is smoother, so the smoother fit and linear model do
really well.

45
20
20

15
Mean Squared Error
10

10
Y

5
−10

0
0 20 40 60 80 100 2 5 10 20

X Flexibility

Here the truth is wiggly and the noise is low, so the more flexible fits
do the best.

46
Bias-Variance Trade-off
Suppose we have fit a model fˆ(x) to some training data Tr, and
let (x0, y0) be a test observation drawn from the population. If
the true model is Y = f ( X ) + (with f (x) = E(Y |X = x)),
then

The expectation averages over the variability of y0 as well as


the variability in Tr. Note that Bias(fˆ(x0))] = E[fˆ(x0)] − f (x0).
Typically as the flexibility of fˆ increases, its variance increases,
and its bias decreases. So choosing the flexibility based on
average test error amounts to a bias-variance trade-off.

47
Bias-variance trade-off for the three
examples
2.5

2.5

20
MSE
Bias
Var
2.0

2.0

15
1.5

1.5

10
1.0

1.0

5
0.5

0.5
0.0

0.0

0
2 5 10 20 2 5 10 20 2 5 10 20

Flexibility Flexibility Flexibility

48
Classification Problems

Here the response variable Y is qualitative — e.g. email is one


of C = (spam, ham) (ham=good email), digit class is one of
C = {0, 1, . . . , 9}. Our goals are to:

• Build a classifier C ( X) that assigns a class label from C to


a future unlabeled observation X.

• Assess the uncertainty in each classification

• Understand the roles of the different predictors among


X = (X 1 , X 2 , . . . , Xp).

49
|| | | | || | ||| || |||||||||||||||||| ||||||||||||||||| |||||||||||||||| ||||||||||| |||||||| ||||||||||||||||||| ||||||| |||| |||

1.0
| | ||

0.8
0.6
y

0.4
0.2
0.0

|| | | || ||| || || | || || ||||||| | |||||||||| ||||||| |||||||||||||||||| |||||||||||||||||||||


||||||| | |
1 6 7
2 3 4 5

x
Suppose the K elements in C are
numbered 1,2,. . . , K. Let
pk(x) = Pr(Y = k|X = x), k = 1,2,. . . , K.
These are the conditional class probabilities at x; e.g. see little
barplot at x = 5. Then the Bayes optimal classifier at x is
C(x) = j if pj (x) = max{p1(x), p2(x), . . . , pK (x)}
50
| | | || || || || || || || |||| ||| | | |||||||| || || | |||| || | || | |
1.0
0.8
0.6
y

0.4
0.2
0.0

| | | || | | || | | | || | || |||| | || | | || | || | | |

2 3 4 5 6

Nearest-neighbor averaging can be used as before.


Also breaks down as dimension grows. However, the impact on
Ĉ (x) is less than on p̂k (x), k = 1,. . . , K.

51
Classification: some details

• Typically we measure the performance of Ĉ (x) using the


misclassification error rate:

ErrTe = Avei∈Te I [yi /= Ĉ(x i )]

• The Bayes classifier (using the true pk(x)) has smallest


error (in the population).

52
Classification: some details

• Typically we measure the performance of Ĉ (x) using the


misclassification error rate:

ErrTe = Avei∈Te I [yi /= Ĉ(x i )]

• The Bayes classifier (using the true pk(x)) has smallest


error (in the population).

• Support-vector machines build structured models for C(x).

• We will also build structured models for representing the


pk(x). e.g. Logistic regression, generalized additive models.

53
Example: K-nearest neighbors in two dimensions

oo o
oo oo
o
o
o oo o o
o
o oo oo ooo
o o
o o
oooooo o oo
o oo o
o o o oo oo o oo
o o o o o
oo oo o o o o
o o o o o o o
ooo o o o
o oo o o
o o oooo o o o o o o o
o
o o ooooooo o o oo o
X2

o o o
oo o oo o o o
o o o o oo oo
o o oo
o o o o ooo
o o ooo o
oo o oooo oooo
o o o oooo o
o o o oo o o
o o o oooo oo
o o
oo o
o
o o o

X1
54
KNN: K=10

oo o
oo oo
o
o
o oo o o
oo o
o o o o
o oo o
o
o oo o o oooooo o oo
o o o
o oo oo oo
o o o o oo o
oo oo o o o o
o o o o o o o
ooo o o o
o o
o o o o
o o oooo o oo o o o
o
o o ooooooo o o oo o
o o
X2

o oo o o o
oo o o o oo o
o o o o
o o oo o
o o o oo
o oo oo o
oo o oooo oooo
o o oooo o
o o o o oo o o
o
o o ooooo oo o
o
o o
o
o o o

X1

55
KNN: K = 1 K N N : K=100

oo o oo o
oo o
o
o oo o
o
o
o oo o o o oo o o
o o
o o
o oo oo ooo o oo oo ooo
o o oo oo o o o oo oo o
oo o o o o o oo oo oo o o o o o oo oo
o o o oo o o o o oo o
o o o oo oo o o o oo oo
o o o o
o oo o o o oo o o
o
o o o o o o
o o o o o
o o o o o o o o o o
o oo o o o o o o oo o o o o o
oo o o o o oo o o o o
o o o o oo o oo o o o o oo o oo
oo o ooo o oo o ooo o
o oo o
o ooo o oo o
o ooo
o o o o o o
oo o oo o o oo o oo o o
o o
o o o
o o oo oo o o o
o o oo oo
o o o o
o o o oo o oo o o o o o oo o oo o o
o o o o
o oooo oo o o o oooo oo o o
oo oo oo oo
o o ooo o o o ooo o
o o o
o oo o o o o o o
o oo o o o
o o
o
o
o oooo oo o
o
o
o oooo oo o
oo o oo o
o o
o o o o
o o

56
0.20
0.15
Error Rate

0.10
0.05

Training Errors
0.00

Test Errors

0.01 0.02 0.05 0.10 0.20 0.50 1.00

1/K
57
Exercise
For each of parts (a) through (d), indicate whether we would
generally expect the performance of a flexible statistical learning
method to be better or worse than an inflexible method. Justify your
answer.

(a) The sample size n is extremely large, and the number of


predictors p is small.

(b) The number of predictors p is extremely large, and the number


of observations n is small.

(c) The relationship between the predictors and response is highly


non-linear.

(d) The variance of the error terms, i.e. σ2 = Var( ), is extremely


high.

58
Exercise
Provide a sketch of typical (squared) bias, variance, training
error, test error, and Bayes (or irreducible) error curves, on a
single plot, as we go from less flexible statistical learning
methods towards more flexible approaches. The x-axis
should represent the amount of flexibility in the method, and
the y-axis should represent the values for each curve. There
should be five curves.

59
End

60

Você também pode gostar