Beyond Classification Beyond Classification Beyond Classification Beyond Classification

Beyond Classification
Rob Schapire
Princeton University
[currently visiting Yahoo! Research]
Classification and Beyond
• earlier, studied classification learning

• goal: learn to classify examples into fixed set of categories
• want to predict correct class as often as possible
• many applications
• however, often faced with learning problems that don’t fit this
paradigm:
• predicting real-valued quantities:
• how many times will some web page be visited?
• how much will be bid on a particular advertisement?
• predicting probabilities:
• what is the probability user will click on some link?
• how likely is it that some user is a spammer?
This Lecture
• general techniques for:

• predicting real-valued quantities — “regression”
• predicting probabilities
• central, unifying idea: loss minimization
Regression
Example: Weather Prediction
• meteorologists A and B apply for job

• to test which is better:
• ask each to predict how much it will rain
• observe actual amount
• repeat
predictions actual
A B outcome
Monday 1.2 0.5 0.9
Tuesday 0.1 0.3 0.0
Wednesday 2.0 1.0 2.1
• how to judge who gave better predictions?
Example (cont.)
• natural idea:
• measure discrepancy between predictions and outcomes
• e.g., measure using absolute difference
• choose forecaster with closest predictions overall
predictions actual difference

A B outcome A B
Monday 1.2 0.5 0.9 0.3 0.4
Tuesday 0.1 0.3 0.0 0.1 0.3
Wednesday 2.0 1.0 2.1 0.1 1.1
0.5 1.8
• could have measured discrepancy in other ways

• e.g., difference squared
• which measure to use?
Loss
• each forecast scored using loss function

x = weather conditions
f (x) = predicted amount
y = actual outcome
• loss function L(f (x), y ) measures discrepancy between
prediction f (x) and outcome y
• e.g.:
• absolute loss: L(f (x), y ) = |f (x) − y |
• square loss: L(f (x), y ) = (f (x) − y )2
• which L to use?
• need to understand properties of loss functions
Square Loss
• square loss often sensible because encourages predictions close
to true expectation
• fix x
• say y random with µ = E [y ]
• predict f = f (x)
• can show:

E [L(f , y )] = E (f − y )2 = (f − µ)2 + Var(y )
| {z }
intrinsic randomness
• therefore:
• minimized when f = µ
• lower square loss ⇒ f closer to µ
• forecaster with lowest square loss has predictions closest
to E [y |x] on average
Learning for Regression
• say examples (x, y ) generated at random
• expected square loss

E [Lf ] ≡ E (f (x) − y )2
minimized when f (x) = E [y |x] for all x

• how to minimize from training data (x1 , y1 ), . . . , (xm , ym )?
• attempt to find f with minimum empirical loss:
m
1 X
Ê [Lf ] ≡ (f (xi ) − yi )2
m
i =1
• if ∀f : Ê [Lf ] ≈ E [Lf ] then f that minimizes Ê [Lf ] will

approximately minimize E [Lf ]
• to be possible, need to choose f of restricted form to avoid
overfitting
Linear Regression
• e.g., if x ∈ Rn , could choose to use linear predictors of form

f (x) = w · x
• then need to find w to minimize
m
1 X
(w · xi − yi )2
m
i =1
• can solve in closed form

• can also minimize on-line (e.g. using gradient descent)
Regularization
• to constrain predictor further, common to add regularization

term to encourage small weights:
m
1 X
(w · xi − yi )2 + λkwk2
m
i =1
(in this case, called “ridge regression”)

• can significantly improve performance by limiting overfitting
• requires tuning of λ parameter
• different forms of regularization have different properties
• e.g., using kwk1 instead tends to encourage “sparse”
solutions
Absolute Loss
• what if instead use L(f (x), y ) = |f (x) − y | ?

• can show E [|f (x) − y |] minimized when
f (x) = median of y ’s conditional distribution, given x
• potentially, quite different behavior from square loss
• not used so often
Summary so far
• can handle prediction of real-valued outcomes by:

• choosing a loss function
• computing a prediction rule with minimum loss on
training data
• different loss functions have different properties:
• square loss estimates conditional mean
• absolute loss estimates conditional median
• what if goal is to estimate entire conditional distribution of y
given x?
Estimating Probabilities
Weather Example (revisited)
• say goal now is to predict probability of rain

• again, can compare A and B’s predictions:
predictions actual
A B outcome
Monday 60% 80% rain
Tuesday 20% 70% no-rain
Wednesday 90% 50% no-rain
• which is better?
Plausible Approaches
• similar to classification
• but goal now is to predict probability of class
• could reduce to regression:

1 if rain
y=
0 if no-rain
• minimize square loss to estimate
E [y |x] = Pr[y = 1|x] = Pr[rain|x]
• reasonable, though somewhat awkward and unnatural

(especially when more than two possible outcomes)
Different Approach: Maximum Likelihood
• each forecaster predicting distribution over set of outcomes

y ∈ {rain, no-rain} for given x
• can compute probability of observed outcomes according to
each forecaster — “likelihood”
predictions actual likelihood
A B outcome A B
Monday 60% 80% rain 0.6 0.8
Tuesday 20% 70% no-rain 0.8 0.3
Wednesday 90% 50% no-rain 0.1 0.5
likelihood(A) = .6 × .8 × .1
likelihood(B) = .8 × .3 × .5
• intuitively, higher likelihood ⇒ better fit of estimated
probabilities to observations
• so: choose maximum-likelihood forecaster
Log Loss
• given training data (x1 , y1 ), . . . , (xm , ym )

• f (y |x) = predicted probability of y on given x
m
Y
• likelihood of f = f (yi |xi )
i =1
• maximizing likelihood ≡ minimizing negative log likelihood
m
X
(− log f (yi |xi ))
i =1
• L(f (·|x), y ) = − log f (y |x) called “log loss”

Estimating Probabilities
• Pr[y |x] = true probability of y given x

• can prove: E [− log f (y |x)] minimized when f (y |x) = Pr[y |x]
• more generally,
E [− log f (y |x)] = (average distance between f (·|x) and Pr[·|x])

+(intrinsic uncertainty of Pr[·|x])
• so: minimizing log loss encourages choice of predictor close to

true conditional probabilities
Learning
• given training data (x1 , y1 ), . . . , (xm , ym ), choose f (y |x) to

minimize
1 X
(− log f (yi |xi ))
m
i
• as before, need to restrict form of f

• e.g.: if x ∈ Rn , y ∈ {0, 1}, common to use f of form
f (y = 1|x) = σ(w · x)
where σ(z) = 1/(1 + e −z )

• can numerically find w to minimize log loss
• “logistic regression”
Log Loss and Square Loss
• e.g.: if x ∈ Rn , y ∈ R, can take f (y |x) to be gaussian with

mean w · x and fixed variance
• then minimizing log loss ≡ linear regression
• general: square loss ≡ log loss with gaussian conditional
probability distributions (and fixed variance)
Classification and Loss Minimization
• in classification learning, try to minimize 0-1 loss

1 if f (x) 6= y
L(f (x), y ) =
0 else
• expected 0-1 loss = generalization error

• empirical 0-1 loss = training error
• computationally and numerically difficult loss since
discontinuous and not convex
• to handle, both AdaBoost and SVM’s minimize alternative
surrogate losses
• AdaBoost: “exponential” loss
• SVM’s: “hinge” loss
Summary
• much of learning can be viewed simply as loss minimization

• different losses have different properties and purposes
• regression (real-valued labels):
• use square loss to estimate conditional mean
• use absolute loss to estimate conditional median
• estimating conditional probabilities:
• use log loss (≡ maximum likelihood)
• classification:
• use 0/1-loss (or surrogate)
• provides unified and flexible means of algorithm design

Beyond Classification Beyond Classification Beyond Classification Beyond Classification

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Beyond Classification Beyond Classification Beyond Classification Beyond Classification

Enviado por

Direitos autorais:

Formatos disponíveis

Beyond Classification

• earlier, studied classification learning

• general techniques for:

• meteorologists A and B apply for job

predictions actual difference

• could have measured discrepancy in other ways

• each forecast scored using loss function

minimized when f (x) = E [y |x] for all x

• if ∀f : Ê [Lf ] ≈ E [Lf ] then f that minimizes Ê [Lf ] will

• e.g., if x ∈ Rn , could choose to use linear predictors of form

• can solve in closed form

• to constrain predictor further, common to add regularization

(in this case, called “ridge regression”)

• what if instead use L(f (x), y ) = |f (x) − y | ?

• can handle prediction of real-valued outcomes by:

• say goal now is to predict probability of rain

• minimize square loss to estimate

E [y |x] = Pr[y = 1|x] = Pr[rain|x]

• reasonable, though somewhat awkward and unnatural

• each forecaster predicting distribution over set of outcomes

• given training data (x1 , y1 ), . . . , (xm , ym )

• L(f (·|x), y ) = − log f (y |x) called “log loss”

• Pr[y |x] = true probability of y given x

E [− log f (y |x)] = (average distance between f (·|x) and Pr[·|x])

• so: minimizing log loss encourages choice of predictor close to

• given training data (x1 , y1 ), . . . , (xm , ym ), choose f (y |x) to

• as before, need to restrict form of f

where σ(z) = 1/(1 + e −z )

• e.g.: if x ∈ Rn , y ∈ R, can take f (y |x) to be gaussian with

• in classification learning, try to minimize 0-1 loss

• expected 0-1 loss = generalization error

• much of learning can be viewed simply as loss minimization

Você também pode gostar