Você está na página 1de 23

Beyond Classification

Rob Schapire
Princeton University
[currently visiting Yahoo! Research]
Classification and Beyond

• earlier, studied classification learning


• goal: learn to classify examples into fixed set of categories
• want to predict correct class as often as possible
• many applications
• however, often faced with learning problems that don’t fit this
paradigm:
• predicting real-valued quantities:
• how many times will some web page be visited?
• how much will be bid on a particular advertisement?
• predicting probabilities:
• what is the probability user will click on some link?
• how likely is it that some user is a spammer?
This Lecture

• general techniques for:


• predicting real-valued quantities — “regression”
• predicting probabilities
• central, unifying idea: loss minimization
Regression
Example: Weather Prediction

• meteorologists A and B apply for job


• to test which is better:
• ask each to predict how much it will rain
• observe actual amount
• repeat

predictions actual
A B outcome
Monday 1.2 0.5 0.9
Tuesday 0.1 0.3 0.0
Wednesday 2.0 1.0 2.1
• how to judge who gave better predictions?
Example (cont.)
• natural idea:
• measure discrepancy between predictions and outcomes
• e.g., measure using absolute difference
• choose forecaster with closest predictions overall

predictions actual difference


A B outcome A B
Monday 1.2 0.5 0.9 0.3 0.4
Tuesday 0.1 0.3 0.0 0.1 0.3
Wednesday 2.0 1.0 2.1 0.1 1.1
0.5 1.8

• could have measured discrepancy in other ways


• e.g., difference squared
• which measure to use?
Loss

• each forecast scored using loss function


x = weather conditions
f (x) = predicted amount
y = actual outcome
• loss function L(f (x), y ) measures discrepancy between
prediction f (x) and outcome y
• e.g.:
• absolute loss: L(f (x), y ) = |f (x) − y |
• square loss: L(f (x), y ) = (f (x) − y )2
• which L to use?
• need to understand properties of loss functions
Square Loss
• square loss often sensible because encourages predictions close
to true expectation
• fix x
• say y random with µ = E [y ]
• predict f = f (x)
• can show:
 
E [L(f , y )] = E (f − y )2 = (f − µ)2 + Var(y )
| {z }
intrinsic randomness

• therefore:
• minimized when f = µ
• lower square loss ⇒ f closer to µ
• forecaster with lowest square loss has predictions closest
to E [y |x] on average
Learning for Regression
• say examples (x, y ) generated at random
• expected square loss
 
E [Lf ] ≡ E (f (x) − y )2

minimized when f (x) = E [y |x] for all x


• how to minimize from training data (x1 , y1 ), . . . , (xm , ym )?
• attempt to find f with minimum empirical loss:
m
1 X
Ê [Lf ] ≡ (f (xi ) − yi )2
m
i =1

• if ∀f : Ê [Lf ] ≈ E [Lf ] then f that minimizes Ê [Lf ] will


approximately minimize E [Lf ]
• to be possible, need to choose f of restricted form to avoid
overfitting
Linear Regression

• e.g., if x ∈ Rn , could choose to use linear predictors of form


f (x) = w · x
• then need to find w to minimize
m
1 X
(w · xi − yi )2
m
i =1

• can solve in closed form


• can also minimize on-line (e.g. using gradient descent)
Regularization

• to constrain predictor further, common to add regularization


term to encourage small weights:
m
1 X
(w · xi − yi )2 + λkwk2
m
i =1

(in this case, called “ridge regression”)


• can significantly improve performance by limiting overfitting
• requires tuning of λ parameter
• different forms of regularization have different properties
• e.g., using kwk1 instead tends to encourage “sparse”
solutions
Absolute Loss

• what if instead use L(f (x), y ) = |f (x) − y | ?


• can show E [|f (x) − y |] minimized when
f (x) = median of y ’s conditional distribution, given x
• potentially, quite different behavior from square loss
• not used so often
Summary so far

• can handle prediction of real-valued outcomes by:


• choosing a loss function
• computing a prediction rule with minimum loss on
training data
• different loss functions have different properties:
• square loss estimates conditional mean
• absolute loss estimates conditional median
• what if goal is to estimate entire conditional distribution of y
given x?
Estimating Probabilities
Weather Example (revisited)

• say goal now is to predict probability of rain


• again, can compare A and B’s predictions:
predictions actual
A B outcome
Monday 60% 80% rain
Tuesday 20% 70% no-rain
Wednesday 90% 50% no-rain
• which is better?
Plausible Approaches

• similar to classification
• but goal now is to predict probability of class
• could reduce to regression:

1 if rain
y=
0 if no-rain

• minimize square loss to estimate

E [y |x] = Pr[y = 1|x] = Pr[rain|x]

• reasonable, though somewhat awkward and unnatural


(especially when more than two possible outcomes)
Different Approach: Maximum Likelihood

• each forecaster predicting distribution over set of outcomes


y ∈ {rain, no-rain} for given x
• can compute probability of observed outcomes according to
each forecaster — “likelihood”
predictions actual likelihood
A B outcome A B
Monday 60% 80% rain 0.6 0.8
Tuesday 20% 70% no-rain 0.8 0.3
Wednesday 90% 50% no-rain 0.1 0.5
likelihood(A) = .6 × .8 × .1
likelihood(B) = .8 × .3 × .5
• intuitively, higher likelihood ⇒ better fit of estimated
probabilities to observations
• so: choose maximum-likelihood forecaster
Log Loss

• given training data (x1 , y1 ), . . . , (xm , ym )


• f (y |x) = predicted probability of y on given x
m
Y
• likelihood of f = f (yi |xi )
i =1
• maximizing likelihood ≡ minimizing negative log likelihood
m
X
(− log f (yi |xi ))
i =1

• L(f (·|x), y ) = − log f (y |x) called “log loss”


Estimating Probabilities

• Pr[y |x] = true probability of y given x


• can prove: E [− log f (y |x)] minimized when f (y |x) = Pr[y |x]
• more generally,

E [− log f (y |x)] = (average distance between f (·|x) and Pr[·|x])


+(intrinsic uncertainty of Pr[·|x])

• so: minimizing log loss encourages choice of predictor close to


true conditional probabilities
Learning

• given training data (x1 , y1 ), . . . , (xm , ym ), choose f (y |x) to


minimize
1 X
(− log f (yi |xi ))
m
i

• as before, need to restrict form of f


• e.g.: if x ∈ Rn , y ∈ {0, 1}, common to use f of form

f (y = 1|x) = σ(w · x)

where σ(z) = 1/(1 + e −z )


• can numerically find w to minimize log loss
• “logistic regression”
Log Loss and Square Loss

• e.g.: if x ∈ Rn , y ∈ R, can take f (y |x) to be gaussian with


mean w · x and fixed variance
• then minimizing log loss ≡ linear regression
• general: square loss ≡ log loss with gaussian conditional
probability distributions (and fixed variance)
Classification and Loss Minimization

• in classification learning, try to minimize 0-1 loss



1 if f (x) 6= y
L(f (x), y ) =
0 else

• expected 0-1 loss = generalization error


• empirical 0-1 loss = training error
• computationally and numerically difficult loss since
discontinuous and not convex
• to handle, both AdaBoost and SVM’s minimize alternative
surrogate losses
• AdaBoost: “exponential” loss
• SVM’s: “hinge” loss
Summary

• much of learning can be viewed simply as loss minimization


• different losses have different properties and purposes
• regression (real-valued labels):
• use square loss to estimate conditional mean
• use absolute loss to estimate conditional median
• estimating conditional probabilities:
• use log loss (≡ maximum likelihood)
• classification:
• use 0/1-loss (or surrogate)
• provides unified and flexible means of algorithm design

Você também pode gostar