Machine Learning in Industry

Machine Learning in Industry
Ralf Herbrich
Amazon
Overview
Theory
Inference in Factor Graphs
Approximate Message Passing
Applications @ Microsoft
TrueSkill: Gamer Rating and Matchmaking
TrueSkill Through Time: History of Chess
Click-Through Rate Prediction in Online Advertising
Matchbox: Recommendation Systems
Applications @ Amazon
Background Material
http://www.coursera.org http://www.cs.ubc.ca/~murphyk/MLbook/index.html
http://www.cs.ucl.ac.uk/staff/d.barber/brml/ http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm
Overview
Theory
Applications
Future Applications
Graphical Models
Definition: Graphical representation of joint

probability distribution
Nodes: = Variables
Edges: Relationship between variables
Variables:
Observed Variables: Data
Unobserved Variables: Causes + Temporary/Latent
Key Questions:
(Conditional) Dependency:
Inference/Marginalisation:
Factor Graphs
Definition: Graphical representation of product

structure of a function (Wiberg, 1996)
Nodes: = Factors = Variables
Edges: Dependencies of factors on variables.
Semantic: a b
c
Local variable dependency of factors
Factor Graphs and Bayes Law
Bayes law
s1 s s2
Factorising prior
t1 t2
Factorising likelihood
d
Inference: Sum out latent variables
y
Factor Trees: Separation
y
f3(x,y)
v w x
f1(v,w) f2(w,x)
z
f4(x,z)
Observation: Sum of products becomes product of sums of all

messages from neighbouring factors to variable!
Messages: From Factors To Variables
y
f3(x,y)
w x
f2(w,x)
z
f4(x,z)
Observation: Factors only need to sum out all their

local variables!
Messages: From Variables To Factors
y
f3(x,y)
x
f2(w,x)
z
f4(x,z)
Observation: Variables pass on the product of all

incoming messages!
The Sum-Product Algorithm
Three update equations (Aji & McEliece, 1997)
Update equations can be directly derived from the

distributive law.
Calculate all marginals at the same time!
Only need to pass messages twice along each edge!
Practical Considerations II
Redundant computations:
t
Caching: Only store and , then

A Bayesian Interpretation
Recall Bayes Law:
Prior and Data Messages: t
Message passing is separating the likelihood and prior

into outgoing and incoming message!
Problem: The exact messages from factors to
variables may not be closed under products.
Solution: Approximate each marginal as well as

possible in using a divergence measure on beliefs.
General Idea: Leave-one out approximation

* =
-5 0 5 -5 0 5 -5 0 5

* =
-5 0 5 -5 0 5 -5 0 5
Divergence Measures
Kullback-Leibler Divergence: Expected log-odd ratio
between two distributions:
Minimizer for Exponential Families: Matching the

moments of the distribution !
General -Divergence:
Special Cases:
-Divergence in Pictures
When to use which -Divergence?
x y
=0 resolves multi-modality in the posterior at the

expense of too much certainty!
When to use which -Divergence?
w1 w2
=1 captures all uncertainty for uni-modal posterior

distributions!
Sample (ctd)
Overview
Theory
TrueSkill
Joint work with Thore Graepel, Tom Minka & Phillip Trelford
Motivation
Competition is central to our lives

Innate biological trait
Driving principle of many sports
Chess Rating for fair competition
ELO: Developed in 1960 by rpd Imre l
Matchmaking system for tournaments
Challenges of online gaming
Learn from few match outcomes efficiently
Support multiple teams and multiple players per
team
The Skill Rating Problem
Given:
Match outcomes: Orderings among k teams
consisting of n1, n2 , ..., nk players, respectively
Questions:
Skill si for each player such that
Global ranking among all players

Fair matches between teams of players
Two Player Match Outcome Model
Latent Gaussian performance model for fixed skills

Possible outcomes: Player 1 wins over 2 (and vice versa)
s1 s2
p1 p2
y12
Two Team Match Outcome Model
Skill of a team is the sum of the skills of its members
s1 s2 s3 s4
t1 t2
y12
Multiple Team Match Outcome Model
Possible outcomes: Permutations of the teams
s1 s2 s3 s4
t1 t2 t3
y
Multiple Team Match Outcome Model
But we are interested in the (Gaussian) posterior!
s1 s2 s3 s4
t1 t2 t3
y12 y23
Efficient Approximate Inference
Gaussian Prior Factors
s1 s2 s3 s4
Fast and efficient approximate message passing

t using Expectation
1 t Propagation t
2 3
Ranking Likelihood Factors

y12 y23
Applications to Online Gaming
Leaderboard
Global ranking of all players
Matchmaking
For gamers: Most uncertain outcome
For inference: Most informative
Both are equivalent!
Experimental Setup
Data Set: Halo 2 Beta

3 game modes
Free-for-All
Two Teams
1 vs. 1
> 60,000 match
outcomes
6,000 players
6 weeks of game play
Publically available
Convergence Speed
40
35
30
25
Level
20
15
char (TrueSkill)
10
SQLWildman (TrueSkill)
5 char (Halo 2 rank)
SQLWildman (Halo 2 rank)
0
0 100 200 300 400
Number of Games
Convergence Speed (ctd.)
100%
char wins
SQLWildman wins
Winning probability
80% Both players draw
60%
40%
20%
5/8 games won by char
0%
0 100 200 300 400 500
Number of games played

Xbox 360 & Halo 3
Xbox 360 Live

Launched in September 2005
Every game uses TrueSkill to match players
> 10 million players
> 2 million matches per day
> 2 billion hours of gameplay
Halo 3
Launched on 25th September 2007
Largest entertainment launch in history
> 200,000 player concurrently (peak: 1,000,000)
Halo 3 in Action
Halo 3 Public Beta Analysis
Skill Distributions of Online Games
Golf (18 holes): 60 levels
Car racing (3-4 laps): 40 levels
UNO (chance game): 10 levels

TrueSkillTM Through Time: Chess
Model time-series of skills by

smoothing across time
pt,i pt,j
History of Chess st,i st,j
3.5M game outcomes pt,i pt,j
(ChessBase)
20 million variables (each of
200,000 players in each year of
lifetime + latent variables) pt+1,i pt+1,j
st+1, i st+1, j
40 million factors
pt+1,i pt+1,j
ChessBase Analysis: 1850 - 2006
Garry Kasparov
3000
2800 Robert James Fischer

Anatoly Karpov
2600
Mikhail Botvinnik
Skill estimate
2400 Paul Morphy

Whilhelm Steinitz
2200
Boris V Spassky
2000 Emanuel Lasker
1800
Jose Raul Capablanca
1600
Adolf Anderssen
1400
1850 1858 1866 1875 1883 1891 1899 1907 1916 1924 1932 1940 1949 1957 1965 1973 1981 1990 1998 2006
Year
Online
Advertising
Joint work with Thore Graepel, Joaquin Quionero Candela, Onno Zoeter, Tom Borchert , Phillip Trelford
Why Predict Probability-of-Click?
Display (according to
expected revenue)

Charge (per click)

$1.00 * 10% =$0.10 $0.80
Advantages
$2.00 of improved
* 4% =$0.08 probability
$1.25 estimates:
Increase$0.10
user satisfaction
* 50% =$0.05 by better
$0.05targeting
Fairer charges to advertisers

Increase revenue by showing ads with high click-thru rate
Uncertainty: Bayesian Probabilities
102.34.12.201
15.70.165.9
Client IP
221.98.2.187
92.154.3.86
+ p(pClick)
Match Exact Match
Type Broad Match
ML-1
Position SB-1
SB-2
Training Algorithm in Action
w1 + w2
c
No Click
Prediction
Training/Update
Click
Inference: An Optimization View
Accuracy
MatchBox
Joint work with Thore Graepel, Joaquin Quionero Candela, David Stern, Ulrich Paquet
Crime Drama Action Comedy Action Action
Tarantino Mendes Campbell Mitchell Donner Wachowski
ID=4243 ID=534 ID=9834 ID=6345 ID=2452 ID=9864
1 2 3 4 5 6
Programmer
Age<30
A
ID=33451
Student
Age<30
ID=33431
B
Shopkeeper
Age>45
C
ID=4321
Student
Age<30 D
ID=5641
Matchbox With Metadata
User Metadata Item Metadata
ID=234 Male British Camera SLR
u01 u11 u21 User v11 v21
+ Item s1 User trait 1 t1 +
u02 u12 u22 v12 v22
+ s2 User trait 2 t2 +
Rating potential ~
r
Recommender System: MatchBox
User
likes
dislikes Social Network
Movie
Movie
mark Heat
ralf The Rock

User
tao The Godfather
sheryl
R. Scott
Director
Gender
Male C. Eastwood
Female Q. Tarantino
R. Howard
Message Passing For Matchbox
u01 u11 u21 v11 v21
+ s1 * t1 +
u02 u12 u22 v12 v22
+ s2 * t2 +
r
1.5
User/Item Trait Space
24: Season 3 Adaptation
1
24: Season 2
0.5
Preference Cone for user

145035
0
-1.5 -1 -0.5 0 0.5A Clockwork1Orange 1.5
A Knights Tale
-0.5
AI: Artificial Intelligence
-1
Users
A Cinderella Story Movies
-1.5
Incremental Training with ADF
Items
1 2 3 4 5 6
B
Users
D
ADF: Message Passing Iteration 1
1.5
0.5
0
-1.5 -1 -0.5 0 0.5 1 1.5
-0.5
-1
-1.5
Message Passing Iteration 2
1.5
0.5
0
-1.5 -1 -0.5 0 0.5 1 1.5
-0.5
-1
-1.5
1.5
0.5
0
-1.5 -1 -0.5 0 0.5 1 1.5
-0.5
-1
-1.5
1.5
0.5
0
-1.5 -1 -0.5 0 0.5 1 1.5
-0.5
-1
-1.5
feedback models
Feedback Models
u01 u11 u21 v11 v21
+ s1 t1 +
u02 u12 u22 v12 v22
+ s2 t2 +
r
Feedback Models
u01 u11 u21 v11 v21
+ s1 t1 +
u02 u12 u22 v12 v22
+ s2 t2 +
r
Feedback Models
=3
Feedback Models
> > < <
t0 t1 t2 t3
Feedback Models
>0
Message Passing: Compositionality
u11 u21 v11 v21
+ s1 t1 +
u12 u22 v12 v22
+ s2 t2 +
User Model Item Model
*
x1 x2 x3 x4 r
Context Model +
Feedback Model >0

Overview
Theory
ML Opportunities @ Amazon
Retail Customers Seller Catalog Digital

Demand Product Fraud Browse-Node Named-
Forecasting Recommendation Detection Classification Entity
Vendor Lead Product Search Predictive Meta-data Extraction
Time Visual Search Help validation XRay
Prediction Product Ads Seller Review Analysis Plagiarism
Pricing Shopping Advice Search & Detection
Packaging Crawling
Customer Problem
Substitute Detection
Prediction
89
XRay
Machine Translation
Machine Translation: Deep Dive
p(English) p(Chinese| English)

p(English | Chinese) =
p(Chinese)
p(English) p(Chinese| English)
Language Translation
Model Model
Language Model: What are good English sentences?
Translation Model: What English sentences account

well for a given Chinese sentence?
Thanks!

Machine Learning in Industry

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Machine Learning in Industry

Enviado por

Direitos autorais:

Formatos disponíveis

Machine Learning in Industry

Definition: Graphical representation of joint

Definition: Graphical representation of product

Observation: Sum of products becomes product of sums of all

Observation: Factors only need to sum out all their

Observation: Variables pass on the product of all

Update equations can be directly derived from the

Caching: Only store and , then

Recall Bayes Law:

Prior and Data Messages: t

Message passing is separating the likelihood and prior

Solution: Approximate each marginal as well as

General Idea: Leave-one out approximation

Minimizer for Exponential Families: Matching the

=0 resolves multi-modality in the posterior at the

=1 captures all uncertainty for uni-modal posterior

Competition is central to our lives

Global ranking among all players

Latent Gaussian performance model for fixed skills

Skill of a team is the sum of the skills of its members

Possible outcomes: Permutations of the teams

But we are interested in the (Gaussian) posterior!

Gaussian Prior Factors

Fast and efficient approximate message passing

Ranking Likelihood Factors

Data Set: Halo 2 Beta

80% Both players draw

Number of games played

Xbox 360 Live

Golf (18 holes): 60 levels

Car racing (3-4 laps): 40 levels

UNO (chance game): 10 levels

Model time-series of skills by

2800 Robert James Fischer

2400 Paul Morphy

2000 Emanuel Lasker

Fairer charges to advertisers

Tarantino Mendes Campbell Mitchell Donner Wachowski

ID=4243 ID=534 ID=9834 ID=6345 ID=2452 ID=9864

+ Item s1 User trait 1 t1 +

u02 u12 u22 v12 v22

ralf The Rock

tao The Godfather

u01 u11 u21 v11 v21

u02 u12 u22 v12 v22

Preference Cone for user

AI: Artificial Intelligence

u01 u11 u21 v11 v21

u02 u12 u22 v12 v22

u01 u11 u21 v11 v21

u02 u12 u22 v12 v22

> > < <

u12 u22 v12 v22

Feedback Model >0

Retail Customers Seller Catalog Digital

p(English) p(Chinese| English)

Language Model: What are good English sentences?

Translation Model: What English sentences account

Você também pode gostar