Recommender Systems

RecSys 10-06-14
The Quest for the Optimal Experiment

1
Causation
Science & Algorithms at Netflix

Experimentation
Science, methodology, and statistical analysis of
experiments
Correlation
Algorithm R&D
Mathematical algorithms that get embedded into automated
processes, such as our recommendation system
Predictive models
Standalone mathematical models to support decision
making (e.g. title demand prediction)
2
Numbers shown in this presentation

are not representative
of Netflixs overall metric values
Netflix Experimentation: Common

Product is a set of controlled, randomized
experiments, many running at once
Experiment in all areas
Plenty of rigor and attention around statistics,
metrics, analysis
Netflix Experimentation: Distinctive

Core to culture (not just process)
Curated approach
Decisions not automated
Scrutiny of each test (and by many people)
Paying customers who are always logged in

Monthly subscription
Tests last several months
Sampling (test allocation) of new members can take weeks
or even months
Many devices
Retention is our core metric (OEC)

Continually improve member enjoyment
Streaming Hours is our main engagement metric

20%
18%
16%
14%
12%
10%
Cancel Rate
8%
6%
4%
2%
0%
Customers Stream Hours in the past 28 days
Streaming measurement: Streaming score
Retention
Probability of retaining at each future billing cycle

based on streaming S hours at N days of tenure
Total hours consumed during N days of membership
Streaming measurement:
KS visual & Mann Whitney u test statistic
KS Test statistic
Streaming measurement:
Thresholds with z-tests for proportions
10
Much experimentation on the recommender system
Row selection
Video ranking
Video-video similarity
User-user similarity
Search recommendations
Popularity vs personalization
Diversity
Novelty/Freshness
Evidence
Sample and Subject Purity
12
Same test, different populations
13
Who should Netflix sample?

Classes of experience with Netflix
Signups who are not rejoining members
Rejoining members
Existing members (any tenure)
Existing members who are beyond their
free trial
Newly activating a device
Geography
Global
US
International
Region-specific
Tenure
1 month (free trial)
2-6 months
7+ months
14
Two considerations
1. For whom/what do you want to optimize?
2. Who will experience the winning test experience
that gets launched?
15
New members by country region
Time
16
Membership by tenure
Longer tenure
Medium tenure
Free trial
Time
17
Cancel Rate
Hard to impact long-tenured members
Free trial
Medium tenure
Long tenure
18
Current favored samples in algorithm

testing
Global signups who are not rejoining within a

year
Secondarily:
US existing members who are beyond their free trial
International (non-US) existing members who are
beyond their free trial
19
Addressing Sampling Bias

Stratified sampling on attributes that are:
Correlated with core metric
Independent of the test treatment
Regression tests for any systematic randomization

process
Bias monitoring for each tests sample
Large sample sizes
Re-testing
Good judgment to recognize that the story makes
sense
20
In the words of Nate Silver

On predicting the 2008 recession in a world of noisy data and
dependent variables:
Not only was Hatziuss forecast correct,
but it was also right for the right reasons,
explaining the causes of the collapse and anticipating the effects.
Hatzius refers to this chain of cause and effect as a story
In contrast, if you just look at the economy as
a series of variables and equations without any underlying structure,
you are almost certain to mistake noise for a signal
The Signal and the Noise: Why so Many Predictions Fail but Some Dont by Nate Silver
21
Short- versus long-term engagement metrics
22
Short-term metrics we consider
Daily cancel requests

Daily streaming hours
Daily visits
Session length
Failed sessions (no play)
Take rates (CTR where the clicks is to play)
Page-level
Row-level
Title-level
23
Statistically significant differences in

churn rarely stabilize until after Day 45
Test Duration
Test Duration
24
Short-term metrics we consider
Daily cancel requests

Daily streaming hours
Daily visits
Session length
Failed sessions (no play)
Take rates (CTR where the clicks is to play)
Page-level
Row-level
Title-level
25
How well do your short-term metrics

correlate with your OEC, and
how much improvement do you see
in that correlation if you increase
the time interval?
26
Streaming signal that appears over time

1 Week
1 Month
2 Months
27
Or disappears over time

1 Week
1 Month
2 Months
28
Ability to predict 4-month retention using

streaming hours improves with longerterm data
29
Key Takeaways
Exercise rigor in selecting the population to sample;
representative of:
The population you want to optimize for
The population that will receive the experience if launched
Remain open-minded about changing the target population

as business shifts occur
Address bias, ongoing
Know and apply the time duration necessary for your OEC
to stabilize
Additional short-term metrics need to have sufficient
duration to correlate well with your OEC
30

Recommender Systems

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Recommender Systems

Enviado por

Direitos autorais:

Formatos disponíveis

RecSys 10-06-14

The Quest for the Optimal Experiment

Science & Algorithms at Netflix

Numbers shown in this presentation

Netflix Experimentation: Common

Netflix Experimentation: Distinctive

Paying customers who are always logged in

Retention is our core metric (OEC)

Streaming Hours is our main engagement metric

Customers Stream Hours in the past 28 days

Streaming measurement: Streaming score

Probability of retaining at each future billing cycle

Total hours consumed during N days of membership

Much experimentation on the recommender system

Sample and Subject Purity

Same test, different populations

Who should Netflix sample?

New members by country region

Hard to impact long-tenured members

Current favored samples in algorithm

Global signups who are not rejoining within a

Addressing Sampling Bias

Regression tests for any systematic randomization

In the words of Nate Silver

Short- versus long-term engagement metrics

Short-term metrics we consider

Daily cancel requests

Statistically significant differences in

Short-term metrics we consider

Daily cancel requests

How well do your short-term metrics

Streaming signal that appears over time

Or disappears over time

Ability to predict 4-month retention using

Remain open-minded about changing the target population

Você também pode gostar