Você está na página 1de 30

RecSys 10-06-14

The Quest for the Optimal Experiment


1

Causation

Science & Algorithms at Netflix


Experimentation
Science, methodology, and statistical analysis of
experiments

Correlation

Algorithm R&D
Mathematical algorithms that get embedded into automated
processes, such as our recommendation system
Predictive models
Standalone mathematical models to support decision
making (e.g. title demand prediction)
2

Numbers shown in this presentation


are not representative
of Netflixs overall metric values

Netflix Experimentation: Common


Product is a set of controlled, randomized
experiments, many running at once
Experiment in all areas
Plenty of rigor and attention around statistics,
metrics, analysis

Netflix Experimentation: Distinctive


Core to culture (not just process)
Curated approach
Decisions not automated
Scrutiny of each test (and by many people)

Paying customers who are always logged in


Monthly subscription
Tests last several months
Sampling (test allocation) of new members can take weeks
or even months

Many devices

Retention is our core metric (OEC)


Continually improve member enjoyment

Streaming Hours is our main engagement metric


20%
18%
16%
14%
12%
10%

Cancel Rate
8%
6%
4%
2%
0%

Customers Stream Hours in the past 28 days

Streaming measurement: Streaming score

Retention

Probability of retaining at each future billing cycle


based on streaming S hours at N days of tenure

Total hours consumed during N days of membership

Streaming measurement:
KS visual & Mann Whitney u test statistic

KS Test statistic

Streaming measurement:
Thresholds with z-tests for proportions

10

Much experimentation on the recommender system

Row selection
Video ranking
Video-video similarity
User-user similarity
Search recommendations
Popularity vs personalization
Diversity
Novelty/Freshness
Evidence

Sample and Subject Purity

12

Same test, different populations

13

Who should Netflix sample?


Classes of experience with Netflix
Signups who are not rejoining members
Rejoining members
Existing members (any tenure)
Existing members who are beyond their
free trial
Newly activating a device

Geography

Global
US
International
Region-specific

Tenure
1 month (free trial)
2-6 months
7+ months

14

Two considerations
1. For whom/what do you want to optimize?
2. Who will experience the winning test experience
that gets launched?

15

New members by country region

Time
16

Membership by tenure

Longer tenure
Medium tenure
Free trial

Time
17

Cancel Rate

Hard to impact long-tenured members

Free trial

Medium tenure

Long tenure

18

Current favored samples in algorithm


testing

Global signups who are not rejoining within a


year
Secondarily:
US existing members who are beyond their free trial
International (non-US) existing members who are
beyond their free trial

19

Addressing Sampling Bias


Stratified sampling on attributes that are:
Correlated with core metric
Independent of the test treatment

Regression tests for any systematic randomization


process
Bias monitoring for each tests sample
Large sample sizes
Re-testing
Good judgment to recognize that the story makes
sense

20

In the words of Nate Silver


On predicting the 2008 recession in a world of noisy data and
dependent variables:
Not only was Hatziuss forecast correct,
but it was also right for the right reasons,
explaining the causes of the collapse and anticipating the effects.
Hatzius refers to this chain of cause and effect as a story
In contrast, if you just look at the economy as
a series of variables and equations without any underlying structure,
you are almost certain to mistake noise for a signal
The Signal and the Noise: Why so Many Predictions Fail but Some Dont by Nate Silver

21

Short- versus long-term engagement metrics

22

Short-term metrics we consider

Daily cancel requests


Daily streaming hours
Daily visits
Session length
Failed sessions (no play)
Take rates (CTR where the clicks is to play)
Page-level
Row-level
Title-level

23

Statistically significant differences in


churn rarely stabilize until after Day 45

Test Duration

Test Duration
24

Short-term metrics we consider

Daily cancel requests


Daily streaming hours
Daily visits
Session length
Failed sessions (no play)
Take rates (CTR where the clicks is to play)
Page-level
Row-level
Title-level

25

How well do your short-term metrics


correlate with your OEC, and
how much improvement do you see
in that correlation if you increase
the time interval?

26

Streaming signal that appears over time


1 Week

1 Month

2 Months

27

Or disappears over time


1 Week

1 Month

2 Months

28

Ability to predict 4-month retention using


streaming hours improves with longerterm data

29

Key Takeaways
Exercise rigor in selecting the population to sample;
representative of:
The population you want to optimize for
The population that will receive the experience if launched

Remain open-minded about changing the target population


as business shifts occur
Address bias, ongoing
Know and apply the time duration necessary for your OEC
to stabilize
Additional short-term metrics need to have sufficient
duration to correlate well with your OEC

30

Você também pode gostar