Escolar Documentos
Profissional Documentos
Cultura Documentos
Page 2
Demystifying (defining) big-data
The challenge of Big Data (by IBM)
Four main sources
1
Enterprise Systems
2
Social Media
3 4
8
Four main sources
1
Enterprise Systems
2
Social Media
3Mobile
4
Mobile targeting – time and geography
11
Geo-Conquesting
https://vimeo.com/44351185
12
Four main sources
1
Enterprise Systems
2
Social Media
3Mobile
4
The Internet
of Things
IoT is bringing about an explosion in
connected devices and huge data sets
Internet-of-Things Projections
$1
7,1tn IoT Solutions Revenue | IDC
http://postscapes.com/internet-of-things-market-size
15
iBeacons – IOT used in the Store
iBeacons indoor positioning systems can interact directly with smart phones
e.g. using Bluetooth Low Energy (BLE)
16
Targeting via iBeacon
19
Applications managing infrastructure
20
Class exercise: smart chairs
21
IoT Leads to Better Measurement
22
Data as a disruptive force
Analytics as a “disruptive technology”
Data driven healthcare
Use of IT in agriculture is growing
Jojoba Israel: Lots of data are constantly collected
but analyzed separately
Weather
station +
Excel files
Weather Irrigation
Excel files
+
Yield Soil Excel files
No
Integration…
B2B in the data area?
What can business do?
(Analytics)
Data-Driven Optimization
Data-Driven
Status Quo
Optimization
Page 30
Analytics Ladder
Page 34
Descriptive vs. predictive
Descriptive:
• Clustering
• Association rules Unsupervised
learning
Patterns
Predictive:
• Classification Supervised
• Prediction learning Models
4 Pillars of business analytics
Page 38
Clustering
m f mm m m
100 m m
m m f m m
m m m f
m
Cluster 2 mm
m m
m mm m
60 f m m m m
mm f m mm m m
f f m f mm m m
f
m mm m
m mm m
m m m
m fm mm m m
m
20 mm
Cluster 1 Cluster 3
20 30 40 50 Age
How about going beyond eyeballing the data in 2-3 dimensions?
Need general-purpose techniques to deal with any-dimensional data
Example: clustering mall visitors
RESULT
QUESTION APPROACH Location- and
What data Analysis of Wi-Fi behavior-based
sources can be usage mapped to based insights for
used? physical space. tenant and mall
strategies.
Page 51
Classification
Assigning each individual to one of several
pre-defined categories (or classes)
• Objective:
– to predict classification when unknown or will occur in
the future,
– based on rules derived from similar data where the
classification is known
?
Cardiac
Rhythm
Classification
Prediction (Regression)
Stock Price
http://www.blueflag.com.au/blog/why-australians-wont-
buy-1-million-cars-2011
http://mechonomic.blogspot.com/2010/07/ibm-share-price-on-
decline.html
Example: prediction of mall visitors’
next step
QUESTION APPROACH
RESULT
What data Analysis of Wi-Fi
sources can be usage mapped to ?
used? physical space.
The result:
same sample is used to make estimation AND
Determine how reliable the estimates are
62 / 37
What is Different from Classical Statistics?
The result:
Fit a model with one sample
Assess its performance with another sample
Use computationally intensive techniques
(examples: classification trees, neural networks)
63 / 37
What is Different from Classical Statistics?
The danger:
- Over-fitting: model fit so closely to the available
sample of data describes not merely structural
characteristics of the data, but random
peculiarities as well
64 / 37
Over-fitting
65 / 37
Let’s try (supervised learning)
Terminology we will need
67 / 37
Example: Buyer/ non – buyer classification
Observation Income ($000's) Lot Size (000's sq. ft.) Buyers = 1, Non-buyers = 2
1 60 18.4 1
2 85.5 16.8 1
3 64.8 21.6 1
4 61.5 20.8 1
5 87 23.6 1
6 110.1 19.2 1
7 108 17.6 1
8 82.8 22.4 1
9 69 20 1
10 93 20.8 1
11 51 22 1
12 81 20 1
13 75 19.6 2
14 52.8 20.8 2
15 64.8 17.2 2
16 43.2 20.4 2
17 84 17.6 2
18 49.2 17.6 2
19 59.4 16 2
20 66 18.4 2
21 47.4 16.4 2
22 33 18.8 2
23 51 14 2
24 63 14.8 2
Graphical View
Decision Trees
X2 <= 21
X2 <= 19?
X2 <= 19
X1 < 84.75
Final “Pure” Split
Classification Tree
Decision
Nodes
Leaf
nodes
Why are Decision Trees Popular?
• 2 key ideas
1. Recursive partitioning of the space of the
independent variables
2. Second is of pruning using validation data
Key performance metrics
Key performance metrics
81 / 37
Example 1: predict automotive prices
82 / 37
Example 2: income classification
83 / 37
Moving from correlation to
causation
Leveraging Analytics for Competitive
Advantage
Page 85
Example
BI Analyst
runs report »
Page 86
Example
Page 87
An (older) Example
8
Source: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html
Search Engine Ads with Site Links
A B
Page 89
Search Engine Ads with Site Links
<answer>
Page 90
Our intuition is poor
Ideas tested at
1/3
Statistically significant,
but negative
Source: Ronny Kohavi, MSFT
Page 91
Non-tech Companies where A/B
testing is standard operating
procedure
• Walmart
• Hertz
• Singapore Airlines
• Capital One
• (Not to mention Google, Amazon – moving credit card
offers to checkout page was a $10 million effect! –
Booking.com, Facebook, Uber, Airbnb)
• Requires infrastructure (tools exist)
– instrumentation (to record such things as clicks, mouse
hovers, and event times)
– data pipelines, and
– data scientists
Page 92
Two Methodological Paradigms
Page 93
The Goal of Causal Analysis
9
Approaches to Study Causation
• Observational
The researcher looks for natural differences across cases
and tries to find a single input that might have caused
the variation in outcomes
• Experimental
The researcher conducts an experiment… If outcomes
vary across the treatment and control groups, the (Teele 2013)
difference must be due to the catalyst
A B
Diet Soda Anyone?
• Is it:
– Consuming Diet Coke makes you fat?
or
– Overweight people are ordering Diet Coke because
they want to lose weight
9
Storks Deliver Babies (p=0.008)
R. Matthews(2000)
http://priceonomics.com/do-storks-deliver-babies/
9
Causality
A B
9
Causality
1 C
Figure 1.1 - Morgan S. and Winship
Cause and Effect
1
Correlation does not imply
causation
• The First Law of Data Science: “To
determine if a correlation is true in the real
world, it must be verified empirically”
• Association
– A statistically significant correlation or regression
coefficient - the likelihood of its occurrence by
chance alone is small.
• Time order of occurrence
– The causal variable must precede the outcome
variable in time
• Eliminating other potential causes
Advantages of Experiments
1
A..Z test?
1
Challenges
VS.
1
Challenges
1
Challenges
• What to measure?
– Define the OEC – Overall Evaluation Criterion
• Minimize difference between control/treatment
group
• How long should we run the experiment?
• How to measure the significance of the
results?
– Statistical significance
– Economic significance
• Heterogenous treatment effect
1
When Should We Use Experiments?
• Choice between known options
• Examples:
– 41 Shades of Blue (Google)
– Every 100ms counts (Amazon)
– Encryption notification (Kayak)
When Should We Use Experiments?
1
When Should We Use Experiments?
1
Terminology and Notations
• di – treatment variable:
– di = 1 the ith subject receives the treatment
– di = 0 the ith subject does not receive the
treatment
• Yi(d) – the potential outcome of the ith subject
– Yi(1) – potential outcome when treated
– Yi(0) – potential outcome when not treated
• The subject-level treatment effect τi = Yi(1) – Yi(0)
1
Terminology and Notations
Yi(0) Yi(1) τi
Student 1 80 85 5
Student 2 85 85 0
Student 3 90 100 10
Student 4 65 60 -5
Student 5 60 70 10
Student 6 85 85 0
Student 7 85 100 15
• Observed outcomes:
– The connection between the observed outcome 𝑌𝑌𝑖𝑖
and the underlying potential outcomes is given by
the equation
𝑌𝑌𝑖𝑖 = 𝑑𝑑𝑖𝑖 𝑌𝑌𝑖𝑖 1 + 1 − 𝑑𝑑𝑖𝑖 𝑌𝑌𝑖𝑖 0
– For any given subject, we observe either 𝑌𝑌𝑖𝑖 1 or
𝑌𝑌𝑖𝑖 0 , not both
• The fundamental problem of causal
inference
only one of 𝑌𝑌𝑖𝑖 1 and 𝑌𝑌𝑖𝑖 0 is observed, so we can never
find the true causal effect.
1
Terminology and Notations
Yi(0) Yi(1) τi
Student 1 85 ?
Student 2 85 ?
Student 3 100 ?
Student 4 65 ?
Student 5 60 ?
Student 6 85 ?
Student 7 85 ?
𝑁𝑁 𝑁𝑁
1 1
𝐴𝐴𝐴𝐴𝐴𝐴 = 𝜇𝜇𝑌𝑌(1) − 𝜇𝜇𝑌𝑌(0) = � 𝑌𝑌𝑖𝑖 1 − � 𝑌𝑌𝑖𝑖 0
𝑁𝑁 𝑁𝑁
𝑖𝑖=1 𝑖𝑖=1
𝑁𝑁
1
= � 𝑌𝑌𝑖𝑖 1 − 𝑌𝑌𝑖𝑖 0
𝑁𝑁
𝑖𝑖=1
1
Hypothesis Testing
• Null Hypothesis
– Yi(1)=Yi(0)
or
– ATE=0
For Completely Randomized Design
1
Hypothesis Testing
Control Treatm
ent
H0:ATE=0
H1:ATE≠0
1
Error Types
1
Random Assignment
1
Random Assignment
Colors
Before Random symbolize
Assignment any
differentiatin
g attribute
among the
After Random individuals
Assignment
Control Treatment
Experimental Groups 1
What if people chose their
condition?
Colors
Before choosing symbolize
any
differentiatin
g attribute
among the
Systematic individuals
error
Control Treatment
Self-selected Groups 1
Selection Bias
• Simple Example
• Sample Selection Bias
– Average height of Americans?
• Self-Selection
– caused when the sample chooses itself
– certain characteristics are over-represented
because they correlate with willingness to be
included.
1
Blind Experiment
Images by lc.gcumedia.com 1
Complete Randomized Design
(CRD)
• Random assignment of subjects to a set of
treatments
• Any variable that could influenced the
response variable is equalized between the
groups
1
Heterogeneous Treatment Effects
1
What to Test?
1
Design Choices in Online Experiments
Type of Experiment:
• Lab/Virtual Lab
• Field Experiment
• Natural Experiment
1
Lab Experiment
1
Lab Experiment
1
Field Experiment
1
Case Study: the effect of SEM
1
Search Engine Marketing
1
Paid Search Effectiveness
• Hypothesis:
– queries with the word eBay intent to visit ebay.com paid
search results substitutes for natural ones. Ads are
navigational.
• Treatment:
– Stop brand related terms (“ebay shoes”) @ Bing
• Control:
– Google, Yahoo!
1
Paid Search Effectiveness
1
Paid Search Effectiveness
• No control:
lost.
1
Impact
1
Leveraging Analytics for Competitive
Advantage
Page 140
Analytics – take A-ways
• For (almost) every question you can use data.
– Just be creative about data sources and models
– look for places were patterns can occur