Big Data Analytics

“…OVER TIME, WE BELIEVE BIG DATA MAY WELL
BECOME A NEW TYPE OF CORPORATE ASSET

THAT WILL CUT ACROSS BUSINESS UNITS AND
FUNCTION MUCH AS A POWERFUL BRAND
DOES, REPRESENTING A KEY BASIS FOR
COMPETITION…”
MCKINSEY QUARTERLY
Business analytics: Agenda
1. “Big data” – sources, challenges, and promise

(today)
1. Leveraging analytics for competitive advantage:
The 4 pillars of business analytics.
(later this week)
Page 2
Demystifying (defining) big-data
The challenge of Big Data (by IBM)
Four main sources
1
Enterprise Systems
2
Social Media
3 4
8
Four main sources
1
Enterprise Systems
2
Social Media
3Mobile
4
Mobile targeting – time and geography
11
Geo-Conquesting
https://vimeo.com/44351185
12
Four main sources
1
Enterprise Systems
2
Social Media
3Mobile
4
The Internet
of Things
IoT is bringing about an explosion in
connected devices and huge data sets
Internet-of-Things Projections
Some Big Numbers: Some small numbers:
14bn Connected Devices | Bosch SI Peter Middleton, Gartner:
50bn Connected Devices | Cisco

“By 2020, component costs
309bn IoT Supplier Revenue | Gartner will have come down to the
point that connectivity will
1,9tn IoT Economic Value Add | Gartner become a standard feature,
even for processors costing
less than
$1
7,1tn IoT Solutions Revenue | IDC
http://postscapes.com/internet-of-things-market-size
15
iBeacons – IOT used in the Store
iBeacons indoor positioning systems can interact directly with smart phones
e.g. using Bluetooth Low Energy (BLE)
16
Targeting via iBeacon
16 % higher unplanned spending

(Ghose et al 2015)
Applications beyond marketing
Monitor the driving

habits of drivers
19
Applications managing infrastructure
– Sensors (or a drone)

tell your parking app
about vacant spots
– Sensors rich garbage
bins can schedule
pickups!
– Smart LED streetlights
only light up if a
pedestrian approaches
20
Class exercise: smart chairs
21
IoT Leads to Better Measurement
Has always been a foundation for progress:

– Medicine changed drastically post microscope!
– ERP systems automate key business processes
and serve
as a foundation for modern management!
– Today, our key social processes are digitized:
• 46% of US singles found their romantic partner
online
• Facebook is where ‘friends’ do what they like to do
– IoT has the potential to change our daily lives
22
Data as a disruptive force
Analytics as a “disruptive technology”
Data driven healthcare
Use of IT in agriculture is growing
Jojoba Israel: Lots of data are constantly collected
but analyzed separately
Weather
station +
Excel files
Weather Irrigation
Excel files
+
Yield Soil Excel files
No
Integration…
B2B in the data area?
What can business do?
(Analytics)
Data-Driven Optimization
Organizations that use data-driven decision-making are 5% more productive and 6%

more profitable than their competitors. - MIT
Data-Driven
Status Quo
Optimization
What Happened? What’s our best outcome?

 Internal Data only  Blended Internal &
 Standard Report External Data
 Ad hoc Queries  Predictive Analytics
 Exception Reporting  Sophisticated Modeling
 Monthly Report  Machine Learning
Generation  Real-time Analysis
 ‘Gut feel’ decisions  Fact-based Decisions
Page 30
Analytics Ladder
Courtesy: David Hardoon

Page 31
4 Pillars of business analytics
Page 34
Descriptive vs. predictive
• Descriptive data analytics

• Also sometimes called “exploratory data mining” or
“unsupervised learning” Goal: Find patterns in data (such as
association rules, meaningful segments / clusters, or
anomalies)
– A much broader perspective of “exploratory data analytics”
includes a variety of additional approaches: data
visualization, descriptive statistics, correlation, data
reduction, OLAP technologies, queries, and reporting
• Predictive data analytics

– Also sometimes called “ “supervised learning”
– Goal: Predict a target/outcome variable (such as
purchase/no purchase, fraud/no fraud, creditworthy/not
creditworthy, etc.), typically by building predictive models
Four Key Ideas we will cover
Descriptive:
• Clustering
• Association rules Unsupervised
learning
Patterns
Predictive:
• Classification Supervised
• Prediction learning Models
Page 38
Clustering
Finding elements of data – clusters - that have a

high degree of similarity, and grouping them
together
Example: identifying customer segments (for
which we can make different offers).
Main idea: organizing data into most
natural groups
Amount Example: understanding the consumer base on your website
spent per (based on age, gender, amount spent…)
visit
m f mm m m
100 m m
m m f m m
m m m f
m
Cluster 2 mm
m m
m mm m
60 f m m m m
mm f m mm m m
f f m f mm m m
f
m mm m
m mm m
m m m
m fm mm m m
m
20 mm
Cluster 1 Cluster 3
20 30 40 50 Age
How about going beyond eyeballing the data in 2-3 dimensions?
 Need general-purpose techniques to deal with any-dimensional data
Example: clustering mall visitors
RESULT
QUESTION APPROACH Location- and
What data Analysis of Wi-Fi behavior-based
sources can be usage mapped to based insights for
used? physical space. tenant and mall
strategies.
Image (cc) flickr/ Will

Clustering: Basic Ideas
• Organizing data points/objects (e.g., customers) into
homogeneous (and, hopefully, meaningful) groups/clusters
• Desired properties of clustering result:

– High intra-similarity, i.e., any two data points / objects
that are assigned into the same cluster should exhibit
similarity to each other
– Low inter-similarity, i.e., any two data points / objects
that are assigned into different clusters should not be very
similar to each other (why?)
• Helps to gain insights into your data

– Instead of trying to look at the entire dataset (e.g., a huge
number of customers), you can inspect the representative data
groups/clusters (e.g., a small number of groups, into which your
data can be arranged most naturally)
– Usually a useful precursor for additional, deeper analyses
– Many applications!
Clustering case Study:
Customer Segmentation for Regional
Airline
Airline
• Goal: break down a large data set into small similar groups based on
customer attributes.
• Customer attributes considered in

this situation included:
• Travel frequency
• Average days booked in
advance
• Number of flights per trip
• Percentage of round trip
• Percentage of group trip
• Booking channels
Can you find a “title”?
Airline
Can you find a “title”?
Association rules
• (also known us co-occurrence grouping)

• Attempts to find associations between entities
based on transactions involving them.
• Important: no examples are provided to the
model; no “correct answer” exists
• Example: Amazon
Page 51
Classification
Assigning each individual to one of several
pre-defined categories (or classes)
• Objective:
– to predict classification when unknown or will occur in
the future,
– based on rules derived from similar data where the
classification is known
?
Cardiac
Rhythm
Classification
Prediction (Regression)
Estimate (or predict) a numerical value of

specific variable based on past and current data
Stock Price
http://www.blueflag.com.au/blog/why-australians-wont-
buy-1-million-cars-2011
http://mechonomic.blogspot.com/2010/07/ibm-share-price-on-
decline.html
Example: prediction of mall visitors’
next step
QUESTION APPROACH
RESULT
What data Analysis of Wi-Fi
sources can be usage mapped to ?
used? physical space.
Image (cc) flickr/ Will

Case study: Large shopping malls in china
3 coupon types: Random, location and trajectory
0.35 60
0.3 50
0.25 40
0.2
30
0.15
20
0.1
0.05 10
0 0
C Random Location Trajectory C Random Location Trajectory
Highest Redemption Rate Highest Spending in Store

30 20
25
15
20
15 10
10
5
5
0 0
C Random Location Trajectory C Random Location Trajectory
Least Time Spent in Store Time Elapse Until Redemption

What is Different from Classical Statistics?
Assumptions in classical statistics:

“data is scarce”
“computing is difficult”
The result:
same sample is used to make estimation AND
Determine how reliable the estimates are
Do you find “confidence intervals” and “hypothesis

testing” easy to explain to your non-technical
colleagues?
62 / 37
Assumptions in data mining:

“data and computing are abundant”
The result:
Fit a model with one sample
Assess its performance with another sample
Use computationally intensive techniques
(examples: classification trees, neural networks)
63 / 37
Advantage of data mining:

- Can be open ended
- No need for a hypothesis testing
The danger:
- Over-fitting: model fit so closely to the available
sample of data describes not merely structural
characteristics of the data, but random
peculiarities as well
64 / 37
Over-fitting
65 / 37
Let’s try (supervised learning)
Terminology we will need
• Training data: portion of data used to fit a

model
• Validation data: portion of the data used to
assess how well the model fits and also:
– to adjust some models
– select the best model from among those that
have been tried
• Test data: portion of the data used only at the
end of the model building and selection process
to assess how well the final model might
perform on additional data
67 / 37
Example: Buyer/ non – buyer classification
• A riding-mower manufacturer classifies families

into:
a. those likely to purchase a riding mower

b. those not likely to buy one
• The question: can we derive a method to help

us identify future buyers?
• The data (or the “predictor variables”):

– Income ($ 000s)
– Lot size (sq ft 000s)
(Lawn) Mowers data
Observation Income ($000's) Lot Size (000's sq. ft.) Buyers = 1, Non-buyers = 2
1 60 18.4 1
2 85.5 16.8 1
3 64.8 21.6 1
4 61.5 20.8 1
5 87 23.6 1
6 110.1 19.2 1
7 108 17.6 1
8 82.8 22.4 1
9 69 20 1
10 93 20.8 1
11 51 22 1
12 81 20 1
13 75 19.6 2
14 52.8 20.8 2
15 64.8 17.2 2
16 43.2 20.4 2
17 84 17.6 2
18 49.2 17.6 2
19 59.4 16 2
20 66 18.4 2
21 47.4 16.4 2
22 33 18.8 2
23 51 14 2
24 63 14.8 2
Graphical View
Decision Trees
• Classification Tree – binary outcome

– Will the buyer purchase or not?
• Regression Trees – continuous outcome
– How much will the buyer spend?
• Very broadly applicable technique

• Easy to explain “rules”
Key task is to algorithmically find the splits in
the data that help classifying (separating)
buyers and non-buyers
X2 <= 21
X2 <= 19?
Which is a better split?

Recursive Partitioning
X2 <= 19
X1 < 84.75
Final “Pure” Split
Classification Tree
Decision
Nodes
Leaf
nodes
Why are Decision Trees Popular?
• Tells you which predictors are important

– Variable subset selection is automatic (since
it is part of the split selection)
– Wine.xls uses only 2 out of 13 variables
• No hassle with outliers
– choice of a split depends on the ordering of
observation values and not on the absolute
magnitudes
• No hassle with missing data
• Easy interpretation and implementation

– If then else rules….
Classification and Regression Trees
(CART)
• Very broadly applicable technique

• Easy to explain “rules”
• 2 key ideas
1. Recursive partitioning of the space of the
independent variables
2. Second is of pruning using validation data
Key performance metrics
Key performance metrics
• Accuracy: percentage of times the model

classified both class 0 and class 1 accurately
𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 =
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
• Precision: out of the positive cases, how many

were predicted correctly?
𝑇𝑇𝑇𝑇
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 =
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
• Recall: Out of the cases classified as positive,

how many are positive?
𝑇𝑇𝑇𝑇
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 =
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
Let’s try (supervised learning)
using Azure
Welcome to Azure!
• Experiment: your “sandbox”

• Step 1: get the data
• Step 2: preprocess the data
(don’t forget to split it!)
• Step 3: choose and run a modeling algorithm
• Step 4: score and evaluate your model
81 / 37
Example 1: predict automotive prices
• Step 1: get the data – “automotive price data”

– Clean missing values
– Split (75-25)
– Define features (make, body-style, wheel-base,
horsepower, peak-rmp, highway-mpg, price)
– Linear regression
– Train model
82 / 37
Example 2: income classification
• Step 1: get the data – “adult income binary

classification dataset”
– Clean missing values
– Split (75-25)
– Decision trees
– Train model
– Confusion matrix
– Accuracy
– ROC curve
83 / 37
Moving from correlation to
causation
Leveraging Analytics for Competitive
Advantage
Page 85
Example
Senior business leader wants to know,

“Did the website redesign increase sales? Can you run
a report?”
BI Analyst
runs report »
Page 86
Example
Senior business leader wants to know,

“Did the website redesign increase sales? Can you
run a report?”
But that’s the

wrong question
Page 87
An (older) Example
• Amazon – shopping cart recommendations

– A marketing senior VP was against it:
• It might distract people away from
checking out
– Results?
8
Source: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html
Search Engine Ads with Site Links
 Should search engine add “site links” to ads, which allow

advertisers to offer several destinations on ads?
 OEC: Revenue, ads constraint to same vertical pixels on avg
A B
Source: Ronny Kohavi, MSFT

Pro: richer ads, users better informed where they land
Cons: Constraint means on average 4 “A” ads vs. 3 “B” ads
Variant B is 5msc slower (compute + higher page weight)
Left hand Right hand
Page 89
Search Engine Ads with Site Links
 <answer>
 The above change was costly to implement.

MSFT made two small changes to Bing, which
took days to develop, each increased ad revenue
by about $100 million annually.
 (One was delayed by 6 months because it was
not prioritized high, a prioritization mistake that
cost $50M)
Page 90
Our intuition is poor
Ideas tested at
Flat: no significant Prove to be statistically

difference significant, positive
1/3 1/3 changes
1/3
Statistically significant,
but negative
Page 91
Non-tech Companies where A/B
testing is standard operating
procedure
• Walmart
• Hertz
• Singapore Airlines
• Capital One
• (Not to mention Google, Amazon – moving credit card
offers to checkout page was a $10 million effect! –
Booking.com, Facebook, Uber, Airbnb)
• Requires infrastructure (tools exist)
– instrumentation (to record such things as clicks, mouse
hovers, and event times)
– data pipelines, and
– data scientists
Page 92
Two Methodological Paradigms
Causation & Prediction

Controlled experiments Predictive models tell you
are a powerful tool to where to look for these
evaluate ideas, and to forces
understand
fundamental forces at
play
Page 93
The Goal of Causal Analysis
• Use scientific methodology to support decision

making
• Cause and effect questions
– Test theory of causal relationships
– Contribute knowledge on the nature of a
causal relationship
– Transparent methodology
– Reproducible procedures
9
Approaches to Study Causation
• Observational
The researcher looks for natural differences across cases
and tries to find a single input that might have caused
the variation in outcomes
• Experimental
The researcher conducts an experiment… If outcomes
vary across the treatment and control groups, the (Teele 2013)
difference must be due to the catalyst
A B
Diet Soda Anyone?
• Discovery: people drinking diet soda are overweight
• Is it:
– Consuming Diet Coke makes you fat?
or
– Overweight people are ordering Diet Coke because
they want to lose weight
9
Storks Deliver Babies (p=0.008)
R. Matthews(2000)
http://priceonomics.com/do-storks-deliver-babies/
Matthews, R. (2000). Storks deliver babies (p=0.008). Teaching

Statistics,22(2), 36-38. 9
The Negative Effect of Science?
9
Causality
• The act or process of causing something to happen

or exist
• The relationship between an event or situation and
a possible reason or cause
(merriam-webster)
A B
9
Causality
1 C
Figure 1.1 - Morgan S. and Winship
Cause and Effect
• To establish a cause and effect relationship?

– The cause must precede the effect
– The cause must be related to the effect
– No other plausible alternative explanation
1
Correlation does not imply
causation
• The First Law of Data Science: “To
determine if a correlation is true in the real
world, it must be verified empirically”
(Dr. Michael L. Brodie, KDD 2014)

Inferring Causality
• Association
– A statistically significant correlation or regression
coefficient - the likelihood of its occurrence by
chance alone is small.
• Time order of occurrence
– The causal variable must precede the outcome
variable in time
• Eliminating other potential causes
Advantages of Experiments
• Best scientific way to prove causality

– The effect in the dependent variable caused by
changes introduced in the treatment (Kohavi, 2015)
• An effective way to obtain unbiased estimates
of causal effects (Aral and Walker, 2012)
Experiment - Basic Concept
Source: Kohavi R., KDD

2015
1
A/B Testing
1
A..Z test?
• The Multi-Armed Bandit Problem
1
Challenges
VS.
1
Challenges
• Minimize the possibility that the results you

get might be due to a hidden confounding
factor
1
Challenges
• What to measure?
– Define the OEC – Overall Evaluation Criterion
• Minimize difference between control/treatment
group
• How long should we run the experiment?
• How to measure the significance of the
results?
– Statistical significance
– Economic significance
• Heterogenous treatment effect
1
When Should We Use Experiments?
• Choice between known options
• Examples:
– 41 Shades of Blue (Google)
– Every 100ms counts (Amazon)
– Encryption notification (Kayak)
• Less suitable for:

– New experiences
• Change averse
• Novelty effect
– Fuzzy questions/opportunities?
• What is not offered?
• Which product to develop?
– Long term activity
1
• Which product should we sell?

• Add new premium service?
• Change logo
1
Terminology and Notations
• di – treatment variable:
– di = 1  the ith subject receives the treatment
– di = 0  the ith subject does not receive the
treatment
• Yi(d) – the potential outcome of the ith subject
– Yi(1) – potential outcome when treated
– Yi(0) – potential outcome when not treated
• The subject-level treatment effect  τi = Yi(1) – Yi(0)
1
Yi(0) Yi(1) τi
Student 1 80 85 5
Student 2 85 85 0
Student 3 90 100 10
Student 4 65 60 -5
Student 5 60 70 10
Student 6 85 85 0
Student 7 85 100 15
Average 78.57 83.57 5

1
• Observed outcomes:
– The connection between the observed outcome 𝑌𝑌𝑖𝑖
and the underlying potential outcomes is given by
the equation
𝑌𝑌𝑖𝑖 = 𝑑𝑑𝑖𝑖 𝑌𝑌𝑖𝑖 1 + 1 − 𝑑𝑑𝑖𝑖 𝑌𝑌𝑖𝑖 0
– For any given subject, we observe either 𝑌𝑌𝑖𝑖 1 or
𝑌𝑌𝑖𝑖 0 , not both
• The fundamental problem of causal
inference
only one of 𝑌𝑌𝑖𝑖 1 and 𝑌𝑌𝑖𝑖 0 is observed, so we can never
find the true causal effect.
1
Yi(0) Yi(1) τi
Student 1 85 ?
Student 2 85 ?
Student 3 100 ?
Student 4 65 ?
Student 5 60 ?
Student 6 85 ?
Student 7 85 ?
Average 73.75 90 16.25

1
• Average Treatment Effect - ATE
𝑁𝑁 𝑁𝑁
1 1
𝐴𝐴𝐴𝐴𝐴𝐴 = 𝜇𝜇𝑌𝑌(1) − 𝜇𝜇𝑌𝑌(0) = � 𝑌𝑌𝑖𝑖 1 − � 𝑌𝑌𝑖𝑖 0
𝑁𝑁 𝑁𝑁
𝑖𝑖=1 𝑖𝑖=1
𝑁𝑁
1
= � 𝑌𝑌𝑖𝑖 1 − 𝑌𝑌𝑖𝑖 0
𝑁𝑁
𝑖𝑖=1
1
Hypothesis Testing
• Null Hypothesis
– Yi(1)=Yi(0)
or
– ATE=0
For Completely Randomized Design
1
Hypothesis Testing
Control Treatm
ent
 H0:ATE=0
 H1:ATE≠0
1
Error Types
1
Random Assignment
• Each participant has a known (usually equal)

chance of being assigned to any of the groups.
• Successful randomization - group assignment
cannot be predicted in advance.
1
Random Assignment
Colors
Before Random symbolize
Assignment any
differentiatin
g attribute
among the
After Random individuals
Assignment
Control Treatment
Experimental Groups 1
What if people chose their
condition?
Colors
Before choosing symbolize
any
differentiatin
g attribute
among the
Systematic individuals
error
Control Treatment
Self-selected Groups 1
Selection Bias
• Simple Example
• Sample Selection Bias
– Average height of Americans?
• Self-Selection
– caused when the sample chooses itself
– certain characteristics are over-represented
because they correlate with willingness to be
included.
1
Blind Experiment
Images by lc.gcumedia.com 1
Complete Randomized Design
(CRD)
• Random assignment of subjects to a set of
treatments
• Any variable that could influenced the
response variable is equalized between the
groups
The effect is only due to the treatment

imposed
1
Heterogeneous Treatment Effects
• Does the treatment has the same effect on the

treated?
– Female/Male?
– Age group?
– Education?
• HTE – measure the effect on sub populations
– Pre defined (known) populations
– Advanced data methods
1
What to Test?
• Understanding consumers behavior:

– Cognitive bias
– Rational\irrational behavior
– Social effect
– Price sensitivity
• Website design
1
Design Choices in Online Experiments
Type of Experiment:
• Lab/Virtual Lab
• Field Experiment
• Natural Experiment
1
Lab Experiment
• Conducted in a well-controlled environment

– All variables can be controlled
1
Lab Experiment
• Participants are aware that they are taking part in an

experiment.
• They may or may not know the true aims of the experiment
• Settings don’t always resemble “real world”
• Participants don’t resemble other populations
– Samples are generally non-random
– Small samples, at least by survey data standards
– Participants are often college undergraduates
– Participants are often WEIRD:
• Western, educated, industrialized, rich, democratic
1
Field Experiment
• Examine an intervention in the real

environment
• The subjects are naturally undertaking certain
tasks
• The subjects do not know that they are that
they are participating in an experiment
• The researchers manipulate the independent
variable
1
Case Study: the effect of SEM
• 49% is SEM (Search engine

marketing)
(non-mobile + mobile)
• Google is the leading SEM

provider, advertising ≈95% of
revenues
• What is the ROI of SEM?
1
Search Engine Marketing
1
Paid Search Effectiveness
(Blake, Nosko, & Tadelis,

• The business question:
– What is the ROI of paid search for eBay?
• Hypothesis:
– queries with the word eBay  intent to visit ebay.com  paid
search results substitutes for natural ones. Ads are
navigational.
• Treatment:
– Stop brand related terms (“ebay shoes”) @ Bing
• Control:
– Google, Yahoo!
1

• Simple pre-post analysis (w/o
control): 5.6% decrease in
total clicks
• With control: 0.59% of clicks

lost, but not statistically
significant
• 99.5% substitution between

paid and natural
1

• Follow up test on Google
• No control:
no other brand SEM campaigns
• Pre-post estimate shows 3% clicks
lost.
1
Impact
1
Leveraging Analytics for Competitive
Advantage
Page 140
Analytics – take A-ways
• For (almost) every question you can use data.
– Just be creative about data sources and models
– look for places were patterns can occur
• Be creative in looking for data

– Don’t forget to look outside the organization!
• Ask yourself if this is a supervised task

– For example – do I have examples to provide?
– If not – don’t despair! Patterns are still possible
• Can you use an experiment to get causal

understanding?
• Know how to read your results.
• Remember it is not really complicated!

Big Data Analytics

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Big Data Analytics

Enviado por

Direitos autorais:

Formatos disponíveis

“…OVER TIME, WE BELIEVE BIG DATA MAY WELL

BECOME A NEW TYPE OF CORPORATE ASSET

1. “Big data” – sources, challenges, and promise

Some Big Numbers: Some small numbers:

14bn Connected Devices | Bosch SI Peter Middleton, Gartner:

50bn Connected Devices | Cisco

16 % higher unplanned spending

Monitor the driving

– Sensors (or a drone)

Has always been a foundation for progress:

Organizations that use data-driven decision-making are 5% more productive and 6%

What Happened? What’s our best outcome?

Courtesy: David Hardoon

• Descriptive data analytics

• Predictive data analytics

Finding elements of data – clusters - that have a

Image (cc) flickr/ Will

• Desired properties of clustering result:

• Helps to gain insights into your data

• Customer attributes considered in

• (also known us co-occurrence grouping)

Estimate (or predict) a numerical value of

Image (cc) flickr/ Will

Highest Redemption Rate Highest Spending in Store

Least Time Spent in Store Time Elapse Until Redemption

Assumptions in classical statistics:

Do you find “confidence intervals” and “hypothesis

Assumptions in data mining:

Advantage of data mining:

• Training data: portion of data used to fit a

• A riding-mower manufacturer classifies families

a. those likely to purchase a riding mower

• The question: can we derive a method to help

• The data (or the “predictor variables”):

• Classification Tree – binary outcome

• Very broadly applicable technique

Which is a better split?

• Tells you which predictors are important

• Easy interpretation and implementation

• Very broadly applicable technique

• Accuracy: percentage of times the model

• Precision: out of the positive cases, how many

• Recall: Out of the cases classified as positive,

• Experiment: your “sandbox”

• Step 1: get the data – “automotive price data”

• Step 1: get the data – “adult income binary

Senior business leader wants to know,

Senior business leader wants to know,

But that’s the

• Amazon – shopping cart recommendations

 Should search engine add “site links” to ads, which allow

Source: Ronny Kohavi, MSFT

 The above change was costly to implement.

Flat: no significant Prove to be statistically

Causation & Prediction

• Use scientific methodology to support decision

• Discovery: people drinking diet soda are overweight

Matthews, R. (2000). Storks deliver babies (p=0.008). Teaching

• The act or process of causing something to happen

• To establish a cause and effect relationship?

(Dr. Michael L. Brodie, KDD 2014)

• Best scientific way to prove causality

Source: Kohavi R., KDD