Você está na página 1de 60

Big Data Analysis

1 Author: Vikram Andem


ISRM & ITGRC Conference
Big Data Analysis
Concepts and References
Use Cases in Airline Industry
Big Data Analysis
2 Author: Vikram Andem
ISRM & ITGRC Conference

The objective of this presentation is to provide awareness and familiarize a general business or
management user with terms and terminology of Big Data Analysis and some references to use cases that
can be (or are currently) applied in Airline industry.

The presentation is intended for an business or a management user to help with the thinking process on
formulating an analytical question given business situation / problem for Big Data Analysis.

The presentation may also help provide an insight on basic terms and concepts that are a need to know,
what to ask, how to evaluate and/ or help solve a business problem for a potential Big Data Analysis use
case and what to expect from the work of an competent Data Scientist when dealing with such use case
for Big Data Analysis.

NOTE: J ust reviewing this presentation will most likely NOTmake you competent enough to instantly perform Big Data
Analysis. Big Data Analysis is a new (very recent) aspect of Data Science and requires some college or university level
course work in (fields such as, but not limited to) mathematics, statistics, computer science , management science,
econometrics, engineering etc.

The presentation is divided into three parts following a separate presentation on Big Data Security &
Governance, Risk Management & Compliance

Part 1. Big Data : Introduction ( page # 3)

Part 2. Very quick introduction to understanding Data and analysis of Data ( page # 8)
(Beginner: if you are new to understanding data and use of data you should start here)

Part 3. Big Data Analysis : Concepts and References to Use Cases in Airline Industry ( page # 17)
(Advanced: if you understand data and how to use data, you may jump to this part).
Big Data Analysis
3 Author: Vikram Andem
ISRM & ITGRC Conference
Big Data: Introduction


You may skip this section
if you are familiar with Big Data
and directly jump to Part 2 (page # 8)
Part 1
Big Data Analysis
4 Author: Vikram Andem
ISRM & ITGRC Conference
Introduction

Projected growth and use of
Unstructured vs. Structured data
2012 2013 2014 2015 2016 2017 2018 2019 2020
Unstructured Structured
2013 IEEE Bigdata conference (projected growth of data
combined for all fortune 500 companies only)
Limitations of existing Data Analytics Architecture
BI Reports + Interacti ve Apps
RDBMS (aggregated data)
ETL Computer Grid
Coll ection
Instrumentation
Storage Only Grid
(original raw data)
Mostly Append
Limit #1 : Moving data to
compute doesnt scale.
Limit #2 :
Cant explore
high fidelity
raw data
Limit #3 :
Archiving =
Premature
data death
*
Z
e
t
t
a
b
y
t
e

s

o
f

d
a
t
a

* 1 Zettabyte = 1000 Exabyte's = 1 Million Petabytes = 1 Billion Terabytes.
Big Data a general term refers to the large voluminous amounts (at least terabytes)
of poly-structured data that is gleaned from traditional and non-traditional sources
and continuously flows through and around organizations, including but not limited-to e-mail,
text, event logs, audio, video, blogs, social media and transactional records.
What does
this
information
hold?
What is the
challenge
extracting
it?
It holds the promise of giving enterprises like United a deeper insight into their
customers, partners, and business. This data can provide answers to questions
they may not have even thought to ask. Companies like United can benefit
from a multidimensional view of their business when they add insight from big
data to the traditional types of information they collect and analyze.
N
u
m
b
e
r

o
f

R
e
s
u
l
t
s
D
e
m
a
n
d
<- More Generic
More Specific ->
Popularity Rank
<- Small Tail
Long Tail ->
Traditional EDW
Classical Statistics
Big Data
Specific Spikes
Transactional Data
(e.g., Reservations)
Non-Transactional &RawData
(e.g., Search's, Event logs)
-
+
The challenge of extracting value from big data is similar in many ways to
the age-old problem of distilling business intelligence from transactional
data. At the heart of this challenge is the process used to extract data
from multiple sources, transform it to fit your analytical needs, and load it
into a data warehouse for subsequent analysis, a process known as
Extract, Transform & Load (ETL). The nature of big data requires that the
infrastructure for this process can scale cost-effectively.
While the storage capacities of hard
drives have increased massively over
the years, access speeds the rate at
which data can be read from drives
have not kept up. One typical drive
from year 1990 could store 1,370 MB of
data and had a transfer speed of 4.4
MB/s, so you could read all the data
from a full drive in around five minutes.
Over 20 years later, one terabyte
drives are the norm, but the transfer
speed is around 100 MB/s, so it takes
more than two and a half hours to
read all data off the disk. This is a long
time to read all data on a single drive
and writing is even slower.
Data Storage and Analysis
Big Data Analysis
5 Author: Vikram Andem
ISRM & ITGRC Conference
Hadoop

Apache Hadoop is a scalable fault- tolerant distributed
system for storage and large scale processing of data-sets on clusters of
commodity hardware. Hadoop can be used to store Exabyte's of unstructured
and semi-structured data reliably on tens of thousands of general purpose
servers while scaling performance cost-effectively by merely adding
inexpensive nodes to the cluster. Using Hadoop in this way, organization like
United gains an additional ability to store and access data that they might
need, data that may never be loaded into the data warehouse.
The Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS)

Schema must be created before
any data can be loaded.
An explicit load operation has to
take place which transforms data
to DB internal structure.
New columns must be added
explicitly before new data for such
columns can be loaded into the
database.
Schema-on-Read (Hadoop)

Data is simply copied to the file
store, no transformation is needed.
A SerDe (Serializer/ Deserlizer) is
applied during read time to extract
the required columns (late binding).
New data can start flowing anytime
and will appear retroactively once
the SerDe is updated to parse it.
Read is Fast
Standards / Governance
Load is Fast
Flexibility / Agility
Pros
Interactive OLAP Analytics (<1sec)
Multistep ACID Transactions
100% SQL Compliance
Structured or Not (Flexibility)
Scalability of Storage/Compute
Complex Data Processing
Use
When
Hadoop Architecture
Central to the scalability of Hadoop is the
distributed processing framework known as
MapReduce which splits the input
data-set into multiple chunks, each of
which is assigned a map task that can
process the data in parallel. Each map task
reads the input as a set of (key, value) pairs
and produces a transformed set of (key,
value) pairs as the output. The framework
shuffles and sorts outputs of the map tasks,
sending the intermediate (key, value) pairs
to the reduce tasks, which group them into
final results. MapReduce uses Job Tracker
and Task Tracker mechanisms to schedule
tasks, monitor them, & restart any that fail.
Hadoop Distributed File System (HDFS) is designed for scalability and fault tolerance. HDFS stores large files by dividing them
into blocks (usually 64 or 128 MB) and replicating the blocks on three or more servers. HDFS provides APIs for MapReduce applications to read
and write data in parallel. Capacity and performance can be scaled by adding Data Nodes, and a single NameNode mechanism manages data
placement and monitors server availability. HDFS clusters in production use today reliably hold petabytes of data on thousands of nodes.
Pictures source :Intel White Paper on Big Data Analytics
Big Data Analysis
6 Author: Vikram Andem
ISRM & ITGRC Conference

Use Case Description
Data Storage Collect and store unstructured data in a fault-resilient scalable data
store that can be organized and sorted for indexing and analysis.
Analytics Ability to query in real time at the speed of thought on petabyte
scale unstructured and semi structured data using Hbase and Hive.
Batch Processing of
Unstructured data
Ability to batch-process (index, analyze etc.) tens to hundreds of
petabytes of unstructured and semi-structured data.
Data Archive Medium-term (12-36 months) archival of data from EDW/DBMS to
meet data retention policies.
Integration with
EDW
Extract, transfer and load data in and out of Hadoop into separate
DBMS for advanced analytics.
Search and
predictive analytics
Crawl, extract, index and transform structured and unstructured
data for search and predictive analytics.
Use Cases
Southwest Airlines uses Hadoop
based solution for its "Rapid
Rewards loyalty program" for
Customer Service.
C
o
m
m
o
n

P
a
t
t
e
r
n
s

o
f

H
a
d
o
o
p

U
s
e

Big Data Exploration
Pattern # 1: Hadoop as a Data Refinery



Tradit ional Sources
(RDBMS, OLTP,
OLAP)
D
A
T
A
S
Y
S
T
E
M
S
D
A
T
A
S
O
U
R
C
E
S
A
P
P
L
I
C
A
T
IO
N
S
New Sources
(Weblogs, email, sensor
data, social media)
RDBMS EDW
Traditional Repos
Business
Analyt ics
Cust om
Applications
Enterprise
Applications
1
3
2
1
3
2
Collect dat a and
apply a known
algorit hm int o t rust ed
operat ional process
Capture
Capt ure all data
Process
Parse, cleanse,
apply structure &
t ransf orm
Exchange
Push t o existing
dat a warehouse for
use with exist ing
analyt ic tools
Pattern # 2: Data Exploration with Hadoop



Tradit ional Sources
(RDBMS, OLTP,
OLAP)
D
A
T
A
S
Y
S
T
E
M
S
D
A
T
A
S
O
U
R
C
E
S
A
P
P
L
I
C
A
T
IO
N
S
New Sources
(Weblogs, email, sensor
data, social media)
RDBMS EDW
Traditional Repos
Business
Analyt ics
Cust om
Applications
Enterprise
Applications
1
3
2
1
3
2
Collect dat a and
per form it erative
invest igat ion for value
Capture
Capt ure all data
Process
Parse, cleanse,
apply structure &
t ransf orm
Exchange
Explore and
visualize with
analyt ics tools
support ing Hadoop
Pattern # 3: Application Enrichment with Hadoop



Tradit ional Sources
(RDBMS, OLTP,
OLAP)
D
A
T
A
S
Y
S
T
E
M
S
D
A
T
A
S
O
U
R
C
E
S
A
P
P
L
I
C
A
T
IO
N
S
New Sources
(Weblogs, email, sensor
data, social media)
RDBMS EDW
Traditional Repos
Business
Analyt ics
Cust om
Applications
Enterprise
Applications
1
3
2
1
3
2
Collect dat a, analyze
and pr esent salient
result s for online apps
Capture
Capt ure all data
Process
Parse, cleanse,
apply structure &
t ransf orm
Exchange
Incorporat e dat a
dir ect ly int o
applications
Top 5 General Usages
Obtaining a 360-degree
view of Customers
Operations Analytics
Data Warehouse
Augmentation
Social Media
How Airline Industry is using Hadoop ?
Capturing Sensor Data to
Optimize Maintenance
Top 5 Airline Usages
Forecasting the Weather to
Optimize Fuel Loads
Identifying and Capturing the
Demand Signal (Competitive
offerings, Travel partner feeds)
Loyalty and Promotions
Webpage Visits, Log Storage
When is the best time of
day/day of week/time of
year to fly to minimize
delays?
Do older planes suffer more
delays?
How does the number of
people flying between
different locations change
over time?
How well does weather
predict plane delays?
Can you detect cascading
failures as delays in one
airport create delays in
others? Are there critical
links in the system?
American Airlines is utilizing
Hadoop based solution for
Clickstream, Customer, Kiosk and
Data Analytics.
British Airlines uses Hadoop
based solution for Forecasting and
Revenue Management.
Orbitz Worldwide utilizes
Hadoop based solution for
Statistical Analysis to identify best
possible promotions that combine
air travel with hotel stay.
Prospects: CTO Qs?
source : Internet Google search
Big Data Analysis
7 Author: Vikram Andem
ISRM & ITGRC Conference
Cost Benefit Analysis
Gigabyte scale
Petabyte scale
Size of data
C
o
s
t

p
e
r

T
e
r
a
b
y
t
e
Cost of current, str uct ured dat a
management technologies
Economics of Data
Cost of Hadoop
Value captured
by customers
(like United)
using Hadoop
Why is Hadoop a value proposition?

TCOD (Total Cost of Data) is the cost of owning
(and using!) data over time for analytic purposes is a
better metric than TCO (Total Cost of Ownership) for
Cost Benefit Analysis of this case. TCOD estimates what
a company like United will really spend to get to its
business goal and the focus is on on total cost, not just
the platform cost. In this case the TCOD comparison is
made between EDW Platform/Appliance and Hadoop
for the same amount of (raw or unstructured) data.
Cost Comparison & Benefits are based on underlying Data Management Requirements
1.Hundreds of TB of data per week 500 TB data.
2.Raw data life: few hours to a few days.
3.Challenge: find the important events or trends.
4.Analyze the raw data once or a few times.
5.When analyzing, read entire files.
6.Keep only the significant data.
Project A: Emphasis on Data Refining Requirements
EDW Appliance Hadoop
Total System Cost $23M $1.3M
System and Data Admin $2M $2M
Application Development $5M $6M
Total Cost of Data $30M $9.3M
Cost equation is favorable to Hadoop for data
refining, data landing and archival requirements.
1.Data volume 500 TB to start all must be retained for at
least five years.
2.Continual growth of data and workload
3.Data sources: thousands
4. Data sources change their feeds frequently
5.Challenges: Data must be correct and data must be integrated
6.Typical enterprise data lifetime: decades
7.Analytic application lifetime: years
8.Millions of data users
9.Hundreds of analytic applications
10.Thousands of one time analyses
11.Tens of thousands of complex queries
Project B: Emphasis on EDW Requirements
EDW Appliance Hadoop
Total System Cost $45M $5M
System and Data Admin $50M $100M
Application Development $40M $300M
ETL $60M $100M
Complex Queries $40M $80M
Analysis $30M $70M
Total Cost of Data $265M $655M
Cost equation is favorable to EDW for Data
Warehouse appliance requirements
Cost Comparison Conclusions
Each technology has large advantages in its sweet spot(s).
Neither platform is cost effective in the others sweet spot.
Biggest differences for the data warehouse are the development cost for Complex Queries & Analytics.
Total cost is extremely sensitive to technology choice.
Analytic architectures will require both Hadoop and data warehouse platforms.
Focus on total cost, not platform cost, in making your choice for a particular application or use.
Many analytic processes will use both Hadoop & EDW technology so integration cost also counts!
source for TCOD comparison is The Real Cost of Big Data Spreadsheet provided by Winter Corp (www.wintercorp.com/tcod)
Hadoop is ideal for data storage of
data which rarely needed;
data which can grow rapidly;
data which can grow very large;
data for which it is uncertain how it will be
needed in the future;
data which may or may not have structure;
data which may require ETL and Analysis
sometime in the future but just needs to
be stored now ..for some unknown use.
TCOD is the cost of owning (and using!) data over time for analytic purposes
* ETL is extract, transformand load (preparing data for analytic use)
Software Development/Maintenance Cost
SYSTEM
COST
ADMIN
COST
ETL * APPS QUERIES ANALYTICS
Big Data Analysis
8 Author: Vikram Andem
ISRM & ITGRC Conference
Very quick introduction to
understanding Data
and analysis of Data



Start here if you are new to
understanding the data or
do not know how to analyze data.
Part 2
Big Data Analysis
9 Author: Vikram Andem
ISRM & ITGRC Conference




Introduction to Data
Targets business constraints.
Assesses and determines new ways to operate.
What is the best
action/outcome?
What could
happen?
What happened?
Finds associations in data not readily apparent with
customary analysis .
Forecasts future probabilities and trends.
Pattern recognition from samples for reporting of trends.
Formulates and analyzes historical data.
D
e
s
c
r
i
p
t
i
v
e

A
n
a
l
y
t
i
c
s


P
r
e
d
i
c
t
i
v
e

A
n
a
l
y
t
i
c
s


P
r
e
s
c
r
i
p
t
i
v
e

A
n
a
l
y
t
i
c
s


Benefits
Very High
High
Medium

Myth: I have large sets of data on Hadoop File System. Running powerful analytical tools (e.g., R, SAS, Tableau etc.) on Hadoop Infrastructure will
perform all the data analysis work for me and provide/deliver useful information.

Fact : The data by itself may not contain the answer, Big or Small - you need the right data. The combination of data and an aching desire for an
answer does not ensure that a reasonable answer can be extracted from a given body of data . Performing data analysis with an understanding (and
application of) data science principles, by correctly framing the analytical problem (with correct data sampling/collection methods) and with the use
of appropriate analytical tools will most likely provide useful information for statistical/analytical inference.
Data Analysis Benefits

Data Facts
Data is just data.
Data does not live nor die.
Data does not offer truth nor does it lie.
Data is not large nor its small.
Data has always existed as both big data and small data.

Data is the second most important thing
The most important thing in data science is the question.
The second most important is the data.
Often the data will limit or enable the questions.
But having data is useless if you don't have an appropriate
question.


Data are values of qualitative or quantitative variables,
belonging to a set of items. Source: Wikipedia

Big Data Analysis
10 Author: Vikram Andem
ISRM & ITGRC Conference
Data Types : Basics

Types of Data Variables
Examples (from above table):
gender: categorical
sleep: numerical, continuous
bedtime: categorical, ordinal
countries: numerical, discrete
dread: categorical, ordinal - could also
be used as numerical
Data var iabl es
numerical categori cal
conti nuous di scret e
regul ar
categori cal
ordi nal
Numerical (quantitative)
variables take on numerical
values. It is sensible to add,
subtract, take averages, etc. with
these values.
Categorical (qualitative) variables take on a limited
number of distinct categories. These categories can be
identified with numbers, but it wouldnt be sensible to
do arithmetic operations with these values.
Continuous numerical
variables are measured,
and can take on any
numerical value.
Discrete numerical variables are
counted, and can take on only
whole non-negative numbers.
Categorical variables that have
ordered levels are called ordinal. Think
about a flight survey question where
you are asked about how satisfied you
are with the customer service you
received, and the options are very
unsatisfied, unsatisfied, neutral,
satisfied or very satisfied. These levels
will have inherent ordering and hence
the variable will be called ordinal.
If the levels of the categorical variable do
not have an inherent ordering to them,
then the variable is simply called (regular)
categorical. (e.g., do you prefer morning
flight or an evening flight?)

Observations, Variables and Data Matrices
Data are organized in what we call a data matrix, where each row represents
an observation (or a case), and each column represents a variable.
City no_flights %_ontime . region
Chicago 350 90 Midwest
Houston 330 96 South

Newark 306 92 Northeast
San Francisco 310 93 West
data matrix
observation
(case)
variable
First variable is City which is an identifier variable for the name of the city
United serves to which the data are gathered.
Next is the no_flights (number of flights) served by united daily, and is a
discrete numerical variable.
Next is the %_ontime (percentage on time) representing the united flights
that operated on-time (arrival or departure) which represents a continuous
numerical variable (as it can take on any value between zero and 100, even
though the values shown here are rounded to whole numbers).
Last column is region representing where the city is located in USA as
designated by US census (Northeast, Midwest, South, and West) and this a
categorical variable.
When two variables show some connection with one another, they are called associated, or
dependent variables.
The association can be further described as positive or negative
If two variables are not associated they are said to be independent.
Relationships between Variables
example
Big Data Analysis
11 Author: Vikram Andem
ISRM & ITGRC Conference
Data Observational Studies and Experiments

Observational Study
In an observational study you collect data in
a way that does not directly interfere with
how the data arise, i.e. merely observe.
We can only establish an association (or
correlation) between the explanatory and
response variables.
If an observation study uses data from the
past, its called a retrospective study,
whereas if data are collected throughout the
study, its called prospective.
Experiment
In an experiment, you randomly
assign subjects to various
treatments and can therefore
establish a causal connection
between the explanatory and
response variables.
work out
don't
work out
average
energy
level
average
energy
level
work out
don't
work out
average
energy level
average
energy level
random
assignment
In observational study we sample two types of people from the population, those who
choose to work out regularly and those who dont, then we find the average energy level
for the two groups of people and compare. On the other hand in an experiment, we
sample a group of people from the population and then we randomly assign these
people into two groups, those who will regularly workout throughout the course of the
study and those who will not. The difference is that the decision of whether to work out
or not is not left to the subjects as in the observational study but is instead imposed by
the researcher. At the end we compare the average energy levels of the two groups.
Based on the observational study, even if we find the difference between the energy
levels of these two groups of people, we really cant attribute this difference solely to
working out, because there may be other variables that we didnt control for in this
study that contribute to the observed difference. For example people who are in better
shape might be more likely to work out and also have high energy levels. However in the
experiment such variables that likely to contribute to the outcome are equally
represented in the two groups, due to random assignment. Therefore if we find a
difference between the two averages, we can indeed make a causal statement
attributing this difference to working out.
Example: Suppose you want
to evaluate the relationship
between regularly working
out and energy level. We
can design the study as
observational study or an
experiment.


What type of study is this, observational
study or an experiment?
Girls who regularly ate breakfast, particularly one that includes cereal,
were slimmer than those who skipped the morning meal, according to a
study that tracked nearly 2,400 girls for 10 years. [...] As part of the
survey, the girls were asked once a year what they had eaten during the
previous three days.
This is an observational study since the researchers merely observed the
behavior of the girls (subjects) as opposed to imposing treatments on
them. The study concludes there is an association between girls eating
breakfast and being slimmer.
#1 : Eating breakfast causes girls to be thinner.


#2: Being thin causes girls to eat breakfast


#3: A third variable is responsible for both. What could it be? An
extraneous variable that affects both the explanatory and the response
variable and that make it seem like there is a relationship between the
two are called confounding variables.
Images from: http://www.appforhealth.com/ wp-content/ uploads/ 2011/ 08/ ipn-cerealfrijo-300x135.jpg,
http://www.dreamstime.com/ stock-photography-too-thin-woman-anorexia-model-image2814892.
What determines whether we can infer correlation or causation
depends on the type of study that we are basing our decision on.
Observational studies for the most part only allow us to make
correlation statements, while experiments infer us to causation.
Correlation does not
imply causation.
Question
Answer
3 Possible Explanations
vs.
Big Data Analysis
12 Author: Vikram Andem
ISRM & ITGRC Conference

Data Sampling and Sources of Bias

Data Sampling
Think about sampling something that you are
cooking - you taste (examine) a small part of what
youre cooking to get an idea about the dish as a
whole.
When you taste a spoonful of soup and decide the
spoonful you tasted isnt salty enough, thats called
exploratory analysis for the sample at hand.
If you can generalize and conclude that your entire
soup needs salt, thats making an inference.
For your inference to be valid, the spoonful you
tasted (the sample) needs to be representative of
the entire pot (the population).
If your spoonful comes only from the surface and
the salt is collected at the bottom of the pot, what
you tasted is probably not representative of the
whole pot.
On the other hand, if you first stir the soup
thoroughly before you taste, your spoonful will
more likely be representative of the whole pot.
Exploratory
Analysis
Representative
Sample
Inference

Sources of Sampling Bias
Convenience sample bias: Individuals who are easily accessible are
more likely to be included in the sample. Example: say you want to find
out how people in your city feel about recent increase in public
transportation costs. If you only poll people in your neighborhood as
opposed to a representative of a whole sample from the entire city your
study will suffer from Convenience sample bias.
Voluntary Response bias: Occurs when the sample consists of people
who volunteer to respond because they have strong opinions on the issue.
Such a sample will also not be representative of the entire population.
Example: say you place polling machines at all bus stops and metro
stations in your city, but only those who suffered by the price increase
choose to actually take the time to vote and express their opinion on the
recent increase in public transportation fares. The people who respond to
such sample do not make up the representative of the entire population.
Non-Response sampling bias: If only a (non-random) fraction of the
randomly sampled people choose to respond to a survey, the sample is no
longer a representative of the entire population. Example: say you take a
random sample of individuals from your city and attempt to survey them
but certain segments of the population; say those from the lower socio-
economic status are less likely to respond to the survey then its not a
representative of entire population.

Sampling Bias a historical example: Landon vs. FDR In 1936, Landon sought the Republican presidential
nomination opposing the re-election of FDR.

A popular magazine of the time (1936) Literary Digest
polled about 10 million Americans, and got responses from
about 2.4 million. To put things in perspective, nowadays
reliable polls in USA poll about 1500 to 3000 people, so the
10 million poll was very huge sample.

The poll showed that Landon would likely be the
overwhelming winner and FDR would get only 43% of the
votes.

Election result: FDR won, with 62% of the votes.
What went wrong with
the Literary Digest Poll?
The magazine had surveyed
its own readers:
registered automobile owners,
and registered telephone users.
These groups had incomes well above the national average of the
day (remember, this is Great Depression era) which resulted in lists
of voters far more likely to support Republicans than a truly typical
voter of the time, i.e. the sample was not representative of the
American population at the time.
The Literary Digest election
poll was based on a sample
size of 2.4 million, which is huge,
but since the sample was
biased, the sample did not
yield an accurate prediction.

Back to the soup analogy:
If the soup is not well stirred, it
doesnt matter how large a
spoon you have, it will still not
taste right. If the soup is well
stirred, a small spoon will suffice
to test the soup.
Big Data Analysis
13 Author: Vikram Andem
ISRM & ITGRC Conference

Data Sampling Methods & Experimental Design

Obtaining Good Samples
Almost all statistical methods are based on the notion of
implied randomness.
If observational data are not collected in a random framework
from a population, these statistical methods the estimates &
errors associated with the estimates are not reliable.
Most commonly used random sampling techniques are simple,
stratified, and cluster sampling.
Simple Random Sample
Randomly select cases
from the population,
where there is no
implied connection
between the points
that are selected.
Stratified Sample
Strata are made up of
similar observations. We
take a simple random
sample from each
stratum.
Cluster Sample
Clusters are usually not
made up of homogeneous
observations, and we take
a simple random sample
from a random sample of
clusters. Usually preferred
for economical reasons.
(data) Experimental Design Concepts
Control: Compare treatment of interest to a control group.
Randomize: Randomly assign subjects to treatments, and randomly sample from the population
whenever possible.
Replicate: Within a study, replicate by collecting a sufficiently large sample. Or replicate the
entire study.
Block: If there are variables that are known or suspected to affect the response variable, first
group subjects into blocks based on these variables, and then randomize cases within each block
to treatment groups.
Blocking example: We would like to design an experiment to investigate if energy gels makes you
run faster: Treatment: energy gel ; Control: no energy gel
It is suspected that energy gels might affect pro and amateur athletes differently, therefore we
block for pro status:
Divide the sample to pro and amateur. Randomly assign pro athletes to treatment and
control groups. Randomly assign amateur athletes to treatment and control groups.
Pro/amateur status is equally represented in resulting treatment and control groups
Random Assignment vs. Random Sampling
Big Data Analysis
14 Author: Vikram Andem
ISRM & ITGRC Conference

Hypothesis Testing

Two competing claims
Claim 1. There is nothing going on.
Promotion and gender are independent, no gender discrimination,
observed difference in proportions is simply due
to chance. => Null hypothesis
Claim 2. There is something going on.
Promotion and gender are dependent, there is gender
discrimination, observed difference in proportions is not due to
chance. => Alternative hypothesis


A court trial as a hypothesis test
Hypothesis testing is very much like a court trial.
H0: Defendant is innocent
HA: Defendant is guilty
We then present the evidence: collect data.
Then we judge the evidence - Could these data plausibly have happened by
chance if the null hypothesis were true?
If they were very unlikely to have occurred, then the evidence raises
more than a reasonable doubt in our minds about the null
hypothesis.
Ultimately we must make a decision. How unlikely is unlikely?
If the evidence is not strong enough to reject the assumption of innocence, the
jury returns with a verdict of not guilty.
The jury does not say that the defendant is innocent, just that there
is not enough evidence to convict.
The defendant may, in fact, be innocent, but the jury has no way of
being sure.
Said statistically, we fail to reject the null hypothesis.
We never declare the null hypothesis to be true, because we simply
do not know whether its true or not.
Therefore we never accept the null hypothesis.
In a trial, the burden of proof is on the prosecution.
In a hypothesis test, the burden of proof is on the unusual claim.
The null hypothesis is the ordinary state of affairs (the status quo), so its the
alternative hypothesis that we consider unusual and for which we must gather
evidence.
Hypothesis Testing
Big Data Analysis
15 Author: Vikram Andem
ISRM & ITGRC Conference
Statistical Inference and Prediction

Statistical Inference
Statistical inference is the process
of drawing conclusions from data
that are subject to random variation,
for example, observational errors or
sampling variation. Initial
requirements of such a system of
procedures for inference and
induction are that the system should
produce reasonable answers when
applied to well-defined situations and
that it should be general enough to
be applied across a range of
situations. Inferential statistics are
used to test hypotheses and make
estimations using sample data.
Confidence Interval
Outcome of statistical inference may
be an answer to the question "what
should be done next?", where this
might be a decision about making
further experiments or surveys, or
about drawing a conclusion before
implementing some organizational or
governmental policy.
A confidence interval gives an estimated range of values which is likely to
include an unknown population parameter, the estimated range being
calculated from a given set of sample data. If independent samples are taken
repeatedly from the same population, and a confidence interval calculated for
each sample, then a certain percentage (confidence level) of the intervals will
include the unknown population parameter. Confidence intervals are usually
calculated so that this percentage is 95%, but we can produce 90%, 99%,
99.9% (or whatever) confidence intervals for the unknown parameter.
Confidence Level
Confidence level is the probability value (1-alpha) associated with a
confidence interval. It is often expressed as a percentage. For example,
say alpha = 0.05 = 5%, then the confidence level is equal to
(1-0.05) = 0.95, i.e. a 95% confidence level.

Prediction
In statistics, prediction
is the process of
determining the
magnitude of statistical
variates at some future
point of time. For data
analysis context the word
may also occur in slightly
different meanings; e.g. in a
regression equation expressing a dependent variate y in terms of
dependent xs, the value given for y by specified values of xs is
called the predicted value even when no temporal element is
involved.
Prediction vs. Inference
Using data to predict an event that has yet to occur is statistical
prediction. Inferring the value of a population quantity such as the
average income of a country or the proportion of eligible voters who say
they will vote yes is statistical inference.
Prediction and inference answer different types of data analysis
questions.

Examples of predictions (because the events have not occurred at the time
of writing this content):
The probability that Chicago Bulls will win the 2018 NBA playoffs is __.
The probability that Republican Party will win the 2020 Presidential
election is __.

Examples of inferences: (because the questions involve estimating a
population value.)
The proportion of NBA fans that currently believe Chicago Bulls will win
the 2018 playoffs is __.
The proportion of eligible voters that currently state they will vote for
Republican Party in the 2020 Presidential election is __.
Big Data Analysis
16 Author: Vikram Andem
ISRM & ITGRC Conference
Before you proceed to Part 3 :

Please quickly review the
Appendix section to familiarize
with terms and terminology that will
be used in the rest of the presentation.
Big Data Analysis
17 Author: Vikram Andem
ISRM & ITGRC Conference
Big Data Analysis
Concepts and References
Use Cases in Airline Industry


Jump here directly,
if you are a advanced user
who understands data and
knows how to analyze data.
Part 3
Big Data Analysis
18 Author: Vikram Andem
ISRM & ITGRC Conference
Big Data Analysis: Concepts and Airline Industry Use Cases

Data Analysis


Machine Learning

&
Models derive useful analytical information so humans
can better understand it.

Examples:
Does spending more money on marketing & sales in
area X vs. area Y makes the company more
profitable?
What does the customer want ? (e.g., customer survey).
Models allow machines (software programs &
applications) to make real-time (auto) decisions.

Examples:
Google search / Amazon product recommendations,
Facebook news feed, etc.
Geographic (GPS based) specific advertisements or
event (holiday, weather, traffic) based promotions.
Big Data application areas
Big Data Analysis
19 Author: Vikram Andem
ISRM & ITGRC Conference

Bayesian Approach and Bayes Rule

Bayesian Approach

Differences between Bayesians and Non-Bayesians
Bayes
Theorem
The probability the email message is spam, given the words in the
email message is expressed as below:
Example: Email Spam Filtering
With Bayes:
A key benefit: The ability to incorporate prior knowledge
A key weakness: The need to incorporate prior knowledge
Big Data Analysis
21 Author: Vikram Andem
ISRM & ITGRC Conference
Bayesian Belief Network Representation of Airline Passenger Behavior

Source: Booz Allen Hamilton
Bayesian Belief Network Representation of
Airline Passenger Behavior

The basis of this slide is from the paper titled
Airline Analytics:
Decision Analytics Center of Excellence
by Cenk Tunasar, Ph.D., and Alex Cosmas
of Booz Allen Hamilton

In the above listed paper authors claim Booz
Allen used the Big Data infrastructure of an airline client,
and were able to analyze large datasets containing more
than 3 years worth of passenger data of approximately 100
GB+. Booz Allen generated hypotheses to test from the Big
Data set including , but not limited to:

Airline Market Performance
What are the clients natural market types and their
distinct attributes?
What is the clients competitive market health?
Where does the client capture fare premiums or fare
discounts relative to other carriers?

Passenger Behavior
What is the variability of booking curves by market type?
What are the intrinsic attributes of markets with the
highest earn and highest burn rates?
Can predictive modeling be developed for reservation
changes and no-show rates for individual passengers on
individual itineraries?

Consumer Choice
What is the demand impact of increasing connection time?
What is the effect of direct versus connecting
itineraries on passenger preference?
A use
case in
Airline
industry
(URL: http://www.boozallen.com/media/file/airline-analytics-brochure.pdf)
Big Data Analysis
22 Author: Vikram Andem
ISRM & ITGRC Conference

Bayesian Ideas are very important for Big Data Analysis

Bayesian Themes


Prediction
Average over
unknowns,
don't maximize.

Uncertainty
Probability coherently
represents uncertainty.

Combine Information
Hierarchical models
combine information
from multiple sources.
Source: Steve Scott (Google Inc.

Sparsity
Sparsity plays an important role in
modeling Big Data
Models are "big" because of a small
number of factors with many levels.
Big data problems are often big
collections of small data problems.

Multi-armed
Bandits Problem
Multi-armed bandit
problem is the problem a
gambler faces at a row of
slot machines, sometimes
known as "one-armed
bandits", when deciding
which slot machines to
play, how many times to
play each machine and in
which order to play them.
When played, each
machine provides a
random reward from a
distribution specific to
that machine. The
objective of the gambler
is to maximize the sum of
rewards earned through a
sequence of lever pulls.
Source: Wikipedia

Bayes Rule applied to Machine Learning


A use
case in
Airline
industry
Big Data Project at South West Airlines
The below URL provides a visual (interactive graphics) presentation of the Big Data Project at South West Airlines
and how they used Bayesian approach and Naive Bayes classification with WEKA("Waikato Environment for
Knowledge Analysis") tool for analysis of the following questions:
1) What are the important factors that cause delays and their weightage ?
2) What kind of weather (e.g. sunny, cloudy, snow, rain, etc.) causes weather delays?
3) Are some of the time periods during the day (e.g. early morning, morning, noon, etc.) that are
more prone to delays than others?
(URL: http://prezi.com/f3bsv9m6yl2g/big-data-project_southwest-airlines/)
Entirely driven by parameter uncertainty
Big Data Analysis
23 Author: Vikram Andem
ISRM & ITGRC Conference
Example: Bayesian based Search Optimization on Google File System (Source: Google Analytics)


Source: Steve Scott (Google Inc.)
Personalization as a Big Logistic Regression"

Search words: Chicago to Houston today

Search words: Chicago to Houston flight tomorrow

Search words: Chicago to Houston cheapest
Big Data Analysis
24 Author: Vikram Andem
ISRM & ITGRC Conference

Meta Analysis

Meta Analysis

Meta-analysis refers to methods that focus on contrasting and
combining results from different studies, in the hope of identifying
patterns among study results, sources of disagreement among those
results, or other interesting relationships that may come to light in
the context of multiple studies. In its simplest form, meta-analysis is
normally done by identification of a common measure of effect size. A
weighted average of that common measure is the output of a meta-
analysis. The weighting is related to sample sizes within the individual
studies. More generally there are other differences between the
studies that need to be allowed for, but the general aim of a meta-
analysis is to more powerfully estimate the true effect size as opposed
to a less precise effect size derived in a single study under a given
single set of assumptions and conditions. Source: Wikipedia
Advantages
Results can be generalized to a larger population,
The precision and accuracy of estimates can be improved as
more data is used. This, in turn, may increase the statistical
power to detect an effect.
Inconsistency of results across studies can be quantified and
analyzed. For instance, does inconsistency arise from sampling
error, or are study results (partially) influenced by between-
study heterogeneity.
Hypothesis testing can be applied on summary estimates.
A use
case in
Airline
industry
Price Elasticities of Demand for Passenger Air Travel
A good discussion of the topic is detailed in the paper listed below:

Price Elasticities of Demand for Passenger Air Travel: A Meta-Analysis

by Martijn Brons, Eric Pels, Peter Nijkamp, Piet Rietveld
of Tinbergen Institute
(URL: http://papers.tinbergen.nl/01047.pdf)

Meta Analysis and Big Data
A good discussion of the topic is detailed in the article listed below:
Meta-Analysis: The Original 'Big Data
by Blair T. Johnson , Professor at University of Connecticut
(URL: http://meta-analysis.ning.com/profiles/blogs/meta-analysis-the-original-big-data)

Big Data Analysis
25 Author: Vikram Andem
ISRM & ITGRC Conference

Effect Size

Effect Size

Effect size is a measure of the strength of a phenomenon (for
example, the change in an outcome after experimental
intervention). An effect size calculated from data is a descriptive
statistic that conveys the estimated magnitude of a relationship
without making any statement about whether the apparent
relationship in the data reflects a true relationship in the
population. In that way, effect sizes complement inferential
statistics such as p-values. Among other uses, effect size measures
play an important role in meta-analysis studies that summarize
findings from a specific area of research, and in statistical power
analyses. Source: Wikipedia

Example: A weight loss program may boast that it leads to an
average weight loss of 30 pounds. In this case, 30 pounds is the
claimed effect size. if the weight loss program results in an average
loss of 30 pounds, it is possible that every participant loses exactly
30 pounds, or half the participants lose 60 pounds and half lose
no weight at all.
"Small", Medium", Large" Effect Sizes
Effect sizes apply terms such as "small", "medium" and "large" to the
size of the effect and are relative. Whether an effect size should be
interpreted small, medium, or large depends on its substantive context
and its operational definition. Cohen's conventional criteria small,
medium, or big are near ubiquitous across many fields. Power analysis or
sample size planning requires an assumed population parameter of
effect sizes.





For Cohen's an effect size of 0.2 to 0.3 might be a "small" effect, around
0.5 a "medium" effect and 0.8 to infinity, a "large" effect.
Big Data Analysis
26 Author: Vikram Andem
ISRM & ITGRC Conference

Monte Carlo Method

Monte Carlo Method

Monte Carlo methods (or experiments) are a broad
class of computational algorithms that rely on
repeated random sampling to obtain numerical results;
typically one runs simulations many times over in order
to obtain the distribution of an unknown probabilistic
entity. The name comes from the resemblance of the
technique to the act of playing and recording results in
a real gambling casino. They are often used in physical
and mathematical problems and are most useful when
it is difficult or impossible to obtain a closed-form
expression, or infeasible to apply a deterministic
algorithm. Monte Carlo methods are mainly used in
three distinct problem classes: optimization, numerical
integration and generation of draws from a probability
distribution.
Monte Carlo methods vary, but tend to follow a particular
pattern:
Define a domain of possible inputs.
Generate inputs randomly from a probability distribution
over the domain.
Perform a deterministic computation on the inputs.
Aggregate the results.
For example: Consider a circle inscribed in a unit square. Given
that circle and the square have a ratio of areas that is /4, the
value of can be approximated using a Monte Carlo method:
Draw a square on ground, then inscribe a circle within it.
Uniformly scatter some objects of uniform size (grains of
rice or sand) over the square.
Count the number of objects inside the circle and the total
number of objects.
The ratio of the two counts is an estimate of the ratio of
the two areas, which is /4. Multiply the result by 4 to
estimate .
Monte Carlo Methods for Bayesian Analysis and Big Data
A good discussion of the topic is detailed in the paper listed below:
A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets
by David Madigan, Professor and Dean at Columbia University
and Greg Ridgeway, Deputy Director at National Institute of Justice.
(URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2753529/ )
Source: Wikipedia

A use
case in
Airline
industry
Flight Delay-Cost (Initial delay type I and Propagated delay type II)
and Dynamic Simulation Analysis for Airline Schedule Optimization
Flight Delay-Cost Simulation Analysis and
Airline Schedule Optimization
by Duojia Yuan of RMIT University, Victoria, Australia
(URL: http://researchbank.rmit.edu.au/eserv/rmit:9807/Yuan.pdf

General use
case for
Customer
Satisfaction
and
Customer
Loyalty
Concurrent Reinforcement Learning from Customer Interactions

Concurrent Reinforcement Learning from Customer Interactions
by David Silver of University College London (published 2013)
and Leonard Newnham, Dave Barker, Suzanne Weller, Jason McFall of Causata Ltd .
(URL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Publications_files/concurrent-rl.pdf )
A good discussion of the topic is detailed in the Ph.D. thesis listed below. The reliability
modeling approach developed in this project (to enhance the dispatch reliability of
Australian X airline fleet) is based on the probability distributions and Monte Carlo
Simulation (MCS) techniques. Initial (type I) delay and propagated (type II) delay are
adopted as the criterion for data classification and analysis.
In the below paper, authors present a framework for concurrent reinforcement learning, a new method of
a company interacting concurrently with many customers with an objective function to maximize revenue,
customer satisfaction, or customer loyalty, which depends primarily on the sequence of interactions
between company and customer (such as promotions, advertisements, or emails) and actions by the
customer (such as point-of- sale purchases, or clicks on a website).
The proposed concurrent reinforcement learning framework uses a variant of temporal-
difference learning to learn efficiently from partial interaction sequences. The goal is to
maximize the future rewards for each customer, given their history of interactions with
the company. The proposed framework differs from traditional reinforcement learning
paradigms, due to the concurrent nature of the customer interactions. This distinction
leads to new considerations for reinforcement learning algorithms.
Big Data Analysis
27 Author: Vikram Andem
ISRM & ITGRC Conference

Bayes and Big Data: Consensus Monte Carlo and Nonparametric Bayesian Data Analysis

A good discussion of the topic is detailed in the article listed below:

Bayes and Big Data: The Consensus Monte Carlo Algorithm
by
Robert E. McCulloch, of University of Chicago, Booth School of Business
Edward I. George, of University of Pennsylvania, The Wharton School
Steven L. Scott, of Google, Inc
Alexander W. Blocker, of Google, Inc
Fernando V. Bonassi, Google, Inc.
(URL: http://www.rob-mcculloch.org/some_papers_and_talks/papers/working/consensus-mc.pdf)

Consensus Monte Carlo


For Bayesian methods to work in a MapReduce / Hadoop environment,
we need algorithms that require very little communication.
Need: A useful definition of big data is data that is too big to fit on a
single machine, either because of processor, memory, or disk
bottlenecks. Graphics Processing Units (GPU) can alleviate the processor
bottleneck, but memory or disk bottlenecks can only be alleviated by
splitting big data across multiple machines. Communication between
large numbers of machines is expensive (regardless of the amount of
data being communicated), so there is a need for algorithms that
perform distributed approximate Bayesian analyses with minimal
communication.

Consensus Monte Carlo operates by running a separate Monte Carlo
algorithm on each machine, and then averaging the individual Monte
Carlo draws. Depending on the model, the resulting draws can be nearly
indistinguishable from the draws that would have been obtained by
running a single machine algorithm for a very long time.
Source: Steve Scott (Google Inc.)

Non-Parametric Bayesian Data Analysis
A use
case in
Airline
industry
Airline Delays in International Air Cargo Logistics
A good discussion of the topic is detailed in the paper below:
Nonparametric Bayesian Analysis in
International Air Cargo Logistics
by Yan Shang
of Fuqua School of Business, Duke University
(URL: https://bayesian.org/abstracts/5687 )
Non-Parametric Analysis refers to comparative properties (statistics) of the data, or
population, which do not include the typical parameters, of mean, variance,
standard deviation, etc.
Need / Motivation: Models are never correct for real world data.
Non-Parametric Modelling of Large Data Sets
What is a nonparametric model?
A parametric model where the number of parameters increases with data.
A really large parametric model.
A model over infinite dimensional function or measure spaces.
A family of distributions that is dense in some large space.

Why nonparametric models in Bayesian theory of learning?
Broad class of priors that allows data to speak for itself.
Side-step model selection and averaging.
Bayes and Big Data
Big Data Analysis
28 Author: Vikram Andem
ISRM & ITGRC Conference
Homoscedasticity vs. Heteroskedasticity

Homoscedasticity

In regression analysis , homoscedasticity means a situation in
which the variance of the dependent variable is the same for all
the data. Homoscedasticity facilitates analysis because most
methods are based on the assumption of equal variance.

A sequence or a vector of random variables is homoscedastic if all
random variables in the sequence or vector have the same finite
variance. This is also known as homogeneity of variance.


In regression analysis , heteroskedasticity means a situation in which the
variance of the dependent variable varies across the data. Heteroskedasticity
complicates analysis because many methods in regression analysis are based
on an assumption of equal variance.

A collection of random variables is heteroscedastic if there are sub-
populations that have different variabilities from others. Here "variability"
could be quantified by the variance or any other measure of statistical
dispersion. Thus heteroscedasticity is the absence of homoscedasticity.
Heteroskedasticity



Big Data Analysis
29 Author: Vikram Andem
ISRM & ITGRC Conference


Benfords Law


Benfords Law
Benford's Law, also called the First-Digit Law, refers to the
frequency distribution of digits in many (but not all) real-life
sources of data. In this distribution, the number 1 occurs as the
leading digit about 30% of the time, while larger numbers occur in
that position less frequently: 9 as the first digit less than 5% of the
time. Benford's Law also concerns the expected distribution for
digits beyond the first, which approach a uniform distribution.

This result has been found to apply to a wide variety of data sets,
including electricity bills, street addresses, stock prices, population
numbers, death rates, lengths of rivers, physical and mathematical
constants, and processes described by power laws (which are very
common in nature). It tends to be most accurate when values are
distributed across multiple orders of magnitude. Source: Wikipedia

Numerically, the
leading digits have
the following
distribution in
Benford's Law,
where d is the leading
digit and P(d) the
probability:
Benfords Law Big Data Application: Fraud Detection Facts
The graph below shows Benford's Law for base 10. There is a
generalization of the law to numbers expressed in other bases (for
example, base 16), and also a generalization from leading 1 digit to
leading n digits. A set of numbers is said to satisfy Benford's Law
if the leading digit d (d {1, ..., 9}) occurs with Probability.
Benfords Law holds true for a data set that grows
exponentially (e.g., doubles, then doubles again in the
same time span). It is best applied to data sets that go
across multiple orders of magnitude . The theory does
not hold true for data sets in which digits are
predisposed to begin with a limited set of digits. The
theory also does not hold true when a data set covers
only one or two orders of magnitude.
Helps identify duplicates & other data pattern anomalies in large data sets.
Enables auditors and data analysts to focuses on possible anomalies in very large data
sets.
It does not "directly" prove that error or fraud exist, but identifies items that deserve
further study on statistical grounds.
Mainly used for setting future auditing plans and is a low cost entry for continuous
analysis of very large data sets
Not good for sampling results in very large selection sizes.
As technology matures, finding fraud will increase (not decrease).
Not all data sets are suitable for analysis .
A use case
in Airline
industry
An financial/accounting auditor can evaluate very large data
sets (in a continuous monitoring or continuous audit
environment) that represents a continuous stream of
transactions , such as the sales made by an (third party)
online retailer or the internal airline reservation system.
Fraud Detection in Airline Ticket Purchases
Christopher J. Rosetti, CPA, CFE, DABFA of KPMG states in his presentation titled "SAS
99: Detecting Fraud Using Benfords Law" presented at the FAE/NYSSCPA, Technology
Assurance Committee , on March 13, 2003 claims that United Airlines currently
uses Benford's law for fraud detection!
(URL: http://www.nysscpa.org/committees/emergingtech/law.ppt )
Big Data Analysis
30 Author: Vikram Andem
ISRM & ITGRC Conference
Multiple Hypothesis Testing


Multiple Testing Problem
Multiple testing problem occurs when one considers a set of
statistical inferences simultaneously or infers a subset of
parameters selected based on the observed values. Errors in
inference, including confidence intervals that fail to include their
corresponding population parameters or hypothesis tests that
incorrectly reject the null hypothesis are more likely to occur when
one considers the set as a whole. Source: Wikipedia

For example, one might declare that a coin was biased if in 10 flips it landed
heads at least 9 times. Indeed, if one assumes as a null hypothesis that the
coin is fair, then the probability that a fair coin would come up heads at least
9 out of 10 times is (10 + 1) (1/2)
10
= 0.0107. This is relatively unlikely, and
under statistical criteria such as p-value < 0.05, one would declare that the
null hypothesis should be rejected i.e., the coin is unfair.

A multiple-comparisons problem arises if one wanted to use this test (which
is appropriate for testing the fairness of a single coin), to test the fairness of
many coins. Imagine if one were to test 100 fair coins by this method. Given
that the probability of a fair coin coming up 9 or 10 heads in 10 flips is 0.0107,
one would expect that in flipping 100 fair coins ten times each, to see a
particular (i.e., pre-selected) coin comes up heads 9 or 10 times would still be
very unlikely, but seeing any coin behave that way, without concern for which
one, would be more likely than not. Precisely, the likelihood that all 100 fair
coins are identified as fair by this criterion is (1 0.0107)
100
0.34. Therefore
the application of our single-test coin-fairness criterion to multiple
comparisons would be more likely to falsely identify at least one fair coin as
unfair.
Multiple Hypothesis Testing

A use
case in
Airline
industry

Predicting Flight Delays using Multiple Hypothesis Testing
A good discussion of the topic is detailed in the paper listed below:
Predicting Flight Delays
by Dieterich Lawson and William Castillo of Stanford University
(URL: http://cs229.stanford.edu/proj2012/CastilloLawson-PredictingFlightDelays.pdf )

Also detailed in the book
Big Data for Chimps: A Seriously Fun guide to Terabyte-scale data processing
by the same author (Dieterich Lawson) and Philip Kromer. Sample Source Code for
modelling in Matlab is also provided by the Dieterich Lawson and can be found at
URL: https://github.com/infochimps-labs/big_data_for_chimps
Big Data Analysis
31 Author: Vikram Andem
ISRM & ITGRC Conference
The German Tank Problem


The German Tank Problem
The problem of estimating the
maximum of a discrete uniform
distribution from sampling without
replacement is known in English as the
German tank problem, due to its
application in World War II to the
estimation of the number of German
tanks. The analyses illustrate the
difference between frequentist
inference and Bayesian inference.
Estimating the population maximum
based on a single sample yields
divergent results, while the estimation
based on multiple samples is an
instructive practical estimation
question whose answer is simple but
not obvious. Source: Wikipedia
During World War II, production of
German tanks such as the Panther
(below photo) was accurately
estimated by Allied intelligence using
statistical methods.
Example: Suppose an intelligence officer has spotted k = 4 tanks with
serial numbers, 2, 6, 7, and 14, with maximum observed serial number, m = 14.

The unknown total number of tanks is called N.

The formula for estimating the total
number of tanks suggested by the
frequentist approach outlined is:

Whereas, the Bayesian analysis below yield
(primarily) a probability mass function for
the number of tanks:

from which we can estimate the number of
tanks according to:

This distribution has positive skewness,
related to the fact that there are at least 14 tanks.

During the course of the war the Western Allies
made sustained efforts to determine the extent of
German production, and approached this in two major ways: conventional intelligence
gathering and statistical estimation. To do this they used the serial numbers on
captured or destroyed tanks. The principal numbers used were gearbox numbers, as
these fell in two unbroken sequences. Chassis and engine numbers were also used,
though their use was more complicated. Various other components were used to cross-
check the analysis. Similar analyses were done on tires, which were observed to be
sequentially numbered (i.e., 1, 2, 3, ..., N). The analysis of tank wheels yielded an
estimate for the number of wheel molds that were in use.

Analysis of wheels from two tanks (48 wheels each, 96 wheels total) yielded an
estimate of 270 produced in February 1944, substantially more than had previously
been suspected. German records after the war showed production for the month of
February 1944 was 276. The statistical approach proved to be far more accurate than
conventional intelligence methods, and the phrase German tank problem
became accepted as a descriptor for this type of statistical analysis.

Application in Big Data Analysis
Similar to German Tank Problem we can estimate/analyze (large or
small) data sets that we dont have (or assumed that we dont have).
There is leaky data all around us; all we have to do is to think outside
the box. Companies very often dont think about the data they publish
publicly and we can either extrapolate from that data (as in the German
Tank problem) or simply extract useful information from it.
A companys competitors' websites (publicly available data) can be a
valuable hunting ground. Think about whether you can use it to estimate
some missing data (as with the serial numbers) and/or combine that
data with other, seemingly innocuous, sets to produce some vital
information. If that information gives your company a commercial
advantage and is legal, then you should use it as part of your analysis.
Source: Wikipedia
Big Data Analysis
32 Author: Vikram Andem
ISRM & ITGRC Conference
NyquistShannon Sampling Theorem


NyquistShannon Sampling Theorem
The Nyquist Theorem, also known as the sampling theorem, is a
principle that engineers follow in the digitization of analog signals.
For analog-to-digital conversion (ADC) to result in a faithful
reproduction of the signal, slices, called samples, of the analog
waveform must be taken frequently. The number of samples per
second is called the sampling rate or sampling frequency.

Any analog signal consists of components at various frequencies.
The simplest case is the sine wave, in which all the signal energy is
concentrated at one frequency. In practice, analog signals usually
have complex waveforms, with components at many frequencies.
The highest frequency component in an analog signal determines
the bandwidth of that signal. The higher the frequency, the
greater the bandwidth, if all other factors are held constant.

Suppose the highest frequency component, in hertz, for a given
analog signal is f
max
. According to the Nyquist Theorem, the
sampling rate must be at least 2f
max
, or twice the highest analog
frequency component. The sampling in an analog-to-digital
converter is actuated by a pulse generator (clock). If the sampling
rate is less than 2f
max
, some of the highest frequency components
in the analog input signal will not be correctly represented in the
digitized output. When such a digital signal is converted back to
analog form by a digital-to-analog converter, false frequency
components appear that were not in the original analog signal.
This undesirable condition is a form of distortion called aliasing.



Application in Big Data Analysis
Even though the NyquistShannon Sampling Theorem is about the
minimum sampling rate of a continuous wave, but with Big Data Analysis
practice it will tell you how frequently you need to collect that Big Data
from sensors like smart meters.
The frequency of data collection for Big Data is the Velocity, one of
the three Vs for terms that define Big Data; Volume, Velocity and
Varity.
Left figure: X(f) (top blue) and X
A
(f) (bottom blue) are continuous Fourier transforms of two
different functions, x(t) and x
A
(t) (not shown). When the functions are sampled at rate f
s
, the
images (green) are added to the original transforms (blue) when one examines the discrete-time
Fourier transforms (DTFT) of the sequences. In this hypothetical example, the DTFTs are identical,
which means the sampled sequences are identical, even though the original continuous pre-
sampled functions are not. If these were audio signals, x(t) and x
A
(t) might not sound the same. But
their samples (taken at rate f
s
) are identical and would lead to identical reproduced sounds; thus
x
A
(t) is an alias of x(t) at this sample rate. In this example (of a bandlimited function), such aliasing
can be prevented by increasing f
s
such that the green images in the top figure do not overlap the
blue portion.

Right figure: Spectrum, X
s
(f), of a properly sampled bandlimited signal (blue) and the adjacent DTFT
images (green) that do not overlap. A brick-wall low-pass filter, H(f), removes the images, leaves the
original spectrum, X(f), and recovers the original signal from its samples. Source: Wikipedia
Source: Wikipedia
Big Data Analysis
33 Author: Vikram Andem
ISRM & ITGRC Conference

Simpsons Paradox

Simpsons Paradox
Simpson's paradox is a paradox in which a trend that appears in
different groups of data disappears when these groups are
combined, and the reverse trend appears for the aggregate data.
This result is particularly confounding when frequency data are
unduly given causal interpretations. Simpson's Paradox disappears
when causal relations are brought into consideration.

Example:
It's a well accepted rule of thumb that the larger the data set, the more
reliable the conclusions drawn. Simpson' paradox, however, slams a
hammer down on the rule and the result is a good deal worse than a
sore thumb. Simpson's paradox demonstrates that a great deal of care
has to be taken when combining small data sets into a large one.
Sometimes conclusions from the large data set may be the exact
opposite of conclusion from the smaller sets. Unfortunately, the
conclusions from the large set can (also) be wrong.
The lurking variables (or confounding variable) in Simpsons paradox are
categorical. That is, they break the observation into groups, such as the
city of origin for the airline flights. Simpsons paradox is an extreme form
of the fact that the observed associations can be misleading when there
are lurking variables.
Status Airline
A
Airline
B
On Time 718 5534
Delayed 74 532
Total 792 6066
From the left table: Airline A is
delayed 9.3% (74/792) of the time;
Airline B is delayed only 8.8%
(532/6066) of the time.

So Airline A would
NOT be preferable.
Chicago Houston

Airline
On
Time
Delayed Total On
Time
Delayed Total
A 497 62 559 221 12 233
B 694 117 811 4840 415 5255

From the above table:

From Chicago, Airline A is delayed 11.1% (62/559) of the time, but
Airline B is delayed 14.4% (117/811) of the time.
From Houston, Airline A is delayed 5.2% (12/233) of the time, but
Airline B is delayed 7.9% (415/5255).
Consequently, Airline A would be preferable.
This conclusion contradicts the previous conclusion.
Simpsons' Paradox is when Big Data sets CAN go wrong

A use
case in
Airline
industry

Airline On-Time Performance at Hub-and-Spoke Flight Networks
A good discussion of the topic is detailed in the paper listed below:
Simpsons Paradox, Aggregation,
and Airline On-time Performance
by Bruce Brown of Cal State Polytechnic University
(URL: http://www.csupomona.edu/~bbrown/Brown_SimpPar_WEAI06.pdf)
Big Data doesnt happen overnight and theres no
magic to it.
Just deploying Big Data tools and analytical
solutions (R, SAS, and Tableau etc.) doesnt
guarantee anything, as Simpsons Paradox proves.
Big Data Analysis
34 Author: Vikram Andem
ISRM & ITGRC Conference


Machine Learning

Machine Learning and Data Mining
Machine learning concerns the construction and study
of systems that can learn from data. For example, a machine
learning system could be trained on email messages to learn to
distinguish between spam and non-spam messages. After learning,
it can then be used to classify new email messages into spam and
non-spam folders. The core of machine learning deals with
representation and generalization. Representation of data
instances and functions evaluated on these instances are part of
all machine learning systems. Generalization is the property that
the system will perform well on unseen data instances.
Source: Wikipedia
These two terms are commonly confused, as they often employ the same
methods and overlap significantly. They can be roughly defined as follows:

Machine learning focuses on prediction, based on known properties
learned from the training data.
Data mining focuses on the discovery of (previously) unknown properties
in the data. This is the analysis step of Knowledge Discovery in
Databases.

Terminology
Classification: The learned attribute is categorical (nominal)
Regression: The learned attribute is numeric
Supervised Learning (Training) : We are given examples of inputs and
associated outputs and we learn the relationship between them.
Unsupervised Learning (sometimes: Mining): We are given inputs, but
no outputs (such as unlabeled data) and we learn the Latent labels.
(example: Clustering).

Example:
Document
Classification
Highly accurate predictions on real time and continuous data (based
on rule sets with earlier training / learning and training / historical data).
Goal is not to uncover underlying truth.
Emphasis on methods that can handle very large datasets for better
predictions.

A use
case in
Airline
industry
South West Airlines use of Machine Learning for Airline Safety
The below URL details an article (published September 2013) on how
South West Airlines uses Machine Learning algorithms for Big Data
purposes to analyze vast amounts of very large data sets (which are
publicly accessible from NASAs DASHlink site) to find anomalies and
potential safety issues and to identify patterns to improve airline safety.

URL: http://www.bigdata-startups.com/BigData-startup/southwest-
airlines-uses-big-data-deliver-excellent-customer-service/
Primary Goal of Machine Learning
Why Machine Learning?

Increase barrier to entry
when product / service
quality is dependent on data
Customize product / service to
increase engagement and profits.
Example: Customize sales page to increase
conversion rates for online products.

vs.
Use Case1 Use Case 2
Big Data Analysis
35 Author: Vikram Andem
ISRM & ITGRC Conference

Classification Rules and Rule Sets

Rule Set to Classify Data

Golf Example: To Play or Not to Play

A use
case in
Airline
industry
Optimal Airline Ticket Purchasing (automated feature selection)
A good discussion of the topic is detailed in the paper listed below:
Optimal Airline Ticket Purchasing Using
Automated User-Guided Feature Selection
by William Groves and Maria Gini of University of Minnesota
(URL: http://ijcai.org/papers13/Papers/IJCAI13-032.pdf )

Classification Problems
Examples of Classification Problems:
Text categorization (e.g., spam filtering)
Market segmentation (e.g.: predict if
customer will respond to promotion).
Natural-language processing
(e.g., spoken language understanding).

Big Data Analysis
36 Author: Vikram Andem
ISRM & ITGRC Conference

Decision Tree Learning
Example:
Good vs. Evil


Decision tree learning uses
a decision tree as a
predictive model which
maps observations about
an item to conclusions
about the item's target
value. More descriptive
names for such tree models
are classification trees or
regression trees. In these
tree structures, leaves
represent class labels and
branches represent
conjunctions of features
that lead to those class
labels. In decision analysis,
a decision tree can be used
to visually and explicitly
represent decisions and
decision making. In data
mining, a decision tree
describes data but not
decisions; rather the
resulting classification tree
can be an input for decision
making. Source: Wikipedia
Big Data Analysis
37 Author: Vikram Andem
ISRM & ITGRC Conference





Tree Size vs. Accuracy
Accuracy, Confusion Matrix, Overfitting, Good/Bad Classifiers, and Controlling Tree Size

Building an Accurate Classifier
Good and
Bad
Classifiers


A use
case in
Airline
industry
Predicting Airline Customers Future Values
A good discussion of the topic is detailed in the paper listed below:

Applying decision trees for value-based customer relations
management: Predicting airline customers future values

by Giuliano Tirenni, Christian Kaiser and Andreas Herrmann
of the Center for Business Metrics at University of St. Gallen, Switzerland.
(URL: http://ipgo.webs.upv.es/azahar/Pr%C3%A1cticas/articulo2.pdf )
Theory
Overfitting example
Accuracy and Confusion Matrix
Big Data Analysis
38 Author: Vikram Andem
ISRM & ITGRC Conference


Entropy and Information Gain

Entropy

Question: How do you determine which
attribute best classifies data or a data set?
Answer: Entropy
Entropy is a measure of unpredictability
of information content.

Example : A poll on some political issue.
Usually, such polls happen because the
outcome of the poll isn't already known.
In other words, the outcome of the poll
is relatively unpredictable, and actually
performing the poll and learning the
results gives some new information;
these are just different ways of saying
that the entropy of the poll results is
large. Now, consider the case that the
same poll is performed a second time
shortly after the first poll. Since the
result of the first poll is already known,
the outcome of the second poll can be
predicted well and the results should not
contain much new information; in this case
the entropy of the second poll results is
small. Source: Wikipedia

Statistical quantity measuring how well an
attribute classifies the data.
Calculate the information gain for each attribute.
Choose attribute with greatest information gain.
If there are n equally probable possible messages,
then the probability p of each is 1/n
Information conveyed by a message is -log(p) = log(n)
Example, if there are 16 messages, then log(16) = 4
and we need 4 bits to identify/send each message.
In general, if we are given a probability distribution
P = (p1, p2, .., pn)
The information conveyed by distribution
(aka Entropy of P) is:
I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn))
Information Theory : Background
Information
Gain
Largest Entropy: Boolean functions with the same
number of ones and zero's have largest entropy.
In machine learning, this concept can be used to
define a preferred sequence of attributes to investigate to most rapidly
narrow down the state of X. Such a sequence (which depends on the
outcome of the investigation of previous attributes at each stage) is
called a decision tree. Usually an attribute with high mutual
information should be preferred to other attributes.

A use
case in
Airline
industry
An Airline matching Airplanes to Routes
(using Machine Learning)
((URL: http://machinelearning.wustl.edu/mlpapers/paper_files/jmlr10_helmbold09a.pdf )
A good discussion of the topic is detailed in the paper listed below:

Learning Permutations with Exponential Weights

by David P. Helmbold and Manfred K.Warmuth
of University of California, Santa Cruz
Big Data Analysis
39 Author: Vikram Andem
ISRM & ITGRC Conference
The Bootstrap

The Bootstrap


A good discussion of the topic is detailed in the article listed below:

The Big Data Bootstrap
by Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar
and Michael I. Jordan of University of California, Berkeley

(URL: http://www.cs.berkeley.edu/~jordan/papers/blb_icml2012.pdf )
Bootstrapping is a method for assigning measures of
accuracy (defined in terms of bias, variance, confidence
intervals, prediction error or some other such measure) to
sample estimates. This technique allows estimation of the
sampling distribution of almost any statistic using only very
simple methods. Generally, it falls in the broader class of
resampling methods. The basic idea of bootstrapping is that
inference about a population from sample data (sample
population) can be modeled by resampling the sample data
and performing inference on (resample sample). As the
population is unknown, the true error in a sample statistic
against its population value is unknowable. In bootstrap-
resamples, the 'population' is in fact the sample, and this is
known; hence the quality of inference from resample data
'true' sample is measurable. Source: Wikipedia




Concept



Big Data and the Bootstrap
Abstract from the paper listed on the lower left side: The bootstrap provides a simple and powerful means
of assessing the quality of estimators. However, in settings involving large datasets, the computation of
bootstrap-based quantities can be prohibitively demanding. As an alternative, the authors present the Bag
of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and
subsampling to obtain a robust,
computationally efficient means of
assessing estimator quality. BLB is
well suited to modern parallel and
distributed computing architectures
and retains the generic applicability,
statistical efficiency, and favorable
theoretical properties of the bootstrap.
The authors provide the results of an
extensive empirical and theoretical
investigation of BLB's behavior,
including a study of its statistical
correctness, its large-scale
implementation and performance, selection of hyper parameters, and performance on real data.
The authors claim their procedure for quantifying estimator quality is accurate, automatic and
scalable and have tested on data sets of size exceeding 1 Terabyte.


A use
case in
Airline
industry
Modeling Demand and Supply for Domestic and International Air
Travel Economics for Cost Minimization and Profit Maximization
An in-depth and excellent scholarly detail of the application of bootstrapping for
modelling Domestic and International Air travel Economics (demand / supply) for an
Airline company is detailed in the Ph.D. thesis listed below:
(slightly old published April 1999, but still very relevant to this age )

Essays on Domestic and International Airline
Economics with Some Bootstrap Applications

by Anthony Kenneth Postert
of Rice University
(URL: http://scholarship.rice.edu/bitstream/handle/1911/19428/9928581.PDF?sequence=1 )
Bootstrap and Big Data
Big Data Analysis
40 Author: Vikram Andem
ISRM & ITGRC Conference
Ensemble Learning, Bagging and Boosting

Ensemble Learning


The basis of this slide is from the original presentation titled
Bayesian Ensemble Learning for Big Data
by Rob McCulloch
of University of Chicago, Booth School of Business
Published, November 17, 2013
(URL: http://www.rob-mcculloch.org/some_papers_and_talks/talks/dsi-bart.pdf )
Ensemble methods use multiple learning algorithms to
obtain better predictive performance than could be
obtained from any of the constituent learning
algorithms. Unlike a statistical ensemble in statistical
mechanics, which is usually infinite, a machine learning
ensemble refers only to a concrete finite set of
alternative models, but typically allows for much more
flexible structure to exist between those alternatives.
Source : Wikipedia

Bagging

A use
case in
Airline
industry
Air Traffic Capacity impact during Adverse Weather conditions
A good discussion of the topic is detailed in the paper listed below:
An Translation of Ensemble Weather Forecasts
into Probabilistic Air Traffic Capacity Impact
by Matthias Steiner, Richard Bateman, Daniel Megenhardt,
Yubao Liu, Mei Xu, Matthew Pocernich,
of the National Center for Atmospheric Research,
and by Jimmy Krozel of Metron Aviation
(URL: http://nldr.library.ucar.edu/repository/assets/osgc/OSGC-000-000-000-687.pdf )
Bootstrap aggregating, also called Bagging, is a
machine learning ensemble meta-algorithm
designed to improve the stability and accuracy
of machine learning algorithms used in
statistical classification and regression. It also
reduces variance and helps to avoid overfitting.
Although it is usually applied to decision tree
methods, it can be used with any type of
method. Bagging is a special case of the model
averaging approach. Source : Wikipedia


Boosting
Boosting is a machine learning meta-
algorithm for reducing bias in supervised
learning. Boosting is based on the
question: Can a set of weak learners
create a single strong learner? A weak
learner is defined to be a classifier which
is only slightly correlated with the true
classification (it can label examples
better than random guessing). In
contrast, a strong learner is a classifier
that is arbitrarily well-correlated with the
true classification. Source : Wikipedia
Ensemble Learning and Big Data
Big Data Analysis
41 Author: Vikram Andem
ISRM & ITGRC Conference


Random Forests

Random Forests

Random forests are an
ensemble learning
method for classification
(and regression) that
operate by constructing
a multitude of decision
trees at training time
and outputting the
class that is the mode
of the classes output by
individual trees. Source: Wikipedia

A use
case in
Airline
industry
Network based model for Predicting Air Traffic Delays
The authors of the below paper propose a model using Random Forest (RF)
algorithms, considering both temporal and spatial (that is, network) delay states as
explanatory variables. In addition to local delay variables that describe the arrival or
departure delay states of the most influential airports and origin-destination (OD)
pairs in the network, the authors propose new network delay variables that depict
the global delay state of the entire NAS at the time of prediction.

A Network-Based Model for Predicting Air Traffic Delays
by Juan Jose Rebollo and Hamsa Balakrishnan
of Massachusetts Institute of Technology
(URL: http://www.mit.edu/~hamsa/pubs/RebolloBalakrishnanICRAT2012.pdf)

Random Forests in Big Data
Cloudera: In the below URL link Cloudera (a major Big
Data vendor), shows how to implement a Poisson
approximation to enable and train a random forest on an
enormous data set (with R an open source free statistical
software on Hadoop File System). The link also provides
Map and Reduce source code.

URL: https://blog.cloudera.com/blog/2013/02/how-to-resample-from-a-
large-data-set-in-parallel-with-r-on-hadoop/
Big Data Analysis
42 Author: Vikram Andem
ISRM & ITGRC Conference
k-nearest Neighbours

k-nearest Neighbours

k -nearest Neighbors algorithm (or k-NN for short) is a non-parametric
method used for classification and regression. In both cases, the input
consists of the k closest training examples in the feature space. The output
depends on whether k -NN is used for classification or regression:

In k-NN classification, the output is a class membership. An object is classified by a
majority vote of its neighbors, with the object being assigned to the class most
common among its k nearest neighbors (k is a positive integer, typically small). If k
= 1, then the object is simply assigned to the class of that single nearest neighbor.
In k-NN regression, the output is the property value for the object. This value is
the average of the values of its k nearest neighbors.

k-NN is a type of instance-based learning, or lazy learning, where
the function is only approximated locally and all computation is
deferred until classification. The k-NN algorithm is among the
simplest of all machine learning algorithms. Both for classification
and regression, it can be useful to weight contributions of
neighbors, so that the nearer neighbors contribute more to the
average than the more distant ones.

For example, a common weighting scheme consists in giving each
neighbor a weight of 1/d, where d is the distance to the neighbor. The neighbors are
taken from a set of objects for which the class (for k -NN classification) or the object
property value (for k -NN regression) is known. This can be thought of as the training
set for the algorithm, though no explicit training step is required. A shortcoming of
the k -NN algorithm is that it is sensitive to the local structure of the data.

Example of k-NN classification. The test sample (green circle) should be
classified either to the first class of blue squares or to the second class of red triangles. If k = 3
(solid line circle) it is assigned to the second class because there are 2 triangles and only 1 square
inside the inner circle. If k = 5 (dashed line circle) it is assigned to the first class (3 squares vs. 2
triangles inside the outer circle).

Data Reduction
Data reduction is one of the most important problems for work with huge data sets.
Usually, only some of the data points are needed for accurate classification. Those data
are called the prototypes and can be found as follows:
1. Select the class-outliers, that is, training data that are classified incorrectly by k-NN (for
a given k)
2. Separate the rest of the data into two sets: (i) the prototypes that are used for the
classification decisions and (ii) the absorbed points that can be correctly classified by k-
NN using prototypes which can be removed from the training set.

K-Nearest Neighbours and Big Data

A good discussion of the how to execute kNN joins in a MapReduce cluster with
algorithms in MapReduce to perform efficient parallel kNN joins on large data is presented
in the paper below. The authors demonstrated ideas on using Hadoop with extensive
experiments in large real and synthetic datasets, with tens or hundreds of millions of
records ran in R up to 30 dimensions, with efficiency, effectiveness, and scalability.

Efficient Parallel kNN Joins for Large Data in MapReduce
by Chi Zhang of Florida State University and Jeffrey Jestes of University of Utah
(URL: http://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf )
Source: Wikipedia

Background: Nearest Neighbor Graph
Nearest neighbor graph (NNG) for a set of n objects P
in a metric space (e.g., for a set of points in the plane
with Euclidean distance) is a directed graph with P
being its vertex set and with a directed edge from
p to q whenever q is a nearest neighbor of p.
Example: The right side image shows a nearest
neighbor graph of 100 points in the Euclidean plane.

k-nearest neighbor graph (k-NNG) is a graph in which two vertices p and q
are connected by an edge, if the distance between p and q is among the k-th
smallest distances from p to other objects from P

Source: Wikipedia
Source: Wikipedia
Big Data Analysis
43 Author: Vikram Andem
ISRM & ITGRC Conference
k-nearest Neighbours (continued)

A use
case in
Airline
industry
Seating Arrangement and Inflight Purchase / Buying Behavior of Airline Customers
The below paper investigates and characterizes how social influence affects buying behavior of airline passengers who can purchase items through an
individual entertainment system located in front of them. The author used the seating configuration in the airplane as a basis for the analysis.
The authored used large data sets with a sample size for analysis of purchase behavior of about 257,000 passengers in nearly 2,000 fights for the analysis,
where the passengers performed 65,525 transactions, with an average of 33.3 transactions per flight. The author claims that he finds strong evidence of
social effects and states the number of average transactions per passenger increases 30% upon observation of a neighbor's purchase. Within and cross-
category effects the author analyzed and found that the passengers are likely to buy from the same category purchased by their neighbors. For example, a
purchase of an alcoholic beverage increases the probability of same category purchases by a neighbor in 78%. The author claims peer effects also take
place at a deeper level than product category. Passengers `imitate' their peers' decisions on the type of food, alcohol and even movie genre. The paper
also investigates the determinants of social influence: Author claims no support is found for informational learning as a significant mechanism in driving
social influence. The main determinant of social influence is found to be the number of neighbors observed purchasing an item. The results are consistent
with informational learning where consumers only learn from others' actions, but not from their inaction.

Peer Effects in Buying Behavior: Evidence from In-Flight Purchases

By Pedro M. Gardete, Assistant Professor of Marketing at Stanford University (published September 2013)

(URL: http://faculty-gsb.stanford.edu/gardete/documents/SocialEffects_8_2013.pdf )
Big Data Analysis
44 Author: Vikram Andem
ISRM & ITGRC Conference

Stochastic Gradient Descent

Gradient
Descent

Gradient descent is a first-order optimization algorithm. To find a local minimum of a function
using gradient descent, one takes steps proportional to the negative of the gradient (or of the
approximate gradient) of the function at the current point. If instead one takes steps
proportional to the positive of the gradient, one approaches a local maximum of that function;
the procedure is then known as gradient ascent. Gradient descent is also known as steepest
descent, or the method of steepest descent. When known as the latter, gradient descent
should not be confused with the method of steepest descent for approximating integrals.
Source: Wikipedia
Stochastic Gradient Descent

Stochastic gradient descent is a gradient descent optimization
method for minimizing an objective function that is written as a sum
of differentiable functions.
Source: Wikipedia
Source:
Wikipedia
Big Data Analysis
45 Author: Vikram Andem
ISRM & ITGRC Conference

Stochastic Gradient Descent (continued)
Gradient
Descent
Example
Big Data Analysis
46 Author: Vikram Andem
ISRM & ITGRC Conference


Stochastic Gradient Descent (continued)
Stochastic Gradient Descent: Example
Source:
Wikipedia

Stochastic Gradient Descent vs.
Minibatch Gradient Descent

Stochastic Gradient Descent and Big Data

The below URL link provides a presentation on Stochastic Optimization for Big Data Analytics.

Stochastic Optimization for Big Data Analytics
by Tianbao Yang and Shenghuo Zhu of NEC Laboratories America
and Rong Jin of Michigan State University
(URL: http://www.cse.msu.edu/~yangtia1/sdm14-tutorial.pdf)

The authors in the below paper present stochastic gradient descent techniques for online learning
and ensemble methods to scale out to large amounts of data at Twitter with details on how to
integrate machine learning tools into Hadoop platform (using Pig a programming tool) .

Large-Scale Machine Learning at Twitter
by Jimmy Lin and Alek Kolcz of Twitter, Inc.
(URL: http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf )
Big Data Analysis
47 Author: Vikram Andem
ISRM & ITGRC Conference
Stochastic Games and Markov Perfect Equilibrium

Importance of Markov Perfect Equilibrium in Airline industry Pricing
As an example of the use of this equilibrium concept we consider the competition between firms which had invested heavily into fixed costs and
are dominant producers in an industry, forming an oligopoly. The players are taken to be committed to levels of production capacity in the short
run, and the strategies describe their decisions in setting prices. Firms' objectives are modeled as maximizing present discounted value of profits.

Airfare Game / Airline Pricing Game: Often an airplane ticket for a certain route has the same price on either airline A or airline B. Presumably,
the two airlines do not have exactly the same costs, nor do they face the same demand function given their varying frequent-flyer programs, the
different connections their passengers will make, and so forth. Thus, a realistic general equilibrium model would be unlikely to result in nearly
identical prices. Both airlines have made sunk investments into the equipment, personnel, and legal framework. In the near term we may think of
them as committed to offering service. We therefore see that they are engaged, or trapped, in a strategic game with one another when setting
prices.

Equilibrium: Consider the following strategy of an airline for setting the ticket price for a certain route. At every price-setting opportunity:
If the other airline is charging $300 or more, or is not selling tickets on that flight, charge $300
If the other airline is charging between $200 and $300, charge the same price
If the other airline is charging $200 or less, choose randomly between the following three options with equal probability:
matching that price, charging $300, or exiting the game by ceasing indefinitely to offer service on this route.

This is a Markov strategy because it does not depend on a history of past observations. It satisfies also the Markov reaction function definition
because it does not depend on other information which is irrelevant to revenues and profits. Assume now that both airlines follow this strategy
exactly. Assume further that passengers always choose the cheapest flight and so if the airlines charge different prices, the one charging the
higher price gets zero passengers. Then if each airline assumes that the other airline will follow this strategy, there is no higher-payoff alternative
strategy for itself, i.e. it is playing a best response to the other airline strategy. If both airlines followed this strategy, it would form a Nash
equilibrium in every proper subgame, thus a subgame-perfect Nash equilibrium. Source: Wikipedia

Stochastic (or Markov) Games
Stochastic (or Markov) game, is a dynamic game with
probabilistic transitions played by one or more players. The
game is played in a sequence of stages. At the beginning of
each stage the game is in some state. The players select
actions and each player receives a payoff that depends on
the current state and the chosen actions. The game then
moves to a new random state whose distribution depends on
previous state and actions chosen by players. The procedure
is repeated at the new state and play continues for a finite or
infinite number of stages. Total payoff to a player is often
taken to be discounted sum of stage payoffs or the limit
inferior of averages of stage payoffs. Source: Wikipedia

Markov Perfect Equilibrium
A Markov perfect equilibrium is an equilibrium concept in game theory. It is the refinement of the concept of sub
game perfect equilibrium to extensive form games for which a pay-off relevant state space can be readily
identified. In extensive form games, and specifically in stochastic games, a Markov perfect equilibrium is a set of
mixed strategies for each of the players which satisfy the following criteria:
The strategies have the Markov property of memorylessness, meaning that each player's mixed strategy can
be conditioned only on the state of the game. These strategies are called Markov reaction functions.
The state can only encode payoff-relevant information. This rules out strategies that depend on non-
substantive moves by the opponent. It excludes strategies that depend on signals, negotiation, or
cooperation between players (e.g. cheap talk or contracts).
The strategies form a subgame perfect equilibrium of the game. Source: Wikipedia

Subgame Perfect Equilibrium

Subgame Perfect Equilibrium is a refinement of a Nash equilibrium used in dynamic games. A strategy profile
is a subgame perfect equilibrium if it represents a Nash equilibrium of every subgame of the original game.
Informally, this means that if (1) the players played any smaller game that consisted of only one part of the
larger game and (2) their behavior represents a Nash equilibrium of that smaller game, then their behavior is a
subgame perfect equilibrium of the larger game. Source: Wikipedia

Dynamic Airfare Pricing
and Competition
The below paper details Airline industry price
competition for an oligopoly in a dynamic
setting, where each of the sellers has a fixed
number of units available for sale over a fixed
number of periods. Demand is stochastic, and
depending on how it evolves, sellers may
change their prices at any time.


Dynamic Price Competition
with Fixed Capacities
by Kalyan Talluri & Victor Martinez de Albeniz
A use
case in
Airline
industry

Most of the work in this
paper was done prior to the
writing of the paper as part
of (both author's)
Ph.D. dissertation at
Massachusetts Institute of
Technology
(published February 2010)
(URL: www.econ.upf.edu/docs/papers/downloads/1205.pdf)
Big Data Analysis
48 Author: Vikram Andem
ISRM & ITGRC Conference
Stochastic Games and Markov Perfect Equilibrium (continued)

A use
case in
Airline
industry
Dynamic Revenue Management in Airline Alliances / Code Sharing

The below paper presents and excellent formulization of a Markov-game model of a two-partner airline alliance that can be used to analyze the effects of
these mechanisms on each partners behavior. The authors show that no Markovian transfer pricing mechanism can coordinate an arbitrary alliance. Next,
the authors derive the equilibrium acceptance policies under each scheme and use analytical techniques, as well a numerical analyses of sample
alliances, to generate fundamental insights about partner behavior under each scheme. The analysis and numerical examples also illustrate how certain
transfer price schemes are likely to perform in networks with particular characteristics.

Dynamic Revenue Management in Airline Alliances

by Robert Shumsky of Dartmouth College and Christopher Wright, Harry Groenevelt of University of Rochester (published February 2009)

(URL: http://www.researchgate.net/publication/220413135_Dynamic_Revenue_Management_in_Airline_Alliances/file/72e7e5215a1f91ed5b.pdf )
Big Data Analysis
49 Author: Vikram Andem
ISRM & ITGRC Conference

Logistic Regression

Logistic Regression


Uses and examples of Logistic Regression
Examples: Logistic regression might be used to predict
Whether a patient has a given disease (e.g. diabetes), based on observed
characteristics of the patient (age, gender, BMI, results of various blood
tests, etc.).
Whether an American voter will vote Democratic or Republican, based on
age, income, gender, race, state of residence, previous elections, etc.
In engineering, for predicting the probability of failure of a given process,
system or product.
In marketing applications such as prediction of a customer's propensity to
purchase a product or cease a subscription, etc.
In economics it can be used to predict the likelihood of a person's choosing
to be in the labor force, and a business application would be to predict the
likelihood of a homeowner defaulting on a mortgage. Source: Wikipedia
Logistic Regression is a type of probabilistic statistical
classification model used to predict a binary response from a
binary predictor and can be used for predicting the outcome
of a categorical dependent variable (i.e., a class label) based
on one or more predictor variables (features). It is used in
estimating the parameters of a qualitative response model.
Logistic regression measures the relationship between
a categorical dependent variable and one or more
independent variables, which are usually (but not
necessarily) continuous, by using probability scores as
the predicted values of the dependent variable. Source: Wikipedia
The three images below show a funny explanation and presentation of the Logistic Regression
in Machine Learning by David Hu on his internship at the Khan Academy (a non-profit education site)
URL: http://david-hu.com/2012/01/05/khan-academy-internship-post-mortem.html
Source: Wikipedia

Logistic Regression in Big Data
Internet search provides lot of (widely) use cases, a couple are listed below:
A Big Data Logistic Regression with R and ODBC
by Larry D'Agostino of "R news and tutorials at R bloggers "
URL: http://www.r-bloggers.com/big-data-logistic-regression-with-r-and-odbc/
Large Data Logistic Regression (with example Hadoop code)
by John Mount of "Win-Vector Blog : Applied Theorist's Point of View"
URL: http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-
example-hadoop-code/
Big Data Analysis
50 Author: Vikram Andem
ISRM & ITGRC Conference

Logistic Regression (continued)
Predicting Airline Passenger No-show Rates
A use case
in Airline
industry

Accurate forecasts of the expected number of no shows for each flight can increase airline revenue by reducing the number of spoiled seats (empty seats that might otherwise have been sold)
and the number of involuntary denied boardings at the departure gate. Conventional no-show forecasting methods typically average the no-show rates of his historically similar flights, without
the use of passenger-specific information. The authors of the below paper develop two classes of models to predict cabin-level no-show rates using specific information on the individual
passengers booked on each flight.

The first of the proposed models computes the no-show probability for each passenger, using both the cabin-level historical
forecast and the extracted passenger features as explanatory variables. This passenger level model is implemented using
three different predictive methods: a C4.5 decision-tree, a segmented Naive Bayes algorithm, and a new aggregation method
for an ensemble of probabilistic models.
The second cabin-level model is formulated using the desired cabin-level no-show rate as the response variable. Inputs to this
model include the predicted cabin-level no-show rates derived from the various passenger-level models, as well as simple
statistics of the features of the cabin passenger population.

The cabin-level model is implemented using either linear regression, or as a direct probability model with explicit incorporation
of the cabin-level no-show rates derived from the passenger-level model outputs. The new passenger-based models are
compared to a conventional historical model, using train and evaluation data sets taken from over 1 million passenger name
records. Standard metrics such as lift curves and mean-square cabin-level errors establish the improved accuracy of the passenger based
models over the historical model. The authors evaluated all models using a simple revenue model, and show that the cabin-level
passenger-based predictive model can produce between 0.4% and 3.2% revenue gain over the conventional model, depending
on the revenue-model parameters.
Passenger-Based Predictive Modeling of Airline No-show Rates
by Richard Lawrence, Se June Hong of IBM T. J. Watson Research Center
and Jacques Cherrier of Air Canada
(URL: http://www.msci.memphis.edu/~linki/7118papers/Lawrence03Passenger.pdf )


Airline Customer Satisfaction / Loyalty
A use
case in
Airline
industry
There are lot of research papers and actual implementation articles on leveraging
Logistical Regression for Airline Customer Satisfaction / Loyalty , a few listed below:
A Logistic Regression Model of Customer Satisfaction of Airline
by Peter Josephat and Abbas Ismail of University of Dodoma
(URL: http://www.macrothink.org/journal/index.php/ijhrs/article/view/2868/2669)
Analytical CRM at the airlines:
Managing loyal passengers using knowledge discovery in database
by Jehn-Yih Wong and Lin-Hao Chiu of Ming Chuan University,
and Pi-Heng Chung, of De Lin Institute of Technology
and Te-Yi Chang of National Kaohsiung University
(URL: http://ir.nkuht.edu.tw/bitstream/987654321/1528/1/8%E5%8D%B74-001.pdf )

Modelling Customer Response Rate for
Exchange /Purchase of Airline Frequent Flier Miles for
3
rd
Party (non-airline) Products and Services.
A use
case in
Airline
industry
The below presents the Logistic Regression model for Predictive
Goal used by Sprint/Nextel to identify Delta Airlines Sky Miles
members who will most likely respond to an offer for exchange of
frequent flier miles for the purchase of Sprint-Nextel wireless phones
and service.
Delta Airlines Response Model Overview
by Geoff Gray, Armando Litonjua , Matt McNamara, Tim Roach and Jason Thayer
(URL: http://galitshmueli.com/system/files/FrequentFliers.pdf
Big Data Analysis
51 Author: Vikram Andem
ISRM & ITGRC Conference
Support Vector Machine

Support Vector Machine
Support vector machines (SVM) are supervised learning models with
associated learning algorithms that analyze data and recognize patterns,
used for classification and regression analysis. Given a set of training
examples, each marked as belonging to one of two categories, an SVM
training algorithm builds a model that assigns new examples into one
category or the other, making it a non-probabilistic binary linear
classifier. An SVM model is a representation of the examples as points in
space, mapped so that the examples of the separate categories are
divided by a clear gap that is as wide as possible. New examples are
then mapped into that same space and predicted to belong to a
category based on which side of the gap they fall on. Source: Wikipedia


Airline Ticket Cancellation Forecasting & Revenue Management
Passenger Name Record Data Mining Based
Cancellation Forecasting for Revenue Management
by Dolores Romero Morales and Jingbo Wang
of Said Business School, University of Oxford
A use case
in Airline
industry
Using real-world datasets, the authors of the below paper examine
the performance of existing models and propose new promising
ones based on Logistic Regression and Support Vector Machines for
ticket cancellation forecasting & improving Airline revenue
(URL: http://www.optimization-online.org/DB_FILE/2008/04/1953.pdf )
H
1
does not separate the classes.
H
2
does, but only with a small margin.
H
3
separates them with the maximum
margin.
Example
Maximum-margin hyperplane and margins
for an SVM trained with samples from two
classes. Samples on the margin are called
the support vectors.


Support Vector Machine and Big Data

The below paper provides good information on Support Vector Machine classifiers
for solving the pattern recognition problem in machine learning on large data sets
Support Vector Machine Classifiers for Large Data Sets
by E. Michael Gertz and Joshua D. Griffin of Argonne National Laboratory
(URL: http://ftp.mcs.anl.gov/pub/tech_reports/reports/TM-289A.pdf )
Big Data Analysis
52 Author: Vikram Andem
ISRM & ITGRC Conference
k-means Clustering

k-means Clustering
k-means clustering is a method of vector quantization, originally
from signal processing, that is popular for cluster analysis in data mining.
k-means clustering aims to partition n observations into k clusters in
which each observation belongs to the cluster with the nearest mean,
serving as a prototype of the cluster. This results in a partitioning of the
data space into Voronoi cells (see voronoi diagram below).

Given a set of observations (x
1
, x
2
, , x
n
), where each observation is a d-
dimensional real vector, k-means clustering aims to partition the n
observations into k sets (k n) S = {S
1
, S
2
, , S
k
} so as to minimize the
within-cluster sum of squares (WCSS):
where
i
is the mean of points in S
i
.

Demonstration of k-means standard algorithm




Airline Data Model Mining for:
Customer Segmentation Analysis Customer Loyalty Analysis
Customer Life Time Value Analysis Frequent Flyer Passenger Prediction
Oracle Airlines Data Model Data Mining Models
by Oracle Corporation
A use case
in Airline
industry
Oracle (a major database software and applications vendor) for its
software product offering titled "Airlines Data Model and Data Mining
Models provides reference information about the data mining models :
(URL: http://docs.oracle.com/cd/E11882_01/doc.112/e26208/data_mining_adm.htm#DMARF1188 )


k-means clustering and Big Data
The below paper provides an excellent (theory and application pseudo code) for NoSQL based
application / programs using k-means clustering in Big Data Analysis .
Turning Big data into tiny data:
Constant-size coresets for k-means, PCA and projective clustering
by Dan Feldman, Melanie Schmidt, Christian Sohler
of Massachusetts Institute of Technology
(URL: http://people.csail.mit.edu/dannyf/subspace.pdf )
Background: Voronoi diagram
In mathematics, a Voronoi diagram is a way
of dividing space into a number of regions.
A set of points (called seeds, sites, or
generators) is specified beforehand and for
each seed there will be a corresponding
region consisting of all points closer to that
seed than to any other. Source: Wikipedia

Step 1:
k initial "means"
(in this case k=3)
are randomly
generated within
the data domain
(shown in color).
Step 2:
k clusters are created by
associating every
observation with the
nearest mean. The
partitions here represent
the Voronoi diagram
generated by the means.
Step 3:
The centroid of
each of the k
clusters becomes
the new mean.
Step 4:
Steps 2 and 3 are
repeated until
convergence has
been reached.
Source: Wikipedia

Modelling / Predicting Airline Passengers Ticket
Choice (Price Sensitivity) with Brand Loyalty
Modeling Passenger's Airline Ticket
Choice Using Segment Specific Cross
Nested Logit Model with Brand Loyalty
A use case
in Airline
industry
Below paper provides methods on modelling /predicting
Airline Passengers ticket choices (that are price sensitive)
with Brand Loyalty and authors have used large data sets
from a stated preference choice experiment conducted
among Australian citizens traveling to the USA.
(URL: http://www.agifors.org/award/submissions2012/TomDrabas.pdf )
( Published May 2012)
United Airlines is part
of the analysis in this
paper, along with other
Australian airlines
by Tomasz Drabas and Cheng-Lung Wu
of University of New South Wales
Big Data Analysis
53 Author: Vikram Andem
ISRM & ITGRC Conference
Appendix
Sources / Credits / References used to make this presentation:

1. Google : www.google.com
2. Wikipedia : http://www.wikipedia.org/
3. Big Data Analytics: Harvard Extension School http://www.extension.harvard.edu/courses/big-data-analytics
4. Tackling the Challenges of Big Data: Massachusetts Institute of Technology http://web.mit.edu/professional/pdf/oxp-docs/BigDataCourseFlyer.pdf
5. Course era:
a) Data Scientists Tool Box by Brian Caffo, Roger D. Peng, Jeff Leek of John Hopkins University https://www.coursera.org/specialization/jhudatascience/1
b) Data Analysis and Statistical Inference by Mine etinkaya-Rundel of Duke University: https://www.coursera.org/course/statistics
c) Introduction to Data Science by Bill Howe of University of Washington; https://www.coursera.org/specialization/jhudatascience/1
6. OpenIntro Statistics : http://www.openintro.org/stat/




Excellent Reading Material (publicly available for free )
OpenIntro Statistics 2
nd
Edition authored by
David M Diez, Quantitative Analyst at Google/YouTube
Christopher D Barr, Assistant Professor at Harvard University
Mine Cetinkaya-Rundel, Assistant Professor at Duke University
URL: https://dl.dropboxusercontent.com/u/8602054/os2.pdf
Practical Regression and Anova using R authored by Julian J. Faraway
Electronic version at the below URL is free; Physical / Print version for $79
URL: http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
ThinkStats: Probability and Statistics for Programmers authored by Allen B. Downey
URL: http://greenteapress.com/thinkstats/thinkstats.pdf
Source: Openintro Statistics
Mean
Median
Variance
Standard deviation
Shape of the distribution
Mean vs. Median
If the distribution is
symmetric, center is often
defined as the mean:
mean = median
If the distribution is
skewed or has extreme
outliers, center is often
defined as the median.

Right-skewed:
mean > median
Left-skewed:
mean < median
Probability
Independence & Conditional Probability
Source: Openintro Statistics
Bayes Theorem
Random Variables
Normal Distribution
Normal Distribution with different
parameters
Z Scores
Finding the exact probability - using the Z table
Source: Openintro Statistics
Six Sigma
Binomial Distribution
Normal Probability Plot and Skewness
Central Limit Theorem
Confidence Intervals
Changing the Confidence Levels
Source: Openintro Statistics
p - values
Decision Errors
Hypothesis Test as a trial
Hypothesis Testing for Population Mean
The t - distribution
Type 1 and Type 2 Errors
Source: Openintro Statistics
ANOVA (Analysis of Variance)
Conditions
z/t test vs. ANOVA
Purpose
Method
Source: Openintro Statistics
Parameter and Point Estimate
Comparing two Proportions
Standard Error Calculations
Anatomy of a Test Statistic
Chi-square statistic
Why Square?
The chi-square distribution
Conditions for chi-square test
Source: Openintro Statistics
Quantifying a Relationship
The least squares line
Slope of the
Regression
Intercept
Prediction and Extrapolation
Conditions for
the least
squares line
R
Terminology
R vs. Adjusted R
Adjusted R
Source: Openintro Statistics
Sensitivity and Specificity
Generalized Linear Models
Logistic Regression
Logit function
Properties of the Logit
The Logistic Regression Model

Você também pode gostar