Escolar Documentos
Profissional Documentos
Cultura Documentos
Data Analysis
Session Speaker
K.M. Sharath Kumar
1
M. S. Ramaiah University of Applied Sciences
Session Objectives
>_To explain the relevance of data analysis for
carrying out
research
>_To explore different types of data analysis
techniques for effective interpretation
>_To critique and recommend appropriate
exploratory data analysis techniques for a
problem
2
M. S. Ramaiah University of Applied Sciences
Session Outline
Sampling Design
Data Collection Methods
Quantitative and Qualitative Data Analysis
Stages in Data Analysis
Review of Techniques
Error Analysis
3
M. S. Ramaiah University of Applied Sciences
4
M. S. Ramaiah University of Applied Sciences
One Variant
6,200 Distinct Parts
Imported from 17 Countries
From 240 Suppliers
Assembled in 1 Plant
Within few minutes
Exported to 34 Countries
Same day
Without becoming inventory!
5
M. S. Ramaiah University of Applied Sciences
6
M. S. Ramaiah University of Applied Sciences
7
M. S. Ramaiah University of Applied Sciences
Data Analysis
(1/2)
Listen
Listen to
to what
what the
the data
data is
is saying
saying
8
M. S. Ramaiah University of Applied Sciences
vs.
Explanation through
numbers
Objective
Deductive reasoning
Predefined variables and
measurement
Data collection before
analysis
Cause and effect
relationships
Qualitative
Explanation through words
Subjective
Inductive reasoning
Creativity, extraneous
variables
Data collection and
analysis intertwined
Description, meaning
10
M. S. Ramaiah University of Applied Sciences
11
M. S. Ramaiah University of Applied Sciences
Data
Data and
and Hard
Hard
Evidence!!
Evidence!!
12
M. S. Ramaiah University of Applied Sciences
13
M. S. Ramaiah University of Applied Sciences
Types of Data
Continuous Data
Discrete Data
14
M. S. Ramaiah University of Applied Sciences
Continuous Data
Data generated by
Physically measuring the characteristic
Generally using an instrument
Assigning an unique value to each item
Examples:
Time to receive a shipment, Time spend per page, Time to
activate, CPU Speed, Total Minutes per Incident (TMPI),
etc.
Hardness, Strength, Weight, Diameter, etc.
15
M. S. Ramaiah University of Applied Sciences
Discrete Data
Data generated by
Classifying the items into different groups based on
some criteria
No physical measurement is involved
Examples:
Sex, Shade variation, Surface defects etc.
% of visitors signing in for AOL messenger per day,
Number of Recharges per Month , Number of Operating
Systems, % Escalations, etc .
16
M. S. Ramaiah University of Applied Sciences
Data
SL No.
Data
0.98
11
1.02
1.03
12
0.98
1.00
13
1.01
1.00
14
1.01
0.99
15
0.99
1.01
16
1.00
0.97
17
1.01
1.02
18
0.99
1.00
19
1.00
10
0.99
20
1.02
17
M. S. Ramaiah University of Applied Sciences
18
M. S. Ramaiah University of Applied Sciences
Random Variables
0
BBBB
BGBB
GBBB
BBBG
BBGB
GGBB
GBBG
BGBG
BGGB
GBGB
BBGG
BGGG
GBGG
GGGB
GGBG
GGGG
1
X
3
4
Sample Space
Points on the
Real Line
19
TheGraphical
GraphicalDisplay
Displayfor
forthis
this
The
ProbabilityDistribution
Distribution
Probability
shownon
onthe
thenext
nextSlide.
Slide.
isisshown
P(x)
1/16
4/16
6/16
4/16
1/16
16/16=1
20
M. S. Ramaiah University of Applied Sciences
6/ 16
Probability, P(X)
0.3
4/ 16
4/ 16
0.2
0.1
1/ 16
0.0
1/ 16
2
Number of Girls, X
21
M. S. Ramaiah University of Applied Sciences
Example
Consider the experiment of tossing two six-sided dice. There are 36 possible
outcomes. Let the random variable X represent the sum of the numbers on
the two dice:
x
P(x)
x
P(x)
3
1,3
2,3
3,3
4,3
5,3
6,3
4
1,4
2,4
3,4
4,4
5,4
6,4
5
1,5
2,5
3,5
4,5
5,5
6,5
6
1,6
2,6
3,6
4,6
5,6
6,6
7
8
9
10
11
12
0.17
0.12
p(x)
1,1
2,1
3,1
4,1
5,1
6,1
2
1,2
2,2
3,2
4,2
5,2
6,2
2
2
1/36
1/36
3
3
2/36
2/36
4
4
3/36
3/36
5
5
4/36
4/36
6
6
5/36
5/36
7
7
6/36
6/36
8
8
5/36
5/36
9
9
4/36
4/36
0.07
0.02
2
10
11
12
22
M. S. Ramaiah University of Applied Sciences
NORMAL DISTRIBUTION
23
M. S. Ramaiah University of Applied Sciences
PP
RR
OO
CC
EE
SS
SS
People
24
M. S. Ramaiah University of Applied Sciences
Smooth curve
interconnecting the
center of each bar
Units of
Measure
25
M. S. Ramaiah University of Applied Sciences
Normal Distribution
If the frequency distribution of a set of
values is such that :
68.26% of the values lie within 1 from
the mean
AND
95.46% of the values lie within 2 from
the mean
AND
99.73% of the values lie within 3 from
the mean
26
into
Z.
The
27
M. S. Ramaiah University of Applied Sciences
Sampling Design
28
M. S. Ramaiah University of Applied Sciences
Population
(N)
Sample
(n)
29
Sample Type
Define Relevant
Population
Sampling
technique
Identify existing
sampling frame
Evaluate
sampling frame
Probability
Non-Probability
Dont
accept
Modify
sampling frame
Select
sampling frame
Draw
sample
30
Types of Sampling
Non-Probability
Sampling
Probability
Sampling
Convenience
Sampling
Simple
Random
Sampling
Stratified
Random
Sampling
Systematic
Sampling
Expert
Sampling
Quota
Sampling
Cluster
Sampling
31
M. S. Ramaiah University of Applied Sciences
Instratified
stratified random
random sampling,
sampling,we
weassume
assumethat
thatthe
the
In
populationof
ofNNunits
unitsmay
maybe
bedivided
dividedinto
into m
mgroups
groupswith
withNNi
population
i
unitsin
ineach
eachgroup
group i=1,2,...,m.
i=1,2,...,m. The
Them
mstrata
strataare
are
units
nonoverlappingand
andtogether
togetherthey
theymake
makeup
upthe
thetotal
total
nonoverlapping
population:NN1+
+NN2+...+
+...+
=N.
Population
population:
NNmm=N.
1
2
Stratum1
N1
Stratum 2
N2
The
The m
m strata
strata are
are
non-overlapping.
non-overlapping.
Ni N
Stratum m
Nm
i 1
32
M. S. Ramaiah University of Applied Sciences
33
M. S. Ramaiah University of Applied Sciences
Example
Suppose in a market survey, you have to select 5
households out of 50 households in a block.
- Number of units in the population N = 50
- Number of units in the sample n = 5
- Sampling Interval K = (N/n) = 50/5 = 10
- Select a random number between 1 and 10
Suppose the selected random number is 5. Starting
with 5, select every 10th unit.
34
M. S. Ramaiah University of Applied Sciences
Example Contd.
1
13
22
31
40
49
2
14
23
32
41
50
3
15
24
33
42
4 5
16
25
34
43
6 7
17
26
35
44
8 9
18
27
36
45
10
19
28
37
46
11
20
29
38
47
12
21
30
39
48
35
M. S. Ramaiah University of Applied Sciences
Cluster Sampling
Group
Population
Population Distribution
Distribution
Sample
Sample
Distribution
Distribution
Instratified
stratifiedsampling
sampling
In
randomsample
sample(n
(n)i)
aarandom
i
chosenfrom
fromeach
each
isischosen
segmentof
ofthe
the
segment
population(N
(N).
i).
population
i
Incluster
clustersampling
sampling
In
observationsare
aredrawn
drawn
observations
from m
mout
outof
ofM
Mareas
areasor
or
from
clustersof
ofthe
the
clusters
population.
population.
36
M. S. Ramaiah University of Applied Sciences
Caution
37
M. S. Ramaiah University of Applied Sciences
Sampling Distribution
- A conceptual framework
38
M. S. Ramaiah University of Applied Sciences
X
X
n
X X
=
39
The sample
sample size
size determines
determines the
the bound
bound of
of aa statistic,
statistic,
The
since the
the standard
standard error
error of
of aa statistic
statistic shrinks
shrinks as
as the
the
since
sample size
size increases:
increases:
sample
Sample size = 2n
Standard error
of statistic
Sample size = n
Standard error
of statistic
40
M. S. Ramaiah University of Applied Sciences
41
M. S. Ramaiah University of Applied Sciences
42
M. S. Ramaiah University of Applied Sciences
Z
E
2
Example
A marketing manager of a fast food restaurant in a
city wishes to estimate the average yearly amount
that families spend on fast food restaurants. He wants
the estimate to be within + or Rs. 100 with a
confidence interval of 99%. It is known from an earlier
pilot study that the standard deviation of the family
expenditure on fast food restaurant is Rs. 500. How
many families must be chosen for this problem?
44
M. S. Ramaiah University of Applied Sciences
Solution
Z
E
2
45
M. S. Ramaiah University of Applied Sciences
pP
p (1 p )
n
n Z
p (1 p )
Example
A company manufacturing sports goods wants to
estimate the proportion of cricket players among high
school students in India. The company wants the
estimate to be within + or 0.03 with a confidence
interval of 99%. A pilot study done earlier reveals that
out of 80 high school students, 36 students play
cricket. What should be the sample size?
47
M. S. Ramaiah University of Applied Sciences
Solution
p = 36/80 = 0.45
Applying the formula
n = ((2.58^2) (0.45(1-0.45)))/(0.03^2)
n = 1831
48
M. S. Ramaiah University of Applied Sciences
49
M. S. Ramaiah University of Applied Sciences
Scales of Measurement
Nominal Scale - groups or classes
Gender
Likerts Scale
51
M. S. Ramaiah University of Applied Sciences
Ex:
I plan to purchase a laptop in next twelve months
Yes
No
52
M. S. Ramaiah University of Applied Sciences
Agree
Neutral
Disagree
Strongly
Agree
Disagree
54
M. S. Ramaiah University of Applied Sciences
55
M. S. Ramaiah University of Applied Sciences
3
2
1
Extremely
Unfavourable
56
M. S. Ramaiah University of Applied Sciences
100
58
M. S. Ramaiah University of Applied Sciences
60
M. S. Ramaiah University of Applied Sciences
Data Analysis
61
M. S. Ramaiah University of Applied Sciences
62
M. S. Ramaiah University of Applied Sciences
http://www.wordinfo.info/words/index/info/view_unit/1/?
63
M. S. Ramaiah University of Applied Sciences
Editing
coding
Data entry
Key Boarding
Data
Analysis
Descriptive
analysis
Bivariate
analysis
Univariate
analysis
Interpretation
Multivariate
analysis
64
M. S. Ramaiah University of Applied Sciences
Count (frequencies)
Percentage
Mean
Mode
Median
Range
Standard deviation
Variance
Ranking
65
M. S. Ramaiah University of Applied Sciences
Error checking
and verification
Editing
coding
Data entry
Key Boarding
Data
Analysis
Descriptive
analysis
Bivariate
analysis
Univariate
analysis
Interpretation
Multivariate
analysis
66
M. S. Ramaiah University of Applied Sciences
Frequency Distributions
Women (N=30)
A lot
Some
A little
Not at all
14
Percentage Distributions
Women (N=30)
A lot
Some
A little
Not at all
46%
30%
17%
7%
68
M. S. Ramaiah University of Applied Sciences
Court Referral
Social Worker
Friend or Acquaintan
Librarian
Web Search Engine
Newspaper Story
Other
69
M. S. Ramaiah University of Applied Sciences
Math
History
English
English
History
Biology
Music
Latin
Biology
Math
98
95
96
95
93
94
92
93
98
Latin
Music
Gym
Gym
92
94
40
Mean = 87
Median = 94
70
M. S. Ramaiah University of Applied Sciences
Note
40
50
55
94
Mean = 81
40
92
93
94
95
Mean = 87
96
98
71
M. S. Ramaiah University of Applied Sciences
Histograms
72
M. S. Ramaiah University of Applied Sciences
Cross Tabulations
73
M. S. Ramaiah University of Applied Sciences
Graphing comparisons
Satisfaction with Services
40
Satisfaction Score
35
30
25
20
15
10
5
0
A
Clinic Name
74
M. S. Ramaiah University of Applied Sciences
Satisfaction Score
14
12
10
Staff
Advice
Facility
8
6
4
2
0
A
Clinic
75
M. S. Ramaiah University of Applied Sciences
Satisfaction Score
14
12
A
B
C
D
E
10
8
6
4
2
0
Staff
Advice
Facility
Satisfaction Component
76
M. S. Ramaiah University of Applied Sciences
Error checking
and verification
Editing
coding
Data entry
Key Boarding
Data
Analysis
Descriptive
analysis
Bivariate
analysis
Univariate
analysis
Interpretation
Multivariate
analysis
77
M. S. Ramaiah University of Applied Sciences
Bi-variate Analysis
Y = f (X)
78
M. S. Ramaiah University of Applied Sciences
Correlation
Regression
Chi-square Test and Cramers rule
Hypothesis Test for two population means/proportions
Paired T-tests comparing two groups
79
M. S. Ramaiah University of Applied Sciences
Measure of Correlation:
of Correlation
SymbolCoefficient
:r
Range : -1 to 1
Sign : Type of correlation
Value : Degree of correlation
Examples:
r = 0.6 , 60 % positive correlation
r = -0.82, 82% negative correlation
r = 0, No correlation
80
M. S. Ramaiah University of Applied Sciences
Regression
Regression helps
To identify the exact form of the relationship
To model output in terms of input or process variables
y=a+bx
Examples:
Yield = 5 + 3 x Time
Y = 2 - 5x
81
M. S. Ramaiah University of Applied Sciences
Coefficient of Regression
Measure of degree of Relationship
Symbol : R2
Range of R2 : 0 to 1
82
M. S. Ramaiah University of Applied Sciences
y
69
78
8
21
24
72
Regression Statistics
Multiple R
0.594159006
R Square
0.353024925
Adjusted R Square
0.191281156
Standard Error
27.80337004
Observations
6
Intercept
x
Coefficients
83.00449781
-0.605970474
83
M. S. Ramaiah University of Applied Sciences
y
69
78
8
21
24
72
Predicted y
43.62
78.16
29.07
29.68
52.71
38.77
Error
25.38
-0.16
-21.07
-8.68
-28.71
33.23
Sum
Error Square
644.33
0.02
444.08
75.33
824.03
1104.32
3092.11
f ( X i ) Y / MSEP
U S X * (1 b j ) 2 / MSEP
R
1 r * SY / MSEP
2
85
M. S. Ramaiah University of Applied Sciences
Logistic Regression
Objective
To develop a mathematical model for an attribute or response metric
(Y) in terms of other available attributes (Xs).
When to Use
Xs : Continuous
Y : Discrete binary
86
M. S. Ramaiah University of Applied Sciences
Objective
To test hypothesis that compare the population
mean of interest for two separate populations
(independent samples)
Test Statistic (Large Sample)
2
Sample)
1
X X
12
22
n n
1
X X
n1 n2
1
87
M. S. Ramaiah University of Applied Sciences
Chi-Square Test
Objective:
To test whether two variables which have frequency
data are related or not
Usage:
When both the variables ( X & Y) are categorical
(grouped)
Cramers Rule: To quantify the relationship between X &
Y
88
M. S. Ramaiah University of Applied Sciences
Error checking
and verification
Editing
coding
Data entry
Key Boarding
Data
Analysis
Descriptive
analysis
Bivariate
analysis
Univariate
analysis
Interpretation
Multivariate
analysis
89
M. S. Ramaiah University of Applied Sciences
Multivariate Analysis
The analysis of the simultaneous relationships
among several variables
Analyse the data covariance structure to
understand it or to reduce the data dimension
Assign observations to groups
Explore relationships
variables
among
categorical
90
91
M. S. Ramaiah University of Applied Sciences
Multiple Regression
To model output variable y in terms of two or
more variables
General Form:
Y = a + b1X1 + b2X2 + - - - + bkXk
Two variable case:
Y = a + b1X1 + b2X2
Adjusted R2
If Adj R2 > 0.6, then the model is reasonably good
P value from coefficient table
If p value < 0.05, the corresponding term has
strong relationship with output
92
93
M. S. Ramaiah University of Applied Sciences
Shift
Time
0.038
Impurity
0.033
0.028
0.023
0.018
Evidence
Evidenceof
ofaastrong
strongShift
Shiftto
toShift
ShiftEffect
Effect
94
M. S. Ramaiah University of Applied Sciences
MSE
i 1
Yi Yi
n2
MB
Y f ( X )
i 1
Factor Analysis
Loading Plot of Pop, ..., Home
Home
0.75
School
Second Factor
0.50
0.25
Pop
0.00
Employ
-0.25
Health
-0.50
-0.4
-0.2
0.0
0.2
0.4
First Factor
0.6
0.8
1.0
Explain the presence of each variable with the sign (+ or -). This
way we can reduce the number of variables
96
M. S. Ramaiah University of Applied Sciences
Predictors Selection
97
M. S. Ramaiah University of Applied Sciences
P = 0.001
98
M. S. Ramaiah University of Applied Sciences
Classification Methods
Example:
x1
Attribute 2
x2
Label : y
y1 (Red) , y2 (Blue)
x2
Attribute 1
x2
40
38
36
34
32
30
28
26
24
22
20
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
> 35
y1
< 28
x1
< 15.5
y2
y2
> 15.5
y1
x1
99
M. S. Ramaiah University of Applied Sciences
CLASSIFICATION METHODS
Example: Rules
Attribute 1
x1
Attribute 2
x2
Label : y
y1 (Red) , y2 (Blue)
If x2 > 35 then y = y1
If x2 < 28, then y = y2
If 28 > x2 > 35 & x1 > 15.5, then y = y1
x2
< 28
> 35
y1
x1
< 15.5
y2
y2
> 15.5
y1
100
M. S. Ramaiah University of Applied Sciences
Cluster Analysis
Objective
To classify the records or items into a smaller number of groups based
on the values of available attributes.
When to Use
When there is no Y attribute
All attributes are considered as Xs only
101
M. S. Ramaiah University of Applied Sciences
Weight in kg
Acceleration in m/s2
Acceleration in m/s2
Weight in kg
102
total
variation
into
Population 1
Population 2
Population 3
103
M. S. Ramaiah University of Applied Sciences
104
M. S. Ramaiah University of Applied Sciences
Optimisation Methods
Objective
To identify the best values of a set of variables
(Xs) which will optimize an objective function
satisfying a given set of constraints
For n variables in m constraints
Max / Min Z = C1x1 + C2x2 + .Cnxn
Subject to
a11 x1 + a12x2 + . + a1nxn < /> = b1
a21 x1 + a22x2 + . + a2nxn < /> = b2
105
M. S. Ramaiah University of Applied Sciences
106
M. S. Ramaiah University of Applied Sciences