Escolar Documentos
Profissional Documentos
Cultura Documentos
Analysis of Data
Session – I
1
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
Summarizing Data
• Raw data is often voluminous and difficult
to handle
• Decision makers want a few numbers to
summarize the entire data
• Summarization leads to loss of information
but can help focus on key aspects of the
dataset
2
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
3
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
Box Plot
Boxplot of C1
160
Max
140
Q3
120
100
80
C1
60
40 Median
20 Q1
0 Min
4
4
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
5
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
Identifying Outliers
• The interquartile range (IQR) is defined as IQR =
Q3 – Q1
• An observation (x) is a ‘soft’ or ‘possible’ outlier if
x > Q3 + 1.5 IQR or x < Q1 – 1.5 IQR
• An observation (x) is a ‘hard’ or ‘confirmed’
outlier if x > Q3 + 3 IQR or x < Q1 – 3 IQR
• Note: All ‘hard’ outliers are also ‘soft’ outliers but
not vice versa.
6
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
7
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
8
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
x1 + ... + x n
• Arithmetic Mean : x =
n
(x1 − x )2 + ... + ( x n − x )2
• Standard Deviation : s =
n
1 n 2
= ∑
n i=1
xi − x 2
9
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
10
10
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
Chebyshev’s Inequality
• In most situations of common occurrence
Chebyshev’s inequality asserts that
proportion of observations outside the interval
(mean – t SD, mean + t SD) is at most t-2
• Using Chebyshev’s inequality we have the
proportion of observations outside
i) (mean – 2 SD, mean + 2 SD) is at most 0.25
ii) (mean – 3 SD, mean + 3 SD) is at most 0.11
iii) (mean – 4 SD, mean + 4 SD) is at most 0.06
11
11
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
Impact of Outliers
• Both the mean and the SD are quite
sensitive to the presence of outliers
• If mean and SD are proposed to be used
for summarizing a data set it is better to
delete the outliers first and then proceed to
calculate the mean and SD
12
12
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
13
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
14
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
Frequency Distribution
• Often, particularly for large data sets, it is
advantageous to summarize data using a
frequency distribution.
• The entire range (Range = Max – Min) of the
data is divided into a few disjoint classes each of
which is an interval
• A frequency distribution gives the list of the
classes along with the number of observations in
each class (called the frequency of the class)
15
15
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
16
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
17
17
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
Example
Duration of Frequency Relative Frequency
Pregnancy (days) Frequency Density
255– 258.5 76
0.07 0.028
258.5 – 265.5 121
0.11 0.044
265.5 – 272.5 334
0.31 0.122
272.5 – 279.5 348
0.32 0.127
279.5 – 286.5 205
0.19 0.075
286.5 – 289 10
0.01 0.004
Total 1094
18
18
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
Histogram
Histogram
0.140
0.120
0.100
Frequency Density
0.080
Series1
0.060
0.040
0.020
0.000
255– 258.5 258.5 – 265.5 265.5 – 272.5 272.5 – 279.5 279.5 – 286.5 286.5 – 289
Duration of Pregnancy 19
19
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
20
20
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
Sturges’ Rule
• The number of classes (k) = [1 + log2n]+1 where
[1+log2n] is the greatest integer less than or
equal to 1+log2n
• In other words, choose k such that
2k-3≤n<2k-2
• E.g. n=35 => k = 8
n = 83 => k = 9
• The class width (h) is computed as Range
divided by the number of classes.
• h = (Max – Min) / k
21
21
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
Freedman-Diaconis Rule
2 IQR
• The class width is given by h = 1/3
n
Range
• The number of classes k = +1
h
where []
. is the greatest integer function.
22
22
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
Example
• For the pregnancy duration data set we have n =
1094, Range = 289-255 = 34
Q1 = 267.1, Q3=278.3, IQR = 11.2
• Freedman – Diaconis rule gives the class width
to be 2.18
• The number of class intervals is therefore 16.
• The class intervals are 255 – 257.18, 257.18 –
259.36, … , 285.52 – 287.7 and 287.7 – 289
(note the last interval has shorter length)
23
23
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
24
24
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
10
5
0
-10 -5 0 5
tm 25
25
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
8
6
4
2
0
-8 -6 -4 -2 0 2 4
tm
26
26