Analysis of Data Session 1

Prof.
Arnab K Laha - Analysis of Data

(PGP-X)
Analysis of Data
Session – I
1
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
Summarizing Data
• Raw data is often voluminous and difficult
to handle
• Decision makers want a few numbers to
summarize the entire data
• Summarization leads to loss of information
but can help focus on key aspects of the
dataset
2
(PGP-X)
Five Number Summary of Data

• (Min, Q1, Median, Q3, Max) is called the five number
summary of data
• 25% of the observations are below Q1and 75% of the
observations are above Q1.
• Q1 is called the first quartile
• 50% of the observations are below Median and 50% are
above Median
• 75% of the observations are below Q3 and 25% are
above Q3.
• Q3 is called the third quartile.
• Each of the segments Min-Q1, Q1-Med, Med-Q3 and
Q3-Max contains 25% of the data.
3
3
(PGP-X)
Box Plot
Boxplot of C1
160
Max
140
Q3
120
100
80
C1
60
40 Median
20 Q1
0 Min
4
4
(PGP-X)
Unusual Observations (Outliers)

• It is some times seen that one or a few
observations in a dataset are remarkably
different from the rest
• These observations are called outliers
• Outliers can substantially impact the
statistical analysis and hence need to be
considered separately
• Outliers are not necessarily ‘bad’
observations; but they are different
5
5
(PGP-X)
Identifying Outliers
• The interquartile range (IQR) is defined as IQR =
Q3 – Q1
• An observation (x) is a ‘soft’ or ‘possible’ outlier if
x > Q3 + 1.5 IQR or x < Q1 – 1.5 IQR
• An observation (x) is a ‘hard’ or ‘confirmed’
outlier if x > Q3 + 3 IQR or x < Q1 – 3 IQR
• Note: All ‘hard’ outliers are also ‘soft’ outliers but
not vice versa.
6
(PGP-X)
Dealing with outliers (if present)

• Accommodative Approach: Use methods which
are resistant to the presence of outliers (Robust
methods)
• Example: 5% Trimmed Mean
• 5% Trimmed Mean is computed as follows:
Arrange the data in increasing order and then
delete the lower 5% of the observations and also
the upper 5% of the observations. Compute the
simple average of the remaining observations.
• Deletion Approach: Delete the outliers and work
with the remaining data set.
7
(PGP-X)
Two number summary of data

• Often data sets are summarized by giving
only two numbers:
- a measure of central tendency and
- a measure of spread (around the
measure of central tendency)
8
(PGP-X)
Mean and Standard Deviation
x1 + ... + x n
• Arithmetic Mean : x =
n
(x1 − x )2 + ... + ( x n − x )2
• Standard Deviation : s =
n
1 n 2
= ∑
n i=1
xi − x 2
9
(PGP-X)
Mean and Standard Deviation

• Note that Standard Deviation (SD) =0 only
when all the equal observations are equal.
• The higher the SD the higher is the spread
around the mean value
• Lower SD indicates ‘better reliability’ of the
mean value in representing the dataset.
10
10
(PGP-X)
Chebyshev’s Inequality
• In most situations of common occurrence
Chebyshev’s inequality asserts that
proportion of observations outside the interval
(mean – t SD, mean + t SD) is at most t-2
• Using Chebyshev’s inequality we have the
proportion of observations outside
i) (mean – 2 SD, mean + 2 SD) is at most 0.25
ii) (mean – 3 SD, mean + 3 SD) is at most 0.11
iii) (mean – 4 SD, mean + 4 SD) is at most 0.06
11
11
(PGP-X)
Impact of Outliers
• Both the mean and the SD are quite
sensitive to the presence of outliers
• If mean and SD are proposed to be used
for summarizing a data set it is better to
delete the outliers first and then proceed to
calculate the mean and SD
12
12
(PGP-X)
Median and MAD

• An alternative to using mean and SD as a summary
of the data is to use Median and MAD
• MAD is the acronym for Median Absolute Deviation
about Median
• If med is the median of the dataset x1,…,xn, MAD is
the median of the set of numbers {|x1-med|,
|x2-med|,…,|xn-med|}
• Usually 1.4826 MAD is used as a measure of
spread
• The Median and MAD are both far less sensitive to
presence of outliers than mean and SD. No deletion
of data is required if these are used.
13
13
(PGP-X)
Empirical Cumulative Distribution Function

• Let {x1,..., x n } be the given data set.
The Empirical Cumulative Distribution Function (ECDF) is
# observatio ns ≤ t
defined as Fn ( t ) =
n
• Fn ( t ) − Fn (s) = proportion of observatio ns greater than s but
less than or equal to t.
• The smallest number Q1 satisfying Fn (Q1) ≥ 0.25 is called the
first quartile
~ satisfying F (m
• The smallest number m ~ ) ≥ 0.5 is called the median
n
• The smallest number Q3 satisfying Fn (Q3) ≥ 0.75 is called the

third quartile
14
14
(PGP-X)
Frequency Distribution
• Often, particularly for large data sets, it is
advantageous to summarize data using a
frequency distribution.
• The entire range (Range = Max – Min) of the
data is divided into a few disjoint classes each of
which is an interval
• A frequency distribution gives the list of the
classes along with the number of observations in
each class (called the frequency of the class)
15
15
(PGP-X)
Example of Frequency Distribution

Duration of Cases
Pregnancy (frequency)
(days)
256 – 258 76
259 – 265 121
266 – 272 334
273 – 279 348
280 – 286 205
287 – 289 10
From: Bhat & Khustagi:
Total 1094 Singapore Med J 2006; 47(12) 16
16
(PGP-X)
Relative Frequency and Frequency Density
• Relative Frequency of a class is the

frequency of the class divided by the total
number of observations
• Frequency Density of a class is the
Relative frequency of a class divided by
the class width.
17
17
(PGP-X)
Example
Duration of Frequency Relative Frequency
Pregnancy (days) Frequency Density
255– 258.5 76
0.07 0.028
258.5 – 265.5 121
0.11 0.044
265.5 – 272.5 334
0.31 0.122
272.5 – 279.5 348
0.32 0.127
279.5 – 286.5 205
0.19 0.075
286.5 – 289 10
0.01 0.004
Total 1094
18
18
(PGP-X)
Histogram
Histogram
0.140
0.120
0.100
Frequency Density
0.080
Series1
0.060
0.040
0.020
0.000
255– 258.5 258.5 – 265.5 265.5 – 272.5 272.5 – 279.5 279.5 – 286.5 286.5 – 289
Duration of Pregnancy 19
19
(PGP-X)
Histogram: How many classes?

• For construction of the histogram it is important to decide
on the number of classes or equivalently the class width
• The shape of the histogram heavily depends on the
choice of the number of classes / class width.
• Two popular approaches are:
a) Sturges’ rule
b) Freedman – Diaconis rule
• Both these rules usually (but not always) give similar
results if the number of observations is less than 200.
• Freedman – Diaconis rule is the preferred/ better rule for
determining the class width of a histogram.
20
20
(PGP-X)
Sturges’ Rule
• The number of classes (k) = [1 + log2n]+1 where
[1+log2n] is the greatest integer less than or
equal to 1+log2n
• In other words, choose k such that
2k-3≤n<2k-2
• E.g. n=35 => k = 8
n = 83 => k = 9
• The class width (h) is computed as Range
divided by the number of classes.
• h = (Max – Min) / k
21
21
(PGP-X)
Freedman-Diaconis Rule
2 IQR
• The class width is given by h = 1/3
n
 Range 
• The number of classes k =   +1
 h 
where []
. is the greatest integer function.
22
22
(PGP-X)
Example
• For the pregnancy duration data set we have n =
1094, Range = 289-255 = 34
Q1 = 267.1, Q3=278.3, IQR = 11.2
• Freedman – Diaconis rule gives the class width
to be 2.18
• The number of class intervals is therefore 16.
• The class intervals are 255 – 257.18, 257.18 –
259.36, … , 285.52 – 287.7 and 287.7 – 289
(note the last interval has shorter length)
23
23
(PGP-X)
Example: Stock Returns

• 55 daily returns
• Five number summary:
Min = -8.213, Q1 = -1.413, Median = -0.1364
Q3 = 0.817, Max = 4.071
24
24
(PGP-X)
Histogram using Sturges’ rule

Histogram of tm
20
15
Frequency
10
5
0
-10 -5 0 5
tm 25
25
(PGP-X)
Histogram using Freedman-Diaconis rule

Histogram of tm
14
12
10
Frequency
8
6
4
2
0
-8 -6 -4 -2 0 2 4
tm
26
26

Analysis of Data Session 1

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Analysis of Data Session 1

Enviado por

Direitos autorais:

Formatos disponíveis

Prof.

Arnab K Laha - Analysis of Data

Five Number Summary of Data

Unusual Observations (Outliers)

Dealing with outliers (if present)

Two number summary of data

Mean and Standard Deviation

Mean and Standard Deviation

Median and MAD

Empirical Cumulative Distribution Function

• The smallest number Q3 satisfying Fn (Q3) ≥ 0.75 is called the

Example of Frequency Distribution

Relative Frequency and Frequency Density

• Relative Frequency of a class is the

Histogram: How many classes?

Example: Stock Returns

Histogram using Sturges’ rule

Histogram using Freedman-Diaconis rule

Você também pode gostar