Você está na página 1de 26

Prof.

Arnab K Laha - Analysis of Data


(PGP-X)

Analysis of Data

Session – I

1
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Summarizing Data
• Raw data is often voluminous and difficult
to handle
• Decision makers want a few numbers to
summarize the entire data
• Summarization leads to loss of information
but can help focus on key aspects of the
dataset

2
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Five Number Summary of Data


• (Min, Q1, Median, Q3, Max) is called the five number
summary of data
• 25% of the observations are below Q1and 75% of the
observations are above Q1.
• Q1 is called the first quartile
• 50% of the observations are below Median and 50% are
above Median
• 75% of the observations are below Q3 and 25% are
above Q3.
• Q3 is called the third quartile.
• Each of the segments Min-Q1, Q1-Med, Med-Q3 and
Q3-Max contains 25% of the data.
3

3
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Box Plot
Boxplot of C1
160
Max
140
Q3
120

100

80
C1

60

40 Median

20 Q1

0 Min
4

4
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Unusual Observations (Outliers)


• It is some times seen that one or a few
observations in a dataset are remarkably
different from the rest
• These observations are called outliers
• Outliers can substantially impact the
statistical analysis and hence need to be
considered separately
• Outliers are not necessarily ‘bad’
observations; but they are different
5

5
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Identifying Outliers
• The interquartile range (IQR) is defined as IQR =
Q3 – Q1
• An observation (x) is a ‘soft’ or ‘possible’ outlier if
x > Q3 + 1.5 IQR or x < Q1 – 1.5 IQR
• An observation (x) is a ‘hard’ or ‘confirmed’
outlier if x > Q3 + 3 IQR or x < Q1 – 3 IQR
• Note: All ‘hard’ outliers are also ‘soft’ outliers but
not vice versa.

6
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Dealing with outliers (if present)


• Accommodative Approach: Use methods which
are resistant to the presence of outliers (Robust
methods)
• Example: 5% Trimmed Mean
• 5% Trimmed Mean is computed as follows:
Arrange the data in increasing order and then
delete the lower 5% of the observations and also
the upper 5% of the observations. Compute the
simple average of the remaining observations.
• Deletion Approach: Delete the outliers and work
with the remaining data set.

7
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Two number summary of data


• Often data sets are summarized by giving
only two numbers:
- a measure of central tendency and
- a measure of spread (around the
measure of central tendency)

8
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Mean and Standard Deviation

x1 + ... + x n
• Arithmetic Mean : x =
n

(x1 − x )2 + ... + ( x n − x )2
• Standard Deviation : s =
n
1 n 2
= ∑
n i=1
xi − x 2

9
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Mean and Standard Deviation


• Note that Standard Deviation (SD) =0 only
when all the equal observations are equal.
• The higher the SD the higher is the spread
around the mean value
• Lower SD indicates ‘better reliability’ of the
mean value in representing the dataset.

10

10
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Chebyshev’s Inequality
• In most situations of common occurrence
Chebyshev’s inequality asserts that
proportion of observations outside the interval
(mean – t SD, mean + t SD) is at most t-2
• Using Chebyshev’s inequality we have the
proportion of observations outside
i) (mean – 2 SD, mean + 2 SD) is at most 0.25
ii) (mean – 3 SD, mean + 3 SD) is at most 0.11
iii) (mean – 4 SD, mean + 4 SD) is at most 0.06
11

11
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Impact of Outliers
• Both the mean and the SD are quite
sensitive to the presence of outliers
• If mean and SD are proposed to be used
for summarizing a data set it is better to
delete the outliers first and then proceed to
calculate the mean and SD

12

12
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Median and MAD


• An alternative to using mean and SD as a summary
of the data is to use Median and MAD
• MAD is the acronym for Median Absolute Deviation
about Median
• If med is the median of the dataset x1,…,xn, MAD is
the median of the set of numbers {|x1-med|,
|x2-med|,…,|xn-med|}
• Usually 1.4826 MAD is used as a measure of
spread
• The Median and MAD are both far less sensitive to
presence of outliers than mean and SD. No deletion
of data is required if these are used.
13

13
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Empirical Cumulative Distribution Function


• Let {x1,..., x n } be the given data set.
The Empirical Cumulative Distribution Function (ECDF) is
# observatio ns ≤ t
defined as Fn ( t ) =
n
• Fn ( t ) − Fn (s) = proportion of observatio ns greater than s but
less than or equal to t.
• The smallest number Q1 satisfying Fn (Q1) ≥ 0.25 is called the
first quartile
~ satisfying F (m
• The smallest number m ~ ) ≥ 0.5 is called the median
n

• The smallest number Q3 satisfying Fn (Q3) ≥ 0.75 is called the


third quartile
14

14
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Frequency Distribution
• Often, particularly for large data sets, it is
advantageous to summarize data using a
frequency distribution.
• The entire range (Range = Max – Min) of the
data is divided into a few disjoint classes each of
which is an interval
• A frequency distribution gives the list of the
classes along with the number of observations in
each class (called the frequency of the class)

15

15
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Example of Frequency Distribution


Duration of Cases
Pregnancy (frequency)
(days)
256 – 258 76
259 – 265 121
266 – 272 334
273 – 279 348
280 – 286 205
287 – 289 10
From: Bhat & Khustagi:
Total 1094 Singapore Med J 2006; 47(12) 16

16
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Relative Frequency and Frequency Density

• Relative Frequency of a class is the


frequency of the class divided by the total
number of observations
• Frequency Density of a class is the
Relative frequency of a class divided by
the class width.

17

17
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Example
Duration of Frequency Relative Frequency
Pregnancy (days) Frequency Density

255– 258.5 76
0.07 0.028
258.5 – 265.5 121
0.11 0.044
265.5 – 272.5 334
0.31 0.122
272.5 – 279.5 348
0.32 0.127
279.5 – 286.5 205
0.19 0.075
286.5 – 289 10
0.01 0.004
Total 1094

18

18
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Histogram
Histogram

0.140

0.120

0.100
Frequency Density

0.080

Series1

0.060

0.040

0.020

0.000
255– 258.5 258.5 – 265.5 265.5 – 272.5 272.5 – 279.5 279.5 – 286.5 286.5 – 289
Duration of Pregnancy 19

19
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Histogram: How many classes?


• For construction of the histogram it is important to decide
on the number of classes or equivalently the class width
• The shape of the histogram heavily depends on the
choice of the number of classes / class width.
• Two popular approaches are:
a) Sturges’ rule
b) Freedman – Diaconis rule
• Both these rules usually (but not always) give similar
results if the number of observations is less than 200.
• Freedman – Diaconis rule is the preferred/ better rule for
determining the class width of a histogram.

20

20
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Sturges’ Rule
• The number of classes (k) = [1 + log2n]+1 where
[1+log2n] is the greatest integer less than or
equal to 1+log2n
• In other words, choose k such that
2k-3≤n<2k-2
• E.g. n=35 => k = 8
n = 83 => k = 9
• The class width (h) is computed as Range
divided by the number of classes.
• h = (Max – Min) / k
21

21
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Freedman-Diaconis Rule
2 IQR
• The class width is given by h = 1/3
n
 Range 
• The number of classes k =   +1
 h 
where []
. is the greatest integer function.

22

22
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Example
• For the pregnancy duration data set we have n =
1094, Range = 289-255 = 34
Q1 = 267.1, Q3=278.3, IQR = 11.2
• Freedman – Diaconis rule gives the class width
to be 2.18
• The number of class intervals is therefore 16.
• The class intervals are 255 – 257.18, 257.18 –
259.36, … , 285.52 – 287.7 and 287.7 – 289
(note the last interval has shorter length)
23

23
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Example: Stock Returns


• 55 daily returns
• Five number summary:
Min = -8.213, Q1 = -1.413, Median = -0.1364
Q3 = 0.817, Max = 4.071

24

24
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Histogram using Sturges’ rule


Histogram of tm
20
15
Frequency

10
5
0

-10 -5 0 5

tm 25

25
Prof. Arnab K Laha - Analysis of Data
(PGP-X)

Histogram using Freedman-Diaconis rule


Histogram of tm
14
12
10
Frequency

8
6
4
2
0

-8 -6 -4 -2 0 2 4

tm
26

26

Você também pode gostar