Você está na página 1de 61

Preliminaries

Empirical Data Distributions

Summary Measures

STAT101 Introductory Statistics


Data Distributions & Summary Measures

Sutaip L. C. Saw, Ph.D.

Last Revision: 29 July 2014

Miscellany

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Preliminaries
Topics:
What is Statistics?
Typical Descriptive Statistics Problems
Note for the Student
It is recommended that students read this section in its
entirety before coming to class for the lecture to ensure that
they have the required background information.1
During the lecture I will mainly focus on sections which have
a direct bearing on the lecture topic under discussion.
Material in the last section serves to complement what we
cover during the lecture.
1

This also applies to the Preliminaries section of subsequent lecture slides.

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Statistics Overview
Topics:
What is Statistics?
Applications of Statistics
Learning Objectives:
Learn the nature of Statistics and study its relevance to
Business Research Analysis and Decision Making.
Learn about the different subdisciplines of Statistics concerned
with extracting descriptive information from data, assessing
uncertainty and making statistical inferences & predictions.

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

What is Statistics?
Statistics is the discipline which makes use of mathematical and
computational techniques to, among other things,
collect data using surveys, observational studies or designed
experiments;
describe, summarize and present the collected data;
assess and quantify uncertainty;
draw inferences about population characteristics based on
sample information;
assess the statistical significance of observed differences or
presence of associations;
construct empirical models to obtain estimates, test
hypotheses or for predictive purposes;
make projections using cross-sectional or time series data.

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Applications of Statistics
Some Applications:
Marketing Research
Eg. Assessing Brand Preferences for a Given Product
Finance
Eg. Measuring the Credit Risk of a Counterparty
Insurance
Eg. Measuring Risk of an Insurance Portfolio
Reliability Engineering
Eg. Assessing the Reliability of an Aircraft Engine
Medical Research
Eg. Determining the Efficacy of a New Drug
Q: Do you think Statistics is worthwhile learning? If so, why?

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Typical Descriptive Statistics Problems


Organizing Data
Forty students in an Introductory Statistics course were asked to
state their political affliations (i.e., whether they favoured the
Democratic (D), Republican (R) or Other (O) party). The
following results were obtained.
D
D
D
D
O

R
O
R
O
R

O
R
O
D
D

R
D
D
D
R

R
O
R
D
R

R
O
R
R
R

R
R
O
O
R

R
D
R
D
D

What type of data are we dealing with?


What can we say about the distribution of political affliations?
Source: Adapted from Weiss (2012, p. 40).

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Summarizing Data
Arterial blood pressures (in mm of mercury) for a sample of 16
children of diabetic mothers are given below.
81.6 84.1
82.0 88.9
84.6 104.9
69.4 78.9

87.6
86.7
90.8
75.2

82.8
96.4
94.0
91.0

What does the data tell you about the average blood pressure
of a child whose mother is diabetic?
What can we conclude about the variability of the blood
pressure measurements?
Source: Adapted from Weiss (2012, p. 95)

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Empirical Data Distributions


Topics:
Tabulating Data Distributions
Graphing Data Distributions
Learning Objectives:
Learn tabular and graphical techniques for organizing and
presenting data.
Learn how to choose among the available techniques for a
given problem in descriptive statistical analysis.
Note:
Much of the material in this and the next section are of a review nature.
Well quickly review such material but spend more time on material
students are less familiar with.

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Tabulating Data Distributions


Tabulating Categorical Data
The first column of the table contains the possible categories
and the second column the correponding absolute frequencies
(optionally, relative frequencies may also be given in another
column).
Example
Consider the political affliation data given in the first illustrative
problem. Following is the frequency table for the data.
Affliation
Democratic
Republican
Other

Abs Freq
13
18
9

Rel Freq
0.325
0.450
0.225

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Tabulating Numerical Data


In an absolute frequency table, the number of observations in
each class (i.e., pre-defined sub-interval) is presented.

Class
(l1 , u1 ]
(l2 , u2 ]
(l3 , u3 ]
..
.

Frequency
n1
n2
n3
..
.

(lk , uk ]

nk

Abs Frequency Table


Class
(10, 20]
(20, 30]
(30, 40]
(40, 50]
(50, 60]

Frequency
3
7
4
4
2

Note: (10, 20] refers to values between 10 (exclusive) and 20 (inclusive) etc.

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Example [Frequency Tables]


The absolute frequency table in the previous slide was obtained
from the following raw data
12 13 17 21 24 24 26 27 27 30
32 35 37 38 41 43 44 46 53 58
The corresponding relative and cumulative frequency tables are:
Class
(10, 20]
(20, 30]
(30, 40]
(40, 50]
(50, 60]

Rel Freq
0.15
0.35
0.20
0.20
0.10

Class
(10, 20]
(20, 30]
(30, 40]
(40, 50]
(50, 60]

Q: What can we deduce from each table?

Cum Freq
0.15
0.50
0.70
0.90
1.00

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Graphing Data Distributions


Graphing Distributions for Categorical Data
Pie Chart
A circle is divided into pie slices. The area of each slice is
proportional to the relative frequency of each category.
Example
For the political affliation data, we have the following pie chart.

Pie Slice
Democratic
Republican
Other
Q: How can we improve on this graphical display?

Angle
117 deg
162 deg
81 deg

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Bar Chart
Each category is represented by a vertical (or horizontal) bar.
The height (or width) of each bar is equal or proportional to
the absolute or relative frequency of a category.
Example
For the political affliation data, we have the following bar chart.

Q: Which is preferred? A pie chart or bar chart?

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Side-by-Side Bar Chart


This chart may be used to present bivariate categorical data.
Example [Side-by-Side Bar Chart]
Consider the following distribution of student grades by gender.
A B C D E
Female 3 9 7 1 1
Male
4 6 5 3 1
In relative terms, we have the following table.
A
B
C
D
E
Female 0.14 0.43 0.33 0.05 0.05
Male
0.21 0.32 0.26 0.16 0.05

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Example [Side-by-Side Bar Chart] (contd)


Information in the first (second) table may be displayed by the
chart in the left (right) panel of the following figure.

Q: What conclusion(s) can be drawn from the above figure?


Q: Does it matter which chart you base you conclusions on?
Source: Adapted from Chow et al (2007, p. 7).

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Graphing Distributions for Numerical Data


Absolute Frequency Histogram
Displays information contained in an absolute frequency table
using vertical bars with no gaps between bars.
The height of each bar gives the number of observations that
lie in the interval determined by the base of the bar.
Example
Class
(10, 20]
(20, 30]
(30, 40]
(40, 50]
(50, 60]

Frequency
3
7
4
4
2

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Relative Frequency Histogram


Displays information in a relative frequency table by vertical
bars with no gaps between bars.
The area of each bar gives the fraction of observations that lie
in the interval determined by the base of the bar.
Example
Class
(10, 20]
(20, 30]
(30, 40]
(40, 50]
(50, 60]

Frequency
0.15
0.35
0.20
0.20
0.10

Q: What can you conclude from the above figure?

Preliminaries

Empirical Data Distributions

Summary Measures

Digression: Identifying Distribution Shapes

Miscellany

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Cumulative Frequency Polygon


Displays a plot of cumulative frequency against upper class limit in
an expanded cumulative frequency table (as illustrated below).
Example

Class
(0, 10]
(10, 20]
(20, 30]
(30, 40]
(40, 50]
(50, 60]

Cum Freq (%)


0
15
50
70
90
100

Q: What useful statistic(s) can we deduce from such plots?

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Digression: Quartiles
Let x1 , x2 , . . . , xn denote a set of n observations for our study.
Usually, the xi s are unordered.
For some applications, we need to work with ordered values in the
dataset, i.e, with x(i) s such that
x(1) x(2) x(n) .
Define
Q2 = second quartile of the xi s

 1
x(k) + x(k+1) , if n = 2k,
2
=
x(k+1) ,
if n = 2k + 1.
Note that Q2 is also referred to as the median of the xi s.

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

The first quartile, denoted Q1, may be defined as the median of xi


values less than or equal to Q2.
The third quartile, denoted Q3, may be defined as the median of
xi values greater than or equal to Q2.
Example
For the following set of 5 observations
101.96

109.76

99.63

99.76

100.22

101.96

109.76.

the corresponding ordered sample is


99.63

99.76

100.22

Here,
Q1 = 99.76, Q2 = 100.22 and Q3 = 101.96.

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Stem and Leaf Diagram


A stem and leaf diagram (like the one shown below) is a graphical
display that shows the distribution of a set of numerical values.
From it, one can
sometimes recover the original data;
easily infer empirical percentiles;
obtain measures of central tendency and dispersion.
Example
1
2
3
4

|
|
|
|

67788899
0012257
28
2

Ordered data: 16, 17, . . . , 38, 42.


Distribution is right-skewed.
Q1 = 18, Q2 = 20 and Q3 = 23.5
Min = 16 and Max = 42.

Preliminaries

Empirical Data Distributions

Summary Measures

Example [Stem and Leaf Display]


For the Cord Strength dataset
25
34
19
34
25

25
27
25
33
26

1
1
2
2
3
3
4

|
|
|
|
|
|
|

36
21
14
28
27

31
35
32
26
34

26
30
30
43
33

36
41
29
30
27

29
33
31
40
33

37
21
26
32
29

37
26
22
32
30

20
26
24
31
31

we obtain
4
9
01124
55556666667778999
000011112223333444
56677
013

Miscellany

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Boxplots
We introduce the boxplot via a couple of examples.
Example [Boxplot]
Weekly television viewing times (in hours) of a sample of 20 people
are given below.
25
66
34
30

41
35
26
38

27
31
32
30

32 43
15 5
38 16
20 21

To obtain a boxplot, begin by finding the quartiles.


5
25
31
38

15
26
32
38

16
27
32
41

20
30
34
43

21
30
35
66

Q1 = 23
Q2 = 30.5
Q3 = 36.5

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Example [Boxplot] (contd)


Then, determine the following limits
Lower Limit = Q1 1.5 IQR = 2.75,
Upper Limit = Q3 + 1.5 IQR = 56.75,
where IQR = 36.5 23 = 13.5. Finally, obtain 5 and 43 as the
adjacent valuesa and note that 66 is a potential outlier since it falls
outside the interval (2.75, 56.75).

a
Adjacent values are the most extreme values that lie within the lower and
upper limits; they are the most extreme observations that are not potential
outliers (Weiss, 2012, p. 120).

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Example [Parallel Boxplots]


Measurements on skinfold thickness (in mm) for samples of
runners and nonrunners in the same age group are given below.
Runners
|
Nonrunners
-----------------+----------------------7.3 6.7 8.7
|
24.0 19.9 7.5 18.4
3.0 5.1 8.8
|
28.0 29.4 20.3 19.0
7.8 3.8 6.2
|
9.3 18.1 22.8 24.2
5.4 6.4 6.3
|
9.6 19.4 16.3 16.3
3.7 7.5 4.6
|
12.4 5.2 12.2 15.6

Group
Statistics
5 Num Summary
Limits
Adjacent Values
Potential Outliers

Runners
3.0, 4.85, 6.3, 7.4, 8.8
1.025, 11.225
3.0, 8.8
None

Nonrunners
5.2, 12.3, 18.25, 21.55, 29.4
-1.575, 35.425
5.2, 29.4
None

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Example [Parallel Boxplots] (contd)

Q: What conclusions can you draw from the above figure?


Source: Adapted from Weiss (2012, pp. 121-122)

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Summary Measures
Topics:
Location & Spread of a Distribution
Measures of Central Tendency
Measures of Dispersion
Summary Measures for Grouped Data
Learning Objectives:
Learn how to measure the location and spread of the
distribution of raw data for a single numerical variable.
Learn how to obtain summary measures from grouped data.
Learn how to interpret and choose between the various
summary measures.
Learn the role played by robustness in the selection of a
summary measure.

Preliminaries

Empirical Data Distributions

Summary Measures

Location & Spread of a Distribution

Miscellany

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Measures of Central Tendency


Let x1 , x2 , . . . , xn denote a set n observations with corresponding
ordered values x(1) , x(2) , . . . , x(n) .
Some measures of central tendency are given below.
Mean

1X
xi = x, say.
mean =
n
i=1

Median

median =

1
2


x(k) + x(k+1) , if n = 2k,
x(k+1) ,
if n = 2k + 1.

Mode
mode = data value with highest frequency.

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Example
Consider dataset
101.96, 109.76, 99.63, 99.76, 100.22
with corresponding ordered values
99.63, 99.76, 100.22, 101.96, 109.76.
Here, the mean is
x=

101.96 + 109.76 + 99.63 + 99.76 + 100.22


102.27
5

and
median = x(3) = 100.22.
Q: What about the mode?

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Advantages & Disadvantages


Feature
Always Exists?
Always Unique?
Not Affected by Outliers?
Further Analysis Potential?

Mean
Y
Y
N
Y

Median
Y
N
Y
N

Mode
N
N
Y
N

Note
Use a robust (i.e., resistant) measure of central tendency
when outlying values (assuming these are valid) are present.
The trimmed mean is an example of a robust measure of
location - see Exercise 3.54 on p. 101 of Weiss (2012) for a
specific illustration.
Q: What about the mean and median?

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Example [Robustness]
The mean is not robust since it is affected by outlying (extreme)
observations.
> set.seed(2012)
> x <- rnorm(50, 10, 1)
> mean(x)
[1] 10.03585
> median(x)
[1] 10.09504
Note that Ive decided to stop using R for this course. You may ignore the R
codes that you see in this and the next three examples.

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Example [Robustness] (contd)


> x <- sort(x)
> x[50] <- 30
> mean(x)
[1] 10.37307
> median(x)
[1] 10.09504
The median is not affected by extreme observations and hence it is
a robust measure of central tendency.

Preliminaries

Empirical Data Distributions

Summary Measures

Relative Magnitude of Location Measures


Example
> table(x)
x
1 2 3 4 5
4 7 23 32 23

6
7

7
4

> mean(x)
[1] 4
> median(x)
[1] 4
The above example illustrates the case when
mean = median = mode.

Miscellany

Preliminaries

Empirical Data Distributions

Summary Measures

In the next example, we have


mean < median = mode.
Example
> table(x)
x
1 2 3 4 5 6 7
2 4 7 12 15 33 27
> mean(x)
[1] 5.41
> median(x)
[1] 6

Miscellany

Preliminaries

Empirical Data Distributions

Summary Measures

It is also possible that


mean > median = mode.
Example
> table(x)
x
1 2 3 4
27 33 15 12

5
7

6
4

7
2

> mean(x)
[1] 2.59
> median(x)
[1] 2
Q: What is the practical significance of these examples?

Miscellany

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Example [Mean vs Median]


The ordered sample and stem and leaf display for some data on
arterial blood pressure are given below.
69.4
82.0
86.7
91.0

75.2
82.8
87.6
94.0

78.9 81.6
84.1 84.6
88.9 90.8
96.4 104.9

6
7
8
9
10

|
|
|
|
|

9
59
22345789
1146
5

Here,
x = 86.18 and median = 85.65.
Q: Which measure do you recommend for the data at hand?

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Measures of Dispersion
Some measures of dispersion are given below.
Range
range = x(n) x(1)
Interquartile Range
IQR = Third Quartile First Quartile
Variance

variance =

1 X
(xi x)2
n1
i=1

Standard Deviation
v
u
u 1
standard deviation = t
n1

n
X
i=1

!
xi2 nx 2

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Example
Consider the (ordered) dataset
99.63, 99.76, 100.22, 101.96, 109.76.
Here,
range = 109.76 99.63 = 10.13
and
IQR = 101.96 99.76 = 2.2.
Furthermore,

99.632 + + 109.762 5 102.272
18.42
variance =
51
and
standard deviation

18.42 = 4.29.

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

A relative measure of dispersion is


coefficient of variation =

standard deviation
.
mean

Example
For data in the previous example,
coefficient of variation =

4.29
0.04.
102.27

Advantages & Disadvantages


Feature
Always Exists?
Always Unique?
Not Affected by Outliers?
Absolute Measure?
Same Units?

R
Y
Y
N
Y
Y

V
Y
N
N
Y
N

SD
Y
N
N
Y
Y

IQR
Y
N
Y
Y
Y

CV
Y
N
N
N
N

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Example [Comparing Stock Performance]


Following are annual logarithmic returns of Microsof (MSFT) and
Hewlett-Packard (HWP) for the period spanning 1995-1999.
|
1995
1996
1997
1998
1999
-----+-----------------------------------MSFT | 0.3644 0.6622 0.5026 0.7648 0.5290
HWP | 0.5014 0.1836 0.2156 0.1864 0.4921

Some summary statistics for the returns are as follows:


|
MSFT
HWP
-------------+---------------Mean
| 0.5646
0.3158
Std Dev
| 0.1539
0.1657
Median
| 0.5290
0.2156
IQR
| 0.1596
0.3057
Coef of Var | 0.2727
0.5246

Q: Which of the two stocks performed better over 1995-1999?

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Mean & Variance for Grouped Data


Grouped data refers to data in a frequency distribution.
Example
Class |
Freq.
Percent
Cum.
------------+----------------------------------(10,15] |
1
2.00
2.00
(15,20] |
2
4.00
6.00
(20,25] |
8
16.00
22.00
(25,30] |
17
34.00
56.00
(30,35] |
15
30.00
86.00
(35,40] |
5
10.00
96.00
(40,45] |
2
4.00
100.00
------------+-----------------------------------

Information in the first and any one of the remaining three


columns of the above table constitute grouped data.

Preliminaries

Empirical Data Distributions

Summary Measures

Let
mi

= mid-point of i-th class,

ni

= frequency of i-th class,

k = number of classes,
n = total frequency.
The grouped data mean is
xg =

k
X

mi

ni
1X
=
m i ni .
n
n
i=1

i=1

The grouped data variance is


sg2 =

k
X
i=1

mi2

ni
1X 2
x 2g =
mi ni x 2g .
n
n
i=1

Miscellany

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Example
For the grouped data given earlier, we have
2
Class |
ni
mi
mi*ni
mi * ni
-----------+-----------------------------------------(10,15] |
1
12.5
12.5
156.25
(15,20] |
2
17.5
35.0
612.50
(20,25] |
8
22.5
180.0
4050.00
(25,30] |
17
27.5
467.5
12856.25
(30,35] |
15
32.5
487.5
15843.75
(35,40] |
5
37.5
187.5
7031.25
(40,45] |
2
42.5
85.0
612.50
-----------+-----------------------------------------Total |
50
1455.0
44162.50

Hence,
xg =

1455.0
44162.50
= 29.1 and sg2 =
29.12 = 36.44.
50
50

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Topics:
Summation Notation
Classification of Statistical Studies
Questions for Class Discussion
Learning Objectives:
Review the notation used for summation.
Learn about different types of statistical studies.

Miscellany

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Summation Notation
Summation Notation
Given numerical values x1 , . . . , xn , we have:
n
X
xi = x1 + x2 + + xn
i=1
n
n
X
X
(axi + b) = (ax1 + b) + + (axn + b) = a
xi + nb
i=1

i=1

Example
If xi s are given by 1.75, 2.25, 2.25, 2.25, 1.75, 2.00, 1.50, we have
7
X
i=1

xi = 13.75 and

7
X
i=1

xi2 = 1.752 + + 1.502 = 27.5625.

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Classification of Statistical Studies


Observational Study
Observed relationships and other inferences apply only to
the study subjects (or objects) under investigation.
No control of extraneous sources of variation.
Example [Vasectomies & Prostrate Cancer]
A study found an association between vasectomy and prostrate
cancer - elevated risk after vasectomy.
No information that the study was based on a properly chosen
sample or a properly designed experiment.
We cannot infer causation nor generalize the observed association.
Source: Adapted from Weiss (2012, p. 7).

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Inferential Study
The study is based on a properly chosen sample (e.g., random
sample).
Inferences made from sample information may be generalized
to a larger population.
Example [Testing Baseballs]
An independent testing company investigated the liveliness of 85
randomly selected Rawlings baseballs from the 1977 supplies of
major league teams.
The Rawlings baseball was found to be more lively than the 1976
Spalding baseball.
Source: Adapted from Weiss (2012, p. 6).

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Designed Experiments
A proper randomization technique is used to allocate subjects
(or objects) to treatment and control groups.
Relevant sources of extraneous variation are controlled.
Example [Folic Acid & Birth Defects]
4753 women prior to conception were divided randomly into two
groups. One group took daily doses of folic acid while the other
took only trace elements.
Incidence of major birth defects was much reduced for the group
taking folic acid.
Here, we can infer presence of a causal relationship.
Source: Adapted from Weiss (2012, p. 7).

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Questions for Class Discussion


Question 1
A stem-and-leaf display of daily protein intake (in grams) for a
sample of 51 female vegetarians is shown below.
The decimal point is 1 digit(s) to the right of the |
0
1
2
3
4
5
6
7
8

|
|
|
|
|
|
|
|
|

1259
34558
01889
013566688899
001235567
002234467899
88
05

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Question 1 (contd)
A similar display for a sample of 53 female nonvegetarians is given
below.
The decimal point is 1 digit(s) to the right of the |
0
1
2
3
4
5
6
7
8

|
|
|
|
|
|
|
|
|

5
14
34557
4567779
0112444569
0003345577
0113334799
1157
1444

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Question 1 (contd)
(a) The quartiles for both groups of females are partially given in
the following table. Fill in the missing entries in table.
Group
Vegetarian
Nonvegetarian

1st Quartile
38

2nd Quartile
39

3rd Quartile
63

Table : Quartiles of Vegetarian and Nonvegetarian Females

(b) Based on information in (the completed) table, compare the


location and spread of the two sets of data.
(c) Identify potential outliers, if any, for each dataset. Do you
obtain results that are consistent with what you observe in the
stem-and-leaf displays?

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Question 2
(a) Which of the following is not a property of the coefficient of
variation?
(i)
(ii)
(iii)
(iv)

It
It
It
It

is
is
is
is

not always unique.


resistant to outliers.
a relative measure.
not in the same units as the original data.

(b) The (arithmetic) mean computed from raw data is always


unique. The same is true of the mean computed from
grouped data. True or False?
(c) The sample mid-range is a robust measure of location. True
or False?

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Question 3
Suppose you obtain the following five number summaries from
data on annual (percentage) returns for common stock and
government bonds over a fifteen year period.
Investment: Bonds
[1] -10.460
1.035

4.600

14.080

42.980

Investment: Stocks
[1] -25.930 -0.495

10.710

23.760

44.770

(a) What types of statistics do the numbers in each summary


represent?

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Question 3 (contd)
(b) One of the values given in the five number summary for the
bond returns looks unusual. Is it a potential outlier?
(c) Of the two financial instruments, which is preferred if your
primary investment objective is to choose the one that gives
you the greater level of return on average?
(d) Which is preferred if risk aversion is the key factor influencing
your choice of investment to make?
(e) Is there anything wrong with the following statement?
Under appropriate conditions, the coefficient of variation is a
useful measure to consider when making risk-reward trade-offs
amongst several investment alternatives.

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Question 4
Consider the following absolute frequency distribution obtained
from data on distance (in miles) travelled to work for a random
sample of 50 workers.
Classes
| (10,20] (20,30] (30,40] (40,50]
----------+-----------------------------------Frequency |
3
19
23
5
(a) Determine the grouped data variance using information
provided by the above empirical distribution.
(b) Determine one other grouped data measure of dispersion.

Preliminaries

Empirical Data Distributions

Summary Measures

Miscellany

Acknowledgements

The current slides are based in part on material from:


Introductory Statistics (9th Edition) by Neil A. Weiss.
Introductory Statistics (2nd Edition) by H. K. Chow, A.
Ghosh, D. H. Y. Leung and Y. K. Tse.
The slides were produced using The Beamer Class package and
MikTeX (a public domain document preparation system).
Customized computations and graphics were produced using R (a
public domain statistical software package).
I am grateful to the developers of the above resources for making
them available.