Escolar Documentos
Profissional Documentos
Cultura Documentos
5 Dispersion
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Describe the concept of dispersion measures;
2. Explain the concept of range as a dispersion measure;
3. Categorise the distribution curve by its symmetry and non-symmetry; and
4. Analyse variance and standard deviation.
INTRODUCTION
We have discussed earlier the position quantities such as mean and quartiles,
which can be used to summarise the distributions. However, these quantities are
ordered numbers located at the horizontal axis of the distribution graph. As
numbers along the line, they are not able to explain in quantitative measure for
example about the shape of the distribution.
In this topic, we will learn about quantity measures regarding the shape of a
distribution. For example, the quantity, namely variance is usually used to
measure the dispersion of observations around their mean location. The range is
used to describe the coverage of a given data set. Coefficient of skewedness will
be used to measure the asymmetrical distribution of a curve. The coefficient of
curtosis is used to measure peakedness of a distribution curve.
ACTIVITY 5.1
Figure 5.1(a) shows two distribution curves with different location centres but
possibly of the same dispersion measure (they may have the same range of
coverage, but of different variances). Curve 1 could represent a distribution of
mathematics marks of male students from School A; while Curve 2 represents
distribution of mathematics marks of female students in the same examination
from the same school.
Figure 5.1 (b) shows two distribution curves with the same location centre but
possibly of different dispersion measures (they may have different range of
coverages, as well as variances). Curve 3 could represent a distribution of physics
marks of male students from School A and Curve 4 represents the distribution of
physics marks of male students in the same examination but from School B.
Figure 5.1 (b): Two distribution curves with same location centre
but possibly of different dispersion measures
Figure 5.1 (c) shows two distribution curves with different location centres but
possibly of the same dispersion measure (they may have the same range of
coverage, but possibly of the same variance). However, Curve 5 is slightly
skewed to the right and Curve 6 is slightly skewed to the left. Curve 5 could
represent a distribution of mathematics marks of students from School A and
Curve 6 may represent distribution of mathematics marks of students in the
same trial examination but from different school.
Figure 5.1 (c): Two distribution curves with different location centres
but possibly of same dispersion measure
By looking at Figures 5.1(a), (b) and (c), besides the mean, we need to know other
quantities such as variance, range and coefficient of skewedness in order to
describe or summarise completely a given distribution.
The range is defined as the difference between the maximum value and the
minimum value of observations.
Thus,
As can be seen from formula 5.1, range can be easily calculated. However, it
depends on the two extreme values to measure the overall data coverage. It does
not explain anything about the variation of observations between the two
extreme values.
Example 5.1:
Give comment on the scatteredness of observations in each of the following data
sets:
Set 1 12 6 7 3 15 5 10 18 5
Set 2 9 3 8 8 9 7 8 9 18
Solution:
(a) Arrange observations in ascending order of values, and draw scatter points
plot for each set of data.
Set 1 3 5 5 6 7 10 12 15 18
Set 2 3 7 8 8 8 9 9 9 18
Figure 5.2
(c) Observations in Set 1 are scattered almost evenly throughout the range.
However, for Set 2, most of the observations are concentrated around
numbers 8 and 9.
(d) We can consider numbers 3 and 18 as outliers to the main body of the data
in Set 2.
(e) From this exercise, we learn that it is not good enough to compare only the
overall data coverage using range. Some other dispersion measures have to
be considered too.
To conclude, Figures 5.2 shows that two distributions can have the same range
but they could be of different shapes which cannot be explained by range.
ACTIVITY 5.2
Range does not explain the density of a data set. What do you
understand about this statement? Discuss it with your coursemates.
EXERCISE 5.1
Set A: 45, 48, 52, 54, 55, 55, 57, 59, 60, 65
Set B: 25, 32, 40, 45, 53, 60, 61, 71, 78, 85
(a) Calculate the mean and the range of both data sets.
Set C 35 62 42 75 26 50 57 8 88 80 18 83
Set D 50 42 60 62 57 43 46 56 53 88 8 59
(a) Calculate the mean and the range of both data sets.
The longer range indicates that the observations in the central main body are
more scattered. This quantity measure can be used to complement the overall
range of data, as the latter has failed to explain the variations of observations
between two extreme values.
Besides, the former does not depend on the two extreme values. Thus inter
quartile range can be used to measure the dispersions of the main body of data. It
is also recommended to complement the overall data range when we make
comparison of two sets of data.
For example, let us consider question 2 in Exercise 5.1 where Set C is compared
with Set D. Although they have the same overall data range (88 8 = 80), they
have different distribution. The inter quartile range for Set C is larger than Set D.
This indicates that the main body data of Set D is less scattered than the main
body data of Set C.
IQR Q3 Q1 (5.2)
Double bars | | means absolute value. Some reference books prefer to use Semi
Inter Quartile Range which is given by:
IQR
Q (5.3)
2
Example 5.2:
By using the inter quartile range, compare the spread of data between Set C and
Set D in question 2 (Exercise 5.1).
Set C 35 62 42 75 26 50 57 8 88 80 18 83
Set D 50 42 60 62 57 43 46 56 53 88 8 59
Solution:
1
(a) Q1 is at the position 12 + 1 = 3.25
4
3
Q3 is at the position 12 + 1 = 9.75
4
Q1 Q3
Set C: 8, 18, 26, 35, 42, 50, 57, 62, 75, 80, 83, 88
Set D: 8, 42, 43, 46, 50, 53, 56, 57, 59, 60, 62 ,88
(d) Then the inter quartile range for each data set is given by
(i) IQR (C) = 78.75 28.25 = 50.5 ; and
(ii) IQR (D) = 59.75 43.75 = 16.0.
(e) Since IQR (D) < IQR(C) therefore data Set D is considered less spread than
Set C.
Coefficient of Variation, VQ
Inter quartile range (IQR) and semi inter quartile range (Q) are two quantities
which have dimensions. Therefore, they become meaningless when being used in
comparing two data sets of different units. For instance, comparison of data on
age (years) and weights (kg). To avoid this problem, we can use the coefficient of
quartiles variation, which has no dimension and is given by:
Q3 Q1
Q 2 Q3 Q1
VQ (5.4)
TTQ Q3 Q1 Q3 Q1
2
In the above formula, TTQ is the mid point between Q1 and Q3; and the two bars
| | means absolute value.
EXERCISE 5.2
Given the following three sets of data, answer the following questions.
Set E:
Age (Yrs) 5-14 15-24 25-34 35-44 45-54 55-64 65-74 f
Number
of 35 90 120 98 130 52 25 550
Residents
Set F:
Value of
Products 10- 15- 20- 25- 30- 35- 40-
f
(RM) x 14.99 19.99 24.99 29.99 34.99 39.99 44.99
100
Number
of 2 6 15 22 35 15 5 100
Products
Set G:
Extra
1- 1.03- 1.06- 1.09- 1.12- 1.15- 1.18-
Charges f
1.02 1.05 1.08 1.11 1.14 1.17 1.20
(RM)
Number
3 15 28 30 25 14 5 120
of Shops
If we have two distributions, the one with larger variance is more spreading and
hence its frequency curve is more flat. Variance of population uses symbol 2
and always has a positive sign. Standard deviation is obtained by taking the
square root of the variance. In this module, we will consider the given data as a
population.
n
2
xi
i 1
(5.5)
n
Example 5.3:
Obtain the standard deviation of the data set 20, 30, 40, 50, 60.
Solution:
x x (x )2
20 20 40 = -20 (-20)2 = 400
30 30 40 = -10 (-10)2 = 100
40 0 0
50 10 100
60 20 400
Sum = 200 Sum = 1000
= 200/5 = 40
(x )2
= 1000/5 = 200
n
EXERCISE 5.3
2
xi2 xi
(5.6)
n n
Example 5.4:
Calculate the mean of set numbers 3, 6, 7, 2, 4, 5 and 8.
Solution:
x 3 6 7 2 4 5 8 x 35
2
x2 9 36 49 4 16 25 64 x 203
2
203 35
2
7 7
Example 5.5:
Obtain the standard deviation of data Set 2 in Example 5.1.
Set 2 9 3 8 8 9 7 8 9 18
Solution:
x 9 3 8 8 9 7 8 9 18 x 79
2
x2 81 9 64 64 81 49 64 81 324 x 817
2
817 79
3.71
9 9
(You can compare this with the answer obtained in Exercise 5.3.)
2
f i xi2 fi xi
(5.7)
n n
where xi is the class mid-point of the ith class whose frequency is fi.
Example 5.6:
Obtain the standard deviation of the books on weekly sales given in Table 2.5.
The solution can be seen in Table 2.6.
Class 34 - 43 44 - 53 54 - 63 64 - 73 74 - 83 84 - 93 94 - 103
Frequency (f) 2 5 12 18 10 2 1
Solution:
2
f i xi2
2
f i xi 2272425 3315
12.21 12 books
n n 50 50
2
The variance is = 149.16 149 books.
ACTIVITY 5.3
Do you think obtaining standard deviation for grouped data is easier than
for ungrouped data? Justify your answer.
Coefficient of Variation
When we want to compare the dispersion of two data sets with different units, as
data for age and weight, variance is not appropriate to be used simply because
this quantity has a unit. However, the coefficient of variation, V, as given below
which is dimensionless is more appropriate.
Standard Deviation
V (5.9)
Mean
Example 5.7:
Data set A has mean of 120 and standard deviation of 36 but data set B has mean
of 10 and standard deviation of 2. Compare the variation in data set A and data
set B.
Solution:
A
120 A
36 B
10 B
2
36 2
VA 100% 30% VB 100% 20%
120 10
Data set A has more variation, while data set B is more consistent. Data set B is
more reliable compared to data set A.
EXERCISE 5.4
5.5 SKEWNESS
In a real situation, we may have distribution which is:
(a) Symmetry
Mean Mode x x
PCS (1) (5.10)
Standard Deviation s
3 Mean Median 3 x x
PCS (2) (5.11)
Standard Deviation s
Example 5.8:
If data set A has mean of 9.333, median of 9.5 and standard deviation of 2.357,
then calculate Pearsons coefficient of skewness for the data set.
Solution:
3 x x 3 9.333 9.5
PCS 0.213
s 2.357
In our case, the distribution is negatively skewed. There are more values above
the average than below the average.
EXERCISE 5.5
Distribution A:
Weight
20-29 30-39 40-49 50-59 60-69 70-79 80-89
(Kg)
No. of
4 10 20 30 25 10 1
Students
Distribution B:
Marks 20-29 30-39 40-49 50-59 60-69 70-79 80-89
No. of
10 20 30 20 15 4 1
Students
(a) Obtain: mean, mode, median, Q1, Q3, standard deviation and the
coefficient of variation.
In this topic, you have studied various measures of dispersions which can be
used to describe the shape of a frequency curve.
It has been mentioned earlier that overall range cannot explain the pattern of
observations lying between the minimum and the maximum values.
Thus we introduce inter quartile range (IQR) to measure the dispersion of the
data in the middle 50% or the main body.
The variance which is being called Shape Parameter is also given to measure
the dispersion. However, for comparison of two sets of data which have
different units, coefficient of variation is used.