Escolar Documentos
Profissional Documentos
Cultura Documentos
Some terms
Raw data
Raw data is data recorded in the sequence in which they are collected and before they are processed or
ranked
Table 1: The weights of 20 students in kg (Quantitative raw data)
61
66
68
65
65
62
67
67
68
60
71
73
69
69
63
70
74
70
64
71
B
D
C
B
B
A
C
B
A
B
C
B
B
A
B
C
A
D
Arrays
An arrangement of numerical raw data in ascending order or descending order of magnitude
60
68
61
68
62
69
63
69
64
70
65
70
65
71
66
71
67
73
67
74
Ungrouped data
Contains information on each member of a sample or population individually
Examples: Data presented in Table 1 and Table 2
Grouped data
Data presented in classes or intervals.
Example:
UCCM2623 Scores
Number of students
1.2
10 12
4
13 15
12
16 18
20
19 21
14
Solution.
Course
Biotech
Business
Engineering
Infotech
Others
Tally
Frequency
8
4
4
25
Total:
Example 1.2. Determine the relative frequency and percentage distributions for the data in Example 1.1.
Solution.
Course
Relative
Frequency
Biotech
Business
Engineering
Infotech
Others
Percentage
32%
0.24
16%
0.12
Total:
16%
100%
Example 1.3. Construct a bar chart for the data in Example 1.1.
Solution.
Frequency
8
6
4
2
Biotech
Others
Course
1.3
Lists all the classes and the number of values that belong to each class.
Data presented in the form of a frequency distribution are called grouped data.
Note:
The classes are non-overlapping i.e. each value belongs to one and only one class
Class
An interval that includes all the values that falls within two numbers, the lower and upper limits
Class limits
Endpoints of each interval
Class Boundary
Class boundary is the dividing line between two classes. It is given by the midpoint of the upper limit of
one class and the lower limit of the next higher class
Class width / class size
Class width is the difference between the upper and lower class boundary
class width = upper boundary lower boundary
Class mark / class midpoint
Class mark is the midpoint of the class interval
class mark = (lower class limit + upper class limit ) / 2
Constructing frequency distribution tables
1.
Determine the number of classes, usually varies from 5 to 20, depending mainly on the number of
observations in the data set.
Find 2k where k is the smallest number such that 2k is greater than the number of observations
(n).
2.
3.
Determine the lower limit of the first class or the starting point.
Any convenient number that is equal to or less than the smallest value in the data set can be used
as the lower limit of the first class.
Chapter 1 - 3
Example 1.4. Sample of birth-weights (oz) from 50 consecutive deliveries is given below. Construct a
frequency distribution table.
86
120
123
104
121
111
91
128
133
104
118
89
134
132
98
121
122
115
106
115
92
115
84
98
107
124
138
138
125
127
108
118
140
146
122
104
99
105
108
135
132
95
124
132
126
125
115
144
98
89
Solution.
Birthweights (oz)
80-89
90-99
Tally
f
4
8
110-119
120-129
130-139
13
3
Example 1.5. Calculate the relative frequencies and percentages distributions for the data in Example
1.4.
Solution.
Birthweights (oz)
Class Boundaries
80-89
79.5 - 89.5
90-99
100-109
110-119
120-129
119.5 - 129.5
130-139
129.5 - 139.5
140-149
139.5 - 149.5
Relative Frequency
Percentage
8%
0.14
0.16
0.14
0.26
89.5 - 109.5
14%
16%
14%
16%
0.06
Chapter 1 - 4
6%
1.3.3 Histogram
Three types of histogram
1.
Frequency histogram
Relative frequency histogram
2.
3.
Percentage histogram
A frequency histogram consists of a set of rectangle having
a) The bases on a horizontal axis with centres at the class marks and lengths equal to the class interval
sizes
b) The areas proportional to the class frequencies
If the class intervals all have equal size
the height of the rectangles are proportional to the class frequencies
otherwise
the height of the rectangles must be adjusted
Procedures to draw a histogram:
1.
Mark the class boundary of each interval on the horizontal axis.
2.
For each class, mark the frequencies (or relative frequencies or percentages) on the vertical
axis.
Draw a bar for each class so that its height represents the frequency of that class. (No gap
3.
between each bars)
4.
Label the histogram.
1.3.4 Polygon
Polygon is a line graph formed by joining the midpoints of the tops of successive bars in a histogram.
Next, we mark two more classes (with zero frequencies), one at each end, and mark the midpoints.
Three types of polygon:
1.
Frequency polygon
2.
Relative frequency polygon
3.
Percentage polygon
Chapter 1 - 5
79.5
89.5
99.5
109.5
119.5
129.5
139.5
149.5
Birth-weight (oz)
89.5
99.5
109.5
119.5
129.5
139.5
149.5
Birth-weight (oz)
89.5
99.5
109.5
119.5
129.5
139.5
149.5
Birth-weight (oz)
Example 1.7. The frequency distribution gives the weight of 35 objects, measured to the nearest kg.
Draw a histogram to illustrate the data.
Weight (kg)
Frequency
68
4
Solution.
adjusted frequency =
9 11
6
12 17
10
18 20
3
Chapter 1 - 6
21 29
12
Weight (kg)
68
9 11
Class width
3
Frequency
4
6
12 17
10
18 20
21 29
12
Adjusted Frequency
6
5
4
3
2
1
5.5
8.5
11.5
14.5
17.5
20.5
23.5
26.5
29.5
Weight (kg)
Example 1.8. Refer to data in Example 1.4, construct its cumulative frequency distribution, cumulative
relative frequency and cumulative percentage.
Birthweights (oz)
<79.5
Cumulative
frequency
0
4
<99.5
<109.5
19
<119.5
<129.5
<139.5
<149.5
26
39
47
55
Cumulative relative
frequency
0
0.08
0.22
Cumulative
percentage, %
0%
8%
22%
38%
0.52
0.78
0.94
1
78%
94%
100%
Note:
1.
The ogive starts at the lower boundary of the first class and ends at the upper boundary of the last
class.
2.
If relative cumulative frequency is used in place of cumulative frequency, the graph is called
relative cumulative frequency curve or percentage ogive.
Example 1.9. Draw an ogive for the data in Example 1.4. Estimate from the ogive,
a)
the total number of deliveries that their birth-weights were less than 95oz.
b)
the value of X , if 20 % of the deliveries were of birth-weights X oz or more.
Cumulative frequency
Solution.
Ogive
55
50
45
40
35
30
25
20
15
10
5
0
79.5
89.5
99.5
109.5
119.5
129.5
139.5
149.5
Birth-Weight (oz)
1.4
1.4.1 Median
Median is the value of the middle term in a data set that has been ranked in increasing or decreasing order
Median is the value of the
n +1
th term in a ranked data set; n = total number of elements in the set .
2
Note:
1.
If n is odd, then median is the value of the middle term in the ranked data.
2.
If n is even, then median is the average value of the two middle terms.
Chapter 1 - 8
Example 1.10. Find the median of set A = { 10, 5, 19, 8, 3 } and set B = { 2, 7, 3, 6, 4, 5 }
Solution.
Note:
Median is not influenced by the extreme value. (Extreme values are values that are very small or very
large relative to the majority of the values in a data set.)
For grouped data in the form of frequency distribution of single-valued classes
Median can be found either from ungrouped frequency distribution or from the cumulative frequency
distribution.
0
3
1
5
2
12
3
9
4
4
5
2
Solution.
1.4.2 Mode
Mode is the value that occurs with the highest frequency in a data set.
Example 1.12. Find the mode of each of the following data set.
i)
74, 9, 5, 8, 3, 8, 8
iii)
2, 6, 6, 6, 3, 8, 8, 8, 3
ii)
2, 2, 6, 6, 8, 8, 9, 9
iv)
B, C, D, A, A, C, C, C, B, A
Solution.
Note:
1.
Mode is not influenced by the extreme value.
2.
Mode may not exist, exist one mode(unimode), two modes(bimodal) or more than two
modes(multimodal).
3.
Mode can be used for both quantitative and qualitative data
Chapter 1 - 9
0
3
1
5
2
12
3
9
4
4
Solution.
1.4.3 Mean
The mean for population data x1 , x 2 , ..., x N is denoted by and is defined as
x + x + ... + x N
1 N
= 1 2
=
xi
N
N i =1
The mean for sample data x1 , x 2 , ..., x n is denoted by
X =
x1 + x 2 + ... + x n 1
=
n
n
n
i =1
X and is defined as
xi
Example 1.14. Find the arithmetic mean for the data set { 158, 189, 265, 127, 191 }
Solution.
Note:
1.
Mean not necessary takes one of the values in the original data
2.
Mean is influenced by extreme value
For grouped data in the form of frequency distribution of single-valued classes
X =
f 1 x1 + f 2 x 2 + ... + f n x n 1
=
n
n
n
i =1
f i xi =
f i x i
f i
fi
2
1
5
3
6
4
8
2
xi
fi
f i xi
24
16
xi
Solution.
Chapter 1 - 10
5
2
f i =population size
N=
n=
f i mi
N
X=
f i mi
n
f i = sample size
68
4
Solution.
Class interval
9 11
6
68
21 29
12
12 17
18 20
21 29
10
14.5
19
25
6
60
10
145
3
57
12
300
f i mi
1.5
18 20
3
9 11
Class midpoint ( mi )
Frequency ( f i )
12 17
10
Measures of dispersion
Sometimes, with the measures of central tendency only are not enough to reveal the whole picture of the
distribution of a data set. This is because the measure of central tendency does not describe how the data
is distributed
Data set
A
B
Data
1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 11
4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 8
Set A
Mean
6
6
Median
6
6
Set B
1 2 3 4 5 6 7 8 9 10 11
4 5 6
Mode
6
6
7 8
Note: The mean, median and mode are the same for data set A and B but the distribution of the data are
different.
Example 1.17. Find the range for data set A and data set B above.
Variance
The variance is the average of the squared deviation of the data from the mean.
1
N
N
i =1
Population Variance = 2 =
xi
1
N
N
i =1
( xi ) 2 =
1 N 2
( xi ) 2
N i =1
1
n
n
i =1
xi
1 n
1
( xi X ) 2 =
Sample Variance = s =
n 1 i =1
n 1
2
n
i =1
1
x
n
2
i
n
i =1
xi
Standard Deviation
The standard deviation is the positive square root of the variance
Example 1.18. Data shows the salary per day for all 6 employees of a small company.
29.50, 16.50, 35.40, 21.30, 49.70, 24.60
Calculate the variance and standard deviation for these data.
Solution.
Mean, =
xi
xi
( xi ) 2
29.50
0.00
0.00
xi
870.25
5.90
- 8.20
20.20
- 4.90
34.81
67.24
408.04
24.01
1253.16
453.69
2470.09
605.16
16.50
35.40
21.30
49.70
24.60
Total
Chapter 1 - 12
Method 1:
Population variance = 2 =
1
N
N
i =1
( xi ) 2
xi2 =
Population variance = 2 =
1 N 2
( xi ) 2
N i =1
Example 1.19. A sample consists of 5 data values: 72, 49, 79, 55 and 57. Calculate the variance and
standard deviation.
Solution.
n = 5 , xi =
xi2 =
1
Sample variance = s =
n 1
i =1
1
x
n
2
i
n
i =1
xi
Sample Variance = s 2 =
1
N
N
i =1
f i ( mi ) 2 =
1
1
f i ( mi X ) 2 =
n 1 i =1
n 1
n
f i mi2
f i mi
N
N
n
i =1
f i mi2
1
n
n
i =1
f i mi
Example 1.20. Find the variance from the following frequency distribution if it represent
a)
population
b)
sample
Height (m)
Frequency
20 22
3
23 25
6
26 28
12
Chapter 1 - 13
29 31
9
32 34
2
Solution.
Height
Midpoint, m
Frequency, f
fm
f m2
63
1323
6
12
9
2
324
270
66
8748
8100
2178
20 22
23 25
26 28
29 31
32 34
Total:
24
27
30
33
2 =
f i mi2
f i mi
N
N
s2 =
1
n 1
1.6
n
i =1
f i mi2
=
1
n
n
i =1
f i mi
Measures of position
Measures of position determine the position of a single value in relation to other values in a sample or a
population data set.
1.6.1 Quartiles
Quartiles are 3 summary measures that divide a ranked data set into 4 equal parts.
second quartile (Q2) is the median of a data set.
first quartile (Q1) is the value of the middle term among the observations that are less than
the median.
third quartile (Q3) is the value of the middle term among the observations that are greater
than the median.
value
1
( n + 1) th value
2
3
Q3 = ( n + 1)th value
4
When n is odd, the rule locate the exact position of the quartiles.
When n is even,
a)
n
2
2.5
6.5
Chapter 1 - 14
1
3
( n + 1) or ( n + 1) values,
4
4
b)
1
3
( n + 1) or ( n + 1)
4
4
value which is greater than .5 value and round down the values which is smaller than .5 value, for
example:
3.75
4
2
2.25
n
2
Q3 Q1
2
1.6.3 Percentiles
The (approximate) value of the kth percentile, denoted by Pk is
Pk = value of the
kn
th term in a ranked data set
100
where k denotes the number of the percentile and n represents the sample size. Note that round
the nearest integer or .5 value, for example: 2.2
2.3
2.7
2.8
2.0
2.5
2.5
3.0
kn
to
100
Example 1.21. The following are the scores of 12 students in a mathematics class.
75
80
68
53
99
58
76
73
85
88
91
79
a)
Find the values of the three quartiles. Where does the score of 88 lie in relation to these quartiles?
b)
Find the interquartile range.
c)
Find the quartile deviation.
d)
Find the value of the 62nd percentile.
Solution.
Chapter 1 - 15