Unit 1: Probability Theory

PROBABILITY THEORY C 5606/1/
1

PROBABILITY THEORY

OBJECTIVES

General Objective

To understand the concept of probability.

Specific Objectives

At the end of the unit you should be able to:

Define Classical Probability and state the concept of Experiments
and Events

Define the following events:
- Conditional Events
- Independent Events
- Mutually Exclusive Events

List the elements in the sample spaces
To find the probability of an event based on Classical probability
Use the set theory to explain:
- Venn diagrams
- complementary of sets
- Union of sets
- Intersection of sets
- Null sets

UNIT 1

http://modul2poli.blogspot.com

2

Probability as a general concept can be
defined as the chance of an event occurring. Probability
is the basis of inferential statistics. For example,
predictions are based on probability, and hypotheses are
tested by using probability.

1.0 INTRODUCTION

So much in peoples lives is affected by chance. From the time
a person awakes until he or she goes to bed, that person makes
decision regarding the possible events that are governed at least in
part by chance. For example, should I carry an umbrella to class
today?
Will my motorcycle battery last until the end of the semester?
Should I accept a new job?

I NPUT

3

1.1 PROBABILITY

A probability experiment is a chance process that leads to well-defined
results called outcomes.
An outcome is the result of a single trial of a probability experiment.
A sample space is the set of all possible outcomes of a probability
experiment.
An event consists of the outcomes of a probability experiment.

Some sample spaces for various probability experiments are shown here.

Experiment Sample space
Toss one coin Head, Tail
Roll a die 1, 2, 3, 4, 5, 6
Answer a true-false question True, False
Toss two coins Head-head, tail-tail, head-tail, tail-head

Example 1.1

1. a) Find the sample space for rolling two dice.
b) Find the sample space for the gender of the children if a family has three
children. Use B for boy and G for girl.

2. Use a tree diagram to find the sample space for the gender of three children
in a family, as in ACTIVITY 1B


4

Solution to Example 1.1

1. a

Since each die can land in six different ways, and two dice are rolled, the
sample space can be presented by a rectangular array. The sample space
is the list of pairs of numbers in the chart.

Die 1 Die 2
1 2 3 4 5 6
1 (1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6)
2 (2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6)
3 (3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6)
4 (4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6)
5 (5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6)
6 (6, 1) (6, 2) (6, 3) (6, 4) (6, 5) (6, 6)

1. b

There are two genders, male and female, and each child could be either
gender. Hence there are eight possibilities, as shown here.

BBB BBG BGB GBB GGG GGB GBG BGG

2.

children
B G
B G B G
B G B G B G
B G

5

ACTIVITY 1A

TEST YOUR UNDERSTANDING BEFORE YOU CONTINUE WITH THE NEXT
INTPUT!

1. Find the sample space for tossing two coins.

2. A die is rolled and a coin is tossed. Show the sample space.


6

FEEDBACK TO ACTIVITY 1A

1. S = {(H, H), (H, T), (T, H),(T,T)}

2.
Die 1 2 3 4 5 6
Coin
Head
(H))
(H, 1) (H, 2) (H, 3) (H, 4) (H, 5) (H, 6)
Tail (T) (T, 1) (T, 2) (T, 3) (T, 4) (T, 5) (T, 6)


7

1.2 CERTAIN AND COMPLEMENTARY EVENTS

When a die is rolled, the sample space is S = {1, 2, 3, 4, 5, 6}. Now let us define
event A as number 1 appears on the dies surface, therefore complement of A
(written as A) consists of all the number on the dies surface excluding 1,
therefore A = {2, 3, 4, 5, 6}.

Events can be presented pictorially by Venn diagrams. Figure (a) shows a simple
event E. The area inside the rectangle represents all the events in the sample
space(S). Figure (b)
Shows the complement of an event (

E ), which is the area inside the rectangle
but outside the circle representing E.

Fig a Fig b

S

Eee
E
- S
E
E

I NPUT

8

1.2.1 SET DESIGNATIONS

1. Sample space, S is represented by elements in a rectangle. Any event, E
is represented by its elements in a circle.

2. E or
E is the complement of E. E means event E never occurred.

3.
2 1
E E means either E1 or E2 or both have occurred.

S

E
E
E
S
E
1

E
2


9

4.
2 1
E E means both occurred.

5. E
1
and E
2
are two mutually exclusive events in which
2 1
E E =| . They
have no shared outcomes.

6.
n
E E E E ..... , ,
, 3 2 1
are mutually exclusive and finite if and only if
i) | =
j i
E E for every i and j,
ii) S E E E E
n
= ......
3 2 1

E
1
E
2 E4

E
3
E
5

2 1
E E
S
1
E

2
E
S

10

1.2.2 SET IDENTITIES

The following identities can be used if there is a need.

1. A A A = 8. A B B A =
2. | | = A 9. ) ( ) ( ) ( C A B A C B A =
3. S S A = 10. ) ( ) ( ) ( C A B A C B A =
4. A A A = 11. ' ' )' ( B A B A =
5. A B B A = 12. ' ' )' ( B A B A =
6. A S A = 13. ) ' ( B A A B A =
7. | | = A 14. ) ' ( ) ( B A B A B =

Example 1.2

1. A = {1, 2, 3, 4, 5, 6} and B = {1, 3, 6, 7}, find:

i) B A
ii) B A
iii) n( ) B A

2. A = {1, 2, 3, 4, 5, 6}, B = {2, 4, 6} and C = {3, 5, 7, 9}, find:

i) B A
ii) C A
iii) C B
iv) C B A
v) n ) ( C B A


11

Solution to example 1.2

1. i) B A = {1, 3, 6}
ii) B A = {1, 2, 3, 4, 5, 6, 7}
iii) n ) ( B A = n(A) + n(B) n ) ( B A = 6 + 4 3 = 7

*If A, B and C are finite sets, therefore:

) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( C B A n C B n C A n B A n C n B n A n C B A n + + + =

2. i) B A = {2, 4, 6}
ii) C A = {3, 5}
iii) C B = { } |
iv) C B A = { } |
v) ) ( C B A
) ( ) ( ) ( ) ( ) ( ) ( ) ( C B A n C B n C A n B A n C n B n A n + + +
= 6 + 3 + 4 3 2 0 + 0
= 8


12

ACTIVITY 1B

INPUT!

1. Find the sample space for tossing two coins.

2. Nasir normally has one type of drink after lunch everyday. He randomly
drinks tea, coffee or simply plain water. If event A represents Mamat has
one type of drink after lunch, list down the elements in the sample space
S and event A. Find the relationship between the sample space and the A
set.

3. If n (A B) = 45, 5 ) ( = B A n , and n(B) = 22, find n(A).


13

FEEDBACK TO ACTIVITY 1B

1. S = {HH, HT,TH,TT)

2. S={tea, coffee, water}, A = {tea, coffee, water}, therefore S = A

3. 28


14

1.3 CLASSICAL PROBABILITY

Classical probability uses sample spaces to determine the numerical probability
that an event will happen. One does not actually have to perform the experiment
to determine that probability. Classical probability assumes that all outcomes in
the sample space are equally likely to occur. For example, when a single die is
rolled, each outcome has the same probability of occurring. Since there are six
outcomes, each outcome has a probability of
6
1
.

Equally likely events are events that have the same probability of
occurring
Formula for Classical Probability

The probability of any event E is

_____Number of outcomes in E____________
Total number of outcomes in the sample space

This probability is denoted by

) (
) (
) (
S n
E n
E P =
This probability is called classical probability, and it uses
the sample space S.

I NPUT

15

Probability Rule 1
The probability of any event E is a number (either a fraction or decimal) between
and including 0 and 1. This is denoted by 1 ) ( 0 s s E P .
Probability Rule 2
If an event E cannot occur (i.e., the even contains no members in the sample
space), the probability is zero.
Probability Rule 3
If an event is certain, the probability of E = 1.
Probability Rule 4
The sum of the probabilities of the outcomes in the sample space is 1

Example 1.3

1. If a family has three children, find the probability that all the children are girls.

2. When a single die is rolled, find the probability of getting a 9.

3. When a single die is rolled, what is the probability of getting a number less

than 7?

The next example illustrates
the probability rules

16


1. The sample space for the gender of children for a family that has three
children is BBB, BBG, BGB, GBB, GGG, GGB, GBG, and BGG. (see t6he
tree diagram in the basic concepts section). Since there is one way in eight
possibilities for all three children to be girls,

P(GGG) =
8
1

2. Since the sample space is 1, 2, 3, 4, 5, and 6, it is impossible to get a 9.
Hence, the probability is P(9) = 0
6
0
=

3. Since all outcomes, 1, 2, 3, 4, 5, and 6, are less than 7, the probability is

P(number less than 7) = 1
6
6
=

** (Rule 4) For example, in a roll of a fair die, each outcome in the sample
space has a probability of
6
1
. Hence, the sum of the probabilities of the
outcomes is as shown.

Outcome 1 2 3 4 5 6
Probability
6
1

6
1

6
1

6
1

6
1

6
1

Sum
6
1
+
6
1
+
6
1
+
6
1
+
6
1
+
6
1
=1


17

ACTIVITY 1C

INPUT!

1. What is the probability of throwing a number greater than 4 with a die whose
faces are numbered from 1 to 6?

2. In a competition a prize is given for correctly forecasting the results of six
football matches. If a competitor sends in ten different forecasts, what is the
probability, that he receives the prize?

3. a) One red and one black marble are concealed in a bag. Find the
probability of drawing a red marble.
b) When three red and one black marbles are placed in the bag, find the
probability of drawing one red marble. What is the probability of drawing
one black marble?

4. A box contains 132 rivets of which 32 are undersized, 47 are oversized and
62 are satisfactory. Determine the probability of drawing at random:
(a) one undersized; (b) one oversized; and
(c) one satisfactory rivet from the box.

5. Four hundred resistors are examined and 6% are found to be defective.
Determine the probability that one selected at random will be defective and
also the probability that it will not be defective.

6. A purse contains 7 copper and 13 silver coins. Determine the probability of
selecting a copper coin when one is taken at random.

7. Determine the probability of winning a prize in a raffle by buying 3 tickets,
when there are 7 prizes and a total of 450 tickets sold.

8. Determine the probability of an event not happening when the probability of it
happening is 7/93.


18

FEEDBACK TO ACTIVITY 1C

1. 1/3 or 0.33 or % 33
3
1

2. 0.0137 or 1.37%

3. (a) , (b) and

4. (a)
132
23
(b)
132
47
(c)
66
31

5.
(
50
47
,
30
3

6.
20
7

7. 7/150

8. 86/93


19

1.4 COMPLEMENTARY EVENTS

Formula for empirical probability
Given a frequency distribution, the probability of an event being in a given
class is

P(E)=
n
f
= , of which f is frequency for the class and n is the total
frequencies in the distribution

The complement of an event E is the set of outcomes in the sample space
that are not included in the outcomes of event E. The complement of E is
denoted by E or
E (read E bar).

Rule for complementary events

) ( 1 ) ( E P E P =

I NPUT

20

Example 1.4

1. 25 students were asked if they like this module. The responses were
classified as yes, no, or undecided. The results were categorized in a
frequency distribution, as shown. Find the probability that a person
responded no.

Response Frequency
Yes 15
No 8
Undecided 2
Total 25

2. In a sample of 50 students, 21 had type O blood, 22 had type A blood, 5
had type B blood, and 2 had type AB blood. Set up a frequency
distribution and find the following probabilities:
a. A student has type O blood
b. A person has type A or type B blood
c. A person has neither type A nor type O blood
d. A person does not have type AB blood

3. Hostel records indicated that students stayed in the hostel for the number
of days during school break shown in the distribution.

Number of
days stayed
Frequency
3 127
4 32
5 56
6 19
7 5
Total 127

Find the probabilities.
a. A student stayed exactly 5 days.
b. A student stayed less than 6 days
c. A student stayed at most 4 days.
d. A student stayed at least 5 days.


21

4. Based on the Venn diagram below, A and B are two events in the sample
space S. Find:

a. P(A) (g) ) ' ( B A P
b. P(A) (h) ) ' ' ( B A P
c. P(B) (i) ) ( B A P
d. P(B) (j) ) ' ( B A P
e. ) ( B A P (k) ) ' ( B A P
f. ) ' ( B A P (l) ) ' ' ( B A P

S 40

aAr
B
A 35 5 20

22


1. P(E)=
25
8
=
n
f

2.

i. P(O) =
50
21
=
n
f

ii. P(A or B) =
50
27
50
5
50
22
= + (Add the frequencies of the two
classes)
iii. P(neither A nor O) =
50
7
50
2
50
5
= + (Neither A nor O means tat a
student has either type B or type AB blood).
iv. P(not AB) = 1 P(AB) =
25
24
50
48
50
2
1 = =

3. a) P(5) =
127
56

b) P(less than 6 days) =
127
103
127
56
127
32
127
15
= + + (Less than
6 days means either 3, or 4 or 5 days.)

c) P(at most 4 days) =
127
47
127
32
127
15
= + (At most 4 days means 3 or 4
days.

d) P(at least 5 days) =
127
80
127
5
127
19
127
56
= + + (At least 5 days means
either 5, or 6, or 7 days.)

Type Frequency
A 22
B 5
AB 2
O 21
Total 50

23

4 . n(S) = 35 + 5 + 20 + 40

a. n(A) = 40 and P(A) =
5
2
100
40
) (
) (
= =
S n
A n

b. n(A) = 60 and P(A) =
5
3
100
60
) (
) ' (
= =
S n
A n

c. n(B) = 25 and P(B) =
4
1
100
25
) (
) (
= =
S n
B n

d. n(B) = 75 and P(B) =
4
3
100
75
) (
) ' (
= =
S n
B n

e. n(A B) = 5 and P(A B) =
20
1
100
5
) (
) (
= =
S n
B A n

f. n(A B) = 35 and P(A B ) =
20
7
100
35
) (
) ' (
= =
S n
B A n

g. n(A ) B = 20 and P(A B) =
5
1
100
20
) (
) ' (
= =
S n
B A P

h. n(A B) = 40 and P(A B) =
5
2
100
40
) (
) ' ' (
= =
S n
B A P

i. n(A B) = 60 and P(A B) =
5
3
100
60
) (
) (
= =
S n
B A P

j. n(A B) = 80 and P(A B) =
5
4
100
80
) (
) ' (
= =
S n
B A P

k. n(A ) B = 65 and P(A ) B =
20
13
100
65
) (
) ' (
= =
S n
B A P

l. n(A ) ' B = 95 and P(A B) =
20
19
100
95
) (
) ' ' (
= =
S n
B A n


24

ACTIVITY 1D

INPUT!

1. If there are 50 tickets sold at a raffle and on person buys 7 tickets, what is
the probability of that person winning a price?

2. A survey found that 53% of Polytechnic students think this module is the best
of all the modules ever published. If a student is selected at random, find the
probability that he or she will disagree or have no opinion at all.

3. A couple has three children. Find each probability.
a) Of all boys
b) Of all girls
c) Of exactly two boys or two girls
d) Of at least one child of each gender


25

FEEDBACK TO ACTIVITY 1D

1.
50
7

2. 47%
3 a.
8
1
b.
4
1
c.
4
3
d.
4
3


26

1.5 THE ADDITION RULES FOR PROBABILITY

Many problems involve finding the probability of two or more events. For
example, in your class gathering, one might wish to know, for a person selected
at random, that a person is a female or is wearing glasses. In this case there are
three possibilities to consider.
1. The person is a female.
2. The person is wearing glasses.
3. The person is a female and she is wearing glasses.

Consider another example. In the same gathering there are male and female
students. If a person is selected at random, what is the probability that the person
is a male or a female student? In this case, there are only two possibilities:
1. The person is a female.
2. The person is a male.

The difference between the two examples is that in the first case, the person
selected can be a female and is wearing glasses at the same time. In the second
case, the person selected cannot be both a female and a male at the same time.
In the second case, the two events are said to be mutually exclusive; in the first
case, they are not mutually exclusive.

I NPUT

27

Two events are mutually exclusive if they cannot occur at the same time (i.e.,
they have no outcomes in common)

Addition Rule 1
When two events A and B are mutually exclusive, the probability that A and B
will occur is
P(A or B) =P(A) + P(B)

Addition Rule 2

If A and B are not mutually exclusive, then

P(A or B) = P(A) + P(B) P(A B)

Example 1.5

1. A restaurant has 3 pieces of chicken karipap, 5 pieces of potato karipap and
4 pieces of fish karipap. If a customer selects a piece of karipap for dessert,
find the probability that it will be either potato or fish.

2. There are 20 buffaloes, 13 cows and 6 goats in a lorry. If an animal is selected
at random, find the probability that that animal is either a cow or a goat.

3. A day of the week is selected at random; find the probability that it is a
weekend day (Saturday or Sunday).

4. In a hospital unit there are eight nurses and five doctors. Seven nurses and
three doctors are females. If a staff person is selected the probability that the
subject is a nurse or a female.


28


1. Since there are 12 pieces of karipap,
P(potato or fish) = P(potato) + P(fish) =
4
3
12
9
12
4
12
5
= = +
The events are mutually exclusive.

2. P(cow or goat) = P(cow) + P(goat) =
39
19
39
6
39
13
= +

3. P(Saturday or Sunday) = P(Saturday) + P(Sunday) =
7
2
7
1
7
1
= +

4. The sample space is shown below:

Staff Females Males Total
Nurses 7 1 8
Doctors 3 2 5
Total 10 3 13

The probability is P (nurse or male) = P(nurse) + P(male) P(male nurse)
=
13
10
13
1
13
3
13
8
= +


29

ACTIVITY 1E

INPUT!

1. At a convention there are seven mathematics instructors, five computer
science instructors, three statistics instructors, and four science instructors. If
an instructor is selected, find the probability of getting a science instructor or a
math instructor.

2. In a statistics class there are 18 juniors and 10 seniors; 6 of the seniors are
females, and 12 of the juniors are males. If a student is selected at random,
find the probabilities of selecting the following:

a. a junior or a female
b. A senior or a female
c. A junior or a senior

3. A womans clothing store owner buys from three companies: A, B, and C. The
most recent purchases are shown here.

Product A B C
Dresses 24 18 12
Blouses 13 36 16

If one item is selected at random, find the following probabilities.
a. It was purchased from company A or is a dress.
b. It was purchased from company B or company C.
c. It is a blouse or it was purchased from company A.


30

4. A grocery store employs cashiers, stock clerks, and deli personnel. The
distribution of employees according to marital status is shown next.

Marital
status
cashiers Stock
clerks
Deli
personnel
Married 8 12 3

Not
married
5 15 2

If an employee is selected at random, find these probabilities:
a. The employee is a stock clerk or married.
b. The employee is not married.
c. The employee is a cashier or is not married.

5. RTM, TV3, and NTV7 have quiz shows, comedies, and dramas. The number
of each is shown here.

Type of
show
RTM Tv3 Ntv7
Quiz
show
5 2 1
Comedy 3 2 8
Drama 4 4 2

If a show is selected at random, find these probabilities.
a. The show is a quiz show or it is shown on TV3.
b. The show is a drama or a comedy.
c. The show is shown on NTV7 or it is a drama.


31

FEEDBACK TO ACTIVITY 1E

1.
19
11

2. a.
7
6
b.
7
4
c. 1
3. a.
118
67
b.
118
81
c.
59
44

4. a.
45
38
b.
45
22
c.
3
2

5. a.
31
14
b.
31
23
c.
31
19


32

1.6 THE MULTIPLICATION RULES AND CONDITIONAL
PROBABILITY

The previous section showed that the addition rules are used to compute
probabilities for mutually exclusive and not mutually exclusive events. This
section introduces two more rules, the multiplication rules. These rules can be
used to find the probability of two or more events that occur in sequence. For
example, if a coin is tossed and then a die is rolled, one can find the probability of
getting a head on the coin and a 4 on the die. These two events are said to be
independent since the outcome of the first event (tossing a coin) does not affect
the probability outcome of the second event (rolling a die).

In order to find the probability of two independent events that occur in sequence,
one must find the probability of each event occurring separately and then multiply
the answers. For example, if a coin is tossed twice, the probability of getting two
heads is
4
1
2
1
2
1
. = . The result can be verified by looking at the sample space, HH,
HT, TH, and TT. Then P (HH) =
4
1
.

Two events A and B are independent if the fact that A occurs does not affect the
probability of B occurring.

I NPUT

33

Multiplication Rule 1

When two events are independent, the probability of both
occurring is
P(A and B) = P(A).P(B)

Example 1.6

1. A coin is flipped and a die is rolled. Find the probability of getting a head on
the coin and a 4 on the die.

2. An urn contains three red balls, two blue balls, and five white balls. A ball is
selected and its color noted. Then it is replaced. A second ball is selected and
its color noted. Find the probability of each of the following.

a. selecting two blue balls.
b. Selecting a blue ball and then a red ball.
c. Selecting a red ball and then a blue ball.

3. A pool found that 46% of students say they have suffered great stress at least
once in the exam week. If three students are selected at random, find the
probability that all three will say that they suffer great stress al least once in
the exam week.


1. P(head and 4) = P(head).P(4) =
12
1
6
1
2
1
. = , note that the sample space for the
coin is H, T; and for the die is 1, 2, 3, 4, 5, 6.

2. a. P(blue and blue) = P(blue).P(blue) =
25
1
100
4
10
2
10
2
. = =
b. P(blue and white) = P(blue).P(white) =
10
1
100
10
10
5
10
2
. = =
c. P (red and blue) = P(red).P(blue) =
50
3
100
6
10
2
10
3
. = =

3. Let S denote stress, then P(S and S and S) =
P(S).P(S).P(S) = (0.46)(0.46)(0.46) =0.097

34

ACTIVITY 1F

INPUT!

1. Two balls are drawn in turn with replacement from a bag containing 8 red
balls, 15 white balls, 24 black balls and 17 orange balls. Determine the
probabilities of having:
(a) two red balls;
(b) a red and a white ball;
(c) no orange balls;
(d) a black and red or black and orange ball;
(e) at least one black ball;
(f) a white ball on the first draw but the second ball not white.

2. The probability of three events happening are 1/8 for event A, 1/5 for
event B, and 2/7 for event C. Determine:
(a) the probability of all three events happening;
(b) the probability of event A and B but not C happening;
(c) the probability of only event B happening; and
(d) the probability of event A or event B happening but not
event C.

3. One bag contains 3 red and 5 black marbles and a second bag contains 4
green and 7 white marbles. One marble is drawn from the first bag and
two marbles from the second bag, without replacement. Determine the
probability of having;
(a) one red and two white marbles;
(b) no green marbles; and
(c) either one black and two green or one black and two white
marbles.


35

FEEDBACK TO ACTIVITY 1F

1. (a) 1/64 (f)
4096
3807

(b) 15/256
(c)
4096
2209
(g)
4096
735

(d)
256
75

(e) 39/64

2. (a) 1/140
(b) 1/156
(c) 1/8
(d) 13/56

3. (a) 63/440
(b) 21/55
(c) 27/88


36

1.7 CONDITIONAL PROBABILITY

In the previous section, the events were independent of each other, since the
occurrence of the first event in no way affected the outcome of the second event.
On the other hand, when the occurrence of the first event changes the probability
of the second event, the two events are said to be dependent. Refer to example
2 (b). The probability of selecting the blue ball is
10
2
. If the ball is not replaced,
then the probability of selecting the second (white) ball is
9
5
since there are only
nine balls remaining.

The conditional probability of an event B in relationship to event A is the
probability that event B occurs after event A has already occurred. The notation
for conditional probability is P(B/A). this notation does not mean that B is
divided by A; rather, it means the probability that event B occurs given that event
A has already occurred. In the ball example (ex 2b), P(B/A) is
9
2
since the first
ball was not replaced.

When the outcome or occurrence of the first events affects the outcome or
occurrence of the second event in such a way that the probability is changed, the
events are said to be dependent.

I NPUT

37

Multiplication Rule 2
When two events are dependent, the probability of both occurring is
P(A and B) =P(A).P(B/A)

The conditional probability of an event B in relationship to an event A
was defined as the probability that event B occurs after event A has already
occurred. It can be found by dividing both sides of the equation for multiplication
by P(A), as shown:

P(A and B) = P(A).P(B/A) therefore: ) / (
) (
) (
A B P
A P
AandB P
=

The Venn diagram for conditional probability is shown in the figure
below. In this case,
P(B/A) =
) (
) & (
A P
B A P
which is represented by the area in the
intersection or overlapping part of the circles A and B divided by area of circle A.
The reasoning here is that if one assumes A has occurred, the A become the
sample space for the next calculation and is the denominator of the probability
fraction
) (
) & (
A P
B A P
. The numerator P(A & B) represents the total probability of the
part B contained in A. Hence, P(A & B) becomes the numerator of the probability
fraction
) (
) & (
A P
B A P
. Imposing a condition reduces the sample space.


38

1.7.1 PROBABILITIES FOR AT LEAST

The multiplication rules can be used with the complementary and non-
complementary events to simplify solving probability problems involving at
least. The next examples illustrate how this is done.

Venn diagram for Conditional probability
P(A) P(A and B) P(B)
P(B/A) =
) (
) & (
A P
B A P


39

Example 1.7

1. In a shipment of 25 radios, 2 are defective. If two radios are randomly selected
and tested, fond the probability that both are defective if the first one is not
replaced after it has been tested.

2. Box 1 contains two red balls and one blue ball. Box 2 contains three blue balls
and on red ball. A coin is tossed. If it falls heads up, box 1 is selected and a
ball is drawn. If it falls tails up, box 2 is selected and a ball is drawn. Find the
probability of selecting a red ball.

3. A box contains white chips and black chips. A person selects two chips without
replacement. If the probability of selecting a black chip and a white chip is
56
15
,
and the probability of selecting a black chip on the first draw is
8
3
, find the
probability of selecting the white chip on the second draw, given that the first
chip selected was a black chip.

4. The probability Ali parks in a no-parking zone and gets a parking ticket is 0.06,
and the probability that Ali cannot find a legal parking space and has to park in
the no-parking zone is 0.20. On Monday, Ali arrives at school and has to park
in a no-parking zone. Find the probability that he will get a parking ticket.

5. A recent survey asked 100 people if they thought Sardin Cap Ayam is the best
sardin. The results of the survey are shown in the table.

Gender Yes No Total
Male 32 18 50
Female 8 42 50
Total 40 60 100

Find these probabilities.
a. The respondent answered yes, given that the respondent was a
female.
b. The respondent was a male, given that the respondent answered
no.


40

6. A coin is tossed five times. Find the probability of getting at least one tail.

7. A survey reported that 3% of pens sold in the Politeknik are Pilot pens. If 4
students who purchased a pen are randomly selected, find the probability
that at least one purchased a Pilot pen.

8. A coin is tossed three times. Find the probability of getting
i) Exactly 2 tails
ii) At least 2 tails


1. Since the event are dependent,
P(D
1
and D
2
) =P(D
1
).P(D
2
/D
1
) = (2/25).(1/24) = 2/600 = 1/300

2. With the use of a tree diagram, the sample space can be determined as
shown in the figure. First assign the probabilities to each branch. Next, using
the multiplication rule, multiply the probabilities for each branch.

Finally, use the addition rule, since a red ball can be obtained from box 1 or
box2;
P(red) =
24
11
24
3
24
8
8
1
6
2
= + = +

Tree diagrams can be used when the events are independent or dependent, and
they can also be used for sequences of three or more events.

Ball

P(B1) Box 1 P(R/B1) 2/3 Red
6
2
3
2
2
1
. =

P(B/B1) 1/3 Blue
6
1
3
1
2
1
. =

P(B2) Box 2 P(R/B1) Red
8
1
4
1
2
1
. =

P(B/B2) Blue
8
3
4
3
2
1
. =

41

3. Let B = selecting a black chip W = selecting a white chip
Then P(W/B) =
7
5
) (
) & (
8
3
56
15
= =
B P
W B P
Hence the probability of selecting a white
chip on the second draw given that the first chip[ selected was black is
7
5
.

4. Let N= parking in a no-parking zone and T = getting a ticket, then

P(T/N) = 30 . 0
20 . 0
06 . 0
) (
) & (
= =
N P
T N P
Hence, Ali has a 0.30 probability of
getting a parking ticket, given that he parked in a no-parking zone.

5. Let M = respondent was a male Y = respondent answered yes
F = respondent was a female N = respondent answered no

a) The problem is to find P(Y/F). The rules states P(Y/F) = P(Y
and F) /P(F). The probability P(F and Y) is the number of
females who responded yes divided by the total number of
respondents: P(F and Y) =8/100
The probability P(F) is the probability of selecting a female:
P(F) = 50/100

Then,

P(Y/F) =
25
4
100 / 50
100 / 8
) (
) & (
= =
F P
Y F P

b. The problem is to find P(M/N)

P(M/N) =
10
3
100 / 60
100 / 18
) (
) & (
= =
N P
M N P


42

6. It is easier to find the probability of the complement of the event, which is
all heads, and then subtract the probability from 1 to get the probability of
at least one tail.
P(E) = 1 P(

E )
P(at least 1 tail) = 1 P(all heads)
P(all heads) = (1/2)
5
=
32
1

Hence,
P(at least one tail) = 1 -
32
31
32
1
=

7. Let E = at least one Pilot pen purchased and
E = no Pilot pen purchased.

Then,
P(E) = 0.03 and P(

E ) = 1 0.03 = 0.97

P(no Pilot pen purchased) = (0.97)(0.97)(0.97) = 0.885; hence, P(at
least one Pilot pen purchased) = 1 0.885 = 0.115

8. i) P(exactly 2 tails) =
8
3
8
1
8
1
8
1
= + +
ii) P(at least 2 tails) = P(2 tails and 1 head) + P(3 tails)
=
2
1
8
1
8
3
= +
(##Draw a tree diagram to verify your answer)


43

ACTIVITY 1G

TEST YOUR UNDERSTANDING BEFORE YOU CONTINUE WITHY THE NEXT
INPUT!

1. The probabilities of an engine failing are given by: p
1
, failure due to
overheating; p
2
, failure due to ignition problems; p
3
, failure due to fuel
blockage. When p
1
= 1/7, p
2
= 2/9 and p
3
= 3/11, determine the
probabilities of:
(a) both p
1
and p
2
happening;
(b) either p2 or p
3
happening
(c) both p
1
and either p
2
or p
3
happening.

2. Actuarial tables show that the life expectancy of three men, A, B and C,
over a twenty-year period depends on their age and is given by PA = 4/15,
PB = 11/15 and PC = 14/15. Determine the probabilities that in twenty
years:
(a) all three men will be alive;
(b) A will be alive but B and C will be dead;
(c) At least one man will be alive.

3. When exploration for oil occurs a test hole is drilled. If as a result of this
test drilling it seems likely that really large quantity of oil exist, (a bonanza)
then the well is said to have structure. Examination of past records reveals
the following information;

- Structure No structure Total
Bonanza 0.20 0.05 0.25
No Bonanza 0.15 0.60 0.75
Total 0.35 0.65 1.00

a) Find P(Bonanza/structure)
b) (No Bonanza/structure)
c) P(Bonanza/no structure)
d) P(No bonanza/no structure)


44
4. Suppose we have 100 urns. Type 1 urn (of which there are 70) each contains
5 black and 5 white balls. Type 2 urn (which there are 30) each contains 8
black and 2 white balls. An urn is randomly selected and a ball is drawn from
that urn. If the ball chosen was black, what is the probability the ball came
from a type 1 urn?


45

FEEDBACK TO ACTIVITY 1G

1. (a) 2/63 (b) 49/99 (c) 7/99

2. (a)
3375
616
(b)
3375
16
(c)
3375
3331

3. (a) 0.571 (b) 0.429 (c) 0.077 (d) 0.923

4. 0.593


46

SELF-ASSESMENT 1

You are approaching success. Try all the questions in this self assessment
section and check your answers with those given in the Feedback to the Self -
Assessment 1 on the next page. If you have problems, consult your instructor.
Good luck.

1. State which events are independent and which are dependent.
i) Tossing a coin and throwing a die
ii) Drawing a ball from an urn, not replacing it, and then drawing a
second ball.
iii) Getting a raise in salary and purchasing a new car.
iv) Driving on ice and having accident.
v) Having a large show size and having a high IQ.
vi) A father being left-handed and a daughter being left-handed.
vii) Smoking excessively and having lung cancer.
viii) Eating an excessive amount of ice cream and smoking an
excessive amount of cigarettes.

2. A survey found that 68% of book buyers are 40 or older. If two book buyers
are selected at random, find the probability that both are 40 or older.

3. A salesman finds that the probability of making a sale is 0.23. If he talks to four
customers today, find the probability that he will make four sales.

4. Find the probability of selecting two people at random who were born in the
same month.

5. If three people are selected, find the probability that all three were born in
January.

6. What is the probability that a husband, wife and daughter have the same
birthday?


47

7. A radio uses six batteries, two of which are defective. If two are selected at
random without replacement, find the probability that the first battery tests
good and the second one is defective.

8. Out of 120 students, 90 of them put on white t-shirts. If five students are
selected at random, one by one, find the probability that all will put on white t-
shirts.

9. Urn 1 contains five red balls and three black balls. Urn 2 contains three red
balls and one black ball. Urn 3 contains four red balls and two black balls. If
an urn is selected at random and a ball is drawn, find the probability it will be
red.

10. If a die is rolled three times, find the probability of getting at least one even
number.


48

FEEDBACK TO SELF-ASSESMENT 1

Have you tried all the questions?? If YES, check your answer now.

1. a. Independent e. Independent
b. Dependent f. Dependent
c. Dependent g. dependent
d. Dependent h. Independent

2. 0.462

3. 0.003

4. 1/12

5. 1/1728

6. 1/133,225

7. 4/15

8. 243/1024

9. 49/72

10. 7/8

PROBABILITY DISTRIBUTIONS C 5606/2/

1

UNIT 2

PROBABILITY DISTRIBUTIONS

OBJECTIVES

General Objective

To understand the concept of probability distributions

Specific Objectives

Construct a probability distribution for a random variable.
Find the mean, variance, and expected value for a discrete random
variable.
Find the exact probability for X successes in n trials of a binomial
experiment.
Find the mean, variance, and standard deviation for the variable of a
binomial distribution.
Fit a theoretical distribution given a frequency list.
Identify distributions as symmetrical or skewed.
Identify the properties of the normal distribution.
Find the area under the standard normal distribution, given various z
values
Find probabilities for a normally distributed variable by transforming it into
a standard normal variable.
Find specific data values for given percentages using the standard normal
distribution.


2

2.0 INTRODUCTION

A variable can be defined as a characteristic or attribute that can assume
different values. Various letters of the alphabet, such as X, Y, or Z, are used to
represent variables. Since the variables in this unit are associated with
probability, they are called random variables.

If a variable can assume only a specific number of values, such as the outcomes
for the roll of a die or the outcomes for the toss of a coin, then the variable is
called a discrete variable. Discrete variables have values that can be
counted.

Variables that can assume all values in the interval between any given two
values are called continuous variables. For example, if the temperature goes
from 60
o
to 75
o
in a 24-hour period, it has passed through all possible number
from 60 to 75. Continuous random variables are obtained from data that can
be measured rather than counted.
A random variable is variable
whose values are determined by
chance

I NPUT

3

2.1 PROBABILITY DISTRIBUTION

The procedure shown here for constructing a probability distribution for a discrete
random variable uses the probability experiment of tossing three coins. Recall
that when three coins are tossed, the sample space is represented as TTT, TTH,
THT, HTT, HHT, HTH, THH, HHH, and if X is the random variable for the number
of heads, then X assumes the value 0, 1,2, or 3.

Probabilities for the value of X can be determined as follows:

No
heads
One head Two heads Three
heads
TTT TTH THT HTT HHT HTH THH HHH
8
1

8
1

8
1

8
1

8
1

8
1

8
1

8
1

8
1

8
3

8
3

8
1

Hence, the probability of getting no heads is
8
1
, one head is
8
3
, two heads is
8
3
,
and three heads is
8
1
. From these values, a probability distribution can be
constructed by listing the outcomes and assigning the probability of each
outcome , as shown.

Number of heads, X 0 1 2 3
Probability, P(X)
8
1

8
3

8
3

8
1

A discrete probability distribution consists of the values a random variable can
assume and the corresponding probabilities of the values. The probabilities are
determined theoretically or by observation.


4

Example 2.1

1. Construct a probability distribution for rolling a single die.

2. Represent graphically the probability distribution for the sample space for
tossing three coins.

3. In the rainy months, a store selling umbrella keeps track of the number of
umbrellas it sells each day during a period of 90 days. The number of
umbrellas sold per day is represented by the variable X. The results are
shown here.

X Number of days
0 45
1 30
2 15
Total 90

Compute the probability P(X) for each X, and construct a probability
distribution and graph for the data.

Two requirements for a Probability Distribution

1. The sum of the probabilities of all events in the sample space must equal 1;
that is
= 1 ) ( X P
2. The probability of each event in the sample space must be between or
equal to 0 and 1. That is 0 P(X)1

5


1. Since the sample space is S = {1, 2, 3, 4, 5, 6},, and each outcome has a
probability of
6
1
, the distribution is as follows:

Outcome X 1 2 3 4 5 6
Probability, P(X)
6
1

6
1

6
1

6
1

6
1

6
1

2. The values X assumes are located on the x-axis, and the values for P(X)
are located on the y axis.

Number of Heads,
X
0 1 2 3
Probability, P(X) 1/8 3/8 3/8 1/8

3. The probability P(X) can be computed for each X by dividing the number
of days X umbrellas sold by total days.

For 0 umbrellas: 5 . 0
90
45
=
For 1 umbrella: 33 . 0
90
30
=
For 2 umbrellas: 17 . 0
90
15
=


6

The distribution is as follows:

Number of umbrellas sold 0 1 2
Probability, P(X) 0.5 0.33 0.17

Number of umbrellas

7

ACTIVITY 2A

INPUT!

1. Define and give three examples of a random variable.

2. Give three examples of a discrete random variable.

3. What is a probability distribution?

Question 7 9: Determine whether the distribution represents a
probability distribution. If it is not, state why.

4..
X 3 6 9 12 15
P(X)
9
4

9
2

9
1

9
1

9
1

5.
X 1 32 3 4 5
P(X)
10
3

10
1

10
1

10
2

10
3

6.
X 5 10 15
P(X) 1.2 0.3 0.5

F
E
E
D
B
A
C
K

F
O
R

A
C
T
I
V
I
T
Y

2
A


8

For question 7 to 9, state whether the variable is discrete or continuous.

7. The number of cups of coffee a fast-food restaurant serves each day.

8. The weight of a rhinoceros.

9. The number of lecturers in your polytechnic

For question 10 to 12, construct a probability distribution for the data and draw a
graph for the distribution.

10. The probabilities that a patient will have 0, 1, 2, or 3 medical tests performed
on entering a hospital are , , ,
15
1
15
3
15
5
16
6
and respectively.

11. The probabilities of a machine manufacturing 0, 1, 2, 3, 4, or 5 defective
parts in one day are 0.75, 0.17, 0.04, 0.025, 0.01, and 0.005, respectively.

12.. A die is loaded in such a way that the probabilities of getting 1, 2, 3, 4, 5,
and 6 are
12
1
12
1
12
1
12
1
6
1
2
1
, , , , , and , respectively.


9


1. A random variable is a variable whose values are determined by chance.
Examples will vary.
2. The number of commercials a radio station plays during each hour. The
number of times a student uses his or her calculator during a mathematics
exam. The number of leaves on a specific type of tree.
3. A probability distribution is a distribution that consists of the values a
random variable can assume along with the corresponding probability of
these values.
4. Yes
5. Yes
6. No, the probability values cannot be greater than one.
7. Discrete
8. Continuous
9. Discrete

10 to 12. Refer to the examples given in the input.


10

2.2 RANDOM VARIABLES OF THE CONTINUOUS TYPE

If X is a continuous variable and f(x) is a function assigned to it, then

}
=
b
a
dx x f b X a P ) ( ) ( with the following requirements satisfied..
i) 0 ) ( > x f For every possible x values
ii)
}

= 1 ) ( dx x f
Therefore f(x) is a probability density function for X (pdf)

Example 2.2

Prove Y is a continuous random variable given the following pdf:

f(y) =
s s
others
y y
0
2 0 ) 2 (
2
1

i) Find P(Y>1)
ii) P(Y<1)
iii) P(0.5<Ys 1)
iv) Draw the graph for f(y) and double check your answer (i)
through (iii) by finding the area under the graph.

I NPUT

11


} } } } }
= +
(
+ = + + = =
0 2
0 2
2
0
2
1 0
2
2
2
1
0 ) 2 ( 0 ) 2 (
2
1
) (
y
y ody dy y dy dy y dy y f

It is proven that Y is a continuous random variable.

i) 4 / 1
2
2
2
1
) 2 (
2
1
) 1 (
2
1
2 2
1
=
(
= = >
}
y
y dy y Y P
ii) 4 / 3
2
2
2
1
) 2 (
2
1
) 1 (
1
0
1
0
2
=
(
= =
}
y
y dy y Y P
iii) 16 / 5
2
2
2
1
) 2 (
2
1
) 1 5 . 0 (
1
5 . 0
1
5 . 0
2
=
(
= = s s
}
y
y dy y Y P

iv) P(Y>1) = the area of triangle = X =1/4

1

1/2
1 2 y
f(y)

12

(ii) P(Y<1) = Area of trapezium =
4
3
1 .
2
1
1
2
1
=
|
.
|
\
|
+

iii) P( = s < ) 1 5 . 0 y area of trapezium =
16
5
2
1
2
1
4
3
2
1
=
|
.
|
\
|
+

1/2 1 2 y
f(y)
1

3/4

1/2
2 y
f(y)

1

1/2


13

ACTIVITY 2B

TEST YOUR UNDERSTANDING BEFORE PROCEEDING TO THE NEXT
INPUT!

1. The continuous random variable has the following pdf:

0.2 0 1 s < y

f(y) = 0.2 + cy 1 0 s < y

0 others

Find the constant c and
a) ) 5 . 0 0 ( s s Y P
b) 5 . 0 5 . 0 ( s < Y P
c) P(Y>0.5)
d) P(Y)>0.5 / P(Y)>0.1

2. The continuous random variable x has the following pdf :
f(x) = x 0x2
c(2x-3) 2x3
0 others

a) Find the constant c
b) Sketch y = f(x)
c) Find P (X1)
d) Find P (X>2.5)
e) Find P (1X2.3)

F
E
E
D
B
A
C
K
F
O
R
A
C
T
I
V
I
T
Y
2
A


14

3. The continuous random variable x has the following pdf :
1/x
2
for 1 <x <
f(x) =
0 others

If A = { x : 1<x<2} and B = { x: 4 <x<5 }, find
a) P(AB)
b) P(AUB)

4. Find mode for :-

a) P (X) = x = (
2
1
)
2
x = 1,2,3
b) f(x) = 12x
2
(1-x) 0<x<1
c) f(x) = (
2
1
)x
2
e
x
0<x<


15


1. c = 1.2 (a) 0.25, (b) 0.35, (c) 0.55, (d) P(Y>0.5) = 0.55, P(Y)>0.1 = 0.774
P(Y)>0.5 / P(Y)>0.1 = 0.71

2. a) c =

b) The sketch of y = f(x)]

c) P =
d) P = 5/16
e) P = 0.3475

1 2 3
1/2

1/4
y = 1/4

1

3/4

y = (2x-3)

16

3. a) P(AB) = 0
b) P(AUB) =
20
11

4. a) X = 1

b) X =
3
2

c) X = 2


17

2.3 THE BINOMIAL DISTRIBUTION

Many types of probability problems have only two outcomes, or they can be
reduced to two outcomes. For example when a coin is tossed, it can land heads
or tails. When a baby is born, it will be either male or female. A multiple-choice,
even though there are four or five answer choices, can be classified as correct or
incorrect. Situation like these are called binomial experiments.

A binomial experiment is a probability experiment that satisfies the following
four requirements:

1. Each trial can have only two outcomes or outcomes that can be reduced to
two outcomes. The outcomes can be considered as either success or failure.
2. There must be a fixed number of trials.
3. The outcomes of each trial must be independent of each other.
4. The probability of a success must remain the same for each trial.

I NPUT

18

A binomial experiment and its results give rise to a special probability
distribution called the binomial distribution. The outcomes of a binomial
experiment and the corresponding probabilities of these outcomes are called a
binomial distribution.

The probability of a success in a binomial experiment can be computed with the
following formula.

Binomial Probability Formula

In a binomial experiment, the probability of exactly X successes in n trials is

x n x
q p
X X n
n
X P

= . .
! )! (
!
) (

Notation for the Binomial Distribution

P(S) The symbol for the probability of success
P(F) The symbol for the probability of failure
p The numerical probability of a success
q The numerical probability of a failure

P(S) = p and P(F) =1 p = q
n The number of trials
X The number of success

Note that 0Xn

19

Example 2.3

1. A coin is tossed three times. Find the probability of getting exactly two
heads.

2. If a student randomly guesses at five multiple-choice questions, find the
probability that the student gets exactly three correct. Each question has
five possible choices.

3. Solve the problem in (a) using the binomial distribution table.

4. A fake survey reported that 5% of Malaysians are afraid of being alone at
night. If a random sample of 20 Malaysians is selected, find the
probabilities using the binomial table:

0 There are exactly five people in the sample who are afraid
being alone at night.
1 There are at most three people in the sample who are being
afraid of being alone at night.
2 There are at least three people in the sample who are afraid
of being alone at night.


1. This problem can be solved by looking at the sample space. There are
three ways to get two heads.
S = {HHH, HHT,HTH, THH, TTH, THT, HTT, TTT}

The answer is
8
3
, or 0.375


20

2. From the standpoint of a binomial experiment, you can show that it meet
the four requirements.

a. There are only two outcomes for each trial, heads or tails.
b. There is a fixed number of trials (three)
c. The outcomes are independent of each other (the outcome of one toss in
no way affects the outcome of another toss).
d. The probability of a success (heads) is
2
1
in each case.

In this case, n = 3, X = 2, p =
2
1
, and q =
2
1
. Hence substituting in the
formula gives P(2 heads) = 375 . 0
8
3
)
2
1
( )
2
1
(
! 2 )! 2 3 (
! 3
1 2
= =

3. In this case n = 5, X = 3, and P =
5
1
, since there is one chance in five of
guessing a correct answer. Then,
P(3) = 05 . 0
5
4
5
1
! 3 )! 3 5 (
! 5
2 3
= |
.
|
\
|
|
.
|
\
|

4. You need the binomial distribution table. Fit in the proper values and the
answer is 0.375.

i) n = 20, p = 0.05, and X = 5. From the table, one gets 0.02.

ii) n = 20, and p = 0.05. At most three people means 0, or 1, or 2, or 3.
Hence the solution is P(1) + P(2) + P(3) = 0.358 + 0.377 + 0.189 +
0.060 = 0.984

iii) n = 20 and p = 0.05. At least three people means 3, 4, 5,, 20. This
problem can best be solved by finding P(0) + p(1) + P(2) and
subtracting from 1.

P(0) + P(1) + P(2) = 0.358 + 0.377 + 0.189 = 0.924

1 0.924 = 0.076


21

ACTIVITY 2C

INPUT!

1. Which of the following are binomial experiments or can be reduced to
binomial experiments?
i) Surveying 100 people to determine if they like Siti Nurhaliza.
ii) Tossing a coin 100 times to see how many heads occur.
iii) Testing four different brands of aspirin to see which brands are
effective.
iv) Testing one brand of aspirin using 10 people to see which brands
are effective.

2. Compute the probability of X successes, using the binomial formula.
I) n = 6, X = 3, p = 0.03
ii) n = 4, x = 2, p = 0.18
iii) n = 5, x = 3, p = 0.63
iv) n = 9, X = 0, p = 0.42

3. A student takes a 10-question, true-false exam and guesses on each
question. Find the probability of passing if the lowest passing grade is 6
correct out of 10. Based on your answer, would it be a good idea not to study
and depend on guessing?


22

Feedback to Activity 2C

1. Yes, ii) Yes, iii) No, iv) Yes

2. Refer to the table

3. 0.377; no, your score will be about 40%


23

2.4 MEAN, VARIANCE AND STANDARD DEVIATION FOR THE
BINOMIAL DISTRIBUTION

In the use of the binomial distribution, the outcomes must be independent. For
example, in a selection of components from a batch to be tested, each
component must be replaced before the next one is selected. Otherwise, the
outcomes are not independent. However, a dilemma arises because there is a
chance that the same component could be selected again. This situation can be
avoided by not replacing the component and using the hyper geometric
distribution to calculate the probabilities.

Mean, Variance, and Standard Deviation for the binomial
Distribution

mean p n. =
variance q p n . .
2
= o
standard deviation = q p n . . = o

I NPUT

24

Example 2.4

1. A coin is tossed is tossed four times. Find the mean, variance, and standard
deviation of the number of heads that it will be obtained.

2. A die is rolled 480 times. Find the mean, variance, and standard deviation
of the number of 2 s that will be rolled.


1. With the formulas for the binomial distribution and n = 4, p =
2
1
, and q =
2
1
, the
result are

2 . 4 .
2
1
= = = p n

1 . . 4 . .
2
1
2
1
2
= = = q p n o

1 1 = = o

2. n = 480, p =
6
1
. This is a binomial situation, where getting a 2 is a success and
not getting 2 is a failure; hence,

( ) 80 . 480 .
6
1
= = = p n

( )( ) 7 . 66 480 . .
6
5
6
1
2
= = = q p n o

2 . 8 7 . 66 . . = = = q p n o

On average, there will be 80 twos. The standard deviation is 8.


25

ACTIVITY 2C

INPUT!

1.A statistical bulletin reported that 2% of all births result in twins. If a random
sample of 8000 births is taken, find the mean, variance, and standard deviation
of the number of births that would result in twins.

2.If 2% of automobile carburetors are defective, find the mean, variance, and
standard deviation of 500 carburetors.


26


1. = 160, =
2
o 156.8, o 12.5

2.
108
1


27

2.5 THE NORMAL DISTRIBUTION

What is Normal?

It is known that the normal range of systolic blood pressure 110 to 140. The
normal interval for a persons triglycerides is from 30 to 200 mg/dl. By measuring
these variables, a doctor can determine if a patients vital statistics are within the
normal interval, or if some type of treatment is needed to correct the condition
and avoid future illness. The question then is How does one determine the so-
called normal interval?

Recall that a continuous variable can assume all values between any two given
values of the variables. Many continuous variables have distributions that are
bell-shaped and are called approximately normally distributed variables. For
example, if we select a random sample of 100 adult women, measure their
heights, and construct a histogram, we get a graph similar to the one in fig (a).
Now if we increase the sample size and decrease the width of the classes, the
histogram will look like the ones shown in fig. (d) and (c). Finally, if it were
possible to measure exactly the heights of all adult females in Malaysia and plot
them, the histogram would approach what is called the normal distribution,
shown in (d).

I NPUT

28

When the data values are evenly distributed about the mean, the distribution is
said to be symmetrical. See fig (a) below. When the majority of the data values
fall to the left or right of the mean, the distribution is said to be skewed. See fig
(b) and (c)

Normal and Skewed Distributions

For the sake of simplicity, a theoretical curve, called the normal distribution curve
can be used to study many variables that are not perfectly normally distributed
but are nevertheless approximately normal. The mathematical equation for the
normal distribution curve is
t o
o
2
) 2 /( ) (
2 2

=
X
e
y , where e ~ 2.718, t =3.14, =
population mean, o = population standard deviation.


29

Another important aspect in statistics is that the area under the normal
distribution curve is more important than frequencies. Therefore, when the
normal distribution is pictured, the y axis, which indicates the frequencies, is
sometimes omitted.

The shape and position of the normal distribution curve depend on two
parameters, the mean and standard deviation. Each normally distributed variable
has its own normal distribution curve, which depends on the values of the
variables mean and standard
deviation.

Fig (a) shows two normal distributions with the same mean values but different
standard deviations. The larger the standard deviation, the more dispersed, or
spread out the distribution is. Fig (b) shows two normal distributions with the
same standard deviation with the same standard deviation but with different
means. These curves have the same shapes but are located at different
positions on the x axis. Fig (c) shows two normal distributions with different
means and different standard deviation.


30

The normal distribution is a continuous, symmetric, bell-shaped distribution of a
variable.

Summary of the Properties of the Theoretical Normal
Distribution

1. The normal distribution curve is bell-shaped.
2. The mean, median, and mode are equal and
located at the center of the distribution.
3. The normal distribution curve is unimodal (i.e., it
has only one mode).
4. The curve is symmetrical about the mean, which is
equivalent to saying that its shape is the same on
both sides of a vertical line passing through the
center.
5. The curve is continuous i.e., there are no gaps or
holes. For each value of X, there is a corresponding
value of Y.
6. The curve never touches the x axis. Theoretically,
no matter how far in either direction the curve
extends, it never meets the x axis but it gets
increasingly closer
7. The total area under the normal distribution curve is
equal to 1.00 or 100%.
8. The area under the normal curve that lies within one
standard deviation of the mean is approximately
0.68, or 68%; within two standard deviations, about
0.95, or 95%; and within three standard deviations,
about 0.997, or 99.7%. See the figure below that
shows the area in each region.


31

The Standard Normal Distribution

Since each normally distributed variable has its own mean and standard
deviation, the shape and location of these curves will vary. In practical
applications, one would have to have a table of areas under the curve for each
variable. To simplify this situation, statisticians use what is called the standard
normal distribution.

Areas under the Normal Distribution Curve

The standard normal distribution is a normal distribution with a mean 0 and a
standard deviation of 1


32

Standard Normal Distribution

The values under the curve indicate the proportion of area in each section. For
example, the area between the mean and one standard deviation above or below
the mean is about 0.3143, or 31.43%.

The formula for the standard normal distribution is
t 2
2
2
x
e
y =
All normally distributed variables can be transformed into the standard normally
distributed variable by using the formula for the standard score

z=
sdeviation
mean value
or
o

=
X
z

This is the same formula used in Section 3-4. The use of this formula will be
explained in the next session.

As stated earlier, the area the normal distribution curve is used to solve practical
application problems, such as finding the percentage of adult women whose
height is between 5 feet 4 inches and 5 feet 7 inches, or finding the probability
that a new battery will last longer than four years. Hence, the major emphasis of
this section will be to show the procedure for finding the area under the standard
normal distribution curve for any z value. The application will be shown in the
next section. Once the X values are transformed using the preceding formula,
they are called z value. The z value is actually the number of standard deviations
that a particular X value is away from the mean.


33

For the solution of problems using the normal distribution, a four-step procedure
is recommended with the use of the Procedure Table shown below.

STEP 1 Draw a picture

STEP 2 Shade the area desired.

STEP 3 Find the correct figure in the following Procedure Table.

STEP 4 Follow the directions given in the appropriate block of the
Procedure Table to get the desired area.

Finding The Area under The Normal Distribution Curve


34

Example 2.5

Find the probability (i.e. the area under the curve) for each question 1 through
17.

Question 1

a. ) 5 . 0 ( > Z P
b. ) 0 . 1 ( > Z P
c. ) 53 . 1 ( > Z P
d. ) 998 . 1 ( > Z P
e. ) 86 . 1 ( s Z P

Question 2

a. If X~N(1, 4), find ) 3 3 ( s s X P

b. If X~N(2, 4), find ) 4 1 ( s s X P

c. If X~N(2, 9), find ) 12 0 ( s s X P

d. If X~N(-5, 36), find ) 12 10 ( s s X P


35


Question 1


36

Question 2


37

ACTIVITY 2D

INPUT!

1. Find the area to the right of Z = 2.43 and to the left of Z = -3.01.

2. Find the probability for each.
a) P(0<z<2.32)
b) P(z<1.65)
c) P(z>1.91)

3. Find the z value such that the areas under the normal distribution curve
between 0 and the z value is 0.2123.


38

FEEDBACK TO ACTIVITY 2D

1. 0.0088, or 0.88%

2. i) 0.4898, or 48.98%
ii) 0.9505, or 95.05%
iii) 0.0281, or 2.81%

3. 0.56


39

2.6 APPLICATIONS OF THE NORMAL DISTRIBUTION

To solve problems by using the standard normal distribution, transform the
original variable into a standard normal a standard normal distribution variable
by using the formula
ion darddeviat s
value
z
tan
min
= or
o

=
X
z

For example, suppose that the scores for a standardized test are normally
distributed, have a mean of 100, and have a standard deviation of 15. When the
scores are transformed into z values, the two distributions coincide, as shown in
the figure below. (Recall that the z distribution has a mean of 0 and a standard
deviation of 1)

I NPUT

40

Example 2.6

1. If the scores for the test have a mean of 100 and a standard deviation of
15, find the percentage of scores that will fall below 112.

2. Each month, a Malaysian household generates an average 28 kg of
newspaper for garbage recycling. Assume the standard deviation is 2 kg.
If a household is selected at random, find the probability of its generating

i) Between 27 and 31 kg per month.
ii) More than 30.2 kg per month. Assume the variable is approximately
normally distributed.

3. The normal distribution can also be used to answer questions of
how many? Automobile Association of Malaysia reports that the average
time it takes to respond to an emergency call is 25 minutes. Assume the
variable is approximately distributed and the standard deviation is 4.5
minutes. If 80 calls are randomly selected, approximately how many will
be responded to in less than 15 minutes.

4. In order to qualify for Askar Wataniah Politeknik training program,
candidates must score in the top 10% on a general abilities test. The test
has a mean of 200 and a standard deviation of 20. Find the lowest
possible score to qualify. Assume the test scores are normally distributed.

5. For a medical study, a researcher wishes to select people in the middle
60% of the population based on blood pressure. If the mean systolic blood
pressure is 120 and the standard deviation is 8, find the upper and lower
readings that would qualify people to participate in the study.


41


1.


42

Hence, the probability that a randomly selected household generates between
27 and 31 kg of newspaper per month is 62.47%.


43

3.

STEP 2 Find z values for 15.
2 . 2
5 . 4
25 15
=
=
o
X
z

STEP 3 Find the appropriate area. The area obtained from the table is
0.4868, which corresponds to the area between z = 0 and z = -2.2

STEP 4 Subtract 0.4868 from 0.5000 to get 0.0132.

STEP 5 To find how many calls will be made in less than 15 minutes,
multiply the sample size (80) by the area (0.0132) to get 1.056.
Hence, 1.056, or approximately one, call will be responded to in
under 15 minutes.


44

4.

STEP 1 Subtract 0.1000 from 0.5000 to get area under the normal
distribution between 200 and X: 0.5000 0.1000 = 0.4000

STEP 2 Find the z value that corresponds to an area of 0.4000 by looking
up 0.4000 in the table. If the specific value cannot be found, use the
closest value, for example 0.3997. The corresponding z value is
1.28.

STEP 3 Substitute in the formula
o

=
X
z and solve for X.

20
200
28 . 1

=
X
of which the answer is X = 226.
A score of 226 should be used as cutoff. Anybody scoring 226 or
higher qualifies.

5. Assume that blood pressure readings are normally distributed; then cutoff
points are as shown in the figure below.


45

Note that two values are needed, one above the mean and one below the
mean. Find the value to the right of the value first. The closest z value for an
area of 0.3000 is 0.84. Substituting in the formula o + = z X , one gets

72 . 126 120 ) 8 )( 84 . 0 (
1
= + = + = o z X , On the other side, z = -0.84; hence

28 . 113 120 ) 8 )( 84 . 0 (
2
= + = X

Therefore , the middle 60% will have blood pressure readings of
113.28<X<126.72


46

ACTIVITY 2E

INPUT!

1. The mean income of daily labors is RM60. Assume the standard deviation
RM10. If a labor is selected at random, what is the probability that his
salary is -
a) more than RM70
b) below RM62
c) more than RM64
d) between RM62 and RM72
e) between RM55 and RM80

2. Absorption tests on 500 bricks gave a mean absorbtion and standard
deviation of 10.3% and 2.1% respectively. Determine the number of bricks
with absorbtion
a) between 8.9% and 11.1%
b) greater than 15%.

3. Chip-board partition units are produced with a mean length of 75mm and
standard deviation 6 mm. Determine the rejection rate if the permissible
length deviations are 15 mm (assume the lengths to be normally
distributed).


47

FEEDBACK TO ACTIVITY 2E

1. a) P(X>70) = 0.1567
b) P(X<62) = 0.5793
c) P(X>54) = 0.7257
d) 3056 . 0 ) 72 62 ( = s s X P
e) 6687 . 0 ) 80 55 ( = s s X P

2. a) P(8.9<X<11.1) = 0.3966
b) P(X>15) = 2.24

3. P(735>X>765) = 0.0124 that is the rejection rate is 1.24%


48

SELF-ASSESMENT 2

You are approaching success. Try all the questions in this self-assessment
section and check your answer given on the next page. If you face any problems,
consult your instructor. Good luck.

1. What is a probability distribution? Give an example.
2. Determine whether the distribution represents a probability distribution. If it
is not, state why

X 3 6 9 12 15
P(X) 4/9 2/9 1/9 1/9 1/9

Construct a probability distribution for the data and draw a graph for the
distribution.

State whether The Variable is Discrete or Continuous (3-4)

3. The number of cups of coffee a fast food restaurant serves each day.
4. The weight of a rhinoceros
5. The probability that a patient will have 0, 1,2 or 3 medical tests performed
on entering a hospital are
15
6
,
15
5
,
15
3
and
15
1
,
respectively.

6. The probabilities that a customer will purchase 0,1,2, or 3 books are 0.45,
0.30 ,0.15 and 0.10, respectively.

7. The probabilities that a customer selects 1, 2, 3, 4 or 5 items at a
convenience store are 0.32, 0.12, 0.23, 0.18 and 0.15, respectively.

8. Construct a probability distribution for drawing a card from a deck of 40
cards consisting of 10 cards numbered 1, 10 cards numbered 2 and 15
cards numbered 3 and 5 cards numbered 4

49

FEEDBACK TO SELF-ASSESSMENT 2

Have you tried the questions??? If YES, check your answers now.

1. A probability distribution is a distribution that consists of the values a
random variable can assume along with the corresponding probabilities of
these values.

2. Yes

3. Discrete

4. Continuous

5 8: Refer to example 2.1

SAMPLE AND SAMPLE DISTRIBUTIONS C 5606/3/ 1

SAMPLE AND SAMPLE DISTRIBUTIONS

OBJECTIVES

General Objective

To understand and the concept of sampling and sample distributions

Specific Objectives


Define the sampling distribution concept which is the base for inferential
statistics.
Express the relationship between statistical samples and population
parameters.
Explain the concept of sampling distribution of sample means based on
random sample taken with and without replacement from a population.
Calculate the mean, variance and standard deviation of the distribution of
the sample means taken with or without replacement from a population.
State the criteria for big samples (n>30).
Study the characteristics of the distributions of the means of samples
taken from a population.
Use the central limit theorem to solve the probability problems involving
distribution of sample means for large number of samples.

UNIT 3


3.0 INTRODUCTION

As an engineer, you are required to find out the mean value of the service life for
newly developed light bulbs. One of the approaches is to randomly pick out, say
50 light bulbs from the whole population of thousand bulbs produced and have
them tested. In doing so, you can approximate the mean value for the bulbs. This
method is known as sampling.

3.1 SAMPLE DISTRIBUTIONS

Every sample is a subset from a population. By studying the sample, it is
possible to find out the characteristics of the sample and eventually determine
the characteristics of the whole population. It would be ideal if the sample were a
perfect miniature of the population in all characteristics. This ideal, however, is
impossible to achieve. The best that can be done is to select a sample that will
be representative with respect to some characteristics, preferably those
pertaining to the study.

For a sample to be a random sample, every member of the population must
have an equal chance to be selected. If selected without being biased, it will
become the representative of the population.

I NPUT

3.1.1 SAMPLE STATISTICS AND POPULATION PARAMETERS

Probability distribution concept can be applied for sample statistics. An example
of sample statistics is the measurement for central tendency for a given sample
such as the mean
) ( x or the variation such as standard deviation, S. The

population mean, and the population standard deviation, o are the
measurement for the central tendency of a sample. Below is a table for sample
statistics and population parameters:

Quantity Sample
statistics
Population parameters
Size N N
Mean
x

Variance
2
s
2
o
Standard
deviation
S o
Proportion ^ p p

3.1.2 DISTRIBUTION OF SAMPLE MEANS

If we select 100 samples of a specific size from a large population and compute
the mean of the same variable for each 100 samples. The sample means,
100 2
1
... , x x x , constitute a sampling distribution of sample means.

If the samples are randomly selected with replacement, the sample means, for
most part, will be somewhat different from the population mean . These
differences are caused by sampling error.

Properties of the distribution of sample Means

1. The mean of the sample means will be the same as the
population mean.
2. The standard deviation of the sample means will be smaller than
the standard deviation of the population, and will be equal to the
population standard deviation divided by the square root of the
sample size


Example 3.1

1. Suppose a lecturer gave an eight point quiz to a small class of four
students. The results of the quiz were 2, 6, 4, and 8. Assume the four
students constitute the population.

Find i) The population o , and draw the graph of the sample means.
ii)
x x
o , of the sample means

2. Assume that we have a population consisting of three numbers 1, 2, and
3. The probability distributions for these numbers are

X 1 2 3
P(x) 1/3 1/3 1/3

Find i) The population means, variance and standard deviation
ii) Now, if all samples of size 2 are taken with replacement, and the
mean of each sample is found, find:

a) The probability distribution for sample means, x , draw a table
b) The mean for the sample means
c) The variance and standard deviation for sample means


1. The mean of the population is

5
4
8 4 6 2
=
+ + +
= ,

The standard deviation of the population is

236 . 2
4
) 5 8 ( ) 5 4 ( ) 5 6 ( ) 5 2 (
2 2 2 2
=
+ + +
= o


Below is the graph of the sample means. The graph appears to be
somewhat normal, even though it is a histogram.

1 2 3 4
score
f
r
e
q
u
e
n
c
y
,

1

Now, if all samples of size 2 are taken with replacement, and the mean of each
sample is found, the distribution is shown next. (You can draw a tree diagram if
you wish)

Sample Mean Sample Mean
2, 2 2 6, 2 4
2, 4 3 6, 4 5
2, 6 4 6, 6 6
2, 8 5 6, 8 7
4, 2 3 8, 2 5
4, 4 4 8, 4 6
4, 6 5 8, 6 7
4, 8 6 8, 8 8

A frequency distribution of sample means is as follows.

X
F
2 1
3 2
4 3
5 4
6 3
7 2
8 1


Below is the graph of the sample means. The graph appears to be somewhat
normal, even though it is a histogram.

0
1
2
3
4
5
2 3 4 5 6 7 8
sample mean
frequency

The mean of the sample means, denoted by 5
5
80
16
8 ... 3 2
= =
+ +
=
x
which is the
same as the population mean. Hence =
x

The standard deviation of the sample means denoted by
581 . 1
16
) 5 8 ..( ) 5 3 ( ) 5 2 (
2 2 2
=
+ +
=
x
o which the same as the population
standard deviation is divided by 2 : 581 . 1
2
236 . 2
= =
x
o
Note: if all possible sample of size n are taken with replacement from the same
population, the mean of the sample means, denoted by

x
, equals to the
population mean ; and the standard deviation of the sample means, denoted
by

x
o , equals
n
o
.

2. Population mean,
2 3 / 6 ) 3 2 1 ( 3 / 1 ) 3 / 1 ( 3 ) 3 / 1 ( 2 ) 3 / 1 ( 1 ) ( ) ( = = + + = + + = = = x xp X E

Population variance, )] ( [ ) (
2 2 2
X E X E s =
=
2
3
1
2 2
3
1
2 2
3
1
2
) ( 3 ) ( 2 ) ( 1 + +
= ) 9 4 1 (
3
1
+ +
=
3
14

3
2
2
3
14
2
2 = = o
Therefore
3
2
= o


ii)

Sample
Mean,
x
1, 1 1.0
1, 2 1.5
1, 3 2.0
2, 1 1.5
2, 2 2.0
2, 3 2.5
3, 1 2.0
3, 2 2.5
3, 3 3.0

The probability distribution for sample means, x

x 1.0 1.5 2.0 2.5 3.0
P( x )
1/9 2/9 3/9 2/9 1/9

You can draw a histogram for sample means,
x against P(

x ) as in activity 3A
and then find the mean for the sample means, ) ( ) ( x p x x E
x
= = =
) ( 3 ) ( 5 . 2 ) ( 2 ) ( 5 . 1 ) ( 1
9
1
9
2
9
3
9
2
3
1
+ + + +
=
9
18

=2
= (population mean)

Variance for the sample means, x is
) ( )] ( [ ) ( (
2 2 2 2
x p x x E x E x = = o
= ) ( 0 . 2 ) ( 5 . 1 ) ( 1
9
3
2
9
2
2
9
1
2
+ +
=
3
13

3
1
2
3
13
2
2 = = x o


Standard deviation for sample means,
3
1
: =
x
x o

Look:
3
1
=
x
o =
n
n
n
o
o
o
= = = = 2 & ,
2
3
2
2
2
3
2


ACTIVITY 3A

INPUT!

1. Let the population consist of the digits 1, 2 and 3. Find the population
mean and the population standard deviation.

2. 10000 female students are found to have a mean weight of 63 kg with a
standard deviation of 7 kg. 100 samples of size 36 are taken, without
replacement, from the above. Estimate the mean and standard deviation
of the sample-means.



1. , 2 =
x
o =
3
2

2. = 63 and
x
o = 1.17


3.2 THE CENTRAL LIMIT THEOREM

As the sample size n increases, the shape of the distribution of the sample
means taken with replacement from a population with mean and standard
deviation o will approach the normal distribution. As previously shown, this
distribution will have a mean and a standard deviation
n
o

The central limit theorem can be used to answer questions about sample means
in the same manner that the normal distribution can be used to answer questions
about individual values. The only difference is that a new formula must be used
for the z values.

n
X
z
o

= Notice that X is the sample mean, and the denominator is the
standard error of the mean. It is important to remember two things when using
the central limit theorem:

When the original variable is normally distributed, the distribution of the sample
means will be normally distributed, for any sample size n.
When the distribution of the original variable departs from normality, a sample
size of 30 or more is needed to use the normal distribution to approximate the
distribution of the sample means. The larger the sample, the better the
approximation will be.

..

I NPUT

NOTE

Since the sample size is 30 or larger, the normality assumption is not necessary,
as in the example above. When do we use
n
X
z
/ o

= or
o

=
X
z ?

The formula
n
X
z
/ o

= should be used to gain information about a sample mean
whereas the formula
o

=
X
z is used to gain information about an individual
data value obtained from the population. See the example below.

Example 3.2

1. Students in semester 1 and 2 in Polytechnics spend an average of 25
hours sleeping in a week. Assume the variable is normally distributed and
the standard deviation is 3 hours. If 20 students from semester 1 and 2
are randomly selected, find the probability that the mean of the number of
hours they sleep will be greater than 26.3 hours.

2. The average age of motorcycles registered in polytechnics is 8 years, or
96 months. Assume the standard deviation is 16 months. If a random
sample of 36 motorcycles is selected, find the probability that the mean of
their ages is between 90 and 100 months.

3. The average number of pounds of meat a person consumes a year is
218.4 pounds. Assume that the standard deviation is 25 pounds and the
distribution is approximately normal.
i) Find the probability that a person selected at random consumes
less than 224 pounds per year.
ii) If a sample of 40 individuals is selected, find the probability that
the mean of the sample will be less than 224 pounds per year.



1. Since the variable is approximately normally distributed, the distribution of
sample means will be approximately normal, with a mean of 25. The
standard deviation of the sample means is

671 . 0
20
3
= = =
n
x
o
o

The distribution of the means is shown above, with the appropriate area shaded.

The z-value is
n
X
z
o

= = 94 . 1
671 . 0
3 . 1
20
3
25 3 . 26
= =

The area between 0 and 1.94 is 0.4738. Since the desired area is in the tail,
subtract 0.4738 from 0.5000. Hence 0.5000 0.4738 = 0.0262, or 2.62%.
One can conclude that the probability of obtaining a sample mean larger than
26.3 hours is 2.62% (i.e., % 62 . 2 ) 3 . 26 ) ( = ) X P )

25 26.3

2. The desired area is shown in the figure below:

The two z-values are

25 . 2
36 / 16
96 90
1
=
= z and 50 . 1
36 / 16
96 100
2
=
= z

The two areas corresponding to the z values 0f -2.25 and 1.50, respectively, are
0.4878 and 0.4332. Since the z-values are on opposite sides of the mean, find
the probability of adding the areas: 0.478 + 0.4332 = 0.921, or 92.1%.

Hence, the probability of obtaining a sample mean between 90 and 100 months
is 92.1% i.e., P(90< X <100) = 92.1%.

90 96 100

3. (i) Since the question asks about an individual person, the formula

o

=
X
z is used. The distribution is shown in the figure below.

The z value is 22 . 0
25
4 . 218 224
=
=
o
X
z

The area between 0 and 0.22 is 0.0871; this area must be added to 0.5000 to get
the total area to the left of z = 0.22.

0.0871 + 0.5000 = 0.5871

Hence, the probability of selecting an individual who consumes less than 224
pounds of meat per year is 0.5871, or 58.71% ( i.e., P(X<224) = 0.5871.

218.4 224
Distribution of individual data
values for the population

(ii) Since the question concerns the mean of a sample with a size of
40, the formula
n
X
z
/ o

= is used. The area is shown in the figure
below:

The z value is

42 . 1
40
25
4 . 218 224
/
=
=
n
X
z
o

The area between z = 0 and z = 1.42 is 0.422; this value must be added to
0.5000 to get the total area.
0.422 + 0.5000 = 0.9222

Hence, the probability that the mean of a sample of 40 individuals is less than
224 pounds per year is 0.9222, or 92.22%. That is P( 9222 . 0 ) 224 = < X

218.4 224

Comparing the two probabilities, one can see that the probability of selecting an
individual who consumes less than 224 pounds of meat per year is 58.71%, but
the probability of selecting a sample of 40 people with a mean consumption of
meat that is less than 224 pounds per year is 92.22%. This rather large
difference is due to the fact that the distribution of sample means is much less
variable than the distribution of individual data values.


ACTIVITY 3B

INPUT!

1. The average salary for workers at an electronic factory is RM13.50 per
hour. Assume that the standard deviation is RM2.90 per hour and the
distribution is approximately normal. If X is the mean salary per hour for a
random sample of the workers at the factory, find the mean and standard
deviation for a sample distribution X if the sample size is (a) 30 workers,
and (b) 75 workers

2. The average weight of sugar sachets is 32 grams. Assume the standard
deviation is 0.3 gram. If a random sample of 20 sachets is selected, find
the probability that the mean of their weight is between 31.8 and 31.9
grams.

3. Analysis of 150 compressive strength results gave a mean strength of 32
N/mm
2
and standard deviation 6.5 N/mm
2
. Given that 10 samples of 12
results are considered, find the number of samples with mean strength
greater than 33 N/mm
2
.

4. Asbestos-cement sheets are manufactured with a mean length 2400 mm
and standard deviation 3 mm. Given that 20 batches consisting of 3 dozen
sheets are considered, determine

(a) the probability that a batch (chosen at random) has a mean
length between 2399.5 mm and 2400.6 mm
(b) the number of batches with mean length less than 2399.3 mm.



1. a) 53 . 0 , 50 . 13 , = = =
x x
RM X o
b) 33 . 0 , 50 . 13 , = = =
x x
RM X o

2. 0.667 or 66.7%

3. =
x
=32 N/mm
2
,
x
o =1.81 N/mm
2
, P( 2912 . 0 ) 33 = > x ~ 3 samples

4. (a) P(2399.5< x <2400.6) = 0.7262 (b) P( x <2399.3) = 0.0808


3.3 DISTRIBUTION OF THE SAMPLE MEANS

a. Distribution of the sample means with replacement

Statement 1

The shape of the distribution of the sample means X taken with replacement
from a known population with mean and standard deviationo , regardless of
the sample size (n), will approach the normal distribution. As previously shown,
this distribution will have a mean and a standard deviation
n
o

Statement 2

If the sample is taken from any population with known ando , and the sample
size is very large (n> 30), the distribution of sample mean is almost normal with
min and standard deviation o that is ) , (
2
n
N x
c
~

b. Distribution of the sample means without replacement

The formula for the standard error of the mean,
n
o
, is accurate when the sample
are drawn with replacement or without replacement from a very large or infinite
population. Since sampling with replacement is for the most part unrealistic, a
correction factor is necessary for computing the standard error of the mean for
samples drawn without replacement from a finite population. Compute the
correction factor by using the following formula:

I NPUT

1
N
n N
where N is the population size and n is the sample size.

This correction factor is necessary if relatively large samples are
taken from a small population, because the sample mean will then be more
accurately estimate the population means and there will be less error in the
estimation. Therefore, the standard error of the mean must be multiplied by the
correction factor to adjust it for large samples taken from a small population. That
is

1
=
N
n N
n
x
o
o

Finally the formula for the z value becomes

1
.
=
N
n N
n
X
z
o

When the population is large and the sample is small, the correction factor is
generally not used, since it will be very close to 1.000. Therefore

n
x
o
o = .


Example 3.3

1. The average price of houses in Jitra is RM157000 and is rather skewed.
Assume the standard deviation is RM29500. If x is the mean price for a
sample of 400 houses selected at random, find the probability:
a) That the sample mean is between RM154000 and 160000.
b) That the mean price for this sample is below RM154000.

2. The average time taken by line workers in an electronic firm to assemble
the electronic components is 80 hours with the standard deviation of 8
hours. Find the probabilities (P) of the mean assembly time if a random
sample consisting of 16 workers is selected.

a. ) 82 78 ( s s x P
b. ) 84 76 ( s s x P
c. ) 86 74 ( s s x P

3. The average service hour of 400 batteries is 800 with the standard
deviation of 45. If a random sample of 45 batteries is selected, what is the
probability that the sample mean is between 790 and 810 hours.

4. The data shows the number of children belonging to a group of 50
Polytechnic lecturers.

No. of
children
0 1 2 3 4
No. of
lecturers
1 18 24 4 3

a. Find the mean and the standard deviation of the data above.
b. If a sample of 10 lecturers is taken, find the mean number of children
of this sample that is more than 2.



1. Although the price of houses in Jitra is skewed and not normally distributed,
the sample mean price is rather normal due to the big sample size (n=400).
Therefore the central limit theorem is applicable.

Given =157000 ando =RM29500.
157000 RM
x
= =

1475
400
29500
RM
n
x
= = =
o
o Therefore

) 1475 , 157000 (
2
N x =

a. ) 160000 154000 ( s s x P
)
1475
157000 160000
1475
157000
1475
) 157000 154000
(

s
=
x
P
=P(-2.03 ) 03 . 2 s s z
=0.976

b. 0212 . 0 ) 03 . 2 (
1475
157000 154500
1475
157000
) 154500 ( = s = |
.
|
\
|
s
= s Z P
x
P x P

a. ) 82 78 ( s s x P
b. ) 84 76 ( s s x P


2. Although the sample size is small (n=16), the time distribution to
assemble the components is normally distributed. Therefore the distribution
of the sample mean is normally distributed with mean = 80 hours and the
standard deviation
16
8
=
x
o = 2 hours.

a.
|
.
|
\
|
s
= s s
2
80 82
2
80
2
80 78
) 82 78 (
x
P x P
= ) 1 1 ( s s Z P
=0.6826

b.
|
.
|
\
|
s
= s s
2
80 84
2
80
2
80 76
) 84 76 (
x
P x P
= ) 2 2 ( s s Z P
= 0.9544

c.
|
.
|
\
|
s
= s s
2
80 86
2
80
2
80 74
) 86 74 (
x
P x P
= ) 3 3 ( s s Z P
= 0.9974

3. The probability that the mean sample is between 790 and 810 hours is
0.9066.

4. The probability distribution is:

No. of children(x) 0 1 2 3 4
Relative frequency, p(x) 0.02 0.36 0.48 0.08 0.06

a)
= ) ( x xp = 2 8 . 1 ) 06 . 0 ( 4 ) 08 . 0 ( 3 ) 48 . 0 ( 2 ) 36 . 0 ( 1 ) 02 . 0 ( 0 ~ = + + + +

= + + + + = = 72 . 0 ) 06 . 0 ( 4 ) 08 . 0 ( 3 ) 48 . 0 ( 2 ) 36 . 0 ( 1 ) 02 . 0 ( 0 ) ( ) (
2 2 2 2 2 2 2 2
o x p x

8445 . 0 72 . 0 = = o


b) Due to large samples (N = 50) and 10 lecturers were selected without
replacement, the sampling distribution for sample means is almost normal
with 8 . 1 =
x
and
2424 . 0
1 50
10 50
10
8485 . 0
1
=
=
N
n N
n
x
o
o Therefore

) 2424 . 0 , 8 . 1 (
2
N x ~
2033 . 0 ) 83 . 0 (
2424 . 0
8 . 1 2
2424 . 0
8 . 1
) 2 ( = = |
.
|
\
|
= Z P
x
P x P


ACTIVITY 3C

INPUT!

1. The heights of 2500 men are normally distributed with a mean of 170 cm and a
standard deviation of 7 cm. If random samples are taken of 30 men, predict
the standard deviation and the mean of sampling distribution of means, if
sampling is done (a) with replacement, and (b) without replacement.

2. A group of 1000 ingots of metal have a mean mass of 7.4 kg and a standard
deviation of 0.4 kg. Find the probability that a sample of 50 ingots chosen at
random from the group, without replacement, will have a combined mass of (a)
between 360 and 377.5 kg, and (b) more than 375 kg.

3. Determine the mean and standard deviation of the set of numbers 1, 2, 4, 5,
and 6, correct to three decimal places. By selecting all possible different
samples of size 2 which can be drawn with replacement (25 pairs) determine
(a) the mean of the sampling distribution of means, and (b) the standard error
of the means, correct to three decimal places.

4. Determine the standard error of the means for problem 3, if sampling is without
replacement, correct to three significant figures.

5. The length of 1500 bolts is normally distributed with a mean of 22.4 cm and a
standard deviation of 0.048 cm. If 30 samples are drawn at random from this
population, each of size 36 bolts, determine the mean of the sampling
distribution and the standard error of the means when sampling is done with
replacement.

6. Determine the standard error of the means in problem 5, if sampling is done
without replacement, correct to 4 decimal places.


7. If a random sample of 64 lamps is drawn from a batch, determine the
probability that the mean time to failure will be less than 785 hours, correct to 3
decimal places.

8. Determine the probability that the mean time to failure of a random sample of
16 lamps will be between 790 hours and 810 hours, correct to 3 decimal
places.

9. For a random sample of 64 lamps, determine the probability that the mean
time to failure will exceed 820 hours, correct to 2 significant figures.



1. (a) = =
x
1.278 cm (b) =
x
o 1.271 cm, =
x
170 cm

2. The mean of the sampling distribution of means = = =
x
7.4 kg
The standard error of the means, =
x
o 0.0552 kg
(a) 0.9966 (b) 0.0351

3. , 855 . 1 , 6000 . 3 = = o (a) 600 . 3 =
x
, (b) =
x
o 1.312

4.
x
o = 1.136

5.
x
=22.4 cm,
x
o =0.08 cm

6.
x
o = 0.0079 cm

7. 0.023

8. 0.497

9. 0.0038


SELF ASSESSMENT 3

You are approaching success. Try all the questions in this self-assessment
section and check your answers on the next page. If you encounter any
problems, consult your instructor. Good luck.

1. If the samples of a specific size are selected from a population and the
means are computed, what is this distribution of means called?
2. What is the mean of the sample means?
3. What does the central limit theorem say about the shape of the distribution
of sample means?
4. What formula is used to gain information about a sample mean when the
variable is normally distributed or when the sample size is 30 or more?

For exercise below, assume that the sample is taken from a large
population and the correction factor can be ignored.

5. The mean serum cholesterol of a large population of overweight
adults is 220 Mg/dl and the standard deviation is 16.3 mg/dl. If a sample of
adults is selected. Find the probability that the mean will be between 220
and 222 mg/dl.

6. The mean weight of 18 year old females is 126 pound, and the standard
deviation is 15.7. If the sample of 25 females is selected, find the
probability that the mean of the sample will be greater than 128.3 pounds.
Assume the variable is normally distributed.


7. The average price of the pound of sliced bacon is RM2.02. Assume the
standard deviation is RM0.08. If a random sample of 40 one-pound
packages is selected, find the probability that the mean of the sample will
be less than RM2.00.

8. The mean score on a dexterity test for 12 year old is 30. The standard
deviation is if a psychologist admitters the test to a class of 22 student,
find the probability that the mean of the sample will be between 27 and 31.
Assume the variable is normally distributed.

9. The average age of lawyers is 43.6 years, with a standard deviation of 5.1
years. If the law firm employs 50 lawyers, find the probability that the
average age of the group is greater than 44.2 years old.

10. Procter & Gamble reported that an American family of 4 washes an
average of one ton (2000 pounds) of clothes each year. If the standard
deviation of the distribution is 187.5 pounds, find the probability that the
mean of the randomly selected sample of 50 families or four will be
between 1980 and 1990 pounds.

11. The average time it taken a group of adults to complete a certain
achievement test is 46.2 minutes. The standard deviation is 80 minutes.
Assume the variable is normally distributed

a) Find the probability that a randomly selected adult will
complete the test in less than 43 minutes.
b) Find the probability that if 50 randomly selected adults take
the test, the mean time it takes the group to complete the
test will be less than 43 minutes.
c) Does it seem reasonable that an adult would finish the test in
less than 43 minutes? Explain
d) Does it seem reasonable that the mean of the 50 adults
could be less than 43 minutes?


12. The average cholesterol content of a certain brand of eggs is 215
milligrams and the standard deviation is 15 milligrams. Assume the
variable is normally distributed.

a) If a single egg is selected, find the probability that the
cholesterol content will be more than 220 milligrams.
b) If a sample of eggs is selected, find the probability that the
mean of the sample will be larger than 220 milligrams.

13. The average labor cost for car repairs for a large chain of car repair shop is
RM 48.25. The standard deviation is RM 4.20. Assume the variable is

(a) If a store is selected at random, find the probability that the
labour cost will range between RM 46 and RM 48
(b) If stores are selected at random, find the probability that the
mean of the sample will be between RM 46 and RM 48.
Which answer is larger? Explain why.


FEEDBACK TO SELF-ASSESSMENT 3


1. The distribution is called the sampling distribution of sample means.
2. The mean of the mean is equal to the population mean.
3. The distribution will be approximately normal when the sample size is
large.
4. z =
n
x
/ o

5. 0.2486
6. 0.2327
7. 0.0571
8. 0.8239
9. 0.2033
10. 0.1254
11. a) 0.3446
b) 0.0023
c) Yes , since it is within one standard deviation of the mean.
d) very unlikely

12. a) 0.3707 b) 0.0475

13. a) 0.1815 b) 0.3854 c) Means are less variable than individual data.

STATISTICAL ESTIMATION AND SMALL SAMPLING THEORIES C5606/4/

1

STATISTICAL ESTIMATION AND SMALL SAMPLING
THEORIES

OBJECTIVES

GENERAL OBJECTIVE

Use Statistics to make estimates of parameters

SPECIFIC OBJECTIVE

After completing this unit, you should be able to

Find the confidence interval for the mean when o is known.
Determine the minimum sample size for finding a confidence interval for
the mean.
Find the confidence interval for the mean when o is unknown and 30 < n .
Estimate the population parameters based on a large sample size using
point and interval estimates and able to explain the concept of confidence
interval
Estimate the mean of the population when the standard deviation of the
population is known
Estimate the mean and standard deviation of a population from sample
data
Estimate the mean of a population based on a small sample size

UNIT 4


2

4.0 CONFIDENCE INTERVAL AND SAMPLE SIZE

One aspect of inferential statistics is estimation, which is the process of
estimating the value of a parameter from information obtained from a sample.
Look at the following statements:
- One out of four Polytechnic students is currently dieting
- 72% of Malaysians cannot afford to buy a brand new Mercedes Benz
- The average kindergarten students has seen more than 5000 hours
of television
- The average amount of pocket money for a Poly student is RM500 per
semester

Since the populations from which these values were obtained are large, these
values are only estimates of the true parameters and are derived from data
collected from samples.

The statistical procedure for estimating the populations mean, variance and
standard deviation will be explained in this module.

An important question in estimation is that of sample size. How large should the
sample be in order to make an accurate estimate?

I NPUT

3

4.1 CONFIDENCE INTERVALS FOR THE MEAN ( o Known or n> 30)

Suppose a Poly director wishes to estimate the average age of the students
attending classes this semester. The director could select a random sample of
100 students and find the average age of these students, say 22.3 years. From
the sample mean, the director could infer that the average age of all the students
is 22.3 years. This type of estimate is called a point estimate.

Sample measures (i.e., statistics) are used to estimate population measures (i.e.,
parameters). The sample mean is the best estimate of the population mean
because the means of samples vary less than other statistics such as medians
and modes when many samples are selected from the same population.

A good estimator should satisfy the three properties described next.

A point estimate is a specific numerical value estimate of
parameter. The best point estimate of the population mean is
the sample mean X .
Three properties of good estimator

1. The estimator should be unbiased estimator. That is, the expected
value or the mean of the estimates obtained from samples of a given
size is equal to the parameter being estimated.
2. The estimator should be consistent. For a consistent estimator, as
sample size increases, the value of the estimator approaches the
value of the parameter estimated.
3. The estimator should be a relatively efficient estimator. That is, of
all the statistics that can be used to estimate a parameter, the
relatively efficient parameter has the smallest variance.


4

4.1.1 CONFIDENCE INTERVALS

As stated in the previous module, the sample mean will be, for the most part,
somewhat different from the population mean due to sampling error. Then, how
good is point estimate? As the accuracy of a point estimate is questionable,
statisticians use another type of estimate called an interval estimate.

For example, an interval estimate for the average age of all the students might be
26.9< <27.7, or 27.3 0.4 years.

Either the interval contains the parameter or it does not. A degree of confidence
(usually %) can be assigned before an interval estimate is made. For instance,
one may wish to be 95% confident that the interval contains the true population
mean. Another question then arises. Why 95%? Why not 99% or 99.5%?

If one desires to be more confident (99% or 99.5%), then the interval must be
larger. For example, a 99% confidence interval for the mean age of the Poly
students might be 26.7< <27.9, or 27.3 0.6. Hence, a trade-off occurs. To be
more confident that the interval contains the true population mean, one must
make the interval wider.

An interval estimate of a parameter is an interval or a range of values
used to estimate the parameter. This estimate may or may not contain the
value of the parameter being estimated.
The confidence level of an interval estimate of a parameter is the probability
that the interval estimate will contain the parameter.

A confidence interval is a specific interval estimate of a parameter
determined by using data obtained from a sample and by using the specific
confidence level of the estimate.

5

Intervals constructed in this way are called confidence intervals. Three
common confidence intervals are used: The 90%, the 95%, and the 99%
confidence interval.

The central limit theorem states that when the sample size is large,
approximately 95% of the sample means will fall within 1.96 standard errors of
the population mean. That is

|
|
.
|
\
|
n
o
96 . 1
Now, if a specific mean is selected, say X , there is a 95% probability that
it falls within the range of
|
|
.
|
\
|
n
o
96 . 1 . Likewise there is a 95% probability that
the interval specified by
|
|
.
|
\
|
n
X
o
96 . 1 will contain . Stated another way,

|
|
.
|
\
|
n
X
o
96 . 1 < <
|
.
|
\
|
+
n
X
o
96 . 1

Hence, on can be 95% confident that the population mean is contained within
that interval when the values of the variable are normally distributed in the
population.

Since other confidence intervals are used in statistics, the symbol
2
o
Z is used in
the general formula for confidence intervals. The Greek letter o (alpha)
represents the total area in both of the tails of the standard normal distribution
curve.
2
o
represents the area in each one of the tails.

The relationship between o and the confidence level is that the stated
confidence level is the percentage equivalent to the decimal value of 1 - o , and
vice versa. When the 95% confidence interval is to be found, o = 0.05, since 1
0.05 = 0.95. When o = 0.01, the 1 - o = 1 0.01 = 0.99, and the 99%
confidence interval is being calculated.


6

The term
2
o
z
|
|
.
|
\
|
n
o
is called the maximum error of estimate. For a specific value,
say o = 0.05, 95% of the sample means will fall within this error value on either
side of the population mean.

Example 4.1

1. One of the Polytechnic directors wishes to estimate the average age of the
students currently enrolled. Per last year record, it is known that the standard
deviation is 2 years. A sample of 50 students is selected of which the mean
age is 23.2 years. Find the 95% confidence interval of the population mean.

2. A well known tonic drink is known to increase the pulse rate of its users. The
standard deviation of the pulse rate is known to be 5 beats per minute. A
sample of 30 users had an average pulse rate of 104 beats per minute. Find
the 99% confidence interval of the true mean.

3. A sample of 30 koperasi has the mean ( X ) = 11.091 (assets in millions
of RM) and the standard deviation s = 14.405. Find the 90% confidence
interval of the mean.

Formula for the Confidence Interval of the Mean for a Specific o

|
|
.
|
\
|
n
z X
o
o
2
< <
|
|
.
|
\
|
+
n
z X
o
o
2

For a 95% confidence interval,
2
o
z = 1.96; and for a 99% confidence
interval,
2
o
z = 2.58

The maximum error of estimate is the maximum likely difference between
the point estimate of a parameter and the actual value of the parameter.


7

4. It is required to determine the mean diameter of a long length of wire. The
diameter of the wire is measured in 16 places selected at random throughout
its length and the mean of these values is 0.314 mm. If the standard
deviation of the diameter of the wire is given by the manufacturers as 0.025
mm, determine (a) the 80% confidence interval of the estimated mean
diameter of the wire, and (b) with what degree of confidence it can be said
that the mean diameter is 0.314 0.01 mm.


1. Since the 95% confidence interval is desired, 96 . 1
2
=
o
Z . Hence,
substituting in the formula
|
|
.
|
\
|
+ < <
|
|
.
|
\
|
n
X
n
X
o
o
96 . 1 96 . 1

23.2 1.96
|
|
.
|
\
|
+ < <
|
|
.
|
\
|
50
2
96 . 1 2 . 23
50
2

23.2 0.6< < 23.2 + 0.6

22.6 < < 23.8 or 23.2 6 . 0 years.

The director can say, with95% confidence, the average age of the
students is between 22.6 and 23.8 years, based on 50 students.

2. Since the 99% confidence interval is desired, 58 . 2
2
=
o
Z

|
|
.
|
\
|
+ < <
|
|
.
|
\
|
n
X
n
X
o
o
96 . 1 96 . 1

|
|
.
|
\
|
+ < <
|
|
.
|
\
|
30
5
58 . 2 104
30
5
58 . 2 104
4 . 2 104 4 . 2 104 + < <
4 . 106 6 . 101 < <
106 102 < < ~ or 2 2 . 104


8

One can be 99% confident that the mean pulse rate of all users is
between 102 and 106 beats per minute, based on a sample of 30 users.

3.
STEP 1 It is known that the mean ( X ) is 11.091 and the standard
deviation
(s) = 14.405
STEP 2 Find
2
o
. Since the 90% confidence interval is to be used,
o = 1 0.90 = 0.10, and 05 . 0
2
10 . 0
2
= =
o

STEP 3 Find
2
o
z . Subtract 0.05 from 0.5000 to get 0.4500. The
corresponding z from the table is 1.65.
STEP 4 Substitute in the formula

|
|
.
|
\
|
+ < <
|
|
.
|
\
|
n
s
z X
n
s
z X
2 2
o o

(s is used in place of o when o is unknown, since 30 > n )

|
|
.
|
\
|
+ < <
|
|
.
|
\
|
30
405 . 14
091 . 11
30
405 . 14
65 . 1 091 . 11

430 . 15 752 . 6 < < L
Hence, one can be 90% confident that the population mean of the assets
is between RM6.752 million and 15.430 million, based on a sample of 30
koperasi.


9

4. (a) For the population: o = 0.025 mm, for the sample: N = 16, x = 0.314,
because an infinite number of measurements can be obtained for the
diameter of the wire, the population is infinite and the estimated value
of the confidence interval of the population mean is given by

|
|
.
|
\
|
n
z X
o
o
2
=
|
|
.
|
\
|
n
z X
o
o
2
= 008 . 0 314 . 0
16
) 025 . 0 ( 28 . 1
314 . 0 = mm.

That is, the 80% confidence interval is from 0.306 mm to 0.322mm.
This indicates that the estimated mean diameter of the wire is between
0.306 and 0.322 and that this prediction is likely to be correct 80 times
out of 100.

b) To determine the confidence level, the given data is equated to
expression

0.314 0.01 mm =
|
|
.
|
\
|
n
z X
o
o
2
, i.e
2
o
z = 6 . 1


10

ACTIVITY 4A

INPUT!

1. What is the difference between a point estimate and an interval estimate
of a parameter? Which is better? Why?

2. What is the maximum error of estimate?

3. What are the three properties of a good estimator?

4. What is necessary to determine a sample size?

5. Find each:
a)
2
o
z for the 99% confidence interval
b)
2
o
c)
2
o
d)
2
o
e)
2
o

6. A sample of the mathematics test scores for 35 first semester students
has a mean of 82. The standard deviation of the sample is 15.
a) Find the 95% confidence interval of the mean test scores of the
entire first semester students.

b) Find the 99% confidence interval of the mean test scores of the
entire first semester students.

c) Which interval is larger? Explain why.


11

SOLUTION TO ACTIVITY 4A

1. A point estimate of a parameter specifies a specific value, such as =
87; an interval estimate specifies a range of values for the parameter,
such as 84< <90. The advantage of an interval estimate is that a specific
confidence level (say 95%) can be selected, and one can be 95%
confident that the interval contains the parameter that is being estimated.

2. The maximum error of estimate is the likely range of values to the right or
left of the statistic which may contain the parameter.

3. A good estimator should be unbiased, consistent, and relatively efficient.

4. For one to be able to determine sample size, the maximum error of
estimate and degree of confidence must be specified and the population
standard deviation must be known.

5. a) 2.58
b) 2.33
c) 1.96
d) 1.65
e) 1.88

6. a) 77< <87
b) 75<<89
c) The 99% confidence interval is larger because the confidence level is
larger.


12

4.2 SAMPLE SIZE

Sample size determination is closely related to statistical estimation. How large a
sample necessary to make an accurate estimate? The answer depends on three
things: the maximum error of estimate, the population standard deviation, and the
degree of confidence. For the purpose of this unit, it is assumed that the
population standard deviation of the variable is known or has been estimated
from the previous study.

The formula for sample size is derived from the maximum error of estimate
formula,

|
|
.
|
\
|
=
n
z E
o
o
2
and this formula is solved for n as follows:

2
.
2
|
|
.
|
\
|
=
E
z
n
o
o

Example 4.2

You are asked to estimate the average age of the students in this Poly.
How large a sample is necessary? You want to be confident that the estimate
should be accurate within one year. From the previous study, the standard
deviation of the ages is known to be 3 years.

I NPUT

13


Since 01 . 0 = o or 1 0.99, =
2
o
z 2.58, and E = 1, substituting in the
formula, you get

2
.
2
|
|
.
|
\
|
=
E
z
n
o
o
=
2
1
) 3 )( 58 . 2 (
|
.
|
\
|
= 59.9 which is rounded up to 60. Well, you
need a sample size of at least 60 students in order to be 99% confident that the
estimate is within 1 year of the true mean age.


14

ACTIVITY 4B

INPUT!

1. A study of 40 poly lecturers showed that they spent, on average, 12.6
minutes correcting a students weekly quiz.
a) Find the 90% confidence interval of the mean time for all quizzes
when o = 2.5 minutes.
b) If a lecturer stated that he spent, on average, 30 minutes correcting
a quiz, what would be your reaction?

2. A study found that Poly students spend an average of RM185.00 per
month for the cellular phone bills. If a sample of 49 students was used, find
the 90% confidence interval of the mean. Assume the standard deviation of
the sample is RM1.56.

3. The mean weight of 84 soil samples is 61.2 grams and the standard
deviation is 7.9 grams. Find the 95% confidence interval for the true mean.

4. A poly director wishes to estimate the average number of hours his part-time
lecturers teach per week. The standard deviation from the previous study is
2.6 hours. How large a sample must be selected if he wants to be 99%
confident of finding whether the true mean differs from the sample mean by
1 hour?

5. You are required to estimate the fresh weights of concrete cubes. How large
a sample must be selected if you are required to be 90% confident that the
true mean is within 600 grams of the sample mean? The standard deviation
of the fresh weights is known to be 800 grams.

6. A class lecturer would like to estimate the average number of sick days that
students use per year. It is assumed that the standard deviation is 2.5 days.
How large a sample must be selected if the lecturer wants to be 95%
confident of getting an interval that contains the true mean with a maximum
error of 1 day?


15

SOLUTION TO ACTIVITY 4B

1. (a) 11.9 < < 13.3 (b) It would be highly unlikely, since this is far larger
than 13.3

2. RM18.13 < < RM18.87

3. 59.5 < < 62.9

4. 45

5. 5

6. 25


16

4.3 CONFIDENCE INTERVALS FOR THE MEAN ( o unknown and n<30)

When o is known and the variable is normally distributed or when o is unknown
and n> 30, the standard normal distribution is used to find confidence intervals
for the mean. However, in many situations, the population standard deviation is
not known and the sample size is less than 30. In such situations, the standard
deviation from the sample can be used in place of the population standard
deviation for confidence intervals. But somewhat different distribution, called the t
distribution, must be used when the sample size is less than 30 and the
variable is normally or approximately distributed.

Characteristics of the t Distribution

The t distribution shares some characteristics of the normal distribution
and differs fro it in others. The t distribution is similar to the standard
normal distribution in the following ways.

1. It is bell-shaped.
2. It is symmetrical about the mean.
3. The mean, median, and mode are equal to 0 and are located at the
center of the distribution.
4. The curve never touches the x-axis.

The t distribution differs from the standard normal distribution in the
following ways.

1. The variance is greater than 1.
2. The t distribution is actually a family of curves based on the concept
of degrees of freedom, which is related to the sample size.

I NPUT

17

3. As the sample size increases, the t distribution approaches the
Standard normal distribution
See the figure below.

Many statistical distributions use the concept of degrees of freedom, and the
formulas for finding the degrees of freedom vary for different statistical tests. The
degrees of freedom are the number of values that are free to vary after a sample
statistic has been computed, and they tell researcher which specific curve to use
when a distribution consists of a family of curves.

For example, if the mean of 5 values is 10, then 4 of the 5 values are free to vary.
But once 4 values are selected, the 5
th
value must be a specific number to get a
sum of 50. Since 50/5 = 10. Hence, the d.f. are 5 1 = 4, and this value tells the
researcher which t curve to use.

The symbol d.f. will be used for degrees of freedom. The d.f. for a confidence
interval for the means are found by subtracting 1 from the sample size, i.e d.f. = n
1.

Formula for a Specific Confidence Interval for the Mean When Is Unknown and
n<30

|
|
.
|
\
|
+ < <
|
|
.
|
\
|
n
s
t X
n
s
t X
2 2
o o

The degrees of freedom are n - 1

Notes: When to use the z or t distribution


18

Example 4.3

1. Find the
2
o
t value for a 95% confidence interval when the sample size is
22.

2. Ten randomly selected automobiles were stopped, and the tread depth of
the right front tire was measured. The mean was 0.32 mm, and the
standard deviation was 0.08 mm. Find the 95% confidence interval of the
mean depth. Assume that the variable is approximately normally
distributed.

3. The data represent a sample of the number of home fires started by
candles for the past several years. Find the 99% confidence interval for
the mean number of home fires started by candles each year.

5460 5900 6090 6310 7160 8440
9930

Is o known
Is n ? 30 >
Use
2
o
t values and s in
the formula
No
No
Use
2
o
Z values no matter what
the sample size is
Yes
Use
2
o
Z values and s in place of o
In the formula
Yes

19


1. d.f. = 22 -1, or 21. Find 21 in the left column and 95% in the row labeled
confidence intervals. The intersection where the two meet give the value
for
2
o
t , which is 2.080. See the figure below. Note: At the bottom of the
table where d.f. = , the
2
o
z can be found for specific confidence intervals.
The reason is that as the degrees of freedom increase, the t distribution
approaches the standard normal distribution.

2. Since o is unknown and s must replace it, the t distribution (see table
F) must be used for 95%. Hence with 9 degrees of freedom,
2
o
t = 2.262.
The 95% confidence interval of the population mean is found by
substituting in the formula

|
|
.
|
\
|
+ < <
|
|
.
|
\
|
n
s
t X
n
s
t X
2 2
o o
=
|
|
.
|
\
|
=
10
08 . 0
) 62 . 2 ( 32 . 0 , 38 . 0 26 . 0 < <

Therefore, one can be 95% confident that the population mean
tread depth of all right front tires is between 0.26 and 0.38 mm based on a
sample of 10 tires.


20
3.

STEP 1

Find the mean and standard deviation for the data
Use the formulas or your calculator
The mean X = 7041.4
The standard deviation s = 1610.3

STEP 2

Find
2
o
t from table F. Use the 99% confidence interval with
d.f. = 6. It is 3.707.

STEP 3

Substitute in the formula and solve

|
|
.
|
\
|
+ < <
|
|
.
|
\
|
n
s
t X
n
s
t X
2 2
o o

|
|
.
|
\
|
+ < <
|
|
.
|
\
|
7
3 . 1610
707 . 3 4 . 7041
7
3 . 1610
707 . 3 4 . 7041

4785.2< <9297.6

One can be 99% confident that the population mean of home fires started by
candles each year is between 4785.2 and 9297.6, based on a sample of home
fires occurring over a period of 7 years.


21

ACTIVITY 4C

INPUT!

For the following activities, assume that all variables are approximately
distributed.

1. A sample of 8 measurements of the diameter of a bar are made and the
mean of the sample is 2.470 cm. The standard deviation of the samples is
0.21 mm. Determine (a) the 95% confidence interval and (b) the 80%
confidence interval for an estimate of the actual diameter of the bar.

2. A sample of 15 electric lamps are selected randomly from a large batch
and are tested until they fail. The mean and standard deviation of the time
to failure are 1177 hours and 25 hours respectively. Determine the
confidence level based on estimated failure time of 1177 5.8 hours.

3. The value of the ultimate tensile strength of a material is determined by
measurements on 10 samples of materials. The mean and standard
deviation of the results are found to be 4.38 Mpa and 0.06 Mpa
respectively. Determine the 95% confidence interval for the mean of the
ultimate tensile strength of the material.

4. Use data in problem #3 above to determine the 99% confidence interval
for the mean of the ultimate tensile strength of the material.

5. The time taken for a chemical reaction to take place is measured 5 times
and is found to be: 0.28 hours, 0.30 hours, 0.27 hours, 0.33 hours and
0.31 hours. Determine the 95% and 99% confidence intervals for the
estimated true reaction time.


22

SOLUTION TO ACTIVITY 4C

1. (a) The 95% confidence interval are 2.455 cm and 2.485 cm.
(b) The 80% confidence interval are 2.463 cm and 2.477 cm.

2. It is likely that 80% of all the lamps will fail between 1171.2 and
1182.8 hours. (
2
o
t = 0.868).

3. 4.417< <4.343

4. 4.324< <4.436

5. 0.275< <0.321; 0.258< <0.338


23

SELF ASSESSMENT 4

You are approaching success. Try all the questions in this self-
assessment section and check your answers on the next page. If you encounter
any problems, consult your instructor. Good luck.

1. When should the t distribution be used to find a confidence interval for the
mean?

2. Determine whether the statement is true or false. If the statement is false,
explain why.
a) Interval estimate are preferred over point estimates since a confidence
level can be specified.
b) An estimator is consistent if, as the sample size decreases, the value
of the estimator approaches the value of parameter estimated.

Select the best answer.

3. When a 99% confidence interval is calculated instead of 95% confidence
interval with n being the same, the maximum error of estimate will be
a. Smaller
b. Larger
c. The same
d. It cannot be determined


24

4. When the population standard deviation is unknown and sample size is
less than 30, what table value should be used in computing a confidence
interval for a mean.
a. z
b. t
c. None of the above

Complete the following statement with the best answer.

5. The maximum difference between the point estimate of a parameter and
the actual value of the parameter is
called__________________________________.

6. The three confidence intervals used most often are the ____%, ______%,
and ___%.

7. The specific resistance of a reel German silver wire of nominal diameter
0.5 mm is estimated by determining the resistance of 7 samples of the
wire. These were found to have resistance values (in ohms per meter) of
1.12, 1.15, 1.10, 1.14, 1.15, 1.10 and 1.11. Determine the 95% confidence
interval for the true specific resistance of the reel of wire.

8. In determining the melting point of a metal, 5 determinations of the melting
point are made. The mean and standard deviation of the five results are
232.27
o
C and 0.742
o
C. Calculate the confidence with which the prediction
the melting point of the metal is between 232.48
o
C and 233.06
o
C can be
made.
9. The standard deviation of the masses of 500 blocks is 150 kg. A random
sample of 40 blocks have a mean mass of 2.40 Mg.
a) Determine the 95% and 99% confidence intervals for estimating the
mean mass of the remaining 469 blocks, and
b) With what degree of confidence can it be said that the mean mass of
the remaining 460 blocks is 2.40 0.035Mg?


25

In the following exercises, assume that all variables are approximately

10. The average hemoglobin for a sample of 20 lecturers was 16 grams per
100 milliliters, with a sample standard deviation of 2 grams. Find the 99%
confidence interval of the true mean.

11. A sample of 6 adult elephants had an average weight of 12200 pounds,
with a sample standard deviation of 200 pounds. Find the 95% confidence
interval of the true mean.

12. A recent study of 28 city residents showed that the mean of the time they
had lived at their present address was 9.3 years. The standard deviation
of the sample was 2 years. Find the 90% confidence interval of the true
mean.

13. A recent study of 25 students showed that they spent an average
RM18.53 for petrol per week, the standard deviation of the sample was
RM3.00. Find the 95% confidence interval of the true mean.

14. The average yearly income for 28 married couple in Politeknik is
RM58219.00. The standard deviation of the sample is RM56.00. Find the
95% confidence interval of the true mean.


26

FEEDBACK TO SELF ASSESSMENT 4


1. The t distribution should be used when o is unknown and n<30.

2. (a) True (b) False

3. b

4. b

5. Maximum error of estimate

6. 90; 95; 99

7. 1.11
1
O m < < 1.14
1
O m

8. 95%

9. (a) 2.355 < < 2.445; 2.341 < < 2.459 (b) 86%

10. 15< <17

11. 11990< <12410

12. 8.7< <9.9

13. RM 17.29< <RM 19.77

14. RM 58197.00< <RM 58241.00

CORRELATION AND REGRESSION C 5606 / 5/ 1

CORRELATION AND REGRESSION

OBJECTIVES

General Objective

To understand and apply the concept of correlation and regression

Specific Objectives

At the end of the unit, you should be able to:

Draw a scatterplot for a set of ordered pairs
Compute the correlation coefficient
Compute the equation of the regression line

UNIT 5


5.0 CORRELATION

So far we have considered the statistics of one variable. Of course we sometimes get
data involving two variables. For example, look at the marks obtained on two
Mathematics paper by a group of students below.

Student A B C D E F G H I J
Paper 1 42 84 50 42 33 50 69 81 50 35
Paper 2 31 83 42 60 28 63 59 92 73 40

So what can we find out from the data ? Students B and H have done very well on
both papers, E has done very badly on both papers, student I has done much better
on paper 2 than paper 1.

A graph might help us to make more sense of the data, as would the average (mean)
mark for papers 1 and 2. The most useful type of graph is a scatter diagram.

I NPUT

5.1 CORRELATION- SCATTER DIAGRAM

If we plot the data as points, with marks for Paper 1 on the x- axis and for paper 2 on
the y-axis, we obtain a graph like the one shown heree. Note that we do not need to
start the scales at zero.

We see that the points go roughly from bottom left to top right(this is made clearer by
enclosing the points as shown below.


From the data the mean value for paper 1 x = 53.6
And for paper 2 y = 57.1

We now plot the line x = 53.6 and y = 57.1 on the scatter diagram:

The line divide the graph into four quadrants :

Top Right All points have both x values and y values greater than their respective
means i.e. (x x ) <0, (y - y ) < 0. The product would be positive.

Bottom Left All points have both x values and y values less than their respective
means i.e. (x x ) <0, (y - y ) < 0. The product would be positive.

Top left x values less than x , y values greater than y . Product negative.

Bottom right x values greater than x , y values less than y . Product negative.

Look at the scattergrams (scatter diagrams) below. The patterns seem to be very
different.


Roughly speaking:

Positive correlation the higher the value of x, the higher the value of y.
Negative correlation the higher value of x, the lower value of y.
Zero correlation no fixed relationship between x and y.

Again this is made clearer by drawing the lines y = y , x = x .

You have met scatter diagrams in your work of which you may have drawn a line of
best fit on the graph in order to estimate a value of y given a value of x. The line was
drawn by eye but you would know that the line passes through the mean values of
( x , y ) as shown below.


The lines on the first two diagrams are relatively easy to draw, but where do we draw
a line on the third and having drawn it, would it be of any practical use?

Notice that we have been looking for a special type of relationship between the x and
y values a straight line or linear relationship. The fact that we cant find such a
relationship does not mean that there is no relationship at all.

The product-moment formula for determining the linear correlation coefficient

The convention of dealing with data

Horizontal (x) axis The independent variable

Vertical (y) axis The dependent variable

Let us look at some data on the height of students and the distance they can throw a
cricket ball.

Height (x) cm 122 124 133 138 144 156 158 161 164 168
Distance (y) m 41 38 52 56 29 54 59 61 63 67

Just looking at the data, a general response might be the taller a person, the further
they can throw a cricket ball. (apart from the odd person!)


Does a scatter diagram support that hypothesis?

The example below shows one drawback: SCALE


One of the measures of the degree of linear correlation between two variables is
called the coefficient of correlation, denoted by the symbol r. The coefficient of
correlation for two variables, say X and Y, is given by:

2 2
) ( ) (
) )( (
Y Y X X
Y Y X X
r

oe simply =
) )( (
2 2
y x
xy

Example 5.1

a) Determine the coefficient of correlation between X and Y based on the data below.

X 4 5 6 9
Y 12 10 8 6

b) The data given below gives the experimental values obtained for the torque output
from an electric motor, X, against the current taken from the supply, Y. Determine
the value, degree and nature of the coefficient of linear correlation between the
variables X and Y (if there is one).

X 0 1 2 3 4 5 6 7 8 9
Y 4 6 6 6 8 10 10 10 14 12

The value of the correlation coefficient ranges from
+1 for a perfect correlation
to -1 for a perfect negative correlation


a) Construct a table from the given data.

1 2 3 4 5 6 7
X Y
x = X - X y = Y-Y
xy x
2
y
2

4 12 -2 3 -6 4 9
5 10 -1 1 -1 1 1
6 8 0 -1 0 0 1
9 6 3 -3 -9 9 9
24 X
36 Y

16 xy

14
2
x

20
2
y
6
4
24
X 9
4
36
Y

r =

9562 . 0
280
16
) 20 )( 14 (
16
) )( (
2 2

y x
xy

b)

X

Y
x =
X X

y =
Y Y

xy

x
2

y
2

0 4 -4.5 -4.6 20.7 20.25 21.16
1 6 -3.5 -2.6 9.1 12.25 6.76
2 6 -2.5 -2.6 6.5 6.25 6.76
3 6 -1.5 -2.6 3.9 2.25 6.76
4 8 -0.5 -0.6 0.3 0.25 0.36
5 10 0.5 1.4 0.7 0.25 1.96
6 10 1.5 1.4 2.1 2.25 1.96
7 10 2.5 1.4 3.5 6.25 1.96
8 14 3.5 5.4 18.9 12.25 29.16
9 12 4.5 3.4 15.3 20.25 11.56
5 . 4
10
45
45

X
x

6 . 8
10
86
86

Y
y

81 xy .0

5 . 82
2
x

4 . 88
2
y

r =

95 . 0
) 4 . 88 )( 5 . 82 (
81
) )( (
2 2

y x
xy

A good direct correlation exists between the the values of X and Y.

ACTIVITY 5A

TEST YOUR UNDERSTANDING BEFORE PROCEEDING TO THE NEXT INPUT...!

1. Determine the coefficient of correlation up to 4 decimal places between X and Y
based on the data below.

X 122 124 133 138 144 156 158 161 164 168
Y 41 38 52 56 29 54 59 61 63 67

2. The co-ordinates given below refer to an experiment to verufy Newtons law of
cooling over a limited range of values. Determine the value, degree and nature of
the coefficient of correlation.

Time (min) 4 8 10 12 16 22
Temperatuer (
o
C) 46 34 30 26 24 20

3. The following results were obtained experimentally when verifying Hookes law:

Load (N) 2 5 8 11 15
Extension (mm) 2 23 62 119 223

Determine the value, degree and nature of the coefficient of correlation.

4. The thickness of case-hardening achieved varies with temperature and some co-
ordinated obtained by experiment are as shown.

Temperature (
o
C) 400 420 350 320 400 480 440 370
Thickness (m) 3.7 3.4 3.7 3.8 3.6 3.3 3.4 3.7

Determine the coefficient of correlation based on these values.+-



1. r = 0.7289
2. r = -0.92, good, inverse
3. 0.97, good, direct
4. 0.93


5.2 LEAST SQUARES REGRESSION LINE

Scatter Diagrams Line Of the Best

We have already referred to the drawing of a line of best fit by eye

Thev only calculation involved determining x dan y , since the line of best fit
passes through the point ( x , y ).

From the line you might be expected to estimate a y value given an x- value. Of
course, by eye line fitting is a subjective matter, trying to minimise the distances
between the points and the line.

A mathematical computation method is available to produce two lines : known as y
and x ( to estimate value of y) and x on y ( to estimate values of x)

These are known as (Linear) Regression Lines or Least-Squares Regression Lines.

I NPUT

Scatter Diagrams The y on x Regression Line

Since the line must pass through (( x , y ), the parameters that can vary are the
gradient of the line and the point where the line cuts the y axis.

The equation of the line will be of the form y = a + bx y on x ( some syllabuses use
Greek letters and instead of a and b)

The y on x line minimises the sum of the squares of the vertical distances from the
points to the regression line ( the square of the distance is used to ensure a positive
result).

As with correlation there is a formula derived from a proof and a corresponding
computational method. The proof is not required at A/AS Level )

For y = a + bx b =
n
x
x
n
y x
xy
2
2
) (
(
a = y -b x

Where y and x are the mean values of y and x.


Example 5.2

a) y on x Regression Line ( Least Squares Regression Line )

x 2.5 4 8 5 7 9.5 8.5 12.5 12.5 14.5
y 3.5 3 6.5 7 8 11 9 10.5 13 13

x = 84
y = 84.5
xy = 827
2
x = 845.5 n = 10 x
= 8.4 y
= 8.45

Calculate the regression line y on x.

b) Based on the data alreday calculated, find the regression line y on x and estimate
the value of y when x = 160

x = 1468
y = 520
xy = 77689
2
x = 218070 n = 10 x
= 8.4


a) To calculate the regression line y-on-x

b =
n
x
x
n
y x
xy
2
2
) (
(
=
)
10
84
( 5 . 845
10
) 5 . 84 84 (
827
2
x
= 0.8377

a = y - b x = 8.45 (0.8377 x 8.4) = 1.4133

So least squares regression line y - on - x is y = 1.4133 + 0.8377 x

Least Squares Regression Line - y - on x

From the previous page , the least squares regression line y - on - x is :


y = 1.4133 + 0.8377x

We can now use this equation to calculate ( estimate) a value of y for a given value of
x .

For example . Find a value for y given x = 10

Substituting y = 1.4133 + (0.8377 x 10)

Finding a value from within the range of x is called interpolation

Warning . Estimation a value from outside the data range ( say x = 20 ) is called
extrapolation and should bec avoided ( at all cost ) since you do not know that the
relationship between x and y will hold for larger and smaller values than those
recorded.

b) For the regression line y on x,

b =
n
x
x
n
y x
xy
2
2
) (
(
=
)
10
1468
( 218070
10
) 520 1468 (
77689
2
x
= 0.5270

a = y - (b x ) = 52 - (0.5270 x 146.8 ) = - 25.3636

So, regresson line is y = -25.3636 + 0.5270x

When x = 160, y = -25.3636 + (0.5270 x 160) = 58.96


ACTIVITY 5B

TEST YOUR UNDERSTANDING BEFORE PROCEEDING TO THE NEXT INPUT...!

a. The table shows the results for a number of athletes. X represents long
jump (metres )

x = 19
y = 66
xy = 126.22
2
x = 36.44 n = 8

X y x
2
y
2
xy
1.8 6.7 3.24 44.89 12.06
2.1 7.6 4.41 57.76 15.96
1.9 6.3 3.61 39.69 11.97
2.0 6.8 4.00 46.24 13.6
1.8 5.9 3.24 34.81 10.62
1.8 7.9 3.24 62.41 14.22
1.6 5.5 2.56 30.25 8.8
1.8 5.6 3.24 31.36 10.08
1.9 6.5 3.61 42.25 12.35
2.3 7.2 5.29 51.84 16.56
19 66 36.44 441.5 126.22

Calculate the values of b for the regression line y = a + bx

b. The length y metres of a cable subjected to a load of x kilograms is given by
y = + x. In an experiment to estimate and for a particular cable, the value
of of y was measured for each of x . The following quantities were calculated from
the 15 pair of values.

x = 225
y = 238
xy = 3581
2
x = 3625

Calculated the least squares estimates of and


c. Set of bivariate data can be summarised as follows :

x = 21
y = 43
xy = 171
2
x = 91 n = 6
2
y = 335

i) Calculate the equation of the regression line of y on x . Give your answer in
the form y = a + bx, where the values of a and b should be stated to 3
significant figures.
ii) It is required to estimate the value of y for a given value of x. State
circumstances under which the regression line of x and y should be used,
rather than the regression line of y and x



a. b = 2.4118

b. y = + x y = 15.69 + 0.014x

c. i) a = 3.0688, regression line is y = 3.07 + 1.17 ( 3 significant figures)
ii) Use regression line of x on y to estimate value of x when y is the
independent variable.


SELF ASSESSMENT 5

You are approaching success. Try all the questions in this self-assessment section
and check your answers given on the next page. If you encounter any problems,
consult your instructor. Good luck.

1. The data given below refers to the relationship between man-hours worked
and production achieved in a factory. Determine the coefficient of
correlation.

Index of
production
man-hour
basis

100

97

100

101

93

103

91

89

110

86
Index of
production,
actual
basis

94

91

100

105

84

112

83

80

123

78

2. The number of man-days lost per week due to sickness in two similar
departments of a factory are show for a 12-week period.

Department A 20 18 19 21 17 18 12 16 14 17 13 15
Department B 18 21 18 20 17 19 16 15 15 18 16 18

Determine the coefficent of correlation and comment on its degree and
nature.


3. The masses and height for ten people were measured and the results are
as shown.

Mass
(kg)
38 38 38 44 44 51 32 51 77 32
Height
(cm)
135 140 137 141 147 145 132 149 164 130

Calculate the coefficient of correlation for this data

4. The relationship between the pressure and volume of a gas was measured
and the follwowing results were obtained :

Pressure
(kPa)
58 62 67 73 81 81 86 92 104
Volume
(m
3
)
0.36 0.97 0.43 0.52 0.48 0.29 0.31 0.75 0.27

Determine the coefficient of correlation and comment on the result
obtained.

5. The caloric intake of rats varies with body mass as shown below.

Body
mass
(g)
2.0 3.1 3.6 4.6 5.0 6.0 7.0 8.0 8.5 9.0 10.0
Caloric
Intake
(cal h
-1

1.5
2.1 3.2 3.6 3.6 3.9 4.1 4.2 4.5 4.6 5.9

Is there a linear correlation between these results ?


6. Determine the coefficient of correlation for the data given below and test
the null hypothesis that = 0 at a level of significance of 0.1. The
datagiven relates the number of hours of sunshime per week to the hours
lost due to sickness.

Hours of
sunshine/week
10 13 15 17 18 20 22 23 24
Hous lost due
to sickness
90 75 75 65 55 45 55 45 35

7. The length y metres of a cable subjected to a load of x kilograms is given
by y = + x. In an experiment to estimate and a particular cable, the
value of y was measured for each of 15 values of x. The following
quantities were calculated from the pairs of values.

x = 225
y = 238.5
xy = 3581
2
x = 3625

a) Calculate the least squares estimates of and

8. A set of bivariate data can be summarised as follows

x = 21
y = 43
xy = 171
2
x = 91 n = 6
2
y = 335

i) Calculated the equation of regression line of y and x. Give your
answer in the form y = a + bx, where the values of a and b should
be stated to 3 significant figures.
ii) It is required to estimate the value of y for a given value of x. State
circumstances under which the regression line of x and y should
be used, rather than the regression line of y on x

9. The data given below is relationship between the heights and masses of ten
people.

Height,
X cm
175 180 193 165 187 171 198 168 184 177
Mass,
Y kg
82 78 86 72 91 80 95 72 89 74

Determine the equation of the regression line of mass on height,
expressing the regression coefficients correct to two decimal places.


10. The power needed to drive a lathe increase as the cutting angle of the tool
increase when cutting a constant speed and depth of cut. The relationship for
mild steel is :

Cutting
angle
(degrees)X
50 55 60 65 70 75 80 85 90
Power
(kW)Y
6.2 6.8 7.6 8.2 8.1 8.8 9.7 10.0 10.4

Determine a) the equation of the regression line of power on cutting angle and
b) the equation of the regression line of cutting angle on power,
expresing the regression coefficients correct to three significant
figures in each case.


FEEDBACK TO SELF ASSESSMENT 5

Have you tried all the questions?? If YES, check your answers now.

1. 0.97
2. 0.70 , fair direct

3. 0.97
4. -0.31, It is probable that the measurements were made at different
Temperatures

5. r = 0.94, hence there is a good, direct correlation.

6. r = -0.95, t.99
7
= 1.42 I tI = 8.05 hypothesis is rejected

7. = 15.69 = 0.014 y= 15.69 + 0.014x

8. i) y = 3.07 + 1.17x
ii) use regression line of x and y to estimate value of x when y is the
independent variable.

9. y = -036.83 + 0.66x

10. a) Y = 1.14 + 0.104 X
b) X = -9.27 + 9.41Y


Unit 1: Probability Theory

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Unit 1: Probability Theory

Enviado por

Direitos autorais:

Formatos disponíveis

PROBABILITY THEORY C 5606/1/

E is the complement of E. E means event E never occurred.

E = no Pilot pen purchased.

) ( x or the variation such as standard deviation, S. The

Você também pode gostar