Probability and Statistics (Tanton)

THINKING
MATHEMATICS
A Refreshingly Clear Reference

Series for Teachers and Students
and all those seeking
True and Joyous Understanding!
Volume 8
PROBABILITY AND STATISTICS
LEVEL:
TABLE OF CONTENTS
PART 1: BASIC PROBABILITY THEORY

Simplistic Overview 2
Nave Probability Theory 7
OR 15
AND 17
Sequence Principle 23
The Empirical Model 34
Law of Large Numbers 34
Monte Carlo 35
Expected Value 38
Conditional Probability 45
Bayes Theorem 49
PROBLEM SET I 55
PART 2: COUNTING
COUNTING PRINCIPLES
The Multiplication Principle 2
Factorials 4
The Labeling Principle 9
Multi-stage Labeling 13
Fun with Poker 16
PASCALS TRIANGLE
A Grid of Numbers 20
The Binomial Theorem 27
PROBLEM SET II 31
PART 3: STATISTICS
Displaying and Summarising Data 2
Measures of Central Tendency 5
Measures of Dispersion 9
Scatter Plots 16
Lines of Best Fit 18
Correlation Coefficient 23
Null Hypothesis 27
Distributions 31
Central Limit Theorem 37
Normal Distribution 41
68-95-99.7 Rule 43
z-scores 45
Roulette 51
Confidence Intervals 54
P-values 57
Gallup Poles 60
Sampling 62
Chi-Squared test 66
Quality Control 71
Run Tests 74
Rank Correlation 82
PROBLEM SET III 85
PART 4: ADVANCED TOPICS

Random Variables 2
Sum, differences, multiples 4
Connection to Central Limit Theorem 9
Cereal Box Problem 10
Geometric Distribution 12
Binomial Distribution 14
Proportions 19
Students t-distribution 20
Chi Squared distribution 23
Chebyshevs Inequality 23
Informal Course Notes
PART I of IV
James Tanton
2007 James Tanton
CONTENTS:
Simplistic Overview 2
Nave Probability Theory 7

OR 15
AND 17
Sequence Principle 23
The Empirical Model 34

Monte Carlo 35
Expected Value 38
Conditional Probability 45
Bayes Theorem 49
REFERENCES:
SOLVE THIS: Mathematical Activities for Students and Clubs, J. Tanton, Mathematical
Association of America, Washington D.C., 2001
ENCYCLOPEDIA OF MATHEMATICS, J. Tanton, Facts on File, New York, 2005.

PART ONE: 2
SIMPLISTIC OVERVIEW
Probability and Statistics represent two sides of the same coin.
PROBABILITY: Explores what can be said about an unknown sample from a known
collection of objects.
e.g. We know all possible combinations from rolling a pair of dice. What is the most
likely outcome?
STATISTICS: Explores what can be said about an unknown collection from a known
sample.
e.g. We surveyed 100 people and found that 37 chewed gum. What does this say
about the gum-chewing habits of the entire nation?
BASIC IDEAS:
Probability: If a situation can be described in terms of possible outcomes that are

deemed equally likely, then the probability of any one particular outcome occurring
is defined to be:
1
Prob =
Total number of outcomes
e.g. The possible outcomes from rolling a dice are: 1, 2, 3, 4, 5, 6. Each is usually
deemed equally likely. Then:
1
Prob(3) =
6
1
Prob(5) =
6
etc.
Probability relies on the ability to COUNT things.
e.g. Four cards dealt from a deck. Whats the probability of getting four aces?
This problem relies on the being able to count all 4-card hands. (A bit tricky.)
PART ONE: 3
STATISTICS: There are two branches:
Descriptive Statistics is concerned with methods of collecting, tabulating and

summarizing data.
Inferential Statistics is concerned with making inferences and predictions based

on collected data.
e.g. A medical study records the heights of 100 eight-year-olds.
average height = a statistic

tallest height = a statistic
third to shortest height = a statistic
THESE ARE ALL DESCRIPTIVE
Making a judgment about whether a particular childs height is abnormal

is an INFERENTIAL JUDGMENT
COMMENT: The word statistik was coined by German political scientist Gottfried
Achenwall (1719-1772) to mean a summary of how things stand. It is based on the
Latin word stare meaning to stand.
PART ONE: 4
HISTORY
PROBABILITY
The start of probability theory can, essentially, be pinpointed to a single moment in
time. In 1654 French nobleman Chevalier de Mr wrote to prominent
mathematician Blaise Pascal asking for advice on the following problem:
Two friends each lay down $100 in a friendly best of seven tennis game.
But rain interrupts play after just four matches with one person having won
three games and the other just one. How should the $200 be divvied up
between the two players so as to properly reflect the likelihood of each
winning?
Pascal shared this problem Pierre de Fermat. Both solved it independently using
different techniques. Through this problem, probability theory was born.
Comment: Italian mathematician Girolamo Cardano (1501-1576) actually worked

with ideas akin to probability theory before this but did not publish his work. And
of course gambling games have been in existence for centuries and scholars have
wondered about their results. But the first definitive analysis of chance began
with the work of Pascal and Fermat.
STATISTICS
The study of statistics descriptive statistics, at least is ancient.
3050 B.C.E. Egyptians collated data on population wealth
2300 B.C.E. Ancient Chinese did the same
594 B. C. E. Greeks took a census for tax collection
309 B. C. E. Greeks took a census for population figures
Later Romans kept census records, birth and death records,

conducted geographic surveys, etc.
Middle ages: Very little done
The start of inferential statistics can be pinpointed to:

1662 John Grant analysed birth and death records to create
life tables which were used to predict life expectancies
of different social groups.
1790 First U.S. census taken.
PART ONE: 5
GETTING OUR FEET WET:
For fun lets go back and analyse de Mrs problem. Lets imagine, like Pascal and
Fermat, we are seeing it for the first time. How would you like to approach it?
Recall:
Best of 7 games but only 4 games played.
Player A has won 3 games
Player B has won 1 game.
How best divvy up a $200 pot?
Heres some space for writing notes!

PART ONE: 6
COMMENT: There are a number of interesting concepts at play in this example. If

you have some familiarity with these terms, you may be able to identify in your
work The Law of Large Numbers, the definition of the probability of an event,
the notion of expected value. Well, of course, talk about these concepts in detail
later in these notes.
PART ONE: 7
NAVE PROBABILITY THEORY
Recall the basic principle of probability:
If a situation can be described in terms of possible outcomes that are

deemed equally likely, then the probability of one particular outcome
occurring defined to be:
1
Prob =
Total number of outcomes
Example: In rolling a die there are six possible outcomes: 1, 2, 3, 4, 5, 6, each

deemed equally likely. Then:
1
Prob(4) =
6
1
Prob(6) =
6
etc.
Definition:
The set of all possible outcomes of an experiment is called the sample space.
Example: In tossing a coin Sample Space = {H , T }
In rolling a die Sample Space = {1, 2, 3, 4, 5, 6}
Ascertaining someones age (in years)

Sample Space = {0, 1, 2 ,.., 120(?) }
Definition: An event is a set of outcomes (or just a single outcome).
Example: In rolling a die: Sample Space = {1, 2, 3, 4, 5, 6}
An event could be: {2, 4, 6} (rolling an even number)

or : {3} (rolling a three)
or: {1, 2,3, 4, 5, 6} (rolling any number!)
PART ONE: 8
Definition: Given a sample space S for an experiment and an event E, the

probability of E occurring is defined to be:
# elements of E
P( E ) =
# elements of S
(This is, of course, assuming that the sample space has just a finite number of
elements, and that every single outcome is equally likely.)
Example: In rolling a die S = { 1, 2, 3, 4, 5, 6}
The probability of rolling an even number, E = {2. 4 6} is:
# elements of E 3 1
P(even) = = = .
# elements of S 6 2
Note, also:
1
P({3}) =
6
4 2
P ({1, 2, 4,5}) = =
6 3
0
P ( rolling a 7 ) = = 0
6
6
P(rolling any number) = =1
6
COMMENT: We always have 0 P( E ) 1 .

PART ONE: 9
WARNING: This nave approach to probability assumes each individual outcome is

equally likely. THIS IS NOT AN EASY CONCEPT!
For example: In rolling a pair of dice and computing their sum, the set of all
possibly outcomes is:
S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
But these individual events ,2, 3, ,12, are not equally likely.
Somehow we are meant to know that the underlying equally likely quantity here is
not the sums 2, 3, , 12, but the pairs of numbers behind each sum with order
considered important!
There are 36 possible ordered pairs:
1-1 1-2 1-3 1-4 1-5 1-6

2-1 2-2 2-3 2-4 2-5 2-6
3-1 3-2 3-3 3-4 3-5 3-6
4-1 4-2 4-3 4-4 4-5 4-6
5-1 5-2 5-3 5-4 5-5 5-6
6-1 6-2 6-3 6-4 6-5 6-6
These are the entities deemed equally likely.
Now, only one of these pairs gives a sum of 2, so:

1
P (2) =
36
Also, we see:
2 1
P (3) = =
36 18
3 1
P (4) = =
36 12
EXERCISE: Write down P(5), P(6), P(7), P(8), P(9), P(10), P(11) and P(12).
[QUESTION: Is this correct? Is the order of the pair indeed important? Or

should 6-3, say, be deemed equivalent 3-6? How is one meant to know?]
PART ONE: 10
This might not seem too much of an issue here, but in more complicated examples
one might be given a sample space, such as S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, and
one is meant to somehow know whether or not these outcomes are equally likely,
or whether or not this sample space is a result of a more fundamentally equally
likely set.
THIS IS VERY PERTURBING TO A MATHEMATICIAN!
This is equally perturbing to students and it should be!
Example: Consider the command: Pick a whole number at random. What does this
mean? Is each number equally likely? Does the term equally likely even apply?
Example: THE WALLET GAME
Two people decide to play the following game:
Each pulls out her wallet.

Whoever possesses the least amount of money in her wallet wins.
Her prize? The contents of the other players wallet.
Each person can reason: I stand to win more than I lose. Thus the game is in my
favour!
A game cant be favourable simultaneously to both players! Something is very

strange here. What are the odds of winning? Is everything really balanced and
equally likely?
ADDITIONAL COMMENT:
The very act of defining probability in terms of equally probable events is
circular: using the term equally likely assumes you already know what probability
means! The very basis of nave probability is philosophically flawed.
PART ONE: 11
AND YET ANOTHER COMMMENT:

We defined, for a sample space S and an event E, the probability of E occurring as:
# elements of E
P( E ) =
# elements of S
If we change our wording here and say:
size of E
P( E ) =
size of S
then we can extend our notion of probability to geometric settings where size of
a set is taken to be its area. For example, a circle sits inside a circle of side
length 2 inches. We can ask: If a point inside the square is chosen at random, what
at the chances of it landing outside the circle?
The area of the shaded region, the region of interest (the event E), is
22 12 = 4 and the area of the entire square (the sample space S) is 22 = 4 .
Then the probability we seek is:
size of E 4
P( E ) = =
size of S 4
QUESTION: What does equally likely mean in this setting? Is the probability of
picking any specific point zero? If so, is the probability of picking any point from a
collection of points also zero?
PART ONE: 12
And while we are mired in philosophical woes, consider the following disturbing
problem:
EXERCISE: BERTRANDS PARADOX

A chord is chosen at random in a circle. What is the probability that the chord is
longer than the side-length of an inscribed equilateral triangle?
Answer 1: By rotating the circle we may as well assume that one end of the chosen
chord is positioned at the left end of the circle. Then we can see that the chord
will be longer than the side of an inscribed equilateral triangle if its second end lies
in the shaded portion shown.
1
This represents of the circumference of the circle. Thus the probability we
3
seek is:
1
P=
3
This argument is mathematically sound and the result is absolutely correct.

PART ONE: 13
Answer 2: By rotating the circle we may as well assume that the chosen chord is
horizontal. Then the chosen chord will be longer than the side-length of an
inscribed equilateral triangle if its mid-point lies on the shaded portion of the
diameter shown:
1
An exercise in geometry shows that this represents of the diameter. Thus the
2
probability we seek is:
1
P=
2
This argument is mathematically sound and the result is absolutely correct!!!
The problem here is that the term at random is absolutely vague! The first
answer defines at random to mean: select a point on the circumference of the
circle and connect it with a given previously chosen point. The second solution
assumes at random means: draw a circle on the floor and roll a broom handle from
one side of the room across the circle.
It is possible to define at random by many different means for this problem and
arrive at different, but absolutely valid, answers. (If one draws a circle on a piece
1
of paper and drops straws from above onto the figure, one finds that about of
4
them give chords that of the length we seek!)
Examples like these paradoxes alerted mathematicians to the problems with

beginning approaches to probability theory. Terms such as equally likely and at
random and even probability itself, are fundamentally vague notions. No wonder
ones intuition is so challenged by this subject!
PART ONE: 14
For fun
SICHERMAN DICE:
Most people believe that ordered pairs of values of the fundamental equally likely
entities to be considered when rolling a pair of dice and computing their sum. In
this case, the results of rolling two dice can be nicely displayed in a table:
2 6
We see now that P (3) = , P (7) = and so forth.
36 36
Suppose instead we roll two dice, one numbered 1-2-2-3-3-4 and the
other 1-3-4-5-6-8.
Complete the following addition table and verify that these dice give exactly the
same probabilities for any given sum as ordinary dice.
CHALLENGE: Is there a way to renumber a pair of tetrahedral dice so that the

probability of any given sum is the same as from a pair of ordinary tetrahedral
dice (numbered 1-2-3-4 and 1-2-3-4)?
COMMENT: These dice were discovered by Col. George Sicherman in the 1970s.
PART ONE: 15
NAVE PROBABILITY THEORY CONTINUED
Putting philosophical woes on hold for now
Given a (finite) sample space S and an event A, we have defined the probability of
event A occurring as:
size of A
P ( A) =
size of S
Next, we need to explore the possibilities of combining actions.
Definition: Two actions are said to be independent if the outcomes of one action
in no way affect the outcomes of the other.
Example: Tossing a coin and rolling a die are independent events.
Here the sample space is the set of twelve pairs:
(H, 1) (H, 2) (H, 3) (H, 4) (H, 5) (H, 6)

(T, 1) (T, 2) (T, 3) (T, 4) (T, 5) (T, 6)
Example: Picking a card from a deck of cards, destroying it, and then picking a
second card from the deck, and NOT independent events: The result of the first
action affects possible outcomes for the second. For instance, picking the ace of
spades first no longer allows the ace of spades to be chosen second.
Example: Deciding what to wear and the weather forecast are not independent
events.
COMMENT: We generally rely on our intuitive understanding of the world to

conclude whether or not two events are independent. (Again, this can be a difficult
issue.) This notion, as it stands, is just as vague and problematic as the term
equally likely.
PART ONE: 16
THE USE OF THE WORD OR
Lets examine a basic example of two independent actions:
ACTION 1: Toss a coin

ACTION 2: Roll a die
The set of all possible outcomes can be displayed in a tree diagram:
The twelve different outcomes are now explicit. It is easy to compute

probabilities. For example:
3
P ( {H, even} ) =
12
2
P( {T, 5 or 6} ) =
12
3+ 2 3 2
P ({H, even} OR {T, 5 or 6} ) = = +
12 12 12
= P({H, even}) + P({T, 5 or 6})

PART ONE: 17
This illustrates a general principle:
If A = one set of desired outcomes

B = a second set of desired outcomes
and these outcomes share no common events
Then:
P(A or B) = P(A) + P(B)
Example: Let A = head and any number

B = tail and 3
6 1 7
Then P ( A B ) = P ( A) + P ( B ) = + =
12 12 12
Comment: The union symbol is interpreted as or.
EXERCISE: What if A and B do share events in common? Use the following Venn
diagram to explain the formula: P ( A B ) = P ( A) + P ( B ) P( A B) .
EXERCISE: Find P( {H, odd} or {H, 1, 2, or 3}).

PART ONE: 18
THE USE OF THE WORD AND
There is a second model called the square model that is useful for analyzing
probabilities.
Example: Suppose 100 people walk down a garden path that leads to a fork. A left
turn leads to house A, a right turn to house B.
Assume that there is a 50% chance that a person will turn one way over another.
In this set-up wed expect, basically 50 people to end up at house A and 50 people
at house B. The following diagram of one-hundred dots (for 100 people) depicts
this outcome:
The number 100 here is immaterial. The point is that if a square is used to denote
the entire population of people walking down the path, then half the area of the
square (half the people) end up with results A, and the second half the square
result B.
This was a very simple example. Lets practice some more complicates scenarios.
PART ONE: 19
EXERCISE: Folk walk down the following system of paths. Use the square model to
compute the fraction of people that end up at house A, at house B, and at house C.
[Assume that each choice encountered at a fork in the path is equally likely.]
PART ONE: 20
EXERCISE: Folk walk down the following system of paths. Use the square model to
compute the fraction of people that end up at each house. [Again assume that each
choice encountered at a fork in the path is equally likely.]
PART ONE: 21
EXAMPLE: I roll a die and then toss a coin. What are the chances of getting an
even number followed by a head?
Answer: Think of this as a path-walking problem with two houses labeled WANT
and DONT WANT. The forks in the road represent the options that can occur
(each, with 50% chance of occurring):
This leads to the square model diagram:
We see that the desired outcome represents one quarter (half of a half) of the
square:
1
P(even AND head) =
4

PART ONE: 22
EXAMPLE: I toss a quarter, then I toss a dime, and then I roll a die. What are the
chances of receiving HEAD, HEAD, and 5 or 6?
Answer: Heres the garden path:
This gives the square model:
We have:
2 1 1
P( H and H and {5,6} ) = of of of the square
6 2 2
2 1 1 1
= =
6 2 2 12
PART ONE: 23
We see the multiplication principle at hand:
If A represents the set of desired outcomes for one action, and B the set of
desired outcomes for a second action, and these actions are independent then:
P(A and B) = P(A) P(B)
In summary:
For independent actions: OR means addition

AND means multiplication
ASIDE: Consider a garden path that leads to two possible houses, A and B.
Suppose there are an infinite numbers of three-way forks, each with left turn
leading to A, right turn leading to B, and straight path leading to the next fork.
Create a beautiful depiction of the square model for this situation that makes it
1 1 1 1 1
visually clear that this formula: + + + + = must be true.
3 9 27 81 2
PART ONE: 24
ANOTHER CONFUSING MATTER
DOES THE ORDER IN WHICH ONE COMPUTES TASKS MATTER?
For example, in tossing a coin and rolling a die there are three possibilities:
1. Toss the coin and then roll the die

2. Roll the die and then toss the coin
3. Roll and toss the coin simultaneously.
Standing back and thinking about this one would likely say that all three scenarios
are philosophically equivalent, even though the tree diagrams and square diagrams
for possibilities 1. and 2. are different. (Draw them! Is it possible to draw a tree
diagram for simultaneous actions?)
Its good to spell things out and make explicit the following
SEQUENCE PRINCIPLE: If two actions are independent, then performing the two
actions simultaneously is philosophically equivalent to performing them one at a
time (and it does not matter in which order one opts to do them).
The following example is typical of the difficulties that can arise:
EXAMPLE: I roll a pair of dice. What are the chances of getting an 6 and a 2?
Answer: It is best to avoid thinking of events that occur simultaneously and

instead tease them apart into a sequence of actions. Here we can imagine rolling
one die and then the other. Now it is clear that there are two desirable
possibilities:
Roll a 6 and then roll a 2
OR
Roll a 2 and then roll a 6
Using + or OR and x for AND, the probability we seek is:

1 1 1 1 1
P = P (6) P (2) + P (2) P (6) = + =
6 6 6 6 18
EXERCISE: I roll three die simultaneously. What at the chances of seeing two 6s
and one 5?
EXERCISE: In rolling two dice. What are the chances of getting a 2 and a 2?
PART ONE: 25
This is essentially all there is to nave probability theory. Of course, there are
subtle issues to explore - and well come to those in applications but for now, lets
practice the basic ideas.
EXERCISE: Suppose A is an event for some sample space S. Explain the following
formula:
P(not A) = 1 P(A)
EXERCISE:
The probability that any one person will be bitten at least once in life by a dog is
1 1
. The probability of being bitten by a cat is .
20 50
Find:
a) The probability that a person will be bitten by both a cat and a dog some
time in life
b) The probability that a person will never be bitten by a dog.
c) The probability that someone will be bitten by a cat or a dog but not both.
EXERCISE: Three dice are tossed simultaneously. What are the chances of:
a) Receiving three 1s?

b) Receiving no 1s?
c) Receiving two 1s and one 2?
Three dice and two coins are tossed simultaneously. What are the chances of:
d) Receiving three 1s and two Hs?

e) Receiving no 1s and no Hs?
PART ONE: 26
EXERCISE: A couple has two children.
a) Draw a tree diagram illustrating all possibilities re gender.

b) What is the probability that the couple has two boys?
c) What is the probability that the couple has a child of each gender?
[Assume that the chances of having a boy match those of having a girl.]
EXERCISE:
a) You know that Jenny has two children and that her first child is a boy. What
is the probability that her other child is a girl?
b) You know that Mike has two children and that one is a boy. What are the
chances that the other child is a girl?
EXERCISE:
Lulu has four children and you are told that at least one of the four is a boy. What
is the probability that
a) Exactly two of her children are boys?
b) At least two are boys?
EXERCISE: Two dice are rolled.

a) What is the probability that their sum is odd? Even?
b) What is the probability that their product is odd? Even?
c) When rolling a pair of dice, the most likely sum is 7. What is the most
likely product?
EXERCISE: A pop-quiz has 10 multiple choice questions, each with choices: A, B, C

or D. I didnt study for this quiz and decided circle the answers at random.
a) What are the chances that I will get all 10 questions right?
b) What are the chances that I will get 9 out of 10 correct?
c) What are the chances that I will get 8 out of 10 correct?
d) What are the chances that I will get at least three right?
PART ONE: 27
EXAMPLE: A CLIFF HANGER
Dorothy is not feeling good. She stands at the edge of a cliff, conveniently labeled
position 1, with an infinite expanse of land at her back (conveniently labeled in
steps 2, 3, 4, ). Regard location 0 as off the cliff.
Dorothy lays her fate in the toss(es) of a coin. She pulls out a quarter to toss and
decides:
If it lands HEADS, I will step forward (to my doom).

If it lands TAILS, I will step backwards one place and toss again.
She does this repeatedly stepping forward with each land of HEADS, backwards
with each land of TAILS until she either meets her doom or ends up wandering
forever in the infinite expanse behind her.
What are the chances that Dorothy will walk over the cliff?
Answer: Well answer this is in a series of steps.
Let p = p (1 0) be the probability, when standing at position 1, of Dorothy

eventually reaching position zero. (This may either be by stepping one pace forward
right away, or taking a step back and then two paces forward, and so forth.) It is
this value p that we seek.
Define p (2 1) , p (3 2) , and so on the same way.
STEP 1: Can you see that p (1 0) and p (2 1) and p (3 2) and so on, are each,
philosophically, the same problem and so have the same value p ?
PART ONE: 28
Notice that:
p (1 0 ) = p ( stepping forward right away OR stepping back to position 2 and reaching 0 sometime later )
= p ( stepping forward right away ) + p ( stepping back AND moving 2 0 )
1 1
= + p ( 2 0)
2 2
1 1
= + p ( moving from 2 to 1 AND moving from 1 to 0 )
2 2
1 1
= + p ( 2 1) p (1 0 )
2 2
This leads to the quadratic equation:
1 1 2
p= + p
2 2
STEP 2: Solve for p.
Is Dorothys doom certain?
EXERCISE: Explain why

p ( N 0) = p ( N N 1) p (2 1) p (1 0) = 1 1 1 = 1 .
What is this saying?
EXERCISE: GAMBLERS RUIN

A gambler repeatedly plays a simple game: a 50% chance of winning a dollar, a 50%
chance of losing a dollar. If she starts with $N, what are the chances of her losing
all her money?
COMMENT: We are flirting with the notion of a random walk and have essentially
proven that a one-dimensional walk, the walker will visit each and every cell of the
line an infinite number of times. Feel free to conduct some internet research on
this topic.
PART ONE: 29
EXAMPLE: A SMARTER GAMBLER

Another gambler repeatedly plays the same simple game with a 50% chance of
winning a dollar, 50% chance of losing a dollar. She starts with $4.
She decides she will stop playing when she either reaches $0 (loses all her money)
or gets to $10.
What are her chances of reaching $10?
Answer - almost: Let p ( N ) = probability of reaching $10 starting with $N in

hand. We certainly have p (0) = 0 (weve already lost our money and have no chance
of reaching $10) and p (10) = 1 (with $10 in hand we are guaranteed having $10!).
Now, suppose N is between 1 and 9 inclusive. Then:
p ( N ) = p ( lose a dollar AND play with $N-1 OR win a dollar AND play with $N+1 )
1 1
= p ( N 1) + p ( N + 1)
2 2
Thus each number p ( N ) is the average of its two neighboring values.
CHALLENGE: If p (1) = x , show that this means that p (2) = 2 x , and that p (3) = 3 x ,
and so forth. What must be the value of x ?
And, so, what is the value of p (4) ?
EXERCISE: Back to Dorothy
1
Suppose she uses a weighted coin that has only a chance of landing HEADS and
3
2
chance of landing TAILS. Show that her chance of survival after a possibly
3
infinite number of tosses is now 50%!
PART ONE: 30
A not-so-exciting example:
EXERCISE: A bag contains:
Two red balls

Three white balls
Six blue balls
A ball is chosen at random. What is the probability that the ball is:
a) Blue?
b) Either red or blue?
c) Neither red nor blue?
A not-unexciting example:
EXERCISE: A bag contains one Red, one Blue and one White ball. John picks out a
ball at random.
If it is Red, he wins.
If it is blue, he loses.
If it is white, he puts the ball back, and adds to the bag another red ball and
another blue ball. (So the bag now contains 2Rs, 2Bs and one W.)
John then chooses a ball at random.
If it is red, he wins now.

If it is blue, he loses.
If it is white, he puts the ball back and adds another red ball and another blue ball
and picks again.
John keeps doing this until he either wins or loses.
Use this problem to establish the bizarre formula:
1 1 2 1 1 3 1 1 1 4 1
+ + + + =
3 3 5 3 5 7 3 5 7 9 2
PART ONE: 31
EXAMPLE: NON-TRANSITIVE DICE
Here are designs for four dice: A, B, C and D.
You and a friend decide to play the following game.
Your friend chooses a die and then you choose a die.

You each roll your chosen die. Whoever receives the largest number wins.
Show that if you choose the die to the left of your friends choice (or choose die D
if your friend chooses die A) you will win this game two-thirds of the time. That is,
show that:
die A beats die B two-thirds of the time
die B beats die C two-thirds of the time
die C beats die D two-thirds of the time
die D beats die A two-thirds of the time
HINT: The following table shows all possible wins if dice A and B are rolled:
We indeed see that A beats B two thirds of the time.

PART ONE: 32
EXERCISE: ANOTHER MAGICAL PROPERTY OF A MAGIC SQUARE
Heres the standard 3x3 magic square.

(Each row, column and diagonal has the same sum of 15).
Player A picks a number at random from row A, player B a number at random from
row B, and player C a number at random from row C.
Lets say that A beats B if As number is larger than Bs.
a) Show that chances are A will beat B.

b) Show that chances are B will beat C.
c) Show that chances are C will beat A!
PART ONE: 33
ANOTHER CLASSIC THE BIRTHDAY PROBLEM
Two random people are kidnapped. What are the chances that their birthdays land
on the same day of the year? Give your answer as a percentage to one decimal
place.
Answer:
Three random people are kidnapped. What are the chances that at least two of
them have the same birthday?
Answer:
Four people are kidnapped. What are the chances that at least two of them have
the same birthday?
Answer:
PART ONE: 34
If you are game fill out the following table:

PART ONE: 35
THE EMPIRICAL MODEL
One way to attempt computing probability values is to repeat an experiment a large

number of times and see how often the desired outcome occurs. This is usually the
only option available if a particular experiment is extremely complicated or
difficult to compute. For example:
What is the probability that 5 letters chosen at random spell an English

word?
The best thing to do would probably be Have a computer develop 10,000

examples of five letters chosen at random and match each of these with its built-
in dictionary to see what proportion of them are English words. This wont be an
exact answer to our questions, but we suspect it would be a very close
approximation.
We are using a principle here called the Law of Large Numbers. This law makes
intuitive sense and is often assumed without explicit mention. In the nineteenth
and twentieth centuries, as mathematicians attempted to put probability theory on
a sound, rigorous footing, one of the significant check-points of their work was
the ability to prove this Law of Large numbers as true according to the axioms of
their theory. Heres the principle:
LAW OF LARGE NUMBERS

The more times a random phenomenon is performed the closer the proportion of
trials in which a particular desired outcome occurs approximates the true
probability of that outcome occurring.
Example: If you toss a coin some number of times, you would expect approximately
half of the tosses to be HEADS and half TAILS.
With 10 tosses, it is unlikely you will receive exactly half of each. (Try it!)
With 100 tosses, the proportion of heads would be closer to 50%
With 1000 tosses, significantly closer to 50%.
Even better with 100000000000000000 tosses!
PART ONE: 36
Comment: Many gamblers incorrectly interpret this result as follows:
If a run of trials did not produce the desired outcome, then the chances of
that outcome occurring on the very next trial is increased.
Example: You toss a coin nine times and got four HEADS and five TAILS. That the
next toss will be HEADS is thus almost certain NOT TRUE!
Example: You toss a coin 999 times and got TAILS every time. The chances that
the next toss will be heads, alas, is still only 50%.
Gamblers often feel that a string of losses must produce a win on the next turn.
MONTE CARLO METHOD:

The act of repeating an experiment multiple times to determine (an approximate)
value for a probability value is often called the Monte Carlo Method. Many
casinos determined odds for their games simply by playing them multiple times and
observing frequency of outcomes. (How do you determine the chances of winning a
hand of blackjack? An easy method is to observe play of a large number of games.)
Aside: Read Bringing Down the House to learn how MIT students tipped the odds
of blackjack in their favour by certain tactical plays.
One can use the Monte Carlo Method to work out areas of complicated regions.
Example: Here is an aerial photograph of an oil spill.
It is known that the area of the rectangular region photographed is 40 square

kilometers.
PART ONE: 37
One can compute (a fairly accurate approximation to) the area by digitizing the
photograph and have a compute select 10,000 points at random in the photograph.
If, say, 6,473 of those points land in the shaded region, then we can say that the
area of the spill is ?
ACTIVITY:
In a group of three call one person player 0, one person player 2 and the remaining person
player 3. Each person places his right hand behind his back and secretly holds up one, two
or three fingers. The players then show their hands.
If all three numbers match, player 3 receives a point.

If two of the three numbers match, then player 2 receives a point.
If there are no matches, player zero receives a point.
a) Play this game a large number of times and tally points in a table. From your data,
who seems to have the largest chance of receiving a point in any single game?
Estimate that probability of winning. Estimate the chances of a win for each of the
remaining two players.
b) Use theory to determine the actual probability of a win for each of the three
players.
c) Suppose players 3, 2, and 0 are assigned, instead of one point for each win a, b and
c points, respectively, for each win. Choose values for a, b and c that makes this
game fair.
EXERCISE: a) A bag contains four balls each colored either red or blue. Jenny pulls out
two balls at random and gets a pair of blue balls. She returns the balls to the bag, gives it
a shake, and pulls out another pair of balls. She does this 100 times recording the results
along the way:
BB = 52 times
BR = 48 times
RR = 0 times
Most likely, how many blue balls and how many red balls are in that bag?
b) Suppose, instead, Jenny obtained the result:

BB = 16 times
BR = 70 times
RR = 14 times
What might you conclude about the colors of the balls?
PART ONE: 38
OPTIONAL EXERCISE: BUFFON NEEDLE PROBLEM

Do some internet research on how one can use probability theory to approximate
.
EXAMPLE: KRUSKAL COUNT

In the early 1980s, Princeton physicist Martin Kruskal discovered a remarkable
mathematical property all passages of written text seem to possess. This
phenomenon is now referred to as Kruskals count. To illustrate, consider the
familiar nursery rhyme:
Twinkle twinkle little star,

How I wonder what you are,
Up above the world so high,
Like a diamond in the sky.
Twinkle twinkle little star,
How I wonder what you are.
Perform the following steps:

1. Select any word from the first or second line and count the number of
letters it contains.
2. Count that many words forward through the passage to land on a new
word. (For example, choosing the word star, with four letters, will
transport you to the word what.)
3. Count the number of letters in the new word, and move forward again
that many places.
4. Repeat this procedure until you can go no further (that is, counting
forward will take you off the nursery rhyme.)
5. Observe the final word on which you have landed.
Surprisingly, no matter on which word you start this counting task, the procedure
always takes you to the same word in the final line, namely, the word you.
Kruskal observed that this same phenomenon seems to occur with any sufficiently
large piece of text - counting forward in this way from any choice of beginning
word lands you at the same place at the end of the page. This provides an amusing
activity for several people to perform simultaneously, all working with the same
text, but starting with different choices of initial word.
Why does this seem to work?

PART ONE: 39
EXPECTED VALUE
Suppose you a play a game with some monetary values associated with it.
For example: Flip a coin.

If it comes up HEADS, you win $2.
If it comes up TAILS, you lose $1.
Definition: The expected value of a game is the average profit (or loss) one would
expect if the game were played a large number of times.
For example, suppose we played the above game 200 times. Then, on average, we
would expect to win $2 one hundred times and lose $1 one hundred times.
2 + 2 + + 2 + (1) + (1) + + (1)

Average win =
200
2 100 + (1) 100

=
200
1 1 1
= 2 + (1) = = 50 cents
2 2 2
Thus, wed expect to win 50 cents per game.
COMMENT: Many text-book questions phrase a game like the one described above
as follows:
You pay $1 to play the following game:

Flip a coin. If it is HEADS, win $3. If it is TAILS, you win nothing.
Do you see that it is exactly the same game as before? The idea of having to pay
first often offers a point of confusion for students. It is good to tease such
questions apart and list the end outcomes explicitly:
If HEADS Im up $2 overall
If TAILS - I am down $1 overall
Now it is clear how to handle analysis of the game.

PART ONE: 40
EXAMPLE: Imagine the following simple dice game:
Roll 1: You win $10

Roll 2: You win $5
Roll 3,4,5,6: You lose $3
Is this game worth playing?
Answer: Imagine playing 600 rounds. (Why did I choose the number 600?)
On average, wed expect:
100 times a win of $10

100 times a win of $5
400 times a loss of $3
100 10 + 100 5 + 400 (3) 1 1 4

Average profit = = 10 + 5 + (3) = 0.50
600 6 6 6
This game is in your favour. It is worth playing. You can expect to win, on average,
50 cents per game.
Note: In this calculation we see the appearance of the probabilities of each

outcome multiplied by their respective values of outcomes.
In general
If a game offers values x1 , x2 , , xn with probabilities p1 , p2 , , pn , then the

expected value of the game is:
x1 p1 + x2 p2 + + xn pn
This number is often denoted or E.
Exercise: Find the expected value of tossing a pair of dice.
Answer:
PART ONE: 41

EXAMPLE: A coin is tossed.
If H comes up on the first toss, you win $1.

If H first appears in the second toss, you lose $1.
If H first appears in the third toss, you win $1.
and so on.
(Basically: You win $1 if H first appears on an odd toss, lose $1 if H first appears
on even toss.)
What is the expected value of this game?
Answer:
1
The probability of getting heads on the first toss is
2
1 1 1
The probability of getting heads on the second toss is: = . (Why?)
2 2 4
1 1 1 1
The probability of getting heads on the third toss is: = .
2 2 2 8
And so on
Thus:
1 1 1 1 1 1 1 1
= 1 + (1) + 1 + (1) + = + +
2 4 8 16 2 4 8 16
One can use the geometric series formula to evaluate this infinite sum. Another
approach (a neat trick) is to multiply this sum by two:
1 1 1 1 1 1 1 1 1 1 1 1 1
2 = 2 + + = 1 + + = 1 + + = 1
2 4 8 16 32 2 4 8 16 2 4 8 16
1
So 2 = 1 giving = . Done!
3

PART ONE: 42
EXERCISE:
a) Consider the example we first studied in this section, phrased as the textbooks
tend to phrase it:
You pay $1 to play the following game:

Flip a coin. If it is HEADS, win $3. If it is TAILS, you win nothing.
What is the expected profit in playing this game?
A student decides to answer the problem as follows:
1 1
Well = 3 + 0 = $1: 50 . But since we paid a dollar, we must subtract $1
2 2
from this amount. The expected profit is therefore 50 cents.
This agrees with our previous answer. Coincidence?
COMMENT: Some students prefer to follow this approach to questions like these.
b) A casino offers a game in which one can win x1 , x2 , x3 or x4 dollars with

probabilities p1 , p2 , p3 , p4 respectively. Suppose the expected value of this game is
= x1 p1 + x2 p2 + x3 p3 + x4 p4 . As a promotion, the casino decides to offer bonus
night during which all payouts increase by $2. (Thus the payouts are now
x1 + 2, x2 + 2, x3 + 2, x4 + 2 dollars.) Prove, mathematically, that increases by 2. Does
the mathematics used here also explain the result of part a)?
PART ONE: 43
TWO TIDBITS
Heres a seemingly paradoxical exercise:
EXERCISE:
In Tiny-Town, 90% of the city cabs are purple and the remaining 10% are blue.
A crime was committed and an eye-witness claims she saw a blue cab at the scene.
Subsequent tests showed that this witness is correct in her observations four
times out of five, that is, 80% of the time.
What are the chances that the cab at the scene really was blue?
COMMENT: The answer is surprisingly small!

PART ONE: 44
EXAMPLE: MONTY HALL PROBLEM
Named after the host of a popular American TV game show Lets Make a Deal!, the
Monty Hall problem is a classic puzzler often used to test initiates in the field of
probability theory. It goes as follows:
On a game show three closed doors stand before you. The host informs you
that a cash prize lies behind one of the doors, and nothing behind the other
two. You select a door, but before you open it, the host quickly opens one of
the remaining two doors to show you that the prize is not there. He now
gives the chance to change your mind and open instead the third remaining
door. The question is: What should you do? Should you stay with your original
choice of door, or switch to the other option? Is there any advantage to
switching?
Ones typical first reaction to this puzzle is that there is no advantage at all to
switching since two doors remain with only one containing a prize, the chance of
selecting the correct door, either by staying with the chosen door or switching, is
always 50 percent. Surprisingly, this reasoning is not correct for it makes no use
of the subtle information the host presents to you, which you can actually use to
your advantage.
a) Play the game with a partner using playing cards as doors one black and
two red. Take turns being host and being contestant. What do you notice
about your choices as host?
b) Explain why your odds of winning double if you choose to always switch
rather than stick.
c) Suppose the host presents you with 100 doors with only one containing a
prize. You reach for a door but just before you open it, the host reveals to
you the empty contents of 98 other doors. There are now two closed doors,
one with your hand on it. The host then offers you the chance to change
your mind and open instead the remaining closed door. Should you stick
with your original choice or switch?
PART ONE: 45
For the bold
Consider the following variation of the game:
A game contains four doors, one with a fabulous cash prize behind it, the remaining
three empty. The host invites you to select a door and you place your hand on its
knob.
At this stage, Monty opens one of the remaining three doors and shows its empty
contents. He offers you the chance to stick with your current door choice or
switch to one of the remaining two closed doors. You make your choice.
Next, Monty opens a second door to reveal its empty contents. Two closed doors
remain, one with your hand on its knob. He again offers you the chance to stick or
switch.
At this stage the game ends and you accept the consequences.
There are four strategies to this two-stage game: Stick-Stick; Stick-Switch;

Switch- Stick; Switch-Switch.
Which of these four possibilities gives you the greatest chance of winning?
PART ONE: 46
FOR THE BOLDER

In our analysis of the original (and subsequent) Monty Hall problems we made two
assumptions:
i) Monty does know behind which door the prize lies
ii) Monty has no preference as to which non-prize door he opens when faced with a
choice.
Lets consider some alternative scenarios:
Assume that Monty knows where the prize lies but has a preference as to which non-
prize door he opens when faced with a choice.
(For an example of preference, suppose the doors are numbered 1, 2, and 3 and Monty will
always open the lowest numbered door he can.)
In this scenario, switching is not always better! Sometimes a stick is just as good! Matters
depend on individual plays.
For example, suppose the contestant reaches for door number 1. If Monty opens door
number 3 to reveal a non-prize, then the contestant should switch for a certain win. (What
stopped Monty from opening door 2 if that was his preference?) If, on the other hand,
Monty did open door 2 to reveal a non-proze, then there is no advantage to sticking or
switching. (Montys action here reveals no information about the possible location of the
prize.)
EXPERIMENT: Conduct a card experiment with a friend mimicking this scenario. How
often can your friend deduce for certain the location of the black card? In general, what
are the odds of your friend winning this version of the game?
Assume that Monty has no knowledge of the location of the prize and, by luck,
opened a non-prize door.
There is no advantage to sticking or to switching: each produces a 50% chance of winning.

To see why, imagine playing the game 30 times. On average, the contestant will have his
hand on the correct door for 10 of those games, and on in incorrect door for 20 of those
games. In those 20 games, Monty will accidentally reveal the prize half the time, and so we
must reject those occurrences. (We are told that this did not happen.) So we are left with
20 games to ponder, 10 of each type. Sticking leads to a win for 10 of those 20. Switching
leads to a win for the remaining 10 of the 20.
CHALLENGE: Consider the final scenario in which Monty does not know the location of the
prize but will open the lowest number door available to him. It turns out to a non-prize.
Should the contestant stick or switch, or does it depend?
PART ONE: 47
CONDITIONAL PROBABILITY
Analysis of the Monty Hall problem flirts with difficulty of what to do when partial
information is revealed in a situation. This leads us to the notion of conditional
probability,
Definition: The probability of an event occurring given knowledge that another

event has already occurred is called a conditional probability.
Example: Two cards are drawn at random from a deck. Knowledge of the colour of
the first card will affect the likelihood that the second card is red. Specifically:
If we are told that the first card was black, then:
26
P( second card red) =
51
If we are told that the first card was red, then:
25
P( second card red) =
51
If we are told nothing about the colour of the first card, then:
26 1
P( second card red) = =
52 2
[The first card might just as well still be in the deck.]
Notation: The probability that event A will occur given knowledge that event B has
already occurred is denoted:
P(A|B)
26
e.g. We have: P(second card red | first card black) =
51
PART ONE: 48
We can conduct a thought activity to determine a formula for P(A|B).
Imagine running the experiment a large number of times and observing the number
of times B occurs.
We want P(A|B), the proportion of times A occurs among those times B has already
happened. That is, we want the number of times both A and B occurred compared
to the number of times just B occurred. This suggests:
P( A B)
P(A|B) =
P( B)
[Recall: The intersection symbol means and.]

PART ONE: 49
Example: Draw a card from a deck. A friend tells you that the card is red. What is
the probability that it is an ace?
P (Ace and Red)

Answer: P( Ace | red ) =
P (Red)
P (red ace)
=
P(red)
2

= 52
1

2
1
=
13
[And this makes sense since among the 26 red cards, two are aces.]
Exercise: A die is rolled. Someone yells out that the answer is odd.
Given this information, what is the probability that the roll was a 3? a 4?
Answer these questions by practicing the formula for conditional probability (and
then check that the answers make sense!)
PART ONE: 50
CONDITIONAL PROBABILITY AND INDEPENDENT EVENTS
Recall that two actions are independent if the outcomes of one in no way affect
the outcomes of the other. So, if A and B are independent events, wed expect
then:
P(A|B) = P(A)
[Knowledge of B occurring in no way affects the likelihood of A occurring.]
We can prove this mathematically.
Recall, for independent events, we have: P ( A and B ) = P ( A) P ( B) . Thus:
P ( A and B ) P ( A) P ( B )
P(A|B) = = = P ( A)
P( B) P( B)
Example: A coin is tossed and a die is rolled. What is the probability of getting a
HEAD given that the die rolled a 6?
P (HEAD and SIX)

Answer: P(HEAD | Six) =
P (SIX)
1 1

= 2 6
1
6
1
=
2
COMMENT: In a sophisticated theory of probability, one defines a measure onto

a set of objects that defines what at random means for the problem at hand.
This obviates the issue of equally likely and begins to put the theory on sound
logical footing. Mathematicians then take the relation P ( A | B ) = p ( A) as the
definition of what it means for two events A and B to be independent, that is, A
and B are said to be independent if the measure P(A|B) equals the measure
P(A).
PART ONE: 51
BAYES THEOREM (1763)
What is the relationship between P(A|B) and P(B|A) ?
e.g. Draw a card from a deck. Then:
1
P(ace|red) =
13
1
P(red|ace) =
2
What is the connection between these two numbers?
RECALL:
P( A B)
P(A|B) =
P( B)
and
P ( B A)
P(B|A) =
P ( A)
We have:
P ( B A) P ( A B) P( A B ) P( B ) P( B)
P ( B | A) = = = = P( A | B)
P ( A) P ( A) P ( B) P( A) P( A)
That is:
P( B)
P ( B | A) = P( A | B)
P( A)
P (red ) 1 1/ 2 1
Example: P(red|ace) = P(ace|red) . = = .
P (ace) 13 1/13 2
PART ONE: 52
More generally
Suppose B1 and B2 are two non-overlapping events that cover the whole sample
space.
e.g. B1 = getting a red card

B2 = getting a back card
Suppose A is another event.
e.g. A = getting an ace
Then:
BAYES THEOREM:
P ( A | B1 ) P ( B1 )
P ( B1 | A) =
P ( A | B1 ) P( B1 ) + P ( A | B2 ) P ( B2 )
This looks worse than it is. It is also fairly straightforward to prove.
Heres some blank space to write out the proof:
Proof:

PART ONE: 53
Lets do an example:
EXAMPLE:
Bag 1 contains 5 red and 2 white balls.
Bag 2 contains 7 red and 4 white balls.
A bag is selected at random. A ball is selected at random from that bag.
You are told the ball is red.

What is the probability that that ball came from bag 1?
Answer: Let:
B1 = ball comes from bag 1

B2 = ball comes from bag 2
A = ball is red
We want P( B1 |A), the probability that the ball came from bag 1 given that it is red.
According to Bayes theorem:
P ( A | B1 ) P ( B1 )
P ( B1 | A) =
P ( A | B1 ) P ( B1 ) + P ( A | B2 ) P ( B2 )
5 1

= 7 2
5 1 7 1
+
7 2 11 2
55
=
104

Would you have guessed this answer?
The theorem looks complicated, but it allows you to compute some nasty problems
with relative ease.
PART ONE: 54
Heres a simple, but philosophically confusing, example:
EXERCISE: TWO CARD PARADOX

One card is red on one side and black on the other.
A second card is red on both sides.
Both cards are put in a bag and one is pulled out at random. You see that one side
of the chosen card is red. What is the probability that the other side of this
chosen card is also red?
a) Answer this question first by using logical reasoning

b) Answer this question a second time using Bayes theorem.
Answers:
EXERCISE: Yale psychologists have coined the term cognitive dissonance for the act of
devaluing an object after being told it is not available, and conducted an experiment with
monkeys to show that they might too engage in cognitive dissonance. (See
http://www.nytimes.com/2008/04/08/science/08tier.html?_r=1&8dpc&oref=slogin ).
Scientists had discovered that monkeys prefer red, green and blue M&Ms over all other
colors. Monkeys were then given two M&Ms of different colors say one red and one
green. The monkeys would grab one candy, say the red one, and then have the other one
taken away. Next the monkey would be offered another two M&Ms but of the colors it
had not eaten, in our example, blue and green. The monkey had already experienced the
green M&M being taken away, and the scientists found that about two thirds of the
monkeys opted to take the blue one instead. Had the majority of monkeys indeed devalued
the green M&M given the previous loss?
Show that this is a mathematical result and not a psychological result. That is, show that,
of all the monkeys that prefer red M&Ms over green M&Ms, two thirds of them also
prefer blue M&Ms over green M&Ms, irrespective of whatever experiment is to be
conducted!
PART ONE: 55
PROBLEM SET I
Question 1: Analyse and answer the following variation of de Meres problem:
Two players play a series of games for the best out of five. The winner is to
receive a prize of $1000. After three games of play, in which the first player had
won one game and the other two games, the match was interrupted by an
earthquake. How should the $1000 be divvied up between the two players so as to
properly reflect their likelihoods of having won the series? (Assume the each
player has a 50% chance of winning any particular game.)
Question 2: Repeat question 1 but this time assume that the first player has only
a 10% chance of winning any individual game.
Question 3: 8640 people walk down the following garden path. At each fork, equal
numbers of people take each option. Find the number of people that end up in each
of the houses A, B, C, and D.
PART ONE: 56
Question 4: Assume that exactly 50% of children born are boys and 50% are girls.
A couple has three children.
a) Draw a tree diagram displaying the possible genders of their three children,
b) What are the chances that the couple has three boys?
c) What are the chances that the couple has at least one boy?
d) What are the chances that the couple has exactly two boys?
Suppose that we are now told that their first child was a girl.
e) What are the chances that the other two children are also girls?
f) What are the chances that at least one of their three children is a girl?
Question 5: Billys girlfriend has a dimple on her left cheek (there is 1/100 chance
that this occurs), blue eyes (there is a 1/100 chance this occurs), and likes math
(there is a 1/100 chance that this occurs). He says that his girlfriend is one in a
million. Is he correct?
Question 6: A card is drawn at random from a deck of 52 playing cards.
a) Describe the sample space if suits are not considered relevant

b) Describe the sample space is suits and numerical value are considered
relevant
c) Describe the sample space if the value of the card is considered irrelevant
Question 7: A card is drawn at random from a deck of 52 cards. What is the

probability of:
a) Drawing an ace?
b) Drawing a club?
c) Drawing the ace of clubs?
d) Drawing an ace or a club?
e) Drawing neither an ace nor a club?
f) Drawing any suit except clubs?
g) Drawing the three of clubs or the king of diamonds or any heart?
PART ONE: 57
Question 8: An urn contains 5 red balls, 8 blue balls, and 7 white balls. A ball is
selected at random. What are the chances of:
a) selecting a red ball?

b) not selecting a blue ball?
c) selecting a ball that is white or blue?
After the ball is selected, it is returned to the urn, and the experiment is
repeated.
What are the chances, in the run of these two experiments, of
d) selecting a red ball followed by another red ball?

e) selecting two balls of the same colour?
Question 9:
The chances that someone gets bitten by a dog at least once in life is 0.02 .
The chances of being hit by a meteorite at least once in life is 0.001 .
The chances of stepping in gum at least once in life is 0.99 .
What is the probability
a) Of being hit by a meteorite and being bitten by a dog in your life?

b) Of never being hit by a meteorite?
c) Of never stepping in gum and never being hit by a meteorite?
d) Of all three events happening in your life.
e) None of these events happening in your life.
Question 10: M&Ms come in six colors. Heres a table showing the probability that
a randomly chosen M&M has a particular colour:
Colour Brown Red Yellow Green Orange Blue

Probability 30% 20% 20% 10% 10%
a) Fill in the missing number for blue.

b) What are the chances that an M&M chosen at random is either brown or
red?
PART ONE: 58
c) What are the chances that two M&Ms chosen at random from an extremely
large bag are both blue?
d) What are the chances that three M&Ms chosen at random from an
extremely large bag are all blue?
Ive been told that the colour distribution for Peanut M&Ms is different.
Obtain a bag of peanut M&Ms and use your sample to make estimates for the
entries in the following table:
Colour Brown Red Yellow Green Orange Blue

Probability
Question 11 (ANNOYING BUT INTERESTING):

It is said that a Friday falls on the 13th day of the month 48 times every 28 years.
a) Verify this calculation.
b) What is the probability that a randomly chosen Friday is a Friday the 13th?
Question 12: (HARD-ISH, BUT REALLY INTERESTING!)

There are eight possible outcomes in tossing a coin three times:
HHH, HHT, HTH, THH, HTT, THT, TTH, and TTT.
Two players decide to play the following game. Player A chooses the sequence HHH
and player B the sequence THH. A coin is tossed repeatedly until one of these
sequences appears. For example, the coin might produce T, T, H, T, H, H and
player B wins. If the coin produces the sequence H, T, T, H, H, H then player A
wins.
a) Play the game 10 times. Does player B seem to win the majority of times?
b) Explain why player B has the advantage.
c) Suppose instead player A chooses the sequence HHT and player B the
sequence THH. Play the game 10 times. Does player B again win the majority
of times? Can you explain why?
PART ONE: 59
d) Heres a table showing all the options A could choose and what B chooses in
response.
If A chooses this Then B chooses this

HHH THH
HHT THH
HTH HHT
THH TTH
TTH HTT
THT TTH
HTT HHT
TTT HTT
Play the game 10 times for each of the eight rows in the table. Verify that B
wins the majority of times in each case. Show me the results you obtained.
Question 13: A bag contains a red ball and a white ball. Jodie takes out a ball at
random. If it is red she wins. If it is white, she then moves to a bag that contains
two red balls, and a single white ball, and pulls out a ball. If it is red, she wins. If it
is white, she then moves to a bag that contains three red balls and a single white
ball, and pulls out a ball. If it is red, she wins. If it is white, she then moves on to
There are an infinite number of bags available to her, and she keeps playing this
game until she eventually wins.
Explain how this hypothetical scenario proves the following equation:
1 1 2 1 1 3 1 1 1 4
+ + + +=1
2 2 3 2 3 4 2 3 4 5

n
(That is, in math notation, weve established: (n + 1)! = 1 .)
n =1
PART ONE: 60
Question 14: A bag contains a red ball, a blue ball, and a white ball. Schuyler pulls a
ball out at random. If it is red, he wins. If it is blue, he loses. If it is white, then
he moves on to a bag that contains two red, two blue and one white ball. He pulls
one out at random. If it is red, he wins. If it is blue, he loses. If it is white he
moves on to a bag that contains four red, four blue, and one white ball. And so on,
with double the number of red balls and double the number of blue balls from bag
to bag.
a) Explain why Schuylers chances of winning this game are .

b) Write an interesting infinite sum based on this hypothetical scenario whose
value is .
Question 15: Three dice are tossed simultaneously. What are the chances of
rolling
a) three sixes?
b) two sixes and a one?
c) no sixes?
d) at least one six?
Question 16: Three dice and two coins are tossed simultaneously. What are the
chances of receiving
a) three sixes and two heads?

b) no sixes and two heads?
c) no sixes and no heads?
Question 17: Five cards are drawn from a deck of 52 cards:
a) Explain why the chances of pulling out five cards that are all hearts is:
1 12 11 10 9
0.05% .
4 51 50 49 48
b) Find the probability of pulling out five black cards.
c) Show that the probability of pulling out four Kings among the five cards is
close to 0.002%.
d) Show that the probability of pulling out three Kings and two Queens is close
to 0.001%.
PART ONE: 61
Question 18: Consider the following magic square:
Player A chooses a number at random from the first row; player B chooses a
number at random from the second row, and player C chooses a number at random
from the third row.
a) What are the chances that player Bs number is higher than player As?
b) What are the chances that player Cs number is higher than player Bs?
c) What are the chances that player As number is higher than player Cs?
[In this game of chance, B has the advantage over A, C has the advantage over B,
and A has the advantage over C!]
Question 19:
a) Eleven numbers are arranged in a line. The first number is 0, the last number
is 0, and every number in between is the average of its two neighbors. What
are the 11 numbers and why?
b) Eleven numbers are arranged in a line. The first number is 0, the last number
is 1, and every number in between is the average of its two neighbors. What
are the 11 numbers and why?
Question 20: You are a game show contestant and the game show host presents to
you 100 boxes. She tells you that inside one box lies a fabulous prize and all the
remaining boxes are empty. You select a box at random and are about to open it
when the host interrupts you and opens 98 boxes to reveal to you their emptiness.
This leaves two boxes: the one you selected and one other.
PART ONE: 62
You are now given the chance to stick with the box you first chose, or to switch
and open instead the second box.
a) If you decide to stick, what are your chances of winning the prize?
b) If you decide to switch, what are your chances of winning the game?
Suppose the game show host opens only 97 boxes. This leaves three boxes: the one
you first selected and two others.
The host now gives you the choice to either stick with your original box or to
switch to either one of the remaining boxes.
c) If you decide to stick, what are your chanced of winning the prize?
d) If you switch to a different box, what now are your chances of winning?
Question 21: In a game, if a outcomes are deemed favorable and the remaining b
possible outcomes unfavorable, then folk may say in horse racing circles in
particular that the odds in favor of winning are a to b, or alternatively that the
odds against are b to a. For example, in rolling a die the odds in favor of rolling a
6 are 1:5. The odds against rolling a 5 or a 6 are 4:2 (which could be reduced to
2:1). In a horse race if the odds against a horse are 7:2, this means that bookies
2
believe that the horse has only a chance of winning.
9
CORRECT or INCORRECT?
a) A bookie at a horse race says that the odds against a particular horse are
8
5:8. This means that the probability the horse will win the race is .
13
b) A game yields a 30% chance of a win. The odds against winning the game are
thus 7:3.
c) In casting a die, the odds in favor of rolling a number smaller than 5 are 2:1.
d) In tossing a coin twice, the odds against receiving two heads is 3:1.
Question 22:
Bag 1 contains 13 red balls and 14 blue balls.
Bag 2 contains 12 red balls and 7 blue balls.
A bag is selected at random and a ball is pulled out of that bag at random. We are
told that the ball is red.
What is the probability that the ball came from bag 1?

PART ONE: 63
Question 23: A die is tossed. What is the probability that the result is a number
less than 4 if
a) We are told no other information?

b) We are told that the result was an odd number?
c) We are told that the result wasnt 5?
d) We are told that the result wasnt 1?
Question 24:
a) Two ordinary dice are tossed. What is the probability of NOT getting a total
of 7 or 11?
b) Two Sicherman dice are tossed. What is the probability of NOT getting a
total of 7 or 11?
Question 25: One bag contains 4 red and 5 white balls. A second bag contains 3
red and 6 white balls. A ball is drawn from each bag. What is the probability that
a) Both balls are white

b) Both balls are red
c) One is white and the other is red
Question 26 One bag contains 2 red and 3 white balls. A second bag contains 3 red
and 1 white balls. A ball is drawn from each bag. Suppose we are told that one ball
chosen was red. What are the chances that the second ball is also red?
Question 27: A bag contains 5 red and 4 white balls. A ball is selected and then,
without replacing the first ball, a second ball is selected. I tell you that the second
ball is white. What is the probability that the first ball was white?
Question 28: You play a simple coin-tossing game. If the coin lands heads, you win
$3. If lands tails, you must pay $1.
a) If you play this game 100 times, how much money do you expect to have?
b) What is the expected value of this simple game?
PART ONE: 64
Question 29: Roll a die. If it comes up even, you win that many dollars. If it comes
up odd, you must pay that many dollars. (For example, a roll of 4 wins you four
dollars. With a roll of 5, you lose five dollars.)
What is the expected value of this game? Would you want to play it?
Question 30: A coin is tossed once, possibly twice. If a head appears on the first
toss, you win $10 and the game stops. If it lands tails, the coin is tossed again. If
the second toss lands heads you win $4, otherwise you pay $20.
a) If you played this game 100, on average, how many times will you win $10?
How many times will you win $4? How many times will you lose?
b) What is the expected value of this game? Would you want to play it?
Question 31: A gambling game is called fair if its expected value is zero.
A die is rolled. If it rolls 1, 2, 3, or 4, you win $300. If it rolls 5 or 6 you lose $x.
Find a value of x that makes this game fair.
Question 32: A gambling game is called fair if its expected value is zero.
A coin is tossed three times. If at least two heads appear, you win $100. If exactly
one head appears, you win $50. If no head appears, you lose $x.
Find a value of x that makes this game fair.
Question 33: A die is rolled. If it lands 1 you win $10. If it lands 2 you win $300.
If it lands 3 you win $1. If it lands 4 you lose $500. If it lands 5 or 6 you win $x.
Find a value of x so that the expected value of this game is fifty cents.
Question 34: There is a one-in-twenty-million chance of winning the lottery.

This weeks jackpot is $50,000,000. If it costs $1 to buy a ticket, what is the
expected value of this game? Are the odds in your favour?
[ASIDE: They are! But what fact of social behaviour are we ignoring in this
argument? Why should one still not bother to buy a lottery ticket even if the prize
is so high so as to give the impression that odds are in your favor.]
PART ONE: 65
Question 35: PSEUDO-RANDOM NUMBERS
It is not possible to generate truly random numbers with a computer - any program
follows a predetermined set of instructions but it is possible to create a list that
appears to be random. Several methods for doing so exist. The most popular is the
middle-square method developed in 1946 by John von Neumann. It works as
follows:
Step 1: Select a four-digit number.

Step 2: Square the number to produce an eight-digit number. (You might
have to place a zero at the front of the number to get eight digits.)
Step 3: Use the middle four digits of this eight-digit number as the next
number in the sequence.
REPEAT
This procedure produces a seemingly random list of numbers between 0 and 9999.
a) Verify that starting with the number 7254 yields the sequence:
7254, 6205, 5020, 2004, 0160, 0256, 0655, 4290
b) What happens if, instead, you start with the initial number 1049?
This procedure (and, in fact, all procedures that currently exist) are not without
flaw.
PART ONE: 66
SOME MTEL-TYPE QUESTIONS
Question 36: The inner-most circle has diameter 6 inches, and each circle
thereafter has diameter 4 inches greater than the previous circle.
a) What is the ratio of area D to area B?

b) What is the ratio of the perimeter of the largest circle in the diagram to the
perimeter of the smallest circle?
c) A dart is randomly thrown at this target. What is the probability that is lands in
either region A or region C?
Question 37: A survey displays milk preferences amongst men and women:
Men Women
Whole Milk 10 3
2% Milk 18 16
Non-Fat 7 15
No Preference 6 12
a) A woman who was surveyed is chosen at random. What is the probability she
prefers whole milk?
b) A person who likes whole milk is chosen at random. What is the chance that this
person is male?
PART ONE: 67
Question 38: AE = 2AC = 3DE = 4BC
a) Is the ratio CD:AB greater than or less than one-third?
b) A point is chosen in AE at random. What is the probability that it will lie in BD ?
Question 39: I select a card at random from a standard deck of 52 cards and
then a second card. I put the remaining 50 cards aside and lay the two selected
cards face down on the table-top in front of me.
I look at the first card. It is black. Knowing this, what is the probability that the
second card is also black?
Question 40: At a party, 30% of the people present select red as their favourite
colour, 40% select blue, and 30% select yellow. If a person is chosen at random,
what are the chances that he or she does NOT prefer yellow.
Question 41: A computer is programmed to select a single digit 1, 2, 3, 4, 5, 6, 7,

8, or 9 at random.
a) What are the chances that the computer will select an even digit?
b) The computer selects two digits, one after the other. What are the chances
of obtaining an odd digit followed by an even digit?
PART ONE: 68
PART II of IV
James Tanton
2007 James Tanton
CONTENTS:
COUNTING PRINCIPLES
The Multiplication Principle 2
Factorials 4
The Labeling Principle 9
Multi-stage Labeling 14
Fun with Poker 18
PASCALS TRIANGLE
A Grid of Numbers 22
The Binomial Theorem 29
Exercises 33
PART TWO: 2
THE MULTIPLICATION PRINCIPLE
Heres a very simple puzzle:
There are three major highways from Adelaide to Brisbane, and four major
highways from Brisbane to Canberra.
How many different routes can one take to travel from Adelaide to
Canberra?
The answer to this question is clearly 12. But pause for a moment and ask
yourself why? Is it obvious that the number of routes from A to C really is
3 4 , that is, three groups of four?
Make sure you are comfortable that multiplication really is the right
arithmetic operation here (as opposed to direct addition).
Lets take the puzzle up a notch:
Suppose there are also six major highways from Canberra to Darwin.
How many different routes are there from A to D?
Be sure that you are convinced the answer is given by multiplication:
#routes = 3 4 6 = 72
PART TWO: 3
EXERCISE: I own five different shirts, four different pairs of trousers

and two sets of shoes. How many different outfits could you see me in?
EXERCISE: There are ten possible movies I can see and ten possible snacks
I can eat whilst at the movies. I am going to see a film tonight and I will eat
a snack. How many choices do I have in all for a movie/snack combo?
We have the
THE MULTIPLICATION PRINCIPLE

If there are a ways to complete one task and b ways to complete a second
task, and the outcomes of the first task in no way affect the choices made
for the second task, then the number of different ways to complete both
tasks is a b .
This principle readily extends to the completion of more than one task.
EXERCISE: Explain the clause stated in the middle of the multiplication

principle. What could happen if different outcomes from the first task
affect choices available for the second task? Give a concrete example.
PART TWO: 4
FACTORIALS:
In how many ways can six people stand in a line?
Answer: There are six possibilities for the task of placing someone in the
first spot, five possibilities for who to place second, four for third, three
for fourth, two for the fifth and one for sixth. By the multiplication there
are thus:
6 5 4 3 2 1 = 720
ways to complete the task of lining up all six people.
Definition: The product of integers from 1 to N is called N factorial and is

denoted N!.
These numbers grow very large very quickly:
1! = 1
2! = 2 1 = 2
3! = 3 2 1 = 6
4! = 4 3 2 1 = 24
5! = 5 4 3 2 1 = 120
6! = 6 5 4 3 2 1 = 720
7! = 7 6 5 4 3 2 1 = 5040
8! = 8 7 6 5 4 3 2 1 = 40320
COMMENT: In 1729, at the age of 22, Swiss mathematician Leonhard Euler

found a formula for a function that generalizes the factorial function. He
called it the Gamma Function. The curious thing is that you can put
fractional and irrational values into his gamma function and obtain

meaningful answers. Euler discovered, for instance, that ! equals . Very
2
strange!
EXERCISE: What is the highest factorial your calculator can handle?

PART TWO: 5
WORD GAMES:
EXAMPLE: My name is JIM. In how many ways can one rearrange the
letters of my name?
Answer 1: By brute force we can list all possibilities and see that there are
six arrangements: JIM JMI MIJ MJI IMJ IJM
Answer 2: We can use the multiplication principle. We have three slots to

fill:
___ ___ ___
The first task is to fill the first slot with a letter. There are 3 ways to
complete this task.
The second task is to fill the second slot. There are 2 ways to complete this
task. (Once the first slot is filled, there are only two choices of letters to
use for the second slot.)
The third task is to fill the third slot. There is only 1 way to complete this
task (once slots one and two are filled).
By the multiplication principle, there are thus 3 2 1 = 3! ways to complete

this task.
EXERCISE: In how many ways can one arrange the letters HOUSE ?
EXERCISE: How many ways are there to rearrange the letters BOB? Assume
the Bs are indistinguishable?
Comment: One can certainly answer this second exercise by brute force
just list the possibilities. But is there a sophisticated way to think about how
to handle the repeated letter? Think about this before reading on.
PART TWO: 6
PRACTICE EXERCISE:
In how many ways can one arrange the letters HOUSES?
Certainly if the Ss were distinguishable written, say, as S1 and S2 - then

the problem is easy to answer:
There are 6! ways to rearrange the letters HOUS1ES2 .
The list of arrangements might begin:
HOUS1ES2
HOUS2ES1
OHUS1S2E
OHUS2S1E
S1S2UEOH
S2S1UEOH

But notice, if the Ss are no longer distinguishable, then pairs in this list of
answers collapse to give the same arrangement. We must alter our answer
by a factor of two and so the number of arrangements of the word HOUSES
is:
6!
= 360
2
QUESTION: What is this 2 on the denominator? To properly understand

it, work out the answer to this next problem:
EXERCISE: How many ways are there to rearrange the letters of the word
CHEESE?
Think about this before reading on.

PART TWO: 7
Answer: If the three Es are distinct written E1 , E2 , and E3 , say then

there are 6! ways to rearrange the letters CHE1E2S E3. But the three Es can
be rearranged 3! = 6 different ways within any one particular arrangement
of letters. These six arrangements would be seen as the same if the Es were
no longer distinct:
HE1 E2 SCE3 HE3 E1 SCE2

HE1 E3 SCE2 HE3 E2 SCE1 HEESCE
HE2 E1 SCE3 HE2 E3 SCE1
Thus we must divide our answer of 6! by 3! to account for the groupings of

6!
six that become identical. There are thus = 120 ways to arrange the
3!
letters of CHEESE.
6!
Comment: The number of ways to rearrange the letters HOUSES is . The
2!
2 on the denominator is really 2!.
EXERCISE: Explain why the number of ways to arrange the letters of the
7!
word CHEESES is .
3!2!
EXERCISE:
In how many ways can one arrange the letters CHEEEEESIEST?
How about of CHEESIESTESSNESS?
PART TWO: 8
7!
Comment: Consider the word DOODLED. Its letters can be arranged
2!3!
different ways, with 2! in the denominator arising from the fact that there
are two Os, and the 3! from the three Ds. If we wished, we could also
include in the denominator a 1! (- which equals 1) for the fact that there is a
single L in the word and another 1! for the single E. Thus the number of ways
to arrange the letters DOODLED might be better written:
7!
2!3!1!1!
This has the advantage of offering a self check: the numbers in the
denominator should match - in sum - the numbers in the numerator.
Lets take this further
Each number in the denominator corresponds to the number of times a

letter appears in the original word: O two times, D thrice, E once and L once.
The letter P appear zero times so we could actually write:
7!
2!3!1!1!0!
Also, the letter J appear zero times as well, so perhaps we should write:
7!
2!3!1!1!0!0!
and so on.
THIS IS ALL FINE AND CONSISTENT IF WE CHOOSE

TO DEFINE 0! TO BE THE NUMBER 1.
It is for this reason that mathematicians set 0! = 1. Even if one is being silly,
the formulas still remain correct.
PART TWO: 9
THE LABELING PRINCIPLE
We can rephrase the letter-arranging problem. Again consider the word

CHEESIEST. Rearranging these letters corresponds to assigning letters to
nine slots:
1 slot is to be labeled C
1 slot is to be labeled H
3 slots are to be labeled E
2 slots are to be labeled S
1 slot is to be labeled I
1 slot is to be labeled T
9!
We know the answer to the problem is: .
1!1!3!2!1!1!
This is the same problem as the following:
Nine people are to be given hats. One is to be given a cranberry-red

hat (C), one is to be given a hot-pink hat (H), three emerald-green
hats (E), two sky-blue hats (S), one an indigo-blue hat (I), and one a
teal hat (T). How many ways?
We see that rearranging letters is equivalent to assigning labels to distinct

objects (people or specific slots) and the answer to the problem is the
fraction with numerator the number of objects, factorialised, and
denominator given by the counts of objects with each label, factorialised.
We have:
PART TWO: 10
THE LABELING PRINCIPLE

Each of distinct N objects is to be given a label. If k1 of them are to have
label 1, k2 label 2, and so on, all the way to kr of them label r, then total
number of ways to assign all labels is given by:
N!
k1!k2! kr !
This is an extremely powerful result.
SOME EXAMPLES:
1. Four people from a group of ten are needed for a committee. In how many
different ways can a committee be formed?
Answer: The ten folk are to be labeled as follows: 4 as on the committee

10!
and 6 as off. The answer must be .
4! 6!
2. Fifteen horses run a race. How many possibilities are there for first,
second, and third place?
Answer: One horse will be labeled first, one will be labeled second, one
15!
third, and twelve will be labeled losers. The answer must be: .
1! 1!1!12!
3. A feel good running race has 20 participants. Three will be deemed
equal first place winners, five will be deemed equal second place winners,
and the rest will be deemed equal third place winners. How many different
outcomes can occur?
20!
Answer: Easy! .
3! 5!12!
4. From an office of 20 people, two committees are needed. The first

committee shall have 7 members, one of which shall be the chair and 1 the
treasurer. The second committee shall have 8 members. This committee will
have 3 co-chairs and 2 co-secretaries and 1 treasurer. In how many ways can
this be done?
PART TWO: 11
Answer: Keep track of the labels. Here they are:
1 person will be labeled chair of first committee

1 person will be labeled treasure of first committee
5 people will be labeled ordinary members of first committee
3 people will be labeled co-chairs of second committee
2 people will be labeled co-secretaries of second committee
1 person will be labeled treasurer of second committee
7 people will be labeled lucky, they are on neither committee.
20!
The total number of possibilities is thus: . Easy!
1!1!5!3!2!1!7!
COMMENT: Students are usually taught to distinguish between a

permutation and a combination. They differ by whether or not the order
of terms is important. This is unnecessarily confusing and somewhat
artificial. People would call the first example a combination. They would call
the second example a permutation. There are no names for examples 3 and 4!
COMMENT:
N!
The formula with k1 + k2 + + kr = N is called a generalized
k1!k2! kr !
combinatorial coefficient. It is denoted:
N

k1 k2 kr
6 6!
For example, = = 60
2 3 1 2!3!1!
PART TWO: 12
EXAMPLE: In how many different ways can one arrange seven As and nine
Bs?
Answer: We have sixteen slots, seven of which are to be labeled A and

nine to be labeled B. This gives:
16!
7!9!
possible arrangements.
EXAMPLE: Ten circles are drawn in a row. In how many different ways can
we color two of them black and leave the rest white?
Answer: Two circles are to be labeled black and eight as white. There are:
10! 10 9
= = 45
2!8! 2
possibilities.
CHALLENGE: How many solutions are there to the equation 8 = a + b + c if

each of a , b and c is a positive integer or zero?
HINT:
PART TWO: 13
IF YOU REALLY ARE WORRIED ABOUT ORDER
1. SELECTION WITHOUT ORDER IS JUST LABELING
An example will explain:

Suppose 5 people are to be chosen from 12 and the order in which folk
are chosen is not important. In how many ways can this be done?
Answer:
12!
5 people will be labeled chosen and 7 not chosen. There are ways to
5!7!
accomplish this task.
2. SELECTION WITH ORDER IS JUST LABELING
An example will again explain:

Suppose 5 people are to be chosen from 12 for a team and the order
in which they are chosen is considered important. In how many ways
can this be done?
Answer: We have:
1 person labeled first
1 person labeled second
1 person labeled third
1 person labeled fourth
1 person labeled fifth
7 people labeled not chosen
12!
This can be done ways.
1!1!1!1!1!7!
Again there is no need to fuss about order. Just come up with the labeling
scheme that is appropriate for the problem.
EXERCISE: Coming full circle Explain, using the labeling principle, why the
6!
number of ways to arrange six people in a line is 6! (which is really )
1!1!1!1!1!1!
PART TWO: 14
MULTI-STAGE LABELING
Although the labeling principle helps remove the confusion of order vs. non-order,
many standard arrangement problems still possess a level of complication that is
delicate. For example, consider the following typical standardized test problem:
In how many ways can one arrange the letters of the word ORANGE if the
first and last letters must each be a vowel?
This is not a straightforward labeling problem as some objects are given preferred
status over others: the vowels require a different type of consideration from the
consonants. It is really a two-stage challenge:
STAGE 1: Contend with the vowels

STAGE 2: Contend with the remaining letters
Each of these stages can be handled separately. The Multiplication Principle tells us
to then multiply the results.
Solution:
STAGE 1: One vowel shall be labeled first position, one last position and one
3!
shall be labeled placed with the consonants. There are = 6 ways to complete
1!1!1!
stage 1.
STAGE 2: We now have four consonants, R, N, G, and the remaining vowel, to label
4!
as second, third, fourth and fifth. There are = 24 ways to accomplish stage
1!1!1!1!
2.
Thus there are 6 24 = 144 desired arrangements of ORANGE.
COMMENT: Many might prefer to present the answer as a six-stage process:
Do you see what is meant by this diagram?

PART TWO: 15
EXAMPLE: A company would like to send out a team of five plumbers to a

construction site. They will send two expert plumbers and three trainee plumbers.
If there are a total of 10 expert plumbers available and 8 trainees, how many
different teams are possible?
Answer: This too is a two stage process:
STAGE 1: Select the experts

10!
There are possible ways to label two expert plumbers as chosen and
2!8!
the rest not chosen.
STAGE 2: Select the trainees

8!
There are possible ways to label three trainees as chosen and the
3!5!
rest not chosen.
10! 8!
By the multiplication principle, there are thus possible teams.
2!8! 3!5!
COMMENT: Notice that we have no control over who is labeled expert and who is
labeled trainee. We only have control over the labels chosen and not chosen.
That there some fixed previously assigned labels is a hint that this must be dealt as
a multi-stage problem.
EXAMPLE: In how many ways can one arrange the letters ABCDE so that A is never
at the beginning or the end?
Well give three answers to this problem, even though most people would prefer to
answer the question just the first we present. (We offer two more approaches just
to illustrate that there are multiple ways to approach these problems.)
Answer 1: Think of this as a five-stage process! Deal with the first letter, deal
with the last letter, deal with the second letter, deal with the third letter, and deal
with the fourth letter. By the multiplication principle, we multiply the results.
PART TWO: 16
Answer 2: The letters B, C, D, E have a different status than A.
STAGE 1: Place the letter A
There are 3 possible locations for this letter
STAGE 2: Place the remaining letters
There are four remaining positions for four letters. They can be placed in
4!
these positions = 24 ways.
1!1!1!1!
Thus there are 3 24 = 72 desired arrangements.
Answer 3: There are five slots in which to place letters with the two end slots
having a different status than the middle three.
STAGE 1: Fill the end slots.

4!
There are four letters to work with, yielding = 12 possibilities.
1!1!2!
(The labels here are first slot, last slot and not used.)
STAGE 2: Fill the middle three slots

3!
There are three letters to work with yielding = 6 possibilities.
1!1!1!
By the multiplication principle we have 12 6 = 72 permissible arrangements.

PART TWO: 17
EXAMPLE: Six people Albert, Bilbert, Cuthbert, Dilbert, Egbert and Filbert are to
sit in a circle. How many different arrangements are possible if rotations of the
same arrangement are considered equivalent?
Answer: This question is tricky in that there are no clear labels associated with
the question: there is no clear first seat or second seat, and so forth. We can
think of it as a multi-stage process nonetheless by having the men take a seat one
at a time:
Albert must sit somewhere. He can sit anywhere (since all rotations are
deemed equivalent) and there is thus only 1 action for him to take.
Bilbert now has 5 options: take the seat one place to Alberts left, two
places to his left, and so on. Cuthbert has 4 options. Dilbert has 3. Egbert
has 2. Filbert has 1.
Thus by the multiplication principle, there are 1 5 4 3 2 1 = 120

possible arrangements.
We can be a little slicker and think of this as a two-stage process:
STAGE 1: Albert takes a seat
There is only 1 option: Albert takes any seat.
STAGE 2: The remaining five each take a seat.
This is a labeling problem as Alberts position now defines five labels: one
place to his left, two places to his left, and so on. There are thus
5!
= 120 possibilities.
1!1!1!1!1!
By the multiplication problem there are 1 120 = 120 possible configurations.

PART TWO: 18
FUN WITH POKER HANDS
One plays poker with a deck of 52 cards, which come in 4 suits (hearts,
clubs, spades, diamonds) with 13 values per suit (A, 2, 3, , 10, J, Q, K).
In poker one is dealt five cards and certain combinations of cards are
deemed valuable. For example, a four of a kind consists of four cards of
the same value and a fifth card of arbitrary value. A full house is a set of
three cards of one value and two cards of a second value. A flush is a set
of five cards of the same suit. The order in which one holds the cards in
ones hand is immaterial.
EXAMPLE: How many flushes are possible in poker?
Answer: Again this is a multi-stage problem with each stage being its own
separate labeling problem. One way to help tease apart stages is to image
that youve been given the task of writing a computer program to create
poker hands. How will you instruct the computer to create a flush?
First of all, there are four suits hearts, spades, clubs and diamonds and
we need to choose one to use for our flush. That is, we need to label one suit
4!
as used and three suits as not used. There are = 4 ways to do this.
1!3!
Second stage: Now that we have a suit, we need to choose five cards from
the 13 cards of that suit to use for our hand. Again, this is a labeling
problem - label five cards as used and eight cards as not used. There are
13!
= 1287 ways to do this.
5!8!
By the multiplication principle there are 4 1287 = 5148 ways to compete both
stages. That is, there are 5148 possible flushes.
52!
Comment: There are = 2598960 five-card hands in total in poker.
5!47!
5148
(Why?) The chances of being dealt a flush are thus: 0.20% .
2598960
PART TWO: 19
EXAMPLE: How many full houses are possible in poker?
Answer: This problem is really a three-stage labeling issue.
First we must select which of the thirteen card values A, 2, 3, 4, 5, 6, 7, 8,

9, 10, J, Q, K - is going to be used for the triple, which will be used for the
double, and which 11 values are going to be ignored. There are
13!
= 13 12 = 156 ways to accomplish this task.
1!1!11!
Among the four cards of the value selected for the triple, three will be used
4!
for the triple and one will be ignored. There are = 4 ways to accomplish
3!1!
this task. Among the four cards of the value selected for the double, two
4!
will be used and two will be ignored. There are = 6 ways to accomplish
2!2!
this.
By the multiplication principle, there are 156 4 6 = 3744 possible full

houses.
COMMENT: High-school teacher Sam Miskin recently used this labeling

method to count poker hands with his high-school students. To count how
many one pair hands (that is, hands with one pair of cards the same
numerical value and three remaining cards each of different value) he found
it instructive bring 13 students to the front of the room and hand each
student four cards of one suit from a single deck of cards.
He then asked the remaining students to select which of the thirteen

students should be the pair and which three should be the singles. He
had the remaining nine students to return to their seats.
He then asked the pair student to raise his four cards in the air and asked
the seated students to select which two of the four should be used for the
pair. He then asked each of the three single students in turn to hold up
their cards while the seated students selected on one the four cards to
make a singleton.
PART TWO: 20
This process made the multi-stage procedure clear to all and the count of
possible one pair hands, namely,
13! 4!
4 4 4
1!3!9! 2!2!
readily apparent.
EXERCISE: Two pair consists of two cards of one value, two cards of a
different value, and a third card of a third value. What are the chances of
being dealt two-pair in poker?
EXAMPLE: A straight consists of five cards with values forming a string

of five consecutive values (with no wrap around). For example, 45678,
A2345 and 10JQKA are considered straights, but KQA23 is not. (Suits are
immaterial for straights.)
How many different straights are there in poker?
Answer: A straight can begin with A, 2, 3, 4, 5, 6, 7, 8, 9 or 10. We must

first select which of these values is to be the start of our straight. There
are 10 choices.
For the starting value we must select which of the four suits it will be.
There are 4 choices.
There are also 4 choices for the suit of the second card in the straight, 4
for the third, 4 for the fourth, and 4 for the fifth.
By the multiplication principle, the total number of straights is:
10 4 4 4 4 4 = 10240 .
The chances of being dealt a straight is about 0.39%.

PART TWO: 21
Another popular gambling game
EXERCISE: KENO
In a KENO game ten numbers are selected at random from the number 1
through 80. Players of the game submit tickets beforehand selecting 1
through 10 numbers. They win prizes according to the number of matches
they receive.
a) Poindexter plays a 10-spot game, meaning, that he selects 10

numbers on his ticket. What are his chances of obtaining 10 matches
out of 10? What are his chances of receiving 9 matches out of 10?
b) Bilbert pays $1 to play a 4-spot game, meaning, that he selects 4

numbers on his ticket. The payouts for the 4-spot game are as follows:
Match all four: $50

Match three out of four: $5
Match two out of four: $1
What is the expected value of this 4-spot game?

PART TWO: 22
A GRID OF NUMBERS
Heres a famous puzzle:
Starting at the top-left cell marked S and taking horizontal steps one
place only to the right or vertical steps downwards only, how many
different paths are to the location marked E?
Play with this puzzle for a while before reading on. As you play, perhaps
contemplate the following questions:
1. Given the location of the point E, is the grid shown in the diagram
unnecessarily large?
2. Marking in different paths from S to E is awfully complicated. One

could first count paths to different cells first, ones easier to handle,
and look for patterns. For example, how many distinct paths are there
from S to any cell on the top row? Write the answers in those cells.
How many distinct paths take you to any cell in the leftmost column?
To cells in the second row? Second column? Third row?
3. If you are willing to trust patterns, can you make a good guess as to
the answer to the original puzzle?
PART TWO: 23
There are two ways to approach this puzzle.
APPROACH NUMBER 1: FORMULAS
Every path from S to E can be described by a sequence of letters R and D.

For example, the path given in the diagram can be described by the
sequence:
RDRRDDRRRDRR
This sequence contains eight Rs and four Ds. Moreover, any sequence of
eight Rs and four Ds corresponds to a path from S to E.
Exercise: Mark in on the diagram the paths given by DDRRRRRDRRRD and

RRRRRRRRDDDD.
Thus the number of paths from S to E matches the number of ways to

arrange twelve letters eight Rs and four Ds. (That is, to label twelve slots
with eight Rs and four Ds). The answer to the original puzzle is:
12!
= 495 paths.
8!4!
Exercise: How many paths are there from S to the bottom-right cell of the
grid?
Exercise: Suppose the cell E is a steps to the right of S and b steps down
from S. Show that the number of paths from S to E is given by:
a + b (a + b)!
=
a b a !b !
In fact, number the rows 0, 1, 2, (with the top row being the zero-th row)
and we number the columns 0, 1, 2, (with the leftmost column being the
zero-th column). The cell E in the original diagram thus has position row 4,
12!
column 8, and the number of paths to it is . (Paths to this cell involve 4
4!8!
Rs and 8 Ds.)
PART TWO: 24
In general, numbering rows and columns this way, the cell row a and column b
requires a Rs and b Ds to get to it and so the number of paths to it is:
(a + b)!
a !b !
Exercise: Is this formula still correct for the cells in the zero-th row? In
the zero-th column? (Good thing we set 0! = 1 .)
What value should we place in the cell labeled S row zero, column zero?
How many ways should we say that start at S and end at S?
APPROACH 2: PATTERNS
If you fill in the answers for the number of paths to each cell, the following
grid of numbers appears:
(The exercise above suggests that the position labeled S should also be
assigned the number 1.)
Exercise: Explain why the table is symmetrical about the southeast diagonal
line.
(a + b)!
We have the formula for the entry in the a-th row and b-th column
a !b !
(starting the counts at zero).
Have you noticed that each entry in an interior cell is the sum of two
numbers the number just above the cell and the number just to the left of
the cell? This makes sense in terms of counting paths. Consider the circled
PART TWO: 25
cell. To reach this cell one can either first reach the cell just above there
are 15 ways to do this and then step down, or reach the cell just to its left
there are 20 ways to accomplish this and then step right. This gives a
total of 15 + 20 = 35 paths to the circled cell.
Exercise: Use this observation to fill in the remainder of the table.
This grid of numbers possesses a number of curious properties. For example,

start at any 1 on the top row, head down any number of cells, and then turn
right for one cell to create a stocking. The number in the toe of the stocking
always equals the sum of the numbers in the leg of the stocking.
There is a horizontal version of this stocking property also.
Exercise: Explain why the stocking property works.
HINT: Use the fact that the number in the toe is really the sum of two
other numbers in the grid.
PART TWO: 26
The grid of numbers appearing in the cells is just the grid of numbers made
famous by French mathematician Blaise Pascal (1623-1662) for his work in
probability theory.
Each row of this triangle is a diagonal of the grid.
Regard the top row of the triangle (the single 1) as the zero-th row.
(Then the sixth row of the triangle, for example, is 1 6 15 20 15 6 1).
In any row of the triangle call the left-most entry the zero-th entry.
(Thus in the sixth row, 1 is the zero-th entry, 6 is the first entry, 15 is
the second entry, and so on.)
Then the formula for the entry in the n-th row k places in from the left (and
n k places from the right) is:
n!
k !(n k )!
COMMENT: Make sure you understand that this is correct.
Consider an entry in the a-th row and b-th column of the grid of numbers.
(a + b)!
Then the formula for that entry is: . But a + b is the number of the
a !b !
diagonal on which that cell belongs, a is the number of places in from one
PART TWO: 27
end of the diagonal, and b the number of places in from the other end of
the diagonal.
Exercise: a) Draw a grid of squares and mark the cell in the 3-rd row and 4-
th column. Verify that it is indeed on the 7-th diagonal of the grid, 3 and 4
places in from each end.
b) Consider the entry on the 5-th row of Pascals triangle, 2 places in from
the left. Find the corresponding cell in your grid of squares. What is a and
b for this cell and what does 5 correspond to?
For the grid of numbers we saw that every entry was the sum of the entry
just above it and just to the left of it. In Pascals triangle this translates to:
EACH INTERIOR ENTRY IS THE SUM OF THE

TWO ENTRIES ABOVE IT
PART TWO: 28
Exercise:
a) Without doing the computation, explain why the sum of entries in the
bottom row shown above will turn out to be double the sum of the entries in
the row just above it.
HINT: Each entry in the bottom row is the sum of two entries in the row
above it.
b) Explain why the sum of entries in the n-th row is sure to be 2n .

(Remember to call the top row of Pascals triangle - the one with a single 1
- the zero-th row.)
Exercise: Explain why each alternating sum in Pascals triangle, beyond the
zero-th row, is zero:
11 = 0
1 2 +1 = 0
1 3 + 3 1 = 0
1 4 + 6 4 +1 = 0
1 5 + 10 10 + 5 1 = 0

The following property is strange.
Look at the powers of 11:
110 = 1
111 = 11
112 = 121
113 = 1331
114 = 14641
115 = 161051 = 1 | 5 |10 |10 | 5 | 1
Any guesses as to why these powers appear as rows of Pascals triangle?

PART TWO: 29
THE BINOMIAL THEOREM

Recall the act of expanding brackets: Select one term from each set of
parentheses and make sure to collect all possible combinations
e.g. (a + b + c)( x + y )( p + q + r )( s + t + u + v) = axps + ayqu + cxps +
Imagine expanding the quantity:
( x + y) = ( x + y )( x + y )( x + y )( x + y )( x + y )
5
The term x5 will appear once by choosing the term x from each set of
parentheses.
The term x 4 y will appear five times:

once by choosing x, x, x ,x and then y.
once by choosing x, x, x, y, and then x.
once by choosing x, x, y, x, and then x.
once by choosing x, y, x, x, and then x.
once by choosing y, x, x, x, and then x.
That is, x 4 y will appear the same number of times as it is possible to

5!
arrange four xs and one y. This can be done = 5 ways, which is also the
4!1!
number of paths in the square grid from S to row 4, column 1. This is an
entry of Pascals triangle.
The term x3 y 2 will appear as many times as it is possible to arrange three xs

5!
and two ys, that is, = 10 times.
3!2!
The term x 2 y 3 ten times, the term xy 4 five times, and the term x5 once.
We have:
( x + y)
5
= x 5 + 5 x 4 y + 10 x 3 y 2 + 10 x 2 y 3 + 5 xy 4 + y 5 .
The numbers 1, 5, 10, 10, 5, 1 are the entries of the fifth row of Pascals
triangle.
PART TWO: 30
We have:
( x + y) = 1
0
( x + y) = x + y
1
( x + y ) = x 2 + 2 xy + y 2
2
( x + y ) = x3 + 3x 2 y + 3xy 2 + y 3
3
( x + y ) = x 4 + 4 x3 y + 6 x 2 y 2 + 4 xy 3 + y 4
4
and so on.
SOME FUN
1. Put x = 10 and y = 1 . Notice that, for instance:
114 = (10 + 1) = 10 4 + 4 103 + 6 10 2 + 4 10 + 1 = 10000 + 4000 + 600 + 40 + 1 = 14641

4
.
This explains the connection of the powers of 11.
24 = (1 + 1) = 14 + 4 13 + 6 12 + 4 1 + 1 = 1 + 4 + 6 + 4 + 1
4
This explains again - why the sum of entries in a row of entries of

Pascals triangle is a power of two.
0 = (1 1) = 14 + 4 13 (1) + 6 12 (1)2 + 4 1 (1)3 + 1 (1) 4 = 1 4 + 6 4 + 1

4
This explains again why the alternating sum of entries in a row of

Pascals triangle is always zero.
PART TWO: 31
Stated formally, we have
Binomial Theorem:
n! n! n! a b
( x + y)n = xn + x n 1 y + x n2 y 2 + + x y + + yn
(n 1)!1! (n 2)!2! a !b !
The coefficients are the entries of the n-th row of Pascals triangle.
n n!
COMMENT: We have used the notation for the expression . Thus
a b a !b !
the binomial theorem can be written:
n n n n 1 n n2 2 n a b n n
( x + y)n = x + x y+ x y ++ x y + + y
n 0 n 1 1 n 2 2 a b 0 n
Often mathematicians suppress one of the terms in the notation and write
n n
just for . (We must have b = n a .)
a a b
7 7 7!
For example, = = . Thus the binomial theorem might be
5 5 2 5!2!
written:
n n n 1 n n 2 2 n a b n n
( x + y )n = x n + x y + x y + + x y + + y
n n 1 n 2 a 0
n
COMMENT: The entries of Pascals triangles - - are also called binomial
a
coefficients.
PART TWO: 32
PART TWO: 33
PART II PROBLEMS
Question 42: How many different paths are there from A to G?
Question 43: The word BOOKKEEPING is the only word in the English
language with three consecutive double letters. In how many ways can one
arrange the letters of this word?
Question 44: In how many ways can you arrange the letters of your full
name?
Question 45: Evaluate the following expressions:
800! 15! 87!

a) b) c)
799! 13!2!0! 89!
Simplify the following expressions as far as possible:
N! N! n!
d) e) f)
N! ( N 1)! (n 2)!
1 (k + 2)! n !(n 2)!

g) h)
k +1 ( (n 1)!)
2
k!
PART TWO: 34
Question 46: a) Suppose a and b are positive integers with a + b = n . Show

that:
n n 1 n 1
= +
a b a 1 b a b 1
b) Suppose a, b and c are three positive integers with a + b + c = n . Show

that:
n n 1 n 1 n 1
= + +
a b c a 1 b c a b 1 c a b c 1
Question 47: a) A mathematics department has 10 members. Four people are

to be selected for a committee. In how many different ways can this be
done?
b) An English department has 10 members. Four people are needed for a
committee and in that committee one person needs to be the chair. In
how many different ways can the department form a committee of
four with one chair?
c) A Medieval Tibetan Poetry department has 10 members. Four people
are needed for a committee which has two co-chairs. In how many
different ways can one form a committee of four with two co-chairs?
d) A Dramatic Arts of Left-Handed Mimists Department has 10
members. Two committees are to be formed: one with four members
with two co-chairs, and one with three members and a single chair. In
how many different ways can this be done? Assume no person is on
both committees.
Question 48: Twelve horses run in race.

a) A ribbon will be presented to each of first, second, third, and fourth
place. In how many possible ways can this be done?
b) Make a comment as to why there is absolutely no need to mention or
even think of permutations when answering questions like these.
Question 49:
a) In how many different ways can one arrange five As and five Bs.
b) A coin is tossed 10 times. In how different ways could exactly five
heads appear?
PART TWO: 35
Question 50: In how many ways can 10 people sit on a bench if only four
seats are available?
Question 51: Five pink marbles, two red marbles, and three rose marbles
are to be arranged in a row. If marbles of the same colour are identical, in
how many different ways can these marbles be arranged?
Question 52:
a) Twelve white dots lie in a row. Two are to be coloured red. In how
many ways can this be done?
b) Consider the equation 10 = x + y + z . How many solutions does it have if
each variable is to be a positive integer or zero?
Question 53:
a) In how many ways can the letters ABCDEFGH be arranged?
b) In how many ways can the letters ABCDEFGH be arranged with letter G
appearing somewhere to the left of letter D?
c) In how many ways can the letters ABCDEFGH be arranged with the
letters F and H not adjacent?
Question 54:
a) Hats are to be distributed to 20 people at a party. Five hats are red,
five hats are blue, and 10 hats are purple. In how many different ways
can this be done? (Assume the people are mingling and moving about.)
b) If the 20 people are clones and cannot be distinguished, in how many
essentially different ways can these hats be distributed?
Question 55: Lets establish the formulas from the textbooks

a) Suppose r objects are to be selected from a collection of n objects
with the order in which they are selected considered important. Use
n!
the labeling principle to show that this can be done in n Pr =
(n r )!
different ways.
b) Suppose r objects are to be selected from a collection of n objects
without regard to order. Use the labeling principle to show that this
n!
can be done n Cr = different ways.
r !(n r )!
NOW FORGET THESE FORMULAS. You dont ever need them!
PART TWO: 36
Question 56: a) In poker three of a kind is a set of three cards of the

same value with neither of the two remaining cards that value (or of value
equal to each other). What is the probability of being dealt three-of-a-kind?
b) What is the probability of being dealt four of a kind in poker?
Question 57: Consider the question
In how many different ways can 8 people sit around a round table?
This is a vague question. What does different mean?
a) Answer the question if the chairs of the table are marked North,
Northeast, East, Southeast, South, , Northwest.
b) Answer the question if the chairs are not marked so that two
different rotations of the same arrangement of people would be
considered the same.
c) Answer the question under the assumption that rotations are
considered the same and reflections about a diameter of the table are
considered the same.
Suppose two particular people must not sit next to one another. Answer each
of the questions a), b) and c) with this added restriction.
(HINT: First count the number of arrangements with that couple seated
together.)
Question 58: EXTREMELY HARD

An ice-cream stand offers the mega-bowl special: twelve-scoops in a bowl
from a choice of twelve possible flavors. How many different mega-bowl
combinations does it offer?
COMMENT: The problem here is that scoops, like the clones of question
47b), are indistinguishable AND you are not told how many scoops there are
to be of a particular label (flavor). Problems like these are hard and fall
under the category of what is called multi-choosing.
PART TWO: 37
Question 59: A committee of five must be formed from five men and seven
woman.
a) How many committees can be formed if gender is irrelevant?

b) How many committees can be formed if there must be at exactly two
women on the committee?
c) How many committees can be formed if one particular man must be on
the committee and one particular woman must not be on the
committee?
d) How many committees can be formed if one particular couple (one man
and one woman) cant be on together on the committee?
Question 60:
a) From 10 people k are needed for a committee. Write down a formula
for the number of ways this can be done.
b) Suppose we want our formula to hold NO MATTER WHAT. Set k = 11
into your formula. What value should (1)! have so that your formula
is correct for the number of ways to select 11 people from 10 for a
committee?
Question 61:
a) Prove that the product of any 3 consecutive integers is sure to be
divisible by 3! = 6.
b) Prove that the product of any 7 consecutive integers is sure to be
divisible by 7! = 5040.
b) Prove that the product of any k consecutive integers is sure to be
divisible by k!
HINT: Consider the problem: k people from N are to be selected for a

committee. In how many ways can this be done? We know that the answer to
N!
this must be a whole number. Thus, is always a whole number.
k !( N k )!
Question 62: A factorian is a number that equals the sum of its digits
factorialised. For example, 145 is a factorian since 1! + 4! + 5! = 145. The
number 1 is a factorian, as is the number 2. (We have 1! = 1 and 2! = 2.)
There is only one other factorian. What is it?
CHALLENGE: Prove that there are only four factorians.

PART TWO: 38
PART III of IV
James Tanton
2007 James Tanton
CONTENTS:
Displaying and Summarising Data 2
Measures of Central Tendency 5
Measures of Dispersion 9
Scatter Plots 16
Lines of Best Fit 18
Correlation Coefficient 23
Null Hypothesis 27
Distributions 31
Central Limit Theorem 37
Normal Distribution 41
68-95-99.7 Rule 43
z-scores 45
Roulette 51
Confidence Intervals 54
P-values 57
Gallup Poles 60
Sampling 62
Chi-Squared test 66
Quality Control 71
Run Tests 74
Rank Correlation 82
Exercises 85
PART THREE: 2
DISPLAYING AND SUMMARIZING DATA
The practice and the study of the tools and techniques for collecting, displaying
and summarizing numerical information is called descriptive statistics.
For example, a medical study might record the blood types of 100 university
students and present the information obtained in a list, a table, or a diagram of
some kind. Inferences and conclusions might then be drawn from the information
presented. For this example the data, in and of itself, is not numerical but rather
categorical (the categories type A, type B, type AB and type O are examined) but
the count of entries that fall into each particular category is numerical. In other
examples, numerical information might adopt potentially continuous array of values
(such as age, height, or weight) and might be divided into categories for ease
(height 30-36 inches, 37-42 inches, 43-48 inches, etc. for instance).
Once categories have been established, there are a number of standard methods
for presenting and summarizing data.
Presenting Data
A frequency distribution is a table or chart that shows the count (number) of
individuals in each possible category considered.
For example, of the 100 students tested above, suppose 20 have blood type A, 27
blood type B, 16 blood type AB and 37 blood type O. This information can be
summarized in a frequency table, or a bar chart or a pie graph.
PART THREE: 3
Notice for the bar chart that the individual bars are separated to emphasize the
distinct nature of the different categories.
If the data presented comes from a continuous array of values, then the bars are
drawn without separation. In this situation, the bar chart is called a histogram.
A frequency polygon is a histogram with a polygonal line drawn in connecting the

midpoints at the height of each bar (with end points touching the axis).
Tables of whole number values are sometimes displayed via stem-and-leaf plots.
Each number is divided into two parts: the units digit (the leaf), and the set of
digits to its left (the stem) (or perhaps the hundreds are separated from the
tens and units together, or some other variation.)
COMMENT: Each stem and leaf plot should be accompanied with a legend
indicating how the numbers are split. For example, to the table above we might
attach the comment 4|0 = forty.
PART THREE: 4
Describing Data
Statisticians use three broad features to swiftly describe data.
1. The General Shape of a Histogram:

Distributions rarely conform to exact shapes, but statisticians still find it useful
to describe general features of the shape of a histogram.
2. Measures of Central Tendency:

Statisticians usually seek some means to identify a single measurement that, in
some sense, represents the middle value or most typical value of an entire data
set. Four different measures of central tendency are commonly used (which we
shall describe).
3. Measures of Dispersion:
Statisticians also seek some means to measure the spread of a data set. If data
is tightly clustered about a single central value, then that central value is a
meaningful representative of the entire data set. If, on the other hand, data
that is widely scattered across spectrum of values, attributing meaning to a value
of central tendency must be done with care.
PART THREE: 5
In detail
MEASURES OF CENTRAL TENDENCY

A measure of central tendency is a single measurement that, in some sense, is
typical of the entire data set. It represents the approximate centre of the
frequency distribution.
Four different measures of central tendency are in common use today:
Mean or Average [Usually denoted by the Greek letter ]
This is simply the arithmetic mean (average) of the data values at hand. It is found
by summing together all the data values and dividing by the total number of
measurements.
Example: Consider the data set 5, 6, 9, 9.

Then the mean is:
5+6+9+9
= = 7.25.
4
Example: If in a study the value 7 occurs 32 times and the value 9 occurs 25
7 32 + 9 25
times. The mean is: = 7.88
57
CHALLENGE EXERCISE: Prove that the sum of differences of each data value
from the means is always zero. (For example, in the previous example we have
(5 7.25) + (6 7.25) + (9 7.25) + (9 7.25) = 0 . )
The mean is the most commonly used measure of central tendency.
Exercise: Some texts might give the following formula for mean:
f1 x1 + f 2 x2 + + f n xn
=
f1 + f 2 + + f n
Can you interpret what the symbols in this formula mean and why the formula is
correct?
PART THREE: 6
Mode
The mode is the value in the data set that occurs most often.
Example: For the ten data values 3, 6, 5, 3, 1, 6, 5, 3, 8, 3 the mode is 3.
Example: For the data set 4, 5, 8, 8 the mode is 8.
Example: The data set 5, 5, 6, 6, 9, 9, 3, 3, 10, 10 has no mode.
Example: The data set 1, 1, 1, 1, 5, 5, 7, 7, 7, 7, 8, 8, 9, 9, 9 is bimodal.
For non-numerical data (such as colours, or letters of the alphabet) the mode is
the only measure of central tendency available.
Median
Arrange the data set in increasing order. Then the median is the middle value of
the sequence of data values.
Example: The median of the data set 3, 3, 5, 6, 7, 16, 16, 19, 37 is 7.
If the data set contains an even number of entries, then the average of the middle
two values is taken as the median.
5+8
Example: The median of 3, 4, 4, 5, 8, 8, 10, 12 is = 6 .5 .
2
The median is useful for finding the value at the center of the distribution. It
divides the data set into two equally sized groups.
PART THREE: 7
Midrange
The midrange of a set of data is the average of the smallest and largest values.
5+9
Example: The midrange of the data set 5, 6, 9, 9 is = 7.
2
The midrange provides a quick estimate to a central value. It is easy to compute,

but is highly affected by extremely low or high values in the data set.
COMMENT: MTEL (the Massachusetts licensure exam) likes to toy with the
interplay between these different measures of central tendency.
EXERCISE:
a) Find FIVE data values with:
Median = 10
Mode = 10
Mean = 1000
b) Now find five data values with median = 10, mode = 1000 and mean = 10.
c) Can you find five data values with median = 1000, mode = 10, mean = 10?
EXERCISE: Repeat the previous exercise but this time for SIX data values.
EXERCISE: Scientists observe the speed of different turtles. Their observation

of a number of turtles yields a data set with:
Mean = 3.0 ft/min

Mode = 2.9 ft/min
Midrange = 3.2 ft/min
Median = 3.1 ft/min
On her walk home back from the lab, one scientist finds a turtle with ground speed
1000 ft/min.
How would the addition of this extra data value to the data set likely affect the
mean, median, mode, and midrange of the data?
PART THREE: 8
ASIDE: SIMPSONS PARADOX
Two students Albert and Bilbert each took a sample of math questions over a
series of two days. There were 100 questions in total and Albert scored 65% and
Bilbert 64% overall. So Albert proved himself a better test taker.
But here are the scores day-by-day:
FIRST DAY:
Albert = 71%
Bilbert = 80%
SECOND DAY:
Albert = 50%
Bilbert = 57%
So each day Bilbert did a better job than Albert, but did not beat Albert overall!
How is this possible?
The following table shows raw data of their test results.
This paradox arises because Albert and Bilbert did not complete the same number
of questions each day and the averages we computed are not equally weighted. This
curious phenomenon is known as Simpsons paradox, discovered by the Statistician
Simpson in the 1960s when it arose in the examination of graduate school
admission rates for men and women into UC Berkeley.
PART THREE: 9
MEASURES OF DISPERSION
How well clustered about the central value is a set of data? Is the data spread
out or tight about this value?
Example: Consider the following two sets of data, each with = 5.22
DATA SET I : 5.0 5.1 5.2 5.4 5.4 = 5.22
DATA SET II: 0.1 2.2 2.3 8.9 12.6 = 5.22
These sets are very different! We need to quantify methods for measuring scatter
about a mean. There are several approaches.
Range
The range of a data set is simply the difference between the lowest and highest
values in the set.
Example: The range of the data set 5, 6, 9, 9 is: 9 5 = 4.
The range is a very simplistic measure of dispersion, and does not reveal any
information about how the data values are distributed. It is highly affected by
extremely low or high values in the data set. However, the range is often a useful
measurement in practical daily issues. For example, weather forecasts usually give
the range of temperatures to expect for the day.
PART THREE: 10
Deviation from the Mean
Example: Consider the data set 5, 6, 9, 9 which has mean = 7.25 .
We can measure the deviation of each data point from the mean:
| 5 7.25 | = 2.25
|6 7.25 | = 1.25
| 9 7.25 | = 1.75
| 9 7.25 | = 1.75
The average of these deviations gives a good measure of overall scatter. Here, the
average deviation is:
| 5 7.25 | + | 6 7.25 | + | 9 7.25 | + | 9 7.25 | 2.25 + 1.25 + 1.75 + 1.75

= = 1.75.
4 4
Thus we can say:
The data set 5, 6, 9, 9 has mean = 7.25 with average deviation from the
mean of 1.75
A SUBTLE POINT ..
A subtle point should be noted. Given n data values x1 , x2 , , xn , one first computes
the mean , and then the n deviations: | x1 |,| x2 |, ,| xn | .
Once the first n-1 of these quantities are computed (and these could turn out to
be of any value), the value of the nth quantity, however, is forced the data set
must conform to a mean .
Exercise: As an example, suppose I tell you that a set of three data values has
mean = 3 and two of the data values are 1 and 5. Do you know the third data
value?
PART THREE: 11
Thus there are only n 1 independent computations to be made. For this reason
mathematicians choose to divide the sum of deviations by n 1 rather than n.
Thus a measure of scatter for the data set 5, 6, 9, 9, for example, is computed:
| 5 7.25 | + | 6 7.25 | + | 9 7.25 | + | 9 7.25 |

2.33.
3
NOTE: When dealing with thousands of data points, dividing by n-1 as opposed to
dividing by n will have very little effect. The results will be practicably the same.
WARNING: Textbooks are confused about this. Some texts will choose to divide
by n while others will choose to divide by n-1.
Watch out for this when you read different books.
Mathematicians prefer to divide by n-1.

PART THREE: 12
ANOTHER MEASURE OF DISPERSION
Working with absolute values in mathematical equations is difficult.
[CHALLENGE: Solve | x 2 | 3x 5 x = 7 .]
Another way to work with positive quantities is to square values, rather than take
absolute values. One can later apply a square root if desired.
The variance of a data set is the sum of all deviations squared, divided by one less
than the number of data values.
Example: The variance of the four data values 5, 6, 9, 9 with mean 7.25 is:
(5 7.25) 2 + (6 7.25) 2 + (9 7.25) 2 + (9 7.25)2

= 4.25 .
3
Variance is usually denoted by the symbol 2 ( read sigma squared).
For n data values x1 , x2 , , xn variance is given by the formula:
( x1 ) 2 + ( x2 ) 2 + + ( xn ) 2
2 = .
n 1
To nullify the effect of squaring, mathematicians will next take a square root.
The standard deviation of a set of n data values x1 , x2 , , xn is:
( x1 )2 + ( x2 ) 2 + + ( xn ) 2
= = 2
.
n 1
Example: The standard deviation of the four data values 5, 6, 9, 9 is

= 4.25 2.06.
NOTE: The mean is still computed by dividing by N.
x1 + x2 + + xN
=
N
PART THREE: 13
COMMENT: If the data is given in terms of inches, say, then 2 is in terms of

inches squared, but is back to being in terms of inches. Standard deviation
always has the same units as the original data set.
Standard deviation is the most commonly used measure of dispersion.
EXERCISE: Compute the standard deviation of the rolls of a die.
Answer:

PART THREE: 14
A COMMENT ON OUTLIERS
The median of a set of data is a value which divides the data into two equal
halves. (If the number of data values is even, then exactly 50% of the data
values lie below the median, and 50% above. If the number of data values is odd,
then close to 50% of the data values lie below and above.)
The median of the lower half of data values is called the first quartile, denoted
Q1 , and the median of the upper half the third quartile, Q3 . (And the median itself
can be called the second quartile, Q2 .) These quartiles divide the data into
approximately 25% blocks.
COMMENT: There is some confusion here in the literature. Some texts insist that
the quartile values correspond to actual data values and some dont. Some handle
the cases of an odd number of data values differently than others. For large data
sets these differences are negligible and hence the lack of uniformity. It does
cause concern, however, for standardized test makers who will have students work
with small data sets for which the differences can be striking. One must examine
any particular authors protocols with care with regard to this matter. (See also
questions 63 and 66.)
The interquartile range IQR of a set of data is the value Q3 Q1 .
EXAMPLE: The data 2 4 4 7 10 12 25 has

Q1 = 4
Q2 = 7
Q3 = 12
IQR = 12 4 = 8
An outlier is a data value that seems too large or too small to be coherent with the
data set. (Maybe an error occurred during the experiment, or a value was recorded
incorrectly, for example.) This is a subjective call and often statisticians will bring
a little more solidity to this notion by declaring:
A data value is suspect to be an outlier if its value is 1.5 IQR or more above the
third quartile or 1.5 IQR or more below the first quartile.
PART THREE: 15
In our example 2 4 4 7 10 12 25 we have:
Q3 + 1.5 IQR = 12 + 1.5 8 = 24

Q1 1.5 IQR = 4 1.5 8 = 8
As the value 25 is larger than 24 it might be considered an outlier.

As 2 is larger than -8, its value would likely be accepted.
Folks like to identify outliers as they may adversely affect the analysis of data:
the mean and standard deviation of a data set changes with the inclusion of
outliers.
PART THREE: 16
SCATTER PLOTS
A scientist records the pH level of a reactive solution every 10 minutes. She

records the data on a graph.
A graph such as this displaying the measurements of two quantities here pH level
and time is called a scatter diagram.
A scatter diagram can show whether there seems to be some relationship between
the two quantities. In this example, it looks like that there is a fairly good linear
relationship of positive slope between pH and time.
The following scatter diagram between IQ levels and shoe size suggests no
relationship between these quantities:
PART THREE: 17
If a scatter diagram suggests a linear relationship of positive slope, then we say

that the two quantities depicted are positively correlated. If the relationship
seems to be linear of negative slope then we say that they are negatively
correlated.
EXERCISE: Find a group of friends. Draw a scatter diagram for shoe size and
height. Any correlation?
[Warning: Mens and womens shoe sizes are computed differently. Perhaps use a
yard stick to measure the lengths of peoples feet in inches?]
PART THREE: 18
LINES OF BEST FIT
Suppose some data, for x and y values, that looks as though it is linearly
correlated.
We want to determine an equation for the line that fits the data well.
There are two approaches:
1. Just eyeball one.
2. Use mathematics to derive the equation of the line that fits the data well
in some sense.
Notice: In conducting an experiment, one usually has complete control of one

variable, the x variable. For example, in measuring pH levels, one has control of the
times that the measurements are taken, but not of the pH levels one reads.
Thus, deviations of a data points from a line of best fit should be measured as
vertical segments variations of the y-values with no deviation horizontally. For
this reason, people look for lines that minimize vertical deviations only (or, to
avoid absolute values, the squares of the vertical deviations).
Heres one method for doing this, the least squares method. Well explain it with an
example.
PART THREE: 19
EXAMPLE: Here are three data points: (1,2) (2,5) (6,8)
Choose a line that minimizes the squares of the vertical deviations.
Answer: One thing that seems reasonable (and turns out to be a true property of
the general theory) that a line of best fit would properly represent the data and
go through the most average data point.
Let:
1+ 2 + 6
x = average of the x-values = =3
3
2+5+8
y = average of the y-values = =5
3
So the line should go through the point (3, 5).

PART THREE: 20
Now the question is: What should the slope of this line be?
If we call the slope m then the equation of the line will be:
y 5
=m
x3
That is: y = m( x 3) + 5
Lets work out the y-values of this line for the given x-values and compare them to
the actual y-values of the data points:
The sum of the differences squared is:
( 5 2m 2 ) + ( 5 m 5 ) + ( 5 + 3m 8 )
2 2 2
= 14m 2 30m + 18
b 30 15
This has smallest value when m = , that is, when m = =
2a 28 14
So the line of best fit (by minimizing squared differences) is:
15
y= ( x 3) + 5
14
PART THREE: 21
Definition: The process of finding a line of best fit is called regression.
The method of choosing a line of best fit by minimizing squares of differences is

called the least squares method.
For completeness, here are the general formulas for the least squares method:
LEAST SQUARES METHOD

Suppose we have N data points in a scatter diagram:
Let:
x1 + x2 + + xN
x=
N
y + y2 + + y N
y= 1
N
Let:
( x x) + ( x ) ( )
2 2 2
1 2 x + + xN x
S xx = (called the variance of the x-values).
N 1
(y ) +(y ) ( )
2 2 2
1 y 2 y + + yN y
S yy = (called the variance of the y-values).
N 1
S xy =
( x x )(
1 ) (
y1 y + x2 x )( y 2 )
y + + xN x ( )( y N y )
N 1
(called the covariance of the x- and y-values).
S xy
Then the line of best fit goes through the point ( x , y ) and has slope .
S xx
S xy
The equation is: yy = ( x x) .
S xx
PART THREE: 22
Example: For our three data points:
we have:
1+ 2 + 6
x= =3
3
2+5+8
y= =5
3
(1 3) 2 + (2 3) 2 + (6 3)2
S xx = =7
2
(2 5) 2 + (5 5) 2 + (8 5)2
S yy = =9
2
(1 3)(2 5) + (2 3)(5 5) + (6 3)(8 5)
S xy = = 7.5
2
7.5 15
The line of best fit goes through (3,5) and has slope = - just as we have
7 14
seen.

Question: Why did we compute S yy ?
Answer: It is involved in answering the question: How good is the fit really?
PART THREE: 23
MEASURING THE DEGREE OF FIT: THE CORRELATION COEFFICIENT.
Here are some data values:
We chose a line y = mx + b that made the sum of deviations squared:
D = ( y1 ( mx1 + b ) ) + ( y2 ( mx2 + b ) ) + + ( y N ( mxN + b ) )

2 2 2
the smallest.
This quantity reflects the amount of variation of the points about the regression
line.
Now:
( ) +(y ) ( )
2 2 2
T = y1 y 2 y + + yN y
represents the amount of variation of the y-values in general in the sense of

measuring the amount of variation about the mean given by the horizontal line
y = y.
Since the regression line is designed to be better than any other line, we
necessarily have:
D T .
This prompts one to think of the proportion:
T D
T
This is a number guaranteed to be between 0 and 1.

PART THREE: 24
T D
If equals 1, then this is saying that D= 0, which means that there is no
T
scatter about the regression line. That is, all data points lie exactly on a line.
T D
If equals 0, then this is saying that T = D. That is, the amount of scatter
T
about the regression line is no different than the amount of scatter in general.
That is, computing a regression line has no effect on scatter, and so there is no
relationship between the x- and y-values of any significance.
T D
Since is always a positive number we give it a name that is always a positive
T
quantity:
T D
Definition: R 2 =
T
A tedious (but not difficult) exercise in algebra shows that this quantity is given
by the formula:
(S )
2
xy
R2 =
S xx S yy
Usually people take the square root of this quantity:
(S )
2
xy
R=
S xx S yy
choosing the + sign to indicate data has a positive slope and the sign to
indicate negative slope .
The number R is called the correlation coefficient of the data.

PART THREE: 25
Example: Lets compute the correlation coefficient of our data:
Since the data has positive slope:
(S )
2
( 7.5 )
2
xy
R=+ = 0.95
S xx S yy 79
This is very good. [Of course, with just three data points there is little
information to go on. CHALLENGE: Explain why the correlation coefficient will have
value R = 1 - indicating perfect fit if we work a data set of just two data points.]
One wants a correlation coefficient pretty close to 1 or to -1.
A value around 0.85 or higher (or -0.85 and lower) is usually deemed good.
One wouldnt want to make predictions of interpolation or extrapolation with poor

fitting lines.
EXERCISE: Consider the following data.
a) Use the least squares method to find a line of best fit.

b) Find the correlation coefficient for this line
c) Does it seem reasonable to use this line of best fit for general analysis?
d) Make a prediction as to the y-value of the data when x= 1.7. (Interpolation)
e) Make a prediction for the y-value when x = 13.2. (Extrapolation)
PART THREE: 26
A WORD OF WARNING
Its always wise to LOOK at a data a set before diving in and completing a linear
regression. For example, although we can certainly find a line of best fit to the
data shown, it would have little meaning. (We might wish to find a quadratic or an
exponential curve to fit the data.)
If you suspect data fits a curve of the form y = ac x taking logarithms gives
log y = x log c + log a , a straight line relationship between x and log y . Perform a linear
regression (via the methods of this section) to the table of data values shown
Suppose we obtain a line of best fit log y = mx + b with:

m = 1.3
b = 0.2
( )
x
This gives: log y = 1.3 x 0.2 and so y = 101.3 100.2 = 0.63 19.95 x .
If you suspect data follows a curve of the form y = ax 2 , take square roots and fit a line to
the data y and x.
And so forth.
PART THREE: 27
INFERENTIAL STATISTICS
THE NULL HYPOTHESIS
Heres something fun
PUZZLE: A disease is spreading across the country at an alarming rate.
Fifty percent of the people who get it, get better on their own. The remaining
fifty percent die.
Two serums, A and B, has been developed hurriedly and little time has been given
to test them. The only information available right now is:
3 patients with the disease who were given serum A all survived
7 out of 8 patients who were given serum B survived.
You have just learned that you have the disease. Which serum should you take?
Some comments
Three-out-of-three is a 100% success rate, but only three test patients

isnt much to go on.
Seven-out-of-eight is not perfect, but it is a larger sample.
Which seems more promising?
KEY IDEA:
Assume that serum A has no effect and ask: How likely is it that 3/3
people would naturally survive on their own?
Assume that serum B has no effect. How likely is it that 7/8 people would
survive on their own?
PART THREE: 28
Answers:
If serum A has no effect that the chances that three people would all naturally
1 1 1 1
survive is: = = 12.5% .
2 2 2 8
If serum B has no effect, then the chances that 7/8 people survive is:
1 1 1 1 1 1 1 1 1
8 = = 3.125%
2 2 2 2 2 2 2 2 32
It is quite unlikely that wed see 7/8 people surviving if serum B had no effect.
(Far more unlikely than seeing 3/3 people survive.) We conclude then: there is a
good chance that serum B is having an effect.
I would take serum B.

The act of assuming that there is no effect at play and working to see where that
assumption leads is called testing the null hypothesis.
PART THREE: 29
EXAMPLE: You have a suspect coin in hand.
a) You toss the coin 10 times and get ten HEADS in a row. Would you likely
conclude that the coin is biased?
b) Suppose instead when you tossed the coin you got nine HEADS out of ten.
Are you still likely to conclude that the coin is biased?
c) What if you got 8 HEADS out of ten? Just seven?
Answer: Lets test the null hypothesis and assume for the moment that the coin is
fair (that is, that nothing suspect is going on).
a) What are the chances of naturally getting 10/10 heads with a fair coin?
10
1 1
= 0.1%
2 1024
With 99.9% confidence I would say that the coin is biased.

(Note: There is a 0.1% chance that I am wrong.)
b) What are the chances of receiving 9/10 heads with a fair coin?
10
10! 1
1.0%
1!9! 2
In this case I would say, with 99.0% confidence, that the coin is biased.
c) What are the chances of receiving 8/10 heads naturally?
10
10! 1
4.4%
2!8! 2
With 95.6% confidence I would say that the coin is biased.
COMMENT: There is an issue of wording here. The chances of seeing eight or

more heads, as one might phrase a test for bias, is greater than 4.4%.
In fact, lets make a table:

PART THREE: 30
With 7/10 heads I am less confident to conclude that the coin is biased.
Even less so for 6/10 heads.

NOTICE: Weve presented here a table displaying the likelihood of each and every
possible outcome. This is an example of a distribution.
Here we have also talked about confidence in making some kind of inference about
the meaning of a result.
Were now in the thick of inferential statistics.
[Comment: The distribution in the above table is called the binomial distribution.]
PART THREE: 31
DISTRIBUTIONS
Loose Definition: A distribution is a table or a diagram that illustrates the

frequency of measurements or counts from an experiment or study.
Histograms lead to distributions.
For example, consider the following histogram displaying the heights of 1000
people:
If the heights of the bars are percentages, not actual counts, then this diagram
has total area 1.
We can make the information displayed on the diagram more precise by choosing
small category intervals.
and smaller and smaller
In the limit we get a smooth curve of area 1, the height distribution curve.
PART THREE: 32
Of course, one cannot do this in practice, but we do like to think that human
heights follow some kind of smooth distribution curve of area 1.
Then we like to say the probability that someone chosen at random has height
72" x 78" is given by
P (72 x 78) = the fraction of area above the interval 72 x 78
Since the area of the whole curve is 1, this fraction of area matches actual area
above the interval 72 x 78 .
PART THREE: 33
Better Definition: A distribution for a quantity X (such as height or foot length) is

a curve with area 1 such that the probability that a randomly chosen value for X
lies between a and b is:
P (a X b) = the area under the curve from a to b.
This is abstract!
Usually in practice, one doesnt actually know any formula for the distribution curve
a quantity X seems to follow.
ARTIFICIAL EXERCISE: Suppose peoples ages among the worlds population is

distributed as follows:
a) Verify that the area under this distribution is indeed 1.

b) A person is chosen at random. According to this model, what are the chances
that this person is between 20 and 30 years old?
PART THREE: 34
Comment: People like to use the following adjectives for distributions:

PART THREE: 35
BIG QUESTION
How does one find or estimate the distribution for a quantity?
Answer: Take some samples and make a guess based on what you observe.
But there is something deeper going on
ACTIVITY: What is the distribution of heights of the people in this class?

What is the mean and the standard deviation?
Heres a curious idea:
What if we took another class somewhere in the nation with the same number
participants and worked out their mean height? And say we did this for a third
class as well. Actually for 1000 other classes!
Wed expect all the means we calculate to be close to each other.
The means have their own distribution.
Heres the curious thing

PART THREE: 36
In the 1700s, when scientists conducted experiments multiple times and computed
the average result or aggregate result over many runs of the same experiment,
they noticed that the means always seem to follow a bell-shaped curve no matter
the type of experiment was being conducted:
They thought this odd.
Human height comes in a bell-shaped curve. (The height of each human is the
aggregate effect of growth rates of a collection of cells. Thus each human is the
mean result of a collection of experiments.)
The lengths of carrots come in a bell-shaped curve. (Each carrot is the aggregate
result of cell growth.)
Scholars began work on identifying this curve and finding a formula for it.
[These scholars include Gauss (~ 1820), Laplace (1818) and Lyapunov (1901).]
This special curve is today called the normal distribution.
These scholars managed to prove the famous central limit theorem, which we
shall discuss next.
For those interested

1 ( x )2
1
The normal distribution follows the formula y = e 2 2
where is the
2 2
mean of the original experiment and is the standard deviation of the original
experiment.
PART THREE: 37
THE CENTRAL LIMIT THEOREM
ACTIVITY:
Heres a simple activity you can perform with 20 of your closest friends to
illustrate the Central Limit Theorem in action.
Have each person roll a single die FOUR times and compute the average of the four
values obtained. Repeat three more times.
First Roll Second Roll Third Roll Fourth Roll AVERAGE

1.
2.
3.
4.
Also, for the 16 rolls of the die recorded above, list the total number of 1s, 2s, 3s,
4s, 5s, and 6s that occurred.
1s 2s 3s 4s 5s 6s
On the board, create two charts: one for the raw data and one for the means.
Have each person come to the board and place a dot on the left chart for each of
the 16 rolls of his or her die, and one dot for each of the four average values on
the right chart. Do you see distributions akin to the diagrams below? Is the
distribution of dots on the left close to uniform? Is the distribution on the right
approximately normal?
PART THREE: 38
Heres the theorem:
CENTRAL LIMIT THEOREM:
Suppose a quantity has some kind of distribution with mean and standard
deviation .
Take a sample of N measurements of this quantity and calculate its mean.

Do this repeatedly for many different samples of size N.
Then the means of all the samples will closely follow a normal distribution with:
mean =

standard deviation =
N
The mathematical proof of this result is HARD!! But the idea behind it is fairly
clear.
The last statement of the theorem states, essentially:
The larger the sample size, the less deviation one obtains.
Example: Suppose you are looking at the heights of people.
Select 10 people at a time and plot their means: Expect a lot of spread.
Select 100 people at a time and plot their means: Expect less spread.
PART THREE: 39
Select 1000 people at a time and plot their means: Expect even less
spread.
Select 6.5 billion people at a time (that is, the entire worlds poputation!)
and plot their means: Expect no spread!
People say:
Error decreases as sample size increases.

PART THREE: 40
EXAMPLE: A manufacturer makes light bulbs. Their average bulb lasts
= 55.0 hours
with standard deviation
= 1.8 hours
They ship boxes of 100 bulbs to their distributors.
Suppose we open a box and compute the average life span of all 100 bulbs. We
repeat this act for many many more boxes.
Then, according to the central limit theorem., the average box lifespan has:
mean = = 55.0 hours

standard deviation = = 0.18 hours
100
OPTIONAL ACTIVITY FOR THOSE WITH COMPUTER SKILLS
a) Have a computer select ten random numbers from the range 1 to 100, and
compute the mean of those ten numbers. Have the computer do this one
hundred times and then plot the means. Does the distribution look bell-
shaped?
b) Repeat this exercise but this time have the computer select twenty numbers
at random each time and compute their mean. Does the resulting curve look
more normal?
c) Repeat this exercise but this time have the computer select fifty numbers
each time and compute their mean. How does the distribution appear?
PART THREE: 41
UNDERSTANDING THE NORMAL DISTRIBUTION
The normal curve is symmetrical, bell-shaped, and of area 1.
Now we havent talked about how to compute the mean and standard deviation of
a continuum of values (that is, over a small curve); only how to compute these for a
finite set of discrete values.
x + x2 + + xN ( x1 ) 2 + + ( xN )2
[Recall: = 1 and = ]
N N 1
One way to deal with a continuous set is to approximate it as a finite collection of

values as though it is a histogram:
One computes the mean and standard deviation for these finite values, and then
take the limit as one repeats this for a finer and finer histogram. (This is calculus.)
PART THREE: 42
The upshot is
For the normal distribution:
mean = = central value

standard deviation = = measure of spread
PART THREE: 43
THE 68 95 99.7 RULE:
For the normal distribution
68% of the data lies with within one standard deviation of the mean:
95% of the data lies within two standard deviations of the mean:
99.7% of the data lies within three standard deviations of the mean:
PART THREE: 44
EXERCISE:
A species of carrot has average length 12.3 cm with standard deviation 0.8 cm.
Assume carrots are normally distributed.
a) What are the chances that a carrot chosen at random lies in the range 11.5
cm to 13.1 cm.?
b) What are the chances that a carrot chosen at random is longer than 13.9cm?
Answer:

PART THREE: 45
Z-SCORES
The following example leads to an important concept:
EXAMPLE: Here are the college test scores on math along with Johns scores.
On which subject did John do best? Worst?

[Assume scores are normally distributed.]
Answer:
Notice that John is 60 points below the mean in algebra, that is, 1 standard
deviations below. This is not good.
Notice that John is 70 points above the mean in calculus, that is, 2 above. This is
very good!
John is 20 points above the mean in statistics, that is, 0.8 above. This is fairly
good.
Even though John got the lowest number score in calculus, this was his best result!

In this example we were able to compare different scores by bringing them to the
same standard: the number of standard deviations above or below the mean.
This leads to the notion of a z-score:

PART THREE: 46
Definition: If x is the value of an experiment which has mean and standard

deviation , then its z-score is:
x
z=

This is the number of standard deviations x lies above or below the mean.
Example: Johns z-scores are:
670 730 1
Algebra: z= = 1 (That is, 1 standard deviations below the mean.)
40 2
450 380
Calculus: z= =2 (That is, 2 standard deviations above the mean.)
35
660 640
Statistics: z= = 0.8 (That is, 0.8 standard deviations above the mean.)
25

x
NOTE: The transformation z = converts the points x = , x = and

x = + into z = 1 , z = 0 , and z = +1 , and thus transforms a distribution centered
about with a spread of into a distribution centered about 0 with a spread of
1.
So
PART THREE: 47
The distribution of the z-scores of a normal distribution has is a normal

distribution with mean 0 and standard deviation 1.
(A normal distribution with = 0 and = 1 is called the standard normal
distribution.)
It is convenient to define the following function:
Definition: ( z ) = the area under the standard normal distribution ( = 0 and

= 1 ) to the left of the value z.
(This is called the probability density function for the standard normal
distribution.)
Books publish tables of ( z ) values.

(Such tables are also easy to find on the internet.)
For example: (0) = 0.500 (Do you see why?)
According to one of these tables we also have:
(1) = 0.841 (1.72) = 0.957
(3.0) = 0.013
PART THREE: 48
EXAMPLE: The wealth index of a floogle is normally distributed with mean 0 and
standard deviation 1.
A floogle is selected at random. What is the probability that its wealth index is
between 1.00 and 1.72?
Answer:
Probability = area between 1 and 1.72
= (1.72) (1)
= 0.957 0.841
= 0.116 = 11.6%
EXAMPLE: The wealth index of a woogle is normally distributed with mean 18 and
standard deviation 4.
A woogle is selected at random. What is the probability that its wealth index is
between 22 and 24.88?
Answer: We need to convert this to a problem about a normal distribution with

mean 0 and standard deviation 1. That is, we need to work with z-scores.
22 18
x = 22 z= =1
4
24.88 18
x = 24.88 z= = 1.72
4
So P(between 22 and 24.88) = (1.72) (1) =11.6%.

PART THREE: 49
COMMENT: Sometimes texts will only give values for the areas between 0 and a
positive value z as shown:
Exercise: Find a text that gives values as described here. Use it to compute:
a) (0.2)
b) (0.2)
c) (1.16)
d) (2.21)
PART THREE: 50
A QUICK TASTE OF HYPOTHESIS TESTING
EXAMPLE: It is generally believed that parsnip length is normally distributed with

mean 18.5 cm and standard deviation 3.2 cm.
Johnny found a parsnip next to a nuclear power plant that is 25.0 cm long.
Should we conclude that something unusual happened to the parsnip?
Answer: What are the chances of finding a parsnip of that length under normal
circumstances? (NULL HYPOTHESIS!!)
Recall the 68-95-99.7 rule. The chances that we naturally find ourselves in the
shaded region is only 2.5%.
Thus, with 97.5% confidence we can say that something strange happened to
the parsnip!

PART THREE: 51
THE CENTRAL LIMIT THEOREM CONTINUED
Recall the result
CENTRAL LIMIT THEOREM:

Suppose a quantity has some kind of distribution with mean and standard
deviation .
Take a sample of N measurements of this quantity and calculate its mean.

Do this repeatedly for many different samples of size N.
Then the means of all the samples will closely follow a normal distribution with:
mean =

standard deviation =
N
And recall that the normal distribution has 95% of its values lying with two
standard deviations of its mean, and 99.7% of its values within three standard
deviations.
These are the key ideas behind statistical hypothesis testing.
One example
ANALYSIS OF ROULETTE:
A Roulette wheel has 18 red spaces, 18 black spaces and two green spaces. In the
simplest version of the game one can place a $1 bet on either red or black. If
your colour comes up, you win $1. If it doesnt, you lose $1. (Thus the two green
spaces give a slight advantage to the house.)
18
So Your chances of winning $1 are
38
20
Your chances of losing $1 are
38
PART THREE: 52
In one round of the game your expected profit is:
18 20
= 1 + (1) = 0.053
38 38
So, on average, you lose 5.3 cents per bet.
Whats the standard deviation here? It is not clear what this means in this case
(what are the data points one is referring to here?) but one can reason as follows:
If we were to play 38 games, then we would expect, on average, to receive the data
value +1 eighteen times and the data value -1 twenty times. The standard deviation
from the mean of -0.053 is given by:
(1 (0.053)) 2 + + (1 (0.053)) 2 + (1 (0.053))2 + + ( 1 ( 0.053)) 2

=
2
37
(1 (0.053))2 18 + (1 (0.053)) 2 20
=
37
= 1.0242
= 1.0242 = 1.0120
So, for a single roulette bet:
= 0.053
= 1.012
Now suppose I am a habitual gambler and go to the casino every night to play 100
rounds of roulette. By the central limit theorem the results of my nightly activity
closely follow a normal distribution with:
= 0.053
1.012
= = 0.1012
100
PART THREE: 53
Almost all of my nightly results (99.7% of them) lie within the range:
3 to + 3
i.e. within the range -0.3566 to 0.2506.
For 100 bets of a dollar each, this means that for almost all evenings my
winnings lie between -$35.66 and $25.06.
Although I often lose, I also often win. This keeps me coming back!
FROM THE CASINOS POINT OF VIEW
I am not the only person playing the game each night. They may see something like
100 000 rounds of roulette being played per night.
These samples of 100 000 rounds per night closely follow a normal distribution
with:
= 0.053
1.012
= = 0.003
100 000
So almost all evenings (99.7% of them) outcomes lie within the range:
3 to + 3
i.e. within the range -0.062 to -0.044.
With 100 000 bets of $1 this corresponds to a nightly win for gamblers between
the range -$6200 and -$4400.
The casino sees an almost guaranteed profit somewhere between $4400 and
$6200 per night from Roulette alone.
PART THREE: 54
CONFIDENCE INTERVALS
Often people are comfortable making statements that have a 95% chance of being
correct!
EXAMPLE: It is generally believed that the lengths of carrots are normally

distributed with mean 18.5 cm and standard deviation 3.2 cm.
Johnny found a carrot growing next to the nuclear power plant that is 25.0 cm long.
He wants to say that something unusual happened to this carrot. With what level of
confidence can he say this?
Answer: We know that 95% of data in a normal distribution lies within two
standard deviations from the mean. In this case, 95% of carrot lengths should lie
within the range 18.5 6.4 cm., that is, in the interval [12.1 , 24.9] cm. Johnnys
carrot has length outside of this range. There is only a 5% chance that this would
happen naturally. Thus, with 95% confidence, Johnny can say that something
strange happened to his carrot.
EXAMPLE: The average mass of all planets in the universe is not known, but
scientific theories do suggest that planet masses vary with a standard deviation of
the order = 3000 units.
Astronomers have observed the motion of 24 planets and have calculated their
average mass to be m = 27650 units.
What can we say about the value of , the mean mass of all planets in the
universe?
Answer: For sample sizes of 24 planets we expect to obtain a distribution very

close to the normal distribution with:
mean =

standard deviation = = 612
24
We obtained: m = 27650.
PART THREE: 55
Now, in general, there is a 95% chance that m lies within two standard deviations
of .
By the same token, there is a 95% chance that lies within two standard
deviations of m!! (If m is within two units of , then is within two units of m.)
So we can say, with 95% confidence, that the true value for the mean lies
somewhere in the interval 27650 1224 , that is, in the range [26426, 28874]
We call [26426, 28874] the 95% confidence interval.
IN GENERAL:
Suppose a population has unknown mean and known standard deviation .
A sample of size N yields an average value m.
Then, with a 95% level of confidence, we can say that the true mean lies
somewhere in the interval:

[m2 , m+2 ]
N N
This is called the 95% confidence interval for the mean.
EXAMPLE: A newspaper reports that the average height of an Australian male is

175 8 cm with 95% confidence. What does this mean?
Answer: We are 95% sure that the true mean for the entire population of
Australian males lies somewhere between 167 and 183 cm.
NOTE: It DOES NOT mean that 95% of Australian men have height somewhere
between 167 and 183 cm. THIS IS A COMMON MISCONCEPTION!
PART THREE: 56
EXAMPLE: A manufacturer of pipes should produce pipes with diameter 3 inches.

However, the manufacturing equipment is not perfect and has a standard deviation
of = 0.02 inches in the pipes it produces.
Inspectors select 100 pipes at random and find their average diameter to be 2.98
inches. Are they pleased?
Answer: The true mean lies somewhere in the interval:
0.02 0.02
[2.98 2 , 2.98 + 2 ] = [2.976, 2.984]
100 100
with 95% confidence. Things dont look good. It is very unlikely that the
manufacturer is producing pipes with the required mean of 3 inches.
EXAMPLE: A filter removes dust from my living room. The amount of dust it
removes from day to day has standard deviation = 0.3 mg.
I measured the weight of dust collected over three days and got the figures:
13.2 mg, 13.7 mg, 12.9 mg.
Compute the 95% confidence level of the true average weight of dust removed per
day.
Answer: The sample has mean
13.2 + 13.7 + 12.9

m= = 13.27 mg
3

The 95% confidence level is m 2 = 13.27 0.35 mg.
3
QUESTION: What is a 99.7% confidence interval? How would I have to adjust my

calculations to find this level of confidence?
PART THREE: 57
P-VALUES
Sometimes we like to be more specific and follow our hunches about what we think
the value of a mean should actually be.
EXAMPLE: Suppose a population has a normal distribution.
= unknown, but we suspect it has value 12

=3
A sample of size 40 was found to have average value m = 12.98.
What do we now think about our hunch that =12?
Answer: Lets assume that really is 12 and how likely it is we would have obtained
a sample mean of m = 12.98.
Now samples of size 40 have means that follow a normal distribution with:
mean = 12

standard deviation = = 0.47
40
Now 95% of the means should lie between 11.06 and 12.94.
We got the value 12.98. There is a 2% chance that wed land in this range if the
true mean really were =12.
We reject our hunch that =12 with 97.5% certainty that this is the right thing
to do.
PART THREE: 58
Some people like to go further:
In the same example:
If =12, what is the probability that wed get a sample mean of m = 12.98 or more?
Answer:
Convert this to a z-score so that we can look up values on a table of standard

normal distribution values:
12.98 12
z= = 2.09
0.47
According to the tables, this region has area 0.5000 0.4817 = 0.0183 = 1.83%
This value is called the p-value of the sample mean 12.98 for the assumption that
=12.
The chance of getting a sample mean of 12.98 or more under the assumption that
=12 is extremely low. We reject the claim that =12 with 98.17% level of
confidence that we are doing the right thing.
PART THREE: 59
Weve just computed a right-tail p-value for a given sample mean. One can also
compute left-tail p-values.
EXERCISE: Suppose a population has a normal distribution.
= unknown, but we suspect it has value 12

=3
A sample of size 40 was found to have average value m = 11.03.
Find the left-tail p-value for this sample.

Would you reject the claim that =12?
A p-value gives a measure of the likelihood that we would obtain a sample mean at
least as extreme as the one we observed, under the assumption that beliefs about
the true value of the mean are correct.
PART THREE: 60
GALLUP POLLS and the like:

There is another version of the central limit theorem that comes in handy for
interpreting Gallup Polls, for example.
Suppose we are interested in determining what percentage of a population has a

certain characteristic, say, predicting the percentage of Americans that will vote
Republican at the next elections. Call the proportion of the population with the
desired property p. Our goal is to estimate the value of p.
Suppose we take a sample of size N from the population and find that percentage
p has the desired property. (For example, we interview just 1000 Americans and
find that 32% say they will likely vote Republican.)
CENTRAL LIMIT THEOREM: Version II

Let p be the (unknown) percentage of a population possessing a certain
characteristic. If samples of size N are taken and the values p are computed for
those samples, then:
The values p have a distribution that is approximately normal and the larger the
sample size N the better the approximation.
The mean and standard deviation of the distribution of p values are:
=p
p (100 p )
=
N
(Here, p is given as a percentage.)
Given a particular value p for one sample, the 95% confidence level for that
sample is:
[ p 2 , p + 2 ]
where is the standard deviation computed with p rather than p.
With 95% confidence we can say that the true value p lies within this range.
PART THREE: 61
EXAMPLE: Of 1500 adult Americans that were polled, 3.2% of them said they had
an overall thoroughly pleasurable experience studying math in high-school. Estimate
the proportion of ALL adult Americans that will say the same.
Answer: We have
p = 3.2
3.2 96.8
= = 0.454
1500
With 95% confidence we can say that the percentage of Americans who claim to
have enjoyed high-school math lies in the range [2.292, 4.108] .
COMMENT: IN THE NEWSPAPER
Newspaper reports usually dont mention a confidence level but rather a margin
of error. They mean by this the same thing with the understanding that we are
talking about a 95% confidence level. For example, if a journalist writes
According to our survey, 36% of Americans are now afraid to eat cheese.
The margin of error in this report is 2.6 percentage points.
The journalists means by this that, with 95% confidence, we can say that the true
percentage of Americans afraid to eat cheese lies in the interval [33.4,38.6] .
WARNING: Some organizations prefer to use 90% confidence levels (phrased in

terms of margins of error.) For example, the Bereau of Labor Statistics reports
unemployment figures with 90% confidence.
POLLS CAN BE MISLEADING!
There is a large sub-branch of statistics concerned with the issue of how to select an
appropriately representative sample. By considering this very question, one may begin to
influence the types of people one may accept to interview for a survey. For example, in
studying the shopping habits of Americans one may think to go to the local mall to
PART THREE: 62
interview shoppers. Right there you have a bias in your sample you are considering only
people who like to shop at malls!
A famous historical example of an erroneous prediction based on biased sampling occurred

during the 1936 U.S. presidential elections. The popular magazine Literary Digest, as part
of the sensationalism leading up to the election, conducted a poll to predict the outcome of
the race. After interviewing a sample of eligible voters, chosen by drawing names at
random from telephone books from across the nation, the editors of the publication
concluded that the election was a foregone conclusion - Alfred Landon was to win with a
comfortable leadand they subsequently published much editorial commentary to this
effect. It turned out, however, that Langdons opponent, Franklin Roosevelt, won the
election by a landslide. Members of the Digest did not realize that they had worked with a
biased sampleonly affluent Americans could afford telephones at the time of the Great
Depression and be listed in telephone books. This was a class of voter then more likely to
vote Republican. Consequently, the Digests prediction was erroneous. The publication
folded in 1937 due to both the sampling fiasco and the difficult times of the depression.
Today, a number of sampling methods are commonly used to help ensure that no bias
occurs. These methods include:
Random Sampling
Each subject of the population is assigned a number, and numbers are generated randomly
with the aid of a computer to select members.
Systematic Sampling
Each subject of the population is assigned a number, and, starting at a random number,
every kth member from then on is selected. For example, one might select every 23rd
person, starting with the 533rd member.
Stratified Sampling
When a population is naturally divided into groups (such as male/female, or age by decade),
selecting a random sample from within each group produces what is called a stratified
sample. Samples produced this way are used to ensure representatives of each subgroup
are present in the study. For example, in a study involving college freshman and
sophomores, one might select twenty-five students at random from each group freshman
males, freshman females, sophomore males and sophomore females to make a sample of
one hundred students.
Cluster Sampling
If an intact subgroup of a population is used as a representative sample of the entire
population then the sample is called a cluster sample. For example, the set of all freshman
females might be used to represent the population of all college students for the purposes
PART THREE: 63
of one study, or the 12 eggs in one carton of eggs as representative of all the eggs handled
by a particular supermarket.
The list goes on!
PRACTICE EXERCISE:
a) Take a deck of cards and divide it into four suits. Randomly select 2 cards from
each suit and list the eight cards here:
Which of the above type of sampling is this?
Is there an 8-card sample that would not arise via this method?
b) Shuffle the deck of cards. Select two suits by randomly drawing two cards from
the deck. Now separate those two suits from the deck. Randomly select four cards
from each of those two suits. Write the sample of eight cards here:
This is an example of a multistage random sample.

Can all possible samples arise this way?
c) Describe a procedure for selecting 8 cards from a deck that illustrates the method
of systematic sampling.
Statisticians often prefer to work with a simple random sample (SRS) scheme: not only
does each person in the population have equal chance of being selected, but every
combination of people has the same chance of appearing as any other combination. For
example, random sampling and systematic sampling can be SRS, but stratified sampling and
cluster sampling are certainly not.
The central limit theorem is assuming that samples chosen are simple random
samples.
PART THREE: 64
There is a wonderful student exercise being bandied about teaching conferences.

Sadly, I have not been able to track down the originator of this idea to assign
proper credit.
As you will see, the following activity can be attempted on the first day of class,
and as student sophistication grows, can be repeated with more powerful tools.
ACTIVITY: HOW MANY RED BOOKS ARE IN THE LIBRARY?
You mission is to simply give a reasonable answer to this question!
Working in teams of three, formulate an approach to an answer by following the

three steps outlined below. These are precisely the steps a statistician must follow
when she commences a new type of project.
STEP 1: DESCRIBING THE DATA

The question to be explored is vague. Can you come up with some reasonable points
of clarification?
What does it mean for a book to be red? Go to the library and look at some books.
Try to come up a consensus within your team as to what red-ness should mean.
Write down your definition of a red book.
Does the phrase in the library need to be clarified?
STEP 2: COLLECTING DATA

Use a method that your team believes will give the best approximate answer to the
question. Describe your method and approach.
STEP 3: CONCLUSION
Write down your answer. Approximately how many books in the library are red?
PART THREE: 65
LATER
Select a sampling method that seems appropriate for garnering a non-biased

sample of 200 or more library books. Count that number in that sample that fit
your definition of being red, and express this count as a proportion p .
Using
=p
p (100 p )
=
N
write a 95% confidence interval for the true proportion of library books that are
red.
PART THREE: 66
TWO-VARIABLE ANALYSIS: CHI-SQUARED TESTS
Its easiest just to begin with an example:
EXAMPLE: Is there any correlation between hair colour and eye colour?
A team goes out and examines a random sample of people. The data collected is
displayed in the following contingency table:
Does there seem to be some connection?
Answer: Heres one way to think about this.
First, compute each row sum, column sum and grand total.
We see that 38 out of the 110 people examined had blue eyes. That is, 38/110 =
34.5% of the sample had blue eyes.
We also see that there were 37 of the people examined were blonde, 46 brown
haired and 27 red haired.
PART THREE: 67
If there is absolutely no influence of hair colour on eye colour, then wed expect
34.5% of the blonde population to have blue eyes, 34.5% of the brown haired
population to have blue eyes, and 34.5% of the red haired population to have blue
eyes. That is,
THE EXPECTED FREQUENCY (aka count) OF BLONDE PEOPLE WITH BLUE EYES
IS:
38
37 = 12.8 .
110
THE EXPECTED FREQUENCY OF BROWN HAIRED PEOPLE WITH BLUE EYES IS:
38
46 = 15.9
110
THE EXPECTED FREQUENCY OF RED HAIRED PEOPLE WITH BLUES EYES IS:
38
27 = 9.3
110
How many green-eyed people with red hair would we expect?
27
25 = 6.1
110
(The proportion of our sample with red hair is 27/110. This proportion of 25 green
eyed folk should have red hair.)
In general:
The expected frequency of the entry in the i-th row and j-th column is given
by:
row i total col j total
grand total
PART THREE: 68
We can go and fill in all the expected frequencies under the assumption that there
is no relationship between the two qualities.
Looking at this it seems that our observed frequencies are vastly different from
the expected frequencies (for no relationship). It seems that something is going
on.
To make this more precise
The CHI-SQUARED STATISTIC for a table of observed frequencies (o) and

expected frequencies (e) is:
(o e)
2
= sum of all calculations

2
In our example
( 8 15.9 ) ( 5 6.1)
2 2
(23 12.8) 2
=
2
+ + + = 17.88
12.8 15.9 6.1
Now if there truly were no relationship twixt the two variables then youd expect
the observed values to be very close to the expected values, that is:
A 2 value close to zero suggests no connection.

A large 2 value suggests something is going on.
PART THREE: 69
In our example, we seem to have a large 2 value.
Statisticians have done the mathematics on the 2 statistic and have computed
the distribution one would expect it to follow. [NASTY MATH!] Each table of a
different size has its own chi-squared distribution.
A table with r rows and c columns is said to have;
= (r 1)(c 1)
degrees of freedom.
E,g, In our example we have = 2 2 = 4 degrees of freedom.
(COMMENT: Why r-1 and c-1? We know that all r rows add to 100% of the
data. Thus if you know the sum of the first r-1 rows, then you do not need to
be given the sum of the r-th row. Its value is not free. Ditto for the
columns.)
See internet or most any stats book for a table of 2 values for each value of .
In our example: 2 = 17.88 for = 4 .
According to the table of values, this value lies between 2 0.995 and 2 0.999
The chances of this 2 value occurring is above 99.5%. Thus, with 99.5%
confidence, we can say that there is some kind of correlation between eye
colour and hair colour.
NOTE: There is no claim as to what that connection is. Further independent

analysis is required.
PART THREE: 70
EXERCISE: Analyse this (fictional) data from interviews with 5406 fifteen-year
olds.
CAVEAT: LOW EXPECTED FREQUENCIES (values lower than 5) TEND TO SKEW

CHI-SQUARED TESTS. Statisticians have the rule of thumb that if more than
20% of the entries in a table have expected values less than 5, then the test is
unreliable.
PART THREE: 71
QUALITY CONTROL
EXAMPLE: A pipe manufacturer makes pipes of diameter 3 inches. Consumers will

tolerate a spread of values with standard deviation = 0.06 inches.
To test the quality of their manufacturing techniques, each day a random sample of
10 pipes is selected and their mean diameter is computed. Here are the results of
twelve days of data:
DAY 1 2 3 4 5 6 7 8 9 10 11 12
Mean 2.98 3.01 3.04 2.97 2.99 3.01 3.05 3.04 3.07 3.08 3.06 3.09
Is the wear and tear of the production equipment having an effect?
Answer: Lets plot the data. Also, the mean is meant to be 3.00 so lets plot that
line as well:
It looks like the data is drifting away from the target mean. To make this more
precise
0.06
Samples of size ten should have mean 3.00 and standard deviation = 0.02 .
10
99.7% of the results should lie within three standard deviations of this mean. That
the data is drifting above the critical line of +3 standard deviations suggests that
quality is out of control. (Day 9 is the first day of concern.)
PART THREE: 72
EXAMPLE: Two machines each produce 1 000 bolts per day. The following table
shows the number of defective bolts each machine manufactured over a ten day
period.
MACHINE 1 42 37 18 37 17 26 35 21 18 17
MACHINE 2 44 36 23 41 24 25 31 35 23 21
Using only basic techniques, is one machine significantly less reliable than the
other?
Answer: There isnt much to work with on this problem.
One approach is to count totals:
Over 10 days, machine 1 produced 268 out of 10 000 defective bolts:

Percentage: 2.68%
Over 10 days, machine 2 produced 303 out of 10 000 defective bolts:
Percentage: 3.03%
These seem on par.
ANOTHER APPROACH: Perform a COUNTS TEST.
Heres the list showing which machine produced the greatest number of defective
bolts per day:
2 1 2 2 2 1 1 2 2 2
Machine 2 is listed seven times out of the ten days.
If there is no difference in the quality of the machines, that is, if each is equally
likely to be listed on a day as having produced the most defective bolts for that
day, then this sequence is akin the sequence of Hs and Ts in flipping a coin.
Is it unusual to get seven Hs in a run of ten flips? That is, is the fact that machine
2 is listed seven times at all significant?
PART THREE: 73
Question: What are the chances of receiving seven heads in flipping a coin ten
times?
10! 1
Answer: 11.7% .
3!7! 210
It can happen. (It occurs about 12% of the time.)
This is not considered rare enough to be significant. So we would say that there
is no significant evidence to suggest that machine 2 is behaving differently to
machine 1.
COMMENT: We usually look for events that a rare, say have a 5% chance of
occurring, to say, with 95% confidence that something unusual is occurring.
For example, suppose machine 2 was listed NINE times out of the ten of string.
10! 1
The chance of this naturally occurring is 1% , so we would conclude, with
1!9! 210
99% confidence, that machine 2 is indeed less reliable than machine 1.
PART THREE: 74
RUN TESTS FOR RANDOMNESS
Suppose some activity has two possible outcomes: A or B.
e.g Toss a coin: H or T

Roll a die: Even or Odd
Height of a person: Above the mean or Below the mean
Suppose we perform the activity a number of times and record the sequence of As
and Bs that result:
e.g.
A A | B B B | A | B | A A A A A | B B B |A A | B | A A A
Definition: A run is a string of repeated letters in the sequence.
One usually separates runs with a | to make them easier to see.

In the example above there are nine runs.
TWO COMMENTS:
a) A sequence with a large number of runs suggests that the sequence of A and
B generated is not truly random. For instance, the following sequence has
the maximal possible of runs. You would unlikely believe it to be a random
sequence:
A|B|A|B|A|B|A|B|A|B|A|B|A|B|A
b) A sequence with very few runs doesnt seem that random either.
AAAAAAAAA|BBBBBBBB|AAAAAAAAAAA
There seems to be too much clustering.

PART THREE: 75
So the count of runs in a sequence should, in some way, give an indication of just
how random that sequence is.
Some mathematical facts
Suppose in a string of N symbols we have a As and b Bs. (So N = a + b.)
N!
There are possible ways to arrange these As and Bs.
a !b !
List them all and count the number of runs in each possible example.
Mathematicians have proven that the count of runs has mean and standard
deviation given by these formulae:
2ab
= +1
N
2ab ( 2ab N )
=
N 2 ( N 1)
They have also shown that if a and b are each 7 or greater, then 95% of the
run counts lie within two standard deviations of this mean.
COMMENT: This is using the version of standard deviation with N in the

denominator rather than N-1.
EXERCISE:
a) Write down all the possible ways to list three As and two Bs.
b) Count the runs in each
c) Find the mean and the standard deviation of the count of runs. Verify that the
above formulae give the same values.
PART THREE: 76
EXAMPLE: Consider the following string:
HHHTTHHHTTTHTTTT
How likely is it that this came from flipping a coin ?
Answer: We have a = 7 heads and b= 9 tails. Here N = 16.
There are 6 runs.
Now, according to the previous result, the runs should follow a distribution with:
= 8.875
= 1.9
The count of six runs is within the range of two standard deviations from the
mean. We cannot conclude that this example is unusual.
HTHHTHTHTTHHHTHT
Answer: We have a=9 heads, b= 7 tails, and N = 16.

There are 12 runs.
Again:
= 8.875
= 1.9
The number 12 is within 2 standard deviations from the mean. We cannot conclude
that this sequence is not random.
PART THREE: 77
HHHHHTTTTTTHHTTT
Answer: a = 7 heads; b = 9 tails; N = 16.

There are 4 runs.
Again:
= 8.875
= 1.9
The count of 4 runs is more than two standard deviations below the mean. With
95% confidence we can say that this sequence was not produced by a random
phenomenon.
PART THREE: 78
TWO APPLICATIONS
ABOVE- and BELOW- the MEDIAN TEST
To determine whether or not a set of numerical data is random

a) Write the data in order it was collected
b) Compute the median of the data
c) Write A or B next to each data point to indicate whether that point is
above or below the median. (If an entry has the same value as the median,
then omit it.)
d) Do a runs test on the sequence of As and Bs.
If the data really was generated by a random phenomenon, then the sequence of
As and Bs produced should be random.
EXAMPLE: Heres some data. Does it seem random? Use the above/below median
test.
16 12 23 18 37 21 13 14 30 79 11
Answer: We need to find the median. (Unfortunately, this means ordering the
data!):
11 12 14 13 16 18 21 23 30 37 79
median = 18.
Now heres the sequence in terms of aboves and belows:
BBA *AABBAAB
(The star indicates the omitted value.)

PART THREE: 79
We have: a = 5, b = 5 with N = 10. There are 5 runs. (I know that these a- and b-
values are a bit low, but lets follow the test anyway just for the fun of it!)
This gives:
=6
= 1.49
The value of 5 runs is not outside two standard deviations from the mean. The data
seems to be following a random phenomenon.
DIFFERENCE IN POPULATIONS TEST
Suppose two samples of sizes m and n are denoted:
a1 a2 a3 am
b1 b2 bn
To decide whether or not the two samples came from the same type of population,
arrange all m + n values in increasing order. (If some values of repeated, choose an
order among them at random.) Record a sequence of As and Bs to show from which
sample each data point came from.
If the resulting sequence of As and Bs is random, then we can conclude that the
samples are not really different and come from the same source.
If the sequence is not random, then no such conclusion can be made.

PART THREE: 80
EXAMPLE: Twelve people from a mall were interviewed for their ages. Call these
the M values: 13 18 34 17 16 30 13 47 37 35 15 35
Twelve people at an art museum were interviewed for their ages. Call these the A
values: 45 52 17 28 41 63 48 23 38 60 40 40
Are these ages from the same type of population?
Answer: Arrange the data in numerical order and keep track of which are Ms and
which are As.
13 13 15 16 17 17 18 23 28 30 34 35 35 37 38 40 40 41 45 47 48 52 60 63
M MM MA M M A A M M M M M A A A A A M A A A A
We have:
a = 12
b = 12
N = 24
There are 8 runs.
For these values:
= 13
= 2.40
The value of 8 runs is more than two standard deviations away from the mean.
With 95% confidence we can say that these two sets of data are not coming from
the same type of population!
Heres a fun example:
EXAMPLE: Here are the first twenty digits of :
3 1 4 1 5 9 2 6 5 3 5 8 97 93 26 4 3
Do they seem random?

PART THREE: 81
Answer: Do the median test:
One checks that the median is 4.5. The sequence of Aboves and Belows is:
BBBB|AA|B|AA|B|AAAAA|BB|A|BB
Here:
a = 10
b = 10
N = 20
There are 9 runs.
We have:
= 11
= 2.17
The value of 9 runs is within two standard deviations of the mean. This sequence
looks random!
EXERCISE:
a) Write a sequence of Hs and Ts twenty symbols long that looks random to
you. (The number of Hs need not be the same as the number of Ts.) Perform
the runs test. Is your sequence random.?
b) Flip a coin 20 times and record results. Perform a runs test for randomness
on your sequence!
PART THREE: 82
RANK CORRELATION
Heres an opportunity to offer students a challenging exercise that illustrates the

way tools and ideas in statistics are created.
THE PROBLEM:
Five men Albert, Bilbert, Cuthbert, Dilbert and Egbert take part in a singing
contest and are ranked by two judges 1 5 (with 1 as best and 5 as least
favored). For example, a possible outcome of the contest might be:
Albert Bilbert Cuthbert Dilbert Egbert

Judge 1 1 4 3 5 2
Judge 2 3 5 2 4 1
If the judges followed purely objective assessment criteria and were completely
free of personal preferences, then we would expect the two rankings should be
identical. If, on the other hand, the judges followed no set procedures for their
ranking schemes and assigned rankings in a random fashion, then we would expect
very little or no correlation between the two lists. In the example presented above
we seem to be somewhere between these two extremes.
THE CHALLENGE:
Develop an index that takes two lists of rankings from two judges and, from
those lists, applies some formula or algorithm to those lists and computes a
numerical value, which we shall call R.
We would like R to have the following properties:
i) 0 R 1
ii) R has value 1 if the two lists are identical.
iii) R has value 0 if the two lists are in complete disagreement. (e,g. The first judge
lists the candidate in the order 1, 2, 3, 4, 5 and the second judge in the order 5, 4,
3, 2, 1.)
Compute the value of your Rank Correlation Coefficient to the example above and
interpret the results.
PART THREE: 83
Here are some possible approaches:
APPROACH 1: Given two lists compute the difference of scores of each

contestant, square, and sum. This gives a number D.
In our example, we have:
D = (1 3) + ( 4 5 ) + ( 3 2 ) + ( 5 4 ) + ( 2 1) = 8
2 2 2 2 2
The largest value D can possess (for a list of five numbers) is 40 and this occurs
when the rankings are in reverse order. (Why?). The smallest value D can possess is
0, and this occurs when the orders are in complete agreement. So set:
D
R = 1
40
This does the trick.
8
In our example, R = 1 = 0.8 , which indicates some disagreement.
40
Comment: This is the approach Charles Spearman took in 1904. He defined his
D
index to be = 1 2 where M is the maximum value D could be for two lists n
M
entries long. Here = 1 corresponds to complete agreement and = 1 to
complete disagreement.
n ( n 2 1)
Note: One can show that M = and this occurs if the two lists are in
3
reverse order of one another. [To see this, show what happens to the value D if
two numbers in one list are swapped. Show that the value of D increases if we swap
two elements that arent already in reverse order.]
PART THREE: 84
APPROACH 2: Use absolute values instead of squaring in the previous approach.
What is the maximal value D can obtain in this case and when does it occur?
APPROACH 3: We can reorder the names of the contestants so that list of ranks
for the first judge is 1, 2, 3, 4, 5. The list of ranks for the second judge changes
accordingly.
Albert Egbert Cuthbert Bilbert Dilbert

Judge 1 1 2 3 4 5
Judge 2 3 1 2 5 4
Now look at each contestant in turn along the second row. Count the number of
scores to the right of each entrant with a lower score.
In our example, according to Judge 2, Albert has TWO lower scores to his right.
Bilbert has ZERO lower scores to his right, Cuthbert ZERO, Bilbert ONE, Dilbert
ZERO.
Summing these scores gives a value S = 2 + 0 + 0 + 1 + 0 = 3 .
If the rankings were in perfect agreement, then S would have value 0. If they
were in perfect disagreement (in reverse order), then S would have value 10, and
this is maximal. Set
S
R = 1 .
10
In our example, R = 0.7 indicating some disagreement.
Comment: In 1938 M. G. Kendall took an approach similar to this one.
****
Many approaches, of course, are possible.
The difficulty in this work is determining when and how a maximal value for a count
occurs (and generalizing this to a list of n contestants and not just five). [Approach
2 is problematic in this regard.]
PART THREE: 85
PROBLEM SET III
Question 63: PERCENTILES and QUARTILES

One of the 99 values that divide a set of data in numerical order into 100 equal parts is
called a percentile. For example, the 90th percentile is the data value such that 90 percent
of the data points are below that value. Often scores in standardized tests are presented
in terms of percentiles. For example, if 525 students take an exam and 95% of the
students receive a score lower than 74 (and some student actually did earn a score of 74),
then the 95th percentile for the exam is 74.
It is often convenient to divide data sets into four equal parts. The lower (or first)
quartile, denoted Q1, is the 25th percentile. The middle (or second) quartile, Q2, the
median, is the 50th percentile, and the upper (or third) quartile, Q3, is the 75th percentile.
(COMMENT: As we have seen on page 14, there is some confusion over the definition of a
quartile. Notice that this paragraph defines it as the 25th percentile, and so, here at least,
a quartile should correspond to a data value.)
The following table shows test scores for 120 participants:
Score Number Participants

97 3
95 1
89 8
88 10
86 2
85 6
83 1
80 31
Between 70 and 79 28
Below 70 30
a) What is the 90th percentile for this data?

b) What is the third quartile for this data?
c) What is the median for this data?
d) From the information presented is it possible to determine the mode? The mean?
The midrange?
PART THREE: 86
Question 64: a) Find an example of SIX data points with:
Mean = 1000
Median = 10
Mode = 10
b) Find an example of SIX data points with:
Mean = 10
Median = 10
Mode = 1000
c) Find an example, if possible, of SIX data points with:
Mean = 10
Median = 1000
Mode = 10
Question 65: QUIRKY AVERAGES
The average American has one ovary and one testicle
The average square on a checkerboard is grey
The average roll on a die is 3.
One average, each planet of the solar system has 722 million human inhabitants.
Each of these statements is technically correct, but meaningless! (There is no roll

of 3 on a die, checkerboard squares are either black or white, who lives on Pluto?
and as for the average American well )
Come up with two more quirky mis-uses of the average.

PART THREE: 87
Question 66: BOX PLOTS

A box plot is a quick graphical representation showing the range of a data set,
the median of the data set, and the medians of the lower half and of the upper
half of the data set. For example, the following picture is a box plot:
We see that the data ranges from 20 to 70 and that the three divider marks at 35
or so (the left end of the box), at 42 or so (the line in within the box), and at 65 or
so (the right end of the box) divide the data into four groups each representing
25% of the data.
a) What is the range and median according to the following box plot? 50%-
75% of the data lies within which range of values?
b) Draw a box plot for the following set of data:

1, 1, 1, 3, 4, 6, 6, 6, 6, 7, 10, 14.
Following page 14 Be clear that in your mind that the left end of the box lies
at position 2 and the right end at position 6.5. (The line within the box is the
median of the data. The lines at the end of the box are the medians of the
lower and upper halves of the data.)
PART THREE: 88
Question 67: STEM AND LEAF PLOTS

Often data, given as whole numbers, is presented in via a stem-and-leaf plot. Each
number is divided into two parts: the units digit (the leaf) and the digits to the left of
the unit (the stem). In one column all the stems are list, and in the second, all the
corresponding leaves.
For example, the data set:
22. 23. 26. 31. 31. 31. 38. 42. 63, 69, 69, 127, 129, 131
is presented:
2 2, 3, 6
3 1, 1, 1, 8
4 2
6 3, 9, 9
12 7, 9
13 1
KEY: 3|8 = 38
a) Draw a stem-and-leaf plot for the data:
113, 113, 114, 115, 116, 123, 123, 130, 130, 203, 308, 308, 319
b) Find the mean, median, and mode for the following data presented via stem-and-
leaf:
0 1,1,4
2 2,3,3
4 2,2,2
8 1,8,8
14 0
18 9,9
KEY: 18|9 = 189
COMMENT: Other types of stem-and-leaf plots are possible. For example, one might use
the key 13|04 = 1304.
COMMENT: Stem and leaf plots are used to give a quick sense of a shape as to how the
data is distributed.
PART THREE: 89
Question 68: Find the mean and standard deviation of the following data set:
5.6 5.2 4.6 5.7 4.9 6.4
Question 69: Consider the following set of data:
x Y
2 3
3 5
5 12
6 20
Draw a scatter diagram for this data.

Find the equation of the line of best fit for this data. Sketch that line on the
scatter diagram.
Visually does this line seem to fit the data well?
Compute the correlation coefficient for this data. What does this say about the
fit of the line?
Question 70: Consider the following set of data:
X Y
1 1.94
3 5.98
4 8.04
6 12.02
Draw a scatter diagram for this data.

Find the equation of the line of best fit for this data. Sketch that line on the
scatter diagram.
Visually does this line seem to fit the data well?
Compute the correlation coefficient for this data. What does this say about the
fit of the line?
PART THREE: 90
Question 71:
a) Find the mean, median, and mode of the following test scores:
Score Number of Students

100 3
95 2
93 1
92 2
87 3
81 1
79 4
60 1
b) Another student later took the test and scored just three points. Describe, in
words only, what effect such a low-value additional data point will have on each of
the mean, median, and mode.
Question 72: In what way is the following graph misleading?

PART THREE: 91
Question 73: Here is a stem-and-leaf plot. What is the mode of this data set?
(Here the data values are: 120, 120, 120, 130, , 615)
Question 74: The median of a data set is significantly larger than the mean of the
data set. What could cause this?
(A) There are a few exceptionally small values in the data set
(B) There are a few exceptionally large values in the data set
(C) The data values are tightly clustered around one value
(D) The data values are evenly spread across a range of values
Question 75: Draw a reasonably accurate pie chart for the following data:
PART THREE: 92
Question 76: A sample of people at a mall were measured for their heights. The
results are displayed in the following histogram.
Based on this data, what are the chances that a person at the mall selected at
random is less than 61 inches tall?
Question 77: An investment company offers fifteen different investment options,

varying from low-risk to high-risk plans. The average rate of return on these plans
have been:
5% 5% 5% 5% 5% 5% 10% 10% 10% 15% 15% 15% 30% 90% 200%
a) Display this data by any visual means of your choice.

b) Compute the mean, mode, and median of this data.
A representative of this investment firm is talking with a potential new client.

When speaking about the central tendency of the companys return-rate figures,
would you advise the representative to speak about the mean, the mode, or the
median of the data values? Choose one and give justification for your choice.
PART THREE: 93
Question 78: Scientists come up with an equation of best fit:
y = 0.98 x + 0.02
with correlation coefficient r = 0.01 . Would they want to use this equation to
predict values for y ? Explain.
Question 79: Here is some data displayed on a graph.
Which of the following seems like a reasonable line of best fit?
(A) 2 y 6t = 90 (B) 6t + 2 y = 90 (C) 180 2 y 6t = 0 (D) 3t y 90 = 0
What seems to be a reasonable value for the correlation coefficient?
(A) 1.28 (B) -0.01 (C) 0.95 (D) 0.90
Question 80: Here is some data displayed on a graph.
The line of best fit is:

y = 45 2.1t
What does this model predict for the y-value at t = 2.5 ?

PART THREE: 94
BACK TO NON-MTEL QUESTIONS
Question 81: A terrible disease is sweeping across the nation at an alarming rate.
Only 10% people who catch the disease survive.
Two experimental serums have hurriedly been developed but only limited testing
has been done on them. Two people with the disease were given serum A and both
survived. Six people with the disease were given serum B and four survived.
a) Assuming that serum A had no effect, show that the chances of two people
naturally surviving the disease is 1%.
b) Assuming that serum B had no effect, show that the chances of four out of
six people naturally surviving the disease is 0.12%.
c) Given these figures, which serum is more likely to have had a better effect
on recovery?
Question 82: Another terrible disease is sweeping across the nation at an alarming
rate. 50% people who catch the disease survive.
Three experimental serums have hurriedly been developed but only limited testing
has been done on them.
Serum A: Three people with the disease were given the serum and all three
survived.
Serum B: Ten people with the disease were given the serum and eight survived.
Serum C: Five people with the disease were given the serum and four survived.
Youve just contracted the disease! Based on this limited information, which of the
three serums would you take and why?
PART THREE: 95
Question 83: The following diagram gives the distribution of the number of
minutes Australian women can tolerate left-handed 8 year-old boys whistling while
chewing gum.
a) Verify that the area under this curve is one unit.
According to this distribution If an Australian woman is chosen at random, what

is the probability that she can tolerate whistling
b) for a length of time between 5 and 15 minutes?

c) for less than one minute?
d) between 6 and 8 minutes?
e) for more than 3 minutes?
Question 84: A coin is tossed 8 times. Complete the following table.
# Heads that appear 8 7 6 5 4 3 2 1 0

Probability 0.391% 3.125% 10.938%
Question 85: You suspect a coin is biased. You toss it 12 times and heads appear
ten times. With what level of confidence would you say that the coin is indeed
biased?
Question 86: SOMETHING REALLY COOL !!!

American pennies are biased!
a) Stand 20 American pennies on edge on a table and then bang the table so
that they all fall over. Count the number of heads that appear. What do you
notice? Repeat.
b) Spin 20 American pennies and let them come to rest. Count the number of
tails that appear. What do you notice? Repeat.
PART THREE: 96
Question 87: You suspect a die is biased towards landing 6. You toss the die six
times and get a six three of those times.
a) What are the chances of obtaining exactly three sixes in a roll of a fair die?
b) Would you conclude that the die above is biased? If so, with what level of
confidence would you make such a claim?
Question 88: A company manufactures bolts. If 5% of the bolts they produce are
defective, what are the chances that four bolts chosen at random are:
a) all defective?
b) all but one is defective?
c) none are defective?
Question 89: The distribution of weights of woogles is known to be symmetrical

and triangular, with mean 50 pounds and range 20 pounds.
a) A woogle is selected at random. What is the probability that its weight is

between 30 and 50 pounds?
b) A woogle is selected at random. What is the probability that its weight is
over 60 pounds?
c) Find the value c so that 75% of the woogles have weight between 30 and c
pounds.
PART THREE: 97
Question 90: Use a table of values for the normal distribution curve (with mean 0
and standard deviation 1) to find the area under the curve between:
a) z = 0 and z = 1.4
b) z = -0.70 and z = 0
c) z = 1.1 and z = 1.2
d) z = -1.8 and z = 0.6
e) all z values below -0.1
f) all z values above 0.1
Question 91: The mean weight of a 1000 high-school students is 147 pounds with
standard deviation 17 pounds. Assume the weights are normally distributed.
a) How many students weigh between 130 and 164 pounds?

b) How many students weigh between 113 and 181 pounds?
Use z-scores and the table of values for the normal distribution curve (with mean
0 and standard deviation 1) to find
c) The number of students who weight between 147 and 152 pounds.
d) The number of students who weight between 132 and 150 pounds.
Question 92: A company produces cars that last an average of 10.2 years on the
road with standard deviation 4.3 years. You buy a car from the company. Assuming
that car ages are normally distributed, what are the chances that your car will be
on the road for over 20 years?
Question 93: A company produces light bulbs with average lifespan 40 hours
(standard deviation 4 hours).
a) Consumer advocates across the nation test 100 light bulbs and calculate the
average life span of the bulb according to their samples. To a close approximation,
what mean do they obtain with what standard deviation?
b) A year later they repeat the experiment but this time testing 1000 light bulbs
each. To a close approximation, what mean do they obtain with what standard
deviation?
PART THREE: 98
Question 94: In a hand of Blackjack one has a 45% chance of winning a dollar and
a 55% chance of losing a dollar.
Following the same analysis as we did in class for the game of Roulette
a) Show that the mean and standard deviation for a single hand of Blackjack
are given by:
= 0.10
= 1.00
(This issue of whether to divide by n or n-1 for standard deviation is annoying.
Here I divided by n-1.)
b) A habitual gambler attends the casino every night and plays 100 hands of
blackjack, betting a dollar each and every time. Find the range of winnings
she can expect (99.7% of the time) for each night of gambling.
c) The casino sees 100,000 hands of blackjack played per night. Find the range
of profit they can expect (99.7% of the time) per night from Blackjack.
Question 95: A simple dice game is played as follows:
Roll a six, win $4.

Roll anything else, lose a dollar.
a) Show that the mean and standard deviation for a single play of this game
are:
= 0.167
= 2.041
b) A habitual gambler plays 50 rounds of this game every day. Find the range of
winnings she can expect (99.7% of the time) for each day of gambling.
c) The casino sees 1 000 000 rounds of this game per day. Find the range of
profit they can expect (99.7% of the time) per day from this game.
PART THREE: 99
Question 96: Rabbits who eat carrots only have weights that are normally
distributed with mean 12.5 pounds and standard deviation 3.2 pounds.
a) Attilla is a rabbit weighing 15.2 pounds. Is this unusual?

b) Priscilla is a rabbit weighing 19.1 pounds. Does Priscilla eat only carrots?
With what level of confidence do you answer this question?
Question 97: All men with the name of JIM have a handsomeness value that is
normally distributed with mean 86.6 and standard deviation 2.3.
a) What proportion of men named Jim have handsomeness value 82 or less?

b) What proportion of men named Jim have handsomeness value 93.5 or higher?
c) Your instructor has handsomeness value of 100. How many standard
deviations above the norm is this?
Question 98: The mean weight of floogles is not known, but it is known that their
weights vary with standard deviation = 4 units.
A biologist measured the weights of 60 floogles and found that her sample had
mean m = 143.2 .
Find the 95% confidence level for the mean weight of all floogles.
Question 99: Gibgobs have heat factors that vary about some unknown mean with
standard deviation = 12 . A scientist measured the heat factor of four gibgobs.
He obtained the values:
133 146 137 and 140
Find the 95% confidence level for the mean heat factor of all gibgobs.
Question 100: A soccer ball company is meant to produce soccer balls of radius
12.4 cm with an error of at most 0.4 cm.
They set their machines so that the balls they produce have mean = 12.4 with
standard deviation = 0.2 . They produce 500 balls per day. How many balls per day
must be rejected?
PART THREE:100
Question 101: Suppose a population has normal distribution.
is unknown, but it is suspected to have value 4.

= .03
A sample of size 50 was found to have average value m = 4.01
What do you think about the suspicion that =4?
Question 102: Find the p-value for the sample mean of 4.01 in the previous
example.
Question 103: A company produces cables with breaking strengths of mean 1800
lbs and standard deviation 100 lb. A new manufacturing technique, however, claims
to increase the average breaking strength. To test this claim, a sample of 50
cables is examined and is found to have mean breaking strength 1845 lb.
a) If the new technique had no effect of breaking strength, how likely is it

that a batch of 50 cables would have an average breaking strength of 1845
lbs?
b) Do you think the new technique had an effect?
Question 104: Two drugs, A and B, are being tested for possible cure to a disease.
The following data has been collected thus far. Does there seem to be any
correlation worth pursuing?
PART THREE:101
Question 105: Does there seem to be a correlation between favourite colour and
shoe size?
Question 106: The following table shows test scores of students in a physics
course and the same students in a math course. Does there seem to be a
correlation between math and physics proficiency?
Question 107: According to this data do you think there is some connection
between marital status and performance in a Prob. and Statistics course?
PART THREE:102
Question 108: A ball bearing company produces balls with a diameter, hopefully,
of mean value = 11.5 mm with a tolerance given by standard deviation = 0.2 .
Each day the company selects 20 balls at random and computes the mean of that
sample. Over the course of three weeks they collected the following values:
11.602 11.547 11.312 11.449 11.401 11.608 11.471 11.453 11.446 11.522 11.664
11.823 11.629 11.602 11.756 11.707 11.612 11.628 11.602 11.816 11.812
a) Is there a trend moving away from the mean 11.5?

b) What is the first data value that deviates from the 99.7% (three standard
deviation) range from the mean?
Question 109: Twenty-five people were asked to try a new type of gum. They
were asked whether or not the liked it: yes or no. The results are as follows:
YYNNNNYYYNYNN YNNNNNYYYYNN
How many runs are in this sequence? Does this sequence appear random?
Question 110:
a) In how many different ways can one arrange 3 As and 3Bs?
b) List all the ways and count the number of runs that appear in each.
c) What is the mean number of runs and what is the standard deviation for the
number of runs?
Question 111: Here are the first 20 digits of 2 . Do they seem random?
14142135623730950488
Question 112: Use the above/below median test to determine whether or not the
following list of data values appear random:
8 15 9 12 10 7 11 8 13 9 11
PART THREE:103
Question 113:
a) Write a list of Hs and Ts that seem random to you. Do a list that is 20 long.
(The number of Hs and Ts do not have to be the same.)
Test your sequence for randomness.
b) Flip a coin 20 times and record the list of Hs and Ts that result. Test the
sequence for randomness.
Question 114: According to a statistical survey: Australian men have blood

pressure 118 13 ms (with 95% confidence) What does this mean?
(A) 95% of all Australian men have blood pressure 118.
(B) 95% of all Australian men have blood pressure under 131.
(C) 95% of all Australian men have blood pressure between 105 and 131.
(D) There is a 95% chance that the average blood pressure of all Australian
men lies somewhere between 105 and 131.
Question 115: A survey displays milk preferences amongst men and women:
Men Women
Whole Milk 10 3
2% Milk 18 16
Non-Fat 7 15
No Preference 6 12
Does there seem to be a correlation between gender and milk preference?
(Actually, MTEL will not ask you to do a chi-squared analysis, but do this one in any
case!)
PART THREE:104
Question 116: Administrators at a local grocery store conduct a survey on the

average number of gallons of milk a customer buys per day. They interview N
customers per day, and thus work with samples of size N. Administrators later
decide to work interview 3N customers per day, thereby tripling the sample sizes.
What effect will this have on the sampling error?
(A) No effect
(B) Decrease sampling error
(C) Increase sampling error
(D) Not enough information to say.
COMMENT: The term sampling error is vague here. The question is really
Which would yield least spread of data values (that is, least standard
deviation): Calculating the mean of samples of size N or calculating the mean
of samples of size 3N?
PART IV of IV
BRIEF INTRODUCTION TO
MORE ADVANCED THINKING
(and filling in some gaps!)
James Tanton
2008 James Tanton
CONTENTS:
Mean and Variance Revisited: The Human Perspective 2
One Data Set 2
Two Data Sets 4
Playing with Formulas 7
Vectors 9
Mean and Variance: The Perspective of the Gods 10
Random Variables 15
Sum, differences, multiples 17
Connection to Central Limit Theorem 22
Cereal Box Problem 22
Geometric Distribution 25
Binomial Distribution 27
Proportions 32
Students t-distribution 33
Chi Squared distribution 36
Chebyshevs Inequality 36
PART FOUR: 2
MEAN AND VARIANCE REVISITED: THE HUMAN PERSPECTIVE
ONE SET OF DATA: A geometric perspective
Suppose we run an experiment and gain from it n data values:
x1 , x2 , , xn
and, as mere mortals, we know nothing more about the situation than these n values. (That
is, we have no understanding about what to expect from the experiment such as the mean
value, the variation from the mean, the underlying frequency of data values behind the
scenes, etc.)
But if the experiment were ideal, meaning that outcomes were absolutely and utterly
repeatable, then we would expect no variation in data values at all. This means that all
measurements would adopt exactly the same value q, say.
How close is our data ( x1 , x2 , , xn ) from an ideal ( q, q, , q ) ?
To answer this question we seek a value q so that the point M = ( q, q, , q ) is as close as

possible to our point P = ( x1 , x2 , , xn ) . We want to choose a value q that minimizes the
distance:
( x1 q ) + ( x2 q ) + ( xn q )
2 2 2
PM =
It is easier to just to minimize the quantity under the square root sign.
Now
( x1 q ) + ( x2 q ) + ( xn q ) = nq 2 2 ( x1 + + xn ) q + ( x12 + + xn 2 )
2 2 2
is a quadratic in q and has minimum value for:
2 ( x1 + + xn ) x1 + + xn
q= = =x
2n n
the datas mean.

PART FOUR: 3
The minimum value under the square root sign thus occurs when q = x and the minimum
value is:
( x x) + ( x ) ( )
2 2 2
1 2 x + + xn x
This is a sum of individual deviations, squared, and in and of itself is a measure of the total
spread of values. Dividing by n gives an average spread. The resulting quantity is the
VARIANCE of the data:
( x x) ( )
2 2
1 + + xn x
Var ( x , , x ) =
1 n
n
COMMENT: If the data is a measurement of length, say, then each xi has a unit of meters
perhaps and so Var ( x1 , , xn ) has units of meters squared. It is handy to have a measure
of spread in the same units as the data. For this reason, folk take the square root of
variance and call the result STANDARD DEVIATION:
( x x) ( )
2 2
1 + + xn x
( x1 , , xn ) = Var ( x1 , , xn ) =
n
COMMENT: As we have seen, many texts alter these definitions slightly. Mathematicians
note the following:
( ) ( )
FACT: x1 x + x2 x + + xn x equals zero. ( )
(EXERCISE: Show this!)
Thus if one knows the value of n -1 of the terms, x1 x , x2 x , , xn x , then one can ( )( ) ( )
deduce the value of nth one from the fact that their sum should be zero.
( )( ) (
So among the values x1 x , x2 x , , xn x , there are only n -1 real pieces of )
information. To reflect this, many choose to divide by n-1 rather than n and set
( x x) ( ) ( x x) ( )
2 2 2 2
1 + + xn x 1 + + xn x
Var ( x , , x ) = and ( x1 , , xn ) =
n 1 n 1
1 n
IN THIS CHAPTER OF THE NOTES WE SHALL DIVIDE BY n.

PART FOUR: 4
TWO SETS OF DATA: A summary and a geometric interpretation
Suppose we run an experiment and record two sets of data values from it. (For example,
our experiment could be to ask passersby for their heights and their shoe sizes.) We have
data values:
x1 , x2 , , xn
y1 , y2 , , yn
We can plot the points ( xi , yi ) on a diagram to create a SCATTER PLOT.
The plot might reveal a relationship (CORRELATION) between the data values. If there
seems to be a linear correlation, then one might be interested in finding a straight line
that approximates the data points well.
LINE OF BEST FIT:

It seems reasonable to believe that the best line for the data should go through the
( )
most average point for the data: x , y . So the line of best fit should have an equation of
the form:
y y = m xx( )
for some best slope m yet to be determined.
The line of best fit should minimize the total sum of deviations from that line. So for the
( )
data point xi the line of best fit predicts the value m xi x + y compared to the actual
data value yi . We need a value m that minimizes:
PART FOUR: 5
( )) + + ( y y m ( x x ))
(
2 2
D = y1 y m x1 x n n
= m ( ( x x ) + + ( x x ) ) 2m ( ( x x )( y y ) + + ( x )( y ))
2 2
2
1 n 1 1 n x n y
+ (( y y ) + + ( y y ))
1
2
n
This is a quadratic in m and has minimum value for:
m=
( x x )( y y ) + + ( x x )( y
1 1 n n y )
( x x) + + ( x x)
2 2
1 n
Folk define:
( x x) ( )
2 2
+ + xn x
= Var ( x1 , , xn )
1
S xx =
n
S xy =
( x x )( y y ) + + ( x
1 1 n x )( y n y )
n
( y y) ( )
2 2
+ + yn y
= Var ( y1 , , yn )
1
S yy =
n
And the line of best fit (LEAST SQUARES METHOD) is:
S xy
y y =
S xx
(x x )
CORRELATION COEFFICIENT:
We created a line that minimizes the total amount of scattering D of y-values about that
line. Here:
( ( )) ( ( ))
2 2
D = y1 y m x1 x + + yn y m xn x
( ) ( )
2 2
The quantity T = y1 y + + yn y is a measure of the amount of scattering of y-
values in general. We can also view this as the amount of scattering about the horizontal
PART FOUR: 6
line y = y , which is not the line of best fit. Since D is the minimal value for all lines, we
have D T .
T D
As we have seen, the he proportion , with value between 0 and 1, is a measure of
T
desired scattering. To make sense of this note that:
T D
= 0 means T = D , which says that the amount of scattering about a
T
supposed line of best fit is no different from the amount of scattering in general.
THERE IS NO CORRELATION between the data values at all.
T D
= 1 means D = 0 , which says that there is absolutely no scatter about the
T
line of best fit, that is, the data fits this line exactly. We have PERFECT LINEAR
CORRELATION between the data values.
An exercise in algebra gives:

T D ( S xy )
2
=
T S xx S yy
and people usually denote this quantity R 2 , calling it THE CORRELATION COEFFICIENT.
(S )
2
xy
COMMENT: Actually people usually set R = using the + sign if the slope m is
S xx S yy
positive and the sign if m is negative.
(
IF x1 x )( y y ) + + ( x
1 n x )( y n )
y = 0 THEN THERE IS ABSOLUTLEY NO
CORRELATION BETWEEN DATA VALUES.
PART FOUR: 7
PLAYING WITH FORMULAS FOR THE FUN OF IT:
Given two sets of data values from an experiment (which we label set X and set Y):
X : x1 , x2 , , xn
Y : y1 , y2 , , yn
we can create new data sets by adding or multiplying all values (which we label X + Y and
XY ):
X + Y : x1 + y1 , x1 + y2 , , xn + yn
XY : x1 y1 , x1 y2 , , xn yn
NOTE: If there are originally n data values for X and for Y, there are n 2 values for X + Y
and for XY .
The mean of X + Y :
The average value of the X + Y data set is:
( x1 + y1 ) + ( x1 + y2 ) + + ( xn + yn ) = nx1 + ( y1 + + yn ) + nx2 + ( y1 + + yn ) + + nxn + ( y1 + + yn )

n2 n2
n2 x + n2 y
=
n2
= x+ y
The mean of XY :
The average value of the XY data set is:
( x1 y1 ) + ( x1 y2 ) + + ( xn yn ) = x1 ( y1 + + yn ) + x2 ( y1 + + yn ) + + xn ( y1 + + yn )
n2 n2
nx y + nx2 y + + nxn y
= 1
n2
n2 x y
= 2 = xy
n
PART FOUR: 8
The variance of X + Y : The variance of the X + Y data set, about its data mean of x + y
is a little long and scary looking, but actually not too tricky to work out. Here goes:
( x + y x y) + ( x + y ) ( )
2 2 2
1 1 1 2 x y + + xn + yn x y
( ) (
)( y y ) + ( y y ) + ( x x ) 2 ( x x )( y ) ( )
2 2 2 2
= x1 x 2 x1 x 1 1 1 1 2 y + y2 y
+ + ( x x ) 2 ( x x )( y y ) + ( y y )
2 2
n n n n
( ) (
( ) + + n ( y ) )
2 2 2 2
= n x1 x + + n xn x + n y1 y n y
2 ( x x ) (( y y ) + ( y y ) + + ( y y ))
1 1 2 n
2 ( x x ) (( y y ) + ( y y ) + + ( y y ))
2 1 2 n
(
2 xn x ) (( y y ) + ( y
1 2 )
y + + yn y ( ))
( ) ( ) ( ) ( )
2 2 2 2
= n x1 x + + n xn x + n y1 y + + n yn y 0 0 0
( ) ( ) ( ) ( )
2 2 2 2
= n x1 x + + n xn x + n y1 y + + n yn y
(
[We used the fact that y1 y + y2 y + + yn y = 0 .] ) ( ) ( )
Divide by n 2 to get:
( ) ( ) ( ) ( )
2 2 2 2
n x1 x + + n xn x n y1 y + + n yn y
Var ( X + Y ) = +
n2 n2
( x x) ( ) + ( y y) ( )
2 2 2 2
1 + + xn x 1 + + yn y
=
n n
= Var ( X ) + Var ( Y )
We have:
FOR ANY TWO DATA SETS X AND Y:
Var ( X + Y ) = Var ( X ) + Var (Y )
(for variance computed with respect to the data mean).
COMMENT: There is no easy formula for Var(XY). (Try it!)

PART FOUR: 9
ASIDE ON VECTORS:
Given a set of data values x1 , x2 , , xn form the vector
vx =< x1 x, x2 x, , xn x >
This vector has the property that its entries sum to zero.
Our formulas can be rewritten in terms of vector notation.
For example,
|| vx ||2
Var ( X ) =
n
|| vx ||
(X ) =
n
vx v y
S xy =
n
(v vy )
2
S xy 2 x v vy
R2 = = = x i = cos
2
S xx S yy
|| vx || || v y || || vx || || v y ||
where is the angle between the vectors vectors vx and v y .

PART FOUR: 10
MEAN AND VARIANCE: THE PERSPECTIVE OF THE GODS
Lets now assume that we are omniscient and are fully aware of all information
about all experiments ever run. For any experiment we now assume we know all
possible values that can occur and the likelihood of each and every particular value
actually occurring. That is, we know the PROBABILITY DISTRIBUTION of any
given experiment.
Definition: A (discrete) RANDOM VARIABLE X is a set of values x1 , x2 , x3 , along

with a set of probabilities p1 , p2 , p3 , with pi representing the chances of the value xi
actually appearing. (We have that the sum of probabilities p1 , + p2 + p3 + is 1.)
COMMENT: We are usually not God-like and do not know the true nature of a random
variable. For example, who really knows the probability distribution of the number of
humming birds that will visit someones feeder during a given hour of the day while the
homeowner happens to be watching the TV tuned to a prime-numbered station?
On occasion us mere mortals do have glimpses into the world of the Gods. For example, let
X be the random variable: All values of a fair die. We know the values of this random
1 1 1 1 1 1
variable are 1, 2, 3, 4, 5, 6 with probability distribution , , , , , .
6 6 6 6 6 6
Definition: The EXPECTED VALUE, denoted E(X) or , of a random variable X is the

quantity:
= E ( X ) = p1 x1 + p2 x2 + p3 x3 +
This is the Gods version of the average value. To see why, imagine we ran the experiment
n times. Then pi , in an ideal setting, is the proportion of times we can expect to see the
value xi . So for n runs of the experiment we should expect to see xi a total of npi times.
The average result we expect to see, in the ideal case, is thus:
p1nx1 + p2 nx2 + p3 nx3 +

= p1 x1 + p2 x2 + p3 x3 + = E ( X )
n
1 1 1 1 1 1
The expected value of rolling a die is = 1 + 2 + 3 + 4 + 5 + 6 = 3.5
6 6 6 6 6 6
PART FOUR: 11
COMMENT: We need to be careful to distinguish between the ideal God-like situation and
the non-ideal mortal reality. When running an experiment, such as rolling a die, it is unlikely
our data values will ever actually have mean value E(X)! (Try it. Roll a die six times and
compute the average result. It probably isnt 3.5 - though it can happen!)
Suppose we do run an experiment n times and obtain n data values: x1 , x2 , , xn .
God knows the value of the true mean = E ( X )

We dont know the true mean. We can only compute the data mean:
x1 + x2 + + xn
x=
n
We can hope that the data mean x is a close approximation to the true mean .
COMMENT: We saw for the data mean that the values x1 x, x2 x, , xn x sum to zero
and so represent only n -1 truly independent values.
In the God world, the values x1 , x2 , , xn are not guaranteed to sum to zero and
so actually do represent n independent values.
So From the perspective of the Gods, dividing by values n , rather than by n 1 , is

always appropriate.
Moving on
As mere mortals defined the variance as the average of the sum of deviations from the
data-mean squared. The Gods analogy to variance is thus:
var( X ) = ( x1 ) p1 + ( x2 ) p2 +
2 2
And the standard deviation of a random variable is:
( x1 ) p1 + ( x2 ) p2 +
2 2
( X ) = Var ( X ) =
PART FOUR: 12
Playing with formulas:
If X is a random variable and k is a constant, then kX is the random variable with all values
multiplied by k with the same underlying probability distribution; and X + k is the random
variable with all values increased by k. (See page 17 for more.) We shall prove in the next
section:
THEOREM:
E (kX ) = kE ( X )
Var (kX ) = k 2Var ( X )
E( X + k ) = E( X ) + k
Var ( X + k ) = Var ( X )
If X and Y are two random variables then X+Y is the random variable with values xi + y j
where xi is a value adopted by X, y j is a value adopted by Y, and the probability
associated with xi + y j is P ( X = xi and Y = y j ) . How one computes this probability
depends on the nature of X and Y. The nicest situation of all would be if, as in nave
probability theory, the word and continues to translate into an action of multiplication.
Definition: Two random variables are INDEPENDENT if P ( X = xi and Y = y j ) equals the

product P ( X = xi ) P (Y = y j ) .
For example, if X is the roll of a die and Y is the spin of a spinner with numbers 1 through
10 (each equally likely), then there are 60 equally like outcomes for the pair (X,Y) and
1 1 1
P ( X = 3 and Y = 8) , say, equals . This equals which is P ( X = 3) P (Y = 8) . Here X
60 6 10
and Y are independent.
THEOREM: If X and Y are independent, then

E ( X + Y ) = E ( X ) + E (Y )
E ( XY ) = E ( X ) E (Y )
Var ( X + Y ) = Var ( X ) + Var (Y )
Var ( X Y ) = Var ( X ) + Var (Y )
Proof: Next section.

PART FOUR: 13
COMMENT: It is curious that Var ( X + Y ) = Var ( X ) + Var (Y ) is always true in our mortal
world (calculated with data means) and not always true for the God-world (calculated with
actual means). We require the condition that X and Y should be independent for this
result to hold in the God world. What is the difference?
In our mortal world we examined two data sets:
x1 , x2 , , xn
y1 , y2 , , yn
We did not know the true means of the underlying random variables X and Y, but
calculated instead just the data means x and y :
x1 + x2 + + xn 1 1 1
x= = x1 + x2 + + xn
n n n n
y + y2 + + yn 1 1 1
y= 1 = y1 + y2 + + yn
n n n n
But these expressions each look like the expected value of a random variable.
Lets create our own, human, random variables: X and Y:
1 1 1
X has values x1 , x2 , , xn with probabilities , , ,
n n n
1 1 1
Y has values y1 , y2 , , yn with probabilities , , ,
n n n
Then E(X) is the data mean x and E(Y) is the data mean y .
X and Y arent the real random variables lurking behind the data sets, but they are the
ones we see by only looking at the data.
We next calculated the sum of data values:
x1 + y1 , x1 + y2 , , xn + yn
( x1 + y1 ) + ( x1 + y2 ) + + ( xn + yn )
This gives n 2 values and we computed their mean as .
n2
PART FOUR: 14
But in doing this we tacitly assumed that each of the pairs in this sum, xi + y j , has the
same frequency as any other pair. That is, we assumed each pair if equally likely, and so,
1
since there are n 2 pairs in all, each pair xi + y j comes with probability .
n2
But this makes X and Y independent random variables:
1
P ( X ' = xi and Y ' = y j ) =
n2
1 1 1
P ( X ' = xi ) P (Y ' = y j ) = = 2
n n n
So P ( X ' = xi and Y ' = y j ) = P ( X ' = xi ) P(Y ' = y j ) .
Our human construct X and Y obey all the conditions of the Gods and so obey:
Var ( X '+ Y ') = Var ( X ') + Var (Y ')
We did not realize at the time, but in Part I of these notes we were implicitly drawn to
mimicking the ways of the Gods! (Such is always the wont of mankind?)
PART FOUR: 15
RANDOM VARIABLES: THE THEORY AND PROOFS
Loosely speaking A random variable X is a quantity whose value is not known but whose
probability of taking a particular value or a range of values is known. Thus random variables
come a priori with a probability distribution function in mind (at least in principle).
A random variable is said to be discrete if it adopts only finitely many possible values
(each having a known probability of occurring) or a list of possible values. It is continuous
if it can adopt a continuous range of values with probability values P ( a X b) known and
given by a probability distribution curve.
Example: A roll of a die is a random variable X. It can have values 1, 2, 3, 4, 5, or 6 each

1
with probability defined to be .
6
Example: A roll of a biased die could be a random variable Y with probability distribution:
Example: The height of a person chosen at random is a (continuous) random variable H

with probability distribution assumed to be a normal curve.
PART FOUR: 16
Definition: Suppose a discrete random variable X has value x with probability P ( x ) , then,
the expected value (or mean) of the random variable is:
= E ( X ) = x P( x)
(Here denotes summation. So if X takes on values x1 , x2 , , xn with probabilities

p1 , p2 , , pn , then this expression is stating: = E ( X ) = x1 p1 + x2 p2 + + xn pn .)
Example: The expect value of rolling an ordinary die is, as before,
1 1 1 1 1 1
E( X ) = 1 + 2 + 3 + 4 + 5 + 6 = 3.5
6 6 6 6 6 6
COMMENT: If the random variable is continuous, then the summation is replaced by an

integral (continuous summation).
Generalising the definition of variance and standard deviation
Definition: The variance of a discrete random variable X is:
2 = Var ( X ) = ( x ) P( x)
2
Its standard deviation is = SD ( X ) = Var ( X ) .
(Again, there is an integral analogue for continuous random variables.)
Example: Standard deviation of rolling an ordinary die is

1 2 1 2 1 2 1 2 1 2 1
2 = (1 3.5 ) + ( 2 3.5 ) + ( 3 3.5 ) + ( 4 3.5 ) + ( 5 3.5 ) + ( 6 3.5 ) = 0.73
2
6 6 6 6 6 6
= 0.85
NOTE: The n 1 versus n issue comes into play here! This definition assumes the
divide by n convention.
NOTE: Var ( X ) = E (( X ) )
2
PART FOUR: 17
SUMS, DIFFERENCES and MULTIPLES OF RANDOM VARIABLES
If all the numbers on a dice a doubled then the wed expect the mean value of a roll to
double (from 3.5. to 7) and standard deviation (spread) of values to double as well (from
0.85 to 1.70).
E (2 X ) = 2 E ( X )
SD (2 X ) = 2 SD ( X )
(By 2X we mean a new random variable with values double those of X, but with the same
underlying probability distribution.)
In general:
E (aX ) = aE ( X )
Var (aX ) = a 2Var ( X )
so that
SD (aX ) = | a | SD ( X )
Proof: If X has values x1 , x2 , , xn with probabilities p1 , p2 , , pn , then aX has values

ax1 , ax2 , , axn with probabilities p1 , p2 , , pn . Thus:
E (aX ) = ax1 p1 + ax2 p2 + + axn pn = a ( x1 p1 + + xn pn ) = aE ( X )
If = E ( X ) then
Var (aX ) = ( ax1 a ) p1 + + ( axn a ) pn = a 2
2 2
(( x )
1
2 2
)
p1 + + ( xn ) pn = a 2Var ( X )

NOTE: This proofs are more compactly written in notation:
E (ax) = ax P ( x) = a x P ( x) = aE ( X )
Var (ax) = ( ax a ) P ( x) = a 2 ( x ) P ( x) = a 2Var ( X )
2 2
We shall follow this style of presentation in the proofs that follow. (But of course it is
always possible and usually helpful to translate the lines in these proofs back to the
form without notation.)
PART FOUR: 18
Suppose we add 3 to all the values of the die. Then wed expect the mean roll to increase
by three (from 3.5 to 6.5) but the spread of values, the standard deviation, not to change:
E ( X + 3) = E ( X ) + 3
Vax( X + 3) = Var ( X )
In general:
E ( X + c) = E ( X ) + c
Var ( X + c) = Var ( X )
Proof: E ( X + c) = ( x + c ) P( x) = xP( x) + c P( x) = E ( X ) + c 1 = E ( X ) + c
(Note that all probabilities sum to 1. Thus P( x) = 1 .) If we set E ( X ) = we have just
established that E ( X + c) = + c .We thus have:
Var ( X + c) = ( ( x + c ) ( + c ) ) P( x) = ( x ) P( x) = Var ( X )
2 2

Let X be the random variable associated to rolling a die, and Y the random variable of
rolling the die a second time. (OR X and Y can be the random variables associated to rolling
two different dice simultaneously.)
Then X+Y is the random variable that corresponds to the sum of two die and XY the
random variable that corresponds to the product of the two rolls.
Definition: Two (discrete) random variables X and Y with probability distributions

P ( X = x) and P (Y = y ) , respectively, are said to be independent if the probability
distribution P(X = x and Y = y) is given by the product P ( X = x) P (Y = y ) .
Example: If X and Y are the rolls of two separate dice, then

1 1 1
P ( X = 3 and Y = 2) = = = P ( X = 3) P (Y = 2) . This is true for all values (not just
36 6 6
X= 3 and Y = 2) and so X and Y are independent.
PART FOUR: 19
The following formulas are true, but they take some work to establish:
For independent random variables X and Y:
E ( X + Y ) = E ( X ) + E (Y )
E ( XY ) = E ( X ) E (Y )
Var ( X + Y ) = Var ( X ) + Var (Y )
Example: For X and Y the rolls of two die,
E ( X + Y ) = 2 P ( X = 1 and Y = 1) + 3P ( X = 1 and Y = 2)
+ 3P ( X = 2 and Y = 1) + + 12 P ( X = 6 and Y = 6)
1 1 1 1
= 2 + 3 + 3 + + 12
36 36 36 36
=7
And E ( X ) + E (Y ) = 3.5 + 3.5 = 7
Notice: From the third line we have that
Var ( X Y ) = Var ( X + ( Y ))
= Var ( X ) + Var (Y )
= Var ( X ) + (1) 2 Var (Y ) = Var ( X ) + Var (Y )
That is,
Var ( X Y ) = Var ( X ) + Var (Y )

PART FOUR: 20
Proof (OPTIONAL READING):
Suppose s is a given value and we wish to compute the probability that the sum X + Y
equals s. We can do this by finding listing all the x-values and all the y-values that sum to
s. Suppose this list appears:
x1 + y1 = s
x2 + y2 = s

xk + yk = s
Then
P ( X + Y = s ) = P ( X = x1 , Y = y1 OR X = x2 , Y = y2 OR OR X = xk , Y = yk )
=
x+ y=s
P ( X = x, Y = y )
=
x+ y=s
P ( X = x) P (Y = y )
where
x+ y=s
denotes summation over all the pairs of x- and y-values that sum to s.
Note that the double sum s x+ y=s

is the same as summing over all x- and y-values:
x y
Using these facts we can now see:
E ( X + Y ) = sP ( X + Y = s )
s
= sP( X = x) P(Y = y )
s x+ y =s
= ( x + y ) P ( X = x) P (Y = y )
x y
= xP ( X = x) P (Y = y ) + yP ( X = x) P(Y = y )
x y
= xP ( X = x) P (Y = y ) + yP (Y = y ) P ( X = x)
x y y x
= xP ( X = x) 1 + yP (Y = y ) 1
x y
= E ( X ) + E (Y )
PART FOUR: 21
Also:
E ( XY ) = sP ( XY = s )
s
= xyP ( XY = s )
s xy = s
= xyP ( X = x and Y = y )
x y

= xP ( X = x) yP (Y = y )
x y
= xP ( X = x) E (Y )
x
= E (Y ) xP ( X = x) = E ( X ) E (Y )
x
Finally, if X and Y are independent with E ( X ) = and E (Y ) = , then:
Var ( X + Y ) = E (( X + Y ) )
2
= E (( X + Y ) )
2
= E ( ( X ) + 2 ( X )(Y ) + ( Y ) )
2 2
= E ( ( X ) ) + 2 E ( X ) E (Y ) + E ( (Y ) )
2 2
= Var ( X ) + 2 ( )( ) + Var (Y )
= Var ( X ) + Var (Y )

CHALLENGE EXERCISE: Show that Var ( X ) = E (( X ) ) = E ( X ) ( E ( X ))

2 2 2
PART FOUR: 22
CONNECTION TO THE CENTRAL LIMIT THEOREM
Suppose X 1 , X 2 , , X n is a collection of random variables that represent the results of

running an experiment n times. Let
X1 + X 2 + + X n
X =
n
This is the average result.
The central limit states that if each of the random variables X 1 , X 2 , , X n has mean
and standard deviation , then:
Then the probability distribution of X is well approximated by the normal

distribution with mean and standard deviation .
n
The hard part of this theorem is proving the approximation to the normal curve.
Calculating the mean and standard deviation of X is now easy!
1 1 1
E( X ) = E ( X1 + + X n ) = ( E ( X1 ) + + E( X n ) ) = ( + + ) =
n n n
n 2 2
Var ( X ) = 2 (Var ( X 1 ) + + Var ( X n ) ) = 2 ( 2 + + 2 ) = 2 =
1 1
n n n n
PART FOUR: 23
THE CEREAL BOX PROBLEM
Krunchy-Munch Cereal company has placed a prize in every third box. How many boxes of
cereal can I expect to buy before seeing a prize?
1
The chances of finding a prize in a given box is p = and the chances of failing to see a
3
2
prize is q = . We have:
3
The probability of finding a prize in the first box you try is p .
The probability of first finding the prize in the second box you try is: qp .
The probability of first finding the prize in the third box you try is: q 2 p
and so on.
Let X be the random variable which is the box number you open when you first find the
prize. Then:
E ( X ) = 1 p + 2 qp + 3 q 2 p + 4 q 3 p +
COMMENT: Here X can have one of an infinite set of discrete values. It is also called a
discrete random variable.
To evaluate this sum we need to make use of the famous geometric formula.
1
ASIDE: Proving that 1 + x + x 2 + x3 + =
1 x
Suppose we wish to find the value of this infinite sum. Lets call its value S:
1 + x + x 2 + x3 + = S
Multiply through by x:
x + x 2 + x 3 + x 4 + = xS
That is:
S 1 = xS
Solving gives:
1
S=
1 x
COMMENT: Weve actually only proven the statement: IF the sum 1 + x + x 2 + x3 + has a
1
finite value, then that value must be . In a calculus course, one proves that the sum
1 x
does indeed converge to a finite answer for values 1 < x < 1 .
PART FOUR: 24
Returning to our cereal box problem we have:
E ( X ) = 1 p + 2 qp + 3 q 2 p + 4 q 3 p +
Notice that qE ( X ) = qp + 2q 2 p + 3q 3 p + 4q 4 p + from which it follows that:
E ( X ) qE ( X ) = p + qp + q 2 p + q 3 p +
That is:
(1 q ) E ( X ) = p (1 + q + q 2 + q 3 + ) = p
1
1 q
That is:
1
pE ( X ) = p =1
p
and so:
1
E( X ) =
p
1
With p = this has value 3.
3
We can expect to buy three boxes before seeing a prize.
FOR THE BOLD: Suppose Krunchy-Munch Cereal company actually has three different
prizes and each box contains one of these prizes. (Assume there is a one-third chance of
finding any particular prize in a given box.)
1 1
Show that one can expect to buy 3 1 + + = 5.5 before seeing all three prizes.
2 3
[In general: If there are n prizes to be had, show that one can expect to buy
1 1 1
n 1 + + + + boxes before seeing all n prizes.]
2 3 n
PART FOUR: 25
THE GEOMETRIC DISTRIBUTION
The cereal box problem is an example of a general situation. Suppose an experiment is run
and the probability of observing a success is p and of observing a failure q = 1 p . (For
1
example, in tossing a coin with heads deemed a success, we have p = q = . In rolling a
2
1 5
die with six deemed a success, we have p = and q = .)
6 6
Let X be the random variable:
X = the number of runs of the experiment needed for seeing a first success.
There is a p chance that one will see a success on the first run, so X has value 1 with
probability p.
The probability that X = 2 is qp . (Fail and then succeed.)

The probability that X= 3 is q 2 p . (Two failures then a success.)
And so on.
The probability distribution associated to X appears:
This is called the geometric distribution. We have P ( X = n) = q n 1 p .
1
We have seen on the previous page that: E ( X ) = With some work it is possible to show
p
q
. (One uses the formula Var ( X ) = E ( X 2 ) ( E ( X ) ) and shows that
2
that Var ( X ) = 2
p
1+ q
E ( X 2 ) = 1 p + 4qp + 9q 2 p + 16q 3 p + = .)
(1 q )
3
PART FOUR: 26
In summary:
For the geometric distribution:
P ( X = n) = q n 1 p
1
E( X ) =
p
q
Var ( X ) =
p2
EXAMPLE: You would like to know the meaning of life, but only 1 in 100 people on this
planet know the answer. You decide to ask each person you meet until you find someone
who can tell you the answer.
a) How many people do you expect to meet before finding someone with the answer?
b) What is the probability that you will find the person with the answer within the
first three people you meet?
Answer: This is a geometric probability situation with p = 0.01 and q = 0.99 .
1
a) E ( X ) = = 100 . We can expect to meet 100 people before finding the answer.
p
b) P ( X 3) = P ( X = 1) + P ( X = 2) + P ( X = 3) = p + qp + q 2 p 2.97% .
The geometric probability distribution has a nice feature:
P ( X > n) = 1 P ( X n)
= 1 p qp q n 1 p
= 1 p (1 + q + + q n 1 )
1 qn
= 1 p
1 q
= 1 (1 q n )
= qn
The probability that it will take me more than 100 people to find the meaning of life is
thus (0.99)100 36.6% .
PART FOUR: 27
THE BINOMIAL DISTRIBUTION
In tossing a coin ten times we have already worked out the distribution of probabilities of
obtaining exactly 10, 9, 8, , 2, 1 and no heads. (See part 1.) This is a specific example of a
more general situation.
Suppose an experiment has probability p of producing a success and probability q = 1 p

of producing a failure.
Let Bn ( k ) be the probability of producing exactly k successes in a run of n experiments.
3 7
10! 1 1
[For example, in tossing a coin 10 times, B10 ( 3) = 11.7% .]
3!7! 2 2
For each value n we have a series of probability values, one for each of k = 0 to k = n .
Each of these distributions of probabilities is called a binomial distribution.
In general, we have:
n!
Bn ( k ) = p k q nk
k !(n k )!
In our studies of the binomial theorem in part II, we saw that:
n!
( p + q)
n
= p n + np n 1 q + + p k q nk + + q n
k !(n k )!
Hence the name of the binomial distribution.
We shall prove:
Consider the binomial distribution of n trials. If X is the random variable that counts how
many successes occur, then, as we have seen:
n!
P( X = k ) = p k q nk
k !(n k )!
For this random variable we have:

E ( X ) = np
Var ( X ) = npq
PART FOUR: 28
Proof:
First consider a single run of the experiment ( n = 1 ). There is either 1 success or zero
successes:
In this simple case:
E ( X ) = 1. p + 0.q = p
(x )
2
Using Var ( X ) = P ( x) with = p we have:
Var ( X ) = (1 p ) p + ( 0 p ) q
2 2
= q2 p + p2 q
= pq ( p + q )
= pq
Now consider the situation of n runs of the experiment. Let
X = the count of successes in all n experiments
X 1 = the count of success in just the first experiment

X 2 = the count of success in just the second experiment

X n = the count of success in just the last experiment
Then X = X 1 + X 2 + + X n . As X 1 , X 2 , , X n are independent we have:
E ( X ) = E ( X 1 ) + E ( X 2 ) + + E ( X n ) = p + p + + p = np
Var ( X ) = Var ( X 1 ) + Var ( X 2 ) + + Var ( X n ) = pq + pq + + pq = npq
This completes the proof.

PART FOUR: 29
EXAMPLE: Returning to my pursuit of the meaning of life Suppose I invite 40 people at

random over for a party. What is the mean and standard deviation of the number of people
at my party I can expect to know the answer to my question? What is the probability that
2 or 3 people among this group have the answer?
Answer: Here p = 0.01 and q = 0.99 . Among a sample of 40 people we can expect:
E ( X ) = 40 p = 0.4
SD( X ) = npq = 20 0.01 0.99 0.44
Also
P ( X = 2 or 3) = P ( X = 2) + P ( X = 3)
40! 40!
( 0.01) ( 0.99 ) + ( 0.01) ( 0.99 )
2 38 3 37
=
2!38! 3!37!
6.0%
CONNECTION TO THE NORMAL CURVE

Imagine that as a hospital administrator I need 100 samples of O- blood. There will be 450
donors over the next week, but only 6% of people have this blood type. In order to find
the probability of obtaining 100 donors we need to compute:
450!
( 0.06 ) ( 0.94 )
150 300
150!300!
This is extraordinarily unwieldy and impossible to compute.
Fortunately, for large samples (and well explain what we mean by large in a moment) the
binomial distribution can be well approximated by a normal curve with:
= np
= npq
We can then use the tables of the normal curve values in order to approximate the
probabilities we need.
PART FOUR: 30
WHY SHOULD THE NORMAL CURVE COME INTO PLAY?

There are two reasons. Firstly, the binomial distribution analyses results of running
experiments multiple times (the probability of obtaining k = 0 , k = 1 , k = n successes
over n runs). If n is large, the central limit theorem states that the situation should begin
to approximate a normal curve.
The true reason is that there is a connection between factorials and the number e. Stirling
proved that:
n ! 2 n n n e n
with the approximation only improving as n grows larger. Thus it seems plausible (and is in
fact the case) that the formulas one obtains from the binomial distribution begin to look
2
like a formula for the curve of the form: y = e x , a normal curve.)
[Calculus gives a hint as to why the Stirlings formula might be true. Notice that:
ln n ! = ln1 + ln 2 + ln 3 + + ln n
appears as the sum of areas of rectangles under the curve y = ln x . Thus:
n
ln n ! ln x dx = [ x ln x x ]1 = n ln n n + 1 n ln n n
n
1
n n
which gives n ! n e . (Stirling gave a more refined version of this argument.)]
HOW LARGE IS LARGE ENOUGH?

The normal curve extends infinitely far both to the left and to the right, but 99.7% of the
region under the curve lies within 3 of the mean . The two tails constitute only 0.3% of
the region under the curve.
The binomial distribution, however, only have probability values for k = 0, k = 1, , k = n ,

and does not extend infinitely far to the left and to the right. In fact, it does not have
values for k less than zero and for k greater than n. we would like these non-existent
regions to match the two tails of he normal curve so that the mismatch of missing region
is negligible.
For the binomial distribution = np and = npq and we would like: + 3 < n and
0 < 3 so that the two tails correspond to moving beyond n and below 0.
PART FOUR: 31
The statement 0 < 3 gives:
np > 3 npq
np > 9q
Since q 1 we obtain:
np > 9
The other relation gives:
nq > 9
For simplicity, folk usually work with the number 10 to say:
If np 10 and nq 10 then the binomial distribution can be approximated as a normal

distribution.
PART FOUR: 32
PROPORTIONS: POPULATION PROPORTIONS and SAMPLE PROPORTIONS
Suppose were interested in finding the proportion of Americans who can whistle. Call this
true, and unknown, proportion p . To estimate p we could collect of sample of 1000 people
say and find the sample proportion p of those who can whistle.
To understand how good an estimate p is for p, we can analyse the distribution of p

values. This turns out to be a binomial distribution. To see why, note the following:
The probability of selecting an American at random who can whistle is p.
Since the population of Americans is so large, removing this first person from the
pool really wont alter the probability of choosing a second American who can
whistle. The chances are still, for all meaningful purposes, p.
Thus, selecting n = 1000 Americans is equivalent to running an experiment 1000

times with p chance of success and q = 1 p chance of failure. The parameter n p
is counting the number of successes and so must have binomial distribution. It has
mean np and standard deviation npq . Diving by n gives the distribution of p . It
has:
=p
pq
=
n
Moreover, if np and nq are each 10 , this distribution is approximately normal.
This is the central limit theorem: Version II from part III of these notes.
PART FOUR: 33
STUDENTS t-DISTRIBUTION
Throughout section III of these notes, in calculating confidence intervals and the like, we
assumed that the standard deviation of a distribution is assumed known. This is rarely
the case.
In many cases we approximate with the value given by the standard deviation of the
.
sample at hand
EXAMPLE: Of 1500 adult Americans that were polled, 3.2% said that they had a
thoroughly enjoyable experience studying math in high-school. Estimate the proportion of
ALL adult Americans that will say the same.
We answered this question in part III as follows:
Answer: We have
p = 3.2
3.2 96.8
= = 0.454
1500
as an approximation for , with 95% confidence we can say that the percentage
Using
of Americans who felt this way about math lies in the range [2.292, 4.108].
Gosset noticed, when performing statistical checks in his brewing company, that
unacceptable errors occurred in his analyses by making use of these approximate values
values in their own right, in the
for . He set to work to analyzing the distribution of
same way we analyse the distribution of sample means x via the central limit theorem.
To me more precise, Gosset realized that since we always need to convert entities to their
z-score, it is best to analyse the distribution of the quantities:
x
t=

n
He completely determined the mathematics of these entities, showing that, for each n
(the size of the sample) they follow their own distribution curves called Students t-
distribution with n 1 degrees of freedom. (Gosset published under the pseudonym
Student.)
PART FOUR: 34
Thus, one can create more accurate confidence intervals for means from a sample by
working with tables from Students t-distributions rather than work with the normal
distribution and approximate values for the standard deviation.
EXAMPLE: The speed of 23 along a particular road with posted speed limit 40 mph was
recorded. Their mean speed was x = 41.0 mph with = 4.23 mph. Is there reason to
believe that mean speed of all cars along this road is greater than 40 mph?
Answer: Assuming that car speeds follow something akin to a normal distribution (we can
plot the results and see if they appear bell shaped) we can follow Gossets model. Well
make the assumption that = 40.0 and see what we conclude.
Here:
41.0 40.0
t= 1.13
4.23
23
There are n 1 = 22 degrees of freedom. According to a table of t-distributions:
P (t22 > 1.13) = 0.136 = 13.6%
This is not sufficiently rare to conclude that receiving a sample mean of 41.0 is unusual
under the assumption that the true mean is 40.0. That is, we have no reason to reject the
idea that the true mean is 40.0 mph.
PART FOUR: 35
COMPARING TWO MEANS
Two companies make light bulbs: company 1 and company 2. Wed like to know if there is
any difference between the mean life-time of the bulbs they each produce. We take a
sample of bulbs of size n1 and compute their sample mean x1 , and a sample of size n2 from
company 2 and compute their sample mean x2 . What does the difference x1 x2 of these
two sample means tell us about the difference the two true means 1 2 ?
Let X 1 be the random variable of possible x1 values, and define X 2 similarly. By the
central limit theorem:
E ( X 1 ) = 1 E ( X 2 ) = 2
1 2
SD( X 1 ) = SD( X 2 ) =
n1 n2
We also have:
E ( X 1 X 2 ) = E ( X 1 ) E ( X 2 ) = 1 2
12 22
SD ( X 1 X 2 ) = Var ( X 1 X 2 ) = Var ( X 1 ) + Var ( X 2 ) = +
n1 n2
This is, of course, assuming that we know the true standard deviations. Without this
knowledge, we can still test if the difference 1 2 = 0 by using students t-distribution
by computing the value:
t=
(x x ) (
1 2 1 2 )
=
(x x )
1 2
2 2 2 2
1 2 1 2
+ +
n1 n2 n1 n2
and seeing if this value is an acceptable distance from 1 2 = 0 . Details are omitted
here, but the gist is clear.
COMMENT: One of the details omitted is determining the number of degrees of freedom
that are appropriate for this problem. Because we have a mix of sample sizes, the count of
degrees of freedom is complicated considerably.
PART FOUR: 36
A COMMENT ON CHI-SQUARED DISTRIBUTION
If X 1 , X 2 , , X are random variables with normal probability distributions, then

situations can arise in which one wishes to study the random variable (or close variations
of it):
X 12 + X 2 2 + + X 2
[This arises in our study of contingency tables.]
The mathematics of this random variable is well understood and its probability distribution
is known. It is called the chi squared distribution with degrees of freedom. This
distribution has = and = 2 .
CHEBYSHEVS INEQUALITY
Here is an astounding fact:
For any probability distribution, if the mean of the associated random variable is
and its standard deviation , then the area under the curve greater then k
1
standard deviations from the mean is no more than 2 .
k
Its proof is swift! We have:

2 = ( x ) P( x) (x )
2 2
P( x)
all x
|x | k

k
|x | k
2 2
P( x)
= k 2 2
| x | k
P( x)
= k 2 2 P ( | x | k )
and so:
1
P ( | x | k )
k2
2
CHALLENGE: Prove, more generally, P (| x | ) .
2
PART FOUR: 37
LAW OF LARGE NUMBERS
One of the great check points of probability theory was the proof of the intuitively
obvious result, the law of large numbers. It showed that all was on the right track.
In formal language, here is the result:
Suppose X 1 , X 2 , , X n are independent random variables each of mean and

standard deviation . We usually think of these random variables as the result of
running the same experiment n different times. Let S n = X 1 + X 2 + + X n (so that
Sn
is the average result). Then
n
S
lim n P n > = 0 .
n
Sn
That is, the probability that the average differs from the true mean by more than
n
some error value goes to zero as n grows, no matter the degree of error you wish to
tolerate.
Sn
COMMENT: This is not actually saying that lim n equals , only, in the some sense,
n
Sn
that lim n equals with probability 1. (Very strange! But curious issues do arise in
n
the theory. For example, the chances of choosing a whole number by selecting a point at
random along a number line is zero, even though integers are themselves valid points to be
selected!)
S S 2
Proof: By the results of pages 17-19 we have: E n = Var n = .
n n n
S 2
By Chebyshevs inequality: P n .
n
2
n
Sn 2
And so lim n P > = 0 since clearly lim n 2 = 0 .
n n
PART FOUR: 38

Probability and Statistics (Tanton)

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Probability and Statistics (Tanton)

Enviado por

Direitos autorais:

Formatos disponíveis

THINKING

A Refreshingly Clear Reference

PART 1: BASIC PROBABILITY THEORY

PROBLEM SET III 85

PART 4: ADVANCED TOPICS

Nave Probability Theory 7

The Empirical Model 34

ENCYCLOPEDIA OF MATHEMATICS, J. Tanton, Facts on File, New York, 2005.

Probability and Statistics represent two sides of the same coin.

Probability: If a situation can be described in terms of possible outcomes that are

Probability relies on the ability to COUNT things.

STATISTICS: There are two branches:

Descriptive Statistics is concerned with methods of collecting, tabulating and

Inferential Statistics is concerned with making inferences and predictions based

e.g. A medical study records the heights of 100 eight-year-olds.

average height = a statistic

Making a judgment about whether a particular childs height is abnormal

Comment: Italian mathematician Girolamo Cardano (1501-1576) actually worked

Later Romans kept census records, birth and death records,

The start of inferential statistics can be pinpointed to:

GETTING OUR FEET WET:

How best divvy up a $200 pot?

Heres some space for writing notes!

COMMENT: There are a number of interesting concepts at play in this example. If

NAVE PROBABILITY THEORY

Recall the basic principle of probability:

If a situation can be described in terms of possible outcomes that are

Example: In rolling a die there are six possible outcomes: 1, 2, 3, 4, 5, 6, each

Example: In tossing a coin Sample Space = {H , T }

In rolling a die Sample Space = {1, 2, 3, 4, 5, 6}

Ascertaining someones age (in years)

Definition: An event is a set of outcomes (or just a single outcome).

Example: In rolling a die: Sample Space = {1, 2, 3, 4, 5, 6}

An event could be: {2, 4, 6} (rolling an even number)

Definition: Given a sample space S for an experiment and an event E, the

Example: In rolling a die S = { 1, 2, 3, 4, 5, 6}

The probability of rolling an even number, E = {2. 4 6} is:

COMMENT: We always have 0 P( E ) 1 .

WARNING: This nave approach to probability assumes each individual outcome is

S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}

There are 36 possible ordered pairs:

1-1 1-2 1-3 1-4 1-5 1-6

These are the entities deemed equally likely.

Now, only one of these pairs gives a sum of 2, so:

[QUESTION: Is this correct? Is the order of the pair indeed important? Or

THIS IS VERY PERTURBING TO A MATHEMATICIAN!

This is equally perturbing to students and it should be!

Example: THE WALLET GAME

Two people decide to play the following game:

Each pulls out her wallet.

A game cant be favourable simultaneously to both players! Something is very

AND YET ANOTHER COMMMENT:

If we change our wording here and say:

EXERCISE: BERTRANDS PARADOX

This argument is mathematically sound and the result is absolutely correct.

This argument is mathematically sound and the result is absolutely correct!!!

Examples like these paradoxes alerted mathematicians to the problems with

CHALLENGE: Is there a way to renumber a pair of tetrahedral dice so that the

NAVE PROBABILITY THEORY CONTINUED

Putting philosophical woes on hold for now

Next, we need to explore the possibilities of combining actions.

Example: Tossing a coin and rolling a die are independent events.

Here the sample space is the set of twelve pairs:

(H, 1) (H, 2) (H, 3) (H, 4) (H, 5) (H, 6)