Escolar Documentos
Profissional Documentos
Cultura Documentos
MATHEMATICS
Volume 8
PROBABILITY AND STATISTICS
LEVEL:
TABLE OF CONTENTS
PROBLEM SET I 55
PART 2: COUNTING
COUNTING PRINCIPLES
The Multiplication Principle 2
Factorials 4
The Labeling Principle 9
Multi-stage Labeling 13
Fun with Poker 16
PASCALS TRIANGLE
A Grid of Numbers 20
The Binomial Theorem 27
PROBLEM SET II 31
PART 3: STATISTICS
Displaying and Summarising Data 2
Measures of Central Tendency 5
Measures of Dispersion 9
Scatter Plots 16
Lines of Best Fit 18
Correlation Coefficient 23
Null Hypothesis 27
Distributions 31
Central Limit Theorem 37
Normal Distribution 41
68-95-99.7 Rule 43
z-scores 45
Roulette 51
Confidence Intervals 54
P-values 57
Gallup Poles 60
Sampling 62
Chi-Squared test 66
Quality Control 71
Run Tests 74
Rank Correlation 82
PART I of IV
James Tanton
2007 James Tanton
CONTENTS:
Simplistic Overview 2
Expected Value 38
Conditional Probability 45
Bayes Theorem 49
REFERENCES:
SOLVE THIS: Mathematical Activities for Students and Clubs, J. Tanton, Mathematical
Association of America, Washington D.C., 2001
SIMPLISTIC OVERVIEW
PROBABILITY: Explores what can be said about an unknown sample from a known
collection of objects.
e.g. We know all possible combinations from rolling a pair of dice. What is the most
likely outcome?
STATISTICS: Explores what can be said about an unknown collection from a known
sample.
e.g. We surveyed 100 people and found that 37 chewed gum. What does this say
about the gum-chewing habits of the entire nation?
BASIC IDEAS:
e.g. The possible outcomes from rolling a dice are: 1, 2, 3, 4, 5, 6. Each is usually
deemed equally likely. Then:
1
Prob(3) =
6
1
Prob(5) =
6
etc.
e.g. Four cards dealt from a deck. Whats the probability of getting four aces?
This problem relies on the being able to count all 4-card hands. (A bit tricky.)
PART ONE: 3
COMMENT: The word statistik was coined by German political scientist Gottfried
Achenwall (1719-1772) to mean a summary of how things stand. It is based on the
Latin word stare meaning to stand.
PART ONE: 4
HISTORY
PROBABILITY
The start of probability theory can, essentially, be pinpointed to a single moment in
time. In 1654 French nobleman Chevalier de Mr wrote to prominent
mathematician Blaise Pascal asking for advice on the following problem:
Two friends each lay down $100 in a friendly best of seven tennis game.
But rain interrupts play after just four matches with one person having won
three games and the other just one. How should the $200 be divvied up
between the two players so as to properly reflect the likelihood of each
winning?
Pascal shared this problem Pierre de Fermat. Both solved it independently using
different techniques. Through this problem, probability theory was born.
STATISTICS
The study of statistics descriptive statistics, at least is ancient.
3050 B.C.E. Egyptians collated data on population wealth
2300 B.C.E. Ancient Chinese did the same
594 B. C. E. Greeks took a census for tax collection
309 B. C. E. Greeks took a census for population figures
For fun lets go back and analyse de Mrs problem. Lets imagine, like Pascal and
Fermat, we are seeing it for the first time. How would you like to approach it?
Recall:
Best of 7 games but only 4 games played.
Player A has won 3 games
Player B has won 1 game.
Definition:
The set of all possible outcomes of an experiment is called the sample space.
# elements of E
P( E ) =
# elements of S
(This is, of course, assuming that the sample space has just a finite number of
elements, and that every single outcome is equally likely.)
# elements of E 3 1
P(even) = = = .
# elements of S 6 2
Note, also:
1
P({3}) =
6
4 2
P ({1, 2, 4,5}) = =
6 3
0
P ( rolling a 7 ) = = 0
6
6
P(rolling any number) = =1
6
For example: In rolling a pair of dice and computing their sum, the set of all
possibly outcomes is:
But these individual events ,2, 3, ,12, are not equally likely.
Somehow we are meant to know that the underlying equally likely quantity here is
not the sums 2, 3, , 12, but the pairs of numbers behind each sum with order
considered important!
2 1
P (3) = =
36 18
3 1
P (4) = =
36 12
EXERCISE: Write down P(5), P(6), P(7), P(8), P(9), P(10), P(11) and P(12).
This might not seem too much of an issue here, but in more complicated examples
one might be given a sample space, such as S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, and
one is meant to somehow know whether or not these outcomes are equally likely,
or whether or not this sample space is a result of a more fundamentally equally
likely set.
Example: Consider the command: Pick a whole number at random. What does this
mean? Is each number equally likely? Does the term equally likely even apply?
Each person can reason: I stand to win more than I lose. Thus the game is in my
favour!
ADDITIONAL COMMENT:
The very act of defining probability in terms of equally probable events is
circular: using the term equally likely assumes you already know what probability
means! The very basis of nave probability is philosophically flawed.
PART ONE: 11
# elements of E
P( E ) =
# elements of S
size of E
P( E ) =
size of S
then we can extend our notion of probability to geometric settings where size of
a set is taken to be its area. For example, a circle sits inside a circle of side
length 2 inches. We can ask: If a point inside the square is chosen at random, what
at the chances of it landing outside the circle?
The area of the shaded region, the region of interest (the event E), is
22 12 = 4 and the area of the entire square (the sample space S) is 22 = 4 .
Then the probability we seek is:
size of E 4
P( E ) = =
size of S 4
QUESTION: What does equally likely mean in this setting? Is the probability of
picking any specific point zero? If so, is the probability of picking any point from a
collection of points also zero?
PART ONE: 12
And while we are mired in philosophical woes, consider the following disturbing
problem:
Answer 1: By rotating the circle we may as well assume that one end of the chosen
chord is positioned at the left end of the circle. Then we can see that the chord
will be longer than the side of an inscribed equilateral triangle if its second end lies
in the shaded portion shown.
1
This represents of the circumference of the circle. Thus the probability we
3
seek is:
1
P=
3
Answer 2: By rotating the circle we may as well assume that the chosen chord is
horizontal. Then the chosen chord will be longer than the side-length of an
inscribed equilateral triangle if its mid-point lies on the shaded portion of the
diameter shown:
1
An exercise in geometry shows that this represents of the diameter. Thus the
2
probability we seek is:
1
P=
2
The problem here is that the term at random is absolutely vague! The first
answer defines at random to mean: select a point on the circumference of the
circle and connect it with a given previously chosen point. The second solution
assumes at random means: draw a circle on the floor and roll a broom handle from
one side of the room across the circle.
It is possible to define at random by many different means for this problem and
arrive at different, but absolutely valid, answers. (If one draws a circle on a piece
1
of paper and drops straws from above onto the figure, one finds that about of
4
them give chords that of the length we seek!)
For fun
SICHERMAN DICE:
Most people believe that ordered pairs of values of the fundamental equally likely
entities to be considered when rolling a pair of dice and computing their sum. In
this case, the results of rolling two dice can be nicely displayed in a table:
2 6
We see now that P (3) = , P (7) = and so forth.
36 36
Suppose instead we roll two dice, one numbered 1-2-2-3-3-4 and the
other 1-3-4-5-6-8.
Complete the following addition table and verify that these dice give exactly the
same probabilities for any given sum as ordinary dice.
COMMENT: These dice were discovered by Col. George Sicherman in the 1970s.
PART ONE: 15
Given a (finite) sample space S and an event A, we have defined the probability of
event A occurring as:
size of A
P ( A) =
size of S
Definition: Two actions are said to be independent if the outcomes of one action
in no way affect the outcomes of the other.
Example: Picking a card from a deck of cards, destroying it, and then picking a
second card from the deck, and NOT independent events: The result of the first
action affects possible outcomes for the second. For instance, picking the ace of
spades first no longer allows the ace of spades to be chosen second.
Example: Deciding what to wear and the weather forecast are not independent
events.
3
P ( {H, even} ) =
12
2
P( {T, 5 or 6} ) =
12
3+ 2 3 2
P ({H, even} OR {T, 5 or 6} ) = = +
12 12 12
Then:
P(A or B) = P(A) + P(B)
6 1 7
Then P ( A B ) = P ( A) + P ( B ) = + =
12 12 12
EXERCISE: What if A and B do share events in common? Use the following Venn
diagram to explain the formula: P ( A B ) = P ( A) + P ( B ) P( A B) .
There is a second model called the square model that is useful for analyzing
probabilities.
Example: Suppose 100 people walk down a garden path that leads to a fork. A left
turn leads to house A, a right turn to house B.
Assume that there is a 50% chance that a person will turn one way over another.
In this set-up wed expect, basically 50 people to end up at house A and 50 people
at house B. The following diagram of one-hundred dots (for 100 people) depicts
this outcome:
The number 100 here is immaterial. The point is that if a square is used to denote
the entire population of people walking down the path, then half the area of the
square (half the people) end up with results A, and the second half the square
result B.
This was a very simple example. Lets practice some more complicates scenarios.
PART ONE: 19
EXERCISE: Folk walk down the following system of paths. Use the square model to
compute the fraction of people that end up at house A, at house B, and at house C.
[Assume that each choice encountered at a fork in the path is equally likely.]
PART ONE: 20
EXERCISE: Folk walk down the following system of paths. Use the square model to
compute the fraction of people that end up at each house. [Again assume that each
choice encountered at a fork in the path is equally likely.]
PART ONE: 21
EXAMPLE: I roll a die and then toss a coin. What are the chances of getting an
even number followed by a head?
Answer: Think of this as a path-walking problem with two houses labeled WANT
and DONT WANT. The forks in the road represent the options that can occur
(each, with 50% chance of occurring):
We see that the desired outcome represents one quarter (half of a half) of the
square:
1
P(even AND head) =
4
PART ONE: 22
EXAMPLE: I toss a quarter, then I toss a dime, and then I roll a die. What are the
chances of receiving HEAD, HEAD, and 5 or 6?
We have:
2 1 1
P( H and H and {5,6} ) = of of of the square
6 2 2
2 1 1 1
= =
6 2 2 12
PART ONE: 23
If A represents the set of desired outcomes for one action, and B the set of
desired outcomes for a second action, and these actions are independent then:
In summary:
ASIDE: Consider a garden path that leads to two possible houses, A and B.
Suppose there are an infinite numbers of three-way forks, each with left turn
leading to A, right turn leading to B, and straight path leading to the next fork.
Create a beautiful depiction of the square model for this situation that makes it
1 1 1 1 1
visually clear that this formula: + + + + = must be true.
3 9 27 81 2
PART ONE: 24
For example, in tossing a coin and rolling a die there are three possibilities:
Standing back and thinking about this one would likely say that all three scenarios
are philosophically equivalent, even though the tree diagrams and square diagrams
for possibilities 1. and 2. are different. (Draw them! Is it possible to draw a tree
diagram for simultaneous actions?)
Its good to spell things out and make explicit the following
SEQUENCE PRINCIPLE: If two actions are independent, then performing the two
actions simultaneously is philosophically equivalent to performing them one at a
time (and it does not matter in which order one opts to do them).
EXAMPLE: I roll a pair of dice. What are the chances of getting an 6 and a 2?
EXERCISE: I roll three die simultaneously. What at the chances of seeing two 6s
and one 5?
EXERCISE: In rolling two dice. What are the chances of getting a 2 and a 2?
PART ONE: 25
This is essentially all there is to nave probability theory. Of course, there are
subtle issues to explore - and well come to those in applications but for now, lets
practice the basic ideas.
EXERCISE: Suppose A is an event for some sample space S. Explain the following
formula:
P(not A) = 1 P(A)
EXERCISE:
The probability that any one person will be bitten at least once in life by a dog is
1 1
. The probability of being bitten by a cat is .
20 50
Find:
a) The probability that a person will be bitten by both a cat and a dog some
time in life
b) The probability that a person will never be bitten by a dog.
c) The probability that someone will be bitten by a cat or a dog but not both.
EXERCISE: Three dice are tossed simultaneously. What are the chances of:
Three dice and two coins are tossed simultaneously. What are the chances of:
[Assume that the chances of having a boy match those of having a girl.]
EXERCISE:
a) You know that Jenny has two children and that her first child is a boy. What
is the probability that her other child is a girl?
b) You know that Mike has two children and that one is a boy. What are the
chances that the other child is a girl?
EXERCISE:
Lulu has four children and you are told that at least one of the four is a boy. What
is the probability that
a) Exactly two of her children are boys?
b) At least two are boys?
a) What are the chances that I will get all 10 questions right?
b) What are the chances that I will get 9 out of 10 correct?
c) What are the chances that I will get 8 out of 10 correct?
d) What are the chances that I will get at least three right?
PART ONE: 27
Dorothy is not feeling good. She stands at the edge of a cliff, conveniently labeled
position 1, with an infinite expanse of land at her back (conveniently labeled in
steps 2, 3, 4, ). Regard location 0 as off the cliff.
Dorothy lays her fate in the toss(es) of a coin. She pulls out a quarter to toss and
decides:
She does this repeatedly stepping forward with each land of HEADS, backwards
with each land of TAILS until she either meets her doom or ends up wandering
forever in the infinite expanse behind her.
What are the chances that Dorothy will walk over the cliff?
STEP 1: Can you see that p (1 0) and p (2 1) and p (3 2) and so on, are each,
philosophically, the same problem and so have the same value p ?
PART ONE: 28
Notice that:
p (1 0 ) = p ( stepping forward right away OR stepping back to position 2 and reaching 0 sometime later )
= p ( stepping forward right away ) + p ( stepping back AND moving 2 0 )
1 1
= + p ( 2 0)
2 2
1 1
= + p ( moving from 2 to 1 AND moving from 1 to 0 )
2 2
1 1
= + p ( 2 1) p (1 0 )
2 2
1 1 2
p= + p
2 2
COMMENT: We are flirting with the notion of a random walk and have essentially
proven that a one-dimensional walk, the walker will visit each and every cell of the
line an infinite number of times. Feel free to conduct some internet research on
this topic.
PART ONE: 29
She decides she will stop playing when she either reaches $0 (loses all her money)
or gets to $10.
p ( N ) = p ( lose a dollar AND play with $N-1 OR win a dollar AND play with $N+1 )
1 1
= p ( N 1) + p ( N + 1)
2 2
CHALLENGE: If p (1) = x , show that this means that p (2) = 2 x , and that p (3) = 3 x ,
and so forth. What must be the value of x ?
1
Suppose she uses a weighted coin that has only a chance of landing HEADS and
3
2
chance of landing TAILS. Show that her chance of survival after a possibly
3
infinite number of tosses is now 50%!
PART ONE: 30
A not-so-exciting example:
A ball is chosen at random. What is the probability that the ball is:
a) Blue?
b) Either red or blue?
c) Neither red nor blue?
A not-unexciting example:
EXERCISE: A bag contains one Red, one Blue and one White ball. John picks out a
ball at random.
If it is Red, he wins.
If it is blue, he loses.
If it is white, he puts the ball back, and adds to the bag another red ball and
another blue ball. (So the bag now contains 2Rs, 2Bs and one W.)
1 1 2 1 1 3 1 1 1 4 1
+ + + + =
3 3 5 3 5 7 3 5 7 9 2
PART ONE: 31
Show that if you choose the die to the left of your friends choice (or choose die D
if your friend chooses die A) you will win this game two-thirds of the time. That is,
show that:
die A beats die B two-thirds of the time
die B beats die C two-thirds of the time
die C beats die D two-thirds of the time
die D beats die A two-thirds of the time
HINT: The following table shows all possible wins if dice A and B are rolled:
Player A picks a number at random from row A, player B a number at random from
row B, and player C a number at random from row C.
Two random people are kidnapped. What are the chances that their birthdays land
on the same day of the year? Give your answer as a percentage to one decimal
place.
Answer:
Three random people are kidnapped. What are the chances that at least two of
them have the same birthday?
Answer:
Four people are kidnapped. What are the chances that at least two of them have
the same birthday?
Answer:
PART ONE: 34
We are using a principle here called the Law of Large Numbers. This law makes
intuitive sense and is often assumed without explicit mention. In the nineteenth
and twentieth centuries, as mathematicians attempted to put probability theory on
a sound, rigorous footing, one of the significant check-points of their work was
the ability to prove this Law of Large numbers as true according to the axioms of
their theory. Heres the principle:
Example: If you toss a coin some number of times, you would expect approximately
half of the tosses to be HEADS and half TAILS.
With 10 tosses, it is unlikely you will receive exactly half of each. (Try it!)
With 100 tosses, the proportion of heads would be closer to 50%
With 1000 tosses, significantly closer to 50%.
Even better with 100000000000000000 tosses!
PART ONE: 36
If a run of trials did not produce the desired outcome, then the chances of
that outcome occurring on the very next trial is increased.
Example: You toss a coin nine times and got four HEADS and five TAILS. That the
next toss will be HEADS is thus almost certain NOT TRUE!
Example: You toss a coin 999 times and got TAILS every time. The chances that
the next toss will be heads, alas, is still only 50%.
Gamblers often feel that a string of losses must produce a win on the next turn.
Aside: Read Bringing Down the House to learn how MIT students tipped the odds
of blackjack in their favour by certain tactical plays.
One can use the Monte Carlo Method to work out areas of complicated regions.
One can compute (a fairly accurate approximation to) the area by digitizing the
photograph and have a compute select 10,000 points at random in the photograph.
If, say, 6,473 of those points land in the shaded region, then we can say that the
area of the spill is ?
ACTIVITY:
In a group of three call one person player 0, one person player 2 and the remaining person
player 3. Each person places his right hand behind his back and secretly holds up one, two
or three fingers. The players then show their hands.
a) Play this game a large number of times and tally points in a table. From your data,
who seems to have the largest chance of receiving a point in any single game?
Estimate that probability of winning. Estimate the chances of a win for each of the
remaining two players.
b) Use theory to determine the actual probability of a win for each of the three
players.
c) Suppose players 3, 2, and 0 are assigned, instead of one point for each win a, b and
c points, respectively, for each win. Choose values for a, b and c that makes this
game fair.
EXERCISE: a) A bag contains four balls each colored either red or blue. Jenny pulls out
two balls at random and gets a pair of blue balls. She returns the balls to the bag, gives it
a shake, and pulls out another pair of balls. She does this 100 times recording the results
along the way:
BB = 52 times
BR = 48 times
RR = 0 times
Most likely, how many blue balls and how many red balls are in that bag?
Surprisingly, no matter on which word you start this counting task, the procedure
always takes you to the same word in the final line, namely, the word you.
Kruskal observed that this same phenomenon seems to occur with any sufficiently
large piece of text - counting forward in this way from any choice of beginning
word lands you at the same place at the end of the page. This provides an amusing
activity for several people to perform simultaneously, all working with the same
text, but starting with different choices of initial word.
EXPECTED VALUE
Suppose you a play a game with some monetary values associated with it.
Definition: The expected value of a game is the average profit (or loss) one would
expect if the game were played a large number of times.
For example, suppose we played the above game 200 times. Then, on average, we
would expect to win $2 one hundred times and lose $1 one hundred times.
1 1 1
= 2 + (1) = = 50 cents
2 2 2
COMMENT: Many text-book questions phrase a game like the one described above
as follows:
Do you see that it is exactly the same game as before? The idea of having to pay
first often offers a point of confusion for students. It is good to tease such
questions apart and list the end outcomes explicitly:
If HEADS Im up $2 overall
If TAILS - I am down $1 overall
Answer: Imagine playing 600 rounds. (Why did I choose the number 600?)
This game is in your favour. It is worth playing. You can expect to win, on average,
50 cents per game.
In general
x1 p1 + x2 p2 + + xn pn
Answer:
PART ONE: 41
EXAMPLE: A coin is tossed.
and so on.
(Basically: You win $1 if H first appears on an odd toss, lose $1 if H first appears
on even toss.)
Answer:
1
The probability of getting heads on the first toss is
2
1 1 1
The probability of getting heads on the second toss is: = . (Why?)
2 2 4
1 1 1 1
The probability of getting heads on the third toss is: = .
2 2 2 8
And so on
Thus:
1 1 1 1 1 1 1 1
= 1 + (1) + 1 + (1) + = + +
2 4 8 16 2 4 8 16
One can use the geometric series formula to evaluate this infinite sum. Another
approach (a neat trick) is to multiply this sum by two:
1 1 1 1 1 1 1 1 1 1 1 1 1
2 = 2 + + = 1 + + = 1 + + = 1
2 4 8 16 32 2 4 8 16 2 4 8 16
1
So 2 = 1 giving = . Done!
3
PART ONE: 42
EXERCISE:
a) Consider the example we first studied in this section, phrased as the textbooks
tend to phrase it:
1 1
Well = 3 + 0 = $1: 50 . But since we paid a dollar, we must subtract $1
2 2
from this amount. The expected profit is therefore 50 cents.
COMMENT: Some students prefer to follow this approach to questions like these.
TWO TIDBITS
EXERCISE:
In Tiny-Town, 90% of the city cabs are purple and the remaining 10% are blue.
A crime was committed and an eye-witness claims she saw a blue cab at the scene.
Subsequent tests showed that this witness is correct in her observations four
times out of five, that is, 80% of the time.
What are the chances that the cab at the scene really was blue?
Named after the host of a popular American TV game show Lets Make a Deal!, the
Monty Hall problem is a classic puzzler often used to test initiates in the field of
probability theory. It goes as follows:
On a game show three closed doors stand before you. The host informs you
that a cash prize lies behind one of the doors, and nothing behind the other
two. You select a door, but before you open it, the host quickly opens one of
the remaining two doors to show you that the prize is not there. He now
gives the chance to change your mind and open instead the third remaining
door. The question is: What should you do? Should you stay with your original
choice of door, or switch to the other option? Is there any advantage to
switching?
Ones typical first reaction to this puzzle is that there is no advantage at all to
switching since two doors remain with only one containing a prize, the chance of
selecting the correct door, either by staying with the chosen door or switching, is
always 50 percent. Surprisingly, this reasoning is not correct for it makes no use
of the subtle information the host presents to you, which you can actually use to
your advantage.
a) Play the game with a partner using playing cards as doors one black and
two red. Take turns being host and being contestant. What do you notice
about your choices as host?
b) Explain why your odds of winning double if you choose to always switch
rather than stick.
c) Suppose the host presents you with 100 doors with only one containing a
prize. You reach for a door but just before you open it, the host reveals to
you the empty contents of 98 other doors. There are now two closed doors,
one with your hand on it. The host then offers you the chance to change
your mind and open instead the remaining closed door. Should you stick
with your original choice or switch?
PART ONE: 45
A game contains four doors, one with a fabulous cash prize behind it, the remaining
three empty. The host invites you to select a door and you place your hand on its
knob.
At this stage, Monty opens one of the remaining three doors and shows its empty
contents. He offers you the chance to stick with your current door choice or
switch to one of the remaining two closed doors. You make your choice.
Next, Monty opens a second door to reveal its empty contents. Two closed doors
remain, one with your hand on its knob. He again offers you the chance to stick or
switch.
At this stage the game ends and you accept the consequences.
Which of these four possibilities gives you the greatest chance of winning?
PART ONE: 46
Assume that Monty knows where the prize lies but has a preference as to which non-
prize door he opens when faced with a choice.
(For an example of preference, suppose the doors are numbered 1, 2, and 3 and Monty will
always open the lowest numbered door he can.)
In this scenario, switching is not always better! Sometimes a stick is just as good! Matters
depend on individual plays.
For example, suppose the contestant reaches for door number 1. If Monty opens door
number 3 to reveal a non-prize, then the contestant should switch for a certain win. (What
stopped Monty from opening door 2 if that was his preference?) If, on the other hand,
Monty did open door 2 to reveal a non-proze, then there is no advantage to sticking or
switching. (Montys action here reveals no information about the possible location of the
prize.)
EXPERIMENT: Conduct a card experiment with a friend mimicking this scenario. How
often can your friend deduce for certain the location of the black card? In general, what
are the odds of your friend winning this version of the game?
Assume that Monty has no knowledge of the location of the prize and, by luck,
opened a non-prize door.
CHALLENGE: Consider the final scenario in which Monty does not know the location of the
prize but will open the lowest number door available to him. It turns out to a non-prize.
Should the contestant stick or switch, or does it depend?
PART ONE: 47
CONDITIONAL PROBABILITY
Analysis of the Monty Hall problem flirts with difficulty of what to do when partial
information is revealed in a situation. This leads us to the notion of conditional
probability,
Example: Two cards are drawn at random from a deck. Knowledge of the colour of
the first card will affect the likelihood that the second card is red. Specifically:
26
P( second card red) =
51
25
P( second card red) =
51
If we are told nothing about the colour of the first card, then:
26 1
P( second card red) = =
52 2
Notation: The probability that event A will occur given knowledge that event B has
already occurred is denoted:
P(A|B)
26
e.g. We have: P(second card red | first card black) =
51
PART ONE: 48
Imagine running the experiment a large number of times and observing the number
of times B occurs.
We want P(A|B), the proportion of times A occurs among those times B has already
happened. That is, we want the number of times both A and B occurred compared
to the number of times just B occurred. This suggests:
P( A B)
P(A|B) =
P( B)
Example: Draw a card from a deck. A friend tells you that the card is red. What is
the probability that it is an ace?
P (red ace)
=
P(red)
2
= 52
1
2
1
=
13
[And this makes sense since among the 26 red cards, two are aces.]
Exercise: A die is rolled. Someone yells out that the answer is odd.
Given this information, what is the probability that the roll was a 3? a 4?
Answer these questions by practicing the formula for conditional probability (and
then check that the answers make sense!)
PART ONE: 50
Recall that two actions are independent if the outcomes of one in no way affect
the outcomes of the other. So, if A and B are independent events, wed expect
then:
P(A|B) = P(A)
P ( A and B ) P ( A) P ( B )
P(A|B) = = = P ( A)
P( B) P( B)
Example: A coin is tossed and a die is rolled. What is the probability of getting a
HEAD given that the die rolled a 6?
1 1
= 2 6
1
6
1
=
2
1
P(ace|red) =
13
1
P(red|ace) =
2
RECALL:
P( A B)
P(A|B) =
P( B)
and
P ( B A)
P(B|A) =
P ( A)
We have:
P ( B A) P ( A B) P( A B ) P( B ) P( B)
P ( B | A) = = = = P( A | B)
P ( A) P ( A) P ( B) P( A) P( A)
That is:
P( B)
P ( B | A) = P( A | B)
P( A)
P (red ) 1 1/ 2 1
Example: P(red|ace) = P(ace|red) . = = .
P (ace) 13 1/13 2
PART ONE: 52
More generally
Suppose B1 and B2 are two non-overlapping events that cover the whole sample
space.
Then:
BAYES THEOREM:
P ( A | B1 ) P ( B1 )
P ( B1 | A) =
P ( A | B1 ) P( B1 ) + P ( A | B2 ) P ( B2 )
Proof:
PART ONE: 53
Lets do an example:
EXAMPLE:
Bag 1 contains 5 red and 2 white balls.
Bag 2 contains 7 red and 4 white balls.
Answer: Let:
We want P( B1 |A), the probability that the ball came from bag 1 given that it is red.
P ( A | B1 ) P ( B1 )
P ( B1 | A) =
P ( A | B1 ) P ( B1 ) + P ( A | B2 ) P ( B2 )
5 1
= 7 2
5 1 7 1
+
7 2 11 2
55
=
104
The theorem looks complicated, but it allows you to compute some nasty problems
with relative ease.
PART ONE: 54
Both cards are put in a bag and one is pulled out at random. You see that one side
of the chosen card is red. What is the probability that the other side of this
chosen card is also red?
Answers:
EXERCISE: Yale psychologists have coined the term cognitive dissonance for the act of
devaluing an object after being told it is not available, and conducted an experiment with
monkeys to show that they might too engage in cognitive dissonance. (See
http://www.nytimes.com/2008/04/08/science/08tier.html?_r=1&8dpc&oref=slogin ).
Scientists had discovered that monkeys prefer red, green and blue M&Ms over all other
colors. Monkeys were then given two M&Ms of different colors say one red and one
green. The monkeys would grab one candy, say the red one, and then have the other one
taken away. Next the monkey would be offered another two M&Ms but of the colors it
had not eaten, in our example, blue and green. The monkey had already experienced the
green M&M being taken away, and the scientists found that about two thirds of the
monkeys opted to take the blue one instead. Had the majority of monkeys indeed devalued
the green M&M given the previous loss?
Show that this is a mathematical result and not a psychological result. That is, show that,
of all the monkeys that prefer red M&Ms over green M&Ms, two thirds of them also
prefer blue M&Ms over green M&Ms, irrespective of whatever experiment is to be
conducted!
PART ONE: 55
PROBLEM SET I
Two players play a series of games for the best out of five. The winner is to
receive a prize of $1000. After three games of play, in which the first player had
won one game and the other two games, the match was interrupted by an
earthquake. How should the $1000 be divvied up between the two players so as to
properly reflect their likelihoods of having won the series? (Assume the each
player has a 50% chance of winning any particular game.)
Question 2: Repeat question 1 but this time assume that the first player has only
a 10% chance of winning any individual game.
Question 3: 8640 people walk down the following garden path. At each fork, equal
numbers of people take each option. Find the number of people that end up in each
of the houses A, B, C, and D.
PART ONE: 56
Question 4: Assume that exactly 50% of children born are boys and 50% are girls.
a) Draw a tree diagram displaying the possible genders of their three children,
b) What are the chances that the couple has three boys?
c) What are the chances that the couple has at least one boy?
d) What are the chances that the couple has exactly two boys?
Suppose that we are now told that their first child was a girl.
e) What are the chances that the other two children are also girls?
f) What are the chances that at least one of their three children is a girl?
Question 5: Billys girlfriend has a dimple on her left cheek (there is 1/100 chance
that this occurs), blue eyes (there is a 1/100 chance this occurs), and likes math
(there is a 1/100 chance that this occurs). He says that his girlfriend is one in a
million. Is he correct?
Question 8: An urn contains 5 red balls, 8 blue balls, and 7 white balls. A ball is
selected at random. What are the chances of:
After the ball is selected, it is returned to the urn, and the experiment is
repeated.
Question 9:
The chances that someone gets bitten by a dog at least once in life is 0.02 .
The chances of being hit by a meteorite at least once in life is 0.001 .
The chances of stepping in gum at least once in life is 0.99 .
Question 10: M&Ms come in six colors. Heres a table showing the probability that
a randomly chosen M&M has a particular colour:
c) What are the chances that two M&Ms chosen at random from an extremely
large bag are both blue?
d) What are the chances that three M&Ms chosen at random from an
extremely large bag are all blue?
Ive been told that the colour distribution for Peanut M&Ms is different.
Obtain a bag of peanut M&Ms and use your sample to make estimates for the
entries in the following table:
Two players decide to play the following game. Player A chooses the sequence HHH
and player B the sequence THH. A coin is tossed repeatedly until one of these
sequences appears. For example, the coin might produce T, T, H, T, H, H and
player B wins. If the coin produces the sequence H, T, T, H, H, H then player A
wins.
a) Play the game 10 times. Does player B seem to win the majority of times?
b) Explain why player B has the advantage.
c) Suppose instead player A chooses the sequence HHT and player B the
sequence THH. Play the game 10 times. Does player B again win the majority
of times? Can you explain why?
PART ONE: 59
d) Heres a table showing all the options A could choose and what B chooses in
response.
Play the game 10 times for each of the eight rows in the table. Verify that B
wins the majority of times in each case. Show me the results you obtained.
Question 13: A bag contains a red ball and a white ball. Jodie takes out a ball at
random. If it is red she wins. If it is white, she then moves to a bag that contains
two red balls, and a single white ball, and pulls out a ball. If it is red, she wins. If it
is white, she then moves to a bag that contains three red balls and a single white
ball, and pulls out a ball. If it is red, she wins. If it is white, she then moves on to
There are an infinite number of bags available to her, and she keeps playing this
game until she eventually wins.
1 1 2 1 1 3 1 1 1 4
+ + + +=1
2 2 3 2 3 4 2 3 4 5
n
(That is, in math notation, weve established: (n + 1)! = 1 .)
n =1
PART ONE: 60
Question 14: A bag contains a red ball, a blue ball, and a white ball. Schuyler pulls a
ball out at random. If it is red, he wins. If it is blue, he loses. If it is white, then
he moves on to a bag that contains two red, two blue and one white ball. He pulls
one out at random. If it is red, he wins. If it is blue, he loses. If it is white he
moves on to a bag that contains four red, four blue, and one white ball. And so on,
with double the number of red balls and double the number of blue balls from bag
to bag.
Question 15: Three dice are tossed simultaneously. What are the chances of
rolling
a) three sixes?
b) two sixes and a one?
c) no sixes?
d) at least one six?
Question 16: Three dice and two coins are tossed simultaneously. What are the
chances of receiving
a) Explain why the chances of pulling out five cards that are all hearts is:
1 12 11 10 9
0.05% .
4 51 50 49 48
b) Find the probability of pulling out five black cards.
c) Show that the probability of pulling out four Kings among the five cards is
close to 0.002%.
d) Show that the probability of pulling out three Kings and two Queens is close
to 0.001%.
PART ONE: 61
Player A chooses a number at random from the first row; player B chooses a
number at random from the second row, and player C chooses a number at random
from the third row.
a) What are the chances that player Bs number is higher than player As?
b) What are the chances that player Cs number is higher than player Bs?
c) What are the chances that player As number is higher than player Cs?
[In this game of chance, B has the advantage over A, C has the advantage over B,
and A has the advantage over C!]
Question 19:
a) Eleven numbers are arranged in a line. The first number is 0, the last number
is 0, and every number in between is the average of its two neighbors. What
are the 11 numbers and why?
b) Eleven numbers are arranged in a line. The first number is 0, the last number
is 1, and every number in between is the average of its two neighbors. What
are the 11 numbers and why?
Question 20: You are a game show contestant and the game show host presents to
you 100 boxes. She tells you that inside one box lies a fabulous prize and all the
remaining boxes are empty. You select a box at random and are about to open it
when the host interrupts you and opens 98 boxes to reveal to you their emptiness.
This leaves two boxes: the one you selected and one other.
PART ONE: 62
You are now given the chance to stick with the box you first chose, or to switch
and open instead the second box.
a) If you decide to stick, what are your chances of winning the prize?
b) If you decide to switch, what are your chances of winning the game?
Suppose the game show host opens only 97 boxes. This leaves three boxes: the one
you first selected and two others.
The host now gives you the choice to either stick with your original box or to
switch to either one of the remaining boxes.
c) If you decide to stick, what are your chanced of winning the prize?
d) If you switch to a different box, what now are your chances of winning?
Question 21: In a game, if a outcomes are deemed favorable and the remaining b
possible outcomes unfavorable, then folk may say in horse racing circles in
particular that the odds in favor of winning are a to b, or alternatively that the
odds against are b to a. For example, in rolling a die the odds in favor of rolling a
6 are 1:5. The odds against rolling a 5 or a 6 are 4:2 (which could be reduced to
2:1). In a horse race if the odds against a horse are 7:2, this means that bookies
2
believe that the horse has only a chance of winning.
9
CORRECT or INCORRECT?
a) A bookie at a horse race says that the odds against a particular horse are
8
5:8. This means that the probability the horse will win the race is .
13
b) A game yields a 30% chance of a win. The odds against winning the game are
thus 7:3.
c) In casting a die, the odds in favor of rolling a number smaller than 5 are 2:1.
d) In tossing a coin twice, the odds against receiving two heads is 3:1.
Question 22:
Bag 1 contains 13 red balls and 14 blue balls.
Bag 2 contains 12 red balls and 7 blue balls.
A bag is selected at random and a ball is pulled out of that bag at random. We are
told that the ball is red.
Question 23: A die is tossed. What is the probability that the result is a number
less than 4 if
Question 24:
a) Two ordinary dice are tossed. What is the probability of NOT getting a total
of 7 or 11?
b) Two Sicherman dice are tossed. What is the probability of NOT getting a
total of 7 or 11?
Question 25: One bag contains 4 red and 5 white balls. A second bag contains 3
red and 6 white balls. A ball is drawn from each bag. What is the probability that
Question 26 One bag contains 2 red and 3 white balls. A second bag contains 3 red
and 1 white balls. A ball is drawn from each bag. Suppose we are told that one ball
chosen was red. What are the chances that the second ball is also red?
Question 27: A bag contains 5 red and 4 white balls. A ball is selected and then,
without replacing the first ball, a second ball is selected. I tell you that the second
ball is white. What is the probability that the first ball was white?
Question 28: You play a simple coin-tossing game. If the coin lands heads, you win
$3. If lands tails, you must pay $1.
a) If you play this game 100 times, how much money do you expect to have?
b) What is the expected value of this simple game?
PART ONE: 64
Question 29: Roll a die. If it comes up even, you win that many dollars. If it comes
up odd, you must pay that many dollars. (For example, a roll of 4 wins you four
dollars. With a roll of 5, you lose five dollars.)
What is the expected value of this game? Would you want to play it?
Question 30: A coin is tossed once, possibly twice. If a head appears on the first
toss, you win $10 and the game stops. If it lands tails, the coin is tossed again. If
the second toss lands heads you win $4, otherwise you pay $20.
a) If you played this game 100, on average, how many times will you win $10?
How many times will you win $4? How many times will you lose?
b) What is the expected value of this game? Would you want to play it?
Question 31: A gambling game is called fair if its expected value is zero.
A die is rolled. If it rolls 1, 2, 3, or 4, you win $300. If it rolls 5 or 6 you lose $x.
Find a value of x that makes this game fair.
Question 32: A gambling game is called fair if its expected value is zero.
A coin is tossed three times. If at least two heads appear, you win $100. If exactly
one head appears, you win $50. If no head appears, you lose $x.
Find a value of x that makes this game fair.
Question 33: A die is rolled. If it lands 1 you win $10. If it lands 2 you win $300.
If it lands 3 you win $1. If it lands 4 you lose $500. If it lands 5 or 6 you win $x.
Find a value of x so that the expected value of this game is fifty cents.
[ASIDE: They are! But what fact of social behaviour are we ignoring in this
argument? Why should one still not bother to buy a lottery ticket even if the prize
is so high so as to give the impression that odds are in your favor.]
PART ONE: 65
It is not possible to generate truly random numbers with a computer - any program
follows a predetermined set of instructions but it is possible to create a list that
appears to be random. Several methods for doing so exist. The most popular is the
middle-square method developed in 1946 by John von Neumann. It works as
follows:
This procedure produces a seemingly random list of numbers between 0 and 9999.
a) Verify that starting with the number 7254 yields the sequence:
b) What happens if, instead, you start with the initial number 1049?
This procedure (and, in fact, all procedures that currently exist) are not without
flaw.
PART ONE: 66
Question 36: The inner-most circle has diameter 6 inches, and each circle
thereafter has diameter 4 inches greater than the previous circle.
Question 37: A survey displays milk preferences amongst men and women:
Men Women
Whole Milk 10 3
2% Milk 18 16
Non-Fat 7 15
No Preference 6 12
a) A woman who was surveyed is chosen at random. What is the probability she
prefers whole milk?
b) A person who likes whole milk is chosen at random. What is the chance that this
person is male?
PART ONE: 67
Question 39: I select a card at random from a standard deck of 52 cards and
then a second card. I put the remaining 50 cards aside and lay the two selected
cards face down on the table-top in front of me.
I look at the first card. It is black. Knowing this, what is the probability that the
second card is also black?
Question 40: At a party, 30% of the people present select red as their favourite
colour, 40% select blue, and 30% select yellow. If a person is chosen at random,
what are the chances that he or she does NOT prefer yellow.
a) What are the chances that the computer will select an even digit?
b) The computer selects two digits, one after the other. What are the chances
of obtaining an odd digit followed by an even digit?
PART ONE: 68
PROBABILITY AND STATISTICS
Informal Course Notes
PART II of IV
James Tanton
2007 James Tanton
CONTENTS:
COUNTING PRINCIPLES
The Multiplication Principle 2
Factorials 4
The Labeling Principle 9
Multi-stage Labeling 14
Fun with Poker 18
PASCALS TRIANGLE
A Grid of Numbers 22
The Binomial Theorem 29
Exercises 33
PART TWO: 2
There are three major highways from Adelaide to Brisbane, and four major
highways from Brisbane to Canberra.
How many different routes can one take to travel from Adelaide to
Canberra?
The answer to this question is clearly 12. But pause for a moment and ask
yourself why? Is it obvious that the number of routes from A to C really is
3 4 , that is, three groups of four?
Make sure you are comfortable that multiplication really is the right
arithmetic operation here (as opposed to direct addition).
Suppose there are also six major highways from Canberra to Darwin.
#routes = 3 4 6 = 72
PART TWO: 3
EXERCISE: There are ten possible movies I can see and ten possible snacks
I can eat whilst at the movies. I am going to see a film tonight and I will eat
a snack. How many choices do I have in all for a movie/snack combo?
We have the
This principle readily extends to the completion of more than one task.
FACTORIALS:
Answer: There are six possibilities for the task of placing someone in the
first spot, five possibilities for who to place second, four for third, three
for fourth, two for the fifth and one for sixth. By the multiplication there
are thus:
6 5 4 3 2 1 = 720
1! = 1
2! = 2 1 = 2
3! = 3 2 1 = 6
4! = 4 3 2 1 = 24
5! = 5 4 3 2 1 = 120
6! = 6 5 4 3 2 1 = 720
7! = 7 6 5 4 3 2 1 = 5040
8! = 8 7 6 5 4 3 2 1 = 40320
WORD GAMES:
EXAMPLE: My name is JIM. In how many ways can one rearrange the
letters of my name?
Answer 1: By brute force we can list all possibilities and see that there are
six arrangements: JIM JMI MIJ MJI IMJ IJM
The first task is to fill the first slot with a letter. There are 3 ways to
complete this task.
The second task is to fill the second slot. There are 2 ways to complete this
task. (Once the first slot is filled, there are only two choices of letters to
use for the second slot.)
The third task is to fill the third slot. There is only 1 way to complete this
task (once slots one and two are filled).
EXERCISE: In how many ways can one arrange the letters HOUSE ?
EXERCISE: How many ways are there to rearrange the letters BOB? Assume
the Bs are indistinguishable?
Comment: One can certainly answer this second exercise by brute force
just list the possibilities. But is there a sophisticated way to think about how
to handle the repeated letter? Think about this before reading on.
PART TWO: 6
PRACTICE EXERCISE:
In how many ways can one arrange the letters HOUSES?
HOUS1ES2
HOUS2ES1
OHUS1S2E
OHUS2S1E
S1S2UEOH
S2S1UEOH
But notice, if the Ss are no longer distinguishable, then pairs in this list of
answers collapse to give the same arrangement. We must alter our answer
by a factor of two and so the number of arrangements of the word HOUSES
is:
6!
= 360
2
EXERCISE: How many ways are there to rearrange the letters of the word
CHEESE?
6!
Comment: The number of ways to rearrange the letters HOUSES is . The
2!
2 on the denominator is really 2!.
EXERCISE: Explain why the number of ways to arrange the letters of the
7!
word CHEESES is .
3!2!
EXERCISE:
In how many ways can one arrange the letters CHEEEEESIEST?
How about of CHEESIESTESSNESS?
PART TWO: 8
7!
Comment: Consider the word DOODLED. Its letters can be arranged
2!3!
different ways, with 2! in the denominator arising from the fact that there
are two Os, and the 3! from the three Ds. If we wished, we could also
include in the denominator a 1! (- which equals 1) for the fact that there is a
single L in the word and another 1! for the single E. Thus the number of ways
to arrange the letters DOODLED might be better written:
7!
2!3!1!1!
This has the advantage of offering a self check: the numbers in the
denominator should match - in sum - the numbers in the numerator.
7!
2!3!1!1!0!
Also, the letter J appear zero times as well, so perhaps we should write:
7!
2!3!1!1!0!0!
and so on.
It is for this reason that mathematicians set 0! = 1. Even if one is being silly,
the formulas still remain correct.
PART TWO: 9
1 slot is to be labeled C
1 slot is to be labeled H
3 slots are to be labeled E
2 slots are to be labeled S
1 slot is to be labeled I
1 slot is to be labeled T
9!
We know the answer to the problem is: .
1!1!3!2!1!1!
We have:
PART TWO: 10
N!
k1!k2! kr !
SOME EXAMPLES:
1. Four people from a group of ten are needed for a committee. In how many
different ways can a committee be formed?
2. Fifteen horses run a race. How many possibilities are there for first,
second, and third place?
Answer: One horse will be labeled first, one will be labeled second, one
15!
third, and twelve will be labeled losers. The answer must be: .
1! 1!1!12!
3. A feel good running race has 20 participants. Three will be deemed
equal first place winners, five will be deemed equal second place winners,
and the rest will be deemed equal third place winners. How many different
outcomes can occur?
20!
Answer: Easy! .
3! 5!12!
20!
The total number of possibilities is thus: . Easy!
1!1!5!3!2!1!7!
COMMENT:
N!
The formula with k1 + k2 + + kr = N is called a generalized
k1!k2! kr !
combinatorial coefficient. It is denoted:
N
k1 k2 kr
6 6!
For example, = = 60
2 3 1 2!3!1!
PART TWO: 12
EXAMPLE: In how many different ways can one arrange seven As and nine
Bs?
16!
7!9!
possible arrangements.
EXAMPLE: Ten circles are drawn in a row. In how many different ways can
we color two of them black and leave the rest white?
Answer: Two circles are to be labeled black and eight as white. There are:
10! 10 9
= = 45
2!8! 2
possibilities.
HINT:
PART TWO: 13
Answer:
12!
5 people will be labeled chosen and 7 not chosen. There are ways to
5!7!
accomplish this task.
Answer: We have:
1 person labeled first
1 person labeled second
1 person labeled third
1 person labeled fourth
1 person labeled fifth
7 people labeled not chosen
12!
This can be done ways.
1!1!1!1!1!7!
Again there is no need to fuss about order. Just come up with the labeling
scheme that is appropriate for the problem.
EXERCISE: Coming full circle Explain, using the labeling principle, why the
6!
number of ways to arrange six people in a line is 6! (which is really )
1!1!1!1!1!1!
PART TWO: 14
MULTI-STAGE LABELING
Although the labeling principle helps remove the confusion of order vs. non-order,
many standard arrangement problems still possess a level of complication that is
delicate. For example, consider the following typical standardized test problem:
In how many ways can one arrange the letters of the word ORANGE if the
first and last letters must each be a vowel?
This is not a straightforward labeling problem as some objects are given preferred
status over others: the vowels require a different type of consideration from the
consonants. It is really a two-stage challenge:
Each of these stages can be handled separately. The Multiplication Principle tells us
to then multiply the results.
Solution:
STAGE 1: One vowel shall be labeled first position, one last position and one
3!
shall be labeled placed with the consonants. There are = 6 ways to complete
1!1!1!
stage 1.
STAGE 2: We now have four consonants, R, N, G, and the remaining vowel, to label
4!
as second, third, fourth and fifth. There are = 24 ways to accomplish stage
1!1!1!1!
2.
COMMENT: Notice that we have no control over who is labeled expert and who is
labeled trainee. We only have control over the labels chosen and not chosen.
That there some fixed previously assigned labels is a hint that this must be dealt as
a multi-stage problem.
EXAMPLE: In how many ways can one arrange the letters ABCDE so that A is never
at the beginning or the end?
Well give three answers to this problem, even though most people would prefer to
answer the question just the first we present. (We offer two more approaches just
to illustrate that there are multiple ways to approach these problems.)
Answer 1: Think of this as a five-stage process! Deal with the first letter, deal
with the last letter, deal with the second letter, deal with the third letter, and deal
with the fourth letter. By the multiplication principle, we multiply the results.
PART TWO: 16
There are four remaining positions for four letters. They can be placed in
4!
these positions = 24 ways.
1!1!1!1!
Answer 3: There are five slots in which to place letters with the two end slots
having a different status than the middle three.
EXAMPLE: Six people Albert, Bilbert, Cuthbert, Dilbert, Egbert and Filbert are to
sit in a circle. How many different arrangements are possible if rotations of the
same arrangement are considered equivalent?
Answer: This question is tricky in that there are no clear labels associated with
the question: there is no clear first seat or second seat, and so forth. We can
think of it as a multi-stage process nonetheless by having the men take a seat one
at a time:
Albert must sit somewhere. He can sit anywhere (since all rotations are
deemed equivalent) and there is thus only 1 action for him to take.
Bilbert now has 5 options: take the seat one place to Alberts left, two
places to his left, and so on. Cuthbert has 4 options. Dilbert has 3. Egbert
has 2. Filbert has 1.
This is a labeling problem as Alberts position now defines five labels: one
place to his left, two places to his left, and so on. There are thus
5!
= 120 possibilities.
1!1!1!1!1!
One plays poker with a deck of 52 cards, which come in 4 suits (hearts,
clubs, spades, diamonds) with 13 values per suit (A, 2, 3, , 10, J, Q, K).
In poker one is dealt five cards and certain combinations of cards are
deemed valuable. For example, a four of a kind consists of four cards of
the same value and a fifth card of arbitrary value. A full house is a set of
three cards of one value and two cards of a second value. A flush is a set
of five cards of the same suit. The order in which one holds the cards in
ones hand is immaterial.
Answer: Again this is a multi-stage problem with each stage being its own
separate labeling problem. One way to help tease apart stages is to image
that youve been given the task of writing a computer program to create
poker hands. How will you instruct the computer to create a flush?
First of all, there are four suits hearts, spades, clubs and diamonds and
we need to choose one to use for our flush. That is, we need to label one suit
4!
as used and three suits as not used. There are = 4 ways to do this.
1!3!
Second stage: Now that we have a suit, we need to choose five cards from
the 13 cards of that suit to use for our hand. Again, this is a labeling
problem - label five cards as used and eight cards as not used. There are
13!
= 1287 ways to do this.
5!8!
By the multiplication principle there are 4 1287 = 5148 ways to compete both
stages. That is, there are 5148 possible flushes.
52!
Comment: There are = 2598960 five-card hands in total in poker.
5!47!
5148
(Why?) The chances of being dealt a flush are thus: 0.20% .
2598960
PART TWO: 19
Among the four cards of the value selected for the triple, three will be used
4!
for the triple and one will be ignored. There are = 4 ways to accomplish
3!1!
this task. Among the four cards of the value selected for the double, two
4!
will be used and two will be ignored. There are = 6 ways to accomplish
2!2!
this.
He then asked the pair student to raise his four cards in the air and asked
the seated students to select which two of the four should be used for the
pair. He then asked each of the three single students in turn to hold up
their cards while the seated students selected on one the four cards to
make a singleton.
PART TWO: 20
This process made the multi-stage procedure clear to all and the count of
possible one pair hands, namely,
13! 4!
4 4 4
1!3!9! 2!2!
readily apparent.
EXERCISE: Two pair consists of two cards of one value, two cards of a
different value, and a third card of a third value. What are the chances of
being dealt two-pair in poker?
For the starting value we must select which of the four suits it will be.
There are 4 choices.
There are also 4 choices for the suit of the second card in the straight, 4
for the third, 4 for the fourth, and 4 for the fifth.
10 4 4 4 4 4 = 10240 .
EXERCISE: KENO
In a KENO game ten numbers are selected at random from the number 1
through 80. Players of the game submit tickets beforehand selecting 1
through 10 numbers. They win prizes according to the number of matches
they receive.
A GRID OF NUMBERS
Starting at the top-left cell marked S and taking horizontal steps one
place only to the right or vertical steps downwards only, how many
different paths are to the location marked E?
Play with this puzzle for a while before reading on. As you play, perhaps
contemplate the following questions:
1. Given the location of the point E, is the grid shown in the diagram
unnecessarily large?
3. If you are willing to trust patterns, can you make a good guess as to
the answer to the original puzzle?
PART TWO: 23
RDRRDDRRRDRR
This sequence contains eight Rs and four Ds. Moreover, any sequence of
eight Rs and four Ds corresponds to a path from S to E.
Exercise: How many paths are there from S to the bottom-right cell of the
grid?
Exercise: Suppose the cell E is a steps to the right of S and b steps down
from S. Show that the number of paths from S to E is given by:
a + b (a + b)!
=
a b a !b !
In fact, number the rows 0, 1, 2, (with the top row being the zero-th row)
and we number the columns 0, 1, 2, (with the leftmost column being the
zero-th column). The cell E in the original diagram thus has position row 4,
12!
column 8, and the number of paths to it is . (Paths to this cell involve 4
4!8!
Rs and 8 Ds.)
PART TWO: 24
In general, numbering rows and columns this way, the cell row a and column b
requires a Rs and b Ds to get to it and so the number of paths to it is:
(a + b)!
a !b !
Exercise: Is this formula still correct for the cells in the zero-th row? In
the zero-th column? (Good thing we set 0! = 1 .)
What value should we place in the cell labeled S row zero, column zero?
How many ways should we say that start at S and end at S?
APPROACH 2: PATTERNS
If you fill in the answers for the number of paths to each cell, the following
grid of numbers appears:
(The exercise above suggests that the position labeled S should also be
assigned the number 1.)
Exercise: Explain why the table is symmetrical about the southeast diagonal
line.
(a + b)!
We have the formula for the entry in the a-th row and b-th column
a !b !
(starting the counts at zero).
Have you noticed that each entry in an interior cell is the sum of two
numbers the number just above the cell and the number just to the left of
the cell? This makes sense in terms of counting paths. Consider the circled
PART TWO: 25
cell. To reach this cell one can either first reach the cell just above there
are 15 ways to do this and then step down, or reach the cell just to its left
there are 20 ways to accomplish this and then step right. This gives a
total of 15 + 20 = 35 paths to the circled cell.
HINT: Use the fact that the number in the toe is really the sum of two
other numbers in the grid.
PART TWO: 26
The grid of numbers appearing in the cells is just the grid of numbers made
famous by French mathematician Blaise Pascal (1623-1662) for his work in
probability theory.
Regard the top row of the triangle (the single 1) as the zero-th row.
(Then the sixth row of the triangle, for example, is 1 6 15 20 15 6 1).
In any row of the triangle call the left-most entry the zero-th entry.
(Thus in the sixth row, 1 is the zero-th entry, 6 is the first entry, 15 is
the second entry, and so on.)
Then the formula for the entry in the n-th row k places in from the left (and
n k places from the right) is:
n!
k !(n k )!
Consider an entry in the a-th row and b-th column of the grid of numbers.
(a + b)!
Then the formula for that entry is: . But a + b is the number of the
a !b !
diagonal on which that cell belongs, a is the number of places in from one
PART TWO: 27
end of the diagonal, and b the number of places in from the other end of
the diagonal.
Exercise: a) Draw a grid of squares and mark the cell in the 3-rd row and 4-
th column. Verify that it is indeed on the 7-th diagonal of the grid, 3 and 4
places in from each end.
b) Consider the entry on the 5-th row of Pascals triangle, 2 places in from
the left. Find the corresponding cell in your grid of squares. What is a and
b for this cell and what does 5 correspond to?
For the grid of numbers we saw that every entry was the sum of the entry
just above it and just to the left of it. In Pascals triangle this translates to:
Exercise:
a) Without doing the computation, explain why the sum of entries in the
bottom row shown above will turn out to be double the sum of the entries in
the row just above it.
HINT: Each entry in the bottom row is the sum of two entries in the row
above it.
Exercise: Explain why each alternating sum in Pascals triangle, beyond the
zero-th row, is zero:
11 = 0
1 2 +1 = 0
1 3 + 3 1 = 0
1 4 + 6 4 +1 = 0
1 5 + 10 10 + 5 1 = 0
110 = 1
111 = 11
112 = 121
113 = 1331
114 = 14641
115 = 161051 = 1 | 5 |10 |10 | 5 | 1
( x + y) = ( x + y )( x + y )( x + y )( x + y )( x + y )
5
The term x5 will appear once by choosing the term x from each set of
parentheses.
The term x 2 y 3 ten times, the term xy 4 five times, and the term x5 once.
We have:
( x + y)
5
= x 5 + 5 x 4 y + 10 x 3 y 2 + 10 x 2 y 3 + 5 xy 4 + y 5 .
The numbers 1, 5, 10, 10, 5, 1 are the entries of the fifth row of Pascals
triangle.
PART TWO: 30
We have:
( x + y) = 1
0
( x + y) = x + y
1
( x + y ) = x 2 + 2 xy + y 2
2
( x + y ) = x3 + 3x 2 y + 3xy 2 + y 3
3
( x + y ) = x 4 + 4 x3 y + 6 x 2 y 2 + 4 xy 3 + y 4
4
and so on.
SOME FUN
.
This explains the connection of the powers of 11.
24 = (1 + 1) = 14 + 4 13 + 6 12 + 4 1 + 1 = 1 + 4 + 6 + 4 + 1
4
Binomial Theorem:
n! n! n! a b
( x + y)n = xn + x n 1 y + x n2 y 2 + + x y + + yn
(n 1)!1! (n 2)!2! a !b !
The coefficients are the entries of the n-th row of Pascals triangle.
n n!
COMMENT: We have used the notation for the expression . Thus
a b a !b !
the binomial theorem can be written:
n n n n 1 n n2 2 n a b n n
( x + y)n = x + x y+ x y ++ x y + + y
n 0 n 1 1 n 2 2 a b 0 n
Often mathematicians suppress one of the terms in the notation and write
n n
just for . (We must have b = n a .)
a a b
7 7 7!
For example, = = . Thus the binomial theorem might be
5 5 2 5!2!
written:
n n n 1 n n 2 2 n a b n n
( x + y )n = x n + x y + x y + + x y + + y
n n 1 n 2 a 0
n
COMMENT: The entries of Pascals triangles - - are also called binomial
a
coefficients.
PART TWO: 32
PART TWO: 33
PART II PROBLEMS
Question 43: The word BOOKKEEPING is the only word in the English
language with three consecutive double letters. In how many ways can one
arrange the letters of this word?
Question 44: In how many ways can you arrange the letters of your full
name?
N! N! n!
d) e) f)
N! ( N 1)! (n 2)!
Question 49:
a) In how many different ways can one arrange five As and five Bs.
b) A coin is tossed 10 times. In how different ways could exactly five
heads appear?
PART TWO: 35
Question 50: In how many ways can 10 people sit on a bench if only four
seats are available?
Question 51: Five pink marbles, two red marbles, and three rose marbles
are to be arranged in a row. If marbles of the same colour are identical, in
how many different ways can these marbles be arranged?
Question 52:
a) Twelve white dots lie in a row. Two are to be coloured red. In how
many ways can this be done?
b) Consider the equation 10 = x + y + z . How many solutions does it have if
each variable is to be a positive integer or zero?
Question 53:
a) In how many ways can the letters ABCDEFGH be arranged?
b) In how many ways can the letters ABCDEFGH be arranged with letter G
appearing somewhere to the left of letter D?
c) In how many ways can the letters ABCDEFGH be arranged with the
letters F and H not adjacent?
Question 54:
a) Hats are to be distributed to 20 people at a party. Five hats are red,
five hats are blue, and 10 hats are purple. In how many different ways
can this be done? (Assume the people are mingling and moving about.)
b) If the 20 people are clones and cannot be distinguished, in how many
essentially different ways can these hats be distributed?
In how many different ways can 8 people sit around a round table?
a) Answer the question if the chairs of the table are marked North,
Northeast, East, Southeast, South, , Northwest.
b) Answer the question if the chairs are not marked so that two
different rotations of the same arrangement of people would be
considered the same.
c) Answer the question under the assumption that rotations are
considered the same and reflections about a diameter of the table are
considered the same.
Suppose two particular people must not sit next to one another. Answer each
of the questions a), b) and c) with this added restriction.
(HINT: First count the number of arrangements with that couple seated
together.)
COMMENT: The problem here is that scoops, like the clones of question
47b), are indistinguishable AND you are not told how many scoops there are
to be of a particular label (flavor). Problems like these are hard and fall
under the category of what is called multi-choosing.
PART TWO: 37
Question 59: A committee of five must be formed from five men and seven
woman.
Question 60:
a) From 10 people k are needed for a committee. Write down a formula
for the number of ways this can be done.
b) Suppose we want our formula to hold NO MATTER WHAT. Set k = 11
into your formula. What value should (1)! have so that your formula
is correct for the number of ways to select 11 people from 10 for a
committee?
Question 61:
a) Prove that the product of any 3 consecutive integers is sure to be
divisible by 3! = 6.
b) Prove that the product of any 7 consecutive integers is sure to be
divisible by 7! = 5040.
b) Prove that the product of any k consecutive integers is sure to be
divisible by k!
PART III of IV
James Tanton
2007 James Tanton
CONTENTS:
Displaying and Summarising Data 2
Measures of Central Tendency 5
Measures of Dispersion 9
Scatter Plots 16
Lines of Best Fit 18
Correlation Coefficient 23
Null Hypothesis 27
Distributions 31
Central Limit Theorem 37
Normal Distribution 41
68-95-99.7 Rule 43
z-scores 45
Roulette 51
Confidence Intervals 54
P-values 57
Gallup Poles 60
Sampling 62
Chi-Squared test 66
Quality Control 71
Run Tests 74
Rank Correlation 82
Exercises 85
PART THREE: 2
The practice and the study of the tools and techniques for collecting, displaying
and summarizing numerical information is called descriptive statistics.
For example, a medical study might record the blood types of 100 university
students and present the information obtained in a list, a table, or a diagram of
some kind. Inferences and conclusions might then be drawn from the information
presented. For this example the data, in and of itself, is not numerical but rather
categorical (the categories type A, type B, type AB and type O are examined) but
the count of entries that fall into each particular category is numerical. In other
examples, numerical information might adopt potentially continuous array of values
(such as age, height, or weight) and might be divided into categories for ease
(height 30-36 inches, 37-42 inches, 43-48 inches, etc. for instance).
Once categories have been established, there are a number of standard methods
for presenting and summarizing data.
Presenting Data
A frequency distribution is a table or chart that shows the count (number) of
individuals in each possible category considered.
For example, of the 100 students tested above, suppose 20 have blood type A, 27
blood type B, 16 blood type AB and 37 blood type O. This information can be
summarized in a frequency table, or a bar chart or a pie graph.
PART THREE: 3
Notice for the bar chart that the individual bars are separated to emphasize the
distinct nature of the different categories.
If the data presented comes from a continuous array of values, then the bars are
drawn without separation. In this situation, the bar chart is called a histogram.
Tables of whole number values are sometimes displayed via stem-and-leaf plots.
Each number is divided into two parts: the units digit (the leaf), and the set of
digits to its left (the stem) (or perhaps the hundreds are separated from the
tens and units together, or some other variation.)
COMMENT: Each stem and leaf plot should be accompanied with a legend
indicating how the numbers are split. For example, to the table above we might
attach the comment 4|0 = forty.
PART THREE: 4
Describing Data
3. Measures of Dispersion:
Statisticians also seek some means to measure the spread of a data set. If data
is tightly clustered about a single central value, then that central value is a
meaningful representative of the entire data set. If, on the other hand, data
that is widely scattered across spectrum of values, attributing meaning to a value
of central tendency must be done with care.
PART THREE: 5
In detail
This is simply the arithmetic mean (average) of the data values at hand. It is found
by summing together all the data values and dividing by the total number of
measurements.
Example: If in a study the value 7 occurs 32 times and the value 9 occurs 25
7 32 + 9 25
times. The mean is: = 7.88
57
CHALLENGE EXERCISE: Prove that the sum of differences of each data value
from the means is always zero. (For example, in the previous example we have
(5 7.25) + (6 7.25) + (9 7.25) + (9 7.25) = 0 . )
Exercise: Some texts might give the following formula for mean:
f1 x1 + f 2 x2 + + f n xn
=
f1 + f 2 + + f n
Can you interpret what the symbols in this formula mean and why the formula is
correct?
PART THREE: 6
Mode
The mode is the value in the data set that occurs most often.
For non-numerical data (such as colours, or letters of the alphabet) the mode is
the only measure of central tendency available.
Median
Arrange the data set in increasing order. Then the median is the middle value of
the sequence of data values.
If the data set contains an even number of entries, then the average of the middle
two values is taken as the median.
5+8
Example: The median of 3, 4, 4, 5, 8, 8, 10, 12 is = 6 .5 .
2
The median is useful for finding the value at the center of the distribution. It
divides the data set into two equally sized groups.
PART THREE: 7
Midrange
The midrange of a set of data is the average of the smallest and largest values.
5+9
Example: The midrange of the data set 5, 6, 9, 9 is = 7.
2
COMMENT: MTEL (the Massachusetts licensure exam) likes to toy with the
interplay between these different measures of central tendency.
EXERCISE:
a) Find FIVE data values with:
Median = 10
Mode = 10
Mean = 1000
b) Now find five data values with median = 10, mode = 1000 and mean = 10.
c) Can you find five data values with median = 1000, mode = 10, mean = 10?
EXERCISE: Repeat the previous exercise but this time for SIX data values.
On her walk home back from the lab, one scientist finds a turtle with ground speed
1000 ft/min.
How would the addition of this extra data value to the data set likely affect the
mean, median, mode, and midrange of the data?
PART THREE: 8
Two students Albert and Bilbert each took a sample of math questions over a
series of two days. There were 100 questions in total and Albert scored 65% and
Bilbert 64% overall. So Albert proved himself a better test taker.
FIRST DAY:
Albert = 71%
Bilbert = 80%
SECOND DAY:
Albert = 50%
Bilbert = 57%
So each day Bilbert did a better job than Albert, but did not beat Albert overall!
How is this possible?
This paradox arises because Albert and Bilbert did not complete the same number
of questions each day and the averages we computed are not equally weighted. This
curious phenomenon is known as Simpsons paradox, discovered by the Statistician
Simpson in the 1960s when it arose in the examination of graduate school
admission rates for men and women into UC Berkeley.
PART THREE: 9
MEASURES OF DISPERSION
How well clustered about the central value is a set of data? Is the data spread
out or tight about this value?
Example: Consider the following two sets of data, each with = 5.22
These sets are very different! We need to quantify methods for measuring scatter
about a mean. There are several approaches.
Range
The range of a data set is simply the difference between the lowest and highest
values in the set.
The range is a very simplistic measure of dispersion, and does not reveal any
information about how the data values are distributed. It is highly affected by
extremely low or high values in the data set. However, the range is often a useful
measurement in practical daily issues. For example, weather forecasts usually give
the range of temperatures to expect for the day.
PART THREE: 10
We can measure the deviation of each data point from the mean:
| 5 7.25 | = 2.25
|6 7.25 | = 1.25
| 9 7.25 | = 1.75
| 9 7.25 | = 1.75
The average of these deviations gives a good measure of overall scatter. Here, the
average deviation is:
The data set 5, 6, 9, 9 has mean = 7.25 with average deviation from the
mean of 1.75
A SUBTLE POINT ..
A subtle point should be noted. Given n data values x1 , x2 , , xn , one first computes
the mean , and then the n deviations: | x1 |,| x2 |, ,| xn | .
Once the first n-1 of these quantities are computed (and these could turn out to
be of any value), the value of the nth quantity, however, is forced the data set
must conform to a mean .
Exercise: As an example, suppose I tell you that a set of three data values has
mean = 3 and two of the data values are 1 and 5. Do you know the third data
value?
PART THREE: 11
Thus there are only n 1 independent computations to be made. For this reason
mathematicians choose to divide the sum of deviations by n 1 rather than n.
Thus a measure of scatter for the data set 5, 6, 9, 9, for example, is computed:
NOTE: When dealing with thousands of data points, dividing by n-1 as opposed to
dividing by n will have very little effect. The results will be practicably the same.
WARNING: Textbooks are confused about this. Some texts will choose to divide
by n while others will choose to divide by n-1.
[CHALLENGE: Solve | x 2 | 3x 5 x = 7 .]
Another way to work with positive quantities is to square values, rather than take
absolute values. One can later apply a square root if desired.
The variance of a data set is the sum of all deviations squared, divided by one less
than the number of data values.
Example: The variance of the four data values 5, 6, 9, 9 with mean 7.25 is:
( x1 ) 2 + ( x2 ) 2 + + ( xn ) 2
2 = .
n 1
To nullify the effect of squaring, mathematicians will next take a square root.
( x1 )2 + ( x2 ) 2 + + ( xn ) 2
= = 2
.
n 1
x1 + x2 + + xN
=
N
PART THREE: 13
Answer:
PART THREE: 14
A COMMENT ON OUTLIERS
The median of a set of data is a value which divides the data into two equal
halves. (If the number of data values is even, then exactly 50% of the data
values lie below the median, and 50% above. If the number of data values is odd,
then close to 50% of the data values lie below and above.)
The median of the lower half of data values is called the first quartile, denoted
Q1 , and the median of the upper half the third quartile, Q3 . (And the median itself
can be called the second quartile, Q2 .) These quartiles divide the data into
approximately 25% blocks.
COMMENT: There is some confusion here in the literature. Some texts insist that
the quartile values correspond to actual data values and some dont. Some handle
the cases of an odd number of data values differently than others. For large data
sets these differences are negligible and hence the lack of uniformity. It does
cause concern, however, for standardized test makers who will have students work
with small data sets for which the differences can be striking. One must examine
any particular authors protocols with care with regard to this matter. (See also
questions 63 and 66.)
An outlier is a data value that seems too large or too small to be coherent with the
data set. (Maybe an error occurred during the experiment, or a value was recorded
incorrectly, for example.) This is a subjective call and often statisticians will bring
a little more solidity to this notion by declaring:
A data value is suspect to be an outlier if its value is 1.5 IQR or more above the
third quartile or 1.5 IQR or more below the first quartile.
PART THREE: 15
Folks like to identify outliers as they may adversely affect the analysis of data:
the mean and standard deviation of a data set changes with the inclusion of
outliers.
PART THREE: 16
SCATTER PLOTS
A graph such as this displaying the measurements of two quantities here pH level
and time is called a scatter diagram.
A scatter diagram can show whether there seems to be some relationship between
the two quantities. In this example, it looks like that there is a fairly good linear
relationship of positive slope between pH and time.
The following scatter diagram between IQ levels and shoe size suggests no
relationship between these quantities:
PART THREE: 17
EXERCISE: Find a group of friends. Draw a scatter diagram for shoe size and
height. Any correlation?
[Warning: Mens and womens shoe sizes are computed differently. Perhaps use a
yard stick to measure the lengths of peoples feet in inches?]
PART THREE: 18
Suppose some data, for x and y values, that looks as though it is linearly
correlated.
We want to determine an equation for the line that fits the data well.
2. Use mathematics to derive the equation of the line that fits the data well
in some sense.
Thus, deviations of a data points from a line of best fit should be measured as
vertical segments variations of the y-values with no deviation horizontally. For
this reason, people look for lines that minimize vertical deviations only (or, to
avoid absolute values, the squares of the vertical deviations).
Heres one method for doing this, the least squares method. Well explain it with an
example.
PART THREE: 19
Answer: One thing that seems reasonable (and turns out to be a true property of
the general theory) that a line of best fit would properly represent the data and
go through the most average data point.
Let:
1+ 2 + 6
x = average of the x-values = =3
3
2+5+8
y = average of the y-values = =5
3
Now the question is: What should the slope of this line be?
If we call the slope m then the equation of the line will be:
y 5
=m
x3
That is: y = m( x 3) + 5
Lets work out the y-values of this line for the given x-values and compare them to
the actual y-values of the data points:
( 5 2m 2 ) + ( 5 m 5 ) + ( 5 + 3m 8 )
2 2 2
= 14m 2 30m + 18
b 30 15
This has smallest value when m = , that is, when m = =
2a 28 14
15
y= ( x 3) + 5
14
PART THREE: 21
For completeness, here are the general formulas for the least squares method:
Let:
x1 + x2 + + xN
x=
N
y + y2 + + y N
y= 1
N
Let:
( x x) + ( x ) ( )
2 2 2
1 2 x + + xN x
S xx = (called the variance of the x-values).
N 1
(y ) +(y ) ( )
2 2 2
1 y 2 y + + yN y
S yy = (called the variance of the y-values).
N 1
S xy =
( x x )(
1 ) (
y1 y + x2 x )( y 2 )
y + + xN x ( )( y N y )
N 1
(called the covariance of the x- and y-values).
S xy
Then the line of best fit goes through the point ( x , y ) and has slope .
S xx
S xy
The equation is: yy = ( x x) .
S xx
PART THREE: 22
we have:
1+ 2 + 6
x= =3
3
2+5+8
y= =5
3
(1 3) 2 + (2 3) 2 + (6 3)2
S xx = =7
2
(2 5) 2 + (5 5) 2 + (8 5)2
S yy = =9
2
(1 3)(2 5) + (2 3)(5 5) + (6 3)(8 5)
S xy = = 7.5
2
7.5 15
The line of best fit goes through (3,5) and has slope = - just as we have
7 14
seen.
Answer: It is involved in answering the question: How good is the fit really?
PART THREE: 23
the smallest.
This quantity reflects the amount of variation of the points about the regression
line.
Now:
( ) +(y ) ( )
2 2 2
T = y1 y 2 y + + yN y
Since the regression line is designed to be better than any other line, we
necessarily have:
D T .
T D
T
T D
If equals 1, then this is saying that D= 0, which means that there is no
T
scatter about the regression line. That is, all data points lie exactly on a line.
T D
If equals 0, then this is saying that T = D. That is, the amount of scatter
T
about the regression line is no different than the amount of scatter in general.
That is, computing a regression line has no effect on scatter, and so there is no
relationship between the x- and y-values of any significance.
T D
Since is always a positive number we give it a name that is always a positive
T
quantity:
T D
Definition: R 2 =
T
A tedious (but not difficult) exercise in algebra shows that this quantity is given
by the formula:
(S )
2
xy
R2 =
S xx S yy
(S )
2
xy
R=
S xx S yy
choosing the + sign to indicate data has a positive slope and the sign to
(S )
2
( 7.5 )
2
xy
R=+ = 0.95
S xx S yy 79
This is very good. [Of course, with just three data points there is little
information to go on. CHALLENGE: Explain why the correlation coefficient will have
value R = 1 - indicating perfect fit if we work a data set of just two data points.]
A value around 0.85 or higher (or -0.85 and lower) is usually deemed good.
A WORD OF WARNING
Its always wise to LOOK at a data a set before diving in and completing a linear
regression. For example, although we can certainly find a line of best fit to the
data shown, it would have little meaning. (We might wish to find a quadratic or an
exponential curve to fit the data.)
If you suspect data fits a curve of the form y = ac x taking logarithms gives
log y = x log c + log a , a straight line relationship between x and log y . Perform a linear
regression (via the methods of this section) to the table of data values shown
If you suspect data follows a curve of the form y = ax 2 , take square roots and fit a line to
the data y and x.
And so forth.
PART THREE: 27
INFERENTIAL STATISTICS
Fifty percent of the people who get it, get better on their own. The remaining
fifty percent die.
Two serums, A and B, has been developed hurriedly and little time has been given
to test them. The only information available right now is:
3 patients with the disease who were given serum A all survived
7 out of 8 patients who were given serum B survived.
You have just learned that you have the disease. Which serum should you take?
Some comments
KEY IDEA:
Assume that serum A has no effect and ask: How likely is it that 3/3
people would naturally survive on their own?
Assume that serum B has no effect. How likely is it that 7/8 people would
survive on their own?
PART THREE: 28
Answers:
If serum A has no effect that the chances that three people would all naturally
1 1 1 1
survive is: = = 12.5% .
2 2 2 8
If serum B has no effect, then the chances that 7/8 people survive is:
1 1 1 1 1 1 1 1 1
8 = = 3.125%
2 2 2 2 2 2 2 2 32
It is quite unlikely that wed see 7/8 people surviving if serum B had no effect.
(Far more unlikely than seeing 3/3 people survive.) We conclude then: there is a
good chance that serum B is having an effect.
The act of assuming that there is no effect at play and working to see where that
assumption leads is called testing the null hypothesis.
PART THREE: 29
a) You toss the coin 10 times and get ten HEADS in a row. Would you likely
conclude that the coin is biased?
b) Suppose instead when you tossed the coin you got nine HEADS out of ten.
Are you still likely to conclude that the coin is biased?
c) What if you got 8 HEADS out of ten? Just seven?
Answer: Lets test the null hypothesis and assume for the moment that the coin is
fair (that is, that nothing suspect is going on).
a) What are the chances of naturally getting 10/10 heads with a fair coin?
10
1 1
= 0.1%
2 1024
b) What are the chances of receiving 9/10 heads with a fair coin?
10
10! 1
1.0%
1!9! 2
In this case I would say, with 99.0% confidence, that the coin is biased.
10
10! 1
4.4%
2!8! 2
With 7/10 heads I am less confident to conclude that the coin is biased.
Even less so for 6/10 heads.
NOTICE: Weve presented here a table displaying the likelihood of each and every
possible outcome. This is an example of a distribution.
Here we have also talked about confidence in making some kind of inference about
the meaning of a result.
[Comment: The distribution in the above table is called the binomial distribution.]
PART THREE: 31
DISTRIBUTIONS
For example, consider the following histogram displaying the heights of 1000
people:
If the heights of the bars are percentages, not actual counts, then this diagram
has total area 1.
We can make the information displayed on the diagram more precise by choosing
small category intervals.
In the limit we get a smooth curve of area 1, the height distribution curve.
PART THREE: 32
Of course, one cannot do this in practice, but we do like to think that human
heights follow some kind of smooth distribution curve of area 1.
Then we like to say the probability that someone chosen at random has height
72" x 78" is given by
Since the area of the whole curve is 1, this fraction of area matches actual area
above the interval 72 x 78 .
PART THREE: 33
This is abstract!
Usually in practice, one doesnt actually know any formula for the distribution curve
a quantity X seems to follow.
BIG QUESTION
Answer: Take some samples and make a guess based on what you observe.
What if we took another class somewhere in the nation with the same number
participants and worked out their mean height? And say we did this for a third
class as well. Actually for 1000 other classes!
In the 1700s, when scientists conducted experiments multiple times and computed
the average result or aggregate result over many runs of the same experiment,
they noticed that the means always seem to follow a bell-shaped curve no matter
the type of experiment was being conducted:
Human height comes in a bell-shaped curve. (The height of each human is the
aggregate effect of growth rates of a collection of cells. Thus each human is the
mean result of a collection of experiments.)
The lengths of carrots come in a bell-shaped curve. (Each carrot is the aggregate
result of cell growth.)
Scholars began work on identifying this curve and finding a formula for it.
[These scholars include Gauss (~ 1820), Laplace (1818) and Lyapunov (1901).]
These scholars managed to prove the famous central limit theorem, which we
shall discuss next.
ACTIVITY:
Heres a simple activity you can perform with 20 of your closest friends to
illustrate the Central Limit Theorem in action.
Have each person roll a single die FOUR times and compute the average of the four
values obtained. Repeat three more times.
Also, for the 16 rolls of the die recorded above, list the total number of 1s, 2s, 3s,
4s, 5s, and 6s that occurred.
1s 2s 3s 4s 5s 6s
On the board, create two charts: one for the raw data and one for the means.
Have each person come to the board and place a dot on the left chart for each of
the 16 rolls of his or her die, and one dot for each of the four average values on
the right chart. Do you see distributions akin to the diagrams below? Is the
distribution of dots on the left close to uniform? Is the distribution on the right
approximately normal?
PART THREE: 38
Suppose a quantity has some kind of distribution with mean and standard
deviation .
Then the means of all the samples will closely follow a normal distribution with:
mean =
standard deviation =
N
The mathematical proof of this result is HARD!! But the idea behind it is fairly
clear.
The larger the sample size, the less deviation one obtains.
Select 10 people at a time and plot their means: Expect a lot of spread.
Select 100 people at a time and plot their means: Expect less spread.
PART THREE: 39
Select 1000 people at a time and plot their means: Expect even less
spread.
Select 6.5 billion people at a time (that is, the entire worlds poputation!)
and plot their means: Expect no spread!
People say:
= 55.0 hours
with standard deviation
= 1.8 hours
Suppose we open a box and compute the average life span of all 100 bulbs. We
repeat this act for many many more boxes.
Then, according to the central limit theorem., the average box lifespan has:
a) Have a computer select ten random numbers from the range 1 to 100, and
compute the mean of those ten numbers. Have the computer do this one
hundred times and then plot the means. Does the distribution look bell-
shaped?
b) Repeat this exercise but this time have the computer select twenty numbers
at random each time and compute their mean. Does the resulting curve look
more normal?
c) Repeat this exercise but this time have the computer select fifty numbers
each time and compute their mean. How does the distribution appear?
PART THREE: 41
Now we havent talked about how to compute the mean and standard deviation of
a continuum of values (that is, over a small curve); only how to compute these for a
finite set of discrete values.
x + x2 + + xN ( x1 ) 2 + + ( xN )2
[Recall: = 1 and = ]
N N 1
One computes the mean and standard deviation for these finite values, and then
take the limit as one repeats this for a finer and finer histogram. (This is calculus.)
PART THREE: 42
The upshot is
68% of the data lies with within one standard deviation of the mean:
95% of the data lies within two standard deviations of the mean:
99.7% of the data lies within three standard deviations of the mean:
PART THREE: 44
EXERCISE:
A species of carrot has average length 12.3 cm with standard deviation 0.8 cm.
Assume carrots are normally distributed.
a) What are the chances that a carrot chosen at random lies in the range 11.5
cm to 13.1 cm.?
b) What are the chances that a carrot chosen at random is longer than 13.9cm?
Answer:
PART THREE: 45
Z-SCORES
The following example leads to an important concept:
EXAMPLE: Here are the college test scores on math along with Johns scores.
Answer:
Notice that John is 60 points below the mean in algebra, that is, 1 standard
deviations below. This is not good.
Notice that John is 70 points above the mean in calculus, that is, 2 above. This is
very good!
John is 20 points above the mean in statistics, that is, 0.8 above. This is fairly
good.
Even though John got the lowest number score in calculus, this was his best result!
In this example we were able to compare different scores by bringing them to the
same standard: the number of standard deviations above or below the mean.
670 730 1
Algebra: z= = 1 (That is, 1 standard deviations below the mean.)
40 2
450 380
Calculus: z= =2 (That is, 2 standard deviations above the mean.)
35
660 640
Statistics: z= = 0.8 (That is, 0.8 standard deviations above the mean.)
25
x
NOTE: The transformation z = converts the points x = , x = and
x = + into z = 1 , z = 0 , and z = +1 , and thus transforms a distribution centered
about with a spread of into a distribution centered about 0 with a spread of
1.
So
PART THREE: 47
(This is called the probability density function for the standard normal
distribution.)
(3.0) = 0.013
PART THREE: 48
EXAMPLE: The wealth index of a floogle is normally distributed with mean 0 and
standard deviation 1.
A floogle is selected at random. What is the probability that its wealth index is
between 1.00 and 1.72?
Answer:
= (1.72) (1)
= 0.957 0.841
= 0.116 = 11.6%
EXAMPLE: The wealth index of a woogle is normally distributed with mean 18 and
standard deviation 4.
A woogle is selected at random. What is the probability that its wealth index is
between 22 and 24.88?
22 18
x = 22 z= =1
4
24.88 18
x = 24.88 z= = 1.72
4
COMMENT: Sometimes texts will only give values for the areas between 0 and a
positive value z as shown:
Exercise: Find a text that gives values as described here. Use it to compute:
a) (0.2)
b) (0.2)
c) (1.16)
d) (2.21)
PART THREE: 50
Johnny found a parsnip next to a nuclear power plant that is 25.0 cm long.
Should we conclude that something unusual happened to the parsnip?
Answer: What are the chances of finding a parsnip of that length under normal
circumstances? (NULL HYPOTHESIS!!)
Recall the 68-95-99.7 rule. The chances that we naturally find ourselves in the
shaded region is only 2.5%.
Thus, with 97.5% confidence we can say that something strange happened to
the parsnip!
PART THREE: 51
Then the means of all the samples will closely follow a normal distribution with:
mean =
standard deviation =
N
And recall that the normal distribution has 95% of its values lying with two
standard deviations of its mean, and 99.7% of its values within three standard
deviations.
One example
ANALYSIS OF ROULETTE:
A Roulette wheel has 18 red spaces, 18 black spaces and two green spaces. In the
simplest version of the game one can place a $1 bet on either red or black. If
your colour comes up, you win $1. If it doesnt, you lose $1. (Thus the two green
spaces give a slight advantage to the house.)
18
So Your chances of winning $1 are
38
20
Your chances of losing $1 are
38
PART THREE: 52
18 20
= 1 + (1) = 0.053
38 38
Whats the standard deviation here? It is not clear what this means in this case
(what are the data points one is referring to here?) but one can reason as follows:
If we were to play 38 games, then we would expect, on average, to receive the data
value +1 eighteen times and the data value -1 twenty times. The standard deviation
from the mean of -0.053 is given by:
37
(1 (0.053))2 18 + (1 (0.053)) 2 20
=
37
= 1.0242
= 1.0242 = 1.0120
= 0.053
= 1.012
Now suppose I am a habitual gambler and go to the casino every night to play 100
rounds of roulette. By the central limit theorem the results of my nightly activity
closely follow a normal distribution with:
= 0.053
1.012
= = 0.1012
100
PART THREE: 53
Almost all of my nightly results (99.7% of them) lie within the range:
3 to + 3
For 100 bets of a dollar each, this means that for almost all evenings my
winnings lie between -$35.66 and $25.06.
Although I often lose, I also often win. This keeps me coming back!
I am not the only person playing the game each night. They may see something like
100 000 rounds of roulette being played per night.
These samples of 100 000 rounds per night closely follow a normal distribution
with:
= 0.053
1.012
= = 0.003
100 000
So almost all evenings (99.7% of them) outcomes lie within the range:
3 to + 3
With 100 000 bets of $1 this corresponds to a nightly win for gamblers between
the range -$6200 and -$4400.
The casino sees an almost guaranteed profit somewhere between $4400 and
$6200 per night from Roulette alone.
PART THREE: 54
CONFIDENCE INTERVALS
Often people are comfortable making statements that have a 95% chance of being
correct!
Johnny found a carrot growing next to the nuclear power plant that is 25.0 cm long.
He wants to say that something unusual happened to this carrot. With what level of
confidence can he say this?
Answer: We know that 95% of data in a normal distribution lies within two
standard deviations from the mean. In this case, 95% of carrot lengths should lie
within the range 18.5 6.4 cm., that is, in the interval [12.1 , 24.9] cm. Johnnys
carrot has length outside of this range. There is only a 5% chance that this would
happen naturally. Thus, with 95% confidence, Johnny can say that something
strange happened to his carrot.
EXAMPLE: The average mass of all planets in the universe is not known, but
scientific theories do suggest that planet masses vary with a standard deviation of
the order = 3000 units.
Astronomers have observed the motion of 24 planets and have calculated their
average mass to be m = 27650 units.
What can we say about the value of , the mean mass of all planets in the
universe?
We obtained: m = 27650.
PART THREE: 55
Now, in general, there is a 95% chance that m lies within two standard deviations
of .
By the same token, there is a 95% chance that lies within two standard
deviations of m!! (If m is within two units of , then is within two units of m.)
So we can say, with 95% confidence, that the true value for the mean lies
somewhere in the interval 27650 1224 , that is, in the range [26426, 28874]
IN GENERAL:
Suppose a population has unknown mean and known standard deviation .
Then, with a 95% level of confidence, we can say that the true mean lies
somewhere in the interval:
[m2 , m+2 ]
N N
Answer: We are 95% sure that the true mean for the entire population of
Australian males lies somewhere between 167 and 183 cm.
NOTE: It DOES NOT mean that 95% of Australian men have height somewhere
between 167 and 183 cm. THIS IS A COMMON MISCONCEPTION!
PART THREE: 56
Inspectors select 100 pipes at random and find their average diameter to be 2.98
inches. Are they pleased?
0.02 0.02
[2.98 2 , 2.98 + 2 ] = [2.976, 2.984]
100 100
with 95% confidence. Things dont look good. It is very unlikely that the
manufacturer is producing pipes with the required mean of 3 inches.
EXAMPLE: A filter removes dust from my living room. The amount of dust it
removes from day to day has standard deviation = 0.3 mg.
I measured the weight of dust collected over three days and got the figures:
13.2 mg, 13.7 mg, 12.9 mg.
Compute the 95% confidence level of the true average weight of dust removed per
day.
The 95% confidence level is m 2 = 13.27 0.35 mg.
3
P-VALUES
Sometimes we like to be more specific and follow our hunches about what we think
the value of a mean should actually be.
Answer: Lets assume that really is 12 and how likely it is we would have obtained
a sample mean of m = 12.98.
Now samples of size 40 have means that follow a normal distribution with:
mean = 12
standard deviation = = 0.47
40
Now 95% of the means should lie between 11.06 and 12.94.
We got the value 12.98. There is a 2% chance that wed land in this range if the
true mean really were =12.
We reject our hunch that =12 with 97.5% certainty that this is the right thing
to do.
PART THREE: 58
If =12, what is the probability that wed get a sample mean of m = 12.98 or more?
Answer:
12.98 12
z= = 2.09
0.47
According to the tables, this region has area 0.5000 0.4817 = 0.0183 = 1.83%
This value is called the p-value of the sample mean 12.98 for the assumption that
=12.
The chance of getting a sample mean of 12.98 or more under the assumption that
=12 is extremely low. We reject the claim that =12 with 98.17% level of
confidence that we are doing the right thing.
PART THREE: 59
Weve just computed a right-tail p-value for a given sample mean. One can also
compute left-tail p-values.
A p-value gives a measure of the likelihood that we would obtain a sample mean at
least as extreme as the one we observed, under the assumption that beliefs about
the true value of the mean are correct.
PART THREE: 60
Suppose we take a sample of size N from the population and find that percentage
p has the desired property. (For example, we interview just 1000 Americans and
find that 32% say they will likely vote Republican.)
The values p have a distribution that is approximately normal and the larger the
sample size N the better the approximation.
=p
p (100 p )
=
N
(Here, p is given as a percentage.)
Given a particular value p for one sample, the 95% confidence level for that
sample is:
[ p 2 , p + 2 ]
where is the standard deviation computed with p rather than p.
With 95% confidence we can say that the true value p lies within this range.
PART THREE: 61
EXAMPLE: Of 1500 adult Americans that were polled, 3.2% of them said they had
an overall thoroughly pleasurable experience studying math in high-school. Estimate
the proportion of ALL adult Americans that will say the same.
Answer: We have
p = 3.2
3.2 96.8
= = 0.454
1500
With 95% confidence we can say that the percentage of Americans who claim to
have enjoyed high-school math lies in the range [2.292, 4.108] .
Newspaper reports usually dont mention a confidence level but rather a margin
of error. They mean by this the same thing with the understanding that we are
talking about a 95% confidence level. For example, if a journalist writes
According to our survey, 36% of Americans are now afraid to eat cheese.
The margin of error in this report is 2.6 percentage points.
The journalists means by this that, with 95% confidence, we can say that the true
percentage of Americans afraid to eat cheese lies in the interval [33.4,38.6] .
There is a large sub-branch of statistics concerned with the issue of how to select an
appropriately representative sample. By considering this very question, one may begin to
influence the types of people one may accept to interview for a survey. For example, in
studying the shopping habits of Americans one may think to go to the local mall to
PART THREE: 62
interview shoppers. Right there you have a bias in your sample you are considering only
people who like to shop at malls!
Today, a number of sampling methods are commonly used to help ensure that no bias
occurs. These methods include:
Random Sampling
Each subject of the population is assigned a number, and numbers are generated randomly
with the aid of a computer to select members.
Systematic Sampling
Each subject of the population is assigned a number, and, starting at a random number,
every kth member from then on is selected. For example, one might select every 23rd
person, starting with the 533rd member.
Stratified Sampling
When a population is naturally divided into groups (such as male/female, or age by decade),
selecting a random sample from within each group produces what is called a stratified
sample. Samples produced this way are used to ensure representatives of each subgroup
are present in the study. For example, in a study involving college freshman and
sophomores, one might select twenty-five students at random from each group freshman
males, freshman females, sophomore males and sophomore females to make a sample of
one hundred students.
Cluster Sampling
If an intact subgroup of a population is used as a representative sample of the entire
population then the sample is called a cluster sample. For example, the set of all freshman
females might be used to represent the population of all college students for the purposes
PART THREE: 63
of one study, or the 12 eggs in one carton of eggs as representative of all the eggs handled
by a particular supermarket.
PRACTICE EXERCISE:
a) Take a deck of cards and divide it into four suits. Randomly select 2 cards from
each suit and list the eight cards here:
Is there an 8-card sample that would not arise via this method?
b) Shuffle the deck of cards. Select two suits by randomly drawing two cards from
the deck. Now separate those two suits from the deck. Randomly select four cards
from each of those two suits. Write the sample of eight cards here:
c) Describe a procedure for selecting 8 cards from a deck that illustrates the method
of systematic sampling.
Statisticians often prefer to work with a simple random sample (SRS) scheme: not only
does each person in the population have equal chance of being selected, but every
combination of people has the same chance of appearing as any other combination. For
example, random sampling and systematic sampling can be SRS, but stratified sampling and
cluster sampling are certainly not.
The central limit theorem is assuming that samples chosen are simple random
samples.
PART THREE: 64
As you will see, the following activity can be attempted on the first day of class,
and as student sophistication grows, can be repeated with more powerful tools.
What does it mean for a book to be red? Go to the library and look at some books.
Try to come up a consensus within your team as to what red-ness should mean.
Write down your definition of a red book.
STEP 3: CONCLUSION
Write down your answer. Approximately how many books in the library are red?
PART THREE: 65
LATER
Using
=p
p (100 p )
=
N
write a 95% confidence interval for the true proportion of library books that are
red.
PART THREE: 66
EXAMPLE: Is there any correlation between hair colour and eye colour?
A team goes out and examines a random sample of people. The data collected is
displayed in the following contingency table:
First, compute each row sum, column sum and grand total.
We see that 38 out of the 110 people examined had blue eyes. That is, 38/110 =
34.5% of the sample had blue eyes.
We also see that there were 37 of the people examined were blonde, 46 brown
haired and 27 red haired.
PART THREE: 67
If there is absolutely no influence of hair colour on eye colour, then wed expect
34.5% of the blonde population to have blue eyes, 34.5% of the brown haired
population to have blue eyes, and 34.5% of the red haired population to have blue
eyes. That is,
THE EXPECTED FREQUENCY (aka count) OF BLONDE PEOPLE WITH BLUE EYES
IS:
38
37 = 12.8 .
110
THE EXPECTED FREQUENCY OF BROWN HAIRED PEOPLE WITH BLUE EYES IS:
38
46 = 15.9
110
THE EXPECTED FREQUENCY OF RED HAIRED PEOPLE WITH BLUES EYES IS:
38
27 = 9.3
110
27
25 = 6.1
110
(The proportion of our sample with red hair is 27/110. This proportion of 25 green
eyed folk should have red hair.)
In general:
The expected frequency of the entry in the i-th row and j-th column is given
by:
row i total col j total
grand total
PART THREE: 68
We can go and fill in all the expected frequencies under the assumption that there
is no relationship between the two qualities.
Looking at this it seems that our observed frequencies are vastly different from
the expected frequencies (for no relationship). It seems that something is going
on.
(o e)
2
In our example
( 8 15.9 ) ( 5 6.1)
2 2
(23 12.8) 2
=
2
+ + + = 17.88
12.8 15.9 6.1
Now if there truly were no relationship twixt the two variables then youd expect
the observed values to be very close to the expected values, that is:
Statisticians have done the mathematics on the 2 statistic and have computed
the distribution one would expect it to follow. [NASTY MATH!] Each table of a
different size has its own chi-squared distribution.
= (r 1)(c 1)
degrees of freedom.
(COMMENT: Why r-1 and c-1? We know that all r rows add to 100% of the
data. Thus if you know the sum of the first r-1 rows, then you do not need to
be given the sum of the r-th row. Its value is not free. Ditto for the
columns.)
See internet or most any stats book for a table of 2 values for each value of .
According to the table of values, this value lies between 2 0.995 and 2 0.999
The chances of this 2 value occurring is above 99.5%. Thus, with 99.5%
confidence, we can say that there is some kind of correlation between eye
colour and hair colour.
EXERCISE: Analyse this (fictional) data from interviews with 5406 fifteen-year
olds.
QUALITY CONTROL
To test the quality of their manufacturing techniques, each day a random sample of
10 pipes is selected and their mean diameter is computed. Here are the results of
twelve days of data:
DAY 1 2 3 4 5 6 7 8 9 10 11 12
Mean 2.98 3.01 3.04 2.97 2.99 3.01 3.05 3.04 3.07 3.08 3.06 3.09
Answer: Lets plot the data. Also, the mean is meant to be 3.00 so lets plot that
line as well:
It looks like the data is drifting away from the target mean. To make this more
precise
0.06
Samples of size ten should have mean 3.00 and standard deviation = 0.02 .
10
99.7% of the results should lie within three standard deviations of this mean. That
the data is drifting above the critical line of +3 standard deviations suggests that
quality is out of control. (Day 9 is the first day of concern.)
PART THREE: 72
EXAMPLE: Two machines each produce 1 000 bolts per day. The following table
shows the number of defective bolts each machine manufactured over a ten day
period.
MACHINE 1 42 37 18 37 17 26 35 21 18 17
MACHINE 2 44 36 23 41 24 25 31 35 23 21
Using only basic techniques, is one machine significantly less reliable than the
other?
Heres the list showing which machine produced the greatest number of defective
bolts per day:
2 1 2 2 2 1 1 2 2 2
If there is no difference in the quality of the machines, that is, if each is equally
likely to be listed on a day as having produced the most defective bolts for that
day, then this sequence is akin the sequence of Hs and Ts in flipping a coin.
Is it unusual to get seven Hs in a run of ten flips? That is, is the fact that machine
2 is listed seven times at all significant?
PART THREE: 73
Question: What are the chances of receiving seven heads in flipping a coin ten
times?
10! 1
Answer: 11.7% .
3!7! 210
This is not considered rare enough to be significant. So we would say that there
is no significant evidence to suggest that machine 2 is behaving differently to
machine 1.
COMMENT: We usually look for events that a rare, say have a 5% chance of
occurring, to say, with 95% confidence that something unusual is occurring.
For example, suppose machine 2 was listed NINE times out of the ten of string.
10! 1
The chance of this naturally occurring is 1% , so we would conclude, with
1!9! 210
99% confidence, that machine 2 is indeed less reliable than machine 1.
PART THREE: 74
Suppose we perform the activity a number of times and record the sequence of As
and Bs that result:
e.g.
A A | B B B | A | B | A A A A A | B B B |A A | B | A A A
TWO COMMENTS:
a) A sequence with a large number of runs suggests that the sequence of A and
B generated is not truly random. For instance, the following sequence has
the maximal possible of runs. You would unlikely believe it to be a random
sequence:
A|B|A|B|A|B|A|B|A|B|A|B|A|B|A
b) A sequence with very few runs doesnt seem that random either.
AAAAAAAAA|BBBBBBBB|AAAAAAAAAAA
So the count of runs in a sequence should, in some way, give an indication of just
how random that sequence is.
N!
There are possible ways to arrange these As and Bs.
a !b !
List them all and count the number of runs in each possible example.
Mathematicians have proven that the count of runs has mean and standard
deviation given by these formulae:
2ab
= +1
N
2ab ( 2ab N )
=
N 2 ( N 1)
They have also shown that if a and b are each 7 or greater, then 95% of the
run counts lie within two standard deviations of this mean.
EXERCISE:
a) Write down all the possible ways to list three As and two Bs.
b) Count the runs in each
c) Find the mean and the standard deviation of the count of runs. Verify that the
above formulae give the same values.
PART THREE: 76
HHHTTHHHTTTHTTTT
Now, according to the previous result, the runs should follow a distribution with:
= 8.875
= 1.9
The count of six runs is within the range of two standard deviations from the
mean. We cannot conclude that this example is unusual.
HTHHTHTHTTHHHTHT
The number 12 is within 2 standard deviations from the mean. We cannot conclude
that this sequence is not random.
PART THREE: 77
HHHHHTTTTTTHHTTT
Again:
= 8.875
= 1.9
The count of 4 runs is more than two standard deviations below the mean. With
95% confidence we can say that this sequence was not produced by a random
phenomenon.
PART THREE: 78
TWO APPLICATIONS
If the data really was generated by a random phenomenon, then the sequence of
As and Bs produced should be random.
EXAMPLE: Heres some data. Does it seem random? Use the above/below median
test.
16 12 23 18 37 21 13 14 30 79 11
Answer: We need to find the median. (Unfortunately, this means ordering the
data!):
11 12 14 13 16 18 21 23 30 37 79
median = 18.
BBA *AABBAAB
We have: a = 5, b = 5 with N = 10. There are 5 runs. (I know that these a- and b-
values are a bit low, but lets follow the test anyway just for the fun of it!)
This gives:
=6
= 1.49
The value of 5 runs is not outside two standard deviations from the mean. The data
seems to be following a random phenomenon.
a1 a2 a3 am
b1 b2 bn
To decide whether or not the two samples came from the same type of population,
arrange all m + n values in increasing order. (If some values of repeated, choose an
order among them at random.) Record a sequence of As and Bs to show from which
sample each data point came from.
If the resulting sequence of As and Bs is random, then we can conclude that the
samples are not really different and come from the same source.
EXAMPLE: Twelve people from a mall were interviewed for their ages. Call these
the M values: 13 18 34 17 16 30 13 47 37 35 15 35
Twelve people at an art museum were interviewed for their ages. Call these the A
values: 45 52 17 28 41 63 48 23 38 60 40 40
Answer: Arrange the data in numerical order and keep track of which are Ms and
which are As.
13 13 15 16 17 17 18 23 28 30 34 35 35 37 38 40 40 41 45 47 48 52 60 63
M MM MA M M A A M M M M M A A A A A M A A A A
We have:
a = 12
b = 12
N = 24
= 13
= 2.40
The value of 8 runs is more than two standard deviations away from the mean.
With 95% confidence we can say that these two sets of data are not coming from
the same type of population!
3 1 4 1 5 9 2 6 5 3 5 8 97 93 26 4 3
One checks that the median is 4.5. The sequence of Aboves and Belows is:
BBBB|AA|B|AA|B|AAAAA|BB|A|BB
Here:
a = 10
b = 10
N = 20
There are 9 runs.
We have:
= 11
= 2.17
The value of 9 runs is within two standard deviations of the mean. This sequence
looks random!
EXERCISE:
a) Write a sequence of Hs and Ts twenty symbols long that looks random to
you. (The number of Hs need not be the same as the number of Ts.) Perform
the runs test. Is your sequence random.?
b) Flip a coin 20 times and record results. Perform a runs test for randomness
on your sequence!
PART THREE: 82
RANK CORRELATION
THE PROBLEM:
Five men Albert, Bilbert, Cuthbert, Dilbert and Egbert take part in a singing
contest and are ranked by two judges 1 5 (with 1 as best and 5 as least
favored). For example, a possible outcome of the contest might be:
If the judges followed purely objective assessment criteria and were completely
free of personal preferences, then we would expect the two rankings should be
identical. If, on the other hand, the judges followed no set procedures for their
ranking schemes and assigned rankings in a random fashion, then we would expect
very little or no correlation between the two lists. In the example presented above
we seem to be somewhere between these two extremes.
THE CHALLENGE:
Develop an index that takes two lists of rankings from two judges and, from
those lists, applies some formula or algorithm to those lists and computes a
numerical value, which we shall call R.
i) 0 R 1
ii) R has value 1 if the two lists are identical.
iii) R has value 0 if the two lists are in complete disagreement. (e,g. The first judge
lists the candidate in the order 1, 2, 3, 4, 5 and the second judge in the order 5, 4,
3, 2, 1.)
Compute the value of your Rank Correlation Coefficient to the example above and
interpret the results.
PART THREE: 83
D = (1 3) + ( 4 5 ) + ( 3 2 ) + ( 5 4 ) + ( 2 1) = 8
2 2 2 2 2
The largest value D can possess (for a list of five numbers) is 40 and this occurs
when the rankings are in reverse order. (Why?). The smallest value D can possess is
0, and this occurs when the orders are in complete agreement. So set:
D
R = 1
40
8
In our example, R = 1 = 0.8 , which indicates some disagreement.
40
Comment: This is the approach Charles Spearman took in 1904. He defined his
D
index to be = 1 2 where M is the maximum value D could be for two lists n
M
entries long. Here = 1 corresponds to complete agreement and = 1 to
complete disagreement.
n ( n 2 1)
Note: One can show that M = and this occurs if the two lists are in
3
reverse order of one another. [To see this, show what happens to the value D if
two numbers in one list are swapped. Show that the value of D increases if we swap
two elements that arent already in reverse order.]
PART THREE: 84
What is the maximal value D can obtain in this case and when does it occur?
APPROACH 3: We can reorder the names of the contestants so that list of ranks
for the first judge is 1, 2, 3, 4, 5. The list of ranks for the second judge changes
accordingly.
Now look at each contestant in turn along the second row. Count the number of
scores to the right of each entrant with a lower score.
In our example, according to Judge 2, Albert has TWO lower scores to his right.
Bilbert has ZERO lower scores to his right, Cuthbert ZERO, Bilbert ONE, Dilbert
ZERO.
If the rankings were in perfect agreement, then S would have value 0. If they
were in perfect disagreement (in reverse order), then S would have value 10, and
this is maximal. Set
S
R = 1 .
10
****
Many approaches, of course, are possible.
The difficulty in this work is determining when and how a maximal value for a count
occurs (and generalizing this to a list of n contestants and not just five). [Approach
2 is problematic in this regard.]
PART THREE: 85
It is often convenient to divide data sets into four equal parts. The lower (or first)
quartile, denoted Q1, is the 25th percentile. The middle (or second) quartile, Q2, the
median, is the 50th percentile, and the upper (or third) quartile, Q3, is the 75th percentile.
(COMMENT: As we have seen on page 14, there is some confusion over the definition of a
quartile. Notice that this paragraph defines it as the 25th percentile, and so, here at least,
a quartile should correspond to a data value.)
Mean = 1000
Median = 10
Mode = 10
b) Find an example of SIX data points with:
Mean = 10
Median = 10
Mode = 1000
Mean = 10
Median = 1000
Mode = 10
One average, each planet of the solar system has 722 million human inhabitants.
We see that the data ranges from 20 to 70 and that the three divider marks at 35
or so (the left end of the box), at 42 or so (the line in within the box), and at 65 or
so (the right end of the box) divide the data into four groups each representing
25% of the data.
a) What is the range and median according to the following box plot? 50%-
75% of the data lies within which range of values?
22. 23. 26. 31. 31. 31. 38. 42. 63, 69, 69, 127, 129, 131
is presented:
2 2, 3, 6
3 1, 1, 1, 8
4 2
6 3, 9, 9
12 7, 9
13 1
KEY: 3|8 = 38
113, 113, 114, 115, 116, 123, 123, 130, 130, 203, 308, 308, 319
b) Find the mean, median, and mode for the following data presented via stem-and-
leaf:
0 1,1,4
2 2,3,3
4 2,2,2
8 1,8,8
14 0
18 9,9
COMMENT: Other types of stem-and-leaf plots are possible. For example, one might use
the key 13|04 = 1304.
COMMENT: Stem and leaf plots are used to give a quick sense of a shape as to how the
data is distributed.
PART THREE: 89
Question 68: Find the mean and standard deviation of the following data set:
5.6 5.2 4.6 5.7 4.9 6.4
x Y
2 3
3 5
5 12
6 20
X Y
1 1.94
3 5.98
4 8.04
6 12.02
Question 71:
a) Find the mean, median, and mode of the following test scores:
b) Another student later took the test and scored just three points. Describe, in
words only, what effect such a low-value additional data point will have on each of
the mean, median, and mode.
Question 73: Here is a stem-and-leaf plot. What is the mode of this data set?
(Here the data values are: 120, 120, 120, 130, , 615)
Question 74: The median of a data set is significantly larger than the mean of the
data set. What could cause this?
(A) There are a few exceptionally small values in the data set
(B) There are a few exceptionally large values in the data set
(C) The data values are tightly clustered around one value
(D) The data values are evenly spread across a range of values
Question 75: Draw a reasonably accurate pie chart for the following data:
PART THREE: 92
Question 76: A sample of people at a mall were measured for their heights. The
results are displayed in the following histogram.
Based on this data, what are the chances that a person at the mall selected at
random is less than 61 inches tall?
y = 0.98 x + 0.02
with correlation coefficient r = 0.01 . Would they want to use this equation to
predict values for y ? Explain.
Question 81: A terrible disease is sweeping across the nation at an alarming rate.
Only 10% people who catch the disease survive.
Two experimental serums have hurriedly been developed but only limited testing
has been done on them. Two people with the disease were given serum A and both
survived. Six people with the disease were given serum B and four survived.
a) Assuming that serum A had no effect, show that the chances of two people
naturally surviving the disease is 1%.
b) Assuming that serum B had no effect, show that the chances of four out of
six people naturally surviving the disease is 0.12%.
c) Given these figures, which serum is more likely to have had a better effect
on recovery?
Question 82: Another terrible disease is sweeping across the nation at an alarming
rate. 50% people who catch the disease survive.
Three experimental serums have hurriedly been developed but only limited testing
has been done on them.
Serum A: Three people with the disease were given the serum and all three
survived.
Serum B: Ten people with the disease were given the serum and eight survived.
Serum C: Five people with the disease were given the serum and four survived.
Youve just contracted the disease! Based on this limited information, which of the
three serums would you take and why?
PART THREE: 95
Question 83: The following diagram gives the distribution of the number of
minutes Australian women can tolerate left-handed 8 year-old boys whistling while
chewing gum.
Question 85: You suspect a coin is biased. You toss it 12 times and heads appear
ten times. With what level of confidence would you say that the coin is indeed
biased?
Question 87: You suspect a die is biased towards landing 6. You toss the die six
times and get a six three of those times.
a) What are the chances of obtaining exactly three sixes in a roll of a fair die?
b) Would you conclude that the die above is biased? If so, with what level of
confidence would you make such a claim?
Question 88: A company manufactures bolts. If 5% of the bolts they produce are
defective, what are the chances that four bolts chosen at random are:
a) all defective?
b) all but one is defective?
c) none are defective?
Question 90: Use a table of values for the normal distribution curve (with mean 0
and standard deviation 1) to find the area under the curve between:
a) z = 0 and z = 1.4
b) z = -0.70 and z = 0
c) z = 1.1 and z = 1.2
d) z = -1.8 and z = 0.6
e) all z values below -0.1
f) all z values above 0.1
Question 91: The mean weight of a 1000 high-school students is 147 pounds with
standard deviation 17 pounds. Assume the weights are normally distributed.
Use z-scores and the table of values for the normal distribution curve (with mean
0 and standard deviation 1) to find
c) The number of students who weight between 147 and 152 pounds.
d) The number of students who weight between 132 and 150 pounds.
Question 92: A company produces cars that last an average of 10.2 years on the
road with standard deviation 4.3 years. You buy a car from the company. Assuming
that car ages are normally distributed, what are the chances that your car will be
on the road for over 20 years?
Question 93: A company produces light bulbs with average lifespan 40 hours
(standard deviation 4 hours).
a) Consumer advocates across the nation test 100 light bulbs and calculate the
average life span of the bulb according to their samples. To a close approximation,
what mean do they obtain with what standard deviation?
b) A year later they repeat the experiment but this time testing 1000 light bulbs
each. To a close approximation, what mean do they obtain with what standard
deviation?
PART THREE: 98
Question 94: In a hand of Blackjack one has a 45% chance of winning a dollar and
a 55% chance of losing a dollar.
Following the same analysis as we did in class for the game of Roulette
a) Show that the mean and standard deviation for a single hand of Blackjack
are given by:
= 0.10
= 1.00
(This issue of whether to divide by n or n-1 for standard deviation is annoying.
Here I divided by n-1.)
b) A habitual gambler attends the casino every night and plays 100 hands of
blackjack, betting a dollar each and every time. Find the range of winnings
she can expect (99.7% of the time) for each night of gambling.
c) The casino sees 100,000 hands of blackjack played per night. Find the range
of profit they can expect (99.7% of the time) per night from Blackjack.
a) Show that the mean and standard deviation for a single play of this game
are:
= 0.167
= 2.041
b) A habitual gambler plays 50 rounds of this game every day. Find the range of
winnings she can expect (99.7% of the time) for each day of gambling.
c) The casino sees 1 000 000 rounds of this game per day. Find the range of
profit they can expect (99.7% of the time) per day from this game.
PART THREE: 99
Question 96: Rabbits who eat carrots only have weights that are normally
distributed with mean 12.5 pounds and standard deviation 3.2 pounds.
Question 97: All men with the name of JIM have a handsomeness value that is
normally distributed with mean 86.6 and standard deviation 2.3.
Question 98: The mean weight of floogles is not known, but it is known that their
weights vary with standard deviation = 4 units.
A biologist measured the weights of 60 floogles and found that her sample had
mean m = 143.2 .
Find the 95% confidence level for the mean weight of all floogles.
Question 99: Gibgobs have heat factors that vary about some unknown mean with
standard deviation = 12 . A scientist measured the heat factor of four gibgobs.
He obtained the values:
Find the 95% confidence level for the mean heat factor of all gibgobs.
Question 100: A soccer ball company is meant to produce soccer balls of radius
12.4 cm with an error of at most 0.4 cm.
They set their machines so that the balls they produce have mean = 12.4 with
standard deviation = 0.2 . They produce 500 balls per day. How many balls per day
must be rejected?
PART THREE:100
Question 102: Find the p-value for the sample mean of 4.01 in the previous
example.
Question 103: A company produces cables with breaking strengths of mean 1800
lbs and standard deviation 100 lb. A new manufacturing technique, however, claims
to increase the average breaking strength. To test this claim, a sample of 50
cables is examined and is found to have mean breaking strength 1845 lb.
Question 104: Two drugs, A and B, are being tested for possible cure to a disease.
The following data has been collected thus far. Does there seem to be any
correlation worth pursuing?
PART THREE:101
Question 105: Does there seem to be a correlation between favourite colour and
shoe size?
Question 106: The following table shows test scores of students in a physics
course and the same students in a math course. Does there seem to be a
correlation between math and physics proficiency?
Question 107: According to this data do you think there is some connection
between marital status and performance in a Prob. and Statistics course?
PART THREE:102
Question 108: A ball bearing company produces balls with a diameter, hopefully,
of mean value = 11.5 mm with a tolerance given by standard deviation = 0.2 .
Each day the company selects 20 balls at random and computes the mean of that
sample. Over the course of three weeks they collected the following values:
11.602 11.547 11.312 11.449 11.401 11.608 11.471 11.453 11.446 11.522 11.664
11.823 11.629 11.602 11.756 11.707 11.612 11.628 11.602 11.816 11.812
Question 109: Twenty-five people were asked to try a new type of gum. They
were asked whether or not the liked it: yes or no. The results are as follows:
YYNNNNYYYNYNN YNNNNNYYYYNN
How many runs are in this sequence? Does this sequence appear random?
Question 110:
a) In how many different ways can one arrange 3 As and 3Bs?
b) List all the ways and count the number of runs that appear in each.
c) What is the mean number of runs and what is the standard deviation for the
number of runs?
Question 111: Here are the first 20 digits of 2 . Do they seem random?
14142135623730950488
Question 112: Use the above/below median test to determine whether or not the
following list of data values appear random:
8 15 9 12 10 7 11 8 13 9 11
PART THREE:103
Question 113:
a) Write a list of Hs and Ts that seem random to you. Do a list that is 20 long.
(The number of Hs and Ts do not have to be the same.)
b) Flip a coin 20 times and record the list of Hs and Ts that result. Test the
sequence for randomness.
(B) 95% of all Australian men have blood pressure under 131.
(C) 95% of all Australian men have blood pressure between 105 and 131.
(D) There is a 95% chance that the average blood pressure of all Australian
men lies somewhere between 105 and 131.
Question 115: A survey displays milk preferences amongst men and women:
Men Women
Whole Milk 10 3
2% Milk 18 16
Non-Fat 7 15
No Preference 6 12
(Actually, MTEL will not ask you to do a chi-squared analysis, but do this one in any
case!)
PART THREE:104
(A) No effect
(B) Decrease sampling error
(C) Increase sampling error
(D) Not enough information to say.
COMMENT: The term sampling error is vague here. The question is really
Which would yield least spread of data values (that is, least standard
deviation): Calculating the mean of samples of size N or calculating the mean
of samples of size 3N?
PROBABILITY AND STATISTICS
Informal Course Notes
PART IV of IV
BRIEF INTRODUCTION TO
MORE ADVANCED THINKING
(and filling in some gaps!)
James Tanton
2008 James Tanton
CONTENTS:
Mean and Variance Revisited: The Human Perspective 2
One Data Set 2
Two Data Sets 4
Playing with Formulas 7
Vectors 9
Mean and Variance: The Perspective of the Gods 10
Random Variables 15
Sum, differences, multiples 17
Connection to Central Limit Theorem 22
Cereal Box Problem 22
Geometric Distribution 25
Binomial Distribution 27
Proportions 32
Students t-distribution 33
Chi Squared distribution 36
Chebyshevs Inequality 36
Law of Large Numbers 37
PART FOUR: 2
x1 , x2 , , xn
and, as mere mortals, we know nothing more about the situation than these n values. (That
is, we have no understanding about what to expect from the experiment such as the mean
value, the variation from the mean, the underlying frequency of data values behind the
scenes, etc.)
But if the experiment were ideal, meaning that outcomes were absolutely and utterly
repeatable, then we would expect no variation in data values at all. This means that all
measurements would adopt exactly the same value q, say.
( x1 q ) + ( x2 q ) + ( xn q )
2 2 2
PM =
It is easier to just to minimize the quantity under the square root sign.
Now
( x1 q ) + ( x2 q ) + ( xn q ) = nq 2 2 ( x1 + + xn ) q + ( x12 + + xn 2 )
2 2 2
2 ( x1 + + xn ) x1 + + xn
q= = =x
2n n
The minimum value under the square root sign thus occurs when q = x and the minimum
value is:
( x x) + ( x ) ( )
2 2 2
1 2 x + + xn x
This is a sum of individual deviations, squared, and in and of itself is a measure of the total
spread of values. Dividing by n gives an average spread. The resulting quantity is the
VARIANCE of the data:
( x x) ( )
2 2
1 + + xn x
Var ( x , , x ) =
1 n
n
COMMENT: If the data is a measurement of length, say, then each xi has a unit of meters
perhaps and so Var ( x1 , , xn ) has units of meters squared. It is handy to have a measure
of spread in the same units as the data. For this reason, folk take the square root of
variance and call the result STANDARD DEVIATION:
( x x) ( )
2 2
1 + + xn x
( x1 , , xn ) = Var ( x1 , , xn ) =
n
COMMENT: As we have seen, many texts alter these definitions slightly. Mathematicians
note the following:
( ) ( )
FACT: x1 x + x2 x + + xn x equals zero. ( )
(EXERCISE: Show this!)
Thus if one knows the value of n -1 of the terms, x1 x , x2 x , , xn x , then one can ( )( ) ( )
deduce the value of nth one from the fact that their sum should be zero.
( )( ) (
So among the values x1 x , x2 x , , xn x , there are only n -1 real pieces of )
information. To reflect this, many choose to divide by n-1 rather than n and set
( x x) ( ) ( x x) ( )
2 2 2 2
1 + + xn x 1 + + xn x
Var ( x , , x ) = and ( x1 , , xn ) =
n 1 n 1
1 n
Suppose we run an experiment and record two sets of data values from it. (For example,
our experiment could be to ask passersby for their heights and their shoe sizes.) We have
data values:
x1 , x2 , , xn
y1 , y2 , , yn
The plot might reveal a relationship (CORRELATION) between the data values. If there
seems to be a linear correlation, then one might be interested in finding a straight line
that approximates the data points well.
The line of best fit should minimize the total sum of deviations from that line. So for the
( )
data point xi the line of best fit predicts the value m xi x + y compared to the actual
data value yi . We need a value m that minimizes:
PART FOUR: 5
( )) + + ( y y m ( x x ))
(
2 2
D = y1 y m x1 x n n
= m ( ( x x ) + + ( x x ) ) 2m ( ( x x )( y y ) + + ( x )( y ))
2 2
2
1 n 1 1 n x n y
+ (( y y ) + + ( y y ))
1
2
n
m=
( x x )( y y ) + + ( x x )( y
1 1 n n y )
( x x) + + ( x x)
2 2
1 n
Folk define:
( x x) ( )
2 2
+ + xn x
= Var ( x1 , , xn )
1
S xx =
n
S xy =
( x x )( y y ) + + ( x
1 1 n x )( y n y )
n
( y y) ( )
2 2
+ + yn y
= Var ( y1 , , yn )
1
S yy =
n
S xy
y y =
S xx
(x x )
CORRELATION COEFFICIENT:
We created a line that minimizes the total amount of scattering D of y-values about that
line. Here:
( ( )) ( ( ))
2 2
D = y1 y m x1 x + + yn y m xn x
( ) ( )
2 2
The quantity T = y1 y + + yn y is a measure of the amount of scattering of y-
values in general. We can also view this as the amount of scattering about the horizontal
PART FOUR: 6
line y = y , which is not the line of best fit. Since D is the minimal value for all lines, we
have D T .
T D
As we have seen, the he proportion , with value between 0 and 1, is a measure of
T
desired scattering. To make sense of this note that:
T D
= 0 means T = D , which says that the amount of scattering about a
T
supposed line of best fit is no different from the amount of scattering in general.
THERE IS NO CORRELATION between the data values at all.
T D
= 1 means D = 0 , which says that there is absolutely no scatter about the
T
line of best fit, that is, the data fits this line exactly. We have PERFECT LINEAR
CORRELATION between the data values.
=
T S xx S yy
and people usually denote this quantity R 2 , calling it THE CORRELATION COEFFICIENT.
(S )
2
xy
COMMENT: Actually people usually set R = using the + sign if the slope m is
S xx S yy
positive and the sign if m is negative.
(
IF x1 x )( y y ) + + ( x
1 n x )( y n )
y = 0 THEN THERE IS ABSOLUTLEY NO
CORRELATION BETWEEN DATA VALUES.
PART FOUR: 7
Given two sets of data values from an experiment (which we label set X and set Y):
X : x1 , x2 , , xn
Y : y1 , y2 , , yn
we can create new data sets by adding or multiplying all values (which we label X + Y and
XY ):
X + Y : x1 + y1 , x1 + y2 , , xn + yn
XY : x1 y1 , x1 y2 , , xn yn
NOTE: If there are originally n data values for X and for Y, there are n 2 values for X + Y
and for XY .
The mean of X + Y :
The mean of XY :
( x1 y1 ) + ( x1 y2 ) + + ( xn yn ) = x1 ( y1 + + yn ) + x2 ( y1 + + yn ) + + xn ( y1 + + yn )
n2 n2
nx y + nx2 y + + nxn y
= 1
n2
n2 x y
= 2 = xy
n
PART FOUR: 8
The variance of X + Y : The variance of the X + Y data set, about its data mean of x + y
is a little long and scary looking, but actually not too tricky to work out. Here goes:
( x + y x y) + ( x + y ) ( )
2 2 2
1 1 1 2 x y + + xn + yn x y
( ) (
)( y y ) + ( y y ) + ( x x ) 2 ( x x )( y ) ( )
2 2 2 2
= x1 x 2 x1 x 1 1 1 1 2 y + y2 y
+ + ( x x ) 2 ( x x )( y y ) + ( y y )
2 2
n n n n
( ) (
( ) + + n ( y ) )
2 2 2 2
= n x1 x + + n xn x + n y1 y n y
2 ( x x ) (( y y ) + ( y y ) + + ( y y ))
1 1 2 n
2 ( x x ) (( y y ) + ( y y ) + + ( y y ))
2 1 2 n
(
2 xn x ) (( y y ) + ( y
1 2 )
y + + yn y ( ))
( ) ( ) ( ) ( )
2 2 2 2
= n x1 x + + n xn x + n y1 y + + n yn y 0 0 0
( ) ( ) ( ) ( )
2 2 2 2
= n x1 x + + n xn x + n y1 y + + n yn y
(
[We used the fact that y1 y + y2 y + + yn y = 0 .] ) ( ) ( )
Divide by n 2 to get:
( ) ( ) ( ) ( )
2 2 2 2
n x1 x + + n xn x n y1 y + + n yn y
Var ( X + Y ) = +
n2 n2
( x x) ( ) + ( y y) ( )
2 2 2 2
1 + + xn x 1 + + yn y
=
n n
= Var ( X ) + Var ( Y )
We have:
FOR ANY TWO DATA SETS X AND Y:
Var ( X + Y ) = Var ( X ) + Var (Y )
(for variance computed with respect to the data mean).
ASIDE ON VECTORS:
vx =< x1 x, x2 x, , xn x >
This vector has the property that its entries sum to zero.
For example,
|| vx ||2
Var ( X ) =
n
|| vx ||
(X ) =
n
vx v y
S xy =
n
(v vy )
2
S xy 2 x v vy
R2 = = = x i = cos
2
S xx S yy
|| vx || || v y || || vx || || v y ||
Lets now assume that we are omniscient and are fully aware of all information
about all experiments ever run. For any experiment we now assume we know all
possible values that can occur and the likelihood of each and every particular value
actually occurring. That is, we know the PROBABILITY DISTRIBUTION of any
given experiment.
COMMENT: We are usually not God-like and do not know the true nature of a random
variable. For example, who really knows the probability distribution of the number of
humming birds that will visit someones feeder during a given hour of the day while the
homeowner happens to be watching the TV tuned to a prime-numbered station?
On occasion us mere mortals do have glimpses into the world of the Gods. For example, let
X be the random variable: All values of a fair die. We know the values of this random
1 1 1 1 1 1
variable are 1, 2, 3, 4, 5, 6 with probability distribution , , , , , .
6 6 6 6 6 6
= E ( X ) = p1 x1 + p2 x2 + p3 x3 +
This is the Gods version of the average value. To see why, imagine we ran the experiment
n times. Then pi , in an ideal setting, is the proportion of times we can expect to see the
value xi . So for n runs of the experiment we should expect to see xi a total of npi times.
The average result we expect to see, in the ideal case, is thus:
1 1 1 1 1 1
The expected value of rolling a die is = 1 + 2 + 3 + 4 + 5 + 6 = 3.5
6 6 6 6 6 6
PART FOUR: 11
COMMENT: We need to be careful to distinguish between the ideal God-like situation and
the non-ideal mortal reality. When running an experiment, such as rolling a die, it is unlikely
our data values will ever actually have mean value E(X)! (Try it. Roll a die six times and
compute the average result. It probably isnt 3.5 - though it can happen!)
x1 + x2 + + xn
x=
n
We can hope that the data mean x is a close approximation to the true mean .
COMMENT: We saw for the data mean that the values x1 x, x2 x, , xn x sum to zero
and so represent only n -1 truly independent values.
In the God world, the values x1 , x2 , , xn are not guaranteed to sum to zero and
so actually do represent n independent values.
Moving on
As mere mortals defined the variance as the average of the sum of deviations from the
data-mean squared. The Gods analogy to variance is thus:
var( X ) = ( x1 ) p1 + ( x2 ) p2 +
2 2
( x1 ) p1 + ( x2 ) p2 +
2 2
( X ) = Var ( X ) =
PART FOUR: 12
If X is a random variable and k is a constant, then kX is the random variable with all values
multiplied by k with the same underlying probability distribution; and X + k is the random
variable with all values increased by k. (See page 17 for more.) We shall prove in the next
section:
THEOREM:
E (kX ) = kE ( X )
Var (kX ) = k 2Var ( X )
E( X + k ) = E( X ) + k
Var ( X + k ) = Var ( X )
If X and Y are two random variables then X+Y is the random variable with values xi + y j
where xi is a value adopted by X, y j is a value adopted by Y, and the probability
associated with xi + y j is P ( X = xi and Y = y j ) . How one computes this probability
depends on the nature of X and Y. The nicest situation of all would be if, as in nave
probability theory, the word and continues to translate into an action of multiplication.
For example, if X is the roll of a die and Y is the spin of a spinner with numbers 1 through
10 (each equally likely), then there are 60 equally like outcomes for the pair (X,Y) and
1 1 1
P ( X = 3 and Y = 8) , say, equals . This equals which is P ( X = 3) P (Y = 8) . Here X
60 6 10
and Y are independent.
COMMENT: It is curious that Var ( X + Y ) = Var ( X ) + Var (Y ) is always true in our mortal
world (calculated with data means) and not always true for the God-world (calculated with
actual means). We require the condition that X and Y should be independent for this
result to hold in the God world. What is the difference?
x1 , x2 , , xn
y1 , y2 , , yn
We did not know the true means of the underlying random variables X and Y, but
calculated instead just the data means x and y :
x1 + x2 + + xn 1 1 1
x= = x1 + x2 + + xn
n n n n
y + y2 + + yn 1 1 1
y= 1 = y1 + y2 + + yn
n n n n
But these expressions each look like the expected value of a random variable.
1 1 1
X has values x1 , x2 , , xn with probabilities , , ,
n n n
1 1 1
Y has values y1 , y2 , , yn with probabilities , , ,
n n n
Then E(X) is the data mean x and E(Y) is the data mean y .
X and Y arent the real random variables lurking behind the data sets, but they are the
ones we see by only looking at the data.
x1 + y1 , x1 + y2 , , xn + yn
( x1 + y1 ) + ( x1 + y2 ) + + ( xn + yn )
This gives n 2 values and we computed their mean as .
n2
PART FOUR: 14
But in doing this we tacitly assumed that each of the pairs in this sum, xi + y j , has the
same frequency as any other pair. That is, we assumed each pair if equally likely, and so,
1
since there are n 2 pairs in all, each pair xi + y j comes with probability .
n2
1
P ( X ' = xi and Y ' = y j ) =
n2
1 1 1
P ( X ' = xi ) P (Y ' = y j ) = = 2
n n n
Our human construct X and Y obey all the conditions of the Gods and so obey:
We did not realize at the time, but in Part I of these notes we were implicitly drawn to
mimicking the ways of the Gods! (Such is always the wont of mankind?)
PART FOUR: 15
Loosely speaking A random variable X is a quantity whose value is not known but whose
probability of taking a particular value or a range of values is known. Thus random variables
come a priori with a probability distribution function in mind (at least in principle).
A random variable is said to be discrete if it adopts only finitely many possible values
(each having a known probability of occurring) or a list of possible values. It is continuous
if it can adopt a continuous range of values with probability values P ( a X b) known and
given by a probability distribution curve.
Example: A roll of a biased die could be a random variable Y with probability distribution:
Definition: Suppose a discrete random variable X has value x with probability P ( x ) , then,
the expected value (or mean) of the random variable is:
= E ( X ) = x P( x)
1 1 1 1 1 1
E( X ) = 1 + 2 + 3 + 4 + 5 + 6 = 3.5
6 6 6 6 6 6
2 = Var ( X ) = ( x ) P( x)
2
6 6 6 6 6 6
= 0.85
NOTE: The n 1 versus n issue comes into play here! This definition assumes the
divide by n convention.
NOTE: Var ( X ) = E (( X ) )
2
PART FOUR: 17
If all the numbers on a dice a doubled then the wed expect the mean value of a roll to
double (from 3.5. to 7) and standard deviation (spread) of values to double as well (from
0.85 to 1.70).
E (2 X ) = 2 E ( X )
SD (2 X ) = 2 SD ( X )
(By 2X we mean a new random variable with values double those of X, but with the same
underlying probability distribution.)
In general:
E (aX ) = aE ( X )
Var (aX ) = a 2Var ( X )
so that
SD (aX ) = | a | SD ( X )
If = E ( X ) then
Var (aX ) = ( ax1 a ) p1 + + ( axn a ) pn = a 2
2 2
(( x )
1
2 2
)
p1 + + ( xn ) pn = a 2Var ( X )
E (ax) = ax P ( x) = a x P ( x) = aE ( X )
Var (ax) = ( ax a ) P ( x) = a 2 ( x ) P ( x) = a 2Var ( X )
2 2
We shall follow this style of presentation in the proofs that follow. (But of course it is
always possible and usually helpful to translate the lines in these proofs back to the
form without notation.)
PART FOUR: 18
Suppose we add 3 to all the values of the die. Then wed expect the mean roll to increase
by three (from 3.5 to 6.5) but the spread of values, the standard deviation, not to change:
E ( X + 3) = E ( X ) + 3
Vax( X + 3) = Var ( X )
In general:
E ( X + c) = E ( X ) + c
Var ( X + c) = Var ( X )
Proof: E ( X + c) = ( x + c ) P( x) = xP( x) + c P( x) = E ( X ) + c 1 = E ( X ) + c
(Note that all probabilities sum to 1. Thus P( x) = 1 .) If we set E ( X ) = we have just
established that E ( X + c) = + c .We thus have:
Var ( X + c) = ( ( x + c ) ( + c ) ) P( x) = ( x ) P( x) = Var ( X )
2 2
Let X be the random variable associated to rolling a die, and Y the random variable of
rolling the die a second time. (OR X and Y can be the random variables associated to rolling
two different dice simultaneously.)
Then X+Y is the random variable that corresponds to the sum of two die and XY the
random variable that corresponds to the product of the two rolls.
The following formulas are true, but they take some work to establish:
E ( X + Y ) = E ( X ) + E (Y )
E ( XY ) = E ( X ) E (Y )
Var ( X + Y ) = Var ( X ) + Var (Y )
E ( X + Y ) = 2 P ( X = 1 and Y = 1) + 3P ( X = 1 and Y = 2)
+ 3P ( X = 2 and Y = 1) + + 12 P ( X = 6 and Y = 6)
1 1 1 1
= 2 + 3 + 3 + + 12
36 36 36 36
=7
Var ( X Y ) = Var ( X + ( Y ))
= Var ( X ) + Var (Y )
= Var ( X ) + (1) 2 Var (Y ) = Var ( X ) + Var (Y )
That is,
Suppose s is a given value and we wish to compute the probability that the sum X + Y
equals s. We can do this by finding listing all the x-values and all the y-values that sum to
s. Suppose this list appears:
x1 + y1 = s
x2 + y2 = s
xk + yk = s
Then
P ( X + Y = s ) = P ( X = x1 , Y = y1 OR X = x2 , Y = y2 OR OR X = xk , Y = yk )
=
x+ y=s
P ( X = x, Y = y )
=
x+ y=s
P ( X = x) P (Y = y )
where
x+ y=s
denotes summation over all the pairs of x- and y-values that sum to s.
E ( X + Y ) = sP ( X + Y = s )
s
= sP( X = x) P(Y = y )
s x+ y =s
= ( x + y ) P ( X = x) P (Y = y )
x y
= xP ( X = x) P (Y = y ) + yP ( X = x) P(Y = y )
x y
= xP ( X = x) P (Y = y ) + yP (Y = y ) P ( X = x)
x y y x
= xP ( X = x) 1 + yP (Y = y ) 1
x y
= E ( X ) + E (Y )
PART FOUR: 21
Also:
E ( XY ) = sP ( XY = s )
s
= xyP ( XY = s )
s xy = s
= xyP ( X = x and Y = y )
x y
= xP ( X = x) yP (Y = y )
x y
= xP ( X = x) E (Y )
x
= E (Y ) xP ( X = x) = E ( X ) E (Y )
x
Var ( X + Y ) = E (( X + Y ) )
2
= E (( X + Y ) )
2
= E ( ( X ) + 2 ( X )(Y ) + ( Y ) )
2 2
= E ( ( X ) ) + 2 E ( X ) E (Y ) + E ( (Y ) )
2 2
= Var ( X ) + 2 ( )( ) + Var (Y )
= Var ( X ) + Var (Y )
X1 + X 2 + + X n
X =
n
The central limit states that if each of the random variables X 1 , X 2 , , X n has mean
and standard deviation , then:
The hard part of this theorem is proving the approximation to the normal curve.
Calculating the mean and standard deviation of X is now easy!
1 1 1
E( X ) = E ( X1 + + X n ) = ( E ( X1 ) + + E( X n ) ) = ( + + ) =
n n n
n 2 2
Var ( X ) = 2 (Var ( X 1 ) + + Var ( X n ) ) = 2 ( 2 + + 2 ) = 2 =
1 1
n n n n
PART FOUR: 23
Krunchy-Munch Cereal company has placed a prize in every third box. How many boxes of
cereal can I expect to buy before seeing a prize?
1
The chances of finding a prize in a given box is p = and the chances of failing to see a
3
2
prize is q = . We have:
3
The probability of finding a prize in the first box you try is p .
The probability of first finding the prize in the second box you try is: qp .
The probability of first finding the prize in the third box you try is: q 2 p
and so on.
Let X be the random variable which is the box number you open when you first find the
prize. Then:
E ( X ) = 1 p + 2 qp + 3 q 2 p + 4 q 3 p +
COMMENT: Here X can have one of an infinite set of discrete values. It is also called a
discrete random variable.
To evaluate this sum we need to make use of the famous geometric formula.
1
ASIDE: Proving that 1 + x + x 2 + x3 + =
1 x
Suppose we wish to find the value of this infinite sum. Lets call its value S:
1 + x + x 2 + x3 + = S
Multiply through by x:
x + x 2 + x 3 + x 4 + = xS
That is:
S 1 = xS
Solving gives:
1
S=
1 x
COMMENT: Weve actually only proven the statement: IF the sum 1 + x + x 2 + x3 + has a
1
finite value, then that value must be . In a calculus course, one proves that the sum
1 x
does indeed converge to a finite answer for values 1 < x < 1 .
PART FOUR: 24
E ( X ) = 1 p + 2 qp + 3 q 2 p + 4 q 3 p +
E ( X ) qE ( X ) = p + qp + q 2 p + q 3 p +
That is:
(1 q ) E ( X ) = p (1 + q + q 2 + q 3 + ) = p
1
1 q
That is:
1
pE ( X ) = p =1
p
and so:
1
E( X ) =
p
1
With p = this has value 3.
3
FOR THE BOLD: Suppose Krunchy-Munch Cereal company actually has three different
prizes and each box contains one of these prizes. (Assume there is a one-third chance of
finding any particular prize in a given box.)
1 1
Show that one can expect to buy 3 1 + + = 5.5 before seeing all three prizes.
2 3
[In general: If there are n prizes to be had, show that one can expect to buy
1 1 1
n 1 + + + + boxes before seeing all n prizes.]
2 3 n
PART FOUR: 25
The cereal box problem is an example of a general situation. Suppose an experiment is run
and the probability of observing a success is p and of observing a failure q = 1 p . (For
1
example, in tossing a coin with heads deemed a success, we have p = q = . In rolling a
2
1 5
die with six deemed a success, we have p = and q = .)
6 6
X = the number of runs of the experiment needed for seeing a first success.
There is a p chance that one will see a success on the first run, so X has value 1 with
probability p.
1
We have seen on the previous page that: E ( X ) = With some work it is possible to show
p
q
. (One uses the formula Var ( X ) = E ( X 2 ) ( E ( X ) ) and shows that
2
that Var ( X ) = 2
p
1+ q
E ( X 2 ) = 1 p + 4qp + 9q 2 p + 16q 3 p + = .)
(1 q )
3
PART FOUR: 26
In summary:
P ( X = n) = q n 1 p
1
E( X ) =
p
q
Var ( X ) =
p2
EXAMPLE: You would like to know the meaning of life, but only 1 in 100 people on this
planet know the answer. You decide to ask each person you meet until you find someone
who can tell you the answer.
a) How many people do you expect to meet before finding someone with the answer?
b) What is the probability that you will find the person with the answer within the
first three people you meet?
1
a) E ( X ) = = 100 . We can expect to meet 100 people before finding the answer.
p
b) P ( X 3) = P ( X = 1) + P ( X = 2) + P ( X = 3) = p + qp + q 2 p 2.97% .
P ( X > n) = 1 P ( X n)
= 1 p qp q n 1 p
= 1 p (1 + q + + q n 1 )
1 qn
= 1 p
1 q
= 1 (1 q n )
= qn
The probability that it will take me more than 100 people to find the meaning of life is
thus (0.99)100 36.6% .
PART FOUR: 27
In tossing a coin ten times we have already worked out the distribution of probabilities of
obtaining exactly 10, 9, 8, , 2, 1 and no heads. (See part 1.) This is a specific example of a
more general situation.
3 7
10! 1 1
[For example, in tossing a coin 10 times, B10 ( 3) = 11.7% .]
3!7! 2 2
For each value n we have a series of probability values, one for each of k = 0 to k = n .
Each of these distributions of probabilities is called a binomial distribution.
In general, we have:
n!
Bn ( k ) = p k q nk
k !(n k )!
n!
( p + q)
n
= p n + np n 1 q + + p k q nk + + q n
k !(n k )!
We shall prove:
Consider the binomial distribution of n trials. If X is the random variable that counts how
many successes occur, then, as we have seen:
n!
P( X = k ) = p k q nk
k !(n k )!
Proof:
First consider a single run of the experiment ( n = 1 ). There is either 1 success or zero
successes:
E ( X ) = 1. p + 0.q = p
(x )
2
Using Var ( X ) = P ( x) with = p we have:
Var ( X ) = (1 p ) p + ( 0 p ) q
2 2
= q2 p + p2 q
= pq ( p + q )
= pq
E ( X ) = E ( X 1 ) + E ( X 2 ) + + E ( X n ) = p + p + + p = np
Var ( X ) = Var ( X 1 ) + Var ( X 2 ) + + Var ( X n ) = pq + pq + + pq = npq
Answer: Here p = 0.01 and q = 0.99 . Among a sample of 40 people we can expect:
E ( X ) = 40 p = 0.4
SD( X ) = npq = 20 0.01 0.99 0.44
Also
P ( X = 2 or 3) = P ( X = 2) + P ( X = 3)
40! 40!
( 0.01) ( 0.99 ) + ( 0.01) ( 0.99 )
2 38 3 37
=
2!38! 3!37!
6.0%
450!
( 0.06 ) ( 0.94 )
150 300
150!300!
Fortunately, for large samples (and well explain what we mean by large in a moment) the
binomial distribution can be well approximated by a normal curve with:
= np
= npq
We can then use the tables of the normal curve values in order to approximate the
probabilities we need.
PART FOUR: 30
The true reason is that there is a connection between factorials and the number e. Stirling
proved that:
n ! 2 n n n e n
with the approximation only improving as n grows larger. Thus it seems plausible (and is in
fact the case) that the formulas one obtains from the binomial distribution begin to look
2
like a formula for the curve of the form: y = e x , a normal curve.)
[Calculus gives a hint as to why the Stirlings formula might be true. Notice that:
ln n ! = ln1 + ln 2 + ln 3 + + ln n
n
ln n ! ln x dx = [ x ln x x ]1 = n ln n n + 1 n ln n n
n
1
n n
which gives n ! n e . (Stirling gave a more refined version of this argument.)]
For the binomial distribution = np and = npq and we would like: + 3 < n and
0 < 3 so that the two tails correspond to moving beyond n and below 0.
PART FOUR: 31
np > 3 npq
np > 9q
Since q 1 we obtain:
np > 9
The other relation gives:
nq > 9
Suppose were interested in finding the proportion of Americans who can whistle. Call this
true, and unknown, proportion p . To estimate p we could collect of sample of 1000 people
say and find the sample proportion p of those who can whistle.
Since the population of Americans is so large, removing this first person from the
pool really wont alter the probability of choosing a second American who can
whistle. The chances are still, for all meaningful purposes, p.
=p
pq
=
n
This is the central limit theorem: Version II from part III of these notes.
PART FOUR: 33
STUDENTS t-DISTRIBUTION
Throughout section III of these notes, in calculating confidence intervals and the like, we
assumed that the standard deviation of a distribution is assumed known. This is rarely
the case.
In many cases we approximate with the value given by the standard deviation of the
.
sample at hand
EXAMPLE: Of 1500 adult Americans that were polled, 3.2% said that they had a
thoroughly enjoyable experience studying math in high-school. Estimate the proportion of
ALL adult Americans that will say the same.
Answer: We have
p = 3.2
3.2 96.8
= = 0.454
1500
as an approximation for , with 95% confidence we can say that the percentage
Using
of Americans who felt this way about math lies in the range [2.292, 4.108].
Gosset noticed, when performing statistical checks in his brewing company, that
unacceptable errors occurred in his analyses by making use of these approximate values
values in their own right, in the
for . He set to work to analyzing the distribution of
same way we analyse the distribution of sample means x via the central limit theorem.
To me more precise, Gosset realized that since we always need to convert entities to their
z-score, it is best to analyse the distribution of the quantities:
x
t=
n
He completely determined the mathematics of these entities, showing that, for each n
(the size of the sample) they follow their own distribution curves called Students t-
distribution with n 1 degrees of freedom. (Gosset published under the pseudonym
Student.)
PART FOUR: 34
Thus, one can create more accurate confidence intervals for means from a sample by
working with tables from Students t-distributions rather than work with the normal
distribution and approximate values for the standard deviation.
EXAMPLE: The speed of 23 along a particular road with posted speed limit 40 mph was
recorded. Their mean speed was x = 41.0 mph with = 4.23 mph. Is there reason to
believe that mean speed of all cars along this road is greater than 40 mph?
Answer: Assuming that car speeds follow something akin to a normal distribution (we can
plot the results and see if they appear bell shaped) we can follow Gossets model. Well
make the assumption that = 40.0 and see what we conclude.
Here:
41.0 40.0
t= 1.13
4.23
23
This is not sufficiently rare to conclude that receiving a sample mean of 41.0 is unusual
under the assumption that the true mean is 40.0. That is, we have no reason to reject the
idea that the true mean is 40.0 mph.
PART FOUR: 35
Two companies make light bulbs: company 1 and company 2. Wed like to know if there is
any difference between the mean life-time of the bulbs they each produce. We take a
sample of bulbs of size n1 and compute their sample mean x1 , and a sample of size n2 from
company 2 and compute their sample mean x2 . What does the difference x1 x2 of these
two sample means tell us about the difference the two true means 1 2 ?
Let X 1 be the random variable of possible x1 values, and define X 2 similarly. By the
central limit theorem:
E ( X 1 ) = 1 E ( X 2 ) = 2
1 2
SD( X 1 ) = SD( X 2 ) =
n1 n2
We also have:
E ( X 1 X 2 ) = E ( X 1 ) E ( X 2 ) = 1 2
12 22
SD ( X 1 X 2 ) = Var ( X 1 X 2 ) = Var ( X 1 ) + Var ( X 2 ) = +
n1 n2
This is, of course, assuming that we know the true standard deviations. Without this
knowledge, we can still test if the difference 1 2 = 0 by using students t-distribution
by computing the value:
t=
(x x ) (
1 2 1 2 )
=
(x x )
1 2
2 2 2 2
1 2 1 2
+ +
n1 n2 n1 n2
and seeing if this value is an acceptable distance from 1 2 = 0 . Details are omitted
here, but the gist is clear.
COMMENT: One of the details omitted is determining the number of degrees of freedom
that are appropriate for this problem. Because we have a mix of sample sizes, the count of
degrees of freedom is complicated considerably.
PART FOUR: 36
The mathematics of this random variable is well understood and its probability distribution
is known. It is called the chi squared distribution with degrees of freedom. This
distribution has = and = 2 .
CHEBYSHEVS INEQUALITY
For any probability distribution, if the mean of the associated random variable is
and its standard deviation , then the area under the curve greater then k
1
standard deviations from the mean is no more than 2 .
k
k
|x | k
2 2
P( x)
= k 2 2
| x | k
P( x)
= k 2 2 P ( | x | k )
and so:
1
P ( | x | k )
k2
2
CHALLENGE: Prove, more generally, P (| x | ) .
2
PART FOUR: 37
One of the great check points of probability theory was the proof of the intuitively
obvious result, the law of large numbers. It showed that all was on the right track.
Sn
That is, the probability that the average differs from the true mean by more than
n
some error value goes to zero as n grows, no matter the degree of error you wish to
tolerate.
Sn
COMMENT: This is not actually saying that lim n equals , only, in the some sense,
n
Sn
that lim n equals with probability 1. (Very strange! But curious issues do arise in
n
the theory. For example, the chances of choosing a whole number by selecting a point at
random along a number line is zero, even though integers are themselves valid points to be
selected!)
S S 2
Proof: By the results of pages 17-19 we have: E n = Var n = .
n n n
S 2
By Chebyshevs inequality: P n .
n
2
n
Sn 2
And so lim n P > = 0 since clearly lim n 2 = 0 .
n n
PART FOUR: 38