Você está na página 1de 217

Sirindhorn International Institute of Technology

Thammasat University
School of Information, Computer and Communication Technology

ECS315 2013/1
1

Part I.1

Dr.Prapun

Probability and You

Whether you like it or not, probabilities rule your life. If you have
ever tried to make a living as a gambler, you are painfully aware
of this, but even those of us with more mundane life stories are
constantly affected by these little numbers.
Example 1.1. Some examples from daily life where probability
calculations are involved are the determination of insurance premiums, the introduction of new medications on the market, opinion
polls, weather forecasts, and DNA evidence in courts. Probabilities also rule who you are. Did daddy pass you the X or the Y
chromosome? Did you inherit grandmas big nose?
Meanwhile, in everyday life, many of us use probabilities in our
language and say things like Im 99% certain or There is a onein-a-million chance or, when something unusual happens, ask the
rhetorical question What are the odds?. [18, p 1]
1.1

Randomness

1.2. Many clever people have thought about and debated what
randomness really is, and we could get into a long philosophical
discussion that could fill up a whole book. Lets not. The French
mathematician Laplace (17491827) put it nicely:
Probability is composed partly of our ignorance, partly
of our knowledge.
4

Inspired by Laplace, let us agree that you can use probabilities


whenever you are faced with uncertainty. [18, p 2]
1.3. Random phenomena arise because of [13]:
(a) our partial ignorance of the generating mechanism
(b) the laws governing the phenomena may be fundamentally random (as in quantum mechanics; see also Ex. 1.7.)
(c) our unwillingness to carry out exact analysis because it is not
worth the trouble
Example 1.4. Communication Systems [24]: The essence of
communication is randomness.
(a) Random Source: The transmitter is connected to a random
source, the output of which the receiver cannot predict with
certainty.
If a listener knew in advance exactly what a speaker
would say, and with what intonation he would say it,
there would be no need to listen!
(b) Noise: There is no communication problem unless the transmitted signal is disturbed during propagation or reception in
a random way.
(c) Probability theory is used to evaluate the performance of communication systems.
Example 1.5. Random numbers are used directly in the transmission and security of data over the airwaves or along the Internet.
(a) A radio transmitter and receiver could switch transmission
frequencies from moment to moment, seemingly at random,
but nevertheless in synchrony with each other.
(b) The Internet data could be credit-card information for a consumer purchase, or a stock or banking transaction secured by
the clever application of random numbers.
5

Example 1.6. Randomness is an essential ingredient in games of


all sorts, computer or otherwise, to make for unexpected action
and keen interest.
Example 1.7. On a more profound level, quantum physicists
teach us that everything is governed by the laws of probability.
They toss around terms like the Schrodinger wave equation and
Heisenbergs uncertainty principle, which are much too difficult for
most of us to understand, but one thing they do mean is that the
fundamental laws of physics can only be stated in terms of probabilities. And the fact that Newtons deterministic laws of physics
are still useful can also be attributed to results from the theory of
probabilities. [18, p 2]
1.8. Most people have preconceived notions of randomness that
often differ substantially from true randomness. Truly random
data sets often have unexpected properties that go against intuitive
thinking. These properties can be used to test whether data sets
have been tampered with when suspicion arises. [22, p 191]
[14, p 174]: people have a very poor conception of randomness; they do not recognize it when they see it and they cannot
produce it when they try
Example 1.9. Apple ran into an issue with the random shuffling
method it initially employed in its iPod music players: true randomness sometimes produces repetition, but when users heard the
same song or songs by the same artist played back-to-back, they
believed the shuffling wasnt random. And so the company made
the feature less random to make it feel more random, said Apple
founder Steve Jobs. [14, p 175]
1.2

Background on Some Frequently Used Examples

Probabilists love to play with coins and dice. We like the idea of
tossing coins, rolling dice, and drawing cards as experiments that
have equally likely outcomes.
1.10. Coin flipping or coin tossing is the practice of throwing
a coin in the air to observe the outcome.
6

When a coin is tossed, it does not necessarily fall heads or


tails; it can roll away or stand on its edge. Nevertheless, we shall
agree to regard head (H) and tail (T) as the only possible
outcomes of the experiment. [4, p 7]
Typical experiment includes
Flip a coin N times. Observe the sequence of heads and
tails or Observe the number of heads.
1.11. Historically, dice is the plural of die, but in modern standard English dice is used as both the singular and the plural. [Excerpted from Compact Oxford English Dictionary.]
Usually assume six-sided dice
Usually observe the number of dots on the side facing upwards.
1.12. A complete set of cards is called a pack or deck.
(a) The subset of cards held at one time by a player during a
game is commonly called a hand.
(b) For most games, the cards are assembled into a deck, and
their order is randomized by shuffling.
(c) A standard deck of 52 cards in use today includes thirteen
ranks of each of the four French suits.
The four suits are called spades (), clubs (), hearts
(), and diamonds (). The last two are red, the first
two black.
(d) There are thirteen face values (2, 3, . . . , 10, jack, queen, king,
ace) in each suit.
Cards of the same face value are called of the same kind.
court or face card: a king, queen, or jack of any suit.

1.3

A Glimpse at Probability Theory

1.13. Probabilities are used in situations that involve randomness. A probability is a number used to describe how likely
something is to occur, and probability (without indefinite article) is the study of probabilities. It is the art of being certain
of how uncertain you are. [18, p 24] If an event is certain
to happen, it is given a probability of 1. If it is certain not to
happen, it has a probability of 0. [7, p 66]
1.14. Probabilities can be expressed as fractions, as decimal numbers, or as percentages. If you toss a coin, the probability to get
heads is 1/2, which is the same as 0.5, which is the same as 50%.
There are no explicit rules for when to use which notation.
In daily language, proper fractions are often used and often
expressed, for example, as one in ten instead of 1/10 (one
tenth). This is also natural when you deal with equally likely
outcomes.
Decimal numbers are more common in technical and scientific reporting when probabilities are calculated from data.
Percentages are also common in daily language and often with
chance replacing probability.
Meteorologists, for example, typically say things like there
is a 20% chance of rain. The phrase the probability of rain
is 0.2 means the same thing.
When we deal with probabilities from a theoretical viewpoint,
we always think of them as numbers between 0 and 1, not as
percentages.
See also 3.5.

[18, p 10]

Definition 1.15. Important terms [13]:


(a) An activity or procedure or observation is called a random
experiment if its outcome cannot be predicted precisely because the conditions under which it is performed cannot be
predetermined with sufficient accuracy and completeness.
8

The term experiment is to be construed loosely. We do


not intend a laboratory situation with beakers and test
tubes.
Tossing/flipping a coin, rolling a dice, and drawing a card
from a deck are some examples of random experiments.
(b) A random experiment may have several separately identifiable
outcomes. We define the sample space as a collection
of all possible (separately identifiable) outcomes/results/measurements of a random experiment. Each outcome () is an
element, or sample point, of this space.
Rolling a dice has six possible identifiable outcomes
(1, 2, 3, 4, 5, and 6).
(c) Events are sets (or classes) of outcomes meeting some specifications.
Any1 event is a subset of .

Intuitively, an event is a statement about the outcome(s)


of an experiment.

The goal of probability theory is to compute the probability of various events of interest. Hence, we are talking about a set function
which is defined on subsets of .
Example 1.16. The statement when a coin is tossed, the probability to get heads is l/2 (50%) is a precise statement.
(a) It tells you that you are as likely to get heads as you are to
get tails.
(b) Another way to think about probabilities is in terms of average long-term behavior. In this case, if you toss the coin
repeatedly, in the long run you will get roughly 50% heads
and 50% tails.
1

For our class, it may be less confusing to allow event A to be any collection of outcomes
(, i.e. any subset of ).
In more advanced courses, when we deal with uncountable , we limit our interest to only
some subsets of . Technically, the collection of these subsets must form a -algebra.

Although the outcome of a random experiment is unpredictable,


there is a statistical regularity about the outcomes. What you
cannot be certain of is how the next toss will come up. [18, p 4]
1.17. Long-run frequency interpretation: If the probability
of an event A in some actual physical experiment is p, then we
believe that if the experiment is repeated independently over and
over again, then a theorem called the law of large numbers
(LLN) states that, in the long run, the event A will happen approximately 100p% of the time. In other words, if we repeat an
experiment a large number of times then the fraction of times the
event A occurs will be close to P (A).
Example 1.18. Return to the coin tossing experiment in Ex. 1.16:

Definition 1.19. Let A be one of the events of a random experiment. If we conduct a sequence of n independent trials of this
experiment, and if the event A occurs in N (A, n) out of these n
trials, then the fraction

is called the relative frequency of the event A in these n trials.


1.20. The long-run frequency interpretation mentioned in 1.17
can be restated as
N (A, n)
.
n
n

P (A) = lim

1.21. Another interpretation: The probability of an outcome can


be interpreted as our subjective probability, or degree of belief,
that the outcome will occur. Different individuals will no doubt
assign different probabilities to the same outcomes.
10

1.22. In terms of practical range, probability theory is comparable


with geometry ; both are branches of applied mathematics that
are directly linked with the problems of daily life. But while pretty
much anyone can call up a natural feel for geometry to some extent,
many people clearly have trouble with the development of a good
intuition for probability.
Probability and intuition do not always agree. In no other
branch of mathematics is it so easy to make mistakes
as in probability theory.
Students facing difficulties in grasping the concepts of probability theory might find comfort in the idea that even the
genius Leibniz, the inventor of differential and integral calculus along with Newton, had difficulties in calculating the
probability of throwing 11 with one throw of two dice. (See
Ex. 3.4.)
[22, p 4]

11

Review of Set Theory

Example 2.1. Let = {1, 2, 3, 4, 5, 6}

2.2. Venn diagram is very useful in set theory. It is often used


to portray relationships between sets. Many identities can be read
out simply by examining Venn diagrams.
2.3. If is a member of a set A, we write A.
Definition 2.4. Basic set operations (set algebra)
Complementation: Ac = { :
/ A}.
Union: A B = { : A or B}
Here oris inclusive; i.e., if A, we permit to belong
either to A or to B or to both.
Intersection: A B = { : A and B}
Hence, A if and only if belongs to both A and B.
A B is sometimes written simply as AB.
The set difference operation is defined by B \ A = B Ac .
B \ A is the set of B that do not belong to A.
When A B, B \ A is called the complement of A in B.

12

2.5. Basic Set Identities:


Idempotence: (Ac )c = A
Commutativity (symmetry):
AB =BA, AB =BA
Associativity:
A (B C) = (A B) C

A (B C) = (A B) C

Distributivity
A (B C) = (A B) (A C)

A (B C) = (A B) (A C)

de Morgan laws

(A B)c = Ac B c

(A B)c = Ac B c
2.6. Disjoint Sets:

Sets A and B are said to be disjoint (A B) if and only if


A B = . (They do not share member(s).)
A collection of sets (Ai : i I) is said to be pairwise disjoint or mutually exclusive [9, p. 9] if and only if Ai Aj =
when i 6= j.
Example 2.7. Sets A, B, and C are pairwise disjoint if

2.8. For a set of sets, to avoid the repeated use of the word set,
we will call it a collection/class/family of sets.

13

Definition 2.9. Given a set S, a collection = (A : I) of


subsets2 of S is said to be a partition of S if
S
(a) S = AI and
(b) For all i 6= j, Ai Aj (pairwise disjoint).
Remarks:
The subsets A , I are called the parts of the partition.

A part of a partition may be empty, but usually there is no


advantage in considering partitions with one or more empty
parts.
Example 2.10 (Slide:maps).
Example 2.11. Let E be the set of students taking ECS315

Definition 2.12. The cardinality (or size) of a collection or set


A, denoted |A|, is the number of elements of the collection. This
number may be finite or infinite.
A finite set is a set that has a finite number of elements.
A set that is not finite is called infinite.
Countable sets:

In this case, the subsets are indexed or labeled by taking values in an index or label
set I

14

Empty set and finite sets are automatically countable.

An infinite set A is said to be countable if the elements


of A can be enumerated or listed in a sequence: a1 , a2 , . . . .

A singleton is a set with exactly one element.


Ex. {1.5}, {.8}, {}.

Caution: Be sure you understand the difference between


the outcome -8 and the event {8}, which is the set consisting of the single outcome 8.

2.13. We can categorize sets according to their cardinality:

Example 2.14. Examples of countably infinite sets:


the set N = {1, 2, 3, . . . } of natural numbers,
the set {2k : k N} of all even numbers,
the set {2k 1 : k N} of all odd numbers,
the set Z of integers,

15

Set Theory Probability Theory


Set
Event
Universal set Sample Space ()
Element
Outcome ()
Table 1: The terminology of set theory and probability theory
A
Ac
AB
AB

Event Language
A occurs
A does not occur
Either A or B occur
Both A and B occur

Table 2: Event Language

Example 2.15. Example of uncountable sets3 :


R = (, )
interval [0, 1]
interval (0, 1]
(2, 3) [5, 7)
Definition 2.16. Probability theory renames some of the terminology in set theory. See Table 1 and Table 2.
Sometimes, s are called states, and is called the state
space.
2.17. Because of the mathematics required to determine probabilities, probabilistic methods are divided into two distinct types,
discrete and continuous. A discrete approach is used when the
number of experimental outcomes is finite (or infinite but countable). A continuous approach is used when the outcomes are continuous (and therefore infinite). It will be important to keep in
mind which case is under consideration since otherwise, certain
paradoxes may result.
3

We use a technique called diagonal argument to prove that a set is not countable and
hence uncountable.

16

Classical Probability

Classical probability, which is based upon the ratio of the number


of outcomes favorable to the occurrence of the event of interest to
the total number of possible outcomes, provided most of the probability models used prior to the 20th century. It is the first type
of probability problems studied by mathematicians, most notably,
Frenchmen Fermat and Pascal whose 17th century correspondence
with each other is usually considered to have started the systematic study of probabilities. [18, p 3] Classical probability remains
of importance today and provides the most accessible introduction
to the more general theory of probability.
Definition 3.1. Given a finite sample space , the classical
probability of an event A is
P (A) =

|A|
||

(1)

[6, Defn. 2.2.1 p 58]. In traditional language, a probability is


a fraction in which the bottom represents the number of possible outcomes, while the number on top represents the number of
outcomes in which the event of interest occurs.
Assumptions: When the following are not true, do not calculate probability using (1).
Finite : The number of possible outcomes is finite.

Equipossibility: The outcomes have equal probability of


occurrence.
The bases for identifying equipossibility were often
physical symmetry (e.g. a well-balanced dice, made of
homogeneous material in a cubical shape) or
a balance of information or knowledge concerning the various possible outcomes.
Equipossibility is meaningful only for finite sample space, and,
in this case, the evaluation of probability is accomplished
through the definition of classical probability.
17

We will NOT use this definition beyond this section. We will


soon introduce a formal definition in Section 5.
In many problems, when the finite sample space does not
contain equally likely outcomes, we can redefine the sample
space to make the outcome equipossible.

Example 3.2 (Slide). In drawing a card from a deck, there are 52


equally likely outcomes, 13 of which are diamonds. This leads to
a probability of 13/52 or 1/4.
3.3. Basic properties of classical probability: From Definition 3.1,
we can easily verified4 the properties below.
P (A) 0
P () = 1
P () = 0
P (Ac ) = 1 P (A)
P (A B) = P (A) + P (B) P (A B) which comes directly
from
|A B| = |A| + |B| |A B|.
A B P (A B) = P (A) + P (B)
Suppose
= {1 , . . . , n } and P ({i }) = n1 . Then P (A) =
P
P ({}).
A

The probability of an event is equal to the sum of the


probabilities of its component outcomes because outcomes
are mutually exclusive
4

Because we will not rely on Definition 3.1 beyond this section, we will not worry about
how to prove these properties. In Section 5, we will prove the same properties in a more
general setting.

18

Example 3.4 (Slides). When rolling two dice, there are 36 (equiprobable) possibilities.
P [sum of the two dice = 5] = 4/36.
Though one of the finest minds of his age, Leibniz was not
immune to blunders: he thought it just as easy to throw 12 with
a pair of dice as to throw 11. The truth is...
P [sum of the two dice = 11] =
P [sum of the two dice = 12] =
Definition 3.5. In the world of gambling, probabilities are often
expressed by odds. To say that the odds are n:1 against the event
A means that it is n times as likely that A does not occur than
that it occurs. In other words, P (Ac ) = nP (A) which implies
1
n
P (A) = n+1
and P (Ac ) = n+1
.
Odds here has nothing to do with even and odd numbers.
The odds also mean what you will win, in addition to getting your
stake back, should your guess prove to be right. If I bet $1 on a
horse at odds of 7:1, I get back $7 in winnings plus my $1 stake.
The bookmaker will break even in the long run if the probability
of that horse winning is 1/8 (not 1/7). Odds are even when they
are 1:1 - win $1 and get back your original $1. The corresponding
probability is 1/2.
3.6. It is important to remember that classical probability relies
on the assumption that the outcomes are equally likely.
Example 3.7. Mistake made by the famous French mathematician Jean Le Rond dAlembert (18th century) who is an author of
several works on probability:
The number of heads that turns up in those two tosses can
be 0, 1, or 2. Since there are three outcomes, the chances of each
must be 1 in 3.

19

4 Permutation of r types of n objects


Tuesday, July 10, 2012
2:53 PM

315 2013 L1 Page 1

315 2013 L1 Page 2

Sirindhorn International Institute of Technology


Thammasat University
School of Information, Computer and Communication Technology

ECS315 2013/1
4

Part I.2

Dr.Prapun

Enumeration / Combinatorics / Counting

There are many probability problems, especially those concerned


with gambling, that can ultimately be reduced to questions about
cardinalities of various sets. Combinatorics is the study of systematic counting methods, which we will be using to find the cardinalities of various sets that arise in probability.
4.1

Four Principles

4.1. Addition Principle (Rule of sum):


When there are m cases such that the ith case has ni options,
for i = 1, . . . , m, and no two of the cases have any options in
common, the total number of options is n1 + n2 + + nm .
In set-theoretic terms, suppose that a finite set S can be partitioned5 into (pairwise disjoint parts) S1 , S2 , . . . , Sm . Then,
5

|S| = |S1 | + |S2 | + + |Sm |.

The art of applying the addition principle is to partition the set S to be counted into
manageable parts; that is, parts which we can readily count. But this statement needs to
be qualified. If we partition S into too many parts, then we may have defeated ourselves.
For instance, if we partition S into parts each containing only one element, then applying the

20

In words, if you can count the number of elements in all of


the parts of a partition of S, then |S| is simply the sum of the
number of elements in all the parts.
Example 4.2. We may find the number of people living in a country by adding up the number from each province/state.
Example 4.3. [1, p 28] Suppose we wish to find the number of
different courses offered by SIIT. We partition the courses according to the department in which they are listed. Provided there is
no cross-listing (cross-listing occurs when the same course is listed
by more than one department), the number of courses offered by
SIIT equals the sum of the number of courses offered by each department.
Example 4.4. [1, p 28] A student wishes to take either a mathematics course or a biology course, but not both. If there are four
mathematics courses and three biology courses for which the student has the necessary prerequisites, then the student can choose
a course to take in 4 + 3 = 7 ways.
Example 4.5. Let A, B, and C be finite sets. How many triples
are there of the form (a,b,c), where a A, b B, c C?

4.6. Tree diagrams: When a set can be constructed in several


steps or stages, we can represent each of the n1 ways of completing
the first step as a branch of a tree. Each of the ways of completing
the second step can be represented as n2 branches starting from
addition principle is the same as counting the number of parts, and this is basically the same
as listing all the objects of S. Thus, a more appropriate description is that the art of applying
the addition principle is to partition the set S into not too many manageable parts.[1, p 28]

21

the ends of the original branches, and so forth. The size of the set
then equals the number of branches in the last level of the tree,
and this quantity equals
n1 n2
4.7. Multiplication Principle (Rule of product):
When a procedure/operation can be broken down into m
steps,
such that there are n1 options for step 1,
and such that after the completion of step i 1 (i = 2, . . . , m)
there are ni options for step i (for each way of completing step
i 1),
the number of ways of performing the procedure is n1 n2 nm .
In set-theoretic terms, if sets S1 , . . . , Sm are finite, then |S1
S2 Sm | = |S1 | |S2 | |Sm |.
For k finite sets A1 , ..., Ak , there are |A1 | |Ak | k-tuples
of the form (a1 , . . . , ak ) where each ai Ai .
Example 4.8. Suppose that a deli offers three kinds of bread,
three kinds of cheese, four kinds of meat, and two kinds of mustard.
How many different meat and cheese sandwiches can you make?
First choose the bread. For each choice of bread, you then
have three choices of cheese, which gives a total of 3 3 = 9
bread/cheese combinations (rye/swiss, rye/provolone, rye/cheddar, wheat/swiss, wheat/provolone ... you get the idea). Then
choose among the four kinds of meat, and finally between the
two types of mustard or no mustard at all. You get a total of
3 3 4 3 = 108 different sandwiches.
Suppose that you also have the choice of adding lettuce, tomato,
or onion in any combination you want. This choice gives another
2 x 2 x 2 = 8 combinations (you have the choice yes or no
three times) to combine with the previous 108, so the total is now
108 8 = 864.
That was the multiplication principle. In each step you have
several choices, and to get the total number of combinations, multiply. It is fascinating how quickly the number of combinations
22

grow. Just add one more type of bread, cheese, and meat, respectively, and the number of sandwiches becomes 1,920. It would take
years to try them all for lunch. [18, p 33]
Example 4.9 (Slides). In 1961, Raymond Queneau, a French poet
and novelist, wrote a book called One Hundred Thousand Billion
Poems. The book has ten pages, and each page contains a sonnet,
which has 14 lines. There are cuts between the lines so that each
line can be turned separately, and because all lines have the same
rhyme scheme and rhyme sounds, any such combination gives a
readable sonnet. The number of sonnets that can be obtained in
this way is thus 1014 which is indeed a hundred thousand billion.
Somebody has calculated that it would take about 200 million
years of nonstop reading to get through them all. [18, p 34]
Example 4.10. There are 2n binary strings/sequences of length
n.

Example 4.11. For a finite set A, the cardinality of its power set
2A is
|2A | = 2|A| .

Example 4.12. (Slides) Jack is so busy that hes always throwing


his socks into his top drawer without pairing them. One morning
Jack oversleeps. In his haste to get ready for school, (and still a
bit sleepy), he reaches into his drawer and pulls out 2 socks. Jack
knows that 4 blue socks, 3 green socks, and 2 tan socks are in his
drawer.
(a) What are Jacks chances that he pulls out 2 blue socks to
match his blue slacks?

23

(b) What are the chances that he pulls out a pair of matching
socks?

Example 4.13. [1, p 2930] Determine the number of positive


integers that are factors of the number
34 52 117 138 .
The numbers 3,5,11, and 13 are prime numbers. By the fundamental theorem of arithmetic, each factor is of the form
3i 5j 11k 13` ,
where 0 i 4, 0 j 2, 0 k 7, and 0 ` 8. There are
five choices for i, three for j, eight for k, and nine for `. By the
multiplication principle, the number of factors is
5 3 8 9 = 1080.
4.14. Subtraction Principle: Let A be a set and let S be a
larger set containing A. Then
|A| = |S| |S \ A|
When S is the same as , we have |A| = |S| |Ac |
Using the subtraction principle makes sense only if it is easier
to count the number of objects in S and in S \ A than to
count the number of objects in A.
Example 4.15. Chevalier de Meres Scandal of Arithmetic:
Which is more likely, obtaining at least one six in 4 tosses
of a fair dice (event A), or obtaining at least one double
six in 24 tosses of a pair of dice (event B)?

24

We have

64 54
P (A) =
=1
64

 4
5
.518
6

and

 24
3624 3524
35
P (B) =
=1
.491.
24
36
36
Therefore, the first case is more probable.
Remark 1: Probability theory was originally inspired by gambling problems. In 1654, Chevalier de Mere invented a gambling
system which bet even money6 on event B above. However, when
he began losing money, he asked his mathematician friend Pascal to analyze his gambling system. Pascal discovered that the
Chevaliers system would lose about 51 percent of the time. Pascal became so interested in probability and together with another
famous mathematician, Pierre de Fermat, they laid the foundation
of probability theory. [U-X-L Encyclopedia of Science]
Remark 2: de Mere originally claimed to have discovered a
contradiction in arithmetic. De Mere correctly knew that it was
advantageous to wager on occurrence of event A, but his experience
as gambler taught him that it was not advantageous to wager on
occurrence of event B. He calculated P (A) = 1/6 + 1/6 + 1/6 +
1/6 = 4/6 and similarly P (B) = 24 1/36 = 24/36 which is
the same as P (A). He mistakenly claimed that this evidenced a
contradiction to the arithmetic law of proportions, which says that
24
4
6 should be the same as 36 . Of course we know that he could not
simply add up the probabilities from each tosses. (By De Meres
logic, the probability of at least one head in two tosses of a fair
coin would be 2 0.5 = 1, which we know cannot be true). [22, p
3]
4.16. Division Principle (Rule of quotient): When a finite
set S is partitioned into equal-sized parts of m elements each, there
are |S|
m parts.
6
Even money describes a wagering proposition in which if the bettor loses a bet, he or she
stands to lose the same amount of money that the winner of the bet would win.

25

4.2

Four Kinds of Counting Problems

4.17. Choosing objects from a collection is called sampling, and


the chosen objects are known as a sample. The four kinds of
counting problems are [9, p 34]:
(a) Ordered sampling of r out of n items with replacement: nr ;
(b) Ordered sampling of r n out of n items without replacement: (n)r ;
(c) Unordered
 sampling of r n out of n items without replacen
ment: r ;
(d) Unordered
sampling of r out of n items with replacement:

n+r1
.
r
See 4.33 for bars and stars argument.
Many counting problems can be simplified/solved by realizing
that they are equivalent to one of these counting problems.
4.18. Ordered Sampling: Given a set of n distinct items/objects,
select a distinct ordered7 sequence (word) of length r drawn from
this set.
(a) Ordered sampling with replacement: n,r = nr
Ordered sampling of r out of n items with replacement.
The with replacement part means an object can be
chosen repeatedly.
Example: From a deck of n cards, we draw r cards with
replacement; i.e., we draw a card, make a note of it, put
the card back in the deck and re-shuffle the deck before
choosing the next card. How many different sequences of
r cards can be drawn in this way? [9, Ex. 1.30]

Different sequences are distinguished by the order in which we choose objects.

26

(b) Ordered sampling without replacement:


(n)r =

r1
Y
i=0

(n i) =

n!
(n r)!

= n (n 1) (n (r 1));
{z
}
|
r terms

rn

Ordered sampling of r n out of n items without replacement.


The without replacement means once we choose an
object, we remove that object from the collection and we
cannot choose it again.
In Excel, use PERMUT(n,r).

Sometimes referred to as the number of possible r-permutations


of n distinguishable objects
Example: The number of sequences8 of size r drawn from
an alphabet of size n without replacement.
(3)2 = 3 2 = 6 is the number of sequences of size 2
drawn from an alphabet of size 3 without replacement.
Suppose the alphabet set is {A, B, C}. We can list all
sequences of size 2 drawn from {A, B, C} without replacement:
AB
AC
BA
BC
CA
CB
Example: From a deck of 52 cards, we draw a hand of 5
cards without replacement (drawn cards are not placed
back in the deck). How many hands can be drawn in this
way?
8

Elements in a sequence are ordered.

27

For integers r, n such that r > n, we have (n)r = 0.

We define (n)0 = 1. (This makes sense because we usually


take the empty product to be 1.)
(n)1 = n

(n)r = (n(r1))(n)r1 . For example, (7)5 = (74)(7)4 .



1, if r = 1
(1)r =
0, if r > 1
Extended definition: The definition in product form
(n)r =

r1
Y

(n i) = n (n 1) (n (r 1))
|
{z
}
i=0
r terms

can be extended to any real number n and a non-negative


integer r.
Example 4.19. (Slides) The Seven Card Hustle: Take five red
cards and two black cards from a pack. Ask your friend to shuffle
them and then, without looking at the faces, lay them out in a row.
Bet that them cant turn over three red cards. The probability that
they CAN do it is

Definition 4.20. For any integer n greater than 1, the symbol n!,
pronounced n factorial, is defined as the product of all positive
integers less than or equal to n.

(a) 0! = 1! = 1
(b) n! = n(n 1)!
(c) n! =

et tn dt

(d) Computation:

28

(i) MATLAB: Use factorial(n). Since double precision numbers only have about 15 digits, the answer is only accurate
for n 21. For larger n, the answer will have the right
magnitude, and is accurate for the first 15 digits.
(ii) Googles web search box built-in calculator: Use n!
(e) Approximation: Stirlings Formula [5, p. 52]:



n
1
n n
n! 2nn e =
2e e(n+ 2 ) ln( e ) .

(2)

In some references, the sign is replaced by to emphasize


that the ratio of the two sides converges to unity as n .
4.21. Factorial and Permutation: The number of arrangements (permutations) of n 0 distinct items is (n)n = n!.
Meaning: The number of ways that n distinct objects can be
ordered.
A special case of ordered sampling without replacement
where r = n.
In MATLAB, use perms(v), where v is a row vector of length
n, to creates a matrix whose rows consist of all possible permutations of the n elements of v. (So the matrix will contain
n! rows and n columns.)
Example 4.22. In MATLAB, perms([3 4 7]) gives
7
7
4
4
3
3

4
3
7
3
4
7

3
4
3
7
7
4

29

2) 1 P Ai P ( Ai ) ( n 1) n
n

i =1 i =1

1
0.37
e

[Szekely86, p 14].

1
1

Similarly, perms(abcd) gives


dcba dcab dbca
dbac dabc dacb
x
1 0.5

1 cbda cbad cabd cadb


cdba cdab
x
bcda bcad bdca
bdac badc bacd
x
acbd (acdb
abdc adbc adcb
xabcd
1
x 1) x
0
1

Example 4.23.e (Slides) Finger-Smudge on Touch-Screen Devices

a)

4
8
10birthday: ProbExample 4.24. (Slides)2 Probability
of6 coincidence
1
x
10
ability that there is at
least two people
who have
the same birth1
day9 in a group of r persons:

P ( A1 A2 ) P ( A1 ) P ( A2 )

Random Variable
3) Let i.i.d. X 1 ,, X r be uniformly distributed on a finite set {a1 ,, an } . Then, the
r 1

probability that X 1 ,, X r are all distinct is pu ( n, r ) = 1 1 1 e


n
i =1
a) Special case: the birthday paradox in (1).

r ( r 1)
2n

pu(n,r) for n = 365


1

pu ( n, r )

0.9

0.6

0.8
0.7

1 e

0.6

p = 0.9
p = 0.7
p = 0.5

0.5

r ( r 1)
2n

0.4

r
n

0.5
0.4

r = 23

0.3

n = 365

r = 23
n = 365

p = 0.3
p = 0.1

0.3

0.2

0.2

0.1
0.1
0

10

15

20

25

30

35

40

45

50

55

50

100

150

200

250

300

350

b) The approximation comes from 1 + x e x .

Figure 1: pu (n, r): The probability of the event that at least one element appears

c) From
approximation,
to size
haver pwith
r ) = p , we need
u ( n, replacement
twice the
in random
sample of
is taken from a population
of n elements.

Example 4.25. It is surprising to see how quickly the probability


in Example 4.24 approaches 1 as r grows larger.
9

We ignore February 29 which only comes in leap years.

30

Birthday Paradox : In a group of 23 randomly selected people, the probability that at least two will share a birthday (assuming birthdays are equally likely to occur on any given day of the
year10 ) is about 0.5.
At first glance it is surprising that the probability of 2 people
having the same birthday is so large11 , since there are only 23
people compared with 365 days on the calendar. Some
 of the
23
surprise disappears if you realize that there are 2 = 253
pairs of people who are going to compare their birthdays. [3,
p. 9]
Example 4.26. Another variant of the birthday coincidence paradox: The group size must be at least 253 people if you want a
probability > 0.5 that someone will have the same birthday
as

364 r
you. [3, Ex. 1.13] (The probability is given by 1 365 .)

A naive (but incorrect) guess is that d365/2e = 183 people


will be enough. The problem is that many people in the
group will have the same birthday, so the number of different
birthdays is smaller than the size of the group.
On late-night televisions The Tonight Show with Johnny Carson, Carson was discussing the birthday problem in one of his
famous monologues. At a certain point, he remarked to his
audience of approximately 100 people: Great! There must
10

In reality, birthdays are not uniformly distributed. In which case, the probability of a
match only becomes larger for any deviation from the uniform distribution. This result can
be mathematically proved. Intuitively, you might better understand the result by thinking of
a group of people coming from a planet on which people are always born on the same day.
11
In other words, it was surprising that the size needed to have 2 people with the same
birthday was so small.

31

be someone here who was born on my birthday! He was off


by a long shot. Carson had confused two distinctly different
probability problems: (1) the probability of one person out of
a group of 100 people having the same birth date as Carson
himself, and (2) the probability of any two or more people out
of a group of 101 people having birthdays on the same day.
[22, p 76]
4.27. Now, lets revisit ordered sampling of r out of n different
items without replacement. This is also called the number of possible r-permutations of n different items. One way to look at the
sampling is to first consider the n! permutations of the n items.
Now, use only the first r positions. Because we do not care about
the last nr positions, we will group the permutations by the first
r positions. The size of each group will be the number of possible
permutations of the n r items that has not already been used in
the first r positions. So, each group will contain (n r)! members.
By the division principle, the number of groups is n!/(n r)!.
4.28. The number of permutations of n = n1 + n2 + + nr
objects of which n1 are of one type, n2 are of one type, n2 are of
second type, . . . , and nr are of an rth type is

n!
.
n1 !n2 ! nr !
Example 4.29. The number of permutations of AABC

Example 4.30. Bar Codes: A part is labeled by printing with


four thick lines, three medium lines, and two thin lines. If each
ordering of the nine lines represents a different label, how many
different labels can be generated by using this scheme?

4.31. Binomial coefficient:


 
n
(n)r
n!
=
=
r
r!
(n r)!r!
32

(a) Read n choose r.


(b) Meaning:
(i) Unordered sampling of r n out of n distinct items
without replacement

(ii) The number of subsets of size r that can be formed from


a set of n elements (without regard to the order of selection).
(iii) The number of combinations of n objects selected r at a
time.
(iv) the number of r-combinations of n objects.
(v) The number of (unordered) sets of size r drawn from an
alphabet of size n without replacement.
(c) Computation:
(i) MATLAB:
nchoosek(n,r), where n and r are nonnegative integers, returns nr .
nchoosek(v,r), where v is a row vector of length n,
creates a matrix whose rows consist of all possible
combinations of the n elements
 of v taken r at a time.
n
The matrix will contains r rows and r columns.
Example: nchoosek(abcd,2) gives
ab
ac
ad
33

bc
bd
cd
(ii) Excel: combin(n,r)
(iii) Mathcad: combin(n,r)

(iv) Maple: nr
(d)
(e)
(f)
(g)
(h)

(v) Googles web search box built-in calculator: n choose r




n
Reflection property: nr = nr
.


n
n
=
n
0 = 1.


n
n
=
1
n1 = n.

n
r = 0 if n < r or r is a negative integer.


n
max nr = b n+1
.
2 c
r

Example 4.32. In bridge, 52 cards are dealt to four players;


hence, each player has 13 cards. The order in which the cards
are dealt is not important, just the final 13 cards each player ends
up with. How many different bridge games can be dealt? (Answer:
53,644,737,765,488,792,839,237,440,000)

4.33. The bars and stars argument:


Example: Find all nonnegative integers x1 , x2 , x3 such that
x1 + x2 + x3 = 3.

34

0
0
0
0
1
1
1
2
2
3

+
+
+
+
+
+
+
+
+
+

0
1
2
3
0
1
2
0
1
0

+
+
+
+
+
+
+
+
+
+

3
2
1
0
2
1
0
1
0
0

1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1



n+r1
There are n+r1
=
distinct vector x = xn1 of nonr
n1
negative integers such that x1 + x2 + + xn = r. We use
n 1 bars to separate r 1s.
(a) Suppose we further require that
 the xi are strictly positive
r1
(xi 1), then there are n1 solutions.

(b) Extra Lower-bound Requirement: Suppose we further require that xi ai where the ai are some given
nonnegative integers, then the number of solution is


r (a1 + a2 + + an ) + n 1
.
n1
Note that here we P
work with equivalent problem: y1 +
y2 + + yn = r ni=1 ai where yi 0.
Consider the distribution of r = 10 indistinguishable balls
into n = 5 distinguishable cells. Then, we only concern with
the number of balls in each cell. Using n 1 = 4 bars, we
can divide r = 10 stars into n = 5 groups. For example,
****|***||**|*
would mean (4,3,0,2,1). In general, there are

n+r1
ways of arranging the bars and stars.
r
4.34. Unordered sampling with replacement: There are
n items. We sample r out of these n items with replacement.
Because the order in the sequences is not important in this kind
of sampling, two samples are distinguished by the number of each
item in the sequence. In particular, Suppose r letters are drawn
35

with replacement from a set {a1 , a2 , . . . , an }. Let xi be the number


of ai in the drawn sequence. Because we sample r times, we know
that, for every sample, x1 + x2 + xn = r where
the xi are non
n+r1
negative integers. Hence, there are
possible unordered
r
samples with replacement.
4.3

Binomial Theorem and Multinomial Theorem



4.35. Binomial theorem: Sometimes, the number nr is called
a binomial coefficient because it appears as the coefficient of
xr y nr in the expansion of the binomial (x+y)n . More specifically,
for any positive integer n, we have,
n  
X
n r nr
(x + y)n =
xy
(3)
r
r=0
(Slide) To see how we get (3), lets consider a smaller case of
n = 3. The expansion of (x + y)3 can be found using combinatorial
reasoning instead of multiplying the three terms out. When (x +
y)3 = (x+y)(x+y)(x+y) is expanded, all products of a term in the
first sum, a term in the second sum, and a term in the third sum
are added. Terms of the form x3 , x2 y, xy 2 , and y 3 arise. To obtain
a term of the form x3 , an x must be chosen in each of the sums,
and this can be done in only one way. Thus, the x3 term in the
product has a coefficient of 1. To obtain a term of the form x2 y,
an x must be chosen in two of the three sums (and consequently
a y in the other sum). Hence. the number of such terms
is the

3
number of 2-combinations of three objects, namely, 2 . Similarly,
the number of terms of the form xy 2 is the number of ways to pick
one of the three sums to obtain an x (and consequently take
 ay
3
from each of the other two terms). This can be done in 1 ways.
Finally, the only way to obtain a y 3 term is to choose the y for
each of the three sums in the product, and this can be done in
exactly one way. Consequently. it follows that
(x + y)3 = x3 + 3x2 y + 3xy 2 + y 3 .
Now, lets state a combinatorial proof of the binomial theorem
(3). The terms in the product when it is expanded are of the form
36

xr y nr for r = 0, 1, 2, . . . , n. To count the number of terms of the


form xr y nr , note that to obtain such a term it is necessary to
choose r xs from the n sums (so that the other n r terms
 in the
n
r nr
product are ys). Therefore. the coefficient of x y
is r .
From (3), if we let x = y = 1, then we get another important
identity:
n  
X
n
= 2n .
(4)
r
r=0

4.36. Multinomial Counting : The multinomial coefficient




n
n1 n2 nr

is defined as
i1
P 
r 
Y
n
nk
k=0

i=1

ni

  
 

 
n
n n1
n n1 n2
nr
=

n1
n2
n3
nr
n!
= Q
.
r
ni !
i=1

We have seen this before in (4.28). It is the number of ways that


r
P
we can arrange n =
ni tokens when having r types of symbols
i=1

and ni indistinguishable copies/tokens of a type i symbol.


4.37. Multinomial Theorem:
X
n!
n
(x1 + . . . + xr ) =
xi11 xi22 xirr ,
i1 !i2 ! ir !
where the sum ranges over all ordered r-tuples of integers i1 , . . . , ir
satisfying the following conditions:
i1 0, . . . , ir 0,

i1 + i2 + + ir = n.

When r = 2 this reduces to the binomial theorem.


37

Sirindhorn International Institute of Technology


Thammasat University
School of Information, Computer and Communication Technology

ECS315 2013/1
5

Part II

Dr.Prapun

Probability Foundations

Constructing the mathematical foundations of probability theory


has proven to be a long-lasting process of trial and error. The
approach consisting of defining probabilities as relative frequencies
in cases of repeatable experiments leads to an unsatisfactory theory.
The frequency view of probability has a long history that goes
back to Aristotle. It was not until 1933 that the great Russian
mathematician A. N. Kolmogorov (1903-1987) laid a satisfactory
mathematical foundation of probability theory. He did this by
taking a number of axioms as his starting point, as had been done
in other fields of mathematics. [22, p 223]
We will try to avoid several technical details14 15 in this class.
Therefore, the definition given below is not the complete definition. Some parts are modified or omitted to make the definition
easier to understand.

14

To study formal definition of probability, we start with the probability space (, A, P ).


Let be an arbitrary space or set of points . Recall (from Definition 1.15) that, viewed
probabilistically, a subset of is an event and an element of is a sample point. Each
event is a collection of outcomes which are elements of the sample space .
The theory of probability focuses on collections of events, called event -algebras, typically denoted by A (or F), that contain all the events of interest (regarding the random
experiment E) to us, and are such that we have knowledge of their likelihood of occurrence.
The probability P itself is defined as a number in the range [0, 1] associated with each event
in A.
15
The class 2 of all subsets can be too large for us to define probability measures with
consistency, across all member of the class. (There is no problem when is countable.)

44

Definition 5.1. Kolmogorovs Axioms for Probability [12]:


A probability measure16 is a real-valued set function17 that satisfies
P1 Nonnegativity :
P (A) 0.
P2 Unit normalization:
P () = 1.
P3 Countable additivity or -additivity : For every countable
sequence (An )
n=1 of disjoint events,
!

X
[
P (An ).
P
An =
n=1

n=1

The number P (A) is called the probability of the event A


The entire sample space is called the sure event or the
certain event.
If an event A satisfies P (A) = 1, we say that A is an almostsure event.
A support of P is any set A for which P (A) = 1.
From the three axioms18 , we can derive many more properties
of probability measure. These properties are useful for calculating
probabilities.

16

Technically, probability measure is defined on a -algebra A of . The triple (, A, P ) is


called a probability measure space, or simply a probability space
17
A real-valued set function is a function the maps sets to real numbers.
18
Remark: The axioms do not determine probabilities; the probabilities are assigned based
on our knowledge of the system under study. (For example, one approach is to base probability
assignments on the simple concept of equally likely outcomes.) The axioms enable us to easily
calculate the probabilities of some events from knowledge of the probabilities of other events.

45

5.2. P () = 0.

5.3. Finite additivity19 : If A1 , . . . , An are disjoint events, then


!
n
n
[
X
P
Ai =
P (Ai ).
i=1

i=1

Special case when n = 2: Addition rule (Additivity)


If A B = , then P (A B) = P (A) + P (B).

19

(5)

It is not possible to go backwards and use finite additivity to derive countable additivity
(P3).

46

5.4. The probability of a finite or countable event equals the sum


of the probabilities of the outcomes in the event.
(a) In particular, if A is countable, e.g. A = {a1 , a2 , . . .}, then
P (A) =

X
n=1

P ({an }).



(b) Similarly, if A is finite, e.g. A = a1 , a2 , . . . , a|A| , then
P (A) =

|A|
X
n=1

P ({an }).

This greatly simplifies20 construction of probability measure.

Remark: Note again that the set A under consideration here


is finite or countably infinite. You can not apply the properties
above to uncountable set.21
20
Recall that a probability measure P is a set function that assigns number (probability) to
all set (event) in A. When is countable (finite or countably infinite), we may let A = 2 =
the power set of the sample space. In other words, in this situation, it is possible to assign
probability value to all subsets of .
To define P , it seems that we need to specify a large number of values. Recall that to
define a function g(x) you usually specify (in words or as a formula) the value of g(x) at all
possible x in the domain of g. The same task must be done here because we have a function
that maps sets in A to real numbers (or, more specifically, the interval [0, 1]). It seems that
we will need to explicitly specify P (A) for each set A in A. Fortunately, 5.4 implies that we
only need to define P for all the singletons (when is countable).
21
In Section 11, we will start talking about (absolutely) continuous random variables. In
such setting, we have P ({}) = 0 for any . However, it is possible to have an uncountable
set A with P (A) > 0. This does not contradict the properties that we discussed in 5.4. If A
is finite or countably infinite, we can still write
X
X
P (A) =
P ({}) =
0 = 0.
A

For event A that is uncountable, the properties in 5.4 are not enough to evaluate P (A).

47

Example 5.5. A random experiment can result in one of the outcomes {a, b, c, d} with probabilities 0.1, 0.3, 0.5, and 0.1, respectively. Let A denote the event {a, b}, B the event {b, c, d}, and C
the event {d}.

P (A) =
P (B) =
P (C) =
P (Ac ) =
P (A B) =
P (A C) =
5.6. Monotonicity : If A B, then P (A) P (B)

Example 5.7. Let A be the event to roll a 6 and B the event


to roll an even number. Whenever A occurs, B must also occur.
However, B can occur without A occurring if you roll 2 or 4.
5.8. If A B, then P (B \ A) = P (B) P (A)

5.9. P (A) [0, 1].


5.10. P (A B) can not exceed P (A) and P (B). In other words,
the composition of two events is always less probable than (or at
most equally probable to) each individual event.

48

Example 5.11 (Slides). Experiments by psychologists Kahneman


and Tversky.
Example 5.12. Let us consider Mrs. Boudreaux and Mrs. Thibodeaux who are chatting over their fence when the new neighbor
walks by. He is a man in his sixties with shabby clothes and a
distinct smell of cheap whiskey. Mrs.B, who has seen him before,
tells Mrs. T that he is a former Louisiana state senator. Mrs. T
finds this very hard to believe. Yes, says Mrs.B, he is a former
state senator who got into a scandal long ago, had to resign, and
started drinking. Oh, says Mrs. T, that sounds more likely.
No, says Mrs. B, I think you mean less likely.
Strictly speaking, Mrs. B is right. Consider the following two
statements about the shabby man: He is a former state senator
and He is a former state senator who got into a scandal long ago,
had to resign, and started drinking. It is tempting to think that
the second is more likely because it gives a more exhaustive explanation of the situation at hand. However, this reason is precisely
why it is a less likely statement. Note that whenever somebody
satisfies the second description, he must also satisfy the first but
not vice versa. Thus, the second statement has a lower probability
(from Mrs. Ts subjective point of view; Mrs. B of course knows
who the man is).
This example is a variant of examples presented in the book
Judgment under Uncertainty [11] by Economics Nobel laureate
Daniel Kahneman and co-authors Paul Slovic and Amos Tversky.
They show empirically how people often make similar mistakes
when they are asked to choose the most probable among a set of
statements. It certainly helps to know the rules of probability. A
more discomforting aspect is that the more you explain something
in detail, the more likely you are to be wrong. If you want to be
credible, be vague. [18, p 1112]

49

5.13. Complement Rule:


P (Ac ) = 1 P (A) .
The probability that something does not occur can be computed as one minus the probability that it does occur.
Named probabilitys Trick Number One in [10]
5.14. Probability of a union (not necessarily disjoint):
P (A B) = P (A) + P (B) P (A B)

P (A B) P (A) + P (B).
Approximation: If P (A)  P (B) then we may approximate
P (A B) by P (A).
Example 5.15 (Slides). Combining error probabilities from various sources in DNA testing
Example 5.16. In his bestseller Innumeracy, John Allen Paulos
tells the story of how he once heard a local weatherman claim that
there was a 50% chance of rain on Saturday and a 50% chance of
rain on Sunday and thus a 100% chance of rain during the weekend.
Clearly absurd, but what is the error?
Answer: Faulty use of the addition rule (5)!
If we let A denote the event that it rains on Saturday and B
the event that it rains on Sunday, in order to use P (A B) =
P (A) + P (B), we must first confirm that A and B cannot occur at
50

the same time (P (A B) = 0). More generally, the formula that is


always holds regardless of whether P (A B) = 0 is given by 5.14:
P (A B) = P (A) + P (B) P (A B).
The event A B describes the case in which it rains both days.
To get the probability of rain over the weekend, we now add 50%
and 50%, which gives 100%, but we must then subtract the probability that it rains both days. Whatever this is, it is certainly
more than 0 so we end up with something less than 100%, just like
common sense tells us that we should.
You may wonder what the weatherman would have said if the
chances of rain had been 75% each day. [18, p 12]
5.17. Probability of a union of three events:
P (A B C) = P (A) + P (B) + P (C)
P (A B) P (A C) P (B C)
+ P (A B C)
5.18. Two bounds:
(a) Subadditivity or Booles Inequality: If A1 , . . . , An are
events, not necessarily disjoint, then
!
n
n
X
[
P (Ai ).
P
Ai
i=1

i=1

(b) -subadditivity or countable subadditivity: If A1 , A2 ,


. . . is a sequence of measurable sets, not necessarily disjoint,
then
!

[
X
P
Ai
P (Ai )
i=1

i=1

This formula is known as the union bound in engineering.

51

5.19. If a (finite) collection {B1 , B2 , . . . , Bn } is a partition of ,


then
n
X
P (A) =
P (A Bi )
i=1

Similarly, if a (countable) collection {B1 , B2 , . . .} is a partition


of , then

X
P (A) =
P (A Bi )
i=1

5.20. Connection to classical probability theory: Consider an


experiment with finite sample space = {1 , 2 , . . . , n } in which
each outcome i is equally likely. Note that n = ||.

We must have

1
, i.
n
event A, we can apple 5.4 to get

P ({i }) =
Now, given any event finite22
P (A) =

P ({}) =

X1
A

|A| |A|
=
.
n
||

We can then say that the probability theory we are working on


right now is an extension of the classical probability theory. When
the conditons/assumptions of classical probability theory are met,
then we get back the defining definition of classical classical probability. The extended part gives us ways to deal with situation
where assumptions of classical probability theory are not satisfied.

22

In classical probability, the sample space is finite; therefore, any event is also finite.

52

Event-based Independence and Conditional


Probability

Example

Example
6.1. Roll a dice. . .
Roll a fair dice

Sneak peek:
Figure 3: Conditional Probability Example: Sneak Peek

Example 6.2 (Slides). Diagnostic Tests.


3

6.1

Event-based Conditional Probability

Definition 6.3. Conditional Probability : The conditional probability P (A|B) of event A, given that event B 6= occurred, is
given by
P (A B)
.
(6)
P (A|B) =
P (B)
Some ways to say23 or express the conditional probability,
P (A|B), are:
the probability of A, given B

the probability of A, knowing B

the probability of A happening, knowing B has already


occurred

23

Note also that although the symbol P (A|B) itself is practical, it phrasing in words can be
so unwieldy that in practice, less formal descriptions are used. For example, we refer to the
probability that a tested-positive person has the disease instead of saying the conditional
probability that a randomly chosen person has the disease given that the test for this person
returns positive result.

53

Defined only when P (B) > 0.


If P (B) = 0, then it is illogical to speak of P (A|B); that
is P (A|B) is not defined.
6.4. Interpretation: Sometimes, we refer to P (A) as
a priori probability , or
the prior probability of A, or
the unconditional probability of A.
It is sometimes useful to interpret P (A) as our knowledge of
the occurrence of event A before the experiment takes place. Conditional probability P (A|B) is the updated probability of the
event A given that we now know that B occurred (but we still do
not know which particular outcome in the set B occurred).
Example 6.5. Back to Example 6.1

Example
Roll a fair dice

Sneak peek:

Figure 4: Sneak Peek: A Revisit

54

Example 6.6. In diagnostic tests Example 6.2, we learn whether


we have the disease from test result. Originally, before taking the
test, the probability of having the disease is 0.01%. Being tested
positive from the 99%-accurate test updates the probability of
having the disease to about 1%.
More specifically, let D be the event that the testee has the
disease and TP be the event that the test returns positive result.
Before taking the test, the probability of having the disease
is P (D) = 0.01%.
Using 99%-accurate test means
P (TP |D) = 0.99 and P (TPc |Dc ) = 0.99.
Our calculation shows that P (D|TP ) 0.01.
6.7. Prelude to the concept of independence:
If the occurrence of B does not give you more information about
A, then
P (A|B) = P (A)
(7)
and we say that A and B are independent.
Meaning: learning that event B has occurred does not change
the probability that event A occurs.
We will soon define independence in Section 6.2. Property
(7) can be regarded as a practical definition for independence.
However, there are some technical issues24 that we need to deal
with when we actually define independence.

24

Here, the statement assume P (B) > 0 because it considers P (A|B). The concept of
independence to be defined in Section 6.2 will not rely directly on conditional probability and
therefore it will include the case where P (B) = 0.

55

6.8. Similar properties to the three probability axioms:


(a) Nonnegativity: P (A|B) 0
(b) Unit normalization: P (|B) = 1.
In fact, for any event A such that B A, we have P (A|B) =
1.
This implies
P (|B) = P (B|B) = 1.
(c) Countable additivity: For every countable sequence (An )
n=1
of disjoint events,
!


X
[

P (An |B).
P
An B =

n=1

n=1

In particular, if A1 A2 ,
P (A1 A2 |B ) = P (A1 |B ) + P (A2 |B )
6.9. More Properties:
P (A|) = P (A)
P (Ac |B) = 1 P (A|B)

P (A B|B) = P (A|B)
P (A1 A2 |B) = P (A1 |B) + P (A2 |B) P (A1 A2 |B).
P (A B) P (A|B)
56

6.10. When is finite and all outcomes have equal probabilities,


P (A|B) =

P (A B) |A B| / || |A B|
=
=
.
P (B)
|B| / ||
|B|

This formula can be regarded as the classical version of conditional


probability.
Example 6.11. Someone has rolled a fair dice twice. You know
that one of the rolls turned up a face value of six. The probability
1
that the other roll turned up a six as well is 11
(not 16 ). [22,
Example 8.1, p. 244]
6.12. Probability of compound events
(a) P (A B) = P (A)P (B|A)
(b) P (A B C) = P (A B) P (C|A B)
(c) P (A B C) = P (A) P (B|A) P (C|A B)
When we have many sets intersected in the conditioned part, we
often use , instead of .
Example 6.13. Most people reason as follows to find the probability of getting two aces when two cards are selected at random
from an ordinary deck of cards:
(a) The probability of getting an ace on the first card is 4/52.
(b) Given that one ace is gone from the deck, the probability of
getting an ace on the second card is 3/51.
(c) The desired probability is therefore
4
3
.
52 51
[22, p 243]
Question: What about the unconditional probability P (B)?

57

Example 6.14. You know that roughly 5% of all used cars have
been flood-damaged and estimate that 80% of such cars will later
develop serious engine problems, whereas only 10% of used cars
that are not flood-damaged develop the same problems. Of course,
no used car dealer worth his salt would let you know whether your
car has been flood damaged, so you must resort to probability
calculations. What is the probability that your car will later run
into trouble?
You might think about this problem in terms of proportions.

If you solved the problem in this way, congratulations. You


have just used the law of total probability.
6.15. Total Probability Theorem: If a (finite or infinitely)
countable collection of events {B1 , B2 , . . .} is a partition of , then
P (A) =

P (A|Bi )P (Bi ).

(8)

i
25

This is a formula for computing the probability of an event


that can occur in different ways.
6.16. Special case:
P (A) = P (A|B)P (B) + P (A|B c )P (B c ).
This gives exactly the same calculation as what we discussed in
Example 6.14.
25

The tree diagram is useful for helping you understand the process. However, then the
number of possible cases is large (many Bi for the partition), drawing the tree diagram may
be too time-consuming and therefore you should also learn how to apply the total probability
theorem directly without the help of the tree diagram.

58

Example 6.17. Continue from the Diagnostic Tests Example


6.2 and Example 6.6.
P (TP ) = P (TP D) + P (TP Dc )
= P (TP |D ) P (D) + P (TP |Dc ) P (Dc ) .
For conciseness, we define
pd = P (D)
and
pT E = P (TP |Dc ) = P (TPc |D).
Then,
P (TP ) = (1 pT E )pD + pT E (1 pD ).

6.18. Bayes Theorem:


(a) Form 1:

P (B|A) = P (A|B)

P (B)
.
P (A)

(b) Form 2: If a (finite or infinitely) countable collection of events


{B1 , B2 , . . .} is a partition of , then
P (Bk |A) = P (A|Bk )

P (Bk )
P (A|Bk )P (Bk )
.
=P
P (A)
i P (A|Bi )P (Bi )

Extremely useful for making inferences about phenomena that


cannot be observed directly.
Sometimes, these inferences are described as reasoning about
causes when we observe effects.
59

Example 6.19. Continue from the Disease Testing Examples


6.2, 6.6, and 6.17:
P (D TP )
P (TP |D ) P (D)
=
P (TP )
P (TP )
(1 pT E )pD
=
TE (1 pT E )pD + pT E (1 pD )

P (D |TP ) =

Effect of p

pTE = 1 0.99 = 0.01

1
0.9

pTE = 1 0.9 = 0.1

0.8

P(D|TP)

0.7
0.6

pTE = 1 0.5 = 0.5

0.5
0.4

0.3
0.2
0.1
0

0.1

0.2

0.3

0.4

0.5

pD

0.6

0.7

0.8

0.9

Figure 5: Probability P (D |TP ) that a person will have the disease given that
the test result is positive. The conditional probability is evaluated as a function of PD which tells how common the disease is. Thee values of test error
probability pT E are shown.

Example 6.20. Medical Diagnostic: Because a new medical procedure has been shown to be effective in the early detection of an
illness, a medical screening of the population is proposed. The
probability that the test correctly identifies someone with the illness as positive is 0.99, and the probability that the test correctly
identifies someone without the illness as negative is 0.95. The incidence of the illness in the general population is 0.0001. You take
the test, and the result is positive. What is the probability that
you have the illness? [15, Ex. 2-37]

60

Example 6.21. Bayesian networks are used on the Web sites of


high-technology manufacturers to allow customers to quickly diagnose problems with products. An oversimplified example is presented here.
A printer manufacturer obtained the following probabilities from
a database of test results. Printer failures are associated with three
types of problems: hardware, software, and other (such as connectors), with probabilities 0.1, 0.6, and 0.3, respectively. The probability of a printer failure given a hardware problem is 0.9, given
a software problem is 0.2, and given any other type of problem is
0.5. If a customer enters the manufacturers Web site to diagnose
a printer failure, what is the most likely cause of the problem?
Let the events H, S, and O denote a hardware, software, or
other problem, respectively, and let F denote a printer failure.

Example 6.22 (Slides). Prosecutors Fallacy and the Murder of


Nicole Brown
Example 6.23. In the early 1990s, a leading Swedish tabloid
tried to create an uproar with the headline Your ticket is thrown
away!. This was in reference to the popular Swedish TV show
Bingolotto where people bought lottery tickets and mailed them
to the show. The host then, in live broadcast, drew one ticket from
a large mailbag and announced a winner. Some observant reporter
noticed that the bag contained only a small fraction of the hun61

dreds of thousands tickets that were mailed. Thus the conclusion:


Your ticket has most likely been thrown away!
Let us solve this quickly. Just to have some numbers, let us
say that there are a total of N = 100, 000 tickets and that n =
1, 000 of them are chosen at random to be in the final drawing.
If the drawing was from all tickets, your chance to win would
be 1/N = 1/100, 000. The way it is actually done, you need to
both survive the first drawing to get your ticket into the bag and
then get your ticket drawn from the bag. The probability to get
your entry into the bag is n/N = 1, 000/100, 000. The conditional
probability to be drawn from the bag, given that your entry is in
it, is 1/n = 1/1, 000. Multiply to get 1/N = 1/100, 000 once more.
There were no riots in the streets. [18, p 22]
6.24. Chain rule of conditional probability [9, p 58]:
P (A B|C) = P (B|C)P (A|B C).
Example 6.25. Your teacher tells the class there will be a surprise
exam next week. On one day, Monday-Friday, you will be told in
the morning that an exam is to be given on that day. You quickly
realize that the exam will not be given on Friday; if it was, it would
not be a surprise because it is the last possible day to get the
exam. Thus, Friday is ruled out, which leaves Monday-Thursday.
But then Thursday is impossible also, now having become the last
possible day to get the exam. Thursday is ruled out, but then
Wednesday becomes impossible, then Tuesday, then Monday, and
you conclude: There is no such thing as a surprise exam! But the
teacher decides to give the exam on Tuesday, and come Tuesday
morning, you are surprised indeed.
This problem, which is often also formulated in terms of surprise fire drills or surprise executions, is known by many names, for
example, the hangmans paradox or by serious philosophers as
the prediction paradox. To resolve it, lets treat it as a probability problem. Suppose that the day of the exam is chosen randomly
among the five days of the week. Now start a new school week.
What is the probability that you get the test on Monday? Obviously 1/5 because this is the probability that Monday is chosen.
62

If the test was not given on Monday. what is the probability that
it is given on Tuesday? The probability that Tuesday is chosen
to start with is 1/5, but we are now asking for the conditional
probability that the test is given on Tuesday, given that it was not
given on Monday. As there are now four days left, this conditional
probability is 1/4. Similarly, the conditional probabilities that the
test is given on Wednesday, Thursday, and Friday conditioned on
that it has not been given thus far are 1/3, 1/2, and 1, respectively.
We could define the surprise index each day as the probability
that the test is not given. On Monday, the surprise index is therefore 0.8, on Tuesday it has gone down to 0.75, and it continues to
go down as the week proceeds with no test given. On Friday, the
surprise index is 0, indicating absolute certainty that the test will
be given that day. Thus, it is possible to give a surprise test but
not in a way so that you are equally surprised each day, and it is
never possible to give it so that you are surprised on Friday. [18,
p 2324]
Example 6.26. Today Bayesian analysis is widely employed throughout science and industry. For instance, models employed to determine car insurance rates include a mathematical function describing, per unit of driving time, your personal probability of having
zero, one, or more accidents. Consider, for our purposes, a simplified model that places everyone in one of two categories: high
risk, which includes drivers who average at least one accident each
year, and low risk, which includes drivers who average less than
one.
If, when you apply for insurance, you have a driving record
that stretches back twenty years without an accident or one that
goes back twenty years with thirty-seven accidents, the insurance
company can be pretty sure which category to place you in. But if
you are a new driver, should you be classified as low risk (a kid who
obeys the speed limit and volunteers to be the designated driver)
or high risk (a kid who races down Main Street swigging from a
half-empty $2 bottle of Boones Farm apple wine)?
Since the company has no data on you, it might assign you
an equal prior probability of being in either group, or it might
63

use what it knows about the general population of new drivers


and start you off by guessing that the chances you are a high risk
are, say, 1 in 3. In that case the company would model you as a
hybridone-third high risk and two-thirds low riskand charge you
one-third the price it charges high-risk drivers plus two-thirds the
price it charges low-risk drivers.
Then, after a year of observation, the company can employ the
new datum to reevaluate its model, adjust the one-third and twothird proportions it previously assigned, and recalculate what it
ought to charge. If you have had no accidents, the proportion of
low risk and low price it assigns you will increase; if you have had
two accidents, it will decrease. The precise size of the adjustment
is given by Bayess theory. In the same manner the insurance
company can periodically adjust its assessments in later years to
reflect the fact that you were accident-free or that you twice had
an accident while driving the wrong way down a one-way street,
holding a cell phone with your left hand and a doughnut with
your right. That is why insurance companies can give out good
driver discounts: the absence of accidents elevates the posterior
probability that a driver belongs in a low-risk group. [14, p 111112]

64

6.2

Event-based Independence

Plenty of random things happen in the world all the time, most of
which have nothing to do with one another. If you toss a coin and
I roll a dice, the probability that you get heads is 1/2 regardless of
the outcome of my dice. Events that are unrelated to each other
in this way are called independent.
Definition 6.27. Two events A, B are called (statistically26 )
independent if
P (A B) = P (A) P (B)
B

|=

Notation: A

(9)

Read A and B are independent or A is independent of B


We call (9) the multiplication rule for probabilities.
If two events are not independent, they are dependent. Intuitively, if two events are dependent, the probability of one
changes with the knowledge of whether the other has occurred.
6.28. Intuition: Again, here is how you should think about independent events: If one event has occurred, the probability of the
other does not change.
P (A|B) = P (A) and P (B|A) = P (B).

(10)

|=

|=

In other words, the unconditional and the conditional probabilities are the same. We can almost use (10) as the definitions for
independence. This is what we mentioned in 6.7. However, we use
(9) instead because it (1) also works with events whose probabilities are zero and (2) also has clear symmetry in the expression (so
that A B and B A can clearly be seen as the same). In fact,
in 6.32, we show how (10) can be used to define independence with
extra condition that deals with the case when zero probability is
involved.
26

Sometimes our definition for independence above does not agree with the everydaylanguage use of the word independence. Hence, many authors use the term statistically
independence to distinguish it from other definitions.

65

Example 6.29. [26, Ex. 5.4] Which of the following pairs of events
are independent?

Example: Club & Black

(a) The card is a club, and the card is black.


clubs
diamonds

hearts

spades

Figure 6: A Deck of Cards

(b) The card is a king, and the card is black.

6.30. An event with probability 0 or 1 is independent of any event


(including itself).
In particular, and are independent of any events.
6.31. An event A is independent of itself if and only if P (A) is 0
or 1.
6.32. Two events A, B with positive probabilities are independent
if and only if P (B |A) = P (B), which is equivalent to P (A |B ) =
P (A).
When A and/or B has zero probability, A and B are automatically independent.
6.33. When A and B have nonzero probabilities, the following
statements are equivalent:

66

6.34. The following four statements are equivalent:


Ac

B,

Ac

B c.

|=

B c,

|=

|=

B,

|=

Example 6.35. If P (A|B) = 0.4, P (B) = 0.8, and P (A) = 0.5,


are the events A and B independent? [15]

6.36. Keep in mind that independent and disjoint are not


synonyms. In some contexts these words can have similar meanings, but this is not the case in probability.
If two events cannot occur at the same time (they are disjoint),
are they independent? At first you might think so. After all,
they have nothing to do with each other, right? Wrong! They
have a lot to do with each other. If one has occurred, we know
for certain that the other cannot occur. [18, p 12]
To check whether A and B are disjoint, you only need to
look at the sets themselves and see whether they have shared
element(s). This can be answered without knowing probabilities.
To check whether A and B are independent, you need to look
at the probabilities P (A), P (B), and P (A B).
Reminder: If events A and B are disjoint, you calculate the
probability of the union A B by adding the probabilities
of A and B. For independent events A and B you calculate
the probability of the intersection A B by multiplying the
probabilities of A and B.

67

|=

The two statements A B and A B can occur simultaneously only when P (A) = 0 and/or P (B) = 0.

Reverse is not true in general.


Example 6.37. Experiment of flipping a fair coin twice. =
{HH, HT, T H, T T }. Define event A to be the event that the first
flip gives a H; that is A = {HH, HT }. Event B is the event that
the second flip gives a H; that is B = {HH, T H}. Note that even
though the events A and B are not disjoint, they are independent.

Example 6.38 (Slides). Prosecutors fallacy : In 1999, a British


jury convicted Sally Clark of murdering two of her children who
had died suddenly at the ages of 11 and 8 weeks, respectively.
A pediatrician called in as an expert witness claimed that the
chance of having two cases of infant sudden death syndrome, or
cot deaths, in the same family was 1 in 73 million. There was
no physical or other evidence of murder, nor was there a motive.
Most likely, the jury was so impressed with the seemingly astronomical odds against the incidents that they convicted. But where
did the number come from? Data suggested that a baby born into
a family similar to the Clarks faced a 1 in 8,500 chance of dying
a cot death. Two cot deaths in the same family, it was argued,
therefore had a probability of (1/8, 500)2 which is roughly equal to
1/73,000.000.
Did you spot the error? I hope you did. The computation
assumes that successive cot deaths in the same family are independent events. This assumption is clearly questionable, and even
a person without any medical expertise might suspect that genetic
factors play a role. Indeed, it has been estimated that if there
is one cot death, the next child faces a much larger risk, perhaps
68

around 1/100. To find the probability of having two cot deaths in


the same family, we should thus use conditional probabilities and
arrive at the computation 1/8, 5001/100, which equals l/850,000.
Now, this is still a small number and might not have made the jurors judge differently. But what does the probability 1/850,000
have to do with Sallys guilt? Nothing! When her first child died,
it was certified to have been from natural causes and there was
no suspicion of foul play. The probability that it would happen
again without foul play was 1/100, and if that number had been
presented to the jury, Sally would not have had to spend three
years in jail before the verdict was finally overturned and the expert witness (certainly no expert in probability) found guilty of
serious professional misconduct.
You may still ask the question what the probability 1/100 has
to do with Sallys guilt. Is this the probability that she is innocent? Not at all. That would mean that 99% of all mothers who
experience two cot deaths are murderers! The number 1/100 is
simply the probability of a second cot death, which only means
that among all families who experience one cot death, about 1%
will suffer through another. If probability arguments are used in
court cases, it is very important that all involved parties understand some basic probability. In Sallys case, nobody did.
References: [14, 118119] and [18, 2223].
Definition 6.39. Three events A1 , A2 , A3 are independent if and
only if
P (A1 A2 ) = P (A1 ) P (A2 )
P (A1 A3 ) = P (A1 ) P (A3 )
P (A2 A3 ) = P (A2 ) P (A3 )
P (A1 A2 A3 ) = P (A1 ) P (A2 ) P (A3 )
Remarks:
(a) When the first three equations hold, we say that the three
events are pairwise independent.
(b) We may use the term mutually independence to further
emphasize that we have independence instead of pairwise
independence.
69

Definition 6.40. The events A1 , A2 , . . . , An are independent if


and only if for any subcollection Ai1 , Ai2 , . . . , Aik ,
P (Ai1 Ai2 Aik ) = P (Ai1 ) P (Ai2 ) P (Ain ) .
Note that part of the requirement is that
P (A1 A2 An ) = P (A1 ) P (A2 ) P (An ) .
Therefore, if someone tells us that the events A1 , A2 , . . . , An
are independent, then one of the properties that we can conclude is that
P (A1 A2 An ) = P (A1 ) P (A2 ) P (An ) .
Equivalently, this is the same as the requirement that

\
Y
P Aj =
P (Aj ) J [n] and |J| 2
jJ

jJ

Note that the case when j = 1 automatically holds. The case


when j = 0 can be regarded as the event case, which is also
trivially true.
6.41. Four events A, B, C, D are pairwise independent if and
only if they satisfy the following six conditions:
P (A B) = P (A)P (B),
P (A C) = P (A)P (C),

P (A D) = P (A)P (D),
P (B C) = P (B)P (C),

P (B D) = P (B)P (D), and


P (C D) = P (C)P (D).

They are independent if and only if they are pairwise independent


(satisfy the six conditions above) and also satisfy the following five
more conditions:
P (B C D) = P (B)P (C)P (D),
P (A C D) = P (A)P (C)P (D),

P (A B D) = P (A)P (B)P (D),

P (A B C) = P (A)P (B)P (C), and

P (A B C D) = P (A)P (B)P (C)P (D).

70

6.3

Bernoulli Trials

Example 6.42. Consider the following random experiments


(a) Flip a coin 10 times. We are interested in the number of heads
obtained.
(b) Of all bits transmitted through a digital transmission channel,
10% are received in error. We are interested in the number of
bits in error in the next five bits transmitted.
(c) A multiple-choice test contains 10 questions, each with four
choices, and you guess at each question. We are interested in
the number of questions answered correctly.
These examples illustrate that a general probability model that
includes these experiments as particular cases would be very useful.
Example 6.43. Each of the random experiments in Example 6.42
can be thought of as consisting of a series of repeated, random
trials. In all cases, we are interested in the number of trials that
meet a specified criterion. The outcome from each trial either
meets the criterion or it does not; consequently, each trial can be
summarized as resulting in either a success or a failure.
Definition 6.44. A Bernoulli trial involves performing an experiment once and noting whether a particular event A occurs.
The outcome of the Bernoulli trial is said to be
(a) a success if A occurs and
(b) a failure otherwise.
We may view the outcome of a single Bernoulli trial as the outcome of a toss of an unfair coin for which the probability of heads
(success) is p = P (A) and the probability of tails (failure) is 1 p.
The labeling (success and failure) is not meant to be literal and sometimes has nothing to do with the everyday meaning of the words. We can just as well use A and B or 0 and
1.
71

Example 6.45. Examples of Bernoulli trials: Flipping a coin,


deciding to vote for candidate A or candidate B, giving birth to
a boy or girl, buying or not buying a product, being cured or not
being cured, even dying or living are examples of Bernoulli trials.
Actions that have multiple outcomes can also be modeled as
Bernoulli trials if the question you are asking can be phrased
in a way that has a yes or no answer, such as Did the dice
land on the number 4? or Is there any ice left on the North
Pole?
Definition 6.46. (Independent) Bernoulli Trials = a Bernoulli
trial is repeated many times.
(a) It is usually assumed that the trials are independent. This
implies that the outcome from one trial has no effect on the
outcome to be obtained from any other trial.
(b) Furthermore, it is often reasonable to assume that the probability of a success in each trial is constant.
An outcome of the complete experiment is a sequence of successes and failures which can be denoted by a sequence of ones
and zeroes.
Example 6.47. If we toss unfair coin n times, we obtain the
space = {H, T }n consisting of 2n elements of the form (1 , 2 , . . . , n )
where i = H or T.
Example 6.48. What is the probability of two failures and three
successes in five Bernoulli trials with success probability p.
We observe that the outcomes with three successes in five trials
are 11100, 11010, 11001, 10110, 10101, 10011, 01110, 01101, 01011,
and 00111. We note that the probability of each outcome is a
product of five probabilities, each related to one Bernoulli trial.
In outcomes with three successes, three of the probabilities are p
and the other two are 1 p. Therefore, each outcome with three
successes has probability (1 p)2 p3 . There are 10 of them. Hence,
the total probability is 10(1 p)2 p3
72

6.49. The probability of exactly n1 success in n = n0 +n1 bernoulli


trials is
 
 
n
n
(1 p)nn1 pn1 =
(1 p)n0 pnn0 .
n1
n0
Example 6.50. At least one occurrence of a 1-in-n-chance event
in n repeated trials:

n Bernoulli trials
Assume success probability = 1/n
1
0.9
0.8

P #successes 1

0.7

1
1 0.6321
e

0.6

P #successes 1

0.5
0.4

P #successes 0

0.3
0.2

P # successes 2

1
0.1839
2e

P # successes 3

0.1
0

1
0.3679
e

10

15

20

25
n

30

35

40

45

50

Figure 7: A 1-in-n-chance event in n repeated Bernoulli trials

Example 6.51. Digital communication over unreliable channels: Consider a communication system below
73

Here, we consider a simple channel called binary symmetric


channel:

This channel can be described as a channel that introduces random bit errors with probability p.
A crude digital communication system would put binary information into the channel directly; the receiver then takes whatever
value that shows up at the channel output as what the sender
transmitted. Such communication system would directly suffer bit
error probability of p.
In situation where this error rate is not acceptable, error control
techniques are introduced to reduce the error rate in the delivered
information.
One method of reducing the error rate is to use error-correcting
codes:

A simple error-correcting code is the repetition code. Example of such code is described below:
(a) At the transmitter, the encoder box performs the following
task:
(i) To send a 1, it will send 11111 through the channel.
(ii) To send a 0, it will send 00000 through the channel.

74

(b) When the five bits pass through the channel, it may be corrupted. Assume that the channel is binary symmetric and
that it acts on each of the bit independently.
(c) At the receiver, we (or more specifically, the decoder box) get
5 bits, but some of the bits may be changed by the channel.
To determine what was sent from the transmitter, the receiver
apply the majority rule: Among the 5 received bits,
(i) if #1 > #0, then it claims that 1 was transmitted,
(ii) if #0 > #1, then it claims that 0 was transmitted.

Error Control Coding


Repetition Code at Tx: Repeat the bit n times.
Channel: Binary Symmetric Channel (BSC) with bit error

probability p.
Majority Vote at Rx
0.5
0.45
0.4
0.35

0.3

n=1
n=5

0.25
0.2
0.15

n = 15

0.1

n = 25

0.05

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Figure 8: Bit error probability for a simple system that uses repetition code
at the transmitter (repeat each bit n times) and majority vote at the receiver.
The channel is assumed to be binary symmetric with bit error probability p.

75

Exercise 6.52 (F2011). Kakashi and Gai are eternal rivals. Kakashi
is a little stronger than Gai and hence for each time that they fight,
the probability that Kakashi wins is 0.55. In a competition, they
fight n times (where n is odd). Assume that the results of the fights
are independent. The one who wins more will win the competition.
Suppose n = 3, what is the probability that Kakashi wins the
competition.

76

Sirindhorn International Institute of Technology


Thammasat University
School of Information, Computer and Communication Technology

ECS315 2013/1 Part III.1 Dr.Prapun


7

Random variables

In performing a chance experiment, one is often not interested


in the particular outcome that occurs but in a specific numerical
value associated with that outcome. In fact, for most applications, measurements and observations are expressed as numerical
quantities.
Example 7.1. Take this course and observe your grades.

7.2. The advantage of working with numerical quantities is that


we can perform mathematical operations on them.

In the mathematics of probability, averages are called expectations or expected values.

77

Definition 7.3. A real-valued function X() defined for all points


in a sample space is called a random variable (r.v. or RV)
27
.
So, a random variable is a rule that assigns a numerical value
to each possible outcome of a chance experiment.

Intuitively, a random variable is a variable that takes on its


values by chance.
The convention is to use capital letters such as X, Y, Z to
denote random variables.
Example 7.4. Roll a fair dice: = {1, 2, 3, 4, 5, 6}.

27

The term random variable is a misnomer. Technically, if you look at the definition
carefully, a random variable is a deterministic function; that is, it is not random and it is not
a variable. [Toby Berger][26, p 254]
As a function, it is simply a rule that maps points/outcomes in to real numbers.

It is also a deterministic function; nothing is random about the mapping/assignment.


The randomness in the observed values is due to the underlying randomness of the
argument of the function X, namely the experiment outcomes .
In other words, the randomness in the observed value of X is induced by the underlying
random experiment, and hence we should be able to compute the probabilities of the
observed values in terms of the probabilities of the underlying outcomes.

78

Example 7.5 (Three Coin Tosses). Counting the number of heads


Three
Coincoin
Tosses
in a sequence
of three
tosses.
TTT,TTH,THT,THH,HTT,HTH,HHT,HHH

HHH
HTT
THT
TTH

HHT
HTH
HHT
TTT

TTT
TTH,THT,HTT
THH,HTH,HHT
HHH

0,
1,

N
2,
3,

Real number line

Example 7.6 (Sum of Two Dice). If S is the sum of the dots


when rolling one fair dice twice, the random variable S assigns the
numerical value i+j to the outcome (i, j) of the chance experiment.
Example 7.7. Continue from Example 7.4,
(a) What is the probability that X = 4?

(b) What is the probability that Y = 4?

79

Definition 7.8. Events involving random variables:


[some condition(s) on X] = the set of outcomes in such that
X() satisfies the conditions.
[X B] = { : X() B}
[a X < b] = [X [a, b)] = { : a X() < b}
[X > a] = { : X() > a}
[X = x] = { : X() = x}
We usually use the corresponding lowercase letter to denote
(a) a possible value (realization) of the random variable
(b) the value that the random variable takes on
(c) the running values for the random variable
All of the above items are sets of outcomes. They are all events!
Example 7.9. Continue from Examples 7.4 and 7.7,
(a) [X = 4] = { : X() = 4}


(b) [Y = 4] = { : Y () = 4} = : ( 3)2 = 4
Definition 7.10. To avoid double use of brackets (round brackets over square brackets), we write P [X B] when we means
P ([X B]). Hence,
P [X B] = P ([X B]) = P ({ : X() B}) .
Similarly,
P [X < x] = P ([X < x]) = P ({ : X() < x}) .
Example 7.11. In Example 7.5 (Three Coin Tosses), if the coin
is fair, then
P [N < 2] =

80

7.12. At a certain point in most probability courses, the sample


space is rarely mentioned anymore and we work directly with random variables. The sample space often disappears along with
the () of X() but they are really there in the background.
Definition 7.13. A set S is called a support of a random variable
X if P [X S] = 1.
To emphasize that S is a support of a particular variable X,
we denote a support of X by SX .
Practically, we define a support of a random variable X to be
the set of all the possible values of X.28
For any random variable, the set R of all real numbers is
always a support; however, it is not that useful because it does
not further limit the possible values of the random variable.
Recall that a support of a probability measure P is any set
A such that P (A) = 1.
Definition 7.14. The probability distribution is a description
of the probabilities associated with the random variable.
7.15. There are three types of of random variables. The first type,
which will be discussed in Section 8, is called discrete random
variable. To tell whether a random variable is discrete, one simple
way is to consider the possible values of the random variable.
If it is limited to only a finite or countably infinite number of
possibilities, then it is discrete. We will later discuss continuous
random variables whose possible values can be anywhere in
some intervals of real numbers.

28

Later on, you will see that 1) a default support of a discrete random variable is the set
of values where the pmf is strictly positive and 2) a default support of a continuous random
variable is the set of values where the pdf is strictly positive.

81

Discrete Random Variables

Intuitively, to tell whether a random variable is discrete, we simply


consider the possible values of the random variable. If the random
variable is limited to only a finite or countably infinite number of
possibilities, then it is discrete.
Example 8.1. Voice Lines: A voice communication system for
a business contains 48 external lines. At a particular time, the
system is observed, and some of the lines are being used. Let the
random variable X denote the number of lines in use. Then, X
can assume any of the integer values 0 through 48. [15, Ex 3-1]
Definition 8.2. A random variable X is said to be a discrete
random variable if there exists a countable number of distinct
real numbers xk such that
X
P [X = xk ] = 1.
(11)
k

In other words, X is a discrete random variable if and only if X


has a countable support.
Example 8.3. For the random variable N in Example 7.5 (Three
Coin Tosses),

For the random variable S in Example 7.6 (Sum of Two Dice),

8.4. Although the support SX of a random variable X is defined as


any set S such that P [X S] = 1. For discrete random variable,
SX is usually set to be {x : pX (x) > 0}, the set of all possible
values of X.
Definition 8.5. Important Special Case: An integer-valued random variable is a discrete random variable whose xk in (11)
above are all integers.
82

8.6. Recall, from 7.14, that the probability distribution of a


random variable X is a description of the probabilities associated
with X.
For a discrete random variable, the distribution is often characterized by just a list of the possible values (x1 , x2 , x3 , . . .) along
with the probability of each:
(P [X = x1 ] , P [X = x2 ] , P [X = x3 ] , . . . , respectively) .
In some cases, it is convenient to express the probability in
terms of a formula. This is especially useful when dealing with a
random variable that has an unbounded number of outcomes. It
would be tedious to list all the possible values and the corresponding probabilities.
8.1

PMF: Probability Mass Function

Definition 8.7. When X is a discrete random variable satisfying


(11), we define its probability mass function (pmf) by29
pX (x) = P [X = x].

Sometimes, when we only deal with one random variable or


when it is clear which random variable the pmf is associated
with, we write p(x) or px instead of pX (x).
The argument (x) of a pmf ranges over all real numbers.
Hence, the pmf is defined for x that is not among the xk
in (11). In such case, the pmf is simply 0. This is usually
expressed as pX (x) = 0, otherwise when we specify a pmf
for a particular r.v.

29

Many references (including [15] and MATLAB) use fX (x) for pmf instead of pX (x). We will
NOT use fX (x) for pmf. Later, we will define fX (x) as a probability density function which
will be used primarily for another type of random variable (continuous r.v.)

83

Example 8.8. Continue from Example 7.5. N is the number of


heads in a sequence of three coin tosses.

8.9. Graphical Description of the Probability Distribution: Traditionally, we use stem plot to visualize pX . To do this, we graph
a pmf by marking on the horizontal axis each value with nonzero
probability and drawing a vertical bar with length proportional to
the probability.
8.10. Any pmf p() satisfies two properties:
(a) p() 0
(b) there exists numbers x1 , x2 , x3 , . . . such that
p(x) = 0 for other x.

p(xk ) = 1 and

When you are asked to verify that a function is a pmf, check these
two properties.
8.11. Finding probability from pmf: for any subset B of R, we
can find
X
X
P [X B] =
P [X = xk ] =
pX (xk ).
xk B

xk B

In particular, for integer-valued random variables,


X
X
P [X B] =
P [X = k] =
pX (k).
kB

kB

84

8.12. Steps to find probability of the form P [some condition(s) on X]


when the pmf pX (x) is known.
(a) Find the support of X.
(b) Consider only the x inside the support. Find all values of x
that satisfies the condition(s).
(c) Evaluate the pmf at x found in the previous step.
(d) Add the pmf values from the previous step.
Example 8.13. Suppose a random variable X has pmf
c
/x, x = 1, 2, 3,
pX (x) =
0, otherwise.
(a) The value of the constant c is

(b) Sketch of pmf

(c) P [X = 1]

(d) P [X 2]

(e) P [X > 3]

85

8.14. Any function p() on R which satisfies


(a) p() 0, and
(b) there exists numbers x1 , x2 , x3 , . . . such that
p(x) = 0 for other x

p(xk ) = 1 and

is a pmf of some discrete random variable.


8.2

CDF: Cumulative Distribution Function

Definition 8.15. The (cumulative) distribution function (cdf )


of a random variable X is the function FX (x) defined by
FX (x) = P [X x] .
The argument (x) of a cdf ranges over all real numbers.
From its definition, we know that 0 FX 1.
Think of it as a function that collects the probability mass
from up to the point x.
8.16. From pmf to cdf: In general, for any discrete random variable with possible values x1 , x2 , . . ., the cdf of X is given by
X
pX (xk ).
FX (x) = P [X x] =
xk x

Example 8.17. Continue from Examples 7.5, 7.11, and 8.8 where
N is defined as the number of heads in a sequence of three coin
tosses. We have
pN (0) = pN (3) =

1
3
and pN (1) = pN (2) = .
8
8

(a) FN (0)

(b) FN (1.5)

86

(c) Sketch of cdf

8.18. Facts:
For any discrete r.v. X, FX is a right-continuous, staircase
function of x with jumps at a countable set of points xk .

_c03_066-106.qxd

When10:58
you are
given the cdf of a discrete random variable, you
AM Page 73
can derive its pmf from the locations and sizes of the jumps.
If a jump happens at x = c, then pX (c) is the same as the
amount of jump at c. At the location x where there is no
jump, pX (x) = 0.

1/7/10

Example 8.19. Consider a discrete random variable X whose cdf


3-3 CUMULATIVE DISTRIBUTION FUNCTIONS
FX (x) is shown in Figure 9.
F(x)

F(x)

1.0

1.000
0.997
0.886

0.7
0.2
2

Figure 3-3 Cumulative distribution function for


Figure 9: CDF for Example 8.19
Example 3-7.

Figure 3-4 Cumulative distribu


function for Example 3-8.

Determine the pmf pX (x).


EXERCISES FOR SECTION 3-3
3-32. Determine the cumulative distribution function of the
random variable in Exercise 3-14.
3-33. Determine the cumulative distribution function for
the random variable in Exercise 3-15; also determine the following probabilities:
(a) P1X  1.252
(b) P1X  2.22
87
(c) P11.1  X  12 (d) P1X  02
3-34. Determine the cumulative distribution function for the
random variable in Exercise 3-16; also determine the following

Determine each of the following probabilities:


(a) P1X  42 (b) P1X  72
(c) P1X  52 (d) P1X  42
(e) P1X  22
3-41.

0
0.25
F1x2 
0.75
1

x  10
10  x  30
30  x  50
50  x

X
countable set C, P C 0

FX is continuous

25) Every random variable can be written as a sum of a discrete random variable and a
continuous random variable.
A random variable can30
have at most countably many point x such that
8.20.26)Characterizing
properties of cdf:

P X x 0 .

(cumulative) distribution function


(cdf) induced
by a probability P on
CDF1 F27)
is non-decreasing
(monotone
increasing)
X The

, is the function F x P , x .

The (cumulative) distribution function (cdf) of the random variable X is the


function FX x P X , x P X x .

The distribution P X can be obtained from the distribution function by setting

CDF2 FX is right
the right)
P , xcontinuous
P .
uniquely determines from
F x ; that is F (continuous
X

0 FX 1

FX is non-decreasing
FX is right continuous:

x FX x lim FX y lim FX y FX x P X x .
yx
yx

yx

lim FX x 0 and lim FX x 1 .

Figure 10: Right-continuous function at jump point


x FX x lim FX y lim FX y P X , x P X x .
yx
y x
yx

x =
P 0xand
F x
F x F
jump=
or saltus
CDF3 lim FP XX (x)
lim
1. in F at x.
=Xthe(x)

x y

8.21. FX can
P be
x, y written
F y F as
x
X
P x, y F y F x
FX (x) =
pX (xk )u(x xk ),

xk

where u(x) = 1[0,) (x) is the unit step function.

30

These properties hold for any type of random variables. Moreover, for any function F
that satisfies these three properties, there exists a random variable X whose CDF is F .

88

Sirindhorn International Institute of Technology


Thammasat University
School of Information, Computer and Communication Technology

ECS315 2013/1 Part III.2 Dr.Prapun


8.3

Families of Discrete Random Variables

Many physical systems can be modeled by the same or similar


random experiments and random variables. In this subsection,
we present the analysis of several discrete random variables that
frequently arise in applications.31

Definition 8.22. X is uniformly distributed on a finite set S


if
 1
, x S,
pX (x) = P [X = x] = |S|
0, otherwise,
We write X U(S) or X Uniform(S).
Read X is uniform on S or X is a uniform random variable
on set S.
The pmf is usually referred to as the uniform discrete distribution.
Simulation: When the support S contains only consecutive integers32 , it can be generated by the command randi in MATLAB
(R2008b).
31

As mention in 7.12, we often omit a discussion of the underlying sample space of the
random experiment and directly describe the distribution of a particular random variable.
32
or, with minor manipulation, only uniformly spaced numbers

89

Example 8.23. X is uniformly distributed on 1, 2, . . . , n if

In MATLAB, X can be generated by randi(10).


Example 8.24. Uniform pmf is used when the random variable
can take finite number of equally likely or totally random values.
Classical game of chance / classical probability
Fair gaming devices (well-balanced coins and dice, well-shuffled
decks of cards)
Example 8.25. Roll a fair dice. Let X be the outcome.

Definition 8.26. X is a Bernoulli random variable if

1 p, x = 0,
p,
x = 1,
p (0, 1)
pX (x) =

0,
otherwise,
Write X B(1, p) or X Bernoulli(p)
X takes only two values: 0 or 1
Definition 8.27. X is a binary random variable if

1 p, x = a,
pX (x) =
p,
x = b,
p (0, 1), b > a.

0,
otherwise,
X takes only two values: a or b

90

Definition 8.28. X is a binomial random variable with size


n N and parameter p (0, 1) if
 n x
p (1 p)nx , x {0, 1, 2, . . . , n},
x
pX (x) =
(12)
0,
otherwise

Write X B(n, p) or X binomial(p).


Observe that B(1, p) is Bernoulli with parameter p.
To calculate pX (x), can use binopdf(x,n,p) in MATLAB.
Interpretation: X is the number of successes in n independent
Bernoulli trials.
Example 8.29. An optical inspection system is to distinguish
among different part types. The probability of a correct classification of any part is 0.98. Suppose that three parts are inspected
and that the classifications are independent.
(a) Let the random variable X denote the number of parts that
are correctly classified. Determine the probability mass function of X. [15, Q3-20]
(b) Let the random variable Y denote the number of parts that
are incorrectly classified. Determine the probability mass
function of Y .
Solution:
(a) X is a binomial random variable with n = 3 and p = 0.98.
Hence,
 3
0.98x (0.02)3x , x {0, 1, 2, 3},
x
pX (x) =
(13)
0,
otherwise
In particular, pX (0) = 8 106 , pX (1) = 0.001176, pX (2) =
0.057624, and pX (3) = 0.941192. Note that in MATLAB, these
probabilities can be calculated by evaluating
binopdf(0:3,3,0.98).
91

(b) Y is a binomial random variable with n = 3 and p = 0.02.


Hence,
 3
0.02y (0.98)3y , y {0, 1, 2, 3},
y
pY (y) =
(14)
0,
otherwise
In particular, pY (0) = 0.941192, pY (1) = 0.057624, pY (2) =
0.001176, and pY (3) = 8 106 . Note that in MATLAB, these
probabilities can be calculated by evaluating
binopdf(0:3,3,0.02).
Alternatively, note that there are three parts. If X of them are
classified correctly, then the number of incorrectly classified
parts is n X, which is what we defined as Y . Therefore,
Y = 3 X. Hence, pY (y) = P [Y = y] = P [3 X = y] =
P [X = 3 y] = pX (3 y).

Example 8.30. Daily Airlines flies from Amsterdam to London


every day. The price of a ticket for this extremely popular flight
route is $75. The aircraft has a passenger capacity of 150. The
airline management has made it a policy to sell 160 tickets for this
flight in order to protect themselves against no-show passengers.
Experience has shown that the probability of a passenger being
a no-show is equal to 0.1. The booked passengers act independently of each other. Given this overbooking strategy, what is the
probability that some passengers will have to be bumped from the
flight?
Solution: This problem can be treated as 160 independent
trials of a Bernoulli experiment with a success rate of p = 9/10,
where a passenger who shows up for the flight is counted as a success. Use the random variable X to denote number of passengers
that show up for a given flight. The random variable X is binomial distributed with the parameters n = 160 and p = 9/10. The
probability in question is given by
P [X > 150] = 1 P [X 150] = 1 FX (150).

In MATLAB, we can enter 1-binocdf(150,160,9/10) to get 0.0359.


Thus, the probability that some passengers will be bumped from
any given flight is roughly 3.6%. [22, Ex 4.1]
92

Definition 8.31. A geometric random variable X is defined by


the fact that for some constant (0, 1),
pX (k + 1) = pX (k)
for all k S where S can be either N or N {0}.
(a) When its support is N = {1, 2, . . .},

(1 ) x1 , x = 1, 2, . . .
pX (x) =
0,
otherwise.
Write X G1 () or geometric1 ().

In MATLAB, to evaluate pX (x), use geopdf(x-1,1-).

Interpretation: X is the number of trials required in


Bernoulli trials to achieve the first success.
In particular, in a series of Bernoulli trials (independent
trials with constant probability p of a success), let the
random variable X denote the number of trials until the
first success. Then X is a geometric random variable with
parameter = 1 p and

(1 ) x1 , x = 1, 2, . . .
pX (x) =
0,
otherwise

p(1 p)x1 , x = 1, 2, . . .
=
0,
otherwise.

(b) When its support is N {0},



(1 ) x , x = 0, 1, 2, . . .
pX (x) =
0,
otherwise

p(1 p)x , x = 0, 1, 2, . . .
=
0,
otherwise.
Write X G0 () or geometric0 ().

In MATLAB, to evaluate pX (x), use geopdf(x,1-).

Interpretation: X is the number of failures in Bernoulli


trials before the first success occurs.
93

8.32. In 1837, the famous French mathematician Poisson introduced a probability distribution that would later come to be known
as the Poisson distribution, and this would develop into one of the
most important distributions in probability theory. As is often remarked, Poisson did not recognize the huge practical importance of
the distribution that would later be named after him. In his book,
he dedicates just one page to this distribution. It was Bortkiewicz
in 1898, who first discerned and explained the importance of the
Poisson distribution in his book Das Gesetz der Kleinen Zahlen
(The Law of Small Numbers). [22]
Definition 8.33. X is a Poisson random variable with parameter > 0 if
 x
e x! , x = 0, 1, 2, . . .
pX (x) =
0,
otherwise
In MATLAB, use poisspdf(x,alpha).
Write X P () or Poisson().
We will see later in Example 9.7 that is the average or
expected value of X.
Instead of X, Poisson random variable is usually denoted by
. The parameter is often replaced by where is referred
to as the intensity/rate parameter of the distribution
Example 8.34. The first use of the Poisson model is said to have
been by a Prussian (German) physician, Bortkiewicz, who found
that the annual number of late-19th-century Prussian (German)
soldiers kicked to death by horses fitted a Poisson distribution [6,
p 150],[3, Ex 2.23]33 .

33

I. J. Good and others have argued that the Poisson distribution should be called the
Bortkiewicz distribution, but then it would be very difficult to say or write.

94

Example 8.35. The number of hits to a popular website during


a 1-minute interval is given by N P() where = 2.
(a) Find the probability that there is at least one hit between
3:00AM and 3:01AM.

(b) Find the probability that there are at least 2 hits during the
time interval above.

8.36. One of the reasons why Poisson distribution is important is


because many natural phenomenons can be modeled by Poisson
processes.
Definition 8.37. A Poisson process (PP) is a random arrangement of marks (denoted by below) on the time line.

The marks may indicate the arrival times or occurrences of


event/phenomenon of interest.
Example 8.38. Examples of processes that can be modeled by
Poisson process include
(a) the sequence of times at which lightning strikes occur or mail
carriers get bitten within some region
(b) the emission of particles from a radioactive source

95

(c) the arrival of


telephone calls at a switchboard or at an automatic phoneswitching system
urgent calls to an emergency center

(filed) claims at an insurance company

incoming spikes (action potential) to a neuron in human


brain
(d) the occurrence of
serious earthquakes
traffic accidents
power outages
in a certain area.
(e) page view requests to a website
8.39. It is convenient to consider the Poisson process in terms of
customers arriving at a facility.
We focus on a type of Poisson process that is called homogeneous
Poisson process.
Definition 8.40. For homogeneous Poisson process, there is
only one parameter that describes the whole process. This number
is call the rate and usually denoted by .
Example 8.41. If you think about modeling customer arrival as
a Poisson process with rate = 5 customers/hour, then it means
that during any fixed time interval of duration 1 hour (say, from
noon to 1PM), you expect to have about 5 customers arriving in
that interval. If you consider a time interval of duration two hours
(say, from 1PM to 3PM), you expect to have about 2 5 = 10
customers arriving in that time interval.
8.42. One important fact which we will revisit later is that, for a
homogeneous Poisson process, the number of arrivals during a time
interval of duration T is a Poisson random variable with parameter
= T .
96

Example 8.43. Examples of Poisson random variables:


#photons emitted by a light source of intensity [photons/second] in time
#atoms of radioactive material undergoing decay in time
#clicks in a Geiger counter in seconds when the average
number of click in 1 second is .
#dopant atoms deposited to make a small device such as an
FET
#customers arriving in a queue or workstations requesting
service from a file server in time
Counts of demands for telephone connections in time
Counts of defects in a semiconductor chip.
Example 8.44. Thongchai produces a new hit song every 7 months
on average. Assume that songs are produced according to a Poisson process. Find the probability that Thongchai produces more
than two hit songs in 1 year.

8.45. Poisson approximation of Binomial distribution: When


p is small and n is large, B(n, p) can be approximated by P(np)
(a) In a large number of independent repetitions of a Bernoulli
trial having a small probability of success, the total number of
successes is approximately Poisson distributed with parameter = np, where n = the number of trials and p = the
probability of success. [22, p 109]
97

(b) More specifically, suppose Xn B(n, pn ). If pn 0 and


npn as n , then
 
k
n k
nk

.
P [Xn = k] =
p (1 pn )
e
k!
k n
Example 8.46. Consider Xn B(n, 1/n).

Example 8.47. Recall that Bortkiewicz applied the Poisson model


to the number of Prussian cavalry deaths attributed to fatal horse
kicks. Here, indeed, one encounters a very large number of trials
(the Prussian cavalrymen), each with a very small probability of
success (fatal horse kick).
8.48. Summary:
X
Uniform Un
U{0,1,...,n1}

Support SX
{1, 2, . . . , n}
{0, 1, . . . , n 1}

Bernoulli B(1, p)

{0, 1}

Binomial B(n, p)
Geometric G0 ()
Geometric G1 ()
Poisson P()

{0, 1, . . . , n}
N {0}
N
N {0}

pX (x)

1
n
1
n

1 p, x = 0
p,
x=1

n x
p
(1

p)nx
x
(1 ) x
(1 ) x1
x
e x!

Table 3: Examples of probability mass functions. Here, p, (0, 1). > 0.


nN

98

8.4

Some Remarks

8.49. Sometimes, it is useful to define and think of pmf as a vector


p of probabilities.
When you use MATLAB, it is also useful to keep track of the
values of x corresponding to the probabilities in p. This can be
done via defining a vector x.

Example 8.50. For B 3, 13 , we may define
x = [0, 1, 2, 3]
and
"                        #
3
1 0 2 3 3
1 1 2 2 3
1 2 2 1 3
1 3 2 0
p=
,
,
,
0
3
3
3
3
3
3
3
3
1
2
3


8 4 2 1
, , ,
=
27 9 9 27

8.51. At this point, we have a couple of ways to define probabilities that are associated with a random variable X
(a) We can define P [X B] for all possible set B.
(b) For discrete random variable, we only need to define its pmf
pX (x) which is defined as P [X = x] = P [X {x}].
(c) We can also define the cdf FX (x).
Definition 8.52. If pX (c) = 1, that is P [X = c] = 1, for some
constant c, then X is called a degenerated random variable.

99

ECS315 2013/1
9

Part III.2 Dr.Prapun

Expectation and Variance

Two numbers are often used to summarize a probability distribution for a random variable X. The mean is a measure of the center or middle of the probability distribution, and the variance is a
measure of the dispersion, or variability in the distribution. These
two measures do not uniquely identify a probability distribution.
That is, two different distributions can have the same mean and
variance. Still, these measures are simple, useful summaries of the
probability distribution of X.
9.1

Expectation of Discrete Random Variable

The most important characteristic of a random variable is its expectation. Synonyms for expectation are expected value, mean,
and first moment.
The definition of expectation is motivated by the conventional
idea of numerical average. Recall that the numerical average of n
numbers, say a1 , a2 , . . . , an is
n

1X
ak .
n
k=1

We use the average to summarize or characterize the entire collection of numbers a1 , . . . , an with a single value.
Example 9.1. Consider 10 numbers: 5, 2, 3, 2, 5, -2, 3, 2, 5, 2.
The average is
5 + 2 + 3 + 2 + 5 + (2) + 3 + 2 + 5 + 2 27
=
= 2.7.
10
10
We can rewrite the above calculation as
2

1
4
2
3
+2
+3
+5
10
10
10
10

101

Definition 9.2. Suppose X is a discrete random variable, we define the expectation (or mean or expected value) of X by
X
X
x pX (x).
(15)
x P [X = x] =
EX =
x

In other words, The expected value of a discrete random variable


is a weighted mean of the values the random variable can take on
where the weights come from the pmf of the random variable.
Some references use mX or X to represent EX.
For conciseness, we simply write x under the summation symbol in (15); this means that the sum runs over all x values in
the support of X. (Of course, for x outside of the support,
pX (x) is 0 anyway.)
9.3. Analogy: In mechanics, think of point masses on a line with
a mass of pX (x) kg. at a distance x meters from the origin.
In this model, EX is the center of mass (the balance point).
This is why pX (x) is called probability mass function.
Example 9.4. When X Bernoulli(p) with p (0, 1),

Note that, since X takes only the values 0 and 1, its expected
value p is never seen.
9.5. Interpretation: The expected value is in general not a typical
value that the random variable can take on. It is often helpful to
interpret the expected value of a random variable as the long-run
average value of the variable over many independent repetitions
of an experiment

1/4, x = 0
Example 9.6. pX (x) =
3/4, x = 2

0,
otherwise

102

Example 9.7. For X P(),


EX =

ie

i
()

i!

i=0

= e

X
k=0

i
()

i!

i=1

i+0=e

k
= e e = .
k!

X
()i1
()
(i 1)!
i=1

Example 9.8. For X B(n, p),

n  
n
X
X
n i
n!
EX =
i
p (1 p)ni =
pi (1 p)ni
i
i! (n i)!
i
i=0
i=1

n
n 
X
X
(n 1)!
n1 i
ni
i
=n
p (1 p)
=n
p (1 p)ni
(i 1)! (n i)!
i1
i=1

i=1

Let k = i 1. Then,
EX = n

n1
X
k=0


n1
X n 1
n 1 k+1
n(k+1)
p (1 p)
= np
pk (1 p)n1k
k
k
k=0

We now have the expression in the form that we can apply the
binomial theorem which finally gives
EX = np(p + (1 p))n1 = np.
We shall revisit this example again using another approach in Example 10.45.
Example 9.9. Pascals wager : Suppose you concede that you
dont know whether or not God exists and therefore assign a 50
percent chance to either proposition. How should you weigh these
odds when deciding whether to lead a pious life? If you act piously
and God exists, Pascal argued, your gaineternal happinessis infinite. If, on the other hand, God does not exist, your loss, or
negative return, is smallthe sacrifices of piety. To weigh these
possible gains and losses, Pascal proposed, you multiply the probability of each possible outcome by its payoff and add them all up,
forming a kind of average or expected payoff. In other words, the
mathematical expectation of your return on piety is one-half infinity (your gain if God exists) minus one-half a small number (your
loss if he does not exist). Pascal knew enough about infinity to
103

know that the answer to this calculation is infinite, and thus the
expected return on piety is infinitely positive. Every reasonable
person, Pascal concluded, should therefore follow the laws of God.
[14, p 76]
Pascals wager is often considered the founding of the mathematical discipline of game theory, the quantitative study of
optimal decision strategies in games.
9.10. Technical issue: Definition (15) is only meaningful if the
sum is well defined.
The sum of infinitely many nonnegative terms is always welldefined, with + as a possible value for the sum.
Infinite Expectation: Consider a random variable X whose
pmf is defined by
 1
, x = 1, 2, 3, . . .
pX (x) = cx2
0, otherwise
P
1
2
Then, c =
n=1 n2 which is a finite positive number ( /6).
However,
EX =

X
k=1

X
1X1
11
= +.
kpX (k) =
k 2=
ck
c
k
k=1

k=1

Some care is necessary when computing expectations of signed


random variables that take infinitely many values.
The sum over countably infinite many terms is not always well
defined when both positive and negative terms are involved.
For example, the infinite series 1 1 + 1 1 + . . . has the sum
0 when you sum the terms according to (1 1) + (1 1) + ,
whereas you get the sum 1 when you sum the terms according
to 1 + (1 + 1) + (1 + 1) + (1 + 1) + .
Such abnormalities cannot happen when all terms in the infinite summation are nonnegative.

104

It is the convention in probability theory that EX should be evaluated as


X
X
EX =
xpX (x)
(x)pX (x),
x0

x<0

If at least one of these sums is finite, then it is clear what


value should be assigned as EX.
If both sums are +, then no value is assigned to EX, and
we say that EX is undefined.
Example 9.11. Undefined Expectation: Let
 1
, x = 1, 2, 3, . . .
pX (x) = 2cx2
0,
otherwise
Then,
EX =

X
k=1

kpX (k)

1
X

(k) pX (k).

k=

The first sum gives

kpX (k) =

k=1

X
k=1

1
1 X1

k
=
=
.
2ck 2
2c
k
2c
k=1

The second sum gives


1
X
k=

(k) pX (k) =

kpX (k) =

k=1

X
k=1

1 X1
1

=
k
= .
2
2ck
2c
k
2c
k=1

Because both sums are infinite, we conclude that EX is undefined.


9.12. More rigorously, to define EX, we let X + = max {X, 0} and
X = min {X, 0}. Then observe that X = X + X and that
both X + and X are nonnegative r.v.s. We say that a random
variable X admits an expectation if EX + and EX are not
both equal to +. In which case, EX = EX + EX .

105

9.2

Function of a Discrete Random Variable

Given a random variable X, we will often have occasion to define


a new random variable by Y g(X), where g(x) is a real-valued
function of the real-valued variable x. More precisely, recall that
a random variable X is actually a function taking points of the
sample space, , into real numbers X(). Hence, we have the
following definition
Definition 9.13. The notation Y = g(X) is actually shorthand
for Y () := g(X()).
The random variable Y = g(X) is sometimes called derived
random variable.
Example 9.14. Let

pX (x) =

1 2
cx ,

0,

x = 1, 2
otherwise

and
Y = X 4.
Find pY (y) and then calculate EY .

9.15. For discrete random variable X, the pmf of a derived random variable Y = g(X) is given by
X
pY (y) =
pX (x).
x:g(x)=y

106

Note that the sum is over all x in the support of X which satisfy
g(x) = y.
Example 9.16. A binary random variable X takes only two
values a and b with
P [X = b] = 1 P [X = a] = p.
X can be expressed as X = (b a)I + a, where I is a Bernoulli
random variable with parameter p.
9.3

Expectation of a Function of a Discrete Random


Variable

Recall that for discrete random variable X, the pmf of a derived


random variable Y = g(X) is given by
X
pX (x).
pY (y) =
x:g(x)=y

If we want to compute EY , it might seem that we first have to


find the pmf of Y . Typically, this requires a detailed analysis of g
which can be complicated, and it is avoided by the following result.
9.17. Suppose X is a discrete random variable.
X
g(x)pX (x).
E [g(X)] =
x

This is referred to as the law/rule of the lazy/unconcious


statistician (LOTUS) [23, Thm 3.6 p 48],[9, p. 149],[8, p. 50]
because it is so much easier to use the above formula than to first
find the pmf of Y . It is also called substitution rule [22, p 271].
Example 9.18. Back to Example 9.14. Recall that
 1 2
x , x = 1, 2
pX (x) = c
0,
otherwise
(a) When Y = X 4 , EY =

107

(b) E [2X 1]

9.19. Caution: A frequently made mistake of beginning students


is to set E [g(X)] equal to g (EX). In general, E [g(X)] 6= g (EX).
 
1
(a) In particular, E X1 is not the same as EX
.
(b) An exception is the case of a linear function g(x) = ax + b.
See also (9.23).
Example 9.20. Continue from Example 9.4. For X Bernoulli(p),
(a) EX = p
 
(b) E X 2 = 02 (1 p) + 12 p = p 6= (EX)2 .
Example 9.21. Continue from Example 9.7. Suppose X P().

i
X
 2 X
i1

2
=e
i
E X =
ie
i!
(i 1)!
i=1
i=0

(16)

We can evaluate the infinite sum in (16) by rewriting i as i1+1:

X
i=1

X
X
X i1
i1
i1
i1
=
(i 1 + 1)
=
(i 1)
+
(i 1)!
(i 1)!
(i 1)!
(i 1)!
i=1

i=2

i2
(i 2)!

X
i=1

i=1
i1

(i 1)!

i=1

= e + e = e ( + 1).

Plugging this back into (16), we get


 
E X 2 = ( + 1) = 2 + .
9.22.
9.8. For X B(n, p), one can find
 2  Continue from Example
E X = np(1 p) + (np)2 .
108

9.23. Some Basic Properties of Expectations


(a) For c R, E [c] = c
(b) For c R, E [X + c] = EX + c and E [cX] = cEX
(c) For constants a, b, we have
E [aX + b] = aEX + b.
(d) For constants c1 and c2 ,
E [c1 g1 (X) + c2 g2 (X)] = c1 E [g1 (X)] + c2 E [g2 (X)] .

(e) For constants c1 , c2 , . . . , cn ,


" n
#
n
X
X
E
ck gk (X) =
ck E [gk (X)] .
k=1

k=1

Definition 9.24. Some definitions involving expectation of a function of a random variable:


i
i
h
h
0
k
(a) Absolute moment: E |X| , where we define E |X| = 1
 
(b) Moment: mk = E X k = the k th moment of X, k N.
The first moment of X is its expectation EX.
 
The second moment of X is E X 2 .

109

9.4

Variance and Standard Deviation

An average (expectation) can be regarded as one number that


summarizes an entire probability model. After finding an average,
someone who wants to look further into the probability model
might ask, How typical is the average? or, What are the
chances of observing an event far from the average? A measure
of dispersion/deviation/spread is an answer to these questions
wrapped up in a single number. (The opposite of this measure is
the peakedness.) If this measure is small, observations are likely
to be near the average. A high measure of dispersion suggests that
it is not unusual to observe events that are far from the average.
Example 9.25. Consider your score on the midterm exam. After
you find out your score is 7 points above average, you are likely to
ask, How good is that? Is it near the top of the class or somewhere
near the middle?.
Example 9.26. In the case that the random variable X is the
random payoff in a game that can be repeated many times under
identical conditions, the expected value of X is an informative
measure on the grounds of the law of large numbers. However, the
information provided by EX is usually not sufficient when X is
the random payoff in a nonrepeatable game.
Suppose your investment has yielded a profit of $3,000 and you
must choose between the following two options:
the first option is to take the sure profit of $3,000 and
the second option is to reinvest the profit of $3,000 under the
scenario that this profit increases to $4,000 with probability
0.8 and is lost with probability 0.2.
The expected profit of the second option is
0.8 $4, 000 + 0.2 $0 = $3, 200
and is larger than the $3,000 from the first option. Nevertheless,
most people would prefer the first option. The downside risk is
too big for them. A measure that takes into account the aspect of
risk is the variance of a random variable. [22, p 35]
110

9.27. The most important measures of dispersion are the


standard deviation and its close relative, the variance.
Definition 9.28. Variance:
h

Var X = E (X EX)

(17)

Read the variance of X


2
, or VX [23, p. 51]
Notation: DX , or 2 (X), or X

In some references, to avoid confusion from the two expectation symbols, they first define m = EX and then define the
variance of X by


Var X = E (X m)2 .
We can also calculate the variance via another identity:
 
Var X = E X 2 (EX)2

The units of the variance are squares of the units of the random variable.
9.29. Basic properties of variance:
Var X 0.
 
Var X E X 2 .
Var[cX] = c2 Var X.
Var[X + c] = Var X.
Var[aX + b] = a2 Var X.

111

Definition 9.30. Standard Deviation:


p
X = Var[X].
It is useful to work with the standard deviation since it has
the same units as EX.
Informally we think of outcomes within X of EX as being
in the center of the distribution. Some references would informally interpret sample values within X of the expected
value, x [EX X , EX + X ], as typical values of X and
other values as unusual.
aX+b = |a| X .

9.31. X and Var X: Note that the


function is a strictly
increasing function. Because X = Var X, if one of them is
large, another one is also large. Therefore, both values quantify
the amount of spread/dispersion in RV X (which can be observed
from the spread or dispersion of the pmf or the histogram or the
relative frequency graph). However, Var X does not have the same
unit as the RV X.
9.32. In finance, standard deviation is a key concept and is used
to measure the volatility (risk) of investment returns and stock
returns.
It is common wisdom in finance that diversification of a portfolio
of stocks generally reduces the total risk exposure of the investment. We shall return to this point in Example 10.65.
Example 9.33. Continue from Example 9.25. If the standard
deviation of exam scores is 12 points, the student with a score of
+7 with respect to the mean can think of herself in the middle of
the class. If the standard deviation is 3 points, she is likely to be
near the top.
Example 9.34. Suppose X Bernoulli(p).
 
(a) E X 2 = 02 (1 p) + 12 p = p.
112

(b) Var X = EX 2 (EX)2 = p p2 = p(1 p).


Alternatively, if we directly use (17), we have


Var X = E (X EX)2 = (0 p)2 (1 p) + (1 p)2 p
= p(1 p)(p + (1 p)) = p(1 p).
Example 9.35. Continue from Example 9.7 and Example 9.21.
Suppose X P(). We have
 
Var X = E X 2 (EX)2 = 2 + 2 = .
Therefore, for Poisson random variable, the expected value is the
same as the variance.
Example 9.36. Consider
two pmfs shown in Figure 11. The
2.4the
Expectation
85
random variable X with pmf at the left has a smaller variance
than
the random
variable
Y withofpmf
atitsthe
right
because
more
The
variance
is the average
squared deviation
X about
mean.
The variance
characterizes
how likely itmass
is to observe
values of the random
variable
far (their
from its mean.
For example,
probability
is concentrated
near
zero
mean)
in the
consider the two pmfs shown in Figure 2.9. More probability mass is concentrated near zero
graph
at atthe
than
ingraph
the atgraph
at the right. [9, p. 85]
in
the graph
the left
left than
in the
the right.
p ( i)

p ( i)

2 1

1/3

1/3

1/6

1/6

2 1

Figure
Example
2.27 shows
the random
X with pmf
at the left
has a smaller
variance than
the
Figure2.9.11:
Example
9.36thatshows
thatvariable
a random
variable
whose
probability
mass
random variable Y with pmf at the right.

is concentrated near the mean has smaller variance. [9, Fig. 2.9]

Example 2.27. Let X and Y be the random variables with respective pmfs shown in
Figure
2.9.We
Compute
and var(Y
).
9.37.
havevar(X)
already
talked
about variance and standard de-

viation
as By
a number
of=the
] and
Solution.
symmetry, that
both Xindicates
and Y have spread/dispersion
zero mean, and so var(X)
E[X 2pmf.
2
].
Write
var(Y
)
=
E[Y
More specifically, lets imagine a pmf that shapes like a bell curve.
2 ] = (2)2 1 + (1)2 1 + (1)2 1 + (2)2 1 = 2,
E[X
As the value of
X gets smaller,
6
3 the spread
3
6 of the pmf will be
smaller and hence the pmf would look sharper. Therefore, the
and
2 ] = (2)2 1 + (1)2 1 + (1)2 1 + (2)2 1 = 3.
E[Ythe
probability that
random
X 6would3 take a value that is
3 variable
6
Thus,
X andthe
Y aremean
both zero-mean
random
variables taking the values 1 and 2. But Y
far from
would be
smaller.
is more likely to take values far from its mean. This is reected by the fact that var(Y ) >
var(X).

When a random variable does not have113


zero mean, it is often convenient to use the
variance formula,
var(X) = E[X 2 ] (E[X])2 ,

(2.17)

The next property involves the use of X to bound the tail


probability of a random variable.
9.38. Chebyshevs Inequality :
2
X
P [|X EX| ] 2

or equivalently
P [|X EX| nX ]

1
n2

Useful only when > X


Example 9.39. If X has mean m and variance 2 , it is sometimes
convenient to introduce the normalized random variable
Y =

X m
.

Definition 9.40. Central Moments: A generalization of the


variance is the nth central moment which is defined to be
n = E [(X EX)n ] .
(a) 1 = E [X EX] = 0.
2
(b) 2 = X
= Var X: the second central moment is the variance.

114

Sirindhorn International Institute of Technology


Thammasat University
School of Information, Computer and Communication Technology

ECS315 2013/1 Part IV.1 Dr.Prapun


10
10.1

Continuous Random Variables


From Discrete to Continuous Random Variables

In many practical applications of probability, physical situations


are better described by random variables that can take on a continuum of possible values rather than a discrete number of values.
For this type of random variable, the interesting fact is that
any individual value has probability zero:
P [X = x] = 0 for all x

(18)

and that
the support is always uncountable.
These random variables are called continuous random variables.
10.1. We can see from (18) that the pmf is going to be useless for
this type of random variable. It turns out that the cdf FX is still
useful and we shall introduce another useful function called probability density function (pdf) to replace the role of pmf. However,
integral calculus34 is required to formulate this continuous analog
of a pmf.
10.2. In some cases, the random variable X is actually discrete
but, because the range of possible values is so large, it might be
more convenient to analyze X as a continuous random variable.
34

This is always a difficult concept for the beginning student.

115

Example 10.3. Suppose that current measurements are read from


a digital instrument that displays the current to the nearest onehundredth of a mA. Because the possible measurements are limited, the random variable is discrete. However, it might be a more
convenient, simple approximation to assume that the current measurements are values of a continuous random variable.
Example 10.4. If you can measure the heights of people with
infinite precision, the height of a randomly chosen person is a continuous random variable. In reality, heights cannot be measured
with infinite precision, but the mathematical analysis of the distribution of heights of people is greatly simplified when using a
mathematical model in which the height of a randomly chosen
person is modeled as a continuous random variable. [22, p 284]
Example 10.5. Continuous random variables are important models for
(a) voltages in communication receivers
(b) file download times on the Internet
(c) velocity and position of an airliner on radar
(d) lifetime of a battery
(e) decay time of a radioactive particle
(f) time until the occurrence of the next earthquake in a certain
region
Example 10.6. The simplest example of a continuous random
variable is the random choice of a number from the interval
(0, 1).
In MATLAB, this can be generated by the command rand.
In Excel, use rand().
The generation is unbiased in the sense that any number
in the range is as likely to occur as another number.
Histogram is flat over (0, 1).

Formally, this is called a uniform RV on the interval (0, 1).


116

Definition 10.7. We say that X is a continuous random variable35 if we can find a (real-valued) function36 f such that, for any
set B, P [X B] has the form
Z
P [X B] =
f (x)dx.
(19)
B

In particular,
P [a X b] =

f( x)dx.

(20)

In other words, the area under the graph of f (x) between


the points a and b gives the probability P [a X b].

The function f is called the probability density function


(pdf) or simply density.
When we want to emphasize that the function f is a density
of a particular random variable X, we write fX instead of f .

35

To be more rigorous, this is the definition for absolutely continuous random variable. At
this level, we will not distinguish between the continuous random variable and absolutely
continuous random variable. When the distinction between them is considered, a random
variable X is said to be continuous (not necessarily absolutely continuous) when condition (18)
is satisfied. Alternatively, condition (18) is equivalent to requiring the cdf FX to be continuous.
Another fact worth mentioning is that if a random variable is absolutely continuous, then it
is continuous. So, absolute continuity is a stronger condition.
36
Strictly speaking, -function is not a function; so, cant use -function here.

117

Uniform Random Variable on (0,1)


Wednesday, August 28, 2013
8:28 AM

>> X = rand(1e3,1); hist(X,10)

>> X = rand(1e5,1); hist(X,10)

120

12000

100

10000

80

8000

60

6000

40

4000

20

2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

315 2013 L1 Page 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

315 2013 L1 Page 2

2. The range of possible x values is along the horizontal axis.


3. The probability that x will take on a value between a and b will be the
area under the curve between points a and b, as shown in Figure 7.1. The

FIGURE 7.1
Area = P(a x b)

f (x)

For a continuous random


variable, the probability distribution is described by a
curve called the probability
density function, f(x). The
total area beneath the curve
is 1.0, and the probability
that x will take on some
value between a and b is
the area beneath the curve
between points a and b.

Figure 12: For a continuous random variable, the probability distribution is


described by a curve called the probability density function, f (x). The total
area beneath the curve is 1.0, and the probability that X will take on some
value between a and b is the area beneath the curve between points a and b.

Example 10.8. For the random variable generated by the rand


command in MATLAB37 or the rand() command in Excel,

Definition 10.9. Recall that the support SX of a random variable


X is any set S such that P [X S] = 1. For continuous random
variable, SX is usually set to be {x : fX (x) > 0}.

37

The rand command in MATLAB is an approximation for two reasons:

(a) It produces pseudorandom numbers; the numbers seem random but are actually the
output of a deterministic algorithm.
(b) It produces a double precision floating point number, represented in the computer
by 64 bits. Thus MATLAB distinguishes no more than 264 unique double precision
floating point numbers. By comparison, there are uncountably infinite real numbers in
the interval from 0 to 1.

118

10.2

Properties of PDF and CDF for Continuous Random Variables

10.10. fX is determined only almost everywhere38 . That is, given


a pdf f for a random variable X, if we construct a function g by
changing the function f at a countable number of points39 , then g
can also serve as a pdf for X.
10.11. The cdf of any kind of random variable X is defined as
FX (x) = P [X x] .
Note that even through there are more than one valid pdfs for
any given random variable, the cdf is unique. There is only one
cdf for each random variable.
10.12. For continuous random variable, given the pdf fX (x), we
can find the cdf of X by
Z x
FX (x) = P [X x] =
fX (t)dt.

10.13. Given the cdf FX (x), we can find the pdf fX (x) by
If FX is differentiable at x, we will set
d
FX (x) = fX (x).
dx
If FX is not differentiable at x, we can set the values of fX (x)
to be any value. Usually, the values are selected to give simple
expression. (In many cases, they are simply set to 0.)

38
39

Lebesgue-a.e, to be exact
More specifically, if g = f Lebesgue-a.e., then g is also a pdf for X.

119

Example 10.14. For the random variable generated by the rand


command in MATLAB or the rand() command in Excel,

Example 10.15. Suppose that the lifetime X of a device has the


cdf

x<0
0,
1 2
x , 0x2
FX (x) =
4
1,
x>2
Observe that it is differentiable at each point x except at x = 2.
The probability density function is obtained by differentiation of
the cdf which gives
 1
x, 0 < x < 2
fX (x) = 2
0, otherwise.
At x = 2 where FX has no derivative, it does not matter what
values we give to fX . Here, we set it to be 0.
10.16. In many situations when you are asked to find pdf, it may
be easier to find cdf first and then differentiate it to get pdf.
Exercise 10.17. A point is picked at random in the inside of a
circular disk with radius r. Let the random variable X denote the
distance from the center of the disk to this point. Find fX (x).
10.18. Unlike the cdf of a discrete random variable, the cdf of a
continuous random variable has no jumps and is continuous everywhere.
Rx
10.19. pX (x) = P [X = x] = P [x X x] = x fX (t)dt = 0.
Again, it makes no sense to speak of the probability that X will
take on a pre-specified value. This probability is always zero.
10.20. P [X = a] = P [X = b] = 0. Hence,
P [a < X < b] = P [a X < b] = P [a < X b] = P [a X b]
120

The corresponding integrals over an interval are not affected


by whether or not the endpoints are included or excluded.
When we work with continuous random variables, it is usually
not necessary to be precise about specifying whether or not
a range of numbers includes the endpoints. This is quite different from the situation we encounter with discrete random
variables where it is critical to carefully examine the type of
inequality.
R
10.21. fX is nonnegative and R fX (x)dx = 1.
Example 10.22. Random variable X has pdf
 2x
ce , x > 0
fX (x) =
0,
otherwise
Find the constant c and sketch the pdf.

Definition 10.23. A continuous random variable is called exponential if its pdf is given by
 x
e , x > 0,
fX (x) =
0,
x0
for some > 0
Theorem 10.24. Any nonnegative40 function that integrates to
one is a probability density function (pdf) of some random
variable [9, p.139].
40

or nonnegative a.e.

121

for some
integrable function f .a Since P(X IR) = 1, the function f must integrate to one;

i.e., f (t) dt = 1. Further, since P(X B) 0 for all B, it can be shown that f must be
nonnegative.1 A nonnegative function that integrates to one is called a probability density
function (pdf). 10.25. Intuition/Interpretation:
Usually, the set B is an interval such as B = [a, b]. In this case,

The use of the word density originated with the analogy to


the distribution of matter inb space. In physics, any finite volume,
f (t) dt.
P(a X b) =
no matter how small, has aa positive mass, but there is no mass at
single point.
similar description
applies
to continuous
See Figure 4.1(a).a Computing
such A
probabilities
is analogous to
determining
the mass of random
a
variables.
piece of wire stretching
from a to b by integrating its mass density per unit length from a to
b. Since most probability
densities we work
are continuous,
for a small interval, say
Approximately,
for awith
small
x,
[x, x + x], we have
Z x+x
x+x
P [X [x, x + x]] =
fX (t)dt fX (x)x.
P(x X x + x) =

f (t) dtx f (x) x.

See Figure 4.1(b).This is why we call fX the density function.

x x+ x

Figure 13: (a)


P [x X x + x] is (b)
the area of the shaded vertical strip.

Figure 4.1. (a) P(a X b) = ab f (t) dt is the area of the shaded region under the density f (t). (b) P(x X

In other words, the probability of random variable X taking
x + x) = xx+x f (t) dt is the area of the shaded vertical strip.

on
a value in a small interval around point c is approximately equal
Note that for to
random
variables
with
f (c)c
when
ca density,
is the length of the interval.
[x<Xx+x]
P(a X b) = P(a < X b) = P(aP
X < b) = P(a < X < b)

In fact, fX (x) = lim

x0

since the corresponding integrals over an interval are not affected by whether or not the
The
number fX (x) itself is not a probability. In particular,
endpoints are included
or excluded.

it does not have to be between 0 and 1.

Some common densities

fX (c)of is
a relative
measure
forA summary
the likelihood
that
random
Here are some examples
continuous
random
variables.
of the more
comX will
take
a value in the immediate neighborhood
mon ones can be foundvariable
on the inside
of the
backon
cover.
a Later,

of random
pointvariable
c. is involved, we write fX (x) instead of f (x).
when more than one
Stated differently, the pdf fX (x) expresses how densely the
probability mass of random variable X is smeared out in the
neighborhood of point x. Hence, the name of density function.

122

10.26. Histogram
and pdf [22,
143approximation
and 145]:
From Histogram
to ppdf
Number of samples = 5000

Histogram

10

12

5000 Samples
14

16

18

Number of occurrences

1000

0.25

Vertical axis scaling

500
0

8
10
12
x
Frequency (%) of occurrences

14

0.216

pdf
Estimated pdf

18

20
0.15

10
0

10
x

12

14

16
0.1

18

0.05

10
x

12

14

16

18

Figure 14: From histogram to pdf.

(a) A (probability) histogram is a bar chart that divides the


range of values covered by the samples/measurements into
intervals of the same width, and shows the proportion (relative frequency) of the samples in each interval.
To make a histogram, you break up the range of values
covered by the samples into a number of disjoint adjacent
intervals each having the same width, say width . The
height of the bar on each interval [j, (j + 1)) is taken
such that the area of the bar is equal to the proportion
of the measurements falling in that interval (the proportion of measurements within the interval is divided by the
width of the interval to obtain the height of the bar).
The total area under the histogram is thus standardized/normalized to one.
(b) If you take sufficiently many independent samples from a continuous random variable and make the width of the base
intervals of the probability histogram smaller and smaller, the
graph of the histogram will begin to look more and more like
the pdf.
123

(c) Conclusion: A probability density function can be seen as a


smoothed out version of a probability histogram
10.3

Expectation and Variance

10.27. Expectation: Suppose X is a continuous random variable


with probability density function fX (x).
Z
EX =
xfX (x)dx
(21)
Z

E [g(X)] =
g(x)fX (x)dx
(22)

In particular,
 
E X2 =
Var X =

x2 fX (x)dx
 
(x EX)2 fX (x)dx = E X 2 (EX)2 .

Example 10.28. For the random variable generated by the rand


command in MATLAB or the rand() command in Excel,

Example 10.29. For the exponential random variable introduced


in Definition 10.23,

124

10.30. If we compare other characteristics of discrete and continuous random variables, we find that with discrete random variables,
many facts are expressed as sums. With continuous random variables, the corresponding facts are expressed as integrals.
10.31. All of the properties for the expectation and variance of
discrete random variables also work for continuous random variables as well:
(a) Intuition/interpretation of the expected value: As n ,
the average of n independent samples of X will approach EX.
This observation is known as the Law of Large Numbers.
(b) For c R, E [c] = c
(c) For constants a, b, we have E [aX + b] = aEX + b.
P
P
(d) E [ ni=1 ci gi (X] = ni=1 ci E [gi (X)].
 
(e) Var X = E X 2 (EX)2
(f) Var X 0.
 
(g) Var X E X 2 .
(h) Var[aX + b] = a2 Var X.
(i) aX+b = |a| X .
10.32. Chebyshevs Inequality :
P [|X EX| ]

2
X
2

or equivalently
1
n2
This inequality use variance to bound the tail probability
of a random variable.
P [|X EX| nX ]

Useful only when > X

125

Example 10.33. A circuit is designed to handle a current of 20


mA plus or minus a deviation of less than 5 mA. If the applied
current has mean 20 mA and variance 4 mA2 , use the Chebyshev
inequality to bound the probability that the applied current violates the design parameters.
Let X denote the applied current. Then X is within the design
parameters if and only if |X 20| < 5. To bound the probability
that this does not happen, write
P [|X 20| < 5]

Var X
4
=
= 0.16.
52
25

Hence, the probability of violating the design parameters is at most


16%.
10.34. Interesting applications of expectation:
(a) fX (x) = E [ (X x)]
(b) P [X B] = E [1B (X)]

126

Sirindhorn International Institute of Technology


Thammasat University
School of Information, Computer and Communication Technology

ECS315 2013/1 Part IV.2 Dr.Prapun


10.4

Families of Continuous Random Variables

Theorem 10.24 states that any nonnegative function f (x) whose


integral over the interval (, +) equals 1 can be regarded as
a probability density function of a random variable. In real-world
applications, however, special mathematical forms naturally show
up. In this section, we introduce a couple families of continuous
random variables that frequently appear in practical applications.
The probability densities of the members of each family all have the
same mathematical form but differ only in one or more parameters.
10.4.1

Uniform Distribution

Definition 10.35. For a uniform random variable on an interval


[a, b], we denote its family by uniform([a, b]) or U([a, b]) or simply
U(a, b). Expressions that are synonymous with X is a uniform
random variable are X is uniformly distributed, X has a uniform distribution, and X has a uniform density. This family is
characterized by

0,
x < a, x > b
fX (x) =
1
ba , a x b
The random variable X is just as likely to be near any value
in [a, b] as any other value.

127

In MATLAB,
(a) use X = a+(b-a)*rand or X = random(Uniform,a,b)
to generate the RV,

84

(b) use pdf(Uniform,x,a,b) and cdf(Uniform,x,a,b)


to calculate the pdf and cdf, respectively.

0,
x < a, x > b
Exercise 10.36. Show that FX (x) = xa
ba , a x b
Probability theory, random variables and random processes
Fx(x)

fx(x)

1
1
ba

Fig. 3.5

Fig. 3.6

The pdf and cdf for the uniform random variable.

Figure 15: The pdf and cdf for the uniform random variable. [16, Fig. 3.5]
Fx(x)

fx(x)
1

Example
10.37 (F2011). Suppose X is uniformly distributed on
2 2
1
the interval (1, 2). (X U(1, 2).)
(a) Plot the pdf fX (x) of X.

1
2
x

The
pdf and
cdf of
a Gaussian
(b)
Plot
the
cdf Frandom
of X.
X (x) variable.

G a u ss i a n ( o r n o r m a l ) ra n d o m va r i a b l e

This is a continuous random variable that

is described by the following pdf:

10.38. The uniform distribution provides a probability model for



'
1from the interval
(x )2 [a, b].
selecting a point at frandom
(x) =
exp
,
(3.16)
x

2 2

2 2

Use with caution to model a quantity that is known to vary


parameters
whose
meaning
is described
later. Itelse
is usually
denoted
where randomly
and 2 are two
between
a and
b but
about
which little
is known.
2

as N (, ). Figure 3.6 shows sketches of the pdf and cdf of a Gaussian random variable.
The Gaussian random variable is the most important and frequently encountered random variable in communications. This is because
128 thermal noise, which is the major source
of noise in communication systems, has a Gaussian distribution. Gaussian noise and the
Gaussian pdf are discussed in more depth at the end of this chapter.
The problems explore other pdf models. Some of these arise when a random variable

Example 10.39. [9, Ex. 4.1 p. 140-141] In coherent radio communications, the phase difference between the transmitter and the
receiver, denoted by , is modeled as having a uniform density on
[, ].
(a) P [ 0] =

1
2



(b) P 2 =

3
4

Exercise
Show that EX =
 2  110.40.
2
E X = 3 b + ab + a2 .
10.4.2

a+b
2 ,

Var X =

(ba)
12

, and

Gaussian Distribution

10.41. This is the most widely used model for the distribution
of a random variable. When you have many independent random
variables, a fundamental result called the central limit theorem
(CLT) (informally) says that the sum (technically, the average) of
them can often be approximated by normal distribution.
Definition 10.42. Gaussian random variables:
(a) Often called normal random variables because they occur so
frequently in practice.
(b) In MATLAB, use X = random(Normal,m,) or X = *randn
+ m.
(c) fX (x) =

1 e 2 (
2
1

xm 2

).

In Excel, use NORMDIST(x,m,,FALSE).


In MATLAB, use normpdf(x,m,) or pdf(Normal,x,m,).
Figure 16 displays the famous bell-shaped graph of the
Gaussian pdf. This curve is also called the normal curve.
129

84

Probability theory, random variables and random processes


Fx(x)

fx(x)

(d) FX (x) has no closed-form expression. However,


see 10.48.
1
1

Fig. 3.5

Inb MATLAB,
use normcdf(x,m,) or cdf(Normal,x,m,).
a
In Excel, use NORMDIST(x,m,,TRUE).
 x
x
a
b
0
b 2 .
0 N m,
(e) We write aX
The pdf and cdf for the uniform random variable.

Fx(x)

fx(x)
1
2 2

1
2

Fig. 3.6

The pdf and cdf of a Gaussian random variable.

Figure 16: The pdf and cdf of N (, 2 ). [16, Fig. 3.6]

G a u ss i a n ( o r n o r m a l ) ra n d o m va r i a b l e

2
10.43.
EX by=themfollowing
and Var
is described
pdf: X = .

This is a continuous random variable that


'
(x )2
exp
,
probabilities:
2 2
2 2
fx (x) =

(3.16)
10.44. Important
P [|X
| < ] = 0.6827;
where and 2 are two parameters whose meaning is described later. It is usually denoted
P [|X
(,| 2>
] =3.60.3173;
). Figure
shows sketches of the pdf and cdf of a Gaussian random variable.
as N
Gaussian
random
variable is the most important and frequently encountered ranP [|X The
|
> 2]
= 0.0455;
dom variable in communications. This is because thermal noise, which is the major source
P [|X
| < 2] = 0.9545
of noise in communication systems, has a Gaussian distribution. Gaussian noise and the
These
are illustrated
Figure
19.
Gaussianvalues
pdf are discussed
in more depth in
at the
end of this
chapter.
The problems explore other pdf models. Some of these arise when a random variable
is passed through a nonlinearity. How to determine the pdf of the random variable in this
Example
10.45.
case is discussed
next.Figure 20 compares several deviation scores and

the normal distribution:


n c t i o n s of a ra n d o m va r i a b l e A function of a random variable y = g(x) is itself a
(a) FuStandard
scores have a mean of zero and a standard deviation
random variable. From the definition, the cdf of y can be written as
of 1.0.

Fy (y) = P(  : g(x()) y).

(3.17)

(b) Scholastic Aptitude Test scores have a mean of 500 and a


standard deviation of 100.
130

109

3.5 The Gaussian random variable and process


(a)
0.6
0.4

Signal amplitude (V)

0.2
0
0.2
0.4
0.6
0.8

0.2

0.4

0.6

0.8

t (s)
(b)
4
Histogram
Gaussian fit
Laplacian fit

3.5
3

fx(x) (1/V)

2.5
2
1.5
1
0.5

Fig. 3.14

0
1

0.5

0
x (V)

0.5

(a) A sample skeletal muscle (emg) signal, and (b) its histogram and pdf fits.

Figure 17: Electrical activity of a skeletal muscle: (a) A sample skeletal muscle
(emg) signal, and (b) its histogram and pdf fits. [16, Fig. 3.14]


1=
=
=

2

fx (x)dx

K1 eax dx
2



2
2
K12
eax dx
eay dy
x=
y=
 
2
2
K12
ea(x +y ) dxdy.
x= y=

131

2

(3.103)

standard score.
3.5 The Gaussian
random variable
andhelp
processus to determine
The normal
distribution
can
probabilities.

0.4

6.4 Notation

x = 1
=2

0.35

The z notation is critical in the use ofx =normal


x 5
0.3
distributions.

6.5 Normal Approximation of the


Binomial
0.2

0.25

fx(x)

6.3 Applications of Normal Distributions


111

Binomial probabilities can be estimated by using


a normal distribution.

0.15
0.1

23) Fourier transform:


F ( fX ) =
0.05

f ( x)e
X

j x

dt = e

1
j m 2 2
2

mal Probability
Distributions


ence Scores

24) Note that

x2
15

dx =

10
.

0
x

10

15

Plots of the zero-meanxGaussian


pdf for different values
m
x ofmstandard deviation,
xm .

1 Q
= Q
.
[ X > x ] =ofQ the

; P [ X < x ] =Gaussian
pdf
Figure 25)
18:P Plots
zero-mean
for
different
values of standard


deviation, XP. X[16,


Fig.
< =3.15]
0.6827, P X > = 0.3173

Fig. 3.15

> 2 Table
= 0.0455,
P X single
<x2on
different
=most
0.9545
mal probability distributionP isX considered
the
important proba3.1 Influence
of
quantities
stribution. An unlimited number
a normal
Range (kx )of continuous random
k = 1 variables
k = 2 have either
k=3
k=4

roximately normal distribution.


f ( x)
) x)
P(mx kx < x mxf +( xk
0.683
0.955
0.997
0.999
6
8
e all familiar with IQ (intelligence
quotient) scores and/or
Test)
104
10Aptitude
10
Error probability
103 SAT (Scholastic
the mean deviation
3.09
3.72scores have
4.75 a mean
5.61 of
scores have a mean of 100Distance
and afrom
standard
of 16. SAT
95%
68%
a standard deviation of 100. But did you know that these continuous random variables
w a normal distribution? of the pdf are ignorable. Indeed when communication systems are considered later it is the
X

+ 2

presence of these tails that


results2 in bit errors. The probabilities are on the order of 103

1
1012
, very small,
but
still
significant
in terms oftosystem
It N
is of
Q-function
:
corresponds
26)
Q
z
=
e
P [ X >performance.
zof
~
(
)
( 0,1interest
] where
z 2 2 dx density
Figure 19: Probability
function
X X N
(,
)2; ) . to
see how far, in terms of x , one must be from the mean value to have the different levels of
A, pictures the comparison of
sevthat is
Q ( z ) is the probability of the tail of N ( 0,1) .
error probabilities.
shall
F IAs
GU
R EbeAseen in later chapters this translates to the required SNR to
iation scores and the normal
distriachieve a specified bit error probability.
This
N ( 0,1
) is also shown in Table 3.1.
Standard scores have a mean
of
Having considered the single (or univariate) Gaussian random variable, we turn our
d a standard deviationattention
of 1.0.
to the case of two jointly Gaussian random variables (or the bivariate case). Again
tic Aptitude Test scoresthey
have
a
are described
by their joint pdf which, in general, is an exponential whose exponent
( z(ax
) 2 +bx+cxy+dy+ey2 +f ) , where the conis a quadratic
f 500 and a standard deviation
of in the two variables, i.e., fx,y (x, y) =QKe
stants K, a, b, c, d, e, and f are chosen to satisfy the basic properties
of a valid joint pdf,
0.5
namely being always nonnegative
(
0),
having
unit
volume,
and
also
that the marginal
0 z
t Intelligence Scale scores have a 

pdfs, fx (x) = fx,y2%
(x, y)dy and
= 34%
fx,y (x,14%
y)dx, are 2%
valid. Written in standard
14%fy (y)34%
100 and a standard deviation of
16.
1
form
joint
pdf is
a) the
Q is
a decreasing
function with Q ( 0 ) = .
case there are 34 percent of the 3.0 2.0 1.0 0 2 1.0 2.0 3.0

Standard Scores
b) Q ( z ) = 1 Q ( z )
etween the mean and one standard
1
n, 14 percent between one and
c) Qtwo
) ) = z 300 400 500 600 700 800
(1 Q ( z200
SAT
Scores
d deviations, and 2 percent beyond x
x
1 2 2 sin
1 4 2 sin
2
dard deviations.
d) Q ( x ) = e52 d68
. Q ( x84
e
d116
.
) = 100
132 148

0
Binet Intelligence
Scale Scores

( f ( x ))

d
1 x2 d
1 2 d
ck, Applying Psychology: Critical and Creative
of
Deviation
QThinking,
e ;6.2QPictures
Comparison
f ( x ) ) =the
e
f (Several
x) .
e)
( x ) = Figure
(
dx
dxby permission
2
2Pearson
the Normal Distribution, Figure
1992 Prentice-Hall,
Inc. Reproduced
of
Education,
Inc.
20: Comparison
of Several
Deviation
Scoresdx
and the Normal
Distribution

132

(c) Binet Intelligence Scale41 scores have a mean of 100 and a


standard deviation of 16.
In each case there are 34 percent of the scores between the
mean and one standard deviation, 14 percent between one and
two standard deviations, and 2 percent beyond two standard
deviations. [Source: Beck, Applying Psychology: Critical and
Creative Thinking.]
10.46. N (0, 1) is the standard Gaussian (normal) distribution.
In Excel, use NORMSINV(RAND()).
In MATLAB, use randn.
The standard normal cdf is denoted by (z).
It inherits all properties of cdf.

Moreover, note that (z) = 1 (z).

10.47. Relationship between N (0, 1) and N (m, 2 ).


(a) An arbitrary Gaussian random variable with mean m and
variance 2 can be represented as Z +m, where Z N (0, 1).
41

Alfred Binet, who devised the first general aptitude test at the beginning of the 20th
century, defined intelligence as the ability to make adaptations. The general purpose of the
test was to determine which children in Paris could benefit from school. Binets test, like its
subsequent revisions, consists of a series of progressively more difficult tasks that children of
different ages can successfully complete. A child who can solve problems typically solved by
children at a particular age level is said to have that mental age. For example, if a child can
successfully do the same tasks that an average 8-year-old can do, he or she is said to have a
mental age of 8. The intelligence quotient, or IQ, is defined by the formula:
IQ = 100 (Mental Age/Chronological Age)
There has been a great deal of controversy in recent years over what intelligence tests measure.
Many of the test items depend on either language or other specific cultural experiences for
correct answers. Nevertheless, such tests can rather effectively predict school success. If
school requires language and the tests measure language ability at a particular point of time
in a childs life, then the test is a better-than-chance predictor of school performance.

133

This relationship can be used to generate general Gaussian


RV from standard Gaussian RV.

(b) If X N m, 2 , the random variable
X m

is a standard normal random variable. That is, Z N (0, 1).


Z=

Creating a new random variable by this transformation


is referred to as standardizing.
The standardized variable is called standard score or
z-score.
10.48. It is impossible to express the integral of a Gaussian PDF
between non-infinite limits (e.g., (20)) as a function that appears
on most scientific calculators.
An old but still popular technique to find integrals of the
Gaussian PDF is to refer to tables that have been obtained
by numerical integration.
One such table is the table that lists (z) for many values
of positive z.

For X N m, 2 , we can show that the CDF of X can
be calculated by


xm
FX (x) =
.

Example 10.49. Suppose Z N (0, 1). Evaluate the following


probabilities.
(a) P [1 Z 1]

134

(b) P [2 Z 2]

Example 10.50. Suppose X N (1, 2). Find P [1 X 2].

1
N 0,
2
erf ( z )

10.51. Q-function: Q (z) =

R
0
z

2
x2

1 e
z 2

2z

dx corresponds to P [X > z]

where X N (0, 1); that is Q (z) is the probability of the tail


of N (0, 1). The Q function is then a complementary cdf (ccdf).
N ( 0,1)

1
0.9
0.8

Q(z)

0.7
0.6
0.5
0.4
0.3
0.2
0.1

0
-3

-2

-1

Figure 21: Q-function

(a) Q is a decreasing function with Q (0) = 21 .


(b) Q (z) = 1 Q (z) = (z)
10.52. Error function (MATLAB): erf (z) =

1 2Q 2z

Rz

ex dx =

(a) It is an odd function of z.



(b) For z 0, it corresponds to P [|X| < z] where X N 0, 21 .
(c) lim erf (z) = 1
z

135

0,
k
k 2
E ( X ) = ( k 1) E ( X ) =

1 3 5

2
k
, k odd
k
2 4 6 ( k 1)

[Papoulis p 111].
E X =

1 3 5 ( k 1) k ,
k even

k odd
( k 1) , k even
k

(d) erf (z) = erf (z)


= 4 2 2 + 2
4.
Var X 2 



1
x
1
x

(e) (x) = 2 1 + erf


= 2 kerfc
2
0,
k
2

28) For N ( 0,1) and k 1 , E X (2)


= ( k 1) E X

= 1 3 5

k odd
( k 1) , k even

(f)29)The
complementary error 2function:
e x dx= 1 22Q ( R 2z ) corresponds
Error function (Matlab): erf ( z ) =
to
2

erfc (z) = 1 erf (z) = 2Q 0 2z = z ex dx


z

1
P X < z where X ~ N 0, .
2

1
N 0,
2

erf ( z )
Q

2z

a) lim erf ( z ) = 1
z

Figure 22: erf-function and Q-function

b) erf ( z ) = erf ( z )

10.4.3

Exponential Distribution

Definition 10.53. The exponential distribution is denoted by


E ().
(a) > 0 is a parameter of the distribution, often called the rate
parameter.
(b) Characterized by
 x
e , x > 0,
fX (x) =
0,
x0

1 ex , x > 0,
FX (x) =
0,
x0

136

Survival-, survivor-, or reliability-function:

(c) MATLAB:
X = exprnd(1/) or random(exp,1/)
fX (x) = exppdf(x,1/) or pdf(exp,x,1/)
FX (x) = expcdf(x,1/) or cdf(exp,x,1/)
Example 10.54. Suppose X E(), find P [1 < X < 2].

Exercise 10.55. Exponential random variable as a continuous


version of geometric random variable: Suppose X E (). Show
that bXc G0 (e ) and dXe G1 (e )
Example 10.56. The exponential distribution is intimately related to the Poisson process. It is often used as a probability
model for the (waiting) time until a rare event occurs.
time elapsed until the next earthquake in a certain region
decay time of a radioactive particle
time between independent events such as arrivals at a service
facility or arrivals of customers in a shop.
duration of a cell-phone call
time it takes a computer network to transmit a message from
one node to another.

137

10.57. EX =

Example 10.58. Phone Company A charges $0.15 per minute


for telephone calls. For any fraction of a minute at the end of
a call, they charge for a full minute. Phone Company B also
charges $0.15 per minute. However, Phone Company B calculates
its charge based on the exact duration of a call. If T , the duration
of a call in minutes, is exponential with parameter = 1/3, what
are the expected revenues per call E [RA ] and E [RB ] for companies
A and B?
Solution: First, note that ET = 1 = 3. Hence,
E [RB ] = E [0.15 T ] = 0.15ET = $0.45.
and
E [RA ] = E [0.15 dT e] = 0.15E dT e .

Now, recall that dT e G1 e . Hence, E dT e =
Therefore,
E [RA ] = 0.15E dT e 0.5292.

1
1e

3.53.

10.59. Memoryless property : The exponential r.v. is the only


continuous42 r.v. on [0, ) that satisfies the memoryless property:
P [X > s + x |X > s] = P [X > x]
for all x > 0 and all s > 0 [18, p. 157159]. In words, the future
is independent of the past. The fact that it hasnt happened yet,
tells us nothing about how much longer it will take before it does
happen.
Imagining that the exponentially distributed random variable
X represents the lifetime of an item, the residual life of an item
has the same exponential distribution as the original lifetime,
regardless of how long the item has been already in use. In
42

For discrete random variable, geometric random variables satisfy the memoryless property.

138

other words, there is no deterioration/degradation over time.


If it is still currently working after 20 years of use, then today,
its condition is just like new.
In particular, suppose we define the set B+x to be {x + b : b B}.
For any x > 0 and set B [0, ), we have
P [X B + x|X > x] = P [X B]
because
P [X B + x]
=
P [X > x]
10.5

t
dt =tx
B+x e
=
ex

R
B

e( +x) d
.
ex

Function of Continuous Random Variables: SISO

Reconsider the derived random variable Y = g(X).

Recall that we can find EY easily by (22):


Z
EY = E [g(X)] =
g(x)fX (x)dx.
R

However, there are cases when we have to evaluate probability


directly involving the random variable Y or find fY (y) directly.
Recall that for discrete random variables, it is easy to find pY (y)
by adding all pX (x) over all x such that g(x) = y:
X
pY (y) =
pX (x).
(23)
x:g(x)=y

For continuous random variables, it turns out that we cant43 simply integrate the pdf of X to get the pdf of Y .

43

When you applied Equation (23) to continuous random variables, what you would get is
0 = 0, which is true but not interesting nor useful.

139

10.60. For Y = g(X), if you want to find fY (y), the following


two-step procedure will always work and is easy to remember:
(a) Find the cdf FY (y) = P [Y y].
(b) Compute the pdf from the cdf by finding the derivative
d
fY (y) = dy
FY (y) (as described in 10.13).
10.61. Linear Transformation: Suppose Y = aX + b. Then,
the cdf of Y is given by
h
i
P X yb , a > 0,
a i
h
FY (y) = P [Y y] = P [aX + b y] =
P X yb , a < 0.
a
Now, by definition, we know that




yb
yb
= FX
,
P X
a
a
and





yb
yb
yb
=P X>
+P X =
P X
a
a
a




yb
yb
+P X =
.
= 1 FX
a
a
i
h
yb
For continuous random variable, P X = a = 0. Hence,


 
FX yb ,
a > 0,
a

FY (y) =
1 FX yb , a < 0.
a
Finally, fundamental theorem of calculus and chain rule gives

 
yb
1

, a > 0,
d
a fX a

fY (y) = FY (y) =
1 fX yb , a < 0.
dy
a

Note that we can further simplify the final formula by using the
| | function:


1
yb
fY (y) =
fX
, a 6= 0.
(24)
|a|
a
140

Graphically, to get the plots of fY , we compress fX horizontally


by a factor of a, scale it vertically by a factor of 1/|a|, and shift it
to the right by b.
Of course, if a = 0, then we get the uninteresting degenerated
random variable Y b.
10.62. Suppose X N (m, 2 ) and Y = aX+b for some constants
a and b. Then, we can use (24) to show that X N (am+b, a2 2 ).

Example 10.63. Amplitude modulation in certain communication systems can be accomplished using various nonlinear devices
such as a semiconductor diode. Suppose we model the nonlinear
device by the function Y = X 2 . If the input X is a continuous
random variable, find the density of the output Y = X 2 .

141

Exercise 10.64 (F2011). Suppose X is uniformly distributed on


the interval (1, 2). (X U(1, 2).) Let Y = X12 .
(a) Find fY (y).
(b) Find EY .
Exercise 10.65 (F2011). Consider the function

x, x 0
g(x) =
x, x < 0.
Suppose Y = g(X), where X U(2, 2).
Remark: The function g operates like a full-wave rectifier in
that if a positive input voltage X is applied, the output is Y = X,
while if a negative input voltage X is applied, the output is Y =
X.
(a) Find EY .
(b) Plot the cdf of Y .
(c) Find the pdf of Y

142

P [X B] =
P [X = x] =

Discrete
P
pX (x)

Continuous
R
fX (x)dx

xB

pX (x) = F (x) F (x )
P X ((a, b]) = F (b) F (a)

Interval prob.


P X ([a, b]) = F (b) F a


P X ([a, b)) = F b F a

P X ((a, b)) = F b F (a)

EX =

0
P X ((a, b]) = P X ([a, b])
= P X ([a, b)) = P X ((a, b))
Zb
= fX (x)dx = F (b) F (a)
a
+
R

xpX (x)

xfX (x)dx

d
P [g(X) y] .
dy

fY (y) =

Alternatively,
For Y = g(X),

pY (y) =

pX (x)

x: g(x)=y

For Y = g(X),
P [Y B] =

pX (x)

x:g(x)B

E [g(X)] =

g(x)pX (x)

fY (y) =

xk are the real-valued roots


of the equation y = g(x).
R
fX (x)dx
{x:g(x)B}
+
R

g(x)fX (x)dx

E [X 2 ] =

+
R

x2 pX (x)

Var X =

P
x

X fX (xk )
,
|g 0 (xk )|
k

x2 fX (x)dx

(x EX)2 pX (x)

+
R

(x EX)2 fX (x)dx

Table 4: Important Formulas for Discrete and Continuous Random Variables

143

Sirindhorn International Institute of Technology


Thammasat University
School of Information, Computer and Communication Technology

ECS315 2013/1 Part V.1 Dr.Prapun


11

Multiple Random Variables

One is often interested not only in individual random variables, but


also in relationships between two or more random variables. Furthermore, one often wishes to make inferences about one random
variable on the basis of observations of other random variables.
Example 11.1. If the experiment is the testing of a new medicine,
the researcher might be interested in cholesterol level, blood pressure, and the glucose level of a test person.
11.1

A Pair of Discrete Random Variables

In this section, we consider two discrete random variables, say X


and Y , simultaneously.
11.2. The analysis are different from Section 9.2 in two main
aspects. First, there may be no deterministic relationship (such as
Y = g(X)) between the two random variables. Second, we want
to look at both random variables as a whole, not just X alone or
Y alone.
Example 11.3. Communication engineers may be interested in
the input X and output Y of a communication channel.

144

Example 11.4. Of course, to rigorously define (any) random variables, we need to go back to the sample space . Recall Example
7.4 where we considered several random variables defined on the
sample space = {1, 2, 3, 4, 5, 6} where the outcomes are equally
likely. In that example, we define X() = and Y () = ( 3)2 .

Example 11.5. Consider the scores of 20 students below:


10, 9, 10, 9, 9, 10, 9, 10, 10, 9, 1, 3, 4, 6, 5, 5, 3, 3, 1, 3.
|
{z
} |
{z
}
Room #1

Room #2

The first ten scores are from (ten) students in room #1. The last
10 scores are from (ten) students in room #2.
Suppose we have the a score report card for each student. Then,
in total, we have 20 report cards.

Figure 23: In Example 11.5, we pick a report card randomly from a pile of
cards.

I pick one report card up randomly. Let X be the score on that


card.
What is the chance that X > 5? (Ans: P [X > 5] = 11/20.)
145

What is the chance that X = 10? (Ans: pX (10) = P [X = 10] =


5/20 = 1/4.)
Now, let the random variable Y denote the room# of the student
whose report card is picked up.

What is the probability that X = 10 and Y = 2?

What is the probability that X = 10 and Y = 1?

What is the probability that X > 5 and Y = 1?

What is the probability that X > 5 and Y = 2?

Now suppose someone informs me that the report card which I


picked up is from a student in room #1. (He may be able to tell
this by the color of the report card of which I have no knowledge.)
I now have an extra information that Y = 1.
What is the probability that X > 5 given that Y = 1?

What is the probability that X = 10 given that Y = 1?

146

11.6. Recall that, in probability, , means and. For example,


P [X = x, Y = y] = P [X = x and Y = y]

and
P [3 X < 4, Y < 1] = P [3 X < 4 and Y < 1]
= P [X [3, 4) and Y (, 1)] .
In general,
[Some condition(s) on X,Some condition(s) on Y ]
is the same as the intersection of the individual statements:
[Some condition(s) on X] [Some condition(s) on Y ]
which simply means both statements happen.
More technically,
[X B, Y C] = [X B and Y C] = [X B] [Y C]
and

P [X B, Y C] = P [X B and Y C]
= P ([X B] [Y C]) .

Remark: Linking back to the original sample space, this shorthand actually says
[X B, Y C] = [X
= {
= {
= [X

B and Y C]
: X() B and Y () C}
: X() B } { : Y () C}
B] [Y C] .

147

11.7. The concept of conditional probability can be straightforwardly applied to discrete random variables. For example,
P [Some condition(s) on X | Some condition(s) on Y ] (25)
is the conditional probability P (A|B) where
A = [Some condition(s) on X] and
B = [Some condition(s) on Y ].
Recall that P (A|B) = P (A B)/P (B). Therefore,
P [X = x and Y = y]
,
P [Y = y]

P [X = x| Y = y] =
and
P [3 X < 4| Y < 1] =

P [3 X < 4 and Y < 1]


P [Y < 1]

More generally, (25) is


P ([Some condition(s) on X] [Some condition(s) on Y ])
P ([Some condition(s) on Y ])
P ([Some condition(s) on X,Some condition(s) on Y ])
=
P ([Some condition(s) on Y ])
P [Some condition(s) on X,Some condition(s) on Y ]
=
P [Some condition(s) on Y ]

More technically,
P [X B| Y C] = P ([X B] |[Y C]) =
=

P [X B, Y C]
.
P [Y C]

148

P ([X B] [Y C])
P ([Y C])

Definition 11.8. Joint pmf : If X and Y are two discrete random variables (defined on a same sample space with probability
measure P ), the function pX,Y (x, y) defined by
pX,Y (x, y) = P [X = x, Y = y]
is called the joint probability mass function of X and Y .
(a) We can visualize the joint pmf via stem plot. See Figure 24.
(b) To evaluate the probability for a statement that involves both
X and Y random variables:

We first find all pairs (x, y) that satisfy the condition(s) in


the statement, and then add up all the corresponding values
joint pmf.
More technically, we can then evaluate P [(X, Y ) R] by
X
P [(X, Y ) R] =
pX,Y (x, y).
(x,y):(x,y)R

Example 11.9 (F2011). Consider random variables X and Y


whose joint pmf is given by

c (x + y) , x {1, 3} and y {2, 4} ,
pX,Y (x, y) =
0,
otherwise.
(a) Check that c = 1/20.



(b) Find P X 2 + Y 2 = 13 .

In most situation, it is much more convenient to focus on the


important part of the joint pmf. To do this, we usually present
the joint pmf (and the conditional pmf) in their matrix forms:
149

2.3 Multiple random variables

75

Example11.10.
2.13. In the
precedingboth
example,
is the
that the rst
cache valDefinition
When
Xwhat
and
Y probability
take finitely
many
miss occurs after the third memory access?
ues (both have finite supports), say SX = {x1 , . . . , xm } and SY =
Solution. We need to nd
{y1 , . . . , yn }, respectively, we can arrange
the probabilities pX,Y (xi , yj )

P(T > 3) = P(T = k).


in an m n matrix
k=4

However, since P(T = k) = 0 for k 0, a nite series is obtained by writing


pX,Y (x1 , y1 ) pX,Y
(x,P(T
y2 ) 3). . . pX,Y (x1 , yn )
3) = 1 1

p (x , y )P(T >
X,Y 2 1 pX,Y (x2 , y3 2 ) . . . pX,Y (x2 , yn )
= .1 P(T =
(26)
.

..
..
.. k=1 . .k).

.
.
2
].
pX,Y (xm , y1 ) pX,Y=(x1m,(1y2 )p)[1.+. .p +ppX,Y
(xm , yn )

We
callmass
thisfunctions
matrix the joint pmf matrix.
Joint shall
probability
The joint probability mass function of X and Y is dened by
The
sum of all the entries in the matrix is one.
pXY (xi , y j ) := P(X = xi ,Y = y j ).

(2.7)

An example for integer-valued random variables is sketched in Figure 2.8.

0.06
0.04
0.02
0
8
7
6
5
4
3
2
1
i

2.8. Sketch of bivariate probability mass function pXY (i, j).


Figure 24:Figure
Example
of the plot of a joint pmf. [9, Fig. 2.8]

It turns out that we can extract the marginal probability mass functions pX (xi ) and
pY (y j ) from the joint pmf p44
j ) using the formulas
pX,Y (x, y) = 0 if XY (xxi , y
/ SX or y
/ SY . In other words,

we
dont have to consider
the
the supports
pXY (xiy
, y joutside
)
pX (x
(2.8)of X
i) =x
and
j
and Y , respectively.

44

To see this, note that pX,Y (x, y) can not exceed pX (x) because P (A B) P (A). Now,
suppose at x = a, we have pX (a) = 0. Then pX,Y (a, y) must also = 0 for any y because it can
not exceed pX (a) = 0. Similarly, suppose at y = a, we have pY (a) = 0. Then pX,Y (x, a) = 0
for any x.

150

11.11. From the joint pmf, we can find pX (x) and pY (y) by
X
pX,Y (x, y)
(27)
pX (x) =
y

pY (y) =

pX,Y (x, y)

(28)

In this setting, pX (x) and pY (y) are call the marginal pmfs (to
distinguish them from the joint one).
(a) Suppose we have the joint pmf matrix in (26). Then, the sum
of the entries in the ith row is45 pX (xi ), and
the sum of the entries in the jth column is pY (yj ):
pX (xi ) =

n
X

pX,Y (xi , yj ) and pY (yj ) =

m
X

pX,Y (xi , yj )

i=1

j=1

(b) In MATLAB, suppose we save the joint pmf matrix as P XY, then
the marginal pmf (row) vectors p X and p Y can be found by
p_X = (sum(P_XY,2))
p_Y = (sum(P_XY,1))
Example 11.12. Consider the following joint pmf matrix

45

To see this, we consider A = [X = xi ] and a collection defined by Bj = [Y = yj ]


and
/ SY ]. Note that the collection B0 , B1 , . . . , Bn partitions . So, P (A) =
Pn B0 = [Y
P
(A

B
j ). Of course, because the support of Y is SY , we have P (A B0 ) = 0. Hence,
j=0
the sum can start at j = 1 instead of j = 0.

151

Definition 11.13. The conditional pmf of X given Y is defined


as
pX|Y (x|y) = P [X = x|Y = y]
which gives
pX,Y (x, y) = pX|Y (x|y)pY (y) = pY |X (y|x)pX (x).

(29)

11.14. Equation (29) is quite important in practice. In most


cases, systems are naturally defined/given/studied in terms of their
conditional probabilities, say pY |X (y|x). Therefore, it is important
the we know how to construct the joint pmf from the conditional
pmf.
Example 11.15. Consider a binary symmetric channel. Suppose
the input X to the channel is Bernoulli(0.3). At the output Y of
this channel, the crossover (bit-flipped) probability is 0.1. Find
the joint pmf pX,Y (x, y) of X and Y .

Exercise 11.16. Toss-and-Roll Game:


Step 1 Toss a fair coin. Define X by

1, if result = H,
X=
0, if result = T.
Step 2 You have two dice, Dice 1 and Dice 2. Dice 1 is fair. Dice 2 is
unfair with p(1) = p(2) = p(3) = 92 and p(4) = p(5) = p(6) =
1
9.
(i) If X = 0, roll Dice 1.
(ii) If X = 1, roll Dice 2.
152

Record the result as Y .


Find the joint pmf pX,Y (x, y) of X and Y .
Exercise 11.17 (F2011). Continue from Example 11.9. Random
variables X and Y have the following joint pmf

c (x + y) , x {1, 3} and y {2, 4} ,
pX,Y (x, y) =
0,
otherwise.
(a) Find pX (x).
(b) Find EX.
(c) Find pY |X (y|1). Note that your answer should be of the form

?, y = 2,
?, y = 4,
pY |X (y|1) =

0, otherwise.
(d) Find pY |X (y|3).
Definition 11.18. The joint cdf of X and Y is defined by
FX,Y (x, y) = P [X x, Y y] .

153

Definition 11.19. Two random variables X and Y are said to be


identically distributed if, for every B, P [X B] = P [Y B].
Example 11.20. Let X Bernoulli(p). Let Y = X and Z = 1
X. Then, all of these random variables are identically distributed.
11.21. The following statements are equivalent:
(a) Random variables X and Y are identically distributed .
(b) For every B, P [X B] = P [Y B]
(c) pX (c) = pY (c) for all c
(d) FX (c) = FY (c) for all c
Definition 11.22. Two random variables X and Y are said to be
independent if the events [X B] and [Y C] are independent
for all sets B and C.
11.23. The following statements are equivalent:
(a) Random variables X and Y are independent.
[Y C] for all B, C.

|=

(b) [X B]

(c) P [X B, Y C] = P [X B] P [Y C] for all B, C.


(d) pX,Y (x, y) = pX (x) pY (y) for all x, y.
(e) FX,Y (x, y) = FX (x) FY (y) for all x, y.
Definition 11.24. Two random variables X and Y are said to be
independent and identically distributed (i.i.d.) if X and
Y are both independent and identically distributed.
11.25. Being identically distributed does not imply independence.
Similarly, being independent, does not imply being identically distributed.

154

Example 11.26. Roll a dice. Let X be the result. Set Y = X.

Example 11.27. Suppose the pmf of a random variable X is given


by

1/4, x = 3,
,
x = 4,
pX (x) =

0,
otherwise.
Let Y be another random variable. Assume that X and Y are
i.i.d.
Find
(a) ,
(b) the pmf of Y , and
(c) the joint pmf of X and Y .

155

Example 11.28. Consider a pair of random variables X and Y


whose joint pmf is given by

1/15, x = 3, y = 1,

2/15, x = 4, y = 1,
4/15, x = 3, y = 3,
pX,Y (x, y) =

,
x = 4, y = 3,

0,
otherwise.
(a) Are X and Y identically distributed?
(b) Are X and Y independent?

156

11.2

Extending the Definitions to Multiple RVs

Definition 11.29. Joint pmf:


pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = P [X1 = x1 , X2 = x2 , . . . , Xn = xn ] .
Joint cdf:
FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = P [X1 x1 , X2 x2 , . . . , Xn xn ] .
11.30. Marginal pmf:

Definition 11.31. Identically distributed random variables:


The following statements are equivalent.
(a) Random variables X1 , X2 , . . . are identically distributed
(b) For every B, P [Xj B] does not depend on j.
(c) pXi (c) = pXj (c) for all c, i, j.
(d) FXi (c) = FXj (c) for all c, i, j.
Definition 11.32. Independence among finite number of random variables: The following statements are equivalent.
(a) X1 , X2 , . . . , Xn are independent
(b) [X1 B1 ], [X2 B2 ], . . . , [Xn Bn ] are independent, for all
B1 , B2 , . . . , Bn .
Q
(c) P [Xi Bi , i] = ni=1 P [Xi Bi ], for all B1 , B2 , . . . , Bn .
Q
(d) pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = ni=1 pXi (xi ) for all x1 , x2 , . . . , xn .
Q
(e) FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = ni=1 FXi (xi ) for all x1 , x2 , . . . , xn .
Example 11.33. Toss a coin n times. For the ith toss, let

1, if H happens on the ith toss,
Xi =
0, if T happens on the ith toss.
We then have a collection of i.i.d. random variables X1 , X2 , X3 , . . . , Xn .
157

Example 11.34. Roll a dice n times. Let Ni be the result of the


ith roll. We then have another collection of i.i.d. random variables
N1 , N2 , N3 , . . . , Nn .
Example 11.35. Let X1 be the result of tossing a coin. Set X2 =
X3 = = Xn = X1 .

11.36. If X1 , X2 , . . . , Xn are independent, then so is any subcollection of them.


11.37. For i.i.d. Xi Bernoulli(p), Y = X1 + X2 + + Xn is
B(n, p).
Definition 11.38. A pairwise independent collection of random variables is a collection of random variables any two of which
are independent.
(a) Any collection of (mutually) independent random variables is
pairwise independent
(b) Some pairwise independent collections are not independent.
See Example (11.39).
Example 11.39. Let suppose X, Y , and Z have the following
joint probability distribution: pX,Y,Z (x, y, z) = 14 for (x, y, z)
{(0, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0)}. This, for example, can be constructed by starting with independent X and Y that are Bernoulli1
2 . Then set Z = X Y = X + Y mod 2.
(a) X, Y, Z are pairwise independent.
(b) X, Y, Z are not independent.

158

Sirindhorn International Institute of Technology


Thammasat University
School of Information, Computer and Communication Technology

ECS315 2013/1
11.3

Part V.2

Dr.Prapun

Function of Discrete Random Variables

11.40. Recall that for discrete random variable X, the pmf of a


derived random variable Y = g(X) is given by
X
pY (y) =
pX (x).
x:g(x)=y

Similarly, for discrete random variables X and Y , the pmf of a


derived random variable Z = g(X, Y ) is given by
X
pX,Y (x, y).
pZ (z) =
(x,y):g(x,y)=z

Example 11.41. Suppose the joint

1/15,

2/15,
pX,Y (x, y) =
4/15,

8/15,

0,
Let Z = X + Y . Find the pmf of Z.

159

pmf of X and Y is given by


x = 0, y = 0,
x = 1, y = 0,
x = 0, y = 1,
x = 1, y = 1,
otherwise.

Exercise 11.42 (F2011). Continue from Exercise 11.9. Let Z =


X +Y.
(a) Find the pmf of Z.
(b) Find EZ.
11.43. In general, when Z = X + Y ,
X
pX,Y (x, y)
pZ (z) =
(x,y):x+y=z

X
y

pX,Y (z y, y) =

X
x

pX,Y (x, z x).

Furthermore, if X and Y are independent,


X
pX (x) pY (y)
pZ (z) =

(30)

(x,y):x+y=z

X
y

pX (z y) pY (y) =

X
x

pX (x) pY (z x).

(31)

Example 11.44. Suppose 1 P(1 ) and 2 P(2 ) are independent. Let = 1 +2 . Use (31) to show46 that P(1 +2 ).
First, note that p (`) would be positive only on nonnegative
integers because a sum of nonnegative integers (1 and 2 ) is still
a nonnegative integer. So, the support of is the same as the
support for 1 and 2 . Now, we know, from (31), that
X
P [ = `] = P [1 + 2 = `] =
P [1 = i] P [2 = ` i]
i

Of course, we are interested in ` that is a nonnegative integer.


The summation runs over i = 0, 1, 2, . . .. Other values of i would
make P [1 = i] = 0. Note also that if i > `, then ` i < 0 and
P [2 = ` i] = 0. Hence, we conclude that the index i can only
46

Remark: You may feel that simplifying the sum in this example (and in Exercise 11.45
is difficult and tedious, in Section 13, we will introduce another technique which will make
the answer obvious. The idea is to realize that (31) is a convolution and hence we can use
Fourier transform to work with a product in another domain.

160

be integers from 0 to k:
P [ = `] =

`
X

i
1 1 2

i=0

i!

(1 +2 )

=e

`i
2
(` i)!

1X
`!
i1 `i
2
`! i=0 i! (` i)!
`

(1 +2 )

=e

`
1 X i `i
(1 +2 ) (1 + 2 )
=e
,
`! i=0 1 2
`!

where the last equality is from the binomial theorem. Hence, the
sum of two independent Poisson random variables is still Poisson!
(
`
(1 +2 ) (1 +2 )
, ` {0, 1, 2, . . .}
e
`!
p (`) =
0,
otherwise
Exercise 11.45. Suppose B1 B(n1 , p) and B2 B(n2 , p) are
independent. Let B = B1 + B2 . Use (31) to show that B
B(n1 + n2 , p).
11.4

Expectation of Function of Discrete Random Variables

11.46. Recall that the expected value of any function g of a


discrete random variable X can be calculated from
X
g(x)pX (x).
E [g(X)] =
x

Similarly47 , the expected value of any function g of two discrete


random variable X and Y can be calculated from
XX
E [g(X, Y )] =
g(x, y)pX,Y (x, y).
x

47

Again, these are called the law/rule of the lazy statistician (LOTUS) [22, Thm 3.6
p 48],[9, p. 149] because it is so much easier to use the above formula than to first find the
pmf of g(X) or g(X, Y ). It is also called substitution rule [21, p 271].

161

P [X = Y ]

Discrete
P
pX (x)
xB
P
pX,Y (x, y)
(x,y):(x,y)R
P
pX (x) = pX,Y (x, y)
y
P
pY (y) = pX,Y (x, y)
P P x
pX,Y (x, y)
xP
y: y<x
P
=
pX,Y (x, y)
y x: x>y
P
pX,Y (x, x)

X Y
Conditional
E [g(X, Y )]

pX,Y (x, y) = pX (x)pY (y)


p
(x,y)
pX|Y (x|y) = X,Y
p
(y)
Y
PP
g(x, y)pX,Y (x, y)

P [X B]
P [(X, Y ) R]
Joint to Marginal:
(Law of Total Prob.)
P [X > Y ]

|=

Table 5: Joint pmf: A Summary

11.47. E [] is a linear operator: E [aX + bY ] = aEX + bEY .

(a) Homogeneous: E [cX] = cEX


(b) Additive: E [X + Y ] = EX + EY
P
P
(c) Extension: E [ ni=1 ci gi (Xi )] = ni=1 ci E [gi (Xi )].
Example 11.48. Recall from 11.37 that when i.i.d. Xi Bernoulli(p),
Y = X1 + X2 + Xn is B(n, p). Also, from Example 9.4, we have
EXi = p. Hence,
" n
#
n
n
X
X
X
EY = E
Xi =
E [Xi ] =
p = np.
i=1

i=1

i=1

Therefore, the expectation of a binomial random variable with


parameters n and p is np.
162

Example 11.49. A binary communication link has bit-error probability p. What is the expected number of bit errors in a transmission of n bits?
Theorem 11.50 (Expectation and Independence). Two random
variables X and Y are independent if and only if
E [h(X)g(Y )] = E [h(X)] E [g(Y )]
for all functions h and g.
In other words, X and Y are independent if and only if for
every pair of functions h and g, the expectation of the product
h(X)g(Y ) is equal to the product of the individual expectations.
One special case is that
Y

|=

implies E [XY ] = EX EY.

(32)

|=

However, independence means more than this property. In


other words, having E [XY ] = (EX)(EY ) does not necessarily
imply X Y . See Example 11.62.
11.51. Lets combined what we have just learned about independence into the definition/equivalent statements that we already
have in 11.32.
The following statements are equivalent:
(a) Random variables X and Y are independent.
[Y C] for all B, C.

|=

(b) [X B]

(c) P [X B, Y C] = P [X B] P [Y C] for all B, C.


(d) pX,Y (x, y) = pX (x) pY (y) for all x, y.
(e) FX,Y (x, y) = FX (x) FY (y) for all x, y.
(f)

163

Exercise 11.52 (F2011). Suppose X and Y are i.i.d. with EX =


EY = 1 and Var X = Var Y = 2. Find Var[XY ].
11.53. To quantify the amount of dependence between two
random variables, we may calculate their mutual information.
This quantity is crucial in the study of digital communications
and information theory. However, in introductory probability class
(and introductory communication class), it is traditionally omitted.
11.5

Linear Dependence

Definition 11.54. Given two random variables X and Y , we may


calculate the following quantities:
(a) Correlation: E [XY ].
(b) Covariance: Cov [X, Y ] = E [(X EX)(Y EY )].
(c) Correlation coefficient: X,Y =

Cov[X,Y ]
X Y

Exercise 11.55 (F2011). Continue from Exercise 11.9.


(a) Find E [XY ].
1
(b) Check that Cov [X, Y ] = 25
.

11.56. Cov [X, Y ] = E [(X EX)(Y EY )] = E [XY ] EXEY

Note that Var X = Cov [X, X].


11.57. Var [X + Y ] = Var X + Var Y + 2Cov [X, Y ]

164

Definition 11.58. X and Y are said to be uncorrelated if and


only if Cov [X, Y ] = 0.
11.59. The following statements are equivalent:
(a) X and Y are uncorrelated.
(b) Cov [X, Y ] = 0.
(c) E [XY ] = EXEY .
(d)
|=

11.60. Independence implies uncorrelatedness; that is if X Y ,


then Cov [X, Y ] = 0.
The converse is not true. Uncorrelatedness does not imply independence. See Example 11.62.
11.61. The variance of the sum of uncorrelated (or independent)
random variables is the sum of their variances.

Example 11.62. Let X be uniform on {1, 2} and Y = |X|.

165

Exercise 11.63. Suppose two fair dice are tossed. Denote by the
random variable V1 the number appearing on the first dice and by
the random variable V2 the number appearing on the second dice.
Let X = V1 + V2 and Y = V1 V2 .
(a) Show that X and Y are not independent.
(b) Show that E [XY ] = EXEY .
11.64. Cauchy-Schwartz Inequality :
2 2
(Cov [X, Y ])2 X
Y

11.65. Cov [aX + b, cY + d] = acCov [X, Y ]


Cov [aX + b, cY + d] = E [((aX + b) E [aX + b]) ((cY + d) E [cY + d])]

= E [((aX + b) (aEX + b)) ((cY + d) (cEY + d))]


= E [(aX aEX) (cY cEY )]
= acE [(X EX) (Y EY )]
= acCov [X, Y ] .

Definition 11.66. Correlation coefficient:


Cov [X, Y ]
X Y



X EX
Y EY
E [XY ] EXEY
=E
=
.
X
Y
X Y

X,Y =

X,Y is dimensionless
X,X = 1
X,Y = 0 if and only if X and Y are uncorrelated.
11.67. Linear Dependence and Cauchy-Schwartz Inequality

1, a > 0
(a) If Y = aX + b, then X,Y = sign(a) =
1, a < 0.
To be rigorous, we should also require that X > 0 and
a 6= 0.
(b) Cauchy-Schwartz Inequality : |X,Y | 1.
In other words, XY [1, 1].
166

(c) When Y , X > 0, equality occurs if and only if the following


conditions holds

a 6= 0 such that (X EX) = a(Y EY )


a 6= 0 and b R such that X = aY + b
c 6= 0 and d R such that Y = cX + d
|XY | = 1

a
In which case, |a| = XY and XY = |a|
= sgn a. Hence, XY
is used to quantify linear dependence between X and Y .
The closer |XY | to 1, the higher degree of linear dependence
between X and Y .

Example 11.68. [21, Section 5.2.3] Consider an important fact


that investment experience supports: spreading investments over
a variety of funds (diversification) diminishes risk. To illustrate,
imagine that the random variable X is the return on every invested
dollar in a local fund, and random variable Y is the return on every
invested dollar in a foreign fund. Assume that random variables X
and Y are i.i.d. with expected value 0.15 and standard deviation
0.12.
If you invest all of your money, say c, in either the local or the
foreign fund, your return R would be cX or cY .
The expected return is ER = cEX = cEY = 0.15c.
The standard deviation is cX = cY = 0.12c

Now imagine that your money is equally distributed over the


two funds. Then, the return R is 21 cX + 12 cY . The expected return
is ER = 21 cEX + 12 cEY = 0.15c. Hence, the expected return
remains at 15%. However,
hc
i c2
c2
c2
Var R = Var (X + Y ) = Var X + Var Y = 0.122 .
2
4
4
2
c 0.0849c.
So, the standard deviation is 0.12
2
In comparison with the distributions of X and Y , the pmf of
1
2 (X + Y ) is concentrated more around the expected value. The
centralization of the distribution as random variables are averaged
together is a manifestation of the central limit theorem.
167

11.69. [21, Section 5.2.3] Example 11.68 is based on the assumption that return rates X and Y are independent from each other.
In the world of investment, however, risks are more commonly
reduced by combining negatively correlated funds (two funds are
negatively correlated when one tends to go up as the other falls).
This becomes clear when one considers the following hypothetical situation. Suppose that two stock market outcomes 1 and 2
are possible, and that each outcome will occur with a probability of
1
2 Assume that domestic and foreign fund returns X and Y are determined by X(1 ) = Y (2 ) = 0.25 and X(2 ) = Y (1 ) = 0.10.
Each of the two funds then has an expected return of 7.5%, with
equal probability for actual returns of 25% and 10%. The random
variable Z = 21 (X + Y ) satisfies Z(1 ) = Z(2 ) = 0.075. In other
words, Z is equal to 0.075 with certainty. This means that an investment that is equally divided between the domestic and foreign
funds has a guaranteed return of 7.5%.

168

Exercise 11.70. The input X and output Y of a system subject


to random perturbations are described probabilistically by the following joint pmf matrix:

x
1
3

0.02 0.10 0.08


0.08 0.32 0.40

(a) Evaluate the following quantities.


(i) EX
(ii) P [X = Y ]
(iii) P [XY < 6]
(iv) E [(X 3)(Y 2)]


(v) E X(Y 3 11Y 2 + 38Y )
(vi) Cov [X, Y ]
(vii) X,Y
(b) Calculate the following quantities using what you got from
part (a).
(i) Cov [3X + 4, 6Y 7]

(ii) 3X+4,6Y 7

(iii) Cov [X, 6X 7]


(iv) X,6X7

169

Answers:
(a)
(i) EX = 2.6
(ii) P [X = Y ] = 0
(iii) P [XY < 6] = 0.2
(iv) E [(X 3)(Y 2)] = 0.88


(v) E X(Y 3 11Y 2 + 38Y ) = 104

(vi) Cov [X, Y ] = 0.032


(vii) X,Y = 0.0447
(b)

(i) Hence, Cov [3X + 4, 6Y 7] = 3 6 Cov [X, Y ] 3


6 0.032 0.576 .

(ii) Note that

Cov [aX + b, cY + d]
aX+b cY +d
ac
acCov [X, Y ]
=
=
X,Y = sign(ac) X,Y .
|a|X |c|Y
|ac|

aX+b,cY +d =

Hence, 3X+4,6Y 7 = sign(3 4)X,Y = X,Y = 0.0447 .

(iii) Cov [X, 6X 7] = 1 6 Cov [X, X] = 6 Var[X]


3.84 .
(iv) X,6X7 = sign(1 6) X,X = 1 .

170

Tutorial on Sep 20, 2013


Friday, September 20, 2013
9:16 AM

315 2013 L2 Page 1

Sirindhorn International Institute of Technology


Thammasat University
School of Information, Computer and Communication Technology

ECS315 2013/1
11.6

Part V.3

Dr.Prapun

Pairs of Continuous Random Variables

In this section, we start to look at more than one continuous random variables. You should find that many of the concepts and
formulas are similar if not the same as the ones for pairs of discrete
random variables which we have already studied. For discrete random variables, we use summations. Here, for continuous random
variables, we use integrations.
Recall that for a pair of discrete random variables, the joint
pmf pX,Y (x, y) completely characterizes the probability model of
two random variables X and Y . In particular, it does not only
capture the probability of X and probability of Y individually,
it also capture the relationship between them. For continuous
random variable, we replace the joint pmf by joint pdf.
Definition 11.71. We say that two random variables X and Y
are jointly continuous with joint pdf fX,Y (x, y) if48 for any
region R on the (x, y) plane
ZZ
P [(X, Y ) R] =
fX,Y (x, y)dxdy
(33)
{(x,y):(x,y)R}

To understand where Definition 11.71 comes from, it is helpful


to take a careful look at Table 6.
48

Remark: If you want to check that a function f (x, y) is the joint pdf of a pair of random
variables (X, Y ) by using the above definition, you will need to check that (33) is true for any
region R. This is not an easy task. Hence, we do not usually use this definition for such kind
of test. There are some mathematical facts that can be derived from this definition. Such
facts produce easier condition(s) than (33) but we will not talk about them here.

171

P [X B]
P [(X, Y ) R]

Discrete
P
pX (x)
xB
P
pX,Y (x, y)
(x,y):(x,y)R

Continuous
R
fX (x)dx
B

RR

fX,Y (x, y)dxdy

{(x,y):(x,y)R}

Table 6: pmf vs. pdf

Example 11.72. Indicate (sketch) the region of integration for


each of the following probabilities
(a) P [1 < X < 2, 1 < Y < 1]
(b) P [X + Y < 3]

11.73. For us, Definition 11.71 is useful because if you know that
a function f (x, y) is a joint pdf of a pair of random variables, then
you can calculate countless possibilities of probabilities involving
these two random variables via (33). (See, e.g. Example 11.76.)
However, the actual calculation of probability from (33) can be
difficult if you have non-rectangular region R or if you have a
complicated joint pdf. In other words, the formula itself is straightforward and simple, but to carry it out may require that you review
some multi-variable integration technique from your calculus class.
11.74. Intuition/Approximation: Note also that the joint
pdfs definition extends the interpretation/approximation that we
previously discussed for one random variable.
Recall that for a single random variable, the pdf is a measure of
probability per unit length. In particular, if you want to find
the probability that the value of a random variable X falls inside
some small interval, say the interval [1.5, 1.6], then this probability
can be approximated by
P [1.5 X 1.6] fX (1.5) 0.1.
172

More generally, for small value of interval length d, the probability


that the value of X falls within a small interval [x, x + d] can be
approximated by
P [x X x + d] fX (x) d.

(34)

Usually, instead of using d, we use x and hence


P [x X x + x] fX (x) x.

(35)

For two random variables X and Y , the joint pdf fX,Y (x, y)
measures probability per unit area:
P [x X x + x , y Y y + y] fX,Y (x, y) x y.
(36)

Do not forget that the comma signifies the and (intersection)


operation.
11.75. There are two important characterizing properties of joint
pdf:
(a) fX,Y 0 a.e.
R + R +
(b) fX,Y (x, y)dxdy = 1

173

Example 11.76. Consider a probability model of a pair of random


variables uniformly distributed over a rectangle in the X-Y plane:

c, 0 x 5, 0 y 3
fX,Y (x, y) =
0, otherwise.
(a) Find the constant c.
(b) Evaluate P [2 X 3, 1 Y 3], and P [Y > X]

11.77. Other important properties and definitions for a pair of


continuous random variables are summarized in Table 7 along with
their discrete counterparts.

174

PDiscrete
pX,Y (x, y)

P [(X, Y ) R]

RR Continuous
fX,Y (x, y)dxdy

(x,y):(x,y)R

Joint to Marginal:

pX (x) =

{(x,y):(x,y)R}
+
R

pX,Y (x, y)

fX (x) =

(Law of Total Prob.)

pY (y) =

pX,Y (x, y)

fY (y) =

P P

P [X > Y ]

+
R Rx

pX,Y (x, y)

P P
P

fX,Y (x, y)dx

fX,Y (x, y)dydx


+
R R

pX,Y (x, y)

y x: x>y

P [X = Y ]

fX,Y (x, y)dy

x y: y<x

+
R

fX,Y (x, y)dxdy

pX,Y (x, x)

|=

X Y
Conditional

pX,Y (x, y) = pX (x)pY (y)


p
(x,y)
pX|Y (x|y) = X,Y
pY (y)

fX,Y (x, y) = fX (x)fY (y)


f
(x,y)
fX|Y (x|y) = X,Y
fY (y)

Table 7: Important formulas for a pair of discrete RVs and a pair of Continuous
RVs

Exercise 11.78 (F2011). Random variables X and Y have joint


pdf

c, 0 y x 1,
fX,Y (x, y) =
0, otherwise.
(a) Check that c = 2.
(b) In Figure 25, specify the region of nonzero pdf.

Figure 25: Figure for Exercise 11.78b.

(c) Find the marginal density fX (x).


(d) Check that EX = 23 .
175

(e) Find the marginal density fY (y).


(f) Find EY
Definition 11.79. The joint cumulative distribution function (joint cdf ) of random variables X and Y (of any type(s))
is defined as
FX,Y (x, y) = P [X x, Y y] .
Although its definition is simple, we rarely use the joint cdf
to study probability models. It is easier to work with a probability mass function when the random variables are discrete,
or a probability density function if they are continuous.
11.80. The joint cdf for a pair of random variables (of any type(s))
has the following properties49 .:
(a) 0 FX,Y (x, y) 1
(i) FX,Y (, ) = 1.

(ii) FX,Y (, y) = FX,Y (x, ) = 0.


(b) Joint to Marginal: FX (x) = FX,Y (x, ) and FY (y) = FX,Y (, y).
In words, we obtain the marginal cdf FX and FY directly from
FX,Y by setting the unwanted variable to .
(c) If x1 x2 and y1 y2 , then FX,Y (x1 , y1 ) FX,Y (x2 , y2 )
11.81. The joint cdf for a pair of continuous random variables
also has the following properties:
Rx Ry
(a) FX,Y (x, y) = fX,Y (u, v)dvdu.
(b) fX,Y (x, y) =

49

2
xy FX,Y (x, y).

Note that when we write FX,Y (x, ), we mean lim FX,Y (x, y). Similar limiting definition
y

applies to FX,Y (, ), FX,Y (, y), FX,Y (x, ), and FX,Y (, y)

176

11.82. Independence:
The following statements are equivalent:
(a) Random variables X and Y are independent.
[Y C] for all B, C.

|=

(b) [X B]

(c) P [X B, Y C] = P [X B] P [Y C] for all B, C.


(d) fX,Y (x, y) = fX (x) fY (y) for all x, y.
(e) FX,Y (x, y) = FX (x) FY (y) for all x, y.
Exercise 11.83 (F2011). Let X1 and X2 be i.i.d. E(1)
(a) Find P [X1 = X2 ].


(b) Find P X12 + X22 = 13 .
11.7

Function of a Pair of Continuous Random Variables: MISO

There are many situations in which we observe two random variables and use their values to compute a new random variable.
Example 11.84. Signal in additive noise: When we says that a
random signal X is transmitted over a channel subject to additive
noise N , we mean that at the receiver, the received signal Y will
be X +N . Usually, the noise is assumed to be zero-mean Gaussian
2
2
noise; that is N N (0, N
) for some noise power N
.
Example 11.85. In a wireless channel, the transmitted signal
X is corrupted by fading (multiplicative noise). More specifically,
the received signal Y at the receivers antenna is Y = H X.
Remark : In the actual situation, the signal is further corrupted
by additive noise N and hence Y = HX + N . However, this
expression for Y involves more than two random variables and
hence we we will not consider it here.

177

Discrete

Continuous

PP

E [Z]

Z =X +Y

fX,Y (x, y)dxdy


{(x,y): g(x,y)B}
R +
fZ (z) = fX,Y (x, z x)dx

pX+Y = pX pY

fX+Y = fX fY


RR

|=

g(x, y)fX,Y (x, y)dxdy

g(x, y)pX,Y (x, y)


P
pX,Y (x, y)
(x,y): g(x,y)B
P
pZ (z) = pX,Y (x, z x)
P x
= pX,Y (z y, y)
x

P [Z B]

+
R +
R

R +

fX,Y (z y, y)dy

Table 8: Important formulas for function of a pair of RVs. Unless stated otherwise, the function is defined as Z = g(X, Y )

11.86. Consider a new random variable Z defined by


Z = g(X, Y ).
Table 8 summarizes the basic formulas involving this derived random variable.
11.87. When X and Y are continuous random variables, it may
be of interest to find the pdf of the derived random variable Z =
g(X, Y ). It is usually helpful to devide this task into two steps:
RR
(a) Find the cdf FZ (z) = P [Z z] = g(x,y)z fX,Y (x, y)dxdy
(b) fW (w) =

d
dw FW (w).

Example 11.88. Suppose X and Y are i.i.d. E(3). Find the pdf
of W = Y /X.

178

Exercise 11.89 (F2011). Let X1 and X2 be i.i.d. E(1).


(a) Define Y = min {X1 , X2 }. (For example, when X1 = 6 and
X2 = 4, we have Y = 4.) Describe the random variable Y .
Does it belong to any known family of random variables? If
so, what is/are its parameters?
(b) Define Y = min {X1 , X2 } and Z = max {X1 , X2 }.
fY,Z (2, 1).

Find

(c) Define Y = min {X1 , X2 } and Z = max {X1 , X2 }.


fY,Z (1, 2).

Find

11.90. Observe that finding the pdf of Z = g(X, Y ) is a timeconsuming task. If you goal is to find E [Z] do not forget that it
can be calculated directly from
Z Z
E [g(X, Y )] =
g(x, y)fX,Y (x, y)dxdy.
11.91. The following property is valid for any kind of random
variables:
#
"
X
X
E [Zi ] .
E
Zi =
i

Furthermore,
E

"
X

#
gi (X, Y ) =

X
i

179

E [gi (X, Y )] .

Discrete
P
pX (x)
PxB
pX,Y (x, y)

P [X B]
P [(X, Y ) R]
Joint to Marginal:

(x,y):(x,y)R

pX (x) =

pX,Y (x, y)

Continuous
R
fX (x)dx
B

RR

fX,Y (x, y)dxdy


{(x,y):(x,y)R}
+
R

(Law of Total Prob.)

pY (y) =

pX,Y (x, y)

fY (y) =

P [X > Y ]

P P

pX,Y (x, y)

P P

pX,Y (x, y)

y x: x>y

P [X = Y ]

+
R

fX,Y (x, y)dx

x y: y<x

fX,Y (x, y)dy

fX (x) =

+
R Rx

fX,Y (x, y)dydx


+
R R

fX,Y (x, y)dxdy

pX,Y (x, x)

|=

X Y
Conditional
E [g(X, Y )]
P [g(X, Y ) B]
Z =X +Y

pX,Y (x, y) = pX (x)pY (y)


p
(x,y)
pX|Y (x|y) = X,Y
pY (y)
PP
g(x, y)pX,Y (x, y)
x y
P
pX,Y (x, y)
(x,y): g(x,y)B
P
pZ (z) = pX,Y (x, z x)
P x
=
pX,Y (z y, y)
y

Table 9: pmf vs. pdf

180

fX,Y (x, y) = fX (x)fY (y)


f
(x,y)
fX|Y (x|y) = X,Y
fY (y)
+
R +
R
g(x, y)fX,Y (x, y)dxdy

RR
fX,Y (x, y)dxdy
{(x,y): g(x,y)B}
R +
fZ (z) = fX,Y (x, z x)dx
R +
= fX,Y (z y, y)

11.92. Independence: At this point, it is useful to summarize


what we know about independence. The following statements are
equivalent:
(a) Random variables X and Y are independent.
[Y C] for all B, C.

|=

(b) [X B]

(c) P [X B, Y C] = P [X B] P [Y C] for all B, C.


(d) For discrete RVs, pX,Y (x, y) = pX (x) pY (y) for all x, y.
For continuous RVs, fX,Y (x, y) = fX (x) fY (y) for all x, y.
(e) FX,Y (x, y) = FX (x) FY (y) for all x, y.
(f) E [h(X)g(Y )] = E [h(X)] E [g(Y )] for all functions h and g.
Definition 11.93. All of the definitions involving expectation of
a function of two random variables are the same as in the discrete
case:
Correlation between X and Y : E [XY ].
Covariance between X and Y :

Cov [X, Y ] = E [(X EX)(Y EY )] = E [XY ] EXEY.


Var X = Cov [X, X].

X and Y are said to be uncorrelated if and only if Cov [X, Y ] =


0.

X and Y are said to be orthogonal if E [XY ] = 0.

Correlation coefficient: XY =

181

Cov[X,Y ]
X Y

Exercise 11.94 (F2011). Continue from Exercise 11.78. We found


that the joint pdf is given by

2, 0 y x 1,
fX,Y (x, y) =
0, otherwise.
Also recall that EX =

2
3

and EY = 13 .

(a) Find E [XY ]


(b) Are X and Y uncorrelated?
(c) Are X and Y independent?
Example 11.95. The bivariate Gaussian or bivariate normal density is a generalization of the univariate N (m, 2 ) density. For bivariate normal, fX,Y (x, y) is

2


 
2
yEY
yEY

xEX
xEX

2 X
+ Y
X
Y
1
p
exp
.
2)

2
(1

2X Y 1 2

(37)
Important properties:
(a) =

Cov[X,Y ]
X Y

(1, 1) [24, Thm. 4.31]

(b) [24, Thm. 4.28]

Y is equivalent to X and Y are uncorrelated.

|=

(c) X

182

X = 1, Y = 1, = 0

6
4

0
x

X = 1, Y = 2, = 0.5

0
x

X = 1, Y = 2, = 0.8

X = 1, Y = 2, = 0

0
x

0
x

X = 3, Y = 1, = 0

0
x

X = 1, Y = 2, = 0.99

0
x

Figure 26: Samples from bivariate Gaussian distributions.

Correlation coefficient
Number of samples = 2000
3

Number of samples = 2000


3

2
0.2

-1
-2

-3

2
2

0
-2

0
x

-2
y

0.1
0.05

2
0

0
-2

0
-2

0
x

-2

-2

Number of samples = 2000

Number of samples = 2000

0
-1

0.2

0.1

-1

-2

-2
0
x

2
-2

0.3
0.2
0.1

0
-2

Joint pdf

Joint pdf

0.15

0.1
0.05

-2

-3

0.2

Joint pdf

Joint pdf

0
-1

-3

1
0.15

-3

0
-2

-2

0
x

0
-2

-2

Remark: marginal pdfs for both X and Y are standard Gaussian


27: Effect of on bivariate Gaussian distribution. Note that the marginal
pdfs for both X and Y are all standard Gaussian.

8 Figure

183

Sirindhorn International Institute of Technology


Thammasat University
School of Information, Computer and Communication Technology

ECS315 2013/1
12

Part VI

Dr.Prapun

Three Types of Random Variables

12.1. Review: You may recall50 the following properties for cdf
of discrete random variables. These properties hold for any kind
of random variables.
(a) The cdf is defined as FX (x) = P [X x]. This is valid for
any type of random variables.
(b) Moreover, the cdf for any kind of random variable must satisfies three properties which we have discussed earlier:
CDF1 FX is non-decreasing
CDF2 FX is right-continuous
CDF3 lim FX (x) = 0 and lim FX (x) = 1.
x

(c) P [X = x] = FX (x) FX (x ) = the jump or saltus in F at


x.
Theorem 12.2. If you find a function F that satisfies CDF1,
CDF2, and CDF3 above, then F is a cdf of some random variable.

50

If you dont know these properties by now, you should review them as soon as possible.

184

Example 12.3. Consider an input X to a device whose output Y


will be the same as the input if the input level does not exceed 5.
For input level that exceeds 5, the output will be saturated at 5.
Suppose X U(0, 6). Find FY (y).

12.4. We can categorize random variables into three types according to its cdf:
(a) If FX (x) is piecewise flat with discontinuous jumps, then X
is discrete.
(b) If FX (x) is a continuous function, then X is continuous.
(c) If FX (x) is a piecewise continuous function with discontinuities, then X is mixed.

185

81

3.1 Random variables


(a)

Fx(x)
1.0

(b)

Fx(x)

1.0

(c)

Fx(x)

1.0

Fig. 3.2

TypicalFigure
cdfs: (a)
variable,
(b) a continuous
random variable,
and (c) a mixed
random
28:a discrete
Typicalrandom
cdfs: (a)
a discrete
random variable,
(b) a continuous
random
variable.
variable, and (c) a mixed random variable [16, Fig. 3.2].

For a discrete random variable, Fx (x) is a staircase function, whereas a random variable
is called continuous if Fx (x) is a continuous function. A random variable is called mixed
if it is neither discrete nor continuous. Typical cdfs for discrete, continuous, and mixed
random variables are shown in Figures 3.2(a), 3.2(b), and 3.2(c), respectively.
Rather than dealing with the cdf, it is more common to deal with the probability density
186 of F (x), i.e.,
function (pdf), which is defined as the derivative
x
fx (x) =

dFx (x)
.
dx

(3.11)

We have seen in Example 12.3 that some function can turn a


continuous random variable into a mixed random variable. Next,
we will work on an example where a continuous random variable
is turned into a discrete random variable.
Example 12.5. Let X U(0, 1) and Y = g(X) where

1, x < 0.6
g(x) =
0, x 0.6.
Before going deeply into the math, it is helpful to think about the
nature of the derived random variable Y . The definition of g(x)
tells us that Y has only two possible values, Y = 0 and Y = 1.
Thus, Y is a discrete random variable.

Example 12.6. In MATLAB, we have the rand command to generate U(0, 1). If we want to generate a Bernoulli random variable
with success probability p, what can we do?

Exercise 12.7. In MATLAB, how can we generate X binomial(2, 1/4)


from the rand command?

187

13

Transform methods: Characteristic Functions

Definition 13.1. The characteristic function of a random variable


X is defined by


X (v) = E ejvX .
Remarks:
(a) If X is a continuous random variable with density fX , then
Z +
X (v) =
ejvx fX (x)dx,

which is the Fourier transform of fX evaluated at v. More


precisely,
X (v) = F {fX } ()|=v .
(38)
(b) Many references use u or t instead of v.
Example 13.2. You may have learned that the Fourier transform
of a Gaussian waveform is a Gaussian waveform. In fact, when
X N (m, 2 ),
F {fX } () =

fX (x) ejx dx = ejm 2

2 2

Using (38), we have


1 2 2

X (v) = ejvm 2 v

Example 13.3. For X E(), we have X (v) =

188

jv .

As with the Fourier transform, we can build a large list of commonly used characteristic functions. (You probably remember that
rectangular function in time domain gives a sinc function in frequency domain.) When you see a random variable that has the
same form of characteristic function as the one that you know, you
can quickly make a conclusion about the family and parameters of
that random variable.
Example 13.4. Suppose a random variable X has the character2
istic function X (v) = 2jv
. You can quickly conclude that it is
an exponential random variable with parameter 2.
For many random variables, it is easy to find its expected value
or any moments via the characteristic function. This can be done
via the following result.

 

(k)
(k)
13.5. X (v) = j k E X k ejvX and X (0) = j k E X k .

Example 13.6. When X E(),


(a) EX = 1 .

(b) Var X =

1
2 .

189

Exercise 13.7 (F2011). Continue from Example 13.2.


(a) Show that for X N (m, 2 ), we have
(i) EX = m
 
(ii) E X 2 = 2 + m2 .
 
(b) for X N (3, 4), find E X 3 .

One very important properties of characteristic function is that


it is very easy to find the characteristic function of a sum of independent random variables.
|=

13.8. Suppose X Y . Let Z = X + Y . Then, the characteristic


function of Z is the product of the characteristic functions of X
and Y :
Z (v) = X (v)Y (v)

Remark: Can you relate this property to the property of the


Fourier transform?
Example 13.9. Use 13.8 to show that the sum of two independent
Gaussian random variables is still a Gaussian random variable:

Exercise 13.10. Continue from Example 11.44. Suppose 1


P(1 ) and 2 P(2 ) are independent. Let = 1 + 2 . Use
13.8 to show that P(1 + 2 ).

Exercise 13.11. Continue from Example 11.45 Suppose B1


B(n1 , p) and B2 B(n2 , p) are independent. Let B = B1 + B2 .
Use 13.8 to show that B B(n1 + n2 , p).
190

Sirindhorn International Institute of Technology


Thammasat University
School of Information, Computer and Communication Technology

ECS315 2013/1
14
14.1

Part VII

Dr.Prapun

Limiting Theorems
Law of Large Numbers (LLN)

Definition 14.1. Let X1 , X2 , . . . , Xn be a collection of random


variables with a common mean E [Xi ] = m for all i. In practice,
since we do not know m, we use the numerical average, or sample
mean,
n
1X
Mn =
Xi
n i=1
in place of the true, but unknown value, m.
Q: Can this procedure of using Mn as an estimate of m be
justified in some sense?
A: This can be done via the law of large number.
14.2. The law of large number basically says that if you have a
sequence of i.i.dPrandom variables X1 , X2 , . . .. Then the sample
means Mn = n1 ni=1 Xi will converge to the actual mean as n
.
14.3. LLN is easy to see via the property of variance. Note that
" n
#
n
X
1
1X
E [Mn ] = E
Xi =
EXi = m
n i=1
n i=1
and
"
Var[Mn ] = Var

n
1X

i=1

#
Xi

1 X
1
= 2
Var Xi = 2 ,
n i=1
n

191

(39)

Remarks:
(a) For (39) to hold, it is sufficient to have uncorrelated Xi s.
(b) From (39), we also have
1
Mn = .
n

(40)

In words, when uncorrelated (or independent) random variables each having the same distribution are averaged together,
the standard deviation is reduced according to the square root
law. [21, p 142].
Exercise 14.4 (F2011). Consider i.i.d. random variables X1 , X2 , . . . , X10 .
Define the sample mean M by
10

1 X
M=
Xk .
10
k=1

Let

10

1 X
V1 =
(Xk E [Xk ])2 .
10
k=1

and
10

1 X
V2 =
(Xj M )2 .
10 j=1
Suppose E [Xk ] = 1 and Var[Xk ] = 2.
(a) Find E [M ].
(b) Find Var[M ].
(c) Find E [V1 ].
(d) Find E [V2 ].

192

14.2

Central Limit Theorem (CLT)

In practice, there are many random variables that arise as a sum


of many other random variables. In this section, we consider the
sum
n
X
Sn =
Xi
(41)
i=1

where the Xi are i.i.d. with common mean m and common variance
2.
Note that when we talk about Xi being i.i.d., the definition
is that they are independent and identically distributed. It
is then convenient to talk about a random variable X which
shares the same distribution (pdf/pmf) with these Xi . This
allow us to write
i.i.d.
Xi X,
(42)
which is much more compact than saying that the Xi are
i.i.d. with the same distribution (pdf/pmf) as X. Moreover,
2
we can also use EX and X
for the common expected value
and variance of the Xi .
Q: How does Sn behave?
For the Sn defined above, there are many cases for which we
know the pmf/pdf of Sn .
Example 14.5. When the Xi are i.i.d. Bernoulli(p),

Example 14.6. When the Xi are i.i.d. N (m, 2 ),

Note that it is not difficult to find the characteristic function


of Sn if we know the common characteristic function X (v) of the
193

Xi :

Sn (v) = (X (v))n .

If we are lucky, as in the case for the sum of Gaussian random variables in Example 14.6 above, we get Sn (v) that is of the form that
we know. However, Sn (v) will usually be something we havent
seen before or difficult to find the inverse transform. This is one
of the reason why having a way to approximate the sum Sn would
be very useful.
There are also some situations where the distribution of the Xi
is unknown or difficult to find. In which case, it would be amazing
if we can say something about the distribution of Sn .
In the previous section, we consider the sample mean of identically distributed random variables. More specifically, we consider
the random variable Mn = n1 Sn . We found that Mn will converge
to m as n increases to . Here, we dont want to rescale the sum
Sn by the factor n1 .
14.7 (Approximation of densities and pmfs using the CLT). The
actual statement of the CLT is a bit difficult to state. So, we first
give you the interpretation/insight from CLT which is very easy
to remember and use:
For n large enough, we can approximate Sn by a Gaussian random variable with the same mean and variance as
Sn .
Note that the mean and variance of Sn is nm and n 2 , respectively. Hence,
for n large enough we can approximate Sn by

2
N nm, n . In particular,


snm

(a) FSn (s) n .


(b) If the Xi are continuous random variable, then
fSn (s)

2
1

21 ( xnm
n ) .
e

2 n

(c) If the Xi are integer-valued, then




2
1
1
1
)
12 ( knm

n
P [Sn = k] = P k < Sn k +

.
e
2
2
2 n
194

[9, eq (5.14), p. 213].


The approximation is best for k near nm [9, p. 211].
Example 14.8. Approximation for Binomial Distribution: For
X B(n, p), when n is large, binomial distribution becomes difficult to compute directly because of the need to calculate factorial
terms.
(a) When p is not close to either 0 or 1 so that the variance is
also large, we can use CLT to approxmiate
2
1
(kEX)
2
Var
X
P [X = k]
e
2 Var X
(knp)2
1
2np(1p)
e
.
=p
2np (1 p)

(43)
(44)

This is called Laplace approximation to the Binomial distribution [25, p. 282].


Normal Approximation to Poisson Distribution with large :
(b) When p is small, the binomial distribution can be approxin
P(np)ofasasdiscussed
in 8.45.
X can bebythough
a sum of i.i.d.
Let X ~ P ( ) . mated
X i ~ P ( 0 ) , i.e., X = X i , where
i =1

(c) IfXp isisapproximately


very close to normal
1, thenNn ,X
for behave
n0 = . Hence
large.approximately
( )will
Poisson.
Some says that the normal approximation is good when > 5 .
p := 0.05

p := 0.05

( x+ 1)

n := 100 := 5

0.15
( x )

1
2

0.1

x 1

0.06
( x )

0.04

x 1

e x

( )

( n x+ 1) ( x+ 1)

( x+ 1)

e x

( n+ 1)

n := 800 := 40

( )
x

p ( 1 p )

( n+ 1)

n x 0.05

( n x+ 1) ( x+ 1)

10

p ( 1 p )

n x 0.02

20

40

60

The above Figure


figure compare
1) Poisson
when x istointeger,
2) Gaussian,
3) Gamma,and
4)
29: Gaussian
approximation
Binomial,
Poisson distribution,
Binomial. Gamma distribution.

If g : Z + R is any bounded function and ~ P ( ) , then E g ( + 1) g ( ) = 0 .

Proof.

( g ( i + 1) ig ( i ) ) e
i =0


i +1
i
= e g ( i + 1)
ig ( i )
195
i!
i!
i!
i =1
i =0

i +1
m +1

= e g ( i + 1)
g ( m + 1)

i! m=0
m!
i =0

Exercise 14.9 (F2011). Continue from Exercise 6.53. The stronger


person (Kakashi) should win the competition if n is very large. (By
the law of large numbers, the proportion of fights that Kakashi wins
should be close to 55%.) However, because the results are random
and n can not be very large, we can not guarantee that Kakashi
will win. However, it may be good enough if the probability that
Kakashi wins the competition is greater than 0.85.
We want to find the minimal value of n such that the probability
that Kakashi wins the competition is greater than 0.85.
Let N be the number of fights that Kakashi wins among the n
fights. Then, we need
h
ni
P N>
0.85.
(45)
2
Use the central limit theorem and Table 3.1 or Table 3.2 from
[Yates and Goodman] to approximate the minimal value of n such
that (45) is satisfied.
14.10. A more precise statement for CLT can be expressed via the
convergence of the characteristic function. In particular, suppose
that (Xk )k1 is a sequence of i.i.d. random
Pn variables with mean m
2
and variance 0 < < . Let Sn = k=1 Xk . It can be shown
that

(a) the characteristic function of Snmn


converges pointwise to
n
the characteristic function of N (0, 1) and that

(b) the characteristic function of Snmn


converges pointwise to
n
the characteristic function of N (0, ).
P
iid
To see this, let Zk = Xkm Z and Yn = 1n nk=1 Zk . Then,

n
t
EZ = 0, Var Z = 1, and Yn (t) = Z ( n ) . By approximating
 
ex 1 + x + 21 x2 . We have X (t) 1 + jtEX 12 t2 E X 2 and
n

2
1 t2
t2
Yn (t) = 1
e ,
2n
which is the characteristic function of N (0, 1).
196

The case of Bernoulli(1/2) was derived by Abraham de Moivre


around 1733. The case of Bernoulli(p) for 0 < p < 1 was
considered by Pierre-Simon Laplace [9, p. 208].

197

15

Random Vector

In Section 11.2, we have introduced the way to deal with more than
two random variables. In particular, we introduce the concepts of
joint pmf:
pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = P [X1 = x1 , X2 = x2 , . . . , Xn = xn ]
and joint pdf:
fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn )
of a collection of random variables.
Definition 15.1. You may notice that it is tedious to write the
n-tuple (X1 , X2 , . . . , Xn ) every time that we want to refer to this
collection of random variables. A more convenient notation uses
a column vector X to represent all of them at once, keeping in
mind that the ith component of X is the random variable Xi . This
allows us to express
(a) pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) as pX (x) and
(b) fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) as fX (x).
When the random variables are separated into two groups, we may
label those in a group as X1 , X2 , . . . , Xn and those in another group
as Y1 , Y2 , . . . , Ym . In which case, we can express
(a) pX1 ,...,Xn ,Y1 ,...,Ym (x1 , . . . , xn , y1 , . . . , ym ) as pX,Y (x, y) and
(b) fX1 ,...,Xn ,Y1 ,...,Ym (x1 , . . . , xn , y1 , . . . , ym ) as fX,Y (x, y).
Definition 15.2. Random vectors X and Y are independent if
and only if
(a) Discrete: pX,Y (x, y) = pX (x)pY (y).
(b) Continuous: fX,Y (x, y) = fX (x)fY (y).
Definition 15.3. A random vector X contains many random variables. Each of these random variables has its own expected value.
We can represent the expected values of all these random variables
in the form of a vector as well by using the notation E [X]. This
is a vector whose ith component is EXi .
198

In other words, the expectation E [X] of a random vector X


is defined to be the vector of expectations of its entries.
E [X] is usually denoted by X or mX .

Definition 15.4. Recall that a random vector is simply a vector


containing random variables as its components. We can also talk
about random matrix which is simply a matrix whose entries
are random variables. In which case, we define the expectation of
a random matrix to be a matrix whose entries are expectation of
the corresponding random variables in the random matrix.
Example 15.5.

E
and

X
Y




=

EX
EY



 

X W
EX EW
E
=
Y Z
EY EZ
15.6. For non-random matrix A, B, C and a random vector X,
E [AXB + C] = A (EX) B + C.

Correlation and covariance are important quantities that capture linear dependency between two random variables. When we
have many random variables, there are many possible pairs to find
correlation E [Xi Xj ] and covariance Cov [Xi , Xj ]. All of the correlation values can be expressed at once using the correlation matrix.
Definition 15.7. The correlation matrix RX of a random (column) vector X is defined by


RX = E XXT .
Note that it is symmetric and that the ij-entry of RX is simply
E [Xi Xj ].
 
X1
Example 15.8. Consider X =
.
X2






X
1
RX = E XXT = E
X1 X2
X2
 

 

E X12
E [X
X
]
X12 X1 X2
1
2
 
=
=E
X1 X2 X22
E [X1 X2 ] E X22
199

Definition 15.9. Similarly, all of the covariance values can be


expressed at once using the covariance matrix. The covariance
matrix CX of a random vector X is defined as




CX = E (X EX)(X EX)T = E XXT (EX)(EX)T
= RX (EX)(EX)T .

Note that it is symmetric and that the ij-entry of CX is simply


Cov [Xi , Xj ].
In some references, X or X is used instead of CX .
15.10. Properties of covariance matrix:
(a) For i.i.d. Xi each with variance 2 , CX = 2 I.
(b) Cov [AX + b] = ACX AT .
In addition to the correlations and covariances of the elements
of one random vector, it is useful to refer to the correlations and
covariances of elements of two random vectors.
Definition 15.11. If X and Y are both random vectors (not necessarily of the same dimension), then their cross-correlation matrix
is


RXY = E XYT .
and their cross-covariance matrix is


CXY = E (X EX)(Y EY)T .
Example 15.12. Jointly Gaussian random vector X N (m, ):
T 1
1
1
p
e 2 (xm) (xm) .
(2)
det ()


(a) m = EX and = CX = E (X EX)(X EX)T .

fX (x) =

n
2

(b) For bivariate normal, X1 = X and X2 = Y . We have


 


2
2
X
Cov [X, Y ]
X
XY X Y
=
=
.
Cov [X, Y ]
Y2
XY X Y
Y2
200

16

Introduction to Stochastic Processes (Random Processes)

A random process consider an infinite collection of random variables. These random variables are usually indexed by time. So,
the obvious notation for random process would be X(t). As in
the signals-and-systems class, time can be discrete or continuous. When time is discrete, it may be more appropriate to use
X1 , X2 , . . . or X[1], X[2], X[3], . . . to denote a random process.
Example 16.1. Sequence of results (0 or 1) from a sequence of
Bernoulli trials is a discrete-time random process.
16.2. Two perspectives:
(a) We can view a random process as a collection of many random
variables indexed by t.
(b) We can also view a random process as the outcome of a random experiment, where the outcome of each trial is a deterministic waveform (or sequence) that is a function of t.
The collection of these functions is known as an ensemble,
and each member is called a sample function.
Example 16.3. Gaussian Random Processes: A random process
X(t) is Gaussian if for all positive integers n and for all t1 , t2 , . . . , tn ,
the random variables X(t1 ), X(t2 ), . . . , X(tn ) are jointly Gaussian
random variables.
16.4. Formal definition of random process requires going back to
the probability space (, A, P ).
Recall that a random variable X is in fact a deterministic function of the outcome from . So, we should have been writing
it as X(). However, as we get more familiar with the concept of
random variable, we usually drop the () part and simply refer
to it as X.
For random process, we have X(t, ). This two-argument expression corresponds to the two perspectives that we have just
discussed earlier.
201

(a) When you fix the time t, you get a random variable from a
random process.
(b) When you fix , you get a deterministic function of time from
a random process.
As we get more familiar with the concept of random processes, we
again drop the argument.
Definition 16.5. A sample function x(t, ) is the time function
associated with the outcome of an experiment.
Example 16.6 (Randomly Scaled Sinusoid). Consider the random
process defined by
X(t) = A cos(1000t)
where A is a random variable. For example, A could be a Bernoulli
random variable with parameter p.
This is a good model for a one-shot digital transmission via
amplitude modulation.
(a) Consider the time t = 2 ms. X(t) is a random variable taking
the value 1 cos(2) = 0.4161 with probability p and value
0 cos(2) = 0 with probability 1 p.
If you consider t = 4 ms. X(t) is a random variable taking
the value 1 cos(4) = 0.6536 with probability p and value
0 cos(4) = 0 with probability 1 p.

(b) From another perspective, we can look at the process X(t) as


two possible waveforms cos(1000t) and 0. The first one happens with probability p; the second one happens with probability 1 p. In this view, notice that each of the waveforms
is not random. They are deterministic. Randomness in this
situation is associated not with the waveform but with the
uncertainty as to which waveform will occur in a given trial.

202

92

Probability theory, random variables and random processes


(b)
(a)

0
(d)

(c)

+V
V

Tb

Fig. 3.8

Typical
members for
four random
processes for
commonly
in communications:
(a)
Figureensemble
30: Typical
ensemble
members
four encountered
random processes
commonly
thermal noise, (b) uniform phase, (c) Rayleigh fading process, and (d) binary random data process.

encountered in communications: (a) thermal noise, (b) uniform phase (encountered in communication systems where it is not feasible to establish timing at
the receiver.), (c) Rayleigh
fading process, and (d) binary random data process


(which mayx(t)
represent
transmitted
bits 0 and 1 that are mapped to +V and
V
=
V(2a
(3.46)
k 1)[u(t kTb + ) u(t (k + 1)Tb + )],
(volts)). [16, Fig.k=
3.8]

where ak = 1 with probability p, 0 with probability (1 p) (usually p = 1/2), Tb is the bit


duration, and  is uniform over [0, Tb ).
Observe that two of these ensembles have member functions that look very deterministic, one is quasi deterministic but for the last one even individual time functions look
random. But the point is not whether any one member function looks deterministic or
not; the issue when dealing with random processes is that we do not know for sure which
member function we shall have to deal with.

203

3.2.1 Classication of random processes


The basic classification of random processes arises from whether its statistics change with

Definition 16.7. At any particular time t, because we have a


random variable, we can also find its expected value. The function
mX (t) captures these expected values as a deterministic function
of time:
mX (t) = E [X(t)] .
16.1

Autocorrelation Function and WSS

One of the most important characteristics of a random process is its


autocorrelation function, which leads to the spectral information of
the random process. The frequency content process depends on the
rapidity of the amplitude change with time. This can be measured
by correlating the values of the process at two time instances tl and
t2 .
Definition 16.8. Autocorrelation Function: The autocorrelation function RX (t1 , t2 ) for a random process X(t) is defined by
RX (t1 , t2 ) = E [X(t1 )X(t2 )] .
Example 16.9. The random process x(t) is a slowly varying process compared to the process y(t) in Figure 31. For x(t), the values
at tl and t2 are similar; that is, have stronger correlation. On the
other hand, for y(t), values at tl and t2 have little resemblance,
that is, have weaker correlation.
Example 16.10 (Randomly Phased Sinusoid). Consider a random process
X(t) = 5 cos(7t + )
where is a uniform random variable on the interval (0, 2).
Z +
mX (t) = E [X(t)] =
5 cos(7t + )f ()d

Z 2
1
=
5 cos(7t + ) d = 0.
2
0

204

T_

(c)

Figure 31: Autocorrelation functions for a slowly varying and a rapidly varying
random process [13, Fig. 11.4]

and
RX (t1 , t2 ) = E [X(t1 )X(t2 )]
= E [5 cos(7t1 + ) 5 cos(7t2 + )]
25
cos (7(t2 t1 )) .
=
2
Definition 16.11. A random process whose statistical characteristics do not change with time is classified as a stationary random
process. For a stationary process, we can say that a shift of time
origin will be impossible to detect; the process will appear to be
the same.
Example 16.12. The random process representing the temperature of a city is an example of a nonstationary process, because
the temperature statistics (mean value, for example) depend on
the time of the day.
On the other hand, the noise process is stationary, because its
statistics (the mean ad the mean square values, for example) do
not change with time.
205

16.13. In general, it is not easy to determine whether a process


is stationary. In practice, we can ascertain stationary if there is no
change in the signal-generating mechanism. Such is the case for
the noise process.
A process may not be stationary in the strict sense. A more
relaxed condition for stationary can also be considered.
Definition 16.14. A random process X(t) is wide-sense stationary (WSS) if
(a) mX (t) is a constant
(b) RX (t1 , t2 ) depends only on the time difference t2 t1 and does
not depend on the specific values of t1 and t2 .
In which case, we can write the correlation function as RX ( )
where = t2 t1 .


One important consequence is that E X 2 (t) will be a constant as well.
Example 16.15. The random process defined in Example 16.9 is
WSS with
25
RX ( ) =
cos (7 ) .
2
16.16. Most information signals and noise sources encountered
in communication systems are well modeled as WSS random processes.
Example 16.17. White noise process is a WSS process N (t)
whose
(a) E [N (t)] = 0 for all t and
(b) RN ( ) =

N0
2 ( ).

See also 16.23 for its definition.


Since RN ( ) = 0 for 6= 0, any two different samples of
white noise, no matter how close in time they are taken, are
uncorrelated.
206

Example 16.18. [Thermal noise] A statistical analysis of the random motion (by thermal agitation) of electrons shows that the
autocorrelation of thermal noise N (t) is well modeled as

e t0
RN ( ) = kT G
watts,
t0
where k is Boltzmanns constant (k = 1.38 1023 joule/degree
Kelvin), G is the conductance of the resistor (mhos), T is the
(ambient) temperature in degrees Kelvin, and t0 is the statistical
average of time intervals between collisions of free electrons in the
resistor, which is on the order of 1012 seconds. [16, p. 105]
16.2

Power Spectral Density (PSD)

An electrical engineer instinctively thinks of signals and linear systems in terms of their frequency-domain descriptions. Linear systems are characterized by their frequency response (the transfer
function), and signals are expressed in terms of the relative amplitudes and phases of their frequency components (the Fourier
transform). From the knowledge of the input spectrum and transfer function, the response of a linear system to a given signal can be
obtained in terms of the frequency content of that signal. This is
an important procedure for deterministic signals. We may wonder
if similar methods may be found for random processes.
In the study of stochastic processes, the power spectral density
function, SX (f ), provides a frequency-domain representation of the
time structure of X(t). Intuitively, SX (f ) is the expected value
of the squared magnitude of the Fourier transform of a sample
function of X(t).
You may recall that not all functions of time have Fourier transforms. For many functions that extend over infinite time, the
Fourier transform does not exist. Sample functions x(t) of a stationary stochastic process X(t) are usually of this nature. To work
with these functions in the frequency domain, we begin with XT (t),
a truncated version of X(t). It is identical to X(t) for T t T
and 0 elsewhere. We use F{XT }(f ) to represent the Fourier transform of XT (t) evaluated at the frequency f .
207

Definition 16.19. Consider a WSS process X(t). The power


spectral density (PSD) is defined as

1 
E |F{XT }(f )|2
t 2T
" Z
2 #
T


1
X(t)ej2f t dt
E
= lim
t 2T
T

SX (f ) = lim

We refer to SX (f ) as a density function because it can be interpreted as the amount of power in X(t) in the small band of
frequencies from f to f + df .
16.20. Wiener-Khinchine theorem: the PSD of a WSS random process is the Fourier transform of its autocorrelation function:
Z +
SX (f ) =
RX ( )ej2f d

and

RX ( ) =

SX (f )ej2f df.

One important consequence is



RX (0) = E X 2 (t) =

SX (f )df.

Example 16.21. For the thermal noise in Example 16.18, the


2kT G
corresponding PSD is SN (f ) = 1+(2f
t0 )2 watts/hertz.
16.22. Observe that the thermal noises PSD in Example 16.21
is approximately flat over the frequency range 010 gigahertz. As
far as a typical communication system is concerned we might as
well let the spectrum be flat from 0 to , i.e.,
SN (f ) =

N0
watts/hertz,
2

where N0 is a constant; in this case N0 = 4kT G.

208

97

3.2 Random processes


Frequency-domain ensemble

Time-domain ensemble
x1(t, 1)

|X1( f, 1)|

0
x2(t, 2)

|X2( f, 2)|

xM (t, M)

Fig. 3.9

.
.

|XM ( f, M)|

.
.

Fourier transforms
of member
functions ofofa random
process.
For simplicity,
only the magnitude
spectra
Figure
32: Fourier
transforms
member
functions
of a random
process.
For
are shown.
simplicity, only the magnitude spectra are shown. [16, Fig. 3.9]

What we have managed to accomplish thus far is to create the random variable, P, which
in some sense represents the power in the process. Now we find the average value of P, i.e.,


1
E{P} = E
2T

T
T

209
'
x2T (t)dt

1
=E
2T

'
|XT (f )| df .
2

(3.64)

Definition 16.23. Noise that has a uniform spectrum over the


entire frequency range is referred to as white noise. In particular,
for white noise,
N0
SN (f ) =
watts/hertz,
2
The factor 2 in the denominator is included to indicate that
SN (f ) is a two-sided spectrum.
The adjective white comes from white light, which contains
equal amounts of all frequencies within the visible band of
electromagnetic radiation.
The average power of white noise is obviously infinite.
(a) White noise is therefore an abstraction since no physical
noise process can truly be white.
(b) Nonetheless, it is a useful abstraction.
The noise encountered in many real systems can be
assumed to be approximately white.
This is because we can only observe such noise after
it has passed through a real system, which will have
a finite bandwidth. Thus, as long as the bandwidth
of the noise is significantly larger than that of the
system, the noise can be considered to have an infinite
bandwidth.
As a rule of thumb, noise is well modeled as white
when its PSD is flat over a frequency band that is 35
times that of the communication system under consideration. [16, p 105]
Theorem 16.24. When we input X(t) through an LTI system
whose frequency response is H(f ). Then, the PSD of the output
Y (t) will be given by
SY (f ) = SX (f )|H(f )|2 .

210

106

Probability theory, random variables and random processes


(a)

Sw( f ) (W/Hz)

N0 /2

White noise
Thermal noise
0
15

10

0
f (GHz)

10

15

(b)

Rw(t) (W)

N0 /2()

Fig. 3.11

Fig. 3.12

White noise
Thermal noise
0.1

0.08 0.06 0.04 0.02


0
0.02
(picosec)

0.04

0.06

0.08

0.1

(a) The
PSD (S33:
(b) PSD
the autocorrelation
(Rw(b)
( ))the
of thermal
noise.
w (f )),
Figure
(a)and
The
(SN (f )), and
autocorrelation
(RN ( )) of noise.

(Assume G = 1/10 (mhos), T = 298.15 K, and t0 = 3 1012 seconds.) [16,


Fig. 3.11]
L
x(t)

y(t)

A lowpass filter.

Finally, since the noise samples of white noise are uncorrelated, if the noise is both white
and Gaussian (for example, thermal noise) then the noise samples are also independent.
Exa m p l e 3.7 Consider the lowpass filter given in Figure 3.12. Suppose that a (WSS)
white noise process, x(t), of zero-mean and211
PSD N0 /2 is applied to the input of the filter.
(a) Find and sketch the PSD and autocorrelation function of the random process y(t) at the
output of the filter.

A.3

Calculus

A.15. Integration by parts is a technique for simplifying integrals of the form


Z
a (x) b (x)dx.
In particular,
Z
Z
0
f (x) g (x)dx = f (x) g (x) f 0 (x) g (x)dx.

(56)

Sometimes it is easier to remember the formula if we write it in


differential form. Let u = f (x) and v = g(x). Then du = f 0 (x)dx
and dv = g 0 (x)dx. Using the Substitution Rule, the integration by
parts formula becomes
Z
Z
udv = uv vdu
(57)
The main goal in integration by parts is to choose u and dv
to obtain a new integral that is easier to evaluate then the
original. In other words,
R the goal of integration by parts is to
go from an integral
udv that we dont see how to evaluate
R
to an integral vdu that we can evaluate.
Note that when we calculate v from dv, we can use any of the
antiderivative. In other words, we may put in v + C instead
of v in (57). Had we included this constant of integration C
in (57), it would have eventually dropped out. This is always
the case in integration by parts.
For definite integrals, the formula corresponding to (56) is
Zb
a

f (x) g 0 (x)dx = f (x) g (x)|ba

Zb

f 0 (x) g (x)dx.

The corresponding u and v notation is


Z b
Z b
b
udv = uv|a
vdu
a

215

(58)

(59)

It is important to keep in mind that the variables u and v in


this formula are functions of x and that the limits of integration
in (59) are limits on the variable x. Sometimes it is helpful to
emphasize this by writing (59) as
Z b
Z b
udv = uv|bx=a
vdu
(60)
x=a

x=a

Repeated application of integration by parts gives


Z
f (x) g (x)dx = f (x) G1 (x)+

n1
X

(1) f

(i)

(x) Gi+1 (x)+(1)

f (n) (x) Gn (x) dx

i=1

(61)
di
dxi f

(i)

(x), G1 (x) = g (x)dx, and Gi+1 (x) =


Rwhere f (x) =
Gi (x)dx.
A convenient method for organizing the computations into two
columns is called tabular integration by parts shown in Figure
34 which can be used to derived (61).
ff xx

+ + g xg x

f 1 x

G x
G1 x
x
f 2 x + G x
f x
G2 x
f x + G x
n 1
ff x x G Gxn1 x
n
f x Gn x
1

Differentiate

Differentiate

Integrate

n 1

n1

Integrate

n 1

n1

To see this, note that

x G1 x dx , by
this,
and Parts
x g xthat
fIntegration
dx f Figure
x G1 x 34:
f note
To see

G x fdx
G , and
x dx .
x fG xxG xf xfG x x dx
f x f g xx dx

2 2
1


sin
dxx.
f x x G+ x1e dx f xx eGdx x 3 x f9 x 27 xeG x cos
x

3x

e3 x

2x

n 1

n 1

2 3x

n 1

n 1

n 1

3x

n 1

ex

ex

sin x
e
-1 3 2 2 sin x e dx x 2
e3 x
sin
+ x +
x e dx x 21 3 x x e3 x
2 3
e 9
27 sin x cos x e x + sin1x e x dx
cos x 9
3x
+
2
x
e
x
1
1 3x
sin x 0e dx
3
sin x
sin x cos x e x e
+
2
27
1 3x
x
x
2
e
sin x cos x e sin x e dx
+ 9
1 Figure
x
of Integration by Parts
1 3 x using Figure 34.
x n e ax35:n Examples
ax
0 e
dx x cos x e x n 1e ax (Integration by parts).
x n esin
2
a
a
27
x

2 3x

1
n ax , 1

x
e 1 n n 1 ax
t
dt

ax
x e 0dx

x e
,
a
a 1
1

, 1

1
t dt 1
1

216
(Integration by parts).

ex
ex
ex

Example A.16. Use integration by parts to compute the following integrals:


R
(a) x ln xdx.
R
(b) x2 ex dx.
Solution:
x2
2

(a)

x ln xdx =

(b)

x2 ex dx =

R
ln x

x2 1
2 x dx

= x2 ln x 12

xdx =

x2
x2
ln x
+C
2
4


x2 (ex ) (2x) (ex ) + (2) (ex ) + C =

ex (x2 + 2x + 2) + C .

A.17. Integration involving the Gaussian function: There


are several important results in probability that are derived from
such integrations. It is probably easier to remember or start with
the formula for the gaussian pdf because we know that it should
integrate to 1:
Z
2
1
21 ( xm
)

e
dx = 1.
(62)
2

To actually evaluate (prove) such an integral, we simplify it by a


change of variable to get an equivalent expression:
Z

(x)2
1
e 2 dx = 1.
2

(63)

Even this simplified form is quite tricky to evaluate. The typical


procedure is to consider the square of the integral:

2

Z
Z
Z
(x)2
(x)2
(y)2
1 e 2 dx = 1 e 2 dx e 2 dy .
2
2

After combining the product on the right into a double integral,


we change from Cartesian to polar coordinates. Let x = r cos()

217

and y = r sin(). In which case, x2 + y 2 = r2 and dxdy = rdrd.


This gives

2
2

Z
Z
Z
2
2
r2
1 e x2 dx = 1
re ddr
2
2

0 0

2
Z
Z
2
1
r2
=
re
ddr
2
0
0

Z

2
2
1
r2
r2

2 re dr = e = 1
=
0
2
0

which complete the proof.


Now that we have derive (63), it can then be used to show
several important results some of which are provided below.
Example A.18. Analytically derive the following facts:
(a)

ex dx =

(b)

for > 0.

x2

x 12 e 2 dx = 0

Remark: This shows that E [X] = 0 when X N (0, 1).

(c)

x2

x2 12 e 2 dx = 1.
x2

Hint: Write x2 e 2 as
x2

x2

x2 e 2 = x xe 2
and use integration by parts.

 
Remark: This shows that E X 2 = Var X = 1 when X
N (0, 1).
(d)

x2 ex dx =

1
2

for > 0.

218

(e)

x2 ex dx =

(f)

1
4

x2

for > 0.
1 2

esx 12 e 2 dx = e 2 s .

Hint: Completing the square: x2 2sx = (x s)2 s2 .


Remark: This shows that when X N (0, 1),
 
1 2
E esX = e 2 s .

To find the Fourier transform of fX , simply substitute s =


j = j2f to get


1 2
2 2
E ejX = e 2 = e2 f .
To find characteristic function of the standard Gaussian X,
we substitute s = jt to get


1 2
X (t) = E ejtX = e 2 t .
(g) When X N (m, 2 ),
 
1 2 2
(i) E esX = esm+ 2 s .
(ii) Fourier transform:
F {fX } =

fX (x) ejx dx = ejm 2

2 2

= ej2f m2

(iii) Characteristic function:




1 2 2
X (t) = E ejtX = ejtm 2 t
Solution:

(a) Let y = 2x. Then, 12 y 2 = x2 and dx =


Z

ex

1
dx =
2
219

1 2

1 dy.
2

e 2 y dy.

Hence,

2 2 2

We have already shown (63) which says that


R

1. Hence,

1 2

e 2 y dy =

1 2
1 e 2 y dy
2

2 and

ex

(b) xe

(x)2
2

1
dx =
2

Z
e

21 y 2

r
2

dy = =
.

is an odd function.

(c) Use integration by parts: separating


x2

x2

x2 e 2 = x xe 2
to get
Z

x2 e

(d) Let y =

2
x2

dx = xe

2
x2

x2

e 2 dx = 2
+

2x. Then, 12 y 2 = x2 and dx =


Z

x2 ex dx =

1 dy.
2

1 2 1 y2 1
y e 2 dy
2
2

3
2

(2)

1 2
1
y 2 e 2 y dy
2

(2) 2
2

(e) x2 ex is an even function. Hence,


Z

x2 ex dx = 2

Z
0

220

x2 ex dx.

Hence,

(f) Applying the hint, we have


Z

x2
1
esx e 2 dx =
2

1
1 2
2
2
1
e 2 (x 2sx+s ) e 2 s dx
2

=e

1 2
2s

2
1
1 2
1
e 2 (xs) dx = e 2 s
2

(g) For X N (m, 2 ), we have


fX (x) =
(i)
 
E esX =

= esm

1 xm 2
1
e 2 ( ) .
2

2
1
12 ( xm
)

e
e
dx
2

sx

es(y+m)
Z

1 2
1
e 2 y dy
2

2
1 2
1
1
esy e 2 y dy = esm e 2 (s)
2

sm+ 12 s2 2

=e

(ii) To find the Fourier transform of fX , simply substitute


s = j = j2f into the answer from part (g.i).

(iii) To find characteristic function of the standard Gaussian


X, we substitute s = jt into the answer from part (g.i).

221

Você também pode gostar