Escolar Documentos
Profissional Documentos
Cultura Documentos
Thammasat University
School of Information, Computer and Communication Technology
ECS315 2013/1
1
Part I.1
Dr.Prapun
Whether you like it or not, probabilities rule your life. If you have
ever tried to make a living as a gambler, you are painfully aware
of this, but even those of us with more mundane life stories are
constantly affected by these little numbers.
Example 1.1. Some examples from daily life where probability
calculations are involved are the determination of insurance premiums, the introduction of new medications on the market, opinion
polls, weather forecasts, and DNA evidence in courts. Probabilities also rule who you are. Did daddy pass you the X or the Y
chromosome? Did you inherit grandmas big nose?
Meanwhile, in everyday life, many of us use probabilities in our
language and say things like Im 99% certain or There is a onein-a-million chance or, when something unusual happens, ask the
rhetorical question What are the odds?. [18, p 1]
1.1
Randomness
1.2. Many clever people have thought about and debated what
randomness really is, and we could get into a long philosophical
discussion that could fill up a whole book. Lets not. The French
mathematician Laplace (17491827) put it nicely:
Probability is composed partly of our ignorance, partly
of our knowledge.
4
Probabilists love to play with coins and dice. We like the idea of
tossing coins, rolling dice, and drawing cards as experiments that
have equally likely outcomes.
1.10. Coin flipping or coin tossing is the practice of throwing
a coin in the air to observe the outcome.
6
1.3
1.13. Probabilities are used in situations that involve randomness. A probability is a number used to describe how likely
something is to occur, and probability (without indefinite article) is the study of probabilities. It is the art of being certain
of how uncertain you are. [18, p 24] If an event is certain
to happen, it is given a probability of 1. If it is certain not to
happen, it has a probability of 0. [7, p 66]
1.14. Probabilities can be expressed as fractions, as decimal numbers, or as percentages. If you toss a coin, the probability to get
heads is 1/2, which is the same as 0.5, which is the same as 50%.
There are no explicit rules for when to use which notation.
In daily language, proper fractions are often used and often
expressed, for example, as one in ten instead of 1/10 (one
tenth). This is also natural when you deal with equally likely
outcomes.
Decimal numbers are more common in technical and scientific reporting when probabilities are calculated from data.
Percentages are also common in daily language and often with
chance replacing probability.
Meteorologists, for example, typically say things like there
is a 20% chance of rain. The phrase the probability of rain
is 0.2 means the same thing.
When we deal with probabilities from a theoretical viewpoint,
we always think of them as numbers between 0 and 1, not as
percentages.
See also 3.5.
[18, p 10]
The goal of probability theory is to compute the probability of various events of interest. Hence, we are talking about a set function
which is defined on subsets of .
Example 1.16. The statement when a coin is tossed, the probability to get heads is l/2 (50%) is a precise statement.
(a) It tells you that you are as likely to get heads as you are to
get tails.
(b) Another way to think about probabilities is in terms of average long-term behavior. In this case, if you toss the coin
repeatedly, in the long run you will get roughly 50% heads
and 50% tails.
1
For our class, it may be less confusing to allow event A to be any collection of outcomes
(, i.e. any subset of ).
In more advanced courses, when we deal with uncountable , we limit our interest to only
some subsets of . Technically, the collection of these subsets must form a -algebra.
Definition 1.19. Let A be one of the events of a random experiment. If we conduct a sequence of n independent trials of this
experiment, and if the event A occurs in N (A, n) out of these n
trials, then the fraction
P (A) = lim
11
12
A (B C) = (A B) C
Distributivity
A (B C) = (A B) (A C)
A (B C) = (A B) (A C)
de Morgan laws
(A B)c = Ac B c
(A B)c = Ac B c
2.6. Disjoint Sets:
2.8. For a set of sets, to avoid the repeated use of the word set,
we will call it a collection/class/family of sets.
13
In this case, the subsets are indexed or labeled by taking values in an index or label
set I
14
15
Event Language
A occurs
A does not occur
Either A or B occur
Both A and B occur
We use a technique called diagonal argument to prove that a set is not countable and
hence uncountable.
16
Classical Probability
|A|
||
(1)
Because we will not rely on Definition 3.1 beyond this section, we will not worry about
how to prove these properties. In Section 5, we will prove the same properties in a more
general setting.
18
Example 3.4 (Slides). When rolling two dice, there are 36 (equiprobable) possibilities.
P [sum of the two dice = 5] = 4/36.
Though one of the finest minds of his age, Leibniz was not
immune to blunders: he thought it just as easy to throw 12 with
a pair of dice as to throw 11. The truth is...
P [sum of the two dice = 11] =
P [sum of the two dice = 12] =
Definition 3.5. In the world of gambling, probabilities are often
expressed by odds. To say that the odds are n:1 against the event
A means that it is n times as likely that A does not occur than
that it occurs. In other words, P (Ac ) = nP (A) which implies
1
n
P (A) = n+1
and P (Ac ) = n+1
.
Odds here has nothing to do with even and odd numbers.
The odds also mean what you will win, in addition to getting your
stake back, should your guess prove to be right. If I bet $1 on a
horse at odds of 7:1, I get back $7 in winnings plus my $1 stake.
The bookmaker will break even in the long run if the probability
of that horse winning is 1/8 (not 1/7). Odds are even when they
are 1:1 - win $1 and get back your original $1. The corresponding
probability is 1/2.
3.6. It is important to remember that classical probability relies
on the assumption that the outcomes are equally likely.
Example 3.7. Mistake made by the famous French mathematician Jean Le Rond dAlembert (18th century) who is an author of
several works on probability:
The number of heads that turns up in those two tosses can
be 0, 1, or 2. Since there are three outcomes, the chances of each
must be 1 in 3.
19
ECS315 2013/1
4
Part I.2
Dr.Prapun
Four Principles
The art of applying the addition principle is to partition the set S to be counted into
manageable parts; that is, parts which we can readily count. But this statement needs to
be qualified. If we partition S into too many parts, then we may have defeated ourselves.
For instance, if we partition S into parts each containing only one element, then applying the
20
21
the ends of the original branches, and so forth. The size of the set
then equals the number of branches in the last level of the tree,
and this quantity equals
n1 n2
4.7. Multiplication Principle (Rule of product):
When a procedure/operation can be broken down into m
steps,
such that there are n1 options for step 1,
and such that after the completion of step i 1 (i = 2, . . . , m)
there are ni options for step i (for each way of completing step
i 1),
the number of ways of performing the procedure is n1 n2 nm .
In set-theoretic terms, if sets S1 , . . . , Sm are finite, then |S1
S2 Sm | = |S1 | |S2 | |Sm |.
For k finite sets A1 , ..., Ak , there are |A1 | |Ak | k-tuples
of the form (a1 , . . . , ak ) where each ai Ai .
Example 4.8. Suppose that a deli offers three kinds of bread,
three kinds of cheese, four kinds of meat, and two kinds of mustard.
How many different meat and cheese sandwiches can you make?
First choose the bread. For each choice of bread, you then
have three choices of cheese, which gives a total of 3 3 = 9
bread/cheese combinations (rye/swiss, rye/provolone, rye/cheddar, wheat/swiss, wheat/provolone ... you get the idea). Then
choose among the four kinds of meat, and finally between the
two types of mustard or no mustard at all. You get a total of
3 3 4 3 = 108 different sandwiches.
Suppose that you also have the choice of adding lettuce, tomato,
or onion in any combination you want. This choice gives another
2 x 2 x 2 = 8 combinations (you have the choice yes or no
three times) to combine with the previous 108, so the total is now
108 8 = 864.
That was the multiplication principle. In each step you have
several choices, and to get the total number of combinations, multiply. It is fascinating how quickly the number of combinations
22
grow. Just add one more type of bread, cheese, and meat, respectively, and the number of sandwiches becomes 1,920. It would take
years to try them all for lunch. [18, p 33]
Example 4.9 (Slides). In 1961, Raymond Queneau, a French poet
and novelist, wrote a book called One Hundred Thousand Billion
Poems. The book has ten pages, and each page contains a sonnet,
which has 14 lines. There are cuts between the lines so that each
line can be turned separately, and because all lines have the same
rhyme scheme and rhyme sounds, any such combination gives a
readable sonnet. The number of sonnets that can be obtained in
this way is thus 1014 which is indeed a hundred thousand billion.
Somebody has calculated that it would take about 200 million
years of nonstop reading to get through them all. [18, p 34]
Example 4.10. There are 2n binary strings/sequences of length
n.
Example 4.11. For a finite set A, the cardinality of its power set
2A is
|2A | = 2|A| .
23
(b) What are the chances that he pulls out a pair of matching
socks?
24
We have
64 54
P (A) =
=1
64
4
5
.518
6
and
24
3624 3524
35
P (B) =
=1
.491.
24
36
36
Therefore, the first case is more probable.
Remark 1: Probability theory was originally inspired by gambling problems. In 1654, Chevalier de Mere invented a gambling
system which bet even money6 on event B above. However, when
he began losing money, he asked his mathematician friend Pascal to analyze his gambling system. Pascal discovered that the
Chevaliers system would lose about 51 percent of the time. Pascal became so interested in probability and together with another
famous mathematician, Pierre de Fermat, they laid the foundation
of probability theory. [U-X-L Encyclopedia of Science]
Remark 2: de Mere originally claimed to have discovered a
contradiction in arithmetic. De Mere correctly knew that it was
advantageous to wager on occurrence of event A, but his experience
as gambler taught him that it was not advantageous to wager on
occurrence of event B. He calculated P (A) = 1/6 + 1/6 + 1/6 +
1/6 = 4/6 and similarly P (B) = 24 1/36 = 24/36 which is
the same as P (A). He mistakenly claimed that this evidenced a
contradiction to the arithmetic law of proportions, which says that
24
4
6 should be the same as 36 . Of course we know that he could not
simply add up the probabilities from each tosses. (By De Meres
logic, the probability of at least one head in two tosses of a fair
coin would be 2 0.5 = 1, which we know cannot be true). [22, p
3]
4.16. Division Principle (Rule of quotient): When a finite
set S is partitioned into equal-sized parts of m elements each, there
are |S|
m parts.
6
Even money describes a wagering proposition in which if the bettor loses a bet, he or she
stands to lose the same amount of money that the winner of the bet would win.
25
4.2
26
r1
Y
i=0
(n i) =
n!
(n r)!
= n (n 1) (n (r 1));
{z
}
|
r terms
rn
27
r1
Y
(n i) = n (n 1) (n (r 1))
|
{z
}
i=0
r terms
Definition 4.20. For any integer n greater than 1, the symbol n!,
pronounced n factorial, is defined as the product of all positive
integers less than or equal to n.
(a) 0! = 1! = 1
(b) n! = n(n 1)!
(c) n! =
et tn dt
(d) Computation:
28
(i) MATLAB: Use factorial(n). Since double precision numbers only have about 15 digits, the answer is only accurate
for n 21. For larger n, the answer will have the right
magnitude, and is accurate for the first 15 digits.
(ii) Googles web search box built-in calculator: Use n!
(e) Approximation: Stirlings Formula [5, p. 52]:
n
1
n n
n! 2nn e =
2e e(n+ 2 ) ln( e ) .
(2)
4
3
7
3
4
7
3
4
3
7
7
4
29
2) 1 P Ai P ( Ai ) ( n 1) n
n
i =1 i =1
1
0.37
e
[Szekely86, p 14].
1
1
a)
4
8
10birthday: ProbExample 4.24. (Slides)2 Probability
of6 coincidence
1
x
10
ability that there is at
least two people
who have
the same birth1
day9 in a group of r persons:
P ( A1 A2 ) P ( A1 ) P ( A2 )
Random Variable
3) Let i.i.d. X 1 ,, X r be uniformly distributed on a finite set {a1 ,, an } . Then, the
r 1
r ( r 1)
2n
pu ( n, r )
0.9
0.6
0.8
0.7
1 e
0.6
p = 0.9
p = 0.7
p = 0.5
0.5
r ( r 1)
2n
0.4
r
n
0.5
0.4
r = 23
0.3
n = 365
r = 23
n = 365
p = 0.3
p = 0.1
0.3
0.2
0.2
0.1
0.1
0
10
15
20
25
30
35
40
45
50
55
50
100
150
200
250
300
350
Figure 1: pu (n, r): The probability of the event that at least one element appears
c) From
approximation,
to size
haver pwith
r ) = p , we need
u ( n, replacement
twice the
in random
sample of
is taken from a population
of n elements.
30
Birthday Paradox : In a group of 23 randomly selected people, the probability that at least two will share a birthday (assuming birthdays are equally likely to occur on any given day of the
year10 ) is about 0.5.
At first glance it is surprising that the probability of 2 people
having the same birthday is so large11 , since there are only 23
people compared with 365 days on the calendar. Some
of the
23
surprise disappears if you realize that there are 2 = 253
pairs of people who are going to compare their birthdays. [3,
p. 9]
Example 4.26. Another variant of the birthday coincidence paradox: The group size must be at least 253 people if you want a
probability > 0.5 that someone will have the same birthday
as
364 r
you. [3, Ex. 1.13] (The probability is given by 1 365 .)
In reality, birthdays are not uniformly distributed. In which case, the probability of a
match only becomes larger for any deviation from the uniform distribution. This result can
be mathematically proved. Intuitively, you might better understand the result by thinking of
a group of people coming from a planet on which people are always born on the same day.
11
In other words, it was surprising that the size needed to have 2 people with the same
birthday was so small.
31
n!
.
n1 !n2 ! nr !
Example 4.29. The number of permutations of AABC
bc
bd
cd
(ii) Excel: combin(n,r)
(iii) Mathcad: combin(n,r)
(iv) Maple: nr
(d)
(e)
(f)
(g)
(h)
34
0
0
0
0
1
1
1
2
2
3
+
+
+
+
+
+
+
+
+
+
0
1
2
3
0
1
2
0
1
0
+
+
+
+
+
+
+
+
+
+
3
2
1
0
2
1
0
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
n+r1
There are n+r1
=
distinct vector x = xn1 of nonr
n1
negative integers such that x1 + x2 + + xn = r. We use
n 1 bars to separate r 1s.
(a) Suppose we further require that
the xi are strictly positive
r1
(xi 1), then there are n1 solutions.
(b) Extra Lower-bound Requirement: Suppose we further require that xi ai where the ai are some given
nonnegative integers, then the number of solution is
r (a1 + a2 + + an ) + n 1
.
n1
Note that here we P
work with equivalent problem: y1 +
y2 + + yn = r ni=1 ai where yi 0.
Consider the distribution of r = 10 indistinguishable balls
into n = 5 distinguishable cells. Then, we only concern with
the number of balls in each cell. Using n 1 = 4 bars, we
can divide r = 10 stars into n = 5 groups. For example,
****|***||**|*
would mean (4,3,0,2,1). In general, there are
n+r1
ways of arranging the bars and stars.
r
4.34. Unordered sampling with replacement: There are
n items. We sample r out of these n items with replacement.
Because the order in the sequences is not important in this kind
of sampling, two samples are distinguished by the number of each
item in the sequence. In particular, Suppose r letters are drawn
35
is defined as
i1
P
r
Y
n
nk
k=0
i=1
ni
n
n n1
n n1 n2
nr
=
n1
n2
n3
nr
n!
= Q
.
r
ni !
i=1
i1 + i2 + + ir = n.
ECS315 2013/1
5
Part II
Dr.Prapun
Probability Foundations
14
44
X
[
P (An ).
P
An =
n=1
n=1
16
45
5.2. P () = 0.
i=1
19
(5)
It is not possible to go backwards and use finite additivity to derive countable additivity
(P3).
46
X
n=1
P ({an }).
(b) Similarly, if A is finite, e.g. A = a1 , a2 , . . . , a|A| , then
P (A) =
|A|
X
n=1
P ({an }).
For event A that is uncountable, the properties in 5.4 are not enough to evaluate P (A).
47
Example 5.5. A random experiment can result in one of the outcomes {a, b, c, d} with probabilities 0.1, 0.3, 0.5, and 0.1, respectively. Let A denote the event {a, b}, B the event {b, c, d}, and C
the event {d}.
P (A) =
P (B) =
P (C) =
P (Ac ) =
P (A B) =
P (A C) =
5.6. Monotonicity : If A B, then P (A) P (B)
48
49
P (A B) P (A) + P (B).
Approximation: If P (A) P (B) then we may approximate
P (A B) by P (A).
Example 5.15 (Slides). Combining error probabilities from various sources in DNA testing
Example 5.16. In his bestseller Innumeracy, John Allen Paulos
tells the story of how he once heard a local weatherman claim that
there was a 50% chance of rain on Saturday and a 50% chance of
rain on Sunday and thus a 100% chance of rain during the weekend.
Clearly absurd, but what is the error?
Answer: Faulty use of the addition rule (5)!
If we let A denote the event that it rains on Saturday and B
the event that it rains on Sunday, in order to use P (A B) =
P (A) + P (B), we must first confirm that A and B cannot occur at
50
i=1
[
X
P
Ai
P (Ai )
i=1
i=1
51
X
P (A) =
P (A Bi )
i=1
We must have
1
, i.
n
event A, we can apple 5.4 to get
P ({i }) =
Now, given any event finite22
P (A) =
P ({}) =
X1
A
|A| |A|
=
.
n
||
22
In classical probability, the sample space is finite; therefore, any event is also finite.
52
Example
Example
6.1. Roll a dice. . .
Roll a fair dice
Sneak peek:
Figure 3: Conditional Probability Example: Sneak Peek
6.1
Definition 6.3. Conditional Probability : The conditional probability P (A|B) of event A, given that event B 6= occurred, is
given by
P (A B)
.
(6)
P (A|B) =
P (B)
Some ways to say23 or express the conditional probability,
P (A|B), are:
the probability of A, given B
23
Note also that although the symbol P (A|B) itself is practical, it phrasing in words can be
so unwieldy that in practice, less formal descriptions are used. For example, we refer to the
probability that a tested-positive person has the disease instead of saying the conditional
probability that a randomly chosen person has the disease given that the test for this person
returns positive result.
53
Example
Roll a fair dice
Sneak peek:
54
24
Here, the statement assume P (B) > 0 because it considers P (A|B). The concept of
independence to be defined in Section 6.2 will not rely directly on conditional probability and
therefore it will include the case where P (B) = 0.
55
X
[
P (An |B).
P
An B =
n=1
n=1
In particular, if A1 A2 ,
P (A1 A2 |B ) = P (A1 |B ) + P (A2 |B )
6.9. More Properties:
P (A|) = P (A)
P (Ac |B) = 1 P (A|B)
P (A B|B) = P (A|B)
P (A1 A2 |B) = P (A1 |B) + P (A2 |B) P (A1 A2 |B).
P (A B) P (A|B)
56
P (A B) |A B| / || |A B|
=
=
.
P (B)
|B| / ||
|B|
57
Example 6.14. You know that roughly 5% of all used cars have
been flood-damaged and estimate that 80% of such cars will later
develop serious engine problems, whereas only 10% of used cars
that are not flood-damaged develop the same problems. Of course,
no used car dealer worth his salt would let you know whether your
car has been flood damaged, so you must resort to probability
calculations. What is the probability that your car will later run
into trouble?
You might think about this problem in terms of proportions.
P (A|Bi )P (Bi ).
(8)
i
25
The tree diagram is useful for helping you understand the process. However, then the
number of possible cases is large (many Bi for the partition), drawing the tree diagram may
be too time-consuming and therefore you should also learn how to apply the total probability
theorem directly without the help of the tree diagram.
58
P (B|A) = P (A|B)
P (B)
.
P (A)
P (Bk )
P (A|Bk )P (Bk )
.
=P
P (A)
i P (A|Bi )P (Bi )
P (D |TP ) =
Effect of p
1
0.9
0.8
P(D|TP)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5
pD
0.6
0.7
0.8
0.9
Figure 5: Probability P (D |TP ) that a person will have the disease given that
the test result is positive. The conditional probability is evaluated as a function of PD which tells how common the disease is. Thee values of test error
probability pT E are shown.
Example 6.20. Medical Diagnostic: Because a new medical procedure has been shown to be effective in the early detection of an
illness, a medical screening of the population is proposed. The
probability that the test correctly identifies someone with the illness as positive is 0.99, and the probability that the test correctly
identifies someone without the illness as negative is 0.95. The incidence of the illness in the general population is 0.0001. You take
the test, and the result is positive. What is the probability that
you have the illness? [15, Ex. 2-37]
60
If the test was not given on Monday. what is the probability that
it is given on Tuesday? The probability that Tuesday is chosen
to start with is 1/5, but we are now asking for the conditional
probability that the test is given on Tuesday, given that it was not
given on Monday. As there are now four days left, this conditional
probability is 1/4. Similarly, the conditional probabilities that the
test is given on Wednesday, Thursday, and Friday conditioned on
that it has not been given thus far are 1/3, 1/2, and 1, respectively.
We could define the surprise index each day as the probability
that the test is not given. On Monday, the surprise index is therefore 0.8, on Tuesday it has gone down to 0.75, and it continues to
go down as the week proceeds with no test given. On Friday, the
surprise index is 0, indicating absolute certainty that the test will
be given that day. Thus, it is possible to give a surprise test but
not in a way so that you are equally surprised each day, and it is
never possible to give it so that you are surprised on Friday. [18,
p 2324]
Example 6.26. Today Bayesian analysis is widely employed throughout science and industry. For instance, models employed to determine car insurance rates include a mathematical function describing, per unit of driving time, your personal probability of having
zero, one, or more accidents. Consider, for our purposes, a simplified model that places everyone in one of two categories: high
risk, which includes drivers who average at least one accident each
year, and low risk, which includes drivers who average less than
one.
If, when you apply for insurance, you have a driving record
that stretches back twenty years without an accident or one that
goes back twenty years with thirty-seven accidents, the insurance
company can be pretty sure which category to place you in. But if
you are a new driver, should you be classified as low risk (a kid who
obeys the speed limit and volunteers to be the designated driver)
or high risk (a kid who races down Main Street swigging from a
half-empty $2 bottle of Boones Farm apple wine)?
Since the company has no data on you, it might assign you
an equal prior probability of being in either group, or it might
63
64
6.2
Event-based Independence
Plenty of random things happen in the world all the time, most of
which have nothing to do with one another. If you toss a coin and
I roll a dice, the probability that you get heads is 1/2 regardless of
the outcome of my dice. Events that are unrelated to each other
in this way are called independent.
Definition 6.27. Two events A, B are called (statistically26 )
independent if
P (A B) = P (A) P (B)
B
|=
Notation: A
(9)
(10)
|=
|=
In other words, the unconditional and the conditional probabilities are the same. We can almost use (10) as the definitions for
independence. This is what we mentioned in 6.7. However, we use
(9) instead because it (1) also works with events whose probabilities are zero and (2) also has clear symmetry in the expression (so
that A B and B A can clearly be seen as the same). In fact,
in 6.32, we show how (10) can be used to define independence with
extra condition that deals with the case when zero probability is
involved.
26
Sometimes our definition for independence above does not agree with the everydaylanguage use of the word independence. Hence, many authors use the term statistically
independence to distinguish it from other definitions.
65
Example 6.29. [26, Ex. 5.4] Which of the following pairs of events
are independent?
hearts
spades
66
B,
Ac
B c.
|=
B c,
|=
|=
B,
|=
67
|=
The two statements A B and A B can occur simultaneously only when P (A) = 0 and/or P (B) = 0.
\
Y
P Aj =
P (Aj ) J [n] and |J| 2
jJ
jJ
P (A D) = P (A)P (D),
P (B C) = P (B)P (C),
70
6.3
Bernoulli Trials
n Bernoulli trials
Assume success probability = 1/n
1
0.9
0.8
P #successes 1
0.7
1
1 0.6321
e
0.6
P #successes 1
0.5
0.4
P #successes 0
0.3
0.2
P # successes 2
1
0.1839
2e
P # successes 3
0.1
0
1
0.3679
e
10
15
20
25
n
30
35
40
45
50
Example 6.51. Digital communication over unreliable channels: Consider a communication system below
73
This channel can be described as a channel that introduces random bit errors with probability p.
A crude digital communication system would put binary information into the channel directly; the receiver then takes whatever
value that shows up at the channel output as what the sender
transmitted. Such communication system would directly suffer bit
error probability of p.
In situation where this error rate is not acceptable, error control
techniques are introduced to reduce the error rate in the delivered
information.
One method of reducing the error rate is to use error-correcting
codes:
A simple error-correcting code is the repetition code. Example of such code is described below:
(a) At the transmitter, the encoder box performs the following
task:
(i) To send a 1, it will send 11111 through the channel.
(ii) To send a 0, it will send 00000 through the channel.
74
(b) When the five bits pass through the channel, it may be corrupted. Assume that the channel is binary symmetric and
that it acts on each of the bit independently.
(c) At the receiver, we (or more specifically, the decoder box) get
5 bits, but some of the bits may be changed by the channel.
To determine what was sent from the transmitter, the receiver
apply the majority rule: Among the 5 received bits,
(i) if #1 > #0, then it claims that 1 was transmitted,
(ii) if #0 > #1, then it claims that 0 was transmitted.
probability p.
Majority Vote at Rx
0.5
0.45
0.4
0.35
0.3
n=1
n=5
0.25
0.2
0.15
n = 15
0.1
n = 25
0.05
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Figure 8: Bit error probability for a simple system that uses repetition code
at the transmitter (repeat each bit n times) and majority vote at the receiver.
The channel is assumed to be binary symmetric with bit error probability p.
75
Exercise 6.52 (F2011). Kakashi and Gai are eternal rivals. Kakashi
is a little stronger than Gai and hence for each time that they fight,
the probability that Kakashi wins is 0.55. In a competition, they
fight n times (where n is odd). Assume that the results of the fights
are independent. The one who wins more will win the competition.
Suppose n = 3, what is the probability that Kakashi wins the
competition.
76
Random variables
77
27
The term random variable is a misnomer. Technically, if you look at the definition
carefully, a random variable is a deterministic function; that is, it is not random and it is not
a variable. [Toby Berger][26, p 254]
As a function, it is simply a rule that maps points/outcomes in to real numbers.
78
HHH
HTT
THT
TTH
HHT
HTH
HHT
TTT
TTT
TTH,THT,HTT
THH,HTH,HHT
HHH
0,
1,
N
2,
3,
79
80
28
Later on, you will see that 1) a default support of a discrete random variable is the set
of values where the pmf is strictly positive and 2) a default support of a continuous random
variable is the set of values where the pdf is strictly positive.
81
29
Many references (including [15] and MATLAB) use fX (x) for pmf instead of pX (x). We will
NOT use fX (x) for pmf. Later, we will define fX (x) as a probability density function which
will be used primarily for another type of random variable (continuous r.v.)
83
8.9. Graphical Description of the Probability Distribution: Traditionally, we use stem plot to visualize pX . To do this, we graph
a pmf by marking on the horizontal axis each value with nonzero
probability and drawing a vertical bar with length proportional to
the probability.
8.10. Any pmf p() satisfies two properties:
(a) p() 0
(b) there exists numbers x1 , x2 , x3 , . . . such that
p(x) = 0 for other x.
p(xk ) = 1 and
When you are asked to verify that a function is a pmf, check these
two properties.
8.11. Finding probability from pmf: for any subset B of R, we
can find
X
X
P [X B] =
P [X = xk ] =
pX (xk ).
xk B
xk B
kB
84
(c) P [X = 1]
(d) P [X 2]
(e) P [X > 3]
85
p(xk ) = 1 and
Example 8.17. Continue from Examples 7.5, 7.11, and 8.8 where
N is defined as the number of heads in a sequence of three coin
tosses. We have
pN (0) = pN (3) =
1
3
and pN (1) = pN (2) = .
8
8
(a) FN (0)
(b) FN (1.5)
86
8.18. Facts:
For any discrete r.v. X, FX is a right-continuous, staircase
function of x with jumps at a countable set of points xk .
_c03_066-106.qxd
When10:58
you are
given the cdf of a discrete random variable, you
AM Page 73
can derive its pmf from the locations and sizes of the jumps.
If a jump happens at x = c, then pX (c) is the same as the
amount of jump at c. At the location x where there is no
jump, pX (x) = 0.
1/7/10
F(x)
1.0
1.000
0.997
0.886
0.7
0.2
2
0
0.25
F1x2
0.75
1
x 10
10 x 30
30 x 50
50 x
X
countable set C, P C 0
FX is continuous
25) Every random variable can be written as a sum of a discrete random variable and a
continuous random variable.
A random variable can30
have at most countably many point x such that
8.20.26)Characterizing
properties of cdf:
P X x 0 .
, is the function F x P , x .
CDF2 FX is right
the right)
P , xcontinuous
P .
uniquely determines from
F x ; that is F (continuous
X
0 FX 1
FX is non-decreasing
FX is right continuous:
x FX x lim FX y lim FX y FX x P X x .
yx
yx
yx
x =
P 0xand
F x
F x F
jump=
or saltus
CDF3 lim FP XX (x)
lim
1. in F at x.
=Xthe(x)
x y
8.21. FX can
P be
x, y written
F y F as
x
X
P x, y F y F x
FX (x) =
pX (xk )u(x xk ),
xk
30
These properties hold for any type of random variables. Moreover, for any function F
that satisfies these three properties, there exists a random variable X whose CDF is F .
88
As mention in 7.12, we often omit a discussion of the underlying sample space of the
random experiment and directly describe the distribution of a particular random variable.
32
or, with minor manipulation, only uniformly spaced numbers
89
1 p, x = 0,
p,
x = 1,
p (0, 1)
pX (x) =
0,
otherwise,
Write X B(1, p) or X Bernoulli(p)
X takes only two values: 0 or 1
Definition 8.27. X is a binary random variable if
1 p, x = a,
pX (x) =
p,
x = b,
p (0, 1), b > a.
0,
otherwise,
X takes only two values: a or b
90
8.32. In 1837, the famous French mathematician Poisson introduced a probability distribution that would later come to be known
as the Poisson distribution, and this would develop into one of the
most important distributions in probability theory. As is often remarked, Poisson did not recognize the huge practical importance of
the distribution that would later be named after him. In his book,
he dedicates just one page to this distribution. It was Bortkiewicz
in 1898, who first discerned and explained the importance of the
Poisson distribution in his book Das Gesetz der Kleinen Zahlen
(The Law of Small Numbers). [22]
Definition 8.33. X is a Poisson random variable with parameter > 0 if
x
e x! , x = 0, 1, 2, . . .
pX (x) =
0,
otherwise
In MATLAB, use poisspdf(x,alpha).
Write X P () or Poisson().
We will see later in Example 9.7 that is the average or
expected value of X.
Instead of X, Poisson random variable is usually denoted by
. The parameter is often replaced by where is referred
to as the intensity/rate parameter of the distribution
Example 8.34. The first use of the Poisson model is said to have
been by a Prussian (German) physician, Bortkiewicz, who found
that the annual number of late-19th-century Prussian (German)
soldiers kicked to death by horses fitted a Poisson distribution [6,
p 150],[3, Ex 2.23]33 .
33
I. J. Good and others have argued that the Poisson distribution should be called the
Bortkiewicz distribution, but then it would be very difficult to say or write.
94
(b) Find the probability that there are at least 2 hits during the
time interval above.
95
Support SX
{1, 2, . . . , n}
{0, 1, . . . , n 1}
Bernoulli B(1, p)
{0, 1}
Binomial B(n, p)
Geometric G0 ()
Geometric G1 ()
Poisson P()
{0, 1, . . . , n}
N {0}
N
N {0}
pX (x)
1
n
1
n
1 p, x = 0
p,
x=1
n x
p
(1
p)nx
x
(1 ) x
(1 ) x1
x
e x!
98
8.4
Some Remarks
8.51. At this point, we have a couple of ways to define probabilities that are associated with a random variable X
(a) We can define P [X B] for all possible set B.
(b) For discrete random variable, we only need to define its pmf
pX (x) which is defined as P [X = x] = P [X {x}].
(c) We can also define the cdf FX (x).
Definition 8.52. If pX (c) = 1, that is P [X = c] = 1, for some
constant c, then X is called a degenerated random variable.
99
ECS315 2013/1
9
Two numbers are often used to summarize a probability distribution for a random variable X. The mean is a measure of the center or middle of the probability distribution, and the variance is a
measure of the dispersion, or variability in the distribution. These
two measures do not uniquely identify a probability distribution.
That is, two different distributions can have the same mean and
variance. Still, these measures are simple, useful summaries of the
probability distribution of X.
9.1
The most important characteristic of a random variable is its expectation. Synonyms for expectation are expected value, mean,
and first moment.
The definition of expectation is motivated by the conventional
idea of numerical average. Recall that the numerical average of n
numbers, say a1 , a2 , . . . , an is
n
1X
ak .
n
k=1
We use the average to summarize or characterize the entire collection of numbers a1 , . . . , an with a single value.
Example 9.1. Consider 10 numbers: 5, 2, 3, 2, 5, -2, 3, 2, 5, 2.
The average is
5 + 2 + 3 + 2 + 5 + (2) + 3 + 2 + 5 + 2 27
=
= 2.7.
10
10
We can rewrite the above calculation as
2
1
4
2
3
+2
+3
+5
10
10
10
10
101
Definition 9.2. Suppose X is a discrete random variable, we define the expectation (or mean or expected value) of X by
X
X
x pX (x).
(15)
x P [X = x] =
EX =
x
Note that, since X takes only the values 0 and 1, its expected
value p is never seen.
9.5. Interpretation: The expected value is in general not a typical
value that the random variable can take on. It is often helpful to
interpret the expected value of a random variable as the long-run
average value of the variable over many independent repetitions
of an experiment
1/4, x = 0
Example 9.6. pX (x) =
3/4, x = 2
0,
otherwise
102
ie
i
()
i!
i=0
= e
X
k=0
i
()
i!
i=1
i+0=e
k
= e e = .
k!
X
()i1
()
(i 1)!
i=1
n
n
X
X
n i
n!
EX =
i
p (1 p)ni =
pi (1 p)ni
i
i! (n i)!
i
i=0
i=1
n
n
X
X
(n 1)!
n1 i
ni
i
=n
p (1 p)
=n
p (1 p)ni
(i 1)! (n i)!
i1
i=1
i=1
Let k = i 1. Then,
EX = n
n1
X
k=0
n1
X n 1
n 1 k+1
n(k+1)
p (1 p)
= np
pk (1 p)n1k
k
k
k=0
We now have the expression in the form that we can apply the
binomial theorem which finally gives
EX = np(p + (1 p))n1 = np.
We shall revisit this example again using another approach in Example 10.45.
Example 9.9. Pascals wager : Suppose you concede that you
dont know whether or not God exists and therefore assign a 50
percent chance to either proposition. How should you weigh these
odds when deciding whether to lead a pious life? If you act piously
and God exists, Pascal argued, your gaineternal happinessis infinite. If, on the other hand, God does not exist, your loss, or
negative return, is smallthe sacrifices of piety. To weigh these
possible gains and losses, Pascal proposed, you multiply the probability of each possible outcome by its payoff and add them all up,
forming a kind of average or expected payoff. In other words, the
mathematical expectation of your return on piety is one-half infinity (your gain if God exists) minus one-half a small number (your
loss if he does not exist). Pascal knew enough about infinity to
103
know that the answer to this calculation is infinite, and thus the
expected return on piety is infinitely positive. Every reasonable
person, Pascal concluded, should therefore follow the laws of God.
[14, p 76]
Pascals wager is often considered the founding of the mathematical discipline of game theory, the quantitative study of
optimal decision strategies in games.
9.10. Technical issue: Definition (15) is only meaningful if the
sum is well defined.
The sum of infinitely many nonnegative terms is always welldefined, with + as a possible value for the sum.
Infinite Expectation: Consider a random variable X whose
pmf is defined by
1
, x = 1, 2, 3, . . .
pX (x) = cx2
0, otherwise
P
1
2
Then, c =
n=1 n2 which is a finite positive number ( /6).
However,
EX =
X
k=1
X
1X1
11
= +.
kpX (k) =
k 2=
ck
c
k
k=1
k=1
104
x<0
X
k=1
kpX (k)
1
X
(k) pX (k).
k=
kpX (k) =
k=1
X
k=1
1
1 X1
k
=
=
.
2ck 2
2c
k
2c
k=1
(k) pX (k) =
kpX (k) =
k=1
X
k=1
1 X1
1
=
k
= .
2
2ck
2c
k
2c
k=1
105
9.2
1 2
cx ,
0,
x = 1, 2
otherwise
and
Y = X 4.
Find pY (y) and then calculate EY .
9.15. For discrete random variable X, the pmf of a derived random variable Y = g(X) is given by
X
pY (y) =
pX (x).
x:g(x)=y
106
Note that the sum is over all x in the support of X which satisfy
g(x) = y.
Example 9.16. A binary random variable X takes only two
values a and b with
P [X = b] = 1 P [X = a] = p.
X can be expressed as X = (b a)I + a, where I is a Bernoulli
random variable with parameter p.
9.3
107
(b) E [2X 1]
i
X
2 X
i1
2
=e
i
E X =
ie
i!
(i 1)!
i=1
i=0
(16)
X
i=1
X
X
X i1
i1
i1
i1
=
(i 1 + 1)
=
(i 1)
+
(i 1)!
(i 1)!
(i 1)!
(i 1)!
i=1
i=2
i2
(i 2)!
X
i=1
i=1
i1
(i 1)!
i=1
= e + e = e ( + 1).
k=1
109
9.4
Var X = E (X EX)
(17)
In some references, to avoid confusion from the two expectation symbols, they first define m = EX and then define the
variance of X by
Var X = E (X m)2 .
We can also calculate the variance via another identity:
Var X = E X 2 (EX)2
The units of the variance are squares of the units of the random variable.
9.29. Basic properties of variance:
Var X 0.
Var X E X 2 .
Var[cX] = c2 Var X.
Var[X + c] = Var X.
Var[aX + b] = a2 Var X.
111
p ( i)
2 1
1/3
1/3
1/6
1/6
2 1
Figure
Example
2.27 shows
the random
X with pmf
at the left
has a smaller
variance than
the
Figure2.9.11:
Example
9.36thatshows
thatvariable
a random
variable
whose
probability
mass
random variable Y with pmf at the right.
is concentrated near the mean has smaller variance. [9, Fig. 2.9]
Example 2.27. Let X and Y be the random variables with respective pmfs shown in
Figure
2.9.We
Compute
and var(Y
).
9.37.
havevar(X)
already
talked
about variance and standard de-
viation
as By
a number
of=the
] and
Solution.
symmetry, that
both Xindicates
and Y have spread/dispersion
zero mean, and so var(X)
E[X 2pmf.
2
].
Write
var(Y
)
=
E[Y
More specifically, lets imagine a pmf that shapes like a bell curve.
2 ] = (2)2 1 + (1)2 1 + (1)2 1 + (2)2 1 = 2,
E[X
As the value of
X gets smaller,
6
3 the spread
3
6 of the pmf will be
smaller and hence the pmf would look sharper. Therefore, the
and
2 ] = (2)2 1 + (1)2 1 + (1)2 1 + (2)2 1 = 3.
E[Ythe
probability that
random
X 6would3 take a value that is
3 variable
6
Thus,
X andthe
Y aremean
both zero-mean
random
variables taking the values 1 and 2. But Y
far from
would be
smaller.
is more likely to take values far from its mean. This is reected by the fact that var(Y ) >
var(X).
(2.17)
or equivalently
P [|X EX| nX ]
1
n2
X m
.
114
(18)
and that
the support is always uncountable.
These random variables are called continuous random variables.
10.1. We can see from (18) that the pmf is going to be useless for
this type of random variable. It turns out that the cdf FX is still
useful and we shall introduce another useful function called probability density function (pdf) to replace the role of pmf. However,
integral calculus34 is required to formulate this continuous analog
of a pmf.
10.2. In some cases, the random variable X is actually discrete
but, because the range of possible values is so large, it might be
more convenient to analyze X as a continuous random variable.
34
115
Definition 10.7. We say that X is a continuous random variable35 if we can find a (real-valued) function36 f such that, for any
set B, P [X B] has the form
Z
P [X B] =
f (x)dx.
(19)
B
In particular,
P [a X b] =
f( x)dx.
(20)
35
To be more rigorous, this is the definition for absolutely continuous random variable. At
this level, we will not distinguish between the continuous random variable and absolutely
continuous random variable. When the distinction between them is considered, a random
variable X is said to be continuous (not necessarily absolutely continuous) when condition (18)
is satisfied. Alternatively, condition (18) is equivalent to requiring the cdf FX to be continuous.
Another fact worth mentioning is that if a random variable is absolutely continuous, then it
is continuous. So, absolute continuity is a stronger condition.
36
Strictly speaking, -function is not a function; so, cant use -function here.
117
120
12000
100
10000
80
8000
60
6000
40
4000
20
2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FIGURE 7.1
Area = P(a x b)
f (x)
37
(a) It produces pseudorandom numbers; the numbers seem random but are actually the
output of a deterministic algorithm.
(b) It produces a double precision floating point number, represented in the computer
by 64 bits. Thus MATLAB distinguishes no more than 264 unique double precision
floating point numbers. By comparison, there are uncountably infinite real numbers in
the interval from 0 to 1.
118
10.2
10.13. Given the cdf FX (x), we can find the pdf fX (x) by
If FX is differentiable at x, we will set
d
FX (x) = fX (x).
dx
If FX is not differentiable at x, we can set the values of fX (x)
to be any value. Usually, the values are selected to give simple
expression. (In many cases, they are simply set to 0.)
38
39
Lebesgue-a.e, to be exact
More specifically, if g = f Lebesgue-a.e., then g is also a pdf for X.
119
x<0
0,
1 2
x , 0x2
FX (x) =
4
1,
x>2
Observe that it is differentiable at each point x except at x = 2.
The probability density function is obtained by differentiation of
the cdf which gives
1
x, 0 < x < 2
fX (x) = 2
0, otherwise.
At x = 2 where FX has no derivative, it does not matter what
values we give to fX . Here, we set it to be 0.
10.16. In many situations when you are asked to find pdf, it may
be easier to find cdf first and then differentiate it to get pdf.
Exercise 10.17. A point is picked at random in the inside of a
circular disk with radius r. Let the random variable X denote the
distance from the center of the disk to this point. Find fX (x).
10.18. Unlike the cdf of a discrete random variable, the cdf of a
continuous random variable has no jumps and is continuous everywhere.
Rx
10.19. pX (x) = P [X = x] = P [x X x] = x fX (t)dt = 0.
Again, it makes no sense to speak of the probability that X will
take on a pre-specified value. This probability is always zero.
10.20. P [X = a] = P [X = b] = 0. Hence,
P [a < X < b] = P [a X < b] = P [a < X b] = P [a X b]
120
Definition 10.23. A continuous random variable is called exponential if its pdf is given by
x
e , x > 0,
fX (x) =
0,
x0
for some > 0
Theorem 10.24. Any nonnegative40 function that integrates to
one is a probability density function (pdf) of some random
variable [9, p.139].
40
or nonnegative a.e.
121
for some
integrable function f .a Since P(X IR) = 1, the function f must integrate to one;
i.e., f (t) dt = 1. Further, since P(X B) 0 for all B, it can be shown that f must be
nonnegative.1 A nonnegative function that integrates to one is called a probability density
function (pdf). 10.25. Intuition/Interpretation:
Usually, the set B is an interval such as B = [a, b]. In this case,
x x+ x
Figure 4.1. (a) P(a X b) = ab f (t) dt is the area of the shaded region under the density f (t). (b) P(x X
In other words, the probability of random variable X taking
x + x) = xx+x f (t) dt is the area of the shaded vertical strip.
on
a value in a small interval around point c is approximately equal
Note that for to
random
variables
with
f (c)c
when
ca density,
is the length of the interval.
[x<Xx+x]
P(a X b) = P(a < X b) = P(aP
X < b) = P(a < X < b)
x0
since the corresponding integrals over an interval are not affected by whether or not the
The
number fX (x) itself is not a probability. In particular,
endpoints are included
or excluded.
fX (c)of is
a relative
measure
forA summary
the likelihood
that
random
Here are some examples
continuous
random
variables.
of the more
comX will
take
a value in the immediate neighborhood
mon ones can be foundvariable
on the inside
of the
backon
cover.
a Later,
of random
pointvariable
c. is involved, we write fX (x) instead of f (x).
when more than one
Stated differently, the pdf fX (x) expresses how densely the
probability mass of random variable X is smeared out in the
neighborhood of point x. Hence, the name of density function.
122
10.26. Histogram
and pdf [22,
143approximation
and 145]:
From Histogram
to ppdf
Number of samples = 5000
Histogram
10
12
5000 Samples
14
16
18
Number of occurrences
1000
0.25
500
0
8
10
12
x
Frequency (%) of occurrences
14
0.216
pdf
Estimated pdf
18
20
0.15
10
0
10
x
12
14
16
0.1
18
0.05
10
x
12
14
16
18
E [g(X)] =
g(x)fX (x)dx
(22)
In particular,
E X2 =
Var X =
x2 fX (x)dx
(x EX)2 fX (x)dx = E X 2 (EX)2 .
124
10.30. If we compare other characteristics of discrete and continuous random variables, we find that with discrete random variables,
many facts are expressed as sums. With continuous random variables, the corresponding facts are expressed as integrals.
10.31. All of the properties for the expectation and variance of
discrete random variables also work for continuous random variables as well:
(a) Intuition/interpretation of the expected value: As n ,
the average of n independent samples of X will approach EX.
This observation is known as the Law of Large Numbers.
(b) For c R, E [c] = c
(c) For constants a, b, we have E [aX + b] = aEX + b.
P
P
(d) E [ ni=1 ci gi (X] = ni=1 ci E [gi (X)].
(e) Var X = E X 2 (EX)2
(f) Var X 0.
(g) Var X E X 2 .
(h) Var[aX + b] = a2 Var X.
(i) aX+b = |a| X .
10.32. Chebyshevs Inequality :
P [|X EX| ]
2
X
2
or equivalently
1
n2
This inequality use variance to bound the tail probability
of a random variable.
P [|X EX| nX ]
125
Var X
4
=
= 0.16.
52
25
126
Uniform Distribution
127
In MATLAB,
(a) use X = a+(b-a)*rand or X = random(Uniform,a,b)
to generate the RV,
84
fx(x)
1
1
ba
Fig. 3.5
Fig. 3.6
Figure 15: The pdf and cdf for the uniform random variable. [16, Fig. 3.5]
Fx(x)
fx(x)
1
Example
10.37 (F2011). Suppose X is uniformly distributed on
2 2
1
the interval (1, 2). (X U(1, 2).)
(a) Plot the pdf fX (x) of X.
1
2
x
The
pdf and
cdf of
a Gaussian
(b)
Plot
the
cdf Frandom
of X.
X (x) variable.
G a u ss i a n ( o r n o r m a l ) ra n d o m va r i a b l e
2 2
2 2
as N (, ). Figure 3.6 shows sketches of the pdf and cdf of a Gaussian random variable.
The Gaussian random variable is the most important and frequently encountered random variable in communications. This is because
128 thermal noise, which is the major source
of noise in communication systems, has a Gaussian distribution. Gaussian noise and the
Gaussian pdf are discussed in more depth at the end of this chapter.
The problems explore other pdf models. Some of these arise when a random variable
Example 10.39. [9, Ex. 4.1 p. 140-141] In coherent radio communications, the phase difference between the transmitter and the
receiver, denoted by , is modeled as having a uniform density on
[, ].
(a) P [ 0] =
1
2
(b) P 2 =
3
4
Exercise
Show that EX =
2 110.40.
2
E X = 3 b + ab + a2 .
10.4.2
a+b
2 ,
Var X =
(ba)
12
, and
Gaussian Distribution
10.41. This is the most widely used model for the distribution
of a random variable. When you have many independent random
variables, a fundamental result called the central limit theorem
(CLT) (informally) says that the sum (technically, the average) of
them can often be approximated by normal distribution.
Definition 10.42. Gaussian random variables:
(a) Often called normal random variables because they occur so
frequently in practice.
(b) In MATLAB, use X = random(Normal,m,) or X = *randn
+ m.
(c) fX (x) =
1 e 2 (
2
1
xm 2
).
84
fx(x)
Fig. 3.5
Inb MATLAB,
use normcdf(x,m,) or cdf(Normal,x,m,).
a
In Excel, use NORMDIST(x,m,,TRUE).
x
x
a
b
0
b 2 .
0 N m,
(e) We write aX
The pdf and cdf for the uniform random variable.
Fx(x)
fx(x)
1
2 2
1
2
Fig. 3.6
G a u ss i a n ( o r n o r m a l ) ra n d o m va r i a b l e
2
10.43.
EX by=themfollowing
and Var
is described
pdf: X = .
'
(x )2
exp
,
probabilities:
2 2
2 2
fx (x) =
(3.16)
10.44. Important
P [|X
| < ] = 0.6827;
where and 2 are two parameters whose meaning is described later. It is usually denoted
P [|X
(,| 2>
] =3.60.3173;
). Figure
shows sketches of the pdf and cdf of a Gaussian random variable.
as N
Gaussian
random
variable is the most important and frequently encountered ranP [|X The
|
> 2]
= 0.0455;
dom variable in communications. This is because thermal noise, which is the major source
P [|X
| < 2] = 0.9545
of noise in communication systems, has a Gaussian distribution. Gaussian noise and the
These
are illustrated
Figure
19.
Gaussianvalues
pdf are discussed
in more depth in
at the
end of this
chapter.
The problems explore other pdf models. Some of these arise when a random variable
is passed through a nonlinearity. How to determine the pdf of the random variable in this
Example
10.45.
case is discussed
next.Figure 20 compares several deviation scores and
(3.17)
109
0.2
0
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
t (s)
(b)
4
Histogram
Gaussian fit
Laplacian fit
3.5
3
fx(x) (1/V)
2.5
2
1.5
1
0.5
Fig. 3.14
0
1
0.5
0
x (V)
0.5
(a) A sample skeletal muscle (emg) signal, and (b) its histogram and pdf fits.
Figure 17: Electrical activity of a skeletal muscle: (a) A sample skeletal muscle
(emg) signal, and (b) its histogram and pdf fits. [16, Fig. 3.14]
1=
=
=
2
fx (x)dx
K1 eax dx
2
2
2
K12
eax dx
eay dy
x=
y=
2
2
K12
ea(x +y ) dxdy.
x= y=
131
2
(3.103)
standard score.
3.5 The Gaussian
random variable
andhelp
processus to determine
The normal
distribution
can
probabilities.
0.4
6.4 Notation
x = 1
=2
0.35
0.25
fx(x)
0.15
0.1
f ( x)e
X
j x
dt = e
1
j m 2 2
2
mal Probability
Distributions
ence Scores
x2
15
dx =
10
.
0
x
10
15
1 Q
= Q
.
[ X > x ] =ofQ the
; P [ X < x ] =Gaussian
pdf
Figure 25)
18:P Plots
zero-mean
for
different
values of standard
Fig. 3.15
> 2 Table
= 0.0455,
P X single
<x2on
different
=most
0.9545
mal probability distributionP isX considered
the
important proba3.1 Influence
of
quantities
stribution. An unlimited number
a normal
Range (kx )of continuous random
k = 1 variables
k = 2 have either
k=3
k=4
+ 2
1
1012
, very small,
but
still
significant
in terms oftosystem
It N
is of
Q-function
:
corresponds
26)
Q
z
=
e
P [ X >performance.
zof
~
(
)
( 0,1interest
] where
z 2 2 dx density
Figure 19: Probability
function
X X N
(,
)2; ) . to
see how far, in terms of x , one must be from the mean value to have the different levels of
A, pictures the comparison of
sevthat is
Q ( z ) is the probability of the tail of N ( 0,1) .
error probabilities.
shall
F IAs
GU
R EbeAseen in later chapters this translates to the required SNR to
iation scores and the normal
distriachieve a specified bit error probability.
This
N ( 0,1
) is also shown in Table 3.1.
Standard scores have a mean
of
Having considered the single (or univariate) Gaussian random variable, we turn our
d a standard deviationattention
of 1.0.
to the case of two jointly Gaussian random variables (or the bivariate case). Again
tic Aptitude Test scoresthey
have
a
are described
by their joint pdf which, in general, is an exponential whose exponent
( z(ax
) 2 +bx+cxy+dy+ey2 +f ) , where the conis a quadratic
f 500 and a standard deviation
of in the two variables, i.e., fx,y (x, y) =QKe
stants K, a, b, c, d, e, and f are chosen to satisfy the basic properties
of a valid joint pdf,
0.5
namely being always nonnegative
(
0),
having
unit
volume,
and
also
that the marginal
0 z
t Intelligence Scale scores have a
pdfs, fx (x) = fx,y2%
(x, y)dy and
= 34%
fx,y (x,14%
y)dx, are 2%
valid. Written in standard
14%fy (y)34%
100 and a standard deviation of
16.
1
form
joint
pdf is
a) the
Q is
a decreasing
function with Q ( 0 ) = .
case there are 34 percent of the 3.0 2.0 1.0 0 2 1.0 2.0 3.0
Standard Scores
b) Q ( z ) = 1 Q ( z )
etween the mean and one standard
1
n, 14 percent between one and
c) Qtwo
) ) = z 300 400 500 600 700 800
(1 Q ( z200
SAT
Scores
d deviations, and 2 percent beyond x
x
1 2 2 sin
1 4 2 sin
2
dard deviations.
d) Q ( x ) = e52 d68
. Q ( x84
e
d116
.
) = 100
132 148
0
Binet Intelligence
Scale Scores
( f ( x ))
d
1 x2 d
1 2 d
ck, Applying Psychology: Critical and Creative
of
Deviation
QThinking,
e ;6.2QPictures
Comparison
f ( x ) ) =the
e
f (Several
x) .
e)
( x ) = Figure
(
dx
dxby permission
2
2Pearson
the Normal Distribution, Figure
1992 Prentice-Hall,
Inc. Reproduced
of
Education,
Inc.
20: Comparison
of Several
Deviation
Scoresdx
and the Normal
Distribution
132
Alfred Binet, who devised the first general aptitude test at the beginning of the 20th
century, defined intelligence as the ability to make adaptations. The general purpose of the
test was to determine which children in Paris could benefit from school. Binets test, like its
subsequent revisions, consists of a series of progressively more difficult tasks that children of
different ages can successfully complete. A child who can solve problems typically solved by
children at a particular age level is said to have that mental age. For example, if a child can
successfully do the same tasks that an average 8-year-old can do, he or she is said to have a
mental age of 8. The intelligence quotient, or IQ, is defined by the formula:
IQ = 100 (Mental Age/Chronological Age)
There has been a great deal of controversy in recent years over what intelligence tests measure.
Many of the test items depend on either language or other specific cultural experiences for
correct answers. Nevertheless, such tests can rather effectively predict school success. If
school requires language and the tests measure language ability at a particular point of time
in a childs life, then the test is a better-than-chance predictor of school performance.
133
134
(b) P [2 Z 2]
1
N 0,
2
erf ( z )
R
0
z
2
x2
1 e
z 2
2z
dx corresponds to P [X > z]
1
0.9
0.8
Q(z)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-3
-2
-1
Rz
ex dx =
135
0,
k
k 2
E ( X ) = ( k 1) E ( X ) =
1 3 5
2
k
, k odd
k
2 4 6 ( k 1)
[Papoulis p 111].
E X =
1 3 5 ( k 1) k ,
k even
k odd
( k 1) , k even
k
= 1 3 5
k odd
( k 1) , k even
(f)29)The
complementary error 2function:
e x dx= 1 22Q ( R 2z ) corresponds
Error function (Matlab): erf ( z ) =
to
2
1
P X < z where X ~ N 0, .
2
1
N 0,
2
erf ( z )
Q
2z
a) lim erf ( z ) = 1
z
b) erf ( z ) = erf ( z )
10.4.3
Exponential Distribution
136
(c) MATLAB:
X = exprnd(1/) or random(exp,1/)
fX (x) = exppdf(x,1/) or pdf(exp,x,1/)
FX (x) = expcdf(x,1/) or cdf(exp,x,1/)
Example 10.54. Suppose X E(), find P [1 < X < 2].
137
10.57. EX =
1
1e
3.53.
For discrete random variable, geometric random variables satisfy the memoryless property.
138
t
dt =tx
B+x e
=
ex
R
B
e( +x) d
.
ex
For continuous random variables, it turns out that we cant43 simply integrate the pdf of X to get the pdf of Y .
43
When you applied Equation (23) to continuous random variables, what you would get is
0 = 0, which is true but not interesting nor useful.
139
FX yb ,
a > 0,
a
FY (y) =
1 FX yb , a < 0.
a
Finally, fundamental theorem of calculus and chain rule gives
yb
1
, a > 0,
d
a fX a
fY (y) = FY (y) =
1 fX yb , a < 0.
dy
a
Note that we can further simplify the final formula by using the
| | function:
1
yb
fY (y) =
fX
, a 6= 0.
(24)
|a|
a
140
Example 10.63. Amplitude modulation in certain communication systems can be accomplished using various nonlinear devices
such as a semiconductor diode. Suppose we model the nonlinear
device by the function Y = X 2 . If the input X is a continuous
random variable, find the density of the output Y = X 2 .
141
142
P [X B] =
P [X = x] =
Discrete
P
pX (x)
Continuous
R
fX (x)dx
xB
pX (x) = F (x) F (x )
P X ((a, b]) = F (b) F (a)
Interval prob.
P X ([a, b]) = F (b) F a
P X ([a, b)) = F b F a
P X ((a, b)) = F b F (a)
EX =
0
P X ((a, b]) = P X ([a, b])
= P X ([a, b)) = P X ((a, b))
Zb
= fX (x)dx = F (b) F (a)
a
+
R
xpX (x)
xfX (x)dx
d
P [g(X) y] .
dy
fY (y) =
Alternatively,
For Y = g(X),
pY (y) =
pX (x)
x: g(x)=y
For Y = g(X),
P [Y B] =
pX (x)
x:g(x)B
E [g(X)] =
g(x)pX (x)
fY (y) =
g(x)fX (x)dx
E [X 2 ] =
+
R
x2 pX (x)
Var X =
P
x
X fX (xk )
,
|g 0 (xk )|
k
x2 fX (x)dx
(x EX)2 pX (x)
+
R
(x EX)2 fX (x)dx
143
144
Example 11.4. Of course, to rigorously define (any) random variables, we need to go back to the sample space . Recall Example
7.4 where we considered several random variables defined on the
sample space = {1, 2, 3, 4, 5, 6} where the outcomes are equally
likely. In that example, we define X() = and Y () = ( 3)2 .
Room #2
The first ten scores are from (ten) students in room #1. The last
10 scores are from (ten) students in room #2.
Suppose we have the a score report card for each student. Then,
in total, we have 20 report cards.
Figure 23: In Example 11.5, we pick a report card randomly from a pile of
cards.
146
and
P [3 X < 4, Y < 1] = P [3 X < 4 and Y < 1]
= P [X [3, 4) and Y (, 1)] .
In general,
[Some condition(s) on X,Some condition(s) on Y ]
is the same as the intersection of the individual statements:
[Some condition(s) on X] [Some condition(s) on Y ]
which simply means both statements happen.
More technically,
[X B, Y C] = [X B and Y C] = [X B] [Y C]
and
P [X B, Y C] = P [X B and Y C]
= P ([X B] [Y C]) .
Remark: Linking back to the original sample space, this shorthand actually says
[X B, Y C] = [X
= {
= {
= [X
B and Y C]
: X() B and Y () C}
: X() B } { : Y () C}
B] [Y C] .
147
11.7. The concept of conditional probability can be straightforwardly applied to discrete random variables. For example,
P [Some condition(s) on X | Some condition(s) on Y ] (25)
is the conditional probability P (A|B) where
A = [Some condition(s) on X] and
B = [Some condition(s) on Y ].
Recall that P (A|B) = P (A B)/P (B). Therefore,
P [X = x and Y = y]
,
P [Y = y]
P [X = x| Y = y] =
and
P [3 X < 4| Y < 1] =
More technically,
P [X B| Y C] = P ([X B] |[Y C]) =
=
P [X B, Y C]
.
P [Y C]
148
P ([X B] [Y C])
P ([Y C])
Definition 11.8. Joint pmf : If X and Y are two discrete random variables (defined on a same sample space with probability
measure P ), the function pX,Y (x, y) defined by
pX,Y (x, y) = P [X = x, Y = y]
is called the joint probability mass function of X and Y .
(a) We can visualize the joint pmf via stem plot. See Figure 24.
(b) To evaluate the probability for a statement that involves both
X and Y random variables:
(b) Find P X 2 + Y 2 = 13 .
75
Example11.10.
2.13. In the
precedingboth
example,
is the
that the rst
cache valDefinition
When
Xwhat
and
Y probability
take finitely
many
miss occurs after the third memory access?
ues (both have finite supports), say SX = {x1 , . . . , xm } and SY =
Solution. We need to nd
{y1 , . . . , yn }, respectively, we can arrange
the probabilities pX,Y (xi , yj )
p (x , y )P(T >
X,Y 2 1 pX,Y (x2 , y3 2 ) . . . pX,Y (x2 , yn )
= .1 P(T =
(26)
.
..
..
.. k=1 . .k).
.
.
2
].
pX,Y (xm , y1 ) pX,Y=(x1m,(1y2 )p)[1.+. .p +ppX,Y
(xm , yn )
We
callmass
thisfunctions
matrix the joint pmf matrix.
Joint shall
probability
The joint probability mass function of X and Y is dened by
The
sum of all the entries in the matrix is one.
pXY (xi , y j ) := P(X = xi ,Y = y j ).
(2.7)
0.06
0.04
0.02
0
8
7
6
5
4
3
2
1
i
It turns out that we can extract the marginal probability mass functions pX (xi ) and
pY (y j ) from the joint pmf p44
j ) using the formulas
pX,Y (x, y) = 0 if XY (xxi , y
/ SX or y
/ SY . In other words,
we
dont have to consider
the
the supports
pXY (xiy
, y joutside
)
pX (x
(2.8)of X
i) =x
and
j
and Y , respectively.
44
To see this, note that pX,Y (x, y) can not exceed pX (x) because P (A B) P (A). Now,
suppose at x = a, we have pX (a) = 0. Then pX,Y (a, y) must also = 0 for any y because it can
not exceed pX (a) = 0. Similarly, suppose at y = a, we have pY (a) = 0. Then pX,Y (x, a) = 0
for any x.
150
11.11. From the joint pmf, we can find pX (x) and pY (y) by
X
pX,Y (x, y)
(27)
pX (x) =
y
pY (y) =
pX,Y (x, y)
(28)
In this setting, pX (x) and pY (y) are call the marginal pmfs (to
distinguish them from the joint one).
(a) Suppose we have the joint pmf matrix in (26). Then, the sum
of the entries in the ith row is45 pX (xi ), and
the sum of the entries in the jth column is pY (yj ):
pX (xi ) =
n
X
m
X
pX,Y (xi , yj )
i=1
j=1
(b) In MATLAB, suppose we save the joint pmf matrix as P XY, then
the marginal pmf (row) vectors p X and p Y can be found by
p_X = (sum(P_XY,2))
p_Y = (sum(P_XY,1))
Example 11.12. Consider the following joint pmf matrix
45
B
j ). Of course, because the support of Y is SY , we have P (A B0 ) = 0. Hence,
j=0
the sum can start at j = 1 instead of j = 0.
151
(29)
?, y = 2,
?, y = 4,
pY |X (y|1) =
0, otherwise.
(d) Find pY |X (y|3).
Definition 11.18. The joint cdf of X and Y is defined by
FX,Y (x, y) = P [X x, Y y] .
153
|=
(b) [X B]
154
1/4, x = 3,
,
x = 4,
pX (x) =
0,
otherwise.
Let Y be another random variable. Assume that X and Y are
i.i.d.
Find
(a) ,
(b) the pmf of Y , and
(c) the joint pmf of X and Y .
155
1/15, x = 3, y = 1,
2/15, x = 4, y = 1,
4/15, x = 3, y = 3,
pX,Y (x, y) =
,
x = 4, y = 3,
0,
otherwise.
(a) Are X and Y identically distributed?
(b) Are X and Y independent?
156
11.2
158
ECS315 2013/1
11.3
Part V.2
Dr.Prapun
1/15,
2/15,
pX,Y (x, y) =
4/15,
8/15,
0,
Let Z = X + Y . Find the pmf of Z.
159
X
y
pX,Y (z y, y) =
X
x
(30)
(x,y):x+y=z
X
y
pX (z y) pY (y) =
X
x
pX (x) pY (z x).
(31)
Example 11.44. Suppose 1 P(1 ) and 2 P(2 ) are independent. Let = 1 +2 . Use (31) to show46 that P(1 +2 ).
First, note that p (`) would be positive only on nonnegative
integers because a sum of nonnegative integers (1 and 2 ) is still
a nonnegative integer. So, the support of is the same as the
support for 1 and 2 . Now, we know, from (31), that
X
P [ = `] = P [1 + 2 = `] =
P [1 = i] P [2 = ` i]
i
Remark: You may feel that simplifying the sum in this example (and in Exercise 11.45
is difficult and tedious, in Section 13, we will introduce another technique which will make
the answer obvious. The idea is to realize that (31) is a convolution and hence we can use
Fourier transform to work with a product in another domain.
160
be integers from 0 to k:
P [ = `] =
`
X
i
1 1 2
i=0
i!
(1 +2 )
=e
`i
2
(` i)!
1X
`!
i1 `i
2
`! i=0 i! (` i)!
`
(1 +2 )
=e
`
1 X i `i
(1 +2 ) (1 + 2 )
=e
,
`! i=0 1 2
`!
where the last equality is from the binomial theorem. Hence, the
sum of two independent Poisson random variables is still Poisson!
(
`
(1 +2 ) (1 +2 )
, ` {0, 1, 2, . . .}
e
`!
p (`) =
0,
otherwise
Exercise 11.45. Suppose B1 B(n1 , p) and B2 B(n2 , p) are
independent. Let B = B1 + B2 . Use (31) to show that B
B(n1 + n2 , p).
11.4
47
Again, these are called the law/rule of the lazy statistician (LOTUS) [22, Thm 3.6
p 48],[9, p. 149] because it is so much easier to use the above formula than to first find the
pmf of g(X) or g(X, Y ). It is also called substitution rule [21, p 271].
161
P [X = Y ]
Discrete
P
pX (x)
xB
P
pX,Y (x, y)
(x,y):(x,y)R
P
pX (x) = pX,Y (x, y)
y
P
pY (y) = pX,Y (x, y)
P P x
pX,Y (x, y)
xP
y: y<x
P
=
pX,Y (x, y)
y x: x>y
P
pX,Y (x, x)
X Y
Conditional
E [g(X, Y )]
P [X B]
P [(X, Y ) R]
Joint to Marginal:
(Law of Total Prob.)
P [X > Y ]
|=
i=1
i=1
Example 11.49. A binary communication link has bit-error probability p. What is the expected number of bit errors in a transmission of n bits?
Theorem 11.50 (Expectation and Independence). Two random
variables X and Y are independent if and only if
E [h(X)g(Y )] = E [h(X)] E [g(Y )]
for all functions h and g.
In other words, X and Y are independent if and only if for
every pair of functions h and g, the expectation of the product
h(X)g(Y ) is equal to the product of the individual expectations.
One special case is that
Y
|=
(32)
|=
|=
(b) [X B]
163
Linear Dependence
Cov[X,Y ]
X Y
164
165
Exercise 11.63. Suppose two fair dice are tossed. Denote by the
random variable V1 the number appearing on the first dice and by
the random variable V2 the number appearing on the second dice.
Let X = V1 + V2 and Y = V1 V2 .
(a) Show that X and Y are not independent.
(b) Show that E [XY ] = EXEY .
11.64. Cauchy-Schwartz Inequality :
2 2
(Cov [X, Y ])2 X
Y
X,Y =
X,Y is dimensionless
X,X = 1
X,Y = 0 if and only if X and Y are uncorrelated.
11.67. Linear Dependence and Cauchy-Schwartz Inequality
1, a > 0
(a) If Y = aX + b, then X,Y = sign(a) =
1, a < 0.
To be rigorous, we should also require that X > 0 and
a 6= 0.
(b) Cauchy-Schwartz Inequality : |X,Y | 1.
In other words, XY [1, 1].
166
a
In which case, |a| = XY and XY = |a|
= sgn a. Hence, XY
is used to quantify linear dependence between X and Y .
The closer |XY | to 1, the higher degree of linear dependence
between X and Y .
11.69. [21, Section 5.2.3] Example 11.68 is based on the assumption that return rates X and Y are independent from each other.
In the world of investment, however, risks are more commonly
reduced by combining negatively correlated funds (two funds are
negatively correlated when one tends to go up as the other falls).
This becomes clear when one considers the following hypothetical situation. Suppose that two stock market outcomes 1 and 2
are possible, and that each outcome will occur with a probability of
1
2 Assume that domestic and foreign fund returns X and Y are determined by X(1 ) = Y (2 ) = 0.25 and X(2 ) = Y (1 ) = 0.10.
Each of the two funds then has an expected return of 7.5%, with
equal probability for actual returns of 25% and 10%. The random
variable Z = 21 (X + Y ) satisfies Z(1 ) = Z(2 ) = 0.075. In other
words, Z is equal to 0.075 with certainty. This means that an investment that is equally divided between the domestic and foreign
funds has a guaranteed return of 7.5%.
168
x
1
3
(ii) 3X+4,6Y 7
169
Answers:
(a)
(i) EX = 2.6
(ii) P [X = Y ] = 0
(iii) P [XY < 6] = 0.2
(iv) E [(X 3)(Y 2)] = 0.88
(v) E X(Y 3 11Y 2 + 38Y ) = 104
Cov [aX + b, cY + d]
aX+b cY +d
ac
acCov [X, Y ]
=
=
X,Y = sign(ac) X,Y .
|a|X |c|Y
|ac|
aX+b,cY +d =
170
ECS315 2013/1
11.6
Part V.3
Dr.Prapun
In this section, we start to look at more than one continuous random variables. You should find that many of the concepts and
formulas are similar if not the same as the ones for pairs of discrete
random variables which we have already studied. For discrete random variables, we use summations. Here, for continuous random
variables, we use integrations.
Recall that for a pair of discrete random variables, the joint
pmf pX,Y (x, y) completely characterizes the probability model of
two random variables X and Y . In particular, it does not only
capture the probability of X and probability of Y individually,
it also capture the relationship between them. For continuous
random variable, we replace the joint pmf by joint pdf.
Definition 11.71. We say that two random variables X and Y
are jointly continuous with joint pdf fX,Y (x, y) if48 for any
region R on the (x, y) plane
ZZ
P [(X, Y ) R] =
fX,Y (x, y)dxdy
(33)
{(x,y):(x,y)R}
Remark: If you want to check that a function f (x, y) is the joint pdf of a pair of random
variables (X, Y ) by using the above definition, you will need to check that (33) is true for any
region R. This is not an easy task. Hence, we do not usually use this definition for such kind
of test. There are some mathematical facts that can be derived from this definition. Such
facts produce easier condition(s) than (33) but we will not talk about them here.
171
P [X B]
P [(X, Y ) R]
Discrete
P
pX (x)
xB
P
pX,Y (x, y)
(x,y):(x,y)R
Continuous
R
fX (x)dx
B
RR
{(x,y):(x,y)R}
11.73. For us, Definition 11.71 is useful because if you know that
a function f (x, y) is a joint pdf of a pair of random variables, then
you can calculate countless possibilities of probabilities involving
these two random variables via (33). (See, e.g. Example 11.76.)
However, the actual calculation of probability from (33) can be
difficult if you have non-rectangular region R or if you have a
complicated joint pdf. In other words, the formula itself is straightforward and simple, but to carry it out may require that you review
some multi-variable integration technique from your calculus class.
11.74. Intuition/Approximation: Note also that the joint
pdfs definition extends the interpretation/approximation that we
previously discussed for one random variable.
Recall that for a single random variable, the pdf is a measure of
probability per unit length. In particular, if you want to find
the probability that the value of a random variable X falls inside
some small interval, say the interval [1.5, 1.6], then this probability
can be approximated by
P [1.5 X 1.6] fX (1.5) 0.1.
172
(34)
(35)
For two random variables X and Y , the joint pdf fX,Y (x, y)
measures probability per unit area:
P [x X x + x , y Y y + y] fX,Y (x, y) x y.
(36)
173
174
PDiscrete
pX,Y (x, y)
P [(X, Y ) R]
RR Continuous
fX,Y (x, y)dxdy
(x,y):(x,y)R
Joint to Marginal:
pX (x) =
{(x,y):(x,y)R}
+
R
pX,Y (x, y)
fX (x) =
pY (y) =
pX,Y (x, y)
fY (y) =
P P
P [X > Y ]
+
R Rx
pX,Y (x, y)
P P
P
+
R R
pX,Y (x, y)
y x: x>y
P [X = Y ]
x y: y<x
+
R
pX,Y (x, x)
|=
X Y
Conditional
Table 7: Important formulas for a pair of discrete RVs and a pair of Continuous
RVs
49
2
xy FX,Y (x, y).
Note that when we write FX,Y (x, ), we mean lim FX,Y (x, y). Similar limiting definition
y
176
11.82. Independence:
The following statements are equivalent:
(a) Random variables X and Y are independent.
[Y C] for all B, C.
|=
(b) [X B]
There are many situations in which we observe two random variables and use their values to compute a new random variable.
Example 11.84. Signal in additive noise: When we says that a
random signal X is transmitted over a channel subject to additive
noise N , we mean that at the receiver, the received signal Y will
be X +N . Usually, the noise is assumed to be zero-mean Gaussian
2
2
noise; that is N N (0, N
) for some noise power N
.
Example 11.85. In a wireless channel, the transmitted signal
X is corrupted by fading (multiplicative noise). More specifically,
the received signal Y at the receivers antenna is Y = H X.
Remark : In the actual situation, the signal is further corrupted
by additive noise N and hence Y = HX + N . However, this
expression for Y involves more than two random variables and
hence we we will not consider it here.
177
Discrete
Continuous
PP
E [Z]
Z =X +Y
pX+Y = pX pY
fX+Y = fX fY
RR
|=
P [Z B]
+
R +
R
R +
fX,Y (z y, y)dy
Table 8: Important formulas for function of a pair of RVs. Unless stated otherwise, the function is defined as Z = g(X, Y )
d
dw FW (w).
Example 11.88. Suppose X and Y are i.i.d. E(3). Find the pdf
of W = Y /X.
178
Find
Find
11.90. Observe that finding the pdf of Z = g(X, Y ) is a timeconsuming task. If you goal is to find E [Z] do not forget that it
can be calculated directly from
Z Z
E [g(X, Y )] =
g(x, y)fX,Y (x, y)dxdy.
11.91. The following property is valid for any kind of random
variables:
#
"
X
X
E [Zi ] .
E
Zi =
i
Furthermore,
E
"
X
#
gi (X, Y ) =
X
i
179
E [gi (X, Y )] .
Discrete
P
pX (x)
PxB
pX,Y (x, y)
P [X B]
P [(X, Y ) R]
Joint to Marginal:
(x,y):(x,y)R
pX (x) =
pX,Y (x, y)
Continuous
R
fX (x)dx
B
RR
pY (y) =
pX,Y (x, y)
fY (y) =
P [X > Y ]
P P
pX,Y (x, y)
P P
pX,Y (x, y)
y x: x>y
P [X = Y ]
+
R
x y: y<x
fX (x) =
+
R Rx
+
R R
pX,Y (x, x)
|=
X Y
Conditional
E [g(X, Y )]
P [g(X, Y ) B]
Z =X +Y
180
|=
(b) [X B]
Correlation coefficient: XY =
181
Cov[X,Y ]
X Y
2
3
and EY = 13 .
xEX
xEX
2 X
+ Y
X
Y
1
p
exp
.
2)
2
(1
2X Y 1 2
(37)
Important properties:
(a) =
Cov[X,Y ]
X Y
|=
(c) X
182
X = 1, Y = 1, = 0
6
4
0
x
X = 1, Y = 2, = 0.5
0
x
X = 1, Y = 2, = 0.8
X = 1, Y = 2, = 0
0
x
0
x
X = 3, Y = 1, = 0
0
x
X = 1, Y = 2, = 0.99
0
x
Correlation coefficient
Number of samples = 2000
3
2
0.2
-1
-2
-3
2
2
0
-2
0
x
-2
y
0.1
0.05
2
0
0
-2
0
-2
0
x
-2
-2
0
-1
0.2
0.1
-1
-2
-2
0
x
2
-2
0.3
0.2
0.1
0
-2
Joint pdf
Joint pdf
0.15
0.1
0.05
-2
-3
0.2
Joint pdf
Joint pdf
0
-1
-3
1
0.15
-3
0
-2
-2
0
x
0
-2
-2
8 Figure
183
ECS315 2013/1
12
Part VI
Dr.Prapun
12.1. Review: You may recall50 the following properties for cdf
of discrete random variables. These properties hold for any kind
of random variables.
(a) The cdf is defined as FX (x) = P [X x]. This is valid for
any type of random variables.
(b) Moreover, the cdf for any kind of random variable must satisfies three properties which we have discussed earlier:
CDF1 FX is non-decreasing
CDF2 FX is right-continuous
CDF3 lim FX (x) = 0 and lim FX (x) = 1.
x
50
If you dont know these properties by now, you should review them as soon as possible.
184
12.4. We can categorize random variables into three types according to its cdf:
(a) If FX (x) is piecewise flat with discontinuous jumps, then X
is discrete.
(b) If FX (x) is a continuous function, then X is continuous.
(c) If FX (x) is a piecewise continuous function with discontinuities, then X is mixed.
185
81
Fx(x)
1.0
(b)
Fx(x)
1.0
(c)
Fx(x)
1.0
Fig. 3.2
TypicalFigure
cdfs: (a)
variable,
(b) a continuous
random variable,
and (c) a mixed
random
28:a discrete
Typicalrandom
cdfs: (a)
a discrete
random variable,
(b) a continuous
random
variable.
variable, and (c) a mixed random variable [16, Fig. 3.2].
For a discrete random variable, Fx (x) is a staircase function, whereas a random variable
is called continuous if Fx (x) is a continuous function. A random variable is called mixed
if it is neither discrete nor continuous. Typical cdfs for discrete, continuous, and mixed
random variables are shown in Figures 3.2(a), 3.2(b), and 3.2(c), respectively.
Rather than dealing with the cdf, it is more common to deal with the probability density
186 of F (x), i.e.,
function (pdf), which is defined as the derivative
x
fx (x) =
dFx (x)
.
dx
(3.11)
Example 12.6. In MATLAB, we have the rand command to generate U(0, 1). If we want to generate a Bernoulli random variable
with success probability p, what can we do?
187
13
2 2
X (v) = ejvm 2 v
188
jv .
As with the Fourier transform, we can build a large list of commonly used characteristic functions. (You probably remember that
rectangular function in time domain gives a sinc function in frequency domain.) When you see a random variable that has the
same form of characteristic function as the one that you know, you
can quickly make a conclusion about the family and parameters of
that random variable.
Example 13.4. Suppose a random variable X has the character2
istic function X (v) = 2jv
. You can quickly conclude that it is
an exponential random variable with parameter 2.
For many random variables, it is easy to find its expected value
or any moments via the characteristic function. This can be done
via the following result.
(k)
(k)
13.5. X (v) = j k E X k ejvX and X (0) = j k E X k .
(b) Var X =
1
2 .
189
ECS315 2013/1
14
14.1
Part VII
Dr.Prapun
Limiting Theorems
Law of Large Numbers (LLN)
n
1X
i=1
#
Xi
1 X
1
= 2
Var Xi = 2 ,
n i=1
n
191
(39)
Remarks:
(a) For (39) to hold, it is sufficient to have uncorrelated Xi s.
(b) From (39), we also have
1
Mn = .
n
(40)
In words, when uncorrelated (or independent) random variables each having the same distribution are averaged together,
the standard deviation is reduced according to the square root
law. [21, p 142].
Exercise 14.4 (F2011). Consider i.i.d. random variables X1 , X2 , . . . , X10 .
Define the sample mean M by
10
1 X
M=
Xk .
10
k=1
Let
10
1 X
V1 =
(Xk E [Xk ])2 .
10
k=1
and
10
1 X
V2 =
(Xj M )2 .
10 j=1
Suppose E [Xk ] = 1 and Var[Xk ] = 2.
(a) Find E [M ].
(b) Find Var[M ].
(c) Find E [V1 ].
(d) Find E [V2 ].
192
14.2
where the Xi are i.i.d. with common mean m and common variance
2.
Note that when we talk about Xi being i.i.d., the definition
is that they are independent and identically distributed. It
is then convenient to talk about a random variable X which
shares the same distribution (pdf/pmf) with these Xi . This
allow us to write
i.i.d.
Xi X,
(42)
which is much more compact than saying that the Xi are
i.i.d. with the same distribution (pdf/pmf) as X. Moreover,
2
we can also use EX and X
for the common expected value
and variance of the Xi .
Q: How does Sn behave?
For the Sn defined above, there are many cases for which we
know the pmf/pdf of Sn .
Example 14.5. When the Xi are i.i.d. Bernoulli(p),
Xi :
Sn (v) = (X (v))n .
If we are lucky, as in the case for the sum of Gaussian random variables in Example 14.6 above, we get Sn (v) that is of the form that
we know. However, Sn (v) will usually be something we havent
seen before or difficult to find the inverse transform. This is one
of the reason why having a way to approximate the sum Sn would
be very useful.
There are also some situations where the distribution of the Xi
is unknown or difficult to find. In which case, it would be amazing
if we can say something about the distribution of Sn .
In the previous section, we consider the sample mean of identically distributed random variables. More specifically, we consider
the random variable Mn = n1 Sn . We found that Mn will converge
to m as n increases to . Here, we dont want to rescale the sum
Sn by the factor n1 .
14.7 (Approximation of densities and pmfs using the CLT). The
actual statement of the CLT is a bit difficult to state. So, we first
give you the interpretation/insight from CLT which is very easy
to remember and use:
For n large enough, we can approximate Sn by a Gaussian random variable with the same mean and variance as
Sn .
Note that the mean and variance of Sn is nm and n 2 , respectively. Hence,
for n large enough we can approximate Sn by
2
N nm, n . In particular,
snm
2
1
21 ( xnm
n ) .
e
2 n
n
P [Sn = k] = P k < Sn k +
.
e
2
2
2 n
194
(43)
(44)
p := 0.05
( x+ 1)
n := 100 := 5
0.15
( x )
1
2
0.1
x 1
0.06
( x )
0.04
x 1
e x
( )
( n x+ 1) ( x+ 1)
( x+ 1)
e x
( n+ 1)
n := 800 := 40
( )
x
p ( 1 p )
( n+ 1)
n x 0.05
( n x+ 1) ( x+ 1)
10
p ( 1 p )
n x 0.02
20
40
60
Proof.
( g ( i + 1) ig ( i ) ) e
i =0
i +1
i
= e g ( i + 1)
ig ( i )
195
i!
i!
i!
i =1
i =0
i +1
m +1
= e g ( i + 1)
g ( m + 1)
i! m=0
m!
i =0
197
15
Random Vector
In Section 11.2, we have introduced the way to deal with more than
two random variables. In particular, we introduce the concepts of
joint pmf:
pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = P [X1 = x1 , X2 = x2 , . . . , Xn = xn ]
and joint pdf:
fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn )
of a collection of random variables.
Definition 15.1. You may notice that it is tedious to write the
n-tuple (X1 , X2 , . . . , Xn ) every time that we want to refer to this
collection of random variables. A more convenient notation uses
a column vector X to represent all of them at once, keeping in
mind that the ith component of X is the random variable Xi . This
allows us to express
(a) pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) as pX (x) and
(b) fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) as fX (x).
When the random variables are separated into two groups, we may
label those in a group as X1 , X2 , . . . , Xn and those in another group
as Y1 , Y2 , . . . , Ym . In which case, we can express
(a) pX1 ,...,Xn ,Y1 ,...,Ym (x1 , . . . , xn , y1 , . . . , ym ) as pX,Y (x, y) and
(b) fX1 ,...,Xn ,Y1 ,...,Ym (x1 , . . . , xn , y1 , . . . , ym ) as fX,Y (x, y).
Definition 15.2. Random vectors X and Y are independent if
and only if
(a) Discrete: pX,Y (x, y) = pX (x)pY (y).
(b) Continuous: fX,Y (x, y) = fX (x)fY (y).
Definition 15.3. A random vector X contains many random variables. Each of these random variables has its own expected value.
We can represent the expected values of all these random variables
in the form of a vector as well by using the notation E [X]. This
is a vector whose ith component is EXi .
198
X
Y
=
EX
EY
X W
EX EW
E
=
Y Z
EY EZ
15.6. For non-random matrix A, B, C and a random vector X,
E [AXB + C] = A (EX) B + C.
Correlation and covariance are important quantities that capture linear dependency between two random variables. When we
have many random variables, there are many possible pairs to find
correlation E [Xi Xj ] and covariance Cov [Xi , Xj ]. All of the correlation values can be expressed at once using the correlation matrix.
Definition 15.7. The correlation matrix RX of a random (column) vector X is defined by
RX = E XXT .
Note that it is symmetric and that the ij-entry of RX is simply
E [Xi Xj ].
X1
Example 15.8. Consider X =
.
X2
X
1
RX = E XXT = E
X1 X2
X2
E X12
E [X
X
]
X12 X1 X2
1
2
=
=E
X1 X2 X22
E [X1 X2 ] E X22
199
fX (x) =
n
2
16
A random process consider an infinite collection of random variables. These random variables are usually indexed by time. So,
the obvious notation for random process would be X(t). As in
the signals-and-systems class, time can be discrete or continuous. When time is discrete, it may be more appropriate to use
X1 , X2 , . . . or X[1], X[2], X[3], . . . to denote a random process.
Example 16.1. Sequence of results (0 or 1) from a sequence of
Bernoulli trials is a discrete-time random process.
16.2. Two perspectives:
(a) We can view a random process as a collection of many random
variables indexed by t.
(b) We can also view a random process as the outcome of a random experiment, where the outcome of each trial is a deterministic waveform (or sequence) that is a function of t.
The collection of these functions is known as an ensemble,
and each member is called a sample function.
Example 16.3. Gaussian Random Processes: A random process
X(t) is Gaussian if for all positive integers n and for all t1 , t2 , . . . , tn ,
the random variables X(t1 ), X(t2 ), . . . , X(tn ) are jointly Gaussian
random variables.
16.4. Formal definition of random process requires going back to
the probability space (, A, P ).
Recall that a random variable X is in fact a deterministic function of the outcome from . So, we should have been writing
it as X(). However, as we get more familiar with the concept of
random variable, we usually drop the () part and simply refer
to it as X.
For random process, we have X(t, ). This two-argument expression corresponds to the two perspectives that we have just
discussed earlier.
201
(a) When you fix the time t, you get a random variable from a
random process.
(b) When you fix , you get a deterministic function of time from
a random process.
As we get more familiar with the concept of random processes, we
again drop the argument.
Definition 16.5. A sample function x(t, ) is the time function
associated with the outcome of an experiment.
Example 16.6 (Randomly Scaled Sinusoid). Consider the random
process defined by
X(t) = A cos(1000t)
where A is a random variable. For example, A could be a Bernoulli
random variable with parameter p.
This is a good model for a one-shot digital transmission via
amplitude modulation.
(a) Consider the time t = 2 ms. X(t) is a random variable taking
the value 1 cos(2) = 0.4161 with probability p and value
0 cos(2) = 0 with probability 1 p.
If you consider t = 4 ms. X(t) is a random variable taking
the value 1 cos(4) = 0.6536 with probability p and value
0 cos(4) = 0 with probability 1 p.
202
92
0
(d)
(c)
+V
V
Tb
Fig. 3.8
Typical
members for
four random
processes for
commonly
in communications:
(a)
Figureensemble
30: Typical
ensemble
members
four encountered
random processes
commonly
thermal noise, (b) uniform phase, (c) Rayleigh fading process, and (d) binary random data process.
encountered in communications: (a) thermal noise, (b) uniform phase (encountered in communication systems where it is not feasible to establish timing at
the receiver.), (c) Rayleigh
fading process, and (d) binary random data process
(which mayx(t)
represent
transmitted
bits 0 and 1 that are mapped to +V and
V
=
V(2a
(3.46)
k 1)[u(t kTb + ) u(t (k + 1)Tb + )],
(volts)). [16, Fig.k=
3.8]
203
Z 2
1
=
5 cos(7t + ) d = 0.
2
0
204
T_
(c)
Figure 31: Autocorrelation functions for a slowly varying and a rapidly varying
random process [13, Fig. 11.4]
and
RX (t1 , t2 ) = E [X(t1 )X(t2 )]
= E [5 cos(7t1 + ) 5 cos(7t2 + )]
25
cos (7(t2 t1 )) .
=
2
Definition 16.11. A random process whose statistical characteristics do not change with time is classified as a stationary random
process. For a stationary process, we can say that a shift of time
origin will be impossible to detect; the process will appear to be
the same.
Example 16.12. The random process representing the temperature of a city is an example of a nonstationary process, because
the temperature statistics (mean value, for example) depend on
the time of the day.
On the other hand, the noise process is stationary, because its
statistics (the mean ad the mean square values, for example) do
not change with time.
205
N0
2 ( ).
Example 16.18. [Thermal noise] A statistical analysis of the random motion (by thermal agitation) of electrons shows that the
autocorrelation of thermal noise N (t) is well modeled as
e t0
RN ( ) = kT G
watts,
t0
where k is Boltzmanns constant (k = 1.38 1023 joule/degree
Kelvin), G is the conductance of the resistor (mhos), T is the
(ambient) temperature in degrees Kelvin, and t0 is the statistical
average of time intervals between collisions of free electrons in the
resistor, which is on the order of 1012 seconds. [16, p. 105]
16.2
An electrical engineer instinctively thinks of signals and linear systems in terms of their frequency-domain descriptions. Linear systems are characterized by their frequency response (the transfer
function), and signals are expressed in terms of the relative amplitudes and phases of their frequency components (the Fourier
transform). From the knowledge of the input spectrum and transfer function, the response of a linear system to a given signal can be
obtained in terms of the frequency content of that signal. This is
an important procedure for deterministic signals. We may wonder
if similar methods may be found for random processes.
In the study of stochastic processes, the power spectral density
function, SX (f ), provides a frequency-domain representation of the
time structure of X(t). Intuitively, SX (f ) is the expected value
of the squared magnitude of the Fourier transform of a sample
function of X(t).
You may recall that not all functions of time have Fourier transforms. For many functions that extend over infinite time, the
Fourier transform does not exist. Sample functions x(t) of a stationary stochastic process X(t) are usually of this nature. To work
with these functions in the frequency domain, we begin with XT (t),
a truncated version of X(t). It is identical to X(t) for T t T
and 0 elsewhere. We use F{XT }(f ) to represent the Fourier transform of XT (t) evaluated at the frequency f .
207
SX (f ) = lim
We refer to SX (f ) as a density function because it can be interpreted as the amount of power in X(t) in the small band of
frequencies from f to f + df .
16.20. Wiener-Khinchine theorem: the PSD of a WSS random process is the Fourier transform of its autocorrelation function:
Z +
SX (f ) =
RX ( )ej2f d
and
RX ( ) =
SX (f )ej2f df.
RX (0) = E X 2 (t) =
SX (f )df.
N0
watts/hertz,
2
208
97
Time-domain ensemble
x1(t, 1)
|X1( f, 1)|
0
x2(t, 2)
|X2( f, 2)|
xM (t, M)
Fig. 3.9
.
.
|XM ( f, M)|
.
.
Fourier transforms
of member
functions ofofa random
process.
For simplicity,
only the magnitude
spectra
Figure
32: Fourier
transforms
member
functions
of a random
process.
For
are shown.
simplicity, only the magnitude spectra are shown. [16, Fig. 3.9]
What we have managed to accomplish thus far is to create the random variable, P, which
in some sense represents the power in the process. Now we find the average value of P, i.e.,
1
E{P} = E
2T
T
T
209
'
x2T (t)dt
1
=E
2T
'
|XT (f )| df .
2
(3.64)
210
106
Sw( f ) (W/Hz)
N0 /2
White noise
Thermal noise
0
15
10
0
f (GHz)
10
15
(b)
Rw(t) (W)
N0 /2()
Fig. 3.11
Fig. 3.12
White noise
Thermal noise
0.1
0.04
0.06
0.08
0.1
(a) The
PSD (S33:
(b) PSD
the autocorrelation
(Rw(b)
( ))the
of thermal
noise.
w (f )),
Figure
(a)and
The
(SN (f )), and
autocorrelation
(RN ( )) of noise.
y(t)
A lowpass filter.
Finally, since the noise samples of white noise are uncorrelated, if the noise is both white
and Gaussian (for example, thermal noise) then the noise samples are also independent.
Exa m p l e 3.7 Consider the lowpass filter given in Figure 3.12. Suppose that a (WSS)
white noise process, x(t), of zero-mean and211
PSD N0 /2 is applied to the input of the filter.
(a) Find and sketch the PSD and autocorrelation function of the random process y(t) at the
output of the filter.
A.3
Calculus
(56)
Zb
f 0 (x) g (x)dx.
215
(58)
(59)
x=a
n1
X
(1) f
(i)
i=1
(61)
di
dxi f
(i)
+ + g xg x
f 1 x
G x
G1 x
x
f 2 x + G x
f x
G2 x
f x + G x
n 1
ff x x G Gxn1 x
n
f x Gn x
1
Differentiate
Differentiate
Integrate
n 1
n1
Integrate
n 1
n1
x G1 x dx , by
this,
and Parts
x g xthat
fIntegration
dx f Figure
x G1 x 34:
f note
To see
G x fdx
G , and
x dx .
x fG xxG xf xfG x x dx
f x f g xx dx
2 2
1
sin
dxx.
f x x G+ x1e dx f xx eGdx x 3 x f9 x 27 xeG x cos
x
3x
e3 x
2x
n 1
n 1
2 3x
n 1
n 1
n 1
3x
n 1
ex
ex
sin x
e
-1 3 2 2 sin x e dx x 2
e3 x
sin
+ x +
x e dx x 21 3 x x e3 x
2 3
e 9
27 sin x cos x e x + sin1x e x dx
cos x 9
3x
+
2
x
e
x
1
1 3x
sin x 0e dx
3
sin x
sin x cos x e x e
+
2
27
1 3x
x
x
2
e
sin x cos x e sin x e dx
+ 9
1 Figure
x
of Integration by Parts
1 3 x using Figure 34.
x n e ax35:n Examples
ax
0 e
dx x cos x e x n 1e ax (Integration by parts).
x n esin
2
a
a
27
x
2 3x
1
n ax , 1
x
e 1 n n 1 ax
t
dt
ax
x e 0dx
x e
,
a
a 1
1
, 1
1
t dt 1
1
216
(Integration by parts).
ex
ex
ex
(a)
x ln xdx =
(b)
x2 ex dx =
R
ln x
x2 1
2 x dx
= x2 ln x 12
xdx =
x2
x2
ln x
+C
2
4
x2 (ex ) (2x) (ex ) + (2) (ex ) + C =
ex (x2 + 2x + 2) + C .
e
dx = 1.
(62)
2
(x)2
1
e 2 dx = 1.
2
(63)
Z
Z
Z
(x)2
(x)2
(y)2
1 e 2 dx = 1 e 2 dx e 2 dy .
2
2
217
Z
Z
Z
2
2
r2
1 e x2 dx = 1
re ddr
2
2
0 0
2
Z
Z
2
1
r2
=
re
ddr
2
0
0
Z
2
2
1
r2
r2
2 re dr = e = 1
=
0
2
0
ex dx =
(b)
for > 0.
x2
x 12 e 2 dx = 0
(c)
x2
x2 12 e 2 dx = 1.
x2
Hint: Write x2 e 2 as
x2
x2
x2 e 2 = x xe 2
and use integration by parts.
Remark: This shows that E X 2 = Var X = 1 when X
N (0, 1).
(d)
x2 ex dx =
1
2
for > 0.
218
(e)
x2 ex dx =
(f)
1
4
x2
for > 0.
1 2
esx 12 e 2 dx = e 2 s .
2 2
= ej2f m2
ex
1
dx =
2
219
1 2
1 dy.
2
e 2 y dy.
Hence,
2 2 2
1. Hence,
1 2
e 2 y dy =
1 2
1 e 2 y dy
2
2 and
ex
(b) xe
(x)2
2
1
dx =
2
Z
e
21 y 2
r
2
dy = =
.
is an odd function.
x2
x2 e 2 = x xe 2
to get
Z
x2 e
(d) Let y =
2
x2
dx = xe
2
x2
x2
e 2 dx = 2
+
x2 ex dx =
1 dy.
2
1 2 1 y2 1
y e 2 dy
2
2
3
2
(2)
1 2
1
y 2 e 2 y dy
2
(2) 2
2
x2 ex dx = 2
Z
0
220
x2 ex dx.
Hence,
x2
1
esx e 2 dx =
2
1
1 2
2
2
1
e 2 (x 2sx+s ) e 2 s dx
2
=e
1 2
2s
2
1
1 2
1
e 2 (xs) dx = e 2 s
2
= esm
1 xm 2
1
e 2 ( ) .
2
2
1
12 ( xm
)
e
e
dx
2
sx
es(y+m)
Z
1 2
1
e 2 y dy
2
2
1 2
1
1
esy e 2 y dy = esm e 2 (s)
2
sm+ 12 s2 2
=e
221