Escolar Documentos
Profissional Documentos
Cultura Documentos
1. Motivation
The study of probability stems from the analysis of certain games of chance, and it has found
applications in most branches of science and engineering. In this chapter the basic concepts of probability
theory are presented.
2. Syllabus
probabilities, independence,
total probability;
Bayes’ Rule and applications. 1 hour 1 hour
variable
their distribution and density 1 hour 1 hour
functions
Random variables:
1 hour 1 hour
Probability Mass Function (PMF),
Probability Density Functions
(PDF) and properties,
3. Books Recommended
The objective of this module is to make the reader understand concepts of probability, different types of
probability & acquire an ability to compute probability in various cases.
6. Key Notation
Set of integers
RV Random Variable
7. Key Definitions
A set A is called a subset of B (or B is called the superset of A), denoted by , if all the
If A is a subset of B and there is at least one element in B which is not an element of A, then A is
3. Universal set:
We always consider all sets for the problem under consideration to be a subset of a (large) set called
the universal set. For the binary digital communication problem, the set may be considered as the
universal set. We shall denote the universal set by the symbol S.
4. Union: The union of two sets A and B, denoted by is defined as the set of
elements that are either in A or in B or both. In set builder notation, we write
5. Intersection: The intersection of two sets A and B, denoted by , is defined as the set
of elements that are common to both A and B. We can write,
6. Difference: The difference of two sets A and B, denoted by is the set of those
elements of A which do not belong to B. Thus,
Clearly,
9. Venn diagram
The sets and set operations can be illustrated by means of the Venn diagrams. A rectangle is used to
represent the universal set and a circle is used to represent any set in it.
Consider the probability space and function mapping the sample space
For two events A and B with , the conditional probability was defined as
Consider the event and any event B involving the random variable X . The conditional
distribution function of X given B is defined as
8. Key Relations
If an experiment is repeated times under similar conditions and the event occurs in times, then
2. Conditional probability
In many applications we have to deal with a finite sample space and the elementary events formed
by single elements of the set may be assumed equiprobable. In this case, we can define the probability of
the event A according to the classical definition discussed earlier:
where = number of elements favourable to A and n is the total number of elements in the sample
space .
5. Bernoulli trial
Suppose in an experiment, we are only concerned whether a particular event A has occurred or not.
We call this event as the ‘success' with probability and the complementary event as the
Probability of
Success :
Failure :
6. Binomial Law
9. Theory
The modern approach to probability based on axiomatically defining probability as function of a set.
A background on the set theory is essential for understanding probability.
Set:
A set is a well defined collection of objects. These objects are called elements or members of the
set. Usually uppercase letters are used to denote sets.
Example 1
The elements of a set are enumerated within a pair of curly brackets as shown in this example.
Instead of listing all the elements, we can represent a set in the set-builder notation specifying some
represents the set . We read ' ' as 'all such that'. Particularly, if a set is infinite
having infinite number of elements or listing all the elements of the set is cumbersome, such a
representation is useful.
The null set or empty set is the set that does not contain any element. A null set is denoted by .
A set A is called a subset of B (or B is called the superset of A), denoted by , if all the
If A is a subset of B and there is at least one element in B which is not an element of A, then A is
Example 2 Let .
Then, .
o Implies that .
These are: .
Universal set
We always consider all sets for the problem under consideration to be a subset of a (large) set called
the universal set. For the binary digital communication problem, the set may be considered as the
universal set. We shall denote the universal set by the symbol S.
In discussion involving English letters, the alphabet of the English language may be considered as the
universal set.
Two sets A and B are equal if and only if they have the same elements. Thus,
We take above definition of the equality of two sets to establish identities involving set theoretic operation.
Set operations
We can combine events by set operation to get other events. Following set operations are useful:
Union: The union of two sets A and B, denoted by is defined as the set of elements that
are either in A or in B or both. In set builder notation, we write
Intersection: The intersection of two sets A and B, denoted by , is defined as the set of elements
that are common to both A and B. We can write,
Difference: The difference of two sets A and B, denoted by is the set of those elements of A
which do not belong to B. Thus,
Complement: The complement of a set A, denoted by , is defined as the set of all elements which
are not in A.
Clearly,
Example 4
Venn diagram
The sets and set operations can be illustrated by means of the Venn diagrams. A rectangle is used to
represent the universal set and a circle is used to represent any set in it.
1. Identity properties
3. Associative properties
4. Distributive properties
5. Complementary properties
6. De Morgan’s laws
These properties can be proved easily and verified using the Venn diagram. They can be used to
derive useful results.
2. Sample Space: The sample space is the collection of all possible outcomes of a random
3. Event: An event A is a subset of the sample space such that probability can be assigned to it. Thus
For a discrete sample space, all subsets are events.
Figure 1
The possible outcomes are H (head) and T (tail). The associated sample space is It is a finite
sample space. The events associated with the sample space are: and .
We may have to toss the coin any number of times before a head is obtained. Thus the possible outcomes
are :
H, TH,TTH,TTTH, …..
How many outcomes are there? The outcomes are countable but infinite in number. The countably infinite
sample space is .
The probability of an event is a number assigned to the event. Let us see how we can define
probability.
Consider a random experiment with a finite number of outcomes If all the outcomes of the experiment
are equally likely, the probability of an event is defined by
where
Example 6 A fair die is rolled once. What is the probability of getting a ‘6’ ?
Here and
Example 7 A fair coins is tossed twice. What is the probability of getting two ‘heads'?
Here and .
Total number of outcomes is 4 and all four outcomes are equally likely.
The classical definition is limited to a random experiment which has only a finite number of
outcomes. In many experiments like that in the above examples, the sample space is finite and each
outcome may be assumed ‘equally likely.' In such cases, the counting method can be used to
compute probabilities of events.
Consider the experiment of tossing a fair coin until a ‘head' appears. As we have discussed earlier,
there are countably infinite outcomes. Can you believe that all these outcomes are equally likely?
The notion of equally likely is important here. Equally likely means equally probable. Thus this
definition presupposes that all events occur with equal probability. Thus the definition includes a
concept to be defined.
If an experiment is repeated times under similar conditions and the event occurs in times, then
Example 8 Suppose a die is rolled 500 times. The following table shows the frequency each face.
Discussion this definition is also inadequate from the theoretical point of view.
We have earlier defined an event as a subset of the sample space. Does each subset of the sample
space forms an event?
The answer is yes for a finite sample space. However, we may not be able to assign probability
meaningfully to all the subsets of a continuous sample space. We have to eliminate those subsets.
The concept of the sigma algebra is meaningful now.
Definition Let be a sample space and a sigma field defined over it. Let be a mapping
from the sigma-algebra into the real line such that for each , there exists a unique
. Clearly is a set function and is called probability, if it satisfies the following three
axioms.
If ,
This is a special case of axiom 3 and for a discrete sample space , this simpler version may be
considered as the axiom 3. We shall give a proof of this result below.
In a special case, when the outcomes are equi-probable, we can assign equal probability p to each
elementary event.
Suppose represent the elementary events. Thus is the event of getting ‘1', is the
event of getting '2' and so on.
Example 10 Consider the experiment of tossing a fair coin until a head is obtained discussed in Example 3.
Suppose the sample space S is continuous and un-countable. Such a sample space arises when the
outcomes of an experiment are numbers. For example, such sample space occurs when the experiment
consists in measuring the voltage, the current or the resistance. In such a case, the sigma algebra consists
of the Borel sets on the real line.
Example 11 Suppose
Then for
In many applications we have to deal with a finite sample space and the elementary events formed
by single elements of the set may be assumed equiprobable. In this case, we can define the probability of
the event A according to the classical definition discussed earlier:
space .
Thus calculation of probability involves finding the number of elements in the sample space and the
event A. Combinatorial rules give us quick algebraic formulae to find the elements in .We briefly outline
some of these rules:
1. Product rule Suppose we have a set A with m distinct elements and the set B with n distinct
The above result can be generalized as follows: The number of distinct k -tupples in
Solution: The sample space corresponding to two throws of the die is illustrated in the following table.
Clearly, the sample space has elements by the product rule. The event corresponding to
getting at least one 3 is highlighted and contains 11 elements. Therefore, the required probability is .
Suppose we have to choose k objects from a set of n objects. Since sampling is with ordering, each ordered
arrangements of k objects is to be considered. Further, after every choosing, the object is placed back in
the set. In this case, the number of distinct ordered k - tupples = . If a random
experiment has n outcomes, if the experiment is repeated k times, then
Suppose we have to choose k objects from a set of n objects by picking one object after another at random.
In this case the first object can be chosen from n objects, the second object can be chosen from n-1 objects,
and so. Therefore, by applying the product rule, the number of distinct ordered k-tupples in this case is
Clearly,
Example 2 Birthday problem - Given a class of students, what is the probability of two students in the
class having the same birthday? Plot this probability vs. number of students and be surprised!.
Sampling without replacing and without ordering Suppose be the number of ways in which k objects
can be chosen out of a set of
n objects. In this case ordering of the objects in the set of k objects is not considered.
Note that k objects can be arranged among themselves in k ways. Therefore, if ordering of the k objects is
considered, the number of ways in which k objects can be chosen out of n objects is . This is the case
of sampling with ordering.
Example 3 An urn contains 6 red balls, 5 green balls and 4 blue balls. 9 balls were picked at random from
the urn without replacement. What is the probability that out of the balls 4 are red, 3 are green and 2 are
blue?
Solution:
5. Arranging n objects into k specific groups Suppose we want to partition a set of n distinct elements
Example 4 What is the probability that in a throw of 12 dice each face occurs twice?
Solution: The total number of elements in the sample space of the outcomes of a single throw of 12
dice is
The number of favourable outcomes is the number of ways in which 12 dice can be arranged in six
groups of size 2 each – group 1 consisting of two dice each showing 1, group 2 consisting of two dice each
showing 2 and so on.
Therefore, the total number distinct groups
Conditional probability
The answer is the conditional probability of B given A denoted by . We shall develop the
concept of the conditional probability and explain under what condition this conditional probability is same
as .
Let us consider the case of equiprobable events discussed earlier. Let sample points be
favourable for the joint event .
From the definition of conditional probability, we have the joint probability of two events
A and B as follows
Example 2 A family has two children. It is known that at least one of the children is a girl. What is the
probability that both the children are girls?
In the following we show that the conditional probability satisfies the axioms of probability.
By definition
Axiom 1:
Axiom 2:
We have,
Axiom 3:
We have,
If , then
We have ,
We have ,
Remark
(1) A decomposition of a set S into 2 or more disjoint nonempty subsets is called a partition of S. The
(2) The theorem of total probability can be used to determine the probability of a complex event in terms
of related simpler events. This result will be used in Bays' theorem to be discussed to the end of the lecture.
Example 3 Suppose a box contains 2 white and 3 black balls. Two balls are picked at random without
replacement.
Clearly and form a partition of the sample space corresponding to picking two balls from the
box.
Independent events
Two events are called independent if the probability of occurrence of one event does not affect the
probability of occurrence of the other. Thus the events A and B are independent if
and
or --------------------
the independence of n events. The events are called independent if and only
Example 4 Consider the example of tossing a fair coin twice. The resulting sample space is given by
Let be the event of getting ‘tail' in the first toss and be the event of
getting ‘head' in the second toss. Then
and
Again, so that
Example 5 Consider the experiment of picking two balls at random discussed in example 3.
This result is known as the Baye's theorem. The probability is called the a priori probability
and is called the a posteriori probability. Thus the Bays' theorem enables us to determine the a
posteriori probability from the observation that B has occurred. This result is of practical
importance and is the heart of Baysean classification, Baysean estimation etc.
Example 1
In a binary communication system a zero and a one is transmitted with probability 0.6 and 0.4
respectively. Due to error in the communication system a zero becomes a one with a probability 0.1 and a
one becomes a zero with a probability 0.08. Determine the probability (i) of receiving a one and (ii) that a
one was transmitted when the received message is one.
transmitting 0 and be the event of transmitting 1 and and be corresponding events of receiving 0
and 1 respectively.
Example 2 In an electronics laboratory, there are identically looking capacitors of three makes
Let D be the event that the item is defective. Here we have to find .
Here
Box A contains 2 red chips; box B contains two white chips; and box C contains 1 red chip and 1 white chip.
A box is selected at random, and one chip is taken at random from that box. What is the probability of
selecting a white chip?
Solution. Let A be the event that Box A is randomly selected; let B be the event that Box B is randomly
selected; and let C be the event that Box C is randomly selected. Because there are three boxes that are
equally likely to be selected, P(A) = P(B) = P(C) = 1/3. Let W be the event that a white chip is randomly
selected. The probability of selecting a white chip from a box depends on the box from which the chip is
selected:
P(W | A) = 0
P(W | B) = 1
P(W | C) = 1/2
Now, a white chip could be selected in one of three ways: (1) Box A could be selected, and then a white
chip be selected from it; or (2) Box B could be selected, and then a white chip be selected from it; or (3) Box
C could be selected, and then a white chip be selected from it. That is, the probability that a white chip is
selected is:
Then, recognizing that the events W ∩ A, W ∩ B, and W ∩ C are mutually exclusive, we get
(P(W) = P(W|A)P(A)+P(W|B)P(B)+P(W|C)P(C)
= 1/2
In the above example, if the selected chip is white, what is the probability that the other chip in the box is
red?
Solution. The box that contains one white chip and one red chip is Box C. Therefore, we are interested in
finding P(C | W). From previous example P(W) = ½.
𝑃(𝐶∩𝑊)
P(C/ W) =
𝑃(𝑊)
P(W|C)P(C)
=
P(W|A)P(A)+P(W|B)P(B)+P(W|C)P(C)
= 1/3
Repeated Trials
In our discussions so far, we considered the probability defined over a sample space corresponding to
a random experiment. Often, we have to consider several random experiments in a sequence. For example,
the experiment corresponding to sequential transmission of bits through a communication system may be
considered as a sequence of experiments each representing transmission of single bit through the channel.
PRODUCT: Suppose two experiments and with the corresponding sample space and are
performed sequentially. Such a combined experiment is called the product of two experiments and .
Clearly, the outcome of this combined experiment consists of the ordered pair where
and . The sample space corresponding to the combined experiment is given by . The
events in S consist of all the Cartesian products of the form where is an event in and is an
where is the probability defined on the events of This is because, the event in
Independent Experiments
In many experiments, the events and are independent for every selection of
Example 1
Consider the experiments of rolling a fair die and tossing a fair coin sequentially. What is the probability
that a '2' and a 'head' will occur?
Solution: Suppose is the sample space of the experiment of rolling of a six-faced fair die and is the
sample space of the experiment of tossing of a fair die.
Example 2
Solution:
Bernoulli trial
Suppose in an experiment, we are only concerned whether a particular event A has occurred or not.
We call this event as the ‘success' with probability and the complementary event as the
Probability of
Success :
Failure :
Consider n independent repetitions of the Bernoulli trial. Let S be the sample space associated with
each trial and we are interested in a particular event and its complement such that and
Any event in S is of the form where some s are A and remaining s are .
But there are number of events in S with k number of A's and n - k number of s.
Hence the probability of k successes in n independent repetitions of the Bernoulli trial is given by
A typical plot of vs k for n=20 and a particluar value of p is shown in the figure.
Example 3 A fair dice is rolled 6 times. What is the probability that a 4 appears thrice?
Solution:
We have
with
And with
Example 4
Random Signal Analysis Page 44
A communication source emits binary symbols 1 and 0 with probability 0.6 and 0.4 respectively. What is the
probability that there will be 5 1's in a message of 20 symbols?
Solution:
Example 5 In a binary communication system, bit error occurs with a probability of . What is
the probability of getting at least one error bit in a message of 8 bits?
Solution:
The right hand side is an expression for normal probability law to be discussed in a later class.
More Problems
Question 1: A die is rolled, find the probability that an even number is obtained.
Solution to Question 1:
S = {1,2,3,4,5,6}
Let E be the event "an even number is obtained" and write it down.
E = {2,4,6}
Question 2: Two coins are tossed, find the probability that two heads are obtained.
Note: Each coin has two possible outcomes H (heads) and T (Tails).
Solution to Question 2:
S = {(H,T),(H,H),(T,H),(T,T)}
E = {(H,H)}
a) -0.00001
Solution to Question 3:
A probability is always greater than or equal to 0 and less than or equal to 1, hence only a) and c)
above cannot represent probabilities: -0.00010 is less than 0 and 1.001 is greater than 1.
Question 4: Two dice are rolled, find the probability that the sum is
a) equal to 1
b) equal to 4
c) less than 13
Solution to Question 4:
S = { (1,1),(1,2),(1,3),(1,4),(1,5),(1,6)
(2,1),(2,2),(2,3),(2,4),(2,5),(2,6)
(3,1),(3,2),(3,3),(3,4),(3,5),(3,6)
(4,1),(4,2),(4,3),(4,4),(4,5),(4,6)
(5,1),(5,2),(5,3),(5,4),(5,5),(5,6)
(6,1),(6,2),(6,3),(6,4),(6,5),(6,6) }
Let E be the event "sum equal to 1". There are no outcomes which correspond to a sum equal to 1,
hence
Solution to Question 5:
S = { (1,H),(2,H),(3,H),(4,H),(5,H),(6,H)
(1,T),(2,T),(3,T),(4,T),(5,T),(6,T)}
Let E be the event "the die shows an odd number and the coin shows a head". Event E may be
described as follows
E={(1,H),(3,H),(5,H)}
Question 6: A card is drawn at random from a deck of cards. Find the probability of getting the 3 of
diamond.
Solution to Question 6:
P(E) = 1 / 52
Question 7: A card is drawn at random from a deck of cards. Find the probability of getting a queen.
Solution to Question 7:
The sample space S of the experiment in question 7 is shown above (see question 6)
Let E be the event "getting a Queen". An examination of the sample space shows that there are 4
"Queens" so that n(E) = 4 and n(S) = 52. Hence the probability of event E occurring is given by
P(E) = 4 / 52 = 1 / 13
Question 8: A jar contains 3 red marbles, 7 green marbles and 10 white marbles. If a marble is drawn from
the jar at random, what is the probability that this marble is white?
Solution to Question 8:
We first construct a table of frequencies that gives the marbles color distributions as follows
color frequency
red 3
green 7
white 10
= 10 / 20 = 1 / 2
Solution to Question 9:
group frequency
a 50
B 65
O 70
AB 15
Total frequencies
= 70 / 200 = 0.35
Random Variable
In application of probabilities, we are often concerned with numerical values which are random in
nature. For example, we may consider the number of customers arriving at a service station at a particular
interval of time or the transmission time of a message in a communication system. These random
quantities may be considered as real-valued function on the sample space. Such a real-valued function is
called real random variable and plays an important role in describing random data. We shall introduce the
concept of random variables in the following sections.
Mathematical Preliminaries
set is called the domain of and the set is called the range of . Clearly .
A random variable associates the points in the sample space with real numbers.
Consider the probability space and function mapping the sample space into
always a valid event, but the same may not be true if is infinite. The concept of sigma algebra is again
necessary to overcome this difficulty. We also need the Borel sigma algebra -the sigma algebra defined
on the real line.
The function is called a random variable if the inverse image of all Borel sets under is
an event. Thus, if is a random variable, then
Here .
Example 2 Consider the sample space associated with the single toss of a fair die. The sample space is
given by .
If we define the random variable that associates a real number equal to the number on the face of
Axiom1
Axiom2
Axiom 3
We have seen that the event and are equivalent and .The
underlying sample space is omitted in notation and we simply write and instead of
and respectively.
Consider the Borel set , where represents any real number. The equivalent event
For example,
and so on.
called the cumulative distribution function , abbreviated as CDF ) of and denoted by . Thus
1.
This follows from the fact that is a probability and its value should lie between 0 and 1.
3. is right continuous.
4.
5.
6.
We have ,
7.
Find a) .
b) .
c) .
d) .
Solution:
flat except at the points of jump discontinuity. If the sample space is discrete the random variable
defined on it is always discrete.
• X is called a mixed random variable if has jump discontinuity at countable number of points
and increases continuously at least in one interval of X. For a such type RV X,
Typical plots of for discrete, continuous and mixed-random variables are shown in Figure 1,
Figure 2 and Figure 3 respectively.
A random variable is said to be discrete if the number of elements in the range is finite or
accountably infinite.
First assume to be countably finite. Let be the elements of . Here the mapping
(pmf) .
• Suppose .Then
Interpretation of
so that
represents the concentration of probability just as the density represents the concentration of mass.
.
Remark: Using the Dirac delta function we can define the density function for a discrete random
variables.
Consider the random variable defined by the probability mass function (pmf)
Then the density function can be written in terms of the Dirac delta function as
Example 2
Consider the random variable defined with the distribution function given by,
where
Suppose denotes the countable subset of points on such that the random variable
can be expressed as
Example 5
X is the random variable representing the life time of a device with the PDF for . Define the
following random variable
Find FY(y).
To understand binomial distributions and binomial probability, it helps to understand binomial experiments
and some associated notation; so we cover those topics first.
Binomial Experiment
Consider the following statistical experiment. You flip a coin 2 times and count the number of times the
coin lands on heads. This is a binomial experiment because:
Notation
Binomial Distribution
A binomial random variable is the number of successes x in n repeated trials of a binomial experiment. The
probability distribution of a binomial random variable is called a binomial distribution.
Suppose we flip a coin two times and count the number of heads (successes). The binomial random variable
is the number of heads, which can take on values of 0, 1, or 2. The binomial distribution is presented below.
0 0.25
1 0.50
2 0.25
The binomial probability refers to the probability that a binomial experiment results in exactly x successes.
For example, in the above table, we see that the binomial probability of getting exactly one head in two
coin flips is 0.50.
Given x, n, and P, we can compute the binomial probability based on the binomial formula:
Binomial Formula. Suppose a binomial experiment consists of n trials and results in x successes. If the
probability of success on an individual trial is P, then the binomial probability is:
Suppose a die is tossed 5 times. What is the probability of getting exactly 2 fours?
Solution: This is a binomial experiment in which the number of trials is equal to 5, the number of successes
is equal to 2, and the probability of success on a single trial is 1/6 or about 0.167. Therefore, the binomial
probability is:
A cumulative binomial probability refers to the probability that the binomial random variable falls within a
specified range (e.g., is greater than or equal to a stated lower limit and less than or equal to a stated upper
limit).
For example, we might be interested in the cumulative binomial probability of obtaining 45 or fewer heads
in 100 tosses of a coin (see Example 1 below). This would be the sum of all these individual binomial
probabilities.
Example 1
Solution: To solve this problem, we compute 46 individual probabilities, using the binomial formula. The
sum of all these probabilities is the answer we seek. Thus,
b(x < 45; 100, 0.5) = b(x = 0; 100, 0.5) + b(x = 1; 100, 0.5) + . . . + b(x = 45; 100, 0.5)
b(x < 45; 100, 0.5) = 0.184
Example 2
The probability that a student is accepted to a prestigious college is 0.3. If 5 students from the same school
apply, what is the probability that at most 2 are accepted?
Solution: To solve this problem, we compute 3 individual probabilities, using the binomial formula. The sum
of all these probabilities is the answer we seek. Thus,
Example 3
What is the probability that the world series will last 4 games? 5 games? 6 games? 7 games? Assume that
the teams are evenly matched.
Solution: This is a very tricky application of the binomial distribution. If you can follow the logic of this
solution, you have a good understanding of the material covered in the tutorial, to this point.
In the world series, there are two baseball teams. The series ends when the winning team wins 4 games.
Therefore, we define a success as a win by the team that ultimately becomes the world series champion.
For the purpose of this analysis, we assume that the teams are evenly matched. Therefore, the probability
that a particular team wins a particular game is 0.5.
Let's look first at the simplest case. What is the probability that the series lasts only 4 games. This can occur
if one team wins the first 4 games. The probability of the National League team winning 4 games in a row
is:
Similarly, when we compute the probability of the American League team winning 4 games in a row, we
find that it is also 0.0625. Therefore, probability that the series ends in four games would be 0.0625 +
0.0625 = 0.125; since the series would end if either the American or National League team won 4 games in
a row.
Now let's tackle the question of finding probability that the world series ends in 5 games. The trick in
finding this solution is to recognize that the series can only end in 5 games, if one team has won 3 out of
the first 4 games. So let's first find the probability that the American League team wins exactly 3 of the first
4 games.
Okay, here comes some more tricky stuff, so listen up. Given that the American League team has won 3 of
the first 4 games, the American League team has a 50/50 chance of winning the fifth game to end the
series. Therefore, the probability of the American League team winning the series in 5 games is 0.25 * 0.50
= 0.125. Since the National League team could also win the series in 5 games, the probability that the series
ends in 5 games would be 0.125 + 0.125 = 0.25.
The rest of the problem would be solved in the same way. You should find that the probability of the series
ending in 6 games is 0.3125; and the probability of the series ending in 7 games is also 0.3125.
Random Signal Analysis Page 69
Negative Binomial Distribution
In this lesson, we cover the negative binomial distribution and the geometric distribution. As we will see,
the geometric distribution is a special case of the negative binomial distribution.
A negative binomial experiment is a statistical experiment that has the following properties:
Consider the following statistical experiment. You flip a coin repeatedly and count the number of times the
coin lands on heads. You continue flipping the coin until it has landed 5 times on heads. This is a negative
binomial experiment because:
The experiment consists of repeated trials. We flip a coin repeatedly until it has landed 5 times on
heads.
Each trial can result in just two possible outcomes - heads or tails.
The probability of success is constant - 0.5 on every trial.
The trials are independent; that is, getting heads on one trial does not affect whether we get heads
on other trials.
The experiment continues until a fixed number of successes have occurred; in this case, 5 heads.
Notation
The following notation is helpful, when we talk about negative binomial probability.
A negative binomial random variable is the number X of repeated trials to produce r successes in a
negative binomial experiment. The probability distribution of a negative binomial random variable is called
a negative binomial distribution. The negative binomial distribution is also known as the Pascal
distribution.
Suppose we flip a coin repeatedly and count the number of heads (successes). If we continue flipping the
coin until it has landed 2 times on heads, we are conducting a negative binomial experiment. The negative
binomial random variable is the number of coin flips required to achieve 2 heads. In this example, the
number of coin flips is a random variable that can take on any integer value between 2 and plus infinity.
The negative binomial probability distribution for this example is presented below.
2 0.25
3 0.25
4 0.1875
5 0.125
6 0.078125
7 or more 0.109375
The negative binomial probability refers to the probability that a negative binomial experiment results in r
- 1 successes after trial x - 1 and r successes after trial x. For example, in the above table, we see that the
negative binomial probability of getting the second head on the sixth flip of the coin is 0.078125.
Given x, r, and P, we can compute the negative binomial probability based on the following formula:
Negative Binomial Formula. Suppose a negative binomial experiment consists of x trials and results in r
successes. If the probability of success on an individual trial is P, then the negative binomial probability is:
If we define the mean of the negative binomial distribution as the average number of trials required to
produce r successes, then the mean is equal to:
μ=r/P
where μ is the mean number of trials, r is the number of successes, and P is the probability of a success on
any given trial.
As if statistics weren't challenging enough, the above definition is not the only definition for the negative
binomial distribution. Two common alternative definitions are:
The negative binomial random variable is R, the number of successes before the binomial
experiment results in k failures. The mean of R is:
μR = kP/Q
The negative binomial random variable is K, the number of failures before the binomial experiment
results in r successes. The mean of K is:
μK = rQ/P
The moral: If someone talks about a negative binomial distribution, find out how they are defining the
negative binomial random variable.
On this web site, when we refer to the negative binomial distribution, we are talking about the definition
presented earlier. That is, we are defining the negative binomial random variable as X, the total number of
trials required for the binomial experiment to produce r successes.
Geometric Distribution
The geometric distribution is a special case of the negative binomial distribution. It deals with the number
of trials required for a single success. Thus, the geometric distribution is negative binomial distribution
where the number of successes (r) is equal to 1.
An example of a geometric distribution would be tossing a coin until it lands on heads. We might ask: What
is the probability that the first head occurs on the third flip? That probability is referred to as a geometric
probability and is denoted by g(x; P). The formula for geometric probability is given below.
g(x; P) = P * Qx - 1
Sample Problems
The problems below show how to apply your new-found knowledge of the negative binomial distribution
(see Example 1) and the geometric distribution (see Example 2).
Example 1
Bob is a high school basketball player. He is a 70% free throw shooter. That means his probability of making
a free throw is 0.70. During the season, what is the probability that Bob makes his third free throw on his
fifth shot?
Solution: This is an example of a negative binomial experiment. The probability of success (P) is 0.70, the
number of trials (x) is 5, and the number of successes (r) is 3.
To solve this problem, we enter these values into the negative binomial formula.
b*(x; r, P) = x-1Cr-1 * Pr * Qx - r
b*(5; 3, 0.7) = 4C2 * 0.73 * 0.32
b*(5; 3, 0.7) = 6 * 0.343 * 0.09 = 0.18522
Thus, the probability that Bob will make his third successful free throw on his fifth shot is 0.18522.
Example 2
Let's reconsider the above problem from Example 1. This time, we'll ask a slightly different question: What
is the probability that Bob makes his first free throw on his fifth shot?
Solution: This is an example of a geometric distribution, which is a special case of a negative binomial
distribution. Therefore, this problem can be solved using the negative binomial formula or the geometric
formula. We demonstrate each approach below, beginning with the negative binomial formula.
The probability of success (P) is 0.70, the number of trials (x) is 5, and the number of successes (r) is 1. We
enter these values into the negative binomial formula.
b*(x; r, P) = x-1Cr-1 * Pr * Qx - r
b*(5; 1, 0.7) = 4C0 * 0.71 * 0.34
b*(5; 3, 0.7) = 0.00567
g(x; P) = P * Qx - 1
g(5; 0.7) = 0.7 * 0.34 = 0.00567
Hypergeometric Distribution
Notation
The following notation is helpful, when we talk about hypergeometric distributions and hypergeometric
probability.
Hypergeometric Experiments
Consider the following statistical experiment. You have an urn of 10 marbles - 5 red and 5 green. You
randomly select 2 marbles without replacement and count the number of red marbles you have selected.
This would be a hypergeometric experiment.
Note that it would not be a binomial experiment. A binomial experiment requires that the probability of
success be constant on every trial. With the above experiment, the probability of a success changes on
every trial. In the beginning, the probability of selecting a red marble is 5/10. If you select a red marble on
the first trial, the probability of selecting a red marble on the second trial is 4/9. And if you select a green
marble on the first trial, the probability of selecting a red marble on the second trial is 5/9.
Hypergeometric Distribution
A hypergeometric random variable is the number of successes that result from a hypergeometric
experiment. The probability distribution of a hypergeometric random variable is called a hypergeometric
distribution.
Given x, N, n, and k, we can compute the hypergeometric probability based on the following formula:
Hypergeometric Formula. Suppose a population consists of N items, k of which are successes. And a
random sample drawn from that population consists of n items, x of which are successes. Then the
hypergeometric probability is:
Example 1
Suppose we randomly select 5 cards without replacement from an ordinary deck of playing cards. What is
the probability of getting exactly 2 red cards (i.e., hearts or diamonds)?
For example, suppose we randomly select five cards from an ordinary deck of playing cards. We might be
interested in the cumulative hypergeometric probability of obtaining 2 or fewer hearts. This would be the
probability of obtaining 0 hearts plus the probability of obtaining 1 heart plus the probability of obtaining 2
hearts, as shown in the example below.
Example 1
Suppose we select 5 cards from an ordinary deck of playing cards. What is the probability of obtaining 2 or
fewer hearts?
Poisson Distribution
A Poisson distribution is the probability distribution that results from a Poisson experiment.
Notation
The following notation is helpful, when we talk about the Poisson distribution.
e: A constant equal to approximately 2.71828. (Actually, e is the base of the natural logarithm
system.)
μ: The mean number of successes that occur in a specified region.
x: The actual number of successes that occur in a specified region.
P(x; μ): The Poisson probability that exactly x successes occur in a Poisson experiment, when the
mean number of successes is μ.
Poisson Distribution
A Poisson random variable is the number of successes that result from a Poisson experiment. The
probability distribution of a Poisson random variable is called a Poisson distribution.
Given the mean number of successes (μ) that occur in a specified region, we can compute the Poisson
probability based on the following formula:
Poisson Formula. Suppose we conduct a Poisson experiment, in which the average number of successes
within a given region is μ. Then, the Poisson probability is:
where x is the actual number of successes that result from the experiment, and e is approximately equal to
2.71828.
Example 1
The average number of homes sold by the Acme Realty company is 2 homes per day. What is the
probability that exactly 3 homes will be sold tomorrow?
A cumulative Poisson probability refers to the probability that the Poisson random variable is greater than
some specified lower limit and less than some specified upper limit.
Example 1
Suppose the average number of lions seen on a 1-day safari is 5. What is the probability that tourists will
see fewer than four lions on the next 1-day safari?
To solve this problem, we need to find the probability that tourists will see 0, 1, 2, or 3 lions. Thus, we need
to calculate the sum of four probabilities: P(0; 5) + P(1; 5) + P(2; 5) + P(3; 5). To compute this sum, we use
the Poisson formula:
The normal distribution refers to a family of continuous probability distributions described by the normal
equation.
where X is a normal random variable, μ is the mean, σ is the standard deviation, π is approximately
3.14159, and e is approximately 2.71828.
The random variable X in the normal equation is called the normal random variable. The normal equation
is the probability density function for the normal distribution.
The graph of the normal distribution depends on two factors - the mean and the standard deviation. The
mean of the distribution determines the location of the center of the graph, and the standard deviation
determines the height and width of the graph. When the standard deviation is large, the curve is short and
wide; when the standard deviation is small, the curve is tall and narrow. All normal distributions look like a
symmetric, bell-shaped curve, as shown below.
The curve on the left is shorter and wider than the curve on the right, because the curve on the left has a
bigger standard deviation.
The normal distribution is a continuous probability distribution. This has several implications for probability.
About 68% of the area under the curve falls within 1 standard deviation of the mean.
About 95% of the area under the curve falls within 2 standard deviations of the mean.
About 99.7% of the area under the curve falls within 3 standard deviations of the mean.
Collectively, these points are known as the empirical rule or the 68-95-99.7 rule. Clearly, given a normal
distribution, most outcomes will be within 3 standard deviations of the mean.
To find the probability associated with a normal random variable, use a graphing calculator, an online
normal distribution calculator, or a normal distribution table. In the examples below, we illustrate the use
of Stat Trek's Normal Distribution Calculator, a free tool available on this site. In the next lesson, we
demonstrate the use of normal distribution tables.
Example 1
An average light bulb manufactured by the Acme Corporation lasts 300 days with a standard deviation of
50 days. Assuming that bulb life is normally distributed, what is the probability that an Acme light bulb will
last at most 365 days?
Solution: Given a mean score of 300 days and a standard deviation of 50 days, we want to find the
cumulative probability that bulb life is less than or equal to 365 days. Thus, we know the following:
We enter these values into the Normal Distribution Calculator and compute the cumulative probability. The
answer is: P( X < 365) = 0.90. Hence, there is a 90% chance that a light bulb will burn out within 365 days.
Example 2
Suppose scores on an IQ test are normally distributed. If the test has a mean of 100 and a standard
deviation of 10, what is the probability that a person who takes the test will score between 90 and 110?
We use the Normal Distribution Calculator to compute both probabilities on the right side of the above
equation.
To compute P( X < 110 ), we enter the following inputs into the calculator: The value of the normal
random variable is 110, the mean is 100, and the standard deviation is 10. We find that P( X < 110 )
is 0.84.
To compute P( X < 90 ), we enter the following inputs into the calculator: The value of the normal
random variable is 90, the mean is 100, and the standard deviation is 10. We find that P( X < 90 ) is
0.16.
Thus, about 68% of the test scores will fall between 90 and 110.
The standard normal distribution is a special case of the normal distribution. It is the distribution that
occurs when a normal random variable has a mean of zero and a standard deviation of one.
The normal random variable of a standard normal distribution is called a standard score or a z-score. Every
normal random variable X can be transformed into a z score via the following equation:
z = (X - μ) / σ
where X is a normal random variable, μ is the mean of X, and σ is the standard deviation of X.
A standard normal distribution table shows a cumulative probability associated with a particular z-score.
Table rows show the whole number and tenths place of the z-score. Table columns show the hundredths
place. The cumulative probability (often from minus infinity to the z-score) appears in the cell of the table.
For example, a section of the standard normal table is reproduced below. To find the cumulative
probability of a z-score equal to -1.31, cross-reference the row of the table containing -1.3 with the column
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
... ... ... ... ... ... ... ... ... ... ...
-1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0722 0.0708 0.0694 0.0681
-1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
-1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
... ... ... ... ... ... ... ... ... ... ...
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
Of course, you may not be interested in the probability that a standard normal random variable falls
between minus infinity and a given value. You may want to know the probability that it lies between a
given value and plus infinity. Or you may want to know the probability that a standard normal random
variable lies between two given values. These probabilities are easy to compute from a normal distribution
table. Here's how.
Find P(Z > a). The probability that a standard normal random variable (z) is greater than a given
value (a) is easy to find. The table shows the P(Z < a). The P(Z > a) = 1 - P(Z < a).
Suppose, for example, that we want to know the probability that a z-score will be greater than 3.00.
From the table (see above), we find that P(Z < 3.00) = 0.9987. Therefore, P(Z > 3.00) = 1 - P(Z < 3.00)
= 1 - 0.9987 = 0.0013.
Find P(a < Z < b). The probability that a standard normal random variables lies between two values is
also easy to find. The P(a < Z < b) = P(Z < b) - P(Z < a).
For example, suppose we want to know the probability that a z-score will be greater than -1.40 and
less than -1.20. From the table (see above), we find that P(Z < -1.20) = 0.1151; and P(Z < -1.40) =
0.0808. Therefore, P(-1.40 < Z < -1.20) = P(Z < -1.20) - P(Z < -1.40) = 0.1151 - 0.0808 = 0.0343.
In school or on the Advanced Placement Statistics Exam, you may be called upon to use or interpret
standard normal distribution tables. Standard normal tables are commonly found in appendices of most
statistics texts.
Often, phenomena in the real world follow a normal (or near-normal) distribution. This allows researchers
to use the normal distribution as a model for assessing probabilities associated with real-world
phenomena. Typically, the analysis involves two steps.
Transform raw data. Usually, the raw data are not in the form of z-scores. They need to be
transformed into z-scores, using the transformation equation presented earlier: z = (X - μ) / σ.
Find probability. Once the data have been transformed into z-scores, you can use standard normal
distribution tables, online calculators (e.g., Stat Trek's free normal distribution calculator), or handheld
graphing calculators to find probabilities associated with the z-scores.
The problem in the next section demonstrates the use of the normal distribution as a model for
measurement.
Chi-Square Distribution
The distribution of the chi-square statistic is called the chi-square distribution. In this lesson, we learn to
compute the chi-square statistic and find the probability associated with the statistic. Chi-square examples
illustrate key points.
Suppose we conduct the following statistical experiment. We select a random sample of size n from a
normal population, having a standard deviation equal to σ. We find that the standard deviation in our
sample is equal to s. Given these data, we can define a statistic, called chi-square, using the following
equation:
Χ2 = [ ( n - 1 ) * s2 ] / σ2
If we repeated this experiment an infinite number of times, we could obtain a sampling distribution for the
chi-square statistic. The chi-square distribution is defined by the following probability density function:
Y = Y0 * ( Χ2 ) ( v/2 - 1 ) * e-Χ2 / 2
where Y0 is a constant that depends on the number of degrees of freedom, Χ2 is the chi-square statistic, v =
n - 1 is the number of degrees of freedom, and e is a constant equal to the base of the natural logarithm
system (approximately 2.71828). Y0 is defined, so that the area under the chi-square curve is equal to one.
In the figure below, the red curve shows the distribution of chi-square values computed from all possible
samples of size 3, where degrees of freedom is n - 1 = 3 - 1 = 2. Similarly, the green curve shows the
distribution for samples of size 5 (degrees of freedom equal to 4); and the blue curve, for samples of size 11
(degrees of freedom equal to 10).
The chi-square distribution is constructed so that the total area under the curve is equal to 1. The area
under the curve between 0 and a particular chi-square value is a cumulative probability associated with
that chi-square value. For example, in the figure below, the shaded area represents a cumulative
probability associated with a chi-square statistic equal to A; that is, it is the probability that the value of a
chi-square statistic will fall between 0 and A.
Fortunately, we don't have to compute the area under the curve to find the probability. The easiest way to
find the cumulative probability associated with a particular chi-square statistic is to use the Chi-Square
Distribution Calculator, a free tool provided by Stat Trek.
The Chi-Square Distribution Calculator solves common statistics problems, based on the chi-square
distribution. The calculator computes cumulative probabilities, based on simple inputs. Clear instructions
Problem 1
The Acme Battery Company has developed a new cell phone battery. On average, the battery lasts 60
minutes on a single charge. The standard deviation is 4 minutes.
Suppose the manufacturing department runs a quality control test. They randomly select 7 batteries. The
standard deviation of the selected batteries is 6 minutes. What would be the chi-square statistic
represented by this test?
Solution
To compute the chi-square statistic, we plug these data in the chi-square equation, as shown below.
Χ2 = [ ( n - 1 ) * s2 ] / σ2
Χ2 = [ ( 7 - 1 ) * 62 ] / 42 = 13.5
where Χ2 is the chi-square statistic, n is the sample size, s is the standard deviation of the sample, and σ is
the standard deviation of the population.
Problem 2
Let's revisit the problem presented above. The manufacturing department ran a quality control test, using 7
randomly selected batteries. In their test, the standard deviation was 6 minutes, which equated to a chi-
square statistic of 13.5.
Suppose they repeated the test with a new random sample of 7 batteries. What is the probability that the
standard deviation in the new test would be greater than 6 minutes?
Given the degrees of freedom, we can determine the cumulative probability that the chi-square statistic
will fall between 0 and any positive value. To find the cumulative probability that a chi-square statistic falls
between 0 and 13.5, we enter the degrees of freedom (6) and the chi-square statistic (13.5) into the Chi-
Square Distribution Calculator. The calculator displays the cumulative probability: 0.96.
This tells us that the probability that a standard deviation would be less than or equal to 6 minutes is 0.96.
This means (by the subtraction rule) that the probability that the standard deviation would be greater than 6 minutes
is 1 - 0.96 or .04.
10. Questions
Objective questions
Question 1: In how many ways can the letters of the word ABACUS be rearranged such that the vowels
always appear together?
B. 3! * 3!
C.
D.
E.
B.
C. 56
D. 23
E.
Question 3:What is the probability that the position in which the consonants appear remain unchanged
when the letters of the word "Math" are re-arranged?
A. 1/4
B. 1/6
C. 1/3
D. 1/24
E. 1/12
Question 4: There are 6 boxes numbered 1, 2, ... 6. Each box is to be filled up either with a red or a green
ball in such a way that at least 1 box contains a green ball and the boxes containing green balls are
consecutively numbered. The total number of ways in which this can be done is:
A. 5
B. 21
C. 33
Random Signal Analysis Page 87
D. 60
E. 6
Question 5: A man can hit a target once in 4 shots. If he fires 4 shots in succession, what is the probability
that he will hit his target?
man can hit a target once in 4 shots. If he fires 4 shots in succession, what is the probability that he will hit
his target?
A.1
B.
C.
D.
E.
Question 6: In how many ways can 5 letters be posted in 3 post boxes, if any number of letters can be
posted in all of the three post boxes?
In how many ways can 5 letters be posted in 3 post boxes, if any number of letters can be posted in all of
the three post boxes?
A. 5 C 3
B.5 P 3
C.53
D.35
E.25
Random Signal Analysis Page 88
Question 7: Ten coins are tossed simultaneously. In how many of the outcomes will the third coin turn up a
head?
Ten coins are tossed simultaneously. In how many of the outcomes will the third coin turn up a head?
A. 210
B. 29
C. 28
D. 29
E. None of these
Question 8: In how many ways can the letters of the word "PROBLEM" be rearranged to make seven letter
words such that none of the letters repeat?
In how many ways can the letters of the word "PROBLEM" be rearranged to make 7 letter words such that
none of the letters repeat?
A.7!
B.7C7
C.77
D.49
E.None of these
Short Questions
14. Define probability distribution of i) discrete random variable ii) continuous random variable.
Long Questions
1. In a factory, 4 machines A1, A2, A3 and A4 produce 10%, 25%, 35% and 30% of the items respectively.
The percentage of the defective items produced by them is 5%, 4%, 3% and 2% respectively. An item
selected at random is found to be defective. What is the probability that it was produced by machine A2?
2. In a communication system a zero is transmitted with probability 0.55.In the channel a zero received as
zero with probability 0.9 and one received as one with probability 0.8. Find the probability that
4. Suppose an urn contains ten white balls and five red balls. Two balls are withdrawn at random from the
urn without replacements:
5. Two balanced dices are being rolled simultaneously. If sum of the numbers shown at a time by the two
faces is 6. What is the probability that the number shown by one of the face to the dice in this case is 1?
9. Define probability distribution of i) a discrete random variable ii) a continuous random variable.
University Questions
Dec 2012
Q. If A and B are two events such that P(A)=0.3, P(B)=0.4, P(AB)=0.2 find P(A U B), P(A/B)
Q. If A and B are two events, prove that P(A U B)= P(A) + P(B) – P(AB)
Q. Suppose two million lottery tickets are issued with 100 wining tickets among them
(ii) How many tickets should one buy to be 95% confident of having wining tickets ?
Q. What 'is a Random Variable? Explain continuous and discrete Random Variables with suitable examples.
Define expectation of continuous and discrete Random Variables
Dec 11
Q. If A and B are two independent events, prove that P(AB)= P(A). P(B)
Q. Suppose five cards to be drawn at random from a standard deck of cards. If all the drawn cards are red,
what is the probability that all of them hearts
Q. What 'is a Random Variable? Explain continuous and discrete Random Variables with suitable examples.
Q. the random variable has exponential probability density function f(x)= K e-|x|. Determine value of K and
corrosponding distribution function
May 2011
(i) What is the probability that the ball chosen will be a white ball?
(ii) Given that the ball chosen is white, what is the probability that it came from box I?
Dec 2010
In a communication system a zero is transmitted with probability 0.4 and one is transmitted with
probability 0.6. Due to noise in the channel a zero can be received as one with probability 0.1 and zero with
probability 0.9 similarly a one can be received as zero with probability 0.1 and one with probability 0.9.
Now-
(i) A one was observed what is the probability that zero was transmitted
(ii) A one was observed what is the probability that one was transmitted
May 2010
Q. (a) Give the following definitions of probability with the shortcomings if any:
Q. A mechanism consist of three paths A, B, C and probability of their failure are p,q,r respectively. The
mechanism works if their is no failure in any of these parts. Find the probability that
(i) Mechanism is working
(ii) Mechanism is not working
Dec 2009
Q. In a communication system a zero is transmitted with probability 0.45.In the channel a zero received as
zero with probability 0.9 and one received as one with probability 0.8. Find the probability that
Q. Explain the concept of Joint and Conditional Probability with one eg. each.
Q. In a factory, 4 machines A1, A2, A3 and A4 produce 10%, 25%, 35% and 30% of the items respectively.
The percentage of the defective items produced by them is 5%, 4%, 3% and 2% respectively. An item
selected at random is found to be defective. What is the probability that it was produced by machine A2?
Q. If X, Yare two independent exponentially distributed random variables with same parameter unity, find
the probability density function of U=X+Y, V = X/(X + V).
Q. A random variable takes values 9, 13, 17 (5 + 4n) each with probability 1/n, find mean and variance of X.
Q. What 'is a Random Variable? Explain continuous and discrete Random Variables
Q. What 'is a Random Variable? Explain continuous and discrete Random Variables with suitable examples.
May 2007
Q. For a certain binary communication channel, the probability that a transmitted '0' is received as a '0'is
0.95 while the probability that a transmitted '1' is received as '1' is 0.90. If the probability of transmitting a
'0' is 0.4, find the probability that -
Dec 2006
Q. (a) Give the following definitions of probability with the shortcomings if any:
(i) What is the probability that the ball chosen will be a white ball?
(ii) Given that the ball chosen is white, what is the probability that it came from box I?
June 2006
Q(a) (i) Define the conditional probability of an event A' given that another event B has occurred.
(ii) A biased coin is tossed till a head appears for the first time. What is the probability that the number of
tosses required is odd? [2+6]
Q.A certain test for a particular cancer is known to be 95% accurate. A person submits to the test and the
results are positive. Suppose the person comes from a population of 100,000 where 2000 people suffer
from that disease. What can we conclude about the probability that the person under test has that
particular cancer?
Dec 2005
(a) Let B1, B2 . . . ., Bn be partitions of an event space Bi, i = 1, 2 . . . n, for the event B that has 10 occurred.
Suppose now an event A occurs. Find expression for P (BIA) in terms of B1,
(b) Two balanced dices are being rolled simultaneously. If sum of the numbers shown at a time by the 10
two faces is 7. What is the probability that the number shown by one of the face to the dice in this case is
1?
June 2005
Q (a) (i) with the help of a Venn diagram show that the conditional probability of occurrence of an event A
given that the event B has occurred, is given by -
(b) Suppose that 5 cards to be drawn at random from a standard deck of 52 cards. If all the cards drawn are
red, what is the probability that all of them are hearts.
Dec2004
(a) An experiment is performed N times. During the trial the event A occurs nA times and the event B only
occurs nAB time, during the occurrence of the event A. From the relative frequency approach define the
probability of occurrence of the event A, p(A), the joint probability of occurrence of the events A and B,
P(AB) and the conditional probability of the event B, P(B/A) given the event A has occurred in terms of the
frequency of occurrences nA, nAB and N. Show that P (B/A) = P(AB) / p(A) and P(B'A) = 1 - P(B/A). (b) In
throwing of fair die the probability of the event A = (The outcome is greater than 3). The Probability of the
event B. (the outcome is even numbers). Find P(A/B) and P(A'B).
May 2004
(b) Suppose an urn contains five white balls and seven red balls. Two balls are withdrawn at random from
the urn without replacements:
1. Motivation
When we have a random variable which is function of another random variable and we know the
statistics of one random variable then we can get the statistics of the unknown random variable in
terms of known random variable.
2. Syllabus
Characteristic functions
Moment theorem
5. Objective
In this chapter we study a few basic concepts of functions of random variables and investigate the expected
value of a certain function of a random variable. The techniques of moment generating functions and
characteristic functions, which are very useful in some applications, are presented.
6. Key Notation:
7. Key Definitions
8. Key Relations
1. The function of RV
2. Expected value of RV
4. Standard Deviation
5. Variance
6. Chebyshev Inequality
Often we have to consider random variables which are functions of other random variables. Let
Then the rectifier output is given by . We have to find the probability description of the random
variable . We consider the following cases:
Suppose,
Remark
(1) The distribution given by is called a uniform distribution over the interval [0,1].
(2) The above result is particularly important in simulating a random variable with a particular distribution
function.
We assumed to be one-to-one function for invariability. However, the result is more general -
the random variable defined by the distribution function of any random variable is uniformly
distributed
over [0,1]
For example, if is a discrete RV,
Proof:
in the above, we assumed to have three roots. In general, if has n roots, then
Suppose
and
so that
The expectation operation extracts a few parameters of a random variable and provides a summary
description of the random variable in terms of these parameters.
It is far easier to estimate these parameters from data than to estimate the distribution or density
function of the random variable.
Random Signal Analysis Page 107
Moments are some important parameters obtained through the expection operation.
provided exists.
is also called the mean or statistical average of the random variable and is denoted by
Note that, for a discrete RV with the probability mass function (pmf) the pdf
is given by
The mean gives an idea about the average value of the random value. The values of the random
variable are spread about this value.
Therefore, the mean can be also interpreted as the centre of gravity of the pdf curve.
Example 1
Then
pX(x)
Then
Then
Remark
If is an even function of , then Thus the mean of a RV with an even symmetric pdf
is 0.
We shall illustrate the above result in the special case when is one-to-one and
monotonically increasing function of x In this case,
Figure 2
a) If is a constant,
Clearly
(b) If are two functions of the random variable and are constants,
Variance
For a random variable with the pdf and mean the variance of is denoted by and
defined as
Example 4
Example 5
Remark
The variance is a central moment and measure of dispersion of the random variable about the
mean.
is the average of the square deviation from the mean. It gives information about the
deviation of the values of the RV about the mean. A smaller implies that the random values are
more clustered about the mean. Similarly, a bigger means that the random values are more
scattered.
For example, consider two random variables with pmf as shown below. Note that each of
has zero mean.The variances are given by and implying that has more
spread about the mean.
he pdfs of two continous random variables with the same mean and different variances are illustrated in
Figure 4 .
We could have used the mean absolute deviation to know about the deviation of the
random values about the mean. But it is more difficult both for analysis and numerical calculation.
Properties of variance
(1)
(2) If then
3) If is a constant,
We can define the nth moment and the nth central- moment of a random variable X by the following
relations
Note that
The mean is the first moment and the mean-square value is the second moment
The first central moment is 0 and the variance is the second central moment
The third central moment measures lack of symmetry of the pdf of a random variable
is called the coefficient of skewness and If the pdf is symmetric this coefficient will be zero.
is called kurtosis. If the peak of the pdf is sharper, then the random variable has a
higher kurtosis.
The mean and variance also give some quantitative information about the bounds of RVs. Following
inequalities are extremely useful in many practical problems.
Chebysev Inequality
control department rejects the item if the absolute deviation of from is greater than What
fraction of the manufacturing item does the quality control department reject? Can you roughly guess it?
The standard deviation gives us an intuitive idea how the random variable is distributed about the mean.
This idea is more precisely expressed in the remarkable Chebysev Inequality stated below. For a random
Proof:
Remark
Example 6
Solution: A nonnegative RV has the mean Find an upper bound of the probability
By Markov's inequality
Just as the frequency-domain charcterisations of discrete-time and continuous-time signals, the probability
mass function and the probability density function can also be characterized in the frequency-domain by
means of the charcteristic function of a random variable . These functions are particularly important in
Characteristic function
Consider a random variable with probability density function The characteristic function of
denoted by is defined as
instead of This implies that the properties of the Fourier transform applies to the
characteristic function.
The interpretation that is the expectation of helps in calculating moments with the
help of the characteristics function. In a simple case ,
Example 1
Solution:
Example 2
Suppose X is a random variable taking values from the discrete set with corresponding
Then ,
In this case can be interpreted as the discrete-time Fourier transform with substituting
in the original discrete-time Fourier transform. The inverse relation is
Example 3
Then,
Example 4
If the random variable under consideration takes non- negative integer values only, it is convenient to
characterize the random variable in terms of the probability generating function G (z) defined by
Note that
Thus, given the probability generating function , we can get the probability mass function from the
More problems
Problem
In a recent little league softball game, each player went to bat 4 times. The number of hits made by each
player is described by the following probability distribution.
Number of hits, x 0 1 2 3 4
Probability, P(x) 0.10 0.20 0.30 0.25 0.15
The correct answer is E. The mean of the probability distribution is 2.15, as defined by the following
equation.
E(X) = Σ [ xi * P(xi) ]
E(X) = 0*0.10 + 1*0.20 + 2*0.30 + 3*0.25 + 4*0.15 = 2.15
Problem
The number of adults living in homes on a randomly selected city block is described by the following
probability distribution.
Number of adults, x 1 2 3 4
Probability, P(x) 0.25 0.50 0.15 0.10
Solution
The correct answer is D. The solution has three parts. First, find the expected value; then, find the variance;
then, find the standard deviation. Computations are shown below, beginning with the expected value.
E(X) = Σ [ xi * P(xi) ]
E(X) = 1*0.25 + 2*0.50 + 3*0.15 + 4*0.10 = 2.10
σ2 = Σ { [ xi - E(x) ]2 * P(xi) }
σ2 = (1 - 2.1)2 * 0.25 + (2 - 2.1)2 * 0.50 + (3 - 2.1)2 * 0.15 + (4 - 2.1)2 * 0.10
σ = (1.21 * 0.25) + (0.01 * 0.50) + (0.81) * 0.15) + (3.61 * 0.10) = 0.3025 + 0.0050 + 0.1215 + 0.3610 =
2
0.79
And finally, the standard deviation is equal to the square root of the variance; so the standard deviation is
sqrt(0.79) or 0.889.
Problem
The table on the left shows the joint probability distribution between two random variables - X and Y; and
the table on the right shows the joint probability distribution between two random variables - A and B.
X A
0 1 2 0 1 2
Y 3 0.1 0.2 0.2 B 3 0.1 0.2 0.2
Solution
The correct answer is A. The solution requires several computations to test the independence of random
variables. Those computations are shown below.
X and Y are independent if P(x|y) = P(x), for all values of X and Y. From the probability distribution table,
we know the following:
Thus, P(x|y) = P(x), for all values of X and Y, which means that X and Y are independent. We repeat the
same analysis to test the independence of A and B.
Thus, P(a|b) is not equal to P(a), for all values of A and B. For example, P(a=0) = 0.3; but P(a=0 | b=3) = 0.2.
This means that A and B are not independent.
Problem
X
0 1 2
3 0.1 0.2 0.2
Y
4 0.1 0.2 0.2
The table on the right shows the joint probability distribution between two random variables - X and Y. (In a
joint probability distribution table, numbers in the cells of the table represent the probability that particular
values of X and Y occur together.)
Solution
E(X) = Σ [ xi * P(xi) ]
E(X) = 0 * (0.1 + 0.1) + 1 * (0.2 + 0.2) + 2 * (0.2 + 0.2) = 0 + 0.4 + 0.8 = 1.2
E(Y) = Σ [ yi * P(yi) ]
E(Y) = 3 * (0.1 + 0.2 + 0.2) + 4 * (0.1 + 0.2 + 0.2) = (3 * 0.5) + (4 * 0.5) = 1.5 + 2 = 3.5
And finally, the mean of the sum of X and Y is equal to the sum of the means. Therefore,
Problem
Suppose X and Y are independent random variables. The variance of X is equal to 16; and the variance of Y
is equal to 9. Let Z = X - Y.
Solution
The correct answer is B. The solution requires us to recognize that Variable Z is a combination of two
independent random variables. As such, the variance of Z is equal to the variance of X plus the variance of
Y.
The standard deviation of Z is equal to the square root of the variance. Therefore, the standard deviation is
equal to the square root of 25, which is 5.
Problem
The average salary for an employee at Acme Corporation is $30,000 per year. This year, management
awards the following bonuses to every employee.
Solution
The correct answer is C. To compute the bonus, management applies the following linear transformation to
the each employee's salary.
Random Signal Analysis Page 129
Y = mX + b
Y = 0.10 * X + 500
where Y is the transformed variable (the bonus), X is the original variable (the salary), m is the multiplicative
constant 0.10, and b is the additive constant 500.
Since we know that the mean salary is $30,000, we can compute the mean bonus from the following
equation.
Y = mX + b
Y = 0.10 * $30,000 + $500 = $3,500
10. Questions
Objective Questions
a. A[ ]
b. E[ ]
c. D[ ]
d. Z[ ]
a. E[X2]
b. E2[X]
c. [E[X2]]2
d. E[X2]
a. 0.2
b. 0.5
c. 0.7
Short Questions
1. Write the formula to express the pdf of RV which is function of another RV.
3. Write the formula to find the expected value of continuous & discrete RV
5. Define variance of RV
University Questions
Dec 2012
Q. Explain MGF of discrete random variable and continuous random variable in detail
May 2012
Q. Prove that if two random variables are independent, then density function of their sum is given by
convolution of their density functions.
Dec 2011
May 2011
1 Y b
f Y ( y) fX ( )
|a| a
Q. Let X be a continuous random variable with uniform pdf in (0, 2π). Find probability density function of Y=
cos X
Dec 2010
Q. In medical Imaging such as computer tomography the reaction between the detector reading Y and body
absorptivity X follows Y=eX law. Let X be N(µ,σ2). Compute the pdf of Y
f(x)=(m/2)e-m|x| -∞<x<∞
Dec 2009
Q.How the characteristic function Фx(w) of a random variable X is defined? Show that Фx(w) can be
expressed as
j n wn 1 dn
( w) mn where mn n [ n X ( w)] w0 is the nth order moment of r.v X
n! j dw
Q. If the probability density function of X is fX(x) = e-X for x>0; then find the probability density function of
Y= X3
Dec 2008
1 Y b
f Y ( y) fX ( )
|a| a
May 2007
1 Y b
f Y ( y) fX ( )
|a| a
(b) If a random variable X has uniform distribution in (-2, 2), find the probability density function fy(y) of Y=
3X + 2.
Dec 2006
Q. How the characteristic function Фx(w) of a random variable X is defined? Show that Фx(w) can be
expressed as –
j n wn 1 dn
( w) mn where mn [ X ( w)] w0 is the nth order moment of r.v X
n! j n dw n
June 2005
Q. How the characteristic function Фx(w) of a random variable X is defined? Show that Фx(w) can be
expressed as –
j n wn 1 dn
( w) mn where mn [ X ( w)] w0 is the nth order moment of r.v X
n! j n dw n
Dec 2005
(ii) If X is a random variable and f(x) is given by f(x) = f = 1/b. e-(x-a)/b, find the first and
second moments of X.
1. Motivation
When we have a random variable which is function of another random variable and we know the
statistics of one random variable then we can get the statistics of the unknown random variable in
terms of known random variable.
2. Syllabus
independent, uncorrelated
and orthogonal random 1 hour 1 hour
variables.
3. Books Recommended
5. Objective
In this chapter we study a few basic concepts of functions of random variables and investigate the expected
value of a certain function of a random variable. The techniques of moment generating functions and
characteristic functions, which are very useful in some applications, are presented.
6. Key Notation:
7. Key Definitions
8. Key Relations
8. The function of RV
9. Expected value of RV
12. Variance
We may define two or more random variables on the same sample space. Let and be two real
random variables defined on the same probability space The mapping such that for
• The above figure illustrates the mapping corresponding to a joint random variable. The joint random
• We can extend the above definition to define joint random variables of any dimension. The mapping
Example1 Suppose we are interested in studying the height and weight of the students in a class. We
can define the joint RV where represents height and represents the weight.
Example 2 Suppose in a communication system is the transmitted signal and is the corresponding
Recall the definition of the distribution of a single random variable. The event was used to
define the probability distribution function . Given , we can find the probability of any event
involving the random variable. Similarly, for two random variables and , the event
cumulative distribution function (CDF) of the random variables and and denoted by .
Figure 2
Note that
•
Random Signal Analysis Page 141
• is right continuous in both the variables.
Figure 4
To prove this
Similarly .
Example 3
(a)
(b)
If and are two discrete random variables defined on the same probability space such
that takes values from the countable subset and takes values from the countable subset .Then
the joint random variable can take values from the countable subset in . The joint random
Given , we can determine other probabilities involving the random variables and
This is because
• Marginal Probability Mass Functions: The probability mass functions and are obtained
from the joint probability mass function as follows
and similarly
These probability mass functions and obtained from the joint probability mass functions
are called marginal probability mass functions .
Example 4 Consider the random variables and with the joint probability mass function as tabulated in
Table 1. The marginal probabilities and are as shown in the last column and the last row
Table 1
If and are two continuous random variables and their joint distribution function is continuous
provided it exists.
Clearly
The marginal density functions and of two joint RVs and are given by the
derivatives of the corresponding marginal distribution functions. Thus
Remark
• The marginal CDF and pdf are same as the CDF and pdf of the concerned single random variable. The
marginal term simply refers that it is derived from the corresponding joint distribution or density function
of two or more joint random variables.
• With the help of the two-dimensional Dirac Delta function, we can define the joint pdf of two discrete
jointly random variables. Thus for discrete jointly random variables and .
Example 6 The joint pdf of two random variables and are given by
• Find .
• Find .
• Find and .
We discussed the conditional CDF and conditional PDF of a random variable conditioned on some events
defined in terms of the same random variable. We observed that
and
Suppose and are two discrete jointly random variable with the joint PMF The conditional
• The conditional PMF satisfies the properties of the probability mass functions.
• From the definition of conditional probability mass functions, we can define two independent random
variables. Two discrete random variables X and Y are said to be independent if and only if
Suppose and are two discrete jointly random variables. Given and we can
Example 1 Consider the random variables and with the joint probability mass function as presented in
the
following table
Consider two continuous jointly random variables and with the joint probability distribution
function We are interested to find the conditional distribution function of one of the random
variables on the condition of a particular value of the other random variable.
We cannot define the conditional distribution function of the random variable on the condition of the
as in the above expression. The conditional distribution function is defined in the limiting
sense as follows:
Because,
Similarly we have
• (4)
Example 2 X and Y are two jointly random variables with the joint pdf given by
find,
(a)
(b)
(c)
Solution:
Since
we get
Given the marginal density function and the conditional density , we can apply the Bayes'
rule for two continuous joint random variables to get as follows. Recall that
In context of the above Bayes rule, is called the a priori density function and is called
the a posteriori density function.
Example 3 For random variables X and Y, the joint probability density function is given by
and
Therefore,
and
variable defined on the same sample space with the conditional probability density function In
practical problems we may have to estimate the conditional PMF of given the observed value We
can define this conditional PMF also in the limiting sense
Example 4
Let and be two random variables characterized by the joint distribution function
and equivalently
We are often interested in finding out the probability density function of a function of two or more RVs.
Following are a few examples.
Random Signal Analysis Page 157
• The received signal by a communication receiver is given by
where is received signal which is the superposition of the message signal and the noise .
The frequently applied operations on communication signals like modulation, demodulation, correlation
etc. involve multiplication of two signals in the form Z = XY.
We have to know about the probability distribution of in any analysis of . More formally, given two
random variables X and Y with joint probability density function and a function we
have to find .
Consider the event corresponding to each z. We can find a variable subset such that
Consider Figure 2
We have
Example 1
Suppose X and Y are independent random variables and each uniformly distributed over (a, b). and
We have
and
Suppose X and Y are independent zero mean Gaussian random variable with unity standard deviation and
. Then
Here
Suppose X and Y are two independent Gaussian random variables each with mean 0 and variance and
.Then
Suppose X and Y are independent Gaussian variables with non-zero mean and respectively and
constant variance. We have to find the joint density function of the random variable .
Here
Suppose Then
The Rician density is used to model the envelope of a sinusoid plus a narrow-band Gaussian noise.
We consider the transformation We have to find out the joint probability density
Let us see how the corners of the differential region are mapped to the plane. Observe that
Therefore,
Further, it can be shown that the absolute values of the Jacobians of the forward and the inverse
transform are inverse of each other so that
corresponding to roots. The inverse mapping is illustrated in the following Figure 2 for As these
parallelograms are non- overlapping,
Example 2 Suppose X and Y are two independent Gaussian random variables each with mean 0 and
Solution:
and (2)
From (1)
and
From (2)
Recall that
where
Note that
As is varied over the entire axis, the corresponding (non-overlapping) differential regions in
plane cover the entire plane.
Thus,
Example 2 If
Proof:
Example 3
Consider the discrete random variables discussed .The joint probability mass function of the
(1) We have earlier shown that expectation is a linear operator. We can generally write
Thus
Just like the moments of a random variable provide a summary description of the random variable, so
also the joint moments provide summary description of two random variables. For two continuous random
where and
Remark
(1) If are discrete random variables, the joint expectation of order and is defined as
(2) If and , we have the second-order moment of the random variables given by
We will also show that To establish the relation, we prove the following result:
Non-negativity of the left-hand side implies that its minimum also must be nonnegative.
Now
Thus
Then
• Two random variables may be dependent, but still they may be uncorrelated. If there exists
correlation between two random variables, one may be represented as a linear regression of the other. We
will discuss this point in the next section.
Linear Regression of on
we have,
Solving for ,
so that
Thus is the linear regression of the random variable Y on the random variable
X.The linear regression approximates a random variable in terms of another random variable by means of a
straight- line fit.
Remark
Note that independence implies uncorrelatedness. But uncorrelated generally does not imply
independence (except for jointly Gaussian random variables).
Example 4
Because
Many practically occuring random variables are modeled as jointly Gaussian random variables. For
example, noise samples at different instants in the communication system are modeled as jointly Gaussian
random variables.
Two random variables are called jointly Gaussian if their joint probability density function is
means
variances
correlation coefficient
We denote the jointly Gaussian random variables and with these parameters as
The joint pdf has a bell shape centred at as shown in the Figure 1 below. The variances
determine the spread of the pdf surface and determines the orientation of the surface in the
(1) If and are jointly Gaussian, then and are both Gaussian.
Similarly
(2) The converse of the above result is not true. If each of and is Gaussian, and are not
necessarily jointly Gaussian. Suppose
and
Similarly,
(3) If and are jointly Gaussian, then for any constants and ,the random variable given by
Example 1 Suppose X and Y are two jointly-Gaussian 0-mean random variables with variances of 1 and 4
We have
If and are discrete random variables, we can define the joint characteristic function in terms of the
joint probability mass function as follows:
The joint characteristic function has properties similar to the properties of the chacteristic function of a
single random variable. We can easily establish the following properties:
1.
2.
3. If and are independent random variables, then
4. We have,
Example 2 The joint characteristic function of the jointly Gaussian random variables and with the
joint pdf
We can use the joint characteristic functions to simplify the probabilistic analysis as illustrated on next
page:
Suppose then
Thus the linear transformation of two Gaussian random variables is a Gaussian random variable.
Conditional Expectation
Recall that
is given by
Clearly, denotes the centre of mass of the conditional pdf or the conditional pmf as shown in Figure
1 on next page.
Remark
Consider the discrete random variables discussed in example 4 in lecture 18.The joint
probability mass function of the random variables are tabulated in Table . Find the joint expectation of
Example 2
Suppose are jointly uniform random variables with the joint probability density function
given by
Find
Figure 2
We have
Example 3
Find .
We have,
and
Proof :
and similarly
and
Consider two random variables with joint pdf . Suppose is observable and
some a priori information about is available in a sense that some values of are more likely. We can
represent this prior information in the form of a prior density function . We have to estimate for
Clearly
Suppose the optimum estimator is a function of the random variable such that it minimizes the
Since is always positive, the above integral will be minimum if the inner integral is minimum. This
results in the problem :
Suppose are two jointly Gaussian random variables considered in the earlier example. We
If and
then
Thus the MMSE estimator for two zero-mean jointly Gaussian random variables is linearly
related with the data . This result plays an important role in the optimal filtering of random signals.
Markov Inequality
Let us first take a look at the Markov Inequality. Even though the statement looks very simple, clever
application of the inequality is at the heart of more powerful inequalities like Chebyshev or Chernoff.
Initially, we will see the simplest version of the inequality and then we will discuss the more general
version. The basic Markov inequality states that given a random variable X that can only take non negative
values, then
In the equation above, I seperated the fraction 1/a because that is the only varying part. We will later see
that for Chebyshev we get a similar fraction. The proof of this inequality is straightforward. There are
multiple proofs even though we will use the follow proof as it allows us to show Markov
inequality graphically.This proof is partly taken from Mitzenmacher and Upfal’s exceptional book on
Randomized Algorithms.
Consider a constant a >= 0. Then define an indicator random variable I which takes value of 1 is X >=a . ie
Now we make a clever observation. We know that X is non negative. ie X >= 0. This means that the
fraction X/a is atleast 0 and atmost can be infinty. Also, if X < a, then X/a < 1. When X > a, X/a > 1. Using
these facts,
But we also know that the expectation of indicator random variable is also the probability that it takes the
value 1. This means E[I] = Pr(X>=a). Putting it all together, we get the Markov inequality.
This is a very powerful technique. Careful selection of f(X) allows you to derive more powerful bounds.
(1) One of the simplest examples is f(X) = |X| which guarantees f(X) to be non negative.
(2) Later we will show that Chebyshev inequality is nothing but Markov inequality that
uses
(3) Under some additional constraints, Chernoff inequality uses .
Simple Examples
Of course we can estimate a finer value using the Binomial distribution, but the core idea here is that we do
not need to know it !
Example 2:
For an example where Markov inequality gives a bad result, let us the example of a dice. Let X be the face
that shows up when we toss it. We know that E[X] is 7/2 = 3.5. Now lets say we want to find the probability
that X >= 5. By Markov inequality,
The actual answer of course is 2/6 and the answer is quite off. This becomes even more bizarre , for
example, if we find P(X >= 3) . By Markov inequality,
The upper bound is greater than 1 ! Of course using axioms of probability, we can set it to 1 while the
actual probability is closer to 0.66 . You can play around with the coin example or the score example to find
cases where Markov inequality provides really weak results.
Tightness of Markov
The last example might have made you think that the Markov inequality is useless. On the contrary, it
provided a weak bound because the amount of information we provided to it is limited. All we provided to
it were that the variable is non negative and that the expected value is known and finite. In this section, we
will show that it is indeed tight – that is Markov inequality is already doing as much as it can.
From the previous example, we can see an example where Markov inequality is tight. If the mean of 100
students is 20 and if 50 students got a score of exactly 0, then Markov implies that atmost 50 students can
get a score of atleast 40.
Chebyshev Inequality
Chebyshev inequality is another powerful tool that we can use. In this inequality, we remove the restriction
that the random variable has to be non negative. As a price, we now need to know additional information
about the variable – (finite) expected value and (finite) variance. In contrast to Markov, Chebyshev allows
you to estimate the deviation of the random variable from its mean. A common use of it estimates the
probability of the deviation from its mean in terms of its standard deviation.
Similar to Markov inequality, we can state two variants of Chebyshev. Let us first take a look at the simplest
version. Given a random variable X and its finite mean and variance, we can bound the deviation as
We used the Markov inequality in the second line and used the fact that .
It should be intuitive to note that the more information we get the tighter the bound is. For Markov we got
1/t as the fraction. It was 1/a^2 for second order Chebyshev and 1/a^k for k^th order Chebyshev inequality.
If we want to find the probability that the variable deviates from mean by constant C, the bound provided
by Chebyshev is ,
which is tight !
10. Questions
Objective Questions
a. A[ ]
b. E[ ]
c. D[ ]
d. Z[ ]
a. E[X2]
b. E2[X]
c. [E[X2]]2
d. E[X2]
a. A E[X] + B E[Y]
b. AX + B E[Y]
c. A E[X] + BY
d. A E[X + BY]
a. 0.2
b. 0.5
c. 0.7
d. 0.9
a. E[(X-µX)(Y-µY)]
b. E[X-µY]
c. E[Y-µX]
d. E[µX-µY]
Short Questions
1. Write the formula to express the pdf of RV which is function of another RV.
3. Write the formula to find the expected value of continuous & discrete RV
5. Define variance of RV
Long Questions
=0 otherwise
(ii) Independent.
= 0 otherwise
5. Suppose X and Y are two random variables. Define covariance and correlation coefficient of X and Y.
When do we say that X and Y are
Random Signal Analysis Page 209
(i) Orthogonal
(ii) independent
(iii) Uncorrelated?
(iv) Are uncorrelated random variables independent?
University Questions
Dec 2012
Q. If X and Y are two independent random variable with identical uniform distribution in (0,1) find
probability density function of (U,V) where U= X + Y and V=X-Y. Are U, V independent ?
May 2012
Q. The joint density function of two dimensional random variable (X,Y) is given by
Q. Suppose X and Y are continuous random variable with joint probability density function
= 0 elsewhere
Q. Prove that if two random variables are independent, then density function of their sum is given by
convolution of their density functions.
Dec 2011
=0 otherwise
Q. Suppose X and Y are two random variables. Define covariance and correlation coefficient of X and Y.
When do we say that X and Y are (i) Orthogonal (ii) independent and (iii) uncorrelated? Are uncorrelated
random variables independent?
May 2011
Q. The joint density function of two dimensional random variable (X,Y) is given by
Dec 2010
for the values of x and y for which (x,y) lies within the triangle as shown
Find (i) C
(ii) fX(x)
(iii) fY(y)
May 2010
fX,Y(x,y)=CeXeY 0<X<Y<∞
=0 otherwise
(ii) fX(x)
(iii) FY(y)
(iv) FX(x/y)
(v) FY(y/x)
(vi) EY(y/x)
(vii) EX(x/y)
Q. If fX,Y(x,y)=2eXeY 0<X<Y<∞
=0 otherwise
f(x)=(m/2)e-m|x| -∞<x<∞
Dec 2009
=0 otherwise
Q. If X and Yare two independent exponential random variables with probability density functions given by
= 0, x < 0 and
= 0, y<0
June 2005
Fy(y Ix) = lim Fy(Y IX < X <.x + h) and applying Baye's rule it can be written as
h->0
h->0
Find the characteristic function Фx(w) and hence determine the expected value of X.
May 2007
Q. The joint probability density function of two dimensional random variable (x, y) is given by - fxy(X, y) =
4xy e-(x2 +y2 ), x >0, y > 0
(ii) Find the conditional density function of Y given that X=x and the conditional density
Q. Prove that if two random variables are independent, then density function of their sum is given by
convolution of their density functions.
Q. If X and Yare two independent exponential random variables with probability density functions given by
= 0, x < 0 and
= 0, y<0
If U= X/ Y and V= y, find the joint probability density function of (U, V) and hence find probability density
function of U.
Q. If X and Yare two random variables with standard 'deviations 6x and 6y and if C xy is the covariance
between them, then prove .
Dec 2006
(a) Suppose X and Y are two random variables. Define covariance and
(i) Orthogonal (ii) independent and (iii) uncorrelated? Are uncorrelated random variables independent?
(b) Suppose that X and Yare continuous random variables with joint
= 0 elsewhere
If X and Yare independent, Binomial random variables with parameters (m,p) and (n,p) respectively, obtain
the distribution of X+Y.
= 0 otherwise
June 2006
(a) Suppose X and Y are two random variables. Define covariance and correlation coefficient of X and Y.
When do we say that X and Yare (i) orthogonal (ii) independent and (iii) uncorrelated ?
(b) Suppose X and Yare continuous random variables with joint probability density function.
= 0 elsewhere
(a) Suppose X and Y are independent random variables and each is exponentially distributed with common
parameter A. That is,
f x (x) = λe-λx, and f y (y) = λe-λy. Let the random variables u and v be given by u=X+ Y and v=X-Y, obtain the
joint density of u and v and the marginal density of u.
Dec 2005
(b) If X and Yare independent random variables and z = x + y, find f(z) by the transform method.
June 2005
=0 otherwise
Find (i) fX(x) (ii) FX(x) (iii) FY(y) (iv) FX(x/y) (v) fY(y/x) (vi) E(Y/x) (vii) E(X/y)
(i) Derive an expression for their joint moment at the origin. Why it is called correlation?
(ii) Derive an expression for their joint central moment. Why it is called covariance? Explain
(iii) Derive an expression for 'their normalized covariance. Why it is called covariance
coefficient? Explain what is its physical significance? What is its range of values?
(iv) Explain when X and Yare orthogonal, when they are independent and they are uncorrelated
=0 otherwise
Dec 2004
(a) Let X and Y two continuous random variables. What is meant by their correlation function R xy ? Derive
an expression for Rxy. given their joint density function, fxy (x, y). What happens to Rxy ?
(b) What is meant by the covariance function Cxy of the random variables of X and Y. Write and expression
for the covariance. Given the mean value of X and Yare μx and μy respectively. Under which conditions Cxy is
positive? Under which conditions Cxy is negative? .What is normalized covariance or covariance coefficient?
Write the expression for ρ and its range of values.
(c) Let X and Y be two continuous random variables with means equal to 7/12 and 7/12 respectively and
variances equal to 11/44 and 11/44 respectively and their joint probability density function.
=0 otherwise
Q. Define the characteristic functions Φx(w) and Φy(w) of the continuous random variables X and Y
respectively and find the probability density function fz(Z), given Z = X + Y.
Motivation
When we have a random variable which is function of another random variable and we know the
statistics of one random variable then we can get the statistics of the unknown random variable in
terms of known random variable.
Syllabus
Sr. No. of Self
Topic Fine Detailing
No. Hours Study
01 Stochastic Sequence of random variables 1 hour 1 hour
Convergence
and limit Convergence everywhere,
theorems almost everywhere,
comparison of convergence
modes,
Books Recommended
5. Objective
In this chapter we study the laws of large numbers and the central limit theorem, which is one of the most
remarkable results in probability theory, are discussed.
6. Key Notation
σ Standard Deviation
σ 2, var[ ] Variance
1. The weak law of large numbers states that the sample average converges in probability towards
the expected value
How closely does represent the true mean as n is increased? How do we measure the
The Cauchy criterion gives the condition for convergence of a sequence without actually finding the
limit. The sequence converges if and only if, for every there exists a positive
Convergence of a random sequence cannot be defined as above. Note that for each
1. Convergence Everywhere
Note here that the sequence of numbers for each sample point is convergent.
A random sequence may not converge for every . Consider the event
If are independent and identically distributed random variables with a finite mean
Remark:
The strong law of large numbers states that the sample mean converges to the true mean as the
sample size increases.
The SLLN is one of the fundamental theorems of probability. There is a weaker version of the law
that we will discuss later.
The following Cauchy criterion gives the condition for m.s. convergence of a random sequence
without actually finding the limit. The sequence converges in m.s. if and only if ,
Now,
4. Convergence in probability
for every .
[ Markov Inequality ]
Example 2
Clearly,
Therefore
Thus the above sequence converges to a constant in probability.
Remark:
Convergence in probability is also called stochastic convergence.
Suppose are independent and identically distributed random variables, with sample
mean
5. Convergence in distribution
are the distribution functions of and respectively. The sequence is said to converge to in
distribution if
for all x at which is continuous. Here the two distribution functions eventually coincide. We write
Example 3 Suppose is a sequence of RVs with each random variable having the
uniform density
clearly,
Consider independent random variables .The mean and variance of each of the
2. The random variables are independent with same mean and variance, but not
identically distributed.
3. The random variables are independent with different means and same variance and
not identically distributed.
4. The random variables are independent with different means and each variance
being neither too small nor too large.
We shall consider the first condition only. In this case, the central-limit theorem can be stated as follows:
each with mean and variance and Then, the sequence { } converges in
distribution to a Gaussian random variable with mean 0 and variance . That is,
The central limit theorem is really a property of convolution. Consider the sum of two statistically
. This can also be shown with the help of the characteristic functions as follows:
We can illustrate the CLT by convolving two uniform distributions repeatedly. In Figure 1, the
convolution of two uniform distributions gives a triangular distribution. Further convolution gives a
parabolic distribution and so on.
We give a less rigorous proof of the theorem with the help of the characteristic function. Further we
Clearly,
We will show that as the characteristic function is of the form of the characteristic function of a
Gaussian random variable.
Substituting
Note also that each term in involves a ratio of a higher moment and a power of and therefore,
which is the characteristic function of a Gaussian random variable with 0 mean and variance .
Remark
1. Under the conditions of the CLT, the sample mean converges in distribution to
In other words, if samples are taken from any distribution with mean and variance
, as the sample size increases, the distribution function of the sample mean approaches to the
distribution function of a Gaussian random variable.
2. The CLT states that the distribution function converges to a Gaussian distribution function.
The theorem does not say that the pdf is a Gaussian pdf in the limit. For example, suppose
each has a Bernoulli distribution. Then the pdf of Y consists of impulses and can never approach
3. The Cauchy distribution does not meet the conditions for the central limit theorem to hold. As we
have noted earlier, this distribution does not have a finite mean or a finite variance. Suppose a
Thus the sum of large number of Cauchy random variables will not follow a Gaussian distribution.
1. The central-limit theorem is one of the most widely used results of probability. If a random variable
is result of several independent causes, then the random variable can be considered to be Gaussian.
For example,
1. the thermal noise in a resistor is result of the independent motion of billions of electrons
and is modelled as Gaussian.
2. the observation error/ measurement error of any process is modeled as a Gaussian.
2. The CLT can be used to simulate a Gaussian distribution given a routine to simulate a particular
random variable.
3. Normal approximation of the Binomial distribution
4. One of the applications of the CLT is in approximation of the Binomial coefficients. We have already
Thus,
or,
In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing
the same experiment a large number of times. According to the law, the average of the results obtained
from a large number of trials should be close to the expected value, and will tend to become closer as more
trials are performed.
The LLN is important because it "guarantees" stable long-term results for the averages of some random
events. For example, while a casino may lose money in a single spin of the roulette wheel, its earnings will
tend towards a predictable percentage over a large number of spins. Any winning streak by a player will
eventually be overcome by the parameters of the game. It is important to remember that the LLN only
applies (as the name indicates) when a large number of observations are considered. There is no principle
that a small number of observations will coincide with the expected value or that a streak of one value will
immediately be "balanced" by the others
Examples
For example, a single roll of a fair, six-sided die produces one of the numbers 1, 2, 3, 4, 5, or 6, each with
equal probability. Therefore, the expected value of a single die roll is
It follows from the law of large numbers that the empirical probability of success in a series of Bernoulli
trials will converge to the theoretical probability. For a Bernoulli random variable, the expected value is the
theoretical probability of success, and the average of n such variables (assuming they are independent and
identically distributed (i.i.d.)) is precisely the relative frequency.
For example, a fair coin toss is a Bernoulli trial. When a fair coin is flipped once, the theoretical probability
that the outcome will be heads is equal to 1/2. Therefore, according to the law of large numbers, the
proportion of heads in a "large" number of coin flips "should be" roughly 1/2. In particular, the proportion
of heads after n flips will almost surely converge to 1/2 as n approaches infinity.
Though the proportion of heads (and tails) approaches 1/2, almost surely the absolute (nominal) difference
in the number of heads and tails will become large as the number of flips becomes large. That is, the
probability that the absolute difference is a small number, approaches zero as the number of flips becomes
large. Also, almost surely the ratio of the absolute difference to the number of flips will approach zero.
Intuitively, expected absolute difference grows, but at a slower rate than the number of flips, as the
number of flips grows.
Forms
Two different versions of the law of large numbers are described below; they are called the strong law of
large numbers, and the weak law of large numbers. Both versions of the law state that – with virtual
certainty – the sample average
where X1, X2, ... is an infinite sequence of i.i.d. Lebesgue integrable random variables with expected value
E(X1) = E(X2) = ...= µ. Lebesgue integrability of Xj means that the expected value E(Xj) exists according to
Lebesgue integration and is finite.
An assumption of finite variance Var(X1) = Var(X2) = ... = σ2 < ∞ is not necessary. Large or infinite variance
will make the convergence slower, but the LLN holds anyway. This assumption is often used because it
makes the proofs easier and shorter.
The difference between the strong and the weak version is concerned with the mode of convergence being
asserted. For interpretation of these modes, see Convergence of random variables.
The weak law of large numbers (also called Khintchine's law) states that the sample average converges in
probability towards the expected value[6][proof]
Interpreting this result, the weak law essentially states that for any nonzero margin specified, no matter
how small, with a sufficiently large sample there will be a very high probability that the average of the
observations will be close to the expected value; that is, within the margin.
Convergence in probability is also called weak convergence of random variables. This version is called the
weak law because random variables may converge weakly (in probability) as above without converging
strongly (almost surely) as below.
Strong law
The strong law of large numbers states that the sample average converges almost surely to the expected
value[7]
That is,
The proof is more complex than that of the weak law.[8] This law justifies the intuitive interpretation of the
expected value (for Lebesgue integration only) of a random variable when sampled repeatedly as the "long-
term average".
Almost sure convergence is also called strong convergence of random variables. This version is called the
strong law because random variables which converge strongly (almost surely) are guaranteed to converge
weakly (in probability). The strong law implies the weak law but not vice versa, when the strong law
conditions hold the variable converges both strongly (almost surely) and weakly (in probability) . However
the weak law may hold in conditions where the strong law does not hold and then the convergence is only
weak (in probability) .
There are different views among mathematicians whether the two laws could be unified to one law,
thereby replacing the weak law.[9]
The strong law of large numbers can itself be seen as a special case of the pointwise ergodic theorem.
Moreover, if the summands are independent but not identically distributed, then
The weak law states that for a specified large n, the average is likely to be near μ. Thus, it leaves open
the possibility that happens an infinite number of times, although at infrequent intervals.
The strong law shows that this almost surely will not occur. In particular, it implies that with probability 1,
we have that for any ε > 0 the inequality holds for all large enough n.
A Central Limit Theorem word problem will most likely contain the phrase “assume the variable is normally
distributed”, or one like it. With these central limit theorem examples, you will be given:
A population (i.e. 29-year-old males, seniors between 72 and 76, all registered vehicles, all cat owners)
General Steps
Step 1:Identify the parts of the problem. Your question should state:
a number associated with “greater than” ( ). Note: this is the sample mean. In other words, the problem
is asking you “What is the probability that a sample mean of x items will be greater than this number?
Step 2: Draw a graph. Label the center with the mean. Shade the area roughly above (i.e. the “greater
than” area). This step is optional, but it may help you see what you are looking for.
Step 3: Use the following formula to find the z-score. Plug in the numbers from step 1.
Click here if you want easy, step-by-step instructions for solving this formula.
Subtract the mean (μ in step 1) from the ‘greater than’ value ( in step 1). Set this number aside for a
moment.
Divide the standard deviation (σ in step 1) by the square root of your sample (n in step 1). For example, if
thirty six children are in your sample and your standard deviation is 3, then 3/√36=0.5
Divide your result from step 1 by your result from step 2 (i.e. step 1/step 2)
Step 4: Look up the z-score you calculated in step 3 in the z-table. If you don’t remember how to look up z-
scores, you can find an explanation in step 1 of this article: Area to the right of a z-score.
Step 5: Subtract your z-score from 0.5. For example, if your score is 0.1554, then 0.5 – 0.1554 = 0.3446.
Step 6: Convert the decimal in Step 5 to a percentage. In our example, 0.3446 = 34.46%.
That’s it!
2. Specific Example
1. General Steps
Step 1: Identify the parts of the problem. Your question should state:
population size
Step 2: Draw a graph. Label the center with the mean. Shade the area roughly below (i.e. the “less than”
area). This step is optional, but it may help you see what you are looking for.
Step 3: Use the following formula to find the z-score. Plug in the numbers from step 1.
Click here if you want simple, step-by-step instructions for using this formula.
If formulas confuse you, all this formula is asking you to do is:
Divide the standard deviation (σ in step 1) by the square root of your sample (n in step 1). For example, if
thirty six children are in your sample and your standard deviation is 2, then 3/√36=0.5
Divide your result from step 1 by your result from step 2 (i.e. step 1/step2)
Step 4: Look up the z-score you calculated in step 4 in the z-table. If you don’t remember how to look up z-
scores,you can find an explanation in step 1 of this article on area to the right of a z-score in a normal
distribution curve.
Step 5: Add your z-score to 0.5. For example, if your z-score is 0.1554, then 0.5 + 0.1554 is 0.6554.
That’s it!
2. Specific Example
A population of 29 year-old males has a mean salary of $29,321 with a standard deviation of $2,120. If a
sample of 100 men is taken, what is the probability their mean salaries will be less than $29,000?
Step 1: Insert the values into the z-formula:
=(29,000-29,321)/2,120/√100 = -321/212 = -1.51.
Step 2: Look up the z-score in the left-hand z-table (or use technology). -1.51 has an area of 93.45%.
However, this is not the answer, as the question is asking for LESS THAN, and 93.45% is the area “greater
than” so you need to subtract from 100%.
100%-93.45%=6.55% or about 0.07.
Sample problem: There are 250 dogs at a dog show who weigh an average of 12 pounds, with a standard
deviationof 8 pounds. If 4 dogs are chosen at random, what is the probability they have an average weight
of greater than 8 pounds and less than 25 pounds?
Step 1:Identify the parts of the problem. Your question should state:
a) Subtract the mean (μ in Step 1) from the greater than value (Xbar in Step 1): 25-12=13.
b) Divide the standard deviation (σ in Step 1) by the square root of your sample (n in Step 1): 8/sqrt4=4
c) Divide your result from a by your result from b: 13/4=3.25
Step 4 Use the formula from Step 3 to find the z-values. This time, use Xbar2 from Step 1 (8).
a) Subtract the mean (μ in Step 1) from the greater than value (Xbar in Step 1): 8-12=-4.
b) Divide the standard deviation (σ in Step 1) by the square root of your sample (n in Step 1): 8/sqrt4=4
c) Divide your result from a by your result from b: -4/4= -1
Note that the bell curve is symmetrical, so if you want to look up a negative value like -1, then just look up
the positive counterpart. The area will be the same.
.8407 = 84.07%
That’s it!
Back to top for more Central Limit Theorem Examples
Sample problem: A population of community college students includes inner city students (p = .33). What is
theprobability that a random sample of 45 students from the population will have from 20% to 40% inner
city students?
Step 1: Press APPS. Highlight the Stats/List Editor by using the scroll keys. Press ENTER.
If you don’t see the Stats/List editor you need to load the app. See instructions here.
Step 4: Scroll down and enter .33 in the Prob Success box.
Step 5: Scroll down and enter 9 in the Lower Value box (because 20% of 45 = 9).
Step 6: Scroll down and enter 18 in the Upper Value box (because 40% of 45 = 18). Press ENTER.
Step 7: Read the Result: Cdf=.857142. This means that the probability your random sample will have 20-
40% inner city students is 85.71%.
10. Questions
Objective questions
a. Weak variable
b. Large variables
c. Strong variable
d. Random variables
a. Weak variable
b. Large variables
c. Strong variable
d. Random variables
a. | X(s) - Xn(s) | → 0
b. | X(s) - Xn(s) | → ∞
c. | X(s) - Xn(s) | → µ
d. | X(s) - Xn(s) | → σ
c. P(Xn(s)) = 1
d. P( X(s) ) = 0
a. E [ Xn- X]2 → 0
b. E [ X]2 → 0
c. E [ Xn]2 → 0
Short Questions
2. Explain when sequence of random variables is said Almost sure (a.s.) to converge or converge with
probability 1
3. Explain when sequence of random variables is said to converge in the mean square sense
Long Questions
1. The scores on a general test have mean 450 and standard deviation 50. It is highly desirable to score over
480 on this exam. A person can get into Smith's College prestigious MBA program if he/she scores over 480.
In one location 25 people sign up to take the exam. The average score of these 25 people exceeds 490. Is
this odd? Should the test center investigate? Answer on the basis of the CLT.
2. A machine fills cereal boxes at a factory. Due to an accumulation of small errors (different flakes sizes,
etc.) it is thought that the amount of cereal in a box is normally distributed with mean 22 oz. for a
3. Sixteen adult males are in a pit which is 98 feet deep. They decide to stand on one another (feet to
head), hoping that the person on top can grip the top of the pit and get out, and, hence go for help. What's
the probability that their plan succeeds?
4.. (Weak law of large numbers) If are iid random variables each with mean and
University Questions
Dec 2012
May 2012
Dec 2011
Dec 2010
May 2010
1. Motivation:
This topic develops the fundamental understanding and analyzes the behavior of signals and
random phenomena.
2. Syllabus:
properties
properties
1 hour 1 hour
autocorrelation function and power
spectral density of a WSS random
sequence
Autocorrelation and
Autocorrelation and
power-spectral density of the
output 1 hour 1 hour
3. References:
5. Prerequisite:
6. Key Notations:
E[ ], µ Expected value
σ Standard Deviation
σ 2, var[ ] Variance
R( t) Autocorrelation
δ( t) Delta function
h( t) Impulse Response
T( ) Transformation
7. Key Definitions:
1. Random Process:
Thus a random process is a function of the sample point and index variable and may be written as
2 Conditional Probability:
Consider the event and any event B involving the random variable X . The conditional
distribution function of X given B is defined as
4. Linear system
The system is called linear if the principle of superposition applies: the weighted sum of inputs results
in the weighted sum of the corresponding outputs . Thus for a linear system
6. Causal system
The system is called causal if the output of the system at depends only on the present and
past values of input. Thus for a causal system
Introduction
1. Random Process
In practical problems, we deal with time varying waveforms whose value at a time is random in nature. For
example, the speech waveform recorded by a microphone, the signal received by communication receiver
or the daily record of stock-market data represents random variables that change with time. How do we
characterize such data? Such data are characterized as random or stochastic processes. This will covers the
fundamentals of random processes.
Recall that a random variable maps each sample point in the sample space to a point in the real line.
A random process maps each sample point to a waveform.
We can define a discrete-time random process on discrete points of time. Particularly, we can get a
The value of a random process is at any time can be described from its probabilistic model.
The state is the value taken by at a time , and the set of all such states is called the state
space. A random process is discrete-state if the state-space is finite or countable. It also means that the
corresponding sample space is also finite or countable. Otherwise , the random process is called continuous
state.
Clearly, can take only two values - 0 and 1. Hence is a discrete-time two-state process.
As we have observed above that at a specific time is a random variable and can be described
by its probability distribution function This distribution function is called the first-
order probability distribution function.
To describe , we have to use joint distribution function of the random variables at all possible
variables. Thus a random process can thus be described by specifying the joint
distribution function .
A random process is called strict-sense stationary (SSS) if its probability structure is invariant with
Thus, the joint distribution functions of any set of random variables does not depend
on the placement of the origin of the time axis. This requirement is a very strict. Less strict form of
stationarity may be defined.
Particularly,
if then is called
order stationary.
is called order stationary does not depend on the placement of the origin of the time axis. This
requirement is a very strict. Less strict form of stationarity may be defined.
If is stationary up to order 1
As a consequence
If is stationary up to order 2
Put
Similarly,
Therefore, the autocorrelation function of a SSS process depends only on the time lag
We can also define the joint stationarity of two random processes. Two processes
order is invariant under the translation of time. A complex random process is called
It is very difficult to test whether a process is SSS or not. A subclass of the SSS process called the wide sense
stationary process is extremely important from practical point of view.
(2) An SSS process is always WSS, but the converse is not always true.
Note that
Note that
and
For any t ,
5. Autocorrelation Function
When are not in the same pulse interval, and hence are independent.
Depending on the delay D , the points may lie on one or two pulse intervals.
Such signals are called power signals. For a power signal the autocorrelation function is defined as
then is the average power delivered to the resistance. In this sense, represents the average
power of the signal.
We see that of the above periodic signal is also periodic and its maximum occurs when
The autocorrelation of the deterministic signal gives us insight into the properties of the autocorrelation
function of a WSS process. We shall discuss these properties next.
Poisson process
In probability theory, a Poisson process is a stochastic process that counts the number of events[note 1] and
the time points at which these events occur in a given time interval. The time between each pair of
consecutive events has an exponential distribution with parameter λ and each of these inter-arrival times is
assumed to be independent of other inter-arrival times. The process is named after the Poisson distribution
introduced by French mathematician Siméon Denis Poisson.[1] It describes the time of events in radioactive
decay,[2] telephone calls at a call center,[3] document requests on a web server,[4] and many other punctual
phenomena where events occur independently from each other.
The Poisson process is a continuous-time stochastic process; the sum of a Bernoulli process can be thought
of as its discrete-time counterpart. A Poisson process is a pure-birth process, the simplest example of a
birth-death process. It is also a point process on the real half-line
Definition
N(0) = 0
Independent increments (the numbers of occurrences counted in disjoint intervals are independent
of each other)
Stationary increments (the probability distribution of the number of occurrences counted in any
time interval only depends on the length of the interval)
Proportionality (the probability of an occurrence in a time interval is proportional to the length of
the time interval)
The probability of simultaneous occurrences equals zero.
The probability distribution of the waiting time until the next occurrence is an exponential
distribution.
For each t≥0, the probability distribution of N(t) is a Poisson distribution with parameter λt. Here
λ>0 is called the rate of the Poisson process.
The occurrences are distributed uniformly on any interval of time. (Note that N(t), the total number
of occurrences, has a Poisson distribution over the non-negative integers, whereas the location of
an individual occurrence on t ∈ (a, b] is uniform.)
There are a series of generalizations of the basic Poisson process defined above; these are also termed
Poisson processes. The first of them, called homogeneous, coincides with the basic Poisson process defined
above.
Homogeneous
A homogeneous Poisson process counts events that occur at a constant rate; it is one of the most well-
known Lévy processes. This process is characterized by a rate parameter λ, also known as intensity, such
where N(t + τ) − N(t) = k is the number of events in time interval (t, t + τ].
Just as a Poisson random variable is characterized by its scalar parameter λ, a homogeneous Poisson
process is characterized by its rate parameter λ, which is the expected number of "events" or "arrivals" that
occur per unit time.
N(t) is a sample homogeneous Poisson process, not to be confused with a density or distribution function.
Inhomogeneous
An inhomogeneous Poisson process counts events that occur at a variable rate. In general, the rate
parameter may change over time; such a process is called a non-homogeneous Poisson process or
inhomogeneous Poisson process. In this case, the generalized rate function is given as λ(t). Now the
expected number of events between time a and time b is
Thus, the number of arrivals in the time interval [a, b], given as N(b) − N(a), follows a Poisson distribution
with associated parameter Na,b
A rate function λ(t) in a non-homogeneous Poisson process can be either a deterministic function of time or
an independent stochastic process, giving rise to a Cox process. A homogeneous Poisson process may be
viewed as a special case when λ(t) = λ, a constant rate.
Spatial
An important variation on the (notionally time-based) Poisson process is the spatial Poisson process. In the
case of a one-dimension space (a line) the theory differs from that of a time-based Poisson process only in
the interpretation of the index variable. For higher dimension spaces, where the index variable (now x) is in
some vector space V (e.g. R2 or R3), a spatial Poisson process can be defined by the requirement that the
random variables defined as the counts of the number of "events" inside each of a number of non-
Space-time
A further variation on the Poisson process, the space-time Poisson process, allows for separately
distinguished space and time variables. Even though this can theoretically be treated as a pure spatial
process by treating "time" as just another component of a vector space, it is convenient in most
applications to treat space and time separately, both for modeling purposes in practical applications and
because of the types of properties of such processes that it is interesting to study.
In the special case that this generalized rate function is a separable function of time and space, we have:
(If this is not the case, λ(t) can be scaled appropriately.) Now, represents the spatial probability
density function of these random events in the following sense. The act of sampling this spatial Poisson
process is equivalent to sampling a Poisson process with rate function λ(t), and associating with each event
a random vector sampled from the probability density function . A similar result can be shown for
the general (non-separable) case.
Characterisation
In its most general form, the only two conditions for a counting process to be a Poisson process are:[citation
needed]
Memorylessness (also called evolution without after-effects): the number of arrivals occurring in
any bounded interval of time after time t is independent of the number of arrivals occurring before
time t.
These seemingly unrestrictive conditions actually impose a great deal of structure in the Poisson process. In
particular, they imply that the time between consecutive events (called interarrival times) are independent
random variables. For the homogeneous Poisson process, these inter-arrival times are exponentially
distributed with parameter λ (mean 1/λ).
Also, the memorylessness property entails that the number of events in any time interval is independent of
the number of events in any other interval that is disjoint from it. This latter property is known as the
independent increments property of the Poisson process.
Properties
As defined above, the stochastic process {N(t)} is a Markov process, or more specifically, a continuous-time
Markov process.[citation needed]
To illustrate the exponentially distributed inter-arrival times property, consider a homogeneous Poisson
process N(t) with rate parameter λ, and let Tk be the time of the kth arrival, for k = 1, 2, 3, ... . Clearly the
number of arrivals before some fixed time t is less than k if and only if the waiting time until the kth arrival
is more than t. In symbols, the event [N(t) < k] occurs if and only if the event [Tk > t] occurs. Consequently
the probabilities of these events are the same:
In particular, consider the waiting time until the first arrival. Clearly that time is more than t if and only if
the number of arrivals before time t is 0. Combining this latter property with the above probability
distribution for the number of homogeneous Poisson process events in a fixed interval gives:
And therefore:
Consequently, the waiting time until the first arrival T1 has an exponential distribution, and is thus
memoryless. One can similarly show that the other interarrival times Tk − Tk−1 share the same distribution.
Random Signal Analysis Page 266
Hence, they are independent, identically distributed (i.i.d.) random variables with parameter λ > 0; and
expected value 1/λ. For example, if the average rate of arrivals is 5 per minute, then the average waiting
time between arrivals is 1/5 minute.
Applications
The classic example of phenomena well modelled by a Poisson process is deaths due to horse kick in the
Prussian army, as shown in 1898 by Ladislaus Bortkiewicz, a Polish economist and statistician who also
examined data of child suicides.[6][7] The following examples are also well-modeled by the Poisson process:
Gaussian process
In probability theory and statistics, Gaussian processes are a family of stochastic processes. In a Gaussian
process, every point in some input space is associated with a normally distributed random variable.
Moreover, every finite collection of those random variables has a multivariate normal distribution. The
distribution of a Gaussian process is the joint distribution of all those (infinitely many) random variables,
and as such, it is a distribution over functions.
The concept of Gaussian processes is named after Carl Friedrich Gauss because it is based on the notion of
the normal distribution which is often called the Gaussian distribution. In fact, Gaussian processes can be
seen as an infinite-dimensional generalization of multivariate normal distributions.
Gaussian processes are important in statistical modelling because of properties inherited from the normal.
For example, if a random process is modeled as a Gaussian process, the distributions of various derived
quantities can be obtained explicitly. Such quantities include the average value of the process over a range
of times and the error in estimating the average using sample values at a small set of times.
A Gaussian process is a stochastic process Xt, t ∈ T, for which any finite linear combination of samples has a
joint Gaussian distribution. More accurately, any linear functional applied to the sample function Xt will give
a normally distributed result. Notation-wise, one can write X ~ GP(m,K), meaning the random function X is
distributed as a GP with mean function m and covariance function K.[1] When the input vector t is two- or
multi-dimensional, a Gaussian process might be also known as a Gaussian random field.[2]
Some authors[3] assume the random variables Xt have mean zero; this greatly simplifies calculations without
loss of generality and allows the mean square properties of the process to be entirely determined by the
covariance function K.[4]
Alternative definitions
Alternatively, a process is Gaussian if and only if for every finite set of indices in the index set
is a multivariate Gaussian random variable. Using characteristic functions of random variables, the Gaussian
property can be formulated as follows: is Gaussian if and only if, for every finite set of
indices , there are real valued , with such that
The numbers and can be shown to be the covariances and means of the variables in the process.[5]
Covariance functions
A key fact of Gaussian processes is that they can be completely defined by their second-order statistics.[2]
Thus, if a Gaussian process is assumed to have mean zero, defining the covariance function completely
defines the process' behaviour. The covariance matrix K between all the pair of points x and x' specifies a
distribution on functions and is known as the Gram matrix. Importantly, because every valid covariance
function is a scalar product of vectors, by construction the matrix K is a non-negative definite matrix.
Equivalently, the covariance function K is a non-negative definite function in the sense that for every pair x
and x', K(x,x') ≥ 0; if K(,) > 0 then K is called positive definite. Importantly the non-negative definiteness of K
enables its spectral decomposition using the Karhunen–Loeve expansion. Basic aspects that can be defined
through the covariance function are the process' stationarity, isotropy, smoothness and periodicity.[6][7]
Stationarity refers to the process' behaviour regarding the separation of any two points x and x' . If the
process is stationary, it depends on their separation, x − x', while if non-stationary it depends on the actual
position of the points x and x'; an example of a stationary process is the Ornstein–Uhlenbeck process. On
the contrary, the special case of an Ornstein–Uhlenbeck process, a Brownian motion process, is non-
stationary.
Ultimately Gaussian processes translate as taking priors on functions and the smoothness of these priors
can be induced by the covariance function.[6] If we expect that for "near-by" input points x and x' their
corresponding output points y and y' to be "near-by" also, then the assumption of smoothness is present. If
we wish to allow for significant displacement then we might choose a rougher covariance function. Extreme
examples of the behaviour is the Ornstein–Uhlenbeck covariance function and the squared exponential
where the former is never differentiable and the latter infinitely differentiable.
Periodicity refers to inducing periodic patterns within the behaviour of the process. Formally, this is
achieved by mapping the input x to a two dimensional vector u(x) = (cos(x), sin(x)).
Applications
A Gaussian process can be used as a prior probability distribution over functions in Bayesian inference.[7][9]
Given any set of N points in the desired domain of your functions, take a multivariate Gaussian whose
covariance matrix parameter is the Gram matrix of your N points with some desired kernel, and sample
from that Gaussian.
Inference of continuous values with a Gaussian process prior is known as Gaussian process regression, or
kriging; extending Gaussian process regression to multiple target variables is known as cokriging.[10]
Gaussian processes are thus useful as a powerful non-linear multivariate interpolation tool. Additionally,
Gaussian process regression can be extended to address learning tasks in both supervised (e.g. probabilistic
classification[7]) and unsupervised (e.g. manifold learning[2]) learning frameworks.
Consider a real WSS process Since the autocorrelation function of such a process is a
The autocorrelation function is an important function charactersing a WSS random process. It possesses
some general properties. We briefly describe them below.
Remark If is a voltage signal applied across a 1 ohm resistance, then is the ensemble average
power delivered to the resistance.
Because,
We have
Proof
It can be shown that the sufficient condition for a function to be the autocorrelation function of a real
Proof: Note that a real WSS random process is called mean-square periodic ( MS periodic ) with a
6. Suppose
The autocorrelation function measures the correlation between two random variables and
If drops quickly with respect to then the and will be less correlated for large This
in turn means that the signal has lot of changes with respect to time. Such a signal has high frequency
components. If drops slowly, the signal samples are highly correlated and such a signal has less high
frequency components. Later on we see that is directly related to the frequency -domain
representation of a WSS process.
If and are two real jointly WSS random processes, their cross-correlation functions are
independent of and depends on the time-lag. We can write the cross-correlation function
We Have
Further,
Consider a random process which is sum of two real jointly WSS random processes.
We have
Example 2
Suppose
In many applications, physical systems are modeled as linear time-invariant (LTI) systems. The dynamic
behaviour of an LTI system to deterministic inputs is described by linear differential equations. We are
familiar with time and transform domain (such as Laplace transform and Fourier transform) techniques to
solve these differential equations. In this lecture, we develop the technique to analyze the response of an
LTI system to WSS random process.
A system is modelled by a transformation T that maps an input signa to an output signal y(t) as
shown in Figure 1. We can thus write,
The system is called linear if the principle of superposition applies: the weighted sum of inputs results
in the weighted sum of the corresponding outputs . Thus for a linear system
Then,
It is easy to check that that the differentiator in the above example is a linear time-invariant system.
The system is called causal if the output of the system at depends only on the present and
past values of input. Thus for a causal system
As shown in Figure 2, a linear system can be characterised by its impulse response where
Figure 2
Recall that any function x(t) can be represented in terms of the Dirac delta function as follows
Figure 3 shows the input-output relationship of an LTI system in terms of the impulse response and the
frequency response.
Figure 3
Consider an LTI system with impulse response h(t). Suppose is a WSS process input to the
where we have assumed that the integrals exist in the mean square (m.s.) sense.
the discrete-time LTI system is played by the unit sample sequence , defined by
As illustrated in Figure 1, discrete-time linear shift-invariant system is characterized by the unit sample
response which is the output of the system to the unit sample sequence .
The DTFT of the unit sample response is the transfer function of the system and given by
where is a function of the complex variable It is defined on a region of convergence (ROC) on the
An analysis similar to that for the continuous-time LTI system can be applied to the discrete-time
LTI system. Such an analysis shows that the response of a the linear time-invariant system with
More generally, we can take the of the input and the response and show that
Remark
In this case, the ROC of is a region in the given by For example, suppose
Then,
The contour is called the unit circle. Thus represents evaluated on the unit circle.
The polynomials and helps us in analyzing the properties of a linear system in terms of the
Pole - the point in the where Consequently at such a point. The ROC of
does not contain any pole. The poles and zeroes and unit circle on the complex plane are illustrated
in Figure 2.
the zeros of the system lie inside the unit circle, the inverse system with a transfer function
will have all its poles inside the unit circle and be stable.
• A discrete-time LTI system is called a maximum- phase system if all its poles and zeros lie outside
the unit circle.
Consider a discrete-time linear time-invariant system with impulse response and input as
shown in Figure 3 below. Assume to be a WSS process with mean and autocorrelation function
Figure 3
where
The cross-correlation between the output and the input random processes is given by
Figure 4
Figure 5
Remark
is a WSS Gaussian random process, then the output process is also Gaussian with the
probability density function determined by its mean and the autocorrelation function.
Example 1
Though the input is an uncorrelated process in the above example, the output is a correlated
process.
For the same white noise input, we can generate random processes with different autocorrelation
functions or power spectral densities.
Figure 6
Then
We have seen that is the product of a constant and two transfer functions This
result is of fundamental importance in modeling a WSS process because of the spectral factorization
theorem stated below:
Thus a WSS random signal with continuous spectrum that satisfies the Paley Wiener
condition can be considered as an output of a linear filter fed by a white noise sequence
{w[n]}as shown in Figure 7(a). The sequence {w[n]} is called the innovation sequence.
symmetrical about the unit circle groups the poles and zeros inside the unit circle
and groups the poles and zeros outside the unit circle.
In general spectral factorization is difficult, however for a signal with rational power spectrum,
spectral factorization can be easily done.
have a filter to filter the given signal to get the innovation sequence.
and are related through an invertible transform; so they contain the same
information.
Example 2 Suppose the power spectral density of a discrete random sequence is given by
Then
Wold's Decomposition
Any WSS signal can be decomposed as a sum of two mutually orthogonal processes a regular process
can be expressed as the output of linear filter using a white noise sequence as input.
is a predictable process, that is, the process can be predicted from its own past with zero
prediction error.
Consider the problem of estimating a signal in presence of additive noise. We want to estimate the signal
by filtering the noisy signal.
We have to use the probabilistic properties of the noise to dissociate the noise from the signal. An
optimal filter performs this dissociation. We will consider the case when the signal to be estimated is of
known form (deterministic). For example, in radar application a signal of known form is reflected from a
distant target. The received signal is the sum of the scaled and shifted version of the original signal and the
noise.
That is,
Where X ( t ) is shifted and scaled version of the known transmitted signal and V ( t ) is a noise assumed
to be WSS with a power spectral density . We wish to decide whether X ( t ) is present and its value
Then
Equality holds if
band . is called the band-width and is called the centre frequency of the
band-pass process . If is very small compared to the centre frequency , then is called a
narrow-band process.
We can similarly define a low-pass random process as a random process if its power spectral
In telecommunication, we often deal with random signals which have PSD concentrated in a small
frequency band and negligible outside this band. The information bearing signals like speech, image
and video are low-pass signals. These information-bearing signals modulate a sinusoidal carrier for
transmitting over the communication channel that acts as a bandpass filter. For example, the
amplitude- modulated waveform received by a communication receiver is modelled as an
amplitude-modulated random-phase sinusoid
The noise associated with communication signal undergoes band-pass filtering in the
communication receiver and the band-pass filtered noise can be modeled as a band-pass process.
We can do the correlation and power spectral analysis of such signals in the usual manner. However, for
analysis of nonlinear operations like the multiplication with a random process, the following trigonometric
representation is useful.
An arbitrary zero-mean WSS process can be represented in terms of the slowly varying
(1)
then ,
and
(3)
Note that
and
Again
and
where and the integral is defined in the mean-square sense. See the illustration in Figure 2.
and
The Hilbert transform of is generally denoted as Therefore, from (2) and (3) we establish
and
The realization for the in phase and the quadrature phase components is shown in Figure 3 below.
From the above analysis, we can summarise the following expressions for the autocorrelation functions
where
Figure 4
Similarly ,
symmetric about
implying that
Example 1
Suppose the band-limited white-noise process has the PSD as shown in Figure 5 below.
(1) The representation of the band-pass process in terms of the in-phase and the quadrature
where
and
and are respectively called the envelope and the phase of the process .
and are also Gaussian processes , then and will be independent. Using the
results on the PDF of functions of RVs, we get the following.
Solution. The random process Xn is a discrete-time, continuous-valued random process. The sample space
is SX = {x : x ≥ 0}. The index parameter set (domain of time) is I = {1, 2, 3, · · ·}.
Example 2 The number of failures N(t), which occur in a computer network over the time interval [0, t), can
be described by a homogeneous Poisson process {N(t), t ≥ 0}. On an average, there is a failure after every 4
hours, i.e. the intensity of the process is equal to λ = 0.25[h −1 ]
(a) What is the probability of at most 1 failure in [0, 8), at least 2 failures in [8, 16), and at most 1 failure in
[16, 24) (time unit: hour)?
Solution (a) The probability p = P[N(8) − N(0) ≤ 1, N(16) − N(8) ≥ 2, N(24) − N(16) ≤ 1] is required. In view of
the independence and the homogeneity of the increments of a homogeneous Poisson process, it can be
determined as follows: p = P[N(8) − N(0) ≤ 1]P[N(16) − N(8) ≥ 2]P[N(24) − N(16) ≤ 1] = P[N(8) ≤ 1]P[N(8) ≥
2]P[N(8) ≤ 1]. Since P[N(8) ≤ 1] = P[N(8) = 0] + P[N(8) = 1] = e −0.25·8 + 0.25 · 8 · e −0.25·8 = 0.406 and
P[N(8) ≥ 2] = 1 − P[N(8) ≤ 1] = 0.594, the desired probability is p = 0.406 × 0.594 × 0.406 = 0.098.
Example 3 — Random Telegraph signal Let a random signal X(t) have the structure X(t) = (−1)N(t) Y, t ≥ 0, 3
where {N(t), t ≥ 0} is a homogeneous Poisson process with intensity λ and Y is a binary random variable with
P(Y = 1) = P(Y = −1) = 1/2 which is independent of N(t) for all t. Signals of this structure are called random
telegraph signals. Random telegraph signals are basic modules for generating signals with a more
complicated structure. Obviously, X(t) = 1 or X(t) = −1 and Y determines the sign of X(0).
Since |X(t)| 2 = 1 < ∞ for all t ≥ 0, the stochastic process {X(t), t ≥ 0} is a secondorder process. Letting I(t) =
(−1)N(t) , its trend function is m(t) = E[X(t)] = E[Y ]E[I(t)]. Since E[Y ] = 0, the trend function is identically
zero: m(t) ≡ 0. It remains to show that the covariance function C(s, t) of this process depends only on |t−s|.
This requires the determination of the probability distribution of I(t). A transition from I(t) = −1 to I(t) = +1
or, conversely, from I(t) = +1 to I(t) = −1 occurs at those time points where Poisson events occur, i.e. where
Random Signal Analysis Page 306
N(t) jumps. P(I(t) = 1) = P(even number of jumps in [0, t]) = e −λtX∞ i=0 (λt) 2i (2i)! = e −λt cosh λt, P(I(t) =
−1) = P(odd number of jumps in [0, t]) = e −λtX∞ i=0 (λt) 2i+1 (2i + 1)! = e −λt sinh λt. Hence the expected
value of I(t) is E[I(t)] = 1 · P(I(t) = 1) + (−1) · P(I(t) = −1) = e −λt[cosh λt − sinh λt] = e −2λt . Since C(s, t) =
COV[X(s), X(t)] = E[(X(s)X(t))] = E[Y I(s)Y I(t)] = E[Y 2 I(s)I(t)] = E(Y 2 )E[I(s)I(t)] and E(Y 2 ) = 1, C(s, t) =
E[I(s)I(t)]. Thus, in order to evaluate C(s, t), the joint distribution of the random vector (I(s), I(t)) must be
determined. In view of the homogeneity of the increments of {N(t), t ≥ 0}, for 4 s < t, p1,1 = P(I(s) = 1, I(t) =
1) = P(I(s) = 1)P(I(t) = 1|I(s) = 1) = e −λs cosh λs P(even number of jumps in (s, t]) = e −λs cosh λs e−λ(t−s)
cosh λ(t − s) = e −λt cosh λs cosh λ(t − s). Analogously, p1,−1 = P(I(s) = 1, I(t) = −1) = e −λt cosh λs sinh λ(t − s)
p−1,1 = P(I(s) = −1, I(t) = 1) = e −λt sinh λs sinh λ(t − s) p−1,−1 = P(I(s) = −1, I(t) = −1) = e −λt sinh λs cosh λ(t −
s). Since E[I(s)I(t)] = p1,1 + p−1,−1 − p1,−1 − p−1,1, we obtain C(s, t) = e −2λ(t−s) , s < t. Note that the order
of s and t can be changed so that C(s, t) = e −2λ|t−s| .
Questions
Objective Question
a. Zero
b. Pole
c. Master
d. Slave
a. Zero
b. Pole
c. Master
d. Slave
a. Linear
b. Time Invariant
c. Time Variant
d. Low Pass
a. Realizable
b. Imaginary
c. Ideal
d. Lossless
8. Short Question
9. Long Question
1. Explain in brief:
3. What is a Random Process? State four classes of random processes giving one example each
functions and
5. a) Sketch all the possible realizations of X(t).
(i)
Dec 2012
Q. A random process is given by X(t)= sin(Wt+Y) where Y is uniformly distributed in (0, 2π). Verify whether
{X(t)} is WSS process
Q. State and prove the properties of autocorrelation function and cross correlation function.
Were θ is uniformly distributed over ( - π, π). Prove that the X(t) is.correlation Ergodic.
Q. A WSS random process {X(t)} is applied to the input of LTI system whose impulse response is te-atu(t)
where a(>0) is a real constant
May 2012
Q. A random process is given by X(t)= Acos(Wt+Y) where Y is uniformly distributed in (0, 2π). Verify whether
{X(t)} is WSS process
Q. Explain power spectral density. State it’s properties and prove any two of them
Q. Prove that if input to the LTI is WSS then output is also WSS
Dec 2011
Q. What is random process ? state four classes of random process with example
Q. Explain power spectral density. State it’s properties and prove any two of them
Q. If the X(t) is given by X(t) = 10 cos (100t+ θ) where θ is uniformly distributed over ( - π, π). Prove that
the X(t) is WSS
May 2011
Q. Find autocorrelation function and poer spectral density of random process is given by X(t)= Acos(Wt+Y)
where Y is uniformly distributed in (-π, π). Verify whether {X(t)} is WSS process
Q. Explain power spectral density. State it’s properties and prove any two of them
Dec 2010
where A and a are real positive constants, is applied to the input of the LTI system with impulse response
h(t)= e-b|τ|u(t) where b is real positive constants. Find the autocorrelation of the output Y(t) of the system.
Dec 2009
Q. State and prove the properties of autocorrelation function and cross correlation function.
Were θ is uniformly distributed over ( - π, π). Prove that the X(t) is.correlation Ergodic.
Dec 2008
Q.What is a Random Process? State four classes of random processes giving one example each
CHAPTER-6
The objective of this course is to analyze the behavior of signals and random phenomena, with
special emphasis on its applications to communication engineering, signals and linear systems.
2. Syllabus:
1. Introduction
3. References:
5. Prerequisite:
6. Key Definitions:
1. Stochastic process
Dynamical system with stochastic (i.e. at least partially random) dynamics. At each
Time the system is in one state Xt, taken from a set S, the state space. One often
writes such a process as
A Markov chain with transition probabilities that depend only on the length m-n of the separating
time interval,
4 stochastic matrix
The one-step transition probabilities WXY (1) in a homogeneous Markov chain are from
now on interpreted as entries of a matrix W = { WXY } , the so-called transition matrix of the chain, or
stochastic matrix.
Introduction
7.1 Random walks
A drunk walks along a pavement of width 5. At each time step he/she moves one position forward, and one
position either to the left or to the right with equal probabilities.
Except: when in position 5 can only go to 4 (wall), when in position 1 and going to the
You decide to take part in a roulette game, starting with a capital of C 0 pounds. At each round of the game
you gamble £10. You lose this money if the roulette gives an even number, and you double it (so receive
£20) if the roulette gives an odd number.
Suppose the roulette is fair, i.e. the probabilities of even and odd outcomes are exactly
1/2. What is the probability that you will leave the casino broke?
Consider two urns A and B in a casino game. Initially A contains two white balls, and
B contains three black balls. The balls are then `shuffled' repeatedly at discrete time
steps according to the following rule: pick at random one ball from each urn, and swap them. The three
possible states of the system during this (discrete time and discrete
Dynamical systems with stochastic (partially or fully random) dynamics. Some are really fundamentally
random; others are `practically' random.
E.g.
Commerce: stock markets & exchange rates, insurance risk, derivative pricing,
We first define stochastic processes generally, and then show how one finds discrete time
Markov chains as probably the most intuitively simple class of stochastic processes.
Dynamical system with stochastic (i.e. at least partially random) dynamics. At each
Time the system is in one state Xt, taken from a set S, the state space. One often writes
such a process as
Consequences, conventions
(i)We can only speak about the probabilities to find the system in certain states at certain times: each X t is a
random variable.
(ii) To define a process fully: specify the probabilities (or probability densities) for the
(iii) If time discrete: label time steps by integers n ≥0, write X = ( Xn : n ≥0) .
Markov chains are discrete state space processes that have the Markov property. Usually they are defined
to have also discrete time
A discrete time and discrete state space stochastic process is Markovian if and only if the conditional
probabilities do not depend on (X0,……,Xn) in full, but only on the most recent state Xn:
The likelihood of going to any next state at time n + 1 depends only on the state we
Consequences, conventions
Proof:
X € S:
This defines a time dependent probability measure on the set S, with the usual
Properties
(iii) For any two times m > n ≥ 0 the measures Pn(X) and Pm(X) are related via
With -----------(I)
A Markov chain with transition probabilities that depend only on the length m ¡ n of
is called a homogeneous (or stationary) Markov chain. Here the absolute time is
irrelevant: if we re-set our clocks by a uniform shift n -- n + K for fixed K, then all
probabilities to make certain transitions during given time intervals remain the same.
consequences, conventions
(i) The transition probabilities in a homogeneous Markov chain obey the Chapman-
The likelihood to go from Y to X in m steps is the sum over all paths that go first
in m - 1 steps to any intermediate state X’, followed by one step from X’ to X.The Markovian property
guarantees that the last step is independent of how we got to X’. Stationary ensures that the likelihood to
go in m - 1 steps to X’ is not dependent on when various intermediate steps were made.
Proof:
Rewrite Pm(X) in two ways, first by choosing n = 0 in the right-hand side of (I), second by choosing n = m - 1
in the right-hand side of (I):
Finally we choose P0(X) = δXY , and demand that the above is true for any Y € S:
The one-step transition probabilities WXY (1) in a homogeneous Markov chain are from now on interpreted
as entries of a matrix W = (WXY), the so-called transition matrix of the chain, or stochastic matrix.
consequences, conventions:
This follows directly from (8), in combination with our identification of W XY in Markov chains as the
probability to go from Y to X in one time step.
Examples
If Xn-1 is not known exactly, average over all possible values of Xn-1:
Hence
Since the state space S is discrete, we can represent/label the states by integer numbers,
and write simply S = (1; 2; 3; : : :). Now the X are themselves integer random variables. To
exploit optimally the simple nature of Markov chains we change our notation
From now on we will limit ourselves for simplicity to Markov chains with finite state spaces
S = (1; : : : ; |S|). This is not essential but removes distracting technical complications.
In our new notation the dynamical eqns of the Markov chain becomes
Example Suppose a car rental agency has three locations in Ottawa: Downtown location (labelled A), East
end location (labelled B) and a West end location (labelled C). The agency has a group of delivery drivers to
serve all three locations. The agency's statistician has determined the following:
1. Of the calls to the Downtown location, 30% are delivered in Downtown area, 30% are delivered in
the East end, and 40% are delivered in the West end
2. Of the calls to the East end location, 40% are delivered in Downtown area, 40% are delivered in
the East end, and 20% are delivered in the West end
3. Of the calls to the West end location, 50% are delivered in Downtown area, 30% are delivered in
the East end, and 20% are delivered in the West end.
After making a delivery, a driver goes to the nearest location to make the next delivery. This way, the
location of a specific driver is determined only by his or her previous location.
To make matters simple, let's assume that it takes each delivery person the same amount of time (say 15
minutes) to make a delivery, and then to get to their next location. According to the statistician's data, after
15 minutes, of the drivers that began in A, 30% will again be in A, 30% will be in B, and 40% will be in C.
Since all drivers are in one of those three locations after their delivery, each column sums to 1. Because we
are dealing with probabilities, each entry must be between 0 and 1, inclusive. The most important fact that
lets us model this situation as a Markov chain is that the next location for delivery depends only on the
current location, not previous history. It is also true that our matrix of probabilities does not change during
the time we are observing.
Now, let’s start with a simple question. I f you begin at location C, what is the probability (say, P) that you
will be in area B after 2 deliveries? Think about how you can get to B in two steps. We can go from C to C,
then from C to B, we can go from C to B, then from B to B, or we can go from C to A, then from A to B. To
figure out P, let P(XY) denote the probability of going from X to Y in one delivery (where X,Y can be A,B or
C). Do you remember how probabilities work? If two (or more) independent events must both (all) happen,
to obtain the probability of them both (all) happening, we multiply their probabilities together. To obtain
the probability of either (any) happening, we add the probabilities of those events together.
This gives us P = P(CA)P(AB) + P(CB)P(BB) + P(CC)P(CB) for the probability that a delivery person goes from C
to B in 2 deliveries. Substituting into our formula using the statistician's data above gives P = (.5)(.3) +
(.3)(.4) + (.2)(.3) = .33.This tells us that if we begin at location C, we have a 33% chance of being in location
B after 2 deliveries.
Let's try this for another pair. If we begin at location B, what is the probability of being at location B after 2
deliveries? Try this yourself before you read further! The probability of going from location B to location B
in two deliveries is P(BA)P(AB) + P(BB)P(BB) + P(BC)P(CB) = (.4)(.3)+(.4)(.4) + (.2)(.3) = .34. Now it wasn't so
bad calculating where you would be after 2 deliveries, but what if you need to know where you will be after
5, or 15 deliveries? That could take a LONG time. There must be an easier way, right? Look carefully at
where these numbers come from. As you might suspect, they are the result of matrix multiplication..
Going from C to B in 2 deliveries is the same as taking the inner product of row 2 and column 3. Going from
B to B in 2 deliveries is the same as taking the inner product of row 2 and column 2. If you multiply T by T,
the (2, 3) and (2,2) entries are respectively, the same answers that you got for these two questions above.
The rest of T2 answers the same type of question for any other pair of locations X and Y.
Now that we have this matrix, it should be easier to find where we will be after 3 deliveries. We will let
p(AB) represent the probability of going from A to B in 2 deliveries. Let's find the probability of going from C
to B in 3 deliveries: it is p(CA)P(AB) + p(CB)P(BB) + p(CC)P(CB) = (.37)(.3) + (.33)(.4) + (.3)(.3) = .333. You will
see that this probability is the inner product of row 2 of T 2 and column 3 of T. Therefore, if we multiply T2 by
T, we will get the probability matrix for 3 deliveries.
By now, you probably know how we find the matrix of probabilities for 4, 5 or more deliveries. Notice that
the elements on each column still add to 1. Therefore, it is important that you do not round your answers.
Keep as many decimal places as possible to retain accuracy.
What do you notice about these matrices as we take into account more and more deliveries? The numbers
in each row seems to be converging to a particular number. Think about what this tells us about our long-
term probabilities. This tells us that after a large number of deliveries, it no longer matters which location
we were in when we started. At the end of the week, we have (approximately) a 38.9% Chance of being at
location A, a 33.3% chance of being at location B, and a 27.8% chance of being in location C. This
convergence will happen with most of the transition matrices that we consider.
Remark If all the entries of the transition matrix are between 0 and 1 EXCLUSIVELY, then convergence is
guaranteed to take place. Convergence may take place when 0 and 1 are in the transition matrix, but
convergence is no longer guaranteed.
Think about the situation that this matrix represents in order to understand why Ak oscillates as k grows.
is the vector of initial distribution. After one delivery, the distribution will be (approximately) 40% of our
drivers in area A, 33.4% in area B, and 26.6% in area C. We found this by multiplying our initial distribution
matrix by our transition matrix, as follows:
After many deliveries, we saw that some convergence occurs, so that the area from which we start doesn't
matter. This will mean that we will obtain the same right-hand side no matter with which initial distribution
we start. For example,
If the initial distribution indicates the actual number of people in the system, the following can represent
our system after one delivery:
Did you notice that we now have a fractional number of people in areas A and C after one delivery? We
know that this cannot happen, but this gives us a good idea of approximately how many delivery people are
in each area. After many deliveries, the right-hand side of this equality will also be very close to a particular
vector. For example,
The particular vector that the product converges to is the total number of people in the system (54 in this
case) times any columnn of the matrix that Ak converges to as k grows,
I hope the above example gave you a good idea about the process of Markov chains. Now here is the
general setting:
Definitions For a Markov chain with n states, the state vector is a column vector whose ith component
represents the probability that the system is in the ith state at that time. Note that the sum of the entries of
a state vector is 1. For example, vectors X0 and X1 in the above example are state vectors. If pij is the
probability of movement (transition) from one state j to state i, then the matrix T=[ pij] is called the
transition matrix of the Markov chain.
The following Theorem gives the relation between two consecutive state vectors:
If Xn+1 and Xn are two consecutive state vectors of a Markov chain with transition matrix T, then X n+1=T Xn
For a Markov chain, we are usually interested in the long-term behavior of a general state vector Xn. In
other words, we would like to find the limit of Xn as n→∞. It may happen that this limit does not exist, for
example let
Clearly Xn oscillates between the vectors (0, 1) and (1, 0) and therefore does not approach a fixed vector.
A question is: what makes Xn approach a limiting vector as n→∞. The next theorem will give an answer,
first we need a definition:
Definition A transition matrix T is called regular if, for some integer r, all entries of Tr are strictly positive. (0
is not strictly positive).
is regular since
1. If T is a regular transition matrix, then as n approaches infinity, Tn→S where S is a matrix of the
form [v, v,…,v] with v being a constant vector.
2. If T is a regular transition matrix of a Markov chain process, and if X is any state vector, then as n
approaches infinity, TnX→p, where p is a fixed probability vector (the sum of its entries is 1), all of
whose entries are positive.
Consider a Markov chain with a regular transition matrix T, and let S denote the limit of Tn as n approaches
infinity, then TnX→SX=p, and therefore the system approaches a fixed state vector p called the steady-state
vector of the system. Now since Tn+1=TTn and that both Tn+1 and Tn approach S, we have S=TS. Note that
any column of this matrix equation gives Tp=p. Therefore, the steady-state vector of a regular Markov chain
with transition matrix T is the unique probability vector p satisfying Tp=p.
Is there a way to compute the steady-state vector of a regular Markov chain without using the limit? Well,
if we can solve Tp=p, for p, then yes! You might have seen this sort of thing before (and certainly will in
your first linear algebra course) Recall the definition of an eigenvector and an eigenvalue of a square
matrix:
Given a square matrix A, we say that the number λ is an eigenvalue of A if there exists a nonzero vector X
satisfying: AX=λX. In this case, we say that X is an eigenvector of A corresponding to the eigenvalue λ.
Recall that the eigenvalues of a matrix A are the solutions to the equation det(A- λI)=0 where I is the
identity matrix of the same size as A. If λ is an eigenvalue of A, then an eigenvector corresponding to λ is a
non-zero solution to the homogeneous system (A- λI)X=0. Consequently, there are infinitely many
eigenvectors corresponding to a fixed eigenvalue.
Example If you have lived in Ottawa for a while, you must have realized that the weather is a main concern
of the population. An unofficial study of the weather in the city in early spring yields the following
observations:
2. If we have a nice day, we just as likely to have snow or rain the next day
3. If we have snow or rain, then we have an even chance to have the same the next day
4. If there is a change from snow or rain, only half of the time is this a change to a nice day.
b. If it is nice today, what is the probability of being nice after one week?
Solution 1) since the weather tomorrow depends only on today, this is a Markov chain process. The
transition matrix of this system is
3) Notice first that we are dealing with a regular Markov chain since the transition matrix is regular,
so we are sure that the steady-state vector exists. To find it we solve the homogeneous system (T-
I)X=0 which has the following coefficient matrix:
So what solution do we choose? Remember that a steady-state vector is in particular a probability vector;
that is the sum of its components is 1: 0.5t+t+t=1 gives t=0.4. Thus, the steady-state vector is
Objective Question
Dec 2012
Q. The transition matrix of markov chain with three states 0,1,2 is given by
¾ ¼ 0
¼ ½ ¼
May 2012
Dec 2011
May 2011
Q. Three boys A, B, C are throwing balls to each other. A alway throws ball to B. B is as likely to throw the
ball to C as to A. The probability that C will throw the ball to A is 2/3. Write transition probability matrix and
prove that process is markovian