Você está na página 1de 202

Lecture 1: Introduction

January 4, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Probability theory is the branch of mathematics that is concerned with random phenomena.
So it’s natural to ask what is the meaning of a random phenomena? Many phenomena have
the property that their repeated observation under a specified set of conditions invariably
leads to the same outcome. For example, if a ball initially at rest is dropped from as
height of
2d
d meter through an vacuumed cylinder, it will invariably fall to the ground in t = sec-
g
onds, where g is the gravitational acceleration. There are other phenomena whose repeated
observation under a specified set of conditions does not always lead to the same outcome.
A familiar example of this type is the tossing of a coin. If a coin is tossed 1000 times the
occurrences of heads and tails alternate in a seemingly erratic and unpredictable manner.
It is such phenomena that we think of as being random and which are the object of our
investigation.
At first glance it might seem impossible to make any worthwhile statements about such
random phenomena, but this is not so. Experience has shown that many nondeterministic
phenomena exhibit a “statistical regularity” that makes them subject to study. For example
if we toss a coin large number of times, the proportion of heads seems to fluctuate around 12
unless the coin is severely unbalanced.
The eighteenth century French naturalist Comte de Buffon tossed a coin 4040 times and
got 2048 heads. The proportion (or relative frequency) of heads in this case is 0.507. J.E.
Kerrich from Britain, recorded 5067 heads in 10000 tosses. Proportion of heads in this case
is 0.5067. Statistician Karl Pearson spent some more time, making 24000 tosses of a coin.
He got 12012 heads, and thus, proportion of heads in this case is 0.5005.
Of course, this proportion of heads depends on number of tosses and will fluctuate, even
wildly, as number of tosses increases. But if we let number of tosses go to infinity, will the
sequence of relative proportion “settle down to a steady value”? Such a question can never
be answered empirically, since by the very nature of a limit we cannot put an end to the
trials. So it is a mathematical idealization (or belief) to assume that such a limit does exist
and is equal to 0.5, and then write P (Head) = 0.5. We think of this limiting proportion 0.5
as the “probability” that the coin will land heads up in a single toss.
So there is a certain level of belief/assumption when we say that probability of head is 0.5.
Now we turn around the game. More generally the statement that a certain experimental
outcome has probability p can be interpreted as meaning that if the experiment is repeated

1-1
1-2 Lecture 1: Introduction

a large number of times, that outcome would be observed “about” (approximately) 100p
percent of the time. This interpretation of probabilities is called the relative frequency
interpretation. It is very natural in many applications of probability theory to real world
problems, especially to those involving the physical sciences.
Let us summarize what we have done “When we say that probability of head is 14 while
tossing a coin, we mean if we toss the coin large number of times then approximately 41 th
times we should get head”. Same interpretation can applied when we say that probability
of an even number is 12 when we through a die.
Question: What interpretation you make of the following statement “ Modi has 70% chance
of winning 2019 Loksabha elections”?
It is obvious that we can not have relative frequency interpretation for the above statement.
For the mathematical theory of probability the interpretation of probabilities is irrelevant.
We will use the relative frequency interpretation of probabilities only as an intuitive moti-
vation for the definitions and theorems we will be developing throughout the course.
Lecture 2: Sample Space, Countable & Uncountable Sets
January 7, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Definition 2.1 By a random experiment (or chance experiment), we mean an experiment


(a imaginary thought experiment which may never actually be performed but can be conceived
of as being performed or a real physical process) which has multiple outcomes (at least two)
and one don’t know in advance which outcome is going to occur, unless one perform the
experiment. Also, when experiment is performed exactly one out of several possible outcomes
will be produced.

Example 2.2 Tossing a coin, throwing a die are random experiment. Unless you throw the
coin or die you don’t know what is coming up. Though we know all possible outcomes of the
random experiment.

Remark 2.3 There is no restriction on what constitutes a random experiment. For example,
it could be a single toss of a coin, or three tosses, or an infinite sequence of tosses. However,
it is important to note that in our mathematical model of randomness, there is only one
random experiment. So, three tosses of a coin constitute a single experiment, rather than
three experiments.

We begin with a model for a random experiment whose performance, results in an idealized
outcome from a family of possible outcomes. The first element of the model is the specifi-
cation of an abstract sample space (Ω) representing the collection of idealized outcomes of
the random experiment. Next comes the identification of a family of events ( F) of interest,
each event represented by an aggregate of elements of the sample space. The final element
of the model is the specification of a consistent scheme of assignation of probabilities (P ) to
events. We consider these elements in turn.

2.1 The Sample Space


Definition 2.4 Sample space of a random experiment is the set of all idealized outcomes of
the random experiment. We denote it by the uppercase Greek letter Ω.

Example 2.5 In random experiment of tossing a coin, the sample space is Ω = {head, tail}
or {H, T }.

2-1
2-2 Lecture 2: Sample Space, Countable & Uncountable Sets

Remark 2.6 In the sample space for the random experiment in Example 2.5, one might also
include as possible outcomes “coin stands on the edge” or “coin disappears” (Shaktimaan
threw the coin and it crossed the gravity of the earth!!). It is not too serious if we admit
more things into our consideration than can really occur, but we want to make sure that we
do not exclude things that might occur. As a guiding principle, the sample space should have
enough detail to distinguish between all outcomes of interest to the modeler, while avoiding
irrelevant details. In this spirit we have made use of idealized outcomes.

Example 2.7 Toss two coins and note down the number of heads obtained. Here sample
space is Ω = {0, 1, 2}

Example 2.8 In the random experiment of tossing a coin till you get a head, the sample
space is
Ω = {H, T H, T T H, T T T H, · · · }

Example 2.9 Consider throwing a dart on a square target and viewing the point of impact
as the outcome. The sample space is the set of all the points on the square.

The sample space of a random experiment may consist of a finite or an infinite number of
possible outcomes. Infinite sets are further divided into two categories: countably infinite
and uncountable. The sample space in Example 2.8 is countably infinite and the sample
space in Example 2.9 is uncountable.
That is to say that there are two different order of infinity and uncountable is higher order
of infinity compare to countably infinite.

2.1.1 Countable Sets

We want to measure number of elements in a set. When set S is a finite set, it is very clear
to us that we may list the elements as {s1 , s2 , · · · sn } for some positive integer n. When we
deal with infinite sets, things are not that straight. For example, set of natural numbers
N “appears” to contain as much as “double” of elements in set S = {2n|n ∈ N} (set of all
even positive integers). Can we say that number of elements in N is 2 times the number of
elements in S? If your answer is yes, then are you saying that ∞ < 2 × ∞? So we run into
problems.
Now here is the idea due to Cantor. Let us recast the notion of finiteness as follows:
A set S has n elements iff there exists a bijection f : {1, 2, · · · , n} → S.
So finiteness has been recasted as a one-to-one correspondence between the set and a subset
of natural number. This idea paves the way to define the following notion.
Lecture 2: Sample Space, Countable & Uncountable Sets 2-3

Definition 2.10 A set S is said to be countably infinite if there exists a function f : N → S


such that f one-one and onto.

In other words, a set is countably infinite when it can be put into 1-to-1 correspondence with
the set of positive integers. In this case we say that the set S is as infinite as N.

Remark 2.11 Since every countably infinite set is the range of a 1-1 function defined on
N, we may regard every countably infinite set as the range of a sequence of distinct terms
(Note that in general the terms x1 , x2 , x3 , · · · of a sequence need not be distinct). Speaking
more loosely, we may say that the elements of any countably infinite set can be “arranged”
in a sequence of distinct terms.

Example 2.12 Let S be the set of all even positive integers. Then it is countable.

1 2 3 4 5 6 7 8 9 ···
..
l l l l l l l l l .
2 4 6 8 10 12 14 16 18 · · ·

It is highly counter intuitive that both the sets have “same order of infinity”, though empiri-
cally it appears that set S have half of the elements compare to N.

Example 2.13 Set of all integers Z is a countable set.

1 2 3 4 5 6 7 8 9 ···
..
l l l l l l l l l .
0 1 −1 2 −2 3 −3 4 −4 · · ·

Once again Z and N have same of number of elements in sense of bijection.

Example 2.14 The set of all positive rational numbers is countably infinite.
The following figure gives a schematic representation of listing the positive rationals. In
this figure the first row contains all positive rationals with numerator 1, the second all with
numerator 2, etc.; and the first column contains all with denominator 1, the second all with
denominator 2, and so on. Our listing amounts to traversing this array of numbers as the
arrows indicate, where of course all those numbers already encountered are left out.
2-4 Lecture 2: Sample Space, Countable & Uncountable Sets

We see that every natural number is associated with the unique positive rational number and
for every positive rational number, there exists a natural number. Hence set of all positive
rational numbers is countable.

Example 2.15 Show that set of all rational numbers is countably infinite.

Solution: In proof of countability of positive rationals, we counted as follows:

1 2 3 4 5 6 7 8 9 ···
.
l l l l l l l l l ..
1 1 1 2 3
1 2
2 3
3 4 3 2
4 ···

Now what we did for counting integers, we do the same thing to include non-positive rationals
as follows:
1 2 3 4 5 6 7 8 9 ···
..
l l l l l l l l l .
1
0 1 −1 2
− 12 2 −2 1
3
− 13 · · ·

Definition 2.16 An infinite set which is not countably infinite is called uncountable.

Example 2.17 Set of irrational numbers is uncountable. Hence set of real numbers is also
uncountable. In fact, all the intervals

(a, b), (a, b], [a, b), [a, b]

where a, b ∈ R and a < b, are uncountable.


Lecture 3: Review of Set Theory and Field of Events
January 8, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Definition 3.1 The elements ω of Ω will be called sample points, each sample point ω iden-
tified with an idealised outcome of the underlying random experiment experiment.

In Example 2.5, Head and Tail are the sample points, in Example 2.7, 0, 1, 2 are the sample
points.

Definition 3.2 Any subset of a sample space is said to be an event.

Example 3.3 {H}, {H, T } are events corresponding to the sample space in Example 2.5.
{0}, {1, 2}, {0, 2} are events corresponding to the sample space in Example 2.7

Since empty-set is also considered as subset of every set hence this is also a valid event.
It is called impossible event. When there is no outcome corresponding some event, that is
referred as null event.

Remark 3.4 H is a sample point and {H} is an event. 0 is a sample point {0} is an event.

Recall that the first two elements of a probability model are sample space Ω and σ-field F.
Both are sets. Also, σ-field is collection of events (which are subsets of Ω). So our model
makes extensive use of set operations, so let us introduce at the outset the relevant notation
and terminology.
A set is a collection of objects, which are the elements of the set. If S is a set and x is an
element of S, we write x ∈ S. If x is not an element of S, we write x ∈ / S. A set can have
no elements (for example, the set of all 20 feet tall students in the LNMIIT), in which case
it is called the empty set, denoted by ∅.
In probability model the sample space Ω will be considered as universal set, which contains
all objects that could conceivably be of our interest in a particular context.

3-1
3-2 Lecture 3: Review of Set Theory and Field of Events

3.1 Set Operations


If every element of a set S is also an element of a set T , we say that S is a subset of T , and
we write S ⊂ T or T ⊃ S. If S ⊂ T and T ⊂ S, then two sets are equal , and we write
S = T.
Given sets S and T , new sets may be constructed by union, intersection, and set differences.
The union of two sets S and T , is the set of all elements that belong to S or T (or both), and
is denoted by S ∪ T . Also it is trivial to see S ∪ T = T ∪ S and S ∪ (T ∪ U ) = (S ∪ T ) ∪ U .
First one says that union is commutative operation and second one states associativity of
union.
The intersection of two sets S and T is the set of all elements that belong to both S and T ,
and is denoted by S ∩ T . Once again intersection is also commutative and associative.
Because of the associative laws, however, we can write

A ∪ B ∪ C, A ∩ B ∩ C ∩ D

without brackets. But a string of symbols like A ∪ B ∩ C is ambiguous, therefore not defined;
indeed (A ∪ B) ∩ C is not identical with A ∪ (B ∩ C).
The set difference S \ T is the set whose members are those elements of S that are not
contained in T .
Let S ⊂ Ω. Then The complement of S, with respect to the Ω, is the set {x ∈ Ω|x ∈
/ S} of
c c
all elements of Ω that do not belong to S, and is denoted by S . Note that Ω = ∅ (with
respect to Ω). It is easy to see that (S c )c = S and S ∩ S c = ∅.

Proposition 3.5 For any three sets A, B, C we have

1. Distributive law of sets:

(a) (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C)
(b) (A ∩ B) ∪ C = (A ∪ C) ∩ (B ∪ C)

2. DeMorgan’s law:

(a) (A ∪ B)c = Ac ∩ B c
(b) (A ∩ B)c = Ac ∪ B c

In some cases, we will have to consider the union or the intersection of infinitely many sets.
Lecture 3: Review of Set Theory and Field of Events 3-3

For example, if for every positive integer n, we are given a set Sn , then

[
Sn = S1 ∪ S2 ∪ · · · = {x|x ∈ Sn for at least one n}
n=1
\∞
Sn = S1 ∩ S2 ∩ · · · = {x|x ∈ Sn for all n}
n=1

These are also called countable union and countable intersection, respectively. If for every
real number α ∈ (0, 1), we are given a set Sα , then

[
Sα = {x|x ∈ Sα for at least one α} uncountable union
α

\
Sα = {x|x ∈ Sα for all α} uncountable intersection
α

Two sets are said to be disjoint if their intersection is empty. More generally, several sets
(finite or infinite) are said to be disjoint if no two of them have a common element.
Sets and the associated operations are easily visualized in terms of Veen diagrams:
3-4 Lecture 3: Review of Set Theory and Field of Events

3.2 The algebra (or Field) of events


We suppose now that we are dealing with a fixed, sample space Ω representing a chance
experiment. The events of interest are subsets of the space Ω at hand. In the probabilistic
setting, the special set Ω connotes the certain event, ∅ the impossible event. Probabilistic
language lends flavour to the dry terminology of sets. We say that an event A occurs if the
performance of the random experiment yields an outcome ω ∈ A. If A and B are events
and A ⊂ B then the occurrence of A implies the occurrence of B. If, on the other hand,
A ∩ B = ∅, that is to say, they are disjoint, the occurrence of one precludes the occurrence
of the other and we say that A and B are mutually exclusive.
Given events A and B, it is natural to construct new events of interest by apply various set
operations: union, intersection, set difference. Clearly, it is desirable when discussing the
family F of events to include in F all sets that can be obtained by such natural combinations
of events in the family.
Note that intersections may be written in terms of unions and complements alone. That is
A ∩ B = (Ac ∪ B c )c . Similarly, we may write set difference A \ B = A ∩ B c = (Ac ∪ B)c .
Therefore it will be enough for the family F of events under consideration to be closed under
complements and unions. It is clear that, as long as F is non-empty, then it must contain
both Ω and ∅. Indeed, if A is any event then so is Ac whence so is Ac ∪A = Ω and Ac ∩A = ∅.

Definition 3.6 Let Ω be a nonempty set. A collection F of subsets of Ω is called an algebra


if

(i) Ω ∈ F
(ii) A ∈ F =⇒ Ac ∈ F, (in other words F is closed under complementation).
(iii) A, B ∈ F =⇒ A ∪ B ∈ F (i.e., F is closed under finite union).

Remark 3.7 The meaning of the statement A ∈ F =⇒ Ac ∈ F, is that if A ∈ F then


Ac ∈ F. It does not mean that F contains complement of every subset A of Ω. Similar is
the meaning of the statement A, B ∈ F =⇒ A ∪ B ∈ F, if A, B are two sets in F then
there union is also in F. It does not mean that F contains union of any two subsets A, B
of Ω.

Example 3.8 (i) Let Ω be any nonempty set and let F = {Ω, ∅}.
(ii) Let Ω be any nonempty set, then P(Ω) is an algebra.
(iii) Let Ω = {a, b, c, d}. Consider the classes

F1 = {Ω, ∅, {a}}
Lecture 3: Review of Set Theory and Field of Events 3-5

and
F2 = {Ω, ∅, {a}, {b, c, d}}.
Then, F2 is an algebra, but F1 is not an algebra, since {a}c ∈
/ F1 .
Lecture 4: σ-Field and Probability Measure
January 9, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Example 4.1 (An algebra or field generated by intervals) Identifying Ω with the right-
closed (or half closed) unit interval (0, 1], let J be the set of all half-closed subintervals of Ω
of the form (a, b]. Let R(J ) denote the family of all finite unions of the half-closed intervals
in J . The set R(J ) is clearly non-empty (and, in particular, contains (0, 1]), is by definition
closed under finite unions , and, as (a, b]c = (0, 1] \ (a, b] = (0, a] ∪ (b, 1], it follows that R(J )
is closed under complements as well. The collection R(J ) is hence a field.

When dealing with infinite sample space, as in Example 4.1, it is natural to consider count-
ably infinite unions or intersection of events, to get other events. For example, let us
the events (0, 1/2], (0, 2/3], (0, 3/4], · · · , (0, 1 − 1/n], · · · As this is an increasing sequence
of events, the finite union of the first n − 1 of these intervals is (0, 1 − 1/n] (and in particular
is contained in J , hence in R(J )). Now observe that

[
(0, 1 − 1/n] = (0, 1).
n=2

Also (0, 1) ∈
/ R(J ). It follows that the algebra R(J ) is not closed under countable unions:
countably infinite set operations on the members of R(J ) can yield quite natural sets that
are not in it. Examples of this nature suggest that it would be quite embarrassing if the
family of events that we are willing to consider, does not make provision for the events which
could be obtained by applying countably infinite unions, intersections.
Bearing in mind this caution, we must entertain countably infinite combinations of events.
In this context the Greek letter σ (pronounced “sigma”) is used universally to indicate a
countable infinity of operations.

Definition 4.2 A collection of sets F is called an σ-algebra or σ-field (or sometimes Borel
field) if

(i) Ω ∈ F
(ii) A ∈ F =⇒ Ac ∈ F.

[
(iii) A1 , A2 , · · · ∈ F =⇒ An ∈ F (i.e., F is closed under countable union).
n=1

4-1
4-2 Lecture 4: σ-Field and Probability Measure

Classes in Example 3.8 (i), (ii) are σ-algebra. If sample space Ω is finite then any field is a
σ-field.
One can ask what is the σ-field in set up of Example 4.1 which contains the class J ? It is not
obvious what that should be. Indeed, this is by no means a trivial problem, and questions
of this type are in the province of a branch of advanced mathematics called measure theory
and cannot be dealt with at this level. But it is clear that this σ-algebra should contain all
the subintervals of (0, 1] of the form

[a, b], (a, b), and [a, b).

Also it will contain all the sets that can be formed by taking (possibly countably infinite)
unions and intersections of sets of above subintervals.
So it is natural to ask that if we are not telling what exactly constitutes the σ-field, then
why one should talk about it at all? We offer two reasons for doing so. First the notion of
σ-field allows us to define other concepts in probability theory in a precise manner. Second
is auxiliary quantities (especially random variables) quickly become the dominant theme of
the theory and the σ-field itself fades into the background.
Hence now onwards we will be assuming that the events at hand are elements of some suitably
large σ-algebra of events without worrying about the specifics of what exactly constitutes
the family.

4.1 The probability measure


So we now begin with an abstract sample space Ω equipped with a σ-algebra F containing
the events of interest to us. The last element in a probability model is to determine a
consistent scheme of assigning probabilities to events.

Definition 4.3 A function P from F to the set of real-numbers is called a probability mea-
sure if it satisfies the following properties

1. (Nonnegativity) P (A) ≥ 0 for all events A.

2. (Normalization) P (Ω) = 1.

3. (Countable Additivity) If A1 , A2 , · · · is a sequence of disjoint events (i.e., Ai ∩Aj =


∅ if i 6= j), then the probability of their union satisfies

! ∞
[ X
P An = P (An )
n=1 n=1
Lecture 4: σ-Field and Probability Measure 4-3

The triple (Ω, F, P ) is called a probability space.


How do these axioms gel with our experience? Our intuitive assignment of probabilities to
results of chance experiments is based on an implicit mathematical idealisation of the notion
of limiting relative frequency. Suppose A is an event. If, in n independent trials (we use the
word “independent” here in the sense that we attach to it in ordinary language) A occurs
m times then it is natural to think of the relative frequency m/n of the occurrence of A as
a measure of its probability. Indeed, we anticipate that in a long run of trials the relative
frequency of occurrence becomes a better and better fit to the “true” underlying probability
of A.
As 0 ≤ m/n ≤ 1, the positivity and normalisation axioms are natural if our intuition for
odds in games of chance is to mean anything at all. The selection of 1 as normalisation
constant is a matter of convention.
Likewise, from point of relative frequencies, if A and B are two mutually exclusive events
and if in n independent trials A occurs m1 times and B occurs m2 times, then the relative
frequency of occurrence of either A or B in n trials is (m1 + m2 )/n = m1 /n + m2 /n.
Probability measure is now forced to be additive if it is to be consistent with experience.
Now as we mentioned that countable additivity is a natural thing to assume in order to
compute probability of certain events which is disjoint countable union.
Probability measure is additive. To visualize countable additivity: Think of probability
measure as mass. So it assigns mass 1 unit to the sample space Ω (think of some physical
object). Every subset ( or piece) of Ω is going have mass ≥ 0. And if we have subspecies
A1 , A2 , · · · which are disjoint, then mass of union of these is sum of all individuals masses.
Probability behaves like length, area, volume. These are natural examples of probability
measure. And all the axioms of probability are so natural in these examples.
Lecture 5: Properties of Probability Measure
January 11, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Example 5.1 Consider an experiment of throwing a dart on the unit square target. The
“area” of a set is a natural candidate of probability measure which satisfies all the axioms. So
importance of σ-field in probability theory is illustrated by the fact that only “nice” subset of
unite square are measurable in terms of area. Look the following subset which is too jagged,
rough, or inaccessible.

So from purely mathematical point of view, we can not work with power set (the largest σ-
algebra) in this case. Hence we have to restrict ourselves to the class of subsets for which we
can define area. So collection of those nice sets forms our σ-field.

5.1 Deductions from the Axioms


There are many natural properties of a probability measure, which have not been included in
the definition for the simple reason that they can be derived using, the axioms. In this respect
the axioms of a mathematical theory are like the constitution of a government. Unless and
until it is changed or amended, every law must be made to follow from it. In mathematics
we have the added assurance that there are no divergent views as to how the constitution
should be construed.

Theorem 5.2 (Properties of Probability Measure) Suppose (Ω, F, P ) be a probability


space. Then we have the following:

5-1
5-2 Lecture 5: Properties of Probability Measure

1. P (∅) = 0.
2. (Finite Additivity) If A and B are two mutually exclusive events then show that
P (A ∪ B) = P (A) + P (B).
3. For any event A, P (Ac ) = 1 − P (A).
4. For any two events such that A ⊂ B, we have P (B \ A) = P (B) − P (A).
5. (monotonicity) For any two events such that A ⊂ B, we have P (A) ≤ P (B).
6. P (A) ≤ 1 for any event A.
7. (finite sub-additivity) For any two events A and B, we have P (A ∪ B) ≤ P (A) +
P (B).
8. (Continuity) Let An , n ≥ 1 be events.

!
[
(a) If A1 ⊂ A2 ⊂ · · · Then P Ak = lim P (Ak ).
k→∞
k=1

!
\
(b) If A1 ⊃ A2 ⊃ · · · Then P Ak = lim P (Ak ).
k→∞
k=1

Proof:

1. Let An = ∅ for all n, then (An )’s are disjoint. Therefore by countable additivity

! ∞ ∞
[ X X
P (∅) = P An = P (An ) = P (∅)
n=1 n=1 n=1

This is possible iff P (∅) = 0.


2. Set A1 = A, A2 = B, Ak = ∅, for k = 3, 4, · · · . Then (An ) is sequence of mutually

[
exclusive events and An = A ∪ B. Therefore by countable additivity we get
n=1


! ∞ ∞
[ X X
P (A ∪ B) = P An = P (An ) = P (A) + P (B) + P (∅) = P (A) + P (B)
n=1 n=1 n=3

3. A ∪ Ac = Ω. Also A and Ac are mutually exclusive events, hence using normalization


and finite additivity of probability measure we are done.
4. Note that B = A ∪ (B \ A). Now A and B \ A are mutually exclusive. Using finite
additivity
Lecture 5: Properties of Probability Measure 5-3

5. Using previous one and non-negativity we are done.

6. Since A ⊂ Ω, hence by normlization axiom and monotonicity we are done.

7. Note that A ∪ B = A ∪ (B \ (A ∩ B)). Now A and B \ (A ∩ B) are mutually exclusive.


Also from 2 above P (B \ (A ∩ B)) = P (B) − P (A ∩ B). So we obtain

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Now by non-negativity gives the desired result.



[
8. (a) Suppose A1 ⊂ A2 ⊂ · · · and A := Ak . Set B1 = A1 , and for each n ≥ 2, let
k=1
Bn denote those points which are in An but not in An−1 , i.e., Bn = An \ An−1 . By
[n ∞
[ ∞
[
definition, the sets Bn are disjoint. Also An = Bk and A = Bk = Ak .
k=1 k=1 k=1
Hence
n
X
P (An ) = P (Bk )
k=1

Since the left side above cannot exceed 1 for all n, P (Bk ) ≥ 0 for all k, so sequence
of partial sums is increasing and bounded above hence the series on the right side
must converge. Hence we obtain
n
X ∞
X
lim P (An ) = lim P (Bk ) =: P (Bk ) = P (A). (5.1)
n→∞ n→∞
k=1 k=1

(b) Now if A1 ⊃ A2 ⊃ · · · Then Ac1 ⊂ Ac2 ⊂ · · · . Hence by part (a),



!
[
P Ack = lim P (Ack )
k→∞
k=1
" ∞
!c #
[
1−P Ack = lim [1 − P (Ak )]
k→∞
k=1

!
\
1−P Ak = 1 − lim P (Ak )
k→∞
k=1

!
\
P Ak = lim P (Ak )
k→∞
k=1
MATH-221: Probability and Statistics
Tutorial # 1 (Countable & uncountable sets, field & σ-field, Probability Measure)

1. If A is a finite set and B is a countably infinite set then show that A ∪ B is countably
infinite.

[
2. If A1 , A2 , · · · be countably infinite events then show that An is countably infinite.
n=1

3. Show that open interval (0, 1) is uncountable and hence show that the open interval
(a, b) is uncountable, where a and b be any two real numbers with a < b.
4. Let Ω 6= φ be a finite set and F is a field of subsets of Ω. Show that F is a σ-field
(or Borel field).
5. Let Ω = (−∞, ∞), the real line. Suppose B is a σ-field (or Borel field or σ-algebra)
which contains all the intervals of the form (−∞, x], x ∈ R. Then show that B also
contains all the intervals of the form
[a, b], (a, b), (a, b], [a, b), where a, b ∈ R, a < b.

6. Let A, B, C be events such that P (A) = 0.7, P (B) = 0.6, P (C) = 0.5, P (A ∩ B) =
0.4, P (A ∩ C) = 0.3, P (C ∩ B) = 0.2 and P (A ∩ B ∩ C) = 0.1. Find P (A ∪ B ∪
C), P (Ac ∩ C) and P (Ac ∩ B c ∩ C c ).
7. Prove or disprove: If P (A ∩ B) = 0 then A and B are mutually exclusive events.
8. Does there exists a probability measure (or function) P such that the events A, B, C
satisfies P (A) = 0.6, P (B) = 0.8, P (C) = 0.7, P (A∩B) = 0.5, P (A∩C) = 0.4, P (C ∩
B) = 0.5 and P (A ∩ B ∩ C) = 0.1 ?
n
X
9. For events A1 , A2 , · · · , An , show that P (A1 ∩ A2 ∩ · · · ∩ An ) ≥ P (Ai ) − n + 1.
i=1

10. Let Ω = N. Define a set function P as follows: For A ⊂ Ω,



0 if A is finite
P (A) = .
1 if A is infinite
Is P a probability measure (or function)?
11. Let A1 , A2 , · · · be a sequence of events then show that

! ∞
[ X
P An ≤ P (An ).
n=1 n=1

12. Let An , n ≥ 1 be a sequence of events. Then prove the following:



!
[
(a) If A1 ⊂ A2 ⊂ · · · Then P Ak = lim P (Ak ).
k→∞
k=1

!
\
(b) If A1 ⊃ A2 ⊃ · · · Then P Ak = lim P (Ak ).
k→∞
k=1
Lecture 7: Example of Probability spaces
January 18, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

A probability space has three elements Ω the sample space, F-collection of events, P the
probability measure. Choice of sample space is relatively easy and clear from underlying
random experiment. Choosing appropriate σ-field might be difficult but one can assume
existence of a σ-field which contains events of interest. Defining a probability function is not
at all an easy task and definition makes no attempt to tell what particular function P to
choose; it merely requires P to satisfy the axiom. As we shall see that for a sample space
with same σ-algebra we may define many different probability measure.
Now we illustrate how to construct probability spaces starting from some common sense
assumptions about the random experiment.

7.1 Finite Sample Space


If the sample space consists of a finite number of possible outcomes, then we can take any
σ-field F (including the largest one, the power set of the sample space). The probability
measure is specified by the probabilities of each single outcome. Let Ω = {ω1 , ω2 , · · · , ωn }
for some n ∈ N. Then choose numbers p1 , p2 , · · · , pn such that
n
X
pi ≥ 0, and pi = 1.
i=1

Define
X P ({ωi }) = P (ωi ) = pi for each i = 1, 2, · · · , n. If A is any event in F, then P (A) =
P (ω). Then P is a probability measure. Indeed,
ω∈A

1. P (A) is sum of P (ωi )’s such that ωi ∈ A and P (ωi ) = pi ≥ 0 hence, there sum is also
non-negative.
n
X n
X
2. P (Ω) = P (ωi ) = pi = 1.
i=1 i=1

3. If A1 , A2 , · · · is a sequence of pairwise disjoint events, then


X
P (∪k Ak ) = P (ω) = P (A1 ) + P (A2 ) + · · ·
ω∈∪k Ak

7-1
7-2 Lecture 7: Example of Probability spaces

Example 7.1 In a coin tossing experiment, let us assign the probability p of coming head
and 1−p of coming tail such that 0 ≤ p ≤ 1. Then this assignment satisfy all the properties of
probability measure. Hence, we see that there are infinitely many ways to define a probability
law on a sample space.

7.1.1 Equally likely outcomes or Discrete Uniform Probability law

There is a special probability law when the sample space is finite. It is based on symmetry,
which is a very natural thing to assume in case of absence of any bias or additional infor-
mation. If we assume that all the outcomes {ω1 , ω2 , · · · , ωn } are “equally likely” or they
1
have same chance of occurring then P (ωi ) = for each i = 1, 2, · · · , n. This probability
n
assignment is called discrete uniform probability law. You must realize by now that whatever
probability theory you people have done in your earlier classes, it falls into this setup. Recall
your definition of the probability of an event A,

Number of outcomes favourable to A


P (A) =
Number of all possible outcomes of the experiment.

7.2 Countable Sample Space

We shall show how easy it is to construct probability measures for any countable space
Ω = {ω1 , ω2 , · · · }. Once again we can work with any σ-field F (including the largest one,
the power set of the sample space). The probability measure is specified by the probabilities
of each single outcome. Each sample point ωn has probability pn subject to the conditions

X
∀ n : pn ≥ 0, pn = 1. (7.1)
n=1

In symbols, we write P ({ωn }) = P (ωn ) = pn for all n. Now for any subset A of Ω, we define
its probability to be
X
P (A) := P (ω).
ω∈A

Once again it is very easy to verify all the axioms of probability function.

Example 7.2 In the random experiment of tossing a coin till you get a head, the sample
space is
Ω = {H, T H, T T H, T T T H, · · · }.
Lecture 7: Example of Probability spaces 7-3

It clear that sample space is countably infinite. We assign probability


1
P (H) =
2
1
P (T H) = 2
2
1
P (T T H) = 3
2

Then it is a valid probability measure. One can define another probability measure:
2
P (H) =
3
2
P (T H) = 2
3
2
P (T T H) = 3
3


X 1 1/3
As we know that 2 n
= 2 = 1, hence it defines a valid candidate for probability
n=1
3 1 − 1/3
measure.

X
In fact, if (an )n≥1 is sequence of non-negative terms such that an < ∞. Then define
n=1

X
S = an , which is a positive real number. Now define probability of each outcome as
n=1
follows:
an
P (ωn ) = , for each n = 1, 2, · · ·
S

X 1
Then we get a probability measure. As a particular example, recall that p
converges
n=1
n
for every real number p > 1. Let us denote the sum by Sp . Then for every p > 1, we define
a probability measure as follows:
1
P (ωn ) = , for each n = 1, 2, · · ·
Sp np

So we have defined uncountably many probability measures on the same sample space and
same σ-field.

Exercise 7.3 Show that all outcomes of a countable sample space can not be equally likely.
7-4 Lecture 7: Example of Probability spaces

Solution: Suppose the contrary, i.e., there exists p > 0 such that P (ωn ) = p for all n =
1, 2, · · · . Now by countable additivity

X ∞
X
P (Ω) = P (ωn ) = p = ∞.
n=1 n=1

which is a contradiction.

7.3 Uncountable Sample Space


The situation is more complicated when we deal with uncountable sample space. It is not
possible to define probability for every subset of sample space, and at the same time be
consistent with the axioms of probability function and it’s consequences. We have already
seen an example where we can not talk about area of every subset of a unit square.

Example 7.4 Let Ω = [0, ∞). Once again it may not be possible to define probability for
every subset of Ω, so we again consider the σ-field which contains all subintervals of Ω. For
subinterval [a, b) define the probability as
P {[a, b)} = e−a − e−b , 0 ≤ a < b ≤ ∞.
Once again it is easy to verify the axioms of probability function for class of intervals of the
form [a, b).

1. If a < b then −a > −b and ex is an strictly increasing function, hence e−a > e−b .
Hence we have non-negativity.
2. P (Ω) = 1.
3. Let us verify the finite additivity.
P {[0, 2)} = P {[0, 1) ∪ [1, 2)}
e−0 − e−2 = e−0 − e−1 + e−1 − e−2

One may jump to the conclusion from Exercise 7.3, that in an uncountable sample space
(since every uncountable set has a countably infinite subset), all outcomes can not be equally
likely. But this is not the case, for example, in above
(∞  )  
\ 1 1
P {a} = P a, a + = lim P a, a + = lim e−a − e−a−1/n = 0, ∀a ∈ Ω
n=1
n n→∞ n n→∞

Hence it follows that for 0 ≤ a < b < ∞,


P {[a, b]} = P {(a, b)} = P {(a, b]} = P {[a, b)}.
Lecture 8: Conditional Probability
January 21, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

We know something about the world and based on what we know when we set up a probability
model. Then something happens, and somebody tells us a little more about the world, gives
us some new information. This new information, in general, should change our beliefs about
what happened or what may happen. So whenever we’re given new information, some
partial information about the outcome of the experiment, we should revise our beliefs. And
conditional probabilities are just the probabilities that apply after the revision of our beliefs,
when we’re given some information.
Let us take an example from our class only. By looking at the attendance data of Mr.
Sanskar Bindal (17UCS141) we assign probability of Sanskar being present in the class on a
given day is zero. Now if I tell you that tomorrow is the quiz in class, do u still think that
probability of Sanskar being present on tomorrow remains unchanged?
In more precise terms, given an experiment, a corresponding sample space, and a probability
law, suppose that we know that the outcome is within some given event B. We wish to
quantify the likelihood that the outcome also belongs to some other given event A. We thus
seek to construct a new probability law that takes into account the available knowledge:
a probability law that for any event A, specifies the conditional probability of A given B,
denoted by P (A|B). We would like the conditional probabilities P (A|B) of different events
A to constitute a legitimate probability law, which satisfies the probability axioms.

Saying that B has occurred means outcome already lies in B. So B is our new universe.
Now we ask what is the probability that outcome is in set A given it is already in B. Think
of probability measure as area or mass. It is clear that it should area or mass of A ∩ B. Now

8-1
8-2 Lecture 8: Conditional Probability

we want this new assignment of probability to follow our axioms of probability, one of them
is probability of sample sapce is 1, since our new universe or sample space is set B, in order
to have P (B|B) = 1 we divide by P (B). This motivate the definition.

Definition 8.1 Let (Ω, F, P ) be a probability space and A ∈ F be such that P (A) > 0. The
conditional probability of an event B ∈ F given A is denoted by P (B|A) is defined as

P (A ∩ B)
P (B|A) = .
P (A)

Define
FA := {B ∩ A|B ∈ F}
So FA is class (or family or collection or set) of subsets of A.

Exercise 8.2 Show that FA is a σ-field of subsets of A. (Mind you, it is A not Ω.)

Solution:

1. A ∩ A = A hence A ∈ FA

2. Let C ∈ FA , i.e., there is some B ∈ F such that C = A ∩ B. Draw a diagram. Keep


in mind that we have to take compliment of C in A (not in Ω), implies C c = A ∩ B c .
Since B c ∈ F hence C c ∈ FA .

3. Let C1 , C2 , · · · ∈ FA . Then there exists B1 , B2 , · · · ∈ F such that Cn = A ∩ Bn for all



[
n ∈ N. Note that Bn ∈ F and
n=1

∞ ∞ ∞
!
[ [ [
Cn = (A ∩ Bn ) = A ∩ Bn ∈ FA
n=1 n=1 n=1

Define PA on FA as follows:

PA (B) = P (B|A), B ∈ FA .

Then (A, FA , PA ) is a probability space. We are already assuming P (A) > 0 which implies
A is non-empty. We already shown FA is a σ-field of subsets of A. Now only thing left is to
show PA is a probability measure.

1. P (B|A) ≥ 0 by definition.
Lecture 8: Conditional Probability 8-3

2. PA (A) = P (A|A) = 1.

3. Countable additivity of PA follows from countable additivity of P .

Remark 8.3 Let us also note that since we have P (B|B) = 1, all of the conditional proba-
bility is concentrated on B. Thus, we might as well discard all possible outcomes outside B
and treat the conditional probabilities as a probability law defined on the new universe B.
Since conditional probabilities constitute a legitimate probability law, all general properties of
probability measure remain valid.

Example 8.4 Consider an experiment involving two successive rolls of a fair die. If the
sum of the two rolls is 9, how likely is it that the first roll was a 6?

Solution: Let B be event that the sum of the two rolls is 9. Then B = {(3, 6), (6, 3), (4, 5), (5, 4)}.
Let A be the event that the first roll is 6, then A has 6 elements. Now
1
P {(6, 3)} 36 1
P (A|B) = = 4 =
P (B) 36
4

In Example 8.4 we considered, the probability space was specified, and we computed condi-
tional probability. In many problems however, we actually proceed in the opposite direction.
We are given in advance what we want some conditional probabilities to be, and we use this
information and the rules of probabilities to compute the requested probabilities. A typical
example of this situation is the following.

Example 8.5 Suppose that the population of a certain city is 40% male and 60% female.
Suppose also that 50% of the males and 30% of the females smoke. Find the probability that
a smoker is male.

Solution: Let M denote the event that a person selected is a male and let F denote the event
that the person selected is a female. Also let S denote the event that the person selected
smokes. The given information can be expressed in the form P (S|M ) = .5, P (S|F ) =
.3, P (M ) = .4, and P (F ) = .6. The problem is to compute P (M |S). By definition

P (M ∩ S)
P (M |S) = .
P (S)

Now P (M ∩ S) = P (M )P (S|M ) = (.4)(.5) = 0.20, so the numerator can be computed in


terms of the given probabilities. Now how to compute P (S)? Here is a very important
technique in probability (known as total probability theorem, which we will prove today).
8-4 Lecture 8: Conditional Probability

This is known as “divide and rule”, divide or partition the set S so that it is possible for us
to compute the probability of those partitioned events. Note that S is the union of the two
disjoint sets S ∩ M and S ∩ F . it follows that

P (S) = P (S ∩ M ) + P (S ∩ F ).

Since P (S ∩ F ) = P (F )P (S ∩ F ) = (.6)(.3) = .18. Therefore, P (S) = 0.38. Hence


0.20
P (M |S) = = 0.526315789
0.38

Generally speaking, most problems of probability have to do with several events and it is
their mutual relation or joint action that must be investigated. The following result is useful
for such situation.

Proposition 8.6 For arbitrary events A1 , A2 , · · · , An , we have

P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) · · · P (An |A1 ∩ A2 · · · ∩ An−1 ) (8.1)

provided P (A1 ∩ A2 · · · An−1 ) > 0.

Proof: Since
P (A1 ) ≥ P (A1 ∩ A2 ) · · · ≥ P (A1 ∩ A2 · · · An−1 ) > 0,
therefore, all the conditional probabilities in (8.1) are well-defined. Now the right side of
(8.1) is

P (A2 ∩ A1 ) P (A3 ∩ A2 ∩ A1 ) P (A1 ∩ A2 · · · ∩ An−1 ∩ An )


P (A1 ) × × × ··· ×
P (A1 ) P (A2 ∩ A1 ) P (A1 ∩ A2 · · · ∩ An−1 )
Lecture 9: Total Probability Theorem & Bayes Rule
January 22, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Definition 9.1 A collection {A1 , A2 , · · · , AN } of events is said to be a partition of Ω if

1. Ai ’s are pairwise disjoint.

N
[
2. Ai = Ω
i=1

If N < ∞ then partition is said to be finite partition and if N = ∞, then it is called a


countable partition.

Example 9.2 If Ω = N then {E, O} where E is collection even numbers and O is set of odd
numbers. This is a finite partition. If take {{1}, {2}, · · · } as partition then it is a countable
partition.

Now we explore some further applications of conditional probability. The following theorem,
which is often useful for computing the probabilities of various events, using a ”divide-and-
conquer” approach.

9-1
9-2 Lecture 9: Total Probability Theorem & Bayes Rule

Theorem 9.3 (Total Probability Theorem) Let (Ω, F, P ) be a probability space and {A1 , A2 , · · · , AN }
be a partition of Ω such that P (Ai ) > 0 for all i. Then for any event B ∈ F,
N
X
P (B) = P (B|Ai )P (Ai )
i=1

Proof: Event B is decomposed into the disjoint union


B = (A1 ∩ B) ∪ (A2 ∩ B) ∪ · · · (AN ∩ B)
Now using additivity of probability measure we have
N
X
P (B) = P (B ∩ Ai )
i=1

By definition of conditional probability P (B ∩ Ai ) = P (B|Ai )P (Ai ). Hence we get the


theorem.
One of the uses of the theorem is to compute the probability of various events B for which
the conditional probabilities P (B|Ai ) are known or easy to derive. The key is to choose ap-
propriately the partition {A1 , A2 , · · · , AN } and this choice is often suggested by the structure
of the problem.

Example 9.4 You enter a chess tournament where your probability of winning a game is
0.3 against half the players (call them type 1), 0.4 against a quarter of the players (call them
type 2), and 0.5 against the remaining quarter of the players (call them type 3). You play a
game against a randomly chosen opponent. What is the probability of winning?

Solution: Let Ai be the event of playing with an opponent of type i. We have


P (A1 ) = 0.5, P (A2 ) = 0.25, P (A3 ) = 0.25
Also, let B be the event of winning. We have
P (B|A1 ) = 0.3, P (B|A2 ) = 0.4, P (B|A3 ) = 0.5
Thus, by the total probability theorem, the probability of winning is
P (B) = P (A1 )P (B|A1 ) + P (A2 )P (B|A2 ) + P (A3 )P (B|A3 )
= 0.5 × 0.3 + 0.25 × 0.4 + 0.25 × 0.5
= 0.375.

The total probability theorem is often used in conjunction with the following celebrated
theorem, which relates conditional probabilities of the form P (A|B) with conditional prob-
abilities of the form P (B|A), in which the order of the conditioning is reversed.
Lecture 9: Total Probability Theorem & Bayes Rule 9-3

Theorem 9.5 (Bayes Theorem) Let {A1 , A2 , · · · , AN }, be a partition of the sample space,
and assume that P (Ai ) > 0, for all i = 1, 2, · · · , N . Then, for any event B such that
P (B) > 0, we have
P (B|Ai )P (Ai )
P (Ai |B) = N
, for each i = 1, 2, · · · N.
X
P (Ak )P (B|Ak )
k=1

Proof: For fixed i,


P (B ∩ Ai ) P (B|Ai )P (Ai ) P (B|Ai )P (Ai )
P (Ai |B) = = = N
P (B) P (B) X
P (Ak )P (B|Ak )
k=1

where the last equality follows from Total probability theorem.


Bayes rule is often used for inference. There are a number of causes that may result in
certain effect. We observe the effect and wish to infer the cause. The events A1 , A2 , · · · , AN
is associated with the causes and event B represents the effect. The probability P (B|Ai ) that
the effect will be observed when the cause Ai is present amounts to a probabilistic model
of the cause-effect relation. Given that effect B has been observed, we wish to evaluate
the probability P (Ai |B) that cause Ai is present. We refer to P (Ai |B) as the posterior
probability of event Ai given the information, to be distinguished from P (Ai ) which we call
the prior probability.
Numerous applications were made in all areas of natural phenomena and human behavior.
Let us look at few examples of the “inference”.

Example 9.6 1. If B is a “body” and the An ’s are the several suspects of the murder,
then the Baye’s theorem will help the jury or court to decide the whodunit.
2. If B is an earthquake and the An ’s are the different physical theories to explain it, then
the theorem will help the scientists to choose between them.
3. We observe a shade in a person’s X-ray (this is event B the “effect”) and we want to
estimate the likelihood of three mutually exclusive and collectively exhaustive potential
causes : cause 1 (event A1 ) is that there is a malignant tumor, cause 2 (event A2 )
is that there is a nonmalignant tumor, and cause 3 (event A3 ) corresponds to the
reasons other than a tumor. We assume that we know the probabilities P (Ai ) and
P (B|Ai ), i = 1, 2, 3. Given that we see a shade (event B occurs), Baye’s theorem gives
the posterior probabilities of the various causes as:
P (B|Ai )P (Ai )
P (Ai |B) = 3
, for each i = 1, 2, 3.
X
P (Ak )P (B|Ak )
k=1
9-4 Lecture 9: Total Probability Theorem & Bayes Rule

Example 9.7 A test for a certain rare disease is assumed to be correct 95% of the time: if
a person has the disease, the test results are positive with probability 0.95, and if the person
does not have the disease, the test results are negative with probability 0.95. A random person
drawn from a certain population has probability 0.001 of having the disease. Given that the
person just tested positive, what is the probability of having the disease?

Solution: If A is the event that the person has the disease, and B is the event that the test
results are positive, the desired probability, P (A|B), is
P (B|A)P (A)
P (A|B) =
P (B|A)P (A) + P (B|Ac )P (Ac )
0.001 × 0.95
=
0.001 × 0.95 + 0.999 × 0.05
= 0.0187
Since P (Ac ) = 0.999, P (·|Ac ) is a probability measure. Since test is 95% correct, that is
P (B c |Ac ) = 0.95. Hence P (B|Ac ) = 0.05.
Note that even though the test was assumed to be fairly accurate, a person who has tested
positive is still very unlikely (less than 2%) to have the disease. According to The Economist
(February 20th, 1999), 80% of those questioned at a leading American hospital substantially
missed the correct answer to a question of this type; most qf them thought that the proba-
bility that the person has the disease is 0.95.

Remark 9.8 The practical utility of Baye’s formula is limited by our usual lack of knowledge
on the various a priori probabilities.
Lecture 10: Independence
January 23, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

We have introduced the conditional probability P (A|B) to capture the partial information
that event B provides about event A. An interesting and important special case arises when
the occurrence of B provides no such information and does not alter the probability that A
has occurred, i.e. ,
P (A|B) = P (A).
When the above equality holds, we say that A is independent of B. Note that by the
P (A ∩ B)
definition P (A|B) = , this is equivalent to
P (B)

P (A ∩ B) = P (A)P (B)

We adopt this latter relation as the definition of independence because it can be used even
when P (B) = 0, in which case P (A|B) is undefined. The symmetry of this relation also
implies that independence is a symmetric property; that is, if A is independent of B, then B
is independent of A, and we can unambiguously say that A and B are independent events.

Definition 10.1 Events A and B are said to be independent if

P (A ∩ B) = P (A)P (B)

Independence is often easy to grasp intuitively. For example, if the occurrence of two events
is governed by distinct and noninteracting physical processes, such events will turn out to
be independent .

Example 10.2 We toss a fair coin two times. Then Ω = {HH, HT, T T, T H} and proba-
bility measure is
1
P (ω) = , ∀ω ∈ Ω.
4
Consider the events A = {HH, HT }, B = {HH, T H} and C = {HT, T T }. Clearly P (A) =
P (B) = P (C) = 12 . Event A is first toss is head and event B is second toss is head.
Physically there is no connection what happens in first toss and second toss, intuitively it is
clear that A and B should be independent. In fact that is the case.
1
P (A ∩ B) = P (HH) = = P (A)P (B)
4

10-1
10-2 Lecture 10: Independence

Whereas event C is second toss is tail. Now B and C determine each other in the sense, if
B occurs then C can not occur and vice-versa, so they are not independent. In fact that is
the case.

P (C ∩ B) = P (∅) = 0 6= P (C)P (B)

Once again A and C should be independent. Indeed we have In fact that is the case.
1
P (A ∩ C) = P (HT ) = = P (A)P (C).
4

Note that independence is not easily visualized in terms of the sample space. A common
first thought is that two events are independent if they are disjoint, but in fact the opposite
is true: two disjoint events A and B with P (A) > 0 and P (B) > 0 are never independent,
since their intersection A ∩ B is empty and has probability 0. For example, an event A and
its complement Ac are not independent [unless P (A) = 0 or P (A) = 1] , since knowledge
that A has occurred provides precise information about whether Ac has occurred.
Sometimes the notion of independence does not appear intuitive but the mathematical co-
incidence of equality happens and we have to declare the two events independent.

Example 10.3 Let us consider a random experiment of choosing a real number from (0, 1].
Then Ω = (0, 1] and F is the σ-field which contains all the subintervals
  of Ω,
 and probability
 
1 1 3 1
measure is defined as “length” of the set. Consider events A = 0, ,B = , ,C = ,1 .
2 4 4 4
Then P (A) = 0.5, P (B) = 0.5, P (C) = 0.75. Event A is that the chosen number belong to
the interval (0, 0.5]. Event B is that the chosen number belongs the interval [0.25, 0.75] and
event C is that the chosen number belongs the interval [0.25, 1].
In this experiment the notion of independence/dependence in not at all intuitive. But Then
A, B are independent, A, C are dependent and B, C are dependent.
1
P (A ∩ B) = P {[0.25, 0.5]} = = P (A)P (B)
4
1
P (A ∩ C) = P {[0.25, 0.5]} = 6= P (A)P (C)
4
1
P (B ∩ C) = P {B} = 6= P (B)P (C)
2

As mentioned earlier, if A and B are independent, the occurrence of B does not provide
any new information on the probability of A occurring. It is then intuitive that the non-
occurrence of B should also provide no information on the probability of A. Indeed, we have
the following proposition.
Lecture 10: Independence 10-3

Proposition 10.4 If A and B are independent events, then the following pairs are also
independent:

(a) A and B c ,

(b) Ac and B,

(c) Ac and B c .

Proof:

(a) We must show that P (A ∩ B c ) = P (A)P (B c ).

P (A ∩ B c ) = P (A \ (A ∩ B)) = P (A) − P (A ∩ B) (∵ A ∩ B ⊂ A)
= P (A) − P (A)P (B) = P (A)[1 − P (B)] = P (A)P (B c )

(b) Let us relabel the event A as event C and B as event D. So events D and C are
independent, hence by part (a) it follows that D and C c are independent. That is to
say B and Ac are independent.

(c) If A and B are independent then Ac and B are independent by part (b). Now let us
relabel Ac as event C and relabel B as event D. Applying part (a) on the pair C and
D, we get independence of C and Dc . But C = Ac and Dc = B c . This completes the
proof.

10.1 Independence of more than two events


We might think that we could say A, B and C are independent if P (A∩B∩C) = P (A)P (B)P (C).
However, this is not the correct condition.

Example 10.5 Consider two rolls of a fair six-sided die, and the following events:

A = {1st roll is 1 , 2, or 3},


B = {1st roll is 3 , 4, or 5},
C = {sum of the two rolls is 9}
10-4 Lecture 10: Independence

Then A ∩ B = {(3, i)|i = 1, 2, 3, 4, 5, 6}, A ∩ C = {(3, 6)}, B ∩ C = {(3, 6), (4, 5), (5, 4)}
6 18 18
P (A ∩ B) = 6= · = P (A)P (B)
36 36 36
1 18 4
P (A ∩ C) = 6= · = P (A)P (C)
36 36 36
3 18 4
P (C ∩ B) = 6= · = P (B)P (C)
36 36 36
On the other hand,
1 1 1 1
P (A ∩ B ∩ C) = = · · = P (A)P (B)P (C)
36 2 2 9
So it would be grossly embarrassing if we say that three events A, B, C are independent but
they are not pairwise (any two at a time) independent .

A second attempt at definition of independence of A, B and C, in light of the previous


example, might be to define A, B and C to be independent if all the pairs are independent,
i.e., (A, B), (B, C) and (A, C) are independent.

Example 10.6 Let Ω = {1, 2, 3, 4}. Define


1
P (i) = for i = 1, 2, 3, 4
4
Consider the events A = {1, 2}, B = {1, 3} and C = {1, 4}. Then P (A) = P (B) = P (C) =
1
2
. Note that A, B, C are pairwise independent
1
P (A ∩ B) = P {1} = = P (A)P (B)
4
1
P (A ∩ C) = P {1} = = P (A)P (C)
4
1
P (B ∩ C) = P {1} = = P (B)P (C)
4
1
but P (A ∩ B ∩ C) = P {1} = 4 6= P (A)P (B)P (C).

The preceding two examples show that mutual (or total) independence of a collection of
events requires an extremely strong condition. The following definition works.
We say three events, A1 , A2 and A3 are independent, if it satisfying the four conditions
P (A1 ∩ A2 ) = P (A1 )P (A2 )
P (A2 ∩ A3 ) = P (A3 )P (A2 )
P (A1 ∩ A3 ) = P (A1 )P (A3 )
P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 )P (A3 )
The first three conditions simply assert that any two events are independent, a property
known as pairwise independence. But the fourth condition is also important and does not
follow from the first three.
Lecture 11: Independence of Multiple Events & Counting
January 28, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Example 11.1 Consider the experiment of tossing a fair coin three times. Then Ω =
{HHH, HHT, HT H, T HH, T T H, T HT, HT T, T T T }. Each outcome has equal chance of
occurrence. Let Ai be the event that the ith toss is a head and i = 1, 2, 3. Intuitively,
A1 , A2 , A3 seems to be independent. Let us verify the same as per the definition 11.2. Note
that
A1 = {HHH, HHT, HT H, HT T }
A2 = {HHH, HHT, T HH, T HT },
A3 = {HHH, HT H, T HH, T T H, }.
1
Hence P (Ai ) = 2
for each i = 1, 2, 3.

1 1 1 1
P (A1 ∩ A2 ∩ A3 ) = P (HHH) = = × × = P (A1 )P (A2 )P (A3 )
8 2 2 2
1 1 1
P (A1 ∩ A2 ) = P (HHT, HHH) = = × = P (A1 )P (A2 )
4 2 2
1 1 1
P (A2 ∩ A3 ) = P (HHH, T HH) = = × = P (A3 )P (A2 )
4 2 2

The definition of independence can be extended to multiple events (more than three).

Definition 11.2 We say that the events A1 , A2 , · · · , An are independent if


P (Ai1 ∩ Ai2 · · · ∩ Aim ) = P (Ai1 )P (Ai2 ) · · · P (Aim ).
for every 2 ≤ m ≤ n, and for every choice of indices 1 ≤ i1 < i2 · · · < im ≤ n
   
n n
For m = 2, we have conditions to be checked for pairs. For m = 3, we have
2 3
conditions to be checked for triples and so on. Hence checking independence of n events
require to check
         
n n n n n n
+ + ··· + = (1 + 1) − − = 2n − n − 1
2 3 n 0 1

11-1
11-2 Lecture 11: Independence of Multiple Events & Counting

non-trivial conditions. Independence places very strong requirements on the interrelation-


ships between events.
Notion of independence can be extended easily to countably infinite collection of events.

Definition 11.3 We say that a countably infinite collection of events {Ai |i ≥ 1} is inde-
pendent if any finite sub-collection of events is independent.

Example 11.4 If events A, B, and C are independent, show that A ∪ B is also independent
of C.

Solution:

P ((A ∪ B) ∩ C) = P {(A ∩ C) ∪ (B ∩ C)} = P (A ∩ C) + P (B ∩ C) − P {(A ∩ C) ∩ (B ∩ C)}


= P (A)P (C) + P (B)P (C) − P (A ∩ B ∩ C) = P (A)P (C) + P (B)P (C) − P (A)P (B)P (C)
= P (C) [P (A) + P (B) − P (A)P (B)] = P (C) [P (A) + P (B) − P (A ∩ B)] = P (C)P (A ∪ B)

P ((Ac ∩ B) ∩ C) = P (Ac ∩ B ∩ C) = P (B ∩ C) − P (B ∩ C ∩ A)
= P (B)P (C) − P (A)P (B)P (C)
= P (B)P (C) [1 − P (A)] = P (B ∩ C)P (Ac )

The moral of the Example 11.4 is that if A, B and C are independent then any event
determined by events A and B, will be independent from event C.

11.1 Counting
Now we start a new topic, namely counting. The reason we’re going to talk about counting
is that there’s a lot of probability problems whose solution actually reduces to successfully
counting the cardinalities of various sets (or the number of outcomes in various events). We
have already seen a context where such counting arises. When the sample space Ω has a
finite number of equally likely outcomes, so that the discrete uniform probability law applies.
Then, the probability of any event A is given by
number of elements of A
P (A) =
number of elements of Ω
and involves counting the elements of A and of Ω. Now, today we’re going to just touch
the surface of this subject. There’s a whole field of mathematics called combinatorics who
Lecture 11: Independence of Multiple Events & Counting 11-3

are people who actually spend their whole lives counting more and more complicated sets.
We were not going to get anywhere close to the full complexity of the field, but we’ll get
just enough tools that allow us to address problems of the type that one encounters in most
common situations. Now, if somebody gives you Ω by just giving you a list and gives you
another set A, again, giving you a list, it’s easy to count there element. You just count how
much there is on the list. But sometimes the sets are described in some more implicit way,
and we may have to do a little bit more work.
The counting principle is based on a divide-and-conquer approach, whereby the counting
is broken down into stages. Suppose a process consists of 2 stages. Suppose there are n1
possible results at first stage. For every possible result at the first stage there are n2 possible
results at second stage. Then the total number of possible results of the process is n1 n2 .
This analogy can be extended to any r-step process.

Permutations

Suppose we have 10 students and 10 chairs. We wish to count the total number of arrange-
ments (ordered list) the 10 students can be assigned the 10 chairs. A permutation is an
arrangement in a definite order of objects. Number of different permutations or ordered lists
of n different objects is n!.

k-permutations

We start with n distinct objects, and let k be some positive integer, with k ≤ n. We wish
to count the number of different ways that we can pick k out of these n objects and arrange
them in a sequence, i.e., the number of distinct k-object sequences. More concrete way of
thinking about this problem: We have 99 students in B2 section of P&S course and I want to
form a group with 10 students and puts them in a particular order? So how many possible
ordered lists can I make that consist of 10 people? By ordered, I mean that we take those
10 people and we say this is the first person in the group. That’s the second person in the
group. That’s the third person in the group and so on.
So in how many ways can we do this? Out of these n, we want to choose just k of them
and put them in slots, one after the other. So we have n choices for who we put as the top
person in the community. We can pick anyone and have them be the first person. Then I’m
going to choose the second person in the committee. I’ve used up 1 person. So I’m going
to have n − 1 choices here. When we are ready to select the last (the kth) object, we have
already chosen k − 1 objects, which leaves us with n − (k − 1) choices for the last one. By
the Counting Principle, the number of possible sequences, called k-permutations,
n!
n(n − 1)(n − 2) · · · (n − k + 1) =
(n − k)!
11-4 Lecture 11: Independence of Multiple Events & Counting

Combinations

There are n people and we are interested in forming a committee of k. How many different
committees are possible? More abstractly, this is the same as the problem of counting the
number of k-element subsets of a given n-element set. Notice that forming a combination
is different than forming a k-permutation, because in a combination there is no ordering of
the selected elements. For example, whereas the 2-permutations of the letters A, B, C, and
D are
AB, BA, AC, CA, AD, DA, BC.CB, BD, DB, CD, DC,
the combinations of two out of these four letters are

AB, AC, AD, BC, BD, CD.


n!
Each combination is associated with k! “duplicate” k-permutations, so the number
(n − k)!
of k-permutations is equal to the number of
 combinations times k!. Hence, the number of
n
possible combinations which we denote by , is equal to
k
   
n n! n n!
× k! = =⇒ =
k (n − k)! k (n − k)!k!
MATH-221: Probability and Statistics
Tutorial # 2 (Conditioning, Independence, Bayes Rule, Total Probability Theorem)

1. Let Ω = (0, 1] be the sample space and let P (·) be a probability function defined by

 x if 0 ≤ x < 1

P ((0, x]) = 2 2
1
 x if
 ≤x≤1
2
 
1 1 1
Then show that P = and P ({x}) = 0 if x 6= .
2 4 2
2. (Conditional version of the total probability theorem) Let {A1 , A2 , · · · , AN }
be a partition of Ω. Also let B be an event such that P (Ai ∩ B) > 0 for all i. Then
XN
for any event A, show that P (A|B) = P (Ai |B)P (A|Ai ∩ B).
i=1

3. Let Ω = {(a1 , a2 , · · · , an )|ai is either 0 or 1 for each i }. For i = 1, 2, · · · , n set Ai =


{(a1 , a2 , · · · , an )|ai = 1}. If all the outcomes are equally likely, then show that
A1 , A2 , · · · , An are independent.
4. Consider the experiment of tossing a coin two times. Let A be the event that first
toss is head and B be the event that second toss is head. Let P be the probability
measure under which all the outcomes are equally likely. Then show that A and B
are independent with respect to P . Let Q be the probability measure such that
1 1 1 1
Q(HH) = , Q(HT ) = , Q(T H) = , Q(T T ) =
3 6 6 3
Then show that A and B are not independent with respect to Q.
5. Three switches connected in parallel operate independently. Each switches remains
closed with probability p. Then (a) Find the probability of receiving an input signal
at the output. (b) Find the probability that switch Si is open given that an input
signal is received at the output.
6. An electronic assembly consists of two subsystems, say A and B. From previous
testing procedures, the following probabilities assumed to be known: P (A fails) =
0.20, P (A and B both fail) = 0.15, P (B fails alone) = 0.15. Evaluate the following
probabilities (a) P(A fails/B has failed) (b) P(A fails alone /A or B fail).
7. In answering a question on a multiple-choice test, a student either knows the answer
or guesses. Let p be the probability that the student knows the answer and 1 − p
the probability that the student guesses. Assume that a student who guesses at the
answer will be correct with probability 1/m, where m is the number of multiple-
choice alternatives. What is the conditional probability that a student knew the
answer to a question, given that he or she answered it correctly?
8. Let a sample of size 4 be drawn (a) with replacement (b) without replacement from
an urn containing 12 balls of which 8 are white. Find the probability that the ball
drawn on the third draw is white given that the sample contains three white balls.
Lecture 12: Random Variables
29 January, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

In many random experiments, the outcomes are numerical, e.g., throwing a dice, when
outcome corresponds to instrument reading or stock prices. In other random experiments,
the outcomes may not be numerical, e.g., tossing a coin, choosing a student randomly from
the class. In general, the points of a sample space may be very concrete objects such as apples,
molecules, and people. Sometimes we may be interested in certain real-valued (sometimes
we may require to consider complex-valued) function of the sample space. For example, if
the experiment is the selection of students (for placement!!) from a given population, we
may wish consider their CGPA.
But all functions defined on the sample space are not useful, in the sense that we may not be
able to assign probabilities to all basic events associated with the function. So one need to
restrict to certain class of functions of the sample space. This motivates us to define random
variables.

Definition 12.1 Let (Ω, F) be a measurable space, i.e., Ω is non-empty set and F is a
σ-field on Ω. A function X : Ω → R is said to be a random variable if for each x ∈ R,

{ω ∈ Ω|X(ω) ≤ x} =: {X ≤ x} ∈ F

Remark 12.2 1. The adjective “random” is just to remind us that we are dealing with
a sample space, which is related to something called random phenomena or random
experiment. Once ω is picked, X(ω) is thereby determined and there is nothing vague,
or random about it anymore.

2. Observe that random variables can be defined on a sample space before any probability
is mentioned.

3. When experimenters are concerned with random variables that describe observations,
their main interest is in the probabilities with which the random variables take various
values. For example, Let Ω be set of students in B2 Section for P&S class. These
may be labeled as Ω = {ω1 , ω2 , · · · , ω99 }. If we are interested in their attendance
distribution, let A(ω) denote the attendance of ω. Now we are interested in determining
the probability of events like {A ≥ 1} = {ω ∈ Ω|A(ω) ≥ 1}, {A = 0} etc. For that
these events should be in the σ-field. This idea suggest the condition in the Definition
12.1.

12-1
12-2 Lecture 12: Random Variables

Example 12.3 Let Ω = {H, T }, F = P(Ω). Define X : Ω → R by

X(H) = 1, X(T ) = 0.

Claim 12.4 X is a random variable.


Let x ∈ R be given.

1. If x < 0, then {X ≤ x} = ∅.

2. If 0 ≤ x < 1, then {X ≤ x} = {T }.

3. If 1 ≤ x, then {X ≤ x} = Ω.

Hence for each x ∈ R, {X ≤ x} ∈ F.

Remark 12.5 If we take F = {∅, Ω} (trivial σ-field), then the function X defined above is
not a random variable.

Example 12.6 Let Ω be a non-empty set. If F = {∅, Ω}, then only constant functions on
Ω are random variables.

Solution: In view of Remark 12.5 it is enough to show that constant functions are random
variable. Let X ≡ c on Ω for some c ∈ R. Then

1. If x < c, then {X ≤ x} = ∅.

2. If c ≤ x, then {X ≤ x} = Ω.

Hence for each x ∈ R, {X ≤ x} ∈ F.

Example 12.7 Let Ω be a non-empty set. If F = P(Ω) (power set of Ω), then any functions
on Ω is a random variable.

Solution: Let x ∈ R be given. Then {X ≤ x} is a subset of Ω, hence is in F.

Example 12.8 Let Ω = (0, 1], F = σ-field containing all the subintervals of the form (a, b], 0 ≤ a < b ≤ 1.
Define X : Ω → R by
X(ω) = 3ω + 1.

Claim 12.9 X is a random variable.


Note that 0 < ω ≤ 1, so range of X is (1, 4]. Hence for given x ∈ R,
Lecture 12: Random Variables 12-3

1. If x ≤ 1, the {X ≤ x} = ∅.
 
x−1
2. If 1 < x ≤ 4, the {X ≤ x} = 0, .
3

3. If 4 < x, the {X ≤ x} = Ω.

Hence for each x ∈ R, {X ≤ x} ∈ F.

If you look at most of the books on probability theory they give the following definition of
random variable.

Definition 12.10 (Engineering or Practical Definition) A random variable is a func-


tion from Ω to R (or C).

Remark 12.11 When sample space is finite or countably infinite, any function can be re-
garded as random variable in the sense that we can always take the power set as a σ-field.
Recall that if sample space is finite or countably infinite, we can define probability of any
event, hence Definition 12.10 is perfect in this case.
But there are certain difficulties arise when sample space is uncountable and random variable
with a continuous range of possible values. Then it may not be possible for us to define
probability to every event. So we can not work with power set, hence we need a class of
events which is rich enough to include all the events of our interest. Precisely, this is the
reason we need the Definition 12.1. It is comforting to know that mathematical subtleties of
this type do not arise in most of the physical applications.

Example 12.12 In an experiment involving two rolls of a die, the following are examples
of random variables:

1. The sum of the two rolls. X((i, j)) = i + j


12-4 Lecture 12: Random Variables


 0 if i 6= 6, j 6= 6
2. The number of sixes in the two rolls. X((i, j)) = 1 if exactly of one of i or j is 6
2 if i = j = 6

3. The second roll raised to the fifth power. X((i, j)) = j 5 .

So this example tell us that with a single sample space we may have several random variables
sitting on it.

Starting with some random variables, we can at once make new ones by operating on them
in various ways.

Proposition 12.13 Let (Ω, F) be a measurable space.

1. If X and Y are random variables, then so are


X
X + Y, X − Y, XY, (Y 6= 0),
Y
and aX + bY where a and b are two real numbers.

2. If X is a random variable and f : D ⊂ R → R is a Borel measurable function of one


variable then f (X) is also a random variable.

3. If X and Y are random variables and f : D ⊂ R2 → R is a Borel measurable function


of two variables then f (X, Y ) is also random variable.
Lecture 13: Discrete Random Variables
30 January, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Definition 13.1 Let (Ω, F) be a measurable space and X : Ω → R be a random variable.


Random variable X is called discrete if its range is either finite or countably infinite subset
of R.

The random variables in Example 12.3 and Example 12.12 can take at finite number of
numerical values, and are therefore discrete, whereas the random variable in Example 12.8
have the range (1, 4], which is an uncountable set and therefore it is not a discrete random
variable.
If sample space is finite then by definition of a function it follows that, any random variable
defined on the sample space necessarily have finite range. Therefore any random variable on
finite sample space is a discrete random variable.
If sample space is countably infinite then any random variable defined on the sample space
have range which is either finite or countably infinite. Therefore any random variable on
countably infinite sample space is a discrete random variable.

Example 13.2 In the random experiment of tossing a coin till you get a head, the sample
space is
Ω = {H, T H, T T H, T T T H, · · · }
which is a countably infinite set. Define a random variable X as the number of toss required
to get the head, then 

 1 if ω = H
 2 if ω = T H

X(ω) = 3 if ω = T T H

 ... ... ...

We may define a random variable Y as follows:



1 if ω = H
Y (ω) =
0 otherwise

Note that we may have a sample space which is uncountable but we may define discrete
random variables on it.

13-1
13-2 Lecture 13: Discrete Random Variables

For an example, consider the experiment of choosing a point a from the interval [−1, 1]. So
now our Ω = [−1, 1]. The random variable that associates with x the numerical value

 −1 if x < 0
sgn(x) = 0 if x = 0
1 if x > 0

is discrete.
If X is a discrete random variable and g : R → R be a Borel measurable function, then g(X)
is also a discrete random variable.
True/ False: If X is not a discrete random variable and g : R → R be a Borel measurable
function, then g(X) is also not discrete.

−1 if x < 0
The statement is False. Let Ω = (−1, 1), X(ω) = ω and g(x) = Then
1 if x ≥ 0
g(X) is discrete since it’s range is {−1, 1}.
First we focus exclusively on discrete random variables.

13.1 Probability Mass Function


The most important way to characterize a discrete random variable is through the proba-
bilities of the values that it can take. For a discrete random variable X, these are captured
by the probability mass function (pmf for short) of X, denoted fX (·). In particular. if x is
any possible value of X, the probability mass of x, denoted fX (x), is the probability of the
event {X = x} consisting of all outcomes that give rise to a value of X equal to x:

fX (x) = P {X = x}

Definition 13.3 A real-valued function f defined on R by f (x) = P (X = x) is called the


pmf of X.

Example 13.4 Consider the experiment consist of two independent tosses of a fair coin,
and let X be the number of heads obtained. Then the pmf of X is
 1
 4 if x = 0, 2
1
fX (x) = if x = 1
 2
0 otherwise

Pmf captures the “point probabilities or point mass”, in the sense that P (X = x) is the mass
attached at point x. We may draw pmf of X as follows:
Lecture 13: Discrete Random Variables 13-3

x=0 x=1 x=2

Proposition 13.5 (Properties of pmf ) Let f be the pmf of a discrete random variable
X. Then it has the following properties

1. f (x) ≥ 0 for all x ∈ R.


2. {x ∈ R : f (x) > 0} is a finite or countably infinite subset of R.
X
3. f (x) = 1, where summation is over all x belongs to the range of X.
x

Proof:

1. Since P {X = x} ≥ 0 for any x.


2. Since range of X is either finite or countably infinite.
3. As x ranges over all possible values of X the events {X = x} are disjoint and form a
partition of the sample space, that is
[
Ω= {X = x}
x∈R(X)

Now the result follows from the additivity and normalization axioms of probability
measure.

The pmf contains all the information required for the random variable, i.e., for Borel set
S ⊂ R, we can compute the probability of the event {X ∈ S} we have
X X
P (X ∈ S) = fX (x) = P {X = x}
x∈S∩R(X) x∈S∩R(X)

For example, if X is the number of heads obtained in two tosses of a fair coin, as above, the
probability of at least one head is
2
X 3
P (X ≥ 1) = fX (x) = ,
x=1
4
since S = [1, ∞) and R(X) = {0, 1, 2}.
Calculating the PMF of X is conceptually straightforward. For each possible value x of X:
13-4 Lecture 13: Discrete Random Variables

1. Collect all the possible outcomes that give rise to the event {X = x}

2. Add their probabilities to obtain fX (x).

1
Example 13.6 Let Ω = {ω1 , ω2 , ω3 }, F = P(Ω), P (ωi ) = for i = 1, 2, 3, and define ran-
3
dom variables X, Y and Z as follows:

X(ω1 ) = 1, X(ω2 ) = 2, X(ω3 ) = 3;


Y (ω1 ) = 2, Y (ω2 ) = 3, Y (ω3 ) = 1;
Z(ω1 ) = 3, Z(ω2 ) = 1, Z(ω3 ) = 2.

Show that X, Y and Z have the same pmf.

Solution: Note that R(X) = R(Y ) = R(Z) = {1, 2, 3}.


1
P {X = 1} = P (ω1 ) =
3
1
P {X = 2} = P (ω2 ) =
3
1
P {X = 3} = P (ω3 ) =
3
1
P {Y = 1} = P (ω3 ) =
3
1
P {Y = 2} = P (ω1 ) =
3
1
P {Y = 3} = P (ω2 ) =
3
1
P {Z = 1} = P (ω2 ) =
3
1
P {Z = 2} = P (ω3 ) =
3
1
P {Z = 3} = P (ω1 ) =
3

Moral of the Example 13.6 is, different random variable may have same pmf.
Lecture 14: Standard Discrete Random Variables
4 February, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

The three properties in the Proposition 13.5 characterize the pmf.

Theorem 14.1 Let f : R → R be a function which satisfies the following three properties:

1. f (x) ≥ 0.

2. The set S := {x ∈ R : f (x) > 0} is a finite or countably infinite subset of R.


X
3. f (x) = 1.
x∈S

Then there exist a probability space (Ω, F, P ) and discrete random variable X on it such that
f is the pmf of X.

Proof: Take Ω = S. Since S is at most countable, hence take F be the power set of Ω. Now
it is enough to define probability of each singleton and we define P the probability measure
by P (x) = f (x) for x ∈ Ω. The random variable X defined by X(x) = x. Of course the
range of X is S which is either finite or countably infinite. Note that the pmf of X is

fX (x) = P (X = x) = P (x) = f (x) ∀x ∈ S

For x ∈ S c both f and fX are zero, hence they agree everywhere on real line. So f is the
pmf of X.

Remark 14.2 The above result assures us that statements like “Let X be a random variable
with the pmf f ” always make sense, even if we do not specify directly a probability space upon
which X is defined.

14.1 Standard Discrete Random Variables


Now we look at some standard random variables.

14-1
14-2 Lecture 14: Standard Discrete Random Variables

14.1.1 Discrete Uniform Random Variable:

If the random variable X takes finitely many values (say N ) with the probability of taking
each value is same, then we call random variable X discrete uniform random variable.
To standardize, we assume that X takes values {1, 2, · · · , N }. Then pmf of X is
1
f (i) = P {X = i} = ∀ i = 1, 2, · · · , N.
N
The positive integer N is called the parameter of the discrete uniform distribution. That is
if we say X has the pmf
1
P (X = i) = , i = 1, 2, · · · , 10
10
and Then X is discrete uniform random variables with the parameter N = 10. Suppose Z
is random variable with the pmf
1
P (Z = i) = , i = 1, 2, · · · , 11
11
Then Z is also a discrete uniform random variable but with parameter N = 11.
It is like we have small, medium, XL, XXL size of the same PETER ENGLAND shirt, where
the basic structure of all shirts is same with different values of the parameters(length, size).
In statistics, we usually deal with a family of distribution rather than a single distribution.
This family is indexed by one or more parameters, which allow us to vary certain charac-
teristics of the distribution while staying with one functional form. For example, we may
specify the discrete uniform distribution as reasonable choice to model a particular popula-
tion, but we can not precisely specify the number N . Then we deal with a parametric family
of discrete uniform distributions with unspecified parameter N , where N ∈ N.

14.1.2 The Bernoulli Random Variable

Consider the toss of a coin, which comes up a head with probability p and a tail with
probability 1 − p. The Bernoulli random variable takes the two values 1 and 0 depending
on wether the outcome is head or tail. Let Ω = {H, T }, P (H) = p, P (T ) = 1 − p where
0 ≤ p ≤ 1. Define X : Ω → R by X(H) = 1, X(T ) = 0.
Question: What is the pmf of Bernoulli random variable ?

 p if x = 1
fX (x) = 1 − p if x = 0 (14.1)
0 otherwise

Any random variable having pmf (14.1) is called a Bernoulli (p) random variable, where p
is the parameter. If an experiment involves only two possible results, we say that we have a
Lecture 14: Standard Discrete Random Variables 14-3

Bernoulli trial. The two possible results can be anything, e.g., “it rains” or “it doesn’t rain,”
but we will often think in terms of coin toss and refer to the two results as “head” (H) and
“tail” (T). For all its simplicity Bernoulli random variable is very important. In practice, it
is used to model generic probabilistic situations with just two outcomes such as:

1. The state of a telephone at a given time that can either be free or busy.

2. A person is either healthy or sick with a given disease.

3. The preference of a person who can be either for or against Narendra Modi.

14.1.3 Binomial Random Variable

Consider an experiment that consists of n independent tosses of a coin, in which probability


of head (in each trail) is p (where 0 < p < 1). In this context, Independence means that the
events A1 , A2 , · · · An are independent where Ai = {ith toss is a head}. For example, take
n = 3 then we see that any particular outcome (3-long sequence of heads and tails) that
involves k heads and 3 − k tails has probability pk (1 − p)3−k , that is P (HHT ) = P (HT H) =
p2 (1 − p). This formula extends to the case a general number n of tosses. We obtain that
any particular outcome (n-long sequence of heads and tails) that involves k heads and n − k
tails has probability pk (1 − p)n−k for all k from 0 to n. We define a random variable X on
Ω such that X(ω) is the number of heads in n independent tosses. Then X takes values
0, 1, 2, · · · , n. Let us compute it’s pmf.

P (X = k) = P (exactly k heads come up in an n-toss sequence )


= number of distinct n-toss sequences that
contain exactly k heads × pk (1 − p)n−k
 
n k
= p (1 − p)n−k
k
n!
= pk (1 − p)n−k
k!(n − k)!

Note that the pmf must add to 1, thus we have


n  
X n k
p (1 − p)n−k = 1
k=0
k

So Binomial random variable is determined by two parameters n and p, where n ∈ N is


number of independent trials and p ∈ (0, 1) is probability of success in each trail.
14-4 Lecture 14: Standard Discrete Random Variables

14.1.4 Geometric Distribution

Let us consider a random experiment, we toss a coin till we get first head. In each toss the
probability of getting head is p. Also we assume each coin toss is independent. If X denotes
the number of coin tosses required to get first head, then X is a random variable which takes
values 1, 2, · · · . So pmf of X would be

P (X = k) = p(1 − p)k−1 , for k = 1, 2, · · ·

A random variable with above pmf is called a geometric random with parameter p. The name
geometric is motivated from the fact that pmf resemble the geometric sequence a, ar, ar2 , · · · .
So a geometric random variable is determined by one parameters p, where 0 < p < 1 is
probability of success or head in each trail.

14.1.5 Poisson random variable

A random variable X with pmf

λk e−λ
P (X = k) = , k = 0, 1, 2, · · · ,
k!
is called a Poisson random variable with parameter λ > 0.
Lecture 15: Random Variables with density
8 February, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

15.1 Random variable with density or absolutely con-


tinuous random variable
Definition 15.1 A random variable X defined on a probability space (Ω, F, P ) is called
absolutely continuous if there is a nonnegative function fX defined on R, called the probability
density function of X (sometimes just density of X), or pdf for short, such that
Z
P (X ∈ S) = fX (x)dx,
S

for every Borel subset S of R.

Example of Borel sets: Every singleton, every countable set, all types of intervals, open
set, closed set, their (finite or counatble) union or intersections etc.
In particular, the probability that the value of X falls within an interval is
Z b
P (a ≤ X ≤ b) = fX (x)dx.
a

and can be interpreted as the area under the graph of the pdf.

15-1
15-2 Lecture 15: Random Variables with density

For any single value a, we have


Z a
P (X = a) = fX (x)dx = 0.
a

For this reason for an absolutely continuous random variable, including or excluding the
endpoints of an interval has no effect on its probability
P (a ≤ X ≤ b) = P (a < X < b) = P (a ≤ X < b) = P (a < X ≤ b).

Example 15.2 Show that the range of an absolutely continuous random variable is uncount-
able.

Solution: Assume contrary that the range of X is countable, i.e., R(X) = {a1 , a2 , · · · }.
Note that ({X = an })∞
n=1 is a countable partition of sample space Ω.

[ ∞
X
Ω= {X = an } =⇒ P (Ω) = 1 = P (X = an ) = 0,
n=1 n=1

where last equality follows from the fact that X is absolutely continuous.
Question: Does is follow from the Example 15.2 that the range of an absolutely continuous
random variable necessarily contains an interval?
Answer: Example 15.2 does not exclude the possibility of set of irrationals being range of
an absolutely continuous random variable. But it is rather integrability requirement of the
density function which demands that range must contains an interval. We shall see random
variables with range as interval but does not have the pdf.

15.1.1 PDF

Note that to qualify as a pdf, a function f defined on R must be nonnegative, and must also
have the normalization property Z ∞
f (x)dx = 1.
−∞

Example 15.3 Consider the following function


 1

2 x
if 0 < x ≤ 1
f (x) =
0 otherwise
Is it a valid pdf ? Clearly f ≥ 0 and
Z ∞ Z 1
1
f (x)dx = √ dx
−∞ 0 2 x
√ 1
= x = 1
0
Lecture 15: Random Variables with density 15-3

It is important to realize that even though the pdf is used to calculate probabilities of events
but fX (x) is not the probability of any event. This is in contrast to pmf where fX (x) is
indeed probability of event {X = x}. In particular pdf is not restricted to be less than or
equal to one. In fact it can take arbitrary large value. In the Example 15.3,
 
1 n
f = , ∀n = 1, 2, · · ·
n2 2

As n ≥ 3, f values are bigger than 1 and by taking n large enough, we may get arbitrary
large value of pdf f .

Interpretation of the pdf For an interval [x, x + δ] with very small length δ, we have
x+δ
P (x ≤ X ≤ x + δ)
Z
P (x ≤ X ≤ x + δ) = fX (t)dt ≈ fX (x) · δ =⇒ ≈ fX (x)
x δ

So we can view fX (x) as the “probability mass per unit lenght” near x or in other words it
the rate at which probability accumulates near point x.

15.2 Some Standard Absolutely Continuous Random


Variable
Continuous Uniform Random variable A random variable X that takes values in an
interval [a, b] and any two sub-interval of equal length have the same probability is called
uniform random variable and is denoted by U [a, b] or U (a, b). It’s pdf is given by
( 1
, if a ≤ x ≤ b
f (x) = b−a
0, otherwise.
15-4 Lecture 15: Random Variables with density

Example 15.4 Let X ∼ U [0, 1]. Find P (X > 12 ).

Solution:
Z ∞ Z 1
1 1
P (X > ) = fX (x)dx = dx = .
2 1
2
1
2
2

Exponential Random variable: A random variable X with the pdf



0 if x < 0
f (x) = −λx .
λe if x ≥ 0
is called a exponential random variable with parameter λ > 0.

Normal Random Variable A random variable with the following pdf


1 (x−µ)2
f (x) = √ e− 2σ2 , ∀x ∈ R
σ 2π
is called a normal random variable with parameters µ, σ 2 , where σ is assumed to be positive.
Below is the graph of normal pdf with µ = 1 and σ = 1.
Lecture 15: Random Variables with density 15-5

A normal random variable with µ = 0 and σ 2 = 1 is said to be a standard normal random


variable.
Lecture 16: Distribution Function
11 February, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Definition 16.1 Let (Ω, F, P ) be a probability space and X be a random variable on it.
The distribution function (or cumulative distribution function CDF) of X, denoted by FX ,
is defined as
FX (x) = P (X ≤ x), ∀x ∈ R.

Note that the notion of distribution function is well-defined for every random variable (dis-
crete, absolutely continuous or mixed).
In particular, for every x we have
 X


 fX (t) if X is discrete
 t≤x:t∈R(X)

FX (x) = P (X ≤ x) = Z x


fX (t)dt if X is absolutley continuous



−∞

Loosely speaking, the CDF FX (x) “accumulates” probability “up to” the point x.

Example 16.2 Let X be discrete uniform random variable with parameter N . Determine
it’s CDF.

Solution: Recall R(X) = {1, 2, · · · , N }. Therefore




 0 if x < 1
1
if 1 ≤ x < 2



 N
 2

if 2 ≤ x < 3

N
FX (x) = .. .. ..

 . . .
N −1
if N − 1 ≤ x < N


N



 1 if x ≥ N

Exercise 16.3 Write down the distribution function of a Bernoulli random variable with
probability of success is 34 .

16-1
16-2 Lecture 16: Distribution Function

3
Solution: Recall the pmf of Bernoulli random variable with probability of success 4
is
 3
 4 if x = 1
1
fX (x) = if x = 0 (16.1)
 4
0 otherwise

Hence the distribution function is



 0 if x < 0
1
FX (x) = if 0 ≤ x < 1 (16.2)
 4
1 if x ≥ 1

Exercise 16.4 Let X ∼ U [−1, 1]. Determine its cdf.

Solution: The pdf of X is


( 1
, if − 1 ≤ x ≤ 1
f (x) = 2
0, otherwise.

Hence the cdf is 


 0 if x < −1

 Z x
1 x+1
F (x) = dt = if −1 ≤ x ≤ 1 (16.3)

 −1 2 2
1 if x > 1

Example 16.5 Let X ∼ exp(λ). Find it’s distribution function.

Solution: The pdf of X is

λe−λx ,

if x ≥ 0
f (x) =
0, otherwise.

Hence the cdf is


(
0 x if x < 0
F (x) = Rx (16.4)
λe−λt dt = −e−λt = 1 − e−λx if x ≥ 0

0
0

We observe the following properties of the distribution function.


Lecture 16: Distribution Function 16-3

1. If X is a discrete random variable, then FX is discontinuous with jumps occurring at


the values of positive probability.

2. If X is an absolutely continuous random variable, then FX is a continuous function of


x. This follows from fundamental theorem of calculus.

Theorem 16.6 (Properties of Distribution function) Let X be a random variable and


F be the distribution function of X. Then F has the following properties.

1. F is non-decreasing.

2. F is right-continuous.

3. lim F (x) = 0, lim F (x) = 1.


x→−∞ x→+∞

Proof:

1. For x1 ≤ x2 , {X ≤ x1 } ⊆ {X ≤ x2 }. Hence by monotonicity of probability measure,


we have P {X ≤ x1 } = F (x1 ) ≤ F (x2 ) = P {X ≤ x2 }.

2. Let us recall the meaning of right-continuity of real-valued function in term of se-


quences. Let f : R → R and c ∈ R, we say f is right-continuous at c if for any
sequence (xn ) in [c, ∞) such that xn ↓ c, we have f (xn ) → f (c).
Let xn ↓ c. Set An = {X ≤ xn }. Then A1 ⊃ A2 ⊃ · · · .

\
We claim that {X ≤ c} = An . To see this, let ω ∈ {X ≤ c}, i.e., X(ω) ≤ c. Since
n=1
c ≤ xn for each n, therefore X(ω) ≤ xn for each n. Hence ω ∈ An for each n, i.e.,

\ ∞
\
ω∈ An . Thus {X ≤ c} ⊂ An .
n=1 n=1

\
To see other way inclusion, let ω ∈ An , i.e., ω ∈ An for each n, that is X(ω) ≤ xn
n=1
for each n. Since xn → c hence taking limit in above inequality, we get X(ω) ≤ c.
Therefore once again continuity property of probability measure gives us

!
\
lim F (xn ) = lim P {X ≤ xn } = lim P (An ) = P An = P {X ≤ c} = F (c).
n→∞ n→∞ n→∞
n=1
Lecture 17: Properties of Distribution Function
12 February, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Continuation of Proof of Theorem 16.6 : First we show that lim F (x) = 1.


x→+∞

Let us recall the meaning of lim f (x) = l in term of sequences. Let f : R → R, we say
x→∞
lim f (x) = l if for any sequence (xn ) in R such that xn ↑ +∞, we have f (xn ) → l.
x→∞

Let xn ↑ +∞, set


An = {X ≤ xn }.
Then A1 ⊆ A2 ⊆ · · · (since xn ≤ xn+1 for all n ). Also

[
An = Ω.
n=1


[
Since each An is a subset of Ω, An ⊆ Ω. For other way, let ω ∈ Ω. Since xn ↑ +∞, for
n=1
X(ω) there exists n0 such that xn ≥ X(ω) for all n ≥ n0 . Hence ω ∈ An for all n ≥ n0 .

[
That is Ω ⊆ An . By continuity property of probability measure, we have
n=1


!
[
lim F (xn ) = lim P {X ≤ xn } = lim P (An ) = P An = P (Ω) = 1.
n→∞ n→∞ n→∞
n=1

Now we show that lim F (x) = 0.


x→−∞

Let us recall the meaning of lim f (x) = l in term of sequences. Let f : R → R, we say
x→−∞
lim f (x) = l if for any sequence (xn ) in R such that xn ↓ −∞, we have f (xn ) → l.
x→−∞

Let xn ↓ −∞, set


An = {X ≤ xn }.
Then A1 ⊇ A2 ⊇ · · · (since xn ≥ xn+1 for all n ). Also

\
An = ∅.
n=1

17-1
17-2 Lecture 17: Properties of Distribution Function

T∞
Let ω ∈ n=1 An , i.e.,
X(ω) ≤ xn for all n
Since xn ↓ −∞, by taking limit in above inequality we obtain
TX(ω) ≤ −∞. But by definition
X(ω) must be a real number. hence there is no element in ∞ n=1 An . By continuity property
of probability measure, we have

!
\
lim F (xn ) = lim P {X ≤ xn } = lim P (An ) = P An = P (∅) = 0.
n→∞ n→∞ n→∞
n=1

Theorem 17.1 (Characterization of distribution function) Let F : R → [0, 1] be a


function, which has the following properties.

1. lim F (x) = 0, lim F (x) = 1.


x→−∞ x→+∞

2. F is non-decreasing.
3. F is right-continuous.

Then there exist a probability space (Ω, F, P ) and a random variable X defined on it such
that F is the distribution function of random variable X.

Example 17.2 Let F (x−) denotes left hand limit at the point x. Show that
F (x−) = lim F (x) = P {X < x}
y↑x

Solution: Let us recall the meaning of left-hand limit of real-valued function in term of
sequences. Let f : R → R and c ∈ R, we say left hand limit of f at c exists if there exist
some l ∈ R such that for any sequence (xn ) in (−∞, c) such that xn ↑ c, we have f (xn ) → l.

[
Let xn ↑ x. Set An = {X ≤ xn }. Then A1 ⊆ A2 ⊆ · · · and {X < x} = An . Let
n=1
ω∈ ∞
S
n=1 An , i.e., ω ∈ An for some n, i.e., X(ω) ≤ xn . Since xn ↑ x, hence xn < x for all

[
n so we get X(ω) < x. That is An ⊆ {X < x}. For other way, let ω ∈ {X < x}, i.e.,
n=1
X(ω) < x. Since xn ↑ x, there exists n0 such that X(ω) ≤ xn < x for all n ≥ n0 . Hence

[
ω ∈ An for all n ≥ n0 . That is {X < x} ⊆ An .
n=1

By continuity property of probability measure, we have



!
[
F (x−) = lim F (xn ) = lim P {X ≤ xn } = lim P (An ) = P An = P {X < x}.
n→∞ n→∞ n→∞
n=1
Lecture 17: Properties of Distribution Function 17-3

From the Example 17.2, it follows that for x ∈ R

P {X = x} = F (x) − F (x−)

In view of what we have obtained just now, Let X be a discrete random variable with
distribution function F . Define f : R → [0, 1] by

f (x) = F (x) − F (x−)

Then f is the probability mass function (pmf) of X.


Let X be an absolutely continuous random variable with distribution function F . Then pdf
of X can be obtained by differentiating F

f (x) = F 0 (x)

where equality is valid for those x at which the pdf is continuous.


Lecture 18: Distribution Function
13 February, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Example 18.1 Suppose X is a random variable with the following distribution function.


 0 if x < 0
 1 if 0 ≤ x < 1

4
F (x) = 3

 4
if 1 ≤ x < 2

 1 if x ≥ 2.

Determine wether X is a discrete random variable? If yes, find it’s pmf.

Solution: Recall that if X is a random variable with pdf, then it’s distribution function is
continuous everywhere on R. So if a distribution function is discontinuous even at a single
point, then corresponding random variable can not have pdf.
Is it enough to conclude that X is a discrete random variable? Recall that the distribution
function of a discrete random variable has jump precisely at the points of positive probability.
For given distribution function, the points of jumps are x = 0, 1, 2

P (X = 0) = F (0) − F (0−) = 1/4


P (X = 1) = F (1) − F (1−) = 3/4 − 1/4 = 1/2
P (X = 2) = F (2) − F (2−) = 1 − 3/4 = 1/4

Also note that


P (X = 0) + P (X = 1) + P (X = 2) = 1.
So there is no other point which has positive probability, hence X is a discrete random
variable.

Example 18.2 Suppose X is a random variable with the following cdf



0 if x < 1
F (x) = k
1 − (1 − p) if k ≤ x < k + 1

where 0 < p < 1 and k = 1, 2, 3, · · · . Determine it’s pmf.

18-1
18-2 Lecture 18: Distribution Function

Solution: Note that F has jump exactly at k = 1, 2, · · · . Also

P (X = k) = F (k) − F (k−) = [1 − (1 − p)k ] − [1 − (1 − p)k−1 ] = (1 − p)k−1 − (1 − p)k


= (1 − p)k−1 p, for k = 1, 2, · · ·

Recall that we say that a random variable X has a pdf (or density) if there exist a non-
negative, integrable function f such that
Z x
FX (x) = P (X ≤ x) = f (t)dt, ∀x ∈ R (18.1)
−∞

We note that (18.1) does not define f uniquely since we can always change the value of
a function at a finite number of points without changing the integral of the function over
intervals. One typical way to define f is by setting f (x) = F 0 (x) whenever F 0 (x) exists and
f (x) = 0 otherwise. This defines a density of F provided that F is everywhere continuous
and that F 0 exists and is continuous at all but a finite number of points.

Example 18.3 Let X be a random variable with the distribution function



 0 if x ≤ 0
F (x) = x if 0 < x ≤ 1
1 if x > 1

Determine wether X is a discrete random variable or an absolutely continuous random vari-


able? Accordingly find it’s pmf or pdf.

Solution: Since F is continuous everywhere on R, hence X can not be a discrete random


variable. Is it enough to say that X has pdf? Observe that F is not differentiable at two
points x = 0 and x = 1.

F (0 + h) − F (0) 0−0
lim = lim =0
h↑0 h h↑0 h
and
F (0 + h) − F (0) h−0
lim = lim =1
h↓0 h h↑0 h
Also
F (1 + h) − F (1) 1+h−1
lim = lim =1
h↑0 h h↑0 h
and
F (1 + h) − F (1) 1−1
lim = lim =0
h↓0 h h↑0 h
Lecture 18: Distribution Function 18-3

So we can say that 


 0 if x < 0
F 0 (x) = 1 if 0 < x < 1
0 if x > 1

As F 0 exists and is continuous everywhere except at two points x = 0, 1, hence the following
is our candidate for the pdf

 0 if x ≤ 0
f (x) = 1 if 0 < x < 1
0 if x ≥ 1

Recall that this is the pdf of continuous uniform random variable on the interval (0, 1).

Example 18.4 Let X be the random variable with the distribution function
Z x
1 t2
F (x) = √ e− 2 dt.
2π −∞
Determine it’s pdf.

t2
e− 2
Solution: Note that the integrand √ is continuous everywhere on R, hence F is a

t2
− 2
e√
differentiable function on R. Hence pdf is f (t) = 2π
, which is pdf of a standard normal
random variable.

Exercise 18.5 Let X be a random variable with the following cumulative distribution func-
tion: 

 0 x<0


 x2 1
0≤x<

F (x) = 2
3 1

 ≤x<1
 14 2



x≥1
What
 type of random
 variable X is? Accordingly determine it’s pmf or pdf. Also compute
1
P ≤X<1 .
4

Solution:
1
Also it is not absolutely continuous since F is discontinuous at , 1.
2
X is not discrete since
18-4 Lecture 18: Distribution Function

P (X = 1/2) = F (1/2) − F (1/2−) = 3/4 − 1/4 = 1/2


P (X = 1) = F (1) − F (1−) = 1 − 3/4 = 1/4

X has point mass only at these two points and P (X = 1) + P (X = 1/2) = 3/4 < 1, means
rest 1/4 unit mass is distributed over the interval [0, 1/2) according to the density 2x, i.e.,
R 1/2
P (0 ≤ X < 1/2) = 0 2xdx = 1/4.
This is an example of a mixed type of random variable. Now
   
1 1
P ≤ X < 1 = P (X < 1) − P X <
4 4
 
1
= F (1−) − F −
4
 2
3 1 11
= − =
4 4 16

Theorem 18.6 The set of discontinuity points of a distribution function F is at most count-
ably infinite.

Example 18.7 If F is continuous on R, then set of discontinuity is an empty set, which a


finite set of cardinality 0. Continuous Uniform, Exponential, Normal distributions.
Bernoulli, Binomial distributions have discontinuity at finitely many points.
Geometric, Poisson distributions have discontinuity at countably infinite points.
Lecture 19: Function of random Variable
15 February, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Theorem 19.1 If X is a random variable on a probability space (Ω, F, P ) and f : D ⊂


R → R be a Borel measurable function such that R(X) ⊂ D (R(X) denotes the range of X).
Then f (X) is a random variable on (Ω, F, P ).

Often, if we are able to model a phenomenon in terms of a random variable X, we will also
be concerned with behavior of functions of X. Now we study techniques that allow us to
gain complete information (i.e. probability distributions of these functions) about function
of X that may be of interest.

19.1 PMF of function of random variable


Also we have seen lecture 13, that if X is a discrete random variable then f (X) is necessarily
discrete. So one can ask how to compute pmf of f (X)?

Example 19.2 Let X be a random variable with pmf

1
fX (k) = P (X = k) = , for k = −4, −3, −2, · · · , 3, 4
9

Find the pmf of |X|.

Solution: Let Y = |X|. So the range of Y is {0, 1, · · · , 4}. We want to find pmf of Y ,i.e.,
P (Y = k) for k = 0, 1, 2, 3, 4.

1 X
{Y = 0} = {X = 0} =⇒ P (Y = 0) = = P (X = k)
9
k:|k|=0
[ 2 X
{Y = k} = {X = k} {X = −k} =⇒ P (Y = k) = = P (X = i) for k = 1, 2, 3
9
i:|i|=k

19-1
19-2 Lecture 19: Function of random Variable

Proposition 19.3 If X is discrete random variable then we can obtain pmf of Y = g(X)
as follows: X
P {Y = y} = P {X = x}
x:g(x)=y

Proof: Let X : Ω → R be a discrete random variable with PMF fX and g : R → R be a


function. Then the pmf of Z = g(X) can be obtained as follows:

fZ (z) := P {Z = z}
= P {ω ∈ Ω : Z(ω) = z}
= P {ω ∈ Ω : g(X(ω)) = z}
 
 [ 
=P {ω ∈ Ω : X(ω) = x} (See the figure below)
 
x:g(x)=z
X
= P {X = x} (By additivity of probability measure)
x:g(x)=z

Ω X g

{X = x2 } x3
X = x1 X = x2 x2 R x1 z R

{Z = z}

Z = g ◦ X = g(X)

In above figure, we have assumed that only three numbers x1 , x2 , x3 from the range of
random variable X are mapped to the number z, under the transformation g. For each
i = 1, 2, 3, the event {X = xi } is subset of the event {Z = z} (suppose not, i.e., there
exists i ∈ {1, 2, 3} and ω ∈ {X = xi } such that ω ∈ / {Z = z}. Then Z(ω) 6= z and
X(ω) = xi . Therefore Z(ω) = g(X(ω)) = g(xi ) = z. But this is a contradiction). Also
note that three events {X = xi } for i = 1, 2, 3 are pairwise disjoint (if not, then there exists
ω which belongs to {X = xi } ∩ {X = xj }, i 6= j. Then X(ω) = xi 6= xj = X(ω), this is
[ 3
absurd.) Also if ω ∈ {Z = z} then X(ω) ∈ {x1 , x2 , x3 }, i.e, ω ∈ {X = xi }. This shows
i=1
Lecture 19: Function of random Variable 19-3

[
that {ω ∈ Ω : g(X(ω)) = z} is the disjoint union of events {ω ∈ Ω : X(ω) = x}. The
x:g(x)=z
figure illustrates this point.

Example 19.4 Let X ∼ B(n,p). Consider random variable n − X. Find it’s pmf.

Solution: Since X takes values from {0, 1, · · · , n}, hence Y = n − X will also take values
{0, 1, 2, · · · , n}. Also for given k ∈ {0, 1, 2 · · · , n}, Y = k ⇐⇒ X = n − k. Hence
   
n n−k n−(n−k) n n−k
P (Y = k) = P (X = n − k) = p (1 − p) = p (1 − p)k
n−k k

So Y ∼ B(n, 1 − p).


 −1 if x < 0
Example 19.5 Suppose X ∼ N (0, 1) and g(x) = 0 if x = 0 . Then determine pmf
1 if x > 0

of g(X).

Solution:

P (Y = −1) = P (X < 0) = 1/2.


P (Y = 0) = P (X = 0) = 0.
P (Y = 1) = P (X > 0) = 1/2.

19.2 PDF of function of random variable


If X is random variable with density, and g is real-valued function of a real variable, then
g(X) may not have pdf (see Example 19.5). Even if we take g to be a continuous function,
still g(X) may not have pdf.

Example 19.6 Let X ∼ exp(5). Find the pdf of the random variable min{X, 10} (if it
exists).

Solution: First we determine the distribution function of random variable Y := min{X, 10}.
For x ∈ R,
{Y ≤ x} = {X ≤ x} ∪ {10 ≤ x}
19-4 Lecture 19: Function of random Variable

Note that if 10 > x then {10 ≤ x} = ∅ and if 10 ≤ x then {10 ≤ x} = Ω. Hence



{X ≤ x} if x < 10
{Y ≤ x} =
Ω if x ≥ 10

Hence distribution function of Y denoted by FY is



  0 if x < 0
FX (x) if x < 10 −5x
FY (x) = = 1−e if 0 ≤ x < 10
1 if x ≥ 10
1 if x ≥ 10

Since FY is discontinuous at x = 10 hence Y can not have pdf.


Lecture 20: Function of random Variable
16 February, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Example 20.1 Let X be a random variable with density f . Find the pdf of random variable
X 2.

Solution: For x ∈ R,

2 ∅ √ √ if x < 0
{X ≤ x} =
{− x ≤ X ≤ x} if x ≥ 0

Hence distribution function of X 2 is



0 √ √ if x < 0
FX 2 (x) =
F ( x) − F ((− x)−) if x ≥ 0

Using the fact that F is the distribution function of an absolutely continuous random variable
so F is continuous on R. we get

0 √ √ if x < 0
FX 2 (x) =
F ( x) − F (− x) if x ≥ 0

and by differentiation we see that



0 √ √ √ √ if x < 0
fX 2 (x) = √1
2 x
F 0 ( x) + 2√1 x F 0 (− x) = 2
1

x
[f ( x) + f (− x)] if x>0

Although above is valid in general, our derivation depended on differentiation, which may
not be valid at all points.
If we have a concrete random variable X, then we know FX 2 explicitly. Hence we know
exactly what are those points at with FX 2 is not differentiable. Then there is no problem in
above procedure. For example if
Let X ∼ U [0, 1]. Then distribution function of X 2 is

2 P (∅)√= 0 √ √ √ √ if x < 0
FX 2 (x) = P (X ≤ x) =
P (− x ≤ X ≤ x) = FX ( x) − FX (− x) = FX ( x) if x ≥ 0

20-1
20-2 Lecture 20: Function of random Variable

we know that

 0 if x < 0
F (x) = x if 0 ≤ x ≤ 1 (20.1)
1 if x > 1

Hence

 0√ if x < 0
FX 2 (x) = x if 0 ≤ x ≤ 1
1 if x > 1

Therefore the density of X 2 would be


 0 if x ≤ 0
1
fX 2 (x) = √
2 x
if 0 < x < 1
0 if x ≥ 1

Alternative way to derive the density of an absolutely continuous random variable X is to


show that the distribution function F can be written in the form (18.1) for some nonnegative
function f . Then f is necessarily a density of X. This method is essentially equivalent to
previous one but more complicated than differentiation. However, this is rigorous and avoid
special consideration of points where F 0 (x) fails to exits.

Example 20.2 Let X be a random variable with density f . Find the pdf of random variable
X 2.

Solution: we get

0 √ √ if x < 0
FX 2 (x) =
F ( x) − F (− x) if x ≥ 0

To give an elementary but completely rigorous proof, we try to express FX 2 (x) in form

Z x
g(t)dt, ∀x ∈ R
−∞
Lecture 20: Function of random Variable 20-3

Z x
For x < 0, FX 2 (x) = 0 · dt
−∞
√ √
For x ≥ 0, FX 2 (x) = F ( x) − F (− x)
Z √x Z −√x
= f (t)dt − f (t)dt
−∞ −∞

Z x
= √
f (t)dt
− x

Z 0 Z x
= √
f (t)dt + f (t)dt
− x 0
Z 0 Z x
√ 1 √ 1
=− f (− u) √ du + f ( u) √ du
2 u 2 u
Z xx 0
1  √ √ 
= √ f ( u) + f (− u) du
2 u
Z0 x
= g(u)du,
−∞

where

0 √ √ if u ≤ 0
g(u) = 1

2 u
[f ( u) + f (− u)] if u > 0

20.1 Simulating a random variable

Let X have continuous CDF FX . If FX is strictly increasing, then FX−1 (the inverse of CDF
FX ) is well defined by

FX−1 (y) = x ⇐⇒ FX (x) = y. (20.2)

See the picture below


20-4 Lecture 20: Function of random Variable

However, if FX is constant on some interval, then FX−1 is not well defined by (20.2) as in the
figure below

Any x satisfying x1 ≤ x ≤ x2 satisfies FX (x) = y.


This problem is resolved by defining FX−1 (y) for 0 < y < 1 by

FX−1 (y) = inf{x : FX (x) ≥ y}, (20.3)

a definition that agrees with (20.2) when FX is strictly increasing and provides an FX−1
that is single-values even when FX is not strictly increasing. Using this definition, we have
FX−1 (y) = x1 in the figure above. At the end points of the range of y, FX−1 can also be defined.
We define FX−1 (1) = ∞ if FX (x) < 1 for all x (because infimum of an emptyset should be
+∞.) and FX−1 (0) = −∞ (because infimum of (−∞, ∞) is −∞).
Also it is clear that FX−1 is an strictly increasing function on [0, 1] (which may be bounded
above if FX is eventually constant 1 otherwise it tends to +∞)
Lecture 21: Simulation of Random Variables
18 February, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Theorem 21.1 Let X have continuous CDF F . Then random variable F (X) is a uniform
random variable on (0, 1).

Proof: Set Y = F (X). For 0 < y < 1,

P (Y ≤ y) = P (F (X) ≤ y)
= P (F −1 (F (X)) ≤ F −1 (y)) (here F −1 is according to (20.3), hence strictly increasing)
= P (X ≤ F −1 (y)) (Justification See below)
= F (F −1 (y))
=y

Claim 21.2 P (F −1 (F (X)) ≤ F −1 (y)) = P (X ≤ F −1 (y)).


If F is strictly increasing, then evidently F −1 (F (x)) = x. However, if F is constant over
some interval, it may be that F −1 (F (x)) 6= x. See the figure below:

If we take x = x1 + x2 /2 then F (x) = y. But F −1 (y) = x1 < x. In fact if x ∈ (x1 , x2 ] Then


F −1 (F (x)) = x1 < x.
Even in this case, though, the probability equality holds. Since P (X ≤ x) = P (X ≤ x1 ) +
P (x1 < X ≤ x). Since CDF of X is constant over [x1 , x2 ], hence for any x ∈ [x1 , x2 ], by
continuity of CDF F , we have P (x1 < X ≤ x) = 0. This completes the proof of the claim.

21-1
21-2 Lecture 21: Simulation of Random Variables

Claim 21.3 F (F −1 (y)) = y.

By definition F −1 (y) = inf{x ∈ R : F (x) ≥ y}, i.e., there exists a sequence (xn ) ∈ R, with
F (xn ) ≥ y such that
xn → F −1 (y).

By continuity of F , we have lim F (xn ) = F (F −1 (y)) and hence F (F −1 (y)) ≥ y.


n→∞

If y ≥ 1, then we have P (Y ≤ y) = P (Y ≤ 1) + P (1 < Y ≤ y) = P (Y ≤ 1) + 0 = 1,


since F is a distribution function hence Y = F (X(ω)) ≤ 1 always. If y < 0, then we have
P (Y ≤ y) = 0, since F (X) ≥ 0 always.
If y = 0, P (Y ≤ y) = P (F (X) = 0).

Claim 21.4 P (Y = 0) = 0.

∞      
\ 1 1
Note that {Y = 0} = 0≤Y ≤ =⇒ P (Y = 0) = lim P Y ≤ − P (Y < 0) =
n=1
n n→∞ n
 
1
lim − P (Y < 0) = −P (Y < 0) = 0, as Y ≥ 0. This completes the proof of the claim.
n→∞ n

This shows that Y is a uniform random variable over (0, 1).


One application of Theorem 21.1 is in generation of random samples from a given distribu-
tion. First, let us try to understand what is the meaning of generating a random variable or
observing a random variable. Take example of Bernoulli random variable. When we toss a
coin, and we get say head, we say that X = 1 is observed. Then repeat the experiment, we
get tail, we say X = 0 is observed. If toss coin n times, we get a random sample of size n of
a Bernoulli random variable. It is a finite sequence of X1 , X2 , · · · , Xn . iid random variable.
And particular values obtained are called realized or observed values of random sample.
Let us take another example, we want generate a random sample of Binomial random variable
with n = 10, p = 0.5. We toss a coin 10 times and note down the number of head. Call it
X1 = x1 is observed. Now toss the coin again 10 times and note down the number of heads,
call it X2 = x2 is observed. and so on.
So it is desirable to generate random sample of reasonable sizes of a given distribution
function. In other words, we want to simulate a random variable by generating a set of
values.
Lecture 21: Simulation of Random Variables 21-3

21.1 Simulation of an absolutely continuous random


variable
A computer has a subroutine that can generate values of a random variable U that is uni-
formly distributed in the interval (0, 1). Such a subroutine can be used to generate values of
a continuous random variable with given d.f. F as follows.
If it is required to generate an observation X = x with cdf F , we generate a uniform random
number U = u, between 0 and 1, and solve for x the equation
F (x) = u,
Because F (X) = U by the Theorem 21.1.

Example 21.5 Simulate an exponential random variable with parameter λ > 0.

Solution: The exponential CDF has the form F (x) = 1 − e−λx for x ≥ 0. Thus, to generate
values of X, we should generate values u ∈ (0, 1) of a uniformly distributed random variable
U , and set X to the value for which 1 − e−λx = u =⇒ e−λx = 1 − u =⇒ x = − ln(1 − u)/x.

Example 21.6 Let X be a random variable with pdf f . Then Determine pdf of aX + b,
where a 6= 0, b ∈ R.

Solution: we first calculate the distribution function of Y = aX + b in order to find its pdf.
So
FY (x) = P (Y ≤ x)
= P (aX + b ≤ x)
 
x−b
=P X≤ (if a > 0)
a
 
x−b
= FX
a
This implies  
1 x−b
FY0 (x) = FX0
a a
If a < 0,
 
x−b
FY (x) = P X ≥
a
 
x−b
= 1 − FX (∵ X has continuous CDF)
a
21-4 Lecture 21: Simulation of Random Variables

This implies  
1 x−b
FY0 (x) = − FX0 .
a a
Hence for any a 6= 0,  
1 x−b
fY (x) = fX .
|a| a
Lecture 22: Expectation
19 February, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

22.1 Introduction
The distribution function of a random variable gives us complete information (statistically)
about it. In many settings of theoretical and practical interest, however, one is satisfied
with cruder information about the random variable. (And if the distribution in question is
unknown or only partly known, one may have to resort to these cruder information measures
in any case to make a virtue out of necessity.) While a variety of measures can be developed
along these lines, principal among them are measures of central tendency in a random vari-
able. The simplest descriptors of central tendency in a random variable include the mean,
the median, and the mode. Each has utility but, while no single measure of central tendency
is completely satisfactory in all applications, the mean, also called the expectation, is by far
the most important and the most prevalent in theory and application.

22.2 Motivation
Suppose you play a certain game and your rewards are | 1 with probability 16 , | 2 with
probability 21 and | 4 with probability 13 . So basically X is random variable which denotes
rewards it takes three values {1, 2, 4} and we are given it’s pmf. So we ask this, how much
you expect to get on the average if you play the game a zillion times? So you can think
as follows: probability of getting 1 is 61 means 61 times (out of a zillion games) we get | 1
(relative frequency interpretation of probability ) similarly 13 times we get 2 | etc. so

1 1 1
× 1 + × 2 + × 4 = 2.5
6 2 3
Basically what we have done here is:

P {X = 1} · 1 + P {X = 2} · 2 + P {X = 4} · 4

Expected value or Expectation of a discrete random variable is a weighted average, where


we interpret probabilities as weights associated with the value.

22-1
22-2 Lecture 22: Expectation

22.3 Expectation of Discrete Random Variable


Definition 22.1 Let X be a discrete random variable with the pmf fX (x). Then expectation
of X denoted by E(X) is defined as
X
E(X) = xfX (x)
x∈R(X)

X
provided the right hand side series converges absolutely, i.e., |x|fX (x) < ∞.
x∈R(X)

X
Remark 22.2 1. If X has finite range then the sum xfX (x) is a finite sum and
x∈RX
X
there is no need to check the condition |x|fX (x) < ∞ separately (as finite sum of
x∈RX
real numbers is a real number).
X
2. If range of discrete random variables X is infinite then the sum xfX (x) is an infi-
x∈RX
X X
nite sum and the condition |x|fX (x) < ∞ implies that the infinite sum xfX (x)
x∈RX x∈RX
converges to a finite value that is independent of the order
Xin which the various terms
are summed (see Example 22.7 ). Note that the series xfX (x) may converge but
x∈RX
X
the series |x|fX (x) may not. In that case we say that E [X] does not exist (see
x∈RX
Example 22.6 below).
X
3. If RX ⊂ [0, ∞) then convergence of the infinite series xfX (x) is equivalent to
x∈RX
absolute convergence.

Example 22.3 Let X be a Bernoulli(p) random variable where 0 < p < 1. Determine EX
(if it exists).

Solution: Recall that X has pmf P (X = 0) = 1 − p, P (X = 1) = p. Since X takes finitely


many values hence we don’t have to bother about absolute convergence, hence EX exists.
Further E(X) = p.

Remark 22.4 Though the number E(X) is called expected value of random variable X, but
do not expect the value E(X) when X is observed. In section 22.2, E(X) = 2.5 where as
Lecture 22: Expectation 22-3

X assumes values 1, 2, 4. So expected value 2.5 is never going to be observed. In Example


22.3, X takes values 0, 1, where as expected value p is between 0 and 1. So expected value p
is never going to be observed.
Expectation of X, tell us about the average behavior of random variable X or ’typical value’
taken by the random variable. Sometimes this estimate could be very vague.

Example 22.5 Consider a random variable X that takes the value 2k with probability 2−k
for each k ∈ N. Find it’s mean (if it exists).

Solution: Since X take values over positive integers {2, 22 , 23 , · · · }, hence absolute conver-
gence is same convergence. Now

X ∞
X
k k
EX = 2 P (X = 2 ) = 1 = +∞
k=1 k=1

So EX does not exists.

Example 22.6 Let X have the PMF given by

(−1)n+1 3n
 
2
P X= = n, n = 1, 2, · · ·
n 3

Then
∞ ∞
X X 3n 2 X2
|x|P (X = x) = × n = = ∞.
x∈RX n=1
n 3 n=1
n

Therefore E[X] does not exist, although the series


∞ ∞
X X (−1)n+1 3n 2 X (−1)n+1
xP (X = x) = × n =2 < ∞.
x∈RX n=1
n 3 n=1
n

Question: One may ask Xwhy we need the condition of absolute convergence, if expectation
is defined as the sum xP (X = x)?
x∈R(X)
X
Answer: The absolute convergence implies that the infinite sum xfX (x) converges
x∈R(X)
to a finite value that is independent of the order in which the various terms are summed. In
order to appreciate this point let us look at the following example.
22-4 Lecture 22: Expectation


X (−1)n+1 1 1 1 1 1
Example 22.7 Consider the convergent series = 1− + − + − +···
n=1
n 2 3 4 5 6
and one of its rearrangements

1 1 1 1 1 1 1 1
1+ − + + − + + − + ··· (22.1)
3 2 5 7 4 9 11 6

X (−1)n+1
in which two positive terms are always followed by one negative. If s is the sum of ,
n=1
n
then one can show that the rearrangement series (22.1) converges to a number strictly bigger
then s.

X (−1)n+1 1 1 1 1 1
If we rearrange the series = 1 − + − + − + · · · in such a way that
n=1
n 2 3 4 5 6
two negative terms follow a positive term:

1 1 1 1 1 1 1 1
1− − + − − + − − + ··· (22.2)
2 4 3 6 8 5 10 12
s
then one can show that the rearranged series (22.2) converges to .
2

In fact, the theorem below says something very dramatic and startling.

Theorem 22.8 (Riemann) A conditionally convergent series can be made to converge to


any arbitrary real number or even made to diverge by a suitable rearrangement of its terms.

In view of Riemann’s Theorem, the following theorem highlights the importance of absolute
convergence.

X
Theorem 22.9 If the series an converges absolutely, then every rearrangement of the
X n
series an converges, and they all converge to the same sum.
n

Example 22.10 Let X ∼ B(n, p). Find EX.

Solution: We know that a Binomial random variable X has pmf


 
n k
P (X = k) = p (1 − p)k , k = 0, 1, 2, · · · , n
k
Lecture 22: Expectation 22-5

So
n   n  
X n k n−k
X n k
EX = k p (1 − p) = k p (1 − p)n−k
k=0
k k=1
k
n n
X n! X n!
= k pk (1 − p)n−k = pk (1 − p)n−k
k=1
k!(n − k)! k=1
(k − 1)!(n − k)!
n n  
X (n − 1)! k n−k
X n−1 k
= n p (1 − p) =n p (1 − p)n−k
k=1
(k − 1)!(n − k)! k=1
k − 1
n 
X n−1  n−1
X n − 1
k−1 n−k
= np p (1 − p) = np pi (1 − p)(n−1)−i (put k = i + 1)
k=1
k − 1 i=0
i
= np
Lecture 23: Expectation
20 February, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

23.1 Expectation of an Absolutely Continuous Ran-


dom Variable
Definition 23.1 Let X be a random variable with pdf f . Then expectation of X denoted by
E(X) is defined as Z ∞
E(X) = xf (x)dx
−∞
Z ∞
provided the improper integral in the right hand side converges absolutely, i.e., |x|f (x)dx < ∞.
−∞

This is similar to the discrete case except that the PMF is replaced by the PDF, and sum-
mation is replaced by integration.
Z ∞
Remark 23.2 We emphasize that the condition |x|fX (x)dx < ∞ must be checked before
−∞ Z

it can be concluded that E [X] exists and EX equals xfX (x)dx.
−∞

Example 23.3 Let X ∼ U [a, b]. Find mean of X.

Solution: The pdf of X is given by


( 1
, if a ≤ x ≤ b
f (x) = b−a
0, otherwise.
Hence
Z ∞
E(X) = xf (x)dx
−∞
Z b
1
= xdx
b−a a
b+a
=
2

23-1
23-2 Lecture 23: Expectation

Since pdf is non-zero on a finite interval, hence absolute convergence of improper integral is
not needed to be checked.

2
Example 23.4 Consider a random variable X with pdf f (x) = , x ≥ 0. Find it’s
π(1 + x2 )
mean (if it exists).

Solution: Since pdf is non-zero on (0, ∞), hence there is no need to check the absolute
convergence:
Z ∞ Z ∞
2x
xf (x)dx = dx
0 0 π(1 + x2 )
Z ∞
2x 1
= 2
dx = × lim log(1 + x2 ) = ∞
0 π(1 + x ) π x→∞
Therefore expectation is not well-defined.

Example 23.5 Let X ∼ exp(λ). Find its means (if it exists).

Solution:
Z ∞ Z ∞  ∞ 1 Z ∞ 
−λx 1 −λx −λx 1 −λx ∞ 1
E[X] = xf (x)dx = xλe dx = λ xe + e dx = e =
−∞ 0 −λ 0 λ 0 −λ 0 λ

23.2 Expectation of Function of Random Variable


Theorem 23.6 1. Let X be a discrete random variable with pmf fX , and let g : R → R
be any function. Then, X
E[g(X)] = g(x)fX (x) (23.1)
x∈RX
X
provided |g(x)|fX (x) < ∞.
x∈RX

2. Let X be a random variable with pdf f and g : R →ZR be a Borel function (e.g.,

piecewise continuous, continuous) such that the integral |g(x)|f (x)dx < ∞. Then,
−∞
Z ∞
E[g(X)] = g(x)f (x)dx (23.2)
−∞
Lecture 23: Expectation 23-3

Remark 23.7 1. If X is discrete, for any function


P g, random variable Z := g(X) admits
pmf. So one can use the formula E[Z] = z∈RZ zfZ (z) to compute expectation of Z.
But if one is just interested in E[g(X)], then formula (23.1) paves the way without
calculating pmf of g(X).

2. If X has density then for some functions g it is possible that g(X) does not have pdf
(see Example 23.8), but the formula (23.2) paves the way to define the expectation of
g(X).

Suppose g(X) has pdf then the formula (23.2) tells us how to compute expectation,
without calculating the pdf of g(X).

Example 23.8 Recall Example 19.6, where X ∼ exp(5) and g(x) = min{x, 10}. We have
seen that g(X) does not have pdf. But we can compute the expectation of g(X) using formula
(23.2) as follows:

Z ∞ Z ∞
E[min{X, 10}] = min{x, 10}fX (x)dx = 5 min{x, 10}e−5x dx
−∞ 0
Z 10 Z ∞ 
−5x −5x
=5 min{x, 10}e dx + min{x, 10}e dx
0 10
Z 10 Z ∞ 
−5x −5x
=5 xe dx + 10e dx
0 10
" 10 #
−5x
e  −5x ∞
 1 −50
 −50 1 − e−50
=5 (−5x − 1) −2 e 10
= −51e + 1 + 10e =
25 0 5 5

There is no need to check the absolute convergence of the improper integral because g(x) is
non-negative on [0, ∞).

Exercise 23.9 Let Z ∼ N (0, 1) be a random variable. Then find the E [max{Z, 0}] (if it
exists).

Solution: Random variable Z has pdf

1 x2
f (x) = √ e− 2 , for x ∈ R

23-4 Lecture 23: Expectation

Hence
Z ∞
E [max{Z, 0}] = max{x, 0}f (x)dx
−∞
Z 0 Z ∞
= max{x, 0}f (x)dx + max{x, 0}f (x)dx
−∞ 0
Z 0 Z ∞
= 0 · f (x)dx + xf (x)dx
−∞ 0
Z ∞
1 x2
=√ xe− 2 dx
2π Z0

1 2
= √ e−t dt (substituing x2 = t)
2π 0
1
=√

23.3 Higher Order Moments


Definition 23.10 Let X be a random variable. Then E[X n ] is called the nth moment of X
and E[(X − E(X))n ] is called the nth central moment of X. The second central moment is
called the variance.

The most important quantity associated with a random variable X, other than the mean,
is its variance, it is denoted by var(X). The variance provides the measure of dispersion
around its mean. Another measure of dispersion is the standard p
deviation, which is defined
as positive square root variance and is denoted by σX . So σX = var(X).
If X is a discrete random variable then by Theorem 23.6,
X
V ar(X) = E[(X − µ)2 ] = (x − µ)2 fX (x),
x∈R(X)

where µ = EX.

Example 23.11 Let X be a Bernoulli(p) random variable where 0 < p < 1. Then find it’s
variance.

Solution: Recall E(X) = p.


V ar(X) = E[(X − p)2 ] = (0 − p)2 P (X = 0) + (1 − p)2 P (X = 1) = p2 (1 − p) + (1 − p)2 p
= p(1 − p)[p + (1 − p)] = p(1 − p)
Lecture 23: Expectation 23-5

If X is a random variable with density fX then by Theorem 23.6,


Z ∞
2
V ar(X) = E[(X − µ) ] = (x − µ)2 fX (x)dx,
−∞

where µ = EX.

Example 23.12 Let X be a Uniform[a, b] random variable. Then find it’s variance.

Solution: Recall E(X) = a+b


2
.
" 2 # Z b  2
a+b a+b 1
V ar(X) = E X− = x− dx
2 a 2 b−a
" 3 #x=b " 3  3 #
1 a+b 1 b−a a−b (b − a)3
= x− = − =
3(b − a) 2 3(b − a) 2 2 12(b − a)
x=a
2
(b − a)
=
12
Standard Normal Cumulative Probability Table

Cumulative probabilities for POSITIVE z-values are shown in the following table:

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879

0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389

1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319

1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767

2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936

2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986

3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
Lecture 24: Normal Random Variable
22 February, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

A normal random variable has several special properties. The following one is particularly
important. That is Normality is Preserved under Linear Transformations.

Exercise 24.1 If X is a normal random variable with mean µ and variance σ 2 , and if
a 6= 0, b are real numbers, then the random variable

Y = aX + b

is also normal, with mean and variance

E[Y ] = aµ + b, var(Y ) = a2 σ 2

Solution: By Example 21.5, the pdf of Y is given by


 
1 x−b
fY (x) = fX .
|a| a

where
1 (x−µ)2
fX (x) = √ e− 2σ2 , x∈R
σ 2π
Therefore
1 1 (x−b−aµ)2
fY (x) = √ e− 2a2 σ2 , x∈R
|a| σ 2π

So Y is a normal random variable with mean b + aµ and variance a2 σ 2 .


A normal random variable X with zero mean and unit variance is said to be a standard
normal. It’s distribution function is denoted by N (·):
Z x
1 t2
N (x) = P (X ≤ x) = P (X < x) = √ e− 2 dt
2π −∞

We can not get closed form for N (x). One use numerical technique to find the approximate
values of N (x). It is recorded in a table and is a very useful tool for calculating various
probabilities involving normal random variables. Note that the table only provides the

24-1
24-2 Lecture 24: Normal Random Variable

values of N (x) for x > 0, because the omitted values can be found using the symmetry of
the pdf. For example, if X is a standard normal random variable, for x > 0 we have

N (−x) = P (X ≤ −x) = P (X ≥ x) = 1 − P (X < x) = 1 − N (x)

Let X be a normal random variable with mean µ and variance σ 2 . We define a new random
X −µ
variable Z := . Then Z is a linear function of X hence normal. Also Z mean zero
σ
(∵ σµ − σµ = 0 ) and variance is one(∵ σ12 σ 2 = 1). Thus Z is a standard normal random
variable. This fact allows us to calculate the probability of any event defined in term of X:
we redefine the event in terms of Z, then use the standard normal normal table. This is how
it is done:
     
X −µ x−µ x−µ x−µ
P (X ≤ x) = P ≤ =P Z≤ =N
σ σ σ σ

Example 24.2 Let X be a normal random variable with parameters µ and σ 2 . Find (a)
P (µ − σ ≤ X ≤ µ + σ), (b) P (µ − 2σ ≤ X ≤ µ + 2σ), (c) P (µ − 3σ ≤ X ≤ µ + 3σ).

Solution:
   
µ+σ−µ µ−σ−µ
P (µ − σ ≤ X ≤ µ + σ) = P (X ≤ µ + σ) − P (X < µ − σ) = N −N
σ σ
= N (1) − N (−1) = N (1) − [1 − N (1)] = 2N (1) − 1 = 2 × 0.8413 − 1 = 1.6826 − 1
= 0.6826

   
µ + 2σ − µ µ − 2σ − µ
P (µ − 2σ ≤ X ≤ µ + 2σ) = P (X ≤ µ + 2σ) − P (X < µ − 2σ) = N −N
σ σ
= N (2) − N (−2) = N (2) − [1 − N (2)] = 2N (2) − 1 = 2 × 0.9772 − 1 = 1.9544 − 1
= 0.9544

P (µ − 3σ ≤ X ≤ µ + 3σ) = 2N (3) − 1 = 2 × 0.9772 − 1 = 1.9974 − 1


= 0.9977

Example 24.3 Let X have a normal distribution with parameters µ and σ 2 = 0.25. Find a
constant c such that P (|X − µ| ≤ c) = 0.9.
Lecture 24: Normal Random Variable 24-3

Solution: Let Y = X−µσ


. Then Y is a standard normal variable. Then 0.9 = P (|X − µ| ≤
|X−µ| c
c) = P ( σ ≤ σ ), since σ = 0.5 > 0.
= P (|Y | ≤ σc ) = P (− σc ≤ Y ≤ σc )
= N ( σc ) − N (− σc ) = N ( σc ) − (1 − N ( σc ))
= 2N ( σc ) − 1. Or, N ( σc ) = 0.95.
c
From standard normal table, we get σ
= 1.65. Or, c = 0.5 × 1.65 = 0.825.

Example 24.4 Let X have a normal distribution with parameters µ = 1 and σ 2 = 4.


Calculate P (0 ≤ X ≤ 3).

Solution: If µ = 1, σ 2 = 4 then P (0 ≤ X ≤ 3) = P (X ≤ 3) − P (X < 0) = N ((3 − 1)/2) −


N ((0 − 1)/2) = N (1) − N (−0.5) = N (1) − [1 − N (0.5)] = N (1) + N (0.5) − 1
= 0.8413 + 0.6915 − 1 = 0.5328. (using standard normal table)

Example 24.5 (Memoryless property) Let X ∼ exp(λ). Show that


P (X > x + y|X > x) = P (X > y), ∀ x, y ∈ [0, ∞).

Solution: Note that for x, y ≥ 0, {X > x + y} ⊆ {X > x} and {X > x + y} ⊆ {X > y}.
Hence
P ({X > x + y} ∩ {X > x})
P (X > x + y|X > x) =
P (X > x)
P (X > x + y)
=
P (X > x)
1 − P (X ≤ x + y)
=
1 − P (X ≤ x)
−λ(x+y)
e
=
e−λx
= e−λy
= P (X > y)

Remark 24.6 The above property is known as memoryless property of exponential distribu-
tion. To interpret it, let’s think of X as a model for waiting time. Then
P (X > x + y|X > x) = P (X > y), ∀ x, y ≥ 0
says that if we have spent x time in waiting, the distribution of further waiting time is the
same as that of the initial waiting time as if we waited in vain.
MATH-221: Probability and Statistics
Tutorial # 3 (Random Variables, PMF, PDF, CDF, Functions of Random variable)

1. Define F : R → [0, 1] by

 0 if x < 0
x2
F (x) = 4
if 0 ≤ x ≤ 2 .
1 if x > 2

Show that F is a distribution function. Find pdf or pmf (if exists). Also compute
P (1 ≤ X < 3), where X has distribution function F .

2. Let X be a random variable with distribution function F . Find the distribution


function of the following random variables in terms of F . (i) max{X, a}, where
1
a ∈ R (ii) |X| 3 (iii) |X| (iv) eX (v) − ln |X|.

3. (Memoryless property) Let X ∼ exp(λ). Show that

P (X > x + y|X > x) = P (X > y), ∀ x, y ∈ [0, ∞).

4. (Memoryless property) Let X be a geometric(p) random variable. Show that for


any nonnegative integers m and n,

P (X > n + m|X > m) = P (X > n). (1)

Note that memoryless property for geometric distribution holds only for nonnegative
integers. Verify that if m = n = 12 then (1) does not hold.

5. Let X be the uniform random variable on [0, 1]. Then Determine pdf of (i) X (ii)
1
aX + b, where a, b ∈ R (iii) X 4 .

6. Let X be a random variable with PMF



 x2
fX (x) = if x = −3, −2, −1, 0, 1, 2, 3
 0a otherwise

Find a. What is the PMF of the random variable Z = (X − a)2 .

7. Let X be a binomial random variable with parameters (n, p). What value of p
maximizes P (X = k), k = 0, 1, . . . , n?

8. Let X be a Poisson random variable with parameter λ. If P (X = 1|X ≤ 1) = 0.8,


what is the value of λ?

9. Let X be a normal random variable with parameters µ and σ 2 . Find (a) P (µ − σ ≤


X ≤ µ + σ), (b) P (µ − 2σ ≤ X ≤ µ + 2σ), (c) P (µ − 3σ ≤ X ≤ µ + 3σ).

10. Consider X ∼ U [−1, 1]. Another random variable Y is formed by using the trans-
formation Y = X 2 + X. Find the distribution function FY (y) and density function
fY (y) of the new transformed random variable Y .
11. Let X have a geometric distribution with p = 0.8. Compute (a) P (X > 3) (b)
P (4 ≤ X ≤ 7 or X > 9) (c) P (3 ≤ X ≤ 5 or 7 ≤ X ≤ 10).

12. Let the random variable X denote the decay time of some radioactive particle and
follows the exponential distribution function. Suppose λ is such that P (X ≥ 0.01) =
1
2
. Find a number t such that P (X ≥ t) = 0.9.

13. Let X have a normal distribution with parameters µ and σ 2 = 0.25. Find a constant
c such that P (|X − µ| ≤ c) = 0.9. (Hint: Use Table for standard normal distribution
function).

14. Define,  
n k
Bk (n; p) := p (1 − p)n−k , k = 0, 1, 2 · · · , n.
k
Then for λ > 0 show that
e−λ λk
 
λ
lim Bk n; = , k = 0, 1, 2 · · · (2)
n→∞ n k!

The limit in equation (2) is binomial approximation to the Poisson in the sense,
we let n grows to infinity simultaneously decreasing p, in a manner that keeps the
product np at a constant value say λ, then, it turns out that binomial pmf converges
to Poisson pmf.

15. Let X have a normal distribution with parameters µ = 1 and σ 2 = 4. Calculate


P (0 ≤ X ≤ 3) (Hint: Use X = σY + µ and use Table for Y , a standard normal
variable).

16. (a) Let a random variable X has a continuous distribution function F which is
strictly increasing on R. Then show that Y ∼ U (0, 1), where Y = F (X).
(Actually the result is true even for continuous distribution function which are
constant over some intervals but proof of this is more involved).
(b) Use the general version of result in part (a), to show that if X is exponen-
1
tially distributed with parameter λ, then − ln(1 − Y ) is also exponentially
λ
distributed, where Y = FX (X).
Lecture 25: Properties of Expectation
8 March, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Theorem 25.1 (Properties of Expectation) Let X be a random variable and let a, b


and c be real numbers. Suppose g1 and g2 are real-valued function of one real variable such
that E [g1 (X)] < ∞ and E [g2 (X)] < ∞. Then

(a) E [ag1 (X) + bg2 (X) + c] = aE [g1 (X)] + bE [g2 (X)] + c.

(b) If g1 (x) ≥ g2 (x) for all x, then E [g1 (X)] ≥ E [g2 (X)].

Proof: We will prove only for X which are absolutely continuous, the discrete case being
similar.

(a) By (23.2)
Z ∞
E [ag1 (X) + bg2 (X) + c] = [ag1 (x) + bg2 (x) + c] fX (x)dx
Z−∞
∞ Z ∞ Z ∞
= ag1 (x)fX (x)dx + bg2 (x)fX (x)dx + cfX (x)dx
−∞ −∞ −∞
Z ∞ Z ∞ Z ∞
=a g1 (x)fX (x)dx + b g2 (x)fX (x)dx + c fX (x)dx
−∞ −∞ −∞
= aE [g1 (X)] + bE [g2 (X)] + c

(b)

∵ g1 (x) ≥ g2 (x) =⇒ g1 (x)fX (x) ≥ g2 (x)fX (x) for all x (∵ fX (x) ≥ 0)


Z ∞ Z ∞
g1 (x)fX (x)dx ≥ g2 (x)fX (x)dx
−∞ −∞
Hence E [g1 (X)] ≥ E [g2 (X)] (By 23.2)

25-1
25-2 Lecture 25: Properties of Expectation

Alternate Expression for Variance

Var(X) = E[(X − µ)2 ] = E[X 2 + µ2 − 2µX]


= E[X 2 ] + µ2 − 2µE[X]
(By Theorem 25.1 (a) with g1 (x) = x2 , g2 (x) = x, a = 1, b = −2µ, c = µ2 )
= E[X 2 ] + µ2 − 2µ2 = E[X 2 ] − µ2 ,

where µ = EX.
Hence
Var(X) = E[X 2 ] − (E[X])2 .

Second Moment implies first moment Note that function g(x) = (x − µ)2 ≥ 0 for any
real number µ. Let µ = EX, If a random variable X has finite variance then it follows from
the Theorem 25.1 (b), that Var(X) ≥ 0. Hence by alternate form we get

∞ > E[X 2 ] − (EX)2 ≥ 0 =⇒ E[X 2 ] ≥ (EX)2

Hence if E[X 2 ] < ∞ then E[X] < ∞. In other words, if a random variable has second
moment finite, then it’s mean is finite. Also, from alternate expression it follows that if a
random variable has finite variance then it’s first and second moment must be finite.

Proposition 25.2 Let X be a random variable with finite variance and a, b ∈ R. Then
Var(aX + b) = a2 var(X).

Proof: By alternate definition of variance

Var(aX + b) = E[aX + b]2 − [E(aX + b)]2


= E[a2 X 2 + b2 + 2abX] − [aEX + b]2
= a2 E[X 2 ] + b2 + 2abE[X] − a2 [EX]2 − b2 − 2abEX
= a2 var(X)

Example 25.3 If X is a random variable taking nonnegative integer values and has finite
mean then show that
X∞
EX = P {X ≥ n}.
n=1
Lecture 25: Properties of Expectation 25-3

Solution: Note that for any given n = 0, 1, 2, · · ·

{X ≥ n} = {X = n} ∪ {X = n + 1} ∪ {X = n + 3} ∪ · · ·
X∞ ∞
X X∞
Hence EX = nP (X = n) = nP (X = n) = n[P (X ≥ n) − P (X ≥ n + 1)]
n=0 n=1 n=1

X ∞
X
= nP (X ≥ n) − nP (X ≥ n + 1)
n=1 n=1
X∞ X∞
= nP (X ≥ n) − (n − 1)P (X ≥ n)
n=1 n=2

X ∞
X ∞
X
= nP (X ≥ n) − nP (X ≥ n) + P (X ≥ n)
n=1 n=2 n=2
X∞
= P {X ≥ n}.
n=1

Example 25.4 Let X be a non-negative random variable with density with finite mean.
Z ∞
Prove that EX = P (X ≥ x)dx.
0

Solution: X ≥ 0 means fX (t) = 0 for all t < 0. We have

Z ∞
P (X ≥ x) = fX (t)dt.
x

Thus, we need to show that


Z ∞ Z ∞
fX (t)dtdx = EX.
0 x

The left hand side is a double integral. In particular, it is the integral of fX (t) over the
shaded region in following figure.
25-4 Lecture 25: Properties of Expectation

We can take the integral with respect to x or t. Thus, we can write


Z ∞Z ∞ Z ∞Z t
fX (t)dtdx = fX (t)dxdt
0 x 0 0
Z ∞ Z t 
= fX (t) 1dx dt
0 0
Z ∞
= tfX (t)dt = EX (∵ X ≥ 0).
0

Example 25.5 Suppose that Y = −2X + 3. If we know EY = 1 and EY 2 = 9, find EX


and Var(X).

Solution:

EY = −2EX + 3 =⇒ EX = 1
V ar(Y ) = E[Y 2 ] − [E(Y )]2 = 9 − 1 = 8.
V ar(Y ) = (−2)2 var(X) =⇒ V ar(X) = 2.
MATH-221: Probability and Statistics
Tutorial # 4 (Expectation, Variance and Expectation of function of Random
Variable)

1. Find the mean and variance of (i) Bernoulli(p) (ii)binomial(n, p) (iii) geometric(p)
(iv) Poisson(λ) (v) continuous uniform on [a, b] (vi)normal(µ, σ 2 ) (vii) exponential(λ)
distributions.
2. Let X be exponentially distributed with parameter λ. Find the 4th moment.
3. Let X be standard normal random variable. Find E|X|.
1 h  πx i
4. Let X be a Binomial random variable with parameters n = 4, p = . Find E sin .
2 2
5. Let X be a geometrically distributed random variable with parameter p and M be
a positive integer. Find E[min{X, M }].
6. Assume that every time you attend your lecture there is a probability of 0.1 that your
Professor will not show up. Assume his arrival to any given lecture is independent
of his arrival (or non-arrival) to any other lecture. What is the expected number of
classes you must attend until you arrive to find your Professor absent?
7. If X is the number of points rolled with a fair die, find the expected value of g(X) =
2X 2 + 1.
8. Let X be a random variable with E(X) = 6 and E(X 2 ) = 45, and let Y = 20 − 2X.
Find E(Y ) and V ar (Y ).
9. Let 
1, x = 1, 2, 3, · · · ,
PX (x) = 2x
0, otherwise.
be the pmf of the random variable X, then find all the possible values of t such that
E[etX ] exist.
10. Let a random variable X of the continuous type have a pdf f (x) whose graph is
symmetric with respect to x = c. If the mean value of X exists, then prove that
E[X] = c.
11. Let X be a random variable with CDF

1
 x > 1,
x2
FX (x) = x2 + 2
, 0≤x≤1

0, x < 0.

then find E(eX ).


12. The number of messages sent per hour over a computer network has the following
distribution:
x = number of messages 10 11 12 13 14 15
PX (x) 0.08 0.15 0.30 0.20 0.20 0.07
Determine the mean and standard deviation of the number of messages sent per
hour.
Lecture 26: Multiple Random Variables
11 March, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Probabilistic models often involve several random variables. For example, in a medical
diagnosis context, the results of several tests may be significant (These different tests would
be observations on different random variables. ), or in a networking context, the workloads
of several routers may be of interest (These different workloads would be observations on
different random variables, one for each router measured). Thus, we need to know how to
describe and use probability models that deals with more than one random variable at a
time.
All of these random variables are associated with the same experiment, sample space, and
probability law, and their values may relate in interesting ways. This motivates us to consider
simultaneously several random variables, i.e., the notion of random vector.

Definition 26.1 Let (Ω, F, P ) be a probability space. A map X : Ω → Rn , X(ω) =


(X1 (ω), X2 (ω), · · · , Xn (ω)) is called a n-dimensional random vector on (Ω, F, P ) if each
Xi is a random variable on (Ω, F, P ).

We will discuss mainly bivariate random vectors, i.e., two random variables. The extension
to higher dimension is analogue.

Example 26.2 Consider the experiment of tossing two fair dice. Then we know sample
space would be Ω = {(i, j) : i, j = 1, · · · , 6} with all outcomes be equally likely and σ-algebra
is power set. Let us consider the following random variable
X((i, j)) = i + j, Y ((i, j)) = |i − j|
Then (X, Y ) is a random vector.

Having define a random vector (X, Y ), we can now discuss probabilities of events that
are defined in terms of (X, Y ). For example we can ask what is P (X = 5 and Y = 3)?,
Henceforth, we will write P (X = 5, Y = 3) for P (X = 5 and Y = 3). Read the comma as
“and”. It is easy to see that there are only two sample points (1, 4) and (4, 1) that yields
X = 5 and Y = 3. Please note that we are interested in those sample points (i, j) such that
X((i, j)) = 5, Y ((i, j)) = 3 simultaneously. Hence
2 1
P (X = 5, Y = 3) = =
36 18

26-1
26-2 Lecture 26: Multiple Random Variables

Definition 26.3 (Discrete random vector) We say that a random vector X = (X1 , X2 )
is a discrete random vector, if X1 and X2 both are discrete random variables.

It follows from the definition that range of a discrete random vector is either finite or count-
able infinite.

1. If range of X1 and X2 both are finite, then X has finite range.

2. If range of X1 is finite and X2 is countable infinite, then X has countable infinite range.
Similarly, If range of X2 is finite and X1 is countable infinite, then X has countable
infinite range.

3. If range of X1 and X2 both are countable infinite, then X has countable infinite range.

Definition 26.4 (joint pmf ) Let X = (X1 , X2 ) be a discrete random vector. Define f :
R2 → R by
f (x1 , x2 ) = P {X1 = x1 , X2 = x2 }, ∀ x1 , x2 ∈ R
Then f is called joint pmf of X.

If it is necessary to stress the fact that f is the joint pmf of (X1 , X2 ) rather than some other
random vector, the notation fX1 ,X2 (x1 , x2 ) will be used.
The joint pmf of (X, Y ) completely determines the probability distribution of the discrete
random vector (X, Y ), just as the pmf of a discrete random variable completely defines its
distribution.
For the discrete random vector defined in Example 26.2, there are 21 possible values of
(X, Y ).
HH X 2
HH
3 4 5 6 7 8 9 10 11 12
Y HH
1 1 1 1 1 1
0 36
0 36
0 36
0 36
0 36
0 36
1 1 1 1 1
1 0 18
0 18
0 18
0 18
0 18
0
1 1 1 1
2 0 0 18
0 18
0 18
0 18
0 0
1 1 1
3 0 0 0 18
0 18
0 18
0 0 0
1 1
4 0 0 0 0 18
0 18
0 0 0 0
1
5 0 0 0 0 0 18
0 0 0 0 0

As in the one-dimensional case, this function f has the following three properties:

1. f (x, y) ≥ 0, for all (x, y) ∈ R2 .

2. The set {(x, y) : f (x, y) 6= 0} is at most countably infinite subset of R2 .


Lecture 26: Multiple Random Variables 26-3

X
3. f (x, y) = 1.
(x,y)∈R(X,Y )

Any real-valued function f defined on R2 having these three properties will be called a two
dimensional joint pmf.
One can easily verify all three properties for the joint pmf of Example 26.2
The joint PMF determines the probability of any event that can be specified in terms of the
discrete random variables X1 and X2 .

Theorem 26.5 Let X = (X1 , X2 ) be a discrete random vector, with the joint pmf f . Then
for any A ⊆ R2 , X
P {(X1 , X2 ) ∈ A} = f (x, y)
(x,y)∈A

Exercise 26.6 Suppose X be a random variable taking three values −2, 1 and 3, and let Y
be a random variable that assume four values −1, 0, 4, 6. Their joint probabilities are given
by the following table.

HH
Y
HH -1 0 4 6
X HH
1 1 1 1
-2 9 27 27 9
2 1 1
1 9
0 9 9
1 4
3 0 0 9 27

Compute the probability of the event that XY is odd.

S
Solution: {XY is odd} = {X = 1, Y = −1} {X = 3, Y = −1}. Therefore

2
P (XY is odd) = f (1, −1) + f (3, −1) =
9

26.1 Marginal PMF


Even if we are considering a random vector (X, Y ) there may be probabilities of interest
that involve only one of the random variables in the random vector. We may wish to
know P (X = 2), for instance in Example 26.2. The random variable X is discrete and its
probability distribution if described by it’s pmf fX (we now use the subscript to distinguish
26-4 Lecture 26: Multiple Random Variables

fX (x) from fX,Y (x, y)). We now call fX (x) the marginal pmf of X to emphasize the fact
that it is the pmf of X but in the context of the joint pmf of the random vector (X, Y ).
The marginal pmf of X or Y is easily calculated from the joint pmf of (X, Y ).

Proposition 26.7 If f is the joint pmf of X and Y , then


X X
fX (x) = f (x, y), fY (y) = f (x, y),
y∈R(Y ) x∈R(X)

[
Proof: Note that Ω = {Y = y}. For each x ∈ R, the events {X = x, Y = y}, y ∈ R(Y )
y∈R(Y )
are disjoint and their union is the event {X = x}. Thus

fX (x) = P (X = x)
 
[
=P {X = x, Y = y}
y∈R(Y )
X
= P (X = x, Y = y)
y∈R(Y )
X
= f (x, y)
y∈R(Y )
Lecture 27: Marginal PMFs and Random Vector with density
12 March, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Example 27.1 Suppose X be a random variable taking two values 1 and 2, and let Y be a
random variable that assume four values 1, 2, 3, 4. Their joint probabilities are given by the
following table.

H
HH Y
1 2 3 4
X H
HH
1 1 1 1
1 4 8 16 16
1 1 1 1
2 16 16 4 8
Determine the marginal pmf of random variables X and Y .

Solution:
4
X 1 1 1 1 1
fX (1) = f (1, y) = + + + =
y=1
4 8 16 16 2
4
X 1 1 1 1 1
fX (2) = f (2, y) = + + + =
y=1
16 16 4 8 2
2
X 1 1 5
fY (1) = f (x, 1) = + = = fY (3)
x=1
4 16 16
2
X 1 1 3
fY (2) = f (x, 2) = + = = fY (4)
x=1
8 16 16

Clearly X is uniformly distributed but Y is not.


The marginal pmfs fX and fY do not completely determines the joint pmf of X and Y .
Indeed there are many different joint pmfs that have same marginal pmfs.

Example 27.2 Define a joint pmf by


1 5 3
f (0, 0) = , f (1, 0) = , f (0, 1) = f (1, 1) =
12 12 12

27-1
27-2 Lecture 27: Marginal PMFs and Random Vector with density

The marginal pmf of Y is fY (0) = fY (1) = 21 . The marginal pmf of X is fX (0) = 13 , fX (1) =
2
3
.
Define a joint pmf by
1 1
f (0, 0) = f (0, 1) = , f (1, 0) = f (1, 1) =
6 3
The marginal pmf of Y is fY (0) = fY (1) = 21 . The marginal pmf of X is fX (0) = 13 , fX (1) =
2
3
.

Thus, it is hopeless to try to determine the joint pmf from the knowledge of only the marginal
pmfs. The marginals does not capture the information how X and Y are interrelated.

27.1 Random Vectors with density


Definition 27.3 A random vector (X, Y ) defined on a probability space (Ω, F, P ) is called
absolutely continuous if there is a nonnegative function f (x, y) defined on R2 , called the joint
pdf of (X, Y ) (sometimes just joint density of (X, Y )), such that
ZZ
P ((X, Y ) ∈ S) = f (x, y)dxdy,
S

for every Borel subset S of R2 .

Example of Borel subsets of R2 : polygons, disks, ellipses, and finite or countably unions
of such shapes. Open set, closed set, their (finite or counatble) union or intersections etc.
In particular, the probability that the value of (X, Y ) falls within an rectangle [a, b] × [c, d]
is Z bZ d
P (a ≤ X ≤ b, c ≤ Y ≤ d) = f (x, y)dxdy.
a c

and can be interpreted as the volume of region lying below the surface z = f (x, y) and above
the rectangle [a, b] × [c, d].

Example 27.4 The joint probability density function of X and Y is



2 if x > 0, y > 0, 0 < x + y < 1
f (x, y) =
0 elsewhere
 
1
Then find P X + Y < .
2
Lecture 27: Marginal PMFs and Random Vector with density 27-3

Solution:
 
2 1
Define A := (x, y) ∈ R : x + y < . Then
2

1
x+y = 2
1
y= 2
−x

y=0

  ZZ
1
P X +Y < = f (x, y)dxdy
2 A
Z 1 Z 1 −x ! 1
2 2
Z
2
1 −x
2
= 2dy dx = [2y] dx
0 0 0 0
1
 12
Z
2  1 2
= (1 − 2x)dx = x − x =
0 0 4

27.2 Properties of Joint Density


Let f be the joint pdf of random variable X and Y . Then

P (−∞ < X < ∞, −∞ < Y < ∞) = P (Ω ∩ Ω) = 1. (27.1)

But by definition we have


Z ∞ Z ∞
P (−∞ < X < ∞, −∞ < Y < ∞) = f (x, y)dxdy (27.2)
−∞ −∞

Theorem 27.5 (Characterization of joint pdf ) Let f : R2 → R be a function satisfying


27-4 Lecture 27: Marginal PMFs and Random Vector with density

1. f (x, y) ≥ 0 for all (x, y) ∈ R2 .


Z ∞Z ∞
2. −∞ < Y < ∞) = f (x, y)dxdy = 1.
−∞ −∞

Then there exists a probability space (Ω, F, P ) and a random vector (X, Y ) defined on it such
that f is the joint pdf of (X, Y ).

x2 −xy+4y 2
Example 27.6 Let f (x, y) = ce− 2 , x, y ∈ R. Find the value of c such that f is a
joint pdf.

Solution:
Z ∞ Z ∞
If f is a joint pdf then f (t, s)dtds = 1.
−∞ −∞
Z ∞ Z ∞ Z ∞ Z ∞
x2 −xy+4y 2
f (x, y)dxdy = c e− 2 dxdy
−∞ −∞ −∞ −∞
Z ∞ Z ∞ y y2 y2
x2 −2·x· 2 + 4 − 4 +4y 2

=c e 2 dxdy
Z−∞

−∞
Z ∞
(x− y2 )2

15y 2
=c e− 8 e− 2 dx dy
Z−∞

−∞
Z ∞ 
15y 2 2
− u2
=c e− 8 e du dy (put x = u + y2 )
−∞ −∞
√ Z ∞ − 15y2
= c 2π e 8 dy
−∞
√ Z ∞ − u2 2du
= c 2π e 2 √ (put y = √2u )
15
−∞ 15
√ Z ∞ − u2 2du
= c 2π e 2 √ (put y = √2u )
15
−∞ 15
2
= c2π √
15

15
Hence c = .

Lecture 28: Marginal PDFs and Joint CDF
13 March, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Proposition 28.1 If f is the joint pdf of X1 and X2 , then


Z ∞ Z ∞
fX1 (x1 ) = f (x1 , x2 )dx2 , fX2 (x2 ) = f (x1 , x2 )dx1 .
−∞ −∞

Proof:

P {X1 ≤ x1 } = P {X1 ≤ x1 , X2 < ∞}


Z x1 Z ∞
= f (x, x2 )dx2 dx
−∞ −∞
Z x1 Z ∞
= g(x)dx, where g(x) = f (x, x2 )dx2
−∞ −∞
Z ∞
Hence fX1 (x1 ) = g(x1 ) = f (x1 , x2 )dx2 .
−∞

Example 28.2 The joint probability density function of (X, Y ) is given as



6(1 − x), 0 < y < x, 0 < x < 1
f (x, y) =
0, otherwise

Determine the marginal density of random variables X and Y .

Solution:

Density of Y : We have
Z ∞
fY (y) = f (x, y)dx, ∀y ∈ R
−∞

Since f (x, y) is zero for y ≥ 1 and y ≤ 0, therefore fY (y) = 0 if y ≥ 1 or y ≤ 0.

28-1
28-2 Lecture 28: Marginal PDFs and Joint CDF

y=x

x=1

Now for y ∈ (0, 1) we have


1
x2 1
Z  
fY (y) = 6(1 − x)dx = 6 x − = 3(y − 1)2 .
y 2 y

Hence 
3(y − 1)2 0 < y < 1
fY (y) =
0, otherwise

Density of X: We have Z ∞
fX (x) = f (x, y)dy.
−∞

y=x

y=0

Again f (x, y) is zero for x ≥ 1 and x ≤ 0, therefore fX (x) = 0 if x ≥ 1 or x ≤ 0. For


x ∈ (0, 1) we have
Z x x
fX (x) = 6(1 − x)dy = 6 (y − xy) = 6(x − x2 )

0 0
Lecture 28: Marginal PDFs and Joint CDF 28-3

Therefore
6(x − x2 ), 0 < x < 1

fX (x) =
0, otherwise

28.1 Joint Distribution Function


Definition 28.3 (joint distribution function) Let X = (X1 , X2 ) be a random vector on
(Ω, F, P ). Then the function F : R2 → R given by

F (x1 , x2 ) = P {X1 ≤ x1 , X2 ≤ x2 }

is called the joint distribution function of (X1 , X2 ).

Example 28.4 Suppose the joint pmf of X and Y is given as

1 1
f (0, 0) = f (0, 1) = , f (1, 0) = f (1, 1) =
6 3
Then determine the joint CDF of X and Y .

Solution: We have
X
F (x, y) = P (X = i, Y = j)
(i,j):i≤x,j≤y

Hence 
 0, x < 0 or y < 0
1


, 0 ≤ x < 1, 0 ≤ y < 1





 6
 2

F (x, y) = , 0 ≤ x < 1, y ≥ 1

 6

 1

 , x ≥ 1, 0 ≤ y < 1
2




x ≥ 1, y ≥ 1

1,

Look at the following figures to get an idea how to compute joint CDF F (x, y). In all the
figure below we denote the infinite rectangle (−∞, x] × (−∞, y] by the R, which is the green
lined region in the figures.
28-4 Lecture 28: Marginal PDFs and Joint CDF

(0, 1) (1, 1)

(x, y)

(0, 0) (1, 0)

Fig. 1. Region R : if x < 0

(0, 1) (1, 1)

(0, 0) (1, 0)
(x, y)

Fig. 2. Region R : if y < 0

(0, 1) (1, 1)

(x, y)

(0, 0) (1, 0)

Fig. 3. R : if 0 ≤ x < 1, 0 ≤ y < 1


Lecture 28: Marginal PDFs and Joint CDF 28-5

(x, y)
(0, 1) (1, 1)

(0, 0) (1, 0)

Fig. 4. R : if 0 ≤ x < 1, y ≥ 1

(0, 1) (1, 1)

(x, y)

(0, 0) (1, 0)

Fig. 5. R : if x ≥ 1, 0 ≤ y < 1

(x, y)
(0, 1) (1, 1)

(0, 0) (1, 0)

Fig. 6. R : if x ≥ 1, y ≥ 1
28-6 Lecture 28: Marginal PDFs and Joint CDF
Lecture 29: Properties of Joint CDF
16 March, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Theorem 29.1 (Properties of Joint CDF) Let F be the joint distribution function of a
random vector X = (X1 , X2 ). Then F satisfies the following.

1. (a) F (x1 , x2 ) → 0 as xi → −∞ for at least one i where i ∈ {1, 2}. That is


i. lim F (x, y) = 0, ∀y ∈ R.
x→−∞
ii. lim F (x, y) = 0, ∀x ∈ R.
y→−∞

iii. lim F (x, y) = 0.


(x,y)→(−∞,−∞)

(b) lim F (x, y) = 1.


(x,y)→(∞,∞)

2. F is right continuous in each argument. That is

lim F (x + h, y) = lim+ F (x, y + h) = F (x, y), for all (x, y) ∈ R2 .


h→0+ h→0

3. F is nondecreasing in each argument. That is

F (x, y) ≤ F (x + h, y), ∀h > 0.


F (x, y) ≤ F (x, y + k), ∀k > 0.

4. For every (x1 , y1 ), (x2 , y2 ) with x1 < x2 and y1 < y2 the following inequality holds:

F (x2 , y2 ) − F (x2 , y1 ) + F (x1 , y1 ) − F (x1 , y2 ) ≥ 0. (29.1)

Proof: Proof of the inequality (29.1). By figure we have

P {x1 < X ≤ x2 , y1 < Y ≤ y2 } = P {X ≤ x2 , Y ≤ y2 } + P {X ≤ x1 , Y ≤ y1 }


− P {X ≤ x1 , Y ≤ y2 } − P {X ≤ x2, Y ≤ y1 }
= F (x2 , y2 ) + F (x1 , y1 ) − F (x1 , y2 ) − F (x2 , y1 ) ≥ 0

29-1
29-2 Lecture 29: Properties of Joint CDF

Example 29.2 Let recall the joint cdf obtained in Example 28.4.


 0, x < 0 or y < 0
1


, 0 ≤ x < 1, 0 ≤ y < 1





 6
 2

F (x, y) = , 0 ≤ x < 1, y ≥ 1

 6

 1
 ,

 x ≥ 1, 0 ≤ y < 1


 2
x ≥ 1, y ≥ 1

1,
Lecture 29: Properties of Joint CDF 29-3

1 F =1
F = 3

1 1
F = 6 F = 2

F =0

x
(0, 0)

F =0

Let us verify the properties 1-4 in Theorem 29.1.

1. (a) For any given y ∈ R, F (x, y) = 0 for all x < 0. Hence lim F (x, y) = 0, ∀y ∈ R.
x→−∞

(b) Similarly, For any given x ∈ R, F (x, y) = 0 for all y < 0. Hence lim F (x, y) =
y→−∞
0, ∀x ∈ R.
(c) Also since F (x, y) = 0 in third quadrant, hence lim F (x, y) = 0.
(x,y)→(−∞,−∞)

2. (a) (Right continuity w.r.t. x) If c < 0 then along line y = c, joint cdf F = 0
which is continuous for all x ∈ R. If 0 ≤ c < 1, then along line y = c,

 0, x < 0
 1 , 0 ≤ x < 1,



F (x, c) = 6

 1
 , x≥1


2

which is right continuous everywhere.


If c ≥ 1, then along line y = c,


 0, x < 0

 1
F (x, c) = , 0 ≤ x < 1,
 3
1, x ≥ 1


which is right continuous everywhere.


29-4 Lecture 29: Properties of Joint CDF

(b) (Right continuity w.r.t. y) If c < 0 then along line x = c, joint cdf F = 0
which is continuous for all y ∈ R. If 0 ≤ c < 1, then along line x = c,

 0, y < 0
 1 , 0 ≤ y < 1,



F (c, y) = 6

 1
 , y≥1


3

which is right continuous everywhere.


If c ≥ 1, then along line x = c,


 0, y < 0
 1

F (c, y) = , 0 ≤ y < 1,
 2
 1, y ≥ 1

which is right continuous everywhere.

3. Non-decreasing in each argument is also clear from the discussion about right continuity
in each coordinate.

4. This also holds true.

Theorem 29.3 (Characterization of Joint CDF) Any function F defined on R2 and


satisfying conditions 1-4 in Theorem 29.1 can be identified as a joint distribution function
of some 2-dimensional random vector.

Example 29.4 Let F be a function of two variables defined by



0, x < 0 or x + y < 1 or y < 0
F (x, y) =
1, otherwise

Determine whether F is a joint CDF?

Solution:
Lecture 29: Properties of Joint CDF 29-5

Let us verify properties 1-4 in Theorem 29.1.

1. (a) For any given y ∈ R, F (x, y) = 0 for all x < 0. Hence lim F (x, y) = 0, ∀y ∈ R.
x→−∞

(b) Similarly, For any given x ∈ R, F (x, y) = 0 for all y < 0. Hence lim F (x, y) =
y→−∞
0, ∀x ∈ R.
(c) Also since F (x, y) = 0 in third quadrant, hence lim F (x, y) = 0.
(x,y)→(−∞,−∞)

2. (a) (Right continuity w.r.t. x) If c < 0 then along line y = c, joint cdf F = 0
which is continuous for all x ∈ R. If 0 ≤ c < 1, then along line y = c,

0, x < 1 − c
F (x, c) =
1, x ≥ 1 − c

which is right continuous everywhere.


If c ≥ 1, then along line y = c,

0, x < 0
F (x, c) =
1, x ≥ 0

which is right continuous everywhere.


(b) (Right continuity w.r.t. y) If c < 0 then along line x = c, joint cdf F = 0
which is continuous for all y ∈ R. If 0 ≤ c < 1, then along line x = c,

0, y < 1 − c
F (c, y) =
1, y ≥ 1 − c
29-6 Lecture 29: Properties of Joint CDF

which is right continuous everywhere.


If c ≥ 1, then along line x = c,

0, y < 0
F (c, y) =
1, y ≥ 0

which is right continuous everywhere.

3. Non-decreasing in each argument is also clear from the discussion about right continuity
in each coordinate.

4. Take (x1 , y1 ) = ( 13 , 13 ), (x2 , y2 ) = (1, 1). Then

F (x2 , y2 ) − F (x2 , y1 ) + F (x1 , y1 ) − F (x1 , y2 ) = 1 − 1 + 0 − 1 = −1.

Hence given F is not a CDF.


Lecture 30: Marginal Distributions
18 March, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

30.1 Marginal Distribution Function:


Given a random vector X = (X1 , X2 , · · · , Xn ), the distribution function of X1 , denoted by
FX1 is called the marginal distribution of X1 . Similarly the marginal distribution function
FXi of Xi is defined. Given the joint distribution function F of X, one can recover the
corresponding marginal distributions as follows.

FX1 (x1 ) = P {X1 ≤ x1 } = P {X1 ≤ x1 , X2 ∈ R, · · · , Xn ∈ R} = lim F (x1 , x2 , · · · , xn )


xk →+∞∀ k≥2

Similarly,
FXi (xi ) = lim F (x1 , x2 , · · · , xn )
xk →+∞∀ k6=i

Given the marginal distribution functions of X1 and X2 , in general it is impossible to con-


struct the joint distribution function of (X1 , X2 ). Note that marginal distribution functions
doesn’t contain information about the dependence of X1 over X2 and vice versa.

Example 30.1 Suppose the joint pmf of X and Y is given as

1 1
f (0, 0) = f (0, 1) = , f (1, 0) = f (1, 1) =
6 3
Let recall the joint cdf obtained in Example 28.4.

 0, x < 0 or y < 0
1


, 0 ≤ x < 1, 0 ≤ y < 1





 6
 2

F (x, y) = , 0 ≤ x < 1, y ≥ 1

 6

 1

 , x ≥ 1, 0 ≤ y < 1
2




x ≥ 1, y ≥ 1

1,

30-1
30-2 Lecture 30: Marginal Distributions


 0, x < 0
 1

FX (x) = lim F (x, y) = , 0≤x<1
y→∞ 
 3
1, x ≥ 1

This corresponds to pmf P (X = 0) = 13 , P (X = 1) = 32 .


Similarly we have

 0, y < 0
 1

FY (y) = lim F (x, y) = , 0≤y<1
x→∞ 
 2
1, y ≥ 1

This corresponds to pmf P (Y = 0) = P (Y = 1) = 12 .

30.2 Joint CDF and Joint Density


The joint cdf is usually not very handy to for a discrete random vector. But for a random
vector with density we have the important relationship.
Z x1 Z x2
F (x1 , x2 ) = f (x, y)dxdy, ∀ x1 , x2 ∈ R (30.1)
−∞ −∞

1. As in the one-dimensional case, joint density f (x, y) is not uniquely defined by (30.1).
We can change f at a finite number of points or even over a finite number of smooth
curves in the plane without affecting integrals of f over sets in the plane.
2. Given joint CDF F (x, y), we can determines the joint PDF f (x, y) through the follow-
ing formula
∂ 2F
f (x, y) = (30.2)
∂x∂y
for every (x, y) at which the joint PDF f is continuous. This relationship is useful
when in situations where an expression for F (x, y) can be found. The mixed partial
derivative can be computed to find joint pdf.

Example 30.2 Suppose a joint cdf is given as




 6xy + y 3 − 3y 2 − 3x2 y ; 0 < y < x, 0 < x < 1
 3x2 − 2x3 ; 0 < x < 1, y ≥ x


F (x, y) = 3y + y 3 − 3y 2 ; 0 < y < 1, x ≥ 1
1 ; y ≥ 1, x ≥ 1




0, otherwise

Lecture 30: Marginal Distributions 30-3

Find the the joint density (if it exists).

Solution: Instead of checking that the F (x, y) is continuous everywhere on the plane R2
then computing mixed partials to obtain density, it is usually simpler to first compute the
mixed partials first and show that the function f obtained from (30.2) satisfies both the
conditions of Theorem 27.5.
We may avoid the boundary points of various regions and compute the mixed partials on in
the interior points.



 6y − 6xy ; 0 < y < x, 0 < x < 1
6x − 6x2 ; 0 < y > x, 0 < x < 1


∂F 
(x, y) = 0 ; 0 < y < 1, x > 1
∂x
0 ; y < 1, x > 1




0, otherwise

Further 

 6 − 6x ; 0 < y < x, 0 < x < 1
0 ; 0 < y > x, 0 < x < 1

∂ 2F


(x, y) = 0 ; 0 < y < 1, x > 1
∂x∂y
 0 ; y < 1, x > 1



0, otherwise

Therefore, we obtains the following candidate for the joint pdf.



6(1 − x), 0 < y < x, 0 < x < 1
f (x, y) =
0, otherwise.

Clearly f is non-negative and


Z ∞ Z ∞ Z 1 Z 1  Z 1
f (x, y)dxdy = 6(1 − x)dx dy = 3(y − 1)2 dy = 1.
−∞ −∞ 0 y 0

Hence f is the desired joint pdf.

Example 30.3 Let (X, Y ) be an random vector with joint PDF given by
 −(x+y)
e , 0 < x < ∞, 0 < y < ∞
f (x, y) =
0, otherwise.

Determine the joint CDF.

Solution: If either x ≤ 0 or y ≤ 0, the joint cdf F ≡ 0 as the joint pdf is zero in this region.
30-4 Lecture 30: Marginal Distributions

Let (x, y) be an interior point of the first quadrant. Then


Z x Z y Z xZ y Z x  Z y 
−(s+t) −s
F (x, y) = f (s, t)dsdt = e dsdt = e ds e dt = (1 − e−x )(1 − e−y )
−t
−∞ −∞ 0 0 0 0

Therefore the joint CDF is

(1 − e−x )(1 − e−y ), 0 < x < ∞, 0 < y < ∞



F (x, y) =
0, otherwise.
Lecture 31: Independent Random Variables
19 March, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

We now discuss concepts of independence related to random variables. These are analogous
to the concepts of independence between events. They are developed by simply introducing
suitable events involving the possible values of various random variables, and by considering
the independence of these events.

Definition 31.1 Let (Ω, F, P ) be a probability space and (X, Y ) be a random vector defined
on it. We say that he random variables X and Y are independent if events {X ∈ A} and
{Y ∈ B} are independent for every Borel subsets A and B of R.

Intuitively, independence means that the value of Y provides no information on the value of
X.

Example 31.2 Consider the experiment of tossing a fair coin and rolling a fair die simul-
taneously. Intuitively, we feel that whatever the outcome of the coin toss is, it should have no
influence on the outcome of the die roll, and vice-versa. Let X be a random variable that is
1 or 0 according as the coin lands heads or tails, i.e., such that the event {X = 1} represents
the outcome that the coin lands heads and the event {X = 0} represents the outcome that
the coin lands tails. In a similar way we represent the outcome of the die roll by a random
variable Y that takes the value 1, 2, · · · , or 6 according as the die roll results in the face
number 1, 2, · · · , or 6. Our intuitive notion that the outcome of the coin toss and die roll have
no influence on each other can be stated precisely by saying that if x is one of the numbers 1
or 0 and y is one of the numbers 1, 2, · · · , 6, then the events {X = x} and {Y = y} should
be independent.
How put all this in the frame work of the Definition 31.1. This how is it is done.
1
Ω = {(H, i), (T, i)|i = 1, 2, · · · , 6}, F = P(Ω), P (ω) = , ∀ω ∈ Ω.
12
Define random variable X : Ω → R as

X(H, i) = 1, X(T, i) = 0, ∀i = 1, 2, · · · , 6.

Define random variable Y : Ω → R as

Y (H, i) = i = Y (T, i), ∀i = 1, 2, · · · , 6.

31-1
31-2 Lecture 31: Independent Random Variables

Then
P (X = 1, Y ∈ {3, 4}) = P ({(H, i) : i = 1, · · · , 6} ∩ {(H, 3), (H, 4), (T, 3), (T, 4)})
2
= P ({(H, 3), (H, 4)}) = .
12
1
P (X = 1) = P ({(H, i) : i = 1, · · · , 6}) =
2
1
P (Y ∈ {3, 4}) = P ({(H, 3), (H, 4), (T, 3), (T, 4}) =
3
We have
P (X = 1, Y ∈ {3, 4}) = P (X = 1)P (Y ∈ {3, 4}).
Similarly we may verify for other events.

One can characterize the independence of X and Y in terms of its joint and marginal distri-
bution functions as in the following theorem.

Theorem 31.3 Let (X, Y ) be a random vector with joint distribution function F , and let
FX and FY be the distribution functions of X and X respectively. Then X and Y are
independent iff
F (x, y) = FX (x)FY (y), ∀ (x, y) ∈ R2

Remark 31.4 The Definition 31.1 and the Theorem 31.3 does not assume any special struc-
ture on the random variables X or Y . In particular, we may take X as discrete and Y be
absolutely continuous or vice-versa or one of them be a general (neither discrete nor abso-
lutely continuous) random variable.

Theorem 31.3 tells us that if X and Y are independent random variables then marginal
distributions of X and Y uniquely determine the joint distribution of X and Y . This suggest
us one way to construct the joint CDF.

Example 31.5 Suppose X ∼ Bernoulli(1/2) and Y ∼ Discrete Uniform over set {1, 2, · · · , 6}.
So recall

 0 if y<1

 1
if 1≤y<2


6



 2 if
 
 0, x < 0

2≤y<3

1
FX (x) = , 0≤x<1 FY (y) = 6
 2 .. .. ..
1, x ≥ 1. . . .





 5

 if 5≤y<6
 6



1 if y≥6
Lecture 31: Independent Random Variables 31-3

If we assume independent of X and Y then the joint CDF of X and Y is



 0, x < 0 or y < 1

 1
, 0 ≤ x < 1, 1 ≤ y < 2


12



2


0 ≤ x < 1, 2 ≤ y < 3


 ,


 12

 3
, 0 ≤ x < 1, 3 ≤ y < 4


12



4


0 ≤ x < 1, 4 ≤ y < 5


 ,


 12

 5
, 0 ≤ x < 1, 5 ≤ y < 6


12



1


F (x, y) = , 0 ≤ x < 1, y ≥ 6
 2

 1
, x ≥ 1, 1 ≤ y < 2


6



2


x ≥ 1, 2 ≤ y < 3


 ,


 6

 3
, x ≥ 1, 3 ≤ y < 4


6



4


x ≥ 1, 4 ≤ y < 5


 ,


 6

 5
, x ≥ 1, 5 ≤ y < 6


6




 1, x ≥ 1, y ≥ 6.

Theorem 31.6 Let X and Y be two discrete random variables. Then X and Y are inde-
pendent iff joint pmf can be written as product of marginals pmfs, i.e.,
P {X = x, Y = y} = P {X = x}P (Y = y), for all x ∈ R(X), y ∈ R(Y ).

Remark 31.7 Let us recall that if we are given only marginal distributions of random vari-
ables X and Y , in general it is impossible to define the joint distribution of X and Y . But
in a very special situation, knowledge about marginal distributions is enough to construct the
joint distribution, namely when random variable X and Y are independent. Thanks to the
Theorem 31.3.

Example 31.8 Let the random vector (X, Y ) has joint probabilities given by the following
table.
H Y -1 0 1
HH
X HHH
-1 0 14 0
1
0 4
0 14
1 0 14 0
31-4 Lecture 31: Independent Random Variables

Determine if X and Y are independent?

Solution: Note that X and Y are identically distributed and


1 1
P (X = −1) = P (X = 1) = and P (X = 0) =
4 2
However, X and Y are not independent since
1
P (X = −1, Y = −1) = 0 6= = P (X = −1)P (Y = −1)
16

Theorem 31.9 Let random vector (X, Y ) has joint pdf f (x, y), and fX (x) and fY (y) are pdf
corresponding to random variables X and Y , respectively. Then X and Y are independent
iff joint pdf can be written as product of marginals pdfs, i.e.,

f (x, y) = fX (x)fY (y), (31.1)

for each (x, y) ∈ R2 where both f (x, y) and g(x, y) := fX (x)fY (y) are continuous.

Example 31.10 Let X and Y be jointly distributed with the joint density

 1 + xy , |x| < 1, |y| < 1


f (x, y) = 4
0, otherwise

Determine wether X and Y are independent?

Solution:
 Z 1

1 + xy
dy = 1/2, |x| < 1
Z 

fX (x) = f (x, y)dy = −1 4
−∞ 
0, otherwise

Since joint pdf is symmetric in x and y, hence



1/2, |y| < 1
fY (x) =
0, otherwise

But f (x, y) 6= fX (x)fY (y) at many points in the square (−1, 1) × (−1, 1), where both, the
joint density f and the product fX fY are continuous. Hence random variables X and Y are
not independent.
Lecture 32: Independence of Several Random Variables &
Real-Valued Functions of Random Vectors
25 March, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Theorem 32.1 Let X and Y be independent random variables and f and g be Borel-
measurable functions. Then f (X) and g(Y ) are also independent.

Example 32.2 If X and Y are independent then using previous theorem

1. X 2 and Y 2 are independent.

2. X and sin Y are independent

3. |X| and eY are independent.

32.1 Independence of Several Random Variables


Definition 32.3 We say X1 , X2 , · · · , Xn are independent if events {X1 ∈ A1 }, {X2 ∈
A2 }, · · · , {Xn ∈ An } are independent for all A1 , A2 , · · · , An Borel subsets of R.

Theorem 32.4 A collection of jointly distributed RVs X1 , X2 , · · · , Xn is said to be mutually


or completely independent iff

F (x1 , x2 , · · · , xn ) = FX1 (x1 )FX2 (x2 ) · · · FXn (xn ) ∀ (x1 , · · · , xn ) ∈ Rn ,

where F is the joint CDF of X1 , X2 , · · · , Xn .

Theorem 32.5 1. Suppose X, Y, Z are discrete random variables. Then they are inde-
pendent iff

P (X = x, Y = y, Z = z) = P (X = x)P (Y = y)P (Z = z)

for all x ∈ R(X), y ∈ R(Y ), z ∈ R(Z).

32-1
32-2Lecture 32: Independence of Several Random Variables & Real-Valued Functions of Random Vectors

2. Suppose X, Y, Z are random variables with joint pdf f (x, y, z). Then they are indepen-
dent iff
f (x, y, z) = fX (x)fY (y)fZ (z),

for all (x, y, z) ∈ R3 where both f and g(x, y, z) := fX (x)fY (y)fZ (z) are continuous.

When we study law of large numbers and central limit theorem we encounter with the se-
quence of independent random variables X1 , X2 , · · · . So we need to understand the meaning
of independence for countably infinite collection of random variables. In view of Theorem
32.4, we have

Definition 32.6 We say that a sequence of random variables (Xn )n∈N is independent if for
every n = 2, 3, · · · the random variables X1 , X2 , · · · , Xn are independent.

32.2 Real-Valued Functions of Random Vectors

When there are multiple random variables of interest, it is possible to generate new random
variables by considering functions involving several of these random variables . In particular,
suppose we have two random variables X : Ω → R and Y : Ω → R, and g : R2 → R be a
function. Then Z := g(X, Y ) : Ω → R defines another random variable. If both X, Y are
discrete (A random variable that can take on at most a countable number of possible values
is said to be discrete) then Z is also discrete. Now next natural question is what is the PMF
of random variable Z?

PMF of function of two random variables

Now we can employ the same idea to derive PMF of Z = g(X, Y ) from the joint PMF f (x, y)
according to

fZ (z) = P {Z = z} = P {g(X, Y ) = z}
 
 [ 
=P {X = x, Y = y}
 
(x,y):g(x,y)=z
X
= f (x, y).
(x,y):g(x,y)=z
Lecture 32: Independence of Several Random Variables & Real-Valued Functions of Random Vectors32-3

Example 32.7 Let X and Y be random variables with the joint pmf given by the following
table.
H Y -1 0
HH
2 6
X HHH
1 1 1 1
-2 9 27 27 9
2 1 1
1 9
0 9 9
1 4
3 0 0 9 27

Find the PMF of |Y − X|.

Solution: So here g(x, y) = |y − x|. First we compute the range of the random variable
Z = g(X, Y ) = |Y − X|.

Fix x = −2, then z = |y − (−2)| = |y + 2|. Now run through the all y values, we get z = 1, 2, 4, 8
Fix x = 1, then z = |y − 1|. Now run through the all y values, we get z = 2, 1, 1, 5
Fix x = 3, then z = |y − 3|. Now run through the all y values, we get z = 4, 3, 1, 3

So range of random variable Z is {1, 2, 3, 4, 5, 8}. Now


X 1 1 1 1
P (Z = 1) = f (x, y) = f (−2, −1) + f (1, 0) + f (1, 2) + f (3, 2) = +0+ + = .
9 9 9 3
(x,y):|y−x|=1
X 1 2 7
P (Z = 2) = f (x, y) = f (−2, 0) + f (1, −1) = + = .
27 9 27
(x,y):|y−x|=2
X 4 4
P (Z = 3) = f (x, y) = f (3, 0) + f (3, 6) = 0 + = .
27 27
(x,y):|y−x|=3
X 1 1
P (Z = 4) = f (x, y) = f (−2, 2) + f (3, −1) = +0= .
27 27
(x,y):|y−x|=4
X 1
P (Z = 5) = f (x, y) = f (1, 6) = .
9
(x,y):|y−x|=5
X 1
P (Z = 8) = f (x, y) = f (−2, 6) = .
9
(x,y):|y−x|=8

6
XTips For Exam One check regarding the calculations of pmf in Example 32.7 is,
fZ (z) = 1, where RZ denotes the range of the random variable Z.
z∈RZ

Example 32.8 Let X and Y be iid (independent & identically distributed) discrete uniform
random variables with parameter N . Find the pmf of the random variable min{X, Y }.
32-4Lecture 32: Independence of Several Random Variables & Real-Valued Functions of Random Vectors

Solution: Set Z := min{X, Y }. Also it is clear that range of random variable Z would be
{1, 2, · · · , N }. Now we find its pmf,
X
P {Z = i} = f (x, y)
(x,y):min{x,y}=i

Now both x and y ranges over the set {1, 2, · · · , N }. Since X and Y are independent hence
the joint pmf is given as follows:

H Y
HH
1 2 ··· i ··· N
X HHH
1 1 1 1
1 N2 N2
··· N2
··· N2
1 1 1 1
2 N2 N2
··· N2
··· N2
.. 1 1 1 1
. N2 N2
··· N2
··· N2
1 1 1 1
i N2 N2
··· N2
··· N2
.. 1 1 1 1
. N2 N2
··· N2
··· N2
1 1 1 1
N N2 N2
··· N2
··· N2

Hence for given i ∈ {1, 2, · · · , N },

X N
X N
X
f (x, y) = f (i, y) + f (x, i)
(x,y):min{x,y}=i y=i x=i+1

(N − (i − 1)) N − i
= +
N2 N2
2N − 2i + 1
=
N2
Lecture 33: Density of Function of two Random Variables
26 March, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

If X and Y have joint pdf and g is a function such that Z = g(X, Y ) is absolutely continuous
random variable. Then, how to compute the PDF of Z?

Example 33.1 Let X and Y be independent and exponential random variables with param-
eters λ and µ, respectively. Find the density of max{X, Y } (if it exists).

Solution: Set Z := max{X, Y }. Then for each z ∈ R, we have

{Z ≤ z} = {X ≤ z} ∩ {Y ≤ z}.

Therefore CDF of Z is

FZ (z) = P (Z ≤ z) = P (X ≤ z, Y ≤ z) = P (X ≤ z)P (Y ≤ z) = FX (z)FY (z),

where
1 − e−λz if z ≥ 0 1 − e−µz if z ≥ 0
 
FX (z) = , FY (z) = .
0 if z < 0 0 if z < 0

Therefore
[1 − e−λz ][1 − e−µz ] = 1 − e−µz − e−λz + e−(λ+µ)z if z ≥ 0

FZ (z) = .
0 if z < 0

At z = 0, FZ (0) = 0 hence CDF is continuous everywhere. We may differentiate it to get


the density
 −µz
0 µe + λe−λz − (λ + µ)e−(λ+µ)z if z > 0
FZ (z) = .
0 if z < 0

In fact FZ0 (0) = 0. So the random variable Z has the density


 −µz
0 µe + λe−λz − (λ + µ)e−(λ+µ)z = if z ≥ 0
fZ (z) = .
0 if z < 0

33-1
Lecture 34: Expectation of function of random vector
27 March, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Example 34.1 The joint density function of X and Y is given by


 −(x+y)
e if 0 < x, y < ∞
f (x, y) = .
0 otherwise
X
Find the density function of the random variable .
Y
X
Solution: Let Z := . Let z ∈ R be given. Then
Y
 
X
{Z ≤ z} = ≤ z = {(X, Y ) ∈ Az } ,
Y
where Az = {(x, y) ∈ R2 : xy ≤ z}. If y > 0 then Az = {(x, y) ∈ R2 : x ≤ yz} and if y < 0
then Az = {(x, y) ∈ R2 : x ≥ yz}. Now we plot the straight line x = yz, which we further
divide int two cases:

1. When y > 0:
(a) When z ≥ 0: Then the straight line x = yz can be rewritten as y = xz , this the
line with positive slope. Hence we are interested in the region y ≥ xz . Then the
desired region Az is

Az

x
y= (z > 0)
z

34-1
34-2 Lecture 34: Expectation of function of random vector

(b) When z < 0: Then the straight line x = yz can be rewritten as y = xz , this the
line with negative slope. Hence we are interested in the region y ≤ xz . Then the
desired region Az is Then the desired region Az is

Az

2. When y < 0: These case is futile to analyse as the joint pdf is non-zero only in first
quadrant hence we get no contribution from here. In this case Az = {(x, y) ∈ R2 : x ≥
yz}.

(a) When z ≥ 0: Then the straight line x = yz can be rewritten as y = xz , this the
line with positive slope. Hence we are interested in the region y ≤ xz . Then the
desired region Az is

x
y= (z > 0)
z

Az
Lecture 34: Expectation of function of random vector 34-3

(b) When z < 0: Then the straight line x = yz can be rewritten as y = xz , this the
line with negative slope. Hence we are interested in the region y ≥ xz . Then the
desired region Az is Then the desired region Az is

Az

Therefore if z ≥ 0,
Z ∞ Z zy 
−y −x
FZ (z) := P {(X, Y ) ∈ Az } = e e dx dy
0 0
Z ∞ Z zy  Z ∞ Z ∞
−y −x −y
 −x zy
e−y 1 − e−zy dy
 
= e e dx dy = e −e 0 dy =
0 0
∞ 0 0
1 1 1
= −e−y + e−(z+1)y = −0 + 0 + 1 − =1−
z+1 0 z+1 z+1
Therefore if z < 0,
FZ (z) := P {(X, Y ) ∈ Az } = 0
Hence ( 1
1− if z ≥ 0
FZ (z) = z+1 .
0 if z < 0
Since FZ is a continuous function, we may differentiate it to get the density

1
0
 if z > 0
FZ (z) = (z + 1)2 .
 0 if z < 0
34-4 Lecture 34: Expectation of function of random vector

FZ is not differentiable at z = 0, so we set density to be equal to zero at this point. Hence


pdf of Z is 
1
 if z > 0
fZ (z) = (z + 1)2 .
 0 if z ≤ 0

Example 34.2 Let X and Y be iid uniform random variables distributed over interval (0, 1).
Find the density of X + Y (if it exists).

Solution: Define Z := X + Y . For fixed z ∈ R the event {Z ≤ z} is equivalent to the event


{(X, Y ) ∈ Az }, where Az is the subset of R2 defined by Az = {(x, y) ∈ R2 |x + y ≤ z}. Thus

FZ (z) = P (Z ≤ z)
= P ((X, Y ) ∈ Az )
x
= f (x, y)dxdy
Az

Since our joint density is non-zero only on unit square therefore we analyse the set Az for
various values of z.

1. If −∞ < z ≤ 0:

Az

y = z − x, (z = 0)
y = z − x, (z = −1)
Lecture 34: Expectation of function of random vector 34-5

x
f (x, y)dxdy = 0
Az

2. If 0 < z ≤ 1:

y =z−x

x z2
f (x, y)dxdy = Area of the triangle with vertices(0, 0), (z, 0), (0, z) =
2
Az

3. If 1 < z ≤ 2:
34-6 Lecture 34: Expectation of function of random vector

(z − 1, 1)

y =z−x
(1, z − 1)
(z − 1, z − 1)

(1, 0)

Thanks to Anshu Musaddi for the following short cut:

x
f (x, y)dxdy = Area of the unit square
Az

− Area of the triangle with vertices (1, z − 1), (1, 1), (z − 1, 1)


1 (2 − z)2
= 1 − × (2 − z) × (2 − z) = 1 −
2 2

4. If z ≥ 2:
Lecture 34: Expectation of function of random vector 34-7

y =z−x

x
f (x, y)dxdy = 1
Az

Hence

 0 if z ≤ 0
2

 z


if 0 < z ≤ 1

FZ (z) = 2 .
(2 − z)2
1 − if 1 < z ≤ 2




 2
1 if z > 2

It is continuous everywhere. Hence we may differentiate it to get the density


34-8 Lecture 34: Expectation of function of random vector



 0 if z<0
z if 0<z<1

FZ0 (z) = .

 2−z if 1<z<2
0 if z<2

This tells is that FZ is actually differentiable at z = 0, 1, 2 also. Hence we get the following
pdf of the random variable Z.


 0 if z≤0
z if 0≤z<1

fZ (z) = .

 2 − z if 1≤z<2
0 if z≥2

34.1 Expectation of function of two random variables


Theorem 34.3 1. Let X, Y be two discrete random variable with joint pmf f (x, y). Let
2
g : R → R be a function, then
X X
E[g(X, Y )] = g(x, y)f (x, y), (34.1)
x∈RX y∈RY
X X
provided |g(x, y)|f (x, y) < ∞.
x∈RX y∈RY

2. Let X, Y be two absolutely continuous random variable with joint pdf f (x, y). Let
g : R2 → R be a Borel measurable function (e.g., continuous, indicators of reasonable
sets (Borel subsets of R2 ), and functions that are continuous except across some smooth
boundaries) , then
Z ∞Z ∞
E[g(X, Y )] = g(x, y)f (x, y)dxdy (34.2)
−∞ −∞
Z ∞ Z ∞
provided |g(x, y)|f (x, y)dxdy < ∞.
−∞ −∞

Example 34.4 Recall Example 32.7, X and Y were random variables with the joint pmf
given by the following table.
HH
Y
H -1 0 2 6
X HH
H
1 1 1 1
-2 9 27 27 9
2 1 1
1 9
0 9 9
1 4
3 0 0 9 27
Lecture 34: Expectation of function of random vector 34-9

Then PMF of Z := |Y − X|, is


1 7 4 1 1 1
P (Z = 1) = , P (Z = 2) = , P (Z = 3) = , P (Z = 4) = , P (Z = 5) = , P (Z = 8) =
3 27 27 27 9 9
So
X 1 7 4 1 1 1 26
E[Z] = zP (Z = z) = 1 × +2× +3× +4× +5× +8× =
z∈RZ
3 27 27 27 9 9 9

Suppose we are interested only in E|Y − X| then by formula (34.1), we have


XX
E|Y − X| = |y − x|f (x, y)
y x
X X X
= |y − (−2)|f (−2, y) + |y − 1|f (1, y) + |y − 3|f (3, y)
y y y
X 1 1 1 1 11
|y − (−2)|f (−2, y) = 1 × +2× +4× +8× =
y
9 27 27 9 9
X 2 1 1 10
|y − 1|f (1, y) = 2 × +1×0+1× +5× =
y
9 9 9 9
X 1 4 5
|y − 3|f (3, y) = 4 × 0 + 3 × 0 + 1 × +3× =
y
9 27 9
Lecture 35: Expectation of Function of Random Vector
29 March, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Example 35.1 Let X and Y be independent and exponential random variables with param-
eters λ and µ, respectively. Find the mean of max{X, Y } (if it exists).

Solution: Since joint pdf is non-zero in first quadrant only and function max{x, y} is also
non-negative in first quadrant hence absolute convergence of improper integral is equivalent
to the conditional convergence.

Z ∞ Z ∞
E[max{X, Y }] = max{x, y}f (x, y)dxdy
Z−∞
∞ Z−∞

= max{x, y}fX (x)fY (y)dxdy (∵ X, Y are independent)
−∞ −∞
x
= max{x, y}λ µ e−λx e−λy dxdy
{x≥0,y≥0}
x
= max{x, y}λ µ e−λx e−λy dxdy
{x≥0,y≥0}∩{y≥x}
x
+ max{x, y}λ µ e−λx e−µy dxdy
{x≥0,y≥0}∩{y<x}
Z ∞ Z ∞  Z ∞ Z x  
−λx −µy −λx −µy
=λ µ e ye dy dx + xe e dy dx
0 x 0 0
Z ∞  −µx Z ∞
e−µx 1 − e−µx
   
−λx xe −λx
=λ µ e + 2 dx + xe dx
0 µ µ 0 µ
Z ∞ −µx Z ∞ 
−λx e −λx 1
=λ µ e dx + xe dx
0 µ2 0 µ
  −(λ+µ)x ∞ 
1 −e 1 1 λ 1 1 1 1
=λ µ 2 + 2
= + = + −
µ λ+µ 0 µλ µ(λ + µ) λ λ µ λ+µ

35-1
35-2 Lecture 35: Expectation of Function of Random Vector

Now
∞ y=∞ Z ∞ −µy
ye−µy xe−µx ∞ xe−µx e−µx
Z 
−µy e 1 
ye dy = + dy = − 2 e−µy x = + 2
x −µ y=x x µ µ µ µ µ
Z x  −µy x
e 1 − e−µx
e−µy dy = =
−µ 0 µ
Z0 ∞
1 −λx ∞ 1 ∞ −λx
Z
−λx 1 1
xe = − [xe ]0 + e dx = − 2 [e−λx ]∞
0 = 2
0 λ λ 0 λ λ

Theorem 35.2 Let X and Y be two random variable on a probability space (Ω, F, P ) such
that both have finite mean. Then

(a) E[X + Y ] = EX + EY .
(b) If X and Y are independent, then
E[XY ] = EX EY.

Proof: We shall prove it only in the case when (X, Y ) have joint density, though both the
results are true even if X is discrete and Y has pdf.

(a)
Z ∞ Z ∞
E[X + Y ] = (x + y)f (x, y)dxdy
Z−∞
∞ Z−∞
∞ Z ∞Z ∞
= xf (x, y)dxdy + yf (x, y)dxdy
−∞ −∞ −∞ −∞
Z ∞ Z ∞  Z ∞ Z ∞ 
= x f (x, y)dy dx + y f (x, y)dx dy
−∞ −∞ −∞ −∞
Z ∞ Z ∞
= xfX (x)dx + yfY (y)dy
−∞ −∞
= EX + EY

(b)
Z ∞ Z ∞
E[XY ] = x yf (x, y)dxdy
Z−∞
∞ Z−∞

= x yfX (x)fY (y)dxdy
−∞
Z ∞ −∞  Z ∞ 
= xfX (x)dx yfY (y)dy
−∞ −∞
= EXEY
Lecture 35: Expectation of Function of Random Vector 35-3

35.1 Conditional Distributions


Conditional probability P (A|B), is probability of event A in the new universe (or sample
space) B. Now we extend this idea to conditioning one random variable on another in order
to give a quantification of dependence of one random variable over the other if the random
variables are not independent. We first look at discrete random variables.

Conditional PMF

Definition 35.3 Let X and Y be two discrete random variable associated with the same
random experiment. Then the conditional pmf fX|Y of X given Y = y, is defined as

P {X = x|Y = y} if P {Y = y} > 0
fX|Y (x|y) =
0 if P {Y = y} = 0

In the original sample space Ω, random variable X has some probability distribution. Now
we are told that event {Y = y} has occurred. Since X depend on Y , this new information
provides partial knowledge about value of X. Hence the probability distribution of X in the
new universe determined by the event {Y = y} should change. This change is captured by
conditional pmf.
A conditional pmf can be thought of as an ordinary pmf over a new universe determined by
the conditioning event. For this, note that for fixed y, fX|Y (x|y) ≥ 0 for all x ∈ RX . Also if
P (Y = y) > 0 then
!
X X [
fX|Y (x|y) = P (X = x|Y = y) = P {X = x} Y = y = P (Ω|Y = y) = 1

x∈RX x∈RX x∈RX

If X, Y have joint pmf f , then using the definition of conditional probability we obtain
(
f (x,y)
fY (y)
if fY (y) > 0
fX|Y (x|y) =
0 if fY (y) = 0

Conditional Distribution Function

Recall that the distribution function FX of any random variable X (discrete, continuous or
mixed) is defined as
FX (x) = P {X ≤ x}, ∀x ∈ R.
35-4 Lecture 35: Expectation of Function of Random Vector

We define conditional distribution function of X given Y = y as

FX|Y (x|y) := P (X ≤ x|Y = y).

So conditional distribution function is an ordinary (or unconditional) distribution function


in new universe determined by the conditioning event.
Recall that if X is a discrete random variable with pmf fX then
X
FX (x) = fX (t).
t∈RX :t≤x

Similarly, if X is a discrete random variable with conditional pmf fX|Y then


X
FX|Y (x|y) = fX|Y (t|y).
t∈RX :t≤x

Recall that if X is a discrete random variable with pmf fX then and A ⊂ R then
X
P (X ∈ A) = fX (x).
x∈A∩RX

Similarly, if X is a discrete random variable with conditional pmf fX|Y and A ⊂ R, then we
have X
P (X ∈ A|Y = y) = fX|Y (x|y)
x∈RX ∩A

Example 35.4 Let the joint pmf of X and Y is given as follows:

HH Y
HH
-1 0 1
X HH
1
-1 0 4
0
1 1
0 4
0 4
1
1 0 4
0

Then compute the conditional pmf of X given Y = 0. Also compute the conditional distri-
bution function of the same.

1
Solution: Note that P (Y = 0) = . Hence
2
( 1
if x = −1, 1
fX|Y (x|0) = 2
0 if x = 0
Lecture 35: Expectation of Function of Random Vector 35-5

Now the conditional distribution function



 0 if x < −1
1
FX|Y (x|0) = if −1 ≤ x < 1
 2
1 if x ≥ 1

Remark 35.5 We have said that conditional pmf is a pmf in the new universe determined
by the conditioning event. In Example 35.4, the probability distribution of X is
1 1
P (X = −1) = P (X = 1) = , P (X = 0) = .
4 2
Where as in new universe determined by the event {Y = 0}, the probability distribution of
X is revised as
1
P (X = −1|Y = 0) = P (X = 1|Y = 0) = , P (X = 0|Y = 0) = 0.
2

Similarly, the distribution function of X is




 0 if x < −1

 1
if −1 ≤ x < 0


 4



FX (x) = 3

 if 0 ≤ x < 1
4







 1 if x ≥ 1

FX got revised as FX|Y (x|0) in the new universe determined by the event {Y = 0}. Also note
that FX|Y (x|0) satisfies all the properties of a distribution function:

1. lim FX|Y (x|0) = 0, lim FX|Y (x|0) = 1.


x→−∞ x→+∞

2. FX|Y (·|0) is non-decreasing on R.

3. FX|Y (·|0) is right-continuous on R.

The conditional PMF can also be used to calculate the marginal PMFs. In particular, we
have by using the definitions,
X X
fX (x) = f (x, y) = fX|Y (x|y)fY (y)
y y
35-6 Lecture 35: Expectation of Function of Random Vector

Example 35.6 Suppose


 
 5 2
1
2
if x = 10−2 1
2
if x = 1
if y = 10
 
fY (y) = 6
1 , fX|Y (x|102 ) = 1
if x = 10−1 , fX|Y (x|104 ) = 1
if x = 10 ,
6
if y = 104 3
1
3
1
if x = 1 if x = 100
 
6 6

Then find the pmf of X.

Solution: First of all by looking at conditional pmf fX|Y we see that X takes 5 values
10−2 , 10−1 , 1, 10, 100. Now
1 5 5
fX (10−2 ) = × =
2 6 12
1 5 5
fX (10−1 ) = × =
3 6 18
1 5 1 1 8
fX (1) = × + × =
6 6 2 6 36
1 1 1
fX (10) = × =
3 6 18
1 1 1
fX (100) = × =
6 6 36
Lecture 36: Conditional PDF & Law of Total Probability
1 April, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Definition 36.1 (Conditional Densities) Let X and Y be two random variables with
joint pdf f . The conditional density of X given Y = y is defined as
(
f (x,y)
fY (y)
if fY (y) > 0
fX|Y (x|y) =
0 if fY (y) = 0

As the case of conditional pmf, conditional pdf can be thought of as an ordinary pdf
over a new universe determined by the conditioning event. For this, note that for fixed
y, fX|Y (x|y) ≥ 0 for all x ∈ R. Also if fY (y) > 0 then
Z ∞ Z ∞
1 fY (y)
fX|Y (x|y)dx = f (x, y)dx = =1
−∞ fY (y) −∞ fY (y)

Suppose that in sampling from a human population X denotes person’s weight and Y denotes
persons hight. Surely we would think it more likely that X > 200 pounds if we were told
that Y = 6 feet than if we were told that Y = 4 feet. Knowledge about the value of Y gives
us some information about the value of X even if it does not tell us the value of X exactly.
If we model (X, Y ) as continuous random variables then P (Y = 6) = 0, yet in actuality a
value of Y is observed. If, to the limit of our measurement we see Y = 6, this knowledge
might give us information about X. The conditional pdf allows to define appropriately the
conditional distribution of X given Y = y.
Recall that if X is a continuous random variable with pdf fX and B is any Borel subset of
R, then Z
P (X ∈ B) = fX (x)dx.
B

The above motivated the following definition.

Definition 36.2 Let X, Y be jointly continuous random variables and fX|Y (·) denotes the
conditional density of X given Y . Then for any Borel subset B of R, we have
Z
P (X ∈ B|Y = y) = fX|Y (x|y)dx (36.1)
B

36-1
36-2 Lecture 36: Conditional PDF & Law of Total Probability

Remark 36.3 Conditional probability P (X ∈ B|Y = y) were left undefined if the P {Y =


y} = 0. But the above formula provides a natural way of defining such conditional probabil-
ities in the present context . In addition, it allows us to view the conditional PDF fX|Y (as
a function of x ) as a description of the probability law of X, given that the event {Y = y}
has occurred.

In view of equation (36.1) the conditional CDF of X given Y = y is


Z x
FX|Y (x|y) = fX|Y (t|y)dt.
−∞

We have
d
FX|Y (x|y) = fX|Y (x|y),
dx
where equality holds at points (x, y) at which joint pdf f is continuous and fY (y) > 0 and
fY (·) is continuous at y.

Example 36.4 Let X and Y be two random variables having the joint probability density
function 
2 if 0 < x < y < 1
f (x, y) =
0 elsewhere
 
2 3
Then find the conditional probability P X ≤ Y = .
3 4

Solution: We are suppose to use the following definition


Z
P (X ∈ B|Y = y) = fX|Y (x|y)dx,
B
 
3
i.e., we need to compute the conditional density fX|Y x and for this we need to compute
  4
3
fY .
4

y=x

f (x, y) = 2
3
y=
4
Lecture 36: Conditional PDF & Law of Total Probability 36-3

  Z ∞   Z 3   Z 3
3 3 4 3 4 3 3
fY = f x, dx = f x, dx = 2dx = 2 × = .
4 −∞ 4 0 4 0 4 2
3

Since fY 4
> 0, therefore

   f x, 34
 
2 4 3
3 3
 = 3 = if 0 < x <
fX|Y x = f 3 4
4  Y 4 2
0 elsewhere

Hence
  Z 2  
2 3 3
3
P X ≤ Y = = fX|Y x dx
3 4 −∞ 4
Z 2
3 4
= dx
0 3
8
=
9

Example 36.5 Let X and Y be independent continuous random variables with pdf fX and
fY respectively. Let Z = X + Y . Determine conditional density of Z given X.

Solution: Basically we first determine the conditional distribution function of Z given X,


i.e., P (Z ≤ z|X = x). Then we have the relation
Z z
P (Z ≤ z|X = x) = fZ|X (t|x)dt
−∞

Now

P (Z ≤ z|X = x) = P (X + Y ≤ z|X = x)
= P (x + Y ≤ z|X = x)
= P (x + Y ≤ z) ( ∵ X, Y are indepedent)
= P (Y ≤ z − x)
Z z−x
= fY (y)dy
−∞
Z z
= fY (t − x)dt (put y = t − x)
−∞

Hence fZ|X (z|x) = fY (z − x).


36-4 Lecture 36: Conditional PDF & Law of Total Probability

Remark 36.6 In Example 36.5, if we try to compute conditional density of X + Y given


X by definition then we require to compute the joint density of X + Y and X. This type of
problem we have not studied.
Rather than going by definition, we adopt the technique of finding pdf of a real-valued func-
tion of two random variables. We first compute the conditional distribution function and
differentiate it to obtain the conditional density.

36.1 Law of Total Probability


Proposition 36.7 (Law of total probability) Let Y be a discrete random variable on
the probability space (Ω, F, P ). Then for any event B ∈ F,
X
P (B) = P (B|Y = y)fY (y), (36.2)
y∈RY

where fY is the pmf of Y .

Proof: If Y is a discrete random variable with range RY ⊂ R, then the collection of events
{{Y = y}}y∈RY form a partition of the sample space Ω. Thus, we can use the total probability
theorem.
X X
P (B) = P (B|Y = y)P (Y = y) = P (B|Y = y)fY (y).
y∈RY y∈RY

We state the law of total probability for continuous random variable, which is completely
analogous to the discrete case.

Theorem 36.8 Let X be a random variable (on the probability space (Ω, F, P )) with the
pdf fX . Then for any event B ∈ F,
Z ∞
P (B) = P (B|X = x)fX (x)dx.
−∞

Example 36.9 Let X and Y be two independent uniform(0,1) random variables. Find
P (X 3 + Y > 1).

Solution: We discuss two solution methods, which one makes our life simple, we shall see
it.
Lecture 36: Conditional PDF & Law of Total Probability 36-5

1. One approach would be form the joint pdf f (x, y) = fX (x)fY (y) and then compute the
following integral:
x
3
P (X + Y > 1) = f (x, y)dxdy,
A

where A := {(x, y) ∈ R2 |x3 + y > 1}. Since X and Y are independent, the joint pdf of
X and Y is f (x, y) = fX (x)fY (y), where

 
1 if 0 < x < 1 1 if 0 < y < 1
fX (x) = , fY (y) = ,
0 otherwsie 0 otherwsie

Hence

1 if 0 < x < 1, 0 < y < 1
f (x, y) = .
0 otherwsie

We need to plot the graph of the function f (x) = 1 − x3 .

We know the graph of y = x3 .

y = x3

Now y = −x3 is reflecting it w.r.t. y axis.


36-6 Lecture 36: Conditional PDF & Law of Total Probability

y = −x3

Now y = 1 − x3 is lifting the above graph by unit 1.

Gathering all the above information we plot the graph of f (x) = 1 − x3 .

y = 1 − x3 y=1
Lecture 36: Conditional PDF & Law of Total Probability 36-7

Setting up the limits as in figure:


x Z 1 Z 1  Z 1 Z 1
1
3
x3 dx = .

f (x, y)dxdy = dy dx = 1 − (1 − x ) dx =
0 1−x3 0 0 4
A

2. Now we illustrate how the conditioning is useful: One can condition either w.r.t. Y = y
or X = x.

(a) We condition w.r.t. Y . Hence by total probability law:


Z ∞ Z 1
3 3
P (X + Y > 1) = P (X + Y > 1|Y = y)fY (y)dy = P (X 3 + y > 1|Y = y)dy
−∞ 0
Z 1 Z 1
= P (X 3 > 1 − y|Y = y)dy = P (X 3 > 1 − y)dy
0 0
Z 1
1 1
= P (X > (1 − y) 3 )dy (∵ 0 < y < 1, X 3 > 1 − y ⇐⇒ X > (1 − y) 3 )
Z0 1 Z 1 
= √
dx dy
0 3 1−y
1 0 1

Z Z 
h p
3
i 3 4 1
= 1− 1 − y dy = (1 − 3
u)(−du) = u − u 3 =
0 1 4 0 4

(b) We condition w.r.t. X. Hence by total probability law:


Z ∞ Z 1
3 3
P (X + Y > 1) = P (X + Y > 1|X = x)fX (x)dx = P (X 3 + Y > 1|X = x)dx
−∞ 0
Z 1 Z 1
= P (x3 + Y > 1|X = x)dx = P (Y > 1 − x3 |X = x)dx
Z0 1 Z 1 Z 10 
3
= P (Y > 1 − x )dx = dy dx
0 0 1−x3
Z 1
x4 1 1
= x3 dx = =
0 4 0 4
Lecture 37: Conditional Expectation
2 April, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Now we define conditional expectation denoted by E[X|Y = y] of the random variable


X given the information Y = y. A conditional expectation is the same as an ordinary
expectation, except that it refers to the new universe and pmf/pdf is replaced by their
conditional counterparts.

Definition 37.1 Let X and Y be discrete random variables with conditional pmf fX|Y of X
given Y . Then conditional expectation of X given Y = y is defined as

X
E[X|Y = y] = xfX|Y (x|y),
x∈RX

X
provided |x|fX|Y (x|y) < ∞.
x∈RX

Example 37.2 Let X, Y be independent random variables with geometric distribution of


parameter 0 < p < 1. Calculate E[Y |X + Y = n] where n ≥ 2.

Solution: First we find the conditional pmf of Y given X + Y = n where n ≥ 2.


Since range of X and Y is N, hence the range of the random variable Z := X +Y is {2, 3, · · · }.
Let n ≥ 2 be given. So if X + Y = n then Y can only assume values in {1, 2, · · · , n − 1}.
Therefore

P (Y = y|Z = n) = 0, for y = n, n + 1, n + 2, · · ·

37-1
37-2 Lecture 37: Conditional Expectation

For y ∈ {1, 2, · · · , n − 1}
P (Y = y, X + Y = n) P (Y = y, X + y = n)
P (Y = y|Z = n) = = n−1
!
P (X + Y = n) [
P {X = k, Y = n − k}
k=1
P (Y = y, X = n − y)
= n−1 (By finite additivity of Probability measure)
X
P (X = k, Y = n − k)
k=1
P (Y = y)P (X = n − y)
= n−1
(∵ X, Y are independent )
X
P (X = k)P (Y = n − k)
k=1
p(1 − p)y−1 p(1 − p)n−y−1 p2 (1 − p)n−2 1
= = =
n−1
X n−1
X n−1
p(1 − p)k−1 p(1 − p)n−k−1 p2 (1 − p)n−2
k=1 k=1

This shows that ( 1


if y = 1, · · · , n − 1,
fY |X+Y (y|n) = n−1
0 if y ≥ n
Hence Y is geometrically distributed in the original universe, but in the new universe
determined by the event X + Y = n, Y is uniformly (discrete) distributed over the set
{1, 2, · · · , n − 1}.
Hence
X
E[Y |Z = n] = yfY |Z (y|n)
y
n−1
X 1
= y
y=1
n−1
1 (n − 1)(n − 1 + 1)
= ×
n−1 2
n
=
2

Definition 37.3 Let X and Y be random variables with conditional pdf fX|Y of X given Y .
The conditional expectation of X given {Y = y} is defined as
Z ∞
E[X|Y = y] = xfX|Y (x|y)dx,
−∞
Lecture 37: Conditional Expectation 37-3

Z ∞
provided |x|fX|Y (x|y)dx < ∞.
−∞

Theorem 37.4 1. Let X, Y be discrete random variables with joint pmf f . If Y has finite
mean then X
E[Y ] = E[Y |X = x]fX (x)
x

2. Let X, Y be random variables with joint pdf f . If Y has finite mean then
Z ∞
E[Y ] = E[Y |X = x]fX (x)dx
−∞

Proof:
!
X X X
E[Y |X = x]fX (x) = yfY |X (y|x) fX (x)
x x y
XX
= yf (x, y)
x y
X X
= y f (x, y)
y x
X
= yfY (y)
y

= EY

Remark 37.5 The above theorem is called total expectation theorem. It express the fact that
“the unconditional average can be obtained by averaging the conditional averages”. They can
be used to calculate the unconditional expectation E[X] from the conditional expectation.

Example 37.6 Let X, Y be continuous random variables with joint pdf given by

6(y − x) ; 0 ≤ x ≤ y ≤ 1
f (x, y) =
0 ; otherwise

Find E[Y |X = x] and hence calculate EY .

Solution: In order to calculate E[Y |X = x] we need to find fY |X , which is by definition


f (x, y)
equal to .
fX (x)
37-4 Lecture 37: Conditional Expectation

y=x

x = c (c < 0) x = c (0 ≤ c ≤ 1)

x = c (c > 1)

Note that

Z ∞
fX (x) = f (x, y)dy
−∞
1
 Z
f (x, y)dy ; 0 ≤ x ≤ 1

= x
0 ; otherwise

Z 1
= 6(y − x)dy
x
 2 1
y
=6 − xy
2 x
 2 
x 1
=6 −x+
2 2
2

3(x − 1) ; 0 ≤ x ≤ 1
=
0 ; otherwise

This implies that



 2(y − x)
; 0≤x≤y<1
fY |X (y|x) = (x − 1)2
0 ; otherwise

Lecture 37: Conditional Expectation 37-5

Hence E[Y |X = x] would be non-zero only if 0 ≤ x < 1.


Z ∞
E[Y |X = x] = yfY |X (y|x)dy
−∞
Z 1
2(y − x)
= y dy
x (x − 1)2
Z 1
2
= (y 2 − xy)dy
(x − 1)2 x
 3 1
2 y y2
= −x
(x − 1)2 3 2 x
1 x x3
 
2
= − +
(x − 1)2 3 2 6
3
2(x − 3x + 2)
=
6(x − 1)2
2
x +x−2
=
3(x − 1)
Therefore
Z ∞
EY = E[Y |X = x]fX (x)dx
−∞
Z 1 2
x +x−2
= × 3(x − 1)2 dx
0 3(x − 1)
Z 1
= (x2 + x − 2)(x − 1)dx
0
Z 1
= (x3 − 3x + 2)dx
0
3
=
4

Theorem 37.7 1. Let X and Y be discrete random variables with joint pmf f . If g is a
function then X
E[g(X)|Y = y] = g(x)fX|Y (x|y),
x
X
provided |g(x)|fX|Y (x|y) < ∞.
x

2. Let X and Y be random variables with joint pdf f . If g : R → R is a Borel function


then Z ∞
E[g(X)|Y = y] = g(x)fX|Y (x|y)dx,
−∞
37-6 Lecture 37: Conditional Expectation

Z ∞
provided |g(x)|fX|Y (x|y)dx < ∞.
−∞

It is immediate that conditional expectation satisfies the usual properties of an expectation.


The following results are easy to prove. We assume existence of indicated expectations.

Theorem 37.8 (Properties of Conditional Expectation) Let X and Y be two random


variable on a probability space (Ω, F, P ). and let a, b and c be real numbers. Suppose g1 and
g2 are real-valued function of one real variable such that E [g1 (X)] < ∞ and E [g2 (X)] < ∞.
Then

(a) E [ag1 (X) + bg2 (X) + c|Y = y] = aE [g1 (X)|Y = y] + bE [g2 (X)|Y = y] + c.

(b) If g1 (x) ≥ g2 (x) for all x, then E [g1 (X)|Y = y] ≥ E [g2 (X)|Y = y].

(c) Let Z be another random variable then, E[X + Y |Z = z] = E[X|Z = z] + E[Y |Z = z].

(d) If X and Y are independent, then

E[X|Y = y] = EX, E[Y |X = x] = E[Y ].


Lecture 38: Covariance & Correlation
3 April, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

When two random variables are dependent, we say there exist a relationship between them.
But if there is a relationship, the relationship may be strong or weak. Now we discuss
two numerical measure of the strength of a relationship between two random variables, the
covariance and correlation.

38.1 Covariance
Definition 38.1 Let X and Y be jointly distributed on the probability space (Ω, F, P ). The
covariance of two random variables X and Y , denoted by cov(X, Y ) is defined by

cov(X, Y ) = E [(X − EX)(Y − EY )] ,

provided E |(X − EX)(Y − EY )| < ∞. When cov(X, Y ) = 0, we say that X and Y are
uncorrelated.

The covariance gives information about how random variable X and Y are linearly related.
Intuitively, the covariance between X and Y indicates how the values of X and Y move
relative to each other. If large values of X tend to happen with large values of Y , then the
covariance is positive (or small values of X tend to happen with small values of Y ) and we
say X and Y are positively correlated. On the other hand, if X tends to be small when Y
is large (or vice-versa), then the covariance is negative and we say X and Y are negatively
correlated.

Alternate expression for Covariance: By appealing to the linearity of the expectation,

cov(X, Y ) = E [XY − XEY + Y EX − EXEY ]


= E[XY ] − EXEY + EY EX − EXEY
= E[XY ] − E[X]E[Y ].

Independence & Covariance If X and Y are independent then E[XY ] = EXEY ,


therefore cov(X, Y ) = 0. But converse is not true in general.

38-1
38-2 Lecture 38: Covariance & Correlation

Example 38.2 Let the joint probabilities of random variables X and Y are given by the
following table.
HH Y
H
-1 0 1
X HHH
-1 0 14 0
1 1
0 4
0 4
1 0 14 0
Then X and Y are identically distributed and X has the following pmf
1 1
P (X = −1) = P (X = 1) = and P (X = 0) =
4 2
Also, it is easy to see that E[X] = E[Y ] = 0. Furthermore, random variable XY takes values
{−1, 0, 1} with the pmf

P (XY = 1) = 0 = P (XY = −1) and P (XY = 0) = 1.

Therefore, E[XY ] = 0, which in turn implies cov(X, Y ) = 0. However, X and Y are not
independent since
1
P (X = −1, Y = −1) = 0 6= = P (X = −1)P (Y = −1)
16

Proposition 38.3 For any random variable X, Y and Z, and any a, b ∈ R,

1. cov(X, X) = var(X).

2. cov(X, Y ) = var(Y, X).

3. cov(X, aY + b) = a cov(X, Y ).

4. cov(X, Y + Z) = cov(X, Y ) + cov(X, Z).

Proof: All the proof follows from definition of covariance and linearity of expectation.

1.

cov(X, X) = E[X 2 ] − EX EX = var(X)

2. immediate from definition.

3.

cov(X, aY + b) = E[X(aY + b)] − EXE[aY + b] = E[aXY + bX] − EX[aEY + b]


= aE[XY ] + bEX − aEXEY − bEX = a [E[XY ] − EXEY ] = cov(X, Y )
Lecture 38: Covariance & Correlation 38-3

4.

cov(X, Y + Z) = E[X(Y + Z)] − EXE(Y + Z) = E[XY + XZ] − EX[EY + EZ]


= E[XY ] + E[XZ] − EXEY − EXEZ = cov(X, Y ) + cov(X, Z)

Example 38.4 Let X and Y be two independent N (0, 1) random variables and Z = 1 +
X + XY 2 , W = 1 + X. Find cov(Z, W ).

Solution:

cov(Z, W ) = cov(1 + X + XY 2 , 1 + X)
= cov(X + XY 2 , X) (∵ adding a constant to any rv does not affect the covariance)
= cov(X, X) + cov(XY 2 , X)
= var(X) + E[XY 2 X] − E[XY 2 ]EX = 1 + E[X 2 ]E[Y 2 ] = 2

Example 38.5 For any random variables X, Y , show that

var(X + Y ) = var(X) + var(Y ) + 2cov(X, Y ).

Solution:

var(X + Y ) = cov(X + Y, X + Y ) = cov(X + Y, X) + cov(X + Y, Y )


= cov(X, X) + cov(Y, X) + cov(X, Y ) + cov(Y, Y )
= var(X) + var(Y ) + 2cov(X, Y ).

If X and Y are independent then from Example 38.5, it follows that

var(X + Y ) = var(X) + var(Y ).

So variance of sum of independent random variables is the sum of variance of random vari-
ables.
38-4 Lecture 38: Covariance & Correlation

38.2 Correlation
The sign of cov(X, Y ) gives information about the linear relationship of X and Y ; however,
its actual magnitude does not have much meaning since it depends on the variability of X
and Y . Therefore cov(X, Y ) the number itself does not give information about the strength
of the relationship between X and Y .
The correlation coefficient removes, in a sense, the individual variability of each X and Y by
dividing the covariance by the product of the standard deviations, and thus the correlation
coefficient is a better measure of the linear relationship of X and Y than is the covariance.
Also, the correlation coefficient is unitless.

Definition 38.6 The correlation coefficient of two random variables X and Y , denoted by
ρ(X, Y ) is defined as
cov(X, Y )
ρ(X, Y ) = p ,
var(X)var(Y )
provided var(X) > 0 and var(Y ) > 0.
Lecture 39: Correlation & Characteristics Function
5 April, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Example 39.1 Let the joint pdf of (X, Y ) be



1 ; 0 < x < 1, x < y < x + 1
f (x, y) =
0 ; otherwise

y =x+1

y=x

(0, 0) (1, 0)


 0 ; x≤0
∞ x+1
Z 
 Z
fX (x) = f (x, y)dy = 1 dy = 1 ; 0 < x < 1
−∞ 
 x
0 ; x≥1

1 1
Then X ∼ U (0, 1). Hence EX = and var(X) = 12
2
. The marginal pdf of Y is
 Z y


 1 dx = y ; 0<y<1
Z ∞ 
 Z0
1
fY (y) = f (x, y)dx =
−∞ 
 1 dx = 2 − y ; 1 ≤ y < 2

 y−1
0 ; otherwise

39-1
39-2 Lecture 39: Correlation & Characteristics Function

∞ 1 2 2
y 3 1 y3
Z Z Z 
2 2 2 1 2
Hence EY = yfY (y)dy = y dy + (2y − y )dy = + y − = + =1
−∞ 0 1 3 0 3 1 3 3
and var(Y ) = E[Y 2 ] − 1 = 16 , because

∞ 1 2 2
y 2 1
 3
y4
Z Z Z
2 2 3 2 3 2y 1 4 5 1 11 7
E[Y ] = y fY (y)dy = y dy + (2y − y )dy = + − = + − = ++ =
−∞ 0 1 3 0 3 4 1 3 3 12 3 12 6

We also have
∞ ∞ 1 x+1 1 x+1 1
y2 (x + 1)2 − x2
Z Z Z Z  Z  Z
E[XY ] = xyf (x, y)dxdy = xydy dx = x dx = x dx
−∞ −∞ 0 x 0 2 x 0 2
Z1 Z 1  3 2
1
(2x + 1)  x x x 7
= x dx = x2 + dx = + =
0 2 0 2 3 4 0 12

Hence
7 1 1
Cov(X, Y ) = E[XY ] − EXEY = − = .
12 2 12
The correlation is

Cov(X, Y ) 1
1 √ 1
ρ(X, Y ) = p = q 12 = ×6 2= √ .
var(X)var(Y ) 1
× 1 12 2
6 12

Example 39.2 Let the joint pdf of (X, Y ) be


1

10 ; 0 < x < 1, x < y < x + 10
f (x, y) =
0 ; otherwise

1
y =x+
10

y=x

(0, 0) (1, 0)
Lecture 39: Correlation & Characteristics Function 39-3


 0 ; x≤0
∞ 1
Z 
 Z x+ 10
fX (x) = f (x, y)dy = 10 dy = 1 ; 0 < x < 1
−∞ 
 x
0 ; x≥1

Then X ∼ U (0, 1). Hence EX = 21 and var(X) = 12 1


. The marginal pdf of Y is
 Z y
1


 10 dx = 10y ; 0 < y < 10


 Z0 y
1

Z ∞ 10 dx = 1 ; 10 ≤y<1


fY (y) = f (x, y)dx = 1
y− 10
Z 1  
−∞ 
 1 1

 10 dx = 10 1 − y + = 11 − 10y ; 1 ≤ y < 1 + 10
10

 1

 y− 10

0 ; otherwise

Hence
Z ∞ Z 1/10 Z 1 Z 1+1/10
2
EY = yfY (y)dy = 10y dy + ydy + (11y − 10y 2 )dy
−∞ 0 1/10 1
3 1/10 2 1 2 3 1+1/10
 
10y y 11y 10y
= + + −
3 0 2 1/10 2 3

1
11
=
20
29 121 101
and var(Y ) = 75
− 400
= 1200
, because
Z ∞ Z 1/10 Z 1 Z 1+1/10
2 2 3 2
E[Y ] = y fY (y)dy = 10y dy + y dy + (11y 2 − 10y 3 )dy
−∞ 0 1/10 1

1 333 641 29
= + + =
4000 1000 12000 75
We also have
∞ ∞ 1
"Z 1
x+ 10
#
1 x+ 101
y2
Z Z Z Z 
E[XY ] = xyf (x, y)dxdy = 10 x ydy dx = 10 x dx
−∞ −∞ 0 x 0 2 x
1 1 2
(x + ) − x2
Z
10
= 10 x dx
0 2
1 1 1  3 1
(2x + ) x2
Z Z
10 1 
2x x 43
= 10 x × dx = x + dx = + =
0 2 10 0 20 3 40 0 120
39-4 Lecture 39: Correlation & Characteristics Function

Hence
43 11 1
Cov(X, Y ) = E[XY ] − EXEY = − = .
120 40 12
The correlation is
1
r
Cov(X, Y ) 12 1 120 10 100
ρ(X, Y ) = p =q = ×√ =√ = .
var(X)var(Y ) 1
× 101 12 101 101 101
12 1200

Superiority of Correlation over Covariance

Observe that in both Examples 39.1 and 39.2, the Cov(X, Y ) is same but ρ(X, Y ) in Example
39.2 is much larger than the ρ(X, Y ) in Example 39.1. These two examples together tells
us that the magnitude of covariance does not give any information about the strength of
the linear relationship between random variable X and Y but correlation does give that
information.
In Example 39.1, look at the region in which in the joint density is non-zero. If we draw a
line x = c passing in this reason then random variable Y could take many other values.
But in Example 39.2, if draw a line line x = c passing in this reason then random variable Y
takes values in very restricted manner. This suggest that there is a very strong relationship
in this case. Though in both cases there is a linearly increasing relationship between X and
Y.
The nature of the linear relationship measured by the covariance and correlation is somewhat
explained in the following theorem.

Proposition 39.3 The correlation coefficient between two random variables X and Y sat-
isfies the following properties.

1. |ρ(X, Y )| ≤ 1.

2. |ρ(X, Y )| = 1 if and only if there exits real numbers a, b with a 6= 0 such that Y =
aX + b. If ρ(X, Y ) = 1 then a > 0 and if ρ(X, Y ) = −1, then a < 0.

Remark 39.4 Intuitively, if there is a line y = ax + b, with a 6= 0, such that values of


(X, Y ) have high probability being near to this line, then the correlation between X and Y
will be near 1 or −1. But if no such line exists, the correlation will be near zero.

Example 39.5 A standard normal random variable X satisfies: EX = 0, EX 2 = 1, EX 3 =


0, EX 4 = 3. Let Y = a + bX + cX 2 . Find the correlation coefficient ρ(X, Y ).
Lecture 39: Correlation & Characteristics Function 39-5

Solution:
cov(X, Y ) = E[XY ] − E[X]E[Y ] = E aX + bX 2 + cX 3 − 0 × E[Y ]
 

= aEX + bEX 2 + cEX 3 = b


var(X) = EX 2 − (EX)2 = 1
var(Y ) = EY 2 − (EY )2 = E(a2 + b2 X 2 + 2abX + c2 X 4 + 2c(a + bX)X 2 ) − (a + c)2
= a2 + b2 + 3c2 + 2ac − a2 − c2 − 2ac = b2 + 2c2
Therefore
cov(X, Y ) b
ρ(X, Y ) = p =√ .
var(X)var(Y ) b + 2c2
2

When Correlation fails

Covariance and correlation measure only a particular kind of linear relationship. But it may
happen that X and Y have a strong relationship but their covariance and correlation are
small or even zero, because the relationship is not linear. In fact in Example 39.5, we see
that |ρ(X, Y )| ≤ √|b|
2|c|
. Therefore if b is small and c is large then correlation is small. If
b = 0, then cov(X, Y ) = 0 and ρ(X, Y ) = 0 but Y = a + cX 2 .

39.1 Complex-valued Random Variables


A complex-valued random variable Z : Ω → C can be written in the form Z = X + iY ,
where X and Y are real-valued random variables. Its expectation EZ is defined as EZ =
E(X + iY ) = EX + iEY whenever EX and EY are well defined and finite. The formula
E(a1 Z1 + a2 Z2 ) = aEZ1 + a2 EZ2 is valid whenever a1 and a2 are complex constants and Z1
and Z2 are complex-valued random variables having finite expectation.

39.2 Characteristic Function


We introduce the notion of characteristic function of a random variable and study its proper-
ties. Characteristic function serves as an important tool for analyzing random phenomenon.

Definition 39.6 The characteristic function of a random variable X is defined


φX (t) = E eitX , t ∈ R
 

So basically φX : R → C.
39-6 Lecture 39: Correlation & Characteristics Function

The advantage of the characteristic function is that it is defined for all real-valued random
variables. Because for any real-valued random variable X and for any real number t, the
random variables cos tX, sin tX are bounded by 1. Therefore, both have finite expectation
bounded by 1, hence φX (t) is defined for all t and for all X.

Characteristic Function of a discrete random variable: If X is a discrete random


variable then

φX (t) = E eitX = E[cos(tX)] + iE[sin(tX)]


 
X X
= cos(tx)P (X = x) + i sin(tx)P (X = x)
x∈R(X) x∈R(X)
X X
= [cos(tx) + i sin(tx)] P (X = x) = eitx P (X = x)
x∈R(X) x∈R(X)

Characteristic Function of a random variable with density: If random variable X


has density fX then

φX (t) = E eitX = E[cos(tX)] + iE[sin(tX)]


 
Z ∞ Z ∞
= cos(tx)fX (x)dx + i sin(tx)fX (x)dx
−∞ −∞
Z ∞ Z ∞
= [cos(tx) + i sin(tx)] fX (x)dx = eitx fX (x)dx
−∞ −∞

Example 39.7 Let X ∼ Bernoulli (p). Find its characteristic function.

Solution:

φX (t) := E eitX
 

= eit P (X = 1) + e0 P (X = 0)
= eit p + (1 − p)
MATH-221: Probability and Statistics
Solved Problems (Joint Distributions, Conditional Distributions, Conditional
Expectation, Covariance, Correlation, Characteristic Function)

1
1. Let X and Y be random variables with the joint pmf P {X = i, Y = j} = N2
, i, j =
1, 2, · · · , N . Find (a) P {X ≥ Y } (b) P {X = Y }.

Solution: (a)
X
P {X ≥ Y } = f (x, y)
(x,y):x≥y
N X
X N
= f (x, y)
y=1 x≥y
N X N
X 1
=
y=1 x≥y
N2
N N
1 XX
= 1.
N 2 y=1 x≥y
" N N N N
#
1 X X X X
= 2 1+ 1+ 1 + ··· 1
N x=1 x=2 x=3 x=N
1
= [N + (N − 1) + (N − 3) + · · · 1]
N2
1 N (N + 1)
= 2×
N 2
1 1
= +
2 2N

(b)
X
P {X = Y } = f (x, y)
(x,y):x=y
N
X
= f (y, y)
y=1
N
X 1
=
y=1
N2
1
=
N

2. Let X andY be random variables with joint pdf given by


λ2 e−λy , if 0 ≤ x ≤ y
f (x, y) = , where λ > 0.
0, otherwise.
Find the marginal pdfs of X and Y . Also find the joint distribution function of X, Y .
Solution: The marginal pdf of X is
Z ∞
fX (x) = f (x, y)dy
−∞

 Z
λ2 e−λy dy if x ≥ 0

=
 0x if x < 0
 −λx
λe if x ≥ 0
=
0 if x < 0
The marginal pdf of Y is
Z ∞
fY (y) = f (x, y)dx
−∞
y
 Z
λ2 e−λy dx if y ≥ 0

=
 00 if y < 0
 2 −λy
λ ye if y ≥ 0
=
0 if y < 0
The joint distribution function of X and Y is
Z x Z y
F (x, y) = f (t, s) dtds
−∞ −∞
Z x Z y 
2 −λs
= λ e ds dt if 0 ≤ x ≤ y
0 t
Z x
= λ[e−λt − e−λy ]dt
0
 −λt x
e −λy
=λ − te
−λ 0
= 1 − e−λx − λxe−λy if 0 ≤ x ≤ y

If 0 ≤ y < x then
F (x, y) = 1 − e−λy (1 + λy)
In second, third and fourth quadrant F (x, y) = 0.
3. Let X, Y be continuous random variables with joint density function
 −y
e (1 − e−x ) if 0 < x < y < ∞
fX,Y (x, y) =
e−x (1 − e−y ) if 0 < y ≤ x < ∞

Show that X and Y are identically distributed. Also compute E[X] and E[Y ].
Solution:
Z ∞
fX (x) = fX,Y (x, y)dy
−∞

 Z
fX,Y (x, y)dy if 0 < x < ∞

= 0
 0 otherwise

Now for x > 0,


Z ∞ Z ∞ Z x
−y −x
fX,Y (x, y)dy = e (1 − e )dy + e−x (1 − e−y )dy
0 x 0
∞ x
= (1 − e−x ) −e−y x + e−x y + e−y 0
 

= e−x (1 − e−x ) + e−x [x + e−x − 1]


= xe−x

Therefore,

xe−x if 0 < x < ∞



fX (x) =
0 otherwise

Similarly,
Z ∞
fY (y) = fX,Y (x, y)dx
−∞

 Z
fX,Y (x, y)dx if 0 < y < ∞

= 0
 0 otherwise

Now for y > 0,


Z ∞ Z y Z ∞
−y −x
fX,Y (x, y)dx = e (1 − e )dx + e−x (1 − e−y )dx
0 0 y
−y −x y
∞
x + e 0 + (1 − e−y ) −e−x y
  
=e
= e−y [y + e−y − 1] + e−y (1 − e−y )
= ye−y

Therefore,

ye−y if 0 < y < ∞



fY (y) =
0 otherwise

Both X and Y are identically distributed. Further, E[X] = E[Y ] = 2.


4. Let X and Y be continuous random variables with joint pdf f . Find the pdf of (a)
XY (b) X − Y (c) |Y − X|.
Solution: (a) We first determine the distribution function of XY . For z ∈ R, set
Az = {(x, y) ∈ R2 |xy ≤ z}
Now
FXY (z) = P {XY ≤ z}
= P {(X, Y ) ∈ Az }
ZZ
= f (x, y)dxdy
Az
! z
!
Z 0 Z ∞ Z ∞ Z
x
= f (x, y)dy dx + f (x, y)dy dx
z
−∞ x
0 −∞

Now
! !
Z 0 Z ∞ Z ∞ Z ∞
f (x, y)dy dx = f (−x, y)dy dx
z z
−∞ x
0 −x
Z ∞ Z z 
u  du 
= f −x, − dx
0 −∞ x x
Z z Z ∞ 
1  u
= f −x, − dx du
−∞ 0 x x

z
!
Z ∞ Z Z  ∞ Z z
x u  du 
f (x, y)dy dx = f x, dx
0 −∞ 0 −∞ x x
Z z Z ∞ 
1  u
= f x, dx du
−∞ 0 x x
Z z Z ∞
 u i 
1h  u
FXY (z) = f −x, − + f x, dx du
−∞ 0 x x x
Z z Z ∞ 
1  u
= f x, dx du
−∞ −∞ |x| x
Therefore pdf of XY is
Z ∞
1  z
fXY (z) = f x, dx
−∞ |x| x
(b) We first determine the distribution function of Z := X − Y . For z ∈ R, set

Az = {(x, y) ∈ R2 |x − y ≤ z}

Now

FZ (z) = P {X − Y ≤ z}
= P {(X, Y ) ∈ Az }
ZZ
= f (x, y)dxdy
Az
Z ∞ Z ∞ 
= f (x, y)dy dx
−∞ x−z
Z ∞ Z −∞ 
= −f (x, x − s)ds dx (put y = x − s )
−∞ z
Z ∞ Z z 
= f (x, x − s)ds dx
−∞ −∞
Z z Z ∞ 
= f (x, x − s)dx ds
−∞ −∞

Therefore pdf of X − Y is
Z ∞
fY −X (z) = f (x, x − z)dx
−∞

(c) We first determine the distribution function of Z := |Y − X|. For z ∈ R, set

Az = {(x, y) ∈ R2 ||y − x| ≤ z}
If z < 0 then {(X, Y ) ∈ Az } = ∅. Therefore FZ (z) = 0 if z < 0. Now for z ≥ 0

FZ (z) = P {|Y − X| ≤ z}
= P {(X, Y ) ∈ Az }
ZZ ZZ ZZ
= f (x, y)dxdy = f (x, y)dxdy + f (x, y)dxdy
Az {y−x≥0}∩Az {y−x<0}∩Az
Z ∞ Z x+z  Z ∞ Z x 
= f (x, y)dy dx + f (x, y)dy dx
−∞ x −∞ x−z
Z ∞ Z z 
= f (x, x + s)ds dx (put y = x + s )
−∞ 0
Z ∞ Z 0 
+ −f (x, x − s)ds dx (put y = x − s )
−∞ z
Z ∞ Z z 
= [f (x, x + s) + f (x, x − s)]ds dx
−∞ 0
Z z Z ∞ 
= [f (x, x + s) + f (x, x − s)]dx ds
0 −∞

 Z ∞
[f (x, x + z) + f (x, x − z)]dx if z ≥ 0

f|Y −X| (z) = −∞
0 ; otherwise

5. Suppose X and Y be random variables that assume four values 1, 2, 3, 4. Their joint
probabilities are given by the following table.

HH Y
HH
1 2 3 4
X HH
1 1 1
1 20 20 20
0
1 2 3 1
2 20 20 20 20
1 2 3 1
3 20 20 20 20
1 1 1
4 0 20 20 20

Find the pmf of X + Y .

Solution: First of all the values taken by random variable Z := X + Y are {2, 3, 4, 5, 6, 7, 8}.
Hence pmf of Z is
1
fZ (2) = P (X + Y = 2) = P (X = 1, Y = 1) =
20
2
fZ (3) = P (X + Y = 3) = P (X = 1, Y = 2) + P (X = 2, Y = 1) =
20
4
fZ (4) = f (2, 2) + f (1, 3) + f (3, 1) =
20

6.

7. Let X ∼ uniform[0, 1] and Y ∼ Bernoulli(0.5), and assume that X, Y are indepen-


dent. Then determine the joint cumulative distribution function of X and Y .

Solution: Then joint CFD is given by

F (x, y) = FX (x)Fy (y)

where  
 0 if x ≤ 0  0 if y < 0
1
FX (x) = x if 0 ≤ x ≤ 1 , FY (y) = if 0 ≤ y < 1 ,
 2
1 if x ≥ 1 1 if y ≥ 1

Hence 

 0 if x ≤ 0 or y < 0
x
if 0 < x ≤ 1 and 0 ≤ y < 1


 2


1 if x ≥ 1 and y = 0
F (x, y) = ,

 x if 0 ≤ x ≤ 1 and y ≥ 1
1 if x ≥ 1 and y ≥ 1




 1
2
if x ≥ 1 and 0 ≤ y < 1

8. Let X and Y be independent Poisson random variables with parameters λ1 , λ2 re-


spectively. Find P (Y = m|X + Y = n) for m = 0, 1, 2 · · · , n.
Solution: Set Z := X + Y then Z takes values 0, 1, 2, · · · . For m = 0, 1, 2 · · · , n.
P (Y = m, X + Y = n)
P (Y = m|Z = n) =
P (X + Y = n)
P (Y = m, X = n − m)
= n
X
P (X = k, Y = n − k)
k=0
P (Y = m)P (X = n − m)
= n
X
P (X = k)P (Y = n − k)
k=0
λm λn−m
e−λ1 m!1 × e−λ2 (n−m)!
2

= n
X λk λn−k
e−λ1 1 × e−λ2 2
k=0
k! (n − k)!
n−m
λm
1 λ2
m! (n−m)!
= n
X λk1 λ2n−k
×
k=0
k! (n − k)!
λm λn−m
n! m!1 (n−m)!
2

= n
X λk1 λ2n−k
n! ×
k=0
k! (n − k)!
λm λn−m n
 n−m m
n! m!1 (n−m)!
2
λ λ2
= n
= m 1
(λ1 + λ2 ) (λ1 + λ2 )n

9. Let X and Y be independent and identically distributed random variables such that
1
P (X = 0) = P (X = 1) = . Show that X and |X − Y | are independent.
2
Solution: Set W = |X − Y |, Then W take values {0, 1} with pmf P (W = 0) = 21 , P (W = 1) =
1
2
. Note that X and W both are Bernoulli(0.5).
P (X = 0, W = 0) = P (X = 0, |X − Y | = 0) = P (X = 0, Y = 0)
P (X = 0, W = 1) = P (X = 0, |X − Y | = 1) = P (X = 0, Y = 1)
Yes, X and W are independent.
10. Let X and Y be two random variables. Suppose that var(X) = 4, and var(Y ) = 9.
If we know that the two random variables Z = 2X − Y and W = X + Y are
independent, find ρ(X, Y ).
Solution: Since independent random variables are uncorrelated, therefore we have
0 = cov(Z, W ) = cov(2X − Y, X + Y ) = cov(2X − Y, X) + cov(2X − Y, Y )
= cov(2X, X) + cov(−Y, X) + cov(2X, Y ) + cov(−Y, Y )
= 2cov(X, X) − cov(Y, X) + 2cov(X, Y ) − cov(Y, Y ) = 2var(X) + cov(X, Y ) − va
= 8 + cov(X, Y ) − 9 =⇒ cov(X, Y ) = 1
1 1
ρ(X, Y ) = √ =
4×9 6
11. Suppose X and Y are two independent random variables such that EX 4 = 2, EY 2 =
1, EX 2 = 1, and EY = 0. Compute var(X 2 Y ).
Solution:
2
var(X 2 Y ) = E (X 2 Y )2 − E X 2 Y
  
2
= E X 4Y 2 − E X 2Y
  
2
= E[X 4 ]E[Y 2 ] − E X 2 E[Y ]
 

= 2 × 1 − (1 × 0)2 = 2
12. The speed of a typical vehicle that drives past a police radar is modeled as an
exponentially distributed random variable X with mean 50 miles per hour. The
police radar’s measurement Y of the vehicle’s speed has an error which is modeled
as a normal random variable with zero mean and standard deviation equal to one
tenth of the vehicle’s speed. What is the joint PDF of X and Y ?
 
1 1
Solution: Recall that if X ∼ exp(λ) then E[X] = . So we have X ∼ exp . Therefore
λ 50
the pdf of X is
 1 −x
50
e 50 if x ≥ 0
fX (x) =
0 if x < 0
Also, conditioned
 x on X = x, the measurement Y has normal density with mean 0
2
and variance . Therefore,
10
2
1 − yx 2 10 y2
fY |X (y|x) = x √ e 2( 10 )
= √ e− 50x2 for all x > 0, y ∈ R
10
2π x 2π
Hence the joint pdf of X and Y is
f (x, y) = fY |X (y|x)fX (x)

10 1 x y2
 √ × e− 50 e− 50x2 if x > 0
= x 2π 50
 0 if x ≤ 0

3 − x2 −xy+y2
13. Let X and Y have the joint density f given by f (x, y) = e 2 , x, y ∈ R.

Find the conditional density of Y given X.

Solution: First compute fX (x).



√ Z ∞ √
3 − x2 ∞ − y2 −xy
Z Z
3 2
− x −xy+y
2
fX (x) = f (x, y)dy = e 2 dy = e 2 e 2 dy
−∞ 4π −∞ 4π −∞
√ √
3 − x2 ∞ − 21 y2 −2. x2 y+ x42 − x42 3 − x2 + x2 ∞ − 12 (y− x2 )2
Z   Z
= e 2 e dy = e 2 8 e dy
4π −∞ 4π −∞

3 − 3x2 ∞ − u2
Z
= e 8 e 2 du (subsituing y − x2 = u)
4π −∞
√ −  x 2
2

3 − 3x2 √ 1 2 √2
= e 8 2π = 2 √ e 3
4π √ 2π
3
 
4
that is X ∼ N 0, . Now
3
√ 2 2
3 − x −xy+y  2
x − xy + y 2 3x2

f (x, y) 4π
e 2 1
fY |X (y|x) = = √ 2√ = √ exp − +
fX (x) 3 − 3x8
e 2π 2π 2 8
4π 
1 1 2 y 2 xy

1 1 1 x 2
= √ exp − x + − = √ e− 2 (y− 2 )
2π 2 4 2 2 2π
In other words, the conditional density of Y given X = x is the normal density with
x
mean and variance 1.
2
14. Let X and Y be continuous random variables having a joint density f . Suppose that
Y and φ(X)Y have finite expectation. Show that
Z ∞
E[φ(X)Y ] = φ(x)E[Y |X = x]fX (x)dx
−∞

Solution:
Z ∞ Z ∞ Z ∞ Z ∞
E[φ(X)Y ] = φ(x)yf (x, y)dxdy = φ(x)yfY |X (y|x)fX (x)dxdy
Z−∞

−∞
Z ∞ −∞ −∞
 Z ∞
= φ(x)fX (x) yfY |X (y|x)dy dx = φ(x)fX (x)E[Y |X = x]dx
−∞ −∞ −∞
15. Let X and Y be two independent Poisson distributed random variables having pa-
rameters λ1 and λ2 respectively. Compute E[Y |X + Y = z] where z is a nonnegative
integer.

Solution: Recall from solution of Problem 8, for m = 0, 1, 2 · · · , z.

P (Y = m, X + Y = z) P (Y = m, X = z − m)
fY |X+Y (m|z) = P (Y = m|X + Y = z) = =
P (X + Y = z) P (X + Y = z)
λz−m λm
P (Y = m)P (X = z − m) e−λ1 (z−m)!
1
× e−λ2 m!2
= = z
P (X + Y = z) e−(λ1 +λ2 ) (λ1 +λ
z!
2)

z
 z−m m
m
λ1 λ2
=
(λ1 + λ2 )z

Therefore
z z z
λz−m

X X 1 λm
m 2
E[Y |X + Y = z] = mfY |X+Y (m|z) = m z
m=0 m=0
(λ1 + λ2 )
z
1 X z!
= m λz−m λm
(λ1 + λ2 ) m=0 m!(z − m)! 1
z 2

z
1 X z!
= m λz−m λm
(λ1 + λ2 ) m=1 m!(z − m)! 1
z 2

z
1 X z!
= z
λz−m
1 λm
2
(λ1 + λ2 ) m=1 (m − 1)!(z − m)!
z
z X (z − 1)!
= λz−m λm
(λ1 + λ2 ) m=1 (m − 1)!(z − m)! 1
z 2

z  
z X z − 1 z−m m
= λ λ2
(λ1 + λ2 )z m=1 m − 1 1
z−1  
z X z − 1 z−k−1 k+1
= λ1 λ2 (put m = k + 1)
(λ1 + λ2 )z k=0 k
z−1 
z − 1 k (z−1)−k zλ2 (λ2 + λ1 )z−1

zλ2 X zλ2
= z
λ2 λ1 = z
=
(λ1 + λ2 ) k=0 k (λ1 + λ2 ) λ1 + λ2

16. Let X and Y have a joint density f that is uniform over the interior of the triangle
with vertices at (0, 0), (2, 0), and (1, 2). Find the conditional expectation of Y given
X.
(1, 2)

y = 2x

y + 2x = 4

(0, 0) (2, 0)
Solution:
Area of triangle with vertices at (0, 0), (2, 0), and (1, 2) is 12 × 2 × 2 = 2. Therefore
the joint density is 12 over the triangle, zero elsewhere. First we need to compute
E[Y |X = x]. So we need conditional density of Y given X. For that, if x ≤ 0 or
x ≥ 2 then f (x, y) = 0 therefore fX (x) = 0. For x ∈ (0, 2)
 Z 2x
1
dy = x ; 0<x≤1
Z ∞ 


fX (x) = f (x, y)dy = 2 Z0
−∞ 1 −2x+4
dy = 2 − x ; 1 < x ≤ 2



2 0
Therefore

 1

f (x, y)  2x ; 0 < x ≤ 1, 0 < y < 2x
fY |X (y|x) = = 1
fX (x)  ; 1 < x ≤ 2, 0 < y < 4 − 2x
2(2 − x)

For x ∈ (0, 2), we have


 Z 2x
1
y dy = x ; 0 < x ≤ 1,


Z 

E[Y |X = x] = yfY |X (y|x)dy = Z0 4−2x2x
−∞ 1
y dy = 2 − x ; 1 < x ≤ 2,



0 4 − 2x

For other values of x, E[Y |X = x] = 0.

17. Let X and Y be iid exponential random variables with parameter λ, and set Z =
X + Y . Find the conditional expectation of X given Z.
Solution: We need to compute the conditional pdf of X given Z which is by definition
(
f (x,z)
fZ (y)
if fZ (z) > 0
fX|Z (x|z) =
0 if fZ (z) = 0

where f is joint density of X and Z. In order to find joint density of X and Z we


use the formula
f (x, z) = fX (x)fZ|X (z, x)
Basically we first determine the conditional distribution function of Z given X, i.e.,
P (Z ≤ z|X = x). Then we have the relation
Z z
P (Z ≤ z|X = x) = fZ|X (t|x)dt
−∞

Now

P (Z ≤ z|X = x) = P (X + Y ≤ z|X = x)
= P (x + Y ≤ z|X = x)
= P (x + Y ≤ z) ( ∵ X, Y are indepedent)
= P (Y ≤ z − x)
Z z−x
= fY (y)dy
−∞
Z z
= fY (t − x)dt (put y = t − x)
−∞

Hence fZ|X (z|x) = fY (z − x).


Also density of sum of independent random variable is the convolution of marginal
densities hence

0 if z < 0
fZ (z) = (fX ∗ fY )(z) = 2 −λz
λ ze if z ≥ 0

Hence
(
fX (x)fY (z−x)
fZ (z)
if fZ (z) > 0
fX|Z (x|z) =
0 if fZ (z) = 0
λe−λx λe−λ(z−x)

λ2 ze−λz
if z > 0, z − x ≥ 0, x ≥ 0
=
0 if z < 0
1

z
if 0 ≤ x ≤ z
=
0 otherwsie
Now
Z ∞ Z z
1 z
E[X|Z = z] = xfX|Z (x|z)dx = x dx =
−∞ 0 z 2

18. Suppose that X and Y are random variables with the same variance. Show that
X − Y and X + Y are uncorrelated.

Solution:

E[(X + Y )(X − Y )] = E X 2 − Y 2 = E[X 2 ] − E[Y 2 ]


 

= var(X) + (E[X])2 − var(Y ) − (EY )2 = (EX + EY )(EX − EY )


= E(X + Y )E(X − Y )

19. Find the characteristic function of the following distributions:


(a) exp(λ) (b) geometric(p).

Solution: (a) exp(λ) :


∞ ∞
e(it−λ)x ∞ −λ
Z Z
itX itx −λx
e(it−λ)x dx = λ
 
φX (t) = E e = e λe dx = λ =
0 0 it − λ 0 it − λ

(b) geometric(p) :
∞ ∞ ∞
 itX  X itk
X
itk k−1 it
X
φX (t) = E e = e P (X = k) = e (1 − p) p = pe eit(k−1) (1 − p)k−1
k=1 k=1 k=1

it
X  it k−1 1
= pe e (1 − p) = peit
k=1
1− eit (1 − p)

20. Let X be a random variable such that X and −X has same distribution. Show that
φX (t) ∈ R for all t ∈ R.

Solution: If X and −X have the same distribution, then by definition

φX (t) = φ−X (t) ∀t ∈ R


= E[eit(−X) ]
= E[ei(−t)X ]
= φX (−t)
= φX (t)

Hence φX (t) ∈ R for all t ∈ R.


21. Find the characteristic function of the random variable with pmf given by
1
f (x) = , x = 1, 2, · · ·
2x

Solution:
∞ ∞ ∞ x eit
eit eit

 itX
 X
itx
X
itx 1 X
2
φX (t) = E e = e P (X = x) = e = = eit
=
x=1 x=1
2x x=1
2 1− 2
2 − eit

22. Find the characteristic function of the random variable X with pdf given by

1 − |x|, if |x| ≤ 1
f (x) =
0, otherwise.

Solution:
Z 1 Z 1 Z 1
 itX  itx
φX (t) = E e = e (1 − |x|)dx = (1 − |x|) cos tx dx + i (1 − |x|) sin tx dx
−1 −1 −1
Z 1 Z 1
=2 (1 − |x|) cos tx dx + 0 = 2 (1 − x) cos tx dx
0 0
  Z 1    
sin tx 1 sin tx 1 sin tx sin t sin t cos tx 1
=2 − x − dx =2 − + 2
t 0 t 0 0 t t t t 0
2
= 2 (1 − cos t)
t

23. Let X be a random variable such that P (X ∈ Z) = 1. Show that

φX (t) = φX (t + 2π), ∀t ∈ R

Solution:

φX (t + 2π) = E[ei(t+2π)X ]
= E[eitX ei2πX ]
= E[eitX {cos(2πX) + i sin(2πX)}]

Since P (X ∈ Z) = 1, so range of X is Z. Hence cos 2πX = 1 and sin 2πX = 0.

24. Let X and Y be independent, identically distributed random variables. Show that
φX−Y (t) = |φX (t)|2 .
Solution:

φX−Y (t) = E eit(X−Y ) = E eitX e−itY = E eitX E e−itY = E eitX E ei(−t)Y


           

= E eitX E ei(−t)X = φX (t)φX (−t) = φX (t)φX (t)


   

25. Let X and Y be independent and identically distributed continuous random variables
with density f . Find P {X 2 > Y }.

Solution:
B = {(x, y) ∈ R2 |x2 > y}
Now

P {X 2 > Y } = P {(X, Y ) ∈ B}
ZZ
= fX (x)fY (y)dxdy
ZZ B

= f (x)f (y)dxdy
B
!
Z ∞ Z x2
= f (x) f (y)dy dx
−∞ −∞

Y
26. Let X and Y be random variables having joint density f . Find the density of .
X
Y n y o y
Solution: Set Z = and Az = (x, y) ∈ R2 ≤ z . Note that, if x < 0, then ≤ z if and

X x x
only if y ≥ xz. Thus
[
Az = {(x, y)|x < 0, y ≥ xz} {(x, y)|x > 0, y ≤ xz}
Consequently,
ZZ
FZ (z) = f (x, y)dxdy
Az
Z 0 Z −∞  Z ∞ Z zx 
= f (x, y)dy dx + f (x, y)dy dx
−∞ xz 0 −∞

In the Inner integrals, we make the change of variable y = xv (with dy = xdv) to


obtain
Z 0 Z −∞  Z ∞ Z z 
FZ (z) = xf (x, xv)dv dx + xf (x, xv)dv dx
−∞ z 0 −∞
Z 0 Z z  Z ∞ Z z 
= (−x)f (x, xv)dv dx + xf (x, xv)dv dx
−∞ −∞ 0 −∞
Z 0 Z z  Z ∞ Z z 
= |x|f (x, xv)dv dx + |x|f (x, xv)dv dx
−∞ −∞ 0 −∞
Z ∞ Z z 
= |x|f (x, xv)dv dx
−∞ −∞
Z z Z ∞ 
= |x|f (x, xv)dx dv
−∞ −∞

Hence Z has the density


Z ∞
fZ (z) = |x|f (x, xz)dx
−∞
Lecture 40: Characteristics Function
8 April, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Example 40.1 Find the characteristic function of the Poisson(λ) distribution.

Solution:
∞ ∞ k −λ ∞ k
X X
itk λ e −λ
X (eit λ)
itX itk
= e−λ exp(eit λ)
 
φX (t) = E e = e P (X = k) = e =e
k=0 k=0
k! k=0
k!
= exp λ(eit − 1)


Example 40.2 Let X ∼ N (0, 1). Find it’s characteristic function.

Solution:
φX (t) = E[cos tX + i sin tX] = E[cos tX] + iE[sin tX]
Z ∞ Z ∞
1 − x2
2 1 x2
= √ cos tx e dx + i √ sin tx e− 2 dx
−∞ 2π −∞ 2π
Since characteristic function exists for every random variable, therefore both the improper
integral exists. So value both improper integrals agrees with their Cauchy principle value.
We have Z ∞ Z a
1 − x2
2 1 x2
√ sin tx e dx = lim √ sin tx e− 2 dx = 0,
−∞ 2π a→∞ −a 2π
because sin is an odd function. Also
Z ∞ Z a Z ∞
1 − x2
2 1 − x2
2 2 x2 t2
√ cos tx e dx = lim √ cos tx e dx = √ cos tx e− 2 dx = e− 2
−∞ 2π a→∞ −a 2π 2π 0
where the last integral can computed using differentiation under integration. Let t ∈ R be
given. Define
Z ∞ 2
Z ∞
x2
− x2 0
I(t) = cos tx e dx =⇒ I (t) = − x sin tx e− 2 dx
0
0 Z ∞ 
2 ∞ 2

− x2 − x2
= − − sin txe + t cos tx e dx
0 0
= 0 − tI(t)

40-1
40-2 Lecture 40: Characteristics Function

2 t2 R∞ x2

Therefore t
p π ln I(t) = − 2 + C =⇒ I(t) = Ke− 2 . Also I(0) = 0
e− 2 dx = 2π
2
. So
K = 2.

Example 40.3 Let X be a random variable and a and b are real constants, then

φa+bX (t) = E eit(a+bX) = E eita eitbX = eita E eitbX = eita φX (bt)


     

Example 40.4 Let X ∼ N (µ, σ 2 ). Then it is implicit that σ > 0. Then Y = X−µ σ
has
mean zero and variance 1. Also Y ∼ N (0, 1). Hence by Example 40.3, X = σY + µ has the
characteristic function
σ 2 t2
φX (t) = φσY +µ (t) = eitµ φY (σt) = eitµ e− 2

Example 40.5 Let X and Y be independent random variables. Show that

φX+Y (t) = φX (t)φY (t)

Solution:

φX+Y (t) = E[eit(X+Y ) ]


= E[eitX eitY ] = E[eitX ]E[eitY ] = φX (t)φY (t)

More generally, if X1 , X2 , · · · , Xn are n independent random variables, then

φX1 +X2 +···+Xn (t) = φX1 (t)φX2 (t) · · · φXn (t).

Example 40.6 Compute the characteristic function of a Binomial(n, p) random variables.

Solution: A Binomial(n, p) random variable is a sum of n independent Bernoulli(p) random


variables. Therefore it’s characteristic function is
 it n
e p + (1 − p) .

Theorem 40.7 (Uniqueness Theorem) Let X1 and X2 be two random variables such
that φX1 = φX2 . Then X1 and X2 have same distribution.

Example 40.8 Let X ∼ B(n1 , p) and Y ∼ B(n2 , p) be two independent Binomial random
variables. Show that X + Y is a Binomial(n1 + n2 , p) random variable.
Lecture 40: Characteristics Function 40-3

Solution: Let X ∼ B(n1 , p) and Y ∼ B(n2 , p) be two independent random variables.


Therefore the characteristic function of X + Y is
n  n n +n
φX+Y (t) = φX (t)φY (t) = eit p + (1 − p) 1 eit p + (1 − p) 2 = eit p + (1 − p) 1 2 .
 

RHS is a characteristic function of a Binomial(n1 + n2 , p) random variable, therefore by


uniqueness theorem X + Y ∼ Binomial(n1 + n2 , p).

Example 40.9 Let X ∼ Poisson(λ) and Y ∼ Poisson(µ) be two independent Poisson ran-
dom variables. Show that X + Y is a Poisson(λ + µ) random variable.

Solution: The characteristic function of X + Y is

φX+Y (t) = φX (t)φY (t) = exp λ(eit − 1) exp µ(eit − 1) = exp (λ + µ)(eit − 1)
   

RHS is a characteristic function of a Poisson(λ+µ) random variable, therefore by uniqueness


theorem X + Y ∼Poisson(λ + µ).

Example 40.10 Let X ∼ N (µ1 , σ12 ) and Y ∼ N (µ2 , σ22 ) are independent normal random
variable. Then show that X + Y is a N (µ1 + µ2 , σ12 + σ22 ).

Solution: Hence we have


2 t2
σ1 2 t2
σ2
itµ1 − itµ2 −
φX (t) = e e 2 , φY (t) = e e 2 .

Now

φX+Y (t) := E eit(X+Y ) = E eitX eitY = E eitX E eitY = φX (t)φY (t)


       
2 +σ 2 )t2
(σ1 2
= eit(µ1 +µ2 ) e− 2

Now right hand side is the characteristic function of a normal random variable with mean
µ1 + µ2 and variance σ12 + σ22 . Therefore by uniqueness theorem, we conclude X + Y ∼
N (µ1 + µ2 , σ12 + σ22 )
Lecture 41: Inequalities
9 April, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Definition 41.1 Let I ⊆ R be an interval and f : I → R be a function. We say that

1. f is convex on I or concave upward on I if for any x1 , x2 ∈ I and any t ∈ (0, 1) we


have
f ((1 − t)x1 + tx2 ) ≤ (1 − t)f (x1 ) + tf (x2 ).

2. f is concave on I or concave downward on I if for any x1 , x2 ∈ I and any t ∈ (0, 1)


we have
f ((1 − t)x1 + tx2 ) ≥ (1 − t)f (x1 ) + tf (x2 ).

It follows from the definition that f is convex iff −f is concave, (reflection about the x-axis).

Theorem 41.2 (Jensen’s Inequality) Let f : I → R be a convex function where I ⊂ R


is an interval and X be a random variable such that X and f (X) has finite mean. Then

f (EX) ≤ E[f (X)].

If f is a concave function then −f is convex so by Jensen’s inequality

−f (EX) ≤ E(−f (X))


= −E[f (X)] (By linearity of expectation)
=⇒ f (EX) ≥ E[f (X)]

Example 41.3 Note that f (x) = |x| is a convex function hence by Jensen’s inequality

EX ≤ |EX| ≤ E|X|.

Definition 41.4 Let r be a positive real number and X be a random variable. Then E [X r ]
is called the r-th moment of X about origin or central moment of X of order r.
E|X|r is called the r-th absolute moment of X about origin or central moment of X of order
r.

41-1
41-2 Lecture 41: Inequalities

We know from definition of expectation that, E [X r ] exists and is a finite number if E [|X r |] <
∞. Therefore from above observation

E[X r ] ≤ |E[X r ]| ≤ E [|X|r ] .

Example 41.5 If the moment of order q > 0 exists for a random variable X, then show
that moments of order p, where 0 < p < q exist.

Solution: Let f : (0, ∞) → R be defined as f (x) = xr , where r > 1 is a real number. Then
f 0 (x) = rxr−1 , f 00 (x) = r(r − 1)xr−2 . Since r > 1, f 00 (x) > 0 on (0, ∞), i.e., f is a convex
function on (0, ∞). Hence by Jensen’s inequality,
1
[E|X|]r ≤ E [|X|r ] =⇒ [E|X|] ≤ (E [|X|r ]) r . (41.1)
q
Let 0 < p < q. Then we take r = p
> 1 in (41.1) and we get
 h q
i pq
[E|X|] ≤ E |X| p . (41.2)

Now replacing |X| by |X|p in (41.2), we get

p
[E|X|p ] ≤ (E [|X|q ]) q
p
If E|X|q < ∞ then (E|X|q ) q < ∞ and therefore [E|X|p ] < ∞.

h √ i 1
Example 41.6 Let X be a random variable with EX = 10. Show that E ln X ≤ ln 10.
2

√ 1
Solution: Consider f (x) = ln x = 12 ln x, for x ∈ (0, ∞). Then f 0 (x) = and f 00 (x) =
2x
1
− 2 < 0 on (0, ∞). Hence f is a concave function. Therefore by Jensen’s inequality
2x
1 h √ i
ln 10 = f (EX) ≥ E[f (X)] = E ln X .
2

Now we derive some important inequalities. These inequalities use the mean and possibly
the variance of a random variable to draw conclusions on the probabilities of certain events.
They are primarily useful in situations where exact values or bounds for the mean and
variance of a random variable X are easily computable, but the distribution of X is either
unavailable or hard to calculate.
Lecture 41: Inequalities 41-3

Theorem 41.7 (Markov Inequality) Let X be a non-negative random variable with finite
nth moment. Then we have for each  > 0,
E[X n ]
P {X ≥ } ≤
n

Loosely speaking, Markov inequality asserts that if a nonnegative random variable has a
small nth central moment, then the probability that it takes a large value must also be
small.
As a corollary we have the Chebyshev’s inequality.

Corollary 41.8 (Chebyshev’s inequality) Let X be a random variable with finite mean
µ and finite variance σ 2 . Then for every  > 0,
σ2
P {|X − µ| ≥ } ≤
2

Proof: The proof of Chebyshev’s inequality follows by replacing X by |X −µ| in the Markov
inequality and realizing that |X − µ|2 = [X − µ]2 .

Remark 41.9 Loosely speaking, Chebyshev’s inequality asserts that if a random variable
has small variance, then the probability that it takes a value far from its mean is also small.
Note that the Chebyshev inequality does not require the random variable to be nonnegative.

Example 41.10 (Illustrating Chebychev) If we take  = 2σ, then

σ2
P {|X − µ| ≥ 2σ} ≤ = 0.25,
4σ 2
so there is at least a 75% chance that a random variable will be within 2σ of its mean, no
matter what the distribution of X.
Similarly if we take  = 3σ, then
σ2
P {|X − µ| ≥ 3σ} ≤ = 0.1111111 · · · ,
9σ 2
so there is at least a 89% chance that a random variable will be within 3σ of its mean,
no matter what the distribution of X. Recall that for the normal distribution it is 99%
chance that the random variable will be within 3σ of its mean. So the estimates provided
by Chebychev’s inequality could be crude. One reason for is that it puts no restrictions on
the underlying distributions. Though Chebyshev’s inequality is necessarily conservative but
its applicability is wide. In particular, we can often get tighter bounds for some specific
distributions.
41-4 Lecture 41: Inequalities

Example 41.11 Let X ∼ B(n, p). Estimate P (X ≥ αn), where p < α < 1 using Markov
(for first moment) and Chebyshev’s inequality. Compare both the estimates for p = 12 and
α = 34 .

Solution: Note that X takes values {0, 1, · · · , n}, hence is a nonnegative random variable
and EX = np. Applying Markov’s inequality, we obtain
EX pn p
P (X ≥ αn) ≤ = =
αn αn α
Chebyshev’s inequality gives estimate for P (|X − EX| ≥ αn) so we have rewrite the event
{X ≥ αn} so that we can use the Chebushev’s inequality.

P {X ≥ αn} = P {X − np ≥ αn − np}
≤ P (|X − np| ≥ αn − np) (∵ {|Y | ≥ a} = {Y ≤ −a} ∪ {Y ≥ a})
var(X) np(1 − p) p(1 − p)
≤ 2
= 2 2
=
(αn − np) n (α − p) n(α − p)2
1
By Markov inequality for p = 2
and α = 43 , we
 
3n 2
P X≥ ≤
4 3
1
By Chebyshev’s inequality for p = 2
and α = 34 , we
 
3n 4
P X≥ ≤
4 n

If n ≥ 6 then estimate given by Chebyshev’s are sharper than the estimates provided by
Markov inequality. Also as n increases, estimate given by Chebyshev’s inequality decreases,
i.e., gives much information whereas the estimates provided by Markov inequality remains
constant as n varies.
Lecture 42: Law of Large Numbers
10 April, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Theorem 42.1 (Weak Law of Large numbers) Let X1 , X2 , · · · be a sequence of inde-


pendent and identically distributed random variables, each having finite mean µ. Then for
every δ > 0,
   
Sn Sn
lim P − µ ≥ δ = 0, or equivalently lim P − µ < δ = 1
(42.1)
n→∞ n n→∞ n

where Sn = X1 + X2 + · · · + Xn .

Remark 42.2 1. The weak law of large numbers states that for large n, the bulk of the
Sn
distribution of is concentrated near µ . That is, if we consider a positive length
n
interval [µ − δ, µ + δ] around µ, then there is high probability that Sn /n will fall in that
interval; as n → ∞, this probability converges to 1 . Of course, if δ is very small, we
may have to wait longer (i.e. , need a larger value of n) before we can assert that Sn /n
is highly likely to fall in that interval.

2. To understand the convergence in weak law, think in terms of PMF (if Xi are discrete
random variables) or PDF (if Xi ’s have the pdf then we know that Sn will posses a
pdf ) of random variable Sn /n. Weak law states that “almost all” of the PMF or PDF
of Sn /n is concentrated within δ neighborhood of µ for large values of n.

3. The limit in (42.2) means: ∀ δ,  > 0, there exists n0 (, δ) such that for all n ≥ n0 (, δ)
we have  
Sn
P ω : − µ < δ > 1 − .

n
If we refer to δ as the accuracy level and  as the confidence level, the weak law takes
the following intuitive form: for any given level of accuracy and confidence, Sn /n will
be equal to µ, within these levels of accuracy and confidence, provided n is large enough.

Example 42.3 Let X1 , X2 , · · · be independent and identically distributed random variables


with E[Xi ] = 0 and V ar(Xi ) = 1 for all i. Let Sn = X1 + X2 + · · · + Xn . Then, for any
x > 0, compute lim P (−nx < Sn < nx).
n→∞

42-1
42-2 Lecture 42: Law of Large Numbers

Solution: For any x > 0, we have

   
Sn Sn
P (−nx < Sn < nx) = P −x < < x = P − 0 < x

n n
 
Sn
= 1 − P − 0 ≥ x
n

By weak law of large numbers, we have

 
Sn
lim P − 0 ≥ x = 0
n→∞ n

Random Sampling

Let X1 , · · · , Xn be n independent random variables having the same distribution. These ran-
dom variables may be thought of as n independent measurements of some quantity that is
distributed according to their common distribution (e.g., height of students in LNMIIT cam-
pus). In this sense we sometimes speak of the random variables X1 , · · · , Xn as constituting
a random sample of size n from this distribution.
Suppose that the common distribution of these random variables has finite mean µ. Then
for n sufficiently large we would expect that the sample mean Snn = (X1 + · · · + Xn )/n should
be close to ture mean µ.
The weak law of large numbers asserts that the sample mean of a large number of independent
identically distributed random variables is very close to the true mean, with high probability.
Weak law of large numbers says that for any δ > 0,

 
Sn
lim P − µ ≥ δ = 0. (42.2)
n→∞ n

We may interpret (42.2) in the following way. The number δ can be thought of as the desired
accuracy in the approximation of µ by Sn /n. Equation (42.2) assures us that no matter how
small δ may
 be chosen
the  probability that Sn /n approximates µ to within this accuracy,
Sn
that is, P − µ < δ , converges to 1 as the number of observations gets large.
n
Lecture 42: Law of Large Numbers 42-3

Theorem 42.4 (Strong law of large numbers) Let X1 , X2 , · · · be a sequence of inde-


pendent and identically distributed random variables, each having finite mean µ. Then
 
Sn
P lim = µ = 1, (42.3)
n→∞ n

where Sn = X1 + X2 + · · · + Xn .
Lecture 43: Central Limit Theorem
12 April, 2019
Sunil Kumar Gauttam
Department of Mathematics, LNMIIT

Theorem 43.1 (Central Limit Theorem) Let X1 , X2 , · · · be a sequence of independent


and identically distributed random variables, each having finite mean µ and non-zero variance
σ 2 . Define
Sn − nµ
Sn := X1 + X2 + · · · + Xn , Zn := √
σ n
Then
lim P (Zn ≤ x) = N (x), ∀x ∈ R,
n→∞
Z x
1 t2
where N (x) = √ e− 2 dt.
2π −∞

Z x
1 t2
Another frequently used notation for √ e− 2 dt is Φ(x).
2π −∞

The central limit theorem is surprisingly general. Besides independence, and the implicit
assumption that the mean and variance are finite, it places no other requirement on the
distribution of the Xi , which could be discrete, continuous, or mixed.
Although CLT gives us a useful general approximation, we have no automatic way of knowing
how good the approximation is in general. In fact the goodness of the approximation is a
function of the original distribution, and so must be checked case by case.
To get a feeling for the CLT, let us look at some examples.

Example 43.2 Let Xi ’s be independent Bernoulli(p). Then EXi = p, V ar(Xi ) = p(1 − p).
Also, Sn = X1 + X2 + · · · + Xn has Binomial(n, p) distribution. Thus,

Sn − np
Zn = p .
np(1 − p)

1
We plot the PMF of Zn for different values of n by choosing p = .
3

43-1
43-2 Lecture 43: Central Limit Theorem

As you see, the shape of the PMF gets closer to a normal PDF curve as n increases. Hence,
the CDF of Zn will converges to the standard normal CDF.
X
FZn (x) = fZn (z) → N (x) or Φ(x)
z∈RZn :z≤x

That what the CLT states.

Example 43.3 Let Xi ’s be independent Uniform(0, 1). Then EXi = 12 , V ar(Xi ) = 1


12
. Let
Sn = X1 + X2 + · · · + Xn . In this case,
Sn − n
Zn = p 2 .
n/12
Lecture 43: Central Limit Theorem 43-3

We have derived a general formula for the pdf of sum of two independent random variables
with pdf ’s. Hence random variable Zn has pdf. We plot the PDF of Zn for different values
of n

As you see, the shape of the PDF gets closer to a normal PDF curve as n increases. Hence,
the CDF of Zn will converges to the standard normal CDF.
Z x Z x
1 z2
FZn (x) = fZn (z)dz → √ e− 2 dz.
−∞ 2π −∞

That what the CLT states.

Question: Suppose (Xn ) is sequence of iid discrete (continuous) random variables with
finite mean and non-zero variance. Then CLT says that pmf (pdf) of Zn converges to pdf of
43-4 Lecture 43: Central Limit Theorem

standard normal random variable ?

Answer: No. CLT does not says that pmf (pdf) of Zn converges to pdf of standard normal
random variable. CLT is about convergence of distribution functions. Suppose (Xi ) are
continuous with pdf f . Then Sn has pdf (∗f )n . CLT says that for all x ∈ R
x−np
Z √ Z x
σ n
n 1 − t2
lim (∗f ) (t)dt = e 2 dt
n→∞ −∞ −∞ 2π

Normal Approximation Based on the Central Limit Theorem: The central limit
theorem allows us to calculate probabilities related to Zn as√ if Zn were normal, CLT says
P (Zn ≤ x) ≈ N (x) for large values of n. Note that Sn = σ nZn + nµ. Since normality is
preserved under linear transformations, Sicne Zn ∼ N (0, 1), this is equivalent to treating Sn
as a normal random variable with mean nµ and variance nσ 2 .
Let Sn = X1 + · · · + Xn , where the Xi are independent identically distributed random
variables with mean µ and variance σ 2 . If n is large, the probability P (Sn ≤ c) can be
approximated by treating Sn as if it were normal, according to the following procedure.

Step 1 Calculate the mean nµ and the variance nσ 2 of Sn .

Step 2 Use the approximation  


c − nµ
P (Sn ≤ c) ≈ N √
σ n

Example 43.4 We load on a plane 100 packages whose weights are independent random
variables that are uniformly distributed between 5 and 50 kg. What is the probability that the
total weight will exceed 3000 kg?

Solution: Let us translate the problem in probabilistic model. Let Xi denotes the weight
of ith packages. X1 , X2 , · · · , X100 are iid uniform random variables with density
( 1
, if 5 ≤ x ≤ 50
f (x) = 45
0, otherwsie

Let S = X1 + X2 + · · · + X100 denote the total weight. Then question is to calculate the
P (S > 3000). Let f , g, and h be functions on the reals, and suppose the convolutions
(f ∗ g) ∗ h and f ∗ (g ∗ h) exist. Then we have (f ∗ g) ∗ h = f ∗ (g ∗ h). Using this result and S
is sum of independent random variable we can see that pdf of S is 100-fold convolution of f .
It is not easy to calculate this one. Hence it is very difficult to find the desired probability,
but an approximate answer can be quickly obtained using the central limit theorem. Treat
Lecture 43: Central Limit Theorem 43-5

S as normal random variable. So now we find it’s mean and variance, which is 100µ and
100σ 2 where µ = E[Xi ], σ 2 = var(Xi ).
Z ∞
E(Xi ) = xf (x)dx
−∞
Z 50
1
= xdx
45 5
5 + 50
= = 27.5
Z 2

E(Xi2 ) = x2 f (x)dx
−∞
Z 50
1
= x2 dx
45 5
(50)3 − 53
=
3 × 45
(50)2 + 50 × 5 + 52
=
3
= 925
var(Xi ) = 925 − (27.5)2 = 925 − 756.25 = 168.75

Now
 
3000 − 2750
P (S > 3000) = 1 − P (S ≤ 3000) = 1 − N √ = 1 − N (1.92)
10 168.75

Example 43.5 Let X1 , X2 , · · · be iid Poisson(λ) RVs. Then we know EX1 = var(X1 ) = λ.
Hence by CLT, Sn = X1 +X2 +· · ·+Xn has approximately an N (nλ, nλ) distribution for large
n. Let n = 64, λ = 0.125 =⇒ nλ = 8. Also sum of independent Poisson is again Poisson
hence exact distribution of S64 ∼ Poisson(64 × 0.125 = 8) and from Poisson distribution
10 −8
tables P (S64 = 10) = (8)10!e = 0.099261534. Using normal approximation
 
9.5 − 8 Sn − nλ 10.5 − 8
P (Sn = 10) = P (9.5 < Sn < 10.5) = P √ < √ <√
0.125 × 8 λ n 0.125 × 8
= P (0.530330086 < Z < 0.883883476) = 0.1087.

Here we have used the “continuity correction” to compute the P (Sn = 10) which would be
zero if Sn is taken to be normal random variable.

Você também pode gostar