Você está na página 1de 106

Master Information, Systems, and Technology

Lecture 452
Statistical signal processing
Ver. 3.0
Alexandre Renaux
Universit Paris-Sud 11
ii
Contents
General introduction 1
I Elements of probability theory 3
1 Introduction 4
2 Probability 6
2.1 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Probability space and important properties . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 -Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Borel -algebra over R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Borel -algebra over R
d
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Borel -algebra over C and C
d
. . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.5 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.6 Examples of measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.6.1 Dirac measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.6.2 Counting measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.6.3 Lebesgue measure over (R, B (R)) . . . . . . . . . . . . . . . . . . 10
2.2.6.4 Lebesgue measure over
_
R
d
, B
_
R
d
__
. . . . . . . . . . . . . . . . . 10
2.2.6.5 Lebesgue measure over (C, B (C)) and
_
C
d
, B
_
C
d
__
. . . . . . . . 11
2.2.7 Probabilities and probability spaces . . . . . . . . . . . . . . . . . . . . . . 11
2.2.8 Examples of discrete probability measures . . . . . . . . . . . . . . . . . . . 11
2.3 How to play with probabilities? Fundamental properties . . . . . . . . . . . . . . . 12
2.3.1 Probabilities of the , , and of events . . . . . . . . . . . . . . . . . . . 12
2.3.2 Conditional probabilities, Bayes theorem and marginalization . . . . . . . . 13
2.3.3 Independence in probability (part 1/3) . . . . . . . . . . . . . . . . . . . . . 15
2.3.3.1 Independence of two events . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3.2 Independence of a nite family of events . . . . . . . . . . . . . . 15
2.3.3.3 Independence of a countable family of events . . . . . . . . . . . . 15
2.3.3.4 Independence of -algebra . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Topological spaces and -algebras . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Random variables 17
3.1 Measurable applications/functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Denitions and rst properties . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Operations of measurable functions leading to measurable functions . . . . 19
3.1.3 Limit of measurable functions . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Random variables and probability distribution (part 1/2) . . . . . . . . . . . . . . 21
3.2.1 Denition of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Independence in probability (part 2/3) . . . . . . . . . . . . . . . . . . . . . . . . . 21
iii
iv CONTENTS
3.3.1 Independence of random variables . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Relationships between independences . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Integration theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.1 Riemann integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.2 Integral with respect to a measure . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.3 Integral with respect to the Dirac measure . . . . . . . . . . . . . . . . . . . 22
3.4.4 Integral with respect to a discrete measure . . . . . . . . . . . . . . . . . . 22
3.4.5 Lebesgue integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.6 Negligability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.7 Beppo-Lvi theorem and Lebesgue theorem . . . . . . . . . . . . . . . . . . 22
3.4.8 Absolute continuity and density . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Random variables and probability distribution (part 2/2) . . . . . . . . . . . . . . 22
3.5.1 Cumulative distribution function . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.2 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.3 Continuous random variables and probability density function . . . . . . . . 23
3.5.4 Mathematical expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.5 L
p
(, T, ) and L
p
(, T, ) spaces . . . . . . . . . . . . . . . . . . . . . . 24
3.5.5.1 Denitions and main properties . . . . . . . . . . . . . . . . . . . 24
3.5.5.2 Hilbert spaces and L
2
(, T, ) . . . . . . . . . . . . . . . . . . . . 24
3.5.5.3 Radon-Nikodym theorem and duality in L
p
(, T, ) spaces . . . . 24
3.5.6 Moments and central moments . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.7 Characteristic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Examples of important probability distributions for discrete random variable . . . 26
3.6.1 Discrete uniform probability distribution . . . . . . . . . . . . . . . . . . . . 26
3.6.2 Bernoulli probability distribution . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.3 Binomial probability distribution . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.4 Poisson probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6.5 Binomial probability distribution . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6.6 Hypergeometric probability distribution . . . . . . . . . . . . . . . . . . . . 27
3.7 Examples of important probability density functions for continuous random variable 27
3.7.1 Uniform probability density function . . . . . . . . . . . . . . . . . . . . . . 27
3.7.2 Real Gaussian or normal probability density function . . . . . . . . . . . . 27
3.7.3 Complex circular Gaussian probability density function . . . . . . . . . . . 28
3.7.4 Exponential probability density function . . . . . . . . . . . . . . . . . . . . 28
3.7.5 Gamma probability density function . . . . . . . . . . . . . . . . . . . . . . 28
3.7.6 Beta probability density function . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7.7 Student probability density function . . . . . . . . . . . . . . . . . . . . . . 28
3.7.8
2
probability density function . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.8 Substitution of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Random vectors 29
4.1 Product probability space and probability distribution . . . . . . . . . . . . . . . . 29
4.2 Back to integration theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Joint and marginal cumulative distribution function . . . . . . . . . . . . . . . . . 29
4.4 Joint and marginal probability density function . . . . . . . . . . . . . . . . . . . . 30
4.5 Conditional probability density function . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6 Independence in probability (part 3/3) . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7 Mathematical expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.8 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.9 Covariance and correlation matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.10 Correlation coecient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.11 Correlation and independency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.12 Characteristic function of a random vector . . . . . . . . . . . . . . . . . . . . . . . 35
4.13 Examples of important probability distributions for discrete random variable . . . 35
4.13.1 Real multivariate Gaussian distribution . . . . . . . . . . . . . . . . . . . . 35
Statistical signal processing iv
CONTENTS v
4.13.2 Complex multivariate Gaussian distribution . . . . . . . . . . . . . . . . . . 36
4.13.3 Real Wishart distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.13.4 Complex Wishart distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.14 Sum of independent random variables . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.15 Substitution of random vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Convergences and additional results 38
5.1 Convergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.1 Convergence in probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.2 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.2.1 Dnition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.2.2 Lvys continuity theorem . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.3 Relationship between convergence in probability and convergence in distri-
bution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Weak law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
II Random signals 42
6 Introduction 43
7 Temporal representations 44
7.1 Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2 Mean and correlation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8 Main classes of random signals 46
8.1 Stationary signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.2 Ergodic signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.3 Theoretical white noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.4 Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.5 Poisson processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.6 Markov processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9 Spectral representations 49
9.1 Power spectral density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9.2 WienerKhintchine theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.3 Interference formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
10 Random signal models 53
10.1 Autoregressive processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10.2 Moving average processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
10.3 Autoregressive moving average processes . . . . . . . . . . . . . . . . . . . . . . . . 55
III Element of estimation theory 56
11 Introduction 57
12 Estimation of deterministic parameters 58
12.1 Least Square estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
12.1.1 Philosophy and example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
12.1.2 Performance of the least-square estimator . . . . . . . . . . . . . . . . . . . 59
12.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
12.3 Estimation performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
12.3.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Statistical signal processing v
vi CONTENTS
12.3.2 Variance and mean square error . . . . . . . . . . . . . . . . . . . . . . . . . 60
12.3.3 Cramr-Rao bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
12.4 Maximum Likelihood estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
12.5 Properties of the maximum likelihood estimator . . . . . . . . . . . . . . . . . . . . 66
12.5.1 Etude de
1

n
L

n
(
0
) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
12.5.2 Etude de
1
n
L

n
(
0
) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
12.5.3 Etude de
1
n
L

n
(
n
) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
12.5.4 Fin de la demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
13 Estimation of random parameters 71
13.1 Estimation performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
13.1.1 Local mean square error versus global mean square error . . . . . . . . . . . 71
13.1.2 Bayesian Cramr-Rao bound . . . . . . . . . . . . . . . . . . . . . . . . . . 71
13.2 Minimum Mean Square Error estimator . . . . . . . . . . . . . . . . . . . . . . . . 71
13.3 Maximum A Posteriori estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
IV Exercices 72
14 Problems 73
14.1 -algebras and Borel -algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
14.2 Independence of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
14.3 Random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
14.4 Correlation and independency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
14.5 Correlation and independency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
14.6 Couple of Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 75
14.7 Substitution of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
14.8 Sum of independent Gaussian random variables . . . . . . . . . . . . . . . . . . . . 75
15 Problems 76
15.1 Problem 1: stationarity and PSD of a sinusodal signal . . . . . . . . . . . . . . . . 76
15.2 Problem 2: another sinusodal signal . . . . . . . . . . . . . . . . . . . . . . . . . . 76
15.3 Problem 3: modulated signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
15.4 Problem 4: sum of cisodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
15.5 Problem 5: quasi-monochromatic signal . . . . . . . . . . . . . . . . . . . . . . . . 77
15.6 Problem 6: discrete signal and ltering . . . . . . . . . . . . . . . . . . . . . . . . . 77
15.7 Problem 7: another modulated signal . . . . . . . . . . . . . . . . . . . . . . . . . . 78
15.8 Problem 8: sum of cisodes and ltering . . . . . . . . . . . . . . . . . . . . . . . . 78
15.9 Problem 9: non-linear system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
16 Problems 80
16.1 Problem 1: linear observation model and LS estimator . . . . . . . . . . . . . . . . 80
16.2 Problem 2: Cramr-Rao bound and line tting . . . . . . . . . . . . . . . . . . . . 80
16.3 Problem 3: nuisance parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
16.4 Problem 4: linear observation model and ML estimator . . . . . . . . . . . . . . . 81
16.5 Problem 5: noise power estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
16.6 Problem 6: maximum likelihood estimation of the parameter of a Poisson distribution 81
16.7 Problem 7: maximum likelihood estimation of the parameter of a Rayleigh distribution 81
V Annexes 82
17 Borel -algebra over a topological space 83
18 Additional results on measure theory 84
Statistical signal processing vi
CONTENTS vii
19 Elements of integration theory 85
VI Practical works: 86
20 PW1: Matlab 101 and basic signal processing problems 87
20.1 Some important Matlab functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
20.1.1 Preliminary remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
20.1.2 Variables denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
20.1.3 Vectors denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
20.1.4 Drawing plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
20.1.5 Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
20.2 Application to signal processing problems . . . . . . . . . . . . . . . . . . . . . . . 88
20.2.1 Noises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
20.2.2 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
20.2.3 Linear estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
20.2.4 Non-linear estimation: spectral analysis . . . . . . . . . . . . . . . . . . . . 89
21 PW2: Speech processing 90
21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
21.2 Parameters estimation of an AR process . . . . . . . . . . . . . . . . . . . . . . . . 90
21.3 Analysis of the sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
21.4 Synthesis of the sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
22 PW3: Sources localization 94
22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
22.2 Observation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
22.2.1 Quasi-monochromatic signals . . . . . . . . . . . . . . . . . . . . . . . . . . 95
22.2.2 Case of one source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
22.2.3 Case of several sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
22.2.4 Full observation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
22.3 Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
22.4 Maximum likelihood estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Statistical signal processing vii
viii CONTENTS
Statistical signal processing viii
General introduction
Let us start with some denitions (from wikipedia):
Signal: a physical quantity that can carry information.
Signal processing: the eld of techniques used to extract information from signals.
Noise: uctuations in and the addition of external factors to the stream of target information
(signal) being received at a detector.
Statistics: a mathematical science pertaining to the collection, analysis, interpretation, and
presentation of data.
These four denitions are the cornerstone of the present manuscript which is devoted to statis-
tical signal processing. Signal processing is a wide and an ancillary topic. Wide because the kind
of problems that can be solved by signal processing techniques are large. One can cite prediction,
reconstruction, ltering, data compression, but, also, estimation of parameters of interest hidden
in data. To be more general, signal processing deals with all the kind of information that a user
want to know or to modify. As a by product, signal processing is an ancillary topic since the kind of
applications that can be solved by signal processing techniques are also large. As an example, one
can mention radar/sonar, digital communications, medical imaging, econometrics, weather fore-
casting, astrophysics, etc. Note that, generally, all these applications need their own processing
techniques. A lot of systems in real life use signal processing methods. Basically, when you are
listening a MP3 le, when you made a reservation for a hotel on internet, when you check mete-
orology for tomorrow on TV, when you are using your cell phone, when you check your position
with a GPS system, you are basically using the signal processing methods developed by engineers
and researchers. To nish with generalities, signal processing can be seen as the techniques that
we use at the output of a particular system of sensors to extract the information of interest.
One of the most important part of signal processing is statistical signal processing. As we will
see in this lecture, the data that we want to process can generally be modelled by using the physic
of the problem. For example, we know that we are measuring a sequence of 0 and 1 (as in digital
communication) or we know that we are measuring a sum of sinusoids (as in spectral analysis).
Unfortunately, a physical model is never perfect. For example, a radar system emits a known
waveform and use the echoes of this waveform to detect or to estimate the range of a target. The
problem is that the measured echoes are composed from the main echo due to the target but, also,
to a lot of unwanted echoes due to the environment. It is clear that it is impossible to model such
echoes in a precise manner. The solution is to use some well known mathematical tools in order
to model these uncertainties. These tools come from probability theory and statistic theory which
is well adapted to solve these kind of problems. This is why we use the term statistical signal
processing. Note that these methods will be powerful because we will see that we will be able to
model a lot of uncertainties at dierent levels and we will be able to represent these random signals
with great delity as simple quantities measurable through time or across space. These tools are
also interesting because they give methods to evaluate the performance of practical algorithms,
i.e., if we are building good or bad algorithms.
The aim of the present manuscript is to provide a basis to describe, to model and, of course, to
process random signals. The level will be the rst year of master. A lot of the methods described
1
2 CONTENTS
here will be emphasized and extended during the second year (depending of the speciality). The
required background is then the bachelor degree and, more particularly, the analysis of determin-
istic signals (Fourier transforms, etc.). It is also assumed that the student is familiar with basic
mathematical concepts such as algebra (group, vector space, matrix, determinant, etc.), Riemann-
integration and derivation, sequences and series and optimization. However, in order to be self
consistent, several useful results are given the present manuscript.
This manuscript is divided into three parts. The rst part deals with probability theory con-
cepts which are necessary throughout the lectures. The second part deals with random signals
or stochastic processes. The emphasize is done on the representations (and the dierence with
deterministic signals) and on the modelization of such signals. Finally, the third part deals with
estimation theory, i.e., how to build ecient and robust algorithms to recover the information
hidden in a random signal. The philosophy used here is to alternate denitions, theorems (with
proofs most of the time) and exercises given at the end of the manuscript. At the very end of
this manuscript, the student will nd some practical works using the software Matlab in order to
illustrate the theoretical concepts given in this lecture. Of course, the present document focuses on
basic concepts and is far to be enough in order to fully understand this wide topic. Consequently,
I encourage the student who want to details the techniques presented here to read the literature
which abundant. One can cite these books (from which this document is largely inspired):
Bernard Picinbono, Random Signals and Systems, Prentice Hall, 1993.
Steven M. Kay, Fundamentals of Statistical Signal Processing - Part 1: Estimation Theory,
Prentice Hall, 1993.
Harry L. Van Trees, Detection, Estimation, and Modulation Theory Part I, Wiley, 1968.
Athanasios Papoulis, Probability, Random Variables and Stochastic Processes, McGraw-Hill,
1991.
Alberto Leon-Garcia, Probability and Random Processes of Electrical Engineering, Addison
Wesley, Sec. Edt, 1993.
Statistical signal processing 2
Part I
Elements of probability theory
3
Chapter 1
Introduction
In this part, one presents the basic concepts used to describe random variables and random vectors.
Vocabulary and important denitions for the following are introduced.
Here are some summary found on internet about probability theory:
(from wikipedia): Probability is the likelihood or chance that something is the case or will hap-
pen. Theoretical Probability is used extensively in areas such as nance, statistics, gambling,
mathematics, science and philosophy to draw conclusions about the likelihood of potential
events and the underlying mechanics of complex systems. [...]. The word probability does not
have a consistent direct denition. In fact, there are two broad categories of probability in-
terpretations, whose adherents possess dierent (and sometimes conicting) views about the
fundamental nature of probability: (1) Frequentists talk about probabilities only when dealing
with well dened random experiments. The probability of a random event denotes the relative
frequency of occurrence of an experiments outcome, when repeating the experiment. Fre-
quentists consider probability to be the relative frequency "in the long run" of outcomes. (2)
Bayesians, however, assign probabilities to any statement whatsoever, even when no random
process is involved. Probability, for a Bayesian, is a way to represent an individuals degree
of belief in a statement, given the evidence. [...]. Probability has an interesting etymology.
Its meaning today is almost the opposite of the meaning of the word from which it originated.
Before the seventeenth century, legal evidence in Europe was considered to greater weight if
a person testifying had probity. Empirical evidence was barely a concept. Probity was
a measure of authority, so evidence came from authority. A noble person had probity. Yet
today, probability is the very measure of the weight of empirical evidence in science, arrived
at from inductive or statistical inference.
(from MathWorld): Probability is the branch of mathematics that studies the possible out-
comes of given events together with the outcomes relative likelihoods and distributions. In
common usage, the word "probability" is used to mean the chance that a particular event (or
set of events) will occur expressed on a linear scale from 0 (impossibility) to 1 (certainty), also
expressed as a percentage between 0 and 100%. The analysis of events governed by probability
is called statistics. There are several competing interpretations of the actual "meaning" of
probabilities. Frequentists view probability simply as a measure of the frequency of outcomes
(the more conventional interpretation), while Bayesians treat probability more subjectively as
a statistical procedure that endeavors to estimate parameters of an underlying distribution
based on the observed distribution. A properly normalized function that assigns a probability
"density" to each possible outcome within some interval is called a probability function (or
probability distribution function), and its cumulative value (integral for a continuous distri-
bution or sum for a discrete distribution) is called a distribution function (or cumulative
distribution function).
(from Stanford Encyclopedia of Philosophy) Interpreting probability is a commonly used but
misleading name for a worthy entreprise. The so-called interpretations of probability would
4
5
be better called analyses of various concepts of probability, and interpreting probability is
the task of providing such analyses. Or perhaps better still, if our goal is to transform inexact
concepts of probability familiar to ordinary folk into exact ones suitable for philosophical and
scientic theorizing, then the task may be one of explication in the sense of Carnap (1950).
Normally, we speak of interpreting a formal system, that is, attaching familiar meanings to
the primitive terms in its axioms and theorems, usually with an eye to turning them into true
statements about some subject of interest. However, there is no single formal system that
is probability, but rather a host of such systems. To be sure, Kolmogorovs axiomatization,
which we will present shortly, has achieved the status of orthodoxy, and it is typically what
philosophers have in mind when they think of probability theory. Nevertheless, several of
the leading interpretations of probability fail to satisfy all of Kolmogorovs axioms, yet they
have not lost their title for that. Moreover, various other quantities that have nothing to
do with probability do satisfy Kolmogorovs axioms, and thus are interpretations of it in a
strict sense: normalized mass, length, area, volume, and indeed anything that falls under the
scope of measure theory, the abstract mathematical theory that generalizes such quantities.
Nobody seriously considers these to be interpretations of probability, however, because they
do not play the right role in our conceptual apparatus. Instead, we will be concerned here
with various probability-like concepts that purportedly do. Be all that as it may, we will follow
common usage and drop the cringing scare quotes in our survey of what philosophers have
taken to be the chief interpretations of probability. Whatever we call it, the project of nding
such interpretations is an important one. Probability is virtually ubiquitous. It plays a role
in almost all the sciences. It underpins much of the social sciences witness, for example,
the prevalence of the use of statistical testing, condence intervals, regression methods, and
so on. It nds its way, moreover, into much of philosophy. In epistemology, the philosophy of
mind, and cognitive science, we see states of opinion being modeled by subjective probability
functions, and learning being modeled by the updating of such functions. Since probability
theory is central to decision theory and game theory, it has ramications for ethics and
political philosophy. It gures prominently in such staples of metaphysics as causation and
laws of nature. It appears again in the philosophy of science in the analysis of conrmation
of theories, scientic explanation, and in the philosophy of specic scientic theories, such as
quantum mechanics, statistical mechanics, and genetics. It can even take center stage in the
philosophy of logic, the philosophy of language, and the philosophy of religion. Thus, problems
in the foundations of probability bear at least indirectly, and sometimes directly, upon central
scientic, social scientic, and philosophical concerns. The interpretation of probability is
one of the most important such foundational problems.
Statistical signal processing 5
Chapter 2
Probability
2.1 Vocabulary
Let us start with some vocabulary. An experiment is any procedure that can be innitely repeated
and has a well dened set of outcomes. For example, a rolling dice or tossing a coin. The sample
space of an experiment is the set of all possible outcomes. For example, if the experiment is tossing
a coin, the sample space is the set {head, tail}. For tossing a single six-sided die, the sample space
is Face 1, Face 2, Face 3, Face 4, Face 5, Face 6. For some kinds of experiments, there may
be two or more plausible sample spaces available.A random event is a set of outcomes (a subset
of the sample space) to which a probability will be assigned. For example, if the experiment is
tossing a coin (not crooked), the natural probabilities associated to each elements of the sample
space are P (head) = P (tail) = 1/2. In order to be more rigorous, we have to introduce some
notations. The sample space
1
will be denoted . An elementary random event will be denoted
. Consequently, with the denitions proposed above, =
1
,
2
, . . . ,
N
(note that, in the
following, we will also sometimes consider countable or even innite). Note that a random event
E can be E =
5
(i.e. an elementary event) or E =
2
,
10
,
19
depending on the kind of
experiment. A complementary event

E is an event which is realized if the event E is not realized.
For example, if =
1
,
2
,
3
and E =
1
,
3
, then

E =
2
. Note that

= (empty set).
Let A and B two events. The event A + B or A B is the event in which one of the events
A or B is realized. On the other hand, the event A.B or A B is the event in which both the
events A and B are realized. For example, with =
1
,
2
, . . . ,
6
, if A =
2
,
4
,
6
and
B =
3
,
4
,
5
,
6
, then A + B =
2
,
3
,
4
,
5
,
6
and A.B =
4
,
6
. Of course, E ,
one have E +

E = and E.

E = .
2.2 Probability space and important properties
In the light of the previously introduced vocabulary, our goal is to be able to assign a weight or
a measure (that we will call "probability") to each possible events given a sample space . It is
clear that to solve this problem, we need to build a set of all these possible events given a sample
space . More precisely, we want to build a set of set which will be called -algebra. Note that
this set, denoted T, can be nite, countable or innite.
2.2.1 -Algebra
If we go back to the previous Section, given a sample space , it is natural to want to be able to
assign a probability to:
the sample space itself. This means that T
1
Sometimes, the sample space is denoted the Universe.
6
2.2. PROBABILITY SPACE AND IMPORTANT PROPERTIES 7
the complementary event

E (if E T). This means that we want to assign a probability to
the fact that an event is realizable or not.
the union of two (or more) events included in T. This means if we are able to assign a
probability to two (or more) events, we want to be able to assign a probability to the fact
that all the events occurs together.
This is exactly the mathematical denition of a -algebra.
Denition 1 Given a sample space , a family of events of denoted T is a -algebra if
_

_
(i) T,
(ii) E T,

E T,
(iii) E
n
T n (countable),
_
n
E
n
T.
(2.1)
In other words, one build a family of events T closed under , , . Note that this denition
implies that
_
_
_
(iv) T, from (i) and (ii),
(v) n, E
n
T (countable),

n
E
n
T, from (ii), (iii) and via De Morgan laws.
(2.2)
Some examples of -algebra are
T = , . This -algebra is sometimes called the minimal (smallest) or trivial -algebra
over .
Let E be an event of . T =
_
, E,

E,
_
is a -algebra over .
Let us dene T () the power set of , i.e., the set of all the subsets of . T = T () is a
-algebra. This -algebra is the biggest one because it contains all the other -algebra over
.
Denition 2 When T is a -algebra over , the couple (, T) is called measurable space. Any
set F T is measurable.
Note that the theory of -algebra is out of the scope of this document. We will just need some
results (given by the next theorems) for the following of our study.
Theorem 3 The intersection of a family of contable -algebra is still a -algebra. More formaly,
let I be a subset of N and (T
i
)
iI
a family of -algebra over the same sample space . Then,
T =

iI
T
i
is a -algebra.
Proof. (i) T
i
i,then T.(ii) Let E T i, E T
i
i,

E T
i


E T. (iii)
n, E
n
T ni, E
n
T
i
ni,

E
n
T
i
n,

E
n
T.
Note that this result is valid if and only if T is a -algebra.
Theorem 4 Let (, T) be a measurable space and

. Then, (

, A

A T) is a
measurable space (i.e., A

A T is a -algebra over

).
Proof. Let ( = A

A T . (i) T

(. Since

. Consequently,

(. (ii) Let C (, then it exists A T such that A

= C. Since

A T,

A

(. Moreover, the complementary even of C with respect to

is given by

C = (A

=
_

_

_

_
=

A

(. (iii) Let C
n
a countable family such
that n, C
n
(, then, it exists a family of A
n
T such that C
n
= A
n

. Consequently,
_
n
C
n
=
_
n
(A
n

) =
_
_
n
A
n
_

. Finally, since
_
n
A
n
T,
_
n
C
n
(.
Statistical signal processing 7
8 CHAPTER 2. PROBABILITY
Theorem 5 Let T a family of subset of . The smallest -algebra containing T exists. This
-algebra is called -algebra generated by T and is denoted (T) .
Proof. The set of -algebra containing T is not empty since T T () the power set of .
Moreover, from Theorem 3, the set of -algebra containing T is still a -algebra and this set is
obviously the smallest -algebra containing T.
Remark 6 It is important to note it is possible to have (T
1
) = (T
2
) for two dierent family
T
1
and T
2
. For example, if E be an event of . (E) =
_

E
_
=
_
, E,

E,
_
.
Remark 7 If T is a -algebra, then T = (T) .
2.2.2 Borel -algebra over R
We will now dene one of the most useful -algebra that we will use in the following: the Borel
-algebra
2
(or -algebra of Borel sets).
Denition 8 Let the sample space be = R. The -algebra generated by the open sets of R
is called Borel -algebra and will be denoted B (R) . An element of B (R) is called Borelian.
Theorem 9 Any open, close or semi-open set belongs to B (R) and any union (nite or countable)
of open, close or semi-open sets belongs to B (R) .
Proof. (x, y) R
2
, ]x, y[ B (R) by denition. Then since [x, y] =
nN

x
1
n
, y +
1
n
_
, then
[x, y] B (R) (x, y) R
2
from (v). The proof concerning the semi-open sets is similar.
Theorem 10 The Borel -algebra is also the -algebra generated by the sets ], x] , ], x[ ,
]x, +[ , [x, +[ x R.
Proof. Let T be the -algebra genrated by the sets ]x, +[ x R. Then, T B (R) . Then,
[x, +[ =
nN

x
1
n
, +
_
T. We also have, ], x[ = [x, +[ T due to (ii). Moreover,
with x < y, ]x, y[ = ], y[ ]x, +[ T due to (iii). Finally, since any open set of R is the
(countable) union of the sets ], x[ , ]x, y[ , and ]x, +[, then B (R) T, which conclude the
proof for ]x, +[ x R. The proof concerning the other sets is similar.
Theorem 11 The Borel -algebra is also the -algebra generated by the sets ], x] , ], x[ ,
]x, +[ , [x, +[ x Q.
Proof. Let 1 = ], x] x R and 1

= ], x] x Q . We have B (R) = (1) from
the previous theorem and it is clear that 1

1, then (1

) B (R) . We then have to prove


that (1

) B (R) which is equivalent to verify that (x, y) R


2
, ]x, y[ (1

) . It exists two
sequence x
n
and y
n
of rationnal number such that x < x
n
< y
n
< y with x
n
x and y
n
y.
Since, ]x
n
, y
n
] = ], y
n
] ]x
n
, +[ = ], y
n
] ], x
n
], we have ]x
n
, y
n
] (1

) . We also
have ]x, y[ =
n
]x
n
, y
n
] , then (x, y) R
2
, ]x, y[ (1

) . The proof concerning the other sets is


similar.
Theorem 12 B (R) is the power of continuum, so B (R) ,= T (R)
Proof. Admitted.
2
Note that for this course, the Borel -algebra is dened in the particular context of = R (and = R
d
in the
following). In fact, one can dene the Borel -algebra for any topological space (E, T ) .
Statistical signal processing 8
2.2. PROBABILITY SPACE AND IMPORTANT PROPERTIES 9
2.2.3 Borel -algebra over R
d
We will see in the following that we will have to work with = R
d
and not only with = R. This
is why we introduce here the Borel -algebra over R
d
.
Denition 13 Let the sample space be = R
d
. The -algebra generated by the open sets ]x
1
, y
1
[
]x
2
, y
2
[ ]x
d
, y
d
[ =
d

i=1
]x
i
, y
i
[ (x
i
, y
i
) R
2
is called Borel -algebra over R
d
and will be
denoted B
_
R
d
_
.
Note that a more general denition using the topology over R
d
is also available.
Theorem 14 B
_
R
d
_
is also the -algebra generated by the sets
d

i=1
[x
i
, y
i
] ,
d

i=1
]x
i
, y
i
] , and
d

i=1
[x
i
, y
i
[
(x
i
, y
i
) R
2d
and by the sets
d

i=1
], x
i
] ,
d

i=1
], x
i
[ ,
d

i=1
]x
i
, +[ ,
d

i=1
[x
i
, +[ x
i
R or
Q.
Proof. The proofs are similar to the ones used for B (R) .
2.2.4 Borel -algebra over C and C
d
To be updated.
2.2.5 Measures
Until now, given a sample space , we are able to built some natural set (the so-called -algebra)
closed under , , . On each elements of these sets, we want to assign a weight representing a
probability. In order to dene what is a probability, we need to introduce the concept of measure.
Remeber that when T is a -algebra over , the couple (, T) is called measurable space and any
set F T is measurable.
Denition 15 A (positive) measure over a measurable space (, T) is any application from T
to

R
+
= [0, +] such that
_

_
() = 0,
E
nN
T (pairwise disjoint),
_
_
nN
E
n
_
=

nN
(E
n
) .
(2.3)
The last condition is called -additivity. The rst condition permits to avoid some trivial
situations. Indeed, E T, one has E = E and then by -additivity (E) =
(E) +

nN
() .
As we will see in the next subsection, a probability is in fact a particular measure, consequently
the measure properties will be given in the next subsection. However, we still give two important
denition and some examples of measures in the following.
Denition 16 Let (, T) be a measurable space such that , T.
A measure over (, T) is called discrete if it exists a family D =
n
: n I (where
I is nite or countable set of index) of elements such that
_
(D) = 0,
E T, (E) = (E D) =

n
ED
(
n
) .
(2.4)
A measure over (, T) is called continuous if , on has () = 0.
Statistical signal processing 9
10 CHAPTER 2. PROBABILITY
2.2.6 Examples of measures
First, it is easy to see that = 0 for any elements of T and = + for any elements of T except
the empty set (() = 0) are some measures.
2.2.6.1 Dirac measure
Denition 17 The Dirac measure at point
0
is a discrete measure denoted

0
(E) E T
such that
_

0
(E) = 1 if
0
E,

0
(E) = 0 if
0
/ E.
(2.5)
In other words,

0
(E) = I
E
(
0
) , (2.6)
where I
E
is the indicator function.
2.2.6.2 Counting measure
Denition 18 The counting measure over (N, T (N)) is dened by
(.) =
+

n=0

n
(.) . (2.7)
It is easy to prove that is a measure. Moreover, it is a discrete measure over (N, T (N)) since
D = N is countable and (NN) = () = 0. Moreover, this measure is called counting measure
because,
E T (N) , (E) =
+

n=0

n
(E) = number of elements of E. (2.8)
Remark 19 The counting measure can also be dened over (R, B (R)) in the same way (i.e.,
(.) =
+

n=0

n
(.)). In this case, the measure is still discrete since (RN) = 0 and E B (R) ,
(E) is the number of integer in E.
2.2.6.3 Lebesgue measure over (R, B (R))
Theorem 20 (and Denition) It exists a unique measure over (R, B (R)) , denoted and called
Lebesgue measure, such that
3
for A = ]a, b[ or A = [a, b] or A = [a, b[ or A = ]a, b]
(A) = b a. (2.9)
Consequently, over R, the Lebesgue measure represents the length.
Proof. Admitted.
Corollary 21 The Lebesgue measure is continuous, i.e., x R, (x) = 0.
Proof. if A = [x, x] , (A) = x x = 0.
Corollary 22 The Lebesgue measure of any nite or countable subset of R is equal to zero.
Proof. Let A R such that A = a
nI
where I is an index countable or nite. Then,
A =
nI
a
n
B (R) and, by -additivity,
_

nI
a
n
_
=

nI
(a
n
) = 0.
2.2.6.4 Lebesgue measure over
_
R
d
, B
_
R
d
__
Theorem 23 (and Denition) It exists a unique measure over
_
R
d
, B
_
R
d
__
, denoted
d
and
called Lebesgue measure, such that
4
for A =
d

i=1
]a
i
, b
i
[ or A =
d

i=1
[a
i
, b
i
] or A =
d

i=1
[a
i
, b
i
[ or
3
Remember that ]a, b[ B(R) (a, b) R
2
.
4
Remember that ]a, b[ B(R) (a, b) R
2
.
Statistical signal processing 10
2.2. PROBABILITY SPACE AND IMPORTANT PROPERTIES 11
A =
d

i=1
]a
i
, b
i
]

d
(A) =
d

i=1
(b
i
a
i
) (2.10)
Consequently, over R
2
and R
3
, the Lebesgue measure represents the surface and the volume,
respectively.
Proof. Admitted.
2.2.6.5 Lebesgue measure over (C, B (C)) and
_
C
d
, B
_
C
d
__
To be updated.
2.2.7 Probabilities and probability spaces
Denition 24 Any measure P over (, T) such that P () = 1 is called a probability. The triple
(, T, P) is called probability space.
Note that the denition claims that P is a measure on (, T) such that P () = 1. It means
that a measure on (, T) is a function (positive) which assigns a real number to elements of T
(made from ). In probability theory, the particular measure is such that P () = 1. In other
words
5
, 0 P (E) 1 always. If P (E) = 1, the event is certain (which must happen as a result
of the experiment without fail). If P (E) = 0, the event is impossible (cannot happen as a result
of the experiment).
The physical interpretation of the concept of probability is connected to the notion of limit
frequency. Indeed, the relative frequency of occurrence of an event, in a number of repetitions of
the experiment, is a measure of the probability of that event. Thus, if N
T
is the total number of
trials and N
X
is the number of trials where the event X occurred, the probability P(X) of the
event occurring will be approximated by the relative frequency as follows:
P(X)
N
X
N
T
. (2.11)
A further and more controversial claim is that in the "long run," as the number of trials
approaches innity, the relative frequency will converge exactly to the probability:
P(X) = lim
N
T

N
X
N
T
. (2.12)
Note that, one objection to this is that we can only ever observe a nite sequence, and thus
the extrapolation to the innite involves unwarranted metaphysical assumptions. This conicts
with the standard claim that the frequency interpretation is somehow more "objective" than other
probability theories.
It exists other kind of interpretations for the concept of probability (see, for example, logi-
cal, epistemic and inductive probability, propensity probability, and, maybe the most important,
Bayesian probability).
2.2.8 Examples of discrete probability measures
Denition 25 The Dirac measure is a discrete probability. Indeed, it is easy to prove that

0
() = 1.
5
Literally, P (E) means the probability that the event E occurs.
Statistical signal processing 11
12 CHAPTER 2. PROBABILITY
Denition 26 General discrete probability: Let
nI
be a nite or countable family of elements
of and let p
nI
be a family of real numbers such that n I, p
n
0 and

nI
p
n
= 1. Then,
P (.) =

nI
p
n

n
(.) , (2.13)
is a probability over (, T ()) or any measurable space (, T) where n I,
n
T. It is easy
to prove that P is a measure. Moreover, P () =

nI
p
n

n
() =

nI
p
n
= 1.
Note that we have
n I, P (
n
) = p
n
, (2.14)
and

nI
, P () = 0. (2.15)
From this probability, it is easy to dene the equiprobability. Indeed, if is a nite set, one
choose p
n
=
1
Card
n. Consequently, for any E , one has
P (E) =
1
Card

nI

n
(E) =
Card E
Card
. (2.16)
Remark 27 The aforementionned counting measure and the Lebesgue measure are not probabili-
ties since () = +.
2.3 How to play with probabilities? Fundamental properties
2.3.1 Probabilities of the , , and of events
Denition 28 Mutually exclusive events: let (A, B) T
2
be two events. A and B are mu-
tually exclusive if A B = .
Corollary 29 Countable Additivity Probability Axiom: Let E
n=1,...,N
T a set of countable
events. If E
nN
are mutually exclusive, then
P (E
1
E
2
E
N
) = P
_
N
_
n=1
E
n
_
=
N

n=1
P (E
n
) . (2.17)
Proof. This comes from the measures denition.
Theorem 30 Let (A, B) T
2
be two events such that B A. Then,
P (B) P (A) . (2.18)
Proof. Let C be the event such that A = B C. Note that, B and C are mutually exclusive.
Then, P (A) = P (B) + P (C) . Since, by denition, P (C) 0,one obtains the result. This result
is true for any measure.
Note that the opposite is wrong, i.e., if P (B) P (A) B A. For example, for tossing a
not crooked single six-sided die (i.e., = Face1, Face2, Face3, Face4, Face5, Face6), the event
E = Face2, Face4, Face6 as a probability P (E) =
1
2
P ( = Face1) =
1
6
and / E.
Theorem 31 Let (A, B) T
2
be two events. Then,
P (A B) = P (A) +P (B) P (A B) . (2.19)
Statistical signal processing 12
2.3. HOW TO PLAY WITH PROBABILITIES? FUNDAMENTAL PROPERTIES 13
Proof. Let A
1
be the event such that A = A
1
(A B) and let B
1
be the event such that B =
B
1
(A B). Consequently, AB = A
1
(A B)B
1
. Note that A
1
, B
1
and (A B) are mutually
exclusive. Then, P (A B) = P (A
1
)+P (A B)+P (B
1
) . Moreover, P (A) = P (A
1
)+P (A B)
and P (B) = P (A B) + P (B
1
) . Consequently, P (A B) = P (A) + P (B) P (A B) . This
result is true for any measure.
A generalization of the previous theorem can be done leading to the so-called Poincar formula:
Theorem 32 Let E
n=1,...,N
T a nite set of events. Then,
P
_
N
_
n=1
E
n
_
=

n
P (E
n
)

i=j
P (E
i
E
j
) +

i=j=k
P (E
i
E
j
E
k
) +(1)
N1
P
_
N

n=1
E
n
_
.
(2.20)
Proof. To be updated.
For example, when N = 3,one obtains
P (E
1
E
2
E
3
) = P (E
1
) +P (E
2
) +P (E
3
)
P (E
1
E
2
) P (E
2
E
3
) P (E
1
E
3
)
+P (E
1
E
2
E
3
) . (2.21)
Theorem 33 Boole inequality. Let E
n=1,...,N
T a nite set of events. Then,
P
_
N
_
n=1
E
n
_

N

n=1
P (E
n
) . (2.22)
Proof. From (2.19), the formula is satised for N = 2. Assume that it is still correct at order
N 1, then at order N,we have P
_
N
_
n=1
E
n
_
= P
_
N1
_
n=1
E
n
E
N
_
= P
_
N1
_
n=1
E
n
_
+ P (E
N
)
P
__
N1
_
n=1
E
n
_
E
N
_
P
_
N1
_
n=1
E
n
_
+P (E
N
)
N1

n=1
P (E
n
) +P (E
N
) =
N

n=1
P (E
n
) .
Theorem 34 Let E T an event and let

E its complementary event. Then,
P (E) +P
_

E
_
= 1. (2.23)
Proof. Since E

E = , P
_
E

E
_
= P (E) +P
_

E
_
. Moreover, P
_
E

E
_
= P () = 1.
2.3.2 Conditional probabilities, Bayes theorem and marginalization
The following denition is fundamental in probability. It deals with the probability of an event
given the knowledge of another. For instance, one can imagine to look for the probability to get
a heart from a deck of 32 cards. Without any other knowledge, the natural probability is clearly
1
8
. but if we know that the selected card is red, the probability is reduced to
1
4
. We see with this
simple example that any additional knowledge (should) modify the assigned probability to each
events.
More formally, given a probability space (, T, P) and an event B T with P(B) > 0. If we
know that B is realized (i.e., B), then for all event A T one has A A B.
Then the application :
_
T R
+
A P (A B)
is a measure over T but not a probability since
() = P ( B) = P (B) ,= 1. However, by normalizing by P (B) we obtain a new application,
denoted P ( A[ B), which is a probability.
Statistical signal processing 13
14 CHAPTER 2. PROBABILITY
Denition 35 Given a probability space (, T, P) and two events A T and B T with P(B) >
0, the conditional probability of A given B is dened by
P ( A[ B) =
P (A B)
P (B)
. (2.24)
If P(A) > 0, the conditional probability of B given A is dened by
P ( B[ A) =
P (A B)
P (A)
. (2.25)
In other words, the conditional probability P ( A[ B) is the probability of some event A, given
the occurrence of some other event B. Conditional probability is read "the probability of A, given
B". In the same way, the conditional probability P ( B[ A) is the probability of some event B,
given the occurrence of some other event A. Conditional probability is read "the probability of B,
given A".
Example 36 In the case of tossing a not crooked single six-sided die ( i.e., ={Face1, Face2,
Face3, Face4, Face5, Face6}). If we consider the event A of getting Face2 the rst time a die is
rolled and the event B that the sum of the face numbers seen on the rst and second trials is 8, we
obtain: P (A) = 1/6, P (B) = 5/36 and P ( B[ A) = P (to have the face 6 the second time) = 1/6.
Remark 37 Let E
n=1,...,N
T a set of countable events. If E
n
are mutually exclusive, then
P
_
_
n
E
n

A
_
=
P
__
_
n
E
n
_
A
_
P (A)
=
P
_
_
n
(E
n
A)
_
P (A)
=

n
P (E
n
A)
P (A)
=

n
P ( E
n
[ A) . (2.26)
Theorem 38 Bayes theorem: Given a probability space (, T, P) and two events A T and
B T, one has
P ( A[ B) = P ( B[ A)
P (A)
P (B)
. (2.27)
Proof. Obvious by mixing Eqn. (2.24) and Eqn. (2.25).
Intuitively, Bayes theorem in this form describes the way in which ones beliefs about observing
A are updated by having observed B.
Denition 39 Marginal, Prior, Posterior and Joint probabilities: In the framework of
the Bayes theorem, P(A) is the prior probability or marginal probability of A. It is "prior" in
the sense that it does not take into account any information about B. P(A[B) is the conditional
probability of A, given B. It is also called the posterior probability because it is derived from or
depends upon the specied value of B. P(B[A) is the conditional probability of B given A. P(B)
is the prior or marginal probability of B, and acts as a normalizing constant. P (A B) is the
joint probability of A and B.
Denition 40 Marginalization: the marginal probability of an event is obtained by summing
the joint probabilities over all outcomes. Let E
n=1,...,N
T a set of countable events. Then,
P (M) =
N

n=1
P (M E
n
) =
N

n=1
P ( M[ E
n
) P (E
n
) . (2.28)
Statistical signal processing 14
2.3. HOW TO PLAY WITH PROBABILITIES? FUNDAMENTAL PROPERTIES 15
2.3.3 Independence in probability (part 1/3)
In this subsection, we dene the concept of statistical independence between events and between
-algebras. The concept of statistical independence between random variables (which have not
yet been dened) and relationships between all the kind of independence (events, -algebras and
random variables) will be given in the next chapter.
2.3.3.1 Independence of two events
Denition 41 Independence of events: Let A T and B T be two events. A and B will be
independent or statistically independent if and only if
P (A B) = P (A) P (B) . (2.29)
And we use the notation AB.
To say that two events are statistically independent intuitively means that the occurrence of
one event makes it neither more nor less probable that the other occurs.
Example 42 The event of getting face 6 the rst time a die is rolled and the event of getting face
6 the second time are independent. By contrast, if we consider the event A of getting face 2 the rst
time a die is rolled and the event B that the sum of the numbers seen on the rst and second trials is
8, we obtain: P (A) = 1/6, P (B) = 5/36 and P ( B[ A) = P (to have a 6 the second time) = 1/6.
Since P ( B[ A) ,= P (B), A and B are not independent.
Remark 43 From Eqn. (2.24) and Eqn. (2.25), the independence between two events A and B
implies that
P ( A[ B) = P (A) , (2.30)
and
P ( B[ A) = P (B) . (2.31)
In other words, the conditional probability of A given B is the same as the marginal probability
of A and the conditional probability of B given A is the same as the marginal probability of B.
2.3.3.2 Independence of a nite family of events
The generalization to more than two events is straightforward.
Denition 44 Mutual independence: Let E
n=1,...,N
T a nite set of events. These events
are mutually independent if and only if
P
_
N

n=1
E
n
_
=
N

n=1
P (E
n
) . (2.32)
And we use the notation
N

n=1
E
n
.
Remark 45 Mutual independency and mutual exclusivity are two dierent concepts. Remind that
two events A and B are mutually exclusive if and only if A B = . Then P (A B) = 0.
Therefore, if P(B) > 0 then P ( A[ B) is dened and equal to 0.
2.3.3.3 Independence of a countable family of events
Denition 46 Mutual independence: Let E
nI
T a countable set of events. These events
are mutually independent if and only if K, such that K is a nite subset of I, the family E
nK
are mutually independent.
Statistical signal processing 15
16 CHAPTER 2. PROBABILITY
2.3.3.4 Independence of -algebra
Denition 47 Let (, T) be a measurable space and T
n
T a nite family of -algebra. This
family is independent if for all family of events E
n
T
n
one has
P
_
N

n=1
E
n
_
=
N

n=1
P (E
n
) . (2.33)
We use the notation
N

n=1
T
n
.
2.4 Topological spaces and -algebras
This section is here for the interested reader and can be skipped because out of the scope of this
course.
Let us rst remind the concept of topology and topological space and its main properties.
Topology is an area of mathematics unifying and generalizing the well known concept of limits,
continuity, convergence of functions from R
d
R
p
to more abstract spaces.
Denition 48 Let X be a set. A topology on X is a family of subsets T of X such that
_

_
X T and T ,
O
n
T n nite,
N

n=1
O
n
T ,
O
n
T n innite,
n
O
n
T .
(2.34)
The elements of T are called open sets of X and their complementary in X are called closed
sets.
Denition 49 The couple (X, T ) where T is a topology on X is called topological space.
Denition 50 Let (X, T ) be topological space. We call Borel -algebra over X the -algebra
generated by the open sets of X, i.e. (T ) .
Denition 51 Let (X, T ) and (Y, o) be two topological spaces. An application f from X to Y is
called continuous if
S o, f
1
(S) T . (2.35)
To be updated.
2.5 Summary
From a set which represent all the possible outcomes of an experiment, we have built another
set T (called -algebra) which contain all the reasonable (coherent) subsets of on which we
want to assign a probability, i.e., a measurable set. We have shown that for the particular and
important case where = R or R
d
, a good corresponding -algebra is the Borel -algebra. Then,
we have seen that the action of "assigning a probabilty" is in fact to measure the -algebra.
Depending on the needs, we have introduced the main measures useful in probability theory.
Moreover, from the properties of -algebras and measures we have introduced dierent formula
to calculate probabilities from an experiment concerning the inclusion, intersection, and union of
events. Some other denitions which naturally appear when one wants to play with probabilites
have been dened such as conditional and marginal probabilities. Finally, the rst concepts of
independence have been presented. So, we are able to play with events and probability of events
inside a probability space (, T, P) . the goal of the next chapters is to connect this probability
space (, T, P) to another measurable space (, /) which will be mainly
_
R
d
, B
_
R
d
__
, d 1
by way of an application that we will call random variable and to see how the probability P is
transported from (, T, P) to (, /) .
Statistical signal processing 16
Chapter 3
Random variables
As we will see, a random variable associates (w.r.t. the aforementioned condition) to each elemen-
tary events a real number X (). For example, a coin with = head, tail , one can associate
the value 0 to the event head and the value 1 to the event tail. It comes X (head) = 0 and
X (tail) = 1. In this simple case, it is easy to see that X
1
(0) = head and that X
1
(1) = tail.
Consequently, the random variable is just an application from to

= 0, 1 such that we have


build a -algebra over

from a a -algebra over . This is the kind of property that we will


need to work with random variables and this is the goal of the rst part of this chapter to dene
this concept of regularity (that we will call measurable functions) which will naturally lead to the
denition of a random variable.
3.1 Measurable applications/functions
In this section, we will introduce the class of measurable applications/functions. Then, we will
study the main properties of these functions, i.e., which operations over measurable applica-
tions/functions are still measurable and the limit of measurable applications/functions.
3.1.1 Denitions and rst properties
Remark 52 Let and be two sets. Let I be a nite, or countable or innite index. An
application f from to satises: f
1
() = ; f
1
() = ; F f
1
_

F
_
= f
1
(F);
F
iI

iI
f
1
(F
i
) = f
1
_

iI
F
i
_
; F
iI

iI
f
1
(F
i
) = f
1
_

iI
F
i
_
.
Theorem 53 Let f be an application from
1
to
2
and T
2
be a -algebra over
2
. Then, T
1
=
f
1
(T
2
) is a -algebra over
1
.
Proof. (i) By denition,
2
T
2
, then f
1
(
2
) =
1
T
1
. (ii) Let E T
1
, then it exists F
T
2
such that f
1
(F) = E. Moreover, by denition,

F T
2
then f
1
_

F
_
= f
1
(F) =

E T
1
.
(iii) Let E
nI
T
1
, with I a countable index, then it exists F
nI
T
2
such that f
1
(F
n
) = E
n
n I. Moreover, by denition,
nI
F
n
T
2
then f
1
_

nI
F
n
_
=
nI
f
1
(F
n
) =
nI
E
n
T
1
.
Denition 54 Let (, T) and (, /) be two measurable spaces. An application f from to is
called measurable from (, T) to (, /) if A /, one has f
1
(A) T.
Note that from theorem 53, we know that f
1
(/) is a -algebra. Then the denition is equiv-
alent to f
1
(/) T.
Remark 55 Measurable applications play the same role for measurable spaces than continuous
applications for topological spaces.
17
18 CHAPTER 3. RANDOM VARIABLES
Denition 56 Let (, T) and (R, B (R)) be two measurable spaces. A measurable application f
from to R is called measurable function.
Denition 57 Let
_
R
d
, B
_
R
d
__
and (R
p
, B (R
p
)) be two measurable spaces
1
. A measurable func-
tion f from R
d
to R
p
is called Borel function.
Denition 58 Let f
i
a familly of function from (, T) to (, /) . The genrerated -algebra by
f
iI
, denoted (f
iI
) , is the smallest -algebra containing f
1
i
(/) i. In other words, the smallest
-algebra over leading to measurable f
iI
.
The next theorem is important because it connects the concept of measurable set and the
concept of measurable function.
Theorem 59 Let (, T) and (R, B (R)) be two measurable spaces and E . Let I
E
the indicator
function, i.e. the application
I
E
: R
: e I
E
=
_
1 if e E,
0 else.
(3.1)
Then, E is measurable (i.e., E T) if and only if I
E
is a measurable function from (, T) to
(R, B (R)). One can also say that I
E
is a measurable function from (, T) to (R, B (R)) for any
E T.
Proof. Since x R, ]x, +[ B (R) , we have
x R, I
1
E
(]x, +[) =
_
_
_
if x 1,
E if 0 x < 1,
if x < 0.
(3.2)
Consequently, I
E
is a measurable function from (, T) to (R, B (R)) if and only if x
R, I
1
E
(]x, +[) T. By denition, and belong to T, then I
E
is a measurable function
from (, T) to (R, B (R)) if and only if E T.
The goal of the next theorems is to show what kind of applications/functions are measur-
ables and what kind of operations between measurable applications/functions leads to measurable
applications/functions.
Theorem 60 Let (, T) and (, /) be two measurable spaces and let T = T (). Any application
f from to is measurable from (, T ()) to (, /).
Proof. Obvious.
Theorem 61 Let (, T) be a measurable space and be a set.Let f be an application from to
. It is always possible to build a -algebra / over such that f is measurable from (, T) to
(, /).
Proof. Obviously, one can choose / = , since f
1
() = T and f
1
() = T.
This is the smallest possible -algebra. One can also choose / =
_
A : f
1
(A) T
_
. It is
clear that f is measurable if / is a -algebra. Moreover, (i) / since f
1
() = T. (ii) If
A /, then f
1
(A) T f
1
(A) T f
1
_

A
_
T

A /. (iii) If A
iI
/ where I
is at least a countable index, then i f
1
(A
iI
) T
iI
f
1
(A
iI
) T f
1
_

iI
A
iI
_

T
iI
A
iI
/.
1
This denition is valid if for any topological spaces with their respective Borel -algebra. Namely if (X, T )
and (Y, S) are two topological spaces and (X, (T )) and (Y, (S)) the two associated measurable spaces. Then a
measurable function f from X to Y , i.e. if S (S) , one has f
1
(S) (T ), is called Borel function.
Statistical signal processing 18
3.1. MEASURABLE APPLICATIONS/FUNCTIONS 19
Theorem 62 Let be a set and (, /) be a measurable space.Let f be an application from to
. It is always possible to build a -algebra T over such that f is measurable from (, T) to
(, /). The smallest -algebra is generated by f.
Proof. The previous theorem T () is a correct choice. However, one can choose T =
_
f
1
(A) : A /
_
. Of course, if T is a -algebra then f is measurable. We conclude the proof by
using Theorem 53 which prove that T is a -algebra.
Theorem 63 Let f be an application from to . Let / a family of subsets of and the -
algebra generated by /, (/) . Then, a necessary and sucient condition such that f is measurable
from (, T) to (, (/)) is that A /, f
1
(A) T or equivalently f
1
(/) T.
Note that the previous theorem is slightly dierent from the classical denition. Indeed, one
not have to check all the elements of (/) but only on / which is not a -algebra.
Proof. Since / (/), the necessity condition f
1
( (/)) T f
1
(/) T is obvious.
Let us now prove the sucient condition, i.e. that f
1
(/) T f
1
( (/)) T. Let
/

=
_
A : f
1
(A) T
_
. From Theorem 61, /

is a -algebra over and f is measurable


from (, T) to (, /

) . Moreover, it is clear that / /

, then, since (/) is the smallest -algebra


containing /, (/) (/

) = /

. Finally, f
1
( (/)) f
1
(/

) T.
3.1.2 Operations of measurable functions leading to measurable func-
tions
Theorem 64 Let (, T) and (, /) be two measurable spaces. The constant application denied
such that E , f (E) = a with a R xed is measurable from (, T) to (, /).
Proof. A /, f
1
(A) = (if a A) or (if a / A). Since and belongs to T by
denition, this conclude the proof.
Theorem 65 Let (E, c) , (F, T) , and (G, () be three measurable spaces. Let f be a measurable
application from (E, c) to (F, T) and let g be a measurable application from (F, T) to (G, (). Then,
the application h = g f is a measurable application from (E, c) to (G, () .
Proof. By denition, h = g f = g (f) is a measurable application from (E, c) to (G, ()
if A (, h
1
(A) c. However, h
1
(A) = f
1
_
g
1
(A)
_
. Since A (, g
1
(A) T and
f
1
(T) c, one has A (, h
1
(A) c.
Theorem 66 Let
_
R
d
, B
_
R
d
__
and (R
p
, B (R
p
)) be two measurable spaces
2
. Any continuous func-
tion f from R
d
to R
p
is a Borel function (i.e. measurable) from
_
R
d
, B
_
R
d
__
to (R
p
, B (R
p
)) .
Proof. To be updated.
Theorem 67 Let f
i=1,...,n
be a nite familly of measurable functions from (, T) to (R, B (R)) .
Let g be a (Borel) function from (R
n
, B (R
n
)) to (R
p
, B (R
p
)) . Then
3
, the function h from to R
p
dened by h = g (f
1
, f
2
, . . . , f
n
) is a measurable function from (, T) to (R
p
, B (R
p
)).
Proof. Let f = (f
1
, f
2
, . . . , f
n
) be the application from to R
n
. Consequently, h = g f and
from Theorem 65, and since g is a measurable function from (R
n
, B (R
n
)) to (R
p
, B (R
p
)), h will
be measurable from (, T) to (R
p
, B (R
p
)) if f is measurable from (, T) to (R
n
, B (R
n
)) . to be
updated.
2
This denition is valid if for any topological spaces with their respective Borel -algebras. Namely if (X, T )
and (Y, S) are two topological spaces and (X, (T )) and (Y, (S)) the two associated measurable spaces. Then any
continuous function f from X to Y is a Borel function (i.e. measurable). Let f
i=1,...,n
a nite familly of measurable
functions from (, F) to (R, B(R)) .
3
This denition is valid if for any topological spaces with their respective Borel -algebras and g continuous.
Namely, let (, F) be a measurable space and (X, T ) a topological space (i.e., (X, (T )) is a measurable space).
Let f
i=1,...,n
be a nite family of measurable functions from (, F) to (R, B(R)) . Finally, let g be a continuous
function (i.e. a Borel function, i.e. measurable, i.e. T (T ) , g
1
(T) B(R
n
)) from R
n
to X. Then, the
function h from (, F) to (X, (T )) dened by h = g (f
1
, f
2
, . . . , f
n
) is a measurable function (, F) to (X, (T )).
Statistical signal processing 19
20 CHAPTER 3. RANDOM VARIABLES
Corollary 68 A nite linear combination of measurable functions from (, T) to (R, B (R)) is
measurable from (, T) to (R, B (R)).
Proof. Let g (x
1
, . . . , x
n
) =
n

i=1
a
i
x
i
where i, a
i
R. Since g is continuous from R
n
to R,
from Theorem 66 g is measurable (Borel function) from (R
n
, B (R
n
)) to (R, B (R)) . Consequently,
h =
n

i=1
a
i
f
i
is a measurable function from (, T) to (R, B (R)) .
Corollary 69 Let f
i=1,...,n
be a nite family of measurable functions from (, T) to (R, B (R)) .
Then
n

i=1
(f
i
)
a
i
with a
i
Z and a
i
> 0 if f
i
can be equal to zero is a measurable function from
(, T) to (R, B (R)) .
Proof. Let g (x
1
, . . . , x
n
) =
n

i=1
(x
i
)
a
i
with a
i
Z and a
i
> 0 if f
i
can be equal to zero. Since g
is continuous from R
n
to R, from Theorem 66 g is measurable (Borel function) from (R
n
, B (R
n
))
to (R, B (R)) . Consequently, h =
n

i=1
(f
i
)
a
i
is a measurable function from (, T) to (R, B (R)) .
Corollary 70 Let / be the set of the measurable functions from (, T) to (R, B (R)). Then,
(/, +, ., ) is a R-algebra
4
. Of course, this implies that (/, +, .) is a vector space over R and that
(/, +) is an Abelian group.
Proof. To be updated.
Corollary 71 Let f
1
et f
2
be two measurable functions from (, T) to (R, B (R)) . Then max (f
1
, f
2
)
and min (f
1
, f
2
) are measurable functions from (, T) to (R, B (R)) . Of course, since the con-
stant application E , g (E) = 0 is measurable, for any measurable function f from (, T)
to (R, B (R)) , f
+
= max (f, 0) and f

= min (f, 0) are measurable functions from (, T) to


(R, B (R)) .
Proof. Let g (x
1
, x
2
) = max (x
1
, x
2
) (or g (x
1
, x
2
) = min (x
1
, x
2
)). Since g is continuous
from R
2
to R, from Theorem 66 g is measurable (Borel function) from
_
R
2
, B
_
R
2
__
to (R, B (R)) .
Consequently, h = max (x
1
, x
2
) (or h = min (x
1
, x
2
)) is a measurable function from (, T) to
(R, B (R)) .
Corollary 72 Let / be the set of the measurable functions from (, T) to (R, B (R)). Then,
(/, sup, inf) is a lattice.
3.1.3 Limit of measurable functions
To be updated.
4
Remember that (A, , ) where and are two internal operations (A is closed under and ) is a vector
space over a eld K if:
(A, ) is an Abelian group (associativity, identity element, inverse element, and commutativity).
(u, v) A
2
and (, ) K
2
, (i) (u v) = ( u) ( v) (ii) ( ) u = ( u) ( u) (iii)
( u) = u (iv) admits an identity element
Remember that (A, , , ) where , , and are three internal operations (A is closed under , , and ) is an
algebra over a eld K if:
(A, , ) is a vector space over K
(A, ) is a monoid (associativity, identity element) and is distributive w.r.t. . (This bullet and the rst
bullet of the denition of a vector space means that (A, , ) is a ring)
and are such that (u, v) A
2
and K, (u v) = u ( v) = ( u) v.
Statistical signal processing 20
3.2. RANDOM VARIABLES AND PROBABILITY DISTRIBUTION (PART 1/2) 21
3.1.4 Summary
In conclusion, the reader has to remember that
I
E
is a measurable function for any measurable set E.
Any application f from (, T ()) to (, /) is measurable.
The constant application from (, T) to (, /) is measurable.
The composition of measurable applications is measurable.
Any continuous function f fromR
d
to R
p
, with their associated Borel -algebras is measurable.
The sum or the substraction of two (or more) measurable functions are measurable (any
linear combination of measurable function is measurable).
The product and division (with denominator dierent from zero) of measurable functions is
measurable. The integer power of a measurable function is a measurable fucntion.
The minimum and the maximum of a measurable function is a measurable function.
3.2 Random variables and probability distribution (part 1/2)
3.2.1 Denition of a random variable
Denition 73 Let (, T, P) be a probability space and (, /) a measurable space. A measurable
application X from (, T, P) to (, /) is called a random variable.
Remark 74 From the two previous denitions, it will be possible to take the supremum or the
limit of a familly of random variables.
Remark 75 A random variable is not a variable in the classical sense but an application.
Denition 76 A random vrariable X from (, T, P) to (R, B (R)) is called real random variable.
3.2.2 Probability distribution
Denition 77 Let X be a random variable from (, T, P) to (, /) . The application denoted P
X
from / to [0, 1] such that
A /, P
X
(A) = P
_
X
1
(A)
_
= P ( : X () A) , (3.3)
is called the probability distribution of the random variable X.
Remark 78 The random variable X is totally characterized if we know P
X
. This is why, in
practice, we will not take care of (, T, P) and only (, /) and P
X
will be studied.
3.3 Independence in probability (part 2/3)
3.3.1 Independence of random variables
To be updated.
3.3.2 Relationships between independences
To be updated.
Statistical signal processing 21
22 CHAPTER 3. RANDOM VARIABLES
3.4 Integration theory
3.4.1 Riemann integral
3.4.2 Integral with respect to a measure
3.4.3 Integral with respect to the Dirac measure
3.4.4 Integral with respect to a discrete measure
3.4.5 Lebesgue integral
3.4.6 Negligability
3.4.7 Beppo-Lvi theorem and Lebesgue theorem
3.4.8 Absolute continuity and density
3.5 Random variables and probability distribution (part 2/2)
3.5.1 Cumulative distribution function
Our aim is now to introduce tools to characterize random variables. It is clear from above that
a random variable is described by P
X
. These quantities dene the probability distribution of the
random variable X. A probability distribution identies either the probability of each value of an
unidentied random variable (when the variable is discrete), or the probability of the value falling
within a particular interval (when the variable is continuous). The probability function describes
the range of possible values that a random variable can attain and the probability that the value
of the random variable is within any subset of that range. In order to describe the probability dis-
tribution of random variables, we will introduce two useful functions called cumulative distribution
function and probability density function.
Denition 79 Cumulative distribution function: the cumulative distribution function F
X
(x)
of a real random variable X is dened as follows
R [0, 1]
x F
X
(x) = P
X
(], x]) = P (X x)
It can be regarded as the proportion of the population whose value is less than a particular
value x.
The properties of the cumulative distribution function are the following:
0 F
X
(x) 1, x R
lim
x
F
X
(x) = 0. Since P (X ) = 0 (impossible event).
lim
x+
F
X
(x) = 1. Since P (X +) = 1 (certain event).
Every cumulative distribution function F
X
(x) is (not necessarily strictly) monotone increas-
ing and right-continuous.
Theorem 80 If a function satisfy the four aforementioned properties, then it is the cumulative
distribution function of random variable.
Some other properties are
P (a X b) = F
X
(b) F
X
(a) .
The gap F
X
(x
0
) F
X
_
x

0
_
in a value x = x
0
is equal to P (X = x
0
) .
Statistical signal processing 22
3.5. RANDOM VARIABLES AND PROBABILITY DISTRIBUTION (PART 2/2) 23
3.5.2 Discrete random variables
Denition 81 A random variable is called discrete if its probability distribution is a discrete mea-
sure.
From Denition 16, this means that it exists a family D = x (nite or countable) of elements
such that (RD) = 0, and E B (R) , P
X
(E) =

xED
P
X
(x) . The elements of D are the
discrete values that can take the random variable and are called atoms.
Theorem 82 If X is a discrete random variable, then it attains values x
1
, x
2
, ..., x
N
with proba-
bility P (X = x
i
) = p
i
, i = 1, . . . , N, then the CDF of X will be discontinuous at the points x
i
and
constant in between:
F
X
(x) =

x
i
<x
P (X = x
i
) =

x
i
<x
p
i
. (3.4)
3.5.3 Continuous random variables and probability density function
Denition 83 A random variable is called continuous if its probability distribution is a continuous
measure.
From Denition 16, this means that x R, P
X
(x) = 0.
Theorem 84 A random variable is continuous if and only if its cumulative distribution function
is continuous.
Denition 85 Probability density function: The probability distribution P
X
of a random vari-
able X has a density denoted f
X
if
_
f
X
0,
F
X
(x) =
_
x

f
X
(u)du.
(3.5)
Such random variable is called absolutely continuous.
Theorem 86 A function f over R is the probability density function of a random variable if
f 0,
f is measurable
f is integrable and
_
R
f (x) dx = 1.
Theorem 87 If a probability density function is continuous over [a, b] , then the cumulative dis-
tribution function is derivable over [a, b] and we have
f
X
(x) =
dF
X
(x)
dx
. (3.6)
Proof. To be updated.
Theorem 88 A random variable absolutely continuous is continuous (the opposite is wrong).
Proof. To be updated.
To conclude, the main properties of the probability density function are the following:
f
X
(x) 0.
f
X
() = f
X
(+) = 0.

_
b
a
f
X
(x) dx = F
X
(b) F
X
(a) = P (a X < b) .
Of course, the latter equality implies straightforwardly that:

_
x

f
X
() d = F
X
(x) .

_
+

f
X
(x) dx = 1.
Statistical signal processing 23
24 CHAPTER 3. RANDOM VARIABLES
3.5.4 Mathematical expectation
In practice, it can be dicult to know the CDF or the probability density function of a random
variable, or one may prefer characterize the random variable with a small number of parameters.
In the following these parameters will be called moments. To calculate these moments, one need
to introduce a new operator called the mathematical expectation.
Remark 89 Rigorously, the expected value of a random variable is the integral of the random
variable with respect to its probability measure. Here, we use a simplied denition which will be
enough for the scope of this document. Note that this denition assume implicitly that all the
continuous random variables that we will use here will be assumed to be absolutely
continuous with a continuous density.
Denition 90 Let g (X) be a deterministic function of a (absolutely) continuous random variable
X. The expectation (denoted E[g(X)]) of the random variable g(X) is given by
E[g(X)] =
_
R
g(x)f
X
(x) dx, (3.7)
where the integral is assumed to exist.
In the case of a discrete random variable X taking values x
i
, the expectation is given by
E[g(X)] =

i
g(x
i
)P (X = x
i
) . (3.8)
The properties of the mathematical expectation are the following:
E[c] = c if c is a constant (deterministic variable). (Since, by denition, E[c] =
_
R
cf
X
(x) dx =
c
_
R
f
X
(x) dx
. .
=1
).
E[g(X)] is a constant (independent of x), consequently, E[E[g(X)]] = E[g(X)] .
E[g(X) +h(X)] = E[g(X)] +E[h(X)] if and are constants. (Due to the linearity
of the integral operator and of the sum operator).
Theorem 91 Jensen inequality: Let a convex function, then
(E[X]) E[(X)] . (3.9)
Proof. To be updated.
3.5.5 L
p
(, T, ) and L
p
(, T, ) spaces
3.5.5.1 Denitions and main properties
To be updated.
3.5.5.2 Hilbert spaces and L
2
(, T, )
To be updated.
3.5.5.3 Radon-Nikodym theorem and duality in L
p
(, T, ) spaces
To be updated.
Statistical signal processing 24
3.5. RANDOM VARIABLES AND PROBABILITY DISTRIBUTION (PART 2/2) 25
3.5.6 Moments and central moments
The concept of moment in probability evolved from the concept of moment in physics.
Denition 92 The moment of order k N, denoted m
k
of a random variable X is dened as
follows
m
k
= E
_
X
k

. (3.10)
Denition 93 The central moment of order k N, denoted m
k
of a random variable X is dened
as follows
m
k
= E
_
(X E[X])
k
_
= E
_
(X m
1
)
k
_
. (3.11)
Denition 94 m
1
= E[X] is called the mean of the random variable. If m
1
= 0, then the random
variable is centered.
Denition 95 m
2
= E
_
(X E[X])
2
_
is called the variance of the random variable.

m
2
is
called the standard deviation of the random variable.
The variance of a random variable is one measure of statistical dispersion, averaging the squared
distance of its possible values from the mean. Whereas the mean is a way to describe the location
of a distribution, the variance is a way to capture its scale or degree of being spread out.
Remark 96 In this document, we are mainly interested by the mean and the variance of the
random variables. Note that, it is possible to study the high order moments of a random variable.
For example, the third central moment, m
3
, is a measure of the lopsidedness of the distribution;
any symmetric distribution will have a third central moment, if dened, of zero. The third central
moment is called the skewness. A distribution that is skewed to the left (the tail of the distribution
is heavier on the right) will have a negative skewness. A distribution that is skewed to the right
(the tail of the distribution is heavier on the left), will have a positive skewness. The fourth
central moment, m
4
, is a measure of whether the distribution is tall and skinny or short and squat,
compared to the normal distribution of the same variance. The fourth central moment is called the
kurtosis.
Theorem 97 The relationship between m
1
, m
2
and m
2
is given by
m
2
= m
2
m
2
1
. (3.12)
Proof. Let us expand m
2
= E
_

_
_
_
_X E[X]
. .
=m
1
_
_
_
2
_

_ = E
_
X
2
+m
2
1
2Xm
1

= E
_
X
2

+ m
2
1

2m
1
E[X] = m
2
m
2
1
.
3.5.7 Characteristic function
The characteristic function, denoted
X
(u), of any random variable completely denes its proba-
bility distribution and is useful to calculate the moments of it. On the real line it is given by the
following formula, where X is any random variable with the distribution in question:

X
(u) = E
_
e
iuX

=
_
R
f
X
(x) e
iux
dx, (3.13)
where i
2
= 1. Every probability distribution on R has a characteristic function since
_
R
[f
X
(x)[
2
dx.
One can remark that the characteristic function is simply the Fourier transform of the PDF.
The properties of the characteristic function are the following:
From the Jensen inequality, [
X
(u)[ 1 =
X
(0) .
Statistical signal processing 25
26 CHAPTER 3. RANDOM VARIABLES
With a change of variable, one see that
X
(u) =

X
(u) .

X
(u) is a continuous function even if X is a discrete random variable.
Theorem 98 The moments of a random variable X are given by
m
k
= i
k
d
k

X
(u)
du
k

u=0
. (3.14)
Proof. The exponential function can be written in a power serie such that e
x
=

k=0
x
k
k!
.
Consequently,

X
(u) =
_
R
f
X
(x) e
iux
dx =

k=0
_
R
f
X
(x)
(iux)
k
k!
dx
=

k=0
(iu)
k
k!
_
R
x
k
f
X
(x) dx =

k=0
(iu)
k
k!
m
k
.
By identication with a Taylor series h(u) =

k=0
d
k
h(u)
du
k

x=0
u
k
k!
, one nishes the proof.
3.6 Examples of important probability distributions for dis-
crete random variable
3.6.1 Discrete uniform probability distribution
To be updated.
3.6.2 Bernoulli probability distribution
The Bernoulli PDF is the PDF of a discrete random variable which takes value 1 with success
probability p and value 0 with failure probability q = 1 p. So, if X is a random variable with
this distribution, we have:
P (X = 1) = 1 P (X = 0) = 1 q = p. (3.15)
The probability distribution of X is given by
P (X = x) = p
x
(1 p)
1x
, (3.16)
where x = 0, 1 . The mean of X is given by E[X] = p. The variance of X is given by E
_
(X E[X])
2
_
=
p (1 p) . The distribution of heads and tails in coin tossing is an example of a Bernoulli distribu-
tion with p = q = 1/2.
3.6.3 Binomial probability distribution
The binomial distribution is the discrete probability distribution of the number of successes in a
sequence of n independent yes/no experiments, each of which yields success with probability p.
When n = 1, the binomial distribution is a Bernoulli distribution. The PDF of a random variable
X with this distribution is given by
P (X = x) =
_
n
x
_
p
x
(1 p)
nx
, (3.17)
where x = 0, 1, . . . , n and where
_
n
x
_
=
n!
x!(nx)!
is the binomial coecient. The formula can
be understood as follows: we want x successes p
x
and n x failures (1 p)
nx
. However, the x
Statistical signal processing 26
3.7. EXAMPLES OF IMPORTANT PROBABILITY DENSITY FUNCTIONS FOR CONTINUOUS RANDOM
VARIABLE 27
successes can occur anywhere among the n trials, and there are binomial coecient dierent ways
of distributing x successes in a sequence of n trials.
The mean of X is given by E[X] = np. The variance of X is given by E
_
(X E[X])
2
_
=
np (1 p) .
3.6.4 Poisson probability distribution
The Poisson distribution is a discrete probability distribution that expresses the probability of a
number of events occurring in a xed period of time if these events occur with a known average
rate and independently of the time since the last event. The Poisson distribution can also be
used for the number of events in other specied intervals such as distance, area or volume. A
random variable that count, among other things, a number of discrete occurrences (sometimes
called "arrivals") that take place during a time-interval of given length. If the expected number of
occurrences in this interval is , then the probability that there are exactly k occurrences (k N)
is equal to
P (K = k) =

k
e

k!
. (3.18)
A classic example of Poisson distribution is the nuclear decay of atoms. The mean of K is given
by E[K] = . The variance of K is given by E
_
(K E[K])
2
_
= .
3.6.5 Binomial probability distribution
To be updated.
3.6.6 Hypergeometric probability distribution
To be updated.
3.7 Examples of important probability density functions for
continuous random variable
3.7.1 Uniform probability density function
The continuous uniform probability density function is a family of probability probability density
function such that for each member of the family, all intervals of the same length on the distri-
butions support are equally probable. The support is dened by the two parameters, a and b,
which are its minimum and maximum values. The distribution is abbreviated X U
[a,b]
. The
probability density function of the continuous uniform distribution is:
f
X
(x) =
_
1
ba
if a x b,
0 otherwise,
(3.19)
where x R.
This PDF will be useful when we only know the support of a random variable. The mean of
X is given by E[X] =
a+b
2
. The variance of X is given by E
_
(X E[X])
2
_
=
(ab)
2
12
.
3.7.2 Real Gaussian or normal probability density function
The normal probability density function, also called the Gaussian probability density function, is
an important family of continuous probability distributions, applicable in many elds. It will be
the main PDF that we will study during these lectures (with the uniform PDF). The importance
of the normal distribution as a model of quantitative phenomena in the natural and behavioral
sciences is due in part to the central limit theorem (see below). Moreover, closed-form expressions
Statistical signal processing 27
28 CHAPTER 3. RANDOM VARIABLES
of various problems can be obtain when we deal with normal distributions. The distribution is
abbreviated X ^
_
m,
2
_
. The PDF is given by
f
X
(x) =
1

2
e

(xm)
2
2
2
, (3.20)
where x R. The mean of X is given by E[X] = m. The variance of X is given by E
_
(X E[X])
2
_
=

2
.
3.7.3 Complex circular Gaussian probability density function
To be updated.
3.7.4 Exponential probability density function
To be updated.
3.7.5 Gamma probability density function
To be updated.
3.7.6 Beta probability density function
To be updated.
3.7.7 Student probability density function
To be updated.
3.7.8
2
probability density function
To be updated.
3.8 Substitution of random variables
Theorem 99 Let X be an absolutely continuous random variable (with density f
X
(x)) taking
values in an open set of R denoted A. Let g (.) a one-to-one mapping from A to and open subset
of R denoted B = Img. g is assumed to be dierentiable and its inverse too. Then, the random
variable Y = g (X) is absolutely continuous, taking values in B and with density
f
Y
(y) =

dx
dy

x=g
1
(y)

f
X
_
g
1
(y)
_
, (3.21)
where g
1
is the inverse function of g.
Proof. To be updated.
Statistical signal processing 28
Chapter 4
Random vectors
In this chapter, one extends all the previous results to the case of random vectors.
4.1 Product probability space and probability distribution
Theorem 100 Let X =
_
X
1
X
2
X
N
_
T
be a vector made from random variables dened
on the same probability space (, T, P) with values in (R, B (R)) . Then, X is a random variable
from (, T, P) to
_
R
N
, B
_
R
N
__
where B
_
R
N
_
is Borel set over R
N
, i.e., the set formed from the
open sets ], x
1
[ ], x
2
[ ], x
N
[ (x
1
, x
2
, . . . , x
N
) R
N
.
Proof. To be updated.
Remark 101 This theorem is not general since we focus on
_
R
N
, B
_
R
N
__
only. But it will be
enough for the remain of this document. This means that we will only deal with random vector
made from real random variables.
Denition 102 Let X be a random vector with values in
_
R
N
, B
_
R
N
__
. The application denoted
P
X
from B
_
R
N
_
to [0, 1] such that
A B
_
R
N
_
, P
X
(A) = P
_
X
1
1
(A) X
1
2
(A) X
1
2
(A)
_
, (4.1)
is called the probability distribution of the random vector X.
4.2 Back to integration theory
4.3 Joint and marginal cumulative distribution function
Denition 103 Joint CDF: the joint cumulative distribution function of a random vector is
given by
F
X
(x) = F
X
1
,X
2
,...,X
N
(x
1
, x
2
, . . . , x
N
)
= P
X
(], x
1
] ], x
N
])
= P
_
N

i=1
X
i
x
i
_
, (4.2)
where (x
1
, x
2
, . . . , x
N
) R
N
.
This function is from R
N
to [0, 1].
Denition 104 Any vector made from a (strict) subset of the random vector X is called marginal
random vector.
29
30 CHAPTER 4. RANDOM VECTORS
Denition 105 Marginal CDF: If are only interested by the cumulative distribution function
of a marginal random variable, the cumulative distribution function is called marginal cumulative
distribution function and is given by taking the limit of the joint cumulative distribution function
when all the variables other than ones in the marginal vector tend to + This function is from
R
P
to [0, 1] , where P < N is the size of the marginal vector.
Example 106 i = 1, . . . , N
F
X
i
(x
i
) = P (X
1
< X
2
< X
i
< x
i
X
N
< )
= P (X
i
< x
i
) . (4.3)
(i, j) (1, . . . , N)
2
, with i ,= j
F
X
i
,X
j
(x
i
, x
j
) = P (X
1
< X
i
< x
i
X
j
< x
j
X
N
< )
= P (X
i
< x
i
X
j
< x
j
) . (4.4)
4.4 Joint and marginal probability density function
Denition 107 Joint probability density function: the random vector X is said to be abso-
lutely continuous if it exists a measurable function f
X
from
_
R
N
, B
_
R
N
__
to (R
+
, B (R
+
)) such
that (x
1
, x
2
, . . . , x
N
) R
N
one has
F
X
(x) =
x
1
_

x
2
_


x
N
_

f
X
(u
1
, u
2
, . . . , u
N
) du
1
du
2
du
N
. (4.5)
This function is called joint probability density function of the random vector X. It is a function
from R
N
to R
+
.
Theorem 108 A function f from R
N
to R
+
is the joint probability density function of a random
vector if
f 0,
f is measurable
f is integrable and
_
R
N
f (x) dx =
+
_


+
_

f
X
(x
1
, . . . , x
N
) dx
1
dx
N
= 1.
Theorem 109 For any set B =

x
1
1
, x
2
1
_

x
1
2
, x
2
2
_

x
1
N
, x
2
N
_
over R
N
we have
P
_
x
1
1
X
1
x
2
1
x
1
2
X
2
x
2
2
x
1
N
X
N
x
2
N
_
=
_
B
f (x) dx. (4.6)
Theorem 110 If a probability density function is continuous at the point
_
x
0
1
, . . . , x
0
N
_
, we have
f
X
_
x
0
1
, . . . , x
0
N
_
=
F
X
(x
1
, . . . , x
N
)
x
1
. . . x
N

x
0
1
,...,x
0
N
. (4.7)
Denition 111 Marginal probability density function: If are only interested by the probabil-
ity density function of a marginal random vector, the probability density function is called marginal
probability density function and is given by x
i
R
f
X
i
(x
i
) =
_
R
N1
f
X
(x) dx
1
dx
2
dx
i1
dx
i+1
dx
N
, (4.8)
Statistical signal processing 30
4.5. CONDITIONAL PROBABILITY DENSITY FUNCTION 31
since
P
X
i
(x
i
) = P (X R R ], x
i
] R R)
=
_
RR],x
i
]RR
f
X
(u) du
1
du
2
du
N
=
_
],x
i
]
f
X
i
(u
i
) du
i
. (4.9)
Remark 112 Note that if marginal probability density function is continuous, the marginal prob-
ability density function is also obtain by taking the partial derivatives of the marginal cumulative
distribution function. for example:
f
X
i
(x
i
) =
dF
X
i
(x
i
)
dx
i
=
_
R
N1
f
X
(x) dx
1
dx
2
dx
i1
dx
i+1
dx
N
. (4.10)
Proof. Let us prove this relationship in the case N = 2 (the generalization is straightforward).
f
X
(x) =

2
F
X
(x)
x
1
x
2
, then
f
X
1
(x
1
) =
_
R
f
X
(x) dx
2
=
_
R

2
F
X
(x)
x
1
x
2
dx
2
=

x
1
_
R
F
X
(x)
x
2
dx
2
=

x
1
_
R
P (X
1
x
1
X
2
x
2
)
x
2
dx
2
=

x
1
_
_
_P (X
1
x
1
X
2
)
. .
=P(X
1
<x
1
)
P (X
1
x
1
X
2
)
. .
=0
_
_
_
=
P (X
1
x
1
)
x
1
=
dF
X
1
(x
1
)
dx
1
(4.11)
4.5 Conditional probability density function
Remind that we have dened the conditional probability of an event A given B by P ( A[ B) =
P(AB)
P(B)
. In the same way, one can dene a conditional probability density function which gives a
relationship between the marginal and the joint probability density function as
f
X
1
|X
2
( x
1
[ x
2
) =
f
X
1
,X
2
(x
1
, x
2
)
f
X
2
(x
2
)
, (4.12)
or
f
X
2
|X
1
( x
2
[ x
1
) =
f
X
1
,X
2
(x
1
, x
2
)
f
X
1
(x
1
)
. (4.13)
Note that the conditional probability density function is a probability density function since
_
R
f
X
1
|X
2
( x
1
[ x
2
) dx
1
=
_
R
f
X
1
,X
2
(x
1
, x
2
)
f
X
2
(x
2
)
dx
1
=
1
f
X
2
(x
2
)
_
R
f
X
1
,X
2
(x
1
, x
2
) dx
1
. .
=f
X
2
(x
2
)
= 1. (4.14)
Theorem 113 Bayes theorem for probability density function:
f
X
2
|X
1
( x
2
[ x
1
) = f
X
1
|X
2
( x
1
[ x
2
)
f
X
2
(x
2
)
f
X
1
(x
1
)
. (4.15)
Proof. Obvious by mixing Eqn. (4.12) and Eqn. (4.13).
Statistical signal processing 31
32 CHAPTER 4. RANDOM VECTORS
4.6 Independence in probability (part 3/3)
Let X =
_
X
1
X
2
X
N
_
T
be a random vector. The random variables X
1
, X
2
, . . . , X
N
are
mutually independent if
f
X
(x) =
N

i=1
f
X
i
(x
i
) . (4.16)
In other words, the joint probability density function is simply reduced to the product of the
marginal probability density function.
Proof. Since the random variables X
1
, X
2
, . . . , X
N
are mutually independent, one can write
F
X
(x) = P
_
N

i=1
X
i
< x
i
_
=
N

n=1
P (X
i
< x
i
) =
N

n=1
F
X
i
(x
i
) . (4.17)
Consequently,
f
X
(x) =

N
F
X
(x)
x
1
x
2
x
N
=
N

n=1
dF
X
i
(x
i
)
dx
i
=
N

i=1
f
X
i
(x
i
) . (4.18)
Note that Eqn. (4.16) implies that
f
X
2
|X
1
( x
2
[ x
1
) = f
X
2
(x
2
) , (4.19)
and
f
X
1
|X
2
( x
1
[ x
2
) = f
X
1
(x
1
) . (4.20)
4.7 Mathematical expectation
Denition 114 Let X =
_
X
1
X
2
X
N
_
T
be an absolutely continuous random vector and
let g (X) be a deterministic function from R
N
to R, then
E[g(X)] =
_
R

_
R
g (x
1
, x
2
, . . . , x
N
) f
X
1
,X
2
,...,X
N
(x
1
, x
2
, . . . , x
N
) dx
1
dx
2
. . . dx
N
=
_
R
N
g (x) f
X
(x) dx, (4.21)
where the integral is assumed to exist. E[.] is called the expectation operator.
4.8 Mean
Denition 115 The moment of order one, denoted m
X
, of a random vector X is dened as
follows:
m
X
= E[X] = E
_

_
_
_
_
_
_
X
1
X
2
.
.
.
X
N
_
_
_
_
_
_

_
=
_
_
_
_
_
E[X
1
]
E[X
2
]
.
.
.
E[X
N
]
_
_
_
_
_
. (4.22)
In other words, the expectation of a vector is dened as the vector of the expectations.
Remark 116 Note that, i = 1, . . . , N, we have
E[X
i
] =
_
R
N
x
i
f
X
(x) dx
=
_
R
x
i
_
R
N1
f
X
(x) dx
1
dx
2
dx
i1
dx
i+1
dx
N
dx
i
=
_
R
x
i
f
X
i
(x
i
) dx
i
, (4.23)
Statistical signal processing 32
4.9. COVARIANCE AND CORRELATION MATRICES 33
which is simply the expectation over the marginal PDF f
X
i
(x
i
) .
4.9 Covariance and correlation matrices
Denition 117 The moment of order 2 of a random vector is called the correlation matrix, is
denoted C and is given by
C = E
_
XX
T
_
= E
_

_
_
_
_
_
_
X
1
X
2
.
.
.
X
N
_
_
_
_
_
_
X
1
X
2
X
N
_
_

_
= E
_

_
_
_
_
_
_
_
(X
1
)
2
X
1
X
2
X
1
X
N
X
1
X
2
(X
2
)
2
.
.
.
.
.
.
.
.
.
X
N1
X
N
X
1
X
N
X
N1
X
N
(X
N
)
2
_
_
_
_
_
_
_

_
=
_
_
_
_
_
_
_
_
E
_
(X
1
)
2
_
E[X
1
X
2
] E[X
1
X
N
]
E[X
1
X
2
] E
_
(X
2
)
2
_
.
.
.
.
.
.
.
.
.
E[X
N1
X
N
]
E[X
1
X
N
] E[X
N1
X
N
] E
_
(X
N
)
2
_
_
_
_
_
_
_
_
_
. (4.24)
In other words, the expectation of a vector is dened as the vector of the expectations.
Remark 118 Note that, i, j = 1, . . . , N, we have
E[X
i
X
j
] =
_
R
N
x
i
x
j
f
X
(x) dx
=
_
R
2
x
i
x
j
_
R
N2
f
X
(x) dx
1
dx
2
dx
i1
dx
i+1
dx
j1
dx
j+1
dx
N
dx
i
dx
j
=
_
R
2
x
i
x
j
f
X
i
,X
j
(x
i
, x
j
) dx
i
dx
j
, (4.25)
If i = j we have straightforwardly,
E
_
(X
i
)
2
_
=
_
R
x
2
i
f
X
i
(x
i
) dx
i
. (4.26)
Remark 119 Note that C is a symmetric matrix since C = C
T
. C is also a positive dene matrix
(C _ 0) since u ,= 0, we have u
T
Cu 0.
Denition 120 The central moment of order 2 of a random vector is called the covariance matrix,
is denoted and is given by
=E
_
(XE[X]) (XE[X])
T
_
= E
_
(Xm
X
) (Xm
X
)
T
_
. (4.27)
Remark 121 Note that is a symmetric matrix since =
T
. is also a positive dene matrix
( _ 0) since u ,= 0, we have u
T
u 0.
Theorem 122 The relationship between m
X
, C, and is given by
= Cm
X
m
T
X
. (4.28)
Statistical signal processing 33
34 CHAPTER 4. RANDOM VECTORS
Proof.
= E
_
(Xm
X
) (Xm
X
)
T
_
= E
_
XX
T
+m
X
m
T
X
m
X
X
T
Xm
T
X
_
= E
_
XX
T
_
+m
X
m
T
X
m
X
E
_
X
T

E[X] m
T
X
= Cm
X
m
T
X
. (4.29)
Denition 123 Remark 124 Note that if the random vector X is a complex vector, the denition
of the correlation and covariance matrices is modied as follows
C=E
_
XX
H
_
, (4.30)
and
=E
_
(XE[X]) (XE[X])
H
_
, (4.31)
where
H
indicates the conjugate transpose operator.
4.10 Correlation coecient
For a random vector, we see that the elements of the correlation and of the covariance matrices
have the structure E[X
i
X
j
] i, j = 1, . . . , N. This structure will help us to indicate the strength
and direction of a linear relationship between two random variables.
Denition 125 The correlation coecient, denoted
X
i
,X
j
, between two random variables X
i
and X
j
is dened as follows

X
i
,X
j
=
E
_
(X
i
m
X
i
)
_
X
j
m
X
j
_
_
E
_
(X
i
m
X
i
)
2
_
E
_
_
X
j
m
X
j
_
2
_
, (4.32)
where E
_
(X
i
m
X
i
)
_
X
j
m
X
j
_
is called the covariance between X
i
and X
j
.
The properties of the correlation coecient are the following:

X
i
,X
j

1 (from the Cauchy-Schwarz inequality).



X
i
,X
j
= 1 X
i
+ X
j
=constant. It means that their is a linear relationship between
the two random variables.
If
X
i
,X
j
= 0, we say that the two random variables are uncorrelated.
Some value in between 0 and 1 in all other cases, indicating the degree of linear dependence
between the variables. The closer the coecient is to either -1 or 1, the stronger the correlation
between the variables.
Remark 126 If
X
i
,X
j
= 0 i, j then, the covariance matrix is diagonal
=
_
_
_
_
_
_
_
_
E
_
(X
1
m
X
1
)
2
_
0 0
0 E
_
(X
2
m
X
2
)
2
_
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0
0 0 E
_
(X
N
m
X
N
)
2
_
_
_
_
_
_
_
_
_
. (4.33)
Statistical signal processing 34
4.11. CORRELATION AND INDEPENDENCY 35
4.11 Correlation and independency
Theorem 127 The independency between two random variables implies
E[(X m
X
) (Y m
Y
)] = E[X m
X
] E[Y m
Y
] . (4.34)
Proof. Let X and Y two random variables. The independency states that
f
X,Y
(x, y) = f
X
(x) f
Y
(y) . (4.35)
Consequently,
E[(X m
X
) (Y m
Y
)] =
_
R
_
R
(x m
X
) (y m
Y
) f
X,Y
(x, y) dxdy
=
_
R
_
R
(x m
X
) (y m
Y
) f
X
(x) f
Y
(y) dxdy
=
_
R
(x m
X
) f
X
(x) dx
_
R
(y m
Y
) f
Y
(y) dy
= E[X m
X
] E[Y m
Y
] (4.36)
Theorem 128 The independence between two random variables implies that they are uncorrelated.
Proof. Since E[X m
X
] = E[X] m
X
= m
X
m
X
= 0, we have
X,Y
= 0.
Remark 129 The opposite is wrong except if X and Y are gaussian random variables, i.e., in
general

X,Y
= 0 f
X,Y
(x, y) = f
X
(x) f
Y
(y) . (4.37)
4.12 Characteristic function of a random vector
Denition 130 The characteristic function of a random vector, X R
N
, is given by

X
(u) = E
_
e
ju
T
x
_
, (4.38)
where u R
N
.
4.13 Examples of important probability distributions for dis-
crete random variable
4.13.1 Real multivariate Gaussian distribution
X =
_
X
1
X
2
X
N
_
T
is a Gaussian random vector if and only if every linear combination

1
X
1
+
2
X
2
+ +
N
X
N
is Gaussian.
A fundamental consequence is that any linear transformation of a Gaussian random vector is
a Gaussian random vector.
A Gaussian random vector X with mean m
X
and covariance matrix
X
will be denoted
X ^ (m
X
,
X
) .
Denition 131 The joint PDF of a Gaussian random vector X R
N
with mean m
X
and covari-
ance matrix
X
is given by
f
X
(x) =
1
(2)
N/2
[
X
[
1/2
e

1
2
(xm
X
)
T

1
X
(xm
X
)
. (4.39)
Statistical signal processing 35
36 CHAPTER 4. RANDOM VECTORS
4.13.2 Complex multivariate Gaussian distribution
To be updated.
4.13.3 Real Wishart distribution
To be updated.
4.13.4 Complex Wishart distribution
To be updated.
4.14 Sum of independent random variables
Theorem 132 Let X and Y two independent random variables. Let Z = X + Y. Then, the
probability density function f
Z
(z) is given by
f
Z
(z) =
_
R
f
X
(x) f
Y
(z x) dx, (4.40)
which is the convolution of f
X
with f
Y
.
Proof. By using the marginalization rules, we have
f
Z
(z) =
_
R
_
R
f
X,Y,Z
(x, y, z) dxdy, (4.41)
and since X and Y are assumed to be independent, one obtains
f
Z
(z) =
_
R
_
R
f
X
(x) f
Y
(y) f
Z|X,Y
( z[ x, y) dxdy. (4.42)
Since Z = X +Y we have f
Z|X,Y
( z[ x, y) = (z (x +y)) , so
f
Z
(z) =
_
R
_
R
f
X
(x) f
Y
(y) (z (x +y)) dxdy
=
_
R
f
X
(x) f
Y
(z x) dx, (4.43)
since
_
R
f
Y
(y) (z x y) dy = f
Y
(z x) .
4.15 Substitution of random vectors
Theorem 133 Let X and Y two random vectors
_
R
N
_
such that Y = g (X) where g is a one-
to-one deterministic function from R
N
to R
N
. In other words,
Y
1
= g
1
(X
1
, X
2
, . . . , X
N
)
Y
2
= g
2
(X
1
, X
2
, . . . , X
N
)
.
.
.
Y
N
= g
N
(X
1
, X
2
, . . . , X
N
)
Then, f
X
(x) and f
Y
(y) are linked by the following relationship:
p
Y
(y) = [J[ p
X
(x)[
x
i
=g
1
i
(y
1
,y
2
,...,y
N
)
, (4.44)
Statistical signal processing 36
4.15. SUBSTITUTION OF RANDOM VECTORS 37
with J (called Jacobian matrix) given by
J =det
_
_
_
x
1
y
1

x
1
y
N
.
.
.
.
.
.
.
.
.
x
N
y
1

x
N
y
N
_
_
_. (4.45)
Proof. To be updated.
Statistical signal processing 37
Chapter 5
Convergences and additional results
5.1 Convergences
5.1.1 Convergence in probability
Denition 134 Une suite de variables alatoires Y
n
converge en probabilit vers une constante c
si
> 0, P ([Y
n
c[ < )
n+
1, (5.1)
ou, de manire quivalente si
> 0, P ([Y
n
c[ )
n+
0. (5.2)
On utilisera la notation Y
n
P
c. La convergence en probabilit signie que, pour n large, la
variable alatoire Y
n
est proche dune constante c.
Il est parfois dicile de dmontrer directement lune des deux propositions prcdentes. On
pourra utiliser lingalit de Chebyshev qui va nous donner une condition susante (mais non
ncessaire) pour dmontrer la convergence en probabilit.
Lemma 135 Ingalit de Chebyshev : pour nimporte quelle variable alatoire Y et nimporte
quelles constantes a > 0 et c, on a
E
_
(Y c)
2
_
a
2
P ([Y c[ a) . (5.3)
Proof. On pose Z = Y c. Alors on a
E
_
Z
2

=
_
R
z
2
f
Z
(z) dz =
_
|z|<a
z
2
f
Z
(z) dz +
_
|z|a
z
2
f
Z
(z) dz

_
|z|a
z
2
f
Z
(z) dz a
2
_
|z|a
f
Z
(z) dz = a
2
P ([Z[ a) . (5.4)
Theorem 136 Une condition susante pour dmontrer que Y
n
P
c est de dmontrer que
E
_
(Y
n
c)
2
_
n+
0. (5.5)
Proof. > 0, lingalit de Chebyshev nous dit que
P ([Y
n
c[ )
1

2
E
_
(Y
n
c)
2
_
. (5.6)
Donc, si E
_
(Y
n
c)
2
_
n+
0, alors P ([Y
n
c[ )
n+
0.
38
5.1. CONVERGENCES 39
5.1.2 Convergence in distribution
5.1.2.1 Dnition
Denition 137 Soit Y
1
, . . . , Y
n
une suite de variables alatoires ayant pour fonctions de rpar-
titions la suite de fonctions F
Y
1
(.) , . . . , F
Y
n
(.) (autrement dit F
Y
i
(y) = P (Y
i
y)). Soit Y une
variable alatoire de fonction de rpartition F
Y
(y) = P (Y y) . On dira que Y
n
converge en loi
vers Y si
F
Y
n
(y)
n+
F
Y
(y) , y R o F
Y
est continue. (5.7)
Denition 138 Une autre dnition quivalente ci-dessus est que Y
n
converge en loi vers Y si
est seulement si pour toute fonction continue borne f
lim
n
E [f (Y
n
)] = E [f (Y )] . (5.8)
On utilisera la notation Y
n
L
Y. Attention, cette notation pourrait faire penser que la conver-
gence en loi implique que, pour n large, la variable alatoire Y
n
est proche de Y (ce qui est stupide
puisque Y est elle-mme une variable alatoire). La convergence en loi signie que la distribution
de Y
n
est proche de la distribution de Y. En ce sens, la convergence en loi est plus faible que la
convergence en probabilit.
Nous allons maintenant donner un thorme (dit thorme de continuit de Lvy) qui est une
alternative pour dmontrer la convergence en loi.
5.1.2.2 Lvys continuity theorem
Theorem 139 Soit Y
1
, . . . , Y
n
une suite de variables alatoires ayant pour fonctions caratris-
tiques la suite de fonctions
Y
1
(t) , . . . ,
Y
n
(t) . Soit Y une variable alatoire de fonction carac-
tristique
Y
(t) . Alors
_
t R :
Y
n
(t)
Y
(t)
_

_
Y
n
L
Y
_
, (5.9)
o
Y
n
(t)
Y
(t) signie la convergence simple (n ).
Proof. Admis.
5.1.3 Relationship between convergence in probability and convergence
in distribution
On pourrait tre choqu de vouloir relier la convergence en probabilit et la convergence en loi
puisque la convergence en probabilit signie que Y
n
converge vers une constante c tandis que la
convergence en loi signie que Y
n
converge vers une variable alatoire Y. Nous allons donc dnir
la convergence en probabilit dune suite de variable alatoire vers une autre variable alatoire.
Denition 140 Une suite de variables alatoires Y
n
converge en probabilit vers une variable
alatoire Y si
Y
n
Y
P
0. (5.10)
On utilisera galement la notation Y
n
P
Y.
Theorem 141 dit de Slutsky : Si Y
n
L
Y et A
n
P
a et B
n
P
b o a et b sont des constantes,
alors
A
n
+B
n
Y
n
L
a +bY. (5.11)
Proof. Admis.
On remarquera que dans ce dernier thorme, les variables alatoires A
n
et B
n
ne sont pas
obligatoirement indpendantes de Y
n
.
Theorem 142 Si Y
n
P
Y alors Y
n
L
Y (linverse est faux).
Proof. Admis
Statistical signal processing 39
40 CHAPTER 5. CONVERGENCES AND ADDITIONAL RESULTS
5.2 Weak law of large numbers
Theorem 143 Soit Y
1
, . . . , Y
n
une suite de variables alatoires i.i.d. de moyenne E[Y
i
] = c i et
de variance E
_
(Y
i
c)
2
_
=
2
< + i, alors
1
n
(Y
1
+ +Y
n
)
P
E[Y
i
] = c. (5.12)
Proof. On pose

Y
n
=
1
n
(Y
1
+ +Y
n
) . On a
E
_
_

Y
n
c
_
2
_
= E
_
_
1
n
(Y
1
+ +Y
n
) c
_
2
_
= E
_
_
1
n
(Y
1
+ +Y
n
)
1
n
nc
_
2
_
= E
_
_
_
1
n
n

i=1
(Y
i
c)
_
2
_
_
=
1
n
2
E
_
n

i=1
n

=1
(Y
i
c) (Y
i
c)
_
=
1
n
2
n

i=1
n

=1
E[(Y
i
c) (Y
i
c)] =
1
n
2
n

i=1
E
_
(Y
i
c)
2
_
=

2
n
, (5.13)
puisque E[(Y
i
c) (Y
i
c)] = 0 si i ,= i

(variables alatoires indpendantes). En consquence,


E
_
_

Y
n
c
_
2
_
=

2
n
n+
0 ce qui conclut la preuve daprs le thorme 136.
5.3 Central limit theorem
This theorem is one of the justication of the usefulness of the Gaussian probability density func-
tion.
Theorem 144 Soit Y
1
, . . . , Y
n
une suite de variables alatoires i.i.d. de moyenne E[Y
i
] = c i et
de variance E
_
(Y
i
c)
2
_
=
2
< + i, alors, si on pose

Y
n
=
1
n
(Y
1
+ +Y
n
) ,on a

n
_

Y
n
c
_
L
Y ^
_
0,
2
_
, (5.14)
o encore

n

Y
n
c
_
L
Y ^ (0, 1) . (5.15)
Proof. La preuve sappuie sur les fonctions caractristiques. On rappelle que la fonction
caractristique
Y
(t) dune variable alatoire Y est donne par
Y
(t) = E
_
e
iY t

. De plus, on a
vu que, lorsque les moments dordre k, nots
k
, existent et que la srie converge, on a
Y
(t) =

k=0
i
k

k
k!
t
k
. En consquence, pour une variable alatoire Y de moyenne nulle et de variance 1,
la fonction caractristique de Y admet le dveloppement limit
Y
(t) = 1 +
i
2
2
t
2
+ o
_
t
2
_
=
1
1
2
t
2
+ o
_
t
2
_
. De plus, on sait qua partir dune variable alatoire Y de moyenne c et de
variance
2
, on peut fabriquer la variable alatoire X =
Y c

de moyenne nulle et de variance


1. Soit Y
1
, . . . , Y
n
une suite de variables alatoires i.i.d. de moyenne E[Y
i
] = c i et de variance
E
_
(Y
i
c)
2
_
=
2
< + i. On pose

Y
n
=
1
n
(Y
1
+ +Y
n
) et Z
n
=

Y
n
c
_
. Donc
Z
n
=

Y
n
c
_
=

_
1
n
(Y
1
+ +Y
n
) c
_
=

n
n
((Y
1
+ +Y
n
) nc)
=
1

n
n

i=1
(Y
i
c) =
n

i=1
X
i

n
, (5.16)
Statistical signal processing 40
5.3. CENTRAL LIMIT THEOREM 41
o les X
i
sont des variables alatoires de moyenne nulle et de variance 1. On sait que la fonction
caractristique dune somme de vraiable alatoires indpendantes est gale au produit des fonctions
caractristiques de chaques variables alatoires et que pour une constante a,
aY
(t) =
Y
(at).
Donc, la fonction caractristique de la variable alatoire Z
n
est donne par

Z
n
(t) =
n

i=1

X
i
_
t

n
_
=
_

X
i
_
t

n
__
n
, (5.17)
puisque les Y
i
, et donc les X
i
, sont supposs identiquement distribus. Donc

Z
n
(t) =
_
1
1
2n
t
2
+o
_
t
2
n
__
n
. (5.18)
Donc,
lim
n

Z
n
(t) = e

t
2
2
, (5.19)
o lon reconnait la fonction caractristique dune variable alatoire gaussienne centr de variance
1. On termine la preuve en invoquant le thorme de continuit de Lvy.
The central limit theorem also applies in the case of sequences that are not identically dis-
tributed, provided one of a number of conditions apply (see Lyapunovs central limit theorem).
In practice, for example the case of electronic noise, on can often regard a single measured value
as the weighted average of a large number of small eects dicult to describe one by one. Using the
central limit theorem, one can then see that this would often (though not always) produce a nal
distribution that is approximately Gaussian. This justies the common use of this distribution to
stand in for the eects of unobserved variables in models.
Statistical signal processing 41
Part II
Random signals
42
Chapter 6
Introduction
As previously stated, most introductions to signals and systems deal strictly with deterministic
signals. Each value of these signals are xed and can be determined by a mathematical expression,
rule, or table. Because of this, future values of any deterministic signal can be calculated from past
values. For this reason, these signals are relatively easy to analyze as they do not change, and we
can make accurate assumptions about their past and future behavior. Unlike deterministic signals,
stochastic signals, or random signals, are not so nice. Random signals cannot be characterized
by a simple, well-dened mathematical equation and their future values cannot be well predicted.
Rather, we must use the previously introduced probability and statistics to analyze their behavior.
Signals in radar, sonar, audio, communication, control systems and biomedical engineering are a
few examples.
Like a deterministic signal can be represented both in the time domain and in the frequency
domain (by way of the Fourier transform) we will present, in this part, the temporal and the
frequency, or spectral, representation of a random signal. We will see the strong connection between
random vectors and random signals and introduce the moment of order one and two (the so-called
correlation function) of a random signal. Like a deterministic signal can be classied (for example,
periodic signals, real or complex signals, etc), we will introduce the main classes of random signals.
Finally, we will discuss about model of random signals which are often used to model and predict
various types of natural phenomena.
43
Chapter 7
Temporal representations
7.1 Stochastic processes
In order to study random signals, we want to look at a collection of these signals rather than just
one instance of that signal. This collection of signals is called a stochastic process.
Denition 145 A stochastic process is a sequence of random variables, x(, t) such that
x(, t) : (, T, P) R( or Z) R
_
or R
N
or C
N
_
, (7.1)
where is the result of an event and t a parametrization of the sequence (the time in the case of
random signals).
For simplicity, we will use the simplied notation x(t) = x(, t) . With the previous denition,
it is clear that all the random variables x(t) of a stochastic random process can be put in a vector
which becomes a random vector
x =
_
x(1) x(0) x(1)
_
T
. (7.2)
Consequently, a random signal is completely characterized by the joint PDF, f
x
(x) , of the
vector x. Unfortunately, in general, this joint PDF is impossible to obtain because only one event
is available or/and only a part of the signal is available. The solution of this problem is to use
the previously introduced moments (which are able to characterize a random variable with a small
number of parameters) of the joint PDF to obtain the characteristics of the studied random signal.
7.2 Mean and correlation function
Denition 146 The mean of a random signal is dened as
m(t) = E[x(t)] , t. (7.3)
Which is exactly the denition of the mean of x.
Denition 147 If E[x(t)] = 0, t, the signal is said centered.
Denition 148 For two instant t
1
and t
2
, the correlation (or covariance) function,
x
(t
1
, t
2
), of
a real random signal x(t) is given by

x
(t
1
, t
2
) = E[(x(t
1
) E[x(t
1
)]) (x(t
2
) E[x(t
2
)])] . (7.4)
We recognize a particular element of the covariance matrix. If the studied random signal is
complex, the denition is modied as follows

x
(t
1
, t
2
) = E
_
(x(t
1
) E[x(t
1
)]) (x(t
2
) E[x(t
2
)])

. (7.5)
44
7.2. MEAN AND CORRELATION FUNCTION 45
The properties of the correlation function are:

x
(t
1
, t
2
) = (
x
(t
2
, t
1
))

.
|
x
(t
1
, t
2
)|
2

x
(t
1
, t
1
)
x
(t
2
, t
2
) . The equality holds if t
1
= t
2
.
If t
1
= t
2
, the correlation function
x
(t
1
, t
2
) = E
_
(x(t
1
) E[x(t
1
)])
2
_
represents the vari-
ance of the random signal at time t
1
.
The mean and the correlation function of a random signals are useful to characterize it. The
correlation function is a numerical means of the randomness of a signal. If
x
(t
1
, t
2
) = 0 t
1
, t
2
,
this means that x(t
1
) and x(t
2
) are uncorrelated, consequently, the signal is "strongly" random.
Statistical signal processing 45
Chapter 8
Main classes of random signals
In this chapter, we present some particular classes of random which appears in nature.
8.1 Stationary signals
Denition 149 A random signal is second order stationary if the two following conditions are
satised
m(t) = m, Constant t, (8.1)
and

x
(t
1
, t
2
) =
x
(t
1
t
2
) . (8.2)
In other words, the moment of order one and two do not depends on the time, but only on the
delay = t
1
t
2
.
If a signal is stationary, the properties of the correlation function becomes:

x
() = (
x
())

if the signal is complex and


x
() =
x
() is the signal is real (then,
the correlation function is even).
|
x
()|
x
(0) . The equality hold if = 0.

x
(0) is the variance (or the power) of the random signal.
The correlation function of stationary signal can be simplied as follows:
x
() = E[x(t
1
) x

(t
2
)]
|m|
2
.
The consequence of stationarity are very important. It means that the mean and the covariance
of the signal do not depend on the time. Consequently, if several realizations of the signal can be
recorded during a nite duration (enough large), one can know these two main properties.
Example 150 An example of stationary signal is given by x(t) = e
i2(f
0
t+)
where f
0
is a constant
and where is a uniform random variable U (0, 1) . Indeed, m(t) = E[x(t)] = E
_
e
i2(f
0
t+)

=
e
i2f
0
t
E
_
e
i2

, with E
_
e
i2

=
_
R
e
i2
f

() d =
_
1
0
e
i2
d = 0. Then m(t) = 0 = constant
t. Moreover,
x
(t
1
, t
2
) = E
_
x(t
1
) (x(t
2
))

= E
_
e
i2(f
0
t
1
+)
e
i2(f
0
t
2
+)

= e
i2f
0
(t
1
t
2
)
depends
only on = t
1
t
2
.
8.2 Ergodic signals
Denition 151 A continuous random signal is ergodic if the two following conditions are satised
lim
T
1
2T
T
_
T
x(t) dt = E[x(t)] = m, (8.3)
46
8.3. THEORETICAL WHITE NOISE 47
and
lim
T
1
2T
T
_
T
x(t) x

(t ) dt =
x
(t, t ) . (8.4)
If we have a discrete-time signal, x
k
, k Z, the two conditions are modied as follows:
lim
T
1
2T
T

k=T
x
k
= E[x
k
] = m, (8.5)
and
lim
T
1
2T
T

k=T
x
k
x

kn
=
x
(n) . (8.6)
Theorem 152 Ergodicity implies stationarity.
Proof.
T
_
T
x(t) dt and
T
_
T
x(t) x

(t ) dt are independent of t.
The opposite of the last theorem is wrong.
The consequence of ergodicity are also very important. It means that the mean and the co-
variance of the signal do not depend on the time (stationarity), that the statistical mean and
the temporal mean are equals, and that the correlation function and the temporal correlation are
equals. Consequently, only one realization of the signal has to be recorded during a nite
duration (enough large) to know this two main properties.
Example 153 Let us now check if the previously studied stationary signal is also ergodic. We
have x(t) = e
i2(f
0
t+)
. Then
T
_
T
x(t) dt =
T
_
T
e
i2(f
0
t+)
dt = e
i2
_
e
i2f
0
t
i2f
0
_
T
T
=
1
f
0
sin (2f
0
T) e
i2
. (8.7)
Then,
lim
T
1
2T
T
_
T
x(t) dt = lim
T
1
2Tf
0
sin (2f
0
T) e
i2
= 0, (8.8)
since 1 sin (2f
0
T) 1. Then the rst condition is satised. Concerning the second condition,
T
_
T
x(t) x

(t ) dt =
T
_
T
e
i2(f
0
t+)i2(f
0
(t)+)
dt = e
i2
T
_
T
1dt = 2Te
i2
. (8.9)
Then,
lim
T
1
2T
T
_
T
x(t) x

(t ) dt = lim
T
e
i2
= e
i2
=
x
() . (8.10)
Consequently, the second condition is also satised and the signal is ergodic.
8.3 Theoretical white noise
Denition 154 A discrete-time random signal is called theoretical white noise, if

x
() =
_

2
if = 0
0 otherwise
(8.11)
where
2
is called the power of the noise. Note that it is a stationary signal.
Statistical signal processing 47
48 CHAPTER 8. MAIN CLASSES OF RANDOM SIGNALS
Denition 155 By extension, a continous-time random signal is called theoretical white noise, if

x
() =
2
() , (8.12)
where () is the Dirac delta function.
Due to the structure of the correlation function,
x
(t
1
, t
2
) = 0 if t
1
,= t
2
, so, for any dierent
instants, the values x(t
1
) and x(t
2
) are uncorrelated. This means that this kind of signals is
"strongly" random.
8.4 Gaussian processes
Gaussian random signals are the most important from a theoretical point of view as for applications.
As we will see, Gaussian signals are strongly connected to Gaussian random vectors which are
widely justied by the central limit theorem.
Denition 156 A signal is said Gaussian if the random vector x =
_
x(1) x(0) x(1)
_
T
is Gaussian.
8.5 Poisson processes
To be updated.
8.6 Markov processes
In some situations, the evolution of a random signal does not depend on the full past, but just on
a nite number of previous samples. Such a signals is called Markovian.
Denition 157 A random signal is a Markov signal or process of order N if its conditional PDF
f
X(t+1)|X(t)X(1)
( x(t + 1)[ x(t) x(1)) can be written
f
X(t+1)|X(t), ,X(1)
( x(t + 1)[ x(t) , , x(1)) = f
X(t+1)|X(t)X(tN)
( x(t + 1)[ x(t) , , x(t N)) .
(8.13)
The advantage of such signals is that their joint PDF can be simplied. For example, the joint
PDF of a Markov signal of order one can be simplied as
f
X(t),X(t1), ,X(1)
(x(t) , x(t 1) , , x(1)) = f
X(1)
(x(1)) f
X(2)|X(1)
( x(2)[ x(1))
f
X(3)|X(1),X(2)
( x(3)[ x(1) , x(2))
f
X(t)|X(t1), ,X(1)
( x(t)[ x(t 1) , . . . , x(2))
= f
X(1)
(x(1)) f
X(2)|X(1)
( x(2)[ x(1))
f
X(3)|X(2)
( x(3)[ x(2))
f
X(t)|X(t1)
( x(t)[ x(t 1))
= f
X(1)
(x(1))
t

i=2
f
X(i)|X(i1)
( x(i)[ x(i 1)) . (8.14)
A Markov signal can be seen as a tradeo between a mutually independent signal and a fully
dependent signal.
An example of Markov signal of order 1 is given by the following equation:
x(t) = ax(t 1) +u(t) , (8.15)
where u(t) is a random signal and where a is a constant.
Statistical signal processing 48
Chapter 9
Spectral representations
The power spectral density (PSD) is a positive real function of a frequency variable associated
with a stationary stochastic process, or a deterministic function of time, which has dimensions of
power per Hz, or energy per Hz. It is often called simply the spectrum of the signal. Intuitively,
the power spectral density captures the frequency content of a stochastic process and helps identify
periodicities. It is the complement of the Fourier transform for deterministic signals.
9.1 Power spectral density
In this section, in order to be rigorous, we use the notation x(, t) .
Consider a set of signals x(
i
, t) , i Z obtained from a stationary stochastic process. We
observe a particular realization x(
i
, t) , i Z of duration T. Let x(
i
, t, T) be this observation.
We have
x(
i
, t, T) = x(
i
, t)
T
(t) , (9.1)
where
T
(t) =
_
1 for t
_

T
2
,
T
2

0 otherwise
. Let us call X (
i
, f, T) the Fourier transform of this par-
ticular signal. We have
X (
i
, f, T) =
_
R
x(
i
, t, T) e
i2ft
dt
=
_
T/2
T/2
x(
i
, t) e
i2ft
dt. (9.2)
The power spectrum of x(
i
, t, T) is given by

x
(
i
, f, T) =
1
T
|X (
i
, f, T)|
2
. (9.3)
In order to consider all the possible realization of x(
i
, t, T) , one take the expectation (statis-
tical mean) of
x
(
i
, f, T)

x
(f, T) = E[(
i
, f, T)] . (9.4)
The power spectral density (PSD), denoted
x
(f), of a random signal is then given by taking
the limit of
x
(f, T) when T

x
(f) = lim
T
E[
x
(
i
, f, T)]
= lim
T
1
T
E
_
|X (
i
, f, T)|
2
_
. (9.5)
This formula, while interesting, can be useless. Indeed, it can be dicult to obtain the PDF of
X (
i
, f, T) to calculate the expectation. Fortunately, the WienerKhinchine theorem provides a
simple alternative.
49
50 CHAPTER 9. SPECTRAL REPRESENTATIONS
9.2 WienerKhintchine theorem
Theorem 158 The PSD,
x
(f) , of a stationary random signal, x(t) , with correlation function

x
() is given by

x
(f) = TF
x
() =
_
R

x
() e
i2f
d. (9.6)
Proof. We start by the denition of the PSD given by Eqn. (9.5):

x
(f) = lim
T
1
T
E
_
|X (
i
, f, T)|
2
_
. (9.7)
|X (
i
, f, T)|
2
can be rewritten as
|X (
i
, f, T)|
2
= X (
i
, f, T) X

(
i
, f, T)
=
_
R
x(
i
, t, T) e
i2ft
dt
_
R
x

(
i
, t, T) e
i2ft
dt
=
_
R
_
R
x(
i
, t, T) x

(
i
, t

, T) e
i2f(tt

)
dtdt

. (9.8)
We know that x(
i
, t, T) = x(
i
, t)
T
(t) , then
|X (
i
, f, T)|
2
=
_
R
_
R
x(
i
, t) x

(
i
, t

)
T
(t)
T
(t

) e
i2f(tt

)
dtdt

. (9.9)
Since the random part in the previous integral is x(
i
, t) x

(
i
, t

) , we have
E
_
|X (
i
, f, T)|
2
_
=
_
R
_
R
E[x(
i
, t) x

(
i
, t

)]
T
(t)
T
(t

) e
i2f(tt

)
dtdt

. (9.10)
Since we assume that x(
i
, t) is stationary, we have E[x(
i
, t) x

(
i
, t

)] =
x
() , where
= t t

. Then,
E
_
|X (
i
, f, T)|
2
_
=
_
R
_
R

x
()
T
(t)
T
(t +) e
i2f
dtd
=
_
R

x
() e
i2f
_
R

T
(t)
T
(t +) dtd. (9.11)
The integral I () =
_
R

T
(t)
T
(t +) dt is the convolution of rectangle function with itself.
Then, if T or if T we have I () = 0. If 0 T, we have I () = T . If T 0,
we have I () = T +. Then if we note Tri
2T
() =
_
_
_
1 + if T 0
1 if 0 T
0, otherwise
the triangle function,
we have I () = T Tri
2T
() . Consequently,
1
T
E
_
|X (
i
, f, T)|
2
_
=
_
R

x
() e
i2f
Tri
2T
() d, (9.12)
and

x
(f) = lim
T
1
T
E
_
|X (
i
, f, T)|
2
_
=
_
R

x
() e
i2f
d, (9.13)
since lim
T
Tri
2T
() = 1.
The WienerKhinchine theorem provides a simple formula to calculate the PSD of a stationary
random signal.
Example 159 The PSD of a continuous-time white noise x(t) is given by

x
(f) = TF
x
() = TF
_

2
()
_
=
2
. (9.14)
Then, a continuous-time white noise has a constant PSD for all frequencies (this is why we use
the term white by analogy to the white light).
Statistical signal processing 50
9.3. INTERFERENCE FORMULA 51
9.3 Interference formula
Theorem 160 Let x
1
(t) and x
2
(t) be two possibly correlated, stationary signals. x
1
(t) is the
input of a linear lter with impulse response h
1
(t) and x
2
(t) is the input of a linear lter with
impulse response h
2
(t). Then, the output of the lters y
1
(t) and y
2
(t) are also stationary and we
have

y
1
y
2
() =
_
h
1
(t) h

2
(t)
x
1
x
2
(t)
_
() (9.15)
and

y
1
y
2
(f) = H
1
(f) H

2
(f)
x
1
x
2
(f) , (9.16)
which is called the interference formula, where

y
1
y
2
(f) = TF
_

y
1
y
2
()
_
= TF E[y
1
(t) y

2
(t )] , (9.17)

x
1
x
2
(f) = TF
_

x
1
x
2
()
_
= TF E[x
1
(t) x

2
(t )] , (9.18)
H
1
(f) = TF h
1
(t) , and H
2
(f) = TF h
2
(t) , (9.19)
(z
1
(t) z
2
(t)) () =
_
R
z
1
(t) z
2
( t) dt. (9.20)
Proof. The intercorrelation function
y
1
y
2
(t
1
t
2
) can be rewritten has

y
1
y
2
(t
1
t
2
) = E[y
1
(t
1
) y

2
(t
2
)]
= E
_
(x
1
h
1
) (t
1
) [(x
2
h
2
) (t
2
)]

= E
__
R
h
1
(
1
) x
1
(t
1

1
) d
1
_
R
h

2
(
2
) x

2
(t
2

2
) d
2
_
=
_
R
_
R
h
1
(
1
) h

2
(
2
) E[x
1
(t
1

1
) x

2
(t
2

2
)] d
1
d
2
=
_
R
_
R
h
1
(
1
) h

2
(
2
)
x
1
x
2
(t
1

1
, t
2

2
) d
1
d
2
, (9.21)
where
x
1
x
2
(t
1

1
, t
2

2
) =
x
1
x
2
(t
1

1
t
2
+
2
) since x
1
(t) and x
2
(t) are stationary sig-
nals. Note that
_
h
1
(
1
)
x
1
x
2
(
1
)
_
(t
1
t
2
+
2
) =
_
R
h
1
(
1
)
x
1
x
2
(t
1

1
t
2
+
2
) d
1
, (9.22)
then

y
1
y
2
(t
1
t
2
) =
_
R
_
h
1
(
1
)
x
1
x
2
(
1
)
_
(t
1
t
2
+
2
) h

2
(
2
) d
2
. (9.23)
Let us set = t
1
t
2
and g ( +
2
) =
_
h
1
(
1
)
x
1
x
2
(
1
)
_
( +
2
) . We have

y
1
y
2
() =
_
R
g ( +
2
) h

2
(
2
) d
2
. (9.24)
With the change of variables

2
=
2
+
_
d

2
= d
2
_
we obtain

y
1
y
2
() =
_
R
g
_

2
_
h

2
_

_
d

2
=
_
g
_

2
_
h

2
_

2
__
()
=
__
h
1
(
1
)
x
1
x
2
(
1
)
_ _

2
_
h

2
_

2
__
()
=
_
h
1
(t) h

2
(t)
x
1
x
2
(t)
_
() , (9.25)
which prove the rst relationship of the theorem. Concerning the interference formula, we have

y
1
y
2
(f) = TF
_

y
1
y
2
()
_
= TF
__
h
1
(t) h

2
(t)
x
1
x
2
(t)
_
()
_
= TF h
1
(t) TF h

2
(t) TF
_

x
1
x
2
(t)
_
. (9.26)
Statistical signal processing 51
52 CHAPTER 9. SPECTRAL REPRESENTATIONS
Since H
1
(f) = TF h
1
(t) and
x
1
x
2
(f) = TF
_

x
1
x
2
()
_
,it only remains the study of
TF h

2
(t)
TF h

2
(t) =
_
+

2
(t) e
i2ft
dt. (9.27)
By using the change of variable t

= t ( dt

= dt) we obtain
TF h

2
(t

) =
_

+
h

2
(t

) e
i2ft

dt

=
_
+

2
(t

) e
i2ft

dt

=
__
+

h
2
(t

) e
i2ft

dt

= (TF h
2
(t

))

= H

2
(f) . (9.28)
Finally,

y
1
y
2
(f) = H
1
(f) H

2
(f)
x
1
x
2
(f) . (9.29)
Remark 161 If x
1
(t) = x
2
(t) = x(t) and h
1
(t) = h
2
(t) = h(t) the interference formula is
simplied as follows

y
(f) = |H (f)|
2

x
(f) . (9.30)
The interference formula is used to know how is modied the PSD of a signal at the output of
a linear lter.
Statistical signal processing 52
Chapter 10
Random signal models
In some application, we want to reproduce (to create a copy) of a signal in order to analyze it but
also to predict its future behavior. One can cite the problems of nancial market analysis or some
problems of speech processing (see practical work 2).
These kind of signals are generally complex and cannot be modeled by a simple Gaussian signal
or a white noise.
Then, the goal of this section is to present some very useful models in signal processing able to
represent these complex signals. We restrict our analysis to stationary signals. Statistical models
able to describe non-stationary signals are out of the scope of this document.
Moreover, as we will see, these models depend on parameters p and q which have to be chosen.
Algorithm to nd these parameters can be founded in the literature (see model order selection,
Akaike criterion, etc) but this study is out of the scope of this document.
10.1 Autoregressive processes
Denition 162 An autoregressive process of order p, denoted AR(p) , is dened as follows,
x
n
= a
1
x
n1
+a
2
x
n2
+ +a
p
x
np
+u
n
=
p

i=1
a
i
x
ni
+u
n
, (10.1)
where u
n
is a centered white noise and the coecients a
1
, a
2
, . . . , a
p
are left to the user.
This model means that we assume the signal at time n is a weighted function of the p previous
instant plus some noise.
Note that an AR(p) process can be seen as the output of a linear lter (discrete-time) of input
is u
n
described by its transfert function H (z) . Indeed, the z-transform of Eqn. (10.1) leads to
X (z) = a
1
z
1
X (z) +a
2
z
2
X (z) + +a
p
z
p
X (z) +U (z) (10.2)
Consequently, the transfert function of the lter generating an AR(p) process is
H (z) =
X (z)
U (z)
=
1
1
p

i=1
a
i
z
i
, (10.3)
which can studied (stability,...) by the classical tools of discrete-time lter theory. Some constraints
are necessary on the values of the parameters of this model in order that the model remains
stationary. For example, processes in the AR(1) model with [a
1
[ 1 are not stationary.
Theorem 163 For a given signal x
n=1,...,N
, N p, the coecients a
1
, a
2
, . . . , a
p
are given by
a = M
1
, (10.4)
53
54 CHAPTER 10. RANDOM SIGNAL MODELS
where
a =
_
a
1
a
2
a
p
_
T
, (10.5)
=
_

x
(1)
x
(2)
x
(p)
_
T
, (10.6)
and where
M =
_
_
_
_
_
_

x
(0)
x
(1)
x
(p 1)

x
(1)
x
(0)
.
.
.
.
.
.
.
.
.
.
.
.

x
(p 1)
x
(0)
_
_
_
_
_
_
. (10.7)
This equation is called Yule-Walker equation. Consequently, the coecients are completely
described by the correlation function
x
(k) which can be easily approximated.
Proof. Since x
n
=
p

i=1
a
i
x
ni
+u
n
, the correlation function
x
(k) = E[x
n
x
nk
] is given by

x
(k) = E
__
p

i=1
a
i
x
ni
+u
n
_
x
nk
_
=
p

k=1
a
k
E[x
ni
x
nk
] +E[u
n
x
nk
]
=
p

k=1
a
k

x
(k i) +E[u
n
x
nk
] . (10.8)
Note that E[u
n
x
nk
] = 0 k > 0. Indeed,
x
1
= u
1
,
x
2
= a
1
x
1
+u
2
= a
1
u
1
+u
2
,
x
n
= a
1
x
1
+a
2
x
2
+u
3
= a
1
u
1
+a
2
a
1
u
1
+a
2
u
2
+u
3
,
.
.
.
Consequently, x
n
can always be written as x
n
=
n1

i=1
b
i
u
i
+ u
n
where b
i
is a function of the
coecients a
1
, a
2
, . . . , a
p
. It comes
E[u
n
x
nk
] = E[u
n+k
x
n
] = E
_
u
n+k
_
n1

i=1
b
i
u
i
+u
n
__
(10.9)
=
n1

i=1
b
i
E[u
i
u
n+k
] +E[u
n+k
u
n
] = 0, (10.10)
if k > 0 since u
n
is a white noise. Consequently, we have
x
(k) =
p

k=1
a
k

x
(k i) , or with matrix
notations
= Ma. (10.11)
10.2 Moving average processes
Denition 164 A moving average process of order p, denoted MA(q) , is dened as follows,
x
n
= b
0
u
n
+b
1
u
n1
+ +b
q
u
nq
=
q

i=0
b
i
u
ni
, (10.12)
where u
n
is a centered white noise and the coecients b
0
, b
1
, b
2
, . . . , b
q
are left to the user.
Statistical signal processing 54
10.3. AUTOREGRESSIVE MOVING AVERAGE PROCESSES 55
This model means that we assume the signal at time n is a weighted function of the q+1 random
variables. That is, a moving average model is conceptually a linear regression of the current value
of the series against the white noise or random shocks of one or more prior values of the series.
Note that a MA(q) process can be seen as the output of a linear lter (discrete-time) of input
is u
n
described by its transfert function H (z) . Indeed, the z-transform of Eqn. (10.12) leads to
X (z) = b
0
U (z) +b
1
z
1
U (z) +b
2
z
2
U (z) + +b
q
z
q
U (z) (10.13)
Consequently, the transfert function of the lter generating an AR(p) process is
H (z) =
X (z)
U (z)
=
q

i=0
b
i
z
i
, (10.14)
which is the transfer function of nite impulse response lter. So, this lter is always stable.
Unfortunately, the coecients b
0
, b
1
, b
2
, . . . , b
q
of MA processes can not be computed by
solving a linear system of equations as the coecients a
1
, a
2
, . . . , a
p
of AR processes. Indeed,
the correlation function of a MA process is given by

x
(n m) = E[x
n
x
m
] . (10.15)
From the denition, it is clear that
x
(n m) is a sum of function of the form E[u
n
u
m
] =
_

2
if n = m
0 otherwise.
. Consequently,

x
(n m) = 0 if n m > q, (10.16)
and

x
(n m) = (b
0
b
nm
+ +b
q
b
mn+q
)
2
if 0 n m q, (10.17)
which is a non-linear system which can be solved numerically only (and the solution has no reason
to be unique). Moreover, a MA process generally needs a large value of q to be useful.
10.3 Autoregressive moving average processes
Denition 165 An autoregressive moving average process, denoted ARMA(p, q) , is the combi-
nation of an AR process of order p and of a MA process of order q. In other words,
x
n
=
p

i=1
a
i
x
ni
+
q

i=0
b
i
u
ni
, (10.18)
where u
n
is a centered white noise and the coecients a
1
, a
2
, . . . , a
p,
b
0
, b
1
, b
2
, . . . , b
q
are left to
the user.
ARMA is appropriate when a system is a function of a series of unobserved shocks (the MA
part) as well as its own behavior (the AR part). For example, stock prices may be shocked by
fundamental information as well as exhibiting technical trending and mean-reversion eects due
to market participants. Note that an AR(p) process can be seen as the output of a linear lter
(discrete-time) of input is u
n
described by its transfert function H (z) . Indeed, the z-transform of
Eqn. (10.18) leads to
X (z) = a
1
z
1
X (z)+a
2
z
2
X (z)+ +a
p
z
p
X (z)+b
0
U (z)+b
1
z
1
U (z)+b
2
z
2
U (z)+ +b
q
z
q
U (z)
(10.19)
Consequently, the transfert function of the lter generating an ARMA(p, q) process is
H (z) =
X (z)
U (z)
=
q

i=0
b
i
z
i
1
p

i=1
a
i
z
i
, (10.20)
which can studied (stability,...) by the classical tools of discrete-time lter theory.
Due to the MA part in the model, one needs to solve a non-linear system of equations to nd
coecients a
1
, a
2
, . . . , a
p,
b
0
, b
1
, b
2
, . . . , b
q
.
Statistical signal processing 55
Part III
Element of estimation theory
56
Chapter 11
Introduction
Again, let us start with some denitions:
Estimation: to nd (to estimate) some unknown quantities hidden in a random signal.
Parameters: the unknown quantities wanted by a user. In this part the parameter will be
denoted .
Observations: one or several realizations of the random signal. In this part the observation
will be denoted by y (t) .
Observations model: a mathematical model taking into account the physics of the problem
and/or a statistical modelization (for example of the noise or interferences).
Estimator: an algorithm to estimate the parameters. In this part the estimator will be denoted

.
Estimation performance: criterions to quantify if the estimator is good or not.
The entire purpose of estimation theory is to arrive at an estimator, and preferably an imple-
mentable one that could actually be used. These are the general steps to arrive at an estimator:
In order to arrive at a desired estimator for estimating a single or multiple parameters, it
is rst necessary to determine a model for the system. This model should incorporate the
process being modeled as well as points of uncertainty and noise. The model describes the
physical scenario in which the parameters apply.
After deciding upon a model, it is helpful to nd the limitations placed upon an estimator.
This limitation, for example, can be found through the Cramr-Rao bound.
Next, an estimator needs to be developed or applied if an already known estimator is valid
for the model. The estimator needs to be tested against the limitations to determine if it is
an optimal estimator (if so, then no other estimator will perform better).
Finally, experiments or simulations can be run using the estimator to test its performance.
After arriving at an estimator, real data might show that the model used to derive the estimator
is incorrect, which may require repeating these steps to nd a new estimator. A non-implementable
or infeasible estimator may need to be scrapped and the process started anew.
In summary, the estimator estimates the parameters of a physical model based on measured
data.
Note that an estimator is a function of the observable sample data which will be modeled by
using the techniques of the previous part. Then, the observations will be considered as random
signals. Consequently, an estimator is a random variable or a random vector and the performance
of estimation can be studied by using classical probability tools (moments).
57
Chapter 12
Estimation of deterministic
parameters
In this chapter, the parameters that have to be estimated are assumed deterministic (but, of course,
unknown). It means that, we assume that the true values of the parameters will not change if we
are able to repeat an experiment.
12.1 Least Square estimator
12.1.1 Philosophy and example
The idea behind the Least-Square (LS) estimator is very simple. First, we consider that our
observations y (t) at the output of the system can be written as follows
y (t) = y
m
(t, ) +e (t) , t = 1, . . . , N, (12.1)
where is a vector of parameters of interest and where y
m
(t, ) is the observation issued from a
physical model without perturbation and e (t) is and error. Consequently, this model simply means
that the observation at the output of a system is an observation without noise (that we are able
to correctly model) plus an error. Note that this error is then given by
e (t, ) = y (t) y
m
(t, ) , (12.2)
where the dependence on of the error is explicitly shown.
Since we have N observations, we can use the following matrix form for this equation.
e () = y y
m
() , (12.3)
where
y =
_
y (1) y (2) y (N)
_
T
, (12.4)
and
y
m
() =
_
y
m
(1, ) y
m
(2, ) y
m
(N, )
_
T
. (12.5)
The goal of the Least-Square estimator is to nd the values of , denoted

LS
, which will
minimize the error e. We arbitrarily chose the criterion of the norm square e
T
() e () .
Denition 166 Consequently the Least-Square estimator is given by

LS
= Arg min

J
LS
() , (12.6)
where
J
LS
() = e
T
() e () =
N

t=1
(y (t) y
m
(t, ))
2
. (12.7)
58
12.2. LIKELIHOOD FUNCTION 59
Example 167 If y
m
(t, ) = , i.e., if the observations can be modeled as y (t) = +e, the LS cri-
terion is then given by J
LS
() =
N

t=1
(y (t) )
2
. In order to calculate

LS
=Arg min

J
LS
() ,
one can derive J
LS
() with respect to and see when this derivative is equal to 0. In other
words,
dJ
LS
()
d

LS
= 0 since J
LS
() is a convex function. The derivative of J
LS
() is ob-
tain easily as
dJ
LS
()
d
=
d
d
N

t=1
(y (t) )
2
=
N

t=1
d
d
(y (t) )
2
= 2
N

t=1
y (t) + 2
N

t=1
. Consequently,
dJ
LS
()
d

LS
= 0 leads to

LS
=
1
N
N

t=1
y (t) which is simply a sample temporal mean.
Note that in the last example, we found an explicit formulation for

LS
. It means that we do
not have to perform a numerical minimization of the criterion J
LS
() . Unfortunately it is not
always the case.
12.1.2 Performance of the least-square estimator
Denition 168 From the previous denition, it is clear that the minimum error of the LS esti-
mator is given by
|Minimum error|
2
= J
LS
_

LS
_
. (12.8)
Example 169 In our previous example, y
m
(t, ) = , we found J
LS
() =
N

t=1
(y (t) )
2
and

LS
=
1
N
N

t=1
y (t) . Then, J
LS
_

LS
_
=
N

t=1
_
y (t)

LS
_
2
=
N

t=1
y
2
(t) + N

2
LS
2

LS
N

t=1
y (t) =
N

t=1
y
2
(t) N

2
LS
=
N

t=1
y
2
(t)
1
N
_
N

t=1
y (t)
_
2
.
As we can see in this example, the minimum error of

LS
is dierent from zero.
Remark 170 Note that the LS estimator does not include a statistical modelization of the obser-
vations and, particularly, of the noise. Consequently, it can be applied to a wide class of problem
and is often called robust but non-optimal.
12.2 Likelihood function
Our goal is now to design optimal estimators (contrary to the LS estimator). As we will see the
term optimal is connected to a full statistical description of the observation model.
The likelihood function (often simply the likelihood) is a function of the parameters of a sta-
tistical model that plays a key role in statistical inference. In non-technical usage, "likelihood" is
a synonym for "probability", but throughout this document only the technical denition is used.
Informally, if "probability" allows us to predict unknown outcomes based on known parameters,
then "likelihood" allows us to estimate unknown parameters based on known outcomes.
Denition 171 Given a parameterized family of probability density functions
y p ( y[ ) , (12.9)
where is the parameter vector, the likelihood function is dened as
p ( y[ ) . (12.10)
where y is the observed outcome of an experiment. In other words, when p ( y[ ) is viewed as
a function of y with xed, it is a probability density function, and when viewed as a function of
with y xed, it is a likelihood function.
Statistical signal processing 59
60 CHAPTER 12. ESTIMATION OF DETERMINISTIC PARAMETERS
12.3 Estimation performance
Let us think about this theoretical experiment. We are able to design to dierent estimator of
a parameter called

1
and

2
. The question is: how to know if we should use

1
or

2
for our
experiment? As previously stated, both of these estimators are based on the set of observations
y (t) , t = 1, . . . , N. If the observation model giving y (t) take into account a probabilistic model
of the noise, y (t) can be considered as a random signal and it comes that

1
and

2
are random
variables (or random vector in the multiple parameters case). Consequently, both

1
and

2
can
be described by the methods introduced in part one and, particularly, the moments of

1
and

2
can be calculated. We see in this section that these moments will help us to answer this question.
Note that all the expectations in this section, and until the next chapter, are taken with respect
to the likelihood function p ( y[ ). In other words, E[g (y, )] =
_
R
N
g (y, ) p ( y[ ) dy.
12.3.1 Bias
Denition 172 The bias, b

() , of an estimator

of is dened as follows
b

() = E
_

_
. (12.11)
In other words, the bias of an estimator is simply the dierence between the moment of order
one (the so-called mean) of this estimator minus the true value of the parameter . If an estimator
as a bias equal to zero, meaning that, in average,

is around , it will be called unbiased estimator.
Generally, we are looking for estimators with small (or zero) bias.
The extension to the multiple parameters case is trivial and leads to b

() = E
_

_
where
b

() is a vector.
12.3.2 Variance and mean square error
The bias is not the only criterion that we have to use in order to know the performance of an
estimator. Indeed, an estimator can be unbiased and leads to a value far from the true value of
the parameter for one realization of the signal. The previously introduced tools to analyse the
dispersion of a random variable around its mean was called the variance. In the same way we
dene the variance of an estimator as following.
Denition 173 The variance, V

() , of an estimator

of is dened as follows
V

() = E
_
_

E
_

__
2
_
. (12.12)
Generally, we are looking for estimators with the smallest variance.
The extension to the multiple parameters case is trivial and leads to
V

() = E
_
_

E
_

___

E
_

__
T
_
, (12.13)
where V

() is a matrix (the so-called covariance matrix).


Note that an estimator can have a small variance and a huge bias because the variance is the
dispersion of an estimator around its own mean. In order to combine both criterions of bias and
variance, we can dene the mean square error of an estimator as following.
Denition 174 The mean square error, MSE

() , of an estimator

of is dened as follows
MSE

() = E
_
_


_
2
_
. (12.14)
This criterion combine both bias and variance of an estimator since,
Statistical signal processing 60
12.3. ESTIMATION PERFORMANCE 61
Theorem 175 The relationship between the bias, the variance and the MSE is given by
MSE

() = V

() +b
2

() . (12.15)
Proof. From the denition of the variance we have
V

() = E
_
_

E
_

__
2
_
= E
_
_

()
_
2
_
, (12.16)
since b

() = E
_

_
. This expression can be rewritten as
V

() = E
_
_


_
2
+b
2

() 2
_


_
b

()
_
= MSE () +b
2

() 2E
_


_
b

()
= MSE () b
2

() . (12.17)
The extension to the multiple parameters case is trivial and leads to
MSE

() = E
_
_


__


_
T
_
= V

() +b

() b
T

() . (12.18)
Since a "good" estimator have both small bias and variance, it implies that it also has small
mean square error. Note that, due to the relationship between the bias, the variance and the mean
square error, sometimes an unbiased estimator will have a higher mean square error than another
biased (i.e., b () ,= 0). This kind of interesting problems are called bias-variance tradeo and will
not be discussed here. In this document, we will focus only on unbiased estimator with minimum
variance.
Example 176 Let us consider the following observation model
y (t) = +n(t) , t = 1, . . . , N, (12.19)
where we assume E[n(t)] = 0, t and E[n(t
1
) n(t
2
)] =
2
(t
1
t
2
) . We propose to study the
performances in terms of bias and variance of the three following estimators

1
= x(1) , (12.20)

2
=
1
N
N

t=1
y (t) , (12.21)
and

3
=
1
N 1
N

t=1
y (t) . (12.22)
Let us start with the bias analysis. By denition, the bias, b

1
(), of

1
is given by
b

1
() = E
_

1
_
. (12.23)
We have
E
_

1
_
= E[x(1)] = E[ +n(1)] = +E[n(1)] = , (12.24)
Statistical signal processing 61
62 CHAPTER 12. ESTIMATION OF DETERMINISTIC PARAMETERS
since E[n(t)] = 0, t. Consequently,

1
is an unbiased estimator
_
b

1
() = 0
_
. Concerning the
estimator

2
, the bias, b

2
() , is given by
b

2
() = E
_

2
_

= E
_
1
N
N

t=1
y (t)
_

=
1
N
_
N

t=1
E[y (t)]
_

=
1
N
_
N

t=1
+E[n(t)]
_

=
1
N
_
N

t=1

_
= 0. (12.25)
Consequently,

2
is also an unbiased estimator. Concerning the estimator

3
, the bias, b

3
() ,
is given by
b

3
() =
N
N 1
=
1
N 1
. (12.26)
Consequently,

3
is a biased estimator since b

3
() ,= 0. Moreover, b

3
() is depending on
and the bias can not be removed. However, note that

3
is asymptotically unbiased (i.e., unbiased
at the limit) since lim
N
b

3
() = 0.
Since

1
and

2
are unbiased estimators while

3
is biased, we will only focus on the variance
of

1
and

2
.
Concerning

1
, we have
V

1
() = E
_
_

1
E
_

1
__
2
_
= E
_
(x(1) )
2
_
= E
_
n
2
(1)

=
2
. (12.27)
Concerning

2
, we have
V

2
() = E
_
_

2
E
_

2
__
2
_
= E
_
_
_
1
N
N

t=1
y (t)
_
2
_
_
= E
_
_
_
1
N
_
N

t=1
+n(t)
_

_
2
_
_
= E
_
_
_
1
N
N

t=1
n(t)
_
2
_
_
=
1
N
2
E
_
N

t=1
N

=1
n(t) n(t

)
_
=
1
N
2
N

t=1
E
_
n
2
(t)

=
1
N
2
N

t=1

2
=

2
N
, (12.28)
Statistical signal processing 62
12.3. ESTIMATION PERFORMANCE 63
since E[n(t
1
) n(t
2
)] =
2
(t
1
t
2
) . Consequently, for N > 1, V

2
() < V

1
(). Then,

2
is a
better estimator than

1
. Moreover note that lim
N
V

2
() = 0. It means that if we are able to have
a long window of observation then, better will be the estimation.
12.3.3 Cramr-Rao bound
From the previous example, it is clear that we are able to compare several estimators of the same
parameters to decide which is the best one. However, it is impossible to know if it exists another
estimator with better performance. In other words, if it exists an ultimate limit on the variance for
any estimator. This limit is given for the particular case of unbiased estimator by the Cramr-rao
bound.
Denition 177 The scalar Cramr-Rao bound is given by
CRB() =
1
E
_
_
ln p( y|)

_
2
_ =
1
E
_

2
ln p( y|)

2
_, (12.29)
where E
_
_
ln p( y|)

_
2
_
= E
_

2
ln p( y|)

2
_
is called the Fisher information and is assumed to exist.
Theorem 178 Cramr-Rao inequality. For any unbiased estimator

of a single parameter ,
if we have E
_
ln p( y|)

_
= 0, , then
V

() CRB() . (12.30)
Proof. Let us rst analyse the regularity condition E
_
ln p( y|)

_
= 0. We have
E
_
ln p ( y[ )

_
=
_
R
N
ln p ( y[ )

p ( y[ ) dy =
_
R
N
p ( y[ )

dy. (12.31)
If the integral operator and the derivative operator can be exchanged, we have
E
_
ln p ( y[ )

_
=

_
R
N
p ( y[ ) dy =
1

= 0. (12.32)
Consequently, the regularity condition holds if and only if the bounds of the integrals are not
depending on (this assumption will be assumed to hold in the following).
To prove the Cramr-Rao inequality, we start by the fact that we are studying unbiased esti-
mators, i.e.,
E
_

_
=
_
R
N

p ( y[ ) dy = . (12.33)
Then,

_
R
N

p ( y[ ) dy =

= 1

_
R
N

p ( y[ )

dy = 1

_
R
N

ln p ( y[ )

p ( y[ ) dy = 1

_
R
N
_


_
ln p ( y[ )

p ( y[ ) dy = 1, (12.34)
since,
_
R
N

ln p ( y[ )

p ( y[ ) dy =
_
R
N
p ( y[ )

dy = 0. (12.35)
Statistical signal processing 63
64 CHAPTER 12. ESTIMATION OF DETERMINISTIC PARAMETERS
Now, let us remind the Cauchy-Schwartz inequality. Let u(y) , v (y) , and w(y) be three
integrable functions from R
N
to R. Then, the Cauchy-Schwartz inequality states that
__
R
N
u(y) v (y) w(y) dy
_
2

_
R
N
u(y) v
2
(y) dy
_
R
N
u(y) w
2
(y) dy. (12.36)
By applying the Cauchy-Schwartz inequality to the left hand side of Eqn. (12.34) with
u(y) = p ( y[ ) , (12.37)
v (y) =

, (12.38)
w(y) =
ln p ( y[ )

, (12.39)
one obtains
__
R
N
_


_
ln p ( y[ )

p ( y[ ) dy
_
2

_
R
N
_


_
2
p ( y[ ) dy
_
R
N
_
ln p ( y[ )

_
2
p ( y[ ) dy.
(12.40)
Finally, since
_
R
N
_


_
ln p( y|)

p ( y[ ) dy = 1 and
_
R
N
_


_
2
p ( y[ ) dy =V

() one ob-
tains
V

()
1
E
_
_
ln p( y|)

_
2
_. (12.41)
To complete the proof, one has to show that E
_
_
ln p( y|)

_
2
_
= E
_

2
ln p( y|)

2
_
. Since,
E
_
ln p( y|)

_
= 0, one has easily
E
_
ln p ( y[ )

_
= 0

E
_
ln p ( y[ )

_
= 0

_
R
N

_
ln p ( y[ )

p ( y[ )
_
dy = 0

_
R
N

2
ln p ( y[ )

2
p ( y[ ) dy+
_
R
N
_
ln p ( y[ )

_
2
dy = 0.(12.42)
Denition 179 An estimator (unbiased) which achieves the Cramr-Rao bound is called e-
cient. If an estimator achieves the Cramr-Rao bound when the number of observation tends to
innity, it called asymptotically ecient.
We will see in the next section how the Cramr-Rao bound is linked to an optimal estimator
which is called Maximum Likelihood estimator.
Example 180 Let us go back to the previous example. For the observation model y (t) = +
n(t) , t = 1, . . . , N, where E[n(t)] = 0, t and E[n(t
1
) n(t
2
)] =
2
(t
1
t
2
) we found that
the estimator

=
1
N
N

t=1
y (t) was unbiased and had a variance given by V

() =

2
N
. Let us now
calculate the Cramr-Rao bound for this problem. Here the full pdf of the observations has to be
known in order to calculate p ( y[ ). So, we will assume that n(t) is a Gaussian random process
(note that we also implicitly assume that
2
is known which is not always the case). Consequently,
the likelihood function p ( y (t)[ ) is Gaussian with mean and variance
2
(see Problem 6 of Part
one). Due to the whiteness and the Gaussian behavior of n(t) , we that the observations y (t) are
independent. Then, using the notation y =
_
y (1) y (2) y (N)
_
T
, we have
p ( y[ ) =
N

t=1
p ( y (t)[ ) =
1
(2
2
)
N/2
e

1
2
2
N

t=1
(y(t))
2
. (12.43)
Statistical signal processing 64
12.4. MAXIMUM LIKELIHOOD ESTIMATOR 65
It comes that
ln p ( y[ ) =
N
2
ln
_
2
2
_

1
2
2
N

t=1
(y (t) )
2
(12.44)
ln p ( y[ )

=
1
2
2
N

t=1

(y (t) )
2
=
1

2
N

t=1
(y (t) ) (12.45)

2
ln p ( y[ )

2
=
1

2
N

t=1
1 =
N

2
. (12.46)
Consequently,
CRB() =
1
E
_

2
ln p( y|)

2
_ =
1
E
_

2
=

2
N
, (12.47)
which is equal to the variance of

=
1
N
N

t=1
y (t). This means that

is an ecient estimator if n(t)
is a Gaussian random process.
Theorem 181 In the case of several parameters to estimate, becomes a vector ( R
p
). An
unbiased estimator

of is then characterized by its p p covariance matrix
V

() = E
_
_

E
_

___

E
_

__
T
_
. (12.48)
The Cramr-Rao bound (inequality) becomes also a p p matrix (inequality)
V

() _ I
1
() = CRB() , (12.49)
where I () R
pp
is called the Fisher information matrix and, where A _ B means that the
matrix A B is non-negative denite. The elements of the Fisher information matrix are given
by
I ()
i,j
= E
_

2
ln p ( y[ )

j
_
, (12.50)
where
i
, respectively
j
, is the i
th
element, respectively the j
th
element, of . The Fisher infor-
mation matrix can also be written in a matrix form as
I () = E
_

2
ln p ( y[ )

T
_
, (12.51)
where the operator

is dened as follows

=
_

p
_
T
. (12.52)
Proof. To be updated.
Theorem 182 Slepian-Bang formula
12.4 Maximum Likelihood estimator
The maximum likelihood estimator is an optimal estimators (contrary to the LS estimator) because,
to be applied, this method needs the full probability model. For a xed set of data and underlying
probability model, maximum likelihood picks the values of the model parameters that make the
data "more likely" than any other values of the parameters would make them. Maximum likelihood
estimation gives a unique and easy way to determine solution in the case of the normal distribution
and many other problems, although in very complex problems this may not be the case.
Statistical signal processing 65
66 CHAPTER 12. ESTIMATION OF DETERMINISTIC PARAMETERS
Denition 183 By extension of the previous principle for a parameter vector and an observation
vector y, the maximum likelihood estimator is given by

ML
= arg max

p ( y[ ) . (12.53)
Note that, for any increasing C
1
function f (x), one has the alternative estimator

ML
= arg max

f (p ( y[ )) . (12.54)
As we will see in the following, it can be very useful for certain problem to choose f (x) = ln (x) .
Example 184 Again, let y (t) = + n(t) , t = 1, . . . , N, where n(t) is white Gaussian noise
with zero mean and know variance
2
. As for the calculus of the Cramr-Rao bound, the log
likelihood of the observations is given by
ln p ( y[ ) =
N
2
ln
_
2
2
_

1
2
2
N

t=1
(y (t) )
2
. (12.55)
The solution of the maximum likelihood is then given by
ln p ( y[ )

ML
= 0, (12.56)
which is equivalent to
N

t=1
_
y (t)

ML
_
= 0

ML
=
1
N
N

t=1
y (t) . (12.57)
This means that

ML
=

LS
if n(t) is a Gaussian random process.
12.5 Properties of the maximum likelihood estimator
The following theorems are very powerful results which show why the maximum likelihood estima-
tor is called optimal.
Theorem 185 If an estimator is ecient, then it is the maximum likelihood estimator.
Proof. The cornerstone of the Cramr-Rao inequality proof is the Cauchy-Schwartz inequality
__
R
N
u(y) v (y) w(y) dy
_
2

_
R
N
u(y) v
2
(y) dy
_
R
N
u(y) w
2
(y) dy, (12.58)
where the equality holds if and only if v (y) = w(y) where is a constant. In other words, from
a geometric point of view, when the vectors v (y) and w(y) are collinear. Moreover, we know that,
if the equality holds, then the estimator is ecient. Consequently, since
v (y) =

, and w(y) =
ln p ( y[ )

, (12.59)
an ecient estimator,

, satisfy

=
ln p ( y[ )

. (12.60)
On the other hand, we know that the maximum likelihood estimator is solution of
ln p ( y[ )

ML
= 0. (12.61)
Statistical signal processing 66
12.5. PROPERTIES OF THE MAXIMUM LIKELIHOOD ESTIMATOR 67
Consequently, by plugging Eqn. (12.61) in Eqn. (12.60), we obtain

ML
= 0

=

ML
. (12.62)
Note that, unfortunately, the opposite is wrong. It means that the maximum likelihood esti-
mator can be non-ecient (V

ML
() > CRB()).
The following result is often called asymtotic normality of the maximum likelihood estimator.
On se replace dsormais dans le contexte de lestimation paramtrique. Les observations
provenant du modle sont modlis par une suite de variables alatoires Y
1
, . . . , Y
n
. La fonction
de vraisemblance est note p (y
1
, . . . , y
n
;
0
) o y
i
est une ralisation particulire de la variable
alatoire Y
i
, i = 1, . . . , n et o
0
represente la vraie valeur du paramtre dintret (li au mod-
le dobservation). Lestimateur au sens du maximum de vraisemblance not

n
= f (Y
1
, . . . , Y
n
)
satisfait lquation de vraisemblance
d ln p (y
1
, . . . , y
n
; )
d

n
= 0. (12.63)
On supposera
1. Le modle identiable, cest--dire, P

1
= P

2
=
1
=
2
.
2. Le paramtre
0
vit dans un espace ouvert de R, ]a, b[ , o a ,= b.
3. La suite de variables alatoires Y
1
, . . . , Y
n
est i.i.d. et muni dune densit de probabilit
suppose continue. En consquence, p (y
1
, . . . , y
n
;
0
) =
n

i=1
p (y
i
;
0
)
4. Lensemble C = y : p (y;
0
) > 0 ne depent pas de
0
.
5. y C, p (y; ) est derivable trois fois par rapport et la driv troisime est continue.
6. On peut driver par rapport sous le signe intgrale (intgrale sur y).
7. Il existe un nombre positif not c (
0
) et une fonction M

0
(y) tel que
_

d
3
ln p(y;)
d
3

0
(y) y C, [
0
[ < c (
0
) ,
E [M

0
(y)] < .
(12.64)
On notera dans cette partie ln p (y
1
, . . . , y
n
; ) = L
n
() . En consquence L

n
_

n
_
= 0 o on
dnit L

n
_

n
_
=
d ln p(y
1
,...,y
n
;)
d

n
.
On supposera
1
que

n

0
. Un dveloppement limit (formule de Taylor-Lagrange avec reste
driv) de lquation L

n
_

n
_
autour de
0
nous donne
L

n
_

n
_
= L

n
(
0
) +
_

0
_
L

n
(
0
) +
1
2
_

0
_
2
L

n
(
n
) , (12.65)
o
n
est un nombre strictement compris entre

n
et
0
. Or, on sait que L

n
_

n
_
= 0, donc
L

n
(
0
) +
_

0
_
L

n
(
0
) +
1
2
_

0
_
2
L

n
(
n
) = 0 (12.66)

0
_
L

n
(
0
) +
1
2
_

0
_
2
L

n
(
n
) = L

n
(
0
) (12.67)
1
En fait, on peut prouver que sous les conditions sus-mentionnes,

n
P

0
Statistical signal processing 67
68 CHAPTER 12. ESTIMATION OF DETERMINISTIC PARAMETERS

0
_
=
L

n
(
0
)
L

n
(
0
) +
1
2
_

0
_
L

n
(
n
)
(12.68)

0
_
=

1
n
L

n
(
0
)
1
n
L

n
(
0
) +
1
2n
_

0
_
L

n
(
n
)
(12.69)

n
_

0
_
=
1

n
L

n
(
0
)

1
n
L

n
(
0
)
1
2n
_

0
_
L

n
(
n
)
. (12.70)
12.5.1 Etude de
1

n
L

n
(
0
)
On se concentre dabord sur le terme
1

n
L

n
(
0
) =

n
n
d ln p (y
1
, . . . , y
n
; )
d

=
0
=

n
n
n

i=1
d ln p (y
i
; )
d

=
0
. (12.71)
Or, daprs la loi faible des grands nombres,
1
n
n

i=1
d ln p (y
i
; )
d

=
0
P
E
_
d ln p (Y
i
; )
d

=
0
_
. (12.72)
De plus,
E
_
d ln p (Y
i
; )
d

=
0
_
=
_
d ln p (y
i
; )
d

=
0
p (y
i
;
0
) dy
i
=
_
dp (y
i
; )
d

=
0
dy
i
=
d
d
_
p (y
i
; ) dy
i

=
0
= 0. (12.73)
Si on utilise maintenant le thorme de la limite centrale, alors

n
_
1
n
n

i=1
d ln p (y
i
; )
d

=
0
_
L
^
_
0, V ar
_
d ln p (y
i
; )
d

=
0
__
. (12.74)
Or
V ar
_
d ln p (y
i
; )
d

=
0
_
= E
_
_
_
d ln p (Y
i
; )
d

=
0
_
2
_
_
= F (
0
) , (12.75)
o lon reconnait linformation de Fisher F (
0
) . En consquence,
1

n
L

n
(
0
)
L
^ (0, F (
0
)) . (12.76)
Statistical signal processing 68
12.5. PROPERTIES OF THE MAXIMUM LIKELIHOOD ESTIMATOR 69
12.5.2 Etude de
1
n
L

n
(
0
)
Dtaillons maintenant

1
n
L

n
(
0
) =
1
n
d
2
ln p (y
1
, . . . , y
n
; )
d
2

=
0
=
1
n
n

i=1
d
2
ln p (y
i
; )
d
2

=
0
=
1
n
n

i=1
d
d
_
1
p (y
i
; )
dp (y
i
; )
d
_

=
0
=
1
n
n

i=1
d
d
_
1
p (y
i
; )
_
dp (y
i
; )
d
+
1
p (y
i
; )
d
2
p (y
i
; )
d
2

=
0
=
1
n
n

i=1

1
p
2
(y
i
; )
_
dp (y
i
; )
d
_
2
+
1
p (y
i
; )
d
2
p (y
i
; )
d
2

=
0
=
1
n
n

i=1
1
p
2
(y
i
; )
_

_
dp (y
i
; )
d
_
2
+p (y
i
; )
d
2
p (y
i
; )
d
2
_

=
0
=
1
n
n

i=1
_
1
p (y
i
; )
dp (y
i
; )
d

=
0
_
2

1
n
n

i=1
1
p (y
i
; )
d
2
p (y
i
; )
d
2

=
0
. (12.77)
Or, daprs la loi faible des grands nombres,
1
n
n

i=1
_
1
p (y
i
; )
dp (y
i
; )
d

=
0
_
2
P
E
_
_
_
1
p (y
i
; )
dp (y
i
; )
d

=
0
_
2
_
_
= F (
0
) , (12.78)
et
1
n
n

i=1
1
p (y
i
; )
d
2
p (y
i
; )
d
2

=
0
P
E
_
1
p (y
i
; )
d
2
p (y
i
; )
d
2
_
= 0. (12.79)
En consquence,

1
n
L

n
(
0
)
P
F (
0
) . (12.80)
12.5.3 Etude de
1
n
L

n
(
n
)
Le dernier terme peut scrire
1
n
L

n
(
n
) =
1
n
d
3
ln p (y
1
, . . . , y
n
; )
d
3

=
0
=
1
n
n

i=1
d
3
ln p (y
i
; )
d
3

=
0
. (12.81)
Donc, daprs lhypothse 7,

1
n
L

n
(
n
)

<
1
n
n

i=1
M

0
(y) , (12.82)
lorsque

< c (
0
) . Daprs la loi faible des grands nombres,
1
n
n

i=1
M

0
(y)
P
E[M

0
(y)] . (12.83)
Donc,
1
n
L

n
(
n
) est born en probabilit.
Statistical signal processing 69
70 CHAPTER 12. ESTIMATION OF DETERMINISTIC PARAMETERS
12.5.4 Fin de la demonstration
On rappel que

n
_

0
_
=
1

n
L

n
(
0
)

1
n
L

n
(
0
)
1
2n
_

0
_
L

n
(
n
)
. (12.84)
Puisque
1
n
L

n
(
n
) est born en probabilit,

n
P

0
et
1
n
L

n
(
0
)
P
F (
0
), alors
1
2n
_

0
_
L

n
(
n
)
P
0, (12.85)
et

1
n
L

n
(
0
)
1
2n
_

0
_
L

n
(
n
)
P
F (
0
) . (12.86)
De plus, puisque
1

n
L

n
(
0
)
L
^ (0, F (
0
)) on a

n
_

0
_
L
^
_
0,
1
F(
0
)
_
, (12.87)
o
1
F(
0
)
est la borne de Cramr-Rao. Lestimateur au sens du maximum de vraisemblance est
donc asymptotiquement non-biais et ecace.
Theorem 186 Let consider the following general observation model: y = f () +n where R
P
is a deterministic vector of parameters, where y R
N
, with N > P, is the observations vector,
where f is a set of known C

functions from R
P
to R
N
, and, where n is a Gaussian random vector
with zero mean and known covariance matrix
2
I. Then, when
2
(i.e., when the signal to
noise ratio is small) we have

ML
^ (, CRB()) . (12.88)
Proof. The proof of this theorem is out of the scope of this lecture.
Statistical signal processing 70
Chapter 13
Estimation of random parameters
13.1 Estimation performance
To be updated.
13.1.1 Local mean square error versus global mean square error
To be updated.
13.1.2 Bayesian Cramr-Rao bound
To be updated.
13.2 Minimum Mean Square Error estimator
To be updated.
13.3 Maximum A Posteriori estimator
To be updated.
71
Part IV
Exercices
72
Chapter 14
Problems
14.1 -algebras and Borel -algebra
Let =
1
,
2
,
3
, then T () = ,
1
,
2
,
3
,
1
,
2
,
2
,
3
,
1
,
3
,
1
,
2
,
3
.
Prove that T () is a -algebra.
14.2 Independence of events
Prove that if two events (dened on the same probability space), A and B are statistically
independent, then (i) A and

B are statistically independent (ii)

A and B are statistically
independent (iii)

A and

B are statistically independent.
What is the probability of an event independent of itself?
14.3 Random variable
A random variable X has a probability density function represented in Figure 14.1.
0
1 2 -1
-2
c
x
f
X
(x)
Figure 14.1: Probability density function of X.
1. Calculate the value of c.
2. Calculate the probability Pr (1.5 X 1.5).
73
74 CHAPTER 14. PROBLEMS
14.4 Correlation and independency
Let (X, Y ) a couple of real random variables with joint uniform density dened as follows
f
X,Y
(x, y) =
_
if [x[ +[y[ 1,
0 otherwise.
(14.1)
1. Calculate the constant .
2. Calculate the marginal pdfs f
X
(x) and f
Y
(y) .
3. Calculate E[X] and E[Y ] .
4. Calculate E[XY ] and the correlation coecient
X,Y
.
5. Are X and Y independent?
Let (X, Y ) a couple of real random variables with joint uniform density dened as follows
f
X,Y
(x, y) =
_
if 0 x a and 0 y a
0 otherwise.
(14.2)
6. Calculate the constant .
7. Calculate the marginal pdfs f
X
(x) and f
Y
(y) .
8. Are X and Y independent?
9. Calculate the density of the random variable Z = X +Y.
14.5 Correlation and independency
Let X and Y be two random variables. The joint probability density function p
XY
(x, y) is given
in Figure 14.2.
1. Calculate .
2. Calculate the marginal probability functions f
X
(x) and f
Y
(y) . Are X and Y independent?
3. Calculate E[XY ] . Are X and Y uncorrelated?
1
-1
-1
1
-0,5
-0,5
0,5
0,5
X
Y
Figure 14.2: Joint probability density function of X and Y .
Statistical signal processing 74
14.6. COUPLE OF GAUSSIAN RANDOM VARIABLES 75
14.6 Couple of Gaussian random variables
Let X =
_
X Y
_
T
be a real Gaussian random vector. We assume that the vector is zero mean
(i.e., E[X] = E[Y ] = 0) and that E
_
X
2

= E
_
Y
2

=
2
.
1. Calculate the joint pdf f
X,Y
(x, y) as a function of and of the correlation coecient
X,Y
.
2. Prove that non-correlation independency in the Gaussian case.
14.7 Substitution of random variables
1. Let Y = aX
2
. Calculate f
Y
as a function of f
X
. Assume that X U
_

1
2
,
1
2
_
and simplify
the expression of f
Y
.
2. Let Y = X
0
cos () with X
0
> 0 a constant and with U (0, 2) . Calculate f
Y
.
3. Let R =

X
2
+Y
2
and = tan
1
_
Y
X
_
. The couple X, Y is assumed to be independent
random variables with Gaussian pdf with zero mean and variance
2
X
and
2
Y
, respectively.
Calculate the joint pdf f
R,
(r, ) . Assume that
2
X
=
2
Y
=
2
. Calculate f
R,
(r, ) , f
R
(r)
and f

() . Are R and independent?


14.8 Sum of independent Gaussian random variables
Let X ^
_
m
X
,
2
X
_
and Y ^
_
m
Y
,
2
Y
_
two independent random variables.
1. Prove that Z = X +Y ^
_
m
X
+m
Y
,
2
X
+
2
Y
_
by using the result of Eqn. (4.40).
2. Calculate the characteristic function of X and Y.
3. Prove that Z = X +Y ^
_
m
X
+m
Y
,
2
X
+
2
Y
_
by using the characteristic functions.
4. Prove that X + ^
_
m
X
+,
2

2
X
_
, where and are two constants.
Statistical signal processing 75
Chapter 15
Problems
15.1 Problem 1: stationarity and PSD of a sinusodal signal
Let x(t) = Asin (2f
0
t +) be a random signal where f
0
and A are two constants and where is
a uniform random variable U (0, 2) .
1. Is x(t) stationary.
2. Deduce the PSD of x(t) .
15.2 Problem 2: another sinusodal signal
Let x(t) = cos (t +B) be a random signal where is constant and where B is a random variable.
1. Calculate E[x(t)] as a function of E[cos (B)] and E[sin (B)] .
2. Give a condition on the characteristic function
B
(1) to obtain x(t) centered.
3. We assume E[x(t)] = 0. Calculate the correlation function
x
(t, t ) as a function of
E[cos (2B)] and E[sin (2B)] .
4. Give a condition on the characteristic function
B
(2) to obtain x(t) stationary.
5. We assume now that B U (, ) . Is x(t) centered. Is x(t) stationary?
15.3 Problem 3: modulated signal
We consider the random signal w(t) = x(t) cos (t) + y (t) sin (t) . is constant. x(t) and y (t)
are both random signals, real, centered and stationary.
1. Give the expression of the correlation function
w
(t, t ) as a function of
x
() ,
y
()
and
xy
() = E[x(t) y (t )] .
2. Deduce a condition such that w(t) becomes stationary.
15.4 Problem 4: sum of cisodes
We consider the signal s (t) =
q

k=1
e
i(2f
k
t+
k
)
, where f
k
are constants frequencies and where
k
are random variables k with each an uniform PDF over [0, 2] .
1. Calculate the mean of s (t) .
76
15.5. PROBLEM 5: QUASI-MONOCHROMATIC SIGNAL 77
2. Calculate the correlation function
s
(t
1
, t
2
) . Is s (t) stationary?
3. We assume that q = 2. Give an integral representation of the PDF of the random variable
Y = Re s (0) .
4. Let x(t) = s (t)+n(t) where n(t) is a centered white noise with variance
2
. n(t) is assumed
to be independent of s (t) . Calculate E[x(t)] and
x
(t, t ) . Is x(t) stationary?
15.5 Problem 5: quasi-monochromatic signal
An emitter create a random signal x(t) assumed to be real, centered, stationary and quasi-
monochromatic, i.e., its PSD satisfy

x
(f) = 0 for [f f
0
[ > f with f
0
f. (15.1)
This signal is transmitted and has a reexion before to be received. Consequently, we receive
y (t) = x(t) +x(t ) with constant.
1. Calculate the mean E[y (t)] .
2. Calculate the correlation function
y
(t
1
, t
2
) as a function of
x
(t
2
t
1
) . Is y (t) stationary?
3. Calculate E
_
y
2
(t)

and the PSD


y
(f) .
4. Show that y (t) can be seen like the output of a linear lter and give the frequency response
H (f) of this lter. Find again
y
(f) by using the interference formula.
5. Is y (t) quasi-monochromatic?
6. We assume that x(t) is the real part of the random signal z (t) = e
i(2f
0
t+(t))
where (t) is
a real random signal. Calculate E[z (t)] and give a condition on the characteristic function

(u) to have E[z (t)] = 0.


7. We assume (t) = A.t, where A U (, ) . Calculate the correlation function
z
(t
1
, t
2
)
and conclude on the stationarity of z (t) .
8. We assume (t) = A.t
2
, where A U (, ) . Calculate the correlation function
z
(t
1
, t
2
)
and conclude on the stationarity of z (t) .
9. We assume (t) = A.t + B, where A N
_
0,
2
_
and B is a constant. Calculate the
correlation function
z
(t
1
, t
2
) and conclude on the stationarity of z (t) .
10. We assume that (t) is a stationary Gaussian random signal with correlation function

().
Calculate the correlation function
z
(t
1
, t
2
) and conclude on the stationarity of z (t) .
15.6 Problem 6: discrete signal and ltering
Let u
k
be a discrete time signal, real, centered and white
_
i.e.,
u
() =
_

2
if = 0
0 otherwise
_
.
1. Is u
k
stationary?
2. u
k
is the input of a linear lter with impulse response h
k
= a
k
when k 0 and h
k
= 0 when
k < 0. What is the condition on a to have a stable lter?
3. Calculate the output x
k
of the lter and E[x
k
] .
4. Calculate the correlation function of x
k
.
5. Calculate the spectral density of x
k
.
Statistical signal processing 77
78 CHAPTER 15. PROBLEMS
6. For p 0 show that
x
(p) can be written as only
x
(p 1) . Deduce the expression of
x
(p)
as a function of
x
(1) and
x
(0) .
7. Show that x
k+1
can be calculated as a function of x
k
and u
k+1
.
15.7 Problem 7: another modulated signal
1. Let x(t) be a stationary random signal. Let y (t) = x(t) cos (2ft) where f is constant. Is
y (t) stationary?
2. Let z (t) = x(t) cos (2ft +) where f is constant and is a uniform random variable over
[0, 2] . is independent of x(t). Is z (t) stationary?
15.8 Problem 8: sum of cisodes and ltering
We consider the following random signal s (t) =
N

i=1
A
i
exp (j2f
i
t) where all the A
i
are real random
variables, centered, mutually independent, with the same variance
2
and all the f
i
are positive
constants.
1. Is the signal s (t) centered?
2. Calculate the correlation function
s
(t, t ). Is s (t) stationary?
3. Calculate the power spectral density
s
(f) of s (t).
4. We introduce the signal x(t) = s (t) + exp (j (2f
0
t +)) where is a random variable
independent of A
i
. Calculate the correlation function
x
(t, t ) and the power spectral
density
x
(f) of x(t).
5. We assume that x(t) is the input of a lter with transfer function H (f) =Rect
2B
(f) dened
as
Rect
2B
(f) =
_
1 if B f B,
0 otherwise.
We assume that 0 < f
0
< f
1
< B < f
2
< f
3
< < f
N
. Calculate the power spectral
density
y
(f) of the signal y(t) at the output of the lter (y(t) is stationary) and calculate
the power E
_
|y (t)|
2
_
.
6. Calculate E
_
|y (t)|
2
_
when 0 < f
0
< f
1
< < f
k
< B < f
k+1
< < f
N
with 1 k N.
15.9 Problem 9: non-linear system
A real random signal x(t) is centered, stationary, with variance
2
, with a correlation function

x
() and with a power spectral density
x
(f). A non-linear system provide:
y (t) = [x(t) +a cos (2f
0
t +)]
2
,
where a and f
0
are two constants and, is a random variable independent of x(t) with a uniform
probability density function over [0, 2].
1. Prove that E
_
cos
2
(2f
0
t +)

=
1
2
. Calculate the mean of y (t).
Statistical signal processing 78
15.9. PROBLEM 9: NON-LINEAR SYSTEM 79
2. Prove that
E[cos (2f
0
t +) cos (2f
0
(t ) +)] =
1
2
cos (2f
0
)
and
E
_
cos
2
(2f
0
t +) cos
2
(2f
0
(t ) +)

=
1
4
_
cos
2
(2f
0
) +
1
2
_
.
3. Calculate the correlation function
y
(t, t ) of y (t) as a function of E
_
(x(t) x(t ))
2
_
,
cos (2f
0
),
x
(), a, and
2
.
4. Calculate the power spectral density
y
(f) of y (t) as a function of
x
(f).
Statistical signal processing 79
Chapter 16
Problems
16.1 Problem 1: linear observation model and LS estimator
Let y
m
() = A where R
P
is a deterministic vector of parameters, where y
m
() R
N
, with
N > P, is the observations vector issued from a physical model, and, where A R
NP
is a known
matrix. We assume that the matrix A
T
A is invertible.
1. Prove that

LS
=
_
A
T
A
_
1
A
T
y. (16.1)
2. Show that the minimum error is given by
J
LS
_

LS
_
= y
T
_
I A
_
A
T
A
_
1
A
T
_
y, (16.2)
where I is the identity matrix of size N N. (hint: prove that
_
I A
_
A
T
A
_
1
A
T
_
2
=
I A
_
A
T
A
_
1
A
T
).
16.2 Problem 2: Cramr-Rao bound and line tting
Let y (t) = A+Bt +n(t) , t = 1, . . . , N, where n(t) is a white Gaussian noise with zero mean
and know variance
2
. The parameter vector of interest is =
_
A B
_
T
.
1. Calculate the Fisher information matrix I () for this problem.
2. Compare the element CRB()
1,1
on the parameter A with the scalar Cramr-Rao bound
obtained if B = 0.
16.3 Problem 3: nuisance parameters
Let I () =
_
a b
b c
_
be a 2 2 Fisher information matrix.
1. Show that
_
I
1
()

1,1
=
c
ac b
2

1
a
=
1
[I ()]
1,1
. (16.3)
2. What does this mean about estimating a parameter when a second is either known or un-
known?
3. When does equality hold and why?
80
16.4. PROBLEM 4: LINEAR OBSERVATION MODEL AND ML ESTIMATOR 81
16.4 Problem 4: linear observation model and ML estimator
Let y = A +n where R
P
is a deterministic vector of parameters, where y R
N
, with N > P,
is the observations vector, where A R
NP
is a known matrix, and, where n is a Gaussian random
vector with zero mean and known covariance matrix
n
. We assume that the matrix A
T
A is
invertible.
1. By using the two following identities
x
T
a
x
=
a
T
x
x
= a and
x
T
Ax
x
=
_
A+A
T
_
x, prove
that

ML
=
_
A
T

1
n
A
_
1
A
T

1
n
y. (16.4)
2. Prove that

ML
is an unbiased estimator and that the covariance matrix of

ML
is equal to
V

ML
() = E
_
_

ML
E
_

ML
___

ML
E
_

ML
__
T
_
=
_
A
T

1
n
A
_
1
. (16.5)
3. Prove that

ML
is an ecient estimator.
16.5 Problem 5: noise power estimation
We consider the following observation model: y (t) = n(t) t = 1, . . . , N, where n(t) is a white
Gaussian noise with zero mean and unknown variance =
2
.
1. Derive the maximum likelihood estimator

ML
. (Hint: the fourth order moment of a Gaussian
random variable X is given by E
_
X
4

= 3
2
).
2. Prove that

ML
is an unbiased and ecient estimator.
16.6 Problem 6: maximum likelihood estimation of the pa-
rameter of a Poisson distribution
Let y (t) , with t = 1, .., N, be an independent random signal following a Poisson distribution. In
other words p ( y (t)[ ) =
e

y(t)
y(t)!
, where is an unknown parameter. Using ln p ( y (1) , y (2) , .., y (N)[ ),
calculate the maximum likelihood estimator

of .
16.7 Problem 7: maximum likelihood estimation of the pa-
rameter of a Rayleigh distribution
Let y (t) , with t = 1, .., N, be an independent random signal following a Rayleigh distribu-
tion. In other words p
_
y (t)[
2
_
=
y(t)

2
e

(y(t))
2
2
2
, where
2
is an unknown parameter. Using
ln p
_
y (1) , y (2) , .., y (N)[
2
_
, calculate the maximum likelihood estimator
2
of
2
.
Statistical signal processing 81
Part V
Annexes
82
Chapter 17
Borel -algebra over a topological
space
The Borel -algebra over R and R
d
has been dened in Chapter 2. In this part we dene the Borel
-algebra over any topological space. Let us start with the denition of a topological space and
its properties.
83
Chapter 18
Additional results on measure theory
84
Chapter 19
Elements of integration theory
85
Part VI
Practical works:
86
Chapter 20
PW1: Matlab 101 and basic signal
processing problems
In order to save the realized program it is preferable to use Matlab script (.m les).
20.1 Some important Matlab functions
20.1.1 Preliminary remarks
Matlab makes dierence between capital ans small letters.
All the command have to be followed by a ; if we dont want to see the output.
Matlab is a sequential software. The operations are eected by order of writing.
The command help followed by the name of a function gives the help on this function. The
command help alone gives the general help of Matlab.
20.1.2 Variables denition
In order to put the value 5 in the variable a, we write a = 5;
is written pi
20.1.3 Vectors denition
A column vector t =
_
_
2
6
5
_
_
is written as follows: t = [2; 6; 5] ;
A row vector t =
_
4 7 1

is written as follows: t = [4 7 1] ;
More generally, a row vector starting at value a until value b with a step p is written as
follows: t = [a : p : b] ; If we want to have p = 1 the command is reduced to t = [a : b] ;
The transpose operator of a vector v is written as follows: v

;
The modulus of a vector v is written as follows: abs (v) ;
In order to take the square of each elements of a vector we write: x.2;
If we want to add the value 5 to the vector t =
_
4 7 1

we can write: t = [t 5] ;
The length of a vector is given by the command length(t)
87
88 CHAPTER 20. PW1: MATLAB 101 AND BASIC SIGNAL PROCESSING PROBLEMS
20.1.4 Drawing plots
To plot a vector t as a function of a vector x we write: plot (t, x) If only plot (x) is used,
Matlab use t = [1 : length(x)]
The command hold on is used to plot several curve on the same plot.
The command subplot is used to create axes in tiled positions.
20.1.5 Loop
In order to repeat statements a specic number of times, we will use the command for.
20.2 Application to signal processing problems
20.2.1 Noises
Use the command rand to create a vector of random variables. First, use 10 as the size of
the vector. Plot this vector. Plot an approximation of the probability density function of
this random variable (use the command hist). Increase the size of the vector and see what
kind of probability density function we have. What are the mean and the variance of this
random variable (use the command mean and var).
Same question as before but using the command randn instead of rand.
20.2.2 Central limit theorem
Generate several values of the random variable S
j
=
N

i=1
n
i
, where n
i
is a uniform random
variable over [1, 1] and j = 1 . . . 1000. Plot the probability density function of S
j
for
dierent values of N. Whats happen when N becomes high?
20.2.3 Linear estimation
We have the following observation model
y (t) = m+n(t) , t = 1, . . . , N, (20.1)
where m is a constant that we want to estimate and n(t) is a Gaussian random process with zero
mean and variance
2
.
Choose one value m and generate the vector y =
_
y (1) y (2) y (N)

T
. Plot this
vector as a function of the time for several values of
2
.
In order to estimate m we use the Maximum Likelihood estimator m
ML
=
1
N
N

t=1
y (t). Give
the code of this estimator and apply it with several realization of the noise for a xed value

2
. Do it for several value of
2
. Whats happen when
2
increase.
We want to know the bias b ( m
ML
) = E[ m
ML
] m of our estimator. For that purpose, since
E[.] is a mathematical (i.e., theoretical) operator we will use an estimate of it: E[ m
ML
]
1
mc
mc

i=1
( m
ML
)
i
where the ( m
ML
)
i
are several estimates of m for a xed variance
2
. Show
that this estimator is unbiased with m high.
Same question with m =
1
N1
N

t=1
y (t) . Show that this estimator is biased with m high and
N low. Show that this estimator is asymptotically unbiased.
Statistical signal processing 88
20.2. APPLICATION TO SIGNAL PROCESSING PROBLEMS 89
We want to plot the variance of m
ML
: var ( m
ML
) = E
_
( m
ML
E[ m
ML
])
2
_
. Use the same
technique
1
as with the bias to approximate the expectation operator. Plot the variance as a
function of N for xed
2
and as a function of
2
for xed N.
Compare the previous results with the Cramr-Rao bound given by CRLB =

2
N
. Conclude
to the eciency of the estimator.
20.2.4 Non-linear estimation: spectral analysis
We have the following observation model
y (t) = a
1
sin (2f
0
t) +a
2
sin (2f
1
t) +n(t) , t = 1, . . . , N, (20.2)
where a
1
and a
2
are known amplitudes and f
0
and f
1
are unknown frequencies (between 0 and
1
2
)
that we want to estimate. n(t) is a Gaussian random process with zero mean and variance
2
.
Choose one value of a
1
, a
2
, f
0
and f
1
and generate the vector y =
_
y (1) y (2) y (N)

T
.
Plot this vector as a function of the time for several values of
2
.
Due to the model, give an algorithm to compute the Least-Square criterion
J (f
0
, f
1
) =
N

t=1
(y (t) a
1
sin (2f
0
t) a
2
sin (2f
1
t))
2
. (20.3)
Due to the function sin (.), this problem is non-linear in the parameters. So we have to
minimize numerically J (f
0
, f
1
). With a loop on the frequency f, plot this criterion. Choose
a small step with respect to the possible values of f
0
and f
1
. Note that we will not program
the minimization here. The argument of the minimum of J (f
0
, f
1
) is checked by hand (for
exemple with the command ginput)
For a
1
= 1 and a
2
= 0, plot the criterion for several values of f
0
with xed variance.
For a
1
= 1 and a
2
= 0, plot the criterion for several values of
2
with xed frequency f
0
.
Repeat the last two bullets when a
1
= 1 and a
2
= 1.
Since, n(t) is a Gaussian random process with zero mean and variance
2
the Maximum
Likelihood estimator is given by the same minimization but when J (f
0
, f
1
) is seen as a
surface. Modify your algorithm to plot J (f
0
, f
1
) has a surface (use the command plot3) and
do the same analysis.
1
This technique is called Monte-Carlo simulation.
Statistical signal processing 89
Chapter 21
PW2: Speech processing
In order to save the realized program it is preferable to use Matlab script (.m les).
21.1 Introduction
The goal of this practical work is to analyse and synthesize some particular sounds. The recording
can be modelled by the repetition of K elementary functions similar to weighted sinus functions
with a repetition period T (see Figure 21.1)
0 0.05 0.1 0.15 0.2
1
0.5
0
0.5
1
t(s)
0 0.005 0.01 0.015 0.02 0.025
1
0.5
0
0.5
1
t(s)
Figure 21.1: Sound signal recording
A weighted sinus function can be easily modelled by an Autoregressive process (AR). These
processes are very useful since they need few parameters (approximatively 20 for a sound). But,
in order to t with the signal, we have to estimate these coecients of the AR process in order to
characterize the sound. The synthesis is the inverse process. If we have the AR coecients, the
number of repetition K and the period of repetition T, how to recover the sound?
21.2 Parameters estimation of an AR process
An AR process can be written
s
n
= a
1
s
n1
+a
2
s
n2
+ +a
d
s
nd
+e
n
= e
n
+
d

i=1
a
i
s
ni
. (21.1)
where e
n
is a stationary white noise of power P and, where a
1
, a
2
, , a
d
are the coecients of
the model. d is called the order of the model.
90
21.2. PARAMETERS ESTIMATION OF AN AR PROCESS 91
We can also rewritten this model as
s
n
= e
n
+
d

i=1
b
i
e
ni
, (21.2)
where the coecients b
i
are function of the coecients a
i
.
It can be seen that s
n
is the output of a digital lter of frequency response
H (z) =
1
1
d

i=1
a
i
z
i
, (21.3)
with input e
n
.
Since e
n
is assumed to be stationary white, we have

e
(k) = E [e
n
e
nk
] = P (k) , (21.4)
where (k) is the Kronecker symbol, i.e., (k) =
_
1 if k = 0,
0 otherwise.
.
Give the expression of the two vectors Num and Den corresponding to the coecients of the
polynoms of the numerator and the denominator of H (z) (the i
th
element of these vectors
is equal to the coecient of z
i
). These vectors will be used with Matlab.
By multiplying Eqn. (21.2) by e
n+k
and taking the expectation operator we get

es
(k) = E [e
n
s
nk
] = 0 k > 0. (21.5)
By multiplying Eqn. (21.1) by s
nk
(k > 0) and taking the expectation operator we get

s
(k) = E [s
n
s
nk
] =
d

i=1

s
(k i) k > 0. (21.6)
Let us set
a =
_
a
1
a
2
a
q

T
(21.7)
=
_

s
(1)
s
(2)
s
(d)

T
(21.8)
M =
_
_
_
_
_
_

s
(0)
s
(1)
s
(d 1)

s
(1)
s
(0)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

s
(d 1)
s
(0)
_
_
_
_
_
_
(21.9)
Using Eqn. (21.6), (21.7), (21.8), and (21.9) show that
= Ma (21.10)
Deduce that the knowledge of
s
(k) k = 0, , d leads to the knowledge of the coecients
a
1
, a
2
, , a
d
.
These kind of equation is called Yule-Walker system and allow us to determine the coecients
of an AR process of order d.
Give the expression of Den as a function of the vector a.
The signal synthe22.mat contains some real recorded sound signal (the letters "a","e","i","o",
and,"u") with a sampling frequency Fe. Use the command load synthe22 to download all
the sounds.
Statistical signal processing 91
92 CHAPTER 21. PW2: SPEECH PROCESSING
Plot the letter "a" as a function of the real time. Isolate only the useful part of this signal
and save it in a new vector.
Unfortunately, in practice, we dont have access to the autocorrelation function
s
(k). We have
to estimate it. We will use the following estimator

s
(k) =
1
N
N

n=k+1
s
n
s
nk

s
(k) . (21.11)
The Matlab function iden(s,d), especially written for this practical work, gives the vector
Den corresponding to the coecients of the polynoms of the denominator of H (z) where s
is the signal that you want to analyse and d is the order of the AR model. a is computed by
the Yule-Walker equation and the approximation of Eqn. (21.11). Give the vector Den for
your signal "a" with d = 20.
Give a program which use the vector Den for:
Calculate the roots of the polynom 1
d

i=1
a
i
z
i
= 0 and plot these roots in the complex
plane. (use the command roots)
Calculate the modulus and the phases of these roots. (command abs and angle). Verify
that all the modulus are always lower than one.
Calculate the frequencies F
i
=
1
2
arg (p
i
) F
e
where p
i
, i = 1, . . . , d, are the roots previ-
ously calculated.
Calculate and plot the impulse response and frequency response of H (z). (command
impz and freqz ).
21.3 Analysis of the sound
The recording sound can be modelled by the repetition of K elementary functions similar to
weighted sinus functions with a repetition period T (see Figure 21.1). The repetition period is
generally equal to T = 10ms for a man and T = 5ms for a woman.
What is the value of T? It is a man or a woman who recorded this sound?
What is the value of K?
Give a program which
Calculate the Fourier transform of a signal recorded on N samples with a sampling
frequency F
e
. (command FFT)
Plot the modulus of this Fourier transform as a function of the real frequency axis.
After plotting the Fourier transform of your signal, rend the value of T.
Why the Fourier transform seems constituted of pics?
21.4 Synthesis of the sound
We want to reproduce a synthetic sound by using our previous analysis. We know from Eqn. (21.3)
that one repetition of our signal is the output of a lter fully known (since we have estimate a)
when the input is a white noise (a impulse in the time domain). Consequently, if we create K
impulses in the time domain as the input of our lter, we should recover our sound.
What is the time between two impulses?
Statistical signal processing 92
21.4. SYNTHESIS OF THE SOUND 93
Create a program which give the input of the lter .
Create a program which give the output of the lter when the input is e
n
. (command lter)
Compare the synthesized signal to the real one in the time domain and in the frequency
domain.
Statistical signal processing 93
Chapter 22
PW3: Sources localization
In order to save the realized program it is preferable to use Matlab script (.m les).
22.1 Introduction
Problems such that
Emergency cell phone localization (avalanche,...)
Tracking of a singer on a stage
Radar localization
can be solved by triangulation techniques if we know the angle of the source with respect to 3
base stations (see Figure 22.1). Each base station use an antenna with several sensors.
Station
debase
Station
debase
Station
debase
Portable
localiser

1
Figure 22.1: Example of source localization problem. "Portable localiser"="Cell phone to local-
ize" and "Station de base"="Base station"
The goal of this practical work is to study a statistical model of observations obtained on a
base station and two techniques of estimation in order to estimate angles of arrivals. The rst one
is called beamforming and dont use the statistical properties of the observation model and, the
second one is based on the Maximum Likelihood algorithm which use the full statistical properties
of the observation model.
94
22.2. OBSERVATION MODEL 95
22.2 Observation model
We want to nd the direction of arrivals
m
(m = 1, . . . , M) of M sources (M = 1 or 2 in this
practical work) with help of an antenna of N > M sensors (N = 4 in this practical work). We
note d
0
the distance between two consecutive sensor (d
0
= 4cm in this practical work). The index
n = 1, . . . , N will represent the index of the sensors.
antenne
demicrophones

m
n N =
d
0
n =1
CAN algorithme

m
source m
Figure 22.2:
We assume that the signals emitted by the sources are quasi-monochromatic. Moreover, the
waves are planes (which is equivalent to say that the sources are assumed to be far from the
antenna). We note s (t) the signal received by sensor number n = 1 which is our reference. Signals
received by the antenna are sampled with K points at the sample frequency F
s
(in this practical
work, K = 2048 and F
s
= 20kHz) and are recorded in a matrix S (the n
th
row of this matrix
corresponds to the signal of the sensor number n).
22.2.1 Quasi-monochromatic signals
A quasi-monochromatic signal has a spectrum equal to zero out of the bandwith [f
0
f, f
0
+ f]
with f
0
f. This signal can be written
s (t) = a (t) cos (2f
0
t +(t)) , (22.1)
where a (t) is the instataneous amplitude and 2f
0
t + (t) is the instataneous phase. We call
envelope of the signal s (t) the signal e (t) such that
e (t) = a (t) e
j(t)
. (22.2)
The envelope is very important since all the information are contained inside.
In order to remove the noise we will lter the signal with a band-pass lter with the function
[f
0
, S_filtered] = trait(S, F
s
) which will lter the signals in the matrix S (sampled at F
s
) and
give as a output the central frequency f
0
and the ltered signals in a new matrix S_filtered.
An exemple of noisy quasi-monochromatic signal is recorded in vector s (use the command
load s). Use the command trait to lter this signal and draw the ltered signals.
In order to obtain the enveloppe e (t) from s (t) we use the command E = env (S, f
0
, F
s
).
Using the command env, calculate and draw the enveloppes of the noisy and the ltered
signals s (t).
Statistical signal processing 95
96 CHAPTER 22. PW3: SOURCES LOCALIZATION
22.2.2 Case of one source
If a plane wave is emitted from direction , the delay between two sensors is equal to
=
d
0
v
sin () , (22.3)
where v (v = 343 m/s in this practical work) is the velocity of the wave.
Prove this result.
Consequently, the signal received by the n
th
sensor is a delayed version of the signal received
by the rst sensor and is equal to s (t (n 1)) . Consequently, the enveloppe e (t) at the n
th
sensor is
a (t (n 1)) e
j(t(n1))
e
j2f
0
(n1)
. (22.4)
Since the antenna is small and the signal is narrow band, we have the following approximations
a (t (n 1)) a (t) , (22.5)
(t (n 1)) (t) . (22.6)
Consequently, the envelope e (t) at the n
th
sensor is
e
ref
(t) e
j2f
0
(n1)
d
0
v
sin()
, (22.7)
where e
ref
(t) = a (t) e
j(t)
is the envelope at the rst sensor. We denote e (t) the vector of all the
envelopes for each sensors (here is then a 4 line vector)
e (t) = e
ref
(t)
_
_
_
_
_
_
1
e
j2f
0
d
0
v
sin()
.
.
.
e
j2f
0
(N1)
d
0
v
sin()
_
_
_
_
_
_
. .
g()
= e
ref
(t) g () , (22.8)
where g () is called the steering vector and contains all the information about the array geometry
and the angle of arrival.
22.2.3 Case of several sources
If we have M > 1, we have several enveloppes e
i
ref
(t) i = 1, . . . , M and the vector e (t) of all the
enveloppes for each sensors is written
e (t) = [g (
1
) g (
2
) g (
M
)]
_
_
_
_
_
e
1
ref
(t)
e
2
ref
(t)
.
.
.
e
M
ref
(t)
_
_
_
_
_
=
_
_
_
_
_
_
1 1 1 1
e
j2f
0
d
0
v
sin(
1
)
e
j2f
0
d
0
v
sin(
2
)
e
j2f
0
d
0
v
sin(
M
)
.
.
.
.
.
.
.
.
.
e
j2f
0
(N1)
d
0
v
sin(
1
)
e
j2f
0
(N1)
d
0
v
sin(
2
)
e
j2f
0
(N1)
d
0
v
sin(
M
)
_
_
_
_
_
_
. .
G()
_
_
_
_
_
e
1
ref
(t)
e
2
ref
(t)
.
.
.
e
M
ref
(t)
_
_
_
_
_
. .
e
ref
(t)
= G() e
ref
(t) , (22.9)
where =[
1

2

M
]
T
and where G() is called the steering matrix.
Statistical signal processing 96
22.3. BEAMFORMING 97
22.2.4 Full observation model
Since the noise corrupt our observations, the nal observation model is
y (t) = e (t) +n(t) = G() e
ref
(t) +n(t) t = 1, . . . , K, (22.10)
where K is the total number of observations. n(t) is a Gaussian noise with zero mean and
covariance matrix
2
I
N
.
22.3 Beamforming
The beamforming principle is given in Figure 22.3.
(N-1)t avance
q
m
CAN
n=1
N
0
t
+
) ( q p

-
t
T t
t t s
T
d ) (
1 2
Figure 22.3:
Since the structure of g () is known, we multiply the observed signals by
g
H
_

_
y (t) = e
ref
(t) g
H
_

_
g () +g
H
_

_
n(t). (22.11)
Consequently, the energy of the output is given by

g
H
_

_
y (t)

2
= g
H
_

_
y (t) y
H
(t) g
_

_
, (22.12)
which will be maximal when

= . Consequently, the beamforming estimator is given by

B
= arg max

g
H
_

_
y (t)

2
. (22.13)
Note that, as the least square approach, we dont take into account the statistical properties of
our model (the probability density function of n(t) is not used here).
The Matlab function S = sim_1 () create a matrix of observations S with only one di-
rection of arrival (without noise). The Matlab function [, P] = balayage(S) give the criterion

g
H
_

_
y (t)

2
.
For a given value (for exemple, = 0), generate the signal S.
Draw the four output of the sensors as a function of the time.
Statistical signal processing 97
98 CHAPTER 22. PW3: SOURCES LOCALIZATION
Calculate and draw the criterion

g
H
_

_
y (t)

2
.
Rend the direction of arrival .
Change the value of and analyse the resolution power of the algorithm.
Whats happen if we add some noise?
The Matlab function S = sim_2_d (
1
,
2
) create a matrix of observations S with two sources
of direction of arrivals
1
and
2
(without noise).
For given values
1
and
2
, generate the signal S and calculate the criterion

g
H
_

_
y (t)

2
.
What happen if
1
and
2
are too close. What is the limit angle (
1

2
) of this algorithm?
In which cases is it a good method?
22.4 Maximum likelihood estimator
Now, we want to take into account the full statistical properties of our model by using the likelihood
of the observations in order to increase the resolution of our algorithm.
The likelihood of one observation can be written
p ( y (t)[ ) =
_

2
_N
2
exp
_

2
(y (t) G() e
ref
(t))
H
(y (t) G() e
ref
(t))
_
. (22.14)
Consequently
p ( y (1) , y (2) , . . . , y (K)[ ) =
_

2
_NK
2
exp
_

2
K

k=1
(y (k) G() e
ref
(k))
H
(y (k) G() e
ref
(k))
_
(22.15)
And the Maximum Likelihood estimator of is given by

ML
= arg max

p ( y (1) , y (2) , . . . , y (K)[ ) , (22.16)


which can be reduced, after some calculus eort, to

ML
= arg min

trace
_

y
_
, (22.17)
where

B
= I
N
G()
_
G
H
() G()
_
1
G
H
() , (22.18)

y
=
1
K
K

k=1
y (k) y
H
(k) . (22.19)
The Matlab function S = sim_2_c(
1
,
2
, c, SNR) generate noisy observation matrix S with
two sources of direction of arrivals
1
and
2
. c is the correlation coecient between the two sources
and SNR is the power of the noise. The matlab function [
1
,
2
] = music(S) give the projected
maximum likelihood criterion.
For given values
1
and
2
, SNR = 0 and c = 0 generate the signal S and show that our
algorithm is better than the beamforming.
Change the value of the SNR and/or c and see whats happen.
Statistical signal processing 98

Você também pode gostar