Você está na página 1de 51

GUIA DE PREPARAÇÃO

DA DISSERTAÇÃO E RESUMO ALARGADO


PARA OS CURSOS
Algoritmos DE MESTRADO
De Análise DE 2º CICLO NO IST
Discriminativa Linear

1. T R A M IT A Ç Ã O D E D IS S E R T A Ç Ã O /P R O J E C T O .................................................................................. 2

2. IN F O R M A Ç Ã O A IN T R O D U Z IR N O S IS T E M A F É N IX ........................................................................ 4

3. C O N F ID E N C IA L ID A DPedro Miguel Correia Guerreiro


E ............................................................................................................................. 4

4. E S T R U T U R A E F O R M A T O D A D IS S E R T A Ç Ã O .................................................................................. 5
4 .1 Im p re s s ã o d a D is s e rta ç ã o ............................................................................................................... 5
4 .2 C a p a e L o m b a d a ............................................................................................................................... 5
4 .3 E q u a ç õ e s e E x p re s s õ e s ................................................................................................................... 5
4 .4 R e fe rê n c ia s e B ib lio g ra fia ................................................................................................................ 5
4 .5 T a b e la s e F ig u ra s .............................................................................................................................. 5
5. E S T R U TDissertação
U R A D O R E S U M O para
A L A R Gobtenção do Grau de Mestre em
A D O .............................................................................................. 5

6. Engenharia Electrotécnica e de Computadores


E S T R U T U R A D O C D ................................................................................................................................. 6

7. M O D E L O D E C A P A E L O M B A D A ........................................................................................................... 6

8. M O D E L O D E C A P A D E C D ...................................................................................................................... 9

9. F IC H A D E H O M O L O G A Ç Ã O d e J Ú R I.................................................................................................. 1 0

10 . C O N T E Ú D O D E id e n tific a c a o .p d f.......................................................................................................... 1 1

11. D E C L A R A Ç Ã O R E S P E IT A N T E À D IV U L G AJúri
Ç Ã O D A D IS S E R T A Ç Ã O .......................................... 1 3

12 . E X E M P L O Presidente:
D E D E C L A R A Ç Ã O D E Prof.
C O N F IDCarlos
E N C IA LJorge Ferreira Silvestre
ID A D E ............................................................. 13
Orientadores: Prof. João Manuel de Freitas Xavier
Prof. Pedro Manuel Quintas Aguiar
Vogais: Prof. José Manuel Bioucas Dias

Setembro 2008
1
GUIA DE PREPARAÇÃO
DA DISSERTAÇÃO E RESUMO ALARGADO
PARA OS CURSOS DE MESTRADO DE 2º CICLO NO IST

Linear Discriminant Analysis Algorithms


1. T R A M IT A Ç Ã O D E D IS S E R T A Ç Ã O /P R O J E C T O .................................................................................. 2

2. IN F O R M A Ç Ã O A IN T R O D U Z IR N O S IS T E M A F É N IX ........................................................................ 4

3. C O N F ID E N C IA L ID A D E ............................................................................................................................. 4

4. E S T R U T U R A E F O R M A T O D A D IS S E R T A Ç Ã O .................................................................................. 5
4 .1 Im p re s s ã o d a D is s e rta ç ã o ............................................................................................................... 5
4 .2 C a p a e L o m b a d a ............................................................................................................................... 5
4 .3 E q u a ç õ e s e E x p re s s õ e s ................................................................................................................... 5
4 .4 R e fe rê n c ia s e B Pedro Miguel Correia Guerreiro
ib lio g ra fia ................................................................................................................ 5
4 .5 T a b e la s e F ig u ra s .............................................................................................................................. 5
5. E S T R U T U R A D O R E S U M O A L A R G A D O .............................................................................................. 5

6. E S T R U T U R A D O C D ................................................................................................................................. 6

7. M O D E L O D E C A P A E L O M B A D A ........................................................................................................... 6

8. M O D E L O D E C A P A D E C D ...................................................................................................................... 9

9. F IC H A D E H O M O L O G A Ç Ã O d e J Ú R I.................................................................................................. 1 0

10 .
A CDissertation
O N T E Ú D O D E id e n tific a c a o .p d f.......................................................................................................... 1 1
submitted in fulfillment of the requirements for the
11. D E C L A R A Ç Ã O R E S P Edegree
IT A N T E À of D IV Master
U L G A Ç Ã O of D A DScience
IS S E R T A Ç Ãin: O .......................................... 1 3

12 . Electrical and Computer Engineering


E X E M P L O D E D E C L A R A Ç Ã O D E C O N F ID E N C IA L ID A D E ............................................................. 1 3

September 2008
1
Agradecimentos

Agradeço aos meus pais por todo o apoio que me deram ao longo destes anos e do seu constante
incentivo a fazer sempre melhor. Agradeço ao Prof. João Xavier e Prof. Pedro Aguiar pela ajuda
absolutamente essencial e sem a qual não teria sido possível a escrita desta tese. Agradeço ainda à
instituição Instituto Superior Técnico e a todo o seu corpo docente pela formação que me facultaram.

4
Resumo

Propõem-se novos algoritmos para o cálculo de discriminantes lineares usados na redução de dimen-
são de dados de Rn para Rp , com p < n. São apresentadas alternativas ao critério clássico da Dis-
tância de Fisher, nomeadamente, investigam-se novos critérios baseados em: Distância de Chernoff,
J-Divergência e Divergência de Kullback-Leibler. Os problemas de optimização que emergem do uso
destas alternativas são não convexos e consequentemente difíceis de resolver. No entanto, apesar da
não convexidade, os algoritmos desenvolvidos garantem que o discriminante linear é globalmente op-
timo para p = 1. Tal foi possível devido a reformulações do problema e a recentes resultados na teoria
da optimização [8],[9]. Uma abordagem suboptima é desenvolvida para 1 < p < n.

Palavras-Chave: Discriminantes Lineares, Redução de Dimensão de Dados, Distância de Fisher, Dis-


tância de Chernoff, resultados não convexos de dualidade forte, Divergência de Kullback-Leibler.

i
Abstract

We propose new algorithms for computing linear discriminants to perform data dimensionality reduction
from Rn to Rp with p < n. We propose alternatives to the classical Fisher’s Distance criterion, namely,
we investigate new criterions based on the: Chernoff-Distance, J-Divergence and Kullback-Leibler Di-
vergence. The optimization problems that emerge of using these alternative criteria are non-convex and
thus hard to solve. However, despite the non-convexity, our algorithms guarantee global optimality for
the linear discriminant when p = 1. This is possible due to problem reformulations and recent develop-
ments in optimization theory [8],[9]. A greedy suboptimal approach is developed for 1 < p < n.

Keywords: Linear Discriminants, Data Dimensionality Reduction, Fisher’s Distance, Chernoff-Distance,


Nonconvex strong duality results, Kullback-Leibler Divergence.

ii
Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Algorithms for Dimensionality Reduction to R 8


2.1 Kullback-Leibler Divergence Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 J-Divergence Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Interval Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Chernoff Distance Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Greedy Algorithms 18

4 Computer Simulations 20
4.1 Hit Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.1 Dimensionality Reduction to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Dimensionality Reduction to Rp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.3 Asymptotic Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 ROC-Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Conclusions and Future Work 29

A Quadratic Program with Quadratic Constraints, Strong-Duality result 31


A.1 Introduction to Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
A.2 Strong Duality Result Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

B Stiefel Matrix Constraint Invariance 35

C Set Properties 36

D Criteria Equivalence 37

iv
List of Figures

1.1 1-dimensional pdf’s f0 and f1 obtained through Fisher’s Distance. . . . . . . . . . . . . . 6


1.2 1-dimensional pdf’s f0 and f1 obtained through J-Divergence. . . . . . . . . . . . . . . . . 6

2.1 Geometrical interpretation of the set of restrictions. . . . . . . . . . . . . . . . . . . . . . . 14

4.1 Distinct Means and Distinct Covariance Matrices. Legend: Magenta - Chernoff Distance
, Red - Fisher’s Distance, Blue - Kullback-Leibler Divergence, Green - J-Divergence. . . . 27
4.2 Equal Means and Distinct Covariance Matrices. Legend: Magenta - Chernoff Distance ,
Red - Fisher’s Distance, Blue - Kullback-Leibler Divergence, Green - J-Divergence. . . . . 27
4.3 Distinct Means and Equal Covariance Matrices. Legend: Magenta - Chernoff Distance ,
Red - Fisher’s Distance, Blue - Kullback-Leibler Divergence, Green - J-Divergence. . . . . 28

v
List of Tables

4.1 Distinct Means and Distinct Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . 21


4.2 Equal Means and Distinct Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Distinct Means and Equal Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Chernoff Distance Criteria with Distinct Means and Distinct Covariance Matrices . . . . . 22
4.5 Fisher’s Distance Criteria with Distinct Means and Distinct Covariance Matrices . . . . . . 22
4.6 Kullback-Leibler Divergence Criteria with Distinct Means and Distinct Covariance Matrices 22
4.7 J-Divergence Criteria with Distinct Means and Distinct Covariance Matrices . . . . . . . . 22
4.8 Chernoff Distance Criteria with Equal Means and Distinct Covariance Matrices . . . . . . 24
4.9 Fisher’s Distance Criteria with Equal Means and Distinct Covariance Matrices . . . . . . . 24
4.10 Kullback-Leibler Divergence Criteria with Equal Means and Distinct Covariance Matrices . 24
4.11 J-Divergence Criteria with Equal Means and Distinct Covariance Matrices . . . . . . . . . 24
4.12 Chernoff Distance Criteria with Distinct Means and Equal Covariance Matrices . . . . . . 25
4.13 Fisher’s Distance Criteria with Distinct Means and Equal Covariance Matrices . . . . . . . 25
4.14 Kullback-Leibler Divergence Criteria with Distinct Means and Equal Covariance Matrices . 25
4.15 J-Divergence Criteria with Distinct Means and Equal Covariance Matrices . . . . . . . . . 25
4.16 Distinct Covariance Matrices and Distinct Means . . . . . . . . . . . . . . . . . . . . . . . 26
4.17 Distinct Covariance Matrices and Equal Means . . . . . . . . . . . . . . . . . . . . . . . . 26
4.18 Equal Covariance Matrices and Distinct Means . . . . . . . . . . . . . . . . . . . . . . . . 26

vi
Chapter 1

Introduction

1.1 Background
Linear Discriminant Analysis (LDA) is a very important tool in a wide variety of problems. It is commonly
used in machine learning problems like, pattern recognition [1],[2], face recognition [4], feature extraction
[3] and in data dimensionality reduction.
A problem that is treated in LDA is the binary class assigning problem: given one sample in a high-
dimensional space Rn , say x  Rn , decide to which class C0 or C1 it belongs to. Usually the two classes
C0 and C1 represent two random sources. The classification process can be made in high dimension,
i.e. in Rn , using therefore all information available. However this might be computationally heavy for
for certain real time applications. So, instead of using all the n entries of the sample x directly, an
appropriate linear combination of them is made. With this linear combination, we try to capture some
data features (hopefully those where C0 and C1 differ most), and then perform the data classification.
Making these linear combinations, lead generically to information loss, and consequently increases the
probability of erroneous classifications. However, this problem can be attenuated, by making more than
one linear combination, and collect them in a vector y, to perform the classification. The number of linear
combinations is denoted by p. That is
y = Qx (1.1)

where Q  Rp×n is called the linear discriminant, y  Rp is the vector that collects the p linear combinations,
and x  Rn is the sample to be classified. The classification process is made trough the low-dimensional
vector y  Rp , which works like a signature of the sample x.
The key issue here is the design of the linear discriminant Q. This design process is generically
formulated as an optimization problem, where the objective function measures class separability in the
projected space Rp , i.e.
max f (Q).
(1.2)
Q  Rp×n

The choice of the cost function in (1.2) plays a critical role. An obvious proposal for such cost function,
would be f (Q) = −P( e)(Q), where P( e)(Q) stands for the probability of error of the optimum detector
in Rp , for the given setup, the minus sign has to do with the fact, that the optimization problem in (1.2),
has been written as a maximization problem. However, in general there is no closed form expression
for P( e)(Q). This motivates the introduction of alternative suboptimum choices, which are nonetheless
tractable.

1
1.2 Previous Work
We now give a precise formulation of the problem to be solved and review previous works in this area.

Problem Statement. In what follows, the two classes C0 and C1 introduced in section 1.1 are identified
with two random sources, that are here denoted by S0 for source 0, and by S1 for source 1. We focus on
the Gaussian case.
Given the two independent n-dimensional Gaussian sources

S0 : x ∼ F0 = N (µ0 , Σ0 )
(1.3)
S1 : x ∼ F1 = N (µ1 , Σ1 )

we wish to find a linear discriminant Q, for data dimensionality reduction, minimizing erroneous classifi-
cation of the samples generated by these sources in low dimension.
Being x  Rn a sample generated by one of the n-dimensional sources S0 or S1 that are considered
to be equally probable, a linear mapping from Rn to Rp is made with the linear discriminant Q  Rp×n , i.e.

y = Qx.

Due to this linear mapping, we have

s0 : y ∼ f0 = N (Qµ0 , QΣ0 QT )
(1.4)
s1 : y ∼ f1 = N (Qµ1 , QΣ1 QT )

where s0 and s1 denote the p-dimensional sources that result from the dimensionality reduction induced
by the linear discriminant Q  Rp×n
Whenever a sample x is available, it has to be classified. The classification is made with the maximum
likelihood criteria, that is more well known in this context as the Neyman-Pearson detector. The linear
map Q is applied to the sample x, forming y = Qx, and then the maximum likelihood criteriion is applied
to the random variable y. If N (Qµ0 , QΣ0 QT )(y) > N (Qµ1 , QΣ1 QT )(y), y is considered to have been
generated by the p-dimensional source s0 and x is therefore considered to have been generated by the
n-dimensional source S0 and vice-versa.

Previous Work. In the following we discuss several proposals for the cost function f (Q) in (1.2), and
we analyze the strengths and weaknesses of previous works that utilize such cost functions.

Fisher’s Distance Maximization Criterion. A popular choice is the Fisher’s Distance Maximization
Criterion, which is now reviewed.
We wish to optimally separate in Fisher’s sense, the signatures y from s0 , from the signatures from s1 .
Intuitively this is equivalent, to separate as much as possible, the respective probability density functions
f0 and f1 defined in (1.4).
The general optimization problem in (1.2) under Fisher’s Distance Maximization criterion (see [6]) is

max tr{(Q(Σ0 + Σ1 )QT )−1 (Q(µ0 − µ1 )(µ0 − µ1 )T QT )}


(1.5)
Q  Rp×n

2
where the objective function is the Fisher’s Distance between the low dimensional distributions f0 and
f1 .
In order to better understand what Fisher’s Distance measures, the case where Q  R1×n is pre-
sented. Putting Q = [q T ], where q  Rn , (1.5) boils down to:

q T (µ0 − µ1 )(µ0 − µ1 )T q
max
q T (Σ0 + Σ1 )q (1.6)
q  Rn

Now, it’s easy to understand that, the outer class variance q T (µ0 − µ1 )(µ0 − µ1 )T q = [q T µ0 − q T µ1 ]2 is
being maximized while the total inner class variance q T (Σ0 + Σ1 )q, is being minimized.
1
The solution Q for (1.5), can be obtained by doing the eigenvalue decomposition of (Σ0 + Σ1 )− 2 (µ0 −
− 12
µ1 )(µ0 − µ1 )T (Σ0 + Σ1 ) , and taking for the p rows of Q, the p eigenvectors associated to the p largest
1 1
eingenvalues, see [6]. However, since (Σ0 + Σ1 )− 2 (µ0 − µ1 )(µ0 − µ1 )T (Σ0 + Σ1 )− 2 has rank 1, it is easy
to see that the optimum discriminant for p > 1 achieves the same performance, as measured by (1.5),
as the optimum discriminant for p = 1. That is, there is no gain in projecting to spaces whose dimension
p > 1. For p = 1, the optimum descriminant is Q = [q T ] where q is a solution of (1.6), that is

q = (Σ0 + Σ1 )−1 (µ0 − µ1 ). (1.7)

In sum, Fisher’s Distance Maximization criterion enjoys a closed form solution and a very intuitive
interpretation. However, it only allows dimensionality reduction to p = 1.

Other Criteria. It was said previously that in general there is no closed form expression for the clas-
sification error rate. This leads to the utilization of suboptimal measures for it. The theoretical basis for
the cost functions or measures used in [6] and [7] is now presented.

Stein’s Lemma. [10] Suppose we have k statistically independent samples from the same source, and
the classification is made trough the maximum-likelihood detector, then we have

log PF (k)
lim = −DKL (f0 ||f1 ) for xed PM (1.8)
k→+∞ k
log Pe (k)
lim = −C(f0 , f1 ) (1.9)
k→+∞ k
log Pe (k)
lim ≥ −JD(f0 , f1 ) (1.10)
k→+∞ k

where PF (k) is the probability of false alarm, Pe (k) is classification error probability and PM is the
missing probability, when k samples from the same source are used to make the classification. Note
that f0 and f1 are the p-dimensional probability density functions, resultant from the dimensionality
reduction induced by the linear discriminant Q.
These probabilities are well known from the hypothesis tests. The probability of false alarm PF (k), is
the probability of detecting s0 when y was generated by s1 . The missing probability PM is the probability
of detecting s1 when y was generated by s0 . Pe (k) is simply the probability of wrong classification of the
sample. Note that were used the p-dimensional sources s0 and s0 , whit which the classification process
is performed.
The exponents DKL (f0 ||f1 ), JD(f0 , f1 ), C(f0 , f1 ) in (1.8), are the Kullback-Leibler Divergence, the

3
J-Divergence and the Chernoff Distance, whose definitions for generic p-dimensional probability density
functions f0 , f1 are Z
f0 (y)
DKL (f0 ||f1 ) = f0 (y) log dy (1.11)
Rp f1 (y)

DKL (f0 ||f1 ) + DKL (f1 ||f0 )


JD(f0 , f1 ) = (1.12)
2

f0 (y)t f1 (y)1−t dy
R
C(f0 ||f1 ) = max − log Rp (1.13)
0≤t≤1

respectively.
Particularizing these expressions for the p-dimensional Gaussian probability density functions, resul-
tant from the dimensionality reduction performed by the linear discriminant Q, f0 (Q) = N (Qµ0 , QΣ0 QT )
and f1 (Q) = N (Qµ1 , QΣ1 QT ), we have

|QΣ1 QT |
 
1
DKL (f0 ||f1 )(Q) = log + tr((QΣ1 QT )−1 (QΣ0 QT )) + (µ0 − µ1 )T QT (QΣ1 QT )−1 Q(µ0 − µ1 ) − p
2 |QΣ0 QT |
(1.14)

JD(f0 , f1 )(Q) = 1 T −1 T T −1 T

4 tr((QΣ1 Q ) (QΣ0 Q ) + (QΣ0 Q ) (QΣ1 Q )) (1.15)
+ 14 (µ0 − µ1 )T QT [(QΣ0 QT )−1 + (QΣ1 QT )−1 ]Q(µ0 − µ1 ) − 2p


t(1 − t)
− t)QΣ1 QT ]−1 Q(µ0 − µ1 )

C(f0 , f1 )(Q) = max 2 (µ0 − µ1 )T QT [tQΣ0 QT ) + (1

|tQΣ0 QT + (1 − t)QΣ1 QT |
0≤t≤1 + 12 log
|QΣ0 QT |t |QΣ1 QT |1−t
(1.16)
For p = 1, and attending to Q = [q T ], the expressions simplify further an become

q T Σ0 q q T Σ0 q [q T (µ0 − µ1 )]2
 
1
DKL (f0 ||f1 )(q) = − log + − 1 (1.17)
2 q T Σ1 q q T Σ1 q q T Σ1 q

1 (q T Σ0 q + q T Σ1 q)2 q T Σ0 q + q T Σ1 q T
 
T
JD(f0 , f1 )(q) = + (q (µ0 − µ1 )(µ0 − µ1 ) q) − 4 (1.18)
4 q T Σ0 qq T Σ1 q q T Σ0 qq T Σ1 q

 
1 q T (µ0 − µ1 )(µ0 − µ1 )T q tq T Σ0 q + (1 − t)q T Σ1 q
C(f0 , f1 )(q) = max 2 t(1 − t) T T + log
tq Σ0 q + (1 − t)q Σ1 q (q T Σ0 q)t (q T Σ1 q)1−t
0≤t≤1
(1.19)
Stein’s Lemma gives asymptotic expressions for PF (k) and Pe (k). The heuristic behind Stein’s
Lemma utilization is that, even tough, be necessary a large number k of samples for the asymptotic
expressions give a good approximation for PF (k) and Pe (k), it is expected that they behave well when
k is small or even equal to one. Stein’s Lemma fills heuristically the lack of closed form expressions for
the probabilities.

4
The Kullback-Leibler divergence has a very easy interpretation.
Z
f0 (y)
DKL (f0 ||f1 ) = f0 (y) log dy
Rp f1 (y)

f0 (y)
This can be interpreted as maximizing the expected value of the log-likelihood ratio log f1 (y) , under
f0 , which is equivalent to maximize the number of correct detections of samples from s0 , when such
samples were generated by s0 . Note that this criteria is asymmetric.
The J-Divergence is simply the symmetrization of Kullback-Leibler Divergence. With this symmetriza-
tion, the two sources are treated equally.
From (1.10)
DKL (f0 ||f1 ) + DKL (f1 ||f0 )
JD(f0 , f1 ) =
2
it can be seen that by maximizing J-Divergence, an asymptotic lower bound for the classification error
rate is minimized, it is expected that by minimizing this lower bound, the classification error rate is also
minimized.
The Chernoff distance is related with the geodesic distance between the probability density distribu-
tions f0 and f1 in the probability manifold (see [6]).
There is a characteristic that the J-Divergence, the Chernoff Distance and Kullback-Leibler Diver-
gence share. That characteristic will reveal to be the main advantage of this methods when compared
to Fisher’s Distance. That characteristic is the capability of these criteria for discriminate the probability
density distributions f0 and f1 by their variance. Looking at the expressions of these criteria in (1.17)
to 1.19, it is easy to see, that when maximized in q, very different variances for f0 and f1 will emerge.
Looking to the Kullback-Leibler Divergence expression it can be seen that the parcel

q T Σ0 q σ02
= (1.20)
q T Σ1 q σ12

will contribute for this phenomenon. For the J-Divergence and Chernoff Distance this is done by

(q T Σ0 q+q T Σ1 q)2
q T Σ0 qq T Σ1 q
tq T Σ0 q + (1 − t)q T Σ1 q (1.21)
log
(q T Σ0 q)t (q T Σ1 q)1−t

respectively, that can be interpreted as the arithmetic mean over the geometric mean of (q T Σ0 q,q T Σ1 q),
where in the second are weighted by t. Attending to the fact that this quotient (arithmetic mean over the
geometric mean) has a minimum when the quantities involved are equal, it’s explained why, when these
parcels are maximized the variances will be different.
Figures 1.1 and 1.2 show what happened using Fisher’s Distance Maximization criterion and the
J-Divergence Maximization criterion Figures 1.1 and 1.2 prove unequivocally this capability.
Looking at Stein’s Lemma statement in (1.8), it can be seen that in order to minimize PF (k) and
Pe (k), DKL (f0 ||f1 ), JD(f0 , f1 ), C(f0 , f1 ) must be maximized. This is precisely what is done in [7] where
the J-Divergence is maximized and [6] where the same happens with Chernoff-Distance.

5
Figure 1.1: 1-dimensional pdf’s f0 and f1 obtained through Fisher’s Distance.

Figure 1.2: 1-dimensional pdf’s f0 and f1 obtained through J-Divergence.

1.3 Contribution
In this thesis is treated the class or source assigning problem for the case where the sources S0 and S1
or classes C0 and C1 are Gaussian distributed. The criteria or cost functions utilized are the Chernoff
Distance, the Kullback-Leibler Divergence and J-Divergence. The major problem with the chose of these
criteria is that the respective optimization problems are very hard to solve.
The works already developed, namely the ones presented in [7] and [6] don’t solve the problems
with full generality or utilize methods that don’t guarantee global optimality. The work developed in this
thesis gives the next step by solving the optimization problems resultant from the utilization of Cher-
noff Distance, Kullback-Leibler Divergence and J-Divergence with full generality and guaranteing global

6
optimality. Global optimality is just guaranteed when reducing to one dimension i.e: p = 1.

1.4 Thesis Outline


Chapter 2. In this chapter are presented the algorithms that compute the linear discriminants, that
maximize the Kullback-Leibler Divergence, J-Divergence and Chernoff-Distance when projecting the
n-dimensional samples to R.

Chapter 3. In this chapter is presented the suboptimal approach for computing the linear discriminants
when projecting the n-dimensional samples to Rp .

Chapter 4. This chapter presents the results of the performances, i.e. correct classifications of the
n-dimensional samples for the several criteria used in the linear discriminants computation.

7
Chapter 2

Algorithms for Dimensionality


Reduction to R

In this chapter are presented the algorithms to compute the linear discriminant Q = [q T ] that per-
forms dimensionality reduction to R by maximizing the several criteria presented in chapter 1, i.e:
Kullback-Leibler Divergence (DKL (f0 ||f1 )(Q)), J-Divergence (JD(f0 , f1 )(Q)) and Chernoff Distance
(C(f0 , f1 )(Q)).
As mentioned in chapter 1, the probability density functions f0 , f1 present in these expressions,
are those that characterize the output y = Qx of the 1-dimensional sources s0 , s1 resultant from the
dimensionality reduction process.

2.1 Kullback-Leibler Divergence Maximization


In chapter 1, the expression for the Kullback-Leibler Divergence between the 1-dimensional probability
density functions f0 , f1 was found to be (see equation (1.17))

q T Σ0 q q T Σ0 q [q T (µ0 − µ1 )]2
 
1
DKL (f0 ||f1 )(q) = − log + − 1 (2.1)
2 q T Σ1 q q T Σ1 q q T Σ1 q

The goal is to find the global maximizer q of (2.1), i.e.


 
1 q T Σ0 q q T Σ0 q [q T (µ0 − µ1 )]2
q= arg max 2 − log + − 1 .
q T Σ1 q q T Σ1 q q T Σ1 q (2.2)
q 6= 0

It’s easy to verify that (2.1) doesn’t depend on the norm of q. So, a restriction that doesn’t eliminate any
direction for q is admissible.
In order to simplify the objective function of the optimization problem in (2.2) and without eliminating
any direction for q, the restriction q T Σ1 q = 1 is chosen. Applying the restriction, the optimization problem

8
in (2.2) becomes

arg max q T Σ0 q − log q T Σ0 q + [q T (µ0 − µ1 )]2


q = (2.3)
q T Σ1 q = 1
arg max q T Σ0 q − log q T Σ0 q + q T (µ0 − µ1 )(µ0 − µ1 )T q
= (2.4)
q T Σ1 q = 1
arg max q T [Σ0 + (µ0 − µ1 )(µ0 − µ1 )T ]q − log q T Σ0 q
= (2.5)
q T Σ1 q = 1

In what follows, Σ0 + (µ0 − µ1 )(µ0 − µ1 )T is substituted by R, resulting:

q= arg max q T Rq − log q T Σ0 q.


(2.6)
q T Σ1 q = 1

Problem Reformulation. The optimization problem in (2.6) is non-convex, so a reformulation is made


by introducing the variables x and y

x = q T Rq (2.7)
T
y = q Σ0 q (2.8)

resulting for (2.6) in


max x − log y.
(2.9)
(x, y)  C = {(q T Rq, q T Σ0 q) : q T Σ1 q = 1}

Reformulating the optimization problem this way, the optimization is just made in two variables (x, y)  C.
However, the complexity of the original problem is hidden in the definition of the set C. The strategy to
solve (2.6) consists in finding the solution (x∗ , y ∗ ) for (2.9), and then computing a corresponding q, i.e, a
q that solves the following system of quadratic equations:

q : q T Rq = x∗
q T Σ0 q = y ∗ . (2.10)
T
q Σ1 q = 1

The set C is compact and connected. It results from a continuous quadratic mapping of an ellipsoid,
implying that the variables x and y considered separately, belong to closed intervals on R. It is needed
to compute the closed interval on R for the x variable, since it is a connected set, it is just needed to
calculate the ends of the interval, i.e:

xmin = min q T Rq
T
(2.11)
q Σ1 q = 1

xmax = max q T Rq
(2.12)
q T Σ1 q = 1
−1 −1 −1 −1
The solutions to (2.52) and (2.53) are xmin = λmin (Σ1 2 RΣ1 2 ) and xmax = λmax (Σ1 2 RΣ1 2 ) respec-
tively, and thus x  [xmin , xmax ].
Knowing this, the strategy to solve (2.9) consists in discretizing the above interval fixing a value for x,

9
and optimizing over the y variable. Given the objective function in (2.9), this corresponds to minimize y.
This procedure has to be done for all points x of the discretization of [xmin , xmax ]. Once this procedure is
finished, the best pair (x∗ , y ∗ ) is chosen and the corresponding q defined in (2.10), is the one that solves
(2.6).
Fixing a value for x  [xmin , xmax ] and attending to (2.7), the problem related with the y variable opti-
mization is
min q T Σ0 q
q T Rq = x (2.13)
T
q Σ1 q = 1

This problem is non-convex and will be solved trough duality theory presented in appendix (A.1).
In the process of finding the pair (x∗ , y ∗ ) that solves the optimization problem in (2.9), for a fixed
value of x, it is just needed to know the value of the best attainable value of y (calculated as in (2.13)).
For x]xmin , xmax [ strong duality exists for (2.13), the values of y variable are calculated trough the dual
problem that is:
max −λ1 x − λ2 1
Σ0 + λ 1 R + λ 2 Σ1 ≥ 0 (2.14)
2
var : (λ1 , λ2 )  R

As explained in appendix (A.1), the dual problem is an optimization problem in just two variables. Once
the pair (x∗ , y ∗ ) is computed, it is needed to compute the optimal q ∗ that solves 2.6. For this process the
bi-dual problem of (2.13) is used. From this process we know that the set of optimal points that contain
the solution q, is such that q T Rq = x∗ , so this restriction is represented in the Bi-Dual Problem (2.15)
through tr(RQ) = x∗
min tr(Σ0 Q)

tr(RQ) = x
(2.15)
tr(Σ1 Q) = 1
Q0

Provided Slater conditions are verified i.e: x]xmin , xmax [ and the uniqueness of the solution for the bi-
dual problem, Q is a rank-1 semidefinite positive matrix and its only eigenvector is the solution for the
problem i.e., it is the linear discriminant that optimizes the Kullback-Leibler Divergence criterion.

10
2.2 J-Divergence Maximization
In chapter 1, the expression for the J-Divergence between the 1-dimensional probability density func-
tions f0 , f1 was found to be (see equation (1.18))

1 (q T Σ0 q + q T Σ1 q)2 q T Σ0 q + q T Σ1 q T
 
T
JD(f0 ||f1 )(q) = + T (q (µ0 − µ1 )(µ0 − µ1 ) q) − 4 . (2.16)
4 q T Σ0 qq T Σ1 q q Σ0 qq T Σ1 q

The goal is to find the global maximizer q of (2.16), i.e.


 T T 2

q= arg max 1 (q Σ0 q + q Σ1 q) + q T Σ0 q + q T Σ1 q (q T (µ − µ )(µ − µ )T q) − 4
4 0 1 0 1
q T Σ0 qq T Σ1 q q T Σ0 qq T Σ1 q (2.17)
q 6= 0

The expression for the J-Divergence in (2.16) doesn’t depend on the norm of q, but just on its
direction. Taking advantage of this property the following restriction is added

q T Σ0 qq T Σ1 q = 1. (2.18)

As in the previous optimization problem, this restriction is admissible in the sense that it doesn’t eliminate
any direction. Given any q  Rn it is possible to scale it without changing its direction till verifies (2.18).
Applying this restriction and dropping the multiplicative and constant parts of the objective function, the
optimization problem in (2.17) becomes

q= arg max (q T Σ0 q + q T Σ1 q)2 + (q T Σ0 q + q T Σ1 q)(q T (µ0 − µ1 )(µ0 − µ1 )T q)


T T
. (2.19)
q Σ0 qq Σ1 q = 1

1 1
The restriction in (2.18) can be written as q T Σ1 q = q T Σ0 q
, so q T Σ1 q is substituted by q T Σ0 q
in (2.19)
resulting in

q= arg max (q T Σ0 q + 1 )2 + (q T Σ q + 1 )(q T (µ − µ )(µ − µ )T q)


0 0 1 0 1
q T Σ0 q q T Σ0 q . (2.20)
q T Σ0 qq T Σ1 q = 1

Problem Reformulation. The optimization problem in (2.20) is non-convex, so, a reformulation of the
same is made by introducing the variables x and y

x = q T Σ0 q (2.21)
T T
y = q (µ0 − µ1 )(µ0 − µ1 ) q (2.22)

resulting for (2.20) in

1 2 1
 
max x+ x + x+ x y
(2.23)
(x, y)  C = {(q T Σ0 q, q T M q) : q T Σ0 qq T Σ1 q = 1}

where M = (µ0 − µ1 )(µ0 − µ1 )T


The strategy to solve (2.20) as previously seen, consists in finding the solution (x∗ , y ∗ ) for (2.23) and

11
then computing the corresponding q, i.e:

q : q T Σ 0 q = x∗
qT M q = y∗ . (2.24)
T T
q Σ0 qq Σ1 q = 1

In appendix C is shown that set the C is compact and connected, implying that the variables x and y
considered separately, belong to closed intervals on R. It is needed to compute the closed interval on R
for the x variable, since it is a connected set, it is just needed to calculate the ends of the interval i.e:

xmin = min q T Σ0 q
(2.25)
q T Σ0 qq T Σ1 q = 1

xmax = max q T Σ0 q
(2.26)
q T Σ0 qq T Σ1 q = 1

The solutions to 2.25 and 2.26 are

1
xmin = q (2.27)
−1 −1
λmax (Σ0 2 Σ1 Σ0 2 )
1
xmax = q (2.28)
−1 −1
λmin (Σ0 2 Σ1 Σ0 2 )

So, x  [xmin , xmax ]. In section (2.2.1) is shown how these expressions for the extremal points of the
interval in R for the x variable, were obtained.
Knowing this, the strategy to solve (2.23), consists in discretizing the above interval, fixing a value
for x, and optimize over the y variable. The goal is to solve (2.23) and so, optimizing y consists in
maximizing it. This procedure has to be done for all points x of the discretization of [xmin , xmax ]. Once
this procedure is finished, the best pair (x∗ , y ∗ ) is chosen, and the corresponding q according to (2.24),
is the one that solves (2.20).
Fixing a value x  [xmin , xmax ] and attending to (2.21), the problem related with the y variable opti-
mization is
max qT M q
q T Σ0 q = x (2.29)
T 1
q Σ1 q = x

that written as a minimization problem becomes

min q T (−M )q
T (2.30)
q Σ0 q = x
T 1
q Σ1 q = x

This problem is solved trough duality theory presented in section A.1. From this point, everything
follows the same procedure as for the Kullback-Leibler Divergence algorithm (see 2.1). It’s important to
note that strong duality for (2.30), only exist for x  ]xmin , xmax [ce criteria.

12
2.2.1 Interval Computation
Here is shown how to compute the extremal points of the interval [xmin , xmax ] ⊂ R, where the x variable
defined in (2.21) belongs.
The definition of this extremal points is given in (2.25) and (2.26) and is here again presented

xmin = min q T Σ0 q
(2.31)
q T Σ0 qq T Σ1 q = 1

xmax = max q T Σ0 q.
(2.32)
q T Σ0 qq T Σ1 q = 1

Introducing the variables a and b, defined as

a = q T Σ0 q (2.33)
T
b = q Σ1 q (2.34)

the problems in (2.31) and (2.32) are equivalent to

xmin = min a
ab = 1 (2.35)
T T n
var : (a, b)  K = {(q Σ0 q, q Σ1 q) : q  R }

xmax = max a
ab = 1 (2.36)
var : (a, b)  K = {(q T Σ0 q, q T Σ1 q) : q  Rn }

The restrictions in the reformulated problems are

ab = 1
(a, b)  K = {(q T Σ0 q, q T Σ1 q) : q  Rn }.

Due to the positive definiteness of the matrices Σ0 and Σ1 involved in the definition of the variables a
and b, it’s easy to see that the variables a and b belong to the first orthant.
Set K consists in a quadratic mapping from Rn to R2 and due to a theorem by Dines (see [9]), set K
is a closed convex cone in the first orthant. With this graphical interpretation, it’s easy to see that xmin
and xmax are the points of intersection of the straight lines that delimitate set K with the graph of the
hyperbola function.

In order to calculate these intersections, the mathematical expressions of the straight lines that de-
limitate set K are needed. Since these straight lines pass trough the origin, they are of the form b = ma.
From figure 2.1 can be seen, that in order to calculate xmin , the slope of the upper straight line is
needed, in order to do that a point (a, b) of this straight line must be computed. Attending to set K
definition and fixing a = 1 , b is equal to:

−1 −1
b= max q T Σ1 q = λmax (Σ0 2 Σ1 Σ0 2 )
(2.37)
q T Σ0 q = 1

13
Figure 2.1: Geometrical interpretation of the set of restrictions.

The restriction q T Σ0 q = 1 in the above optimization problem needed to compute b, has naturally to do
with a = 1. The choice of a = 1 was made in order to obtain directly the slope of the upper straight line.
With the slope of the upper straight line that delimitates set K, calculated it is needed to intersect
this straight line with the graph of the hyperbola function, i.e, the following system of equations must be
solved

ab = 1 (2.38)
b = ma (2.39)

−1 −1
where m = λmax (Σ0 2 Σ1 Σ0 2 ).
This results for a and b, in
a= q 1
− 12 − 12
q λmax (Σ0 Σ1 Σ0 )
−1 −1
b= λmax (Σ0 2 Σ1 Σ0 2 )

where a is xmin .
Following the same process to calculate xmin , can be shown that xmax = q 1 .
−1 −1
λmin (Σ0 2 Σ1 Σ0 2 )

14
2.3 Chernoff Distance Maximization
In chapter 1 the expression for the Chernoff Distance between the 1-dimensional probability density
functions f0 , f1 was found to be (see equation (1.19))

 
1 q T (µ − µ1 )(µ0 − µ1 )T q tq T Σ0 q + (1 − t)q T Σ1 q
C(f0 , f1 )(q) = max 2 t(1 − t) T 0 T + log .
tq Σ0 q + (1 − t)q Σ1 q (q T Σ0 q)t (q T Σ1 q)1−t
0≤t≤1
(2.40)
The goal is to find the global maximizer q of (2.40), i.e.
 
1 q T (µ0 − µ1 )(µ0 − µ1 )T q tq T Σ0 q + (1 − t)q T Σ1 q
q= arg max max 2 t(1 − t) T T + log .
tq Σ0 q + (1 − t)q Σ1 q (q T Σ0 q)t (q T Σ1 q)1−t
q 6= 0 0≤t≤1
(2.41)
The expression of the Chernoff-distance in (2.40), involves an intrinsic optimization in the variable t,
so the optimization problem in (2.41) can be written expliciting that characteristic giving
 
1 qT M q tq T Σ0 q + (1 − t)q T Σ1 q
(q, t) = arg max 2 t(1 − t) T T + log
tq Σ0 q + (1 − t)q Σ1 q (q T Σ0 q)t (q T Σ1 q)1−t
(2.42)
q 6= 0
0≤t≤1

where M = (µ0 − µ1 )(µ0 − µ1 )T .

The above optimization problem (2.42) is non-convex and it has to be optimized in two variables, q  Rn
and t  [0, 1]. Due to its non-convexity, it is useful to rewrite it in the following equivalent form

 
1 qT M q tq T Σ0 q + (1 − t)q T Σ1 q
max max 2 t(1 − t) T + log
tq Σ0 q + (1 − t)q T Σ1 q (q T Σ0 q)t (q T Σ1 q)1−t . (2.43)
0≤t≤1 q 6= 0

This equivalent optimization problem is solved by fixing t belonging to a discretization of [0, 1], and
optimizing in the q variable. The main advantage of this written, is that the cost function in (2.43) is inde-
pendent on the norm of q for a fixed t. Since this optimization problem is non-convex, it has necessarily
to be solved by searching for every point t of the discretization of [0, 1] the best corresponding q. Once
this procedure is finished, the best pair (q ∗ , t∗ ) is chosen.

Due to the independence on the norm of q for a fixed t, the optimization problem can be further simplified
by introducing the following restriction

tq T Σ0 q + (1 − t)q T Σ1 q = 1 (2.44)

resulting for (2.43) in

1

max max 2 t(1 − t)q T M q − log (q T Σ0 q)t (q T Σ1 q)1−t
. (2.45)
0≤t≤1 tq T Σ0 q + (1 − t)q T Σ1 q = 1

15
With this approach, for a fixed t we have to find the global maximizer q of the following subproblem.

1

max 2 t(1 − t)q T M q − log (q T Σ0 q)t (q T Σ1 q)1−t
(2.46)
tq T Σ0 q + (1 − t)q T Σ1 q = 1

It is expected that t∗ belongs to the open interval ]0, 1[. Otherwise, the information about one of the
covariance matrices Σ0 or Σ1 would be neglected. With this assumption, the endpoints of [0, 1] are not
evaluated. This enables a rewritten of the restriction introduced at (2.44) that is the following

1 − tq T Σ0 q
q T Σ1 q = .
1−t
Having this in consideration, and using the properties of the logarithm function, (2.46) becomes

 
1 T T 1 − tq T Σ0 q
max 2 t(1 − t)q M q − t log q Σ0 q − (1 − t) log 1−t .
(2.47)
tq T Σ0 q + (1 − t)q T Σ1 q = 1

Subproblem reformulation. The optimization problem in (2.47) is non-convex, so as in the previous


situations a reformulation of the same is made by introducing the variables x and y

x = q T Σ0 q (2.48)
T
y = q Mq (2.49)

resulting for (2.47) in

 
max t(1 − t)y − t log x − (1 − t) log 11−−tx
t .
(x, y)  C = {(q T Σ0 q, q T M q) : tq T Σ0 q + (1 − t)q T Σ1 q = 1}
(2.50)
∗ ∗
Again, the strategy to solve (2.47) consists in finding the solution (x , y ) for (2.50) and then computing
the corresponding q, i.e. a q that solves the following system of quadratic equations:

q : q T Σ0 q = x ∗
qT M q = y∗ . (2.51)
T T
tq Σ0 q + (1 − t)q Σ1 q = 1

The set C is compact and connected. It results from a continuous quadratic mapping of an ellipsoid,
implying that the variables x and y considered separately, belong to closed intervals on R.
It is needed to compute the closed interval on R for the x variable, since it is a connected set, it is
just needed to calculate the ends of the interval, i.e:

xmin = min q T Σ0 q
(2.52)
tq T Σ0 q + (1 − t)q T Σ1 q = 1

xmax = max q T Σ0 q
(2.53)
tq T Σ0 q + (1 − t)q T Σ1 q = 1

16
Knowing this, the strategy to solve (2.50) consists in discretizing the above interval fixing a value for
x, and optimizing over the y variable. Given the objective function in (2.50), this corresponds to maximize
y.
This procedure has to be done for all points x of the discretization of [xmin , xmax ]. Once this proce-
dure is finished, the best pair (x∗ , y ∗ ) is chosen and the corresponding q defined in (2.51), is the one
that solves (2.47).

Fixing a value for x  [xmin , xmax ] and attending to (2.7), the problem related with the y variable opti-
mization is

min qT M q
q T Σ0 q = x (2.54)
T T
tq Σ0 q + (1 − t)q Σ1 q = 1

This problem is solved trough duality theory presented in section A.1. From this point, everything
follows the same procedure as for the Kullback-Leibler Divergence algorithm (see 2.1). It’s important to
note that strong duality for (2.54), only exist for x  ]xmin , xmax [.

Observation. In all the three previous algorithms, it was said that strong duality holds for x  ]xmin , xmax [
and consequently the extreme points were not evaluated. This doesn’t represent a problem, since the
objective functions in (2.9), (2.23) and (2.50) are continuous functions for the respective closed intervals
as well as the optimized y variable as a function of the fixed x.

17
Chapter 3

Greedy Algorithms

The algorithms presented in chapter 2 perform a dimensionality reduction from n dimensions to one
dimension. This is done by linear mapping the n-dimensional samples x  Rn through the linear discrim-
inant q. This drastic dimensionality reduction may induce an acceptable loss of information making the
projected distributions almost indistinguishable. The consequence is significant classification error rate.
In order to make the process of dimensionality reduction less drastic and hopefully lowering the classi-
fication error rate, this chapter considers dimensionality reduction to Rp with p > 1. The algorithms to
be developed here perform a dimensionality reduction from n dimensions to p dimensions, where p > 1
through the linear discriminant matrix Q  Rp×n , i.e.

y = Qx (3.1)

where x is the n-dimensional sample, Q is the linear discriminant matrix, and y  Rp is the signature of
the sample x, used in the classification procedure.
In order to better understand the greedy version of the algorithms presented in chapter 2 it is pre-
sented the greedy algorithm that maximizes the Kullback-Leibler Divergence. For the other criteria the
algorithms follow a similar pattern which will not be repeated here.
The optimal linear discriminant Q  Rp×n , that maximizes the Kullback-Leibler Divergence between
the p-dimensional probability density functions f0 (Q) = N (Qµ0 , QΣ0 QT ) and f1 (Q) = N (Qµ1 , QΣ1 QT ),
is found by solving the optimization problem

max DKL (f0 ||f1 )(Q)


p×n
(3.2)
QR

where

|QΣ1 QT |
 
1 T −1 T T T T −1
DKL (f0 ||f1 )(Q) = log + tr((QΣ1 Q ) (QΣ0 Q )) + (µ0 − µ1 ) Q (QΣ1 Q ) Q(µ0 − µ1 ) − p .
2 |QΣ0 QT |
(3.3)
The main problem with this approach is the non-convexity of the objective function. Although, the case
p = 1 could be treated trough a series of reformulations and simplifications which made possible finding
the solution efficiently, we were not able to extend this procedure for p > 1. So a sub-optimal approach to
solve (3.2) is taken. This approach consists in compute the p rows of Q  Rp×n one by one, by solving p
1-dimensional optimization problems, like the one in (2.2) for the case of the Kullback-Leibler Divergence.
We start by noting that, without loss of optimality, the matrix Q in (3.2) can be taken to be Stiefel, i.e.,

18
with orthonormal rows. This is proved in appendix B. The fact that Q can be a Stiefel matrix motivates,
the following procedure to compute its p rows.

Computation of the rows of Q.  


− q1T −
 .. 
Q=
 .


− qpT −

The first row q1T , coincides with the linear discriminant q transposed, for the 1-dimension problem (see
2.2), i.e.  
1 q T Σ0 q q T Σ0 q [q T (η0 − η1 )]2
q1 = q = argmax 2 − log T + −1
q T Σ1 q q Σ1 q q T Σ1 q (3.4)
q 6= 0

The second row is computed by running again the algorithm, but now imposing that such row is
orthogonal to the first, i.e:
q2 = O1 g (3.5)

where O1  Rn×(n−1) is a matrix, whose columns generate the orthogonal complement of the subspace
generated by q1 , and g  Rn−1 is the vector that collects the coefficients of the linear combination of the
columns of O1 .
In order to compute q2 , a modified version of (3.4) is solved, i.e.
 
1 g T O1T Σ0 O1 g g T O1T Σ0 O1 g [g T O1T (η0 − η1 )]2
g = argmax 2 − log T T + −1
g T O1T Σ1 O1 g g O1 Σ1 O1 g g T O1T Σ1 O1 g . (3.6)
g 6= 0

The modification introduced, was the substitution of the q in (3.4), by O1 g. This imposes orthogonality
condition. Note that this optimization problem has exactly the same form of the 1-dimensional problem
in (3.4), being therefore solved in exactly the same way.
Note that q2 = O1 g.
To solve for row i, it’s just a matter of substituting O1 by Oi−1 . Being Oi−1 , the matrix that generates
the orthogonal complement to the subspace generated by the i − 1 rows, previously calculated.
It’s important to note, that the complexity of the sub-problems solved to compute the p rows is de-
creasing. This is due to the fact that the optimization is being made in subspaces whose dimensions are
decreasing.

19
Chapter 4

Computer Simulations

4.1 Hit Rates


In this section we compare the performance of the four criteria used to construct linear discriminants:
Kullback-Leibler Divergence (KLD), J-Divergence (JD), Chernoff Distance (CHF), Fisher’s Distance (FLDA).
The index of performance is the hitrate, i.e., the percentage of correct decisions in the lower-dimensional
space Rp . We consider the following simulation scenarios: dimensionality reduction to R, (p = 1) (4.1.1),
dimensionality reduction to Rp (4.1.2), and the using of k > 1 samples from the same source in the de-
tection process (4.1.3). The hit-rates were computed by Monte Carlo simulations. We generated 100000
samples from each Gaussian source in Rn . Each sample is projected to Rp by each of the four linear
discriminants and classified by the optimum detector (Maximum Likelihood Detector). The hit rate for a
given linear discriminant corresponds to the average of correct decisions over the 200000 samples from
both sources.

20
4.1.1 Dimensionality Reduction to R
The results of the simulations are presented for three distinct cases concerning the parameters of the
sources: distinct means and distinct covariance matrices (table 4.1), equal means and distinct covari-
ance matrices (table 4.2), distinct means and equal covariance matrices (table 4.3), with increasing data
dimensionality n = 10, 20, 30, 40, 50.

Table 4.1: Distinct Means and Distinct Covariance Matrices


JD KLD CHF FLDA
n=10 0.8508 0.8309 0.8607 0.5969
n=20 0.9870 0.9870 0.9415 0.6412
n=30 0.9010 0.9013 0.9376 0.7056
n=40 0.9935 0.9936 0.9426 0.6931
n=50 0.9891 0.9893 0.9430 0.7088

Table 4.2: Equal Means and Distinct Covariance Matrices


JD KLD CHF FLDA
n=10 0.9852 0.9853 0.9404 0.6006
n=20 0.9827 0.9830 0.9411 0.5159
n=30 0.9820 0.9821 0.9400 0.5255
n=40 0.9867 0.9868 0.9403 0.5243
n=50 0.9583 0.9586 0.9376 0.5053

Table 4.3: Distinct Means and Equal Covariance Matrices


JD KLD CHF FLDA
n=10 0.6740 0.6743 0.6743 0.6737
n=20 0.9255 0.9254 0.9253 0.9253
n=30 0.9203 0.9202 0.9204 0.9206
n=40 0.9665 0.9667 0.9666 0.9668
n=50 1.0000 1.0000 1.0000 1.0000

As we can see, the FLDA always corresponds to the worst performance tables 4.1 and 4.2. In table
4.3 all criteria performed about the same as predicted by the theory (see section D).

21
4.1.2 Dimensionality Reduction to Rp
The next tables collect the results obtained when reducing the n-dimensional data to Rp with p > 1,
through the linear discriminants that maximize the Chernoff Distance, Kullback-Leibler Divergence, J-
Divergence and Fisher’s Distance. For each criterion, there is a table with the classification hit-rates, for
varying data dimensionality n = 10, 20, 30, 40, 50 and increasing p. Once again, we present the results
for the three different cases concerning the parameters of the sources.
For the same data dimension n (rows of the tables) and for the same sources setup, the different
criteria may be compared. Ex: row 1 from table 4.4 can be compared with row 1 from tables 4.5, 4.6,
4.7.

Table 4.4: Chernoff Distance Criteria with Distinct Means and Distinct Covariance Matrices
p=1 p=7 p=17 p=27 p=37 p=47
n=10 0.9027 0.9827 x x x x
n=20 0.9420 0.9995 0.9999 x x x
n=30 0.9419 0.9994 1.0000 1.0000 x x
n=40 0.9409 1.0000 1.0000 1.0000 1.0000 x
n=50 0.9431 0.9994 1.0000 1.0000 1.0000 1.0000

Table 4.5: Fisher’s Distance Criteria with Distinct Means and Distinct Covariance Matrices
p=1 p=7 p=17 p=27 p=37 p=47
n=10 0.5915 0.8523 x x x x
n=20 0.6036 0.8200 0.9943 x x x
n=30 0.6396 0.7933 0.9735 1.0000 x x
n=40 0.6439 0.7368 0.9302 0.9924 1.0000 x
n=50 0.6758 0.7677 0.9080 0.9863 0.9994 1.0000

Table 4.6: Kullback-Leibler Divergence Criteria with Distinct Means and Distinct Covariance Matrices
p=1 p=7 p=17 p=27 p=37 p=47
n=10 0.8426 0.9341 x x x x
n=20 0.9905 0.9997 0.9998 x x x
n=30 0.9899 1.0000 1.0000 1.0000 x x
n=40 0.9748 0.9998 1.0000 1.0000 1.0000 x
n=50 0.9893 1.0000 1.0000 1.0000 1.0000 1.0000

Table 4.7: J-Divergence Criteria with Distinct Means and Distinct Covariance Matrices
p=1 p=7 p=17 p=27 p=37 p=47
n=10 0.8881 0.9808 x x x x
n=20 0.9901 0.9997 0.9999 x x x
n=30 0.9898 0.9999 1.0000 1.0000 x x
n=40 0.9747 0.9998 1.0000 1.0000 1.0000 x
n=50 0.9891 0.9999 1.0000 1.0000 1.0000 1.0000

For the case of distinct means and distinct covariance matrices for the sources, it can be seen,
that the hit-rates are improved by increasing p. This is to be expected, as less information is being
discarded. In chapter 1 in section 1.2, we saw that the pure Fisher’s Distance Maximization criterion
cannot handle dimensionality reduction to Rp with p > 1. However what can be seen from the tables

22
is that, Fisher´s Distance Maximization criterion (FLDA) benefits of applying the greedy technique we
proposed in chapter 3.

23
Table 4.8: Chernoff Distance Criteria with Equal Means and Distinct Covariance Matrices
p=1 p=7 p=17 p=27 p=37 p=47
n=10 0.9013 0.9817 x x x x
n=20 0.9413 0.9994 0.9999 x x x
n=30 0.9417 0.9995 1.0000 1.0000 x x
n=40 0.9402 1.0000 1.0000 1.0000 1.0000 x
n=50 0.9406 0.9994 1.0000 1.0000 1.0000 1.0000

Table 4.9: Fisher’s Distance Criteria with Equal Means and Distinct Covariance Matrices
p=1 p=7 p=17 p=27 p=37 p=47
n=10 0.5138 0.8480 x x x x
n=20 0.5472 0.8111 0.9942 x x x
n=30 0.5473 0.7730 0.9719 1.0000 x x
n=40 0.5173 0.6958 0.9237 0.9914 1.0000 x
n=50 0.5104 0.7183 0.8932 0.9846 0.9994 1.0000

Table 4.10: Kullback-Leibler Divergence Criteria with Equal Means and Distinct Covariance Matrices
p=1 p=7 p=17 p=27 p=37 p=47
n=10 0.8412 0.9327 x x x x
n=20 0.9906 0.9997 0.9998 x x x
n=30 0.9904 1.0000 1.0000 1.0000 x x
n=40 0.9749 0.9998 0.9999 1.0000 1.0000 x
n=50 0.9890 1.0000 1.0000 1.0000 1.0000 1.0000

Table 4.11: J-Divergence Criteria with Equal Means and Distinct Covariance Matrices
p=1 p=7 p=17 p=27 p=37 p=47
n=10 0.8867 0.9801 x x x x
n=20 0.9903 0.9997 0.9999 x x x
n=30 0.9905 0.9999 1.0000 1.0000 x x
n=40 0.9748 0.9998 1.0000 1.0000 1.0000 x
n=50 0.9890 0.9999 1.0000 1.0000 1.0000 1.0000

In this case, the means of the sources are equal and the covariance matrices are distinct. All the
criteria exhibit a better performance with increasing p. The most notorious case,correspond to FLDA: for
p = 1, it can be seen (table 4.9) that the detector is as good as a random classificator, however for p = 7
the performance increases drastically. As before the other criteria outperform the FLDA criterion. This
proves their ability for discriminating the sources through their covariance.

24
Table 4.12: Chernoff Distance Criteria with Distinct Means and Equal Covariance Matrices
p=1 p=7 p=17 p=27 p=37 p=47
n=10 0.8402 0.8402 x x x x
n=20 0.7752 0.7753 0.7753 x x x
n=30 0.9685 0.9686 0.9686 0.9686 x x
n=40 1.0000 1.0000 1.0000 1.0000 1.0000 x
n=50 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

Table 4.13: Fisher’s Distance Criteria with Distinct Means and Equal Covariance Matrices
p=1 p=7 p=17 p=27 p=37 p=47
n=10 0.8402 0.8402 x x x x
n=20 0.7753 0.7753 0.7753 x x x
n=30 0.9686 0.9686 0.9686 0.9686 x x
n=40 1.0000 1.0000 1.0000 1.0000 1.0000 x
n=50 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

Table 4.14: Kullback-Leibler Divergence Criteria with Distinct Means and Equal Covariance Matrices
p=1 p=7 p=17 p=27 p=37 p=47
n=10 0.8402 0.8402 x x x x
n=20 0.7753 0.7753 0.7753 x x x
n=30 0.9686 0.9686 0.9686 0.9686 x x
n=40 1.0000 1.0000 1.0000 1.0000 1.0000 x
n=50 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

Table 4.15: J-Divergence Criteria with Distinct Means and Equal Covariance Matrices
p=1 p=7 p=17 p=27 p=37 p=47
n=10 0.8402 0.8402 x x x x
n=20 0.7753 0.7753 0.7753 x x x
n=30 0.9685 0.9686 0.9686 0.9686 x x
n=40 1.0000 1.0000 1.0000 1.0000 1.0000 x
n=50 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

In this situation the parameters of the distributions are equal covariance matrices an distinct means.
As shown in (D), all the criteria perform the same which is confirmed by analyzing tables 4.12 to 4.15.
An intuitive interpretation of this is that apart from an invertible linear transformation of the samples,
there is no difference of the situation where the covariance matrices are the identity and the means are
distinct: for this situation it’s obvious that the best possible linear discriminant is a vector that is aligned
with µ0 − µ1 ; in the orthogonal directions the distributions are exactly the same and there is nothing to
discriminate.

25
4.1.3 Asymptotic Behavior
In this last set of tables,

Table 4.16: Distinct Covariance Matrices and Distinct Means


KLS KLA CHF FLDA
k=1 0.8881 0.8426 0.9027 0.5915
k=3 0.9931 0.9934 0.9971 0.6709
k=5 0.9999 0.9999 0.9998 0.7584
k=7 1.0000 1.0000 1.0000 0.8595
k=9 1.0000 1.0000 1.0000 0.8634

Table 4.17: Distinct Covariance Matrices and Equal Means


KLS KLA CHF FLDA
k=1 0.9314 0.9314 0.8987 0.5044
k=3 1.0000 1.0000 0.9983 0.5465
k=5 1.0000 1.0000 0.9998 0.5512
k=7 1.0000 1.0000 1.0000 0.7413
k=9 1.0000 1.0000 1.0000 0.7114

Table 4.18: Equal Covariance Matrices and Distinct Means


KLS KLA CHF FLDA
k=1 0.6235 0.6235 0.6235 0.6235
k=3 1.0000 1.0000 1.0000 1.0000
k=5 1.0000 1.0000 1.0000 1.0000

as would be predictable, the performance of the detector increases when k is increased. It’s important
to note that the FLDA criterion never achieved a hit rate of 1, however in tables 4.16 and 4.17 the other
criteria achieved a hit-rate of 1 for k = 5. In table 4.18 is again visible the criteria equivalence under
equal covariance matrices.

26
4.2 ROC-Curves
The ROC curves give the probability of detection as a function of the probability of false alarm. In
figures 4.1, 4.2, and 4.3 we present the results for the several criteria and for the three cases of the
n-dimensional probability density functions parameters. The n-dimensional samples where classified
by their 1-dimensional signatures obtained by the application of the linear discriminants for the several
criteria. The dimension of the high dimensional samples n is 10.

Figure 4.1: Distinct Means and Distinct Covariance Matrices. Legend: Magenta - Chernoff Distance ,
Red - Fisher’s Distance, Blue - Kullback-Leibler Divergence, Green - J-Divergence.

Figure 4.2: Equal Means and Distinct Covariance Matrices. Legend: Magenta - Chernoff Distance , Red
- Fisher’s Distance, Blue - Kullback-Leibler Divergence, Green - J-Divergence.

27
Figure 4.3: Distinct Means and Equal Covariance Matrices. Legend: Magenta - Chernoff Distance , Red
- Fisher’s Distance, Blue - Kullback-Leibler Divergence, Green - J-Divergence.

Once again, the J-Divergence , Chernoff Distance and Kullback-Leibler Divergence criteria, outper-
form Fisher’s Distance criterion. In figure 4.3, is again clear the criteria equivalence for the case of equal
covariance matrices. For the first two situations figure 4.1 and fig. 4.2 where the covariance matrices
are different, it can be seen that bellow probabilites of false alarm of 0, 1, the kullback-Leibler Divergence
Maximization criterion is the one that achieves a greater probability of detection. This fact is an implica-
tion of the asymmetrical character that this criterion exhibits. Looking at the Kullback-Leibler Divergence
general definition Z
f0 (y)
DKL (f0 ||f1 ) = f0 (y) log dy,
Rp f1 (y)
f0 (y)
this fact is very easily interpreted as maximizing the expected value of the log-likelihood ratio log f1 (y) ,
under f0 , which is equivalent to maximize the number of correct detections of samples from s0 , when
such samples were generated by s0 .

28
Chapter 5

Conclusions and Future Work

In this thesis we proposed new criteria for designing linear discriminants for data dimensionality reduc-
tion prior to the application of a binary detector. We also developed algorithms to solve the non-convex
optimization problems corresponding to the design of these new linear discriminants. These algorithms
compute the linear discriminants that maximize the Chernoff Distance, the J-Divergence and Kullback-
Leibler Divergence between the probability density functions that characterize the low-dimensional sig-
natures of the original data.
The optimization problems that result from maximizing these measures of dissimilarity of the two
sources are non-convex. However it was possible to solve them efficiently (global optimality), trough
reformulations and the use of duality theory, for the case where the n-dimensional samples are mapped
to R. A suboptimal strategy was proposed for the case of mapping the samples to Rp , with p greater
than one.
The results present in chapter 4, proved uniquivocally that the new techniques outperform the Fisher’s
Distance Criteria. This is due to the fact that the new criteria can discriminates the probability density
functions through their variance see figures 1.1 and 1.2. It’s important to note that a Gaussian prob-
ability density function is characterized by its two first moments, the covariance matrix and the mean.
Thus a good discriminator should use both to distinguish them. This is secured by Chernoff Distance,
J-Divergence and Kullback-Leibler Divergence criteria.

Future Work. In this thesis, we focused on the Gaussian case. This framework may model many
practical situations, however it is far from exhausting all practical applications. This observation leads
immediately to two generalizations. The first generalization would be considering the case where the
two sources, instead of following Gaussian distributions follow a Gaussian mixture. The importance of
this generalization is clear: any regular probability density distribution can be well approximated by a
mixture of Gaussians. The second generalization would be to study the multiclass problem. The main
obstacles to these generalizations are: for the Gaussian mixture case, the computation of the closed
form expressions for the Chernoff Distance, J-Divergence and Kullback-Leibler Divergence, and the
non-convexity of the design; for the multiclass problem there is no asymptotic expresions as there are
for the two class situation (Stein’s Lemma), however may be useful trying to optimize pairwise Chernoff
Distances and J-Divergences between the probability density functions that characterize the several
classes.

29
Bibliography

[1] R.O. Duda, P.E. Hart, D.H. Stork. Pattern Classification (2nd ed.). Wiley Interscience, (2000).

[2] G.J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. Wiley-Interscience; New
Ed edition (August 4, 2004).

[3] Jian Yang, Hui Ye, Zhang David. A new LDA-KL combined method for feature extraction and its
generalisation Pattern Analysis and Applications, archive Volume 7 , Issue 2 (July 2004) table of
contents Pages: 225 - 225 Year of Publication: 2004 ISSN:1433-7541

[4] Yongping Li , Josef Kittler , Jiri Matas. "Effective Implementation of Linear Discriminant Analysis for
Face Recognition and Verification." Proceedings of the 8th International Conference on Computer
Analysis of Images and Patterns, p.234-242, September 01-03, 1999.

[5] J.A.Thomas, T.M.Cover. Elements of Information Theory. John Wiley and Sons, Inc.,1991.

[6] Luis Rueda, Myriam Herrera. " A New Linear Dimensionality Reduction Technique Based on Cher-
noff Distance" IBERAMIA-SBIA 2006: 299-308

[7] Louis L. Scharf. Statistical Signal Processing: Detection, Estimation and Time series Analysis.
Addision and Wesley, 1991.

[8] B. T. Polyak. "Convexity of Quadratic Transformations and Its Use in Control and Optimization".
Journal of Optimization Theory and Applications. Vol. 99, No. 3, pp.553-583, December 1998.

[9] Lloyd L. Dines. "On the Mapping of Quadratic Forms." Bull. Amer. Math. Soc. Volume 47, Number
6 (1941), 494-498.

[10] Don H. Johnson, Sinan Sinanovic. Symetrizing the Kullback-Leibler Distance. Computer and Infor-
mation Technology Institute. Department of Electrical Engineering. Rice University. Houston

30
Appendix A

Quadratic Program with Quadratic


Constraints, Strong-Duality result

A.1 Introduction to Strong Duality


In all the optimization problems presented in chapter 2, emerged the necessity of solving a particular
kind of optimization problem, a quadratic program with two quadratic constraints

min q T Aq
q T Bq = b
(A.1)
q T Cq = c
var : q  Rn

where A is a symmetric matrix, B and C are positive definite symmetric matrices, b and c are positive
scalars.
This optimization problem in all its generality is very hard to solve due to its non-convex quadratic
constraints. However, when the matrices involved in the quadratic expressions have some properties,
strong duality exists.
Strong duality is a very important tool in optimization theory. Makes possible solving a very hard
optimization problem, trough a easy one. The very hard problem is usually called the Primal Problem,
and the easy one is called the Dual Problem. Strong duality says that under certain conditions, the
Primal and Dual problems give the same result, that is: their objective functions attend the same value
at their respective globally optimal points.
In optimization theory, the Primal Problem and the respective Dual Problem are:

Primal Problem.
p∗ = min f (x)
xX
h(x) = 0 (A.2)
g(x) ≤ 0
var : x  Rn

where f : Rn → R ∪ {+∞}
X ⊂ Rn

31
h : Rn → Rp , h = (h1 , h2 , ..., hp )
g : Rn → Rm , g = (g1 , g2 , ..., gm )

Dual Problem.
d∗ = max L(λ, µ)
µ≥0 (A.3)
p m
var : (λ, µ)  R × R

where
L(λ, µ) = inf L(x, λ, µ)
(A.4)
xX

being L(x, λ, µ) = f (x) + λT h(x) + µT g(x) the Lagrangian function.

Applying the definition of the Dual Problem to the quadratic program with the quadratic constraints in
(A.1) comes:
max −λ1 b − λ2 c
A + λ1 B + λ2 C  0 (A.5)
2
var : (λ1 , λ2 )  R

which is an SDP (Semidefinite Program) and therefore convex, in just two variables independently of n
that can be very large.

In the algorithms presented in section 2, is also needed the Bi-Dual Program. This is simply the dual
problem of the first dual, since the first dual problem is convex, strong duality exists between them.

Bi-Dual Problem.
min tr(AQ)
tr(BQ) = b
(A.6)
tr(CQ) = c
Q0

where Q is a symmetric positive semi-definite matrix of dimension n.


The Bi-Dual Problem in (A.6) is very much harder to solve than the Dual Problem in (A.5), due to the
dimensionality of the optimization variable Q  Sn+ .
Provided strong duality exists for the problem in (A.1) and the Bi-Dual Problem has a single solution,
its solution Q is a rank-1 matrix, from which the only eingenvector is the solution for the Primal Problem
in (A.1). This is the way how the solution q  Rn is retrieved.

A.2 Strong Duality Result Demonstration


It is easy to show and a consequence of the definition of the Dual Problem, that p∗ ≥ d∗ , so in order to
show that there exists strong duality i.e. p∗ = d∗ , it is just needed to show p∗ ≤ d∗ .
For this demonstration is of fundamental importance the following result, due to B. T. Polyak [8],
related with quadratic mappings:

32
Theorem
∃ : µa A + µb B + µc C > 0
(A.7)
(µa , µb , µc )  R3

implies that:
{(xT Ax, xT Bx, xT Cx) : x  Rn } (A.8)

is a closed convex pointed cone, for n ≥ 3.

Let U = {(f, u, v)  R3 : (f, u, v) = (xT Ax, xT Bx, xT Cx), x  Rn } and V = {(f, u, v)  R3 : f < p∗ , u =
b, v = c}

The matrices B and C are positive definite, thus the conditions of the theorem in (A.7) are verified
with (µa , µb , µc ) = (0, 1, 1) and therefore, set U is a closed convex pointed cone. Set V is simply a half
straight line, being also convex.

The sets U and V are disjoint. Supposing they are not, there is x0  Rn , such that (xT0 Ax0 , xT0 Bx0 , xT0 Cx0 )
belonging to set U , belongs also to set V , verifying xT0 Ax0 < p∗ , xT0 Bx0 = b and xT0 Cx0 = c, in contra-
diction with the fact that:
p∗ = min xT Ax
xT Bx = b
xT Cx = c

Because the sets U and V are convex and disjoint, there is an hyperplane that separates them i.e:

∃ sT z ≥ r ∀ zU ∧ sT z ≤ r ∀ zV
(A.9)
s  R3 \0

Attending to sets U and V definitions, and being s = (sf , su , sv ), the proposition in (A.9) is equivalent
to the following two:
sf xT Ax + su xT Bx + sv xT Cx ≥ r ∀ x  Rn (A.10)

sf f + su b + sv c ≤ r ∀ f ≤ p∗ (A.11)

The proposition in (A.11) implies that sf ≥ 0. Otherwise and attending to the fact that f ] − ∞, p∗ [,
the product sf f , could be made arbitrarily large by choosing f arbitrarily negative, and the proposition
wouldn’t be satisfied.
The fact that sf ≥ 0 will lead to the strong duality result. This inequality is separated in two different
situations. The first situation is sf > 0, the second situation is sf = 0. The first situation will lead directly
to the strong duality result, and the second will generate a contradiction, by implying the inexistence of
the separating hyperplane, that was proven to exist.

33
First situation: sf > 0. For this situation the propositions in (A.10) and (A.11) can be rewritten as:
     
su sv r
xT Ax + xT Bx + xT Cx ≥ ∀ x  Rn (A.12)
sf sf sf
     
su sv r
f+ b+ c≤ ∀ f ≤ p∗ (A.13)
sf sf sf
0 0
Defining su = ssuf and sv = sv
sf and adding term by term the two equations in the proposition in (A.12)
and (A.13), follows:

0 0 0
xT Ax + su xT Bx + sv xT Cx ≥ f + sv c ∀ x  Rn , f ≤ p∗ ⇔ (A.14)

0 0
xT Ax + su (xT Bx − b) + sv (xT Cx − c) ≥ f ∀ x  Rn , f ≤ p∗ ⇔ (A.15)
0 0
inf xT Ax + su (xT Bx − b) + sv (xT Cx − c) ≥ f ∀ f < p∗
(A.16)
xRn
0 0
The left term in inequation (A.16) is the Dual function evaluated at point (su , sv ). It is known that the
Dual function is always less or equal to p∗ , but Attending to set V definition, f can be made arbitrarily
close to p∗ and so it follows the strong duality result.

Second situation: sf = 0. This situation will lead to s = 0 implying the inexistence of the separating
hyperplane for the sets U and V .
Since sf = 0, the equations in the propositions in A.9 are equivalent to the following two:

su xT Bx + sv xT Cx ≥ r ∀ x  Rn (A.17)

su b + sv c ≤ r ∀ f ≤ p∗ (A.18)

Adding the previous equations term by term, follows that:

su (xT Bx − b) + sv (xT Cx − c) ≥ 0 ∀ x  Rn (A.19)

Supposing the existence of xb+ and xb− such that:

xTb+ Bxb+ − b > 0


xTb− Bxb− − b < 0 (A.20)
xTb+ Cxb+ / − c = 0
/ − −

the following propositions are true:

su (xTb+ Bxb+ − b) + sv (xTb+ Cxb+ − c) ≥ 0 ⇒ su ≥ 0


>0 =0
(A.21)
su (xTb− Bxb− − b) + sv (xTb− Cxb− − c) ≥ 0 ⇒ su ≤ 0
<0 =0

The previous propositions combined imply that su = 0, the same reasoning implies sv = 0. The conclu-
sion is that considering sf = 0 comes su = 0 and sv = 0 and thus s = 0 implying the inexistence of the
separating hyperplane for the sets U and V , that was proven to exist.

34
Appendix B

Stiefel Matrix Constraint Invariance

Here is shown that the linear discriminant Q  Rp×n can be taken as a Stiefel Matrix. For convenience of
notation, in this section the linear discriminant Q is denoted by A. This is proven by showing this result
for the Kullback-Leibler Divergence whose expression

|AΣ1 AT |
 
1
DKL (f0 ||f1 )(A) = log + tr((AΣ1 AT )−1 (AΣ0 AT )) + (µ0 − µ1 )T AT (AΣ1 AT )−1 A(µ0 − µ1 ) − p
2 |AΣ0 AT |
(B.1)
was previously presented in (1.14).
The statement that is going to be proved is

DKL (f0 ||f1 )(A) = DKL (f0 ||f1 )(Q) (B.2)

where Q : QT Q = In , i.e. Q is Stiefel.


Assuming linear independence between the rows of A, it’s possible to write A = RQ. This is called
the RQ factorization, where Matrix R  Rp×p is an invertible upper triangular matrix and Q  Rp×n is
orthonormal.
Inserting A = RQ in (B.1), comes
 
|RQΣ1 QT RT |
DKL (f0 ||f1 )(RQ) = 1
2 log|RQΣ0 QT RT |
+ tr((RQΣ1 QT RT )−1 (RQΣ0 QT RT ))
(B.3)
+ 21 (µ0 − µ1 )T QT RT (RQΣ1 QT RT )−1 RQ(µ0 − µ1 ) − p


Attending to the fact that R is invertible, it´s a matter of using the algebraic properties of the determi-
nant and the trace of a matrix to see that R cancels out in (B.3), leading to (B.1).

35
Appendix C

Set Properties

In this section, we show that the set C defined in (2.23) as C = {(q T Σ0 q, q T M q) : q T Σ0 qq T Σ1 q = 1} is


compact and connected.
Defining A = {q  Rn : q T Σ0 qq T Σ1 q = 1} and

F : Rn → R2
(C.1)
q 7→ (q T Σ0 q, q T M q)

it follows that C = F (A), where F is a continuous map. Thus the compactness and connectedness
of set C will follow from the compactness and connectedness of A, which is now shown:
Defining B = {q  Rn : ||q|| = 1} and

Φ : Rn → Rn
q (C.2)
q 7→ p
4
q T Σ0 qq T Σ1 q

we see that A = Φ(B), where Φ is continuous function over B, and B is the compact and connected unit
sphere, implying therefore that A is compact and connected.
In order to show A = Φ(B), it is proven Φ(B) ⊂ A and A ⊂ Φ(B).

Case Φ(B) ⊂ A. Given q  B, it’s a matter of algebra, to verify that Φ(q) verifies Φ(q)T Σ0 Φ(q)Φ(q)T Σ1 Φ(q) =
1, and thus belongs to set A.

0 0 0 0 0 q0
Case A ⊂ Φ(B). Given q  A, and therefore verifying q T Σ0 q q T Σ1 q = 1, q = ||q 0 ||
 B, is such that
0
Φ(q) A. So for every q  A there is a point q  B such that Φ(q) A, implying therefore A ⊂ Φ(B).

36
Appendix D

Criteria Equivalence

In this section it is shown that when F0 = N (µ0 , Σ0 ) and F1 = N (µ1 , Σ1 ) have equal covariance matrices
(Σ0 = Σ1 ) the Kullback-Leibler Divergence, J-Divergence and Chernoff Distance criteria are all equiva-
lent to Fisher’s Distance. This result is illustrated just for the Kullback-Leibler Divergence. The same line
of reasoning shows this result is valid for the other criteria.

Demonstration. The Kullback-Leibler Divergence DKL (f0 ||f1 )(Q) between the probability density func-
tions f0 (Q) = N (Qµ0 , QΣ0 QT ) and f1 (Q) = N (Qµ1 , QΣ1 QT ) is

|QΣ1 QT |
 
1 T −1 T T T T −1
DKL (f0 ||f1 )(Q) = log + tr((QΣ1 Q ) (QΣ0 Q )) + (µ0 − µ1 ) Q (QΣ1 Q ) Q(µ0 − µ1 ) − p .
2 |QΣ0 QT |
(D.1)
Inserting Σ0 = Σ1 = Σ in (D.1), yields

1
DKL (f0 ||f1 )(Q) = (µ0 − µ1 )T QT (QΣQT )−1 Q(µ0 − µ1 ). (D.2)
2

Equation (D.2) coincides with Fisher’s Distance presented in (1.5), whose expression is reproduced
here for Σ0 = Σ1 = Σ:

F D(f0 ||f1 )(Q) = 1 tr{(QΣQT )−1 (Q(µ − µ )(µ − µ )T QT )}


2 0 1 0 1
1 tr{(µ − µ )T )QT (QΣQT )−1 Q(µ − µ )} (D.3)
2 0 1 0 1
1 (µ − µ )T )QT (QΣQT )−1 Q(µ − µ )
2 0 1 0 1

37

Você também pode gostar