Você está na página 1de 40

Motivation

Information Theory Background


A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Metric Distances Between Probability


Distributions of Different Sizes

M. Vidyasagar

Cecil & Ida Green Chair


The University of Texas at Dallas
M.Vidyasagar@utdallas.edu
www.utdallas.edu/∼m.vidyasagar

Johns Hopkins University, 20 October 2011

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Outline
1 Motivation
Problem Formulation
Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions


Definition of the Metric Distance
Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Problem Formulation
A Metric Distance Between Distributions
Source of Difficulty
Optimal Order Reduction
Concluding Remarks

Outline
1 Motivation
Problem Formulation
Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions


Definition of the Metric Distance
Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Problem Formulation
A Metric Distance Between Distributions
Source of Difficulty
Optimal Order Reduction
Concluding Remarks

Outline
1 Motivation
Problem Formulation
Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions


Definition of the Metric Distance
Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Problem Formulation
A Metric Distance Between Distributions
Source of Difficulty
Optimal Order Reduction
Concluding Remarks

Motivation

Originally: Given a hidden Markov process with a very large state


space, how can one approximate it ‘optimally’ with another HMP
with a much smaller state space?
Analog control ← digital control.
Signals in Rd ← signals over finite sets.
Applications: Control over networks, data compression, reduced
size noise modeling etc.
If u, y are input and output, view {ut , yt } as a stochastic process
over some finite set U × Y , then ‘reduced order modeling’ is
approximating {ut , yt } by another stochastic process over a
‘smaller cardinality’ set U 0 × Y 0 .

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Problem Formulation
A Metric Distance Between Distributions
Source of Difficulty
Optimal Order Reduction
Concluding Remarks

General Problem: Simplified Modeling of Stochastic


Processes

Suppose {Xt } is a stochastic process assuming values in a ‘large’


but finite set A with n elements; we wish to approximate it by
another process {Yt } assuming values in a ‘small’ finite set B with
m < n elements.
Questions:

How do we define the ‘distance’ (think ‘modeling error’)


between the two processes {Xt } and {Yt }?
Given {Xt }, how do we find the ‘best possible’ reduced order
approximation to {Xt } in the chosen distance?

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Problem Formulation
A Metric Distance Between Distributions
Source of Difficulty
Optimal Order Reduction
Concluding Remarks

Scope of Today’s Talk

Scope of this talk: i.i.d. (independent, identically distributed)


processes.
An i.i.d. process {Xt } is completely described by its
‘one-dimensional marginal,’ i.e., the distribution of X1 (or any Xt ).
Questions:

Given two probability distributions φ, ψ, on finite sets A, B,


how can we define a ‘distance’ between them?
Given distribution φ with n components, and an integer
m < n, how can we find the ‘best possible’ m-dimensional
approximation to φ?

Full paper at: http://arxiv.org/pdf/1104.4521v2

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Problem Formulation
A Metric Distance Between Distributions
Source of Difficulty
Optimal Order Reduction
Concluding Remarks

Outline
1 Motivation
Problem Formulation
Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions


Definition of the Metric Distance
Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Problem Formulation
A Metric Distance Between Distributions
Source of Difficulty
Optimal Order Reduction
Concluding Remarks

Total Variation Metric


Suppose A = {a1 , . . . , an } is a finite set. and φ, ψ are probability
distributions on A. Then the total variation metric is
n n n
1X X X
ρ(φ, ψ) = |φi − ψi | = {φi − ψi }+ = − {φi − ψi }− ,
2
i=1 i=1 i=1

where {x}+ = max{x, 0}, {x}− = min{x, 0}.


ρ is permutation-invariant if the same permutation is applied to
the components of φ, ψ (rearranging the elements of A). So if π is
a permutation of the elements of A, then

ρ(π(φ), π(ψ)) = ρ(φ, ψ).

But what if φ, ψ are probability measures on different sets?


M. Vidyasagar Distances Between Probability Distributions
Motivation
Information Theory Background
Problem Formulation
A Metric Distance Between Distributions
Source of Difficulty
Optimal Order Reduction
Concluding Remarks

Permutation Invariance Example

Example:

A = {H, T }, B = {M, W }, φ = [0.55 0.45], ψ = [0.48 0.52].

What is the distance between φ, ψ?


If we identify H ↔ M, T ↔ W , then ρ(φ, ψ) = 0.07.
But if we identify H ↔ W, T ↔ M , then ρ(φ, ψ) = 0.03.
Which one is more ‘natural’ ? Answer: There is no natural
association!

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Problem Formulation
A Metric Distance Between Distributions
Source of Difficulty
Optimal Order Reduction
Concluding Remarks

Permutation Invariance: An Inherent Feature

Suppose φ, ψ are probability distributions on distinct sets A, B,


and we wish to ‘compare’ them. Suppose d(φ, ψ) is the distance
(yet to be defined).
Claim: Suppose π is a permutation on A, ξ is a permutation on B.
Then we must have

d(φ, ψ) = d(π(φ), ξ(ψ)).

Ergo: Any definition of the distance must be invariant under


possibly different permutations of A, B.
In particular, if A, B are distinct sets, even if |A| = |B|, our
distance cannot reduce to ρ.

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Outline
1 Motivation
Problem Formulation
Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions


Definition of the Metric Distance
Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Notation

A = {1, . . . , n}1 , B = {1, . . . , m}, φ is a probability distribution


on A, ψ is a probability distribution on B, X is a random variable
on A with distribution φ, Y is a r.v. on B with distribution ψ.
n
X
Sn := {v ∈ Rn+ : vi = 1}.
i=1

So φ ∈ Sn , ψ ∈ Sm . Sn×m denotes the set of n × m ‘stochastic


matrices’, i.e., matrices whose columns add up to one:

Sn×m = {P ∈ [0, 1]n×m : P em = en },

where en is the column vector with n ones.


1
We don’t write A = {a1 , . . . , an } but that is what we mean.
M. Vidyasagar Distances Between Probability Distributions
Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Entropy

Suppose φ ∈ Sn . Then the (Shannon) entropy of φ is


n
X
H(φ) = − φi log φi .
i=1

We can also call this H(X) (i.e., associate entropy with a r.v. or
its distribution).

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Mutual Information

Suppose X, Y are r.v.s on A, B and let θ denote their joint


distribution. Then θ ∈ Snm , and θ A = φ, θ B = ψ (marginal
distributions).

I(X, Y ) := H(X) + H(Y ) − H(X, Y )

is called the mutual information between X and Y .


Alternate formula:
XX θij
I(X, Y ) = θij log .
φi ψi
i∈A j∈B

Note that I(X, Y ) = I(Y, X).

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Conditional Entropy

The quantity
H(X|Y ) = H(X, Y ) − H(Y )
is called the conditional entropy of X given Y . Note that
H(X|Y ) 6= H(Y |X) in general. In fact

H(Y |X) = H(X|Y ) + H(Y ) − H(X).

If θ is the joint distribution of X, Y , then

H(X|Y ) = H(θ) − H(ψ), H(Y |X) = H(θ) − H(φ).

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Conditional Entropy: Alternate Formulation

Define

Θ = [θij ] = [Pr{X = i&Y = j}] ∈ [0, 1]n×m ,

and define P = [Diag(φ)]−1 Θ. Clearly

θij
pij = = Pr{Y = j|X = i}.
φi
So P ∈ Sn×m and
X
H(Y |X) = φi H(pi ) =: Jφ (P ),
i∈A

where pi is the i-th row of P .


M. Vidyasagar Distances Between Probability Distributions
Motivation
Information Theory Background
Definition of the Metric Distance
A Metric Distance Between Distributions
Computing the Distance
Optimal Order Reduction
Concluding Remarks

Outline
1 Motivation
Problem Formulation
Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions


Definition of the Metric Distance
Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Definition of the Metric Distance
A Metric Distance Between Distributions
Computing the Distance
Optimal Order Reduction
Concluding Remarks

Outline
1 Motivation
Problem Formulation
Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions


Definition of the Metric Distance
Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Definition of the Metric Distance
A Metric Distance Between Distributions
Computing the Distance
Optimal Order Reduction
Concluding Remarks

Maximizing Mutual Information (MMI)


Given φ ∈ Sn , ψ ∈ Sm , we look for a ‘joint’ distribution θ ∈ Snm
such that θ A = φ, θ B = ψ, and in addition, the entropy H(θ) is
minimum, or equivalently, the conditional entropy of X given Y is
minimum, or again equivalently, the mutual information between
X and Y is maximum. In other words, define

W (φ, ψ) := min H(θ) s.t. θ A = φ, θ B = ψ


= min H(X, Y ) s.t. θ A = φ, θ B = ψ,

V (φ, ψ) := W (φ, ψ) − H(φ)


= min H(Y |X) s.t. θ A = φ, θ B = ψ.

We try to make Y ‘as deterministic as possible’ given X.


M. Vidyasagar Distances Between Probability Distributions
Motivation
Information Theory Background
Definition of the Metric Distance
A Metric Distance Between Distributions
Computing the Distance
Optimal Order Reduction
Concluding Remarks

The Variation of Information Metric


The quantity
d(φ, ψ) := V (φ, ψ) + V (ψ, φ)
is called the variation of information metric. Since

V (ψ, φ) = V (φ, ψ) + H(φ) − H(ψ),

we need to compute only one of V (ψ, φ), V (φ, ψ).


The quantity d satisfies

d(φ, φ) = 0 and d(φ, ψ) ≥ 0 ∀φ, ψ.


d(φ, ψ) = d(ψ, φ).
The triangle inequality holds:

d(φ, ξ) ≤ d(φ, ψ) + d(ψ, ξ).


M. Vidyasagar Distances Between Probability Distributions
Motivation
Information Theory Background
Definition of the Metric Distance
A Metric Distance Between Distributions
Computing the Distance
Optimal Order Reduction
Concluding Remarks

Proof of Triangle Inequality


Suppose X, Y, Z are r.v.s on finite sets A, B, C. Then the
conditional entropy satisfies a ‘one-sided’ triangle inequality:
H(X|Y ) ≤ H(X|Z) + H(Z|Y ).
Proof:
H(X|Y ) ≤ H(X, Z|Y ) = H(Z|Y ) + H(X|Z, Y )
≤ H(Z|Y ) + H(X|Z).
So if we define
v(X, Y ) = H(X|Y ) + H(Y |X),
then v satisfies the triangle inequality. Note that
V (φ, ψ) = min v(X, Y )
φ,ψ

subject to X, Y having distributions φ, ψ respectively.


M. Vidyasagar Distances Between Probability Distributions
Motivation
Information Theory Background
Definition of the Metric Distance
A Metric Distance Between Distributions
Computing the Distance
Optimal Order Reduction
Concluding Remarks

A Key Property

Actually d is a pseudometric, not a metric. In other words, d(φ, ψ)


can be zero even if φ 6= ψ.
Theorem: d(φ, ψ) = 0 if and only if n = m and φ, ψ are
permuations of each other.
Consequence: The metric d is not convex!
Example: Let n = m = 2,

φ = [0.75 0.25], ψ = [0.25 0.75], ξ = 0.5φ + 0.5ψ = [0.5 0.5].

Then
d(φ, φ) = d(φ, ψ) = 0 but d(φ, ξ) > 0.

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Definition of the Metric Distance
A Metric Distance Between Distributions
Computing the Distance
Optimal Order Reduction
Concluding Remarks

Outline
1 Motivation
Problem Formulation
Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions


Definition of the Metric Distance
Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Definition of the Metric Distance
A Metric Distance Between Distributions
Computing the Distance
Optimal Order Reduction
Concluding Remarks

Computing the Metric Distance

Change variable of optimization from θ, the joint distribution, to


P , the matrix of conditional probabilities.

V (φ, ψ) = min H(θ) − H(φ) s.t. θ A = φ, θ B = ψ


θ∈Snm
X
= min Jφ (P ) := φi H(pi ) s.t. φP = ψ.
P ∈Sn×m
i∈A

Since Jφ (·) is a strictly concave function, and the feasible region is


polyhedral (convex hull of a finite number of extreme points),
solution occurs at one of these extreme points.
Also, a ‘principle of optimality’ allows us to break down large
problems into smaller problems.

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Definition of the Metric Distance
A Metric Distance Between Distributions
Computing the Distance
Optimal Order Reduction
Concluding Remarks

The m = 2 Case

Partition the set {1, . . . , n} into two sets I1 , I2 such that


X X
|ψ1 − φi | = |ψ2 − φi |
i∈I1 i∈I2

is minimized.
Interpretation: Given two bins with capacities ψ1 , ψ2 , assign
φ1 , . . . , φn to the two bins so that overflow in one bin (and
under-utilization in other bin) is minimized.
Theorem: If both bins are filled exactly, then V (φ, ψ) = 0.

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Definition of the Metric Distance
A Metric Distance Between Distributions
Computing the Distance
Optimal Order Reduction
Concluding Remarks

The m = 2 Case (Continued)

Theorem: If both bins cannot be filled exactly, and if the smallest


element of φ (call it φ0 ) belongs to the overstuffed bin, then

V (φ, ψ) = φ0 H([u/φ0 (φ0 − u)/φ0 ]),

where u is the unutilized capacity (or overflow).


If φ0 belongs to the underutilized bin, the above is an upper bound
for V (φ, ψ) but may not equal V (φ, ψ).

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Definition of the Metric Distance
A Metric Distance Between Distributions
Computing the Distance
Optimal Order Reduction
Concluding Remarks

The m = 2 Case (Continued)

Theorem: If both bins cannot be filled exactly, and if the smallest


element of φ (call it φ0 ) belongs to the overstuffed bin, then

V (φ, ψ) = φ0 H([u/φ0 (φ0 − u)/φ0 ]),

where u is the unutilized capacity (or overflow).


If φ0 belongs to the underutilized bin, the above is an upper bound
for V (φ, ψ) but may not equal V (φ, ψ).
Bad news: Computing the optimal partitioning is NP-hard!
So we need an approximation procedure to upper bound d(φ, ψ).
No special reason to restrict to m = 2.

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Definition of the Metric Distance
A Metric Distance Between Distributions
Computing the Distance
Optimal Order Reduction
Concluding Remarks

Best Fit Algorithm for Bin-Packing

Think of ψ1 , . . . , ψm as capacities of m bins.

Arrange ψj in descending order.


For i = 1, . . . , n, place each φi into the bin with maximum
unutilized capacity.
If a bin overflows, don’t fill it anymore.

Complexity is O(n log m).


Provably suboptimal: Total bin size is ≤ 1.25 times optimal bin
size; best-known bound.
Corresponding bound on overflow is not so good: It is
≤ 0.25 + 1.25× optimal value.

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Definition of the Metric Distance
A Metric Distance Between Distributions
Computing the Distance
Optimal Order Reduction
Concluding Remarks

Illustration of Best Fit Algorithm


Suppose ψ = [0.45 0.30 0.25], and φ ∈ S14 is given by

φ = 10−2 · [ 14 13 12 9 8 7 6 6 5 5 4 4 4 3 ]

φ11
φ8
φ6 φ14
φ12
φ9 φ13
φ2
φ5 φ10
φ7
φ1 φ3 φ4

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
Definition of the Metric Distance
A Metric Distance Between Distributions
Computing the Distance
Optimal Order Reduction
Concluding Remarks

A Greedy Algorithm for Bounding d(φ, ψ)

Think of ψ1 , . . . , ψm as m bins to be packed. Modify ‘best fit’


algorithm: Sort ψ in decreasing order, put each φi into bin with
largest unutilized capacity. If φi does not fit into any bin, put it
aside (departure from best fit algorithm).
When all of φ is processed, say k entries are put aside. Then
k < m. Now solve a k by m bin-packing problem; repeat. See full
paper for details.
This results in a sequence of bin-packing problems of decreasing
size. Outcome is an upper bound for d(φ, ψ). Complexity is
O((n + m2 ) log m).

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Outline
1 Motivation
Problem Formulation
Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions


Definition of the Metric Distance
Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Problem Formulation

Problem: Given φ ∈ Sn , and m < n, find ψ ∈ Sm that minimizes


d(φ, ψ).
Definition: ψ ∈ Sm is said to be an aggregation of φ ∈ Sn if
there exists a partition I1 , . . . , Im of {1, . . . , n} such that
X
ψj = φi .
i∈Ij

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Results

Theorem: Any optimal approximation of φ in the variation of


information metric d must be an aggregation of φ.
Theorem: An aggregation φ(a) of φ is an optimal approximation
of φ in the variation of information metric d if and only if it has
maximum entropy (amongst all aggregations).
Note: um , the uniform distribution with m elements, has
maximum entropy in Sm . So should we try to minimize
ρ(φ(a) , um )? This is yet another bin-packing problem with all bin
capacities equal to 1/m. But how valid is this approach?

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

NP-Hardness of Optimal Order Reduction

Suppose m = 2 and let φ(a) be an aggregation of φ ∈ Sn . Then


φ(a) has maximum entropy if and only if the total variation
distance ρ(φ(a) , u2 ) is minimized.
Therefore:

When m = 2, minimizing ρ(φ(a) , u2 ) gives an optimal


reduced order approximation to φ.
For m ≥ 3, minimizing ρ(φ(a) , um ) gives a suboptimal
reduced order approximation to φ.

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Reformulation of Problem

Problem: Given φ ∈ Sn and m < n, find an aggregation of


φ(a) ∈ Sm with maximum entropy.
Reformulation: Since the uniform distribution um has maximum
entropy in Sm , So find an aggregation φ(a) such that the total
variation distance ρ(φ(a) , um ) is minimized.
More general problem: (Not any harder) Given φ ∈ Sn , m < n
and ξ ∈ Sm , find an aggregation φ(a) to minimize the total
variation distance ρ(φ(a) , ξ).
Use best-fit algorithm to find an aggregation of φ that is close to
the uniform distribution. Complexity is O(n log m).

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Bound on Performance of Best Fit Algorithm

Theorem: Let φ(a) denote the aggregation of φ using the best fit
algorithm. Then
ρ(φ(a) , ξ) ≤ 0.25mφmax ,
where φmax is the largest component of φ and ρ is the total
variation metric.
This can be turned into a (messy) bound on the entropy of φ(a) .

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Outline
1 Motivation
Problem Formulation
Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions


Definition of the Metric Distance
Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Achievements

Definition of a proper metric between probability distributions


defined on sets of different cardinalities, by maximizing
mutual information between the two distributions.
Study of properties of the distance, and the problem of
computing the distance; showing the close relationship to the
bin-packing problem with overstuffing.
Since bin-packing is NP-hard, adapting best-fit algorithm to
generate upper bounds for distance in polynomial time.
Characterization of solution of optimal order reduction
problem in terms of aggregating to maximize entropy.
Formulation as a bin-packing problem with overstuffing.
Upper bound on performance of algorithm.

M. Vidyasagar Distances Between Probability Distributions


Motivation
Information Theory Background
A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks

Next Steps

Extension of the metric to Markov processes, hidden Markov


processes, arbitrary (but stationary ergodic) stochastic
processes.
Alternatives to best-fit heuristic algorithm.

Thank You!

M. Vidyasagar Distances Between Probability Distributions

Você também pode gostar