Metric Distances Between Probability Distributions of Different Sizes

Motivation
Information Theory Background

A Metric Distance Between Distributions
Optimal Order Reduction
Concluding Remarks
Metric Distances Between Probability

Distributions of Different Sizes
M. Vidyasagar
Cecil & Ida Green Chair

The University of Texas at Dallas
M.Vidyasagar@utdallas.edu
www.utdallas.edu/∼m.vidyasagar
Johns Hopkins University, 20 October 2011
M. Vidyasagar Distances Between Probability Distributions

Motivation
Concluding Remarks
Outline
1 Motivation
Problem Formulation
Source of Difficulty
2 Information Theory Background
3 A Metric Distance Between Distributions

Definition of the Metric Distance
Computing the Distance
4 Optimal Order Reduction
5 Concluding Remarks

Motivation
Problem Formulation
Concluding Remarks
Outline
1 Motivation
Problem Formulation


Motivation
Problem Formulation
Concluding Remarks
Outline
1 Motivation
Problem Formulation


Motivation
Problem Formulation
Concluding Remarks
Motivation
Originally: Given a hidden Markov process with a very large state

space, how can one approximate it ‘optimally’ with another HMP
with a much smaller state space?
Analog control ← digital control.
Signals in Rd ← signals over finite sets.
Applications: Control over networks, data compression, reduced
size noise modeling etc.
If u, y are input and output, view {ut , yt } as a stochastic process
over some finite set U × Y , then ‘reduced order modeling’ is
approximating {ut , yt } by another stochastic process over a
‘smaller cardinality’ set U 0 × Y 0 .

Motivation
Problem Formulation
Concluding Remarks
General Problem: Simplified Modeling of Stochastic

Processes
Suppose {Xt } is a stochastic process assuming values in a ‘large’

but finite set A with n elements; we wish to approximate it by
another process {Yt } assuming values in a ‘small’ finite set B with
m < n elements.
Questions:
How do we define the ‘distance’ (think ‘modeling error’)

between the two processes {Xt } and {Yt }?
Given {Xt }, how do we find the ‘best possible’ reduced order
approximation to {Xt } in the chosen distance?

Motivation
Problem Formulation
Concluding Remarks
Scope of Today’s Talk
Scope of this talk: i.i.d. (independent, identically distributed)

processes.
An i.i.d. process {Xt } is completely described by its
‘one-dimensional marginal,’ i.e., the distribution of X1 (or any Xt ).
Questions:
Given two probability distributions φ, ψ, on finite sets A, B,

how can we define a ‘distance’ between them?
Given distribution φ with n components, and an integer
m < n, how can we find the ‘best possible’ m-dimensional
approximation to φ?
Full paper at: http://arxiv.org/pdf/1104.4521v2

Motivation
Problem Formulation
Concluding Remarks
Outline
1 Motivation
Problem Formulation


Motivation
Problem Formulation
Concluding Remarks
Total Variation Metric

Suppose A = {a1 , . . . , an } is a finite set. and φ, ψ are probability
distributions on A. Then the total variation metric is
n n n
1X X X
ρ(φ, ψ) = |φi − ψi | = {φi − ψi }+ = − {φi − ψi }− ,
2
i=1 i=1 i=1
where {x}+ = max{x, 0}, {x}− = min{x, 0}.

ρ is permutation-invariant if the same permutation is applied to
the components of φ, ψ (rearranging the elements of A). So if π is
a permutation of the elements of A, then
ρ(π(φ), π(ψ)) = ρ(φ, ψ).
But what if φ, ψ are probability measures on different sets?

Motivation
Problem Formulation
Concluding Remarks
Permutation Invariance Example
Example:
A = {H, T }, B = {M, W }, φ = [0.55 0.45], ψ = [0.48 0.52].
What is the distance between φ, ψ?

If we identify H ↔ M, T ↔ W , then ρ(φ, ψ) = 0.07.
But if we identify H ↔ W, T ↔ M , then ρ(φ, ψ) = 0.03.
Which one is more ‘natural’ ? Answer: There is no natural
association!

Motivation
Problem Formulation
Concluding Remarks
Permutation Invariance: An Inherent Feature
Suppose φ, ψ are probability distributions on distinct sets A, B,

and we wish to ‘compare’ them. Suppose d(φ, ψ) is the distance
(yet to be defined).
Claim: Suppose π is a permutation on A, ξ is a permutation on B.
Then we must have
d(φ, ψ) = d(π(φ), ξ(ψ)).
Ergo: Any definition of the distance must be invariant under

possibly different permutations of A, B.
In particular, if A, B are distinct sets, even if |A| = |B|, our
distance cannot reduce to ρ.

Motivation
Concluding Remarks
Outline
1 Motivation
Problem Formulation


Motivation
Concluding Remarks
Notation
A = {1, . . . , n}1 , B = {1, . . . , m}, φ is a probability distribution

on A, ψ is a probability distribution on B, X is a random variable
on A with distribution φ, Y is a r.v. on B with distribution ψ.
n
X
Sn := {v ∈ Rn+ : vi = 1}.
i=1
So φ ∈ Sn , ψ ∈ Sm . Sn×m denotes the set of n × m ‘stochastic

matrices’, i.e., matrices whose columns add up to one:
Sn×m = {P ∈ [0, 1]n×m : P em = en },
where en is the column vector with n ones.

1
We don’t write A = {a1 , . . . , an } but that is what we mean.
Motivation
Concluding Remarks
Entropy
Suppose φ ∈ Sn . Then the (Shannon) entropy of φ is

n
X
H(φ) = − φi log φi .
i=1
We can also call this H(X) (i.e., associate entropy with a r.v. or
its distribution).

Motivation
Concluding Remarks
Mutual Information
Suppose X, Y are r.v.s on A, B and let θ denote their joint

distribution. Then θ ∈ Snm , and θ A = φ, θ B = ψ (marginal
distributions).
I(X, Y ) := H(X) + H(Y ) − H(X, Y )
is called the mutual information between X and Y .

Alternate formula:
XX θij
I(X, Y ) = θij log .
φi ψi
i∈A j∈B
Note that I(X, Y ) = I(Y, X).

Motivation
Concluding Remarks
Conditional Entropy
The quantity
H(X|Y ) = H(X, Y ) − H(Y )
is called the conditional entropy of X given Y . Note that
H(X|Y ) 6= H(Y |X) in general. In fact
H(Y |X) = H(X|Y ) + H(Y ) − H(X).
If θ is the joint distribution of X, Y , then
H(X|Y ) = H(θ) − H(ψ), H(Y |X) = H(θ) − H(φ).

Motivation
Concluding Remarks
Conditional Entropy: Alternate Formulation
Define
Θ = [θij ] = [Pr{X = i&Y = j}] ∈ [0, 1]n×m ,
and define P = [Diag(φ)]−1 Θ. Clearly
θij
pij = = Pr{Y = j|X = i}.
φi
So P ∈ Sn×m and
X
H(Y |X) = φi H(pi ) =: Jφ (P ),
i∈A
where pi is the i-th row of P .

Motivation
Concluding Remarks
Outline
1 Motivation
Problem Formulation


Motivation
Concluding Remarks
Outline
1 Motivation
Problem Formulation


Motivation
Concluding Remarks
Maximizing Mutual Information (MMI)

Given φ ∈ Sn , ψ ∈ Sm , we look for a ‘joint’ distribution θ ∈ Snm
such that θ A = φ, θ B = ψ, and in addition, the entropy H(θ) is
minimum, or equivalently, the conditional entropy of X given Y is
minimum, or again equivalently, the mutual information between
X and Y is maximum. In other words, define
W (φ, ψ) := min H(θ) s.t. θ A = φ, θ B = ψ

= min H(X, Y ) s.t. θ A = φ, θ B = ψ,
V (φ, ψ) := W (φ, ψ) − H(φ)

= min H(Y |X) s.t. θ A = φ, θ B = ψ.
We try to make Y ‘as deterministic as possible’ given X.

Motivation
Concluding Remarks
The Variation of Information Metric

The quantity
d(φ, ψ) := V (φ, ψ) + V (ψ, φ)
is called the variation of information metric. Since
V (ψ, φ) = V (φ, ψ) + H(φ) − H(ψ),
we need to compute only one of V (ψ, φ), V (φ, ψ).

The quantity d satisfies
d(φ, φ) = 0 and d(φ, ψ) ≥ 0 ∀φ, ψ.

d(φ, ψ) = d(ψ, φ).
The triangle inequality holds:
d(φ, ξ) ≤ d(φ, ψ) + d(ψ, ξ).

Motivation
Concluding Remarks
Proof of Triangle Inequality

Suppose X, Y, Z are r.v.s on finite sets A, B, C. Then the
conditional entropy satisfies a ‘one-sided’ triangle inequality:
H(X|Y ) ≤ H(X|Z) + H(Z|Y ).
Proof:
H(X|Y ) ≤ H(X, Z|Y ) = H(Z|Y ) + H(X|Z, Y )
≤ H(Z|Y ) + H(X|Z).
So if we define
v(X, Y ) = H(X|Y ) + H(Y |X),
then v satisfies the triangle inequality. Note that
V (φ, ψ) = min v(X, Y )
φ,ψ
subject to X, Y having distributions φ, ψ respectively.

Motivation
Concluding Remarks
A Key Property
Actually d is a pseudometric, not a metric. In other words, d(φ, ψ)

can be zero even if φ 6= ψ.
Theorem: d(φ, ψ) = 0 if and only if n = m and φ, ψ are
permuations of each other.
Consequence: The metric d is not convex!
Example: Let n = m = 2,
φ = [0.75 0.25], ψ = [0.25 0.75], ξ = 0.5φ + 0.5ψ = [0.5 0.5].
Then
d(φ, φ) = d(φ, ψ) = 0 but d(φ, ξ) > 0.

Motivation
Concluding Remarks
Outline
1 Motivation
Problem Formulation


Motivation
Concluding Remarks
Computing the Metric Distance
Change variable of optimization from θ, the joint distribution, to

P , the matrix of conditional probabilities.
V (φ, ψ) = min H(θ) − H(φ) s.t. θ A = φ, θ B = ψ

θ∈Snm
X
= min Jφ (P ) := φi H(pi ) s.t. φP = ψ.
P ∈Sn×m
i∈A
Since Jφ (·) is a strictly concave function, and the feasible region is

polyhedral (convex hull of a finite number of extreme points),
solution occurs at one of these extreme points.
Also, a ‘principle of optimality’ allows us to break down large
problems into smaller problems.

Motivation
Concluding Remarks
The m = 2 Case
Partition the set {1, . . . , n} into two sets I1 , I2 such that

X X
|ψ1 − φi | = |ψ2 − φi |
i∈I1 i∈I2
is minimized.
Interpretation: Given two bins with capacities ψ1 , ψ2 , assign
φ1 , . . . , φn to the two bins so that overflow in one bin (and
under-utilization in other bin) is minimized.
Theorem: If both bins are filled exactly, then V (φ, ψ) = 0.

Motivation
Concluding Remarks
The m = 2 Case (Continued)
Theorem: If both bins cannot be filled exactly, and if the smallest

element of φ (call it φ0 ) belongs to the overstuffed bin, then
V (φ, ψ) = φ0 H([u/φ0 (φ0 − u)/φ0 ]),
where u is the unutilized capacity (or overflow).

If φ0 belongs to the underutilized bin, the above is an upper bound
for V (φ, ψ) but may not equal V (φ, ψ).

Motivation
Concluding Remarks
The m = 2 Case (Continued)
Theorem: If both bins cannot be filled exactly, and if the smallest

element of φ (call it φ0 ) belongs to the overstuffed bin, then
V (φ, ψ) = φ0 H([u/φ0 (φ0 − u)/φ0 ]),
where u is the unutilized capacity (or overflow).

If φ0 belongs to the underutilized bin, the above is an upper bound
for V (φ, ψ) but may not equal V (φ, ψ).
Bad news: Computing the optimal partitioning is NP-hard!
So we need an approximation procedure to upper bound d(φ, ψ).
No special reason to restrict to m = 2.

Motivation
Concluding Remarks
Best Fit Algorithm for Bin-Packing
Think of ψ1 , . . . , ψm as capacities of m bins.
Arrange ψj in descending order.

For i = 1, . . . , n, place each φi into the bin with maximum
unutilized capacity.
If a bin overflows, don’t fill it anymore.
Complexity is O(n log m).

Provably suboptimal: Total bin size is ≤ 1.25 times optimal bin
size; best-known bound.
Corresponding bound on overflow is not so good: It is
≤ 0.25 + 1.25× optimal value.

Motivation
Concluding Remarks
Illustration of Best Fit Algorithm

Suppose ψ = [0.45 0.30 0.25], and φ ∈ S14 is given by
φ = 10−2 · [ 14 13 12 9 8 7 6 6 5 5 4 4 4 3 ]
φ11
φ8
φ6 φ14
φ12
φ9 φ13
φ2
φ5 φ10
φ7
φ1 φ3 φ4

Motivation
Concluding Remarks
A Greedy Algorithm for Bounding d(φ, ψ)
Think of ψ1 , . . . , ψm as m bins to be packed. Modify ‘best fit’

algorithm: Sort ψ in decreasing order, put each φi into bin with
largest unutilized capacity. If φi does not fit into any bin, put it
aside (departure from best fit algorithm).
When all of φ is processed, say k entries are put aside. Then
k < m. Now solve a k by m bin-packing problem; repeat. See full
paper for details.
This results in a sequence of bin-packing problems of decreasing
size. Outcome is an upper bound for d(φ, ψ). Complexity is
O((n + m2 ) log m).

Motivation
Concluding Remarks
Outline
1 Motivation
Problem Formulation


Motivation
Concluding Remarks
Problem Formulation
Problem: Given φ ∈ Sn , and m < n, find ψ ∈ Sm that minimizes

d(φ, ψ).
Definition: ψ ∈ Sm is said to be an aggregation of φ ∈ Sn if
there exists a partition I1 , . . . , Im of {1, . . . , n} such that
X
ψj = φi .
i∈Ij

Motivation
Concluding Remarks
Results
Theorem: Any optimal approximation of φ in the variation of

information metric d must be an aggregation of φ.
Theorem: An aggregation φ(a) of φ is an optimal approximation
of φ in the variation of information metric d if and only if it has
maximum entropy (amongst all aggregations).
Note: um , the uniform distribution with m elements, has
maximum entropy in Sm . So should we try to minimize
ρ(φ(a) , um )? This is yet another bin-packing problem with all bin
capacities equal to 1/m. But how valid is this approach?

Motivation
Concluding Remarks
NP-Hardness of Optimal Order Reduction
Suppose m = 2 and let φ(a) be an aggregation of φ ∈ Sn . Then

φ(a) has maximum entropy if and only if the total variation
distance ρ(φ(a) , u2 ) is minimized.
Therefore:
When m = 2, minimizing ρ(φ(a) , u2 ) gives an optimal

reduced order approximation to φ.
For m ≥ 3, minimizing ρ(φ(a) , um ) gives a suboptimal
reduced order approximation to φ.

Motivation
Concluding Remarks
Reformulation of Problem
Problem: Given φ ∈ Sn and m < n, find an aggregation of

φ(a) ∈ Sm with maximum entropy.
Reformulation: Since the uniform distribution um has maximum
entropy in Sm , So find an aggregation φ(a) such that the total
variation distance ρ(φ(a) , um ) is minimized.
More general problem: (Not any harder) Given φ ∈ Sn , m < n
and ξ ∈ Sm , find an aggregation φ(a) to minimize the total
variation distance ρ(φ(a) , ξ).
Use best-fit algorithm to find an aggregation of φ that is close to
the uniform distribution. Complexity is O(n log m).

Motivation
Concluding Remarks
Bound on Performance of Best Fit Algorithm
Theorem: Let φ(a) denote the aggregation of φ using the best fit
algorithm. Then
ρ(φ(a) , ξ) ≤ 0.25mφmax ,
where φmax is the largest component of φ and ρ is the total
variation metric.
This can be turned into a (messy) bound on the entropy of φ(a) .

Motivation
Concluding Remarks
Outline
1 Motivation
Problem Formulation


Motivation
Concluding Remarks
Achievements
Definition of a proper metric between probability distributions

defined on sets of different cardinalities, by maximizing
mutual information between the two distributions.
Study of properties of the distance, and the problem of
computing the distance; showing the close relationship to the
bin-packing problem with overstuffing.
Since bin-packing is NP-hard, adapting best-fit algorithm to
generate upper bounds for distance in polynomial time.
Characterization of solution of optimal order reduction
problem in terms of aggregating to maximize entropy.
Formulation as a bin-packing problem with overstuffing.
Upper bound on performance of algorithm.

Motivation
Concluding Remarks
Next Steps
Extension of the metric to Markov processes, hidden Markov

processes, arbitrary (but stationary ergodic) stochastic
processes.
Alternatives to best-fit heuristic algorithm.
Thank You!

Metric Distances Between Probability Distributions of Different Sizes

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Metric Distances Between Probability Distributions of Different Sizes

Enviado por

Direitos autorais:

Formatos disponíveis

Motivation

Information Theory Background

Metric Distances Between Probability

Cecil & Ida Green Chair

Johns Hopkins University, 20 October 2011

M. Vidyasagar Distances Between Probability Distributions

2 Information Theory Background

3 A Metric Distance Between Distributions

4 Optimal Order Reduction

M. Vidyasagar Distances Between Probability Distributions

2 Information Theory Background

3 A Metric Distance Between Distributions

4 Optimal Order Reduction

M. Vidyasagar Distances Between Probability Distributions

2 Information Theory Background

3 A Metric Distance Between Distributions

4 Optimal Order Reduction

M. Vidyasagar Distances Between Probability Distributions

Originally: Given a hidden Markov process with a very large state

M. Vidyasagar Distances Between Probability Distributions

General Problem: Simplified Modeling of Stochastic

Suppose {Xt } is a stochastic process assuming values in a ‘large’

How do we define the ‘distance’ (think ‘modeling error’)

M. Vidyasagar Distances Between Probability Distributions

Scope of Today’s Talk

Scope of this talk: i.i.d. (independent, identically distributed)

Given two probability distributions φ, ψ, on finite sets A, B,

Full paper at: http://arxiv.org/pdf/1104.4521v2

M. Vidyasagar Distances Between Probability Distributions

2 Information Theory Background

3 A Metric Distance Between Distributions

4 Optimal Order Reduction

M. Vidyasagar Distances Between Probability Distributions

Total Variation Metric

where {x}+ = max{x, 0}, {x}− = min{x, 0}.

ρ(π(φ), π(ψ)) = ρ(φ, ψ).

But what if φ, ψ are probability measures on different sets?

Permutation Invariance Example

A = {H, T }, B = {M, W }, φ = [0.55 0.45], ψ = [0.48 0.52].

What is the distance between φ, ψ?

M. Vidyasagar Distances Between Probability Distributions

Permutation Invariance: An Inherent Feature

Suppose φ, ψ are probability distributions on distinct sets A, B,

d(φ, ψ) = d(π(φ), ξ(ψ)).

Ergo: Any definition of the distance must be invariant under

M. Vidyasagar Distances Between Probability Distributions

2 Information Theory Background

3 A Metric Distance Between Distributions

4 Optimal Order Reduction

M. Vidyasagar Distances Between Probability Distributions

A = {1, . . . , n}1 , B = {1, . . . , m}, φ is a probability distribution

So φ ∈ Sn , ψ ∈ Sm . Sn×m denotes the set of n × m ‘stochastic

Sn×m = {P ∈ [0, 1]n×m : P em = en },

where en is the column vector with n ones.

Suppose φ ∈ Sn . Then the (Shannon) entropy of φ is

M. Vidyasagar Distances Between Probability Distributions

Suppose X, Y are r.v.s on A, B and let θ denote their joint

I(X, Y ) := H(X) + H(Y ) − H(X, Y )

is called the mutual information between X and Y .

Note that I(X, Y ) = I(Y, X).

M. Vidyasagar Distances Between Probability Distributions

H(Y |X) = H(X|Y ) + H(Y ) − H(X).

If θ is the joint distribution of X, Y , then