Você está na página 1de 45

Data Assimilation – The Schrödinger Perspective

Sebastian Reich∗
July 24, 2018
arXiv:1807.08351v1 [math.NA] 22 Jul 2018

Abstract
Data assimilation addresses the general problem of how to combine model–based predictions with par-
tially and noise observations of the process in an optimal manner. This survey focuses on sequential data
assimilation techniques using probabilistic particle–based algorithms. In addition to surveying recent de-
velopments for discrete and continuous–time data assimilation both in terms of mathematical foundations
and algorithmic implementations, we also provide a unifying framework from the perspective of coupling of
measures and Schrödinger’s boundary value problem for diffusion processes in particular.

1 Introduction
This survey focuses on sequential data assimilation techniques for state and parameter estimation in the context
of discrete and/or continuous–time stochastic diffusion processes. The field itself is well–established (Evensen
2006, Särkkä 2013, Law, Stuart and Zygalakis 2015, Reich and Cotter 2015, Asch, Bocquet and Nodet 2017),
but also undergoes continues development due to new challenges arising from emerging application areas such
as medicine, traffic control, biology, cognitive sciences and geosciences.
Data assimilation is typically formulated within a Bayesian framework in order to combine partial and noisy
observations with model predictions and their uncertainties with the goal of adjusting model states and model
parameters in an optimal manner. In case of linear systems and Gaussian distributions, this task leads to the
celebrated Kalman filter (Särkkä 2013) which even today forms the basis of a number of popular data assimilation
schemes and which has given rise to the widely used ensemble Kalman filter (Evensen 2006). Contrary to
standard sequential Monte Carlo methods (Doucet, de Freitas and (eds.) 2001, Bain and Crisan 2008), the
ensemble Kalman filter does not provide a consistent approximation to the sequential filtering problem, while
being applicable to very high–dimensional problems. This and other advances have widened the scope of
sequential data assimilation and have led to an avalanche of new methods in recent years.
We will focus in this review on probabilistic methods (in contrast to data assimilation techniques based on
optimisation, such as 3DVar and 4DVar) in the form of sequential particle methods. The essential challenge of
sequential particle methods is to convert a sample of M particles from a filtering distribution at time tk into M
samples from the filtering distribution at time tk+1 without having access to the full filtering distributions. It
will also often be the case in practical applications that this sample size will be small to moderate in comparison
to the number of variables we need to estimate.
Sequential particle methods can be viewed as a special instance of interacting particle systems (del Moral
2004). We will view such interacting particle systems in this review from the perspective of approximating
a certain boundary value problem in the space of probability measures, where the boundary conditions are
provided by the underlying dynamic process, the data, and Bayes’ theorem. This point of view leads to
natural links to optimal transportation (Villani 2003, Reich and Cotter 2015) and, more importantly for this
review, to Schrödinger’s problem (Föllmer and Gantert 1997, Leonard 2014, Chen, Georgiou and Pavon 2014), as
formulated first by E. Schrödinger as a certain boundary value problem for Brownian motion (Schrödinger 1931).

Remark 1.1. We will primary refer to the methods considered in the survey as particle or ensemble methods
instead of the also widely used notion of sequential Monte Carlo methods. We will also use the notions of
particles, samples and ensemble members synonymously. Since the ensemble size, M , is generally assumed to
be small to moderate relative to the number of variables of interest, we will focus on robust but generally biased
particle methods.
∗ Department of Mathematics, University of Potsdam & University of Reading, sebastian.reich@uni-potsdam.de

1
Model Model
State State State

Assimilation Assimilation Assimilation

Data Data Data

time

Figure 1: A schematic illustration of sequential data assimilation where model states are propagated forward
in time under a given model dynamics and adjusted whenever data becomes available at discrete instances in
time. In this paper, we look at a single transition from a given model state conditioned on all the previous
and current data to the next instance in time and its adjustment under the assimilation of the then becoming
available new data.

This survey consists of four main parts. We start with reviewing key mathematical concept of data assimilation
when the data becomes available at discrete instances in time. The underlying dynamic models can be either
continuous or discrete–in–time. After recalling the standard concepts of filtering and smoothing from stochastic
analysis, the Schrödinger formulation will be introduced as the natural mathematical framework for sequential
data assimilation. The second part will summarise popular probabilistic, i.e. particle–based, computational
approaches for discrete–in–time data assimilation. Those include the ensemble Kalman filter and its extensions
to the more general class of linear ensemble transform filters. This part is followed by a chapter on data
assimilation for data arriving continuously in time. We will distinguish between data that is smooth as a function
of time and data which has been perturbed by Brownian motion. In both cases, we will derive appropriate
mean–field equations, which produce the correct conditional marginal distributions in the state variables. The
final chapter of this review discusses some numerical approximations for these mean–field equations in the form
of interacting particle systems.

2 Mathematical foundation of discrete–time DA


Let us assume that we are given partial and noisy observations yk , k = 1, . . . , K, of a stochastic process in
regular time intervals of length T = 1. Given a likelihood function π(y|z), a Markov transition kernel q+ (z 0 |z)
and an initial distribution Π0 , the associated prior and posterior probability density functions (PDFs) are given
by
K
Y
π(z0:K ) := Π0 (z0 ) q+ (zk |zk−1 ) (1)
k=1

and
K
Y
π(z0:K |y1:K ) := Π0 (z0 ) π(yk |zk ) q+ (zk |zk−1 ) , (2)
k=1

2
respectively (Jazwinski 1970, Särkkä 2013). While it is of broad interest to approximate the posterior or
smoothing PDF (2), we will focus on the recursive approximation of the filtering PDFs π(zk |y1:k ) using particle
filters in this paper. More specifically, we will assume that we have M equally weighted Monte Carlo samples
i
zk−1 , i = 1, . . . , M , from the filtering PDF π(zk−1 |y1:k−1 ) at time t = k − 1 available and we wish to produce M
equally weighted samples from the filtering PDF π(zk |y1:k ) at time t = k having access to the transition kernel
q+ (zk |zk−1 ) and the likelihood π(yk |zk ) only. Since the computational task is exactly the same for all indices
k ≥ 1, we simply set k = 1 throughout this paper.
We introduce some notations. The PDF at t0 is given by
M
1 X
π0 (z0 ) := δ(z0 − z0i ) , (3)
M i=1

b1 (z1 ) and
where δ(z) denotes the Dirac delta function. We abbreviate the filtering PDF π(z1 |y1 ) at t = 1 by π
the likelihood π(y1 |z1 ) by l(z1 ). Since the forecast PDF is given by
Z M
1 X
π1 (z1 ) := q+ (z1 |z0 ) π0 (z0 ) dz0 = q+ (z1 |z0i ) , (4)
M i=1

the filtering PDF is given by


M
l(z1 ) π1 (z1 ) 1 1 X
b1 (z1 ) :=
π = l(z1 ) q+ (z1 |z0i ) (5)
π1 [l] π1 [l] M i=1

according to Bayes’ theorem.


Here we have used the shorthand Z
π[f ] = f (z) π(z)dz

for the expectation of a function f under a PDF π. Similarly, integration with respect to a probability measure
P, not necessarily absolutely continuous with respect to Lebesque, will be denoted by
Z
P[f ] = f (z)P(dz) .

Remark 2.1. If the model depends on parameters, λ, or different models are to be compared, then it is important
to explicitly compute the evidence
M Z
1 X
β = π1 [l] = l(z1 ) q+ (z1 |z0i ) dz1 , (6)
M i=1

which otherwise only appears as a normalising constant in (5). More specifically, if q+ (z1 |z0 ; λ), then β = β(λ)
in (6) and larger values of β(λ) indicate a better fit of the transition kernel to the data for that parameter value.

b1 at time t = 1 implies a smoothing distribution at time t = 0, which we denote by


The filtering distribution π
b0 (z0 ) and which is given by
π
Z M
1 1 X i
b0 (z0 ) :=
π l(z1 ) q+ (z1 |z0 ) π0 (z0 ) dz1 = γ δ(z0 − z0i ) (7)
β M i=1

with weights Z
1
γ i := l(z1 ) q+ (z1 |z0i ) dz1 . (8)
β
b1 can be obtained from π
It is important to note that the filtering PDF π b0 using the transition kernels

l(z1 ) q+ (z1 |z0i )


qb+ (z1 |z0i ) := , (9)
β γi

3
Prediction
⇡0 ⇡1

Smoothing Schrödinger Filtering

Optimal Control
b0
⇡ b1

Data
y1

t=0 t=1 Time

Figure 2: A schematic illustration of a single data assimilation cycle. The distribution π0 characterises the
distribution of states conditioned on all observations up to and including t0 , which we set here to t = 0 for
simplicity. The predictive distribution at time t1 = 1, as generated by the model dynamics, is denoted by π1 .
Upon assimilation of the data y1 and application of Bayes’ formula, one obtains the filtering distribution π b1 .
The conditional distribution of states at time t0 conditioned on all the available data including y1 is denoted by
b0 . Control theory provides the adjusted model dynamics for transforming π
π b0 into π
b1 . Finally, the Schrödinger
problem links π0 and π b1 in the form of a penalised boundary value problem in the space of joint probability
measures. Data assimilation scenario A corresponds to the blue lines, scenario B to the brown lines, and scenario
C to the red line.

i.e.
M
1 X
b1 (z1 ) =
π qb+ (z1 |z0i ) γ i .
M i=1

Three different scenarios arise at this point of how to produce the desired samples zb1i , i = 1, . . . , M , from
the filtering PDF (5).

(A) One first produces samples, z1i , from the forecast PDF π1 and then transforms those samples into samples,
zb1i , from π
b1 . This can be viewed as introducing a Markov transition kernel q1 (b
z1 |z1 ) with the property
that Z
b1 (b
π z1 ) = q1 (b
z1 |z1 ) π1 (z1 ) dz1 . (10)

We will use techniques from optimal transportation (Villani 2003, Villani 2009, Reich and Cotter 2015)
to find appropriate transition kernels.

(B) One first produces M samples from the smoothing PDF (7) via resampling with replacement and then
samples form πb1 using the smoothing transition kernels (9). The resampling can be represented in terms
of a Markov transition matrix Q0 ∈ RM ×M such that

γ = Q0 p .

4
Here we have introduced the associated probability vectors
 1 T 
1 T
M
γ = γM , . . . , γM ∈ RM , p= 1
M,..., M ∈ RM . (11)

Again we will explore techniques form optimal transport to find such Markov transition matrices in section
3.

(C) One directly seeks Markov transition kernels q+ (z1 |z0i ), i = 1, . . . , M , with the property that
M
1 X ∗
b1 (z1 ) =
π q (z1 |z0i ) (12)
M i=1 +

and then draws a single sample, zb1i , from each kernel q+ (z1 |z0i ). We will show that such kernels can be
found by solving a Schrödinger problem (Leonard 2014, Chen et al. 2014).

Approach (A) forms the basis of the classic bootstrap particle filter (Doucet et al. 2001, Liu 2001, Bain and
Crisan 2008, Arulampalam, Maskell, Gordon and Clapp 2002) and provides also the starting point for many
currently used ensemble–based data assimilation algorithms (Evensen 2006, Reich and Cotter 2015, Law et
al. 2015). Approach (B) is also well known in the context of particle filters under the notion of optimal proposal
densities (Doucet et al. 2001, Arulampalam et al. 2002). The exploration of optimal or other proposal densities in
the context of data assimilation has started more recently (Vanden-Eijnden and Weare 2012, Van Leeuwen 2015).
There has also been a renewed recent interest in approach (B) from the perspective of optimal control and
twisting approaches (Guarniero, Johansen and Lee 2017, Heng, Bishop, Deligiannidis and Doucet 2018, Kappen
and Ruiz 2016, Ruiz and Kappen 2017). Finally, approach (C) has not yet been explored in the context of
particle filters and data assimilation. However, as we argue in this paper, progress on the numerical solution
of Schrödinger’s problem (Cuturi 2013, Peyre and Cuturi 2018) turns it into a viable option in addition to
providing a unifying mathematical framework for data assimilation.
The accuracy of an ensemble–based data assimilation methods can be characterised in terms of its effective
sample size Meff (Liu 2001). The relevant effective sample size for approach (B) is, for example, given by
M2 1
Meff = PM = .
i
i=1 (γ )
2 kγk2
One finds that M ≥ Meff ≥ √ 1 and the accuracy of a data assimilation step decreases with√decreasing Meff ,
i.e., the convergence rate 1/ M of a standard Monte Carlo method gets replaced by 1/ Meff (Agapiou,
Papaspipliopoulos, Sanz-Alonso and Stuart 2017). Approach (C) offers a route around this problem by bridging
π0 with πb1 directly.
Example 2.2. We illustrate the three approaches by a simple example. The prior samples are given by M = 11
equally spaced particles z0i ∈ R from the interval [−1, 1]. The predictive distribution π1 is provided by
M
1 X 1 
π1 (z) = exp − 2σ1 2 (z − z0i )2
M i=1 (2π)1/2 σ

with variance σ 2 = 0.1. The likelihood function is given by


1 
π(y1 |z) = 1/2
1
exp − 2R (y1 − z)2
(2πR)
with R = 0.1 and y1 = −0.5. The implied filtering and smoothing distributions can be found in figure 3. Since
b1 is in the form of a weighted Gaussian mixture distribution, the Markov chain leading from π
π b0 to πb1 can be
stated explicitly, i.e., (9) is provided by
1 
qb+ (z1 |z0i ) = 1/2
exp − 2bσ1 2 (z̄1i − z)2 (13)
(2π) σ b
with
σ4 σ2
b2 = σ 2 −
σ , z̄1
i
= z0
i
− (z i − y1 ) .
σ2 + R σ2 + R 0
The resulting transition kernels are displayed in figure 4 together with the corresponding transition kernels for
the Schrödinger approach, which connects π0 directly with π b1 .

5
prior prediction
0.1 0.5

0.08 0.4

0.06 0.3

0.04 0.2

0.02 0.1

0 0
-1 -0.5 0 0.5 1 -4 -2 0 2 4
space space
smoothing filtering
0.2 1.5

0.15
1
0.1
0.5
0.05

0 0
-1 -0.5 0 0.5 1 -4 -2 0 2 4
space space

b1 , and the smoothing


Figure 3: Displayed are the initial PDF π0 , the predictive PDF π1 , the filtering PDF π
PDF πb0 for a simple Gaussian transition kernel.

Remark 2.3. It is often assumed in rare event simulations arising from statistical mechanics that Π0 in (1) is a
point measure, i.e., the starting point of the simulation is known exactly. See, for example, Hartmann, Richter,
Schütte and Zhang (2017). This corresponds to (3) with M = 1. It turns out that the associated smoothing
problem becomes equivalent to Schrödinger’s problem under this particular setting since the distribution at t = 0
is fixed.

The remainder of this section is structured as follows. We first recapitulate the pure prediction problem for
discrete–time Markov processes and continuous–time diffusion processes. We then discuss the filtering and
smoothing problem for a single data assimilation step. The final subsection is devoted to the Schrödinger
problem (Leonard 2014, Chen et al. 2014) of bridging the filtering distribution, π0 , at t = 0 with the filtering
b1 , at t = 1.
distribution, π

2.1 Prediction
We assume under the chosen computational setting that we have access to M samples z0i ∈ RNz , i = 1, . . . , M ,
from the filtering distribution at t = 0. We also assume that we know (explicitly or implicitly) the forward
transition probabilities q+ (z1 |z0i ) of the underlying Markovian stochastic process. This leads to the prediction
PDF π1 as given by (4).
We also introduce the backward transition kernel q− (z0 |z1 ) for later use, which is defined through the equality

q− (z0 |z1 ) π1 (z1 ) = q+ (z1 |z0 ) π0 (z0 ) .

Note that q− (z0 |z1 ) as well as π0 are not absolutely continuous with respect to the underlying Lebesque measure,

6
optimal control proposals Schroedinger proposal
2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5
space space

Figure 4: The panel on the left displays the transition kernels (13) for the M = 11 different particles z0i . These
correspond to the optimal control path in figure 2. The corresponding transitions kernels, which lead directly
from π0 to πb1 , are displayed in the panel on the right. These correspond to the Schrödinger path in figure 2.

Details on how to compute those Schrödinger transition kernels, q+ (z1 |z0i ), can be found in section 3.4.1.

i.e.,
M
1 X q+ (z1 |z0i )
q− (z0 |z1 ) = δ(z0 − z0i ) . (14)
M i=1 π1 (z1 )
The backward transition kernel q− (z1 |z0 ) reverses the prediction process in the sense that
Z
π0 (z0 ) = q− (z0 |z1 ) π1 (z1 ) dz1 .

Remark 2.4. Let us assume that detailed balance


q+ (z1 |z0 ) π(z0 ) = q+ (z0 |z1 ) π(z1 )
holds for some PDF π and forward transition kernel q+ (z1 |z0 ). Then π1 = π for π0 = π and q− (z0 |z1 ) =
q+ (z1 |z0 ).

We now derive a class of forward transition kernels using the concept of twisting (Guarniero et al. 2017, Heng
et al. 2018), which is an application of Doob’s H–transform technique (Doob 1984) to the smoothing problem.

Definition 2.5. Given a non-negative twisting function ψ1 (z1 ) such that the modified transition kernel
ψ
q+ (z1 |z0 ) := ψ1 (z1 ) q+ (z1 |z0 ) ψ0 (z0 )−1 (15)
with Z
ψ0 (z0 ) := q+ (z1 |z0 ) ψ1 (z1 ) dz1 (16)

is well defined, one can define the twisted prediction density


M M
1 X ψ 1 X ψ1 (z1 )
π1ψ (z1 ) := i
q (z1 |z0 ) = q+ (z1 |z0i ) . (17)
M i=1 + M i=1 ψ0 (z0i )

The PDFs π1 and π1ψ are related by


PM i
π1 (z1 ) i=1 q+ (z1 |z0 )
ψ
= PM ψ1 (z1 ) . (18)
π1 (z1 ) i
i=1 ψ0 (z i ) q+ (z1 |z0 )
0

7
Eq. (18) gives rise to importance weights
π1 (z1i )
wi ∝ (19)
π1ψ (z1i )
for samples z1i = Z1i (ω) drawn from the twisted prediction distribution, i.e.,
ψ
Z1i ∼ q+ (· |z0i )
and
M
1 X i
π1 (z) ≈ w δ(z − z1i )
M i=1
in a weak sense. Here we have assumed that the normalisation constant in (19) is chosen such that
M
X
wi = M . (20)
i=1

Twisted transition kernels will become important when looking at the filtering and smoothing as well as the
Schrödinger problem later in this section.

Remark 2.6. The transition kernel q+ (z1 |z0 ) might depend on unknown parameters λ. This leads to combined
state and parameter estimation problems. One approach is to extend state space by those additional parameters
and to adapt the transition kernel appropriately. Another approach is to compute the evidence β = β(λ) of the
data under the given model parameters and to compare models based on their evidence. See Kantas, Doucet,
Singh, Maciejowski and Chopin (2015) for a recent survey.

Let us now discuss a couple of specific models which give rise to transition kernels q+ (z1 |z0 ). These models will
be used throughout this paper to illustrate mathematical and algorithmic concepts.

2.1.1 Gaussian model error


Let us consider the discrete–time stochastic process
Z1 = Ψ(Z0 ) + γ 1/2 Ξ0 (21)
for given map Ψ : RNz → RNz , scaling factor γ > 0, and Gaussian distributed random variable Ξ0 with mean
zero and covariance matrix B ∈ RNz ×Nz . The associated forward transition kernel is given by
q+ (z1 |z0 ) = n(z1 ; Ψ(z0 ), γB) . (22)
Here we have introduced the shorthand n(z; z̄, P ) for the PDF of a Gaussian random variable with mean z̄ and
covariance matrix P .
Let us consider a twisting potential ψ1 of the form
 
1 T −1
ψ1 (z1 ) ∝ exp − (Hz1 − d) R (Hz1 − d)
2
for given H ∈ RNz ×Nd , d ∈ RNd , and covariance matrix R ∈ RNd ×Nd . We define
K := BH T (HBH T + γ −1 R)−1 (23)
and
B̄ := B − KHB, z̄1i := Ψ(z0i ) − K(HΨ(z0i ) − d) . (24)
The twisted prediction kernels are given by
ψ
q+ (z1 |z0i ) = n(z1 ; z̄1i , γ B̄)
and  
1
ψ0 (z0i ) ∝ exp − (Hz0i − d)T (R + γHBH T )−1 (Hz0i − d)
2
for i = 1, . . . , M .

8
2.1.2 SDE models
Consider the (forward) SDE (Pavliotis 2014)
dZt+ = ft (Zt+ ) dt + γ 1/2 dWt+ (25)
with initial condition Z0+ = z0 and γ > 0. Here Wt+ stands for standard Brownian motion in the sense that the
+
distribution of Wt+∆t , ∆t > 0, conditioned on wt+ = Wt+ (ω) is Gaussian with mean wt+ and covariance matrix
∆t I (Pavliotis 2014) and the process Zt+ is adapted to Wt+ .
The resulting time–t transition kernels qt+ (z|z0 ), t ∈ (0, 1], satisfy the Fokker-Planck equation (Pavliotis 2014)

∂t qt+ (· |z0 ) = −∇z · qt+ (· |z0 )ft + γ2 ∆z qt+ (· |z0 )

with initial condition q0+ (z|z0 ) = δ(z − z0 ) and the time–one forward transition kernel q+ (z1 |z0 ) is given by
q+ (z1 |z0 ) = q1+ (z1 |z0 ) .
We introduce the Fokker-Planck operator L by
Lπ := −∇z · (π ft ) + γ2 ∆z π (26)
and its dual operator (Pavliotis 2014)
L† g := ∇z g · ft + γ2 ∆z g
for later reference.
+
Solutions (realisations) z[0,1] = Z[0,1] (ω) of the SDE (25) with initial conditions drawn from π0 are continuous
functions of time, i.e., z[0,1] ∈ C := C([0, 1], RNz ), and define a probability measure Q on C, i.e.,
+
Z[0,1] ∼ Q.

We note that the marginal distributions πt of Q, given by


Z
πt (zt ) = qt+ (zt |z0 ) π0 (z0 ) dz0 ,

also satisfy the Fokker-Planck equation, i.e.,


∂t πt = L πt (27)
γ
= −∇z · (πt (ft − γ∇z log πt )) − 2 ∆ z πt (28)

= −∇z · πt (ft − γ2 ∇z log πt ) (29)
for given PDF π0 at time t = 0. Furthermore, we can read off from (28) the backward SDE
dZt− = ft (Zt− ) dt − γ∇z log πt dt + γ 1/2 dWt− ,
= bt (Zt− ) dt + γ 1/2 dWt− (30)

with final condition Z1− ∼ π1 , Wt− backward Brownian motion, and density dependent drift term
bt (z) := ft (z) − γ∇z log πt
(Nelson 1984, Chen et al. 2014). Here backward Brownian motion is to be understood in the sense that the

distribution of Wt−∆τ , ∆τ > 0, conditioned on wt− = Wt− (ω) is Gaussian with mean wt− and covariance matrix
∆τ I and all other properties of Brownian motion appropriately adjusted. The process Zt− is adapted to Wt− .

Remark 2.7. The backward SDE (30) induces a corresponding backward transition kernel q1−τ (z|z1 ) with
τ ∈ [0, 1], which satisfies the Fokker–Planck equation
− −
 −
−∂τ q1−τ (· |z0 ) = −∇z · q1−τ (· |z0 )bt − γ2 ∆z q1−τ (· |z0 )

with initial condition q1− (z|z1 ) = δ(z − z1 ) at τ = 0. The induces backward transition kernel q− (z0 |z1 ) is then
given by
q− (z0 |z1 ) = q0− (z0 |z1 )
and satisfies (14).

9
We also note that the mean field equation,
d 1
zt = ft (zt ) − γ2 ∇z log πt (zt ) = (ft (zt ) + bt (zt )) , (31)
dt 2
resulting from (29), leads to the same marginal distributions πt as the forward and backward SDEs, respectively.
It should be kept in mind, however, that the path measure generated by (31) is different from the path measure
Q generated by (25).

Remark 2.8. The notion of a backward SDE also arises in a different context where the driving Brownian
motion is still adapted to the past, i.e. Wt+ in our notation, and a final condition is prescribed as for (30). See
(52) below and Carmona (2016) for more details.

Please also note that the backward SDE and the mean field equation (31) become singular as t → 0 for the
given (3). A meaningful solution can be defined via regularisation of the Dirac delta function, i.e.,
M
1 X
π0 (z) ≈ n(z; z0i , I) ,
M i=1

and taking the limit  → 0.


We will find later that it is sometimes advantageous to modify the given SDE (25) by a time–dependent
drift term ut (z). Such a modification provides the time–continuous analog to the twisted transition kernel (15)
introduced earlier in subsection 2.1. The modified forward SDE

dZt+ = ft (Zt+ ) dt + ut (Zt+ ) dt + γ 1/2 dWt+ (32)

with Z0+ ∼ π0 generates a path measure which we denote by Qu . Realizations of this path measure are denoted
u
by z[0,1] . According to Girsanov’s theorem (Pavliotis 2014) the two path measures Q and Qu are absolute
continuous with respect to each other with Radon–Nikodym derivative
 Z 1 
dQu 1 2 1/2 +
= exp kut k dt + 2γ ut · dWt (33)
dQ |zu 2γ 0
[0,1]

provided that the Kullback-Leibler diverence KL(Qu ||Q) between Qu and Q, given by
Z  Z 1 
u 1 2
KL(Q ||Q) := kut k dt Qu (dz[0,1]
u
), (34)
2γ 0

is finite. Recall that the Kullback-Leibler divergence between two path measures P  Q on C is defined by
Z
dP
KL(P||Q) = log P(dz[0,1] ) .
dQ
u
If the modified SDE (32) is used to make predictions, then its solutions z[0,1] need to be weighted according
to the inverse Radon–Nikodym derivative
 Z 1 
dQ 1 2 1/2 +
= exp − ku t k dt + 2γ ut · dW t (35)
dQu |zu 2γ 0
[0,1]

in order to reproduce the desired marginal distributions π1 of the original SDE (25).

2.2 Filtering and Smoothing


We now add the likelihood
l(z1 ) = π(y1 |z1 )
of the data y1 at time t1 = 1 to the picture. Bayes’ theorem tells us that, given the prediction PDF π1 at time
b1 is given by (5). The distribution π
t1 , the posterior PDF π b1 solves the filtering problem at time t1 given the

10
data y1 . We also recall the definition of the evidence (6). The quantity F = − log β is called the free energy in
statistical physics (Hartmann et al. 2017).
An appropriate transition kernel q1 (b z1 |z1 ), satisfying (10), is required in order to complete the transition
from π0 to π b1 following approach (A). A suitable framework for finding such transition kernels is via the theory
of optimal transportation (Villani 2003). More specifically, let Π denote the set of all joint probability measures
π(z1 , zb1 ) with marginals
Z Z
π(z1 , zb1 ) db
z1 = π1 (z1 ), π(z1 , zb1 ) dz1 = π
b(b
z1 ) .

We seek the joint measure π ∗ (z1 , zb1 ) ∈ Π which minimises the expected Euclidian distance between the two
associated random variabels Z1 and Zb1 , i.e.
Z Z

π = arg inf kz1 − zb1 k2 π(z1 , zb1 ) dz1 db
z1 . (36)
π∈Π

The minimising joint measure is of the form

π ∗ (z1 , zb1 ) = δ(b


z1 − ∇z Φ(z1 )) π1 (z1 ) (37)

with a convex potential Φ under appropriate conditions on the PDFs π1 and π b1 (Villani 2003). These conditions
are satisfied for dynamical systems with Gaussian model errors and typical SDE models. Once the potential Φ
(or an approximation) is available, samples z1i , i = 1, . . . , M , from the prediction PDF π1 can be converted into
samples zb1i , i = 1, . . . , M , from the filtering distribution π
b1 via

zb1i = ∇z Φ(z1i ) . (38)

We will discuss in section 3 how to approximate the transformation (38). We will find that many of the popular
data assimilation schemes, such as the ensemble Kalman filter, can be viewed as approximations to (38) (Reich
and Cotter 2015).
We recall at this point that classic particle filters start from the importance weights

b1 (z1i )
π l(z1i )
wi ∝ i
=
π1 (z1 ) β

and obtain the desired samples zbi by an appropriate resampling with replacement scheme (Doucet et al. 2001,
Arulampalam et al. 2002, Douc and Cappe 2005).

Remark 2.9. If one replaces the forward transition kernel q+ (z1 |z0 ) by a twisted kernel (15), then, using (18),
the filtering distribution (5) satisfies
PM j
b1 (z1 )
π l(z1 ) j=1 q+ (z1 |z0 )
= PM . (39)
π1ψ (z1 ) β ψ1 (z1 ) j
j=1 ψ0 (z j ) q+ (z1 |z0 )
0

Hence drawing samples z1i , i = 1, . . . , M , from π1ψ instead of π1 leads to modified importance weights
PM i j
i
l(z1i ) j=1 q+ (z1 |z0 )
w ∝P . (40)
M ψ1 (z1j )
j=1 ψ0 (z j ) q+ (z1i |z0j )
0

b1 = π1ψ and importance weights


We will see later that finding the optimal twisting potential, i.e., the one with π
i
w = 1 in (40), is equivalent to solving the Schrödinger problem (59)–(62).

The associated smoothing distribution at time t = 0 can be defined as follows. First introduce

b1 (z1 )
π l(z1 )
ψ1 (z1 ) := = . (41)
π1 (z1 ) β

11
Next we set Z Z
ψ0 (z0 ) := q+ (z1 |z0 ) ψ1 (z1 ) dz1 = β −1 q+ (z1 |z0 ) l(z1 ) dz1 , (42)

b0 := π0 ψ0 , i.e.,
and introduce π
M
1 X
b0 (z0 ) =
π ψ0 (z0i ) δ(z0 − z0i )
M i=1
M
1 X i
= γ δ(z0 − z0i ) (43)
M i=1

since ψ0 (z0i ) = γ i with γ i defined by (8). Again sampling with replacement can be used to produce equally
weighted particles from π b0 .

b0 and π
Lemma 2.10. The smoothing PDFs π b1 satisfy
Z
b0 (z0 ) = q− (z0 |z1 ) π
π b1 (z1 ) dz1 (44)

with the backward transition kernel defined by (14). Furthermore,


Z
b1 (z1 ) = qb+ (z1 |z0 ) π
π b0 (z0 ) dz0

with modified forward transition kernels


qb+ (z1 |z0i ) = ψ1 (z1 ) q+ (z1 |z0i ) ψ0 (z0i )−1
l(z1 )
= q+ (z1 |z0i ) (45)
β γi
and γ i , i = 1, . . . , M , defined by (8).
Proof. We note that
π0 (z0 ) l(z1 )
b1 (z1 ) =
q− (z0 |z1 ) π q+ (z1 |z0 )b
π1 (z1 ) = q+ (z1 |z0 ) π0 (z0 ) ,
π1 (z1 ) β
b0 = ψ0 π0 and
which implies the first equation. The second equation follows from π
Z M
1 X l(z1 )
qb+ (z1 |z0 ) π
b0 (z0 ) dz0 = q+ (z1 |z0i ) .
M i=1 β

In other words, we have defined a twisted forward transition kernel of the form (15).

Seen from a more abstract perspective, we have provided an alternative formulation of the joint smoothing
distribution
l(z1 ) q+ (z1 |z0 ) π0 (z0 )
b(z0 , z1 ) :=
π (46)
β
in the form of
l(z1 ) ψ1 (z1 ) ψ0 (z0 )
b(z0 , z1 ) =
π q+ (z1 |z0 ) π0 (z0 )
β ψ1 (z1 ) ψ0 (z0 )
= qb+ (z1 |z0 ) π
b0 (z0 ) (47)
because of (41). Note that the marginal distributions of π b are provided by πb0 and π
b1 , respectively.
One can exploit these formulations computationally as follows. If one has generated M equally weighted
particles zb0j from the smoothing distribution (43) at time t = 0 via resampling with replacement, then one can
obtain equally weighted samples zb1j from the filtering distribution π
b1 using the modified prediction distributions
(45). This is the idea behind the optimal proposal particle filter (Doucet et al. 2001, Arulampalam et al. 2002,
Fearnhead and Künsch 2018) and provides an implementation of approach (B) as introduced earlier.

12
Remark 2.11. We remark that backward simulation methods utilise (44) in order to address the smoothing
problem (2) in a sequential forward–backward manner. Since we are not interested in the general smoothing
problem in this paper, we refer the reader to the survey by Lindsten and Schön (2013) for more details.
Definition 2.12. It is not necessary that one choses ψ1 as in (41). Instead one can pick a suitable twisting
potential ψ1 , as already introduced in subsection 2.1, such that

l(z1 )
lψ (z1 ) := π0 [ψ0 ]
β ψ1 (z1 )

is well defined with ψ0 given by (16). The modified forward transition kernel is given by (15) and the modified
initial distribution by
ψ0 (z0 ) π0 (z0 )
π0ψ (z0 ) := .
π0 [ψ0 ]
Finally, the smoothing distribution (46) becomes
ψ
b(z0 , z1 ) = lψ (z1 ) q+
π (z1 |z0 ) π0ψ (z0 ) . (48)

Remark 2.13. As mentioned before, the choice (41) implies lψ = const and leads to the well–known optimal
proposal density for particle filters. The more general formulation (48) has recently been explored and expanded
by Guarniero et al. (2017) and Heng et al. (2018) in order to derive efficient proposal densities for the general
smoothing problem (2). Within the simplified formulation (48), such approaches reduce to a change of measure
from π0 to π0ψ at t0 followed by a forward transition according to q+ ψ
and subsequent reweighting by a modified
ψ
likelihood l at t1 and, hence, lead to particle filters that combine approaches (A) and (B) as introduced earlier.

2.2.1 Gaussian model errors (cont.)


We return to the discrete–time process (21) and assume a Gaussian measurement error leading to a Gaussian
likelihood  
1 T −1
l(z1 ) ∝ exp − (Hz1 − y1 ) R (Hz1 − y1 ) .
2
We set ψ1 = l/β in order to derive the optimal forward kernel for the associated smoothing/filtering problem.
Following the discussion from section 2.1.1, this lead to the modified transition kernels

qb+ (z1 |z0i ) := n(z1 ; z̄1i , γ B̄)

with B̄ and K defined by (24) and (23), respectively, and

z̄1i := Ψ(z0i ) − K(HΨ(z0i ) − y1 ) .

b0 is given by
The smoothing distribution π
M
1 X i
b0 (z0 ) =
π γ δ(z − z0i )
M i=1

with coefficients  
1
γ ∝ exp − (Hz0i − y1 )T (R + γHBH T )−1 (Hz0i − y1 )
i
.
2
It is easily checked that, indeed, Z
ψ
b1 (z1 ) =
π q+ b0 (z0 ) dz0 .
(z1 |z0 ) π

The results from this subsection have been used in simplified form in example 2.2 in order to compute (13).
We also note that a non–optimal, i.e., ψ1 (z1 ) 6= l(z1 )/β, but Gaussian choice for ψ1 leads to a Gaussian lψ and
ψ
the transition kernels q+ (z1 |z0i ) in (48) remain Gaussian as well. This is in contrast to the Schrödinger problem,
which we discuss in the following section 2.3 and which leads to forward transition kernels of the form (67).

13
2.2.2 SDE models (cont.)
The likelihood l(z1 ) introduces a change of measure over path space z[0,1] ∈ C from the prediction measure Q
with marginals πt to the smoothing measure P b via the Radon–Nykodym derivative

b
dP l(z1 )
= . (49)
dQ |z[0,1] β

b by π
We denote the marginal distributions of the smoothing measure P bt . It is also known that the density ratio
bt (z)
π
ψt (z) := (50)
πt (z)
satisfies the backward Kolmogorov equation (Pavliotis 2014)

∂t ψt = −L† ψt = −∇z ψt · ft − γ2 ∆z ψt (51)

with final condition ψ1 (z) = l(z)/β at t = 1.

Remark 2.14. Alternatively to solving the backward Kolmogorov equation (51) in ψt , the smoothing distribution
bt can also be obtained through solving the backward SDE (30) with final condition Z1− ∼ π
π b1 . This statement
follows from the corresponding result (43) for Markovian transition kernels q+ (z1 |z0 ) and their backward transi-
tion kernel q− (z0 |z1 ), as defined by (25), applied to the time–δt transition kernels qδt (z 0 |z) of the forward SDE
(25). More specifically, Z
ψt (z) = qδt (z 0 |z) ψt+δt (z 0 ) dz 0

implies Z
πt (z) qδt (z 0 |z)
bt (z) =
π bt+δt (z 0 ) dz 0
π
πt+δt (z 0 )
and the time–δt backward transition kernel at time t is provided by
πt (z) qδt (z 0 |z)
q−δt (z|z 0 , t) := .
πt+δt (z 0 )
Finally, taking the limit δt → 0 leads to the (time–dependent) backward SDE (30).

Remark 2.15. Ito’s formula


γ
dψt = ∂t ψt + ∇z ψt · dZt+ + ∆ z ψt
2
and the backward Kolmogorov equation (51) imply that

dψt = γ 1/2 ∇z ψt · dWt+

along solutions of the forward SDE (25). In other words, the quantities ψt are materially advected along solutions
of the forward SDEs in expectation or, in the language of stochastic analysis, ψt is a martingale. Hence, by the
martingale representation theorem, there exists a unique process Vt such that

dψt = Vt · dWt+ (52)

and ψ1 (z) = l(z)/β at t = 1. Here (52) has to be understood as a backward SDE in the sense of Carmona
(2016), where the solution (ψt , Vt ) is adapted to the past s ≤ t whereas the solution Zt− to the backward SDE
(30) is adapted to the future s ≥ t.

b0 into the filtering


Lemma 2.16. The modified forward SDE, which transports the smoothing distribution π
b1 , is given by
distribution π
 √
dZt+ = ft (Zt+ ) + γ∇z log ψt (Zt+ ) dt + γ dWt+ (53)

with Z0+ ∼ π b
b0 and the path measure generated by (53) is equivalent to the smoothing path measure P.

14
Proof. The backward SDE (30) with final condition Z1− ∼ π
b1 leads to the associated Fokker–Planck equation
γ
bt = −∇z · (b
∂t π πt (f − γ log πt )) − bt
∆z π
2
γ
= −∇z · (b bt )) +
πt (f − γ∇z log πt + γ∇z log π bt
∆z π
2
and the second equality corresponds to the Fokker–Planck equation for the forward smoothing SDE (53) since
bt /πt .
ψt = π

If one compares (53) with the controlled SDE formulation (32), then one finds that

ut (z) := γ∇z log ψt (z) . (54)

Note that the initial distributions for (32) and (53) are different. We will reconcile this fact in the following
subsection by considering the associated Schrödinger problem (Föllmer and Gantert 1997, Leonard 2014, Chen
et al. 2014).

b is given by the Donsker–Varadhan principle


Remark 2.17. A variational characterisation of P
b = arg inf {−P[log(l)] + KL(P||Q)} ,
P (55)
PQ

b is chosen such that the expected loss, −P[log(l)], is minimised subject to the penalty
i.e., the distribution P
introduced by the Kullback-Leibler divergence with respect to the original path measure Q. Note that

inf {P[− log(l)] + KL(P||Q)} = − log β (56)


PQ

with β = Q[l]. It also follows from (34) that (55) is equivalent to


  Z 1 
∗ u 1 2
u := arg inf −Q log(l) − kut k dt . (57)
u 2γ 0

b = Qu . Further-
In other words, solving the smoothing problem is equivalent to an optimal control problem and P

more, the optimal control law ut is provided by (54). See Hartmann et al. (2017) for an in depth discussion of
variational formulations and their numerical implementation in the context of rare event simulations for which
it is generally assumed that π0 (z) = δ(z − z0 ) in (3), i.e., the ensemble size is M = 1 when viewed within the
context of this paper. See also Ruiz and Kappen (2017) for a discussion on the relation of smoothing to optimal
control problems and their numerical implementation.

Remark 2.18. One can chose ψt differently from the choice made in (50) by changing the final condition
for the backward Kolmogorov equation (51) to any suitable ψ1 . As already discussed for twisted discrete–time
b in terms of
smoothing, such modifications give rise to alternative representations of the smoothing distribution P
modified forward SDEs, likelihoods and initial distributions. See Kappen and Ruiz (2016) and Ruiz and Kappen
(2017) for an application of these ideas to importance sampling in the context of partially observed diffusion
processes. More specifically, let ut denote some suitable control law for the forward SDE (32) with given initial
distribution Z0+ ∼ q0 , then
b
dP b
dP dQu
= ,
dQ |zu dQu |zu dQ |zu
[0,1] [0,1] [0,1]

which, using (33) and (49), implies

b  Z 1 
dP l(z1u ) π0 (z0u ) 1 2 1/2 +
= exp − kut k dt + 2γ ut · dWt . (58)
dQu |zu β q0 (z0u ) 2γ 0
[0,1]

b i.e., the
b0 leads to Qu = P,
It follows from the previous remark that the control law (54) together with q0 = π
right–hand side of (58) becomes equal to one.

15
2.3 Schrödinger Problem
We now return to the twisting potential approach as introduced in section 2.2 with two important modifications,
which lead to the, so called, Schrödinger problem. These modifications are, first, that the twisting potential ψ1
ψ
is determined implicitly and, second, that the modified transition kernel q+ is applied to π0 instead of the tilted
ψ
initial density π0 as in (48). More specifically:

Definition 2.19. We seek the solution ψ0 and ψ1 of the boundary value problem

π0 (z0 ) = π0ψ (z0 ) ψ0 (z0 ) (59)


b1 (z1 ) =
π π1ψ (z1 ) ψ1 (z1 ) (60)
Z
π1ψ (z1 ) = q+ (z1 |z0 ) π0ψ (z0 ) dz0 (61)
Z
ψ0 (z0 ) = q+ (z1 |z0 ) ψ1 (z1 ) dz1 (62)

b1 at t = 0 and t = 1, respectively. The required modified PDFs


for given marginal (filtering) distributions π0 and π
π0ψ and π1ψ are defined through (59) and (60), respectively. The solution (ψ0 , ψ1 ) of the, so called, Schrödinger
system (59)–(62) leads to the modified transition kernel

q+ (z1 |z0 ) := ψ1 (z1 ) q+ (z1 |z0 ) ψ0 (z0 )−1 , (63)

which satisfies Z

b1 (z1 ) =
π q+ (z10 |z0 ) π0 (z0 ) dz0 .


The modified transition kernel q+ (z1 |z0 ) couples the two marginal distributions π0 and π b1 with the twisting

potential ψ1 implicitly defined. In other words, q+ provides the transition density for going from the initial
distribution (3) at time t0 to the filtering distribution at time t1 without the need for any reweighting (Approach
C). See Leonard (2014) and Chen et al. (2014) for more mathematical details on the Schrödinger problem.

Remark 2.20. Let us compare the Schrödinger system to the twisting potential approach (48) for the smoothing
problem from subsection 2.2 in some more detail. First, note that the twisting potential approach to smoothing
replaces (59) by
π0ψ (z0 ) = π0 (z0 ) ψ0 (z0 )
and (60) by
π1ψ (z1 ) = π1 (z1 ) ψ1 (z1 )
with ψ1 a given twisting potential. The associated ψ0 is determined by (62) as in the twisting approach. In both
cases, the modified transition kernel is given by (63). Finally, (61) is replaced by the prediction step (4).

In order to solve the Schrödinger system for our given initial distribution (3) and the associated filter distribution
b1 , we make the ansatz
π
M M
1 X i X
π0ψ (z0 ) = α δ(z0 − z0i ) , αi = M .
M i=1 i=1

This ansatz together with (59)–(62) immediately implies


M
1 1 X i
ψ0 (z0i ) = , π1ψ (z1 ) = α q+ (z1 |z0i ) ,
αi M i=1

as well as PM
1 i
b1 (z1 )
π l(z1 ) M i=1 q+ (z1 |z0 )
ψ1 (z1 ) = ψ = PM
. (64)
π1 (z1 ) β 1 i i
i=1 α q+ (z1 |z0 )
M

16
Hence we arrive at the equations
1
ψ0 (z0j ) = j
(65)

= ψ1 (z1 ) q+ (z1 |z0j ) dz1
Z PM i
l(z1 ) i=1 q+ (z1 |z0 )
= PM i q+ (z1 |z0j ) dz1 (66)
β i=1 α q (z
+ 1 0 |z i)

for j = 1, . . . , M . These M equations have to be solved for the M unknown coefficients αj . In other words, the
Schrödinger problem becomes finite–dimensional in the context of this paper. More specifically:

Lemma 2.21. The forward Schrödinger transition kernel (63) is given by

∗ b1 (z1 )
π
q+ (z1 |z0i ) = q+ (z1 |z0i ) αi
π1ψ (z1 )
PM
αi j=1 q+ (z1 |z0j ) l(z1 )
= PM q+ (z1 |z0i ) , (67)
α j q (z |z j ) β
j=1 + 1 0

for each particle z0i with the coefficients αj , j = 1, . . . , M , defined by (65)–(66).


Proof. Because of (65)–(66), the forward transition kernels (67) satisfy
Z

q+ (z1 |z0i ) dz1 = 1 (68)

for all i = 1, . . . , M and


Z M
∗ 1 X ∗
q+ (z1 |z0 ) π0 (z0 ) dz = q (z1 |z0i )
M i=1 +
M
1 X i b1 (z1 )
π
= α q+ (z1 |z0i ) PM
M i=1 1
M j=1 αj q+ (z1 |z0j )
b1 (z1 )
=π (69)

as desired.

Numerical implementations will be discussed in section 3. Note that the knowledge of the normalising constant
β is not required a priori for solving (65)–(66) since it appears as a common scaling factor.
We note that the coefficients {αj } together with the associated potential ψ1 from the Schrödinger system
b1 , i.e., one sets
provide the optimally twisted prediction kernel (15) with respect to the filtering distribution π
ψ0 (z0i ) = 1/αi in (17) and defines the potential ψ1 by (64).

Remark 2.22. The Schrödinger problem is closely link to optimal transportation (Cuturi 2013, Leonard 2014,
Chen et al. 2014). For example, consider the Gaussian transition kernel (22) with Ψ(z) = z and B = I. Then
the solution (67) to the associated Schrödinger problem of coupling π0 and π b1 reduces to the solution π ∗ of the
associated optimal transport problem
Z Z
π ∗ = arg inf kz0 − z1 k2 π(z0 , z1 ) dz0 dz10
π∈Π

in the limit γ → 0. Here a joint PDF π(z0 , z1 ) is an element of Π if


Z Z
π(z0 , z1 ) dz1 = π0 (z0 ), b(z1 ) .
π(z0 , z1 ) dz0 = π

17
2.3.1 SDE models (cont.)
At the level of SDEs, Schrödinger’s problem amounts to continuously bridging the given initial PDF π0 with
+
the PDF πb1 at final time using an appropriate modification of the stochastic process Z[0,1] ∼ Q defined by the
forward SDE (25) with initial distribution π0 at t = 0. The desired modified stochastic process P∗ is defined as
the minimiser of
e := KL(P||Q)
L(P) e

subject to the constraint that the marginal distributions π e at time t = 0 and t = 1 satisfy π0 and π
et of P b1 ,
respectively (Föllmer and Gantert 1997, Leonard 2014, Chen et al. 2014).

b can
Remark 2.23. We note that the variational principle (55), characterising the smoothing path measure P,
be replaced by n o
P∗ = arg inf −e e
π1 [log(l)] + KL(P||Q)
P∈Π
e

with
eQ: π
Π = {P e1 = π
b1 , π
e0 = π0 }
in the context of Schrödinger’s problem. The associated
n o
− log β ∗ := inf −e e
π1 [log(l)] + KL(P||Q) π [log(l)] + KL(P∗ ||Q)
= −b
P∈Π
e

can be viewed as a generalisation of (56) and gives rise to a generalised evidence β ∗ , which could be used for
model comparison and parameter estimation.

The Schrödinger process P∗ corresponds to a Markovian process across the whole time domain [0, 1] (Leonard
2014, Chen et al. 2014). More specifically, consider the controlled forward SDE (32) with initial conditions

Z0+ ∼ π0

and a given control law ut for t ∈ [0, 1]. Let Pu denote the measure associated to this process. Then one can
find a time-dependent potential ψt and an associated control (54) such that the marginal of the associated path
measure Pu at times t = 1 satisfies
π1u = π
b1
and, more generally,
P∗ = Pu .
The desired potential ψt in (54) can, for example, be obtained as follows. Let (ψ0 , ψ1 ) denote the solution of the
associated Schrödinger system (59)–(62), where q+ (z1 |z0 ) denotes now the time–one forward transition kernel
of (25). Then ψt in (54) is the solution of the backward Kolmogorov equation (51) with prescribed ψ1 at final
time.

Remark 2.24. Given the solution (ψ0 , ψ1 ) and the implied PDF π e0+ := π0ψ = π0 /ψ0 of the Schrödinger system
et+ , t ≥ 0, denote the marginals of the forward SDE (25) with Z0+ ∼ π
(59)–(62), let π e0+ . Furthermore, consider
the backward SDE (30) with drift term

et+ (z) .
bt (z) = ft (z) − γ∇z log π (70)

and final time condition Z1− ∼ π e0+ ensures that Z0− ∼ π0 . Furthermore the desired
b1 . Then the choice of π
control in (32) is provided by
e−
π
ut = γ∇z log t+ ,
et
π
where πet− denotes the marginal distributions of the backward SDE (30) with drift term (70) and π e1− = π
b1 . We
will return to this reformulation of the Schrödinger problem in section 3 when considering it as the limit of a
sequence of smoothing problems.

Remark 2.25. The solution to the Schrödinger problem for linear SDEs and Gaussian marginal distributions
has been discussed in detail by Chen, Georgiou and Pavon (2016).

18
2.3.2 Discrete measures
We finally discuss the Schrödinger problem in the context of finite state Markov chains in some more detail.
These results will be needed in the following sections on the numerical implementation of the Schrödinger
approach to sequential data assimilation.

Remark 2.26. Let u ∈ RN , then D(u) ∈ RN ×N denotes the diagonal matrix with entries (D(u))ii = ui ,
i = 1, . . . , N . We also denote a N × 1 vector of ones by 1 = (1, . . . , 1)T ∈ RN .

Let us consider an example which will be closely related to the discussion in the following section 3. We are
given a bi–stochastic matrix Q ∈ RL×M with all entries satisfying qlj > 0 and two discrete probability measures
represented by vectors p1 ∈ RL and p0 ∈ RM , respectively. Again we assume that all entries in p1 and p0 are
strictly positive for simplicity. We introduce the set of all bi-stochastic L × M matrices with those distributions
as marginals, i.e. 
Π := P ∈ RL×M : P ≥ 0, P T 1L = p0 , P 1M = p1 . (71)
Solving Schrödinger’s system (59)–(62) corresponds to finding two non–negative vectors u ∈ RL and v ∈ RM
such that
P ∗ := D(u) Q D(v)−1 ∈ Π .
In turns out that P ∗ is uniquely determined and minimises the Kullback-Leibler divergence between all P ∈ Π
and the reference matrix Q, i.e.
X plj
P ∗ = arg min KL (P ||Q), KL (P ||Q) := plj log . (72)
P ∈Π qlj
l,j

See Peyre and Cuturi (2018) for more details.

Remark 2.27. If one makes the ansatz


ul qlj
plj = ,
vj
then the minimisation problem (72) becomes equivalent to
X
P ∗ = arg min plj (log ul − log vj )
P ∈Π
l,j

subject to the constraints


P 1M = D(u)QD(v)−1 1M = p1 , P T 1L = D(v)−1 QT D(u)1L = p0 .
Note that these constraint determine u > 0 and v > 0 up to a common scaling factor. Hence (72) can be reduced
to finding (u, v) > 0 such that
uT 1L = 1, P 1M = p1 , P T 1L = p0 .
Hence we have shown that solving the Schrödinger system is equivalent to solving the minimisation problem (72)
for discrete measures. It holds that
min KL (P ||Q) = pT T
1 log u − p0 log v .
P ∈Π

Definition 2.28. The Sinkhorn iteration


uk+1 := D(P k 1M )−1 p1 (73)
v k+1 := D(p0 )−1 (D(uk+1 ) P k )T 1L (74)
k+1 k+1 k k+1 −1
P := D(u ) P D(v ) (75)
0 L×M ∗
with initial P = Q ∈ R provides an algorithm for computing P , i.e.
lim P k = P ∗ .
k→∞

It follows that
lim uk = 1L , lim v k = 1M .
k→∞ k→∞
See Cuturi (2013) for a computationally efficient and robust implementation of the Sinkhorn iteration and Peyre
and Cuturi (2018) for a convergence proof.

19
3 Numerical methods
After having summarised the relevant mathematical foundation for prediction, filtering and smoothing, and the
Schrödinger problem, we now discuss numerical approximations suitable for ensemble based data assimilation.
It is clearly impossible to cover all available methods and we will focus on a selection of approaches which are
built around the idea of optimal transport, ensemble transform methods and Schrödinger systems. We will also
focus on methods that can be applied or extended to problems with high–dimensional state spaces even though
we will not cover this topic in this survey. See Reich and Cotter (2015), Van Leeuwen (2015), and Asch et al.
(2017) instead.

3.1 Prediction
Generating samples from the forecast distributions q+ (· |z0i ) is in most cases straightforward. The computational
expenses can however vary dramatically which impacts on the choice of algorithms for sequential data assimi-
lation. We demonstrate in this subsection how samples from the prediction PDF π1 can be used to construct
an associated finite state Markov chain that transforms π0 into an empirical approximation of π1 .

Definition 3.1. Let us assume that we have L ≥ M independent and balanced samples z1l from the M forecast
distributions q+ (· |z0j ), j = 1, . . . , M . We introduce the L × M matrix Q with entries

qlj := q+ (z1l |z0j ) . (76)


We now consider the associated bi–stochastic matrix P ∗ ∈ RL×M , as defined by (72), with the two probability
vectors in (71) given by p1 = 1/L ∈ RL and p0 = 1/M ∈ RM , respectively. The finite state Markov chain
Q+ := M P ∗ (77)
provides a sample–based approximation to the forward transition kernel q+ (z 0 |z).

More precisely, the ith column of Q+ provides an empirical approximation to q+ (·|z0i ). Also note that the initial
distribution π0 gives rise to the probability vector p0 := 1/M ∈ RM and the Markov chain Q+ satisfies
1
Q+ p0 = p1 = L1 ,

which is in agreement with the fact that the z1l ’s are equally weighted samples from the forecast PDF π1 .

Remark 3.2. The associated backward transition kernel Q− ∈ RM ×L satisfies


Q− D(p1 ) = (Q+ D(p0 ))T
and, hence, is given by
L T
Q− = (Q+ D(p0 ))T D(p1 )−1 = Q .
M +
It holds that
Q− p1 = D(p0 ) QT
+ 1L = D(p0 ) 1M = p0 ,
as desired.
Definition 3.3. We can extend the concept of twisting to discrete Markov chains such as (77). A twisting
potential ψ1 gives rise to a vector u ∈ RL with normalised entries
ψ1 (z1l )
ul = PL ,
k
k=1 ψ1 (z1 )

l = 1, . . . , L. The twisted finite state Markov kernel is now defined by



+ := D(u) Q+ D(v)
−1
, v := (D(u) Q+ )T 1L , (78)
ψ
and it holds that 1T T
L Q+ = 1M as required for a Markov kernel. The twisted prediction probability is given by

pψ ψ
1 := Q+ p0

with p0 = 1M /M . Furthermore, if we set p0 = v then pψ


1 = u.

20
3.1.1 Gaussian model errors (cont.)
The proposal density is given by (22) and it is easy to produce K > 1 samples from each of the M proposals
q+ (· |z0j ). Hence we can make the total sample size L = K M as large as desired. In order to produce M
samples, ze1j , from a twisted finite state Markov chain (78), one can now take the M columns of Qψ
+ , draw an
ψ
index l(j) ∈ {1, . . . , L} with probability (Q+ )lj for the jth column, and finally set

ze1j = z1l , l = l(j) ,

j = 1, . . . , M . We will provide more details when discussing the Schrödinger problem in the context of Gaussian
model errors in section 3.4.1.

3.1.2 SDE models (cont.)


The Euler-Maruyama method (Kloeden and Platen 1992)
+
Zn+1 = Zn+ + ftn (Zn+ ) ∆t + (γ∆t)1/2 Ξn , Ξn ∼ N(0, I) , (79)

n = 0, . . . , N − 1, will be used for the numerical approximation of (25) with step-size ∆t := 1/N , tn = n ∆t.
In other words, we replace Zt+n by its numerical approximation Zn+ . A numerical approximation (realisation) of
+
the whole solution path z[0,1] will be denoted by z0:N = Z0:N (ω) and can be computed recursively due to the
Markov property of the Euler-Maruyama scheme. The marginal PDFs of Zn+ are denoted by πn .
For any finite number of time-steps N , we can define a joint PDF π0:N on UN = RNz ×(N +1) via
N −1
!
1 X 2
π0:N (z0:N ) ∝ exp − kηn k π0 (z0 ) (80)
2∆t n=0

with
ηn := γ −1/2 (zn+1 − zn − ftn (zn ) ∆t) (81)
and ηn = ∆t1/2 Ξn (ω). Note that the joint PDF π0:N (z0:N ) can also be expressed in terms of z0 and η0:N −1 .
The numerical approximation of SDEs provides an example for which the increase in computational cost for
producing L > M samples from the PDF π0:N versus L = M is non trivial, in general.
We now extend definition 3.1 to the case of temporally discretised SDEs in the form of (79).
i
Definition 3.4. Let us assume that we have L = M independent numerical solutions z0:N of (79). We introduce
n
an M × M matrix Q for each n = 1, . . . , N with entries
n j j j
qlj = q+ (znl |zn−1 ) := n(znl ; zn−1 + ∆tf (zn−1 ), γ∆t I) .

With each Qn we associated a finite state Markov chain Qn+ as defined by (77) for general transition densities
q+ in definition 3.1. An approximation of the Markov transition from time t0 = 0 to t1 = 1 is now provided by
N
Y
Q+ := Qn+ . (82)
n=1

Remark 3.5. The approximation (77) can be related to the diffusion map approximation of the infinitesimal
generator of Brownian dynamics √
dZt+ = −∇z U (Zt+ ) dt + 2 dWt+ (83)
with potential U (z) = − log π ∗ (z) in the following sense. First note that π ∗ is invariant under the associated
Fokker–Planck equation (27) with operator L given by
 π
Lπ = ∇z · π ∗ ∇z ∗ .
π
Let z i , i = 1, . . . , M , denote M samples from the invariant PDF π ∗ and define the symmetric matrix Q ∈ RM ×M
with entries
qlj = n(z l ; z j , 2∆t I) .

21
Then the associated (symmetric) matrix (77), as introduced in definition 3.1, provides a discrete approximation
to the evolution of a probability vector p0 ∝ π0 /π ∗ over a time–interval ∆t and, hence, to the semigroup operator

e∆tL with the dual operator of L given by
1
L† g = ∇z · (π ∗ ∇z g) . (84)
π∗
We formally obtain
Q+ − I
L† ≈ (85)
∆t
for ∆t sufficiently small. The symmetry of Q+ reflects the fact that L† is self–adjoint with respect to the weighted
inner product Z
hf, giπ∗ = f (z) g(z) π ∗ (z) dz .

See Harlim (2018) for a discussion of alternative diffusion map approximations to the infinitesimal generator L†
and appendix A for an application to the feedback particle filter formulation of continuous–time data assimilation.

We also consider the discretisation


+

Zn+1 = Zn+ + ftn (Zn+ ) + utn (Zn+ ) ∆t + (γ∆t)1/2 Ξn , (86)
u
n = 0, . . . , N − 1, of a controlled SDE (32) with associated PDF π0:N defined by
N −1
!
u u 1 X u 2
π0:N (z0:N ) ∝ exp − kη k , π0 (z0 ) (87)
2∆t n=0 n

where  u
ηnu := γ −1/2 zn+1 − znu − (ftn (znu ) + utn (znu )) ∆t .
u
Here z0:N denotes a realisation of the discretisation (86) with control laws utn . One finds that
1 1 1 ∆t
kηnu k2 = kηn k2 − 1/2 utn (znu )T ηn + kutn (znu )k2
2∆t 2∆t γ 2γ
1 1 ∆t
= kηn k2 − 1/2 utn (znu )T ηnu − kutn (znu )k2
2∆t γ 2γ
and, hence, !
1 X 
u u N −1
π0:N (z0:N )
u = exp kutn (znu )k2 ∆t + 2γ 1/2 utn (znu )T ηnu , (88)
π0:N (z0:N ) 2γ n=0

which provides a discrete version of (33) since ηnu = ∆t1/2 Ξn (ω) are increments of Brownian motion over time
intervals of length ∆t.

Remark 3.6. Instead of discretising the forward SDE (25) in order to produce samples from the prediction
PDF π1 , one can also start from the mean field formulation (31) and its time discretisation, e.g.
i

zn+1 = zni + ftn (zni ) + utn (zni ) ∆t (89)

for i = 1, . . . , M and
γ
en (z) .
utn (z) = − ∇z log π
2
en stands for an approximation to the marginal PDF πtn based on the available samples zni , i = 1, . . . , M .
Here π
A simple approximation is obtained by the Gaussian PDF

en (z) = n(z; z̄n , Pn )


π

with empirical mean


M
1 X i
z̄n = z
M i=1 n

22
and empirical covariance matrix
M
1 X i i
Pn = z (z − z̄n )T .
M − 1 i=1 n n

The system (89) becomes 


i
zn+1 = zni + ftn (zni ) + γPn−1 (zni − z̄n ) ∆t ,
i = 1, . . . , M , and provides an example of an interacting particle approximation.

3.2 Filtering
Let us assume that we are given M samples, z1i , from the prediction PDF using forward transition kernels
q+ (· |z0i ), i = 1, . . . , M . The likelihood function l(z) leads to importance weights

wi ∝ l(z1i ) . (90)

We also normalise these importance weights such that (20) holds.


Remark 3.7. The model evidence β can be estimated from the samples, z1i , and the likelihood l(z) as follows
M
1 X i
βe := l(z ) .
M i=1 1

If the likelihood is of the form



l(z) ∝ exp − 21 (y1 − h(z))T R−1 (y1 − h(z)

and the prior distribution in y = h(z) can be approximated as being Gaussian with covariance
M M
1 X 1 X
Phh := h(z1i )(h(z1i ) − h̄)T , h̄ := h(z1i ) ,
M − 1 i=1 M i=1

then the evidence can be approximated by


1 
βe ≈ Ny
exp − 12 (y1 − h̄)T Pyy
−1
(y1 − h̄)
(2π) 2 |Pyy |1/2

with
Pyy := R + Phh .
Such an approximation has been used, for example, in Carrassi, Bocquet, Hannart and Ghil (2017). See also
Reich and Cotter (2015) for more details on how to compute and use model evidence in the context of sequential
data assimilation.

Sequential data assimilation requires to produce M equally weighted samples zb1j ∼ π b1 from the M weighted
samples z1i ∼ π1 with weights wi . This is a standard problem in Monte Carlo integration and there are
many possibilities for tackling this problem; among those are multinomial, residual, systematic and stratified
resampling (Douc and Cappe 2005). Here we focus on those resampling methods which are based on a discrete
Markov chain P ∈ RM ×M with the property that
 T
1 w1 wM
w= MP 1, w= M ,..., M . (91)

The Markov property of P implies that P T 1M = 1M . We denote the set of all Markov chains satisfying (91) by
Π. Any Markov chain P ∈ Π can be used for resampling but we seek the Markov chain P ∗ ∈ Π which minimises
the expected distance between the samples, i.e.
M
X

P = arg min pij kz1i − z1j k2 . (92)
P ∈Π
i,j=1

23
Note that (92) is a special case of the optimal transport problem (36) with the involved probability measures
being discrete measures. Resampling can now be performed according to
z1j = z1i ] = p∗ij
P[b (93)
for j = 1, . . . , M .
Since, it is known that (92) converges to (36) as M → ∞ (McCann 1995) and since (36) leads to a transfor-
mation (38), the resampling step (93) has been replaced by
M
X
zb1j = z1i p∗ij (94)
i=1

in the, so called, ensemble transform particle filter (ETPF) (Reich 2013, Reich and Cotter 2015). In other
words, the ETPF replaces resampling with probabilities p∗ij by its mean (94) for each j = 1, . . . , M . The ETPF
leads to a biased but consistent in the limit M → ∞ approximation to the resampling step.
The general formulation (94) with the coefficients p∗ij chosen appropriately1 leads to a large class of, so
called, ensemble transform particle filters (Reich and Cotter 2015). Ensemble transform particle filters result,
in general, in biased and inconsistent but robust estimates which have found applications to high–dimensional
state space models (Evensen 2006, Vetra-Carvalho, van Leeuwen, Nerger, Barth, Altaf, Brasseur, Kirchgessner
and Beckers 2018) for which traditional particle filters fail due to the curse of dimensionality (Bengtsson, Bickel
and Li 2008). More specifically, the class of ensemble transform particle filters includes the popular ensemble
Kalman filters (EnKF) (Evensen 2006, Reich and Cotter 2015, Vetra-Carvalho et al. 2018, Carrassi, Bocquet,
Bertino and Evensen 2018) and, so called, second–order accurate particle filters with coefficients p∗ij in (94)
chosen such that the ensemble mean
M
1 X i i
z̄1 := w z1
M i=1
and the ensemble covariance matrix
M
1 X i i
Pe := w (z1 − z̄1 )(z1i − z̄1 )T
M i=1

are exactly reproduced by the transformed, equally weighted particles zb1j , j = 1, . . . , M , i.e.
M M
1 X j 1 X i
zb = z̄1 , (b z1i − z̄1 )T = Pe
z − z̄1 )(b
M j=1 1 M − 1 j=1 1

See the survey paper by Vetra-Carvalho et al. (2018) and the paper by Acevedo et al. (2017) for more details.
A summary of the ensemble Kalman filter can be found in appendix C.
In addition, hybrid methods (Frei and Künsch 2013, Chustagulprom, Reich and Reinhardt 2016), which
bridge between classic particle filters and the EnKF, have recently been successfully applied to atmospheric
fluid dynamics (Robert, Leuenberger and Künsch 2018).

3.3 Smoothing
Recall that the joint smoothing distribution πb(z0 , z1 ) can be represented in the form (47) with modified transition
kernel (9) and smoothing distribution (7) at time t0 with weights γ i determined by (8).
Let us assume that it is possible to sample from qb+ (z1 |z0i ) and that the weights γ i are available. Then we
can utilise (47) in sequential data assimilation as follows. One first resamples the z0i at time t0 using a discrete
Markov chain P ∈ RM ×M satisfying
1
pb0 = M P 1 , pb0 := γ (95)
with γ defined in (11). Again optimal transportation can be used to identify a suitable P . More explicitly, we
now define Π as the set of all M × M Markov chains, P , which satisfy (95), then the Markov chain P ∗ arising
from the associated optimal transport problem (92) can be used either for resampling, i.e.
z0j = z0i ] = p∗ij ,
P[e
PM
1 The coefficients p∗ij of an ensemble transform particle filter do not need to be non–negative and only satisfy i=1 p∗ij = 1
(Acevedo, de Wiljes and Reich 2017).

24
or in a transformation step
M
X
zb0j = z0i p∗ij .
i=1

Once equally weighted samples zb0j , b0 have been determined, the desired samples zb1i from
i = 1, . . . , M , from π
b1 are simply given by
π
zb1i := Zb1i (ω), Zb1i ∼ qb+ (· |b
z0i ) ,
for i = 1, . . . , M .
The required transition kernels (9) are explicitly available for state space models with Gaussian model errors
and Gaussian likelihood functions. In many other cases, these kernels are not explicitly available and/or is
difficult to draw from. In such cases, one can resort to sample–based transition kernels.
For example, consider the twisted discrete Markov kernel (78) with twisting potential ψ1 (z) = l(z). The
vector v gives rise to a probability vector pb0 = v ∈ RM and

pb1 := Qψ
+pb0 (96)

approximates the filtering distribution at time t1 . The Markov transition matrix Qψ + ∈R


L×M
together with pb0
provide an approximation to the smoothing kernel qb+ (z1 |z0 ) and π b0 , respectively.
The approximations Qψ + ∈ R
L×M
and pb0 ∈ RM can be used to first generate equally weighted samples
zb0i ∈ {z01 , . . . , z0M } with distribution pb0 via, for example, resampling with replacement. If zb0i = z0k for an index
k = k(i) ∈ {1, . . . , M }, then
z1i = z1l ] = (Qψ
P[b + )lk

for each i = 1, . . . , M . The zb1i ’s are equally weighted samples from the discrete filtering distribution pb1 , which
is an approximation to the continuous filtering PDF π b1 .

Remark 3.8. One has to take computational complexity and robustness into account when deciding whether to
utilise methods from section 3.2 or this section to advance M samples z0i from the prior distribution π0 into M
samples zb1i from the posterior distribution π
b1 . While the methods from section 3.2 are easier to implement in
general, the methods of this section benefit from the fact that
1 1
M> ≥ ≥ 1,
kγk2 kwk2
where the importance weights γ ∈ RM and w ∈ RM are defined in (11) and (91), respectively. In other words,
the methods from this section lead to larger effective sample sizes (Liu 2001, Agapiou et al. 2017).
Remark 3.9. We mention that finding efficient methods for solving the more general smoothing problem (2)
is an active area of research. See, for example, the recent contributions by Guarniero et al. (2017) and Heng et
al. (2018) for discrete–time Markov processes and Kappen and Ruiz (2016) as well as Ruiz and Kappen (2017)
for smoothing in the context of SDEs. Ensemble transform methods of the form (94) can also be extended to the
general smoothing problem. See, for example, Evensen (2006) and Carrassi et al. (2018) for extensions of the
ensemble Kalman filter and Kirchgessner, Tödter, Ahrens and Nerger (2017) for an extension of the nonlinear
ensemble transform filter to the smoothing problem.

3.3.1 SDE models (cont.)


After discretization in time, smoothing leads to a change from the predictive PDF (80) to
l(zN )π0:N (z0:N )
b0:N (z0:N ) :=
π
π0:N [l]
N −1
!
1 X
∝ exp − kξn k2 π0 (z0 ) l(zN )
2∆t n=0

with ξn given by (81), or, alternatively,


b0:N
π l(zN )
(z0:N ) = .
π0:N π0:N [l]

25
Remark 3.10. Efficient MCMC methods for sampling high–dimensional smoothing distributions can be found
in Beskos, Girolami, Lan, Farrell and Stuart (2017) and Beskos, Pinski, Sanz-Serna and Stuart (2011). Im-
proved sampling can also be achieved by using regularized Störmer-Verlet time-stepping methods (Reich and
Hundertmark 2011) in a hybrid Monte Carlo method (Liu 2001). See appendix B for more details.

3.4 Schrödinger Problem


Recall that the Schrödinger system (59)–(62) reduces in our context to solving equations (65)–(66) for the
unknown coefficients αi , i = 1, . . . , M . In order to make this problem tractable we need to replace the required
expectation values with respect to q+ (z1 |z0j ) by Monte Carlo approximations. More specifically, let us assume
that we have L ≥ M balanced samples z1l from the predictive distribution π1 . The associated L × M matrix Q
with entries (76) provides a discrete approximation to the underlying Markov process defined by q+ (z1 |z0 ) and
initial PDF (3).
The importance weights in the associated approximation to the filtering distribution
L
1X l
b1 (z) =
π w δ(z − z1l )
L
l=1

are given by (90) with the weights normalised such that


L
X
wl = L . (97)
l=1

Finding the coefficients {αi } in (65)–(66) can now be reformulated as finding two vectors u ∈ RL and v ∈ RM
such that
P ∗ := D(u)QD(v)−1 (98)
satisfies P ∗ ∈ Π with  
 L
X M
X l

1
Π= P ∈ RL×M : plj ≥ 0, plj = 1, M plj = w
L . (99)
 
l=1 j=1

We note that (99) are discrete approximations to (68) and (69), respectively. The scaling factor ψ0 in (59) is
approximated by the vector v up to a normalisation constant, while the vector u provides an approximation to

ψ1 in (60). Finally, the desired approximations to the Schrödinger transition kernels q+ (z1 |z0i ), i = 1, . . . , M ,

are provided by the columns of P , i.e.,
z1i = z1l ] = p∗li
P[b
characterises the desired samples zb1i , i = 1, . . . , M , from the filtering distribution π
b1 . See the following subsection
for more details.
The required vectors u and v can be computed using the iterative Sinkhorn algorithm (73)–(75) (Cuturi
2013, Peyre and Cuturi 2018).

Remark 3.11. The approximation (98) can be extended to an approximation of the Schrödinger forward tran-
sition kernels (67) in the following sense. We use αi = 1/vi in (67) and note that the resulting approximation
satisfies (69) while (68) now longer holds exactly. Instead we find that
wl 1
ul = P
L M l j
j=1 q+ (z1 |z0 )/vj

in (98) and, therefore,


Z L
∗ 1 X l(z1l ) q+ (z1l |z0i )/vi
q+ (z1 |z0i ) dz1 ≈ PM
L β l j
l=1 j=1 q+ (z1 |z0 )/vj
L L
1X l q+ (z1l |z0i )/vi X
≈ w PM = p∗lj = 1
L q (z l |z j )/v
j=1
l=1 + 1 0 j l=1

because of (99). Furthermore, one can use such approximations in combination with Monte Carlo sampling
methods which do not require normalised target PDFs.

26
Remark 3.12. Note that one can replace the forward transition kernel ψ+ (z1 |z0 ) in (76) by any suitable twisted
prediction kernel (15). This results in a modified matrix Q in (98) and weights wl in 99. However, the resulting
matrix (98) still provides an approximation to the Schrödinger problem.

3.4.1 Gaussian model errors (cont.)


One can easily generate L, L ≥ M , i.i.d. samples z1l from the predictive distribution (22), i.e.
M
1 X
Z1l ∼ n(· ; Ψ(z0j ), γB) ,
M j=1

b1 characterised through the importance weights (90).


and with the filtering distribution π
We define the distance matrix D ∈ RL×M with entries
1 l
dlj := kz − Ψ(z0j )k2B , kzk2B := z T B −1 z ,
2 1
and the matrix Q ∈ RL×M with entries
qlj := e−dlj /γ .
The Markov chain P ∗ ∈ RL×M is now given by

P ∗ = arg min KL(P ||Q)


P ∈Π

with the set Π defined by (99)


Once P ∗ has been computed, the desired Schrödinger transitions from π0 to π b1 can be represented as follows.

The Schrödinger transition kernels q+ (z1 |z0i ) are approximated for each z0i by

L
X

q̃+ (z1 |z0i ) := M p∗li δ(z1 − z1l ) , i = 1, . . . , M . (100)
l=1


The empirical measure in (100) converges weakly to the desired q+ (z1 |z0i ) as L → ∞ and

M
1 X
b1 (z1 ) ≈
π δ(z − zb1i ),
M i=1

with
bi (ω),
zb1i = Z Zb1i ∼ q̃+

(· |z0i ) , (101)
1

provides the desired approximation of π b1 by M equally weighted particles zb1i ,


i = 1, . . . , M .
We remark that (100) has been used to produce the Schrödinger transition kernels for example 2.2 and the
right panel of figure 4 in particular. More specifically, we have M = 11 and used L = 11000. Since the particles

z1l ∈ R, l = 1, . . . , L, are distributed according to the predictive PDF π1 , a function representation of q̃+ (z1 |z0i )

over all of R is provided by interpolating pli onto R and multiplication of this interpolated function by π1 (z).
For γ  1, the measure in (100) can also be approximated by a Gaussian measure with mean
L
X
z̄1j := M z1l p∗lj
l=1

and covariance matrix γB, i.e., we replace (101) by

Zb1i ∼ N(z̄1i , γB)

for i = 1, . . . , M .

27
prior prediction
200 150

150
100
100
50
50

0 0
-5 0 5 -5 0 5
state variable state variable
smoothing filtering
200 150

150
100
100
50
50

0 0
-5 0 5 -5 0 5
state variable state variable

Figure 5: Histograms produced from M = 200 Monte Carlo samples of the initial PDF π0 , the prediction PDF
b2 at time t = 2, and the smoothing PDF π
π2 at time t = 2, the filtering distribution π b0 for a Brownian particle
moving in a double well potential.

3.4.2 SDE (cont.)


One can also apply (98) with constraints (99) to approximate the Schrödinger problem associated with SDE
models. On typically uses L = M in this case and utilises (82) in place of Q in (98).
Example 3.13. We consider scalar–valued motion of a Brownian particle in a bimodal potential, i.e.,

dZt+ = Zt+ dt − (Zt+ )3 dt + γ 1/2 dWt+ (102)

with γ = 0.5 and initial distribution Z0 ∼ N(−1, 0.3). At time t = 2 we measure the location y = 1 with
measurement error variance R = 0.2. We simulate the dynamics using M = 200 particles and a time-step of
∆t = 0.01 in the Euler–Maruyama discretisation (79). One can find histograms produced from the Monte Carlo
samples in figure 5. The samples from the filtering and smoothing distributions are obtained by resampling
with replacement from the weighted distributions with weights given by (90). Next we compute (82) from the
M = 200 Monte Carlo samples of (102). Eleven out of the two hundred transition kernels from π0 to π2
b2 (Schrödinger problem) are displayed in figure 6.
(prediction problem) and π0 to π
The Sinkhorn approach might require relatively large sample sizes M in order to lead to useful approxima-
(0)
tions. Alternatively we may assume that there is an approximative control term ut with associated forward
SDE
(0)
dZt+ = ft (Zt ) dt + ut (Zt+ ) dt + γ 1/2 dWt+ , t ∈ [0, 1] , (103)
and Z0+ ∼ π0 . We denote the associated path measure by Q(0) . Girsanov’s theorem implies that the Radon–
Nikodym derivative of Q with respect to Q(0) is given by (compare (33))

dQ
= exp(−V (0) ) ,
dQ(0) |z(0)
[0,1]

28
10 -3 prediction kernels
3.5 Schrodinger kernels
0.025

3
0.02

2.5
0.015

2 0.01

1.5 0.005

1 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5
state variable state variable

Figure 6: Left panel: Approximations of typical transition kernels from time π0 to π2 under the Brownian
dynamics model (102). Right panel: Approximations of typical Schrödinger transition kernels from π0 to
b2 . All approximations were computed using the Sinkhorn algorithm and by linear interpolation between the
π
M = 200 data points.

where V (0) is defined through the stochastic integral


Z 1 
1 (0) (0)
V (0) := kut k2 dt + 2γ 1/2 ut · dWt+
2γ 0
(0)
along solution paths z[0,1] of (103). Because of
b
dP b
dP dQ (0)
(0)
= (0) (0)
∝ l(z1 ) exp(−V (0) ) ,
dQ |z (0) dQ |z (0) dQ |z
[0,1] [0,1] [0,1]

(0)
b1 . The control ut
we can now use (103) to importance sample from the filtering PDF π should be chosen such
that the variance in the modified likelihood function
(0) (0)
l(0) (z[0,1] ) := l(z1 ) exp(−V (0) ) (104)
(0)
b1 at time t = 1
gets reduced compared to the uncontrolled case ut ≡ 0. In particular, the filter distribution π
satisfies
(0) (0) (0) (0)
b1 (z1 ) ∝ l(0) (z[0,1] ) π1 (z1 ) ,
π
(0)
where πt , t ∈ (0, 1], denote the marginal PDFs generated by (103).
We now describe an algorithm for solving the associated Schrödinger problem. The desired optimal control
law is defined iteratively by
(k) (k−1) (k)
ut = ut + γ∇z log ψt , k = 1, . . . , (105)
(0) (k−1)
with ut given. We denote the path measure introduced by (32) with ut = ut and initial PDF π0 by Q(k) .
(k)
The potentials ψt are now given as the solutions to the backward Kolmogorov equation
(k) (k) (k−1)
∂t ψt = −L†(k) ψt , L†(k) g := ∇z g · (ft + ut ) + γ2 ∆z g . (106)
with final time condition
(k) b1 (z)
π
ψ1 (z) := (k−1)
. (107)
π1 (z)
(k−1) (k)
Here π1 denotes the time one marginal of Q(k−1) . Note that ψt needs to be determined up to a constant
(k)
of proportionality only since the associated control law is determined from ψt by (105). The recursion (105)
is stopped whenever the final time condition (107) is sufficiently close to a constant function.

29
Remark 3.14. One can replace (106) by any other method for solving the smoothing problem associated to the
(k−1) (k)
SDE (32) with Z0+ ∼ π0 , control law ut , and likelihood function l(z) = ψ1 (z).

(k)
One needs to restrict the class of possible control laws ut in order to obtain a computationally feasible
implementations in practice. For example, a simple class of control laws is provided by linear controls of the
form
(k) (k) (k)
ut (z) = −Bt (z − mt )
(k) (k)
with appropriately chosen symmetric positive–definite matrices Bt and vectors mt . Such approximations
can, for example, be obtained from the smoother extensions of ensemble transform methods mentioned earlier.
See also the recent work by Kappen and Ruiz (2016) and Ruiz and Kappen (2017) on numerical methods for
the SDE smoothing problem.

4 DA for continuous-time data


In this section, we focus on the continuous–time filtering problem over the time interval [0, 1]. The filtering
problem coincides in this case with the Schrödinger problem. We distinguish between smooth and non-smooth
data yt , t ∈ [0, 1].

4.1 Smooth data


We start from a forward SDE model (25) with associated path measure Q over the space of continuous functions
C. However, contrary to the previous sections, the likelihood l is defined along a whole solution path z[0,1] as
follows:  Z 1 
dPb
∝ l(z[0,1] ), l(z[0,1] ) := exp − Vt (zt ) dt
dQ |z[0,1] 0

with the assumption that Q[l] < ∞ and Vt (z) ≥ 0. A specific example of a suitable Vt is provided by
1
Vt (z) = kh(z) − yt k2 , (108)
2
where the data yt ∈ R, t ∈ [0, 1], is a smooth function of time and h(z) is a forward operator connecting the
model states to the observations/data. The problem has, for example, been addressed by Fleming (1997) from
an optimal control perspective. Here we focus on a mean–field interpretation of a modified Fokker–Planck
equation.

b satisfy the modified Fokker-Planck equation


bt of P
Lemma 4.1. The marginal distributions π

bt = Lb
∂t π bt (Vt − π
πt − π bt [Vt ]) (109)

with L defined by (26).


Proof. This can be seen by setting γ = 0 and ft ≡ 0 in (25) for simplicity and by considering the incremental
change of measure induced by the likelihood, i.e.,

bt+δt
π
∝ e−Vt δt ≈ 1 − Vt δt ,
bt
π
bt [1] = 1 is preserved.
and taking the limit δt → 0 under the constraint that π

Next we rewrite (109) in the form


bt = Lb
∂t π πt + ∇z · (b
π t ∇z φ t ) (110)
where the potential φt : RNz → R satisfies the elliptic PDE

∇z · (b
πt ∇z φt ) = −b bt [Vt ]) .
πt (Vt − π (111)

30
With (110) in place, we obtain the mean field equation

dZt+ = ft (Zt+ ) − ∇z φt (Zt+ ) dt + γ 1/2 dWt+ . (112)
The marginal distributions πtu of the controlled SDE (112) with
ut (z) = −∇z φt (z)
b
bt of the path measure P.
agree with the marginals π
The control ut is not uniquely determined. For example, one can replace (111) by
∇z · (πt Mt ∇z φt ) = −πt (Vt − πt [Vt ]) , (113)
where Mt is a symmetric positive definite matrix. More specifically, let us assume that πt is Gaussian with
mean z̄t and covariance matrix Pt and that h(z) is linear, i.e., h(z) = Hz, then (113) can be solved analytically
for Mt = Pt with
1
∇z φt (z) = H T (Hz + H z̄t − 2yt ) .
2
The resulting mean field equation becomes

dZt+ = ft (Zt+ ) − 21 Pt H T (HZt+ + H z̄t − 2yt ) dt + γ 1/2 dWt+ , (114)
which gives rise to the ensemble Kalman–Bucy filter upon Monte Carlo discretization (Bergemann and Reich
2012). See also section 5 below and appendix C.

Remark 4.2. The approach described in this subsection can also be applied to standard Bayesian inference
without model dynamics. More specifically, let us assume that we have samples z0i , i = 1, . . . , M , from a prior
distribution π0 which we would like to transform into samples, z1i , from a posterior distribution
l(z) π0 (z)
b1 (z) :=
π
π0 [l]
b1 , for example, via
with likelihood l(z) = π(y|z). One can introduce a homotopy connecting π0 with π
l(z)s π0
πs (z) = (115)
π0 [ls ]
with s ∈ [0, 1]. We find that
∂πs
= πs (log l − πs [log l]) . (116)
∂s
We now seek a differential equation
d
Zs = us (Zs ) (117)
ds
with Z0 ∼ π0 such that its marginal distributions πs satisfy (116). This condition together with Liouville’s
equation for the time–evolution of marginal densities under a differential equation (117) lead to
− ∇z · (πs us ) = πs (log l − πs [log l]) . (118)
In order to define us in (117) uniquely, we make the ansatz
us (z) = −∇z φs (z) (119)
which leads to the elliptic PDE
∇z · (πs ∇z φs ) = πs (log l − πs [log l]) (120)
in the potential φs . The desired samples z1i from πb1 are now obtained as the time–one solutions of (117) with
‘control law’ (119) satisfying (120) and initial conditions z0i , i = 1, . . . , M . There are many modifications of
this basic procedure (Daum and Huang 2011, Reich 2011, Moselhy and Marzouk 2012) with some of them
leading to explicit expressions for (117) such as, for example, Gaussian PDFs (Bergemann and Reich 2010) and
Gaussian mixture PDFs (Reich 2012). We finally mention that the limit s → ∞ in (115) leads, formally, to the
PDF π∞ = δ(z − zML ), where zML denotes the minimiser of V (z) = − log π(y|z), i.e. the maximum likelihood
estimator, which we assume here to be unique, e.g., V is convex. In other words these homotopy methods can
be used to solve optimisation problems via mean–field equations and their interacting particle approximations.
See, for example, Zhang, Taghvaei and Mehta (2017) and appendix C for more details.

31
Remark 4.3. Yet another approach for transforming prior samples into posterior ones is provided through the
mean–field interpretation
d πs (Zs )
Zs = −∇z log , (121)
ds b1 (Zs )
π
b1 and γ = 2, i.e.,
of the Fokker-Planck equation (29) for Zs ∼ πs with drift term ft (z) = ∇z log π
 
πs
∂s πs = ∇z · πs ∇z log ,
b1
π
and for which its holds under fairly general assumptions that
b1
lim πs = π
s→∞

(Pavliotis 2014). The more common approach would be to solve Brownian dynamics

b1 (Zs ) dt + 2dWs
dZs = ∇z log π
for each sample. In other words, formulation (121) replaces stochastic Brownian dynamics by a deterministic
interacting particle system. See appendix A for more details.

4.2 Random data


We now replace (108) by an observation model of the form
dYt = h(Zt+ ) dt + dVt+ ,
where we set Yt ∈ R for simplicity and Vt+ denotes standard Brownian motion. The forward operator h : RNz →
R is also assumed to be known. The PDFs π bt (z) for Zt conditioned on all observations ys with s ∈ [0, t] satisfy
the Kusher-Stratonovitch equation (Jazwinski 1970)
bt = Lb
∂t π bt [h])(dYt − π
πt + (h − π bt [h] dt) (122)
with L defined by (26). The following observation is important for the subsequent discussions.

Remark 4.4. Consider state dependent diffusion


dZt+ = γt (Zt+ ) ◦ dUt+ , (123)
in its Stratonovitch interpretation (Pavliotis 2014), where Ut+ is scalar–valued Brownian motion and γt (z) ∈
RNz ×1 . The associated Fokker-Planck equation for the marginal PDFs πt takes the form
1
∂t πt = ∇z · (γt ∇z · (πt γt )) (124)
2
and expectation values evolve in time according to
Z t
πt [f ] = π0 [f ] + πs [Af ] ds (125)
0

with operator A defined by


1 T
Af = γ ∇z (γtT ∇z f ) .
2 t
Consider now the mean field equation
d 1
Zt = − γt (Zt ) Jt , Jt := πt−1 ∇z · (πt γt ) . (126)
dt 2
The associated Liouville equation is
1 1
∂t πt = ∇z · (πt γt Jt ) = ∇z · (γt ∇z · (γt πt )) .
2 2
In other words, the marginal PDFs πt and the expectation values πt [f ] evolve identically under (123) and (126),
respectively.

32
We now state a formulation of the continuous–time filtering problem in terms of appropriate mean–field equa-
tions. These equations follow the framework of the feedback particle filter (FPF) as first introduced by Yang,
Mehta and Meyn (2013). See Crisan and Xiong (2010) and Xiong (2011) for alternative formulations.
Lemma 4.5. The mean field SDE

dZt+ = ft (Zt+ ) dt + γ 1/2 dWt+ − Kt (Zt+ ) ◦ dIt (127)

with
dIt := h(Zt+ ) dt − dYt + dUt+ ,
Ut+ standard Brownian motion, and Kt := ∇z φt with the potential φt satisfying the elliptic PDE

∇z · (πt ∇z φt ) = −πt (h − πt [h]) (128)

leads to the same evolution of its conditional marginal distributions πt as (122).

Proof. We set γ = 0 and ft ≡ 0 in (127) for simplicity. Then, following (124) with γt = Kt , the Fokker-Planck
equation for the marginal distributions πt of (127) conditioned on {Ys }s∈[0,t] is given by

dπt = ∇z · (πt Kt (h(z) dt − dYt )) + ∇z · (Kt ∇z · (πt Kt )) dt (129)


= (πt [h] dt − dYt ) ∇z · (πt Kt ) + ∇z · (πt Kt (h(z) − πt [h]) dt + ∇z · (Kt ∇z · (πt Kt )) dt (130)
= (h − πt [h])(dYt − πt [h] dt) (131)

as desired, where we have used (128) twice to get from (130) to (131). Also note that both Yt and Ut+ contributed
to the diffusion induced last term in (129) and, hence, the factor a half in (124) gets replaced by one.

Remark 4.6. Using the reformulation (126) of (123) in Stratonovitch form with γt = Kt together with (128),
one can replace Kt ◦ dUt+ by 21 Kt (πt [h] − h) dt, which leads to the alternative

1
dIt = (h + πt [h]) dt − dYt
2
for the innovation It , as originally proposed by Yang et al. (2013) in their FPF formulation.

A special case of the FPF is the ensemble Kalman–Bucy filter (Bergemann and Reich 2012) with the Kalman
gain factor Kt being independent of the state variable z and of the form

Kt = Ptzh . (132)

Here Ptzh denotes the covariance matrix between Zt and h(Zt ) at time t.

5 Numerical methods
In this section, we discuss some numerical implementations of the mean–field approach to continuous–time data
assimilation. An introduction to standard particle filter implementations can, for example, be found in Bain
and Crisan (2008). We start with the continuous–time formulation of the ensemble Kalman filter and state a
numerical implementation of the FPF using a Schrödinger formulation in the second part of this section. See
also appendix A for some more details on a particle–based solution of the elliptic PDEs (111), (120), and (128),
respectively.

5.1 Ensemble Kalman–Bucy filter


Let us start with the ensemble Kalman–Bucy filter (EnKBF), which arises naturally from the mean–field
equations (114) and (127), respectively, with Kalman gain (132) (Bergemann and Reich 2012). We state the
EnKBF here in the form
dZti = ft (Zti )dt + γ 1/2 dWt+ − KtM dIti (133)

33
for i = 1, . . . , M and
M M
1 X i T 1 X
KtM := Z (h(Zti ) − h̄M
t ) , h̄M
t := h(Zti ) .
M − 1 i=1 t M i=1

The innovations dIti take different forms depending on whether the data is smooth in time, i.e.,
1 
dIti = h(Zti ) + h̄M
t − 2yt dt ,
2
or contains stochastic contributions, i.e.,
1 
dIti = h(Zti ) + h̄M
t dt − dyt , (134)
2
or, alternatively,
dIti = h(Zti )dt + dUti − dyt ,
where Uti denotes standard Brownian motion. The SDEs (133) can be discretised in time by any suitable time–
stepping method such as the Euler-Maruyama scheme (Kloeden and Platen 1992). However, one has to be
careful with the choice of the step–size ∆t due to potentially stiff contributions from KtM dIti . See, for example,
Amezcua, Kalnay, Ide and Reich (2014).

Remark 5.1. It is of broad interest to study the stability and accuracy of interacting particle filter algorithms
such as the discrete–time EnKF and the continuous–time EnKBF for fixed particle numbers M . On the negative
side, it has been shown by Kelly, Majda and Tong (2015) that such algorithms can undergo finite–time instabil-
ities while it has also been demonstrated (González-Tokman and Hunt 2013, Kelly, Law and Stuart 2014, Tong,
Majda and Kelly 2016, de Wiljes, Reich and Stannat 2018) that such algorithms can be stable and accurate
under appropriate conditions on the dynamics and measurement process. Asymptotic properties of the EnKF
and EnKBF in the limit of M → ∞ have also been studied, for example, by Gland, Monbet and Tran (2011),
Kwiatowski and Mandel (2015), and de Wiljes et al. (2018).

5.2 Feedback particle filter


A Monte Carlo implementation of the FPF (127) faces two main obstacles. First, one needs to approximate the
potential φt in (128) with the density πt only available in terms of an empirical measure
M
1 X
πt (z) = δ(z − zti ) .
M i=1

Several possible approximations have been discussed by Taghvaei and Mehta (2016) and Taghvaei, de Wiljes,
Mehta and Reich (2017). Here we would like to mention in particular an approximation based on diffusion maps
which we summarise in appendix A. Second, one needs to apply a suitable time–stepping methods for the SDE
(127) in Stratonovitch form. Here we suggest to use Heun’s method (Burrage, Burrage and Tian 2004)
i
z̃n+1 = zni + ∆tftn (zni ) + (γ∆t)1/2 ξni + Kn (zni )∆Ini ,
 1/2 i
i
zn+1 = zni + ∆t i i
2 ftn (zn ) + ftn+1 (z̃n+1 ) + (γ∆t) ξn +
1  
Kn (zni )∆Ini + K̃n+1 (z̃n+1
i
)∆I¯n+1
i
,
2

where K̃n+1 and ∆I˜n+1


i i
are computed using the predictor values z̃n+1 , i = 1, . . . , M .
While the above implementation of the FPF requires to solve the elliptic PDE (128) twice per time–step
we now suggest a time-stepping approach in terms of an associated Schrödinger problem. Let us assume that
we have M equally weighted particles zni representing the conditional filtering distribution at time tn . We first
propagate these particles forward under the drift term alone, i.e.,

ẑ i := zni + ∆t ftn (zni ), i = 1, . . . , M .

34
In a next step, we draw L = KM with K ≥ 1 balanced samples zel from the prediction PDF
M
1 X
e(z) :=
π n(z; ẑ i , γ∆tI)
M i=1

and assign importance weights 


wl ∝ exp − ∆t z l ))2 + ∆yn h(e
2 (h(e zl)
to them with normalisation (97). Recall that we assumed that yt ∈ R for simplicity and ∆yn := ytn+1 − ytn .
We then solve the Schrödinger problem

P ∗ = arg min KL(P ||Q) (135)


P ∈Π

with the entries of Q ∈ RL×M given by


 
1
qli = exp − z l − ẑ i k2
ke
2γ∆t
i
and the set Π defined by (99). The desired particles zn+1 are finally given as realisations of
L
X
i
Zn+1 = zel p∗li + (γ∆t)1/2 Ξin , Ξin ∼ N(0, I) , (136)
l=1

for i = 1, . . . , M .
The update (136) with P ∗ defined by (135) can be viewed as combining an approximation of (128) and
time–stepping of (127) into a single step.

Remark 5.2. One can also use the matrix P ∗ from (135) to implement a resampling scheme
i
P[zn+1 = zel ] = p∗li (137)

for i = 1, . . . , M . Note that, contrary to classic resampling schemes based on weighted particles (e z l , wl ), l =
∗ i
1, . . . , L, the sampling probabilities pli take the underlying geometry of the forecasts ẑ in state space into account.

Example 5.3. We consider the SDE formulation

dZt = f (Zt )dt + γ 1/2 dWt

of a stochastically perturbed Lorenz–63 model (Lorenz 1963, Reich and Cotter 2015, Law et al. 2015) with
diffusion constant γ = 0.1. The system is fully observed according to

dYt = f (Zt )dt + R1/2 dVt

with measurement error variance R = 0.1 and the system is simulated over a time interval t ∈ [0, 40000] with
step–size ∆t = 0.01. We implemented a standard particle filter with resampling performed after each time–
step and compare the resulting RMS errors with those arising from using (136) (Schrödinger Transform) and
(137) (Schrödinger Resample), respectively. See figure 5.2. It can be seen that the Schrödinger–based methods
outperform the standard particle filter in terms of RMS errors for small ensemble sizes. The Schrödinger
Transform method is particularly robust for very small ensemble sizes while Schrödinger Resample performs
better at larger sample sizes. We also implemented the EnKBF (133) and found that it diverged for the smallest
ensemble size of M = 5 and performed worse than the other methods for larger ensemble sizes.

6 Conclusions
We have summarised sequential data assimilation techniques suitable for state and parameter estimation of
discrete or continuous–time stochastic processes. In addition to algorithmic approaches based on the standard
filtering and smoothing framework of stochastic analysis, we have drawn a connection to a boundary value

35
-2
RMS errors as a function M
10
RMSE

10-3
particle filter
Schrodinger transform
Schrodinger resample
EnKBF

0.05 0.1 0.15 0.2


1/M

Figure 7: RMS errors as a function of sample size, M , for a standard particle filter, the EnKBF, and implementa-
tions of (136) (Schrödinger transform) and(137) (Schrödinger resample), respectively. Both Schrödinger–based
methods outperform the standard particle filter for small ensemble sizes. The EnKBF diverged for the smallest
ensemble size of M = 5 and performed worse than all other methods for this highly nonlinear problem.

problem over joint probability measures first formulated by E. Schrödinger. We have argued that sequential
data assimilation essentially needs to approximate such a boundary value problem with the boundary conditions
given by the filtering distributions at consecutive observation times.
Application of these techniques to high–dimensional problems arising, for example, from the spatial dis-
cretisation of PDEs require further approximations in the form of localisation and inflation, which we have not
discussed in this survey. See, for example, Evensen (2006), Reich and Cotter (2015), and Asch et al. (2017) for
further details.
Acknowledgement. This research has been partially funded by Deutsche Forschungsgemeinschaft (DFG)
through grant CRC 1294 “Data Assimilation”.

References
W. Acevedo, J. de Wiljes and S. Reich (2017), ‘Second-order accurate ensemble transform particle filters’,
SIAM Journal on Scientific Computing 39, A1834–A1850.
S. Agapiou, O. Papaspipliopoulos, D. Sanz-Alonso and A. Stuart (2017), ‘Importance sampling: Computational
complexity and intrinsic dimension’, Statistical Science 32, 405–431.
J. Amezcua, E. Kalnay, K. Ide and S. Reich (2014), ‘Ensemble transform Kalman-Bucy filters’, Q.J.R. Meteor.
Soc. 140, 995–1004.
J. Anderson (2010), ‘A non-Gaussian ensemble filter update for data assimilation’, Monthly Weather Review
138, 4186–4198.
M. Arulampalam, S. Maskell, N. Gordon and T. Clapp (2002), ‘A tutorial on particle filters for online
nonlinear/non–Gaussian Bayesian tracking’, IEEE Trans. Sign. Process. 50, 174–188.

36
M. Asch, M. Bocquet and M. Nodet (2017), Data assimilation: Methods, algorithms and applications, SIAM,
Philadelphia.
A. Bain and D. Crisan (2008), Fundamentals of stochastic filtering, Vol. 60 of Stochastic modelling and applied
probability, Springer-Verlag, New-York.

T. Bengtsson, P. Bickel and B. Li (2008), Curse of dimensionality revisited: Collapse of the particle filter in
very large scale systems, in IMS Lecture Notes - Monograph Series in Probability and Statistics: Essays in
Honor of David F. Freedman, Vol. 2, Institute of Mathematical Sciences, pp. 316–334.
K. Bergemann and S. Reich (2010), ‘A mollified ensemble Kalman filter’, Q. J. R. Meteorological Soc.
136, 1636–1643.

K. Bergemann and S. Reich (2012), ‘An ensemble Kalman–Bucy filter for continuous data assimilation’, Me-
teorolog. Zeitschrift 21, 213–219.
A. Beskos, M. Girolami, S. Lan, P. Farrell and A. Stuart (2017), ‘Geometric MCMC for infinite–dimensional
inverse problems’, J. Comput. Phys. 335, 327–351.

A. Beskos, F. Pinski, J. Sanz-Serna and A. Stuart (2011), ‘Hybrid Monte Carlo on Hilbert spaces’, Stochastic
Processes and their Applications 121, 2201–2230.
L. Bottou, F. Curtis and J. Nocedal (2018), ‘Optimization methods for large–scale machine learning’, SIAM
Review 60, 223–311.

K. Burrage, P. M. Burrage and T. Tian (2004), ‘Numerical methods for strong solutions of stochastic differential
equations: An overview’, Proc. R. Soc. Lond. A 460, 373–402.
R. Carmona (2016), Lecures on BSDEs, stochastic control, and stochastic differential games with financial
applications, SIAM, Philadelphia.
A. Carrassi, M. Bocquet, L. Bertino and G. Evensen (2018), ‘Data assimilation in the geosciences: An overview
of methods, issues, and perspectives’, WIREs Clim Change.
A. Carrassi, M. Bocquet, A. Hannart and M. Ghil (2017), ‘Estimation model evidence using data assimilation’,
Q.J.R. Meteorol. Soc. 143, 866–880.
Y. Chen, T. Georgiou and M. Pavon (2014), ‘On the relation between optimal transport and Schrödinger
bridges: A stochastic control viewpoint’, J. Optim. Theory Appl. 169, 671–691.
Y. Chen, T. Georgiou and M. Pavon (2016), ‘Optimal steering of a linear stochastic system to a final probability
distribution, Part I’, Trans. Automatic Control 61, 1158–1169.
N. Chustagulprom, S. Reich and M. Reinhardt (2016), ‘A hybrid ensemble transform filter for nonlinear and
spatially extended dynamical systems’, SIAM/ASA J. Uncertainty Quantification 4, 592–608.

D. Crisan and J. Xiong (2010), ‘Approximate McKean-Vlasov representation for a class of SPDEs’, Stochastics
82, 53–68.
M. Cuturi (2013), Sinkhorn distances: Lightspeed computation of optimal transport, in NIPS 2013.
F. Daum and J. Huang (2011), Particle filter for nonlinear filters, in Acoustics, Speech and Signal Processing
(ICASSP), 2011 IEEE International Conference on, pp. 5920–5923.
J. de Wiljes, S. Reich and W. Stannat (2018), ‘Long–time stability and accuracy of the ensemble Kalman–Bucy
filter for fully observed processes and small measurement noise’, SIAM J. Appl. Dyn. Syst. 17, 1152–1181.
P. Degond and F.-J. Mustieles (1990), ‘A deterministic approximation of diffusion equations using particles’,
SIAM J. Sci. Comput. 11, 293–310.
P. del Moral (2004), Feynman-Kac formulae: Genealogical and interacting particle systems with applications,
Springer-Verlag, New York.

37
J. Doob (1984), Classical potential theory and its probabilistic counterpart, Springer-Verlag, New York.
R. Douc and O. Cappe (2005), ‘Comparison of resampling schemes for particle filtering’, pp. 64 – 69.
A. Doucet, N. de Freitas and N. G. (eds.) (2001), Sequential Monte Carlo methods in practice, Springer-Verlag,
Berlin Heidelberg New York.

G. Evensen (2006), Data assimilation. The ensemble Kalman filter, Springer-Verlag, New York.
P. Fearnhead and H. Künsch (2018), ‘Particle filters and data assimilation’, Annual Review of Statistics and
its Application 5, 421–449.
W. Fleming (1997), ‘Deterministic nonlinear filtering’, Annali della Scuola Normalle Superiore di Pisa 25, 435–
454.
H. Föllmer and N. Gantert (1997), ‘Entropy minimization and Schrödinger processes in infinite dimensions’,
The Annals of Probability 25, 901–926.
M. Frei and H. Künsch (2013), ‘Bridging the ensemble Kalman and particle filters’, Biometrika 100, 781–800.

F. L. Gland, V. Monbet and V. Tran (2011), Large sample asymptotics for the ensemble Kalman filter, in The
Oxford Handbook of Nonlinear Filtering, Oxford University Press, Oxford, pp. 598–631.
C. González-Tokman and B. Hunt (2013), ‘Ensemble data assimilation for hyperbolic systems’, Physica D
243, 128–142.

P. Guarniero, A. Johansen and A. Lee (2017), ‘The iterated auxiliary particle filter’, Journal of the American
Statistical Association 112, 1636–1647.
J. Harlim (2018), Data–driven computational methods, Cambridge University Press, Cambridge.
C. Hartmann, L. Richter, C. Schütte and W. Zhang (2017), ‘Variational characterization of free energy: Theory
and algorithms’, Entropy 19, 629.

J. Heng, A. Bishop, G. Deligiannidis and A. Doucet (2018), Controlled sequential Monte Carlo, Technical
Report arXiv:1708.08396v2, Harvard University.
A. Jazwinski (1970), Stochastic processes and filtering theory, Academic Press, New York.
N. Kantas, A. Doucet, S. Singh, J. Maciejowski and N. Chopin (2015), ‘On particle methods for parameter
estimation in state–space models’, Statistical Scince 30, 328–351.
H. Kappen and H. Ruiz (2016), ‘Adaptive importance sampling for control and inference’, Journal of Statistical
Physics 162, 1244–1266.
H. Kappen, V. Gomez and M. Opper (2012), ‘Optimal control as a graphical model inference problem’, Machine
learning 87, 159–182.
D. Kelly, A. Majda and X. Tong (2015), ‘Concrete ensemble Kalman filters with rigorous catastrophic filter
divergence’, Proc. Natl. Acad. Sci. USA 112, 10589–10594.
D. T. Kelly, K. J. H. Law and A. Stuart (2014), ‘Well-posedness and accuracy of the ensemble Kalman filter
in discrete and continuous time’, Nonlinearity 27, 2579–2604.

P. Kirchgessner, J. Tödter, B. Ahrens and L. Nerger (2017), ‘The smoother extension of the nonlinear ensemble
transform filter’, Tellus A 69, 1327766.
P. Kloeden and E. Platen (1992), Numerical solution of stochastic differential equations, Springer-Verlag,
Berlin Heidelberg New York.

E. Kwiatowski and J. Mandel (2015), ‘Convergence of the square root ensemble Kalman filter in the large
ensemble limit’, SIAM/ASA J. Uncertainty Quantification 3, 1–17.

38
K. Law, A. Stuart and K. Zygalakis (2015), Data assimilation: A mathematical introduction, Springer-Verlag,
New York.
B. Leimkuhler and S. Reich (2005), Simulating Hamiltonian dynamics, Cambridge University Press, Cam-
bridge.

C. Leonard (2014), ‘A survey of the Schrödinger problem and some of its connections with optimal transporta-
tion’, Discrete Contin. Dyn. Syst. A 34, 1533–1574.
F. Lindsten and T. Schön (2013), ‘Backward simulation methods for Monte Carlo statistical inference’, Foun-
dation and Trends in Machine Learning 6, 1–143.

J. Liu (2001), Monte Carlo strategies in scientific computing, Springer-Verlag, New York.
Q. Liu and D. Wang (2016), Stein variational gradient descent: A general purpose Bayesian inference algorithm,
in NIPS 2016.
E. Lorenz (1963), ‘Deterministic non-periodic flows’, J. Atmos. Sci. 20, 130–141.

J. Lu, Y. Lu and J. Nolen (2018), Scaling limit of the Stein variational gradient descent, Part I: The mean
field regime, Technical Report arXiv:1805.04035, Duke University.
R. McCann (1995), ‘Existence and uniqueness of monotone measure–preserving maps’, Duke Mathematical
Journal 80, 309–323.
T. E. Moselhy and Y. Marzouk (2012), ‘Bayesian inference with optimal maps’, J. Comput. Phys. 231, 7815–
7850.
R. Neal (1996), Bayesian learning for neural networks, Springer-Verlag, New York.
E. Nelson (1984), Quantum fluctuations, Princeton University Press, Princeton.
G. Pavliotis (2014), Stochastic processes and applications, Springer–Verlag, New York.

G. Peyre and M. Cuturi (2018), Computational optimal transport, Technical Report arXiv:1803.00567, CNRS,
ENS, CREST, ENSAE.
S. Reich (2011), ‘A dynamical systems framework for intermittent data assimilation’, BIT Numer Math
51, 235–249.

S. Reich (2012), ‘A Gaussian mixture ensemble transform filter’, Q. J. R. Meterolog. Soc. 138, 222–233.
S. Reich (2013), ‘A nonparametric ensemble transform method for Bayesian inference’, SIAM J. Sci. Comput.
35, A2013–A2024.
S. Reich and C. Cotter (2015), Probabilistic forecasting and Bayesian data assimilation, Cambridge University
Press, Cambridge.
S. Reich and T. Hundertmark (2011), ‘On the use of constraints in molecular and geophysical fluid dynamics’,
European Physical Journal Special Topics 200, 259–270.
S. Robert, D. Leuenberger and H. Künsch (2018), ‘A local ensemble transform Kalman particle filter for
convective–scale data assimilation’, Quarterly Journal of the Royal Meteorological Society.

H. Ruiz and H. Kappen (2017), ‘Particle smoothing for hidden diffusion processes: Adaptive path integral
smoother’, IEEE Transactions on Signal Processing 62, 3191–3203.
G. Russo (1990), ‘Deterministic diffusion of particles’, Comm. Pure Appl. Math. 43, 697–733.
S. Särkkä (2013), Bayesian filtering and smoothing, Cambridge University Press, Cambridge.

C. Schillings and A. Stuart (2017), ‘Analysis of the ensemble Kalman filter for inverse problems’, SIAM J.
Numer. Anal. 55, 1264–1290.

39
E. Schrödinger (1931), ‘Über die Umkehrung der Naturgesetze’, Sitzungsberichte der Preußischen Akademie
der Wissenschaften, Physikalisch-mathematische Klasse pp. 144–153.
A. Taghvaei, J. de Wiljes, P. Mehta and S. Reich (2017), ‘Kalman filter and its modern extensions for the
continuous–time nonlinear filtering problem’, ASME. J. Dyn. Sys., Meas., Control. 140, 030904–030904–
11.
T. Taghvaei and P. Mehta (2016), Gain function approximation in the feedback particle filter, Technical Report
arXiv:1603.05496, University of Illinois.
S. Thijssen and H. Kappen (2015), ‘Path integral control and state–dependent feedback’, Physical Review E
91, 032104.

X. Tong, A. Majda and D. Kelly (2016), ‘Nonlinear stability and ergodicity of ensemble based Kalman filters’,
Nonlinearity 29(2), 657.
P. Van Leeuwen (2015), Nonlinear data assimilation for high–dimensional systems, in Frontiers in Applied
Dynamical Systems: Reviews and Tutorials, Vol. 2, Springer-Verlag, New York, pp. 1–73.

E. Vanden-Eijnden and J. Weare (2012), ‘Data assimilation in the low noise regime with application to the
Kuroshio’, Monthly Weather Review 141, 1822–1841.
S. Vetra-Carvalho, P. van Leeuwen, L. Nerger, A. Barth, M. Altaf, P. Brasseur, P. Kirchgessner and J.-M.
Beckers (2018), ‘State–of–the–art stochastic data assimilation methods for high–dimensional non–Gaussian
problems’, Tellus A: Dynamic Meteorology and Oceanography 70, 1445364.

C. Villani (2003), Topics in optimal transportation, American Mathematical Society, Providence, Rhode Island,
NY.
C. Villani (2009), Optimal transportation: Old and new, Springer-Verlag, Berlin Heidelberg.
J. Xiong (2011), Particle approximations to the filtering problem in continuous time, in The Oxford Handbook
of Nonlinear Filtering (D. Crisan and B. Rozovskii, eds), Oxford University Press, Oxford, pp. 635–655.
T. Yang, P. Mehta and S. Meyn (2013), ‘Feedback particle filter’, IEEE Trans. Automatic Control 58, 2465–
2480.
C. Zhang, A. Taghvaei and P. Mehta (2017), A controlled particle filter for global optimization, Technical
Report arXiv:1701.02413, University of Illinois at Urbana-Champaign.

7 Appendices
Appendix A. Mesh–free approximations to Fokker–Planck and backward Kolmogorov
equations
In this appendix, we discuss two closely related approximations first to the Fokker–Planck equation (27) with
the operator (26) taking the special form
 π
Lπ = −∇z · (π ∇z log π ∗ ) + ∆z π = ∇z · π ∗ ∇z ∗
π
and, second, to its dual operator L† given by (84).
The approximation to the Fokker–Planck equation (27) with drift term

ft (z) = ∇z log π ∗ (z) (138)

can be used to transform samples xi0 , i = 1, . . . , M from a (prior) PDF π0 into samples from a target (posterior)
PDF π ∗ using an evolution equation of the form
d
Zs = Fs (Zs ) , (139)
ds

40
with Z0 ∼ π0 such that
lim Zs ∼ π ∗ .
s→∞

The evolution of the marginal PDFs πs is given by Liouville’s equation

∂s πs = −∇z · (πs Fs ) . (140)

We now chose Fs such that the Kullback–Leibler divergence KL (πs ||π ∗ ) is non–increasing in time, i.e.
Z n
d ∗ πs o
KL (πs ||π ) = πs Fs · ∇z log ∗ dz ≤ 0 .
ds π
A natural choice is
πs
Fs (z) := −∇z log (z) ,
π∗
which renders (140) formally equivalent to the Fokker–Planck equation (27) with drift term (138) (Reich and
Cotter 2015, Peyre and Cuturi 2018).
Let us now approximate the evolution equation (139) over a reproducing kernel Hilbert space (RKHS) H
with kernel k(z − z 0 ) and inner product hf, giH , which satisfies

hk(· − z 0 , f iH = f (z 0 ) . (141)

Following Russo (1990) and Degond and Mustieles (1990), we first introduce the approximation
M
1 X
es (z) :=
π k(z − zsi ) (142)
M i=1

to the marginal densities πs . Note that (141) implies that


M
1 X
es iH =
hf, π f (zsi ) .
M i=1

Given some evolution equations


d i
z = uis ,
ds s
for the particles zsi , i = 1, . . . , M , we find that (142) satisfies Liouville’s equation, i.e.,

πs Fes )
es = −∇z · (e
∂s π

with PM i i
e i=1 k(z − zs ) us
Fs (z) = PM
.
i
i=1 k(z − zs )
We finally introduce the functional
M 1
PM
1 X
i
M j=1 k(zs − zsj ) es
π
V({zsl }) := log = he
πs , log iH ,
M i=1 π ∗ (zsi ) π∗

as an approximation to the Kullback–Leibler divergence in the RKHS H and set

uis := −∇zsi V({zsl }) , (143)

which constitutes the desired particle approximation to the Fokker–Planck equation (27) with drift term (138).
We also remark that an alternative interacting particle system, approximating the same asymptotic PDF π ∗
in the limit s → ∞, has been proposed recently by Liu and Wang (2016) under the notion of Stein variational
descent. See Lu, Lu and Nolen (2018) for a theoretical analysis of Stein variational descent, which implies in
particular that Stein variational descent can be viewed as a Lagrangian particle approximation to the modified
evolution equation
πs
∂s πs = ∇z · (πs2 ∇z log ∗ (z)) = ∇z · (πs (∇z πs − πs ∇z log π ∗ ))
π

41
in the marginal PDFs πs , i.e., one uses
πs
Fs (z) := −πs ∇z log (z)
π∗
in (139).
We now turn our attention to the dual operator L† , defined by (84), which arises from (113), (128). More
specifically, let us rewrite (128) in the form

Aφt = −(h − πt [h]) (144)

with the operator A defined by


1
Ag := ∇z · (πt ∇z g)
πt
and we find that A is of the form L† with πt taking the role of π ∗ .
We also recall that (85) provides an approximation to L† and, hence, to A. This observation allows one to
introduce a sample-based method for approximating the potential φ defined by the elliptic partial differential
equation (144) for given function h(z).
We follow here instead the presentation of Taghvaei and Mehta (2016) and Taghvaei et al. (2017) and assume
that we have M samples z i from a PDF πt . The method is based on

φ − eA φ
≈ h − πt [h] (145)

for  > 0 sufficiently small and upon replacing eA by a diffusion map approximation (Harlim 2018) of the form
M
X
eA φ(z) ≈ T φ(z) := k (z, z i ) φ(z i ) . (146)
i=1

The required kernel functions k (z, z i ) are defined as follows. Let

n (z) := n(z; 0, 2 I)

and
M M
1 X 1 X
p (z) := n (z − z j ) = n(z; z j , 2 I) .
M j=1 M j=1

Then
n (z − z i )
k (z, z i ) :=
c (z) p (z i )1/2
with normalisation factor
M
X n (z − z l )
c (z) := .
l=1
p (z l )1/2
In other words, the operator T reproduces constant functions.
The approximations (145) and (146) lead to the fixed–point problem2
M
X
φj = k (z j , z i ) φi + ∆hi , j = 1, . . . , M , (147)
i=1

in the scalar coefficients φj , j = 1, . . . , M , for given


M
1 X
∆hi := h(z i ) − h̄, h̄ := h(z l ) .
M
l=1
2 Itwould also be possible to employ the approximation (85) in the fixed–point problem (147), i.e., to replace k (z j , z i ) by
(Q+ )ji in (85) with ∆t =  and π ∗ = πt .

42
Since T reproduces constant functions, (147) determines φi up to a constant contribution, which we fix by
requiring
XM
φi = 0 .
i=1

The desired functional approximation φe to the potential φ is now provided by


M
X
e
φ(z) = k (z, z i ) {φi + ∆hi } . (148)
i=1

Furthermore, since
M
!
−1 X
i i i l l
∇z k (z, z ) = k (z, z ) (z − z ) − k (z, z )(z − z )
2
l=1
M
!
1 X
i i l l
= k (z, z ) z − k (z, z )z
2
l=1

we obtain
M
X M
X
e j) =
∇z φ(z j i
∇z k (z , z ) ri = z i aij
i=1 i=1

with
ri = φi +  ∆hi
and !
X M
1
aij := k (z j , z i ) ri − k (z j , z l ) rl .
2
l=1

We note that
M
X
aij = 0
i=1

and
1
lim aij = ∆hi + O(M −2 )
→∞ M
since
1
lim k (z j , z i ) = .
→∞ M
In other words,
M
e j) = 1 X i
lim ∇z φ(z z (h(z i ) − h̄)
→∞ M i=1

independent of z j , which is equal to an empirical estimator for the covariance between z and h(z) and which,
in the context of the FPF, leads to the EnKBF formulations of section 5.1. See Taghvaei et al. (2017) for more
details.

Appendix B. Regularized Störmer-Verlet for HMC


One is often faced with the task of sampling from a high–dimensional PDF of the form
1
π(x) ∝ exp(−V (x)), V (x) := (x − x̄)T B −1 (x − x̄) + U (x) ,
2
for known x̄ ∈ RNx , B ∈ RNx ×Nx , and U : RNx → R. The hybrid Monte Carlo (HMC) method (Neal 1996, Liu
2001) has emerged as a popular Markov chain Monte Carlo (MCMC) method for tackling this problem. HMC

43
relies on a symplectic discretization (Leimkuhler and Reich 2005) of the Hamiltonian equations of motion
d
x = M −1 p ,

d
p = −∇x V (x) = −B −1 (x − x̄) − ∇x U (x)

in an artificial time τ . The conserved energy (or Hamiltonian) is provided by
1 T −1
H(x, p) = p M p + V (x) . (149)
2
The symmetric positive–definite mass matrix M ∈ RNx ×Nx can be chosen arbitrarily and a natural choice
in terms of sampling efficiency is M = B −1 (Beskos et al. 2011). However, when also taking into account
computational efficiency, a Störmer–Verlet discretisation
∆τ
pn+1/2 = pn − 2 ∇x V (xn ) , (150)
qn+1 = qn + f−1 pn+1/2
∆τ M , (151)
∆τ
pn+1 = pn+1/2 − 2 ∇x V (xn+1 ) , (152)

with step–size ∆τ > 0, mass matrix M = I in (149) and modified mass matrix
f=I+ ∆τ 2 −1
M 4 B (153)

in (151) emerges as an attractive alternative, since it implies

H(xn , pn ) = H(xn+1 , pn+1 )

for all ∆τ > 0 provided U (x) ≡ 0. The Störmer–Verlet formulation (150)–(152) is based on a regularised
formulation of Hamiltonian equations of motion for highly–oscillatory systems as discussed, for example, by
Reich and Hundertmark (2011).
Energy conserving time–stepping methods for linear Hamiltonian systems have become an essential build-
ing block for applications of HMC to infinite–dimensional inference problems, where B corresponds to the
discretisation of a positive, self–adjoint and trace–class operator B. See, for example, Beskos et al. (2017).
Note that the Störmer–Verlet discretization (150)–(152) together with (153) can be easily extended to
inference problems with constraints g(x) = 0 (Leimkuhler and Reich 2005) and that (150)–(152) conserves
equilibria, i.e., points x∗ with ∇V (x∗ ) = 0, regardless of the step–size ∆τ .
HMC methods, based on (150)–(152) and (153), can be used to sample from the smoothing distribution of
a SDE as considered in sections 2.2 and 3.3.

7.1 Appendix C. Ensemble Kalman filter


We summarise the formulation of an ensemble Kalman filter in the form (94). We start with the stochastic
ensemble Kalman filter (Evensen 2006), which is given by
bj = z j − K(h(z j ) − y1 + Θj ),
Z Θj ∼ N(0, R) , (154)
1 1 1

with Kalman gain matrix


M
1 X i −1
K = Pzh (Phh + R)−1 = z (h(z1i ) − h̄)T (Phh + R)
M − 1 i=1 1

and
M M
1 X 1 X
Phh := h(z1l ) (h(z1l ) − h̄)T , h̄ := h(z1l ) .
M −1 M
l=1 l=1

Formulation (154) can be rewritten in the form (94) with


1 −1
p∗ij = δij − (h(z1i ) − h̄)T (Phh + R) (h(z1j ) − y1 + Θj ) , (155)
M −1

44
where δij denotes the Kronecker delta, i.e., δij = 0 if i 6= j and δii = 1.
More generally, one can think about ensemble Kalman filters and their generalisations (Anderson 2010)
as first defining appropriate updates yb1i to the predicted y1i = h(z1i ) using the observed y1 , which is then
extrapolated to the state variable z via linear regression, i.e.,

1 X i
M  
zb1j = z1j + −1
z1 (h(z1i ) − h̄)T Phh yb1j − y1j , (156)
M − 1 i=1

which can be easily reformulated in the form (94) (Reich and Cotter 2015). Note that the consistency result

H zb1i = yb1i

follows from (156) for linear forward maps h(z) = Hz.


Within such a linear regression framework, one can easily derive ensemble transformations for the particles
z0i at time t = 0. One simply takes the coefficients p∗ij , as defined for example by an ensemble Kalman filter
(155), and applies them to z0i , i.e.,
M
X
j
zb0 = z0i p∗ij .
i=1

These transformed particles can be used to approximate the smoothing distribution π b0 . See, for example,
Evensen (2006) and Kirchgessner et al. (2017) for more details.
Finally, one can also interpret the EnKF as a continuous update in artificial time s ≥ 0 of the form

dzsi = −Pzh R−1 dIsi (157)

with the innovations Isi given either by


1 
dIsi = h(zsi ) + h̄s ds − y1 ds (158)
2
or, alternatively, by
dIsi = h(zsi )ds + R1/2 dVsi − y1 ds ,
where Vsi stands for standard Brownian motion (Bergemann and Reich 2010, Reich 2011, Bergemann and
Reich 2012). Equation (157) with innovation (158) can be given a gradient flow structure (Bergemann and
Reich 2010, Reich and Cotter 2015) of the form
1 i
dz = −Pzz ∇zi V({zsj }), (159)
ds s
with potential
M
1−α X (1 + α)M
V({z j }) := (h(z i ) − y1 )T R−1 (h(z i ) − y1 ) + (h̄ − y1 )T R−1 (h̄ − y1 )
4 i=1 4

and α = 0 in case of the standard EnKF, while α > 0 can be seen as a form of variance inflation (Reich and
Cotter 2015).
A theoretical study of such dynamic formulations in the limit of s → ∞ has been initiated by Schillings and
Stuart (2017). There is an interesting link to stochastic gradient methods (Bottou, Curtis and Nocedal 2018)
which find application in situations where the dimension of the data y1 is very high and the computation of
the complete gradient ∇z h(z) becomes prohibitive. More specifically, the basic concepts of stochastic gradient
methods can be extended to (159) if R is diagonal in which case one would pick at random paired components
of h and y1 at the k’s time–step of a discretisation of (159) with the step–size ∆sk chosen appropriately.

45

Você também pode gostar