Escolar Documentos
Profissional Documentos
Cultura Documentos
5, MAY 2014 1
Abstract—Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven
learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement
learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real
systems, such as robots, where many interactions can be impractical and time consuming. To address this problem, current learning
approaches typically require task-specific knowledge in form of expert demonstrations, realistic simulators, pre-shaped policies, or
specific knowledge about the underlying dynamics. In this article, we follow a different approach and speed up learning by extracting
more information from data. In particular, we learn a probabilistic, non-parametric Gaussian process transition model of the system.
By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model
errors, a key problem in model-based learning. Compared to state-of-the art RL our model-based policy search method achieves an
unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks.
Index Terms—Policy search, robotics, control, Gaussian processes, Bayesian inference, reinforcement learning
2 2 2
f(xt, ut)
f(xt, ut)
f(xt, ut)
0 0 0
−2 −2 −2
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10
(xt, ut) (xt, ut) (xt, ut)
Fig. 1. Effect of model errors. Left: Small data set of observed transitions from an idealized one-dimensional
representations of states and actions (xt , ut ) to the next state xt+1 = f (xt , ut ). Center: Multiple plausible deterministic
models. Right: Probabilistic model. The probabilistic model describes the uncertainty about the latent function by a
probability distribution on the set of all plausible transition functions. Predictions with deterministic models are claimed
with full confidence, while the probabilistic model expresses its predictive uncertainty by a probability distribution.
ous state-action domains and, hence, is directly applica- solve challenging control problems. For instance, in [3],
ble to complex mechanical systems, such as robots. this approach was applied together with locally optimal
In this article, we provide a detailed overview of controllers and temporal bias terms for handling model
the key ingredients of the PILCO learning framework. errors. The key idea was to ground policy evaluations
In particular, we assess the quality of two different using real-life trials, but not the approximate model.
approximate inference methods in the context of policy All above-mentioned approaches to finding controllers
search. Moreover, we give a concrete example of the require more or less accurate parametric models. These
importance of Bayesian modeling and inference for fast models are problem specific and have to be manually
learning from scratch. We demonstrate that P ILCO’s un- specified, i.e., they are not suited for learning models
precedented learning speed makes it directly applicable for a broad range of tasks. Nonparametric regression
to realistic control and robotic hardware platforms. methods, however, are promising to automatically ex-
This article is organized as follows: After discussing tract the important features of the latent dynamics from
related work in Sec. 2, we describe the key ideas of data. In [7], [49] locally weighted Bayesian regression
the PILCO learning framework in Sec. 3, i.e., the dynam- was used as a nonparametric method for learning these
ics model, policy evaluation, and gradient-based policy models. To deal with model uncertainty, in [7] model
improvement. In Sec. 4, we detail two approaches for parameters were sampled from the parameter poste-
long-term predictions for policy evaluation. In Sec. 5, we rior, which accounts for temporal correlation. In [49],
describe how the policy is represented and practically model uncertainty was treated as noise. The approach
implemented. A particular cost function and its natural to controller learning was based on stochastic dynamic
exploration/exploitation trade-off are discussed in Sec. 6. programming in discretized spaces, where the model
Experimental results are provided in Sec. 7. In Sec. 8, we errors at each time step were assumed independent.
discuss key properties, limitations, and extensions of the P ILCO builds upon the idea of treating model uncer-
PILCO framework before concluding in Sec. 9. tainty as noise [49]. However, unlike [49], PILCO is a
policy search method and does not require state space
discretization. Instead closed-form Bayesian averaging
2 R ELATED W ORK over infinitely many plausible dynamics models is pos-
Controlling systems under parameter uncertainty has sible by using nonparametric GPs.
been investigated for decades in robust and adaptive Nonparametric GP dynamics models in RL were pre-
control [4], [35]. Typically, a certainty equivalence prin- viously proposed in [17], [30], [46], where the GP train-
ciple is applied, which treats estimates of the model pa- ing data were obtained from “motor babbling”. Unlike
rameters as if they were the true values [58]. Approaches PILCO, these approaches model global value functions to
to designing adaptive controllers that explicitly take derive policies, requiring accurate value function mod-
uncertainty about the model parameters into account are els. To reduce the effect of model errors in the value func-
stochastic adaptive control [4] and dual control [23]. Dual tions, many data points are necessary as value functions
control aims to reduce parameter uncertainty by explicit are often discontinuous, rendering value-function based
probing, which is closely related to the exploration prob- methods in high-dimensional state spaces often statis-
lem in RL. Robust, adaptive, and dual control are most tically and computationally impractical. Therefore, [17],
often applied to linear systems [58]; nonlinear extensions [19], [46], [57] propose to learn GP value function models
exist in special cases [22]. to address the issue of model errors in the value function.
The specification of parametric models for a particular However, these methods can usually only be applied
control problem is often challenging and requires intri- to low-dimensional RL problems. As a policy search
cate knowledge about the system. Sometimes, a rough method, PILCO does not require an explicit global value
model estimate with uncertain parameters is sufficient to function model but rather searches directly in policy
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014 3
Ground truth that the expected cost in (11) is differentiable with respect
Moment matching
Linearization
to the moments of the state distribution. Moreover, we
assume that the moments of the control distribution µu
∆t
Z n
respectively, where µa∆ is known from (17). The off- (6)
X
2 = x̃t βai kfa (x̃t , x̃i ) p(x̃t ) dx̃t , (29)
diagonal terms σab do not contain the additional term
i=1
Ex̃t [covf [∆a , ∆b |x̃t ]] because of the conditional indepen-
dence assumption of the GP models: Different target where the (posterior) GP mean function mf (x̃t ) was
dimensions do not covary for given x̃t . represented as a finite kernel expansion. Note that x̃i
We start the computation of the covariance matrix with are the state-action pairs, which were used to train the
the terms that are common to both the diagonal and the dynamics GP model. By pulling the constant βai out of
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014 6
the integral and changing the order of summation and Gaussian distributions through linear models, the pre-
integration, we obtain dictive covariance is given by
for all predictive dimensions a = 1, . . . , E. With c1 c−12 = 5.1 Predictive Distribution over Controls
qai , see (19), and ψ i = Σ̃t (Σ̃t +Λa )−1 x̃i +Λ(Σ̃t +Λa )−1 µ̃t
we simplify (31) and obtain During the long-term predictions, the states are given by
a probability distribution p(xt ), t = 0, . . . , T . The prob-
n
X ability distribution of the state xt induces a predictive
covx̃t ,f [x̃t , ∆at ] = βai qai Σ̃t (Σ̃t +Λa )−1 (x̃i − µ̃t ) , (32) distribution p(ut ) = p(π(xt )) over controls, even when
i=1
the policy is deterministic. We approximate the distribu-
a = 1, . . . , E. The desired covariance cov[xt , ∆t ] is a tion over controls using moment matching, which is in
D × E submatrix of the (D + F ) × E cross-covariance many interesting cases analytically tractable.
computed in to (32).
A visualization of the approximation of the predictive 5.2 Constrained Control Signals
distribution by means of exact moment matching is
In practical applications, force or torque limits are
given in Fig. 2.
present and must be accounted for during plan-
ning. Suppose the control limits are such that u ∈
4.2 Linearization of the Posterior GP Mean Function [−umax , umax ]. Let us consider a preliminary policy π̃ with
An alternative way of approximating the predictive dis- an unconstrained amplitude. To account for the control
tribution p(∆t ) by a Gaussian for x̃t ∼ N x̃t | µ̃t , Σ̃t limits coherently during simulation, we squash the pre-
is to linearize the posterior GP mean function. Fig. 2 liminary policy π̃ through a bounded and differentiable
visualizes the approximation by means of linearizing the squashing function, which limits the amplitude of the
posterior GP mean function. final policy π. As a squashing function, we use
The predicted mean is obtained by evaluating the pos- σ(x) = 9
sin(x) + 1
sin(3x) ∈ [−1, 1] , (36)
8 8
terior GP mean in (5) at the mean µ̃t of the input
distribution, i.e., which is the third-order Fourier series expansion of a
trapezoidal wave, normalized to the interval [−1, 1]. The
µa∆ = Ef [fa (µ̃t )] = mfa (µ̃t ) = β >
a ka (X̃, µ̃t ) , (33) squashing function in (36) is computationally convenient
as we can analytically compute predictive moments for
a = 1, . . . , E, where β a is given in (18).
Gaussian distributed states. Subsequently, we multiply
To compute the GP predictive covariance matrix Σ∆ ,
the squashed policy by umax and obtain the final policy
we explicitly linearize the posterior GP mean function
around µ̃t . By applying standard results for mapping π(x) = umax σ(π̃(x)) ∈ [−umax , umax ] , (37)
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014 7
π(x)
π̃(x)
0 0
policy π̃ by
−1 −1 N
X
−5 0 5 −5 0 5 π̃(x∗ ) = k(mi , x∗ )(K + σπ2 I)−1 t = k(M , x∗ )> α , (41)
x x
i=1
(a) Preliminary policy π̃ as a func- (b) Policy π = σ(π̃(x)) as a func-
tion of the state. tion of the state.
where x∗ is a test input, α = (K + 0.01I)−1 t, where
Fig. 3. Constraining the control signal. Panel (a) shows t plays the role of a GP’s training targets. In (41),
an example of an unconstrained preliminary policy π̃ as a M = [m1 , . . . , mN ] are the centers of the (axis-aligned)
function of the state x. Panel (b) shows the constrained Gaussian basis functions
policy π(x) = σ(π̃(x)) as a function of the state x.
k(xp , xq ) = exp − 12 (xp − xq )> Λ−1 (xp − xq ) . (42)
an illustration of which is shown in Fig. 3. Although the We call the policy representation in (41) a deterministic GP
squashing function in (36) is periodic, it is almost always with a fixed number of N basis functions. Here, “deter-
used within a half wave if the preliminary policy π̃ is ministic” means that there is no uncertainty about the
initialized to produce function values that do not exceed underlying function, that is, varπ̃ [π̃(x)] = 0. Therefore,
the domain of a single period. Therefore, the periodicity the deterministic GP is a degenerate model, which is
does not matter in practice. functionally equivalent to a regularized RBF network.
To compute a distribution over constrained control The deterministic GP is functionally equivalent to the
signals, we execute the following steps: posterior GP mean function in (6), where we set the
p(xt ) 7→ p(π̃(xt )) 7→ p(umax σ(π̃(xt ))) = p(ut ) . (38) signal variance to 1, see (42), and the noise variance to
0.01. As the preliminary policy will be squashed through
First, we map the Gaussian state distribution p(xt ) σ in (36) whose relevant support is the interval [− π2 , π2 ],
through the preliminary (unconstrained) policy π̃. Thus, a signal variance of 1 is about right. Setting additionally
we require a preliminary policy π̃ that allows for closed- the noise standard deviation to 0.1 corresponds to fixing
form computation of the moments of the distribution the signal-to-noise ratio of the policy to 10 and, hence,
over controls p(π̃(xt )). Second, we squash the approx- the regularization.
imate Gaussian distribution p(π̃(x)) according to (37) For a Gaussian distributed state x∗ ∼ N (µ∗ , Σ∗ ), the
and compute exactly the mean and variance of p(π̃(x)). predictive mean of π̃(x∗ ) as defined in (41) is given as
Details are given in the Appendix. We approximate
p(π̃(x)) by a Gaussian with these moments, yielding the Ex∗ [π̃(x∗ )] = α>
a Ex∗ [k(M , x∗ )]
distribution p(u) over controls in (38). Z
= α>
a k(M , x∗ )p(x∗ ) dx∗ = α>
a ra , (43)
5.3 Representations of the Preliminary Policy
In the following, we present two representations of where for i = 1, . . . , N and all policy dimensions a =
the preliminary policy π̃, which allow for closed-form 1, . . . , F
computations of the mean and covariance of p(π̃(x)) 1
−2
when the state x is Gaussian distributed. We consider rai = |Σ∗ Λ−1
a + I|
both a linear and a nonlinear representations of π̃. × exp(− 12 (µ∗ − mi )> (Σ∗ + Λa )−1 (µ∗ − mi )) .
5.3.1 Linear Policy
The diagonal matrix Λa contains the squared length-
The linear preliminary policy is given by scales `i , i = 1, . . . , D. The predicted mean in (43) is
π̃(x∗ ) = Ax∗ + b , (39) equivalent to the standard predicted GP mean in (17).
For a, b = 1, . . . , F , the entries of the predictive covari-
where A is a parameter matrix of weights and b is an
ance matrix are computed according to
offset vector. In each control dimension d, the policy in
(39) is a linear combination of the states (the weights are covx∗ [π̃a (x∗ ), π̃b (x∗ )]
given by the dth row in A) plus an offset bd .
The predictive distribution p(π̃(x∗ )) for a state distri- = Ex∗ [π̃a (x∗ )π̃b (x∗ )] − Ex∗ [π̃a (x∗ )]Ex∗ [π̃b (x∗ )] ,
bution x∗ ∼ N (µ∗ , Σ∗ ) is an exact Gaussian with mean
where Ex∗ [π̃{a,b} (x∗ )] is given in (43). Hence, we focus
and covariance
on the term Ex∗ [π̃a (x∗ )π̃b (x∗ )], which for a, b = 1, . . . , F
Ex∗ [π̃(x∗ )] = Aµ∗ + b , covx∗ [π̃(x∗ )] = AΣ∗ A> , (40) is given by
respectively. A drawback of the linear policy is that it Ex∗ [π̃a (x∗ )π̃b (x∗ )] = α> >
a Ex∗ [ka (M , x∗ )kb (M , x∗ ) ]αb
is not flexible. However, a linear controller can often be
used for stabilization around an equilibrium. = α>
a Qαb .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014 8
Algorithm 2 Computing the Successor State Distribution the Appendix. Third, we analytically compute a Gaus-
1: init: xt ∼ N (µt , Σt ) sian approximation to the joint distribution p(xt , ut ) =
2: Control distribution p(ut ) = p(umax σ(π̃(xt , θ))) p(xt , π(xt )). For this, we compute (a) a Gaussian approx-
3: Joint state-control distribution p(x̃t ) = p(xt , ut ) imation to the joint distribution p(xt , π̃(xt )), which is
4: Predictive GP distribution of change in state p(∆t ) exact if π̃ is linear, and (b) an approximate fully joint
5: Distribution of successor state p(xt+1 ) Gaussian distribution p(xt , π̃(xt ), ut ). We obtain cross-
covariance information between the state xt and the
control signal ut = umax σ(π̃(xt )) via
For i, j = 1, . . . , N , we compute the entries of Q as
Z cov[xt , ut ] = cov[xt , π̃(xt )]cov[π̃(xt ), π̃(xt )]−1 cov[π̃(xt ), ut ] ,
Qij = ka (mi , x∗ )kb (mj , x∗ )p(x∗ ) dx∗
where we exploit the conditional independence of xt
1
−2 −1 and ut given π̃(xt ). Then, we integrate π̃(xt ) out to
= ka (mi , x∗ )kb (mj , x∗ )|R| exp(z >
ij T z ij ) ,
obtain the desired joint distribution p(xt , ut ). This leads
R = Σ∗ (Λ−1 −1
a + Λb ) + I , T −1 −1
= Λa + Λb + Σ−1 ∗ , to an approximate Gaussian joint probability distribution
z ij = Λ−1
a (µ∗ − mi ) + Λ−1
b (µ∗ − mj ) . p(xt , ut ) = p(xt , π(xt )) = p(x̃t ). Fourth, with the approx-
imate Gaussian input distribution p(x̃t ), the distribution
Combining this result with (43) fully determines the p(∆t ) of the change in state is computed using the
predictive covariance matrix of the preliminary policy. results from Sec. 4. Finally, the mean and covariance of
Unlike the predictive covariance of a probabilistic GP, a Gaussian approximation of the successor state distri-
see (21)–(22), the predictive covariance matrix of the de- bution p(xt+1 ) are given by (9) and (10), respectively.
terministic GP does not comprise any model uncertainty All required computations can be performed analyti-
in its diagonal entries. cally because of the choice of the Gaussian covariance
function for the GP dynamics model, see (3), the repre-
5.4 Policy Parameters sentations of the preliminary policy π̃, see Sec. 5.3, and
the choice of the squashing function, see (36).
In the following, we describe the policy parameters for
both the linear and the nonlinear policy4 .
6 C OST F UNCTION
5.4.1 Linear Policy
The linear policy in (39) possesses D + 1 parameters per In our learning set-up, we use a cost function that
control dimension: For control dimension d there are D solely penalizes the Euclidean distance d of the current
weights in the dth row of the matrix A. One additional state to the target state. Using only distance penalties is
parameter originates from the offset parameter bd . often sufficient to solve a task: Reaching a target xtarget
with high speed naturally leads to overshooting and,
5.4.2 Nonlinear Policy thus, to high long-term costs. In particular, we use the
generalized binary saturating cost
The parameters of the deterministic GP in (41) are the
locations M of the centers (DN parameters), the (shared)
c(x) = 1 − exp − 2σ1 2 d(x, xtarget )2 ∈ [0, 1] ,
(44)
length-scales of the Gaussian basis functions (D length- c
scale parameters per target dimension), and the N tar- which is locally quadratic but saturates at unity for large
gets t per target dimension. In the case of multivariate deviations d from the desired target xtarget . In (44), the
controls, the basis function centers M are shared. geometric distance from the state x to the target state is
denoted by d, and the parameter σc controls the width
5.5 Computing the Successor State Distribution of the cost function.5
Alg. 2 summarizes the computational steps required to In classical control, typically a quadratic cost is as-
compute the successor state distribution p(xt+1 ) from sumed. However, a quadratic cost tends to focus atten-
p(xt ). The computation of a distribution over controls tion on the worst deviation from the target state along
p(ut ) from the state distribution p(xt ) requires two steps: a predicted trajectory. In the early stages of learning the
First, for a Gaussian state distribution p(xt ) at time t a predictive uncertainty is large and, therefore, the policy
Gaussian approximation of the distribution p(π̃(xt )) of gradients, which are described in Sec. 3.3 become less
the preliminary policy is computed analytically. Second, useful. Therefore, we use the saturating cost in (44) as a
the preliminary policy is squashed through σ and an default within the PILCO learning framework.
approximate Gaussian distribution of p(umax σ(π̃(xt ))) The immediate cost in (44) is an unnormalized Gaus-
is computed analytically in (38) using results from sian with mean xtarget and variance σc2 , subtracted from
4. For notational convenience, with a (non)linear policy we mean 5. In the context of sensorimotor control, the saturating cost function
the (non)linear preliminary policy π̃ mapped through the squashing in (44) resembles the cost function in human reasoning as experimen-
function σ and subsequently multiplied by umax . tally validated by [31].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014 9
target 2
10
2
10
Computation time in s
Computation time in s
0 0
10 10
d
−2
10 100 Training Points
−2
10 100 Training Points
250 Training Points 250 Training Points
500 Training Points 500 Training Points
1000 Training Points 1000 Training Points
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
State space dimensionality State space dimensionality
u1
(a) Linearizing the mean function. (b) Moment matching.
GP training set sizes of n = 100, 250, 500, 1000 data C: Coulom 2002
KK: Kimura & Kobayashi 1999
100
points. We set the predictive dimension E = D. The 5
D: Doya 2000
Average Success in %
WP: Wawrzynski & Pacut 2004
80
CPU time (single core) for computing a predictive state RT: Raiko & Tornio 2009
pilco: Deisenroth & Rasmussen 2011
4
distribution and the required derivatives are shown 60 10
7.1.2.2 Learning Speed: For eight different ran- Fig. 7. Results for the cart-pole swing-up task. (a) Learn-
dom initial trajectories and controller initializations, ing curves for moment matching and linearization (sim-
PILCO followed Alg. 1 to learn policies. In the cart-pole ulation task), (b) required interaction time for solving the
swing-up task, PILCO learned for 15 episodes, which cart-pole swing-up task compared with other algorithms.
corresponds to a total of 37.5 s of data. In the double-
pendulum swing-up task, PILCO learned for 30 episodes, 100 Lin
MM
Average Success in %
corresponding to a total of 75 s of data. To evaluate the 80
learning progress, we applied the learned controllers
60
after each policy search (see line 10 in Alg. 1) 20 times for
2.5 s, starting from 20 different initial states x0 ∼ p(x0 ). 40
The learned controller was considered successful when 20
the tip of the pendulum was close to the target location
from 2 s to 2.5 s, i.e., at the end of the rollout. 0
20 40 60
Total experience in s
• Cart-Pole Swing-Up. Fig. 7(a) shows PILCO’s aver-
age learning success for the cart-pole swing-up task Fig. 8. Average success as a function of the total data
as a function of the total experience. We evaluated used for learning (double pendulum swing-up). The blue
both approximate inference methods for policy eval- error bars show the 95% confidence bounds of the stan-
uation, moment matching and linearization of the dard error for the moment matching (MM) approximation,
posterior GP mean function. Fig. 7(a) shows that the red area represents the corresponding confidence
learning using the computationally more demand- bounds of success when using approximate inference by
ing moment matching is more reliable than using means of linearizing the posterior GP mean (Lin).
the computationally more advantageous lineariza-
tion. On average, after 15 s–20 s of experience, PILCO
reliably, i.e., in ≈ 95% of the test runs, solved the successfully with moment matching. Policy evalua-
cart-pole swing-up task, whereas the linearization tion based on linearization of the posterior GP mean
resulted in a success rate of about 83%. function achieved about 80% success on average,
Fig. 7(b) relates PILCO’s learning speed (blue bar) whereas moment matching on average solved the
to other RL methods (black bars), which solved the task reliably after about 50 s of data with a success
cart-pole swing-up task from scratch, i.e., without rate ≈ 95%.
human demonstrations or known dynamics mod- Summary. We have seen that both approximate inference
els [11], [18], [27], [45], [56]. Dynamics models methods have pros and cons: Moment matching requires
were only learned in [18], [45], using RBF networks more computational resources than linearization, but
and multi-layered perceptrons, respectively. In all learns faster and more reliably. The reason why lineariza-
cases without state-space discretization, cost func- tion did not reliably succeed in learning the tasks is that
tions similar to ours (see (44)) were used. Fig. 7(b) it gets relatively easily stuck in local minima, which is
stresses PILCO’s data efficiency: P ILCO outperforms largely a result of underestimating predictive variances,
any other currently existing RL algorithm by at least an example of which is given in Fig. 2. Propagating
one order of magnitude. too confident predictions over a longer horizon often
• Double-Pendulum Swing-Up with Two Actuators. worsens the problem. Hence, in the following, we focus
Fig. 8 shows the learning curves for the double- solely on the moment matching approximation.
pendulum swing-up task when using either mo-
ment matching or mean function linearization for 7.1.3 Quality of the Gaussian Approximation
approximate inference during policy evaluation. P ILCO strongly relies on the quality of approximate
Fig. 8 shows that PILCO learns faster (learning al- inference, which is used for long-term predictions and
ready kicks in after 20 s of data) and overall more policy evaluation, see Sec. 4. We already saw differences
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014 12
6 3.5 TABLE 1
Angle inner pendulum in rad
distance distribution in %
flywheel
80
frame 60
40
20
wheel 0
1 2 3 4 5
time in s
(a) Robotic unicycle. (b) Histogram (after 1,000 test runs) of the Fig. 12. Low-cost robotic arm by Lynxmotion [1]. The
distances of the flywheel from being upright. manipulator does not provide any pose feedback. Hence,
PILCO learns a controller directly in the task space using
Fig. 10. Robotic unicycle system and simulation results.
visual feedback from a PrimeSense depth camera.
The state space is R12 , the control space R2 .
1 2 3 4 5 6
Fig. 11. Real cart-pole system [15]. Snapshots of a controlled trajectory of 20 s length after having learned the task. To
solve the swing-up plus balancing, PILCO required only 17.5 s of interaction with the physical system.
comprised the commanded pulse widths for the first iterations. After 10 learning iterations, the block in the
four servo motors. Wrist rotation and gripper opening/ gripper was expected to be very close (approximately at
closing were not learned. For block tracking we used noise level) to the target. The required interaction time
real-time (50 Hz) color-based region growing to estimate sums up to only 50 s per controller and 230 s in total (the
the extent and 3D center of the object, which was used initial random trial is counted only once). This speed of
as the state x ∈ R3 by PILCO. learning is difficult to achieve by other RL methods that
As an initial state distribution, we chose p(x0 ) = learn from scratch as shown in Sec. 7.1.1.2.
N x0 | µ0 , Σ0 with µ0 being a single noisy measure- Fig. 13(b) gives some insights into the quality of
ment of the 3D camera coordinates of the block in the the learned forward model after 10 controlled trials.
gripper when the robot was in its initial configuration. It shows the marginal predictive distributions and the
The initial covariance Σ0 was diagonal, where the 95%- actual trajectories of the block in the gripper. The robot
confidence bounds were the edge length b of the block. learned to pay attention to stabilizing the y-coordinate
Similarly, the target state was set based on a single noisy quickly: Moving the arm up/down caused relatively
measurement using the PrimeSense camera. We used large “system noise” as the arm was quite jerky in this
linear preliminary policies, i.e., π̃(x) = u = Ax + b, and direction: In the y-coordinate the predictive marginal
initialized the controller parameters θ = {A, b} ∈ R16 to distribution noticeably increases between 0 s and 2 s.
zero. The Euclidean distance d of the end effector from As soon as the y-coordinate was stabilized, the pre-
the camera was approximately 0.7 m–2.0 m, depending dictive uncertainty in all three coordinates collapsed.
on the robot’s configuration. The cost function in (44) Videos of the block-stacking robot are available at
penalized the Euclidean distance of the block in the http://www.youtube.com/user/PilcoLearner.
gripper from its desired target location on top of the
current tower. Both the frequency at which the controls 8 D ISCUSSION
were changed and the time discretization were set to We have shed some light on essential ingredients for
2 Hz; the planning horizon T was 5 s. After 5 s, the robot successful and efficient policy learning: (1) a probabilistic
opened the gripper and released the block. forward model with a faithful representation of model
We split the task of building a tower into learning indi- uncertainty and (2) Bayesian inference. We focused on
vidual controllers for each target block B2–B6 (bottom to very basic representations: GPs for the probabilistic for-
top), see Fig. 12, starting from a configuration, in which ward model and Gaussian distributions for the state
the robot arm was upright. All independently trained and control distributions. More expressive representa-
controllers shared the same initial trial. tions and Bayesian inference methods are conceivable to
The motion of the block in the end effector was account for multi-modality, for instance. However, even
modeled by GPs. The inferred system noise standard with our current set-up, PILCO can already learn learn
deviations, which comprised stochasticity of the robot complex control and robotics tasks. In [8], our framework
arm, synchronization errors, delays, image processing was used in an industrial application for throttle valve
errors, etc., ranged from 0.5 cm to 2.0 cm. Here, the y- control in a combustion engine.
coordinate, which corresponded to the height, suffered P ILCO is a model-based policy search method, which
from larger noise than the other coordinates. The reason uses the GP forward model to predict state sequences
for this is that the robot movement was particularly jerky given the current policy. These predictions are based
in the up/down movements. These learned noise levels on deterministic approximate inference, e.g., moment
were in the right ballpark since they were slightly larger matching. Unlike all model-free policy search methods,
than the expected camera noise [2]. The signal-to-noise which are inherently based on sampling trajectories [14],
ratio in our experiments ranged from 2 to 6. PILCO exploits the learned GP model to compute analytic
A total of ten learning-interacting iterations (including gradients of an approximation to the expected long-
the random initial trial) generally sufficed to learn both term cost J π for policy search. Finite differences or more
good forward models and good controllers as shown efficient sampling-based approximations of the gradi-
in Fig. 13(a), which displays the learning curve for a ents require many function evaluations, which limits
typical training session, averaged over ten test runs after the effective number of policy parameters [14], [42].
each learning stage and all blocks B2–B6. The effects Instead, PILCO computes the gradients analytically and,
of learning became noticeable after about four learning therefore, can learn thousands of policy parameters [15].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014 15
−0.1 predictive distribution, 95% confidence bound predictive distribution, 95% confidence bound
0.15 0.96
25
−0.15 0.94
20 0.1
z−coordinate
x−coordinate
y−coordinate
0.92
−0.2
15 0.9
0.05
−0.25 0.88
10
0 0.86
−0.3
5 actual trajectory
0.84
target
−0.35 −0.05 predictive distribution, 95% confidence bound
0 0.82
5 10 15 20 25 30 35 40 45 50 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Total experience in s time in s time in s time in s
(a) Average learning curve (block- (b) Marginal long-term predictive distributions and actually incurred trajectories. The red lines show the
stacking task). The horizontal axis trajectories of the block in the end effector, the two dashed blue lines represent the 95% confidence
shows the learning stage, the verti- intervals of the corresponding multi-step ahead predictions using moment matching. The target state is
cal axis the average distance to the marked by the straight lines. All coordinates are measured in cm.
target at the end of the episode.
Fig. 13. Robot block stacking task: (a) Average learning curve with 95% standard error, (b) Long-term predictions.
It is possible to exploit the learned GP model for sam- used for learning both a transition mapping and the
pling trajectories using the PEGASUS algorithm [39], for observation mapping. However, GPDMs typically need
instance. Sampling with GPs can be straightforwardly a good initialization [53] since the learning problem is
parallelized, and was exploited in [32] for learning meta very high dimensional.
controllers. However, even with high parallelization, In [25], the PILCO framework was extended to allow
policy search methods based on trajectory sampling do for learning reference tracking controllers instead of
usually not rely on gradients [7], [30], [32], [40] and are solely controlling the system to a fixed target location.
practically limited by a relatively small number of a few In [16], we used PILCO for planning and control in
tens of policy parameters they can manage [38].7 constrained environments, i.e., environments with obsta-
In Sec. 6.1, we discussed PILCO’s natural exploration cles. This learning set-up is important for practical robot
property as a result of Bayesian averaging. It is, however, applications. By discouraging obstacle collisions in the
also possible to explicitly encourage additional explo- cost function, PILCO was able to find paths around
ration in a UCB (upper confidence bounds) sense [6]: obstacles without ever colliding with them, not even
Instead of summing up expected immediate costs, see during training. Initially, when the model was uncertain,
(2), we would add the sum of cost standard devia- the policy was conservative to stay away from obstacles.
tions,
P weighted by a factor
κ ∈ R. Then, J π (θ) = The PILCO framework has been applied in the context
t E[c(xt )] + κσ[c(xt )] . This type of utility function is of model-based imitation learning to learn controllers
also often used in experimental design [10] and Bayesian that minimize the Kullback-Leibler divergence between a
optimization [9], [33], [41], [51] to avoid getting stuck in distribution of demonstrated trajectories and the predic-
local minima. Since PILCO’s approximate state distribu- tive distribution of robot trajectories [20], [21]. Recently,
tions p(xt ) are Gaussian, the cost standard deviations PILCO has also been extended to a multi-task set-up [13].
σ[c(xt )] can often be computed analytically. For further
details, we refer the reader to [12].
One of PILCO’s key benefits is the reduction of model
errors by explicitly incorporating model uncertainty into 9 C ONCLUSION
planning and control. P ILCO, however, does not take
temporal correlation into account. Instead, model uncer- We have introduced PILCO, a practical model-based pol-
tainty is treated as noise, which can result in an under- icy search method using analytic gradients for policy
estimation of model uncertainty [49]. On the other hand, learning. P ILCO advances state-of-the-art RL methods for
the moment-matching approximation used for approxi- continuous state and control spaces in terms of learning
mate inference is typically a conservative approximation. speed by at least an order of magnitude. Key to PILCO’s
In this article, we focused on learning controllers in success is a principled way of reducing the effect of
MDPs with transition dynamics that suffer from system model errors in model learning, long-term planning, and
noise, see (1). The case of measurement noise is more chal- policy learning. P ILCO is one of the few RL methods
lenging: Learning the GP models is a real challenge since that has been directly applied to robotics without human
we no longer have direct access to the state. However, demonstrations or other kinds of informative initializa-
approaches for training GPs with noise on both the train- tions or prior knowledge.
ing inputs and training targets yield initial promising The PILCO learning framework has demonstrated that
results [36]. For a more general POMDP set-up, Gaussian Bayesian inference and nonparametric models for learn-
Process Dynamical Models (GPDMs) [29], [54] could be ing controllers is not only possible but also practicable.
Hence, nonparametric Bayesian models can play a fun-
7. “Typically, PEGASUS policy search algorithms have been using
[...] maybe on the order of ten parameters or tens of parameters; so, damental role in classical control set-ups, while avoiding
30, 40 parameters, but not thousands of parameters [...]”, A. Ng [38]. the typically excessive reliance on explicit models.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014 16
ACKNOWLEDGMENTS [26] T. T. Jervis and F. Fallside. Pole Balancing on a Real Rig Using
a Reinforcement Learning Controller. Technical Report CUED/F-
The research leading to these results has received fund- INFENG/TR 115, University of Cambridge, December 1992.
ing from the EC’s Seventh Framework Programme [27] H. Kimura and S. Kobayashi. Efficient Non-Linear Control by
(FP7/2007–2013) under grant agreement #270327, ONR Combining Q-learning with Local Linear Controllers. In Proceed-
ings of the 16th International Conference on Machine Learning, 1999.
MURI grant N00014-09-1-1052, and Intel Labs. [28] J. Ko and D. Fox. GP-BayesFilters: Bayesian Filtering using Gaus-
sian Process Prediction and Observation Models. In Proceedings
R EFERENCES of the IEEE/RSJ International Conference on Intelligent Robots and
Systems, 2008.
[1] http://www.lynxmotion.com. [29] J. Ko and D. Fox. Learning GP-BayesFilters via Gaussian Process
[2] http://www.primesense.com. Latent Variable Models. In Proceedings of Robotics: Science and
[3] P. Abbeel, M. Quigley, and A. Y. Ng. Using Inaccurate Models Systems, 2009.
in Reinforcement Learning. In Proceedings of the 23rd International [30] J. Ko, D. J. Klein, D. Fox, and D. Haehnel. Gaussian Processes
Conference on Machine Learning, 2006. and Reinforcement Learning for Identification and Control of
[4] K. J. Aström and B. Wittenmark. Adaptive Control. Dover an Autonomous Blimp. In Proceedings of the IEEE International
Publications, 2008. Conference on Robotics and Automation, 2007.
[5] C. G. Atkeson and J. C. Santamarı́a. A Comparison of Direct [31] K. P. Körding and D. M. Wolpert. The Loss Function of Senso-
and Model-Based Reinforcement Learning. In Proceedings of the rimotor Learning. In J. L. McClelland, editor, Proceedings of the
International Conference on Robotics and Automation, 1997. National Academy of Sciences, volume 101, pages 9839–9842, 2004.
[6] P. Auer. Using Confidence Bounds for Exploitation-Exploration
[32] A. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann. Data-
Trade-offs. Journal of Machine Learning Research, 3:397–422, 2002.
Efficient Generalization of Robot Skills with Contextual Policy
[7] J. A. Bagnell and J. G. Schneider. Autonomous Helicopter Control
Search. In Proceedings of the AAAI Conference on Artificial Intelli-
using Reinforcement Learning Policy Search Methods. In Proceed-
gence, 2013.
ings of the International Conference on Robotics and Automation, 2001.
[8] B. Bischoff, D. Nguyen-Tuong, T. Koller, H. Markert, and A. Knoll. [33] D. Lizotte. Practical Bayesian Optimization. PhD thesis, University
Learning Throttle Valve Control Using Policy Search. In Proceed- of Alberta, Edmonton, Alberta, 2008.
ings of the European Conference on Machine Learning and Knowledge [34] D. J. C. MacKay. Information Theory, Inference, and Learning
Discovery in Databases, 2013. Algorithms. Cambridge University Press, 2003.
[9] E. Brochu, V. M. Cora, and N. de Freitas. A Tutorial on Bayesian [35] D. C. McFarlane and K. Glover. Lecture Notes in Control and In-
Optimization of Expensive Cost Functions, with Application to formation Sciences, volume 138, chapter Robust Controller Design
Active User Modeling and Hierarchical Reinforcement Learning. using Normalised Coprime Factor Plant Descriptions. Springer-
Technical Report TR-2009-023, Department of Computer Science, Verlag, 1989.
University of British Columbia, 2009. [36] A. McHutchon and C. E. Rasmussen. Gaussian Process Training
[10] K. Chaloner and I. Verdinelli. Bayesian Experimental Design: A with Input Noise. In Advances in Neural Information Processing
Review. Statistical Science, 10:273–304, 1995. Systems. 2011.
[11] R. Coulom. Reinforcement Learning Using Neural Networks, with [37] Y. Naveh, P. Z. Bar-Yoseph, and Y. Halevi. Nonlinear Modeling
Applications to Motor Control. PhD thesis, Institut National Poly- and Control of a Unicycle. Journal of Dynamics and Control,
technique de Grenoble, 2002. 9(4):279–296, October 1999.
[12] M. P. Deisenroth. Efficient Reinforcement Learning using Gaussian [38] A. Y. Ng. Stanford Engineering Everywhere CS229—Machine
Processes. KIT Scientific Publishing, 2010. ISBN 978-3-86644-569-7. Learning, Lecture 20, 2008. http://see.stanford.edu/materials/
[13] M. P. Deisenroth, P. Englert, J. Peters, and D. Fox. Multi-Task aimlcs229/transcripts/MachineLearning-Lecture20.html.
Policy Search. http://arxiv.org/abs/1307.0813, July 2013. [39] A. Y. Ng and M. Jordan. P EGASUS: A policy search method
[14] M. P. Deisenroth, G. Neumann, and J. Peters. A Survey on Policy for large mdps and pomdps. In Proceedings of the Conference on
Search for Robotics, volume 2 of Foundations and Trends in Robotics. Uncertainty in Artificial Intelligence, 2000.
NOW Publishers, 2013. [40] A. Y. Ng, H. J. Kim, M. I. Jordan, and S. Sastry. Autonomous
[15] M. P. Deisenroth and C. E. Rasmussen. PILCO: A Model-Based Helicopter Flight via Reinforcement Learning. In Advances in
and Data-Efficient Approach to Policy Search. In Proceedings of Neural Information Processing Systems, 2004.
the International Conference on Machine Learning, 2011. [41] M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian Processes
[16] M. P. Deisenroth, C. E. Rasmussen, and D. Fox. Learning to Con- for Global Optimization. In Proceedings of the International Confer-
trol a Low-Cost Manipulator using Data-Efficient Reinforcement ence on Learning and Intelligent Optimization, 2009.
Learning. In Proceedings of Robotics: Science and Systems, 2011. [42] J. Peters and S. Schaal. Policy Gradient Methods for Robotics.
[17] M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian Process In Proceedings of the 2006 IEEE/RSJ International Conference on
Dynamic Programming. Neurocomputing, 72(7–9):1508–1524, 2009. Intelligent Robotics Systems, 2006.
[18] K. Doya. Reinforcement Learning in Continuous Time and Space. [43] J. Peters and S. Schaal. Reinforcement Learning of Motor Skills
Neural Computation, 12(1):219–245, 2000. with Policy Gradients. Neural Networks, 21:682–697, 2008.
[19] Y. Engel, S. Mannor, and R. Meir. Bayes Meets Bellman: The
[44] J. Quiñonero-Candela, A. Girard, J. Larsen, and C. E. Ras-
Gaussian Process Approach to Temporal Difference Learning. In
mussen. Propagation of Uncertainty in Bayesian Kernel Models—
Proceedings of the International Conference on Machine Learning, 2003.
Application to Multiple-Step Ahead Forecasting. In IEEE Interna-
[20] P. Englert, A. Paraschos, J. Peters, and M. P. Deisenroth. Model-
tional Conference on Acoustics, Speech and Signal Processing, 2003.
based Imitation Learning by Proabilistic Trajectory Matching. In
Proceedings of the IEEE International Conference on Robotics and [45] T. Raiko and M. Tornio. Variational Bayesian Learning of Non-
Automation, 2013. linear Hidden State-Space Models for Model Predictive Control.
[21] P. Englert, A. Paraschos, J. Peters, and M. P. Deisenroth. Proba- Neurocomputing, 72(16–18):3702–3712, 2009.
bilistic Model-based Imitation Learning. Adaptive Behavior, 21:388– [46] C. E. Rasmussen and M. Kuss. Gaussian Processes in Rein-
403, 2013. forcement Learning. In Advances in Neural Information Processing
[22] S. Fabri and V. Kadirkamanathan. Dual Adaptive Control of Systems 16. The MIT Press, 2004.
Nonlinear Stochastic Systems using Neural Networks. Automatica, [47] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for
34(2):245–253, 1998. Machine Learning. The MIT Press, 2006.
[23] A. A. Fel’dbaum. Dual Control Theory, Parts I and II. Automation [48] S. Schaal. Learning From Demonstration. In Advances in Neural
and Remote Control, 21(11):874–880, 1961. Information Processing Systems 9. The MIT Press, 1997.
[24] D. Forster. Robotic Unicycle. Report, Department of Engineering, [49] J. G. Schneider. Exploiting Model Uncertainty Estimates for Safe
University of Cambridge, UK, 2009. Dynamic Control Learning. In Advances in Neural Information
[25] J. Hall, C. E. Rasmussen, and J. Maciejowski. Reinforcement Processing Systems. 1997.
Learning with Reference Tracking Control in Continuous State [50] E. Snelson and Z. Ghahramani. Sparse Gaussian Processes us-
Spaces. In Proceedings of the IEEE International Conference on ing Pseudo-inputs. In Advances in Neural Information Processing
Decision and Control, 2011. Systems. 2006.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014 17
[51] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian Carl Edward Rasmussen is Reader in Infor-
Process Optimization in the Bandit Setting: No Regret and Ex- mation Engineering at the Department of En-
perimental Design. In Proceedings of the International Conference on gineering at the University of Cambridge. He
Machine Learning, 2010. was a Junior Research Group Leader at the
[52] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduc- Max Planck Institute for Biological Cybernetics
tion. The MIT Press, 1998. in Tübingen, and a Senior Research Fellow at
[53] R. Turner, M. P. Deisenroth, and C. E. Rasmussen. State-Space the Gatsby Computational Neuroscience Unit at
Inference and Learning with Gaussian Processes. In Proceedings UCL. He has wide interests in probabilistic meth-
of the International Conference on Artificial Intelligence and Statistics, ods in machine learning, including nonparamet-
2010. ric Bayesian inference, and has co-authored
[54] J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian Process the text book Gaussian Processes for Machine
Dynamical Models for Human Motion. IEEE Transactions on Learning, the MIT Press 2006.
Pattern Analysis and Machine Intelligence, 30(2):283–298, 2008.
[55] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis,
University of Cambridge, Cambridge, UK, 1989. A PPENDIX A
[56] P. Wawrzynski and A. Pacut. Model-free off-policy Reinforcement
Learning in Continuous Environment. In Proceedings of the Inter- T RIGONOMETRIC I NTEGRATION
national Joint Conference on Neural Networks, 2004.
[57] A. Wilson, A. Fern, and P. Tadepalli. Incorporating Domain
This section gives exact integral equations for trigono-
Models into Bayesian Optimization for RL. In Proceedings of the metric functions, which are required to implement the
European Conference on Machine Learning and Knowledge Discovery discussed algorithms. The following expressions can be
in Databases, 2010.
[58] B. Wittenmark. Adaptive Dual Control Methods: An Overview.
found in the book by [1], where x ∼ N (x|µ, σ 2 ) is
In In Proceedings of the IFAC Symposium on Adaptive Systems in Gaussian distributed with mean µ and variance σ 2 .
Control and Signal Processing, 1995. Z
2
Ex [sin(x)] = sin(x)p(x) dx = exp(− σ2 ) sin(µ) ,
Z
2
Ex [cos(x)] = cos(x)p(x) dx = exp(− σ2 ) cos(µ) ,
Z
Ex [sin(x)2 ] = sin(x)2 p(x) dx
= 21 1 − exp(−2σ 2 ) cos(2µ) ,
Marc Peter Deisenroth conducted his Ph.D. re-
search at the Max Planck Institute for Biological
Z
Cybernetics (2006–2007) and at the University Ex [cos(x)2 ] = cos(x)2 p(x) dx
of Cambridge (2007–2009) and received his
= 21 1 + exp(−2σ 2 ) cos(2µ) ,
Ph.D. degree in 2009. He is a Research Fellow
at the Department of Computing at Imperial Col- Z
lege London. He is also adjunct researcher at
the Computer Science Department at TU Darm-
Ex [sin(x) cos(x)] = sin(x) cos(x)p(x) dx
stadt, where he has been Group Leader and Se- Z
nior Researcher from December 2011 to August 1
= 2 sin(2x)p(x) dx
2013. From February 2010 to December 2011,
he has been a Research Associate at the University of Washington. His
research interests center around modern Bayesian machine learning
= 1
2 exp(−2σ 2 ) sin(2µ) .
and its application to autonomous control and robotic systems.
A PPENDIX B
G RADIENTS
In the beginning of this section, we will give a few
derivative identities that will become handy. After that
we will detail derivative computations in the context of
Dieter Fox received the Ph.D. degree from the the moment-matching approximation.
University of Bonn, Germany. He is Professor in
the Department of Computer Science & Engi-
neering at the University of Washington, where B.1 Identities
he heads the UW Robotics and State Estimation
Lab. From 2009 to 2011, he was also Director Let us start with a set of basic derivative identities [2]
of the Intel Research Labs Seattle. Before going that will prove useful in the following:
to UW, he spent two years as a postdoctoral
researcher at the CMU Robot Learning Lab. ∂|K(θ)| ∂K
His research is in artificial intelligence, with a = |K|tr K −1 ,
focus on state estimation applied to robotics and ∂θ ∂θ
activity recognition. He has published over 150 technical papers and is ∂|K|
coauthor of the text book Probabilistic Robotics. Fox is an editor of the = |K|(K −1 )> ,
IEEE Transactions on Robotics, was program co-chair of the 2008 AAAI
∂K
Conference on Artificial Intelligence, and served as the program chair of ∂K −1 (θ) ∂K(θ) −1
the 2013 Robotics Science and Systems conference. He is a fellow of = −K −1 K ,
∂θ ∂θ
AAAI and a senior member of IEEE.
∂θ > Kθ
= θ > (K + K > ) ,
∂θ
∂tr(AKB)
= A> B > ,
∂K
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014 18
∂|(Λ−1 −1
1
Qij = σf2a σf2b |(Λ−1 −1
a + Λb )Σ̃t−1 + I|
−2
a + Λb ) Σ̃t−1 + I|
× exp − 21 (x̃i − x̃j )> (Λa + Λb )−1 (x̃i − x̃j ) ∂ Σ̃t−1
−1
= |(Λa + Λb )−1 Σ̃t−1 + I|
× exp − 12 (ẑ ij − µ̃t−1 )> >
−1 −1
−1 × (Λ−1 −1
a + Λb )Σ̃t−1 + I (Λa + Λ−1
b ) (69)
× (Λ−1
a + Λb )
−1 −1
+ Σ̃t−1 (ẑ ij − µ̃t−1 ) , (63)
= |(Λ−1 + Λb ) −1
Σ̃t−1 + I| (70)
ẑ ij := Λb (Λa + Λb )−1 x̃i + Λa (Λa + Λb )−1 x̃j , (64) a
!
−1 ∂ Σ̃t−1
× tr (Λ−1 −1
a + Λb )Σ̃t−1 + I (Λ−1 −1
a + Λb )
where i, j = 1, . . . , n. ∂ Σ̃t−1
B.2.2.1 Derivative with Respect to the Input Mean:
For the derivative of the entries of the predictive co- the partial derivative of Qij with respect to the covari-
variance matrix with respect to the predictive mean, we ance matrix Σ̃t−1 is given as
obtain
∂Qij
2
∂σ∆ ∂q >
∂Q ∂q a > ∂ Σ̃t−1
ab
= β>
a − q − q a
b
βb
∂ µ̃t−1 ∂ µ̃t−1 ∂ µ̃t−1 b ∂ µ̃t−1 "
3
= c − 12 |(Λ−1
a + Λb )
−1
Σ̃t−1 + I|− 2
2 −1 ∂Q
+ δab −(K a + σw I) , (65)
a
∂ µ̃t−1
× |(Λ−1
a + Λb )
−1
Σ̃t−1 + I|e2
where the derivative of Qij with respect to the input
!
−1 ∂ Σ̃t−1
mean is given as × tr (Λ−1
a + Λ−1
b )Σ̃t−1 +I (Λ−1
a + Λ−1
b )
∂ Σ̃t−1
∂Qij
#
= Qij (ẑ ij − µ̃t−1 )> ((Λ−1 −1 −1
a + Λb ) + Σ̃t−1 )−1 . 1 ∂e2
∂ µ̃t−1 + |(Λ−1
a + Λb )
−1
Σ̃t−1 + I|− 2 (71)
∂ Σ̃t−1
(66)
1
= c|(Λ−1 + Λb )−1 Σ̃t−1 + I|− 2 (72)
B.2.2.2 Derivative with Respect to the Input Co- " a
>
variance Matrix: The derivative of the entries of the pre-
−1 −1
× − 21 (Λ−1 −1
a + Λb )Σ̃t−1 + I (Λa + Λ−1
b ) e2
dictive covariance matrix with respect to the covariance
matrix of the input distribution is #
∂e2
2
+ , (73)
∂σ∆ ∂q > ∂ Σ̃t−1
∂Q ∂q a >
ab
= β>
a − qb − qa b
βb
∂ Σ̃t−1 ∂ Σ̃t−1 ∂ Σ̃t−1 ∂ Σ̃t−1
where the partial derivative of e2 with respect to the
2 −1 ∂Q (p,q)
entries Σt−1 is given as
+ δab −(K a + σwa I) . (67)
∂ Σ̃t−1
−1
∂e2 1 ∂ (Λ−1 −1 −1
a + Λb ) + Σ̃t−1
Since the partial derivatives ∂q a /∂ Σ̃t−1 and ∂q b /∂ Σ̃t−1 (p,q)
= − (ẑ ij − µ̃t−1 )> (p,q)
2
are known from Eq. (56), it remains to compute ∂ Σ̃t−1 ∂ Σ̃
t−1
∂Q/∂ Σ̃t−1 . The entries Qij , i, j = 1, . . . , n are given in × (ẑ ij − µ̃t−1 )e2 . (74)
Eq. (63). By defining
The missing partial derivative in (74) is given by
c := σf2a σf2b exp − 12 (x̃i − x̃j )> (Λ−1
a + Λ−1 −1
b ) (x̃i − x̃j )) −1
1 > −1 −1 −1
−1 ∂ (Λ−1
a + Λb )
−1 −1
+ Σ̃t−1
e2 := exp − 2 (ẑ ij − µ̃t−1 ) (Λa + Λb ) + Σ̃t−1 (p,q)
= −Ξ(pq) , (75)
∂ Σ̃t−1
× (ẑ ij − µ̃t−1 )
where we define
we obtain the desired derivative
Ξ(pq) = 12 (Φ(pq) + Φ(qp) ) ∈ R(D+F )×(D+F ) , (76)
"
∂Qij 3
= c − 21 |(Λ−1
a + Λb )
−1
Σ̃t−1 + I|− 2 p, q = 1, . . . , D + F with
∂ Σ̃t−1
∂|(Λ−1
a + Λb )
−1
Σ̃t−1 + I| Φ(pq) = (Λ−1 −1 −1
−1
× e2 a + Λb ) + Σ̃t−1 (:,p)
∂ Σ̃t−1
# !
1
−2 ∂e2 −1
+ |(Λ−1
a + Λ−1
b )Σ̃t−1 + I| . (68) × (Λ−1
a + Λ−1
b )
−1
+ Σ̃t−1 (q,:)
. (77)
∂ Σ̃t−1
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014 20
= − 12 Qij
"
−1 −1 >
× (Λ−1 −1
a + Λb )Σ̃t−1 + I (Λa + Λ−1
b )
#
>
− (ẑ ij − µ̃t−1 ) Ξ(ẑ ij − µ̃t−1 ) , (79)
R EFERENCES
[1] I. S. Gradshteyn and I. M. Ryzhik. Table of Integrals, Series, and
Products. Academic Press, 6th edition, July 2000.
[2] K. B. Petersen and M. S. Pedersen. The Matrix Cookbook, October
2008. Version 20081110.