Escolar Documentos
Profissional Documentos
Cultura Documentos
Metropolis−Hastings algorithms
Ben Calderhead1
Department of Mathematics, Imperial College London, London SW7 2AZ, United Kingdom
Edited by Wing Hung Wong, Stanford University, Stanford, CA, and approved October 30, 2014 (received for review May 3, 2014)
Markov chain Monte Carlo methods (MCMC) are essential tools for computing, with the most straightforward based on running mul-
solving many modern-day statistical and computational problems; tiple Markov chains simultaneously (9). These may explore the
however, a major limitation is the inherently sequential nature of same distribution, or some product of related distributions, as
these algorithms. In this paper, we propose a natural generaliza- in parallel tempering (10). Furthermore, the locations of other
tion of the Metropolis−Hastings algorithm that allows for parallel- chains may additionally be used to guide the proposal mecha-
izing a single chain using existing MCMC methods. We do so by nism (11). More recent work combines the use of multiple chains
proposing multiple points in parallel, then constructing and sam- with adaptive MCMC in an attempt to use these multiple sources
pling from a finite-state Markov chain on the proposed points such of information to learn an appropriate proposal distribution (12,
that the overall procedure has the correct target density as its station- 13). Sometimes, specific MCMC algorithms are directly amenable
ary distribution. Our approach is generally applicable and straightfor- to parallelization, such as independent Metropolis−Hastings (14)
ward to implement. We demonstrate how this construction may be or slice sampling (15), as indeed are some statistical models via
used to greatly increase the computational speed and statistical careful reparameterization (16) or implementation on specialist
efficiency of a variety of existing MCMC methods, including hardware, such as graphics processing units (GPUs) (17, 18);
Metropolis-Adjusted Langevin Algorithms and Adaptive MCMC. however, these approaches are often problem specific and not
Furthermore, we show how it allows for a principled way of using generally applicable. For problems involving large amounts of data,
every integration step within Hamiltonian Monte Carlo methods; parallelization may in some cases also be possible by partitioning
our approach increases robustness to the choice of algorithmic param- the data and analyzing each subset using standard MCMC methods
eters and results in increased accuracy of Monte Carlo estimates simultaneously on multiple machines (19). The individual Markov
with little extra computational cost. chains in these methods, however, are all based on the standard
sequential Metropolis−Hastings algorithm.
Markov chain Monte Carlo | Bayesian inference | parallel computation | The idea of parallelizing Metropolis−Hastings using multiple
Hamiltonian dynamics proposals has been investigated previously; however, the main
shortcoming of such attempts has been their lack of computa-
tional efficiency. Algorithms such as Multiple Try Metropolis
S ince its introduction in the 1970s, the Metropolis−Hastings
algorithm has revolutionized computational statistics (1). The
ability to draw samples from an arbitrary probability distribution,
(20), Ensemble MCMC (21), and “prefetching” approaches (22,
23) all allow the computation of multiple proposals or future
πðXÞ, known only up to a constant, by constructing a Markov proposed paths in parallel, although only one proposal or path is
chain that converges to the correct stationary distribution has subsequently accepted by the Markov chain, resulting in wasted
enabled the practical application of Bayesian inference for mod- computation of the remaining points. Another class of approaches
eling a huge variety of scientific phenomena, and has resulted in involves incorporating rejected states into the Monte Carlo esti-
Metropolis−Hastings being noted as one of the top 10 most im- mator with an appropriate weighting, such that the resulting es-
portant algorithms from the 20th century (2). Despite regular timate is still unbiased. An overview of such approaches is given by
increases in available computing power, Markov chain Monte Frenkel (24). In particular, the work presented by Tjelmeland (25)
Carlo (MCMC) algorithms can still be computationally very makes use of multiple rejected states at each iteration and is re-
expensive; many thousands of iterations may be necessary to ob- lated to the ideas presented here, although recent investigations
tain low-variance estimates of the required quantities with an of-
tentimes complex statistical model being evaluated for each set of Significance
proposed parameters. Furthermore, many Metropolis−Hastings
algorithms are severely limited by their inherently sequential nature. Many computational problems in modern-day statistics are
Many approaches have been proposed for improving the sta- heavily dependent on Markov chain Monte Carlo (MCMC)
tistical efficiency of MCMC, and although such algorithms are methods. These algorithms allow us to evaluate arbitrary prob-
guaranteed to converge asymptotically to the stationary distri- ability distributions; however, they are inherently sequential in
bution, their performance over a finite number of iterations can nature due to the Markov property, which severely limits their
vary hugely. Much research effort has therefore focused on de- computational speed. We propose a general approach that allows
veloping transition kernels that enable moves to be proposed far scalable parallelization of existing MCMC methods. We do so by
from the current point and subsequently accepted with high prob- defining a finite-state Markov chain on multiple proposals in a
ability, taking into account, for example, the correlation structure of way that ensures asymptotic convergence to the correct sta-
the parameter space (3, 4), or by using Hamiltonian dynamics (5) or tionary distribution. In example simulations, we demonstrate
diffusion processes (6). A recent investigation into proposal kernels up to two orders of magnitude improvement in overall com-
suggests that more exotic distributions, such as the Bactrian kernel, putational performance.
might also be used to increase the statistical efficiency of MCMC
algorithms (7). The efficient exploration of high-dimensional and Author contributions: B.C. designed research; B.C. performed research; and B.C. wrote
the paper.
multimodal distributions is hugely challenging and provides the
motivation for many further methods (8). The author declares no conflict of interest.
Computational efficiency of MCMC algorithms is another im- This article is a PNAS Direct Submission.
portant issue. Approaches have been suggested for making use of Freely available online through the PNAS open access option.
the increasingly low-cost parallelism that is available in modern-day 1
Email: b.calderhead@imperial.ac.uk.
STATISTICS
Fig. 2. Results are based on 10 MCMC runs, each of 5,000 samples, using both the SmMALA and Adaptive MCMC kernels for posterior inference over the
Fitzhugh−Nagumo ODE model (Top) and the logistic regression model (Bottom). The first-column plots show the effective sample size (ESS) for step sizes
resulting in a range of acceptance rates. The second- and third-column plots show the ESS and MSJD for increasing numbers of proposals with N = [1, 5, 10, 20,
50, 100, 200, 500, 1,000]. The fourth-column plots show the corresponding acceptance rates when the samples are drawn directly from the stationary dis-
tribution of the finite-state Markov chain.
Fig. 3. Sampling performance of HMC using standard (Top) and Generalized (Bottom) Metropolis−Hastings for a variety of step sizes.
1. Diaconis P (2008) The Markov chain Monte Carlo revolution. Bull Am Math Soc 46: 20. Liu JS, Liang F, Wong WH (2000) The multiple-try method and local optimization in
179–205. Metropolis sampling. J Am Stat Assoc 95(449):121–134.
2. Beichl I, Sullivan F (2000) The Metropolis algorithm. Comput Sci Eng 2(1):65–69. 21. Neal R (2011) MCMC using ensembles of states for problems with fast and slow
3. Roberts GO, Rosenthal JS (2009) Examples of Adaptive MCMC. J Comput Graph Stat variables. arXiv:1101.0387.
18(2):349–367. 22. Brockwell A (2006) Parallel Markov chain Monte Carlo simulation by pre-fetching.
4. Girolami M, Calderhead B (2011) Riemann manifold Langevin and Hamiltonian Monte J Comput Graph Stat 15(1):246–261.
Carlo methods. J R Stat Soc, B 73(2):123–214. 23. Strid I (2009) Efficient parallelisation of Metropolis−Hastings algorithms using a pre-
5. Duane S, Kennedy AD, Pendleton BJ, Roweth D (1987) Hybrid Monte Carlo. Phys Lett
fetching approach. Comput Stat Data Anal 54(11):2814–2835.
B 195(2):216–222.
24. Frenkel D (2006) Waste-recycling Monte Carlo. Computer Simulations in Condensed
6. Roberts G, Stramer O (2003) Langevin diffusions and Metropolis−Hastings algorithms.
Matter Systems: From Materials to Chemical Biology, Lecture Notes in Physics (Springer,
Methodol Comput Appl Probab 4(4):337–358.
7. Yang Z, Rodríguez CE (2013) Searching for efficient Markov chain Monte Carlo pro- Berlin), Vol 1, pp 127−137.
posal kernels. Proc Natl Acad Sci USA 110(48):19307–19312. 25. Tjelmeland H (2004) Using All Metropolis-Hastings Proposals to Estimate Mean Values
8. Brooks S, Gelman A, Jones G, Meng X (2011) Handbook of Markov Chain Monte Carlo (Norwegian University of Science and Technology, Trondheim, Norway), Tech Rep 4.
(Chapman and Hall, Boca Raton, FL). 26. Delmas J, Jourdain B (2009) Does waste recycling really improve the multi-proposal
9. Rosenthal JS (2000) Parallel computing and Monte Carlo algorithms. Far East J Theor Metropolis−Hastings algorithm? J Appl Probab 46:938–959.
Stat 4:207–236. 27. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equations of
10. Laskey KB, Myers JW (2003) Population Markov chain Monte Carlo. Mach Learn state calculations by fast computing machines. J Chem Phys 21(6):1087–1092.
50(1-2):175–196. 28. Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their
11. Warnes A (2001) The normal kernel coupler: An adaptive Markov chain Monte Carlo applications. Biometrika 57(1):97–109.
method for efficiently sampling from multimodal distributions, PhD thesis (University 29. Billera LJ, Diaconis P (2001) A geometric interpretation of the Metropolis−Hastings
of Washington, Seattle). algorithm. Stat Sci 16(4):335–339.
12. Craiu RV, Rosenthal J, Yang C (2009) Learn from thy neighbor: Parallel chain and 30. Peskun PH (1973) Optimum Monte-Carlo sampling using Markov chains. Biometrika
regional adaptive MCMC. J Am Stat Assoc 104(488):1454–1466. 60(3):607–612.
13. Solonen A, et al. (2012) Efficient MCMC for climate model parameter estimation: 31. Rao CR (1945) Information and accuracy attainable in the estimation of statistical
Parallel adaptive chains and early rejection. Bayesian Anal 7(3):715–736. parameters. Bull. Calc. Math. Soc. 37:81–91.
14. Jacob P, Robert CP, Smith MH (2011) Using parallel computation to improve Inde- 32. Atchade Y (2006) An adaptive version for the Metropolis Adjusted Langevin Algo-
pendent Metropolis-Hastings based estimation. J Comput Graph Stat 20(3):616–635.
rithm with a truncated drift. Methodol Comput Appl Probab 8(2):235–254.
15. Tibbits M, Haran M, Liechty J (2011) Parallel multivariate slice sampling. Stat Comput
33. Neal RM (2010) MCMC using Hamiltonian dynamics. Handbook of Markov Chain
21(3):415–430.
Monte Carlo, eds Brooks S, Gelman A, Jones GL, and Meng X-L (Chapman and Hall, Boca
16. Yan J, Cowles MK, Wang S, Armstrong MP (2007) Parallelizing MCMC for Bayesian
spatiotemporal geostatistical models. Stat Comput 17(4):323–335. Raton, FL), pp 113−162.
17. Suchard MA, et al. (2010) Understanding GPU programming for statistical computation: 34. Geyer CJ (1992) Practical Markov chain Monte Carlo. Stat Sci 7(4):473–483.
Studies in massively parallel massive mixtures. J Comput Graph Stat 19(2):419–438. 35. Pasarica C, Gelman A (2010) Adaptively scaling the Metropolis Algorithm using ex-
18. Lee A, Yau C, Giles MB, Doucet A, Holmes CC (2010) On the utility of graphics cards to pected squared jumped distance. Stat Sin 20:343–364.
perform massively parallel simulation of advanced Monte Carlo methods. J Comput 36. Roberts GO, Rosenthal JS (1998) Optimal scaling of discrete approximations to Langevin
Graph Stat 19(4):769–789. diffusions. J R Stat Soc B 60(1):255–268.
19. Neiswanger W, Wang C, Xing E (2013) Asymptotically exact, embarrassingly parallel 37. Diaconis P, Holmes S, Neal RM (2000) Analysis of a nonreversible Markov chain
MCMC. arXiv:1311.4780. sampler. Ann Appl Probab 10(3):726–752. STATISTICS