Metropolis-Hastings Algorithm from Scratch

Metropolis Algorithm

The Metropolis algorithm is a common acceptance/rejection algorithm for sampling from target distributions and a key tool for Bayesian inference. The algorithm takes draws from a probability distribution creating a sequence where over time the draws approximate the target distribution. At each time step t a draw is taken from a proposal distribution. The new draw is either accepted or rejected based on the ratio of the evaluated density at the proposal and the previous time point. The metropolis algorithm requires the proposal distribution J_t(\theta^{\ast}|\theta^{t-1}) to be symmetric, meaning it satisfies the condition J_t(\theta_b|\theta_a)=J_t(\theta_a|\theta_b). The algorithm is as follows:

  1. Draw a starting point \theta_0. This can be a user specified value or an approximation based on the data.
  2. For t=1,2,...
    a) Sample \theta^{\ast} from the proposal distribution. It is common to choose the proposal distribution to be of the same form as the target distribution but usually scaled 20-30%.
    b) Calculate the ratio of the target densities evaluated at \theta^{\ast} and \theta^{t-1},

        \[ r = \frac{p(\theta^{*}|y)}{p(\theta^{t-1}|y)} \]

  3. Set,

        \[ \theta^{t} = \begin{cases} \theta^{\ast} & \text{with probability }\text{min}(r, 1) & \theta^{t-1} & \text{otherwise} \end{cases} \]

This will be demonstrated on a linear regression example using the metropolis algorithm to estimate the model parameters \beta_0, \beta_{1}and \sigma. First we’ll simulate the data to me estimated.

plot of chunk unnamed-chunk-1

Set up a function to evaluate the likelihood of the model given the data. We will use non-informative prior distributions for the parameters. We now calculate the posterior as the sum of the likelihood and the prior distributions given we have used the log scale.

A key part of the Metropolis algorithm is the proposal distribution. From this we draw the proposal value, calculate the ratio $r$ and randomly accept or reject the proposal. Here we set the proposal distribution to be normally distributed with standard deviations for the parameters (0.5, 0.1, 0.25). These are reasonable values chosen from the linear model above.

Now we run the metropolis algorithm using the steps as follows.

Visualise the results.

plot of chunk unnamed-chunk-4

In this example the parameter estimates are not too bad, a little off given the small number of data points but this at least demonstrates the implementation of the Metropolis algorithm.

Metropolis-Hastings Algorothm

The generalisation of the Metropolis algorithm is the Metropolis-Hastings algorithm. This relaxes the constraint of the jumping distribution to be symmetric. Step 2.b now becomes,

    \[ r = \frac{p(\theta^{\ast}|y)/J_t(\theta^{\ast}|\theta_{t-1})}{p(\theta^{t-1}|y)/J_t(\theta_{t-1}|\theta^{\ast})} \]

plot of chunk unnamed-chunk-5

This is a good demonstration how the algorithm works however, it could use some fine tuning e.g. chosing better proposal distributions. The trace for the intercept and slope exhibit some autocorrelation which could be improved by thinning the sequence. The best thing about R is that there are great packages to help us.

MHadaptive and MCMCpack

The metropolis algorithm can be implemented much more simply using the fantastics packages MHadaptive and MCMCpack. All that is needed is to specify the likelihood function of the model. The likelihood function from above can be passed directly to these functions.

plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6

Similarly using the MCMCpack package and the MCMCmetrop1R function.

plot of chunk unnamed-chunk-7

As you can see the acceptance rate is 0.45 for both methods were as the the algorithm we implemented was approximately 0.25. This is likely due to the chosen proposal distribution. These acceptance rates are pretty good. Essentially we don’t want an acceptance rate too high since this will mean our jumping distribution is too narrow and the new proposal is more likely to be accepted which can be sub-optimal simulation of the posterior. Likewise, if it’s too low, our jumping distribution is too wide and the algorithm will essentially get stuck in one place for a long time and not explore the parameter space adequately. Fortunately MHadpative and MCMCpack functions take care of that for you and you can see the difference in this output.

I encourage you to play around with different parameter sets and proposal distributions to see the differences in the posterior sequences.

Follow me on social media: