Metropolis Algorithm
The Metropolis algorithm is a common acceptance/rejection algorithm for sampling from target distributions and a key tool for Bayesian inference. The algorithm takes draws from a probability distribution creating a sequence where over time the draws approximate the target distribution. At each time step a draw is taken from a proposal distribution. The new draw is either accepted or rejected based on the ratio of the evaluated density at the proposal and the previous time point. The metropolis algorithm requires the proposal distribution
to be symmetric, meaning it satisfies the condition
. The algorithm is as follows:
- Draw a starting point
. This can be a user specified value or an approximation based on the data.
- For
a) Samplefrom the proposal distribution. It is common to choose the proposal distribution to be of the same form as the target distribution but usually scaled 20-30%.
b) Calculate the ratio of the target densities evaluated atand
,
- Set,
This will be demonstrated on a linear regression example using the metropolis algorithm to estimate the model parameters ,
and
. First we’ll simulate the data to me estimated.
# libraries
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(MHadaptive))
suppressPackageStartupMessages(library(MCMCpack))
set.seed(1600)
# model set up
nobs <- 50
b0 <- 2
b1 <- 4
sd <- 2
# simulate data
x <- rnorm(nobs, 10, sqrt(5))
y <- b0 + b1*x + rnorm(nobs, 0, sd)
# plot data
df <- data.frame(x, y)
g1 <- ggplot(df, aes(x=x, y=y)) + geom_point() + geom_abline(slope = b1, intercept = b0, col = "red")
g1
# for reference
summary(lm(y~x))
## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.204 -1.454 -0.066 1.153 4.728 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.5181 1.5264 0.995 0.325 ## x 4.0258 0.1516 26.548 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.084 on 48 degrees of freedom ## Multiple R-squared: 0.9362, Adjusted R-squared: 0.9349 ## F-statistic: 704.8 on 1 and 48 DF, p-value: < 2.2e-16
Set up a function to evaluate the likelihood of the model given the data. We will use non-informative prior distributions for the parameters. We now calculate the posterior as the sum of the likelihood and the prior distributions given we have used the log scale.
posterior <- function(par){
# take the input parameters of the intercept, slope and std dev
b0 <- par[1]
b1 <- par[2]
sd <- par[3]
# compute the expected value given the input parameters
y_hat <- b0 + b1*x
# compute the log likelihoods
loglikelihoods <- dnorm(y, mean = y_hat, sd = sd, log = T)
# sum the log likelihoods and return
sumll <- sum(loglikelihoods)
# priors - non-formative
b0_prior <- dnorm(b0, sd=5, log=TRUE)
b1_prior <- dnorm(b1, sd=5, log=TRUE)
sd_prior <- dnorm(sd, sd=5, log=TRUE)
# now return the sum of all components
return(sum(sumll, b0_prior, b1_prior, sd_prior))
}
A key part of the Metropolis algorithm is the proposal distribution. From this we draw the proposal value, calculate the ratio $r$ and randomly accept or reject the proposal. Here we set the proposal distribution to be normally distributed with standard deviations for the parameters (0.5, 0.1, 0.25). These are reasonable values chosen from the linear model above.
Now we run the metropolis algorithm using the steps as follows.
# Metropolis-Hastings algorithm
metropolisMCMC <- function(theta0, proposal_sd, iter, burnin){
# initialise the chain
chain = matrix(NA, nrow=iter+1, ncol=length(theta0))
chain[1,] = theta0
acceptance <- rep(0, iter)
# each iteraction take a draw from the proposal for each parameter
# calculate teh acceptance probability
# either accept or reject the proposal given the probability r
for (i in 1:iter){
theta_star <- rnorm(3, mean = chain[i,], sd = proposal_sd)
r <- exp(posterior(theta_star) - posterior(chain[i,]))
if (runif(1) < r){
chain[i+1,] = theta_star
acceptance[i] <- 1
}else{
chain[i+1,] = chain[i,]
}
}
cat("Acceptance rate:", mean(acceptance[burnin:iter]))
return(chain[burnin:iter,])
}
theta0 <- c(1, 2, 1)
prop_sd <- c(0.5, 0.1, 0.25)
its <- 50000
burn <- 5000
chain <- metropolisMCMC(theta0, prop_sd, its, burn)
## Acceptance rate: 0.2369281
Visualise the results.
# select burn in
colnames(chain) <- c("b0", "b1", "sd")
burnt_chain_df <- cbind(data.frame(chain), data.frame(iter = 1:nrow(chain)))
# plot function
posterior_densities <- function(){
# intialise list
gg_list <- list()
# create plots and store in a list
gg_list[["b0"]] <- ggplot(burnt_chain_df[,c("iter", "b0")], aes(x=b0, y=..density..)) +
geom_histogram(fill = "darkcyan", col = "black") +
geom_area(stat="density", fill="darkcyan", alpha=0.5) +
geom_vline(xintercept = b0, col="red", lty=2) +
geom_vline(aes(xintercept = mean(b0)), lty=2) +
labs(title="b0 posterior")
gg_list[["b1"]] <- ggplot(burnt_chain_df[,c("iter", "b1")], aes(x=b1, y=..density..)) +
geom_histogram(fill = "darkcyan", col = "black") +
geom_area(stat="density", fill="darkcyan", alpha=0.5) +
geom_vline(xintercept = b1, col="red", lty=2) +
geom_vline(aes(xintercept = mean(b1)), lty=2) +
labs(title="b1 posterior")
gg_list[["sd"]] <- ggplot(burnt_chain_df[,c("iter", "sd")], aes(x=sd, y=..density..)) +
geom_histogram(fill = "darkcyan", col = "black") +
geom_area(stat="density", fill="darkcyan", alpha=0.5) +
geom_vline(xintercept = sd, col="red", lty=2) +
geom_vline(aes(xintercept = mean(sd)), lty=2) +
labs(title="sd posterior")
gg_list[["b0_trace"]] <- ggplot(burnt_chain_df[,c("iter", "b0")], aes(x=iter, y=b0)) +
geom_line() +
geom_hline(aes(yintercept = mean(b0))) +
geom_hline(yintercept = b0, col="red") +
labs(title="b0 trace")
gg_list[["b1_trace"]] <- ggplot(burnt_chain_df[,c("iter", "b1")], aes(x=iter, y=b1)) +
geom_line() +
geom_hline(aes(yintercept = mean(b1))) +
geom_hline(yintercept = b1, col="red") +
labs(title="b1 trace")
gg_list[["sd_trace"]] <- ggplot(burnt_chain_df[,c("iter", "sd")], aes(x=iter, y=sd)) +
geom_line() +
geom_hline(aes(yintercept = mean(sd))) +
geom_hline(yintercept = sd, col="red") +
labs(title="sd trace")
# arrange plots in a grid
return(grid.arrange(grobs=gg_list, nrow=3, ncol=2, as.table=F))
}
posterior_densities()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In this example the parameter estimates are not too bad, a little off given the small number of data points but this at least demonstrates the implementation of the Metropolis algorithm.
Metropolis-Hastings Algorithm
The generalisation of the Metropolis algorithm is the Metropolis-Hastings algorithm. This relaxes the constraint of the jumping distribution to be symmetric. Step 2.b now becomes,
# Metropolis-Hastings algorithm
metropHastingsMCMC <- function(theta0, proposal_sd, iter, burnin){
# initialise the chain
chain = matrix(NA, nrow=iter+1, ncol=length(theta0))
chain[1,] = theta0
acceptance <- array(0, c(iter, length(theta0)))
# each iteraction take a draw from the proposal for each parameter
# calculate teh acceptance probability
# either accept or reject the proposal given the probability r
for (i in 1:iter){
theta_star <- rnorm(3, mean = chain[i,], sd = proposal_sd)
r <- exp(posterior(theta_star) - dnorm(theta_star, mean = chain[i,], sd = proposal_sd, log = T) -
posterior(chain[i,]) + dnorm(chain[i,], mean = theta_star, sd = proposal_sd, log = T))
# now that r is a vector of 3 values representing each parameter we need to accept/reject each individually
for(k in 1:3){
if (runif(1) < r[k]){
chain[i+1,k] = theta_star[k]
acceptance[i,k] <- 1
}else{
chain[i+1,k] = chain[i,k]
}
}
}
cat("Acceptance rate:", mean(acceptance[burnin:iter]))
return(chain[burnin:iter,])
}
chain <- metropHastingsMCMC(theta0, prop_sd, its, burn)
## Acceptance rate: 0.25095
# select burn in
colnames(chain) <- c("b0", "b1", "sd")
burnt_chain_df <- cbind(data.frame(chain), data.frame(iter = 1:nrow(chain)))
posterior_densities()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a good demonstration how the algorithm works however, it could use some fine tuning e.g. chosing better proposal distributions. The trace for the intercept and slope exhibit some autocorrelation which could be improved by thinning the sequence. The best thing about R is that there are great packages to help us.
MHadaptive and MCMCpack
The metropolis algorithm can be implemented much more simply using the fantastics packages MHadaptive and MCMCpack. All that is needed is to specify the likelihood function of the model. The likelihood function from above can be passed directly to these functions.
runMH <- Metro_Hastings(li_func=posterior, par=theta0, par_names = c("b0", "b1", "sd"))
## [1] "updating: 10%" ## [1] "updating: 20%" ## [1] "updating: 30%" ## [1] "updating: 40%" ## [1] "updating: 50%" ## [1] "updating: 60%" ## [1] "updating: 70%" ## [1] "updating: 80%" ## [1] "updating: 90%" ## [1] "updating: 100%"
plotMH(runMH)
runMH$acceptance_rate
## [1] 0.4436939
Similarly using the MCMCpack package and the MCMCmetrop1R function.
runMH_mcmcpack <- MCMCmetrop1R(fun=posterior, theta.init=theta0, logfun=TRUE, burnin = 5000)
## ## ## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ ## The Metropolis acceptance rate was 0.46704 ## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
plot(runMH_mcmcpack)
As you can see the acceptance rate is 0.45 for both methods were as the the algorithm we implemented was approximately 0.25. This is likely due to the chosen proposal distribution. These acceptance rates are pretty good. Essentially we don’t want an acceptance rate too high since this will mean our jumping distribution is too narrow and the new proposal is more likely to be accepted which can be sub-optimal simulation of the posterior. Likewise, if it’s too low, our jumping distribution is too wide and the algorithm will essentially get stuck in one place for a long time and not explore the parameter space adequately. Fortunately MHadpative and MCMCpack functions take care of that for you and you can see the difference in this output.
I encourage you to play around with different parameter sets and proposal distributions to see the differences in the posterior sequences.
Follow me on social media: