Gradient descent is one of the more important optimisation techniques. It is used in a wide variety of machine learning techniques due to it’s flexibility in being able to be applied to any differentiable objective function. With each iteration steps are taken in the direction of the negative gradient until converging to a local minimum. As the algorithm approaches the local minimum the jumps become smaller until a specified tolerance is met, or the maximum iterations. To understand how this works gradient descent is applied to a common method, simple linear regression.

The simple linear regression model is of the form:

where

The objective is to find the parameters (\boldsymbol{\beta}) such that they minimise the mean squared error.

This is a good problem to start with since we know the analytical solution is given by

and can check our results.

## Example

Set up:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
library(ggplot2) set.seed(241) nobs <- 250 b0 <- 4 b1 <- 2 # simulate data x <- rnorm(nobs, 0, 1) y <- b0 + b1*x + rnorm(nobs, 0, 0.5) df <- data.frame(x, y) # plot data g1 <- ggplot(df, aes(x = x, y = y)) + geom_point() g1 |

The analytical solution can be found manually by

1 2 3 4 5 |
# set model matrix X <- model.matrix(y ~ x, data = df) beta <- solve(t(X)%*%X)%*%t(X)%*%y beta |

1 2 3 |
## [,1] ## (Intercept) 4.009283 ## x 2.016444 |

And just to convince ourselves this is correct

1 2 3 4 |
# linear model formulation lm1 <- lm(y ~ x, data = df) coef(lm1) |

1 2 |
## (Intercept) x ## 4.009283 2.016444 |

1 2 |
g1 + geom_abline(slope = coef(lm1)[2], intercept = coef(lm1)[1], col = "darkred") |

## Gradient descent

The objective is to achieve the same result using gradient descent. Gradient descent works by updating the parameters with each iteration in the direction of negative gradient i.e.

where is the learning rate and

The learning rate is to ensure we don’t jump too far with each iteration and rather some proportion of the gradient, otherwise we could end up overshooting the local minimum taking mauch longer to converge or not find the optimal solution at all. Apply this to the problem above, we’ll initialise our values for to something sensible e.g. . I have chosen with 1000 iterations which is a reasonable learning rate to start with for this problem. It’s worth trying different values of to see how it changes convergence. The algorithm is setup as

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# by gradient descent gamma <- 0.01 its <- 1000 beta <- matrix(NA, nrow = its+1, ncol = 2) loss <- matrix(NA, nrow = its+1, ncol = 2) # initial values for beta beta[1,] <- c(1,1) X <- model.matrix(~x, data = df) n <- nrow(df) # gradient descent. not much to it really for(k in 1:its){ loss[k,] <- -2/n*(t(X)%*%(y - X%*%beta[k,])) beta[k+1,] <- beta[k,] - gamma*loss[k,] } # plotting results g2 <- ggplot(df, aes(x = x, y = y)) + geom_point() z <- seq(1, its+1, 10) for(k in z){ g2 <- g2 + geom_abline(slope = beta[k,2], intercept = beta[k,1], col = "darkred", alpha = 0.2) } tail(beta, 1) |

1 2 |
## [,1] [,2] ## [1001,] 4.009283 2.016444 |

1 2 |
g2 |

As expected we obtain the same result. The lines show the gradient and how the parameters converge to the optimal values.

Let’s try a different set of starting values to see how well it converges.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# initial values for beta beta[1,] <- c(6,-1) for(k in 1:its){ loss[k,] <- -2/n*(t(X)%*%(y - X%*%beta[k,])) beta[k+1,] <- beta[k,] - gamma*loss[k,] } # plotting results g3 <- ggplot(df, aes(x = x, y = y)) + geom_point() z <- seq(1, its+1, 10) for(k in z){ g3 <- g3 + geom_abline(slope = beta[k,2], intercept = beta[k,1], col = "darkred", alpha = 0.2) } tail(beta, 1) |

1 2 |
## [,1] [,2] ## [1001,] 4.009283 2.016444 |

1 2 |
g3 |

In this example we can see how robust this method is on a simple problem like linear regression. Even when the initial values are very far away from the true values it converges very quickly.

Follow me on social media: