Outlier detection and treatment is an important part of data analysis and one of the first things you should do with a new data set. Outliers are typically extraneous data points that may have occurred due to an error in measurement or they could be genuine only very unlikely to occur. They have the potential to distort and skew your data and depending on the problem could completely change the model that you build. There’s no one way to detect and treat outliers and is unwise to blindly use an out-of-the-box procedure. Outliering takes careful consideration.

In this post we’ll get a gentle introduction to a few methods of outlier detection, Cook’s Distance, the interquartile range, and Mahalanobis distance.

## Cook’s Distance

Cook’s Distance is a measure of the influence of a point in regression analysis. Data points with high influence can significantly skew results. A linear regression model is expressed

where is assumed to be iid . The paramter vector contains coefficients i.e. . It can be shown the solution to the regression model which minimises the sum of squares is

Following from this the predicted values of are given by

For Cook’s Distance calculations the key component is the projection matrix . The diagonal elements of are the leverage values. Leverage is defined as the measure of how far away the predictor values are from the response variable values .

Taking a step back, intuitively we want to know how influential any given point is. With this in mind it makes sense to evaluate how much the model changes when observation is removed from the data frame. The observation with the largest change will be the observation that is most influential on the model fit. Using the projection matrix, Cook’s Distance is defined as

where and are the residuals.

A simple example will be set up to demonstrate the calculation of Cook’s distance. This won’t be the best model to fit the data just a useful one for explanation. The Boston data set will be down sampled and some artificial outliers will be put in.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
suppressPackageStartupMessages(library(arm)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(gridExtra)) data("Boston") set.seed(555) # down sampling nsamp <- 30 bost <- Boston[sample(1:nrow(Boston), nsamp),] # chuck in some outliers outliers <- sample(1:nsamp, 2) bost$lstat[outliers] <- c(45, 47) bost$medv[outliers] <- exp(c(2.2, 2.15)) # with and without outliers bwith <- coef(lm(log(medv) ~ lstat, bost)) bwithout <- coef(lm(log(medv) ~ lstat, bost[-outliers,])) ggplot(bost, aes(x = lstat, y = log(medv))) + geom_point() + geom_abline(slope = bwith[2], intercept = bwith[1], col = "red") + geom_abline(slope = bwithout[2], intercept = bwithout[1], col = "red", lty=2) |

Clearly the model without the outliers (dashed red line) is a better fit for the data in the example. The independent variable “lstat” was chosen only because it is a strongly correlated variable however, other variables will be included in the model that will remain untouched.

1 2 3 4 5 6 |
# model formula for demostration p <- 4 form <- as.formula(log(medv) ~ lstat + crim + ptratio) lm.bost <- lm(form, data = bost) summary(lm.bost) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
## ## Call: ## lm(formula = form, data = bost) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.185610 -0.074669 0.006372 0.082992 0.164398 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.559948 0.279472 16.316 3.54e-15 *** ## lstat -0.025454 0.002375 -10.718 4.91e-11 *** ## crim -0.016321 0.006143 -2.657 0.013308 * ## ptratio -0.061695 0.015488 -3.983 0.000488 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.11 on 26 degrees of freedom ## Multiple R-squared: 0.9187, Adjusted R-squared: 0.9093 ## F-statistic: 97.95 on 3 and 26 DF, p-value: 2.724e-14 |

To calculate Cook’s Distance the projection matrix is first calculated.

1 2 3 4 5 6 |
# model matrix X <- model.matrix(form, data = bost) # projection matrix H <- X %*% solve(t(X) %*% X) %*% t(X) |

Calculate the residuals.

1 2 3 4 5 6 |
# residuals res <- (diag(nsamp)-H)%*%log(bost$medv) # or residuals(lm.bist) # mse mse <- 1/(nsamp-p)*sum(res^2) |

Cook’s distance is now calculated using the equation above. With all outlier procedures there is an element of judgment namely, how far away from the bulk of the data does a data point need to be before it’s considered an outlier? In real world situations this requires a good understanding of your data. The user is free to decide on what that threshold is. A good rule of thumb for Cook’s distance is taking all those points where to be outliers.

1 2 3 4 5 6 |
cook <- res^2/(mse*p)*(diag(H)/(1-diag(H))^2) # or by cooks.distnace(lm.bost) df <- data.frame(index = 1:nsamp, cook, lstat = bost$lstat) # plotting ggplot(df, aes(x=index, y=cook)) + geom_point() + geom_hline(yintercept = 4/nsamp, col = "red", lty = 2) |

Based on the plot Cook’s distance has identified the 2 outliers we inserted into the data. It’s good practice to manually calculate and implement these process from scratch to aid understanding rather than just using the in built functions. This result can be achieved more simply by ‘cooks.distance(lm.bost)’.

## Interquartile range

The interquartile range is a basic method but often a good place to start for univariate approaches. A data point is considered an outlier if it lies outside the range where . These calculations are effectively the components which make up box plots.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Q <- quantile(bost$lstat, c(0.25, 0.75)) IQR <- Q[2]-Q[1] interval <- c(Q[1]-1.5*IQR, Q[2]+1.5*IQR) # identify outliers id_outlier <- function(x){ if(x < interval[1]){ return(1) }else if(x > interval[2]){ return(1) }else{ return(0) } } outliers_iqr <- which(sapply(bost$lstat, id_outlier) == 1) |

The detected outliers are data points 8, 15 and the inserted outliers are data points 8, 15.

1 2 3 |
Outlier <- as.factor(sapply(bost$lstat, id_outlier)) ggplot(df, aes(x=index, y=lstat, col = Outlier)) + geom_point() + scale_color_manual(values=c("black", "red")) |

If time is of the essence this is an easy and quick calculation to make but it is quite basic. It is still recommended to observe the data and choose a more appropriate threshold if needed.

## Mahalanobis Distance

The Mahalanobis distance aims to do essentially the same thing, find out how far any given point is away from the bulk of the data points. The intuition behind the Mahalanobis is it measures how many standard deviations an given point is away from the mean. Like Cook’s distance it is a multivariate approach. There are a few ways this can be calculated. The most common way is

where is a vector of values for observation , is a vector of length equal to the number of columns in and is the covariance matrix for .

1 2 3 4 5 6 |
Z <- bost[,c("medv", "lstat", "ptratio", "crim")] Z$medv <- log(Z$medv) # mahal distance mahal <- apply(Z, 1, function(x) t(as.matrix(x-colMeans(Z))) %*% solve(cov(Z)) %*% as.matrix(x-colMeans(Z))) |

There is no rule of thumb with regards to a cutoff for Mahalanobis distance, unlike the other two methods. In this case we just have to inspect the data and choose an appropriate cutoff.

1 2 3 |
df <- data.frame(index = 1:nsamp, mahal, lstat = bost$lstat) ggplot(df, aes(x=index, y=mahal)) + geom_point() + geom_hline(yintercept = 7.5, col = "red") |

The Mahalanobis Distance can be calculated simply in R using the in built function.

1 2 3 |
mahal_r <- mahalanobis(Z, colMeans(Z), cov(Z)) all.equal(mahal, mahal_r) |

1 |
## [1] TRUE |

## Final thoughts

Here we tested 3 basic distance based methods which all identify the outliers we inserted into the data. These are a great place to start when detecting outliers in data. More sophisticated methods will be needed for more complex data sources. Remember, knowing your data is always the first step in the outliering process. These methods help in understanding which are the most influential points but shouldn’t be used as a black box.

In later posts I’ll take a look at the outliers package in R and common methods for treatment of outliers and missing values.

Follow me on social media: