Introduction to Outlier Detection

Outlier detection and treatment is an important part of data analysis and one of the first things you should do with a new data set. Outliers are typically extraneous data points that may have occurred due to an error in measurement or they could be genuine only very unlikely to occur. They have the potential to distort and skew your data and depending on the problem could completely change the model that you build. There’s no one way to detect and treat outliers and is unwise to blindly use an out-of-the-box procedure. Outliering takes careful consideration.

In this post we’ll get a gentle introduction to a few methods of outlier detection, Cook’s Distance, the interquartile range, and Mahalanobis distance.

Cook’s Distance

Cook’s Distance is a measure of the influence of a point in regression analysis. Data points with high influence can significantly skew results. A linear regression model is expressed

    \[ \mathbf{y} = X\mathbf{\beta} + \epsilon \]

where \epsilon is assumed to be iid N(0, \sigma^2\text{I}). The paramter vector \mathbf{\beta} contains p coefficients i.e. \mathbf{\beta} = (\beta_0, \dots, \beta_{p-1}). It can be shown the solution to the regression model which minimises the sum of squares is

    \[ \mathbf{\beta} = (X^TX)^{-1}X^Ty \]

Following from this the predicted values of y are given by

    \[ \begin{aligned} \mathbf{\hat{y}} &= X\mathbf{\beta} \\ &= X(X^TX)^{-1}X^Ty \\ &= H\mathbf{y} \end{aligned} \]

For Cook’s Distance calculations the key component is the projection matrix H. The diagonal elements of H are the leverage values. Leverage is defined as the measure of how far away the predictor values are from the response variable values \mathbf{y}.

Taking a step back, intuitively we want to know how influential any given point is. With this in mind it makes sense to evaluate how much the model changes when observation i is removed from the data frame. The observation with the largest change will be the observation that is most influential on the model fit. Using the projection matrix, Cook’s Distance is defined as

    \[ D_i = \frac{e_i^2}{s^2p}\left[ \frac{h_{ii}}{(1-h_{ii})^2} \right] \]

where s^2 = \frac{1}{n-p}\sum_{i=1}^n e_i^2 and e_i = y_i-\hat{y} = (\text{I}-H)y_i are the residuals.

A simple example will be set up to demonstrate the calculation of Cook’s distance. This won’t be the best model to fit the data just a useful one for explanation. The Boston data set will be down sampled and some artificial outliers will be put in.

suppressPackageStartupMessages(library(arm))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(gridExtra))
data("Boston")
set.seed(555)

# down sampling
nsamp <- 30
bost <- Boston[sample(1:nrow(Boston), nsamp),]


# chuck in some outliers
outliers <- sample(1:nsamp, 2)
bost$lstat[outliers] <- c(45, 47)
bost$medv[outliers] <- exp(c(2.2, 2.15))

# with and without outliers
bwith <- coef(lm(log(medv) ~ lstat, bost))
bwithout <- coef(lm(log(medv) ~ lstat, bost[-outliers,]))
ggplot(bost, aes(x = lstat, y = log(medv))) + geom_point() +
  geom_abline(slope = bwith[2], intercept = bwith[1], col = "red") +
  geom_abline(slope = bwithout[2], intercept = bwithout[1], col = "red", lty=2)

plot of chunk unnamed-chunk-1

Clearly the model without the outliers (dashed red line) is a better fit for the data in the example. The independent variable “lstat” was chosen only because it is a strongly correlated variable however, other variables will be included in the model that will remain untouched.

# model formula for demostration
p <- 4
form <- as.formula(log(medv) ~ lstat + crim + ptratio)
lm.bost <- lm(form, data = bost)
summary(lm.bost)
## 
## Call:
## lm(formula = form, data = bost)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.185610 -0.074669  0.006372  0.082992  0.164398 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.559948   0.279472  16.316 3.54e-15 ***
## lstat       -0.025454   0.002375 -10.718 4.91e-11 ***
## crim        -0.016321   0.006143  -2.657 0.013308 *  
## ptratio     -0.061695   0.015488  -3.983 0.000488 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.11 on 26 degrees of freedom
## Multiple R-squared:  0.9187, Adjusted R-squared:  0.9093 
## F-statistic: 97.95 on 3 and 26 DF,  p-value: 2.724e-14

 

To calculate Cook’s Distance the projection matrix is first calculated.

# model matrix
X <- model.matrix(form, data = bost)

# projection matrix
H <- X %*% solve(t(X) %*% X) %*% t(X)

 

Calculate the residuals.

# residuals
res <- (diag(nsamp)-H)%*%log(bost$medv) # or residuals(lm.bist)

# mse
mse <- 1/(nsamp-p)*sum(res^2)

 

Cook’s distance is now calculated using the equation above. With all outlier procedures there is an element of judgment namely, how far away from the bulk of the data does a data point need to be before it’s considered an outlier? In real world situations this requires a good understanding of your data. The user is free to decide on what that threshold is. A good rule of thumb for Cook’s distance is taking all those points where D_i > 4/n to be outliers.

cook <- res^2/(mse*p)*(diag(H)/(1-diag(H))^2) # or by cooks.distnace(lm.bost)
df <- data.frame(index = 1:nsamp, cook, lstat = bost$lstat)

# plotting
ggplot(df, aes(x=index, y=cook)) + geom_point() + geom_hline(yintercept = 4/nsamp, col = "red", lty = 2)

plot of chunk unnamed-chunk-5

 

Based on the plot Cook’s distance has identified the 2 outliers we inserted into the data. It’s good practice to manually calculate and implement these process from scratch to aid understanding rather than just using the in built functions. This result can be achieved more simply by ‘cooks.distance(lm.bost)’.

Interquartile range

The interquartile range is a basic method but often a good place to start for univariate approaches. A data point is considered an outlier if it lies outside the range [Q1-1.5 \text{IQR}, \text{Q3}+1.5\text{IQR}] where \text{IQR}=\text{Q3}-\text{Q1}. These calculations are effectively the components which make up box plots.

Q <- quantile(bost$lstat, c(0.25, 0.75))
IQR <- Q[2]-Q[1]
interval <- c(Q[1]-1.5*IQR, Q[2]+1.5*IQR)

# identify outliers
id_outlier <- function(x){
  if(x < interval[1]){
    return(1)
  }else if(x > interval[2]){
    return(1)
  }else{
    return(0)
  }
}
outliers_iqr <- which(sapply(bost$lstat, id_outlier) == 1)

 

The detected outliers are data points 8, 15 and the inserted outliers are data points 8, 15.

Outlier <- as.factor(sapply(bost$lstat, id_outlier))
ggplot(df, aes(x=index, y=lstat, col = Outlier)) + geom_point() + scale_color_manual(values=c("black", "red"))

plot of chunk unnamed-chunk-7

If time is of the essence this is an easy and quick calculation to make but it is quite basic. It is still recommended to observe the data and choose a more appropriate threshold if needed.

Mahalanobis Distance

The Mahalanobis distance aims to do essentially the same thing, find out how far any given point is away from the bulk of the data points. The intuition behind the Mahalanobis is it measures how many standard deviations an given point is away from the mean. Like Cook’s distance it is a multivariate approach. There are a few ways this can be calculated. The most common way is

    \[ D^2(\mathbf{x_i}) = (\mathbf{x_i}-\mathbf{\mu})S^{-1}(\mathbf{x_i}-\mu)^T \]

where x_i is a vector of values for observation i, \mu is a vector of length p equal to the number of columns in x and S is the covariance matrix for x.

Z <- bost[,c("medv", "lstat", "ptratio", "crim")]
Z$medv <- log(Z$medv)

# mahal distance
mahal <- apply(Z, 1, function(x) t(as.matrix(x-colMeans(Z))) %*% solve(cov(Z)) %*% as.matrix(x-colMeans(Z)))

 

There is no rule of thumb with regards to a cutoff for Mahalanobis distance, unlike the other two methods. In this case we just have to inspect the data and choose an appropriate cutoff.

df <- data.frame(index = 1:nsamp, mahal, lstat = bost$lstat)
ggplot(df, aes(x=index, y=mahal)) + geom_point() + geom_hline(yintercept = 7.5, col = "red")

plot of chunk unnamed-chunk-9

The Mahalanobis Distance can be calculated simply in R using the in built function.

mahal_r <- mahalanobis(Z, colMeans(Z), cov(Z))
all.equal(mahal, mahal_r)
## [1] TRUE

Final thoughts

Here we tested 3 basic distance based methods which all identify the outliers we inserted into the data. These are a great place to start when detecting outliers in data. More sophisticated methods will be needed for more complex data sources. Remember, knowing your data is always the first step in the outliering process. These methods help in understanding which are the most influential points but shouldn’t be used as a black box.

In later posts I’ll take a look at the outliers package in R and common methods for treatment of outliers and missing values.

Follow me on social media: