Outlier detection and treatment is an important part of data analysis and one of the first things you should do with a new data set. Outliers are typically extraneous data points that may have occurred due to an error in measurement or they could be genuine only very unlikely to occur. They have the potential to distort and skew your data and depending on the problem could completely change the model that you build. There’s no one way to detect and treat outliers and is unwise to blindly use an out-of-the-box procedure. Outliering takes careful consideration.
In this post we’ll get a gentle introduction to a few methods of outlier detection, Cook’s Distance, the interquartile range, and Mahalanobis distance.
Cook’s Distance
Cook’s Distance is a measure of the influence of a point in regression analysis. Data points with high influence can significantly skew results. A linear regression model is expressed
where is assumed to be iid
. The paramter vector
contains
coefficients i.e.
. It can be shown the solution to the regression model which minimises the sum of squares is
Following from this the predicted values of are given by
For Cook’s Distance calculations the key component is the projection matrix . The diagonal elements of
are the leverage values. Leverage is defined as the measure of how far away the predictor values are from the response variable values
.
Taking a step back, intuitively we want to know how influential any given point is. With this in mind it makes sense to evaluate how much the model changes when observation is removed from the data frame. The observation with the largest change will be the observation that is most influential on the model fit. Using the projection matrix, Cook’s Distance is defined as
where and
are the residuals.
A simple example will be set up to demonstrate the calculation of Cook’s distance. This won’t be the best model to fit the data just a useful one for explanation. The Boston data set will be down sampled and some artificial outliers will be put in.
suppressPackageStartupMessages(library(arm))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(gridExtra))
data("Boston")
set.seed(555)
# down sampling
nsamp <- 30
bost <- Boston[sample(1:nrow(Boston), nsamp),]
# chuck in some outliers
outliers <- sample(1:nsamp, 2)
bost$lstat[outliers] <- c(45, 47)
bost$medv[outliers] <- exp(c(2.2, 2.15))
# with and without outliers
bwith <- coef(lm(log(medv) ~ lstat, bost))
bwithout <- coef(lm(log(medv) ~ lstat, bost[-outliers,]))
ggplot(bost, aes(x = lstat, y = log(medv))) + geom_point() +
geom_abline(slope = bwith[2], intercept = bwith[1], col = "red") +
geom_abline(slope = bwithout[2], intercept = bwithout[1], col = "red", lty=2)
Clearly the model without the outliers (dashed red line) is a better fit for the data in the example. The independent variable “lstat” was chosen only because it is a strongly correlated variable however, other variables will be included in the model that will remain untouched.
# model formula for demostration
p <- 4
form <- as.formula(log(medv) ~ lstat + crim + ptratio)
lm.bost <- lm(form, data = bost)
summary(lm.bost)
## ## Call: ## lm(formula = form, data = bost) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.185610 -0.074669 0.006372 0.082992 0.164398 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.559948 0.279472 16.316 3.54e-15 *** ## lstat -0.025454 0.002375 -10.718 4.91e-11 *** ## crim -0.016321 0.006143 -2.657 0.013308 * ## ptratio -0.061695 0.015488 -3.983 0.000488 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.11 on 26 degrees of freedom ## Multiple R-squared: 0.9187, Adjusted R-squared: 0.9093 ## F-statistic: 97.95 on 3 and 26 DF, p-value: 2.724e-14
To calculate Cook’s Distance the projection matrix is first calculated.
# model matrix
X <- model.matrix(form, data = bost)
# projection matrix
H <- X %*% solve(t(X) %*% X) %*% t(X)
Calculate the residuals.
# residuals
res <- (diag(nsamp)-H)%*%log(bost$medv) # or residuals(lm.bist)
# mse
mse <- 1/(nsamp-p)*sum(res^2)
Cook’s distance is now calculated using the equation above. With all outlier procedures there is an element of judgment namely, how far away from the bulk of the data does a data point need to be before it’s considered an outlier? In real world situations this requires a good understanding of your data. The user is free to decide on what that threshold is. A good rule of thumb for Cook’s distance is taking all those points where to be outliers.
cook <- res^2/(mse*p)*(diag(H)/(1-diag(H))^2) # or by cooks.distnace(lm.bost)
df <- data.frame(index = 1:nsamp, cook, lstat = bost$lstat)
# plotting
ggplot(df, aes(x=index, y=cook)) + geom_point() + geom_hline(yintercept = 4/nsamp, col = "red", lty = 2)
Based on the plot Cook’s distance has identified the 2 outliers we inserted into the data. It’s good practice to manually calculate and implement these process from scratch to aid understanding rather than just using the in built functions. This result can be achieved more simply by ‘cooks.distance(lm.bost)’.
Interquartile range
The interquartile range is a basic method but often a good place to start for univariate approaches. A data point is considered an outlier if it lies outside the range where
. These calculations are effectively the components which make up box plots.
Q <- quantile(bost$lstat, c(0.25, 0.75))
IQR <- Q[2]-Q[1]
interval <- c(Q[1]-1.5*IQR, Q[2]+1.5*IQR)
# identify outliers
id_outlier <- function(x){
if(x < interval[1]){
return(1)
}else if(x > interval[2]){
return(1)
}else{
return(0)
}
}
outliers_iqr <- which(sapply(bost$lstat, id_outlier) == 1)
The detected outliers are data points 8, 15 and the inserted outliers are data points 8, 15.
Outlier <- as.factor(sapply(bost$lstat, id_outlier))
ggplot(df, aes(x=index, y=lstat, col = Outlier)) + geom_point() + scale_color_manual(values=c("black", "red"))
If time is of the essence this is an easy and quick calculation to make but it is quite basic. It is still recommended to observe the data and choose a more appropriate threshold if needed.
Mahalanobis Distance
The Mahalanobis distance aims to do essentially the same thing, find out how far any given point is away from the bulk of the data points. The intuition behind the Mahalanobis is it measures how many standard deviations an given point is away from the mean. Like Cook’s distance it is a multivariate approach. There are a few ways this can be calculated. The most common way is
where is a vector of values for observation
,
is a vector of length
equal to the number of columns in
and
is the covariance matrix for
.
Z <- bost[,c("medv", "lstat", "ptratio", "crim")]
Z$medv <- log(Z$medv)
# mahal distance
mahal <- apply(Z, 1, function(x) t(as.matrix(x-colMeans(Z))) %*% solve(cov(Z)) %*% as.matrix(x-colMeans(Z)))
There is no rule of thumb with regards to a cutoff for Mahalanobis distance, unlike the other two methods. In this case we just have to inspect the data and choose an appropriate cutoff.
df <- data.frame(index = 1:nsamp, mahal, lstat = bost$lstat)
ggplot(df, aes(x=index, y=mahal)) + geom_point() + geom_hline(yintercept = 7.5, col = "red")
The Mahalanobis Distance can be calculated simply in R using the in built function.
mahal_r <- mahalanobis(Z, colMeans(Z), cov(Z))
all.equal(mahal, mahal_r)
## [1] TRUE
Final thoughts
Here we tested 3 basic distance based methods which all identify the outliers we inserted into the data. These are a great place to start when detecting outliers in data. More sophisticated methods will be needed for more complex data sources. Remember, knowing your data is always the first step in the outliering process. These methods help in understanding which are the most influential points but shouldn’t be used as a black box.
In later posts I’ll take a look at the outliers package in R and common methods for treatment of outliers and missing values.
Follow me on social media: