Recently I was asked about the design matrix (or model matrix) for a regression model and why it is important. In a nutshell it is a matrix usually denoted of size
where
is the number of observations and
is the number of parameters to be estimated. In simple linear regression
i.e. a parameter for the intercept and a parameter for the slope. The typical model formulation is:
where
The interpretation of the slope is, as
increases by 1
changes by
. Consider a short example using the Boston housing data set. This won’t be the best fit for the data, it’s just for explanation.
suppressPackageStartupMessages(library(arm))
data("Boston")
set.seed(908345713)
# reducing data
bost.df <- Boston[,c("medv", "lstat")]
# linear model
lm.bost <- lm(medv ~ lstat, data = bost.df)
summary(lm.bost)
## ## Call: ## lm(formula = medv ~ lstat, data = bost.df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.168 -3.990 -1.318 2.034 24.500 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 34.55384 0.56263 61.41 <2e-16 *** ## lstat -0.95005 0.03873 -24.53 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.216 on 504 degrees of freedom ## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432 ## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
The interpretation of the lstat coefficient is, for each increase in lstat (lower status of the population) the median value of homes (in $1000’s) decreases by 0.95.
It’s somewhat more interesting when considering categorical variables. Each category needs to be converted to a numerical representation, this means expanding the matrix out into a number of columns depending on the number of categories. For example, let’s convert the lstat variable from the model above to categories high, medium and low.
quants <- quantile(bost.df$lstat, c(0.33, 0.66))
to_cat <- function(x){
if(x < quants[1]){
return("low")
}else if(x < quants[2]){
return("med")
}else{
return("high")
}
}
bost.df$lstat_cat <- as.factor(sapply(bost.df$lstat, to_cat))
With 3 levels R will automatically set the contrasts to be
contrasts(bost.df$lstat_cat)
## low med ## high 0 0 ## low 1 0 ## med 0 1
What this means is the ‘high’ category is the reference group and ‘low’ and ‘med’ are modelled against ‘high’. To see this algebraically
and
We now consider all cases where observation is either in the high, med or low category and estimate their respective ‘medv’ values using the model.
From the above it is shown is the average ‘medv’ for the high category. For the low and med categories the average ‘medv’ value is given by
and
respectively. This is important since it shows that
and
are the differences with repect to the high category. To show this with our example we get
lm2 <- lm(medv ~ lstat_cat, data = bost.df)
summary(lm2)
## ## Call: ## lm(formula = medv ~ lstat_cat, data = bost.df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -19.1934 -3.9205 -0.8152 2.6463 28.2713 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 15.0017 0.4881 30.733 <2e-16 *** ## lstat_catlow 16.0917 0.6955 23.138 <2e-16 *** ## lstat_catmed 6.7270 0.6955 9.673 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.402 on 503 degrees of freedom ## Multiple R-squared: 0.5174, Adjusted R-squared: 0.5155 ## F-statistic: 269.6 on 2 and 503 DF, p-value: < 2.2e-16
There are other ways to set up the design matrix which alters the interpretation of the coefficients. Another popular design is the Helmert contrasts. In R the design matrix can be changed to the Helmert design with the following commands.
# change contrasts
contrasts(bost.df$lstat_cat) <- contr.helmert(3)
contrasts(bost.df$lstat_cat)
## [,1] [,2] ## high -1 -1 ## low 1 -1 ## med 0 2
lm3 <- lm(medv ~ lstat_cat, data = bost.df)
summary(lm3)
## ## Call: ## lm(formula = medv ~ lstat_cat, data = bost.df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -19.1934 -3.9205 -0.8152 2.6463 28.2713 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 22.6080 0.2846 79.431 <2e-16 *** ## lstat_cat1 8.0458 0.3477 23.138 <2e-16 *** ## lstat_cat2 -0.4396 0.2018 -2.179 0.0298 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.402 on 503 degrees of freedom ## Multiple R-squared: 0.5174, Adjusted R-squared: 0.5155 ## F-statistic: 269.6 on 2 and 503 DF, p-value: < 2.2e-16
Under this design the estimates of high, med and low now become
And in the example
The estimates of are the same for each category however the interpretation of the coefficients is completely different. The first model has much more coherent interpretation than the second. Under a multivariate regression model the same applies although the design matrix will be wider to accommodate all other variables, continuous or categorical. Finally, if you want to view the full model matrix you can do so by the command
head(model.matrix(lm3))
## (Intercept) lstat_cat1 lstat_cat2 ## 1 1 1 -1 ## 2 1 0 2 ## 3 1 1 -1 ## 4 1 1 -1 ## 5 1 1 -1 ## 6 1 1 -1
This demonstrates the role of the design matrix and its importance. If it isn’t well understood it could completely change the interpretation of the model. If you’re interested try fitting the model
lm4 <- lm(medv ~ lstat_cat-1, data = bost.df)
And see what happens. It should look familiar.
Follow me on social media: