Recently I was asked about the design matrix (or model matrix) for a regression model and why it is important. In a nutshell it is a matrix usually denoted of size where is the number of observations and is the number of parameters to be estimated. In simple linear regression i.e. a parameter for the intercept and a parameter for the slope. The typical model formulation is:

where

The interpretation of the slope is, as increases by 1 changes by . Consider a short example using the Boston housing data set. This won’t be the best fit for the data, it’s just for explanation.

suppressPackageStartupMessages(library(arm))
data("Boston")
set.seed(908345713)

# reducing data
bost.df <- Boston[,c("medv", "lstat")]

# linear model
lm.bost <- lm(medv ~ lstat, data = bost.df)
summary(lm.bost)

##
## Call:
## lm(formula = medv ~ lstat, data = bost.df)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -15.168  -3.990  -1.318   2.034  24.500
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16


The interpretation of the lstat coefficient is, for each increase in lstat (lower status of the population) the median value of homes (in $1000’s) decreases by 0.95. It’s somewhat more interesting when considering categorical variables. Each category needs to be converted to a numerical representation, this means expanding the matrix out into a number of columns depending on the number of categories. For example, let’s convert the lstat variable from the model above to categories high, medium and low. quants <- quantile(bost.df$lstat, c(0.33, 0.66))
to_cat <- function(x){
if(x < quants[1]){
return("low")
}else if(x < quants[2]){
return("med")
}else{
return("high")
}
}
bost.df$lstat_cat <- as.factor(sapply(bost.df$lstat, to_cat))


With 3 levels R will automatically set the contrasts to be

contrasts(bost.df$lstat_cat)  ## low med ## high 0 0 ## low 1 0 ## med 0 1  What this means is the ‘high’ category is the reference group and ‘low’ and ‘med’ are modelled against ‘high’. To see this algebraically and We now consider all cases where observation is either in the high, med or low category and estimate their respective ‘medv’ values using the model. From the above it is shown is the average ‘medv’ for the high category. For the low and med categories the average ‘medv’ value is given by and respectively. This is important since it shows that and are the differences with repect to the high category. To show this with our example we get lm2 <- lm(medv ~ lstat_cat, data = bost.df) summary(lm2)  ## ## Call: ## lm(formula = medv ~ lstat_cat, data = bost.df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -19.1934 -3.9205 -0.8152 2.6463 28.2713 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 15.0017 0.4881 30.733 <2e-16 *** ## lstat_catlow 16.0917 0.6955 23.138 <2e-16 *** ## lstat_catmed 6.7270 0.6955 9.673 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.402 on 503 degrees of freedom ## Multiple R-squared: 0.5174, Adjusted R-squared: 0.5155 ## F-statistic: 269.6 on 2 and 503 DF, p-value: < 2.2e-16  There are other ways to set up the design matrix which alters the interpretation of the coefficients. Another popular design is the Helmert contrasts. In R the design matrix can be changed to the Helmert design with the following commands. # change contrasts contrasts(bost.df$lstat_cat) <- contr.helmert(3)
contrasts(bost.df\$lstat_cat)

##      [,1] [,2]
## high   -1   -1
## low     1   -1
## med     0    2

lm3 <- lm(medv ~ lstat_cat, data = bost.df)
summary(lm3)

##
## Call:
## lm(formula = medv ~ lstat_cat, data = bost.df)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -19.1934  -3.9205  -0.8152   2.6463  28.2713
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  22.6080     0.2846  79.431   <2e-16 ***
## lstat_cat1    8.0458     0.3477  23.138   <2e-16 ***
## lstat_cat2   -0.4396     0.2018  -2.179   0.0298 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.402 on 503 degrees of freedom
## Multiple R-squared:  0.5174, Adjusted R-squared:  0.5155
## F-statistic: 269.6 on 2 and 503 DF,  p-value: < 2.2e-16


Under this design the estimates of high, med and low now become

And in the example

The estimates of are the same for each category however the interpretation of the coefficients is completely different. The first model has much more coherent interpretation than the second. Under a multivariate regression model the same applies although the design matrix will be wider to accommodate all other variables, continuous or categorical. Finally, if you want to view the full model matrix you can do so by the command

head(model.matrix(lm3))

##   (Intercept) lstat_cat1 lstat_cat2
## 1           1          1         -1
## 2           1          0          2
## 3           1          1         -1
## 4           1          1         -1
## 5           1          1         -1
## 6           1          1         -1


This demonstrates the role of the design matrix and its importance. If it isn’t well understood it could completely change the interpretation of the model. If you’re interested try fitting the model

lm4 <- lm(medv ~ lstat_cat-1, data = bost.df)

And see what happens. It should look familiar.