Design Matrix for Regression Explained

Recently I was asked about the design matrix (or model matrix) for a regression model and why it is important. In a nutshell it is a matrix usually denoted X of size n \times p where n is the number of observations and p is the number of parameters to be estimated. In simple linear regression p=2 i.e. a parameter for the intercept and a parameter for the slope. The typical model formulation is:

    \[ \mathbf{y} = X\mathbf{\beta} + \epsilon = \beta_0 + \beta_1\mathbf{x} + \mathbf{\epsilon} \]

where

    \[ \left( \begin{matrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{matrix} \right) = \left( \begin{matrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \\ \end{matrix} \right) \left( \begin{matrix} \beta_0 \\ \beta_1 \end{matrix} \right) + \left( \begin{matrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \\ \end{matrix} \right) \]

The interpretation of the slope \beta_1 is, as x increases by 1 y changes by \beta_1. Consider a short example using the Boston housing data set. This won’t be the best fit for the data, it’s just for explanation.

suppressPackageStartupMessages(library(arm))
data("Boston")
set.seed(908345713)

# reducing data
bost.df <- Boston[,c("medv", "lstat")]

# linear model
lm.bost <- lm(medv ~ lstat, data = bost.df)
summary(lm.bost)
## 
## Call:
## lm(formula = medv ~ lstat, data = bost.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

 

The interpretation of the lstat coefficient is, for each increase in lstat (lower status of the population) the median value of homes (in $1000’s) decreases by 0.95.

It’s somewhat more interesting when considering categorical variables. Each category needs to be converted to a numerical representation, this means expanding the matrix out into a number of columns depending on the number of categories. For example, let’s convert the lstat variable from the model above to categories high, medium and low.

quants <- quantile(bost.df$lstat, c(0.33, 0.66))
to_cat <- function(x){
  if(x < quants[1]){
    return("low")
  }else if(x < quants[2]){
    return("med")
  }else{
    return("high")
  }
}
bost.df$lstat_cat <- as.factor(sapply(bost.df$lstat, to_cat))

 

With 3 levels R will automatically set the contrasts to be

contrasts(bost.df$lstat_cat)
##      low med
## high   0   0
## low    1   0
## med    0   1

 

What this means is the ‘high’ category is the reference group and ‘low’ and ‘med’ are modelled against ‘high’. To see this algebraically

    \[ \mathbf{y} = \beta_0 + \beta_{1} \mathbf{x}_{\text{low}} + \beta_{2} \mathbf{x}_{\text{med}} + \mathbf{\epsilon} \]

and

    \[ \left( \begin{matrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{matrix} \right) = \left( \begin{matrix} 1 & x_{1,\text{low}} & x_{1,\text{med}}\\ 1 & x_{2,\text{low}} & x_{2,\text{med}} \\ \vdots & \vdots \\ 1 & x_{n,\text{low}} & x_{n,\text{med}} \\ \end{matrix} \right) \left( \begin{matrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{matrix} \right) + \left( \begin{matrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \\ \end{matrix} \right) \]

We now consider all cases where observation i is either in the high, med or low category and estimate their respective ‘medv’ values using the model.

    \[ \begin{array}{l l l} \hat{y}_{\text{high}} &= \beta_0 + \beta_1(0) + \beta_2(0) &= \beta_0 \\ \hat{y}_{\text{low}} &= \beta_0 + \beta_1(1) + \beta_2(0) &= \beta_0 + \beta_1 \\ \hat{y}_{\text{med}} &= \beta_0 + \beta_1(0) + \beta_2(1) &= \beta_0 + \beta_2 \end{array} \]

From the above it is shown \beta_0 is the average ‘medv’ for the high category. For the low and med categories the average ‘medv’ value is given by \beta_0 + \beta_1 and \beta_0 + \beta_2 respectively. This is important since it shows that \beta_1 and \beta_2 are the differences with repect to the high category. To show this with our example we get

lm2 <- lm(medv ~ lstat_cat, data = bost.df)
summary(lm2)
## 
## Call:
## lm(formula = medv ~ lstat_cat, data = bost.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.1934  -3.9205  -0.8152   2.6463  28.2713 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   15.0017     0.4881  30.733   <2e-16 ***
## lstat_catlow  16.0917     0.6955  23.138   <2e-16 ***
## lstat_catmed   6.7270     0.6955   9.673   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.402 on 503 degrees of freedom
## Multiple R-squared:  0.5174, Adjusted R-squared:  0.5155 
## F-statistic: 269.6 on 2 and 503 DF,  p-value: < 2.2e-16

 

    \[ \begin{array}{l l l} \hat{y}_{\text{high}} &= 15.0017 \\ \hat{y}_{\text{low}} &= 15.0017 + 16.0917 &= 31.09341 \\ \hat{y}_{\text{med}} &= 15.0017 + 6.7270 &= 21.72874 \end{array} \]

There are other ways to set up the design matrix which alters the interpretation of the coefficients. Another popular design is the Helmert contrasts. In R the design matrix can be changed to the Helmert design with the following commands.

# change contrasts
contrasts(bost.df$lstat_cat) <- contr.helmert(3)
contrasts(bost.df$lstat_cat)
##      [,1] [,2]
## high   -1   -1
## low     1   -1
## med     0    2
lm3 <- lm(medv ~ lstat_cat, data = bost.df)
summary(lm3)
## 
## Call:
## lm(formula = medv ~ lstat_cat, data = bost.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.1934  -3.9205  -0.8152   2.6463  28.2713 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.6080     0.2846  79.431   <2e-16 ***
## lstat_cat1    8.0458     0.3477  23.138   <2e-16 ***
## lstat_cat2   -0.4396     0.2018  -2.179   0.0298 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.402 on 503 degrees of freedom
## Multiple R-squared:  0.5174, Adjusted R-squared:  0.5155 
## F-statistic: 269.6 on 2 and 503 DF,  p-value: < 2.2e-16

 

Under this design the estimates of high, med and low now become

    \[ \begin{array}{l l l} \hat{y}_{\text{high}} &= \beta_0 + \beta_1(-1) + \beta_2(-1) &= \beta_0 - \beta_1 - \beta_2\\ \hat{y}_{\text{low}} &= \beta_0 + \beta_1(1) + \beta_2(-1) &= \beta_0 + \beta_1 - \beta_2 \\ \hat{y}_{\text{med}} &= \beta_0 + \beta_1(0) + \beta_2(2) &= \beta_0 + 2\beta_2 \end{array} \]

And in the example

    \[ \begin{array}{l l l} \hat{y}_{\text{high}} &= 22.6080 - 8.0458 + 0.4396 &= 15.0017 \\ \hat{y}_{\text{low}} &= 15.0017 + 16.0917 + 0.4396 &= 31.09341 \\ \hat{y}_{\text{med}} &= 15.0017 + 2(-0.4396) &= 21.72874 \end{array} \]

The estimates of \hat{y} are the same for each category however the interpretation of the coefficients is completely different. The first model has much more coherent interpretation than the second. Under a multivariate regression model the same applies although the design matrix will be wider to accommodate all other variables, continuous or categorical. Finally, if you want to view the full model matrix you can do so by the command

head(model.matrix(lm3))
##   (Intercept) lstat_cat1 lstat_cat2
## 1           1          1         -1
## 2           1          0          2
## 3           1          1         -1
## 4           1          1         -1
## 5           1          1         -1
## 6           1          1         -1

 

This demonstrates the role of the design matrix and its importance. If it isn’t well understood it could completely change the interpretation of the model. If you’re interested try fitting the model

lm4 <- lm(medv ~ lstat_cat-1, data = bost.df)

And see what happens. It should look familiar.

Follow me on social media: