Design Matrix for Regression Explained

Recently I was asked about the design matrix (or model matrix) for a regression model and why it is important. In a nutshell it is a matrix usually denoted X of size n \times p where n is the number of observations and p is the number of parameters to be estimated. In simple linear regression p=2 i.e. a parameter for the intercept and a parameter for the slope. The typical model formulation is:

    \[ \mathbf{y} = X\mathbf{\beta} + \epsilon = \beta_0 + \beta_1\mathbf{x} + \mathbf{\epsilon} \]


    \[ \left( \begin{matrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{matrix} \right) = \left( \begin{matrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \\ \end{matrix} \right) \left( \begin{matrix} \beta_0 \\ \beta_1 \end{matrix} \right) + \left( \begin{matrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \\ \end{matrix} \right) \]

The interpretation of the slope \beta_1 is, as x increases by 1 y changes by \beta_1. Consider a short example using the Boston housing data set. This won’t be the best fit for the data, it’s just for explanation.


The interpretation of the lstat coefficient is, for each increase in lstat (lower status of the population) the median value of homes (in $1000’s) decreases by 0.95.

It’s somewhat more interesting when considering categorical variables. Each category needs to be converted to a numerical representation, this means expanding the matrix out into a number of columns depending on the number of categories. For example, let’s convert the lstat variable from the model above to categories high, medium and low.


With 3 levels R will automatically set the contrasts to be


What this means is the ‘high’ category is the reference group and ‘low’ and ‘med’ are modelled against ‘high’. To see this algebraically

    \[ \mathbf{y} = \beta_0 + \beta_{1} \mathbf{x}_{\text{low}} + \beta_{2} \mathbf{x}_{\text{med}} + \mathbf{\epsilon} \]


    \[ \left( \begin{matrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{matrix} \right) = \left( \begin{matrix} 1 & x_{1,\text{low}} & x_{1,\text{med}}\\ 1 & x_{2,\text{low}} & x_{2,\text{med}} \\ \vdots & \vdots \\ 1 & x_{n,\text{low}} & x_{n,\text{med}} \\ \end{matrix} \right) \left( \begin{matrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{matrix} \right) + \left( \begin{matrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \\ \end{matrix} \right) \]

We now consider all cases where observation i is either in the high, med or low category and estimate their respective ‘medv’ values using the model.

    \[ \begin{array}{l l l} \hat{y}_{\text{high}} &= \beta_0 + \beta_1(0) + \beta_2(0) &= \beta_0 \\ \hat{y}_{\text{low}} &= \beta_0 + \beta_1(1) + \beta_2(0) &= \beta_0 + \beta_1 \\ \hat{y}_{\text{med}} &= \beta_0 + \beta_1(0) + \beta_2(1) &= \beta_0 + \beta_2 \end{array} \]

From the above it is shown \beta_0 is the average ‘medv’ for the high category. For the low and med categories the average ‘medv’ value is given by \beta_0 + \beta_1 and \beta_0 + \beta_2 respectively. This is important since it shows that \beta_1 and \beta_2 are the differences with repect to the high category. To show this with our example we get


    \[ \begin{array}{l l l} \hat{y}_{\text{high}} &= 15.0017 \\ \hat{y}_{\text{low}} &= 15.0017 + 16.0917 &= 31.09341 \\ \hat{y}_{\text{med}} &= 15.0017 + 6.7270 &= 21.72874 \end{array} \]

There are other ways to set up the design matrix which alters the interpretation of the coefficients. Another popular design is the Helmert contrasts. In R the design matrix can be changed to the Helmert design with the following commands.


Under this design the estimates of high, med and low now become

    \[ \begin{array}{l l l} \hat{y}_{\text{high}} &= \beta_0 + \beta_1(-1) + \beta_2(-1) &= \beta_0 - \beta_1 - \beta_2\\ \hat{y}_{\text{low}} &= \beta_0 + \beta_1(1) + \beta_2(-1) &= \beta_0 + \beta_1 - \beta_2 \\ \hat{y}_{\text{med}} &= \beta_0 + \beta_1(0) + \beta_2(2) &= \beta_0 + 2\beta_2 \end{array} \]

And in the example

    \[ \begin{array}{l l l} \hat{y}_{\text{high}} &= 22.6080 - 8.0458 + 0.4396 &= 15.0017 \\ \hat{y}_{\text{low}} &= 15.0017 + 16.0917 + 0.4396 &= 31.09341 \\ \hat{y}_{\text{med}} &= 15.0017 + 2(-0.4396) &= 21.72874 \end{array} \]

The estimates of \hat{y} are the same for each category however the interpretation of the coefficients is completely different. The first model has much more coherent interpretation than the second. Under a multivariate regression model the same applies although the design matrix will be wider to accommodate all other variables, continuous or categorical. Finally, if you want to view the full model matrix you can do so by the command


This demonstrates the role of the design matrix and its importance. If it isn’t well understood it could completely change the interpretation of the model. If you’re interested try fitting the model

lm4 <- lm(medv ~ lstat_cat-1, data = bost.df)

And see what happens. It should look familiar.

Follow me on social media: