Deep Neural Network from Scratch in R

Neural networks evolved in the computer science domain are often the first thing people think of when they hear machine learning. A neural network is made up of layers and nodes often illustrated in complicated looking network diagrams. In truth neural nets aren’t that complicated. A simple feed forward neural net can be thought of a set of stacked logistic regression models (when the logistic activation function is used) and should be fairly straight forward to understand for anyone familiar with regression. The activation function is essentially the non-linear component of the model which inserts some complexity and customisation to find non-linear patterns in the data.

Consider a simple neural net with 3 input variables, 1 hidden layer with 4 nodes and 2 variables in the output layer. The 3 input variables can be thought of the covariates in a regression model. The hidden layer is what gives neural nets their power. Each node can be thought of a feature that has been generated from the input variables. Each node in the hidden layer is connected to input variables by a set of weights. These weights play essentially the same role as the coefficients in a regression model, meaning each node is a linear combination of the input variables. A non-linear function is then applied to the linear combination.

Then next step is then to use the generated hidden layer nodes as inputs to another set of models to produce the output in the same way. Typically neural nets are supervised models meaning there are labels or some dependent variable on the input data set we would like to predict with as high accuracy as possible. To do this, the error of the output \hat{\mathbf{y}} when compared with the true \mathbf{y} is propagated back through the model via backpropagation and gradient descent. The weights are adjusted in the direction which reduces the error of the output. This is iterated on until a certain level of tolerance or maximum allowed iterations is achieved.

Algebraically this is as follows.

    \[ \begin{array}{l l} X = \left[ \begin{matrix} 1 & x_{11} & x_{12} & x_{13} \\ 1 & x_{21} & x_{22} & x_{23} \\ \vdots & & & \vdots \\ 1 & x_{n1} & x_{n2} & x_{n3} \end{matrix} \right]; & W_1 = \left[ \begin{matrix} \beta^1_{01} & \beta^1_{02} & \beta^1_{03} & \beta^1_{04} \\ w^1_{11} & w^1_{12} & w^1_{13} & w^1_{14} \\ w^1_{21} & w^1_{22} & w^1_{23} & w^1_{24} \\ w^1_{31} & w^1_{32} & w^1_{33} & w^1_{34} \\ \end{matrix} \right]; \\ W_2 = \left[ \begin{matrix} \beta^2_{01} & \beta^2_{02} \\ w^2_{11} & w^2_{12} \\ w^2_{21} & w^2_{22} \\ w^2_{31} & w^2_{32} \\ w^2_{41} & w^2_{42} \\ \end{matrix} \right]; & \hat{\mathbf{y}} = \left[ \begin{matrix} \hat{y}_{11} & \hat{y}_{12} \\ \hat{y}_{21} & \hat{y}_{22} \\ \vdots & \vdots \\ \hat{y}_{n1} & \hat{y}_{n2} \\ \end{matrix} \right] \end{array} \]

The X matrix is the same as the design matrix in a regression model where the first column is a column of 1’s for the intercept term (commonly referred to as the bias parameter). There are 3 input nodes in this example. The weight matrices are essentially the parameters of the models which will be come clearer later. W_1 connects the input layer and the hidden layer and W_2 connects the hidden layer and the output layer in this example. The first row \mathbf{\beta}, I’ve labelled differently to indicate that this is the bias parameter (I’ll refer to it as the bias term from now on because the other literature refer to it as such but you can see how statisticians would call it the intercept). The reason for labelling it differently is the bias term is not influenced by the previous layer, it is simply added to each layer. This can be viewed as the node not having any incoming connections unlike the other nodes. The number of columns in \mathbf{W}_1 indicate the number of hidden layer nodes.

The first step is to initialise the weight matrices. This is done by simply taking random draws from a uniform distribution on the interval [-1,1].

    \[ W \sim U(-1,1) \]

The first calculation is to construct the hidden layer.

    \[ \begin{array}{l l} \mathbf{A}_1 & = \mathbf{X}\mathbf{W}_1 \\ \mathbf{H} & = f(\mathbf{A}_1) \end{array} \]

This will be a matrix of size n \times 4 i.e. n observations and 4 hidden layer nodes. The function f is the activation function, commonly the logistic function (sigmoid) or ReLU. There are quite a number of activation functions and some work better on certain data. Most literature show ReLU tends to converge faster. For this example I’ll use the sigmoid function.

    \[ f(x) = \frac{1}{1+e^{-x}}; & f^{\prime}(x) = f(x)(1-f(x)) \]

Just to make it explicit

    \[ h_{ij} = f(\beta^1_{0j} + w^1_{1j}x_{i1} + w^1_{2j}x_{i2} + w^1_{3j}x_{i3}) \]

which should look familiar. The hidden layer nodes will act as input into the output layer. They can be thought of features that have been generated from the input layer. This is something that a straight regression model does not do, uncover the hidden patterns in the data which is the power of neural nets. Similar to the input layer and the matrix \mathbf{X} a column of 1’s needs to be added to \mathbf{H} in order to factor in the bias associated with the connections between the hidden layer and the output layer.

    \[ \mathbf{\bar{H}} = \left[ \begin{matrix} 1 & h_{11} & h_{12} & h_{13} & h_{14} \\ 1 & h_{21} & h_{22} & h_{23} & h_{24} \\ \vdots & & & & \vdots \\ 1 & h_{n1} & h_{n2} & h_{n3} & h_{n4} \end{matrix} \right] \]

Notation in other posts detailing the maths behind neural nets often incorporate the bias differently or not at all to simplify it. I find it easier to explain it this way and show the relationship between neural networks and regression by adding it into the input matrices.

Finally the output layer is then given by

    \[ \begin{array}{l l} \mathbf{A}_2 &= \mathbf{\bar{H}}\mathbf{W}_2 \\ \hat{\mathbf{y}} &= f(\mathbf{A}_2) \end{array} \]

Initially the output will not be very accurate. The weights need to be adjusted in the direction that reduces the error. This is done by backpropagation. Firstly, the derivative of the error function is calculated. Typically this is the sum of squares or the mean square error. It doesn’t make a lot of difference which is used because the key component is the difference between the true and predicted values. The constant parameters can be dealt with by the learning rate when the weights are adjusted. Often the sum of squares is multiplied by ½ as a convenience for when the derivative is taken it drops out.

    \[ C = \frac{1}{2}\sum_{ij}(y_{ij}-\hat{y}_{ij})^2 \]

The partial derivative of the cost function with respect to the weights needs to be calculated in order to adjust the weights in the direction of the gradient. To do this we use the chain rule. The adjustment required between the hidden and output layer is given by the following.

    \[ \frac{\partial C}{\partial w^2_{ij}} = \frac{\partial C}{\partial \hat{y}_j} \frac{\partial \hat{y}_j}{\partial a^2_{j}} \frac{\partial a^2_{j}}{\partial w^2_{ij}} \]

Here w^2_{ij} refers to the weight connecting node i to node j in layer 2. Fortunately the partial derivatives are straight forward to solve for from here. The components of the above equation are given by the following.

    \[ \begin{array}{l l} \frac{\partial a^2_{j}}{\partial w^2_{ij}} & = \frac{\partial} {\partial w^2_{ij}}\left( \mathbf{\bar{h}}_i \mathbf{w}^2_j \right) = \bar{h}_{i} \\ \frac{\partial \hat{y}_j}{\partial a^l_{j}} & =  \frac{\partial f(a^l_j)}{\partial a^l_{j}} = f^{\prime}(a^l_j) \\ \frac{\partial C}{\partial \hat{y}_j} & =  y_j - \hat{y}_j \end{array} \]

The errors are multiplied by the gradient of the output meaning the weights will be adjusted in the direction of that minimises the error.

    \[ \mathbf{\delta}_y = (\mathbf{y} - \mathbf{\hat{y}}) \cdot f^{\prime}(\mathbf{A}_2) \]

Note that the dot here is indicating direct multiplication rather than matrix multiplication, therefore (\mathbf{y} - \mathbf{\hat{y}}) and f^{\prime}(\mathbf{A}_2) must be the same dimensions. In a similar way we need to find the error in the hidden layer. We don’t know the true values of the hidden layer since the model is generating them. Instead we use the output error to calculate the change required in the hidden layer. This error is propagated back to the hidden layer by the following.

    \[ \mathbf{\delta}_h = \mathbf{\delta_{y}} \mathbf{W}^T_2 \cdot f^{\prime}(\mathbf{A}_{1}) \]

Using these equations, the weight matrices are now updated as follows.

    \[ \begin{array}{l l} \mathbf{W}_2^{t+1} & =  \mathbf{W}_2^{t} + \gamma\mathbf{\bar{H}}^{T} \cdot \mathbf{\delta}_y \\ \mathbf{W}_1^{t+1} & =  \mathbf{W}_1^{t} + \gamma\mathbf{X}^T \cdot \mathbf{\delta}_h \end{array} \]

where \gamma is the learning rate. These steps are then repeated until convergence or the maximum number of iterations has been reached.

Where’s the deep part?

There isn’t really a consensus on what makes deep learning deep. The deep part tends to refer to the number of layers in a neural net. Artificial neural networks are not new methodologies however it is only recently there is the computational power to train models with 10+ layers with hundreds to thousands of nodes. It is often said that a neural net with 2 or more hidden layers is considered deep. This is probably debatable, but for this post that’s what I’ll stick with since neural nets are easily salable. A model with 10 hidden layers isn’t any more mathematically complex than a model with 2 hidden layers, there are just more steps in the process and more weights to update. I won’t go through the maths for a model with 2 layers here, but by following the code and the theory above it should be straight forward how it fits together.


In R we’ll set up a toy example, apply it to the iris data set and compare the results to the ‘neuralnet’ package. This example will have

  • 8 observations
  • 3 input variables
  • 2 x 6 node hidden layer
  • 2 output nodes

For this model there will be 3 sets of weights connecting the layers: input – H_1, H_1H_2 and H2 – output. The sigmoid (logistic) activation function will be used for it’s simple form and easy implementation.


The function below trains the neural net.


Our function will be applied to our toy data with 2 hidden layers with 6 nodes each, a learning rate of 0.02 and a 50000 iterations.

plot of chunk unnamed-chunk-3


The model converged quite quickly given the output and achieves an accuracy of 100%. To test it out, this function will be applied to the iris data set. For demonstration I’m not going to split into training and test sets however when building models for production that is an important practice to avoid over fitting.

plot of chunk unnamed-chunk-4


The model achieves a very high level of accuracy for the iris data (on the training data at least). The plot of the MSE shows convergence after approximately 15000 iterations. While the error exhibits a decreasing trend, it bounces around until it finally converges. This could be an indication of the learning rate being too high.

Often with neural nets there is work tweaking the parameters e.g. number of hidden layers, size of the hidden layers, learning rate, etc, to achieve the best result. In this case this was not done, in large part due to the iris data set being a very easy and clean data set to test classification methods. It’s important to keep this in mind for more complex problems.

We’ll now compare our function with the ‘neuralnet’ package.


As our function did, ‘neuralnet’ achieves an accuracy of 100% on the toy data set.


And does equally well on the iris data set. One of the best things about the neural net package is the plot.

The function we wrote has it’s limitations, at this stage can only train a model with 2 hidden layers and can only uses the sigmoid activation function however once we understand the maths and are confident we can build one ourselves, we can then defer to more sophisticated functions such as ‘neuralnet’. It has the ability to change activation functions, the objective function and the model structure. In future posts I’ll look at using Tensorflow in R.

Follow me on social media: