State space modelling is a popular technique for forecasting and smoothing time series data. There are two main components which make up state space models, an observed data and the unobserved states,
The observed data are conditionally independent given the states. The transition probabilities of the states are defined as implying the Markov property of the probability of moving to the next state is only dependent on the previous state for all
. The probability of observing
at time
is given by
implying the observed data is conditional on the current state.
Dynamic linear models are a special case of state space models where the errors of the state and observed components are normally distributed. The formulation of these models is as follows
Where
is a
matrix
is a
matrix
is an
matrix
is a
matrix
The formulation as a set of equations is given by
where the errors are iid
Under this formulation there is huge flexibility and can specify a wide range of models such as high order polynomials, regression and arima models with seasonal components.
Time series analysis with the DLM package and the Kalman filter
The dlm package in r is fantastic. It is modular so you have the freedom to build models with multiple components for example you can specify a linear trend model with a quarterly seasonal component. In this example we will look at forecasting the co2 and air passenger data by specifying our own state space model. This will include applying a Kalman filter, Kalman smoothing, estimation of the parameters and finally forecasting the 6 years of observations.
We should all know the co2 data
data(co2)
suppressPackageStartupMessages(library(forecast))
suppressPackageStartupMessages(library(dlm))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(zoo))
suppressPackageStartupMessages(library(gridExtra))
# the co2 dataset
ggplot(data.frame(x=index(co2), y=co2), aes(x=x, y=y)) + geom_line()
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
It’s a classic data set with a very obvious increasing trend and cyclic component. There is 1 observation for each month, therefore we’re looking at a 12 order seasonal component. Data like this doesn’t usually happen in the real world! It’s often much noisier and unclear what the underlying pattern is. Let’s make this more interesting by adding in a bit of noise.
# noisy co2
co2 <- co2 + rnorm(length(co2), 0, 2)
ggplot(data.frame(x=index(co2), y=co2), aes(x=x, y=y)) + geom_line()
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
Much better! Firstly we’ll write a function to build the state space model.
# forecasting using state sapce models
model.build <- function(p) {
return(
dlmModPoly(2, dV=p[1], dW=p[2:3]) +
dlmModSeas(12, dV=p[4])
)
}
If we know the parameters we can plug them straight into the model, however that’s not usually the case. Instead we estimate them using MLE.
model.mle <- dlmMLE(co2, parm=c(0.1, 0, 1, 1), build=model.build)
if(model.mle$convergence==0) print("converged") else print("did not converge")
## [1] "did not converge"
model.mle$par
## [1] 9.036397e-01 0.000000e+00 9.934028e-06 1.803640e+00
It’s best practice to check the convergence of the MLE procedure. As with any MLE process we need to have appropriate initial starting points to ensure the algorithm will converge to the right maximum. The dlmMLE function calls the base R function optim so there is some flexibility in optimisation techniques if needed. These ML estimates are now passed to the model function to build the final parameterisation of the model.
model.fit <- model.build(model.mle$par)
Now we can apply a Kalman filter to the data using the dlmFilter function.
model.filtered <- dlmFilter(co2, model.fit)
It’s worth spending some time with this output to understand what has been calculated. There are 3 main output components:
is the state forecast for one time step and is calculated by
where
and
is the forecast step of the observations and is calculated by
where
and
is the posterior and is calculated by
and
The and
components have 13 columns where columns 3 to 13 are the seasonal components of the model. The first 2 columns are the linear trend components. This is convenient to know since we can easily plot the trend without the seasonal component if we wish. If we wish to plot the forecast observations we would use
. Just to convince ourselves that this is correct let’s calculate
from the output and check.
all(model.filtered$m %*% t(model.fit$GG) == model.filtered$a)
## TRUE
We can also add a smoother to the data as follows.
# kalman smoother
model.smoothed <- dlmSmooth(co2, model.fit)
The main output here is . Again there are 13 columns, the first 2 from the linear trend model and the last columns from the seasonal component. Plotting these together we can see the relationship and accuracy of the forecasting.
n <- 6*12
model.forecast <- dlmForecast(model.filtered, nAhead=n)
x <- index(co2)
xf <- seq(max(x), max(x)+n/12, 1/12)
aa <- model.forecast$a[,-1]*(-1)
aa <- cbind(model.forecast$a[,1], aa)
a <- drop(model.forecast$a%*%t(FF(model.fit)))
a <- c(tail(co2,1), a)
df <- rbind(
data.frame(x=x, y=as.numeric(co2), series="original"),
data.frame(x=x, y=apply(model.filtered$m[-1,1:2], 1, sum), series="filtered"),
data.frame(x=x, y=apply(model.smoothed$s[-1,1:2], 1, sum), series="smoothed"),
data.frame(x=xf, y=a, series="forecast")
)
g.dlm <- ggplot(subset(df, x>1970), aes(x=x, y=y, colour=series)) + geom_line()
g.dlm
As you can see this forecasts fairly well. The filtered and smoothed series are pretty similar. Here we have ignored the seasonal components and just plot the trend components to demonstrate it’s utility.
For reference the same data will be forecast using the a traditional ARIMA model.
# forecasting using arima
model <- auto.arima(co2)
model.forecast <- forecast(model, h = 6*12)
plot(model.forecast)
Interestingly it doesn’t pick up on the seasonal component like the state space model. We could formulate the ARIMA model manually to improve on this (usually auto.arima does a pretty good job though).
Air passenger data
The air passenger dataset is another classic. It has an obvious increasing trend and an annual seasonal cycle, therefore we can essentially use the SSM we build above, only re-estimating the parameters. We do need to apply a log transform to the data first to counter the heteroscedastic variance.
air <- AirPassengers
log.air <- log(air)
grid.arrange(
qplot(x=index(air), y=air, geom = "line"),
qplot(x=index(air), y=log.air, geom = "line"),
nrow=2
)
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous. ## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
Now applying the state space model as we did above for the co2 data. For this data we’ll forecast the last 2 years and compare with the observed data.
# forecasting using state sapce models
model.build <- function(p) {
return(
dlmModPoly(2, dV=p[1], dW=p[2:3]) +
dlmModSeas(12, dV=p[4])
)
}
#log.air <- log(air) + rnorm(length(log.air), 0, 0.15)
log.air <- log(air)
train <- log.air[1:120]
test <- log.air[121:144]
model.mle <- dlmMLE(train, parm=c(1, 1, 1, 1), build=model.build)
model.fit <- model.build(model.mle$par)
model.filtered <- dlmFilter(train, model.fit)
model.smoothed <- dlmSmooth(train, model.fit)
n <- 2*12
model.forecast <- dlmForecast(model.filtered, nAhead=n)
x <- index(log.air)
a <- drop(model.forecast$a%*%t(FF(model.fit)))
df <- rbind(
data.frame(x=index(log.air), y=as.numeric(log.air), series="original"),
data.frame(x=x[1:120], y=apply(model.filtered$m[-1,1:2], 1, sum), series="filtered"),
data.frame(x=x[1:120], y=apply(model.smoothed$s[-1,1:2], 1, sum), series="smoothed"),
data.frame(x=x[121:144], y=a, series="forecast")
)
g.dlm <- ggplot(subset(df, x>1950), aes(x=x, y=y, colour=series)) + geom_line()
g.dlm
The forecast series and the original observed series are very close. In this case the state space model and done very well in forecasting the time series.
Summary
In summary here we have
- Formulated state space models under the dynamic linear model framework where the errors are assumed to be normally distributed.
- Forecast a noisy co2 series with a seasonal and trend component and
- Forecast the air passenger data using the linear trend plus seasonal model.
The dlm package is very flexible to handle a number of different model specifications. What I really like about this package is that it’s not a black box, you are required to think about the best specification for your model and interpret the output in context of what you aim to achieve.
There is a lot more to explore with this package such as implementing a regression component, ARIMA model under a state space framework and MCMC approaches for parameter estimation. I aim to dive deeper in later posts.
Follow me on social media: