Forecasting Air Pollution Levels with State Space Models

Recently I was looking at a Kaggle dataset on air quality and the relationships between the concentrations of contaminants and environmental variables such as wind speed and direction, temperature, humidity, time of day and time of year for 2017. The idea was to develop a state space model for forecasting the level of particulate matter in the atmosphere. Using this model we’ll know if it safe to go outside or best to shut the windows and wait for the smog to pass. I also used Bayesian methods to estimate the probability of the the next day being over the threshold of air pollution standards.

Data cleaning

Firstly we’ll import the dataset and change the column names to something a little more usable.

On inspection there appears to be quite a few missing values. These need to be treated in a sensible way otherwise we will end up with not so sensible results. We’ll use the MICE package for investigating and treating the missing values.

The “co” and “co8hr” columns are completely empty so these will be removed. The ozone and NO columns are also fairly patchy, thsee will be treated using multiple imutation however we’ll look at the correlation structure of the full dataset first. If we look at the correlation structure after imputation there will dependencies in the data. Given there are only 10% of the values missing for ozone it isn’t likely to be a problem but it’s good practice to be mindful of the structure of your data.

plot of chunk correlation plot

The correlation structure in this data doesn’t seem to be particularly strong, the strongest being wind speed and sd1 at -0.63 and ozone and temperature at 0.62. We’re mostly interested in the correlations with \text{PM}_{10} and \text{PM}_{2.5} which aren’t strong at all.

Using a multiple imputation technique we’ll treat the missing values.

plot of chunk density plots

The plots above show the density of the imputed values (red) with respect to the observedv(blue). If the assumption is the data is missing at random or missing completely at random we would expect to see the densities to be similar. In most part we do however, wind direction and humidity are a little different to the observed. This may require further investigation to find a more suitable imputation method. The other variables seem ok. We’ll come back to address wind direction and humidity.

Fix the date and time into a POSIXct structure.

Data exploration

Firstly it is important to see how the variables are distributed. These will essentially be those plotted above but it will be good to see them in slightly more detail. The ozone 4 hr window variable will be dropped now that the finer level ozone (1 hr) has been imputed.

plot of chunk time series plot

We want to visualise this data as a time series and identify any interesting patterns over time. Given this data only extends for 1 year seasonal trends will not be able to be picked up however, there may be day to day cycles we can observe such as temperature fluctuations. Since ozone and temperature showed the highest correlation I’ll start my search there.

plot of chunk time series plot 2

There is a lot of detail in these plots. To reduce the granularity to gain a better picture I could aggregate to the day level by either taking the max or mean of the values for each day, or simply choose a smaller window to observe such as a single month. The first option will eliminate much of the noise in the data but it will also eliminate the daily cycle of temperatures increasing throughout the day until midday, then decreasing. I expect to see this same pattern in the ozone data since the two variables show some positive correlation, abliet weak. Given the hottest day last year occurred in February I’ll focus my attention there (as good of a reason as any).

plot of chunk timer series plot 3

When viewing an individual month it is easy to see the relationship between these two features. The colour scale indicates the time of day at a crude level where the darkest colour occurs at midnight and the lightest at midday. This all makes intuitive sense, as the sun goes down the temperatures drop.

Air pollution

A quick Google search returned particulate matter 10 micrometers (\text{PM}_{10}) or less in diameter and 2.5 micrometers (PM2.5) or less in diameter are standardised meassures of air pollution which form environmental policies, for example the Australian governemnt has set national ambient air quality standard of 50 micrograms for PM10 averaged over a 24 hour period. A recent study has shown that an increase of 10 \mu g/m^3 on the previous day is associated with a 0.6% increase in the risk of mortality. The air quality index (AQI) is a standardised measure to advise on the health risk of outdoor activities and calculated as

    \[ \text{AQI}=100\frac{\text{pm}_{10}}{50} \]

This provides us with some good direction. We will build a state space model to forecast the \text{PM}_{10} levels for the next day and for any given day compute the probability of experiencing an average of greater than 50\mu g/m^3 PM10 levels (or an AQI > 1) using Bayesian simulation.

First of all lets look at the series of \text{PM}_{10} for 2017 and for Feb.

plot of chunk timer series plot 4

There are some large spikes scattered throughout the year which are greater than 50\mu g/m^3 although these values are recorded at the hour. To control some of these spikes it may be worth taking the square root transform as there appears to be some skewness. There are also some negative values in the data which cause some problem. In this instance we’ll replace those values with the smallest non-zero value in the data (also, I expect that these are incorrect values anyway and should probably imputed with something more reasonable).

plot of chunk histograms

To be consistent with the air pollution standards KPI these need to be averaged over a 24 hours period. A 24 hr moving average will be applied to the series.

plot of chunk time series plot 5

plot of chunk time series plot 5

Observing the series for all of 2017 it still appears to be quite noisey without any clear cyclic patter With a 24 hour moving average applied. Only 1 point of the PM10 levels excede 50 (or 7 on the square root scale). It would be intersting to find out what caused this spike, a bushfire/burn off perhaps?.

State space model

Given the data is quite noisey and doesn’t have any clear trend a suitable model will be a local linear model with a regression component using temperature, wind speed, wind direction and humidty. A Kalman filter will be applied to the series to smooth out the noise and to forecast for the next day(s).

The raw data has a clear daily cycle in which case a seasonal component would be suitable. Given the data points are sampled hourly this would mean 25 seasonal factors to estimate and that would take a while to run. For the time being the model will be fit without seasonal components. Given there is temperature has this cyclic component and is incorporated in the regression part of the model this is still coming into play to some degree.

plot of chunk ssm output

The above graph has been forecast using the observations for February and forecasting the first 3 days in March. The observations look very similar to what you’d expect from a random walk and this is essentially what the SSM picks up on. With systems like this there is a lot of uncertainty in the data and going back further in time won’t add anything to the model, instead it’s the short term observations that will drive most of the forecasting power.

plot of chunk ssm output 2

While it’s still higher than the observerd it is picking up on the features in the series a little more.

plot of chunk ssm output 3

Plotting a random time period shows that the model is definitely picking up on some features in the data. This appears to be doing quite well forecasting the next 3 days. While the first 1.5 days are fairly flat it does pick up on the dip near the end of the 21st. This will be information will be gained from the regression component of the model. A key feature it doesn’t pick up on is the large spike in the evening of the 22nd. This appears to be an outlier and would require a far more sophisticated model than the one here. Ordinarily outliers are removed or treated but that largely depends on the context. Here we would want to ensure we are capturing the likelihood of observing such spikes given we are talking about peoples health. This could be a separate model on it’s own.

If a 24 hr moving average is applied to the forecasted series I expect the forecasted and observed would be quite close. Feel free to take this function and plot for more time frames.

Bayesian Estimation

When there is a lot of uncertianty in complex systems Bayesian approaches work quite well. A Bayesian framework can be applied to state space models however, for demonstration a simple Bayesian regression model will be fit on the data using the MCMCpack package. The previous days environmental observations will be used to estimate the posterior distribution of \text{PM}_{10} levels for the following day. This will enable the calculation of the probability a level greater than 50\mu g/m^3 will be observed in the following day. For simplicity the dataset will be averaged to the daily level, that way we don’t have to untangle the correlations by taking the moving average. This is a good approximation though, once built the observed moving average values can be applied to the model to prodice the predictive posterior distribution as the variances should be approximately the same.

On the averaged dataset set the dependent variable to be at lag 1 i.e. one day ahead.

The model formula will include temp, wind speed and humidity. Wind direction wasn’t as strong as the other 3. There are some interesting relationships between ozone and no2 however I want to keep this model reduced to simple environmental observations at this time.

plot of chunk fit bayesian modelplot of chunk fit bayesian modelplot of chunk fit bayesian model

All environmental variables are quite strong predictors. The posterior distributions of the parameters are now used to simulate the predictive posterior distribution of \text{PM}_{10} for the following day. To test this a day will be randomly chosen and treated as a new observation.

plot of chunk predictive posterior distbn

The predictive posterior distribution is N(\mu, \sigma^2) where \mu= 4.3877499 and \sigma= 0.5172252. The probability that we will obsereve \text{PM}_{10} levels greater than 50\mu g/m^3 with the given observations is 0. Out of 10000 simulations there were 0 that were greater than the threshold, effectively saying it’s extrememly unlikely. This isn’t too surprising. The data shows only one instances where the levels exceded the threshold and this could be an outlier.

For demonstration lets create “the perfect storm”.

plot of chunk a perfect storm

Under these values there is a 5.9% chance of observing levels greater than 50\mu g/m^3. I’m sure with more observations and a more complex model this will change to be a higher probability however based on 2017 data this is what we’re seeing. This Bayesian model could be improved by applying suitable priors to better reflect the prior beliefs about the system.


We have fit a basic state space model to forecast the next few days of \text{PM}_{10} levels based on historical observations and other environmental observations and a Bayesian model to estimate the probablity of observing \text{PM}_{10} > 50\mu g/m^3 on 24 hr averaged data. This is an initial basic analysis of the data. Next steps include:

  • Applying to PM2.5
  • Fitting a seasonal component to the SSM
  • Incorporating more covariates into the SSM and Bayesian model
  • Adding suitable priors to the Bayesain model
  • Better imputation of missing values
  • Sourcing more historical data
Follow me on social media: