Improve Your Training Set with Unsupervised Learning

On my previous post Advanced Survey Design and Application to Big Data I mentioned unsupervised learning can be used to generate a stratification variable. In this post I want to elaborate on this point and how they can work together to improve estimates and training data for predictive models.

SRS and stratified samples

Consider the estimators of the total from an SRS and stratified sample.

    \[ \begin{array}{l l} \hat{T}_{\text{srs}} & =  \sum^n_{i = 1} {\frac{N}{n}} y_{i} \\ \hat{T}_{\text{str}} & =  \sum^H_{h = 1} \sum^{n_h}_{i \in S_h} {\frac{N_h}{n_h}} y_{ih} \end{array} \]

The variance of these estimators are given by

    \[ \begin{array}{l l} \text{Var} \left( \hat{T}_{\text{srs}} \right) & = \left( 1 - \frac{n}{N} \right) N^2 \frac{s^2}{n} \\ \text{Var} \left( \hat{T}_{\text{str}} \right) & = \sum^H_{h = 1}{ \left( 1 - \frac{n_h}{N_h} \right)} N^2_h \frac{s^2_h}{n_h} \end{array} \]

The variance of the stratification estimator is made up of two components, the within and between strata sum of squares.

    \[ \begin{array}{r l} SSB & = \sum^H_{h = 1} \sum^{N_h}_{i \in S_h}{\left( \bar{y}_h - \bar{y} \right)^2} = \sum^H_{h = 1}{N_h \left( \bar{y}_h - \bar{y} \right)^2} \\ SSW & = \sum^H_{h = 1} \sum^{N_h}_{i \in S_h}{\left( \bar{y}_{ih} - \bar{y}_h \right)^2} = \sum^H_{h = 1} \left( N_h - 1 \right) s^2_h \\ SSTO & = SSB + SSW = \left( N - 1 \right) s^2 \end{array} \]

With some algebra it can be shown

    \[ \begin{array}{r l} \text{Var}\left( \hat{T}_{\text{srs}} \right) & = \left( 1 - \frac{n}{N} \right) N^2 \frac{s^2}{n} \\ & = \left( 1 - \frac{n}{N} \right) \frac{N^2}{n} \frac{SSTO}{N - 1} \\ & = \left( 1 - \frac{n}{N} \right) \frac{N^2}{n \left(N - 1 \right)} \left( SSB + SSW \right) \\ & = \text{Var} \left( \hat{T}_{\text{str}} \right) + \left( 1 - \frac{n}{N} \right) \frac{N}{n \left(N - 1 \right)} \left[ N \left( SSB \right) - \sum^H_{h = 1}{\left( N - N_h \right)s^2_h} \right] \\ & = \text{Var} \left( \hat{T}_{\text{str}} \right) + \left( 1 - \frac{n}{N} \right) \frac{N^2}{n \left(N - 1 \right)} \left[ SSB - \sum^H_{h = 1}{\left( 1 - \frac{N_h}{N} \right)s^2_h} \right] \\ \end{array} \]

This result shows that \text{Var} \left( \hat{T}_{\text{str}} \right) < \text{Var} \left( \hat{T}_{\text{srs}} \right) while SSB - \sum^H_{h = 1}{\left( 1 - \frac{N_h}{N} \right)s^2_h} > 0 and as SSB increases the stratified estimator improves on the SRS estimator.

Unsupervised Learning

Unsupervised learning attempts to uncover hidden structure in the observed data by sorting the observations into a chosen number of clusters. The simplest algorithm to do this is k-means. The k-means algorithm is as follows:

  1. Choose K (number of clusters)
  2. Choose K random points and assign as centers c_k
  3. Compute the distance between each point and each center
  4. Assign each observation to the center they are closest to
  5. Compute the new centers given the cluster allocation c_k = \frac{1}{n_k} \sum^{n_k}_{i \in S_k}{x_i} where S_k contains the points allocated to cluster k
  6. Compute the between and within sum of squares
  7. Repeat 3-6 until the clusters do not change, meet a specified tolerance or max iterations met

The algorithm will minimise the within sum of squares and maximise the between sum of squares.

    \[ \begin{array}{l l} SSW & = \sum^K_{k = 1} \sum^{n_k}_{i \in S_k} (x_i - c_k)^2 \\ SSB & = \sum^K_{k = 1} n_k(c_k - \bar{c})^2 \\ SSTO & = SSW + SSB \end{array} \]

As we saw from the formula above the estimator under a stratified sample performs better than an SRS when

    \[ SSB > \sum^N_{h=1} \left( 1- \frac{N_h}{N} \right) S^2_h \]

From here it’s easy to see that if we construct a stratification variable which aims to minimise SSW and maximise SSB, the estimator for the corresponding sample will also perform better than by using a less efficient variable. There may be practical reasons why this isn’t possible and it makes more sense to use a natural stratification variable however there are many examples where using unsupervised learning to construct a stratification variable can improve the estimator or the training set to be used for modelling. This isn’t isolated to k-means, most clustering algorithms aim to do the same thing in different ways, and each has it’s benefits given the structure of the data. This can be expanded to more sophisticated sampling techniques and not confined to simple stratified samples.

Follow me on social media: