Unsupervised Random Forest Example

A need for unsupervised learning or clustering procedures crop up regularly for problems such as customer behavior segmentation, clustering of patients with similar symptoms for diagnosis or anomaly detection. Unsupervised models are always more challenging since the interpretation of the cluster always comes back to strong subject matter knowledge and knowing your data. The profiling of the clusters is arguably the most challenging aspect of the work. In a business context it can take weeks iterating on the model and socialising the results with the business areas before there is broad agreement on their interpretation and how they can be used.

Kmeans, partitioning around medoids and Gaussian mixture models are go to methods for clustering and have had success with all. A technique I have not used before but are interested in is unsupervised random forests. This post will go into some of the detail and intuition behind URF’s and an example on the iris data set comparing the different methods.

Supervised Random Forest

Everyone loves the random forest algorithm. It’s fast, it’s robust and surprisingly accurate for many complex problems. To start of with we’ll fit a normal supervised random forest model. I’ll preface this with the point that a random forest model isn’t really the best model for this data. A random forest model takes a random sample of features and builds a set of weak learners. Given there are only 4 features in this data set there are a maximum of 6 different trees by selecting at random 4 features. But let’s put that aside and push on because we all know the iris data set and makes learning the methods easier.

 

As expected it does a pretty good job on the hold out sample.

Unsupervised Random Forest

In the unsupervised case we don’t have labels to train on. Instead, like other clustering procedures, need to find the underlying structure in the data. For an unsupervised random forest the set up is as follows.

  1. A joint distribution of the explanatory variables is constructed and draws are taken from this distribution to create synthetic data. In most cases the the same number of draws as in the real data set will be taken.
  2. The real and synthetic data are combined. A label is then created, say 1 for the real data and 0 for the synthetic data.
  3. The random forest model then works in the same way, building a set of weak learners and determining whether or not observation i is real or synthetic.

The key output we want is the proximity (or similarity/dissimilarity) matrix. This is an n \times n matrix where each value is the proportion of times observation i and j where in the same terminal node. For example, if 100 trees were fit and the ij^{th} entry is 0.9, it means 90 times out of 100 observation i and j where in the same terminal node. With this matrix we can then perform a normal clustering procedure such as kmeans or PAM (number of cool things could be done once the proximity matrix is created).

 

Now the unsupervised random forest model is fit we’ll extract the proximity matrix and use this as input to a PAM procedure.

plot of chunk unnamed-chunk-4

 

Only 18 observations were misclassified which isn’t too bad for an unsupervised procedure. Strangely there is one point in the Setosa bunch that was classified as Versicolor. Ordinarily this would not occur.

Comparison with straight kmeans and PAM

A standard kmeans and PAM procedure will be fit for comparison.

plot of chunk unnamed-chunk-5

plot of chunk unnamed-chunk-6

 

It’s fair to assume that the clustering procedures do pretty well so the largest numbers are the correctly allocated ones.

In these examples:

  • kmeans incorrectly allocates 25 observations
  • PAM incorrectly allocates 23 observations and
  • Random forest incorrectly allocates 18

Inspecting the plots, the random forest model tends to do a little better clustering the fringe Versicolor/Virginica species around petal length 5. Even though the random forest procedure probably isn’t most suited to this data set with only 4 independent variables it still does well. With a more complex data set with many independent variables I expect this to work very well.

Follow me on social media: