Simulating data with Bayesian networks

Bayesian networks are really useful for many applications and one of those is to simulate new data. Bayes nets represent data as a probabilistic graph and from this structure it is then easy to simulate new data. This post will demonstrate how to do this with bnlearn.

Fit a Bayesian network

Before simulating new data we need a model to simulate data from. Using the same Australian Institute of Sport dataset from my previous post on Bayesian networks we’ll set up a simple model. For convenience I’ll subset the data to 6 variables.

The variables sex and sport are pretty straight forward. The remaining four are

  • pcBfat – percent of body fat
  • hg – hemoglobin concentration
  • rcc – red cell count
  • hc – hematocrit percentage

I’ve allowed the data to learn the structure of the network, bar one arc, sport to percentage of body fat. The details are not shown here, but check out the post above on how to fit the structure algorithmically (also I suggest heading to the bnlearn doco which has great examples of a number of networks that can be downloaded). The structure is defined by the string and converted to a bn class object.

Now that we have set the structure of the model it is fit to the data with bn.fit using maximum likelihood estimation.

The output is quite detailed so it’s worth running bn_mod to view the conditional probability tables and Gaussian distributions.

Simulate data

New data is simulated from a Bayes net by first sampling from each of the root nodes, in this case sex. Then followed by the children conditional on their parent(s) (e.g. sport | sex and hg | sex) until data for all nodes has been drawn. The numbers on the nodes below indicate the sequence in which the data is simulated, noting that rcc is the terminal node.

From this point it’s easy to simulate new data using rbn. Here we simulate a dataset the same size as the original, but you can simulate as many rows as needed.

Done. We now have a fully synthetic dataset which retains the properties of the original data. And it only took a few lines of code.

An important property of generating synthetic data is that it doesn’t use real data to do so, meaning any predictors need to be simulated first (my post on the synthpop package explains this in more detail). This property is retained since the data is generated sequentially as per the structure of network. Also, when using synthpop the order in which the variables are simulated needs to be set. The order can alter the accuracy of simulated dataset and so it’s important to spend the time to get it right. For a Bayesian network the order is determined by the structure, so in effect this step is already done.

Compare original and simulated datasets

The original and simulated datasets are compared in a couple of ways 1) observing the distributions of the variables 2) comparing the output from various models and 3) comparing conditional probability queries. The third test is more of a sanity check. If the data is generated from the original Bayes net then a new one fit on the simulated data should be approximately the same. The more rows we generate the closer the parameters will be to the original values.

The variable distributions are very close to the original with only a small amount of variation, mostly observed in sport. Red cell count may have a slight bi-modal distribution but in most part it’s a good fit. This amount of variation is reasonable since there are only 202 simulated observations. Simulating more rows will be a closer fit but there are often practical considerations for retaining the same size dataset.

For the second check, two linear models are fit to the original and simulated data to predict hematocrit levels with sex, hemoglobin concentration, percentage of body fat and red cell count as predictors. Sport was left out of the model since it was not a strong predictor of hc and only increased the error.

The coefficients and test statistics of the models are very similar, so both datasets result in the same conclusions. Percent of body fat is the least accurate but still make the same conclusion. In practice you should fit more models to assess the quality of the simulated data.

As mentioned the third is more of a sanity check but it is also a good demonstration of the process. By fitting the same structure to the simulated data we expect to estimate the same parameters and calculate very similar conditional probabilities. Here we simulate 20 000 observations to better estimate the parameters. The conditional probability for the athletes red cell count given the sport they compete in i.e. what is the probability the athletes red cell count will be greater than x where x is the 33rd and 66th percentile?

The conditional probabilities from the simulated data are very close to the original as expected. Now we can be confident that our simulated data can be used as an alternative to the original.

Impute data

Another useful property of Bayes nets is to impute missing values. This is easily done using impute. We’ll remove 25% of the observations from variables hg and hc, and allow the Bayes net to impute them.

The table confirms there are 50 missing observations from hemoglobin and hematocrit variables. Now impute using Bayesian likelihood weighting.

Plotting the imputed against the true values shows the Bayes net imputed the missing values quite well.

I’ve only tested and shown two variables but the others would perform similarly for the subset of data I have chosen. This data is normally distributed so I expected it to work well, however if your data has more complex relationships you’ll need to be more rigorous with defining the structure.

Code bits

Follow me on social media: