Bayesian Modeling with Cross Entropy Optimized Decision Trees

Page content

NOTE: This article is only a draft, but I wanted to make sure it finally gets published in some form. Revisions and improvements (e.g. illustrations) to come.

In this article I will briefly summarize a research project I have been playing with off and on for about a year. My goal is to provide a workable introduction to my research project and a brief discussion of the theory and rationale.

This guide accompanies the public release of two of my personal research projects:

  1. Euclidean Modeler – This proof of concept uses a continuous, uniform apriori model, which is essentially identical to euclidean volume.
  2. Monte Carlo Modeler – This project uses a sampled model, with artificially (and strategically) provided apriori samples produced by recombination of existing data points.

Although I still believe this technology has some important and unique uses, I have thus far been unable to produce any results that demonstrate improvement (or even parity) over current best results. However, I have not yet conducted a methodical survey or comparison with other methods. I plan to produce such a report, but for now I wanted to publish what I have.

Decisions trees are a widely used technique in ML that perform particularly well in ensemble models (Random Forests). Decision trees typically resolve a particular dependent variable and produce rules that will predict the output variable’s expected value/range. In order to build these trees, the rules are chosen to optimize an entropy function. Optimization is generally done greedily one layer at a time, and seeks to optimize sum(x log x) where the sum is over each possible discrete value and x represents the portion of records which had that value of that parameter.

However, these conventional trees have many limitations. The output variable needs to be specified in advance, and generally only one discrete-valued parameter is chosen for the dependent variable. The output predictions do not have any valid certainty measure to express fuzzy outcomes / logic.

As a twist on decision trees, let’s say instead of predicting an output variable, let’s attempt to use the same structure to model the joint probability distribution of said events, including both dependent and independent variables. At this point let’s identify we really want to build a Bayesian model that adjusts a given apriori distribution to conform to sampled data. Now at each graph vertex, when we decide to create each rule, we can now consider two different quantities: apriori and aposteriori weights.

The trick now is, what entropy function should we use to optimize the created tree? When we consider that what we want to do is resolve the current apriori so the product distribution has the most information gained per point sampled from the apriori distribution, the choice is obvious. Our goal is to maximize the cross-entropy of the apriori and aposteriori distribution. Another way to think about it is to say that we want to create an encoding scheme that while encoding aposteriori points efficiently, will “reject” apriori points by encoding them at a low efficiency.

Now that we’ve chosen the objective function, and we have selected data to learn, we need to define our apriori model. There are two main ways to define an apriori model, and I will call them:

  1. Euclidean – This defines the apriori model as a (uniform) continuous density function to be integrated. For the typical uniform density function, apriori weights can be visualized as volumes, which works particularly well since aposteriori weight / volume = density
  2. Monte-Carlo – Points are generated by a generating function, which either draws points from a custom apriori function or recombines data points to produce apriori distributions which intrinsically carry joint distribution information between particular (and often deliberately chosen) variable sets. Working code examples can be found for both these approaches in the aforementioned github projects.

Early experiments showed that this method of modeling joint probability distributions worked on a wide variety of distributions, particularly ones rich in geometric features. I created two testing distributions to test the modeling performance. One distribution is based on a multivariate gaussian extruded over a spline path, creating long snake-like clusters. The next is an adaptation of the 2d logistic map into a 3d space. This distribution is rich in many types of detail and it is reproduced quite well (qualitatively speaking) by the model.

One technique to improve modeling is to introduce a coordinate transform at each tree node that is determined by PCA analysis of the points contained in a given node. The provided PCA coordinate system, when combined with the rule generating step, produces density functions that converge on embedded lower-dimensional manifolds much more effectively. This was especially useful in the euclidean modeler when modeling the 3d logistic map

Another technique, used in the monte carlo modeler, is to generate apriori points based on recombination from data points. The trick here is that by defining the recombination function, we can produce an apriori distribution identical in all respects to the apostori distribution except those respects we are interested in modeling. For example, when modeling the apriori distribution of the classic forest cover problem, we recombined all the independent parameters (from point A) with all the dependent (i.e. cover type) parameters from another point. The result is a distribution that has identical characteristics as the data except for the joint distribution features of each independent variable wrt the dependent variable.


At this point results are largely qualitative. The only well-known benchmark this technology has been applied to is the US Forest Cover dataset, and it only achieved ~70% accuracy where state of the art results seem to be a bit over 80%. However, while testing this with other datasets I have seen this method match the predictive accuracy of neural networks while also predicting certainty and handling bifurcated predictions.

Pros/Cons Vs Other Methods:

  1. Cannot currently match random forest accuracy for forest cover data set
  2. Is able to predict outcomes following complex distributions, such as bifurcated predictions
  3. Accounts for uncertainty using rigorous statistical methods
  4. Handles high numbers of dimensions poorly