pomegranate: probabilistic modelling in python

author: Jacob Schreiber
contact: jmschr@cs.washington.edu

Gerolamo Cardano is one of the first Europeans to publish works in the field of probability theory. As a polymath, he was widely versed in many areas, and expressed his knowledge through a series of works. At the time, Gerolamo was using probability mostly as an edge in gambling in order to pay off the debts he wracked up by being an academic. He should have rebranded as a data scientist and collecting his 6 figure salary. Instead, he chose to publish many seminal works, including Liber de ludo aleae, which had an entire chapter devoted to how to cheat in gambling.

However, the field has progressed slightly outside of the field of gambling, and probabilistic modelling is now widely used in many fields. One of the reasons is because, like graduate students, these models carry their uncertainty around with them, allowing them to prroduce more robust estimates and make more informed decisions.

Probabilistic modelling is a natural way of thinking. Cam Davidson Pilon gives an example in his excellent book Bayesian Methods for Hackers:

You are a skilled programmer, but bugs still slip into your code. After a particularly difficult implementation of an algorithm, you decide to test your code on a trivial example. It passes. You test the code on a harder problem. It passes once again. And it passes the next, even more difficult, test too! You are starting to believe that there may be no bugs in this code...

This is a Bayesian way of thinking. As you acquire more evidence that there are no bugs in the code, your belief in the correctness of the code increases, but you're never completely positive. However, the presence of a single bug would tell you that the code cannot be bug free.

pomegranate

pomegranate initially started out as Yet Another Hidden Markov Model (yahmm), a library written by my friend Adam Novak during his rotation in the UCSC Nanopore Lab. He was disappointed in the lack of an easy installable hidden Markov model library for Python, and so, being the badass he was, wrote his own from scratch in order to pursue his rotation project.

I then adopted his project for my own work in the nanopore lab, but found that a pure python implementation of HMMs was not fast enough for the massive models I was running on gigabytes of data. I ended up rewriting it in Cython, and significantly expanding its feature set.

A few months after entering graduate school, I realized many of the components of yahmm could be reorganized to help me finish homeworks quicker. Since then, it has been expanded to include a wide variety of distributions, finite state machines, hidden Markov models, general mixture models, Bayesian networks, and factor graphs. More to come!

However, while Adam and I disagreed on many things, such as his belief that we needed to do "research" as a graduate student instead of writing open source software, we agreed fundamentally that the code we wrote needed to be fundamentally easy to use. Much like communication skills in general, if what you've done is uninterpretable to other people, it will never be used.

Installation is always as easy as pip install pomegranate (NOTE: Unless you're on a windows, in which case you need a 64 bit C compiler first).

Lets go through probabilistic models of increasing complexity, and how to use them in pomegranate.

In [98]:
%pylab inline
from pomegranate import *
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['norm', 'random', 'clf', 'i0', 'cov', 'log', 'exp']
`%matplotlib` prevents importing * from pylab and numpy

Distributions

The distribution is the simplest unit of probabilistic modelling. You are saying that all of your data are realizations of the same stationary underlying distribution. This is a good fit for many types of simple data where each instance is a fixed feature size and independant of all other instances. Here is a full list of the distributions currently supported:

  • UniformDistribution
  • NormalDistribution
  • LogNormalDistribution
  • ExponentialDistribution
  • BetaDistribution
  • GammaDistribution
  • DiscreteDistribution
  • LambdaDistribution
  • GaussianKernelDensity
  • UniformKernelDensity
  • TriangleKernelDensity
  • IndependentComponentsDistribution
  • MultivariateGaussianDistribution
  • ConditionalProbabilityTable
  • JointProbabilityTable

Normal Distribution

A very common distribution is the normal distribution which is parameterized by $\mu$ and $\sigma$. You can create it in the following way:

In [99]:
d = NormalDistribution( 0, 1 ) # The normal distribution

If we want to visualize this distribution, we can call the 'plot' command, and pass in the number of samples we want to draw. This is required only because more complex distributions, such as kernel densities, can be difficult to plot the exact densities of.

In [100]:
plt.figure( figsize=(10,6) )
d.plot( n=10000, edgecolor='c', facecolor='c', alpha=1, bins=25 )

Good, this looks like a normal distribution.

The most common task is to try to determine the probability of some points given the distribution, $P(D|M)$, where the model is just this distribution.

In [101]:
print d.log_probability(  2 )
print d.log_probability( -2 )
-2.9189385332
-2.9189385332

The log probability for both are the same, which makes sense since they're both the same distance from 0. Calculating log probabilities are more efficient than calculating probabilities, and usually lead to more efficient downstream algorithms, and so are the only option provided.

In [102]:
import timeit
from scipy.stats import norm

print "scipy norm: {}s".format( timeit.timeit( "norm.logpdf(2, 0, 1)", setup="from scipy.stats import norm", number=10000 ) )
print "pomegranate: {}s".format( timeit.timeit( "d.log_probability(2)", setup="from pomegranate import NormalDistribution; d = NormalDistribution(0, 1)", number=10000) )
print "logp difference: {}".format( norm.logpdf(2, 0, 1) - d.log_probability( 2 ) )
scipy norm: 0.705425977707s
pomegranate: 0.00138688087463s
logp difference: -3.99236199655e-13

The next most common operation is to try to train this model on some data. If we train a distribution, we ignore its current parameterization, and instead calculate the MLE estimates as to the parameters. Lets train the distribution on points drawn from $Norm(0.4, 0.9)$, and see if we can recover the underlying distribution. Since we're trying to stick to scikit-learns interface as closely as possible, we use the fit method.

In [103]:
data = np.random.randn(100) * 0.9 + 0.4
d.fit( data )

We can see that we recovered the distribution fairly well.

pomegranate does distribution fitting in summary statistics and cython, so it is significantly faster than other options. Lets see how long it takes to train a pomegranate distribution, versus using numpy.

In [104]:
%timeit data.mean(); data.std()
%timeit d.fit( data )
The slowest run took 4.74 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 29.8 µs per loop
The slowest run took 5.00 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 10.8 µs per loop

Looks like it's ~3x faster.

Since numpy uses summary statistics, these updates can be done out of core using summarize followed by from_summaries, and still get exact answers. summarize will summarize the data into summary statistics, and allow you to get more data, before finally updating the parameters of the distribution in the from_summaries method. Lets look at this in action.

In [105]:
for i in xrange(10):
    data = np.random.randn(10000) * 2.583 + 17.813
    d.summarize(data)

d.from_summaries()
print "Norm({:.5}, {:.5})".format(*d.parameters)
Norm(17.813, 2.5848)

Expectedly, we get close, but not exact, recovery of the parameters.

We can also serialize this into a JSON for human readability, or future storage after a computationally intensive training step.

In [106]:
print d.to_json()
{
    "frozen" : false,
    "class" : "Distribution",
    "parameters" : [
        17.813416418032517,
        2.5848352529257643
    ],
    "name" : "NormalDistribution"
}

Discrete Distribution

A second commonly used distribution is the discrete distribution, which contains a fixed set of keys, and their associated probabilities. In pomegranate, the keys can be ~any~ Python object. They are instantiated by passing in a dictionary with the associated keys and values.

In [107]:
d = DiscreteDistribution({'A': 0.1, 'C': 0.25, 'G': 0.50, 'T': 0.15})

All of the previous functions still work (except plot). We can calculate the log probability of points under this distribution, which is just a dictionary lookup.

In [108]:
print "P({}|M) = {:.3}".format( 'A', np.e ** d.log_probability( 'A' ) )
print "P({}|M) = {:.3}".format( 'G', np.e ** d.log_probability( 'G' ) )
print "P({}|M) = {:.3}".format( '?', np.e ** d.log_probability( '?' ) )
P(A|M) = 0.1
P(G|M) = 0.5
P(?|M) = 0.0

We can also train on lists of data.

In [109]:
d.fit( list('CAGCATCATCATCATAGCACCATAGAAAGATAAAAT') )
print d.parameters
[{'A': 0.47222222222222221, 'C': 0.22222222222222221, 'T': 0.19444444444444445, 'G': 0.1111111111111111}]

And we can also serialize it into a JSON.

In [110]:
print d.to_json()
{
    "frozen" : false,
    "class" : "Distribution",
    "parameters" : [
        {
            "A" : 0.47222222222222221,
            "C" : 0.22222222222222221,
            "T" : 0.19444444444444445,
            "G" : 0.1111111111111111
        }
    ],
    "name" : "DiscreteDistribution"
}

Kernel Densities

The above distributions were parametric, in that they were defined by a handful of parameters. Sometimes, this can lose the excentricities of the data. In contrast, nonparametric distributions require only data, not parameters. We can see this clearly below, where we have some points drawn from the unit norm distribution, and a few outliers.

In [111]:
data = np.concatenate( (np.random.randn(12), [2, 5, 9] ))

plt.figure( figsize=(10, 6))

d1 = NormalDistribution(0, 1)
d1.fit( data )
d1.plot( n=25000, edgecolor='c', facecolor='c', bins=50, alpha=0.3, label="Normal" )

d2 = GaussianKernelDensity(data)
d2.plot( n=25000, edgecolor='r', facecolor='r', bins=50, alpha=0.3, label="Gaussian Kernel Density" )
plt.legend()
Out[111]:
<matplotlib.legend.Legend at 0x7f3afb020810>

We can see that the Gaussian Kernel Densities can be especially useful in modelling outliers without having to come up with complicated distributions. A downside is that calculating log probabilities takes time proportional to the number of points in the kernel density, as opposed to constant time. It's much like kNN in that it takes $O(1)$ to train, and $O(n)$ to predict.

In [112]:
%timeit d1.log_probability(5)
%timeit d2.log_probability(5)
The slowest run took 119.96 times longer than the fastest. This could mean that an intermediate result is being cached 
10000000 loops, best of 3: 143 ns per loop
The slowest run took 10.87 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 636 ns per loop

Even with intermediate state caching, the Gaussian kernel density is ~5x slower on only 12 points. This is even after a recent update where I made them >30x faster than they used to be.

However, it is true that kernel densities do have a parameter, which is the bandwidth. For Gaussian kernel densities, it is similar to adjusting the variance in the Gaussian distribution. Lets take a look:

In [113]:
plt.figure( figsize=(10, 6))

d1 = NormalDistribution(0, 1)
d1.fit( data )
d1.plot( n=25000, edgecolor='c', facecolor='c', bins=50, alpha=0.3, label="Normal" )

d2 = GaussianKernelDensity(data, bandwidth=2)
d2.plot( n=25000, edgecolor='r', facecolor='r', bins=50, alpha=0.3, label="Gaussian Kernel Density" )
plt.legend()
Out[113]:
<matplotlib.legend.Legend at 0x7f3b011d5150>

We can clearly see now that the outliers are dragging the normal distribution to the right, but the kernel density remains unchanged. We can look at a slightly more extreme example to see this truly.

In [114]:
data = np.concatenate( (np.random.randn(20), [1.5, 2, 3, 5, 7, 9, 12, 17, 23, 30] ))

plt.figure( figsize=(10, 6))

d1 = NormalDistribution(0, 1)
d1.fit( data )
d1.plot( n=25000, edgecolor='c', facecolor='c', bins=50, alpha=0.3, label="Normal" )

d2 = GaussianKernelDensity(data, bandwidth=2)
d2.plot( n=25000, edgecolor='r', facecolor='r', bins=50, alpha=0.3, label="Gaussian Kernel Density" )
plt.legend()
Out[114]:
<matplotlib.legend.Legend at 0x7f3b0118a590>

This kernel density can thus capture heavy tailed distributions without having an unnecessarily skewed variance. Lastly, you can also assign weights to all of the points if you choose to. We can see what this looks like when we highly weight the outliers.

In [115]:
plt.figure( figsize=(10, 6))

d1 = NormalDistribution(0, 1)
d1.fit( data )
d1.plot( n=25000, edgecolor='c', facecolor='c', bins=50, alpha=0.3, label="Normal" )

d2 = GaussianKernelDensity(data, bandwidth=2, weights=[1]*20 + [1, 1, 1, 1, 1, 1, 1, 1, 1, 20])
d2.plot( n=25000, edgecolor='r', facecolor='r', bins=50, alpha=0.3, label="Gaussian Kernel Density" )
plt.legend()
Out[115]:
<matplotlib.legend.Legend at 0x7f3afb2330d0>

IndependentComponentsDistribution

Sometimes, a univariate distribution isn't good enough. Many times data comes in the form of multiple, independent, dimensions, each with their own distribution. We can model this with a multivariate distribution which contains a tuple of independent components.

A common example may be to look at data whose signal is normally distributed, and whose duration is modelled by an exponential.

In [116]:
d = IndependentComponentsDistribution([ NormalDistribution(10, 1), ExponentialDistribution(0.5) ])
print d.log_probability( (11, 3) )
-3.61208571376

We can also weight the contributions of each distribution. In this example, lets say that the signal is far more important than the duration.

In [117]:
d = IndependentComponentsDistribution([ NormalDistribution(10, 1), ExponentialDistribution(0.5) ], weights=[3, 1])
print d.log_probability( (11, 3) )
-2.5134734251

We see that the point becomes more probable, because we care more about the good fit to the signal dimension (the normal distribution) than it poorly fits the duration (exponential) distribution. You must be careful with the weightings though, because they don't have to sum to 1. In this sense, weights = [3,1] is not the same as weights = [0.75, 0.25].

We can do updates in the same manner. Drawing sampels from a Gaussian with mu = 12 and sigma = 2, and an exponential with mu = 5 (0.2 since the inverse mu is more prominently used to parameterize an exponential), we can estimate the underlying distributions with 1000 points.

In [118]:
data = numpy.zeros((1000, 2))
data[:,0] = np.random.randn(1000) * 2 + 12 
data[:,1] = np.random.exponential(5, 1000)
print data[:5]
d.fit(data)

norm = d.parameters[0][0].parameters
exp = d.parameters[0][1].parameters

print "Norm({:.5}, {:.5}), Exp({:.5})".format( norm[0], norm[1], exp[0] )
[[ 14.16270132   3.91009252]
 [ 11.4654779    3.62657347]
 [ 13.50441036   0.51138313]
 [ 11.36421671   4.43302188]
 [ 10.14263669   0.49746098]]
Norm(12.109, 1.9893), Exp(0.2002)

Multivariate Gaussian Distribution

It's not always good enough to have independent distributions, as signals can be heavily correlated. A common distribution for this is the multivariate gaussian distribution, utilizing a full covariance matrix. You provide a vector of mus, and a covariance matrix, and you're good to go!

In [119]:
d = MultivariateGaussianDistribution( np.arange(5) * 5, np.eye(5) )
print d.sample()
[ -0.36251825   5.7197841   10.03720788  14.95714953  20.55504366]

We can now train on data which has a diagonal covariance matrix, but whose mu's are now [0, 8, 16, 24, 32] instead of [0, 1, 2, 3, 4]. Lets see how well the model captures the underlying distribution.

In [120]:
data = np.random.randn(1000, 5) + np.arange(5) * 8

d.fit(data)
print "mu: [{:.5}, {:.5}, {:.5}, {:.5}, {:.5}]".format( *d.parameters[0] )
print "cov: \n {}".format( np.array(d.parameters[1]))
mu: [0.015146, 8.0062, 16.002, 23.987, 32.006]
cov: 
 [[ 0.91913188  0.03702049  0.02675236 -0.01291572  0.00245252]
 [ 0.03702049  0.9448676   0.01494961 -0.04437964 -0.04505308]
 [ 0.02675236  0.01494961  0.94968484  0.04456826 -0.05820884]
 [-0.01291572 -0.04437964  0.04456826  1.0064762   0.03046517]
 [ 0.00245252 -0.04505308 -0.05820884  0.03046517  0.99535008]]
In [121]:
print d.sample()
[  1.55308694   8.73921974  15.93548238  24.27235393  31.79424509]

pomegranate is also faster than numpy in calculating these, due to an efficient cython implementation for calculating the covariance directly, instead of doing vector-matrix-vector multiplication, which allocates intermediate arrays.

In [122]:
%timeit data.mean( axis=0 ), np.cov( data.T )
%timeit d.fit(data)
10000 loops, best of 3: 136 µs per loop
The slowest run took 24.98 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 85.3 µs per loop

Naive Bayes

Given these tools, it's pretty simple to set up a Naive Bayes classifier using any combination of distributions you'd like. While not currently built in, you can set up the code in the following way:

In [123]:
def naive_bayes_fit( x, y, distributions ):
    """
    Fit a Naive Bayes classifier to some data.
    """
    
    n = len(distributions)
    w = numpy.zeros( len(distributions))
    for i in xrange(n):
        data = x[y==i, 0]
        distributions[i].fit( data )
        w[i] = data.shape[0]
    
    return w
    
def naive_bayes_predict( x, distributions, weights ):
    """
    A Naive Bayes classifier. Calculate the normalized probability
    of the point under each distribution.
    """
    
    n = len(distributions)
    probs, total = numpy.zeros(n), 0
    for i in xrange(n):
        probs[i] = np.e ** distributions[i].log_probability(x) * weights[i]
        total += probs[i]
    
    return probs / total
In [124]:
from sklearn.naive_bayes import GaussianNB

x = numpy.array([np.concatenate( (np.random.randn(1000), np.random.randn(750) + 2) )]).T
y = np.concatenate( (np.zeros(1000), np.ones(750)) )

clf = GaussianNB()
ds = [ NormalDistribution(5, 1), NormalDistribution(7, 1) ]
In [125]:
%timeit naive_bayes_fit( x, y, ds )
%timeit clf.fit( x, y )
The slowest run took 6.51 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 67.6 µs per loop
1000 loops, best of 3: 513 µs per loop

Looks like pomegranate is faster at Naive Bayes too, but lets make sure we've come to the same distribution in the end.

In [126]:
w = naive_bayes_fit( x, y, ds )

print naive_bayes_predict( 1, ds, w )
print clf.predict_proba( 1 )
[ 0.57549787  0.42450213]
[[ 0.57549787  0.42450213]]

General Mixture Models

It is frequently the case that the data you have is not explained by a single underlying distribution. If we want to try to recover the underlying distributions, we need to have a model which has multiple components. An example is the following data, which is clearly two gaussian distributions, but with different parameters.

In [127]:
data = np.concatenate( (np.random.randn(1000) * 2.75 + 1.25, np.random.randn(2000) * 1.2 + 7.85) )

plt.figure( figsize=(10,6))
plt.hist( data, edgecolor='c', color='c', bins=20 )
plt.show()

We must start with an initial estimate of what this distribution is. In this case, we would likely initialize our General Mixture Model to be of two Gaussian distributions, one centered around 2 with a unit variance, and the other centered around 8 with a unit variance. It looks like there are more points in the distribution to the right, so maybe we want to start off with a higher weight on that distribution.

In [128]:
weights = np.array([0.33, 0.67])
d = GeneralMixtureModel( [NormalDistribution(2, 1), NormalDistribution(8, 1)], weights=weights )

Before we do any training, lets take a look at what the model predicts. All models have a fit and a predict method, in keeping with the scikit-learn API. In this case, we want to predict each point and see which component it falls under.

In [129]:
labels = d.predict( data )
print "{} 1 labels, {} 0 labels".format( labels.sum(), labels.shape[0] - labels.sum() )
2087 1 labels, 913 0 labels

We started off with 2000 1 labels and 1000 0 labels, so this is somewhat close. It's a decent first look. But we can go on to visualize this to get a better look.

In [130]:
plt.figure( figsize=(10,6) )
plt.hist( data[ labels == 0 ], edgecolor='r', color='r', bins=20 )
plt.hist( data[ labels == 1 ], edgecolor='c', color='c', bins=20 )
plt.show()

This looks okay. We see there is a clear linear boundary between the two. This makes sense, because if you have two Gaussians of any dimensionality, the best separator is just a hyperplane. This is one of the reasons that Naive Bayes and Logistic Regresson can work very well on Gaussian distributed data.

However, we are using probabilistic models here. We should take advantage of this, and predict the probability that each point belongs to each distribution, in order to get softer estimates.

In keeping with the scikit-learn API, if we want probabilistic measurements we use the predict_proba method, and columns become each component and rows become each sample.

In [131]:
labels = d.predict_proba( data )
print labels[:5]
print "{} 0 labels, {} 1 labels".format( *labels.sum(axis=0) )
[[  1.00000000e+00   3.70902661e-12]
 [  1.00000000e+00   2.54622788e-17]
 [  1.00000000e+00   1.50804294e-24]
 [  2.24944577e-06   9.99997751e-01]
 [  8.70542265e-01   1.29457735e-01]]
915.821794476 0 labels, 2084.17820552 1 labels

Looks like we get ~1% closer. This is not terribly significant, but useful when we do training, which uses the expectation-maximization (EM) algorithm. I won't go into the details here, but the gist is that we want to train this distribution in an unsupervised fashion, and so we iterate between expectation (calculating probabilities of each point given the data) and maximization (updating model parameters given these probabilities). We iterate between those two steps until convergence.

In [132]:
d.fit( data, verbose=True )
Improvement: 2124.51210034
Improvement: 6.43402970011
Improvement: 1.76558489033
Improvement: 0.989186194303
Improvement: 0.618068106967
Improvement: 0.397525575591
Improvement: 0.260513509817
Improvement: 0.173351275969
Improvement: 0.116822251575
Improvement: 0.0795574999711
Out[132]:
2135.3467393425626
In [133]:
labels = d.predict( data )

plt.figure( figsize=(10,6) )
plt.hist( data[ labels == 0 ], edgecolor='r', color='r', bins=20 )
plt.hist( data[ labels == 1 ], edgecolor='c', color='c', bins=20 )
plt.show()

print "Hard Classification"
print "{} 0 labels, {} 1 labels".format( labels.shape[0] - labels.sum(), labels.sum() )

print
print "Soft Classification"
labels = d.predict_proba( data )
print "{} 0 labels, {} 1 labels".format( *labels.sum(axis=0) )
Hard Classification
934 0 labels, 2066 1 labels

Soft Classification
984.271725243 0 labels, 2015.72827476 1 labels

Training the models exemplifies why sometimes we want to use soft classification. The difference between 981 and 944 is significant, particularly in fuzzy cases where it's not always clear what's going on. As a continued theme, it allows us to bring our uncertainty along with us, and make more informed decisions because of this.

Gaussian Mixture Models

However, a more common type of mixture model is the Gaussian Mixture Model, commonly confused with the General Mixture Model because they share similar names. However, the General Mixture Model can be a mixture of any number of distributions of heterogeneous types.

Lets take a look at a Gaussian Mixture Model using MultivariateGaussianDistributions.

In [134]:
from pomegranate import MultivariateGaussianDistribution as Gaussian

cov = [[0.5, -0.8], [2.4, 4.7]]
a = (np.random.randn(500, 2) + [1, -1.5]).dot( cov )
cov = [[-1.2, 0.8], [1.3, -2.3]]
b = (np.random.randn( 750, 2 ) + [2.5, 4]).dot( cov )

data = np.concatenate((a, b))

plt.figure( figsize=(10, 6))
plt.scatter( data[:,0], data[:,1], c='c', linewidth=0 )
plt.show()
In [135]:
cov = np.eye(2)
gmm = GeneralMixtureModel([ Gaussian([-4, -5], cov), Gaussian([2.0, -10], cov) ])
pred = gmm.predict_proba( data )

plt.figure( figsize=(10, 6) )

colors_a = np.zeros((data.shape[0], 4))
colors_a[:,0] = 1
colors_a[:,3] = pred[:,0]

plt.scatter( data[:,0], data[:,1], color=colors_a )

colors_b = np.zeros((data.shape[0], 4))
colors_b[:,2] = 1
colors_b[:,3] = pred[:,1]
plt.scatter( data[:,0], data[:,1], color=colors_b )
plt.show()
In [136]:
gmm.fit( data, verbose=True )
Improvement: 10202.4717651
Improvement: 166.670691887
Improvement: 395.144134551
Improvement: 390.329118546
Improvement: 81.3088095421
Improvement: 7.76178837896
Improvement: 1.40002008708
Improvement: 0.306573695679
Improvement: 0.06816011516
Out[136]:
11245.461061912698
In [137]:
plt.figure( figsize=(10, 6) )
pred = gmm.predict_proba( data )

colors_a = np.zeros((data.shape[0], 4))
colors_a[:,0] = 1
colors_a[:,3] = pred[:,0]

plt.scatter( data[:,0], data[:,1], color=colors_a )

colors_b = np.zeros((data.shape[0], 4))
colors_b[:,2] = 1
colors_b[:,3] = pred[:,1]
plt.scatter( data[:,0], data[:,1], color=colors_b )
plt.show()
In [138]:
import timeit 
print "scikit-learn: {}s".format( timeit.timeit( "d.fit(data)", setup="from sklearn.mixture import GMM; from __main__ import data; d=GMM(n_components=2, covariance_type='full')", number=100 ) )
print "pomegranate: {}s".format( timeit.timeit( "d.fit(data, stop_threshold=1e-3)", setup="from pomegranate import *; from __main__ import data; import numpy; cov = numpy.eye(2); d=GeneralMixtureModel([ MultivariateGaussianDistribution([-4, -5], cov), MultivariateGaussianDistribution([2.0, -10], cov) ])", number=100 ))
scikit-learn: 3.49997591972s
pomegranate: 1.15622901917s
<timeit-src>:2: SyntaxWarning: import * only allowed at module level

pomegranate is several times faster than scikit-learn in this case, though scikit-learn does offer more fully-featured GMMs (such as DPGMM and tied covariance matrices), and it is a bit easier to initialize a GMM.

Hidden Markov Models

Hidden Markov models (HMMs) are the flagship of the pomegranate package, in that most time is spent improving their implementation, and these improvements sometimes trickle down into the other algorithms. Lets delve into the features which pomegranate offers.

Hidden Markov models are a form of structured prediction method which extend general mixture models to sequences of data, where position in the sequence is relevant. If each point in this sequence is completely independent of the other points, then HMMs are not the right tools and GMMs (or more complicated Bayesian networks) may be a better tool.

The most common examples of HMMs come from bioinformatics and natural language processing. Since I am a bioinformatician, I will predominately use examples from bioinformatics.

Lets take the simplified example of CG island detection on a sequence of DNA. DNA is made up of the four canonical nucleotides, abbreviated 'A', 'C', 'G', and 'T'. Specific organizations of these nucleotides encode enough information to build you, a human being. One simple region in the genome is called the 'CG' island, where the nucleotides 'C' and 'G' are enriched. Lets compare the predictions of a GMM with the predictions of a HMM, to both understand conceptually the differences between the two, and to see how easy it is to use pomegranate.

In [139]:
seq = list('CGACTACTGACTACTCGCCGACGCGACTGCCGTCTATACTGCGCATACGGC')

d1 = DiscreteDistribution({'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25})
d2 = DiscreteDistribution({'A': 0.10, 'C': 0.40, 'G': 0.40, 'T': 0.10})

# General Mixture Model Setup
gmm = GeneralMixtureModel( [d1, d2] )

# Hidden Markov Model Setup
s1 = State( d1, name='background' )
s2 = State( d2, name='CG island' )

hmm = HiddenMarkovModel('Island Finder')
hmm.add_states(s1, s2)
hmm.add_transition( hmm.start, s1, 0.5 )
hmm.add_transition( hmm.start, s2, 0.5 )
hmm.add_transition( s1, s1, 0.5 )
hmm.add_transition( s1, s2, 0.5 )
hmm.add_transition( s2, s1, 0.5 )
hmm.add_transition( s2, s2, 0.5 )
hmm.bake()

plt.figure( figsize=(10,6) )
hmm.draw()
In [140]:
gmm_predictions = gmm.predict( np.array(seq) )
hmm_predictions = hmm.predict( seq )

print "sequence: {}".format( ''.join( seq ) )
print "gmm pred: {}".format( ''.join( map( str, gmm_predictions ) ) )
print "hmm pred: {}".format( ''.join( map( str, hmm_predictions ) ) )
sequence: CGACTACTGACTACTCGCCGACGCGACTGCCGTCTATACTGCGCATACGGC
gmm pred: 110100101010010111110111101011110100001011110001111
hmm pred: 110100101010010111110111101011110100001011110001111

Note: The HMM and GMM predictions may be the inverse of each other, because HMM states undergo a topological sort in order to properly handle silent states (more later), which can change from the order they were inserted into the model.

At this point, a dense HMM with equal probabilities between each state is ~equivalent~ to a GMM. However, this framework gives us great flexibility to add prior knowledge through edge transition probabilities, whereas a GMM doesn't. If we look at the predictions, we see that it's bifurcating between "background" and "CG island" very quickly--in essence, calling every C or G a 'CG island'. This is not likely to be true. We know that CG islands have some As and Ts in them, and background sequence has Cs and Gs. We can change the transition probabilities to account for this, and prevent switching from occuring too rapidly.

In [141]:
hmm = HiddenMarkovModel()
hmm.add_states(s1, s2)
hmm.add_transition( hmm.start, s1, 0.5 )
hmm.add_transition( hmm.start, s2, 0.5 )
hmm.add_transition( s1, s1, 0.9 )
hmm.add_transition( s1, s2, 0.1 )
hmm.add_transition( s2, s1, 0.1 )
hmm.add_transition( s2, s2, 0.9 )
hmm.bake()

plt.figure( figsize=(10,6) )
hmm.draw()
hmm_predictions = hmm.predict( seq )

print "sequence: {}".format( ''.join( seq ) )
print "gmm pred: {}".format( ''.join( map( str, gmm_predictions ) ) )
print "hmm pred: {}".format( ''.join( map( str, hmm_predictions ) ) )
print
print "hmm state 0: {}".format( hmm.states[0].name )
print "hmm state 1: {}".format( hmm.states[1].name )
sequence: CGACTACTGACTACTCGCCGACGCGACTGCCGTCTATACTGCGCATACGGC
gmm pred: 110100101010010111110111101011110100001011110001111
hmm pred: 000000000000000111111111111111100000000000000001111

hmm state 0: background
hmm state 1: CG island

This seems far more reasonable. There is a single CG island surrounded by background sequence, and something at the end. If we knew that CG islands cannot occur at the end of sequences, we need only modify the underlying structure of the HMM in order to say that the sequence must end from the background state.

In [142]:
hmm = HiddenMarkovModel()
hmm.add_states(s1, s2)
hmm.add_transition( hmm.start, s1, 0.5 )
hmm.add_transition( hmm.start, s2, 0.5 )
hmm.add_transition( s1, s1, 0.89 )
hmm.add_transition( s1, s2, 0.10 )
hmm.add_transition( s1, hmm.end, 0.01 )
hmm.add_transition( s2, s1, 0.1 )
hmm.add_transition( s2, s2, 0.9 )
hmm.bake()

plt.figure( figsize=(10,6) )
hmm.draw()

hmm_predictions = hmm.predict( seq )

print "sequence: {}".format( ''.join( seq ) )
print "gmm pred: {}".format( ''.join( map( str, gmm_predictions ) ) )
print "hmm pred: {}".format( ''.join( map( str, hmm_predictions ) ) )
print
print "hmm state 0: {}".format( hmm.states[0].name )
print "hmm state 1: {}".format( hmm.states[1].name )
sequence: CGACTACTGACTACTCGCCGACGCGACTGCCGTCTATACTGCGCATACGGC
gmm pred: 110100101010010111110111101011110100001011110001111
hmm pred: 111111111111111000000000000000011111111111111111111

hmm state 0: CG island
hmm state 1: background

Looks like we managed to get rid of that pesky end (again, the numbers may have flipped, look at the indices). Modifying transition probabilities and using non-dense graphical structures are two major ways in which HMMs account for data in a sequence not being independent and identically distributed (i.i.d.). In fact, in most applications, the graphical structure of a HMM is very sparse.

If we want a more probabilistic view of what's going on, we can get the probability of each symbol in the sequence being in each of the states in the model easily. This is useful to get a soft estimate of classification, which allows us to include confidence as well as prediction. Values close to 50-50 get masked when you make hard classifications, but this uncertainty can be passed to future applications if you use soft assignments. Each row in the matrix is one symbol in the sequence, and the columns correspond to the two states identified above (CG island or background).

In [143]:
print hmm.predict_proba( seq )
[[ 0.4827054   0.5172946 ]
 [ 0.42204064  0.57795936]
 [ 0.28598775  0.71401225]
 [ 0.25756456  0.74243544]
 [ 0.16040038  0.83959962]
 [ 0.13871983  0.86128017]
 [ 0.17217854  0.82782146]
 [ 0.14750165  0.85249835]
 [ 0.18021186  0.81978814]
 [ 0.15446181  0.84553819]
 [ 0.18788044  0.81211956]
 [ 0.16252887  0.83747113]
 [ 0.19841088  0.80158912]
 [ 0.32919701  0.67080299]
 [ 0.38366073  0.61633927]
 [ 0.58044619  0.41955381]
 [ 0.69075524  0.30924476]
 [ 0.74653183  0.25346817]
 [ 0.76392808  0.23607192]
 [ 0.7479817   0.2520183 ]
 [ 0.69407484  0.30592516]
 [ 0.74761829  0.25238171]
 [ 0.76321595  0.23678405]
 [ 0.74538467  0.25461533]
 [ 0.68896078  0.31103922]
 [ 0.57760471  0.42239529]
 [ 0.58391467  0.41608533]
 [ 0.53953778  0.46046222]
 [ 0.6030831   0.3969169 ]
 [ 0.61516689  0.38483311]
 [ 0.57928847  0.42071153]
 [ 0.48505793  0.51494207]
 [ 0.30518744  0.69481256]
 [ 0.25379428  0.74620572]
 [ 0.12610747  0.87389253]
 [ 0.08105965  0.91894035]
 [ 0.07637934  0.92362066]
 [ 0.10767468  0.89232532]
 [ 0.20431225  0.79568775]
 [ 0.23273876  0.76726124]
 [ 0.35851961  0.64148039]
 [ 0.40985726  0.59014274]
 [ 0.40161837  0.59838163]
 [ 0.33141706  0.66858294]
 [ 0.17892403  0.82107597]
 [ 0.12850549  0.87149451]
 [ 0.13285026  0.86714974]
 [ 0.19603531  0.80396469]
 [ 0.19760431  0.80239569]
 [ 0.13801161  0.86198839]
 [ 0.          1.        ]]

There is a corresponding hmm.predict_log_proba method present if you want to get the log values. These are the emission probability values calculated by the forward backward algorithm, and can also be retrieved by calling hmm.forward_backward( seq ), which returns both the emission and the transition probability tables.

Lets take a look at these tables!

In [144]:
trans, ems = hmm.forward_backward( seq )
print trans
[[ 15.78100555   2.89559314   0.           0.        ]
 [  2.41288774  28.91051356   1.           0.        ]
 [  0.           0.           0.           0.        ]
 [  0.4827054    0.5172946    0.           0.        ]]

This is the transition table, which has the soft count of the number of transitions across an edge in the model given a single sequence. It is a square matrix of size equal to the number of states (including start and end state), with number of transitions from (row_id) to (column_id). This is exemplified by the 1.0 in the first row, indicating that there is one transition from background state to the end state, as that's the only way to reach the end state. However, the third (or fourth, depending on ordering) row is the transitions from the start state, and it only slightly favors the background state. These counts are not normalized to the length of the input sequence, but can easily be done so by dividing by row sums, column sums, or entire table sums, depending on your application.

A possible reason not to normalize is to run several sequences through and add up their tables, because normalizing in the end and extracting some domain knowledge. It is extremely useful in practice. For example, we can see that there is an expectation of 2.8956 transitions from CG island to background, and 2.4 from background to CG island. This could be used to infer that there are ~2-3 edges, which makes sense if you consider that the start and end of the sequence seem like they might be part of the CG island states except for the strict transition probabilities used (look at the first few rows of the emission table above.)

We've been using the forward backward algorithm and maximum a posteriori for decoding thus far, however maximum a posteriori decoding has the side effect that it is possible that it predicts impossible sequences in some edge cases. An alternative is Viterbi decoding, which at each step takes the most likely path, instead of sum of all paths, to produce hard assignments.

In [145]:
hmm_predictions = hmm.predict( seq, algorithm='viterbi' )[1:-1]

print "sequence: {}".format( ''.join( seq ) )
print "gmm pred: {}".format( ''.join( map( str, gmm_predictions ) ) )
print "hmm pred: {}".format( ''.join( map( str, hmm_predictions ) ) )
print
print "hmm state 0: {}".format( hmm.states[0].name )
print "hmm state 1: {}".format( hmm.states[1].name )
sequence: CGACTACTGACTACTCGCCGACGCGACTGCCGTCTATACTGCGCATACGGC
gmm pred: 110100101010010111110111101011110100001011110001111
hmm pred: 111111111111111111111111111111111111111111111111111

hmm state 0: CG island
hmm state 1: background

We see here a case in which it does not do too well. The Viterbi path can be more conservative in its transitions due to the hard assignments it makes. In essence, if multiple possibile paths are possible at a given point, it takes the most likely path, even if the sum of all other paths is greater than the sum of that path. In problems with a lower signal to noise ratio, this can mask the signal. As a side note, we can use the following to get the maximum a posteriori and Viterbi paths:

Sequence Alignment

Lets move on to a more complicated structure, that of a profile HMM. A profile HMM is used to align a sequence to a reference 'profile', where the reference profile can either be a single sequence, or an alignment of many sequences (such as a reference genome). In essence, this profile has a 'match' state for every position in the reference profile, and 'insert' state, and a 'delete' state. The insert state allows the external sequence to have an insertion into the sequence without throwing off the entire alignment, such as the following:

ACCG : Sequence
|| |
AC-G : Reference

or a deletion, which is the opposite:

A-G : Sequence
| |
ACG : Reference

The bars in the middle refer to a perfect match, whereas the lack of a bar means either a deletion/insertion, or a mismatch. A mismatch is where two positions are aligned together, but do not match. This models the biological phenomena of mutation, where one nucleotide can convert to another over time. It is usually more likely in biological sequences that this type of mutation occurs than that the nucleotide was deleted from the sequence (shifting all nucleotides over by one) and then another was inserted at the exact location (moving all nucleotides over again). Since we are using a probabilistic model, we get to define these probabilities through the use of distributions! If we want to model mismatches, we can just set our 'match' state to have an appropriate distribution with non-zero probabilities over mismatches.

Lets now create a three nucleotide profile HMM, which models the sequence 'ACT'. We will fuzz this a little bit in the match states, pretending to have some prior information about what mutations occur at each position. If you don't have any information, setting a uniform, small, value over the other values is usually okay.

In [146]:
model = HiddenMarkovModel( "Global Alignment")

# Define the distribution for insertions
i_d = DiscreteDistribution( { 'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25 } )

# Create the insert states
i0 = State( i_d, name="I0" )
i1 = State( i_d, name="I1" )
i2 = State( i_d, name="I2" )
i3 = State( i_d, name="I3" )

# Create the match states
m1 = State( DiscreteDistribution({ "A": 0.95, 'C': 0.01, 'G': 0.01, 'T': 0.02 }) , name="M1" )
m2 = State( DiscreteDistribution({ "A": 0.003, 'C': 0.99, 'G': 0.003, 'T': 0.004 }) , name="M2" )
m3 = State( DiscreteDistribution({ "A": 0.01, 'C': 0.01, 'G': 0.01, 'T': 0.97 }) , name="M3" )

# Create the delete states
d1 = State( None, name="D1" )
d2 = State( None, name="D2" )
d3 = State( None, name="D3" )

# Add all the states to the model
model.add_states( [i0, i1, i2, i3, m1, m2, m3, d1, d2, d3 ] )

# Create transitions from match states
model.add_transition( model.start, m1, 0.9 )
model.add_transition( model.start, i0, 0.1 )
model.add_transition( m1, m2, 0.9 )
model.add_transition( m1, i1, 0.05 )
model.add_transition( m1, d2, 0.05 )
model.add_transition( m2, m3, 0.9 )
model.add_transition( m2, i2, 0.05 )
model.add_transition( m2, d3, 0.05 )
model.add_transition( m3, model.end, 0.9 )
model.add_transition( m3, i3, 0.1 )

# Create transitions from insert states
model.add_transition( i0, i0, 0.70 )
model.add_transition( i0, d1, 0.15 )
model.add_transition( i0, m1, 0.15 )

model.add_transition( i1, i1, 0.70 )
model.add_transition( i1, d2, 0.15 )
model.add_transition( i1, m2, 0.15 )

model.add_transition( i2, i2, 0.70 )
model.add_transition( i2, d3, 0.15 )
model.add_transition( i2, m3, 0.15 )

model.add_transition( i3, i3, 0.85 )
model.add_transition( i3, model.end, 0.15 )

# Create transitions from delete states
model.add_transition( d1, d2, 0.15 )
model.add_transition( d1, i1, 0.15 )
model.add_transition( d1, m2, 0.70 ) 

model.add_transition( d2, d3, 0.15 )
model.add_transition( d2, i2, 0.15 )
model.add_transition( d2, m3, 0.70 )

model.add_transition( d3, i3, 0.30 )
model.add_transition( d3, model.end, 0.70 )

# Call bake to finalize the structure of the model.
model.bake()

plt.figure( figsize=(20, 16) )
model.draw()

for sequence in map( list, ('ACT', 'GGC', 'GAT', 'ACC') ):
    logp, path = model.viterbi( sequence )
    print "Sequence: '{}'  -- Log Probability: {} -- Path: {}".format(
        ''.join( sequence ), logp, " ".join( state.name for idx, state in path[1:-1] ) )
Sequence: 'ACT'  -- Log Probability: -0.513244900357 -- Path: M1 M2 M3
Sequence: 'GGC'  -- Log Probability: -11.0481012413 -- Path: I0 I0 D1 M2 D3
Sequence: 'GAT'  -- Log Probability: -9.12551967402 -- Path: I0 M1 D2 M3
Sequence: 'ACC'  -- Log Probability: -5.08795587886 -- Path: M1 M2 M3

The first and last sequence are entirely matches, meaning that it thinks the most likely alignment between the profile ACT and ACT is A-A, C-C, and T-T, which makes sense, and the most likely alignment between ACT and ACC is A-A, C-C, and T-C, which includes a mismatch. Essentially, it's more likely that there's a T-C mismatch at the end then that there was a deletion of a T at the end of the sequence, and a separate insertion of a C.

The two middle sequences don't match very well, as expected! G's are not very likely in this profile at all. It predicts that the two G's are inserts, and that the C matches the C in the profile, before hitting the delete state because it can't emit a T. The third sequence thinks that the G is an insert, as expected, and then aligns the A and T in the sequence to the A and T in the master sequence, missing the middle C in the profile.

By using deletes, we can handle other sequences which are shorter than three characters. Lets look at some more sequences of different lengths.

In [147]:
for sequence in map( list, ('A', 'GA', 'AC', 'AT', 'ATCC', 'ACGTG', 'ATTT', 'TACCCTC', 'TGTCAACACT') ):
    logp, path = model.viterbi( sequence )
    print "Sequence: '{}'  -- Log Probability: {} -- Path: {}".format(
        ''.join( sequence ), logp, " ".join( state.name for idx, state in path[1:-1] ) )
Sequence: 'A'  -- Log Probability: -5.40618101242 -- Path: M1 D2 D3
Sequence: 'GA'  -- Log Probability: -10.8868199358 -- Path: I0 M1 D2 D3
Sequence: 'AC'  -- Log Probability: -3.62447187905 -- Path: M1 M2 D3
Sequence: 'AT'  -- Log Probability: -3.64488075068 -- Path: M1 D2 M3
Sequence: 'ATCC'  -- Log Probability: -10.6743329646 -- Path: M1 D2 M3 I3 I3
Sequence: 'ACGTG'  -- Log Probability: -10.3938248352 -- Path: M1 M2 I2 I2 I2 D3
Sequence: 'ATTT'  -- Log Probability: -8.67126440175 -- Path: M1 I1 I1 D2 M3
Sequence: 'TACCCTC'  -- Log Probability: -16.9034517961 -- Path: I0 I0 I0 I0 D1 M2 M3 I3
Sequence: 'TGTCAACACT'  -- Log Probability: -16.4516996541 -- Path: I0 I0 I0 I0 I0 I0 I0 M1 M2 M3

Again, more of the same expected. You'll notice most of the use of insertion states are at I0, because most of the insertions are at the beginning of the sequence. It's more probable to simply stay in I0 at the beginning instead of go from I0 to D1 to I1, or going to another insert state along there. You'll see other insert states used when insertions occur in other places in the sequence, like 'ATTT' and 'ACGTG'. Now that we have the path, we need to convert it into an alignment, which is significantly more informative to look at.

In [148]:
def path_to_alignment( x, y, path ):
    """
    This function will take in two sequences, and the ML path which is their alignment,
    and insert dashes appropriately to make them appear aligned. This consists only of
    adding a dash to the model sequence for every insert in the path appropriately, and
    a dash in the observed sequence for every delete in the path appropriately.
    """
    
    for i, (index, state) in enumerate( path[1:-1] ):
        name = state.name
        
        if name.startswith( 'D' ):
            y = y[:i] + '-' + y[i:]
        elif name.startswith( 'I' ):
            x = x[:i] + '-' + x[i:]

    return x, y

for sequence in map( list, ('A', 'GA', 'AC', 'AT', 'ATCC', 'ACGTG', 'ATTT', 'TACCCTC', 'TGTCAACACT') ):
    logp, path = model.viterbi( sequence )
    x, y = path_to_alignment( 'ACT', ''.join(sequence), path )
    
    print "Sequence: {}, Log Probability: {}".format( ''.join(sequence), logp )
    print "{}\n{}".format( x, y )
    print
Sequence: A, Log Probability: -5.40618101242
ACT
A--

Sequence: GA, Log Probability: -10.8868199358
-ACT
GA--

Sequence: AC, Log Probability: -3.62447187905
ACT
AC-

Sequence: AT, Log Probability: -3.64488075068
ACT
A-T

Sequence: ATCC, Log Probability: -10.6743329646
ACT--
A-TCC

Sequence: ACGTG, Log Probability: -10.3938248352
AC---T
ACGTG-

Sequence: ATTT, Log Probability: -8.67126440175
A--CT
ATT-T

Sequence: TACCCTC, Log Probability: -16.9034517961
----ACT-
TACC-CTC

Sequence: TGTCAACACT, Log Probability: -16.4516996541
-------ACT
TGTCAACACT

There are two main algorithms for training hidden Markov models-- Baum Welch (structured version of Expectation Maximization), and Viterbi training. Since we don't start off with labels on the data, these are both unsupervised training algorithms. In order to assign labels, Baum Welch uses EM to assign soft labels (weights in this case) to each point belonging to each state, and then using weighted MLE estimates to update the distributions. Viterbi assigns hard labels to each observation using the Viterbi algorithm, and then updates the distributions based on these hard labels.

pomegranate is extremely well featured when it comes to regularization methods for training, supporting tied emissions and edges, edge and emission inertia, freezing nodes or edges, edge pseudocounts, and multithreaded training. See this tutorial for more information.

Bayesian Networks

Bayesian networks are a powerful probabilistic inference tool, in which a set of variables are represented as nodes, and the lack of an edge represents a conditional independence statement between them. Bayesian networks are powerful in their ability to infer the value of unknown variables given any combination of known variables. Obviously, as more is known about the system, the more accurate these inferences will be, but the same algorithm can handle very different amounts of information.

While Bayesian networks can have extremely complex emission probabilities, usually Gaussian or conditional Gaussian distributions, pomegranate currently supports only discrete Bayesian networks. In order to do inference, factor graph belief propogation is used.

If you didn't understand that, it's okay! Lets look at the simple Monty Hall example. The Monty Hall problem arose from the gameshow Let's Make a Deal, where a guest had to choose which one of three doors had a car behind it. The twist was that after the guest chose, the host, originally Monty Hall, would then open one of the doors the guest did not pick and ask if the guest wanted to switch which door they had picked. Initial inspection may lead you to believe that if there are only two doors left, there is a 50-50 chance of you picking the right one, and so there is no advantage one way or the other. However, it has been proven both through simulations and analytically that there is in fact a 66% chance of getting the prize if the guest switches their door, regardless of the door they initially went with.

We can reproduce this result using Bayesian networks with three nodes, one for the guest, one for the prize, and one for the door Monty chooses to open. The door the guest initially chooses and the door the prize is behind are completely random processes across the three doors, but the door which Monty opens is dependent on both the door the guest chooses (it cannot be the door the guest chooses), and the door the prize is behind (it cannot be the door with the prize behind it).

In [149]:
# The guests initial door selection is completely random
guest = DiscreteDistribution( { 'A': 1./3, 'B': 1./3, 'C': 1./3 } )

# The door the prize is behind is also completely random
prize = DiscreteDistribution( { 'A': 1./3, 'B': 1./3, 'C': 1./3 } )

    # Monty is dependent on both the guest and the prize. 
monty = ConditionalProbabilityTable(
        [[ 'A', 'A', 'A', 0.0 ],
         [ 'A', 'A', 'B', 0.5 ],
         [ 'A', 'A', 'C', 0.5 ],
         [ 'A', 'B', 'A', 0.0 ],
         [ 'A', 'B', 'B', 0.0 ],
         [ 'A', 'B', 'C', 1.0 ],
         [ 'A', 'C', 'A', 0.0 ],
         [ 'A', 'C', 'B', 1.0 ],
         [ 'A', 'C', 'C', 0.0 ],
         [ 'B', 'A', 'A', 0.0 ],
         [ 'B', 'A', 'B', 0.0 ],
         [ 'B', 'A', 'C', 1.0 ],
         [ 'B', 'B', 'A', 0.5 ],
         [ 'B', 'B', 'B', 0.0 ],
         [ 'B', 'B', 'C', 0.5 ],
         [ 'B', 'C', 'A', 1.0 ],
         [ 'B', 'C', 'B', 0.0 ],
         [ 'B', 'C', 'C', 0.0 ],
         [ 'C', 'A', 'A', 0.0 ],
         [ 'C', 'A', 'B', 1.0 ],
         [ 'C', 'A', 'C', 0.0 ],
         [ 'C', 'B', 'A', 1.0 ],
         [ 'C', 'B', 'B', 0.0 ],
         [ 'C', 'B', 'C', 0.0 ],
         [ 'C', 'C', 'A', 0.5 ],
         [ 'C', 'C', 'B', 0.5 ],
         [ 'C', 'C', 'C', 0.0 ]], [guest, prize] )  

# State objects hold both the distribution, and a high level name.
s1 = State( guest, name="guest" )
s2 = State( prize, name="prize" )
s3 = State( monty, name="monty" )

# Create the Bayesian network object with a useful name
network = BayesianNetwork( "Monty Hall Problem" )

# Add the three states to the network 
network.add_states( [ s1, s2, s3 ] )

# Add transitions which represent conditional dependencies, where the second node is conditionally dependent on the first node (Monty is dependent on both guest and prize)
network.add_transition( s1, s3 )
network.add_transition( s2, s3 )
network.bake()

pomegranate uses loopy belief propogation on the factor graph to calculate marginals, meaning that it is an inexact algorithm, but converges to the exact solution on Bayesian networks which have a tree structure. We can use the predict_proba method in order to ask questions for single data points. As a baseline, lets see what happens if we don't put in any information. This should give us the marginal of the graph, which is that everything is equally likely.

In [150]:
print network.predict_proba({})
[ {
    "frozen" : false,
    "class" : "Distribution",
    "parameters" : [
        {
            "A" : 0.33333333333333337,
            "C" : 0.33333333333333337,
            "B" : 0.33333333333333337
        }
    ],
    "name" : "DiscreteDistribution"
}
 {
    "frozen" : false,
    "class" : "Distribution",
    "parameters" : [
        {
            "A" : 0.33333333333333337,
            "C" : 0.33333333333333337,
            "B" : 0.33333333333333337
        }
    ],
    "name" : "DiscreteDistribution"
}
 {
    "frozen" : false,
    "class" : "Distribution",
    "parameters" : [
        {
            "A" : 0.33333333333333337,
            "C" : 0.33333333333333337,
            "B" : 0.33333333333333337
        }
    ],
    "name" : "DiscreteDistribution"
}]

Now lets do something different, and say that the guest has chosen door 'A'. We do this by passing a dictionary to predict_proba with key pairs consisting of the name of the state (in the state object), and the value which that variable has taken.

In [151]:
network.predict_proba({'guest': 'A'})
Out[151]:
array([ {
    "frozen" : false,
    "class" : "Distribution",
    "parameters" : [
        {
            "A" : 1.0,
            "C" : 0.0,
            "B" : 0.0
        }
    ],
    "name" : "DiscreteDistribution"
},
       {
    "frozen" : false,
    "class" : "Distribution",
    "parameters" : [
        {
            "A" : 0.33333333333333337,
            "C" : 0.33333333333333337,
            "B" : 0.33333333333333337
        }
    ],
    "name" : "DiscreteDistribution"
},
       {
    "frozen" : false,
    "class" : "Distribution",
    "parameters" : [
        {
            "A" : 0.0,
            "C" : 0.5,
            "B" : 0.5
        }
    ],
    "name" : "DiscreteDistribution"
}], dtype=object)

We can see that now Monty will not open door 'A', because the guest has chosen it. At the same time, the distribution over the prize has not changed, it is still equally likely that the prize is behind each door.

Now, lets say that Monty opens door 'C' and see what happens.

In [152]:
network.predict_proba({'guest': 'A', 'monty': 'C'})
Out[152]:
array([ {
    "frozen" : false,
    "class" : "Distribution",
    "parameters" : [
        {
            "A" : 1.0,
            "C" : 0.0,
            "B" : 0.0
        }
    ],
    "name" : "DiscreteDistribution"
},
       {
    "frozen" : false,
    "class" : "Distribution",
    "parameters" : [
        {
            "A" : 0.3333333333333334,
            "C" : 0.0,
            "B" : 0.6666666666666666
        }
    ],
    "name" : "DiscreteDistribution"
},
       {
    "frozen" : false,
    "class" : "Distribution",
    "parameters" : [
        {
            "A" : 0.0,
            "C" : 1.0,
            "B" : 0.0
        }
    ],
    "name" : "DiscreteDistribution"
}], dtype=object)