author: Jacob Schreiber

contact: jmschr@cs.washington.edu

Gerolamo Cardano is one of the first Europeans to publish works in the field of probability theory. As a polymath, he was widely versed in many areas, and expressed his knowledge through a series of works. At the time, Gerolamo was using probability mostly as an edge in gambling in order to pay off the debts he wracked up by being an academic. He should have rebranded as a data scientist and collecting his 6 figure salary. Instead, he chose to publish many seminal works, including *Liber de ludo aleae*, which had an entire chapter devoted to how to cheat in gambling.

However, the field has progressed slightly outside of the field of gambling, and probabilistic modelling is now widely used in many fields. One of the reasons is because, like graduate students, these models carry their uncertainty around with them, allowing them to prroduce more robust estimates and make more informed decisions.

Probabilistic modelling is a natural way of thinking. Cam Davidson Pilon gives an example in his excellent book *Bayesian Methods for Hackers*:

You are a skilled programmer, but bugs still slip into your code. After a particularly difficult implementation of an algorithm, you decide to test your code on a trivial example. It passes. You test the code on a harder problem. It passes once again. And it passes the next,

even more difficult, test too! You are starting to believe that there may be no bugs in this code...

This is a Bayesian way of thinking. As you acquire more evidence that there are no bugs in the code, your belief in the correctness of the code increases, but you're never completely positive. However, the presence of a single bug would tell you that the code cannot be bug free.

pomegranate initially started out as Yet Another Hidden Markov Model (yahmm), a library written by my friend Adam Novak during his rotation in the UCSC Nanopore Lab. He was disappointed in the lack of an easy installable hidden Markov model library for Python, and so, being the badass he was, wrote his own from scratch in order to pursue his rotation project.

I then adopted his project for my own work in the nanopore lab, but found that a pure python implementation of HMMs was not fast enough for the massive models I was running on gigabytes of data. I ended up rewriting it in Cython, and significantly expanding its feature set.

A few months after entering graduate school, I realized many of the components of yahmm could be reorganized to help me finish homeworks quicker. Since then, it has been expanded to include a wide variety of distributions, finite state machines, hidden Markov models, general mixture models, Bayesian networks, and factor graphs. More to come!

However, while Adam and I disagreed on many things, such as his belief that we needed to do "research" as a graduate student instead of writing open source software, we agreed fundamentally that the code we wrote needed to be fundamentally easy to use. Much like communication skills in general, if what you've done is uninterpretable to other people, it will never be used.

Installation is always as easy as **pip install pomegranate** (NOTE: Unless you're on a windows, in which case you need a 64 bit C compiler first).

Lets go through probabilistic models of increasing complexity, and how to use them in pomegranate.

In [98]:

```
%pylab inline
from pomegranate import *
```

The distribution is the simplest unit of probabilistic modelling. You are saying that all of your data are realizations of the same stationary underlying distribution. This is a good fit for many types of simple data where each instance is a fixed feature size and independant of all other instances. Here is a full list of the distributions currently supported:

- UniformDistribution
- NormalDistribution
- LogNormalDistribution
- ExponentialDistribution
- BetaDistribution
- GammaDistribution
- DiscreteDistribution
- LambdaDistribution
- GaussianKernelDensity
- UniformKernelDensity
- TriangleKernelDensity
- IndependentComponentsDistribution
- MultivariateGaussianDistribution
- ConditionalProbabilityTable
- JointProbabilityTable

A very common distribution is the normal distribution which is parameterized by $\mu$ and $\sigma$. You can create it in the following way:

In [99]:

```
d = NormalDistribution( 0, 1 ) # The normal distribution
```

In [100]:

```
plt.figure( figsize=(10,6) )
d.plot( n=10000, edgecolor='c', facecolor='c', alpha=1, bins=25 )
```

Good, this looks like a normal distribution.

The most common task is to try to determine the probability of some points given the distribution, $P(D|M)$, where the model is just this distribution.

In [101]:

```
print d.log_probability( 2 )
print d.log_probability( -2 )
```

In [102]:

```
import timeit
from scipy.stats import norm
print "scipy norm: {}s".format( timeit.timeit( "norm.logpdf(2, 0, 1)", setup="from scipy.stats import norm", number=10000 ) )
print "pomegranate: {}s".format( timeit.timeit( "d.log_probability(2)", setup="from pomegranate import NormalDistribution; d = NormalDistribution(0, 1)", number=10000) )
print "logp difference: {}".format( norm.logpdf(2, 0, 1) - d.log_probability( 2 ) )
```

In [103]:

```
data = np.random.randn(100) * 0.9 + 0.4
d.fit( data )
```

We can see that we recovered the distribution fairly well.

pomegranate does distribution fitting in summary statistics and cython, so it is significantly faster than other options. Lets see how long it takes to train a pomegranate distribution, versus using numpy.

In [104]:

```
%timeit data.mean(); data.std()
%timeit d.fit( data )
```

Looks like it's ~3x faster.

Since numpy uses summary statistics, these updates can be done out of core using `summarize`

followed by `from_summaries`

, and still get exact answers. `summarize`

will summarize the data into summary statistics, and allow you to get more data, before finally updating the parameters of the distribution in the `from_summaries`

method. Lets look at this in action.

In [105]:

```
for i in xrange(10):
data = np.random.randn(10000) * 2.583 + 17.813
d.summarize(data)
d.from_summaries()
print "Norm({:.5}, {:.5})".format(*d.parameters)
```

Expectedly, we get close, but not exact, recovery of the parameters.

We can also serialize this into a JSON for human readability, or future storage after a computationally intensive training step.

In [106]:

```
print d.to_json()
```

A second commonly used distribution is the discrete distribution, which contains a fixed set of keys, and their associated probabilities. In pomegranate, the keys can be ~any~ Python object. They are instantiated by passing in a dictionary with the associated keys and values.

In [107]:

```
d = DiscreteDistribution({'A': 0.1, 'C': 0.25, 'G': 0.50, 'T': 0.15})
```

In [108]:

```
print "P({}|M) = {:.3}".format( 'A', np.e ** d.log_probability( 'A' ) )
print "P({}|M) = {:.3}".format( 'G', np.e ** d.log_probability( 'G' ) )
print "P({}|M) = {:.3}".format( '?', np.e ** d.log_probability( '?' ) )
```

We can also train on lists of data.

In [109]:

```
d.fit( list('CAGCATCATCATCATAGCACCATAGAAAGATAAAAT') )
print d.parameters
```

And we can also serialize it into a JSON.

In [110]:

```
print d.to_json()
```

The above distributions were parametric, in that they were defined by a handful of parameters. Sometimes, this can lose the excentricities of the data. In contrast, nonparametric distributions require only data, not parameters. We can see this clearly below, where we have some points drawn from the unit norm distribution, and a few outliers.

In [111]:

```
data = np.concatenate( (np.random.randn(12), [2, 5, 9] ))
plt.figure( figsize=(10, 6))
d1 = NormalDistribution(0, 1)
d1.fit( data )
d1.plot( n=25000, edgecolor='c', facecolor='c', bins=50, alpha=0.3, label="Normal" )
d2 = GaussianKernelDensity(data)
d2.plot( n=25000, edgecolor='r', facecolor='r', bins=50, alpha=0.3, label="Gaussian Kernel Density" )
plt.legend()
```

Out[111]:

In [112]:

```
%timeit d1.log_probability(5)
%timeit d2.log_probability(5)
```

Even with intermediate state caching, the Gaussian kernel density is ~5x slower on only 12 points. This is even after a recent update where I made them >30x faster than they used to be.

However, it is true that kernel densities do have a parameter, which is the bandwidth. For Gaussian kernel densities, it is similar to adjusting the variance in the Gaussian distribution. Lets take a look:

In [113]:

```
plt.figure( figsize=(10, 6))
d1 = NormalDistribution(0, 1)
d1.fit( data )
d1.plot( n=25000, edgecolor='c', facecolor='c', bins=50, alpha=0.3, label="Normal" )
d2 = GaussianKernelDensity(data, bandwidth=2)
d2.plot( n=25000, edgecolor='r', facecolor='r', bins=50, alpha=0.3, label="Gaussian Kernel Density" )
plt.legend()
```

Out[113]:

In [114]:

```
data = np.concatenate( (np.random.randn(20), [1.5, 2, 3, 5, 7, 9, 12, 17, 23, 30] ))
plt.figure( figsize=(10, 6))
d1 = NormalDistribution(0, 1)
d1.fit( data )
d1.plot( n=25000, edgecolor='c', facecolor='c', bins=50, alpha=0.3, label="Normal" )
d2 = GaussianKernelDensity(data, bandwidth=2)
d2.plot( n=25000, edgecolor='r', facecolor='r', bins=50, alpha=0.3, label="Gaussian Kernel Density" )
plt.legend()
```

Out[114]:

In [115]:

```
plt.figure( figsize=(10, 6))
d1 = NormalDistribution(0, 1)
d1.fit( data )
d1.plot( n=25000, edgecolor='c', facecolor='c', bins=50, alpha=0.3, label="Normal" )
d2 = GaussianKernelDensity(data, bandwidth=2, weights=[1]*20 + [1, 1, 1, 1, 1, 1, 1, 1, 1, 20])
d2.plot( n=25000, edgecolor='r', facecolor='r', bins=50, alpha=0.3, label="Gaussian Kernel Density" )
plt.legend()
```

Out[115]:

Sometimes, a univariate distribution isn't good enough. Many times data comes in the form of multiple, independent, dimensions, each with their own distribution. We can model this with a multivariate distribution which contains a tuple of independent components.

A common example may be to look at data whose signal is normally distributed, and whose duration is modelled by an exponential.

In [116]:

```
d = IndependentComponentsDistribution([ NormalDistribution(10, 1), ExponentialDistribution(0.5) ])
print d.log_probability( (11, 3) )
```

In [117]:

```
d = IndependentComponentsDistribution([ NormalDistribution(10, 1), ExponentialDistribution(0.5) ], weights=[3, 1])
print d.log_probability( (11, 3) )
```

We see that the point becomes more probable, because we care more about the good fit to the signal dimension (the normal distribution) than it poorly fits the duration (exponential) distribution. You must be careful with the weightings though, because they don't have to sum to 1. In this sense, weights = [3,1] is not the same as weights = [0.75, 0.25].

We can do updates in the same manner. Drawing sampels from a Gaussian with mu = 12 and sigma = 2, and an exponential with mu = 5 (0.2 since the inverse mu is more prominently used to parameterize an exponential), we can estimate the underlying distributions with 1000 points.

In [118]:

```
data = numpy.zeros((1000, 2))
data[:,0] = np.random.randn(1000) * 2 + 12
data[:,1] = np.random.exponential(5, 1000)
print data[:5]
d.fit(data)
norm = d.parameters[0][0].parameters
exp = d.parameters[0][1].parameters
print "Norm({:.5}, {:.5}), Exp({:.5})".format( norm[0], norm[1], exp[0] )
```

It's not always good enough to have independent distributions, as signals can be heavily correlated. A common distribution for this is the multivariate gaussian distribution, utilizing a full covariance matrix. You provide a vector of mus, and a covariance matrix, and you're good to go!

In [119]:

```
d = MultivariateGaussianDistribution( np.arange(5) * 5, np.eye(5) )
print d.sample()
```

In [120]:

```
data = np.random.randn(1000, 5) + np.arange(5) * 8
d.fit(data)
print "mu: [{:.5}, {:.5}, {:.5}, {:.5}, {:.5}]".format( *d.parameters[0] )
print "cov: \n {}".format( np.array(d.parameters[1]))
```

In [121]:

```
print d.sample()
```

In [122]:

```
%timeit data.mean( axis=0 ), np.cov( data.T )
%timeit d.fit(data)
```

Given these tools, it's pretty simple to set up a Naive Bayes classifier using any combination of distributions you'd like. While not currently built in, you can set up the code in the following way:

In [123]:

```
def naive_bayes_fit( x, y, distributions ):
"""
Fit a Naive Bayes classifier to some data.
"""
n = len(distributions)
w = numpy.zeros( len(distributions))
for i in xrange(n):
data = x[y==i, 0]
distributions[i].fit( data )
w[i] = data.shape[0]
return w
def naive_bayes_predict( x, distributions, weights ):
"""
A Naive Bayes classifier. Calculate the normalized probability
of the point under each distribution.
"""
n = len(distributions)
probs, total = numpy.zeros(n), 0
for i in xrange(n):
probs[i] = np.e ** distributions[i].log_probability(x) * weights[i]
total += probs[i]
return probs / total
```

In [124]:

```
from sklearn.naive_bayes import GaussianNB
x = numpy.array([np.concatenate( (np.random.randn(1000), np.random.randn(750) + 2) )]).T
y = np.concatenate( (np.zeros(1000), np.ones(750)) )
clf = GaussianNB()
ds = [ NormalDistribution(5, 1), NormalDistribution(7, 1) ]
```

In [125]:

```
%timeit naive_bayes_fit( x, y, ds )
%timeit clf.fit( x, y )
```

In [126]:

```
w = naive_bayes_fit( x, y, ds )
print naive_bayes_predict( 1, ds, w )
print clf.predict_proba( 1 )
```

It is frequently the case that the data you have is not explained by a single underlying distribution. If we want to try to recover the underlying distributions, we need to have a model which has multiple components. An example is the following data, which is clearly two gaussian distributions, but with different parameters.

In [127]:

```
data = np.concatenate( (np.random.randn(1000) * 2.75 + 1.25, np.random.randn(2000) * 1.2 + 7.85) )
plt.figure( figsize=(10,6))
plt.hist( data, edgecolor='c', color='c', bins=20 )
plt.show()
```

In [128]:

```
weights = np.array([0.33, 0.67])
d = GeneralMixtureModel( [NormalDistribution(2, 1), NormalDistribution(8, 1)], weights=weights )
```

`fit`

and a `predict`

method, in keeping with the scikit-learn API. In this case, we want to predict each point and see which component it falls under.

In [129]:

```
labels = d.predict( data )
print "{} 1 labels, {} 0 labels".format( labels.sum(), labels.shape[0] - labels.sum() )
```

In [130]:

```
plt.figure( figsize=(10,6) )
plt.hist( data[ labels == 0 ], edgecolor='r', color='r', bins=20 )
plt.hist( data[ labels == 1 ], edgecolor='c', color='c', bins=20 )
plt.show()
```

This looks okay. We see there is a clear linear boundary between the two. This makes sense, because if you have two Gaussians of any dimensionality, the best separator is just a hyperplane. This is one of the reasons that Naive Bayes and Logistic Regresson can work very well on Gaussian distributed data.

However, we are using **probabilistic** models here. We should take advantage of this, and predict the probability that each point belongs to each distribution, in order to get softer estimates.

In keeping with the scikit-learn API, if we want probabilistic measurements we use the `predict_proba`

method, and columns become each component and rows become each sample.

In [131]:

```
labels = d.predict_proba( data )
print labels[:5]
print "{} 0 labels, {} 1 labels".format( *labels.sum(axis=0) )
```

In [132]:

```
d.fit( data, verbose=True )
```

Out[132]:

In [133]:

```
labels = d.predict( data )
plt.figure( figsize=(10,6) )
plt.hist( data[ labels == 0 ], edgecolor='r', color='r', bins=20 )
plt.hist( data[ labels == 1 ], edgecolor='c', color='c', bins=20 )
plt.show()
print "Hard Classification"
print "{} 0 labels, {} 1 labels".format( labels.shape[0] - labels.sum(), labels.sum() )
print
print "Soft Classification"
labels = d.predict_proba( data )
print "{} 0 labels, {} 1 labels".format( *labels.sum(axis=0) )
```

Training the models exemplifies why sometimes we want to use soft classification. The difference between 981 and 944 is significant, particularly in fuzzy cases where it's not always clear what's going on. As a continued theme, it allows us to bring our uncertainty along with us, and make more informed decisions because of this.

However, a more common type of mixture model is the Gaussian Mixture Model, commonly confused with the General Mixture Model because they share similar names. However, the General Mixture Model can be a mixture of any number of distributions of heterogeneous types.

Lets take a look at a Gaussian Mixture Model using MultivariateGaussianDistributions.

In [134]:

```
from pomegranate import MultivariateGaussianDistribution as Gaussian
cov = [[0.5, -0.8], [2.4, 4.7]]
a = (np.random.randn(500, 2) + [1, -1.5]).dot( cov )
cov = [[-1.2, 0.8], [1.3, -2.3]]
b = (np.random.randn( 750, 2 ) + [2.5, 4]).dot( cov )
data = np.concatenate((a, b))
plt.figure( figsize=(10, 6))
plt.scatter( data[:,0], data[:,1], c='c', linewidth=0 )
plt.show()
```

In [135]:

```
cov = np.eye(2)
gmm = GeneralMixtureModel([ Gaussian([-4, -5], cov), Gaussian([2.0, -10], cov) ])
pred = gmm.predict_proba( data )
plt.figure( figsize=(10, 6) )
colors_a = np.zeros((data.shape[0], 4))
colors_a[:,0] = 1
colors_a[:,3] = pred[:,0]
plt.scatter( data[:,0], data[:,1], color=colors_a )
colors_b = np.zeros((data.shape[0], 4))
colors_b[:,2] = 1
colors_b[:,3] = pred[:,1]
plt.scatter( data[:,0], data[:,1], color=colors_b )
plt.show()
```

In [136]:

```
gmm.fit( data, verbose=True )
```

Out[136]:

In [137]:

```
plt.figure( figsize=(10, 6) )
pred = gmm.predict_proba( data )
colors_a = np.zeros((data.shape[0], 4))
colors_a[:,0] = 1
colors_a[:,3] = pred[:,0]
plt.scatter( data[:,0], data[:,1], color=colors_a )
colors_b = np.zeros((data.shape[0], 4))
colors_b[:,2] = 1
colors_b[:,3] = pred[:,1]
plt.scatter( data[:,0], data[:,1], color=colors_b )
plt.show()
```

In [138]:

```
import timeit
print "scikit-learn: {}s".format( timeit.timeit( "d.fit(data)", setup="from sklearn.mixture import GMM; from __main__ import data; d=GMM(n_components=2, covariance_type='full')", number=100 ) )
print "pomegranate: {}s".format( timeit.timeit( "d.fit(data, stop_threshold=1e-3)", setup="from pomegranate import *; from __main__ import data; import numpy; cov = numpy.eye(2); d=GeneralMixtureModel([ MultivariateGaussianDistribution([-4, -5], cov), MultivariateGaussianDistribution([2.0, -10], cov) ])", number=100 ))
```

Hidden Markov models (HMMs) are the flagship of the pomegranate package, in that most time is spent improving their implementation, and these improvements sometimes trickle down into the other algorithms. Lets delve into the features which pomegranate offers.

Hidden Markov models are a form of structured prediction method which extend general mixture models to sequences of data, where position in the sequence is relevant. If each point in this sequence is completely independent of the other points, then HMMs are not the right tools and GMMs (or more complicated Bayesian networks) may be a better tool.

The most common examples of HMMs come from bioinformatics and natural language processing. Since I am a bioinformatician, I will predominately use examples from bioinformatics.

Lets take the simplified example of CG island detection on a sequence of DNA. DNA is made up of the four canonical nucleotides, abbreviated 'A', 'C', 'G', and 'T'. Specific organizations of these nucleotides encode enough information to build you, a human being. One simple region in the genome is called the 'CG' island, where the nucleotides 'C' and 'G' are enriched. Lets compare the predictions of a GMM with the predictions of a HMM, to both understand conceptually the differences between the two, and to see how easy it is to use pomegranate.

In [139]:

```
seq = list('CGACTACTGACTACTCGCCGACGCGACTGCCGTCTATACTGCGCATACGGC')
d1 = DiscreteDistribution({'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25})
d2 = DiscreteDistribution({'A': 0.10, 'C': 0.40, 'G': 0.40, 'T': 0.10})
# General Mixture Model Setup
gmm = GeneralMixtureModel( [d1, d2] )
# Hidden Markov Model Setup
s1 = State( d1, name='background' )
s2 = State( d2, name='CG island' )
hmm = HiddenMarkovModel('Island Finder')
hmm.add_states(s1, s2)
hmm.add_transition( hmm.start, s1, 0.5 )
hmm.add_transition( hmm.start, s2, 0.5 )
hmm.add_transition( s1, s1, 0.5 )
hmm.add_transition( s1, s2, 0.5 )
hmm.add_transition( s2, s1, 0.5 )
hmm.add_transition( s2, s2, 0.5 )
hmm.bake()
plt.figure( figsize=(10,6) )
hmm.draw()
```

In [140]:

```
gmm_predictions = gmm.predict( np.array(seq) )
hmm_predictions = hmm.predict( seq )
print "sequence: {}".format( ''.join( seq ) )
print "gmm pred: {}".format( ''.join( map( str, gmm_predictions ) ) )
print "hmm pred: {}".format( ''.join( map( str, hmm_predictions ) ) )
```

Note: The HMM and GMM predictions may be the inverse of each other, because HMM states undergo a topological sort in order to properly handle silent states (more later), which can change from the order they were inserted into the model.

At this point, a dense HMM with equal probabilities between each state is ~equivalent~ to a GMM. However, this framework gives us great flexibility to add prior knowledge through edge transition probabilities, whereas a GMM doesn't. If we look at the predictions, we see that it's bifurcating between "background" and "CG island" very quickly--in essence, calling every C or G a 'CG island'. This is not likely to be true. We know that CG islands have some As and Ts in them, and background sequence has Cs and Gs. We can change the transition probabilities to account for this, and prevent switching from occuring too rapidly.

In [141]:

```
hmm = HiddenMarkovModel()
hmm.add_states(s1, s2)
hmm.add_transition( hmm.start, s1, 0.5 )
hmm.add_transition( hmm.start, s2, 0.5 )
hmm.add_transition( s1, s1, 0.9 )
hmm.add_transition( s1, s2, 0.1 )
hmm.add_transition( s2, s1, 0.1 )
hmm.add_transition( s2, s2, 0.9 )
hmm.bake()
plt.figure( figsize=(10,6) )
hmm.draw()
hmm_predictions = hmm.predict( seq )
print "sequence: {}".format( ''.join( seq ) )
print "gmm pred: {}".format( ''.join( map( str, gmm_predictions ) ) )
print "hmm pred: {}".format( ''.join( map( str, hmm_predictions ) ) )
print
print "hmm state 0: {}".format( hmm.states[0].name )
print "hmm state 1: {}".format( hmm.states[1].name )
```

In [142]:

```
hmm = HiddenMarkovModel()
hmm.add_states(s1, s2)
hmm.add_transition( hmm.start, s1, 0.5 )
hmm.add_transition( hmm.start, s2, 0.5 )
hmm.add_transition( s1, s1, 0.89 )
hmm.add_transition( s1, s2, 0.10 )
hmm.add_transition( s1, hmm.end, 0.01 )
hmm.add_transition( s2, s1, 0.1 )
hmm.add_transition( s2, s2, 0.9 )
hmm.bake()
plt.figure( figsize=(10,6) )
hmm.draw()
hmm_predictions = hmm.predict( seq )
print "sequence: {}".format( ''.join( seq ) )
print "gmm pred: {}".format( ''.join( map( str, gmm_predictions ) ) )
print "hmm pred: {}".format( ''.join( map( str, hmm_predictions ) ) )
print
print "hmm state 0: {}".format( hmm.states[0].name )
print "hmm state 1: {}".format( hmm.states[1].name )
```

Looks like we managed to get rid of that pesky end (again, the numbers may have flipped, look at the indices). Modifying transition probabilities and using non-dense graphical structures are two major ways in which HMMs account for data in a sequence not being independent and identically distributed (i.i.d.). In fact, in most applications, the graphical structure of a HMM is very sparse.

If we want a more probabilistic view of what's going on, we can get the probability of each symbol in the sequence being in each of the states in the model easily. This is useful to get a soft estimate of classification, which allows us to include confidence as well as prediction. Values close to 50-50 get masked when you make hard classifications, but this uncertainty can be passed to future applications if you use soft assignments. Each row in the matrix is one symbol in the sequence, and the columns correspond to the two states identified above (CG island or background).

In [143]:

```
print hmm.predict_proba( seq )
```

There is a corresponding hmm.predict_log_proba method present if you want to get the log values. These are the emission probability values calculated by the forward backward algorithm, and can also be retrieved by calling hmm.forward_backward( seq ), which returns both the emission and the transition probability tables.

Lets take a look at these tables!

In [144]:

```
trans, ems = hmm.forward_backward( seq )
print trans
```

This is the transition table, which has the soft count of the number of transitions across an edge in the model given a single sequence. It is a square matrix of size equal to the number of states (including start and end state), with number of transitions from (row_id) to (column_id). This is exemplified by the 1.0 in the first row, indicating that there is one transition from background state to the end state, as that's the only way to reach the end state. However, the third (or fourth, depending on ordering) row is the transitions from the start state, and it only slightly favors the background state. These counts are not normalized to the length of the input sequence, but can easily be done so by dividing by row sums, column sums, or entire table sums, depending on your application.

A possible reason not to normalize is to run several sequences through and add up their tables, because normalizing in the end and extracting some domain knowledge. It is extremely useful in practice. For example, we can see that there is an expectation of 2.8956 transitions from CG island to background, and 2.4 from background to CG island. This could be used to infer that there are ~2-3 edges, which makes sense if you consider that the start and end of the sequence seem like they might be part of the CG island states except for the strict transition probabilities used (look at the first few rows of the emission table above.)

We've been using the forward backward algorithm and maximum a posteriori for decoding thus far, however maximum a posteriori decoding has the side effect that it is possible that it predicts impossible sequences in some edge cases. An alternative is Viterbi decoding, which at each step takes the most likely path, instead of sum of all paths, to produce hard assignments.

In [145]:

```
hmm_predictions = hmm.predict( seq, algorithm='viterbi' )[1:-1]
print "sequence: {}".format( ''.join( seq ) )
print "gmm pred: {}".format( ''.join( map( str, gmm_predictions ) ) )
print "hmm pred: {}".format( ''.join( map( str, hmm_predictions ) ) )
print
print "hmm state 0: {}".format( hmm.states[0].name )
print "hmm state 1: {}".format( hmm.states[1].name )
```

Lets move on to a more complicated structure, that of a profile HMM. A profile HMM is used to align a sequence to a reference 'profile', where the reference profile can either be a single sequence, or an alignment of many sequences (such as a reference genome). In essence, this profile has a 'match' state for every position in the reference profile, and 'insert' state, and a 'delete' state. The insert state allows the external sequence to have an insertion into the sequence without throwing off the entire alignment, such as the following:

```
ACCG : Sequence
|| |
AC-G : Reference
```

or a deletion, which is the opposite:

```
A-G : Sequence
| |
ACG : Reference
```

The bars in the middle refer to a perfect match, whereas the lack of a bar means either a deletion/insertion, or a mismatch. A mismatch is where two positions are aligned together, but do not match. This models the biological phenomena of mutation, where one nucleotide can convert to another over time. It is usually more likely in biological sequences that this type of mutation occurs than that the nucleotide was deleted from the sequence (shifting all nucleotides over by one) and then another was inserted at the exact location (moving all nucleotides over again). Since we are using a probabilistic model, we get to define these probabilities through the use of distributions! If we want to model mismatches, we can just set our 'match' state to have an appropriate distribution with non-zero probabilities over mismatches.

Lets now create a three nucleotide profile HMM, which models the sequence 'ACT'. We will fuzz this a little bit in the match states, pretending to have some prior information about what mutations occur at each position. If you don't have any information, setting a uniform, small, value over the other values is usually okay.

In [146]:

```
model = HiddenMarkovModel( "Global Alignment")
# Define the distribution for insertions
i_d = DiscreteDistribution( { 'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25 } )
# Create the insert states
i0 = State( i_d, name="I0" )
i1 = State( i_d, name="I1" )
i2 = State( i_d, name="I2" )
i3 = State( i_d, name="I3" )
# Create the match states
m1 = State( DiscreteDistribution({ "A": 0.95, 'C': 0.01, 'G': 0.01, 'T': 0.02 }) , name="M1" )
m2 = State( DiscreteDistribution({ "A": 0.003, 'C': 0.99, 'G': 0.003, 'T': 0.004 }) , name="M2" )
m3 = State( DiscreteDistribution({ "A": 0.01, 'C': 0.01, 'G': 0.01, 'T': 0.97 }) , name="M3" )
# Create the delete states
d1 = State( None, name="D1" )
d2 = State( None, name="D2" )
d3 = State( None, name="D3" )
# Add all the states to the model
model.add_states( [i0, i1, i2, i3, m1, m2, m3, d1, d2, d3 ] )
# Create transitions from match states
model.add_transition( model.start, m1, 0.9 )
model.add_transition( model.start, i0, 0.1 )
model.add_transition( m1, m2, 0.9 )
model.add_transition( m1, i1, 0.05 )
model.add_transition( m1, d2, 0.05 )
model.add_transition( m2, m3, 0.9 )
model.add_transition( m2, i2, 0.05 )
model.add_transition( m2, d3, 0.05 )
model.add_transition( m3, model.end, 0.9 )
model.add_transition( m3, i3, 0.1 )
# Create transitions from insert states
model.add_transition( i0, i0, 0.70 )
model.add_transition( i0, d1, 0.15 )
model.add_transition( i0, m1, 0.15 )
model.add_transition( i1, i1, 0.70 )
model.add_transition( i1, d2, 0.15 )
model.add_transition( i1, m2, 0.15 )
model.add_transition( i2, i2, 0.70 )
model.add_transition( i2, d3, 0.15 )
model.add_transition( i2, m3, 0.15 )
model.add_transition( i3, i3, 0.85 )
model.add_transition( i3, model.end, 0.15 )
# Create transitions from delete states
model.add_transition( d1, d2, 0.15 )
model.add_transition( d1, i1, 0.15 )
model.add_transition( d1, m2, 0.70 )
model.add_transition( d2, d3, 0.15 )
model.add_transition( d2, i2, 0.15 )
model.add_transition( d2, m3, 0.70 )
model.add_transition( d3, i3, 0.30 )
model.add_transition( d3, model.end, 0.70 )
# Call bake to finalize the structure of the model.
model.bake()
plt.figure( figsize=(20, 16) )
model.draw()
for sequence in map( list, ('ACT', 'GGC', 'GAT', 'ACC') ):
logp, path = model.viterbi( sequence )
print "Sequence: '{}' -- Log Probability: {} -- Path: {}".format(
''.join( sequence ), logp, " ".join( state.name for idx, state in path[1:-1] ) )
```

The first and last sequence are entirely matches, meaning that it thinks the most likely alignment between the profile ACT and ACT is A-A, C-C, and T-T, which makes sense, and the most likely alignment between ACT and ACC is A-A, C-C, and T-C, which includes a mismatch. Essentially, it's more likely that there's a T-C mismatch at the end then that there was a deletion of a T at the end of the sequence, and a separate insertion of a C.

The two middle sequences don't match very well, as expected! G's are not very likely in this profile at all. It predicts that the two G's are inserts, and that the C matches the C in the profile, before hitting the delete state because it can't emit a T. The third sequence thinks that the G is an insert, as expected, and then aligns the A and T in the sequence to the A and T in the master sequence, missing the middle C in the profile.

By using deletes, we can handle other sequences which are shorter than three characters. Lets look at some more sequences of different lengths.

In [147]:

```
for sequence in map( list, ('A', 'GA', 'AC', 'AT', 'ATCC', 'ACGTG', 'ATTT', 'TACCCTC', 'TGTCAACACT') ):
logp, path = model.viterbi( sequence )
print "Sequence: '{}' -- Log Probability: {} -- Path: {}".format(
''.join( sequence ), logp, " ".join( state.name for idx, state in path[1:-1] ) )
```

In [148]:

```
def path_to_alignment( x, y, path ):
"""
This function will take in two sequences, and the ML path which is their alignment,
and insert dashes appropriately to make them appear aligned. This consists only of
adding a dash to the model sequence for every insert in the path appropriately, and
a dash in the observed sequence for every delete in the path appropriately.
"""
for i, (index, state) in enumerate( path[1:-1] ):
name = state.name
if name.startswith( 'D' ):
y = y[:i] + '-' + y[i:]
elif name.startswith( 'I' ):
x = x[:i] + '-' + x[i:]
return x, y
for sequence in map( list, ('A', 'GA', 'AC', 'AT', 'ATCC', 'ACGTG', 'ATTT', 'TACCCTC', 'TGTCAACACT') ):
logp, path = model.viterbi( sequence )
x, y = path_to_alignment( 'ACT', ''.join(sequence), path )
print "Sequence: {}, Log Probability: {}".format( ''.join(sequence), logp )
print "{}\n{}".format( x, y )
print
```

There are two main algorithms for training hidden Markov models-- Baum Welch (structured version of Expectation Maximization), and Viterbi training. Since we don't start off with labels on the data, these are both unsupervised training algorithms. In order to assign labels, Baum Welch uses EM to assign soft labels (weights in this case) to each point belonging to each state, and then using weighted MLE estimates to update the distributions. Viterbi assigns hard labels to each observation using the Viterbi algorithm, and then updates the distributions based on these hard labels.

pomegranate is extremely well featured when it comes to regularization methods for training, supporting tied emissions and edges, edge and emission inertia, freezing nodes or edges, edge pseudocounts, and multithreaded training. See this tutorial for more information.

Bayesian networks are a powerful probabilistic inference tool, in which a set of variables are represented as nodes, and the lack of an edge represents a conditional independence statement between them. Bayesian networks are powerful in their ability to infer the value of unknown variables given any combination of known variables. Obviously, as more is known about the system, the more accurate these inferences will be, but the same algorithm can handle very different amounts of information.

While Bayesian networks can have extremely complex emission probabilities, usually Gaussian or conditional Gaussian distributions, pomegranate currently supports only discrete Bayesian networks. In order to do inference, factor graph belief propogation is used.

If you didn't understand that, it's okay! Lets look at the simple Monty Hall example. The Monty Hall problem arose from the gameshow Let's Make a Deal, where a guest had to choose which one of three doors had a car behind it. The twist was that after the guest chose, the host, originally Monty Hall, would then open one of the doors the guest did not pick and ask if the guest wanted to switch which door they had picked. Initial inspection may lead you to believe that if there are only two doors left, there is a 50-50 chance of you picking the right one, and so there is no advantage one way or the other. However, it has been proven both through simulations and analytically that there is in fact a 66% chance of getting the prize if the guest switches their door, regardless of the door they initially went with.

We can reproduce this result using Bayesian networks with three nodes, one for the guest, one for the prize, and one for the door Monty chooses to open. The door the guest initially chooses and the door the prize is behind are completely random processes across the three doors, but the door which Monty opens is dependent on both the door the guest chooses (it cannot be the door the guest chooses), and the door the prize is behind (it cannot be the door with the prize behind it).

In [149]:

```
# The guests initial door selection is completely random
guest = DiscreteDistribution( { 'A': 1./3, 'B': 1./3, 'C': 1./3 } )
# The door the prize is behind is also completely random
prize = DiscreteDistribution( { 'A': 1./3, 'B': 1./3, 'C': 1./3 } )
# Monty is dependent on both the guest and the prize.
monty = ConditionalProbabilityTable(
[[ 'A', 'A', 'A', 0.0 ],
[ 'A', 'A', 'B', 0.5 ],
[ 'A', 'A', 'C', 0.5 ],
[ 'A', 'B', 'A', 0.0 ],
[ 'A', 'B', 'B', 0.0 ],
[ 'A', 'B', 'C', 1.0 ],
[ 'A', 'C', 'A', 0.0 ],
[ 'A', 'C', 'B', 1.0 ],
[ 'A', 'C', 'C', 0.0 ],
[ 'B', 'A', 'A', 0.0 ],
[ 'B', 'A', 'B', 0.0 ],
[ 'B', 'A', 'C', 1.0 ],
[ 'B', 'B', 'A', 0.5 ],
[ 'B', 'B', 'B', 0.0 ],
[ 'B', 'B', 'C', 0.5 ],
[ 'B', 'C', 'A', 1.0 ],
[ 'B', 'C', 'B', 0.0 ],
[ 'B', 'C', 'C', 0.0 ],
[ 'C', 'A', 'A', 0.0 ],
[ 'C', 'A', 'B', 1.0 ],
[ 'C', 'A', 'C', 0.0 ],
[ 'C', 'B', 'A', 1.0 ],
[ 'C', 'B', 'B', 0.0 ],
[ 'C', 'B', 'C', 0.0 ],
[ 'C', 'C', 'A', 0.5 ],
[ 'C', 'C', 'B', 0.5 ],
[ 'C', 'C', 'C', 0.0 ]], [guest, prize] )
# State objects hold both the distribution, and a high level name.
s1 = State( guest, name="guest" )
s2 = State( prize, name="prize" )
s3 = State( monty, name="monty" )
# Create the Bayesian network object with a useful name
network = BayesianNetwork( "Monty Hall Problem" )
# Add the three states to the network
network.add_states( [ s1, s2, s3 ] )
# Add transitions which represent conditional dependencies, where the second node is conditionally dependent on the first node (Monty is dependent on both guest and prize)
network.add_transition( s1, s3 )
network.add_transition( s2, s3 )
network.bake()
```

In [150]:

```
print network.predict_proba({})
```

In [151]:

```
network.predict_proba({'guest': 'A'})
```

Out[151]:

We can see that now Monty will not open door 'A', because the guest has chosen it. At the same time, the distribution over the prize has not changed, it is still equally likely that the prize is behind each door.

Now, lets say that Monty opens door 'C' and see what happens.

In [152]:

```
network.predict_proba({'guest': 'A', 'monty': 'C'})
```

Out[152]: