I am a fifth year Ph.D. student and IGERT big data fellow in the Computer Science and Engineering department at the University of Washington. My primary research focus is on the application of machine larning methods, primarily deep learning ones, to the massive amount of data being generated in the field of genome science. My research projects have involved predicting the three dimensional structure of the genome using convolutional neural networks and learning a latent representation of the human epigenome as characterized by the Roadmap consortium using deep tensor factorization. Additionally, I routinely contribute to the Python open source community as the core developer of the pomegranate package for flexible probabilistic modeling and in the past as a developer for the scikit-learn project. Future projects include graduating.
Machine Learning | Deep Learning | Big Data | Computational Biology | Chromatin Architecture | Epigenomics
J. Schreiber, Z. L. Wescoe, R. Abu-shumays, J. T. Vivian, B. Baatar, K. Karplus, and M. Akeson (2013) Error rates for nanopore discrimination among cytosine, methylcytosine, and hydroxymethylcytosine along individual DNA strands, Proceedings of the National Academy of Science 110(47) 18910-18915
- pomegranate: Probabilistic Modeling in Python: ODSC East (2017, Invited), Tesla Autopilot Maps (2017, Invited), Data Intelligence (2017), scipy (2017), UW eScience (2015, 2016, 2017), Moore-Sloan Data Science Summit (2016, 2017), Seattle Data-Analytics and Machine Learning Meetup (2016), PyData Chicago (2016)
- Rambutan: Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture: scipy (2017), UCSF (2017, Invited), Great Lakes Bioinformatics Conference (2017)
- Large Scale HMMs for Nanopore Data Analysis: UW eScience (2016), UW ACMS Research Seminar (2016)
- Parallel Processing in Python: UW eScience (2016)
- Accelerating Scientific Code: INRIA Parietal (2015)
- What is a bioinformatician?: UCSC (2013-2016, Invited)
POMEGRANATE Fast and flexible probabilistic modeling in python with a speedy cython implementation. Currently supports basic probability distributions, general mixture models, naive Bayes classifiers, Markov chains, hidden Markov models, factor graphs, and Bayesian networks, all with a convenient sklearn-like API for easy use. It also supports both multithreaded and out-of-core training for fast training of all models on datasets which don't fit into memory.
SCIKIT-LEARN Scikit-learn is a machine learning package for Python with high performance implementations of many classical machine learning algorithms, maintained by experts in the field and widely used. Its focus is on a consistent and easy to use API for various machine learning and data processing algorithms.
Rambutan Rambutan is a convolutional neural network for the prediction of 3D chromatin architecture using only nucleotide sequence and DNaseI sensitivity. Specifically, it predicts Hi-C contacts which are statistically significant with respect to the genomic distance effect.
YAHMM Yet Another Hidden Markov Model. Hidden Markov models for Python, implemented in Cython for speed and using sparse matrix representations for efficiency. Allows users to build HMMs node by node and edge by edge, perform classical HMM algorithms (forward, backward, Viterbi, posterior decoding, and their weighted counterparts), as well as training of the model from sample data (Viterbi, Baum-Welch, and supervised training all supported). YAHMM is extremely feature rich, especially in terms of training which allows for tied emissions, tied edges, freezing of emissions, pseudocounts on edges, and inertia on both edges and distributions. MERGED WITH POMEGRANATE.
PYPORE PyPore is a python library for the analysis of nanopore data. It allows easy loading of data from axon binary files (.abf) into file objects, detection of biomolecule translocations events, segmentation of these events into discrete steps through the nanopore, and analysis of these segments using hidden Markov models to extract meaningful information. Allows the storage of these results to both MySQL databases and JSON files, and the reading of these analyses back in at later times. Lastly, has extensive support for visualizations of each step along the way.
Tesla Motors Inc.
June 2017-September 2017 (3 months)
This internship focused on exploring new ways that machine learning can improve Tesla AutoPilot, and involved processing terabytes of fleet data, doing exploratory data analysis, and building working machine learning prototypes.
June 2016-September 2016 (3 months)
As a member of the AspenTech data science team, my focus was on implementing scalable and efficient probabilistic models for the analysis of large amounts of data.
Software Engineering Intern
Parietal, INRIA Saclay
June 2015-September 2015 (3 months)
As a software engineering intern on the scikit-learn team, my focus was on speeding up the gradient boosting implementation. My work ended up speeding up most tree-based estimators in addition to gradient boosting. trees.
Nanopore Group, University of California, Santa Cruz
July 2013-September 2014 (1 year 2 months)
Worked as an on-hand computational specialist to the UCSC Nanopore Group. This involved me using a variety of machine learning methods (Random Forests, SVMs, HMMs predominately) to analyze large volumes of data and to automate previously hand-done analyses, increasing speed and accuracy. I maintained a MySQL database of experimental metadata, results, summaries, and other information in a normal-form compliant manner. Lastly, I was involved with writing several papers based on my findings, a grant which got accepted from the NIH, and in general explaining my computational work to others without my background.
Undergraduate Research Assistant
Nanopore Group, University of California, Santa Cruz
January 2011-June 2013 (2 years 6 months)
I performed both wetlab and computational research with the UCSC Nanopore Group. This involved me purifying DNA, running nanopore experiments, and analyzing them using simple machine learning methods. This resulted in a first author paper in which I used supervised learning methods to show that the MspA nanopore can distinguish between epigenetic modifications of the cytosine nucleotide, and described error rates for this classification.