I am a fifth year Ph.D. student and IGERT big data fellow in the Computer Science and Engineering department at the University of Washington. My primary research focus is on the application of machine larning methods, primarily deep learning ones, to the massive amount of data being generated in the field of genome science. My research projects have involved predicting the three dimensional structure of the genome using convolutional neural networks and learning a latent representation of the human epigenome using deep tensor factorization. These projects typically involve hundreds of millions to billions of samples, making them markedly "big data" problems. Additionally, I routinely contribute to the Python open source community as the core developer of pomegranate, a package for flexible probabilistic modeling, apricot, a package for data summarization for machine learning, and in the past as a core developer for the scikit-learn project. Future projects include graduating.
Machine Learning | Submodular Selection | Big Data | Computational Biology | Chromatin Architecture | Epigenomics
A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens, Cell, 2019
M. Gasperini, A.J. Hill, J.L. McFaline-Figueroa, B. Martin, S. Kim, M.D. Zhang, D. Jackson, A. Leith, J. Schreiber, W.S. Noble, C. Trapnell, N. Ahituv, J. Shendure
A pitfall for machine learning methods aiming to predict across cell types, bioRxiv, 2019
J. Schreiber, R. Singh, J. Bilmes, and W.S. Noble
Massively parallel profiling and predictive modeling of the outcomes of CRISPR/Cas9-mediated double-strand break repair, bioRxiv, 2019
W. Chen, A. McKenna, J. Schreiber, Y. Yin, V. Agarwal, W.S. Noble, and J. Shendure
Discrimination among Protein Variants Using an Unfoldase-Coupled Nanopore, ACS Nano, 2014
J. Nivala, L. Mulroney, G. Li, J. Schreiber, and M. Akeson
Nanopores Discriminate Among Five C5 Cytosine Variants in DNA, Journal of the American Chemical Sociey, 2014
Z. L. Wescoe, J. Schreiber, M. Akeson
Error rates for nanopore discrimination among cytosine, methylcytosine, and hydroxymethylcytosine along individual DNA strands, Proceedings of the National Academy of Science, 2013
J. Schreiber, Z. L. Wescoe, R. Abu-shumays, J. T. Vivian, B. Baatar, K. Karplus, and M. Akeson
Avocado: Multi-scale deep tensor factorization learns a latent representation of the human epigenome
- Intelligent Systems for Molecular Bilogy (ISMB 2018)
- Stanford Center for Genomics and Personalized Medicine (2018)
pomegranate: Probabilistic Modeling in Python
- ODSC East (2017, Invited)
- Tesla Autopilot Maps (2017, Invited)
- Data Intelligence (2017)
- scipy (2017)
- UW eScience Community Seminar (2015, 2016, 2017)
- Moore-Sloan Data Science Summit (2016, 2017)
- Seattle Data-Analytics and Machine Learning Meetup (2016)
- PyData Chicago (2016)
Rambutan: Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture - scipy (2017)
- UCSF (2017, Invited)
- Stanford Center for Genomics and Personalized Medicine (SCGPM 2017)
- Great Lakes Bioinformatics Conference (2017)
Large Scale HMMs for Nanopore Data Analysis
- UW eScience Community Seminar (2016)
- UW ACMS Research Seminar (2016)
Parallel Processing in Python: UW eScience (2016)
Accelerating Scientific Code: INRIA Parietal (2015)
What is a bioinformatician?: UCSC (2013-2016, Invited)
Tesla Motors Inc.
June 2017-September 2017 (3 months)
This internship focused on exploring new ways that machine learning can improve Tesla AutoPilot, and involved processing terabytes of fleet data, doing exploratory data analysis, and building working machine learning prototypes.
June 2016-September 2016 (3 months)
As a member of the AspenTech data science team, my focus was on implementing scalable and efficient probabilistic models for the analysis of large amounts of data.
Software Engineering Intern
Parietal, INRIA Saclay
June 2015-September 2015 (3 months)
As a software engineering intern on the scikit-learn team, my focus was on speeding up the gradient boosting implementation. My work ended up speeding up most tree-based estimators in addition to gradient boosting. trees.
Nanopore Group, University of California, Santa Cruz
July 2013-September 2014 (1 year 2 months)
Worked as an on-hand computational specialist to the UCSC Nanopore Group. This involved me using a variety of machine learning methods (Random Forests, SVMs, HMMs predominately) to analyze large volumes of data and to automate previously hand-done analyses, increasing speed and accuracy. I maintained a MySQL database of experimental metadata, results, summaries, and other information in a normal-form compliant manner. Lastly, I was involved with writing several papers based on my findings, a grant which got accepted from the NIH, and in general explaining my computational work to others without my background.
Undergraduate Research Assistant
Nanopore Group, University of California, Santa Cruz
January 2011-June 2013 (2 years 6 months)
I performed both wetlab and computational research with the UCSC Nanopore Group. This involved me purifying DNA, running nanopore experiments, and analyzing them using simple machine learning methods. This resulted in a first author paper in which I used supervised learning methods to show that the MspA nanopore can distinguish between epigenetic modifications of the cytosine nucleotide, and described error rates for this classification.