CSE599v Machine Learning in Biology


  • The first meeting will be held on Monday March 29.

  • Please subscribe to the course mailing list.


  • Instructor: Su-In Lee (CSE 536, office hours: M11:30-1pm or by appointment)

  • Meetings: MW 10:10am @ CSE 303

Course Description

Biological sciences are becoming data-rich and information-intensive. Nowadays it became possible to obtain very detailed information about living organisms. For instance, we can retrieve DNA sequence (3 billion-long string) information, expression (activity) levels of >20,000 genes, and various clinical measurements from humans. The growing availability of such information promises a better understanding of important questions (e.g. causes of diseases). However, the complexity of biological systems and the high-dimensionality of data with noise make it difficult to infer such mechanisms from data.

Machine learning (ML) techniques have become very useful tools for resolving important questions in biology by providing mathematical frameworks to analyze vast amount of biological information. Biology is also a fascinating application area of ML because it presents new sets of computational challenges that can ultimately advance ML. In this course, we will discuss recent papers describing successful examples of ML techniques applied to exciting problems in biology.


We will discuss one paper in each meeting. One student will present the paper and lead the discussion. The discussion should include reading questions that will be given 1 week before the class. The instructor will then give a mini-lecture to provide necessary background knowledge for the topic to be discussed in the next meeting.


The course grade will be based on participation in discussions.

Students who take the course for S/NS: reading assigned papers; writing evaluations on 3 papers (30%); leading the discussion on 1-2 paper (30%); participating in discussions (10%)

Letter grade: working on a mini-project (30%), in addition to reading papers and leading/participating in discussions. The final report (<20 pages, font size 11) for the project should be submitted by June 13th.

Topics to be covered

Topic ID Data Topic and readings Discussion leader
1 3/29 Introduction & overview of topics

Syllabus [pdf], Lecture note 1 [ppt]
Su-In Lee
2 3/31 Introduction & overview of topics

Lecture note 2 [ppt]
Su-In Lee
3 4/5 A feature-based approach to modeling protein-DNA interactions, Sharon E, Lubliner S, Segal E. PLoS Comput Biol. 2008.(shorter version: Sharon E, Segal E, RECOMB 2007)

Optional readings:
Efficient structure learning of Markov networks using L1-regularization, Lee SI, Ganapathi V, Koller D. NIPS 2007.

Reading questions, Lecture note 3 [ppt]
Aniruddh Nath [pdf]
4 4/7 Module networks: identifying regulatory modules and their conditional specific regulators from gene expression data, Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N. Nat Genet. 2003.

Optional readings:
Learning module networks, Segal E, Pe'er D, Regev A, Koller D, Friendman N. J Machine Learning Research (JMLR) 2005.

Reading questions, Lecture note 4 [ppt]
Su-In Lee
5 4/12 Probabilistic discovery of overlapping cellular processes and their regulation, Battle A, Segal E, Koller D. J Comput Biol. 2005.
(shorter version: Battle A, Segal E, Koller D. RECOMB 2004)

Reading questions, Lecture note 5 [ppt]
Casey L. Overby [ppt]
6 4/14 Learning a Prior on Regulatory Potential from eQTL Data, Lee SI, Dudley AM, Drubin D, Silver PA, Krogan N, Koller D. PLoS Genet. 2009.

Optional readings:
Learning a meta-level prior for feature relevance from multiple related tasks, Lee SI, Chatalbashev V, Vickrey D, Koller D. ICML 2007.

Reading questions, Lecture note 6 [ppt]
Su-In Lee
7 4/19 An integrative genomics approach to infer causal associations between gene expression and disease, Schadt EE et al. Nat Gen. 2005.

Reading questions, Lecture note 7 ppt
Xu Miao [ppt]
8 4/21 Characterizing dynamic changes in the human blood transcriptional network, Zhu J, Chen Y, Leonardson AS, Wang K, Lamb JR, Emilsson V, Schadt EE. PLoS Comp Biol. 2010.

Reading questions, Lecture note 8 [ppt]
Rolfe Schmidt [ppt]
9 4/26 Statistical estimation of correlated genome associations to a quantitative trait network, Kim SY, Xing E, PLoS Genet. 2010.

Reading questions, Lecture note 9 [ppt]
Eric Garcia [ppt]
10 4/28 Population structure and eigenanalysis, Patterson N, Price AL, Reich D. PLoS Genet 2006.

Principal components analysis corrects for stratification in genome-wide association studies, Price AL, Patternson N, Plenge RM, Weinblatt M, Shadick NA, Reich D. Nat Gen. 2006.

Reading questions, Lecture note 10 [ppt]
James Chen [ppt]
11 5/3 SNP imputation in association studies, Halperin E, Stephan DA. Nat Biotechnology. 2009.

Imputation-Based Analysis of Association Studies: Candidate Regions and Quantitative Traits, Bertrand Servin, Matthew Stephens. PLoS Genet. 2007.

Reading questions, Lecture note 11 [ppt]
Cindy Desmarais [ppt]
12 5/5 Reconstructing genetic ancestry blocks in admixed individuals, Tang H, Coram M, Wang P, Zhu X, Risch N. American Journal of Human Genetics (AJHG) 2006.

Reading questions
Elizabeth Tseng [ppt]
13 5/10 Tag SNP selection in genotype data for maximizing SNP prediction accuracy, alperin E, Kimmel G, Shamir R. Bioinformatics 2005.

BNTagger: improved tagging SNP selection using Bayesian networks, ee PH, Shatkay H. Bioinformatics 2006.

Reading questions
Will Mortensen [ppt]
14 5/12 Causal protein-signaling networks derived from multiparameter single-cell data, Sachs K, Perez O, Pe'er D, Lauffenburger DA, Nolan GP. Science 2005.

Reading questions
Kristi Tsukida [ppt]
15 5/17 CONTRAfold: RNA secondary structure prediction without physics-based models, Do CB, Woods DA, Batzoglou S. Bioinformatics 2006.

A max-margin model for efficient simultaneous alignment and folding of RNA sequences, Do CB, Foo CS, Batzoglou S. Bioinformatics 2008.

Reading questions
Daniel Jones [ppt]
16 5/19 Automatic parameter learning for multiple network alignment, Flannick J, Novak A, Do CB, Srinizasan, Batzoglou S. J Comput Biol. 2009.

Optional readings:
Modeling cellular machinery through biological network comparison, Sharan R, Ideker T, Nat Biotech. 2006.

Gramlin: General and robust alignment of multiple large interaction networks, Jason Flannick, Antal Novak, Balaji S. Srinivasan, Harley H. McAdams and Serafim Batzoglou. Genom Res. 2006.

Reading questions
Adrienne Wang [ppt]
17 5/24 Cell type-specific gene expression differences in complex tissues, Shen-Orr SS, Tibshirani R, Khatri P, Bodian DL, Staedtler F, Perry NM, Hastie T, Sarwal MM, Davis MM, Butte AJ. Nature Methods 2010.

An integrative modular approach to systematically predict gene-phenotype associations, Mehan MR, Nunez-Iglesias J, Dai C, Waterman MS, Zhou XJ. BMC Bioinformatics 2010.

Reading questions
Nathan Parrish [ppt] [ppt]
18 5/26 Aging Mice Show a Decreasing Correlation of Gene Expression within Genetic Modules, Southworth LK, Owen AB, Kim SK. PLoS Genetics 2009.

Reading questions
Austin Webb
19 6/2 Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification, Klammer AA, Reynolds SM, Bilmes JA, MacCoss MJ, Noble WS. Bioinformatics 2008.

Background paper:
An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database, Jimmy K. Enga, Ashley L. McCormacka and John R. Yates III. Journal of the American Society for Mass Spectrometry 1994.

Assigning Significance to Peptides Identified by Tandem Mass Spectrometry Using Decoy Databases, Lukas Kall, John D. Storey, Michael J. MacCoss and William Stafford Noble. Journal of Proteome Research 2008.

Reading questions
Adam Gustafson [ppt]