Introduction to Statistical and Computational Genomics

GENOME 559
Department of Genome Sciences
University of Washington School of Medicine


Homework 5

Due: 2/24/09, at start of class.

Notes: I will be adding extra credit options to some questions. These will have at most a small effect on your grade (and no effect on anyone else's grade, e.g. they won't "raise the curve"), so do them because they're fun, or interesting or to push yourself to do a little more practice. You're not limited to the extensions I suggest, either; if some other related investigation appeals to you, feel free to write it up and hand it in. Talk to me first if in doubt. To avoid losing points while trying to do extra credit, always get the basic assignment working first, and save it, before starting to modify it for extra credit. Turn in both versions.

This week, please try to keep rough track of how may hours you spend on the homework, and separate "basic" from "extra credit." Again, won't affect grades, but I want to track it. After you finish the assignment, please respond to this simple Poll.  (I always welcome constructive feedback. For technical reasons, your poll submissions are tied to unique IDs. In principle I could correlate them to individuals, but it's definitely extra work, and I don't intend to do that. So, it's not quite anonymous, but close. If you prefer to send me anonymous email, feel free.)

Electronic Turnin: I want you to turn in your programs electronically so I can run them if necessary. This is easy. Just visit this Catalyst Dropbox page, click "Homework 5", then click the "Choose File" button to select each file you want to submit. It's probably a good idea to try it once well in advance of the deadline in case there are issues, but I've generally found it works smoothly. You can also easily delete/resubmit any file if you find/fix a bug, etc. (and revisit the poll to update your hours, etc.) if desired.

Please turn in your two Python programs hw5wmm.py and hw5gbk.py. If you also did extra credit versions, include those additional files with "-ec" added to the file name, e.g., hw5wmm-ec.py. Don't forget to include any necessary Python module files you might have chosen to create.

PLEASE *DON'T* include the GenBank files; I'd really rather not get 20 more copies of them!

The "written" portion of the assignment may either be turned in on paper in class as usual, or you may include it electronically (as a .doc, .rtf , .pdf or . txt plain text file). I'd prefer it all in one file.

Problems:

  1. Lecture notes on 2/10 & 2/12 show how to build weight matrix models based on nucleotide counts (or freqencies), even when background nucleotide frequencies are nonuniform.

    1. Write a Python program hw5wmm.py to do this. Your input is a file name specified on the command line. For a motif of length n, this file should contain 4 lines, each with n+1 nonnegative integers. The 4 lines correspond to counts of A, C, G, T, resp., in the i-th position of observed motif instances (1 <= i <= n) for the first n columns, and in the background (the last column of numbers). For simplicity, you may assume that no counts are zero and that the total count in each column is exactly 100. Print the resulting weight matrix scores. (Use 10 x log2, rounded, as shown in the examples in the notes. ">>>help(math.log)" will tell you how to get base 2 logarithms; ">>>help(round)" for rounding.)

    2. Apply your code to the "TATA Box" frequencies given in the 2/10 notes (tweaked to avoid the zero in column 6), assuming uniform background frequencies; your scores should reproduce the weight matrix given in the notes. Use the file tata-unif.txt as your input. E.g., the output from my version is:

          Weight matrix scores:
            -36    19     1    12    10   -46 
            -15   -36    -8    -9    -3   -31 
            -13   -46    -6    -7    -9   -46 
             17   -31     8    -9    -6    19 
      
      (NB: these frequencies come from old, incomplete data; fine as an example, but don't use them for a real study.)

    3. The uniform background assumption is very inaccurate for some organisms. E.g., the macronuclear genome of Tetrahymena thermophila is about 74% AT. Repeat (b), assuming this background frequency. Use the file tata-Tt.txt as your input. I'm not claiming that the E.coli TATA box frequencies are correct for T. thermophila, but if they were, scoring should change accordingly. Based on your intuition and your two score tables, do you think the WMM would be more or less accurate at finding these motifs in an AT-rich genome? Why?

    4. Extra credit options: Generalize your code to drop the assumption that column sums are 100. Instead, divide counts by column totals to get frequencies. To avoid problems with zero counts, add a "pseudocount" of 1 to each entry before converting counts to frequencies. Compute and print total- and per-column relative entropies, as shown in the 2/12 notes. Implement the "scanning" algorithm, i.e., given a WMM, compute the score for each position of a nucleotide sequence (except the few at the end). Try it, somehow, on E. coli.

  2. "Scores" in many bioinformatic applications are quite arbitrary, and can be difficult to interpret, other than a general sense of "bigger is better." In contrast, the scores produced from a weight matrix model have a very precise probabilistic interpretation, under the assumption that "sites" versus "nonsites" arise by independent selection of nucleotides from successive columns of the foreground versus background models.

    1. Under these assumptions, what does a score of "+90" mean? (Again assume scores are 10 x log2, rounded. Hint: "Likelihood Ratio Test.")

    2. Usual rules for "preponderance of scientific evidence" suggest that you should have at least 95% confidence before rejecting the null hypothesis that "this is just random DNA, not a TATA box" (or similar motif). What score does that correspond to? (Again, assume 10 x log2.)

    3. Extra Credit: Many researchers would demand a higher score than that before being confident that a functional motif instance has been identified. Give one or two reasons why such a higher standard is prudent.

  3. Download the complete GenBank file for Methanocaldococcus jannaschii. You can find it by going to the NCBI Genome page, click "Genome Projects" under "Microbial" in the left column, then scroll through the list of available microbial genomes to find "Methanocaldococcus jannaschii DSM 2661". Click its name. Read the brief description of it. To get the genome sequence itself, follow the Refseq FTP link and download two files: NC_000909.gbk and NC_000909.rnt. The .rnt file is a succinct summary of noncoding RNAs in this genome (in this case, limited to tRNAs and rRNAs). We're going to reproduce part of that data by directly extracting it from the .gbk file. Specifically, you should write a Python program hw5gbk.py that reads the .gbk file and prints one line for each tRNA gene found therein, giving

    • the tRNA's genomic coordinates,
    • length (in nucleotides), and
    • strand.

    You must use the Python re module to identify the relevant lines of the .gbk file. I recommend using the re module's "parenthesized groups" facilities to extract coodinates, but this is not required; use regular Python string processing facilities if you prefer. Be sure to find tRNAs on both + and - strands. You can check your resuts by looking at the .rnt file.

    Planning and Incremental Development: Among the keys to any (successful) large software project are planning and incremental development. No one has the brainpower to write a big program from top to bottom and get it correct in one pass. Instead, outlining an incremental plan will pay big dividends. You should think about this for problem 1 as well, but here's a suggestion for this one.

    • Look at the .gbk and .rnt files to see what you're up against.
    • Write a function that takes a file name as its argument and prints the first (say) 10 or 20 lines of that file.
    • Modify that function to also take a string argument, which is to be used as a pattern, and prints the first 10 or 20 lines of the file that contain a match to the pattern. (This is similar to the Unix "grep" utility I described on 2/10.)
    • Now start playing with re patterns. Except in the very rare case where there's a detailed, lawyerly specification of your input, one of the main issues with "parsing" a text input file (using an re package or otherwise) is to define patterns that are loose enough to give you all matches of interest, including plausible variants of examples you've looked at closely, but tight enough to exclude extraneous junk. (Spaces or tabs? Always 42 of them? Does case matter? Are leading zeros allowed on numbers? Etc., etc.) For this problem, the pattern r'tRNA' is an obvious starting point, but I'm sure you'll find it is too loose. Tighten your pattern in various ways until you've homed in on the lines defining tRNA genes. Compare to the .rnt file to make sure.
    • Modify your pattern to capture the genomic coordinates of each hit as "parenthesized groups".
    • Change your function to print coordinates and lengths rather than the actual line you hit.
    • Declare Victory. Celebrate.
    • (As a general principle of software design, most people prefer that functions have a fairly succinct description and minimalist interface to their callers. E.g., "grep(pattern,file) prints lines in file matching pattern" is pretty simple. Appending "Oh, by the way, pattern should specify a tRNA line in a .gbk file and we don't print the hits, we print blah blah blah based on paren groups in pattern..." is a little baroque. For this assignment, that's OK, but I encourage you to think about ways to slightly restructure your final solution so it's less baroque.

    Extra Credit Options: Find rRNAs as well. For each tRNA, identify its associated amino acid. (Perhaps looking in Lutz or the Python docs for information about iterators and the "next" method would be useful.) Reproduce the .rnt file from information in the .gbk file (which is probably how NCBI generates .rnt files...). Extract the nucleotide sequence of each tRNA. What is the average G-C content of tRNAs versus the genome as a whole? Is this a surprise? Care to speculate about why?