image University of Washington Computer Science & Engineering
  Modeling and Searching for Non-Coding RNA: Homework
  CSE Home   About Us    Search    Contact Info 

Homework

Notes: (1) The assignment is deliberately a bit open ended. Feel free to explore and tell me what you did in addition to/instead of what I've outlined below.

(2) The Infernal package is now at version 1.0.2. CMfinder (and and parts of the following assignment) is/are based on the earlier 0.81 Infernal release, and, unfortunately, the covariance model files built by Infernal release 0.81 (and, hence, by CMfinder) appear to be incompatible with those expected by version 1.0. I think the fix is to use the new cmbuild to rebuild the CM that CMfinder generates based on the stockholm alignment, but I need to verify this. You can at least start on the rest of the assignment in the meantime.

  1. Download the Infernal software package (infernal.tar.gz, version 1.0.2).

  2. Read 00README and sections 1-3 of Userguide.pdf (skip or skim some of the installation details like large file system, rigfilters, MPI, and probably some of the useage options including local alignments, accelerating alignments and optional annotation).

  3. Build and install it following the instructions in section 2 of the manual. (If you do the make install step, it copies 7 executable files including cmalign, cmbuild, cmscore, cmsearch into /usr/local/bin; you can easily delete them afterwards, if you don't want to keep them. Alternatively, add an appropriate --bindir option to ./configure, or add .../infernal-1.0.2/src to your path, so these programs can be found.)

    Documentation says it is most extensively tested on Linux, but should work on "any system with an ANSI C compiler", including Mac & Windows. It installed smoothly on my Mac. I haven't tried it elsewhere; you may need to fiddle with the config files to install.

  4. Follow the tutorial steps outlined in section 3 "Getting Started", at least to the bottom of page 13.

  5. The cmbuild example builds a model ("my.cm") for tRNA based on 5 yeast tRNAs. Given so few sequences and such closely related ones, it's a surprisingly good model. I've extracted a handful of tRNA sequences from the Genbank records for Pyrococcus furiosus (an anaerobic archaeon found in 100°C sediments near sea floor vents, presumably not a close relative of S. cerevisiae). Here are 3 versions of the sequences:

    Run cmsearch on pfur.fa using your "my.cm" model.

  6. Deliverable #1: send me the output of cmsearch above, together with the bit scores and E-values of the best scoring true tRNA and worst scoring false tRNA (true/false according to the Genbank annotation). How do the bit scores compare to the "rule of thumb" for significance given in the "executive summary" at the start of Section 5 of the user guide? Some of the weak matches have have rather good E-values; do you trust them? Why do you think this is happening?

    Note: cmsearch searches both strands, and coordinates of plus strand hits are with respect to the input sequence, but in earlier releases the coordinates it reported for minus strand hits counted positions from the front of the reversed sequence. I think this has been corrected, but be careful in case not.

  7. This model did pretty well, but maybe that's all due to Eddy having very carefully selected his example tRNA sequences and very carefully aligning them manually.

    Use Zizhen Yao's CMfinder to automatically discover a tRNA motif in the P. furiosus sequences. The webserver version is being finicky, so I suggest you download and install the software via the above link; I'll update this if I get it fixed soon... Since 8 of the 10 tRNAs in this data happen to be on the reverse strand, I suggest you use pfurrc.fa rather than pfur.fa for this step; CMfinder only looks at one strand. I'd suggest you set CMfinder's parameter for expected number of stemloops to 3.

  8. Use this model to cmsearch pfur.fa. It will do pretty well -- no surprise, it can find the sequences it was build from. Also use it to search infernal-1.0.2/intro/tutorial.fa, a FASTA file that contains the 5 yeast tRNAs from which you built my.cm. You should find that the CMfinder model built from the P. furiosus data doesn't do as well at recognizing yeast tRNAs as the hand-build yeast model did at finding P. furiosus tRNAs.

  9. Deliverable #2: Send me the results of scanning tutorial.fa, together with the lowest true positive and highest false positive scores. (All plus strand hits are "True" in this case.)

  10. Improve the CMfinder model, so that does a better job of finding yeast tRNAs, without significantly reducing its success on P. furiosus tRNAs. Try to think about doing this is in a situation where you have a few "trusted" examples, e.g. the ones in P. furiosus, but none in yeast. There are several ways I can think of that might accomplish this. E.g.:

    You can probably think of other strategies.

  11. Deliverable #3: Try one or more of the above strategies, and/or one or more of your own, and tell me in a couple paragraphs what you did and how well it worked (e.g., send me scan results and true/false score thresholds as above). Also send me the refined .sto file you created. If you have the time and patience, scan more of the yeast genome to see how it does, in terms of false positives, for example. Definitely use Infernal's cmcalibrate on your model before doing the scan, so that you get the benefot of the HMM filtering and E-value calculations.

  12. "Alternate/Extra Credit" try the following:

Email the "deliverables" to me (ruzzo at u.washington.edu), preferably by Monday, May 24. Bundling them all into one zip or tar archive would probably be most convenient.

Please don't hesitate to contact me if you have questions, problems installing the software, etc.



Larry Ruzzo

CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX