Modeling and Searching for Non-Coding RNA: Homework

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Homework

Notes: (1) The assignment is deliberately a bit open ended. Feel free to explore and tell me what you did in addition to/instead of what I've outlined below.

(2) The Infernal package is now at version 1.0.2. CMfinder (and and parts of the following assignment) is/are based on the earlier 0.81 Infernal release, and, unfortunately, the covariance model files built by Infernal release 0.81 (and, hence, by CMfinder) appear to be incompatible with those expected by version 1.0. I think the fix is to use the new cmbuild to rebuild the CM that CMfinder generates based on the stockholm alignment, but I need to verify this. You can at least start on the rest of the assignment in the meantime.

Download the Infernal software package (infernal.tar.gz, version 1.0.2).
Read 00README and sections 1-3 of Userguide.pdf (skip or skim some of the installation details like large file system, rigfilters, MPI, and probably some of the useage options including local alignments, accelerating alignments and optional annotation).
Build and install it following the instructions in section 2 of the manual. (If you do the make install step, it copies 7 executable files including cmalign, cmbuild, cmscore, cmsearch into /usr/local/bin; you can easily delete them afterwards, if you don't want to keep them. Alternatively, add an appropriate --bindir option to ./configure, or add .../infernal-1.0.2/src to your path, so these programs can be found.)
Documentation says it is most extensively tested on Linux, but should work on "any system with an ANSI C compiler", including Mac & Windows. It installed smoothly on my Mac. I haven't tried it elsewhere; you may need to fiddle with the config files to install.
Follow the tutorial steps outlined in section 3 "Getting Started", at least to the bottom of page 13.
The cmbuild example builds a model ("my.cm") for tRNA based on 5 yeast tRNAs. Given so few sequences and such closely related ones, it's a surprisingly good model. I've extracted a handful of tRNA sequences from the Genbank records for Pyrococcus furiosus (an anaerobic archaeon found in 100°C sediments near sea floor vents, presumably not a close relative of S. cerevisiae). Here are 3 versions of the sequences:
- pfur.gb Genbank records, including annotations showing the tRNAs exact locations.
- pfur.fa FASTA format; same as above, but without the annotations.
- pfurrc.fa FASTA formated reverse-complement of the sequences (courtesy of http://searchlauncher.bcm.tmc.edu/seq-util/seq-util.html.)
Run cmsearch on pfur.fa using your "my.cm" model.
Deliverable #1: send me the output of cmsearch above, together with the bit scores and E-values of the best scoring true tRNA and worst scoring false tRNA (true/false according to the Genbank annotation). How do the bit scores compare to the "rule of thumb" for significance given in the "executive summary" at the start of Section 5 of the user guide? Some of the weak matches have have rather good E-values; do you trust them? Why do you think this is happening?
Note: cmsearch searches both strands, and coordinates of plus strand hits are with respect to the input sequence, but in earlier releases the coordinates it reported for minus strand hits counted positions from the front of the reversed sequence. I think this has been corrected, but be careful in case not.
This model did pretty well, but maybe that's all due to Eddy having very carefully selected his example tRNA sequences and very carefully aligning them manually.
Use Zizhen Yao's CMfinder to automatically discover a tRNA motif in the P. furiosus sequences. The webserver version is being finicky, so I suggest you download and install the software via the above link; I'll update this if I get it fixed soon... Since 8 of the 10 tRNAs in this data happen to be on the reverse strand, I suggest you use pfurrc.fa rather than pfur.fa for this step; CMfinder only looks at one strand. I'd suggest you set CMfinder's parameter for expected number of stemloops to 3.
Use this model to cmsearch pfur.fa. It will do pretty well -- no surprise, it can find the sequences it was build from. Also use it to search infernal-1.0.2/intro/tutorial.fa, a FASTA file that contains the 5 yeast tRNAs from which you built my.cm. You should find that the CMfinder model built from the P. furiosus data doesn't do as well at recognizing yeast tRNAs as the hand-build yeast model did at finding P. furiosus tRNAs.
Deliverable #2: Send me the results of scanning tutorial.fa, together with the lowest true positive and highest false positive scores. (All plus strand hits are "True" in this case.)
Improve the CMfinder model, so that does a better job of finding yeast tRNAs, without significantly reducing its success on P. furiosus tRNAs. Try to think about doing this is in a situation where you have a few "trusted" examples, e.g. the ones in P. furiosus, but none in yeast. There are several ways I can think of that might accomplish this. E.g.:
- Based on the cmsearch results of the P. furiosus data using the P. furiosus model, are there sequences that should be deleted from the model or additional sequences that could be added? Cmalign/cmbuild can be used to rebuild the model.
- Would adding a sequence or two from some third species help? E.g., maybe from another Pyrococcus species, wherein you could plausibly find additional examples based on the initial CMfinder model? Cmalign/cmbuild can be used rebuild the model. Even using one of the yeast tRNAs to help find the rest would be interesting.
- Take the Stockholm alignment produced by CMfinder and tune it up by hand. E.g., it's likely that some residues are misaligned; some columns are incorrectly annotated as paired/unpaired, etc. Use cmbuild to build a new model from your improved alignment. This step could be done on its own or in conjunction with the previous steps. [For a small example like this, you can just edit the Stockholm alignment with your favorite text editor. The Rfam curators use the Ralee emacs mode for this, which you could also download.]
You can probably think of other strategies.

Deliverable #3: Try one or more of the above strategies, and/or one or more of your own, and tell me in a couple paragraphs what you did and how well it worked (e.g., send me scan results and true/false score thresholds as above). Also send me the refined .sto file you created. If you have the time and patience, scan more of the yeast genome to see how it does, in terms of false positives, for example. Definitely use Infernal's cmcalibrate on your model before doing the scan, so that you get the benefot of the HMM filtering and E-value calculations.

"Alternate/Extra Credit" try the following:
- If you'd like a more challenging example, instead of, or in addition to the tRNA example, go find a riboswitch: Using Neph & Tompa's MicroFootPrinter, select your favorite gene in your favorite bacterium (e.g. select Bacillus subtilis, any of cbiB, emrE, gcvT, glmS, guaA, lysA, metK (the MFP sample output), mgtE, or terC, restrict to Firmicutes clade). It will show you conserved motifs, which may be TFBS motifs, or may perhaps be conserved patches in 5' UTRs of the orthologous genes. It also automates the tedious steps of finding orthologs and collecting their upstream sequences, which you can then copy/paste to CMfinder. CMfinder is fairly slow on these examples (bigger, more complex structures, more sequence data, slow server, etc.), so be patient (maybe 20-40 minutes on the webserver, or maybe more), but you should get back several motifs. Select a promising one and try using cmsearch --hmmfilter to scan the entire genome of one of these species to find more instances. (I'd guess this step will take an hour or more. Searching just intergenic regions would be faster, if you want to write a small script to pick them out.) The amorphous deliverable here is to tell me what you did/what you learned. E.g., how do these examples compare to the Rfam families? Can you refine the model manually and/or based on more examples to improve it? Do the MicroFootPrinter motifs relate to the riboswitch (and how)? Can you identify the other genes presumably being controlled? Do their annotations suggest a common denominator for what the riboswitch might be sensing? Etc.!

Email the "deliverables" to me (ruzzo at u.washington.edu), preferably by Monday, May 24. Bundling them all into one zip or tar archive would probably be most convenient.

Please don't hesitate to contact me if you have questions, problems installing the software, etc.

Larry Ruzzo


	Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX