|
CSE Home |
About Us |
Search |
Contact Info |
Notes:
(1) The assignment is deliberately a bit open ended. Feel free to explore and tell me what you did in addition to/instead of what I've outlined below.
(2) The Infernal package is now at version 1.1. CMfinder (and parts of the following assignment) is/are based on the earlier 0.81 Infernal release, and, unfortunately, the covariance model files built by Infernal release 0.81 (and, hence, by CMfinder) are incompatible with those expected by version 1.1. The fix is to discard CMfinder's CMs (the "seq.fasta.cm..." files) and use the new Infernal's cmbuild to rebuild them from CMfinder's stockholm alignments ("seq.fasta.motif..." files).
Steps:
Download the Infernal software package (infernal.tar.gz, version 1.1).
Read 00README and sections 1-3 of Userguide.pdf.
Build and install it. For installation, the "Quick installation instructions" in section 2 may be all you need. Even better, note that the Infernal web page includes downloads with pre-build binaries for major platforms. Otherwise, following the instructions in section 2 of the manual.
Follow the tutorial steps outlined in section 3 "Getting Started", roughly to the bottom of page 23.
The cmbuild example builds a model ("my.cm") for tRNA based on 5 yeast tRNAs. Given so few sequences and such closely related ones, it's a surprisingly good model. I've extracted a handful of tRNA sequences from the Genbank records for Pyrococcus furiosus (an anaerobic archaeon found in 100°C sediments near sea floor vents, presumably not a close relative of S. cerevisiae). Here are 3 versions of the sequences:
Run cmsearch on pfur.fa using your "my.cm" model.
Deliverable #1: send me the output of cmsearch above, together with the bit scores and E-values of the best scoring true tRNA and worst scoring false tRNA (true/false according to the Genbank annotation). How do the bit scores compare to the "rule of thumb" for significance given in the "executive summary" at the start of Section 5 of the user guide? Some of the weak matches have have rather good E-values; do you trust them? Why do you think this is happening?
Note: cmsearch searches both strands, and coordinates of plus strand hits are with respect to the input sequence, but in earlier releases the coordinates it reported for minus strand hits counted positions from the front of the reversed sequence. I think this has been corrected, but be careful in case not.
This model did pretty well, but maybe that's all due to Eddy having very carefully selected his example tRNA sequences and very carefully aligning them manually.
Use Zizhen Yao's CMfinder to automatically discover a tRNA motif in the P. furiosus sequences. The webserver version is slow and can be finicky. If so, you can try to download and install the software via the above link.
Since 8 of the 10 tRNAs in this data happen to be on the reverse strand, I suggest you use pfurrc.fa rather than pfur.fa for this step; CMfinder only looks at one strand. I'd suggest you set CMfinder's parameter for expected number of stemloops to 3 (rather than the pair of default parameter sets offered by the web server.) Note that CMfinder usually returns several candidate motifs. You may try more than one of them, or try to select the most promising based on structure (recall the canonical tRNA cloverleaf) and/or score (visible at the ends of the second group of comment lines in the Stockholm files.)
Use this model to cmsearch pfur.fa. It will do pretty well -- no surprise, it can find the sequences it was build from. Also use it to search tutorial.fa, a FASTA file that contains the 5 yeast tRNAs from which you built my.cm. You should find that the CMfinder model built from the P. furiosus data doesn't do as well at recognizing yeast tRNAs as the hand-build yeast model did at finding P. furiosus tRNAs.
Deliverable #2: Send me the results of scanning tutorial.fa, together with the lowest true positive and highest false positive scores. (All plus strand hits are "True" in this case.)
Improve the CMfinder model, so that does a better job of finding yeast tRNAs, without significantly reducing its success on P. furiosus tRNAs. Try to think about doing this is in a situation where you have a few "trusted" examples, e.g. the ones in P. furiosus, but none in yeast. There are several ways I can think of that might accomplish this. E.g.:
You can probably think of other strategies.
Deliverable #3: Try one or more of the above strategies, and/or one or more of your own, and tell me in a couple paragraphs what you did and how well it worked (e.g., send me scan results and true/false score thresholds as above). Also send me the refined .sto file you created. If you have the time and patience, scan more of the yeast genome to see how it does, in terms of false positives, for example. Definitely use Infernal's cmcalibrate on your model before doing the scan, so that you get the benefit of the HMM filtering and E-value calculations.
"Alternate/Extra Credit" try the following:
cmsearch to scan the entire genome
of one of these species to find more instances. (I'd guess this step will take an hour or more. Searching
just intergenic regions would be faster, if you want to write a small script to pick them out.) The
amorphous deliverable here is to tell me what you did/what you learned. E.g., how do these examples compare
to the Rfam families? Can you refine the model manually and/or based on more examples to improve it? Do
the MicroFootPrinter motifs relate to the riboswitch (and how)? Can you identify the other genes presumably
being controlled? Do their annotations suggest a common denominator for what the riboswitch might be
sensing? Etc.!
As I said in class, I think you will learn something whether you spend 5 hours on this, or 50. You get to decide where your personal cost/benefit tradeoff is optimized. Email the "deliverables" you complete to me (ruzzo at uw.edu), say, by the end of classes this quarter. Bundling them all into one zip or tar archive would probably be most convenient.
Please don't hesitate to contact me if you have questions, problems installing the software, etc.
|
Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX | |