|
CSE Home |
About Us |
Search |
Contact Info |
Note: The Infernal package has recently been updated to version 1.0. CMfinder (and the following assignment) is/are based on the earlier 0.81 Infernal release, and, unfortunately, the covariance model files built by Infernal release 0.81 (and, hence, by CMfinder) appear to be incompatible with those expected by version 1.0. So, you should download and use the 0.81 version of Infernal, rather than the latest version.
Download the Infernal software package (infernal.tar.gz, version 0.81).
Read 00README and sections 1-3 of Userguide.pdf (skip or skim some of the installation details like large file system, rigfilters, MPI, and probably some of the useage options including local alignments, accelerating alignments and optional annotation).
Build and install it following the instructions in section 2 of the manual. (If you do the make install step, it copies the 4-6 executable files including cmalign, cmbuild, cmscore, cmsearch into /usr/local/bin; you can easily delete them afterwards, if you don't want to keep them. Alternatively, add an appropriate --bindir option to ./configure, or add .../infernal-0.81/src to your path, so these programs can be found.)
I had no success installing this version on my Mac (even after applying the patch given on the infernal page; probably I did something stupid), but installation on Linux was smooth.
Follow the tutorial steps outlined in section 3 "Getting Started".
The cmbuild example builds a model ("my.cm") for tRNA based on 5 yeast tRNAs. Given so few sequences and such closely related ones, it's a surprisingly good model. I've extracted a handful of tRNA sequences from the Genbank records for Pyrococcus furiosus (an anaerobic archaeon found in 100°C sediments near sea floor vents, presumably not a close relative of S. cerevisiae). Here are 3 versions of the sequences:
Run cmsearch on pfur.fa using your "my.cm" model. [Note: cmsearch will also work on pfur.gb, but it seems to silently replace letters other than ACGT/U by random nucleotides, so the genbank comments (lines begining with semicolons) become "junk DNA". This is unlikely to match the CM, but does disrupt the coordinate system.]
Deliverable #1: send me the output of cmsearch above, together with the scores of the lowest scoring true tRNA and highest scoring false tRNA (true/false according to the Genbank annotation). How do these compare to the "rough guide" for score significance given near the bottom of page 10 of the user guide?
Note: cmsearch searches both strands, and coordinates on the "hit n:..." lines are always with respect to the input sequence, but the coordinates it reports in its alignments for hits on the reverse strand count positions from the front of the reversed sequence.
This model did pretty well, but maybe that's all due to Eddy having very carefully selected his example tRNA sequences and very carefully aligning them manually.
Use Zizhen Yao's CMfinder to automatically discover a tRNA motif in the P. furiosus sequences. The webserver is old and slow (5-15 minutes for this example), so you may prefer to download and install the software via the above link. Since 8 of the 10 tRNAs in this data happen to be on the reverse strand, I suggest you use pfurrc.fa rather than pfur.fa for this step; CMfinder only looks at one strand. I'd suggest you set CMfinder's parameter for expected number of stemloops to 3.
Use this model to cmsearch pfur.fa. It will do pretty well -- no surprise, it can find the sequences it was build from. Also use it to search the tutorial.fa file from the Infernal distribution, which just contains the 5 yeast tRNAs from which you built my.cm. You should find that the CMfinder model built from the P. furiosus data doesn't do as well at recognizing yeast tRNAs as the hand-build yeast model did at finding P. furiosus tRNAs.
Deliverable #2: Send me the results of scanning tutorial.fa, together with the lowest true positive and highest false positive scores.
Improve the CMfinder model, so that does a better job of finding yeast tRNAs, without significantly reducing its success on P. furiosus tRNAs. Try to think about doing this is in a situation where you have a few "trusted" examples, e.g. the ones in P. furiosus, but none in yeast. There are several ways I can think of that might accomplish this. E.g.:
You can probably think of other strategies.
Deliverable #3: Try one or more of the above strategies, and/or one or more of your own, and tell me in a couple paragraphs what you did and how well it worked (e.g., send me scan results and true/false score thresholds as above). Also send me the refined .sto file you created. If you have the time and patience, scan more of the yeast genome to see how it does, in terms of false positives, for example. Infernal includes an implementation of some of the HMM filtering ideas I talked about; read about --hmmfilter in Userguide.pdf. Alternatively, you can get the latest version of Zasha Weinberg's RaveNnA filtering software here. (An early version is included with Infernal but requires some non-default options during installation; look under "rigfilters" in Userguide.)
"Alternate/Extra Credit" do either or both pf the following:
cmsearch --hmmfilter to scan
the entire genome of one of these species to find more
instances. (I'd guess this step will take an hour or
more. Searching just intergenic regions would be faster,
if you want to write a small script to pick them out.)
The amorphous deliverable here is to tell me what you
did/what you learned. E.g., how do these examples compare
to the Rfam families? Can you refine the model manually
and/or based on more examples to improve it? Do the
MicroFootPrinter motifs relate to the riboswitch (and
how)? Can you identify the other genes presumably being
controlled? Do their annotations suggest a common
denominator for what the riboswitch might be sensing?
Etc.!
Email the "deliverables" to me (ruzzo at u.washington.edu), preferably by Monday, May 25. Bundling them all into one zip or tar archive would probably be most convenient.
Please don't hesitate to contact me if you have questions, problems installing the software, etc.
|
Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX | |