Abstract: Non-coding RNAs (ncRNAs) are functional RNA molecules that do not code for proteins. Covariance Models (CMs) are a useful statistical tool to find new members of an ncRNA gene family in a large genome database, using both sequence and, importantly, RNA secondary structure information. Unfortunately, CM searches are slow. This paper shows how to make CMs faster while provably sacrificing none of their accuracy. Specifically, based on the CM, our software builds a profile hidden Markov model (HMM), which filters the genome database. This HMM is a rigorous filter, i.e., its filtering eliminates only sequences that provably could not be annotated as homologs. The CM is run only on what remains. Optimizing the HMM for filtering involves minimizing an exponential objective function with linear inequality constraints. For most known ncRNA families, this allows an 8-gigabase database to be scanned in 2-20 days instead of years, and yields new family members missed by other techniques to improve CM speed.
Keywords: Non-coding RNA, gene families, covariance models, genome annotation, profile hidden Markov models, Iron Response Element, hyperthermophile archaea snoRNA, Histone Downstream Element, rigorous filter
Download Preprint: PDF
E-mail: ruzzo /at/ cs /dot/ washington /dot/ edu