GENOME 559, Winter 2009, Homework 6

Introduction to Statistical and Computational Genomics
GENOME 559
Department of Genome Sciences
University of Washington School of Medicine

Homework 6

Due: 3/5
Notes: Again, please track hours you spend on this, and report them here. (As before, it's not quite anonymous, but close.) There are extra credit options; I again suggest that you complete the basic version first, and turn in both.
Electronic Turnin: Here. Please turn in hw6.py. If you also did extra credit, include additional file(s) with "-ec" added to the file name, e.g., hw6-ec.py. Also include any necessary Python module files you might have chosen to create. The short "written" portion of the assignment may either be turned in on paper in class, or electronically.

Problems:
In this homework, you'll calculate the "Sum of Pairs" score for a multiple alignment. The SP score for an alignment is the sum of the SP scores for its columns, and the SP score for a column is the sum of the scores for all pairs of rows in the column. E.g., the SP score, based on BLOSUM62 scores for pairs, for a column containing (A,A,-,R) is +2: +4 for the single A-A pair, -1 for each of the two A-R pairs (and nothing for the 3 pairs involving '-'; yes, I'm ignoring gaps).
As described in lecture, an amino acid substitution matrix is a 20 x 20 symmetric matrix of integers whose entries give "scores" for substituting one amino acid for another. The file b62.txt is the BLOSUM62 substitution matrix; the 1st row and column label the amino acids.
Write a python function readSubMatrix(filename) to read such a file, returning a tuple (mat, map) where mat is the score matrix (20 x 20, or 21 x 21, as you prefer, but with the scores as ints, not character strings) and map is a dictionary mapping amino acids (single letter codes, as in b62.txt) to row/column numbers in mat. E.g., mat[map['A']][map['G']] gives the score for an alanine/glycine substitution. You may assume the number of rows and columns are equal, and that the aa labels are in the same order for both. (Your solution or my solution to hw5 may be useful.)

Read in a group of aligned sequences in Phylip format, as in hw4. (Your solution or Mary's solution to hw4 may be useful.) Use hem6.txt as your test input.
Calculate and print the SP score for each column of the alignment, and the total score. E.g., a portion of my answer is:
                 1   2   3   4   5   6   7   8   9  10  ...  35  36  37  38  39  40  
    P68871homo   -   V   H   L   T   P   E   E   K   S  ...   L   -   V   Y   P   W 
    P68873chmp   -   V   H   L   T   P   E   E   K   S  ...   L   -   I   Y   P   W 
    P02088mous   -   V   H   L   T   D   A   E   K   A  ...   L   I   V   Y   P   W 
    P02112chic   -   V   H   W   T   A   E   E   K   Q  ...   L   I   V   Y   P   W 
    Q802A3fugu   M   V   E   W   T   D   Q   E   R   T  ...   L   I   V   Y   P   W 
    Q540F0cowp   M   V   A   F   S   D   K   Q   E   G  ...   L   I   V   A   P   A 
    Score:       5  60  39  13  55  11  20  60  42   2  ...  60  24  55  60 105  95 
    Total: 1340
Tip: A nested loop like
    for i in range(n):
        for j in range(n):
            do something with i & j
will, for example, visit the pair {1,2} twice; once with i=1, j=2 and again with i=2, j=1. It also includes the "pair" i=1, j=1. Sometimes that's just what you want, but not in this case. To get each pair only once, use for j in range(i+1,n): for the inner loop.
Extra Credit: Create a Python class to hold a substitution matrix. Its constructor (__init__ method) should build the object by reading a file, as above. Your class definition should include two other methods, one returning the score for a single pair of amino acids, the other returning the SP score for a column of residues in a multiple alignment. Use your class to score the given alignment, as in a-c. For extra extra credit, make a class for alignments, perhaps with index/slice-like operations to enable you to extract columns or subalignments.
Written Part: The alignment in hem6.txt is from a ClustalW alignment of 5 vertebrate hemoglobins and a plant leghaemoglobin, except that I mucked around with the sequences and alignment in the last 5 columns. (SwissProt accession numbers are part of the "species" identifiers in the file, if you're curious about the real deal.) The alignment would get a better SP score if the "-V/-I" in columns 36-37 of the human/chimp alignments were changed to "-V/I-", hence aligning the chimp I to I's in the other 4 species. Explain why a progressive alignment method is very unlikely to construct the later alignment.