Homework #1

Due Thursday, January 15, 2009, at the beginning of class. Assignments turned in more than 5 minutes after the beginning of class will be penalized 10 points, with an additional 10 points every 24 hours thereafter.

  1. (10 points) Here is a pair of aligned protein sequences:

    GDIFYPGYCPDVKPVNKQFDLSAFAGAWHEIAKLP
    GDNFHLGKCPSPLPVQENFDVKKYLGRWYEIEKIP
    

    If this alignment were to be included in the data set used to generate statistics for the BLOSUM matrices, which of the following matrices would it be used to help generate: BLOSUM90, BLOSUM80, BLOSUM62, BLOSUM52, BLOSUM45. Why?

  2. (5 points) You can find a copy of the BLOSUM45 matrix at ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM45. Which amino acid has the largest number of negative scores associated with it? Why?

  3. (10 points)

    RVVNLVP----WVLATDYKNY
    QFFPLMPPAPYWILATDYENY
    

    Score the above alignment using

    • BLOSUM45 and a linear gap penalty of -4
    • BLOSUM80 with affine gap penalties: gap open of -9 and gap extension of -1.

    You can find the BLOSUM80 matrix at ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM80. Be sure to show your work.

  4. (20 points) Draw and fill in the dynamic programming matrix to align these two sequences: CATTC and CGATC. Use this substitution matrix:

      A C G T
    A 2 -7 -3 -7
    C -7 2 -7 -3
    G -3 -7 2 -7
    T -7 -3 -7 2

    and use a fixed gap penalty of -5. What is the score of the optimal global alignment?

  5. (10 points) Write a program that takes as input the first three command line arguments (after the program name) and prints them in uppercase letters on a single line with spaces between.

    > python get-three-args.py con stan tinople
    CON STAN TINOPLE
    
  6. (15 points) Write a program similar to the previous one, but print the three arguments without spaces between.

    > python get-three-args.py con stan tinople
    CONSTANTINOPLE
    
  7. (15 points) Write a program that takes as input two command line arguments: the first argument is a DNA or protein sequence, and the second is an integer n. Print the nth character in the given sequence.

    > python get-nth-character.py curmudgeon 5
    u
    
  8. (15 points) Write a program that takes as input two command line arguments, counts how many time the second one appears inside the first one, and then tells the user how many there are, like this:

    > python count-substrings-in-string.py acgtacgtttgacgtacc acg
    The substring acg appears in the sequence acgtacgtttgacgtacc 3 times.