Exercise 3 - Key

Section A            45 pts

Section B            55 pts

Section C            50 pts

Questions           144 pts

TOTAL     294 pts


{A. Dot Matrix Comparisons.}

{1. Find three protein and three DNA sequences as described above.Convert the sequences to GCG format as you did in exercise 2.}

5 pts for sequences

{2. Read the documentation on the COMPARE and DOTPLOT programs. Run COMPARE several times with a variety of "stringency" or threshold values. Make a small table showing the number of points for the related and unrelated sequences at different threshold levels.}

10 pts for making the table

{3. Compare the dotplots for the related and unrelated sequences. Can you clearly distinguish the similar regions in your plot?}

5 pts for production of dotplots

5 pts for answering the question

{4. Repeat 2 and 3 with your DNA sequences.}

10 pts for making the table

5 pts for production of dotplots

5 pts for answering the question

{B. Sequence Alignments.}

{1. Global alignment using the GCG GAP program. As usual step 1 is to read the documentation. We will also use the related program BESTFIT; read this document as well.}

{2. Run the GAP program and make an alignment.}

10 pts for making an alignment


{3. Use the "FETCH" command to get a copy of the default scoring table and look at the values in it.}

5 pts for fetching and commenting about the scoring table


{4. Repeat the alignment with different values for the "gap creation" and "gap extension" penalties.}

10 pts for doing alignment with different penalties

5 pts for answering the question


{5. Repeat B2 and B4, above, using the program BESTFIT.}

10 pts for doing alignment with different penalties

5 pts for answering the question


{6. Run GAP with Monte Carlo statistics turned on by using the -random switch. }

10 pts for doing monte carlo and Z-score


{C. Searching with FASTA}

{1. Obtain sequences to use.}

5 pts for the sequences


{2. Run FASTA on your protein sequence using one of the recommended  FASTA web sites and a ktup of 2.}

5 pts for running FASTA

10 pts for summary and answering the questions


{3. Repeat this search with a ktup of 1.}

15 pts forRunning FASTA, summary and answering the questions


{4. Perform a search using the cognate DNA sequence.}

15 pts forRunning FASTA, summary and answering the questions


 

{D. Questions.}    144 pts
1. What is the lowest score possible for comparing two amino acid residues using the default scoring table used with the GCG GAP program? [3 pts]
Many scores of -4: B:L, BW, C:E, C:Z, D:L, D:W, E:C, F:P, G:I, G:L, N:W, P:W.
2. What is the highest? [3 pts]
W:W, score = 11.
3. Is a 95% threshold (i.e., 5% of windows expected to score higher than the threshold) based on unrelated sequences a good threshold for a dotplot? Why or why not? Hint: Consider how many "matching" windows and "non-matching" windows you expect in related sequences. [10 pts]
It depends on the length of the sequence. A 95% threshold means that 5% of the windows match in unrelated sequences (or in the unrelated Diagonals of truly related sequences). For two 100 long sequences there are about 10,000 windows so you expect about 500 random matching windows. If the sequences are very similar, you would expect to see 100 windows that represent the true match (assuming they all score above the threshold). This gives about 600 total points, a little high by my "3-5 time the sequence length" rule of thumb. For 1000 long sequences, you would get about 1,000,000 randomly matching windows and only 1000 true matches. This would be an even noisier plot. The problem is that a 5% probability threshold scales with the square of the sequence length, but the number of true matches scales as a linear function of the sequence length.
4. For two distantly related protein sequences, would a longer window in COMPARE/DOTPLOT make it easier to detect the similarity? Explain? [5 pts]
Up to a point, yes. When using a longer window, you are effectively averaging over more observations (each observation is a pair of amino acid residues, one from each sequences). Averaging over more observations should allow you to distinguish between distributions whose means are closer together, e.g. the residue-pairs in distantly homologous sequences and unrelated sequences. However, as sequences become more distant, insertions and deletions accumulate. Insertions and deletions disrupt the one-to-one matching across the window making a longer window less effective than a short one. The tradeoff is between increased discrimination and sensitivity with a long window, and the possibility of missing related regions that contain insertions and deletions.
5. Are "randomized" sequences the same as unrelated sequences? Why or why not? [3 pts]
"Randomized" sequences, as used by GAP and BESTFIT, take a given sequence and shuffle the residues around in order to eliminate diagonals of sequence similarity, while maintaining the residue composition. Randomized sequences are therefore good models of unrelated sequences in which all information on residue order has been removed. However, true unrelated sequences often have weak similarities that are due to the fact that they are biological molecules. In proteins, features such as the clustering of hydrophobic and hydrophilic amino acid residues in the interior and exterior of the protein respectively, transmembrane spans, patterns characteristic of alpha helices and beta-sheets all give a vague similarity to any protein sequences. In DNA, features such as overabundance of A/T in promoter regions, or avoidance of CpG sequences are similar effects
6. Describe what you would see when you make a dotplot of two sequences, one of which has two tandem repeats of a 20 residue long sequence? [5 pts]
One would see two parallel diagonals. Depending the size of the window, the length of the diagonals along each sequence may vary slightly. The diagonals would have a space of 20 residues between them.
7. Which is better able to detect similarity between two sequences, a protein dotplot or a nucleotide dotplot? Why? [5 pts]
Protein dotplots and alignments are generally more sensitive. The 20 letter protein alphabet makes it easier to distinguish related sequences from noise, and protein sequences are not as prone to compositional effects (i.e. different GC composition) as nucleic acid sequence.
8. What does it mean to weight end gaps? [5 pts]
To weight end-gaps means to charge a penalty for gaps at the ends of the sequences, just as for internal gaps.
9. Does the gap program weight end gaps? [5 pts]
No, the default setting is to not weight end-gaps.
10. Does the bestfit program weight end gaps? [5 pts]
No, local alignment approaches never charge penalties for the gaps at the ends of the sequences.
11. Assume that you have made an alignment using GAP with the gap penalties set so high that no gaps can be inserted into the alignment. Will this alignment include the best diagonal segment seen in a dotplot? [10 pts]
Probably, assuming that we are talking about a long region of similarity. When using GAP with very high gap penalties, no insertions or deletions are allowed. The best diagonal region may not be included in the alignment if there are sufficient "random" in other regions to achieve a higher score. This is because GAP is a global alignment program and will try to fit all of both sequences.
12. Do you expect a different result for BESTFIT? Explain. [5 pts]
Yes, the result should be different than with GAP. The BESTFIT alignment should be trimmed at the ends to remove less similar regions.
13. If you align two sequences that have only a 20 residue segment in common, do you expect the same or different results as in the last question? Explain. [5 pts]
The results may be strikingly different. GAP is highly likely to miss the short similarity as explained in 11, above.. BESTFIT looks only for local alignments and should find the parts of each sequence align to each other. BESTFIT is much more likely to find the short region of strong similarity regardless of its location
14. Assume that protein B is derived from protein A by a duplication that happened many millions of years ago. That is, protein B was originally a tandem repeat of protein A, and the sequences have now extensively diverged. Protein B remains approximately twice as long as protein A. What program would you expect to give a better alignment, GAP or BESTFIT. Why? [10 pts]
BESTFIT would be more likely to find the alignment by aligning A alternately with the first half of B then the second half of B. GAP at relatively GAP penalty MIGHT align A with either the first or second half of B but it is unlikely (Note that Gap uses unweighted end-gaps). It is more likely that GAP would distribute sequence A over the whole length of B, because in some areas A is more like the first repeat in B, and others more like the second.
15. For the duplication described above, describe what the dotplot would look like? [5 pts]
The dotplot would look similar to the case described in 5, above, except that the diagonals would have many spaces indicating regions of divergence.
16. Suggest an appropriate gap opening and gap extension penalty setting to use when comparing a eukaryotic cDNA and genomic DNA. [5 pts]
Since introns may be very long, you would want to use a low gap extension penalty, e.g., zero. The gap opening penalty should be relatively large so that gaps are only inserted when introns are encountered, e.g., ten (assuming matches score one).
17. In FASTA, what is the meaning of the init1 score? of the initn score? [10 pts]
In the first step of FASTA, the algorithm identifies matching ktup long words in the two sequences being compared. Words on the same diagonal are assembled into initial matching regions and rescored using a 20x20 scoring table. The init1 score is the highest scoring initial diagonal.
18. What is the meaning of the opt score? [5 pts]
In FASTA, the opt score is an optimized Smith-Waterman dynamic programming alignment score performed in the region of the init1 diagonal.
19. Is the initn score ever likely to be less than the init1 score? Why? [5 pts]
The initn score is the score achieved by combining initial diagonals into longer assemblages including some gaps. This score cannot be lower than the init1 score and should almost always be higher.
20. Why is the opt score sometimes less than the initn score? [10 pts]
The dynamic programming alignment uses different penalties and may include regions not on the initial diagonals. It can thus be lower than initn.
21. What is a lookup table? [5 pts]
In FASTA, a lookup table is an index that tells where each ktup long word is ound in each sequence.
22. How does FASTA use a lookup or "Hash" Table? [5 pts]
The lookup or hash table is used to identify initial diagonals with high frequencies of identical matches. The lookup tables for pairs of sequences are compared and local matching regions built up suing the difference between the ktup positions in the two tables to identify which ktups lie on the same diagonals.
23. Why is a lookup table useful in FASTA? [5 pts]
The lookup table is much smaller than the sequence. Two lookup tables can therefore be compared much more quickly.
24. What is a ktup? [5 pts]
In FASTA, the ktup is the length of the substring or word used to construct the lookup table.
25. How are e-values calculated in FASTA? [5 pts]
In FASTA, e-values are calculated by an iterative fitting process in which the results of the database search are fit to an extreme-value distribution. Values very far from the expected value, i.e. those likely to be homologs, are thrown out in several cycles of iterative fitting.