{A. Dot Matrix Comparisons.}
{1. Find three protein and three DNA sequences as described above.Convert the sequences to GCG format as you did in exercise 2.}
{2. Read the documentation on the COMPARE and DOTPLOT programs. Run COMPARE several times with a variety of "stringency" or threshold values. Make a small table showing the number of points for the related and unrelated sequences at different threshold levels.}
{3. Compare the dotplots for the related and unrelated sequences. Can you clearly distinguish the similar regions in your plot?}
{4. Repeat 2 and 3 with your DNA sequences.}
{B. Sequence Alignments.}
{1. Global alignment using the GCG GAP program. As usual step 1 is to read the documentation. We will also use the related program BESTFIT; read this document as well.}
{2. Run the GAP program and make an alignment.}
{3. Use the "FETCH" command to get a copy of the
default scoring table and look at the values in it.}
{4. Repeat the alignment with different values for the "gap
creation" and "gap extension" penalties.}
{5. Repeat B2 and B4, above, using the program BESTFIT.}
{6. Run GAP with Monte Carlo statistics turned on by using
the -random switch. }
{C. Searching with FASTA}
{1. Obtain sequences to use.}
{2. Run FASTA on your protein sequence using one of the
recommended FASTA web sites and a ktup of 2.}
{3. Repeat this search with a ktup of 1.}
{4. Perform a search using the cognate DNA sequence.}
{D. Questions.}
144 pts
1. What is the lowest score possible for comparing two amino acid
residues using the default scoring table used with the GCG GAP
program? [3 pts]
Many scores of -4: B:L, BW, C:E,
C:Z, D:L, D:W, E:C, F:P, G:I, G:L, N:W, P:W.
2. What is the highest? [3 pts]
W:W, score = 11.
3. Is a 95% threshold (i.e., 5% of windows expected to score higher
than the threshold) based on unrelated sequences a good threshold
for a dotplot? Why or why not? Hint: Consider how many "matching"
windows and "non-matching" windows you expect in related
sequences. [10 pts]
It depends on the length of the
sequence. A 95% threshold means that 5% of the windows match in
unrelated sequences (or in the unrelated Diagonals of truly related
sequences). For two 100 long sequences there are about 10,000
windows so you expect about 500 random matching windows. If the
sequences are very similar, you would expect to see 100 windows
that represent the true match (assuming they all score above the
threshold). This gives about 600 total points, a little high by
my "3-5 time the sequence length" rule of thumb. For
1000 long sequences, you would get about 1,000,000 randomly matching
windows and only 1000 true matches. This would be an even noisier
plot. The problem is that a 5% probability threshold scales with
the square of the sequence length, but the number of true matches
scales as a linear function of the sequence length.
4. For two distantly related protein sequences, would a longer
window in COMPARE/DOTPLOT make it easier to detect the similarity?
Explain? [5 pts]
Up to a point, yes. When using
a longer window, you are effectively averaging over more observations
(each observation is a pair of amino acid residues, one from each
sequences). Averaging over more observations should allow you
to distinguish between distributions whose means are closer together,
e.g. the residue-pairs in distantly homologous sequences and unrelated
sequences. However, as sequences become more distant, insertions
and deletions accumulate. Insertions and deletions disrupt the
one-to-one matching across the window making a longer window less
effective than a short one. The tradeoff is between increased
discrimination and sensitivity with a long window, and the possibility
of missing related regions that contain insertions and deletions.
5. Are "randomized" sequences the same as unrelated
sequences? Why or why not? [3 pts]
"Randomized" sequences,
as used by GAP and BESTFIT, take a given sequence and shuffle
the residues around in order to eliminate diagonals of sequence
similarity, while maintaining the residue composition. Randomized
sequences are therefore good models of unrelated sequences in
which all information on residue order has been removed. However,
true unrelated sequences often have weak similarities that are
due to the fact that they are biological molecules. In proteins,
features such as the clustering of hydrophobic and hydrophilic
amino acid residues in the interior and exterior of the protein
respectively, transmembrane spans, patterns characteristic of
alpha helices and beta-sheets all give a vague similarity to any
protein sequences. In DNA, features such as overabundance of A/T
in promoter regions, or avoidance of CpG sequences are similar
effects
6. Describe what you would see when you make a dotplot of two
sequences, one of which has two tandem repeats of a 20 residue
long sequence? [5 pts]
One would see two parallel diagonals.
Depending the size of the window, the length of the diagonals
along each sequence may vary slightly. The diagonals would have
a space of 20 residues between them.
7. Which is better able to detect similarity between two
sequences, a protein dotplot or a nucleotide dotplot? Why? [5 pts]
Protein dotplots and alignments
are generally more sensitive. The 20 letter protein alphabet makes
it easier to distinguish related sequences from noise, and protein
sequences are not as prone to compositional effects (i.e. different
GC composition) as nucleic acid sequence.
8. What does it mean to weight end gaps?
[5 pts]
To weight end-gaps means to charge
a penalty for gaps at the ends of the sequences, just as for internal
gaps.
9. Does the gap program weight end gaps?
[5 pts]
No, the default setting is to
not weight end-gaps.
10. Does the bestfit program weight end gaps? [5 pts]
No, local alignment approaches
never charge penalties for the gaps at the ends of the sequences.
11. Assume that you have made an alignment using GAP with
the gap penalties set so high that no gaps can be inserted into
the alignment. Will this alignment include the best diagonal segment
seen in a dotplot? [10 pts]
Probably, assuming that we are
talking about a long region of similarity. When using GAP with
very high gap penalties, no insertions or deletions are allowed.
The best diagonal region may not be included in the alignment
if there are sufficient "random" in other regions to
achieve a higher score. This is because GAP is a global alignment
program and will try to fit all of both sequences.
12. Do you expect a different result for BESTFIT? Explain.
[5 pts]
Yes, the result should be different
than with GAP. The BESTFIT alignment should be trimmed at the
ends to remove less similar regions.
13. If you align two sequences that have only a 20 residue
segment in common, do you expect the same or different results
as in the last question? Explain. [5
pts]
The results may be strikingly
different. GAP is highly likely to miss the short similarity as
explained in 11, above.. BESTFIT looks only for local alignments
and should find the parts of each sequence align to each other.
BESTFIT is much more likely to find the short region of strong
similarity regardless of its location
14. Assume that protein B is derived from protein A by
a duplication that happened many millions of years ago. That is,
protein B was originally a tandem repeat of protein A, and the
sequences have now extensively diverged. Protein B remains approximately
twice as long as protein A. What program would you expect to give
a better alignment, GAP or BESTFIT. Why?
[10 pts]
BESTFIT would be more likely
to find the alignment by aligning A alternately with the first
half of B then the second half of B. GAP at relatively GAP penalty
MIGHT align A with either the first or second half of B but it
is unlikely (Note that Gap uses unweighted end-gaps). It is more
likely that GAP would distribute sequence A over the whole length
of B, because in some areas A is more like the first repeat in
B, and others more like the second.
15. For the duplication described above, describe what the dotplot
would look like? [5 pts]
The dotplot would look similar
to the case described in 5, above, except that the diagonals would
have many spaces indicating regions of divergence.
16. Suggest an appropriate gap opening and gap extension
penalty setting to use when comparing a eukaryotic cDNA and genomic
DNA. [5 pts]
Since introns may be very long,
you would want to use a low gap extension penalty, e.g., zero.
The gap opening penalty should be relatively large so that gaps
are only inserted when introns are encountered, e.g., ten (assuming
matches score one).
17. In FASTA, what is the meaning of the init1 score? of
the initn score? [10 pts]
In the first step of FASTA, the
algorithm identifies matching ktup long words in the two sequences
being compared. Words on the same diagonal are assembled into
initial matching regions and rescored using a 20x20 scoring table.
The init1 score is the highest scoring initial diagonal.
18. What is the meaning of the opt score?
[5 pts]
In FASTA, the opt score is an
optimized Smith-Waterman dynamic programming alignment score performed
in the region of the init1 diagonal.
19. Is the initn score ever likely to be less than the
init1 score? Why? [5 pts]
The initn score is the score
achieved by combining initial diagonals into longer assemblages
including some gaps. This score cannot be lower than the init1
score and should almost always be higher.
20. Why is the opt score sometimes less than the initn
score? [10 pts]
The dynamic programming alignment
uses different penalties and may include regions not on the initial
diagonals. It can thus be lower than initn.
21. What is a lookup table?
[5 pts]
In FASTA, a lookup table is an
index that tells where each ktup long word is ound in each sequence.
22. How does FASTA use a lookup or "Hash" Table?
[5 pts]
The lookup or hash table is used
to identify initial diagonals with high frequencies of identical
matches. The lookup tables for pairs of sequences are compared
and local matching regions built up suing the difference between
the ktup positions in the two tables to identify which ktups lie
on the same diagonals.
23. Why is a lookup table useful in FASTA?
[5 pts]
The lookup table is much smaller
than the sequence. Two lookup tables can therefore be compared
much more quickly.
24. What is a ktup? [5 pts]
In FASTA, the ktup is the length
of the substring or word used to construct the lookup table.
25. How are e-values calculated in FASTA?
[5 pts]
In FASTA, e-values are calculated
by an iterative fitting process in which the results of the database
search are fit to an extreme-value distribution. Values very far
from the expected value, i.e. those likely to be homologs, are
thrown out in several cycles of iterative fitting.