{A. Dot Matrix Comparisons.}

{1. Find three protein and three DNA sequences as described above.Convert the sequences to GCG format as you did in exercise 2.}

{2. Read the documentation on the COMPARE and DOTPLOT programs. Run COMPARE several times with a variety of "stringency" or threshold values. Make a small table showing the number of points for the related and unrelated sequences at different threshold levels.}

{3. Compare the dotplots for the related and unrelated sequences. Can you clearly distinguish the similar regions in your plot?}

{4. Repeat 2 and 3 with your DNA sequences.}

{B. Sequence Alignments.}

{1. Global alignment using the GCG GAP program. As usual step 1 is to read the documentation. We will also use the related program BESTFIT; read this document as well.}

{2. Run the GAP program and make an alignment.}

{3. Use the "FETCH" command to get a copy of the default scoring table and look at the values in it.}

{4. Repeat the alignment with different values for the "gap creation" and "gap extension" penalties.}

{5. Repeat B2 and B4, above, using the program BESTFIT.}

{6. Run GAP with Monte Carlo statistics turned on by using the -random switch. }

{C. Searching with FASTA}

{1. Obtain sequences to use.}

{2. Run FASTA on your protein sequence using one of the recommended  FASTA web sites and a ktup of 2.}

{3. Repeat this search with a ktup of 1.}

{4. Perform a search using the cognate DNA sequence.}

{D. Questions.}

1. What is the lowest score possible for comparing two amino acid residues using the default scoring table used with the GCG GAP program?
2. What is the highest?
3. Is a 95% threshold (i.e., 5% of windows expected to score higher than the threshold) based on unrelated sequences a good threshold for a dotplot? Why or why not? Hint:Consider how many "matching" windows and "non-matching" windows you expect in related sequences.
4. For two distantly related protein sequences, would a longer window in COMPARE/DOTPLOT make it easier to detect the similarity? Explain?
5. Are "randomized" sequences the same as unrelated sequences? Why or why not?
6. Describe what you would see when you make a dotplot of two sequences, one of which has two tandem repeats of a 20 residue long sequence?
7. Which is better able to detect similarity between two sequences, a protein dotplot or a nucleotide dotplot? Why?
8. What does it mean to weight end gaps?
9. Does the gap program weight end gaps?
10. Does the bestfit program weight end gaps?
11. Assume that you have made an alignment using GAP with the gap penalties set so high that no gaps can be inserted into the alignment. Will this alignment include the best diagonal segment seen in a dotplot?
12. Do you expect a different result for BESTFIT? Explain?
13. If you align two sequences that have only a 20 residue segment in common, do you expect the same or different results as in the last question? Explain?
14. Assume that protein B is derived from protein A by a duplication that happened many millions of years ago. That is, protein B was originally a tandem repeat of protein A, and the sequences have now extensively diverged. Protein B remains approximately twice as long as protein A. What program would you expect to give a better alignment, GAP or BESTFIT. Why?
15. For the duplication desribed above, describe what the dotplot would look like?
16. Suggest an appropriate gap opening and gap extension penalty setting to use when comparing a eukaryotic cDNA and genomic DNA.
17. In FASTA, what is the meaning of the init1 score? of the initn score?
18. What is the meaning of the opt score?
19. Is the initn score ever likely to be less than the init1 score? Why?
20. Why is the opt score sometimes less than the initn score?
21. What is a lookup table?
22. How does FASTA use a lookup or "Hash" Table?
23. Why is a lookup table useful in FASTA?
24. What is a ktup?
25. How are e-values calculated in FASTA