Exercise 3 - Dotplots, Sequence Alignments, and FASTA



The purpose of this exercise is to introduce two methods for sequence comparisons: dot matrix plots or dotplots, and sequence alignments. Both of these techniques produce the maps that allow us to apply information about a known sequence to a novel sequence. We will investigate the parameters that affect the ability of these methods to identify homology, and introduce some concepts of how the significance of what one sees can be tested. In the third section of this exercise we will examine the use of FASTA, one of the two common database searching programs. In real life situations, database searches normally preced dot-matrix or alignment analyses, but for pedagogical purmposes we reverse the order here. The specific objectives are:





BIMM 140: | Main | 140_Info | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | 141_Info | Syllabus | Exercises | DNASYSTEM | CMS MBR |



Main Specific Tasks to Perform in Exercise 2:

  1. Dot Matrix Sequence Comparisons.
    1. Find three protein and three DNA sequences as described above.
    2. Run COMPARE several times with different "stringency" or threshold values.
    3. Compare the dotplots for the related and unrelated sequences.
    4. Repeat 2 and 3 with your DNA sequences.
  2. Sequence Alignments.
    1. Global alignment using the GCG GAP program.
    2. Run the GAP program and make an alignment.
    3. Use the "FETCH" command to get a copy of the default scoring table and look at the values in it.
    4. Repeat the alignment with different values for the "gap creation" and "gap extension" penalties.
    5. Repeat B2 and B4, above, using the program BESTFIT.
    6. Run GAP with the /random switch.
  3. Searching with FASTA
    1. Obtain sequences
    2. Run FASTA on your protein sequence using ktup of 2
    3. Repeat this search with a ktup of 1.
    4. Perform a search using the cognate DNA sequence.
  4. Questions.

Dotplots and Alignments

Dotplots and sequence alignments are the two main ways of producing the one to one mapping between sequences that has been so useful to molecular biology. The usefulness of dotplots is often underestimated because their interpretation always includes evaluation by eye. This is unfortunate because there are a number of situations in which dotplots provide superior information to the completely computational methods discussed later in this exercise.



BIMM 140: | Main | 140_Info | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | 141_Info | Syllabus | Exercises | DNASYSTEM | CMS MBR |


{A. Dot Matrix Comparisons.}

For the first part of this exercise, you will need three DNA sequences and three protein sequences. Within each group of three, two of the sequences should be related (i.e. homologous), and the other should be unrelated but of similar length. Make an effort to find the DNA sequences that correspond to the protein sequences because our goal is to compare the relative power of protein and nucleotide comparisons in detecting homology. However this is not critical to the exercise and you may select disctinct sets of protein and nucleotide sequences if you are unable to find the correct matching sequences.

I suggest that you use the ExPaSy SWISS-PROT server to look up sequences and use the genus/species suffix to find a distant relative, i.e. vertebrate to bacteria. An alternative is to use Entrez to locate the similar sequences and to then use the "related sequences" feature to find other related sequences. The problem with this is that it is more difficult to tell how closely related the sequences are. Pick another sequence of about the same length, but completely unrelated for the third sequence. You can follow the cross references to EMBL and Genbank in SWISS-PROT sequences or the "nucleotide links" in Entrez to find the cognate DNA sequences if you wish.

We will investigate the use of dotplot to compare the sequences using the GCG programs COMPARE and DOTPLOT. The COMPARE program performs a comparison of all of the "windows" in each sequence, and produces a table of all of the windows whose match score is above the specified threshold. It does not produce a dotplot itself. The DOTPLOT program is used to actually produce the plot.

Once you have initialized the gcg package using the prep gcg command, programs in the GCG package are run by simply typing their name (you saw this in exercise 2 when you ran TESTPLOT).

{1. Find three protein and three DNA sequences as described above.Convert the sequences to GCG format as you did in exercise 2.} Record the sequence entries you chose.

{2. Read the documentation on the COMPARE and DOTPLOT programs. Run COMPARE several times with a variety of "stringency" or threshold values. Make a small table showing the number of points for the related and unrelated sequences at different threshold levels.} For example
Thresholdrelatedunrelated
5127105
106832
113116
12154
1380
etcetera
It is not necessary to extend this table past a thousand points. Note that the number of points (or windows over the threshold) is shown in the output in the line that looks something like

Window: 20  Stringency: 12.0  Points: 342  January 13, 1997 09:40 ..
Now we can actually produce a plot. Make sure that you have selected a graphics option using either xwindows or png. Run the dotplot program by typing its name and entering the name of the point file you made with compare The results appear as a graphical plot.

{3. Compare the dotplots for the related and unrelated sequences. Can you clearly distinguish the similar regions in your plot?} Make plots with different window/stringency levels and include the best plot in your notebook.

{4. Repeat 2 and 3 with your DNA sequences.}



BIMM 140: | Main | 140_Info | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | 141_Info | Syllabus | Exercises | DNASYSTEM | CMS MBR |


{B. Sequence Alignments.}

In the first part of the exercise, you made some dotplots and and saw that the similar regions of the sequence appear as diagonal lines (often called simply diagonals). In this part of the exercise move on to sequence alignments. Sequence alignments, generally produced with a computational technique called "dynamic programming", are one of the most ubiquitous computational techniques used in molecular biology. In spite of this, sequence alignments are often misunderstood or overinterpreted. We will come back to the subject of sequence alignments a number of times. The goal of this first look at alignments is mostly to get a general feel for the approach and an understanding of the basic parameters. We will look briefly at how one evaluates the "goodness" of alignments, but will defer a detailed discussion until later.

In the following section you will again need to know the names of two sequences. For the time being, it probably makes the most sense to use protein sequences, but you may use DNA sequences if you like. However, if you use DNA sequences, you should limit the length of the sequences to no more than a few hundred bases. It makes no sense to make sequence alignments between sequences that are unrelated (except for statistical purposes) so choose sequences that you are confident should show some sequence similarity. The sequences you used above are fine if you want to continue with them.

{1. Global alignment using the GCG GAP program. As usual step 1 is to read the documentation. We will also use the related program BESTFIT; read this document as well.}

The most important parts of the documentation to read are the sections on description, algorithm, and alignment metrics. In the lectures we will discuss in detail how the alignment algorithms work. The main problem of alignment is how to add "gaps" to the sequences to "optimize" the alignment. Both GAP and BESTFIT consider an "optimal" alignment to be one that maximizes the following score:

Score = Matches - N_Gaps( open_pen ) - L_Gaps( extend_pen )
where Matches represents the sum of the comparison scores for all of the aligned residues, i.e. the paired up residues. Positive comparisons indicate similar residues and negative comparison represent dissimilar residues. N_Gaps is the number of "gaps" inserted in the sequences. L_Gaps is the total length of the gaps in the alignment. This is the same as the number of "space" characters inserted - GCG uses a '.' for the gap or space character. In the example below, N_gaps=2 and L_Gaps=7.
THISISANONSENSE......SEQUENCE
|||||| ||            ||||||||
THISIS.NOTABIOLOGICALSEQUENCE

Open_pen and extend_pen are negative scores or penalties that prevent gaps from being inserted too frequently. The idea is that you should only insert gaps into the alignment if they gain you more than a certain amount of matches. That "certain amount" is represented by the gap penalties. GAP and BESTFIT refer to the open_pen and extend_pen as the "gap creation" and "gap extension" penalties, respectively.

{2. Run the GAP program and make an alignment.}

GAP runs interactively in a very similar way to COMPARE, above.

Look at the output file. You will see something like this:


$ ty hahu.pair
 BESTFIT of: Hahu  check: 407  from: 1  to: 141
 
P1;HAHU - hemoglobin alpha chain - human
N;Contains: neo-kyotorphin
C;Species: Homo sapiens (man)
C;Date: #sequence_revision 01-Sep-1981 #text_change 07-Jul-1995
C;Accession: A90807; A93865; A92272; A91637; A92014; A90444; A24631;
A02248
R;Michelson, A.M.; Orkin, S.H. . . . 
 
 to: Myhu  check: 5868  from: 1  to: 153
 
P1;MYHU - myoglobin - human
C;Species: Homo sapiens (man)
C;Date: #text_change 11-Nov-1994
C;Accession: A90999; A93398; A90582; A90584; A02464
R;Weller, P.; Jeffreys, A.J.; Wilson, V.; Blanchetot, A.
EMBO J. 3, 439-446, 1984 . . . 

 Symbol comparison table: 
Gencoredisk:[Gcgcore.Data.Rundata]Swgappep.Cmp
CompCheck: 1254

         Gap Weight:  3.000      Average Match:  0.540
      Length Weight:  0.100   Average Mismatch: -0.396

            Quality:   82.8             Length:    147
              Ratio:  0.587               Gaps:      1
 Percent Similarity: 48.227   Percent Identity: 26.950

 Hahu x Myhu               January 18, 1996 19:23  ..

                  .         .         .         .         .
       1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.... 46
         .||.::.  | ..||||:|. .:.|.|.| |:| : |.| . |..|    
       1 GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK 50
                  .         .         .         .         .
      47 ..DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRV D 94
           |  .:|.::| || .| .||.. : . :. ...:.:|.: || | ::.
      51 SEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIP 100
                  .         .         .         .       
      95 PVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR 141
         .  :.::|.|:: .|... |::|.:..::.::| |. ... :.|.|:
     101 VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYK 147

"Quality" is the score of the alignment, and "Ratio" is the quality divided by the length of the shorter sequence. One very useful number is the percent identity. This is the number of aligned residues that are identical in the two sequences divided by the length of the shorter sequence. Other useful information includes the parameters (e.g., gap creation and gap extension penalties) used in making the alignment, and the "symbol comparison table" or scoring table used to calculate the "Quality".

{3. Use the "FETCH" command to get a copy of the default scoring table and look at the values in it.}

To do this, type fetch blosum62.cmp. The default scoring table for nucleotide sequences is compardna.cmp.

The values that you want to use for the gap penalties depend, of course, on the scores that you give for matching residues. The "symbol comparison table" contains the scores that are given for aligned residues. For instance, the score for cysteine, C, aligned with aspartic acid, D, is -3 while the score for alanine aligned with serine, S, is 1. This is not surprising since alanine and serine both have relatively small side chains and cysteine and tryptophan are very different.

{4. Repeat the alignment with different values for the "gap creation" and "gap extension" penalties.}

Try setting both penalties to zero and see what the resulting alignment looks like. Also try making the penalties ridiculously large, for instance set them to 100, and see what it does to the alignment. Try some other settings and include the alignment you think looks best in your notebook. Explain why you think it is a good alignment.

{5. Repeat B2 and B4, above, using the program BESTFIT.}

GAP and BESTFIT are very similar, as you can see. The difference is that GAP produces a "global" alignment. A global alignment is one that uses all of the residues in both sequences - if they can't be matched up with a residue in the other sequence, they must be aligned with space characters. BESTFIT makes a "local" alignment. A local alignment shows the best matching segment of each sequence and trims off parts that are not distinctly similar. Compare the alignments made with GAP and BESTFIT using the same gap penalties, and comment on the differences.

{6. Run GAP with Monte Carlo statistics turned on by using the -random switch. }

This option instructs the GAP program to produce the alignment and then randomize your input sequences and calculate alignments 100 times.


 Average quality based on 100 randomizations: 43.3 +/- 3.3

will be written to the output file. The average quality and standard deviation (the number to the right of the +/-) given in the last line can be used to put your score into "SD units".

SD_Units = (quality - average_quality) / standard_deviation

For the alignment above I get a score of 11.97! Scores given in "SD units" are also called standardized scores or Z scores. They are frequently used to compare scores produced by different methods because they are independent of scoring system. I can use a Z score to compare results from different programs even though the absolute scores for the two methods are on completely different scales. The "random" option also works with BESTFIT.

In your notebook, calculate the Z score for the alignment you saved in B4, above.



BIMM 140: | Main | 140_Info | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | 141_Info | Syllabus | Exercises | DNASYSTEM | CMS MBR |


{C. Searching with FASTA}

Recommended FASTA Sites:
GenomeNet in Japan
EBI in the UK

{1. Obtain sequences to use.}
Continue in this section using a protein sequence and its correspnding nucleotide sequence. It is best to use a DNA sequence that encodes one and only one gene, and one that isn't too long (2000 bases or less). It will be most convenient to store these sequences on your local computer in FASTA format.  You many use the one of the pairs of cognate Protein-DNA sequences you used in Exercise above, if you like.

Copy some information about these sequences into your notebook. Please include the name, or a brief description, so that we can tell what you are working on

{2. Run FASTA on your protein sequence using one of the recommended  FASTA web sites and a ktup of 2.}
Examine the results and decide where the division between homologous and non-homologous sequences is. Include a sufficient amount of the summary scores to justify your conclusion. Please do not include hundreds of alignments. Are there homologs that score lower than non-homologs?

{3. Repeat this search with a ktup of 1.}
Summarize the differences between this search and the one above, including relevant parts of the output.

{4. Perform a search using the cognate DNA sequence.}
Summarize the differences between this search and the protein searches above, including relevant parts of the output. Which search, DNA or protein, was more successful at finding homologous sequences?



BIMM 140: | Main | 140_Info | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | 141_Info | Syllabus | Exercises | DNASYSTEM | CMS MBR |


{D. Questions.}

  1. What is the lowest score possible for comparing two amino acid residues using the default scoring table used with the GCG GAP program?
  2. What is the highest?
  3. Is a 95% threshold (i.e., 5% of windows expected to score higher than the threshold) based on unrelated sequences a good threshold for a dotplot? Why or why not?
  4. Hint:Consider how many "matching" windows and "non-matching" windows you expect in related sequences.
  5. For two distantly related protein sequences, would a longer window in COMPARE/DOTPLOT make it easier to detect the similarity? Explain?
  6. Are "randomized" sequences the same as unrelated sequences? Why or why not?
  7. Describe what you would see when you make a dotplot of two sequences, one of which has two tandem repeats of a 20 residue long sequence?
  8. Which is better able to detect similarity between two sequences, a protein dotplot or a nucleotide dotplot? Why?
  9. What does it mean to weight end gaps?
  10. Does the gap program weight end gaps?
  11. Does the bestfit program weight end gaps?
  12. Assume that you have made an alignment using GAP with the gap penalties set so high that no gaps can be inserted into the alignment. Will this alignment include the best diagonal segment seen in a dotplot?
  13. Do you expect a different result for BESTFIT? Explain?
  14. If you align two sequences that have only a 20 residue segment in common, do you expect the same or different results as in the last question? Explain?
  15. Assume that protein B is derived from protein A by a duplication that happened many millions of years ago. That is, protein B was originally a tandem repeat of protein A, and the sequences have now extensively diverged. Protein B remains approximately twice as long as protein A. What program would you expect to give a better alignment, GAP or BESTFIT. Why?
  16. For the duplication desribed above, describe what the dotplot would look like?
  17. Suggest an appropriate gap opening and gap extension penalty setting to use when comparing a eukaryotic cDNA and genomic DNA.
  18. In FASTA, what is the meaning of the init1 score? of the initn score?
  19. What is the meaning of the opt score?
  20. Is the initn score ever likely to be less than the init1 score? Why?
  21. Why is the opt score sometimes less than the initn score?
  22. What is a lookup table?
  23. How does FASTA use a lookup or "Hash" Table?
  24. Why is a lookup table useful in FASTA?
  25. What is a ktup?
  26. How are e-values calculated in FASTA


Updated 15 April 2001