The purpose of this exercise is to introduce two methods for sequence comparisons: dot matrix plots or dotplots, and sequence alignments. Both of these techniques produce the maps that allow us to apply information about a known sequence to a novel sequence. We will investigate the parameters that affect the ability of these methods to identify homology, and introduce some concepts of how the significance of what one sees can be tested. In the third section of this exercise we will examine the use of FASTA, one of the two common database searching programs. In real life situations, database searches normally preced dot-matrix or alignment analyses, but for pedagogical purmposes we reverse the order here. The specific objectives are:
For the first part of this exercise, you will need three DNA sequences and three protein sequences. Within each group of three, two of the sequences should be related (i.e. homologous), and the other should be unrelated but of similar length. Make an effort to find the DNA sequences that correspond to the protein sequences because our goal is to compare the relative power of protein and nucleotide comparisons in detecting homology. However this is not critical to the exercise and you may select disctinct sets of protein and nucleotide sequences if you are unable to find the correct matching sequences.
I suggest that you use the ExPaSy SWISS-PROT server to look up sequences and use the genus/species suffix to find a distant relative, i.e. vertebrate to bacteria. An alternative is to use Entrez to locate the similar sequences and to then use the "related sequences" feature to find other related sequences. The problem with this is that it is more difficult to tell how closely related the sequences are. Pick another sequence of about the same length, but completely unrelated for the third sequence. You can follow the cross references to EMBL and Genbank in SWISS-PROT sequences or the "nucleotide links" in Entrez to find the cognate DNA sequences if you wish.
We will investigate the use of dotplot to compare the sequences using the GCG programs COMPARE and DOTPLOT. The COMPARE program performs a comparison of all of the "windows" in each sequence, and produces a table of all of the windows whose match score is above the specified threshold. It does not produce a dotplot itself. The DOTPLOT program is used to actually produce the plot.
Once you have initialized the gcg package using the prep gcg command, programs in the GCG package are run by simply typing their name (you saw this in exercise 2 when you ran TESTPLOT).
{1. Find three protein and three DNA sequences as described above.Convert the sequences to GCG format as you did in exercise 2.} Record the sequence entries you chose.
{2. Read the documentation on the COMPARE and DOTPLOT programs. Run COMPARE several times
with a variety of "stringency" or threshold values. Make a
small table showing the number of points for the related and unrelated
sequences at different threshold levels.}
For example
| Threshold | related | unrelated |
| 5 | 127 | 105 |
| 10 | 68 | 32 |
| 11 | 31 | 16 |
| 12 | 15 | 4 |
| 13 | 8 | 0 |
Window: 20 Stringency: 12.0 Points: 342 January 13, 1997 09:40 ..Now we can actually produce a plot. Make sure that you have selected a graphics option using either xwindows or png. Run the dotplot program by typing its name and entering the name of the point file you made with compare The results appear as a graphical plot.
{3. Compare the dotplots for the related and unrelated sequences. Can you clearly distinguish the similar regions in your plot?} Make plots with different window/stringency levels and include the best plot in your notebook.
{4. Repeat 2 and 3 with your DNA sequences.}
In the first part of the exercise, you made some dotplots and and saw that the similar regions of the sequence appear as diagonal lines (often called simply diagonals). In this part of the exercise move on to sequence alignments. Sequence alignments, generally produced with a computational technique called "dynamic programming", are one of the most ubiquitous computational techniques used in molecular biology. In spite of this, sequence alignments are often misunderstood or overinterpreted. We will come back to the subject of sequence alignments a number of times. The goal of this first look at alignments is mostly to get a general feel for the approach and an understanding of the basic parameters. We will look briefly at how one evaluates the "goodness" of alignments, but will defer a detailed discussion until later.
In the following section you will again need to know the names of two sequences. For the time being, it probably makes the most sense to use protein sequences, but you may use DNA sequences if you like. However, if you use DNA sequences, you should limit the length of the sequences to no more than a few hundred bases. It makes no sense to make sequence alignments between sequences that are unrelated (except for statistical purposes) so choose sequences that you are confident should show some sequence similarity. The sequences you used above are fine if you want to continue with them.
{1. Global alignment using the GCG GAP program. As usual step 1 is to read the documentation. We will also use the related program BESTFIT; read this document as well.}
The most important parts of the documentation to read are the sections on description, algorithm, and alignment metrics. In the lectures we will discuss in detail how the alignment algorithms work. The main problem of alignment is how to add "gaps" to the sequences to "optimize" the alignment. Both GAP and BESTFIT consider an "optimal" alignment to be one that maximizes the following score:
Score = Matches - N_Gaps( open_pen ) - L_Gaps( extend_pen )where Matches represents the sum of the comparison scores for all of the aligned residues, i.e. the paired up residues. Positive comparisons indicate similar residues and negative comparison represent dissimilar residues. N_Gaps is the number of "gaps" inserted in the sequences. L_Gaps is the total length of the gaps in the alignment. This is the same as the number of "space" characters inserted - GCG uses a '.' for the gap or space character. In the example below, N_gaps=2 and L_Gaps=7.
THISISANONSENSE......SEQUENCE |||||| || |||||||| THISIS.NOTABIOLOGICALSEQUENCE
{2. Run the GAP program and make an alignment.}
GAP runs interactively in a very similar way to COMPARE, above.
Look at the output file. You will see something like this:
$ ty hahu.pair
BESTFIT of: Hahu check: 407 from: 1 to: 141
P1;HAHU - hemoglobin alpha chain - human
N;Contains: neo-kyotorphin
C;Species: Homo sapiens (man)
C;Date: #sequence_revision 01-Sep-1981 #text_change 07-Jul-1995
C;Accession: A90807; A93865; A92272; A91637; A92014; A90444; A24631;
A02248
R;Michelson, A.M.; Orkin, S.H. . . .
to: Myhu check: 5868 from: 1 to: 153
P1;MYHU - myoglobin - human
C;Species: Homo sapiens (man)
C;Date: #text_change 11-Nov-1994
C;Accession: A90999; A93398; A90582; A90584; A02464
R;Weller, P.; Jeffreys, A.J.; Wilson, V.; Blanchetot, A.
EMBO J. 3, 439-446, 1984 . . .
Symbol comparison table:
Gencoredisk:[Gcgcore.Data.Rundata]Swgappep.Cmp
CompCheck: 1254
Gap Weight: 3.000 Average Match: 0.540
Length Weight: 0.100 Average Mismatch: -0.396
Quality: 82.8 Length: 147
Ratio: 0.587 Gaps: 1
Percent Similarity: 48.227 Percent Identity: 26.950
Hahu x Myhu January 18, 1996 19:23 ..
. . . . .
1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.... 46
.||.::. | ..||||:|. .:.|.|.| |:| : |.| . |..|
1 GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK 50
. . . . .
47 ..DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRV D 94
| .:|.::| || .| .||.. : . :. ...:.:|.: || | ::.
51 SEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIP 100
. . . .
95 PVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR 141
. :.::|.|:: .|... |::|.:..::.::| |. ... :.|.|:
101 VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYK 147
{3. Use the "FETCH" command to get a copy of the default scoring table and look at the values in it.}
To do this, type fetch blosum62.cmp. The default scoring table for nucleotide sequences is compardna.cmp.
The values that you want to use for the gap penalties depend, of course, on the scores that you give for matching residues. The "symbol comparison table" contains the scores that are given for aligned residues. For instance, the score for cysteine, C, aligned with aspartic acid, D, is -3 while the score for alanine aligned with serine, S, is 1. This is not surprising since alanine and serine both have relatively small side chains and cysteine and tryptophan are very different.
{4. Repeat the alignment with different values for the "gap creation" and "gap extension" penalties.}
Try setting both penalties to zero and see what the resulting alignment looks like. Also try making the penalties ridiculously large, for instance set them to 100, and see what it does to the alignment. Try some other settings and include the alignment you think looks best in your notebook. Explain why you think it is a good alignment.
{5. Repeat B2 and B4, above, using the program BESTFIT.}
GAP and BESTFIT are very similar, as you can see. The difference is that GAP produces a "global" alignment. A global alignment is one that uses all of the residues in both sequences - if they can't be matched up with a residue in the other sequence, they must be aligned with space characters. BESTFIT makes a "local" alignment. A local alignment shows the best matching segment of each sequence and trims off parts that are not distinctly similar. Compare the alignments made with GAP and BESTFIT using the same gap penalties, and comment on the differences.
{6. Run GAP with Monte Carlo statistics turned on by using the -random switch. }
This option instructs the GAP program to produce the alignment and then randomize your input sequences and calculate alignments 100 times.
Average quality based on 100 randomizations: 43.3 +/- 3.3
SD_Units = (quality - average_quality) / standard_deviation
For the alignment above I get a score of 11.97! Scores given in "SD units" are also called standardized scores or Z scores. They are frequently used to compare scores produced by different methods because they are independent of scoring system. I can use a Z score to compare results from different programs even though the absolute scores for the two methods are on completely different scales. The "random" option also works with BESTFIT.
In your notebook, calculate the Z score for the alignment you saved in B4, above.
Recommended FASTA Sites:
GenomeNet in Japan
EBI in the UK
{1. Obtain sequences to use.}
Continue in this section using a protein sequence and its correspnding nucleotide sequence.
It is best to use a DNA sequence that encodes one and only one gene, and one that isn't too long (2000 bases or less). It will be most convenient to store these sequences on your local computer in FASTA format. You many use the one of the pairs of cognate Protein-DNA sequences you used in Exercise above, if you like.
Copy some information about these sequences into your notebook. Please include the name, or a brief description, so that we can tell what you are working on
{2. Run FASTA on your protein sequence using one of the recommended FASTA web sites and a ktup of 2.}
Examine the results and decide where the division between homologous and non-homologous sequences is. Include a sufficient amount of the summary scores to justify your conclusion. Please do not include hundreds of alignments. Are there homologs that score lower than non-homologs?
{3. Repeat this search with a ktup of 1.}
Summarize the differences between this search and the one above, including relevant parts of the output.
{4. Perform a search using the cognate DNA sequence.}
Summarize the differences between this search and the protein searches above, including relevant parts of the output. Which search, DNA or protein, was more successful at finding homologous sequences?