Pharm 207/Bio 207 

Using Internet Resources in Molecular Biology - Lecture 2 

DNA & Protein Sequence Comparison

 

Lecturer: Philip Bourne 
Last Update: October 7, 2003 

Table of Contents

 
 

 

Introduction

DNA and protein sequences are being determined at an ever increasing rate. These data are fundamental to bioinformatics as we know it today. The area of bioinformatics known as sequence analysis concerns the discovery of biological knowledge from these sequences. Typically the user begins with one or more sequences (the probe(s)) for which they wish to learn something about the biology of the organism from which the sequence(s) was obtained.. This involves (at least as far as this lecture is concerned) comparison to itself and one or more other sequences from which comparative information is inferred. For this inference to be correct the detailed annotation of the known sequences must be correct and the comparison must be deemed to exist. This become more problematic as more of the annotation is itself derived from comparison rather than direct biochemical evidence - a natural consequence resulting from the amount of sequence data being generated..  

The Basics

The search for homology is well supported by internet accessible tools, but getting the correct answer with these tools requires that one understand something about the relationship between sequence alignments, database searches, and the statistics of sequence matching.  Related methods for finding sequential or structural motifs require similar understanding.

The Basics - Amino Acid Properties

To understand substitution matrices and other features of protein sequence pairwise alignment it is necessary to understand the basic properties of amino acids.

The Single-Letter Amino Acid Code

 

G Glycine Gly P Proline Pro
A Alanine Ala V Valine Val
L Leucine Leu I Isoleucine Ile
M Methionine Met C Cysteine Cys
F Phenylalanine Phe Y Tyrosine Tyr
W Tryptophan Trp H Histidine His
K Lysine Lys R Arginine Arg
Q Glutamine Gln N Asparagine Asn
E Glutamic Acid Glu D Aspartic Acid Asp
S Serine Ser T Threonine Thr

 


Amino Acid Structures

All amino acids have the same general formula:


The twenty amino acids found in biological systems are:

 

Courtesy Nancy Watson at nwatson@ilstu.edu

More details on amino acids can be found in the Principles of Protein Structure Using the Internet.

It is possible to displays values for static properties on a percentage scale as defined by R.A. Bogardt, et al. (J. Mol. Evol. 15, 197-218, 1980).

The values used are as follows:

Amino Acid    Volume    Polarity     Isoelectric Point    Hydrophobicity     Mean Sol. Accessibility

Gln                   51.3         45.7                 34.4                         0.0                         43.6
Ser                   18.1         32.1                 34.8                         1.9                         40.5
Thr                   34.0         21.0                 45.7                         1.9                         35.3
Asn                  35.4         63.0                 31.3                         2.4                         46.1
Gly                     0.0         37.0                 38.5                         2.7                         54.0
Asp                  31.3       100.0                 0.0                         17.5                         45.0
Glu                   47.2         93.8                 3.2                         17.8                         48.6
Arg                   70.8         51.9             100.0                         22.6                         50.1
Ala                    15.9         25.9               39.2                         23.1                         37.4
His                     49.2        43.2                59.2                         23.1                         28.1
Cys                    28.0         7.4                 26.3                         40.3                         7.4
Lys                    68.0        64.2                 86.9                         43.5                         54.3
Val                     47.7         8.6                 38.5                         49.6                         19.6
Met                     62.8         4.9                 35.7                         44.3                         3.9
Leu                     63.6         0.0                 38.6                         57.6                         10.1
Tyr                     78.5         9.9                 34.4                         70.8                         30.1
Pro                     41.0         21.0             40.2                         73.5                         66.2
Phe                     77.2         1.2             38.6                             76.1                         5.5    
Ile                     63.6         0.0                 39.2                         83.6                             7.5
Trp                     100.0     4.9                 37.7                         100.0                         13.8


Yet more information from the Jena Image Library of Biological Macromolecules

Practical: Manual Sequence Alignments

Align the following sequences by hand using what you know about amino acid properties.

Use:

|   for a perfect match

.   for a conservative substitution

    (blank) for no match

- for a gap in one or other of the sequences.

1.  Align ANALYSIS and ANALYZED (no gaps)

2. Align ANALYSISPRIMER and ANALYZEDERIVED (no gaps)

3. Align VVVVVASCDEFGYYYYY and  QQQQQATCEEFGLLLLL

(Think about this in the context of a local versus global (total) alignment

4. Align TKTYFPHFDLSHGSAQVKGHGKKV and TQRFFESFGDLSTPDAVMGNPKVKAHGKKV

(gapped)

More Detailed Concepts [See BIMM140 for further details]

Homology versus Similarity

Homology

 Global and Local Alignment

Alignments in General - Dynamic Programming

Scores and Gap Penalties

Statistical Significance of Alignments

Database Searching

Filtering

 Dot Plots

Dot plots are probably the oldest way of comparing sequences (Maizel and Lenk). A dot plot is a visual representation of the similarities between two sequences. Each axis of a rectangular array represents one of the two sequences to be compared. A window length is fixed, together with a criterion when two sequence windows are deemed to be similar. Whenever one one window in one sequence resembles another a window in the other sequence, a dot or short diagonal is drawn at the corresponding position of the array. Thus, when two sequences share similarity over their entire length a diagonal line will extend from one corner of the dot plot to the diagonally opposite corner. If two sequences only share patches of similarity this will be revealed by diagonal stretches.

Figure 1 shows an example of a dot plot. There, the alpha chain of human hemoglobin is compared to the beta chain of human hemoglobin. For this computation, the window length was set to 31, matches and mismatches were assigned similarity values of +5 and -4 respectively. The gray values of the dots scale with the similarity of two windows. One can clearly discern a diagonal trace along the entire length of the two sequences. Note the jumps where this trace jumps to another diagonal of the array. These jumps correspond to position where one or the other sequence has more (or less) letters than the other one.

dot plot of two coding DNA-sequences: the alpha chain of human hemoglobin is assigned to the horizontal axis as the beta chain of human hemoglobin is assigned to the vertical axis.


Alignment of the A and B protein chains of deoxy hemoglobin.

 
Dot plots are a very powerful method of comparing two sequences. They do not predispose the analysis in any way such that they constitute the ideal first-pass analysis method. Based on the dot plot the user can decide whether he deals with a case of global, i.e. beginning-to-end similarity, or local similarity. Local similarity denotes the existence of similar regions between two sequences that are embedded in the overall sequences which lack similarity. Sequences may contain regions of self-similarity which are frequently termed internal repeats. A dot plot comparison of the sequence will itself will reveal internal repeats by displaying several parallel diagonals. 

Exercise 1

1. Download and install windot (read the dotplot text file to see what to do)

2. Download the sequences for the A and B chains of hemogblobin

Go to NBCI and get your pet protein sequence and compare it to itself in the same way. Describe your findings.

{NOTE RED represents the minimum of what I expect to see described in your final documentary}

BLAST

 Tutorial 1

 Tutorial 2

PSI-BLAST

Tutorial

Exercise 2

For your pet protein sequence run against psi-blast using different scoring matrices. Attempt to optimize the number of remote homologs found and report on the biological implications of your findings. Include details of any interesting alignments in your report.

Reading Materials

Tools
  1.   Molecular Biology Workbench - General Purpose Sequence Analysis Tool
  2. Dotter program
  3. Database Searches
  4. General Information
    Lists of many sites and tools