 |
Pharm 207/Bio 207
Using Internet Resources in Molecular Biology
- Lecture 2
DNA & Protein Sequence Comparison
|
Lecturer: Philip Bourne
Last Update: October 7, 2003
Table of Contents
|
|
Introduction
DNA and protein sequences are being determined at an ever increasing rate.
These data are fundamental to bioinformatics as we know it today. The area of
bioinformatics known as sequence analysis concerns the discovery of
biological knowledge from these sequences. Typically the user begins with one or
more sequences (the probe(s)) for which they wish to learn something about the
biology of the organism from which the sequence(s) was obtained.. This involves
(at least as far as this lecture is concerned) comparison to itself and one or
more other sequences from which comparative information is inferred. For this
inference to be correct the detailed annotation of the known sequences must be
correct and the comparison must be deemed to exist. This become more problematic
as more of the annotation is itself derived from comparison rather than direct
biochemical evidence - a natural consequence resulting from the amount of
sequence data being generated..
The Basics
- Comparison of an unknown sequence to an annotated sequence permits us to
infer structural, functional and evolutionary relationships.
- Wherever possible use the protein sequence since this confers more
information.
- Sequence alignment provides explicit mapping between sequences - pairwise
implies 2 sequences only
- This comparative process become more valuable the more sequences there are
that have been annotated.
- As the size of the databases grows more efficient algorithms are needed.
- Note the distinction between similarity and homology:
- Similarity is simply a measure of expression how alike two sequences
are
- Homology means there is an evolutionary relationship between two
sequences. There are not degrees of homology.
- Extending this to individual residues they are either identical or
similar - similar implies that they at least share certain properties.
- Substitutions, deletions and insertions all occur as part of the natural
evolutionary process.
The search for homology is well supported by internet
accessible tools, but getting the correct answer with these tools requires
that one understand something about the relationship between sequence alignments,
database searches, and the statistics of sequence matching. Related
methods for finding sequential or structural motifs require similar understanding.
The Basics - Amino Acid Properties
To understand substitution matrices and other features of protein
sequence pairwise alignment it is necessary to understand the basic properties
of amino acids.
The Single-Letter Amino Acid Code
| G |
|
Glycine |
|
Gly |
|
|
|
|
|
|
|
|
P |
|
Proline |
|
Pro |
| A |
|
Alanine |
|
Ala |
|
|
|
|
|
|
|
|
V |
|
Valine |
|
Val |
| L |
|
Leucine |
|
Leu |
|
|
|
|
|
|
|
|
I |
|
Isoleucine |
|
Ile |
| M |
|
Methionine |
|
Met |
|
|
|
|
|
|
|
|
C |
|
Cysteine |
|
Cys |
| F |
|
Phenylalanine |
|
Phe |
|
|
|
|
|
|
|
|
Y |
|
Tyrosine |
|
Tyr |
| W |
|
Tryptophan |
|
Trp |
|
|
|
|
|
|
|
|
H |
|
Histidine |
|
His |
| K |
|
Lysine |
|
Lys |
|
|
|
|
|
|
|
|
R |
|
Arginine |
|
Arg |
| Q |
|
Glutamine |
|
Gln |
|
|
|
|
|
|
|
|
N |
|
Asparagine |
|
Asn |
| E |
|
Glutamic Acid |
|
Glu |
|
|
|
|
|
|
|
|
D |
|
Aspartic Acid |
|
Asp |
| S |
|
Serine |
|
Ser |
|
|
|
|
|
|
|
|
T |
|
Threonine |
|
Thr |
Amino Acid Structures
All amino acids have the same general formula:

The twenty amino acids found in biological systems are:

Courtesy Nancy Watson at
nwatson@ilstu.edu
More details on amino acids can be found in the Principles
of Protein Structure Using the Internet.
It is possible to displays values for static properties on a percentage scale
as defined by R.A. Bogardt, et al. (J. Mol. Evol. 15, 197-218, 1980).
The values used are as follows:
Amino Acid Volume Polarity
Isoelectric Point Hydrophobicity
Mean Sol. Accessibility
Gln
51.3 45.7
34.4
0.0
43.6
Ser
18.1 32.1
34.8
1.9
40.5
Thr
34.0 21.0
45.7
1.9
35.3
Asn
35.4 63.0
31.3
2.4
46.1
Gly
0.0 37.0
38.5
2.7
54.0
Asp
31.3 100.0
0.0
17.5
45.0
Glu
47.2 93.8
3.2
17.8
48.6
Arg
70.8 51.9
100.0
22.6
50.1
Ala
15.9 25.9
39.2
23.1
37.4
His
49.2 43.2
59.2
23.1
28.1
Cys
28.0 7.4
26.3
40.3
7.4
Lys
68.0 64.2
86.9
43.5
54.3
Val
47.7 8.6
38.5
49.6
19.6
Met
62.8 4.9
35.7
44.3
3.9
Leu
63.6 0.0
38.6
57.6
10.1
Tyr
78.5 9.9
34.4
70.8
30.1
Pro
41.0 21.0
40.2
73.5
66.2
Phe
77.2 1.2
38.6
76.1
5.5
Ile
63.6 0.0
39.2
83.6
7.5
Trp
100.0 4.9
37.7
100.0
13.8
Yet more information from the Jena
Image Library of Biological Macromolecules
Practical: Manual Sequence Alignments
Align the following sequences by hand using what you know about amino acid
properties.
Use:
| for a perfect match
. for a conservative substitution
(blank) for no match
- for a gap in one or other of the sequences.
1. Align ANALYSIS and ANALYZED (no gaps)
2. Align ANALYSISPRIMER and ANALYZEDERIVED (no gaps)
3. Align VVVVVASCDEFGYYYYY and QQQQQATCEEFGLLLLL
(Think about this in the context of a local versus global (total) alignment
4. Align TKTYFPHFDLSHGSAQVKGHGKKV and TQRFFESFGDLSTPDAVMGNPKVKAHGKKV
(gapped)
More Detailed Concepts [See BIMM140
for further details]
Homology versus Similarity
- homology:
- the presence of a similar feature because of descent from a common
ancestor
- homoplasy - the presence of a similar feature because of convergence
- similarity - quantitative measure of 'likeness' of two or more
sequences
- similarity - use statistics to determine 'significance' of the
similarity
Homology
- Homology cannot be observed.
- We can’t actually see the ancestral organisms/molecules and trace
descent.
- These ancestral organisms / molecules no longer exist in today's
world.
- Homology is an inference, a conclusion we draw based on observed
similarity.
- In practice, if the similarity statistics indicate a high enough
degree of significance, one infers that the two sequences are homologues
- As homologues, one implicitly infers common ancestry for the two
sequences.
- It is, therefore, subject to degrees of certainty.
- Homology is an all-or-none relationship
- Homology is like Pregnancy: one is ... or one isn't ... all-or-none
- Homology strongly suggests similar structure and function for the
proteins
- Significantly similar molecular sequences are very unlikely to arise by
chance - i.e. homoplasy on the molecular level is very unlikely
- Caveat: horizontal transfer of sequences from one organism to another
- Because molecular homoplasy is unlikely, significant sequence
similarity is a strong indication of homology
Global and Local Alignment
- An alignment that spans the whole length of the overlapping sequences is
considered a global alignment.
- This may not be optimal given that proteins exist as domains and given
sequences may contain multiple domains, often rearranged. As we shall see
the dotplot provides a mechanism for revealing multiple regions of local
alignment.
- A global alignment is useful when you know you have a single domain and
sequences have not diverged substantially.
Alignments in General - Dynamic Programming
- There are many alignments thus it is necessary to define a score so that
optimal alignments can be recognized based on that scoring scheme. The most
optimal alignment is the one with the highest score.
- The score is usually derived by summing incremental contributions at each
step along the way.
- One method for the determination of the best path strategy is known as dynamic
programming and has its roots in graph theory.
- This was first applied by Needleman and Wunsch and published in
1970 in 1970 in perhaps the most cited paper in bioinformatics.
- The idea is one of extension of optimal subpaths to provide a complete
(global) path.
- An extension of the original work to provide local alignments simply says
the alignment need not reach the edges of the search graph. Rather a
local alignment is extended and stops if the score becomes zero, at which
point a new local alignment is started. This may lead to a number of local
alignments, the one with the best score is reported as the optimal local
alignment.
- The optimal local alignment may not be the most biologically meaningful -
it pays to review a number of sub-optimal alignments.
Scores and Gap Penalties
- The scoring function is critical to determining good alignments.
- The absolute match/mismatch rule for scoring works okay for DNA sequences
but for proteins does not take advantage of the observation that some amino
acid substitutions are more likely than others - these are referred to as
conservative substitutions.
- These substitutions conform to the property groups we have discussed
already - examples isoleucine for valine (both small and hydrophobic) and
serine for threonine (both polar).
- Scoring is such as to provide a scale of scoring for absolute
conservation, versus conservative substitution versus non-conserved
substitution. This is manifest in a substitution matrix. [More
on matrices from BIMM 140]
- Gaps represent insertions and deletions in one or other of the sequences
being aligned.
- Too large a gap may represent an implausible set of mutations - in other
works could imply a loss of biological function.
- Gap penalties are defined by two parameters - the gap opening penalty and
the gap extension penalty The rationale here (which is empirical) is that it
is difficult to open a gap, but when it is open it is not so difficult to
extend it.
Statistical Significance of Alignments
- It is easy to define a score, but how do you know that score represents
something biologically meaningful ie some evidence of homology.
- Such a statistical treatment is difficult for global alignments, but
better for local alignments.
- Statistics derived from local alignments without gaps - segments referred
to as a high-scoring segment pair (HSP).
- The probability density function used to describe the behavior of HSPs is
known as the extreme value distribution which differs from a normal
value distribution.
- By relating a score S to the expected distribution it is possible to
calculate a p value which indicates the probability that the specified score
could occur by chance.
- Thus p values close to zero are the most significant.
- A related quantity is E - the expectation value - the expected number of
chance alignments that would achieve a score of S or higher. The smaller the
value of E the more statistically significant the alignment.
Database Searching
- Pairwise sequence alignments can be extended such that one sequence
becomes the query sequence and any of the sequences in a database become a
target sequence to determine a large set of pairwise alignments.
- To scale computationally either parallel computation is needed or
heuristic methods.
- Heuristic methods make assumptions at the risk of missing some alignments.
- Word based approaches are the most popular - rather than comparing
individual amino acids or nucleotides words (groupings of amino acids or
nucleotides) are compared.
- Both FASTA and BLAST - two programs used in this workshop use the concept
of words and will be discussed in more detail.
Filtering
- Both protein and DNA sequences contain low-complexity regions and DNA
sequences in particular contain repetitive elements.
- Filtering attempts to remove the confusion such regions can cause to a
database search.
- This can be done either by first checking the query sequence or filtering
the results.
Dot Plots
Dot plots are probably the oldest way of comparing sequences (Maizel and Lenk).
A dot plot is a visual representation of the similarities between two sequences.
Each axis of a rectangular array represents one of the two sequences to be
compared. A window length is fixed, together with a criterion when two sequence
windows are deemed to be similar. Whenever one one window in one sequence
resembles another a window in the other sequence, a dot or short diagonal is
drawn at the corresponding position of the array. Thus, when two sequences share
similarity over their entire length a diagonal line will extend from one corner
of the dot plot to the diagonally opposite corner. If two sequences only share
patches of similarity this will be revealed by diagonal stretches.
Figure 1 shows an example of a dot plot. There, the alpha chain of human
hemoglobin is compared to the beta chain of human hemoglobin. For this
computation, the window length was set to 31, matches and mismatches were
assigned similarity values of +5 and -4 respectively. The gray values of the
dots scale with the similarity of two windows. One can clearly discern a
diagonal trace along the entire length of the two sequences. Note the jumps
where this trace jumps to another diagonal of the array. These jumps correspond
to position where one or the other sequence has more (or less) letters than the
other one.
dot plot of
two coding DNA-sequences: the alpha chain of human hemoglobin is assigned to the
horizontal axis as the beta chain of human hemoglobin is assigned to the
vertical axis.
Alignment of the A and B protein chains of deoxy
hemoglobin.
Dot plots are a very powerful method of comparing two sequences. They do not
predispose the analysis in any way such that they constitute the ideal
first-pass analysis method. Based on the dot plot the user can decide whether he
deals with a case of global, i.e. beginning-to-end similarity, or local
similarity. Local similarity denotes the existence of similar regions between
two sequences that are embedded in the overall sequences which lack similarity.
Sequences may contain regions of self-similarity which are frequently termed
internal repeats. A dot plot comparison of the sequence will itself will reveal
internal repeats by displaying several parallel diagonals.
Exercise 1
1. Download and install
windot
(read the dotplot text file to see what to do)
2. Download the sequences for the A and B chains of hemogblobin
- Go to the PDB
- Enter 4HHB
- Go to sequence details
- Download the A and B chains at FASTA files
- Experiment with different window sizes and thresholds
Go to NBCI and get your pet protein sequence and
compare it to itself in the same way. Describe your findings.
{NOTE RED represents the minimum of what I expect to
see described in your final documentary}
BLAST
Tutorial 1
Tutorial
2
PSI-BLAST
Tutorial
Exercise 2
For your pet protein sequence run against psi-blast using different scoring
matrices. Attempt to optimize the number of remote homologs found and report on
the biological implications of your findings. Include details of any interesting
alignments in your report.
Reading Materials
- Required Chapter 7 of A.D. Baxevanis and B.F.
Ouellette. "Bioinformatics: A Practical Guide to the Analysis of Genes
and Proteins". John Wiley, 1998.
- Optional Chapter 6 of Attwood and Parry-Smith
"Introduction to Bioinformatics". Addison Wesley Longman (UK),
1999. ISBN: 0 582 327881
Tools
- Molecular Biology Workbench
- General Purpose Sequence Analysis Tool
- Dotter
program
- Database Searches
- General Information
Lists of many sites and tools