![]() |
Pharm 207/Bio 207Lecture 3Using Internet Resources in Molecular Biology - Lecture 3 Protein Motifs & Domainsalso Using BioEditor |
Lecturer: Phil Bourne Lecture Outline
|
|
By now you should have BioEditor installed and accessible on a Windows machine. We will go over:
This lecture extends the ideas of sequence/sequence comparison to sequence/motif or sequence/family comparisons. These latter comparisons are often more informative because they are already associated with structural and functional information (as sequences often are not). Motif or family comparisons can also be more sensitive because motifs represent higher level generalizations of the features that are important for a given structural of functional feature. As the sequence databases grow rapidly, there is a distinct trend away from the traditional sequence/sequence comparison and towards these multiple sequence based comparisons.
A motif is aka fingerprint, site, pattern , signature.
While sequence patterns are very useful, there are a number of protein families as well as functional or structural domains that cannot be detected using patterns due to their extreme sequence divergence. Typical examples of important functional domains, which are weakly conserved, are the globins, and the SH2 and SH3 domains. In such domains there are only a few sequence positions which are well conserved.
The use of techniques based on profiles or weight matrices (the two terms are used synonymously here) allows the detection of such proteins or domains . A profile is a table of position-specific amino acid weights and gap costs. These numbers (also referred to as scores) are used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a given cut-off value constitutes a motif occurrence. A distinguishing feature between a pattern and a profile is that the former is usually confined to a small region with high sequence similarity whereas the latter attempts to characterize a protein family or domain over its entire length.
The PA (PAttern) lines contains the definition of a PROSITE pattern. The patterns are described using the following conventions:
The standard IUPAC one-letter codes for the amino acids are used.
The symbol 'x' is used for a position where any amino acid is accepted.
Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses '[ ]'. For example: [ALT] stands for Ala or Leu or Thr.
Ambiguities are also indicated by listing between a pair of curly brackets '{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met.
Each element in a pattern is separated from its neighbor by a '-'.
Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x.
When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a '<' symbol or respectively ends with a '>' symbol. In some rare cases (e.g. PS00267 or PS00539), '>' can also occur inside square brackets for the C-terminal element. 'F-[GSTV]-P-R-L-[G>]' means that either 'F-[GSTV]-P-R-L-G' or 'F-[GSTV]-P-R-L>' are considered.
A period ends the pattern.
Examples:
PA [AC]-x-V-x(4)-{ED}.
This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
PA <A-x-[ST](2)-x(0,1)-V.
This pattern, which must be in the N-terminal of the sequence ('<'), is translated as: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val
Developing and searching a domain database isn't as simple as developing and searching a sequence database. You can't just simply throw a bag of sequences together in a flatfile. Some of the choices that must be made are:
What do you mean by "family"? -- what sequences go together? When do you break a big family into separate families?
There is no objective definition. The SCOP definition is now widely used: a "family" is clearly related by sequence similarity, a "superfamily" is composed of families whose sequence relationship isn't clear, but which are believed on structural and functional grounds to be homologous, and a "fold" is a group of superfamilies that share a common structural topology but are not necessarily homologous.
It is probably not well appreciated that the same problem has been faced by taxonomists. The idea of imposing strict classifications on life goes back to Aristotle; the idea of imposing strict but hierarchical classifications was popularized by Linnaeus; and all of it was driven by an essentially creationist viewpoint that did not recognize the continuity of evolving forms. Classifications are useful, but should not be taken too seriously. The idea of arguing over the number of sequence families, for instance, is faintly ridiculous. Just as in species taxonomy, protein sequence "taxonomists" vary from "lumpers" (that tend towards a few large families) to "splitters" (that tend toward many small families).
How will you define the bounds of the domain?
Again, there is no objective rule for this definition. "Independently folding structural unit" is a common definition of a protein domain, but it very much falls into the "I know it when I see it" class of definition.
Choice of computational representation.
The simplest way to search a domain database is the strategy used by SBASE -- annotate your sequence fragments for what family they belong to, and BLAST search against your annotated sequences.
Many domain databases make a consensus representation of the family in the hopes of increased sensitivity and efficiency. There is a plethora of different computational representatations. This factor alone is probably the highest bar to integration of the domain databases.
patterns (regular expressions) -- PROSITE
blocks (ungapped HMMs) -- BLOCKS, PRINTS
annotated subsequences -- SBASE
consensus sequences -- PRODOM
full alignments -- PFAM, ISREC PROFILES, SMART
classification by means other than alignments -- PROCLASS
A tradeoff between completeness and correctness.
Because family and domain definition are inherently somewhat arbitrary, and must be decided primarily based on what makes sense to human biologists, it is difficult to automate the creation of domain databases -- computers aren't smart enough.
On the other hand, there are a lot of domains out there, so it is also difficult to manually produce domain databases -- humans aren't fast enough (or rather, they get bored easily).
Therefore there is a tradeoff between automated/complete/low quality domain databases, and manual/incomplete/high quality domain databases.
The manually annotated databases still use automated tools, of course. PFAM is an example of an explicitly hybrid approach, in which "seed" alignments are manually curated and infrequently changed, but "full" alignments are automatically generated using the seeds, and can be rebuilt with every release of the main sequence databases.
Manual classification and alignments -- PROSITE,PRINTS,PFAM
Full automatic from other annotation -- BLOCKS, SBASE, PROCLASS
Full automatic (clustering approaches) -- PRODOM
How many different domains need to be modeled? (100 - 10,000)
InterPro - Integrated Resources of Proteins Domains and Functional Sites
BLOCKS - BLOCKS db
Pfam - Protein families db (HMM derived) [Mirror at St. Louis (USA)]
PRINTS - Protein Motif fingerprint db
ProDom - Protein domain db (Automatically generated)
PROTOMAP - An automatic hierarchical classification of SWISS-PROT proteins
SBASE - SBASE domain db
SMART - Simple Modular Architecture Research Tool
TIGRFAMs - TIGR protein families db
Repeat this exercise on your "pet" protein and include in your BioEditor documentary.
Reading for motifs and patterns is covered in lecture 3. Additional Reading:
| : | Falquet L, Pagni M, Bucher P, Hulo N, Sigrist CJ, Hofmann K, Bairoch A. |
|
The PROSITE database, its status in 2002.Nucleic Acids Res. 2002 Jan 1;30(1):235-8. [paper]
| : | Corpet F, Gouzy J, Kahn D. |
|
Recent improvements of the ProDom database of protein domain families. Nucleic Acids Res. 1999 Jan 1;27(1):263-7. [paper]