Pharm 207/Bio 207

Lecture 3

Using Internet Resources in Molecular Biology - Lecture 3

Protein Motifs & Domains

also Using BioEditor 

Lecturer: Phil Bourne 
 
   

Lecture Outline

  • Using BioEditor (including Chime/Rasmol commands)
  • What is a motif?
  • What is a domain?
  • Methods for describing motifs and domains
  • Group exercise with a protein kinase
  • Describe your pet protein
 

 

Pharm 207/Bio 207

  • Main Page 
  • Lecture 1 - Overview 
  • Lecture 2 - DNA/Protein Sequence Comparison
  • Lecture 3 - Protein Motifs and Domains
  • Lecture 4 - Problem Solving with Entrez 
  • Lecture 5 - Protein Secondary Structure Prediction 
  • Lecture 6 - Phylogenetic Trees
  • Lecture 7 - Protein Families
  • Lecture 8 - Protein Structure Classification 
  • Lecture 9 - Protein Structure Prediction 
  • Lecture 10 - Protein Dynamics 
  • Lecture 11 - Recent Developments 

 

Using BioEditor

By now you should have BioEditor installed and accessible on a Windows machine. We will go over:

Motifs - Introduction

This lecture extends the ideas of sequence/sequence comparison to sequence/motif or sequence/family comparisons. These latter comparisons are often more informative because they are already associated with structural and functional information (as sequences often are not). Motif or family comparisons can also be more sensitive because motifs represent higher level generalizations of the features that are important for a given structural of functional feature. As the sequence databases grow rapidly, there is a distinct trend away from the traditional sequence/sequence comparison and towards these multiple sequence based comparisons.

A motif is aka fingerprint, site, pattern , signature.

While sequence patterns are very useful, there are a number of protein families as well as functional or structural domains that cannot be detected using patterns due to their extreme sequence divergence. Typical examples of important functional domains, which are weakly conserved, are the globins, and the SH2 and SH3 domains. In such domains there are only a few sequence positions which are well conserved.

The use of techniques based on profiles or weight matrices (the two terms are used synonymously here) allows the detection of such proteins or domains . A profile is a table of position-specific amino acid weights and gap costs. These numbers (also referred to as scores) are used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a given cut-off value constitutes a motif occurrence. A distinguishing feature between a pattern and a profile is that the former is usually confined to a small region with high sequence similarity whereas the latter attempts to characterize a protein family or domain over its entire length.

Motifs - Example (from PROSITE)

The PA (PAttern) lines contains the definition of a PROSITE pattern. The patterns are described using the following conventions:

The standard IUPAC one-letter codes for the amino acids are used.

The symbol 'x' is used for a position where any amino acid is accepted.

Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses '[ ]'. For example: [ALT] stands for Ala or Leu or Thr.

Ambiguities are also indicated by listing between a pair of curly brackets '{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met.

Each element in a pattern is separated from its neighbor by a '-'.

Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x.

When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a '<' symbol or respectively ends with a '>' symbol. In some rare cases (e.g. PS00267 or PS00539), '>' can also occur inside square brackets for the C-terminal element. 'F-[GSTV]-P-R-L-[G>]' means that either 'F-[GSTV]-P-R-L-G' or 'F-[GSTV]-P-R-L>' are considered.

A period ends the pattern.

Examples:

   PA   [AC]-x-V-x(4)-{ED}.

This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}

   PA   <A-x-[ST](2)-x(0,1)-V.

This pattern, which must be in the N-terminal of the sequence ('<'), is translated as: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val

Domains and Domain Databases

Developing and searching a domain database isn't as simple as developing and searching a sequence database. You can't just simply throw a bag of sequences together in a flatfile. Some of the choices that must be made are:

What do you mean by "family"? -- what sequences go together? When do you break a big family into separate families?

There is no objective definition. The SCOP definition is now widely used: a "family" is clearly related by sequence similarity, a "superfamily" is composed of families whose sequence relationship isn't clear, but which are believed on structural and functional grounds to be homologous, and a "fold" is a group of superfamilies that share a common structural topology but are not necessarily homologous.

It is probably not well appreciated that the same problem has been faced by taxonomists. The idea of imposing strict classifications on life goes back to Aristotle; the idea of imposing strict but hierarchical classifications was popularized by Linnaeus; and all of it was driven by an essentially creationist viewpoint that did not recognize the continuity of evolving forms. Classifications are useful, but should not be taken too seriously. The idea of arguing over the number of sequence families, for instance, is faintly ridiculous. Just as in species taxonomy, protein sequence "taxonomists" vary from "lumpers" (that tend towards a few large families) to "splitters" (that tend toward many small families).

How will you define the bounds of the domain?

Again, there is no objective rule for this definition. "Independently folding structural unit" is a common definition of a protein domain, but it very much falls into the "I know it when I see it" class of definition.

Choice of computational representation.

The simplest way to search a domain database is the strategy used by SBASE -- annotate your sequence fragments for what family they belong to, and BLAST search against your annotated sequences.

Many domain databases make a consensus representation of the family in the hopes of increased sensitivity and efficiency. There is a plethora of different computational representatations. This factor alone is probably the highest bar to integration of the domain databases.

patterns (regular expressions) -- PROSITE

blocks (ungapped HMMs) -- BLOCKS, PRINTS

annotated subsequences -- SBASE

consensus sequences -- PRODOM

full alignments -- PFAM, ISREC PROFILES, SMART

classification by means other than alignments -- PROCLASS

A tradeoff between completeness and correctness.

Because family and domain definition are inherently somewhat arbitrary, and must be decided primarily based on what makes sense to human biologists, it is difficult to automate the creation of domain databases -- computers aren't smart enough.

On the other hand, there are a lot of domains out there, so it is also difficult to manually produce domain databases -- humans aren't fast enough (or rather, they get bored easily).

Therefore there is a tradeoff between automated/complete/low quality domain databases, and manual/incomplete/high quality domain databases.

The manually annotated databases still use automated tools, of course. PFAM is an example of an explicitly hybrid approach, in which "seed" alignments are manually curated and infrequently changed, but "full" alignments are automatically generated using the seeds, and can be rebuilt with every release of the main sequence databases.

Manual classification and alignments -- PROSITE,PRINTS,PFAM

Full automatic from other annotation -- BLOCKS, SBASE, PROCLASS

Full automatic (clustering approaches) -- PRODOM

How many different domains need to be modeled? (100 - 10,000)

Common Sites and Domain Web Resources

InterPro - Integrated Resources of Proteins Domains and Functional Sites

BLOCKS - BLOCKS db

Pfam - Protein families db (HMM derived) [Mirror at St. Louis (USA)]

PRINTS - Protein Motif fingerprint db

ProDom - Protein domain db (Automatically generated)

PROTOMAP - An automatic hierarchical classification of SWISS-PROT proteins

SBASE - SBASE domain db

SMART - Simple Modular Architecture Research Tool

TIGRFAMs - TIGR protein families db

 

Exercise

  1. Go to the PDB Site and select a protein kinase PDBid 2CPK
  2. Establish what motifs it has and how they relate to the structure
  3. What can you say about the domain architecture?

Repeat this exercise on your "pet" protein and include in your BioEditor documentary.

Reading Materials

Reading for motifs and patterns is covered in lecture 3. Additional Reading:

: Falquet L, Pagni M, Bucher P, Hulo N, Sigrist CJ, Hofmann K, Bairoch A.

 

The PROSITE database, its status in 2002.Nucleic Acids Res. 2002 Jan 1;30(1):235-8. [paper]

: Corpet F, Gouzy J, Kahn D.

 

Recent improvements of the ProDom database of protein domain families. Nucleic Acids Res. 1999 Jan 1;27(1):263-7. [paper]