Exercise 8 - Spring 2001 - BIMM 141

KEY

Points: 200 pts
30 pts A. Protein families and domains
15 pts B. Prosite signature database
40 pts C. Weight Matrix Patterns
115 pts D. Questions

{A . Protein families and domains.}

5 points {1. InterPro as a source of protein domain and family information.}

5 points {2. Look up protein domain information in ProDom.}

5 points {3. Look up protein fingerprint information in PRINTS.}

5 points {4. Look up protein family information using the Pfam database.}

5 points {5. Find out about your protein family/superfamily via the PIR/MIPS FamBase.}

5 points {6. Find out about your protein family/superfamily via the Stanford eMotif site.}

 

{B. PROSITE signature database.}

{1. Learn about PROSITE and how signatures are made by reading the user manual.}

5 points {2. Check to see if your sequences contain known PROSITE signatures.}

10 points {3. Use the GCG program MOTIFS to search your sequence for motifs.}
(5 points for doing at least one run)
(10 points for all three runs plus comparison)

 

{C. Weight matrix patterns }

5 points {1. Weight your sequences using the WEIGHT program.}

5 points {2. Create a profile from a group of aligned sequences. }

5 points {3. Search against database using PROFILESEARCH }

5 points {4. Compare results to BLAST search with some of the single sequences. }

20 points{5. Use MEME server to learn motifs. }

 

{D. Questions.}

1. Briefly explain what is mean by a signature. If you check your sequence for all PROSITE signatures and find no matches, can you be confident that your sequence is unrelated to known structural and functional motifs? Explain. [15 pts]
A signature is a description of a motif/sequence shared by a family of proteins. They are constructed by doing alignments of the proteins in a family and looking for regions that are conserved/similar.
If you do a PROSITE search and find no matches, it still may be that your sequence has a more derived/less conserved version of a signature. You can search again allowing for mismatches. In the extreme case, you may have to "redefine" the signature if you are relatively certain your sequence should be part of the family.
+12 for define "signature" and how it is made.
+3 for answering what to do if find "no matches".
2. How are sequences clustered into families? What is the difference between a family and a superfamily? [5 pts]

Families are at least 50% identical. Superfamilies are related at a significance of p"10-3.

From the MIPS server:

Families/Superfamilies

The PIR-International Protein Sequence Database is structured by sequence homology. Proteins are clustered into homeomorphic protein families (50% sequence identity). Protein families are further clustered into protein superfamilies (~30% sequence identity). Complete sequences belong to the same protein (super)family when they are homologous from the amino end to the carboxyl end. Each completely sequenced protein belongs to exactly one protein superfamily. For each classified database entry you get

o the superfamily number
o the family number for all families that belong to the superfamily
for each family: the list of all entries that are classified into that family
the multiple alignment with annotated features (for families with
more than one member)

Homology Domains

Regions of sequence homology that do not cover the full sequence length are annotated as homology domains (~30% sequence identity). A protein sequence may contain a number of different homology domains. Homology domains may be repeated within a protein sequence. For each database entry you get all homology domains annotated in the database entry for each homology domain: all other annotated occurencies of the homology domain in the database the multiple alignment (homology domain without neighbouring sequences!)

3. How does a profile differ from a simple frequency matrix? [5 pts]
A profile is a position dependent weight matrix (weights and gap penalties are assigned for each and every residue), whereas a simple frequency matrix assigns single weights relating each residue to itself and every other residue, independent of position of the residue in the sequence, as well as using single position independent gap penalties.
 
4. What is the purpose of weighting sequences when determining motifs and patterns? [5 pts]
In a collection of sequences used for determining motifs and patterns, some will be more closely related to each other than is the case for others. A series of closely related sequences will bias the results toward the specific motif sequences found in these related sequences. To avoid this, each of these closely related sequences is weighted down, resulting in lowered importance for each of these sequences. This is similar to the Blosum Matrix issue resulting in the BLO80, BLO62, etc matrices.
 
5. Explain why a profile or motif based search is likely to give more complete results than a single-sequence search. [5 pts]
The Profile contains information obtained from all of the sequences comprising the Profile, where a single sequence contains only the information found in that sequence.
 
6. MEME motifs do not contain gaps. Why is this not a severe problem in using MEME to describe the conserved motifs in a family of sequences? [5 pts]
MEME would simply generate two distinct motifs if there were a motif that contains a gap. These two motifs would routinely be found adjacent to each other in the same set of proteins, thereby indicating that they are each part of a single motif.
 
7. What is InterPro and, in one or two sentences, what is its function and objectives? [5 pts]
InterPro is an integrated documentation resource for protein families, domains, and functional sites, bringing together the efforts of the PROSITE, PRINTS, Pfam and ProDom database projects all in one site.
 
8. What does the "Graphical" link under "Matches" in your InterPro entry do? [5 pts]
This links to a graphic of the protein showing motifs and domains, and providing image map links to these motifs in ProSite, Pfam, and PRINTS databases.
 
9. What are the differences and similarities between Pfam, PRINTS, PROSITE, and ProDom? [10 pts]
PFAM is a database or collection of multiple sequence alignments and HMMs covering many common protein domains.
PRINTS is a compendium of protein 'fingerprints', a group of conserved motifs used to characterize a protein family. These fingerprints can encode protein sec struc foflds, as well as functional motifs.
PROSITE is a database of protein families and domains that uses patterns (regular expressions) or profiles to describe a given motif or domain. Proteins containing that motif or domain are described; these comprise the protein "family" containing a given motif or domain.
ProDom is a French database of protein domains, shown graphically (when the server is working).
 
10. How do the alignments at ProDom compare with your alignments in Exercise 7? [5 pts]
Statement with some appropriate words comparing these.

11. How do the family trees at ProDom compare with those you obtained in Exercise 7? [5 pts]
Statement with some appropriate words comparing these.
 
12. Some SwissProt entries have more than one PRINTS entry. Why is this? [5 pts]
 
13. How do the alignments at PRINTS compare with your alignments in Exercise 7? [5 pts]
Statement with some appropriate words comparing these.

14. How do the family trees at PRINTS compare with those you obtained in Exercise 7? [5 pts]
Statement with some appropriate words comparing these.
 
15. How do the alignments at Pfam compare with your alignments in Exercise 7? [5 pts]
Statement with some appropriate words comparing these.

16. How do the family trees at Pfam compare with those you obtained in Exercise 7? [5 pts]
Statement with some appropriate words comparing these.
 
17. Comments on the use of Jalview? What is Jalview? [5 pts]
Jalview is a graphic viewer written in Java, used for example to visualize Pfam alignments. It has a lot of capability, including editing, selecting subsets of sequences, sorting, and redrawing the alignment in color according to several sets of criteria (aa type, hydrophobicity, secondary structure propensity, etc).

18. How do the alignments at FamBase in PIR compare with your alignments above? [5 pts]
Statement with some appropriate words comparing these.

19. What is the output of eMotif? [5 pts]
Output from eMotif is textual and depends on the specific eMotif program: eMotif Maker, eMotif Scan, and eMotif Search. Output usually contains the sequence, a regular expression describing the motif, E-value describing the statistics of presence of a motif in a given sequence, SwissProt sequence name with link, and so on.

20. What does the "eMotif Maker" application do? [5 pts]
eMotif Maker uses aligned sequences as input to generate an output containing the motifs found in the aligned sequences, indicating number of matches in the aligned sequences and providing expected numbers. Many of the found motifs are essentially identical to each other.