Gribskov & Smith

 BIMM 141 Laboratory

Spring 2001

Introduction to Bioinformatics


 

Exercise 8


Protein Families and Pattern Analysis


This exercise focuses on a brief look at protein families, regular expression descriptions of motifs, and weight matrix patterns. These topics are major types of information derived from multiple alignments, a major topic of Exercise 7.

The main topics addressed in this exercise are:



BIMM 140: | Main | 140_Info | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | 141_Info | Syllabus | Exercises | DNASYSTEM | CMS MBR |



Specific Tasks to Perform in Exercise 8:

A. Protein families and domains
1. Look up protein domain information in ProDom
2. Look up protein family information using the Sanger Institute pfam database
3. Find out about your proteins family/superfamily
B. PROSITE signature database
1. Learn about PROSITE and how signatures are made by reading the user manual
2. Check to see if your sequences have known PROSITE signatures
3. Use the GCG program MOTIFS to search your sequence for motifs
C. Weight matrix patterns
1. Weight your sequences using the WEIGHT program
2. Create a profile from a group of aligned sequences.
3. Search against database using PROFILESEARCH
4. Compare results to BLAST search with single sequences.
5. Use MEME server to learn motifs.
D. Questions

{A . Protein families and domains.}

As the total number of sequenced proteins increases and interest expands in proteome analysis, there is an ongoing effort to organize proteins into families and to describe their component domains and motifs. We examine here some of the developing resources in this area.

You should use the same family of sequences you used for Exercise 7.
 

{1. InterPro as a source of protein domain and family information.}
Go to the InterPro Web site, and peruse the Documentation page to learn about InterPro.
Get information from InterPro about one of the SwissProt protein sequences in your family of sequences used in Exercise 7 by using the SwissProt Accession Number as a text search item in the Interpro "Text Search".
Briefly describe in your notebook the information available.
Examine and comment on a couple of the links available.
In particular, examine the "Graphical" link at the bottom of the Web page under "Matches".
Comment on what you did and found in your notebook Web page.
 

{2. Look up protein domain information in ProDom.}
Using the same SwissProt protein sequence as above, you can get the ProDom entry by simply looking up the sequence entry at the ExPASy site, and following the link marked "PRODOM [Domain Structure]".
Alternatively, you can use the link from InterPro by going to the Interpro "Databases" from the InterPro Home page to get to ProDom and then "search" using an SwissProt Accession Number or InterPro domain Accession Number.

Be patient with ProDom; it is notoriously slow !?!

Look at some of the alignments for a given domain.
Also look at some of the cartoons used for different domains.
Comment on what you did and found in your notebook Web page.
 

{3. Look up protein fingerprint information in PRINTS.}
Using the same SwissProt protein sequence as above, you can get the PRINTS entry by simply looking up the sequence entry at the ExPASy site, and following the link marked "PRINTS [Fingerprint]".
Alternatively, you can use the link from InterPro by going to the Interpro "Databases" from the InterPro Home page to get to PRINTS and then "search" using an SwissProt Accession Number or other text feature.

Look at some of the alignments for a given domain.
Try using the Cinema multiple sequence alignment tool.
Also look at the "relations" for your sequence.
Comment on what you did and found in your notebook Web page.
 

{4. Look up protein family information using the Pfam database.}
Go to the Wash U. mirror site of the Pfam database of protein domains and HMMs. You can, as above, also go directly to the Pfam family of proteins you are examining via a link from an ExPASy SwissPROT protein entry. Examine the information available for your Pfam family.

Examine also the Pfam alignment for the family members.
Try viewing the Pfam alignment using Jalview.
Comment on what you did and found in your notebook Web page.
 

{5. Find out about your protein family/superfamily via the PIR/MIPS FamBase.}
Here we will use the FamBase classification of protein families and superfamilies found at the PIR server in Georgetown. Look up your protein family/superfamily by first going to the PIR Web page for your protein, and then following the links at the bottom to "Alignments" and "iProClass"
Have a look at the multiple alignments using the "PIR Alignment" entries from the links above.
Comment on what you did and found in your notebook Web page.
 

{6. Find out about your protein family/superfamily via the Stanford eMotif site.}
Here we will use the eMotif site at Stanford to look for motifs.
Use "eMotif Search" and the raw sequence of your 'typical' entry for your family/superfamily.
Have a look at the "eMotif Maker" application.
Comment on what you did and found in your notebook Web page.
 
 

{B. PROSITE signature database.}

The PROSITE database is a collection of "signatures" specific for structural or functional features. Here we use the main Web site for PROSITE, located at the ExPASy server in Switzerland.

{1. Learn about PROSITE and how signatures are made by reading the user manual.}
 

{2. Check to see if your sequences contain known PROSITE signatures.}

This is most easily done by simply looking up the sequence in SWISS-PROT and looking for links to PROSITE. Follow these links to see a description of the feature (PDOC) and the actual feature definition (PS). The PS entry is, by the way, one of the fastest ways to discover large groups of proteins that have related domains (Look in the DR fields). If your sequence does not have any PROSITE matches, just browse around and look at several PDOC and PS entries.
 

{3. Use the GCG program MOTIFS to search your sequence for motifs.}

The MOTIFS program is located in the "Protein Analysis" menu. Perform your first run using the default settings.

There are two main options you can use with MOTIFS. Firstly, the default mode is to exclude PROSITE signatures that correspond to very frequent motifs such as potential glycosylation sites. Try a second run with this option turned off by clicking on "search includes paterns that are frequently found in many proteins". Secondly, you can relax the stringency of the search by allowing mismatches (i.e. match the signature at all but n positions). Try a run with frequent patterns excluded and the "Number of allowed mismatches" set to 1. Compare the results of these three searches.
 
 


{C. Weight matrix patterns }

Generation of a profile requires a multiple sequence alignment in GCG MSF format. You may make one yourself using the GCG program PILEUP, or by looking up sequences in the MIPS superfamily classification. Five or six sequences, less than 80% identical, would be ideal. As few as three should be OK. Limit the sequences to less than 500 residues. For this part of the exercise we will use the SDSC Profile Server at SDSC. At the time this exercise was posted, this server is temporarily unavailable. Since I have not been able to quickly remedy this, you may skip to C5.

{1. Weight your sequences using the WEIGHT program.}
You can either paste the entire MSF file (not just the sequences, the whole thing) into the box, or upload the file from your directory.

To upload from GCG, you must go to the "" window and use the upload function.

After running WEIGHT, view the file and use the Netscape "save" function to same a copy for your notebook. While you have the output from WEIGHT open, look at the top section where each sequence is listed. The last number is the sequence weight. Do these weights seem plausible to you?
 

{2. Create a profile from a group of aligned sequences. }
This requires using the next section of the SDSC Profile server, entitled ProfileMake. Your weighted alignment file should be available on the server.

{3. Search against database using PROFILESEARCH }
Use the profile you just generated and the "profilesearch application" with default parameters. The results are returned ranked by e-value. You should see all of the sequences you used in the multiple sequence alignment ranked near the top.
 

{4. Compare results to BLAST search with some of the single sequences. }
Use one or more of the sequences you included in the profile to do a BLAST search.
 

{5. Use MEME server to learn motifs. }
Submit the same group of sequences to the MEME server (http://www.sdsc.edu/MEME) for automated motif learning. How do the results compare? To do this you must convert your sequences using the GCG TOFASTA program. Your results will be returned by email. Note that the MEME server submits jobs to a batch queue and the results may take as long as a day to be returned. Therefore do not leave this to the last minute.

{D. Questions.}

1. Briefly explain what is mean by a signature. If you check your sequence for all PROSITE signatures and find no matches, can you be confident that your sequence is unrelated to known structural and functional motifs? Explain. [15 pts]
2. How are sequences clustered into families? What is the difference between a family and a superfamily? [5 pts]
3. How does a profile differ from a simple frequency matrix.
4. What is the purpose of weighting sequences when determining motifs and patterns? [5 pts]
5. Explain why a profile or motif based search is likely to give more complete results than a single-sequence search.[5 pts]
6. MEME motifs do not contain gaps. Why is this not a severe problem in using MEME to describe the conserved motifs in a family of sequences? [5 pts]
7. What is InterPro and, in one or two sentences, what is its function and objectives? [5 pts]
8. What does the "Graphical" link under "Matches" in your InterPro entry do? [5 pts]
9. What are the differences and similarities between Pfam, PRINTS, PROSITE, and ProDom? [10 pts]
10. How do the alignments at ProDom compare with your alignments in Exercise 7? [5 pts]
11. How do the family trees at ProDom compare with those you obtained in Exercise 7? [5 pts]
12. Some SwissProt entries have more than one PRINTS entry. Why is this? [5 pts]
13. How do the alignments at PRINTS compare with your alignments in Exercise 7? [5 pts]
14. How do the family trees at PRINTS compare with those you obtained in Exercise 7? [5 pts]
15. How do the alignments at Pfam compare with your alignments in Exercise 7? [5 pts]
16. How do the family trees at Pfam compare with those you obtained in Exercise 7? [5 pts]
17. Comments on the use of Jalview? What is Jalview? [5 pts]
18. How do the alignments at FamBase in PIR compare with your alignments above? [5 pts]
19. What is the output of eMotif? [5 pts]
20. What does the "eMotif Maker" application do? [5 pts]
 
 
 
 
 


BIMM 140: | Main | 140_Info | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | 141_Info | Syllabus | Exercises | DNASYSTEM | CMS MBR |



Latest modification: Spring, 2001

If you have problems or questions, send email to Michael Gribskov or Doug Smith or Hiren Patel