| Gribskov & Smith |
BIMM 141 Laboratory |
Spring 2001 |
Introduction to Bioinformatics
This exercise focuses on a brief look at protein families, regular expression descriptions of motifs, and weight matrix patterns. These topics are major types of information derived from multiple alignments, a major topic of Exercise 7.
The main topics addressed in this exercise are:
As the total number of sequenced proteins increases and interest expands in proteome analysis, there is an ongoing effort to organize proteins into families and to describe their component domains and motifs. We examine here some of the developing resources in this area.
You should use the same family of sequences you used for Exercise
7.
{1. InterPro as a source of protein
domain and family information.}
Go to the InterPro
Web site, and peruse the Documentation page to learn about InterPro.
Get information from InterPro about one of the SwissProt protein
sequences in your family of sequences used in Exercise 7 by using
the SwissProt Accession Number as a text search item in the Interpro
"Text Search".
Briefly describe in your notebook the information available.
Examine and comment on a couple of the links available.
In particular, examine the "Graphical" link at the bottom
of the Web page under "Matches".
Comment on what you did and found in your notebook Web page.
{2. Look up protein domain information
in ProDom.}
Using the same SwissProt protein sequence as above, you can get
the ProDom entry by simply looking up the sequence entry at the
ExPASy site, and following the link marked "PRODOM [Domain
Structure]".
Alternatively, you can use the link from InterPro by going to
the Interpro "Databases" from the InterPro Home page
to get to ProDom and then "search" using an SwissProt
Accession Number or InterPro domain Accession Number.
Be patient with ProDom; it is notoriously slow !?!
Look at some of the alignments for a given domain.
Also look at some of the cartoons used for different domains.
Comment on what you did and found in your notebook Web page.
{3. Look up protein fingerprint information
in PRINTS.}
Using the same SwissProt protein sequence as above, you can get
the PRINTS entry by simply looking up the sequence entry at the
ExPASy site, and following the link marked "PRINTS [Fingerprint]".
Alternatively, you can use the link from InterPro by going to
the Interpro "Databases" from the InterPro Home page
to get to PRINTS and then "search" using an SwissProt
Accession Number or other text feature.
Look at some of the alignments for a given domain.
Try using the Cinema multiple sequence alignment tool.
Also look at the "relations" for your sequence.
Comment on what you did and found in your notebook Web page.
{4. Look up protein family information
using the Pfam database.}
Go to the Wash U. mirror
site of the Pfam database of protein domains and HMMs.
You can, as above, also go directly to the Pfam family of proteins
you are examining via a link from an ExPASy SwissPROT protein
entry. Examine the information available for your Pfam family.
Examine also the Pfam alignment for the family members.
Try viewing the Pfam alignment using Jalview.
Comment on what you did and found in your notebook Web page.
{5. Find out about your protein family/superfamily
via the PIR/MIPS FamBase.}
Here we will use the FamBase classification of protein families
and superfamilies found at the PIR server in Georgetown. Look up your protein
family/superfamily by first going to the PIR Web page for your
protein, and then following the links at the bottom to "Alignments"
and "iProClass"
Have a look at the multiple alignments using the "PIR Alignment"
entries from the links above.
Comment on what you did and found in your notebook Web page.
{6. Find out about your protein family/superfamily
via the Stanford eMotif site.}
Here we will use the eMotif site at Stanford to look for motifs.
Use "eMotif Search" and the raw sequence of your 'typical'
entry for your family/superfamily.
Have a look at the "eMotif Maker" application.
Comment on what you did and found in your notebook Web page.
The PROSITE database is a collection of "signatures" specific for structural or functional features. Here we use the main Web site for PROSITE, located at the ExPASy server in Switzerland.
{1. Learn about PROSITE and how signatures
are made by reading the user manual.}
{2. Check to see if your sequences contain known PROSITE signatures.}
This is most easily done by simply looking up the sequence
in SWISS-PROT and looking for links to PROSITE. Follow these links
to see a description of the feature (PDOC) and the actual feature
definition (PS). The PS entry is, by the way, one of the fastest
ways to discover large groups of proteins that have related domains
(Look in the DR fields). If your sequence does not have any PROSITE
matches, just browse around and look at several PDOC and PS entries.
{3. Use the GCG program MOTIFS to search your sequence for motifs.}
The MOTIFS program is located in the "Protein Analysis" menu. Perform your first run using the default settings.
There are two main options you can use with MOTIFS. Firstly,
the default mode is to exclude PROSITE signatures that correspond
to very frequent motifs such as potential glycosylation sites.
Try a second run with this option turned off by clicking on "search
includes paterns that are frequently found in many proteins".
Secondly, you can relax the stringency of the search by allowing
mismatches (i.e. match the signature at all but n positions).
Try a run with frequent patterns excluded and the "Number
of allowed mismatches" set to 1. Compare the results of these
three searches.
Generation of a profile requires a multiple sequence alignment in GCG MSF format. You may make one yourself using the GCG program PILEUP, or by looking up sequences in the MIPS superfamily classification. Five or six sequences, less than 80% identical, would be ideal. As few as three should be OK. Limit the sequences to less than 500 residues. For this part of the exercise we will use the SDSC Profile Server at SDSC. At the time this exercise was posted, this server is temporarily unavailable. Since I have not been able to quickly remedy this, you may skip to C5.
{1. Weight your sequences using the
WEIGHT program.}
You can either paste the entire MSF file (not just the sequences,
the whole thing) into the box, or upload the file from your directory.
To upload from GCG, you must go to the "" window and use the upload function.
After running WEIGHT, view the file and use the Netscape "save" function
to same a copy for your notebook.
While you have the output from WEIGHT open, look at the top section
where each sequence is listed. The last number is the sequence
weight. Do these weights seem plausible to you?
{2. Create a profile from a group of
aligned sequences. }
This requires using the next section of the SDSC Profile server,
entitled ProfileMake. Your weighted alignment file should be available on the server.
{3. Search against database using PROFILESEARCH
}
Use the profile you just generated and the "profilesearch application"
with default parameters. The results are
returned ranked by e-value. You should see all of the sequences
you used in the multiple sequence alignment ranked near the top.
{4. Compare results to BLAST search
with some of the single sequences. }
Use one or more of the sequences you included in the profile to
do a BLAST search.
{5. Use MEME server to learn motifs.
}
Submit the same group of sequences to the MEME server (http://www.sdsc.edu/MEME) for
automated motif learning. How do the results compare? To do this
you must convert your sequences using the GCG TOFASTA program.
Your results
will be returned by email. Note that the MEME server submits jobs
to a batch queue and the results may take as long as a day to
be returned. Therefore do not leave this to the last minute.
Latest modification: Spring, 2001
If you have problems or questions, send email to Michael Gribskov or Doug Smith or Hiren Patel