Exercise 8 - Spring 2001 - BIMM 141
KEY
Points: 200 pts
30 pts A. Protein families and domains
15 pts B. Prosite signature database
40 pts C. Weight Matrix Patterns
115 pts D. Questions
{A . Protein families and domains.}
5 points {1. InterPro
as a source of protein domain and family information.}
5 points {2. Look
up protein domain information in ProDom.}
5 points {3. Look
up protein fingerprint information in PRINTS.}
5 points {4. Look
up protein family information using the Pfam database.}
5 points {5. Find
out about your protein family/superfamily via the PIR/MIPS FamBase.}
5 points {6. Find
out about your protein family/superfamily via the Stanford eMotif
site.}
{B. PROSITE signature database.}
{1. Learn about PROSITE and how signatures are made by
reading the user manual.}
5 points {2. Check
to see if your sequences contain known PROSITE signatures.}
10 points {3.
Use the GCG program MOTIFS to search your sequence for motifs.}
(5 points for doing at least
one run)
(10 points for all three runs plus comparison)
{C. Weight matrix patterns }
5 points {1. Weight
your sequences using the WEIGHT program.}
5 points {2. Create
a profile from a group of aligned sequences. }
5 points {3. Search
against database using PROFILESEARCH }
5 points {4. Compare
results to BLAST search with some of the single sequences. }
20 points{5. Use
MEME server to learn motifs. }
{D. Questions.}
- 1. Briefly explain what is mean by a signature. If you check
your sequence for all PROSITE signatures and find no matches,
can you be confident that your sequence is unrelated to known
structural and functional motifs? Explain. [15
pts]
- A signature is a description of a motif/sequence
shared by a family of proteins. They are constructed by doing
alignments of the proteins in a family and looking for regions
that are conserved/similar.
If you do a PROSITE search and find no matches, it still may
be that your sequence has a more derived/less conserved version
of a signature. You can search again allowing for mismatches.
In the extreme case, you may have to "redefine" the
signature if you are relatively certain your sequence should
be part of the family.
+12 for define "signature" and how it is made.
+3 for answering what to do if find "no matches".
- 2. How are sequences clustered into families? What is the
difference between a family and a superfamily? [5
pts]
Families are at least 50% identical.
Superfamilies are related at a significance of p"10-3.
From the MIPS server:
Families/Superfamilies
The PIR-International Protein Sequence
Database is structured by sequence homology. Proteins are clustered
into homeomorphic protein families (50% sequence identity). Protein
families are further clustered into protein superfamilies (~30%
sequence identity). Complete sequences belong to the same protein
(super)family when they are homologous from the amino end to the
carboxyl end. Each completely sequenced protein belongs to exactly
one protein superfamily. For each classified database entry you
get
o the superfamily number
o the family number for all families that belong to the superfamily
for each family: the list of all entries that are classified into
that family
the multiple alignment with annotated features (for families with
more than one member)
Homology Domains
Regions of sequence homology that do
not cover the full sequence length are annotated as homology domains
(~30% sequence identity). A protein sequence may contain a number
of different homology domains. Homology domains may be repeated
within a protein sequence. For each database entry you get all
homology domains annotated in the database entry for each homology
domain: all other annotated occurencies of the homology domain
in the database the multiple alignment (homology domain without
neighbouring sequences!)
- 3. How does a profile differ from a simple frequency matrix?
[5 pts]
A profile is a position dependent
weight matrix (weights and gap penalties are assigned for each
and every residue), whereas a simple frequency matrix assigns
single weights relating each residue to itself and every other
residue, independent of position of the residue in the sequence,
as well as using single position independent gap penalties.
-
- 4. What is the purpose of weighting sequences when determining
motifs and patterns? [5 pts]
- In a collection of sequences used for
determining motifs and patterns, some will be more closely related
to each other than is the case for others. A series of closely
related sequences will bias the results toward the specific motif
sequences found in these related sequences. To avoid this, each
of these closely related sequences is weighted down, resulting
in lowered importance for each of these sequences. This is similar
to the Blosum Matrix issue resulting in the BLO80, BLO62, etc
matrices.
-
- 5. Explain why a profile or motif based search is likely
to give more complete results than a single-sequence search.
[5 pts]
- The Profile contains information obtained
from all of the sequences comprising the Profile, where
a single sequence contains only the information found in that
sequence.
-
- 6. MEME motifs do not contain gaps. Why is this not a severe
problem in using MEME to describe the conserved motifs in a family
of sequences? [5 pts]
- MEME would simply generate two distinct
motifs if there were a motif that contains a gap. These two motifs
would routinely be found adjacent to each other in the same set
of proteins, thereby indicating that they are each part of a
single motif.
-
- 7. What is InterPro and, in one or two sentences, what is
its function and objectives? [5 pts]
- InterPro is an integrated documentation
resource for protein families, domains, and functional sites,
bringing together the efforts of the PROSITE, PRINTS, Pfam and
ProDom database projects all in one site.
-
- 8. What does the "Graphical" link under "Matches"
in your InterPro entry do? [5 pts]
- This links to a graphic of the protein
showing motifs and domains, and providing image map links to
these motifs in ProSite, Pfam, and PRINTS databases.
-
- 9. What are the differences and similarities between Pfam,
PRINTS, PROSITE, and ProDom? [10 pts]
- PFAM is a database or collection of
multiple sequence alignments and HMMs covering many common protein
domains.
- PRINTS is a compendium of protein 'fingerprints',
a group of conserved motifs used to characterize a protein family.
These fingerprints can encode protein sec struc foflds, as well
as functional motifs.
- PROSITE is a database of protein families
and domains that uses patterns (regular expressions) or profiles
to describe a given motif or domain. Proteins containing that
motif or domain are described; these comprise the protein "family"
containing a given motif or domain.
- ProDom is a French database of protein
domains, shown graphically (when the server is working).
-
- 10. How do the alignments at ProDom compare with your alignments
in Exercise 7? [5 pts]
- Statement with some appropriate words
comparing these.
11. How do the family trees at ProDom compare with those you
obtained in Exercise 7? [5 pts]
- Statement with some appropriate words
comparing these.
-
- 12. Some SwissProt entries have more than one PRINTS entry.
Why is this? [5 pts]
-
- 13. How do the alignments at PRINTS compare with your alignments
in Exercise 7? [5 pts]
- Statement with some appropriate words
comparing these.
14. How do the family trees at PRINTS compare with those you
obtained in Exercise 7? [5 pts]
- Statement with some appropriate words
comparing these.
-
- 15. How do the alignments at Pfam compare with your alignments
in Exercise 7? [5 pts]
- Statement with some appropriate words
comparing these.
16. How do the family trees at Pfam compare with those you obtained
in Exercise 7? [5 pts]
- Statement with some appropriate words
comparing these.
-
- 17. Comments on the use of Jalview? What is Jalview? [5 pts]
- Jalview is a graphic viewer written
in Java, used for example to visualize Pfam alignments. It has
a lot of capability, including editing, selecting subsets of
sequences, sorting, and redrawing the alignment in color according
to several sets of criteria (aa type, hydrophobicity, secondary
structure propensity, etc).
18. How do the alignments at FamBase in PIR compare with your
alignments above? [5 pts]
- Statement with some appropriate words
comparing these.
19. What is the output of eMotif? [5
pts]
- Output from eMotif is textual and depends
on the specific eMotif program: eMotif Maker, eMotif Scan, and
eMotif Search. Output usually contains the sequence, a regular
expression describing the motif, E-value describing the statistics
of presence of a motif in a given sequence, SwissProt sequence
name with link, and so on.
20. What does the "eMotif Maker" application do? [5 pts]
- eMotif Maker uses aligned sequences
as input to generate an output containing the motifs found in
the aligned sequences, indicating number of matches in the aligned
sequences and providing expected numbers. Many of the found motifs
are essentially identical to each other.