The input to MEME is a set of unaligned sequences of the same type (peptide or nucleotide). These sequences are called the training set. MEME's output is a set of motifs. This is illustrated in the sketch below.
| MEME Overview |
|---|
|
For each motif it discovers, MEME reports the occurrences (sites), consensus sequence and the level of conservation (information content) at each position in the pattern. A typical MEME motif is shown below.
| MEME Motif |
|---|
|
MEME also produces block diagrams showing where all of the discovered motifs occur in the training set sequences. This illuminates the spatial arrangement of protein domains or DNA features (e.g., protein binding sites) within the input sequences. A typical MEME block diagram is shown below.
| MEME Block Diagram |
|---|
|
MEME's hypertext (HTML) output also contains buttons that allow you to conveniently use the motifs as input to other tools that allow you to:
MEME searches for motifs by looking for "sites" (stretches of letters) in the input sequences that are highly similar to one or more other sites. MEME looks for the most "significant" motifs in the input sequences, where "significant" is a function of the length of the pattern, number of times it occurs and degree of similarity among the occurrences. MEME uses a statistical objective function based on the information content of the motif to make this idea of "significance" concrete. MEME reports an E-value for each motif it finds, which is an estimate of the number of (equally or more interesting) motifs one would expect to find by chance if the letters in the input sequences were shuffled. Motifs with small E-values (e.g., less than 0.001) are very unlikely to be random sequence artifacts.
The motifs discovered by MEME can help elucidate the function, homology and structure of the sequences. For example, by finding the most highly conserved parts of a set of distantly related peptide sequences, MEME motifs can identify functionally and structurally key parts of the proteins. The motif block diagrams show how each protein is built up of (or missing) the motifs, which can shed light on evolutionary relationships among the proteins. Additional homology can be gotten by using the motifs as the query in a database search. Another important application for MEME is to discover protein binding sites and other regulatory elements in groups of upstream regions from co-regulated genes.
MEME can be used remotely over the web (web MEME, with results being returned by email, or you can install it on your own UNIX-based computer and run it from the UNIX command (command line MEME). The web interface has the advantage of not requiring you to install any software, but some MEME features are only available in the command line version. Web access is free (currently available at http://meme.sdsc.edu and http://bioweb.pasteur.fr/seqanal/motif/meme ). The command line version is free for non-commercial use or with a commercial license, and can be downloaded over the web. Command-line MEME works on many uniprocessor computers and on some multiprocessor computers and clusters that have the MPICH message-passing software installed. A list of supported operating systems and their manufacturers is available at ftp://ftp.sdsc.edu/pub/sdsc/biology/meme/README.
When using MEME via the web interface, your results will typically arrive within a few hours. It is not possible to predict when your MEME results will arrive because the computer on which MEME runs at SDSC and the Pasteur Institute are shared resources. Depending on the load, it can sometimes take a day or more for your job to be processed. Please be patient. You can avoid this unpredictability by installing command line MEME locally on your own Unix-based computer.
This protocol describes the use of MEME via the MEME web interface or from the command line to discover motifs in a family of protein sequences. It also discusses how to interpret the motifs, compare them with known motifs, use them in sequence homology searches and construct phylogentic trees based on them.
Note that your sequences must be in FASTA format if you are using command line MEME. Other formats, described on the MEME website, are supported if you are using MEME via the web interface. If you are using the web interface, the total number of characters in the sequences may not exceed 60,000.
There are many ways to construct a family of protein sequences for input to MEME. For example, file tf4.fasta contains a family of bacterial protein sequences related to Entrez sequence "gi|15897224|ref|NP_341829.1| Hypothetical protein [Sulfolobus solfataricus]". It was constructed by doing a BLASTP search of the non-redundant protein database using the sequence named above (gi|15897224) as the query. The accession numbers of all of the sequences matching the query with BLAST E-values less than or equal to 0.01 were then placed in file tf4.acc. Then, Batch Entrez was used with the file of accession numbers to download the sequences in FASTA format into file tf4.fasta.
The data file (tf4.fasta) used in this example should be downloaded from the Current Protocols Web site (http://www.currentprotocols.com).
| email address | Enter the email address where you want to receive your results |
|---|---|
| Description | (Optional) Enter information describing the the sequences and/or parameters of the MEME run. This information will be included in the subject of the email message you receive from MEME and can be very useful if you submit many MEME runs. |
| name of a file | Use the "Browse" button to enter the path to your training set file. |
| number of motifs | Enter 10. |
| MEME Input Form |
|---|
|
| MEME Verification Screen |
|---|
|
It is a good idea to take a few moments to check the confirmation message to see that everything looks right. Check that your email address is correct. Also, check that "type of sequence" is correct since MEME guesses this based on the sequence characters present in your training set. If MEME gets fooled by your training set, there are instructions on the MEME website explaining how to fix this.
| MEME confirmation email message |
|---|
Subject: MEME job 365072 confirmation: tf4 with Zero or one site distribution (Use
web browser to view results)
Date: Fri, 5 Apr 2002 05:29:10 GMT
From: MEME |
| MEME email results |
|---|
Subject: MEME job 365072 results: tf4 with Zero or one site distribution (Use web
browser to view results)
Date: Thu, 4 Apr 2002 21:30:22 -0800
From: MEME
|
In addition to reporting all of the important parameters used by MEME in its search, this section also shows the frequencies of each letter in your training set. Below these, MEME shows the letter frequencies of the 0-order background model used in computing motif E-values. The two sets of frequencies will be the same unless a background model was specified in MEME's input (command line version of MEME only).
| MEME Command Line Summary Section |
|---|
|
| MEME Training Set Section |
|---|
|
Each motif that MEME finds gets its own motif section. Each motif section contains the following information:
The first line of each motif section is called the motif summary line. This line gives the width, number of occurrences in the training set (`sites'), log likelihood ratio (`llr') and E-value of the motif. Each motif describes a pattern of a fixed width--no gaps are allowed in MEME motifs. MEME numbers the motifs consecutively from one as it finds them. MEME usually finds the most statistically significant (low E-value) motifs first. The statistical significance of a motif is based on its log likelihood ratio, its width and number of occurrences, the background letter frequencies (given in the command line summary), and the size of the training set. The E-value is an estimate of the expected number of motifs with the given log likelihood ratio (or higher), and with the same width and number of occurrences, that one would find in a similarly sized set of random sequences. (In random sequences each position is independent with letters chosen according to the background letter frequencies.) Motifs with E-values larger than 0.01 (1e-2) are possibly just statistical artifacts, and not real motifs. The log likelihood ratio is the logarithm of the ratio of the probability of the occurrences of the motif given the motif model versus their probability given the background model.
| Motif Summary Line |
|---|
|
The first portion of the motif section for motif 5 is shown in the figure below. MEME has discovered a motif of width 29 with 8 sites and a very significant E-value (6.7e-97). The columns of the simplified position-specific probability matrix, information content diagram and multilevel consensus sequence are aligned with the actual motif sites. Below the figure, we describe how to interpret each of these parts of the motif section.
| Simplified PSPM, Information Content Diagram, Consensus and Alignment |
|---|
|
The Simplified Position-Specific Probability Matrix is shown directly below the summary line. This shows a simplified version of the PSPM that MEME's internal EM algorithm uses in its search for motifs. There is one column for each position in motif 5. Each column shows the expected frequency of each possible letter (20 amino acids, in this case) at the corresponding motif position. In order to make it easier to see which letters are most likely in each of the columns of the motif, the simplified motif shows the letter probabilities multiplied by 10 rounded to the nearest integer. Zeros are replaced by ":" (the colon) for readability. The letter "a" represents the number "10" (so that it will fit in one column). Thus, the "a"s in this diagram correspond to the completely conserved residues in certain columns of motif 5.
The information content diagram, aligned directly beneath the simplified PSPM, provides an idea of which positions in the motif are most highly conserved. Each column (position) in a motif can be characterized by the amount of information it contains (measured in bits). Highly conserved positions in the motif have high information; positions where all letters are equally likely have low information. (The information content is relative to the background letter frequencies, which are given in the command line summary section.) The highest information content in motif 5 is acheived by the perfectly conserved histidine in column 19 of the motif. Its information content is higher than that of the other perfectly conserved residues because the background frequency of histidine is lower.
Columns in the information content diagram are colored according to the majority category of the letters occurring in that column of the alignment. If no letter category has frequency above 0.5, the column in the diagram is black. For DNA sequences, the letter categories contain one letter each. For proteins, the categories are based on the biochemical properties of the various amino acids. The categories and their colors are:
|
|
Summing the information content for each position in the motif gives the total information content of the motif (shown in parentheses to the left of the diagram). The total information content is approximately equal to the log likelihood ratio divided by the number of occurrences times ln(2). The total information content gives a measure of the usefulness of the motif for database searches. For a motif to be useful for database searches, it must as a rule contain at least log_2(N) bits of information where N is the number of sequences in the database being searched. For example, to effectively search a database containing 100,000 sequences for occurrences of a single motif, the motif should have an IC of at least 16.6 bits. Motifs with lower information content are still useful when a family of sequences shares more than one motif since they can be combined in multiple motif searches (using MAST).
The multilevel consensus sequence corresponding to the motif, an aid in remembering and understanding the motif, is located directly below the information content diagram. It is calculated from the motif position-specific probability matrix as follows. Separately for each column of the motif, the letters in the alphabet are sorted in decreasing order by the probability with which they are expected to occur in that position of motif occurrences. The sorted letters are then printed vertically with the most probable letter on top. Only letters with probabilities of 0.2 or higher at that position in the motif are printed.
The multilevel consensus sequence of motif 5 says, firstly, that the most likely form (consensus) of the motif is:
ASQREKRSRTGAPESILIHDKGLSTDIGISecondly, the multilevel consensus shows that only letters A and S have probability greater than 0.2 of occurring in position one of the motif. Thirdly, a rough approximation of the motif can be made by converting the multilevel consensus sequence into the Prosite signature:
The eight sites in the training set that MEME has identified as forming the fifth motif can be seen aligned beneath the information content diagram. These sites are shown aligned with each other with flanking sequence on either side. Each site is identified by the name of the sequence where it occurs and position in the sequence where the site begins. The sites are listed in order of increasing statistical significance (position p-value). The position p-value of a site is computed from the the match score of the site with the position-specific scoring matrix (PSSM) for the motif. The position p-value gives the probability of a random string the length of the site (generated from the background letter frequencies) having the same match score (or higher) as the site.
| Motif 5 in FASTA format |
|---|
|
| Motif 5 in Logos format |
|---|
|
| Motif 5 neighbor-joining Tree |
|---|
|
| Motif 5 LAMA search results |
|---|
|
| MAST input form |
|---|
|
There are two significant hits in the yeast database to these bacterial protein motifs. The first is to transcription factor TFIIB. The second hit is to "RNA polymerase III transcription factor with homology to TFIIB". The matching motifs occur in the same order as the training set sequences, except for a second match to motif 1 in the second sequence where motif 5 would be expected. No other sequences have significant E-values (Expect column in the MAST results).
| MAST search of yeast |
|---|
|
| MetaMEME input form |
|---|
|
The two top hits are the same ones as detected by MAST. The third hit is one that MAST does not detect, and is annotated as containing subunits of transcription factor TFIIK.
| MetaMEME search of yeast |
|---|
|
This protocol describes the use of MEME via the MEME web interface or the command line to discover repeated motifs in a family of protein sequences. It also discusses how to interpret the motifs, compare them with known motifs, use them in sequence homology searches and construct phylogentic trees based on them.
This protocol is a direct extension of Basic Protocol 1 where non-repeating motifs were discovered. It is usually a good idea to extend the search for motifs in this way. The only difference in the user input is selecting a different "distribution" model for sites on the MEME input form.
Same as for Basic Protocol 1
| email address | Enter the email address where you want to receive your results |
|---|---|
| Description | (Optional) Enter information describing the the sequences and/or parameters of the MEME run. |
| name of a file | Use the "Browse" button to enter the path to your training set file. |
| distributed | Click on "Any number of repetitions. |
| number of motifs | Enter 10. |
| MEME Input Form |
|---|
|
| MEME motif summary: repeated motifs |
|---|
|
The repeated domain in these sequences is now clear. In most of the sequences, motif 3 followed by motif 1 occurs twice. However, neither of these motifs occurs at all in the original seed sequence (gi|5897224) around which this training set was assembled. The sequences in the family seem to belong to two distinct subfamilies--those containing the repeated motifs 1 and 3, and those that do not. The only motif common to all of the sequences in the family is motif 2, which does not appear to repeat.
| LAMA search of BLOCKS motif database |
|---|
|
The MEME blocks 1 through 10 are labeled "x1910xbliA" through "x1910xbliJ" in the "block 1" column. The matching motifs from the database are in the "block 2" column. There are three significant hits to motifs, all with E-values of 0 (the best value possible). MEME motif 1 matches "IPB000812C" whose annotation (not shown) reads "Transcription factor TFIIB repeat". This agrees with the fact that MEME discovered repeats of this motif in the training set sequences. Motif 2 matches "IPB000812A" whose annotation is also "Transcription factor TFIIB repeat", although it does not appear to repeat in any of the training set sequences. Motif 3 does not appear to match any of the motifs in the Blocks database. However, motif 4 matches "IPB000812B", another TFIIB repeat motif.
| Neighbor-joining tree of motif 1 |
|---|
|
The numbers following the sequence names in the leaves of the tree are the positions in the sequence of the sites. In this example, the sites nearer the N-terminus (smaller position numbers) and nearer the C-terminus (larger position numbers) sites cluster separately. The height of subtree containing the N-terminal sites is greater, suggesting that they may be under less selective pressure.
| MAST search of yeast with repeated motifs |
|---|
|
The same two significant matches as were found in Basic Protocol 1 are found here. Notice that they both contain the repeated motifs 1 and 3, as well as motif 2 that is common to the entire family. One other sequence (ref|NP_010164.1|) has a non-significant E-value (5.7), but does contain weak matches to the two repeated motifs (1 and 3) in the right order suggesting it may be a distant homolog of the TFIIB family. Motif order and spacing can often be informative in this way with distant homologs.
This protocol describes the use of MEME via the MEME web interface or from the command line to discover motifs in a family of DNA sequences. It also discusses how to interpret the motifs and to use them to search sequence databases for sequences containing the motifs.
Note that your sequences must be in FASTA format if you are using command line MEME. Other formats, described on the MEME website, are supported if you are using MEME via the web interface. If you are using the web interface, the total number of characters in the sequences may not exceed 60,000.
There are many ways to construct a set of DNA sequences for input to MEME. You might use a set of upstream regions from genes known to be coregulated as determined by expression microarray experiments. In this example, we will use a file (lex.fasta) that contains a set of E. coli DNA sequences known to bind LexA. We will use MEME to discover the LexA binding sites and characterize the motif.
The data file (lex.fasta) used in this example should be downloaded from the Current Protocols Web site (http://www.currentprotocols.com).
| email address | Enter the email address where you want to receive your results |
|---|---|
| Description | (Optional) Enter information describing the the sequences and/or parameters of the MEME run. This information will be included in the subject of the email message you receive from MEME and can be very useful if you submit many MEME runs. |
| name of a file | Use the "Browse" button to enter the path to your training set file. |
| number of motifs | Enter 2. |
| MEME Input Form |
|---|
|
| LexA binding site motif |
|---|
|
MEME finds the experimentally verified LexA binding site motif. It automatically determines the correct width for the motif. The extremely low E-value (3.7e-33) indicates that this motif is extremely statistically significant.
| summary of motifs in lex.fasta |
|---|
|
Even though the "Zero or one per sequence" distribution was specified, the summary of motifs shows all the (non-overlapping) positions that match the motif models MEME discovers. Thus, motif 1 appears three times in sequence recn, and twice in lexA.
| MAST input form |
|---|
|
| MAST results of search of E. coli with lex motif |
|---|
|
The search identifies upstream regions of several genes not in the training set. Some of them have non-significant E-values, but have multiple copies of the binding site close to the start of transcription, suggesting that LexA may be involved in their regulation as well.
This protocol describes the use of MEME via the MEME web interface or the command line to discover repeated motifs in a set of DNA sequences. It also discusses how to interpret the motifs and use them to search sequence databases.
Note that your sequences must be in FASTA format if you are using command line MEME. Other formats, described on the MEME website, are supported if you are using MEME via the web interface. If you are using the web interface, the total number of characters in the sequences may not exceed 60,000.
In this example, we will use a file (INO_up800.fasta) that contains upstream regions from S. cerevisiae genes known to be repressed in the presence of inositol or choline. [J. van Helden, B. Andre' and J. Collado-Vides, J Mol Biol. 1998 Sep 4;281(5):827-42.]
The data file (INO_up800.fasta) used in this example should be downloaded from the Current Protocols Web site (http://www.currentprotocols.com).
meme INO_up800.fasta -dna -revcomp -mod tcm -nmotifs 2 > INO_up800.anr.html
| email address | Enter the email address where you want to receive your results |
|---|---|
| Description | (Optional) Enter information describing the the sequences and/or parameters of the MEME run. |
| name of a file | Use the "Browse" button to enter the path to your training set file. |
| distributed | Click on "Any number of repetitions. |
| number of motifs | Enter 2. |
| MEME Input Form |
|---|
|
| INO binding site motif |
|---|
|
The E-value of the motif is only marginally significant (9.5e-2), but matches the known binding site Notice that MEME has selected more than one site in three of the training set sequences: three in INO1, and two each in CHO1 and CHO2.
MEME searches for motifs by performing Expectation Maximization (EM) on a motif model of a fixed width and using an initial estimate of the number of sites. It then sorts the possible sites according to their probability according to EM. MEME then and estimates the E-value (defined below) of the first n sites in the sorted list for different values of n. This procedure (first EM, followed by computing E-values for different numbers of sites) is repeated with different widths and different initial estimates of the number of sites. MEME outputs the motif with the lowest (estimated) E-value. Before reporting the motif, MEME trims it (using a dynamic programming multiple alignment) to eliminate any positions where there is a gap in any of the occurrences. After reporting the motif with the lowest E-value that it could find, MEME "erases" the occurrences of that motif in the training set using a soft-erase function based on how well each occurrence matches the overall motif. MEME then repeats the entire procedure to find additional motifs. The process stops when the requested threshold (number of motifs or maximum E-value) is reached.
MEME defines the E-value of a motif as the number of motifs (with the same width and number of occurrences) that would have equal or higher log likelihood ratio if the training set sequences had been generated randomly according to (the 0-order portion of the) a background model. The log likelihood ratio of a motif is
and is a measure of how different the sites are from the background model. The quantity
is the probability of the sites under the motif model. The motif model assumes that each position in the motif is independent of every other position. The motif model can be described by a position-specific probability matrix (PSPM). The quantity
is the probability of the sites given the background model. MEME uses a 0-order Markov model as the background model. By default, the frequencies of letters in the training set define the background model. However, (command line) MEME allows the user to supply their own n-order Markov model in the form of a file containing the frequencies of all possible tuples of up to length n+1.
The E-value estimated by MEME is actually an approximation of the E-value of the log likelihood ratio. An approximation is used because it is far more efficient to compute. The approximation is based on the fact that the log likelihood ratio of a motif is the sum of the log likelihood ratios of each column of the motif. Instead of computing the statistical significance of this sum (its p-value), MEME computes the p-value of each column and then computes the significance of their product. Although not identical to the p-value of the log likelihood ratio, this easier to compute objective function works very similarly in practice. The E-value is then computed from the p-value by scaling for the number of possible motifs in the training set.
Thus, the estimated E-value of a MEME motif depends on: