Current Protocols in Bioinformatics:

Discovering Novel Sequence Motifs with MEME

Basic Protocol 1: Discovering motifs in a protein sequence family using MEME

  1. Run MEME on your training set of sequences (tf4.fasta in this example) by doing one of the following:

  2. Fill in the following fields in the MEME input form:

  3. Click on the "Start search" button. This will submit your search to the MEME web-server at the San Diego Supercomputer Center. Within a few seconds, your browser should display a verification message like the one below.

    It is a good idea to take a few moments to check the confirmation message to see that everything looks right. Check that your email address is correct. Also, check that "type of sequence" is correct since MEME guesses this based on the sequence characters present in your training set. If MEME gets fooled by your training set, there are instructions on the MEME website explaining how to fix this.

  4. Use your email reader to receive the confirmation message MEME will send you. If you do not receive this message, it is possible that you mistyped your email address. In that case, you will have to resubmit your MEME run.

  5. Use your email reader to receive your MEME results. The email containing your MEME results will look like this:

  6. Save your MEME results to a text file. For example, if your email reader is Netscape Messenger, you would click "Save as" under the "File" menu, select "File" and then enter a file name like "tf4.zoops.html".

  7. Use your web browser to view your MEME results. For example, using Netscape Navigator, you would click on "Open Page" in the "File" menu and use "Choose File" to select the file you saved in the previous step: "tf4.zoops.html". Then you would click "Open In Navigator".

  8. At the top of a MEME output document you will see six buttons. Clicking on these allows you to go directly to the different sections of the MEME output. Just beneath these buttons are three additional buttons that allow you to use your MEME output as input to other programs.

  9. Click on the "Command Line" button. This takes you to the section of the document showing the actual command line that was used to run MEME. This information is useful for keeping track of which options were used when MEME was run. It can also be very useful in the event you wish to report a problem with the MEME software.

    In addition to reporting all of the important parameters used by MEME in its search, this section also shows the frequencies of each letter in your training set. Below these, MEME shows the letter frequencies of the 0-order background model used in computing motif E-values. The two sets of frequencies will be the same unless a background model was specified in MEME's input (command line version of MEME only).

  10. Use your browser to return to the top of the MEME output document.

  11. Click on the "Training Set" button. You will now see a summary description of the showing the name, "weight" and length of each sequence you submitted to MEME. (How to weight individual sequences in the training set is described in the MEME on-line documentation.) This section also shows the name of the training set file and the alphabet (protein or DNA). In this case, the training set uses the protein IUPAC alphabet.

  12. Use your browser to return to the top of the MEME output document.

  13. Click on the "First Motif" button. This will take you to a section of the MEME output document describing the first motif that MEME found.

    Each motif that MEME finds gets its own motif section. Each motif section contains the following information:

    1. A summary line showing the width, number of occurrences, log likelihood ratio and statistical significance of the motif.
    2. A simplified position-specific probability matrix.
    3. An information content diagram showing the degree of conservation at each motif position.
    4. A multilevel consensus sequence showing the most conserved letter(s) at each motif position.
    5. The occurrences of the motif (sites) sorted by p-value and aligned with each other.
    6. Motif block diagrams of the occurrences of the motif within each sequence in the training set.
    7. The motif in BLOCKS, FASTA or Raw format for use with the BLOCKS multiple alignment processor and other tools.
    8. A position-specific scoring matrix (PSSM) for use by the MAST sequence database search program.
    9. The position specific probability matrix (PSPM) describing the motif created by MEME's internal EM algorithm.

    The first line of each motif section is called the motif summary line. This line gives the width, number of occurrences in the training set (`sites'), log likelihood ratio (`llr') and E-value of the motif. Each motif describes a pattern of a fixed width--no gaps are allowed in MEME motifs. MEME numbers the motifs consecutively from one as it finds them. MEME usually finds the most statistically significant (low E-value) motifs first. The statistical significance of a motif is based on its log likelihood ratio, its width and number of occurrences, the background letter frequencies (given in the command line summary), and the size of the training set. The E-value is an estimate of the expected number of motifs with the given log likelihood ratio (or higher), and with the same width and number of occurrences, that one would find in a similarly sized set of random sequences. (In random sequences each position is independent with letters chosen according to the background letter frequencies.) Motifs with E-values larger than 0.01 (1e-2) are possibly just statistical artifacts, and not real motifs. The log likelihood ratio is the logarithm of the ratio of the probability of the occurrences of the motif given the motif model versus their probability given the background model.

  14. To the right of the motif summary line you will see two buttons labeled "P" and "N". Clicking on them will take you directly to the previous or next motif summary, respectively. Click on the "N" button four times in succession to arrive at the "MOTIF 5" summary line.

    The first portion of the motif section for motif 5 is shown in the figure below. MEME has discovered a motif of width 29 with 8 sites and a very significant E-value (6.7e-97). The columns of the simplified position-specific probability matrix, information content diagram and multilevel consensus sequence are aligned with the actual motif sites. Below the figure, we describe how to interpret each of these parts of the motif section.

    The Simplified Position-Specific Probability Matrix is shown directly below the summary line. This shows a simplified version of the PSPM that MEME's internal EM algorithm uses in its search for motifs. There is one column for each position in motif 5. Each column shows the expected frequency of each possible letter (20 amino acids, in this case) at the corresponding motif position. In order to make it easier to see which letters are most likely in each of the columns of the motif, the simplified motif shows the letter probabilities multiplied by 10 rounded to the nearest integer. Zeros are replaced by ":" (the colon) for readability. The letter "a" represents the number "10" (so that it will fit in one column). Thus, the "a"s in this diagram correspond to the completely conserved residues in certain columns of motif 5.

    The information content diagram, aligned directly beneath the simplified PSPM, provides an idea of which positions in the motif are most highly conserved. Each column (position) in a motif can be characterized by the amount of information it contains (measured in bits). Highly conserved positions in the motif have high information; positions where all letters are equally likely have low information. (The information content is relative to the background letter frequencies, which are given in the command line summary section.) The highest information content in motif 5 is acheived by the perfectly conserved histidine in column 19 of the motif. Its information content is higher than that of the other perfectly conserved residues because the background frequency of histidine is lower.

    Columns in the information content diagram are colored according to the majority category of the letters occurring in that column of the alignment. If no letter category has frequency above 0.5, the column in the diagram is black. For DNA sequences, the letter categories contain one letter each. For proteins, the categories are based on the biochemical properties of the various amino acids. The categories and their colors are:

    NUCLEIC ACIDS COLOR
    A RED
    C BLUE
    G ORANGE
    T GREEN
    AMINO ACIDS COLOR
    ACFILVM BLUE
    NQST GREEN
    DE MAGENTA
    KR RED
    H PINK
    G ORANGE
    P YELLOW
    Y TURQUOISE

    Summing the information content for each position in the motif gives the total information content of the motif (shown in parentheses to the left of the diagram). The total information content is approximately equal to the log likelihood ratio divided by the number of occurrences times ln(2). The total information content gives a measure of the usefulness of the motif for database searches. For a motif to be useful for database searches, it must as a rule contain at least log_2(N) bits of information where N is the number of sequences in the database being searched. For example, to effectively search a database containing 100,000 sequences for occurrences of a single motif, the motif should have an IC of at least 16.6 bits. Motifs with lower information content are still useful when a family of sequences shares more than one motif since they can be combined in multiple motif searches (using MAST).

    The multilevel consensus sequence corresponding to the motif, an aid in remembering and understanding the motif, is located directly below the information content diagram. It is calculated from the motif position-specific probability matrix as follows. Separately for each column of the motif, the letters in the alphabet are sorted in decreasing order by the probability with which they are expected to occur in that position of motif occurrences. The sorted letters are then printed vertically with the most probable letter on top. Only letters with probabilities of 0.2 or higher at that position in the motif are printed.

    The multilevel consensus sequence of motif 5 says, firstly, that the most likely form (consensus) of the motif is:

    Secondly, the multilevel consensus shows that only letters A and S have probability greater than 0.2 of occurring in position one of the motif. Thirdly, a rough approximation of the motif can be made by converting the multilevel consensus sequence into the Prosite signature:

    (This can be done by taking all of the letters in each column of the motif and enclosing them in brackets.)

    The eight sites in the training set that MEME has identified as forming the fifth motif can be seen aligned beneath the information content diagram. These sites are shown aligned with each other with flanking sequence on either side. Each site is identified by the name of the sequence where it occurs and position in the sequence where the site begins. The sites are listed in order of increasing statistical significance (position p-value). The position p-value of a site is computed from the the match score of the site with the position-specific scoring matrix (PSSM) for the motif. The position p-value gives the probability of a random string the length of the site (generated from the background letter frequencies) having the same match score (or higher) as the site.

  15. Scroll down the MEME output document to just below the aligned sites where the section heading says "Motif 5 block diagrams". The sequences in the training set that contain motif 5 sites (as determined by the MEME algorithm) are shown in schematic format. The sequences are sorted by the lowest p-value among all sites in a given sequence. In this example, the chosen site distribution (Zero or one per sequence) only allows a maximum of one site per sequence, and MEME determined that the best number of total sites is eight.

  16. Scroll down the MEME output document to the heading "Motif 5 in BLOCKS format". There are four buttons here. They can be used to view the sites of the motif in three different formats and to submit the motif to the BLOCKS multiple alignment processor.

  17. Click on the upper left button labeled "View Block 5". This will show the motif in BLOCKS format. Click on the button labeled "View FASTA 5". This will show the sites of the motif in FASTA format. Click on the "View RAW 5" button and you will see the sites of motif 5 in raw sequence format. These three viewing buttons allow you to conveniently cut-and-paste the sites of the motif into other programs and web sites that require one of these three sequence formats.

  18. Click on the button labeled "Submit Block 5". This will connect you to the BLOCKS website. There you can study motif 5 in various ways. For example, clicking on "Logos: GIF" will display motif 5 in "Logos" format:

  19. Click on the button labeled "Submit Block 5" again. Then click on "Tree: GIF". This will display a neighbor-joining tree of the sites in motif 5:

  20. Click on the button labeled "Submit Block 5" once more. Then click on "LAMA". This will compare motif 5 with a large database of known protein motifs. The results are shown in the figure below. There is an extremely strong match (E-value = 0) with motif IPB000812B. Clicking on the link to that motif (IPB000812B) shows that it is the transcription factor TFIIB repeat. (Not shown.)

  21. Scroll down to the section headed "Position-specific scoring matrix". Click on the button labeled "View PSSM 5". This will display the scoring matrix derived from Motif 5. This matrix is a log-odds matrix calculated by taking the log (base 2) of the ratio p/f at each position in the motif where p is the probability of a particular letter at that position in the motif, and f is the background frequency of the letter (given in the command line summary section.) This is the same matrix that is used above in computing the p-values of the motif sites in the aligned sites and block diagrams. The scoring matrix is printed "sideways"--columns correspond to the letters in the alphabet (in the same order as shown in the simplified motif) and rows corresponding to the positions of the motif, position one first.

  22. Scroll down to the section headed "Position-specific probability matrix". Click on the button labeled "View PSPM 5". This will display the probability matrix generated by the EM algorithm during the search for motif 5.

  23. Scroll back to the top of the MEME output document and click on the "Summary of Motifs" button at the top of the page. You can now see how all the motifs discovered by MEME map to the sequences. These are not simply the sites displayed in the motif block diagrams. The summary of motifs is created by using the MAST algorithm to find a set of non-overlapping sites that match the motifs (actually, the PSPMs for the motifs) with p-values below 0.0001.

  24. Return to the top of the document. Underneath the colored buttons are three additional buttons that allow you to submit the MEME motifs to three web-based services:

  25. Click on the "MAST" button. Fill in the MAST form with your email address, select the "S. cerevisiae" database and click on "Start search". MAST will return your results by email when they are ready. This may take from a few minutes to a few hours depending on how busy the MAST server is, the number of motifs in your query and the size of the database you are searching.

  26. Retrieve your MAST search results using your email program, save them to a file, and view the file with your web browser (in a new window).

    There are two significant hits in the yeast database to these bacterial protein motifs. The first is to transcription factor TFIIB. The second hit is to "RNA polymerase III transcription factor with homology to TFIIB". The matching motifs occur in the same order as the training set sequences, except for a second match to motif 1 in the second sequence where motif 5 would be expected. No other sequences have significant E-values (Expect column in the MAST results).

  27. Return to the MEME output document in your browser, and click on the "MetaMEME" button. Use the "Browse" to enter the name of the training set you used with MEME. Then, select the "Yeast" database to search. Then, click on "Submit". Click on "Submit" on the next form that appears to accept the search defaults. MetaMEME will perform its search and return a hidden Markov model, a multiple alignment of the training set, and search results on the sequence database you chose.

  28. Click on "Database Search Results" on the MetaMEME form to see the Yeast sequences that match the model of the bacterial protein family built by MetaMEME.

    The two top hits are the same ones as detected by MAST. The third hit is one that MAST does not detect, and is annotated as containing subunits of transcription factor TFIIK.

Alternate Protocol 1: Finding repeated motifs in protein sequences

  1. Run MEME on your training set of sequences (tf4.fasta in this example) by doing one of the following:

  2. Fill in the following fields in the MEME input form just as you did for Basic Protocol 1, except now you will select "Any number of repetions" for the site distribution:

  3. Click on the "Start search" button.

  4. Use your email reader to save your MEME results to a file when they arrive. Call the file "tf4.anr.html".

  5. Use your web browser to open the MEME results file "tf4.anr.html".

  6. Click on the "Summary of Motifs" button at the top of the MEME results document.

    The repeated domain in these sequences is now clear. In most of the sequences, motif 3 followed by motif 1 occurs twice. However, neither of these motifs occurs at all in the original seed sequence (gi|5897224) around which this training set was assembled. The sequences in the family seem to belong to two distinct subfamilies--those containing the repeated motifs 1 and 3, and those that do not. The only motif common to all of the sequences in the family is motif 2, which does not appear to repeat.

  7. Scroll to the top of the MEME results document.

  8. Click on the button labeled "BLOCKS". This will submit all of the motifs to the Blocks multiple alignment processor. (In Basic Protocol 1, we showed how to submit a single motif to the Blocks processor.)

  9. On the Blocks submission form that appears, click on "LAMA" to search your motifs against a database of protein motifs.

    The MEME blocks 1 through 10 are labeled "x1910xbliA" through "x1910xbliJ" in the "block 1" column. The matching motifs from the database are in the "block 2" column. There are three significant hits to motifs, all with E-values of 0 (the best value possible). MEME motif 1 matches "IPB000812C" whose annotation (not shown) reads "Transcription factor TFIIB repeat". This agrees with the fact that MEME discovered repeats of this motif in the training set sequences. Motif 2 matches "IPB000812A" whose annotation is also "Transcription factor TFIIB repeat", although it does not appear to repeat in any of the training set sequences. Motif 3 does not appear to match any of the motifs in the Blocks database. However, motif 4 matches "IPB000812B", another TFIIB repeat motif.

  10. Scroll to the top of the MEME results document.

  11. Click on the "First motif" button.

  12. Scroll down to the "Motif 1 in BLOCKS format" section and click on the "Submit BLOCK 1" button.

  13. On the resulting input form, click on "Tree: GIF", to see a neighbor-joining tree of the sites composing motif 1.

    The numbers following the sequence names in the leaves of the tree are the positions in the sequence of the sites. In this example, the sites nearer the N-terminus (smaller position numbers) and nearer the C-terminus (larger position numbers) sites cluster separately. The height of subtree containing the N-terminal sites is greater, suggesting that they may be under less selective pressure.

  14. Scroll to the top of the MEME results document.

  15. Click on the button labeled "MAST". This will submit the motifs to the MAST sequence database search tool.

  16. Fill in the resulting MAST submission form as in Basic Protocol 1, giving your email address, selecting the S. cerivisiae database. In the "Ignore motifs if E-value above:" field select ".001". In this case, this will prevent the last two motifs in the MEME file from being used in the search (since they have E-values greater than 0.001). Only using "significant" MEME motifs in the search can often improve search sensitivity.

  17. Click on "Start search" on the MAST submission form.

  18. Use your email program to save the MAST search results to file "tf4.anr.mast.sc.html" when they arrive.

  19. Use your web browser to open file "tf4.anr.mast.sc.html". Click on the "Motif Diagrams" button at the top of the page.

    The same two significant matches as were found in Basic Protocol 1 are found here. Notice that they both contain the repeated motifs 1 and 3, as well as motif 2 that is common to the entire family. One other sequence (ref|NP_010164.1|) has a non-significant E-value (5.7), but does contain weak matches to the two repeated motifs (1 and 3) in the right order suggesting it may be a distant homolog of the TFIIB family. Motif order and spacing can often be informative in this way with distant homologs.

Basic Protocol 2: Discovering DNA motifs in a set of DNA sequences with MEME

  1. Run MEME on your training set of sequences (lex.fasta in this example) by doing one of the following:

  2. Fill in the following fields in the MEME input form:

  3. Click on the "Start search" button. This will submit your search to the MEME web-server at the San Diego Supercomputer Center. Within a few seconds, your browser should display a verification message. (See Basic Protocol 1 for an example).

  4. Use your email reader to receive the confirmation message MEME will send you. If you do not receive this message, it is possible that you mistyped your email address. In that case, you will have to resubmit your MEME run.

  5. Use your email reader to receive your MEME results, as you did in Basic Protocol 1.

  6. Save your MEME results to a text file. For example, if your email reader is Netscape Messenger, you would click "Save as" under the "File" menu, select "File" and then enter a file name like "lex.zoops.html".

  7. Use your web browser to view your MEME results. For example, using Netscape Navigator, you would click on "Open Page" in the "File" menu and use "Choose File" to select the file you saved in the previous step: "lex.zoops.html". Then you would click "Open In Navigator".

  8. Click on the "First motif" button. This will take you to the first motif discovered by MEME.

    MEME finds the experimentally verified LexA binding site motif. It automatically determines the correct width for the motif. The extremely low E-value (3.7e-33) indicates that this motif is extremely statistically significant.

  9. Click on the "N" button on the "MOTIF 1" line to proceed to the second motif found by MEME. The E-value of this second motif is only marginally significant (7.1e-3). Therefore, this motif may be a statistical artifact.

  10. Click on the "N" button on the "MOTIF 2" line to proceed to the second motif found by MEME. This will take you to the "Summary of Motifs" section. (Clicking on "N" again will take you back to motif 1; the motif sections and summary section are thus linked in a circle by the "N" and "P" buttons.)

    Even though the "Zero or one per sequence" distribution was specified, the summary of motifs shows all the (non-overlapping) positions that match the motif models MEME discovers. Thus, motif 1 appears three times in sequence recn, and twice in lexA.

  11. Return to the top of the MEME results document.

  12. Click on the "MAST" button. Fill in the MAST form: and click on "Start search". This wills search the upstream regions of all E. coli genes using only motif 1 (since motif 2 has an E-value above 0.001).

  13. Use your email reader to save the MAST results to file lex.anr.mast.html.

  14. Use your web browser to open the file lex.anr.mast.html and view your MAST results.

    The search identifies upstream regions of several genes not in the training set. Some of them have non-significant E-values, but have multiple copies of the binding site close to the start of transcription, suggesting that LexA may be involved in their regulation as well.

Alternate Protocol 2: Finding repeated motifs in DNA sequences with MEME

The MEME Algorithm