Gribskov & Smith

 BIMM 141 Laboratory

Spring 2001

Introduction to Bioinformatics

 

 

Exercise 4


Homologs from Sequence Databases: FAST and BLAST


Exercise 4 focuses on use of sequence databases or libraries to find similarities to a nucleic acid or protein sequence. This is usually the first type of analysis one performs with a new DNA sequence. The algorithms are largely based on the concepts underlying "dot plots", "sliding window" comparisons, and "distance matrix" concepts discussed in lectures 4-8 of BIMM 140, as well as on the dynamic programming algorithm, Needleman-Wunsch for global alignments, and the Smith-Waterman modification for local alignments. In addition, algorithm changes to increase speed of comparisons with minimal loss of sensitivity (finding all relevant similarities - all true positives) and specificity / selectivity (finding a minimal number of "false" hits - no false positives) are features of the popular FAST and BLAST programs.

You used some of the FAST programs in the last exercise, Exercise 3.

The Main objectives in Exercise 4 are:

We will use one of the types of gene sequences, DNA and protein, used in the Altschul et al paper on Gapped-BLAST and Psi-BLAST (Nucl. Acids Res 25: 3389-3402; 1997) in the BIMM 140 Course Reader, page 37, and listed in Tables 2 and 3, to permit comparison of your results with those of Altschul et al.
We in fact recommend that you use one of the specific sequences listed by the SwissProt Accession Number for the encoded protein. This will permit most precise comparison of your results with some of those given in Table 3.

You will also use the MTG8 protein sequence with BLASTP and FILTER options, to permit comparison with the data and results in Figure 2 of the Altschul et al article, "Issues in Searching Molecular Sequence Databases", in the Course Reader, page 26.

At this point in the course, you have had sufficient experience with both the Web and the GCG programs that we believe the Exercises need not be as verbose as the initial Exercises.

If you have any problems with understanding what is asked or with execution of a task or with whatever, please send email with questions to either Michael Gribskov , Doug Smith, or Hiren Patel, TA for BIMM 140 / 141.

 

Relevant Articles from the BIMM 140 Course Reader for Exercise 4:

Baxevanis-Ouellette, 2nd Edition, Textbook Relevant Chapters:



BIMM 140: | Main | 140_Info | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | 141_Info | Syllabus | Exercises | DNASYSTEM | CMS MBR |


 

Main Specific Tasks to Perform in Exercise 4:

A. Searching with BLAST
1. Find sequences to use
2. Peruse the BLAST site at NCBI
3. Use the DNA sequence with BLASTN and the DNA nr database
4. Use the DNA sequence with BLASTX and BLOSUM62 matrix and Protein nr database
5. Use the protein sequence with BLASTP and protein nr database
6. Do BLASTP runs using matrices PAM30, PAM70, BLOSUM45, and BLOSUM80
B. Filtering Low Complexity Sequences in BLAST Searches
1. Retrieve the MTG8 protein sequence corresponding to Genbank accession number D14820
2. Do a BLASTP run using the BLOSUM62 matrix and no FILTER options
3. Do a BLASTP run using the BLOSUM62 matrix and filter with the FILTER option
C. Comparison with dynamic programming alignments for proteins
1. Use the saved matching protein sequence from BLAST search in Part A4
2. Align this sequence to your query using the GCG BESTFIT program
3. Evaluate the significance of the alignment using the randomization option of BESTFIT
D. Psi-BLAST
1. Use the Protein Sequence from A for First Iteration of Psi-BLAST and protein nr database
2. Continue the Psi-BLAST run by doing the Second Iteration
3. Continue to do Psi_BLAST Iterations until Stabilization of Hits
E. Questions

 

 

{A. Searching with BLAST}

{1. Find sequences to use. Record the URLs and residue numbers for the sequences in your Notebook.}
Use one of the sequences, or a similar sequence, both DNA and encoded Protein, used in the Gapped-BLAST / Psi-BLAST paper in Tables 2 and 3. The SwissProt Accession Numbers for the specific Protein sequences used are:
P00762, P01008, P01111, P02232, P03435, P05013, P07327, P10318, P10635, P14942, and P20705.
If you have troubles finding the cognate DNA sequence, or if you think the sequence is not correct for the type stated, choose another sequence or find a homologous protein-DNA sequence pair.

Obtain both DNA and Protein cognate pairs. Choose a DNA sequence that is not too long, one that encodes only the gene involved or one that is a cDNA from the expressed mRNA. Do NOT use a fragment sequence (one that contains only part of the gene, or is short) and do NOT use a very long sequence, eg a genomic sequence that encodes an entire cosmid or BAC or chromosome.

You should be able to use the search facilities at either NCBI or ExPASy to obtain these sequences. It is sometimes convenient to store these sequences on your computer in FASTA format. Alternatively, you can just record the URL for each sequence entry at NCBI or ExPASy, for example, in your Notebook file, to use to retrieve a given sequence entry as needed. You are to do this here.

{2. Peruse the BLAST site at NCBI.}
Examine the BLAST facilities available at NCBI. Note that the Web page design and ways in which the different BLAST programs are used is new as of 29 January 2001. Note the links available, particularly the "BLAST overview", "FAQ", "BLAST course", and "BLAST tutorial" links. The "BLAST tutorial" options may be of particular value. The "BLAST course" page, written by Altschul, may be useful for concept clarification.

{3. Use the DNA sequence with BLASTN and the DNA nr database. Include representative parts of the BLASTN output in your Notebook.}
For this part of the exercise, use the "Standard nucleotide-nucleotide BLAST [blastn]" option from the BLAST info page. Use the DNA sequence you obtained in Part A above for a BLASTN run, using the nr (NonRedundant) DNA database.

Note the choices of databases available ... and note the links to help on each of the options and windows.

When you run one of the BLAST programs at NCBI, a "formatting BLAST" window comes up. This window gives you a run ID number, permitting you to come back later and look again at the output. The window also permits you to reformat the output, to show more or fewer "one-liner" description lines, alignments, etc.

When the run is completed and you click on the "Format!" button, the BLAST results appear in a new window. Compare this output with the output described in class.

When the BLAST output comes back, use COPY-PASTE to include representative parts of the BLASTN output in your Notebook.

Examine the information about the BLAST run found at the very end of the BLAST output, and include relevant comments in your Notebook.

{4. Use the DNA sequence with BLASTX and BLOSUM62 matrix and Protein nr database}
Repeat the above BLASTN run but now using BLASTX and the Protein nr database. Again, include representative parts of the BLASTX output in your Notebook and comments about information found at the end of the BLAST output.

{5. Use the protein sequence with BLASTP and Protein nr database. Choose 100 descriptions and 100 alignments in the Format options}

{Save the "One-liner" description and the Alignment for the WORST hit found ... this will be used in Part C below}
Now use the cognate Protein Sequence for a BLAST run using the BLASTP program and the same nr database. Again, include representative parts of the BLASTP output in your Notebook.

Examine the information about the BLAST run found at the very end of the BLAST output, and include relevant comments in your Notebook.

Compare the output from these three BLAST runs by looking mainly at the "one-liner" description lines.
Which search seems to be the most useful? Why?

How do the results found in your BLAST runs compare with those in your FASTA runs? Which is most useful? Why?

{6. Do BLASTP runs using the distance matrices PAM30, PAM70, BLOSUM45, and BLOSUM80.}
Use your protein sequence to do runs using the BLASTP program against the Protein nr database, but using the Distance Matrices PAM30, PAM70, BLOSUM45, and BLOSUM80.

Compare the Hits found by looking at the description lines, and include the 4 sets of Description lines (the main hits, at least) in your Notebook.

Compare also the hits found with the PAM matrices to those found with the BLOSUM62 matrix (the NCBI default protein distance matrix) used in Part A3 above.

Which distance matrix works best to detect homologous sequences?


{B. BLASTP run, using FILTER and the MTG8 protein sequence}

In this part, you will use the MTG8 Protein Sequence and the BLASTP program, comparing the effects of the FILTER operation, to "mask" low information sequences and direct repeats. The MTG8 sequence is the one used by Altschul et al in their article "Issues in searching molecular sequence databases", in Figure 2, permitting you to directly compare your results with those of the authors.

{1. Retrieve the URL for the MTG8 protein sequence corresponding to Genbank accession number D14820}
This should be easy for you to do at NCBI Entrez using the Accession Number given.

{2. Do a BLASTP run using the BLOSUM62 matrix and no FILTER options.}
Include representative data from your BLASTP output in your notebook.

{3. Do a BLASTP run using the BLOSUM62 matrix and FILTER}
Do your BLASTP run by choosing both the SEG filter option (choose Filter=default on the pulldown menu). Compare your BLASTP output visually with that obtained with no FILTER options, and include data from your BLASTP output that is particularly relevant to differences found with and without use of the FILTER option. What differences do you find? Why are these sequences filtered out?

 

{C. Comparison to dynamic programming alignments for proteins}

FASTA and BLAST were originally conceived as fast approximations to complete dynamic programming alignments. It is interesting to see how your BLAST results compare to these, much slower, methods.

{1. Use the saved matching protein sequence from the BLAST protein sequence searches in Part A4 above}
Use the sequence that matched poorly to your query and that you saved in Part A4 above. This sequence is near the lower end of the sequences that seem to match (i.e. are homologous) to your query. A sequence that is nearly identical to your query will not show you anything.

Provide some data in your notebook regarding your choice for this new sequence.

{2. Align this protein sequence to your query using the GCG BESTFIT program.}
Try to adjust the gap penalties to make this alignment resemble the alignments produced in the BLAST database searches. How does this alignment differ from the alignments you found for your BLAST runs?

{3. Evaluate the significance of the alignment using the randomization option of BESTFIT.}
Alignments are often considered significant if the score is more than 6 standard deviations away from the mean of random sequences (i.e Z >= 6). Is your alignment significant by this criterion? Which seems more reliable, the comparison to unrelated sequences in the database search, or the randomization strategy used in the appropriate option with BESTFIT?

 

{D. Psi-BLAST run}

Psi-BLAST combines the fast database searching advantages of the BLAST algorithm with the Position-Specific distance matrix features of a Profile to find distant homologues in a database in an Iterative manner ... Psi-BLAST stands for "Position Specific Iterative BLAST"

{1. Use the Protein Sequence from above with First Iteration of Psi-BLAST and Protein nr database}
Use the Protein Sequence from above and the "PSI- and PHI-BLAST" option at the BLAST site at NCBI. Use the BLOSUM62 matrix.

Include representative parts of the BLAST output in your Notebook, including information from the end of the BLAST output.

Include the number of hits found above the threshold of E = 0.001, and note how many of these are "NEW".
What fraction of the hits found in this first run of Psi-BLAST are new?
Begin to build a Table of total hits found, approximate number of New Hits present in the 'one-liner' descriptions, and the E-value of the WORST recorded hit, versus Iteration Number.

How does this output compare with that found in Part 4A above? How is it similar, and how does it differ?
How does the information at the end of the BLAST output compare with that from the Part 4A output above?

{2. Continue the PSI-BLAST run by doing the Second Iteration}
Click on the "Run PSI-Blast iteration 2" button.

When you do this, you are executing a Second BLAST run.
NOTE that the Format Web page is the same as used for the first run!
... it may be covered on your screen!
As you continue to do these Psi-BLAST iterations, you will "flip between" the Output and Format Web pages ... keep each one exposed to some extent on your monitor screen as you do these Iterations !!!

What was the Query used in this Second BLAST run?

{3. Continue to do Psi-BLAST Iterations until Stabilization of Hits}
Click on the "Run PSI-Blast iteration 3" button. Examine your hits. Record the number of total hits.
Continue to do this until the number of total hits has stabilized.

Record the Number of Iterations you did, and complete your Table of Iteration Number versus Total Hits found, number of New Hits in Description lines, and E-value of last (worst) hit recorded.

For your last BLAST run, include some representative information from the output file in your Notebook.
By comparing the description lines from your first Psi-BLAST run and your last Iteration BLAST run, were some of the initial sequences removed or lost? were new sequences gained? how do the Scores compare?

 

{E. Questions}

  1. Did you have any problems finding cognate protein and DNA sequences starting with the SwissProt Accession Numbers given in Part A1 of this Exercise? If so, what were they?
  2. What are the primary functions of the separate BLAST format Web page?
  3. What are the four primary elements of the BLAST output file?
  4. Answer the questions given in Exercise 4, Parts A, B, C, and D, and put your answers in the appropriate place above.
  5. In the BLAST programs, what are the meanings of the following terms: w, T, HSP, S ?
  6. How are w and T determined? How does this determination differ between DNA and protein searches?
  7. What is the EXPECT parameter? How is this related to S? How is this related to "sensitivity" and "specificity" issues?
  8. Speed is a major reason for the development of BLAST 2.0 (Gapped-BLAST). Why is there a speed problem with BLAST 1.0?
  9. BLAST 2.0 runs overall about 3-fold faster than BLAST. What factors, positive and negative, contribute to this overal 3-fold increase in speed?
  10. What are the two major innovations in Gapped-BLAST as compared with BLAST?
  11. What is meant by "filtering"? What types of sequences does one often wish to "filter"? What types of sequences can the BLAST server at NCBI filter? What is the basis of the SEG algorithm, and what types of sequences does it filter? What is the basis of the XNU algorithm, and what types of sequences does it filter?
  12. What is a bit of information? How many bits of information are there in a given amino acid? What does this mean? Calculate how many bits of information there are in the number 100.
  13. The PIR protein database now contains over 75 million amino acids. To search this database with a protein sequence of length 250 aa, a score of how many bits is needed to distinguish an HSP with this Score from chance? Why are higher HSP scores needed in a search of a database than in a comparison between two sequences to distinguish the HSP from chance?
  14. In the table from your Psi-BLAST iterations, what happened to Total Hits, Number of New Hits in Description lines, and the E-value as the number of iterations increased? How do you know when to stop executing further iterations of Psi-BLAST?
  15. What is a Profile?
  16. What happens to the Profile during execution of a Psi-BLAST run? What does the resulting Profile describe?
  17. Is this the best Profile that can be achieved? Why or why not?
  18. How do your results, for both the BLASTP run and the Psi-BLAST complete run, compare with those of the Altschul et al Psi-BLAST paper, Table 3? How can you account for the differences?
  19. For the MTG8 protein sequence, when you did your FILTER run you should have found that the first part of the sequence, a sequence like the following:

MPDRTEKHSTMPDSPVDVKTQSRL

was NOT filtered, whereas the adjacent sequence:

TPPTMPPPPTT

WAS filtered. For each of these two total sequences, calculate the Complexity Kw. Calculate also the Complexity of the "most complex" and of the "least complex" protein sequences for each of these two sequence lengths. How does the Complexity of each of these MTG8 sequences compare with the comparable "most" and "least" complex Complexity values?

 



BIMM 140: | Main | 140_Info | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | 141_Info | Syllabus | Exercises | DNASYSTEM | CMS MBR |


Latest modification: 21 April, 2001

If you have problems or questions, send email to Michael Gribskov or Doug Smith or Hiren Patel