| Gribskov & Smith |
BIMM 141 Laboratory |
Spring 2001 |
Introduction to Bioinformatics
Exercise 4 focuses on use of sequence databases or libraries to find similarities to a nucleic acid or protein sequence. This is usually the first type of analysis one performs with a new DNA sequence. The algorithms are largely based on the concepts underlying "dot plots", "sliding window" comparisons, and "distance matrix" concepts discussed in lectures 4-8 of BIMM 140, as well as on the dynamic programming algorithm, Needleman-Wunsch for global alignments, and the Smith-Waterman modification for local alignments. In addition, algorithm changes to increase speed of comparisons with minimal loss of sensitivity (finding all relevant similarities - all true positives) and specificity / selectivity (finding a minimal number of "false" hits - no false positives) are features of the popular FAST and BLAST programs.
You used some of the FAST programs in the last exercise, Exercise 3.
The Main objectives in Exercise 4 are:
We will use one of the types of gene sequences, DNA and protein,
used in the Altschul et al paper on Gapped-BLAST and Psi-BLAST
(Nucl. Acids Res 25: 3389-3402; 1997) in the BIMM 140 Course
Reader, page 37, and listed in Tables 2 and 3, to permit comparison
of your results with those of Altschul et al.
We in fact recommend that you use one of the specific sequences
listed by the SwissProt Accession Number for the encoded protein.
This will permit most precise comparison of your results with
some of those given in Table 3.
You will also use the MTG8 protein sequence with BLASTP and FILTER options, to permit comparison with the data and results in Figure 2 of the Altschul et al article, "Issues in Searching Molecular Sequence Databases", in the Course Reader, page 26.
At this point in the course, you have had sufficient experience with both the Web and the GCG programs that we believe the Exercises need not be as verbose as the initial Exercises.
If you have any problems with understanding what is asked or with execution of a task or with whatever, please send email with questions to either Michael Gribskov , Doug Smith, or Hiren Patel, TA for BIMM 140 / 141.
Baxevanis-Ouellette, 2nd Edition, Textbook Relevant Chapters:
{1. Find sequences to use. Record the
URLs and residue numbers for the sequences in your Notebook.}
Use one of the sequences, or a similar sequence, both DNA and
encoded Protein, used in the Gapped-BLAST / Psi-BLAST paper in
Tables 2 and 3. The SwissProt Accession Numbers for the specific
Protein sequences used are:
P00762, P01008, P01111, P02232, P03435, P05013, P07327, P10318,
P10635, P14942, and P20705.
If you have troubles finding the cognate DNA sequence, or if you
think the sequence is not correct for the type stated, choose
another sequence or find a homologous protein-DNA sequence pair.
Obtain both DNA and Protein cognate pairs. Choose a DNA sequence that is not too long, one that encodes only the gene involved or one that is a cDNA from the expressed mRNA. Do NOT use a fragment sequence (one that contains only part of the gene, or is short) and do NOT use a very long sequence, eg a genomic sequence that encodes an entire cosmid or BAC or chromosome.
You should be able to use the search facilities at either NCBI or ExPASy to obtain these sequences. It is sometimes convenient to store these sequences on your computer in FASTA format. Alternatively, you can just record the URL for each sequence entry at NCBI or ExPASy, for example, in your Notebook file, to use to retrieve a given sequence entry as needed. You are to do this here.
{2. Peruse the BLAST site at NCBI.}
Examine the BLAST facilities available at NCBI. Note that
the Web page design and ways in which the different BLAST programs
are used is new as of 29 January 2001. Note the links available,
particularly the "BLAST overview", "FAQ",
"BLAST course", and "BLAST tutorial" links.
The "BLAST tutorial" options may be of particular value.
The "BLAST course" page, written by Altschul, may be
useful for concept clarification.
{3. Use the DNA sequence with BLASTN
and the DNA nr database. Include representative parts of the BLASTN
output in your Notebook.}
For this part of the exercise, use the "Standard nucleotide-nucleotide
BLAST [blastn]" option from the BLAST info page. Use the DNA sequence you
obtained in Part A above for a BLASTN run, using the nr (NonRedundant)
DNA database.
Note the choices of databases available ... and note the links to help on each of the options and windows.
When you run one of the BLAST programs at NCBI, a "formatting BLAST" window comes up. This window gives you a run ID number, permitting you to come back later and look again at the output. The window also permits you to reformat the output, to show more or fewer "one-liner" description lines, alignments, etc.
When the run is completed and you click on the "Format!" button, the BLAST results appear in a new window. Compare this output with the output described in class.
When the BLAST output comes back, use COPY-PASTE to include representative parts of the BLASTN output in your Notebook.
Examine the information about the BLAST run found at the very end of the BLAST output, and include relevant comments in your Notebook.
{4. Use the DNA sequence with BLASTX
and BLOSUM62 matrix and Protein nr database}
Repeat the above BLASTN run but now using BLASTX and the Protein
nr database. Again, include representative parts of the BLASTX
output in your Notebook and comments about information found at
the end of the BLAST output.
{5. Use the protein sequence with BLASTP and Protein nr database. Choose 100 descriptions and 100 alignments in the Format options}
{Save the "One-liner" description and the Alignment
for the WORST hit found ... this will be used in Part C below}
Now use the cognate Protein Sequence for a BLAST run using the
BLASTP program and the same nr database. Again, include representative
parts of the BLASTP output in your Notebook.
Examine the information about the BLAST run found at the very end of the BLAST output, and include relevant comments in your Notebook.
Compare the output from these three BLAST runs by looking mainly
at the "one-liner" description lines.
Which search seems to be the most useful? Why?
How do the results found in your BLAST runs compare with those in your FASTA runs? Which is most useful? Why?
{6. Do BLASTP runs using the distance
matrices PAM30, PAM70, BLOSUM45, and BLOSUM80.}
Use your protein sequence to do runs using the BLASTP program
against the Protein nr database, but using the Distance Matrices
PAM30, PAM70, BLOSUM45, and BLOSUM80.
Compare the Hits found by looking at the description lines, and include the 4 sets of Description lines (the main hits, at least) in your Notebook.
Compare also the hits found with the PAM matrices to those found with the BLOSUM62 matrix (the NCBI default protein distance matrix) used in Part A3 above.
Which distance matrix works best to detect homologous sequences?
In this part, you will use the MTG8 Protein Sequence and the BLASTP program, comparing the effects of the FILTER operation, to "mask" low information sequences and direct repeats. The MTG8 sequence is the one used by Altschul et al in their article "Issues in searching molecular sequence databases", in Figure 2, permitting you to directly compare your results with those of the authors.
{1. Retrieve the URL for the MTG8 protein
sequence corresponding to Genbank accession number D14820}
This should be easy for you to do at NCBI Entrez using the Accession
Number given.
{2. Do a BLASTP run using the BLOSUM62
matrix and no FILTER options.}
Include representative data from your BLASTP output in your notebook.
{3. Do a BLASTP run using the BLOSUM62
matrix and FILTER}
Do your BLASTP run by choosing both the SEG filter option (choose
Filter=default on the pulldown menu). Compare your BLASTP output
visually with that obtained with no FILTER options, and include
data from your BLASTP output that is particularly relevant to
differences found with and without use of the FILTER option. What
differences do you find? Why are these sequences filtered out?
FASTA and BLAST were originally conceived as fast approximations to complete dynamic programming alignments. It is interesting to see how your BLAST results compare to these, much slower, methods.
{1. Use the saved matching protein sequence
from the BLAST protein sequence searches in Part A4 above}
Use the sequence that matched poorly to your query and that you
saved in Part A4 above. This sequence is near the lower end of
the sequences that seem to match (i.e. are homologous) to your
query. A sequence that is nearly identical to your query will
not show you anything.
Provide some data in your notebook regarding your choice for this new sequence.
{2. Align this protein sequence to your
query using the GCG BESTFIT program.}
Try to adjust the gap penalties to make this alignment resemble
the alignments produced in the BLAST database searches. How does
this alignment differ from the alignments you found for your BLAST
runs?
{3. Evaluate the significance of the
alignment using the randomization option of BESTFIT.}
Alignments are often considered significant if the score is more
than 6 standard deviations away from the mean of random sequences
(i.e Z >= 6). Is your alignment significant by this criterion?
Which seems more reliable, the comparison to unrelated sequences
in the database search, or the randomization strategy used in
the appropriate option with BESTFIT?
Psi-BLAST combines the fast database searching advantages of the BLAST algorithm with the Position-Specific distance matrix features of a Profile to find distant homologues in a database in an Iterative manner ... Psi-BLAST stands for "Position Specific Iterative BLAST"
{1. Use the Protein Sequence from above
with First Iteration of Psi-BLAST and Protein nr database}
Use the Protein Sequence from above and the "PSI- and PHI-BLAST"
option at the BLAST
site at NCBI. Use the BLOSUM62 matrix.
Include representative parts of the BLAST output in your Notebook, including information from the end of the BLAST output.
Include the number of hits found above the threshold of E =
0.001, and note how many of these are "NEW".
What fraction of the hits found in this first run of Psi-BLAST
are new?
Begin to build a Table of total hits found, approximate number
of New Hits present in the 'one-liner' descriptions, and the E-value
of the WORST recorded hit, versus Iteration Number.
How does this output compare with that found in Part 4A above?
How is it similar, and how does it differ?
How does the information at the end of the BLAST output compare
with that from the Part 4A output above?
{2. Continue the PSI-BLAST run by doing
the Second Iteration}
Click on the "Run PSI-Blast iteration 2" button.
When you do this, you are executing a Second BLAST run.
NOTE that the Format Web page is the same as used for the first
run!
... it may be covered on your screen!
As you continue to do these Psi-BLAST iterations, you will
"flip between" the Output and Format Web pages ... keep
each one exposed to some extent on your monitor screen as you
do these Iterations !!!
What was the Query used in this Second BLAST run?
{3. Continue to do Psi-BLAST Iterations
until Stabilization of Hits}
Click on the "Run PSI-Blast iteration 3" button. Examine
your hits. Record the number of total hits.
Continue to do this until the number of total hits has stabilized.
Record the Number of Iterations you did, and complete your Table of Iteration Number versus Total Hits found, number of New Hits in Description lines, and E-value of last (worst) hit recorded.
For your last BLAST run, include some representative information
from the output file in your Notebook.
By comparing the description lines from your first Psi-BLAST run
and your last Iteration BLAST run, were some of the initial sequences
removed or lost? were new sequences gained? how do the Scores
compare?
MPDRTEKHSTMPDSPVDVKTQSRL
was NOT filtered, whereas the adjacent sequence:
TPPTMPPPPTT
WAS filtered. For each of these two total sequences, calculate the Complexity Kw. Calculate also the Complexity of the "most complex" and of the "least complex" protein sequences for each of these two sequence lengths. How does the Complexity of each of these MTG8 sequences compare with the comparable "most" and "least" complex Complexity values?
Latest modification: 21 April, 2001
If you have problems or questions, send email to Michael Gribskov or Doug Smith or Hiren Patel