Exercise 4
{ ... student name ??? ... Account bi141s?? ...}
{A. Searching with BLAST}
{1. Find sequences to use. Record the URLs and residue
numbers for the sequences in your Notebook.}
{2. Peruse the BLAST site at NCBI.}
{3. Use the DNA sequence with BLASTN and the DNA nr database.
Include representative parts of the BLASTN output in your Notebook.}
{4. Use the DNA sequence with BLASTX and BLOSUM62 matrix
and Protein nr database}
{5. Use the protein sequence with BLASTP and Protein
nr database. Choose 100 descriptions and 100 alignments in the
Format options}
{Save the "One-liner" description and the Alignment
for the WORST hit found ... this will be used in Part C below}
{6. Do BLASTP runs using the distance matrices PAM30,
PAM70, BLOSUM45, and BLOSUM80.}
{B. BLASTP run, using FILTER and the MTG8 protein sequence}
{1. Retrieve the URL for the MTG8 protein sequence corresponding
to Genbank accession number D14820}
{2. Do a BLASTP run using the BLOSUM62 matrix and no
FILTER options.}
{3. Do a BLASTP run using the BLOSUM62 matrix and FILTER}
{C. Comparison to dynamic programming alignments for
proteins}
{1. Use the saved matching protein sequence from the
BLAST protein sequence searches in Part A4 above}
{2. Align this protein sequence to your query using the
GCG BESTFIT program.}
{3. Evaluate the significance of the alignment using
the randomization option of BESTFIT.}
{D. Psi-BLAST run}
{1. Use the Protein Sequence from above with First Iteration
of Psi-BLAST and Protein nr database}
{2. Continue the PSI-BLAST run by doing the Second Iteration}
{3. Continue to do Psi-BLAST Iterations until Stabilization
of Hits}
{E. Questions}
- 1. Did you have any problems finding cognate protein and
DNA sequences starting with the SwissProt Accession Numbers given
in Part A1 of this Exercise? If so, what were they?
- 2. What are the primary functions of the separate BLAST format
Web page?
- 3. What are the four primary elements of the BLAST output
file?
- 4. Answer the questions given in Exercise 4, Parts A, B,
C, and D, and put your answers in the appropriate place above.
- 5. In the BLAST programs, what are the meanings of the following
terms: w, T, HSP, S ?
- 6. How are w and T determined? How does this determination
differ between DNA and protein searches?
- 7. What is the EXPECT parameter? How is this related to S?
How is this related to "sensitivity" and "specificity"
issues?
- 8. Speed is a major reason for the development of BLAST 2.0
(Gapped-BLAST). Why is there a speed problem with BLAST 1.0?
- 9. BLAST 2.0 runs overall about 3-fold faster than BLAST.
What factors, positive and negative, contribute to this overal
3-fold increase in speed?
- 10. What are the two major innovations in Gapped-BLAST as
compared with BLAST?
- 11. What is meant by "filtering"? What types of
sequences does one often wish to "filter"? What types
of sequences can the BLAST server at NCBI filter? What is the
basis of the SEG algorithm, and what types of sequences does
it filter? What is the basis of the XNU algorithm, and what types
of sequences does it filter?
- 12. What is a bit of information? How many bits of information
are there in a given amino acid? What does this mean? Calculate
how many bits of information there are in the number 100.
- 13. The PIR protein database now contains over 75 million
amino acids. To search this database with a protein sequence
of length 250 aa, a score of how many bits is needed to distinguish
an HSP with this Score from chance? Why are higher HSP scores
needed in a search of a database than in a comparison between
two sequences to distinguish the HSP from chance?
- 14. In the table from your Psi-BLAST iterations, what happened
to Total Hits, Number of New Hits in Description lines, and the
E-value as the number of iterations increased? How do you know
when to stop executing further iterations of Psi-BLAST?
- 15. What is a Profile?
- 16. What happens to the Profile during execution of a Psi-BLAST
run? What does the resulting Profile describe?
- 17. Is this the best Profile that can be achieved? Why or
why not?
- 18. How do your results, for both the BLASTP run and the
Psi-BLAST complete run, compare with those of the Altschul et
al Psi-BLAST paper, Table 3? How can you account for the differences?
- 19. For the MTG8 protein sequence, when you did your FILTER
run you should have found that the first part of the sequence,
a sequence like the following:
MPDRTEKHSTMPDSPVDVKTQSRL
was NOT filtered, whereas the adjacent sequence:
TPPTMPPPPTT
WAS filtered. For each of these two total sequences, calculate
the Complexity Kw. Calculate also the Complexity of the "most
complex" and of the "least complex" protein sequences
for each of these two sequence lengths. How does the Complexity
of each of these MTG8 sequences compare with the comparable "most"
and "least" complex Complexity values?