Exercise 4
{ ... student name ??? ... Account bi141s?? ...}

 

{A. Searching with BLAST}

{1. Find sequences to use. Record the URLs and residue numbers for the sequences in your Notebook.}

{2. Peruse the BLAST site at NCBI.}

{3. Use the DNA sequence with BLASTN and the DNA nr database. Include representative parts of the BLASTN output in your Notebook.}

{4. Use the DNA sequence with BLASTX and BLOSUM62 matrix and Protein nr database}

{5. Use the protein sequence with BLASTP and Protein nr database. Choose 100 descriptions and 100 alignments in the Format options}

{Save the "One-liner" description and the Alignment for the WORST hit found ... this will be used in Part C below}

{6. Do BLASTP runs using the distance matrices PAM30, PAM70, BLOSUM45, and BLOSUM80.}

 

{B. BLASTP run, using FILTER and the MTG8 protein sequence}

{1. Retrieve the URL for the MTG8 protein sequence corresponding to Genbank accession number D14820}

{2. Do a BLASTP run using the BLOSUM62 matrix and no FILTER options.}

{3. Do a BLASTP run using the BLOSUM62 matrix and FILTER}

 

{C. Comparison to dynamic programming alignments for proteins}

{1. Use the saved matching protein sequence from the BLAST protein sequence searches in Part A4 above}

{2. Align this protein sequence to your query using the GCG BESTFIT program.}

{3. Evaluate the significance of the alignment using the randomization option of BESTFIT.}

 

{D. Psi-BLAST run}

{1. Use the Protein Sequence from above with First Iteration of Psi-BLAST and Protein nr database}

{2. Continue the PSI-BLAST run by doing the Second Iteration}

{3. Continue to do Psi-BLAST Iterations until Stabilization of Hits}

 

{E. Questions}

1. Did you have any problems finding cognate protein and DNA sequences starting with the SwissProt Accession Numbers given in Part A1 of this Exercise? If so, what were they?
2. What are the primary functions of the separate BLAST format Web page?
3. What are the four primary elements of the BLAST output file?
4. Answer the questions given in Exercise 4, Parts A, B, C, and D, and put your answers in the appropriate place above.
5. In the BLAST programs, what are the meanings of the following terms: w, T, HSP, S ?
6. How are w and T determined? How does this determination differ between DNA and protein searches?
7. What is the EXPECT parameter? How is this related to S? How is this related to "sensitivity" and "specificity" issues?
8. Speed is a major reason for the development of BLAST 2.0 (Gapped-BLAST). Why is there a speed problem with BLAST 1.0?
9. BLAST 2.0 runs overall about 3-fold faster than BLAST. What factors, positive and negative, contribute to this overal 3-fold increase in speed?
10. What are the two major innovations in Gapped-BLAST as compared with BLAST?
11. What is meant by "filtering"? What types of sequences does one often wish to "filter"? What types of sequences can the BLAST server at NCBI filter? What is the basis of the SEG algorithm, and what types of sequences does it filter? What is the basis of the XNU algorithm, and what types of sequences does it filter?
12. What is a bit of information? How many bits of information are there in a given amino acid? What does this mean? Calculate how many bits of information there are in the number 100.
13. The PIR protein database now contains over 75 million amino acids. To search this database with a protein sequence of length 250 aa, a score of how many bits is needed to distinguish an HSP with this Score from chance? Why are higher HSP scores needed in a search of a database than in a comparison between two sequences to distinguish the HSP from chance?
14. In the table from your Psi-BLAST iterations, what happened to Total Hits, Number of New Hits in Description lines, and the E-value as the number of iterations increased? How do you know when to stop executing further iterations of Psi-BLAST?
15. What is a Profile?
16. What happens to the Profile during execution of a Psi-BLAST run? What does the resulting Profile describe?
17. Is this the best Profile that can be achieved? Why or why not?
18. How do your results, for both the BLASTP run and the Psi-BLAST complete run, compare with those of the Altschul et al Psi-BLAST paper, Table 3? How can you account for the differences?
19. For the MTG8 protein sequence, when you did your FILTER run you should have found that the first part of the sequence, a sequence like the following:

MPDRTEKHSTMPDSPVDVKTQSRL

was NOT filtered, whereas the adjacent sequence:

TPPTMPPPPTT

WAS filtered. For each of these two total sequences, calculate the Complexity Kw. Calculate also the Complexity of the "most complex" and of the "least complex" protein sequences for each of these two sequence lengths. How does the Complexity of each of these MTG8 sequences compare with the comparable "most" and "least" complex Complexity values?