Key - Exercise 4 - BIMM 141 - Spring, 2001
{ ... student name ??? ... Account bi141s?? ...}

Max Score: 234 pts

Part A: 35 pts
Part B: 15 pts
Part C: 15 pts
Part D: 25 pts
Part E: 144 pts

 

 

{A. Searching with BLAST}

{1. Find sequences to use. Record the URLs and residue numbers for the sequences in your Notebook.}
Sequences [5 pts]

{2. Peruse the BLAST site at NCBI.}

{3. Use the DNA sequence with BLASTN and the DNA nr database. Include representative parts of the BLASTN output in your Notebook.}
[5 pts] notebook includes

{4. Use the DNA sequence with BLASTX and BLOSUM62 matrix and Protein nr database}
[5 pts] notebook includes

{5. Use the protein sequence with BLASTP and Protein nr database. Choose 100 descriptions and 100 alignments in the Format options}

{Save the "One-liner" description and the Alignment for the WORST hit found ... this will be used in Part C below}
[5 pts] notebook includes
[5 pts] questions

{6. Do BLASTP runs using the distance matrices PAM30, PAM70, BLOSUM45, and BLOSUM80.}
[10 pts] comparisons and questions

 

{B. BLASTP run, using FILTER and the MTG8 protein sequence}

{1. Retrieve the URL for the MTG8 protein sequence corresponding to Genbank accession number D14820}

{2. Do a BLASTP run using the BLOSUM62 matrix and no FILTER options.}
[5 pts] notebook includes
[5 pts] questions

{3. Do a BLASTP run using the BLOSUM62 matrix and FILTER}
[5 pts] comparisons and questions

 

{C. Comparison to dynamic programming alignments for proteins}

{1. Use the saved matching protein sequence from the BLAST protein sequence searches in Part A4 above}
[5 pts] notebook includes

{2. Align this protein sequence to your query using the GCG BESTFIT program.}
[5 pts] notebook includes and questions

{3. Evaluate the significance of the alignment using the randomization option of BESTFIT.}
[5 pts] notebook includes and questions

 

{D. Psi-BLAST run}

{1. Use the Protein Sequence from above with First Iteration of Psi-BLAST and Protein nr database}
[5 pts] notebook includes
[5 pts] questions

{2. Continue the PSI-BLAST run by doing the Second Iteration}
[5 pts] notebook includes and questions

{3. Continue to do Psi-BLAST Iterations until Stabilization of Hits}
[5 pts] notebook includes
[5 pts] questions

 

{E. Questions}

1. Did you have any problems finding cognate protein and DNA sequences starting with the SwissProt Accession Numbers given in Part A1 of this Exercise? If so, what were they? [5 pts]
Generally no problems ... on occasion, SwissProt lists too many GenBank AccNums, making it difficult to determine which is the "correct" one.
 
2. What are the primary functions of the separate BLAST format Web page? [5 pts]
Separate BLAST format Web pages now permit formatting according to the needs of the individual BLAST programs.
 
3. What are the four primary elements of the BLAST output file? [5 pts]
Query information and graphical score distribution
One-liner description.
Alignment information.
Statistics and parameters.
 
4. Answer the questions given in Exercise 4, Parts A, B, C, and D, and put your answers in the appropriate place above. [pts given above]
ok
 
5. In the BLAST programs, what are the meanings of the following terms: w, T, HSP, S ? [8 pts]
w: the length of the word BLAST used to make comparisons
T: the threshold above which the word must score to be in the neighborhood of the query
HSP: high-scoring-segment pairs - the initial matching words before extending to MSPs
S: MSPs which score above some predetermined score S are reported
6. How are w and T determined? How does this determination differ between DNA and protein searches? [10 pts]
T and w offset each other; at a given w increasing T increases selectivity and decreases sensitivity. Appropriate values for w have been empirically decided by the authors of the program. T is determined from the EXPECT value such that BLAST is unlikely to miss an MSP that would score above S (see below). DNA has proportionally larger settings than protein settings; since protein sequences provide more information smaller w and T settings still find many homologs
 
7. What is the EXPECT parameter? How is this related to S? How is this related to "sensitivity" and "specificity" issues? [8 pts]

EXPECT: The statistical significance threshold for reporting matches against database sequences; the default value is 10, such that 10 matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. Fractional values are acceptable. (BLAST help page)

S is calculated from EXPECT using the Karlin-Altschul equation. lower EXPECT scores are more selective but less sensitive. You may not that I found an EXPECT value of 100 more useful, in general, that the default value of 10.

"Sensitivity" is a measure of the attempt to report ALL True Positives. "Specificty" is a measure of the attempt to report ONLY True Positives. As one lowers the EXPECT, one reports fewer hits. This will improve the SPECIFICITY (report ONLY True Positives), but at a sacrifice of SENSITIVITY. More True Positives, particularly those that are distantly related, will be missed. The reverse is true for raising the EXPECT value.

8. Speed is a major reason for the development of BLAST 2.0 (Gapped-BLAST). Why is there a speed problem with BLAST 1.0? [4 pts]
BLAST 1.0 spends nearly all of its time in extending word pairs to find HSPs. In Gapped-BLAST, although T is lowered providing more word pairs, only word pairs where at least two are found on the same diagonal are extended (2-hit method). This provides for a great savings of time with Gapped-BLAST.
 
9. BLAST 2.0 runs overall about 3-fold faster than BLAST. What factors, positive and negative, contribute to this overal 3-fold increase in speed? [9 pts]
From the Gapped-BLAST NAR paper, first, the '2-hit method' is overall approximately twice as fast as the original 'one-hit method' for extension of word pairs to find HSPs. Second, use of the Dynamic Programming step to find gapped alignments permits raising the T value for finding of word pairs, since not all ungapped alignments need to be found in the earlier BLAST steps (the Dynamic Programming step will find the important ones). Third, to improve the speed of the Dynamic Programming step, gapped extension begins from a Seed Pair, and proceeds in both directions, stopping when the alignment score falls below the best alignment score yet found by more than XG units. This permits searching the alignment space outside here-to-fore prescribed diagonals, as used in early versions of FASTA, yet increases speed. The final result of these second and third steps is to reduced the time for the gapped extension stage by about two-fold, resulting in an overall speed-up with Gapped-BLAST by about a factor of 3.
 
10. What are the two major innovations in Gapped-BLAST as compared with BLAST? [8 pts]
1. The Two-Hit Method: requires two word pairs on the same diagonals before the ungapped extension is done, to search for HSPs.
2. The Dynamic Programming gapped extension step.
 
11. What is meant by "filtering"? What types of sequences does one often wish to "filter"? What types of sequences can the BLAST server at NCBI filter? What is the basis of the SEG algorithm, and what types of sequences does it filter? What is the basis of the XNU algorithm, and what types of sequences does it filter? [16 pts]

Filtering masks segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton & Federhen (Computers and Chemistry, 1993), or segments consisting of short-periodicity internal repeats, as determined by the XNU program of Claverie & States (Computers and Chemistry, 1993), or for BLASTN, by the DUST program of Tatusov and Lipman (in preparation). Filtering can eliminate statistically significant but biologically uninteresting reports from the BLAST output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. (BLAST help page)
In addition to low complexity and repetitive sequences, one might also want to filter longer repeated elements such as SINES, LINES, ALU repeats, etc. from DNA sequences.

SEG: removes low complexity sequences (determined by K=logW/L) (e.g. AAAAAAA), SEG is insensitive to the order of the residues or bases so it manly removes sequences with very biased composition.
XNU: removes short-periodicity internal repeats which would appear as dense patches along the main diagonal of a dotplot (e.g. GTCGTCGTC)

12. What is a bit of information? How many bits of information are there in a given amino acid? What does this mean? Calculate how many bits of information there are in the number 100. [5 pts]
A bit of information is the amount of information needed to distinguish between two possibilities. log2 20 = 4.3 bits of information in an amino acid residue (ignoring the fact that they have differing frequencies). This means that it would required, on average, 4.3 yes/no questions to guess which amino acid residue was at a sequence position. The number 100 has log2 100 = 6.6 bits of information.
 
13. The PIR protein database now contains over 75 million amino acids. To search this database with a protein sequence of length 250 aa, a score of how many bits is needed to distinguish an HSP with this Score from chance? Why are higher HSP scores needed in a search of a database than in a comparison between two sequences to distinguish the HSP from chance? [10 pts]
log2(75 million x 250) = 33.9 bits are needed for the database search whereas only
log2(250x250) = 15.9 bits are needed when comparing 2 sequences. A score more than twice as high is therefore needed (33.9/15.9) distinguish the sequence in the database search at the same level of significance.
14. In the table from your Psi-BLAST iterations, what happened to Total Hits, Number of New Hits in Description lines, and the E-value as the number of iterations increased? How do you know when to stop executing further iterations of Psi-BLAST? [8 pts]
The total hits vacillated and then stabilized at a number.  The number of new hits decreased rapidly with more iterations.  The E-values grouped closer together and were more similar to each other.  You stop when the hits converge and don't change (drop or add).
5
15. What is a Profile? [5 pts]
A profile is the conserved residues that result from a multiple sequence alignment.  Each position in the sequence is evaluated for the frequency of residues, and a position-based scoring matrix can be constructed.
 
16. What happens to the Profile during execution of a Psi-BLAST run? What does the resulting Profile describe? [8 pts]
The profile becomes the scoring matrix for further iterations of Psi-BLAST.  The resulting profile is then used as the scoring matrix for further iterations of the algorithm.  The Profile describes the family of proteins which were found to homologues of each other.
 
17. Is this the best Profile that can be achieved? Why or why not? [5 pts]
This is not the best profile possible due to length considerations and position-specific gap penalties.
 
18. How do your results, for both the BLASTP run and the Psi-BLAST complete run, compare with those of the Altschul et al Psi-BLAST paper, Table 3? How can you account for the differences? [5 pts]
The difference in numbers is due to the increase in the databases being searched.  The more sequences, the greater the probability of getting a hit.
 
19. For the MTG8 protein sequence, when you did your FILTER run you should have found that the first part of the sequence, a sequence like the following:

MPDRTEKHSTMPDSPVDVKTQSRL

was NOT filtered, whereas the adjacent sequence:

TPPTMPPPPTT

WAS filtered. For each of these two total sequences, calculate the Complexity Kw. Calculate also the Complexity of the "most complex" and of the "least complex" protein sequences for each of these two sequence lengths. How does the Complexity of each of these MTG8 sequences compare with the comparable "most" and "least" complex Complexity values? [20 pts]

Complexity K1 = Log W / L
L = window size
W = L! / Pni

MPDRTEKHSTMPDSPVDVKTQSRL= count vector, ignoring 0, is (3,3,3,3,2,2,2,2,1,1,1,1)
W = 24! / 3! 3! 3! 3! 2! 2! 2! 1! 1! 1! 1! = 1.62 x 10 21
K1 = Log W / L = (log20 2.99 x 10 19) / 24 = 0.623

TPPTMPPPPTT count vector is (6,4,1)
W = 11! / 6! 4! 1!= 2310
K1 = Log W / L = log20 2310 / 11 = 0.235

The complexity of the least complex word is always zero since W = 1. The complexity of the most complex word depends on its length. Note that for a length of 24, the most complex protein sequence would have a count vector (2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1). For a length of 24, the most complex word would therefore be K1 = 0.723, and for a length of 11, K1 = 0.531. The first sequence is about 86% of the maximum complexity and the second is about 44% of the maximum complexity when compared to the maximum possible values.