Key - Exercise 4 - BIMM
141 - Spring, 2001
{ ... student name ??? ... Account bi141s?? ...}
Max Score: 234 pts
Part A: 35 pts
Part B: 15 pts
Part C: 15 pts
Part D: 25 pts
Part E: 144 pts
{A. Searching with BLAST}
{1. Find sequences to use. Record the URLs and residue
numbers for the sequences in your Notebook.}
Sequences [5 pts]
{2. Peruse the BLAST site at NCBI.}
{3. Use the DNA sequence with BLASTN and the DNA nr database.
Include representative parts of the BLASTN output in your Notebook.}
[5 pts] notebook includes
{4. Use the DNA sequence with BLASTX and BLOSUM62 matrix
and Protein nr database}
[5 pts] notebook includes
{5. Use the protein sequence with BLASTP and Protein
nr database. Choose 100 descriptions and 100 alignments in the
Format options}
{Save the "One-liner" description and the Alignment
for the WORST hit found ... this will be used in Part C below}
[5 pts] notebook includes
[5 pts] questions
{6. Do BLASTP runs using the distance matrices PAM30,
PAM70, BLOSUM45, and BLOSUM80.}
[10 pts] comparisons and questions
{B. BLASTP run, using FILTER and the MTG8 protein sequence}
{1. Retrieve the URL for the MTG8 protein sequence corresponding
to Genbank accession number D14820}
{2. Do a BLASTP run using the BLOSUM62 matrix and no
FILTER options.}
[5 pts] notebook includes
[5 pts] questions
{3. Do a BLASTP run using the BLOSUM62 matrix and FILTER}
[5 pts] comparisons and questions
{C. Comparison to dynamic programming alignments for
proteins}
{1. Use the saved matching protein sequence from the
BLAST protein sequence searches in Part A4 above}
[5 pts] notebook includes
{2. Align this protein sequence to your query using the
GCG BESTFIT program.}
[5 pts] notebook includes and questions
{3. Evaluate the significance of the alignment using
the randomization option of BESTFIT.}
[5 pts] notebook includes and questions
{D. Psi-BLAST run}
{1. Use the Protein Sequence from above with First Iteration
of Psi-BLAST and Protein nr database}
[5 pts] notebook includes
[5 pts] questions
{2. Continue the PSI-BLAST run by doing the Second Iteration}
[5 pts] notebook includes and questions
{3. Continue to do Psi-BLAST Iterations until Stabilization
of Hits}
[5 pts] notebook includes
[5 pts] questions
{E. Questions}
- 1. Did you have any problems finding
cognate protein and DNA sequences starting with the SwissProt
Accession Numbers given in Part A1 of this Exercise? If so, what
were they? [5 pts]
- Generally no problems ... on occasion,
SwissProt lists too many GenBank AccNums, making it difficult
to determine which is the "correct" one.
-
- 2. What are the primary functions of
the separate BLAST format Web page? [5
pts]
- Separate BLAST format Web pages now
permit formatting according to the needs of the individual BLAST
programs.
-
- 3. What are the four primary elements
of the BLAST output file? [5
pts]
- Query information and graphical score
distribution
- One-liner description.
- Alignment information.
- Statistics and parameters.
-
- 4. Answer the questions given in Exercise
4, Parts A, B, C, and D, and put your answers in the appropriate
place above. [pts given above]
- ok
-
- 5. In the BLAST programs, what are
the meanings of the following terms: w, T, HSP, S ? [8 pts]
- w: the length of the word BLAST used
to make comparisons
T: the threshold above which the word must score to be in the
neighborhood of the query
HSP: high-scoring-segment pairs - the initial matching words
before extending to MSPs
S: MSPs which score above some predetermined score S are reported
- 6. How are w and T determined? How
does this determination differ between DNA and protein searches?
[10 pts]
- T and w offset each other; at a given
w increasing T increases selectivity and decreases sensitivity.
Appropriate values for w have been empirically decided by the
authors of the program. T is determined from the EXPECT value
such that BLAST is unlikely to miss an MSP that would score above
S (see below). DNA has proportionally larger settings than protein
settings; since protein sequences provide more information smaller
w and T settings still find many homologs
-
- 7. What is the EXPECT parameter? How
is this related to S? How is this related to "sensitivity"
and "specificity" issues? [8
pts]
EXPECT: The statistical significance
threshold for reporting matches against database sequences; the
default value is 10, such that 10 matches are expected to be
found merely by chance, according to the stochastic model of
Karlin and Altschul (1990). If the statistical significance
ascribed to a match is greater than the EXPECT threshold, the
match will not be reported. Lower EXPECT thresholds are more
stringent, leading to fewer chance matches being reported. Fractional
values are acceptable. (BLAST help page)
S is calculated from EXPECT using the
Karlin-Altschul equation. lower EXPECT scores are more selective
but less sensitive. You may not that I found an EXPECT value
of 100 more useful, in general, that the default value of 10.
"Sensitivity" is a measure
of the attempt to report ALL True Positives. "Specificty"
is a measure of the attempt to report ONLY True Positives. As
one lowers the EXPECT, one reports fewer hits. This will improve
the SPECIFICITY (report ONLY True Positives), but at a sacrifice
of SENSITIVITY. More True Positives, particularly those that
are distantly related, will be missed. The reverse is true for
raising the EXPECT value.
- 8. Speed is a major reason for the
development of BLAST 2.0 (Gapped-BLAST). Why is there a speed
problem with BLAST 1.0? [4 pts]
- BLAST 1.0 spends nearly all of its
time in extending word pairs to find HSPs. In Gapped-BLAST,
although T is lowered providing more word pairs, only word pairs
where at least two are found on the same diagonal are extended
(2-hit method). This provides for a great savings of time with
Gapped-BLAST.
-
- 9. BLAST 2.0 runs overall about 3-fold
faster than BLAST. What factors, positive and negative, contribute
to this overal 3-fold increase in speed? [9
pts]
- From the Gapped-BLAST NAR paper, first,
the '2-hit method' is overall approximately twice as fast as
the original 'one-hit method' for extension of word pairs to
find HSPs. Second, use of the Dynamic Programming step to find
gapped alignments permits raising the T value for finding of
word pairs, since not all ungapped alignments need to be found
in the earlier BLAST steps (the Dynamic Programming step will
find the important ones). Third, to improve the speed of the
Dynamic Programming step, gapped extension begins from a Seed
Pair, and proceeds in both directions, stopping when the alignment
score falls below the best alignment score yet found by more
than XG units. This permits searching the alignment space outside
here-to-fore prescribed diagonals, as used in early versions
of FASTA, yet increases speed. The final result of these second
and third steps is to reduced the time for the gapped extension
stage by about two-fold, resulting in an overall speed-up with
Gapped-BLAST by about a factor of 3.
-
- 10. What are the two major innovations
in Gapped-BLAST as compared with BLAST? [8
pts]
- 1. The Two-Hit Method: requires two
word pairs on the same diagonals before the ungapped extension
is done, to search for HSPs.
2. The Dynamic Programming gapped extension step.
-
- 11. What is meant by "filtering"?
What types of sequences does one often wish to "filter"?
What types of sequences can the BLAST server at NCBI filter?
What is the basis of the SEG algorithm, and what types of sequences
does it filter? What is the basis of the XNU algorithm, and what
types of sequences does it filter? [16
pts]
Filtering masks segments of the query
sequence that have low compositional complexity, as determined
by the SEG program of Wootton & Federhen (Computers and Chemistry,
1993), or segments consisting of short-periodicity internal repeats,
as determined by the XNU program of Claverie & States (Computers
and Chemistry, 1993), or for BLASTN, by the DUST program of Tatusov
and Lipman (in preparation). Filtering can eliminate statistically
significant but biologically uninteresting reports from the BLAST
output (e.g., hits against common acidic-, basic- or proline-rich
regions), leaving the more biologically interesting regions of
the query sequence available for specific matching against database
sequences. (BLAST help page)
In addition to low complexity and repetitive sequences, one might
also want to filter longer repeated elements such as SINES, LINES,
ALU repeats, etc. from DNA sequences.
SEG: removes low complexity sequences
(determined by K=logW/L) (e.g. AAAAAAA), SEG is insensitive to
the order of the residues or bases so it manly removes sequences
with very biased composition.
XNU: removes short-periodicity internal repeats which would appear
as dense patches along the main diagonal of a dotplot (e.g. GTCGTCGTC)
- 12. What is a bit of information? How
many bits of information are there in a given amino acid? What
does this mean? Calculate how many bits of information there
are in the number 100. [5 pts]
- A bit of information is the amount
of information needed to distinguish between two possibilities.
log2 20 = 4.3 bits of information in an amino acid residue (ignoring
the fact that they have differing frequencies). This means that
it would required, on average, 4.3 yes/no questions to guess
which amino acid residue was at a sequence position. The number
100 has log2 100 = 6.6 bits of information.
-
- 13. The PIR protein database now contains
over 75 million amino acids. To search this database with a protein
sequence of length 250 aa, a score of how many bits is needed
to distinguish an HSP with this Score from chance? Why are higher
HSP scores needed in a search of a database than in a comparison
between two sequences to distinguish the HSP from chance? [10 pts]
- log2(75 million x 250) = 33.9 bits
are needed for the database search whereas only
log2(250x250) = 15.9 bits are needed when comparing 2 sequences.
A score more than twice as high is therefore needed (33.9/15.9)
distinguish the sequence in the database search at the same level
of significance.
- 14. In the table from your Psi-BLAST
iterations, what happened to Total Hits, Number of New Hits in
Description lines, and the E-value as the number of iterations
increased? How do you know when to stop executing further iterations
of Psi-BLAST? [8 pts]
- The total hits vacillated and then
stabilized at a number. The number of new hits decreased
rapidly with more iterations. The E-values grouped closer
together and were more similar to each other. You stop
when the hits converge and don't change (drop or add).
- 5
- 15. What is a Profile? [5 pts]
- A profile is the conserved residues
that result from a multiple sequence alignment. Each position
in the sequence is evaluated for the frequency of residues, and
a position-based scoring matrix can be constructed.
-
- 16. What happens to the Profile during
execution of a Psi-BLAST run? What does the resulting Profile
describe? [8 pts]
- The profile becomes the scoring matrix
for further iterations of Psi-BLAST. The resulting profile
is then used as the scoring matrix for further iterations of
the algorithm. The Profile describes the family of proteins
which were found to homologues of each other.
-
- 17. Is this the best Profile that can
be achieved? Why or why not? [5
pts]
- This is not the best profile possible
due to length considerations and position-specific gap penalties.
-
- 18. How do your results, for both the
BLASTP run and the Psi-BLAST complete run, compare with those
of the Altschul et al Psi-BLAST paper, Table 3? How can you account
for the differences? [5 pts]
- The difference in numbers is due to
the increase in the databases being searched. The more
sequences, the greater the probability of getting a hit.
-
- 19. For the MTG8 protein sequence,
when you did your FILTER run you should have found that the first
part of the sequence, a sequence like the following:
MPDRTEKHSTMPDSPVDVKTQSRL
was NOT filtered, whereas the adjacent
sequence:
TPPTMPPPPTT
WAS filtered. For each of these two
total sequences, calculate the Complexity Kw. Calculate also
the Complexity of the "most complex" and of the "least
complex" protein sequences for each of these two sequence
lengths. How does the Complexity of each of these MTG8 sequences
compare with the comparable "most" and "least"
complex Complexity values? [20
pts]
Complexity K1 = Log W / L
L = window size
W = L! / Pni
MPDRTEKHSTMPDSPVDVKTQSRL= count vector,
ignoring 0, is (3,3,3,3,2,2,2,2,1,1,1,1)
W = 24! / 3! 3! 3! 3! 2! 2! 2! 1! 1! 1! 1! = 1.62 x 10 21
K1 = Log W / L = (log20 2.99 x 10 19) / 24 = 0.623
TPPTMPPPPTT count vector is (6,4,1)
W = 11! / 6! 4! 1!= 2310
K1 = Log W / L = log20 2310 / 11 = 0.235
The complexity of the least complex
word is always zero since W = 1. The complexity of the most
complex word depends on its length. Note that for a length of
24, the most complex protein sequence would have a count vector
(2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1). For a length of 24, the most
complex word would therefore be K1 = 0.723, and for a length
of 11, K1 = 0.531. The first sequence is about 86% of the maximum
complexity and the second is about 44% of the maximum complexity
when compared to the maximum possible values.