Pharm 207/Bio 207

Using Internet Resources in Molecular Biology - Lecture 5

Protein Secondary Structure Prediction

 

Lecturer: Phil Bourne
Last Update 10/02/01

Table of Contents

 

 

 

Pharm 207/Bio 207

Lecture Outline

Protein secondary structure prediction is the determination of the regions of secondary structure in a protein at the level of alpha helix, beta sheet, and random coil, from information present in the primary protein sequence. No such methodology exists to distinguish more subtle secondary structure, for example the difference between alpha and 3.10 helix. This lecture will briefly review the history of available methods and their associated success rates. The majority of time will be spent reviewing sites where it is possible to calculate protein secondary structure and subsequently assessing the success rate of these sites using your pet protein.


| TOC & Lecture Outline | Introduction | Reading Materials | Lecture Goals and Assignment | Examples |


Introduction

For those needing a refresher in the secondary structure of proteins please refer to Section 8 of the Principles of Protein Structure Course.

One way of introducing protein secondary structure prediction is to look at it chronologically. The following introduction follows the paper by Eisenhaber, Persson and Argo (1995) that is required reading for this lecture.

1974 Chou and Fasman propose a statistical method based on the propensities of amino acids to adopt secondary structures based on the observation of their location in 15 protein structures determined by X-ray diffraction. Clearly these statistics derive from the particular stereochemical and physicochemical properties of the amino acids. See for example, glycine and proline. These statistics have been refined over the years by a number of authors (including Chou and Fasman themselves) using a larger set of proteins. Rather than a position by position analysis the propensity of a position is calculated using an average over 5 or 6 residues surrounding each position. On a larger set of 62 proteins the base method reports a success rate of 50%.

1978 Garnier improved the method by using statistically significant pair-wise interactions as a determinant of the statistical significance. This improved the success rate to 62%

1993 Levin improved the prediction level by using multiple sequence alignments. The reasoning is as follows. Conserved regions in a multiple sequence alignment provides a strong evolutionary indicator of a role in the function of the protein. Those regions are also likely to have conserved structure, including secondary structure and strengthen the prediction by their joint propensities. This improved the success rate to 69%.

1994 Rost and Sander combined neural networks with multiple sequence alignments. The idea of a neural net is to create a complex network of interconnected nodes, where progress from one node to the next depends on satisfying a weighted function that has been derived by training the net with data of known results, in this case protein sequences with known secondary structures. The success rate is 72%.

It is not expected that better predictions will be possible without introducing more known input parameters to which there is a shown dependency on primary sequence. This implies the inclusion of observations from longer range interactions and combinations of other properties.


| TOC & Lecture Outline | Introduction | Reading Materials | Lecture Goals and Assignment | Examples |


Reading Materials

Required Reading

F. Eisenhaber, B. Persson and P. Argo (1995) Critical Reviews in Biochemistry and Molecular Biology 30(1), 1-94. Relevant section will be available as a handout.

Optional Reading

P.Y. Chou and G Fasman (1974) Biochemistry, 13 211-222.

J. Garnier, D.J. Osguthorpe and B. Robson (1978) J. Mol. Biol., 120, 97-120.

B. Rost and C. Sander Proteins 19, 55-72.


| TOC & Lecture Outline | Introduction | Reading Materials | Lecture Goals and Assignment | Examples |


Lecture Goals & Assignment

The lecture assignment is to use several Internet resources which offer the various methodologies introduced above and compare the prediction results using the sequence of your pet protein. The resources are drawn from a more complete list found in the CMS resource.


| TOC & Lecture Outline | Introduction | Reading Materials | Lecture Goals and Assignment | Examples |


Examples

Here is an example approach.

  1. Obtain the sequence of your pet protein. This is best got as a single letter description in FASTA format since this is the format most commonly used by a variety of prediction methods.
    1. Go to the Structure Search page of Entrez
    2. Enter the PDB code of your pet protein
    3. Follow the protein link
    4. Request a FASTA report format
    5. Cut and paste this to a file for future use.
  2. Select the Garnier-Robson-Osguthorpe - Secondary Structure Prediction here
    1. Paste your FASTA sequence and run the prediction
    2. Paste the results to a file for later comparison
  3. Select the Rost/Sander PhD server (uses a combination of multiple sequence and neural net) at Columbia University.
    1. Enter the appropriate fields selecting "secondary structure only" "HSSP format" "additionally in column format"
    2. Paste the sequence as one-letter codes
    3. Select e-mail or interactive response (my interactive response was automatically changed to e-mail after several minutes)
    4. The actual prediction is about 3/4 the way through a long output.
  4. You now have two predictions Garnier and Rost/Sander we will compare these to results from the Kabsch/Sander algorithm (to be covered in lecture 9) which uses hydrogen bonding potentials derived from the 3-D coordinates. Many X-ray structures reported in PDB files report secondary structure from this algorithm - that is crystallographers consider it as accurate as they can determine by eye and more consistent than each crystallographer making his own prediction. The precalculated set of secondary structure determinations is called DSSP (Definition of the Secondary Structure of Proteins) values can be retrieved from the PDB site
    1. Enter the PDB id of your protein
    2. From the Protein Explorer page enter "Sequence Details"
    3. Review the secondary structure specified on this page
  5. You now have an assignment based on structure. What should go on your web page is a discussion of the success rate of Garnier vs Rost/Sander (PhD) when compares to DSSP. Determine what % of residues are predicted as correct (against DSSP) for each of the two methods. What does this tell you?

| TOC & Lecture Outline | Introduction | Reading Materials | Lecture Goals and Assignment | Examples |