Skip to content

Combinatorial Extension Addresses Challenges of Structural Genomics

SDSC RESEARCH | Contents | Next
Philip E. Bourne
Ilya N. Shindyalov

T he full DNA sequences (genomes) of some 50 living species, including Homo sapiens, have now been solved, and rapid sequencing techniques will add many more genomes in the next few years. The genes of an organism are the portions of its genome that code for proteins, and structural genomics researchers work to determine the structures of these proteins and how they function. Ilya Shindyalov of SDSC works with Philip E. Bourne of SDSC and the Department of Pharmacology at UC San Diego on ways to obtain rapid and accurate answers to structural genomics questions. They have developed an analysis technique called Combinatorial Extension (CE) that compares the 3-D structures of proteins. This has importance for finding both protein function and evolutionary relationships among proteins.



Figure 1. Top Six Substructures
The Combinatorial Extension procedure was used to find a representative set of polypeptide chains, of which these are the highest-scoring. About 75 common substructures have been identified in this way.
"We can infer functional information for new proteins on the basis of their structural similarity to proteins of known function," Shindyalov said, "and we can also classify similarities in various ways without regard to function. Both processes help us understand the relationships among sequence, structure, and function in the proteins of an organism or group of organisms." But, he cautions, there are no pat answers.

One of the first uses of CE was to detect a calcium-binding EF hand motif in acetylcholinesterase that was not known previously (Ref. 1). Shindyalov and Bourne also found clusters of similar substructures of certain protein-folding topologies. "Such discoveries may lead to more natural classifications of fold space than at present, recognizing an underlying continuity in the evolution of substructures," Shindyalov said.


The CE algorithm builds an alignment between two protein structures in which the alignment path is defined by Alignment Fragment Pairs (AFPs). As the name suggests, AFPs are pairs of fragments, one from each protein, evaluated for their structural similarity. Combinations of AFPs that represent possible continuous alignment paths are selectively extended or discarded, leading to a single optimal alignment.

"Empirically, we found AFP lengths of eight amino acids best," Shindyalov said, "and those are the basis for the CE database results." Shindyalov and Bourne have tested the method against other automated procedures (Dali and VAST), "but CE's view of protein fold space is fundamentally different, and it found many similarities not detected by the other methods," Shindyalov said. "Investigators may get most benefit from using all methods."

Since it finds stretches of sequence that are similar but not identical, CE tolerates and incorporates short gaps in AFP sequence-to-sequence matches. The maximum gap is set by the user, but the computation time rises with gap size. "In our all-against-all comparisons of known proteins, we found a gap length of about 30 amino acids most useful," Shindyalov said.

A CE user may choose various strategies for calculating the extension of an alignment from one AFP pair to the next. Measures of the distances between residues or backbone carbons, for example, are one basis on which CE can determine whether to accept the next AFP as extending the run of sequence. All AFPs within a protein are tried against those within the comparison protein until no further improvement can be made and the longest stretches of similar structure are delineated. The procedure has produced an initial gallery of common substructures (Figure 1).

Top| Contents | Next


I. Tsigelny, I.N. Shindyalov, P.E. Bourne, T. Sudhof, P. Taylor (2000): Common EF-hand motifs in cholinesterases and neuroligins suggest a role for Ca2+ binding in cell surface associations. Protein Science 8, 1-6.

I.N. Shindyalov and P.E. Bourne (1998): Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering 9, 739-747.

I.N. Shindyalov and P.E. Bourne (2000): An alternative view of protein fold space. Proteins: Structure, Function, and Genetics 38, 247-260.


In addition to designating the comparison algorithm, CE is also the name of the database of results from applying the algorithm pairwise to all of the protein structures--now numbering more than 13,000--in the Protein Data Bank (PDB). The CE database is maintained at SDSC by the National Biomedical Computation Resource (NBCR), which is sponsored by the NIH National Center for Research Resources.

Shindyalov and Bourne are also members of the Research Collaboratory for Structural Bioinformatics, a collaboration among Rutgers University, the National Institute for Standards and Technology, and SDSC that operates the PDB, the world repository for 3-D protein structures. Scientists using the PDB can access CE results directly.

Queries to the CE resource number 5,000 per month at present, and more than 300 copies of the software have been downloaded. The initial creation of the database required 24,000 processor-hours on the Cray T3E at SDSC. Updates are now made regularly using 64 processors of SDSC's Sun HPC 10000.

"Our hope," Bourne said, "is that procedures like CE will stimulate imaginative uses of likeness and difference in protein structure." --MM *

Top| Contents | Next