Skip to content


BIOLOGY | Contents | Next

RCSB to Apply Bioinformatics Expertise to
Extending Capabilities of Protein Data Bank

Helen Berman, Professor, Department of Chemistry, Rutgers University
Philip Bourne, Senior Staff Scientist, SDSC; Associate Adjunct Professor, Department of Pharmacology, UC San Diego; Adjunct Professor, Burnham Institute
Gary Gilliand, Chief, Biotechnology Division, Research Chemist, CARB Fellow, Center for Advanced Research in Biotechnology, NIST; Adjunct Professor, University of Maryland Biotechnology Institute

Since the first protein structure was determined in 1957, technology has allowed biologists to view the smallest components of living things. From experiment after experiment and with the growing power of computers, biologists have collected large amounts of data on the structure of proteins, DNA, and other so-called biological macromolecules. Such biological structure data are a critical tool for unlocking the secrets of living organisms in pharmaceutical and medical research. To make such data easily accessible to biologists around the world, the Research Collaboratory for Structural Bioinformatics (RCSB) was recently awarded a $10 million, five-year grant to operate and significantly extend the capabilities of the Protein Data Bank (PDB).

The marriage of biology and computers has created a new subdiscipline called bioinformatics, and the RCSB--a consortium of Rutgers University, SDSC, and the Center for Advanced Research in Biotechnology (CARB) of the National Institute of Standards and Technology (NIST)--was formed to combine the three sites' expertise in understanding 3-D biological molecular structure.

"Our vision is that the PDB will enable scientists worldwide to gain better understand structure-function relationships in living things," said Helen Berman, Rutgers professor of chemistry and principal investigator for the PDB award. "We can do this because of the unique infrastructure the RCSB offers in personnel, hardware, software and networking." The PDB is funded by the NSF, the Department of Energy, and two units of the National Institutes of Health: the National Institute of General Medical Sciences and the National Library of Medicine.

RCSB screen shot of Java appletFigure 1: Comparing Protein Structures

The RCSB will enhance the Protein Data Bank with Web-based query and analysis tools such as this Java applet for comparing 3-D protein structures.



The RCSB's Protein Data Bank, in addition to being a repository of data, will provide ways for researchers to understand biological function through investigation of sequence and molecular structure. Previously maintained by Brookhaven National Laboratory, the PDB's change in management will be transparent and seamless as it moves to the RCSB while adding new capabilities for searching and improving the consistency and content of existing and future depositions (Figure 1).

The transfer of the PDB will result in several improvements, including a higher, faster throughput; a greater number of query capabilities, including more complex and more accurate queries; a uniform archive; a dynamic cross-link to other databases; and structure validation and structure and sequence neighboring.

New PDB features will enhance the data collection and query processes. Typically, an experimental biologist collects structural data using either x-ray crystallography or nuclear magnetic resonance (NMR) methods and submits the results to the PDB via a software tool that ensures the data are in the proper format.

To have greater impact, the data will then be annotated and linked to related information in the database. This step consists of checking standard derived features, adding information about the biological unit, and assessing the structural and functional classifications. Initially, a human curator will be responsible for this step, but soon the annotations will be added automatically.

Next, a validation step checks the quality of the deposited atomic models with several software packages from RCSB members and other groups. Once a structure becomes part of the archive, the PDB provides, and regularly updates, links to other structural entries and databases. By the year 2003, the PDB will hold an estimated 50,000 entries, and RCSB efforts are working toward automating the linking and integration process.

A dynamic Web interface will provide customizable access to the underlying PDB. By default, the interface is a simple query tool, but advanced users have the ability to add search fields, tailoring it to their needs. Behind the scenes, the interface determines which information is accessed to respond to a query. RCSB has already developed tools, demonstrated on the group's Web site, that query several databases at the same time.

"The basic data processing system has been in production since 1992 at the Nucleic Acid Database," Berman said. "The integrated query system is an example of the whole being much greater than the sum of its parts. Each component of the system has been used and tested by the RCSB members." The work of RCSB members is supported in part by NPACI's Molecular Science thrust area.

The three institutions have divided their responsibilities according to their expertise in data deposition and processing, database query and integration, and database uniformity. The PDB data will be stored and mirrored at all three RCSB sites, in addition to being mirrored at key sites worldwide, notably in Europe and the Pacific Rim.

VRML model of gp120 complexFigure 2: HIV gp120 Complex

The data and tools from the RCSB's Protein Data Bank will help biologists unlock the secrets of biological systems with 3-D structure information on proteins such as gp120--the protein HIV uses to gain access to human immune systems cells--here shown bound to CD4.



"The RCSB proposal was evaluated using a standard merit review procedure that included a site visit and advisory panel," said Gerald Selzer, program director in the NSF Division of Biological Infrastructure. "Experts in x-ray crystallography, other areas of biology, computer science, and database technology participated in the evaluation of the proposal. Reviewers and agency staff alike were impressed with the plans for operating the database, the management scheme across the three sites, and the expertise that the RCSB brings to this important task."

The expertise of the RCSB members in structure data processing and analysis covers data validation, data modeling, database development, query languages and visualization tool development. The group has developed and currently maintains 11 publicly available structural biology databases.

Much of the framework for the RCSB's PDB is based on the Nucleic Acid Database (NDB), established at Rutgers University in 1990 and directed by Berman. The NDB provides tools and information for studying nucleic acids, including nucleic acids bound to proteins. The Rutgers team has developed automated tools for depositing and validating entries in the database as well as querying and generating reports.

The NDB tools have been used to launch related macromolecular structure databases at Rutgers. Proteins Plus mirrors the PDB and adds the query and report generation capabilities of the NDB. Databases have also been created for proteins that bind to DNA, for nucleic acid structures determined by NMR, and for macromolecule components and ligands to which they bind.

Over the past several years, Phil Bourne, SDSC senior staff scientist, has led a team of computational biologists at SDSC in developing database models for bioinformatics. The result of these efforts, the Property Object Model (POM), is a non-proprietary database model designed to store biological structure data and enable fast queries.

"Speed is important for searching for patterns of structural properties within fast-growing biological databases," Bourne said. "Being non-proprietary is important for making databases and query tools available to the community at large for both public and unpublished data."

SDSC has built and now maintains four POM-based resources. The Protein Kinase Resource is a repository of information about a diverse family of enzymes that play a major role in communication between cells. SDSC also maintains the PDB Obsolete Structures Database, to provide a chronology of versions for PDB structures, the WPDB software, for searching PDB data on PC platforms, and MOOSE, which queries the native and derived features of macromolecular structure from PDB data.

Through CARB, NIST maintains several bioinformatics databases, including the Biological Macromolecule Crystallization Database (BMCD). Led by Gary Gilliand, CARB's Biotechnology Division chief and an adjunct professor at the University of Maryland Biotechnology Institute, the BMCD was established to help crystallographers develop strategies to find suitable crystals for determining 3-D structure. The BMCD contains crystal data and the crystallization conditions, compiled from the literature, from which researchers have obtained diffraction-quality crystals.


"The mission of the RCSB is to integrate the common activities and structural database resources of the three groups," Bourne said. "Each group has extensive experience in providing and maintaining bioinformatics databases for diverse user communities. Each has unique strengths that complement those of the others."

With the combined RCSB resources, the PDB is poised to advance the state of bioinformatics technology, accelerating the process by which depositors make submissions and enabling new science with more rapid diffusion of information and biological knowledge. The PDB will also provide views of the database suitable for audiences ranging from crystallographers and molecular biologists to computer scientists to educators and students (Figure 2).

"The goal of the PDB is to provide a continuous flow of macromolecular structure data that can be used reliably as a foundation for research worldwide," Berman said. "Providing high-quality data is the key to enabling new discoveries in structural biology." --DH END