Skip to content

NBCR: Macromolecular Pattern Recognition and On-line Access to Molecular Biology Tools

SDSC RESEARCH |Contents | Next
Michael Gribskov

Timothy Bailey
Fariba Fana
Stella Veretnik,
Jane Park

Lynn Fink
Ethan Bier
Lawrence T. Reiter
UC San Diego

P resident Bill Clinton announced in June 2000 that the international Human Genome Project and the Celera Genomics Corporation had together completed an initial "rough draft" sequencing of the human genome--the genetic blueprint for human beings. As this sequence information becomes available, identification of protein families and functionally important patterns in protein sequences, or motifs, within the genome are increasingly important. SDSC computational biologist Michael Gribskov is leading a team of researchers in developing Web-based molecular pattern recognition tools that allow biologists to generate maps between a molecule's genetic sequence and its function, based on comparisons to known molecules. Such efforts in the field of bioinformatics are critical, as researchers move from mapping the human genome to applying that knowledge to better understand the molecular basis of human disease.



GribskovPage Figure 1. On-line Sequence Alignment
These rows of letters are aligned sequences of the calcium calmodulin (CaMK) subfamily of protein kinases, displayed using the Alignment Viewer tool of SDSC's Protein Kinase Resource. This tool was developed by SDSC computational biologist Michael Gribskov. CaMK regulates cellular metabolic processes, such as muscle fiber contraction, by modifying the activity of calcium-sensitive enzymes. Each letter represents a different amino-acid residue in the sequence, and the residues are color-coded based on the physicochemical properties of their side chains, such as charge or hydrophobicity. Each line shows a sequence from a different organism, such as maize or soybean, but the alignments are also used to compare sequences of a single organism.

A key challenge for modern biology is to keep track of large, complex sets of data, such as the human genome, and to relate these large data sets both to other public data and to locally developed information resources. As a project of the National Biomedical Computation Resource (NBCR), Gribskov's group at SDSC and the Biology Department at UC San Diego is developing applications for recognizing motifs and for federated database query tools, all accessible through the Web.
Pattern recognition tools let biologists generate maps between a molecule's genetic sequence and its structure and function, based on comparisons to known molecules. A simple interface for access to a large number of resources, otherwise known as transparent supercomputing, makes these computationally demanding comparisons possible without the steep learning curve of computing on unfamiliar systems.
The NBCR is funded by the National Center for Research Resources at the National Institutes of Health and is headquartered at SDSC. One of NBCR's earliest efforts, begun in 1995, Multiple EM for Motif Elicitation (MEME) is a tool for discovering motifs--sequence patterns that occur repeatedly--in a group of related DNA or protein sequences. Complementing MEME, the Motif Alignment and Search Tool (MAST) allows users to search biological sequence databases for sequences that contain one or more of a group of known motifs.
In conjunction with MEME and MAST, NBCR's MotifWeb (formerly known as SeqWeb) temporarily stores data, formats it, analyzes it, and shows results on a Web page or via e-mail. Using the Compugen Bioccelerator (Bio-XLP) computer--a parallel system for sequence comparisons, equivalent to about 100 nodes of a parallel supercomputer such as Blue Horizon--MotifWeb also provides access to locally available databases at SDSC via the MotifWeb server, and to tools for custom display of alignments.


"The goal is to identify the genes in the genomic DNA, and more importantly, to figure out their function," Gribskov said. In the quest to decipher the meaning behind the DNA code, one approach is to find the parts of the molecules that are under functional constraints. To do this, one needs to find sequences in the amino-acid code that are conserved. Some sequences change from gene to gene, and some do not. The ones that do not change are "conserved."

"Conserved sequences, or motifs, indicate a common structure or function when they are found in many distinct proteins," Gribskov said. "So, we can compare the genes to find the conserved sequences and make models of them, at which point we can make inferences about the structure and function of the protein molecules."

New technology for determining gene expression levels has created similar questions in the area of gene regulation. "Gene chip" technology can simultaneously determine the expression level of thousands of genes. One then wants to find groups of genes with similar expression patterns, because these genes are likely to share common regulatory processes. Gribskov's latest project in this area is the development of a Web-based genetic-data analysis system called 2HAPI.

Using this server, researchers can analyze gene chip data, in which probes for each gene in an organism's genome are used to determine simultaneously the expression levels for all of the genes. Until now, the experiments have focused on yeast and the nematode C. elegans, but with 2HAPI, researchers can use sequences from the human genome for these queries.

"We want to find the genes that respond to disease or other stresses in the same way," Gribskov said. "And there is a huge need for software such as 2HAPI to perform the analyses." 2HAPI development is sponsored by NIH, the Veteran's Medical Research Foundation, the UC San Diego Genomics Core Facility, the UC San Diego AIDS Research Program, and SDSC.

Top| Contents | Next


Gribskov is involved with several database efforts that are helping to provide access to genomic sequence analyses, including Functional Genomics of Plant Phosphorylation (PlantsP), Homophila, the Protein Kinase Resource (PKR), and the Molecular Information Agent (MIA).

PlantsP--formerly under NBCR, but now under separate funding--compares protein kinases and protein phosphatases (both are types of enzymes) in Arabidopsis (a small flowering plant of the mustard family) and other plants. Led by the University of Missouri and funded by the plant genome program of the National Science Foundation, PlantsP (see p. 4) focuses on protein phosphorylation, a process that controls most developmental events and environmental responses in the cell.

The Homophila database is an intergenomic database that relates human disease genes to genes in Drosophila (fruit fly). The purpose of this database is to use the sequence information from the On-line Mendelian Inheritance in Man (OMIM) database to determine if sequence homologs of these genes exist in the current Drosophila sequence database (FlyBase). About 48% of human disease-associated sequences in OMIM have strong matches to one or more sequences in the Drosophila sequence database.

"The advantage is that one can grow millions of Drosophila and perform genetic experiments on them, whereas one can't do this with humans," Gribskov said. "This is useful, for example, in figuring out how and why a particular gene causes disease. Extremely powerful genetic approaches can be used in Drosophila to study the functions and interactions of genes. If a researcher finds a lethal gene, its expression can be limited to a specific tissue, such as the wing. So we can figure out how a gene works in Drosophila, and then use that information to figure out how it works in people."

Another NBCR-supported project is PKR, a Web-accessible compendium of information on the protein kinase family of enzymes that play a major role in communication between cells (Figure 1). This resource includes tools for structural and computational analyses, as well as links to related information maintained by other organizations. PKR is a vertically integrated database covering a specific group of proteins, the protein kinases, in greater depth than databases that target all proteins or all nucleic-acid sequences.

"This SDSC project is creating a new kind of scientific database integrating structural, genetic, and molecular biological data," Gribskov said. "Our goal is to build a system that is narrowly focused on kinases, phosphatases, and related molecules. In contrast to major databases that are publicly available today--such as GenBank and the Protein Data Bank--we are producing a system with information that runs deeply, rather than broadly, to provide biomedical researchers with powerful tools for scientific discovery."

For data federation, the main tool is MIA, a prime component of NBCR. "A problem that molecular biologists face is that there are many electronic databases, and it's hard to keep track of the resources available," Gribskov said. "Furthermore, the process of following all the links is frustrating because one might be missing information from the links that one didn't follow. And following all the links is very tedious."

MIA is a Web server that searches biological databases to find the existing information about a macromolecule. It supports searching by molecule identification number, gene symbol, sequence, and key word. MIA queries about 75 databases and focuses on the databases appropriate to the particular search. The results of each database query are parsed, and additional keywords are identified. Specific information can be extracted as a synopsis of the result. The search ends when all template queries have been attempted, and no further keywords can be extracted from the search results. "In essence, it tries to find all the data relevant to your initial piece of information without your having to do it all yourself," Gribskov said.

"The challenge for bioinformatics is not just to develop the tools for the biomedical community, but to integrate them into information resources, and then to integrate that into a network of knowledge-based molecular biology," he said. To this end, NBCR and its motif-recognition tools are helping to accomplish what just recently was impossible: to decode and make sense of the human genome sequences and to truly understand the parts that cause diseases. --AV *

Top| Contents | Next