| SECRETS
OF HIGH-THROUGHPUT GENOMICS
EXPRESSION, PURIFICATION,
CRYSTALLIZATION
AN ENORMOUS TASK
NA
sequences are much the same for all individuals within a species,
while differences are more pronounced from one species to another.
The human DNA sequence is very much like that of the chimpanzee
(estimated to be 98 percent alike) or the mouse (estimated to
be 95 percent alike), and mammalian DNA sequences are more like
one another than they are like those of, for example, plants.
The DNA sequence determines corresponding sequences of genes,
the active lengths of the string that encode the amino-acid sequences
of proteins. "While all living things share a great deal of DNA,
it is the little differences that distinguish the proteins of
one species or individual from those of another." said John Wooley,
UCSD Assistant Vice Chancellor for Research. "Now that scientists
have deciphered the complete genomes of several dozen species,
it's time to go after the structure and function of all the proteins."
The project is called the Joint Center for Structural Genomics
(JCSG), and it began in October 2000 with a five-year, $24 million
grant from the National Institutes of Health (NIH).
|
|

Figure
1. Information Flow in the JCSG
High-throughput
structural genomics requires automatic methods to generate,
analyze, and validate experimental data. Data gathered at
each stage will be part of a centralized bioinformatic system.
|
|
|
JCSG comprises three interactive core projects
(see diagram, Figure 1). One of them, Bioinformatics, is centered
at SDSC. "Our part of the JCSG is dominated by issues of data
access. We need to keep track of our progress in the context of
all available current research, so we can rationally improve our
drive toward determining the structure of large numbers of proteins,"
said Bioinformatics core leader Adam Godzik of UCSD. He has developed
new methods for identifying protein motifs (characteristic bits
of structure). He works with Mark Miller of SDSC, a crystallographer
and structural biologist who is the project coordinator. "We're also working closely with other SDSC
and UCSD scientists: Lynn Ten Eyck, director of the Computational
Center for Macromolecular Structure; Shankar Subramaniam, the
developer of the Biology WorkBench; Susan S. Taylor, whose group
solved the first protein kinase structure; Philip Bourne (who
directs the SDSC portion of the Protein Data Bank) and Ilya Shindyalov,
structural bioinformatics leaders at SDSC; and Michael Gribskov,
developer of the Molecular Information Agent and other innovative
data mining applications," Miller said. Top
| Contents | Next SECRETS OF HIGH-THROUGHPUT
GENOMICS
|

Figure 2.
Ribosome
Ribbon representation of a ribosome. These
complex, multiunit proteins are the replication sites for
new proteins manufactured within the cell. |
|
"The JCSG is taking a systems engineering
approach to determining protein structure and function," Ten Eyck
explained. "There are five major steps in high-throughput protein
structure determination. We begin with genomic sequences of interest,
for example, the genome of the multicellular worm Caenorhabditis
elegans, and select target protein sequences from those encoded
by the genes." "What we're looking for are proteins that
have, say, cellular signaling functions, but which differ significantly
from similar proteins whose structure has already been solved,"
Miller continued. "The bioinformatics approach allows us to sort
through genomic sequences and target proteins on the basis of
criteria like these, automatically." The worm genome is simpler
than the human, yet the fact that the organism is a metazoan (multicellular)
means there will be proteins having the functions JCSG scientists
want to study. Such studies will quickly be extended to similar
proteins found in more complicated organisms, including mammals.
"The targeting procedures can be applied to
any source of sequences," Ten Eyck noted, "with appropriate filtering--and
it is the filtering methods that we have been developing at SDSC."
Soluble proteins thought to be involved in intra- or extracellular
signaling processes will be first to be solved. The group will
then target a number of transmembrane proteins that facilitate
signaling across cell boundaries. Top
| Contents | Next EXPRESSION, PURIFICATION,
CRYSTALLIZATION When target proteins have been selected, the
next steps are expression, purification , and crystallization
of the proteins, carried out in the Crystallomics core project.
Ray Stevens, Peter Schultz, and colleagues at TSRI and the Genomics
Institute of the Novartis Research Foundation (GNF) have been
developing breakthrough robotic technologies for expressing and
obtaining large amounts of purified protein. The expression systems
include Escherichia coli bacteria and yeasts, into which genomic
sequences are inserted that code for the proteins of interest. "Crystallization has been a hit-or-miss business
for far too long," Miller said. A mass of folklore has grown up
around various procedures that work (but not always) to produce
good crystals. By keeping track of all attempts to crystallize
the proteins of interest, he noted, JCSG's Bioinformatics core
will have a record, "both positive and negative," of what works
and what doesn't work. The GNF automated system can make crystals
with as little as 2 nanoliters of protein, so many trials can
be made under varying conditions. Successful crystals will be sent to the Stanford
Synchrotron Radiation Laboratory (SSRL), where the Structure Determination
core scientists begin by X-raying the crystals using the high-power
beamline. Also to be automated is a procedure to make crystals
with heavier atoms inserted at appropriate spots (e.g., substituting
selenium for sulfurs in methionine residues) to enable scientists
to determine "phasing" of the crystals by multiple anomalous diffraction
and other means. "We'll be keeping data on the full set of X-ray
reflections for each crystal studied," Miller said, "and again,
this information will be a guide to future procedures with more
complex proteins." (While individual investigators are encouraged
by PDB to submit their raw crystallographic data, it is not required.) From the X-ray diffraction data, JCSG scientists
will build electron-density maps that, combined with the original
sequence information, will enable complete models to be made of
each protein's three-dimensional crystal structure. "All of these
procedures, which used to take many months, if not years, are
to be automated and routinized," Miller said. "We will also be using new methods developed
by PDB scientists to keep track of the various classes of 'folds'
to be found in the structural data," he said. There appear to
be tens of thousands of variations in folding motifs. The Combinatorial
Extension algorithm developed by Ilya Shindyalov and Philip Bourne
has been used for the classification of all folds found in the
structures in the PDB, and the same methods can be extended to
encompass new folds. For proteins that act together in a complex,
like ribosomes (see Figure2), the orientation and connection of
the subunits is yet another classification problem. Top
| Contents | Next AN ENORMOUS TASK The JCSG is one of seven pilot projects initiated
last year by the NIH National Institute of General Medical Sciences,
in hopes of determining the structures of thousands of proteins.
"An engineering approach here is going to drive down the cost
of obtaining protein structure solutions," Ten Eyck noted, "and
thus the Bioinformatics core of the JCSG is truly the key to success."
JCSG's objective is to solve some 50 structures in the first year,
accelerating to 1000 solutions by the fifth year, for a total
of some 2000 completely new protein solutions. "The fifth-year figure works out to about
three new structures every day," Miller said, "and similar goals
have been postulated by the other six structural genomics consortia.
We're talking about a tremendous amount of data, comparable to
the total depositions in PDB from all investigators for the past
year alone. Storing and mining it will be a challenge for us-and
for the PDB also, a very ambitious task. "Moreover," he continued, "high-throughput
structural genomics requires automatic methods to validate structures
against experimental data." The diversity of experimental data
requires a deep understanding of the refinement and validation
process. "We expect to improve greatly in both data acquisition
and analysis," Miller said, "and we will be depending on strong
links to be forged with, for example, the PDB, other large database
efforts, and the new Alliance for Cellular Signaling, led by Alfred
Gilman at the University of Texas, for which SDSC is also developing
new bioinformatic techniques." JCSG is currently on the lookout for experienced
bioinformaticists and database programmers to help the project
get up to speed. Interested persons should contact Miller via
e-mail: mmiller@sdsc.edu.
-MM  Top
| Contents | Next www.jcsg.org
|
JCSG
Principal Investigator Ian
Wilson
The Scripps Research Institute (TSRI)
Co-Principal Investigators
John Wooley
UCSD Keith Hodgson
Stanford Synchrotron
Radiation Laboratory (SSRL) Core Leaders Bioinformatics:
Adam Godzik
SDSC Structure Determination:
Peter Kuhn
SSRL Crystallomics: Raymond
Stevens
TSRI Bioinformatics
Core Participants
Philip Bourne,
Michael Gribskov,
Mark Miller,
Ilya Shindyalov,
Shankar Subramaniam,
Susan S. Taylor,
Lynn Ten Eyck
SDSC/UCSD |