|
By Paul Tooby
 |
| View
larger image |
|
To predict what types of
wheat will grow best in various areas of Afghanistan,
a researcher could use SEEK to identify suitable
data sets containing observa-tions of where wheat
has been grown in the past, and data on environmental
conditions such as temperature, rainfall, and
soil type.
|
|
More than 20 years of war, coupled with the worst drought
in decades, have devastated Afghanistans agricultural
production and changed large parts of the countrys environment,
leaving the nation heavily dependent on donated food aid.
To help rebuild Afghanistans agriculture, it is necessary
to quickly identify which crops will grow best in the various
regions of the country. This will provide crucial guidance
in avoiding the use of untested, and possibly harmful, invasive
seeds, thus helping to restore local agriculture and protecting
the genetic resources of Afghanistan.
Researchers working on the National Science Foundation (NSF)-sponsored
Science Environment for Ecological Knowledge (SEEK)
are building a powerful information infrastructure that will
offer a unique capability to jump-start this process. "Through
SEEK, a researcher will be able to use ecological niche modeling,
a tool that identifies a crops preferred habitat, or
combination of temperature, rainfall, altitude, and other
factors," said David Vieglais, a research scientist at
the Natural History Museum and Biodiversity Research Center
at the University of Kansas (KUNHM).
"This gives predictions or educated guesses
about which crops will grow best in a given area."
This same set of powerful yet flexible tools can help address
a wide array of environmental issues, from basic ecology research
to practical questions such as alerting planners to the likely
spread of mosquitoes that carry the West Nile virus.
A Race for Understanding
Project
Leaders
Jim Beach
University of Kansas Biodiversity Research Center
Matt Jones, Mark Schildhauer
National Center for Environmental Analysis and Synthesis
Bertram Ludäscher
San Diego Supercomputer Center
William Michener
Long-Term Ecological Research Network/
University of New Mexico
Participants
Corinna Gries, Peter McCartney, Robin Schoeninger
Arizona State University
Paula Huddleston
Integrated Taxonomic Information System
David Vieglais, Ricardo Pereira, Scott Downie, Susan
Gauch,
Town Peterson, Aimee Stewart
University of Kansas
Jessie Kennedy
Napier University
Chad Berkley, Dan Higgins, O.J. Reichman, Jing Tao,
Rich Williams
National Center for Environmental Analysis and Synthesis
Arcot Rajasekar, David Stockwell, Liying Sui, Paul
Tooby,
Jenny Wang, Bing Zhu
San Diego Supercomputer Center
Joseph Goguen, Victor Vianu, Yang Yu
University of California, San Diego
James Brunt, Deana Pennington, Kristin Vanderbilt,
Robert Waide
University of New Mexico
Ferdinando Villa
University of Vermont
Robert Peet
University of North Carolina |
|
"Were in a race," says Bill Michener, a SEEK
principal investigator and staff scientist at the NSF Long-Term
Ecological Research (LTER)
Network Office at the University
of New Mexico. "On the positive side, emerging information
technologies can make SEEK a powerful system that provides
scientists and policy makers with the science-based answers
they need. But todays complex environmental challenges
are multiplying so fast that we urgently need this advanced
infrastructure now to support both basic research and sound
environmental management."
SEEK is an ambitious five-year NSF Information
Technology Research project. "Its a lot of
work to build a comprehensive system like SEEK, but its
the key to being able to achieve an overall understanding
of ecological systems," said Michener. "Even the
general public now understands that all the parts an ecosystem
are connected, so that to understand them you need to encompass
all the components in your model."
Moreover, ecological questions range over extremes of spatial
and temporal scales, and investigations involve all of the
physical and life sciences. Thus, a comprehensive system requires
the kind of large-scale infrastructure being built in the
SEEK project, which will be capable of scaling up and integrating
data of all kinds related to the ecosystem. In addition, the
SEEK tools will provide researchers with analysis and visualization
capabilities that free them from needing specialized IT knowledge,
and give them a powerful platform to do science much more
rapidly and on a larger scale than possible before.
The SEEK initiative is an outgrowth of ecological and biodiversity
informatics research and includes computer scientists, ecologists,
and technologists. The lead organizations involved are part
of the Partnership
for Biodiversity Informatics, a consortium made up of
the National Center for Ecological Analysis and Synthesis
(NCEAS)
at UC Santa Barbara; the
San Diego Supercomputer Center (SDSC)
and UC San Diego; the
Natural History Museum and Biodiversity Research Center at
the University of Kansas (KU);
and the LTER Network Office at the University of New Mexico.
Additional partnering institutions are Arizona
State University, the University
of North Carolina, the University
of Vermont, and Napier
University in Scotland.
A Quick Tour of SEEK
In our example, the researcher needs to predict what types
of wheat will grow best in various areas of Afghanistan. First
the researcher uses SEEK to identify suitable data sets containing
observations of where wheat has been grown in the past, and
data on environmental conditions such as temperature, rainfall,
and soil type. With a common interface to different data sources,
SEEK then pulls in specimen databases that have locality information.
Initially, SEEK will include sources such as the Species
Analyst at KU, which accesses museum databases, the MetaCat
catalog from the Knowledge
Network for Biocomplexity (NCEAS, LTER, etc.), and other
ecological data sources.
The other type of data needed is environmental coverage layers,
which may range in scale from global to county. "The
trick is that environmental data layers and species distribution
data layers have to be integrated using the same cell size
and spatial extent, taking into consideration the effects
that scaling and other transformations might have on an analysis,"
said Matt Jones, SEEK project manager and a researcher at
NCEAS. "SEEK will then transform them so that theyre
all at the same scale." Once the data is available in
SEEK, it is pushed into the ecological niche model, which
may be running elsewhere on a different machine. SEEK finds
the computational resources and produces the niche model that
identifies the preferred habitat for each crop of interest.
The results are then overlaid onto a map of Afghanistan, resulting
in a map that highlights which crops are predicted to grow
best in each areavital information for boosting agricultural
production.
"An important feature of SEEK is that it goes beyond
providing data integration and analysis services," said
Jones. "Were building SEEK to also capture the
scientific workflows, the steps actually carried out as scientists
use SEEK." For example, the output from the ecological
niche model for Afghanistan may turn out to be useful not
only in the researchers future work but in other apparently
unrelated processes. "The whole environment can spiral
up. It can sort of self-assemble and become smarter,
" said Vieglais. This acceleration can lead to what,
in effect, are vastly expanded "virtual collaborations"
that could eventually become community-wide.
A Rose By Any Other Name
In studying the remarkable diversity of life on the Earth,
systematists have developed correspondingly rich formal systems
for applying names to organisms, which promote the reuse of
scientific names for new species concepts as biological classifications
evolve. Since the formal rules have emphasized retaining historical
priority of names, however, this means that the same scientific
name is sometimes used in different senses to describe distinct
species.
The dates of the fieldwork are indications of which species
the data refer to. But in many cases, even knowing the dates
will not be enough, and a researcher will have to dig deeper
into the semantics and broader context of name usage to determine
which species a data set refers to.
There is also the converse situation, where a number of different
scientific names refer to what is now only one or two species.
This is true of the aster, a common genus of garden herbs
in the daisy family. Similar naming or semantic issues exist
in many scientific disciplines, and both casesone name
for multiple species, or concepts, and multiple names for
one conceptcreate significant obstacles for automating
the search for compatible data sets.
To make the problem even more interesting, ecological surveys
and inventories go back hundreds of years, leaving a complex
historical trail of multiple classifications for the same
group of organisms, as well as additional issues of inconsistent
field identification and application of names.
As part of a broad trend to integrate scientific disciplines
and data, pressure is growing to make taxonomic information
available and interoperable. To do this, SEEK researchers
are building concept-based taxonomies, and in the future the
community will need to learn to associate all references to
organisms not just to a name but to a concept as well, by
using adequate descriptive metadata. "Were basically
disambiguating 250 years of nomenclatural fuzziness
caused by the many-to-many relationship between scientific
names and biological species concepts," said Beach. "While
formal nomenclature rules have worked well for experts using
familiar names in current use, the system needs to move forward
and gain the capability to unambiguously identify and map
among biological species in order to support large-scale computer-based
data integration and all the benefits it will bring."
In the SEEK architecture, concepts will be mapped to each
other and to names, so that Web queries looking for data sets
about Pinus ponderosa will know about the different uses of
that name, and about the distinctions among the underlying
species concepts, allowing the user of the cyberinfrastructure
to make an informed choice of which species and concepts to
use.
Under the Hood
The SEEK architecture consists of three layersa top
one called the Analysis and Modeling System, a bottom layer
or EcoGrid that connects to data sources and computational
resources, and an intermediate layer consisting of a Semantic
Mediation System that "translates" for and between
the other layers.
These layers are tightly integrated, with each system using
parts of the others. The EcoGrid infrastructure combines features
of a data grid for ecological research and a computational
grid for analysis and modeling services. "EcoGrid will
include the SDSC Storage Resource Broker (SRB),
a mature middleware for data integration with extensive metadata
capabilities that will help in data discovery," said
Arcot Rajasekar, director of the Data
Grids Technologies group at SDSC. "Another factor
that makes the SRB very helpful in collaborative science is
that it can finely tune data sharing according to the needs
of individual researchers in a collaboration."
The top layer Analytical Modeling System (AMS) is driving
SEEK from a scientific perspective. "We want scientists
to be able to comfortably interact with the AMS layer,"
said Mark Schildhauer, director of computing at NCEAS. "Its
where they ask scientific questions and view the answers,
so it has to be easy to use."
The AMS supports the scientist in designing a valid end-to-end
analysis, or "pipeline," that can successfully work
with the other two layers to identify and integrate data sets
that are appropriate for the intended analysis, and follow
the steps through to results and visualization. The AMS and
Semantic Mediation System develop the scientists question
into queries that can reach data through the EcoGrid. As EcoGrid
tries to integrate data sets, whether they can be used together
depends on constraints such as resolution and the purpose
for which they will be used.
The AMS layer uses the middle layer, the Semantic Mediation
System or SMS, to compose the steps of the analysis pipeline
and decide on any transformations needed to integrate data
and analysis steps. "The SMS provides a mediation component
between the analytical pipeline and the data and metadata
sources available in EcoGrid," said Bertram Ludäscher,
director of the Knowledge-Based Information Systems lab at
SDSC.
When everything is prepared, steps that are computationally
intensive are executed on computational nodes controlled by
EcoGrid in a distributed computational system, and SEEK will
also incorporate advanced scientific visualization capabilities
from SDSC.
Community Building
SEEK researchers are putting a major effort into Education,
Outreach, and Training because the adoption of these powerful
new technologies involves a significant learning curve and
fundamental changes in how ecology is done. "Such major
changes will take some adjustment," said Michener, "and
by hosting community workshops and other outreach activities,
we hope to initiate growing numbers of ecologists into the
benefits of doing ecology in this new way."
SEEK outreach includes a Web portal, informatics training,
and an innovative annual symposium and training program that
focuses on information technology transfer to young investigators
and students, particularly those from underrepresented groups.
"SEEK is pioneering more powerful ways of doing ecology,"
said Michener. "This collaboration is preparing the ground
for new ways of doing science to take root and flourish in
the future."
Paul Tooby is a senior science writer
at the San Diego Supercomputer Center.
|