Skip to content

SDSC Homepage SDSC Homepage San Diego Supercomputer Center Contact EnVision NPACI: A Leading Edge Site San Diego Supercomputer Center
 
 
   
  A Geosciences Network or Understanding the Whole Earth

A Science Environment for Ecological Knowledge


Hot Commodities

How to Run an Antimatter Generator

Modeling the Matrix

Morphin’ Lizards!
SDSC Homepage
SDSC Researcher Likes Augmented Reality View
NPACI Releases Updated Cluster Software
NSF Announces Cyberinfrastructure Initiative
Fran Berman Appointed to UCSD Endowed Chair
Phil Andrews and Jay Boisseau Elected to NPACI Executive Committee
Primate Resource Online at SDSC
SDSC Announces New IBM Data-Oriented Supercomputer

By Paul Tooby

View larger image

To predict what types of wheat will grow best in various areas of Afghanistan, a researcher could use SEEK to identify suitable data sets containing observa-tions of where wheat has been grown in the past, and data on environmental conditions such as temperature, rainfall, and soil type.

More than 20 years of war, coupled with the worst drought in decades, have devastated Afghanistan’s agricultural production and changed large parts of the country’s environment, leaving the nation heavily dependent on donated food aid. To help rebuild Afghanistan’s agriculture, it is necessary to quickly identify which crops will grow best in the various regions of the country. This will provide crucial guidance in avoiding the use of untested, and possibly harmful, invasive seeds, thus helping to restore local agriculture and protecting the genetic resources of Afghanistan.

Researchers working on the National Science Foundation (NSF)-sponsored Science Environment for Ecological Knowledge (SEEK) are building a powerful information infrastructure that will offer a unique capability to jump-start this process. "Through SEEK, a researcher will be able to use ecological niche modeling, a tool that identifies a crop’s preferred habitat, or combination of temperature, rainfall, altitude, and other factors," said David Vieglais, a research scientist at the Natural History Museum and Biodiversity Research Center at the University of Kansas (KUNHM). "This gives predictions or ‘educated guesses’ about which crops will grow best in a given area."

This same set of powerful yet flexible tools can help address a wide array of environmental issues, from basic ecology research to practical questions such as alerting planners to the likely spread of mosquitoes that carry the West Nile virus.

A Race for Understanding

Project Leaders
Jim Beach
University of Kansas Biodiversity Research Center
Matt Jones, Mark Schildhauer
National Center for Environmental Analysis and Synthesis
Bertram Ludäscher
San Diego Supercomputer Center
William Michener
Long-Term Ecological Research Network/
University of New Mexico
Participants
Corinna Gries, Peter McCartney, Robin Schoeninger
Arizona State University
Paula Huddleston
Integrated Taxonomic Information System
David Vieglais, Ricardo Pereira, Scott Downie, Susan Gauch,
Town Peterson, Aimee Stewart
University of Kansas
Jessie Kennedy
Napier University
Chad Berkley, Dan Higgins, O.J. Reichman, Jing Tao, Rich Williams
National Center for Environmental Analysis and Synthesis
Arcot Rajasekar, David Stockwell, Liying Sui, Paul Tooby,
Jenny Wang, Bing Zhu
San Diego Supercomputer Center
Joseph Goguen, Victor Vianu, Yang Yu
University of California, San Diego
James Brunt, Deana Pennington, Kristin Vanderbilt, Robert Waide
University of New Mexico
Ferdinando Villa
University of Vermont
Robert Peet
University of North Carolina

"We’re in a race," says Bill Michener, a SEEK principal investigator and staff scientist at the NSF Long-Term Ecological Research (LTER) Network Office at the University of New Mexico. "On the positive side, emerging information technologies can make SEEK a powerful system that provides scientists and policy makers with the science-based answers they need. But today’s complex environmental challenges are multiplying so fast that we urgently need this advanced infrastructure now to support both basic research and sound environmental management."

SEEK is an ambitious five-year NSF Information Technology Research project. "It’s a lot of work to build a comprehensive system like SEEK, but it’s the key to being able to achieve an overall understanding of ecological systems," said Michener. "Even the general public now understands that all the parts an ecosystem are connected, so that to understand them you need to encompass all the components in your model."

Moreover, ecological questions range over extremes of spatial and temporal scales, and investigations involve all of the physical and life sciences. Thus, a comprehensive system requires the kind of large-scale infrastructure being built in the SEEK project, which will be capable of scaling up and integrating data of all kinds related to the ecosystem. In addition, the SEEK tools will provide researchers with analysis and visualization capabilities that free them from needing specialized IT knowledge, and give them a powerful platform to do science much more rapidly and on a larger scale than possible before.

The SEEK initiative is an outgrowth of ecological and biodiversity informatics research and includes computer scientists, ecologists, and technologists. The lead organizations involved are part of the Partnership for Biodiversity Informatics, a consortium made up of the National Center for Ecological Analysis and Synthesis (NCEAS) at UC Santa Barbara; the San Diego Supercomputer Center (SDSC) and UC San Diego; the Natural History Museum and Biodiversity Research Center at the University of Kansas (KU); and the LTER Network Office at the University of New Mexico. Additional partnering institutions are Arizona State University, the University of North Carolina, the University of Vermont, and Napier University in Scotland.

A Quick Tour of SEEK

In our example, the researcher needs to predict what types of wheat will grow best in various areas of Afghanistan. First the researcher uses SEEK to identify suitable data sets containing observations of where wheat has been grown in the past, and data on environmental conditions such as temperature, rainfall, and soil type. With a common interface to different data sources, SEEK then pulls in specimen databases that have locality information. Initially, SEEK will include sources such as the Species Analyst at KU, which accesses museum databases, the MetaCat catalog from the Knowledge Network for Biocomplexity (NCEAS, LTER, etc.), and other ecological data sources.

The other type of data needed is environmental coverage layers, which may range in scale from global to county. "The trick is that environmental data layers and species distribution data layers have to be integrated using the same cell size and spatial extent, taking into consideration the effects that scaling and other transformations might have on an analysis," said Matt Jones, SEEK project manager and a researcher at NCEAS. "SEEK will then transform them so that they’re all at the same scale." Once the data is available in SEEK, it is pushed into the ecological niche model, which may be running elsewhere on a different machine. SEEK finds the computational resources and produces the niche model that identifies the preferred habitat for each crop of interest. The results are then overlaid onto a map of Afghanistan, resulting in a map that highlights which crops are predicted to grow best in each area–vital information for boosting agricultural production.

"An important feature of SEEK is that it goes beyond providing data integration and analysis services," said Jones. "We’re building SEEK to also capture the scientific workflows, the steps actually carried out as scientists use SEEK." For example, the output from the ecological niche model for Afghanistan may turn out to be useful not only in the researcher’s future work but in other apparently unrelated processes. "The whole environment can spiral up. It can sort of self-assemble and become ‘smarter,’ " said Vieglais. This acceleration can lead to what, in effect, are vastly expanded "virtual collaborations" that could eventually become community-wide.

A Rose By Any Other Name

In studying the remarkable diversity of life on the Earth, systematists have developed correspondingly rich formal systems for applying names to organisms, which promote the reuse of scientific names for new species concepts as biological classifications evolve. Since the formal rules have emphasized retaining historical priority of names, however, this means that the same scientific name is sometimes used in different senses to describe distinct species.

The dates of the fieldwork are indications of which species the data refer to. But in many cases, even knowing the dates will not be enough, and a researcher will have to dig deeper into the semantics and broader context of name usage to determine which species a data set refers to.

There is also the converse situation, where a number of different scientific names refer to what is now only one or two species. This is true of the aster, a common genus of garden herbs in the daisy family. Similar naming or semantic issues exist in many scientific disciplines, and both cases–one name for multiple species, or concepts, and multiple names for one concept–create significant obstacles for automating the search for compatible data sets.

To make the problem even more interesting, ecological surveys and inventories go back hundreds of years, leaving a complex historical trail of multiple classifications for the same group of organisms, as well as additional issues of inconsistent field identification and application of names.

As part of a broad trend to integrate scientific disciplines and data, pressure is growing to make taxonomic information available and interoperable. To do this, SEEK researchers are building concept-based taxonomies, and in the future the community will need to learn to associate all references to organisms not just to a name but to a concept as well, by using adequate descriptive metadata. "We’re basically ‘disambiguating’ 250 years of nomenclatural fuzziness caused by the many-to-many relationship between scientific names and biological species concepts," said Beach. "While formal nomenclature rules have worked well for experts using familiar names in current use, the system needs to move forward and gain the capability to unambiguously identify and map among biological species in order to support large-scale computer-based data integration and all the benefits it will bring."

In the SEEK architecture, concepts will be mapped to each other and to names, so that Web queries looking for data sets about Pinus ponderosa will know about the different uses of that name, and about the distinctions among the underlying species concepts, allowing the user of the cyberinfrastructure to make an informed choice of which species and concepts to use.

Under the Hood

The SEEK architecture consists of three layers–a top one called the Analysis and Modeling System, a bottom layer or EcoGrid that connects to data sources and computational resources, and an intermediate layer consisting of a Semantic Mediation System that "translates" for and between the other layers.

These layers are tightly integrated, with each system using parts of the others. The EcoGrid infrastructure combines features of a data grid for ecological research and a computational grid for analysis and modeling services. "EcoGrid will include the SDSC Storage Resource Broker (SRB), a mature middleware for data integration with extensive metadata capabilities that will help in data discovery," said Arcot Rajasekar, director of the Data Grids Technologies group at SDSC. "Another factor that makes the SRB very helpful in collaborative science is that it can finely tune data sharing according to the needs of individual researchers in a collaboration."

The top layer Analytical Modeling System (AMS) is driving SEEK from a scientific perspective. "We want scientists to be able to comfortably interact with the AMS layer," said Mark Schildhauer, director of computing at NCEAS. "It’s where they ask scientific questions and view the answers, so it has to be easy to use."

The AMS supports the scientist in designing a valid end-to-end analysis, or "pipeline," that can successfully work with the other two layers to identify and integrate data sets that are appropriate for the intended analysis, and follow the steps through to results and visualization. The AMS and Semantic Mediation System develop the scientist’s question into queries that can reach data through the EcoGrid. As EcoGrid tries to integrate data sets, whether they can be used together depends on constraints such as resolution and the purpose for which they will be used.

The AMS layer uses the middle layer, the Semantic Mediation System or SMS, to compose the steps of the analysis pipeline and decide on any transformations needed to integrate data and analysis steps. "The SMS provides a mediation component between the analytical pipeline and the data and metadata sources available in EcoGrid," said Bertram Ludäscher, director of the Knowledge-Based Information Systems lab at SDSC.

When everything is prepared, steps that are computationally intensive are executed on computational nodes controlled by EcoGrid in a distributed computational system, and SEEK will also incorporate advanced scientific visualization capabilities from SDSC.

Community Building

SEEK researchers are putting a major effort into Education, Outreach, and Training because the adoption of these powerful new technologies involves a significant learning curve and fundamental changes in how ecology is done. "Such major changes will take some adjustment," said Michener, "and by hosting community workshops and other outreach activities, we hope to initiate growing numbers of ecologists into the benefits of doing ecology in this new way."

SEEK outreach includes a Web portal, informatics training, and an innovative annual symposium and training program that focuses on information technology transfer to young investigators and students, particularly those from underrepresented groups.

"SEEK is pioneering more powerful ways of doing ecology," said Michener. "This collaboration is preparing the ground for new ways of doing science to take root and flourish in the future."

Paul Tooby is a senior science writer at the San Diego Supercomputer Center.