Skip to content

EnVision, V.19, No. 3 SDSC Homepage San Diego Supercomputer Center Contact EnVision NPACI: A Leading Edge Site San Diego Supercomputer Center
  Building a National Grid
from the Bottom Up

Building the NPACI Grid:
Integrating the
Human Infrastructure

  Discovering Knowledge in Massive Data Collections
  Pacific Rim Group Evolves
into International Model
of Collaboration
Tree of Life and Virtual Grid Development among Four ITR Awards to SDSC
NSF Middleware Projects Receive $9 Million
NSF Awards $1.2 Million to Extend PRAGMA Program
Teachers Bring Technology They Developed at SDSC to Their Classrooms
Texas Installs Lonestar Cluster
SDSC Education Department Efforts Recognized in SIGKids Awards
Discovering Knowledge in Massive Data Collections

by Paul Tooby


As an airplane flies high over the New Mexico landscape, a NASA hyperspectral imaging instrument on board

records the light reflected from the ground below in more than 200 different wavelength bands. This "hyperspectral" data contains important clues that scientists can use in automated, cost—effective methods for distinguishing the type of land cover below—classifying it as trees or grass, productive or dormant, wet or dry.

Having access to accurate land cover classification is vital to a range of users. Emergency planners rely on this information to develop estimates of forest fire risk. For ecologists studying the biodiversity of plant species in an ecosystem, automated land cover classification is the only practical way to get this information over wide areas. And climate researchers use this broad—scale classification of land cover in conjunction with models of future climate change.

But modern sensors such as this hyperspectral imaging instrument produce large volumes of information that are so complex that they can overwhelm traditional methods of analysis. "When you have over 200 wavelengths of radiation readings for three quarters of a million points on the ground—many of which represent mixtures of land covers—'and all their combinations to sort through, it's very difficult to know which frequency bands will best predict a given set of land cover characteristics," said Deana Pennington, research assistant professor at the NSF Long Term Ecological Research (LTER) Network at the University of New Mexico. "So we needed powerful data mining tools that could sift through all these bands to tell us which ones are important in characterizing the subtle differences between land cover types." With 24 sites, the LTER Network is a national—scale effort to investigate ecological processes across diverse ecosystems.

Working with LTER ecologists, researcher Tony Fountain of the San Diego Supercomputer Center (SDSC) at the University of California, San Diego (UCSD) and his team in SDSC's Knowledge and Information Discovery Lab (SKIDL) have developed powerful, user—friendly data mining tools to meet this need.

"The SKIDL group's data mining projects underscore SDSC's leadership in providing advanced data technologies," said Chaitan Baru, co—director of SDSC's Data and Knowledge Systems (DAKS) program. "SKIDL data mining tools play an important role in the large—scale data mining required for today's massive data collections and will exploit the next generation of data—centric computing power within NPACI."

Hallmarks of the Information Age, massive data collections of growing complexity are being generated in all disciplines of science and beyond. Both improved experimental sensors and larger simulations on supercomputers like the National Partnership for Advanced Computational Infrastructure's (NPACI) Blue Horizon produce complicated data collections that contain billions or even trillions of data points.

In addition to benefitting environmental research, SDSC data mining is playing a role in medical research to develop more accurate diagnosis and prognosis of prostate cancer, and in a structural engineering project to deploy an intelligent system of distributed sensors that will be able to identify in real time both gradual age—related and sudden earthquake damage to California's bridges, potentially saving lives and billions of dollars. In developing the SKIDL data mining tools, Fountain's team has worked in close collaboration with researchers in multiple areas. "This ensures that the tools are versatile," said Fountain. "And it demonstrates the maturity of data mining, with the same set of methods able to yield valuable results in very distinct fields."

Monitoring Bridge Safety

Fountain and SKIDL researchers Neil Cotofana and Peter Shin are participating in an NSF Information Technology Research (ITR) project in collaboration with the Department of Structural Engineering in UCSD's Jacobs School of Engineering (JSOE). In this project, structural engineering expertise and advanced information technologies, including SKIDL data mining, are being integrated into what the researchers envision as a "smart" research and health monitoring system for California's bridges and other civil infrastructure.

Currently, highway bridges and other infrastructure are designed, built, and maintained as separate structures that are monitored visually through difficult and time—consuming on—site inspections. "But we're aiming to transform that," says Ahmed Elgamal, professor and chair of JSOE's Department of Structural Engineering. "By integrating sensors and data mining into an "intelligent" information technology framework from the beginning, we believe that we can monitor the safety of bridges in real time, creating a system that can actively respond to events such as earthquakes, and learn from experience." The system could execute preprogrammed responses to events it detects, for example, increasing sensor sampling rates to better investigate possible damage, and issuing a warning if it deems a bridge unsafe.

Although he recognizes that it will take time to develop and deploy these technologies, which will involve sweeping changes in the practice of structural engineering and infrastructure maintenance, Elgamal is excited about the benefits that this smart system will bring. The old adage "a stitch in time saves nine" holds just as true for bridge repair as it does for clothes, and "the new framework with data mining will allow detection of gradual deterioration so that it can be repaired inexpensively, before it reaches a serious—and much more expensive—stage," said Elgamal. This can significantly reduce the current multibillion—dollar annual costs to California's taxpayers for expensive infrastructure repairs and early replacement.

"In addition to the immediate uses of these advanced information technologies in monitoring bridge safety, the most important long term benefit of this collaboration will be to make these powerful IT tools a normal part of structural engineering," said Elgamal. "I'm sure that in the future they'll be used for a great many things we haven't even thought of yet."

Along with aging, civil infrastructure is also subject to sudden events such as earthquakes. The California Division of Mines and Geology has estimated that seismic risk in California costs more than $4 billion annually. Following An earthquake, it is vital to assess the integrity of bridges and other civil infrastructure immediately, to be able to intelligently reroute traffic and direct emergency and repair vehicles. The flexible and scalable software framework, which will initially be deployed with sensor networks in a number of testbed bridges, will help make this a reality.

The information that the researchers are collecting from each bridge includes strain gauge data, which tells engineers whether a given point on the bridge is being compressed or stretched. This is supplemented by high—resolution video, which can identify the cause of a detected event—a large vehicle crossing the bridge, for example—and also measure bridge response.

Within the project's information framework, SKIDL data mining tools will be used in several steps along the processing pipeline for the sensor and video data. "We find that rather than functioning as a stand—alone tool or afterthought, data mining can make the biggest contribution when we collaborate closely with the structural engineers to help them integrate these advanced information technologies appropriately right from the beginning," said Fountain.

The researchers are building SKIDL's data mining into the sensor network "upstream" to help decide in real time which data to keep, and which can be discarded. As a second phase, data mining will support predictive models that help decide in real time if a bridge has been damaged—either gradually or suddenly—and how badly and where. A third phase of data mining will be research oriented, with investigators using the interactive portal to analyze both streaming and archived data for new insights into the detection, causes, and solutions to long term deterioration and sudden damage, and to improve physics—based models for more accurate bridge simulations used in both research and design.

Data from multiple bridges and other structures'each with thousands of sensors, along with video streams and archives of analysis results, all recorded continuously for years'will result in multi—terabyte data collections that will be managed, analyzed, and stored with SDSC's assistance. And the computationally intensive real—time data mining, in which users apply algorithms to large data sets, run simulations, and compare the results with real data, is large—scale research that will require the resources of the NSF TeraGrid and other data—intensive computing resources.

Earth Science Insights

As described above, Fountain and SKIDL researcher Peter Shin are collaborating in the NPACI Earth Systems Science thrust with Pennington and other LTER colleagues, as well as researchers at the Northwest Alliance for Computational Science and Engineering at Oregon State University to develop land cover classification methods.

In hyperspectral data, the intensity of each of the more than 200 wavelength bands is recorded for each pixel in the image, forming a —spectral signature— of that location on the Earth running from infrared to ultra—violet. To derive the land cover classification, the spectral signatures are fed into computer classification methods. But using the entire 200—plus frequency spectrum is costly in computer time and other ways, and typically only a smaller number of wavelength bands is needed for effective classification. The SKIDL data mining tools help the researchers eliminate the unnecessary wavelengths and efficiently identify the frequency bands that best predict each of the many types of land cover, facilitating analysis and improving the stability of land cover classification models.

"Not only are these data mining methods far faster, helping us efficiently derive land cover classifications from large hyperspectral remote sensing data collections," said Pennington, "but with standardized workflows we're able to compare and integrate the land cover classifications with other data products, which makes them more broadly useful." In this way, the researchers are building a system for environmental data management that is both comprehensive and integrated. SDSC provides additional data management support by storing the growing 188—gigabyte hyperspectral data collection, with typical image size around 600 megabytes, in the High Performance Storage System (HPSS).

Improving Cancer Diagnosis and Prognosis

Fountain and SKIDL researcher Hector Jasso are also working with a consortium studying the molecular biology of prostate cancer with the goal of improving strategies for diagnosis and prognosis of cancer. The multi—institution consortium consists of investigators at the Rebecca and John Moores UCSD Cancer Center, the Sidney Kimmel Cancer Center, the Ludwig Institute for Cancer Research, and the Veterans Medical Research Foundation of San Diego.

Prostate cancer is the most common type of cancer and second leading cause of cancer—related deaths among U.S. men. Each year about 185,000 men are diagnosed with this cancer, and by age 75, half to three—quarters of men will have some cancerous changes in their prostate glands. But all prostate cancers are not the same, and with current tests, doctors treating patients often cannot predict which tumors will grow aggressively despite today's therapies, and which will grow so slowly that, depending on a patient's age, the tumors will not progress to a life—threatening stage even if left untreated. Thus, improved strategies for diagnosis, prognosis of the course of the disease, and prediction of treatment outcomes can give physicians reliable guidance in deciding appropriate therapies and avoiding needless treatments.

As a step toward solving these problems, the researchers are investigating how data mining of genetic information can improve prostate cancer diagnosis. In contrast to earlier genetic methods, which could examine only around a dozen genes at once, in this study the researchers are using data from microarray gene expression experiments, a powerful technique that enables the rapid analysis of thousands of genes at the same time. Having data on so many genes, combined with SDSC—supplied data mining tools capable of analyzing this wealth of complex information, allows the researchers to discover which specific genes or combinations of genes can best classify prostate tissue as tumor or not tumor.

While the genes identified this way don't necessarily play a functional role in the cancer itself, they can be very useful as markers that help the investigators improve tumor diagnosis, and in the future lead to strategies for more accurate prognosis of the course of a cancer. But microarray data are very complex, simultaneously assaying more than 10,000 genes for expression in a relatively small number of samples, typically 50 to 200 patients. Thus, investigators are faced with a large number of genes, or variables, for only a small number of samples.

To enable the researchers to make sense of all this information, Jasso applied advanced SKIDL data mining tools to the data, and found that these machine learning methods (which get better at solving problems with experience) are significantly better than previous methods at identifying marker genes that can help classify prostate tissue as either tumor or not tumor. "We develop the classification model by giving it samples of data in which we already know whether the person has cancer," said Fountain. "This 'trains' the model, which can then be applied to making more accurate diagnoses for patients where the result is unknown."

While the multi—megabyte microarray data sets are not unusually large in themselves, the data mining analysis is very computationally demanding because of the enormous number of parameters that must be explored in relation to the 10,000 potential genetic markers. Thus, this application will benefit greatly from the power of SDSC's planned data—intensive supercomputer, DataStar, which will be designed to support both processor—intensive and data—intensive computation.

As the investigators use data mining to improve cancer diagnosis, they plan to build on this knowledge. "Our long term goal is to apply what we're learning in improving methods for the diagnosis of prostate cancer to developing better strategies for determining prognosis of the disease, when we have sufficient clinical data showing disease recurrence in the patient population previously treated for early stage disease," said William Wachsman of the Veterans Medical Research Foundation of San Diego, one of the investigators in the prostate study. "This will be significant in helping to manage the treatment of patients with prostate cancer."

In addition to improving cancer diagnosis, the data mining results can also be of use to cancer researchers as they explore the causes of cancer. "The genes found to be diagnostic markers may or may not be directly involved in the cancer," said Wachsman, "but identifying these 'genes of interest' gives us valuable clues as we try to unravel the complex biological mechanisms underlying the cancer process."

In their data mining research, the SKIDL researchers are also incorporating Web services using Open Grid Services Architecture (OGSA) and Globus tools to enable these data mining tools to work as grid applications. The SKIDL team anticipates that all of these projects will have value beyond the immediate applications. "Working with the researchers, we're designing and testing a general data mining approach and architecture that will be useful not only in structural engineering, cancer, and environmental science, but in the many other disciplines that are struggling with data collections of growing size and complexity," said Fountain. "In the Information Age, knowledge is key, and we're helping researchers find the knowledge in their data."


More on Data Mining


Thousands of strain gauge sensors will be deployed on testbed California bridges, sending back data to be analyzed in real time for both research and health monitoring of bridges and other civil infrastructure ( Peaks on graph indicate a detected event, which time-synched video shows is caused by a van passing over the sensors.


A multi-wavelength, or hyperspectral, image used for land cover classification. Showing the Sevilleta LTER site in central New Mexico, a National Wildlife Refuge since 1973, this false color composite uses near infrared, red, and blue wavelengths. The dark, circular feature is Black Butte, a volcanic remnant. The east-west line along the south edge of Black Butte is the boundary of the Sevilleta, with grazing and development occurring to the north, but not to the south. Black and white lines are paved and dirt roads, respectively.


Airborne hyperspectral image data were compared with surveys of vegetation on the ground to “train” a classification model, which can then automatically generate land cover classifications such as this one from hyperspectral image data.


Click here for full-sized image


Photomicrograph of intermediate grade prostate cancer, magnified 80 times, used in research on applying data mining technologies to improve diagnosis of prostate cancer.


Project Leaders
Tony Fountain,

Bridge monitoring:
Ahmed Elgamal
and Joel P. Conte
Structural Engineering Department,

Mohan Trivedi

Environmental research:
Cherri Pancake

Deana Pennington

Cancer research:
Webster Cavenee
Ludwig Institute for Cancer

Daniel Mercola
Sidney Kimmel Cancer Center

David Tarin

William Wachsman
San Diego VA

Neil Cotofana,
Longjiang Ding,
Hector Jasso,
Peter Shin,
Jenny Wang

John Vande Castle

Dylan Keon

Robert Stuart
San Diego VA and Moores UCSD Cancer Center

Igor Klacanksy
Moores UCSD Cancer Center

Michael Fraser,
James L. Yan,
Daniel Xianfei He,
Tarak Ghandi,
Remy Chang,
Minh Phan,
Dung Nguyen