As an airplane flies high over the New Mexico landscape,
a NASA hyperspectral imaging instrument on board
records the light reflected from the ground below in
more than 200 different wavelength bands. This "hyperspectral"
data contains important clues that scientists can use
in automated, costeffective methods for distinguishing
the type of land cover belowclassifying it as
trees or grass, productive or dormant, wet or dry.
Having access to accurate land cover classification
is vital to a range of users. Emergency planners rely
on this information to develop estimates of forest fire
risk. For ecologists studying the biodiversity of plant
species in an ecosystem, automated land cover classification
is the only practical way to get this information over
wide areas. And climate researchers use this broadscale
classification of land cover in conjunction with models
of future climate change.
But modern sensors such as this hyperspectral imaging
instrument produce large volumes of information that
are so complex that they can overwhelm traditional methods
of analysis. "When you have over 200 wavelengths
of radiation readings for three quarters of a million
points on the groundmany of which represent mixtures
of land covers'and all their combinations to sort
through, it's very difficult to know which frequency
bands will best predict a given set of land cover characteristics,"
said Deana Pennington, research assistant professor
at the NSF Long Term Ecological Research (LTER) Network
at the University of New Mexico. "So we needed
powerful data mining tools that could sift through all
these bands to tell us which ones are important in characterizing
the subtle differences between land cover types."
With 24 sites, the LTER Network is a nationalscale
effort to investigate ecological processes across diverse
Working with LTER ecologists, researcher Tony Fountain
of the San Diego Supercomputer Center (SDSC) at the
University of California, San Diego (UCSD) and his team
in SDSC's Knowledge and Information Discovery Lab (SKIDL)
have developed powerful, userfriendly data mining
tools to meet this need.
"The SKIDL group's data mining projects underscore
SDSC's leadership in providing advanced data technologies,"
said Chaitan Baru, codirector of SDSC's Data and
Knowledge Systems (DAKS) program. "SKIDL data mining
tools play an important role in the largescale
data mining required for today's massive data collections
and will exploit the next generation of datacentric
computing power within NPACI."
Hallmarks of the Information Age, massive data collections
of growing complexity are being generated in all disciplines
of science and beyond. Both improved experimental sensors
and larger simulations on supercomputers like the National
Partnership for Advanced Computational Infrastructure's
(NPACI) Blue Horizon produce complicated data collections
that contain billions or even trillions of data points.
In addition to benefitting environmental research,
SDSC data mining is playing a role in medical research
to develop more accurate diagnosis and prognosis of
prostate cancer, and in a structural engineering project
to deploy an intelligent system of distributed sensors
that will be able to identify in real time both gradual
agerelated and sudden earthquake damage to California's
bridges, potentially saving lives and billions of dollars.
In developing the SKIDL data mining tools, Fountain's
team has worked in close collaboration with researchers
in multiple areas. "This ensures that the tools
are versatile," said Fountain. "And it demonstrates
the maturity of data mining, with the same set of methods
able to yield valuable results in very distinct fields."
Monitoring Bridge Safety
Fountain and SKIDL researchers Neil Cotofana and Peter
Shin are participating in an NSF Information Technology
Research (ITR) project in collaboration with the Department
of Structural Engineering in UCSD's Jacobs School of
Engineering (JSOE). In this project, structural engineering
expertise and advanced information technologies, including
SKIDL data mining, are being integrated into what the
researchers envision as a "smart" research
and health monitoring system for California's bridges
and other civil infrastructure.
Currently, highway bridges and other infrastructure
are designed, built, and maintained as separate structures
that are monitored visually through difficult and timeconsuming
onsite inspections. "But we're aiming to
transform that," says Ahmed Elgamal, professor
and chair of JSOE's Department of Structural Engineering.
"By integrating sensors and data mining into an
"intelligent" information technology framework
from the beginning, we believe that we can monitor the
safety of bridges in real time, creating a system that
can actively respond to events such as earthquakes,
and learn from experience." The system could execute
preprogrammed responses to events it detects, for example,
increasing sensor sampling rates to better investigate
possible damage, and issuing a warning if it deems a
Although he recognizes that it will take time to develop
and deploy these technologies, which will involve sweeping
changes in the practice of structural engineering and
infrastructure maintenance, Elgamal is excited about
the benefits that this smart system will bring. The
old adage "a stitch in time saves nine" holds
just as true for bridge repair as it does for clothes,
and "the new framework with data mining will allow
detection of gradual deterioration so that it can be
repaired inexpensively, before it reaches a seriousand
much more expensivestage," said Elgamal.
This can significantly reduce the current multibilliondollar
annual costs to California's taxpayers for expensive
infrastructure repairs and early replacement.
"In addition to the immediate uses of these advanced
information technologies in monitoring bridge safety,
the most important long term benefit of this collaboration
will be to make these powerful IT tools a normal part
of structural engineering," said Elgamal. "I'm
sure that in the future they'll be used for a great
many things we haven't even thought of yet."
Along with aging, civil infrastructure is also subject
to sudden events such as earthquakes. The California
Division of Mines and Geology has estimated that seismic
risk in California costs more than $4 billion annually.
Following An earthquake, it is vital to assess the integrity
of bridges and other civil infrastructure immediately,
to be able to intelligently reroute traffic and direct
emergency and repair vehicles. The flexible and scalable
software framework, which will initially be deployed
with sensor networks in a number of testbed bridges,
will help make this a reality.
The information that the researchers are collecting
from each bridge includes strain gauge data, which tells
engineers whether a given point on the bridge is being
compressed or stretched. This is supplemented by highresolution
video, which can identify the cause of a detected eventa
large vehicle crossing the bridge, for exampleand
also measure bridge response.
Within the project's information framework, SKIDL
data mining tools will be used in several steps along
the processing pipeline for the sensor and video data.
"We find that rather than functioning as a standalone
tool or afterthought, data mining can make the biggest
contribution when we collaborate closely with the structural
engineers to help them integrate these advanced information
technologies appropriately right from the beginning,"
The researchers are building SKIDL's data mining into
the sensor network "upstream" to help decide
in real time which data to keep, and which can be discarded.
As a second phase, data mining will support predictive
models that help decide in real time if a bridge has
been damagedeither gradually or suddenlyand
how badly and where. A third phase of data mining will
be research oriented, with investigators using the interactive
portal to analyze both streaming and archived data for
new insights into the detection, causes, and solutions
to long term deterioration and sudden damage, and to
improve physicsbased models for more accurate
bridge simulations used in both research and design.
Data from multiple bridges and other structures'each
with thousands of sensors, along with video streams
and archives of analysis results, all recorded continuously
for years'will result in multiterabyte data collections
that will be managed, analyzed, and stored with SDSC's
assistance. And the computationally intensive realtime
data mining, in which users apply algorithms to large
data sets, run simulations, and compare the results
with real data, is largescale research that will
require the resources of the NSF TeraGrid and other
dataintensive computing resources.
Earth Science Insights
As described above, Fountain and SKIDL researcher
Peter Shin are collaborating in the NPACI Earth Systems
Science thrust with Pennington and other LTER colleagues,
as well as researchers at the Northwest Alliance for
Computational Science and Engineering at Oregon State
University to develop land cover classification methods.
In hyperspectral data, the intensity of each of the
more than 200 wavelength bands is recorded for each
pixel in the image, forming a spectral signature
of that location on the Earth running from infrared
to ultraviolet. To derive the land cover classification,
the spectral signatures are fed into computer classification
methods. But using the entire 200plus frequency
spectrum is costly in computer time and other ways,
and typically only a smaller number of wavelength bands
is needed for effective classification. The SKIDL data
mining tools help the researchers eliminate the unnecessary
wavelengths and efficiently identify the frequency bands
that best predict each of the many types of land cover,
facilitating analysis and improving the stability of
land cover classification models.
"Not only are these data mining methods far faster,
helping us efficiently derive land cover classifications
from large hyperspectral remote sensing data collections,"
said Pennington, "but with standardized workflows
we're able to compare and integrate the land cover classifications
with other data products, which makes them more broadly
useful." In this way, the researchers are building
a system for environmental data management that is both
comprehensive and integrated. SDSC provides additional
data management support by storing the growing 188gigabyte
hyperspectral data collection, with typical image size
around 600 megabytes, in the High Performance Storage
Improving Cancer Diagnosis and Prognosis
Fountain and SKIDL researcher Hector Jasso are also
working with a consortium studying the molecular biology
of prostate cancer with the goal of improving strategies
for diagnosis and prognosis of cancer. The multiinstitution
consortium consists of investigators at the Rebecca
and John Moores UCSD Cancer Center, the Sidney Kimmel
Cancer Center, the Ludwig Institute for Cancer Research,
and the Veterans Medical Research Foundation of San
Prostate cancer is the most common type of cancer
and second leading cause of cancerrelated deaths
among U.S. men. Each year about 185,000 men are diagnosed
with this cancer, and by age 75, half to threequarters
of men will have some cancerous changes in their prostate
glands. But all prostate cancers are not the same, and
with current tests, doctors treating patients often
cannot predict which tumors will grow aggressively despite
today's therapies, and which will grow so slowly that,
depending on a patient's age, the tumors will not progress
to a lifethreatening stage even if left untreated.
Thus, improved strategies for diagnosis, prognosis of
the course of the disease, and prediction of treatment
outcomes can give physicians reliable guidance in deciding
appropriate therapies and avoiding needless treatments.
As a step toward solving these problems, the researchers
are investigating how data mining of genetic information
can improve prostate cancer diagnosis. In contrast to
earlier genetic methods, which could examine only around
a dozen genes at once, in this study the researchers
are using data from microarray gene expression experiments,
a powerful technique that enables the rapid analysis
of thousands of genes at the same time. Having data
on so many genes, combined with SDSCsupplied data
mining tools capable of analyzing this wealth of complex
information, allows the researchers to discover which
specific genes or combinations of genes can best classify
prostate tissue as tumor or not tumor.
While the genes identified this way don't necessarily
play a functional role in the cancer itself, they can
be very useful as markers that help the investigators
improve tumor diagnosis, and in the future lead to strategies
for more accurate prognosis of the course of a cancer.
But microarray data are very complex, simultaneously
assaying more than 10,000 genes for expression in a
relatively small number of samples, typically 50 to
200 patients. Thus, investigators are faced with a large
number of genes, or variables, for only a small number
To enable the researchers to make sense of all this
information, Jasso applied advanced SKIDL data mining
tools to the data, and found that these machine learning
methods (which get better at solving problems with experience)
are significantly better than previous methods at identifying
marker genes that can help classify prostate tissue
as either tumor or not tumor. "We develop the classification
model by giving it samples of data in which we already
know whether the person has cancer," said Fountain.
"This 'trains' the model, which can then be applied
to making more accurate diagnoses for patients where
the result is unknown."
While the multimegabyte microarray data sets
are not unusually large in themselves, the data mining
analysis is very computationally demanding because of
the enormous number of parameters that must be explored
in relation to the 10,000 potential genetic markers.
Thus, this application will benefit greatly from the
power of SDSC's planned dataintensive supercomputer,
DataStar, which will be designed to support both processorintensive
and dataintensive computation.
As the investigators use data mining to improve cancer
diagnosis, they plan to build on this knowledge. "Our
long term goal is to apply what we're learning in improving
methods for the diagnosis of prostate cancer to developing
better strategies for determining prognosis of the disease,
when we have sufficient clinical data showing disease
recurrence in the patient population previously treated
for early stage disease," said William Wachsman
of the Veterans Medical Research Foundation of San Diego,
one of the investigators in the prostate study. "This
will be significant in helping to manage the treatment
of patients with prostate cancer."
In addition to improving cancer diagnosis, the data
mining results can also be of use to cancer researchers
as they explore the causes of cancer. "The genes
found to be diagnostic markers may or may not be directly
involved in the cancer," said Wachsman, "but
identifying these 'genes of interest' gives us valuable
clues as we try to unravel the complex biological mechanisms
underlying the cancer process."
In their data mining research, the SKIDL researchers
are also incorporating Web services using Open Grid
Services Architecture (OGSA) and Globus tools to enable
these data mining tools to work as grid applications.
The SKIDL team anticipates that all of these projects
will have value beyond the immediate applications. "Working
with the researchers, we're designing and testing a
general data mining approach and architecture that will
be useful not only in structural engineering, cancer,
and environmental science, but in the many other disciplines
that are struggling with data collections of growing
size and complexity," said Fountain. "In the
Information Age, knowledge is key, and we're helping
researchers find the knowledge in their data."
on Data Mining
Thousands of strain gauge sensors will be
deployed on testbed California bridges, sending
back data to be analyzed in real time for both
research and health monitoring of bridges and
other civil infrastructure (healthmonitoring.ucsd.edu).
Peaks on graph indicate a detected event, which
time-synched video shows is caused by a van
passing over the sensors.
A multi-wavelength, or hyperspectral, image
used for land cover classification. Showing
the Sevilleta LTER site in central New Mexico,
a National Wildlife Refuge since 1973, this
false color composite uses near infrared, red,
and blue wavelengths. The dark, circular feature
is Black Butte, a volcanic remnant. The east-west
line along the south edge of Black Butte is
the boundary of the Sevilleta, with grazing
and development occurring to the north, but
not to the south. Black and white lines are
paved and dirt roads, respectively.
Airborne hyperspectral image data were compared
with surveys of vegetation on the ground to
train a classification model, which
can then automatically generate land cover classifications
such as this one from hyperspectral image data.
Click here for full-sized
Photomicrograph of intermediate grade prostate
cancer, magnified 80 times, used in research
on applying data mining technologies to improve
diagnosis of prostate cancer.
and Joel P. Conte
Structural Engineering Department,
Ludwig Institute for Cancer
Sidney Kimmel Cancer Center
San Diego VA
John Vande Castle
San Diego VA and Moores UCSD Cancer Center
Moores UCSD Cancer Center
James L. Yan,
Daniel Xianfei He,