Skip to content

EnVision V19.No.3 SDSC Homepage San Diego Supercomputer Center Contact EnVision NPACI: A Leading Edge Site San Diego Supercomputer Center
 
 
Features
   
  Building a National Grid
from the Bottom Up
   
 

Building the NPACI Grid:
Integrating the
Human Infrastructure

   
  Discovering Knowledge in Massive Data Collections
   
  Pacific Rim Group Evolves
into International Model
of Collaboration
   
 
News
   
Tree of Life and Virtual Grid Development among Four ITR Awards to SDSC
   
NSF Middleware Projects Receive $9 Million
   
NSF Awards $1.2 Million to Extend PRAGMA Program
   
Teachers Bring Technology They Developed at SDSC to Their Classrooms
   
Texas Installs Lonestar Cluster
   
SDSC Education Department Efforts Recognized in SIGKids Awards
 
Finding Needles of Knowledge in Data Haystacks
 

By Paul Tooby

 

Data mining, which encompasses a broad range of approaches and methods, is used in two principal ways. One is to scour large databases for hidden patterns. For example, in a database of magazine subscribers, data mining might find a pattern that subscribers over 50 years of age tend to buy multi-year subscriptions, while those under 25 tend to buy one-year subscriptions. This insight can be used to develop mailings that better match the interests of different subscribers.

A second major form of data mining is to use existing data to develop a model that can predict the characteristics of new data. For example, data mining on the results of a small test mailing to subscribers announcing a new magazine can be used to develop a model that will predict those subscribers in the full database most likely to subscribe to the new magazine, providing guidance for a full-scale mailing. In scientific applications, this article describes the use of classification models in cancer diagnosis, land cover classification, and monitoring the integrity of highway bridges.

Because automated data mining methods, combined with today’s powerful computers, can examine all of the information in massive data collections, such methods can answer questions that traditionally have been too time-consuming to tackle, and zero in on novel information that even experts may miss because it lies outside their expectations.

As a multidisciplinary endeavor, data mining draws on the fields of artificial intelligence, machine learning, statistics, database theory, as well as mathematics, pattern recognition, and others. A data-driven rather than theory-based approach to discovering knowledge, data mining tools search through databases for patterns that reflect local information about the data, and develop models that give global insights into the full data collection. These results can be useful in themselves, as well as guiding researchers in targeted analyses that explore more precise causal relationships.

Although data mining tools are computer-based automated methods, an important step in all data mining applications is problem formulation and data preparation, which must take into account the specific characteristics of the discipline the data is from, along with issues of data quality, cleaning, and other procedures.

Typical data mining problems are characterized by a relatively small number of features or variables that potentially govern a phenomenon, accompanied by a large number of observations. For example, a business data mining problem might involve a file with 10,000 customers, or “observations,” but only half a dozen features such as name, address, age, date, and purchases, which a company might query to try to predict the response to a targeted mailing.

In modern scientific research, however, a new class of more complex problems known as “high dimensional” problems is emerging that is the reverse: the number of features, or potentially governing variables, is very large, combined with a relative scarcity of observations. The three applications described in this article—in environmental, medical, and engineering fields—are all examples of high dimensional data, with hundreds or thousands of variables to explore accompanied by relatively scarce observational data.

Since existing data mining or machine learning tools don’t handle this kind of problem well, the SKIDL team has developed tools suited for mining such high dimensional data sets, to help scientists efficiently sift through the forest of variables and zero in on the ones most likely to be of interest for further detailed analysis.