|
Data mining, which encompasses a broad range of approaches
and methods, is used in two principal ways. One is to
scour large databases for hidden patterns. For example,
in a database of magazine subscribers, data mining might
find a pattern that subscribers over 50 years of age
tend to buy multi-year subscriptions, while those under
25 tend to buy one-year subscriptions. This insight
can be used to develop mailings that better match the
interests of different subscribers.
A second major form of data mining is to use existing
data to develop a model that can predict the characteristics
of new data. For example, data mining on the results
of a small test mailing to subscribers announcing a
new magazine can be used to develop a model that will
predict those subscribers in the full database most
likely to subscribe to the new magazine, providing guidance
for a full-scale mailing. In scientific applications,
this article describes the use of classification models
in cancer diagnosis, land cover classification, and
monitoring the integrity of highway bridges.
Because automated data mining methods, combined with
todays powerful computers, can examine all of
the information in massive data collections, such methods
can answer questions that traditionally have been too
time-consuming to tackle, and zero in on novel information
that even experts may miss because it lies outside their
expectations.
As a multidisciplinary endeavor, data mining draws
on the fields of artificial intelligence, machine learning,
statistics, database theory, as well as mathematics,
pattern recognition, and others. A data-driven rather
than theory-based approach to discovering knowledge,
data mining tools search through databases for patterns
that reflect local information about the data, and develop
models that give global insights into the full data
collection. These results can be useful in themselves,
as well as guiding researchers in targeted analyses
that explore more precise causal relationships.
Although data mining tools are computer-based automated
methods, an important step in all data mining applications
is problem formulation and data preparation, which must
take into account the specific characteristics of the
discipline the data is from, along with issues of data
quality, cleaning, and other procedures.
Typical data mining problems are characterized by a
relatively small number of features or variables that
potentially govern a phenomenon, accompanied by a large
number of observations. For example, a business data
mining problem might involve a file with 10,000 customers,
or observations, but only half a dozen features
such as name, address, age, date, and purchases, which
a company might query to try to predict the response
to a targeted mailing.
In modern scientific research, however, a new class
of more complex problems known as high dimensional
problems is emerging that is the reverse: the number
of features, or potentially governing variables, is
very large, combined with a relative scarcity of observations.
The three applications described in this articlein
environmental, medical, and engineering fieldsare
all examples of high dimensional data, with hundreds
or thousands of variables to explore accompanied by
relatively scarce observational data.
Since existing data mining or machine learning tools
dont handle this kind of problem well, the SKIDL
team has developed tools suited for mining such high
dimensional data sets, to help scientists efficiently
sift through the forest of variables and zero in on
the ones most likely to be of interest for further detailed
analysis.
|