Data Mining 1
Data Mining 1
Data Mining 1
htm>
<../../index.htm> <../../dmintro/dmintro.htm>
<../../dmintro/dmintro.htm>
Overview
Data mining, the extraction of hidden predictive information from large databases, is a
powerful new technology with great potential to help companies focus on the most
important information in their data warehouses. Data mining tools predict future trends
and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The
automated, prospective analyses offered by data mining move beyond the analyses of past
events provided by retrospective tools typical of decision support systems. Data mining
tools can answer business questions that traditionally were too time consuming to
resolve. They scour databases for hidden patterns, finding predictive information that
experts may miss because it lies outside their expectations.
Most companies already collect and refine massive quantities of data. Data mining
techniques can be implemented rapidly on existing software and hardware platforms to
enhance the value of existing information resources, and can be integrated with new
products and systems as they are brought on-line. When implemented on high
performance client/server or parallel processing computers, data mining tools can analyze
massive databases to deliver answers to questions such as, "Which clients are most likely
to respond to my next promotional mailing, and why?"
This white paper provides an introduction to the basic technologies of data mining.
Examples of profitable applications illustrate its relevance to today’s business
environment as well as a basic description of how data warehouse architectures can
evolve to deliver the value of data mining to end users.
Commercial databases are growing at unprecedented rates. A recent META Group survey
of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte
level, while 59% expect to be there by second quarter of 1996.1 In some industries, such
as retail, these numbers can be much larger. The accompanying need for improved
computational engines can now be met in a cost-effective manner with parallel
multiprocessor computer technology. Data mining algorithms embody techniques that
have existed for at least 10 years, but have only recently been implemented as mature,
reliable, understandable tools that consistently outperform older statistical methods.
In the evolution from business data to business information, each new step has built upon
the previous one. For example, dynamic data access is critical for drill-through in data
navigation applications, and the ability to store large databases is critical to data mining.
From the user’s point of view, the four steps listed in Table 1 were revolutionary because
they allowed new business questions to be answered accurately and quickly.
Evolutionary Business Question Enabling Product Characteristi
Step Technologies Providers cs
Data Collection "What was my total Computers, tapes, IBM, CDC Retrospective,
(1960s) revenue in the last five disks static data
years?" delivery
Data Access "What were unit sales in Relational databases Oracle, Sybase, Retrospective,
(1980s) New England last (RDBMS), Structured Informix, IBM, dynamic data
March?" Query Language Microsoft delivery at
(SQL), ODBC record level
Data "What were unit sales in On-line analytic Pilot, Retrospective,
Warehousing & New England last processing (OLAP), Comshare, dynamic data
Decision Support March? Drill down to multidimensional Arbor, Cognos, delivery at
(1990s) Boston." databases, data Microstrategy multiple levels
warehouses
Data Mining "What’s likely to happen Advanced algorithms, Pilot, Prospective,
(Emerging to Boston unit sales next multiprocessor Lockheed, proactive
Today) month? Why?" computers, massive IBM, SGI, information
databases numerous delivery
startups
(nascent
industry)
Table 1. Steps in the Evolution of Data Mining.
The core components of data mining technology have been under development for
decades, in research areas such as statistics, artificial intelligence, and machine learning.
Today, the maturity of these techniques, coupled with high-performance relational
database engines and broad data integration efforts, make these technologies practical for
current data warehouse environments.
• More columns. Analysts must often limit the number of variables they examine
when doing hands-on analysis due to time constraints. Yet variables that are
discarded because they seem unimportant may carry information about unknown
patterns. High performance data mining allows users to explore the full depth of a
database, without preselecting a subset of variables.
• More rows. Larger samples yield lower estimation errors and variance, and allow
users to make inferences about small but important segments of a population.
A recent Gartner Group Advanced Technology Research Note listed data mining and
artificial intelligence at the top of the five key technology areas that "will clearly have a
major impact across a wide range of industries within the next 3 to 5 years."2 Gartner
also listed parallel architectures and data mining as two of the top 10 new technologies in
which companies will invest during the next 5 years. According to a recent Gartner HPC
Research Note, "With the rapid advance in data capture, transmission and storage, large-
systems users will increasingly need to implement new and innovative ways to mine the
after-market value of their vast stores of detail data, employing MPP [massively parallel
processing] systems to create new sources of business advantage (0.9 probability)."3
The most commonly used techniques in data mining are:
• Rule induction: The extraction of useful if-then rules from data based on
statistical significance.
Many of these technologies have been in use for more than a decade in specialized
analysis tools that work with relatively small volumes of data. These capabilities are now
evolving to integrate directly with industry-standard data warehouse and OLAP
platforms. The appendix to this white paper provides a glossary of data mining terms.
Profitable Applications
A wide range of companies have deployed successful applications of data mining. While
early adopters of this technology have tended to be in information-intensive industries
such as financial services and direct mail marketing, the technology is applicable to any
company looking to leverage a large data warehouse to better manage their customer
relationships. Two critical factors for success with data mining are: a large, well-
integrated data warehouse and a well-defined understanding of the business process
within which data mining is to be applied (such as customer prospecting, retention,
campaign management, and so on).
Some successful application areas include:
• A pharmaceutical company can analyze its recent sales force activity and their
results to improve targeting of high-value physicians and determine which
marketing activities will have the greatest impact in the next few months. The
data needs to include competitor market activity as well as information about the
local health care systems. The results can be distributed to the sales force via a
wide-area network that enables the representatives to review the
recommendations from the perspective of the key attributes in the decision
process. The ongoing, dynamic analysis of the data warehouse allows best
practices from throughout the organization to be applied in specific sales
situations.
• A credit card company can leverage its vast warehouse of customer transaction
data to identify customers most likely to be interested in a new credit product.
Using a small test mailing, the attributes of customers with an affinity for the
product can be identified. Recent projects have indicated more than a 20-fold
decrease in costs for targeted mailing campaigns over conventional approaches.
• A diversified transportation company with a large direct sales force can apply data
mining to identify the best prospects for its services. Using data mining to analyze
its own customer experience, this company can build a unique segmentation
identifying the attributes of high-value prospects. Applying this segmentation to a
general business database such as those provided by Dun & Bradstreet can yield a
prioritized list of prospects by region.
• A large consumer package goods company can apply data mining to improve its
sales process to retailers. Data from consumer panels, shipments, and competitor
activity can be applied to understand the reasons for brand and store switching.
Through this analysis, the manufacturer can select promotional strategies that best
reach their target customer segments.
Each of these examples have a clear common ground. They leverage the knowledge
about customers implicit in a data warehouse to reduce costs and improve the value of
customer relationships. These organizations can now focus their efforts on the most
important (profitable) customers and prospects, and design targeted marketing strategies
to best reach them.
Conclusion
Comprehensive data warehouses that integrate operational data with customer, supplier,
and market information have resulted in an explosion of information. Competition
requires timely and sophisticated analysis on an integrated view of the data. However,
there is a growing gap between more powerful storage and retrieval systems and the
users’ ability to effectively analyze and act on the information they contain. Both
relational and OLAP technologies have tremendous capabilities for navigating massive
data warehouses, but brute force navigation of data is not enough. A new technological
leap is needed to structure and prioritize information for specific end-user problems. The
data mining tools can make this leap. Quantifiable business benefits have been proven
through the integration of data mining with current information systems, and new
products are on the horizon that will bring this integration to an even wider audience of
users.
META Group Application Development Strategies: "Data Mining for Data Warehouses:
1