Cla Idma
Cla Idma
Cla Idma
Algorithms in
Microsoft® SQL ServerTM
WITeLibrary
Home of the Transactions of the Wessex Institute, the WIT electronic-library provides the
international scientific community with immediate and permanent access to individual
papers presented at WIT conferences. Visit the WIT eLibrary at www.witpress.com
Advances in Management Information Series
Information
Information Retrieval
Intelligent Agents
Data Mining
Data Warehouse
Text Mining
Competitive Intelligence
Customer Relationship Management
Information Management
Knowledge Management
Series Editor
A. Zanasi
TEMIS Text Mining Solutions S.A.
Italy
Associate Editors
O. Ciftcioglu P. Giudici
Delft University of Technology Universita di Pavia
The Netherlands Italy
M. Costantino A. Gualtierotti
London IDHEAP
UK Switzerland
A. De Montis J. Lourenco
Universita di Cagliari Universidade do Minho
Italy Portugal
G. Deplano G. Loo
Universita di Cagliari The University of Auckland
Italy New Zealand
D. Malerba F. Rossi
Università degli Studi DATAMAT
UK Germany
N. Milic-Frayling D. Sitnikov
Microsoft Research Ltd Kharkov Academy of Culture
UK Ukraine
G. Nakhaeizadeh R. Turra
DaimlerChrysler CINECA Interuniversity Computing
Germany Centre
Italy
P. Pan
National Kaohsiung University of D. Van den Poel
Applied Science Ghent University
Taiwan Belgium
J. Rao J. Yoon
Case Western Reserve University Old Dominion University
USA USA
D. Riaño N. Zhong
Universiteit Ghent Maebashi Institute of Technology
Belgium Japan
J. Roddick
Flinders University
Australia
Implementing Data Mining
Algorithms in
Microsoft® SQL ServerTM
C.L. Curotto
CESEC/UFPR
Federal University of Paraná, Brazil
N.F.F. Ebecken
COPPE/UFRJ
Federal University of Rio de Janeiro, Brazil
Implementing Data Mining
Algorithms in
Microsoft® SQL ServerTM
Series: Advances in Management Information, Vol. 3
C.L. Curotto and N.F.F. Ebecken
Published by
WIT Press
Ashurst Lodge, Ashurst, Southampton, SO40 7AA, UK
Tel: 44 (0) 238 029 3223; Fax: 44 (0) 238 029 2853
E-Mail: witpress@witpress.com
http://www.witpress.com
WIT Press
25 Bridge Street, Billerica, MA 01821, USA
Tel: 978 667 5841; Fax: 978 667 7582
E-Mail: infousa@witpress.com
http://www.witpress.com
ISBN: 1-84564-037-3
ISSN: 1742-0172
No responsibility is assumed by the Publisher, the Editors and Authors for any injury
and/or damage to persons or property as a matter of products liability, negligence or
otherwise, or from any use or operation of any methods, products, instructions or ideas
contained in the material herein.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, electronic, mechanical, photocopying,
recording, or otherwise, without the prior written permission of the Publisher.
Contents
Foreword xi
Preface xiii
Chapter 2: Tools 9
2.1 Hardware ........................................................................................ 9
2.2 Software ......................................................................................... 9
2.2.1 Operating system.................................................................. 9
2.2.2 DBMS .................................................................................. 9
2.2.3 Compilers ........................................................................... 10
2.2.3.1 Microsoft Visual Studio 6.0 SP5......................... 10
2.2.3.2 Microsoft Visual Studio .NET 2003.................... 12
2.2.4 Syntax parser ...................................................................... 12
2.2.5 Utilities............................................................................... 12
2.2.6 DM sample provider........................................................... 14
2.2.7 IDMA CD........................................................................... 14
2.2.7.1 Assembling DMSample ........................................ 14
2.2.7.2 Assembling DMclcMine ....................................... 15
2.2.7.3 Creating DMiBrowser web site............................. 16
References 65
Microsoft
Data mining technology is growing rapidly and, in many industries, has become
commonplace. The main reasons for this are advances in computer technology
that acquire, store and retrieve enormous amounts of data about everything and
from everywhere. Uncovering patterns from these data allow a better understand-
ing of a particular business, which, in turn, helps optimize business processes,
thereby, creating new business opportunities, etc. Some data mining applications
include target marketing, cross-selling, risk analysis, fraud detection, te xt classi-
fication, spam f iltering and anomaly detection. It involves virtually all industrial
sectors where data are collected and stored electronicall y.
In 2000, Microsoft introduced a data mining feature in their Microsoft ® SQL
ServerTM 2000 Analysis Services. The SQL Server 2000 data mining component
has made numerous features available, becoming an enterprise-level platform for
data mining ser vices. Among others, it implements a standard data mining lan-
guage, called DMX (Data Mining eXtensions), which was proposed as part of the
object linking and embedding database (OLE DB) for Data Mining specif ication.
DMX is an extension of SQL (Structured Query Language), with the same basic
philosophy that the SQL community had in mind when it was invented decades
ago, i.e. of providing application programmers with a clear separation between
business logic and data management details. With DMX, application program-
mers can focus on aspects of data mining modeling combined with business logic
rather than algorithm-specific interfaces and data transformations. Customer
feedback indicates that this is one of the most significant innovations in data min-
ing technology.
Furthermore, Microsoft has introduced a new feature, third-party data mining
provider aggregation, in the SQL ServerTM 2000 Service Pack 1 (SP1), which
allows third parties to integrate their own OLE DB for Data Mining providers into
SQL Server 2000 Analysis Services. In this way, third-party data mining providers
can supply their customers with all enter prise-level platform functionalities of
Analysis Services, as w ell as their data mining algorithms. This is par ticularly
important in the current data mining industry, where no single vendor offers data
mining algorithms to suit all customers’ needs. New algorithms for new needs are
being introduced as we speak.
To promote SQL Ser ver 2000 Analysis Services among third-party data min-
ing providers, Microsoft has made a sample code ( DMSample) available to the
public. DMSample contains all the functionality for a full-fledged OLE DB for
Data Mining providers. It includes OLE DB interfaces (connection, session, com-
mand, rowset, etc.), full implementation of DMX language and a sample data min-
ing algorithm, which is expected to be replaced by the provider. Since the sample
code has been placed on our web site, there have been several thousand downloads
from all o ver the w orld. Traffic to the MSN Data Mining group (http://
groups.msn.com/AnalysisServicesDataMining) and the newsgroup (microsoft.
public.sqlserver.datamining) has increased significantly. Several vendors are
offering their products through the agg regation and we expect substantial growth
in this area.
In view of these developments, I am delighted to see Claudio mak e his expe-
rience in the integration available to others. He has been meticulous in getting his
data mining algorithm inte grated into SQL Ser ver 2000 through the agg regation
functionality using the DMSample code. Throughout the process, he was always
in close communication with the SQL Ser ver data mining team and did an e xcel-
lent job on the integration.
This book covers all the practicalities required to inte grate a third-party data
mining algorithm into SQL Server 2000.
Preface
Acknowledgements
Kris Ganjam, Claude Seidman, Raman Iyer, Peter Kim, Jamie MacLennan and
ZhaoHui Tang provided valuable help in elucidating the doubts that appeared dur-
ing the computational implementation and the execution of the performance
experiments.
Gary Perlman supplied the source code of the |STAT system, used in the com-
putational experiments.
Manoel Rodrigues Justino Filho extensively revised the text.
Chapter 1
Data mining technology overview
1.1 The importance of data mining technology
Transactions within the business community generate data, which are the founda-
tion of the business itself. These business sectors include retail, petroleum,
telecommunications, utilities, manuf acturing, transportation, credit cards, insur-
ance, banking, etc. For years, data have been collected, collated, stored and used
through large databases.
Within the scientific community, data are much more dispersed, since the goal
in modeling scientif ic problems is to f ind and for mulate governing laws in the
form of accurate mathematical ter ms. It has long been reco gnized, however, that
such perfect descriptions are not always possible. Incomplete and inaccurate
knowledge, observations that are often of a qualitati ve nature, the complete
heterogeneity of the sur rounding world, boundaries and initial conditions being
incompletely known—all these generate the search for data models.
To build a model that does not need complex mathematical equations, one
needs sufficient and good data. The possibility of substituting a huge amount of
data by a set of rules, for example, is an improbable accomplishment.
The possibility of extracting explicit and comprehensive knowledge, hidden in
a large database, can generate precious information. It is also possible to complete
these data with the addition of e xpert knowledge.
Data Mining (DM) technology has opened a whole new field of opportunities,
completely changing problem-solving strategies.
DM is a multidisciplinar y domain, covering several techniques. It deals with
discovering hidden, unexpected knowledge in large databases. The
methodologies of traditional decision-making support systems do not scale up to
the degree where they can show the hidden structures of large databases and data
warehouses within the time limits imposed by current scientific and business
environments.
Usually, they require a lot of computing power, but they are solid, in the sense that
if a solution exists, a genetic algorithm can probably find it.
Fuzzy systems are universal approximators, that is, they are capable of approx-
imating general nonlinear functions to an y desired degree of accuracy, similar to
feed forward networks. DM has played a central role in the development of fuzzy
models because fuzzy interpretations of data structures are a very natural and intu-
itively plausible way of formulating and solving various problems.
The mixture of fuzzy logic, neural networks and genetic algorithms provides a
powerful framework for solving difficult real-world problems.
1.9 Applications
At present, DM is a full y developed and very powerful tool, ready for use. DM
finds applications in many areas, the more popular being:
• DM on Government: Detection and pre vention of fraud and money
laundering, criminal patterns, health care transactions, etc.
• DM for Competitive Intelligence: New product ideas, retail marketing and
sales pattern, competitive decisions, future trends and competitive
opportunities, etc.
• DM on Finance: Consumer credit policy, portfolio management,
bankruptcy prediction, foreign e xchange forecasting, derivatives pricing,
risk management, price prediction, forecasting macroeconomic data, time
series modeling, etc.
• Building Models from Data: Applications of DM in science, engineering,
medicine, global climate change modeling, ecological modeling, etc.
2.1 Hardware
The implementation was made using an IBM PC compatible microcomputer, Intel
Pentium III 500 MHz processor inside, 512 MB of RAM memory, 30 MB hard
disk, virtual memory of 768 MB minimum and 2048 MB maximum.
Several programs are used in the debugging and running tasks. This config-
uration exceeds the hardware required by each program running alone. However,
in the debugging task for DM provider implementation, a large amount of mem-
ory is necessary, since MSSQL with Analysis Services component, Microsoft ®
Visual C++® (MSVC) and Microsoft ® Visual Basic® (MSVB) are usually running
together.
2.2 Software
2.2.1 Operating system
The operating system used was the Microsoft® Windows® 2000 Advanced Server
SP4 (Service Pack 4)
(http://www.microsoft.com/windows2000/downloads/servicepacks/sp4). You can
try using the trial version of Windows® 2003
(http://www.microsoft.com/windowsserver2003/evaluation/trial/evalkit.mspx).
To use DMiBrowser, a browser developed to display trees produced by OLE
DB for DM classifiers, you must f irst install the Inter net Information Services
(IIS) component.
2.2.2 DBMS
The DBMS used was MSSQL 2000 Enterprise SP3A. MSSQL and the
Analysis Services component with Service Pack 3A must be installed
(http://www.microsoft.com/sql/downloads/2000/sp3.asp). A trial v ersion of this
software is available (http://www.microsoft.com/sql/evaluation/trial) and can be
used without any problems. For the debugging and de veloping tasks of the DM
10 Software
2.2.3.1 Microsoft® Visual Studio 6.0 SP5 The following instructions must be
performed in installing MSVS 6.0 SP5:
1. Install MSVS 6.0 (Visual C++®, Visual Basic® and Visual J++® compilers).
2. Install the debugging tools used to implement the DMiBrowser:
A. Running MSVS 6.0 setup:
a. Select Server Applications and Tools in the first dialog;
b. In the ne xt dialog select Launch BackOffice Installation
Wizard, and press Install;
c. In the BackOffice Business Solutions dialog, select Custom
and press Next;
d. In the dialog BackOffice Programs and Their Components
select only the installation of the components: Remote Machine
Debugging and Visual InterDev Server.
B. Finish the setup and execute the instruction included in Visual
InterDev / Using Visual InterDev / Building Integrated
Solutions / Integration Tasks / Debugging Remotely.
3. Install MSVS SP 5
(http://msdn.microsoft.com/vstudio/downloads/updates/sp/vs6/sp5).
4. Download Microsoft® Platform SDK F ebruary 2003 Edition from:
(http://www.microsoft.com/msdownload/platformsdk/sdkupdate) and
install the following minimum set of components:
9 Core SDK (Windows Server 2003) – Build Environment;
9 Common Setup Files;
9 Internet Development SDK (Version 6.0) – Build Environment;
9 Microsoft® Data Access Components (Version 2.7) – Build
Environment and Sample and source code.
5. Verify whether Microsoft® Platform SDK include and lib directories are on
the top of Tools/Options/Directories/Include files and Tools/
Options/Directories/Library files lists, as shown in figs 2.1 and 2.2.
6. If MSVS .NET 2003 is not installed, then install Microsoft® .NET Framework
Version 1.1 Redistributable Package (http://www.microsoft.com/
Tools 11
downloads/details.aspx?FamilyId=262D25E3-F589-4842-8157-
034D1E7CF3A3&displaylang=en). This package will be used b y
Assemble utility (see Section 2.2.7.2), to build the DMclcMine provider.
2.2.3.2 Microsoft® Visual Studio .NET 2003 The following instructions must
be performed for installing MSVS .NET 2003:
1. Install MSVS .NET (Visual C++® and Visual Basic®). The trial version of
MSVS .NET 2003 can be used:
(http://msdn.microsoft.com/vstudio/productinfo/trial/default.aspx);
2. Install October 2003 MSVS .NET Documentation Update (http://
www.microsoft.com/downloads/details.aspx?
familyid=a3334aed-4803-4495-8817-c8637ae902dc&displaylang=en);
3. Verify whether Microsoft® Platform SDK include and lib directories
included in MSVS .NET are on the top of Tools/Options/Projects/
VC++ Directories/Include files and Tools/Options/Projects/
VC++ Directories/Library files lists as shown in figs 2.3 and
2.4.
2.2.5 Utilities
RowsetViewer is a sample tool that integrates the Microsoft ® Data Access
Components of Microsoft® Platform SDK February 2003 Edition. It allows a sim-
ple way of viewing and manipulating OLE DB rowsets, with the additional ability
of calling and manipulating other OLE DB methods from the data source, session,
command, rowset, transaction and notification objects supported by any OLE DB
provider (see Section 3.1.1). The source code of RowsetViewer can be installed
selecting the option Sample and source code from the Microsoft® Data
Access Components (Version 2.7) group of Microsoft ® Platform SDK. The
instructions to install this tool are described in the step 5 of Section 2.2.3.1.
Tools 13
2.2.6 DMsampleprovider
DMSample [5] provides a template, with complete source code, to implement a
DM provider that can be run aggregated with MSSQL. The following steps pro-
vide the instructions to install this tool:
1. Download SampleProvider.zip file of DMSample from: (http://www.
microsoft.com/downloads/details.aspx?FamilyID=d4402893-8952-4c9d-
b9bc-0d60c70d619d&DisplayLang=en).
2. Extract all directories and files from SampleProvider.zip to any tempo-
rary directory or root drive. Be sure that the Use folder names option is
selected. A directory named SampleProvider will be created with all
files of this provider.
2.2.7 IDMA CD
The IDMA CD that accompanies this book contains all source codes for
DMclcMine implementation, as w ell as the data sets and utilities used in the
computational experiments. To implement DmclcMine, we created 54 f iles
(including the 13 files created to support MSVS .NET) and modified 95 f iles
(including the 77 files modified by Assemble utility of DMSample (460 orig-
inal files). That is, 149 f iles were created or modif ied from a total of 514 f iles.
Due to copyright restrictions on this tool, we cannot distribute all of these files
already fixed. To circumvent this problem, we have made a Visual Basic® utility which
assembles all files of DmclcMine, starting from DMSample original source files.
Create the idma directory by copying all files from the IDMA CD to the root
directory of your hard disk’s chosen drive and execute the instructions described
in the following sections.
5. Run idma/DMclcMine/DMProv/release/vs6/RegDMclcMine.bat.
This batch file will register the release MSVS 6.0 version of DMclcMine
in your computer. Start MSSQL OLAP Service again.
If you prefer, this task also can be done manually by executing the instructions in
Appendix B.
2.2.7.3 Creating DMiBrowser web site Create the DMiBrowser web site on
your computer by performing the following steps:
1. Ruxn IIS.
2. Execute Action/New/Web Site.
3. Execute the Web Site Creation Wizard providing:
a. A Description for the new web site;
b. An IP address, for example 127.0.0.1;
c. The Path of the new web site: idma/DMiBrowser/web;
d. Allow the follo wing permissions: Read and Run scripts (such as
ASP).
4. Execute Action/Properties. Press tab Documents and Add
index.htm as the default document.
To browse the DMiBrowser web site you have to use http://127.0.0.1 link.
Chapter 3
OLE DB for DM technology
The OLE DB for DM technology [6] defines a standard open API for the
development of DM providers.
This standardization allows the complete portability of these providers,
formed as API coupled systems, which can take advantage of the internal
resources of DBMSs, as well as of any data source compatible with this API.
By using this technolo gy, a DBMS manages distinct applications, de veloped
by distinct developers, for each task forming the KDD process, e.g. data prepara-
tion, DM, visualization and evaluation.
Integration among the different steps of the KDD process is made possible by
the storage of the DM model (DMM), which is available for use by any applica-
tion that complies with this technology.
Seeking portability and free information exchange, the storage of the model is
carried out by using the Predictive Model Markup Language (PMML) specif ica-
tion [7], a markup language for statistics and DMMs, based on XML [8].
In addition, this technology allows a DM provider to be fully portable,
running, coupled with standalone applications, without the support of any DBMS.
A single data connection with a data source, such as a flat text file, is sufficient.
From the DM researcher’s point of view, this technology is highly promising, since
it creates the possibility, with minimum effort, of porting or developing his/her own DM
algorithm, DMM viewer or any other application, using a non-proprietar y program-
ming language to assemble a DM provider ready for use, integrated with a DBMS.
However, it is essential that implementation of DM algorithms proceeds ef fi-
ciently and with care to optimize use of a vailable DBMS internal functions [9].
Physically, a DM provider is a Dynamic Link Librar y (DLL) f ile that can be
used either by MSSQL or by any application through real-time function calls.
As an extension of OLE DB technology, OLE DB for DM can be explained
within the UDA Architecture context, which is described in Section 3.1.
OLE DB provider can be simple or complex. To carry out queries involving mul-
tiple data sources, a provider can use other providers.
The seven basic components of the OLE DB provider object model are
described in table 3.1 [10].
OLE DB components can be classif ied into three categories: data providers,
data consumers and service components. The key characteristic of a data provider
is that it holds the data it exposes to the outside world. While each provider han-
dles implementation details independently, all providers expose their data in a
tabular format through virtual tables. A data consumer is any component—
whether it is system or application code—that needs to access data from OLE
DB providers. Development tools, programming languages and many sophisti-
cated applications fall into this category. Finally, a service component is a lo gi-
cal object that encapsulates a piece of DBMS functionality . One of OLE DB’s
design goals was to implement ser vice components (such as query processors,
cursor engines, or transaction managers) as standalone products that can be
plugged in when needed.
OLE DB for DM technology 21
Listing 3.2: AllElet – XML file of the populated and trained DMM.
OLE DB for DM technology 29
30 OLE DB for DM specif ication
OLE DB for DM technology 31
tions, nodes and other infor mation specific of each DM algorithm. So a suitab le
way to view DMM contents is b y directed graph or trees (set of nodes with con-
necting edges). The decision tree technique f its very well with this method.
Figure 3.10 brings out DMSamp screen showing part of the results of the
query that retrieves the entire contents of the AllElet DMM (selecting and r un-
ning the fifth query):
SELECT * FROM [AllElet_SNB].CONTENT
Selecting the AllElet_SNB DMM in Mining Models list, clicking the right
button of the mouse and pressing Browse you can see the prediction tree of the
AllElet_SNB DMM (fig. 3.11). Figure 3.12 shows the complete prediction tree
of BuysComputer attribute.
Figure 3.13 sho ws the UD A architecture conf iguration for the scenario of
using DMSamp and DMclcMine providers. DMSamp is a simple VB applica-
tion, using the SQLOLEDB, the Microsoft ® OLE DB Provider for MSSQL and
the DMclcMine provider.
Figure 3.14 shows the AllElet BuysComputer prediction tree displayed by
Microsoft® Internet Explorer (MSIE) through DMiBrowser, an ASP utility [15],
modified by Curotto [16] to pro vide support for algorithms other than MSDT.
You can see this screen in MSIE perfor ming the following steps:
OLE DB for DM technology 37
1. Run MSIE and open the address http://127.0.0.1 (see Section 2.2.7.3).
2. In the right upper frame, a list of Analysis Server Databases will
appear. Select AllElet database.
3. In this frame, a list of Models of AllElet database will appear. Select
AllElet_SNB.
4. In this frame, a list of Prediction trees of AllElet_SNB model will
appear. Select Student.
5. You can scroll o ver the tree and select one node to see the Node
Description and Node Distribution. You can also view a specific node
by selecting it at the left upper frame.
Figure 3.15 shows the UDA architecture configuration for this last scenario.
Chapter 4
Implementation of DMclcMine
4.1 The SNBi classifier
The SNBi classifier, implemented to illustrate the algorithm implementation, is a
Naïve Bayesian classifier supporting an incremental update of the training data
set, continuous data attributes and multiple discrete prediction attributes. In spite
of its simplicity (due to the premise of attribute independence), this algorithm
shows better results, in many situations, than more ref ined algorithms.
(4.1)
where:
P(A) and P(B) are prior probabilities because they are independent of an y prior
knowledge. In contrast, P(A|B) and P(B|A) are posterior probabilities because
they are dependent of prior knowledge (based on more information).
The Naïve Bayesian classifier can be formulated as follows [12].
Suppose a training data set for med by:
(4.2)
Class cj with the g reatest probability P(c j|Y) is named maximum posterior
hypothesis. With Bayes theorem, this probability is computed by:
(4.3)
where:
P(cj|Y) probability of event of class c j for a case Y;
P(Y|cj) probability of event of case Y for a class c j;
P(cj) probability of event of class c j;
P(Y) probability of event of case Y (P(Y) ≠ 0).
Since P(Y) is constant for all classes, only the numerator of the abo ve equation
needs to be computed. The class prior probabilities can be computed by:
(4.4)
where:
sj number of training cases of class c j;
v total number of training cases.
Assuming class conditional independence betw een the attributes (origin of the
name Naïve Bayesian Classifier), the probability P(Y|c j) can be computed by:
(4.5)
where:
P(Y|cj) probability of event of case Y for a class c j;
P(yi|cj) probability of event of value yi for an attribute a i of an unknown
case Y for a class c j.
Using the training data, the probabilities P(y i|cj) are computed b y equations,
depending on the type of attribute ai. If ai is discrete, then:
Implementation of DMclcMine 41
(4.6)
where:
r hji number of training cases of class c j, value yi, order h, attribute a i;
sj number of training cases of class c j.
In another way, if ai is continuous, then:
(4.7)
where:
Gaussian (normal) density function for attribute a i;
yi value yi for attribute ai of unknown case Y;
e Naperian number;
mean and standard de viation, respectively, of attribute
values xi for attribute ai of training cases of class c j.
The mean is computed by:
(4.8)
where:
z ji sum of values xi for attribute ai for training cases of class c j;
r1ji number of training cases of class cj of any value xi, order h ⫽ 1 (h ⫽
1 is used for e xisting values of continuous attributes and h ⫽ 0 is
used for missing v alues of continuous and discrete attributes) for
attribute ai;
x ijk values xi for attribute ai for training cases of class c j.
The original standard deviation equation is:
(4.9)
42 The SNBi classifier
where:
x ijk value xi for attribute ai for training cases of class c j;
mean of values xi for attributes ai for training cases of class c j;
r1ji number of training cases of class cj of any value xi, order h ⫽ 1 (h ⫽
1 is used for e xisting values of continuous attributes and h ⫽ 0 is
used for missing values of continuous and discrete attributes) for
attribute ai.
Substituting the mean values in the above equation and developing the square
operation, the standard deviation equation will be:
(4.10)
Simplifying:
(4.11)
(4.12)
(4.13)
(4.14)
Implementation of DMclcMine 43
where:
q ji sum of the squares of the values xi for attribute ai for training cases of
class cj;
z ji sum of the values xi for attribute ai for training cases of class c j;
r1ji number of training cases of class c j of any value xi, order h ⫽ 1 (h ⫽
1 is used for existing values of continuous attributes and h ⫽ 0 is used
for missing v alues of continuous and discrete attrib utes) for
attribute ai.
(4.15)
Being P(cjt |Yt) (probability of e vent of class c jt for a case Yt) for the prediction
44 The SNBi classifier
attribute t, computed by the following equation (see eqns. (4.4) and (4.5)):
(4.16)
where:
s jt number of training cases of class c jt;
v total number of training cases;
P(y it |c jt ) probability of event of value y jt for an attribute ait of an unknown
case Yt for a class c jt.
The probabilities P(y it |c jt ), defined by eqns. (4.5)–(4.7), for prediction attribute t,
are computed from the training data set by the following equations, depending on
the type of attribute a it . If a it is discrete, then (see eqn. (4.6)):
(4.17)
where:
r thji number of training cases of class c jt with the value y it , order h, for
attribute a it ;
s jt number of training cases of class c jt .
In another w ay, if a it is continuous, then the follo wing equation will be used
(see eqn. (4.7)):
(4.18)
where:
y it value y it for attribute a it of unknown case Yt;
e Naperian number;
mean and standard de viation, respectively, for attribute v al-
ues x it for attribute a it of training cases of class c jt , computed
by the following equations.
The mean, for prediction attribute t, is computed b y (see eqn. (4.8)):
(4.19)
Implementation of DMclcMine 45
where:
z tji sum of values x it for attribute a it for training cases of class c jt ;
r t1ji number of training cases of class c jt of any value x it , order h ⫽ 1
(h ⫽ 1 is used for existing values of continuous attributes and h ⫽ 0
is used for missing values of continuous and discrete attributes), for
attribute a it ;
x ijtk values x it for attribute a it for training cases of class c jt .
The standard deviation, for prediction attribute t, is computed by (see eqn. (4.14)):
(4.20)
where:
q tji sum of the squares of the values x it for attribute a it for training cases
of class c jt ;
z tji sum of the v alues x it for attribute a it for training cases of
class c jt ;
r t1ji number of training cases of class cjt of any value x it , order h ⫽ 1 (h ⫽
1 is used for e xisting values of continuous attributes and h ⫽ 0 is
used for missing values of continuous and discrete attributes), for
attribute a it .
The procedure to select the maximum class probability , shown in listing 4.2, can
result in colossal processing times for situations with many classes for the
prediction attribute. An interesting solution for this prob lem has been sho wn by
Leung and Sze [17]. They used the Branch-and-Bound algorithm in a Chinese
character recognition problem. This problem has a training data set with 120 000
cases and 5515 classes.
especially the DLL f ile name, and all constant values that identify the pro vider to
external applications, such as those using globally unique identif iers (GUID). The
detailed instructions to do this task, creating the DMclcMine provider, are shown in
Appendix B.3.
With this new DM provider, you must rename the MSNB classifier (see
Appendix B.4, or remove it to avoid conflict with the MSNB classifier supported
by DMSample.
Finally, the new provider can accept new algorithms. In the following section,
the implementation will be described in the conte xt of the k ey operations to be
supported by a DM provider algorithm on DMMs, shown in Section 3.2.
After the DMM w as populated, the DMM has contents, which will be sa ved
and loaded by, respectively, the DMM_SNaiveBayes::Save and DMM_
SNaiveBayes::Load functions, also included in the DMM/SNaiveBayes.cpp
file.
is built using the training data set and the accuracy is computed using this model
to predict results using the testing data set.
The second method estimates the accuracy by cross-validation and is more robust.
In this method, the whole data is divided into N blocks. Each one of these blocks has
the number of cases and class distribution as uniform as possible. N different DMMs
are then built, using, as training data, the whole data excluding one block and this
omitted block is used to evaluate the DMM. In this way, each case appears in exactly
one testing set. Usually, N ⫽ 10 and the average accuracy rate over the N testing data
sets is a good predictor of the accuracy of a model built from whole data. Indeed, this
result slightly underestimates the accuracy, since each of the N models is constructed
from a subset of the whole data. All experiments carried out here used N ⫽ 10.
In the tables that show the experiment results, the best values are highlight in
bold and the columns of SNBi classifier have light gray background.
It should be noted that the times sho wn in these tab les are the elapsed times
and not the real processing times for the considered task. During the experiments,
the computer was used as a standalone ser ver, without any connected client.
Appendix C shows all details required for processing the experiments, such as
programs and utilities.
The results of the MSDT1 and MSDT2 classifiers, identified by the question
mark (?), mean that these classifiers fail and the process does not reach a success-
ful termination. This error was reported to Microsoft and w as qualified as a b ug
to be f ixed in a future release of MSSQL Analysis Services. For customers with
this problem, Microsoft has been suggesting a w orkaround, which involves mak-
ing the continuous attributes discretized, and is creating a Knowledge Base (KB)
article for this. F or this reason, the tab les also show the result of the classif ier
MSDT2d, which uses discretized input attributes instead of continuous input
attributes. The null values of the train prediction times of the C4.5 classifier are
due to the f act that these values are very small, which when rounded to the unit,
became null. The data structure of this classifier made these short times possible.
The training times of all classifiers, with the e xception of C4.5 classifier,
demonstrated scalability, growing linearly with the increment of the number of
cases. This behavior can be better obser ved in the g raph in f ig. 5.1. SNBi is the
fastest of all considering the training times, while the prediction times were
among the worst, with the exception of WEKA. The poor perfor mance of this
classifier is caused by the use of the JAVA virtual machine.
The non-linear behavior of the C4.5 classifier happened due to g reat amount
of memory demanded by this classifier for the training of 1 million cases, exceed-
ing the efficient capacity of the equipment used. The intensive use of virtual
memory introduces the hard disk access time variable, causing this behavior.
Table 5.3 shows that SNBi and Weka results are identical, as expected, since they
have the same formulation. The accuracy results of SNBi classifier were among the
worst; however, the testing results were the best, but not better than the training
results of C4.5 classifier. In addition, if, in group d, the product time ⫻ accuracy
had been considered, the C4.5 classifier combines speed with competitive accuracy.
The cross-validation time results of SNBi and MSDT2 are shown in tables 5.4
and 5.5. The SNBi training times remained among the lowest. The SNBi accuracy
results also showed the same tendency: the training results were the worst and the
testing results were the best.
In this experiment, the excellent uniform results, shown in the SNBi classifier,
are due to the f act that the data are artificially generated, using normal distribu-
tion and without attribute dependencies.
aircrafts using this airport. These data, collected daily, hour by hour, were formed
by 36 input attributes with 87 600 cases. Faults in the data collection caused some
missing attribute values in several cases.
Costa [22] car ried out intensi ve data preparation that resulted in a data set
formed by 26 482 cases, 18 input attributes and one prediction attribute with seven
classes. This data set was split into two: one for training and another for testing.
This split w as achieved using a unifor m random distribution, maintaining the
same class distribution. The training data set are formed of 18 264 cases (68.97%
of the whole data) and the testing data set are for med of 8218 cases (31.03% of
the whole data). The parameters of the classif iers are shown in table 5.6.
As tables 5.7 and 5.8 show, the SNBi2 classifier results, with some discretized
input attributes, are better than the SNBi1 classifier, with all continuous input
attributes. This is due to the fact that the SNBi classifier uses normal distribution
for continuous attributes. When a few values exist for an attribute, the model will
certainly produce distor tions. An experiment was done with all discretized
Table 5.5: Waveform - Cross-validation - Accuracy (%).
56 Waveform recognition problem
attributes and the obtained result was still better. However, some attributes, by
their own characteristics, cannot be modeled as discrete.
The elapsed processing times of all classifiers were comparable, except for
the classifier C4.5, which showed the best performance. The accuracy results of
the SNBi classifier were among the worst, unlike the previous experiment.
The cross-validation results of SNBi2 and MSDT1 are shown in tables 5.9 and
5.10. These results did not show significant differences, relative to previous results.
These worse results are probably due to dependence among the input
attributes, which is not considered by this classifier.
Results reported by Coelho and Ebecken [23], for the ROC, a Bayesian academic
classifier, showed very close results of those obtained by the SNBi2 classifier.
The source data are multi-relational, formed by 10 tables. For use in this exper-
iment, a SQL query using these tables generated a single one. The parameters of
the classifiers used in this experiment are listed in Table 5.11. Some input
attributes are discrete and all classif iers use the same attribute types.
The prepared data set has 130 143 cases with 63 input attributes and one pre-
diction attribute with two classes (see Appendix C.5. This data set was split into
two: training and testing data sets. This split was achieved using a unifor m ran-
dom distribution, maintaining the same class distribution.The training data set are
formed by 89 543 cases (68.80% of the w hole data and the testing data set are
formed by 40 600 cases (31.20% of the w hole data. Tables 5.12 and 5.13 sho w
the results of elapsed processing times and accurac y.
In spite of showing the lowest training time among the compared classif iers,
the SNBi classifier again presented the worst result for accuracy. On the other
hand, the C4.5 classifier showed an excellent performance in all results. It should
be noted that the high elapsed processing time spend by the MSDT2 classifier, as
well as its accuracy results, are very close to those obtained by the C4.5 classifier.
This is probably due to the fact that both classifiers use the entropy method to con-
trol decision tree growth.
The lowest accuracy presented by the SNBi classifier is probably due to
dependence among the input attributes, which is not considered by this classifier.
The cross-validation results of SNBi and MSDT1 are shown in tables 5.14 and
5.15. These results did not show significant differences, relative to the previous results.
Again, the results reported by Coelho and Ebecken [23], for the ROC, a Bayesian
academic classifier, showed very close results of those obtained by the SNBi classifier.
The same experiment was carried out with the original (unprepared) data [26].
In a general way, all of the results, for accuracy and elapsed processing times, were
worse than those obtained for the prepared data — justifying the need for this prepa-
ration. The ratios among the results of the classif iers were practically the same.
attributes and number of prediction attributes. The results of the experiments were
compared to those produced by the MSDT classifier.
The data sets used by these experiments are generated artificially by a com-
puter program [28]. Details of this generation are sho wn in Appendix C.6.
In each e xperiment, the training time is measured by varying one variable.
Table 5.16 shows the parameters and description of the classifiers used in these
experiments. All input and prediction attributes are discrete.
25 000, 50 000, 75 000, 100 000, 1 million, 2 million and 10 million. The number
of input attributes was fixed at 20 with 25 dif ferent states. The prediction attribute
had five classes. The graph in fig. 5.3 displays the training elapsed processing times.
Again, the SNBi classifier showed absolute linear beha vior and a better per-
formance than the MSDT classifier.
The graph in fig. 5.4 displays the elapsed processing times used to predict the
same data sets. This experiment made possible the evaluation of the processing
capacity of the equipment used. When processing the prediction of 1 million cases
with the SNBi classifier, there was a lack of memory after 2 h and 16 elapsed min-
utes. At this time, the installed virtual memory was only 1024 MB. Only after the
virtual memory was increased to 2048 MB, could this task be completed.
However, the task exceeds the efficient capacity of the equipment due to the
intensive use of virtual memory, which causes the non-linear behavior in the
results shown in fig. 5.4.
Another problem occurred with the prediction task of 1 million cases using
MSDT classifier. DM providers retrieve the cases using a simple query of a table
with 10 million registers stored in Microsoft® SQL ServerTM. However, a non-con-
figurable time limit of 30 s exists to carry out query operations in this ser ver. As
queries of this size can exceed this limit, this task is impossible. This is a bug
reported by Microsoft that should be f ixed in future releases of the ser ver. To by-
pass this problem, tables can be used instead of queries. If the use of queries is
required, this can be done through Microsoft® SQL Query Analyzer, which allows
the configuration of this limit. Surprisingly, this problem was also solved by the
increase in virtual memory.
The graph in fig. 5.4 shows that the linear behavior of SNBi classifier, for the
prediction task, occur red only until the number of cases reached 100 000. From
this value on, non-linear beha vior occurred due to the high demand on memory
The results of the non-incremental SNBi classifier for this experiment correspond
to the accumulated values of the previous results, represented in the g raph by the
line labeled SNBi a. The results of the incremental SNBi classifier are represented
in the graph by the line labeled SNBii. Obviously, the results of elapsed process-
ing times of the incremental method are inferior to those obtained by the non-
incremental method. However, the comparison of lines SNBi and SNBii shows
that the incremental method produced slightl y better results (~2.5% than those
obtained by the non-incremental method.
The graph in f ig. 5.7 shows the linear behavior of the SNBi classifier, which
displayed a slight increase in training time when the number of prediction
attributes increased. Again, the classifier MSDT has shown better linear behavior
than reported by Soni et al. [27].
5.5 Conclusions
The OLE DB for DM technology has proven to be a very useful tool in implement-
ing DM algorithms, achieving complete database querying and mining integration.
It has been demonstrated that the use of microcomputers is possible for
medium level DM task solutions.
As predicted, due to its statistical formulation combined with the data struc-
ture of the implementation, the SNBi classifier has shown high scalability, within
the limits of the equipment used in the e xperiments.
Since an optimum algorithm for all problems does not exist (only an optimum
algorithm for each problem), the good results sho wn by the f irst experiment and
the excellent results of elapsed processing times (considering large data sets),
identifies the SNBi classifier as an e xcellent choice for use, primaril y, in data
exploration, classification and prediction. Using incremental training, associated
with prediction by part tasks, problems of virtually any size can be solved.
The non-linear behavior of the classif ier for large data sets signif ies the need
for additional research using higher capacity equipment, including the use of fed-
erations of databases servers, as well as multiprocessor computers.
References
[1] Fayyad, U.M., Piatetsky-Shapiro, G. & Sm yth, P., From Data Mining to
Knowledge Discovery in Databases, AI Magazine, Vol. 17, No. 3, pp. 37–54,
1996. http://kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf.
[2] Silver, D.L, Knowledge Discovery and Data Mining, MBA course notes of
Dalhousie University, Nova Scotia, Canada, 1998. http://plato.acadiau.ca/
courses/comp/dsilver.
[3] Kim, P. & Carroll, M., Making OLE DB for Data Mining queries against a
DM provider, Visual Basic utility application hosted by Analysis Services –
Data Mining Group w eb site, 2002. http://communities.msn.com/
AnalysisServicesDataMining.
[4] Witten, I.H. & Frank, E., Data Mining: Practical Machine Learning Tools
with Java Implementations, 1st ed., San Francisco, California, USA,
Morgan Kaufmann Publishers, 2000. http://www.cs.waikato.ac.nz/~ml.
[5] Microsoft Corporation, OLE DB f or Data Mining Sample Provider,
Microsoft Corporation, Redmond, Washington, USA, 2002. http://www .
microsoft.com/downloads/details.aspx?FamilyID=d4402893-8952-4c9d-
b9bc-0d60c70d619d&DisplayLang=en.
[6] Microsoft Corporation, OLE DB for Data Mining Specification Version 1.0,
Microsoft Corporation, Redmond, Washington, USA, 2000. http://www .
microsoft.com/downloads/details.aspx?displaylang=en&familyid=01005f9
2-dba1-4fa4-8ba0-af6a19d30217.
[7] DMG – Data Mining Group, PMML – Predictive Model Markup Language
Version 2.0 Specification, Data Mining Group, 2002. http://www.dmg.org.
[8] WC3 – World Wide Web Consortium, Extensible Markup Language (XML)
1.0 (Second Edition) , Bray, T. et al (eds.), WC3 – World Wide Web
Consortium, 2000. http://www.w3.org/TR/REC-xml.
[9] Chaudhuri, S., Data Mining and Database Systems: Where is the
Intersection?, IEEE Data Engineering Bulletin , Vol. 21, No. 1, pp. 4–8,
1998. ftp://ftp.research.microsoft.com/pub/debull/98MAR-CD.pdf.
[10] Skonnard, A., Say UDA for All Your Data Access Needs, Microsoft
Interactive Developer, April, 1998. http://www.microsoft.com/mind/0498/uda/
uda.asp.
[11] Netz, A. et al, Integration of Data Mining and Relational Databases.
Proc. of the VLDB 2000, 26th Int’l Conf. on Very Large DataBases, Cairo,
Egypt, pp. 719–722, 2000. ftp://ftp.research.microsoft.com/users/
AutoAdmin/vldb00DM.pdf.
66 References
[12] Han, J. & Kamber, M., Data Mining: Concepts and Techniques, 1st ed., San
Francisco, California, USA, Morgan Kaufmann Publishers, 2001.
http://www.mkp.com.
[13] Microsoft Corporation, Microsoft® SQL Server™ 2000, Microsoft
Corporation, Redmond, Washington, USA, 2000. http://www .microsoft.
com/sql.
[14] Seidman, C., Data Mining with Microsoft® SQL Server™ 2000 –Technical
Reference, 1st ed., Redmond , Washington, USA, Microsoft Press, 2001.
http://www.microsoft.com/mspress.
[15] Tang, Z., Thin Client Decision Viewer Sample, ASP utility application
hosted by Analysis Services – Data Mining Group web site, 2002.
http://communities.msn.com/AnalysisServicesDataMining.
[16] Curotto, C.L., Thin Client Decision Viewer Sample Modif ied, ASP utility
application, 2002. http://curotto.com/doc/english/dmibrowser.
[17] Leung, C.H. & Sze, L., A Method to Speed Up the Ba yes Classifier,
Engineering Applications of Artificial Intelligence, Vol. 11, No. 3, pp.
419–424, 1998. http://www.elsevier.com.
[18] Breiman, L. et al, Classification And Regression Trees, 1st ed., Boca Raton,
Florida, USA, Chapman & Hall/CRC, 1984. http://www.crcpress.com.
[19] Chickering, D.M., Geiger, D. & Heckerman, D., Learning Bayesian Networks:
The Combination of Knowledge and Statistical Data, Technical
Report MSR-TR-94-09, Microsoft Research, Microsoft Cor poration,
Redmond, Washington, USA, 1994. ftp://ftp.research.microsoft.com/pub/tr/
tr-94-09.ps.
[20] Quinlan, J.R., C4.5: Programs for Machine Learning, 1st ed., San Mateo,
California, USA, Morgan Kaufmann Pub lishers, 1993. http://www .cse.
unsw.edu.au/~quinlan.
[21] Blake, C.L. & Merz, C.J ., UCI Repository of Mac hine Learning Databases,
University of California, Department of Information and Computer Science,
Irvine, CA, USA, 1998. http://www.ics.uci.edu/~mlearn/MLRepository.html.
[22] Costa, M.C.A., Data Mining on High Performance Computing Using
Neural Networks, D.Sc. Thesis (in Portuguese), COPPE/UFRJ – Ci vil
Engineering Graduate Program – Computational Systems, Rio de Janeiro,
RJ, Brazil, 1999. http://www.coc.ufrj.br.
[23] Coelho, P.S.S. & Ebeck en, N.F.F., A Comparison of some Classification
Techniques, In: Data Mining III (Pr oc. of the 3r d Int'l Conf . on Data
Mining), Bologna, Italy, Zanasi, A. et al (eds.), WIT Press, pp. 573–582,
2002. http://www.witpress.com.
[24] Staudt, M., Kietz, J.U. & Reimer, U., A Data Mining Support Environment
and its Application on Insurance Data. Proc. of the KDD'98, 4th Int'l Conf.
on Knowledge Discovery and Data Mining , New York City, New York,
USA, Agrawal, R., Stolorz, P .E. & Piatetsk y-Shapiro, G. (eds.), AAAI
Press, pp. 105–111, 1998. http://kietz.ch/kdd98.ps.gz.
[25] Kietz, J.U. & Staudt, M., KDD Sisyphus I – Dataset, Information Systems
Research, CH/IFUE, Swiss Life, Zurich, Switzerland , 1998.
http://www.cs.wisc.edu/~lists/archive/dbworld/0426.html.
References 67
If you are using MSVS 6.0 to perfor m these steps, the DMSample will be built
and registered.
However, if you are using MSVS .NET , you need to do more, because you
must port the source code to this ne w version of MSVS. The instructions for this
task can be seen in the idma/DMSampleModScript.txt script file.
After you start MSSQL OLAP Service again, the new algorithm will
appear in Mining Algorithm list of Basic Properties of Relational
Mining Model Editor of MSSQL Analysis Services Manager as Sample DM
Algorithm.
74 Building steps
#ifndef __dmstandalone_hpp__
#define __dmstandalone_hpp__
#ifndef __cplusplus
#error dmstandalone.hpp requires C++
compilation
#endif
SP_NO_INTERFACES(IDMASecurityNotifyDummy,
IDMASecurityNotify);
HRESULT STDMETHODCALLTYPE
PreCreateMiningModel(
/* [string][in] */ wchar_t __RPC_FAR
*in_strMiningModelName)
{ return S_OK; };
HRESULT STDMETHODCALLTYPE
PreRemoveMiningModel(
/* [string][in] */ wchar_t __RPC_FAR
*in_strMiningModelName)
{ return S_OK; };
HRESULT STDMETHODCALLTYPE
PostCreateMiningModel(
/* [string][in] */ wchar_t __RPC_FAR
*in_strMiningModelName,
/* [in] */ HRESULT in_hResult)
{ return S_OK; };
HRESULT STDMETHODCALLTYPE
PostRemoveMiningModel(
/* [string][in] */ wchar_t __RPC_FAR
*in_strMiningModelName,
/* [in] */ HRESULT in_hResult)
{ return S_OK; };
};
#endif // __dmstandalone_hpp__
2. Open DMProv/StdAfx.h file:
3 Insert the following statement in DMProv header files:
#include "dmstandalone.hpp"
3. Open DMProv/Session.hpp file:
3 Insert the following private variables in the Session class:
bool IsAdviseSecurityCalled;
bool IsSecurityNotifyDummyInstantiated;
3 Replace GetDMASecurityNotify function by:
78 Aggregate and standalone modes
HRESULTGetDMASecurityNotify
(IDMASecurityNotify **
out_ppSecurityNotify)
{
if (IsAdviseSecurityCalled)
{
if (!m_spSecurityNotify) return E_FAIL;
return
m_spSecurityNotify.CopyTo(out_ppSecurityNotify);
}
else
{
if (!IsSecurityNotifyDummyInstantiated)
{
SPObject<IDMASecurityNotifyDummy>::
Create Instance((IDMASecurityNotifyDummy **)
&m_spSecurityNotify);
if (!m_spSecurityNotify) return E_FAIL;
IsSecurityNotifyDummyInstantiated = true;
}
return
m_spSecurityNotify.CopyTo(out_ppSecurityNotify);
}
}
9 Replace IDMASecurityAdvise methods by:
STDMETHOD
(AdviseSecurity)(IDMASecurityNotify *
in_pSecurityNotify)
{
m_spSecurityNotify = in_pSecurityNotify;
IsAdviseSecurityCalled = true;
return S_OK;
}
STDMETHOD (UnadviseSecurity)
(IDMASecurityNotify * in_pSecurityNotify)
{
if (m_spSecurityNotify !=
in_pSecurityNotify) return E_INVALIDARG;
m_spSecurityNotify.Release();
IsAdviseSecurityCalled = false;
return S_OK;
}
Building DMclcMine 79
`[Mm][Ss]_[Nn][Aa][Ii][Vv][Ee]_[Bb][Aa][Yy]
[Ee][Ss]´ ("MS_Naive_Bayes")
9 Replace T_SampleAlgorithm by T_MSNaiveBayes;
9 Replace PSampleAlg by PMSNaiveBayesAlg;
9 Execute Debug/Compile;
9 Execute Debug/Generate Files. You will receive a message from
Visual Parse++: Auto-merge failed, your original file is in . /dmr
educe. bak00x... You must remo ve dmreduce.cpp because it is
corrupted! The file dmreduce.bak00x must be renamed to
dmreduce.cpp.
2. Open DMParse/dmreduce.cpp file and replace SP_PSampleAlg by
SP_PMSNaiveBayesAlg.
3. Open DMProv/SchemaRowsets.h and DMProv/SchemaRowsets.cpp
files and replace:
9 SERVICE_GUID_Sample_Algorithm by
SERVICE_GUID_MS_Naive_Bayes;
9 SERVICE_NAME_SAMPLE_ALGORITHM by
SERVICE_NAME_MS_NAIVE_BAYES.
4. Open DMProv/SchemaRowsets.h file and replace the value of
SERVICE_NAME_MS_NAIVE_BAYES constant by “MS_Naive_
Bayes” (must be e xactly the same value of T_MSNaiveBayes token
definition). This data is used by Relational and OLAP Mining Model
editors to build the DMSQL queries.
5. Open DMProv/DMSProv.rc, DMProv/SchemaRowsets.cpp and
DMBase/dmresource.h files and replace:
9 IDS_SERVICE_SAMPLE_ALGORITHM_DESCRIPTION by
IDS_SERVICE_MS_NAIVE_BAYES_DESCRIPTION;
9 IDS_SERVICE_SAMPLE_ALGORITHM_DISPLAY_NAME by
IDS_SERVICE_MS_NAIVE_BAYES_DISPLAY_NAME.
6. Open DMProv/DMSProv.rc file and replace the v alues of
IDS_SERVICE_MS_NAIVE_BAYES_DISPLAY_NAME and
IDS_SERVICE_MS_ NAIVE_BAYES_DESCRIPTION strings. These
strings are used in Mining Algorithm list of Relational and OLAP Mining
Model editors. They can be different from token definition.
7. Open DMCore/dmmodel.h, DMCore/dmmodel.cpp, DMParse/
dm reduce.cpp, DMProv/dmxbind.cpp and DMProv/
dmxmlpersist. cpp files and replace DM_ALGORITHMIC_
METHOD_SAMPLE_ ALGORITHM by DM_ALGORITHMIC_
METHOD_MS_NAIVE_ BAYES.
82 Implementing new algorithms
12. Insert new error codes associated with the new parameter using the
instructions in Section B.8.
DM_DEFINE_ERROR_HRESULT(SPE_DMM_ALG_CANT_PREDICT
_CONTINUOUS_ATTRIBUTE)
4. Open DMM/DMM.idl file and insert the return error code definition
according to the syntax: DMME_<error code>. For the example, the
following statement must be inserted:
DMM_DEFINE_ERROR_HRESULT(DMME_ALG_CANT_PREDICT_CON T
INUOUS_ATTRIBUTE,0x000D)
5. Open DMProv/DMErrors.cpp file and insert the appropriated
statements regarding the new error in the DMMapDMMErrorCode
function.
Other files that deal with er ror codes are the following:
DMBase/DMErrors.hpp, DMBase/DMErrors.inl and DMBase/
dmbsglob.cpp.
B.10 Debugging
The DM Provider can be debugged by either running integrated with Analysis
Services Server or as a standalone ser ver.
MSSQL OLAP Ser vice must be stopped before start debugging. Note that
DLLs of DM providers will not be loaded until its DM functionality is being
requested, unless you ask the debugger to load its symbols initiall y.
At the start of debug in MSVS, a console window will appear with the mes-
sage: The Analysis server started successfully.
Start MSSQL Analysis Services Manager and select the Mining Models
folder of any database.
Each registered DM provider DLL will be called and the following functions
will be processed:
9 DllMain (DMProv/DMSProv.cpp);
9 DllGetClassObject (DMProv/DMSProv.cpp);
9 SPDataSource::Initialize (DMProv/DataSrc.cpp);
9 SPModel::Load (DMProv/dmxmlpersist.cpp) called by the previous
function in the statement:
SP_CHECK_ERROR(pDataManager->Init(pQC));
This last function retrieves all existing DMMs, reading the following segments:
9 statements – DMM creation statements;
9 data-dictionary – Data dictionary definition;
9 global-statistics – Global statistics data;
9 Simple-naive-bayes-model – DMM contents.
Reading “data dictionary” segment evolves the following functions:
9 SPModel::LoadColumn (DMProv/dmxmlpersist.cpp);
9 SPQueryManager::DoTrainAttributes (DMQM/DMQueryManager.
cpp), called for each attribute to compute all v alues of discrete attributes.
When a specif ic DMM is selected and Action/Process is activated, the fol-
lowing functions will be processed:
9 SPQueryManager::DoCreate (DMQM/DMQueryManager.cpp);
9 SPDataManager::CreateModel (DMParse/dmtools.cpp);
9 SPQueryManager::DoTrainModel (DMQM/DMQueryManager.
cpp).
The functions SPModel::Load and SPQueryManager::DoTrainModel
call the DMM specif ic functions.
B.12 Bugs
A few bugs w ere detected w hile using DM pro viders and MSSQL Analysis
Services:
9 When you use an in valid algorithm name, the er ror message repor ted by
Analysis Server is incor rect. This bug w as not f ixed yet and ma y be in:
__DMSetLastErrorExHelper function of DMBase/dmbsglob.cpp
file; PFError::SetLastError (pnerror.cpp file of Putl project) or
PNWriteDebug. (pnglobal.cpp file of Putl project);
9 Opening a long-running view from MSSQL Enterprise Manager gets time-
out. The query time-out is 30 s, and the time-out value is not configurable.
To circumvent this bug, y ou can use Microsoft ® SQL Query Analyzer to
run the long-r unning view. If it is necessar y, you can modify the Quer y
time-out of this tool using Tools/Options/Connections/Query time-
out =9999 seconds, for example. It seems that using a zero value to permit
infinite time-out queries does not work.
9 Running MSDT classifier with enormous training data sets and continuous
input attributes cause MSDMine to crash. This error was reported to
Microsoft and w as qualified as a bug to be fixed in a future release of
MSSQL Analysis Services. For customers with this problem, Microsoft
has suggested a workaround, i.e. making the continuous attributes
discretized, and will create a Knowledge Base (KB) article for this.
C.1.1 COMPLEXITY_PENALTY
A floating point number, with a range between 0 and 1 (exclusive), acts as a
penalty for g rowing a tree. The parameter is applied at each additional split. A
value of 0 applies no penalty, and a value close but not equal to 1 (1.0 is outside
the range) applies a high penalty. Applying a penalty will limit the depth and com-
plexity of lear ned trees, w hich avoids over-fitting. However, using too high a
penalty may adversely affect the predictive ability of the learned model. The
effect of this mining parameter is dependent on the mining model itself; some
experimentation and observation may be required to accurately tune the DM
model. The default value is based on the number of attributes for a gi ven model:
9 For 1–9 attributes, the value is 0.5;
9 For 10–99 attributes, the value is 0.9;
9 For 100 or more attributes, the value is 0.99;
C.1.2 MINIMUM_LEAF_CASES
A non-negative integer within a range of 0 to 2 147 483 647. It deter mines the
minimum number of leaf cases required to generate a split in the decision tree.
A low value causes more splits in the decision tree, but can increase the
likelihood of over-fitting. A high value reduces the number of splits in the
decision tree, but can inhibit the g rowth of the decision tree. The default value
is 10.
96 DTS packages
C.1.3 SCORE_METHOD
This identifies the algorithm used to control the g rowth of a decision tree. This algo-
rithm selects the attributes that constitute the tree, the order in which the attributes are
used, the way in which the attribute values should be split up and the point at w hich
the tree should stop growing. Valid values: 1, 2, 3, 4. The meanings of these values are:
1. Entropy: it is based on entropy gain of the classif ier.
2. Orthogonal: a home-g rown method that is based on the or thogonality of
the classifier state distribution. This scoring method yields binary-splits
only, which may end up with too-deep trees.
3. Bayesian with K2: based on Bayesian score with K2 a priori.
4. Bayesian Dirichlet Equivalent with Uniform a priori: the default scoring
method [19].
CI.1.4 SPLIT_METHOD
It describes the various ways that SCORE_METHOD should consider splitting up
attribute values. For example, if an attrib ute has f ive potential values, the values
could be split into binary branches (for example, 3 and 1,2,4,5), or the values
could be split into five separate branches, or some other combination may be con-
sidered. A value of 1 results in decision trees that have only binary branches; a
value of 2 results in decision trees with multiple (Nar y) branches; a value of 3
(default) allows the algorithm to use binar y or multiple branches as needed.
be noted that two batch command file tasks, c45exe.bat and wekaNaive.bat,
correspond, respectively, to the processing of C4.5 and WEKA classifiers.
The data impor t workflow (idma/data/waveform/waveform_Import.dts
of this experiment is shown in f ig. C.9, where two distinct data f iles should be
noted, one for training and another for testing. These data are accessed by two
Text File (Source) data connection objects, named Train Data, for reading the
training data set, and Test Data, for reading the testing data set. This figure also
shows a doub le SQLOLEDB database connection object, named Waveform,
which allows connection with data in the MSSQL.
To begin, all existing data is removed from Waveform MSSQL database by
a SQL statement, included in an Execute SQL Task, named Remove data:
DELETE FROM Waveform
In the following, an ActiveX Script Task, with Visual Basic® script commands,
initializes the record counter, as shown in listing C.1.
In the next step, the training data of the te xt file are archived in the MSSQL
by using a Transform Data Task, defined by another Visual Basic® script, as
shown in listing C.2. A similar task transfers the testing data set.
Fig. C.10 shows the processing workflow of the training and prediction
tasks for the MSDT1 classifier (idma / data / waveform / waveform_
MSDT1_Train&Predict.dts. The running time control tasks (Init TrainPredTime,
Init TestPredTime and End TestPredTime, carried out by stored procedures
included in the idma/data/waveform/waveform.sql script file, should be
noted. The processing workflow of the other classif iers is similar to this.
The processing workflow of the cross-validation task ( idma/data/
waveform/waveform_Cross_Validation.dts is shown in f ig. C.11. On the
Running the experiments 101
Figure C.8: Waveform – Full processing workflow of the f irst part of the
experiment.
other hand, fig. C.12 sho ws the processing w orkflow of the block data import
(idma/data/waveform/waveform_Import_Blocks.dts). Data transfer is
carried out by a similar Visual Basic® script, as that sho wn in listing C.2, while
the stored procedure to create the b locks, named Create_Blocks, called by the
Create Blocks Execute SQL Task, is included in the idma/data/waveform/
waveform.sql script file.
the other classif iers are similar to this. Extensive repetitions in these statements
are replaced by three consecutive dots (...).
dry fog or smoke (2), sand or dust (3), moist fog or thick fog (4), sprinkle (5), rain
(6), snow (7), thunderstorm or lightning (8), and hail (9). After data preparation,
the value 9 (occurring only once) and the v alues 3 and 7 (never occurring) were
excluded; thus, at end, this attribute had seven values.
Fig. C.13 shows the full processing w orkflow of this par t of the e xperiment
(idma/data/meteo/meteo.dts and f ig. C.14 illustrates the data impor t work-
flow (idma/data/meteo/meteo_Import.dts. There is a doub le SQLOLEDB
database connection object, named Meteo, which allows connecting with data in
the MSSQL.
There are three Text File (Source) data connection objects, named All
Data, to read the whole data, Train Data, to write the training data in the text
format, and Test Data ,to write the testing data in the text format. These text
formatted data are used by the C4.5 classifier.
To begin, all e xisting data is removed from meteo MSSQL database by a
SQL statement included in Execute SQL Task, named Remove data:
Concluding, by using two simple data transfer tasks, the training and the test-
ing data sets are archived in text files. These task use two SQL distinct statements
to select the desired data, as can be seen in the tw o following lines:
SELECT * FROM Meteo WHERE Train=1 -- training data
SELECT * FROM Meteo WHERE Train=0 -- testing data
The processing workflow of the training and prediction tasks is similar to that
shown in fig. C.10 of the Waveform experiment. Also, the cross-validation proce-
dures are similar to those sho wn in the Waveform experiment and will not be
described here.
Figure C.13: Meteo – Full Processing workflow of the f irst part of the
experiment.
living in can be found in table hhold. Each par tner can play roles in cer tain
insurance policies (tab le vvert), realized b y table parrol. If a par tner is the
insured person of the contract then tariff role records (tab le tfrol) specify cer-
tain further properties. An insurance contract can have several components (e.g.
the main contract par t plus a component for insuring the case that the ensured
person becomes invalid), each of w hich (recorded in tab le tfkomp) is related
with a tariff role of the respective partners. Finally, each policy concerns a cer-
tain product (table prod) and tariff components are bound to dedicated insur-
ance tariffs (table lvtarf).
The prod and lvtarf tables were not distributed. The taska and taskb tables
contain, respectively, classes assigned to partners and to households.
Listing C.6: Meteo – SNBi1 classifier DMSQL statements to predict the testing
data set.
Running the experiments 111
Table C.5: Insurance – prepared data – DMSQL statements to create the DMM.
124 Performance study
Table C.6: Insurance – original data – DMSQL statements to create the DMM.
126 Performance study
Table C.9 shows the DMSQL statements to create the DMM of the
VarNumCases experiment and listing C.12 shows the DMSQL statement to pre-
dict the testing data set for the SNBi classifier of this same experiment.
Table C.10 shows the DMSQL statements to create the DMM of the
VarStates experiment, while table C.11 shows DMSQL statements to create the
DMM with 32 prediction attributes of the VarPredAtt experiment. The state-
ments for other numbers of attributes are similar to this.
These two last experiments do not have any prediction task.
130 Performance study
Table C.8: VarInpAtt – DMSQL statements to create the DMM with 12 input
attributes.