WEKA Manual For Version 3-6-5
WEKA Manual For Version 3-6-5
WEKA Manual For Version 3-6-5
uU
p(u[pa(u)).
Below, a Bayesian network is shown for the variables in the iris data set.
Note that the links between the nodes class, petallength and petalwidth do not
form a directed cycle, so the graph is a proper DAG.
This picture just shows the network structure of the Bayes net, but for each
of the nodes a probability distribution for the node given its parents are specied
as well. For example, in the Bayes net above there is a conditional distribution
115
116 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
for petallength given the value of class. Since class has no parents, there is an
unconditional distribution for sepalwidth.
Basic assumptions
The classication task consist of classifying a variable y = x
0
called the class
variable given a set of variables x = x
1
. . . x
n
, called attribute variables. A
classier h : x y is a function that maps an instance of x to a value of y.
The classier is learned from a dataset D consisting of samples over (x, y). The
learning task consists of nding an appropriate Bayesian network given a data
set D over U.
All Bayes network algorithms implemented in Weka assume the following for
the data set:
all variables are discrete nite variables. If you have a data set with
continuous variables, you can use the following lter to discretize them:
weka.filters.unsupervised.attribute.Discretize
no instances have missing values. If there are missing values in the data
set, values are lled in using the following lter:
weka.filters.unsupervised.attribute.ReplaceMissingValues
The rst step performed by buildClassifier is checking if the data set
fullls those assumptions. If those assumptions are not met, the data set is
automatically ltered and a warning is written to STDERR.
1
Inference algorithm
To use a Bayesian network as a classier, one simply calculates argmax
y
P(y[x)
using the distribution P(U) represented by the Bayesian network. Now note
that
P(y[x) = P(U)/P(x)
P(U)
=
uU
p(u[pa(u)) (8.1)
And since all variables in x are known, we do not need complicated inference
algorithms, but just calculate (8.1) for all class values.
Learning algorithms
The dual nature of a Bayesian network makes learning a Bayesian network as a
two stage process a natural division: rst learn a network structure, then learn
the probability tables.
There are various approaches to structure learning and in Weka, the following
areas are distinguished:
1
If there are missing values in the test data, but not in the training data, the values are
lled in in the test data with a ReplaceMissingValues lter based on the training data.
8.1. INTRODUCTION 117
local score metrics: Learning a network structure B
S
can be considered
an optimization problem where a quality measure of a network structure
given the training data Q(B
S
[D) needs to be maximized. The quality mea-
sure can be based on a Bayesian approach, minimum description length,
information and other criteria. Those metrics have the practical property
that the score of the whole network can be decomposed as the sum (or
product) of the score of the individual nodes. This allows for local scoring
and thus local search methods.
conditional independence tests: These methods mainly stem from the goal
of uncovering causal structure. The assumption is that there is a network
structure that exactly represents the independencies in the distribution
that generated the data. Then it follows that if a (conditional) indepen-
dency can be identied in the data between two variables that there is no
arrow between those two variables. Once locations of edges are identied,
the direction of the edges is assigned such that conditional independencies
in the data are properly represented.
global score metrics: A natural way to measure how well a Bayesian net-
work performs on a given data set is to predict its future performance
by estimating expected utilities, such as classication accuracy. Cross-
validation provides an out of sample evaluation method to facilitate this
by repeatedly splitting the data in training and validation sets. A Bayesian
network structure can be evaluated by estimating the networks param-
eters from the training set and the resulting Bayesian networks perfor-
mance determined against the validation set. The average performance
of the Bayesian network over the validation sets provides a metric for the
quality of the network.
Cross-validation diers from local scoring metrics in that the quality of a
network structure often cannot be decomposed in the scores of the indi-
vidual nodes. So, the whole network needs to be considered in order to
determine the score.
xed structure: Finally, there are a few methods so that a structure can
be xed, for example, by reading it from an XML BIF le
2
.
For each of these areas, dierent search algorithms are implemented in Weka,
such as hill climbing, simulated annealing and tabu search.
Once a good network structure is identied, the conditional probability ta-
bles for each of the variables can be estimated.
You can select a Bayes net classier by clicking the classier Choose button
in the Weka explorer, experimenter or knowledge ow and nd BayesNet under
the weka.classifiers.bayes package (see below).
2
See http://www-2.cs.cmu.edu/
xjpa(xi)
r
j
.
120 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
Note pa(x
i
) = implies q
i
= 1. We use N
ij
(1 i n, 1 j q
i
) to denote
the number of records in D for which pa(x
i
) takes its jth value.We use N
ijk
(1 i n, 1 j q
i
, 1 k r
i
) to denote the number of records in D
for which pa(x
i
) takes its jth value and for which x
i
takes its kth value. So,
N
ij
=
ri
k=1
N
ijk
. We use N to denote the number of records in D.
Let the entropy metric H(B
S
, D) of a network structure and database be
dened as
H(B
S
, D) = N
n
i=1
qi
j=1
ri
k=1
N
ijk
N
log
N
ijk
N
ij
(8.2)
and the number of parameters K as
K =
n
i=1
(r
i
1) q
i
(8.3)
AIC metric The AIC metric Q
AIC
(B
S
, D) of a Bayesian network structure
B
S
for a database D is
Q
AIC
(B
S
, D) = H(B
S
, D) + K (8.4)
A term P(B
S
) can be added [15] representing prior information over network
structures, but will be ignored for simplicity in the Weka implementation.
MDL metric The minimum description length metric Q
MDL
(B
S
, D) of a
Bayesian network structure B
S
for a database D is is dened as
Q
MDL
(B
S
, D) = H(B
S
, D) +
K
2
log N (8.5)
Bayesian metric The Bayesian metric of a Bayesian network structure B
D
for a database D is
Q
Bayes
(B
S
, D) = P(B
S
)
n
i=0
qi
j=1
(N
ij
)
(N
ij
+ N
ij
)
ri
k=1
(N
ijk
+ N
ijk
)
(N
ijk
)
where P(B
S
) is the prior on the network structure (taken to be constant hence
ignored in the Weka implementation) and (.) the gamma-function. N
ij
and
N
ijk
represent choices of priors on counts restricted by N
ij
=
ri
k=1
N
ijk
. With
N
ijk
= 1 (and thus N
ij
= r
i
), we obtain the K2 metric [19]
Q
K2
(B
S
, D) = P(B
S
)
n
i=0
qi
j=1
(r
i
1)!
(r
i
1 + N
ij
)!
ri
k=1
N
ijk
!
With N
ijk
= 1/r
i
q
i
(and thus N
ij
= 1/q
i
), we obtain the BDe metric [22].
8.2.2 Search algorithms
The following search algorithms are implemented for local score metrics;
K2 [19]: hill climbing add arcs with a xed ordering of variables.
Specic option: randomOrder if true a random ordering of the nodes is
made at the beginning of the search. If false (default) the ordering in the
data set is used. The only exception in both cases is that in case the initial
network is a naive Bayes network (initAsNaiveBayes set true) the class
variable is made rst in the ordering.
8.2. LOCAL SCORE BASED STRUCTURE LEARNING 121
Hill Climbing [16]: hill climbing adding and deleting arcs with no xed
ordering of variables.
useArcReversal if true, also arc reversals are consider when determining
the next step to make.
Repeated Hill Climber starts with a randomly generated network and then
applies hill climber to reach a local optimum. The best network found is
returned.
useArcReversal option as for Hill Climber.
LAGD Hill Climbing does hill climbing with look ahead on a limited set
of best scoring steps, implemented by Manuel Neubach. The number
of look ahead steps and number of steps considered for look ahead are
congurable.
TAN [17, 21]: Tree Augmented Naive Bayes where the tree is formed
by calculating the maximum weight spanning tree using Chow and Liu
algorithm [18].
No specic options.
Simulated annealing [15]: using adding and deleting arrows.
The algorithm randomly generates a candidate network B
S
close to the
current network B
S
. It accepts the network if it is better than the current,
i.e., Q(B
S
, D) > Q(B
S
, D). Otherwise, it accepts the candidate with
probability
e
ti(Q(B
S
,D)Q(BS,D))
where t
i
is the temperature at iteration i. The temperature starts at t
0
and is slowly decreases with each iteration.
Specic options:
TStart start temperature t
0
.
delta is the factor used to update the temperature, so t
i+1
= t
i
.
runs number of iterations used to traverse the search space.
seed is the initialization value for the random number generator.
Tabu search [15]: using adding and deleting arrows.
Tabu search performs hill climbing until it hits a local optimum. Then it
122 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
steps to the least worse candidate in the neighborhood. However, it does
not consider points in the neighborhood it just visited in the last tl steps.
These steps are stored in a so called tabu-list.
Specic options:
runs is the number of iterations used to traverse the search space.
tabuList is the length tl of the tabu list.
Genetic search: applies a simple implementation of a genetic search algo-
rithm to network structure learning. A Bayes net structure is represented
by a array of n n (n = number of nodes) bits where bit i n+j represents
whether there is an arrow from node j i.
Specic options:
populationSize is the size of the population selected in each generation.
descendantPopulationSize is the number of ospring generated in each
8.3. CONDITIONAL INDEPENDENCE TEST BASEDSTRUCTURE LEARNING123
generation.
runs is the number of generation to generate.
seed is the initialization value for the random number generator.
useMutation ag to indicate whether mutation should be used. Mutation
is applied by randomly adding or deleting a single arc.
useCrossOver ag to indicate whether cross-over should be used. Cross-
over is applied by randomly picking an index k in the bit representation
and selecting the rst k bits from one and the remainder from another
network structure in the population. At least one of useMutation and
useCrossOver should be set to true.
useTournamentSelection when false, the best performing networks are
selected from the descendant population to form the population of the
next generation. When true, tournament selection is used. Tournament
selection randomly chooses two individuals from the descendant popula-
tion and selects the one that performs best.
8.3 Conditional independence test based struc-
ture learning
Conditional independence tests in Weka are slightly dierent from the standard
tests described in the literature. To test whether variables x and y are condi-
tionally independent given a set of variables Z, a network structure with arrows
zZ
z y is compared with one with arrows x y
zZ
z y. A test is
performed by using any of the score metrics described in Section 2.1.
At the moment, only the ICS [25]and CI algorithm are implemented.
The ICS algorithm makes two steps, rst nd a skeleton (the undirected
graph with edges iff there is an arrow in network structure) and second direct
124 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
all the edges in the skeleton to get a DAG.
Starting with a complete undirected graph, we try to nd conditional inde-
pendencies x, y[Z) in the data. For each pair of nodes x, y, we consider sets
Z starting with cardinality 0, then 1 up to a user dened maximum. Further-
more, the set Z is a subset of nodes that are neighbors of both x and y. If
an independency is identied, the edge between x and y is removed from the
skeleton.
The rst step in directing arrows is to check for every conguration xz
y where x and y not connected in the skeleton whether z is in the set Z of
variables that justied removing the link between x and y (cached in the rst
step). If z is not in Z, we can assign direction x z y.
Finally, a set of graphical rules is applied [25] to direct the remaining arrows.
Rule 1: i->j--k & i-/-k => j->k
Rule 2: i->j->k & i--k => i->k
Rule 3 m
/|\
i | k => m->j
i->j<-k \|/
j
Rule 4 m
/ \
i---k => i->m & k->m
i->j \ /
j
Rule 5: if no edges are directed then take a random one (first we can find)
The ICS algorithm comes with the following options.
Since the ICS algorithm is focused on recovering causal structure, instead
of nding the optimal classier, the Markov blanket correction can be made
afterwards.
Specic options:
The maxCardinality option determines the largest subset of Z to be considered
in conditional independence tests x, y[Z).
The scoreType option is used to select the scoring metric.
8.4. GLOBAL SCORE METRIC BASED STRUCTURE LEARNING 125
8.4 Global score metric based structure learning
Common options for cross-validation based algorithms are:
initAsNaiveBayes, markovBlanketClassifier and maxNrOfParents (see Sec-
tion 8.2 for description).
Further, for each of the cross-validation based algorithms the CVType can be
chosen out of the following:
Leave one out cross-validation (loo-cv) selects m = N training sets simply
by taking the data set D and removing the ith record for training set D
t
i
.
The validation set consist of just the ith single record. Loo-cv does not
always produce accurate performance estimates.
K-fold cross-validation (k-fold cv) splits the data D in m approximately
equal parts D
1
, . . . , D
m
. Training set D
t
i
is obtained by removing part
D
i
from D. Typical values for m are 5, 10 and 20. With m = N, k-fold
cross-validation becomes loo-cv.
Cumulative cross-validation (cumulative cv) starts with an empty data set
and adds instances item by item from D. After each time an item is added
the next item to be added is classied using the then current state of the
Bayes network.
Finally, the useProb ag indicates whether the accuracy of the classier
should be estimated using the zero-one loss (if set to false) or using the esti-
mated probability of the class.
126 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
The following search algorithms are implemented: K2, HillClimbing, Repeat-
edHillClimber, TAN, Tabu Search, Simulated Annealing and Genetic Search.
See Section 8.2 for a description of the specic options for those algorithms.
8.5 Fixed structure learning
The structure learning step can be skipped by selecting a xed network struc-
ture. There are two methods of getting a xed structure: just make it a naive
Bayes network, or reading it from a le in XML BIF format.
8.6 Distribution learning
Once the network structure is learned, you can choose how to learn the prob-
ability tables selecting a class in the weka.classifiers.bayes.net.estimate
8.6. DISTRIBUTION LEARNING 127
package.
The SimpleEstimator class produces direct estimates of the conditional
probabilities, that is,
P(x
i
= k[pa(x
i
) = j) =
N
ijk
+ N
ijk
N
ij
+ N
ij
where N
ijk
is the alpha parameter that can be set and is 0.5 by default. With
alpha = 0, we get maximum likelihood estimates.
With the BMAEstimator, we get estimates for the conditional probability
tables based on Bayes model averaging of all network structures that are sub-
structures of the network structure learned [15]. This is achieved by estimat-
ing the conditional probability table of a node x
i
given its parents pa(x
i
) as
a weighted average of all conditional probability tables of x
i
given subsets of
pa(x
i
). The weight of a distribution P(x
i
[S) with S pa(x
i
) used is propor-
tional to the contribution of network structure
yS
y x
i
to either the BDe
metric or K2 metric depending on the setting of the useK2Prior option (false
and true respectively).
128 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
8.7 Running from the command line
These are the command line options of BayesNet.
General options:
-t <name of training file>
Sets training file.
-T <name of test file>
Sets test file. If missing, a cross-validation will be performed on the
training data.
-c <class index>
Sets index of class attribute (default: last).
-x <number of folds>
Sets number of folds for cross-validation (default: 10).
-no-cv
Do not perform any cross validation.
-split-percentage <percentage>
Sets the percentage for the train/test set split, e.g., 66.
-preserve-order
Preserves the order in the percentage split.
-s <random number seed>
Sets random number seed for cross-validation or percentage split
(default: 1).
-m <name of file with cost matrix>
Sets file with cost matrix.
-l <name of input file>
Sets model input file. In case the filename ends with .xml,
the options are loaded from the XML file.
-d <name of output file>
Sets model output file. In case the filename ends with .xml,
only the options are saved to the XML file, not the model.
-v
Outputs no statistics for training data.
-o
Outputs statistics only, not the classifier.
-i
Outputs detailed information-retrieval statistics for each class.
-k
8.7. RUNNING FROM THE COMMAND LINE 129
Outputs information-theoretic statistics.
-p <attribute range>
Only outputs predictions for test instances (or the train
instances if no test instances provided), along with attributes
(0 for none).
-distribution
Outputs the distribution instead of only the prediction
in conjunction with the -p option (only nominal classes).
-r
Only outputs cumulative margin distribution.
-g
Only outputs the graph representation of the classifier.
-xml filename | xml-string
Retrieves the options from the XML-data instead of the command line.
Options specific to weka.classifiers.bayes.BayesNet:
-D
Do not use ADTree data structure
-B <BIF file>
BIF file to compare with
-Q weka.classifiers.bayes.net.search.SearchAlgorithm
Search algorithm
-E weka.classifiers.bayes.net.estimate.SimpleEstimator
Estimator algorithm
The search algorithm option -Q and estimator option -E options are manda-
tory.
Note that it is important that the -E options should be used after the -Q
option. Extra options can be passed to the search algorithm and the estimator
after the class name specied following --.
For example:
java weka.classifiers.bayes.BayesNet -t iris.arff -D \
-Q weka.classifiers.bayes.net.search.local.K2 -- -P 2 -S ENTROPY \
-E weka.classifiers.bayes.net.estimate.SimpleEstimator -- -A 1.0
Overview of options for search algorithms
weka.classifiers.bayes.net.search.local.GeneticSearch
-L <integer>
Population size
-A <integer>
Descendant population size
-U <integer>
Number of runs
-M
Use mutation.
130 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
(default true)
-C
Use cross-over.
(default true)
-O
Use tournament selection (true) or maximum subpopulatin (false).
(default false)
-R <seed>
Random number seed
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score type (BAYES, BDeu, MDL, ENTROPY and AIC)
weka.classifiers.bayes.net.search.local.HillClimber
-P <nr of parents>
Maximum number of parents
-R
Use arc reversal operation.
(default false)
-N
Initial structure is empty (instead of Naive Bayes)
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score type (BAYES, BDeu, MDL, ENTROPY and AIC)
weka.classifiers.bayes.net.search.local.K2
-N
Initial structure is empty (instead of Naive Bayes)
-P <nr of parents>
Maximum number of parents
-R
Random order.
(default false)
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
8.7. RUNNING FROM THE COMMAND LINE 131
-S [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score type (BAYES, BDeu, MDL, ENTROPY and AIC)
weka.classifiers.bayes.net.search.local.LAGDHillClimber
-L <nr of look ahead steps>
Look Ahead Depth
-G <nr of good operations>
Nr of Good Operations
-P <nr of parents>
Maximum number of parents
-R
Use arc reversal operation.
(default false)
-N
Initial structure is empty (instead of Naive Bayes)
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score type (BAYES, BDeu, MDL, ENTROPY and AIC)
weka.classifiers.bayes.net.search.local.RepeatedHillClimber
-U <integer>
Number of runs
-A <seed>
Random number seed
-P <nr of parents>
Maximum number of parents
-R
Use arc reversal operation.
(default false)
-N
Initial structure is empty (instead of Naive Bayes)
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score type (BAYES, BDeu, MDL, ENTROPY and AIC)
weka.classifiers.bayes.net.search.local.SimulatedAnnealing
132 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
-A <float>
Start temperature
-U <integer>
Number of runs
-D <float>
Delta temperature
-R <seed>
Random number seed
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score type (BAYES, BDeu, MDL, ENTROPY and AIC)
weka.classifiers.bayes.net.search.local.TabuSearch
-L <integer>
Tabu list length
-U <integer>
Number of runs
-P <nr of parents>
Maximum number of parents
-R
Use arc reversal operation.
(default false)
-P <nr of parents>
Maximum number of parents
-R
Use arc reversal operation.
(default false)
-N
Initial structure is empty (instead of Naive Bayes)
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score type (BAYES, BDeu, MDL, ENTROPY and AIC)
weka.classifiers.bayes.net.search.local.TAN
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
8.7. RUNNING FROM THE COMMAND LINE 133
classifier node.
-S [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score type (BAYES, BDeu, MDL, ENTROPY and AIC)
weka.classifiers.bayes.net.search.ci.CISearchAlgorithm
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score type (BAYES, BDeu, MDL, ENTROPY and AIC)
weka.classifiers.bayes.net.search.ci.ICSSearchAlgorithm
-cardinality <num>
When determining whether an edge exists a search is performed
for a set Z that separates the nodes. MaxCardinality determines
the maximum size of the set Z. This greatly influences the
length of the search. (default 2)
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score type (BAYES, BDeu, MDL, ENTROPY and AIC)
weka.classifiers.bayes.net.search.global.GeneticSearch
-L <integer>
Population size
-A <integer>
Descendant population size
-U <integer>
Number of runs
-M
Use mutation.
(default true)
-C
Use cross-over.
(default true)
-O
Use tournament selection (true) or maximum subpopulatin (false).
(default false)
-R <seed>
134 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
Random number seed
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [LOO-CV|k-Fold-CV|Cumulative-CV]
Score type (LOO-CV,k-Fold-CV,Cumulative-CV)
-Q
Use probabilistic or 0/1 scoring.
(default probabilistic scoring)
weka.classifiers.bayes.net.search.global.HillClimber
-P <nr of parents>
Maximum number of parents
-R
Use arc reversal operation.
(default false)
-N
Initial structure is empty (instead of Naive Bayes)
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [LOO-CV|k-Fold-CV|Cumulative-CV]
Score type (LOO-CV,k-Fold-CV,Cumulative-CV)
-Q
Use probabilistic or 0/1 scoring.
(default probabilistic scoring)
weka.classifiers.bayes.net.search.global.K2
-N
Initial structure is empty (instead of Naive Bayes)
-P <nr of parents>
Maximum number of parents
-R
Random order.
(default false)
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [LOO-CV|k-Fold-CV|Cumulative-CV]
Score type (LOO-CV,k-Fold-CV,Cumulative-CV)
8.7. RUNNING FROM THE COMMAND LINE 135
-Q
Use probabilistic or 0/1 scoring.
(default probabilistic scoring)
weka.classifiers.bayes.net.search.global.RepeatedHillClimber
-U <integer>
Number of runs
-A <seed>
Random number seed
-P <nr of parents>
Maximum number of parents
-R
Use arc reversal operation.
(default false)
-N
Initial structure is empty (instead of Naive Bayes)
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [LOO-CV|k-Fold-CV|Cumulative-CV]
Score type (LOO-CV,k-Fold-CV,Cumulative-CV)
-Q
Use probabilistic or 0/1 scoring.
(default probabilistic scoring)
weka.classifiers.bayes.net.search.global.SimulatedAnnealing
-A <float>
Start temperature
-U <integer>
Number of runs
-D <float>
Delta temperature
-R <seed>
Random number seed
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [LOO-CV|k-Fold-CV|Cumulative-CV]
Score type (LOO-CV,k-Fold-CV,Cumulative-CV)
-Q
Use probabilistic or 0/1 scoring.
(default probabilistic scoring)
136 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
weka.classifiers.bayes.net.search.global.TabuSearch
-L <integer>
Tabu list length
-U <integer>
Number of runs
-P <nr of parents>
Maximum number of parents
-R
Use arc reversal operation.
(default false)
-P <nr of parents>
Maximum number of parents
-R
Use arc reversal operation.
(default false)
-N
Initial structure is empty (instead of Naive Bayes)
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [LOO-CV|k-Fold-CV|Cumulative-CV]
Score type (LOO-CV,k-Fold-CV,Cumulative-CV)
-Q
Use probabilistic or 0/1 scoring.
(default probabilistic scoring)
weka.classifiers.bayes.net.search.global.TAN
-mbc
Applies a Markov Blanket correction to the network structure,
after a network structure is learned. This ensures that all
nodes in the network are part of the Markov blanket of the
classifier node.
-S [LOO-CV|k-Fold-CV|Cumulative-CV]
Score type (LOO-CV,k-Fold-CV,Cumulative-CV)
-Q
Use probabilistic or 0/1 scoring.
(default probabilistic scoring)
weka.classifiers.bayes.net.search.fixed.FromFile
-B <BIF File>
Name of file containing network structure in BIF format
weka.classifiers.bayes.net.search.fixed.NaiveBayes
8.7. RUNNING FROM THE COMMAND LINE 137
No options.
Overview of options for estimators
weka.classifiers.bayes.net.estimate.BayesNetEstimator
-A <alpha>
Initial count (alpha)
weka.classifiers.bayes.net.estimate.BMAEstimator
-k2
Whether to use K2 prior.
-A <alpha>
Initial count (alpha)
weka.classifiers.bayes.net.estimate.MultiNomialBMAEstimator
-k2
Whether to use K2 prior.
-A <alpha>
Initial count (alpha)
weka.classifiers.bayes.net.estimate.SimpleEstimator
-A <alpha>
Initial count (alpha)
Generating random networks and articial data sets
You can generate random Bayes nets and data sets using
weka.classifiers.bayes.net.BayesNetGenerator
The options are:
-B
Generate network (instead of instances)
-N <integer>
Nr of nodes
-A <integer>
Nr of arcs
-M <integer>
Nr of instances
-C <integer>
Cardinality of the variables
-S <integer>
Seed for random number generator
-F <file>
The BIF file to obtain the structure from.
138 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
The network structure is generated by rst generating a tree so that we can
ensure that we have a connected graph. If any more arrows are specied they
are randomly added.
8.8 Inspecting Bayesian networks
You can inspect some of the properties of Bayesian networks that you learned
in the Explorer in text format and also in graphical format.
Bayesian networks in text
Below, you nd output typical for a 10 fold cross-validation run in the Weka
Explorer with comments where the output is specic for Bayesian nets.
=== Run information ===
Scheme: weka.classifiers.bayes.BayesNet -D -B iris.xml -Q weka.classifiers.bayes.net.
Options for BayesNet include the class names for the structure learner and for
the distribution estimator.
Relation: iris-weka.filters.unsupervised.attribute.Discretize-B2-M-1.0-Rfirst-last
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Bayes Network Classifier
not using ADTree
Indication whether the ADTree algorithm [24] for calculating counts in the data
set was used.
#attributes=5 #classindex=4
This line lists the number of attribute and the number of the class variable for
which the classier was trained.
Network structure (nodes followed by parents)
sepallength(2): class
sepalwidth(2): class
petallength(2): class sepallength
petalwidth(2): class petallength
class(3):
8.8. INSPECTING BAYESIAN NETWORKS 139
This list species the network structure. Each of the variables is followed by a
list of parents, so the petallength variable has parents sepallength and class,
while class has no parents. The number in braces is the cardinality of the
variable. It shows that in the iris dataset there are three class variables. All
other variables are made binary by running it through a discretization lter.
LogScore Bayes: -374.9942769685747
LogScore BDeu: -351.85811477631626
LogScore MDL: -416.86897021246466
LogScore ENTROPY: -366.76261727150217
LogScore AIC: -386.76261727150217
These lines list the logarithmic score of the network structure for various meth-
ods of scoring.
If a BIF le was specied, the following two lines will be produced (if no
such le was specied, no information is printed).
Missing: 0 Extra: 2 Reversed: 0
Divergence: -0.0719759699700729
In this case the network that was learned was compared with a le iris.xml
which contained the naive Bayes network structure. The number after Missing
is the number of arcs that was in the network in le that is not recovered by
the structure learner. Note that a reversed arc is not counted as missing. The
number after Extra is the number of arcs in the learned network that are not
in the network on le. The number of reversed arcs is listed as well.
Finally, the divergence between the network distribution on le and the one
learned is reported. This number is calculated by enumerating all possible in-
stantiations of all variables, so it may take some time to calculate the divergence
for large networks.
The remainder of the output is standard output for all classiers.
Time taken to build model: 0.01 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 116 77.3333 %
Incorrectly Classified Instances 34 22.6667 %
etc...
Bayesian networks in GUI
To show the graphical structure, right click the appropriate BayesNet in result
list of the Explorer. A menu pops up, in which you select Visualize graph.
140 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
The Bayes network is automatically layed out and drawn thanks to a graph
drawing algorithm implemented by Ashraf Kibriya.
When you hover the mouse over a node, the node lights up and all its children
are highlighted as well, so that it is easy to identify the relation between nodes
in crowded graphs.
Saving Bayes nets You can save the Bayes network to le in the graph
visualizer. You have the choice to save as XML BIF format or as dot format.
Select the oppy button and a le save dialog pops up that allows you to select
the le name and le format.
Zoom The graph visualizer has two buttons to zoom in and out. Also, the
exact zoom desired can be entered in the zoom percentage entry. Hit enter to
redraw at the desired zoom level.
8.9. BAYES NETWORK GUI 141
Graph drawing options Hit the extra controls button to show extra
options that control the graph layout settings.
The Layout Type determines the algorithm applied to place the nodes.
The Layout Method determines in which direction nodes are considered.
The Edge Concentration toggle allows edges to be partially merged.
The Custom Node Size can be used to override the automatically deter-
mined node size.
When you click a node in the Bayesian net, a window with the probability
table of the node clicked pops up. The left side shows the parent attributes and
lists the values of the parents, the right side shows the probability of the node
clicked conditioned on the values of the parents listed on the left.
So, the graph visualizer allows you to inspect both network structure and
probability tables.
8.9 Bayes Network GUI
The Bayesian network editor is a stand alone application with the following
features
Edit Bayesian network completely by hand, with unlimited undo/redo stack,
cut/copy/paste and layout support.
Learn Bayesian network from data using learning algorithms in Weka.
Edit structure by hand and learn conditional probability tables (CPTs) using
142 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
learning algorithms in Weka.
Generate dataset from Bayesian network.
Inference (using junction tree method) of evidence through the network, in-
teractively changing values of nodes.
Viewing cliques in junction tree.
Accelerator key support for most common operations.
The Bayes network GUI is started as
java weka.classiers.bayes.net.GUI bif le
The following window pops up when an XML BIF le is specied (if none is
specied an empty graph is shown).
Moving a node
Click a node with the left mouse button and drag the node to the desired
position.
8.9. BAYES NETWORK GUI 143
Selecting groups of nodes
Drag the left mouse button in the graph panel. A rectangle is shown and all
nodes intersecting with the rectangle are selected when the mouse is released.
Selected nodes are made visible with four little black squares at the corners (see
screenshot above).
The selection can be extended by keeping the shift key pressed while selecting
another set of nodes.
The selection can be toggled by keeping the ctrl key pressed. All nodes in
the selection selected in the rectangle are de-selected, while the ones not in the
selection but intersecting with the rectangle are added to the selection.
Groups of nodes can be moved by keeping the left mouse pressed on one of
the selected nodes and dragging the group to the desired position.
File menu
The New, Save, Save As, and Exit menu provide functionality as expected.
The le format used is XML BIF [20].
There are two le formats supported for opening
.xml for XML BIF les. The Bayesian network is reconstructed from the
information in the le. Node width information is not stored so the nodes are
shown with the default width. This can be changed by laying out the graph
(menu Tools/Layout).
.ar Weka data les. When an ar le is selected, a new empty Bayesian net-
work is created with nodes for each of the attributes in the ar le. Continuous
variables are discretized using the weka.filters.supervised.attribute.Discretize
lter (see note at end of this section for more details). The network structure
can be specied and the CPTs learned using the Tools/Learn CPT menu.
The Print menu works (sometimes) as expected.
The Export menu allows for writing the graph panel to image (currently
supported are bmp, jpg, png and eps formats). This can also be activated using
the Alt-Shift-Left Click action in the graph panel.
144 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
Edit menu
Unlimited undo/redo support. Most edit operations on the Bayesian network
are undoable. A notable exception is learning of network and CPTs.
Cut/copy/paste support. When a set of nodes is selected these can be placed
on a clipboard (internal, so no interaction with other applications yet) and a
paste action will add the nodes. Nodes are renamed by adding Copy of before
the name and adding numbers if necessary to ensure uniqueness of name. Only
the arrows to parents are copied, not these of the children.
The Add Node menu brings up a dialog (see below) that allows to specify
the name of the new node and the cardinality of the new node. Node values are
assigned the names Value1, Value2 etc. These values can be renamed (right
click the node in the graph panel and select Rename Value). Another option is
to copy/paste a node with values that are already properly named and rename
the node.
The Add Arc menu brings up a dialog to choose a child node rst;
8.9. BAYES NETWORK GUI 145
Then a dialog is shown to select a parent. Descendants of the child node,
parents of the child node and the node itself are not listed since these cannot
be selected as child node since they would introduce cycles or already have an
arc in the network.
The Delete Arc menu brings up a dialog with a list of all arcs that can be
deleted.
The list of eight items at the bottom are active only when a group of at least
two nodes are selected.
Align Left/Right/Top/Bottom moves the nodes in the selection such that all
nodes align to the utmost left, right, top or bottom node in the selection re-
spectively.
Center Horizontal/Vertical moves nodes in the selection halfway between left
and right most (or top and bottom most respectively).
Space Horizontal/Vertical spaces out nodes in the selection evenly between
left and right most (or top and bottom most respectively). The order in which
the nodes are selected impacts the place the node is moved to.
Tools menu
The Generate Network menu allows generation of a complete random Bayesian
network. It brings up a dialog to specify the number of nodes, number of arcs,
cardinality and a random seed to generate a network.
146 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
The Generate Data menu allows for generating a data set from the Bayesian
network in the editor. A dialog is shown to specify the number of instances to
be generated, a random seed and the le to save the data set into. The le
format is ar. When no le is selected (eld left blank) no le is written and
only the internal data set is set.
The Set Data menu sets the current data set. From this data set a new
Bayesian network can be learned, or the CPTs of a network can be estimated.
A le choose menu pops up to select the ar le containing the data.
The Learn Network and Learn CPT menus are only active when a data set
is specied either through
Tools/Set Data menu, or
Tools/Generate Data menu, or
File/Open menu when an ar le is selected.
The Learn Network action learns the whole Bayesian network from the data
set. The learning algorithms can be selected from the set available in Weka by
selecting the Options button in the dialog below. Learning a network clears the
undo stack.
The Learn CPT menu does not change the structure of the Bayesian network,
only the probability tables. Learning the CPTs clears the undo stack.
The Layout menu runs a graph layout algorithm on the network and tries
to make the graph a bit more readable. When the menu item is selected, the
node size can be specied or left to calculate by the algorithm based on the size
of the labels by deselecting the custom node size check box.
8.9. BAYES NETWORK GUI 147
The Show Margins menu item makes marginal distributions visible. These
are calculated using the junction tree algorithm [23]. Marginal probabilities for
nodes are shown in green next to the node. The value of a node can be set
(right click node, set evidence, select a value) and the color is changed to red to
indicate evidence is set for the node. Rounding errors may occur in the marginal
probabilities.
The Show Cliques menu item makes the cliques visible that are used by the
junction tree algorithm. Cliques are visualized using colored undirected edges.
Both margins and cliques can be shown at the same time, but that makes for
rather crowded graphs.
148 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
View menu
The view menu allows for zooming in and out of the graph panel. Also, it allows
for hiding or showing the status and toolbars.
Help menu
The help menu points to this document.
8.9. BAYES NETWORK GUI 149
Toolbar
The toolbar allows a shortcut to many functions. Just hover the mouse
over the toolbar buttons and a tooltiptext pops up that tells which function is
activated. The toolbar can be shown or hidden with the View/View Toolbar
menu.
Statusbar
At the bottom of the screen the statusbar shows messages. This can be helpful
when an undo/redo action is performed that does not have any visible eects,
such as edit actions on a CPT. The statusbar can be shown or hidden with the
View/View Statusbar menu.
Click right mouse button
Clicking the right mouse button in the graph panel outside a node brings up
the following popup menu. It allows to add a node at the location that was
clicked, or add select a parent to add to all nodes in the selection. If no node is
selected, or no node can be added as parent, this function is disabled.
Clicking the right mouse button on a node brings up a popup menu.
The popup menu shows list of values that can be set as evidence to selected
node. This is only visible when margins are shown (menu Tools/Show margins).
By selecting Clear, the value of the node is removed and the margins calculated
based on CPTs again.
150 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
A node can be renamed by right click and select Rename in the popup menu.
The following dialog appears that allows entering a new node name.
The CPT of a node can be edited manually by selecting a node, right
click/Edit CPT. A dialog is shown with a table representing the CPT. When a
value is edited, the values of the remainder of the table are update in order to
ensure that the probabilities add up to 1. It attempts to adjust the last column
rst, then goes backward from there.
The whole table can be lled with randomly generated distributions by selecting
the Randomize button.
The popup menu shows list of parents that can be added to selected node.
CPT for the node is updated by making copies for each value of the new parent.
8.9. BAYES NETWORK GUI 151
The popup menu shows list of parents that can be deleted from selected
node. CPT of the node keeps only the one conditioned on the rst value of the
parent node.
The popup menu shows list of children that can be deleted from selected
node. CPT of the child node keeps only the one conditioned on the rst value
of the parent node.
Selecting the Add Value from the popup menu brings up this dialog, in which
the name of the new value for the node can be specied. The distribution for
the node assign zero probability to the value. Child node CPTs are updated by
copying distributions conditioned on the new value.
The popup menu shows list of values that can be renamed for selected node.
152 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
Selecting a value brings up the following dialog in which a new name can be
specied.
The popup menu shows list of values that can be deleted from selected node.
This is only active when there are more then two values for the node (single
valued nodes do not make much sense). By selecting the value the CPT of the
node is updated in order to ensure that the CPT adds up to unity. The CPTs
of children are updated by dropping the distributions conditioned on the value.
A note on CPT learning
Continuous variables are discretized by the Bayes network class. The discretiza-
tion algorithm chooses its values based on the information in the data set.
8.10. BAYESIAN NETS IN THE EXPERIMENTER 153
However, these values are not stored anywhere. So, reading an ar le with
continuous variables using the File/Open menu allows one to specify a network,
then learn the CPTs from it since the discretization bounds are still known.
However, opening an ar le, specifying a structure, then closing the applica-
tion, reopening and trying to learn the network from another le containing
continuous variables may not give the desired result since a the discretization
algorithm is re-applied and new boundaries may have been found. Unexpected
behavior may be the result.
Learning from a dataset that contains more attributes than there are nodes
in the network is ok. The extra attributes are just ignored.
Learning from a dataset with dierently ordered attributes is ok. Attributes
are matched to nodes based on name. However, attribute values are matched
with node values based on the order of the values.
The attributes in the dataset should have the same number of values as the
corresponding nodes in the network (see above for continuous variables).
8.10 Bayesian nets in the experimenter
Bayesian networks generate extra measures that can be examined in the exper-
imenter. The experimenter can then be used to calculate mean and variance for
those measures.
The following metrics are generated:
measureExtraArcs: extra arcs compared to reference network. The net-
work must be provided as BIFFile to the BayesNet class. If no such
network is provided, this value is zero.
measureMissingArcs: missing arcs compared to reference network or zero
if not provided.
measureReversedArcs: reversed arcs compared to reference network or
zero if not provided.
measureDivergence: divergence of network learned compared to reference
network or zero if not provided.
measureBayesScore: log of the K2 score of the network structure.
measureBDeuScore: log of the BDeu score of the network structure.
measureMDLScore: log of the MDL score.
measureAICScore: log of the AIC score.
measureEntropyScore:log of the entropy.
8.11 Adding your own Bayesian network learn-
ers
You can add your own structure learners and estimators.
154 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
Adding a new structure learner
Here is the quick guide for adding a structure learner:
1. Create a class that derives from weka.classifiers.bayes.net.search.SearchAlgorithm.
If your searcher is score based, conditional independence based or cross-
validation based, you probably want to derive fromScoreSearchAlgorithm,
CISearchAlgorithmor CVSearchAlgorithminstead of deriving from SearchAlgorithm
directly. Lets say it is called
weka.classifiers.bayes.net.search.local.MySearcher derived from
ScoreSearchAlgorithm.
2. Implement the method
public void buildStructure(BayesNet bayesNet, Instances instances).
Essentially, you are responsible for setting the parent sets in bayesNet.
You can access the parentsets using bayesNet.getParentSet(iAttribute)
where iAttribute is the number of the node/variable.
To add a parent iParent to node iAttribute, use
bayesNet.getParentSet(iAttribute).AddParent(iParent, instances)
where instances need to be passed for the parent set to derive properties
of the attribute.
Alternatively, implement public void search(BayesNet bayesNet, Instances
instances). The implementation of buildStructure in the base class.
This method is called by the SearchAlgorithm will call search after ini-
tializing parent sets and if the initAsNaiveBase ag is set, it will start
with a naive Bayes network structure. After calling search in your cus-
tom class, it will add arrows if the markovBlanketClassifier ag is set
to ensure all attributes are in the Markov blanket of the class node.
3. If the structure learner has options that are not default options, you
want to implement public Enumeration listOptions(), public void
setOptions(String[] options), public String[] getOptions() and
the get and set methods for the properties you want to be able to set.
NB 1. do not use the -E option since that is reserved for the BayesNet
class to distinguish the extra options for the SearchAlgorithm class and
the Estimator class. If the -E option is used, it will not be passed to your
SearchAlgorithm (and probably causes problems in the BayesNet class).
NB 2. make sure to process options of the parent class if any in the
get/setOpions methods.
Adding a new estimator
This is the quick guide for adding a new estimator:
1. Create a class that derives from
weka.classifiers.bayes.net.estimate.BayesNetEstimator. Lets say
it is called
weka.classifiers.bayes.net.estimate.MyEstimator.
2. Implement the methods
public void initCPTs(BayesNet bayesNet)
8.12. FAQ 155
public void estimateCPTs(BayesNet bayesNet)
public void updateClassifier(BayesNet bayesNet, Instance instance),
and
public double[] distributionForInstance(BayesNet bayesNet, Instance
instance).
3. If the structure learner has options that are not default options, you
want to implement public Enumeration listOptions(), public void
setOptions(String[] options), public String[] getOptions() and
the get and set methods for the properties you want to be able to set.
NB do not use the -E option since that is reserved for the BayesNet class
to distinguish the extra options for the SearchAlgorithm class and the
Estimator class. If the -E option is used and no extra arguments are
passed to the SearchAlgorithm, the extra options to your Estimator will
be passed to the SearchAlgorithm instead. In short, do not use the -E
option.
8.12 FAQ
How do I use a data set with continuous variables with the
BayesNet classes?
Use the class weka.filters.unsupervised.attribute.Discretizeto discretize
them. From the command line, you can use
java weka.filters.unsupervised.attribute.Discretize -B 3 -i infile.arff
-o outfile.arff
where the -B option determines the cardinality of the discretized variables.
How do I use a data set with missing values with the
BayesNet classes?
You would have to delete the entries with missing values or ll in dummy values.
How do I create a random Bayes net structure?
Running from the command line
java weka.classifiers.bayes.net.BayesNetGenerator -B -N 10 -A 9 -C
2
will print a Bayes net with 10 nodes, 9 arcs and binary variables in XML BIF
format to standard output.
How do I create an articial data set using a random Bayes
nets?
Running
java weka.classifiers.bayes.net.BayesNetGenerator -N 15 -A 20 -C 3
-M 300
will generate a data set in ar format with 300 instance from a random network
with 15 ternary variables and 20 arrows.
156 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
How do I create an articial data set using a Bayes nets I
have on le?
Running
java weka.classifiers.bayes.net.BayesNetGenerator -F alarm.xml -M 1000
will generate a data set with 1000 instances from the network stored in the le
alarm.xml.
How do I save a Bayes net in BIF format?
GUI: In the Explorer
learn the network structure,
right click the relevant run in the result list,
choose Visualize graph in the pop up menu,
click the oppy button in the Graph Visualizer window.
a le save as dialog pops up that allows you to select the le name
to save to.
Java: Create a BayesNet and call BayesNet.toXMLBIF03() which returns
the Bayes network in BIF format as a String.
Command line: use the -g option and redirect the output on stdout
into a le.
How do I compare a network I learned with one in BIF
format?
Specify the -B <bif-le> option to BayesNet. Calling toString() will produce
a summary of extra, missing and reversed arrows. Also the divergence between
the network learned and the one on le is reported.
How do I use the network I learned for general inference?
There is no general purpose inference in Weka, but you can export the network as
XML BIF le (see above) and import it in other packages, for example JavaBayes
available under GPL from http://www.cs.cmu.edu/
javabayes.
8.13 Future development
If you would like to add to the current Bayes network facilities in Weka, you
might consider one of the following possibilities.
Implement more search algorithms, in particular,
general purpose search algorithms (such as an improved implemen-
tation of genetic search).
structure search based on equivalent model classes.
implement those algorithms both for local and global metric based
search algorithms.
8.13. FUTURE DEVELOPMENT 157
implement more conditional independence based search algorithms.
Implement score metrics that can handle sparse instances in order to allow
for processing large datasets.
Implement traditional conditional independence tests for conditional in-
dependence based structure learning algorithms.
Currently, all search algorithms assume that all variables are discrete.
Search algorithms that can handle continuous variables would be interest-
ing.
A limitation of the current classes is that they assume that there are no
missing values. This limitation can be undone by implementing score
metrics that can handle missing values. The classes used for estimating
the conditional probabilities need to be updated as well.
Only leave-one-out, k-fold and cumulative cross-validation are implemented.
These implementations can be made more ecient and other cross-validation
methods can be implemented, such as Monte Carlo cross-validation and
bootstrap cross validation.
Implement methods that can handle incremental extensions of the data
set for updating network structures.
And for the more ambitious people, there are the following challenges.
A GUI for manipulating Bayesian network to allow user intervention for
adding and deleting arcs and updating the probability tables.
General purpose inference algorithms built into the GUI to allow user
dened queries.
Allow learning of other graphical models, such as chain graphs, undirected
graphs and variants of causal graphs.
Allow learning of networks with latent variables.
Allow learning of dynamic Bayesian networks so that time series data can
be handled.
158 CHAPTER 8. BAYESIAN NETWORK CLASSIFIERS
Part III
Data
159
Chapter 9
ARFF
An ARFF (= Attribute-Relation File Format ) le is an ASCII text le that
describes a list of instances sharing a set of attributes.
9.1 Overview
ARFF les have two distinct sections. The rst section is the Header informa-
tion, which is followed the Data information.
The Header of the ARFF le contains the name of the relation, a list of
the attributes (the columns in the data), and their types. An example header
on the standard IRIS dataset looks like this:
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
% (c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
The Data of the ARFF le looks like the following:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
161
162 CHAPTER 9. ARFF
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments. The @RELATION, @ATTRIBUTE and
@DATA declarations are case insensitive.
9.2 Examples
Several well-known machine learning datasets are distributed with Weka in the
$WEKAHOME/data directory as ARFF les.
9.2.1 The ARFF Header Section
The ARFF Header section of the le contains the relation declaration and at-
tribute declarations.
The @relation Declaration
The relation name is dened as the rst line in the ARFF le. The format is:
@relation <relation-name>
where <relation-name> is a string. The string must be quoted if the name
includes spaces.
The @attribute Declarations
Attribute declarations take the form of an ordered sequence of @attribute
statements. Each attribute in the data set has its own @attribute statement
which uniquely denes the name of that attribute and its data type. The order
the attributes are declared indicates the column position in the data section
of the le. For example, if an attribute is the third one declared then Weka
expects that all that attributes values will be found in the third comma delimited
column.
The format for the @attribute statement is:
@attribute <attribute-name> <datatype>
where the <attribute-name> must start with an alphabetic character. If
spaces are to be included in the name then the entire name must be quoted.
The <datatype> can be any of the four types supported by Weka:
numeric
integer is treated as numeric
real is treated as numeric
<nominal-specication>
string
9.2. EXAMPLES 163
date [<date-format>]
relational for multi-instance data (for future use)
where <nominal-specication> and <date-format> are dened below. The
keywords numeric, real, integer, string and date are case insensitive.
Numeric attributes
Numeric attributes can be real or integer numbers.
Nominal attributes
Nominal values are dened by providing an <nominal-specication> listing the
possible values: <nominal-name1>, <nominal-name2>, <nominal-name3>,
...
For example, the class value of the Iris dataset can be dened as follows:
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
Values that contain spaces must be quoted.
String attributes
String attributes allow us to create attributes containing arbitrary textual val-
ues. This is very useful in text-mining applications, as we can create datasets
with string attributes, then write Weka Filters to manipulate strings (like String-
ToWordVectorFilter). String attributes are declared as follows:
@ATTRIBUTE LCC string
Date attributes
Date attribute declarations take the form:
@attribute <name> date [<date-format>]
where <name> is the name for the attribute and <date-format> is an op-
tional string specifying how date values should be parsed and printed (this is the
same format used by SimpleDateFormat). The default format string accepts
the ISO-8601 combined date and time format: yyyy-MM-ddTHH:mm:ss.
Dates must be specied in the data section as the corresponding string rep-
resentations of the date/time (see example below).
Relational attributes
Relational attribute declarations take the form:
@attribute <name> relational
<further attribute definitions>
@end <name>
For the multi-instance dataset MUSK1 the denition would look like this (...
denotes an omission):
164 CHAPTER 9. ARFF
@attribute molecule_name {MUSK-jf78,...,NON-MUSK-199}
@attribute bag relational
@attribute f1 numeric
...
@attribute f166 numeric
@end bag
@attribute class {0,1}
...
9.2.2 The ARFF Data Section
The ARFF Data section of the le contains the data declaration line and the
actual instance lines.
The @data Declaration
The @data declaration is a single line denoting the start of the data segment in
the le. The format is:
@data
The instance data
Each instance is represented on a single line, with carriage returns denoting the
end of the instance. A percent sign (%) introduces a comment, which continues
to the end of the line.
Attribute values for each instance are delimited by commas. They must
appear in the order that they were declared in the header section (i.e. the data
corresponding to the nth @attribute declaration is always the nth eld of the
attribute).
Missing values are represented by a single question mark, as in:
@data
4.4,?,1.5,?,Iris-setosa
Values of string and nominal attributes are case sensitive, and any that contain
space or the comment-delimiter character % must be quoted. (The code suggests
that double-quotes are acceptable and that a backslash will escape individual
characters.) An example follows:
@relation LCCvsLCSH
@attribute LCC string
@attribute LCSH string
@data
AG5, Encyclopedias and dictionaries.;Twentieth century.
AS262, Science -- Soviet Union -- History.
AE5, Encyclopedias and dictionaries.
AS281, Astronomy, Assyro-Babylonian.;Moon -- Phases.
AS281, Astronomy, Assyro-Babylonian.;Moon -- Tables.
9.3. SPARSE ARFF FILES 165
Dates must be specied in the data section using the string representation spec-
ied in the attribute declaration. For example:
@RELATION Timestamps
@ATTRIBUTE timestamp DATE "yyyy-MM-dd HH:mm:ss"
@DATA
"2001-04-03 12:12:12"
"2001-05-03 12:59:55"
Relational data must be enclosed within double quotes . For example an in-
stance of the MUSK1 dataset (... denotes an omission):
MUSK-188,"42,...,30",1
9.3 Sparse ARFF les
Sparse ARFF les are very similar to ARFF les, but data with value 0 are not
be explicitly represented.
Sparse ARFF les have the same header (i.e @relation and @attribute
tags) but the data section is dierent. Instead of representing each value in
order, like this:
@data
0, X, 0, Y, "class A"
0, 0, W, 0, "class B"
the non-zero attributes are explicitly identied by attribute number and their
value stated, like this:
@data
{1 X, 3 Y, 4 "class A"}
{2 W, 4 "class B"}
Each instance is surrounded by curly braces, and the format for each entry is:
<index> <space> <value> where index is the attribute index (starting from
0).
Note that the omitted values in a sparse instance are 0, they are not missing
values! If a value is unknown, you must explicitly represent it with a question
mark (?).
Warning: There is a known problem saving SparseInstance objects from
datasets that have string attributes. In Weka, string and nominal data values
are stored as numbers; these numbers act as indexes into an array of possible
attribute values (this is very ecient). However, the rst string value is as-
signed index 0: this means that, internally, this value is stored as a 0. When a
SparseInstance is written, string instances with internal value 0 are not out-
put, so their string value is lost (and when the ar le is read again, the default
value 0 is the index of a dierent string value, so the attribute value appears
to change). To get around this problem, add a dummy string value at index 0
that is never used whenever you declare string attributes that are likely to be
used in SparseInstance objects and saved as Sparse ARFF les.
166 CHAPTER 9. ARFF
9.4 Instance weights in ARFF les
A weight can be associated with an instance in a standard ARFF le by ap-
pending it to the end of the line for that instance and enclosing the value in
curly braces. E.g:
@data
0, X, 0, Y, "class A", {5}
For a sparse instance, this example would look like:
@data
{1 X, 3 Y, 4 "class A"}, {5}
Note that any instance without a weight value specied is assumed to have a
weight of 1 for backwards compatibility.
Chapter 10
XRFF
The XRFF (Xml attribute Relation File Format ) is a representing the data in
a format that can store comments, attribute and instance weights.
10.1 File extensions
The following le extensions are recognized as XRFF les:
.xrff
the default extension of XRFF les
.xrff.gz
the extension for gzip compressed XRFF les (see Compression section
for more details)
10.2 Comparison
10.2.1 ARFF
In the following a snippet of the UCI dataset iris in ARFF format:
@relation iris
@attribute sepallength numeric
@attribute sepalwidth numeric
@attribute petallength numeric
@attribute petalwidth numeric
@attribute class {Iris-setosa,Iris-versicolor,Iris-virginica}
@data
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3,1.4,0.2,Iris-setosa
...
167
168 CHAPTER 10. XRFF
10.2.2 XRFF
And the same dataset represented as XRFF le:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE dataset
[
<!ELEMENT dataset (header,body)>
<!ATTLIST dataset name CDATA #REQUIRED>
<!ATTLIST dataset version CDATA "3.5.4">
<!ELEMENT header (notes?,attributes)>
<!ELEMENT body (instances)>
<!ELEMENT notes ANY>
<!ELEMENT attributes (attribute+)>
<!ELEMENT attribute (labels?,metadata?,attributes?)>
<!ATTLIST attribute name CDATA #REQUIRED>
<!ATTLIST attribute type (numeric|date|nominal|string|relational) #REQUIRED>
<!ATTLIST attribute format CDATA #IMPLIED>
<!ATTLIST attribute class (yes|no) "no">
<!ELEMENT labels (label*)>
<!ELEMENT label ANY>
<!ELEMENT metadata (property*)>
<!ELEMENT property ANY>
<!ATTLIST property name CDATA #REQUIRED>
<!ELEMENT instances (instance*)>
<!ELEMENT instance (value*)>
<!ATTLIST instance type (normal|sparse) "normal">
<!ATTLIST instance weight CDATA #IMPLIED>
<!ELEMENT value (#PCDATA|instances)*>
<!ATTLIST value index CDATA #IMPLIED>
<!ATTLIST value missing (yes|no) "no">
]
>
<dataset name="iris" version="3.5.3">
<header>
<attributes>
<attribute name="sepallength" type="numeric"/>
<attribute name="sepalwidth" type="numeric"/>
<attribute name="petallength" type="numeric"/>
<attribute name="petalwidth" type="numeric"/>
<attribute class="yes" name="class" type="nominal">
<labels>
<label>Iris-setosa</label>
<label>Iris-versicolor</label>
<label>Iris-virginica</label>
</labels>
10.3. SPARSE FORMAT 169
</attribute>
</attributes>
</header>
<body>
<instances>
<instance>
<value>5.1</value>
<value>3.5</value>
<value>1.4</value>
<value>0.2</value>
<value>Iris-setosa</value>
</instance>
<instance>
<value>4.9</value>
<value>3</value>
<value>1.4</value>
<value>0.2</value>
<value>Iris-setosa</value>
</instance>
...
</instances>
</body>
</dataset>
10.3 Sparse format
The XRFF format also supports a sparse data representation. Even though the
iris dataset does not contain sparse data, the above example will be used here
to illustrate the sparse format:
...
<instances>
<instance type="sparse">
<value index="1">5.1</value>
<value index="2">3.5</value>
<value index="3">1.4</value>
<value index="4">0.2</value>
<value index="5">Iris-setosa</value>
</instance>
<instance type="sparse">
<value index="1">4.9</value>
<value index="2">3</value>
<value index="3">1.4</value>
<value index="4">0.2</value>
<value index="5">Iris-setosa</value>
</instance>
...
</instances>
...
170 CHAPTER 10. XRFF
In contrast to the normal data format, each sparse instance tag contains a type
attribute with the value sparse:
<instance type="sparse">
And each value tag needs to specify the index attribute, which contains the
1-based index of this value.
<value index="1">5.1</value>
10.4 Compression
Since the XML representation takes up considerably more space than the rather
compact ARFF format, one can also compress the data via gzip. Weka automat-
ically recognizes a le being gzip compressed, if the les extension is .xrff.gz
instead of .xrff.
The Weka Explorer, Experimenter and command-line allow one to load/save
compressed and uncompressed XRFF les (this applies also to ARFF les).
10.5 Useful features
In addition to all the features of the ARFF format, the XRFF format contains
the following additional features:
class attribute specication
attribute weights
10.5.1 Class attribute specication
Via the class="yes" attribute in the attribute specication in the header, one
can dene which attribute should act as class attribute. A feature that can
be used on the command line as well as in the Experimenter, which now can
also load other data formats, and removing the limitation of the class attribute
always having to be the last one.
Snippet from the iris dataset:
<attribute class="yes" name="class" type="nominal">
10.5.2 Attribute weights
Attribute weights are stored in an attributes meta-data tag (in the header sec-
tion). Here is an example of the petalwidth attribute with a weight of 0.9:
<attribute name="petalwidth" type="numeric">
<metadata>
<property name="weight">0.9</property>
</metadata>
</attribute>
10.5. USEFUL FEATURES 171
10.5.3 Instance weights
Instance weights are dened via the weight attribute in each instance tag. By
default, the weight is 1. Here is an example:
<instance weight="0.75">
<value>5.1</value>
<value>3.5</value>
<value>1.4</value>
<value>0.2</value>
<value>Iris-setosa</value>
</instance>
172 CHAPTER 10. XRFF
Chapter 11
Converters
11.1 Introduction
Weka oers conversion utilities for several formats, in order to allow import from
dierent sorts of datasources. These utilities, called converters, are all located
in the following package:
weka.core.converters
For a certain kind of converter you will nd two classes
one for loading (classname ends with Loader) and
one for saving (classname ends with Saver).
Weka contains converters for the following data sources:
ARFF les (ArLoader, ArSaver)
C4.5 les (C45Loader, C45Saver)
CSV les (CSVLoader, CSVSaver)
les containing serialized instances (SerializedInstancesLoader, Serial-
izedInstancesSaver)
JDBC databases (DatabaseLoader, DatabaseSaver)
libsvm les (LibSVMLoader, LibSVMSaver)
XRFF les (XRFFLoader, XRFFSaver)
text directories for text mining (TextDirectoryLoader)
173
174 CHAPTER 11. CONVERTERS
11.2 Usage
11.2.1 File converters
File converters can be used as follows:
Loader
They take one argument, which is the le that should be converted, and
print the result to stdout. You can also redirect the output into a le:
java <classname> <input-file> > <output-file>
Heres an example for loading the CSV le iris.csv and saving it as
iris.ar :
java weka.core.converters.CSVLoader iris.csv > iris.arff
Saver
For a Saver you specify the ARFF input le via -i and the output le in
the specic format with -o:
java <classname> -i <input> -o <output>
Heres an example for saving an ARFF le to CSV:
java weka.core.converters.CSVSaver -i iris.arff -o iris.csv
A few notes:
Using the ArSaver from commandline doesnt make much sense, since
this Saver takes an ARFF le as input and output. The ArSaver is
normally used from Java for saving an object of weka.core.Instances
to a le.
The C45Loader either takes the .names-le or the .data-le as input, it
automatically looks for the other one.
For the C45Saver one species as output le a lename without any ex-
tension, since two output les will be generated; .names and .data are
automatically appended.
11.2.2 Database converters
The database converters are a bit more complex, since they also rely on ad-
ditional conguration les, besides the parameters on the commandline. The
setup for the database connection is stored in the following props le:
DatabaseUtils.props
The default le can be found here:
weka/experiment/DatabaseUtils.props
11.2. USAGE 175
Loader
You have to specify at least a SQL query with the -Q option (there are
additional options for incremental loading)
java weka.core.converters.DatabaseLoader -Q "select * from employee"
Saver
The Saver takes an ARFF le as input like any other Saver, but then also
the table where to save the data to via -T:
java weka.core.converters.DatabaseSaver -i iris.arff -T iris
176 CHAPTER 11. CONVERTERS
Chapter 12
Stemmers
12.1 Introduction
Weka now supports stemming algorithms. The stemming algorithms are located
in the following package:
weka.core.stemmers
Currently, the Lovins Stemmer (+ iterated version) and support for the Snow-
ball stemmers are included.
12.2 Snowball stemmers
Weka contains a wrapper class for the Snowball (homepage: http://snowball.tartarus.org/)
stemmers (containing the Porter stemmer and several other stemmers for dif-
ferent languages). The relevant class is weka.core.stemmers.Snowball.
The Snowball classes are not included, they only have to be present in the
classpath. The reason for this is, that the Weka team doesnt have to watch out
for new versions of the stemmers and update them.
There are two ways of getting hold of the Snowball stemmers:
1. You can add the following pre-compiled jar archive to your classpath and
youre set (based on source code from 2005-10-19, compiled 2005-10-22).
http://www.cs.waikato.ac.nz/
ml/weka/stemmers/snowball.jar
2. You can compile the stemmers yourself with the newest sources. Just
download the following ZIP le, unpack it and follow the instructions in
the README le (the zip contains an ANT (http://ant.apache.org/)
build script for generating the jar archive).
http://www.cs.waikato.ac.nz/
ml/weka/stemmers/snowball.zip
Note: the patch target is specic to the source code from 2005-10-19.
177
178 CHAPTER 12. STEMMERS
12.3 Using stemmers
The stemmers can either used
from commandline
within the StringToWordVector (package weka.filters.unsupervised.attribute)
12.3.1 Commandline
All stemmers support the following options:
-h
for displaying a brief help
-i <input-le>
The le to process
-o <output-le>
The le to output the processed data to (default stdout )
-l
Uses lowercase strings, i.e., the input is automatically converted to lower
case
12.3.2 StringToWordVector
Just use the GenericObjectEditor to choose the right stemmer and the desired
options (if the stemmer oers additional options).
12.4 Adding new stemmers
You can easily add new stemmers, if you follow these guidelines (for use in the
GenericObjectEditor):
they should be located in the weka.core.stemmers package (if not, then
the GenericObjectEditor.props/GenericPropertiesCreator.propsle
need to be updated) and
they must implement the interface weka.core.stemmers.Stemmer.
Chapter 13
Databases
13.1 Conguration les
Thanks to JDBC it is easy to connect to Databases that provide a JDBC
driver. Responsible for the setup is the following properties le, located in
the weka.experiment package:
DatabaseUtils.props
You can get this properties le from the weka.jar or weka-src.jar jar-archive,
both part of a normal Weka release. If you open up one of those les, youll nd
the properties le in the sub-folder weka/experiment.
Weka comes with example les for a wide range of databases:
DatabaseUtils.props.hsql - HSQLDB (>=3.4.1)
DatabaseUtils.props.msaccess - MS Access (>3.4.14, >3.5.8, >3.6.0)
see the Windows databases chapter for more information.
DatabaseUtils.props.mssqlserver - MS SQL Server 2000 (>=3.4.9,
>=3.5.4)
DatabaseUtils.props.mssqlserver2005- MS SQL Server 2005 (>=3.4.11,
>=3.5.6)
DatabaseUtils.props.mysql - MySQL (>=3.4.9, >=3.5.4)
DatabaseUtils.props.odbc - ODBC access via Suns ODBC/JDBC bridge,
e.g., for MS Sql Server (>=3.4.9, >=3.5.4)
see the Windows databases chapter for more information.
DatabaseUtils.props.oracle - Oracle 10g (>=3.4.9, >=3.5.4)
DatabaseUtils.props.postgresql- PostgreSQL 7.4 (>=3.4.9, >=3.5.4)
DatabaseUtils.props.sqlite3 - sqlite 3.x (>3.4.12, >3.5.7)
179
180 CHAPTER 13. DATABASES
The easiest way is just to place the extracted properties le into your HOME
directory. For more information on how property les are processed, check out
the following URL:
http://weka.wikispaces.com/Properties+File
Note: Weka only looks for the DatabaseUtils.props le. If you take one of
the example les listed above, you need to rename it rst.
13.2 Setup
Under normal circumstances you only have to edit the following two properties:
jdbcDriver
jdbcURL
Driver
jdbcDriver is the classname of the JDBC driver, necessary to connect to your
database, e.g.:
HSQLDB
org.hsqldb.jdbcDriver
MS SQL Server 2000 (Desktop Edition)
com.microsoft.jdbc.sqlserver.SQLServerDriver
MS SQL Server 2005
com.microsoft.sqlserver.jdbc.SQLServerDriver
MySQL
org.gjt.mm.mysql.Driver (or com.mysql.jdbc.Driver)
ODBC - part of Suns JDKs/JREs, no external driver necessary
sun.jdbc.odbc.JdbcOdbcDriver
Oracle
oracle.jdbc.driver.OracleDriver
PostgreSQL
org.postgresql.Driver
sqlite 3.x
org.sqlite.JDBC
URL
jdbcURL species the JDBC URL pointing to your database (can be still changed
in the Experimenter/Explorer), e.g. for the database MyDatabase on the server
server.my.domain:
13.3. MISSING DATATYPES 181
HSQLDB
jdbc:hsqldb:hsql://server.my.domain/MyDatabase
MS SQL Server 2000 (Desktop Edition)
jdbc:microsoft:sqlserver://v:1433
(Note: if you add ;databasename=db-name you can connect to a dierent
database than the default one, e.g., MyDatabase)
MS SQL Server 2005
jdbc:sqlserver://server.my.domain:1433
MySQL
jdbc:mysql://server.my.domain:3306/MyDatabase
ODBC
jdbc:odbc:DSN name (replace DSN name with the DSN that you want to
use)
Oracle (thin driver)
jdbc:oracle:thin:@server.my.domain:1526:orcl
(Note: @machineName:port:SID)
for the Express Edition you can use
jdbc:oracle:thin:@server.my.domain:1521:XE
PostgreSQL
jdbc:postgresql://server.my.domain:5432/MyDatabase
You can also specify user and password directly in the URL:
jdbc:postgresql://server.my.domain:5432/MyDatabase?user=<...>&password=<...>
where you have to replace the <...> with the correct values
sqlite 3.x
jdbc:sqlite:/path/to/database.db
(you can access only local les)
13.3 Missing Datatypes
Sometimes (e.g. with MySQL) it can happen that a column type cannot be
interpreted. In that case it is necessary to map the name of the column type
to the Java type it should be interpreted as. E.g. the MySQL type TEXT is
returned as BLOB from the JDBC driver and has to be mapped to String (0
represents String - the mappings can be found in the comments of the properties
le):
182 CHAPTER 13. DATABASES
Java type Java method Identier Weka attribute type
String getString() 0 nominal
boolean getBoolean() 1 nominal
double getDouble() 2 numeric
byte getByte() 3 numeric
short getByte() 4 numeric
int getInteger() 5 numeric
long getLong() 6 numeric
oat getFloat() 7 numeric
date getDate() 8 date
text getString() 9 string
time getTime() 10 date
In the props le one lists now the type names that the database returns and
what Java type it represents (via the identier), e.g.:
CHAR=0
VARCHAR=0
CHAR and VARCHAR are both String types, hence they are interpreted as String
(identier 0)
Note: in case database types have blanks, one needs to replace those blanks
with an underscore, e.g., DOUBLE PRECISION must be listed like this:
DOUBLE_PRECISION=2
13.4 Stored Procedures
Lets say youre tired of typing the same query over and over again. A good
way to shorten that, is to create a stored procedure.
PostgreSQL 7.4.x
The following example creates a procedure called emplyoee name that returns
the names of all the employees in table employee. Even though it doesnt make
much sense to create a stored procedure for this query, nonetheless, it shows
how to create and call stored procedures in PostgreSQL.
Create
CREATE OR REPLACE FUNCTION public.employee_name()
RETURNS SETOF text AS select name from employee
LANGUAGE sql VOLATILE;
SQL statement to call procedure
SELECT * FROM employee_name()
Retrieve data via InstanceQuery
java weka.experiment.InstanceQuery
-Q "SELECT * FROM employee_name()"
-U <user> -P <password>
13.5. TROUBLESHOOTING 183
13.5 Troubleshooting
In case youre experiencing problems connecting to your database, check
out the WEKA Mailing List (see Weka homepage for more information).
It is possible that somebody else encountered the same problem as you
and youll nd a post containing the solution to your problem.
Specic MS SQL Server 2000 Troubleshooting
Error Establishing Socket with JDBC Driver
Add TCP/IP to the list of protocols as stated in the following article:
http://support.microsoft.com/default.aspx?scid=kb;en-us;313178
Login failed for user sa. Reason: Not associated with a trusted SQL
Server connection.
For changing the authentication to mixed mode see the following
article:
http://support.microsoft.com/kb/319930/en-us
MS SQL Server 2005: TCP/IP is not enabled for SQL Server, or the
server or port number specied is incorrect.Verify that SQL Server is lis-
tening with TCP/IP on the specied server and port. This might be re-
ported with an exception similar to: The login has failed. The TCP/IP
connection to the host has failed. This indicates one of the following:
SQL Server is installed but TCP/IP has not been installed as a net-
work protocol for SQL Server by using the SQL Server Network Util-
ity for SQL Server 2000, or the SQL Server Conguration Manager
for SQL Server 2005
TCP/IP is installed as a SQL Server protocol, but it is not listening
on the port specied in the JDBC connection URL. The default port
is 1433.
The port that is used by the server has not been opened in the rewall
The Added driver: ... output on the commandline does not mean that
the actual class was found, but only that Weka will attempt to load the
class later on in order to establish a database connection.
The error message No suitable driver can be caused by the following:
The JDBC driver you are attempting to load is not in the CLASS-
PATH (Note: using -jar in the java commandline overwrites the
CLASSPATH environment variable!). Open the SimpleCLI, run the
command java weka.core.SystemInfoand check whether the prop-
erty java.class.path lists your database jar. If not correct your
CLASSPATH or the Java call you start Weka with.
The JDBC driver class is misspelled in the jdbcDriver property or
you have multiple entries of jdbcDriver (properties les need unique
keys!)
The jdbcURL property has a spelling error and tries to use a non-
existing protocol or you listed it multiple times, which doesnt work
either (remember, properties les need unique keys!)
184 CHAPTER 13. DATABASES
Chapter 14
Windows databases
A common query we get from our users is how to open a Windows database in
the Weka Explorer. This page is intended as a guide to help you achieve this. It
is a complicated process and we cannot guarantee that it will work for you. The
process described makes use of the JDBC-ODBC bridge that is part of Suns
JRE/JDK 1.3 (and higher).
The following instructions are for Windows 2000. Under other Windows
versions there may be slight dierences.
Step 1: Create a User DSN
1. Go to the Control Panel
2. Choose Adminstrative Tools
3. Choose Data Sources (ODBC)
4. At the User DSN tab, choose Add...
5. Choose database
Microsoft Access
(a) Note: Make sure your database is not open in another application
before following the steps below.
(b) Choose the Microsoft Access driver and click Finish
(c) Give the source a name by typing it into the Data Source
Name eld
(d) In the Database section, choose Select...
(e) Browse to nd your database le, select it and click OK
(f) Click OK to nalize your DSN
Microsoft SQL Server 2000 (Desktop Engine)
(a) Choose the SQL Server driver and click Finish
(b) Give the source a name by typing it into the Name eld
(c) Add a description for this source in the Description eld
(d) Select the server youre connecting to from the Server combobox
185
186 CHAPTER 14. WINDOWS DATABASES
(e) For the verication of the authenticity of the login ID choose
With SQL Server...
(f) Check Connect to SQL Server to obtain default settings...
and supply the user ID and password with which you installed
the Desktop Engine
(g) Just click on Next until it changes into Finish and click this,
too
(h) For testing purposes, click on Test Data Source... - the result
should be TESTS COMPLETED SUCCESSFULLY!
(i) Click on OK
MySQL
(a) Choose the MySQL ODBC driver and click Finish
(b) Give the source a name by typing it into the Data Source
Name eld
(c) Add a description for this source in the Description eld
(d) Specify the server youre connecting to in Server
(e) Fill in the user to use for connecting to the database in the User
eld, the same for the password
(f) Choose the database for this DSN from the Database combobox
(g) Click on OK
6. Your DSN should now be listed in the User Data Sources list
Step 2: Set up the DatabaseUtils.props le
You will need to congure a le called DatabaseUtils.props. This le already
exists under the path weka/experiment/ in the weka.jar le (which is just a
ZIP le) that is part of the Weka download. In this directory you will also nd a
sample le for ODBC connectivity, called DatabaseUtils.props.odbc, and one
specically for MS Access, called DatabaseUtils.props.msaccess, also using
ODBC. You should use one of the sample les as basis for your setup, since they
already contain default values specic to ODBC access.
This le needs to be recognized when the Explorer starts. You can achieve
this by making sure it is in the working directory or the home directory (if you
are unsure what the terms working directory and home directory mean, see the
Notes section). The easiest is probably the second alternative, as the setup will
apply to all the Weka instances on your machine.
Just make sure that the le contains the following lines at least:
jdbcDriver=sun.jdbc.odbc.JdbcOdbcDriver
jdbcURL=jdbc:odbc:dbname
where dbname is the name you gave the user DSN. (This can also be changed
once the Explorer is running.)
187
Step 3: Open the database
1. Start up the Weka Explorer.
2. Choose Open DB...
3. The URL should read jdbc:odbc:dbname where dbname is the name
you gave the user DSN.
4. Click Connect
5. Enter a Query, e.g., select * from tablename where tablename is
the name of the database table you want to read. Or you could put a
more complicated SQL query here instead.
6. Click Execute
7. When youre satised with the returned data, click OK to load the data
into the Preprocess panel.
Notes
Working directory
The directory a process is started from. When you start Weka from the
Windows Start Menu, then this directory would be Wekas installation
directory (the java process is started from that directory).
Home directory
The directory that contains all the users data. The exact location depends
on the operating system and the version of the operating system. It is
stored in the following environment variable:
Unix/Linux
$HOME
Windows
%USERPROFILE%
Cygwin
$USERPROFILE
You should be able output the value in a command prompt/terminal with
the echo command. E.g., for Windows this would be:
echo %USERPROFILE%
188 CHAPTER 14. WINDOWS DATABASES
Part IV
Appendix
189
Chapter 15
Research
15.1 Citing Weka
If you want to refer to Weka in a publication, please cite following SIGKDD
Explorations
1
paper. The full citation is:
Mark Hall, Eibe Frank, Georey Holmes, Bernhard Pfahringer, Pe-
ter Reutemann, Ian H. Witten (2009); The WEKA Data Mining
Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.
15.2 Paper references
Due to the introduction of the weka.core.TechnicalInformationHandler in-
terface it is now easy to extract all the paper references via weka.core.ClassDiscovery
and weka.core.TechnicalInformation.
The script listed at the end, extracts all the paper references from Weka
based on a given jar le and dumps it to stdout. One can either generate simple
plain text output (option -p) or BibTeX compliant one (option -b).
Typical use (after an ant exejar) for BibTeX:
get_wekatechinfo.sh -d ../ -w ../dist/weka.jar -b > ../tech.txt
(command is issued from the same directory the Weka build.xml is located in)
1
http://www.kdd.org/explorations/issues/11-1-2009-07/p2V11n1.pdf
191
192 CHAPTER 15. RESEARCH
Bash shell script get wekatechinfo.sh
#!/bin/bash
#
# This script prints the information stored in TechnicalInformationHandlers
# to stdout.
#
# FracPete, $Revision: 4582 $
# the usage of this script
function usage()
{
echo
echo "${0##*/} -d <dir> [-w <jar>] [-p|-b] [-h]"
echo
echo "Prints the information stored in TechnicalInformationHandlers to stdout."
echo
echo " -h this help"
echo " -d <dir>"
echo " the directory to look for packages, must be the one just above"
echo " the weka package, default: $DIR"
echo " -w <jar>"
echo " the weka jar to use, if not in CLASSPATH"
echo " -p prints the information in plaintext format"
echo " -b prints the information in BibTeX format"
echo
}
# generates a filename out of the classname TMP and returns it in TMP
# uses the directory in DIR
function class_to_filename()
{
TMP=$DIR"/"echo $TMP | sed s/"\."/"\/"/g".java"
}
# variables
DIR="."
PLAINTEXT="no"
BIBTEX="no"
WEKA=""
TECHINFOHANDLER="weka.core.TechnicalInformationHandler"
TECHINFO="weka.core.TechnicalInformation"
CLASSDISCOVERY="weka.core.ClassDiscovery"
# interprete parameters
while getopts ":hpbw:d:" flag
do
case $flag in
p) PLAINTEXT="yes"
;;
b) BIBTEX="yes"
;;
d) DIR=$OPTARG
;;
w) WEKA=$OPTARG
;;
h) usage
exit 0
;;
*) usage
exit 1
;;
esac
done
# either plaintext or bibtex
if [ "$PLAINTEXT" = "$BIBTEX" ]
then
echo
echo "ERROR: either -p or -b has to be given!"
echo
usage
exit 2
fi
15.2. PAPER REFERENCES 193
# do we have everything?
if [ "$DIR" = "" ] || [ ! -d "$DIR" ]
then
echo
echo "ERROR: no directory or non-existing one provided!"
echo
usage
exit 3
fi
# generate Java call
if [ "$WEKA" = "" ]
then
JAVA="java"
else
JAVA="java -classpath $WEKA"
fi
if [ "$PLAINTEXT" = "yes" ]
then
CMD="$JAVA $TECHINFO -plaintext"
elif [ "$BIBTEX" = "yes" ]
then
CMD="$JAVA $TECHINFO -bibtex"
fi
# find packages
TMP=find $DIR -mindepth 1 -type d | grep -v CVS | sed s/".*weka"/"weka"/g | sed s/"\/"/./g
PACKAGES=echo $TMP | sed s/" "/,/g
# get technicalinformationhandlers
TECHINFOHANDLERS=$JAVA weka.core.ClassDiscovery $TECHINFOHANDLER $PACKAGES | grep "\. weka" | sed s/".*weka"/weka/g
# output information
echo
for i in $TECHINFOHANDLERS
do
TMP=$i;class_to_filename
# exclude internal classes
if [ ! -f $TMP ]
then
continue
fi
$CMD -W $i
echo
done
194 CHAPTER 15. RESEARCH
Chapter 16
Using the API
Using the graphical tools, like the Explorer, or just the command-line is in most
cases sucient for the normal user. But WEKAs clearly dened API (ap-
plication programming interface) makes it very easy to embed it in another
projects. This chapter covers the basics of how to achieve the following common
tasks from source code:
Setting options
Creating datasets in memory
Loading and saving data
Filtering
Classifying
Clustering
Selecting attributes
Visualization
Serialization
Even though most of the code examples are for the Linux platform, using for-
ward slashes in the paths and le names, they do work on the MS Windows plat-
form as well. To make the examples work under MS Windows, one only needs
to adapt the paths, changing the forward slashes to backslashes and adding a
drive letter where necessary.
Note
WEKA is released under the GNU General Public License version 2
1
(GPLv2),
i.e., that derived code or code that uses WEKA needs to be released under the
GPLv2 as well. If one is just using WEKA for a personal project that does
not get released publicly then one is not aected. But as soon as one makes
the project publicly available (e.g., for download), then one needs to make the
source code available under the GLPv2 as well, alongside the binaries.
1
http://www.gnu.org/licenses/gpl-2.0.html
195
196 CHAPTER 16. USING THE API
16.1 Option handling
Conguring an object, e.g., a classier, can either be done using the appro-
priate get/set-methods for the property that one wishes to change, like the
Explorer does. Or, if the class implements the weka.core.OptionHandler in-
terface, one can just use the objects ability to parse command-line options
via the setOptions(String[]) method (the counterpart of this method is
getOptions(), which returns a String[] array). The dierence between the
two approaches is, that the setOptions(String[]) method cannot be used
to set the options incrementally. Default values are used for all options that
havent been explicitly specied in the options array.
The most basic approach is to assemble the String array by hand. The
following example creates an array with a single option (-R) that takes an
argument (1) and initializes the Remove lter with this option:
import weka.filters.unsupervised.attribute.Remove;
...
String[] options = new String[2];
options[0] = "-R";
options[1] = "1";
Remove rm = new Remove();
rm.setOptions(options);
Since the setOptions(String[]) method expects a fully parsed and correctly
split up array (which is done by the console/command prompt), some common
pitfalls with this approach are:
Combination of option and argument Using -R 1 as an element of
the String array will fail, prompting WEKA to output an error message
stating that the option R 1 is unknown.
Trailing blanks Using -R will fail as well, since no trailing blanks are
removed and therefore option R will not be recognized.
The easiest way to avoid these problems, is to provide a String array that
has been generated automatically from a single command-line string using the
splitOptions(String) method of the weka.core.Utils class. Here is an ex-
ample:
import weka.core.Utils;
...
String[] options = Utils.splitOptions("-R 1");
As this method ignores whitespaces, using -R 1 or -R 1 will return the
same result as -R 1.
Complicated command-lines with lots of nested options, e.g., options for the
support-vector machine classier SMO (package weka.classifiers.functions)
including a kernel setup, are a bit tricky, since Java requires one to escape dou-
ble quotes and backslashes inside a String. The Wiki[2] article Use Weka
in your Java code references the Java class OptionsToCode, which turns any
command-line into appropriate Java source code. This example class is also
available from the Weka Examples collection[3]: weka.core.OptionsToCode.
16.1. OPTION HANDLING 197
Instead of using the Remove lters setOptions(String[]) method, the
following code snippet uses the actual set-method for this property:
import weka.filters.unsupervised.attribute.Remove;
...
Remove rm = new Remove();
rm.setAttributeIndices("1");
In order to nd out, which option belongs to which property, i.e., get/set-
method, it is best to have a look at the setOptions(String[])and getOptions()
methods. In case these methods use the member variables directly, one just has
to look for the methods making this particular member variable accessible to
the outside.
Using the set-methods, one will most likely come across ones that re-
quire a weka.core.SelectedTag as parameter. An example for this, is the
setEvaluation method of the meta-classier GridSearch (located in package
weka.classifiers.meta). The SelectedTag class is used in the GUI for dis-
playing drop-down lists, enabling the user to chose from a predened list of
values. GridSearch allows the user to chose the statistical measure to base the
evaluation on (accuracy, correlation coecient, etc.).
A SelectedTag gets constructed using the array of all possible weka.core.Tag
elements that can be chosen and the integer or string ID of the Tag. For in-
stance, GridSearchs setOptions(String[]) method uses the supplied string
ID to set the evaluation type (e.g., ACC for accuracy), or, if the evaluation
option is missing, the default integer ID EVALUATION ACC. In both cases, the
array TAGS EVALUATION is used, which denes all possible options:
import weka.core.SelectedTag;
...
String tmpStr = Utils.getOption(E, options);
if (tmpStr.length() != 0)
setEvaluation(new SelectedTag(tmpStr, TAGS_EVALUATION));
else
setEvaluation(new SelectedTag(EVALUATION_CC, TAGS_EVALUATION));
198 CHAPTER 16. USING THE API
16.2 Loading data
Before any lter, classier or clusterer can be applied, data needs to be present.
WEKA enables one to load data from les (in various le formats) and also from
databases. In the latter case, it is assumed in that the database connection is
set up and working. See chapter 13 for more details on how to congure WEKA
correctly and also more information on JDBC (Java Database Connectivity)
URLs.
Example classes, making use of the functionality covered in this section, can
be found in the wekaexamples.core.converters package of the Weka Exam-
ples collection[3].
The following classes are used to store data in memory:
weka.core.Instances holds a complete dataset. This data structure
is row-based; single rows can be accessed via the instance(int) method
using a 0-based index. Information about the columns can be accessed via
the attribute(int) method. This method returns weka.core.Attribute
objects (see below).
weka.core.Instance encapsulates a single row. It is basically a wrap-
per around an array of double primitives. Since this class contains no
information about the type of the columns, it always needs access to a
weka.core.Instances object (see methods dataset and setDataset).
The class weka.core.SparseInstance is used in case of sparse data.
weka.core.Attribute holds the type information about a single column
in the dataset. It stores the type of the attribute, as well as the labels
for nominal attributes, the possible values for string attributes or the
datasets for relational attributes (these are just weka.core.Instances
objects again).
16.2.1 Loading data from les
When loading data from les, one can either let WEKA choose the appropriate
loader (the available loaders can be found in the weka.core.converters pack-
age) based on the les extension or one can use the correct loader directly. The
latter case is necessary if the les do not have the correct extension.
The DataSource class (inner class of the weka.core.converters.ConverterUtils
class) can be used to read data from les that have the appropriate le extension.
Here are some examples:
import weka.core.converters.ConverterUtils.DataSource;
import weka.core.Instances;
...
Instances data1 = DataSource.read("/some/where/dataset.arff");
Instances data2 = DataSource.read("/some/where/dataset.csv");
Instances data3 = DataSource.read("/some/where/dataset.xrff");
In case the le does have a dierent le extension than is normally associated
with the loader, one has to use a loader directly. The following example loads
a CSV (comma-separated values) le:
import weka.core.converters.CSVLoader;
16.2. LOADING DATA 199
import weka.core.Instances;
import java.io.File;
...
CSVLoader loader = new CSVLoader();
loader.setSource(new File("/some/where/some.data"));
Instances data = loader.getDataSet();
NB: Not all le formats allow to store information about the class attribute
(e.g., ARFF stores no information about class attribute, but XRFF does). If a
class attribute is required further down the road, e.g., when using a classier,
it can be set with the setClassIndex(int) method:
// uses the first attribute as class attribute
if (data.classIndex() == -1)
data.setClassIndex(0);
...
// uses the last attribute as class attribute
if (data.classIndex() == -1)
data.setClassIndex(data.numAttributes() - 1);
16.2.2 Loading data from databases
For loading data from databases, one of the following two classes can be used:
weka.experiment.InstanceQuery
weka.core.converters.DatabaseLoader
The dierences between them are, that the InstanceQuery class allows one to
retrieve sparse data and the DatabaseLoader can retrieve the data incremen-
tally.
Here is an example of using the InstanceQuery class:
import weka.core.Instances;
import weka.experiment.InstanceQuery;
...
InstanceQuery query = new InstanceQuery();
query.setDatabaseURL("jdbc_url");
query.setUsername("the_user");
query.setPassword("the_password");
query.setQuery("select * from whatsoever");
// if your data is sparse, then you can say so, too:
// query.setSparseData(true);
Instances data = query.retrieveInstances();
And an example using the DatabaseLoader class in batch retrieval:
import weka.core.Instances;
import weka.core.converters.DatabaseLoader;
...
DatabaseLoader loader = new DatabaseLoader();
loader.setSource("jdbc_url", "the_user", "the_password");
loader.setQuery("select * from whatsoever");
Instances data = loader.getDataSet();
200 CHAPTER 16. USING THE API
The DatabaseLoader is used in incremental mode as follows:
import weka.core.Instance;
import weka.core.Instances;
import weka.core.converters.DatabaseLoader;
...
DatabaseLoader loader = new DatabaseLoader();
loader.setSource("jdbc_url", "the_user", "the_password");
loader.setQuery("select * from whatsoever");
Instances structure = loader.getStructure();
Instances data = new Instances(structure);
Instance inst;
while ((inst = loader.getNextInstance(structure)) != null)
data.add(inst);
Notes:
Not all database systems allow incremental retrieval.
Not all queries have a unique key to retrieve rows incrementally. In that
case, one can supply the necessary columns with the setKeys(String)
method (comma-separated list of columns).
If the data cannot be retrieved in an incremental fashion, it is rst fully
loaded into memory and then provided row-by-row(pseudo-incremental).
16.2. LOADING DATA 201
202 CHAPTER 16. USING THE API
16.3 Creating datasets in memory
Loading datasets from disk or database are not the only ways of obtaining
data in WEKA: datasets can be created in memory or on-the-y. Generating a
dataset memory structure (i.e., a weka.core.Instances object) is a two-stage
process:
1. Dening the format of the data by setting up the attributes.
2. Adding the actual data, row by row.
The class wekaexamples.core.CreateInstances of the Weka Examples collection[3]
generates an Instances object containing all attribute types WEKA can handle
at the moment.
16.3.1 Dening the format
There are currently ve dierent types of attributes available in WEKA:
numeric continuous variables
date date variables
nominal predened labels
string textual data
relational contains other relations, e.g., the bags in case of multi-
instance data
For all of the dierent attribute types, WEKA uses the same class, weka.core.Attribute,
but with dierent constructors. In the following, these dierent constructors are
explained.
numeric The easiest attribute type to create, as it requires only the
name of the attribute:
Attribute numeric = new Attribute("name_of_attr");
date Date attributes are handled internally as numeric attributes, but
in order to parse and present the date value correctly, the format of the
date needs to be specied. The date and time patterns are explained in
detail in the Javadoc of the java.text.SimpleDateFormat class. In the
following, an example of how to create a date attribute using a date format
of 4-digit year, 2-digit month and a 2-digit day, separated by hyphens:
Attribute date = new Attribute("name_of_attr", "yyyy-MM-dd");
nominal Since nominal attributes contain predened labels, one needs
to supply these, stored in form of a weka.core.FastVector object:
FastVector labels = new FastVector();
labels.addElement("label_a");
labels.addElement("label_b");
labels.addElement("label_c");
labels.addElement("label_d");
Attribute nominal = new Attribute("name_of_attr", labels);
string In contrast to nominal attributes, this type does not store a
predened list of labels. Normally used to store textual data, i.e., content
of documents for text categorization. The same constructor as for the
nominal attribute is used, but a null value is provided instead of an
instance of FastVector:
Attribute string = new Attribute("name_of_attr", (FastVector) null);
16.3. CREATING DATASETS IN MEMORY 203
relational This attribute just takes another weka.core.Instances
object for dening the relational structure in the constructor. The follow-
ing code snippet generates a relational attribute that contains a relation
with two attributes, a numeric and a nominal attribute:
FastVector atts = new FastVector();
atts.addElement(new Attribute("rel.num"));
FastVector values = new FastVector();
values.addElement("val_A");
values.addElement("val_B");
values.addElement("val_C");
atts.addElement(new Attribute("rel.nom", values));
Instances rel_struct = new Instances("rel", atts, 0);
Attribute relational = new Attribute("name_of_attr", rel_struct);
A weka.core.Instances object is then created by supplying a FastVector
object containing all the attribute objects. The following example creates a
dataset with two numeric attributes and a nominal class attribute with two
labels no and yes:
Attribute num1 = new Attribute("num1");
Attribute num2 = new Attribute("num2");
FastVector labels = new FastVector();
labels.addElement("no");
labels.addElement("yes");
Attribute cls = new Attribute("class", labels);
FastVector attributes = new FastVector();
attributes.addElement(num1);
attributes.addElement(num2);
attributes.addElement(cls);
Instances dataset = new Instances("Test-dataset", attributes, 0);
The nal argument in the Instances constructor above tells WEKA how much
memory to reserve for upcoming weka.core.Instance objects. If one knows
how many rows will be added to the dataset, then it should be specied as it
saves costly operations for expanding the internal storage. It doesnt matter, if
one aims to high with the amount of rows to be added, it is always possible to
trim the dataset again, using the compactify() method.
16.3.2 Adding data
After the structure of the dataset has been dened, one can add the actual data
to it, row by row. There are basically two constructors of the weka.core.Instance
class that one can use for this purpose:
Instance(double weight, double[] attValues) generates an Instance
object with the specied weight and the given double values. WEKAs
internal format is using doubles for all attribute types. For nominal, string
and relational attributes this is just an index of the stored values.
Instance(int numAttributes) generates a new Instance object with
weight 1.0 and all missing values.
The second constructor may be easier to use, but setting values via the Instance
class methods is a bit costly, especially if one is adding a lot of rows. There-
fore, the following code examples cover the rst constructor. For simplicity, an
Instances object data based on the code snippets for the dierent attribute
introduced used above is used, as it contains all possible attribute types.
204 CHAPTER 16. USING THE API
For each instance, the rst step is to create a new double array to hold
the attribute values. It is important not to reuse this array, but always create
a new one, since WEKA only references it and does not create a copy of it,
when instantiating the Instance object. Reusing means changing the previously
generated Instance object:
double[] values = new double[data.numAttributes()];
After that, the double array is lled with the actual values:
numeric just sets the numeric value:
values[0] = 1.23;
date turns the date string into a double value:
values[1] = data.attribute(1).parseDate("2001-11-09");
nominal determines the index of the label:
values[2] = data.attribute(2).indexOf("label_b");
string determines the index of the string, using the addStringValue
method (internally, a hashtable holds all the string values):
values[3] = data.attribute(3).addStringValue("This is a string");
relational rst, a new Instances object based on the attributes rela-
tional denition has to be created, before the index of it can be determined,
using the addRelation method:
Instances dataRel = new Instances(data.attribute(4).relation(),0);
valuesRel = new double[dataRel.numAttributes()];
valuesRel[0] = 2.34;
valuesRel[1] = dataRel.attribute(1).indexOf("val_C");
dataRel.add(new Instance(1.0, valuesRel));
values[4] = data.attribute(4).addRelation(dataRel);
Finally, an Instance object is generated with the initialized double array and
added to the dataset:
Instance inst = new Instance(1.0, values);
data.add(inst);
16.4. RANDOMIZING DATA 205
16.4 Randomizing data
Since learning algorithms can be prone to the order the data arrives in, random-
izing (also called shuing) the data is a common approach to alleviate this
problem. Especially repeated randomizations, e.g., as during cross-validation,
help to generate more realistic statistics.
WEKA oers two possibilities for randomizing a dataset:
Using the randomize(Random) method of the weka.core.Instances ob-
ject containing the data itself. This method requires an instance of the
java.util.Random class. How to correctly instantiate such an object is
explained below.
Using the Randomize lter (package weka.filters.unsupervised.instance).
For more information on how to use lters, see section 16.5.
A very important aspect of Machine Learning experiments is, that experiments
have to be repeatable. Subsequent runs of the same experiment setup have
to yield the exact same results. It may seem weird, but randomization is still
possible in this scenario. Random number generates never return a completely
random sequence of numbers anyway, only a pseudo-random one. In order to
achieve repeatable pseudo-random sequences, seeded generators are used. Using
the same seed value will always result in the same sequence then.
The default constructor of the java.util.Random random number generator
class should never be used, as such created objects will generate most likely
dierent sequences. The constructor Random(long), using a specied seed value,
is the recommended one to use.
In order to get a more dataset-dependent randomization of the data, the
getRandomNumberGenerator(int) method of the weka.core.Instances class
can be used. This method returns a java.util.Random object that was seeded
with the sum of the supplied seed and the hashcode of the string representation
of a randomly chosen weka.core.Instance of the Instances object (using a
random number generator seeded with the seed supplied to this method).
206 CHAPTER 16. USING THE API
16.5 Filtering
In WEKA, lters are used to preprocess the data. They can be found below
package weka.filters. Each lter falls into one of the following two categories:
supervised The lter requires a class attribute to be set.
unsupervised A class attribute is not required to be present.
And into one of the two sub-categories:
attribute-based Columns are processed, e.g., added or removed.
instance-based Rows are processed, e.g., added or deleted.
These categories should make it clear, what the dierence between the two
Discretize lters in WEKA is. The supervised one takes the class attribute
and its distribution over the dataset into account, in order to determine the
optimal number and size of bins, whereas the unsupervised one relies on a user-
specied number of bins.
Apart from this classication, lters are either stream- or batch-based. Stream
lters can process the data straight away and make it immediately available for
collection again. Batch lters, on the other hand, need a batch of data to setup
their internal data structures. The Add lter (this lter can be found in the
weka.filters.unsupervised.attribute package) is an example of a stream
lter. Adding a new attribute with only missing values does not require any
sophisticated setup. However, the ReplaceMissingValues lter (same package
as the Add lter) needs a batch of data in order to determine the means and
modes for each of the attributes. Otherwise, the lter will not be able to replace
the missing values with meaningful values. But as soon as a batch lter has been
initialized with the rst batch of data, it can also process data on a row-by-row
basis, just like a stream lter.
Instance-based lters are a bit special in the way they handle data. As
mentioned earlier, all lters can process data on a row-by-row basis after the
rst batch of data has been passed through. Of course, if a lter adds or removes
rows from a batch of data, this no longer works when working in single-row
processing mode. This makes sense, if one thinks of a scenario involving the
FilteredClassifier meta-classier: after the training phase (= rst batch of
data), the classier will get evaluated against a test set, one instance at a time.
If the lter now removes the only instance or adds instances, it can no longer be
evaluated correctly, as the evaluation expects to get only a single result back.
This is the reason why instance-based lters only pass through any subsequent
batch of data without processing it. The Resample lters, for instance, act like
this.
One can nd example classes for ltering in the wekaexamples.filters
package of the Weka Examples collection[3].
16.5. FILTERING 207
The following example uses the Remove lter (the lter is located in package
weka.filters.unsupervised.attribute) to remove the rst attribute from a
dataset. For setting the options, the setOptions(String[]) method is used.
import weka.core.Instances;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.Remove;
...
String[] options = new String[2];
options[0] = "-R"; // "range"
options[1] = "1"; // first attribute
Remove remove = new Remove(); // new instance of filter
remove.setOptions(options); // set options
remove.setInputFormat(data); // inform filter about dataset
// **AFTER** setting options
Instances newData = Filter.useFilter(data, remove); // apply filter
A common trap to fall into is setting options after the setInputFormat(Instances)
has been called. Since this method is (normally) used to determine the output
format of the data, all the options have to be set before calling it. Otherwise,
all options set afterwards will be ignored.
16.5.1 Batch ltering
Batch ltering is necessary if two or more datasets need to be processed accord-
ing to the same lter initialization. If batch ltering is not used, for instance
when generating a training and a test set using the StringToWordVector l-
ter (package weka.filters.unsupervised.attribute), then these two lter
runs are completely independent and will create two most likely incompatible
datasets. Running the StringToWordVector on two dierent datasets, this will
result in two dierent word dictionaries and therefore dierent attributes being
generated.
The following code example shows how to standardize, i.e., transforming all
numeric attributes to have zero mean and unit variance, a training and a test set
with the Standardize lter (package weka.filters.unsupervised.attribute):
Instances train = ... // from somewhere
Instances test = ... // from somewhere
Standardize filter = new Standardize();
// initializing the filter once with training set
filter.setInputFormat(train);
// configures the Filter based on train instances and returns
// filtered instances
Instances newTrain = Filter.useFilter(train, filter);
// create new test set
Instances newTest = Filter.useFilter(test, filter);
208 CHAPTER 16. USING THE API
16.5.2 Filtering on-the-y
Even though using the API gives one full control over the data and makes it eas-
ier to juggle several datasets at the same time, ltering data on-the-y makes
life even easier. This handy feature is available through meta schemes in WEKA,
like FilteredClassifier(package weka.classifiers.meta), FilteredClusterer
(package weka.clusterers), FilteredAssociator(package weka.associations)
and FilteredAttributeEval/FilteredSubsetEval(in weka.attributeSelection).
Instead of ltering the data beforehand, one just sets up a meta-scheme and lets
the meta-scheme do the ltering for one.
The following example uses the FilteredClassifier in conjunction with
the Remove lter to remove the rst attribute (which happens to be an ID
attribute) from the dataset and J48 (J48 is WEKAs implementation of C4.5;
package weka.classifiers.trees) as base-classier. First the classier is built
with a training set and then evaluated with a separate test set. The actual and
predicted class values are printed in the console. For more information on
classication, see chapter 16.6.
import weka.classifiers.meta.FilteredClassifier;
import weka.classifiers.trees.J48;
import weka.core.Instances;
import weka.filters.unsupervised.attribute.Remove;
...
Instances train = ... // from somewhere
Instances test = ... // from somewhere
// filter
Remove rm = new Remove();
rm.setAttributeIndices("1"); // remove 1st attribute
// classifier
J48 j48 = new J48();
j48.setUnpruned(true); // using an unpruned J48
// meta-classifier
FilteredClassifier fc = new FilteredClassifier();
fc.setFilter(rm);
fc.setClassifier(j48);
// train and output model
fc.buildClassifier(train);
System.out.println(fc);
for (int i = 0; i < test.numInstances(); i++) {
double pred = fc.classifyInstance(test.instance(i));
double actual = test.instance(i).classValue();
System.out.print("ID: "
+ test.instance(i).value(0));
System.out.print(", actual: "
+ test.classAttribute().value((int) actual));
System.out.println(", predicted: "
+ test.classAttribute().value((int) pred));
}
16.6. CLASSIFICATION 209
16.6 Classication
Classication and regression algorithms in WEKA are called classiers and are
located below the weka.classifiers package. This section covers the following
topics:
Building a classier batch and incremental learning.
Evaluating a classier various evaluation techniques and how to obtain
the generated statistics.
Classifying instances obtaining classications for unknown data.
The Weka Examples collection[3] contains example classes covering classication
in the wekaexamples.classifiers package.
16.6.1 Building a classier
By design, all classiers in WEKA are batch-trainable, i.e., they get trained on
the whole dataset at once. This is ne, if the training data ts into memory.
But there are also algorithms available that can update their internal model
on-the-go. These classiers are called incremental. The following two sections
cover the batch and the incremental classiers.
Batch classiers
A batch classier is really simple to build:
set options either using the setOptions(String[]) method or the ac-
tual set-methods.
train it calling the buildClassifier(Instances) method with the
training set. By denition, the buildClassifier(Instances) method
resets the internal model completely, in order to ensure that subsequent
calls of this method with the same data result in the same model (re-
peatable experiments).
The following code snippet builds an unpruned J48 on a dataset:
import weka.core.Instances;
import weka.classifiers.trees.J48;
...
Instances data = ... // from somewhere
String[] options = new String[1];
options[0] = "-U"; // unpruned tree
J48 tree = new J48(); // new instance of tree
tree.setOptions(options); // set the options
tree.buildClassifier(data); // build classifier
Incremental classiers
All incremental classiers in WEKA implement the interface UpdateableClassifier
(located in package weka.classifiers). Bringing up the Javadoc for this par-
ticular interface tells one what classiers implement this interface. These classi-
ers can be used to process large amounts of data with a small memory-footprint,
as the training data does not have to t in memory. ARFF les, for instance,
can be read incrementally (see chapter 16.2).
210 CHAPTER 16. USING THE API
Training an incremental classier happens in two stages:
1. initialize the model by calling the buildClassifier(Instances) method.
One can either use a weka.core.Instances object with no actual data or
one with an initial set of data.
2. update the model row-by-row, by calling the updateClassifier(Instance)
method.
The following example shows how to load an ARFF le incrementally using the
ArffLoader class and train the NaiveBayesUpdateable classier with one row
at a time:
import weka.core.converters.ArffLoader;
import weka.classifiers.bayes.NaiveBayesUpdateable;
import java.io.File;
...
// load data
ArffLoader loader = new ArffLoader();
loader.setFile(new File("/some/where/data.arff"));
Instances structure = loader.getStructure();
structure.setClassIndex(structure.numAttributes() - 1);
// train NaiveBayes
NaiveBayesUpdateable nb = new NaiveBayesUpdateable();
nb.buildClassifier(structure);
Instance current;
while ((current = loader.getNextInstance(structure)) != null)
nb.updateClassifier(current);
16.6. CLASSIFICATION 211
16.6.2 Evaluating a classier
Building a classier is only one part of the equation, evaluating how well it
performs is another important part. WEKA supports two types of evaluation:
Cross-validation If one only has a single dataset and wants to get a
reasonable realistic evaluation. Setting the number of folds equal to the
number of rows in the dataset will give one leave-one-out cross-validation
(LOOCV).
Dedicated test set The test set is solely used to evaluate the built clas-
sier. It is important to have a test set that incorporates the same (or
similar) concepts as the training set, otherwise one will always end up
with poor performance.
The evaluation step, including collection of statistics, is performed by the Evaluation
class (package weka.classifiers).
Cross-validation
The crossValidateModel method of the Evaluation class is used to perform
cross-validation with an untrained classier and a single dataset. Supplying an
untrained classier ensures that no information leaks into the actual evaluation.
Even though it is an implementation requirement, that the buildClassifier
method resets the classier, it cannot be guaranteed that this is indeed the case
(leaky implementation). Using an untrained classier avoids unwanted side-
eects, as for each train/test set pair, a copy of the originally supplied classier
is used.
Before cross-validation is performed, the data gets randomized using the
supplied random number generator (java.util.Random). It is recommended
that this number generator is seeded with a specied seed value. Otherwise,
subsequent runs of cross-validation on the same dataset will not yield the same
results, due to dierent randomization of the data (see section 16.4 for more
information on randomization).
The code snippet below performs 10-fold cross-validation with a J48 decision
tree algorithm on a dataset newData, with random number generator that is
seeded with 1. The summary of the collected statistics is output to stdout.
212 CHAPTER 16. USING THE API
import weka.classifiers.Evaluation;
import weka.classifiers.trees.J48;
import weka.core.Instances;
import java.util.Random;
...
Instances newData = ... // from somewhere
Evaluation eval = new Evaluation(newData);
J48 tree = new J48();
eval.crossValidateModel(tree, newData, 10, new Random(1));
System.out.println(eval.toSummaryString("\nResults\n\n", false));
The Evaluation object in this example is initialized with the dataset used in
the evaluation process. This is done in order to inform the evaluation about the
type of data that is being evaluated, ensuring that all internal data structures
are setup correctly.
Train/test set
Using a dedicated test set to evaluate a classier is just as easy as cross-
validation. But instead of providing an untrained classier, a trained classier
has to be provided now. Once again, the weka.classifiers.Evaluation class
is used to perform the evaluation, this time using the evaluateModel method.
The code snippet below trains a J48 with default options on a training set
and evaluates it on a test set before outputting the summary of the collected
statistics:
import weka.core.Instances;
import weka.classifiers.Evaluation;
import weka.classifiers.trees.J48;
...
Instances train = ... // from somewhere
Instances test = ... // from somewhere
// train classifier
Classifier cls = new J48();
cls.buildClassifier(train);
// evaluate classifier and print some statistics
Evaluation eval = new Evaluation(train);
eval.evaluateModel(cls, test);
System.out.println(eval.toSummaryString("\nResults\n\n", false));
16.6. CLASSIFICATION 213
Statistics
In the previous sections, the toSummaryString of the Evaluation class was
already used in the code examples. But there are other summary methods for
nominal class attributes available as well:
toMatrixString outputs the confusion matrix.
toClassDetailsString outputs TP/FP rates, precision, recall, F-measure,
AUC (per class).
toCumulativeMarginDistributionString outputs the cumulative mar-
gins distribution.
If one does not want to use these summary methods, it is possible to access
the individual statistical measures directly. Below, a few common measures are
listed:
nominal class attribute
correct() The number of correctly classied instances. The in-
correctly classied ones are available through incorrect().
pctCorrect() The percentage of correctly classied instances (ac-
curacy). pctIncorrect() returns the number of misclassied ones.
areaUnderROC(int) The AUC for the specied class label index
(0-based index).
numeric class attribute
correlationCoefficient() The correlation coecient.
general
meanAbsoluteError() The mean absolute error.
rootMeanSquaredError() The root mean squared error.
numInstances() The number of instances with a class value.
unclassified() - The number of unclassied instances.
pctUnclassified() - The percentage of unclassied instances.
For a complete overview, see the Javadoc page of the Evaluation class. By
looking up the source code of the summary methods mentioned above, one can
easily determine what methods are used for which particular output.
214 CHAPTER 16. USING THE API
16.6.3 Classifying instances
After a classier setup has been evaluated and proven to be useful, a built classi-
er can be used to make predictions and label previously unlabeled data. Section
16.5.2 already provided a glimpse of how to use a classiers classifyInstance
method. This section here elaborates a bit more on this.
The following example uses a trained classier tree to label all the instances
in an unlabeled dataset that gets loaded from disk. After all the instances have
been labeled, the newly labeled dataset gets written back to disk to a new le.
// load unlabeled data and set class attribute
Instances unlabeled = DataSource.read("/some/where/unlabeled.arff");
unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
// create copy
Instances labeled = new Instances(unlabeled);
// label instances
for (int i = 0; i < unlabeled.numInstances(); i++) {
double clsLabel = tree.classifyInstance(unlabeled.instance(i));
labeled.instance(i).setClassValue(clsLabel);
}
// save newly labeled data
DataSink.write("/some/where/labeled.arff", labeled);
The above example works for classication and regression problems alike, as
long as the classier can handle numeric classes, of course. Why is that? The
classifyInstance(Instance) method returns for numeric classes the regres-
sion value and for nominal classes the 0-based index in the list of available class
labels.
If one is interested in the class distribution instead, then one can use the
distributionForInstance(Instance) method (this array sums up to 1). Of
course, using this method makes only sense for classication problems. The
code snippet below outputs the class distribution, the actual and predicted
label side-by-side in the console:
// load data
Instances train = DataSource.read(args[0]);
train.setClassIndex(train.numAttributes() - 1);
Instances test = DataSource.read(args[1]);
test.setClassIndex(test.numAttributes() - 1);
// train classifier
J48 cls = new J48();
cls.buildClassifier(train);
// output predictions
System.out.println("# - actual - predicted - distribution");
for (int i = 0; i < test.numInstances(); i++) {
double pred = cls.classifyInstance(test.instance(i));
double[] dist = cls.distributionForInstance(test.instance(i));
System.out.print((i+1) + " - ");
System.out.print(test.instance(i).toString(test.classIndex()) + " - ");
System.out.print(test.classAttribute().value((int) pred) + " - ");
System.out.println(Utils.arrayToString(dist));
}
16.6. CLASSIFICATION 215
216 CHAPTER 16. USING THE API
16.7 Clustering
Clustering is an unsupervised Machine Learning technique of nding patterns
in the data, i.e., these algorithms work without class attributes. Classiers, on
the other hand, are supervised and need a class attribute. This section, similar
to the one about classiers, covers the following topics:
Building a clusterer batch and incremental learning.
Evaluating a clusterer how to evaluate a built clusterer.
Clustering instances determining what clusters unknown instances be-
long to.
Fully functional example classes are located in the wekaexamples.clusterers
package of the Weka Examples collection[3].
16.7.1 Building a clusterer
Clusterers, just like classiers, are by design batch-trainable as well. They all
can be built on data that is completely stored in memory. But a small subset of
the cluster algorithms can also update the internal representation incrementally.
The following two sections cover both types of clusterers.
Batch clusterers
Building a batch clusterer, just like a classier, happens in two stages:
set options either calling the setOptions(String[]) method or the
appropriate set-methods of the properties.
build the model with training data calling the buildClusterer(Instances)
method. By denition, subsequent calls of this method must result in
the same model (repeatable experiments). In other words, calling this
method must completely reset the model.
Below is an example of building the EM clusterer with a maximum of 100 itera-
tions. The options are set using the setOptions(String[]) method:
import weka.clusterers.EM;
import weka.core.Instances;
...
Instances data = ... // from somewhere
String[] options = new String[2];
options[0] = "-I"; // max. iterations
options[1] = "100";
EM clusterer = new EM(); // new instance of clusterer
clusterer.setOptions(options); // set the options
clusterer.buildClusterer(data); // build the clusterer
Incremental clusterers
Incremental clusterers in WEKA implement the interface UpdateableClusterer
(package weka.clusterers). Training an incremental clusterer happens in
three stages, similar to incremental classiers:
1. initialize the model by calling the buildClusterer(Instances) method.
Once again, one can either use an empty weka.core.Instances object or
one with an initial set of data.
2. update the model row-by-rowby calling the the updateClusterer(Instance)
method.
3. nish the training by calling updateFinished() method. In case cluster
algorithms need to perform computational expensive post-processing or
clean up operations.
16.7. CLUSTERING 217
An ArffLoader is used in the following example to build the Cobweb clusterer
incrementally:
import weka.clusterers.Cobweb;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.converters.ArffLoader;
...
// load data
ArffLoader loader = new ArffLoader();
loader.setFile(new File("/some/where/data.arff"));
Instances structure = loader.getStructure();
// train Cobweb
Cobweb cw = new Cobweb();
cw.buildClusterer(structure);
Instance current;
while ((current = loader.getNextInstance(structure)) != null)
cw.updateClusterer(current);
cw.updateFinished();
218 CHAPTER 16. USING THE API
16.7.2 Evaluating a clusterer
Evaluation of clusterers is not as comprehensive as the evaluation of classi-
ers. Since clustering is unsupervised, it is also a lot harder determining
how good a model is. The class used for evaluating cluster algorithms, is
ClusterEvaluation (package weka.clusterers).
In order to generate the same output as the Explorer or the command-line,
one can use the evaluateClusterer method, as shown below:
import weka.clusterers.EM;
import weka.clusterers.ClusterEvaluation;
...
String[] options = new String[2];
options[0] = "-t";
options[1] = "/some/where/somefile.arff";
System.out.println(ClusterEvaluation.evaluateClusterer(new EM(), options));
Or, if the dataset is already present in memory, one can use the following ap-
proach:
import weka.clusterers.ClusterEvaluation;
import weka.clusterers.EM;
import weka.core.Instances;
...
Instances data = ... // from somewhere
EM cl = new EM();
cl.buildClusterer(data);
ClusterEvaluation eval = new ClusterEvaluation();
eval.setClusterer(cl);
eval.evaluateClusterer(new Instances(data));
System.out.println(eval.clusterResultsToString());
Density based clusterers, i.e., algorithms that implement the interface named
DensityBasedClusterer (package weka.clusterers) can be cross-validated
and the log-likelyhood obtained. Using the MakeDensityBasedClusterer meta-
clusterer, any non-density based clusterer can be turned into such. Here is an
example of cross-validating a density based clusterer and obtaining the log-
likelyhood:
import weka.clusterers.ClusterEvaluation;
import weka.clusterers.DensityBasedClusterer;
import weka.core.Instances;
import java.util.Random;
...
Instances data = ... // from somewhere
DensityBasedClusterer clusterer = new ... // the clusterer to evaluate
double logLikelyhood =
ClusterEvaluation.crossValidateModel( // cross-validate
clusterer, data, 10, // with 10 folds
new Random(1)); // and random number generator
// with seed 1
16.7. CLUSTERING 219
Classes to clusters
Datasets for supervised algorithms, like classiers, can be used to evaluate a
clusterer as well. This evaluation is called classes-to-clusters, as the clusters are
mapped back onto the classes.
This type of evaluation is performed as follows:
1. create a copy of the dataset containing the class attribute and remove the
class attribute, using the Remove lter (this lter is located in package
weka.filters.unsupervised.attribute).
2. build the clusterer with this new data.
3. evaluate the clusterer now with the original data.
And here are the steps translated into code, using EM as the clusterer being
evaluated:
1. create a copy of data without class attribute
Instances data = ... // from somewhere
Remove filter = new Remove();
filter.setAttributeIndices("" + (data.classIndex() + 1));
filter.setInputFormat(data);
Instances dataClusterer = Filter.useFilter(data, filter);
2. build the clusterer
EM clusterer = new EM();
// set further options for EM, if necessary...
clusterer.buildClusterer(dataClusterer);
3. evaluate the clusterer
ClusterEvaluation eval = new ClusterEvaluation();
eval.setClusterer(clusterer);
eval.evaluateClusterer(data);
// print results
System.out.println(eval.clusterResultsToString());
220 CHAPTER 16. USING THE API
16.7.3 Clustering instances
Clustering of instances is very similar to classifying unknown instances when
using classiers. The following methods are involved:
clusterInstance(Instance) determines the cluster the Instance would
belong to.
distributionForInstance(Instance) predicts the cluster membership
for this Instance. The sum of this array adds up to 1.
The code fragment outlined below trains an EM clusterer on one dataset and
outputs for a second dataset the predicted clusters and cluster memberships of
the individual instances:
import weka.clusterers.EM;
import weka.core.Instances;
...
Instances dataset1 = ... // from somewhere
Instances dataset2 = ... // from somewhere
// build clusterer
EM clusterer = new EM();
clusterer.buildClusterer(dataset1);
// output predictions
System.out.println("# - cluster - distribution");
for (int i = 0; i < dataset2.numInstances(); i++) {
int cluster = clusterer.clusterInstance(dataset2.instance(i));
double[] dist = clusterer.distributionForInstance(dataset2.instance(i));
System.out.print((i+1));
System.out.print(" - ");
System.out.print(cluster);
System.out.print(" - ");
System.out.print(Utils.arrayToString(dist));
System.out.println();
}
16.8. SELECTING ATTRIBUTES 221
16.8 Selecting attributes
Preparing ones data properly is a very important step for getting the best re-
sults. Reducing the number of attributes can not only help speeding up runtime
with algorithms (some algorithms runtime are quadratic in regards to number
of attributes), but also help avoid burying the algorithm in a mass of at-
tributes, when only a few are essential for building a good model.
There are three dierent types of evaluators in WEKA at the moment:
single attribute evaluators perform evaluations on single attributes. These
classes implement the weka.attributeSelection.AttributeEvaluator
interface. The Ranker search algorithm is usually used in conjunction with
these algorithms.
attribute subset evaluators work on subsets of all the attributes in the
dataset. The weka.attributeSelection.SubsetEvaluator interface is
implemented by these evaluators.
attribute set evaluators evaluate sets of attributes. Not to be con-
fused with the subset evaluators, as these classes are derived from the
weka.attributeSelection.AttributeSetEvaluator superclass.
Most of the attribute selection schemes currently implemented are supervised,
i.e., they require a dataset with a class attribute. Unsupervised evaluation
algorithms are derived from one of the following superclasses:
weka.attributeSelection.UnsupervisedAttributeEvaluator
e.g., LatentSemanticAnalysis, PrincipalComponents
weka.attributeSelection.UnsupervisedSubsetEvaluator
none at the moment
Attribute selection oers ltering on-the-y, like classiers and clusterers, as
well:
weka.attributeSelection.FilteredAttributeEval lter for evalua-
tors that evaluate attributes individually.
weka.attributeSelection.FilteredSubsetEval for ltering evalua-
tors that evaluate subsets of attributes.
So much about the dierences among the various attribute selection algorithms
and back to how to actually perform attribute selection. WEKA oers three
dierent approaches:
Using a meta-classier for performing attribute selection on-the-y (sim-
ilar to FilteredClassiers ltering on-the-y).
Using a lter - for preprocessing the data.
Low-level API usage - instead of using the meta-schemes (classier or
lter), one can use the attribute selection API directly as well.
The following sections cover each of the topics, accompanied with a code exam-
ple. For clarity, the same evaluator and search algorithm is used in all of these
examples.
Feel free to check out the example classes of the Weka Examples collection[3],
located in the wekaexamples.attributeSelection package.
222 CHAPTER 16. USING THE API
16.8.1 Using the meta-classier
The meta-classier AttributeSelectedClassifier (this classier is located in
package weka.classifiers.meta), is similar to the FilteredClassifier. But
instead of taking a base-classier and a lter as parameters to perform the
ltering, the AttributeSelectedClassifier uses a search algorithm (derived
from weka.attributeSelection.ASEvaluation), an evaluator (superclass is
weka.attributeSelection.ASSearch) to perform the attribute selection and
a base-classier to train on the reduced data.
This example here uses J48 as base-classier, CfsSubsetEval as evaluator
and a backwards operating GreedyStepwise as search method:
import weka.attributeSelection.CfsSubsetEval;
import weka.attributeSelection.GreedyStepwise;
import weka.classifiers.Evaluation;
import weka.classifiers.meta.AttributeSelectedClassifier;
import weka.classifiers.trees.J48;
import weka.core.Instances;
...
Instances data = ... // from somewhere
// setup meta-classifier
AttributeSelectedClassifier classifier = new AttributeSelectedClassifier();
CfsSubsetEval eval = new CfsSubsetEval();
GreedyStepwise search = new GreedyStepwise();
search.setSearchBackwards(true);
J48 base = new J48();
classifier.setClassifier(base);
classifier.setEvaluator(eval);
classifier.setSearch(search);
// cross-validate classifier
Evaluation evaluation = new Evaluation(data);
evaluation.crossValidateModel(classifier, data, 10, new Random(1));
System.out.println(evaluation.toSummaryString());
16.8. SELECTING ATTRIBUTES 223
16.8.2 Using the lter
In case the data only needs to be reduced in dimensionality, but not used for
training a classier, then the lter approach is the right one. The AttributeSelection
lter (package weka.filters.supervised.attribute) takes an evaluator and
a search algorithm as parameter.
The code snippet below uses once again CfsSubsetEval as evaluator and a
backwards operating GreedyStepwise as search algorithm. It just outputs the
reduced data to stdout after the ltering step:
import weka.attributeSelection.CfsSubsetEval;
import weka.attributeSelection.GreedyStepwise;
import weka.core.Instances;
import weka.filters.Filter;
import weka.filters.supervised.attribute.AttributeSelection;
...
Instances data = ... // from somewhere
// setup filter
AttributeSelection filter = new AttributeSelection();
CfsSubsetEval eval = new CfsSubsetEval();
GreedyStepwise search = new GreedyStepwise();
search.setSearchBackwards(true);
filter.setEvaluator(eval);
filter.setSearch(search);
filter.setInputFormat(data);
// filter data
Instances newData = Filter.useFilter(data, filter);
System.out.println(newData);
224 CHAPTER 16. USING THE API
16.8.3 Using the API directly
Using the meta-classier or the lter approach makes attribute selection fairly
easy. But it might not satify everybodys needs. For instance, if one wants to
obtain the ordering of the attributes (using Ranker) or retrieve the indices of
the selected attributes instead of the reduced data.
Just like the other examples, the one shown here uses the CfsSubsetEval
evaluator and the GreedyStepwise search algorithm (in backwards mode). But
instead of outputting the reduced data, only the selected indices are printed in
the console:
import weka.attributeSelection.AttributeSelection;
import weka.attributeSelection.CfsSubsetEval;
import weka.attributeSelection.GreedyStepwise;
import weka.core.Instances;
...
Instances data = ... // from somewhere
// setup attribute selection
AttributeSelection attsel = new AttributeSelection();
CfsSubsetEval eval = new CfsSubsetEval();
GreedyStepwise search = new GreedyStepwise();
search.setSearchBackwards(true);
attsel.setEvaluator(eval);
attsel.setSearch(search);
// perform attribute selection
attsel.SelectAttributes(data);
int[] indices = attsel.selectedAttributes();
System.out.println(
"selected attribute indices (starting with 0):\n"
+ Utils.arrayToString(indices));
16.9. SAVING DATA 225
16.9 Saving data
Saving weka.core.Instances objects is as easy as reading the data in the rst
place, though the process of storing the data again is far less common than of
reading the data into memory. The following two sections cover how to save the
data in les and in databases.
Just like with loading the data in chapter 16.2, examples classes for saving
data can be found in the wekaexamples.core.converters package of the Weka
Examples collection[3];
16.9.1 Saving data to les
Once again, one can either let WEKA choose the appropriate converter for sav-
ing the data or use an explicit converter (all savers are located in the weka.core.converters
package). The latter approach is necessary, if the le name under which the data
will be stored does not have an extension that WEKA recognizes.
Use the DataSink class (inner class of weka.core.converters.ConverterUtils),
if the extensions are not a problem. Here are a few examples:
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSink;
...
// data structure to save
Instances data = ...
// save as ARFF
DataSink.write("/some/where/data.arff", data);
// save as CSV
DataSink.write("/some/where/data.csv", data);
And here is an example of using the CSVSaver converter explicitly:
import weka.core.Instances;
import weka.core.converters.CSVSaver;
import java.io.File;
...
// data structure to save
Instances data = ...
// save as CSV
CSVSaver saver = new CSVSaver();
saver.setInstances(data);
saver.setFile(new File("/some/where/data.csv"));
saver.writeBatch();
16.9.2 Saving data to databases
Apart from the KnowledgeFlow, saving to databases is not very obvious in
WEKA, unless one knows about the DatabaseSaver converter. Just like the
DatabaseLoader, the saver counterpart can store the data either in batch mode
or incrementally as well.
226 CHAPTER 16. USING THE API
The rst example shows how to save the data in batch mode, which is the
easier way of doing it:
import weka.core.Instances;
import weka.core.converters.DatabaseSaver;
...
// data structure to save
Instances data = ...
// store data in database
DatabaseSaver saver = new DatabaseSaver();
saver.setDestination("jdbc_url", "the_user", "the_password");
// we explicitly specify the table name here:
saver.setTableName("whatsoever2");
saver.setRelationForTableName(false);
// or we could just update the name of the dataset:
// saver.setRelationForTableName(true);
// data.setRelationName("whatsoever2");
saver.setInstances(data);
saver.writeBatch();
Saving the data incrementally, requires a bit more work, as one has to specify
that writing the data is done incrementally (using the setRetrieval method),
as well as notifying the saver when all the data has been saved:
import weka.core.Instances;
import weka.core.converters.DatabaseSaver;
...
// data structure to save
Instances data = ...
// store data in database
DatabaseSaver saver = new DatabaseSaver();
saver.setDestination("jdbc_url", "the_user", "the_password");
// we explicitly specify the table name here:
saver.setTableName("whatsoever2");
saver.setRelationForTableName(false);
// or we could just update the name of the dataset:
// saver.setRelationForTableName(true);
// data.setRelationName("whatsoever2");
saver.setRetrieval(DatabaseSaver.INCREMENTAL);
saver.setStructure(data);
count = 0;
for (int i = 0; i < data.numInstances(); i++) {
saver.writeIncremental(data.instance(i));
}
// notify saver that were finished
saver.writeIncremental(null);
16.10. VISUALIZATION 227
16.10 Visualization
The concepts covered in this chapter are also available through the example
classes of the Weka Examples collection[3]. See the following packages:
wekaexamples.gui.graphvisualizer
wekaexamples.gui.treevisualizer
wekaexamples.gui.visualize
16.10.1 ROC curves
WEKA can generate Receiver operating characteristic (ROC) curves, based
on the collected predictions during an evaluation of a classier. In order to
display a ROC curve, one needs to perform the following steps:
1. Generate the plotable data based on the Evaluations collected predic-
tions, using the ThresholdCurve class (package weka.classifiers.evaluation).
2. Put the plotable data into a plot container, an instance of the PlotData2D
class (package weka.gui.visualize).
3. Add the plot container to a visualization panel for displaying the data, an
instance of the ThresholdVisualizePanelclass (package weka.gui.visualize).
4. Add the visualization panel to a JFrame (package javax.swing) and dis-
play it.
And now, the four steps translated into actual code:
1. Generate the plotable data
Evaluation eval = ... // from somewhere
ThresholdCurve tc = new ThresholdCurve();
int classIndex = 0; // ROC for the 1st class label
Instances curve = tc.getCurve(eval.predictions(), classIndex);
2. Put the plotable into a plot container
PlotData2D plotdata = new PlotData2D(curve);
plotdata.setPlotName(curve.relationName());
plotdata.addInstanceNumberAttribute();
3. Add the plot container to a visualization panel
ThresholdVisualizePanel tvp = new ThresholdVisualizePanel();
tvp.setROCString("(Area under ROC = " +
Utils.doubleToString(ThresholdCurve.getROCArea(curve),4)+")");
tvp.setName(curve.relationName());
tvp.addPlot(plotdata);
4. Add the visualization panel to a JFrame
final JFrame jf = new JFrame("WEKA ROC: " + tvp.getName());
jf.setSize(500,400);
jf.getContentPane().setLayout(new BorderLayout());
jf.getContentPane().add(tvp, BorderLayout.CENTER);
jf.setDefaultCloseOperation(JFrame.DISPOSE_ON_CLOSE);
jf.setVisible(true);
228 CHAPTER 16. USING THE API
16.10.2 Graphs
Classes implementing the weka.core.Drawable interface can generate graphs
of their internal models which can be displayed. There are two dierent types of
graphs available at the moment, which are explained in the subsequent sections:
Tree decision trees.
BayesNet bayesian net graph structures.
16.10.2.1 Tree
It is quite easy to display the internal tree structure of classiers like J48
or M5P (package weka.classifiers.trees). The following example builds
a J48 classier on a dataset and displays the generated tree visually using
the TreeVisualizer class (package weka.gui.treevisualizer). This visu-
alization class can be used to view trees (or digraphs) in GraphVizs DOT
language[26].
import weka.classifiers.trees.J48;
import weka.core.Instances;
import weka.gui.treevisualizer.PlaceNode2;
import weka.gui.treevisualizer.TreeVisualizer;
import java.awt.BorderLayout;
import javax.swing.JFrame;
...
Instances data = ... // from somewhere
// train classifier
J48 cls = new J48();
cls.buildClassifier(data);
// display tree
TreeVisualizer tv = new TreeVisualizer(
null, cls.graph(), new PlaceNode2());
JFrame jf = new JFrame("Weka Classifier Tree Visualizer: J48");
jf.setDefaultCloseOperation(JFrame.DISPOSE_ON_CLOSE);
jf.setSize(800, 600);
jf.getContentPane().setLayout(new BorderLayout());
jf.getContentPane().add(tv, BorderLayout.CENTER);
jf.setVisible(true);
// adjust tree
tv.fitToScreen();
16.10. VISUALIZATION 229
16.10.2.2 BayesNet
The graphs that the BayesNet classier (package weka.classifiers.bayes)
generates can be displayed using the GraphVisualizer class (located in package
weka.gui.graphvisualizer). The GraphVisualizer can display graphs that
are either in GraphVizs DOT language[26] or in XML BIF[20] format. For
displaying DOT format, one needs to use the method readDOT, and for the BIF
format the method readBIF.
The following code snippet trains a BayesNet classier on some data and
then displays the graph generated from this data in a frame:
import weka.classifiers.bayes.BayesNet;
import weka.core.Instances;
import weka.gui.graphvisualizer.GraphVisualizer;
import java.awt.BorderLayout;
import javax.swing.JFrame;
...
Instances data = ... // from somewhere
// train classifier
BayesNet cls = new BayesNet();
cls.buildClassifier(data);
// display graph
GraphVisualizer gv = new GraphVisualizer();
gv.readBIF(cls.graph());
JFrame jf = new JFrame("BayesNet graph");
jf.setDefaultCloseOperation(JFrame.DISPOSE_ON_CLOSE);
jf.setSize(800, 600);
jf.getContentPane().setLayout(new BorderLayout());
jf.getContentPane().add(gv, BorderLayout.CENTER);
jf.setVisible(true);
// layout graph
gv.layoutGraph();
230 CHAPTER 16. USING THE API
16.11 Serialization
Serialization
2
is the process of saving an object in a persistent form, e.g., on
the harddisk as a bytestream. Deserialization is the process in the opposite
direction, creating an object from a persistently saved data structure. In Java,
an object can be serialized if it imports the java.io.Serializable interface.
Members of an object that are not supposed to be serialized, need to be declared
with the keyword transient.
In the following are some Java code snippets for serializing and deserializing a
J48 classier. Of course, serialization is not limited to classiers. Most schemes
in WEKA, like clusterers and lters, are also serializable.
Serializing a classier
The weka.core.SerializationHelper class makes it easy to serialize an ob-
ject. For saving, one can use one of the write methods:
import weka.classifiers.Classifier;
import weka.classifiers.trees.J48;
import weka.core.converters.ConverterUtils.DataSource;
import weka.core.SerializationHelper;
...
// load data
Instances inst = DataSource.read("/some/where/data.arff");
inst.setClassIndex(inst.numAttributes() - 1);
// train J48
Classifier cls = new J48();
cls.buildClassifier(inst);
// serialize model
SerializationHelper.write("/some/where/j48.model", cls);
Deserializing a classier
Deserializing an object can be achieved by using one of the read methods:
import weka.classifiers.Classifier;
import weka.core.SerializationHelper;
...
// deserialize model
Classifier cls = (Classifier) SerializationHelper.read(
"/some/where/j48.model");
2
http://en.wikipedia.org/wiki/Serialization
16.11. SERIALIZATION 231
Deserializing a classier saved from the Explorer
The Explorer does not only save the built classier in the model le, but also the
header information of the dataset the classier was built with. By storing the
dataset information as well, one can easily check whether a serialized classier
can be applied on the current dataset. The readAll methods returns an array
with all objects that are contained in the model le.
import weka.classifiers.Classifier;
import weka.core.Instances;
import weka.core.SerializationHelper;
...
// the current data to use with classifier
Instances current = ... // from somewhere
// deserialize model
Object o[] = SerializationHelper.readAll("/some/where/j48.model");
Classifier cls = (Classifier) o[0];
Instances data = (Instances) o[1];
// is the data compatible?
if (!data.equalHeaders(current))
throw new Exception("Incompatible data!");
Serializing a classier for the Explorer
If one wants to serialize the dataset header information alongside the classier,
just like the Explorer does, then one can use one of the writeAll methods:
import weka.classifiers.Classifier;
import weka.classifiers.trees.J48;
import weka.core.converters.ConverterUtils.DataSource;
import weka.core.SerializationHelper;
...
// load data
Instances inst = DataSource.read("/some/where/data.arff");
inst.setClassIndex(inst.numAttributes() - 1);
// train J48
Classifier cls = new J48();
cls.buildClassifier(inst);
// serialize classifier and header information
Instances header = new Instances(inst, 0);
SerializationHelper.writeAll(
"/some/where/j48.model", new Object[]{cls, header});
232 CHAPTER 16. USING THE API
Chapter 17
Extending WEKA
For most users, the existing WEKA framework will be sucient to perform
the task at hand, oering a wide range of lters, classiers, clusterers, etc.
Researchers, on the other hand, might want to add new algorithms and compare
them against existing ones. The framework with its existing algorithms is not
set in stone, but basically one big plugin framework. With WEKAs automatic
discovery of classes on the classpath, adding new classiers, lters, etc. to the
existing framework is very easy.
Though algorithms like clusterers, associators, data generators and attribute
selection are not covered in this chapter, their implemention is very similar to
the one of implementing a classier. You basically choose a superclass to derive
your new algorithm from and then implement additional interfaces, if necessary.
Just check out the other algorithms that are already implemented.
The section covering the GenericObjectEditor (see chapter 18.4) shows you
how to tell WEKA where to nd your class(es) and therefore making it/them
available in the GUI (Explorer/Experimenter) via the GenericObjectEditor.
233
234 CHAPTER 17. EXTENDING WEKA
17.1. WRITING A NEW CLASSIFIER 235
17.1 Writing a new Classier
17.1.1 Choosing the base class
The ancestor of all classiers in WEKA is weka.classifiers.Classifier, an
abstract class. Your new classier must be derived from this class at least to
be visible through the GenericObjectEditor. But in order to make implemen-
tations of new classiers even easier, WEKA comes already with a range of
other abstract classes derived from weka.classifiers.Classifier. In the fol-
lowing you will nd an overview that will help you decide what base class to
use for your classier. For better readability, the weka.classifiers prex was
dropped from the class names:
simple classier
Classifier not randomizable
RandomizableClassifier randomizable
meta classier
single base classier
SingleClassifierEnhancer not randomizable, not iterated
RandomizableSingleClassifierEnhancer randomizable, not
iterated
IteratedSingleClassifierEnhancer not randomizable, iter-
ated
RandomizableIteratedSingleClassifierEnhancer random-
izable, iterated
multiple base classiers
MultipleClassifiersCombiner not randomizable
RandomizableMultipleClassifiersCombiner randomizable
If you are still unsure about what superclass to choose, then check out the
Javadoc of those superclasses. In the Javadoc you will nd all the classiers
that are derived from it, which should give you a better idea whether this
particular superclass is suited for your needs.
17.1.2 Additional interfaces
The abstract classes listed above basically just implement various combinations
of the following two interfaces:
weka.core.Randomizable to allow (seeded) randomization taking place
weka.classifiers.IterativeClassifier to make the classier an it-
erated one
But these interfaces are not the only ones that can be implemented by a classier.
Here is a list for further interfaces:
weka.core.AdditionalMeasureProducer the classier returns addi-
tional information, e.g., J48 returns the tree size with this method.
weka.core.WeightedInstancesHandler denotes that the classier can
make use of weighted Instance objects (the default weight of an Instance
is 1.0).
weka.core.TechnicalInformationHandler for returning paper refer-
ences and publications this classier is based on.
weka.classifiers.Sourcable classiers implementing this interface
can return Java code of a built model, which can be used elsewhere.
weka.classifiers.UpdateableClassifier for classiers that can be
trained incrementally, i.e., row by row like NaiveBayesUpdateable.
236 CHAPTER 17. EXTENDING WEKA
17.1.3 Packages
A few comments about the dierent sub-packages in the weka.classifiers
package:
bayes contains bayesian classiers, e.g., NaiveBayes
evaluation classes related to evaluation, e.g., confusion matrix, thresh-
old curve (= ROC)
functions e.g., Support Vector Machines, regression algorithms, neural
nets
lazy learning is performed at prediction time, e.g., k-nearest neighbor
(k-NN)
meta meta-classiers that use a base one or more classiers as input,
e.g., boosting, bagging or stacking
mi classiers that handle multi-instance data
misc various classiers that dont t in any another category
rules rule-based classiers, e.g., ZeroR
trees tree classiers, like decision trees with J48 a very common one
17.1. WRITING A NEW CLASSIFIER 237
17.1.4 Implementation
In the following you will nd information on what methods need to be imple-
mented and other coding guidelines for methods, option handling and documen-
tation of the source code.
17.1.4.1 Methods
This section explains what methods need to be implemented in general and
more specialized ones in case of meta-classiers (either with single or multiple
base-classiers).
General
Here is an overview of methods that your new classier needs to implement in
order to integrate nicely into the WEKA framework:
globalInfo()
returns a short description that is displayed in the GUI, like the Explorer or
Experimenter. How long this description will be is really up to you, but it
should be sucient to understand the classiers underlying algorithm. If the
classier implements the weka.core.TechnicalInformationHandler interface
then you could refer to the publication(s) by extending the returned string by
getTechnicalInformation().toString().
listOptions()
returns a java.util.Enumeration of weka.core.Option objects. This enu-
meration is used to display the help on the command-line, hence it needs to
return the Option objects of the superclass as well.
setOptions(String[])
parses the options that the classier would receive from a command-line invoca-
tion. A parameter and argument are always two elements in the string array. A
common mistake is to use a single cell in the string array for both of them, e.g.,
"-S 1" instead of "-S","1". You can use the methods getOption and getFlag
of the weka.core.Utils class to retrieve the values of an option or to ascertain
whether a ag is present. But note that these calls remove the option and, if
applicable, the argument from the string array (destructive). The last call in
the setOptions methods should always be the super.setOptions(String[])
one, in order to pass on any other arguments still present in the array to the
superclass. The following code snippet just parses the only option alpha that
an imaginary classier denes:
import weka.core.Utils;
...
public void setOptions(String[] options) throws Exception {
String tmpStr = Utils.getOption("alpha", options);
if (tmpStr.length() == 0) {
setAlpha(0.75);
}
else {
setAlpha(Double.parseDouble(tmpStr));
}
super.setOptions(options);
}
238 CHAPTER 17. EXTENDING WEKA
getOptions()
returns a string array of command-line options that resemble the current clas-
sier setup. Supplying this array to the setOptions(String[]) method must
result in the same conguration. This method will get called in the GUI when
copying a classier setup to the clipboard. Since handling of arrays is a bit cum-
bersome in Java (due to xed length), using an instance of java.util.Vector
is a lot easier for creating the array that needs to be returned. The following
code snippet just adds the only option alpha that the classier denes to the
array that is being returned, including the options of the superclass:
import java.util.Arrays;
import java.util.Vector;
...
public String[] getOptions() {
Vector<String> result = new Vector<String>();
result.add("-alpha");
result.add("" + getAlpha());
result.addAll(Arrays.asList(super.getOptions())); // superclass
return result.toArray(new String[result.size()]);
}
Note, that the getOptions() method requires you to add the preceding dash for
an option, opposed to the getOption/getFlag calls in the setOptions method.
getCapabilities()
returns meta-information on what type of data the classier can handle, in
regards to attributes and class attributes. See section Capabilities on page
242 for more information.
buildClassier(Instances)
builds the model from scratch with the provided dataset. Each subsequent call of
this method must result in the same model being built. The buildClassifier
method also tests whether the supplied data can be handled at all by the clas-
sier, utilizing the capabilities returned by the getCapabilities() method:
public void buildClassifier(Instances data) throws Exception {
// test data against capabilities
getCapabilities().testWithFail(data);
// remove instances with missing class value,
// but dont modify original data
data = new Instances(data);
data.deleteWithMissingClass();
// actual model generation
...
}
toString()
is used for outputting the built model. This is not required, but it is useful
for the user to see properties of the model. Decision trees normally ouput the
tree, support vector machines the support vectors and rule-based classiers the
generated rules.
17.1. WRITING A NEW CLASSIFIER 239
distributionForInstance(Instance)
returns the class probabilities array of the prediction for the given weka.core.Instance
object. If your classier handles nominal class attributes, then you need to over-
ride this method.
classifyInstance(Instance)
returns the classication or regression for the given weka.core.Instanceobject.
In case of a nominal class attribute, this method returns the index of the class
label that got predicted. You do not need to override this method in this case as
the weka.classifiers.Classifier superclass already determines the class la-
bel index based on the probabilities array that the distributionForInstance(Instance)
method returns (it returns the index in the array with the highest probability;
in case of ties the rst one). For numeric class attributes, you need to override
this method, as it has to return the regression value predicted by the model.
main(String[])
executes the classier from command-line. If your new algorithm is called
FunkyClassifier, then use the following code as your main method:
/**
* Main method for executing this classifier.
*
* @param args the options, use "-h" to display options
*/
public static void main(String[] args) {
runClassifier(new FunkyClassifier(), args);
}
Note: the static method runClassifier (dened in the abstract superclass
weka.classifiers.Classifier) handles all the appropriate calls and catches
and processes any exceptions as well.
240 CHAPTER 17. EXTENDING WEKA
Meta-classiers
Meta-classiers dene a range of other methods that you might want to override.
Normally, this should not be the case. But if your classier requires the base-
classier(s) to be of a certain type, you can override the specic set-method and
add additional checks.
SingleClassierEnhancer
The following methods are used for handling the single base-classier of this
meta-classier.
defaultClassierString()
returns the class name of the classier that is used as the default one for this
meta-classier.
setClassier(Classier)
sets the classier object. Override this method if you require further checks, like
that the classiers needs to be of a certain class. This is necessary, if you still
want to allow the user to parametrize the base-classier, but not choose another
classier with the GenericObjectEditor. Be aware that this method does not
create a copy of the provided classier.
getClassier()
returns the currently set classier object. Note, this method returns the internal
object and not a copy.
MultipleClassiersCombiner
This meta-classier handles its multiple base-classiers with the following meth-
ods:
setClassiers(Classier[])
sets the array of classiers to use as base-classiers. If you require the base-
classiers to implement a certain interface or be of a certain class, then override
this method and add the necessary checks. Note, this method does not create
a copy of the array, but just uses this reference internally.
getClassiers()
returns the array of classiers that is in use. Careful, this method returns the
internal array and not a copy of it.
getClassier(int)
returns the classier from the internal classier array specied by the given
index. Once again, this method does not return a copy of the classier, but the
actual object used by this classier.
17.1. WRITING A NEW CLASSIFIER 241
17.1.4.2 Guidelines
WEKAs code base requires you to follow a few rules. The following sections
can be used as guidelines in writing your code.
Parameters
There are two dierent ways of setting/obtaining parameters of an algorithm.
Both of them are unfortunately completely independent, which makes option
handling so prone to errors. Here are the two:
1. command-line options, using the setOptions/getOptions methods
2. using the properties through the GenericObjectEditor in the GUI
Each command-line option must have a corresponding GUI property and vice
versa. In case of GUI properties, the get- and set-method for a property must
comply with Java Beans style in order to show up in the GUI. You need to
supply three methods for each property:
public void set<PropertyName>(<Type>) checks whether the sup-
plied value is valid and only then updates the corresponding member vari-
able. In any other case it should ignore the value and output a warning
in the console or throw an IllegalArgumentException.
public <Type> get<PropertyName>() performs any necessary conver-
sions of the internal value and returns it.
public String <propertyName>TipText() returns the help text that
is available through the GUI. Should be the same as on the command-line.
Note: everything after the rst period . gets truncated from the tool
tip that pops up in the GUI when hovering with the mouse cursor over
the eld in the GenericObjectEditor.
With a property called alpha of type double, we get the following method
signatures:
public void setAlpha(double)
public double getAlpha()
public String alphaTipText()
These get- and set-methods should be used in the getOptions and setOptions
methods as well, to impose the same checks when getting/setting parameters.
Randomization
In order to get repeatable experiments, one is not allowed to use unseeded
random number generators like Math.random(). Instead, one has to instantiate
a java.util.Random object in the buildClassifier(Instances) method with
a specic seed value. The seed value can be user supplied, of course, which all
the Randomizable... abstract classiers already implement.
242 CHAPTER 17. EXTENDING WEKA
Capabilities
By default, the weka.classifiers.Classifier superclass returns an object
that denotes that the classier can handle any type of data. This is useful for
rapid prototyping of new algorithms, but also very dangerous. If you do not
specically dene what type of data can be handled by your classier, you can
end up with meaningless models or errors. This can happen if you devise a
new classier which is supposed to handle only numeric attributes. By using
the value(int/Attribute) method of a weka.core.Instance to obtain the
numeric value of an attribute, you also obtain the internal format of nominal,
string and relational attributes. Of course, treating these attribute types as
numeric ones does not make any sense. Hence it is highly recommended (and
required for contributions) to override this method in your own classier.
There are three dierent types of capabilities that you can dene:
1. attribute related e.g., nominal, numeric, date, missing values, ...
2. class attribute related e.g., no-class, nominal, numeric, missing class
values, ...
3. miscellaneous e.g., only multi-instance data, minimum number of in-
stances in the training data
There are some special cases:
incremental classiers need to set the minimum number of instances in
the training data to 0, since the default is 1:
setMinimumNumberInstances(0)
multi-instance classiers in order to signal that the special multi-instance
format (bag-id, bag-data, class) is used, they need to enable the following
capability:
enable(Capability.ONLY MULTIINSTANCE)
These classiers also need to implement the interface specic to multi-
instance, weka.core.MultiInstanceCapabilitiesHandler, which returns
the capabilities for the bag-data.
cluster algorithms since clusterers are unsupervised algorithms, they
cannot process data with the class attribute set. The capability that
denotes that an algorithm can handle data without a class attribute is
Capability.NO CLASS
And a note on enabling/disabling nominal attributes or nominal class attributes.
These operations automatically enable/disable the binary, unary and empty
nominal capabilities as well. The following sections list a few examples of how
to congure the capabilities.
17.1. WRITING A NEW CLASSIFIER 243
Simple classier
A classier that handles only numeric classes and numeric and nominal at-
tributes, but no missing values at all, would congure the Capabilities object
like this:
public Capabilities getCapabilities() {
Capabilities result = new Capabilities(this);
// attributes
result.enable(Capability.NOMINAL_ATTRIBUTES);
result.enable(Capability.NUMERIC_ATTRIBUTES);
// class
result.enable(Capability.NUMERIC_CLASS);
return result;
}
Another classier, that only handles binary classes and only nominal attributes
and missing values, would implement the getCapabilities() method as fol-
lows:
public Capabilities getCapabilities() {
Capabilities result = new Capabilities(this);
// attributes
result.enable(Capability.NOMINAL_ATTRIBUTES);
result.enable(Capability.MISSING_VALUES);
// class
result.enable(Capability.BINARY_CLASS);
result.disable(Capability.UNNARY_CLASS);
result.enable(Capability.MISSING_CLASS_VALUES);
return result;
}
Meta-classier
Meta-classiers, by default, just return the capabilities of their base classiers -
in case of descendants of the weka.classifier.MultipleClassifiersCombiner,
an AND over all the Capabilities of the base classiers is returned.
Due to this behavior, the capabilities depend normally only on the cur-
rently congured base classier(s). To soften ltering for certain behavior, meta-
classiers also dene so-called Dependencies on a per-Capability basis. These
dependencies tell the lter that even though a certain capability is not sup-
ported right now, it is possible that it will be supported with a dierent base
classier. By default, all capabilities are initialized as Dependencies.
weka.classifiers.meta.LogitBoost, e.g., is restricted to nominal classes.
For that reason it disables the Dependencies for the class:
result.disableAllClasses(); // disable all class types
result.disableAllClassDependencies(); // no dependencies!
result.enable(Capability.NOMINAL_CLASS); // only nominal classes allowed
244 CHAPTER 17. EXTENDING WEKA
Javadoc
In order to keep code-quality high and maintenance low, source code needs to
be well documented. This includes the following Javadoc requirements:
class
description of the classier
listing of command-line parameters
publication(s), if applicable
@author and @version tag
methods (all, not just public)
each parameter is documented
return value, if applicable, is documented
exception(s) are documented
the setOptions(String[]) method also lists the command-line pa-
rameters
Most of the class-related and the setOptions Javadoc is already available
through the source code:
description of the classier globalInfo()
listing of command-line parameters listOptions()
publication(s), if applicable getTechnicalInformation()
In order to avoid manual syncing between Javadoc and source code, WEKA
comes with some tools for updating the Javadoc automatically. The following
tools take a concrete class and update its source code (the source code directory
needs to be supplied as well, of course):
weka.core.AllJavadoc executes all Javadoc-producing classes (this is
the tool, you would normally use)
weka.core.GlobalInfoJavadoc updates the globalinfo tags
weka.core.OptionHandlerJavadoc updates the option tags
weka.core.TechnicalInformationHandlerJavadoc updates the tech-
nical tags (plain text and BibTeX)
These tools look for specic comment tags in the source code and replace every-
thing in between the start and end tag with the documentation obtained from
the actual class.
description of the classier
<!-- globalinfo-start -->
will be automatically replaced
<!-- globalinfo-end -->
listing of command-line parameters
<!-- options-start -->
will be automatically replaced
<!-- options-end -->
publication(s), if applicable
<!-- technical-bibtex-start -->
will be automatically replaced
<!-- technical-bibtex-end -->
for a shortened, plain-text version use the following:
<!-- technical-plaintext-start -->
will be automatically replaced
<!-- technical-plaintext-end -->
17.1. WRITING A NEW CLASSIFIER 245
Here is a template of a Javadoc class block for an imaginary classier that also
implements the weka.core.TechnicalInformationHandler interface:
/**
<!-- globalinfo-start -->
<!-- globalinfo-end -->
*
<!-- technical-bibtex-start -->
<!-- technical-bibtex-end -->
*
<!-- options-start -->
<!-- options-end -->
*
* @author John Doe (john dot doe at no dot where dot com)
* @version $Revision: 6192 $
*/
The template for any classiers setOptions(String[]) method is as follows:
/**
* Parses a given list of options.
*
<!-- options-start -->
<!-- options-end -->
*
* @param options the list of options as an array of strings
* @throws Exception if an option is not supported
*/
Running the weka.core.AllJavadoc tool over this code will output code with
the comments lled out accordingly.
Revisions
Classiers implement the weka.core.RevisionHandler interface. This pro-
vides the functionality of obtaining the Subversion revision from within Java.
Classiers that are not part of the ocial WEKA distribution do not have to
implement the method getRevision() as the weka.classifiers.Classifier
class already implements this method. Contributions, on the other hand, need
to implement it as follows, in order to obtain the revision of this particular
source le:
/**
* Returns the revision string.
*
* @return the revision
*/
public String getRevision() {
return RevisionUtils.extract("$Revision: 6192 $");
}
Note, a commit into Subversion will replace the revision number above with the
actual revision number.
246 CHAPTER 17. EXTENDING WEKA
Testing
WEKA provides already a test framework to ensure correct basic functionality
of a classier. It is essential for the classier to pass these tests.
Option handling
You can check the option handling of your classier with the following tool from
command-line:
weka.core.CheckOptionHandler -W classname [-- additional parameters]
All tests need to return yes.
GenericObjectEditor
The CheckGOE class checks whether all the properties available in the GUI have a
tooltip accompanying them and whether the globalInfo() method is declared:
weka.core.CheckGOE -W classname [-- additional parameters]
All tests, once again, need to return yes.
Source code
Classiers that implement the weka.classifiers.Sourcable interface can out-
put Java code of the built model. In order to check the generated code, one
should not only compile the code, but also test it with the following test class:
weka.classifiers.CheckSource
This class takes the original WEKA classier, the generated code and the dataset
used for generating the model (and an optional class index) as parameters. It
builds the WEKA classier on the dataset and compares the output, the one
from the WEKA classier and the one from the generated source code, whether
they are the same.
Here is an example call for weka.filters.trees.J48 and the generated
class weka.filters.WEKAWrapper (it wraps the actual generated code in a
pseudo-classier):
java weka.classifiers.CheckSource \
-W weka.classifiers.trees.J48 \
-S weka.classifiers.WEKAWrapper \
-t data.arff
It needs to return Tests OK!.
Unit tests
In order to make sure that your classier applies to the WEKA criteria, you
should add your classier to the junit unit test framework, i.e., by creating a Test
class. The superclass for classier unit tests is weka.classifiers.AbstractClassifierTest.
17.2. WRITING A NEW FILTER 247
17.2 Writing a new Filter
The work horses of preprocessing in WEKA are lters. They perform many
tasks, from resampling data, to deleting and standardizing attributes. In the
following are two dierent approaches covered that explain in detail how to
implement a new lter:
default this is how lters had to be implemented in the past.
simple since there are mainly two types of lters, batch or stream, ad-
ditional abstract classes were introduced to speed up the implementation
process.
17.2.1 Default approach
The default approach is the most exible, but also the most complicated one
for writing a new lter. This approach has to be used, if the lter cannot be
written using the simple approach described further below.
17.2.1.1 Implementation
The following methods are of importance for the implementation of a lter and
explained in detail further down. It is also a good idea studying the Javadoc of
these methods as declared in the weka.filters.Filter class:
getCapabilities()
setInputFormat(Instances)
getInputFormat()
setOutputFormat(Instances)
getOutputFormat()
input(Instance)
bufferInput(Instance)
push(Instance)
output()
batchFinished()
flushInput()
getRevision()
But only the following ones normally need to be modied:
getCapabilities()
setInputFormat(Instances)
input(Instance)
batchFinished()
getRevision()
For more information on Capabilities see section 17.2.3. Please note, that the
weka.filters.Filtersuperclass does not implement the weka.core.OptionHandler
interface. See section Option handling on page 249.
248 CHAPTER 17. EXTENDING WEKA
setInputFormat(Instances)
With this call, the user tells the lter what structure, i.e., attributes, the input
data has. This method also tests, whether the lter can actually process this
data, according to the capabilities specied in the getCapabilities() method.
If the output format of the lter, i.e., the new Instances header, can be
determined based alone on this information, then the method should set the
output format via setOutputFormat(Instances) and return true, otherwise
it has to return false.
getInputFormat()
This method returns an Instances object containing all currently buered
Instance objects from the input queue.
setOutputFormat(Instances)
setOutputFormat(Instances) denes the new Instances header for the out-
put data. For lters that work on a row-basis, there should not be any changes
between the input and output format. But lters that work on attributes, e.g.,
removing, adding, modifying, will aect this format. This method must be
called with the appropriate Instances object as parameter, since all Instance
objects being processed will rely on the output format (they use it as dataset
that they belong to).
getOutputFormat()
This method returns the currently set Instances object that denes the output
format. In case setOutputFormat(Instances) has not been called yet, this
method will return null.
input(Instance)
returns true if the given Instance can be processed straight away and can be
collected immediately via the output() method (after adding it to the output
queue via push(Instance), of course). This is also the case if the rst batch
of data has been processed and the Instance belongs to the second batch. Via
isFirstBatchDone() one can query whether this Instance is still part of the
rst batch or of the second.
If the Instance cannot be processed immediately, e.g., the lter needs to
collect all the data rst before doing some calculations, then it needs to be
buered with bufferInput(Instance) until batchFinished() is called. In
this case, the method needs to return false.
buerInput(Instance)
In case an Instance cannot be processed immediately, one can use this method
to buer them in the input queue. All buered Instance objects are available
via the getInputFormat() method.
push(Instance)
adds the given Instance to the output queue.
output()
Returns the next Instance object from the output queue and removes it from
there. In case there is no Instance available this method returns null.
17.2. WRITING A NEW FILTER 249
batchFinished()
signals the end of a dataset being pushed through the lter. In case of a lter
that could not process the data of the rst batch immediately, this is the place to
determine what the output format will be (and set if via setOutputFormat(Instances))
and nally process the input data. The currently available data can be retrieved
with the getInputFormat() method. After processing the data, one needs to
call flushInput() to remove all the pending input data.
ushInput()
flushInput() removes all buered Instance objects from the input queue.
This method must be called after all the Instance objects have been processed
in the batchFinished() method.
Option handling
If the lter should be able to handle command-line options, then the inter-
face weka.core.OptionHandler needs to be implemented. In addition to that,
the following code should be added at the end of the setOptions(String[])
method:
if (getInputFormat() != null) {
setInputFormat(getInputFormat());
}
This will inform the lter about changes in the options and therefore reset it.
250 CHAPTER 17. EXTENDING WEKA
17.2.1.2 Examples
The following examples, covering batch and stream lters, illustrate the lter
framework and how to use it.
Unseeded random number generators like Math.random() should never be
used since they will produce dierent results in each run and repeatable exper-
iments are essential in machine learning.
BatchFilter
This simple batch lter adds a new attribute called blah at the end of the
dataset. The rows of this attribute contain only the rows index in the data.
Since the batch-lter does not have to see all the data before creating the output
format, the setInputFormat(Instances) sets the output format and returns
true (indicating that the output format can be queried immediately). The
batchFinished() method performs the processing of all the data.
import weka.core.*;
import weka.core.Capabilities.*;
public class BatchFilter extends Filter {
public String globalInfo() {
return "A batch filter that adds an additional attribute blah at the end "
+ "containing the index of the processed instance. The output format "
+ "can be collected immediately.";
}
public Capabilities getCapabilities() {
Capabilities result = super.getCapabilities();
result.enableAllAttributes();
result.enableAllClasses();
result.enable(Capability.NO_CLASS); // filter doesnt need class to be set
return result;
}
public boolean setInputFormat(Instances instanceInfo) throws Exception {
super.setInputFormat(instanceInfo);
Instances outFormat = new Instances(instanceInfo, 0);
outFormat.insertAttributeAt(new Attribute("blah"),
outFormat.numAttributes());
setOutputFormat(outFormat);
return true; // output format is immediately available
}
public boolean batchFinished() throws Exception {
if (getInputFormat() = null)
throw new NullPointerException("No input instance format defined");
Instances inst = getInputFormat();
Instances outFormat = getOutputFormat();
for (int i = 0; i < inst.numInstances(); i++) {
double[] newValues = new double[outFormat.numAttributes()];
double[] oldValues = inst.instance(i).toDoubleArray();
System.arraycopy(oldValues, 0, newValues, 0, oldValues.length);
newValues[newValues.length - 1] = i;
push(new Instance(1.0, newValues));
}
flushInput();
m_NewBatch = true;
m_FirstBatchDone = true;
return (numPendingOutput() != 0);
}
public static void main(String[] args) {
runFilter(new BatchFilter(), args);
}
}
17.2. WRITING A NEW FILTER 251
BatchFilter2
In contrast to the rst batch lter, this one here cannot determine the output
format immediately (the number of instances in the rst batch is part of the
attribute name now). This is done in the batchFinished() method.
import weka.core.*;
import weka.core.Capabilities.*;
public class BatchFilter2 extends Filter {
public String globalInfo() {
return "A batch filter that adds an additional attribute blah at the end "
+ "containing the index of the processed instance. The output format "
+ "cannot be collected immediately.";
}
public Capabilities getCapabilities() {
Capabilities result = super.getCapabilities();
result.enableAllAttributes();
result.enableAllClasses();
result.enable(Capability.NO_CLASS); // filter doesnt need class to be set
return result;
}
public boolean batchFinished() throws Exception {
if (getInputFormat() = null)
throw new NullPointerException("No input instance format defined");
// output format still needs to be set (depends on first batch of data)
if (!isFirstBatchDone()) {
Instances outFormat = new Instances(getInputFormat(), 0);
outFormat.insertAttributeAt(new Attribute(
"blah-" + getInputFormat().numInstances()), outFormat.numAttributes());
setOutputFormat(outFormat);
}
Instances inst = getInputFormat();
Instances outFormat = getOutputFormat();
for (int i = 0; i < inst.numInstances(); i++) {
double[] newValues = new double[outFormat.numAttributes()];
double[] oldValues = inst.instance(i).toDoubleArray();
System.arraycopy(oldValues, 0, newValues, 0, oldValues.length);
newValues[newValues.length - 1] = i;
push(new Instance(1.0, newValues));
}
flushInput();
m_NewBatch = true;
m_FirstBatchDone = true;
return (numPendingOutput() != 0);
}
public static void main(String[] args) {
runFilter(new BatchFilter2(), args);
}
}
252 CHAPTER 17. EXTENDING WEKA
BatchFilter3
As soon as this batch lters rst batch is done, it can process Instance objects
immediately in the input(Instance) method. It adds a new attribute which
contains just a random number, but the random number generator being used
is seeded with the number of instances from the rst batch.
import weka.core.*;
import weka.core.Capabilities.*;
import java.util.Random;
public class BatchFilter3 extends Filter {
protected int m_Seed;
protected Random m_Random;
public String globalInfo() {
return "A batch filter that adds an attribute blah at the end "
+ "containing a random number. The output format cannot be collected "
+ "immediately.";
}
public Capabilities getCapabilities() {
Capabilities result = super.getCapabilities();
result.enableAllAttributes();
result.enableAllClasses();
result.enable(Capability.NO_CLASS); // filter doesnt need class to be set
return result;
}
public boolean input(Instance instance) throws Exception {
if (getInputFormat() = null)
throw new NullPointerException("No input instance format defined");
if (isNewBatch()) {
resetQueue();
m_NewBatch = false;
}
if (isFirstBatchDone())
convertInstance(instance);
else
bufferInput(instance);
return isFirstBatchDone();
}
public boolean batchFinished() throws Exception {
if (getInputFormat() = null)
throw new NullPointerException("No input instance format defined");
// output format still needs to be set (random number generator is seeded
// with number of instances of first batch)
if (!isFirstBatchDone()) {
m_Seed = getInputFormat().numInstances();
Instances outFormat = new Instances(getInputFormat(), 0);
outFormat.insertAttributeAt(new Attribute(
"blah-" + getInputFormat().numInstances()), outFormat.numAttributes());
setOutputFormat(outFormat);
}
Instances inst = getInputFormat();
for (int i = 0; i < inst.numInstances(); i++) {
convertInstance(inst.instance(i));
}
flushInput();
m_NewBatch = true;
m_FirstBatchDone = true;
m_Random = null;
return (numPendingOutput() != 0);
}
protected void convertInstance(Instance instance) {
if (m_Random = null)
m_Random = new Random(m_Seed);
double[] newValues = new double[instance.numAttributes() + 1];
double[] oldValues = instance.toDoubleArray();
newValues[newValues.length - 1] = m_Random.nextInt();
System.arraycopy(oldValues, 0, newValues, 0, oldValues.length);
push(new Instance(1.0, newValues));
}
public static void main(String[] args) {
runFilter(new BatchFilter3(), args);
}
}
17.2. WRITING A NEW FILTER 253
StreamFilter
This stream lter adds a random number (the seed value is hard-coded) at the
end of each Instance of the input data. Since this does not rely on having access
to the full data of the rst batch, the output format is accessible immediately
after using setInputFormat(Instances). All the Instance objects are imme-
diately processed in input(Instance) via the convertInstance(Instance)
method, which pushes them immediately to the output queue.
import weka.core.*;
import weka.core.Capabilities.*;
import java.util.Random;
public class StreamFilter extends Filter {
protected Random m_Random;
public String globalInfo() {
return "A stream filter that adds an attribute blah at the end "
+ "containing a random number. The output format can be collected "
+ "immediately.";
}
public Capabilities getCapabilities() {
Capabilities result = super.getCapabilities();
result.enableAllAttributes();
result.enableAllClasses();
result.enable(Capability.NO_CLASS); // filter doesnt need class to be set
return result;
}
public boolean setInputFormat(Instances instanceInfo) throws Exception {
super.setInputFormat(instanceInfo);
Instances outFormat = new Instances(instanceInfo, 0);
outFormat.insertAttributeAt(new Attribute("blah"),
outFormat.numAttributes());
setOutputFormat(outFormat);
m_Random = new Random(1);
return true; // output format is immediately available
}
public boolean input(Instance instance) throws Exception {
if (getInputFormat() = null)
throw new NullPointerException("No input instance format defined");
if (isNewBatch()) {
resetQueue();
m_NewBatch = false;
}
convertInstance(instance);
return true; // can be immediately collected via output()
}
protected void convertInstance(Instance instance) {
double[] newValues = new double[instance.numAttributes() + 1];
double[] oldValues = instance.toDoubleArray();
newValues[newValues.length - 1] = m_Random.nextInt();
System.arraycopy(oldValues, 0, newValues, 0, oldValues.length);
push(new Instance(1.0, newValues));
}
public static void main(String[] args) {
runFilter(new StreamFilter(), args);
}
}
254 CHAPTER 17. EXTENDING WEKA
17.2.2 Simple approach
The base lters and interfaces are all located in the following package:
weka.filters
One can basically divide lters roughly into two dierent kinds of lters:
batch lters they need to see the whole dataset before they can start
processing it, which they do in one go
stream lters they can start producing output right away and the data
just passes through while being modied
You can subclass one of the following abstract lters, depending on the kind of
classier you want to implement:
weka.filters.SimpleBatchFilter
weka.filters.SimpleStreamFilter
These lters simplify the rather general and complex framework introduced by
the abstract superclass weka.filters.Filter. One only needs to implement
a couple of abstract methods that will process the actual data and override, if
necessary, a few existing methods for option handling.
17.2.2.1 SimpleBatchFilter
Only the following abstract methods need to be implemented:
globalInfo() returns a short description of what the lter does; will
be displayed in the GUI
determineOutputFormat(Instances) generates the new format, based
on the input data
process(Instances) processes the whole dataset in one go
getRevision() returns the Subversion revision information, see section
Revisions on page 258
If more options are necessary, then the following methods need to be overridden:
listOptions() returns an enumeration of the available options; these
are printed if one calls the lter with the -h option
setOptions(String[]) parses the given option array, that were passed
from command-line
getOptions() returns an array of options, resembling the current setup
of the lter
See section Methods on page 237 and section Parameters on page 241 for
more information.
17.2. WRITING A NEW FILTER 255
In the following an example implementation that adds an additional at-
tribute at the end, containing the index of the processed instance:
import weka.core.*;
import weka.core.Capabilities.*;
import weka.filters.*;
public class SimpleBatch extends SimpleBatchFilter {
public String globalInfo() {
return "A simple batch filter that adds an additional attribute blah at the end "
+ "containing the index of the processed instance.";
}
public Capabilities getCapabilities() {
Capabilities result = super.getCapabilities();
result.enableAllAttributes();
result.enableAllClasses();
result.enable(Capability.NO_CLASS); //// filter doesnt need class to be set//
return result;
}
protected Instances determineOutputFormat(Instances inputFormat) {
Instances result = new Instances(inputFormat, 0);
result.insertAttributeAt(new Attribute("blah"), result.numAttributes());
return result;
}
protected Instances process(Instances inst) {
Instances result = new Instances(determineOutputFormat(inst), 0);
for (int i = 0; i < inst.numInstances(); i++) {
double[] values = new double[result.numAttributes()];
for (int n = 0; n < inst.numAttributes(); n++)
values[n] = inst.instance(i).value(n);
values[values.length - 1] = i;
result.add(new Instance(1, values));
}
return result;
}
public static void main(String[] args) {
runFilter(new SimpleBatch(), args);
}
}
256 CHAPTER 17. EXTENDING WEKA
17.2.2.2 SimpleStreamFilter
Only the following abstract methods need to be implemented for a stream lter:
globalInfo() returns a short description of what the lter does; will
be displayed in the GUI
determineOutputFormat(Instances) generates the new format, based
on the input data
process(Instance) processes a single instance and turns it from the
old format into the new one
getRevision() returns the Subversion revision information, see section
Revisions on page 258
If more options are necessary, then the following methods need to be overridden:
listOptions() returns an enumeration of the available options; these
are printed if one calls the lter with the -h option
setOptions(String[]) parses the given option array, that were passed
from command-line
getOptions() returns an array of options, resembling the current setup
of the lter
See also section 17.1.4.1, covering Methods for classiers.
17.2. WRITING A NEW FILTER 257
In the following an example implementation of a stream lter that adds an
extra attribute at the end, which is lled with random numbers. The reset()
method is only used in this example, since the random number generator needs
to be re-initialized in order to obtain repeatable results.
import weka.core.*;
import weka.core.Capabilities.*;
import weka.filters.*;
import java.util.Random;
public class SimpleStream extends SimpleStreamFilter {
protected Random m_Random;
public String globalInfo() {
return "A simple stream filter that adds an attribute blah at the end "
+ "containing a random number.";
}
public Capabilities getCapabilities() {
Capabilities result = super.getCapabilities();
result.enableAllAttributes();
result.enableAllClasses();
result.enable(Capability.NO_CLASS); //// filter doesnt need class to be set//
return result;
}
protected void reset() {
super.reset();
m_Random = new Random(1);
}
protected Instances determineOutputFormat(Instances inputFormat) {
Instances result = new Instances(inputFormat, 0);
result.insertAttributeAt(new Attribute("blah"), result.numAttributes());
return result;
}
protected Instance process(Instance inst) {
double[] values = new double[inst.numAttributes() + 1];
for (int n = 0; n < inst.numAttributes(); n++)
values[n] = inst.value(n);
values[values.length - 1] = m_Random.nextInt();
Instance result = new Instance(1, values);
return result;
}
public static void main(String[] args) {
runFilter(new SimpleStream(), args);
}
}
A real-world implementation of a stream lter is the MultiFilter class (pack-
age weka.filters), which passes the data through all the lters it contains.
Depending on whether all the used lters are streamable or not, it acts either
as stream lter or as batch lter.
258 CHAPTER 17. EXTENDING WEKA
17.2.2.3 Internals
Some useful methods of the lter classes:
isNewBatch() returns true if an instance of the lter was just instan-
tiated or a new batch was started via the batchFinished() method.
isFirstBatchDone() returns true as soon as the rst batch was nished
via the batchFinished() method. Useful for supervised lters, which
should not be altered after being trained with the rst batch of instances.
17.2.3 Capabilities
Filters implement the weka.core.CapabilitiesHandler interface like the clas-
siers. This method returns what kind of data the lter is able to process. Needs
to be adapted for each individual lter, since the default implementation allows
the processing of all kinds of attributes and classes. Otherwise correct function-
ing of the lter cannot be guaranteed. See section Capabilities on page 242
for more information.
17.2.4 Packages
A few comments about the dierent lter sub-packages:
supervised contains supervised lters, i.e., lters that take class distri-
butions into account. Must implement the weka.filters.SupervisedFilter
interface.
attribute lters that work column-wise.
instance lters that work row-wise.
unsupervised contains unsupervised lters, i.e., they work without
taking any class distributions into account. The lter must implement the
weka.filters.UnsupervisedFilter interface.
attribute lters that work column-wise.
instance lters that work row-wise.
Javadoc
The Javadoc generation works the same as with classiers. See section Javadoc
on page 244 for more information.
17.2.5 Revisions
Filters, like classiers, implement the weka.core.RevisionHandler interface.
This provides the functionality of obtaining the Subversion revision from within
Java. Filters that are not part of the ocial WEKA distribution do not have
to implement the method getRevision() as the weka.filters.Filter class
already implements this method. Contributions, on the other hand, need to
implement it, in order to obtain the revision of this particular source le. See
section Revisions on page 245.
17.2. WRITING A NEW FILTER 259
17.2.6 Testing
WEKA provides already a test framework to ensure correct basic functionality
of a lter. It is essential for the lter to pass these tests.
17.2.6.1 Option handling
You can check the option handling of your lter with the following tool from
command-line:
weka.core.CheckOptionHandler -W classname [-- additional parameters]
All tests need to return yes.
17.2.6.2 GenericObjectEditor
The CheckGOE class checks whether all the properties available in the GUI have a
tooltip accompanying them and whether the globalInfo() method is declared:
weka.core.CheckGOE -W classname [-- additional parameters]
All tests, once again, need to return yes.
17.2.6.3 Source code
Filters that implement the weka.filters.Sourcable interface can output Java
code of their internal representation. In order to check the generated code, one
should not only compile the code, but also test it with the following test class:
weka.filters.CheckSource
This class takes the original WEKA lter, the generated code and the dataset
used for generating the source code (and an optional class index) as parameters.
It builds the WEKA lter on the dataset and compares the output, the one from
the WEKA lter and the one from the generated source code, whether they are
the same.
Here is an example call for weka.filters.unsupervised.attribute.ReplaceMissingValues
and the generated class weka.filters.WEKAWrapper (it wraps the actual gen-
erated code in a pseudo-lter):
java weka.filters.CheckSource \
-W weka.filters.unsupervised.attribute.ReplaceMissingValues \
-S weka.filters.WEKAWrapper \
-t data.arff
It needs to return Tests OK!.
17.2.6.4 Unit tests
In order to make sure that your lter applies to the WEKA criteria, you should
add your lter to the junit unit test framework, i.e., by creating a Test class.
The superclass for lter unit tests is weka.filters.AbstractFilterTest.
260 CHAPTER 17. EXTENDING WEKA
17.3 Writing other algorithms
The previous sections covered how to implement classiers and lters. In the
following you will nd some information on how to implement clusterers, as-
sociators and attribute selection algorithms. The various algorithms are only
covered briey, since other important components (capabilities, option handling,
revisions) have already been discussed in the other chapters.
17.3.1 Clusterers
Superclasses and interfaces
All clusterers implement the interface weka.clusterers.Clusterer, but most
algorithms will be most likely derived (directly or further up in the class hier-
archy) from the abstract superclass weka.clusterers.AbstractClusterer.
weka.clusterers.SingleClustererEnhancer is used for meta-clusterers,
like the FilteredClustererthat lters the data on-the-y for the base-clusterer.
Here are some common interfaces that can be implemented:
weka.clusterers.DensityBasedClusterer for clusterers that can esti-
mate the density for a given instance. AbstractDensityBasedClusterer
already implements this interface.
weka.clusterers.UpdateableClusterer clusterers that can generate
their model incrementally implement this interface, like CobWeb.
NumberOfClustersRequestable is for clusterers that allow to specify
the number of clusters to generate, like SimpleKMeans.
weka.core.Randomizable for clusterers that support randomization in
one way or another. RandomizableClusterer, RandomizableDensityBasedClusterer
and RandomizableSingleClustererEnhancer all implement this inter-
face already.
Methods
In the following a short description of methods that are common to all cluster
algorithms, see also the Javadoc for the Clusterer interface.
buildClusterer(Instances)
Like the buildClassifier(Instances) method, this method completely re-
builds the model. Subsequent calls of this method with the same dataset must
result in exactly the same model being built. This method also tests the training
data against the capabilities of this this clusterer:
public void buildClusterer(Instances data) throws Exception {
// test data against capabilities
getCapabilities().testWithFail(data);
// actual model generation
...
}
clusterInstance(Instance)
returns the index of the cluster the provided Instance belongs to.
17.3. WRITING OTHER ALGORITHMS 261
distributionForInstance(Instance)
returns the cluster membership for this Instance object. The membership is a
double array containing the probabilities for each cluster.
numberOfClusters()
returns the number of clusters that the model contains, after the model has
been generated with the buildClusterer(Instances) method.
getCapabilities()
see section Capabilities on page 242 for more information.
toString()
should output some information on the generated model. Even though this is
not required, it is rather useful for the user to get some feedback on the built
model.
main(String[])
executes the clusterer from command-line. If your new algorithm is called
FunkyClusterer, then use the following code as your main method:
/**
* Main method for executing this clusterer.
*
* @param args the options, use "-h" to display options
*/
public static void main(String[] args) {
AbstractClusterer.runClusterer(new FunkyClusterer(), args);
}
Testing
For some basic tests from the command-line, you can use the following test
class:
weka.clusterers.CheckClusterer -W classname [further options]
For junit tests, you can subclass the weka.clusterers.AbstractClustererTest
class and add additional tests.
262 CHAPTER 17. EXTENDING WEKA
17.3.2 Attribute selection
Attribute selection consists basically of two dierent types of classes:
evaluator determines the merit of single attributes or subsets of at-
tributes
search algorithm the search heuristic
Each of the them will be discussed separately in the following sections.
Evaluator
The evaluator algorithm is responsible for determining merit of the current
attribute selection.
Superclasses and interfaces
The ancestor for all evaluators is the weka.attributeSelection.ASEvaluation
class.
Here are some interfaces that are commonly implemented by evaluators:
AttributeEvaluator evaluates only single attributes
SubsetEvaluator evaluates subsets of attributes
AttributeTransformer evaluators that transform the input data
Methods
In the following a brief description of the main methods of an evaluator.
buildEvaluator(Instances)
Generates the attribute evaluator. Subsequent calls of this method with the
same data (and the same search algorithm) must result in the same attributes
being selected. This method also checks the data against the capabilities:
public void buildEvaluator (Instances data) throws Exception {
// can evaluator handle data?
getCapabilities().testWithFail(data);
// actual initialization of evaluator
...
}
postProcess(int[])
can be used for optional post-processing of the selected attributes, e.g., for
ranking purposes.
17.3. WRITING OTHER ALGORITHMS 263
main(String[])
executes the evaluator from command-line. If your new algorithm is called
FunkyEvaluator, then use the following code as your main method:
/**
* Main method for executing this evaluator.
*
* @param args the options, use "-h" to display options
*/
public static void main(String[] args) {
ASEvaluation.runEvaluator(new FunkyEvaluator(), args);
}
Search
The search algorithm denes the heuristic of searching, e.g, exhaustive search,
greedy or genetic.
Superclasses and interfaces
The ancestor for all search algorithms is the weka.attributeSelection.ASSearch
class.
Interfaces that can be implemented, if applicable, by a search algorithm:
RankedOutputSearch for search algorithms that produce ranked lists of
attributes
StartSetHandler search algorithms that can make use of a start set of
attributes implement this interface
Methods
Search algorithms are rather basic classes in regards to methods that need to
be implemented. Only the following method needs to be implemented:
search(ASEvaluation,Instances)
uses the provided evaluator to guide the search.
Testing
For some basic tests from the command-line, you can use the following test
class:
weka.attributeSelection.CheckAttributeSelection
-eval classname -search classname [further options]
For junit tests, you can subclass the weka.attributeSelection.AbstractEvaluatorTest
or weka.attributeSelection.AbstractSearchTest class and add additional
tests.
264 CHAPTER 17. EXTENDING WEKA
17.3.3 Associators
Superclasses and interfaces
The interface weka.associations.Associator is common to all associator al-
gorithms. But most algorithms will be derived from AbstractAssociator, an
abstract class implementing this interface. As with classiers and clusterers, you
can also implement a meta-associator, derived from SingleAssociatorEnhancer.
An example for this is the FilteredAssociator, which lters the training data
on-the-y for the base-associator.
The only other interface that is used by some other association algorithms,
is the weka.clusterers.CARuleMiner one. Associators that learn class associ-
ation rules implement this interface, like Apriori.
Methods
The associators are very basic algorithms and only support building of the
model.
buildAssociations(Instances)
Like the buildClassifier(Instances) method, this method completely re-
builds the model. Subsequent calls of this method with the same dataset must
result in exactly the same model being built. This method also tests the training
data against the capabilities:
public void buildAssociations(Instances data) throws Exception {
// other necessary setups
...
// test data against capabilities
getCapabilities().testWithFail(data);
// actual model generation
...
}
getCapabilities()
see section Capabilities on page 242 for more information.
toString()
should output some information on the generated model. Even though this is
not required, it is rather useful for the user to get some feedback on the built
model.
17.3. WRITING OTHER ALGORITHMS 265
main(String[])
executes the associator from command-line. If your new algorithm is called
FunkyAssociator, then use the following code as your main method:
/**
* Main method for executing this associator.
*
* @param args the options, use "-h" to display options
*/
public static void main(String[] args) {
AbstractAssociator.runAssociator(new FunkyAssociator(), args);
}
Testing
For some basic tests from the command-line, you can use the following test
class:
weka.associations.CheckAssociator -W classname [further options]
For junit tests, you can subclass the weka.associations.AbstractAssociatorTest
class and add additional tests.
266 CHAPTER 17. EXTENDING WEKA
17.4 Extending the Explorer
The plugin architecture of the Explorer allows you to add new functionality
easily without having to dig into the code of the Explorer itself. In the following
you will nd information on how to add new tabs, like the Classify tab, and
new visualization plugins for the Classify tab.
17.4.1 Adding tabs
The Explorer is a handy tool for initial exploration of your data for proper
statistical evaluation, the Experimenter should be used instead. But if the
available functionality is not enough, you can always add your own custom-
made tabs to the Explorer.
17.4.1.1 Requirements
Here is roughly what is required in order to add a new tab (the examples below
go into more detail):
your class must be derived from javax.swing.JPanel
the interface weka.gui.explorer.Explorer.ExplorerPanel must be im-
plemented by your class
optional interfaces
weka.gui.explorer.Explorer.LogHandler in case you want to
take advantage of the logging in the Explorer
weka.gui.explorer.Explorer.CapabilitiesFilterChangeListener
in case your class needs to be notied of changes in the Capabilities,
e.g., if new data is loaded into the Explorer
adding the classname of your class to the Tabs property in the Explorer.props
le
17.4.1.2 Examples
The following examples demonstrate the plugin architecture. Only the neces-
sary details are discussed, as the full source code is available from the WEKA
Examples [3] (package wekaexamples.gui.explorer).
SQL worksheet
Purpose
Displaying the SqlViewer as a tab in the Explorer instead of using it either
via the Open DB... button or as standalone application. Uses the existing
components already available in WEKA and just assembles them in a JPanel.
Since this tab does not rely on a dataset being loaded into the Explorer, it will
be used as a standalone one.
17.4. EXTENDING THE EXPLORER 267
Useful for people who are working a lot with databases and would like to
have an SQL worksheet available all the time instead of clicking on a button
every time to open up a database dialog.
Implementation
class is derived from javax.swing.JPanel and implements the interface
weka.gui.Explorer.ExplorerPanel (the full source code also imports
the weka.gui.Explorer.LogHandlerinterface, but that is only additional
functionality):
public class SqlPanel
extends JPanel
implements ExplorerPanel {
some basic members that we need to have
/** the parent frame */
protected Explorer m_Explorer = null;
/** sends notifications when the set of working instances gets changed*/
protected PropertyChangeSupport m_Support = new PropertyChangeSupport(this);
methods we need to implement due to the used interfaces
/** Sets the Explorer to use as parent frame */
public void setExplorer(Explorer parent) {
m_Explorer = parent;
}
/** returns the parent Explorer frame */
public Explorer getExplorer() {
return m_Explorer;
}
/** Returns the title for the tab in the Explorer */
public String getTabTitle() {
return "SQL"; // whats displayed as tab-title, e.g., Classify
}
/** Returns the tooltip for the tab in the Explorer */
public String getTabTitleToolTip() {
return "Retrieving data from databases"; // the tooltip of the tab
}
/** ignored, since we "generate" data and not receive it */
public void setInstances(Instances inst) {
}
/** PropertyChangeListener which will be notified of value changes. */
public void addPropertyChangeListener(PropertyChangeListener l) {
m_Support.addPropertyChangeListener(l);
}
/** Removes a PropertyChangeListener. */
public void removePropertyChangeListener(PropertyChangeListener l) {
m_Support.removePropertyChangeListener(l);
}
268 CHAPTER 17. EXTENDING WEKA
additional GUI elements
/** the actual SQL worksheet */
protected SqlViewer m_Viewer;
/** the panel for the buttons */
protected JPanel m_PanelButtons;
/** the Load button - makes the data available in the Explorer */
protected JButton m_ButtonLoad = new JButton("Load data");
/** displays the current query */
protected JLabel m_LabelQuery = new JLabel("");
loading the data into the Explorer by clicking on the Load button will re
a propertyChange event:
m_ButtonLoad.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent evt){
m_Support.firePropertyChange("", null, null);
}
});
the propertyChange event will perform the actual loading of the data,
hence we add an anonymous property change listener to our panel:
addPropertyChangeListener(new PropertyChangeListener() {
public void propertyChange(PropertyChangeEvent e) {
try {
// load data
InstanceQuery query = new InstanceQuery();
query.setDatabaseURL(m_Viewer.getURL());
query.setUsername(m_Viewer.getUser());
query.setPassword(m_Viewer.getPassword());
Instances data = query.retrieveInstances(m_Viewer.getQuery());
// set data in preprocess panel (also notifies of capabilties changes)
getExplorer().getPreprocessPanel().setInstances(data);
}
catch (Exception ex) {
ex.printStackTrace();
}
}
});
In order to add our SqlPanel to the list of tabs displayed in the Ex-
plorer, we need to modify the Explorer.props le (just extract it from
the weka.jar and place it in your home directory). The Tabs property
must look like this:
Tabs=weka.gui.explorer.SqlPanel,\
weka.gui.explorer.ClassifierPanel,\
weka.gui.explorer.ClustererPanel,\
weka.gui.explorer.AssociationsPanel,\
weka.gui.explorer.AttributeSelectionPanel,\
weka.gui.explorer.VisualizePanel
17.4. EXTENDING THE EXPLORER 269
Screenshot
270 CHAPTER 17. EXTENDING WEKA
Articial data generation
Purpose
Instead of only having a Generate... button in the PreprocessPanel or using it
from command-line, this example creates a new panel to be displayed as extra
tab in the Explorer. This tab will be available regardless whether a dataset is
already loaded or not (= standalone).
Implementation
class is derived from javax.swing.JPanel and implements the interface
weka.gui.Explorer.ExplorerPanel (the full source code also imports
the weka.gui.Explorer.LogHandler interface, but that is only additional
functionality):
public class GeneratorPanel
extends JPanel
implements ExplorerPanel {
some basic members that we need to have (the same as for the SqlPanel
class):
/** the parent frame */
protected Explorer m_Explorer = null;
/** sends notifications when the set of working instances gets changed*/
protected PropertyChangeSupport m_Support = new PropertyChangeSupport(this);
methods we need to implement due to the used interfaces (almost identical
to SqlPanel):
/** Sets the Explorer to use as parent frame */
public void setExplorer(Explorer parent) {
m_Explorer = parent;
}
/** returns the parent Explorer frame */
public Explorer getExplorer() {
return m_Explorer;
}
/** Returns the title for the tab in the Explorer */
public String getTabTitle() {
return "DataGeneration"; // whats displayed as tab-title, e.g., Classify
}
/** Returns the tooltip for the tab in the Explorer */
public String getTabTitleToolTip() {
return "Generating artificial datasets"; // the tooltip of the tab
}
/** ignored, since we "generate" data and not receive it */
public void setInstances(Instances inst) {
}
/** PropertyChangeListener which will be notified of value changes. */
public void addPropertyChangeListener(PropertyChangeListener l) {
m_Support.addPropertyChangeListener(l);
}
/** Removes a PropertyChangeListener. */
public void removePropertyChangeListener(PropertyChangeListener l) {
m_Support.removePropertyChangeListener(l);
}
17.4. EXTENDING THE EXPLORER 271
additional GUI elements:
/** the GOE for the generators */
protected GenericObjectEditor m_GeneratorEditor = new GenericObjectEditor();
/** the text area for the output of the generated data */
protected JTextArea m_Output = new JTextArea();
/** the Generate button */
protected JButton m_ButtonGenerate = new JButton("Generate");
/** the Use button */
protected JButton m_ButtonUse = new JButton("Use");
the Generate button does not load the generated data directly into the
Explorer, but only outputs it in the JTextArea (the Use button loads the
data - see further down):
m_ButtonGenerate.addActionListener(new ActionListener(){
public void actionPerformed(ActionEvent evt){
DataGenerator generator = (DataGenerator) m_GeneratorEditor.getValue();
String relName = generator.getRelationName();
String cname = generator.getClass().getName().replaceAll(".*\\.", "");
String cmd = generator.getClass().getName();
if (generator instanceof OptionHandler)
cmd += " "+Utils.joinOptions(((OptionHandler)generator).getOptions());
try {
// generate data
StringWriter output = new StringWriter();
generator.setOutput(new PrintWriter(output));
DataGenerator.makeData(generator, generator.getOptions());
m_Output.setText(output.toString());
}
catch (Exception ex) {
ex.printStackTrace();
JOptionPane.showMessageDialog(
getExplorer(), "Error generating data:\n" + ex.getMessage(),
"Error", JOptionPane.ERROR_MESSAGE);
}
generator.setRelationName(relName);
}
});
the Use button nally res a propertyChange event that will load the data
into the Explorer:
m_ButtonUse.addActionListener(new ActionListener(){
public void actionPerformed(ActionEvent evt){
m_Support.firePropertyChange("", null, null);
}
});
272 CHAPTER 17. EXTENDING WEKA
the propertyChange event will perform the actual loading of the data,
hence we add an anonymous property change listener to our panel:
addPropertyChangeListener(new PropertyChangeListener() {
public void propertyChange(PropertyChangeEvent e) {
try {
Instances data = new Instances(new StringReader(m_Output.getText()));
// set data in preprocess panel (also notifies of capabilties changes)
getExplorer().getPreprocessPanel().setInstances(data);
}
catch (Exception ex) {
ex.printStackTrace();
JOptionPane.showMessageDialog(
getExplorer(), "Error generating data:\n" + ex.getMessage(),
"Error", JOptionPane.ERROR_MESSAGE);
}
}
});
In order to add our GeneratorPanel to the list of tabs displayed in the
Explorer, we need to modify the Explorer.props le (just extract it from
the weka.jar and place it in your home directory). The Tabs property
must look like this:
Tabs=weka.gui.explorer.GeneratorPanel:standalone,\
weka.gui.explorer.ClassifierPanel,\
weka.gui.explorer.ClustererPanel,\
weka.gui.explorer.AssociationsPanel,\
weka.gui.explorer.AttributeSelectionPanel,\
weka.gui.explorer.VisualizePanel
Note: the standalone option is used to make the tab available without
requiring the preprocess panel to load a dataset rst.
Screenshot
17.4. EXTENDING THE EXPLORER 273
Experimenter light
Purpose
By default the Classify panel only performs 1 run of 10-fold cross-validation.
Since most classiers are rather sensitive to the order of the data being pre-
sented to them, those results can be too optimistic or pessimistic. Averaging
the results over 10 runs with dierently randomized train/test pairs returns
more reliable results. And this is where this plugin comes in: it can be used
to obtain statistical sound results for a specic classier/dataset combination,
without having to setup a whole experiment in the Experimenter.
Implementation
Since this plugin is rather bulky, we omit the implementation details, but
the following can be said:
based on the weka.gui.explorer.ClassifierPanel
the actual code doing the work follows the example in the Using the
Experiment API wiki article [2]
In order to add our ExperimentPanel to the list of tabs displayed in the
Explorer, we need to modify the Explorer.props le (just extract it from
the weka.jar and place it in your home directory). The Tabs property
must look like this:
Tabs=weka.gui.explorer.ClassifierPanel,\
weka.gui.explorer.ExperimentPanel,\
weka.gui.explorer.ClustererPanel,\
weka.gui.explorer.AssociationsPanel,\
weka.gui.explorer.AttributeSelectionPanel,\
weka.gui.explorer.VisualizePanel
Screenshot
274 CHAPTER 17. EXTENDING WEKA
17.4.2 Adding visualization plugins
Introduction
As of WEKA version 3.5.3 you can easily add visualization plugins in the Ex-
plorer (Classify panel). This makes it easy to implement custom visualizations, if
the ones WEKA oers are not sucient. The following examples can be found in
the Examples collection [3] (package wekaexamples.gui.visualize.plugins).
Requirements
custom visualization class must implement the following interface
weka.gui.visualize.plugins.VisualizePlugin
the class must either reside in the following package (visualization classes
are automatically discovered during run-time)
weka.gui.visualize.plugins
or you must list the package this class belongs to in the properties le
weka/gui/GenericPropertiesCreator.props (or the equivalent in your
home directory) under the key weka.gui.visualize.plugins.VisualizePlugin.
Implementation
The visualization interface contains the following four methods
getMinVersion This method returns the minimum version (inclusive)
of WEKA that is necessary to execute the plugin, e.g., 3.5.0.
getMaxVersion This method returns the maximum version (exclusive)
of WEKA that is necessary to execute the plugin, e.g., 3.6.0.
getDesignVersion Returns the actual version of WEKA this plugin was
designed for, e.g., 3.5.1
getVisualizeMenuItem The JMenuItem that is returned via this method
will be added to the plugins menu in the popup in the Explorer. The
ActionListener for clicking the menu item will most likely open a new
frame containing the visualized data.
17.4. EXTENDING THE EXPLORER 275
Examples
Table with predictions
The PredictionTable.java example simply displays the actual class label and
the one predicted by the classier. In addition to that, it lists whether it was
an incorrect prediction and the class probability for the correct class label.
276 CHAPTER 17. EXTENDING WEKA
Bar plot with probabilities
The PredictionError.java example uses the JMathTools library (needs the
jmathplot.jar [27] in the CLASSPATH) to display a simple bar plot of the
predictions. The correct predictions are displayed in blue, the incorrect ones
in red. In both cases the class probability that the classier returned for the
correct class label is displayed on the y axis. The x axis is simply the index of
the prediction starting with 0.
Chapter 18
Technical documentation
18.1 ANT
What is ANT? This is how the ANT homepage (http://ant.apache.org/)
denes its tool:
Apache Ant is a Java-based build tool. In theory, it is kind of like Make, but
without Makes wrinkles.
18.1.1 Basics
the ANT build le is based on XML (http://www.w3.org/XML/)
the usual name for the build le is:
build.xml
invocationthe usual build le needs not be specied explicitly, if its in
the current directory; if not target is specied, the default one is used
ant [-f <build-file>] [<target>]
displaying all the available targets of a build le
ant [-f <build-file>] -projecthelp
18.1.2 Weka and ANT
a build le for Weka is available from subversion
some targets of interest
cleanRemoves the build, dist and reports directories; also any class
les in the source tree
compileCompile weka and deposit class les in
${path_modifier}/build/classes
docsMake javadocs into ${path_modifier}/doc
exejarCreate an executable jar le in ${path_modifier}/dist
277
278 CHAPTER 18. TECHNICAL DOCUMENTATION
18.2 CLASSPATH
The CLASSPATH environment variable tells Java where to look for classes.
Since Java does the search in a rst-come-rst-serve kind of manner, youll have
to take care where and what to put in your CLASSPATH. I, personally, never
use the environment variable, since Im working often on a project in dierent
versions in parallel. The CLASSPATH would just mess up things, if youre not
careful (or just forget to remove an entry). ANT (http://ant.apache.org/)
oers a nice way for building (and separating source code and class les) Java
projects. But still, if youre only working on totally separate projects, it might
be easiest for you to use the environment variable.
18.2.1 Setting the CLASSPATH
In the following we add the mysql-connector-java-5.1.7-bin.jar to our
CLASSPATH variable (this works for any other jar archive) to make it possible to
access MySQL databases via JDBC.
Win32 (2k and XP)
We assume that the mysql-connector-java-5.1.7-bin.jar archive is located
in the following directory:
C:\Program Files\Weka-3-7
In the Control Panel click on System (or right click on My Computer and select
Properties) and then go to the Avanced tab. There youll nd a button called
Environment Variables, click it. Depending on, whether youre the only person
using this computer or its a lab computer shared by many, you can either create
a new system-wide (youre the only user) environment variable or a user depen-
dent one (recommended for multi-user machines). Enter the following name for
the variable.
CLASSPATH
and add this value
C:\Program Files\Weka-3-7\mysql-connector-java-5.1.7-bin.jar
If you want to add additional jars, you will have to separate them with the path
separator, the semicolon ; (no spaces!).
Unix/Linux
We make the assumption that the mysql jar is located in the following directory:
/home/johndoe/jars/
18.2. CLASSPATH 279
Open a shell and execute the following command, depending on the shell youre
using:
bash
export CLASSPATH=$CLASSPATH:/home/johndoe/jars/mysql-connector-java-5.1.7-bin.jar
c shell
setenv CLASSPATH $CLASSPATH:/home/johndoe/jars/mysql-connector-java-5.1.7-bin.jar
Cygwin
The process is like with Unix/Linux systems, but since the host system is Win32
and therefore the Java installation also a Win32 application, youll have to use
the semicolon ; as separator for several jars.
18.2.2 RunWeka.bat
From version 3.5.4, Weka is launched dierently under Win32. The simple batch
le got replaced by a central launcher class (= RunWeka.class) in combination
with an INI-le (= RunWeka.ini). The RunWeka.bat only calls this launcher
class now with the appropriate parameters. With this launcher approach it is
possible to dene dierent launch scenarios, but with the advantage of hav-
ing placeholders, e.g., for the max heap size, which enables one to change the
memory for all setups easily.
The key of a command in the INI-le is prexed with cmd_, all other keys
are considered placeholders:
cmd_blah=java ... command blah
bloerk= ... placeholder bloerk
A placeholder is surrounded in a command with #:
cmd_blah=java #bloerk#
Note: The key wekajar is determined by the -w parameter with which the
launcher class is called.
By default, the following commands are predened:
default
The default Weka start, without a terminal window.
console
For debugging purposes. Useful as Weka gets started from a terminal
window.
explorer
The command thats executed if one double-clicks on an ARFF or XRFF
le.
In order to change the maximum heap size for all those commands, one only
has to modify the maxheap placeholder.
280 CHAPTER 18. TECHNICAL DOCUMENTATION
For more information check out the comments in the INI-le.
18.2.3 java -jar
When youre using the Java interpreter with the -jar option, be aware of the
fact that it overwrites your CLASSPATH and not augments it. Out of conve-
nience, people often only use the -jar option to skip the declaration of the main
class to start. But as soon as you need more jars, e.g., for database access, you
need to use the -classpath option and specify the main class.
Heres once again how you start the Weka Main-GUI with your current CLASSPATH
variable (and 128MB for the JVM):
Linux
java -Xmx128m -classpath $CLASSPATH:weka.jar weka.gui.Main
Win32
java -Xmx128m -classpath "%CLASSPATH%;weka.jar" weka.gui.Main
18.3 Subversion
18.3.1 General
The Weka Subversion repository is accessible and browseable via the following
URL:
https://svn.scms.waikato.ac.nz/svn/weka/
A Subversion repository has usually the following layout:
root
|
+- trunk
|
+- tags
|
+- branches
Where trunk contains the main trunk of the development, tags snapshots in
time of the repository (e.g., when a new version got released) and branches
development branches that forked o the main trunk at some stage (e.g., legacy
versions that still get bugxed).
18.3.2 Source code
The latest version of the Weka source code can be obtained with this URL:
https://svn.scms.waikato.ac.nz/svn/weka/trunk/weka
If you want to obtain the source code of the book version, use this URL:
18.3. SUBVERSION 281
https://svn.scms.waikato.ac.nz/svn/weka/branches/book2ndEd-branch/weka
18.3.3 JUnit
The latest version of Wekas JUnit tests can be obtained with this URL:
https://svn.scms.waikato.ac.nz/svn/weka/trunk/tests
And if you want to obtain the JUnit tests of the book version, use this URL:
https://svn.scms.waikato.ac.nz/svn/weka/branches/book2ndEd-branch/tests
18.3.4 Specic version
Whenever a release of Weka is generated, the repository gets tagged
dev-X-Y-Z
the tag for a release of the developer version, e.g., dev-3.7.0 for Weka 3.7.0
https://svn.scms.waikato.ac.nz/svn/weka/tags/dev-3-7-0
stable-X-Y-Z
the tag for a release of the book version, e.g., stable-3-4-15 for Weka 3.4.15
https://svn.scms.waikato.ac.nz/svn/weka/tags/stable-3-4-15
18.3.5 Clients
Commandline
Modern Linux distributions already come with Subversion either pre-installed
or easily installed via the package manager of the distribution. If that shouldnt
be case, or if you are using Windows, you have to download the appropriate
client from the Subversion homepage (http://subversion.tigris.org/).
A checkout of the current developer version of Weka looks like this:
svn co https://svn.scms.waikato.ac.nz/svn/weka/trunk/weka
SmartSVN
SmartSVN (http://smartsvn.com/) is a Java-based, graphical, cross-platform
client for Subversion. Though it is not open-source/free software, the foundation
version is for free.
TortoiseSVN
Under Windows, TortoiseCVS was a CVS client, neatly integrated into the
Windows Explorer. TortoiseSVN (http://tortoisesvn.tigris.org/) is the
equivalent for Subversion.
282 CHAPTER 18. TECHNICAL DOCUMENTATION
18.4 GenericObjectEditor
18.4.1 Introduction
As of version 3.4.4 it is possible for WEKA to dynamically discover classes at
runtime (rather than using only those specied in the GenericObjectEditor.props
(GOE) le). In some versions (3.5.8, 3.6.0) this facility was not enabled by de-
fault as it is a bit slower than the GOE le approach, and, furthermore, does
not function in environments that do not have a CLASSPATH (e.g., application
servers). Later versions (3.6.1, 3.7.0) enabled the dynamic discovery again, as
WEKA can now distinguish between being a standalone Java application or
being run in a non-CLASSPATH environment.
If you wish to enable or disable dynamic class discovery, the relevant le
to edit is GenericPropertiesCreator.props (GPC). You can obtain this le
either from the weka.jar or weka-src.jar archive. Open one of these les
with an archive manager that can handle ZIP les (for Windows users, you
can use 7-Zip (http://7-zip.org/) for this) and navigate to the weka/gui
directory, where the GPC le is located. All that is required, is to change the
UseDynamic property in this le from false to true (for enabling it) or the
other way round (for disabling it). After changing the le, you just place it in
your home directory. In order to nd out the location of your home directory,
do the following:
Linux/Unix
Open a terminal
run the following command:
echo $HOME
Windows
Open a command-primpt
run the following command:
echo %USERPROFILE%
If dynamic class discovery is too slow, e.g., due to an enormous CLASSPATH,
you can generate a new GenericObjectEditor.props le and then turn dy-
namic class discovery o again. It is assumed that you already placed the GPC
le in your home directory (see steps above) and that the weka.jar jar archive
with the WEKA classes is in your CLASSPATH (otherwise you have to add it
to the java call using the -classpath option).
18.4. GENERICOBJECTEDITOR 283
For generating the GOE le, execute the following steps:
generate a new GenericObjectEditor.props le using the following com-
mand:
Linux/Unix
java weka.gui.GenericPropertiesCreator \
$HOME/GenericPropertiesCreator.props \
$HOME/GenericObjectEditor.props
Windows (command must be in one line)
java weka.gui.GenericPropertiesCreator
%USERPROFILE%\GenericPropertiesCreator.props
%USERPROFILE%\GenericObjectEditor.props
edit the GenericPropertiesCreator.props le in your home directory
and set UseDynamic to false.
A limitation of the GOE prior to 3.4.4 was, that additional classiers, lters,
etc., had to t into the same package structure as the already existing ones,
i.e., all had to be located below weka. WEKA can now display multiple class
hierarchies in the GUI, which makes adding new functionality quite easy as we
will see later in an example (it is not restricted to classiers only, but also works
with all the other entries in the GPC le).
18.4.2 File Structure
The structure of the GOE is a key-value-pair, separated by an equals-sign. The
value is a comma separated list of classes that are all derived from the su-
perclass/superinterface key. The GPC is slightly dierent, instead of declar-
ing all the classes/interfaces one need only to specify all the packages de-
scendants are located in (only non-abstract ones are then listed). E.g., the
weka.classifiers.Classifier entry in the GOE le looks like this:
weka.classifiers.Classifier=\
weka.classifiers.bayes.AODE,\
weka.classifiers.bayes.BayesNet,\
weka.classifiers.bayes.ComplementNaiveBayes,\
weka.classifiers.bayes.NaiveBayes,\
weka.classifiers.bayes.NaiveBayesMultinomial,\
weka.classifiers.bayes.NaiveBayesSimple,\
weka.classifiers.bayes.NaiveBayesUpdateable,\
weka.classifiers.functions.LeastMedSq,\
weka.classifiers.functions.LinearRegression,\
weka.classifiers.functions.Logistic,\
...
The entry producing the same output for the classiers in the GPC looks like
this (7 lines instead of over 70 for WEKA 3.4.4):
weka.classifiers.Classifier=\
weka.classifiers.bayes,\
weka.classifiers.functions,\
weka.classifiers.lazy,\
weka.classifiers.meta,\
weka.classifiers.trees,\
weka.classifiers.rules
284 CHAPTER 18. TECHNICAL DOCUMENTATION
18.4.3 Exclusion
It may not always be desired to list all the classes that can be found along the
CLASSPATH. Sometimes, classes cannot be declared abstract but still shouldnt
be listed in the GOE. For that reason one can list classes, interfaces, superclasses
for certain packages to be excluded from display. This exclusion is done with
the following le:
weka/gui/GenericPropertiesCreator.excludes
The format of this properties le is fairly simple:
<key>=<prefix>:<class>[,<prefix>:<class>]
Where the <key> corresponds to a key in the GenericPropertiesCreator.props
le and the <prefix> can be one of the following:
S Superclass
any class derived from this will be excluded
I Interface
any class implementing this interface will be excluded
C Class
exactly this class will be excluded
Here are a few examples:
# exclude all ResultListeners that also implement the ResultProducer interface
# (all ResultProducers do that!)
weka.experiment.ResultListener=\
I:weka.experiment.ResultProducer
# exclude J48 and all SingleClassifierEnhancers
weka.classifiers.Classifier=\
C:weka.classifiers.trees.J48,\
S:weka.classifiers.SingleClassifierEnhancer
18.4. GENERICOBJECTEDITOR 285
18.4.4 Class Discovery
Unlike the Class.forName(String) method that grabs the rst class it can
nd in the CLASSPATH, and therefore xes the location of the package it found
the class in, the dynamic discovery examines the complete CLASSPATH you are
starting the Java Virtual Machine (= JVM) with. This means that you can
have several parallel directories with the same WEKA package structure, e.g.,
the standard release of WEKA in one directory (/distribution/weka.jar)
and another one with your own classes (/development/weka/...), and display
all of the classiers in the GUI. In case of a name conict, i.e., two directories
contain the same class, the rst one that can be found is used. In a nutshell,
your java call of the GUIChooser can look like this:
java -classpath "/development:/distribution/weka.jar" weka.gui.GUIChooser
Note: Windows users have to replace the : with ; and the forward slashes
with backslashes.
18.4.5 Multiple Class Hierarchies
In case you are developing your own framework, but still want to use your clas-
siers within WEKA that was not possible with WEKA prior to 3.4.4. Starting
with the release 3.4.4 it is possible to have multiple class hierarchies being dis-
played in the GUI. If you have developed a modied version of NaiveBayes, let
us call it DummyBayes and it is located in the package dummy.classifiers
then you will have to add this package to the classiers list in the GPC le like
this:
weka.classifiers.Classifier=\
weka.classifiers.bayes,\
weka.classifiers.functions,\
weka.classifiers.lazy,\
weka.classifiers.meta,\
weka.classifiers.trees,\
weka.classifiers.rules,\
dummy.classifiers
286 CHAPTER 18. TECHNICAL DOCUMENTATION
Your java call for the GUIChooser might look like this:
java -classpath "weka.jar:dummy.jar" weka.gui.GUIChooser
Starting up the GUI you will now have another root node in the tree view of the
classiers, called root, and below it the weka and the dummy package hierarchy
as you can see here:
18.4. GENERICOBJECTEDITOR 287
18.4.6 Capabilities
Version 3.5.3 of Weka introduced the notion of Capabilities. Capabilities basi-
cally list what kind of data a certain object can handle, e.g., one classier can
handle numeric classes, but another cannot. In case a class supports capabili-
ties the additional buttons Filter... and Remove lter will be available in the
GOE. The Filter... button pops up a dialog which lists all available Capabilities:
One can then choose those capabilities an object, e.g., a classier, should have.
If one is looking for classication problem, then the Nominal class Capability
can be selected. On the other hand, if one needs a regression scheme, then the
Capability Numeric class can be selected. This ltering mechanism makes the
search for an appropriate learning scheme easier. After applying that lter, the
tree with the objects will be displayed again and lists all objects that can handle
all the selected Capabilities black, the ones that cannot grey and the ones that
might be able to handle them blue (e.g., meta classiers which depend on their
base classier(s)).
288 CHAPTER 18. TECHNICAL DOCUMENTATION
18.5 Properties
A properties le is a simple text le with this structure:
<key>=<value>
Comments start with the hash sign #.
To make a rather long property line more readable, one can use a backslash to
continue on the next line. The Filter property, e.g., looks like this:
weka.filters.Filter= \
weka.filters.supervised.attribute, \
weka.filters.supervised.instance, \
weka.filters.unsupervised.attribute, \
weka.filters.unsupervised.instance
18.5.1 Precedence
The Weka property les (extension .props) are searched for in the following
order:
current directory
the users home directory (*nix $HOME, Windows %USERPROFILE%)
the class path (normally the weka.jar le)
If Weka encounters those les it only supplements the properties, never overrides
them. In other words, a property in the property le of the current directory
has a higher precedence than the one in the users home directory.
Note: Under Cywgin (http://cygwin.com/), the home directory is still the
Windows one, since the java installation will be still one for Windows.
18.5.2 Examples
weka/gui/LookAndFeel.props
weka/gui/GenericPropertiesCreator.props
weka/gui/beans/Beans.props
18.6. XML 289
18.6 XML
Weka now supports XML (http://www.w3c.org/XML/) (eXtensible Markup
Language) in several places.
18.6.1 Command Line
WEKA now allows Classiers and Experiments to be started using an -xml
option followed by a lename to retrieve the command line options from the
XML le instead of the command line.
For such simple classiers like e.g. J48 this looks like overkill, but as soon
as one uses Meta-Classiers or Meta-Meta-Classiers the handling gets tricky
and one spends a lot of time looking for missing quotes. With the hierarchical
structure of XML les it is simple to plug in other classiers by just exchanging
tags.
The DTD for the XML options is quite simple:
<!DOCTYPE options
[
<!ELEMENT options (option)*>
<!ATTLIST options type CDATA "classifier">
<!ATTLIST options value CDATA "">
<!ELEMENT option (#PCDATA | options)*>
<!ATTLIST option name CDATA #REQUIRED>
<!ATTLIST option type (flag | single | hyphens | quotes) "single">
]
>
The type attribute of the option tag needs some explanations. There are cur-
rently four dierent types of options in WEKA:
ag
The simplest option that takes no arguments, like e.g. the -V ag for
inversing an selection.
<option name="V" type="flag"/>
single
The option takes exactly one parameter, directly following after the op-
tion, e.g., for specifying the trainings le with -t somefile.arff. Here
the parameter value is just put between the opening and closing tag. Since
single is the default value for the type tag we dont need to specify it ex-
plicitly.
<option name="t">somefile.arff</option>
290 CHAPTER 18. TECHNICAL DOCUMENTATION
hyphens
Meta-Classiers like AdaBoostM1 take another classier as option with
the -W option, where the options for the base classier follow after the
--. And here it is where the fun starts: where to put parameters for the
base classier if the Meta-Classier itself is a base classier for another
Meta-Classier?
E.g., does -W weka.classifiers.trees.J48 -- -C 0.001 become this:
<option name="W" type="hyphens">
<options type="classifier" value="weka.classifiers.trees.J48">
<option name="C">0.001</option>
</options>
</option>
Internally, all the options enclosed by the options tag are pushed to the
end after the -- if one transforms the XML into a command line string.
quotes
A Meta-Classier like Stacking can take several -B options, where each
single one encloses other options in quotes (this itself can contain a Meta-
Classier!). From -B weka.classifiers.trees.J48 we then get
this XML:
<option name="B" type="quotes">
<options type="classifier" value="weka.classifiers.trees.J48"/>
</option>
With the XML representation one doesnt have to worry anymore about
the level of quotes one is using and therefore doesnt have to care about
the correct escaping (i.e. ... \" ... \" ...) since this is done
automatically.
18.6. XML 291
And if we now put all together we can transform this more complicated com-
mand line (java and the CLASSPATH omitted):
<options type="class" value="weka.classifiers.meta.Stacking">
<option name="B" type="quotes">
<options type="classifier" value="weka.classifiers.meta.AdaBoostM1">
<option name="W" type="hyphens">
<options type="classifier" value="weka.classifiers.trees.J48">
<option name="C">0.001</option>
</options>
</option>
</options>
</option>
<option name="B" type="quotes">
<options type="classifier" value="weka.classifiers.meta.Bagging">
<option name="W" type="hyphens">
<options type="classifier" value="weka.classifiers.meta.AdaBoostM1">
<option name="W" type="hyphens">
<options type="classifier" value="weka.classifiers.trees.J48"/>
</option>
</options>
</option>
</options>
</option>
<option name="B" type="quotes">
<options type="classifier" value="weka.classifiers.meta.Stacking">
<option name="B" type="quotes">
<options type="classifier" value="weka.classifiers.trees.J48"/>
</option>
</options>
</option>
<option name="t">test/datasets/hepatitis.arff</option>
</options>
Note: The type and value attribute of the outermost options tag is not used
while reading the parameters. It is merely for documentation purposes, so that
one knows which class was actually started from the command line.
Responsible Class(es):
weka.core.xml.XMLOptions
292 CHAPTER 18. TECHNICAL DOCUMENTATION
18.6.2 Serialization of Experiments
It is now possible to serialize the Experiments from the WEKA Experimenter
not only in the proprietary binary format Java oers with serialization (with
this you run into problems trying to read old experiments with a newer WEKA
version, due to dierent SerialUIDs), but also in XML. There are currently two
dierent ways to do this:
built-in
The built-in serialization captures only the necessary informations of an
experiment and doesnt serialize anything else. Its sole purpose is to save
the setup of a specic experiment and can therefore not store any built
models. Thanks to this limitation well never run into problems with
mismatching SerialUIDs.
This kind of serialization is always available and can be selected via a
Filter (*.xml) in the Save/Open-Dialog of the Experimenter.
The DTD is very simple and looks like this (for version 3.4.5):
<!DOCTYPE object[
<!ELEMENT object (#PCDATA | object)*>
<!ATTLIST object name CDATA #REQUIRED>
<!ATTLIST object class CDATA #REQUIRED>
<!ATTLIST object primitive CDATA "no">
<!ATTLIST object array CDATA "no">
<!ATTLIST object null CDATA "no">
<!ATTLIST object version CDATA "3.4.5">
]>
Prior to versions 3.4.5 and 3.5.0 it looked like this:
<!DOCTYPE object
[
<!ELEMENT object (#PCDATA | object)*>
<!ATTLIST object name CDATA #REQUIRED>
<!ATTLIST object class CDATA #REQUIRED>
<!ATTLIST object primitive CDATA "yes">
<!ATTLIST object array CDATA "no">
]
>
Responsible Class(es):
weka.experiment.xml.XMLExperiment
for general Serialization:
weka.core.xml.XMLSerialization
weka.core.xml.XMLBasicSerialization
18.6. XML 293
KOML (http://old.koalateam.com/xml/serialization/)
The Koala Object Markup Language (KOML) is published under the
LGPL (http://www.gnu.org/copyleft/lgpl.html) and is an alterna-
tive way of serializing and derserializing Java Objects in an XML le.
Like the normal serialization it serializes everything into XML via an Ob-
jectOutputStream, including the SerialUID of each class. Even though we
have the same problems with mismatching SerialUIDs it is at least pos-
sible edit the XML les by hand and replace the oending IDs with the
new ones.
In order to use KOML one only has to assure that the KOML classes
are in the CLASSPATH with which the Experimenter is launched. As
soon as KOML is present another Filter (*.koml) will show up in the
Save/Open-Dialog.
The DTD for KOML can be found at http://old.koalateam.com/xml/koml12.dtd
Responsible Class(es):
weka.core.xml.KOML
The experiment class can of course read those XML les if passed as input or out-
put le (see options of weka.experiment.Experimentand weka.experiment.RemoteExperiment
18.6.3 Serialization of Classiers
The options for models of a classier, -l for the input model and -d for the out-
put model, now also supports XML serialized les. Here we have to dierentiate
between two dierent formats:
built-in
The built-in serialization captures only the options of a classier but not
the built model. With the -l one still has to provide a training le, since
we only retrieve the options from the XML le. It is possible to add more
options on the command line, but it is no check performed whether they
collide with the ones stored in the XML le.
The le is expected to end with .xml.
KOML
Since the KOML serialization captures everything of a Java Object we can
use it just like the normal Java serialization.
The le is expected to end with .koml.
The built-in serialization can be used in the Experimenter for loading/saving
options from algorithms that have been added to a Simple Experiment. Unfor-
tunately it is not possible to create such a hierarchical structure like mentioned
in Section 18.6.1. This is because of the loss of information caused by the
getOptions() method of classiers: it returns only a at String-Array and not
a tree structure.
294 CHAPTER 18. TECHNICAL DOCUMENTATION
Responsible Class(es):
weka.core.xml.KOML
weka.classifiers.xml.XMLClassifier
18.6.4 Bayesian Networks
The GraphVisualizer (weka.gui.graphvisualizer.GraphVisualizer) can save
graphs into the Interchange Format
(http://www-2.cs.cmu.edu/
fgcozman/Research/InterchangeFormat/) for
Bayesian Networks (BIF). If started from command line with an XML lename
as rst parameter and not from the Explorer it can display the given le directly.
The DTD for BIF is this:
<!DOCTYPE BIF [
<!ELEMENT BIF ( NETWORK )*>
<!ATTLIST BIF VERSION CDATA #REQUIRED>
<!ELEMENT NETWORK ( NAME, ( PROPERTY | VARIABLE | DEFINITION )* )>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT VARIABLE ( NAME, ( OUTCOME | PROPERTY )* ) >
<!ATTLIST VARIABLE TYPE (nature|decision|utility) "nature">
<!ELEMENT OUTCOME (#PCDATA)>
<!ELEMENT DEFINITION ( FOR | GIVEN | TABLE | PROPERTY )* >
<!ELEMENT FOR (#PCDATA)>
<!ELEMENT GIVEN (#PCDATA)>
<!ELEMENT TABLE (#PCDATA)>
<!ELEMENT PROPERTY (#PCDATA)>
]>
Responsible Class(es):
weka.classifiers.bayes.BayesNet#toXMLBIF03()
weka.classifiers.bayes.net.BIFReader
weka.gui.graphvisualizer.BIFParser
18.6.5 XRFF les
With Weka 3.5.4 a new, more feature-rich, XML-based data format got intro-
duced: XRFF. For more information, please see Chapter 10.
Chapter 19
Other resources
19.1 Mailing list
The WEKA Mailing list can be found here:
http://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
for subscribing/unsubscribing the list
https://list.scms.waikato.ac.nz/pipermail/wekalist/
(Mirrors: http://news.gmane.org/gmane.comp.ai.weka,
http://www.nabble.com/WEKA-f435.html)
for searching previous posted messages
Before posting, please read the Mailing List Etiquette:
http://www.cs.waikato.ac.nz/
ml/weka/mailinglist etiquette.html
19.2 Troubleshooting
Here are a few of things that are useful to know when you are having trouble
installing or running Weka successfullyon your machine.
NB these java commands refer to ones executed in a shell (bash, command
prompt, etc.) and NOT to commands executed in the SimpleCLI.
19.2.1 Weka download problems
When you download Weka, make sure that the resulting le size is the same
as on our webpage. Otherwise things wont work properly. Apparently some
web browsers have trouble downloading Weka.
19.2.2 OutOfMemoryException
Most Java virtual machines only allocate a certain maximum amount of memory
to run Java programs. Usually this is much less than the amount of RAM in
your computer. However, you can extend the memory available for the virtual
machine by setting appropriate options. With Suns JDK, for example, you can
go
295
296 CHAPTER 19. OTHER RESOURCES
java -Xmx100m ...
to set the maximum Java heap size to 100MB. For more information about these
options see http://java.sun.com/docs/hotspot/VMOptions.html.
19.2.2.1 Windows
Book version
You have to modify the JVM invocation in the RunWeka.bat batch le in your
installation directory.
Developer version
up to Weka 3.5.2
just like the book version.
Weka 3.5.3
You have to modify the link in the Windows Start menu, if youre starting
the console-less Weka (only the link with console in its name executes the
RunWeka.bat batch le)
Weka 3.5.4 and higher Due to the new launching scheme, you no longer
modify the batch le, but the RunWeka.ini le. In that particular le,
youll have to change the maxheap placeholder. See section 18.2.2.
19.2.3 Mac OSX
In your Weka installation directory (weka-3-x-y.app) locate the Contents sub-
directory and edit the Info.plist le. Near the bottom of the le you should
see some text like:
<key>VMOptions</key>
<string>-Xmx256M</string>
Alter the 256M to something higher.
19.2.4 StackOverowError
Try increasing the stack of your virtual machine. With Suns JDK you can use
this command to increase the stacksize:
java -Xss512k ...
19.2. TROUBLESHOOTING 297
to set the maximum Java stack size to 512KB. If still not sucient, slowly
increase it.
19.2.5 just-in-time (JIT) compiler
For maximum enjoyment, use a virtual machine that incorporates a just-in-
time compiler. This can speed things up quite signicantly. Note also that
there can be large dierences in execution time between dierent virtual ma-
chines.
19.2.6 CSV le conversion
Either load the CSV le in the Explorer or use the CVS converter on the com-
mandline as follows:
java weka.core.converters.CSVLoader filename.csv > filename.arff
19.2.7 ARFF le doesnt load
One way to gure out why ARFF les are failing to load is to give them to the
Instances class. At the command line type the following:
java weka.core.Instances filename.arff
where you substitute lename for the actual name of your le. This should
return an error if there is a problem reading the le, or show some statistics if
the le is ok. The error message you get should give some indication of what is
wrong.
19.2.8 Spaces in labels of ARFF les
A common problem people have with ARFF les is that labels can only have
spaces if they are enclosed in single quotes, i.e. a label such as:
some value
should be written either some value or some value in the le.
19.2.9 CLASSPATH problems
Having problems getting Weka to run from a DOS/UNIX command prompt?
Getting java.lang.NoClassDefFoundErrorexceptions? Most likely your CLASS-
PATH environment variable is not set correctly - it needs to point to the
Weka.jar le that you downloaded with Weka (or the parent of the Weka direc-
tory if you have extracted the jar). Under DOS this can be achieved with:
set CLASSPATH=c:\weka-3-4\weka.jar;%CLASSPATH%
298 CHAPTER 19. OTHER RESOURCES
Under UNIX/Linux something like:
export CLASSPATH=/home/weka/weka.jar:$CLASSPATH
An easy way to get avoid setting the variable this is to specify the CLASSPATH
when calling Java. For example, if the jar le is located at c:weka-3-4weka.jar
you can use:
java -cp c:\weka-3-4\weka.jar weka.classifiers... etc.
See also Section 18.2.
19.2.10 Instance ID
People often want to tag their instances with identiers, so they can keep
track of them and the predictions made on them.
19.2.10.1 Adding the ID
A new ID attribute is added real easy: one only needs to run the AddID lter
over the dataset and its done. Heres an example (at a DOS/Unix command
prompt):
java weka.filters.unsupervised.attribute.AddID
-i data_without_id.arff
-o data_with_id.arff
(all on a single line).
Note: the AddID lter adds a numeric attribute, not a String attribute
to the dataset. If you want to remove this ID attribute for the classier in a
FilteredClassifier environment again, use the Remove lter instead of the
RemoveType lter (same package).
19.2.10.2 Removing the ID
If you run from the command line you can use the -p option to output predic-
tions plus any other attributes you are interested in. So it is possible to have a
string attribute in your data that acts as an identier. A problem is that most
classiers dont like String attributes, but you can get around this by using the
RemoveType (this removes String attributes by default).
Heres an example. Lets say you have a training le named train.arff,
a testing le named test.arff, and they have an identier String attribute
as their 5th attribute. You can get the predictions from J48 along with the
identier strings by issuing the following command (at a DOS/Unix command
prompt):
java weka.classifiers.meta.FilteredClassifier
-F weka.filters.unsupervised.attribute.RemoveType
-W weka.classifiers.trees.J48
-t train.arff -T test.arff -p 5
(all on a single line).
19.2. TROUBLESHOOTING 299
If you want, you can redirect the output to a le by adding > output.txt
to the end of the line.
In the Explorer GUI you could try a similar trick of using the String attribute
identiers here as well. Choose the FilteredClassifier, with RemoveType as
the lter, and whatever classier you prefer. When you visualize the results you
will need click through each instance to see the identier listed for each.
19.2.11 Visualization
Access to visualization from the ClassierPanel, ClusterPanel and Attribute-
Selection panel is available from a popup menu. Click the right mouse button
over an entry in the Result list to bring up the menu. You will be presented with
options for viewing or saving the text output anddepending on the scheme
further options for visualizing errors, clusters, trees etc.
19.2.12 Memory consumption and Garbage collector
There is the ability to print how much memory is available in the Explorer
and Experimenter and to run the garbage collector. Just right click over the
Status area in the Explorer/Experimenter.
19.2.13 GUIChooser starts but not Experimenter or Ex-
plorer
The GUIChooser starts, but Explorer and Experimenter dont start and output
an Exception like this in the terminal:
/usr/share/themes/Mist/gtk-2.0/gtkrc:48: Engine "mist" is unsupported, ignoring
---Registering Weka Editors---
java.lang.NullPointerException
at weka.gui.explorer.PreprocessPanel.addPropertyChangeListener(PreprocessPanel.java:519)
at javax.swing.plaf.synth.SynthPanelUI.installListeners(SynthPanelUI.java:49)
at javax.swing.plaf.synth.SynthPanelUI.installUI(SynthPanelUI.java:38)
at javax.swing.JComponent.setUI(JComponent.java:652)
at javax.swing.JPanel.setUI(JPanel.java:131)
...
This behavior happens only under Java 1.5 and Gnome/Linux, KDE doesnt
produce this error. The reason for this is, that Weka tries to look more native
and therefore sets a platform-specic Swing theme. Unfortunately, this doesnt
seem to be working correctly in Java 1.5 together with Gnome. A workaround
for this is to set the cross-platform Metal theme.
In order to use another theme one only has to create the following properties
le in ones home directory:
LookAndFeel.props
With this content:
300 CHAPTER 19. OTHER RESOURCES
Theme=javax.swing.plaf.metal.MetalLookAndFeel
19.2.14 KnowledgeFlow toolbars are empty
In the terminal, you will most likely see this output as well:
Failed to instantiate: weka.gui.beans.Loader
This behavior can happen under Gnome with Java 1.5, see Section 19.2.13 for
a solution.
19.2.15 Links
Java VM options (http://java.sun.com/docs/hotspot/VMOptions.html)
301
302 BIBLIOGRAPHY
Bibliography
[1] Witten, I.H. and Frank, E. (2005) Data Mining: Practical machine learn-
ing tools and techniques. 2nd edition Morgan Kaufmann, San Francisco.
[2] WekaWiki http://weka.wikispaces.com/
[3] Weka Examples A collection of example classes, as part of
an ANT project, included in the WEKA snapshots (available
for download on the homepage) or directly from subversion
https://svn.scms.waikato.ac.nz/svn/weka/branches/stable-3-6/wekaexamples/
[4] J. Platt (1998): Machines using Sequential Minimal Optimization. In B.
Schoelkopf and C. Burges and A. Smola, editors, Advances in Kernel
Methods - Support Vector Learning.
[5] Drummond, C. and Holte, R. (2000) Explicitly representing expected
cost: An alternative to ROC representation. Proceedings of the Sixth
ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining. Publishers, San Mateo, CA.
[6] Extensions for Wekas main GUI on WekaWiki
http://weka.wikispaces.com/Extensions+for+Weka%27s+main+GUI
[7] Adding tabs in the Explorer on WekaWiki
http://weka.wikispaces.com/Adding+tabs+in+the+Explorer
[8] Explorer visualization plugins on WekaWiki
http://weka.wikispaces.com/Explorer+visualization+plugins
[9] Bengio, Y. and Nadeau, C. (1999) Inference for the Generalization Error.
[10] Ross Quinlan (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann
Publishers, San Mateo, CA.
[11] Subversion http://weka.wikispaces.com/Subversion
[12] HSQLDB http://hsqldb.sourceforge.net/
[13] MySQL http://www.mysql.com/
[14] Plotting multiple ROC curves on WekaWiki
http://weka.wikispaces.com/Plotting+multiple+ROC+curves
[15] R.R. Bouckaert. Bayesian Belief Networks: from Construction to Inference.
Ph.D. thesis, University of Utrecht, 1995.
[16] W.L. Buntine. A guide to the literature on learning probabilistic networks from
data. IEEE Transactions on Knowledge and Data Engineering, 8:195210, 1996.
[17] J. Cheng, R. Greiner. Comparing bayesian network classiers. Proceedings UAI,
101107, 1999.
[18] C.K. Chow, C.N.Liu. Approximating discrete probability distributions with de-
pendence trees. IEEE Trans. on Info. Theory, IT-14: 426467, 1968.
BIBLIOGRAPHY 303
[19] G. Cooper, E. Herskovits. A Bayesian method for the induction of probabilistic
networks from data. Machine Learning, 9: 309347, 1992.
[20] Cozman. See http://www-2.cs.cmu.edu/
fgcozman/Research/InterchangeFormat/
for details on XML BIF.
[21] N. Friedman, D. Geiger, M. Goldszmidt. Bayesian Network Classiers. Machine
Learning, 29: 131163, 1997.
[22] D. Heckerman, D. Geiger, D. M. Chickering. Learning Bayesian networks: the
combination of knowledge and statistical data. Machine Learining, 20(3): 197
243, 1995.
[23] S.L. Lauritzen and D.J. Spiegelhalter. Local Computations with Probabilities on
graphical structures and their applications to expert systems (with discussion).
Journal of the Royal Statistical Society B. 1988, 50, 157-224
[24] Moore, A. and Lee, M.S. Cached Sucient Statistics for Ecient Machine
Learning with Large Datasets, JAIR, Volume 8, pages 67-91, 1998.
[25] Verma, T. and Pearl, J.: An algorithm for deciding if a set of observed indepen-
dencies has a causal explanation. Proc. of the Eighth Conference on Uncertainty
in Articial Intelligence, 323-330, 1992.
[26] GraphViz. See http://www.graphviz.org/doc/info/lang.html for more infor-
mation on the DOT language.
[27] JMathPlot. See http://code.google.com/p/jmathplot/ for more information
on the project.