DM Manual-Min
DM Manual-Min
DM Manual-Min
arff format
Procedure:
Open MS Excel. Create a new worksheet with respective headings and data.
Click open the file button and browse the file to open.
Click SAVE shown… dialog box opens save with extension as .arff
Input: bank_data
Procedure:
If the data has been entered without any errors then the file details will be available on the
preprocessor screen.
Insert data fields in note pad like shown and save with extension .csv choosing all files of
type:
Input: weather.arff
Procedure:
We can directly remove the attribute by selecting the attribute and click REMOVE button as shown
below.
(Or)
Click on the save button and store the modified dataset with a new name as weather.arff
Components Used:
After browse the arff file from data and click on ok.
Output:
So result can viewed as double over the resultant csv file that opens in MS-EXCEL WORK
SHEET.
Procedure:
Choose the classify tab in the weka explorer window. Under the classify tab click on the choose
button and select the j48 under tree as shown in the following.
Now select the “use training set “ under the test option located at the left of the weka explorer
window and click on the on start button.
The output is presented in the classifier output window in weka explorer window.
We can also view the output in a separate window by right clicking on the option in result list
clicking on “view in separate window”
Under the result list right click on the item to get the options as shown and select the option
“visualilize tree” option.
After selecting the “visualize tree “ option the output is represented as tree in a separate window
shown
Loading the input file into the explorer to perform the classification as shown in the below figure
After loading the input file named canb.csv as shown in fig, choose the classify tab in the WEKA
explorer window. Under the classify tab click on choose button and select the NaïveBayes under
Bayes as shown ,
The output is represented in the classifier output window in weka explorer window,
Now we are able to view the output in a separate window by right clicking on the option in result list
and clicking on “view in separate window “
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: canb
Instances: 14
Attributes: 6
cid
age
income
student
credit rating
class by computer
Test mode: evaluate on training data
Class
Attribute no yes
(0.38) (0.63)
==============================
cid
mean 6.2 8.2222
std. dev. 4.6648 3.4247
weight sum 5 9
precision 1 1
age
youth 4.0 3.0
middle 1.0 5.0
senior 3.0 4.0
[total] 8.0 12.0
income
high 3.0 3.0
medium 3.0 5.0
low 2.0 4.0
[total] 8.0 12.0
student
no 5.0 4.0
yes 2.0 7.0
[total] 7.0 11.0
credit rating
fair 3.0 7.0
excellent 4.0 4.0
[total] 7.0 11.0
a b <-- classified as
3 2 | a = no
1 8 | b = yes
Input: weather.arff
Procedure:
The arrangement and linking of icons for: ” Training set and test set”
Scheme: IBk
Options: -K 1 -W 0 -A "weka.core.neighboursearch.LinearNNSearch -A
\"weka.core.EuclideanDistance -R first-last\""
Relation: weather-weka.filters.supervised.attribute.AttributeSelection-
Eweka.attributeSelection.CfsSubsetEval-Sweka.attributeSelection.BestFirst -D 1 -N 5
a b <-- classified as
8 1 | a = yes
1 4 | b = no
Scheme: IBk
Relation: iris-weka.filters.supervised.attribute.AttributeSelection-
Eweka.attributeSelection.CfsSubsetEval-Sweka.attributeSelection.BestFirst -D 1 -N 5
Input: weather.arff
Procedure:
Click on start button and get the clustering result in the output window.
Input: weather.arff
Procedure:
Click on start button and get clustering result in the output window.
Input: weather.arff
Procedure:
Go to explorer environment
Input: assrulegen.arff
Load the input file into explorer to perform association as shown below.
After loading, choose the associate tab in the weka explorer window.
Select “use training set “ under the test options which is located at the left of the weka explorer
window and the output is represented as shown below.
Select any attribute in the attributes section and click on remove button.
Now click on the “choose” button from the filter and expend the “unsupervised” option and select
the discretize” option. “
In the “generic ObjectEditor” change the bins value to either 2 or 3 or as our desire and make the
“useEqualFrequency” option as “TRUE” and click on OK.
We can observe the change in the result in the visualize which is as shown in the figure by edit
option.
Click on new experiment, Click on results destination and select arff file.
Click on experiment type choose cross validation with the default number of folds as 10 ,and click
on use relative path check box.
Click on add new button in algorithms and algorithms and choose an classifier, by default Zero-R
classifier is selected, we can add many more classifiers using new buttons for example J48
classifier.
After selecting the classifier parameters click on ok to add it to the list of algorithms.
With the load and save options we can load and save setup of a select classifier and to XML.
Run the current experiment by clicking on the RUN tab of the experiment environment window
.click start to run the experiment. if the experiment was designed correctly, there will be ‘3’
messages in the log panel without errors.
Select “userrelativepaths” in the datasets panel of the setup tab and click on add new to open a
dialog window.
The dataset name is displayed in the destination panel of the setup tab.
The dataset name is displayed in the destination panel of the setup tab
Select save at the top of the setup tab, type the dataset name with the extension exp for binary file
for choose experiment configuration files for xml file type.
To run the current experiment ,click the run tab bat the top of the experiment window.
If the experiment was defined correctly, the three messages shown above will be displayed in the log
panel.
OUTPUT:
@relation InstanceResultListener
@attribute Key_Dataset {iris}
@attribute Key_Run {1,2,3,4,5,6,7,8,9,10}
@attribute Key_Scheme {weka.classifiers.rules.ZeroR}
@attribute Key_Scheme_options {''}
@attribute Key_Scheme_version_ID {48055541465867954}
@attribute Date_time numeric
@attribute Number_of_training_instances numeric
@attribute Number_of_testing_instances numeric
@attribute Number_correct numeric
@attribute Number_incorrect numeric
@attribute Number_unclassified numeric
@attribute Percent_correct numeric
@attribute Percent_incorrect numeric
@attribute Percent_unclassified numeric
@attribute Kappa_statistic numeric
@attribute Mean_absolute_error numeric
@attribute Root_mean_squared_error numeric
iris,1,weka.classifiers.rules.ZeroR,'',48055541465867954,20140716.0732,99,51,17,34,0,33.333333
,66.666667,0,0,0.444444,0.471405,100,100,80.833088,80.833088,0,1.584963,1.584963,0,0,0,0,1,
17,1,34,0,0,0,0,0.333333,1,0.5,0.5,0.333333,0.333333,0.666667,0.666667,0.111111,0.333333,0.1
66667,0.5,0.166667,0.333333,0,0,0,0,1157,8434,5122,100,100,?
iris,2,weka.classifiers.rules.ZeroR,'',48055541465867954,20140716.0732,99,51,17,34,0,33.333333
,66.666667,0,0,0.444444,0.471405,100,100,80.833088,80.833088,0,1.584963,1.584963,0,0,0,0,1,
17,1,34,0,0,0,0,0.333333,1,0.5,0.5,0.333333,0.333333,0.666667,0.666667,0.111111,0.333333,0.1
66667,0.5,0.166667,0.333333,0,0,0,0,1157,8434,5122,100,100,?
iris,3,weka.classifiers.rules.ZeroR,'',48055541465867954,20140716.0732,99,51,17,34,0,33.333333
,66.666667,0,0,0.444444,0.471405,100,100,80.833088,80.833088,0,1.584963,1.584963,0,0,0,0,1,
17,1,34,0,0,0,0,0.333333,1,0.5,0.5,0.333333,0.333333,0.666667,0.666667,0.111111,0.333333,0.1
66667,0.5,0.166667,0.333333,0,0,0,0,1157,8434,5122,100,100,?
iris,4,weka.classifiers.rules.ZeroR,'',48055541465867954,20140716.0732,99,51,17,34,0,33.333333
,66.666667,0,0,0.444444,0.471405,100,100,80.833088,80.833088,0,1.584963,1.584963,0,0,0,0,1,
17,1,34,0,0,0,0,0.333333,1,0.5,0.5,0.333333,0.333333,0.666667,0.666667,0.111111,0.333333,0.1
66667,0.5,0.166667,0.333333,0,0,0,0,1157,8434,5122,100,100,?
iris,5,weka.classifiers.rules.ZeroR,'',48055541465867954,20140716.0732,99,51,17,34,0,33.333333
,66.666667,0,0,0.444444,0.471405,100,100,80.833088,80.833088,0,1.584963,1.584963,0,0,0,0,1,
17,1,34,0,0,0,0,0.333333,1,0.5,0.5,0.333333,0.333333,0.666667,0.666667,0.111111,0.333333,0.1
66667,0.5,0.166667,0.333333,0,0,0,0,1157,8434,5122,100,100,?
iris,6,weka.classifiers.rules.ZeroR,'',48055541465867954,20140716.0732,99,51,17,34,0,33.333333
,66.666667,0,0,0.444444,0.471405,100,100,80.833088,80.833088,0,1.584963,1.584963,0,0,0,0,1,
17,1,34,0,0,0,0,0.333333,1,0.5,0.5,0.333333,0.333333,0.666667,0.666667,0.111111,0.333333,0.1
66667,0.5,0.166667,0.333333,0,0,0,0,1157,8434,5122,100,100,?
iris,7,weka.classifiers.rules.ZeroR,'',48055541465867954,20140716.0732,99,51,17,34,0,33.333333
,66.666667,0,0,0.444444,0.471405,100,100,80.833088,80.833088,0,1.584963,1.584963,0,0,0,0,1,
17,1,34,0,0,0,0,0.333333,1,0.5,0.5,0.333333,0.333333,0.666667,0.666667,0.111111,0.333333,0.1
66667,0.5,0.166667,0.333333,0,0,0,0,1157,8434,5122,100,100,?
iris,8,weka.classifiers.rules.ZeroR,'',48055541465867954,20140716.0732,99,51,17,34,0,33.333333
,66.666667,0,0,0.444444,0.471405,100,100,80.833088,80.833088,0,1.584963,1.584963,0,0,0,0,1,
17,1,34,0,0,0,0,0.333333,1,0.5,0.5,0.333333,0.333333,0.666667,0.666667,0.111111,0.333333,0.1
66667,0.5,0.166667,0.333333,0,0,0,0,1157,8434,5122,100,100,?
iris,9,weka.classifiers.rules.ZeroR,'',48055541465867954,20140716.0732,99,51,17,34,0,33.333333
,66.666667,0,0,0.444444,0.471405,100,100,80.833088,80.833088,0,1.584963,1.584963,0,0,0,0,1,
17,1,34,0,0,0,0,0.333333,1,0.5,0.5,0.333333,0.333333,0.666667,0.666667,0.111111,0.333333,0.1
66667,0.5,0.166667,0.333333,0,0,0,0,1157,8434,5122,100,100,?
iris,10,weka.classifiers.rules.ZeroR,'',48055541465867954,20140716.0732,99,51,17,34,0,33.33333
3,66.666667,0,0,0.444444,0.471405,100,100,80.833088,80.833088,0,1.584963,1.584963,0,0,0,0,
1,17,1,34,0,0,0,0,0.333333,1,0.5,0.5,0.333333,0.333333,0.666667,0.666667,0.111111,0.333333,0
.166667,0.5,0.166667,0.333333,0,0,0,0,1157,8434,5122,100,100,?
Setting up a flow to load an ARFF file (batch mode) and perform a crossvalidation using J48.
• Click on the DataSources tab and choose ArffLoader from the toolbar (the mouse pointer will
change to a cross hairs).
• Next place the ArffLoader component on the layout area by clicking somewhere on the layout (a
copy of the ArffLoader icon will appear on the layout area).
• Next specify an ARFF file to load by first right clicking the mouse over the ArffLoader icon on the
layout. A pop-up menu will appear. Select Configure under Edit in the list from this menu and
browse to the location of your ARFF file.
• Next click the Evaluation tab at the top of the window and choose the ClassAssigner (allows you to
choose which column to be the class) component from the toolbar. Place this on the layout.
• Now connect the ArffLoader to the ClassAssigner: first right click over the ArffLoader and select
the dataSet under Connections in the menu. A rubber band line will appear. Move the mouse over
the ClassAssigner component and left click - a red line labeled dataSet will connect the two
components.
• Next right click over the ClassAssigner and choose Configure from the menu. This will pop up a
window from which you can specify which column is the class in your data (last is the default).
• Next grab a CrossValidationFoldMaker component from the Evaluation toolbar and place it on the
layout. Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over
ClassAssigner and selecting dataSet from under Connections in the menu.
• Next click on the Classifiers tab at the top of the window and scroll along the toolbar until you
reach the J48 component in the trees section. Place a J48 component on the layout.
Now you will see that EM becomes the default clustere and gets added to the list of schemes. You
can now add/delete other clusterers.
You can analyze results in the analyses panel. In the comparison field you will need to scroll down
and select “humidity”
The KnowledgeFlow can draw multiple ROC curves in the same plot window, something that the
Explorer cannot do. In this example we use J48 and RandomForest as classifiers.
• Click on the DataSources tab and choose ArffLoader from the toolbar (the mouse pointer will
change to a cross hairs).
• Next place the ArffLoader component on the layout area by clicking somewhere on the layout (a
copy of the ArffLoader icon will appear on the layout area).
• Next specify an ARFF file to load by first right clicking the mouse over the ArffLoader icon on the
layout. A pop-up menu will appear. Select Configure under Edit in the list from this menu and
browse to the location of your ARFF file.
• Next click the Evaluation tab at the top of the window and choose the ClassAssigner (allows you to
choose which column to be the class) component from the toolbar. Place this on the layout.
• Now connect the ArffLoader to the ClassAssigner: first right click over the ArffLoader and select
the dataSet under Connections in the menu. A rubber band line will appear. Move the mouse over
the ClassAssigner component and left click - a red line labeled dataSet will connect the two
components.
• Next right click over the ClassAssigner and choose Configure from the menu. This will pop up a
window from which you can specify which column is the class in your data (last is the default).
• Next choose the ClassValuePicker (allows you to choose which class label to be evaluated in the
ROC) component from the toolbar. Place this on the layout and right click over ClassAssigner and
select dataSet from under Connections in the menu and connect it with the ClassValuePicker.
• Next grab a CrossValidationFoldMaker component from the Evaluation toolbar and place it on the
layout. Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over
ClassAssigner and selecting dataSet from under Connections in the menu.
• Next click on the Classifiers tab at the top of the window and scroll along the toolbar until you
reach the J48 component in the trees section. Place a J48 component on the layout.
• Connect the CrossValidationFoldMaker to J48 TWICE by first choosing trainingSet and then
testSet from the pop-up menu for the CrossValidationFoldMaker.
• Next go back to the Evaluation tab and place a ClassifierPerformanceEvaluator component on the
layout. Connect J48 to this component by selecting the batchClassifier entry from the pop-up menu
for J48. Add another ClassifierPerformanceEvaluator for RandomForest and connect them via
batchClassifier as well.
• Next go to the Visualization toolbar and place a ModelPerformanceChart component on the layout.
Connect both ClassifierPerformanceEvaluators to the ModelPerformanceChart by selecting the
thresholdData entry from the pop-up menu for ClassifierPerformanceEvaluator.
• Now start the flow executing by selecting Start loading from the pop-up menu for ArffLoader.
Depending on how big the data set is and how long cross validation takes you will see some
• Select Show plot from the popup-menu of the ModelPerformanceChart under the Actions section.
Some classifiers, clusterers and filters in Weka can handle data incrementally in a streaming
fashion. Here is an example of training and testing naive Bayes incrementally. The results are sent
to a TextViewer and predictions are plotted by a StripChart component.
Click on the DataSources tab and choose ArffLoader from the toolbar (the mouse pointer will
change to a cross hairs).
• Next place the ArffLoader component on the layout area by clicking some-where on the layout (a
copy of the ArffLoader icon will appear on the layout area).
• Next specify an ARFF file to load by first right clicking the mouse over the ArffLoader icon on the
layout. A pop-up menu will appear. Select Configure under Edit in the list from this menu and
browse to the location of your ARFF file.
• Next click the Evaluation tab at the top of the window and choose the ClassAssigner (allows you to
choose which column to be the class) component from the toolbar. Place this on the layout.
• Now connect the ArffLoader to the ClassAssigner: first right click over the ArffLoader and select
the dataSet under Connections in the menu. A rubber band line will appear. Move the mouse over
the ClassAssigner component and left click - a red line labeled dataSet will connect the two
components.
• Next right click over the ClassAssigner and choose Configure from the menu. This will pop up a
window from which you can specify which column is the class in your data (last is the default).
• Now grab a NaiveBayesUpdateable component from the bayes section of the Classifiers panel and
place it on the layout.
• Next place an IncrementalClassiferEvaluator from the Evaluation panel onto the layout and
connect NaiveBayesUpdateable to it using a incrementalClassifier connection.
• Next place a TextViewer component from the Visualization panel on the Layout. Connect the
IncrementalClassifierEvaluator to it using a text connection.
• Next place a StripChart component from the Visualization panel on the layout and connect
IncrementalClassifierEvaluator to it using a chart connection.
• Display the StripChart’s chart by right-clicking over it and choosing Show chart from the pop-up
menu. Note: the StripChart can be configured with options that control how often data points and
labels are displayed.
• Finally, start the flow by right-clicking over the ArffLoader and selecting Start loading from the
pop-up menu.
You will need to configure a file called DatabaseUtils.props. This file already exists under the path
weka/experiment/ in the weka.jar file (which is just a ZIP file) that is part of the Weka download. In
this directory you will also find a sample file for ODBC connectivity, called
DatabaseUtils.props.odbc, and one specifically for MS Access, called DatabaseUtils.props.msaccess
(>3.4.14, >3.5.8, >3.6.0), also using ODBC. You should use one of the sample files as basis for your
setup, since they already contain default values specific to ODBC access.
This file needs to be recognized when the Explorer starts. You can achieve this by making sure it is
in the working directory or the home directory (if you are unsure what the terms working directory
and home directory mean, see the \textit{Notes} section). The easiest is probably the second
alternative, as the setup will apply to all the Weka instances on your machine.
Just make sure that the file contains the following lines at least:
jdbcDriver=sun.jdbc.odbc.JdbcOdbcDriver
jdbcURL=jdbc:odbc:dbname
where dbname is the name you gave the user DSN. (This can also be changed once the Explorer is
running.)
Start up the Weka Explorer.
Choose Open DB...
The URL should read "jdbc:odbc:dbname" where dbname is the name you gave the user DSN.
Click Connect
Enter a Query, e.g., "select * from tablename" where tablename is the name of the database table
you want to read. Or you could put a more complicated SQL query here instead.
Click Execute
I. Use Knowledge flow canvas and develop a directed graph for C4.5 execution\
Goal: Setting up a flow to load an arff file (batch mode) and perform a cross validation using J48
(Weka's C4.5 implementation).
Steps to be done:
1. The Weka GUI Chooser window is used to launch Weka's graphical environments. Select the
button labeled "KnowledgeFlow" to start the KnowledgeFlow. Alternatively, you can launch the
KnowledgeFlow from a terminal window by typing "java weka.gui.beans.KnowledgeFlow".
2. First start the KnowlegeFlow.
3. Next click on the DataSources tab and choose "ArffLoader" from the toolbar (the mouse
pointer will change to a "cross hairs").
4. Next place the ArffLoader component on the layout area by clicking somewhere on the layout
(A copy of the ArffLoader icon will appear on the layout area).
5. Next specify an arff file to load by first right clicking the mouse over the ArffLoader icon on
the layout. A pop-up menu will appear. Select "Configure" under "Edit" in the list from this menu
and browse to the location of your arff file.
6. Next click the "Evaluation" tab at the top of the window and choose the "ClassAssigner"
(allows you to choose which column to be the class) component from the toolbar. Place this on the
layout.
1) Weather.nominal.arff
What are the values that the attribute temperature can have?
Load a new dataset. Click the Open file button and select the file iris.arff. . How many instances
does this dataset have? How many attributes? What is the range of possible values of the attribute
petallength?
2) Weather.nominal.arff
What is the function of the first column in the Viewer window? What is the class value of instance
number 8 in the weather data?
Load the iris data and open it in the editor. How many numeric and how many nominal attributes
does this dataset have?
4) Load the iris data using the Preprocess panel. Evaluate C4.5 on this data using (a) the training
set and (b) cross-validation. What is the estimated percentage of correct classifications for (a) and
(b)? Which estimate is more realistic? Use the Visualize classifier errors function to find the wrongly
classified test instances for the cross-validation performed in previous Exercise. What can you say
about the location of the errors?
5) Glass.arff
How many attributes are there in the dataset? What are their names? What is the class attribute?
Run the classification algorithm IBk (weka.classifiers.lazy.IBk). Use cross-validation to test its
performance, leaving the number of folds at the default value of 10. Recall that you can examine the
classifier options in the Generic Object Editor window that pops up when you click the text beside
the Choose button. The default value of the KNN field is 1: This sets the number of neighboring
instances to use when classifying.
6) Glass.arff
What is the accuracy of IBk (given in the Classifier Output box)? Run IBk again, but increase the
number of neighboring instances to k = 5 by entering this value in the KNN field. Here and
throughout this section, continue to use cross-validation as the evaluation method.
For J48, compare cross-validated accuracy and the size of the trees generated for (1) the raw data,
(2) data discretized by the unsupervised discretization method in default mode, and (3) data
discretized by the same method with binary attributes.
8) Apply the ranking technique to the labor negotiations data in labor.arff to determine the four
most important attributes based on information gain. On the same data, run CfsSubsetEval for
correlation-based selection, using the BestFirst search. Then run the wrapper method with J48 as
the base learner, again using the BestFirst search. Examine the attribute subsets that are output.
Which attributes are selected by both methods? How do they relate to the output generated by
ranking using information gain?
9) Run Apriori on the weather data with each of the four rule-ranking metrics, and default
settings otherwise. What is the top-ranked rule that is output for each metric?
• We use a subset of the “Iris Plants Database” dataset (i.e., provided by WEKA, contained in the
“iris.aff” file).
• Each plant record (i.e., example) is represented by the 5 attributes.
- SepalLength – the plant’s sepal length in cm.
- SepalWidth – the plant’s sepal width in cm.
- PetalLength – the plant’s petal length in cm.
- PetalWidth – the plant’s petal width in cm.
- Class – the classification attribute, with the possible values {Iris-setosa, Iris-versicolor, Iris-
virginica}.
• We want to predict, for each of the following users, if s/he will buy a computer or not.
- User #21. A young student with medium income and fair credit rating.
- User #22. A young non-student with low income and fair credit rating.
- User #23. A medium student with high income and excellent credit rating.
- User #24. An old non-student with high income and excellent credit rating.
Use the WEKA tool
• Convert the dataset containing 20 examples (i.e., Users #1-20) into the ARFF format (supported
by WEKA), and save it in the “buy_comp.arff” file.
• For each user in the set of Users #21-24, set the values of the Buy_Computer attribute by the
predictions computed manually in Part I. Convert the data of these four users into the ARFF format,
and save it in the “buy_comp_extra.arff” file.
• Launch the WEKA tool, and then activate the “Explorer” environment.
• Open the “buy_comp” dataset (i.e., saved in the “buy_comp.arff” file).
- For each attribute and for each of its possible values, how many instances in each class
have the feature value (i.e., the class distribution of the feature values)?
• Go to the “Classify” tab. Select the Id3 classifier. Choose “Percentage split” (66% for training) test
mode. Run the classifier and observe the results shown in the “Classifier output” window.
- How many instances used for the training? How many for the test?
- Does the test set currently used include the four instances of Users #21-24?
- How many instances are incorrectly classified?
- What is the MAE (mean absolute error) made by the learned DT?
g. It is also notable that an important figure is the square of the correlation coefficient (R2). In
statistical regression analysis, which invented this prediction method, the most used success
measures are R and R2. The latter represents the percentage of variation in the target figure
accounted for by the model. For example, if we want to predict a sales volume based on three
factors such as the advertising budget, the number of plays on the radio per week, and the
attractiveness of the band, and if we get a correlation coefficient R of 0.8, then we learn from the
model that R2 = 64% of the variability in the outcome (the sales volume) is accounted for by the
three factors. How much of the variability of num can be predicted by the other attributes?
h. Are theses results compatible with the results of assignment #1, which used classification to
predict num?
Now compare these figures with the other classifiers provided in functions and fill-in the following
table (except the last line):
LinearRegression
SMOreg
MultilayerPerceptron
MultilayerPerceptron
(optimized)
The goal of this case study is to investigate how to preprocess data using Weka data mining
tool.
This assignment will be using Weka data mining tool. Weka is an open source Java development
environment for data mining from the University of Waikato in New Zealand. It can be
downloaded freely from http://www.cs.waikato.ac.nz/ml/weka/, Weka is really an asset for
learning data mining because it is freely available, students can study how the different data mining
models are implemented, and develop customized Java data mining applications. Moreover, data
mining results from Weka can be published in the most respected journals and conferences, which
make it a de facto developing environment of choice for research in data mining, where researchers
often need to develop new data mining methods.
The dataset studied is the heart disease dataset from UCI repository (datasets-UCI.jar). Two
different datasets are provided: heart-h.arff (Hungarian data), and heart-c.arff (Cleveland data).
These datasets describe factors of heart disease. They can be downloaded from:
http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html.
The machine mining project goal is to better understand the risk factors for heart disease, as
represented in the 14th attribute: num (<50 means no disease, and values <50-1 to <50-4 represent
increasing levels of heart disease).
The question on which this machine learning study concentrates is whether it is possible to predict
heart disease from the other known data about a patient. The data mining task of choice to answer
1. Fire up Weka software, launch the explorer window and select the \Preprocess" tab. Open the
weather.nominal data-set (\weather.nominal.arff", this should be in the ./data/ directory of the
Weka install).
2. Often we are in search of discovering association rules showing attribute-value conditions that
occur frequently together in a given set of data, such as; buys(X, computer") & buys(X, \scanner") =)
buys (X,\printer") [support = 2%, confidence = 60%]. Where confidence and support are measures of
rule interestingness. A support of 2% means that 2% of all transactions under analysis show that
computer, scanner and printer are purchased together. A confidence of 60% means that 60% of the
customers who purchased a computer and a scanner also bought a printer. We are interested into
association rules that apply to a reasonably large number of instances and have a reasonably high
accuracy on the instances to which they apply.
Weka has three build-in association rule learners. These are, \Apriori", \Predictive Apriori" and
\Tertius", however they are not capable of handling numeric data. Therefore in this exericse we use
weather data.
Briefly inspect the output produced by each Associator and try to interpret its meaning.
(b) In association rule mining the number of possible association rules can be very large even with
tiny datasets, hence it is in our best interest to reduce the count of rules found, to only the most
interesting ones. This is usually achieved by setting minimum thresh-
olds on support and confidence values. Still in the \Associate" view, select the \Apriori" algorithm
again, click on the textbox next to the \Choose" button and try, in turn, different values for the
following parameters \lowerBoundMinSupport" (min threshold for support), \minMetric" (min
threshold for confidence). As you change these parameter values what do you notice about the rules
that are found by the associator? Note that the parameter \numRules" limits the maximum number
of rules that the associator looks for, you can try changing this value.
(c) This time run the Apriori algorithm with the \outputItemSets" parameter set to true. You will
notice that the algorithm now also outputs a list of \Generated sets of large itemsets:" at di_erent
levels. If you have the module's Data Mining book by Witten & Frank with you, then you can
compare and contrast the Apriori associator's output with the association rules on pages 114-116. I
also strongly recommend to read through chapter 4.5 in your own time, while playing with the
weather data in Weka, this chapter gives a nice & easy introduction to association rules. Notice in
particular how the item sets and association rules compare with Weka and tables 4.10-4.11 in the
book.
(d) Compare the association rules output from Apriori and Tertius (you can do this by navigating
through the already build associator models in the \Result list" on the right side of the screen).
Make sure that the Apriori algorithm shows at least 20 rules. Think about how the association rules
generated by the two different methods compare to each other?
Something to always remember with association rules, is that they should not be used for
prediction directly, that is without further analysis or domain knowledge, as they do not necessarily
indicate causality.
They are however a very helpful starting point for further exploration and for building a better
understanding of our data.
As you should certainly know by this point, in order to identify associations between parameters a
correlation matrix and scatter plot matrix can be very useful fs.
The dataset studied is the weather dataset from Weka’s data folder
The goal of this data mining study is to find strong association rules in the weather.nominal
dataset. Answer the following questions:
b. Load the data in Weka Explorer. Select the Associate tab. How many different association
rule mining algorithms are available?
c. Choose Apriori algorithm with the following parameters (which you can select by clicking on
the chosen algorithm: support threshold = 15% (lowerBoundMinSupport = 0.15), confidence
threshold = 90% (metricType = confidence, minMetric = 0.9), number of rules = 50 (numRules = 50).
After starting the algorithm, how many rules do you find? Could you use the regular weather
dataset to get the results? Explain why.
d. Paste a screenshot of the Explorer window showing at least the first 20 rules.
e. Define the concepts of support, confidence, and lift for a rule. Write here the first rule
discovered. What is its support? Its confidence? Interpret the meaning of these terms and this rule
in this particular example.
f. Apriori algorithm generates association rules from frequent itemsets. How many itemsets of
size 4 were found? Which rule(s) have been generated from itemset of size 4 (temperature=mild,
windy=false, play=yes, outlook=rainy)? List their numbers in the list of rules.
Linear Regression can be very useful in association analysis of numerical values, in fact regression
analysis is a powerful approach to modeling the relationship between a dependent and independent
variables. Simple regression is when we predict from one independent variable and multiple
regression is when we predict from more than one independent variables. The model we attempt to
_t is a linear one which is, very simply, drawing a line through the data. Of all the lines that can
possibly be drawn through the data, we are looking for the one that best fits the data. In fact, we
look to find a line that best satisfies
γ = β0 + β1x + ε
So a most accurate model is that which yields a best fit line to the data in question, we are looking
for minimal sum of squared deviations between actual and fitted values, this is called method of
Exercise 1
(a) In Weka go back to the \Preprocess" tab. Open the iris data-set (\iris.tar_", this should be in the
./data/ directory of the Weka install).
(b) In the \Attributes" section (bottom left of the screen) select the \class" feature and click
\Remove". We need to do this, as simple linear regression cannot deal with non numeric values.
(c) Next select the \Classify" tab to get into the Classification perspective of Weka, and choose
\LinearRegression" (under \functions").
(d) Clicking on the textbox next to the \Choose" button brings up the parameter editor window.
Click on the \More" button to get information about the parameters. Make sure that
\attributeSelectionMethod" is set to \No attribute selection" and “\eliminate-ColinearAttributes" is
set to \False".
(e) Finally make sure that you select the parameter “\petalwidth" in the dropdown box just under
the “\Test Options". Hit Start to run the regression.
Inspect the results, in particular pay attention to the Linear Regression Model formula returned,
and the coefficients and intercept of the straight line equation. As this is a numeric
prediction/regression problem, accuracy is measured with Root Mean Squared Error, Mean
Absolute Error and the likes. As most of you will have clearly noticed, you can repeat this process
for regressing the other features in turn, and compare how well the different features can be
predicted.
Exercise 2
• Launch the WEKA tool, and then activate the “Explorer” environment.
- For each attribute and for each of its possible values, how many instances in each class have the
feature value (i.e., the class distribution of the feature values)?
• Go to the “Classify” tab. Select the SimpleLinearRegression learner. Choose “Percentage split”
(66% for training) test mode. Run the classifier and observe the results shown in the “Classifier
output” window.
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 2 48 | c = Iris-virginica
[ This shows for each class, how instances from that class received the various classifications. E.g.
for class "b", 49 instances were correctly classified but 1 was put into class "c". ]
a b c <-- classified as
49 1 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 2 48 | c = Iris-virginica
[ This is the confusion matrix for the 10-fold cross-validation, showing what classification the
instances from each class received when it was used as testing data. E.g. for class "a" 49 instances
were correctly classified and 1 instance was assigned to class "b". ]
Classification Exercises
Exercise 1.
1. Fire up the Weka (Waikato Environment for Knowledge Analysis) soft-
ware, launch the explorer window and select the \Preprocess" tab. Open the iris data-set (\iris.ar_",
this should be in the ./data/ directory of the Weka install).
2. Select the \Classify" tab. Under the \Test options" section you have four different testing options.
How do each (we cannot use \supplied test set" option as we have no applicable _le) of these options
select the training/testing? Which testing mode do you think will perform best? (the
ExplorerGuide.pdf", in the ./ directory of the Weka install may help).
3. Under \Classifier" select \MultilayerPerceptron". What type of classifier is this? How does this
classifier work? What main parameters can be specified for this classifier?
4. Under \Test options" select \Use training set" and under \More options" check \Output
predictions". Now click \Start" to start training the model. You should see a stream of output
appear in the window named \Classifier output". What do each of the following sections tell you
about the model?
(a) \Predictions on ..."
(b) \Summary"
(c) \Detailed accuracy by class"
(d) \Confusion matrix"
Get to the Weka Explorer environment and load the training file using the Preprocess mode. Try
first with weather.arff. Get to the Cluster mode (by clicking on the Cluster tab) and select a
clustering algorithm, for example SimpleKMeans. Then click on Start and you get the clustering
result in the output window. The actual clustering for this algorithm is shown as one instance for
each cluster representing the cluster centroid.
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 32.31367650540387
Cluster centroids:
Cluster 0
Mean/Mode: rainy 75.625 86 FALSE yes
Std Devs: N/A 6.5014 7.5593 N/A N/A
Cluster 1
Mean/Mode: sunny 70.8333 75.8333 TRUE yes
Std Devs: N/A 6.1128 11.143 N/A N/A
Clustered Instances
0 8 ( 57%)
1 6 ( 43%)
Evaluation
The way Weka evaluates the clusterings depends on the cluster mode you select. Four different
cluster modes are available (as buttons in the Cluster mode panel):
1. Use training set (default). After generating the clustering Weka classifies the training
instances into clusters according to the cluster representation and computes the percentage
2. In Supplied test set or Percentage split Weka can evaluate clusterings on separate
test data if the cluster representation is probabilistic (e.g. for EM).
3. Classes to clusters evaluation. In this mode Weka first ignores the class attribute
and generates the clustering. Then during the test phase it assigns classes to the clusters,
based on the majority value of the class attribute within each cluster. Then it computes the
classification error, based on this assignment and also shows the corresponding confusion
matrix. An example of this for k-means is shown below.
Scheme: weka.clusterers.SimpleKMeans -N 2 -S 10
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
Ignored:
play
Test mode: Classes to clusters evaluation on training data
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 11.156838252701938
Cluster centroids:
Cluster 0
Mean/Mode: rainy 75.625 86 FALSE
Std Devs: N/A 6.5014 7.5593 N/A
Cluster 1
Mean/Mode: sunny 70.8333 75.8333 TRUE
Std Devs: N/A 6.1128 11.143 N/A
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 22.31367650540387
Cluster centroids:
Cluster 0
Mean/Mode: rainy 75.625 86 FALSE
Std Devs: N/A 6.5014 7.5593 N/A
EM
The EM clustering scheme generates probabilistic descriptions of the clusters in terms of mean and
standard deviation for the numeric attributes and value counts (incremented by 1 and modified
with a small value to avoid zero probabilities) - for the nominal ones. In "Classes to clusters"
evaluation mode this algorithm also outputs the log-likelihood, assigns classes to the clusters and
prints the confusion matrix and the error rate, as shown in the example below.
Clustered Instances
0 4 ( 29%)
1 10 ( 71%)
Cobweb
Cobweb generates hierarchical clustering, where clusters are described probabilistically. Below is
an example clustering of the weather data (weather.arff). The class attribute (play) is ignored (using
the ignore attributes panel) in order to allow later classes to clusters evaluation. Doing this
automatically through the "Classes to clusters" option does not make much sense for hierarchical
clustering, because of the large number of clusters. Sometimes we need to evaluate particular
clusters or levels in the clustering hierarchy. We shall discuss here an approach to this.
• -A 1.0 -C 0.234 in the command line specifies the Cobweb parameters Acuity and Cutoff (see
the text, page 215). They can be specified through the pop-up window that appears by clicking on
area left to the Choose button.
• The clustering tree structure is shown as a horizontal tree, where subclusters are aligned at
the same column. For example, cluster 1 (referred to in node 1) has three subclusters 2 (leaf 2), 3
(leaf 3) and 4 (leaf 4).
• The root cluster is 0. Each line with node 0 defines a subcluster of the root.
• The number in square brackets after node N represents the number of instances in the
parent cluster N.
• For example, in the above structure cluster 1 has 8 instances and its subclusters 2, 3 and 4
have 2, 3 and 3 instances correspondingly.
• To view the clustering tree right click on the last line in the result list window and then
select Visualize tree.
To evaluate the Cobweb clustering using the classes to clusters approach we need to know the
class values of the instances, belonging to the clusters. We can get this information from Weka in
the following way: After Weka finishes (with the class attribute ignored), right click on the last line
in the result list window. Then choose Visualize cluster assignments - you get the Weka cluster
visualize window. Here you can view the clusters, for example by putting Instance_number on X
and Cluster on Y. Then click on Save and choose a file name (*.arff). Weka saves the cluster
assignments in an ARFF file. Below is shown the file corresponding to the above Cobweb
clustering.
@relation weather_clustered
@attribute Instance_number numeric
@attribute outlook {sunny,overcast,rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE,FALSE}
@attribute play {yes,no}
@attribute Cluster {cluster0,cluster1,cluster2,cluster3,cluster4,cluster5}
@data
0,sunny,85,85,FALSE,no,cluster3
1,sunny,80,90,TRUE,no,cluster5
To represent the cluster assignments Weka adds a new attribute Cluster and includes its
corresponding values at the end of each data line. Note that all other attributes are shown,
including the ignored ones (play, in this case). Also, only the leaf clusters are shown.
Now, to compute the classes to clusters error in, say, cluster 3 we look at the corresponding data
rows in the ARFF file and get the distribution of the class variable: {no, no, yes}. This means that
the majority class is no and the error is 1/3.
If we want to compute the error not only for leaf clusters, we need to look at the clustering
structure (the Visualize tree option helps here) and determine how the leaf clusters are combined in
other clusters at higher levels of the hierarchy. For example, at the top level we have two clusters -
1 and 5. We can get the class distribution of 5 directly from the data (because 5 is a leaf) - 3 yes's
and 3 no's. While for cluster 1 we need its subclusters - 2, 3 and 4. Summing up the class values
we get 6 yes's and 2 no's. Finally, the majority in cluster 1 is yes and in cluster 5 is no (could be
yes too) and the error (for the top level partitioning in two clusters) is 5/14.
Weka provides another approach to see the instances belonging to each cluster. When you visualize
the clustering tree, you can click on a node and then see the visualization of the instances falling
into the corresponding cluster (i.e. into the leafs of the subtree). This is a very useful feature,
however if you ignore an attribute (as we did with "play" in the experiments above) it does not show
in the visualization.
For each step, open the indicated file in the “Preprocess” window. Then, go to the “Attribute
Selection” window and set the “Attribute selection mode to “Use full training set”. For below
mentioned case, perform attribute ranking using the following attribute selection methods with
default parameters:
a) InfoGainAttributeEval; and
b) GainRatioAttributeEval;
These attribute selection methods should consider only non-class dimensions (for each set, the
class attribute is indicated above the “Start” button). Record the output of each run in a text file
a). Perform attribute ranking on the “contact-lenses.arff” data set using the two attribute ranking
methods with default parameters.
Evaluation
Once you have performed the experiments, you should spend some time evaluating your results. In
particular, try to answer at least the following questions: Why would one need attribute relevance
ranking? Do these attribute-ranking methods often agree or disagree? On which data set(s), if any,
these methods disagree? Does discretization and its method affect the results of attribute ranking?
Do missing values affect the results of attribute ranking? Record these and any other observations
in a Word file called “Observations.doc”.
Exercise 2
1. Fire up the Weka (Waikato Environment for Knowledge Analysis) software, launch the explorer
window and select the \Preprocess" tab.
2. Open the iris data-set (\iris.ar_", this should be in the ./data/ directory of the Weka install).
What information do you have about the data set (e.g. number of instances, attributes and classes)?
What type of attributes does this data-set contain (nominal or numeric)? What are the classes in
this data-set? Which attribute has the greatest standard deviation? What does this tell you about
that attribute? (You might also find it useful to open \iris.ar_" in a text editor).
3. Under \Filter" choose the \Standardize" _lter and apply it to all attributes. What does it do? How
does it afect the attributes' statistics? Click \Undo" to un-standardize the data and now apply the
\Normalize" filter and apply it to all the attributes. What does it do? How does it affect the
attributes' statistics? How does it differ from \Standardize"? Click \Undo" again to return the data
to its original state.
4. At the bottom right of the window there should be a graph which visualizes the data-set, making
sure \Class: class (Nom)" is selected in the drop-down box click \Visualize All". What can you
interpret from these graphs? Which attribute(s) discriminate best between the classes in the data-
set? How do the \Standardize" and \Normalize" filters affect these graphs?
5. Under \Filter" choose the \AttributeSelection" filter. What does it do? Are the attributes it selects
the same as the ones you chose as discriminatory above? How does its behavior change as you alter
its parameters?
6. Select the \Visualize" tab. This shows you 2D scatter plots of each attribute against each other
attribute (similar to the F1 vs F2 plots from tutorial 1). Make sure the drop-down box at the bottom
says \Color: class (Nom)". Pay close attention to the plots between attributes you think discriminate
best between classes, and the plots between attributes selected by the \AttributeSelection" filter.
Can you verify from these plots whether your thoughts and the \AttributeSelection" filter are
correct? Which attributes are correlated?
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances
sharing a set of attributes. ARFF files were developed by the Machine Learning Project at the
Overview
ARFF files have two distinct sections. The first section is the Header information, which is followed
the Data information.
The Header of the ARFF file contains the name of the relation, a list of the attributes (the columns
in the data), and their types. An example header on the standard IRIS dataset looks like this:
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
% (c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
The Data of the ARFF file looks like the following:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments. The @RELATION, @ATTRIBUTE and @DATA declarations
are case insensitive.
The ARFF Header Section
The ARFF Header section of the file contains the relation declaration and attribute declarations.
The @relation Declaration
The relation name is defined as the first line in the ARFF file. The format is:
@relation <relation-name>
where <relation-name> is a string. The string must be quoted if the name includes spaces.
The @attribute Declarations
Attribute declarations take the form of an orderd sequence of @attribute statements. Each attribute
in the data set has its own @attribute statement which uniquely defines the name of that attribute
and it's data type. The order the attributes are declared indicates the column position in the data
section of the file. For example, if an attribute is the third one declared then Weka expects that all
where the <attribute-name> must start with an alphabetic character. If spaces are to be included in
the name then the entire name must be quoted.
The <datatype> can be any of the four types currently (version 3.2.1) supported by Weka:
• numeric
• <nominal-specification>
• string
• date [<date-format>]
where <nominal-specification> and <date-format> are defined below. The keywords numeric, string
and date are case insensitive.
Numeric attributes
Nominal attributes
Nominal values are defined by providing an <nominal-specification> listing the possible values:
{<nominal-name1>, <nominal-name2>, <nominal-name3>, ...}
For example, the class value of the Iris dataset can be defined as follows:
String attributes
String attributes allow us to create attributes containing arbitrary textual values. This is very useful
in text-mining applications, as we can create datasets with string attributes, then write Weka Filters
to manipulate strings (like StringToWordVectorFilter). String attributes are declared as follows:
Date attributes
where <name> is the name for the attribute and <date-format> is an optional string specifying how
date values should be parsed and printed (this is the same format used by SimpleDateFormat). The
default format string accepts the ISO-8601 combined date and time format: "yyyy-MM-
dd'T'HH:mm:ss".
The ARFF Data section of the file contains the data declaration line and the actual instance lines.
The @data declaration is a single line denoting the start of the data segment in the file. The format
is:
@data
Each instance is represented on a single line, with carriage returns denoting the end of the
instance.
Attribute values for each instance are delimited by commas. They must appear in the order that
they were declared in the header section (i.e. the data corresponding to the nth @attribute
declaration is always the nth field of the attribute).
Missing values are represented by a single question mark, as in:
@data
4.4,?,1.5,?,Iris-setosa
Values of string and nominal attributes are case sensitive, and any that contain space must be
quoted, as follows:
@relation LCCvsLCSH
@attribute LCC string
@attribute LCSH string
@data
AG5, 'Encyclopedias and dictionaries.;Twentieth century.'
AS262, 'Science -- Soviet Union -- History.'
AE5, 'Encyclopedias and dictionaries.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Phases.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Tables.'
Dates must be specified in the data section using the string representation specified in the
attribute declaration. For example:
@RELATION Timestamps
@DATA
"2001-04-03 12:12:12"
"2001-05-03 12:59:55"
@data
0, X, 0, Y, "class A"
0, 0, W, 0, "class B"
the non-zero attributes are explicitly identified by attribute number and their value stated, like this:
@data
{1 X, 3 Y, 4 "class A"}
{2 W, 4 "class B"}
Each instance is surrounded by curly braces, and the format for each entry is: <index> <space>
<value> where index is the attribute index (starting from 0).
Note that the omitted values in a sparse instance are 0, they are not "missing" values! If a value is
unknown, you must explicitly represent it with a question mark (?).
Warning: There is a known problem saving SparseInstance objects from datasets that have string
attributes. In Weka, string and nominal data values are stored as numbers; these numbers act as
indexes into an array of possible attribute values (this is very efficient). However, the first string
value is assigned index 0: this means that, internally, this value is stored as a 0. When a
SparseInstance is written, string instances with internal value 0 are not output, so their string
value is lost (and when the arff file is read again, the default value 0 is the index of a different string
value, so the attribute value appears to change). To get around this problem, add a dummy string
value at index 0 that is never used whenever you declare string attributes that are likely to be used
in SparseInstance objects and saved as Sparse ARFF files.