An Introduction To The WEKA Data Mining System
An Introduction To The WEKA Data Mining System
Zdravko Markov
Central Connecticut State University
markovz@ccsu.edu
Ingrid Russell
University of Hartford
irussell@hartford.edu
Agenda
• Data Mining
• Weka Project
Knowledge
Extraction of detailed Summaries, trends discovery of hidden
Task
and summary data and forecasts patterns and
insights
Insight and
Type of result Information Analysis
Prediction
Multidimensional Induction (Build the
Deduction (Ask the
data modeling, model, apply it to
Method question, verify with
Aggregation, new data, get the
data)
Statistics result)
What is the average Who will buy a
Example Who purchased mutual income of mutual mutual fund in the
question funds in the last 3 years? fund buyers by next 6 months and
region by year? why?
Example of DBMS, OLAP and Data Mining: Weather data
Assume we have made a record of the weather conditions during a two-week period,
along with the decisions of a tennis player whether or not to play tennis on each
particular day. Thus we have generated tuples (or examples, instances) consisting of
values of four independent variables (outlook, temperature, humidity, windy) and one
dependent variable (play).
Data-Base Management System
• What was the temperature in the sunny days? {85, 80, 72, 69, 75}
• Which days the humidity was less than 75? {6, 7, 9, 11}
• Which days the temperature was greater than 70? {1, 2, 3, 8, 10, 11, 12, 13, 14}
• Which days the temperature was greater than 70 and the humidity was less than 75?
The intersection of the above two: {11}
OLAP: Multidimensional Model (Data Cube)
Dimensions:
• Time: Week 1={1, 2, 3, 4, 5, 6, 7}, Week 2={8, 9, 10, 11, 12, 13, 14}
• Outlook: {sunny, rainy, overcast}
Unit: play (yes/no)
SIGKDD Service Award is the highest service award in the field of data mining and knowledge discovery. It is is given
to one individual or one group who has performed significant service to the data mining and knowledge discovery
field, including professional volunteer services in disseminating technical information to the field, education, and
research funding.
The 2005 ACM SIGKDD Service Award is presented to the Weka team for their development of the freely-available
Weka Data Mining Software, including the accompanying book Data Mining: Practical Machine Learning Tools and
Techniques (now in second edition) and much other documentation.
The Weka team includes Ian H. Witten and Eibe Frank, and the following major contributors (in alphabetical order of
last names): Remco R. Bouckaert, John G. Cleary, Sally Jo Cunningham, Andrew Donkin, Dale Fletcher, Steve
Garner, Mark A. Hall, Geoffrey Holmes, Matt Humphrey, Lyn Hunt, Stuart Inglis, Ashraf M. Kibriya, Richard
Kirkby, Brent Martin, Bob McQueen, Craig G. Nevill-Manning, Bernhard Pfahringer, Peter Reutemann, Gabi
Schmidberger, Lloyd A. Smith, Tony C. Smith, Kai Ming Ting, Leonard E. Trigg, Yong Wang, Malcolm Ware, and
Xin Xu.
The Weka team has put a tremendous amount of effort into continuously developing and maintaining the system since
1994. The development of Weka was funded by a grant from the New Zealand Government's Foundation for
Research, Science and Technology.
There are 15 well-documented substantial projects that incorporate, wrap or extend Weka, and no doubt many
more that have not been reported on Sourceforge.
Ian H. Witten and Eibe Frank also wrote a very popular book "Data Mining: Practical Machine Learning
Tools and Techniques" (now in the second edition), that seamlessly integrates Weka system into teaching
of data mining and machine learning. In addition, they provided excellent teaching material on the book
website.
This book became one of the most popular textbooks for data mining and machine learning, and is very
frequently cited in scientific publications.
Weka is a landmark system in the history of the data mining and machine learning research communities,
because it is the only toolkit that has gained such widespread adoption and survived for an extended period
of time (the first version of Weka was released 11 years ago). Other data mining and machine learning
systems that have achieved this are individual systems, such as C4.5, not toolkits.
Since Weka is freely available for download and offers many powerful features (sometimes not found in
commercial data mining software), it has become one of the most widely used data mining systems. Weka
also became one of the favorite vehicles for data mining research and helped to advance it by making many
powerful features available to all.
In sum, the Weka team has made an outstanding contribution to the data mining field.
Now …
Machine Learning, Data and Web Mining
by Example
(“learning by doing” approach)
Clients data:
• unemployed clients: s3, s10, s12
• loan is to buy a personal computer: s1, s2, s3, s4, s5, s6, s7, s8, s9, s10
• loan is to buy a car: s11, s12, s13, s14, s15, s16, s17, s18, s19, s20
• male clients: s6, s7, s8, s9, s10, s16, s17, s18, s19, s20
• not married: s1, s2, s5, s6, s7, s11, s13, s14, s16, s18
• live in problematic area: s3, s5
• age: s1=18, s2=20, s3=25, s4=40, s5=50, s6=18, s7=22, s8=28, s9=40, s10=50, s11=18, s12=20,
s13=25, s14=38, s15=50, s16=19, s17=21, s18=25, s19=38, s20=50
• money in a bank (x10000 yen): s1=20, s2=10, s3=5, s4=5, s5=5, s6=10, s7=10, s8=15, s9=20, s10=5,
s11=50, s12=50, s13=50, s14=150, s15=50, s16=50, s17=150, s18=150, s19=100, s20=50
• monthly pay (x10000 yen): s1=2, s2=2, s3=4, s4=7, s5=4, s6=5, s7=3, s8=4, s9=2, s10=4, s11=8,
s12=10, s13=5, s14=10, s15=15, s16=7, s17=3, s18=10, s19=10, s20=10
• months for the loan: s1=15, s2=20, s3=12, s4=12, s5=12, s6=8, s7=8, s8=10, s9=20, s10=12, s11=20,
s12=20, s13=20, s14=20, s15=20, s16=20, s17=20, s18=20, s19=20, s20=30
• years with the last employer: s1=1, s2=2, s3=0, s4=2, s5=25, s6=1, s7=4, s8=5, s9=15, s10=0, s11=1,
s12=2, s13=5, s14=15, s15=8, s16=2, s17=3, s18=2, s19=15, s20=2
Data preprocessing and visualization
Relations, attributes, tuples (instances)
http://www.cs.ccsu.edu/~markov/MDLclustering/data.zip
Data preprocessing and visualization
Department data: Create ARFF file
http://www.cs.ccsu.edu/~markov/MDLclustering/MDL.jar
Data preprocessing and visualization
Department data: Create ARFF file in string format (using SimpleCLI)
1. Create file deptA with the files in folder data/departments/A with class label A:
java ARFFstring data/departments/A A deptA
2. Create file deptB with the files in folder data/departments/B with class label B:
java ARFFstring data/departments/B B deptB
• Preprocess.html
• Visualization.html
Attribute Selection
Finding a minimal set of attributes that preserve the class distribution
Attribute relevance with respect to the class – irrelevant attribute (accounting)
• Attribute Selection.html
Classification – creating models (hypotheses)
Mapping (independent attributes -> class)
Classification – creating models (hypotheses)
Inferring one-attribute rules - OneR
ID3, C4.5, J48 (Weka): Select the attribute that minimizes the class
entropy in the split.
Classification – numeric attributes
weather.arff
Classification – predicting class
Click on Set… Click on Open file…
Classification – predicting class
Right click on the highlighted line in Result list and choose Visualize classifier errors
Click on the square
Classification – predicting class
Click on Save
Classification
Student Projects
• Classification.html
Prediction (no model, lazy learning)
test: (sunny, cool, high, TRUE, ?) • K-nearest neighbor (IBk)
Take the class of the nearest neighbor
or the majority class among K neighbors
K=1 -> no
K=3 -> no
K=5 -> yes
K=14 -> yes (Majority predictor, ZeroR)
X 2 8 9 11 12 … 10
Distance(test,X) 1 2 2 2 2 … 4
play no no yes yes yes … yes
• Distance is calculated as the number of different attribute values
• Euclidean distance for numeric attributes
Prediction (no model, lazy learning)
Prediction
Student Projects
• Prediction.html
Model evaluation – holdout (percentage split)
Model evaluation – cross validation
Model evaluation – leave one out cross validation
Model evaluation – confusion (contingency) matrix
predicted
actual
yes no
yes 3 1
no 1 0
predicted
yes no
actual
yes TP FN
no FP TN
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
Model evaluation
Student Projects
• Evaluation.html
Clustering – k-means
Click on Ignore attributes
Clustering – classes to clusters evaluation
Right click on Result list, select Visualize cluster assignments
Click on Save
Clustering
Student Projects
• Clustering.html
Association Rules (A => B)
• Confidence (accuracy): P(B|A) = (# of tuples containing both A and B) / (# of tuples containing A).
• Support (coverage): P(A,B) = (# of tuples containing both A and B) / (total # of tuples)
Association Rules
Student Projects
• Association.html
Document classification and clustering
Predict the class of the Theatre document