0% found this document useful (0 votes)
29 views

DWM Exp6 C49

.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

DWM Exp6 C49

.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

LAB Manual

PART A
(PART A: TO BE REFFERED BY STUDENTS)

Experiment No.06
A.1 Aim:
Perform pre-processing on data and Implementation of Decision Tree
Algorithm Using R-tool or WEKA.

A.2 Prerequisite:
Familiarity with the WEKA tool.

A.3 Outcome:
After successful completion of this experiment students will be able
to
Use classification and clustering algorithms of data mining.

A.4 Theory:

Preprocessing:

Data have quality if they satisfy the requirements of the intended use.
There are many factors comprising data quality, including accuracy,
completeness, consistency, timeliness, believability, and interpretability.
Major Tasks in Data Preprocessing:
In this section, we look at the major steps involved in data preprocessing,
namely, data cleaning, data integration, data reduction, and data
transformation.
Data cleaning routines work to “clean” the data by filling in missing
values, smoothing noisy data, identifying or removing outliers, and
resolving inconsistencies. If users believe the data are dirty, they are
unlikely to trust the results of any data mining that has been applied.
Furthermore, dirty data can cause confusion for the mining procedure,
resulting in unreliable output. Although most mining routines have some
procedures for dealing with incomplete or noisy data, they are not always
robust. Instead, they may concentrate on avoiding overfitting the data to
the function being modeled. Therefore, a useful preprocessing step is to
run your data through some data cleaning routines.
Data reduction obtains a reduced representation of the data set that is
much smaller in volume, yet produces the same (or almost the same)
analytical results. Data reduction strategies include dimensionality
reduction and numerosity reduction.
In dimensionality reduction, data encoding schemes are applied so as
to obtain a reduced or “compressed” representation of the original data.
Examples include data compression techniques (e.g., wavelet transforms
and principal components analysis), attribute subset selection (e.g.,
removing irrelevant attributes), and attribute construction (e.g., where a
small set of more useful attributes is derived from the original set).
In numerosity reduction, the data are replaced by alternative, smaller
representations using parametric models (e.g., regression or log-linear
models) or nonparametric models (e.g., histograms, clusters, sampling, or
data aggregation). Discretization and concept hierarchy generation are
powerful tools for data mining in that they allow data mining at multiple
abstraction levels. Normalization, data discretization, and concept
hierarchy generation are forms of data transformation.
Data Cleaning:
Real-world data tend to be incomplete, noisy, and inconsistent. Data
cleaning (or data cleansing) routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct inconsistencies in
the data.
Dealing with Missing Values
1. Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification). This method is not very
effective, unless the tuple contains several attributes with missing values.
It is especially poor when the percentage of missing values per attribute
varies considerably. By ignoring the tuple, we do not make use of the
remaining attributes’ values in the tuple. Such data could have been
useful to the task at hand.
2. Fill in the missing value manually: In general, this approach is time
consuming and may not be feasible given a large data set with many
missing values.
3. Use a global constant to fill in the missing value: Replace all
missing attribute values by the same constant such as a label like
“Unknown” or -∞. If missing values are replaced by, say, “Unknown,” then
the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common—that of “Unknown.”
Hence, although this method is simple, it is not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value: central tendency,
indicate the “middle” value of a data distribution. For normal (symmetric)
data distributions, the mean can be used, while skewed data distribution
should employ the median..
5. Use the attribute mean or median for all samples belonging to
the same class as the given tuple: For example, if classifying
customers according to credit risk, we may replace the missing value with
the mean income value for customers in the same credit risk category as
that of the given tuple. If the data distribution for a given class is skewed,
the median value is a better choice.
6. Use the most probable value to fill in the missing value: This
may be determined with regression, inference-based tools using a
Bayesian formalism, or decision tree induction.
Dealing with Noise:
Noise is a random error or variance in a measured variable
Binning: Binning methods smooth a sorted data value by consulting its
“neighbourhood,” that is, the values around it. The sorted values are
distributed into a number of “buckets,” or bins. Because binning methods
consult the neighbourhood of values, they perform local smoothing. In
smoothing by bin means, each value in a bin is replaced by the mean
value of the bin. Similarly, smoothing by bin medians can be employed, in
which each bin value is replaced by the bin median. In smoothing by bin
boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the
closest boundary value.
Regression: Data smoothing can also be done by regression, a technique
that conforms data values to a function. Linear regression involves finding
the “best” line to fit two attributes (or variables) so that one attribute can
be used to predict the other. Multiple linear regression is an extension of
linear regression, where more than two attributes are involved and the
data are fit to a multidimensional surface.
Outlier analysis: Outliers may be detected by clustering, for example,
where similar values are organized into groups, or “clusters.” Intuitively,
values that fall outside of the set of clusters may be considered outliers.

Decision tree:
A decision tree is a tree in which each branch node represents a
choice between a number of alternatives, and each leaf node represents a
decision. Decision tree are commonly used for gaining information for the
purpose of decision -making. A Decision Tree is a tree-structured plan of a
set of attributes to test in order to predict the output. To decide which
attribute should be tested first, simply find the one with the highest
information gain. They are able to produce human-readable descriptions
of trends in the underlying relationships of a dataset and can be used for
classification and prediction tasks. Various Decision tree algorithms are
CART, C4.5 and ID3 algorithm.

Advantages of decision tree:

1. Simple to understand and interpret.


2. Requires little data preparation.

3. Able to handle both numerical and categorical data.

4. Perform well with large data in a short time.

ID 3 algorithm:

ID3 is a simple decision tree learning algorithm developed by Ross


Quinlan (1983). The basic idea of ID3 algorithm is to construct the
decision tree by employing a top-down, greedy search through the given
sets to test each attribute at every tree node. In order to select the
attribute that is most useful for classifying a given sets, we introduce a
metric - information gain. A measure used from Information Theory in the
ID3 algorithm used in decision tree construction is that of Entropy.
Informally, the entropy of a dataset can be considered to be how
disordered it is. It has been shown that entropy is related to information,
in the sense that the higher the entropy, or uncertainty, of some data,
then the more information is required in order to completely describe that
data. In building a decision tree, we aim to decrease the entropy of the
dataset until we reach leaf nodes at which point the subset that we are
left with is pure, or has zero entropy and represent instances all of one
class (all instances have the same value for the target attribute).We
measure the entropy of a dataset, S, with respect to one attribute ‘a i’, in
this case the target attribute, with the following calculation.

Where Pi is the proportion of instances in the dataset that take the ith
value of the target attribute.

Measures the expected reduction in entropy. The higher the IG,


more is the expected reduction in entropy.

Where v is a value of A, |Sv| is the subset of instances of S where A takes


the value v, and |S| is the number of instances.
Steps of ID3 (Examples, Target, and Attributes) Algorithm:-
Create a root node
1. If all Examples have the same Target value, give the root this label

2. Else if Attributes is empty label the root according to the most common
value

3. Else begin

3.1.Calculate the information gain for each attribute, according to the


average entropy formula

3.2.Select the attribute, A, with the lowest average entropy (highest


information gain) and make this the attribute tested at the root

3.3.For each possible value, v, of this attribute

3.3.1.Add a new branch below the root, corresponding to A = v

3.3.2.Let Examples(v) be those examples with A = v


3.3.3.If Examples(v) is empty, make the new branch a leaf node
labeled with the most common value among Examples

3.3.4.Else let the new branch be the tree created by


ID3(Examples(v), Target, Attributes - {A})

4. End.

Advantages of ID3:

1. Predict new data.

2. Understandable prediction rules are created from the training data.

3. Builds the fastest and short tree.

4. Only need to test enough attributes until all data is classified.

5. Finding leaf nodes enables test data to be pruned, reducing number of


tests.

Disadvantages of ID3:

1. Data may be over-fitted or over-classified, if a small sample is tested.

2. Only one attribute at a time is tested for making a decision.

3. Numerous trees needed for continuous data.

j-48 classifier in weka:-


This experiment illustrates the use of j-48 classifier in weka. The sample
data set used in this experiment is “student” data available at arff format.
This document assumes that appropriate data preprocessing has been
performed.

Steps involved in this experiment:

Step-1: We begin the experiment by loading the data (student.arff) into


weka.
Step2: Next we select the “classify” tab and click “choose” button to
select the “j48”classifier.
Step3: Now we specify the various parameters. These can be specified by
clicking in the text box to the right of the chose button. In this example,
we accept the default values. The default version does perform some
pruning but does not perform error pruning.
Step4: Under the “text” options in the main panel. We select the 10-fold
cross validation as our evaluation approach. Since we don’t have separate
evaluation data set, this is necessary to get a reasonable idea of accuracy
of generated model.
Step-5: We now click ”start” to generate the model .the Ascii version of
the tree as well as evaluation statistic will appear in the right panel when
the model construction is complete.
Step-6: Note that the classification accuracy of model is about 69%.this
indicates that we may find more work. (Either in preprocessing or in
selecting current parameters for the classification)
Step-7: Now weka also lets us a view a graphical version of the
classification tree. This can be done by right clicking the last result set and
selecting “visualize tree” from the pop-up menu.
Step-8: We will use our model to classify the new instances.
Step-9: In the main panel under “text” options click the “supplied test
set” radio button and then click the “set” button. This will pop-up a
window which will allow you to open the file containing test instances.

Data set employee.arff:


@relation employee
@attribute age {25, 27, 28, 29, 30, 35, 48}
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@attribute performance {good, avg, poor}
@data
%
25, 10k, poor
27, 15k, poor
27, 17k, poor
28, 17k, poor
29, 20k, avg
30, 25k, avg
29, 25k, avg
30, 20k, avg
35, 32k, good
48, 34k, good
48, 32k,good
%
The following screenshot shows the classification rules that were
generated whenj48 algorithm is applied on the given dataset.
PART B
(PART B: TO BE COMPLETED BY STUDENTS)

(Students must submit the soft copy as per following segments


within two hours of the practical. The soft copy must be uploaded
on the Blackboard or emailed to the concerned lab in charge
faculties at the end of the practical in case the there is no Black
board access available)

B.1 Software Code written by student:


(Paste your problem statement related to your case study completed during
the 2 hours of practical in the lab here)
@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1.6,Iris-versicolor
4.9,2.4,3.3,1.0,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,1.4,Iris-versicolor
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4.0,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3.0,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3.0,5.0,1.7,Iris-versicolor
6.0,2.9,4.5,1.5,Iris-versicolor
5.7,2.6,3.5,1.0,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1.0,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
6.0,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3.0,4.1,1.3,Iris-versicolor
5.5,2.5,4.0,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3.0,4.6,1.4,Iris-versicolor
5.8,2.6,4.0,1.2,Iris-versicolor
5.0,2.3,3.3,1.0,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3.0,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3.0,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3.0,5.8,2.2,Iris-virginica
7.6,3.0,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3.0,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6.0,2.2,5.0,1.5,Iris-virginica
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2.0,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6.0,1.8,Iris-virginica
6.2,2.8,4.8,1.8,Iris-virginica
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,5.1,1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6.0,3.0,4.8,1.8,Iris-virginica
6.9,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica
%
%
%
B.2 Input and Output:
(Paste diagram of star schema and snowflake schema model related to your
case study in following format )
B.3 Observations and learning:
(Students are expected to comment on the output obtained with clear
observations and learning for each task/ sub part assigned)
In this experiment, we use of j-48 classifier in the Weka tool. Data is already
available in AIFF format. This document assumes that appropriate data
preprocessing has been performed.
B.4 Conclusion:
(Students must write the conclusion as per the attainment of individual
outcome listed above and learning/observation noted in section B.3)
After completing this Experiment we can use the j-48 classifier in the Weka
tool.
B.5 Question of Curiosity
(To be answered by student based on the practical performed and
learning/observations)
Q1: Draw the tree according to the classifier output and answers the
following questions:
1. What is the depth of the tree?
Ans: 2

1. How many leaf nodes are there in the tree?


Ans: 4
1. How many tree nodes?
Ans: 5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy