DWM Exp6 C49
DWM Exp6 C49
PART A
(PART A: TO BE REFFERED BY STUDENTS)
Experiment No.06
A.1 Aim:
Perform pre-processing on data and Implementation of Decision Tree
Algorithm Using R-tool or WEKA.
A.2 Prerequisite:
Familiarity with the WEKA tool.
A.3 Outcome:
After successful completion of this experiment students will be able
to
Use classification and clustering algorithms of data mining.
A.4 Theory:
Preprocessing:
Data have quality if they satisfy the requirements of the intended use.
There are many factors comprising data quality, including accuracy,
completeness, consistency, timeliness, believability, and interpretability.
Major Tasks in Data Preprocessing:
In this section, we look at the major steps involved in data preprocessing,
namely, data cleaning, data integration, data reduction, and data
transformation.
Data cleaning routines work to “clean” the data by filling in missing
values, smoothing noisy data, identifying or removing outliers, and
resolving inconsistencies. If users believe the data are dirty, they are
unlikely to trust the results of any data mining that has been applied.
Furthermore, dirty data can cause confusion for the mining procedure,
resulting in unreliable output. Although most mining routines have some
procedures for dealing with incomplete or noisy data, they are not always
robust. Instead, they may concentrate on avoiding overfitting the data to
the function being modeled. Therefore, a useful preprocessing step is to
run your data through some data cleaning routines.
Data reduction obtains a reduced representation of the data set that is
much smaller in volume, yet produces the same (or almost the same)
analytical results. Data reduction strategies include dimensionality
reduction and numerosity reduction.
In dimensionality reduction, data encoding schemes are applied so as
to obtain a reduced or “compressed” representation of the original data.
Examples include data compression techniques (e.g., wavelet transforms
and principal components analysis), attribute subset selection (e.g.,
removing irrelevant attributes), and attribute construction (e.g., where a
small set of more useful attributes is derived from the original set).
In numerosity reduction, the data are replaced by alternative, smaller
representations using parametric models (e.g., regression or log-linear
models) or nonparametric models (e.g., histograms, clusters, sampling, or
data aggregation). Discretization and concept hierarchy generation are
powerful tools for data mining in that they allow data mining at multiple
abstraction levels. Normalization, data discretization, and concept
hierarchy generation are forms of data transformation.
Data Cleaning:
Real-world data tend to be incomplete, noisy, and inconsistent. Data
cleaning (or data cleansing) routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct inconsistencies in
the data.
Dealing with Missing Values
1. Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification). This method is not very
effective, unless the tuple contains several attributes with missing values.
It is especially poor when the percentage of missing values per attribute
varies considerably. By ignoring the tuple, we do not make use of the
remaining attributes’ values in the tuple. Such data could have been
useful to the task at hand.
2. Fill in the missing value manually: In general, this approach is time
consuming and may not be feasible given a large data set with many
missing values.
3. Use a global constant to fill in the missing value: Replace all
missing attribute values by the same constant such as a label like
“Unknown” or -∞. If missing values are replaced by, say, “Unknown,” then
the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common—that of “Unknown.”
Hence, although this method is simple, it is not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value: central tendency,
indicate the “middle” value of a data distribution. For normal (symmetric)
data distributions, the mean can be used, while skewed data distribution
should employ the median..
5. Use the attribute mean or median for all samples belonging to
the same class as the given tuple: For example, if classifying
customers according to credit risk, we may replace the missing value with
the mean income value for customers in the same credit risk category as
that of the given tuple. If the data distribution for a given class is skewed,
the median value is a better choice.
6. Use the most probable value to fill in the missing value: This
may be determined with regression, inference-based tools using a
Bayesian formalism, or decision tree induction.
Dealing with Noise:
Noise is a random error or variance in a measured variable
Binning: Binning methods smooth a sorted data value by consulting its
“neighbourhood,” that is, the values around it. The sorted values are
distributed into a number of “buckets,” or bins. Because binning methods
consult the neighbourhood of values, they perform local smoothing. In
smoothing by bin means, each value in a bin is replaced by the mean
value of the bin. Similarly, smoothing by bin medians can be employed, in
which each bin value is replaced by the bin median. In smoothing by bin
boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the
closest boundary value.
Regression: Data smoothing can also be done by regression, a technique
that conforms data values to a function. Linear regression involves finding
the “best” line to fit two attributes (or variables) so that one attribute can
be used to predict the other. Multiple linear regression is an extension of
linear regression, where more than two attributes are involved and the
data are fit to a multidimensional surface.
Outlier analysis: Outliers may be detected by clustering, for example,
where similar values are organized into groups, or “clusters.” Intuitively,
values that fall outside of the set of clusters may be considered outliers.
Decision tree:
A decision tree is a tree in which each branch node represents a
choice between a number of alternatives, and each leaf node represents a
decision. Decision tree are commonly used for gaining information for the
purpose of decision -making. A Decision Tree is a tree-structured plan of a
set of attributes to test in order to predict the output. To decide which
attribute should be tested first, simply find the one with the highest
information gain. They are able to produce human-readable descriptions
of trends in the underlying relationships of a dataset and can be used for
classification and prediction tasks. Various Decision tree algorithms are
CART, C4.5 and ID3 algorithm.
ID 3 algorithm:
Where Pi is the proportion of instances in the dataset that take the ith
value of the target attribute.
2. Else if Attributes is empty label the root according to the most common
value
3. Else begin
4. End.
Advantages of ID3:
Disadvantages of ID3: