Introduction To Data Mining Techniques: Dr. Rajni Jain
Introduction To Data Mining Techniques: Dr. Rajni Jain
explaining the data and which are also capable to make predictions out of that.
The data takes the form of a set of examples and the output takes the form of
predictions on the new examples.
2 Issues in Data mining
Data mining has evolved into an important and active area of research
because of the theoretical challenges and practical applications associated with the
problem of discovering interesting and previously unknown knowledge from realworld databases. The main challenges to the data mining and the corresponding
considerations in designing the algorithms are as follows:
1. Massive datasets and high dimensionality.
2. Overfitting and assessing the statistical significance.
3. Understandability of patterns.
4. Non-standard incomplete data and data integration.
5. Mixed changing and redundant data.
3 Tasks of Data Mining
Data mining as a term used for the specific set of six activities or tasks as follows:
1. Classification
2. Estimation
3. Prediction
4. Affinity grouping or association rules
5. Clustering
6. Description and visualization
The first three tasks - classification, estimation and prediction are all
examples of directed data mining or supervised learning. In directed data mining,
the goal is to use the available data to build a model that describes one or more
particular attribute(s) of interest (target attributes or class attributes) in terms of
the rest of the available attributes. The next three tasks association rules,
clustering and description are examples of undirected data mining i.e. no attribute
is singled out as the target; the goal is to establish some relationship among all the
attributes.
Web-based Databases
Systems
(1990-present)
-XML-based database
systems
-Web Mining
Cleaning and
Integration
Databases
Data
Warehouse
Selection
and
Transformation
Flat Files
Data
Mining
Knowledge
Evaluation
and
Presentation
3.1 Classification
Classification consists of examining the features of a newly presented object and
assigning to it a predefined class. The classification task is characterized by the
well-defined classes, and a training set consisting of preclassified examples. The
task is to build a model that can be applied to unclassified data in order to classify
it. Examples of classification tasks include:
Determination of which home telephone lines are used for internet access
3.2 Estimation
Estimation deals with continuously valued outcomes. Given some input data, we
use estimation to come up with a value for some unknown continuous variables
such as income, height or credit card balance. Some examples of estimation tasks
include:
Estimating the number of children in a family from the input data of mothers
education
Estimating total household income of a family from the data of vehicles in the
family
Estimating the value of a piece of a real estate from the data on proximity of
that land from a major business centre of the city.
3.3 Prediction
Any prediction can be thought of as classification or estimation. The difference is
one of emphasis. When data mining is used to classify a phone line as primarily
used for internet access or a credit card transaction as fraudulent, we do not expect
to be able to go back later to see if the classification was correct. Our
classification may be correct or incorrect, but the uncertainty is due to incomplete
knowledge only: out in the real world, the relevant actions have already taken
place. The phone is or is not used primarily to dial the local ISP. The credit card
transaction is or is not fraudulent. With enough efforts, it is possible to check.
Predictive tasks feel different because the records are classified according to some
predicted future behaviour or estimated future value. With prediction, the only
way to check the accuracy of the classification is to wait and see. Examples of
prediction tasks include:
Predicting the size of the balance that will be transferred if a credit card
prospect accepts a balance transfer offer
adopted for use in prediction by using training examples where the value of the
variable to be predicted is already known, along with historical data for those
examples. The historical data is used to build a model that explains the current
observed behaviour. When this model is applied to current inputs, the result is a
prediction of future behaviour.
3.4 Association Rules
An association rule is a rule which implies certain association relationships among
a set of objects (such as occur together or one implies the other) in a database.
Given a set of transactions, where each transaction is a set of literals (called
items), an association rule is an expression of the form X Y , where X and Y are
sets of items. The intuitive meaning of such a rule is that transactions of the
database which contain X tend to contain Y. An example of an association rule is:
30% of farmers that grow wheat also grow pulses; 2% of all farmers grow both
of these items. Here 30% is called the confidence of the rule, and 2% the support
of the rule. The problem is to find all association rules that satisfy user-specified
minimum support and minimum confidence constraints.
3.5 Clustering
Clustering is the task of segmenting a diverse group into a number of similar
subgroups or clusters. What distinguishes clustering from classification is that
clustering does not rely on predefined classes. In clustering, there are no
predefined classes. The records are grouped together on the basis of self
similarity. Clustering is often done as a prelude to some other form of data mining
or modelling. For example, clustering might be the first step in a market
Hair
Height
Weight
Lotion
Sunburn
X1
blonde
average
light
no
yes
X2
blonde
tall
average
yes
no
X3
brown
short
average
yes
no
X4
blonde
short
average
no
yes
X5
red
average
heavy
no
yes
X6
brown
tall
heavy
no
no
X7
brown
average
heavy
no
no
X8
blonde
short
light
yes
no
Hair
blonde
red
brown
Lotion
yes
yes
no
no
no
yes
Hand, D., Mannila, H., Smyth, P., Principles of Data Mining, Prentice Hall
of India, 2001
Han, J., Kamber, M. Data Mining Concepts and Techniques, Morgan
Kaufmann Publisher, 2001
Fayyad, U., Data Mining and Knowledge Discovery: Making Sense out of
Data, IEEE Expert, Oct. 20-25, 1996
Tan, C. L., Quah, T. S. and Teh, H. H., An Artificial Neural Network that
models Human Decision making, IEEE Computer, 64-70, 1996
Goldberg, D.E. Genetic Algorithms in Search Optimization and Machine
Learning, Addison -Wesley, 1989
Zadeh, L. A., Knowledge Representation in Fuzzy Logic, IEEE TKDE,
1(1):89-99, 1989