Dataming T PDF
Dataming T PDF
Dataming T PDF
Introduction
There is huge amount of data available in Information Industry. This data is of no use until
converted into useful information. Analysing this huge amount of data and extracting useful
information from it is necessary.
The extraction of information is not the only process we need to perform, it also involves other
processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining, Pattern
Evaluation and Data Presentation. Once all these processes are over, we are now position to use
this information in many applications such as Fraud Detection, Market Analysis, Production Control,
Science Exploration etc.
Data Mining is defined as extracting the information from the huge set of data. In other words we
can say that data mining is mining the knowledge from data. This information can be used for any
of the following applications:
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
This information further can be used for various applications such as market analysis,
fraud detection, customer retention, production control, science exploration etc.
Fraud Detection
Other Applications
Customer Profiling - Data Mining helps to determine what kind of people buy what kind of
products.
Identifying Customer Requirements - Data Mining helps in identifying the best products
for different customers. It uses prediction to find the factors that may attract new
customers.
Target Marketing - Data Mining helps to find clusters of model customers who share the
same characteristics such as interest, spending habits, income etc.
Determining Customer purchasing pattern - Data mining helps in determining customer
purchasing pattern.
Fraud Detection
Data Mining is also used in fields of credit card services and telecommunication to detect fraud. In
fraud telephone call it helps to find destination of call, duration of call, time of day or week. It also
analyse the patterns that deviate from an expected norms.
Other Applications
Data Mining also used in other fields such as sports, astrology and Internet Web Surf-Aid.
Data Mining - Tasks
Introduction
Data Mining deals with what kind of patterns can be mined. On the basis of kind of data to be
mined there are two kind of functions involved in Data Mining, that are listed below:
Descriptive
Classification and Prediction
Descriptive
The descriptive function deals with general properties of data in the database. Here is the list of
descriptive functions:
Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
Class/Concept Description
Class/Concepts refers the data to be associated with classes or concepts. For example, in a
company classes of items for sale include computer and printers, and concepts of customers
include big spenders and budget spenders.Such descriptions of a class or a concept are called
class/concept descriptions. These descriptions can be derived by following two ways:
Data Characterization - This refers to summarizing data of class under study. This class
under study is called as Target Class.
Data Discrimination - It refers to mapping or classification of a class with some
predefined group or class.
Frequent patterns are those patterns that occur frequently in transactional data. Here is the list of
kind of frequent patterns:
Frequent Item Set - It refers to set of items that frequently appear together for example
milk and bread.
Frequent Subsequence- A sequence of patterns that occur frequently such as purchasing
a camera is followed by memory card.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased together. This
process refers to process of uncovering the relationship among data and determining association
rules.
For example A retailer generates association rule that show that 70% of time milk is sold with bread
and only 30% of times biscuits are sold with bread.
Mining of Correlations
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other clusters.
Classification is the process of finding a model that describes the data classes or concepts. The
purpose is to be able to use this model to predict the class of objects whose class label is unknown.
This derived model is based on analysis of set of training data. The derived model can be
presented in the following forms:
Classification (IF-THEN) Rules
Decision Trees
Mathematical Formulae
Neural Networks
Here is the list of functions involved in this:
Classification - It predicts the class of objects whose class label is unknown.Its objective
is to find a derived model that describes and distinguishes data classes or concepts. The
Derived Model is based on analysis set of training data i.e the data object whose class
label is well known.
Prediction - It is used to predict missing or unavailable numerical data values rather than
class labels.Regression Analysis is generally used for prediction.Prediction can also be
used for identification of distribution trends based on available data.
Outlier Analysis - The Outliers may be defined as the data objects that do not comply with
general behaviour or model of the data available.
Evolution Analysis - Evolution Analysis refers to description and model regularities or
trends for objects whose behaviour changes over time.
This is the portion of database in which the user is interested. This portion includes the following:
Database Attributes
Data Warehouse dimensions of interest
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Background knowledge to be used in discovery process
The background knowledge allow data to be mined at multiple level of abstraction. For example the
Concept hierarchies are one of the background knowledge that allow data to be mined at multiple
level of abstraction.
This is used to evaluate the patterns that are discovers by the process of knowledge discovery.
There are different interestingness measures for different kind of knowledge.
This refers to the form in which discovered patterns are to be displayed. These representations
may include the following:
Rules
Tables
Charts
Graphs
Decision Trees
Cubes
Introduction
Data mining is not that easy. The algorithm used are very complex. The data is not available at one
place it needs to be integrated form the various heterogeneous data sources. These factors also
creates some issues. Here in this tutorial we will discuss the major issues regarding:
Mining Methodology and User Interaction
Performance Issues
Diverse Data Types Issues
The following diagram describes the major issues.
Mining Methodology and User Interaction Issues
Performance Issues
Data Warehouse
Data Warehousing
Data Warehousing is the process of constructing and using the data warehouse. The data
warehouse is constructed by integrating the data from multiple heterogeneous sources. This data
warehouse supports analytical reporting, structured and/or ad hoc queries and decision making.
Data Warehousing involves data cleaning, data integration and data consolidations. Integrating
Heterogeneous Databases To integrate heterogeneous databases we have the two approaches as
follows:
Query Driven Approach
Update Driven Approach
when the query is issued to a client side, a metadata dictionary translate the query into the
queries appropriate for the individual heterogeneous site involved.
Now these queries are mapped and sent to the local query processor.
The results from heterogeneous sites are integrated into a global answer set.
Disadvantages
Advantages
Online Analytical Mining integrates with Online Analytical Processing with data mining and mining
knowledge in multidimensional databases. Here is the diagram that shows integration of both
OLAP and OLAM:
Importance of OLAM:
Data Mining
Data Mining is defined as extracting the information from the huge set of data. In other words we
can say that data mining is mining the knowledge from data. This information can be used for any
of the following applications:
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Data mining engine is very essential to the data mining system.It consists of a set of functional
modules. These modules are for following tasks:
Characterization
Association and Correlation Analysis
Classification
Prediction
Cluster analysis
Outlier analysis
Evolution analysis
Knowledge Base
This is the domain knowledge. This knowledge is used to guide the search or evaluate the
interestingness of resulting patterns.
Knowledge Discovery
Some people treat data mining same as Knowledge discovery while some people view data mining
essential step in process of knowledge discovery. Here is the list of steps involved in knowledge
discovery process:
Data Cleaning
Data Integration
Data Selection
Data Transformation
Data Mining
Pattern Evaluation
Knowledge Presentation
User interface
User interface is the module of data mining system that helps communication between users and
the data mining system. User Interface allows the following functionalities:
Interact with the system by specifying a data mining query task.
Providing information to help focus the search.
Mining based on the intermediate data mining results.
Browse database and data warehouse schemas or data structures.
Evaluate mined patterns.
Visualize the patterns in different forms.
Data Integration
Data Integration is data preprocessing technique that merges the data from multiple heterogeneous
data sources into a coherent data store. Data integration may involve inconsistent data therefore
needs data cleaning.
Data Cleaning
Data cleaning is a technique that is applied to remove the noisy data and correct the
inconsistencies in data. Data cleaning involves transformations to correct the wrong data. Data
cleaning is performed as data preprocessing step while preparing the data for a data warehouse.
Data Selection
Data Selection is the process where data relevant to the analysis task are retrieved from the
database. Sometimes data transformation and consolidation are performed before data selection
process.
Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other clusters.
Data Transformation
In this step data are transformed or consolidated into forms appropriate for mining by performing
summary or aggregation operations.
Introduction
There is a large variety of Data Mining Systems available. Data mining System may integrate
techniques from the following:
Spatial Data Analysis
Information Retrieval
Pattern Recognition
Image Analysis
Signal Processing
Computer Graphics
Web Technology
Business
Bioinformatics
The data mining system can be classified according to the following criteria:
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
We can classify the data mining system according to kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data etc. And the data
mining system can be classified accordingly. For example if we classify the database according to
data model then we may have a relational, transactional, object- relational, or data warehouse
mining system.
We can classify the data mining system according to kind of knowledge mined. It is means data
mining system are classified on the basis of functionalities such as:
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
We can classify the data mining system according to kind of techniques used. We can describes
these techniques according to degree of user interaction involved or the methods of analysis
employed.
We can classify the data mining system according to application adapted. These applications are
as follows:
Finance
Telecommunications
DNA
Stock Markets
The data mining system needs to be integrated with database or the data warehouse system. If the
data mining system is not integrated with any database or data warehouse system then there will
be no system to communicate with. This scheme is known as non-coupling scheme. In this scheme
the main focus is put on data mining design and for developing efficient and effective algorithms for
mining the available data sets.
Here is the list of Integration Schemes:
No Coupling - In this scheme the Data Mining system does not utilize any of the database
or data warehouse functions. It then fetches the data from a particular source and process
that data using some data mining algorithms. The data mining result is stored in other file.
Loose Coupling - In this scheme the data mining system may use some of the functions
of database and data warehouse system. It then fetches the data from data respiratory
managed by these systems and perform data mining on that data. It then stores the mining
result either in a file or in a designated place in a database or data warehouse.
Semi-tight Coupling - In this scheme the data mining system is along with the kinking the
efficient implementation of data mining primitives can be provided in database or data
warehouse systems.
Tight coupling - In this coupling scheme data mining system is smoothly integrated into
database or data warehouse system. The data mining subsystem is treated as one
functional component of an information system.
Introduction
The Data Mining Query Language was proposed by Han, Fu, Wang, et al for the DBMiner data
mining system. The Data Mining Query Language is actually based on Structured Query Language
(SQL). Data Mining Query Languages can be designed to support ad hoc and interactive data
mining. This DMQL provides commands for specifying primitives. The DMQL can work with
databases data warehouses as well. Data Mining Query Language can be used to define data
mining tasks. Particularly we examine how to define data warehouse and data marts in Data Mining
Query Language.
Here is the syntax of DMQL for specifying the task relevant data:
Characterization
Discrimination
For Example, A user may define bigSpenders as customers who purchase items that costs $100 or
more on average, and budgetSpenders as customers who purchase items at less than $100 on
average. The mining of discriminant descriptions for customers from each of these categories can
be specified in DMQL as:
Association
For Example:
Classification
For Example, To mine patterns classifying customer credit rating where the classes are determined
by the attribute credit_rating, mine classification as classifyCustomerCreditRating
analyze credit_rating
Prediction
-schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
-
set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior
-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
-rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)< $50
level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) ≤ $250))
level_1: high_profit_margin < level_0: all
Interestingness measures and thresholds can be specified by the user with the statement:
For Example:
We have syntax which allows users to specify the display of discovered patterns in one or more
forms.
display as <result_form>
For Example :
display as table
Standardizing the Data Mining Languages will serve the following purposes:
Systematic Development of Data Mining Solutions.
Improve interoperability among multiple data mining systems and functions.
Promote the education.
Promote use of data mining systems in industry and society.
Introduction
There are two forms of data analysis that can be used for extract models describing important
classes or predict future data trends. These two forms are as follows:
Classification
Prediction
These data analysis help us to provide a better understanding of large data. Classification predicts
categorical and prediction models predicts continuous valued functions. For example, we can build
a classification model to categorize bank loan applications as either safe or risky, or a prediction
model to predict the expenditures in dollars of potential customers on computer equipment given
their income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification :
A bank loan officer wants to analyse the data in order to know which customer (loan
applicant) are risky or which are safe.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction :
Suppose the marketing manager needs to predict how much a given customer will spend during a
sale at his company. In this example we are bother to predict a numeric value. Therefore the data
analysis task is example of numeric prediction. In this case a model or predictor will be constructed
that predicts a continuous-valued-function or ordered value.
Note: Regression analysis is a statistical methodology that is most often used for numeric
prediction.
The classifier is built from the training set made up of database tuples and their associated
class labels.
Each tuple that constitutes the training set is referred to as a category or class. These
tuples can also be referred to as sample, object or data points.
In this step the classifier is used for classification.Here the test data is used to estimate the
accuracy of classification rules. The classification rules can be applied to the new data tuples if the
accuracy is considered acceptable.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction. preparing the data involves
the following activities:
Data Cleaning - Data cleaning involves removing the noise and treatment of missing
values. The noise is removed by applying smoothing techniques and the problem of
missing values is solved by replacing a missing value with most commonly occurring value
for that attribute.
Relevance Analysis - Database may also have the irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
Data Transformation and reduction - The data can be transformed by any of the
following methods.
Normalization - The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within a
small specified range. Normalization is used when in the learning step, the neural
networks or the methods involving measurements are used.
Generalization -The data can also be transformed by generalizing it to the higher
concept. For this purpose we can use the concept hierarchies.
Note: Data can also be reduced by some other methods such as wavelet transformation, binning,
histogram analysis, and clustering.
Speed - This refers to the computational cost in generating and using the classifier or
predictor.
Robustness - It refers to the ability of classifier or predictor to make correct predictions
from given noisy data.
Scalability - Scalability refers to ability to construct the classifier or predictor efficiently
given large amount of data.
Interpretability - This refers to the to what extent the classifier or predictor understand.
Introduction
The decision tree is a structure that includes root node, branch and leaf node. Each internal node
denotes a test on attribute, each branch denotes the outcome of test and each leaf node holds the
class label. The topmost node in the tree is the root node.
The following decision tree is for concept buy_computer, that indicates whether a customer at a
company is likely to buy a computer or not. Each internal node represents the test on the attribute.
Each leaf node represents a class.
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if tuples in D are all of the same class, C then
return N as leaf node labeled with class C;
if attribute_list is empty then
return N as leaf node with labeled
with majority class in D;|| majority voting
apply attribute_selection_method(D, attribute_list)
to find the best splitting_criterion;
label node N with splitting_criterion;
if splitting_attribute is discrete-valued and
multiway splits allowed then // no restricted to binary trees
attribute_list = splitting attribute; // remove splitting attribute
for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
let Dj be the set of data tuples in D satisfying outcome j; // a partition
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree Pruning is performed in order to remove anomalies in training data due to noise or outliers.
The pruned trees are smaller and less complex.
Cost Complexity
The cost complexity is measured by following two parameters:
Introduction
Bayesian classification is based on Baye's Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifier are able to predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
Baye's Theorem
Baye's Theorem is named after Thomas Bayes. There are two types of probability as follows:
Posterior Probability [P(H/X)]
Prior Probability [P(H)]
Belief networks
Bayesian networks
Probabilistic networks
There are two components to define Bayesian Belief Network:
Directed acyclic graph
A set of conditional probability tables
The arc in the diagram allows representation of causal knowledge. For example lung cancer is
influenced by a person's family history of lung cancer, as well as whether or not the person is a
smoker.It is woth noting that the variable PositiveXRay is independent of whether the patient has a
family history of lung cancer or is a smoker, given that we know the patient has lung cancer.
IF-THEN Rules
Rule-based classifier make use of set of IF-THEN rules for classification. We can express the rule
in the following from:
IF condition THEN conclusion
Points to remember:
The IF part of the rule is called rule antecedent or precondition.
Note:
We can also write rule R1 as follows:
If the condition holds the true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule based classifier by extracting IF-THEN rules from decision
tree. Points to remember to extract rule from a decision tree:
One rule is created for each path from the root to the leaf node.
To from the rule antecedent each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We do
not require to generate a decision tree first. In this algorithm each rule for a given class covers
many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general
strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by the
rule is removed and the process continues for rest of the tuples. This is because the path to each
leaf in a decision tree corresponds to a rule.
Note:The Decision tree induction can be considered as learning a set of rules simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class at a time.
When learning a rule from a class Ci, we want the rule to cover all the tuples from class C only and
no tuple form any other class.
Rule Pruning
The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has
greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
Here in this tutorial we will discuss about the other classification methods such as Genetic
Algorithms, Rough Set Approach and Fuzzy Set Approaches.
Genetic Algorithms
The idea of Genetic Algorithm is derived from natural evolution. In Genetic Algorithm first of all
initial population is created. This initial population consist of randomly generated rules. we can
represent each rule by a string of bits.
For example , suppose that in a given training set the samples are described by two boolean
attributes such as A1 and A2. And this given training set contains two classes such as C1 and C2.
We can encode the rule IF A1 AND NOT A2 THEN C2 into bit string 100. In this bit representation
the two leftmost bit represent the attribute A1 and A2, respectively.
Likewise the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded as 001.
Note:If the attribute has K values where K>2, then we can use the K bits to encode the attribute
values . The classes are also encoded in the same manner.
Points to remember:
Based on the notion of survival of the fittest, a new population is formed to consist of the
fittest rules in the current population and offspring values of these rules as well.
The fitness of the rule is assessed by its classification accuracy on a set of training
samples.
The genetic operators such as crossover and mutation are applied to create offsprings.
In crossover the substring from pair of rules are swapped to from a new pair of rules.
In mutation, randomly selected bits in a rule's string are inverted.
There are some classes in given real world data, which can not be distinguished in terms of
available attributes. We can use the rough sets to roughly define such classes.
For a given class, C, the rough set definition is approximated by two sets as follows:
Lower Approximation of C - The lower approximation of C consist of all the data tuples,
that bases on knowledge of attribute. These attribute are certain to belong to class C.
Upper Approximation of C - The upper approximation of C consist of all the tuples that
based on knowledge of attributes, can not be described as not belonging to C.
The following diagram shows the Upper and Lower Approximation of class C:
For example, the income value $49,000 belong to both the medium and high fuzzy sets but to
differing degrees. Fuzzy set notation for this income value is as follows:
where m is membership function that operates on fuzzy set of medium_income and high_income
respectively. This notation can be shown diagrammatically as follows:
Data Mining - Cluster Analysis
What is Cluster?
Cluster is a group of objects that belong to the same class. In other words the similar object are
grouped in one cluster and dissimilar are grouped in other cluster.
What is Clustering?
Clustering is the process of making group of abstract objects into classes of similar objects.
Points to Remember
A cluster of data objects can be treated as a one group.
While doing the cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the label to the groups.
The main advantage of Clustering over classification is that, It is adaptable to changes and
help single out useful features that distinguished different groups.
Clustering can also help marketers discover distinct groups in their customer basis. And
they can characterize their customer groups based on purchasing patterns.
In field of biology it can be used to derive plant and animal taxonomies, categorize genes
with similar functionality and gain insight into structures inherent in populations.
Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according house
type, value, geographic location.
Clustering also helps in classifying documents on the web for information discovery.
Clustering is also used in outlier detection applications such as detection of credit card
fraud.
As a data mining function Cluster Analysis serve as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
Clustering Methods
The clustering methods can be classified into following categories:
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of n objects, the partitioning method construct k partition of data.
Each partition will represents a cluster and k≤n. It means that it will classify the data into k groups,
which satisfy the following requirements:
Points to remember:
For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
Then it uses the iterative relocation technique to improve the partitioning by moving objects
from one group to other.
Hierarchical Methods
This method create the hierarchical decomposition of the given set of data objects. We can classify
Hierarchical method on basis of how the hierarchical decomposition is formed as follows:
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as bottom-up approach. In this we start with each object forming a
separate group. It keeps on merging the objects or groups that are close to one another. It keep on
doing so until all of the groups are merged into one or until the termination condition holds.
Divisive Approach
This approach is also known as top-down approach. In this we start with all of the objects in the
same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until
each object in one cluster or the termination condition holds.
Disadvantage
This method is rigid i.e. once merge or split is done, It can never be undone.
Here is the two approaches that are used to improve quality of hierarchical clustering:
Perform careful analysis of object linkages at each hierarchical partitioning.
Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to
group objects into microclusters, and then performing macroclustering on the
microclusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the given
cluster as long as the density in the neighbourhood exceeds some threshold i.e. for each data point
within a given cluster, the radius of a given cluster has to contain at least a minimum number of
points.
Grid-based Method
In this the objects together from a grid. The object space is quantized into finite number of cells that
form a grid structure.
Advantage
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method a model is hypothesize for each cluster and find the best fit of data to the given
model. This method locate the clusters by clustering the density function. This reflects spatial
distribution of the data points.
This method also serve a way of automatically determining number of clusters based on standard
statistics , taking outlier or noise into account. It therefore yield robust clustering methods.
Constraint-based Method
Introduction
The text databases consist most of huge collection of documents. They collect these information
from several sources such as news articles, books, digital libraries, e-mail messages, and web
pages etc. Due to increase amount of information, the text databases are growing rapidly. In many
of the text databases the data is semi structured.
For example, a document may contain a few structured fields, such as title, author, publishing_date
etc. But along with the structure data the document also contains unstructured text components,
such as abstract and contents. Without knowing what could be in the documents, it is difficult to
formulate effective queries for analyzing and extracting useful information from the data. To
compare the documents and rank the importance and relevance of the document the users need
tools.Therefore, text mining has become popular and essential theme in data mining.
Information Retrieval
Information Retrieval deals with the retrieval of information from large number of text-based
documents. Some of the database systems are not usually present in information retrieval system
because both handle different kinds of data. Following are the examples of information retrieval
system:
In such kind of search problem the user takes initiative to pull the relevant information out from the
collection. This is appropriate when the user has ad-hoc information need i.e. short term need. But
if the user has long term information need then the retrieval system can also take initiative to push
any newly arrived information item to the user.
This kind of access to information is called Information Filtering. And the corresponding systems
are known as Filtering Systems or Recommender Systems.
There are three fundamental measures for assessing the quality of text retrieval:
Precision
Recall
F-score
Precision
Precision is the percentage of retrieved documents that are in fact relevant to the query. Precision
can be defined as:
Recall
Recall is the percentage of documents that are relevant to the query and were in fact retrieved.
Recall is defined as:
F-score
F-score is the commonly used trade-off. The information retrieval system often needs to trade-off
for precision or vice versa. F-score is defined as harmonic mean of recall or precision as follows:
Introduction
The World Wide Web contains the huge information such as hyperlink information, web page
access info, education etc that provide rich source for data mining.
The purpose of VIPS is to extract the semantic structure of a web page based on its visual
presentation.
Such a semantic structure corresponds to tree structure.In this tree each node
corresponds to a block.
A value is assigned to each node. This value is called Degree of Coherence. This value is
assigned to indicate how coherent is the content in the block based on visual perception.
The VIPS algorithm first extracts all the suitable blocks from the HTML DOM tree. After
that it finds the separators between these blocks.
The separators refer to the horizontal or vertical lines in a web page that visually cross with
no blocks.
The semantic of the web page is constructed on the basis of these blocks.
The following figure shows the procedure of VIPS algorithm:
Data Mining - Applications & Trends
Introduction
Data Mining is widely used in diverse areas. There are number of commercial data mining system
available today yet there are many challenges in this field. In this tutorial we will applications and
trend of Data Mining.
Telecommunication Industry
Biological Data Analysis
Design and construction of data warehouses for multidimensional data analysis and data
mining.
Loan payment prediction and customer credit policy analysis.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount data from
on sales, customer purchasing history, goods transportation, consumption and services. It is natural
that the quantity of data collected will continue to expand rapidly because of increasing ease,
availability and popularity of web.
The Data Mining in Retail Industry helps in identifying customer buying patterns and trends. That
leads to improved quality of customer service and good customer retention and satisfaction. Here is
the list of examples of data mining in retail industry:
Design and Construction of data warehouses based on benefits of data mining.
Multidimensional analysis of sales, customers, products, time and region.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Telecommunication Industry
Today the Telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, Internet messenger, images, e-mail, web data
transmission etc.Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why data mining is become
very important to help and understand the business.
Now a days we see that there is vast growth in field of biology such as genomics, proteomics,
functional Genomics and biomedical research.Biological data mining is very important part of
Bioinformatics. Following are the aspects in which Data mining contribute for biological data
analysis:
Semantic integration of heterogeneous , distributed genomic and proteomic databases.
Alignment, indexing , similarity search and comparative analysis multiple nucleotide
sequences.
Discovery of structural patterns and analysis of genetic networks and protein pathways.
Association and path analysis.
The applications discussed above tend to handle relatively small and homogeneous data sets for
which the statistical techniques are appropriate. Huge amount of data have been collected from
scientific domains such as geosciences, astronomy etc. There is large amount of data sets being
generated because of the fast numerical simulations in various fields such as climate, and
ecosystem modeling, chemical engineering, fluid dynamics etc. Following are the applications of
data mining in field of Scientific Applications:
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or availability of network
resources. In this world of connectivity security has become the major issue. With increased usage
of internet and availability of tools and tricks for intruding and attacking network prompted intrusion
detection to become a critical component of network administration. Here is the list of areas in
which data mining technology may be applied for intrusion detection:
Development of data mining algorithm for intrusion detection.
Association and correlation analysis, aggregation to help select and build discriminating
attributes.
Analysis of Stream data.
Distributed data mining.
Visualization and query tools.
Which data mining system to choose will depend on following features of Data Mining System:
Data Types - The data mining system may handle formatted text, record-based data and
relational data. The data could also be in ASCII text, relational database data or data
warehouse data. Therefore we should check what exact format, the data mining system
can handle.
System Issues - We must consider the compatibility of Data Mining system with different
operating systems. One data mining system may run on only on one operating system or
on several. There are also data mining systems that provide web-based user interfaces
and allow XML data as input.
Data Sources - Data Sources refers to the data formats in which data mining system will
operate. Some data mining system may work only on ASCII text files while other on
multiple relational sources. Data mining system should also support ODBC connections or
OLE DB for ODBC connections.
Data Mining functions and methodologies - There are some data mining systems that
provide only one data mining function such as classification while some provides multiple
data mining functions such as concept description, discovery-driven OLAP analysis,
association mining, linkage analysis, statistical analysis, classification, prediction,
clustering, outlier analysis, similarity search etc.
Coupling data mining with databases or data warehouse systems - Data mining
system need to be coupled with database or the data warehouse systems. The coupled
components are integrated into a uniform information processing environment.Here are the
types of coupling listed below:
No coupling
Loose Coupling
Semi tight Coupling
Tight Coupling
Scalability - There are two scalability issues in Data Mining as follows:
The various theories for basis of data mining includes the following:
Data Reduction - The basic idea of this theory is to reduce the data representation which
trades accuracy for speed in response to the need to obtain quick approximate answers to
queries on very large data bases.Some of the data reduction techniques are as follows:
Singular value Decomposition
Wavelets
Regression
Log-linear models
Histograms
Clustering
Sampling
Construction of Index Trees
Data Compression - The basic idea of this theory is to compress the given data by
encoding in terms of the following:
Bits
Association Rules
Decision Trees
Clusters
Pattern Discovery - The basic idea of this theory is to discover patterns occurring in the
database. Following are the areas that contributes to this theory:
Machine Learning
Neural Network
Association Mining
Sequential Pattern Matching
Clustering
Probability Theory - This theory is based on statistical theory. The basic idea behind this
theory is to discover joint probability distributions of random variables.
Probability Theory - According to this theory data mining is finding the patterns that are
interesting only to the extent that they can be used in the decision making process of some
enterprise.
Microeconomic View - As per the perception of this theory, the database schema consist
of data and patterns that are stored in the database. Therefore according to this theory
data mining is the task of performing induction on databases.
Inductive databases - Apart from the database oriented techniques, there are statistical
techniques also available for data analysis. These techniques can be applied to scientific
data and data from economic & social sciences as well.
Polynomial
Nonparametric
Robust
Generalized Linear Models - Generalized Linear Model includes:
Logistic Regression
Poisson Regression
The model's generalization allow a categorical response variable to be related to set of
predictor variables in manner similar to the modelling of numeric response variable using
linear regression.
Mixed-effect Models - These models are used for analyzing the grouped data. These
models describe the relationship between a response variable and some covariates in data
grouped according to one or more factors.
Factor Analysis - Factor Analysis Method is used to predict a categorical response
variable. This method assumes that independent variable follow a multivariate normal
distribution.
Time Series Analysis - Following are the methods for analyzing time-series data:
Autoregression Methods
Boxplots
3-D Cubes
Data distribution charts
Curves
Surfaces
Link graphs etc.
Data Mining result Visualization - Data Mining Result Visualization is the presentation of
the results of data mining in visual forms. These visual forms could be scatter plots and
boxplots etc.
Data Mining Process Visualization - Data Mining Process Visualization presents the
several processes of data mining. This allows the users to see how the data are extracted.
This also allow the users to see from which database or data warehouse data are cleaned,
integrated, preprocessed, and mined.
To indicate the patterns of data or the features of data mining results, Audio Data Mining makes
use of audio signals. By transforming patterns into sound and musing instead of watching pictures,
we can listen to pitches,tunes in order to identify anything interesting.
Data Mining and Collaborative Filtering
Today the consumer faced with large variety of goods and services while shopping. During live
customer transactions, the Recommender System helps the consumer by making product
recommendation. The Collaborative Filtering Approach is generally used for recommending
products to customers. These recommendations are based on the opinions of other customers.