0% found this document useful (0 votes)
3 views

UNIT-III

Data mining is the process of analyzing large data sets to identify patterns and relationships that inform business decisions, and is a crucial part of data analytics and the KDD process. It involves techniques like classification, clustering, and regression, and is applied across various fields including business, healthcare, and cybersecurity. The document outlines the stages of data mining, key techniques, and the importance of knowledge representation and data visualization in interpreting data findings.

Uploaded by

Aflah Sidhik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

UNIT-III

Data mining is the process of analyzing large data sets to identify patterns and relationships that inform business decisions, and is a crucial part of data analytics and the KDD process. It involves techniques like classification, clustering, and regression, and is applied across various fields including business, healthcare, and cybersecurity. The document outlines the stages of data mining, key techniques, and the importance of knowledge representation and data visualization in interpreting data findings.

Uploaded by

Aflah Sidhik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

UNIT III INTRODUCTION TO DATA MINING

Data mining:

Data mining is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis. Data mining
techniques and tools enable enterprises to predict future trends and make more-informed
business decisions.

Data mining is a key part of data analytics overall and one of the core disciplines in data
science, which uses advanced analytics techniques to find useful information in data sets.
At a more granular level, data mining is a step in the knowledge discovery in databases
(KDD) process, a data science methodology for gathering, processing and analyzing data.
Data mining and KDD are sometimes referred to interchangeably, but they’re more
commonly seen as distinct things.

Why is data mining important?

Data mining is a crucial component of successful analytics initiatives in organizations. The


information it generates can be used in business intelligence (BI) and advanced analytics
applications that involve analysis of historical data, as well as real-time analytics
applications that examine streaming data as it’s created or collected.

Effective data mining aids in various aspects of planning business strategies and managing
operations. That includes customer-facing functions such as marketing, advertising, sales
and customer support, plus manufacturing, supply chain management, finance and HR.
Data mining supports fraud detection, risk management, cybersecurity planning and many
other critical business use cases. It also plays an important role in healthcare, government,
scientific research, mathematics, sports and more.

KDD vs Data Mining

KDD (Knowledge Discovery in Databases) is a field of computer science, which includes


the tools and theories to help humans in extracting useful and previously unknown
information (i.e., knowledge) from large collections of digitized data. KDD consists of
several steps, and Data Mining is one of them. Data Mining is the application of a specific
algorithm to extract patterns from data. Nonetheless, KDD and Data Mining are used
interchangeably.

Although the two terms KDD and Data Mining are heavily used interchangeably, they refer
to two related yet slightly different concepts.

KDD is the overall process of extracting knowledge from data, while Data Mining is a step
inside the KDD process, which deals with identifying patterns in data.

And Data Mining is only the application of a specific algorithm based on the overall goal of
the KDD process.

KDD is an iterative process where evaluation measures can be enhanced, mining can be
refined, and new data can be integrated and transformed to get different and more
appropriate results.

Data Mining vs KDD

Basic Definition:

DM-Data mining is the process of identifying patterns and extracting details about big data
sets using intelligent methods.

KDD-The KDD method is a complex and iterative approach to knowledge extraction from
big data.

Goal:

DM-To extract patterns from datasets.

KDD-To discover knowledge from datasets.

Scope:

DM-In the KDD method, the fourth phase is called “data mining.”

KDD-KDD is a broad method that includes data mining as one of its steps.

Used Techniques:

DM-Classification

Clustering

Decision Trees

Dimensionality Reduction
Neural Networks

Regression

KDD-Data cleaning

Data Integration

Data selection

Data transformation

Data mining

Pattern evaluation

Knowledge Presentation

Example:

KDD-Clustering groups of data elements based on how similar they are.

DM-Data analysis to find patterns and links.

4 stages to follow in your data mining process

1. Data cleaning and preprocessing

Data cleaning and preprocessing is an essential step of the data mining process as it
makes the data ready for analysis. Data cleaning includes deleting any unnecessary
features or attributes, identifying and correcting outliers, filling in missing values, and
converting categorical variables to numerical ones. This involves removing or correcting
erroneous, incomplete, or inconsistent data, as well as formatting the data into a usable
format for analysis. Preprocessing also includes normalizing the data, reducing its
dimensionality, and performing feature selection to identify important features.

Many companies include these steps as part of their broader data governance initiatives.
After cleaning and preprocessing is complete, the data is ready for exploration and
visualization.
2. Data modeling and evaluation

Data modeling and evaluation is the process of training machine learning models with the
data and then evaluating their performance. This involves selecting an appropriate
algorithm for the task, tuning its hyperparameters to optimize its performance, and using
measures such as accuracy or precision to evaluate its results. After a model is trained and
evaluated, it can be deployed for real-world applications. In addition, data mining can also
be used to detect anomalies or outliers in the data. This is especially useful for fraud
detection and cybersecurity applications. After identifying any anomalies or outliers,
analysts can then investigate further to gain more insight into the problem.

3. Data exploration and visualization

Data exploration and visualization is the process of exploring, analyzing, and visualizing
data to gain insights and identify patterns. This involves summarizing the data using
descriptive statistics, such as measuring its central tendency, dispersion, and correlation
between features; plotting distributions of data points; and performing clustering or
classification algorithms to group similar data points together. Through these methods,
data professionals, including data analysts, data scientists, and analytics and data
engineers, can gain insight into the underlying structure of the data and identify
relationships between features.

Data visualization tools, such as heatmaps, histograms, bar charts, and scatter plots, can
also be used to easily communicate and see how different datasets relate, correlate, and
diverge. Additionally, dimensionality reduction techniques such as principal component
analysis (PCA) can help reduce the complexity of datasets by representing them in fewer
dimensions. After exploring and visualizing the data, analysts can decide which machine
learning algorithms would be most suitable for their project.

4. Deployment and maintenance

In the final stage of data mining, the trained models are deployed in a production
environment. This requires configuring the model for real-time execution and setting up any
necessary monitoring mechanisms to ensure its performance. Additionally, any changes
made to the model or dataset may require re-training the model and redeploying it to
production. Finally, maintenance is also necessary to ensure the performance of the model
and keep it up-to-date with any changes to the data or environment. By keeping track of
these factors, businesses can ensure that their data mining models remain accurate and
can give reliable results in production.

Data Mining Task Primitives

A data mining task can be specified in the form of a data mining query, which is input to the
data mining system. A data mining query is defined in terms of data mining task primitives.
These primitives allow the user to interactively communicate with the data mining system
during discovery to direct the mining process or examine the findings from different angles
or depths. The data mining primitives specify the following,

Set of task-relevant data to be mined.

Kind of knowledge to be mined.

Background knowledge to be used in the discovery process.

Interestingness measures and thresholds for pattern evaluation.

Representation for visualizing the discovered patterns.

A data mining query language can be designed to incorporate these primitives, allowing
users to interact with data mining systems flexibly. Having a data mining query language
provides a foundation on which user-friendly graphical interfaces can be built.

Designing a comprehensive data mining language is challenging because data mining


covers a wide spectrum of tasks, from data characterization to evolution analysis. Each
task has different requirements. The design of an effective data mining query language
requires a deep understanding of the power, limitation, and underlying mechanisms of the
various kinds of data mining tasks. This facilitates a data mining system’s communication
with other information systems and integrates with the overall information processing
environment.
List of Data Mining Task Primitives

ADVERTISEMENT

A data mining query is defined in terms of the following primitives, such as:

1. The set of task-relevant data to be mined

This specifies the portions of the database or the set of data in which the user is interested.
This includes the database attributes or data warehouse dimensions of interest (the
relevant attributes or dimensions).

In a relational database, the set of task-relevant data can be collected via a relational query
involving operations like selection, projection, join, and aggregation.

The data collection process results in a new data relational called the initial data relation.
The initial data relation can be ordered or grouped according to the conditions specified in
the query. This data retrieval can be thought of as a subtask of the data mining task.

This initial relation may or may not correspond to physical relation in the database. Since
virtual relations are called Views in the field of databases, the set of task-relevant data for
data mining is called a minable view.

2. The kind of knowledge to be mined

This specifies the data mining functions to be performed, such as characterization,


discrimination, association or correlation analysis, classification, prediction, clustering,
outlier analysis, or evolution analysis.

3. The background knowledge to be used in the discovery process

This knowledge about the domain to be mined is useful for guiding the knowledge
discovery process and evaluating the patterns found. Concept hierarchies are a popular
form of background knowledge, which allows data to be mined at multiple levels of
abstraction.

Concept hierarchy defines a sequence of mappings from low-level concepts to higher-


level, more general concepts.

Rolling Up – Generalization of data: Allow to view data at more meaningful and explicit
abstractions and makes it easier to understand. It compresses the data, and it would
require fewer input/output operations.

Drilling Down – Specialization of data: Concept values replaced by lower-level concepts.


Based on different user viewpoints, there may be more than one concept hierarchy for a
given attribute or dimension.

An example of a concept hierarchy for the attribute (or dimension) age is shown below.
User beliefs regarding relationships in the data are another form of background knowledge.

4. The interestingness measures and thresholds for pattern evaluation

Different kinds of knowledge may have different interesting measures. They may be used to
guide the mining process or, after discovery, to evaluate the discovered patterns. For
example, interesting measures for association rules include support and confidence. Rules
whose support and confidence values are below user-specified thresholds are considered
uninteresting.

Simplicity: A factor contributing to the interestingness of a pattern is the pattern’s overall


simplicity for human comprehension. For example, the more complex the structure of a
rule is, the more difficult it is to interpret, and hence, the less interesting it is likely to be.
Objective measures of pattern simplicity can be viewed as functions of the pattern
structure, defined in terms of the pattern size in bits or the number of attributes or
operators appearing in the pattern.
Certainty (Confidence): Each discovered pattern should have a measure of certainty
associated with it that assesses the validity or “trustworthiness” of the pattern. A certainty
measure for association rules of the form “A =>B” where A and B are sets of items is
confidence. Confidence is a certainty measure. Given a set of task-relevant data tuples,
the confidence of “A => B” is defined as

Confidence (A=>B) = # tuples containing both A and B /# tuples containing A

Utility (Support): The potential usefulness of a pattern is a factor defining its


interestingness. It can be estimated by a utility function, such as support. The support of an
association pattern refers to the percentage of task-relevant data tuples (or transactions)
for which the pattern is true.

Utility (support): usefulness of a pattern

Support (A=>B) = # tuples containing both A and B / total #of tuples

Novelty: Novel patterns are those that contribute new information or increased
performance to the given pattern set. For example -> A data exception. Another strategy for
detecting novelty is to remove redundant patterns.

5. The expected representation for visualizing the discovered patterns

This refers to the form in which discovered patterns are to be displayed, which may include
rules, tables, cross tabs, charts, graphs, decision trees, cubes, or other visual
representations.

Users must be able to specify the forms of presentation to be used for displaying the
discovered patterns. Some representation forms may be better suited than others for
particular kinds of knowledge.

For example, generalized relations and their corresponding cross tabs or pie/bar charts are
good for presenting characteristic descriptions, whereas decision trees are common for
classification.

DATA MINING TECHNIQUES


Every data science application demands a different data mining technique. One of the
popular and well-known data mining techniques used includes pattern recognition and
anomaly detection. Both these methods employ a combination of techniques to mine
data.

Let’s look at some of the fundamental data mining techniques commonly used across
industry verticals.

1. Association rule

The association rule refers to the if-then statements that establish correlations and
relationships between two or more data items. The correlations are evaluated using
support and confidence metrics, wherein support determines the frequency of occurrence
of data items within the dataset. In contrast, confidence relates to the accuracy of if-then
statements.

For example, while tracking a customer’s behavior when purchasing online items, an
observation is made that the customer generally buys cookies when purchasing a coffee
pack. In such a case, the association rule establishes the relation between two items of
cookies and coffee packs, thereby forecasting future buys whenever the customer adds the
coffee pack to the shopping cart.

2. Classification

The classification data mining technique classifies data items within a dataset into
different categories. For example, we can classify vehicles into different categories, such
as sedan, hatchback, petrol, diesel, electric vehicle, etc., based on attributes such as the
vehicle’s shape, wheel type, or even number of seats. When a new vehicle arrives, we can
categorize it into various classes depending on the identified vehicle attributes. One can
apply the same classification strategy to classify customers based on their age, address,
purchase history, and social group.

Some of the examples of classification methods include

Decision trees
, Naive Bayes classifiers, logistic regression, and so on.

3. Clustering

Clustering data mining techniques group data elements into clusters that share common
characteristics. We can cluster data pieces into categories by simply identifying one or
more attributes. Some of the well-known clustering techniques are k-means clustering,
hierarchical clustering, and Gaussian mixture models.

4. Regression

Regression is a statistical modeling technique using previous observations to predict new


data values. In other words, it is a method of determining relationships between data
elements based on the predicted data values for a set of defined variables. This category’s
classifier is called the ‘Continuous Value Classifier’.

Linear regression

, multivariate regression, and decision trees are key examples of this type.

5. Sequence & path analysis

One can also mine sequential data to determine patterns, wherein specific events or data
values lead to other events in the future. This technique is applied for long-term data as
sequential analysis is key to identifying trends or regular occurrences of certain events. For
example, when a customer buys a grocery item, you can use a sequential pattern to
suggest or add another item to the basket based on the customer’s purchase pattern.

6. Neural networks

Neural networks

Technically refer to algorithms that mimic the human brain and try to replicate its activity
to accomplish a desired goal or task. These are used for several pattern recognition
applications that typically involve deep learning techniques. Neural networks are a
consequence of advanced machine learning research.
7. Prediction

The prediction data mining technique is typically used for predicting the occurrence of an
event, such as the failure of machinery or a fault in an industrial component, a fraudulent
event, or company profits crossing a certain threshold. Prediction techniques can help
analyze trends, establish correlations, and do pattern matching when combined with other
mining methods. Using such a mining technique, data miners can analyze past instances
to forecast future events.

KNOWLEDGE REPRESENTATION

Knowledge representation is the presentation of knowledge to the user for visualization in


terms of trees, tables, rules graphs, charts, matrices, etc.

For Example: Histograms

Histograms

Histogram provides the representation of a distribution of values of a single attribute.

It consists of a set of rectangles, that reflects the counts or frequencies of the classes
present in the given data.

Example: Histogram of an electricity bill generated for 4 months, as shown in diagram given
below.

Electricity bill histogram

Data Visualization

It deals with the representation of data in a graphical or pictorial format.

Patterns in the data are marked easily by using the data visualization technique.

Some of the vital data visualization techniques are:

1. Pixel- oriented visualization technique

In pixel based visualization techniques, there are separate sub-windows for the value of
each attribute and it is represented by one colored pixel.

It maximizes the amount of information represented at one time without any overlap.
Tuple with ‘m’ variable has different ‘m’ colored pixel to represent each variable and each
variable has a sub window.

The color mapping of the pixel is decided on the basis of data characteristics and
visualization tasks.

Pixel visualization

2. Geometric projection visualization technique

Techniques used to find geometric transformation are:

i. Scatter-plot matrices

It consists of scatter plots of all possible pairs of variables in a dataset.

ii. Hyper slice

It is an extension to scatter-plot matrices. They represent multi-dimensional

Function as a matrix of orthogonal two dimensional slices.

iii. Parallel co-ordinates

The parallel vertical lines which are separated defines the axes.

A point in the Cartesian coordinates corresponds to a polyline in parallel coordinates.

3. Icon-based visualization techniques

Icon-based visualization techniques are also known as iconic display techniques.

Each multidimensional data item is mapped to an icon.

This technique allows visualization of large amount of data.

The most commonly used technique is Chernoff faces.

Chernoff faces

This concept was introduced by Herman Chernoff in 1973.


The faces in Chernoff faces are related to facial expressions or features of human being.
So, it becomes easy to identify the difference between the faces.

It includes the mapping of different data dimensions with different facial features.

For example: The face width, the length of the mouth and the length of nose, etc. As shown
in the following diagram.

Chernoff faces

4. Hierarchical visualization techniques

Hierarchical visualization techniques are used for partitioning of all dimensions in to


subset.

These subsets are visualized in hierarchical manner.

Some of the visualization techniques are:

i. Dimensional stacking

In dimension stacking, n-dimensional attribute space is partitioned in 2-dimensional


subspaces.

Attribute values are partitioned into various classes.

Each element is two dimensional space in the form of xy plot.

Helps to mark the important attributes and are used on the outer level.

ii. Mosaic plot

Mosaic plot gives the graphical representation of successive decompositions.

Rectangles are used to represent the count of categorical data and at every stage,
rectangles are split parallel.

iii. Worlds within worlds

Worlds within worlds are useful to generate an interactive hierarchy of display.

Innermost word must have a function and two most important parameters.

Remaining parameters are fixed with the constant value.


Through this, N-vision of data are possible like data glove and stereo displays, including
rotation, scaling (inner) and translation (inner/outer).

Using queries, static interaction is possible.

iv. Tree maps

Tree maps visualization techniques are well suited for displaying large amount of
hierarchical structured data.

The visualization space is divided into the multiple rectangles that are ordered, according
to a quantitative variable.

The levels in the hierarchy are seen as rectangles containing the other rectangle.

Each set of rectangles on the same level in the hierarchy represents a category, a column
or an expression in a data set.

v. Visualization complex data and relations

This technique is used to visualize non-numeric data.

For example: text, pictures, blog entries and product reviews.

A tag cloud is a visualization method which helps to understand the information of user
generated tags.

It is also possible to arrange the tags alphabetically or according to the user preferences
with different font sizes and colors.

DATA MINING QUERY LANGUAGE

Data Mining is a process is in which user data are extracted and processed from a heap of
unprocessed raw data. By aggregating these datasets into a summarized format, many
problems arising in finance, marketing, and many other fields can be solved. In the modern
world with enormous data, Data Mining is one of the growing fields of technology that acts
as an application in many industries we depend on in our life. Many developments and
researches have been held in this field and many systems are also been disclosed. Since
there are numerous processes and functions to be done in Data Mining, a very well
developed user interface is needed. Even though there are many well-developed user
interfaces for the relational systems, Han, Fu, Wang, et al. Proposed the Data Mining Query
Language(DMQL) to further build more developmental systems and innovate many kinds of
research in this field. Though we can’t consider DMQL as a standard language. It is a
derived language that stands as a general query language to perform data mining
techniques. DMQL is executed in DB miner systems for collecting data from several layers
of databases.

Ideas in designing DMQL:

DMQL is designed based on Structured Query Language(SQL) which in turn is a relational


query language.

Data Mining request: For the given data mining task, the corresponding datasets must be
defined in the form of a data mining request. Let us see this with an example. As the user
can request for any specific part of a dataset in the database, the data miner can use the
database query to retrieve the suitable datasets before the process of data mining. If the
aggregation of that specific data is not possible for the data miner, he then collects the
supersets from which one can derive the required data. This proves the need for query
language in data mining which acts as its subtask. Since the extraction of relevant data
from huge datasets cannot be performed by manual work, many development methods are
present in the data mining technique. But by doing this way, sometimes the task of
collecting relevant data requested by the user may be failed. By using DMQL, a command
to retrieve specific datasets or data from the database, which gives a desired result to the
user and it gives comprehending experience in fulfilling the expectations of users.

Background Knowledge: Prior knowledge of datasets and their relationships in a database


help in mining the data. By knowing the relationships or any useful information can ease
the process of extraction and aggregation. For an instance, the conceptual hierarchy of the
number of datasets can increase the efficiency of the process and accuracy by collecting
the desired data easily. By knowing the hierarchy, the data can be generalized with ease.

Generalization: When the data in datasets of a data warehouse is not generalized, often the
data would be in form of unprocessed primitive integrity constraints, roughly associated
multi-valued datasets and their dependencies. But by using the generalization concept
using query language can help in processing the raw data into a precise abstraction. It also
works in the multi-level collection of data with a quality aggregation. When the larger
databases come into the scene, the generalization would play a major role in giving
desirable results in a conceptual level of data collection.

Flexibility and Interaction: To avoid the collection of less desirable or unwanted data from
databases, efficient exposure values or thresholds must be specified for the flexible data
mining and to provide compulsive interaction which makes the user experience interesting.
Such threshold values can be provided with queries of data mining.

Basic syntax in DMQL:

DMQL acquires syntax like the relational query language, SQL. It is designed with the help
of Backus Naur Form (BNF) notation/ grammar. In this notation, “[ ]” or “{ }” denotes 0 or
other possibilities.

To retrieve relevant dataset:

Syntax:

Use database (database_name)

{use hierarchy (hierarchy_name) for (attribute)}

(rule_specified)

Related to(attribute_or_aggreagate_list)

From(relation(s)) [where(condition)]

[order by(order_list)]

{with [(type_of)] threshold = (threshold_value) [for(attribute(s))]}

In the above data-mining query, the first line retrieves the required database
(database_name). The second line uses the hierarchy one has chosen(hierarchy_name)
with the given attribute. (rule_specified) denotes the types of rules to be specified. To find
out the various specified rules, one must find the related set based on the attribute or
aggregation which helps in generalization. The from and where clauses make sure of the
given condition being satisfied. Then they are ordered using “order by” for a designated
threshold value with respect to attributes.

For the rules in DMQL:

Syntax:

Generalization:
Generalize data [into (relation_name)]

Association:

Find association rules [as (rule_name)]

Classification:

Find classification rules [as (rule_name) ] according to [(attribute)]

Characterization:

Find characteristic rules [as (rule_name)]

Discrimination:

Find discriminant rules [as (rule_name)]

For (class_1) with (condition_1)

From (relation(s)_1)

In contrast to (class_2) with (condition_2)

From (relation(s)_2)

{ in contrast to (class_i) with (condition_i)

From (relation(s)_i)}

Integrating Data Mining With Database/Data Warehouse Systems

With the exponential growth of data, data mining systems should be efficient and highly
performative to build complex machine learning models, it is expected that a good variety
of data mining systems will be designed and developed.
Comprehensive information processing and data analysis will be continuously and
systematically surrounded by data warehouse and databases.

Data Mining System Architecture

A critical question in design is whether we should integrate data mining systems with
database systems.

Integrating Data Mining systems with Databases and Data Warehouses with these methods

No Coupling

Loose Coupling

Semi-Tight Coupling

Tight Coupling

No Coupling

No coupling means that a DM system will not utilize any function of a DB or DW system.

It may fetch data from a particular source (such as a file system), process data using some
data mining algorithms, and then store the mining results in another file.

Drawbacks:

First, a Database/Data Warehouse system provides a great deal of flexibility and efficiency
at storing, organizing, accessing, and processing data.

Without using a Database/Data Warehouse system, a Data Mining system may spend a
substantial amount of time finding, collecting, cleaning, and transforming data.
Second, there are many tested, scalable algorithms and data structures implemented in
Database and Data Warehouse systems.

Loose Coupling

Loose coupling means that a Data Mining system will use some facilities of a Database or
Data warehouse system, fetching data from a data repository managed by these systems,
performing data mining, and then storing the mining results either in a file or in a
designated place in a Database or Data Warehouse.

Loose coupling is better than no coupling because it can fetch any portion of data stored in
Databases or Data Warehouses by using query processing, indexing, and other system
facilities.

Drawbacks

It’s difficult for loose coupling to achieve high scalability and good performance with large
data sets.

Semi-Tight Coupling – Enhanced Data Mining Performance

The semi-tight coupling means that besides linking a Data Mining system to a
Database/Data Warehouse system, efficient implementations of a few essential data
mining primitives (identified by the analysis of frequently encountered data mining
functions) can be provided in the Database/Data Warehouse system.

These primitives can include sorting, indexing, aggregation, histogram analysis, multi-way
join, and pre-computation of some essential statistical measures, such as sum, count,
max, min, standard deviation.

This design will enhance the performance of Data Mining systems.

Tight Coupling – A Uniform Information Processing Environment


Tight coupling means that a Data Mining system is smoothly integrated into the
Database/Data Warehouse system.

The data mining subsystem is treated as one functional component of the information
system.

Data mining queries and functions are optimized based on mining query analysis, data
structures, indexing schemes, and query processing methods of a Database or Data
Warehouse system.

Data Preprocessing In Data Mining

Data preprocessing is an important process of data mining. In this process, raw data is
converted into an understandable format and made ready for further analysis. The motive
is to improve data quality and make it up to mark for specific tasks.

Tasks in Data Preprocessing

Data cleaning

Data cleaning help us remove inaccurate, incomplete and incorrect data from the dataset.
Some techniques used in data cleaning are –

Handling missing values

This type of scenario occurs when some data is missing.

Standard values can be used to fill up the missing values in a manual way but only for a
small dataset.
Attribute’s mean and median values can be used to replace the missing values in normal
and non-normal distribution of data respectively.

Tuples can be ignored if the dataset is quite large and many values are missing within a
tuple.

Most appropriate value can be used while using regression or decision tree algorithms

Noisy Data

Noisy data are the data that cannot be interpreted by machine and are containing
unnecessary faulty data. Some ways to handle them are –

Binning – This method handle noisy data to make it smooth. Data gets divided equally and
stored in form of bins and then methods are applied to smoothing or completing the tasks.
The methods are Smoothing by a bin mean method(bin values are replaced by mean
values), Smoothing by bin median(bin values are replaced by median values) and
Smoothing by bin boundary(minimum/maximum bin values are taken and replaced by
closest boundary values).

Regression – Regression functions are used to smoothen the data. Regression can be
linear(consists of one independent variable) or multiple(consists of multiple independent
variables).

Clustering – It is used for grouping the similar data in clusters and is used for finding
outliers.

Data integration

The process of combining data from multiple sources (databases, spreadsheets,text files)
into a single dataset. Single and consistent view of data is created in this process. Major
problems during data integration are Schema integration(Integrates set of data collected
from various sources), Entity identification(identifying entities from different databases)
and detecting and resolving data values concept.

Data transformation

In this part, change in format or structure of data in order to transform the data suitable for
mining process. Methods for data transformation are –

Normalization – Method of scaling data to represent it in a specific smaller range( -1.0 to


1.0)

Discretization – It helps reduce the data size and make continuous data divide into
intervals.

Attribute Selection – To help the mining process, new attributes are derived from the given
attributes.

Concept Hierarchy Generation – In this, the attributes are changed from lower level to
higher level in hierarchy.

Aggregation – In this, a summary of data gets stored which depends upon quality and
quantity of data to make the result more optimal.

Data reduction

It helps in increasing storage efficiency and reducing data storage to make the analysis
easier by producing almost the same results. Analysis becomes harder while working with
huge amounts of data, so reduction is used to get rid of that.

Steps of data reduction are –


Data Compression

Data is compressed to make efficient analysis. Lossless compression is when there is no


loss of data while compression. Loss compression is when unnecessary information is
removed during compression.

Numerosity Reduction

There is a reduction in volume of data i.e. only store model of data instead of whole data,
which provides smaller representation of data without any loss of data.

Dimensionality reduction

In this, reduction of attributes or random variables are done so as to make the data set
dimension low. Attributes are combined without losing its original characteristics.

Scaler Topics Logo

What is Data Cleaning in Data Mining?

Data cleaning, also known as data cleansing, is the process of identifying and correcting or
removing inaccurate, incomplete, irrelevant, or inconsistent data in a dataset. Data
cleaning is a critical step in data mining as it ensures that the data is accurate, complete,
and consistent, improving the quality of analysis and insights obtained from the data. Data
cleaning may involve tasks such as removing duplicates, filling in missing values, handling
outliers, correcting spelling errors, resolving inconsistencies in the data, etc. Data cleaning
helps to minimize the impact of data errors on the results of data mining analysis.

Data Cleaning Characteristics

Some key characteristics of data cleaning are –

Iterative process – Data cleaning in data mining is an iterative process that involves
multiple iterations of identifying, assessing, and addressing data quality issues. It is often
an ongoing activity throughout the data mining process, as new insights and patterns may
prompt the need for further data cleaning.

Time-consuming – Data cleaning in data mining can be a time-consuming task, especially


when dealing with large and complex datasets. It requires careful examination of the data,
identifying errors or inconsistencies, and implementing appropriate corrections or
treatments. The time required for data cleaning can vary based on the complexity of the
dataset and the extent of the data quality issues.

Domain expertise – Data cleaning in data mining often requires domain expertise, as
understanding the context and characteristics of the data is crucial for effective cleaning.
Domain experts possess the necessary knowledge about the data and can make informed
decisions about handling missing values, outliers, or inconsistencies based on their
understanding of the subject matter.

Impact on analysis – Data cleaning in data mining directly impacts the quality and reliability
of the analysis and results obtained from data mining. Neglecting data cleaning can lead to
biased or inaccurate outcomes, misleading patterns, and unreliable insights. By
performing thorough data cleaning, analysts can ensure that the data used for analysis is
accurate, consistent, and representative of the real-world scenario.

Steps of Data Cleaning

The steps involved in the process of data cleaning in data mining can vary depending on the
specific dataset and the requirements of the analysis, but some common steps are –

Data profiling – Data profiling involves examining the dataset to gain an understanding of its
structure, contents, and quality. It helps identify data types, distributions, missing values,
outliers, and potential issues that need to be addressed during the cleaning process.

Handling missing data – Missing data refers to instances where values are not recorded or
are incomplete. Data cleaning involves deciding how to handle missing data, including
imputing missing values using statistical methods, removing instances with missing
values, or using specialized techniques based on domain knowledge.

Handling duplicates – Duplicate records occur when the dataset has identical or very
similar instances. Data cleaning involves identifying and removing duplicate records to
ensure data integrity and prevent bias in the analysis.

Handling outliers – Outliers are data points that are significantly different from other data
points in the dataset. Therefore, identifying and handling outliers can be important for
maintaining the integrity of the analysis. Outliers can be detected using statistical methods
such as Z-score or box plot analysis and removed or adjusted as necessary.

Standardization – Standardization aims to ensure consistent and uniform representation of


data attributes. It involves addressing inconsistencies in data formatting, units of
measurement, or categorical values. Data cleaning includes standardizing attributes to a
standard format to facilitate accurate analysis and interpretation.

Resolving inconsistencies – Inconsistencies can arise from data entry errors, variations in
naming conventions, or conflicting information. Data cleaning involves identifying and
resolving such inconsistencies by cross-validating data from different sources, performing
data validation checks, and leveraging domain expertise to determine the most accurate
values or resolve conflicts.

Quality assurance – Quality assurance is the final step in data cleaning. It involves
performing checks to ensure the accuracy, completeness, and reliability of the cleaned
dataset. This includes validating the cleaned data against predefined criteria, verifying the
effectiveness of data cleaning techniques, and conducting quality control measures to
ensure the data is suitable for analysis.

Data Cleaning Tools in Data Mining

Many data cleaning tools are available for data mining, and the choice of tool depends on
the type of data being cleaned and the user’s specific requirements. Some popular data
cleaning tools used in data mining include –

OpenRefine – OpenRefine is a free and open-source data cleaning tool that can be used for
data exploration, cleaning, and transformation. It supports various data formats, including
CSV, Excel, and JSON.

Trifacta Wrangler – Trifacta is a data cleaning tool that uses machine learning algorithms to
identify and clean data errors, inconsistencies, and missing values. It is designed for large-
scale data cleaning and can handle various data formats.

Talend – Talend is an open-source data integration and cleaning tool that can be used for
data profiling, cleaning, and transformation. It supports various data formats and can be
integrated with other tools and platforms.

TIBCO Clarity – TIBCO Clarity is a data quality management tool that provides a unified
view of an organization’s data assets. It includes features such as data profiling, data
cleaning, and data matching to ensure data quality across the organization.
Cloudingo – Cloudingo is a data cleansing tool specifically designed for Salesforce data. It
includes features like duplicate detection and merging, data standardization, and data
enrichment to ensure high-quality data within Salesforce.

IBM Infosphere Quality Stage – IBM Infosphere Quality Stage is a data quality management
tool that includes features such as data profiling, data cleansing, and data matching. It
also includes advanced features such as survivorship and data lineage to ensure data
quality and governance across the organization.

DATA TRANSFORMATION

Data transformation in data mining refers to the process of converting raw data into a
format that is suitable for analysis and modeling. The goal of data transformation is to
prepare the data for data mining so that it can be used to extract useful insights and
knowledge. Data transformation typically involves several steps, including:

Data cleaning: Removing or correcting errors, inconsistencies, and missing values in the
data.

Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.

Data normalization: Scaling the data to a common range of values, such as between 0 and
1, to facilitate comparison and analysis.

Data reduction: Reducing the dimensionality of the data by selecting a subset of relevant
features or attributes.

Data discretization: Converting continuous data into discrete categories or bins.

Data aggregation: Combining data at different levels of granularity, such as by summing or


averaging, to create new features or attributes.

Data transformation is an important step in the data mining process as it helps to ensure
that the data is in a format that is suitable for analysis and modeling, and that it is free of
errors and inconsistencies. Data transformation can also help to improve the performance
of data mining algorithms, by reducing the dimensionality of the data, and by scaling the
data to a common range of values.
FEATURE SELECTION

Feature selection is critical to building a good model for several reasons. One is that
feature selection implies some degree of cardinality reduction, to impose a cutoff on the
number of attributes that can be considered when building a model. Data almost always
contains more information than is needed to build the model, or the wrong kind of
information. For example, you might have a dataset with 500 columns that describe the
characteristics of customers; however, if the data in some of the columns is very sparse
you would gain very little benefit from adding them to the model, and if some of the
columns duplicate each other, using both columns could affect the model.

Not only does feature selection improve the quality of the model, it also makes the process
of modeling more efficient. If you use unneeded columns while building a model, more
CPU and memory are required during the training process, and more storage space is
required for the completed model. Even if resources were not an issue, you would still want
to perform feature selection and identify the best columns, because unneeded columns
can degrade the quality of the model in several ways:

Noisy or redundant data makes it more difficult to discover meaningful patterns.

If the data set is high-dimensional, most data mining algorithms require a much larger
training data set.

During the process of feature selection, either the analyst or the modeling tool or algorithm
actively selects or discards attributes based on their usefulness for analysis. The analyst
might perform feature engineering to add features, and remove or modify existing data,
while the machine learning algorithm typically scores columns and validates their
usefulness in the model.

In short, feature selection helps solve two problems: having too much data that is of little
value, or having too little data that is of high value. Your goal in feature selection should be
to identify the minimum number of columns from the data source that are significant in
building a model.
Dimensionality reduction

Dimensionality reduction is the process of reducing the number of random variables or


attributes under consideration. High-dimensionality data reduction, as part of a data pre-
processing-step, is extremely important in many real-world applications. High-
dimensionality reduction has emerged as one of the significant tasks in data mining
applications. For an example you may have a dataset with hundreds of features (columns
in your database). Then dimensionality reduction is that you reduce those features of
attributes of data by combining or merging them in such a way that it will not loose much of
the significant characteristics of the original dataset. One of the major problem that occurs
with high dimensional data is widely known as the “Curse of Dimensionality”. This pushes
us to reduce the dimensions of our data if we want to use them for analysis.

Discretization in data mining

Data discretization refers to a method of converting a huge number of data values into
smaller ones so that the evaluation and management of data become easy. In other words,
data discretization is a method of converting attributes values of continuous data into a
finite set of intervals with minimum data loss. There are two forms of data discretization
first is supervised discretization, and the second is unsupervised discretization.
Supervised discretization refers to a method in which the class data is used. Unsupervised
discretization refers to a method depending upon the way which operation proceeds. It
means it works on the top-down splitting strategy and bottom-up merging strategy.

Concept Hierarchy in Data Mining

In data mining, the concept of a concept hierarchy refers to the organization of data into a
tree-like structure, where each level of the hierarchy represents a concept that is more
general than the level below it. This hierarchical organization of data allows for more
efficient and effective data analysis, as well as the ability to drill down to more specific
levels of detail when needed. The concept of hierarchy is used to organize and classify data
in a way that makes it more understandable and easier to analyze. The main idea behind
the concept of hierarchy is that the same data can have different levels of granularity or
levels of detail and that by organizing the data in a hierarchical fashion, it is easier to
understand and perform analysis.

Mining frequent patterns

Finding recurrent patterns or item sets in huge datasets is the goal of frequent pattern
mining, a crucial data mining approach. It looks for groups of objects that regularly appear
together in order to expose underlying relationships and interdependence. Market basket
analysis, web usage mining, and bioinformatics are a few areas where this method is
important.

It helps organizations comprehend client preferences, optimize cross−selling tactics, and


improve recommendation systems by revealing patterns of consumer behavior. By
examining user navigational habits and customizing the browsing experience, online use
mining aids in enhancing website performance. We’ll examine frequent pattern mining in
data mining in this piece. Let’s begin.

Basic Concepts in Frequent Pattern Mining

The technique of frequent pattern mining is built upon a number of fundamental ideas. The
analysis is based on transaction databases, which include records or transactions that
represent collections of objects. Items inside these transactions are grouped together as
itemsets.

The importance of patterns is greatly influenced by support and confidence


measurements. Support quantifies how frequently an itemset appears in the database,
whereas confidence quantifies how likely it is that a rule generated from the itemset is
accurate.

The Apriori algorithm, a popular method for finding recurrent patterns, takes a methodical
approach. In order to find no more frequent itemsets, it generates candidate itemsets,
prunes the infrequent ones, and then progressively grows the size of the itemsets. The
patterns that fulfill the required support criteria are successfully identified through this
iterative approach.

Techniques for Frequent Pattern Mining

Apriori Algorithm

One of the most popular methods, the Apriori algorithm, uses a step−by−step procedure to
find frequent item sets. It starts by creating candidate itemsets of length 1, determining
their support, and eliminating any that fall below the predetermined cutoff. The method
then joins the frequent itemsets from the previous phase to produce bigger itemsets
repeatedly.

Once no more common item sets can be located, the procedure is repeated. The Apriori
approach is commonly used because of its efficiency and simplicity, but because it
requires numerous database scans for big datasets, it can be computationally inefficient.

FP−growth Algorithm

A different strategy for frequent pattern mining is provided by the FP−growth algorithm. It
creates a small data structure known as the FP−tree that effectively describes the dataset
without creating candidate itemsets. The FP−growth algorithm constructs the FP−tree
recursively and then directly mines frequent item sets from it.

FP−growth can be much quicker than Apriori by skipping the construction of candidate
itemsets, which lowers the number of runs over the dataset. It is very helpful for sparse and
huge datasets.

Eclat Algorithm

Equivalence Class Clustering and bottom−up Lattice Traversal are the acronyms for the
Eclat algorithm, a well−liked frequent pattern mining method. It explores the itemset lattice
using a depth−first search approach, concentrating on the representation of vertical data
formats.

Transaction identifiers (TIDs) are effectively used by Eclat to locate intersections between
item sets. This technique is renowned for its ease of use and little memory requirements,
making it appropriate for mining frequent itemsets in vertical databases.

Applications of Frequent Pattern Mining

Market Basket Analysis


Market basket analysis frequently mines patterns to comprehend consumer buying
patterns. Businesses get knowledge about product associations by recognizing itemsets
that commonly appear together in transactions. This knowledge enables companies to
improve recommendation systems and cross−sell efforts. Retailers can use this program to
assist them in making data−driven decisions that will enhance customer happiness and
boost sales.

Web usage mining

Web usage mining is examining user navigation patterns to learn more about how people
use websites. In order to personalize websites and enhance their performance, frequent
pattern mining makes it possible to identify recurrent navigation patterns and session
patterns. Businesses can change content, layout, and navigation to improve user
experience and boost engagement by studying how consumers interact with a website.

Bioinformatics

The identification of relevant DNA patterns in the field of bioinformatics is made possible
by often occurring pattern mining. Researchers can get insights into genetic variants,
illness connections, and drug development by examining big genomic databases for
recurrent patterns. In order to diagnose diseases, practice personalized medicine, and
create innovative therapeutic strategies, frequent pattern mining algorithms help uncover
important DNA sequences and patterns.

What is Association?

Association is a technique used in data mining to identify the relationships or co-


occurrences between items in a dataset. It involves analyzing large datasets to discover
patterns or associations between items, such as products purchased together in a
supermarket or web pages frequently visited together on a website. Association analysis is
based on the idea of finding the most frequent patterns or itemsets in a dataset, where an
itemset is a collection of one or more items.

Association analysis can provide valuable insights into consumer behaviour and
preferences. It can help retailers identify the items that are frequently purchased together,
which can be used to optimize product placement and promotions. Similarly, it can help e-
commerce websites recommend related products to customers based on their purchase
history.

Types of Associations

Here are the most common types of associations used in data mining:

Itemset Associations: Itemset association is the most common type of association


analysis, which is used to discover relationships between items in a dataset. In this type of
association, a collection of one or more items that frequently co-occur together is called
an itemset. For example, in a supermarket dataset, itemset association can be used to
identify items that are frequently purchased together, such as bread and butter.

Sequential Associations: Sequential association is used to identify patterns that occur in a


specific sequence or order. This type of association analysis is commonly used in
applications such as analyzing customer behaviour on e-commerce websites or studying
weblogs. For example, in the weblogs dataset, a sequential association can be used to
identify the sequence of pages that users visit before making a purchase.

Graph-based Associations Graph-based association is a type of association analysis that


involves representing the relationships between items in a dataset as a graph. In this type
of association, each item is represented as a node in the graph, and the edges between
nodes represent the co-occurrence or relationship between items. The graph-based
association is used in various applications, such as social network analysis,
recommendation systems, and fraud detection. For example, in a social network dataset,
identifying groups of users with similar interests or behaviours.

Association Rule Mining

Here are the most commonly used algorithms to implement association rule mining in data
mining:

Apriori Algorithm – Apriori is one of the most widely used algorithms for association rule
mining. It generates frequent item sets from a given dataset by pruning infrequent item sets
iteratively. The Apriori algorithm is based on the concept that if an item set is frequent, then
all of its subsets must also be frequent. The algorithm first identifies the frequent items in
the dataset, then generates candidate itemsets of length two from the frequent items, and
so on until no more frequent itemsets can be generated. The Apriori algorithm is
computationally expensive, especially for large datasets with many items.
FP-Growth Algorithm – FP-Growth is another popular algorithm for association rule mining
that is based on the concept of frequent pattern growth. It is faster than the Apriori
algorithm, especially for large datasets. The FP-Growth algorithm builds a compact
representation of the dataset called a frequent pattern tree (FP-tree), which is used to mine
frequent item sets. The algorithm scans the dataset only twice, first to build the FP-tree and
then to mine the frequent itemsets. The FP-Growth algorithm can handle datasets with
both discrete and continuous attributes.

Eclat Algorithm – Eclat (Equivalence Class Clustering and Bottom-up Lattice Traversal) is a
frequent itemset mining algorithm based on the vertical data format. The algorithm first
converts the dataset into a vertical data format, where each item and the transaction ID in
which it appears are stored. Eclat then performs a depth-first search on a tree-like
structure, representing the dataset’s frequent itemsets. The algorithm is efficient regarding
both memory usage and runtime, especially for sparse datasets.

Correlation Analysis in Data Mining

Correlation Analysis is a data mining technique used to identify the degree to which two or
more variables are related or associated with each other. Correlation refers to the
statistical relationship between two or more variables, where the variation in one variable
is associated with the variation in another variable. In other words, it measures how
changes in one variable are related to changes in another variable. Correlation can be
positive, negative, or zero, depending on the direction and strength of the relationship
between the variables.

, For example,, we are studying the relationship between the hours of study and the grades
obtained by students. If we find that as the number of hours of study increases, the grades
obtained also increase, then there is a positive correlation between the two variables. On
the other hand, if we find that as the number of hours of study increases, the grades
obtained decrease, then there is a negative correlation between the two variables. If there
is no relationship between the two variables, we would say that there is zero correlation.

Why is Correlation Analysis Important?

Correlation analysis is important because it allows us to measure the strength and


direction of the relationship between two or more variables. This information can help
identify patterns and trends in the data, make predictions, and select relevant variables for
analysis. By understanding the relationships between different variables, we can gain
valuable insights into complex systems and make informed decisions based on data-
driven analysis.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy