UNIT-III
UNIT-III
Data mining:
Data mining is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis. Data mining
techniques and tools enable enterprises to predict future trends and make more-informed
business decisions.
Data mining is a key part of data analytics overall and one of the core disciplines in data
science, which uses advanced analytics techniques to find useful information in data sets.
At a more granular level, data mining is a step in the knowledge discovery in databases
(KDD) process, a data science methodology for gathering, processing and analyzing data.
Data mining and KDD are sometimes referred to interchangeably, but they’re more
commonly seen as distinct things.
Effective data mining aids in various aspects of planning business strategies and managing
operations. That includes customer-facing functions such as marketing, advertising, sales
and customer support, plus manufacturing, supply chain management, finance and HR.
Data mining supports fraud detection, risk management, cybersecurity planning and many
other critical business use cases. It also plays an important role in healthcare, government,
scientific research, mathematics, sports and more.
Although the two terms KDD and Data Mining are heavily used interchangeably, they refer
to two related yet slightly different concepts.
KDD is the overall process of extracting knowledge from data, while Data Mining is a step
inside the KDD process, which deals with identifying patterns in data.
And Data Mining is only the application of a specific algorithm based on the overall goal of
the KDD process.
KDD is an iterative process where evaluation measures can be enhanced, mining can be
refined, and new data can be integrated and transformed to get different and more
appropriate results.
Basic Definition:
DM-Data mining is the process of identifying patterns and extracting details about big data
sets using intelligent methods.
KDD-The KDD method is a complex and iterative approach to knowledge extraction from
big data.
Goal:
Scope:
DM-In the KDD method, the fourth phase is called “data mining.”
KDD-KDD is a broad method that includes data mining as one of its steps.
Used Techniques:
DM-Classification
Clustering
Decision Trees
Dimensionality Reduction
Neural Networks
Regression
KDD-Data cleaning
Data Integration
Data selection
Data transformation
Data mining
Pattern evaluation
Knowledge Presentation
Example:
Data cleaning and preprocessing is an essential step of the data mining process as it
makes the data ready for analysis. Data cleaning includes deleting any unnecessary
features or attributes, identifying and correcting outliers, filling in missing values, and
converting categorical variables to numerical ones. This involves removing or correcting
erroneous, incomplete, or inconsistent data, as well as formatting the data into a usable
format for analysis. Preprocessing also includes normalizing the data, reducing its
dimensionality, and performing feature selection to identify important features.
Many companies include these steps as part of their broader data governance initiatives.
After cleaning and preprocessing is complete, the data is ready for exploration and
visualization.
2. Data modeling and evaluation
Data modeling and evaluation is the process of training machine learning models with the
data and then evaluating their performance. This involves selecting an appropriate
algorithm for the task, tuning its hyperparameters to optimize its performance, and using
measures such as accuracy or precision to evaluate its results. After a model is trained and
evaluated, it can be deployed for real-world applications. In addition, data mining can also
be used to detect anomalies or outliers in the data. This is especially useful for fraud
detection and cybersecurity applications. After identifying any anomalies or outliers,
analysts can then investigate further to gain more insight into the problem.
Data exploration and visualization is the process of exploring, analyzing, and visualizing
data to gain insights and identify patterns. This involves summarizing the data using
descriptive statistics, such as measuring its central tendency, dispersion, and correlation
between features; plotting distributions of data points; and performing clustering or
classification algorithms to group similar data points together. Through these methods,
data professionals, including data analysts, data scientists, and analytics and data
engineers, can gain insight into the underlying structure of the data and identify
relationships between features.
Data visualization tools, such as heatmaps, histograms, bar charts, and scatter plots, can
also be used to easily communicate and see how different datasets relate, correlate, and
diverge. Additionally, dimensionality reduction techniques such as principal component
analysis (PCA) can help reduce the complexity of datasets by representing them in fewer
dimensions. After exploring and visualizing the data, analysts can decide which machine
learning algorithms would be most suitable for their project.
In the final stage of data mining, the trained models are deployed in a production
environment. This requires configuring the model for real-time execution and setting up any
necessary monitoring mechanisms to ensure its performance. Additionally, any changes
made to the model or dataset may require re-training the model and redeploying it to
production. Finally, maintenance is also necessary to ensure the performance of the model
and keep it up-to-date with any changes to the data or environment. By keeping track of
these factors, businesses can ensure that their data mining models remain accurate and
can give reliable results in production.
A data mining task can be specified in the form of a data mining query, which is input to the
data mining system. A data mining query is defined in terms of data mining task primitives.
These primitives allow the user to interactively communicate with the data mining system
during discovery to direct the mining process or examine the findings from different angles
or depths. The data mining primitives specify the following,
A data mining query language can be designed to incorporate these primitives, allowing
users to interact with data mining systems flexibly. Having a data mining query language
provides a foundation on which user-friendly graphical interfaces can be built.
ADVERTISEMENT
A data mining query is defined in terms of the following primitives, such as:
This specifies the portions of the database or the set of data in which the user is interested.
This includes the database attributes or data warehouse dimensions of interest (the
relevant attributes or dimensions).
In a relational database, the set of task-relevant data can be collected via a relational query
involving operations like selection, projection, join, and aggregation.
The data collection process results in a new data relational called the initial data relation.
The initial data relation can be ordered or grouped according to the conditions specified in
the query. This data retrieval can be thought of as a subtask of the data mining task.
This initial relation may or may not correspond to physical relation in the database. Since
virtual relations are called Views in the field of databases, the set of task-relevant data for
data mining is called a minable view.
This knowledge about the domain to be mined is useful for guiding the knowledge
discovery process and evaluating the patterns found. Concept hierarchies are a popular
form of background knowledge, which allows data to be mined at multiple levels of
abstraction.
Rolling Up – Generalization of data: Allow to view data at more meaningful and explicit
abstractions and makes it easier to understand. It compresses the data, and it would
require fewer input/output operations.
An example of a concept hierarchy for the attribute (or dimension) age is shown below.
User beliefs regarding relationships in the data are another form of background knowledge.
Different kinds of knowledge may have different interesting measures. They may be used to
guide the mining process or, after discovery, to evaluate the discovered patterns. For
example, interesting measures for association rules include support and confidence. Rules
whose support and confidence values are below user-specified thresholds are considered
uninteresting.
Novelty: Novel patterns are those that contribute new information or increased
performance to the given pattern set. For example -> A data exception. Another strategy for
detecting novelty is to remove redundant patterns.
This refers to the form in which discovered patterns are to be displayed, which may include
rules, tables, cross tabs, charts, graphs, decision trees, cubes, or other visual
representations.
Users must be able to specify the forms of presentation to be used for displaying the
discovered patterns. Some representation forms may be better suited than others for
particular kinds of knowledge.
For example, generalized relations and their corresponding cross tabs or pie/bar charts are
good for presenting characteristic descriptions, whereas decision trees are common for
classification.
Let’s look at some of the fundamental data mining techniques commonly used across
industry verticals.
1. Association rule
The association rule refers to the if-then statements that establish correlations and
relationships between two or more data items. The correlations are evaluated using
support and confidence metrics, wherein support determines the frequency of occurrence
of data items within the dataset. In contrast, confidence relates to the accuracy of if-then
statements.
For example, while tracking a customer’s behavior when purchasing online items, an
observation is made that the customer generally buys cookies when purchasing a coffee
pack. In such a case, the association rule establishes the relation between two items of
cookies and coffee packs, thereby forecasting future buys whenever the customer adds the
coffee pack to the shopping cart.
2. Classification
The classification data mining technique classifies data items within a dataset into
different categories. For example, we can classify vehicles into different categories, such
as sedan, hatchback, petrol, diesel, electric vehicle, etc., based on attributes such as the
vehicle’s shape, wheel type, or even number of seats. When a new vehicle arrives, we can
categorize it into various classes depending on the identified vehicle attributes. One can
apply the same classification strategy to classify customers based on their age, address,
purchase history, and social group.
Decision trees
, Naive Bayes classifiers, logistic regression, and so on.
3. Clustering
Clustering data mining techniques group data elements into clusters that share common
characteristics. We can cluster data pieces into categories by simply identifying one or
more attributes. Some of the well-known clustering techniques are k-means clustering,
hierarchical clustering, and Gaussian mixture models.
4. Regression
Linear regression
, multivariate regression, and decision trees are key examples of this type.
One can also mine sequential data to determine patterns, wherein specific events or data
values lead to other events in the future. This technique is applied for long-term data as
sequential analysis is key to identifying trends or regular occurrences of certain events. For
example, when a customer buys a grocery item, you can use a sequential pattern to
suggest or add another item to the basket based on the customer’s purchase pattern.
6. Neural networks
Neural networks
Technically refer to algorithms that mimic the human brain and try to replicate its activity
to accomplish a desired goal or task. These are used for several pattern recognition
applications that typically involve deep learning techniques. Neural networks are a
consequence of advanced machine learning research.
7. Prediction
The prediction data mining technique is typically used for predicting the occurrence of an
event, such as the failure of machinery or a fault in an industrial component, a fraudulent
event, or company profits crossing a certain threshold. Prediction techniques can help
analyze trends, establish correlations, and do pattern matching when combined with other
mining methods. Using such a mining technique, data miners can analyze past instances
to forecast future events.
KNOWLEDGE REPRESENTATION
Histograms
It consists of a set of rectangles, that reflects the counts or frequencies of the classes
present in the given data.
Example: Histogram of an electricity bill generated for 4 months, as shown in diagram given
below.
Data Visualization
Patterns in the data are marked easily by using the data visualization technique.
In pixel based visualization techniques, there are separate sub-windows for the value of
each attribute and it is represented by one colored pixel.
It maximizes the amount of information represented at one time without any overlap.
Tuple with ‘m’ variable has different ‘m’ colored pixel to represent each variable and each
variable has a sub window.
The color mapping of the pixel is decided on the basis of data characteristics and
visualization tasks.
Pixel visualization
i. Scatter-plot matrices
The parallel vertical lines which are separated defines the axes.
Chernoff faces
It includes the mapping of different data dimensions with different facial features.
For example: The face width, the length of the mouth and the length of nose, etc. As shown
in the following diagram.
Chernoff faces
i. Dimensional stacking
Helps to mark the important attributes and are used on the outer level.
Rectangles are used to represent the count of categorical data and at every stage,
rectangles are split parallel.
Innermost word must have a function and two most important parameters.
Tree maps visualization techniques are well suited for displaying large amount of
hierarchical structured data.
The visualization space is divided into the multiple rectangles that are ordered, according
to a quantitative variable.
The levels in the hierarchy are seen as rectangles containing the other rectangle.
Each set of rectangles on the same level in the hierarchy represents a category, a column
or an expression in a data set.
A tag cloud is a visualization method which helps to understand the information of user
generated tags.
It is also possible to arrange the tags alphabetically or according to the user preferences
with different font sizes and colors.
Data Mining is a process is in which user data are extracted and processed from a heap of
unprocessed raw data. By aggregating these datasets into a summarized format, many
problems arising in finance, marketing, and many other fields can be solved. In the modern
world with enormous data, Data Mining is one of the growing fields of technology that acts
as an application in many industries we depend on in our life. Many developments and
researches have been held in this field and many systems are also been disclosed. Since
there are numerous processes and functions to be done in Data Mining, a very well
developed user interface is needed. Even though there are many well-developed user
interfaces for the relational systems, Han, Fu, Wang, et al. Proposed the Data Mining Query
Language(DMQL) to further build more developmental systems and innovate many kinds of
research in this field. Though we can’t consider DMQL as a standard language. It is a
derived language that stands as a general query language to perform data mining
techniques. DMQL is executed in DB miner systems for collecting data from several layers
of databases.
Data Mining request: For the given data mining task, the corresponding datasets must be
defined in the form of a data mining request. Let us see this with an example. As the user
can request for any specific part of a dataset in the database, the data miner can use the
database query to retrieve the suitable datasets before the process of data mining. If the
aggregation of that specific data is not possible for the data miner, he then collects the
supersets from which one can derive the required data. This proves the need for query
language in data mining which acts as its subtask. Since the extraction of relevant data
from huge datasets cannot be performed by manual work, many development methods are
present in the data mining technique. But by doing this way, sometimes the task of
collecting relevant data requested by the user may be failed. By using DMQL, a command
to retrieve specific datasets or data from the database, which gives a desired result to the
user and it gives comprehending experience in fulfilling the expectations of users.
Generalization: When the data in datasets of a data warehouse is not generalized, often the
data would be in form of unprocessed primitive integrity constraints, roughly associated
multi-valued datasets and their dependencies. But by using the generalization concept
using query language can help in processing the raw data into a precise abstraction. It also
works in the multi-level collection of data with a quality aggregation. When the larger
databases come into the scene, the generalization would play a major role in giving
desirable results in a conceptual level of data collection.
Flexibility and Interaction: To avoid the collection of less desirable or unwanted data from
databases, efficient exposure values or thresholds must be specified for the flexible data
mining and to provide compulsive interaction which makes the user experience interesting.
Such threshold values can be provided with queries of data mining.
DMQL acquires syntax like the relational query language, SQL. It is designed with the help
of Backus Naur Form (BNF) notation/ grammar. In this notation, “[ ]” or “{ }” denotes 0 or
other possibilities.
Syntax:
(rule_specified)
Related to(attribute_or_aggreagate_list)
From(relation(s)) [where(condition)]
[order by(order_list)]
In the above data-mining query, the first line retrieves the required database
(database_name). The second line uses the hierarchy one has chosen(hierarchy_name)
with the given attribute. (rule_specified) denotes the types of rules to be specified. To find
out the various specified rules, one must find the related set based on the attribute or
aggregation which helps in generalization. The from and where clauses make sure of the
given condition being satisfied. Then they are ordered using “order by” for a designated
threshold value with respect to attributes.
Syntax:
Generalization:
Generalize data [into (relation_name)]
Association:
Classification:
Characterization:
Discrimination:
From (relation(s)_1)
From (relation(s)_2)
From (relation(s)_i)}
With the exponential growth of data, data mining systems should be efficient and highly
performative to build complex machine learning models, it is expected that a good variety
of data mining systems will be designed and developed.
Comprehensive information processing and data analysis will be continuously and
systematically surrounded by data warehouse and databases.
A critical question in design is whether we should integrate data mining systems with
database systems.
Integrating Data Mining systems with Databases and Data Warehouses with these methods
No Coupling
Loose Coupling
Semi-Tight Coupling
Tight Coupling
No Coupling
No coupling means that a DM system will not utilize any function of a DB or DW system.
It may fetch data from a particular source (such as a file system), process data using some
data mining algorithms, and then store the mining results in another file.
Drawbacks:
First, a Database/Data Warehouse system provides a great deal of flexibility and efficiency
at storing, organizing, accessing, and processing data.
Without using a Database/Data Warehouse system, a Data Mining system may spend a
substantial amount of time finding, collecting, cleaning, and transforming data.
Second, there are many tested, scalable algorithms and data structures implemented in
Database and Data Warehouse systems.
Loose Coupling
Loose coupling means that a Data Mining system will use some facilities of a Database or
Data warehouse system, fetching data from a data repository managed by these systems,
performing data mining, and then storing the mining results either in a file or in a
designated place in a Database or Data Warehouse.
Loose coupling is better than no coupling because it can fetch any portion of data stored in
Databases or Data Warehouses by using query processing, indexing, and other system
facilities.
Drawbacks
It’s difficult for loose coupling to achieve high scalability and good performance with large
data sets.
The semi-tight coupling means that besides linking a Data Mining system to a
Database/Data Warehouse system, efficient implementations of a few essential data
mining primitives (identified by the analysis of frequently encountered data mining
functions) can be provided in the Database/Data Warehouse system.
These primitives can include sorting, indexing, aggregation, histogram analysis, multi-way
join, and pre-computation of some essential statistical measures, such as sum, count,
max, min, standard deviation.
The data mining subsystem is treated as one functional component of the information
system.
Data mining queries and functions are optimized based on mining query analysis, data
structures, indexing schemes, and query processing methods of a Database or Data
Warehouse system.
Data preprocessing is an important process of data mining. In this process, raw data is
converted into an understandable format and made ready for further analysis. The motive
is to improve data quality and make it up to mark for specific tasks.
Data cleaning
Data cleaning help us remove inaccurate, incomplete and incorrect data from the dataset.
Some techniques used in data cleaning are –
Standard values can be used to fill up the missing values in a manual way but only for a
small dataset.
Attribute’s mean and median values can be used to replace the missing values in normal
and non-normal distribution of data respectively.
Tuples can be ignored if the dataset is quite large and many values are missing within a
tuple.
Most appropriate value can be used while using regression or decision tree algorithms
Noisy Data
Noisy data are the data that cannot be interpreted by machine and are containing
unnecessary faulty data. Some ways to handle them are –
Binning – This method handle noisy data to make it smooth. Data gets divided equally and
stored in form of bins and then methods are applied to smoothing or completing the tasks.
The methods are Smoothing by a bin mean method(bin values are replaced by mean
values), Smoothing by bin median(bin values are replaced by median values) and
Smoothing by bin boundary(minimum/maximum bin values are taken and replaced by
closest boundary values).
Regression – Regression functions are used to smoothen the data. Regression can be
linear(consists of one independent variable) or multiple(consists of multiple independent
variables).
Clustering – It is used for grouping the similar data in clusters and is used for finding
outliers.
Data integration
The process of combining data from multiple sources (databases, spreadsheets,text files)
into a single dataset. Single and consistent view of data is created in this process. Major
problems during data integration are Schema integration(Integrates set of data collected
from various sources), Entity identification(identifying entities from different databases)
and detecting and resolving data values concept.
Data transformation
In this part, change in format or structure of data in order to transform the data suitable for
mining process. Methods for data transformation are –
Discretization – It helps reduce the data size and make continuous data divide into
intervals.
Attribute Selection – To help the mining process, new attributes are derived from the given
attributes.
Concept Hierarchy Generation – In this, the attributes are changed from lower level to
higher level in hierarchy.
Aggregation – In this, a summary of data gets stored which depends upon quality and
quantity of data to make the result more optimal.
Data reduction
It helps in increasing storage efficiency and reducing data storage to make the analysis
easier by producing almost the same results. Analysis becomes harder while working with
huge amounts of data, so reduction is used to get rid of that.
Numerosity Reduction
There is a reduction in volume of data i.e. only store model of data instead of whole data,
which provides smaller representation of data without any loss of data.
Dimensionality reduction
In this, reduction of attributes or random variables are done so as to make the data set
dimension low. Attributes are combined without losing its original characteristics.
Data cleaning, also known as data cleansing, is the process of identifying and correcting or
removing inaccurate, incomplete, irrelevant, or inconsistent data in a dataset. Data
cleaning is a critical step in data mining as it ensures that the data is accurate, complete,
and consistent, improving the quality of analysis and insights obtained from the data. Data
cleaning may involve tasks such as removing duplicates, filling in missing values, handling
outliers, correcting spelling errors, resolving inconsistencies in the data, etc. Data cleaning
helps to minimize the impact of data errors on the results of data mining analysis.
Iterative process – Data cleaning in data mining is an iterative process that involves
multiple iterations of identifying, assessing, and addressing data quality issues. It is often
an ongoing activity throughout the data mining process, as new insights and patterns may
prompt the need for further data cleaning.
Domain expertise – Data cleaning in data mining often requires domain expertise, as
understanding the context and characteristics of the data is crucial for effective cleaning.
Domain experts possess the necessary knowledge about the data and can make informed
decisions about handling missing values, outliers, or inconsistencies based on their
understanding of the subject matter.
Impact on analysis – Data cleaning in data mining directly impacts the quality and reliability
of the analysis and results obtained from data mining. Neglecting data cleaning can lead to
biased or inaccurate outcomes, misleading patterns, and unreliable insights. By
performing thorough data cleaning, analysts can ensure that the data used for analysis is
accurate, consistent, and representative of the real-world scenario.
The steps involved in the process of data cleaning in data mining can vary depending on the
specific dataset and the requirements of the analysis, but some common steps are –
Data profiling – Data profiling involves examining the dataset to gain an understanding of its
structure, contents, and quality. It helps identify data types, distributions, missing values,
outliers, and potential issues that need to be addressed during the cleaning process.
Handling missing data – Missing data refers to instances where values are not recorded or
are incomplete. Data cleaning involves deciding how to handle missing data, including
imputing missing values using statistical methods, removing instances with missing
values, or using specialized techniques based on domain knowledge.
Handling duplicates – Duplicate records occur when the dataset has identical or very
similar instances. Data cleaning involves identifying and removing duplicate records to
ensure data integrity and prevent bias in the analysis.
Handling outliers – Outliers are data points that are significantly different from other data
points in the dataset. Therefore, identifying and handling outliers can be important for
maintaining the integrity of the analysis. Outliers can be detected using statistical methods
such as Z-score or box plot analysis and removed or adjusted as necessary.
Resolving inconsistencies – Inconsistencies can arise from data entry errors, variations in
naming conventions, or conflicting information. Data cleaning involves identifying and
resolving such inconsistencies by cross-validating data from different sources, performing
data validation checks, and leveraging domain expertise to determine the most accurate
values or resolve conflicts.
Quality assurance – Quality assurance is the final step in data cleaning. It involves
performing checks to ensure the accuracy, completeness, and reliability of the cleaned
dataset. This includes validating the cleaned data against predefined criteria, verifying the
effectiveness of data cleaning techniques, and conducting quality control measures to
ensure the data is suitable for analysis.
Many data cleaning tools are available for data mining, and the choice of tool depends on
the type of data being cleaned and the user’s specific requirements. Some popular data
cleaning tools used in data mining include –
OpenRefine – OpenRefine is a free and open-source data cleaning tool that can be used for
data exploration, cleaning, and transformation. It supports various data formats, including
CSV, Excel, and JSON.
Trifacta Wrangler – Trifacta is a data cleaning tool that uses machine learning algorithms to
identify and clean data errors, inconsistencies, and missing values. It is designed for large-
scale data cleaning and can handle various data formats.
Talend – Talend is an open-source data integration and cleaning tool that can be used for
data profiling, cleaning, and transformation. It supports various data formats and can be
integrated with other tools and platforms.
TIBCO Clarity – TIBCO Clarity is a data quality management tool that provides a unified
view of an organization’s data assets. It includes features such as data profiling, data
cleaning, and data matching to ensure data quality across the organization.
Cloudingo – Cloudingo is a data cleansing tool specifically designed for Salesforce data. It
includes features like duplicate detection and merging, data standardization, and data
enrichment to ensure high-quality data within Salesforce.
IBM Infosphere Quality Stage – IBM Infosphere Quality Stage is a data quality management
tool that includes features such as data profiling, data cleansing, and data matching. It
also includes advanced features such as survivorship and data lineage to ensure data
quality and governance across the organization.
DATA TRANSFORMATION
Data transformation in data mining refers to the process of converting raw data into a
format that is suitable for analysis and modeling. The goal of data transformation is to
prepare the data for data mining so that it can be used to extract useful insights and
knowledge. Data transformation typically involves several steps, including:
Data cleaning: Removing or correcting errors, inconsistencies, and missing values in the
data.
Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.
Data normalization: Scaling the data to a common range of values, such as between 0 and
1, to facilitate comparison and analysis.
Data reduction: Reducing the dimensionality of the data by selecting a subset of relevant
features or attributes.
Data transformation is an important step in the data mining process as it helps to ensure
that the data is in a format that is suitable for analysis and modeling, and that it is free of
errors and inconsistencies. Data transformation can also help to improve the performance
of data mining algorithms, by reducing the dimensionality of the data, and by scaling the
data to a common range of values.
FEATURE SELECTION
Feature selection is critical to building a good model for several reasons. One is that
feature selection implies some degree of cardinality reduction, to impose a cutoff on the
number of attributes that can be considered when building a model. Data almost always
contains more information than is needed to build the model, or the wrong kind of
information. For example, you might have a dataset with 500 columns that describe the
characteristics of customers; however, if the data in some of the columns is very sparse
you would gain very little benefit from adding them to the model, and if some of the
columns duplicate each other, using both columns could affect the model.
Not only does feature selection improve the quality of the model, it also makes the process
of modeling more efficient. If you use unneeded columns while building a model, more
CPU and memory are required during the training process, and more storage space is
required for the completed model. Even if resources were not an issue, you would still want
to perform feature selection and identify the best columns, because unneeded columns
can degrade the quality of the model in several ways:
If the data set is high-dimensional, most data mining algorithms require a much larger
training data set.
During the process of feature selection, either the analyst or the modeling tool or algorithm
actively selects or discards attributes based on their usefulness for analysis. The analyst
might perform feature engineering to add features, and remove or modify existing data,
while the machine learning algorithm typically scores columns and validates their
usefulness in the model.
In short, feature selection helps solve two problems: having too much data that is of little
value, or having too little data that is of high value. Your goal in feature selection should be
to identify the minimum number of columns from the data source that are significant in
building a model.
Dimensionality reduction
Data discretization refers to a method of converting a huge number of data values into
smaller ones so that the evaluation and management of data become easy. In other words,
data discretization is a method of converting attributes values of continuous data into a
finite set of intervals with minimum data loss. There are two forms of data discretization
first is supervised discretization, and the second is unsupervised discretization.
Supervised discretization refers to a method in which the class data is used. Unsupervised
discretization refers to a method depending upon the way which operation proceeds. It
means it works on the top-down splitting strategy and bottom-up merging strategy.
In data mining, the concept of a concept hierarchy refers to the organization of data into a
tree-like structure, where each level of the hierarchy represents a concept that is more
general than the level below it. This hierarchical organization of data allows for more
efficient and effective data analysis, as well as the ability to drill down to more specific
levels of detail when needed. The concept of hierarchy is used to organize and classify data
in a way that makes it more understandable and easier to analyze. The main idea behind
the concept of hierarchy is that the same data can have different levels of granularity or
levels of detail and that by organizing the data in a hierarchical fashion, it is easier to
understand and perform analysis.
Finding recurrent patterns or item sets in huge datasets is the goal of frequent pattern
mining, a crucial data mining approach. It looks for groups of objects that regularly appear
together in order to expose underlying relationships and interdependence. Market basket
analysis, web usage mining, and bioinformatics are a few areas where this method is
important.
The technique of frequent pattern mining is built upon a number of fundamental ideas. The
analysis is based on transaction databases, which include records or transactions that
represent collections of objects. Items inside these transactions are grouped together as
itemsets.
The Apriori algorithm, a popular method for finding recurrent patterns, takes a methodical
approach. In order to find no more frequent itemsets, it generates candidate itemsets,
prunes the infrequent ones, and then progressively grows the size of the itemsets. The
patterns that fulfill the required support criteria are successfully identified through this
iterative approach.
Apriori Algorithm
One of the most popular methods, the Apriori algorithm, uses a step−by−step procedure to
find frequent item sets. It starts by creating candidate itemsets of length 1, determining
their support, and eliminating any that fall below the predetermined cutoff. The method
then joins the frequent itemsets from the previous phase to produce bigger itemsets
repeatedly.
Once no more common item sets can be located, the procedure is repeated. The Apriori
approach is commonly used because of its efficiency and simplicity, but because it
requires numerous database scans for big datasets, it can be computationally inefficient.
FP−growth Algorithm
A different strategy for frequent pattern mining is provided by the FP−growth algorithm. It
creates a small data structure known as the FP−tree that effectively describes the dataset
without creating candidate itemsets. The FP−growth algorithm constructs the FP−tree
recursively and then directly mines frequent item sets from it.
FP−growth can be much quicker than Apriori by skipping the construction of candidate
itemsets, which lowers the number of runs over the dataset. It is very helpful for sparse and
huge datasets.
Eclat Algorithm
Equivalence Class Clustering and bottom−up Lattice Traversal are the acronyms for the
Eclat algorithm, a well−liked frequent pattern mining method. It explores the itemset lattice
using a depth−first search approach, concentrating on the representation of vertical data
formats.
Transaction identifiers (TIDs) are effectively used by Eclat to locate intersections between
item sets. This technique is renowned for its ease of use and little memory requirements,
making it appropriate for mining frequent itemsets in vertical databases.
Web usage mining is examining user navigation patterns to learn more about how people
use websites. In order to personalize websites and enhance their performance, frequent
pattern mining makes it possible to identify recurrent navigation patterns and session
patterns. Businesses can change content, layout, and navigation to improve user
experience and boost engagement by studying how consumers interact with a website.
Bioinformatics
The identification of relevant DNA patterns in the field of bioinformatics is made possible
by often occurring pattern mining. Researchers can get insights into genetic variants,
illness connections, and drug development by examining big genomic databases for
recurrent patterns. In order to diagnose diseases, practice personalized medicine, and
create innovative therapeutic strategies, frequent pattern mining algorithms help uncover
important DNA sequences and patterns.
What is Association?
Association analysis can provide valuable insights into consumer behaviour and
preferences. It can help retailers identify the items that are frequently purchased together,
which can be used to optimize product placement and promotions. Similarly, it can help e-
commerce websites recommend related products to customers based on their purchase
history.
Types of Associations
Here are the most common types of associations used in data mining:
Here are the most commonly used algorithms to implement association rule mining in data
mining:
Apriori Algorithm – Apriori is one of the most widely used algorithms for association rule
mining. It generates frequent item sets from a given dataset by pruning infrequent item sets
iteratively. The Apriori algorithm is based on the concept that if an item set is frequent, then
all of its subsets must also be frequent. The algorithm first identifies the frequent items in
the dataset, then generates candidate itemsets of length two from the frequent items, and
so on until no more frequent itemsets can be generated. The Apriori algorithm is
computationally expensive, especially for large datasets with many items.
FP-Growth Algorithm – FP-Growth is another popular algorithm for association rule mining
that is based on the concept of frequent pattern growth. It is faster than the Apriori
algorithm, especially for large datasets. The FP-Growth algorithm builds a compact
representation of the dataset called a frequent pattern tree (FP-tree), which is used to mine
frequent item sets. The algorithm scans the dataset only twice, first to build the FP-tree and
then to mine the frequent itemsets. The FP-Growth algorithm can handle datasets with
both discrete and continuous attributes.
Eclat Algorithm – Eclat (Equivalence Class Clustering and Bottom-up Lattice Traversal) is a
frequent itemset mining algorithm based on the vertical data format. The algorithm first
converts the dataset into a vertical data format, where each item and the transaction ID in
which it appears are stored. Eclat then performs a depth-first search on a tree-like
structure, representing the dataset’s frequent itemsets. The algorithm is efficient regarding
both memory usage and runtime, especially for sparse datasets.
Correlation Analysis is a data mining technique used to identify the degree to which two or
more variables are related or associated with each other. Correlation refers to the
statistical relationship between two or more variables, where the variation in one variable
is associated with the variation in another variable. In other words, it measures how
changes in one variable are related to changes in another variable. Correlation can be
positive, negative, or zero, depending on the direction and strength of the relationship
between the variables.
, For example,, we are studying the relationship between the hours of study and the grades
obtained by students. If we find that as the number of hours of study increases, the grades
obtained also increase, then there is a positive correlation between the two variables. On
the other hand, if we find that as the number of hours of study increases, the grades
obtained decrease, then there is a negative correlation between the two variables. If there
is no relationship between the two variables, we would say that there is zero correlation.