Unit 2 (DWDM)

UNIT 2
DATA MINING :INTRODUCTION

Data mining is one of the most useful techniques that help entrepreneurs, researchers, and
individuals to extract valuable information from huge sets of data. Data mining is also
called Knowledge Discovery in Database (KDD). The knowledge discovery process includes
Data cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern
evaluation, and Knowledge presentation.
What is Data Mining?
The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take the data-driven decision from huge sets of data is called Data Mining.
Data Mining is the process of investigating hidden patterns of information to various

perspectives for categorization into useful data, which is collected and assembled in particular
areas such as data warehouses, efficient analysis, data mining algorithm, helping decision
making and other data requirement to eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to find trends
and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events. Data
Mining is also called Knowledge Discovery of Data (KDD).
Types of Data Mining
Data mining can be performed on the following types of data:
Relational Database:
A relational database is a collection of multiple data sets formally organized by tables, records,
and columns from which data can be accessed in various ways without having to recognize the
1
database tables. Tables convey and share information, which facilitates data searchability,
reporting, and organization.
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
purposes and helps in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, where an organization has kept various kinds of
information.
Object-Relational Database:
A combination of an object-oriented database model and relational database model is called an

object-relational model. It supports Classes, Objects, Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close the gap between the
Relational database and the object-oriented model practices frequently utilized in many
programming languages, for example, C++, Java, C#, and so on.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the potential
to undo a database transaction if it is not performed appropriately. Even though this was a unique
capability a very long while back, today, most of the relational database systems support
transactional database activities.
Advantages of Data Mining
o The Data Mining technique enables organizations to obtain knowledge-based data.

o Data mining enables organizations to make lucrative modifications in operation and
production.
o Compared with other statistical data applications, data mining is a cost-efficient.
o Data Mining helps the decision-making process of an organization.
2
o It can be induced in the new system as well as the existing platforms.
o It is a quick process that makes it easy for new users to analyze enormous amounts of
data in a short time.
Disadvantages of Data Mining
o There is a probability that the organizations may sell useful data of customers to other
organizations for money. As per the report, American Express has sold credit card
purchases of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance training to
work on.
o Different data mining instruments operate in distinct ways due to the different algorithms
used in their design. Therefore, the selection of the right data mining tools is a very
challenging task.
Data Mining Applications
Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences, product
positioning, and impact on sales, customer satisfaction, and corporate profits. Data mining
enables a retailer to use point-of-sale records of customer purchases to develop products and
promotions that help the organization to attract the customer.
These are the following areas where data mining is widely used:
Data Mining in Healthcare:
3
Data mining in healthcare has excellent potential to improve the health system. It uses data and
analytics for better insights and to identify best practices that will enhance health care services
and reduce costs. Analysts use data mining approaches such as Machine learning, Multi-
dimensional database, Data visualization, Soft computing, and statistics. Data Mining can be
used to forecast patients in each category. The procedures ensure that the patients get intensive
care at the right place and at the right time. Data mining also enables healthcare insurers to
recognize fraud and abuse.
Data Mining in Market Basket Analysis:
Market basket analysis is a modeling method based on a hypothesis. If you buy a specific group
of products, then you are more likely to buy another group of products. This technique may
enable the retailer to understand the purchase behavior of a buyer. This data may assist the
retailer in understanding the requirements of the buyer and altering the store's layout accordingly.
Using a different analytical comparison of results between various stores, between customers in
different demographic groups can be done.
Data mining in Education:
Education data mining is a newly emerging field, concerned with developing techniques that
explore knowledge from the data generated from educational Environments. EDM objectives are
recognized as affirming student's future learning behavior, studying the impact of educational
support, and promoting learning science. An organization can use data mining to make precise
decisions and also to predict the results of the student. With the results, the institution can
concentrate on what to teach and how to teach.
Data Mining in Manufacturing Engineering:
Knowledge is the best asset possessed by a manufacturing company. Data mining tools can be
beneficial to find patterns in a complex manufacturing process. Data mining can be used in
system-level designing to obtain the relationships between product architecture, product
portfolio, and data needs of the customers. It can also be used to forecast the product
development period, cost, and expectations among the other tasks.
Data Mining in Fraud detection:
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are a
little bit time consuming and sophisticated. Data mining provides meaningful patterns and
turning data into information. An ideal fraud detection system should protect the data of all the
users. Supervised methods consist of a collection of sample records, and these records are
classified as fraudulent or non-fraudulent. A model is constructed using this data, and the
technique is made to identify whether the document is fraudulent or not.
4
Data Mining Financial Banking:
The Digitalization of the banking system is supposed to generate an enormous amount of data
with every new transaction. The data mining technique can help bankers by solving business-
related problems in banking and finance by identifying trends, casualties, and correlations in
business information and market costs that are not instantly evident to managers or executives
because the data volume is too large or are produced too rapidly on the screen by experts. The
manager may find these data for better targeting, acquiring, retaining, segmenting, and maintain
a profitable customer.
Data Mining Challenges

Introduction
Data today is what keeps businesses up and running. Most business owners manage to get a good
night’s sleep if they can track the data regarding their organization’s performance. Even though
data mining is amazing, it faces numerous difficulties during its usage. The difficulties could be
identified with techniques used, methods, data, performance, and so on. The data mining
measure becomes fruitful when the difficulties or issues are recognized accurately and figured
out appropriately.
Data Mining challenges
These days Data Mining and information disclosure are developing critical innovations for
researchers and businesses in numerous spaces. Data Mining was forming into a setup and
confided in control, as yet forthcoming data mining challenges must be tackled.
Some of the Data mining challenges are given as under:
1. Security and Social Challenges

2. Noisy and Incomplete Data
3. Distributed Data
4. Complex Data
5. Performance
6. Scalability and Efficiency of the Algorithms
7. Improvement of Mining Algorithms
8. Incorporation of Background Knowledge
9. Data Visualization
10. Data Privacy and Security
11. User Interface
12. Mining dependent on Level of Abstraction
13. Integration of Background Knowledge
5
14. Mining Methodology Challenges
1. Security and Social Challenges

Dynamic techniques are done through data assortment sharing, which requires impressive
security. Private information about people and touchy information is gathered for the client’s
profiles, client standard of conduct understanding—illicit admittance to information and the
secret idea of information turning into a significant issue.
2. Noisy and Incomplete Data

Data Mining is a way to obtain information from huge volumes of data. This present reality of
information is noisy, incomplete, and heterogeneous. Data in huge amounts regularly will be
unreliable or inaccurate. These issues could be because of human mistakes, blunders, or errors in
the instruments that measure the data.
3. Distributed Data
True data is normally put away at various stages in distributed processing conditions. It may be
on the internet, individual systems, or even databases. It is essentially hard to carry all the data to
a unified data archive principally because of technical and organizational reasons.
4. Complex Data
True data is heterogeneous, and it may be media data, including natural language text, time
series, spatial data, temporal data, complex data, audio or video, images, etc. It is truly hard to
deal with these various types of data and concentrate on the necessary information. More often
than not, new apparatuses and systems would need to be created to separate important
information.
5. Performance
The presentation of the data mining framework basically relies upon the productivity of
techniques and algorithms utilized. On the off chance that the techniques and algorithms planned
are not sufficient, at that point, it will influence the presentation of the data mining measure
unfavorably.
6. Scalability and Efficiency of the Algorithms

The Data Mining algorithm should be scalable and efficient to extricate information from
tremendous measures of data in the data set.
7. Improvement of Mining Algorithms

Factors, for example, the difficulty of data mining approaches, the enormous size of the database,
and the entire data flow, inspire the distribution and creation of parallel data mining algorithms.
8. Incorporation of Background Knowledge

In the event that background knowledge can be consolidated, more accurate and reliable data
mining arrangements can be found. Predictive tasks can make more accurate predictions, while
6
descriptive tasks can come up with more useful findings. Be that as it may, gathering and
including foundation knowledge is unpredictable.
9. Data Visualization
Data visualization is a vital cycle in data mining since it is the foremost interaction that shows
the output in a respectable way to the client. The information extricated ought to pass on the
significance of what it plans to pass on. However, ordinarily, it is truly hard to address the
information precisely and straightforwardly to the end user. The output information and input
data being very effective, successful, and complex data perception methods should be applied to
make it fruitful.
10. Data Privacy and Security

Data mining typically prompts significant governance, privacy, and data security issues. For
instance, when a retailer investigates the purchase details, it uncovers information about
purchasing propensities and choices of customers without their authorization.
11. User Interface

The knowledge is determined utilizing data mining devices is valuable just in the event that it is
fascinating or more all reasonable by the client. From great representation translation of data,
mining results can be facilitated, and betters comprehend their prerequisites. Many explorations
are done for enormous data sets that manipulate and display mined knowledge to get a great
perception.
12. Mining dependent on Level of Abstraction

Data Mining measures should be community-oriented in light of the fact that it permits clients to
focus on example optimizing, presenting, and pattern finding for data mining dependent on
bringing results back.
13. Integration of Background Knowledge

Previous information might be used to communicate examples to express discovered patterns and
direct the exploration process.
14. Mining Methodology Challenges

These difficulties are identified with data mining methods and their limits. Mining methods that
cause the issue are the control and handling of noise in data, the dimensionality of the domain,
the diversity of data available, the versatility of the mining method, and so on.
The History of Data Mining (OR)Origin if data mining

Data mining is the process of finding useful new correlations, patterns, and trends by
transferring through a high amount of data saved in repositories, using pattern recognition
technologies including statistical and mathematical techniques. It is the analysis of factual
datasets to discover unsuspected relationships and to summarize the records in novel methods
that are both logical and helpful to the data owner.
7
It is the procedure of selection, exploration, and modeling of high quantities of information to
find regularities or relations that are at first unknown to obtain clear and beneficial results for
the owner of the database.
Data Mining methods Bayes’ Theorem (1700`s) and Regression analysis (1800`s) which were
mostly identifying patterns in data. In this article, we won`t start with `Once upon a time…`,
instead we will focus on the recent history and studies. However you can briefly see the major
milestones of data mining history on this chronological table below:
Data mining is the process of analyzing large data sets (Big Data) from different perspectives and
uncovering correlations and patterns to summarize them into useful information. Nowadays it is
blended with many techniques such as artificial intelligence, statistics, data science, database
theory and machine learning.
Recent history:Increasing power of technology and complexity of data sets has lead Data Mining
to evolve from static data delivery to more dynamic and proactive information deliveries; from
tapes and disks to advanced algorithms and massive databases (see the table below). In the late
80`s Data Mining term began to be known and used within the research community by
statisticians, data analysts, and the management information systems (MIS) communities.
8
By the early 1990`s, data mining was recognized as a sub-process or a step within a larger process
called Knowledge Discovery in Databases (KDD) — which gave rise to actually making it ‘the
popular guy’. The most commonly used definition of KDD is “The nontrivial process of
identifying valid, novel, potentially useful, and ultimately understandable patterns in data”
(Fayyad, 1996).
The sub-processes that form part of the KDD process are;
1. Understanding of the application and identifying the goal of the KDD process
2. Creating a target data set
3. Data cleaning and pre-processing
4. Matching the goals of the KDD process (step 1) to a particular data-mining method.
9
5. Research analysis and hypothesis selection
6. Data mining: Searching for patterns of interest in a particular form , including classification
rules, regression, and clustering
7. Interpreting mined patterns
8. Acting on the discovered analysis
The popularity of data mining escalated notably in the 1990`s, with the help of dedicated
conferences, in addition to the fast increase in technology, data storage capabilities and
computers` processing speeds. It was also possible for organizations to keep data in computer
readable form and processing of large volumes of data using desk top machines were not far from
reality.
By the end of 1990`s, data mining was already a well-known technique used by the organizations
after the introduction of customer loyalty cards. This opened a big door allowing organizations to
record customer purchases and data, the resulting data could be mined to identify customer
purchasing patterns. The popularity of data mining has continued to grow rapidly over the last
decade.
The evaluation of data mining applications
The main focus of data mining was tabular data; however with the evolving technology and
different needs new sources were formed to be mined!
 Text Mining: Still a popular data mining activity, it categorizes or clusters large document
collections such as news articles or web pages. Another application is opinion mining where
the techniques are applied to obtain useful information from the questionnaire style data.
 Image Mining: In image mining, mining techniques are applied to images (2D and 3D)
 Graph Mining: It is formed from frequent pattern mining, which is focused on frequently
occurring sub-graphs. A popular extension of graph mining is social network mining.
10
Data mining has become very popular over the last two decades as a discipline in its own. Data
mining applications are used in every field of business, government, and science just to name a
few. Starting from text mining, it has evolved a lot and it will be very interesting to watch with the
usage of different data (e.g spatial data, different sources of multimedia data) in the future.
Types of Data Mining

Each of the following data mining techniques serves several different business problems and
provides a different insight into each of them. However, understanding the type of business
problem you need to solve will also help in knowing which technique will be best to use, which
will yield the best results. The Data Mining types can be divided into two basic parts that are as
follows:
1. Predictive Data Mining Analysis

2. Descriptive Data Mining Analysis
11
1. Predictive Data Mining
As the name signifies, Predictive Data-Mining analysis works on the data that may help to know
what may happen later (or in the future) in business. Predictive Data-Mining can also be further
divided into four types that are listed below:
o Classification Analysis
o Regression Analysis
o Time Serious Analysis
o Prediction Analysis
2. Descriptive Data Mining
The main goal of the Descriptive Data Mining tasks is to summarize or turn given data into
relevant information. The Descriptive Data-Mining Tasks can also be further divided into four
types that are as follows:
o Clustering Analysis
o Summarization Analysis
o Association Rules Analysis
o Sequence Discovery Analysis
1. CLASSIFICATION ANALYSIS
This type of data mining technique is generally used in fetching or retrieving important and
relevant information about the data & metadata. It is also even used to categories the different
types of data format into different classes. If you focus on this article until it ends, you may
definitely find out that Classification and clustering are similar data mining types. As clustering
also categorizes or classify the data segments into the different data records known as the classes.
However, unlike clustering, the data analyst would have the knowledge of different classes or
clusters. Therefore in the classification analysis, you have to apply or implement the algorithms
to decide in which way the new data should be categorized or classified. A classic example of
classification analysis would be Outlook email. In Outlook, they use certain algorithms to
characterize an email is legitimate or spam.
This technique is usually very helpful for retailers who can use it to study the buying habits of
their different customers. Retailers can also study the past sales data and then lookout (or
search ) for products that customers usually buy together. After which, they can put those
products nearby of each other in their retail stores to help customers save their time and as well
as to increase their sales.
12
2. REGRESSION ANALYSIS
In statistical terms, regression analysis is a process usually used to identify and analyze the
relationship among variables. It means one variable is dependent on another, but it is not vice
versa. It is generally used for prediction and forecasting purposes. It can also help you
understand the characteristic value of the dependent variable changes if any of the independent
variables is varied.
3. Time Serious Analysis
A time series is a sequence of data points that are usually recorded at specific time intervals of
points. Usually, they are - most often in regular time intervals (seconds, hours, days, months
etc.). Almost every organization generates a high volume of data every day, such as sales figures,
revenue, traffic, or operating cost. Time series data mining can help in generating valuable
information for long-term business decisions, yet they are underutilized in most organizations.
4. Prediction Analysis
This technique is generally used to predict the relationship that exists between both the
independent and dependent variables as well as the independent variables alone. It can also use
to predict profit that can be achieved in future depending on the sale. Let us imagine that profit
and sale are dependent and independent variables, respectively. Now, on the basis of what the
past sales data says, we can make a profit prediction of the future using a regression curve.
5. Clustering Analysis
In Data Mining, this technique is used to create meaningful object clusters that contain the same
characteristics. Usually, most people get confused with Classification, but they won't have any
issues if they properly understand how both these techniques actually work. Unlike
Classification that collects the objects into predefined classes, clustering stores objects in classes
that are defined by it. To understand it in more detail, you can consider the following given
example:
Example
Suppose you are in a library that is full of books on different topics. Now the real challenge for
you is to organize those books so that readers don't face any problem finding out books on any
particular topic. So here, we can use clustering to keep books with similarities in one particular
shelf and then give those shelves a meaningful name or class. Therefore, whenever a reader
looking for books on a particular topic can go straight to that shelf. Hence he won't be required to
roam the entire library to find the book he wants to read.
13
6. SUMMARIZATION ANALYSIS
The Summarization analysis is used to store a group (or a set ) of data in a more compact way
and an easier-to-understand form. We can easily understand it with the help of an example:
Example
You might have used Summarization to create graphs or calculate averages from a given set (or
group) of data. This is one of the most familiar and accessible forms of data mining.
7. ASSOCIATION RULE LEARNING
It can be considered a method that can help us identify some interesting relations (dependency
modeling) between different variables in large databases. This technique can also help us to
unpack some hidden patterns in the data, which can be used to identify the variables within the
data. It also helps in detecting the concurrence of different variables that appear very frequently
in the dataset. Association rules are generally used for examining and forecasting the behavior of
the customer. It is also highly recommended in the retail industry analysis. This technique is also
used to determine shopping basket data analysis, catalogue design, product clustering, and store
layout. In IT, programmers also uses the association rules to create programs capable of machine
learning. Or in short, we can say that this data mining technique helps to find the association
between two or more Items. It discovers a hidden pattern in the data set.
8. Sequence Discovery Analysis
The primary goal of sequence discovery analysis is to discover interesting patterns in data on the
basis of some subjective or objective measure of how interesting it is. Usually, this task involves
discovering frequent sequential patterns with respect to a frequency support measure. Some
people may often confuse it with time series as both the Sequence discovery analysis and Time
series analysis contains the adjacent observation that are order dependent. However, if the people
see both of them in a little more depth, their confusion can be easily avoided as the Time series
analysis technique contains numerical data, whereas the Sequence discovery analysis contains
discrete values or data.
Data quality
Data quality is a measure of the condition of data based on factors such as accuracy,
completeness, consistency, reliability and whether it's up to date. Measuring data quality levels
can help organizations identify data errorsThe emphasis on data quality in enterprise systems has
increased as data processing has become more intricately linked with business operations and
organizations increasingly use data analytics to help drive business decisions. Data quality
management is a core component of the overall data management process, and data quality
improvement efforts are often closely tied to data governance programs that aim to ensure data is
formatted and used consistently throughout an organization.
14
What is good data quality?
Data accuracy is a key attribute of high-quality data. To avoid transaction processing problems in
operational systems and faulty results in analytics applications, the data that's used must be
correct. Inaccurate data needs to be identified, documented and fixed to ensure that business
executives, data analysts and other end users are working with good information.
Dimensions, that are important elements of good data quality include the following:
 completeness, with data sets containing all of the data elements they should;
 consistency, where there are no conflicts between the same data values in different systems
or data sets;
 uniqueness, indicating a lack of duplicate data records in databases and data warehouses;
 timeliness or currency, meaning that data has been updated to keep it current and is available
to use when it's needed;
 validity, confirming that data contains the values it should and is structured properly; and
 conformity to the standard data formats created by an organization.
15
Data Preprocessing
Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it analysis.
Steps in data preprocessing include:

 Data cleaning: this step involves identifying and removing missing, inconsistent, or
irrelevant data. This can include removing duplicate records, filling in missing values, and
handling outliers.
 Data integration: this step involves combining data from multiple sources, such as
databases, spreadsheets, and text files. The goal of integration is to create a single,
consistent view of the data.
 Data transformation: this step involves converting the data into a format that is more
suitable for the data mining task. This can include normalizing numerical data, creating
dummy variables, and encoding categorical data.
 Data reduction: this step is used to select a subset of the data that is relevant to the data
mining task. This can include feature selection (selecting a subset of the variables) or
feature extraction (extracting new variables from the data).
 Data discretization: this step is used to convert continuous numerical data into categorical
data, which can be used for decision tree and other categorical data mining techniques.
By performing these steps, the data mining process becomes more efficient and the results
become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
Steps Involved in Data Preprocessing:

16
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
 (a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in various
ways.
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.
2. Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
 (b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated
due to faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its
mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
17
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
4. Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For Example-
The attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working
with huge volume of data, analysis became harder in such cases. In order to get rid of this, we
uses data reduction technique. It aims to increase the storage efficiency and reduce data storage
and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:

The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use level of significance and p- value of the attribute.the
attribute having p-value greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression
Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of
dimensionality reduction are:Wavelet transforms and PCA (Principal Component
Analysis).
Aggregation in Data Mining
Aggregation in data mining is the process of finding, collecting, and presenting the data in a
summarized format to perform statistical analysis of business schemes or analysis of human
patterns. When numerous data is collected from various datasets, it’s crucial to gather accurate
data to provide significant results. Data aggregation can help in taking prudent decisions in
marketing, finance, pricing the product, etc
Examples of aggregate data:
18
 Finding the average age of customer buying a particular product which can help in finding
out the targeted age group for that particular product. Instead of dealing with an individual
customer, the average age of the customer is calculated.
Data aggregators:Data Aggregators are a system in data mining that collects data from
numerous sources, then processes the data and repackages them into useful data packages.
They play a major role in improving the data of customer by acting as an agent. It helps in
the query and delivery process where the customer requests data instances about a certain
product. The aggregators provide the customer with matched records of the product.
Working of Data aggregators:The working of data aggregators can be performed in three

stages
o Collection of data
o Processing of data
o Presentation of data
The working of data aggregators takes place in three steps:

 Collection of data: Collecting data from different datasets from the enormous database.
The data can be extracted using IoT(internet of things) such as
 Communications in social media
 Speech recognition like call centers
 Headlines of a news
 Browsing history and other personal data of devices.
 Processing of data: After collecting data, the data aggregator finds the atomic data and
aggregates it. In the processing technique, aggregators use various algorithms from the
field of Artificial Intelligence or Machine learning techniques. It also incorporates
statistical methods to process it, like the predictive analysis. By this, various useful insights
can be extracted from raw data.
 Presentation of data: After the processing step, the data will be in a summarized format
which can provide a desirable statistical result with detailed and accurate data.
19
Types of Data Aggregation:
Types of data aggregation
 Time aggregation: It provides the data point for single resources for a defined time period.
 Spatial aggregation: It provided the data point for a group of resources for a defined time
period.
Time intervals for data aggregation process:
 Reporting period: The period in which the data is collected for presentation. It can either
be a data point aggregated process or simply raw data. E.g. The data is collected and
processed into a summarized format in a period of one day from a network device. Hence
the reporting period will be one day.
 Granularity: The period in which data is collected for aggregation. E.g. To find the sum of
data points for a specific resource collected over a period of 10 mins. Here the granularity
would be 10 mins. The value of granularity can vary from minute to month depending upon
the reporting period.
 Polling period: The frequency in which resources are sampled for data. E.g. If the group of
resources can be polled every 7 minutes which means data points for each resource is
generated every 7 minutes. Polling period and Granularity comes under spatial
aggregation.
Applications of Data Aggregation:
 Data aggregation is used in many fields where a large number of datasets are involved. It
helps in making fruitful decisions in marketing or finance management. It helps in the
planning and pricing of products.
 Efficient use of data aggregation can help in the creation of marketing schemes. E.g. If the
company is performing ad campaigns on a particular platform, they must deeply analyze
the data to raise sales. The aggregation can help in analyzing the execution over a
respective time period of campaigns or a particular cohort or a particular channel/platform.
This can be done in three steps namely Extraction, Transform, Visualize.
20
Workflow of Data Analysis in SaaS Applications.
 Data aggregation plays a major role in retail and e-commerce industries by monitoring
the competitive price. In this field, to keeping track of its fellow company is a must. Like a
company should collect details of pricing, offers etc. of other companies to know what its
competitive company is up to. This can be done by aggregating data from a single resource
like its competitor website.
 Data aggregation plays an impactful role in the travel industry. It comprises research
about the competitor and gaining intelligence in marketing to reach people, image capture
from their travel websites. It also includes customer sentiment analysis which helps to find
the emotions and satisfaction based on linguistic analyses. Failed data aggregation in this
field can lead to the declined growth of the travel company.
 For the business analysis purpose, the data can be aggregated into summary formats which
can help the head of the firm to take correct decisions for satisfying the customers. It helps
in inspecting groups of people.
.
Sampling in Data Mining

Data sampling is a statistical analysis technique used to select, manipulate and analyze a
representative subset of data points to identify patterns and trends in the larger data set being
examined. (OR)
sampling is the process of selecting a subset of data from a larger dataset. There are two main types of
sampling: probability sampling and non-probability sampling. The main difference between the two types
of sampling is how the sample is selected from the population.
Five Basic Sampling Methods

 Simple Random.
 Convenience.
 Systematic.
 Cluster.
 Stratified.
21
Ultimately, every sampling type comes under two broad categories:
 Probability sampling - Random selection techniques are used to select the

sample.
 Non-probability sampling - Non-random selection techniques based on
certain criteria are used to select the sample.
Types Of Sampling Techniques in Data Analytics-

Probability Sampling Techniques
Probability Sampling Techniques are one of the important types of sampling techniques.
Probability sampling allows every member of the population a chance to get selected. It is
mainly used in quantitative research when you want to produce results representative of the
whole population.
1. Simple Random Sampling
In simple random sampling, the researcher selects the participants randomly. There are a number
of data analytics tools like random number generators and random number tables used that are
based entirely on chance.
Example: The researcher assigns every member in a company database a number from 1 to 1000
(depending on the size of your company) and then use a random number generator to select 100
members.
2. Systematic Sampling
In systematic sampling, every population is given a number as well like in simple random
sampling. However, instead of randomly generating numbers, the samples are chosen at regular
intervals.
Example: The researcher assigns every member in the company database a number. Instead of
randomly generating numbers, a random starting point (say 5) is selected. From that number
onwards, the researcher selects every, say, 10th person on the list (5, 15, 25, and so on) until the
sample is obtained.
22
3. Stratified Sampling
In stratified sampling, the population is subdivided into subgroups, called strata, based on some
characteristics (age, gender, income, etc.). After forming a subgroup, you can then use random or
systematic sampling to select a sample for each subgroup. This method allows you to draw more
precise conclusions because it ensures that every subgroup is properly represented.
Example: If a company has 500 male employees and 100 female employees, the researcher
wants to ensure that the sample reflects the gender as well. So the population is divided into two
subgroups based on gender.
4. Cluster Sampling
In cluster sampling, the population is divided into subgroups, but each subgroup has similar
characteristics to the whole sample. Instead of selecting a sample from each subgroup, you
randomly select an entire subgroup. This method is helpful when dealing with large and diverse
populations.
Example: A company has over a hundred offices in ten cities across the world which has roughly
the same number of employees in similar job roles. The researcher randomly selects 2 to 3
offices and uses them as the sample.
Non-Probability Sampling Techniques

Non-Probability Sampling Techniques is one of the important types of Sampling techniques. In
non-probability sampling, not every individual has a chance of being included in the sample.
This sampling method is easier and cheaper but also has high risks of sampling bias. It is often
used in exploratory and qualitative research with the aim to develop an initial understanding of
the population.
1. Convenience Sampling
In this sampling method, the researcher simply selects the individuals which are most easily
accessible to them. This is an easy way to gather data, but there is no way to tell if the sample is
representative of the entire population. The only criteria involved is that people are available and
willing to participate.
Example: The researcher stands outside a company and asks the employees coming in to answer
questions or complete a survey.
23
2. Voluntary Response Sampling
Voluntary response sampling is similar to convenience sampling, in the sense that the only
criterion is people are willing to participate. However, instead of the researcher choosing the
participants, the participants volunteer themselves.
Dimensionality reduction
The number of input features, variables, or columns present in a given dataset is known as
dimensionality, and the process to reduce these features is called dimensionality reduction.
A dataset contains a huge number of input features in various cases, which makes the predictive
modeling task more complicated. Because it is very difficult to visualize or make predictions for
the training dataset with a high number of features, for such cases, dimensionality reduction
techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar
information." These techniques are widely used in machine learning for obtaining a better fit
predictive model while solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data visualization,
noise reduction, cluster analysis, etc.
24
The Curse of Dimensionality
Handling the high-dimensional data is very difficult in practice, commonly known as the curse of
dimensionality. If the dimensionality of the input dataset increases, any machine learning
algorithm and model becomes more complex. As the number of features increases, the number of
samples also gets increased proportionally, and the chance of overfitting also increases. If the
machine learning model is trained on high-dimensional data, it becomes overfitted and results in
poor performance.
Benefits of applying Dimensionality Reduction
o By reducing the dimensions of the features, the space required to store the dataset also
gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.
Disadvantages of dimensionality Reduction
o Some data may be lost due to dimensionality reduction.

o In the PCA dimensionality reduction technique, sometimes the principal components
required to consider are unknown.
Approaches of Dimension Reduction
There are two ways to apply the dimension reduction technique, which are given below:
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and leaving out the
irrelevant features present in a dataset to build a model of high accuracy. In other words, it is a
way of selecting the optimal features from the input dataset.
Three methods are used for the feature selection:
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant features is
taken. Some common techniques of filters method are:
25
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning
model for its evaluation. In this method, some features are fed to the ML model, and evaluate the
performance. The performance decides whether to add those features or remove to increase the
accuracy of the model. This method is more accurate than the filtering method but complex to
work. Some common techniques of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the
machine learning model and evaluate the importance of each feature. Some common techniques
of Embedded methods are:
o LASSO
o Elastic Net
o Ridge Regression, etc.
o
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into
space with fewer dimensions. This approach is useful when we want to keep the whole
information but use fewer resources while processing the information.
Some common feature extraction techniques are:
a. Principal Component Analysis

b. Linear Discriminant Analysis
c. Kernel PCA
d. Quadratic Discriminant Analysis
Common techniques of Dimensionality Reduction

26
a. Principal Component Analysis
b. Backward Elimination
c. Forward Selection
d. Score comparison
e. Missing Value Ratio
f. Low Variance Filter
g. High Correlation Filter
h. Random Forest
i. Factor Analysis
j. Auto-Encoder
Principal Component Analysis (PCA)
Principal Component Analysis is a statistical process that converts the observations of correlated
features into a set of linearly uncorrelated features with the help of orthogonal transformation.
These new transformed features are called the Principal Components. It is one of the popular
tools that is used for exploratory data analysis and predictive modeling.
PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the
power allocation in various communication channels.
Backward Feature Elimination
The backward feature elimination technique is mainly used while developing Linear Regression
or Logistic Regression model. Below steps are performed in this technique to reduce the
dimensionality or in feature selection:
o In this technique, firstly, all the n variables of the given dataset are taken to train the
model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1 features for n
times, and will compute the performance of the model.
27
o We will check the variable that has made the smallest or no change in the performance of
the model, and then we will drop that variable or features; after that, we will be left with
n-1 features.
o Repeat the complete process until no feature can be dropped.
In this technique, by selecting the optimum performance of the model and maximum tolerable
error rate, we can define the optimal number of features require for the machine learning
algorithms.
Forward Feature Selection
Forward feature selection follows the inverse process of the backward elimination process. It
means, in this technique, we don't eliminate the feature; instead, we will find the best features
that can produce the highest increase in the performance of the model. Below steps are
performed in this technique:
o We start with a single feature only, and progressively we will add each feature at a time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the performance of the
model.
Missing Value Ratio
If a dataset has too many missing values, then we drop those variables as they do not carry much
useful information. To perform this, we can set a threshold level, and if a variable has missing
values more than that threshold, we will drop that variable. The higher the threshold value, the
more efficient the reduction.
Low Variance Filter
As same as missing value ratio technique, data columns with some changes in the data have less
information. Therefore, we need to calculate the variance of each variable, and all data columns
with variance lower than a given threshold are dropped because low variance features will not
affect the target variable.
High Correlation Filter
High Correlation refers to the case when two variables carry approximately similar information.
Due to this factor, the performance of the model can be degraded. This correlation between the
independent numerical variable gives the calculated value of the correlation coefficient. If this
28
value is higher than the threshold value, we can remove one of the variables from the dataset. We
can consider those variables or features that show a high correlation with the target variable.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine learning. This
algorithm contains an in-built feature importance package, so we do not need to program it
separately. In this technique, we need to generate a large set of trees against the target variable,
and with the help of usage statistics of each attribute, we need to find the subset of features.
Random forest algorithm takes only numerical variables, so we need to convert the input data
into numeric data using hot encoding.
Factor Analysis
Factor analysis is a technique in which each variable is kept within a group according to the
correlation with other variables, it means variables within a group can have a high correlation
between themselves, but they have a low correlation with variables of other groups.
These two variables have a high correlation, which means people with high income spends more,
and vice versa. So, such variables are put into a group, and that group is known as the factor.
The number of these factors will be reduced as compared to the original dimension of the
dataset.
Auto-encoders
One of the popular methods of dimensionality reduction is auto-encoder, which is a type of ANN
or artificial neural network, and its main aim is to copy the inputs to their outputs. In this, the
input is compressed into latent-space representation, and output is occurred using this
representation. It has mainly two parts:
o Encoder: The function of the encoder is to compress the input to form the latent-space
representation.
o Decoder: The function of the decoder is to recreate the output from the latent-space
representation.
Discretization and binarization

Data discretization is a method of converting attributes values of continuous data into a finite set
of intervals with minimum data loss. Discretization is the process of transferring continuous
functions, models, variables, and equations into discrete counterparts. This process is
usually carried out as a first step toward making them suitable for numerical evaluation and
implementation on digital computers.
29
Data binarization is used to transform the continuous and discrete attributes into binary
attributes.
Data discretization refers to a method of converting a huge number of data values into smaller
ones so that the evaluation and management of data become easy. In other words, data
discretization is a method of converting attributes values of continuous data into a finite set of
intervals with minimum data loss. There are two forms of data discretization first is supervised
discretization, and the second is unsupervised discretization. Supervised discretization refers to a
method in which the class data is used. Unsupervised discretization refers to a method depending
upon the way which operation proceeds. It means it works on the top-down splitting strategy and
bottom-up merging strategy.
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
Table before Discretization
Attribute Age Age Age Age
1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78
After Discretization Child Young Mature Old
Another example is analytics, where we gather the static data of website visitors. For example,
all visitors who visit the site with the IP address of India are shown under country level.
Some Famous techniques of data discretization
Histogram analysis
Histogram refers to a plot used to represent the underlying frequency distribution of a continuous
data set. Histogram assists the data inspection for data distribution. For example, Outliers,
skewness representation, normal distribution representation, etc.
Binning
Binning refers to a data smoothing technique that helps to group a huge number of continuous
values into smaller values. For data discretization and the development of idea hierarchy, this
technique can also be used.
30
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing

the values of x numbers into clusters to isolate a computational feature of x.
Data discretization using decision tree analysis
Data discretization refers to a decision tree analysis in which a top-down slicing technique is
used. It is done through a supervised procedure. In a numeric attribute discretization, first, you
need to select the attribute that has the least entropy, and then you need to run it with the help of
a recursive process. The recursive process divides it into various discretized disjoint intervals,
from top to bottom, using the same splitting criterion.
Data discretization using correlation analysis
Discretizing data by linear regression technique, you can get the best neighboring interval, and
then the large intervals are combined to develop a larger overlap to form the final 20 overlapping
intervals. It is a supervised procedure.
Data discretization and concept hierarchy generation
The term hierarchy represents an organizational structure or mapping in which items are ranked
according to their levels of importance. In other words, we can say that a hierarchy concept
refers to a sequence of mappings with a set of more general concepts to complex concepts. It
means mapping is done from low-level concepts to high-level concepts. For example, in
computer science, there are different types of hierarchical systems. A document is placed in a
folder in windows at a specific place in the tree structure is the best example of a computer
hierarchical tree model. There are two types of hierarchy: top-down mapping and the second one
is bottom-up mapping.
Top-down mapping
Top-down mapping generally starts with the top with some general information and ends with
the bottom to the specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information and ends
with the top to the generalized information.
31
Data discretization and binarization in data mining
Data discretization is a method of converting attributes values of continuous data into a finite set
of intervals with minimum data loss. In contrast, data binarization is used to transform the
continuous and discrete attributes into binary attributes.
Variable Transformation
A variable transformation defines a transformation that is used to some values of a variable. In
other terms, for every object, the revolution is used to the value of the variable for that object.
For instance, if only the significance of a variable is essential, then the values of the variable can
be changed by creating the absolute value.
There are two types of variable transformations: simple functional transformations and
normalization.
Simple Functions
A simple mathematical function is used to each value independently. If r is a variable, then
examples of such transformations include xk,logx, ex,x−−√�,1x1�,sinx,or |x|. In statistics,
variable transformations, particularly sqrt, log, and 1/x, are applied to transform record that does
not have a Gaussian (normal) distribution into information that does. While this can be essential,
some reasons can take precedence in data mining.
Consider the variable of interest is the several data bytes in a session and the several bytes range
from 1 to 1 billion. This is a huge range, and it may be advantageous to compress it by using a
log10 transformation. In this case, sessions that transferred 10 8 and 109 bytes would be more
similar to each other than sessions that transferred 10 and 1000 bytes (9 - 8 = 1 versus 3 - 1 = 2).
Variable transformations should be applied with caution since they change the nature of the
data. There can be issues if the feature of the transformation is not completely respected. For
example, the transformation 1/x decreases the significance of values that are 1 or higher but
increases the significance of values between 0 and 1.
32
Data Transformation Process
The entire process for transforming data is known as ETL (Extract, Load, and Transform).
Through the ETL process, analysts can convert data to its desired format. Here are the steps
involved in the data transformation process:
1. Data Discovery: During the first stage, analysts work to understand and identify data in
its source format. To do this, they will use data profiling tools. This step helps analysts
decide what they need to do to get data into its desired format.
2. Data Mapping: During this phase, analysts perform data mapping to determine how
individual fields are modified, mapped, filtered, joined, and aggregated. Data mapping is
essential to many data processes, and one misstep can lead to incorrect analysis and
ripple through your entire organization.
3. Data Extraction: During this phase, analysts extract the data from its original source.
These may include structured sources such as databases or streaming sources such as
customer log files from web applications.
4. Code Generation and Execution: Once the data has been extracted, analysts need to
create a code to complete the transformation. Often, analysts generate codes with the help
of data transformation platforms or tools.
5. Review: After transforming the data, analysts need to check it to ensure everything has
been formatted correctly.
6. Sending: The final step involves sending the data to its target destination. The target
might be a data warehouse or a database that handles both structured and unstructured
data.
Advantages of Data Transformation
Transforming data can help businesses in a variety of ways. Here are some of the essential
advantages of data transformation, such as:
33
o Better Organization: Transformed data is easier for both humans and computers to use.
o Improved Data Quality: There are many risks and costs associated with bad data.
Data transformation can help your organization eliminate quality issues such as missing
values and other inconsistencies.
o Perform Faster Queries: You can quickly and easily retrieve transformed data thanks to
it being stored and standardizedin a source location.
o Better Data Management: Businesses are constantly generating data from more and
more sources. If there are inconsistencies in the metadata, it can be challenging to
organize and understand it. Data transformation refines your metadata, so it's easier to
organize and understand.
o More Use Out of Data: While businesses may be collecting data constantly, a lot of that
data sits around unanalyzed. Transformation makes it easier to get the most out of your
data by standardizing it and making it more usable.
Disadvantages of Data Transformation

o Data transformation can be expensive. The cost is dependent on the specific
infrastructure, software, and tools used to process data. Expenses may include licensing,
computing resources, and hiring necessary personnel.
o Data transformation processes can be resource-intensive. Performing transformations in
an on-premises data warehouse after loading or transforming data before feeding it into
applications can create a computational burden that slows down other operations. If you
use a cloud-based data warehouse, you can do the transformations after loading because
the platform can scale up to meet demand.
o Lack of expertise and carelessness can introduce problems during transformation. Data
analysts without appropriate subject matter expertise are less likely to notice incorrect
data because they are less familiar with the range of accurate and permissible values.
o Enterprises can perform transformations that don't suit their needs. A business might
change information to a specific format for one application only to then revert the
information to its prior format for a different application.
34
Ways of Data Transformation
o Scripting: Data transformation through scripting involves Python or SQL to write the
code to extract and transform data. Python and SQL are scripting languages that allow
you to automate certain tasks in a program. They also allow you to extract information
from data sets. Scripting languages require less code than traditional programming
languages. Therefore, it is less intensive.
o On-Premises ETL Tools: ETL tools take the required work to script the data
transformation by automating the process. On-premises ETL tools are hosted on company
servers. While these tools can help save you time, using them often requires extensive
expertise and significant infrastructure costs.
o Cloud-Based ETL Tools: As the name suggests, cloud-based ETL tools are hosted in
the cloud. These tools are often the easiest for non-technical users to utilize. They allow
you to collect data from any cloud source and load it into your data warehouse. With
cloud-based ETL tools, you can decide how often you want to pull data from your
source, and you can monitor your usage.
KDD in data mining:

Knowledge Discovery in Databases (KDD) is the process of automatic discovery of previously
unknown patterns, rules, and other regular contents implicitly present in large volumes of
data. Data Mining (DM) denotes discovery of patterns in a data set previously prepared in a
specific way.
The essential steps of KDD (Knowledge Discovery in Databases) are:

 1 – Understanding the Data Set. ...
 2 – Data Selection. ...
 3 – Cleaning and Pre-processing. ...
 4 – Data Transformation. ...
 5 – Select the Appropriate Data Mining Task. ...
 6 – Choice of Data Mining Algorithms. ...
 7 – Application of Data Mining Algorithms. ...
 8 – Evaluation.
35
Data Mining – Knowledge Discovery in Databases(KDD).
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of
useful, previously unknown, and potentially valuable information from large datasets.
The KDD process in data mining typically involves the following steps:
1. Selection: Select a relevant subset of the data for analysis.
2. Pre-processing: Clean and transform the data to make it ready for analysis. This
may include tasks such as data normalization, missing value handling, and data
integration.
3. Transformation: Transform the data into a format suitable for data mining, such as
a matrix or a graph.
4. Data Mining: Apply data mining techniques and algorithms to the data to extract
useful information and insights. This may include tasks such as clustering,
classification, association rule mining, and anomaly detection.
5. Interpretation: Interpret the results and extract knowledge from the data. This may
include tasks such as visualizing the results, evaluating the quality of the discovered
patterns and identifying relationships and associations among the data.
6. Evaluation: Evaluate the results to ensure that the extracted knowledge is useful,
accurate, and meaningful.
7. Deployment: Use the discovered knowledge to solve the business problem and
make decisions.
The KDD process is an iterative process and it requires multiple iterations of the above
steps to extract accurate knowledge from the data.
Steps Involved in KDD Process:
36
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from
collection.
 Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse).
 Data integration using Data Migration tools.
 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to the analysis
is decided and retrieved from the data collection.
 Data selection using Neural network.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming data
into appropriate form required by mining procedure.
Data Transformation is a two step process:
 Data Mapping: Assigning elements from source base to destination to capture
transformations.
 Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to extract
patterns potentially useful.
37
 Transforms task relevant data into patterns.
 Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures.
 Find interestingness score of each pattern.
 Uses summarization and Visualization to make data understandable by user.
7. Knowledge representation: Knowledge representation is defined as technique which
utilizes visualization tools to represent data mining results.
 Generate reports.
 Generate tables.
 Generate discriminant rules, classification rules, characterization rules, etc.
ADVANTAGES OR DISADVANTAGES:
Advantages of KDD:
1. Improves decision-making: KDD provides valuable insights and knowledge that can help
organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the
data ready for analysis, which saves time and money.
3. Better customer service: KDD helps organizations gain a better understanding of their
customers’ needs and preferences, which can help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns and
anomalies in the data that may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive models that can forecast future
trends and patterns.
Disadvantages of KDD:
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing
large amounts of data, which can include sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires specialized skills and knowledge
to implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences, such as bias or
discrimination, if the data or models are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is not accurate or
consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant investments in
hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common problem in machine
learning where a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new unseen data.
38
Difference Between KDD and Data Mining
Parameter KDD Data Mining
KDD refers to a process of identifying Data Mining refers to a process of

valid, novel, potentially useful, and extracting useful and valuable
ultimately understandable patterns and information or patterns from large
Definition relationships in data. data sets.
To extract useful information from

Objective To find useful knowledge from data. data.
Data cleaning, data integration, data

selection, data transformation, data Association rules, classification,
mining, pattern evaluation, and clustering, regression, decision
Techniques knowledge representation and trees, neural networks, and
Used visualization. dimensionality reduction.
Structured information, such as rules Patterns, associations, or insights

and models, that can be used to make that can be used to improve
Output decisions or predictions. decision-making or understanding.
Focus is on the discovery of useful

knowledge, rather than simply finding Focus is on the discovery of patterns
Focus patterns in data. or relationships in data.
Domain expertise is important in KDD, Domain expertise is less critical in

Role of as it helps in defining the goals of the data mining, as the algorithms are
domain process, choosing appropriate data, and designed to identify patterns without
expertise interpreting the results. relying on prior knowledge.
39
40

Unit 2 (DWDM)

Uploaded by

Copyright:

Available Formats

Unit 2 (DWDM)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2 (DWDM)

Uploaded by

Copyright:

Available Formats

UNIT 2

DATA MINING :INTRODUCTION

What is Data Mining?

Data Mining is the process of investigating hidden patterns of information to various

Types of Data Mining

Data mining can be performed on the following types of data:

A combination of an object-oriented database model and relational database model is called an

Advantages of Data Mining

o The Data Mining technique enables organizations to obtain knowledge-based data.

Disadvantages of Data Mining

Data Mining Applications

Data Mining in Healthcare:

Data Mining in Market Basket Analysis:

Data mining in Education:

Data Mining in Manufacturing Engineering:

Data Mining in Fraud detection:

Data Mining Challenges

Data Mining challenges

Some of the Data mining challenges are given as under:

1. Security and Social Challenges

1. Security and Social Challenges

2. Noisy and Incomplete Data

6. Scalability and Efficiency of the Algorithms

7. Improvement of Mining Algorithms

8. Incorporation of Background Knowledge

10. Data Privacy and Security

11. User Interface

12. Mining dependent on Level of Abstraction

13. Integration of Background Knowledge

14. Mining Methodology Challenges

The History of Data Mining (OR)Origin if data mining

The sub-processes that form part of the KDD process are;

2. Creating a target data set

3. Data cleaning and pre-processing

7. Interpreting mined patterns

8. Acting on the discovered analysis

The evaluation of data mining applications

Types of Data Mining

1. Predictive Data Mining Analysis

2. Descriptive Data Mining

3. Time Serious Analysis

7. ASSOCIATION RULE LEARNING

8. Sequence Discovery Analysis

Steps in data preprocessing include:

Steps Involved in Data Preprocessing:

 (a). Missing Data:

2. Fill the Missing values:

 (b). Noisy Data:

4. Concept Hierarchy Generation:

2. Attribute Subset Selection:

Working of Data aggregators:The working of data aggregators can be performed in three

The working of data aggregators takes place in three steps:

Types of data aggregation

Time intervals for data aggregation process:

Applications of Data Aggregation:

Sampling in Data Mining

Five Basic Sampling Methods

 Probability sampling - Random selection techniques are used to select the

Types Of Sampling Techniques in Data Analytics-