Unit II Data Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Unit II Data Mining

- Prof. S. S. Lamkane

What is Data Mining?

The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called
Data Mining.

In other words, we can say that Data Mining is the process of investigating hidden patterns
of information to various perspectives for categorization into useful data, which is collected
and assembled in particular areas such as data warehouses, efficient analysis, data mining
algorithm, helping decision making and other data requirement to eventually cost-cutting
and generating revenue.

Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of future
events. Data Mining is also called Knowledge Discovery of Data (KDD).

Data Mining is a process used by organizations to extract specific data from huge databases
to solve business problems. It primarily turns raw data into useful information.

Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such as
text mining, web mining, audio and video mining, pictorial data mining, and social media
mining. It is done through software that is simple or highly specific. By outsourcing data
mining, all the work can be done faster with low operation costs. Specialized firms can also
use new technologies to collect data that is impossible to locate manually. There are tonnes
of information available on various platforms, but very little knowledge is accessible. The
biggest challenge is to analyze the data to extract important information that can be used
to solve a problem or for company development. There are many powerful instruments and
techniques available to mine data and find better insight from it.

Data Mining Issues:-


Data mining is not an easy task, as the algorithms used can get very complex & data is
not always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. These issues are

1 Mining methodology & user interaction

2 Performance issues

3 Diverse data types issues.


1 Mining Methodology & User Interaction Issues:-
• Mining Different Kinds of Knowledge in Database- Different users may be interested in
different kinds of knowledge. Therefore it is necessary for data mining to cover a broad
range of knowledge discovery task.

• Interactive Mining of Knowledge at Multiple Levels of Abstraction- The data mining


process needs to be interactive because it allows users to focus the search for patterns,
providing & refining data mining requests based on the returned results.

• Presentation & Visualization of Data Mining Results- Once the patterns are discovered
it needs to be expressed in high level languages, & visual representations.

• Handling Noisy or Incomplete Data- The data cleaning methods are required to handle the
noise & incomplete objects while mining the data regularities. It the data cleaning methods
are not there then the accuracy of the discovered patterns will be poor.

2 Performance Issues:-

• Efficiency & scalability of data mining algorithms.

• Parallel, distributed & incremental mining algorithms.

3 Diverse Data Type Issues:-

• Handling of relational data & complex type of data.

• Mining information from heterogeneous databases & global information system.


Data Mining Applications

Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences,
product positioning, and impact on sales, customer satisfaction, and corporate profits. Data
mining enables a retailer to use point-of-sale records of customer purchases to develop
products and promotions that help the organization to attract the customer.

These are the following areas where data mining is widely used:

Data Mining in Healthcare:

Data mining in healthcare has excellent potential to improve the health system. It uses data
and analytics for better insights and to identify best practices that will enhance health care
services and reduce costs. Analysts use data mining approaches such as Machine learning,
Multi-dimensional database, Data visualization, Soft computing, and statistics. Data Mining
can be used to forecast patients in each category. The procedures ensure that the patients
get intensive care at the right place and at the right time. Data mining also enables
healthcare insurers to recognize fraud and abuse.

Data Mining in Market Basket Analysis:

Market basket analysis is a modeling method based on a hypothesis. If you buy a specific
group of products, then you are more likely to buy another group of products. This
technique may enable the retailer to understand the purchase behavior of a buyer. This data
may assist the retailer in understanding the requirements of the buyer and altering the
store's layout accordingly. Using a different analytical comparison of results between
various stores, between customers in different demographic groups can be done.

Data mining in Education:

Education data mining is a newly emerging field, concerned with developing techniques
that explore knowledge from the data generated from educational Environments. EDM
objectives are recognized as affirming student's future learning behavior, studying the
impact of educational support, and promoting learning science. An organization can use
data mining to make precise decisions and also to predict the results of the student. With
the results, the institution can concentrate on what to teach and how to teach.

Data Mining in Manufacturing Engineering:

Knowledge is the best asset possessed by a manufacturing company. Data mining tools can
be beneficial to find patterns in a complex manufacturing process. Data mining can be used
in system-level designing to obtain the relationships between product architecture, product
portfolio, and data needs of the customers. It can also be used to forecast the product
development period, cost, and expectations among the other tasks.

Data Mining in CRM (Customer Relationship Management):

Customer Relationship Management (CRM) is all about obtaining and holding Customers,
also enhancing customer loyalty and implementing customer-oriented strategies. To get a
decent relationship with the customer, a business organization needs to collect data and
analyze the data. With data mining technologies, the collected data can be used for
analytics.

Data Mining in Fraud detection:

Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are
a little bit time consuming and sophisticated. Data mining provides meaningful patterns and
turning data into information. An ideal fraud detection system should protect the data of all
the users. Supervised methods consist of a collection of sample records, and these records
are classified as fraudulent or non-fraudulent. A model is constructed using this data, and
the technique is made to identify whether the document is fraudulent or not.

Data Mining in Lie Detection:

Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task. Law enforcement may use data mining techniques to investigate offenses,
monitor suspected terrorist communications, etc. This technique includes text mining also,
and it seeks meaningful patterns in data, which is usually unstructured text. The information
collected from the previous investigations is compared, and a model for lie detection is
constructed.

Data Mining Financial Banking:

The Digitalization of the banking system is supposed to generate an enormous amount of


data with every new transaction. The data mining technique can help bankers by solving
business-related problems in banking and finance by identifying trends, casualties, and
correlations in business information and market costs that are not instantly evident to
managers or executives because the data volume is too large or are produced too rapidly
on the screen by experts. The manager may find these data for better targeting, acquiring,
retaining, segmenting, and maintain a profitable customer.
Data Mining Versus Knowledge Discovery in Database (KDD):-

Data mining is sometimes referred as knowledge discovery in database or KDD. That


means finding knowledge from large amount of database.

KDD process consists following steps:

• Data Cleaning: - Data cleaning step removes noise & inconsistent data. •
Data Integration: - In this step multiple data sources are combined

together.

• Data Selection: - In this step data relevant to the analysis task are retrieved from the
database.

• Data Transformation: - In this step data can be transformed from one form to another
form.

• Data Mining: - In this step information is extracting from large amount of data.

• Pattern Evaluation: - To identify the truly interesting pattern representing knowledge


based on interestingness measures.

• Knowledge Presentation:- In this step visualization & knowledge representation


techniques are used to present mined knowledge to users.

Data Preprocessing:-

Following are the steps involved in data preprocessing-

• Data Cleaning

• Data Integration

• Data Reduction
• Data Transformation

1. Data Cleaning:-

Data cleaning method cleans the data by filling in missing values, smoothing noisy data,
identifying or removing outliers, & resolving inconsistencies.

If users believe the data are dirty, they are unlikely to trusts the results of any data
mining that has been applied. Furthermore, dirty data can cause confusion for the mining
procedure, resulting in unreliable output. Data cleaning have some procedures for dealing with
incomplete or noisy data.

2. Data Integration:-

Data integration is the process of combining data from multiple sources. Careful
integration can help reduce & avoid redundancies & inconsistencies in the resulting data set.
Data integration can help improve the accuracy & speed of the subsequent data mining
process.

There are number of issues to consider during data integration.

1) Entity Identification problem

2) Redundancy & Correlation Analysis

3) Tuple Duplication

4) Data Value Conflict Detection & Resolution

3. Data Reduction:-

If the data warehouse contains huge data, complex data analysis & mining on huge
amount of data can take a long time, making such analysis impractical or infeasible.

Data reduction techniques can be applied to obtain a reduced representation of the data
set that is much smaller in volume, yet closely maintains the integrity of the original data. That
is, mining on the reduced data set should be more efficient yet produce the same analytical
results. Following are the techniques for data reduction

1) Dimensionality Reduction

2) Numerosity Reduction

3) Data Compression

4. Data Transformation:-

In this preprocessing step, the data are transformed or consolidated so that the resulting
mining process may be more efficient, & the patterns found may be easier to understand.
Strategies for data transformation include the following

1) Smoothing- Which works to remove noise from the data. Techniques include binning,
regression, & clustering.
2) Attribute Construction- where new attribute are constructed & added from the given set of
attributes to help the mining process.

3) Aggregation- Where summary or aggregation operations are applied to the data.

4) Normalization- Where the attribute data are scaled so as to fall within a smaller range such
as - 1.0 to 1.0 or 0.0 to 1.0

Fig. Forms of Data Preprocessing

Data Mining tasks:-

Data mining deals with the kind of patterns that can be mined. On the basis of the kind
of data to be mined, there are two categories of functions involved in data mining:

• Descriptive

• Classification & Prediction

Descriptive Functions:-

The descriptive function deals with the general properties of data in the database. Here
is the list of descriptive functions:

1) Class/ Concept Description

2) Mining of Frequent Patterns

3) Mining of Associations

4) Mining of Correlations

5) Mining of Clusters
14

1) Class/ Concept Description:-

Class/ concept refer to the data to be associated with the classes or concepts.

For example, in a company the classes of items for sales include computer & printers &
concepts of customers include big spenders & budget spenders. Such descriptions of a class or
a concept are called class/ concept descriptions. These descriptions can be derived by the
following two ways:

• Data Characterization - This refers to summarizing data of a class under study. This class
under study is called as the target class.

• Data Discrimination - It refers to the mapping or classification of a class with some


predefined group or class.

2) Mining of Frequent Pattern

Frequent patterns are those patterns that occur frequently in transactional data. Here is
the list of kind of frequent patterns:

• Frequent Item Set -It refers to a set of items that frequently appear together, e.g., milk &
bread.

• Frequent Subsequence - A sequence of patterns that occur frequently such as purchasing a


camera is followed by memory card.

• Frequent Sub Structure - Sub structure refers to different structural forms, such as graphs,
trees or lattices, which may be combined with item- set or subsequences.

3) Mining of Association:-

Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data &
determining association rules.

For example, a retailer generates an association rule that shows that 70% of time milk is
sold with bread & only 30% of times biscuits are sold with bread.

4) Mining of Correlation:-

It is a kind of additional analysis performed to uncover interesting statistical correlations


between associated attribute value pairs or between two item sets to analyze that if they have
positive, negative or no effect on each other.

5) Mining of Cluster:-

Cluster refers to a group of similar kinds of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly differe1nt from the objects in
other clusters.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy