Unit II Data Mining
Unit II Data Mining
Unit II Data Mining
- Prof. S. S. Lamkane
The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called
Data Mining.
In other words, we can say that Data Mining is the process of investigating hidden patterns
of information to various perspectives for categorization into useful data, which is collected
and assembled in particular areas such as data warehouses, efficient analysis, data mining
algorithm, helping decision making and other data requirement to eventually cost-cutting
and generating revenue.
Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of future
events. Data Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases
to solve business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such as
text mining, web mining, audio and video mining, pictorial data mining, and social media
mining. It is done through software that is simple or highly specific. By outsourcing data
mining, all the work can be done faster with low operation costs. Specialized firms can also
use new technologies to collect data that is impossible to locate manually. There are tonnes
of information available on various platforms, but very little knowledge is accessible. The
biggest challenge is to analyze the data to extract important information that can be used
to solve a problem or for company development. There are many powerful instruments and
techniques available to mine data and find better insight from it.
2 Performance issues
• Presentation & Visualization of Data Mining Results- Once the patterns are discovered
it needs to be expressed in high level languages, & visual representations.
• Handling Noisy or Incomplete Data- The data cleaning methods are required to handle the
noise & incomplete objects while mining the data regularities. It the data cleaning methods
are not there then the accuracy of the discovered patterns will be poor.
2 Performance Issues:-
Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences,
product positioning, and impact on sales, customer satisfaction, and corporate profits. Data
mining enables a retailer to use point-of-sale records of customer purchases to develop
products and promotions that help the organization to attract the customer.
These are the following areas where data mining is widely used:
Data mining in healthcare has excellent potential to improve the health system. It uses data
and analytics for better insights and to identify best practices that will enhance health care
services and reduce costs. Analysts use data mining approaches such as Machine learning,
Multi-dimensional database, Data visualization, Soft computing, and statistics. Data Mining
can be used to forecast patients in each category. The procedures ensure that the patients
get intensive care at the right place and at the right time. Data mining also enables
healthcare insurers to recognize fraud and abuse.
Market basket analysis is a modeling method based on a hypothesis. If you buy a specific
group of products, then you are more likely to buy another group of products. This
technique may enable the retailer to understand the purchase behavior of a buyer. This data
may assist the retailer in understanding the requirements of the buyer and altering the
store's layout accordingly. Using a different analytical comparison of results between
various stores, between customers in different demographic groups can be done.
Education data mining is a newly emerging field, concerned with developing techniques
that explore knowledge from the data generated from educational Environments. EDM
objectives are recognized as affirming student's future learning behavior, studying the
impact of educational support, and promoting learning science. An organization can use
data mining to make precise decisions and also to predict the results of the student. With
the results, the institution can concentrate on what to teach and how to teach.
Knowledge is the best asset possessed by a manufacturing company. Data mining tools can
be beneficial to find patterns in a complex manufacturing process. Data mining can be used
in system-level designing to obtain the relationships between product architecture, product
portfolio, and data needs of the customers. It can also be used to forecast the product
development period, cost, and expectations among the other tasks.
Customer Relationship Management (CRM) is all about obtaining and holding Customers,
also enhancing customer loyalty and implementing customer-oriented strategies. To get a
decent relationship with the customer, a business organization needs to collect data and
analyze the data. With data mining technologies, the collected data can be used for
analytics.
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are
a little bit time consuming and sophisticated. Data mining provides meaningful patterns and
turning data into information. An ideal fraud detection system should protect the data of all
the users. Supervised methods consist of a collection of sample records, and these records
are classified as fraudulent or non-fraudulent. A model is constructed using this data, and
the technique is made to identify whether the document is fraudulent or not.
Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task. Law enforcement may use data mining techniques to investigate offenses,
monitor suspected terrorist communications, etc. This technique includes text mining also,
and it seeks meaningful patterns in data, which is usually unstructured text. The information
collected from the previous investigations is compared, and a model for lie detection is
constructed.
• Data Cleaning: - Data cleaning step removes noise & inconsistent data. •
Data Integration: - In this step multiple data sources are combined
together.
• Data Selection: - In this step data relevant to the analysis task are retrieved from the
database.
• Data Transformation: - In this step data can be transformed from one form to another
form.
• Data Mining: - In this step information is extracting from large amount of data.
Data Preprocessing:-
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
1. Data Cleaning:-
Data cleaning method cleans the data by filling in missing values, smoothing noisy data,
identifying or removing outliers, & resolving inconsistencies.
If users believe the data are dirty, they are unlikely to trusts the results of any data
mining that has been applied. Furthermore, dirty data can cause confusion for the mining
procedure, resulting in unreliable output. Data cleaning have some procedures for dealing with
incomplete or noisy data.
2. Data Integration:-
Data integration is the process of combining data from multiple sources. Careful
integration can help reduce & avoid redundancies & inconsistencies in the resulting data set.
Data integration can help improve the accuracy & speed of the subsequent data mining
process.
3) Tuple Duplication
3. Data Reduction:-
If the data warehouse contains huge data, complex data analysis & mining on huge
amount of data can take a long time, making such analysis impractical or infeasible.
Data reduction techniques can be applied to obtain a reduced representation of the data
set that is much smaller in volume, yet closely maintains the integrity of the original data. That
is, mining on the reduced data set should be more efficient yet produce the same analytical
results. Following are the techniques for data reduction
1) Dimensionality Reduction
2) Numerosity Reduction
3) Data Compression
4. Data Transformation:-
In this preprocessing step, the data are transformed or consolidated so that the resulting
mining process may be more efficient, & the patterns found may be easier to understand.
Strategies for data transformation include the following
1) Smoothing- Which works to remove noise from the data. Techniques include binning,
regression, & clustering.
2) Attribute Construction- where new attribute are constructed & added from the given set of
attributes to help the mining process.
4) Normalization- Where the attribute data are scaled so as to fall within a smaller range such
as - 1.0 to 1.0 or 0.0 to 1.0
Data mining deals with the kind of patterns that can be mined. On the basis of the kind
of data to be mined, there are two categories of functions involved in data mining:
• Descriptive
Descriptive Functions:-
The descriptive function deals with the general properties of data in the database. Here
is the list of descriptive functions:
3) Mining of Associations
4) Mining of Correlations
5) Mining of Clusters
14
Class/ concept refer to the data to be associated with the classes or concepts.
For example, in a company the classes of items for sales include computer & printers &
concepts of customers include big spenders & budget spenders. Such descriptions of a class or
a concept are called class/ concept descriptions. These descriptions can be derived by the
following two ways:
• Data Characterization - This refers to summarizing data of a class under study. This class
under study is called as the target class.
Frequent patterns are those patterns that occur frequently in transactional data. Here is
the list of kind of frequent patterns:
• Frequent Item Set -It refers to a set of items that frequently appear together, e.g., milk &
bread.
• Frequent Sub Structure - Sub structure refers to different structural forms, such as graphs,
trees or lattices, which may be combined with item- set or subsequences.
3) Mining of Association:-
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data &
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is
sold with bread & only 30% of times biscuits are sold with bread.
4) Mining of Correlation:-
5) Mining of Cluster:-
Cluster refers to a group of similar kinds of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly differe1nt from the objects in
other clusters.