Data-Mining-OVERVIEW (1)
Data-Mining-OVERVIEW (1)
data. In other words, we can say that data mining is mining knowledge from data. The
tutorial starts off with a basic overview and the terminologies involved in data mining
and then gradually moves on to cover topics such as knowledge discovery, query
language, classification and prediction, decision tree induction, cluster analysis, and
how to mine the Web.
Data mining, also known as Knowledge Discovery in Data (KDD), is the process of
uncovering patterns and other valuable information from large data sets. Over the last
few decades, the development of data warehousing technology and the growth of big
data have rapidly accelerated the adoption of data mining techniques, helping
companies transform their raw data into useful information. However, even though that
technology continuously evolves to handle data at a large scale, leaders still face
challenges with scalability and automation.
Data mining enables organizations to make better decisions through intelligent data
analyses. Two main purposes may be given to the data mining techniques that underlie
these analyses; they can indicate the target file, or predict its outcome using machine
learning algorithms. These methods are being used to organize and filter data, showing
the most interesting information such as fraud detection, user behavior, bottlenecks, or
even security failures.
When combined with data analytics and visualization tools, like Apache Spark, delving
into the world of data mining has never been easier, and extracting relevant insights has
never been faster. Advances in artificial intelligence only continue to expedite adoption
across industries. This Data mining tutorial explains the basics of data mining and
then extends to learn its advanced concepts also.
The data mining process explains different phases to be executed step by step.
Understand Business
Identify what type of data is needed to solve the issue i.e.begin preliminary
analysis of the data
Collect it from authentic sources; obtain access rights, and prepare a data
description report
Clean the data: handle missing data, data errors, default values, and data
corrections.
Integrate the data: combine two disparate data sets to get the final target data
set.
Format the data: convert data types or configure data for the specific mining
technology being used.
Prepare the data in a format
Evaluation
Deployment
here s a huge amount of data available in the Information Industry. This data is of no
use until it is converted into useful information. It is necessary to analyze this huge
amount of data and extract useful information from it.
Extraction of information is not the only process we need to perform; data mining also
involves other processes such as Data Cleaning, Data Integration, Data Transformation,
Data Mining, Pattern Evaluation and Data Presentation. Once all these processes are
over, we would be able to use this information in many applications such as Fraud
Detection, Market Analysis, Production Control, Science Exploration, etc.
Data Mining is defined as extracting information from huge sets of data. In other words,
we can say that data mining is the procedure of mining knowledge from data. The
information or knowledge extracted so can be used for any of the following applications
−
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Data Mining Applications
Apart from these, data mining can also be used in the areas of production control,
customer retention, science exploration, sports, astrology, and Internet Web Surf-Aid
Listed below are the various fields of market where data mining is used −
Customer Profiling − Data mining helps determine what kind of people buy
what kind of products.
Identifying Customer Requirements − Data mining helps in identifying the best
products for different customers. It uses prediction to find the factors that may
attract new customers.
Cross Market Analysis − Data mining performs Association/correlations
between product sales.
Target Marketing − Data mining helps to find clusters of model customers who
share the same characteristics such as interests, spending habits, income, etc.
Determining Customer purchasing pattern − Data mining helps in determining
customer purchasing pattern.
Providing Summary Information − Data mining provides us various
multidimensional summary reports.
Finance Planning and Asset Evaluation − It involves cash flow analysis and
prediction, contingent claim analysis to evaluate assets.
Resource Planning − It involves summarizing and comparing the resources and
spending.
Competition − It involves monitoring competitors and market directions.
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to
detect frauds. In fraud telephone calls, it helps to find the destination of the call, duration
of the call, time of the day or week, etc. It also analyzes the patterns that deviate from
expected norms.
Data mining deals with the kind of patterns that can be mined. On the basis of the kind
of data to be mined, there are two categories of functions involved in Data Mining −
Descriptive
Classification and Prediction
Descriptive Function
The descriptive function deals with the general properties of data in the database. Here
is the list of descriptive functions −
Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers,
and concepts of customers include big spenders and budget spenders. Such
descriptions of a class or a concept are called class/concept descriptions. These
descriptions can be derived by the following two ways −
Frequent patterns are those patterns that occur frequently in transactional data. Here is
the list of kind of frequent patterns −
Frequent Item Set − It refers to a set of items that frequently appear together, for
example, milk and bread.
Frequent Subsequence − A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.
Frequent Sub Structure − Substructure refers to different structural forms, such
as graphs, trees, or lattices, which may be combined with item-sets or
subsequences.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data
and determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is
sold with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the
objects in other clusters.
Classification is the process of finding a model that describes the data classes or
concepts. The purpose is to be able to use this model to predict the class of objects
whose class label is unknown. This derived model is based on the analysis of sets of
training data. The derived model can be presented in the following forms −
Classification (IF-THEN) Rules
Decision Trees
Mathematical Formulae
Neural Networks
Classification − It predicts the class of objects whose class label is unknown. Its
objective is to find a derived model that describes and distinguishes data classes
or concepts. The Derived Model is based on the analysis set of training data i.e.
the data object whose class label is well known.
Prediction − It is used to predict missing or unavailable numerical data values
rather than class labels. Regression Analysis is generally used for prediction.
Prediction can also be used for identification of distribution trends based on
available data.
Outlier Analysis − Outliers may be defined as the data objects that do not
comply with the general behavior or model of the data available.
Evolution Analysis − Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over time.
We can specify a data mining task in the form of a data mining query.
This query is input to the system.
A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives −
Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.
This is the portion of database in which the user is interested. This portion includes the
following −
Database Attributes
Data Warehouse dimensions of interest
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Background knowledge
This is used to evaluate the patterns that are discovered by the process of knowledge
discovery. There are different interesting measures for different kind of knowledge.
This refers to the form in which discovered patterns are to be displayed. These
representations may include the following. −
Rules
Tables
Charts
Graphs
Decision Trees
Cubes