Data Mining

DATA MINING
What is Data Mining?

Every day, terabytes or petabytes of data from business, society, science and
engineering, medicine, and practically every other area of daily life flood into our
computer networks, the World Wide Web (WWW), and various data storage devices. To
automatically find important information from massive amounts of data and translate it
into organized knowledge, powerful and versatile tools are desperately needed. Data
mining was born as a result of this need.
No, Data Mining is not another hype, however it is not simple transformation of
technology developed from databases, statistics, and machine learning. Instead, it
involves an integration of data rather than a simple transformation of techniques from
multiple disciplines such as database technology, statistics, machine learning, high-
performance computing, pattern recognition, neural networks, data visualization, and
information retrieval and so on.
Data mining arose as a result of the evolution of machine learning research since
various approaches to settle up data were invented as a result of the interest in depth of
machine learning, yet both are fundamentally separate but beneficial together.
The term "data mining" is very new, and the internet appears to believe that it
became popular with businesses in the 1990s. What it represents, however, namely
knowledge extraction from data, is a very old concept. The Bayes theorem, which dates
back to the early 1700s, is credited with inventing knowledge finding procedures that are
being used today. Many additional techniques have been developed since then. Here's a
lovely timeline compiled by someone nice: Data Mining History
Even while significant gains in processing power and efficiency, as well as the
spread of omnipresent sensors collecting various forms of data 24 hours a day, have
resulted in a new generation of very advanced data mining techniques, attempting to
comprehend data is nothing new. Statistics is a way to infer patterns from data based on
existing model; machine learning is a heuristics to have the computer form its own model
from the data; data mining and pattern recognition are applications that can be done
through either statistics or machine learning; and pattern recognition is a sub-field of data
mining. Many people would just claim they do all of them.
We can say that Data mining is a process used by companies to turn raw data into
useful information. By using software to look for patterns in large batches of data,
businesses can learn more about their customers to develop more effective marketing
strategies, increase sales and decrease costs. Data mining depends on effective data
collection, warehousing, and computer processing. Among the top benefits that this
technology can provide for business, we can highlight the following:
 It helps to identify and predict further future trends
 It helps to identifies hidden profitability
 It helps in decision making
 It increase company revenue
Data mining is all about discovering new creative things and patterns. It has lots of
benefits and can offer lots of opportunities to business owners. All you need is just find
the right data science partner who will be aware of all the peculiarities of your industry
and business specifics.
Here are the Steps involved in Data mining when viewed as Knowledge Discovery
process
Data Cleaning
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly
formatted, duplicate, or incomplete data within a dataset. When combining multiple data
sources, there are many opportunities for data to be duplicated or mislabeled.
Data Integration
Data integration brings together data gathered from different systems and makes it
more valuable for your business. It helps your people work better with each other and do
more for your customers. Without a data integration platform, you have no way of
accessing the data gathered in one system in another.
Data Selection
A simple sorting algorithm in the data structure. Here It works as a comparison-
based algorithm where the list is divided into two parts where the unsorted part is placed
on the right side and the sorted part is placed in the left side. In the initial stages, the
unsorted part contains the whole list where the sorted part is generally empty.
Data Transformation
Data transformation is the process of changing the format, structure, or values of
data. For data analytics projects, data may be transformed at two stages of the data
pipeline. Processes such as data integration, data migration, data warehousing and data
wrangling all may involve data transformation.
Data Mining
An essential process where intelligent and efficient methods are applied in order to
extract patterns. Data mining is also analyzing and examining database and extracting
useful data form it
Pattern Evaluation
A process that identifies the truly interesting patterns representing knowledge
based on some interestingness measures.
Knowledge Presentation
Knowledge representation is the presentation of knowledge to the user for
visualization in terms of trees, tables, rules graphs, charts, matrices, etc.
DATABASE
A database is a collection of information that is organized so that it can be easily
accessed, managed and updated. Data is organized into rows, columns and tables, and it
is indexed to make it easier to find relevant information. Data gets updated, expanded and
deleted as new information is added. Databases process workloads to create and update
themselves, querying the data they contain and running applications against it.
DATA WAREHOUSE
A data warehouse is a federated repository for all the data that an enterprise's
various business systems collect. The repository may be physical or logical. Data
warehousing emphasizes the capture of data from diverse sources for useful analysis and
access, but does not generally start from the point-of-view of the end user who may need
access to specialized, sometimes local databases. The latter idea is known as the
DataMart.
DATABASE DATA WAREHOUSE

 Used for Online Transactional  Used for Online Analytical
Processing (OLTP) but can be used Processing (OLAP). This reads the
for other purposes such as Data historical data for the Users for
Warehousing. This records the data business decisions.
from the user for history.
 The tables and joins are complex  The Tables and joins are simple
since they are normalized (for since they are de-normalized. This
RDMS). This is done to reduce is done to reduce the response time
redundant data and to save storage for analytical queries.
space.
 Entity – Relational modeling  Data – Modeling technique are
technique are used for RDMS used for the Data Warehouse
database design. design
 Optimized for write operation.  Optimized for read operation.
 Performance is low for analysis  High performance for analytical
queries. queries
 Is usually Database.
DATA MINING FUNCTIONALITIES
Characterization
Data Characterization is all about slicing and dicing the data to understand what
it is all about. Without understanding the contents in the data, practically the data is
unusable, despite having lot of potential. Data mining builds on the results of
characterization.
Examples include pie charts, bar charts, curves, multidimensional data cubes, and
multidimensional tables, including crosstabs. The resulting descriptions can also be
presented as generalized relations or in rule form (called characteristic rules).
Discrimination
Data Discrimination is identifying splitting conditions to partition the data into
independent bins. Although data discrimination is a part of characterization, the
objectives are very different. In general, classifier tools could be used to perform the
discrimination task.
Example is Big Data analytics may exclude someone from receiving marketing for a
prime rate credit card, simply because of non-traditional analytic predictors, such as a
person's zip code, relationship status, or even social media use.
Association and Correlation Analysis

Association analysis is the task of finding interesting relationships in large
datasets. These interesting relationships can take two forms: frequent item sets or
association rules. Frequent item sets are a collection of items that frequently occur
together.
Example If an itemset contains k items, it is called a k-itemset. For instance, {Beer,
Diapers, Milk} is an example of a 3-itemset. The null (or empty) set is an item set that
does not contain any items.
TID ITEMS
Sets of Frequent Items. For example, the
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs} following rule can be
3 {Milk, Diapers, Beer, Cola} extracted from the data set shown in Table
4 {Bread, Milk, Diapers, Beer} {Diapers}  {Beers}
5 {Bread, Milk, Diapers, Cola}
Correlation Analysis is used to study the closeness of the relationship between
two or more variables i.e. the degree to which the variables are associated with each
other. Suppose in a manufacturing firm, they want the relation between – Demand &
supply of commodities.
Example is an increase in one variable leads to an increase in the other variable and vice
versa. For example, spending more time on a treadmill burns more calories. Negative
correlation: A negative correlation between two variables means that the variables move
in opposite directions.
Classification
Classification is a data mining function that assigns items in a collection to target
categories or classes. The goal of classification is to accurately predict the target class for
each case in the data. For example, a classification model could be used to identify loan
applicants as low, medium, or high credit risks.
Example is a common example of classification comes with detecting spam emails. To
write a program to filter out spam emails, a computer programmer can train a machine
learning algorithm with a set of spam-like emails labelled as spam and regular emails
labelled as not-spam.
Regression
Regression is a data mining technique used to predict a range of numeric values
(also called continuous values), given a particular dataset. For example, regression might
be used to predict the cost of a product or service, given other variables.
Example, researchers might administer various dosages of a certain drug to patients and
observe how their blood pressure responds. They might fit a simple linear regression
model using dosage as the predictor variable and blood pressure as the response variable
Clustering
Clustering in data mining is the grouping of a particular set of objects based on
their characteristics, aggregating them according to their similarities. Clustering is the
process of making a group of abstract objects into classes of similar objects. Points to
Remember. A cluster of data objects can be treated as one group. While doing cluster
analysis, we first partition the set of data into groups based on data similarity and then
assign the labels to the groups.
Example for example, a streaming service may collect the following data about
individuals:
 Minutes watched per day
 Total viewing sessions per week
 Number of unique shows viewed per month
Using these metrics, a streaming service can perform cluster analysis to identify high
usage and low usage users so that they can know who they should spend most of their
advertising dollars on.
Outlier analysis
An outlier is an object that deviates significantly from the rest of the objects. They
can be caused by measurement or execution errors. The analysis of outlier data is referred
to as outlier analysis or outlier mining. An outlier cannot be termed as a noise or error.
Described in very simple terms, outlier analysis tries to find unusual patterns in any
dataset.
Example a temperature reading of 40°C may behave as an outlier in the context of a
“winter season” but will behave like a normal data point in the context of a “summer
season”. A low temperature value in June is a contextual outlier because the same value
in December is not an outlier.

Data Mining

Uploaded by

Copyright:

Available Formats

Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

DATA MINING

What is Data Mining?

DATABASE DATA WAREHOUSE

Association and Correlation Analysis

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.