Data Mining
Data Mining
Data Mining
Here are the Steps involved in Data mining when viewed as Knowledge Discovery
process
Data Cleaning
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly
formatted, duplicate, or incomplete data within a dataset. When combining multiple data
sources, there are many opportunities for data to be duplicated or mislabeled.
Data Integration
Data integration brings together data gathered from different systems and makes it
more valuable for your business. It helps your people work better with each other and do
more for your customers. Without a data integration platform, you have no way of
accessing the data gathered in one system in another.
Data Selection
A simple sorting algorithm in the data structure. Here It works as a comparison-
based algorithm where the list is divided into two parts where the unsorted part is placed
on the right side and the sorted part is placed in the left side. In the initial stages, the
unsorted part contains the whole list where the sorted part is generally empty.
Data Transformation
Data transformation is the process of changing the format, structure, or values of
data. For data analytics projects, data may be transformed at two stages of the data
pipeline. Processes such as data integration, data migration, data warehousing and data
wrangling all may involve data transformation.
Data Mining
An essential process where intelligent and efficient methods are applied in order to
extract patterns. Data mining is also analyzing and examining database and extracting
useful data form it
Pattern Evaluation
A process that identifies the truly interesting patterns representing knowledge
based on some interestingness measures.
Knowledge Presentation
Knowledge representation is the presentation of knowledge to the user for
visualization in terms of trees, tables, rules graphs, charts, matrices, etc.
DATABASE
A database is a collection of information that is organized so that it can be easily
accessed, managed and updated. Data is organized into rows, columns and tables, and it
is indexed to make it easier to find relevant information. Data gets updated, expanded and
deleted as new information is added. Databases process workloads to create and update
themselves, querying the data they contain and running applications against it.
DATA WAREHOUSE
A data warehouse is a federated repository for all the data that an enterprise's
various business systems collect. The repository may be physical or logical. Data
warehousing emphasizes the capture of data from diverse sources for useful analysis and
access, but does not generally start from the point-of-view of the end user who may need
access to specialized, sometimes local databases. The latter idea is known as the
DataMart.
Characterization
Data Characterization is all about slicing and dicing the data to understand what
it is all about. Without understanding the contents in the data, practically the data is
unusable, despite having lot of potential. Data mining builds on the results of
characterization.
Examples include pie charts, bar charts, curves, multidimensional data cubes, and
multidimensional tables, including crosstabs. The resulting descriptions can also be
presented as generalized relations or in rule form (called characteristic rules).
Discrimination
Data Discrimination is identifying splitting conditions to partition the data into
independent bins. Although data discrimination is a part of characterization, the
objectives are very different. In general, classifier tools could be used to perform the
discrimination task.
Example is Big Data analytics may exclude someone from receiving marketing for a
prime rate credit card, simply because of non-traditional analytic predictors, such as a
person's zip code, relationship status, or even social media use.
Classification
Classification is a data mining function that assigns items in a collection to target
categories or classes. The goal of classification is to accurately predict the target class for
each case in the data. For example, a classification model could be used to identify loan
applicants as low, medium, or high credit risks.
Example is a common example of classification comes with detecting spam emails. To
write a program to filter out spam emails, a computer programmer can train a machine
learning algorithm with a set of spam-like emails labelled as spam and regular emails
labelled as not-spam.
Regression
Regression is a data mining technique used to predict a range of numeric values
(also called continuous values), given a particular dataset. For example, regression might
be used to predict the cost of a product or service, given other variables.
Example, researchers might administer various dosages of a certain drug to patients and
observe how their blood pressure responds. They might fit a simple linear regression
model using dosage as the predictor variable and blood pressure as the response variable
Clustering
Clustering in data mining is the grouping of a particular set of objects based on
their characteristics, aggregating them according to their similarities. Clustering is the
process of making a group of abstract objects into classes of similar objects. Points to
Remember. A cluster of data objects can be treated as one group. While doing cluster
analysis, we first partition the set of data into groups based on data similarity and then
assign the labels to the groups.
Example for example, a streaming service may collect the following data about
individuals:
Minutes watched per day
Total viewing sessions per week
Number of unique shows viewed per month
Using these metrics, a streaming service can perform cluster analysis to identify high
usage and low usage users so that they can know who they should spend most of their
advertising dollars on.
Outlier analysis
An outlier is an object that deviates significantly from the rest of the objects. They
can be caused by measurement or execution errors. The analysis of outlier data is referred
to as outlier analysis or outlier mining. An outlier cannot be termed as a noise or error.
Described in very simple terms, outlier analysis tries to find unusual patterns in any
dataset.
Example a temperature reading of 40°C may behave as an outlier in the context of a
“winter season” but will behave like a normal data point in the context of a “summer
season”. A low temperature value in June is a contextual outlier because the same value
in December is not an outlier.