0% found this document useful (0 votes)
22 views

Adm Unit - 1

Uploaded by

anuragthippani8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Adm Unit - 1

Uploaded by

anuragthippani8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

UNIT-1

Introduction of data mining

B.Swathi
Assistant professor
Sr university
Fundamental Of Data Mining
➢The process of extracting information to
identify patterns, trends, and useful data that
would allow the business to take the data-
driven decision from huge sets of data is called
“Data Mining
Fundamental Of Data Mining
➢Data mining is one of the most useful techniques
that help entrepreneurs, researchers, and individuals
to extract valuable information from huge sets of
data. Data mining is also called Knowledge
Discovery in Database (KDD).

➢The knowledge discovery process includes Data


cleaning, Data integration, Data selection, Data
transformation, Data mining, Pattern evaluation, and
Knowledge presentation
Fundamental Of Data Mining
➢Data Mining is a process used by organizations to
extract specific data from huge databases to solve
business problems. It primarily turns raw data into
useful information.

➢The biggest challenge is to analyze the data to


extract important information that can be used to
solve a problem or for company development. There
are many powerful instruments and techniques
available to mine data and find better insight from it
data mining process
type of data mining:

➢ Relational Database: A relational database is a collection of


multiple data sets formally organized by tables, records, and columns
from which data can be accessed in various ways.

➢ Data warehouses: The huge amount of data comes from multiple


places such as Marketing and Finance. The extracted data is utilized
for analytical purposes and helps in decision- making for a business
organization.

➢ Data Repositories:The Data Repository generally refers to a


destination for data storage.

➢ For example, a group of databases, where an organization has kept


various kinds of information
type of data mining
➢Object-Relational Database: A combination of an
object-oriented database model and relational
database model is called an object-relational model.
It supports Classes, Objects, Inheritance, etc.
➢for example, C++, Java, C#, and so on.

➢Transactional Database: A transactional database


refers to a database management system (DBMS)
that has the potential to undo a database transaction
if it is not performed appropriately
Advantages of Data Mining

➢ Marketing / Retail
• Data mining helps marketing companies build models based on historical
data to predict who will respond to the new marketing campaigns such as direct
mail, online marketing campaign…etc.
➢ Finance / Banking
• Data mining gives financial institutions information about loan information
and credit reporting.
➢ Manufacturing
• By applying data mining in operational engineering data, manufacturers can
detect faulty equipment and determine optimal control parameters.
➢ Governments
• Data mining helps government agencies by digging and analyzing records
of financial transaction to build patterns that can detect money laundering or
criminal activities.
Disadvantages of data mining

➢ Privacy Issues
• The concerns about personal privacy have been increasing
enormously recently especially when the internet is booming with
social networks, e-commerce, forums, blogs….
➢ Security issues
• Security is a big issue. Businesses own information about their
employees and customers including social security numbers,
birthdays, payroll and etc
➢ Misuse of information/inaccurate information
• Information is collected through data mining intended for ethical
purposes can be misused. This information may be exploited by
unethical people or businesses to take the benefits of vulnerable
people or discriminate against a group of people.
The KDD Process

➢ .
Steps in KDD
1. Data Cleaning: Data cleaning is defined as removal of noisy
and irrelevant data from collection. Cleaning in case of Missing
values.
• Cleaning noisy data, where noise is a random or variance error.
• Cleaning with Data discrepancy detection and Data
transformation tools
2. Data Integration: Data integration is defined as heterogeneous data
from multiple sources combined in a common source (Data
Warehouse) . Data integration using Data Migration tools.
• Data integration using Data Synchronization tools.
• Data integration using ETL(Extract-Transformation-Load) process.
Steps in KDD
3. Data Selection: Data selection is defined as the process where data
relevant to the analysis is decided and retrieved from the data
collection. Data selection using Neural network.
• Data selection using Decision Trees.
• Data selection using Naive bayes.
• Data selection using Clustering, Regression, etc.

4. Data Transformation: Data Transformation is defined as the process


of transforming data into appropriate form required by mining
procedure. Data Transformation is a two step process:
• Data Mapping: Assigning elements from source base to destination
to capture transformations.
• Code generation: Creation of the actual transformation program.
Steps in KDD
5. Data Mining: Data mining is defined as clever techniques that are applied
to extract patterns potentially useful. Transforms task relevant data into
patterns.
• Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly
increasing patterns representing knowledge based on given
measures. Find interestingness score of each pattern.
• Uses summarization and Visualization to make data understandable by
user.
7. Knowledge representation: Knowledge representation is defined as
technique which utilizes visualization tools to represent data mining
results. Generate reports.
• Generate tables.
• Generate discriminant rules, classification rules, characterization rules,
etc.
Data Mining Architecture

• A data-warehouse is a heterogeneous collection of


different data sources organized under a unified
schema.
• There are 2 approaches for constructing data-
warehouse:
• Top-down approach
• Bottom-up approach
Data Mining Architecture
Data Mining Architecture
• External Sources –
External source is a source from where data is collected irrespective
of the type of data. Data can be structured, semi structured and
unstructured as well.
Stage Area –
Since the data, extracted from the external sources does not follow a
particular format, so there is a need to validate this data to load into
data warehouse
For this purpose, it is recommended to use ETL tool.
E(Extracted): Data is extracted from External data source.

• T(Transform): Data is transformed into the standard format.

• L(Load): Data is loaded into data warehouse after transforming it into


the standard format.
• Data Marts –
Data mart is also a part of storage component.
It stores the information of a particular
function of an organization which is handled by
single authority. There can be as many number
of data marts in an organization depending
upon the functions. We can also say that data
mart contains subset of the data stored in data
warehouse.
Data Mining Architecture
Data Mining Functionalities
• Association Analysis − It analyses the set of items
that generally occur together in a transactional dataset.
• Classification
• Prediction
• Clustering
• Outlier analysis − Outliers are data elements that
cannot be grouped in a given class or cluster
• Evolution analysis − It defines the trends for objects
whose behaviour changes over some time.
Data mining Task Primitive
• The use of data mining task primitives can provide a modular
and reusable approach, which can improve the performance,
efficiency, and understandability of the data mining process
Data mining Task Primitive
• The set of task relevant data to be mined: It
refers to the specific data that is relevant and
necessary for a particular task or analysis being
conducted using data mining techniques
• Example: Extracting the database name,
database tables, and relevant required attributes
from the dataset from the provided input
database.
Data mining Task Primitive
• Kind of knowledge to be mined: It refers to
the type of information or insights that are
being sought through the use of data mining
techniques.
• example, It determines the task to be
performed on the relevant data in order to mine
useful information such as classification,
clustering, prediction, discrimination, outlier
detection, and correlation analysis.
Data mining Task Primitive
• Background knowledge to be used in the
discovery process: It refers to any prior
information or understanding that is used to
guide the data mining process.
• example, The use of background knowledge
such as concept hierarchies, and user beliefs
about relationships in data in order to evaluate
and perform more efficiently.
Data mining Task Primitive
• interestingness measures and thresholds for
pattern evaluation: It refers to the methods and
criteria used to evaluate the quality and relevance of
the patterns or insights discovered through data
mining
• example: Evaluating the interestingness and
interestingness measures such as utility, certainty,
and novelty for the data and setting an appropriate
threshold value for the pattern evaluation
Data mining Task Primitive
• Representation for visualizing the discovered
pattern: It refers to the methods used to represent
the patterns Visualization techniques such as charts,
graphs, and maps are commonly used to represent
the data

• example Presentation and visualization of


discovered pattern data using various visualization
techniques such as barplot, charts, graphs, tables,
etc.
Major issues in Data Mining
Data processing in Data Mining
• Data Preprocessing
• Data preprocessing is the process of transforming raw data into an understandable
data.
• It is also an important step in data mining as we cannot work with raw data.
• Why is Data preprocessing important?
• Preprocessing of data is mainly to check the data quality. The quality can be
checked by the following
• Accuracy: To check whether the data entered is correct or not.
• Completeness: To check whether the data is available or not recorded.
• Consistency: To check whether the same data is kept in all the places that do or do
not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
Major Tasks in Data Preprocessing
• Major Tasks in Data Preprocessing:
• Data cleaning
• Data integration
• Data reduction
• Data transformation
Major Tasks in Data Preprocessing
Major Tasks in Data Preprocessing
Major Tasks in Data Preprocessing
Data cleaning in Data Mining
• Data cleaning is the process of preparing raw data for analysis by removing bad
data, organizing the raw data, and filling in the null values. Ultimately, cleaning data
prepares the data for the process of data mining when the most valuable information
can be pulled from the data set .
1. Missing Data
a).Ignore the tuples:
This approach is suitable only when the dataset we have is
quite large and multiple values are missing within a tuple.
Missing Data
b).Fill the Missing values:
There are various ways to do this task. You can
choose to fill the missing values manually.
Age Experie Salary purchas Age Experien Salary purchase
nce ed ce d
25 50 0 25 1 50 0
27 3 1 27 3 80 1
29 5 110 1 29 5 110 1
31 7 140 0 31 7 140 0
33 9 170 1 33 9 170 1

11 200 0 35 11 200 0
Noisy Data
• Noisy data is a meaningless data that can’t be interpreted by
machines. It can be generated due to faulty data collection, data
entry errors etc.
• Binning Method: This method works on sorted data in order to
smooth it. The whole data is divided into segments of equal
size and then various methods are performed to complete the
task.
Regression

Here data can be made smooth by fitting it to a regression


function. The regression used may be linear (having one
independent variable) or multiple (having multiple independent
variables).
Clustering:

This approach groups the similar data in a


cluster. The outliers may be undetected or it
will fall outside the clusters.
Data Transformation in Data Mining

Normalization: is used to scale the data of an attribute


so that it falls in a smaller range, such as -1.0 to 1.0 or
0.0 to 1.0.
It is generally useful for classification algorithms. this
may lead to poor data models while performing data
mining operations. So they are normalized to bring all
the attributes on the same scale.
Example
Methods of Data Normalization

• Methods of Data Normalization


1).Decimal Scaling
2).Min-Max Normalization
3).z-Score Normalization(zero-mean
Normalization)
Decimal Scaling Method
It normalizes by moving the decimal point of values of the data.
To normalize the data by this technique, we divide each value of
the data by the maximum absolute value of data. The data value,
vi, of data is normalized to vi‘ by using the formula below :

where j is the smallest integer such that max(|vi‘|)<1.


Example –
Let the input data is: -10, 201, 301, -401, 501, 601, 701
To normalize the above data,
Step 1: Maximum absolute value in given data(m): 701
Step 2: Divide the given data by 1000 (i.e j=3)
Result: The normalized data is: -0.01, 0.201, 0.301, -0.401,
0.501, 0.601, 0.701
Min-max normalization

Min-max normalization - performs a linear transformation


on the original data.
Suppose that minA and maxA are the minimum and
maximum values of an attribute A.
Min-max normalization maps a value v of A to v1 in the
range [new minA; new maxA] by computing.
Example : Min-max normalization

• Suppose the minimum and maximum values


for the attribute income are $12,000 and
$98,000 To map income to the range [0.0,1.0]
• By min-max normalization , a value of
$73,600 for income is transformed to
Z-Score Normalization
• In z-score normalization (or zero-mean
normalization) the values of an attribute (A),
are normalized based on the mean of A and its
standard deviation
• A value, v, of attribute A is normalized to v’ by
computing.
Example : Z-Score Normalization

Suppose the mean and standard deviation of the


values for the attribute income are $54,000 and
$16,000 respectively.
With Z-Score normalization , a value of $73,600
for income is transformed to
Attribute Selection:
• Attribute Selection:
In this strategy, new attributes are constructed
from the given set of attributes to help the
mining process.

Discretization:
• Discretization:
This is done to replace the raw values of
numeric attribute by interval levels or
conceptual levels.
Concept Hierarchy Generation
• Concept Hierarchy Generation:
Here attributes are converted from lower level
to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.
Data Reduction in Data Mining

• we uses data reduction technique. It aims to increase the


storage efficiency and reduce data storage and analysis costs.
• Data Cube Aggregation:
Aggregation operation is applied to data for the construction of
the data cube.

• Attribute Subset Selection: Attribute subset Selection is a
technique which is used for data reduction in data mining
process. Data reduction reduces the size of data so that it can be
used for analysis purposes more efficiently.
• Methods of Attribute Subset Selection-
1. Stepwise Forward Selection.
2. Stepwise Backward Elimination.
3. Combination of Forward Selection and Backward
Elimination.
4. Decision Tree Induction
• Stepwise Forward Selection: This procedure start
with an empty set of attributes as the minimal set.
• Initial attribute Set: {X1, X2, X3, X4, X5, X6} Initial
reduced attribute set: { }
• Step-1: {X1}
• Step-2: {X1, X2}
• Step-3: {X1, X2, X5}
• Final reduced attribute set: {X1, X2, X5}
• Stepwise Backward Elimination: Here all the
attributes are considered in the initial set of
attributes
• Initial attribute Set: {X1, X2, X3, X4, X5, X6} Initial
reduced attribute set: {X1, X2, X3, X4, X5, X6 }
• Step-1: {X1, X2, X3, X4, X5}
• Step-2: {X1, X2, X3, X5}
• Step-3: {X1, X2, X5}
• Final reduced attribute set: {X1, X2, X5}
• Combination of Forward Selection and
Backward Elimination: The stepwise forward
selection and backward elimination are
combined so as to select the relevant
attributes most efficiently. This is the most
common technique which is generally used for
attribute selection.
Decision Tree Induction
• Decision Tree Induction: This approach uses
decision tree for attribute selection. It
constructs a flow chart like structure having
nodes denoting a test on an attribute.
Numerosity Reduction
• Numerosity Reduction is a data reduction
technique which replaces the original data by
smaller form of data representation.
Dimensionality Reduction:
• Dimensionality reduction is the process in which we reduced
the number of unwanted variables, attributes, and.
Dimensionality reduction is a very important stage of data pre-
processing.

• Example:
Discretization in data mining

• Data discretization refers to a method of converting a huge number of data


values into smaller ones so that the evaluation and management of data
become easy.
• There are two forms of data discretization
• first is supervised discretization,
• second is unsupervised discretization.
• Supervised discretization refers to a method in which the class data is used.
• Unsupervised discretization refers to a method depending upon the way
which operation proceeds.
• It means it works on the top-down splitting strategy and bottom-up merging
strategy.
Example

• Suppose we have an attribute of Age with the


given values.
Feature Extraction
The process of feature extraction is useful when you need to reduce
the number of resources needed for processing without losing
important or relevant information. Feature Extraction aims to reduce
the number of features in a dataset by creating new features
from the existing ones (and then discarding the original features).
These new reduced set of features should then be able to summarize
most of the information contained in the original set of features.

Example:
Feature selection
• Feature selection refers to the process of reducing the inputs
for processing and analysis, or of finding the most meaningful
inputs. A related term, feature engineering (or feature
extraction), refers to the process of extracting useful
information or features from existing data.
Feature construction
Feature construction involves transforming a given set of input features
to generate a new set of more power ful features which are then used for
prediction. This may be done either to compress the dataset by reducing the
number of features or to improve the prediction performance.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy