0% found this document useful (0 votes)

1 views

BI-Unit-3-Part-1-PPT.ppt

The document discusses data mining, which involves extracting useful information from large datasets using statistical and AI techniques. It covers various aspects such as data types, mining processes, and applications in fields like banking and retail. Additionally, it highlights methods like classification, clustering, and association rule mining, along with evaluation techniques for data mining models.

Uploaded by

Jane Hale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

BI-Unit-3-Part-1-PPT.ppt

Uploaded by

Jane Hale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

UNIT-III-PART-1

DATA STORAGE
FOR
BUSINESS INTELLIGENCE
Data Mining Concepts and Applications

• Data mining is a term used to describe

discovering or "mining" knowledge from
large amounts of data.
• Many other names that are associated with
data mining include knowledge extraction,
pattern analysis, data archaeology,
information harvesting, pattern searching,
and data dredging.
• Technical Definition:
– Data mining is a process that uses statistical,
mathematical, and artificial intelligence
techniques to extract and identify useful
information and subsequent knowledge (or
patterns) from large sets of data

• Data mining as "the nontrivial process of

identifying valid, novel, potentially useful,
and ultimately understandable patterns in
data stored in structured databases,"
• Process implies that data mining comprises
many iterative steps.

• Nontrivial means that some

experimentation-type search or inference is
involved ;that is, it is not as straightforward
as a computation of predefined quantities.

• Valid means that the discovered patterns

should hold true on new data with
sufficient degree of certainty.
• Novel means that the patterns are not previously
known to the user within the context of the system
being analyzed.

• Potentially useful means that the discovered

patterns should lead to some benefit to the user
or task.

• Ultimately understandable means that the

pattern should make business sense that leads to
the user to make decisions. It makes sense;
Characteristics and Objectives of
Data Mining
• Data are often buried deep within very large
databases, which sometimes contain data from several
years. In many cases, the data are cleansed and
consolidated into a data warehouse.

• The data mining environment is usually a client/

server architecture or a Web-based information
systems architecture

• The miner is often an end user, empowered by

power query tools to ask ad-hoc questions and obtain
answers quickly, with little or no programming skill.
• Data mining tools are readily combined with
spread sheets and other software development
tools . Thus, the mined data can be analyzed
and deployed quickly and easily.
• Because of the large amounts of data and
massive search efforts, it is sometimes
necessary to use parallel processing for data
mining.
• Sophisticated new tools, including advanced
visualization tools, help to remove the
information ore buried in corporate files
Taxonomy of Data in Data Mining
Blend of Multiple Disciplines
Taxonomy of Data in Data Mining
Attribute Types
• Nominal: Values have no meaningful order: categories, states, or “names
of things”
– Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
• Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
• e.g., gender
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV
positive)
• Ordinal
– Values have a meaningful order (ranking) but magnitude between
successive values is not known.
– Size = {small, medium, large}, grades, army rankings
11
Numeric Attribute Types

• Quantity (integer or real-valued)

• Interval
• Measured on a scale of equal-sized units
• It allow us to compare and quantify the
difference between values
– E.g., temperature in C˚or F˚, calendar dates
• Ratio
• A measurement is ratio-scaled, if a value is
being a multiple (or ratio) of another value
– e.g., temperature in Kelvin, length, counts,
monetary quantities
12
Discrete vs. Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a
collection of documents
– Sometimes, represented as integer variables
– Note: Binary attributes are a special case of discrete
attributes
• Continuous Attribute
– Has real numbers as attribute values
• E.g., temperature, height, or weight
– Practically, real values can only be measured and represented
using a finite number of digits
– Continuous attributes are typically represented as
floating-point variables
13
How Data Mining Works?
• Data mining builds models to identify
patterns among the attributes presented in
the data set.
• Models are the mathematical
representations (simple linear relationships
and/ or complex highly nonlinear
relationships) that identify the patterns
among the attributes of the objects (e.g.,
customers) described in the data set.
Patterns

• Associations find the commonly co-occurring

groupings of things, such as bread and jam
going together in market-basket analysis .

• Predictions tell the nature of future

occurrences of certain events based on what
has happened in the past.
Patterns
• Clusters identify natural groupings of things
based on their known characteristics, such as
assigning customers in different segments
based on their demographics and past
purchase behaviors.

• Sequential relationships discover

time-ordered events. (where the values are
delivered in a sequence)
Taxonomy for Data Mining Tasks
Data Mining Versus Statistics
• Statistics starts with a well-defined proposition
and hypothesis while data mining starts with a
loosely defined discovery statement.

• Statistics collects a sample data to test the

hypothesis, while data mining and analytics use
all of the existing data to discover novel
patterns and relationships.

• Data mining looks for data sets that are as "big"

as possible while statistics looks for right size of
data
VISUALIZATION AND TIME-SERIES
FORECASTING
• Visualization can be used in conjunction with
other data mining techniques to gain a
clearer understanding of underlying
relationships
• In time-series forecasting, the data consists
of values of the same variable that is
captured and stored over time in regular
intervals.
DATA MINING APPLICATIONS
• Customer relationship management.
• Banking
• Retailing and logistics.
• Manufacturing and production
• Insurance
• Computer hardware and software
• Travel industry
• Entertainment industry
• Sports and so on….
DATA MINING PROCESS
CRISP-DM
• Business Understanding: The key element of any data
mining study is to know what the study is for.

• Data Understanding: The most relevant data can be

identified (quantitative and qualitative).

• Data Preprocessing : ET Process

• Model Building: The assessment and comparative

analysis of the various models built. (Because there is
no universally known best method or algorithm for a data
mining task). Depending on the business need, the
data mining task can be of a prediction (either
classification or regression) , an association, or a
clustering type.
Data Preparation
• Testing and Evaluation: The developed models
are assessed and evaluated for their accuracy
and generality. This step assesses the degree
to which the selected model meets the
business objectives and, if so, to what extent?.
• Deployment: The deployment phase can be as
simple as generating a report or as complex as
implementing a repeatable data mining process
across the enterprise.
The deployment step may also include
maintenance activities
SEMMA (Others)
Classification
(Data Mining Methods)

• Classification learns patterns from past data

in order to place new instances (with
unknown labels) into their respective groups
or classes (i.e., good or bad credit risk / yes,
no / "sunny," "rainy," or "cloudy)

• Then, what is Regression?

Classification predicts Class labels where as,
Regression predicting Numerical values.
• The most common two-step methodology of
classification-type prediction involves model
development/ training and model
testing/deployment.

• Several factors are considered in assessing the

model, including the following:
– Predictive accuracy
– Speed
– Robustness
– Scalability
– Interpretability
Confusion Matrix for Tabulation of
Two-Class Classification
Estimating the True Accuracy of
Classification Models
Estimation Methodologies
SIMPLE SPLIT:
k-FOLD CROSS-VALIDATION:
• Also called rotation estimation, the complete
data set is randomly split into “k” mutually
exclusive subsets of approximately equal size.

• The classification model is trained and tested k

times. Each time it is trained on all but one fold
and then tested on the remaining single fold.

• The cross-validation estimate of the overall

accuracy of a model is calculated by simply
averaging the k individual accuracy measures
Decision Tree
Information gain
• Information gain is the splitting mechanism
used in ID3.The basic idea behind ID3 (and its
variants) is to use a concept called entropy.
• Entropy measures the extent of uncertainty or
randomness in a data set. If all the data in a
subset belong to just one class, there is no
uncertainty or randomness in that data set, so
the entropy is zero.
• The objective of this approach is to build
subtrees so that the entropy of each final subset
is zero (or close to zero).
Attribute Selection Measures
• Information Gain:
– The expected information needed to classify a
tuple in D (Dataset) is given by,

How much more information would we still need (after

the partitioning) to arrive at N exact classiﬁcation? This
amount is measured by
------------------------------------------------------------------
• Similarly, we can compute Gain. Income =
0.029 bits, Gain.student= 0.151bits,and
Gain.credit rating=0.048bits.
• Because age has the highest information gain
among the attributes, it is selected as the
splitting attribute.
Gini index
• The Gini index has been used in economics to
measure the diversity of a population. The
same concept can be used to determine the
purity of a specific class as a result of a
decision to branch along a particular
attribute or variable.
• The best split is the one that increases the
purity of the sets resulting from a proposed
split.
Cluster Analysis for Data Mining

• Cluster analysis is an essential data mining

method for classifying items, events, or
concepts into common groupings called
clusters.
• Cluster analysis has been used extensively
for fraud detection (both credit card and
e-commerce fraud).
DETERMINING THE OPTIMAL NUMBER OF
CLUSTERS
• Look at the percentage of variance explained as a function
of the number of clusters; that is, choose a number of
clusters so that adding another cluster would not give much
better modeling of the data.

• Set the number of clusters to (n/2)^0.5 where n is the

number of data points.

• Use the Akaike Information Criterion (AIC), which is a

measure of the goodness of fit (based on the concept of
entropy) to determine the number of clusters.

• Use Bayesian information criterion (BIC), which is a

model-selection criterion (based on maximum likelihood
estimation) to determine the number of clusters.
ANALYSIS METHODS
• Cluster analysis may be based on one or more of the following general
methods:
– Statistical methods (including both hierarchical and
nonhierarchical), such as k-means, k-modes, and so on
– Neural networks
– Fuzzy logic (e.g., fuzzy c-means algorithm)
– Genetic algorithms
• Each of these methods generally works with one of two general method
classes:
– Divisive. With divisive classes, all items start in one cluster and are broken
apart.
– Agglomerative. With agglomerative classes, all items start in individual
clusters, and the clusters are joined together.
Most cluster analysis methods involve the use of a distance measure to
calculate the closeness between pairs of items.
K-MEANS CLUSTERING ALGORITHM

Step 1: Choose the number of clusters (i.e.,

the value of k).
Step 2: Randomly generate k random points as
initial cluster centers.
Step 3: Assign each point to the nearest
cluster center.
Step 4: Recompute the new cluster centers.

Repetition step: Repeat steps 3 and 4 until some convergence criterion

is met.
Association Rule Mining
• Association rule mining aims to find interesting
relationships (affinities) between variables (items)
in large databases.

• The input to market-basket analysis is simple

point-of-sale transaction data, where a number of
products and/or services purchased together are
tabulated under a single transaction instance.

• The outcome of the analysis is invaluable

information that can be used to better understand
customer-purchase behavior in order to maximize
the profit from business transactions.
• "Are all association rules interesting and useful?

• Association rule mining uses two common

metrics: Support and Confidence.

X=> Y[Supp(%), Conf(%)]

• The support (S) of a collection of products is
the measure of how often these products and/
or services appear together in the same
transaction.

• Confidence, which assesses the degree of

certainty of the detected association.
Where, X is the itemset (e.g., a combination of items).

Where:
X is the antecedent (the item(s) on the left side of the rule).

Y is the consequent (the item(s) on the right side of the rule).

Support(X∪Y) is the support of both X and Y occurring together.

APRIORI ALGORITHM

The Skill, Will, Hill Assessment
No ratings yet
The Skill, Will, Hill Assessment
2 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
Data Mining
No ratings yet
Data Mining
30 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
introduction to Data Mining
No ratings yet
introduction to Data Mining
48 pages
Data Mining Implementation
No ratings yet
Data Mining Implementation
9 pages
Data Mining
No ratings yet
Data Mining
33 pages
Introduction To Data Mining Techniques: Dr. Rajni Jain
No ratings yet
Introduction To Data Mining Techniques: Dr. Rajni Jain
11 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
No ratings yet
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
10 pages
Data Mining
No ratings yet
Data Mining
63 pages
DMlecture1
No ratings yet
DMlecture1
39 pages
Data Mining Technique Using Weka Tool
No ratings yet
Data Mining Technique Using Weka Tool
21 pages
Lecture2 DataMiningFunctionalities
No ratings yet
Lecture2 DataMiningFunctionalities
18 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
3 DM
No ratings yet
3 DM
36 pages
Presentation 1
No ratings yet
Presentation 1
28 pages
Data Mining
No ratings yet
Data Mining
20 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining
No ratings yet
Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining
36 pages
Lecturenotes Data Mining
No ratings yet
Lecturenotes Data Mining
23 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
DWM Merged
No ratings yet
DWM Merged
125 pages
dw and dm notes (1)
No ratings yet
dw and dm notes (1)
89 pages
Data Mining Information
100% (1)
Data Mining Information
15 pages
DSS chapter 5
No ratings yet
DSS chapter 5
9 pages
Introduction To Data Mining For Business Analytics
No ratings yet
Introduction To Data Mining For Business Analytics
51 pages
Yihao Final Paper CCSC for Submission
No ratings yet
Yihao Final Paper CCSC for Submission
6 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
76 pages
Data Mining Questions
100% (1)
Data Mining Questions
7 pages
Lect 1
No ratings yet
Lect 1
38 pages
Data Mining, Data Pattern, Machine Learning (Week 2
No ratings yet
Data Mining, Data Pattern, Machine Learning (Week 2
19 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
38 pages
Data Mining
No ratings yet
Data Mining
23 pages
Data Mining: July 18, 2019 1
No ratings yet
Data Mining: July 18, 2019 1
41 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
Big Data 4 (3 - 4)
No ratings yet
Big Data 4 (3 - 4)
13 pages
BI Chapter 04 - Unlocked
No ratings yet
BI Chapter 04 - Unlocked
47 pages
Data management
No ratings yet
Data management
36 pages
1. Introduction
No ratings yet
1. Introduction
26 pages
CSC 425 Data Mining and Warehousing 2024
No ratings yet
CSC 425 Data Mining and Warehousing 2024
54 pages
Data Mining
No ratings yet
Data Mining
15 pages
My Chapter Two
No ratings yet
My Chapter Two
57 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
No ratings yet
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
33 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Paper - Xvii Data Mining and Warehousing
No ratings yet
Paper - Xvii Data Mining and Warehousing
140 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
02-Data Mining Functionalities-2
No ratings yet
02-Data Mining Functionalities-2
23 pages
CH 2
No ratings yet
CH 2
37 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
Data Warehouse and Mining Notes
No ratings yet
Data Warehouse and Mining Notes
12 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
unit 3 BI & Data science (1)
No ratings yet
unit 3 BI & Data science (1)
19 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
36 pages
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet
Snapshot Intermediate
No ratings yet
Snapshot Intermediate
6 pages
Would, Could, Should
No ratings yet
Would, Could, Should
2 pages
Zimsec O Level Computer Studies Project Guide
100% (2)
Zimsec O Level Computer Studies Project Guide
6 pages
Thi Thu 23 2
No ratings yet
Thi Thu 23 2
7 pages
HistoryOfIndia KamleshKapur
No ratings yet
HistoryOfIndia KamleshKapur
2 pages
Master of Distance Education: University of The Philippines Open University
No ratings yet
Master of Distance Education: University of The Philippines Open University
1 page
Dynamic Testing Techniques: Software Testing ISTQB / ISEB Foundation Exam Practice
No ratings yet
Dynamic Testing Techniques: Software Testing ISTQB / ISEB Foundation Exam Practice
54 pages
FAR153 Past Year Question and Answer
No ratings yet
FAR153 Past Year Question and Answer
14 pages
Rose Athena Baja-Sibala, RN
No ratings yet
Rose Athena Baja-Sibala, RN
22 pages
BIOGRAPHY
No ratings yet
BIOGRAPHY
8 pages
Speech PDF
No ratings yet
Speech PDF
3 pages
collegedekho_250423_135028
No ratings yet
collegedekho_250423_135028
3 pages
Naïve Baye's Classification
No ratings yet
Naïve Baye's Classification
16 pages
Main Features of RTE Act 2009
67% (3)
Main Features of RTE Act 2009
7 pages
FWD Offer For Fawad From GEMS Winchester School - Dubai
No ratings yet
FWD Offer For Fawad From GEMS Winchester School - Dubai
3 pages
Homework For Year 7 English
100% (1)
Homework For Year 7 English
7 pages
Descriptive Research: Characteristics Value, Importance, and Advantages Techniques
100% (1)
Descriptive Research: Characteristics Value, Importance, and Advantages Techniques
9 pages
Self Study
No ratings yet
Self Study
1 page
The Difference Between Phonetics and Phonology
86% (21)
The Difference Between Phonetics and Phonology
3 pages
Quiz 1
No ratings yet
Quiz 1
3 pages
B Com Honours - Investment Management - Brochure
No ratings yet
B Com Honours - Investment Management - Brochure
8 pages
Phil-IRI Class Progress Report: Division of Island Garden City of Samal
86% (7)
Phil-IRI Class Progress Report: Division of Island Garden City of Samal
2 pages
G CO C 11 Trapezoids1a
No ratings yet
G CO C 11 Trapezoids1a
8 pages
Homeroom Guidance: Quarter 1 - Module 1: Self-Analysis: A Step To My Improvement
100% (8)
Homeroom Guidance: Quarter 1 - Module 1: Self-Analysis: A Step To My Improvement
10 pages
Swedenborg's Formative Influences: Jewish Mysticism, Christian Cabala and Pietism
No ratings yet
Swedenborg's Formative Influences: Jewish Mysticism, Christian Cabala and Pietism
9 pages
.PDF 1713188656 1209600 1 U4YMZ5J6UENQsBYtoZPg-7vgzB2dx ZB-nkjpylJiDyUFaCYeEWJsEwOpqDSFpYS5uqf77tAbjCKmz9l4eDDVQ
No ratings yet
.PDF 1713188656 1209600 1 U4YMZ5J6UENQsBYtoZPg-7vgzB2dx ZB-nkjpylJiDyUFaCYeEWJsEwOpqDSFpYS5uqf77tAbjCKmz9l4eDDVQ
5 pages
Blank Piform
No ratings yet
Blank Piform
3 pages
Forge Fearlessness
No ratings yet
Forge Fearlessness
5 pages
08tgary Dessler - Human Resource Management-Pearson (2020) 295
No ratings yet
08tgary Dessler - Human Resource Management-Pearson (2020) 295
37 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BI-Unit-3-Part-1-PPT.ppt

Uploaded by

BI-Unit-3-Part-1-PPT.ppt

Uploaded by

UNIT-III-PART-1

• Data mining is a term used to describe

• Data mining as "the nontrivial process of

• Nontrivial means that some

• Valid means that the discovered patterns

• Potentially useful means that the discovered

• Ultimately understandable means that the

• The data mining environment is usually a client/

• The miner is often an end user, empowered by

• Quantity (integer or real-valued)

• Associations find the commonly co-occurring

• Predictions tell the nature of future

• Sequential relationships discover

• Statistics collects a sample data to test the

• Data mining looks for data sets that are as "big"

• Data Understanding: The most relevant data can be

• Data Preprocessing : ET Process

• Model Building: The assessment and comparative

• Classification learns patterns from past data

• Then, what is Regression?

• Several factors are considered in assessing the

• The classification model is trained and tested k

• The cross-validation estimate of the overall

How much more information would we still need (after

• Cluster analysis is an essential data mining

• Set the number of clusters to (n/2)^0.5 where n is the

• Use the Akaike Information Criterion (AIC), which is a

• Use Bayesian information criterion (BIC), which is a

Step 1: Choose the number of clusters (i.e.,

Repetition step: Repeat steps 3 and 4 until some convergence criterion

• The input to market-basket analysis is simple

• The outcome of the analysis is invaluable

• Association rule mining uses two common

X=> Y[Supp(%), Conf(%)]

• Confidence, which assesses the degree of

Y is the consequent (the item(s) on the right side of the rule).

Support(X∪Y) is the support of both X and Y occurring together.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.