0% found this document useful (0 votes)
1 views

BI-Unit-3-Part-1-PPT.ppt

The document discusses data mining, which involves extracting useful information from large datasets using statistical and AI techniques. It covers various aspects such as data types, mining processes, and applications in fields like banking and retail. Additionally, it highlights methods like classification, clustering, and association rule mining, along with evaluation techniques for data mining models.

Uploaded by

Jane Hale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

BI-Unit-3-Part-1-PPT.ppt

The document discusses data mining, which involves extracting useful information from large datasets using statistical and AI techniques. It covers various aspects such as data types, mining processes, and applications in fields like banking and retail. Additionally, it highlights methods like classification, clustering, and association rule mining, along with evaluation techniques for data mining models.

Uploaded by

Jane Hale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

UNIT-III-PART-1

DATA STORAGE
FOR
BUSINESS INTELLIGENCE
Data Mining Concepts and Applications

• Data mining is a term used to describe


discovering or "mining" knowledge from
large amounts of data.
• Many other names that are associated with
data mining include knowledge extraction,
pattern analysis, data archaeology,
information harvesting, pattern searching,
and data dredging.
• Technical Definition:
– Data mining is a process that uses statistical,
mathematical, and artificial intelligence
techniques to extract and identify useful
information and subsequent knowledge (or
patterns) from large sets of data

• Data mining as "the nontrivial process of


identifying valid, novel, potentially useful,
and ultimately understandable patterns in
data stored in structured databases,"
• Process implies that data mining comprises
many iterative steps.

• Nontrivial means that some


experimentation-type search or inference is
involved ;that is, it is not as straightforward
as a computation of predefined quantities.

• Valid means that the discovered patterns


should hold true on new data with
sufficient degree of certainty.
• Novel means that the patterns are not previously
known to the user within the context of the system
being analyzed.

• Potentially useful means that the discovered


patterns should lead to some benefit to the user
or task.

• Ultimately understandable means that the


pattern should make business sense that leads to
the user to make decisions. It makes sense;
Characteristics and Objectives of
Data Mining
• Data are often buried deep within very large
databases, which sometimes contain data from several
years. In many cases, the data are cleansed and
consolidated into a data warehouse.

• The data mining environment is usually a client/


server architecture or a Web-based information
systems architecture

• The miner is often an end user, empowered by


power query tools to ask ad-hoc questions and obtain
answers quickly, with little or no programming skill.
• Data mining tools are readily combined with
spread sheets and other software development
tools . Thus, the mined data can be analyzed
and deployed quickly and easily.
• Because of the large amounts of data and
massive search efforts, it is sometimes
necessary to use parallel processing for data
mining.
• Sophisticated new tools, including advanced
visualization tools, help to remove the
information ore buried in corporate files
Taxonomy of Data in Data Mining
Blend of Multiple Disciplines
Taxonomy of Data in Data Mining
Attribute Types
• Nominal: Values have no meaningful order: categories, states, or “names
of things”
– Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
• Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
• e.g., gender
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV
positive)
• Ordinal
– Values have a meaningful order (ranking) but magnitude between
successive values is not known.
– Size = {small, medium, large}, grades, army rankings
11
Numeric Attribute Types

• Quantity (integer or real-valued)


• Interval
• Measured on a scale of equal-sized units
• It allow us to compare and quantify the
difference between values
– E.g., temperature in C˚or F˚, calendar dates
• Ratio
• A measurement is ratio-scaled, if a value is
being a multiple (or ratio) of another value
– e.g., temperature in Kelvin, length, counts,
monetary quantities
12
Discrete vs. Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a
collection of documents
– Sometimes, represented as integer variables
– Note: Binary attributes are a special case of discrete
attributes
• Continuous Attribute
– Has real numbers as attribute values
• E.g., temperature, height, or weight
– Practically, real values can only be measured and represented
using a finite number of digits
– Continuous attributes are typically represented as
floating-point variables
13
How Data Mining Works?
• Data mining builds models to identify
patterns among the attributes presented in
the data set.
• Models are the mathematical
representations (simple linear relationships
and/ or complex highly nonlinear
relationships) that identify the patterns
among the attributes of the objects (e.g.,
customers) described in the data set.
Patterns

• Associations find the commonly co-occurring


groupings of things, such as bread and jam
going together in market-basket analysis .

• Predictions tell the nature of future


occurrences of certain events based on what
has happened in the past.
Patterns
• Clusters identify natural groupings of things
based on their known characteristics, such as
assigning customers in different segments
based on their demographics and past
purchase behaviors.

• Sequential relationships discover


time-ordered events. (where the values are
delivered in a sequence)
Taxonomy for Data Mining Tasks
Data Mining Versus Statistics
• Statistics starts with a well-defined proposition
and hypothesis while data mining starts with a
loosely defined discovery statement.

• Statistics collects a sample data to test the


hypothesis, while data mining and analytics use
all of the existing data to discover novel
patterns and relationships.

• Data mining looks for data sets that are as "big"


as possible while statistics looks for right size of
data
VISUALIZATION AND TIME-SERIES
FORECASTING
• Visualization can be used in conjunction with
other data mining techniques to gain a
clearer understanding of underlying
relationships
• In time-series forecasting, the data consists
of values of the same variable that is
captured and stored over time in regular
intervals.
DATA MINING APPLICATIONS
• Customer relationship management.
• Banking
• Retailing and logistics.
• Manufacturing and production
• Insurance
• Computer hardware and software
• Travel industry
• Entertainment industry
• Sports and so on….
DATA MINING PROCESS
CRISP-DM
• Business Understanding: The key element of any data
mining study is to know what the study is for.

• Data Understanding: The most relevant data can be


identified (quantitative and qualitative).

• Data Preprocessing : ET Process

• Model Building: The assessment and comparative


analysis of the various models built. (Because there is
no universally known best method or algorithm for a data
mining task). Depending on the business need, the
data mining task can be of a prediction (either
classification or regression) , an association, or a
clustering type.
Data Preparation
• Testing and Evaluation: The developed models
are assessed and evaluated for their accuracy
and generality. This step assesses the degree
to which the selected model meets the
business objectives and, if so, to what extent?.
• Deployment: The deployment phase can be as
simple as generating a report or as complex as
implementing a repeatable data mining process
across the enterprise.
The deployment step may also include
maintenance activities
SEMMA (Others)
Classification
(Data Mining Methods)

• Classification learns patterns from past data


in order to place new instances (with
unknown labels) into their respective groups
or classes (i.e., good or bad credit risk / yes,
no / "sunny," "rainy," or "cloudy)

• Then, what is Regression?


Classification predicts Class labels where as,
Regression predicting Numerical values.
• The most common two-step methodology of
classification-type prediction involves model
development/ training and model
testing/deployment.

• Several factors are considered in assessing the


model, including the following:
– Predictive accuracy
– Speed
– Robustness
– Scalability
– Interpretability
Confusion Matrix for Tabulation of
Two-Class Classification
Estimating the True Accuracy of
Classification Models
Estimation Methodologies
SIMPLE SPLIT:
k-FOLD CROSS-VALIDATION:
• Also called rotation estimation, the complete
data set is randomly split into “k” mutually
exclusive subsets of approximately equal size.

• The classification model is trained and tested k


times. Each time it is trained on all but one fold
and then tested on the remaining single fold.

• The cross-validation estimate of the overall


accuracy of a model is calculated by simply
averaging the k individual accuracy measures
Decision Tree
Information gain
• Information gain is the splitting mechanism
used in ID3.The basic idea behind ID3 (and its
variants) is to use a concept called entropy.
• Entropy measures the extent of uncertainty or
randomness in a data set. If all the data in a
subset belong to just one class, there is no
uncertainty or randomness in that data set, so
the entropy is zero.
• The objective of this approach is to build
subtrees so that the entropy of each final subset
is zero (or close to zero).
Attribute Selection Measures
• Information Gain:
– The expected information needed to classify a
tuple in D (Dataset) is given by,

How much more information would we still need (after


the partitioning) to arrive at N exact classification? This
amount is measured by
------------------------------------------------------------------
• Similarly, we can compute Gain. Income =
0.029 bits, Gain.student= 0.151bits,and
Gain.credit rating=0.048bits.
• Because age has the highest information gain
among the attributes, it is selected as the
splitting attribute.
Gini index
• The Gini index has been used in economics to
measure the diversity of a population. The
same concept can be used to determine the
purity of a specific class as a result of a
decision to branch along a particular
attribute or variable.
• The best split is the one that increases the
purity of the sets resulting from a proposed
split.
Cluster Analysis for Data Mining

• Cluster analysis is an essential data mining


method for classifying items, events, or
concepts into common groupings called
clusters.
• Cluster analysis has been used extensively
for fraud detection (both credit card and
e-commerce fraud).
DETERMINING THE OPTIMAL NUMBER OF
CLUSTERS
• Look at the percentage of variance explained as a function
of the number of clusters; that is, choose a number of
clusters so that adding another cluster would not give much
better modeling of the data.

• Set the number of clusters to (n/2)^0.5 where n is the


number of data points.

• Use the Akaike Information Criterion (AIC), which is a


measure of the goodness of fit (based on the concept of
entropy) to determine the number of clusters.

• Use Bayesian information criterion (BIC), which is a


model-selection criterion (based on maximum likelihood
estimation) to determine the number of clusters.
ANALYSIS METHODS
• Cluster analysis may be based on one or more of the following general
methods:
– Statistical methods (including both hierarchical and
nonhierarchical), such as k-means, k-modes, and so on
– Neural networks
– Fuzzy logic (e.g., fuzzy c-means algorithm)
– Genetic algorithms
• Each of these methods generally works with one of two general method
classes:
– Divisive. With divisive classes, all items start in one cluster and are broken
apart.
– Agglomerative. With agglomerative classes, all items start in individual
clusters, and the clusters are joined together.
Most cluster analysis methods involve the use of a distance measure to
calculate the closeness between pairs of items.
K-MEANS CLUSTERING ALGORITHM

Step 1: Choose the number of clusters (i.e.,


the value of k).
Step 2: Randomly generate k random points as
initial cluster centers.
Step 3: Assign each point to the nearest
cluster center.
Step 4: Recompute the new cluster centers.

Repetition step: Repeat steps 3 and 4 until some convergence criterion


is met.
Association Rule Mining
• Association rule mining aims to find interesting
relationships (affinities) between variables (items)
in large databases.

• The input to market-basket analysis is simple


point-of-sale transaction data, where a number of
products and/or services purchased together are
tabulated under a single transaction instance.

• The outcome of the analysis is invaluable


information that can be used to better understand
customer-purchase behavior in order to maximize
the profit from business transactions.
• "Are all association rules interesting and useful?

• Association rule mining uses two common


metrics: Support and Confidence.

X=> Y[Supp(%), Conf(%)]


• The support (S) of a collection of products is
the measure of how often these products and/
or services appear together in the same
transaction.

• Confidence, which assesses the degree of


certainty of the detected association.
Where, X is the itemset (e.g., a combination of items).

Where:
X is the antecedent (the item(s) on the left side of the rule).

Y is the consequent (the item(s) on the right side of the rule).

Support(X∪Y) is the support of both X and Y occurring together.


APRIORI ALGORITHM

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy