The document discusses data mining, which involves extracting useful information from large datasets using statistical and AI techniques. It covers various aspects such as data types, mining processes, and applications in fields like banking and retail. Additionally, it highlights methods like classification, clustering, and association rule mining, along with evaluation techniques for data mining models.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
1 views
BI-Unit-3-Part-1-PPT.ppt
The document discusses data mining, which involves extracting useful information from large datasets using statistical and AI techniques. It covers various aspects such as data types, mining processes, and applications in fields like banking and retail. Additionally, it highlights methods like classification, clustering, and association rule mining, along with evaluation techniques for data mining models.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51
UNIT-III-PART-1
DATA STORAGE FOR BUSINESS INTELLIGENCE Data Mining Concepts and Applications
• Data mining is a term used to describe
discovering or "mining" knowledge from large amounts of data. • Many other names that are associated with data mining include knowledge extraction, pattern analysis, data archaeology, information harvesting, pattern searching, and data dredging. • Technical Definition: – Data mining is a process that uses statistical, mathematical, and artificial intelligence techniques to extract and identify useful information and subsequent knowledge (or patterns) from large sets of data
• Data mining as "the nontrivial process of
identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases," • Process implies that data mining comprises many iterative steps.
• Nontrivial means that some
experimentation-type search or inference is involved ;that is, it is not as straightforward as a computation of predefined quantities.
• Valid means that the discovered patterns
should hold true on new data with sufficient degree of certainty. • Novel means that the patterns are not previously known to the user within the context of the system being analyzed.
• Potentially useful means that the discovered
patterns should lead to some benefit to the user or task.
• Ultimately understandable means that the
pattern should make business sense that leads to the user to make decisions. It makes sense; Characteristics and Objectives of Data Mining • Data are often buried deep within very large databases, which sometimes contain data from several years. In many cases, the data are cleansed and consolidated into a data warehouse.
• The data mining environment is usually a client/
server architecture or a Web-based information systems architecture
• The miner is often an end user, empowered by
power query tools to ask ad-hoc questions and obtain answers quickly, with little or no programming skill. • Data mining tools are readily combined with spread sheets and other software development tools . Thus, the mined data can be analyzed and deployed quickly and easily. • Because of the large amounts of data and massive search efforts, it is sometimes necessary to use parallel processing for data mining. • Sophisticated new tools, including advanced visualization tools, help to remove the information ore buried in corporate files Taxonomy of Data in Data Mining Blend of Multiple Disciplines Taxonomy of Data in Data Mining Attribute Types • Nominal: Values have no meaningful order: categories, states, or “names of things” – Hair_color = {auburn, black, blond, brown, grey, red, white} – marital status, occupation, ID numbers, zip codes • Binary – Nominal attribute with only 2 states (0 and 1) – Symmetric binary: both outcomes equally important • e.g., gender – Asymmetric binary: outcomes not equally important. • e.g., medical test (positive vs. negative) • Convention: assign 1 to most important outcome (e.g., HIV positive) • Ordinal – Values have a meaningful order (ranking) but magnitude between successive values is not known. – Size = {small, medium, large}, grades, army rankings 11 Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval • Measured on a scale of equal-sized units • It allow us to compare and quantify the difference between values – E.g., temperature in C˚or F˚, calendar dates • Ratio • A measurement is ratio-scaled, if a value is being a multiple (or ratio) of another value – e.g., temperature in Kelvin, length, counts, monetary quantities 12 Discrete vs. Continuous Attributes • Discrete Attribute – Has only a finite or countably infinite set of values • E.g., zip codes, profession, or the set of words in a collection of documents – Sometimes, represented as integer variables – Note: Binary attributes are a special case of discrete attributes • Continuous Attribute – Has real numbers as attribute values • E.g., temperature, height, or weight – Practically, real values can only be measured and represented using a finite number of digits – Continuous attributes are typically represented as floating-point variables 13 How Data Mining Works? • Data mining builds models to identify patterns among the attributes presented in the data set. • Models are the mathematical representations (simple linear relationships and/ or complex highly nonlinear relationships) that identify the patterns among the attributes of the objects (e.g., customers) described in the data set. Patterns
• Associations find the commonly co-occurring
groupings of things, such as bread and jam going together in market-basket analysis .
• Predictions tell the nature of future
occurrences of certain events based on what has happened in the past. Patterns • Clusters identify natural groupings of things based on their known characteristics, such as assigning customers in different segments based on their demographics and past purchase behaviors.
• Sequential relationships discover
time-ordered events. (where the values are delivered in a sequence) Taxonomy for Data Mining Tasks Data Mining Versus Statistics • Statistics starts with a well-defined proposition and hypothesis while data mining starts with a loosely defined discovery statement.
• Statistics collects a sample data to test the
hypothesis, while data mining and analytics use all of the existing data to discover novel patterns and relationships.
• Data mining looks for data sets that are as "big"
as possible while statistics looks for right size of data VISUALIZATION AND TIME-SERIES FORECASTING • Visualization can be used in conjunction with other data mining techniques to gain a clearer understanding of underlying relationships • In time-series forecasting, the data consists of values of the same variable that is captured and stored over time in regular intervals. DATA MINING APPLICATIONS • Customer relationship management. • Banking • Retailing and logistics. • Manufacturing and production • Insurance • Computer hardware and software • Travel industry • Entertainment industry • Sports and so on…. DATA MINING PROCESS CRISP-DM • Business Understanding: The key element of any data mining study is to know what the study is for.
• Data Understanding: The most relevant data can be
identified (quantitative and qualitative).
• Data Preprocessing : ET Process
• Model Building: The assessment and comparative
analysis of the various models built. (Because there is no universally known best method or algorithm for a data mining task). Depending on the business need, the data mining task can be of a prediction (either classification or regression) , an association, or a clustering type. Data Preparation • Testing and Evaluation: The developed models are assessed and evaluated for their accuracy and generality. This step assesses the degree to which the selected model meets the business objectives and, if so, to what extent?. • Deployment: The deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. The deployment step may also include maintenance activities SEMMA (Others) Classification (Data Mining Methods)
• Classification learns patterns from past data
in order to place new instances (with unknown labels) into their respective groups or classes (i.e., good or bad credit risk / yes, no / "sunny," "rainy," or "cloudy)
• Then, what is Regression?
Classification predicts Class labels where as, Regression predicting Numerical values. • The most common two-step methodology of classification-type prediction involves model development/ training and model testing/deployment.
• Several factors are considered in assessing the
model, including the following: – Predictive accuracy – Speed – Robustness – Scalability – Interpretability Confusion Matrix for Tabulation of Two-Class Classification Estimating the True Accuracy of Classification Models Estimation Methodologies SIMPLE SPLIT: k-FOLD CROSS-VALIDATION: • Also called rotation estimation, the complete data set is randomly split into “k” mutually exclusive subsets of approximately equal size.
• The classification model is trained and tested k
times. Each time it is trained on all but one fold and then tested on the remaining single fold.
• The cross-validation estimate of the overall
accuracy of a model is calculated by simply averaging the k individual accuracy measures Decision Tree Information gain • Information gain is the splitting mechanism used in ID3.The basic idea behind ID3 (and its variants) is to use a concept called entropy. • Entropy measures the extent of uncertainty or randomness in a data set. If all the data in a subset belong to just one class, there is no uncertainty or randomness in that data set, so the entropy is zero. • The objective of this approach is to build subtrees so that the entropy of each final subset is zero (or close to zero). Attribute Selection Measures • Information Gain: – The expected information needed to classify a tuple in D (Dataset) is given by,
How much more information would we still need (after
the partitioning) to arrive at N exact classification? This amount is measured by ------------------------------------------------------------------ • Similarly, we can compute Gain. Income = 0.029 bits, Gain.student= 0.151bits,and Gain.credit rating=0.048bits. • Because age has the highest information gain among the attributes, it is selected as the splitting attribute. Gini index • The Gini index has been used in economics to measure the diversity of a population. The same concept can be used to determine the purity of a specific class as a result of a decision to branch along a particular attribute or variable. • The best split is the one that increases the purity of the sets resulting from a proposed split. Cluster Analysis for Data Mining
• Cluster analysis is an essential data mining
method for classifying items, events, or concepts into common groupings called clusters. • Cluster analysis has been used extensively for fraud detection (both credit card and e-commerce fraud). DETERMINING THE OPTIMAL NUMBER OF CLUSTERS • Look at the percentage of variance explained as a function of the number of clusters; that is, choose a number of clusters so that adding another cluster would not give much better modeling of the data.
• Set the number of clusters to (n/2)^0.5 where n is the
number of data points.
• Use the Akaike Information Criterion (AIC), which is a
measure of the goodness of fit (based on the concept of entropy) to determine the number of clusters.
• Use Bayesian information criterion (BIC), which is a
model-selection criterion (based on maximum likelihood estimation) to determine the number of clusters. ANALYSIS METHODS • Cluster analysis may be based on one or more of the following general methods: – Statistical methods (including both hierarchical and nonhierarchical), such as k-means, k-modes, and so on – Neural networks – Fuzzy logic (e.g., fuzzy c-means algorithm) – Genetic algorithms • Each of these methods generally works with one of two general method classes: – Divisive. With divisive classes, all items start in one cluster and are broken apart. – Agglomerative. With agglomerative classes, all items start in individual clusters, and the clusters are joined together. Most cluster analysis methods involve the use of a distance measure to calculate the closeness between pairs of items. K-MEANS CLUSTERING ALGORITHM
Step 1: Choose the number of clusters (i.e.,
the value of k). Step 2: Randomly generate k random points as initial cluster centers. Step 3: Assign each point to the nearest cluster center. Step 4: Recompute the new cluster centers.
Repetition step: Repeat steps 3 and 4 until some convergence criterion
is met. Association Rule Mining • Association rule mining aims to find interesting relationships (affinities) between variables (items) in large databases.
• The input to market-basket analysis is simple
point-of-sale transaction data, where a number of products and/or services purchased together are tabulated under a single transaction instance.
• The outcome of the analysis is invaluable
information that can be used to better understand customer-purchase behavior in order to maximize the profit from business transactions. • "Are all association rules interesting and useful?
• Association rule mining uses two common
metrics: Support and Confidence.
X=> Y[Supp(%), Conf(%)]
• The support (S) of a collection of products is the measure of how often these products and/ or services appear together in the same transaction.
• Confidence, which assesses the degree of
certainty of the detected association. Where, X is the itemset (e.g., a combination of items).
Where: X is the antecedent (the item(s) on the left side of the rule).
Y is the consequent (the item(s) on the right side of the rule).
Support(X∪Y) is the support of both X and Y occurring together.