Notes_Data_Warehouse
Notes_Data_Warehouse
The word "Data Warehouse" was first coined by Bill Inmon in 1990. He said that Data
warehouse is subject Oriented, included, Time- Variant and nonvolatile collection of
data. This data helps in supporting decision making process by analyst in an organization
The operational database undergoes the per day transactions which causes the frequent
changes to the data on daily basis. But if in future the business executive wants to analyse
the previous feedback on any data such as product, supplier, or the consumer data. In this
case the analyst will be having no data available to analyse because the previous data is
updated due to transactions.
The Data Warehouses provide us generalized and consolidated data in multidimensional
view. Along with take a broad view and consolidated view of data the Data Warehouses
also provide us Online Analytical Processing (OLAP) tools. These tools help us in
interactive and effective analysis of data in multidimensional space. This analysis results
in data generalization and data mining.
The data mining functions like association, clustering, classification, prediction can be
integrated with OLAP operations to enhance interactive mining of knowledge at multiple
level of abstraction. That's why data warehouse has now become important platform for
data analysis and online analytical processing.
Understanding Data Warehouse
The Data Warehouse is that database which is kept separate from the
organization's operational database.
There is no frequent updation done in data warehouse.
The Data warehouse possesses consolidated historical data which help the
organization to analyse its business.
The Data warehouse helps the executives to organize, understand and use their
data to take strategic decision.
The Data warehouse systems available, which helps in integration of diversity of
application systems.
The Data warehouse system allows analysis of consolidated historical data
analysis.
The Data warehouse is Subject Oriented, Integrated, Time-Variant and
Nonvolatile collection of data that support management's decision making
process.
In Figure 1-1, the metadata and raw data of a traditional OLTP system is present, as is an
additional type of data, summary data. Summaries are a mechanism to pre-compute
common expensive, long-running operations for sub-second data retrieval. For example,
a typical data warehouse query is to retrieve something such as August sales. A summary
in an Oracle database is called a materialized view.
The consolidated storage of the raw data as the center of your data warehousing
architecture is often referred to as an Enterprise Data Warehouse (EDW). An EDW
provides a 360-degree view into the business of an organization by holding all relevant
business information in the most detailed format.
Data Mart
A data mart is focused on a single functional area of an organization and contains a
subset of data stored in a Data Warehouse.
A data mart is a condensed version of Data Warehouse and is designed for use by a
specific department, unit or set of users in an organization. E.g., Marketing, Sales, HR or
finance. It is often controlled by a single department in an organization.
Data Mart usually draws data from only a few sources compared to a Data warehouse.
Data marts are small in size and are more flexible compared to a Datawarehouse.
Implementing a Data Mart is a rewarding but complex procedure. Here are the detailed
steps to implement a Data Mart:
Designing
Designing is the first phase of Data Mart implementation. It covers all the tasks between
initiating the request for a data mart to gathering information about the requirements.
Finally, we create the logical and physical design of the data mart.
Constructing
This is the second phase of implementation. It involves creating the physical database and
the logical structures.
Storage management: An RDBMS stores and manages the data to create, add, and delete
data.
Fast data access: With a SQL query you can easily access data based on certain
conditions/filters.
Data protection: The RDBMS system also offers a way to recover from system failures
such as power failures. It also allows restoring data from these backups incase of the disk
fails.
Multiuser support: The data management system offers concurrent access, the ability for
multiple users to access and modify data without interfering or overwriting changes made
by another user.
Security: The RDMS system also provides a way to regulate access by users to objects
and certain types of operations.
Populating:
Accessing
Accessing is a fourth step which involves putting the data to use: querying the data,
creating reports, charts, and publishing them. End-user submit queries to the database and
display the results of the queries
Managing
This is the last step of Data Mart Implementation process. This step covers management
tasks such as-
Ongoing user access management.
System optimizations and fine-tuning to achieve the enhanced performance.
Adding and managing fresh data into the data mart.
Planning recovery scenarios and ensure system availability in the case when the system
fails.
Data mining can be viewed as a result of the natural evolution of information tech-
nology. The database and data management industry evolved in the development of
several critical functionalities: data collection and database creation, data management
(including data storage and retrieval and database transaction processing), and advanced
data analysis (involving data warehousing and data mining). The early development of
data collection and database creation mechanisms served as a prerequi- site for the later
development of effective mechanisms for data storage and retrieval, as well as query and
transaction processing. Nowadays numerous database systems offer query and transaction
processing as common practice. Advanced data analysis has naturally become the next
step. Since the 1960s, database and information technology has evolved systematically
from primitive file processing systems to sophisticated and powerful database systems.
The research and development in database systems since the 1970s progressed from early
hierarchical and network database systems to relational database systems (where data are
stored in relational table structures; see Section 1.3.1), data modeling tools, and indexing
and accessing methods. In addition, users gained convenient and flexible data access
through query languages, user interfaces, query optimization, and transac- tion
management. Efficient methods for online transaction processing (OLTP), where a query
is viewed as a read-only transaction, contributed substantially to the evolution and wide
acceptance of relational technology as a major tool for efficient storage, retrieval, and
management of large amounts of data.
After the establishment of database management systems, database technology moved
toward the development of advanced database systems, data warehousing, and data
mining for advanced data analysis and web-based databases. Advanced database systems,
for example, resulted from an upsurge of research from the mid-1980s onward. These
systems incorporate new and powerful data models such as extended-relational, object-
oriented, object-relational, and deductive models. Application-oriented database systems
have flourished, including spatial, temporal, multimedia, active, stream and sensor,
scientific and engineering databases, knowledge bases, and office information bases.
Issues related to the distribution, diversification, and sharing of data have been studied
extensively.
Advanced data analysis sprang up from the late 1980s onward. The steady and dazzling
progress of computer hardware technology in the past three decades led to large supplies
of powerful and affordable computers, data collection equipment, and storage media.
This technology provides a great boost to the database and information industry, and it
enables a huge number of databases and information repositories to be available for
transaction management, information retrieval, and data analysis. Data can now be stored
in many different kinds of databases and information repositories.
One emerging data repository architecture is the data warehouse (Section 1.3.2). This is
a repository of multiple heterogeneous data sources organized under a uni- fied schema at
a single site to facilitate management decision making. Data warehouse technology
includes data cleaning, data integration, and online analytical processing (OLAP)—that
is, analysis techniques with functionalities such as summarization, con- solidation, and
aggregation, as well as the ability to view information from different angles. Although
OLAP tools support multidimensional analysis and decision making, additional data
analysis tools are required for in-depth analysis—for example, data min- ing tools that
provide data classification, clustering, outlier/anomaly detection, and the characterization
of changes in data over time.
Huge volumes of data have been accumulated beyond databases and data ware- houses.
During the 1990s, the World Wide Web and web-based databases (e.g., XML databases)
began to appear. Internet-based global information bases, such as the WWW and various
kinds of interconnected, heterogeneous databases, have emerged and play a vital role in
the information industry. The effective and efficient analysis of data from such different
forms of data by integration of information retrieval, data mining, and information
network analysis technologies is a challenging task. In summary, the abundance of data,
coupled with the need for powerful data analysis tools, has been described as a data rich
but information poor situation (Figure 1.2). The fast-growing, tremendous amount of
data, collected and stored in large and numerous data repositories, has far exceeded our
human ability for comprehension without power- ful tools. As a result, data collected in
large data repositories become “data tombs”—data archives that are seldom visited.
Consequently, important decisions are often made based not on the information-rich data
stored in data repositories but rather on a deci- sion maker’s intuition, simply because the
decision maker does not have the tools to extract the valuable knowledge embedded in
the vast amounts of data. Efforts have been made to develop expert system and
knowledge-based technologies, which typically rely on users or domain experts to
manually input knowledge into knowledge bases. Unfortunately, however, the manual
knowledge input procedure is prone to biases and errors and is extremely costly and time
consuming. The widening gap between data and information calls for the systematic
development of data mining tools that can turn data tombs into “golden nuggets” of
knowledge.
What is Data Mining:
In simple words, data mining is defined as a process used to extract usable data from a
larger set of any raw data. It implies analysing data patterns in large batches of data using
one or more software. Data mining has applications in multiple fields, like science and
research. As an application of data mining, businesses can learn more about their
customers and develop more effective strategies related to various business functions and
in turn leverage resources in a more optimal and insightful manner. This helps businesses
be closer to their objective and make better decisions. Data mining involves effective data
collection and warehousing as well as computer processing. For segmenting the data and
evaluating the probability of future events, data mining uses sophisticated mathematical
algorithms. Data mining is also known as Knowledge Discovery in Data (KDD).
• Clustering based on finding and visually documented groups of facts not previously
known.
If a data mining system is not integrated with a database or a data warehouse system, then
there will be no system to communicate with. This scheme is known as the non-coupling
scheme. In this scheme, the main focus is on data mining design and on developing
efficient and effective algorithms for mining the available data sets.
The list of Integration Schemes is as follows −
No Coupling − In this scheme, the data mining system does not utilize any of the
database or data warehouse functions. It fetches the data from a particular source and
processes that data using some data mining algorithms. The data mining result is stored in
another file.
Loose Coupling − In this scheme, the data mining system may use some of the functions
of database and data warehouse system. It fetches the data from the data respiratory
managed by these systems and performs data mining on that data. It then stores the
mining result either in a file or in a designated place in a database or in a data warehouse.
Semi−tight Coupling − In this scheme, the data mining system is linked with a database
or a data warehouse system and in addition to that, efficient implementations of a few
data mining primitives can be provided in the database.
Tight coupling − In this coupling scheme, the data mining system is smoothly integrated
into the database or data warehouse system. The data mining subsystem is treated as one
functional component of an information system
Data mining is not an easy task, as the algorithms used can get very complex and data is
not always available at one place. It needs to be integrated from various heterogeneous
data sources. These factors also create some issues. The major issues regarding −
Mining Methodology and User Interaction
Performance Issues
Diverse Data Types Issues
The following diagram describes the major issues.
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for efficient and
flexible data mining.
Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the
data cleaning methods are not there then the accuracy of the discovered patterns
will be poor.
Performance Issues
There can be performance-related issues such as follows −
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web
(WWW), text files, and other documents. You need a huge amount of historical
data for data mining to be successful. Organizations typically store data in
databases or data warehouses. Data warehouses may comprise one or more
databases, text files spreadsheets, or other repositories of data. Sometimes, even
plain text files or spreadsheets may contain information. Another primary source
of data is the World Wide Web or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected. As the information comes from various sources
and in different formats, it can't be used directly for the data mining procedure
because the data may not be complete and accurate. So, the first data requires to
be cleaned and unified. More information than needed will be collected from
various data sources, and only the data of interest will have to be selected and
passed to the server. These procedures are not as easy as we think. Several
methods may be performed on the data as part of selection, integration, and
cleaning.
Database or Data Warehouse Server:
The database or data warehouse server consists of the original data that is ready to
be processed. Hence, the server is cause for retrieving the relevant data that is
based on data mining as per user request.
Data Mining Engine:
The data mining engine is a major component of any data mining system. It
contains several modules for operating data mining tasks, including association,
characterization, classification, clustering, prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture.
It comprises instruments and software used to obtain insights and knowledge from
data collected from various data sources and stored within the data warehouse.
Pattern Evaluation Module:
The Pattern evaluation module is primarily responsible for the measure of
investigation of the pattern by using a threshold value. It collaborates with the
data mining engine to focus the search on exciting patterns.
This segment commonly employs stake measures that cooperate with the data
mining modules to focus the search towards fascinating patterns. It might utilize a
stake threshold to filter out discovered patterns. On the other hand, the pattern
evaluation module might be coordinated with the mining module, depending on
the implementation of the data mining techniques used. For efficient data mining,
it is abnormally suggested to push the evaluation of pattern stake as much as
possible into the mining procedure to confine the search to only fascinating
patterns.
Graphical User Interface:
The graphical user interface (GUI) module communicates between the data
mining system and the user. This module helps the user to easily and efficiently
use the system without knowing the complexity of the process. This module
cooperates with the data mining system when the user specifies a query or a task
and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be
helpful to guide the search or evaluate the stake of the result patterns. The
knowledge base may even contain user views and data from user experiences that
might be helpful in the data mining process. The data mining engine may receive
inputs from the knowledge base to make the result more accurate and reliable.
The pattern assessment module regularly interacts with the knowledge base to get
inputs, and also update it.
A concept hierarchy for a given numeric attribute attribute defines a discretization of the
attribute. Concept hierarchies can be used to reduce the data y collecting and replacing
low-level concepts (such as numeric value for the attribute age) by higher level concepts
(such as young, middle-aged, or senior). Although detail is lost by such generalization, it
becomes meaningful and it is easier to interpret.
Manual definition of concept hierarchies can be tedious and time-consuming task for the
user or domain expert. Fortunately, many hierarchies are implicit within the database
schema and can be defined at schema definition level. Concept hierarchies often can be
generated automatically or dynamically refined based on statistical analysis of the data
distribution.
Concept hierarchies for numeric attributes can be constructed automatically based on data
distribution analysis. Five methods for concept hierarchy generation are defined below-
binning histogram analysis entropy-based discretization and data segmentation by
“natural partitioning”.
Binning:
Attribute values can be discretized by distributing the values into bin and replacing each
bin by the mean bin value or bin median value. These technique can be applied
recursively to the resulting partitions in order to generate concept hierarchies.
Histogram Analysis:
Histograms can also be used for discretization. Partitioning rules can be applied to define
range of values. The histogram analyses algorithm can be applied recursively to each
partition in order to automatically generate a multilevel concept hierarchy, with the
procedure terminating once a prespecified number of concept levels have been reached.
A minimum interval size can be used per level to control the recursive procedure. this
specifies the minimum width of the partition, or the minimum member of partitions at
each level.
Cluster Analysis:
A clustering algorithm can be applied to partition data into clusters or groups. Each
cluster forms a node of a concept hierarchy, where all noses are at the same conceptual
level. Each cluster may be further decomposed into sub-clusters, forming a lower kevel in
the hierarchy. Clusters may also be grouped together to form a higher-level concept
hierarchy.
Breaking up annual salaries in the range of into ranges like ($50,000-$100,000) are often
more desirable than ranges like ($51, 263, 89-$60,765.3) arrived at by cluster analysis.
The 3-4-5 rule can be used to segment numeric data into relatively uniform “natural”
intervals. In general the rule partitions a give range of data into 3,4,or 5 equinity
intervals, recursively level by level based on value range at the most significant digit. The
rule can be recursively applied to each interval creating a concept hierarchy for the given
numeric attribute. tributes with tight semantic connections can be pinned together.
The first limitation of class characterization for multidimensional data analysis in Data
warehouses and OLAP tools is the handling of complex objects . The second Limitation
is the lack of an automated generalization process: the user must explicitly Tell the
system which dimension should be included in the class characterization and to How high
a level each dimension should be generalized . Actually , the user must specify each step
of generalization or specification on any dimension.
Usually , it is not difficult for a user to instruct a data mining system regarding how high
level each dimension should be generalized . For example , users can set
attributegeneralization thresholds for this , or specify which level a given dimension
should reach ,such as with the command “generalize dimension location to the country
level”. Even without explicit user instruction , a default value such as 2 to 8 can be set by
the data mining system , which would allow each dimension to be generalized to a level
that contains only 2 to 8 distinct values. If the user is not satisfied with the current level
of generalization, she can specify dimensions on which drill-down or roll-up operations
should be applied.
Explicitly given in the mining query. The relevance analysis should be performed by
Comparison of these classes, as we shall see below. However, when mining class
Characteristics, there is only one class to be characterized. That is, no contrasting class is
specified. It is therefore not obvious what the contrasting class should be for use in of
comparable data in the database that excludes the set of data to be characterized. For
example, to characterize graduate students, the contrasting class can be composed of the
set of undergraduate students.
“How does the information gain calculation work ?” Let 5 be a set of training samples,
where the class label of each sample is known. Each sample is in fact a tuple. One
attribute is used to determine the class of the training samples. For instance, the Attribute
status can be used to define the class label of each sample as either “graduate “ or “
undergraduate “ . Suppose that there are m classes. Let S contain Si; samples of class Ci
, for i=1,…., m. An arbitrary sample belongs to class Ci, with probability Si/S, where S is
the total number of samples in set S. The expected information needed to classify a given
sample is
I(S 1, S2, …..Sm)= - ∑(Si/S)(log 2)(Si/S)
An attribute A with values { ai, a2>∙∙∙>av) can be used to partition 5 into the
Subsets { Si S z,∙∙∙, Sv}, where Sj contains those samples in 5 that have value aj of A.
Let S; contain Sy samples of class Q. The expected information based on this
Partitioning by A is known as the entropy of A. It is the weighted average:
Data Collection: Collect data for both the target class and the contrasting class by query
processing. For class comparison, the user in the data-mining query provides both the
target class and the contrasting class. For class characterization, the target class is the
class to be characterized, whereas the contrasting class is the set of comparable data that
are not in the target class.
Preliminary relevance analysis using conservative AOI: This step identifies a Set of
dimensions and attributes on which the selected relevance measure is to be Applied.
Since different levels of a dimension may have dramatically different Relevance with
respect to a given class, each attribute defining the conceptual levels of the dimension
should be included in the relevance analysis in principle. Attribute-oriented induction
(AOI)can be used to perform some preliminary relevance analysis on the data by
removing or generalizing attributes having a very large number of distinct values (such as
name and phone#). Such attributes are unlikely to be found useful for concept
description. To be conservative , the AOI performed here should employ attribute
generalization thresholds that are set reasonably large so as to allow more (but not
all)attributes to be considered in further relevance analysis by the selected measure (Step
3 below). The relation obtained by such an application of AOI is called the candidate
relation of the mining task.
Remove irrelevant and weakly attributes using the selected relevance analysis
measure: Evaluate each attribute in the candidate relation using the selected relevance
analysis measure. The relevance measure used in this step may be built into the data
mining system or provided by the user. For example, the information gain measure
described above may be used. The attributes are then sorted(i.e., ranked )according to
their computed relevance to the data mining task. Attributes that are not relevant or are
weakly relevant to the task are then removed . A threshold may be set to define “weakly
relevant.” This step results in an initial Target class working relation and an initial
contrasting class
Generate the concept description using AOI: Perform AOI using a less Conservative
set of attribute generalization thresholds. If the descriptive mining Task is class
characterization, only the initial target class working relation is included here. If the
descriptive mining task is class comparison, both the initial target class working relation
and the initial contrasting class working relation are included. The complexity of this
procedure is the induction process is perfomed twice, that Is, in preliminary relevance
analysis (Step 2)and on the initial working relation (Step4). The statistics used in attribute
relevance analysis with the selected measure (Step 3) may be collected during the
scanning of the database in Step 2
Association Rules
Association Rule Mining, as the name suggests, association rules are simple If/Then
statements that help discover relationships between seemingly independent relational
databases or other data repositories.
Most machine learning algorithms work with numeric datasets and hence tend to be
mathematical. However, association rule mining is suitable for non-numeric, categorical
data and requires just a little bit more than simple counting.
Association rule mining is a procedure which aims to observe frequently occurring
patterns, correlations, or associations from datasets found in various kinds of databases
such as relational databases, transactional databases, and other forms of repositories.
An association rule has 2 parts:
an antecedent (if) and
a consequent (then)
An antecedent is something that’s found in data, and a consequent is an item that is found
in combination with the antecedent. Have a look at this rule for instance:
“If a customer buys bread, he’s 70% likely of buying milk.”
In the above association rule, bread is the antecedent and milk is the consequent. Simply
put, it can be understood as a retail store’s association rule to target their customers
better. If the above rule is a result of a thorough analysis of some data sets, it can be used
to not only improve customer service but also improve the company’s revenue.
Association rules are created by thoroughly analyzing data and looking for frequent
if/then patterns. Then, depending on the following two parameters, the important
relationships are observed:
1. Support: Support indicates how frequently the if/then relationship appears in the
database.
2. Confidence: Confidence tells about the number of times these relationships have been
found to be true.
So, in a given transaction with multiple items, Association Rule Mining primarily tries to
find the rules that govern how or why such products/items are often bought together. For
example, peanut butter and jelly are frequently purchased together because a lot of people
like to make PB&J sandwiches.
A Beginner’s Guide to Data Science and Its Applications
Association Rule Mining is sometimes referred to as “Market Basket Analysis”, as it was
the first application area of association mining. The aim is to discover associations of
items occurring together more often than you’d expect from randomly sampling all the
possibilities. The classic anecdote of Beer and Diaper will help in understanding this
better.
The story goes like this: young American men who go to the stores on Fridays to buy
diapers have a predisposition to grab a bottle of beer too. However unrelated and vague
that may sound to us laymen, association rule mining shows us how and why!
Let’s do a little analytics ourselves, shall we?
Suppose an X store’s retail transactions database includes the following data:
Total number of transactions: 600,000
Transactions containing diapers: 7,500 (1.25 percent)
Transactions containing beer: 60,000 (10 percent)
Transactions containing both beer and diapers: 6,000 (1.0 percent)
From the above figures, we can conclude that if there was no relation between beer and
diapers (that is, they were statistically independent), then we would have got only 10% of
diaper purchasers to buy beer too.
However, as surprising as it may seem, the figures tell us that 80% (=6000/7500) of the
people who buy diapers also buy beer.
This is a significant jump of 8 over what was the expected probability. This factor of
increase is known as Lift – which is the ratio of the observed frequency of co-occurrence
of our items and the expected frequency.
How did we determine the lift?
Simply by calculating the transactions in the database and performing simple
mathematical operations.
So, for our example, one plausible association rule can state that the people who buy
diapers will also purchase beer with a Lift factor of 8. If we talk mathematically, the lift
can be calculated as the ratio of the joint probability of two items x and y, divided by the
product of their probabilities.
Lift = P(x,y)/[P(x)P(y)]
However, if the two items are statistically independent, then the joint probability of the
two items will be the same as the product of their probabilities. Or, in other words,
P(x,y)=P(x)P(y),
which makes the Lift factor = 1. An interesting point worth mentioning here is that anti-
correlation can even yield Lift values less than 1 – which corresponds to mutually
exclusive items that rarely occur together.
Association Rule Mining has helped data scientists find out patterns they never knew
existed.
Let’s look at some areas where Association Rule Mining has helped quite a lot:
1. Market Basket Analysis:
This is the most typical example of association mining. Data is collected using barcode
scanners in most supermarkets. This database, known as the “market basket” database,
consists of a large number of records on past transactions. A single record lists all the
items bought by a customer in one sale. Knowing which groups are inclined towards
which set of items gives these shops the freedom to adjust the store layout and the store
catalog to place the optimally concerning one another.
2. Medical Diagnosis:
Association rules in medical diagnosis can be useful for assisting physicians for curing
patients. Diagnosis is not an easy process and has a scope of errors which may result in
unreliable end-results. Using relational association rule mining, we can identify the
probability of the occurrence of illness concerning various factors and symptoms.
Further, using learning techniques, this interface can be extended by adding new
symptoms and defining relationships between the new signs and the corresponding
diseases.
3. Census Data:
Every government has tonnes of census data. This data can be used to plan efficient
public services(education, health, transport) as well as help public businesses (for setting
up new factories, shopping malls, and even marketing particular products). This
application of association rule mining and data mining has immense potential in
supporting sound public policy and bringing forth an efficient functioning of a
democratic society.
4. Protein Sequence:
Proteins are sequences made up of twenty types of amino acids. Each protein bears a
unique 3D structure which depends on the sequence of these amino acids. A slight
change in the sequence can cause a change in structure which might change the
functioning of the protein. This dependency of the protein functioning on its amino acid
sequence has been a subject of great research. Earlier it was thought that these sequences
are random, but now it’s believed that they aren’t.
Apriori uses breadth-first search and a Hash tree structure to count candidate item sets
efficiently. It generates candidate item sets of length from item sets of length .
Then it prunes the candidates which have an infrequent sub pattern. According to
the downward closure lemma, the candidate set contains all frequent -length item
sets. After that, it scans the transaction database to determine frequent item sets among
the candidates.
Apriori, while historically significant, suffers from a number of inefficiencies or trade-
offs, which have spawned other algorithms. Candidate generation generates large
numbers of subsets (the algorithm attempts to load up the candidate set with as many as
possible before each scan). Bottom-up subset exploration (essentially a breadth-first
traversal of the subset lattice) finds any maximal subset S only after all of its proper
subsets.
Algorithm Pseudocode
The pseudocode for the algorithm is given below for a transaction database , and a
support threshold of . Usual set theoretic notation is employed, though note that is a
multiset. is the candidate set for level . Generate() algorithm is assumed to generate the
candidate sets from the large itemsets of the preceding level, heeding the downward
closure lemma. accesses a field of the data structure that represents candidate set , which
is initially assumed to be zero. Many details are omitted below, usually the most
important part of the implementation is the data structure used for storing the candidate
sets, and counting their frequencies.
Apriori
large 1-itemsets
while
for transactions
for candidates
return
Example
A large supermarket tracks sales data by stock-keeping unit (SKU) for each item, and
thus is able to know what items are typically purchased together. Apriori is a moderately
efficient way to build a list of frequent purchased item pairs from this data. Let the
database of transactions consist of the sets {1,2,3,4}, {1,2}, {2,3,4}, {2,3}, {1,2,4},
{3,4}, and {2,4}. Each number corresponds to a product such as “butter” or “bread”. The
first step of Apriori is to count up the frequencies, called the support, of each member
item separately:
This table explains the working of apriori algorithm.
ItemSupport
1 3/7
2 6/7
3 4/7
4 5/7
We can define a minimum support level to qualify as “frequent,” which depends on the
context. For this case, let min support = 3/7. Therefore, all are frequent. The next step is
to generate a list of all pairs of the frequent items. Had any of the above items not been
frequent, they wouldn’t have been included as a possible member of possible pairs. In
this way, Apriori prunes the tree of all possible sets. In next step we again select only
these items (now pairs are items) which are frequent:
Item Support
{1,2}3/7
{1,3}1/7
{1,4}2/7
{2,3}3/7
{2,4}4/7
{3,4}3/7
The pairs {1,2}, {2,3}, {2,4}, and {3,4} all meet or exceed the minimum support of 3/7.
The pairs {1,3} and {1,4} do not. When we move onto generating the list of all triplets,
we will not consider any triplets that contain {1,3} or {1,4}:
Item Support
{2,3,4}2/7
In the example, there are no frequent triplets — {2,3,4} has support of 2/7, which is
below our minimum, and we do not consider any other triplet because they all contain
either {1,3} or {1,4}, which were discarded after we calculated frequent pairs in the
second table.
Multi Dimensional Association Rules
What is classification?
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customer
(loan applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given
profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no
for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a
model or a predictor will be constructed that predicts a continuous-valued-function or
ordered value.
Unit 4
There are many clustering algorithms in the literature. It is difficult to provide a crisp
categorization of clustering methods because these categories may overlap so that a
method may have features from several categories. Nevertheless, it is useful to present a
relatively organized picture of clustering methods. In general, the major fundamental
clustering methods can be classified into the following categories, which are discussed in
the rest of this chapter.
Density-based methods: Most partitioning methods cluster objects based on the dis-
tance between objects. Such methods can find only spherical-shaped clusters and
encounter difficulty in discovering clusters of arbitrary shapes. Other clustering methods
have been developed based on the notion of density. Their general idea is to continue
growing a given cluster as long as the density (number of objects or data points) in the
“neighborhood” exceeds some threshold. For example, for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise or outliers and discover clusters of
arbitrary shape.
Density-based methods can divide a set of objects into multiple exclusive clus- ters, or a
hierarchy of clusters. Typically, density-based methods consider exclusive clusters only,
and do not consider fuzzy clusters. Moreover, density-based methods can be extended
from full space to subspace clustering.
Grid-based methods: Grid-based methods quantize the object space into a finite number
of cells that form a grid structure. All the clustering operations are per- formed on the
grid structure (i.e., on the quantized space). The main advantage of this approach is its
fast processing time, which is typically independent of the num- ber of data objects and
dependent only on the number of cells in each dimension in the quantized space.
Using grids is often an efficient approach to many spatial data mining problems,
including clustering. Therefore, grid-based methods can be integrated with other
clustering methods such as density-based methods and hierarchical methods.
Some clustering algorithms integrate the ideas of several clustering methods, so that it is
sometimes difficult to clas- sify a given algorithm as uniquely belonging to only one
clustering method category. Furthermore, some applications may have clustering criteria
that require the integration of several clustering techniques.
In general, the notation used is as follows. Let D be a data set of n objects to be clustered.
An object is described by d variables, where each variable is also called an attribute or a
dimension, and therefore may also be referred to as a point in a d-dimensional object
space.
Suppose that the hobby of a person is a set-valued attribute containing the set of values
{tennis, hockey, soccer, violin, this set can be generalized to a set of high-level concepts,
such as {sports, music, computer games}
“Can we construct a spatial data warehouse?” Yes, as with relational data, we can
integrate spatial data to construct a data warehouse that facilitates spatial data mining. A
spatial data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of both spatial and nonspatial data in support of spatial data
mining and spatial-datarelated decision-making processes.
This rule states that 80% of schools that are close to sports centers are also close to parks,
and 0.5% of the data belongs to such a case.
In a content-based image retrieval system, there are often two kinds of queries:
Several approaches have been proposed and studied for similarity-based retrieval in
image databases, based on image signature
“Can we construct a data cube for multimedia data analysis?” To facilitate the
multidimensional analysis of large multimedia databases, multimedia data cubes can be
designed and constructed in a manner similar to that for traditional data cubes from
relational data. A multimedia data cube can contain additional dimensions and measures
for multimedia information, such as color, texture, and shape.
Recall: This is the percentage of documents that are relevant to the query and were,
in fact, retrieved. It is formally defined as
How Mining theWorld WideWeb is done?
The World Wide web serves as a huge,widely distributed, global information service
center
for news, advertisements, consumer information, financial management, education,
government, e-commerce, and many other information services. The Web also contains
a rich and dynamic collection of hyperlink information and Web page access and usage
information, providing rich sources for data mining.
1. The Web seems to be too huge for effective data warehousing and data mining.
The size of the Web is in the order of hundreds of terabytes and is still growing
rapidly. Many organizations and societies place most of their public-accessible
information on the Web. It is barely possible to set up a data warehouse to
replicate, store, or integrate all of the data on the Web.
2. The complexity of Web pages is far greater than that of any traditional text
document collection. Web pages lack a unifying structure.
3. The Web is a highly dynamic information source. Not only does the Web grow
rapidly, but its information is also constantly updated.
4. TheWeb serves a broad diversity of user communities. The Internet currently
connects more than 100 million workstations, and its user community is still
rapidly expanding.
These challenges have promoted research into efficient and effective discovery and use of
resources on the Internet.
A major social concern of data mining is the issue of privacy and data security,
particularly as the amount of data collected on individuals continues to grow. Fair
information practices were established for privacy and data protection and cover aspects
regarding the collection and use of personal data. Data mining for counterterrorism can
benefit homeland security and save lives, yet raises additional concerns for privacy due to
the possible access of personal data. Efforts towards
ensuring privacy and data security include the development of privacy-preserving data
mining (which deals with obtaining valid data mining results without learning the
underlying data values) and data security–enhancing techniques (such as encryption)
Trends in data mining include further efforts toward the exploration of new application
areas, improved scalable and interactive methods (including constraint-based mining), the
integration of data mining with data warehousing and database systems, the
standardization of data mining languages, visualization methods, and new methods for
handling complex data types. Other trends include biological data mining, mining
software bugs, Web mining, distributed and real-time mining, graph mining, social
network analysis, multi relational and multi database data mining, data privacy
protection, and data security