0% found this document useful (0 votes)

2 views

Notes_Data_Warehouse

The document provides an overview of data warehouses, defining them as subject-oriented, integrated, time-variant, and non-volatile collections of data that support decision-making processes. It discusses the architecture, features, and implementation steps of data warehouses and data marts, highlighting their roles in business intelligence and data analysis. Additionally, it touches on the evolution of data mining as a natural progression in information technology, emphasizing the importance of data warehousing and advanced data analysis.

Uploaded by

navad1008

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Notes_Data_Warehouse

Uploaded by

navad1008

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Introduction to Data Warehouse

The word "Data Warehouse" was first coined by Bill Inmon in 1990. He said that Data
warehouse is subject Oriented, included, Time- Variant and nonvolatile collection of
data. This data helps in supporting decision making process by analyst in an organization
The operational database undergoes the per day transactions which causes the frequent
changes to the data on daily basis. But if in future the business executive wants to analyse
the previous feedback on any data such as product, supplier, or the consumer data. In this
case the analyst will be having no data available to analyse because the previous data is
updated due to transactions.
The Data Warehouses provide us generalized and consolidated data in multidimensional
view. Along with take a broad view and consolidated view of data the Data Warehouses
also provide us Online Analytical Processing (OLAP) tools. These tools help us in
interactive and effective analysis of data in multidimensional space. This analysis results
in data generalization and data mining.
The data mining functions like association, clustering, classification, prediction can be
integrated with OLAP operations to enhance interactive mining of knowledge at multiple
level of abstraction. That's why data warehouse has now become important platform for
data analysis and online analytical processing.
Understanding Data Warehouse
 The Data Warehouse is that database which is kept separate from the
organization's operational database.
 There is no frequent updation done in data warehouse.
 The Data warehouse possesses consolidated historical data which help the
organization to analyse its business.
 The Data warehouse helps the executives to organize, understand and use their
data to take strategic decision.
 The Data warehouse systems available, which helps in integration of diversity of
application systems.
 The Data warehouse system allows analysis of consolidated historical data
analysis.
 The Data warehouse is Subject Oriented, Integrated, Time-Variant and
Nonvolatile collection of data that support management's decision making
process.

A data warehouse is a database designed to enable business intelligence activities: it

exists to help users understand and enhance their organization's performance. It is
designed for query and analysis rather than for transaction processing, and usually
contains historical data derived from transaction data, but can include data from other
sources. Data warehouses separate analysis workload from transaction workload and
enable an organization to consolidate data from several sources. This helps in:
 Maintaining historical records
 Analyzing the data to gain a better understanding of the business and to improve
the business
In addition to a relational database, a data warehouse environment can include an
extraction, transportation, transformation, and loading (ETL) solution, statistical analysis,
reporting, data mining capabilities, client analysis tools, and other applications that
manage the process of gathering data, transforming it into useful, actionable information,
and delivering it to business users.
To achieve the goal of enhanced business intelligence, the data warehouse works with
data collected from multiple sources. The source data may come from internally
developed systems, purchased applications, third-party data syndicators and other
sources. It may involve transactions, production, marketing, human resources and more.
In today's world of big data, the data may be many billions of individual clicks on web
sites or the massive data streams from sensors built into complex machinery.

Features of Data Warehouse

The key features of Data Warehouse such as Subject Oriented, Integrated, Nonvolatile
and Time-Variant are discussed under the following.
Subject Oriented - The Data Warehouse is Subject Oriented because it provides us the
information approximately a
subject rather the organizations ongoing operations. These subjects can be product,
customers, suppliers, sales, revenue etc. The data warehouse does not focus on the
ongoing operations to a certain extent it focuses on modeling and analysis of data for
decision making.
Integrated – the Data Warehouse is constructed by integration of data from
heterogeneous sources such as relational databases, flat files etc. This integration
enhances the effective analysis of data.
Time-Variant - The Data in Data Warehouse is identified with a fastidious time period.
The data in data warehouse make available information from historical point of view.
Non Volatile - Non volatile means that the previous data is not removed when new data
is added to it. The data warehouse is kept separate from the operational database
consequently frequent changes in operational database are not reflected in data
warehouse.

The key characteristics of a data warehouse are as follows:

 Data is structured for simplicity of access and high-speed query performance.

 End users are time-sensitive and desire speed-of-thought response times.
 Large amounts of historical data are used.
 Queries often retrieve large amounts of data, perhaps many thousands of rows.
 Both predefined and ad hoc queries are common.
 The data load involves multiple sources and transformations.
 In general, fast query performance with high data throughput is the key to a
successful data warehouse
Data Warehouse Architecture: Basic
Figure 1-1 shows a simple architecture for a data warehouse. End users directly access
data derived from several source systems through the data warehouse.

Figure 1-1 Architecture of a Data Warehouse

Description of "Figure 1-1 Architecture of a Data Warehouse"

In Figure 1-1, the metadata and raw data of a traditional OLTP system is present, as is an
additional type of data, summary data. Summaries are a mechanism to pre-compute
common expensive, long-running operations for sub-second data retrieval. For example,
a typical data warehouse query is to retrieve something such as August sales. A summary
in an Oracle database is called a materialized view.
The consolidated storage of the raw data as the center of your data warehousing
architecture is often referred to as an Enterprise Data Warehouse (EDW). An EDW
provides a 360-degree view into the business of an organization by holding all relevant
business information in the most detailed format.

Data Mart
A data mart is focused on a single functional area of an organization and contains a
subset of data stored in a Data Warehouse.
A data mart is a condensed version of Data Warehouse and is designed for use by a
specific department, unit or set of users in an organization. E.g., Marketing, Sales, HR or
finance. It is often controlled by a single department in an organization.
Data Mart usually draws data from only a few sources compared to a Data warehouse.
Data marts are small in size and are more flexible compared to a Datawarehouse.

Why do we need Data Mart?

 Data Mart helps to enhance user's response time due to reduction in volume of
data
 It provides easy access to frequently requested data.
 Data mart are simpler to implement when compared to corporate Datawarehouse.
At the same time, the cost of implementing Data Mart is certainly lower compared
with implementing a full data warehouse.
 Compared to Data Warehouse, a datamart is agile. In case of change in model,
datamart can be built quicker due to a smaller size.
 A Datamart is defined by a single Subject Matter Expert. On the contrary data
warehouse is defined by interdisciplinary SME from a variety of domains. Hence,
Data mart is more open to change compared to Datawarehouse.
 Data is partitioned and allows very granular access control privileges.
 Data can be segmented and stored on different hardware/software platforms.

Type of Data Mart

There are three main types of data marts are:
 Dependent: Dependent data marts are created by drawing data directly from
operational, external or both sources.
 Independent: Independent data mart is created without the use of a central data
warehouse.
 Hybrid: This type of data marts can take data from data warehouses or
operational systems.

Dependent Data Mart

A dependent data mart allows sourcing organization's data from a single Data
Warehouse. It offers the benefit of centralization. If you need to develop one or more
physical data marts, then you need to configure them as dependent data marts.
Dependent data marts can be built in two different ways. Either where a user can access
both the data mart and data warehouse, depending on need, or where access is limited
only to the data mart. The second approach is not optimal as it produces sometimes
referred to as a data junkyard. In the data junkyard, all data begins with a common
source, but they are scrapped, and mostly junked.
Independent Data Mart
An independent data mart is created without the use of central Data warehouse. This kind
of Data Mart is an ideal option for smaller groups within an organization.
An independent data mart has neither a relationship with the enterprise data warehouse
nor with any other data mart. In Independent data mart, the data is input separately, and
its analyses are also performed autonomously.
Implementation of independent data marts is antithetical to the motivation for building a
data warehouse. First of all, you need a consistent, centralized store of enterprise data
which can be analyzed by multiple users with different interests who want widely varying
information.
Hybrid data Mart:
A hybrid data mart combines input from sources apart from Data warehouse. This could
be helpful when you want ad-hoc integration, like after a new group or product is added
to the organization.
It is best suited for multiple database environments and fast implementation turnaround
for any organization. It also requires least data cleansing effort. Hybrid Data mart also
supports large storage structures, and it is best suited for flexible for smaller data-centric
applications.
Steps in Implementing a Datamart

Implementing a Data Mart is a rewarding but complex procedure. Here are the detailed
steps to implement a Data Mart:

Designing
Designing is the first phase of Data Mart implementation. It covers all the tasks between
initiating the request for a data mart to gathering information about the requirements.
Finally, we create the logical and physical design of the data mart.

The design step involves the following tasks:

 Gathering the business & technical requirements and Identifying data sources.
 Selecting the appropriate subset of data.
 Designing the logical and physical structure of the data mart.
Data could be partitioned based on following criteria:
 Date
 Business or Functional Unit
 Geography
 Any combination of above

Data could be partitioned at the application or DBMS level. Though it is recommended to

partition at the Application level as it allows different data models each year with the
change in business environment.

What Products and Technologies Do You Need?

A simple pen and paper would suffice. Though tools that help you create UML or ER
diagrams would also append meta data into your logical and physical designs.

Constructing

This is the second phase of implementation. It involves creating the physical database and
the logical structures.

This step involves the following tasks:

Implementing the physical database designed in the earlier phase. For instance, database
schema objects like table, indexes, views, etc. are created.

What Products and Technologies Do You Need?

You need a relational database management system to construct a data mart. RDBMS
have several features that are required for the success of a Data Mart.

Storage management: An RDBMS stores and manages the data to create, add, and delete
data.

Fast data access: With a SQL query you can easily access data based on certain
conditions/filters.

Data protection: The RDBMS system also offers a way to recover from system failures
such as power failures. It also allows restoring data from these backups incase of the disk
fails.

Multiuser support: The data management system offers concurrent access, the ability for
multiple users to access and modify data without interfering or overwriting changes made
by another user.

Security: The RDMS system also provides a way to regulate access by users to objects
and certain types of operations.
Populating:

In the third phase, data in populated in the data mart.

The populating step involves the following tasks:

 Source data to target data Mapping

 Extraction of source data
 Cleaning and transformation operations on the data
 Loading data into the data mart
 Creating and storing metadata

What Products and Technologies Do You Need?

You accomplish these population tasks using an ETL(Extract Transform Load)Tool. This
tool allows you to look at the data sources, perform source-to-target mapping, extract the
data, transform, cleanse it, and load it back into the data mart.
In the process, the tool also creates some metadata relating to things like where the data
came from, how recent it is, what type of changes were made to the data, and what level
of summarization was done.

Accessing
Accessing is a fourth step which involves putting the data to use: querying the data,
creating reports, charts, and publishing them. End-user submit queries to the database and
display the results of the queries

The accessing step needs to perform the following tasks:

 Set up a meta layer that translates database structures and objects names into
business terms. This helps non-technical users to access the Data mart easily.
 Set up and maintain database structures.
 Set up API and interfaces if required

What Products and Technologies Do You Need?

You can access the data mart using the command line or GUI. GUI is preferred as it can
easily generate graphs and is user-friendly compared to the command line.

Managing
This is the last step of Data Mart Implementation process. This step covers management
tasks such as-
Ongoing user access management.
System optimizations and fine-tuning to achieve the enhanced performance.
Adding and managing fresh data into the data mart.
Planning recovery scenarios and ensure system availability in the case when the system
fails.

What Products and Technologies Do You Need?

You could use the GUI or command line for data mart management.

Data Warehouse Implementation

There are various implementation in data warehouses which are as follows

1. Requirements analysis and capacity planning: The first process in data

warehousing involves defining enterprise needs, defining architectures, carrying
out capacity planning, and selecting the hardware and software tools. This step
will contain be consulting senior management as well as the different stakeholder.
2. Hardware integration: Once the hardware and software has been selected, they
require to be put by integrating the servers, the storage methods, and the user
software tools.
3. Modeling: Modelling is a significant stage that involves designing the warehouse
schema and views. This may contain using a modeling tool if the data warehouses
are sophisticated.
4. Physical modeling: For the data warehouses to perform efficiently, physical
modeling is needed. This contains designing the physical data warehouse
organization, data placement, data partitioning, deciding on access techniques,
and indexing.
5. Sources: The information for the data warehouse is likely to come from several
data sources. This step contains identifying and connecting the sources using the
gateway, ODBC drives, or another wrapper.
6. ETL: The data from the source system will require to go through an ETL phase.
The process of designing and implementing the ETL phase may contain defining
a suitable ETL tool vendors and purchasing and implementing the tools. This may
contains customize the tool to suit the need of the enterprises.
7. Populate the data warehouses: Once the ETL tools have been agreed upon,
testing the tools will be needed, perhaps using a staging area. Once everything is
working adequately, the ETL tools may be used in populating the warehouses
given the schema and view definition.
8. User applications: For the data warehouses to be helpful, there must be end-user
applications. This step contains designing and implementing applications required
by the end-users.
9. Roll-out the warehouses and applications: Once the data warehouse has been
populated and the end-client applications tested, the warehouse system and the
operations may be rolled out for the user's community to use.

Data Mining As Evoloution of Information Technology

Data mining can be viewed as a result of the natural evolution of information tech-
nology. The database and data management industry evolved in the development of
several critical functionalities: data collection and database creation, data management
(including data storage and retrieval and database transaction processing), and advanced
data analysis (involving data warehousing and data mining). The early development of
data collection and database creation mechanisms served as a prerequi- site for the later
development of effective mechanisms for data storage and retrieval, as well as query and
transaction processing. Nowadays numerous database systems offer query and transaction
processing as common practice. Advanced data analysis has naturally become the next
step. Since the 1960s, database and information technology has evolved systematically
from primitive file processing systems to sophisticated and powerful database systems.
The research and development in database systems since the 1970s progressed from early
hierarchical and network database systems to relational database systems (where data are
stored in relational table structures; see Section 1.3.1), data modeling tools, and indexing
and accessing methods. In addition, users gained convenient and flexible data access
through query languages, user interfaces, query optimization, and transaction
management. Efficient methods for online transaction processing (OLTP), where a query
is viewed as a read-only transaction, contributed substantially to the evolution and wide
acceptance of relational technology as a major tool for efficient storage, retrieval, and
management of large amounts of data.
After the establishment of database management systems, database technology moved
toward the development of advanced database systems, data warehousing, and data
mining for advanced data analysis and web-based databases. Advanced database systems,
for example, resulted from an upsurge of research from the mid-1980s onward. These
systems incorporate new and powerful data models such as extended-relational, object-
oriented, object-relational, and deductive models. Application-oriented database systems
have flourished, including spatial, temporal, multimedia, active, stream and sensor,
scientific and engineering databases, knowledge bases, and office information bases.
Issues related to the distribution, diversification, and sharing of data have been studied
extensively.
Advanced data analysis sprang up from the late 1980s onward. The steady and dazzling
progress of computer hardware technology in the past three decades led to large supplies
of powerful and affordable computers, data collection equipment, and storage media.
This technology provides a great boost to the database and information industry, and it
enables a huge number of databases and information repositories to be available for
transaction management, information retrieval, and data analysis. Data can now be stored
in many different kinds of databases and information repositories.
One emerging data repository architecture is the data warehouse (Section 1.3.2). This is
a repository of multiple heterogeneous data sources organized under a unified schema at
a single site to facilitate management decision making. Data warehouse technology
includes data cleaning, data integration, and online analytical processing (OLAP)—that
is, analysis techniques with functionalities such as summarization, consolidation, and
aggregation, as well as the ability to view information from different angles. Although
OLAP tools support multidimensional analysis and decision making, additional data
analysis tools are required for in-depth analysis—for example, data mining tools that
provide data classification, clustering, outlier/anomaly detection, and the characterization
of changes in data over time.
Huge volumes of data have been accumulated beyond databases and data warehouses.
During the 1990s, the World Wide Web and web-based databases (e.g., XML databases)
began to appear. Internet-based global information bases, such as the WWW and various
kinds of interconnected, heterogeneous databases, have emerged and play a vital role in
the information industry. The effective and efficient analysis of data from such different
forms of data by integration of information retrieval, data mining, and information
network analysis technologies is a challenging task. In summary, the abundance of data,
coupled with the need for powerful data analysis tools, has been described as a data rich
but information poor situation (Figure 1.2). The fast-growing, tremendous amount of
data, collected and stored in large and numerous data repositories, has far exceeded our
human ability for comprehension without powerful tools. As a result, data collected in
large data repositories become “data tombs”—data archives that are seldom visited.
Consequently, important decisions are often made based not on the information-rich data
stored in data repositories but rather on a decision maker’s intuition, simply because the
decision maker does not have the tools to extract the valuable knowledge embedded in
the vast amounts of data. Efforts have been made to develop expert system and
knowledge-based technologies, which typically rely on users or domain experts to
manually input knowledge into knowledge bases. Unfortunately, however, the manual
knowledge input procedure is prone to biases and errors and is extremely costly and time
consuming. The widening gap between data and information calls for the systematic
development of data mining tools that can turn data tombs into “golden nuggets” of
knowledge.
What is Data Mining:

In simple words, data mining is defined as a process used to extract usable data from a
larger set of any raw data. It implies analysing data patterns in large batches of data using
one or more software. Data mining has applications in multiple fields, like science and
research. As an application of data mining, businesses can learn more about their
customers and develop more effective strategies related to various business functions and
in turn leverage resources in a more optimal and insightful manner. This helps businesses
be closer to their objective and make better decisions. Data mining involves effective data
collection and warehousing as well as computer processing. For segmenting the data and
evaluating the probability of future events, data mining uses sophisticated mathematical
algorithms. Data mining is also known as Knowledge Discovery in Data (KDD).

Description: Key features of data mining:

• Automatic pattern predictions based on trend and behaviour analysis.

• Prediction based on likely outcomes.

• Creation of decision-oriented information.

• Focus on large data sets and databases for analysis.

• Clustering based on finding and visually documented groups of facts not previously
known.

Data Mining System Classification

A data mining system can be classified according to the following criteria −
 Database Technology
 Statistics
 Machine Learning
 Information Science
 Visualization
 Other Disciplines
Apart from these, a data mining system can also be classified based on the kind of (a)
databases mined, (b) knowledge mined, (c) techniques utilized, and (d) applications
adapted.
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of databases mined. Database
system can be classified according to different criteria such as data models, types of data,
etc. And the data mining system can be classified accordingly.
For example, if we classify a database according to the data model, then we may have a
relational, transactional, object-relational, or data warehouse mining system.
Classification Based on the kind of Knowledge Mined
We can classify a data mining system according to the kind of knowledge mined. It
means the data mining system is classified on the basis of functionalities such as −
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Outlier Analysis
 Evolution Analysis
Classification Based on the Techniques Utilized
We can classify a data mining system according to the kind of techniques used. We can
describe these techniques according to the degree of user interaction involved or the
methods of analysis employed.
Classification Based on the Applications Adapted
We can classify a data mining system according to the applications adapted. These
applications are as follows −
 Finance
 Telecommunications
 DNA
 Stock Markets
 E-mail

Integrating a Data Mining System with a DB/DW System

If a data mining system is not integrated with a database or a data warehouse system, then
there will be no system to communicate with. This scheme is known as the non-coupling
scheme. In this scheme, the main focus is on data mining design and on developing
efficient and effective algorithms for mining the available data sets.
The list of Integration Schemes is as follows −
No Coupling − In this scheme, the data mining system does not utilize any of the
database or data warehouse functions. It fetches the data from a particular source and
processes that data using some data mining algorithms. The data mining result is stored in
another file.
Loose Coupling − In this scheme, the data mining system may use some of the functions
of database and data warehouse system. It fetches the data from the data respiratory
managed by these systems and performs data mining on that data. It then stores the
mining result either in a file or in a designated place in a database or in a data warehouse.
Semi−tight Coupling − In this scheme, the data mining system is linked with a database
or a data warehouse system and in addition to that, efficient implementations of a few
data mining primitives can be provided in the database.
Tight coupling − In this coupling scheme, the data mining system is smoothly integrated
into the database or data warehouse system. The data mining subsystem is treated as one
functional component of an information system

Major Issues in Datamining

Data mining is not an easy task, as the algorithms used can get very complex and data is
not always available at one place. It needs to be integrated from various heterogeneous
data sources. These factors also create some issues. The major issues regarding −
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −

 Mining different kinds of knowledge in databases − Different users may be

interested in different kinds of knowledge. Therefore it is necessary for data
mining to cover a broad range of knowledge discovery task.

 Interactive mining of knowledge at multiple levels of abstraction − The data

mining process needs to be interactive because it allows users to focus the search
for patterns, providing and refining data mining requests based on the returned
results.

 Incorporation of background knowledge − To guide discovery process and to

express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not only
in concise terms but at multiple levels of abstraction.

 Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for efficient and
flexible data mining.

 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.

 Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the
data cleaning methods are not there then the accuracy of the discovered patterns
will be poor.

 Pattern evaluation − The patterns discovered should be interesting because

either they represent common knowledge or lack novelty.

Performance Issues
There can be performance-related issues such as follows −

 Efficiency and scalability of data mining algorithms − In order to effectively

extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as
huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining
algorithms. These algorithms divide the data into partitions which is further
processed in a parallel fashion. Then the results from the partitions is merged. The
incremental algorithms, update databases without mining the data again from
scratch.
Diverse Data Types Issues
 Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is
not possible for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These
data source may be structured, semi structured or unstructured. Therefore mining
the knowledge from them adds challenges to data mining.

Data Mining Architecture

The significant components of data mining systems are a data source, data mining engine,
data warehouse server, the pattern evaluation module, graphical user interface, and
knowledge base.

 Data Source:
The actual source of data is the Database, data warehouse, World Wide Web
(WWW), text files, and other documents. You need a huge amount of historical
data for data mining to be successful. Organizations typically store data in
databases or data warehouses. Data warehouses may comprise one or more
databases, text files spreadsheets, or other repositories of data. Sometimes, even
plain text files or spreadsheets may contain information. Another primary source
of data is the World Wide Web or the internet.
 Different processes:
Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected. As the information comes from various sources
and in different formats, it can't be used directly for the data mining procedure
because the data may not be complete and accurate. So, the first data requires to
be cleaned and unified. More information than needed will be collected from
various data sources, and only the data of interest will have to be selected and
passed to the server. These procedures are not as easy as we think. Several
methods may be performed on the data as part of selection, integration, and
cleaning.
 Database or Data Warehouse Server:
The database or data warehouse server consists of the original data that is ready to
be processed. Hence, the server is cause for retrieving the relevant data that is
based on data mining as per user request.
 Data Mining Engine:
The data mining engine is a major component of any data mining system. It
contains several modules for operating data mining tasks, including association,
characterization, classification, clustering, prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture.
It comprises instruments and software used to obtain insights and knowledge from
data collected from various data sources and stored within the data warehouse.
 Pattern Evaluation Module:
The Pattern evaluation module is primarily responsible for the measure of
investigation of the pattern by using a threshold value. It collaborates with the
data mining engine to focus the search on exciting patterns.
This segment commonly employs stake measures that cooperate with the data
mining modules to focus the search towards fascinating patterns. It might utilize a
stake threshold to filter out discovered patterns. On the other hand, the pattern
evaluation module might be coordinated with the mining module, depending on
the implementation of the data mining techniques used. For efficient data mining,
it is abnormally suggested to push the evaluation of pattern stake as much as
possible into the mining procedure to confine the search to only fascinating
patterns.
 Graphical User Interface:
The graphical user interface (GUI) module communicates between the data
mining system and the user. This module helps the user to easily and efficiently
use the system without knowing the complexity of the process. This module
cooperates with the data mining system when the user specifies a query or a task
and displays the results.
 Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be
helpful to guide the search or evaluate the stake of the result patterns. The
knowledge base may even contain user views and data from user experiences that
might be helpful in the data mining process. The data mining engine may receive
inputs from the knowledge base to make the result more accurate and reliable.
The pattern assessment module regularly interacts with the knowledge base to get
inputs, and also update it.

Data Preprocessing in Data Mining

Preprocessing in Data Mining:

Data preprocessing is a data mining technique which is used to transform the raw data in
a useful and efficient format.

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
1. Missing Data:
This situation arises when some data is missing in the data. It can be handled in
various ways. Some of them are:
a. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
b. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.
2. Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
a. Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values can
be used to complete the task.
b. Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
c. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
 Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0)
 Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.
 Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
 Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get
rid of this, we uses data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.

The various steps to data reduction are:

1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute.the attribute having p-value greater than significance level can be
discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If
after reconstruction from compressed data, original data can be retrieved, such
reduction are called lossless reduction else it is called lossy reduction. The two
effective methods of dimensionality reduction are:Wavelet transforms and PCA
(Principal Componenet Analysis)

.Discretization and Concept Hierarchy Generation

Introduction
Discretization techniques can be used to reduce the number of values for a given
continuous attribute, by dividing the attribute into a range of intervals. Interval value
labels can be used to replace actual data values. These methods are typically recursive,
where a large amount of time is spent on sorting the data at each step. The smaller the
number of distinct values to sort, the faster these methods should be. Many discretization
techniques can be applied recursively in order to provide a hierarchical or multiresolution
partitioning of the attribute values known as concept hierarchy.

A concept hierarchy for a given numeric attribute attribute defines a discretization of the
attribute. Concept hierarchies can be used to reduce the data y collecting and replacing
low-level concepts (such as numeric value for the attribute age) by higher level concepts
(such as young, middle-aged, or senior). Although detail is lost by such generalization, it
becomes meaningful and it is easier to interpret.

Manual definition of concept hierarchies can be tedious and time-consuming task for the
user or domain expert. Fortunately, many hierarchies are implicit within the database
schema and can be defined at schema definition level. Concept hierarchies often can be
generated automatically or dynamically refined based on statistical analysis of the data
distribution.

Discretization and Concept Hierarchy Generation for Numeric Data:

It is difficult and laborious for to specify concept hierarchies for numeric attributes due
to thewide diversity of possible data ranges and the frequent updates if data values.
Manual specification also could be arbitrary.

Concept hierarchies for numeric attributes can be constructed automatically based on data
distribution analysis. Five methods for concept hierarchy generation are defined below-
binning histogram analysis entropy-based discretization and data segmentation by
“natural partitioning”.
Binning:
Attribute values can be discretized by distributing the values into bin and replacing each
bin by the mean bin value or bin median value. These technique can be applied
recursively to the resulting partitions in order to generate concept hierarchies.

Histogram Analysis:
Histograms can also be used for discretization. Partitioning rules can be applied to define
range of values. The histogram analyses algorithm can be applied recursively to each
partition in order to automatically generate a multilevel concept hierarchy, with the
procedure terminating once a prespecified number of concept levels have been reached.
A minimum interval size can be used per level to control the recursive procedure. this
specifies the minimum width of the partition, or the minimum member of partitions at
each level.

Cluster Analysis:
A clustering algorithm can be applied to partition data into clusters or groups. Each
cluster forms a node of a concept hierarchy, where all noses are at the same conceptual
level. Each cluster may be further decomposed into sub-clusters, forming a lower kevel in
the hierarchy. Clusters may also be grouped together to form a higher-level concept
hierarchy.

Segmentation by natural partitioning:

Breaking up annual salaries in the range of into ranges like ($50,000-$100,000) are often
more desirable than ranges like ($51, 263, 89-$60,765.3) arrived at by cluster analysis.
The 3-4-5 rule can be used to segment numeric data into relatively uniform “natural”
intervals. In general the rule partitions a give range of data into 3,4,or 5 equinity
intervals, recursively level by level based on value range at the most significant digit. The
rule can be recursively applied to each interval creating a concept hierarchy for the given
numeric attribute. tributes with tight semantic connections can be pinned together.

Analytical Characterization : Analysis of Attribute Relevance

Introduction
“What if am not sure which attribute to include or class characterization and class
comparison ? I may end up specifying too many attributes , which could slow down the:
system considerably .” Measures of attribute relevance analysis can be used to help
identify irrelevant or weakly relevant attributes that can be excluded from the concept
description process . The incorporation of this preprocessing step into class
characterization or comparison is referred to as analytical characterization or analytical
comparison, respectively . This section describes a general method of attribute relevance
analysis and its integration with attribute-oriented induction.

The first limitation of class characterization for multidimensional data analysis in Data
warehouses and OLAP tools is the handling of complex objects . The second Limitation
is the lack of an automated generalization process: the user must explicitly Tell the
system which dimension should be included in the class characterization and to How high
a level each dimension should be generalized . Actually , the user must specify each step
of generalization or specification on any dimension.

Usually , it is not difficult for a user to instruct a data mining system regarding how high
level each dimension should be generalized . For example , users can set
attributegeneralization thresholds for this , or specify which level a given dimension
should reach ,such as with the command “generalize dimension location to the country
level”. Even without explicit user instruction , a default value such as 2 to 8 can be set by
the data mining system , which would allow each dimension to be generalized to a level
that contains only 2 to 8 distinct values. If the user is not satisfied with the current level
of generalization, she can specify dimensions on which drill-down or roll-up operations
should be applied.

It is nontrivial, howesver, for users to determine which dimensions should be included in

the analysis of class characteristics. Data relations often contain 50 to 100 attributes , and
a user may have little knowledge regarding which attributes or dimensions should be
selected for effective data mining. A user may include too few attributes in the analysis,
causing the resulting mined descriptions to be incomplete. On the other hand, a user may
introduce too many attributes for analysis (e.g. , by indicating “in relevance to *”, which
includes all the attributes in the specified relations).

Methods should be introduced to perform attribute (or dimension )relevance Analysis in

order to filter out statistically irrelevant or weakly relevant attributes, and retain or even
rank the most relevant attributes for the descriptive mining task at hand. Class
characterization that includes the analysis of attribute/dimesnsion relevance is called
analytical characterization. Class comparison that includes such analysis is called
analytical comparison.

Intuitively, an attribute or dimension is considered highly relevant with respect to a Given

class if it is likely that the values of the attribute or dimension may be used to Distinguish
the class from others. For example, it is unlikely that the color of an Automobile can be
used to distinguish expensive from cheap cars, but the model , make, style, and number
of cylinders are likely to be more relevant attributes. Moreover, even within the same
dimension, different levels of concepts may have dramatically different powers for
distinguishing a class from others.
For example, in the birth_date dimension, birth_day and birth_month are unlikely to be
relevant to the salary of employees. However, the birth_decade (i.e. , age interval) may
be highly relevant to the salary of employees. This implies that the analysis of dimension
relevance should be performed at multi-levels of abstraction, and only the most relevant
levels of a dimension should be included in the analysis. Above we said that attribute/
dimension relevance is evaluated based on the ability of the attribute/ dimension to
distinguish objects of a class from others. When mining a class comparison (or
discrimination), the target class and the contrasting classes are

Explicitly given in the mining query. The relevance analysis should be performed by
Comparison of these classes, as we shall see below. However, when mining class
Characteristics, there is only one class to be characterized. That is, no contrasting class is
specified. It is therefore not obvious what the contrasting class should be for use in of
comparable data in the database that excludes the set of data to be characterized. For
example, to characterize graduate students, the contrasting class can be composed of the
set of undergraduate students.

Methods of Attribute Relevance Analysis

There have been many studies in machine learning, statistics, fuzzy and rough set
Theories, and so on , on attribute relevance analysis. The general idea behind attribute
Relevance analysis is to compute some measure that is used to quantify the relevance of
an attribute with respect to a given class or concept. Such measures include information
gain, the Gini index, uncertainity, and correlation coefficients.
Here we introduce a method that integrates an information gain analysis technique With a
dimension-based data analysis method. The resulting method removes the less
informative attributes, collecting the more informative ones for use in concept description
analysis.

“How does the information gain calculation work ?” Let 5 be a set of training samples,
where the class label of each sample is known. Each sample is in fact a tuple. One
attribute is used to determine the class of the training samples. For instance, the Attribute
status can be used to define the class label of each sample as either “graduate “ or “
undergraduate “ . Suppose that there are m classes. Let S contain Si; samples of class Ci
, for i=1,…., m. An arbitrary sample belongs to class Ci, with probability Si/S, where S is
the total number of samples in set S. The expected information needed to classify a given
sample is
I(S 1, S2, …..Sm)= - ∑(Si/S)(log 2)(Si/S)
An attribute A with values { ai, a2>∙∙∙>av) can be used to partition 5 into the
Subsets { Si S z,∙∙∙, Sv}, where Sj contains those samples in 5 that have value aj of A.
Let S; contain Sy samples of class Q. The expected information based on this
Partitioning by A is known as the entropy of A. It is the weighted average:

E(A)=∑yj=1(Si j+…….+Sm j/S) I (Si j,…….Sm j)

The information gain obtained by this partitioning on A is defined by Gain(A)=I(S1,S2,

…….Sm)-E(A) In this approach to relevance analysis, we can compute the information
gain for each of the attributes defining the samples in S. The attribute with the highest
information gain is considered the most discriminating attribute of the given set. By
computing the information gain for each attribute, we therefore obtain a ranking of the
attributes. This ranking can be used for relevance analysis to select the attributes to be
used in concept description.
Attribute relevance analysis for concept description is performed as follows:

Data Collection: Collect data for both the target class and the contrasting class by query
processing. For class comparison, the user in the data-mining query provides both the
target class and the contrasting class. For class characterization, the target class is the
class to be characterized, whereas the contrasting class is the set of comparable data that
are not in the target class.
Preliminary relevance analysis using conservative AOI: This step identifies a Set of
dimensions and attributes on which the selected relevance measure is to be Applied.
Since different levels of a dimension may have dramatically different Relevance with
respect to a given class, each attribute defining the conceptual levels of the dimension
should be included in the relevance analysis in principle. Attribute-oriented induction
(AOI)can be used to perform some preliminary relevance analysis on the data by
removing or generalizing attributes having a very large number of distinct values (such as
name and phone#). Such attributes are unlikely to be found useful for concept
description. To be conservative , the AOI performed here should employ attribute
generalization thresholds that are set reasonably large so as to allow more (but not
all)attributes to be considered in further relevance analysis by the selected measure (Step
3 below). The relation obtained by such an application of AOI is called the candidate
relation of the mining task.
Remove irrelevant and weakly attributes using the selected relevance analysis
measure: Evaluate each attribute in the candidate relation using the selected relevance
analysis measure. The relevance measure used in this step may be built into the data
mining system or provided by the user. For example, the information gain measure
described above may be used. The attributes are then sorted(i.e., ranked )according to
their computed relevance to the data mining task. Attributes that are not relevant or are
weakly relevant to the task are then removed . A threshold may be set to define “weakly
relevant.” This step results in an initial Target class working relation and an initial
contrasting class

Generate the concept description using AOI: Perform AOI using a less Conservative
set of attribute generalization thresholds. If the descriptive mining Task is class
characterization, only the initial target class working relation is included here. If the
descriptive mining task is class comparison, both the initial target class working relation
and the initial contrasting class working relation are included. The complexity of this
procedure is the induction process is perfomed twice, that Is, in preliminary relevance
analysis (Step 2)and on the initial working relation (Step4). The statistics used in attribute
relevance analysis with the selected measure (Step 3) may be collected during the
scanning of the database in Step 2

Data Warehousing (OLAP) to Data Mining (OLAM)

Online Analytical Mining integrates with Online Analytical Processing with data mining
and mining knowledge in multidimensional databases. Here is the diagram that shows the
integration of both OLAP and OLAM −
Importance of OLAM

OLAM is important for the following reasons −

 High quality of data in data warehouses − The data mining tools are required to
work on integrated, consistent, and cleaned data. These steps are very costly in the
preprocessing of data. The data warehouses constructed by such preprocessing are
valuable sources of high quality data for OLAP and data mining as well.
 Available information processing infrastructure surrounding data
warehouses − Information processing infrastructure refers to accessing,
integration, consolidation, and transformation of multiple heterogeneous
databases, web-accessing and service facilities, reporting and OLAP analysis
tools.
 OLAP−based exploratory data analysis − Exploratory data analysis is required
for effective data mining. OLAM provides facility for data mining on various
subset of data and at different levels of abstraction.
 Online selection of data mining functions − Integrating OLAP with multiple
data mining functions and online analytical mining provide users with the
flexibility to select desired data mining functions and swap data mining tasks
dynamically

Association Rules
Association Rule Mining, as the name suggests, association rules are simple If/Then
statements that help discover relationships between seemingly independent relational
databases or other data repositories.
Most machine learning algorithms work with numeric datasets and hence tend to be
mathematical. However, association rule mining is suitable for non-numeric, categorical
data and requires just a little bit more than simple counting.
Association rule mining is a procedure which aims to observe frequently occurring
patterns, correlations, or associations from datasets found in various kinds of databases
such as relational databases, transactional databases, and other forms of repositories.
An association rule has 2 parts:
 an antecedent (if) and
 a consequent (then)
An antecedent is something that’s found in data, and a consequent is an item that is found
in combination with the antecedent. Have a look at this rule for instance:
“If a customer buys bread, he’s 70% likely of buying milk.”
In the above association rule, bread is the antecedent and milk is the consequent. Simply
put, it can be understood as a retail store’s association rule to target their customers
better. If the above rule is a result of a thorough analysis of some data sets, it can be used
to not only improve customer service but also improve the company’s revenue.
Association rules are created by thoroughly analyzing data and looking for frequent
if/then patterns. Then, depending on the following two parameters, the important
relationships are observed:
1. Support: Support indicates how frequently the if/then relationship appears in the
database.
2. Confidence: Confidence tells about the number of times these relationships have been
found to be true.
So, in a given transaction with multiple items, Association Rule Mining primarily tries to
find the rules that govern how or why such products/items are often bought together. For
example, peanut butter and jelly are frequently purchased together because a lot of people
like to make PB&J sandwiches.
A Beginner’s Guide to Data Science and Its Applications
Association Rule Mining is sometimes referred to as “Market Basket Analysis”, as it was
the first application area of association mining. The aim is to discover associations of
items occurring together more often than you’d expect from randomly sampling all the
possibilities. The classic anecdote of Beer and Diaper will help in understanding this
better.
The story goes like this: young American men who go to the stores on Fridays to buy
diapers have a predisposition to grab a bottle of beer too. However unrelated and vague
that may sound to us laymen, association rule mining shows us how and why!
Let’s do a little analytics ourselves, shall we?
Suppose an X store’s retail transactions database includes the following data:
 Total number of transactions: 600,000
 Transactions containing diapers: 7,500 (1.25 percent)
 Transactions containing beer: 60,000 (10 percent)
 Transactions containing both beer and diapers: 6,000 (1.0 percent)
From the above figures, we can conclude that if there was no relation between beer and
diapers (that is, they were statistically independent), then we would have got only 10% of
diaper purchasers to buy beer too.
However, as surprising as it may seem, the figures tell us that 80% (=6000/7500) of the
people who buy diapers also buy beer.
This is a significant jump of 8 over what was the expected probability. This factor of
increase is known as Lift – which is the ratio of the observed frequency of co-occurrence
of our items and the expected frequency.
How did we determine the lift?
Simply by calculating the transactions in the database and performing simple
mathematical operations.
So, for our example, one plausible association rule can state that the people who buy
diapers will also purchase beer with a Lift factor of 8. If we talk mathematically, the lift
can be calculated as the ratio of the joint probability of two items x and y, divided by the
product of their probabilities.
Lift = P(x,y)/[P(x)P(y)]
However, if the two items are statistically independent, then the joint probability of the
two items will be the same as the product of their probabilities. Or, in other words,
P(x,y)=P(x)P(y),
which makes the Lift factor = 1. An interesting point worth mentioning here is that anti-
correlation can even yield Lift values less than 1 – which corresponds to mutually
exclusive items that rarely occur together.
Association Rule Mining has helped data scientists find out patterns they never knew
existed.

Let’s look at some areas where Association Rule Mining has helped quite a lot:
1. Market Basket Analysis:
This is the most typical example of association mining. Data is collected using barcode
scanners in most supermarkets. This database, known as the “market basket” database,
consists of a large number of records on past transactions. A single record lists all the
items bought by a customer in one sale. Knowing which groups are inclined towards
which set of items gives these shops the freedom to adjust the store layout and the store
catalog to place the optimally concerning one another.
2. Medical Diagnosis:
Association rules in medical diagnosis can be useful for assisting physicians for curing
patients. Diagnosis is not an easy process and has a scope of errors which may result in
unreliable end-results. Using relational association rule mining, we can identify the
probability of the occurrence of illness concerning various factors and symptoms.
Further, using learning techniques, this interface can be extended by adding new
symptoms and defining relationships between the new signs and the corresponding
diseases.
3. Census Data:
Every government has tonnes of census data. This data can be used to plan efficient
public services(education, health, transport) as well as help public businesses (for setting
up new factories, shopping malls, and even marketing particular products). This
application of association rule mining and data mining has immense potential in
supporting sound public policy and bringing forth an efficient functioning of a
democratic society.
4. Protein Sequence:
Proteins are sequences made up of twenty types of amino acids. Each protein bears a
unique 3D structure which depends on the sequence of these amino acids. A slight
change in the sequence can cause a change in structure which might change the
functioning of the protein. This dependency of the protein functioning on its amino acid
sequence has been a subject of great research. Earlier it was thought that these sequences
are random, but now it’s believed that they aren’t.

Single and Multidimensional association rules

Boolean Association Rules
Apriori algorithm
Apriori is the best-known algorithm to mine association rules. It uses a breadth-first
search strategy to count the support of itemsets and uses a candidate generation function
which exploits the downward closure property of support.
Apriori is a classic algorithm for learning association rules. Apriori is designed to
operate on databases containing transactions (for example, collections of items bought by
customers, or details of a website frequentation). Other algorithms are designed for
finding association rules in data having no transactions (Winepi and Minepi), or having
no timestamps (DNA sequencing).
As is common in association rule mining, given a set of itemsets (for instance, sets of
retail transactions, each listing individual items purchased), the algorithm attempts to find
subsets which are common to at least a minimum number C of the itemsets. Apriori uses
a “bottom up” approach, where frequent subsets are extended one item at a time (a step
known as candidate generation), and groups of candidates are tested against the data. The
algorithm terminates when no further successful extensions are found.
The purpose of the Apriori Algorithm is to find associations between different sets of
data. It is sometimes referred to as “Market Basket Analysis”. Each set of data has a
number of items and is called a transaction. The output of Apriori is sets of rules that tell
us how often items are contained in sets of data. Here is an example:
each line is a set of items
alphabetagamma
alphabetatheta
alphabetaepsilon
alphabetatheta

1. 100% of sets with alpha also contain beta

2. 25% of sets with alpha, beta also have gamma
3. 50% of sets with alpha, beta also have theta

Apriori uses breadth-first search and a Hash tree structure to count candidate item sets

efficiently. It generates candidate item sets of length from item sets of length .
Then it prunes the candidates which have an infrequent sub pattern. According to

the downward closure lemma, the candidate set contains all frequent -length item
sets. After that, it scans the transaction database to determine frequent item sets among
the candidates.
Apriori, while historically significant, suffers from a number of inefficiencies or trade-
offs, which have spawned other algorithms. Candidate generation generates large
numbers of subsets (the algorithm attempts to load up the candidate set with as many as
possible before each scan). Bottom-up subset exploration (essentially a breadth-first

traversal of the subset lattice) finds any maximal subset S only after all of its proper
subsets.

Algorithm Pseudocode
The pseudocode for the algorithm is given below for a transaction database , and a
support threshold of . Usual set theoretic notation is employed, though note that is a
multiset. is the candidate set for level . Generate() algorithm is assumed to generate the
candidate sets from the large itemsets of the preceding level, heeding the downward
closure lemma. accesses a field of the data structure that represents candidate set , which
is initially assumed to be zero. Many details are omitted below, usually the most
important part of the implementation is the data structure used for storing the candidate
sets, and counting their frequencies.
Apriori
large 1-itemsets
while
for transactions
for candidates
return
Example
A large supermarket tracks sales data by stock-keeping unit (SKU) for each item, and
thus is able to know what items are typically purchased together. Apriori is a moderately
efficient way to build a list of frequent purchased item pairs from this data. Let the
database of transactions consist of the sets {1,2,3,4}, {1,2}, {2,3,4}, {2,3}, {1,2,4},
{3,4}, and {2,4}. Each number corresponds to a product such as “butter” or “bread”. The
first step of Apriori is to count up the frequencies, called the support, of each member
item separately:
This table explains the working of apriori algorithm.
ItemSupport
1 3/7
2 6/7
3 4/7
4 5/7
We can define a minimum support level to qualify as “frequent,” which depends on the
context. For this case, let min support = 3/7. Therefore, all are frequent. The next step is
to generate a list of all pairs of the frequent items. Had any of the above items not been
frequent, they wouldn’t have been included as a possible member of possible pairs. In
this way, Apriori prunes the tree of all possible sets. In next step we again select only
these items (now pairs are items) which are frequent:
Item Support
{1,2}3/7
{1,3}1/7
{1,4}2/7
{2,3}3/7
{2,4}4/7
{3,4}3/7
The pairs {1,2}, {2,3}, {2,4}, and {3,4} all meet or exceed the minimum support of 3/7.
The pairs {1,3} and {1,4} do not. When we move onto generating the list of all triplets,
we will not consider any triplets that contain {1,3} or {1,4}:
Item Support
{2,3,4}2/7
In the example, there are no frequent triplets — {2,3,4} has support of 2/7, which is
below our minimum, and we do not consider any other triplet because they all contain
either {1,3} or {1,4}, which were discarded after we calculated frequent pairs in the
second table.
Multi Dimensional Association Rules

 Attributes can be categorical or quantitative

 Quantitative attributes are numeric and incorporates hierarchy (age, income..)
 Numeric attributes must be discretized
 3 different approaches in mining multi dimensional association rules
o Using static discretization of quantitative attributes
o Using dynamic discretization of quantitative attributes
o Using Distance based discretization with clustering

Mining using Static Discretization

 Discretization is static and occurs prior to mining

 Discretized attributes are treated as categorical
 Use apriori algorithm to find all k-frequent predicate sets
 Every subset of frequent predicate set must be frequent
 If in a data cube the 3D cuboid (age, income, buys) is frequent implies (age,
income), (age,buys), (income, buys)

Mining using Dynamic Discretization

 Known as Mining Quantitative Association Rules

 Numeric attributes are dynamically discretized
 Consider rules of type

Aquan1 Λ Aquan2 -> Acat

(2D Quantitative Association Rules)
age(X,”20…25”) Λ income(X,”30K…40K”) -> buys (X, ”Laptop Computer”)

 ARCS (Association Rule Clustering System) – An Approach for mining

quantitative association rules

Distance-based Association Rule

2 step mining process

 Perform clustering to find the interval of attributes involved

 Obtain association rules by searching for groups of clusters that occur together

The resultant rules must satisfy

 Clusters in the rule antecedent are strongly associated with clusters of rules in the
consequent
 Clusters in the antecedent occur together
 Clusters in the consequent occur together

Classification & Predication

There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −
 Classification
 Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to
categorize bank loan applications as either safe or risky, or a prediction model to predict
the expenditures in dollars of potential customers on computer equipment given their
income and occupation.

What is classification?
Following are the examples of cases where the data analysis task is Classification −
 A bank loan officer wants to analyze the data in order to know which customer
(loan applicant) are risky or which are safe.
 A marketing manager at a company needs to analyze a customer with a given
profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no
for marketing data.

What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a
model or a predictor will be constructed that predicts a continuous-valued-function or
ordered value.

How Does Classification Works?

With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process includes two
steps −
 Building the Classifier or Model
 Using Classifier for Classification
Building the Classifier or Model
 This step is the learning step or the learning phase.
 In this step the classification algorithms build the classifier.
 The classifier is built from the training set made up of database tuples and their
associated class labels.
 Each tuple that constitutes the training set is referred to as a category or class.
These tuples can also be referred to as sample, object or data points.

Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities −
 Data Cleaning − Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and the
problem of missing values is solved by replacing a missing value with most
commonly occurring value for that attribute.
 Relevance Analysis − Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are related.
 Data Transformation and reduction − The data can be transformed by any of
the following methods.
 Normalization − The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within a
small specified range. Normalization is used when in the learning step, the neural
networks or the methods involving measurements are used.
 Generalization − The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.

Comparison of Classification and Prediction Methods

Here is the criteria for comparing the methods of Classification and Prediction −
 Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the
class label correctly and the accuracy of the predictor refers to how well a given
predictor can guess the value of predicted attribute for a new data.
 Speed − This refers to the computational cost in generating and using the
classifier or predictor.
 Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
 Scalability − Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
 Interpretability − It refers to what extent the classifier or predictor understands.

Unit 4

What Is Cluster Analysis?

Cluster analysis or simply clustering is the process of partitioning a set of data objects (or
observations) into subsets. Each subset is a cluster, such that objects in a cluster are
similar to one another, yet dissimilar to objects in other clusters. The set of clusters
resulting from a cluster analysis can be referred to as a clustering. In this context, dif-
ferent clustering methods may generate different clusterings on the same data set. The
partitioning is not performed by humans, but by the clustering algorithm. Hence, clus-
tering is useful in that it can lead to the discovery of previously unknown groups within
the data.
Cluster analysis has been widely used in many applications such as business intel-
ligence, image pattern recognition, Web search, biology, and security. In business
intelligence, clustering can be used to organize a large number of customers into groups,
where customers within a group share strong similar characteristics. This facilitates the
development of business strategies for enhanced customer relationship management.
Moreover, consider a consultant company with a large number of projects. To improve
project management, clustering can be applied to partition projects into categories based
on similarity so that project auditing and diagnosis (to improve project delivery and
outcomes) can be conducted effectively.
In image recognition, clustering can be used to discover clusters or “subclasses” in
handwritten character recognition systems. Suppose we have a data set of handwritten
digits, where each digit is labeled as either 1, 2, 3, and so on. Note that there can be a
large variance in the way in which people write the same digit. Take the number 2, for
example. Some people may write it with a small circle at the left bottom part, while some
others may not. We can use clustering to determine subclasses for “2,” each of which
represents a variation on the way in which 2 can be written. Using multiple models based
on the subclasses can improve overall recognition accuracy.
Clustering has also found many applications in Web search. For example, a keyword
search may often return a very large number of hits (i.e., pages relevant to the search) due
to the extremely large number of web pages. Clustering can be used to organize the
search results into groups and present the results in a concise and easily accessible way.
Moreover, clustering techniques have been developed to cluster documents into topics,
which are commonly used in information retrieval practice.
As a data mining function, cluster analysis can be used as a standalone tool to gain
insight into the distribution of data, to observe the characteristics of each cluster, and to
focus on a particular set of clusters for further analysis. Alternatively, it may serve as a
preprocessing step for other algorithms, such as characterization, attribute subset
selection, and classification, which would then operate on the detected clusters and the
selected attributes or features.
Because a cluster is a collection of data objects that are similar to one another within the
cluster and dissimilar to objects in other clusters, a cluster of data objects can be treated
as an implicit class. In this sense, clustering is sometimes called automatic clas-
sification. Again, a critical difference here is that clustering can automatically find the
groupings. This is a distinct advantage of cluster analysis.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity. Clustering can also be
used for outlier detection, where outliers (values that are “far away” from any cluster)
may be more interesting than common cases. Applications of outlier detection include the
detection of credit card fraud and the monitoring of criminal activities in electronic
commerce. For example, exceptional cases in credit card transactions, such as very
expensive and infrequent purchases, may be of interest as possible fraudulent activities.
Outlier detection is the subject of Chapter 12.
Data clustering is under vigorous development. Contributing areas of research include
data mining, statistics, machine learning, spatial database technology, information
retrieval, Web search, biology, marketing, and many other application areas. Owing to
the huge amounts of data collected in databases, cluster analysis has recently become a
highly active topic in data mining research.
As a branch of statistics, cluster analysis has been extensively studied, with the main
focus on distance-based cluster analysis. Cluster analysis tools based on k-means, k-
medoids, and several other methods also have been built into many statistical analysis
software packages or systems, such as S-Plus, SPSS, and SAS. In machine learning,
recall that classification is known as supervised learning because the class label
information is given, that is, the learning algorithm is supervised in that it is told the class
member- ship of each training tuple. Clustering is known as unsupervised learning
because the class label information is not present. For this reason, clustering is a form of
learning by observation, rather than learning by examples. In data mining, efforts have
focused on finding methods for efficient and effective cluster analysis in large databases.
Active themes of research focus on the scalability of clustering methods, the
effectiveness of methods for clustering complex shapes (e.g., nonconvex) and types of
data (e.g., text, graphs, and images), high-dimensional clustering techniques (e.g.,
clustering objects with thousands of features), and methods for clustering mixed
numerical and nominal data in large databases.

Requirements for Cluster Analysis

Clustering is a challenging research field. In this section, you will learn about the require-
ments for clustering as a data mining tool, as well as aspects that can be used for
comparing clustering methods.

The following are typical requirements of clustering in data mining.

Scalability: Many clustering algorithms work well on small data sets containing fewer
than several hundred data objects; however, a large database may contain millions or
even billions of objects, particularly in Web search scenarios. Clustering on only a
sample of a given large data set may lead to biased results. Therefore, highly scalable
clustering algorithms are needed.
Ability to deal with different types of attributes: Many algorithms are designed to
cluster numeric (interval-based) data. However, applications may require clustering other
data types, such as binary, nominal (categorical), and ordinal data, or mixtures of these
data types. Recently, more and more applications need clustering techniques for complex
data types such as graphs, sequences, images, and documents.
Discovery of clusters with arbitrary shape: Many clustering algorithms determine
clusters based on Euclidean or Manhattan distance measures (Chapter 2). Algorithms
based on such distance measures tend to find spherical clusters with similar size and
density. However, a cluster could be of any shape. Consider sensors, for example, which
are often deployed for environment surveillance. Cluster analysis on sensor readings can
detect interesting phenomena. We may want to use clustering to find the frontier of a
running forest fire, which is often not spherical. It is important to develop algorithms that
can detect clusters of arbitrary shape.
Requirements for domain knowledge to determine input parameters: Many clus-
tering algorithms require users to provide domain knowledge in the form of input
parameters such as the desired number of clusters. Consequently, the clustering results
may be sensitive to such parameters. Parameters are often hard to determine, especially
for high-dimensionality data sets and where users have yet to grasp a deep understanding
of their data. Requiring the specification of domain knowledge not only burdens users,
but also makes the quality of clustering difficult to control.
Ability to deal with noisy data: Most real-world data sets contain outliers and/or
missing, unknown, or erroneous data. Sensor readings, for example, are often noisy—
some readings may be inaccurate due to the sensing mechanisms, and some readings may
be erroneous due to interferences from surrounding transient objects. Clustering
algorithms can be sensitive to such noise and may produce poor-quality clusters.
Therefore, we need clustering methods that are robust to noise.
Incremental clustering and insensitivity to input order: In many applications,
incremental updates (representing newer data) may arrive at any time. Some clustering
algorithms cannot incorporate incremental updates into existing clustering structures and,
instead, have to recompute a new clustering from scratch. Cluster- ing algorithms may
also be sensitive to the input data order. That is, given a set of data objects, clustering
algorithms may return dramatically different clusterings depending on the order in which
the objects are presented. Incremental clustering algorithms and algorithms that are
insensitive to the input order are needed.
Capability of clustering high-dimensionality data: A data set can contain numerous
dimensions or attributes. When clustering documents, for example, each keyword can be
regarded as a dimension, and there are often thousands of keywords. Most clustering
algorithms are good at handling low-dimensional data such as data sets involving only
two or three dimensions. Finding clusters of data objects in a high- dimensional space is
challenging, especially considering that such data can be very sparse and highly skewed.
Constraint-based clustering: Real-world applications may need to perform clustering
under various kinds of constraints. Suppose that your job is to choose the locations for a
given number of new automatic teller machines (ATMs) in a city. To decide upon this,
you may cluster households while considering constraints such as the city’s rivers and
highway networks and the types and number of customers per cluster. A challenging task
is to find data groups with good clustering behavior that satisfy specified constraints.
Interpretability and usability: Users want clustering results to be interpretable,
comprehensible, and usable. That is, clustering may need to be tied in with specific
semantic interpretations and applications. It is important to study how an application goal
may influence the selection of clustering features and clustering methods.
The following are orthogonal aspects with which clustering methods can be compared:
The partitioning criteria: In some methods, all the objects are partitioned so that no
hierarchy exists among the clusters. That is, all the clusters are at the same level
conceptually. Such a method is useful, for example, for partitioning customers into
groups so that each group has its own manager. Alternatively, other methods partition
data objects hierarchically, where clusters can be formed at different semantic levels. For
example, in text mining, we may want to organize a corpus of documents into multiple
general topics, such as “politics” and “sports,” each of which may have subtopics, For
instance, “football,” “basketball,” “baseball,” and “hockey” can exist as subtopics of
“sports.” The latter four subtopics are at a lower level in the hierarchy than “sports.”
Separation of clusters: Some methods partition data objects into mutually exclusive
clusters. When clustering customers into groups so that each group is taken care of by
one manager, each customer may belong to only one group. In some other situations, the
clusters may not be exclusive, that is, a data object may belong to more than one cluster.
For example, when clustering documents into topics, a document may be related to
multiple topics. Thus, the topics as clusters may not be exclusive.
Similarity measure: Some methods determine the similarity between two objects by the
distance between them. Such a distance can be defined on Euclidean space, a road
network, a vector space, or any other space. In other methods, the similarity may be
defined by connectivity based on density or contiguity, and may not rely on the absolute
distance between two objects. Similarity measures play a fundamental role in the design
of clustering methods. While distance-based methods can often take advantage of
optimization techniques, density- and continuity-based methods can often find clusters of
arbitrary shape.
Clustering space: Many clustering methods search for clusters within the entire given
data space. These methods are useful for low-dimensionality data sets. With high-
dimensional data, however, there can be many irrelevant attributes, which can make
similarity measurements unreliable. Consequently, clusters found in the full space are
often meaningless. It’s often better to instead search for clusters within different
subspaces of the same data set. Subspace clustering discovers clusters and subspaces
(often of low dimensionality) that manifest object similarity.
To conclude, clustering algorithms have several requirements. These factors include
scalability and the ability to deal with different types of attributes, noisy data, incremen-
tal updates, clusters of arbitrary shape, and constraints. Interpretability and usability are
also important. In addition, clustering methods can differ with respect to the partitioning
level, whether or not clusters are mutually exclusive, the similarity measures used, and
whether or not subspace clustering is performed.

Overview of Basic Clustering Methods

There are many clustering algorithms in the literature. It is difficult to provide a crisp
categorization of clustering methods because these categories may overlap so that a
method may have features from several categories. Nevertheless, it is useful to present a
relatively organized picture of clustering methods. In general, the major fundamental
clustering methods can be classified into the following categories, which are discussed in
the rest of this chapter.

Partitioning methods: Given a set of n objects, a partitioning method constructs k

partitions of the data, where each partition represents a cluster and k ≤ n. That is, it
divides the data into k groups such that each group must contain at least one object. In
other words, partitioning methods conduct one-level partitioning on data sets. The basic
partitioning methods typically adopt exclusive cluster separation. That is, each object
must belong to exactly one group. This requirement may be relaxed, for example, in
fuzzy partitioning techniques. References to such techniques are given in the
bibliographic notes (Section 10.9).
Most partitioning methods are distance-based. Given k, the number of partitions to
construct, a partitioning method creates an initial partitioning. It then uses an iterative
relocation technique that attempts to improve the partitioning by moving objects from
one group to another. The general criterion of a good partitioning is that objects in the
same cluster are “close” or related to each other, whereas objects in different clusters are
“far apart” or very different. There are various kinds of other criteria for judging the
quality of partitions. Traditional partitioning methods can be extended for subspace
clustering, rather than searching the full data space. This is useful when there are many
attributes and the data are sparse.
Achieving global optimality in partitioning-based clustering is often computation- ally
prohibitive, potentially requiring an exhaustive enumeration of all the possible partitions.
Instead, most applications adopt popular heuristic methods, such as greedy approaches
like the k-means and the k-medoids algorithms, which progres- sively improve the
clustering quality and approach a local optimum. These heuristic clustering methods
work well for finding spherical-shaped clusters in small- to medium-size databases. To
find clusters with complex shapes and for very large data sets, partitioning-based
methods need to be extended. Partitioning-based clustering methods are studied in depth
in Section 10.2.

Hierarchical methods: A hierarchical method creates a hierarchical decomposition of

the given set of data objects. A hierarchical method can be classified as being either
agglomerative or divisive, based on how the hierarchical decomposition is formed. The
agglomerative approach, also called the bottom-up approach, starts with each object
forming a separate group. It successively merges the objects or groups close to one
another, until all the groups are merged into one (the topmost level of the hierarchy), or a
termination condition holds. The divisive approach, also called the top-down approach,
starts with all the objects in the same cluster. In each successive iteration, a cluster is split
into smaller clusters, until eventually each object is in one cluster, or a termination
condition holds.
Hierarchical clustering methods can be distance-based or density- and continuity- based.
Various extensions of hierarchical methods consider clustering in subspaces as well.
Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can
never be undone. This rigidity is useful in that it leads to smaller computation costs by
not having to worry about a combinatorial number of different choices. Such techniques
cannot correct erroneous decisions; however, methods for improv- ing the quality of
hierarchical clustering have been proposed.

Density-based methods: Most partitioning methods cluster objects based on the dis-
tance between objects. Such methods can find only spherical-shaped clusters and
encounter difficulty in discovering clusters of arbitrary shapes. Other clustering methods
have been developed based on the notion of density. Their general idea is to continue
growing a given cluster as long as the density (number of objects or data points) in the
“neighborhood” exceeds some threshold. For example, for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise or outliers and discover clusters of
arbitrary shape.
Density-based methods can divide a set of objects into multiple exclusive clusters, or a
hierarchy of clusters. Typically, density-based methods consider exclusive clusters only,
and do not consider fuzzy clusters. Moreover, density-based methods can be extended
from full space to subspace clustering.

Grid-based methods: Grid-based methods quantize the object space into a finite number
of cells that form a grid structure. All the clustering operations are performed on the
grid structure (i.e., on the quantized space). The main advantage of this approach is its
fast processing time, which is typically independent of the number of data objects and
dependent only on the number of cells in each dimension in the quantized space.
Using grids is often an efficient approach to many spatial data mining problems,
including clustering. Therefore, grid-based methods can be integrated with other
clustering methods such as density-based methods and hierarchical methods.
Some clustering algorithms integrate the ideas of several clustering methods, so that it is
sometimes difficult to classify a given algorithm as uniquely belonging to only one
clustering method category. Furthermore, some applications may have clustering criteria
that require the integration of several clustering techniques.
In general, the notation used is as follows. Let D be a data set of n objects to be clustered.
An object is described by d variables, where each variable is also called an attribute or a
dimension, and therefore may also be referred to as a point in a d-dimensional object
space.

Method General Characteristics

– Find mutually exclusive clusters of spherical shape– Distance-
Partitioning methods based– May use mean or medoid (etc.) to represent cluster center –
Effective for small- to medium-size data sets
– Clustering is a hierarchical decomposition (i.e., multiple levels) –
Cannot correct erroneous merges or splits– May incorporate other
Hierarchical methods
techniques like microclustering or
consider object “linkages”
– Can find arbitrarily shaped clusters
– Clusters are dense regions of objects in space that
are separated by low-density regions
Density-based methods
– Cluster density: Each point must have a minimum
number of points within its “neighborhood”
– May filter out outliers
– Use a multiresolution grid data structure– Fast processing time
Grid-based methods (typically independent of the number of
data objects, yet dependent on grid size)

Multidimensional Analysis and Descriptive Mining of Complex Data Objects

What are complex data object?

1. Many advanced, data-intensive applications, such as scientific research and
engineering design, need to store, access, and analyze complex but relatively
structured data objects.
2. These objects cannot be represented as simple and uniformly structured records (i.e.,
tuples) in data relations.
3. These kinds of systems deal with the efficient storage and access of vast amounts of
disk-based complex structured data objects.
4. These systems organize a large set of complex data objects into classes, which are in
turn organized into class/subclass hierarchies.
5. Each object in a class is associated with (1) an object-identifier, (2) a set of
attributes that may contain sophisticated data structures, set- or list-valued data, class
composition hierarchies, multimedia data, and (3) a set of methods that specify the
computational routines or rules associated with the object class.

How can we Generalize the Structured Data

Typically, set-valued data can be generalized by:-

1. Generalization of each value in the set to its corresponding higher-level concept,
2. Derivation of the general behavior of the set, such as the number of elements in
the set, the types or value ranges in the set, the weighted average for numerical
data, or the major clusters formed by the set.

Suppose that the hobby of a person is a set-valued attribute containing the set of values
{tennis, hockey, soccer, violin, this set can be generalized to a set of high-level concepts,
such as {sports, music, computer games}

How Aggregation and Approximation is done in Spatial and Multimedia Data

Generalization
1. Aggregation and approximation are another important means of generalization
2. In a spatial merge, it is necessary to not only merge the regions of similar types
within the same general class but also to compute the total areas, average density, or
other aggregate functions while ignoring some scattered regions with different types
if they are unimportant to the study.
3. Spatial aggregation and approximation. Suppose that we have different pieces of land
for various purposes of agricultural usage, such as the planting of vegetables, grains,
and fruits. These pieces can be merged or aggregated into one large piece of
agricultural land by a spatial merge. However, such a piece of agricultural land may
contain highways, houses, and small stores. If the majority of the land is used for
agriculture, the scattered regions for other purposes can be ignored, and the whole
region can be claimed as an agricultural area by approximation
4. A multimedia database may contain complex texts, graphics, images, video
fragments, maps, voice, music, and other forms of audio/video information
5. Generalization on multimedia data can be performed by recognition and extraction of
the essential features and/or general patterns of such data. There are many ways to
extract such information. For an image, the size, color, shape, texture, orientation, and
relative positions and structures of the contained objects or regions in the image can
be extracted by aggregation and/or approximation. For a segment of music, its melody
can be summarized based on the approximate patterns that repeatedly occur in the
segment, while its style can be summarized based on its tone, tempo, or the major
musical instruments played.

What is Spatial Data Mining?

1. A spatial database stores a large amount of space-related data, such as maps,

preprocessed remote sensing or medical imaging data, and VLSI chip layout data.
Spatial databases have many features distinguishing them from relational databases.
2. Spatial data mining refers to the extraction of knowledge, spatial relationships, or
other interesting patterns not explicitly stored in spatial databases. Such mining
demands an integration of data mining with spatial database technologies.
3. It can be used for understanding spatial data, discovering spatial relationships and
relationships between spatial and nonspatial data, constructing spatial knowledge
bases, reorganizing spatial databases, and optimizing spatial queries.
4. It is expected to have wide applications in geographic information systems,
geomarketing, remote sensing, image database exploration, medical imaging,
navigation, traffic control, environmental studies, and many other areas where spatial
data are used.
5. A crucial challenge to spatial data mining is the exploration of efficient spatial data
mining techniques due to the huge amount of spatial data and the complexity of
spatial data types and spatial access methods.

“Can we construct a spatial data warehouse?” Yes, as with relational data, we can
integrate spatial data to construct a data warehouse that facilitates spatial data mining. A
spatial data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of both spatial and nonspatial data in support of spatial data
mining and spatial-datarelated decision-making processes.

There are three types of dimensions in a spatial data cube

1. A nonspatial dimension contains only nonspatial data. (such

as “hot” for temperature and “wet” for precipitation)
2. A spatial-to-nonspatial dimension is a dimension whose primitive-level data are
spatial but whose generalization, starting at a certain high level, becomes
nonspatial.eg:-city
3. A spatial-to-spatial dimension is a dimension whose primitive level and all of its
highlevel generalized data are spatial.eg:equitemp.

Two types of measures in a spatial data cube:

1. A numerical measure contains only numerical data. For example, one measure in
a spatial data warehouse could be the monthly revenue of a region
2. A spatial measure contains a collection of pointers to spatial
objects.eg temperature and precipitation
There are several challenging issues regarding the construction and utilization of
spatial datawarehouses.
1. The first challenge is the integration of spatial data from heterogeneous sources and
systems.
2. The second challenge is the realization of fast and flexible on-line analytical
processing in spatial data warehouses.

Mining Spatial Association

A spatial association rule is of the form A=>B [s%;c%], where A and B are sets of spatial
or nonspatial predicates, s% is the support of the rule, and c%is the confidence of the
rule. For example, the followingis a spatial association rule:

is a(X; “school”)^close to(X; “sports center”))=>close to(X; “park”) [0:5%;80%].

This rule states that 80% of schools that are close to sports centers are also close to parks,
and 0.5% of the data belongs to such a case.

What is Multimedia Data Mining?

A multimedia database system stores and manages a large collection of multimedia data,
such as audio, video, image, graphics, speech, text, document, and hypertext data, which
contain text, text markups, and linkages.

1. Similarity Search in Multimedia Data For similarity searching in multimedia

data, we consider two main families of multimedia indexing and retrieval
systems:
2. description-based retrieval systems, which build indices and perform object
retrieval based on image descriptions, such as keywords, captions, size, and time
of creation;
3. content-based retrieval systems, which support retrieval based on the image
content, such as color histogram, texture, pattern, image topology, and the shape
of objects and their layouts and locations within the image.

In a content-based image retrieval system, there are often two kinds of queries:

Image sample- based queries and image feature specification queries.

· Image-sample-based queries find all of the images that are similar to the given image
sample. This search compares the feature vector (or signature) extracted from the sample
with the feature vectors of images that have already been extracted and indexed in the
image database. Based on this comparison, images that are close to the sample image are
returned.
· Image feature specification queries specify or sketch image features like color, texture, or
shape, which are translated into a feature vector to be matched with the feature vectors of
the images in the database
Mining Associations in Multimedia Data

1. Associations between image content and non image content features:

a. A rule like “If at least 50% of the upper part of the picture is blue, then it
is likely to represent sky” belongs to this category since it links the
image content to the keyword sky.
2. Associations among image contents that are not related to spatial relationships: A
rule like “If a picture contains two blue squares, then it is likely to contain one
red circle a swell” belongs to this category since the associations are all regarding
image contents.
3. Associations among image contents related to spatial relationships: A rule like “If
a red triangle is between two yellow squares, then it is likely a big oval-shaped
object is underneath” belongs to this category since it associates objects in the
image with spatial relationship.

Several approaches have been proposed and studied for similarity-based retrieval in
image databases, based on image signature

1. Color histogram–based signature: In this approach, the signature of an image includes

color histograms based on the color composition of an image regardless of its scale or
orientation. This method does not contain any information about shape, image topology,
or texture.
2. Multifeature composed signature: In this approach, the signature of an image includes a
composition of multiple features: color histogram, shape, image topology, and texture.

“Can we construct a data cube for multimedia data analysis?” To facilitate the
multidimensional analysis of large multimedia databases, multimedia data cubes can be
designed and constructed in a manner similar to that for traditional data cubes from
relational data. A multimedia data cube can contain additional dimensions and measures
for multimedia information, such as color, texture, and shape.

What is Text Mining?

Text databases (or document databases), which consist of large collections of documents
from various sources, such as news articles, research papers, books, digital libraries, e-
mail messages, and Web pages. Text databases are rapidly growing due to the increasing
amount of information available in electronic form, such as electronic publications,
various kinds of electronic documents, e-mail, and the World Wide Web .

What is IR(Information Retrieval System)?

A typical information retrieval problem is to locate relevant documents in a document
collection based on a user’s query, which is often some keywords describing an
information need, although it could also be an example relevant document. In such a
search problem, a user takes the initiative to “pull” the relevant information out from the
collection; this is most appropriate when a user has some ad hoc (i.e., short-
term)information need, such as finding information to buy a used car. When a user has a
long-term information need (e.g., a researcher’s interests), a retrieval system may also
take the initiative to “push” any newly arrived information item to a user if the item is
judged as being relevant to the user’s information need. Such an information access
process is called information filtering, and the corresponding systems are often
called filtering
systems or recommender systems.

Basic Measures for Text Retrieval: Precision and Recall

“Suppose that a text retrieval system has just retrieved a number of documents for me
based
on my input in the form of a query. How can we assess how accurate or correct the
system
was?”
Precision: This is the percentage of retrieved documents that are in fact relevant to
the query (i.e., “correct” responses). It is formally defined as

Recall: This is the percentage of documents that are relevant to the query and were,
in fact, retrieved. It is formally defined as
How Mining theWorld WideWeb is done?
The World Wide web serves as a huge,widely distributed, global information service
center
for news, advertisements, consumer information, financial management, education,
government, e-commerce, and many other information services. The Web also contains
a rich and dynamic collection of hyperlink information and Web page access and usage
information, providing rich sources for data mining.

challenges for effective resource and knowledge discovery in web

1. The Web seems to be too huge for effective data warehousing and data mining.
The size of the Web is in the order of hundreds of terabytes and is still growing
rapidly. Many organizations and societies place most of their public-accessible
information on the Web. It is barely possible to set up a data warehouse to
replicate, store, or integrate all of the data on the Web.
2. The complexity of Web pages is far greater than that of any traditional text
document collection. Web pages lack a unifying structure.
3. The Web is a highly dynamic information source. Not only does the Web grow
rapidly, but its information is also constantly updated.
4. TheWeb serves a broad diversity of user communities. The Internet currently
connects more than 100 million workstations, and its user community is still
rapidly expanding.

These challenges have promoted research into efficient and effective discovery and use of
resources on the Internet.

Mining the WWW

1. Mining theWeb Page Layout Structure

The basic structure of a Web page is its DOM(Document Object Model) structure.
The DOM structure of a Web page is a tree structure, where every HTML tag in
the page corresponds to a node in the DOM tree. The Web page can be segmented
by some predefined structural tags. Thus the DOM structure can be used to
facilitate information extraction. Here, we introduce an algorithm called VIsion-
based Page Segmentation (VIPS).VIPS aims to extract the semantic structure of a
Web page based on its visual presentation
2. Mining the Web’s Link Structures to Identify Authoritative Web PagesThe
Web consists not only of pages, but also of hyperlinks pointing from one page to
another.These hyperlinks contain an enormous amount of latent human annotation
that can help automatically infer the notion of authority. These properties of Web
link structures have led researchers to consider another important category of
Web pages called a hub. A hub is one or a set ofWeb pages that provides
collections of links to authorities.

What are the various Data Mining Applications?

Data Mining for Financial Data Analysis-

 Design and construction of data warehouses for multidimensional data

analysis and data mining: Financial data collected in the banking and financial
industry are often relatively complete, reliable, and of high quality, which
facilitates systematic data analysis and data mining. One may like to view the debt
and revenue changes by month, by region, by sector, and by other factors, along
with maximum, minimum, total, average, trend, and other statistical information.
 Loan payment prediction and customer credit policy analysis: Loan payment
prediction and customer credit analysis are critical to the business of a bank.
Many factors can strongly or weakly influence loan payment performance and
customer credit rating.

 Classification and clustering of customers for targeted

marketing: Classification and clustering methods can be used for customer
group identification and targeted marketing. For example, we can use
classification to identify the most crucial factors that may influence a customer’s
decision regarding banking. Customers with similar behaviors regarding loan
payments may be identified by multidimensional clustering techniques.

 Detection of money laundering and other financial crimes: To detect money

laundering and other financial crimes, it is important to integrate information from
multiple databases (like bank transaction databases, and federal or state crime
history databases), as long as they are potentially related to the study

Data Mining for the Retail Industry

 Design and construction of data warehouses based on the benefits of data
mining: Because retail data cover a wide spectrum (including sales, customers,
employees, goods transportation, consumption, and services), there can be many
ways to design a data warehouse for this industry.

 Multidimensional analysis of sales, customers, products, time, and region:

The retail industry requires timely information regarding customer needs,
product sales, trends,and fashions, as well as the quality, cost, profit, and service
of commodities

 Analysis of the effectiveness of sales campaigns: The retail industry conducts

sales campaigns using advertisements, coupons, and various kinds of discounts
and bonuses to promote products and attract customers
 Customer retention—analysis of customer loyalty: With customer loyalty card
information, one can register sequences of purchases of particular customers.
Customer loyalty and purchase trends can be analyzed systematically

 Product recommendation and cross-referencing of items: By mining

associations from sales records, one may discover that a customer who buys a
digital camera is likely to buy another set of items. Such information can be used
to form product recommendations. Collaborative recommender systems use data
mining techniques to make personalized product recommendations during live
customer transactions, based on the opinions of other customers

 Data Mining for the Telecommunication Industry

 Fraudulent pattern analysis and the identification of unusual patterns:
Fraudulent activity costs the telecommunication industry millions of dollars per
year. It is important to (1) identify potentially fraudulent users and their atypical
usage patterns (2) detect attempts to gain fraudulent entry to customer accounts;
and(3) discover unusual patterns that may need special attention, such as busy-
hour frustrated call attempts, switch and route congestion patterns, and periodic
calls from automatic dial-out equipment (like fax machines) that have been
improperly programmed

 Multidimensional association and sequential pattern analysis: The discovery

of association and sequential patterns in multidimensional analysis can be used to
promote telecommunication services. For example, suppose you would like to
find usage patterns for a set of communication services by customer group, by
month, and by time of day.

 Mobile telecommunication services: Mobile telecommunication, Web and

information services, and mobile computing are becoming increasingly integrated
and common in our work and life.

Explain the Social Impacts of Data Mining?

 Ubiquitous data mining is the ever presence of data mining in many aspects of our daily
lives. It can influence how we shop, work, search for information, and use a computer, as
well as our leisure time, health, and well-being. In invisible data mining, “smart”
software, such as Web search engines, customer-adaptive Web services (e.g., using
recommender algorithms), e-mail managers, and so on, incorporates data mining into its
functional components, often unbeknownst to the user.
 From grocery stores that print personalized coupons on customer receipts to on-line
stores that recommend additional items based on customer interests, data mining has
innovatively influenced what we buy, the way we shop, as well as our experience while
shopping.
 Data mining has shaped the on-line shopping experience. Many shoppers routinely turn
to on-line stores to purchase books, music, movies, and toys
 Many companies increasingly use data mining for customer relationship management
(CRM), which helps provide more customized, personal service addressing individual
customer’s needs, in lieu of mass marketing
 While you are viewing the results of your Google query, various ads pop up relating
to your query. Google’s strategy of tailoring advertising to match the user’s
interests is
successful—it has increased the clicks for the companies involved by four to five times.
 Web-wide tracking is a technology that tracks a user across each site she visits. So,while
Surfing the Web, information about every site you visit may be recorded,which can
provide
marketers with information reflecting your interests, lifestyle, and habits
 Finally, data mining can contribute toward our health and well-being. Several
pharmaceutical companies use data mining software to analyze data when developing
drugs and to find associations between patients, drugs, and outcomes. It is also being
used to detect beneficial side effects of drugs

What are the Major concern of Data mining?

A major social concern of data mining is the issue of privacy and data security,
particularly as the amount of data collected on individuals continues to grow. Fair
information practices were established for privacy and data protection and cover aspects
regarding the collection and use of personal data. Data mining for counterterrorism can
benefit homeland security and save lives, yet raises additional concerns for privacy due to
the possible access of personal data. Efforts towards
ensuring privacy and data security include the development of privacy-preserving data
mining (which deals with obtaining valid data mining results without learning the
underlying data values) and data security–enhancing techniques (such as encryption)

What are the Recent trends in Data mining?

Trends in data mining include further efforts toward the exploration of new application
areas, improved scalable and interactive methods (including constraint-based mining), the
integration of data mining with data warehousing and database systems, the
standardization of data mining languages, visualization methods, and new methods for
handling complex data types. Other trends include biological data mining, mining
software bugs, Web mining, distributed and real-time mining, graph mining, social
network analysis, multi relational and multi database data mining, data privacy
protection, and data security

CAS-005 Dumps
No ratings yet
CAS-005 Dumps
7 pages
An Open Source Tool To Extract Traffic Data From Google Maps Limitations and Challenges
No ratings yet
An Open Source Tool To Extract Traffic Data From Google Maps Limitations and Challenges
8 pages
UNIT - 1 - Datawarehouse & Data Mining
100% (1)
UNIT - 1 - Datawarehouse & Data Mining
24 pages
MicroSoft Office Packages: Practical Questions
84% (69)
MicroSoft Office Packages: Practical Questions
112 pages
Acer Diagnostic Suite Toolkit: User Guide
50% (2)
Acer Diagnostic Suite Toolkit: User Guide
57 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Unit - 1 Introduction To Data Warehousing
No ratings yet
Unit - 1 Introduction To Data Warehousing
57 pages
Data Warehousing Research Paper
50% (2)
Data Warehousing Research Paper
7 pages
Need of Two Types of Data: Information
No ratings yet
Need of Two Types of Data: Information
7 pages
DATA WAREHOUSE
No ratings yet
DATA WAREHOUSE
53 pages
DWDM Notes - Final
No ratings yet
DWDM Notes - Final
46 pages
All Sec Dwdm
No ratings yet
All Sec Dwdm
48 pages
Sec A and B DWDM
No ratings yet
Sec A and B DWDM
31 pages
Data Warehousing
No ratings yet
Data Warehousing
71 pages
Data Warehouse
No ratings yet
Data Warehouse
22 pages
unit one
No ratings yet
unit one
41 pages
FD Unit 2
No ratings yet
FD Unit 2
20 pages
Lect 14 DM
No ratings yet
Lect 14 DM
33 pages
Data Warehouse Tutorial
No ratings yet
Data Warehouse Tutorial
88 pages
DMBI Unit-1
No ratings yet
DMBI Unit-1
37 pages
data mining 1
No ratings yet
data mining 1
41 pages
Datawarehouse Unit-2
No ratings yet
Datawarehouse Unit-2
59 pages
Data Warehousing-1
No ratings yet
Data Warehousing-1
51 pages
Unit 1
No ratings yet
Unit 1
26 pages
Data Warehousing-Notes(Module -I & II) (1) (1)
No ratings yet
Data Warehousing-Notes(Module -I & II) (1) (1)
32 pages
Unit3 Notes
No ratings yet
Unit3 Notes
15 pages
DH&DM Unit-1
No ratings yet
DH&DM Unit-1
16 pages
Warehousing
No ratings yet
Warehousing
15 pages
DM UNIT V (1)
No ratings yet
DM UNIT V (1)
50 pages
Data Warehouse: From Wikipedia, The Free Encyclopedia
No ratings yet
Data Warehouse: From Wikipedia, The Free Encyclopedia
5 pages
DWDM u-1
No ratings yet
DWDM u-1
45 pages
Approach, or A Combination of Both
No ratings yet
Approach, or A Combination of Both
12 pages
Data Warehouse unit1 CS3551
No ratings yet
Data Warehouse unit1 CS3551
25 pages
Data-Mining-final-new
No ratings yet
Data-Mining-final-new
109 pages
Data Warehouse - Final
No ratings yet
Data Warehouse - Final
28 pages
Data Warehouse
No ratings yet
Data Warehouse
11 pages
2.data Warehousing: Heterogeneous Database Integration
No ratings yet
2.data Warehousing: Heterogeneous Database Integration
26 pages
Introduction To Data Warehousing Concepts
No ratings yet
Introduction To Data Warehousing Concepts
8 pages
Introduction To DW
No ratings yet
Introduction To DW
28 pages
Datawarehousing&Datamining: R.Kartheek B.Tech-Iii RD I.T V.R.S College, Chirala
No ratings yet
Datawarehousing&Datamining: R.Kartheek B.Tech-Iii RD I.T V.R.S College, Chirala
18 pages
Data Warehousing
No ratings yet
Data Warehousing
4 pages
Data Warehouse Notes
No ratings yet
Data Warehouse Notes
26 pages
Unit 1
No ratings yet
Unit 1
22 pages
Overview of Data Warehousing and OLAP
No ratings yet
Overview of Data Warehousing and OLAP
12 pages
Data Ware Housing1
No ratings yet
Data Ware Housing1
18 pages
CDM Class1,2,3
No ratings yet
CDM Class1,2,3
4 pages
Eval of Business Performance - Module 1
No ratings yet
Eval of Business Performance - Module 1
8 pages
Bca Vi Sem (Datawartehousing) Unit - I Notes
No ratings yet
Bca Vi Sem (Datawartehousing) Unit - I Notes
66 pages
Data Warehouse Basics (Lec. Notes 1)
No ratings yet
Data Warehouse Basics (Lec. Notes 1)
5 pages
unit-1-notes_dw
No ratings yet
unit-1-notes_dw
29 pages
Data Warehouse Definition: There Are Three Types of Data Warehouses
No ratings yet
Data Warehouse Definition: There Are Three Types of Data Warehouses
25 pages
What Is A Data Warehouse?
No ratings yet
What Is A Data Warehouse?
39 pages
DWDM Notes 5 Units
No ratings yet
DWDM Notes 5 Units
110 pages
1a Ravi
No ratings yet
1a Ravi
37 pages
Data Warehouse Concepts
No ratings yet
Data Warehouse Concepts
53 pages
BI Unit 1
No ratings yet
BI Unit 1
39 pages
Bi Units F
No ratings yet
Bi Units F
53 pages
DWM Unit 1. Introduction To Data Warehousing
100% (4)
DWM Unit 1. Introduction To Data Warehousing
12 pages
DWDM
No ratings yet
DWDM
15 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
103 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
14 pages
Unit 1
No ratings yet
Unit 1
14 pages
Warehousing Des-WPS Office
No ratings yet
Warehousing Des-WPS Office
7 pages
05-Data Warehousing & Data Mining
No ratings yet
05-Data Warehousing & Data Mining
8 pages
DIGITAL MARKETING UNIT-1
No ratings yet
DIGITAL MARKETING UNIT-1
11 pages
DIGITAL MARKETING UNIT-III
No ratings yet
DIGITAL MARKETING UNIT-III
7 pages
Python Programming- UNIT 1
No ratings yet
Python Programming- UNIT 1
29 pages
dm_cl2a
No ratings yet
dm_cl2a
34 pages
Free Proxy List
No ratings yet
Free Proxy List
5 pages
M Commerce MIS
No ratings yet
M Commerce MIS
16 pages
Upgrading Your Tranzeo Radio To Firmware Build 6.0.1 Production Release With Release Notes
No ratings yet
Upgrading Your Tranzeo Radio To Firmware Build 6.0.1 Production Release With Release Notes
9 pages
Where can buy Next-Level UI Development with PrimeNG: Master the versatile Angular component library to build stunning Angular applications 1st Edition Dale Nguyen ebook with cheap price
100% (1)
Where can buy Next-Level UI Development with PrimeNG: Master the versatile Angular component library to build stunning Angular applications 1st Edition Dale Nguyen ebook with cheap price
65 pages
SCWCD Related Hints
No ratings yet
SCWCD Related Hints
10 pages
TACN
No ratings yet
TACN
21 pages
(Ebook) Vue.js Succinctly by Ed Freitas ISBN 9781642002003, 1642002003 - Quickly download the ebook to never miss important content
100% (2)
(Ebook) Vue.js Succinctly by Ed Freitas ISBN 9781642002003, 1642002003 - Quickly download the ebook to never miss important content
49 pages
Evidencia Diagrama de Flujo GA5-240202501-AA1-EV01
No ratings yet
Evidencia Diagrama de Flujo GA5-240202501-AA1-EV01
7 pages
1 Introduction
No ratings yet
1 Introduction
45 pages
Micro Manual LISP
100% (1)
Micro Manual LISP
2 pages
Computer Abbreviation-General Knowledge Questions and Answers-42383
No ratings yet
Computer Abbreviation-General Knowledge Questions and Answers-42383
5 pages
Ebin - Pub Fortinet Fortimanager Lab Guide For Fortimanager 72
No ratings yet
Ebin - Pub Fortinet Fortimanager Lab Guide For Fortimanager 72
170 pages
PPL 28DayActionGuide
No ratings yet
PPL 28DayActionGuide
72 pages
Dell Poweredge R320 Getting Started Guide: Regulatory Model: E18S Series Regulatory Type: E18S001
No ratings yet
Dell Poweredge R320 Getting Started Guide: Regulatory Model: E18S Series Regulatory Type: E18S001
8 pages
15) اسئلة تدريبية MCQ
No ratings yet
15) اسئلة تدريبية MCQ
3 pages
En 808D Complete Operating and Programming Milling
No ratings yet
En 808D Complete Operating and Programming Milling
82 pages
File Layouts V1.0
No ratings yet
File Layouts V1.0
186 pages
MPMC Digtal Notes
No ratings yet
MPMC Digtal Notes
129 pages
Lecture17 Structure
No ratings yet
Lecture17 Structure
52 pages
Vi Editing Multiple Files
No ratings yet
Vi Editing Multiple Files
2 pages
Sap Hana: Sudha Paluru
No ratings yet
Sap Hana: Sudha Paluru
56 pages
X-Fast and Y-Fast Tries
No ratings yet
X-Fast and Y-Fast Tries
66 pages
Computer Science 8 - Learning Module - Complete
No ratings yet
Computer Science 8 - Learning Module - Complete
63 pages
Immediate download An R Companion for the Third Edition of The Fundamentals of Political Science Research Paul M. Kellstedt ebooks 2024
100% (1)
Immediate download An R Companion for the Third Edition of The Fundamentals of Political Science Research Paul M. Kellstedt ebooks 2024
55 pages
Installation HYSYS ASPEN 8
No ratings yet
Installation HYSYS ASPEN 8
1 page
Ar Receipts API
No ratings yet
Ar Receipts API
6 pages
Time-Varying Parameter VAR Model Using TVP-VAR Package: Jouchi Nakajima
0% (1)
Time-Varying Parameter VAR Model Using TVP-VAR Package: Jouchi Nakajima
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.