DWDM(UNIT-1)
DWDM(UNIT-1)
AND
DATA MINING
UNIT-1
Introduction to data ware house Why Data Mining? What Is Data Mining? What Kinds of Data Can Be
Mined?What Kinds of Patterns Can Be Mined? Which Technologies Are Used? Which Kinds of Applications
Are Targeted? Major Issues in Data Mining. Data Objects and Attribute Types, Basic Statistical Descriptions
of Data, Data Visualization, Measuring Data Similarity and Dissimilarity.
The operational data store or other transformations before it is loaded to the DW system for
information processing.
A Data Warehouse is used for reporting and analyzing of information and stores both historical
and current data. The data in DW system is used for Analytical reporting, which is later used
by Business Analysts, Sales Managers or Knowledge workers for decision-making.
In the above image, you can see that the data is coming from multiple heterogeneous data
sources to a Data Warehouse. Common data sources for a data warehouse includes −
Operational databases
5
business.
3. Improved data quality – Data warehouses are designed to ensure that data
is consistent and accurate. This means that companies can trust the
information they are using to make decisions.
3. Data silos – Because data warehouses are designed to store specific types
of data, it can be challenging to integrate data from different sources. This
can lead to data silos, where different teams or departments have their own
sets of data that are not shared with others.
6
Difference between Data Base vs Data warehouse:
7
Building Blocks/Components of Data Warehouse:
External Data: For data gathering, most of the executives and data
analysts rely on information coming from external sources for a numerous
amount of the information they use. They use statistical features associated
with their organization that is brought out by some external sources and
department.
Flat files: A flat file is nothing but a text database that stores data in a plain
text format. Flat files generally are text files that have all data processing
and structure markup removed. A flat file contains a table with a single
record per line.
8
2. Data Staging:
After the data is extracted from various sources, now it’s time to prepare the data files for
storing in the data warehouse. The extracted data collected from various sources must be
transformed and made ready in a format that is suitable to be saved in the data warehouse for
querying and analysis. The data staging contains three primary functions
Data Extraction: This stage handles various data sources. Data analysts
should employ suitable techniques for every data source.
9
moves high volumes of data consuming a considerable amount oftime.
Metadata: Metadata means data about data i.e. it summarizes basic details
regarding data, creating findings & operating with explicit instances of
data. Metadata is generated by an additional correction or automatically
and can contain basic information about data.
Raw Data: Raw data is a set of data and information that has not yet been
processed and was delivered from a particular data entity to the data
supplier and hasn’t been processed nonetheless by machine or human. This
data is gathered out from online sources to deliver deep insight into users’
online behavior.
4. Data Marts:
Data marts are also the part of storage component in a data warehouse. It can store the
information of a specific function of an organization that is handled by a single authority. There
may be any number of data marts in a particular organization depending upon the functions. In
short, data marts contain subsets of the data stored in data warehouses.
Now, the users and analysts can use data for various applications like reporting, analyzing,
mining, etc. The data is made available to them whenever required.
10
A Three Tier Data Warehouse Architecture:
Tier-1: The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed data into the bottom tier from
operational databases or other external sources (such as customer profile information
provided by
11
external consultants). These tools and utilities perform data extraction, cleaning, and transformation
(e.g., to merge similar data from different sources into a unified format), as well as load and refresh
functions to update the data warehouse. Example gateways include ODBC (Open Database
Connection) and OLEDB (Open Linking and Embedding for Databases) by Microsoft and JDBC
(Java Database Connection). This tier also contains a metadata repository, which stores information
about the data warehouse and its contents.
Tier-2: The middle tier is an OLAP server that is typically implemented using either a relational
OLAP (ROLAP) model or a multidimensional OLAP. OLAP model is an extended relational DBMS
that maps operations on multidimensional data to standard relational operations. A multi-dimensional
OLAP (MOLAP) model, that is, a special-purpose server that directly implements multidimensional
data and operations.
Tier-3: The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
12
Data Mining
INTRODUCTION: Data mining is nothing but discovery of knowledge data from large database.
Generally, the term mining refers to mining of gold from rocks or sand is called gold mining.
1.1. Why Data Mining?
The major reason that data mining has attracted a great deal of attention in the information
industry in recent years is due to the wide availability of huge amounts of data and need for
turning such data into useful information and knowledge.
The information and knowledge gained can be used for applications ranging from business
management, production control, and market analysis, to engineering design and science
exploration.
Data mining can be viewed as a result of the natural evolution of information technology. It
means, providing a path to extract the required data of an industry from warehousing machine.
This is the witness of developing knowledge of an industry.
It includes data collection, database creation, data management (i.e data storage and retrieval,
and database transaction processing) and data analysis and understanding (involving data
warehousing and data mining).
1.1.1. Evolution of data mining and data warehousing: In the development of data mining, we
should know the evolution of database. This includes,
Data collection and Database creation: In the 1960’s, database and information technology began
with file processing system. It is powerful database system. But it is providing inconsistency of data.
It means, a user needs to maintain duplicate data of an industry.
Database Management System: In b/w 1970 – 1980, the progress of database is
Hierarchical and network database systems were developed.
Relational database systems were developed
Data modeling tools were developed in early 1980s (such as E-R model etc.
Indexing and data organization techniques were developed. ( such as B+ tree, hashing etc).
Query languages were developed. (such as SQL, PL/SQL)
User interfaces, forms and reports, query processing.
On-line transaction processing (OLTP)
Advanced Database Systems: In mid 1980s to till date,
Advanced data models were developed. (such as extended relational, object-oriented,
object-relational, spatial, temporal, multimedia, scientific databases etc.
Data Warehousing and Data mining: In late 1980 to till date
Developed Data warehouse and OLAP technology
Data mining and knowledge discovery were introduced.
Web-based Databases Systems: In 1990 – till date
XML based database systems and web mining were developed.
New Generation of Integrated Information Systems: From 2000 onwards developed an integrated
information system.
What is Data Mining: The term Data Mining refers to extracting or “mining” knowledge
13
from large amounts of data. The term mining is actually a misnomer (i.e. unstructured data).
For example, mining of gold from rocks or sand is referred to as gold mining.
Data mining is the process ofdiscovering meaningful new trends by storing the large amount
of data in repository of database. It also uses pattern recognition techniques as well as
statistical techniques.
1.3.1. Databases Data: A database system is also called a database management system (DBMS).
It consists of a collection of interrelated data, known as a database, and set of software programs to
manage and access the data. The software programs provide mechanisms for defining database
structures and data storage. These also provide data consistency and security, concurrency, shared or
distributed data access etc.
A relational database is a collection of tables, each of which is assigned a unique name. Each table
consists of a set of attributes (columns or fields) and a set of tupples (records or rows). Each tupple
is identified by a unique key and is described by a set of attribute values. For this, ER models are
constructed for relational databases. For example, AllElectronics Industry illustrated with following
information. custer, item, employee, branch.
customer table
item table
AllElectronics company sales his products (such as computers and printers) to the customers. Here
providing the relation b/w custormer table (file) and product table. By this relation can identify what
types of products are taken the customer.
1.3.2. Data Warehouses: A data warehouse is a repository of information collected from multiple
sources, stored under a schema and resides at a single site. The data warehouses are constructed by
a process of data cleaning, data transformation, data integration, data loading and periodic data
refreshing.
Data source in
Chennai Client
Data source in clean
Bombay Transform Data Query and
Integrate Warehouse Analysis tools
Data source in Load
Hyderabad
Client
Data source in
Bangalore
A data warehouse is mainly modeled by a multidimensional database structure, where each
dimension corresponds to an attribute or a set of attributes in the schema and each cell stores the
value of some aggregate measure, such as sales
amount. The physical structure of a data
warehouse may be a relational data store or a
multidimensional data cube. It provides a
multidimensional view of data and allows the
preprocess and fast accessing of summarized
data.
A data cube for summarized sales data of
AllElectronics is presented in fig. The cube has
three dimensions such as address (Chennai,
Bombay, Hyd, Bang), time with Q1,Q2,Q3,Q4
and item with home needs, computer, phone and
security. In this, aggregate value stored in each
cell of the cube.
By providing multidimensional data views, performed the OLAP operations. Such as drill-down,
and roll-up.
1.3.3. Transactional Databases: A transactional database consists of a file where each record
represents a transaction. A transaction includes a unique transaction such as data of the transaction,
the customer id number, the ID number of the sales person and so on.
AllElectronics transactions can be stored in a table with one record per transaction. This is shown in
fig.
Transaction_id List of items Transaction dates
T100 I1, I3, I8, I16 18-12-2018
T200 I2, I8 18-12-2018
1.4 What Kinds of Patterns Can Be Mined? (or) Data Mining Functionalities:
Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks.
Data mining tasks are classified into two categories descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
1.4.1. Concept/Class Description: Descriptions of a individual classes or a concepts in
summarized, concise and precise terms called class or concept descriptions. These descriptions can
be divided into 1. Data Characterization 2. Data Discrimination.
Data Characterization:
It is summarization of the general characteristics of a target class of data (forms).
The data corresponding to the user specified class are collected by a database query.
The output of data characterization can be presented in various forms like pie charts, bar charts
curves, multidimensional cubes, multidimensional tables etc. The resulting descriptions can be
presented as generalized relations are called characteristic rules.
Data Discriminations: Comparison of two target class data objects from one or set of contrasting
(distinct) classes. The target and contrasting classes can be specified by the user, and the
corresponding data objects are retrieved through database queries.
For example, comparison of products whose sales increased by 10% in the last year with
those whose sales decreased by 30% during the same period. This is called data discrimination.
1.4.2. Mining Frequent Patterns, Associations and Correlations:
1.4.2.1. Frequent Patterns: A frequent itemset typically refers to a set of items that often appear in
a transactional data. For example, milk, and bread are frequently purchased by many customers.
AllElectronics industry occurring the products which are frequently purchased by the customers.
Generally, home needs are frequently used by the more customers.
1.4.2.2. Association Analysis: “What is association analysis ?”
Association analysis is the discovery of association rules showing attribute with value
conditions that occur frequently together the given set of data. It is used for transaction data analysis.
The Association rule of the form X ==> Y.
For example, In AllElectronics relational database, data mining system may find association rules
like buys(X, “computer”) ==> buys(X, “software”)
Here, who buys “computer”, they buys “software”.
age (X, “20 .. 29”) & income (X, “20k .. 29k”) ==> buys(X,
In this, the Association rule indicate that that indicates who employee of
AllElectronics have the age b/w 20 to 29 and earning income b/w 20000 to 29000 are
purchased CD player at AllElectronics Company.
1.4.2.3. Classification and Regressive prediction:
Classification is the process of finding a set of models that describes and distinguishes data classes
or concepts.
The derived model may be represented in various forms such as classification (IF-THEN)
rules, decision trees, mathematical formulae or neural networks.
A decision tree is a flow-chart like tree structure. The decision trees can easily converted to
classification rule. The neural networks are used for classification to provide connection b/w
computers.
Regression for Predication is used to predict missing or unavailable data values rather than class
labels. Prediction refers to both data value prediction and class label prediction. The predicted values
are numerical data and are often referred to as prediction.
1.4.2.4. Cluster Analysis: (“What is cluster analysis?”)
Clustering is a method of grouping data into different groups, so that in each group share similar
trends and patterns. The objectives of clustering are
To uncover natural groupings
To initiate hypothesis about the data
To find consistent and valid organization of data.
For example, Cluster analysis can be performed on
AllElectronics customers. It means, to identify
homogeneous (same group) customers. By this cluster
may represent target groups for marketing to increase
the sales.
1.4.2.5. Outlier Analysis: In this analysis, a database may contain data objects that do not do what
someone wants. Most data mining methods discard outliers as noise or exceptions. Finding such type
of applications are fraud detection is referred as outlier mining.
For example, Outlier analysis may uncover usage of credit cards by detecting purchases of
large number of products when comparing with regular purchase of large product customers.
1.5. Which Technologies Are Used? (or) Classification of Data Mining Systems:
Data mining is classified with many
techniques. Such as statistics, machine
learning, pattern recognition, database
and data warehouse systems,
information retrieval, visualization,
algorithms, high performance
computing, and many application
domains (Shown in Figure).Data mining
system can be categorized according to
various criteria.
Statistics: A statistical model is a set of
mathematical functions that describe the
behavior of the objects in a target class
in terms of random variables and their associated probability distributions. Statistical models are
widely used to model data and data classes. For example, in data mining tasks like data
characterization and classification, statistical models of target classes can be built.
Machine Learning: Machine learning investigates how computers can learn (or improve their
performance) based on data. A main research area is for computer programs to automatically learn to
recognize complex patterns and make intelligent decisions based on data. Machine learning is a fast-
growing discipline.
Supervised learning is basically a synonym for classification. The supervision in the learning
comes from the labeled examples in the training data set. For example, in the postal code
recognition problem, a set of handwritten postal code images and their corresponding
machine-readable translations are used as the training examples, which supervise the learning
of the classification model.
Unsupervised learning is essentially a synonym for clustering. The learning processis
unsupervised since the input examples are not class labeled. For example, an unsupervised
learning method can take, as input, a set of images of handwritten digits. Suppose that it finds
10 clusters of data. These clusters may correspond to the 10 distinct digits of 0 to 9,
respectively.
Semi-supervised learning is a class of machine learning techniques that make use of both
labeled and unlabeled examples when learning a model. For a two-class problem, one class as
the positive examples and the other class as the negative examples.
Active learning is a machine learning approach that lets users play an active role in the
learning process.
Database Systems and Data Warehouses:
Database systems can focus on the creation, maintenance, and use of databases for
organizations and end-users. Particularly, database systems principles in data models, query
languages, query processing and optimization methods, data storage, and indexing and
accessing methods. Many data mining tasks need to handle large data sets or even real-time,
fast streaming data. Recent database systems have built systematic data analysis capabilities
on database data using data warehousing and data mining facilities.
A data warehouse integrates data from multiple sources and various timeframes. It provides
OLAP facilities in multidimensional databases to promote multidimensional data mining. It
maintains recent data, previous data and historical data in database.
Information Retrieval:
Information retrieval (IR) is the science of searching for documents or information in
documents. The typical approaches in information retrieval adopt probabilistic models. For
example, a text document can be observing as a container of words, that is, a multi set of
words appearing in the document.
Pattern recognition is the process of recognizing patterns by using machine learning algorithm.
Pattern recognition can be defined as the classification of data based on knowledge already gained
or on statistical information extracted from patterns and/or their representation. One of the important
aspects of the pattern recognition is its application potential. Examples: Speech
recognition, speaker identification, multimedia document recognition (MDR), automatic medical
diagnosis.
Data visualization is a general term that describes any effort to help people understand the
significance of data by placing it in a visual context. Patterns, trends and correlations that might go
undetected in text-based data can be exposed and recognized easier with data visualization software.
An algorithm in data mining (or machine learning) is a set of heuristics and calculations that creates
a model from data. To create a model, the algorithm first analyzes the data you provide, looking for
specific types of patterns or trends.
High Performance Computing (HPC) framework which can abstract the increased complexity in
current computing systems and at the same time provide performance benefits by exploiting multiple
forms of parallelism in Data Mining algorithms.
Data Mining Applications: The list of areas where data mining is widely used − Financial Data
Analysis, Retail Industry, Telecommunication Industry, Biological Data Analysis, Other Scientific
Applications, Intrusion Detection.
o Web search engines are essentially very large data mining applications. Various data
mining techniques are used in all aspects of search engines, ranging from crawling
(e.g., deciding which pages should be crawled and the crawling frequencies), indexing
(e.g., selecting pages to be indexed and deciding to which extent the index should be
constructed), and searching (e.g., deciding how pages should be ranked, which
advertisements should be added, and how the search results can be personalized or
made “context aware”).
1.7. Major issues in Data Mining: Data mining is a dynamic and fast-expanding field with
great strengths. Major issues in data mining research, partitioning them into five groups: mining
methodology, user interaction, efficiency and scalability, diversity of data types, and data mining and
society.
Mining methodology: In this methodology the user interaction on different issues such as
Mining various and new kinds of knowledge.
Mining knowledge in multidimensional space.
Data mining—an interdisciplinary effort.
Boosting the power of discovery in a networked environment.
Handling uncertainty, noise, or incompleteness of data.
Pattern evaluation and pattern- or constraint-guided mining.
User Interaction: Interesting areas of research include how to interact with a data mining system,
how to incorporate a user’s background knowledge in mining, and how to visualize and comprehend
data mining results.
Interactive mining.
Incorporation of background knowledge.
Ad hoc data mining and data mining query language.
Presentation and visualization of data mining results.
Efficiency and Scalability:
Efficiency and scalability of data mining algorithms.
Parallel, distributed, and incremental mining algorithms.
Cloud computing and cluster computing.
Diversity of Database Types:
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data Mining and Society:
Social impacts of data mining.
Privacy-preserving data mining.
Invisible data mining.
1.8. Data Objects and Attribute Types:
A data object represents an entity.
In a sales database, the objects may be customers, store items, and sales;
in a medical database, the objects may be patients;
in a university database, the objects may be students, professors, and courses.
Data objects are typically described by attributes. Data objects can also be referred to as
samples, examples, instances, data points, or objects.
The data objects are stored in a database, they are data tuples. That is, the rows of a database
correspond to the data objects, and the columns correspond to the attributes.
1.8.1. What Is an Attribute?
An attribute is a data field, representing a characteristic or feature of a data object.
The attribute may also call, dimension, feature, and variable. The term dimension is commonly
used in data warehousing. The term feature is commonly used in Machine learning, while
statisticians prefer the term variable. Data mining and database professionalscommonly use
the term attribute.
For example, Attributes is described for a customer object is as customer ID, name, and
address.
Types of Attribute: The type of an attribute is determined by the set of possible values. They are
nominal, binary, ordinal, or numeric.
Nominal Attributes: Nominal means “relating to names.” The values of a nominal attribute
are symbols or names of things. Each value represents some kind of category, code, or state,
and so nominal attributes are also referred to as categorical. The values are also known as
enumerations.
o Example: hair color and marital status are two attributes describing person objects. In
our application, possible values for hair color are black, brown, blond, red, auburn,
gray, and white. The attribute marital status can take on the values single, married,
divorced, and widowed.
Ordinal Attributes: An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between successive values is not
known.
o Example: drink size corresponds to the size of drinks available at a fast-food
restaurant. This ordinal attribute has three possible values: small, medium, and large.
The values have a meaningful sequence (which corresponds to increasing drink size.
o Other examples of ordinal attributes include grade (e.g., A+, A, A-, B+ and so on).
o Professional ranks can be enumerated in a sequential order: for example, assistant,
associate, and professors.
Binary: Nominal attribute with only 2 states (0 and 1). Eg: true or false, yes or no.
Symmetric binary: both outcomes equally important e.g., gender
Asymmetric binary: outcomes not equally important. e.g., medical test
(positive vs. negative).
Numeric Attributes: A numeric attribute is quantitative; that is, it is a measurable quantity,
represented in integer or real values. Numeric attributes can be interval-scaled or ratio-scaled.
o Interval-Scaled Attributes: Interval-scaled attributes are measured on a scale of equal-
size units.
Example: A temperature attribute is interval-scaled. Suppose that we have the
outdoor temperature value for a number of different days, where each day is an
object. For example, a temperature of 20_C is five degrees higher than a
temperature of 15_C. Calendar dates are another example. For instance, the
years 2002 and 2010 are eight years apart.
Ratio-Scaled Attributes:
o A ratio-scaled attribute is a numeric attribute with an inherent zero-point. e.g.,
temperature in Kelvin, length, counts.
1.9. Basic Statistical Descriptions of Data: Basic statistical descriptions can be used
to identify properties of the data and highlight which data values should be treated as noise or outliers.
1.9.1. Measures of central tendency: It means, measure the location of the middle or center of
a data distribution. It includes mean, median, mode, and midrange.
o Mean: measure of the “center” of a set of data is the (arithmetic) mean.
Let x1,x2, : : : ,xN be a set of N values or observations, such as for some
numeric attribute X, like salary. The mean of this set of values is
Suppose we have the following values for salary (in thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
o Median. Let’s find the median of the data from the above example. The data are
already sorted in increasing order. The median can be any value within the two
middlemost values of 52 and 56.
Thus, the median is $54,000.
o Mode: In the example, the data are bimodal. The two modes are $52,000 and
$70,000.
Step 1: The number that occurs most frequently in a data set is called the mode.
Step 2: 30, 36, 47,50, 52, 52, 56, 60, 63, 70, 70, 110.
Step 3: Since the number 52 and 70 appears two times. So, the mode of the data
set are 52 and 70.
symmetric data distribution: In a unimodal frequency curve with perfect symmetric data
distribution, the mean, median, and mode are all at the same center value, as shown in Figure (a).
Data in most real applications are not symmetric. They may instead be either positively skewed,
where the mode occurs at a value that is smaller than the median (Figure b), or negatively skewed,
where the mode occurs at a value greater than the median (Figure c).
1.9.2. Measuring the Dispersion of Data: The measures include range, quartiles, quartiles,
percentiles, and the inter-quartile range. The five-number summary can be displayed as a boxplot,
outliers, variance and standard deviation. also indicate the spread of a data distribution.
The range of the set is the difference between the largest (max()) and smallest (min())
values.
Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets.
The distance between the first and third quartiles is a simple measure of spread that gives the
range covered by the middle half of the data. This distance is called the inter-quartile range
(IQR) and is defined as IQR = Q3 - Q1.
o For example, the quartiles are the three values that split the sorted data set into four
equal parts. The data of above example contain 12 values, already sorted in increasing
order. Thus, the qartiles for this data are the third, sixth, and ninth values, respectively,
in the sorted list. Therefore, Q1 is $47,000 and Q3 is $63,000. Thus, theinter-quartile
range is IQR = 63 - 47 = $16,000.
1.9.2.1. Boxplot: Boxplot incorporates the five-number
summary as follows:
o Data is represented with a box
o The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
o The median is marked by a line within the box
o Two lines(whiskers) outside the box extended to
Minimum and Maximu.
The figure shows Boxplot for the unit price data for items sold at
four branches of AllElectronics during a given time period. For
branch 1, the median price of items sold is $80, Q1 is $60, and Q3
is $100. Notice that two outlying observations for this branch
were plotted individually, as their values of 175 and 202 are more
than 1.5 times the IQR here of 40.
1.9.3. Variance and Standard Deviation: Variance and standard deviation are measures of
data dispersion. They indicate how spread out a data distribution is. A low standard deviation means
that the data observations tend to be very close to the mean, while a high standard deviation indicates
that the data are spread out over a large range of values.
where x- is the mean value of the observations, as defined in Eq. mean formula. The standard
deviation,σ, of the observations is the square root of the variance, σ2 .
In the example 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
Using mean value i.e.
3. Histograms: “Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of
poles. Below figure shows a histogram for the data set of Table 2.1, where buckets (or bins) are
defined by equal-width ranges representing $20 increments and the frequency is the count of items
sold.
4. Scatter Plots and Data Correlation: A scatter plot is one of the most effective graphical
methods for determining a relationship, pattern, or trend between two numeric attributes. To construct
a scatter plot, each pair of values is treated as a pair of coordinates in an algebraic sense and plotted
as points in the plane. Figure shows a scatter plot for the set of data in Table 2.1.
2. Geometric Projection
Visualization Techniques:
Geometric projection techniques
help users find interesting projections of
multidimensional data sets. The central challenge the
geometric projection techniques try to address is how to
visualize a high-dimensional space on a 2-D display.