DWDM_UNIT-2
DWDM_UNIT-2
Why Pre-process the Data:- Today's real-world databases are highly susceptible to
noise, and consists of missing, and inconsistent data due to their huge size. Data
preprocessing is done to improve the quality of the data. Preprocessed data improve
the efficiency and ease of the mining process. There are a number of data
preprocessing techniques. They are
1. Data cleaning can be applied to remove noise and correct inconsistencies in
the data.
2. Data integration merges data from multiple sources into a single data store,
such as a data warehouse or a data cube.
3. Data transformations, such as normalization, may be applied. Normalization
may improve the accuracy and efficiency of mining algorithms.
4. Data reduction can reduce the data size by aggregating, eliminating redundant
features, or clustering.
Page No. : 1
Mining Descriptive Statistical Measures in Large Databases:- For many data
mining tasks, users would like to learn more data characteristics regarding both
central tendency and data dispersion. Measures of central tendency include mean,
median, mode, and midrange, while measures of data dispersion include quartiles,
outliers, variance, and other statistical measures. These descriptive statistics are of
great help in understanding the distribution of the data.
1. Measuring the central tendency:- The most common and most effective
numerical measure of the “center" of a set of data is the (arithmetic) mean. Let x 1,
x2, …., xn be a set of n values or observations. The mean of this set of values is
Page No. : 2
where L1 is the lower class boundary of (i.e., lowest value for) the class containing
the median, n is the number of values in the data, (Pf)l is the sum of the frequencies
of all of the classes that are lower than the median class, and fmedian is the
frequency of the median class, and c is the size of the median class interval.
Another measure of central tendency is the mode. The mode for a set of data is the
value that occurs most frequently in the set. It is possible for the greatest frequency
to correspond to several different values, which results in more than one mode. Data
sets with one, two, or three modes are respectively called unimodal, bimodal, and
trimodal. If a data set has more than three modes, it is multimodal. At the other
extreme, if each data value occurs only once, then there is no mode. For unimodal
frequency curves that are moderately skewed (asymmetrical), we have the following
empirical relation
mode = 3median-2mean
The midrange, that is, the average of the largest and smallest values in a data set,
an be used to measure the central tendency of the set of data. It is trivial to compute
the midrange using the SQL aggregate functions, max() and min().
Page No. : 3
highest and lowest data values as well. This is known as the five-number summary.
The five-number summary of a distribution consists of the median M, the quartiles
Q1 and Q3, and the smallest and largest individual observations, written in the order
Minimum; Q1; M; Q3; Maximum:
A popularly used visual representation of a distribution is the boxplot. In a boxplot:
1. The ends of the box are at the quartiles, so that the box length is the interquartile
range, IQR.
2. The median is marked by a line within the box.
3. Two lines (called whiskers) outside the box extend to the smallest (Minimum) and
largest (Maximum) observations.
Figure 5.5 shows a histogram for the data set of Table 5.11, where classes are
de_ned by equi-width ranges representing $10 increments. Histograms are at least a
century old, and are a widely used univariate graphical method. However, they may
not be as effective as the quantile plot, Q-Q plot and boxplot methods for comparing
groups of univariate observations.
A quantile plot is a simple and effective way to have a first look at data distribution.
First, it displays all of the data (allowing the user to assess both the overall behavior
and unusual occurrences). Second, it plots quantile information. The mechanism
used in this step is slightly different from the percentile computation.
Page No. : 5
Fig: A Q-Q plot for the above dataset
A scatter plot is one of the most effective graphical methods for determining if
there appears to be a relation-ship, pattern, or trend between two quantitative
variables. To construct a scatter plot, each pair of values is treated as a pair of
coordinates in an algebraic sense, and plotted as points in the plane. The scatter plot
is a useful exploratory method for providing a first look at bivariate data to see how
they are distributed throughout the plane, for example, and to see clusters of points,
outliers, and so forth. Figure 5.8 shows a scatter plot for the set of data in Table
Page No. : 6
Fig: a loess curve for the above dataset
Page No. : 8
Data integration:- Data integration combines data from multiple sources into a
single data store, such as large database or data warehouse. Major Issues that are to
be considered during data integration are
Entity identification problem:- Sometimes customer_id in one database, and
cust_number in another refer to the same entity. Data analyst or computer decide
whether they both refer to the same entity by examining the metadata of the
datawarehose. Metadata is data about the data. Such metadata can be used to avoid
errors in schema integration.
Redundancy:- Redundancy is another important issue. An attribute may be
redundant if it can be “derived" from another table, such as annual revenue. Some
redundancies can be detected by correlation analysis. The correlation between
attributes A and B can be measured by
∑ ( A – A)(B - B)
rA,B = ------------------------
(n – 1) σ A σ B
If the correlation factor( r ) is greater than 1, then A and B are positively
correlated. The higher the value, the more each attribute implies the other. Hence, a
high value may indicate that A (or B) may be removed as a redundancy. If the
resulting value is equal to 1, then A and B are independent and there is no
correlation between them. If the resulting value is less than 1, then A and B are
negatively correlated.
Detection and resolution of data value conflicts:- A third important issue in data
integration is the detection and resolution of data value conflicts. For example, for
the same real world entity, attribute values from different sources may differ. This
may be due to differences in metric units used in the system. The price of different
hotels may involve different currencies.
Careful integration of the data from multiple sources can help reduce and
avoid redundancies and inconsistencies in the resulting data set. This can help
improve the accuracy and speed of the subsequent mining process.
Page No. : 9
example, categorical attributes, like street, can be generalized to higher level
concepts, like city or county. Similarly, values for numeric attributes, like age,
may be mapped to higher level concepts, like young, middle-aged, and senior.
5. An attribute is normalized by scaling its values so that they fall within a small
specified range, such as 0 to 1.0. There are three main methods for data
normalization. They are
min-max normalization,
z-score normalization, and
normalization by decimal scaling.
Min-max normalization performs a linear transformation on the original data.
Suppose that minA and maxA are the minimum and maximum values of an attribute
A. Min-max normalization maps a value v of A to v1
v - minA
1
v = ------------- (new_maxA – new_minA) + new_minA
maxA - minA
Example:- Suppose that the maximum and minimum values for the attribute income
are $98,000 and $12,000,respectively. Map the income to the range [0; 1]. By min-
max normalization, a value of $73,600 for income is transformed to
73,600 - 12,000
1
v = ----------------------- (1.0 - 0) + 0 = 0.716
98,000 – 12,000
In z-score normalization (or zero-mean normalization), the values for an attribute A
are normalized based on the mean and standard deviation of A. A value v of A is
normalized to v1 by computing
v–A
1
v = --------
σA
where A stands for mean of A and σ A for standard deviation. This method of
normalization is useful when the actual minimum and maximum of attribute A are
unknown, or when there are outliers which dominate the min-max normalization.
Example :- Suppose that the mean and standard deviation of the values for the
attribute income are $54,000 and $16,000, respectively. With z-score normalization,
a value of $73,600 for income is transformed to
73,600 – 54,000
1
v = --------------------- = 1.225
16,000
Normalization by decimal scaling normalizes by moving the decimal point of values
of attribute A. The number of decimal points moved depends on the maximum
absolute value of A. A value v of A is normalized to v1 by computing
v
1
v = -----
10j
where j is the smallest integer such that Max(|v1|) < 1.
Page No. : 10
Example:- Suppose that the recorded values of A range from -986 to 917. The
maximum absolute value of A is 986. To normalize by decimal scaling, divide each
value by 1,000 (i.e., j = 3) so that -986 normalizes to -0.986
Fig(a) Fig(b)
Datacubes store multidimensional aggregated information. Data cube consists
of many cells and each cell holds an aggregate data value at multiple levels of
abstraction. Data cubes provide fast access to precomputed, summarized data,
thereby benefiting on-line analytical processing as well as data mining.
The cube created at the lowest level of abstraction is referred to as the base
cuboid. A cube for the highest level of abstraction is the apex cuboid. For the sales
data represented in the cube , the apex cuboid would give one total i.e the total sales
for all three years, for all item types, and for all branches. Data cubes created for
varying levels of abstraction are sometimes referred to as cuboids, so that a data
cube may instead refer to a lattice of cuboids. Each higher level of abstraction
further reduces the resulting data size.
2. Dimension reduction:- In Dimension reduction, irrelevant, weakly relevant, or
redundant attributes or dimensions may be detected and removed.
Data sets for analysis may contain hundreds of attributes, many of which may be
irrelevant to the mining task, or redundant. In analyzing customer music interest
attributes such as the customer's telephone number are likely to be irrelevant and
attributes such as age or music taste become relevant attributes .
Page No. : 11
The 'best' (and 'worst') attributes are typically selected using greedy methods. Some
of the methods of attribute subset selection are
1. Step-wise forward selection:- The procedure starts with an empty set of attributes.
The best of the original attributes is determined and added to the set. At each
subsequent iteration or step, the best of the remaining original attributes is added to
the set.
2. Step-wise backward elimination:- The procedure starts with the full set of
attributes. At each step, it removes the worst attribute remaining in the set.
3. Combination forward selection and backward elimination:- The step-wise
forward selection and backward elimination methods can be combined, where at
each step one selects the best attribute and removes the worst from among the
remaining attributes.
4. Decision tree induction:- Decision tree induction constructs a flow-chart-like
structure where each internal (non-leaf) node denotes a test on an attribute, each
branch corresponds to an outcome of the test, and each external (leaf) node denotes a
class prediction. At each node, the algorithm chooses the “best” attribute to partition
the data into individual classes. When decision tree induction is used for attribute
subset selection, a tree is constructed from the given data. All attributes that do not
appear in the tree are assumed to be irrelevant. The set of attributes appearing in the
tree form the reduced subset of attributes.
3. Data compression:- In Data compression encoding mechanisms are used to
reduced or “compressed" representation of the original data. If the original data can
be reconstructed from the compressed data without any loss of information, the data
compression technique used is called lossless. If, instead, we can reconstruct only an
approximation of the original data, then the data compression technique is called
lossy. Two popular and effective methods of lossy data compression are
1. Wavelet transforms and 2. Principal components analysis.
1. Wavelet transforms:- The discrete wavelet transform (DWT) is a linear signal
processing technique that, when applied to a data vector D, transforms it to a
numerically different vector, D0, of wavelet coefficients. This technique be useful
Page No. : 12
for data reduction if the wavelet transformed data are of the same length as the
original data.
The DWT is closely related to the discrete Fourier transform (DFT), a signal
processing technique involving sines and cosines. In general the DWT achieves
better lossy compression. That is, if the same number of coefficients are retained for
a DWT and a DFT of a given data vector, the DWT version will provide a more
accurate approximation of the original data.
Popular wavelet transforms include the Daubechies-4 and the Daubechies-6
transforms. Wavelet transforms can be applied to multidimensional data, such as a
data cube. Wavelet transforms give good results on sparse or skewed data, and data
with ordered attributes.
There is only one DFT, yet there are several DWTs. The general algorithm for a
discrete wavelet transform is
as follows.
1. The length, L, of the input data vector must be an integer power of two. This
condition can be met by padding the data vector with zeros, as necessary.
2. Each transform involves applying two functions. The first applies some data
smoothing, such as a sum or weighted average. The second performs a weighted
difference.
3. The two functions are applied to pairs of the input data, resulting in two sets of
data of length L=2. In general, these respectively represent a smoothed version of
the input data, and the high-frequency content of it.
4. The two functions are recursively applied to the sets of data obtained in the
previous loop, until the resulting data sets obtained are of desired length.
5. A selection of values from the data sets obtained in the above iterations are
designated the wavelet coefficients of the transformed data.
Page No. : 13
distributions, are an example. Non-parametric methods for storing reduced
representations of the data include histograms, clustering, and sampling.
Let's have a look at each of the numerosity reduction techniques mentioned above.
4.Regression and log-linear models
Regression and log-linear models can be used to approximate the given data.
In linear regression, the data are modeled to to a straight line. For example, a random
variable, Y (called a response variable), can be modeled as a linear function of
another random variable, X (called a predictor variable), with the equation
Y = mX+c
Multiple linear regression is an extension of linear regression allowing a response
variable Y to be modeled as a linear function of a multidimensional feature vector.
Log-linear models approximate discrete multidimensional probability distributions.
The method can be used to
estimate the probability of each cell in a base cuboid for a set of discretized
attributes, based on the smaller cuboids making up the data cube lattice. This allows
higher order data cubes to be constructed from lower order ones.
5.Histograms
Histograms use binning to approximate data distributions and are a popular form of
data reduction. A histogram
for an attribute A partitions the data distribution of A into disjoint subsets, or
buckets. The buckets are displayed on a horizontal axis, while the height (and area)
of a bucket typically represents the average frequency of the values represented by
the bucket. If each bucket represents only a single attribute-value/frequency pair, the
buckets are called singleton buckets. Often, buckets instead represent continuous
ranges for the given attribute.
Page No. : 14
Example The following data are a list of prices of commonly sold items at
AllElectronics (rounded to the nearest dollar). The numbers have been sorted.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18,
18, 18, 18, 18, 18, 18, 20, 20,20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28,
28, 30, 30, 30.
Page No. : 15
How are the buckets determined and the attribute values partitioned? There are
several partitioning rules,including the following.
1. Equi-width: In an equi-width histogram, the width of each bucket range is
constant (such as the width of $10 for the buckets in Figure 3.8).
2. Equi-depth (or equi-height): In an equi-depth histogram, the buckets are created
so that, roughly, the frequency of each bucket is constant (that is, each bucket
contains roughly the same number of contiguous data samples).
3. V-Optimal: If we consider all of the possible histograms for a given number of
buckets, the V-optimal histogram is the one with the least variance. Histogram
variance is a weighted sum of the original values that each bucket represents, where
bucket weight is equal to the number of values in the bucket.
4. MaxDiff: In a MaxDiff histogram, we consider the difference between each pair
of adjacent values. A bucket boundary is established between each pair for pairs
having the 1 largest differences, where _ is user-specified.
V-Optimal and MaxDiff_ histograms tend to be the most accurate and practical.
Histograms are highly effective
at approximating both sparse and dense data, as well as highly skewed, and uniform
data.
6.Clustering
Clustering techniques consider data tuples as objects. They partition the objects into
groups or clusters, so that objects within a cluster are \similar" to one another and “
dissimilar" to objects in other clusters. Similarity is commonly defined in terms of
how “close" the objects are in space, based on a distance function. The “quality" of a
cluster may be represented by its diameter, the maximum distance between any two
objects in the cluster
7.Sampling:
Sampling can be used as a data reduction technique since it allows a large data set to
be represented by a much
smaller random sample (or subset) of the data. Suppose that a large data set, D,
contains N tuples. Let's have a look at some possible samples for D.
Page No. : 16
1. Simple random sample without replacement (SRSWOR) of size n: This is
created by drawing n of the
N tuples from D (n < N), where the probably of drawing any tuple in D is 1=N, i.e.,
all tuples are equally likely.
2. Simple random sample with replacement (SRSWR) of size n: This is similar to
SRSWOR, except that each time a tuple is drawn from D, it is recorded and then
replaced. That is, after a tuple is drawn, it is placed back in D so that it may be
drawn again.
3. Cluster sample: If the tuples in D are grouped into M mutually disjoint
“clusters", then a SRS of m clusters can be obtained, where m < M. For example,
tuples in a database are usually retrieved a page at a time, so
that each page can be considered a cluster. A reduced data representation can be
obtained by applying, say, SRSWOR to the pages, resulting in a cluster sample of
the tuples.
4. Stratified sample: If D is divided into mutually disjoint parts called “strata", a
stratified sample of D is generated by obtaining a SRS at each stratum. This helps to
ensure a representative sample, especially when
the data are skewed. For example, a stratified sample may be obtained from
customer data, where stratum is created for each customer age group. In this way,
the age group having the smallest number of customers will
be sure to be represented.
Page No. : 17
Data Discretization and Concept Hierarchy Generation:-
Data discretization techniques can be used to reduce the number of values for a
given continuous attribute by dividing the range of the attribute into intervals.
Interval labels can then be used to replace actual data values.
Discretization and Concept Hierarchy Generation for Numerical Data
i) Binning
Page No. : 18
Binning is a top-down splitting technique based on a specified number of bins. For
example, attribute values can be discretized by applying equal-width or equal-
frequency binning, and then replacing each bin value by the bin mean or median, as
in smoothing by bin means or smoothing by bin medians, respectively. These
techniques can be applied recursively to the resulting partitions in order to generate
concept hierarchies.
ii) Histogram Analysis
Histograms partition the values for an attribute, A, into disjoint ranges called
buckets.
In an equal-width histogram, for example, the values are partitioned into equal-sized
partitions or ranges.
With an equal frequency histogram, the values are partitioned so that, ideally, each
partition contains the same number of data tuples.
iii) Entropy-Based Discretization
Page No. : 19
Specification of a partial ordering of attributes explicitly at the schema level by
users or experts
A user or expert can easily define a concept hierarchy by specifying ordering of the
attributes at the schema level. A hierarchy can be defined by specifying the total
ordering among these attributes at the schema level, such as
street < city < province or state < country
Specification of a portion of a hierarchy by explicit data grouping
we can easily specify explicit groupings for a small portion of intermediate-level
data. For example, after specifying that province and country form a hierarchy at the
schema level, a user could define some intermediate levels manually, such as
{Urbana, Champaign, Chicago} < Illinois
A user may specify a set of attributes forming a concept hierarchy, but omit to
explicitly state their partial ordering. The system can then try to automatically
generate the attribute ordering so as to construct a meaningful concept hierarchy.
– Example: Suppose a user selects a set of location-oriented attributes, street,
country, province_or_state, and city, from the AllElectronics database, but does not
specify the hierarchical ordering among the attributes.
Automatic generation of a schema concept hierarchy based on
the number of distinct attribute values.
_ The attribute with the most
distinct values is placed at
the lowest level of the Country 15 distinct
values
hierarchy
_ Exceptions, e.g.,
weekday, month, quarter,
year State 365 distinct values
33
Page No. : 20
Chapter 3: Data Warehousing and
OLAP Technology: An Overview
records
Data cleaning and data integration techniques are
applied.
Ensure consistency in naming conventions, encoding
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
<dimension_name_first_time> in cube
<cube_name_first_time>
all all
Specification of hierarchies
Schema hierarchy
day < {month <
quarter; week} < year
Set_grouping hierarchy
{1..10} < inexpensive
Office Day
Month
March 29, 2012 Data Mining: Concepts and Techniques 23
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr
t
TV
od
PC U.S.A
Pr
VCR
Country
sum
Canada
Mexico
sum
all
0-D(apex) cuboid
product date country
1-D cuboids
3-D(base) cuboid
product, date, country
Visualization
OLAP capabilities
Interactive manipulation
March 29, 2012 Data Mining: Concepts and Techniques 26
Typical OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
(quarters)
USA
at Vancouver on Canada 2000
time (quarters)
c i
lo at
oc
time
Q1 605 Q1 1000
l
Q2 Q2
computer Q3
home
entertainment Q4
item (types)
computer security
home phone
dice for entertainment
(location = “Toronto” or “Vancouver”) item (types)
and (time = “Q1” or “Q2”) and
(item = “home entertainment” or “computer”)
roll-up
on location
(from cities
) to countries)
ies
it
(c
n Chicago 440
t io
New York 1560
ca
Fig. 3.10 Typical OLAP lo
Toronto 395
time (quarters)
Vancouver
Operations Q1
Q2
605 825 14 400
Q3
Q4
slice
computer security
for time = “Q1”
home phone
entertainment
location (cities)
drill-down
item (types) on time
(from quarters
Chicago to months)
New York
Toronto
s)
t ie
Vancouver 605 825 14 400
( ci
Chicago
computer security i on New York
home phone at
c Toronto
entertainment lo Vancouver
item (types) January 150
February 100
March 150
time (months)
pivot April
May
June
July
item (types)
home
entertainment 605 August
September
computer 825
October
phone 14 November
December
security 400
computer security
New York Vancouver home phone
Chicago Toronto entertainment
March 29, 2012 location (cities)
Data Mining: Concepts and Techniques item (types) 28
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
March 29, 2012 Data Mining: Concepts and Techniques 29
Chapter 3: Data Warehousing and
OLAP Technology: An Overview
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
materialized
March 29, 2012 Data Mining: Concepts and Techniques 34
Data Warehouse Development:
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Mart Mart Data
Warehouse
Data transformation
convert data from legacy or host format to warehouse
format
Load
sort, summarize, consolidate, compute views, check
Business data
business terms and definitions, ownership of data, charging policies
March 29, 2012 Data Mining: Concepts and Techniques 37
OLAP Server Architectures
Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
Greater scalability
Multidimensional OLAP (MOLAP)
Sparse array-based multidimensional storage engine
Fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
Flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers (e.g., Redbricks)
Specialized support for SQL queries over star/snowflake schemas
March 29, 2012 Data Mining: Concepts and Techniques 38