0% found this document useful (0 votes)
20 views22 pages

U5@-Data Reduction

The document provides an overview of data reduction strategies, including techniques such as Principal Components Analysis (PCA), Attribute Subset Selection, and various methods for numerosity reduction. It discusses the importance of maintaining data integrity while reducing volume for improved efficiency in data mining processes. Additionally, it covers data visualization techniques and the applications of these data reduction methods in fields like image processing and finance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views22 pages

U5@-Data Reduction

The document provides an overview of data reduction strategies, including techniques such as Principal Components Analysis (PCA), Attribute Subset Selection, and various methods for numerosity reduction. It discusses the importance of maintaining data integrity while reducing volume for improved efficiency in data mining processes. Additionally, it covers data visualization techniques and the applications of these data reduction methods in fields like image processing and finance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT - V Data Reduction: Overview of Data Reduction Strategies, Wavelet Transforms,

Principal Components Analysis, Attribute Subset Selection, Regression and Log-Linear

Models: Parametric Data Reduction, Histograms, Clustering, Sampling, Data Cube

Aggregation. Data Visualization: Pixel-Oriented Visualization Techniques, Geometric

Projection Visualization Techniques, Icon-Based Visualization Techniques,

Hierarchical Visualization Techniques, Visualizing Complex Data and Relations.

Data Reduction: Overview of Data Reduction

Strategies

 Data reduction techniques ensure the integrity of data

while reducing the data.

 Data reduction is a process that reduces the volume of

original data and represents it in a much smaller volume.

 Data reduction techniques are used to obtain a reduced

representation of the dataset that is much smaller in volume

by maintaining the integrity of the original data.

 By reducing the data, the efficiency of the data mining process

is improved, which produces the same analytical results.


 Data reduction aims to define it more compactly. When the

data size is smaller, it is simpler to apply sophisticated

and computationally high-priced algorithms.

 The reduction of the data may be in terms of the number of

rows (records) or terms of the number of columns

(dimensions).

1. Dimensionality Reduction

 Dimensionality reduction eliminates the attributes from the

data set under consideration, thereby reducing the volume of

original data.

 It reduces data size as it eliminates outdated or redundant

features. Here are three methods of dimensionality reduction.

Principal Component Analysis


 Principal Component Analysis is an unsupervised learning
algorithm that is used for the dimensionality reduction
in machine learning. It is a statistical process that converts
the observations of correlated features into a set of
linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called
the Principal Components.
 It is one of the popular tools that is used for exploratory data
analysis and predictive modeling. It is a technique to draw
strong patterns from the given dataset by reducing the
variances.
 PCA works by considering the variance of each attribute
because the high attribute shows the good split between the
classes, and hence it reduces the dimensionality. Some real-
world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation
in various communication channels. It is a feature
extraction technique, so it contains the important variables
and drops the least important variable.
 The PCA algorithm is based on some mathematical concepts
such as

o Variance and Covariance


o Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:

o Dimensionality: It is the number of features or variables


present in the given dataset. More easily, it is the number of
columns present in the dataset.
o Correlation: It signifies that how strongly two variables are
related to each other. Such as if one changes, the other
variable also gets changed. The correlation value ranges from -
1 to +1. Here, -1 occurs if variables are inversely proportional
to each other, and +1 indicates that variables are directly
proportional to each other.
o Orthogonal: It defines that variables are not correlated to
each other, and hence the correlation between the pair of
variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero
vector v is given. Then v will be eigenvector if Av is the scalar
multiple of v.
o Covariance Matrix: A matrix containing the
covariance(Covariance measures the direction of the
relationship between two variables. A positive covariance
means that both variables tend to be high or low at the
same time. A negative covariance means that when one
variable is high, the other tends to be low) between the pair
of variables is called the Covariance Matrix.

Principal Components in PCA

 As described above, the transformed new features or the


output of PCA are the Principal Components.
 The number of these PCs are either equal to or less than the
original features present in the dataset. Some properties of
these principal components are given below:

o The principal component must be the linear


combination(linear combination is an expression
constructed from a set of terms by multiplying each term
by a constant and adding the results (e.g. a linear
combination of x and y would be any expression of the
form ax + by, where a and b are constants) of the original
features.
o These components are orthogonal, i.e., the correlation between
a pair of variables is zero.

Steps for PCA algorithm


 Getting the dataset
Firstly, we need to take the input dataset and divide it into two
subparts X and Y, where X is the training set, and Y is the
validation set.
 Representing data into a structure
Now we will represent our dataset into a structure. Such as we
will represent the two-dimensional matrix of independent
variable X. Here each row corresponds to the data items,
and the column corresponds to the Features. The number
of columns is the dimensions of the dataset.
 Standardizing the data
In this step, we will standardize our dataset. Such as in a
particular column, the features with high variance are more
important compared to the features with lower variance.
If the importance of features is independent of the variance of
the feature, then we will divide each data item in a column
with the standard deviation of the column. Here we will name
the matrix as Z.
 Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and
will transpose it. After transpose, we will multiply it by Z. The
output matrix will be the Covariance matrix of Z.

 5.Calculating the Eigen Values and Eigen Vectors


Now we need to calculate the eigenvalues and eigenvectors for
the resultant covariance matrix Z. Eigenvectors or the
covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are
defined as the eigenvalues.

 6.Sorting the Eigen Vectors


In this step, we will take all the eigenvalues and will sort them
in decreasing order, which means from largest to smallest.
And simultaneously sort the eigenvectors accordingly in
matrix P of eigenvalues. The resultant matrix will be named as
P*.

7.Calculating the new features Or Principal Components


Here we will calculate the new features. To do this, we will
multiply the P* matrix to the Z. In the resultant matrix Z*,
each observation is the linear combination of original
features. Each column of the Z* matrix is independent of
each other.

8.Remove less or unimportant features from the new


dataset.
The new feature set has occurred, so we will decide here what
to keep and what to remove. It means, we will only keep the
relevant or important features in the new dataset, and
unimportant features will be removed out.

Applications of Principal Component Analysis


o PCA is mainly used as the dimensionality reduction technique
in various AI applications such as computer vision, image
compression, etc.
o It can also be used for finding hidden patterns if data has high
dimensions. Some fields where PCA is used are Finance, data
mining, Psychology, etc.

Application:
1. Image compression is a process applied to a graphics file
to minimize its size in bytes without degrading image
quality below an acceptable threshold
2. The main idea of using PCA for face recognition is to
express the large 1-D vector of pixels constructed from 2-
D facial image into the compact principal components of
the feature space.

3.PCA has been used on Medical Data to show correlation of


Cholesterol with low density lipo-protein.
Attribute Subset Selection
 Attribute subset Selection is a technique which is used for
data reduction in data mining process. Data reduction
reduces the size of data so that it can be used for analysis
purposes more efficiently.

 Need of Attribute Subset Selection-


The data set may have a large number of attributes. But some
of those attributes can be irrelevant or redundant. The goal of
attribute subset selection is to find a minimum set of
attributes such that dropping of those irrelevant attributes
does not much affect the utility of data and the cost of data
analysis could be reduced. Mining on a reduced data set also
makes the discovered pattern easier to understand.
Process of Attribute Subset Selection-
The brute force approach can be very expensive in which each
subset (2^n possible subsets) of the data having n attributes
can be analyzed.

The best way to do the task is to use the statistical
significance tests such that best (or worst) attributes can
be recognized. Statistical significance test assumes that
attributes are independent of one another.
 This is a kind of greedy approach in which a significance
level is decided (statistically ideal value of significance level is
5%) and the models are tested again and again until p-value
(probability value) of all attributes is less than or equal to the
selected significance level. The attributes having p-value
higher than significance level are discarded.
 This procedure is repeated again and again until all the
attribute in data set has p-value less than or equal to the
significance level. This gives us the reduced data set having
no irrelevant attributes.
Methods of Attribute Subset Selection-
1. Stepwise Forward Selection.
2. Stepwise Backward Elimination.
3. Combination of Forward Selection and Backward Elimination.
4. Decision Tree Induction
All the above methods are greedy approaches for attribute subset
selection.

1. Stepwise Forward Selection: This procedure start with an


empty set of attributes as the minimal set. The most relevant
attributes are chosen(having minimum p-value) and are added to
the minimal set. In each iteration, one attribute is added to a
reduced set.
2. Stepwise Backward Elimination: Here all the attributes are
considered in the initial set of attributes. In each iteration, one
attribute is eliminated from the set of attributes whose p-value is
higher than significance level.
3. Combination of Forward Selection and Backward
Elimination: The stepwise forward selection and backward
elimination are combined so as to select the relevant attributes
most efficiently. This is the most common technique which is
generally used for attribute selection.
4. Decision Tree Induction: This approach uses decision tree for
attribute selection. It constructs a flow chart like structure
having nodes denoting a test on an attribute. Each branch
corresponds to the outcome of test and leaf nodes is a class
prediction. The attribute that is not the part of tree is considered
irrelevant and hence discarded.

Numerosity Reduction:

Numerosity Reduction is a data reduction technique which

replaces the original data by smaller form of data

representation. There are two techniques for numerosity

reduction-
Parametric and Non-Parametric methods.
Parametric Methods –
 For parametric methods(Parametric modeling is creating a
model from some known facts about a population. These
“facts” are called parameters), data is represented using some
model.
 The model is used to estimate the data, so that only
parameters of data are required to be stored, instead of actual
data. Regression and Log-Linear methods are used for
creating such models.
 Regression:
o Terms:
o Dependent Variable: The main factor in Regression analysis
which we want to predict or understand is called the
dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the
dependent variables or which are used to predict the values of
the dependent variables are called independent variable, also
called as a predictor.

 Regression analysis is a statistical method to model the


relationship between a dependent (target) and independent
(predictor) variables with one or more independent variables.
 More specifically, Regression analysis helps us to understand
how the value of the dependent variable is changing
corresponding to an independent variable when other
independent variables are held fixed. It predicts
continuous/real values such as temperature, age, salary,
price, etc.
Regression can be a simple linear regression or multiple
linear regression. When there is only single independent
attribute, such regression model is called simple linear
regression and if there are multiple independent attributes,
then such regression models are called multiple linear
regression.

 In multiple linear regression, y will be modeled as a linear


function of two or more predictor(independent) variables.

 Log-Linear Model:
Log-linear model can be used to estimate the probability of
each data point in a multidimensional space for a set of
discretized attributes, based on a smaller subset of
dimensional combinations.

 This allows a higher-dimensional data space to be


constructed from lower-dimensional attributes.
Regression and log-linear model can both be used on sparse data,
although their application may be limited.
Types of Numerosity Reduction

This method uses an alternate, small forms of data representation,


thus reducing data volume. There are two types of Numerosity
reduction, such as:


1. Parametric
 This method assumes a model into which the data fits. Data model
parameters are estimated, and only those parameters are stored, and the
rest of the data is discarded. Regression and Log-Linear methods are
used for creating such models. For example, a regression model can be
used to achieve parametric reduction if the data fits the Linear
Regression model.

o Log-Linear Model: Log-linear model can be used to estimate the


probability of each data point in a multidimensional space for a set of
discretized attributes based on a smaller subset of dimensional
combinations.
o The Log-Linear model discovers the relationship between two or more
discrete attributes. Assume we have a set of tuples in n-dimensional
space; the log-linear model helps derive each tuple's probability in this n-
dimensional space.

Non-Parametric

i. Non-Parametric: A non-parametric numerosity reduction technique does


not assume any model. The non-Parametric technique results in a more
uniform reduction, irrespective of data size, but it may not achieve a high
volume of data reduction like the parametric.
ii. There are at least four types of Non-Parametric data reduction
techniques, Histogram, Clustering, Sampling, Data Cube Aggregation,
and Data Compression.
o Histogram:
o A histogram is used to summarize discrete or continuous data. In
other words, it provides a visual interpretation of numerical data
by showing the number of data points that fall within a specified
range of values (called “bins”). It is similar to a vertical bar graph.
However, a histogram, unlike a vertical bar graph, shows no gaps
between the bars.

o
o
Parts of a Histogram

1. The title: The title describes the information included in the histogram.
2. X-axis: The X-axis are intervals that show the scale of values which the
measurements fall under.
3. Y-axis: The Y-axis shows the number of times that the values occurred
within the intervals set by the X-axis.
4. The bars: The height of the bar shows the number of times that the
values occurred within the interval, while the width of the bar shows the
interval that is covered. For a histogram with equal bins, the width
should be the same across all bars.

Note: The bar graph is the graphical representation of categorical


data. A histogram is the graphical representation of quantitative data.

Importance of a Histogram

i. Creating a histogram provides a visual representation of data


distribution. Histograms can display a large amount of data and
the frequency of the data values.

o E.G

Distributions of a Histogram

A normal distribution: In a normal distribution, points on one side of


the average are as likely to occur as on the other side of the average.
A bimodal distribution: In a bimodal distribution, there are two peaks. In a

bimodal distribution, the data should be separated and analyzed as separate

normal distributions.

 Uncle Bruno owns a garden with 30 black cherry trees. Each tree is of a
different height. The height of the trees (in inches): 61, 63, 64, 66, 68,
69, 71, 71.5, 72, 72.5, 73, 73.5, 74, 74.5, 76, 76.2, 76.5, 77, 77.5, 78,
78.5, 79, 79.2, 80, 81, 82, 83, 84, 85, 87. We can group the data as
follows in a frequency distribution table by setting a range:
Height Range Number of Trees
(ft) (Frequency)

60 - 75 3

66 - 70 3

71 - 75 8

76 - 80 10

81 - 85 5

86 - 90 1

This data can be now shown using a histogram. We need to make sure that
while plotting a histogram, there shouldn’t be any gaps between the bars.

o
2.Clustering:
 Clustering is the task of dividing the population or data
points into a number of groups such that data points in the
same groups are more similar to other data points in the
same group and dissimilar to the data points in other groups.
It is basically a collection of objects on the basis of similarity
and dissimilarity between them.
 For ex– The data points in the graph below clustered together
can be classified into one single group. We can distinguish
the clusters, and we can identify that there are 3 clusters in
the below picture .
Applications of Clustering in different fields
 Marketing: It can be used to characterize & discover customer
segments for marketing purposes.
 Biology: It can be used for classification among different species
of plants and animals.
 Libraries: It is used in clustering different books on the basis of
topics and information.
 Insurance: It is used to acknowledge the customers, their
policies and identifying the frauds.

3.Sampling
o Sampling: One of the methods used for data reduction is
sampling, as it can reduce the large data set into a much smaller
data sample. Below we will discuss the different methods in which
we can sample a large data set D containing N tuples:
a. Simple random sample without replacement (SRSWOR)
of size s: In this s, some tuples are drawn from N tuples
such that in the data set D (s<N). The probability of drawing
any tuple from the data set D is 1/N. This means all tuples
have an equal probability of getting sampled.
b. Simple random sample with replacement (SRSWR) of size
s: It is similar to the SRSWOR, but the tuple is drawn from
data set D, is recorded, and then replaced into the data set D
so that it can be drawn again.

c. Cluster sample: The tuples in data set D are clustered into


M mutually disjoint subsets. The data reduction can be
applied by implementing SRSWOR on these clusters. A
simple random sample of size s could be generated from
these clusters where s<M.
d. Stratified sample: The large data set D is partitioned into
mutually disjoint sets called 'strata'. A simple random
sample is taken from each stratum to get stratified data.
This method is effective for skewed data.

4.Data Cube Aggregation

 This technique is used to aggregate data in a simpler form. Data Cube


Aggregation is a multidimensional aggregation that uses aggregation at
various levels of a data cube to represent the original data set, thus
achieving data reduction.
 For example, suppose you have the data of All Electronics sales per
quarter for the year 2018 to the year 2022. If you want to get the annual
sale per year, you just have to aggregate the sales per quarter for each
year. In this way, aggregation provides you with the required data, which
is much smaller in size, and thereby we achieve data reduction even
without losing any data.
The data cube aggregation is a multidimensional aggregation that eases
multidimensional analysis. The data cube present precomputed and
summarized data which eases the data mining into fast access.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy