0% found this document useful (0 votes)

20 views22 pages

U5@-Data Reduction

The document provides an overview of data reduction strategies, including techniques such as Principal Components Analysis (PCA), Attribute Subset Selection, and various methods for numerosity reduction. It discusses the importance of maintaining data integrity while reducing volume for improved efficiency in data mining processes. Additionally, it covers data visualization techniques and the applications of these data reduction methods in fields like image processing and finance.

Uploaded by

kadiyalalokeshantony2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views22 pages

U5@-Data Reduction

Uploaded by

kadiyalalokeshantony2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

UNIT - V Data Reduction: Overview of Data Reduction Strategies, Wavelet Transforms,

Principal Components Analysis, Attribute Subset Selection, Regression and Log-Linear

Models: Parametric Data Reduction, Histograms, Clustering, Sampling, Data Cube

Aggregation. Data Visualization: Pixel-Oriented Visualization Techniques, Geometric

Projection Visualization Techniques, Icon-Based Visualization Techniques,

Hierarchical Visualization Techniques, Visualizing Complex Data and Relations.

Data Reduction: Overview of Data Reduction

Strategies

 Data reduction techniques ensure the integrity of data

while reducing the data.

 Data reduction is a process that reduces the volume of

original data and represents it in a much smaller volume.

 Data reduction techniques are used to obtain a reduced

representation of the dataset that is much smaller in volume

by maintaining the integrity of the original data.

 By reducing the data, the efficiency of the data mining process

is improved, which produces the same analytical results.

 Data reduction aims to define it more compactly. When the

data size is smaller, it is simpler to apply sophisticated

and computationally high-priced algorithms.

 The reduction of the data may be in terms of the number of

rows (records) or terms of the number of columns

(dimensions).

1. Dimensionality Reduction

 Dimensionality reduction eliminates the attributes from the

data set under consideration, thereby reducing the volume of

original data.

 It reduces data size as it eliminates outdated or redundant

features. Here are three methods of dimensionality reduction.

Principal Component Analysis

 Principal Component Analysis is an unsupervised learning
algorithm that is used for the dimensionality reduction
in machine learning. It is a statistical process that converts
the observations of correlated features into a set of
linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called
the Principal Components.
 It is one of the popular tools that is used for exploratory data
analysis and predictive modeling. It is a technique to draw
strong patterns from the given dataset by reducing the
variances.
 PCA works by considering the variance of each attribute
because the high attribute shows the good split between the
classes, and hence it reduces the dimensionality. Some real-
world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation
in various communication channels. It is a feature
extraction technique, so it contains the important variables
and drops the least important variable.
 The PCA algorithm is based on some mathematical concepts
such as

o Variance and Covariance

o Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:

o Dimensionality: It is the number of features or variables

present in the given dataset. More easily, it is the number of
columns present in the dataset.
o Correlation: It signifies that how strongly two variables are
related to each other. Such as if one changes, the other
variable also gets changed. The correlation value ranges from -
1 to +1. Here, -1 occurs if variables are inversely proportional
to each other, and +1 indicates that variables are directly
proportional to each other.
o Orthogonal: It defines that variables are not correlated to
each other, and hence the correlation between the pair of
variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero
vector v is given. Then v will be eigenvector if Av is the scalar
multiple of v.
o Covariance Matrix: A matrix containing the
covariance(Covariance measures the direction of the
relationship between two variables. A positive covariance
means that both variables tend to be high or low at the
same time. A negative covariance means that when one
variable is high, the other tends to be low) between the pair
of variables is called the Covariance Matrix.

Principal Components in PCA

 As described above, the transformed new features or the

output of PCA are the Principal Components.
 The number of these PCs are either equal to or less than the
original features present in the dataset. Some properties of
these principal components are given below:

o The principal component must be the linear

combination(linear combination is an expression
constructed from a set of terms by multiplying each term
by a constant and adding the results (e.g. a linear
combination of x and y would be any expression of the
form ax + by, where a and b are constants) of the original
features.
o These components are orthogonal, i.e., the correlation between
a pair of variables is zero.

Steps for PCA algorithm

 Getting the dataset
Firstly, we need to take the input dataset and divide it into two
subparts X and Y, where X is the training set, and Y is the
validation set.
 Representing data into a structure
Now we will represent our dataset into a structure. Such as we
will represent the two-dimensional matrix of independent
variable X. Here each row corresponds to the data items,
and the column corresponds to the Features. The number
of columns is the dimensions of the dataset.
 Standardizing the data
In this step, we will standardize our dataset. Such as in a
particular column, the features with high variance are more
important compared to the features with lower variance.
If the importance of features is independent of the variance of
the feature, then we will divide each data item in a column
with the standard deviation of the column. Here we will name
the matrix as Z.
 Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and
will transpose it. After transpose, we will multiply it by Z. The
output matrix will be the Covariance matrix of Z.

 5.Calculating the Eigen Values and Eigen Vectors

Now we need to calculate the eigenvalues and eigenvectors for
the resultant covariance matrix Z. Eigenvectors or the
covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are
defined as the eigenvalues.

 6.Sorting the Eigen Vectors

In this step, we will take all the eigenvalues and will sort them
in decreasing order, which means from largest to smallest.
And simultaneously sort the eigenvectors accordingly in
matrix P of eigenvalues. The resultant matrix will be named as
P*.

7.Calculating the new features Or Principal Components

Here we will calculate the new features. To do this, we will
multiply the P* matrix to the Z. In the resultant matrix Z*,
each observation is the linear combination of original
features. Each column of the Z* matrix is independent of
each other.

8.Remove less or unimportant features from the new

dataset.
The new feature set has occurred, so we will decide here what
to keep and what to remove. It means, we will only keep the
relevant or important features in the new dataset, and
unimportant features will be removed out.

Applications of Principal Component Analysis

o PCA is mainly used as the dimensionality reduction technique
in various AI applications such as computer vision, image
compression, etc.
o It can also be used for finding hidden patterns if data has high
dimensions. Some fields where PCA is used are Finance, data
mining, Psychology, etc.

Application:
1. Image compression is a process applied to a graphics file
to minimize its size in bytes without degrading image
quality below an acceptable threshold
2. The main idea of using PCA for face recognition is to
express the large 1-D vector of pixels constructed from 2-
D facial image into the compact principal components of
the feature space.

3.PCA has been used on Medical Data to show correlation of

Cholesterol with low density lipo-protein.
Attribute Subset Selection
 Attribute subset Selection is a technique which is used for
data reduction in data mining process. Data reduction
reduces the size of data so that it can be used for analysis
purposes more efficiently.

 Need of Attribute Subset Selection-

The data set may have a large number of attributes. But some
of those attributes can be irrelevant or redundant. The goal of
attribute subset selection is to find a minimum set of
attributes such that dropping of those irrelevant attributes
does not much affect the utility of data and the cost of data
analysis could be reduced. Mining on a reduced data set also
makes the discovered pattern easier to understand.
Process of Attribute Subset Selection-
The brute force approach can be very expensive in which each
subset (2^n possible subsets) of the data having n attributes
can be analyzed.

The best way to do the task is to use the statistical
significance tests such that best (or worst) attributes can
be recognized. Statistical significance test assumes that
attributes are independent of one another.
 This is a kind of greedy approach in which a significance
level is decided (statistically ideal value of significance level is
5%) and the models are tested again and again until p-value
(probability value) of all attributes is less than or equal to the
selected significance level. The attributes having p-value
higher than significance level are discarded.
 This procedure is repeated again and again until all the
attribute in data set has p-value less than or equal to the
significance level. This gives us the reduced data set having
no irrelevant attributes.
Methods of Attribute Subset Selection-
1. Stepwise Forward Selection.
2. Stepwise Backward Elimination.
3. Combination of Forward Selection and Backward Elimination.
4. Decision Tree Induction
All the above methods are greedy approaches for attribute subset
selection.

1. Stepwise Forward Selection: This procedure start with an

empty set of attributes as the minimal set. The most relevant
attributes are chosen(having minimum p-value) and are added to
the minimal set. In each iteration, one attribute is added to a
reduced set.
2. Stepwise Backward Elimination: Here all the attributes are
considered in the initial set of attributes. In each iteration, one
attribute is eliminated from the set of attributes whose p-value is
higher than significance level.
3. Combination of Forward Selection and Backward
Elimination: The stepwise forward selection and backward
elimination are combined so as to select the relevant attributes
most efficiently. This is the most common technique which is
generally used for attribute selection.
4. Decision Tree Induction: This approach uses decision tree for
attribute selection. It constructs a flow chart like structure
having nodes denoting a test on an attribute. Each branch
corresponds to the outcome of test and leaf nodes is a class
prediction. The attribute that is not the part of tree is considered
irrelevant and hence discarded.

Numerosity Reduction:

Numerosity Reduction is a data reduction technique which

replaces the original data by smaller form of data

representation. There are two techniques for numerosity

reduction-
Parametric and Non-Parametric methods.
Parametric Methods –
 For parametric methods(Parametric modeling is creating a
model from some known facts about a population. These
“facts” are called parameters), data is represented using some
model.
 The model is used to estimate the data, so that only
parameters of data are required to be stored, instead of actual
data. Regression and Log-Linear methods are used for
creating such models.
 Regression:
o Terms:
o Dependent Variable: The main factor in Regression analysis
which we want to predict or understand is called the
dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the
dependent variables or which are used to predict the values of
the dependent variables are called independent variable, also
called as a predictor.

 Regression analysis is a statistical method to model the

relationship between a dependent (target) and independent
(predictor) variables with one or more independent variables.
 More specifically, Regression analysis helps us to understand
how the value of the dependent variable is changing
corresponding to an independent variable when other
independent variables are held fixed. It predicts
continuous/real values such as temperature, age, salary,
price, etc.
Regression can be a simple linear regression or multiple
linear regression. When there is only single independent
attribute, such regression model is called simple linear
regression and if there are multiple independent attributes,
then such regression models are called multiple linear
regression.

 In multiple linear regression, y will be modeled as a linear

function of two or more predictor(independent) variables.

 Log-Linear Model:
Log-linear model can be used to estimate the probability of
each data point in a multidimensional space for a set of
discretized attributes, based on a smaller subset of
dimensional combinations.

 This allows a higher-dimensional data space to be

constructed from lower-dimensional attributes.
Regression and log-linear model can both be used on sparse data,
although their application may be limited.
Types of Numerosity Reduction

This method uses an alternate, small forms of data representation,

thus reducing data volume. There are two types of Numerosity
reduction, such as:


1. Parametric
 This method assumes a model into which the data fits. Data model
parameters are estimated, and only those parameters are stored, and the
rest of the data is discarded. Regression and Log-Linear methods are
used for creating such models. For example, a regression model can be
used to achieve parametric reduction if the data fits the Linear
Regression model.

o Log-Linear Model: Log-linear model can be used to estimate the

probability of each data point in a multidimensional space for a set of
discretized attributes based on a smaller subset of dimensional
combinations.
o The Log-Linear model discovers the relationship between two or more
discrete attributes. Assume we have a set of tuples in n-dimensional
space; the log-linear model helps derive each tuple's probability in this n-
dimensional space.

Non-Parametric

i. Non-Parametric: A non-parametric numerosity reduction technique does

not assume any model. The non-Parametric technique results in a more
uniform reduction, irrespective of data size, but it may not achieve a high
volume of data reduction like the parametric.
ii. There are at least four types of Non-Parametric data reduction
techniques, Histogram, Clustering, Sampling, Data Cube Aggregation,
and Data Compression.
o Histogram:
o A histogram is used to summarize discrete or continuous data. In
other words, it provides a visual interpretation of numerical data
by showing the number of data points that fall within a specified
range of values (called “bins”). It is similar to a vertical bar graph.
However, a histogram, unlike a vertical bar graph, shows no gaps
between the bars.

o
o
Parts of a Histogram

1. The title: The title describes the information included in the histogram.
2. X-axis: The X-axis are intervals that show the scale of values which the
measurements fall under.
3. Y-axis: The Y-axis shows the number of times that the values occurred
within the intervals set by the X-axis.
4. The bars: The height of the bar shows the number of times that the
values occurred within the interval, while the width of the bar shows the
interval that is covered. For a histogram with equal bins, the width
should be the same across all bars.

Note: The bar graph is the graphical representation of categorical

data. A histogram is the graphical representation of quantitative data.

Importance of a Histogram

i. Creating a histogram provides a visual representation of data

distribution. Histograms can display a large amount of data and
the frequency of the data values.

o E.G

Distributions of a Histogram

A normal distribution: In a normal distribution, points on one side of

the average are as likely to occur as on the other side of the average.
A bimodal distribution: In a bimodal distribution, there are two peaks. In a

bimodal distribution, the data should be separated and analyzed as separate

normal distributions.

 Uncle Bruno owns a garden with 30 black cherry trees. Each tree is of a
different height. The height of the trees (in inches): 61, 63, 64, 66, 68,
69, 71, 71.5, 72, 72.5, 73, 73.5, 74, 74.5, 76, 76.2, 76.5, 77, 77.5, 78,
78.5, 79, 79.2, 80, 81, 82, 83, 84, 85, 87. We can group the data as
follows in a frequency distribution table by setting a range:
Height Range Number of Trees
(ft) (Frequency)

60 - 75 3

66 - 70 3

71 - 75 8

76 - 80 10

81 - 85 5

86 - 90 1

This data can be now shown using a histogram. We need to make sure that
while plotting a histogram, there shouldn’t be any gaps between the bars.

o
2.Clustering:
 Clustering is the task of dividing the population or data
points into a number of groups such that data points in the
same groups are more similar to other data points in the
same group and dissimilar to the data points in other groups.
It is basically a collection of objects on the basis of similarity
and dissimilarity between them.
 For ex– The data points in the graph below clustered together
can be classified into one single group. We can distinguish
the clusters, and we can identify that there are 3 clusters in
the below picture .
Applications of Clustering in different fields
 Marketing: It can be used to characterize & discover customer
segments for marketing purposes.
 Biology: It can be used for classification among different species
of plants and animals.
 Libraries: It is used in clustering different books on the basis of
topics and information.
 Insurance: It is used to acknowledge the customers, their
policies and identifying the frauds.

3.Sampling
o Sampling: One of the methods used for data reduction is
sampling, as it can reduce the large data set into a much smaller
data sample. Below we will discuss the different methods in which
we can sample a large data set D containing N tuples:
a. Simple random sample without replacement (SRSWOR)
of size s: In this s, some tuples are drawn from N tuples
such that in the data set D (s<N). The probability of drawing
any tuple from the data set D is 1/N. This means all tuples
have an equal probability of getting sampled.
b. Simple random sample with replacement (SRSWR) of size
s: It is similar to the SRSWOR, but the tuple is drawn from
data set D, is recorded, and then replaced into the data set D
so that it can be drawn again.

c. Cluster sample: The tuples in data set D are clustered into

M mutually disjoint subsets. The data reduction can be
applied by implementing SRSWOR on these clusters. A
simple random sample of size s could be generated from
these clusters where s<M.
d. Stratified sample: The large data set D is partitioned into
mutually disjoint sets called 'strata'. A simple random
sample is taken from each stratum to get stratified data.
This method is effective for skewed data.

4.Data Cube Aggregation

 This technique is used to aggregate data in a simpler form. Data Cube

Aggregation is a multidimensional aggregation that uses aggregation at
various levels of a data cube to represent the original data set, thus
achieving data reduction.
 For example, suppose you have the data of All Electronics sales per
quarter for the year 2018 to the year 2022. If you want to get the annual
sale per year, you just have to aggregate the sales per quarter for each
year. In this way, aggregation provides you with the required data, which
is much smaller in size, and thereby we achieve data reduction even
without losing any data.
The data cube aggregation is a multidimensional aggregation that eases
multidimensional analysis. The data cube present precomputed and
summarized data which eases the data mining into fast access.

CS607 FinalTerm MCQs With Reference Solved by Arslan
No ratings yet
CS607 FinalTerm MCQs With Reference Solved by Arslan
50 pages
IDS 4 (Week 14)
No ratings yet
IDS 4 (Week 14)
66 pages
ML (Unit 5)
No ratings yet
ML (Unit 5)
34 pages
Principal Component Analysis1
No ratings yet
Principal Component Analysis1
26 pages
Ridge Regression
No ratings yet
Ridge Regression
82 pages
Module 3
No ratings yet
Module 3
41 pages
Lecture 9 - Data Reduction
No ratings yet
Lecture 9 - Data Reduction
36 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
34 pages
ME 310 Numerical Methods Ordinary Differential Equations
No ratings yet
ME 310 Numerical Methods Ordinary Differential Equations
38 pages
Crashing Paper 3
No ratings yet
Crashing Paper 3
19 pages
Dimension Reduction
No ratings yet
Dimension Reduction
15 pages
Decision Making & Looping
No ratings yet
Decision Making & Looping
32 pages
1501589578da Mod15 Q1 e Text
No ratings yet
1501589578da Mod15 Q1 e Text
9 pages
BSIT Roadmap Batch 2 Onwards
No ratings yet
BSIT Roadmap Batch 2 Onwards
2 pages
DR Pca
No ratings yet
DR Pca
22 pages
Mass Lumping SEM Preprint
No ratings yet
Mass Lumping SEM Preprint
57 pages
Unit V Foml
No ratings yet
Unit V Foml
18 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
Ai & ML Week-9
No ratings yet
Ai & ML Week-9
30 pages
Module3 Notes
No ratings yet
Module3 Notes
13 pages
Advanced Control Theory: Dr. V. R. Jisha, Associate Professor, Dept. of Electrical Engg., CET
No ratings yet
Advanced Control Theory: Dr. V. R. Jisha, Associate Professor, Dept. of Electrical Engg., CET
50 pages
Pca
No ratings yet
Pca
18 pages
Kinya Sharon - Ass2 - Machine Learning
No ratings yet
Kinya Sharon - Ass2 - Machine Learning
12 pages
Ferath Kherif PCA
No ratings yet
Ferath Kherif PCA
17 pages
PCA Finds Representation Through Linear Transformation
No ratings yet
PCA Finds Representation Through Linear Transformation
28 pages
Pca 1
No ratings yet
Pca 1
3 pages
It ML Unit 4 Notes Final
No ratings yet
It ML Unit 4 Notes Final
21 pages
Data Reduction
No ratings yet
Data Reduction
23 pages
Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020
No ratings yet
Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020
22 pages
10 ASAP Advanced Statistics Dimension Reduction
No ratings yet
10 ASAP Advanced Statistics Dimension Reduction
8 pages
MAT 243 Project One Summary Report Template
No ratings yet
MAT 243 Project One Summary Report Template
6 pages
Concrete Technology Prof. B. Bhattacharjee Department of Civil Engineering Indian Institute of Technology, Delhi Lecture - 16 Mix Design of Concrete: IS Example and British (DOE) Method
No ratings yet
Concrete Technology Prof. B. Bhattacharjee Department of Civil Engineering Indian Institute of Technology, Delhi Lecture - 16 Mix Design of Concrete: IS Example and British (DOE) Method
30 pages
Dimensionality Reduction (Principal Component Analysis)
No ratings yet
Dimensionality Reduction (Principal Component Analysis)
12 pages
ML Module 6
No ratings yet
ML Module 6
6 pages
Digital Signal Processing
No ratings yet
Digital Signal Processing
24 pages
ML Mod32019
No ratings yet
ML Mod32019
6 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
10 pages
1.4. Support Vector Machines - Scikit-Learn
No ratings yet
1.4. Support Vector Machines - Scikit-Learn
6 pages
Experiment 2 - Warm Water Control
No ratings yet
Experiment 2 - Warm Water Control
13 pages
PW2 - Convolution Integral and Convolution Sum PSA
No ratings yet
PW2 - Convolution Integral and Convolution Sum PSA
5 pages
Van Der Knaap - 1959 - Non-Linear Behavior of Elastic Porous Media
No ratings yet
Van Der Knaap - 1959 - Non-Linear Behavior of Elastic Porous Media
9 pages
Unit 3
No ratings yet
Unit 3
102 pages
What Is PCA?: Image Source
No ratings yet
What Is PCA?: Image Source
17 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
Cheat Sheet
No ratings yet
Cheat Sheet
2 pages
Data Science Case Study
No ratings yet
Data Science Case Study
6 pages
PCA Dev
No ratings yet
PCA Dev
16 pages
Issmfe 1989
No ratings yet
Issmfe 1989
48 pages
PCA Complete
No ratings yet
PCA Complete
8 pages
Volume Cyclicality
No ratings yet
Volume Cyclicality
15 pages
Unit 3dimentionality Reduction
No ratings yet
Unit 3dimentionality Reduction
13 pages
Chapter Five Principal Comonent Analysis (PCA)
No ratings yet
Chapter Five Principal Comonent Analysis (PCA)
33 pages
MOS Analysis
No ratings yet
MOS Analysis
11 pages
6 Principal Component Analysis
No ratings yet
6 Principal Component Analysis
7 pages
ML Mod 4 Part 2
No ratings yet
ML Mod 4 Part 2
32 pages
JLD612 Manual 2011
No ratings yet
JLD612 Manual 2011
8 pages
Principal Component Analysis and Cluster Analysis
No ratings yet
Principal Component Analysis and Cluster Analysis
14 pages
Midterm Examination Physics
No ratings yet
Midterm Examination Physics
2 pages
Love Report
No ratings yet
Love Report
7 pages
10-601 Machine Learning (Fall 2010) Principal Component Analysis
No ratings yet
10-601 Machine Learning (Fall 2010) Principal Component Analysis
8 pages
STAT502
No ratings yet
STAT502
13 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
8 pages
3.2 Pca
No ratings yet
3.2 Pca
27 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
8 pages
W4.2 DataPreProcessing-PCA
No ratings yet
W4.2 DataPreProcessing-PCA
22 pages
Solu of Assignment 9
No ratings yet
Solu of Assignment 9
5 pages
03 Principal Components Analysis
No ratings yet
03 Principal Components Analysis
3 pages
PCA - Ensemble Classifiers
No ratings yet
PCA - Ensemble Classifiers
9 pages
P-3.1.4 - Pca
No ratings yet
P-3.1.4 - Pca
44 pages
Unit 3
No ratings yet
Unit 3
28 pages
Program 3
No ratings yet
Program 3
7 pages
7.3 Pca
No ratings yet
7.3 Pca
17 pages
Eigenvalue and Eigenvector Worksheet: Ivany (18523081) October 2019
No ratings yet
Eigenvalue and Eigenvector Worksheet: Ivany (18523081) October 2019
4 pages
U4 - PCA - 5th Sem - DS
No ratings yet
U4 - PCA - 5th Sem - DS
14 pages
Love Report 1
No ratings yet
Love Report 1
10 pages
Unit Iii Dimentionality Reduction
No ratings yet
Unit Iii Dimentionality Reduction
12 pages
EASE Module 1 Polynomial Functions
No ratings yet
EASE Module 1 Polynomial Functions
29 pages
Analysis and Design of Asynchronous Sequential Circuits
No ratings yet
Analysis and Design of Asynchronous Sequential Circuits
31 pages
Need of Principal Component Analysis
No ratings yet
Need of Principal Component Analysis
8 pages
Pile Cap Design 6 Piles
No ratings yet
Pile Cap Design 6 Piles
1 page
Quantitative Analysis of Flexible Manufacturing System
No ratings yet
Quantitative Analysis of Flexible Manufacturing System
9 pages
PCA
100% (1)
PCA
33 pages
Feature Extraction: - Saheni Patra
No ratings yet
Feature Extraction: - Saheni Patra
17 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Trigonometry Reviewer 1st Quarter
No ratings yet
Trigonometry Reviewer 1st Quarter
3 pages
Simulation Study of The CSTR Reactor For Control Purposes
No ratings yet
Simulation Study of The CSTR Reactor For Control Purposes
4 pages
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
No ratings yet
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
19 pages
Rational Model For Calculating Deflection of Reinforced Concrete Beams and Slabs
No ratings yet
Rational Model For Calculating Deflection of Reinforced Concrete Beams and Slabs
11 pages
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
No ratings yet
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
59 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

U5@-Data Reduction

Uploaded by

U5@-Data Reduction

Uploaded by

UNIT - V Data Reduction: Overview of Data Reduction Strategies, Wavelet Transforms,

Principal Components Analysis, Attribute Subset Selection, Regression and Log-Linear

Models: Parametric Data Reduction, Histograms, Clustering, Sampling, Data Cube

Aggregation. Data Visualization: Pixel-Oriented Visualization Techniques, Geometric

Projection Visualization Techniques, Icon-Based Visualization Techniques,

Hierarchical Visualization Techniques, Visualizing Complex Data and Relations.

Data Reduction: Overview of Data Reduction

 Data reduction techniques ensure the integrity of data

while reducing the data.

 Data reduction is a process that reduces the volume of

original data and represents it in a much smaller volume.

 Data reduction techniques are used to obtain a reduced

representation of the dataset that is much smaller in volume

by maintaining the integrity of the original data.

 By reducing the data, the efficiency of the data mining process

is improved, which produces the same analytical results.

data size is smaller, it is simpler to apply sophisticated

and computationally high-priced algorithms.

 The reduction of the data may be in terms of the number of

rows (records) or terms of the number of columns

 Dimensionality reduction eliminates the attributes from the

data set under consideration, thereby reducing the volume of

 It reduces data size as it eliminates outdated or redundant

features. Here are three methods of dimensionality reduction.

Principal Component Analysis

o Variance and Covariance

Some common terms used in PCA algorithm:

o Dimensionality: It is the number of features or variables

Principal Components in PCA

 As described above, the transformed new features or the

o The principal component must be the linear

Steps for PCA algorithm

 5.Calculating the Eigen Values and Eigen Vectors

 6.Sorting the Eigen Vectors

7.Calculating the new features Or Principal Components

8.Remove less or unimportant features from the new

Applications of Principal Component Analysis

3.PCA has been used on Medical Data to show correlation of

 Need of Attribute Subset Selection-

1. Stepwise Forward Selection: This procedure start with an

Numerosity Reduction is a data reduction technique which

replaces the original data by smaller form of data

representation. There are two techniques for numerosity

 Regression analysis is a statistical method to model the

 In multiple linear regression, y will be modeled as a linear

 This allows a higher-dimensional data space to be

This method uses an alternate, small forms of data representation,

o Log-Linear Model: Log-linear model can be used to estimate the

i. Non-Parametric: A non-parametric numerosity reduction technique does

Note: The bar graph is the graphical representation of categorical

i. Creating a histogram provides a visual representation of data

A normal distribution: In a normal distribution, points on one side of

bimodal distribution, the data should be separated and analyzed as separate

c. Cluster sample: The tuples in data set D are clustered into

4.Data Cube Aggregation

 This technique is used to aggregate data in a simpler form. Data Cube

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.