U5@-Data Reduction
U5@-Data Reduction
Strategies
(dimensions).
1. Dimensionality Reduction
original data.
Application:
1. Image compression is a process applied to a graphics file
to minimize its size in bytes without degrading image
quality below an acceptable threshold
2. The main idea of using PCA for face recognition is to
express the large 1-D vector of pixels constructed from 2-
D facial image into the compact principal components of
the feature space.
Numerosity Reduction:
reduction-
Parametric and Non-Parametric methods.
Parametric Methods –
For parametric methods(Parametric modeling is creating a
model from some known facts about a population. These
“facts” are called parameters), data is represented using some
model.
The model is used to estimate the data, so that only
parameters of data are required to be stored, instead of actual
data. Regression and Log-Linear methods are used for
creating such models.
Regression:
o Terms:
o Dependent Variable: The main factor in Regression analysis
which we want to predict or understand is called the
dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the
dependent variables or which are used to predict the values of
the dependent variables are called independent variable, also
called as a predictor.
Log-Linear Model:
Log-linear model can be used to estimate the probability of
each data point in a multidimensional space for a set of
discretized attributes, based on a smaller subset of
dimensional combinations.
1. Parametric
This method assumes a model into which the data fits. Data model
parameters are estimated, and only those parameters are stored, and the
rest of the data is discarded. Regression and Log-Linear methods are
used for creating such models. For example, a regression model can be
used to achieve parametric reduction if the data fits the Linear
Regression model.
Non-Parametric
o
o
Parts of a Histogram
1. The title: The title describes the information included in the histogram.
2. X-axis: The X-axis are intervals that show the scale of values which the
measurements fall under.
3. Y-axis: The Y-axis shows the number of times that the values occurred
within the intervals set by the X-axis.
4. The bars: The height of the bar shows the number of times that the
values occurred within the interval, while the width of the bar shows the
interval that is covered. For a histogram with equal bins, the width
should be the same across all bars.
Importance of a Histogram
o E.G
Distributions of a Histogram
normal distributions.
Uncle Bruno owns a garden with 30 black cherry trees. Each tree is of a
different height. The height of the trees (in inches): 61, 63, 64, 66, 68,
69, 71, 71.5, 72, 72.5, 73, 73.5, 74, 74.5, 76, 76.2, 76.5, 77, 77.5, 78,
78.5, 79, 79.2, 80, 81, 82, 83, 84, 85, 87. We can group the data as
follows in a frequency distribution table by setting a range:
Height Range Number of Trees
(ft) (Frequency)
60 - 75 3
66 - 70 3
71 - 75 8
76 - 80 10
81 - 85 5
86 - 90 1
This data can be now shown using a histogram. We need to make sure that
while plotting a histogram, there shouldn’t be any gaps between the bars.
o
2.Clustering:
Clustering is the task of dividing the population or data
points into a number of groups such that data points in the
same groups are more similar to other data points in the
same group and dissimilar to the data points in other groups.
It is basically a collection of objects on the basis of similarity
and dissimilarity between them.
For ex– The data points in the graph below clustered together
can be classified into one single group. We can distinguish
the clusters, and we can identify that there are 3 clusters in
the below picture .
Applications of Clustering in different fields
Marketing: It can be used to characterize & discover customer
segments for marketing purposes.
Biology: It can be used for classification among different species
of plants and animals.
Libraries: It is used in clustering different books on the basis of
topics and information.
Insurance: It is used to acknowledge the customers, their
policies and identifying the frauds.
3.Sampling
o Sampling: One of the methods used for data reduction is
sampling, as it can reduce the large data set into a much smaller
data sample. Below we will discuss the different methods in which
we can sample a large data set D containing N tuples:
a. Simple random sample without replacement (SRSWOR)
of size s: In this s, some tuples are drawn from N tuples
such that in the data set D (s<N). The probability of drawing
any tuple from the data set D is 1/N. This means all tuples
have an equal probability of getting sampled.
b. Simple random sample with replacement (SRSWR) of size
s: It is similar to the SRSWOR, but the tuple is drawn from
data set D, is recorded, and then replaced into the data set D
so that it can be drawn again.