0% found this document useful (0 votes)
13 views34 pages

ML(UNIT_5)

Dimensionality reduction is a technique used to reduce the number of input features in a dataset, simplifying predictive modeling and improving visualization. It addresses challenges like the curse of dimensionality, which complicates machine learning models as the number of features increases. Common methods include feature selection and extraction techniques such as Principal Component Analysis (PCA), which retains significant information while reducing complexity.

Uploaded by

nijaniarul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views34 pages

ML(UNIT_5)

Dimensionality reduction is a technique used to reduce the number of input features in a dataset, simplifying predictive modeling and improving visualization. It addresses challenges like the curse of dimensionality, which complicates machine learning models as the number of features increases. Common methods include feature selection and extraction techniques such as Principal Component Analysis (PCA), which retains significant information while reducing complexity.

Uploaded by

nijaniarul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT-5

Introduction to Dimensionality Reduction


Technique
What is Dimensionality Reduction?
The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction.

A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or make
predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.

Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it provides
similar information." These techniques are widely used in machine learning for
obtaining a better fit predictive model while solving the classification and regression
problems.

It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
The Curse of Dimensionality
Handling the high-dimensional data is very difficult in practice, commonly known as
the curse of dimensionality. If the dimensionality of the input dataset increases, any
machine learning algorithm and model becomes more complex. As the number of features
increases, the number of samples also gets increased proportionally, and the chance of
overfitting also increases. If the machine learning model is trained on high-dimensional
data, it becomes overfitted and results in poor performance.

Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.

Benefits of applying Dimensionality Reduction


Some benefits of applying dimensionality reduction technique to the given dataset are
given below:

o By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data
quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.

Disadvantages of dimensionality Reduction


There are also some disadvantages of applying the dimensionality reduction, which are
given below:

o Some data may be lost due to dimensionality reduction.


o In the PCA dimensionality reduction technique, sometimes the principal
components required to consider are unknown.

Approaches of Dimension Reduction


There are two ways to apply the dimension reduction technique, which are given below:

Feature Selection
Feature selection is the process of selecting the subset of the relevant features and
leaving out the irrelevant features present in a dataset to build a model of high accuracy.
In other words, it is a way of selecting the optimal features from the input dataset.

Three methods are used for the feature selection:

1. Filters Methods

In this method, the dataset is filtered, and a subset that contains only the relevant features
is taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods

The wrapper method has the same goal as the filter method, but it takes a machine
learning model for its evaluation. In this method, some features are fed to the ML model,
and evaluate the performance. The performance decides whether to add those features
or remove to increase the accuracy of the model. This method is more accurate than the
filtering method but complex to work. Some common techniques of wrapper methods are:

o Forward Selection
o Backward Selection
o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the
machine learning model and evaluate the importance of each feature. Some common
techniques of Embedded methods are:

o LASSO
o Elastic Net
o Ridge Regression, etc.

Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions
into space with fewer dimensions. This approach is useful when we want to keep the whole
information but use fewer resources while processing the information.

Some common feature extraction techniques are:

1. Principal Component Analysis


2. Linear Discriminant Analysis
3. Kernel PCA
4. Quadratic Discriminant Analysis

Common techniques of Dimensionality Reduction


1. Principal Component Analysis
2. Backward Elimination
3. Forward Selection
4. Score comparison
5. Missing Value Ratio
6. Low Variance Filter
7. High Correlation Filter
8. Random Forest
9. Factor Analysis
10. Auto-Encoder
Principal Component Analysis
Principal component analysis (PCA) is a dimensionality reduction and machine learning
method used to simplify a large data set into a smaller set while still maintaining significant
patterns and trends.
Principal component analysis can be broken down into five steps. I’ll go through each step,
providing logical explanations of what PCA is doing and simplifying mathematical concepts
such as standardization, covariance, eigenvectors and eigenvalues without focusing on how to
compute them.

How Do You Do a Principal Component Analysis?

1. Standardize the range of continuous initial variables


2. Compute the covariance matrix to identify correlations
3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the
principal components
4. Create a feature vector to decide which principal components to keep
5. Recast the data along the principal components axes

What Is Principal Component Analysis?

Principal component analysis, or PCA, is a dimensionality reduction method that is often


used to reduce the dimensionality of large data sets, by transforming a large set of variables
into a smaller one that still contains most of the information in the large set.

Reducing the number of variables of a data set naturally comes at the expense of accuracy,
but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because
smaller data sets are easier to explore and visualize, and thus make analyzing data points
much easier and faster for machine learning algorithms without extraneous variables to
process.

So, to sum up, the idea of PCA is simple: reduce the number of variables of a data set,
while preserving as much information as possible.

What Are Principal Components?

Principal components are new variables that are constructed as linear combinations or
mixtures of the initial variables. These combinations are done in such a way that the new
variables (i.e., principal components) are uncorrelated and most of the information within the
initial variables is squeezed or compressed into the first components. So, the idea is 10-
dimensional data gives you 10 principal components, but PCA tries to put maximum possible
information in the first component, then maximum remaining information in the second and
so on, until having something like shown in the scree plot below.
.
Organizing information in principal components this way will allow you to reduce
dimensionality without losing much information, and this by discarding the components with
low information and considering the remaining components as your new variables.

An important thing to realize here is that the principal components are less interpretable and
don’t have any real meaning since they are constructed as linear combinations of the initial
variables.

Geometrically speaking, principal components represent the directions of the data that
explain a maximal amount of variance, that is to say, the lines that capture most
information of the data. The relationship between variance and information here, is that, the
larger the variance carried by a line, the larger the dispersion of the data points along it, and
the larger the dispersion along a line, the more information it has. To put all this simply, just
think of principal components as new axes that provide the best angle to see and evaluate the
data, so that the differences between the observations are better visible.

How PCA Constructs the Principal Components

As there are as many principal components as there are variables in the data, principal
components are constructed in such a manner that the first principal component accounts for
the largest possible variance in the data set. For example, let’s assume that the scatter plot of
our data set is as shown below, can we guess the first principal component ? Yes, it’s
approximately the line that matches the purple marks because it goes through the origin and
it’s the line in which the projection of the points (red dots) is the most spread out. Or
mathematically speaking, it’s the line that maximizes the variance (the average of the squared
distances from the projected points (red dots) to the origin).
The second principal component is calculated in the same way, with the condition that it is
uncorrelated with (i.e., perpendicular to) the first principal component and that it accounts for
the next highest variance.

This continues until a total of p principal components have been calculated, equal to the
original number of variables.

Step-by-Step Explanation of PCA

Step 1: Standardization

The aim of this step is to standardize the range of the continuous initial variables so that each
one of them contributes equally to the analysis.

More specifically, the reason why it is critical to perform standardization prior to PCA, is that
the latter is quite sensitive regarding the variances of the initial variables. That is, if there are
large differences between the ranges of initial variables, those variables with larger ranges
will dominate over those with small ranges (for example, a variable that ranges between 0
and 100 will dominate over a variable that ranges between 0 and 1), which will lead to biased
results. So, transforming the data to comparable scales can prevent this problem.

Mathematically, this can be done by subtracting the mean and dividing by the standard
deviation for each value of each variable.

Once the standardization is done, all the variables will be transformed to the same scale.

Step 2: Covariance Matrix Computation

The aim of this step is to understand how the variables of the input data set are varying from
the mean with respect to each other, or in other words, to see if there is any relationship
between them. Because sometimes, variables are highly correlated in such a way that they
contain redundant information. So, in order to identify these correlations, we compute
the covariance matrix.

The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions)


that has as entries the covariances associated with all possible pairs of the initial variables.
For example, for a 3-dimensional data set with 3 variables x, y, and z, the covariance matrix is
a 3×3 data matrix of this from:

Covariance Matrix for 3-Dimensional Data.


Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main
diagonal (Top left to bottom right) we actually have the variances of each initial variable.
And since the covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance
matrix are symmetric with respect to the main diagonal, which means that the upper and the
lower triangular portions are equal.
What do the covariances that we have as entries of the matrix tell us about the
correlations between the variables?

It’s actually the sign of the covariance that matters:

 If positive then: the two variables increase or decrease together (correlated)


 If negative then: one increases when the other decreases (Inversely correlated)

Now that we know that the covariance matrix is not more than a table that summarizes the
correlations between all the possible pairs of variables, let’s move to the next step.

Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify
the principal components

Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from
the covariance matrix in order to determine the principal components of the data.

What you first need to know about eigenvectors and eigenvalues is that they always come in
pairs, so that every eigenvector has an eigenvalue. Also, their number is equal to the number
of dimensions of the data. For example, for a 3-dimensional data set, there are 3 variables,
therefore there are 3 eigenvectors with 3 corresponding eigenvalues.

It is eigenvectors and eigenvalues who are behind all the magic of principal components
because the eigenvectors of the Covariance matrix are actually the directions of the axes
where there is the most variance (most information) and that we call Principal Components.
And eigenvalues are simply the coefficients attached to eigenvectors, which give the amount
of variance carried in each Principal Component.

By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the
principal components in order of significance.

Principal Component Analysis Example:

Let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the eigenvectors
and eigenvalues of the covariance matrix are as follows:

If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the
eigenvector that corresponds to the first principal component (PC1) is v1 and the one that
corresponds to the second principal component (PC2) is v2.

After having the principal components, to compute the percentage of variance (information)
accounted for by each component, we divide the eigenvalue of each component by the sum of
eigenvalues. If we apply this on the example above, we find that PC1 and PC2 carry
respectively 96 percent and 4 percent of the variance of the data.
Step 4: Create a Feature Vector

As we saw in the previous step, computing the eigenvectors and ordering them by their
eigenvalues in descending order, allow us to find the principal components in order of
significance. In this step, what we do is, to choose whether to keep all these components or
discard those of lesser significance (of low eigenvalues), and form with the remaining ones a
matrix of vectors that we call Feature vector.

So, the feature vector is simply a matrix that has as columns the eigenvectors of the
components that we decide to keep. This makes it the first step towards dimensionality
reduction, because if we choose to keep only p eigenvectors (components) out of n, the final
data set will have only p dimensions.

Principal Component Analysis Example:

Continuing with the example from the previous step, we can either form a feature vector with
both of the eigenvectors v1 and v2:

Or discard the eigenvector v2, which is the one of lesser significance, and form a feature
vector with v1 only:

Discarding the eigenvector v2 will reduce dimensionality by 1, and will consequently cause a
loss of information in the final data set. But given that v2 was carrying only 4 percent of the
information, the loss will be therefore not important and we will still have 96 percent of the
information that is carried by v1.

So, as we saw in the example, it’s up to you to choose whether to keep all the components or
discard the ones of lesser significance, depending on what you are looking for. Because if
you just want to describe your data in terms of new variables (principal components) that are
uncorrelated without seeking to reduce dimensionality, leaving out lesser significant
components is not needed.

Step 5: Recast the Data Along the Principal Components Axes

In the previous steps, apart from standardization, you do not make any changes on the data,
you just select the principal components and form the feature vector, but the input data set
remains always in terms of the original axes (i.e, in terms of the initial variables).

In this step, which is the last one, the aim is to use the feature vector formed using the
eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones
represented by the principal components (hence the name Principal Components Analysis).
This can be done by multiplying the transpose of the original data set by the transpose of the
feature vector.
Singular Value Decomposition(SVD)
Singular Value Decomposition, or SVD for short, is a mathematical technique used in machine
learning to make sense of huge and complicated data.

Let me explain SVD in laymen's terms:

Imagine you have many different toys you want to organize them. Some are big, some are
small, some are red, some are blue, and so on. It can take some work to figure out how to group
them together!

But what if you could break down each toy into its most essential parts?

For example, you could take a big red ball and break it down into a big part, a red part, and a
round part.

That would make it much easier to compare and group together with other toys that have those
same basic parts.

It takes a big, complicated piece of data and breaks it into its most essential parts. Then we can
use those parts to find patterns and similarities in the data.

For example:

let's say you have many pictures of different animals. SVD could break down each picture into
its most essential parts, like lines and curves. Then we could use those parts to find patterns,
like which animals have similar shapes.

How to Use SVD in Machine Learning

Here are a few examples of the types of problems that SVD can help solve:

Dimensionality Reduction

One of the main applications of SVD is to reduce the dimensionality of a dataset. By finding
the basic patterns in the data and discarding the less important ones, SVD can help simplify the
data and make it easier to work with.

Data Compression

SVD can also compress large datasets without losing too much information. We can represent
the data using fewer features by keeping only the most important singular values and associated
singular vectors.
Matrix Approximation

Another application of SVD is to approximate a large, complex matrix using a smaller, simpler
one. This can be useful when working with large datasets that are difficult to handle directly.

Collaborative Filtering

SVD can be used to predict user preferences in recommender systems by modeling the
relationships between users and items in a large matrix.

Well, sometimes data can be massive and complicated, and it's hard to make sense of it all.
SVD helps us simplify the data and find the most essential parts to understand it better.

What is Singular Decomposition Value

Singular Value Decomposition is a way to factor a matrix A into three matrices, as follows:

A = U * S * V^T

Where U and V are orthogonal matrices, and S is a diagonal matrix containing the
singular values of A.

Note:

 The matrix is considered an orthogonal matrix if the product of a matrix and its
transpose gives an identity value.
 A matrix is diagonal if it has non-zero elements only in the diagonal, running from the
upper left to the lower right corner of the matrix.

Here, U and V represent the left and right singular vectors of A, respectively,
and S represents the singular values of A.

The algorithm for computing the SVD of matrix A can be summarized in the following steps:

1. Compute the eigendecomposition of the symmetric matrix A^T A. This can be done using
any standard eigendecomposition algorithm.
2. Compute the singular values of A as the square root of the eigenvalues of A^T A. Sort
the singular values in descending order.
3. Compute the left and right singular vectors of A as follows:
1. For each singular value, find the corresponding eigenvector of A^T A.
2. Normalize each eigenvector to have a unit length.
3. The left singular vectors of A are the eigenvectors of A A^T corresponding to the
nonzero singular values of A.
4. The right singular vectors of A are the normalized eigenvectors of A^T A.

Assemble the SVD of A as follows:

 The diagonal entries of S are the singular values of A, sorted in descending order.
 The columns of U are the corresponding left singular vectors of A.
 The columns of V are the corresponding right singular vectors of A.

Once the SVD of matrix A has been computed, it can be used for various tasks in machine
learning, such as

 Dimensionality reduction,
 Data compression,
 Feature extraction.
APPLICATIONS OF SVD
Dimensionality Reduction

The first and most important application is to reduce the dimensionality of data, the SVD
is more or less standard for this, PCA is exactly the same as the SVD. You may want to
reduce the dimensionality of your data because:
a) You want to visualize your data in 2d or 3d
b) The algorithm you are going to use works better in the new dimensional space
c) Performance reasons, your algorithm is faster if you reduce dimensions.

In many machine learning problems using the SVD before a ML algorithm helps so it's
always worth a try.

Multi-Dimensional Scaling

The SVD is part of the algorithm to compute multidimensional-scaling.

Pseudo-Inverse

The SVD is also used to compute the pseudo-inverse of a matrix.

Image Compression

The SVD can be used to compress images, but there are some better algorithms of
course.

Example: This 127x350 picture as a 127x350 matrix:

We decompose using the SVD taking the 50 top singular values so we get a 127x50 "U"
matrix, the 50 singular values and a 50x350 "V" matrix. Computing U*S*V we get:
And if we use 20 instead of 50 singular values:

While you wouldn't use the SVD to compress images because there are more efficient
algorithms (jpg) the example shows how good the SVD is to find a low-rank
approximation to a given matrix.

EigenFaces

Let's say we have a set of images like this:

We can represent each image as a vector and have the dataset as a matrix of mxd where
m is the number of images we have and d depends on the resolution of the images. If
we have 400 images in 64x64 each then our matrix is 400x4096.
We can reduce this to rank 16 using the SVD. Then the "V" matrix in U*S*V is 16x4096
and we can interpret this matrix as a collection of 16 faces, these are our "eigenfaces"
Now we can represent each image in our dataset as a vector of 16 components using the
euclidean distance to each eigenface. Then instead of 4096 dimensions to represent our
faces we have 16. This can then be used for quick face-recognition, given a face we
compute the 16 distances and then compare this 16 element vector to the vectors in our
database to find the closest match. This has many applications.

Latent Semantic Indexing

When we process a corpus of text we can represent it as term x document matrix this
matrix can be decomposed using the SVD, the reduced-rank approximation captures (in
some way) the semantics in our text, this usually has several advantages like filtering
noise, solving the problem of synonyms etc. This has many applications in Information
Retrieval.
Matrix Factorization

Matrix factorization is one of the most sought-after machine learning recommendation

models. It acts as a catalyst, enabling the system to gauge the customer’s exact purpose of the

purchase, scan numerous pages, shortlist, and rank the right product or service, and

recommend multiple options available. Once the output matches the requirement, the lead

translates into a transaction and the deal clicks.

What is Matrix Factorization?

This mathematical model helps the system split an entity into multiple smaller entries,
through an ordered rectangular array of numbers or functions, to discover the features or

information underlying the interactions between users and items.

Where is Matrix Factorization used?

Once an individual raises a query on a search engine, the machine deploys uses matrix

factorization to generate an output in the form of recommendations. The system uses two

approaches– content-based filtering and collaborative filtering- to make recommendations.

Content-Based Filtering

This approach recommends items based on user preferences. It matches the requirement,

considering the past actions of the user, patterns detected, or any explicit feedback provided

by the user, and accordingly, makes a recommendation.

Example: If you prefer the chocolate flavor and purchase a chocolate ice cream, the next time

you raise a query, the system shall scan for options related to chocolate, and then, recommend

you to try a chocolate cake.


How does the System make recommendations?

Let us take an example. To purchase a car, in addition to the brand name, people check for

features available in the car, most common ones being safety, mileage, or aesthetic value.

Few buyers consider the automatic gearbox, while others opt for a combination of two or

more features. To understand this concept, let us consider a two-dimensional vector with the

features of safety and mileage.

1. In the above graph, on the left-hand side, we have cited individual preferences,

wherein 4 individuals have been asked to provide a rating on safety and mileage. If

the individuals like a feature, then we assign the value 1, and if they do not like that

particular feature, we assign 0. Now we can see that Persons A and C prefer safety,

Person B chooses mileage and Person D opts for both Safety and Mileage.

2. Cars are been rated based on the number of features (items) they offer. A ranking of 4

implies high features, and 1 depicts fewer features.

• A ranks high on safety and low on mileage


• B is rated high on mileage and low on safety

• C has an average rating and offers both safety and mileage

• D is low on both mileage and safety

3. The blue colored ? mark is the sparse value, wherein either person does not know

about the car or is not part of the consideration list for buying the car or has forgotten

to rate.

4. Let’s understand how the matrix at the center has arrived. This matrix represents the

overall rating of all 4 cars, given by the individuals. Person A has given an overall

rating of 4 for Car A, 1 to Cars B and D, and 2 to Car C. This has been arrived,

through the following calculations-

 For Car A = (1 of safety preference x 4 of safety features) +(0 of mileage preference x

1 of mileage feature) = 4;

 For Car B = (1 of safety preference x 1 of safety features) +(0 of mileage preference x

4 of mileage feature) = 1;

 For Car C = (1 of safety preference x 2 of safety features) +(0 of mileage preference x

2 of mileage feature) = 2;

 For Car D = (1 of safety preference x 1 of safety features) +(0 of mileage preference x

2 of mileage feature) = 1

Now on the basis of the above calculations, we can predict the overall rating for each person

and all the cars.

Advantages of content-based filtering

The model does not require any data about other users, since the recommendations are

specific to one user. This makes it easier to scale it up to a large number of users.

a) The model can capture specific interests of a user, and can recommend niche items that

very few other users are interested in.


Disadvantages of content-based filtering

a) Since the feature representation of the items are hand-engineered to some extent, this

technique requires a lot of domain knowledge. Therefore, the model can only be as good as

the hand-engineered features.

b) The model can only make recommendations based on existing interests of the user. In

other words, the model has limited ability to expand on the users’ existing interests.

Collaborative Filtering

This approach uses similarities between users and items simultaneously, to provide
recommendations. It is the idea of recommending an item or making a prediction, depending

on other like-minded individuals. It could comprise a set of users, items, opinions about an

item, ratings, reviews, or purchases.

Example: Suppose Persons A and B both like the chocolate flavor and have them have tried

the ice-cream and cake, then if Person A buys chocolate biscuits, the system will recommend

chocolate biscuits to Person B.

Types of collaborative filtering

Collaborative filtering is classified under the memory-based and model-based approaches:


How does system make recommendations in collaborative filtering?

In collaborative filtering, we do not have the rating of individual preferences and car
preferences. We only have the overall rating, given by the individuals for each car. As usual,

the data is sparse, implying that the person either does not know about the car or is not under

the consideration list for buying the car, or has forgotten to give the rating.

The task in hand is to predict the rating that Person C might assign to Car C (? marked in

yellow) basis the similarities in ratings given by other individuals. There are three steps

involved to arrive at a collaborative filtering recommendation.

Step-1

Normalization: Usually while assigning a rating, individuals tend to give either a high rating

or a low rating, across all parameters. Normalization usually helps in balancing and evens out
such measures. This is done by taking an average of rating available and subtracting it with

the individual rating(x- x̅)

a) In case of Person A = (4+1+2+1)/4 = 2 = x̅, In case of Person B = (1+4+2)/3 = 2.3 =̅x

Similarly, we can do it for Persons C and D.

b) Then we subtract the average with individual rating

In case of Person A it is, 4-2, 1-2, 2-2, 1-2, In case for Person B = 1-2.3,4-2.3,2-2.3

Similarly, we can do it for Persons C and D.

You get the below table for all the individuals.

If you add all the numbers in each row, then it will add up-to zero. Actually, we have

centered overall individual rating to zero. From zero, if individual rating for each car is

positive, it means he likes the car, and if it is negative it means he does not like the car.

Step-2

Similarity measure: As discussed in content-based filtering, we find the similarities between

two vectors a and b as the ratio between their dot product and the product of their

magnitudes.
Step-3

Neighborhood selection: Here in Car C column, we find maximum similarities between

ratings assigned by Persons A and D. We use these similarities to arrive at the prediction.

Advantages of collaborative filtering

 There is no dependence on domain knowledge as embedding are automatically

learned.

 The model can help users discover new interests. In isolation, the ML system may not

know the user is interested in a given item, but the model might still recommend it

because similar users are interested in that item.

 To some extent, the system needs only the feedback matrix to train a matrix

factorization model. In particular, the system does not require contextual features.
Disadvantages of collaborative filtering

 The matrix cannot handle fresh items, for instance, if a new car is added to the matrix,
it may have limited user interaction and thus, will rarely occur as a recommendation.

 The output of the recommendation could be biased, based on popularity, that is, if

most user interaction is towards a particular car, then the recommendation will focus

on that popular car only.


Collaborative Filtering in Machine Learning
If this time you are watching a horror video on youtube then next time you will automatically
see some more horror videos in your feed have you ever thought about how this thing
works? Like how an application was able to get to know about your choices and likes. This
is exactly what is popularly known as Recommendation Systems.

What is a Recommendation system?


There are a lot of applications where websites collect data from their users and use that data
to predict the likes and dislikes of their users. This allows them to recommend the content
that they like. Recommender systems are a way of suggesting similar items and ideas to a
user’s specific way of thinking.
There are basically two types of recommender Systems:
 Collaborative Filtering: Collaborative Filtering recommends items based on similarity
measures between users and/or items. The basic assumption behind the algorithm is that
users with similar interests have common preferences.
 Content-Based Recommendation: It is supervised machine learning used to induce a
classifier to discriminate between interesting and uninteresting items for the user.

What is Collaborative Filtering?


In Collaborative Filtering, we tend to find similar users and recommend what similar users
like. In this type of recommendation system, we don’t use the features of the item to
recommend it, rather we classify the users into clusters of similar types and recommend each
user according to the preference of its cluster.
There are basically four types of algorithms o say techniques to build Collaborative filtering
recommender systems:
 Memory-Based
 Model-Based
 Hybrid
 Deep Learning
Advantages of Collaborative Filtering-Based Recommender Systems
As we know there are two types of recommender systems the content-based recommender
systems have limited use cases and have higher time complexity. Also, this algorithm is
based on some limited content but that is not the case in Collaborative Filtering based
algorithms. One of the main advantages that these recommender systems have is that they are
highly efficient in providing personalized content but also able t adapt to changing user
preferences.

Measuring Similarity
A simple example of the movie recommendation system will help us in explaining:

In this type of scenario, we can see that User 1 and User 2 give nearly similar ratings to the
movie, so we can conclude that Movie 3 is also going to be averagely liked by User 1 but
Movie 4 will be a good recommendation to User 2, like this we can also see that there are
users who have different choices like User 1 and User 3 are opposite to each other. One can
see that User 3 and User 4 have a common interest in the movie, on that basis we can say that
Movie 4 is also going to be disliked by User 4. This is Collaborative Filtering, we
recommend to users the items which are liked by users of similar interest domains.

Cosine Similarity
We can also use the cosine similarity between the users to find out the users with similar
interests, larger cosine implies that there is a smaller angle between two users, hence they
have similar interests. We can apply the cosine distance between two users in the utility
matrix, and we can also give the zero value to all the unfilled columns to make calculation
easy, if we get smaller cosine then there will be a larger distance between the users, and if the
cosine is larger than we have a small angle between the users, and we can recommend them
similar things.
Rounding the Data
In collaborative filtering, we round off the data to compare it more easily like we can assign
below 3 ratings as 0 and above of it as 1, this will help us to compare data more easily, for
example:

We again took the previous example and we apply the rounding-off process, as you can see
how much more readable the data has become after performing this process, we can see that
User 1 and User 2 are more similar and User 3 and User 4 are more alike.

Normalizing Rating
In the process of normalizing, we take the average rating of a user and subtract all the given
ratings from it, so we’ll get either positive or negative values as a rating, which can simply
classify further into similar groups. By normalizing the data we can make clusters of the users
that give a similar rating to similar items and then we can use these clusters to recommend
items to the users.

What are some of the Challenges to be Faced while using Collaborative Filtering?
As we know that every algorithm has its pros and cons and so is the case with Collaborative
Filtering Algorithms. Collaborative Filtering algorithms are very dynamic and can change as
well as adapt to the changes in user preferences with time. But one of the main issues which
are faced by recommender systems is that of scalability because as the user base increases
then the respective sizes for the computation and the data storage space all increase manifold
which leads to slow and inaccurate results.
MapReduce
MapReduce is a programming model for writing applications that can process Big Data in
parallel on multiple nodes. MapReduce provides analytical capabilities for analyzing huge
volumes of complex data.

What is Big Data?

Big Data is a collection of large datasets that cannot be processed using traditional computing
techniques. For example, the volume of data Facebook or Youtube need require it to collect
and manage on a daily basis, can fall under the category of Big Data. However, Big Data is
not only about scale and volume, it also involves one or more of the following aspects −
Velocity, Variety, Volume, and Complexity.

Why MapReduce?

Traditional Enterprise Systems normally have a centralized server to store and process data.
The following illustration depicts a schematic view of a traditional enterprise system.
Traditional model is certainly not suitable to process huge volumes of scalable data and
cannot be accommodated by standard database servers. Moreover, the centralized system
creates too much of a bottleneck while processing multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce
divides a task into small parts and assigns them to many computers. Later, the results are
collected at one place and integrated to form the result dataset.
How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.

The reduce task is always performed after the map job.

Let us now take a close look at each of the phases and try to understand their significance.

 Input Phase − Here we have a Record Reader that translates each record in an input
file and sends the parsed data to the mapper in the form of key-value pairs.
 Map − Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key-value pairs.
 Intermediate Keys − They key-value pairs generated by the mapper are known as
intermediate keys.
 Combiner − A combiner is a type of local Reducer that groups similar data from the
map phase into identifiable sets. It takes the intermediate keys from the mapper as input
and applies a user-defined code to aggregate the values in a small scope of one mapper.
It is not a part of the main MapReduce algorithm; it is optional.
 Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads
the grouped key-value pairs onto the local machine, where the Reducer is running. The
individual key-value pairs are sorted by key into a larger data list. The data list groups
the equivalent keys together so that their values can be iterated easily in the Reducer
task.
 Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final step.
 Output Phase − In the output phase, we have an output formatter that translates the
final key-value pairs from the Reducer function and writes them onto a file using a
record writer.

Let us try to understand the two tasks Map &f Reduce with the help of a small diagram −
MapReduce-Example

Let us take a real-world example to comprehend the power of MapReduce. Twitter receives
around 500 million tweets per day, which is nearly 3000 tweets per second. The following
illustration shows how Tweeter manages its tweets with the help of MapReduce.

As shown in the illustration, the MapReduce algorithm performs the following actions −

 Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value
pairs.
 Filter − Filters unwanted words from the maps of tokens and writes the filtered maps
as key-value pairs.
 Count − Generates a token counter per word.
 Aggregate Counters − Prepares an aggregate of similar counter values into small
manageable units.

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The map task is done by means of Mapper Class


 The reduce task is done by means of Reducer Class.

Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is
used as input by Reducer class, which in turn searches matching pairs and reduces them.
MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending
the Map & Reduce tasks to appropriate servers in a cluster.

These mathematical algorithms may include the following −

 Sorting
 Searching
 Indexing
 TF-IDF

Sorting

Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the
mapper by their keys.

Searching

Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase.

Indexing

Normally indexing is used to point to a particular data and its address. It performs batch
indexing on the input files for a particular Mapper.

The indexing technique that is normally used in MapReduce is known as inverted


index. Search engines like Google and Bing use inverted indexing technique.

TF-IDF

TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse
Document Frequency. It is one of the common web analysis algorithms. Here, the term
'frequency' refers to the number of times a term appears in a document.
Hadoop Streaming
In the world of big data, processing vast amounts of data efficiently is a crucial task.
Hadoop, an open-source framework, has been a cornerstone in managing and processing
large data sets across distributed computing environments. Among its various components,
Hadoop Streaming stands out as a versatile tool, enabling users to process data using non-
Java programming languages. This article delves into the purpose of Hadoop Streaming, its
usage scenarios, implementation details, and provides a comprehensive understanding of
this powerful tool.

Hadoop Streaming is a utility that allows users to create and run MapReduce jobs using
any executable or script as the mapper and/or reducer, instead of Java. It enables the use
of various programming languages like Python, Ruby, and Perl for processing large
datasets. This flexibility makes it easier for non-Java developers to leverage Hadoop’s
distributed computing power for tasks such as log analysis, text processing, and data
transformation.

Definition and Purpose of Hadoop Streaming


Hadoop Streaming is a utility that allows users to create and run MapReduce jobs with any
executable or script as the mapper and/or reducer. Traditionally, Hadoop MapReduce jobs
are written in Java, but Hadoop Streaming provides the flexibility to use other
programming languages like Python, Ruby, Perl, and more. The primary purpose of
Hadoop Streaming is to lower the barrier of entry for developers who are not proficient in
Java but need to process large data sets using the Hadoop framework.
Key Features:
 Language Flexibility: Allows the use of various programming languages for
MapReduce jobs.
 Ease of Use: Simplifies the process of writing MapReduce jobs by allowing the use of
standard input and output for communication between Hadoop and the scripts.
 Versatility: Enables the integration of a wide range of scripts and executables, making it
a versatile tool for data processing.
Usage Scenarios
Hadoop Streaming is particularly useful in scenarios where:
 Non-Java Expertise: The development team is more proficient in languages other than
Java, such as Python or R.
 Legacy Code Integration: There is a need to integrate existing scripts and tools into the
Hadoop ecosystem without rewriting them in Java.
 Rapid Prototyping: Quick development and testing of data processing pipelines are
required.
 Specialized Processing: Custom processing logic that is more easily implemented in a
specific language.

Common Use Cases:


 Log Analysis: Processing server logs using scripts to filter, aggregate, and analyze log
data.
 Text Processing: Analyzing large text corpora with Python or Perl scripts.
 Data Transformation: Using shell scripts to transform and clean data before loading it
into a data warehouse.
 Machine Learning: Running Python-based machine learning algorithms on large
datasets stored in Hadoop.

Implementation and Example


Implementing Hadoop Streaming involves setting up a Hadoop cluster and running
MapReduce jobs using custom scripts. Here’s a step-by-step example using Python for
word count, a classic MapReduce task.
Step 1: Setting Up Hadoop
Ensure Hadoop is installed and configured on your cluster. You can use Hadoop in pseudo-
distributed mode for testing or a fully distributed mode for production.

Step 2: Writing the Mapper and Reducer Scripts


Create two Python scripts, one for the mapper and one for the reducer

Step 3: Running the Hadoop Streaming Job


Upload the input data to the Hadoop Distributed File System (HDFS):
Step 4: Retrieving the Results
After the job completes, retrieve the output from HDFS:
This output file contains the word count results generated by the MapReduce job.

Conclusion
Hadoop Streaming is an invaluable tool for developers who need to leverage the power of
Hadoop without diving deep into Java. Its ability to integrate various programming
languages and tools makes it a flexible and powerful option for processing large datasets.
Whether you’re analyzing logs, processing text data, or running machine learning
algorithms, Hadoop Streaming simplifies the process and opens up new possibilities for big
data processing.
Implementing PEGASOS: Primal Estimated sub-
GrAdient SOlver for SVM,
Although a support vector machine model (binary classifier) is more commonly
built by solving a quadratic programming problem in the dual space, it can be built
fast by solving the primal optimization problem also. In this article a Support
Vector Machine implementation is going to be described by solving the primal
optimization problem with sub-gradient solver using stochastic gradient decent.

First the vanilla version and then the kernelized version of the
the Pegasos algorithm is going to be described along with some applications on
some datasets.
 Next the hinge-loss function for the SVM is going to be replaced by the log-
loss function for the Logistic Regression and the primal SVM problem is
going to be converted to regularized logistic regression.
 Finally document sentiment classification will be done by first training a
Perceptron, SVM (with Pegasos) and a Logistic Regression classifier on a
corpus and then testing it on an unseen part of the corpus.
 The time to train the classifiers along with accuracy obtained on a held-out
dataset will be computed.

1. SVM implementation by minimizing the primal


objective with hinge-loss using SGD with PEGASOS
Soft-SVM Primal Lagrangian can be represented as follows:

or as the following if the explicit bias term is discarded:

where the 0-1 loss is approximated by the hinge-loss.


 Changing the regularization constant to λ, it can be equivalently expressed
using the hinge-loss as follows, as shown in the next figure, taken from
the Pegasos paper.
 The next figure also describes the Pegasos algorithm, which performs
an SGD on the primal objective (Lagrangian) with carefully chosen steps.
 Since the hinge-loss is not continuous, the sub-gradient of the objective is
considered instead for the gradient computation for a single update step with
SGD.
 The learning rate η is gradually decreased with iteration.
The following figure shows a simplified version of the algorithm:

Some Notes
 The optional projection step has been left out (the line in square brackets in
the paper).
 As usual, the outputs (in the list Y) are coded as +1 for positive examples
and -1 for negative examples.
 The number η is the step length in gradient descent.
 The gradient descent algorithm may have problems finding the minimum if
the step length η is not set properly. To avoid this difficulty, Pegasos uses a
variable step length: η = 1 / (λ · t).
 Since we compute the step length by dividing by t, it will gradually become
smaller and smaller. The purpose of this is to avoid the “bounce
around” problem as it gets close to the optimum.
 Although the bias variable b in the objective function is discarded in this
implementation, the paper proposes several ways to learn a bias term (non-
regularized) too, the fastest implementation is probably with the binary
search on a real interval after the PEGASOS algorithm returns an
optimum w.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy