ML(UNIT_5)
ML(UNIT_5)
A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or make
predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it provides
similar information." These techniques are widely used in machine learning for
obtaining a better fit predictive model while solving the classification and regression
problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
The Curse of Dimensionality
Handling the high-dimensional data is very difficult in practice, commonly known as
the curse of dimensionality. If the dimensionality of the input dataset increases, any
machine learning algorithm and model becomes more complex. As the number of features
increases, the number of samples also gets increased proportionally, and the chance of
overfitting also increases. If the machine learning model is trained on high-dimensional
data, it becomes overfitted and results in poor performance.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
o By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data
quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and
leaving out the irrelevant features present in a dataset to build a model of high accuracy.
In other words, it is a way of selecting the optimal features from the input dataset.
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant features
is taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine
learning model for its evaluation. In this method, some features are fed to the ML model,
and evaluate the performance. The performance decides whether to add those features
or remove to increase the accuracy of the model. This method is more accurate than the
filtering method but complex to work. Some common techniques of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the
machine learning model and evaluate the importance of each feature. Some common
techniques of Embedded methods are:
o LASSO
o Elastic Net
o Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions
into space with fewer dimensions. This approach is useful when we want to keep the whole
information but use fewer resources while processing the information.
Reducing the number of variables of a data set naturally comes at the expense of accuracy,
but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because
smaller data sets are easier to explore and visualize, and thus make analyzing data points
much easier and faster for machine learning algorithms without extraneous variables to
process.
So, to sum up, the idea of PCA is simple: reduce the number of variables of a data set,
while preserving as much information as possible.
Principal components are new variables that are constructed as linear combinations or
mixtures of the initial variables. These combinations are done in such a way that the new
variables (i.e., principal components) are uncorrelated and most of the information within the
initial variables is squeezed or compressed into the first components. So, the idea is 10-
dimensional data gives you 10 principal components, but PCA tries to put maximum possible
information in the first component, then maximum remaining information in the second and
so on, until having something like shown in the scree plot below.
.
Organizing information in principal components this way will allow you to reduce
dimensionality without losing much information, and this by discarding the components with
low information and considering the remaining components as your new variables.
An important thing to realize here is that the principal components are less interpretable and
don’t have any real meaning since they are constructed as linear combinations of the initial
variables.
Geometrically speaking, principal components represent the directions of the data that
explain a maximal amount of variance, that is to say, the lines that capture most
information of the data. The relationship between variance and information here, is that, the
larger the variance carried by a line, the larger the dispersion of the data points along it, and
the larger the dispersion along a line, the more information it has. To put all this simply, just
think of principal components as new axes that provide the best angle to see and evaluate the
data, so that the differences between the observations are better visible.
As there are as many principal components as there are variables in the data, principal
components are constructed in such a manner that the first principal component accounts for
the largest possible variance in the data set. For example, let’s assume that the scatter plot of
our data set is as shown below, can we guess the first principal component ? Yes, it’s
approximately the line that matches the purple marks because it goes through the origin and
it’s the line in which the projection of the points (red dots) is the most spread out. Or
mathematically speaking, it’s the line that maximizes the variance (the average of the squared
distances from the projected points (red dots) to the origin).
The second principal component is calculated in the same way, with the condition that it is
uncorrelated with (i.e., perpendicular to) the first principal component and that it accounts for
the next highest variance.
This continues until a total of p principal components have been calculated, equal to the
original number of variables.
Step 1: Standardization
The aim of this step is to standardize the range of the continuous initial variables so that each
one of them contributes equally to the analysis.
More specifically, the reason why it is critical to perform standardization prior to PCA, is that
the latter is quite sensitive regarding the variances of the initial variables. That is, if there are
large differences between the ranges of initial variables, those variables with larger ranges
will dominate over those with small ranges (for example, a variable that ranges between 0
and 100 will dominate over a variable that ranges between 0 and 1), which will lead to biased
results. So, transforming the data to comparable scales can prevent this problem.
Mathematically, this can be done by subtracting the mean and dividing by the standard
deviation for each value of each variable.
Once the standardization is done, all the variables will be transformed to the same scale.
The aim of this step is to understand how the variables of the input data set are varying from
the mean with respect to each other, or in other words, to see if there is any relationship
between them. Because sometimes, variables are highly correlated in such a way that they
contain redundant information. So, in order to identify these correlations, we compute
the covariance matrix.
Now that we know that the covariance matrix is not more than a table that summarizes the
correlations between all the possible pairs of variables, let’s move to the next step.
Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify
the principal components
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from
the covariance matrix in order to determine the principal components of the data.
What you first need to know about eigenvectors and eigenvalues is that they always come in
pairs, so that every eigenvector has an eigenvalue. Also, their number is equal to the number
of dimensions of the data. For example, for a 3-dimensional data set, there are 3 variables,
therefore there are 3 eigenvectors with 3 corresponding eigenvalues.
It is eigenvectors and eigenvalues who are behind all the magic of principal components
because the eigenvectors of the Covariance matrix are actually the directions of the axes
where there is the most variance (most information) and that we call Principal Components.
And eigenvalues are simply the coefficients attached to eigenvectors, which give the amount
of variance carried in each Principal Component.
By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the
principal components in order of significance.
Let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the eigenvectors
and eigenvalues of the covariance matrix are as follows:
If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the
eigenvector that corresponds to the first principal component (PC1) is v1 and the one that
corresponds to the second principal component (PC2) is v2.
After having the principal components, to compute the percentage of variance (information)
accounted for by each component, we divide the eigenvalue of each component by the sum of
eigenvalues. If we apply this on the example above, we find that PC1 and PC2 carry
respectively 96 percent and 4 percent of the variance of the data.
Step 4: Create a Feature Vector
As we saw in the previous step, computing the eigenvectors and ordering them by their
eigenvalues in descending order, allow us to find the principal components in order of
significance. In this step, what we do is, to choose whether to keep all these components or
discard those of lesser significance (of low eigenvalues), and form with the remaining ones a
matrix of vectors that we call Feature vector.
So, the feature vector is simply a matrix that has as columns the eigenvectors of the
components that we decide to keep. This makes it the first step towards dimensionality
reduction, because if we choose to keep only p eigenvectors (components) out of n, the final
data set will have only p dimensions.
Continuing with the example from the previous step, we can either form a feature vector with
both of the eigenvectors v1 and v2:
Or discard the eigenvector v2, which is the one of lesser significance, and form a feature
vector with v1 only:
Discarding the eigenvector v2 will reduce dimensionality by 1, and will consequently cause a
loss of information in the final data set. But given that v2 was carrying only 4 percent of the
information, the loss will be therefore not important and we will still have 96 percent of the
information that is carried by v1.
So, as we saw in the example, it’s up to you to choose whether to keep all the components or
discard the ones of lesser significance, depending on what you are looking for. Because if
you just want to describe your data in terms of new variables (principal components) that are
uncorrelated without seeking to reduce dimensionality, leaving out lesser significant
components is not needed.
In the previous steps, apart from standardization, you do not make any changes on the data,
you just select the principal components and form the feature vector, but the input data set
remains always in terms of the original axes (i.e, in terms of the initial variables).
In this step, which is the last one, the aim is to use the feature vector formed using the
eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones
represented by the principal components (hence the name Principal Components Analysis).
This can be done by multiplying the transpose of the original data set by the transpose of the
feature vector.
Singular Value Decomposition(SVD)
Singular Value Decomposition, or SVD for short, is a mathematical technique used in machine
learning to make sense of huge and complicated data.
Imagine you have many different toys you want to organize them. Some are big, some are
small, some are red, some are blue, and so on. It can take some work to figure out how to group
them together!
But what if you could break down each toy into its most essential parts?
For example, you could take a big red ball and break it down into a big part, a red part, and a
round part.
That would make it much easier to compare and group together with other toys that have those
same basic parts.
It takes a big, complicated piece of data and breaks it into its most essential parts. Then we can
use those parts to find patterns and similarities in the data.
For example:
let's say you have many pictures of different animals. SVD could break down each picture into
its most essential parts, like lines and curves. Then we could use those parts to find patterns,
like which animals have similar shapes.
Here are a few examples of the types of problems that SVD can help solve:
Dimensionality Reduction
One of the main applications of SVD is to reduce the dimensionality of a dataset. By finding
the basic patterns in the data and discarding the less important ones, SVD can help simplify the
data and make it easier to work with.
Data Compression
SVD can also compress large datasets without losing too much information. We can represent
the data using fewer features by keeping only the most important singular values and associated
singular vectors.
Matrix Approximation
Another application of SVD is to approximate a large, complex matrix using a smaller, simpler
one. This can be useful when working with large datasets that are difficult to handle directly.
Collaborative Filtering
SVD can be used to predict user preferences in recommender systems by modeling the
relationships between users and items in a large matrix.
Well, sometimes data can be massive and complicated, and it's hard to make sense of it all.
SVD helps us simplify the data and find the most essential parts to understand it better.
Singular Value Decomposition is a way to factor a matrix A into three matrices, as follows:
A = U * S * V^T
Where U and V are orthogonal matrices, and S is a diagonal matrix containing the
singular values of A.
Note:
The matrix is considered an orthogonal matrix if the product of a matrix and its
transpose gives an identity value.
A matrix is diagonal if it has non-zero elements only in the diagonal, running from the
upper left to the lower right corner of the matrix.
Here, U and V represent the left and right singular vectors of A, respectively,
and S represents the singular values of A.
The algorithm for computing the SVD of matrix A can be summarized in the following steps:
1. Compute the eigendecomposition of the symmetric matrix A^T A. This can be done using
any standard eigendecomposition algorithm.
2. Compute the singular values of A as the square root of the eigenvalues of A^T A. Sort
the singular values in descending order.
3. Compute the left and right singular vectors of A as follows:
1. For each singular value, find the corresponding eigenvector of A^T A.
2. Normalize each eigenvector to have a unit length.
3. The left singular vectors of A are the eigenvectors of A A^T corresponding to the
nonzero singular values of A.
4. The right singular vectors of A are the normalized eigenvectors of A^T A.
The diagonal entries of S are the singular values of A, sorted in descending order.
The columns of U are the corresponding left singular vectors of A.
The columns of V are the corresponding right singular vectors of A.
Once the SVD of matrix A has been computed, it can be used for various tasks in machine
learning, such as
Dimensionality reduction,
Data compression,
Feature extraction.
APPLICATIONS OF SVD
Dimensionality Reduction
The first and most important application is to reduce the dimensionality of data, the SVD
is more or less standard for this, PCA is exactly the same as the SVD. You may want to
reduce the dimensionality of your data because:
a) You want to visualize your data in 2d or 3d
b) The algorithm you are going to use works better in the new dimensional space
c) Performance reasons, your algorithm is faster if you reduce dimensions.
In many machine learning problems using the SVD before a ML algorithm helps so it's
always worth a try.
Multi-Dimensional Scaling
Pseudo-Inverse
Image Compression
The SVD can be used to compress images, but there are some better algorithms of
course.
We decompose using the SVD taking the 50 top singular values so we get a 127x50 "U"
matrix, the 50 singular values and a 50x350 "V" matrix. Computing U*S*V we get:
And if we use 20 instead of 50 singular values:
While you wouldn't use the SVD to compress images because there are more efficient
algorithms (jpg) the example shows how good the SVD is to find a low-rank
approximation to a given matrix.
EigenFaces
We can represent each image as a vector and have the dataset as a matrix of mxd where
m is the number of images we have and d depends on the resolution of the images. If
we have 400 images in 64x64 each then our matrix is 400x4096.
We can reduce this to rank 16 using the SVD. Then the "V" matrix in U*S*V is 16x4096
and we can interpret this matrix as a collection of 16 faces, these are our "eigenfaces"
Now we can represent each image in our dataset as a vector of 16 components using the
euclidean distance to each eigenface. Then instead of 4096 dimensions to represent our
faces we have 16. This can then be used for quick face-recognition, given a face we
compute the 16 distances and then compare this 16 element vector to the vectors in our
database to find the closest match. This has many applications.
When we process a corpus of text we can represent it as term x document matrix this
matrix can be decomposed using the SVD, the reduced-rank approximation captures (in
some way) the semantics in our text, this usually has several advantages like filtering
noise, solving the problem of synonyms etc. This has many applications in Information
Retrieval.
Matrix Factorization
models. It acts as a catalyst, enabling the system to gauge the customer’s exact purpose of the
purchase, scan numerous pages, shortlist, and rank the right product or service, and
recommend multiple options available. Once the output matches the requirement, the lead
This mathematical model helps the system split an entity into multiple smaller entries,
through an ordered rectangular array of numbers or functions, to discover the features or
Once an individual raises a query on a search engine, the machine deploys uses matrix
factorization to generate an output in the form of recommendations. The system uses two
Content-Based Filtering
This approach recommends items based on user preferences. It matches the requirement,
considering the past actions of the user, patterns detected, or any explicit feedback provided
Example: If you prefer the chocolate flavor and purchase a chocolate ice cream, the next time
you raise a query, the system shall scan for options related to chocolate, and then, recommend
Let us take an example. To purchase a car, in addition to the brand name, people check for
features available in the car, most common ones being safety, mileage, or aesthetic value.
Few buyers consider the automatic gearbox, while others opt for a combination of two or
more features. To understand this concept, let us consider a two-dimensional vector with the
1. In the above graph, on the left-hand side, we have cited individual preferences,
wherein 4 individuals have been asked to provide a rating on safety and mileage. If
the individuals like a feature, then we assign the value 1, and if they do not like that
particular feature, we assign 0. Now we can see that Persons A and C prefer safety,
Person B chooses mileage and Person D opts for both Safety and Mileage.
2. Cars are been rated based on the number of features (items) they offer. A ranking of 4
3. The blue colored ? mark is the sparse value, wherein either person does not know
about the car or is not part of the consideration list for buying the car or has forgotten
to rate.
4. Let’s understand how the matrix at the center has arrived. This matrix represents the
overall rating of all 4 cars, given by the individuals. Person A has given an overall
rating of 4 for Car A, 1 to Cars B and D, and 2 to Car C. This has been arrived,
1 of mileage feature) = 4;
4 of mileage feature) = 1;
2 of mileage feature) = 2;
2 of mileage feature) = 1
Now on the basis of the above calculations, we can predict the overall rating for each person
The model does not require any data about other users, since the recommendations are
specific to one user. This makes it easier to scale it up to a large number of users.
a) The model can capture specific interests of a user, and can recommend niche items that
a) Since the feature representation of the items are hand-engineered to some extent, this
technique requires a lot of domain knowledge. Therefore, the model can only be as good as
b) The model can only make recommendations based on existing interests of the user. In
other words, the model has limited ability to expand on the users’ existing interests.
Collaborative Filtering
This approach uses similarities between users and items simultaneously, to provide
recommendations. It is the idea of recommending an item or making a prediction, depending
on other like-minded individuals. It could comprise a set of users, items, opinions about an
Example: Suppose Persons A and B both like the chocolate flavor and have them have tried
the ice-cream and cake, then if Person A buys chocolate biscuits, the system will recommend
In collaborative filtering, we do not have the rating of individual preferences and car
preferences. We only have the overall rating, given by the individuals for each car. As usual,
the data is sparse, implying that the person either does not know about the car or is not under
the consideration list for buying the car, or has forgotten to give the rating.
The task in hand is to predict the rating that Person C might assign to Car C (? marked in
yellow) basis the similarities in ratings given by other individuals. There are three steps
Step-1
Normalization: Usually while assigning a rating, individuals tend to give either a high rating
or a low rating, across all parameters. Normalization usually helps in balancing and evens out
such measures. This is done by taking an average of rating available and subtracting it with
In case of Person A it is, 4-2, 1-2, 2-2, 1-2, In case for Person B = 1-2.3,4-2.3,2-2.3
If you add all the numbers in each row, then it will add up-to zero. Actually, we have
centered overall individual rating to zero. From zero, if individual rating for each car is
positive, it means he likes the car, and if it is negative it means he does not like the car.
Step-2
two vectors a and b as the ratio between their dot product and the product of their
magnitudes.
Step-3
ratings assigned by Persons A and D. We use these similarities to arrive at the prediction.
learned.
The model can help users discover new interests. In isolation, the ML system may not
know the user is interested in a given item, but the model might still recommend it
To some extent, the system needs only the feedback matrix to train a matrix
factorization model. In particular, the system does not require contextual features.
Disadvantages of collaborative filtering
The matrix cannot handle fresh items, for instance, if a new car is added to the matrix,
it may have limited user interaction and thus, will rarely occur as a recommendation.
The output of the recommendation could be biased, based on popularity, that is, if
most user interaction is towards a particular car, then the recommendation will focus
Measuring Similarity
A simple example of the movie recommendation system will help us in explaining:
In this type of scenario, we can see that User 1 and User 2 give nearly similar ratings to the
movie, so we can conclude that Movie 3 is also going to be averagely liked by User 1 but
Movie 4 will be a good recommendation to User 2, like this we can also see that there are
users who have different choices like User 1 and User 3 are opposite to each other. One can
see that User 3 and User 4 have a common interest in the movie, on that basis we can say that
Movie 4 is also going to be disliked by User 4. This is Collaborative Filtering, we
recommend to users the items which are liked by users of similar interest domains.
Cosine Similarity
We can also use the cosine similarity between the users to find out the users with similar
interests, larger cosine implies that there is a smaller angle between two users, hence they
have similar interests. We can apply the cosine distance between two users in the utility
matrix, and we can also give the zero value to all the unfilled columns to make calculation
easy, if we get smaller cosine then there will be a larger distance between the users, and if the
cosine is larger than we have a small angle between the users, and we can recommend them
similar things.
Rounding the Data
In collaborative filtering, we round off the data to compare it more easily like we can assign
below 3 ratings as 0 and above of it as 1, this will help us to compare data more easily, for
example:
We again took the previous example and we apply the rounding-off process, as you can see
how much more readable the data has become after performing this process, we can see that
User 1 and User 2 are more similar and User 3 and User 4 are more alike.
Normalizing Rating
In the process of normalizing, we take the average rating of a user and subtract all the given
ratings from it, so we’ll get either positive or negative values as a rating, which can simply
classify further into similar groups. By normalizing the data we can make clusters of the users
that give a similar rating to similar items and then we can use these clusters to recommend
items to the users.
What are some of the Challenges to be Faced while using Collaborative Filtering?
As we know that every algorithm has its pros and cons and so is the case with Collaborative
Filtering Algorithms. Collaborative Filtering algorithms are very dynamic and can change as
well as adapt to the changes in user preferences with time. But one of the main issues which
are faced by recommender systems is that of scalability because as the user base increases
then the respective sizes for the computation and the data storage space all increase manifold
which leads to slow and inaccurate results.
MapReduce
MapReduce is a programming model for writing applications that can process Big Data in
parallel on multiple nodes. MapReduce provides analytical capabilities for analyzing huge
volumes of complex data.
Big Data is a collection of large datasets that cannot be processed using traditional computing
techniques. For example, the volume of data Facebook or Youtube need require it to collect
and manage on a daily basis, can fall under the category of Big Data. However, Big Data is
not only about scale and volume, it also involves one or more of the following aspects −
Velocity, Variety, Volume, and Complexity.
Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and process data.
The following illustration depicts a schematic view of a traditional enterprise system.
Traditional model is certainly not suitable to process huge volumes of scalable data and
cannot be accommodated by standard database servers. Moreover, the centralized system
creates too much of a bottleneck while processing multiple files simultaneously.
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce
divides a task into small parts and assigns them to many computers. Later, the results are
collected at one place and integrated to form the result dataset.
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
Let us now take a close look at each of the phases and try to understand their significance.
Input Phase − Here we have a Record Reader that translates each record in an input
file and sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key-value pairs.
Intermediate Keys − They key-value pairs generated by the mapper are known as
intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the
map phase into identifiable sets. It takes the intermediate keys from the mapper as input
and applies a user-defined code to aggregate the values in a small scope of one mapper.
It is not a part of the main MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads
the grouped key-value pairs onto the local machine, where the Reducer is running. The
individual key-value pairs are sorted by key into a larger data list. The data list groups
the equivalent keys together so that their values can be iterated easily in the Reducer
task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output formatter that translates the
final key-value pairs from the Reducer function and writes them onto a file using a
record writer.
Let us try to understand the two tasks Map &f Reduce with the help of a small diagram −
MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter receives
around 500 million tweets per day, which is nearly 3000 tweets per second. The following
illustration shows how Tweeter manages its tweets with the help of MapReduce.
As shown in the illustration, the MapReduce algorithm performs the following actions −
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value
pairs.
Filter − Filters unwanted words from the maps of tokens and writes the filtered maps
as key-value pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of similar counter values into small
manageable units.
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is
used as input by Reducer class, which in turn searches matching pairs and reduces them.
MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending
the Map & Reduce tasks to appropriate servers in a cluster.
Sorting
Searching
Indexing
TF-IDF
Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the
mapper by their keys.
Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase.
Indexing
Normally indexing is used to point to a particular data and its address. It performs batch
indexing on the input files for a particular Mapper.
TF-IDF
TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse
Document Frequency. It is one of the common web analysis algorithms. Here, the term
'frequency' refers to the number of times a term appears in a document.
Hadoop Streaming
In the world of big data, processing vast amounts of data efficiently is a crucial task.
Hadoop, an open-source framework, has been a cornerstone in managing and processing
large data sets across distributed computing environments. Among its various components,
Hadoop Streaming stands out as a versatile tool, enabling users to process data using non-
Java programming languages. This article delves into the purpose of Hadoop Streaming, its
usage scenarios, implementation details, and provides a comprehensive understanding of
this powerful tool.
Hadoop Streaming is a utility that allows users to create and run MapReduce jobs using
any executable or script as the mapper and/or reducer, instead of Java. It enables the use
of various programming languages like Python, Ruby, and Perl for processing large
datasets. This flexibility makes it easier for non-Java developers to leverage Hadoop’s
distributed computing power for tasks such as log analysis, text processing, and data
transformation.
Conclusion
Hadoop Streaming is an invaluable tool for developers who need to leverage the power of
Hadoop without diving deep into Java. Its ability to integrate various programming
languages and tools makes it a flexible and powerful option for processing large datasets.
Whether you’re analyzing logs, processing text data, or running machine learning
algorithms, Hadoop Streaming simplifies the process and opens up new possibilities for big
data processing.
Implementing PEGASOS: Primal Estimated sub-
GrAdient SOlver for SVM,
Although a support vector machine model (binary classifier) is more commonly
built by solving a quadratic programming problem in the dual space, it can be built
fast by solving the primal optimization problem also. In this article a Support
Vector Machine implementation is going to be described by solving the primal
optimization problem with sub-gradient solver using stochastic gradient decent.
First the vanilla version and then the kernelized version of the
the Pegasos algorithm is going to be described along with some applications on
some datasets.
Next the hinge-loss function for the SVM is going to be replaced by the log-
loss function for the Logistic Regression and the primal SVM problem is
going to be converted to regularized logistic regression.
Finally document sentiment classification will be done by first training a
Perceptron, SVM (with Pegasos) and a Logistic Regression classifier on a
corpus and then testing it on an unseen part of the corpus.
The time to train the classifiers along with accuracy obtained on a held-out
dataset will be computed.
Some Notes
The optional projection step has been left out (the line in square brackets in
the paper).
As usual, the outputs (in the list Y) are coded as +1 for positive examples
and -1 for negative examples.
The number η is the step length in gradient descent.
The gradient descent algorithm may have problems finding the minimum if
the step length η is not set properly. To avoid this difficulty, Pegasos uses a
variable step length: η = 1 / (λ · t).
Since we compute the step length by dividing by t, it will gradually become
smaller and smaller. The purpose of this is to avoid the “bounce
around” problem as it gets close to the optimum.
Although the bias variable b in the objective function is discarded in this
implementation, the paper proposes several ways to learn a bias term (non-
regularized) too, the fastest implementation is probably with the binary
search on a real interval after the PEGASOS algorithm returns an
optimum w.