Music Genre Classification Project Repor
Music Genre Classification Project Repor
Music Genre Classification Project Repor
1. Introduction
Ever wanted an amazing playlist that includes songs you like, as well as introducing you to
similar songs that you may have never listened to before? This is a million-dollar question faced
by music streaming services like Apple Music, Pandora, Spotify, and YouTube, to name a few.
As engineers we view a song as, depending on length, approximately 1.3 million data-points. In
order to classify songs as similar we need to look at these data-points and it becomes
increasingly difficult to do this with the raw data especially when new songs are introduced
every day. Also, similarity between songs is a difficult thing to define as the parameters used
to describe how similar two songs are inherently subjective and cannot be easily translated into
an algorithm.
Our aim, in this project, is to implement an algorithm that reduces the number of data-points
we need to work with and build a classifier using various techniques comprising of machine
learning algorithms, the Music Analysis (MA) toolbox developed by Elias Pampalk [4] which was
given to us, concepts like Johnson-Lindenstrauss transform which were taught in class, and a
set of 729 music samples (320 classical, 114 electronic, 26 jazz/blues, 45 metal/punk, and 122
world) provided. The classifier should be able to categorize a new music track into the right
genre based on how similar the extracted features are to the music samples we used to
construct the classifier. In order to achieve the reduction in dimension and the classification of
songs based on genre we have used 2 proposals.
2. Theory
In this section an overview is given for all the theory used in reducing the dimensionality of the
songs as well as how we classified them.
In our project we have used the code ma_mfcc.m, from the Musical Analysis toolbox, which
implements Mel Frequency Cepstral Coefficients (MFCCs) followed by Discrete Cosine
1|Page
Transforms (DCT) on the raw data. MFCCs is an effective tool to extract information from an
audio signal. To compute MFCCs we use some very simple filters and transformations that
roughly model some of the characteristics of the human auditory system. The important
characteristics are:
1. The non-linear frequency resolution using the Mel frequency scale.
2. The non-linear perception of loudness using decibel.
3. Spectral masking effects using DCT.
2.1.3 Decibel
The Mel scale matrix M is converted into the decibel scale matrix M_dB as:
M_dB = 10*log10(M)
2|Page
the spectral masking in the human auditory system. The resulting MFCCs are a compressed
representation of the original data. From 512 samples per 23ms to 20 values per 12ms. We get
the following graph of MFCCs of a song after applying DCT:
2.1.5 Parameters
We used the following parameters while running ma_mfcc.m:
1. If the piece is longer than 2 minutes, then only 2 minutes from the center are used for
further processing.
2. Number of Mel filter are 36, FFT size is 256, hop size is 128, and number of DCT
coefficients are 19.
2.2 Process – I
2.2.1 Dimensionality Reduction
Dimension reduction is an algorithmic technique for reducing the dimensionality of the data.
After applying this technique, the dimension of the data is reduced. This is useful because many
data-centric applications suffer exponential blowup as the underlying dimension grows. This is
referred to as the curse of dimensionality. The curse of dimensionality can be avoided if the
input data is mapped into a space of logarithmic dimension. For example, an algorithm running
3|Page
in time proportional to 2d in dimension d will run in linear time if the dimension can be brought
down to log(d). The most important beneficiaries of this technique are the algorithms like k-
nearest neighbors. The idea here is to preserve the properties of the data set most importantly
the pairwise distances between the data points. But this practically not possible, therefore we
relax it and tolerate errors as long as the error is made arbitrarily small.
There are two common approaches to dimensionality reduction. One of them are data aware
techniques that take advantage of prior information about the input. The other type are the
data oblivious techniques that assume no prior information on the data. Examples of such
techniques are random linear projections, local sensitivity hashing.
The size of the distortion as well as the failure are user specified parameters that determine
the target dimension. Careful quantitative calculation reveals that if we care about the
distances between pairs of vectors and angles between them, then random linear mapping to
a space of dimension logarithmic in the size of the data is sufficient. This is the important
results of the Johnson-Lindenstrauss (JL) seminal work. The JL theorem says that projecting
the n points on a random low-dimensional subspace should, up to a distortion of 1±ε, preserve
pairwise distances. New algorithms proposing the JL lemma have come up. They are based on
the central concept from harmonic analysis known as the Heisenberg principle. The key idea
behind the uncertainty principle is that a signal and its spectrum cannot be both concentrated.
Fast Johnson-Lindenstrauss transform is headed in the same direction.
P is a kXd matrix whose elements are independently distributed. With a probability 1-q set Pij
as 0, otherwise draw Pij from a normal distribution of expectation 0 and variance q-1. The
sparsity constant q is given as
q = min{θ((εp-2lognp)/d),1}
4|Page
H is a dXd normalized Walsh-Hadamard matrix where
Hi,j = d-1/2(-1)<I-1,j-1>
Where <i,j> is the dot product of m bit vectors i,j expressed in binary.
D is a dXd diagonal matrix, where each Dii is drawn independently from {-1,1} with probability
½.
Using the above algorithm, we have implemented the FJLT algorithm in MATLAB. We made use
of the ‘fwht’ function in MATLAB to compute the Walsh Hadamard matrix.
Random projection is computationally very simple: forming the random Matrix M and projecting
the dXn matrix into k dimensions is of the order O(dkn).
The idea of random projections is based on the idea of Johnson-Lindenstrauss lemma. The value
of ε is the key to the JL lemma, and we have experimented with various ε values. Below are
the results for various ε values:
5|Page
Fig. 5: Dimensionality Reduction vs. Epsilon
With an increase in ε, we have a greatly reduced dimension to work with, but the training
accuracy of the classifier decreases. With a decrease in ε, we have a better accuracy but a
higher dimension to work with. So we tried to obtain an optimal ε value where we have a decent
accuracy while working with a low dimension. The optimal ε value obtained is as follows:
A linear classifier has the form f(x) = wTx + b. In 2D, the discriminant is a line. w, also known
as the weight vector, is the normal to the line, and b is the bias. In 3D, the discriminant is a
plane, and in the nth dimension it’s a hyperplane. For a k-NN classifier it is necessary to carry
the training data. For a linear SVM classifier, the training data is used to learn w and then is
discarded. Only w is needed for classifying new data. Learning the SVM can be classified as a
constrained optimization problem.
6|Page
If we select a hyperplane which is close to the data point of one class, then it might not
generalize well. Therefore, we define the concept of margin. Given a particular hyperplane,
we can compute the distance between the hyperplane and the closest data point. Once we
have this value, if we double it we will get what is called the margin. Basically the margin is a
no man’s land. There will never be any data point inside the margin. The optimal separating
hyperplane is the one that maximizes the margin of the training data. SVMs are inherently two-
class classifiers. The technique to do multiclass classification is to build one versus rest
classifiers and to choose the class which classifies the test data with greatest margin.
In our music genre classification project, we have used the John Platt’s implementation of
sequential minimal optimization (SMO) algorithm for training a support vector classifier.
Training the SVM requires the solution of a very large quadratic programming (QP) optimization
problem. SMO breaks this large QP problem into a series of smallest possible QP problems.
These small QP problems are solved analytically, which avoids using a time consuming numerical
QP optimization as an inner loop. The amount of memory required for SMO is linear in training
size.
The QP problem is solved by the SMO algorithm. A point is an optimal point of the QP problem’s
equation if and only if the Karush-Kuhn-Tucker (KKT) conditions are fulfilled. The KKT
conditions can be evaluated one example at a time, which is useful in the construction of the
SMO algorithm. Due to the immense size of the QP problem that arises from SVMs, these cannot
easily be solved via standard QP techniques. The quadratic form in its equation involves a matrix
that has a number of elements equal to the square of the number of training examples. There
is an algorithmic technique namely chunking that uses the fact that the value of the quadratic
form is the same if we remove the rows and columns that correspond to zero Lagrange
multipliers. Therefore, the large QP problem can be broken down into a series of smaller QP
problems, whose ultimate goal is to identify all the non-zero Lagrange multipliers and discard
all of the zero Lagrange multipliers. Chunking seriously reduces the size of the matrix from the
number of training examples squared to approximately the number of non-zero Lagrange
multipliers squared. However, chunking still may not handle large-scale training problems as
even the reduced matrix may not fit into the memory. Osuna showed that the large QP problem
can be broken down into a series of smaller QP sub-problems. As long as at least once example
that violates the KKT conditions is added to the examples for the previous sub-problem, each
step reduces the overall objective function and maintains a feasible point that obeys all of the
constraints. Therefore, a sequence of QP sub-problems that always add at least one violator
will asymptotically coverage. Osuna et al. suggest keeping a constant size matrix for every QP
sub-problem, which implies adding and deleting the same number of examples at every step.
Sequential minimal optimization is a simple algorithm that quickly solves the SVM QP problem
without any extra matrix storage and without invoking an iterative numerical routine for each
sub-problem. SMO decomposes the overall QP problem into QP sub-problems similar to Osuna’s
method.
Unlike the previous methods, SMO chooses to solve the smallest possible optimization problem
involves two Lagrange multipliers because the Lagrange multipliers must obey a linear equality
constraint. At each step, SMO chooses two Lagrange multipliers to jointly optimize, finds the
7|Page
optimal values for these multipliers, and updates the SVM to reflect the new optimal values.
The advantage of SMO lies in the fact that solving two Lagrange multipliers can be done
analytically. Thus an entire iteration due to numerical QP optimization is avoided. Thus there
are three components to SMO: an analytic method to solve for the two Lagrange multipliers, a
heuristic for choosing which multipliers to optimize, and a method for computing the threshold
of the SVM. After each step, this threshold is re-computed, so that the KKT conditions are
fulfilled for both optimized example.
The accuracy of a random forest depends on the strength of the individual tree classifiers and
a measure of dependence between them. Given an ensemble of classifiers and with the training
set drawn at random from the distribution of the random vector we define a margin function.
The margin measures the extent to which the average number of votes at the random vector
for the right class exceeds the average vote for any other class. The larger the margin, the
more confidence in the classification. The result that random forests do not over-fit as more
trees are added but rather produces a limiting value of the generalization error follows from
the strong law of natural numbers (probability theory). Therefore, an upper bound on the
generalization error can thus be determined.
There are multiple ways of building random forests, but according to Breiman et al. several of
these reported in the literature do not perform as well as Adaboost or other algorithms that
work by adaptive reweighting of the training set. The reason why Adaboost works so well is that
at each step it tries to decouple the next classifier from the current one, i.e., the Adabosst
algorithm is aimed at keeping the covariance between classifiers small.
We start with a collection of k different sets of non-negative sum one weights on the training
set. Corresponding to these weights are probabilities whose sum is one. Adaboost is a
deterministic algorithm that selects the weights on the training set for input to the next
classifier based on the misclassifications in the previous classifiers.
2.3 Process – II
2.3.1 Fluctuation Patterns
With Mel Frequency Cepstral Coefficients (MFCC), the dimensionality of each song was reduced
from a million data points to a several hundred thousand points, but this wasn’t good enough
to go ahead with computation of feature vectors for classification. To further reduce the
dimension of the song, we used fluctuation patterns (FP) in combination with MFCC.
8|Page
similarity measure.
The loudness modulation has different effects on our sensation depending on the frequency.
The sensation of fluctuation strength is most intense around 4Hz and gradually decreases up to
a modulation frequency of 15Hz. At 15Hz, the sensation of roughness starts to increase, reaches
its maximum at about 70Hz, and starts to decrease at about 150Hz. Above 150Hz the sensation
of hearing three separately audible tones increase.
The resulting FP is a matrix with rows corresponding to frequency bands and columns
corresponding to modulation frequencies (in the range of 0-10Hz). The elements of this matrix
describe the fluctuation strength. To summarize all FP patterns representing the different
segments of a piece the median of all FPs is computed. Finally, one FP matrix represents an
entire piece. The distance between pieces is computed by interpreting the FP matrix as high-
dimensional vector and computing the Euclidean distance.
9|Page
A general definition of the distance between two d-dimensional feature vectors q and x
is
Where we call the matrix A, as the distance matrix which is any positive definite d x d
matrix. The effect of A is to scale the distance along each feature axis.
Given a set of d-dimensional feature vectors, X = xi and a query vector, qi, the objective
is to find out a subset C which contains M greatest values of the conditional probability
density. P(xi|q), where C belongs to X and P(xi|q) is a retrieval probability function. If
we estimate the probability density function P(xi|q) as a Gaussian distribution whose
mean is the query vector q, then
Hence,
The symmetric K-L divergence between two Gaussians can be computed with:
We labelled this distance as d_g1 (single Gaussian distance metric). The distance is
rescaled to improve results when combining the spectral distance with other
information. In particular, the rescaled distance is computed as
10 | P a g e
Using fact = 450 as rescaling factor gave best results.
Where eps is the smallest difference that MATLAB can recognize between two numbers.
k-Nearest Neighbor (k-NN) Algorithm: k-NN is one of those algorithms that is simple, yet very
powerful. There are few aspects of k-NN that makes it a good choice for similarity search.
11 | P a g e
b) It is an instance-based or lazy learning, i.e., it does not use the training data points to
do any other generalization, i.e., it makes decision based on the entire training data
set. In other words, there is no explicit training phase or it is very minimal.
Before looking into how k-NN works, there are a few assumptions that k-NN makes which needs
attention.
a) k-NN assumes that the data is in a feature space. More exactly, the data points are in a
metric space. The data can be scalars or possibly even multidimensional vectors. Since
the points are in feature space, they have a notion of distance. This need not necessarily
be Euclidean distance although it is the one commonly used.
b) Each of the training data consists of a set of vectors and class label associated with each
vector. In the simplest case, it will be either + or – (for positive or negative classes). But
k-NN, can work equally well with arbitrary number of classes.
c) And about the ‘k’ in k-NN. This parameter decides how many neighbors (where neighbors
is defined based on the distance metric) influence the classification. This is usually an
odd number. If k=1, then the algorithm is simply called the nearest neighbor algorithm.
It works by computing the k-nearest neighbors for a query point and computing a majority vote
to decide which cluster that given query point is to be added to. For our project, we had 6
different genres, and hence we had to identify six clusters (each cluster identifying a genre) to
assign the songs to. We used a weighted combination of the four distance metrics to uniquely
identify one genre from the rest. Using the weighted distance metric approach, we built the 6
clusters. But due to a large overlap between the genres, the clusters built had overlap too, and
this lead to a lot of false positives (a detailed analysis of the results can be found in section 3).
We were getting an accuracy of 60% with the k-NN approach.
In our original approach, the distance of the test song from each island (dc, de, dj, dm, dr, dw) is
calculated from the test data. The song is classified based on the genre it was the closest to.
However, due to the non-proportional split of songs among the different genres this may result
in the classifier developing a bias towards a certain genre or the classifier not being able to
recognize a particular genre.
12 | P a g e
Fig. 7: Visualization of Island Approach
13 | P a g e
3. Experimental Analysis
In this section we look at the results we get from different levels of cross validation. In each
case we obtain a confusion matrix. In the confusion matrix, the rows signify the actual genre
while the columns are the genres in which the songs were classified into. So in each confusion
matrix we’re looking at the diagonal elements of the matrix to get the accuracy of the method.
For the SVM and Random Forests classifier we used an open source software which gave us the
confusion matrix along with the accuracy.
14 | P a g e
Fig. 9: Classification of Classical vs. Non-Classical for different values of k
As can be seen from Fig. 9, the best results were for k = 5. We then did a 5-fold cross validation,
calculated the average of all our runs and obtained the following confusion matrix for classical
vs. non-classical.
The overall accuracy obtained for this method was 57.65%. The below graph shows the
accuracy for different values of k.
15 | P a g e
Classical Electronic Jazz/Blues Metal/Punk Rock/Pop World
Classical 0.28125 0.00000 0.03125 0.67188 0.01612 0.00000
Electronic 0.00000 0.86957 0.00000 0.04348 0.04348 0.04348
Jazz/Blues 0.60000 0.00000 0.40000 0.00000 0.00000 0.00000
Metal/Punk 0.44444 0.00000 0.00000 0.55556 0.00000 0.00000
Rock/Pop 0.45000 0.00000 0.20000 0.05000 0.10000 0.20000
World 0.25000 0.12500 0.16667 0.25000 0.04167 0.16667
Table 4: Confusion Matrix for Fluctuation Pattern & Original Island Approach
3-Fold Cross Validation: We did a 3-fold cross validation on the data, calculated the average
of all our runs and obtained the following confusion matrices.
As can be seen from the table above, our classifier struggles when it comes to rock. In order to
combat this, we are manually labelling everything that is not being classified as rock. The final
confusion matrix is given below.
16 | P a g e
5-Fold Cross Validation: We did a 5-fold cross validation on the data, calculated the average
of all our runs and obtained the following confusion matrices.
We can see that during 5-fold cross validation that there is an improvement of nearly 0.1 in
both electronic and jazz while classical and metal decrease by 0.02. The problem with rock
still remains so we employ the same solution as above to classify the genre.
17 | P a g e
3.6 Fluctuation Patterns with Random Forests
We applied the random forests classifier with the reduced feature set we obtained from
fluctuation patterns. We then did a 5-fold cross validation, and obtained the following confusion
matrix.
4. Results
In this section we applied all the mentioned methods on the 35 test songs provided to us and
got the following results.
18 | P a g e
23 Classical Classical World Clasical Classical Classical
24 Classical Classical Classical Classical Classical Classical
25 Electronic Classical Classical Classical Classical Classical
26 Rock/Pop Classical Metal/Punk Classical Classical Classical
27 Electronic Electronic Electronic Electronic Rock/Pop Electronic
28 Metal/Punk Rock/Pop Classical World Rock/Pop World
29 Rock/Pop Rock/Pop Metal/Punk Classical Classical Classical
30 Electronic Classical Classical Classical Classical Classical
31 Classical Classical Classical Classical Classical World
32 World Classical Metal/Punk World World World
33 Electronic Rock/Pop Electronic Rock/Pop Rock/Pop Rock/Pop
34 Metal/Punk Rock/Pop Electronic Rock/Pop Rock/Pop Rock/Pop
35 Classical Classical Classical Classical Classical Classical
Table 11: Results for the Test Data
5. Bibliography
[1] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast johnson-
lindenstrauss transform. In Proceedings of the thirty-eighth annual ACM symposium on Theory
of computing, pages 557–563. ACM, 2006.
[2] Luís Barreira, Sofia Cavaco, and Joaquim Ferreira da Silva. Unsupervised Music Genre
Classification with a Model-Based Approach, pages 268–281. Springer Berlin Heidelberg, Berlin,
Heidelberg, 2011.
[3] Beth Logan. Music recommendation from song sets. In ISMIR. Citeseer, 2004.
[4] Elias Pampalk. Computational models of music similarity and their application in music
information retrieval, 2006.
[5] G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Transactions
on Speech and Audio Processing, 10(5):293–302, Jul 2002.
19 | P a g e