Music Genre Classification Project Repor

[ECEN 5322] Genre Classification: Project Report
Lakhan Shiva Kamiredy, Pavan Dhareswar, Theja Surya Kanumury
1. Introduction
Ever wanted an amazing playlist that includes songs you like, as well as introducing you to
similar songs that you may have never listened to before? This is a million-dollar question faced
by music streaming services like Apple Music, Pandora, Spotify, and YouTube, to name a few.
As engineers we view a song as, depending on length, approximately 1.3 million data-points. In
order to classify songs as similar we need to look at these data-points and it becomes
increasingly difficult to do this with the raw data especially when new songs are introduced
every day. Also, similarity between songs is a difficult thing to define as the parameters used
to describe how similar two songs are inherently subjective and cannot be easily translated into
an algorithm.
A genre is defined as a category of artistic composition characterized by similarities in form,

style or subject matter. One of the ways we can say two tracks are similar is by looking at the
genre they belong to. Two songs belonging to the same genre are usually going to have more
similarities than two songs belonging to different genres. The classification based on genre can
be done by extracting information that can be retrieved from the raw data.
Our aim, in this project, is to implement an algorithm that reduces the number of data-points
we need to work with and build a classifier using various techniques comprising of machine
learning algorithms, the Music Analysis (MA) toolbox developed by Elias Pampalk [4] which was
given to us, concepts like Johnson-Lindenstrauss transform which were taught in class, and a
set of 729 music samples (320 classical, 114 electronic, 26 jazz/blues, 45 metal/punk, and 122
world) provided. The classifier should be able to categorize a new music track into the right
genre based on how similar the extracted features are to the music samples we used to
construct the classifier. In order to achieve the reduction in dimension and the classification of
songs based on genre we have used 2 proposals.
2. Theory
In this section an overview is given for all the theory used in reducing the dimensionality of the
songs as well as how we classified them.
2.1 Mel Frequency Cepstral Coefficients (MFCCs)

As already stated, each song has approximately 1.3 million data-points. Each data-point gives
very little information about the song to justify the computation time that it takes especially
when we have 729 such songs. Instead, it is more practical to implement dimensional reduction
techniques to reduce the number of sample points we are working with so we can use the entire
729 songs to train and test the classifier we are building. In order to reduce the amount of data
we are working with while preserving as much information as possible we need to apply some
pre-processing techniques before starting with each of our proposals.
In our project we have used the code ma_mfcc.m, from the Musical Analysis toolbox, which
implements Mel Frequency Cepstral Coefficients (MFCCs) followed by Discrete Cosine
1|Page
Transforms (DCT) on the raw data. MFCCs is an effective tool to extract information from an
audio signal. To compute MFCCs we use some very simple filters and transformations that
roughly model some of the characteristics of the human auditory system. The important
characteristics are:
1. The non-linear frequency resolution using the Mel frequency scale.
2. The non-linear perception of loudness using decibel.
3. Spectral masking effects using DCT.
2.1.1 Power Spectrum

In order to compute the power spectrum of a particular song we employ the following algorithm:
1. The input signal is divided into short overlapping segments 23ms long and with 50%
overlap.
2. A window function, in this particular place we used a Hann window, is applied to each
segment
3. The power spectrum P is computed using FFT.
Each column in P represents the power spectrum for a specific time frame. Each row in P
represents one of the linearly spaced bins in the frequency range.
2.1.2 Mel Frequency

The Mel scale is defined as:
mMEL = 1127.01048*log(1+(fHz/700))
The power spectrum P is transformed to the Mel scale matrix M using a filter bank consisting of
triangular filters where each filter defines the response of one frequency band and is
normalized such that the sum of the weights for each triangle is the same.
Fig. 1: Triangular Filter Bank
2.1.3 Decibel
The Mel scale matrix M is converted into the decibel scale matrix M_dB as:
M_dB = 10*log10(M)
2.1.4 Discrete Cosine Transform

The DCT further reduces the dimension of the Mel Power Spectrum. The spectrum is
smoothened along the frequency axis which can be interpreted as a simple approximation of
2|Page
the spectral masking in the human auditory system. The resulting MFCCs are a compressed
representation of the original data. From 512 samples per 23ms to 20 values per 12ms. We get
the following graph of MFCCs of a song after applying DCT:
Fig. 2: MFCCs after DCT
2.1.5 Parameters
We used the following parameters while running ma_mfcc.m:
1. If the piece is longer than 2 minutes, then only 2 minutes from the center are used for
further processing.
2. Number of Mel filter are 36, FFT size is 256, hop size is 128, and number of DCT
coefficients are 19.
2.1.6 Further Work

After running the ma_mfcc.m code we convert an audio file, looking at the first track in the
list provided (artist_1_album_1_track_1.wav), from a 420423x1 double to a 20x3283 double.
This is still quite a large number of data-points to input into a classifier if we want to run in a
relative short amount of time. In order to further reduce the dimension and classify the songs
we followed two processes which are explained in the next section.
2.2 Process – I
2.2.1 Dimensionality Reduction
Dimension reduction is an algorithmic technique for reducing the dimensionality of the data.
After applying this technique, the dimension of the data is reduced. This is useful because many
data-centric applications suffer exponential blowup as the underlying dimension grows. This is
referred to as the curse of dimensionality. The curse of dimensionality can be avoided if the
input data is mapped into a space of logarithmic dimension. For example, an algorithm running
3|Page
in time proportional to 2d in dimension d will run in linear time if the dimension can be brought
down to log(d). The most important beneficiaries of this technique are the algorithms like k-
nearest neighbors. The idea here is to preserve the properties of the data set most importantly
the pairwise distances between the data points. But this practically not possible, therefore we
relax it and tolerate errors as long as the error is made arbitrarily small.
There are two common approaches to dimensionality reduction. One of them are data aware
techniques that take advantage of prior information about the input. The other type are the
data oblivious techniques that assume no prior information on the data. Examples of such
techniques are random linear projections, local sensitivity hashing.
The size of the distortion as well as the failure are user specified parameters that determine
the target dimension. Careful quantitative calculation reveals that if we care about the
distances between pairs of vectors and angles between them, then random linear mapping to
a space of dimension logarithmic in the size of the data is sufficient. This is the important
results of the Johnson-Lindenstrauss (JL) seminal work. The JL theorem says that projecting
the n points on a random low-dimensional subspace should, up to a distortion of 1±ε, preserve
pairwise distances. New algorithms proposing the JL lemma have come up. They are based on
the central concept from harmonic analysis known as the Heisenberg principle. The key idea
behind the uncertainty principle is that a signal and its spectrum cannot be both concentrated.
Fast Johnson-Lindenstrauss transform is headed in the same direction.
2.2.2 Fast Johnson-Lindenstrauss Transform (FJLT)

Fast Johnson-Lindenstrauss Transform (FJLT) is a new low distortion embedding of a high
dimensional dataset to a low dimensional dataset. The dimensionality is reduced from I1d to
IpO(logN) where p={1,2}. By the Johnson-Lindenstrauss lemma, n points in Euclidean space can be
projected down to k=logn/ε2 points while incurring a distortion of 1±ε. In FJLT we reduce the
dimension from Rd to Rk. We may assume without loss of generality that d=2m>k. We will also
assume that n≥d. A random embedding:
Φ ~ FJLT(n,d,ε,p)
Can be obtained as a product of three real-valued matrices:
Φ = PHD
P is the sparse JL matrix, H is the Walsh Hadamard matrix, and D is a diagnol matrix.
Fig. 3: Visualization of the PHD matrices
P is a kXd matrix whose elements are independently distributed. With a probability 1-q set Pij
as 0, otherwise draw Pij from a normal distribution of expectation 0 and variance q-1. The
sparsity constant q is given as
q = min{θ((εp-2lognp)/d),1}
4|Page
H is a dXd normalized Walsh-Hadamard matrix where
Hi,j = d-1/2(-1)<I-1,j-1>
Where <i,j> is the dot product of m bit vectors i,j expressed in binary.
D is a dXd diagonal matrix, where each Dii is drawn independently from {-1,1} with probability
½.
Using the above algorithm, we have implemented the FJLT algorithm in MATLAB. We made use
of the ‘fwht’ function in MATLAB to compute the Walsh Hadamard matrix.
2.2.3 Random Projections

In random projections, the original d-dimensional data is projected to a k-dimensional subspace
through origin, using a random kXd matrix M whose columns have unit lengths. XkXn = MkXdIdXn,
where XkXn is the new low dimensional kXn matrix and IdXn is the input dXn matrix in the original
dimension.
Random projection is computationally very simple: forming the random Matrix M and projecting
the dXn matrix into k dimensions is of the order O(dkn).
The idea of random projections is based on the idea of Johnson-Lindenstrauss lemma. The value
of ε is the key to the JL lemma, and we have experimented with various ε values. Below are
the results for various ε values:
Fig. 4: Accuracy vs. Epsilon
5|Page
Fig. 5: Dimensionality Reduction vs. Epsilon
With an increase in ε, we have a greatly reduced dimension to work with, but the training
accuracy of the classifier decreases. With a decrease in ε, we have a better accuracy but a
higher dimension to work with. So we tried to obtain an optimal ε value where we have a decent
accuracy while working with a low dimension. The optimal ε value obtained is as follows:
Fig. 6: Optimal Epsilon
2.2.4 Support Vector Machines (SVM)

Support Vector Machines (SVM) are a set of supervised learning methods used for classification.
They are based on the concept of decision planes (hyperplanes) that define decision boundaries.
A decision plane is one that separates between a set of objects having different class
memberships. The classifier can be linear classifier or a non-linear one. Classification tasks
based on drawing separating lines to distinguish between objects of different class memberships
are known as hyperplane classifiers.
A linear classifier has the form f(x) = wTx + b. In 2D, the discriminant is a line. w, also known
as the weight vector, is the normal to the line, and b is the bias. In 3D, the discriminant is a
plane, and in the nth dimension it’s a hyperplane. For a k-NN classifier it is necessary to carry
the training data. For a linear SVM classifier, the training data is used to learn w and then is
discarded. Only w is needed for classifying new data. Learning the SVM can be classified as a
constrained optimization problem.
6|Page
If we select a hyperplane which is close to the data point of one class, then it might not
generalize well. Therefore, we define the concept of margin. Given a particular hyperplane,
we can compute the distance between the hyperplane and the closest data point. Once we
have this value, if we double it we will get what is called the margin. Basically the margin is a
no man’s land. There will never be any data point inside the margin. The optimal separating
hyperplane is the one that maximizes the margin of the training data. SVMs are inherently two-
class classifiers. The technique to do multiclass classification is to build one versus rest
classifiers and to choose the class which classifies the test data with greatest margin.
In our music genre classification project, we have used the John Platt’s implementation of
sequential minimal optimization (SMO) algorithm for training a support vector classifier.
Training the SVM requires the solution of a very large quadratic programming (QP) optimization
problem. SMO breaks this large QP problem into a series of smallest possible QP problems.
These small QP problems are solved analytically, which avoids using a time consuming numerical
QP optimization as an inner loop. The amount of memory required for SMO is linear in training
size.
The QP problem is solved by the SMO algorithm. A point is an optimal point of the QP problem’s
equation if and only if the Karush-Kuhn-Tucker (KKT) conditions are fulfilled. The KKT
conditions can be evaluated one example at a time, which is useful in the construction of the
SMO algorithm. Due to the immense size of the QP problem that arises from SVMs, these cannot
easily be solved via standard QP techniques. The quadratic form in its equation involves a matrix
that has a number of elements equal to the square of the number of training examples. There
is an algorithmic technique namely chunking that uses the fact that the value of the quadratic
form is the same if we remove the rows and columns that correspond to zero Lagrange
multipliers. Therefore, the large QP problem can be broken down into a series of smaller QP
problems, whose ultimate goal is to identify all the non-zero Lagrange multipliers and discard
all of the zero Lagrange multipliers. Chunking seriously reduces the size of the matrix from the
number of training examples squared to approximately the number of non-zero Lagrange
multipliers squared. However, chunking still may not handle large-scale training problems as
even the reduced matrix may not fit into the memory. Osuna showed that the large QP problem
can be broken down into a series of smaller QP sub-problems. As long as at least once example
that violates the KKT conditions is added to the examples for the previous sub-problem, each
step reduces the overall objective function and maintains a feasible point that obeys all of the
constraints. Therefore, a sequence of QP sub-problems that always add at least one violator
will asymptotically coverage. Osuna et al. suggest keeping a constant size matrix for every QP
sub-problem, which implies adding and deleting the same number of examples at every step.
Sequential minimal optimization is a simple algorithm that quickly solves the SVM QP problem
without any extra matrix storage and without invoking an iterative numerical routine for each
sub-problem. SMO decomposes the overall QP problem into QP sub-problems similar to Osuna’s
method.
Unlike the previous methods, SMO chooses to solve the smallest possible optimization problem
involves two Lagrange multipliers because the Lagrange multipliers must obey a linear equality
constraint. At each step, SMO chooses two Lagrange multipliers to jointly optimize, finds the
7|Page
optimal values for these multipliers, and updates the SVM to reflect the new optimal values.
The advantage of SMO lies in the fact that solving two Lagrange multipliers can be done
analytically. Thus an entire iteration due to numerical QP optimization is avoided. Thus there
are three components to SMO: an analytic method to solve for the two Lagrange multipliers, a
heuristic for choosing which multipliers to optimize, and a method for computing the threshold
of the SVM. After each step, this threshold is re-computed, so that the KKT conditions are
fulfilled for both optimized example.
2.2.5 Random Forests

Random forests are an ensemble learning method for classification. They are a way of averaging
multiple deep decision trees, trained on different parts of the same training set, with the goal
of overcoming over-fitting problem of individual decision tree. They operate by constructing a
lot of decision trees during training time and outputting the class that is the mode of classes
output by individual trees.
The accuracy of a random forest depends on the strength of the individual tree classifiers and
a measure of dependence between them. Given an ensemble of classifiers and with the training
set drawn at random from the distribution of the random vector we define a margin function.
The margin measures the extent to which the average number of votes at the random vector
for the right class exceeds the average vote for any other class. The larger the margin, the
more confidence in the classification. The result that random forests do not over-fit as more
trees are added but rather produces a limiting value of the generalization error follows from
the strong law of natural numbers (probability theory). Therefore, an upper bound on the
generalization error can thus be determined.
There are multiple ways of building random forests, but according to Breiman et al. several of
these reported in the literature do not perform as well as Adaboost or other algorithms that
work by adaptive reweighting of the training set. The reason why Adaboost works so well is that
at each step it tries to decouple the next classifier from the current one, i.e., the Adabosst
algorithm is aimed at keeping the covariance between classifiers small.
We start with a collection of k different sets of non-negative sum one weights on the training
set. Corresponding to these weights are probabilities whose sum is one. Adaboost is a
deterministic algorithm that selects the weights on the training set for input to the next
classifier based on the misclassifications in the previous classifiers.
2.3 Process – II
2.3.1 Fluctuation Patterns
With Mel Frequency Cepstral Coefficients (MFCC), the dimensionality of each song was reduced
from a million data points to a several hundred thousand points, but this wasn’t good enough
to go ahead with computation of feature vectors for classification. To further reduce the
dimension of the song, we used fluctuation patterns (FP) in combination with MFCC.
Definition & Computation of FPs:

Fluctuation patterns describe the amplitude modulation of the loudness per frequency band.
They describe the characteristics of the audio signal which are not described by the spectral
8|Page
similarity measure.
The loudness modulation has different effects on our sensation depending on the frequency.
The sensation of fluctuation strength is most intense around 4Hz and gradually decreases up to
a modulation frequency of 15Hz. At 15Hz, the sensation of roughness starts to increase, reaches
its maximum at about 70Hz, and starts to decrease at about 150Hz. Above 150Hz the sensation
of hearing three separately audible tones increase.
We employ the following 4 steps to compute FPs:

a) Cut the spectrogram (the spectrogram used here is the Mel frequency dB spectogram,
as FPs are used in combination with MFCC) into short segments of 3 seconds.
b) For each segment and each frequency band use an FFT to compute the amplitude
modulation frequencies of the loudness in the range of 0-10Hz.
c) The modulation frequencies are weighted using a psychoacoustic model of perceived
fluctuation strength.
d) Some filters are applied to emphasize certain types of patterns. For example, we could
use gradient filters to emphasize distinctive beats, which are characterized through a
relatively high fluctuation strength at a specific modulation frequency compared to the
values immediately below and above this specific frequency. We could also apply a
Gaussian filter to increase the similarity between two characteristics in a rhythm
pattern which differ only slightly in the sense of either being in similar frequency bands
or having similar modulation frequencies.
The resulting FP is a matrix with rows corresponding to frequency bands and columns
corresponding to modulation frequencies (in the range of 0-10Hz). The elements of this matrix
describe the fluctuation strength. To summarize all FP patterns representing the different
segments of a piece the median of all FPs is computed. Finally, one FP matrix represents an
entire piece. The distance between pieces is computed by interpreting the FP matrix as high-
dimensional vector and computing the Euclidean distance.
2.3.2 Distance Metrics

Understanding spectral relationships can impart valuable information on: the potential spectral
variability within a given score, the capacity to differentiate certain unique features in a song,
or the need for clustering spectrally similar features. Spectral analysis can therefore be an
important first step in understanding the capabilities and limitations of different analysis
methods and application objectives.
Similarity between individual spectra, or between clusters of similar spectra, can be

mathematically analyzed using different distance metrics between points in the feature space.
So, given a query song, the objective of classification problem is to find out the songs that
match the query song (in terms of clustering, identify the cluster to which the query song can
be assigned to) according to a similarity metric. The smaller the distance between two vectors
in the feature space, the greater the similarity.
We computed 4 distance metrics to compute the similarity between songs

a) Single Gaussian (d_g1):
9|Page
A general definition of the distance between two d-dimensional feature vectors q and x
is
Where we call the matrix A, as the distance matrix which is any positive definite d x d
matrix. The effect of A is to scale the distance along each feature axis.
Given a set of d-dimensional feature vectors, X = xi and a query vector, qi, the objective
is to find out a subset C which contains M greatest values of the conditional probability
density. P(xi|q), where C belongs to X and P(xi|q) is a retrieval probability function. If
we estimate the probability density function P(xi|q) as a Gaussian distribution whose
mean is the query vector q, then
Hence,
Where β is a positive constant.
can be considered as a distance function and the task of retrieval

becomes finding M nearest neighbors of q based on the distance function, where βC-1 is
the distance matrix. One of the popular choice for distance matrix is the identity matrix
I, where our distance function becomes the Euclidean distance. Another popular choice
for distance matrix is the inverse covariance matrix, where our distance function
becomes the Mahalanobis distance.
To perform an accurate nearest-neighbor search, we use a distance metric that is based

on minimizing the Kulback-Leibler (KL) divergence between the distribution of query
data and the data in the dataset. In MATLAB, the Gaussian representing a piece of music
can be computed with:
m = mean(mfcc,2); %% center
co = cov(mfcc’); %% covariance
ico = inv(co); %% inverse covariance
The symmetric K-L divergence between two Gaussians can be computed with:
d = trace(co1*ico2) + trace(co2*ico1) + trace((ico1+ico2)*(m1-m2)*(m1-m2)’)
We labelled this distance as d_g1 (single Gaussian distance metric). The distance is
rescaled to improve results when combining the spectral distance with other
information. In particular, the rescaled distance is computed as
10 | P a g e
Using fact = 450 as rescaling factor gave best results.
b) Fluctuation Pattern Distance (d_fp):

As mentioned before, the distance between songs can be computed by interpreting the
FP matrix as high-dimensional vector and computing the Euclidean distance.
c) Fluctuation Pattern Bass (d_fpb):

The bass is calculated as the sum of the values in the lowest two frequency bands with
a modulation frequency higher than 1Hz.
d) Fluctuation Pattern Gravity (d_fpg):

The Gravity (FP.G) describes the center of gravity of the FP on the modulation frequency
axis. Given 30 modulation frequency-bias (linearly spaced in the range from 0-10Hz) the
center usually lies between the 10th and 15th bin, and is computed as
Where eps is the smallest difference that MATLAB can recognize between two numbers.
2.3.3 Nearest Neighbor Search

The basic idea of genre classification is to define clusters that identify various genres of songs
in the dataset. In the previous section, we defined the distance metrics required to perform
the nearest neighbor search. With this in place, for any query song the task boils down to
searching its nearest neighbors and identifying, based on these nearest neighbors, what cluster
should the query song be added to. The easiest choice in this regard is the k-nearest neighbors
(k-NN) algorithm.
k-Nearest Neighbor (k-NN) Algorithm: k-NN is one of those algorithms that is simple, yet very
powerful. There are few aspects of k-NN that makes it a good choice for similarity search.
a) It is a non-parametric algorithm, i.e., it does not make any assumptions on the

underlying data distribution. This is pretty useful, as in the real world, most of the
practical data does not obey the typical theoretical assumptions made (e.g. Gaussian
mixtures, linearly separable, etc.)
11 | P a g e
b) It is an instance-based or lazy learning, i.e., it does not use the training data points to
do any other generalization, i.e., it makes decision based on the entire training data
set. In other words, there is no explicit training phase or it is very minimal.
Before looking into how k-NN works, there are a few assumptions that k-NN makes which needs
attention.
a) k-NN assumes that the data is in a feature space. More exactly, the data points are in a
metric space. The data can be scalars or possibly even multidimensional vectors. Since
the points are in feature space, they have a notion of distance. This need not necessarily
be Euclidean distance although it is the one commonly used.
b) Each of the training data consists of a set of vectors and class label associated with each
vector. In the simplest case, it will be either + or – (for positive or negative classes). But
k-NN, can work equally well with arbitrary number of classes.
c) And about the ‘k’ in k-NN. This parameter decides how many neighbors (where neighbors
is defined based on the distance metric) influence the classification. This is usually an
odd number. If k=1, then the algorithm is simply called the nearest neighbor algorithm.
It works by computing the k-nearest neighbors for a query point and computing a majority vote
to decide which cluster that given query point is to be added to. For our project, we had 6
different genres, and hence we had to identify six clusters (each cluster identifying a genre) to
assign the songs to. We used a weighted combination of the four distance metrics to uniquely
identify one genre from the rest. Using the weighted distance metric approach, we built the 6
clusters. But due to a large overlap between the genres, the clusters built had overlap too, and
this lead to a lot of false positives (a detailed analysis of the results can be found in section 3).
We were getting an accuracy of 60% with the k-NN approach.
2.3.4 Original Island Approach

Building upon what we built in the k-NN classifier we modified it to get the following approach.
In this approach we assumed that each genre can be taken as a separate island which can be
defined by the mean (µ) and standard deviation (σ). These values can be calculated using the
training set.
In our original approach, the distance of the test song from each island (dc, de, dj, dm, dr, dw) is
calculated from the test data. The song is classified based on the genre it was the closest to.
However, due to the non-proportional split of songs among the different genres this may result
in the classifier developing a bias towards a certain genre or the classifier not being able to
recognize a particular genre.
12 | P a g e
Fig. 7: Visualization of Island Approach
2.3.5 Modified Island Approach

In order to combat this, we came up with a cascaded approach. In this method instead of
looking at all the genres at once, we split the given data-set into classical (320 songs) and non-
classical (409 songs). This removes the bias of the large amount of classical songs allowing us
to build a better classifier to segregate the remaining songs into the 5 other genres. We
continue this split (electronic vs. non-electronic, jazz vs. non-jazz, metal vs. non-metal, and
finally rock vs. world) and hence we get the cascaded approach.
Fig. 8: Visualization of the Cascade Approach
13 | P a g e
3. Experimental Analysis
In this section we look at the results we get from different levels of cross validation. In each
case we obtain a confusion matrix. In the confusion matrix, the rows signify the actual genre
while the columns are the genres in which the songs were classified into. So in each confusion
matrix we’re looking at the diagonal elements of the matrix to get the accuracy of the method.
For the SVM and Random Forests classifier we used an open source software which gave us the
confusion matrix along with the accuracy.
3.1 Random Projections with SVM Classifier

We applied the SVM classifier with the reduced feature set we obtained from random
projections. We then did a 10-fold cross validation, and obtained the following confusion
matrix.
Classical Electronic Jazz/Blues Metal/Punk Rock/Pop World

Classical 0.7625 0.0469 0.0125 0 0.0406 0.1375
Electronic 0.2982 0.2632 0.0088 0.0175 0.2456 0.1667
Jazz/Blues 0.3462 0.1538 0.1923 0.0385 0.1154 0.1538
Metal/Punk 0 0.1111 0 0.4667 0.3333 0.0889
Rock/Pop 0.1471 0.3039 0.0196 0.1471 0.3333 0.049
World 0.4918 0.1885 0.0246 0.041 0.0738 0.1803
Table 1: Confusion Matrix for Random Projections and SVM Classifier
The overall accuracy obtained for this method was 48.83%.
3.2 Random Projections with Random Forests

We applied the random forests classifier with the reduced feature set we obtained from random
projections. We then did a 10-fold cross validation, and obtained the following confusion
matrix.

Classical 0.975 0.0031 0 0 0.0125 0.0094
Electronic 0.6316 0.1491 0 0.0088 0.2105 0
Jazz/Blues 0.8077 0.0385 0 0 0.1538 0
Metal/Punk 0.0444 0.0444 0 0.3333 0.5556 0.0222
Rock/Pop 0.2941 0.1078 0 0.0588 0.5098 0.0294
World 0.7869 0.0902 0 0.0082 0.0738 0.041
Table 2: Confusion Matrix for Random Projections and Random Forests
3.3 Fluctuation Patterns with k-NN

We applied the k-NN classifier with the reduced feature set we obtained from fluctuation
patterns with different values of k and got the following results.
14 | P a g e
Fig. 9: Classification of Classical vs. Non-Classical for different values of k
As can be seen from Fig. 9, the best results were for k = 5. We then did a 5-fold cross validation,
calculated the average of all our runs and obtained the following confusion matrix for classical
vs. non-classical.
k=5 Classical Non-Classical

Classical 0.2037 0.7963
Non-Classical 0.05072 0.94928
Table 3: Confusion Matrix for Fluctuation Patterns & k-NN
The overall accuracy obtained for this method was 57.65%. The below graph shows the
accuracy for different values of k.
Fig. 10: Accuracy for different values of k.
3.4 Fluctuation Patterns with Island Approach

3.4.1 Original Approach
We applied the original island classifier with the reduced feature set from fluctuation patterns
We then did 5-fold cross validation, calculated the average of all our runs and obtained the
following confusion matrix.
15 | P a g e
Classical 0.28125 0.00000 0.03125 0.67188 0.01612 0.00000
Electronic 0.00000 0.86957 0.00000 0.04348 0.04348 0.04348
Jazz/Blues 0.60000 0.00000 0.40000 0.00000 0.00000 0.00000
Metal/Punk 0.44444 0.00000 0.00000 0.55556 0.00000 0.00000
Rock/Pop 0.45000 0.00000 0.20000 0.05000 0.10000 0.20000
World 0.25000 0.12500 0.16667 0.25000 0.04167 0.16667
Table 4: Confusion Matrix for Fluctuation Pattern & Original Island Approach
3.4.2 Modified Approach

We applied the modified island classifier with the reduced feature set from fluctuation
patterns. We then did 3-fold and 5-fold cross validation.
3-Fold Cross Validation: We did a 3-fold cross validation on the data, calculated the average
of all our runs and obtained the following confusion matrices.
Classical Non-Classical Electronic Non-Electronic

Classical 0.98572 0.01428 Electronic 0.88452 0.11538
Non-Classical 0.00000 1.00000 Non-Electronic 0.00000 1.00000
(a) Classical vs. Non-Classical (b) Electronic vs. Non-Electronic
Jazz Non-Jazz Metal Non-Metal

Jazz 0.72222 0.27778 Metal 0.96667 0.03333
Non-Jazz 0.00000 1.00000 Non-Metal 0.00000 1.00000
(c) Jazz vs. Non-Jazz (d) Metal vs. Non Metal
Rock World
Rock 0.07413 0.92857
World 0.00000 1.00000
(e) Rock vs. World
Table 5: Confusion Matrices for each level of the Cascade Classifier
As can be seen from the table above, our classifier struggles when it comes to rock. In order to
combat this, we are manually labelling everything that is not being classified as rock. The final
confusion matrix is given below.

Classical 0.98572 0.00000 0.00000 0.00000 0.01428 0.00000
Electronic 0.00000 0.88452 0.00000 0.00000 0.11538 0.00000
Jazz/Blues 0.00000 0.00000 0.72222 0.00000 0.277778 0.00000
Metal/Punk 0.00000 0.00000 0.00000 0.96667 0.03333 0.00000
Rock/Pop 0.00000 0.00000 0.20000 0.00000 1.00000 0.00000
World 0.00000 0.00000 0.00000 0.00000 0.00000 1.00000
Table 6: Overall Confusion Matrix
16 | P a g e
5-Fold Cross Validation: We did a 5-fold cross validation on the data, calculated the average
of all our runs and obtained the following confusion matrices.
Classical Non-Classical Electronic Non-Electronic

Classical 0.95313 0.04687 Electronic 0.95652 0.04348
Non-Classical 0.00000 1.00000 Non-Electronic 0.00000 1.00000
(a) Classical vs. Non-Classical (b) Electronic vs. Non-Electronic
Jazz Non-Jazz Metal Non-Metal

Jazz 1.00000 0.00000 Metal 0.94444 0.05556
Non-Jazz 0.00000 1.00000 Non-Metal 0.00000 1.00000
(c) Jazz vs. Non-Jazz (d) Metal vs. Non Metal
Rock World
Rock 0.0500 0.95000
World 0.00000 1.00000
(e) Rock vs. World
Table 7: Confusion Matrices for each level of the Cascade Classifier
We can see that during 5-fold cross validation that there is an improvement of nearly 0.1 in
both electronic and jazz while classical and metal decrease by 0.02. The problem with rock
still remains so we employ the same solution as above to classify the genre.

Classical 0.95313 0.00000 0.00000 0.00000 0.04687 0.00000
Electronic 0.00000 0.95652 0.00000 0.00000 0.04348 0.00000
Jazz/Blues 0.00000 0.00000 1.00000 0.00000 0.00000 0.00000
Metal/Punk 0.00000 0.00000 0.00000 0.94444 0.05556 0.00000
Rock/Pop 0.00000 0.00000 0.20000 0.00000 1.00000 0.00000
World 0.00000 0.00000 0.00000 0.00000 0.00000 1.00000
Table 8: Overall Confusion Matrix
The overall accuracy obtained in this method is 97.57%.
3.5 Fluctuation Patterns with SVM Classifier

We applied the SVM classifier with the reduced feature set we obtained from fluctuation
patterns. We then did a 5-fold cross validation, and obtained the following confusion matrix.

Classical 0.97500 0.00313 0.00000 0.00000 0.01250 0.00938
Electronic 0.63158 0.14912 0.00000 0.00877 0.21053 0.00000
Jazz/Blues 0.80769 0.03846 0.00000 0.00000 0.15385 0.00000
Metal/Punk 0.04444 0.04444 0.00000 0.33333 0.55556 0.02222
Rock/Pop 0.29412 0.10784 0.00000 0.05882 0.50980 0.02941
World 0.78689 0.09016 0.00000 0.00820 0.07377 0.04098
Table 9: Confusion Matrix for Fluctuation Pattern with SVMs
17 | P a g e
3.6 Fluctuation Patterns with Random Forests
We applied the random forests classifier with the reduced feature set we obtained from
fluctuation patterns. We then did a 5-fold cross validation, and obtained the following confusion
matrix.

Classical 0.9719 0.0031 0.0031 0 0.0031 0.0188
Electronic 0.0175 0.6053 0.0175 0.0175 0.1667 0.1754
Jazz/Blues 0.2308 0.1154 0.1923 0 0.2308 0.2308
Metal/Punk 0.0222 0.0444 0 0.3778 0.4444 0.1111
Rock/Pop 0.0588 0.1765 0.0098 0.1176 0.5 0.1373
World 0.3197 0.1393 0.0082 0.0164 0.1639 0.3525
Table 10: Confusion Matrix for Fluctuation Pattern with Random Forests
4. Results
In this section we applied all the mentioned methods on the 35 test songs provided to us and
got the following results.
Song Random Random Fluctuation Fluctuation Fluctuation Fluctuation

# Projections Projections Patterns Patterns Patterns Patterns
with SVM with with with with SVM with
Random Original Modified Random
Forests Island Island Forests
Approach Approach
1 Classical Classical Metal/Punk Classical World World
2 World Classical Metal/Punk Classical Classical World
3 Classical Classical Metal/Punk Classical Classical Classical
4 Rock/Pop Rock/Pop Metal/Punk Classical Rock/Pop Classical
5 Electronic Electronic Classical Rock/Pop Rock/Pop Rock/Pop
6 Classical Rock/Pop Metal/Punk Classical Classical World
7 World Classical Electronic Classical Classical Classical
8 Classical Classical Classical Classical Classical Classical
10 Electronic World World Rock/Pop Rock/Pop Rock/Pop
12 Electronic Metal/Punk World Classical Classical Classical
15 World Classical Classical Classical Rock/Pop Classical
16 Classical Classical Classical Metal/Punk Metal/Punk Metal/Punk
18 Classical Classical Jazz/Blues World Rock/Pop World
20 World Classical Classical Classical Classical Classical
18 | P a g e
23 Classical Classical World Clasical Classical Classical
25 Electronic Classical Classical Classical Classical Classical
26 Rock/Pop Classical Metal/Punk Classical Classical Classical
27 Electronic Electronic Electronic Electronic Rock/Pop Electronic
28 Metal/Punk Rock/Pop Classical World Rock/Pop World
29 Rock/Pop Rock/Pop Metal/Punk Classical Classical Classical
30 Electronic Classical Classical Classical Classical Classical
31 Classical Classical Classical Classical Classical World
32 World Classical Metal/Punk World World World
33 Electronic Rock/Pop Electronic Rock/Pop Rock/Pop Rock/Pop
34 Metal/Punk Rock/Pop Electronic Rock/Pop Rock/Pop Rock/Pop
Table 11: Results for the Test Data
5. Bibliography
[1] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast johnson-
lindenstrauss transform. In Proceedings of the thirty-eighth annual ACM symposium on Theory
of computing, pages 557–563. ACM, 2006.
[2] Luís Barreira, Sofia Cavaco, and Joaquim Ferreira da Silva. Unsupervised Music Genre
Classification with a Model-Based Approach, pages 268–281. Springer Berlin Heidelberg, Berlin,
Heidelberg, 2011.
[3] Beth Logan. Music recommendation from song sets. In ISMIR. Citeseer, 2004.
[4] Elias Pampalk. Computational models of music similarity and their application in music
information retrieval, 2006.
[5] G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Transactions
on Speech and Audio Processing, 10(5):293–302, Jul 2002.
19 | P a g e

Music Genre Classification Project Repor

Uploaded by

Copyright:

Available Formats

Music Genre Classification Project Repor

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Music Genre Classification Project Repor

Uploaded by

Copyright:

Available Formats

[ECEN 5322] Genre Classification: Project Report

Lakhan Shiva Kamiredy, Pavan Dhareswar, Theja Surya Kanumury

A genre is defined as a category of artistic composition characterized by similarities in form,

2.1 Mel Frequency Cepstral Coefficients (MFCCs)

2.1.1 Power Spectrum

2.1.2 Mel Frequency

Fig. 1: Triangular Filter Bank

2.1.4 Discrete Cosine Transform

Fig. 2: MFCCs after DCT

2.1.6 Further Work

2.2.2 Fast Johnson-Lindenstrauss Transform (FJLT)

Fig. 3: Visualization of the PHD matrices

2.2.3 Random Projections

Fig. 4: Accuracy vs. Epsilon

Fig. 6: Optimal Epsilon

2.2.4 Support Vector Machines (SVM)

2.2.5 Random Forests

Definition & Computation of FPs:

We employ the following 4 steps to compute FPs:

2.3.2 Distance Metrics

Similarity between individual spectra, or between clusters of similar spectra, can be

We computed 4 distance metrics to compute the similarity between songs

Where β is a positive constant.

can be considered as a distance function and the task of retrieval

To perform an accurate nearest-neighbor search, we use a distance metric that is based

d = trace(co1*ico2) + trace(co2*ico1) + trace((ico1+ico2)*(m1-m2)*(m1-m2)’)

b) Fluctuation Pattern Distance (d_fp):

c) Fluctuation Pattern Bass (d_fpb):

d) Fluctuation Pattern Gravity (d_fpg):

2.3.3 Nearest Neighbor Search

a) It is a non-parametric algorithm, i.e., it does not make any assumptions on the

2.3.4 Original Island Approach

2.3.5 Modified Island Approach

Fig. 8: Visualization of the Cascade Approach

3.1 Random Projections with SVM Classifier

Classical Electronic Jazz/Blues Metal/Punk Rock/Pop World

The overall accuracy obtained for this method was 48.83%.

3.2 Random Projections with Random Forests

Classical Electronic Jazz/Blues Metal/Punk Rock/Pop World

The overall accuracy obtained for this method was 55.01%.

3.3 Fluctuation Patterns with k-NN

k=5 Classical Non-Classical

Fig. 10: Accuracy for different values of k.

3.4 Fluctuation Patterns with Island Approach

The overall accuracy obtained for this method was 39.55%.

3.4.2 Modified Approach

Classical Non-Classical Electronic Non-Electronic

Jazz Non-Jazz Metal Non-Metal

Classical Electronic Jazz/Blues Metal/Punk Rock/Pop World

The overall accuracy obtained for this method was 92.65%.

Classical Non-Classical Electronic Non-Electronic

Jazz Non-Jazz Metal Non-Metal

Classical Electronic Jazz/Blues Metal/Punk Rock/Pop World

The overall accuracy obtained in this method is 97.57%.

3.5 Fluctuation Patterns with SVM Classifier

Classical Electronic Jazz/Blues Metal/Punk Rock/Pop World

The overall accuracy obtained in this method is 65.57%.

Classical Electronic Jazz/Blues Metal/Punk Rock/Pop World

The overall accuracy obtained in this method is 68.04%.

d = trace(co1ico2) + trace(co2ico1) + trace((ico1+ico2)(m1-m2)(m1-m2)’)