0% found this document useful (0 votes)

55 views

CMPS242 Machine Learning Final Project Report: 1. Problem Statement

This document is the final project report for a machine learning course. The project focused on using k-medoids clustering to help businesses find potential competitors by identifying similar businesses. The report describes the problem statement, k-medoids algorithm used, various distance metrics tested (e.g. L1, L2, Jaccard, Dice), features engineered from the data (e.g. star ratings, reviews, attributes), and normalization techniques applied to distance metrics. The goal of the project was to cluster similar businesses to help identify competitors for a business considering expanding to a new location.

Uploaded by

Komal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views

CMPS242 Machine Learning Final Project Report: 1. Problem Statement

Uploaded by

Komal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CMPS242 Machine Learning Final Project Report

Eriq Augustine EAUGUSTI @ UCSC . EDU

Student ID: 1116667
Varun Embar VEMBAR @ UCSC . EDU
Student ID: 1566148
Dhawal Joharapurkar DJOHARAP @ UCSC . EDU
Student ID: 1566168
Xiao Li XLI 111@ UCSC . EDU
Student ID: 1461332
Team 0: Para-normal Distributions - https://github.com/eriq-augustine/242-2016

1. Problem Statement • Since we are only using pairwise comparisons, we

do not need to be able to calculate the average of a
Our project is focused around helping business owners find feature. This is better for non-numeric features like
potential competitors if they choose to open a branch in a strings or sets.
new location. Using this information the business owners
can not only find potential competitors, but also get an idea • Since we are only using pairwise comparisons, once
of how well their business will be received in the new lo- we compute the dissimilarity between the data points,
cality. we can do away with the actual data points, and this
leads to a reduction in the memory usage and makes
We say that a “competitor” is any similar business, with the the computation more efficient.
idea that a similar business caters to a similar clientele and
is therefore a competitors. Therefore our task involves find- • Since our medoid is a real data point (as opposed to
ing similar businesses in other regions. We do this by first a centroid), we can use this data point as a “repre-
clustering similar businesses. Once we have the clusters, to sentative” when visualizing our data (or perhaps even
find similar businesses in a location, we look into the clus- sampling it).
ter to which a given restaurant belongs to, and then display
restaurants in that cluster that are close to the location. Our stopping condition is either when we have run through
10 iterations of clustering (experimentally found to be suf-
We use the k-medoids algorithm to cluster the businesses as ficient with this dataset), or when the current set of clusters
not all attributes are numeric in nature and we need a richer is the same as either the last run or the run before it. We
set of dissimilarity scores. We run experiments with var- compare not just the most recent run so that we can prevent
ious parameter settings, dissimilarity scores, and features unnecessary runs when the clusters are jittering, or when
and report the rand index on a “gold standard” data set. outlier points are constantly shifting their membership back
and forth between two or more clusters.
2. Algorithm Formulation
We use the K-medoid clustering algorithm to cluster busi- 3. Distance Metrics
nesses. K-medoids is a clustering technique that tries to To compare our features, we use several different distance
minimize the pairwise dissimilarity between data points metrics. In this section, we will discuss only metrics that
that are assigned to a cluster and the medoid of that cluster. were used in our final experiments. However, we imple-
The medoid is a data point in data set that best represent mented and explored several different metrics that will be
the cluster center. The medoid is analogous to the centroid covered in Appendix B.
in the K-Means algorithm.
Our distance metrics fall into two categories: numeric and
The K-Medoids algorithm has the following advantage set-based. For numeric, we implemented L1 and L2 dis-
over K-Means. tances. For set-based, we implemented Jaccard and Dice.
CMPS242 Machine Learning Final Project Report

To make combining the distance of different features to- 3. Logistic - Use the Sigmoid function to squash the dis-
gether easier, we also attempt to normalize the output of tance down to a 0-1 range. Since all our distance met-
the distance to ideally be in the range 0 to 1. rics return non-negative values, we can ensure that the
following returns a value in the range of 0 to 1:
3.1. L1 (Manhattan) Distance
1
The Manhattan distance of two numbers x and y is calcu- distance(d) = ( − 0.5) ∗ 2
1 − exp(−d)
lated by:
4. Features
n
X
d= |xi − yi | Our set of features can be divided into three types of fea-
i=1 tures:

3.2. L2 (Euclidean) Distance • Numeric Features

The Euclidean distance of two numbers x and y is calcu- – Star Rating - The star ratings of a business,
lated by: rounded to half-stars.
v – Total Review Count - The total number of re-
u n
uX views present for the business.
d = t (xi − yi )2 – Available Review Count - The number of re-
i=1 views available for the business in the data set.
– Mean Review Length - Average length of a re-
3.3. Jaccard Distance view for the business.
The Jaccard distance of two set A and B is calculated by: – Mean Word Length - Average length of the
words in the reviews for the business.
|A ∩ B| – Number of Words - Total number of words in
dj (A, B) = 1 − the reviews of the business.
|A ∪ B|
– Mean Words per Review - Average number of
3.4. Dice distance words per review available for the business.

The Dice distance of two set A and B is calculated by: • Descriptive Features
The dataset also contains some textual data such as at-
2|A ∩ B| tributes, categories, review text, etc. which describe
dd (A, B) = 1 − the businesses. We construct features using these
|A| + |B|
texts, but to improve efficiency of our model we en-
code the string data as a set of unique integers by
3.5. Normalizations
building maps over all possible values that these fea-
Because different distance metrics work with different out- tures can take. So, our output feature is a set of iden-
put ranges, we need to normalize the distances metrics to tifiers of the words present for the business.
something consistent. For example, the range of Jaccard
– Attribute Features - The dataset contains infor-
Distance is between 0 and 1 while the range of Manhattan
mation about attributes of businesses which de-
Distance is 0 to infinity.
scribe the operations of the business. They ex-
To figure out what works best, we will try three different ist as key-value pairs in the data, for example -
normalization methods: (WiFi, no), (Delivery, false), (Attire, casual). We
squash the key-value pairs together and add these
1. Raw - No normalization is applied. together to create a set of attributes that a busi-
ness has, which we then use as a feature.
2. Logarithmic - Use a logarithm to squash down the – Category Features - The dataset also contains
value closer to zero. Since we did not want to let val- some categorical information about the business,
ues less than 1 grow, we put a hinge at 1 and just let for example, whether the business is a restaurant,
all values less than 1 go to zero. cafe, food place, burgers place, etc. We construct
the feature which is the set of categories that the
distance(d) = ln(max(1, d)) business has assigned to it.
CMPS242 Machine Learning Final Project Report

– Key Words - These are words that Yelp has de- • a - No. of pairs that are assigned to same cluster in
fined to help users in filtering out the businesses both X and Y
that appear in the search results. They’re words
• b - No. of pairs that are assigned to different clusters
that delineate businesses as they’re mostly cate-
in both X and Y
gorical words such as restaurant, cafes, etc. We
use these key words and look for their occur- • c - No. of pairs that are assigned to same cluster in X
rences in the reviews of the businesses and return but to different clusters in Y
a set of key words that they contain.
• d - No. of pairs that are assigned to different clusters
– Top Words - This set contains the most fre- in X but to same cluster in Y
quently occurring words in the reviews of the
text, after taking care of the stop words. We used The rand index is then given by R = a+b
a+b+c+d
a general English language stop words list con-
taining 562 stop words.
5.2. Generating Gold Standard Clusters
• Temporal Features Since we do not have the gold standard clusters, we cluster
We have two time-related features pertaining to the a subset of the data and evaluate the algorithm on these data
functioning hours of businesses points.

– Total Hours - Total number of hours the business We looks at various restaurant chains such as “Taco Bell”,
is open during the week “Starbucks” etc. that are present in the data and assign all
the stores belonging to the same chain as a new cluster. We
– Open Hours - This is a feature encoded informa- extracted the top 15 restaurant chains and their details are
tion about the functioning hours of the business given in Table 1. We only generate pairs within a restaurant
over the week. We divided the hours in a day in chain. Since it is not clear if a McDonald’s and a Burger
the following way to help us attribute the func- King should be in the same or different cluster, we do not
tioning times to a business. look at pairs across various chains.
∗ Open between 6AM - 12PM: The restaurant
However if we only have positive pairing, then the rand in-
functions in the morning, or serves breakfast.
dex can be trivially maximized by assigning all data points
∗ Open between 12PM - 3PM: The restaurant
to the same cluster. To counteract this, we also need pairs
functions in the afternoon, or serves lunch.
of restaurants that should not be in the same cluster. We
∗ Open between 5PM - 9PM: The restaurant collected a list of 285 “fine-dining” restaurants (restaurants
functions in the evening, or serves dinner. which have the highest price range) and create pairs such
∗ Open between 9PM - 2AM: The restaurant that, the first restaurant comes from the “fast food” list like
functions post dinner, or late night. “McDonald’s” and “Taco Bell” and the other from the fine-
However, to make our feature more robust, we dining list. These pairs should not be in the same cluster.
specify that to encode that a business functions
during a specific time-span specified above, it 6. Experiments
must be open for at least 4 days in a week during
those hours. We again return a set of the time- We use all 3069 restaurants present in the gold standard
spans that a business operates during as a set. dataset for our experiments.
There are four different parameters that we can tune in our
5. Evaluation algorithm:

In order to evaluate the correctness of the generated clus- 1. D - the set of features
ters, we create sets of restaurants that are similar. Using
these sets as the “gold standard” clusters, we report the 2. K - the number of clusters
Rand Index. 3. F - the function used to normalize various distance
metrics
5.1. Rand Index
4. S - the distance measure used to measure set similarity
Rand Index is used in data clustering to measure the simi-
larity between two cluster assignments. Given a set of ele- We run multiple experiments, keeping a few of these pa-
ments S, and two partitions of S, X = {X1 , X2 , . . . , xn } rameters fixed, and altering others to better understand the
and Y = {Y1 , Y2 , . . . , Ym }, we compute the following sensitivity of each parameter.
CMPS242 Machine Learning Final Project Report

Restaurants No of branches Normalization Set Distance Rand Index

Starbucks 527 Log Dice 0.685091
Subway 408 Log Jaccard 0.672223
McDonald’s 365 Logistic Dice 0.826786
Taco Bell 193 Logistic Jaccard 0.785894
Burger King 167 None Dice 0.667862
Pizza Hut 159 None Jaccard 0.667858
Wendy’s 149
Panda Express 122 Table 3. K = 10, D = NAW
Dunkin’ Donuts 122
Domino’s Pizza 107
KFC 99
Chipotle Mexican Grill 97
k Rand Index
Dairy Queen 96
8 0.7872
Papa John’s Pizza 92
10 0.8267
Jack in the Box 81
12 0.7780
Fine Dining 100
14 0.7652
Table 1. List of restaurant chains 16 0.7486
18 0.7464

Features Rand Index Table 4. D = NAW, F = logistic, S = Jaccard

A 0.9070
W 0.8523
N 0.6498
AW 0.9295
WN 0.6948
ization does not make any difference to the metrics.
AN 0.7269
NAW 0.8267 In our third experiment, we modify the number of clusters
K, keeping the other parameters fixed. We set the set dis-
Table 2. K = 10, F = logistic, S = Dice tance to Jaccard and the normalization to logistic and use
all the features (Numeric, Attribute, and Word). The met-
rics are given in table 4.
In the first experiment, we modify the feature set keeping We observe that the best setting for K is 10. We also ob-
other parameters fixed. The possible features are N (Nu- serve that as we increase the number of clusters from 10,
meric features), A (Attribute descriptive features (attributes the rand index decreases. This could be due to the good
and categories)), W (Word descriptive features (key words clusters being split into many clusters resulting in decreas-
and top words))1 . We set K = 10, F to logistic normaliza- ing rand index. Another reason could be that the num-
tion, and S to Dice. The results are shown in table 2. ber of “same clusters” pairs in the gold standard dataset
We observe that the combination of Attribute and Word is much higher than the number of “across cluster” pairs.
gives the best performance, followed by Attribute alone. This could result in rand index preferring fewer and larger
We also observe that numeric features lead to deterioration clusters over many small clusters.
of performance. The full results can be seen in Appendix D.
In the second experiment, we keep the number of clusters
and the feature set fixed and alter the normalization func- 7. Conclusion
tion F and the set distance. We use the setting K = 10, and
use all the features (Numeric, Attribute, and Word). The Clustering works fairly well for this dataset. The signals
results are shown in table 3. that most strongly indicate that two businesses are the same
come from the reviews rather than the more structured data
We observe that the best performance is achieved when we like number of stars. Our work with the textual features
use Jaccard similarity. We also observe that using normal- was limited to just a few instances of “low-hanging fruit”
1
Because of time constraints, temporal features were not in- But if the success of those features is any indication of the
cluded in these experiments. Partial results for temporal features richness of the reviews, then the reviews should be the fo-
can be found in Appendix A. cus of future work.
CMPS242 Machine Learning Final Project Report

Appendices B.2. Needleman-Wunsch Distance

The distance of two string ’a’ and ’b’ is calculated by
Needleman-Wunsch algorithm. We calculate the distance
A. Temporal Feature Experiments by comparing the letters in two strings. The scoring way is
Time constraints lead us to not being able to complete the as follows:
experiments with Temporal features. However, we do have
partial results. • Match = 1: two letters ai and bj are the same
It does not look like Temporal features are stronger than • Mismatch = -1: two letters ai and bj are the different
Attribute or Word features. However, we see Temporal fea-
tures performing better than most combinations which in- • Indel = -1:
clude Numeric features. We already know that the Numeric
– Delete: one letter in the string ’a’ aligns to a gap
features are detrimental, but it is very interesting that Tem-
in the string ’b’.
poral beats it when we consider what is represented by each
feature. Numeric features contain information such as the – Insert: one letter in the string ’b’ aligns to a gap
restaurants star rating, but Temporal features just have in- in the string ’a’.
formation about when a place is open.
The pseudo-code for the Needleman-Wunsch algorithm is
Feature Set K Scalar Normalization Set Distance Rand Index as follows:
NAWT 8 Log Dice 0.694153
NAWT 8 Log Jaccard 0.686108 for i = 0 to length(a) do
NAWT 8 Logistic Dice 0.774644
NAWT 8 Logistic Jaccard 0.764851
F (i, 0) ← −i
NAWT 10 Log Dice 0.700142 end for
NAWT 10 Log Jaccard 0.695933
NAWT 10 Logistic Dice 0.761750
for j = 0 to length(b) do
NAWT 10 Logistic Jaccard 0.740966 F (0, j) ← −j
NAWT 12 Log Dice 0.694220
NAWT 12 Log Jaccard 0.698511
end for
NAWT 12 Logistic Dice 0.787857 for i = 1 to length(a) do
NAWT 12 Logistic Jaccard 0.749919
NAWT 14 Log Dice 0.704690
NAWT 14 Log Jaccard 0.708275 for j = 1 to length(b) do
NAWT 14 Logistic Dice 0.765841
NAWT 14 Logistic Jaccard 0.747909
AWT 10 N/A Jaccard 0.718515 if a[i] == a[j] then
NAT 10 Logistic Jaccard 0.742525
NWT 10 Logistic Jaccard 0.703385
M atch ← F (i − 1, j − 1) + 1
T 10 N/A Jaccard 0.779835 else
M ismatch ← F (i − 1, j − 1) − 1
B. Additional Distance Metrics end if
Insert ← F (i, j − 1) − 1
We implemented and explored using string similarity met- Delete ← F (i − 1, j) − 1
rics to compute distance. However, we were not able to F (i, j) ← max(M atch, M ismatch, Insert, Delete)
find suitable features to use string distance with. end for
end for
B.1. Levenshtein Distance
Then we need to normalize the Needleman-Wunsch dis-
The distance of two strings a and b is evaluate by Leven- tance N W a, b by transaction function to the range of [0,
shtein distance: 1]. la and lb are the length of string ’a’ and ’b ’.
max(i, j),

 if min(i,j)=0 N W a, b − max(la , lb )

  dis =
leva,b (i, j) =  leva,b (i − 1, j) + 1 −2 ∗ max(la , lb )


 min = leva,b (i, j − 1) + 1 otherwise
leva,b (i − 1, j − 1) + 1( ai 6= bj )

C. Implementation Details
(1)
Then we normalize the Levenshtein distance leva,b by C.1. Code
transaction function to the range of [0, 1], la and lb rep-
resent the length of ’a’ and ’b’. Our code is hosted on GitHub: https://github.com/eriq-
augustine/242-2016. Since we have implemented all our
leva,b methods ourselves, numpy is the only dependency required
dis = to run our code.
max(la , lb )
CMPS242 Machine Learning Final Project Report

C.2. Data C.3.2. M ULTIPROCESSING

To better handle and analyze the Yelp data, we first put Just the precomputation of the distance matrix for our
it into a relational database. We are using PostgreSQL. ground truth involves calculating 4,707,846 similarities
Our resulting schema was 12 tables in Boyce-Codd nor- (which is still nothing when compared to the 799,980,000
mal form. The create table statements can be found in our required for all restaurants). To speed this up, we took ad-
repository at data/sql/create.sql. vantage of multiprocessing and shared memory.
The script to parse the Yelp data and convert it to SQL in-
sert statements can be found at data/sql/parse.rb. D. Full Results
Having the data in a relational database also gives us the Below are the full results from our experiments on differ-
advantage of indexes and precomputations. Non-trivial in- ent parameters. The “Feature Set” column describes the
formation like the term frequencies over all reviews can be features that were included in that run.
precomputed and stored in tables for use in more complex
features. Additionally, we can tune our feature query with • N - Numeric Features
indexes targeted at our specific features. The SQL file that
handles precomputations and optimizations can be found at • A - “Attribute” Descriptive Features (attributes and
data/sql/optimize.sql. categories)
• W - “Word” Descriptive Features (key words and top
C.3. Optimizations words)
Clustering may be a fairly simple task, but it can quickly
get very resource intensive. Note that scalar normalizations only make sense in the
presence of numeric features (’N’), while set distance only
C.3.1. M EMORY makes sense in the presence of non-numeric features (’A’
& ’W’).
The Yelp dataset contains approximately 40000 restaurants
(based on the given categories). Since the resulting dis- Feature Set K Scalar Normalization Set Distance Rand Index
tance matrix is symmetric matrix and we do not need to NAW 8 Log Dice 0.660370
NAW 8 Log Jaccard 0.661637
keep the values on the diagonal (all points are 0 distance NAW 8 Logistic Dice 0.787228
to themselves), that means we need to compute exactly NAW 8 Logistic Jaccard 0.790557
NAW 8 None Dice 0.659138
n · (n − 1)/2, or 40000 · 39999/2 = 799, 980, 000. NAW 8 None Jaccard 0.659138
NAW 10 Log Dice 0.685091
A naive method would be too keep these values in a map NAW 10 Log Jaccard 0.672223
NAW 10 Logistic Dice 0.826786
structure comprised of maps to floats. Assuming a float NAW 10 Logistic Jaccard 0.785894
takes up 64 bits and an int takes 64 bits, then this means NAW 10 None Dice 0.667862
NAW 10 None Jaccard 0.667858
that a naive representation of the distance matrix would NAW 12 Log Dice 0.680260
require a float for the value and two integers for the keys NAW 12 Log Jaccard 0.677921
NAW 12 Logistic Dice 0.778079
which would take up at least: 799, 980, 000 ∗ (3 ∗ 64) = NAW 12 Logistic Jaccard 0.763160
153, 596, 160, 000bits = 17.88GB. NAW 12 None Dice 0.669829
NAW 12 None Jaccard 0.669829
This is a but unwieldy for most laptops. We can improve NAW 14 Log Dice 0.681115
NAW 14 Log Jaccard 0.674080
this by using a single array instead of a map. Then, we only NAW 14 Logistic Dice 0.765209
need to store the actual distance value and not the keys. NAW 14 Logistic Jaccard 0.757906
NAW 14 None Dice 0.670226
This would reduce our size cost down to; 799, 980, 000 ∗ NAW 14 None Jaccard 0.670226
64 = 51, 198, 720, 000bits = 5.96GB. NAW 16 Log Dice 0.675980
NAW 16 Log Jaccard 0.674917
NAW 16 Logistic Dice 0.748661
However, we can further reduce our memory cost by choos- NAW 16 Logistic Jaccard 0.760910
ing a more cost-efficient data type. Since we are normal- NAW 16 None Dice 0.670803
NAW 16 None Jaccard 0.670918
izing our distance functions to return a value in the range NAW 18 Log Dice 0.676348
of 0 to 1, we do not expect our distances to grow too large. NAW 18 Log Jaccard 0.686562
NAW 18 Logistic Dice 0.746411
Therefore, we can feel safe using a smaller data type such NAW 18 Logistic Jaccard 0.757685
as a 16-bit float. This will further reduce the size down NAW 18 None Dice 0.669489
to: 799, 980, 000 ∗ 16 = 12, 799, 680, 000bits = 1.49GB, continued...
1/12 our original cost.
CMPS242 Machine Learning Final Project Report

Feature Set K Scalar Normalization Set Distance Rand Index Feature Set K Scalar Normalization Set Distance Rand Index
NAW 18 None Jaccard 0.669458 NW 18 Logistic Jaccard 0.681717
AW 10 N/A Dice 0.929555 A 10 N/A Dice 0.907089
AW 10 N/A Jaccard 0.855210 A 10 N/A Jaccard 0.969376
AW 12 N/A Dice 0.870459 A 12 N/A Dice 0.898305
AW 12 N/A Jaccard 0.864035 A 12 N/A Jaccard 0.910638
AW 14 N/A Dice 0.851381 A 14 N/A Dice 0.869441
AW 14 N/A Jaccard 0.842278 A 14 N/A Jaccard 0.911598
AW 16 N/A Dice 0.850255 A 16 N/A Dice 0.826885
AW 16 N/A Jaccard 0.844351 A 16 N/A Jaccard 0.907550
AW 18 N/A Dice 0.848773 N 8 Log N/A 0.657650
AW 18 N/A Jaccard 0.851001 N 8 Logistic N/A 0.644067
NA 8 Log Dice 0.659055 N 8 None N/A 0.659098
NA 8 Log Jaccard 0.658976 N 10 Log N/A 0.657301
NA 8 Logistic Dice 0.742567 N 10 Logistic N/A 0.649856
NA 8 Logistic Jaccard 0.770920 N 10 None N/A 0.667856
NA 10 Log Dice 0.670728 N 12 Log N/A 0.669908
NA 10 Log Jaccard 0.671137 N 12 Logistic N/A 0.665688
NA 10 Logistic Dice 0.726959 N 12 None N/A 0.669810
NA 10 Logistic Jaccard 0.737731 N 14 Log N/A 0.673658
NA 10 None Dice 0.667858 N 14 Logistic N/A 0.661820
NA 10 None Jaccard 0.667858 N 14 None N/A 0.670462
NA 12 Log Dice 0.671808 N 16 Log N/A 0.669040
NA 12 Log Jaccard 0.673073 N 16 Logistic N/A 0.666306
NA 12 Logistic Dice 0.723300 N 16 None N/A 0.670721
NA 12 Logistic Jaccard 0.729915 N 18 Log N/A 0.669534
NA 12 None Dice 0.669829 N 18 Logistic N/A 0.661045
NA 12 None Jaccard 0.669829 N 18 None N/A 0.669127
NA 14 Log Dice 0.667990 W 10 N/A Dice 0.852379
NA 14 Log Jaccard 0.669195 W 10 N/A Jaccard 0.816539
NA 14 Logistic Dice 0.714399 W 12 N/A Dice 0.860230
NA 14 Logistic Jaccard 0.722061 W 12 N/A Jaccard 0.815057
NA 14 None Dice 0.670182 W 14 N/A Dice 0.859987
NA 14 None Jaccard 0.670232 W 14 N/A Jaccard 0.814819
NA 16 Log Dice 0.678210 W 16 N/A Dice 0.819333
NA 16 Log Jaccard 0.676306 W 16 N/A Jaccard 0.798225
NA 16 Logistic Dice 0.712950 W 18 N/A Dice 0.773908
NA 16 Logistic Jaccard 0.726035 W 18 N/A Jaccard 0.769316
NA 16 None Dice 0.670561
NA 16 None Jaccard 0.670614
NA 18 Log Dice 0.678099
NA 18 Log Jaccard 0.676272
NA 18 Logistic Dice 0.708987
NA 18 Logistic Jaccard 0.719783
NA 18 None Dice 0.669363
NA 18 None Jaccard 0.669419
NW 10 Log Dice 0.669719
NW 10 Log Jaccard 0.684106
NW 10 Logistic Dice 0.694825
NW 10 Logistic Jaccard 0.669269
NW 10 None Dice 0.667752
NW 10 None Jaccard 0.667750
NW 12 Log Dice 0.671185
NW 12 Log Jaccard 0.684292
NW 12 Logistic Dice 0.694171
NW 12 Logistic Jaccard 0.674548
NW 12 None Dice 0.669814
NW 12 None Jaccard 0.669810
NW 14 Log Dice 0.670793
NW 14 Log Jaccard 0.677972
NW 14 Logistic Dice 0.694405
NW 14 Logistic Jaccard 0.675936
NW 14 None Dice 0.670697
NW 14 None Jaccard 0.670578
NW 16 Log Dice 0.672472
NW 16 Log Jaccard 0.676489
NW 16 Logistic Dice 0.692137
NW 16 Logistic Jaccard 0.681405
NW 16 None Dice 0.670888
NW 16 None Jaccard 0.670706
NW 18 Logistic Dice 0.698928
continued...

Data Leakage Detection Complete Project Report
85% (33)
Data Leakage Detection Complete Project Report
59 pages
Module 5 - Clustering - Afterclassb
No ratings yet
Module 5 - Clustering - Afterclassb
49 pages
CSE445 NSU Week_5
No ratings yet
CSE445 NSU Week_5
26 pages
Phuong Nguyen: The Complete Guide To Cluster Analysis Using Python
No ratings yet
Phuong Nguyen: The Complete Guide To Cluster Analysis Using Python
68 pages
Lecture Slides-Week15,16
No ratings yet
Lecture Slides-Week15,16
50 pages
ML 4 (1)
No ratings yet
ML 4 (1)
33 pages
5. K-Nearest Neighbors
No ratings yet
5. K-Nearest Neighbors
35 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Clustering For Clasification
No ratings yet
Clustering For Clasification
13 pages
k-means
No ratings yet
k-means
3 pages
Chapter_2
No ratings yet
Chapter_2
70 pages
Introduction To AI and ML - UNIT 4
No ratings yet
Introduction To AI and ML - UNIT 4
29 pages
K-Nearest Neighbors: Nipun Batra July 5, 2020
No ratings yet
K-Nearest Neighbors: Nipun Batra July 5, 2020
66 pages
Agglomerative Hierarchical Clustering
No ratings yet
Agglomerative Hierarchical Clustering
15 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
Clustering
No ratings yet
Clustering
35 pages
3. Chapter 5 CLUSTERING
No ratings yet
3. Chapter 5 CLUSTERING
36 pages
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
No ratings yet
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
34 pages
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
No ratings yet
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
25 pages
Ashish Gandhe, Restaurant Recommendation System PDF
No ratings yet
Ashish Gandhe, Restaurant Recommendation System PDF
5 pages
Ashish Gandhe, Restaurant Recommendation System
No ratings yet
Ashish Gandhe, Restaurant Recommendation System
5 pages
Week03 - 1 - KNN
No ratings yet
Week03 - 1 - KNN
32 pages
SEMINAR
No ratings yet
SEMINAR
19 pages
Product Recommendations in E-Commerce Systems Using Content-Based Clustering and Collaborative Filtering
No ratings yet
Product Recommendations in E-Commerce Systems Using Content-Based Clustering and Collaborative Filtering
65 pages
Unit-II
No ratings yet
Unit-II
119 pages
TM3 ch07 Clustering
No ratings yet
TM3 ch07 Clustering
47 pages
CMSC422 Project Presentation
No ratings yet
CMSC422 Project Presentation
17 pages
AIML-Unit 4 Notes-Assignment 4
No ratings yet
AIML-Unit 4 Notes-Assignment 4
21 pages
Lecture 2 - Nearest-Neighbors Methods
No ratings yet
Lecture 2 - Nearest-Neighbors Methods
57 pages
BookSlides 5A Similarity-based-Learning
No ratings yet
BookSlides 5A Similarity-based-Learning
40 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
1. Clustering
No ratings yet
1. Clustering
75 pages
KNN
No ratings yet
KNN
29 pages
_Clustering
No ratings yet
_Clustering
41 pages
L6 Recommendation
No ratings yet
L6 Recommendation
56 pages
Module 5
No ratings yet
Module 5
370 pages
Week6_clustering_regression
No ratings yet
Week6_clustering_regression
101 pages
Entropy (S) Log (P) : I 1c I I
No ratings yet
Entropy (S) Log (P) : I 1c I I
5 pages
hierarchicalclustering
No ratings yet
hierarchicalclustering
20 pages
Unit 2
No ratings yet
Unit 2
89 pages
copy-merged
No ratings yet
copy-merged
3 pages
Clusters
No ratings yet
Clusters
64 pages
Ashish Gandhe, Restaurant Recommendation System
No ratings yet
Ashish Gandhe, Restaurant Recommendation System
6 pages
Knn
No ratings yet
Knn
30 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
ML-MID1-MYANS
No ratings yet
ML-MID1-MYANS
24 pages
BDA
No ratings yet
BDA
31 pages
13 Clustering and Classifier
No ratings yet
13 Clustering and Classifier
123 pages
SP14 CS188 Lecture 23 -- Kernels and Clustering - print
No ratings yet
SP14 CS188 Lecture 23 -- Kernels and Clustering - print
39 pages
DS_Module 4
No ratings yet
DS_Module 4
57 pages
UNIT 4
No ratings yet
UNIT 4
42 pages
IDS Unit-3 L2
No ratings yet
IDS Unit-3 L2
26 pages
UCS 401 Unit-lll Lect 13 Distance Based Models Neighbours and Examples
No ratings yet
UCS 401 Unit-lll Lect 13 Distance Based Models Neighbours and Examples
20 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
21AI71-module-5-textbook
No ratings yet
21AI71-module-5-textbook
25 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Mlfa Autumn 22 Lec 03
No ratings yet
Mlfa Autumn 22 Lec 03
61 pages
Instance Based Learning
No ratings yet
Instance Based Learning
20 pages
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
ML Question Bank U - 4
No ratings yet
ML Question Bank U - 4
14 pages
Lic Aao Sbi Po Simplification Approximation Questions - PDF 42
No ratings yet
Lic Aao Sbi Po Simplification Approximation Questions - PDF 42
12 pages
Wipro Verbal Ability Questions
No ratings yet
Wipro Verbal Ability Questions
13 pages
CMPS242 Machine Learning Final Project Report: 1. Problem Statement
No ratings yet
CMPS242 Machine Learning Final Project Report: 1. Problem Statement
7 pages
Loan Approval Prediction Based On Machine Learning Approach: Kumar Arun, Garg Ishan, Kaur Sanmeet
No ratings yet
Loan Approval Prediction Based On Machine Learning Approach: Kumar Arun, Garg Ishan, Kaur Sanmeet
4 pages
Wipro Interview Experience
No ratings yet
Wipro Interview Experience
1 page
Oracle Cloud Infrastructure 2024 Data Foundations Associate (1Z0-1195-24)_b49de3e4-ab42-4642-8036-8ab226e864c9
No ratings yet
Oracle Cloud Infrastructure 2024 Data Foundations Associate (1Z0-1195-24)_b49de3e4-ab42-4642-8036-8ab226e864c9
8 pages
HFM Installation Dacument
No ratings yet
HFM Installation Dacument
50 pages
Chapter 20 Modul DSS
No ratings yet
Chapter 20 Modul DSS
20 pages
15 Java 8 Lambda Expressions Part 3
No ratings yet
15 Java 8 Lambda Expressions Part 3
25 pages
Design Thinking Whitepaper - Algarytm
No ratings yet
Design Thinking Whitepaper - Algarytm
34 pages
JRC Jcy-1850
No ratings yet
JRC Jcy-1850
6 pages
Short Courses
No ratings yet
Short Courses
3 pages
SHRM
No ratings yet
SHRM
13 pages
This Text Is For Question 1 To 4
No ratings yet
This Text Is For Question 1 To 4
10 pages
Rohini A. Vaidya. Tel. No.: 9422370500 E-Mail
No ratings yet
Rohini A. Vaidya. Tel. No.: 9422370500 E-Mail
3 pages
Bca 2 C Unit 4 Eng
No ratings yet
Bca 2 C Unit 4 Eng
16 pages
CBC-TM 2 Develop Learning Materials For E-Learning
100% (2)
CBC-TM 2 Develop Learning Materials For E-Learning
75 pages
301 EDM Manual
100% (1)
301 EDM Manual
178 pages
Microsoft Word & Powerpoint notes
No ratings yet
Microsoft Word & Powerpoint notes
26 pages
CS0-003 V17.35 - Check-1
No ratings yet
CS0-003 V17.35 - Check-1
199 pages
ICT602 - Mobile Tech & Dev Chapter 3
No ratings yet
ICT602 - Mobile Tech & Dev Chapter 3
12 pages
1 Introduction
No ratings yet
1 Introduction
45 pages
List of E-Learning - B
No ratings yet
List of E-Learning - B
3 pages
E-Voting System For Ethiopia
No ratings yet
E-Voting System For Ethiopia
76 pages
Management Scientist User's Guide
100% (7)
Management Scientist User's Guide
100 pages
Brochure INVENTOR
No ratings yet
Brochure INVENTOR
25 pages
Customer Master Data Tutorial - Create, Display, Block, Delete in SAP
No ratings yet
Customer Master Data Tutorial - Create, Display, Block, Delete in SAP
9 pages
Essential Safe 4.0: A Scaled Agile, Inc. White Paper March 2017
No ratings yet
Essential Safe 4.0: A Scaled Agile, Inc. White Paper March 2017
27 pages
The Six Phases of Project Management
No ratings yet
The Six Phases of Project Management
6 pages
Intel 8085 Microprocessor Architecture
No ratings yet
Intel 8085 Microprocessor Architecture
3 pages
Mitsubishi - MF501C and MF504C Flexible Disk Drives - Installation Guide
No ratings yet
Mitsubishi - MF501C and MF504C Flexible Disk Drives - Installation Guide
2 pages
Infineon-AN65209 Getting Started With FX2LP-ApplicationNotes-V10 00-En
No ratings yet
Infineon-AN65209 Getting Started With FX2LP-ApplicationNotes-V10 00-En
42 pages
Web Server Log Analysis2
No ratings yet
Web Server Log Analysis2
10 pages
Catia Designer & Expert: Course Overview
No ratings yet
Catia Designer & Expert: Course Overview
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CMPS242 Machine Learning Final Project Report: 1. Problem Statement

Uploaded by

CMPS242 Machine Learning Final Project Report: 1. Problem Statement

Uploaded by

CMPS242 Machine Learning Final Project Report

Eriq Augustine EAUGUSTI @ UCSC . EDU

1. Problem Statement • Since we are only using pairwise comparisons, we

3.2. L2 (Euclidean) Distance • Numeric Features

Restaurants No of branches Normalization Set Distance Rand Index

Features Rand Index Table 4. D = NAW, F = logistic, S = Jaccard

Appendices B.2. Needleman-Wunsch Distance

C.2. Data C.3.2. M ULTIPROCESSING

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.