CMPS242 Machine Learning Final Project Report: 1. Problem Statement
CMPS242 Machine Learning Final Project Report: 1. Problem Statement
To make combining the distance of different features to- 3. Logistic - Use the Sigmoid function to squash the dis-
gether easier, we also attempt to normalize the output of tance down to a 0-1 range. Since all our distance met-
the distance to ideally be in the range 0 to 1. rics return non-negative values, we can ensure that the
following returns a value in the range of 0 to 1:
3.1. L1 (Manhattan) Distance
1
The Manhattan distance of two numbers x and y is calcu- distance(d) = ( − 0.5) ∗ 2
1 − exp(−d)
lated by:
4. Features
n
X
d= |xi − yi | Our set of features can be divided into three types of fea-
i=1 tures:
The Dice distance of two set A and B is calculated by: • Descriptive Features
The dataset also contains some textual data such as at-
2|A ∩ B| tributes, categories, review text, etc. which describe
dd (A, B) = 1 − the businesses. We construct features using these
|A| + |B|
texts, but to improve efficiency of our model we en-
code the string data as a set of unique integers by
3.5. Normalizations
building maps over all possible values that these fea-
Because different distance metrics work with different out- tures can take. So, our output feature is a set of iden-
put ranges, we need to normalize the distances metrics to tifiers of the words present for the business.
something consistent. For example, the range of Jaccard
– Attribute Features - The dataset contains infor-
Distance is between 0 and 1 while the range of Manhattan
mation about attributes of businesses which de-
Distance is 0 to infinity.
scribe the operations of the business. They ex-
To figure out what works best, we will try three different ist as key-value pairs in the data, for example -
normalization methods: (WiFi, no), (Delivery, false), (Attire, casual). We
squash the key-value pairs together and add these
1. Raw - No normalization is applied. together to create a set of attributes that a busi-
ness has, which we then use as a feature.
2. Logarithmic - Use a logarithm to squash down the – Category Features - The dataset also contains
value closer to zero. Since we did not want to let val- some categorical information about the business,
ues less than 1 grow, we put a hinge at 1 and just let for example, whether the business is a restaurant,
all values less than 1 go to zero. cafe, food place, burgers place, etc. We construct
the feature which is the set of categories that the
distance(d) = ln(max(1, d)) business has assigned to it.
CMPS242 Machine Learning Final Project Report
– Key Words - These are words that Yelp has de- • a - No. of pairs that are assigned to same cluster in
fined to help users in filtering out the businesses both X and Y
that appear in the search results. They’re words
• b - No. of pairs that are assigned to different clusters
that delineate businesses as they’re mostly cate-
in both X and Y
gorical words such as restaurant, cafes, etc. We
use these key words and look for their occur- • c - No. of pairs that are assigned to same cluster in X
rences in the reviews of the businesses and return but to different clusters in Y
a set of key words that they contain.
• d - No. of pairs that are assigned to different clusters
– Top Words - This set contains the most fre- in X but to same cluster in Y
quently occurring words in the reviews of the
text, after taking care of the stop words. We used The rand index is then given by R = a+b
a+b+c+d
a general English language stop words list con-
taining 562 stop words.
5.2. Generating Gold Standard Clusters
• Temporal Features Since we do not have the gold standard clusters, we cluster
We have two time-related features pertaining to the a subset of the data and evaluate the algorithm on these data
functioning hours of businesses points.
– Total Hours - Total number of hours the business We looks at various restaurant chains such as “Taco Bell”,
is open during the week “Starbucks” etc. that are present in the data and assign all
the stores belonging to the same chain as a new cluster. We
– Open Hours - This is a feature encoded informa- extracted the top 15 restaurant chains and their details are
tion about the functioning hours of the business given in Table 1. We only generate pairs within a restaurant
over the week. We divided the hours in a day in chain. Since it is not clear if a McDonald’s and a Burger
the following way to help us attribute the func- King should be in the same or different cluster, we do not
tioning times to a business. look at pairs across various chains.
∗ Open between 6AM - 12PM: The restaurant
However if we only have positive pairing, then the rand in-
functions in the morning, or serves breakfast.
dex can be trivially maximized by assigning all data points
∗ Open between 12PM - 3PM: The restaurant
to the same cluster. To counteract this, we also need pairs
functions in the afternoon, or serves lunch.
of restaurants that should not be in the same cluster. We
∗ Open between 5PM - 9PM: The restaurant collected a list of 285 “fine-dining” restaurants (restaurants
functions in the evening, or serves dinner. which have the highest price range) and create pairs such
∗ Open between 9PM - 2AM: The restaurant that, the first restaurant comes from the “fast food” list like
functions post dinner, or late night. “McDonald’s” and “Taco Bell” and the other from the fine-
However, to make our feature more robust, we dining list. These pairs should not be in the same cluster.
specify that to encode that a business functions
during a specific time-span specified above, it 6. Experiments
must be open for at least 4 days in a week during
those hours. We again return a set of the time- We use all 3069 restaurants present in the gold standard
spans that a business operates during as a set. dataset for our experiments.
There are four different parameters that we can tune in our
5. Evaluation algorithm:
In order to evaluate the correctness of the generated clus- 1. D - the set of features
ters, we create sets of restaurants that are similar. Using
these sets as the “gold standard” clusters, we report the 2. K - the number of clusters
Rand Index. 3. F - the function used to normalize various distance
metrics
5.1. Rand Index
4. S - the distance measure used to measure set similarity
Rand Index is used in data clustering to measure the simi-
larity between two cluster assignments. Given a set of ele- We run multiple experiments, keeping a few of these pa-
ments S, and two partitions of S, X = {X1 , X2 , . . . , xn } rameters fixed, and altering others to better understand the
and Y = {Y1 , Y2 , . . . , Ym }, we compute the following sensitivity of each parameter.
CMPS242 Machine Learning Final Project Report
Feature Set K Scalar Normalization Set Distance Rand Index Feature Set K Scalar Normalization Set Distance Rand Index
NAW 18 None Jaccard 0.669458 NW 18 Logistic Jaccard 0.681717
AW 10 N/A Dice 0.929555 A 10 N/A Dice 0.907089
AW 10 N/A Jaccard 0.855210 A 10 N/A Jaccard 0.969376
AW 12 N/A Dice 0.870459 A 12 N/A Dice 0.898305
AW 12 N/A Jaccard 0.864035 A 12 N/A Jaccard 0.910638
AW 14 N/A Dice 0.851381 A 14 N/A Dice 0.869441
AW 14 N/A Jaccard 0.842278 A 14 N/A Jaccard 0.911598
AW 16 N/A Dice 0.850255 A 16 N/A Dice 0.826885
AW 16 N/A Jaccard 0.844351 A 16 N/A Jaccard 0.907550
AW 18 N/A Dice 0.848773 N 8 Log N/A 0.657650
AW 18 N/A Jaccard 0.851001 N 8 Logistic N/A 0.644067
NA 8 Log Dice 0.659055 N 8 None N/A 0.659098
NA 8 Log Jaccard 0.658976 N 10 Log N/A 0.657301
NA 8 Logistic Dice 0.742567 N 10 Logistic N/A 0.649856
NA 8 Logistic Jaccard 0.770920 N 10 None N/A 0.667856
NA 10 Log Dice 0.670728 N 12 Log N/A 0.669908
NA 10 Log Jaccard 0.671137 N 12 Logistic N/A 0.665688
NA 10 Logistic Dice 0.726959 N 12 None N/A 0.669810
NA 10 Logistic Jaccard 0.737731 N 14 Log N/A 0.673658
NA 10 None Dice 0.667858 N 14 Logistic N/A 0.661820
NA 10 None Jaccard 0.667858 N 14 None N/A 0.670462
NA 12 Log Dice 0.671808 N 16 Log N/A 0.669040
NA 12 Log Jaccard 0.673073 N 16 Logistic N/A 0.666306
NA 12 Logistic Dice 0.723300 N 16 None N/A 0.670721
NA 12 Logistic Jaccard 0.729915 N 18 Log N/A 0.669534
NA 12 None Dice 0.669829 N 18 Logistic N/A 0.661045
NA 12 None Jaccard 0.669829 N 18 None N/A 0.669127
NA 14 Log Dice 0.667990 W 10 N/A Dice 0.852379
NA 14 Log Jaccard 0.669195 W 10 N/A Jaccard 0.816539
NA 14 Logistic Dice 0.714399 W 12 N/A Dice 0.860230
NA 14 Logistic Jaccard 0.722061 W 12 N/A Jaccard 0.815057
NA 14 None Dice 0.670182 W 14 N/A Dice 0.859987
NA 14 None Jaccard 0.670232 W 14 N/A Jaccard 0.814819
NA 16 Log Dice 0.678210 W 16 N/A Dice 0.819333
NA 16 Log Jaccard 0.676306 W 16 N/A Jaccard 0.798225
NA 16 Logistic Dice 0.712950 W 18 N/A Dice 0.773908
NA 16 Logistic Jaccard 0.726035 W 18 N/A Jaccard 0.769316
NA 16 None Dice 0.670561
NA 16 None Jaccard 0.670614
NA 18 Log Dice 0.678099
NA 18 Log Jaccard 0.676272
NA 18 Logistic Dice 0.708987
NA 18 Logistic Jaccard 0.719783
NA 18 None Dice 0.669363
NA 18 None Jaccard 0.669419
NW 10 Log Dice 0.669719
NW 10 Log Jaccard 0.684106
NW 10 Logistic Dice 0.694825
NW 10 Logistic Jaccard 0.669269
NW 10 None Dice 0.667752
NW 10 None Jaccard 0.667750
NW 12 Log Dice 0.671185
NW 12 Log Jaccard 0.684292
NW 12 Logistic Dice 0.694171
NW 12 Logistic Jaccard 0.674548
NW 12 None Dice 0.669814
NW 12 None Jaccard 0.669810
NW 14 Log Dice 0.670793
NW 14 Log Jaccard 0.677972
NW 14 Logistic Dice 0.694405
NW 14 Logistic Jaccard 0.675936
NW 14 None Dice 0.670697
NW 14 None Jaccard 0.670578
NW 16 Log Dice 0.672472
NW 16 Log Jaccard 0.676489
NW 16 Logistic Dice 0.692137
NW 16 Logistic Jaccard 0.681405
NW 16 None Dice 0.670888
NW 16 None Jaccard 0.670706
NW 18 Logistic Dice 0.698928
continued...