SSRN Id4012594
SSRN Id4012594
SSRN Id4012594
Distance-based Features
Zann Koha,∗ , Yuren Zhoua , Billy Pik Lik Laua , Ran Liua , Chau Yuena and Keng Hua Chonga
a
Engineering and Product Development, Singapore University of Technology and Design, 8 Somapah
Road, Singapore 487372, Singapore
b
Architecture and Sustainable Design, Singapore University of Technology and Design, 8 Somapah
Road, Singapore 487372, Singapore
1. Introduction
Global Positioning System, or GPS for short, has been around for many years and is increasingly being
used in the context of mobility studies. It has been found to be widely usable for collection of spatio-temporal
data on different scales [1]. GPS mobility data has been used in many different fields and applications, such
as finding efficient routes [2], understanding the progression of infectious diseases[3], and prediction or
Many studies analyze GPS data in conjunction with other data, such as demographic data[6, 7],
supplementary survey data [1], Wi-Fi and geolocation data[8], or even sound and light data [9]. With
increasing privacy concerns in recent years, it has become more difficult to obtain such data for large numbers
of volunteers. Additionally, large volumes of human movement data are created without such supplementary
data. To be eventually able to make use of this, we want to explore ways in which we can analyze GPS data
∗
Corresponding author
zann_koh@mymail.sutd.edu.sg (Z. Koh)
ORCID (s): 0000-0002-4259-261X (Z. Koh)
role can be predicted with high accuracy using long-term GPS data, which supports the idea that Working
and Nonworking users may have different mobility patterns. In addition to this, Nahmias-Biran et al [11]
found several distinct clusters of activity-travel patterns in their GPS-enriched travel survey dataset, which
included distinct temporal patterns of different Out-of-Work activities as well as different Leisure activities.
This leads us to examine the mobility patterns of Workdays and Offdays separately.
Although there are many works that have used GPS data in many different applications, there lacks a
study that compares the mobility patterns of Working and Nonworking users, with a focus on non-Home, non-
Work locations, on Workdays as compared to Offdays. Some challenges faced in this research are ensuring a
fair comparison between users who live and work at different locations, as well as a fair comparison between
Workdays and Offdays. To this end, we propose a new mobility metric that excludes the effects of Home and
Work locations and uses the user’s Home location as a reference point. We decided to use an unsupervised
machine learning method - clustering, which finds groups in data without the need for labels or ground
truth. We use the above-mentioned metric for each user in conjunction with other features as an input for the
clustering algorithm.
• We propose a new mobility metric, Daily Characteristic Distance (DCD), and demonstrate its relation
• We use the DCD to extract features from users and use these features in conjunction with Origin-
Destination (OD) matrix features to cluster users using 𝑘-means clustering on a real-world dataset
collected in Singapore.
• Finally, we analyze the cluster results using two other analysis metrics that we have proposed - User
The structure of the remaining sections will be as follows: Section II lists some related works in the field
of GPS tracking and clustering. The dataset and preprocessing procedures are presented in Section III, while
the proposed methodology is presented in Section IV. Section V shows the results and analysis of performing
our proposed methodology on Workday data, while Section VI does the same for Offday data. Lastly, Section
Section V (Workday)
Section III Section IV
Section VI (Offday)
User Commonality
Home/Work location SSE plots Percentage of users within that
identification - determine optimal cluster who visited each POI category
number of clusters
Average Frequency
Split into Workday and Average percentage of trips made by
K-means clustering
Offday data each user within each cluster falling
- cluster users
into each POI category
Figure 1: Flowchart depicting the data collection, processing, clustering, and analysis framework proposed by
this paper.
2. Related Works
This section will be split into three parts. The first part addresses past works on the analysis of human
mobility via the usage of GPS data. The second part addresses the selection of clustering algorithms used
for this paper. Lastly, the third part deals with mobility metrics.
There have been several works focusing on the use of GPS data in mobility studies. These include the
works by Sila-Nowicka et al [6], who performed an analysis of significant places identified from their GPS
data in conjunction with additional social demographic data, as well as by Zheng et al [12], who proposed an
approach based on supervised learning to infer people’s motion modes from their GPS logs, and Marakkalage
et al [13] who used a fusion of GPS data and Wi-Fi data to derive insights on neighbourhood activity and
micro mobility. Van der Spek et al [1] used GPS to collect data in three European city centers, as well as
track the activity data of 13 families in Almere (a city in The Netherlands) for one week. Alessandretti
et al [14] presented an extensive characterization of the statistical properties of GPS trajectories using a
dataset collected from around 850 people lasting around 25 months. Wang et al [15] proposed a hierarchical
clustering method using an improved edit distance algorithm to use clustering to detect anomalous taxi
trajectories between selected pairs of origins and destinations. Solmaz et al [16] used GPS traces to study
pedestrian mobility in disaster areas. Long and Reuschke [7] even use detailed GPS data to analyze the
even extending to other aspects of demographics. However, as our focus is more on unsupervised machine
learning as compared to prediction, we turn our focus to the application of clustering methods in the analysis
of GPS data.
Some authors have performed clustering on trajectories to find common routes or popular locations.
An example would be Cesario et al [17], who proposed their own algorithm, trajectory pattern mining
(TPM), and used it to discover dense regions and popular sequential patterns within their dataset. Kumar
et al [18] also proposed their own novel algorithm, with the aim of discovering clusters of taxi routes. Tang
et al [19] used the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [20] algorithm
to cluster locations of pick up and drop off points for their taxi GPS dataset, however, their main aim
was to describe the taxi trips using statistical models. Dodge et al [21] proposed their own framework to
assess movement similarity using symbolic representation of movement parameters and used it to cluster
Another form of clustering in the analysis of GPS data is the grouping of users based on their travel
patterns. Amichi et al [22] used Gaussian mixture models [23] to identify three different groups of people
based on their travel patterns - scouters, regulars, and routineers. However, this was mostly based on how
often each user traveled to new locations as compared to returning to previously visited ones. In the long
term, the number of recorded “visited" places will keep increasing, while the “new" locations will become
few and far between, so this approach may not be applicable on a long term scale.
Scherrer et al [24] went through a rigorous selection process for parameters and algorithms before
performing their clustering. Out of a total of four clustering algorithms, 𝑘-means clustering[25] was in at least
the first two combinations in terms of their overall ranking, and they eventually used it to cluster users based
on the large amount of data they gathered from a mobile application. They were able to draw conclusions
from data that was gathered as a byproduct and hence without specific experimental aim or ground truth. As
this is similar to our use case, we eventually decided to use 𝑘-means clustering.
We aim to find some simple, understandable features within our data that will result in meaningful
interpretation of the clustering results. The review paper by Solmaz et al [26] classifies mobility metrics
into three different types - movement-based, link-based, and network-based metrics. As our focus is on how
the users travel, we place an emphasis on movement-based and link-based metrics such as visit frequency
and mean squared distance, rather than on the network-based metrics such as transmission count and energy
consumption.
Movement-based metrics include visit frequency and mean squared distance. A commonly used metric
that combines these is radius of gyration [27], which has been used in many papers [28, 29, 30]. It gives
the characteristic distance traveled by a user within a specified time period and is calculated as the mean
squared distance of the user’s visited locations to the center of mass of those locations. We are interested in
non-Home and non-Work locations, thus we adapted this formula based on what we have in our dataset to
For a link-based metric, those mentioned by Solmaz et al such as node density and intercontact time
were difficult to apply in our dataset. We then considered Origin-Destination (OD) matrices, which have
been commonly used in literature for analyzing flows between locations. For example, they have been used
by Zhou et al [31] and Koh et al [32] to illustrate the probability of human flows between different fixed
nodes. However, we believe that they can be used to describe an individual’s probability of motion between
different locations as well, much like the links of a Markov chain model, which has been shown to have
relatively high prediction accuracy [33] for trajectories. Based on this, we propose a method of extracting
The overall data flow of this paper is summarized in Fig. 1. This section deals with the Data Collection
Timestamped GPS data was collected voluntarily through a user-installed smartphone application that
runs in the background and collects GPS data adaptively. When moving, data is recorded about once every
minute, whereas when the device is still or not moving, the data is recorded about once every five minutes.
Data collection was carried out over a range of different periods of time for different users. Usable data was
Table 1
Labels for the dierent POI types considered in the dataset.
Label Description
selected with the criteria of at least one month of valid data accounting for at least 50% of the recording
duration for each user, resulting in the data from a total of 73 users being selected for use. Although the
sample size is relatively small, it is due to strict criteria to ensure quality of the data used. Additionally, for
this paper, we focus mainly on the framework, which can be extended to other datasets with larger sample
Each detected point of the data consists of a latitude, longitude, start time, and end time. For each user,
individual points at similar coordinates were clustered together using a validation based stay point detection
algorithm [34] to identify points of interest (POIs). Home and Work locations for each user are then detected
from this set of POIs using frequency and stay duration given the time of day. Taking into consideration that
there may be differences in mobility patterns on Workdays and off days for the Working population, the POI
data was then separated into Workday data and Offday data. Workdays are taken to be days when the user
was detected at their Work location. Some of the users do not go to work full-time. For these users, all of
Next, each POI is labeled by its proximity to the nearest location with a specific type out of ten different
categories. If it is more than 400m away from any location with known POI types, it is left unlabeled. The
areas called subzones, which are small sections of planning areas delineated by the Urban Redevelopment
Authority of Singapore for statistical purposes[35]. These subzones were used in the extraction of clustering
This section will go into details of the feature extraction and clustering processes. These processes differ
slightly between the Workday and the Offday datasets. The aim of clustering these users is to find common
types of users based on their mobility patterns, and thus possibly derive insights into common mobility
patterns.
For the purposes of clustering, we extract two main types of features from each user. One is derived
from a proposed metric, Daily Characteristic Distance (DCD), while the other is derived from the Origin-
We are interested in mobility patterns for different users in the dataset. As different users have different
Home and Work coordinates, it is imperative to find a mobility metric that enables us to compare different
users despite this spatial restriction. One such metric in the literature to quantify the mobility of individuals
is the radius of gyration, which considers distances from the center of mass of a trajectory and is thus user-
dependent. The radius of gyration 𝑟𝑔 of a user 𝑎 from the start of their dataset up to a certain time 𝑡 was
√
√
√ 𝑛𝑎𝑐
√ 1 ∑ 𝑎
𝑎
𝑟𝑔 (𝑡) = (⃗𝑟 − 𝑟⃗𝑎𝑐𝑚 )2 (1)
𝑛𝑎𝑐 (𝑡) 𝑖=1 𝑖
1 ∑𝑛𝑎𝑐 ⃗𝑎
where 𝑟⃗𝑎𝑖 represents the 𝑖 = 1, ..., 𝑛𝑎𝑐 (𝑡) positions recorded for user 𝑎 and 𝑟𝑎𝑐𝑚
⃗ =
𝑛𝑎𝑐 (𝑡)
𝑟 is the center of
𝑖=1 𝑖
For the purpose of comparison between different users in our dataset, the time 𝑡 in the above equation is
taken to be the entire duration of each user’s dataset, as the duration varies between users. Thus, the value of
(a) (b)
Figure 2: (a) Illustration of radius of gyration. The shaded gray circle is centered at the computed center of
mass, with a radius equal to the computed radius of gyration. (b) Comparison between radius of gyration (single
value, red line) and proposed Daily Characteristic Distance (set of values, bar plot) over the same time period.
(a) (b)
Figure 3: Violinplots illustrating the DCD distributions of users on (a) Workdays, consisting of only Working
users, and (b) Odays, consisting of both Working users on Odays and Nonworking users on all days. (a) shows
a moderately high correlation between DCD peaks and Home-Work distance of each user, while (b) shows a
higher density of Working users with higher median DCD.
𝑛𝑎𝑐 (𝑡) becomes the total number of locations 𝑁 𝑎 visited by user 𝑎 and the time dependency is removed. The
√
√
√ 1 ∑ 𝑁𝑎
𝑎
𝑟𝑔 = √ (⃗𝑟𝑎 − 𝑟⃗𝑎𝑐𝑚 )2 (2)
𝑁 𝑎 𝑖=1 𝑖
An illustration for this metric is shown in Fig. 2(a) for a user with a 40-day dataset. As expected, the center
of mass lies between the Home and Work location, as those locations are visited with a higher frequency
dataset has labels for Home locations of each user, we can add more meaning to the metric by using the
Home location of each user as the reference point for distance calculations instead of the computed center of
mass of the user’s visited locations. This will allow us to know the characteristic distance that the user travels
from their Home, which may have more physical meaning than a computed center of mass of their trajectory.
Secondly, the radius of gyration metric currently produces one value per user. We propose to break down
the dataset into days and compute a value for each individual day, thus obtaining a distribution of the daily
distances traveled by the user. Each day’s characteristic distance may be affected by whether or not the user
went to work on that day, which is part of what we want to investigate. Lastly, we want to investigate the
locations that the users visit outside of Home and Work. Therefore, we manually negate the contribution of
the Work and Home locations in the calculations by setting their distances to zero and removing the count
of Home and Work visits from the total value of 𝑁 𝑎 . As our proposed new metric computes a characteristic
distance for each day of the dataset, we call it Daily Characteristic Distance (DCD).
√
√
√1 ∑ 𝑛𝑑
𝐷𝐶𝐷𝑑 = √ 𝑓 × (⃗𝑟𝑖𝑑 − 𝑟⃗ℎ𝑜𝑚𝑒 )2 (3)
𝑛𝑑 𝑖=1 𝑖𝑑
where 𝑛𝑑 is the number of unique POIs that the user traveled to on that day, 𝑓𝑖𝑑 is the number of times the
user traveled to location i where 𝑖 = 1, ..., 𝑛𝑑 on that day, and 𝑟⃗ℎ𝑜𝑚𝑒 is given by the mean coordinates of the
Home location of the user. Fig. 2(b) illustrates the difference between radius of gyration (one value per user)
and DCD (a set of values per user). The days without bars have a value of zero, indicating that on those days
the user traveled directly between Home and Work without visiting any other location.
The obtained DCD distributions of all users are plotted in Fig. 3. The Workday data of Working users
is plotted in Fig. 3(a), while the Offday data (consisting of Working users during Offdays and Nonworking
users on all days) is plotted in Fig. 3(b). In Fig. 3(a), we have sorted the distributions in ascending order
of Home-Work distance of each user. From this, we can see that there generally seems to be a relationship
between the Home-Work distance of the users and the location of the peaks of their DCD distributions. We
calculated the Pearson’s R-value between each user’s Home-Work distance and the median of their DCD
distribution and found that there was a moderately high R-value of 0.746, with a p-value of 2.90e−5.
could not apply this method. Instead, we have sorted the distributions in ascending order of their median
point and colored the distributions according to whether the user is a Working or Nonworking user. From
Fig. 3(b), we can see that there is a higher concentration of Working users (blue) at the side with higher
median DCD. This may indicate that Working users tend to visit locations at further distances from their
To use this new metric DCD as a clustering feature, we first break down all the data for all users into
Workday and Offday data. We then separately compute the DCD values for each day and plot separate
histograms. The histograms are plotted in Fig. 4. From these histograms, we obtain the four thresholds of
(a) (b)
Figure 4: Histograms of the number of days within the whole dataset of (a) DCD value on Workdays, and (b)
DCD value on Odays. Note that (b) has been cropped vertically to show greater detail - the leftmost bar has
an actual value of 3312, of which 3137 have a value of 0.
Each user’s data is then broken down into Workday (if applicable) and Offday data. For each type of
data, we calculate the percentage of DCD values that fall within each of the determined thresholds. This
gives us four features for each type of data that add up to 1.0. An example of the features for one user, User
2, is shown in Table 2.
These four features will be used in conjunction with the 16 features derived from the Origin-Destination
Table 2
Example of the four DCD features for Workday and Oday data.
Day Type Home/Work 0-5km 5-15km >15km
Workday 0.60 0.23 0.15 0.02
Oday 0.38 0.16 0.41 0.05
From each user’s trajectory, we can extract distances from each user’s Home (and Work location if
applicable) to the other POIs that the user visits. For Offdays, we can simply use the distance from Home to
that POI as there is no other location that is common to all users. However, there is an additional important
location for Workdays, which is the Work location of the user. Therefore, on Workdays, the distance value of
each POI, referred to as minimum distance, is taken as the minimum of the distances between the POI and
the user’s Home location and between the POI and the user’s Work location. This is to detect any specific
locations that users may go to, that is not nearby to either their Home location or Work location and hence “out
of the way" from the user’s point of view. After extracting these distances separately from the Workday and
Offday data, the corresponding histograms are plotted. These can be seen in Fig. 5. We obtain the thresholds
visually by selecting suitable valleys in the histograms. The thresholds for Workdays are 0km (direct trips
between Home and Work), 0-2km, 2-8km, and >8km, while the thresholds for Offdays are 0-1km, 1-5km,
After getting these thresholds, the trips made by a user are now categorized based on these thresholds. We
are interested in the combinations of movements that users make from threshold to threshold, and whether
these will be a significant distinguishing factor between different users’ mobility patterns. Taking an example
of a user with Workday data, a trip consists of going from location A to location B, where threshold A is
on the row of the matrix and threshold B is on the column of the matrix. If A is located within 0-2km and
B is located within 2-8km, the number corresponding to the “0-2km" row and the “2-8km" column will be
increased by 1. After the trips are all counted for a user, the matrix is normalized by the total number of
trips counted for that user, such that all 16 elements of this matrix add up to 1.0. This is to make the data
comparable between different users. An example of the resulting matrix using the Workday thresholds can
be seen in Table 3. The Offday matrix and features are computed similarly.
(a) (b)
Figure 5:Histograms of the number of POIs visited over the whole dataset of (a) minimum distance between
Home and Work to that location on Workdays, and (b) distance from home to that location on Odays.
Table 3
Example of the 16 OD matrix features for Workday and Oday data.
(a) Workday features (b) Oday features
Threshold Home/Work 0-2km 2-8km >8km Threshold 0-1km 1-5km 5-15km >15km
Home/Work 0.67 0.08 0.04 0.04 0-1km 0.00 0.08 0.23 0.03
0-2km 0.08 0.00 0.00 0.00 1-5km 0.08 0.03 0.00 0.00
2-8km 0.04 0.00 0.00 0.00 5-15km 0.19 0.04 0.29 0.00
>8km 0.03 0.00 0.00 0.01 >15km 0.02 0.00 0.01 0.00
We do not count trips occurring on different calendar dates (i.e. from the last POI on one day to the first
POI the next morning), and we also do not count trips that occur within the same subzone (e.g. Home to
Home).
These 16 features are then used in conjunction with the four features from the above DCD calculations
to form the 20 features used in clustering, which will be discussed in detail in the next subsection.
After extracting the features for each user as in the above subsection, we used the sum-of-squared
errors (SSE) plot to decide on the number of clusters, 𝑘, to be used as the input parameter for 𝑘-means
clustering[25]. The SSE plot measures the sum of all squared errors from the clustered points to their
respective cluster centers after being grouped using each value of 𝑘. As the value of 𝑘 increases, the SSE
naturally decreases, but a good value for 𝑘 would be one located at the ‘elbow’ of the plot, just before the
(a) (b)
Figure 6: SSE plots used to derive (a) the optimal number of clusters for Workdays and (b) the optimal number
of clusters for Odays. Both plots indicate 3 as a suitable value for 𝑘, the number of clusters.
decrease in SSE becomes less than proportionate to the increase in 𝑘. The SSE plots for our dataset can be
seen in Fig. 6, where Fig. 6(a) shows the plot using the data from Workdays, while Fig. 6(b) shows the plot
using the data from Offdays. From both SSE plots, the ‘elbow’ of the plot indicates that a good value of 𝑘
to use would be 𝑘 = 3. The detailed results are plotted in the following sections, with the Workday results
Figure 7:Centroid values of the three clusters obtained from clustering Workday data. Cluster W1 has the highest
percentage of trips directly between Home and Work, as well as the highest percentage of days spent only at
Home or Work. The other two clusters W2 and W3 are in descending order of percentage of trips directly between
Home and Work.
This section focuses on the results obtained from clustering the Workday data of working users. The
eventual aim of this clustering is to separate users into different clusters based on their workday data. Further
analysis is then performed on the identified clusters, which consists of the DCD violinplots for each user in
each cluster, as well as user commonality and average frequency heatmaps, which are explained in detail
later on. The same process will be repeated for the Offday data in the next section.
The centroid values of each cluster are shown in Fig. 7. The clusters are named W1, W2, and W3
respectively (W stands for Workday). Visually, it seems that the clusters are separated mainly based on the
percentage of Home/Work trips out of the total number of trips taken by the user, with cluster W1 having the
highest average percentage of direct Home-Work trips at 77% followed by W2 with 41% and W3 with 21%.
The DCD features below each OD matrix show that there are similar average percentages of Home/Work
Users from Cluster W1 have a large majority - on average 72% - of their Workdays where they do not
visit any other locations. The average percentage of their days spent with DCD at each distance threshold
Looking at Cluster W2, the DCD features are roughly evenly spread across the first three distance
thresholds. Since there is a higher value of DCD being within 5-15km as compared to 0-5km, while the
percentage of trips from the OD matrix indicate a higher emphasis on minimum distance between 0-2km, it
is likely that some of the locations, which are 5-15km from their Home, are actually within 0-2km of their
For Cluster W3, the DCD features have a surprisingly high average value of 55% in the 5-15 km threshold
as compared to 12% and 31% the other two clusters. It also has a much lower average value of 8% in the 0-5km
threshold, as compared to 15% and 27% in the other clusters. As the average percentage of Home/Work direct
trips from the OD matrix is also quite low at 21%, it can be interpreted that the users in this cluster usually
Overall, the clusters can be described as mainly Home/Work Only (W1), frequent short trips in terms of
Figure 8: Violinplots illustrating each user's DCD distribution within each cluster on Workdays.
W1 W2 W3
(a)
(b)
Figure 9: Heatmaps for each of the three Workday clusters showing (a) User Commonality and (b) Average
Frequency. The colormap scales for (b) are narrowed to 0.25 to better show the contrast between the dierent
squares.
Fig. 8 shows the DCD violinplot for each individual user in each cluster, sorted in ascending order of
their Home-Work distance. These violinplots do not show the percentage of days spent only between Work
and Home, as we are interested in the POIs that are not Home and not Work. From this figure, we observe
from the yellow highlighted portion that most of the users in cluster W1 have a low Home-Work distance,
below 5km. This may be a factor in these users having the highest percentage of direct trips between Home
and Work out of the three clusters on Workdays. The users in cluster W2 have Home-Work distances in
the middle range, and usually the peaks of their DCD distributions are located close to the Home-Work
distances. This is also reflected in their OD matrix, in which this cluster has the highest percentage of trips
four users have a large Home-Work distance of over 20km. Three out of the four users have DCD peaks near
their Home-Work distance, but those are not reflected in the centroid OD matrix, perhaps because they travel
to other places that are the same distance from their Home as well as their Work location. These DCD plots
are in agreement with the DCD features for Cluster W3, as the bulk of their DCD distributions are located
The next two parts of cluster analysis, what we will call User Commonality and Average Frequency, are
shown in Fig. 9 (a) and (b) respectively. Both of these types of analysis use the same distance thresholds for
minimum distance that were used for the OD matrix features, broken down into each of the ten different POI
categories that were labeled in the data. To represent User Commonality, each square in Fig. 9(a) shows the
percentage of users within the cluster who fulfilled each minimum distance and POI label combination at
least once in their trajectory. The aim of this is to see whether there is any minimum distance and POI label
combination that is favored by the users in each cluster. The value of each heatmap square 𝑢𝑗𝑘 , in row 𝑗 and
𝑛𝑗𝑘
𝑢𝑗𝑘 = (4)
𝑛𝑐
where 𝑛𝑗𝑘 is the number of users within the cluster who visited a POI at distance threshold 𝑗 with label 𝑘,
Meanwhile, Fig. 9(b) shows the Average Frequency, taken as a percentage of the user’s total trips and
averaged over all users in the cluster, of each minimum distance and POI label combination. The value of
each heatmap square 𝑓𝑗𝑘 , in row 𝑗 and column 𝑘, is given by Equations 5 and 6:
𝑝𝑖𝑗𝑘
𝑃𝑖𝑗𝑘 = (5)
𝑝𝑖
∑𝑛𝑐
𝑖=1
𝑃𝑖𝑗𝑘
𝑓𝑗𝑘 = (6)
𝑛𝑐
is the number of POIs that user 𝑖 visited at distance threshold 𝑗 with label 𝑘, 𝑝𝑖 is the total number of labeled
POIs visited by user 𝑖, and 𝑛𝑐 is the total number of users within the cluster.
From Fig. 9(a), it can be seen that there is no single distance threshold and POI label combination
that is visited by 100% of the users in each cluster, except for Shopping Malls at a minimum distance of
>8km for Cluster W3. However, quite a high percentage of users in the other two clusters visit this distance
threshold/POI label combination as well. Other common combinations that appear in all three clusters are:
Neighborhood Center, Shopping Mall, and Residential, all within the 0-2km threshold. The distinguishing
features of the clusters can be summarized as follows. Cluster W1 has a visible percentage at Recreational
at >8km minimum distance as compared to the others. More of the users in cluster W2 visit Attractions
at a minimum distance of larger than 8km as compared to the other clusters. Cluster 3 has a much higher
percentage of total users that go to Residential POIs at 2-8km threshold as compared to the rest.
From Fig. 9(b), it can be seen that the POI label with common frequency among the three clusters is
Neighborhood Center at 0-2km. Cluster W1 has highest Average Frequency at Residential POIs within 0-
2km and Shopping Malls at >8km. Cluster W2 has a visible frequency at Healthcare at 0-2km, something
which is not seen in the other two clusters. Cluster W3 has a visible frequency at the Park and Recreational
Comparing the two parts of Fig. 9, we can see that although there are some distance and label
combinations that have more users in each cluster that visit them, it does not necessarily mean that they
visit them frequently. The label/distance combinations that are visited frequently are a subset of those that
This section describes the results obtained from clustering the Offday data of all users. The process is
the same as the one used for the Workday data in Section V. The three clusters here are labeled O1, O2, and
The values of each cluster’s centroids are plotted in Fig. 10. We can observe that these clusters show
similar trends to the Workday clusters in that there are those that stay mostly at Home Only (O1), those that
Figure 10: Centroid values of the three clusters obtained from clustering Oday data. Cluster (a) has the highest
percentage of days spent at Home only, while cluster (b) has the highest percentage of days with DCD between
0 to 5 km, meaning they went to at least one other non-Home location. Cluster (c) has the highest percentage
of days with DCD in the 5 to 15km range, indicating that they generally travel the furthest on Odays.
O1 O2 O3
Figure 11: Violinplots illustrating each user's DCD distribution within each cluster on Odays.
make mostly short trips (O2), and those that make mostly longer trips (O3). The users in Cluster O1 spent
71% of their Offdays only at their Home location. Cluster O2 users spent on average 21% of their days at their
Home location, and 58% of their days have a DCD value of 0-5km. The average percentages for Cluster O3
are more evenly split between the Home Only and the first two distance categories, with the highest being
When observed together with each cluster’s corresponding OD matrix, we see that Cluster O2 actually
has the highest average percentage of trips within 0-1km at 46% compared to 32% for Cluster O1.
Additionally, we see that the percentages of trips going between the 0-1km threshold and further thresholds
in O1 stay at home only for more days than those in O2, they tend to travel further when they do go out,
whereas those in O2 could go out on more days but stay within 0-1km for most of their trips. The users in
cluster O3 seem to have more of a balance between staying at home and going out to near or further places.
O1 O2 O3
(a)
(b)
Figure 12: Heatmaps for each of the three Oday clusters showing (a) User Commonality and (b) Average
Frequency. The colormap scales for (b) are narrowed to 0.17 to better show the contrast between the dierent
squares.
The violinplots representing the DCD distribution of each user within each cluster have been plotted in
Fig. 11. Similarly to before, the Home Only days are not reflected on this plot as we are more interested on
Cluster O1 and Cluster O2 both contain dominantly Nonworking users, while the bulk of the Working
users are in Cluster O3. Qualitatively speaking, Cluster O1 seems to lie in the middle of Clusters O2 and
O3. The median DCDs of the users in Cluster O2 are limited to the 0-5km range, which agrees with the
DCD features observed in Fig. 10 and further emphasizes that this group of users makes mostly short trips.
Although the median values of Cluster O3 are not always higher than those in O2, the bulk of the DCD
distributions for Cluster O3 lies above 5km, which is the distance threshold for longer trips in this case.
User Commonality and Average Frequency of each cluster is obtained as described earlier in Section
V-C, and plotted in Fig. 12. From Fig 12(a), we observe that there are the same three main POI labels that
are commonly visited by users, namely Neighborhood Center, Shopping Mall, and Residential.
There is a higher percentage of users in Cluster O3 who visit Parks and Recreational areas between
1-15km, as well as visiting Attractions that are in the 5-15km range from their homes.
Looking at Fig. 12(b), we see that for all three clusters, the three highest frequency labels are the
same as the highest commonality labels. However, for Shopping Malls, the highest frequency distance
threshold differs for each cluster. For Cluster O1, the frequency is higher for Shopping Malls in the 1-5km
range. For Cluster O2, the frequency is concentrated at the 0-1km range for Shopping Malls and similarly
for Neighborhood Center and Residential areas. We can infer that the frequent short trips for this cluster
are mainly for the purpose of visiting locations with those three labels. For Cluster O3, the frequency is
prominently concentrated at Shopping Malls in the 5-15km range, and the frequency for Residential areas is
7. Conclusion
In this paper, we investigated the differences between the GPS trajectory patterns of Workday and Offday
data, as well as Working and Nonworking users. To do so, we proposed a new mobility metric based on radius
of gyration, named Daily Characteristic Distance (DCD), to zoom in on the locations outside of Home and
Work if applicable that the users visited. We discover that Working users’ median DCD on Workdays is
highly correlated to the distance between their Home and Work locations, and that Working users generally
We then used features derived from DCD in conjunction with those derived from the users’ Origin-
Destination matrices to cluster the users in our dataset. We find that we can group users’ mobility into three
types for both Workdays and Offdays. The three types are mainly those that mainly stick to Home (and Work
if applicable), those that make frequent short trips, and those that make longer trips. We also propose two
new types of metric for cluster analysis, namely User Commonality and Average Frequency, to better assess
the labels and distances of different locations that are favored by the users in different clusters. We discover
that three main POI labels are favored regardless of cluster - Neighborhood Centers, Shopping Malls, and
as well as the presence or absence of some other labels such as Attraction, Parks, and Recreational areas.
Urban planners could use this framework on their own target datasets as a case study to discover the types
of places that would be beneficial to locate nearby their intended residential environment. It is important to
note that, while our proposed framework is general, the results that we have obtained are dependent on our
data that we have gathered in Singapore, and thus results may differ widely if our framework is used on data
Currently, our work examines the users’ data and clusters them separately for Workdays and Offdays.
There could be more insights to be drawn from linking both the Workday and Offday mobility features of
the individual Working users together and examining the resulting features for new correlations. This could
Acknowledgements
This research is supported by the Singapore Ministry of National Development and the National Research
Foundation, Prime Minister’s Office under the Land and Liveability National Innovation Challenge (L2 NIC)
Research Programme (L2 NIC Award No. L2NICTDF1-2017-4). Any opinion, findings, and conclusions or
recommendations expressed in this material are those of the author(s) and do not reflect the views of the
Singapore Ministry of National Development and National Research Foundation, Prime Minister’s Office,
Singapore.
Zann Koh: Methodology, Formal analysis, Writing - Original draft preparation, Writing - Review and
editing. Yuren Zhou: Investigation, Writing - Review and editing. Billy Pik Lik Lau: Data curation,
Writing - Review and editing. Ran Liu: Conceptualization of this study, Data curation. Chau Yuen: Project
administration, Supervision, Writing - Review and editing. Keng Hua Chong: Conceptualization of this
References
[1] S. Van der Spek, J. Van Schaick, P. De Bois, R. De Haan, Sensing human activity: GPS tracking, Sensors 9 (4) (2009) 3033–3055.
[3] M. Hast, K. M. Searle, M. Chaponda, J. Lupiya, J. Lubinda, J. Sikalima, T. Kobayashi, T. Shields, M. Mulenga, J. Lessler, et al., The use
of GPS data loggers to describe the impact of spatio-temporal movement patterns on malaria control in a high-transmission area of northern
[4] A. Solomon, A. Bar, C. Yanai, B. Shapira, L. Rokach, Predict demographic information using word2vec on spatial trajectories, in: Proceedings
of the 26th Conference on User Modeling, Adaptation and Personalization, 2018, pp. 331–339.
[5] L. Wu, L. Yang, Z. Huang, Y. Wang, Y. Chai, X. Peng, Y. Liu, Inferring demographics from human trajectories and geographical context,
[6] K. Siła-Nowicka, J. Vandrol, T. Oshan, J. A. Long, U. Demšar, A. S. Fotheringham, Analysis of human mobility patterns from GPS trajectories
and contextual information, International Journal of Geographical Information Science 30 (5) (2016) 881–906.
[7] J. Long, D. Reuschke, Daily mobility patterns of small business owners and homeworkers in post-industrial cities, Computers, Environment
[8] N. Brouwers, M. Woehrle, Dwelling in the canyons: Dwelling detection in urban environments using gps, wi-fi, and geolocation, Pervasive
[9] S. H. Marakkalage, S. Sarica, B. P. L. Lau, S. K. Viswanath, T. Balasubramaniam, C. Yuen, B. Yuen, J. Luo, R. Nayak, Understanding the
lifestyle of older population: Mobile crowdsensing approach, IEEE Transactions on Computational Social Systems 6 (1) (2018) 82–95.
[10] L. Zhu, J. Gonder, L. Lin, Prediction of individual social-demographic role based on travel behavior variability using long-term GPS data,
[11] B.-h. Nahmias-Biran, Y. Han, S. Bekhor, F. Zhao, C. Zegras, M. Ben-Akiva, Enriching activity-based models using smartphone-based travel
[12] Y. Zheng, Q. Li, Y. Chen, X. Xie, W.-Y. Ma, Understanding mobility based on GPS data, in: Proceedings of the 10th international conference
[13] S. H. Marakkalage, B. P. L. Lau, Y. Zhou, R. Liu, C. Yuen, W. Q. Yow, K. H. Chong, Wifi fingerprint clustering for urban mobility analysis,
[14] L. Alessandretti, P. Sapiezynski, S. Lehmann, A. Baronchelli, Multi-scale spatio-temporal analysis of human mobility, PloS one 12 (2) (2017)
e0171686.
[15] Y. Wang, K. Qin, Y. Chen, P. Zhao, Detecting anomalous trajectories and behavior patterns using hierarchical clustering from taxi GPS data,
[16] G. Solmaz, D. Turgut, Modeling pedestrian mobility in disaster areas, Pervasive and Mobile Computing 40 (2017) 104–122.
[17] E. Cesario, C. Comito, D. Talia, An approach for the discovery and validation of urban mobility patterns, Pervasive and Mobile Computing
42 (2017) 77–92.
[18] D. Kumar, H. Wu, S. Rajasegarar, C. Leckie, S. Krishnaswamy, M. Palaniswami, Fast and scalable big data trajectory clustering for
understanding urban mobility, IEEE Transactions on Intelligent Transportation Systems 19 (11) (2018) 3709–3722.
[19] J. Tang, F. Liu, Y. Wang, H. Wang, Uncovering urban human mobility from large scale taxi GPS data, Physica A: Statistical Mechanics and
[20] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for discovering clusters in large spatial databases with noise., in:
[22] L. Amichi, A. C. Viana, M. Crovella, A. F. Loureiro, Mobility profiling: Identifying scouters in the crowd, in: Proceedings of the 15th
International Conference on emerging Networking EXperiments and Technologies, 2019, pp. 9–11.
[24] L. Scherrer, M. Tomko, P. Ranacher, R. Weibel, Travelers or locals? Identifying meaningful sub-populations from human movement data in
the absence of ground truth, EPJ Data Science 7 (1) (2018) 19.
[25] S. Lloyd, Least squares quantization in PCM, IEEE Transactions on Information Theory 28 (2) (1982) 129–137.
[26] G. Solmaz, D. Turgut, A survey of human mobility models, IEEE Access 7 (2019) 125711–125731.
[27] M. C. Gonzalez, C. A. Hidalgo, A.-L. Barabasi, Understanding individual human mobility patterns, nature 453 (7196) (2008) 779.
[28] L. Pappalardo, F. Simini, S. Rinzivillo, D. Pedreschi, F. Giannotti, A.-L. Barabási, Returners and explorers dichotomy in human mobility,
[29] E. Pepe, P. Bajardi, L. Gauvin, F. Privitera, B. Lake, C. Cattuto, M. Tizzoni, Covid-19 outbreak response: a first assessment of mobility changes
[30] Y. Xu, A. Belyi, I. Bojic, C. Ratti, Human mobility and socioeconomic status: Analysis of Singapore and Boston, Computers, Environment
[31] Y. Zhou, B. P. L. Lau, Z. Koh, C. Yuen, B. K. K. Ng, Understanding crowd behaviors in a social event by passive wifi sensing and data mining,
[32] Z. Koh, Y. Zhou, B. P. L. Lau, C. Yuen, B. Tuncer, K. H. Chong, Multiple-perspective clustering of passive wi-fi sensing trajectory data, IEEE
[33] X. Lu, E. Wetter, N. Bharti, A. J. Tatem, L. Bengtsson, Approaching the limit of predictability in human mobility, Scientific Reports 3 (1)
(2013) 1–9.
[34] B. P. L. Lau, M. S. Hasala, V. S. Kadaba, B. Thirunavukarasu, C. Yuen, B. Yuen, R. Nayak, Extracting point of interest and classifying
environment for low sampling crowd sensing smartphone sensor data, in: 2017 IEEE International Conference on Pervasive Computing and
[35] Urban Redevelopment Authority, Master Plan, boundary data retrieved from URA Master Plan 2019, https://data.gov.sg/dataset/
master-plan-2019-subzone-boundary-no-sea (2019).
Zann Koh received the B.Eng degree in Engineering and Product Development from the Singapore University of Technology
and Design, Singapore, in 2017. She is currently pursuing the Ph.D. degree with the Singapore University of Technology
and Design, Singapore, under Dr. Chau Yuen’s supervision. Her current research interests include big data analysis, data
Yuren Zhou received the B.Eng. degree in Electrical Engineering from Harbin Institute of Technology, Harbin, China in
2014, and the Ph.D. degree from Singapore University of Technology and Design, Singapore in 2019, with a focus on data
mining and smart city applications. He is currently a postdoctoral research fellow at Singapore University of Technology
and Design. His current research interests include big data analytics and its application in urban human mobility, building
Billy Pik Lik Lau has received BSc, MPhil degree in Computer Science from Curtin University, Perth, WA, Australia in
2010 and 2014 respectively. He also received PhD degree from the Singapore University of Technology (SUTD) in 2021
and is currently working as a postdoctoral fellow with Dr Chau Yuen. During his master study, he studied the cooperation
rate between agents in multi-agent systems, whereas during his PhD study, he investigated the application of machine
learning in different geographical scales to study human activities. His current research includes algorithm design, multi-
agent architecture design, urban science, data analysis and machine learning.
Ran Liu received the B.S. degree from the Southwest University of Science and Technology, China, in 2007, and the Ph.D.
degree from the University of Tuebingen, Germany, in 2014, under the supervision of Prof. Andreas Zell. Since 2014, he
has been a Research Fellow under the supervision of Prof. Chau Yuen with the MIT International Design Center, Singapore
University of Technology and Design, Singapore. His research interests include robotics, indoor positioning, and SLAM.
Chau Yuen is currently an Associate Professor at Singapore University of Technology and Design. He received the B.Eng.
and Ph.D. degrees from Nanyang Technological University, Singapore, in 2000 and 2004, respectively. He was a Postdoctoral
Fellow at Lucent Technologies Bell Labs, Murray Hill, NJ, USA, in 2005. He was a Visiting Assistant Professor at The Hong
Kong Polytechnic University in 2008. From 2006 to 2010, he was a Senior Research Engineer at the Institute for Infocomm
Research (I2R, Singapore), where he was involved in an industrial project on developing an 802.11n Wireless LAN system,
and participated actively in 3Gpp Long Term Evolution (LTE) and LTE-Advanced (LTE-A) Standardization. He has been
with the Singapore University of Technology and Design since 2010. He is a recipient of the Lee Kuan Yew Gold Medal,
the Institution of Electrical Engineers Book Prize, the Institute of Engineering of Singapore Gold Medal, the Merck Sharp and Dohme Gold Medal,
and twice the recipient of the Hewlett Packard Prize. He received the IEEE Asia-Pacific Outstanding Young Researcher Award in 2012. He serves
as an Editor for the IEEE Transaction on Communications and the IEEE Transactions on Vehicular Technology and was awarded the Top Associate
Keng Hua Chong is Associate Professor of Architecture and Sustainable Design at the Singapore University of Technology
and Design (SUTD), where he directs the Social Urban Research Groupe (SURGe) and co-leads the Opportunity Lab
(O-Lab). His research on social architecture particularly in the areas of ageing population, liveable place and data-driven
collaborative design has led to several key publications and projects, including Creative Ageing Cities, Second Beginnings,