2020 Shan
2020 Shan
2020 Shan
Marketplace Vendors
Sylvester Shan
u6049249
Supervised by
Ramesh Sankaranarayana
June 2020
c Sylvester Shan 2020
Except where otherwise indicated, this thesis is my own original work.
Sylvester Shan
June 12, 2020
I am dedicating this thesis to my family and dear friends. With out the support and
help from everyone, I would not have made it this far.
Thank you
Acknowledgments
I want to thank the support that I received from my parents and sister.
I would like to acknowledge my supervisor Ramesh, and thank him for his pa-
tience, guidance and support throughout my research.
And thanks to all my friends who support me.
vii
Abstract
The usage and number of darknet users has increased rapidly in recent years. A key
reason is that the darknet allows users to be fully anonymous when browsing on
the darknet. Though such privacy is needed for some users, others decide to abuse
darknets by selling or buying illicit goods off the darknet marketplace without being
arrested or punished. Despite the hidden nature of darknet marketplaces, they often
shut down due to reasons such as law enforcement activities or exit scams. As a
result, the average life span of a darknet marketplace tends to be around 8 months.
This leads to an important question: If a vendor has built up a good reputation before a
darknet was shutdown, does that mean he will start over again from scratch? Not likely. A
vendor would most likely use their username as a brand, in order to be recognizable
on a different darknet marketplace when others shuts down.
This thesis states and explores the hypothesis: Accounts that belong to the same
individual are likely to have similar usernames, which are being used as a "Brand" by the
vendor. To verify this hypothesis, we first devise a method to correlate the accounts in
a darknet marketplace data set using their PGP keys, thus linking multiple accounts
to a single user. We then devise a method for determining username similarity, and
check if the correlated accounts have a username similarity above a certain threshold.
These experiments are done both internally within the datasets for the Evolution
marketplace and the SilkRoad2 marketplace, and also between the two datasets.
From the experiments, four behaviors were identified and they were used to ver-
ify and strengthen the hypothesis. Most importantly, we find that two accounts that
belong to the same user are likely to have similar usernames if the accounts belong
to different marketplaces, but not if the accounts belong to the same marketplace.
We thus conclude a modified version of our initial hypothesis: Accounts that belongs
to the same individual, but are on different marketplaces, are likely to have similar usernames,
which are being used as a "Brand" by the vendor.
ix
x
Contents
Acknowledgments vii
Abstract ix
1 Introduction 1
1.1 The Darknet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Tor Browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The Darknet Marketplace . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Drug Dealers and Buyers Going Online . . . . . . . . . . . . . . . 3
1.3.2 Communication: PGP Encryption . . . . . . . . . . . . . . . . . . 3
1.3.3 Cryptocurrencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.4 Marketplace Income . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.5 Marketplace Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.6 The End of a Marketplace . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Feedback Is Everything . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Returning buyer looking for vendor . . . . . . . . . . . . . . . . . . . . . 6
1.6 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Key Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.9 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.10 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Related Work 11
2.1 Drug dealers to Vendors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Correlation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Correlation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
xi
xii Contents
4 Methodology 23
4.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 PGP Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Username Feature Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.1 Reusing TF-IDF: GF-IUF . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.2 GF-IUF feature vector . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.3 Basic Feature Vector . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.4 Updating the data points . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.5 Computing Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Procedure to Verify Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6 Conclusion 37
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Appendices 39
B Study Contract 41
C Description of Software 43
D Readme 45
D.1 Download the data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
D.2 Files to modify: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
D.3 Running the progrmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
D.3.0.1 Optional . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
E Assumptions 47
3.1 Time line of when Evolution and SilkRoad are active and when the
data for both market are scraped . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Hyperlinks found in a HTML file for an account from Evolution . . . . 18
3.3 Ordering the IDs in increment order . . . . . . . . . . . . . . . . . . . . . 18
3.4 Counting the differences between IDs . . . . . . . . . . . . . . . . . . . . 19
xiii
xiv LIST OF FIGURES
List of Tables
xv
xvi LIST OF TABLES
Chapter 1
Introduction
Section 1.1 The Darknet: Introduces the clear web, deep web, darknet, the anonymity
property of the dark net and how the dark web is used.
Section 1.2 The Tor Browser: Introduce the software Tor browser, which is required
to access the darknet.
Section 1.3 The Darknet Marketplace: Introduces what the darknet marketplace is,
how vendors and buyers operate on the marketplace and how the marketplace oper-
ates.
Section 1.4 Feedback is Everything: Address the importance of feedback to vendors
and buyers.
Section 1.5 Returning buyer looking for vendor: This section introduces why it is
likely that vendors uses their username as a "brand" on different marketplaces.
Section 1.6 Motivation: Provides the inspiration behind this thesis.
Section 1.7 Thesis Statement: The hypothesis this thesis verifies.
Section 1.8 Key Assumptions: Addressing important assumptions that were made
and used in this thesis.
Section 1.9 Chapter Summary: Summarizes this chapter and introduces the next
chapter.
Section 1.10 Thesis Outline: Outline of the rest of the thesis.
1
2 Introduction
An average internet user that uses the internet on a daily basis is also likely to
use the deep web. Deep web refers to web pages or websites that are intentionally
designed in a way such that it cannot be indexed by a standard search engine. For
example, the content of an individual’s google drive, email or social media, data
that is stored in databases by organizations etc. Various methods exists to prevent
web pages from being indexed. For example, websites that requires username or
password, a website that can only be accessed using certain software such as Tor.
The darknet is a subset of the deep web. Many illegal activities takes place on
the darknet, such as malware service, hacker forum, selling illicit goods on different
marketplaces, such as drugs, weapons, stolen information etc. It is unlikely for an
average internet user to accidentally access the content on the darknet, as certain soft-
ware are required to access it, such as Tor, I2P etc. By using such software, it allows
users to access the content on the darknet with anonymity. In most countries it is not
illegal to access the darknet, and by using the anonymity property of the darknet,
it is near impossible for the user’s identity to be lifted by the police, government or
legal agencies. This is the major reason why the darknet roams with illegal activities,
as users’ identities are hidden away by the darknet.
The darknet is not criminogenic, as it is a tool which can be used in both good
and bad ways [Mirea et al., 2018]. The anonymity property of the darknet brings con-
venience for users who abuses it for illegal activities, but it also protects the privacy
of other users without such intention. For example, whistleblowers uses the darknet
to share sensitive information from companies or organizations without revealing
their identity, journalists use the darknet to collect sensitive information and avoid
censorship etc. The dark side of the darknet often attracts medias’ attention, while
the bright side is often forgotten. [Mirea et al., 2018]
In this thesis, we will be focusing on the dark side of the darknet: the market-
places.
The Tor Browser most commonly used software to access the darknet. To ensure the
anonymity of a user when browsing the darknet, many servers called Tor nodes are
needed, which are distributed around the world. When accessing a website on the
darknet, the Tor Browser will connect users to a random Tor node, which then routes
through other random Tor nodes, after exiting the final Tor node, it connects the user
to the website. The path of the traffic is completely random and encrypted, assuring
it is impossible to backtrack to identify a user. This process repeats every time when
the user wishes to access a different web page.
Therefore, it is near impossible to shut down the darknet, as this will requires to
shut down all the Tor nodes and hidden servers that hosts websites at the same time
across the world.
§1.3 The Darknet Marketplace 3
1.3.3 Cryptocurrencies
Real world currencies cannot be used on darknet marketplaces, as transactions leave
records behind, making it traceable [Buxton and Bingham, 2015]. Cryptocurrencies
with anonymity properties are used instead, where Bitcoin is the most commonly
used cryptocurrency across different darknet marketplaces Afilipoaie and Shortis
[2015a].
ketplace operators stealing the escrowed bitcoins and shutting down the marketplace
without any notice.
1.6 Motivation
In real life, vendors often market themselves using a brand, which is recognizable
and trusted by consumers. We do not know if this behavior exists among vendors
in the darknet marketplace. Due to this, we explore whether this behavior exists by
conducting experiments and analysis.
At the same time, we also profile other emergent behaviors of darknet vendors.
We believe these behaviors could potentially give use insight, which could be used
to make informed assumptions when building data sets.
For darknet marketplace data sets, we don’t have the ground truth due to the
anonymity property of the darknet marketplace, making it hard to study the darknet
marketplace data set.
§1.7 Thesis Statement 7
It is still possible to create data set that are close to the ground truth, such as
how Zhang et al. [2019] constructed a data set by splitting a piece of text written by
a vendor into two and match it to itself to create a positive data point, then match it
to a piece of text that was written by a different vendor, in order to create a negative
data point.
We proposed a vendor correlation method, where we match all the PGP key used
by the two accounts from the darknet marketplace, if there is a match, we could
nearly guarantee the two accounts belongs to the same individual, as PGP are long
randomly generated strings. Hence we obtained a data set that is very close to
ground truth.
Instead of using this data set to train machine learning models to correlate ven-
dors or for other tasks, we decided to use this data set to put ourselves in the vendors’
shoes, so that we could make observations of their behaviors on the darknet market-
place. And we believe we could use their behavior as assumptions to create larger
data sets. For example, if we concluded a behavior such that darknet marketplace
vendors uses the same username on each account on different darknet marketplace,
then we could correlate vendors just based on measuring their username’s similarity.
Making it easier to create a data set that is closer to ground truth and used for other
purposes such as machine learning.
Thus we proceed in two steps: first we correlate the vendors account based on
attributes other than the username. Then we compute the username similarity of the
two correlated accounts, which is then used to calculate the percentage of correlated
accounts with a username similarity greater than a threshold γ. While conducting
experiment, attempt to conclude other interesting darknet vendor behavior.
A5 : If two PGP keys matches, then they belong to the same individual.
Assumption A3 was made to formalize the problem. We initially treat each ac-
count belongs to one single individual and classify two or multiple accounts belongs
to the same individual based on evidence. For example, we classify two accounts
belongs to the same vendor when the PGP key matches. Assumption A4 was made
to simplify the problem. We cannot exclude the possibility of several individuals
working together on the darknet marketplace using different accounts. Assumption
A5 was made as it is near impossible to generate two identical PGP keys. For a re-
turning buyer, it will be hard to find the vendor they bought from if the vendor uses
exact same usernames as other vendors. Therefore me made A6.
Based on the assumptions above, in this thesis we attempt to figure out which
accounts belong to the same individual, and use the correlated vendors summarize
their different behaviors.
All assumptions made in this thesis can be found in Appendix E.
Chapter 6 is the conclusion of this thesis, addressing what this thesis achieved, the
limitations and future work.
10 Introduction
Chapter 2
Related Work
This chapter introduces various research paper that has studied the darknet, darknet
marketplace, darknet vendor and buyers.
Section 2.1 Drug dealers to Vendors: Discuss about the paper that address how the
darknet operates and how the vendors and buyers sell or purchase goods.
Section 2.2 Correlation Problem: Discuss about the common correlation problem that
many paper had.
Section 2.3 Correlation Methods: Discussed about the different correlation problems
that was proposed.
Many different kinds of research was done related to the darknet and darknet
marketplace. The i
11
12 Related Work
Numerous quantitative and qualitative research was done to find the characteristics
of darknet marketplace and darknet vendors. Christin [2012] was able to mind many
interesting patterns from Silk Road 2, but the patterns cannot be applied to every
other marketplace. Dolliver and Kenney [2016] was able to find few facts between
the characteristics of vendors between marketplace Evolution and Agora, but the
facts and patters that was found was not able to be applied to other marketplace.
These were majorly due to poor vendor correlation method. Many chose to do a
1:1 comparison between the usernames of two accounts, if they match, then they
classify the two accounts belongs to the same person. Many papers have difficulty
to correlate vendors due to the anonymity property of the darknet, which prevents
them from getting the ground truth data set to do such correlation.
A huge variety of correlation method was explored. The most common correlation
method is string correlation of the username. Christen [2006] introduced a wide
variety of methods to match real names. Though majority technique might not work
well with usernames, it introduces techniques to quickly filter out names that doesn’t
match. Wang et al. [2016] introduced computing username similarity by constructing
a username feature vector, then apply the main idea of IDF to each entry of the vector
to create a self information vector. To compute the username similarity, the cosine
similarity of two self information vectors were computed. Though the approach
was interesting, the method was not suitable if the data set is consists of usernames
from a darknet marketplace, as we won’t know how well the method performs since
we don’t know the ground truth. The most interesting work was done by Zhang
et al. [2019], they constructed a attribute heterogeneous information network, which
correlates two vendors via meta paths. Many attributes were used as a vendors
identification, such as the writing style, photography style, the drugs a vendors sells.
§2.4 Summary 13
2.4 Summary
This chapter explored the paper that introduced how the darknet and darknet mar-
ketplace worked, the vendor correlation problem and solutions to the vendor corre-
lation problem.
14 Related Work
Chapter 3
This chapter mainly focuses on how to preprocess raw data sets that was collected by
a web-crawler. The nature and characteristics of the used raw data sets are addressed
first. Then addresses the importance of studying the raw data sets and the unique ID
for each data point before constructing the data set. Then introduces the attributes
of the integrated data points, which will be used later in 4.
Section 3.1 Raw Data Sets: Provides justification for the chosen raw data sets and
their nature.
Section 3.2 Information Scraper: States the preparations before implementing an in-
formation scraper for a raw data sets, and the attributes information scraper should
be collecting.
Section 3.3 Data Integration: Introduces the motivation of data integration and the
attributes of a data point after integration.
Section 3.4 Summary: The summary of this chapter.
15
16 Data Set Preprocessing
did not include information such as the PGP key used by vendor, item description
(which also could include the PGP key) or vendor’s username. We have decided
to create our own data sets using the raw data sets for marketplace Evolution and
Silk Road 2. Where Evolution was active between 14th January 2014 [Afilipoaie and
Shortis, 2015b] and mid March 2015, Silk Road 2 was active between 6th November
2013 and 6th November 2014 [Greenberg, 2014]. To visualize timeline, refer to Figure
3.1 below.
The dates when the two data sets where scraped overlaps. Evolution has been
scraped between dates of 21st January 2014 and 17th March 2015. Silk Road 2 has
been scraped between dates of 20th December 2013 and 6th November 2014. To
visualize timeline, refer to Figure 3.1 below. The reason why no extra data was
collected afterwards is due to both marketplace shutting down. Silk Road 2 was shut
down due to an law enforcement operation called Operation Onymous. [Afilipoaie
and Shortis, 2015b] Evolution was shutdown due to exit scam, where the marketplace
operators stole all the escrowed bitcoins and shutdown the marketplace without any
notice. [Greenberg, 2017] After Silk Road 2 shuts down, it is likely that vendors will
move to Evolution, which was one of the most popular marketplace on the darknet.
[Afilipoaie and Shortis, 2015b]
Figure 3.1: Time line of when Evolution and SilkRoad are active and when the data
for both market are scraped
Both raw data sets contain a directory of folders, where each folder name is the
date it was scraped. In each date, it consists different folders such as profile which
contains the HTML files of vendors on the marketplace, items which contains the
HTML files for all items that was listed on the marketplace etc. The layout of each
raw data set are different, and was studied before implementing the information
scraper.
§3.2 Information Scraper: Collecting the Attributes 17
After scraping the attributes from the raw data sets for Evolution and Silk Road
2, we would have two data sets. We denote the new data set for Evolution and Silk
Road 2 as D_evo and D_sr2 respectively.
Figure 3.2: Hyperlinks found in a HTML file for an account from Evolution
For Evolution, an ID number in each account was used as the unique ID. When
observing a HTML file of an account for Evolution, multiple hyperlinks was found,
which can be seen in Figure 3.2. As can be observed, all the hyperlinks includes
the number 61. A hypothesis was made: such number in the hyperlink for each
account is a unique number assigned by the marketplace. For clarity, we refer to
these numbers as "IDs". This was verified by first selecting a random date in the raw
data set of Evolution, then collect all the IDs in the hyperlink for each account and
put them in list L. If we convert L to a set S, the number of elements in L equals to
the number of elements in S, indicating there are no duplicating IDs. If we sort the
IDs in incrementing order, we would get the results in Figure 3.3. By observation,
the smallest ID number starts from 1 and increments inconsistently.
We created a list Ld that consists of the difference between two consecutive IDs
that is greater than 1, then we counted the total number of time each number ap-
peared in the list. For example, the difference between "2" and "5" is 3, we found 3
appeared 139 times in Ld . Then we were able to obtain plot in Figure 3.4.
From Figure 3.4, we could observe the graph is skewed to the left, meaning the
"gap" between two ID numbers are mostly low. This indicates the ID numbers are
not generated randomly, but most likely to be assigned to new accounts incrementing
order. For the accounts with the missing ID number, they are most likely banned or
removed by the marketplace for violating the market rules. As discussed in Section
1.3.5, many marketplace has their own set of rules to avoid scammers and keeping a
low profile to not attract law enforcement’s attention. For example, many forbidden
providing services that harms other individuals, selling mass destruction weapons,
selling goods related to CEM etc.
§3.2 Information Scraper: Collecting the Attributes 19
Therefore, we treat the numbers in the hyperlinks for each account as a unique
ID that was assigned by the marketplace.
For Silk Road 2, the username was used as the unique ID. From assumption A1,
we assumed that there does not exist any scammer in the marketplace, meaning
there won’t be accounts that pretends to be other vendors on the marketplace. From
assumption and A6, we assumed that vendors will intentionally use different user-
names, indicating they will be unique and different. Therefore by the two assump-
tions, we could then assume usernames are unique and could be used as a unique
ID.
• Date: The date when the web-crawler scraped the account’s HTML file.
• PGP keys: As discussed in Section 1.3.2, PGP encryption are used such that
vendors and buyers could communicate without revealing their true identity.
It has been observed that even if a marketplace offers a specific section to put
the PGP key, vendors might ignore it and put it in the item description. There-
fore, when scraping the PGP key, we also check if the PGP key is in the item
description.
• Ship to: Locations where the vendors are able to ship to.
• Market: The marketplace where this data point was created from.
Following attributes are exclusive to a data point created from Silk Road 2:
• pgp_history: A list of PGP keys associated with the list of dates when it was
used. From this attribute, we could obtain information such as how often and
when does the vendor changes their PGP key.
• listing_history: A list of item names that was listed by the vendor, asso-
ciated with all versions of description for that item and the dates when the
item was listed. From this attribute we could obtain multiple information, such
as how long will an item be listed, how often does the description of an item
change etc.
• ID: The ID number for the data point, which was used as unique ID.
• Uname_history: A list of usernames that has been used by the account associ-
ated with a list of dates when it was used. From this attribute we could observe
how often and when a username is modified.
• dates: A list of dates that the account has been observed by the web-crawler.
Note that the integrated data points in Evolution also contains such informa-
tion, except it was combined with the usernames.
In this thesis, not all the attributes were used. The common attribute pgp_history,
index and market were used. From DR_evo, attributes ID and Uname_history were
used. From DR_sr2, attributes username, vendor_description and dates were
used. As these attributes were enough to verify the hypothesis.
3.4 Summary
This chapter introduced the two raw data sets that was used for this paper: Evolution
and Silk Road 2. Then introduced the two core concepts before implementing an in-
formation scraper, which is studying the HTML file to find the appropriate attributes
to collect, and deciding the unique ID when constructing a data point. Finally, this
chapter introduced the attributes of integrated data points for each data set, which
will be used to correlate accounts in Chapter 4 and further analyzed in Chapter 5.
22 Data Set Preprocessing
Chapter 4
Methodology
This chapter introduces the methods that was used to verify the hypothesis stated
in the thesis statement 1.7. Two problems needed to be solved in order to verify the
hypothesis. The first problem is the vendor correlation problem, which is how to
classify if two accounts belongs to the same vendor. The second problem is how to
compute the username similarity when given two usernames. Finally in this chapter,
we introduce the procedure to verify the hypothesis after solving the two problems.
Section 4.1 Problem Statement: Detail description of the two problems that is needed
to be solved in order to verify the hypothesis.
Section 4.2 PGP Correlation: Introduced the method that was used to correlated ac-
counts.
Section 4.3 Username Feature Vector: Introduced how a username feature vector is
created for the username of an account.
Section 4.4 Procedure to Verify Hypothesis : Introduced the procedure of verifying
the hypothesis.
Section 4.5 Summary: The summary of the chapter.
In Section 1.7, we stated the hypothesis: Accounts that belong to the same individual are
likely to have similar usernames, which are being used as a "Brand" by the vendor. To verify
this hypothesis, two problems needed to be solved: the vendor correlation problem
and the username similarity problem.
Given two accounts acc1 and acc2 , where each account has n attributes associated
with them. We want to determine if acc1 and acc2 belongs to the same individual by
23
24 Methodology
We propose a new method called PGP correlation to address this problem. Details
of PGP correlation is discussed below in Section 4.2
(
(m1, m2, r1, r2) ∃ p1 ∈ PGP1 , ∃ p2 ∈ PGP2 such that p1 ≡ p2
Corr ( acc1 , acc2 ) =
0 otherwise
(4.1)
Where m1 and m2 indicates which data set acc1 and acc2 came from. r1 and r2 refers
to the location (row index) of acc1 and acc2 in the data set of m1 and m2 respectively.
Note that we don’t combine DR_evo and DR_sr2 at any time, as the structure of
both data sets are slightly different and it will be easier to access a specific row of a
specific data set.
§4.3 Username Feature Vector 25
|D|
t f -id f (t, d, D ) = t f (t, d) · id f (t, D ) = f t,d · log (4.2)
| {d ∈ D : t ∈ d} |
Where t is a term, d is a document and D is the corpus. t f (t, d) counts the number
of times term t appears in document d. The value id f (t, D ) returns indicates if the
term is common or rare among all the documents in D. For example, if id f (t, D )
returns a high value around 1, then this indicates the term is very rare in the corpus.
If the value returned is close to 0, then this means the term is very common. Hence
the value returned by t f -id f (t, d, D ) indicates how important the term is t is to the
document d with respect to the corpus D.
With respect to the equation 4.2, instead of using terms t, document d and corpus
D, we used grams g, username u and all usernames U from both data set DR_evo
and DR_sr2.
g f -iu f ( g, u, U ) = t f ( g, u) · id f ( g, U ) (4.3)
Preprocessing Usernames
Let U denote all usernames in all data sets. Before constructing the GF-IUF feature
vector for each username u in U, we preprocess each username u, to obtain U p .
The preprocessing was done by simple removing all the usernames that is not
a letter or digit, then converting all uppercase to lowercase. On the clear net, most
websites’ usernames are not case sensitive and limit the use of special characters such
as "$", "%" etc. In fact, the characters that were used in usernames and are not in the
alphabet or digit are "-" and "_". The two characters often acts as delimiters in the
username. [Wang et al., 2016].
26 Methodology
Therefore, with respect to the list "lis" in code above, we construct the GF-IUF
feature vector G_vecu for preprocessed username u p :
Where:
sum_dig : The sum of digits in username u. This was used because if the username
included some numbers, regardless if it was leet encoding or the numbers that
represents a date, the higher the sum of digits of a username is, the more likely
it will be unique, which can be observed in Figure 4.1 below. The larger the
sum of digit is, the more unique it is, making it stand out compared to the total
sum of other digits that has a low value.
Figure 4.1: Number of usernames with different sum of digits in username (number
of accounts with sum of 0 not included)
After creating the B vector for each username, we want to normalize each ele-
ment in vector B to the range between 0 to 1. To do this, we find the max and mini-
mum value for all the features above and get: lengthmax , lengthmin , lettermax , lettermin ,
digmax , digmin , sum_digmax , sum_digmin , specialmax , specialmin , uppermax , uppermin ,
lowermax and lowermin .
Then we are able to construct a basic vector B_vecu for username u in U by
normalizing each element:
For username u in U, after constructing the GF-IUF feature vector G_vecu and basic
vector B_vecu , we could construct the username feature vector F_vecu by concatenat-
ing vector G_vecu and B_vecu :
Note that for username u in U, we know which data point (account) u is from
and where the data point is located. Hence we could update the data point which
username u is from. Lets assume that u is from data point acc. Then we update
the data point simply by creating a new attribute, which holds the username feature
vector F_vecu :
28 Methodology
Username
UsernameFeaturevector = F_vecu
acc Dates
PGPKeys
.
.
.
We then update all the data points with their corresponding username feature
vector.
F_vec1 · F_vec2
Sim( acc1 , acc2 ) = (4.4)
k F_vec1 k × k F_vec2 k
Where Match_Accs is a list of n tuples, where each tuple is of the form (m, m0 , r, r 0 ).
m and m0 indicates which marketplace data set account acc1 and acc2 belongs to. r
and r 0 indicates where acc1 and acc2 is located in marketplace data set m and m0 .
Note that we filter out tuples where m = m0 and n = n0 , meaning it points to the
same data point in marketplace m.
Succinctly, we could use m and r to extract the corresponding account acc1 (data
point) from data set m. This also applies for the second account acc2 using m0 and
and r 0 . By doing so, we could update the tuples in Match_Accs by replacing m and r
§4.5 Summary 29
By using equation (4.4), we could calculate the similarity score for each tuple in
Match_Accs. Then we could update each tuples in Match_Accs, by simply adding
the username similarity score to the tuple. Hence we get:
Match_Accs = [( acc1n , acc2n , Sim( acc1n , acc2n )) | acc1n 6= acc2n , n ∈ [1, . . . , n]]
4.5 Summary
In this chapter we addressed the procedure how to verify the hypothesis from Sec-
tion 1.7. To verify the hypothesis, we first need to classify if two accounts belongs
to the same vendor by using the PGP correlation method. After correlating two
accounts, we introduced the methods to construct username similarity for each user-
name, which is used to measure the username similarity of the two accounts. Lastly,
by using the two methods, we are able to compute the username similarity of two
correlated accounts, which then later used to calculate the percentage of correlated
accounts with a username similarity score greater or equal to a defined threshold γ.
In the next chapter, we conducted three experiments to verify the hypothesis.
Firstly we correlated the accounts within Evolution, then calculated the percentage
of correlated accounts that has a username similarity score above multiple different
threshold. Similar experiment was done for Silk Road 2. Last experiment was con-
ducted by correlating the accounts of Evolution and Silk Road 2 then calculating the
percentage.
30 Methodology
Chapter 5
Section 5.1 Before Experiment: States the details of what needs to be done before
conducting any experiment.
Section 5.2 Vendors use usernames as "brands": Three experiments was conducted,
used the results to conclude three vendor behaviors, which is later used to verify and
strengthen the hypothesis.
Section 5.3 Username Modification: An experiment conducted on data set DR_evo to
see how often vendors change their username.
Section 5.4 Summary: The summary of this chapter.
31
32 Vendor Behaviors and Observation
5.2.1 Experiments
Note that it is possible for a vendor to create multiple accounts on the same market-
place, hence it will be interesting to observe if the hypothesis holds true for correlated
accounts that are within the same marketplace.
By using the proposed method in Section 4.4, we conducted three experiments.
For the first experiment, we correlated the accounts within Evolution (Evo & Evo).
Then we calculated the percentage of correlated accounts with similarity score greater
or equal to different values of γ. This experiment was repeated with matched ac-
counts within Silk Road 2 (SR2 & SR2) and matched accounts between Evolution and
Silk Road 2 (Evo & SR2).
5.2.2 Results
By correlating the accounts between Evolution and Evolution, Silk Road 2 and Silk
Road 2, Evolution and Silk Road 2, we obtained the following Table 5.1:
After computing the percentage of matched vendors from the three experiments
with different thresholds γ, we got the following results presented in Table 5.2 below:
Threshold γ Percentage for Evo & Evo Percentage for SR2 & SR2 Percentage for Evo & SR2
1.00 6.5% 0.0% 74.58%
0.95 6.5% 0.0% 75.7%
0.90 8.13% 0.0% 78.21%
0.85 12.2% 12.5% 80.17%
0.80 15.45% 12.5% 82.4%
0.75 16.26% 12.5% 82.68%
0.70 16.26% 12.5% 84.64%
0.65 17.07% 12.5% 87.15%
0.60 17.07% 25.0% 87.43%
Table 5.2: Percentage of matched accounts with score greater or equal to different γ
§5.2 Vendors use usernames as "brands" 33
5.2.3 Discussion
Behavior 1: For vendors who has two accounts on two different marketplaces, they
are likely to use the exact same username.
From Table 5.1 and 5.2, we could see that 74.58% vendors who have accounts on both
Silk Road and Evolution uses the exact same username. Note that the way humans
judge if two usernames are the same is different to how a computer judge it.
Table 5.3 consists of some usernames of correlated accounts from Evolution and
Silk Road 2, associated with their similarity score. For humans, with the knowledge
that these usernames were created by the same individual, we could then judge if
they are the "same" by looking at the structure of the two usernames, the semantics
of letters in each username or detect abbreviations easily. For example, in the fourth
row of Table 5.3, the computer gave a similarity score of 0.527, which is fairly low.
For humans, based on the knowledge that we know these two usernames are cre-
ated by the same individual, we could tell that the "C" in "CDreams" have a high
chance of representing "california", allowing use to make the conclusion that these
usernames are the "same". Another example, in the fifth row of Table 5.3, we could
tell that "Chem" in "ChemBrothersAU" is short for "chemical". The "AU" in "Chem-
BrothersAU" probably refers to Australia, which could be disregarded. Hence we say
"ChemBrothers" and "chemicalbrothers" are the same.
Table 5.3: Usernames that are the "same" to humans but not to computers
By judging from the similarity scores in Table 5.3, it would be fair to say that
if given two correlated accounts with username similarity score greater or equal to
0.80, then for humans, we consider the username for both accounts are the "same".
Hence, with respect to Table 5.2, we could conclude that 82/68% of vendors who
has accounts on Evolution and Silk Road 2, their username for both accounts will be
the same. Therefore we could conclude Behavior 1.
Behavior 2: Vendors are highly likely to create new accounts on different market-
places than creating another account on the same marketplace.
This behavior can be simply concluded from Table 5.1. In Evolution and Silk Road 2,
only 123 and 8 vendors respectively has multiple accounts in the same marketplace.
While 358 vendors has accounts in both marketplace. 358 is not a large number
compared to the total number of unique accounts in both marketplaces 5593, but
34 Vendor Behaviors and Observation
note that there are other darknet marketplace which exists at then same time when
these data sets are collected, meaning a vendor could have an account in Evolution
and Agora, but doesn’t have an account in Silk Road 2. Selecting which markets
to create an account is completely up to the vendor themselves, as each darknet
marketplace has different policies and offers. By comparing the number of correlated
vendors between Evolution and Silk Road 2 to the number of correlated vendors
within Evolution and with in Silk Road 2, we could conclude Behavior 2.
Out of the 123 correlated accounts within Evolution itself, 0.065 · 123 ≈ 8 vendors
decided to use the exact same username. Out of the 8 correlated accounts within
Silk Road 2 none used the exact same usernames. Therefore we could conclude that
if a vendor has multiple accounts in the same marketplace, it is likely that these
usernames will have a low similarity score.
On the side note, many of the matched accounts from the same marketplace has
usernames where one is the sub-string of the other. For example, for username "Nod-
nowinaus" and "nodnow", their similarity score is 0.667, but the latter username is a
sub-string of the first username. Hence we added an extra step before computing the
username similarity: if the shorter username is a sub-string of the longer username,
then the similarity score is 1. Else continue with the original method by computing
the cosine similarity of the two username features. Hence we could get:
1 u1 ⊆ u2 ∨ u2 ⊆ u1
Sim( acc1 , acc2 ) = F_vec1 · F_vec2
otherwise
k F_vec1 k × k F_vec2 k
Where u1 , u2 are the usernames for account acc1 and acc2 . By adding in this extract
step, we could get Table 5.4:
Threshold γ Percentage Evo & Evo Percentage SR2 & SR2 Percentage Evo & SR2
1.00 13.82% 25.0% 83.24%
0.95 13.82% 25.0% 83.24%
0.90 13.82% 25.0% 83.52%
0.85 14.63% 25.0% 84.64%
0.80 16.26% 25.0% 85.75%
0.75 17.07% 25.0% 85.75%
0.70 17.07% 25.0% 86.87%
0.65 17.07% 25.0% 88.55%
0.60 17.07% 37.5% 88.83%
Table 5.4: Percentage of matched accounts with score greater or equal to different γ
with extra step to calculate username similarity
Though the percentage was increased in each column, but the number of vendors
§5.3 Username Modification 35
with multiple accounts in the same marketplace with similarity score of 1 is still low.
To be specific, 0.1382 · 123 ≈ 17 vendors from Evolution and 8 · 0.25 = 2 from Silk
Road 2. Therefore the concluded Behavior 3 is still valid.
5.3.1 Experiment
The experiment is simply done by collecting all the data points that has more than
two usernames in Uname_history. We extract the username and compute the cosine
similarity.
5.3.2 Results
After finishing the procedure above, we got following Table 5.5:
Note that for all the accounts in data set DR_evo, only 6 accounts modified their
username, and each account has only modified once.
5.3.3 Discussion
From Table 5.1, we know that DR_evo has 4367 unique accounts, which means only
6 ÷ 4367 ≈ 0.14% users in Evolution changed their usernames. With respect to the
timeline in Figure 3.1, we could see that the raw data set includes nearly the entire
history of Evolution. We know that only 0.14% (6) vendors changed their account’s
username and each only once, we therefore could conclude Behavior 4: Darknet
marketplace vendors are unlikely to change their account’s username.
By observing the 6 data points, nothing was standing out or useful. For the ac-
counts in row 2 and 3, the account owner might of been an scammer, who pretends to
be someone else. This might be later reported to the marketplace operator, resulting
the marketplace to add "NOT" to the front of their username.
5.4 Summary
This chapter conducted multiple experiments and was able to conclude four behav-
iors of vendors from darknet marketplace. By using Behavior 1, we were showed
that the initial hypothesis holds true. And then we used Behavior 2 to strength our
hypothesis and obtained: Accounts that belongs to the same individual, but are on different
marketplaces, are likely to have similar usernames, which are being used as a "Brand" by the
vendor.
Chapter 6
Conclusion
In this thesis, we wanted to verify if the behavioral hypothesis holds true: Accounts
that belong to the same individual are likely to have similar usernames, which are being used
as a "Brand" by the vendor.
First we introduced the two core concepts for constructing a useful data set for our
purposes from the raw data of the darknet marketplace. These were 1. Understand
the structure of the data set and the available attributes so that information can be
extracted, and 2. Decide the unique ID attributes for an account that do not change
over time, so that the information of a single account over the lifetime of the data set
can be stored as a single data point within the data set.
Then we proposed the PGP correlation method to create our ground truth labels
(which accounts belong to the same user), the method to construct username feature
vectors (which determines the similarity score), and the experiment procedure to
verify the hypothesis. In total four experiments was conducted and four vendor
behaviors was concluded:
Behavior 1 : For vendors who has two accounts on two different marketplaces, they
are likely to use the exact same username.
Behavior 2 : Vendors are highly likely to create new accounts on different market-
places than creating another account on the same marketplace.
From the four behaviors, we are able to use Behavior 1 to show our hypothesis
holds true. However we also find that the hypothesis is not always right by Behavior
3, although this is the minority case according to Behavior 2. From these findings,
we refine our hypothesis and obtain: Accounts that belongs to the same individual, but
are on different marketplaces, are likely to have similar usernames, which are being used as a
"Brand" by the vendor. Finally, Behavior 4 shows us that the first three behaviors are
useful findings, as usernames rarely get changed.
37
38 Conclusion
By using the refined hypothesis statement, it is now possible to predict which ac-
counts between different marketplaces belong to the same vendor. Given a darknet
marketplace data set, we would construct the username feature vector for all user-
name and classify two accounts belongs to the same vendor if their similarity score is
greater than a defined threshold γ, where in this thesis we decided upon γ = 0.8. We
hope such findings will be useful for law enforcement agencies and further research
upon darket marketplace behavior.
Create data sets : By using the raw data set of Evolution and Silk Road 2 from
Darknet Marketplace Archives Branwen et al. [2015], create the two new data
sets which contains the attributes collected from the raw data sets.
Vendor Correlation : Implement correlation methods using the attributes from the
two new data sets, which returns all the matched vendors in two marketplaces
Username feature vectors : Implement and create feature vectors for usernames,
which will be used to compute username similarity.
Verify hypothesis : Using the found correlated accounts, compute the similarity be-
tween the usernames of the correlated accounts. Then see how many correlated
accounts has a username similarity above certain threshold. For example, if we
consider usernames with a similarity above 0.85 as the same, then what’s the
percentage of correlated accounts that has a similarity score above 0.85.
39
40 Final Project Description
Appendix B
Study Contract
UniID: u6049249
FORMAL SUPERVISOR (if different, must be an RSSCS academic ): Prof Weifa Liang
LEARNING OBJECTIVES:
1. Gain knowledge in developing methods to solve real world problem with real world data.
2. Identify potential points of correlation between users across multiple darknet marketplaces
3. Understand and enhance data collection & analysis methods in the field of darknet markets.
4. Design & implement practices for classifying drug products based on known & unknown colloquialisms
DESCRIPTION:
The ANU Cybercrime Observatory collects data on the behaviours of buying & selling illicit products on darknet marketplaces.
While hypothesised that vendors may operate over multiple marketplaces, it has been difficult to confirm this using existing
practices. This project aims to research and develop a methodology to determine to a degree of confidence if a vendor from
one darknet market is the same on other darknet marketplaces based on natural language analysis techniques.
The challenge of this research project will be that the dataset is not well formed, and no training set can be provided.
As each darknet marketplace is unique, the available data regarding a vendor will differ from each website. Different methods
will be developed and be used to solve such problem using the provided dataset, such as the similarity
between vendors name across different darknet market, the similarity of grammar errors in the description of the drug and
the similarity of the products different vendor sells across different market.
As this problem is a natural language problem, it is possible to also further understand the use of colloquialisms on
darknet marketplaces to describe drug products. This will involve classifying the existing products into abstract clusters
based on the type of drug it may be and perform predictions on unknown colloquialisms based on other product data.
A tool such as this not only benefits the understanding of a vendor’s breadth and depth of the understanding of the language,
but assists in all data analysis methods pertaining to the use of data relating to drug products sold.
Summary: Develop methods according to existing techniques to correlate vendors across different darknet marketplaces using
data provided by ANU Cybercrime Observatory.
60
Report
Artefact 30
10
Presentation
………………………………………………….. ………………………..
Signature Date
SECTION B (Supervisor):
I am willing to supervise and support this proposal. I have checked the student's academic record
and believe the student can fulfil this contract. If I have nominated an examiner above, I have
obtained their consent (via signature below or attached email)
16/08/2019
………………..………………..……………….. ………………………..
Signature Date
………………………………………………….. ………………………..
Signature Date
Description of Software
With respect to the artefact layout, the folder name "4-Code" contains all the code
that was used for this project. All the code in "4-Code" were written by the student
from scratch. (Except for imported modules)
It is hard to test the correctness of the entire code, as the code was implemented
to verify a hypothesis and generate statistics that was later analyzed. No testing
programs is available to test all the code, but testing code included between cells or
commented out. One or multiple small test code is commented at the end of the cell
for many functions. Majority functions are all well documented, where each function
indicates what data type it should take in etc.
• RAM: 32 GB
Following are the software and important packages that were used:
• Python 3.7.7
• Pandas 0.24.2
• tqdm 4.31.1
• pandarallel 1.4.6
• numpy 1.16.2
• conda 4.8.3
43
44 Description of Software
Readme
45
46 Readme
D.3.0.1 Optional
Assumptions
A5 : If two PGP keys matches, then they belong to the same individual. (1.8)
A6 : Vendors are likely to use usernames that differs to other usernames. (1.8)
47
48 Assumptions
Appendix F
to simplify the problem. This is due to the problem This appendix includes snippets
of HTML files that was scrapped from different darknet marketplaces. For readabil-
More details or ity, the corresponding text in the HTML file is shown below each image.
delete this line?
49
50 Snippets of HTML files
users who are actually interested in making the service the best it can be,
by providing feedback.
Referral links
To refer another user, use your referral link which you can find on your
Profile page, once registered. A referral link looks like this:
http://agorahooawayyfoe.onion/register/RFZ5gSM902
Fees
The base fee (from which the referral percentages are calculated) is cur-
rently 4%.
The fee is taken from the amount which is received by the vendor. The
buyer always pays the actual amount that is displayed for every product.
The vendor receives that amount minus the fee.
Market rules
Anonymity is sacrosanct here. You are to respect the anonymity of all
Agora users to the greatest extent possible.
Vendors may not threaten buyers in any way, shape, or form.
You must have a valid vendor account to sell anything on Agora.
Forbidden products and services
51
Afilipoaie, A. and Shortis, P., 2015a. From dealer to doorstep âĂŞ how drugs are
sold on the dark net. (06 2015). (cited on pages 3, 4, 5, 11, and 12)
Afilipoaie, A. and Shortis, P., 2015b. Operation onymous: International law en-
forcement agencies target the dark net in november 2014. (Jan 2015). (cited on
page 16)
Ball, M.; Broadhurst, R.; Niven, A.; and Trivedi, H., 2019. Data capture and
analysis of darknet markets. (03 2019). (cited on page 4)
Branwen, G.; Christin, N.; DÃl’cary-HÃl’tu, D.; Andersen, R. M.; StExo; Pres-
idente, E.; Anonymous; Lau, D.; Sohhlz, D. K.; Cakic, V.; Buskirk, V.; Whom;
McKenna, M.; and Goode, S., 2015. Dark net market archives, 2011-2015.
https://www.gwern.net/DNM-archives. https://www.gwern.net/DNM-archives. Ac-
cessed: 2019-12-10. (cited on pages 4, 15, and 39)
Broadhurst, R., 2019. Child sex abuse images and exploitation materials. (10 2019),
310–336. doi:10.4324/9780429460593-14. (cited on page 4)
Buxton, J. and Bingham, T., 2015. The rise and challenge of dark net drug markets.
(01 2015). (cited on pages 3, 4, 5, and 11)
Christin, N., 2012. Traveling the silk road: A measurement analysis of a large
anonymous online marketplace. Proceedings of the 22nd International Conference on
World Wide Web, (07 2012). (cited on pages 4 and 12)
Decary-Hetu, D.; Paquet-Clouston, M.; and Aldridge, J., 2016. Going interna-
tional? risk taking by cryptomarket drug vendors. International Journal of Drug
Policy, 35 (01 2016), 69–76. (cited on page 3)
55
56 BIBLIOGRAPHY
Europol, 2017. Drugs and the darknet. perspectives for enforcement, research and
policy. Technical report. (cited on pages 3, 4, and 6)
Greenberg, A., 2014. ’silk road 2.0’ launches, promising a resurrected black market
for the dark web. (cited on page 16)
Greenberg, A., 2017. The dark web’s top drug market, evolution, just vanished.
https://www.wired.com/2015/03/evolution-disappeared-bitcoin-scam-dark-web/.
(cited on page 16)
Manning, C. D.; Raghavan, P.; and Schütze, H., 2008. Introduction to Information
Retrieval. Cambridge University Press, USA. ISBN 0521865719. (cited on page 25)
Meland, P. H.; Bayoumy, Y.; and Sindre, G., 2020. The ransomware-as-a-service
economy within the darknet. Computers and Security, 92 (02 2020), 101762. doi:
10.1016/j.cose.2020.101762. (cited on page 4)
Mirea, M.; Wang, V.; and Jung, J., 2018. The not so dark side of the darknet - a
qualitative study. Security Journal, (07 2018). doi:10.1057/s41284-018-0150-5. (cited
on page 2)
Redman, J., 2016. Dark net market vendors reveal their day-to-day lives on reddit.
https://news.bitcoin.com/dnm-vendors-reveal-lives/. (cited on page 5)
Wang, Y.; Liu, T.; Tan, Q.; Shi, J.; and Guo, L., 2016. Identifying users across
different sites using usernames. Procedia Computer Science, 80 (12 2016), 376–385.
doi:10.1016/j.procs.2016.05.336. (cited on pages 12 and 25)
Zhang, Y.; Fan, Y.; Song, W.; Hou, S.; Ye, Y.; Li, X.; Zhao, L.; Shi, C.; Wang,
J.; and Xiong, Q., 2019. Your style your identity: Leveraging writing and
photography styles for drug trafficker identification in darknet markets over at-
tributed heterogeneous information network. In The World Wide Web Confer-
ence, WWW âĂŹ19 (San Francisco, CA, USA, 2019), 3448âĂŞ3454. Association
for Computing Machinery, New York, NY, USA. doi:10.1145/3308558.3313537.
https://doi.org/10.1145/3308558.3313537. (cited on pages 7 and 12)