AFRAID: Fraud Detection Via Active Inference in Time-Evolving Social Networks
AFRAID: Fraud Detection Via Active Inference in Time-Evolving Social Networks
AFRAID: Fraud Detection Via Active Inference in Time-Evolving Social Networks
...
Fig. 2: A summary graph St at time t contains all nodes and edges observed between time t and time t − s.
time t, can we assume that the node is still legitimate economical structure of the company is disbanded by means
at time t + 1? We introduce how to temporarily of bankruptcy, the technical structure is not changed as all
integrate an inspector’s decision in the network model, resources are re-allocated to other companies and continue
decreasing the value of the decision over time. their current activities. Network analysis is thus a logical
• We propose a combination of simple and fast probing enrichment of traditional fraud detection techniques.
strategies to identify nodes that might possibly distort
Companies can be related to each other by means of
the results of a collective inference approach and apply
common resources they use(d). Although we cannot specify
these strategies to a large real-life fraud graph. We
the exact type of resources, the reader can understand resources
evaluate probing decisions made by (1) a committee
in terms of shared addresses, employees, suppliers, buyers,
of local classifiers, and (2) by insights provided by
etc. The data we have at our disposal contains 10M records
inspectors. (1) A committee of local classifiers col-
of which resources belong(ed) to which companies for which
lectively votes for the most uncertain nodes without
time period. The data consists of 390k companies and 5,6M
relying on domain expertise. (2) Inspectors use their
resources. Remark that resources can be associated with more
intuition to formalize which nodes might distort the
than one company at the same time. Although resource sharing
collective inference techniques.
(or transferring) might indicate a spider construction, non-
• We investigate the benefits of investing k of the total
fraudulent companies also exchange resources (e.g., in an
budget b in learning a better model, and find that active
acquisition or merger between companies all resources of one
inference boosts the performance of the classifier in
company are allocated to the other, employees changing jobs
terms of precision and recall.
creates a relationship between two companies, etc.). Also,
The remainder of the paper is organized as follows: back- fraudulent setups use innocent resources to cover up their
ground (§ II), network definition (§ III), problem definition and tracks. Given a set of companies, resources and the relations
active inference (§ IV), results (§ V), related work (§ VI) and between them at time t, our objective is to identify those
conclusion (§ VII). companies that have a high likelihood to perpetrate fraud at
time t + 1.
II. BACKGROUND
III. N ETWORK DEFINITION
The data used in this study is obtained from the Belgian
Social Security Institution, a federal governmental service that In this section, we will elaborate on how to use the
collects and manages employer and employee social contribu- temporal-relational data to create time-evolving fraud net-
tions. Those contributions are used to finance various branches works. Given relational data at time t, the corresponding graph
of social security including the allowance of unemployment is defined as Gt = (Vt , Et ), with Vt the set of nodes (or points,
funds, health insurance, family allowance funds, etc. Although or vertices) and Et the set of edges (or lines, or links) observed
the contributions both concern employees and employers (i.e., at time t. Graph Gt describes the static network at time t.
companies), the taxes are levied at employer level. That means
that the employer is responsible for transferring the taxes to Besides current relationships, dynamic graphs keep track of
the Social Security Institution. the evolution of past information, e.g. nodes that are added to
or removed from the network, edges that appear and disappear,
We say that a company is fraudulent if the company is edge weights that vary over time, etc. In order to include a time
part of an illegal set up to avoid paying these taxes. Recent aspect in the network, we define the summary graph St at time
developments have shown that fraudulent companies do not op- t as all the nodes and edges observed between time t − s and
erate by themselves, but rely on other associate companies [4], t. Figure 2 depicts how a summary graph is created. For our
[5]. They often use an interconnected network, the so-called problem setting, we include all historical information available
spider constructions, to perpetrate tax avoidance. Figure 1 (s = t), as fraud is often subtle and takes a while before
illustrates the fraud process. A company that cannot fulfill the relational structure is exhibited. Although historical links
its tax contributions to the government files for bankruptcy. hold important information about possible spread of fraud,
If the company is part of an illegal setup, all its resources their impact differs from more recent links. Based on work
(e.g., address, machinery, employees, suppliers, buyers, etc.) of [6], [7], we exponentially decay the edge weight over time
are transferred to other companies within the setup. While the as follows
660
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
661
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
? ?
a b c
Fig. 3: Time-weighted collective inference algorithm. (a) At time t, two companies in the subgraph are fraudulent. The intensity of the color
refers to the recency of the fraudulence. (b) Propagation of fraud through the network by the wRWR0 algorithm. (c) Cutting the incoming
edges after probing node ‘?’ and confirming its non-fraudulent label.
• Weighted Mode: binary indicator for whether the (2) As we want to determine how fraud affects the other
neighborhood is mainly fraudulent or non-fraudulent. nodes, we initialize the restart vector with fraud. The restart
vector ~ei is constructed as follows
If – due to probing companies – the label of one of
ei = e−βh
the neighbors changes, the local neighborhood is directly if i is a fraudulent company
(5)
impacted. After each iteration of Algorithm 1, the local neigh- ei = 0 otherwise
borhood features are locally updated.
with β the decay value, and h the time passed at time
The global neighborhood feature is inferred using a variant t since fraud is detected at that company. Equation 5 weighs
of Random Walk with Restarts (RWR) [11]. RWR computes fraud in time and assigns a higher weight to more recent fraud.
the relevance score between any two nodes in the network.
Given a time-varying fraudulent network, RWR allows us to (3) Finally, fraudulent companies with many resources will
compute the extent to which each company is exposed to a have a smaller effect on their neighbors than companies with
fraudulent company at time t using the following equation: only few resources. In order to avoid emphasizing low-degree
companies, we modify the starting vector with the degree:
ξ~t = c · At · ξ~t + (1 − c) · ~et (3)
~e 0 = ~e × d~ (6)
with ξ~t a vector containing the exposure (or relevance)
where ~e 0 is the element-wise product of the time-weighted
scores at time t, At the adjacency matrix, c the restart ~ The normalized
probability (here: 0.85) and ~et the restart vector. The exposure restart vector ~e and the degree vector d.
score of a node in the network depends with a probability c starting vector ~e 0norm defines the starting vector where all
on its neighboring nodes, and with a probability of (1 − c) elements sum to 1.
on a personalized vector ~et . Considering the problem-specific Equation 3 (referred to as wRWR0 ) is then re-written as
characteristics that concur with fraud, Equation 3 is modified
such that it satisfies (1) the bipartite structure defined by our ξ~t = c · Mnorm,t · ξ~t + (1 − c) · ~e 0norm,t (7)
problem setting, (2) the temporal effect of confirmed fraudulent
companies on the network and (3) the fact that fraud should
equally affect each resource, regardless whether the resource In order to compute the exposure scores, Equation 7
is assigned to a large or small company. requires a matrix inversion. As this is often unfeasible to com-
pute in practice, we use the power-iteration method iterating
(1) Given our bipartite network of companies and re- Equation 7 until convergence [11]. Convergence is reached
sources, we only know the label of fraudulent companies when the change in exposure scores is marginal or after a
and need to decide on how each (non-fraudulent) company predefined number of iterations. The modified RWR algorithm
is currently exposed by the set of fraudulent companies. as described above is illustrated in Figure 3a and 3b.
Therefore, the adjacency matrix Bt of a bipartite summary
graph St with n nodes of type 1 (here: companies) and m B. Probing strategies
nodes of type 2 (here: resources), is transformed to a network
with an equal number of rows and columns according to [12], Given a set of observations with an estimated label by a
and local classifier LC, which observation should be probed (i.e.,
checked for its true label) such that the predicted label of the
0n×n Bn×m other observations are maximally improved? Recall that the
M(n+m)×(n+m) = (4)
B0 m×n 0m×m feature set of each observation from which the LC estimates
the label depends on the neighborhood of that observation. Any
Remark that matrix M represents an undirected graph change made in the label of one node has a direct impact on
where m(i, j) = m(j, i). The row-normalized adjacency the feature set of the neighbors. We define 5 probing strategies:
matrix is then denoted as Mnorm , where all rows sum up committee-based, entropy-based, density-based, combined and
to 1. random strategy.
662
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
1) Committee-based strategy: Rather than to rely on the 2) Entropy-based strategy: Fraud is highly imbalanced,
decision of one LC, many LCs decide on which node to pick having only a limited set of confirmed fraudulent nodes avail-
in a committee-based strategy. An often used approach is able. However, our network exhibits statistically significant
uncertainty sampling. That is, sample that observation about signs of homophily (p-value < 0.02) which indicates that
which all the members of the committee are the most uncertain. fraudulent nodes tend to cluster together. Some non-fraudulent
Our committee is composed of the set of local classifiers nodes lie on the boundary between a cluster of fraudulent and
~ Each local classifier LCm expresses how confident it is
LC. non-fraudulent nodes. The entropy-based strategy measures the
in the estimated label of each observation by means of a impurity of the neighbors’ labels and identifies these nodes that
probability. Sharma and Bilgic [13] distinguishes between two are associated with a similar amount of fraudulent and non-
types of uncertainty: most-surely and least-surely uncertainty. fraudulent nodes, and
The most-surely uncertain node is that node for which the
estimated probabilities of the local classifiers provide equally x∗ = argmax Entropy(x)
x∈U
strong evidence for each class. For example, when half of (2) (2) (2) (2)
(12)
the committee members vote for fraud, and the other half = −drel,x log(drel,x ) − (1 − drel,x ) log(1 − drel,x )
vote for non-fraud, we say that the committee is most-surely
uncertain about the node’s label. Least-surely uncertainty refers (2)
with drel,x the fraction of fraudulent nodes associated with
to that node for which the estimated probabilities do not have node x in the second-order neighborhood (i.e., the companies)
significant evidence for either class. The committee is least- at time t.
surely uncertain about a node’s label if the probability of the
node to belong to a class is close to 0.5 for many classifiers. 3) Density-based strategy: Spider constructions are sub-
Based on w[13], we combine positive (i.e., belonging to class graphs in the network that are more densely connected than
fraud) and negative (i.e., belonging to class non-fraud) evi- other subgraphs. The density-based strategy aims to find those
dence learned from the set of models. Each local classifier LCm nodes of which the neighborhood is highly interconnected.
assigns a fraud estimate to each node x. A model is in favor for
a positive label of node x when Px (+|LCm ) > Px (−|LCm ), # of observed edges
then LCm ∈ P for node x, otherwise LCm ∈ N . Evidence in x∗ = argmax (13)
x∈U # of all possible edges
favor of node x being fraudulent is
4) Combined strategy: Based on experts’ expertise, the
Y Px (+|LCm ) combined strategy searches for companies that are located in
E + (x) = (8)
Px (−|LCm ) (1) a dense neighborhood (= high density), and (2) an impure
LCm ∈P
neighborhood (= high entropy). Evidence is aggregated by
multiplication [13]. The node with the maximum value for
Evidence in favor of node x being a non-fraudulent is the combined strategy is selected for probing, and
Y Px (−|LCm ) x∗ = argmax Combined(x) = Entropy(x)×Density(x) (14)
E − (x) = (9) x∈U
Px (+|LCm )
LCm ∈N
5) Random strategy: The random probing strategy ran-
domly picks a node in the network for probing.
The most-surely uncertain node (MSU) in the set of unla-
beled nodes U is the node which has the maximal combined Probing strategy (1) does not rely on domain expertise,
evidence. while (2)-(4) are guided by experts’ insights. Strategy (5) is
employed as baseline.
x∗ = argmax E(x) = E + (x) × E − (x) (10)
x∈U
C. Temporal weighing of label acquisition
The least-surely uncertain node (LSU) is the node which Based on the previous selection technique, the probed node
has the minimal combined evidence. is sent to inspectors for further investigation. Inspectors will
confirm the true label of the node. Recall that in our setting
x∗ = argmin E(x) = E + (x) × E − (x) (11) only companies can be directly attributed to fraud, resources
x∈U
cannot be passed to the inspectors for investigation. Inspectors
will thus only label company nodes. At label acquisition, two
We define four types of committee-based strategies to scenarios can occur for each node that is probed:
sample nodes: (1) most-surely uncertain (MSU), (2) least-
surely uncertain (LSU), (3) most-surely uncertain using the 1) Classified as fraudulent: In this case, the node is added
best performing local classifiers (MSU+) and (4) least-surely to the bag of fraudulent nodes, and affects (1) the local
uncertain using the best performing local classifiers (LSU+). neighborhood features of the neighbors, and (2) the global
We implemented sampling strategy (3) and (4), as we found neighborhood feature of all nodes. (1) Up until now, the
that some poorly performing classifiers fail to appropriately sampled node was considered to be non-fraudulent. Hence,
weigh the feature set and distort the results of the uncertainty we locally update the feature set of the company’s neighbors.
sampling. Therefore, in MSU+ and LSU+, we allowed only (2) The starting vector of the wRWR0 algorithm (see IV-A) is
well-performing committee members (i.e., above average pre- re-created, treating the node as a fraudulent one. The global
cision of all local classifiers) to vote on the node to be probed. neighborhood feature for each node is then updated.
663
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
F1-measure
non-fraudulent before. It only temporarily affects the exposure
scores computed by the wRWR0 algorithm. If we know for
certain that node i is legitimate at time t – based on e.g.,
inspectors’ labeling – the node should block any fraudulent 2
influence passing through. By temporarily cutting all the in-
coming edges to node i, node i will not receive any fraudulent
influences, and as a result cannot pass fraudulent influences to 1
its neighbors. The edge weight in the adjacency matrix M is
changed as follows:
0
−βd −αh 0 20 40 60 80 100
∀j ∈ Ni : w(j, i) = (1 − e )e (15)
k (in %)
with α and β decay values, t the time passed since the Log Regression Random Forest Naive Bayes
decision that i is non-fraudulent where d = 0 if it is a current SVM Decision Tree
decision, and h is the time passed since a relation between i
and j occurred. Remark that only incoming edges are cut from Fig. 4: Model performance of active inference on time t for
the non-fraudulent node. The outgoing edges are still intact. probing strategy MSU+.
This mitigates the effect of fraud on its neighbors. This is
Considering that there are only 200 out of 200k companies that
illustrated in Figure 3c.
commit fraud during the next time period, this is translated in
an increase of approximately 25% recall. These results indicate
V. R ESULTS that the probing strategy on its own is a powerful approach to
detect many frauds.
We applied our proposed approach for active inference
in time-evolving graphs to a real-life data set obtained from
the Belgian Social Security Institution. We use historical data Remark that the curves in Figure 4 vary a lot. This is
for evaluation, allowing us to appropriately interpret results mainly due to the shift in the top 100 companies, depending
and the value of active inference for our application domain. on which node is probed. Figure 6 illustrates how the changes
We trained five local classifiers (i.e., Logistic Regression, in precision (black curve) can be explained by changes in
Random Forest, Naive Bayes, SVM and Decision Tree) for the top 100 most suspicious companies (gray curve, in %).
two timestamps t1 and t2 . Due to confidentiality issues, we We distinguish three scenarios, as indicated in the figure:
will not specify the exact timestamps of analysis. The local (A) The sampled node causes an increase in precision. The
~ of time t1 are learned using data features
classifiers LC sampled node is labeled as non-fraudulent hereby correctly
of time t0 and their corresponding label at time t1 . The blocking fraudulent influence to the rest of its neighborhood,
model is tested on data features of time t1 aiming to predict
the corresponding label at time t2 . Because inspection is
time-consuming, the number of companies that are passed
on for further inspection is limited. In our problem setting, 0.6
we focus on the top 100 most probable fraudulent compa-
nies, out of more than 200k active companies, and evalu-
of probed nodes
0.4
Figure 4 shows the F1-measure of the local classifiers
obtained when investigating the top 100 most likely fraudulent
companies in function of the percentage of companies labeled
of the budget b. Precision and recall follow a similar pattern, as 0.2
the total number of companies that committed fraud between
t1 and t2 reaches approximately 200 (< 1%). The probing
strategy used is (MSU+). While Naive Bayes, SVM and De- 0
cision Tree are not significantly impacted, the probing strategy
is able to identify nodes that change the top 100 most probable 0 20 40 60 80 100
frauds for Logistic Regression and Random Forest. Although
the benefits for Logistic Regression are not pronounced, the k (in %)
precision achieved by Random Forest increases from 3% up MSU+ MSU LSU LSU+
to 15%. Figure 5 depicts the precision achieved by the probing Combined Entropy Density Random
strategies themselves. On average, more than 50% of the
probed nodes are labeled by the inspectors as fraudulent. Fig. 5: Precision achieved by the probing strategies.
664
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Precision
0.1
Figure 7 and 8 compare the different probing strategies.
We distinguish between committee-based strategies (Figure 7)
and strategies using experts’ experience (Figure 8). In general,
MSU+, the Entropy-based and Combined strategy achieve 0.05
approximately the same precision. Consistent with the results
of [13], the probing strategies LSU and LSU+ do not contribute
to learning, as well as the Density and Random strategy.
Surprisingly, we observed that the MSU strategy which uses all
classifiers does not perform well. When we apply a committee- 0 20 40 60 80 100
based strategy composed of the best members or advanced
experts’ strategies (i.e., Entropy-based and Combined), we k (in %)
achieve the best performance. We can conclude that a commit- MSU MSU+ LSU LSU+
tee of local classifiers can mimic experts’ insights, which is
often preferred in order to make unbiased inspection decisions. Fig. 7: Precision of the committee-based probing strategies.
Finally, we evaluate how model performance is affected by
cutting the edges, and gradually re-integrating their influence in 0.14
time. Figure 9 shows the precision at time t2 with and without
integrating the edge cuts of time t1 . Especially when the
probing budget is limited, the precision is positively impacted. 0.12
When more budget is invested in probing, the effects of
of the top 100
665
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
VIII. ACKNOWLEDGEMENTS
This material is based upon work supported by FWO Grant
No. G055115N, the ARO Young Investigator Program under
666