AFRAID: Fraud Detection Via Active Inference in Time-Evolving Social Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

AFRAID: Fraud Detection via Active Inference in


Time-evolving Social Networks
Véronique Van Vlasselaer∗ Tina Eliassi-Rad† Leman Akoglu‡ Monique Snoeck∗ Bart Baesens∗§
∗ Department of Decision Sciences and Information Management, KU Leuven, Naamsestraat 69, B-3000 Leuven, Belgium
{Veronique.VanVlasselaer,Monique.Snoeck,Bart.Baesens}@kuleuven.be
† Department of Computer Science, Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 08854-8019, US
eliassi@cs.rutgers.edu
‡ Department of Computer Science, Stony Brook University, 1425 Computer Science, Stony Brook, NY 11794-4400, US
leman@cs.stonybrook.edu
§ School of Management, University of Southampton, Highfield Southampton, SO17 1BJ, United Kingdom

Abstract—Fraud is a social process that occurs over time. We Company A


introduce a new approach, called AFRAID, which utilizes active
inference to better detect fraud in time-varying social networks.
That is, classify nodes as fraudulent vs. non-fraudulent. In active Bankrupt company
inference on social networks, a set of unlabeled nodes is given Company B
to an oracle (in our case one or more fraud inspectors) to label.
These labels are used to seed the inference process on previously
trained classifier(s). The challenge in active inference is to select
a small set of unlabeled nodes that would lead to the highest
classification performance. Since fraud is highly adaptive and
dynamic, selecting such nodes is even more challenging than
in other settings. We apply our approach to a real-life fraud Company C
data set obtained from the Belgian Social Security Institution to
Fig. 1: Fraud process: a fraudulent company files for bankruptcy
detect social security fraud. In this setting, fraud is defined as the
in order to avoid paying taxes and transfers its resources to other
intentional failing of companies to pay tax contributions to the
companies that are part of the illegal setup, also known as a spider
government. Thus, the social network is composed of companies
construction.
and the links between companies indicate shared resources. Our
approach, AFRAID, outperforms the approaches that do not utilize Each time period fraud inspectors have a limited budget b at
active inference by up to 15% in terms of precision. their disposal to investigate suspicious instances. This budget
might refer to time, money or the number of instances to be
I. I NTRODUCTION inspected. If we invest k of budget b to ask inspectors about
Data mining techniques offer a good solution to find the true label of a set of instances selected based on a selection
patterns in vast amounts of data. Human interaction is often an criterion, will the total budget b be better spent? That is, do we
indispensable part of data mining in many critical application achieve more precise results by investing a part of the budget
domains [1], [2]. Especially in fraud detection, inspectors are (k) in learning an improved algorithm while the remaining
guided by the results of data mining models to obtain a primary budget l = b − k is used to investigate the re-evaluated results,
indication where fraudulent behavior might situate. However, rather than by using the complete budget b to inspect the initial
manual inspection is time-consuming and efficient techniques results without learning?
that dynamically adapt to a fast changing environment are We propose AFRAID (short for: Active Fraud Investigation
essential. Due to the limited resources of fraud inspectors, and Detection) and apply our developed approach to social
fraud detection models are required to output highly precise security fraud. In social security fraud, companies set up illegal
results, i.e. the hit rate of truly identified fraudsters should be constructions in order to avoid paying tax contributions. While
maximal. In this work, we investigate how active inference detection models can rapidly generate a list of suspicious
fosters the fraud detection process for business applications companies, which k companies should be inspected such that
over time. Active inference is a subdomain of active learning the expected label of all other companies minimizes the tax
where a network-based algorithm (e.g., collective inference) loss due to fraud?
iteratively learns the label of a set of unknown nodes in the
network in order to improve the classification performance. Our contributions are the following:
Given a graph at time t with few known fraudulent nodes, • Fraud is dynamic and evolves over time. We propose a
which k nodes should be probed – that is, inspected to confirm new approach for active inference in a timely manner
the true label – such that the misclassification cost of the by (1) using time-evolving graphs, and (2) weighing
collective inference (CI) algorithm is minimal. In this work, we inspectors’ decisions according to recency. (1) The
consider across-network and across-time learning, as opposed influence that nodes exercise on each other varies
to within-network learning [3]. We combine the results of CI over time. We capture the extent of influence in time-
with local-only features in order to learn a model at time t and varying edge weights of the graph. Additionally, we
predict which entities (i.e., nodes) are likely to commit fraud attach greater importance to recent fraud. (2) Given
at time t + 1. that an inspector labels a specific node as legitimate at

ASONAM '15, August 25-28, 2015, Paris, France 659


© 2015 ACM. ISBN 978-1-4503-3854-7/15/08 $15.00
DOI: http://dx.doi.org/10.1145/2808797.2810058
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

...

Graph Gt-s Graph Gt-1 Graph Gt Summary Graph St

Fig. 2: A summary graph St at time t contains all nodes and edges observed between time t and time t − s.

time t, can we assume that the node is still legitimate economical structure of the company is disbanded by means
at time t + 1? We introduce how to temporarily of bankruptcy, the technical structure is not changed as all
integrate an inspector’s decision in the network model, resources are re-allocated to other companies and continue
decreasing the value of the decision over time. their current activities. Network analysis is thus a logical
• We propose a combination of simple and fast probing enrichment of traditional fraud detection techniques.
strategies to identify nodes that might possibly distort
Companies can be related to each other by means of
the results of a collective inference approach and apply
common resources they use(d). Although we cannot specify
these strategies to a large real-life fraud graph. We
the exact type of resources, the reader can understand resources
evaluate probing decisions made by (1) a committee
in terms of shared addresses, employees, suppliers, buyers,
of local classifiers, and (2) by insights provided by
etc. The data we have at our disposal contains 10M records
inspectors. (1) A committee of local classifiers col-
of which resources belong(ed) to which companies for which
lectively votes for the most uncertain nodes without
time period. The data consists of 390k companies and 5,6M
relying on domain expertise. (2) Inspectors use their
resources. Remark that resources can be associated with more
intuition to formalize which nodes might distort the
than one company at the same time. Although resource sharing
collective inference techniques.
(or transferring) might indicate a spider construction, non-
• We investigate the benefits of investing k of the total
fraudulent companies also exchange resources (e.g., in an
budget b in learning a better model, and find that active
acquisition or merger between companies all resources of one
inference boosts the performance of the classifier in
company are allocated to the other, employees changing jobs
terms of precision and recall.
creates a relationship between two companies, etc.). Also,
The remainder of the paper is organized as follows: back- fraudulent setups use innocent resources to cover up their
ground (§ II), network definition (§ III), problem definition and tracks. Given a set of companies, resources and the relations
active inference (§ IV), results (§ V), related work (§ VI) and between them at time t, our objective is to identify those
conclusion (§ VII). companies that have a high likelihood to perpetrate fraud at
time t + 1.
II. BACKGROUND
III. N ETWORK DEFINITION
The data used in this study is obtained from the Belgian
Social Security Institution, a federal governmental service that In this section, we will elaborate on how to use the
collects and manages employer and employee social contribu- temporal-relational data to create time-evolving fraud net-
tions. Those contributions are used to finance various branches works. Given relational data at time t, the corresponding graph
of social security including the allowance of unemployment is defined as Gt = (Vt , Et ), with Vt the set of nodes (or points,
funds, health insurance, family allowance funds, etc. Although or vertices) and Et the set of edges (or lines, or links) observed
the contributions both concern employees and employers (i.e., at time t. Graph Gt describes the static network at time t.
companies), the taxes are levied at employer level. That means
that the employer is responsible for transferring the taxes to Besides current relationships, dynamic graphs keep track of
the Social Security Institution. the evolution of past information, e.g. nodes that are added to
or removed from the network, edges that appear and disappear,
We say that a company is fraudulent if the company is edge weights that vary over time, etc. In order to include a time
part of an illegal set up to avoid paying these taxes. Recent aspect in the network, we define the summary graph St at time
developments have shown that fraudulent companies do not op- t as all the nodes and edges observed between time t − s and
erate by themselves, but rely on other associate companies [4], t. Figure 2 depicts how a summary graph is created. For our
[5]. They often use an interconnected network, the so-called problem setting, we include all historical information available
spider constructions, to perpetrate tax avoidance. Figure 1 (s = t), as fraud is often subtle and takes a while before
illustrates the fraud process. A company that cannot fulfill the relational structure is exhibited. Although historical links
its tax contributions to the government files for bankruptcy. hold important information about possible spread of fraud,
If the company is part of an illegal setup, all its resources their impact differs from more recent links. Based on work
(e.g., address, machinery, employees, suppliers, buyers, etc.) of [6], [7], we exponentially decay the edge weight over time
are transferred to other companies within the setup. While the as follows

660
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

Algorithm 1: Active inference for time-varying fraud


w(i, j) = e−αh (1) graphs.
input : Summary graph St−1 and St where
with α the decay value (here: α = 0.02) and h the
St = (Vs,t , Es,t ), time-weighted collective inference
time passed since the relationship between node i and j
occurred and where h = 0 depicts a current relationship. algorithm wRWR0 , budget k, set of labeled
Mathematically, a network is represented by an adjacency fraudulent nodes Lt−1 and Lt .
matrix A of size n × n where output: Labeled nodes Lt+1 .

ai,j = w(i, j) if i and j are connected # Initialize LCt
(2)
ai,j = 0 otherwise
ξ~t−1 ← wRWR0 (St−1 , Lt−1 ); IV-A
~ t−1 ), ξ~t−1 ], Lt );
LCt ← LC(~xt−1 [~at−1 , aggr(N
Since companies are explicitly connected to the resources
they use, our fraud graph has a dual structure: every edge in # Active inference
the network connects a company to a resource. The network
composed of n companies and m resources is called a bipartite `←0
network, and is of size n × m. The corresponding adjacency while ` < k do
matrix is Bn×m . As we know when a resource was assigned to ξ~t ← wRWR0 (St , Lt );
a company, the edge weight corresponds to the recency of their ~ t ), ξt ], Lt );
Lt+1 ← LCt (~xt [~at , aggr(N
relationship, exponentially decayed over time. In case multiple Select node vi to probe; IV-B
relationships exist between a company and a resource, we only if y(vi ) = fraudulent then
include the most recent one. An edge weight with maximum
Lt (vi ) ← (fraud,t); IV-C1
value 1 refers to a current assignment.
else if y(vi ) = non-fraudulent then
IV. ACTIVE I NFERENCE ∀vj ∈ Ni : w(j, i) = 0; IV-C2
end
Collective inference is a network analysis technique where `←`+1
the label of a node in the network is said to depend on the end
label of the neighboring nodes. In social network analysis,
this is often referred to as homophily [8], where one tends to
adopt the same behavior as one’s associates (e.g., committing the updated feature set, the LC re-learns a new estimate of
fraud if all your friends are fraudsters). A change in the each of the nodes’ fraud probability. However, as inspectors’
label of one node might cause the label of the neighboring decisions are only temporarily valid, we temporally weigh the
nodes to change which in turn can affect the label of their belief in a decision, by decreasing its value in time. Algorithm
neighbors, and so on. As a consequence, a wrong expectation 1 provides more details on the procedure for active inference
of one node strongly affects the estimated label of the other in time-varying fraudulent networks, and will be discussed in
nodes. Active inference is analogous to active learning. It the remainder of this section.
selects an observation to be labeled in order to improve the
classification performance. While active learning iteratively re- A. Collective Inference Technique
learns and updates a classifier by the newly acquired label, Many collective inference algorithms have been proposed
active inference re-evaluates the labels of the neighboring in the literature (see [10] for an overview). We employ a set
nodes using an existing model. For a profound literature survey of local classifiers that evaluates the classification decision on
of active learning, we refer the reader to [9]. both intrinsic and neighborhood features. For the neighborhood
In this work, we train a set of out-of-time local classifiers features, we make a distinction between (1) local neighborhood
~ at time t where each observation i is composed of a set of
LC features and (2) a global neighborhood feature. The local
features ~xi derived at time t − 1 and the corresponding label neighborhood features are based on the labels of the direct
Li = {fraud, non-fraud} observed at time t. The set of fea- neighbors. Recall that in our bipartite graph only the labels of
tures consists of (1) intrinsic features ~ai , and (2) neighborhood the companies are known, and that the first order neighborhood
features (see IV-A). Intrinsic features are features that occur in of each company is composed of its resources. We define the
isolation and do not depend on the neighborhood. The intrinsic direct neighborhood of a company as the company’s resources
features that describe the companies in our analysis include and their associations. As the number of neighbors for each
age, sector, financial statements, legal seat, etc. The neighbor- node differs, the neighborhood labels are aggregated in a fixed-
hood features are derived by a collective inference technique. length feature vector [10] (here: length = 3). The following
We apply each classifier LCm to observations from time t in aggregated features aggr(Ni ) are derived from the network
order to predict which observations are likely to commit fraud for each company i.
at time t + 1. In active inference, we ask inspectors their most
• Weighted Sum: the number of fraudulent companies
probable label at time t + 1 and already integrate this label in
associated through a similar resource, weighted by the
the current network setting to infer a new expectation of the
edges.
neighbors’ label. We say that we learn across-time and across-
network. Recall that inspectors have a total budget b at their • Weighted Proportion: the fraction of fraudulent com-
disposal each timestamp, and are able to invest k < b budget panies associated through a similar resource, weighted
in improving the current collective inference algorithm. Using by the edges.

661
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

? ?

a b c

Fig. 3: Time-weighted collective inference algorithm. (a) At time t, two companies in the subgraph are fraudulent. The intensity of the color
refers to the recency of the fraudulence. (b) Propagation of fraud through the network by the wRWR0 algorithm. (c) Cutting the incoming
edges after probing node ‘?’ and confirming its non-fraudulent label.

• Weighted Mode: binary indicator for whether the (2) As we want to determine how fraud affects the other
neighborhood is mainly fraudulent or non-fraudulent. nodes, we initialize the restart vector with fraud. The restart
vector ~ei is constructed as follows
If – due to probing companies – the label of one of
ei = e−βh

the neighbors changes, the local neighborhood is directly if i is a fraudulent company
(5)
impacted. After each iteration of Algorithm 1, the local neigh- ei = 0 otherwise
borhood features are locally updated.
with β the decay value, and h the time passed at time
The global neighborhood feature is inferred using a variant t since fraud is detected at that company. Equation 5 weighs
of Random Walk with Restarts (RWR) [11]. RWR computes fraud in time and assigns a higher weight to more recent fraud.
the relevance score between any two nodes in the network.
Given a time-varying fraudulent network, RWR allows us to (3) Finally, fraudulent companies with many resources will
compute the extent to which each company is exposed to a have a smaller effect on their neighbors than companies with
fraudulent company at time t using the following equation: only few resources. In order to avoid emphasizing low-degree
companies, we modify the starting vector with the degree:
ξ~t = c · At · ξ~t + (1 − c) · ~et (3)
~e 0 = ~e × d~ (6)
with ξ~t a vector containing the exposure (or relevance)
where ~e 0 is the element-wise product of the time-weighted
scores at time t, At the adjacency matrix, c the restart ~ The normalized
probability (here: 0.85) and ~et the restart vector. The exposure restart vector ~e and the degree vector d.
score of a node in the network depends with a probability c starting vector ~e 0norm defines the starting vector where all
on its neighboring nodes, and with a probability of (1 − c) elements sum to 1.
on a personalized vector ~et . Considering the problem-specific Equation 3 (referred to as wRWR0 ) is then re-written as
characteristics that concur with fraud, Equation 3 is modified
such that it satisfies (1) the bipartite structure defined by our ξ~t = c · Mnorm,t · ξ~t + (1 − c) · ~e 0norm,t (7)
problem setting, (2) the temporal effect of confirmed fraudulent
companies on the network and (3) the fact that fraud should
equally affect each resource, regardless whether the resource In order to compute the exposure scores, Equation 7
is assigned to a large or small company. requires a matrix inversion. As this is often unfeasible to com-
pute in practice, we use the power-iteration method iterating
(1) Given our bipartite network of companies and re- Equation 7 until convergence [11]. Convergence is reached
sources, we only know the label of fraudulent companies when the change in exposure scores is marginal or after a
and need to decide on how each (non-fraudulent) company predefined number of iterations. The modified RWR algorithm
is currently exposed by the set of fraudulent companies. as described above is illustrated in Figure 3a and 3b.
Therefore, the adjacency matrix Bt of a bipartite summary
graph St with n nodes of type 1 (here: companies) and m B. Probing strategies
nodes of type 2 (here: resources), is transformed to a network
with an equal number of rows and columns according to [12], Given a set of observations with an estimated label by a
and local classifier LC, which observation should be probed (i.e.,
  checked for its true label) such that the predicted label of the
0n×n Bn×m other observations are maximally improved? Recall that the
M(n+m)×(n+m) = (4)
B0 m×n 0m×m feature set of each observation from which the LC estimates
the label depends on the neighborhood of that observation. Any
Remark that matrix M represents an undirected graph change made in the label of one node has a direct impact on
where m(i, j) = m(j, i). The row-normalized adjacency the feature set of the neighbors. We define 5 probing strategies:
matrix is then denoted as Mnorm , where all rows sum up committee-based, entropy-based, density-based, combined and
to 1. random strategy.

662
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

1) Committee-based strategy: Rather than to rely on the 2) Entropy-based strategy: Fraud is highly imbalanced,
decision of one LC, many LCs decide on which node to pick having only a limited set of confirmed fraudulent nodes avail-
in a committee-based strategy. An often used approach is able. However, our network exhibits statistically significant
uncertainty sampling. That is, sample that observation about signs of homophily (p-value < 0.02) which indicates that
which all the members of the committee are the most uncertain. fraudulent nodes tend to cluster together. Some non-fraudulent
Our committee is composed of the set of local classifiers nodes lie on the boundary between a cluster of fraudulent and
~ Each local classifier LCm expresses how confident it is
LC. non-fraudulent nodes. The entropy-based strategy measures the
in the estimated label of each observation by means of a impurity of the neighbors’ labels and identifies these nodes that
probability. Sharma and Bilgic [13] distinguishes between two are associated with a similar amount of fraudulent and non-
types of uncertainty: most-surely and least-surely uncertainty. fraudulent nodes, and
The most-surely uncertain node is that node for which the
estimated probabilities of the local classifiers provide equally x∗ = argmax Entropy(x)
x∈U
strong evidence for each class. For example, when half of (2) (2) (2) (2)
(12)
the committee members vote for fraud, and the other half = −drel,x log(drel,x ) − (1 − drel,x ) log(1 − drel,x )
vote for non-fraud, we say that the committee is most-surely
uncertain about the node’s label. Least-surely uncertainty refers (2)
with drel,x the fraction of fraudulent nodes associated with
to that node for which the estimated probabilities do not have node x in the second-order neighborhood (i.e., the companies)
significant evidence for either class. The committee is least- at time t.
surely uncertain about a node’s label if the probability of the
node to belong to a class is close to 0.5 for many classifiers. 3) Density-based strategy: Spider constructions are sub-
Based on w[13], we combine positive (i.e., belonging to class graphs in the network that are more densely connected than
fraud) and negative (i.e., belonging to class non-fraud) evi- other subgraphs. The density-based strategy aims to find those
dence learned from the set of models. Each local classifier LCm nodes of which the neighborhood is highly interconnected.
assigns a fraud estimate to each node x. A model is in favor for
a positive label of node x when Px (+|LCm ) > Px (−|LCm ), # of observed edges
then LCm ∈ P for node x, otherwise LCm ∈ N . Evidence in x∗ = argmax (13)
x∈U # of all possible edges
favor of node x being fraudulent is
4) Combined strategy: Based on experts’ expertise, the
Y Px (+|LCm ) combined strategy searches for companies that are located in
E + (x) = (8)
Px (−|LCm ) (1) a dense neighborhood (= high density), and (2) an impure
LCm ∈P
neighborhood (= high entropy). Evidence is aggregated by
multiplication [13]. The node with the maximum value for
Evidence in favor of node x being a non-fraudulent is the combined strategy is selected for probing, and
Y Px (−|LCm ) x∗ = argmax Combined(x) = Entropy(x)×Density(x) (14)
E − (x) = (9) x∈U
Px (+|LCm )
LCm ∈N
5) Random strategy: The random probing strategy ran-
domly picks a node in the network for probing.
The most-surely uncertain node (MSU) in the set of unla-
beled nodes U is the node which has the maximal combined Probing strategy (1) does not rely on domain expertise,
evidence. while (2)-(4) are guided by experts’ insights. Strategy (5) is
employed as baseline.
x∗ = argmax E(x) = E + (x) × E − (x) (10)
x∈U
C. Temporal weighing of label acquisition
The least-surely uncertain node (LSU) is the node which Based on the previous selection technique, the probed node
has the minimal combined evidence. is sent to inspectors for further investigation. Inspectors will
confirm the true label of the node. Recall that in our setting
x∗ = argmin E(x) = E + (x) × E − (x) (11) only companies can be directly attributed to fraud, resources
x∈U
cannot be passed to the inspectors for investigation. Inspectors
will thus only label company nodes. At label acquisition, two
We define four types of committee-based strategies to scenarios can occur for each node that is probed:
sample nodes: (1) most-surely uncertain (MSU), (2) least-
surely uncertain (LSU), (3) most-surely uncertain using the 1) Classified as fraudulent: In this case, the node is added
best performing local classifiers (MSU+) and (4) least-surely to the bag of fraudulent nodes, and affects (1) the local
uncertain using the best performing local classifiers (LSU+). neighborhood features of the neighbors, and (2) the global
We implemented sampling strategy (3) and (4), as we found neighborhood feature of all nodes. (1) Up until now, the
that some poorly performing classifiers fail to appropriately sampled node was considered to be non-fraudulent. Hence,
weigh the feature set and distort the results of the uncertainty we locally update the feature set of the company’s neighbors.
sampling. Therefore, in MSU+ and LSU+, we allowed only (2) The starting vector of the wRWR0 algorithm (see IV-A) is
well-performing committee members (i.e., above average pre- re-created, treating the node as a fraudulent one. The global
cision of all local classifiers) to vote on the node to be probed. neighborhood feature for each node is then updated.

663
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

2) Classified as non-fraudulent: Inspectors do not find ·10−2


any evidence that this node will be involved in fraud at
time t + 1. However, this does not imply that the node will
always be non-fraudulent. The inspectors’ decision is only 4
valid for a limited time period. This decision does not impact
the local neighborhood features, as the node was treated as
3

F1-measure
non-fraudulent before. It only temporarily affects the exposure
scores computed by the wRWR0 algorithm. If we know for
certain that node i is legitimate at time t – based on e.g.,
inspectors’ labeling – the node should block any fraudulent 2
influence passing through. By temporarily cutting all the in-
coming edges to node i, node i will not receive any fraudulent
influences, and as a result cannot pass fraudulent influences to 1
its neighbors. The edge weight in the adjacency matrix M is
changed as follows:
0
−βd −αh 0 20 40 60 80 100
∀j ∈ Ni : w(j, i) = (1 − e )e (15)
k (in %)
with α and β decay values, t the time passed since the Log Regression Random Forest Naive Bayes
decision that i is non-fraudulent where d = 0 if it is a current SVM Decision Tree
decision, and h is the time passed since a relation between i
and j occurred. Remark that only incoming edges are cut from Fig. 4: Model performance of active inference on time t for
the non-fraudulent node. The outgoing edges are still intact. probing strategy MSU+.
This mitigates the effect of fraud on its neighbors. This is
Considering that there are only 200 out of 200k companies that
illustrated in Figure 3c.
commit fraud during the next time period, this is translated in
an increase of approximately 25% recall. These results indicate
V. R ESULTS that the probing strategy on its own is a powerful approach to
detect many frauds.
We applied our proposed approach for active inference
in time-evolving graphs to a real-life data set obtained from
the Belgian Social Security Institution. We use historical data Remark that the curves in Figure 4 vary a lot. This is
for evaluation, allowing us to appropriately interpret results mainly due to the shift in the top 100 companies, depending
and the value of active inference for our application domain. on which node is probed. Figure 6 illustrates how the changes
We trained five local classifiers (i.e., Logistic Regression, in precision (black curve) can be explained by changes in
Random Forest, Naive Bayes, SVM and Decision Tree) for the top 100 most suspicious companies (gray curve, in %).
two timestamps t1 and t2 . Due to confidentiality issues, we We distinguish three scenarios, as indicated in the figure:
will not specify the exact timestamps of analysis. The local (A) The sampled node causes an increase in precision. The
~ of time t1 are learned using data features
classifiers LC sampled node is labeled as non-fraudulent hereby correctly
of time t0 and their corresponding label at time t1 . The blocking fraudulent influence to the rest of its neighborhood,
model is tested on data features of time t1 aiming to predict
the corresponding label at time t2 . Because inspection is
time-consuming, the number of companies that are passed
on for further inspection is limited. In our problem setting, 0.6
we focus on the top 100 most probable fraudulent compa-
nies, out of more than 200k active companies, and evalu-
of probed nodes

ate model performance on precision, recall and F1-measure.


Precision

0.4
Figure 4 shows the F1-measure of the local classifiers
obtained when investigating the top 100 most likely fraudulent
companies in function of the percentage of companies labeled
of the budget b. Precision and recall follow a similar pattern, as 0.2
the total number of companies that committed fraud between
t1 and t2 reaches approximately 200 (< 1%). The probing
strategy used is (MSU+). While Naive Bayes, SVM and De- 0
cision Tree are not significantly impacted, the probing strategy
is able to identify nodes that change the top 100 most probable 0 20 40 60 80 100
frauds for Logistic Regression and Random Forest. Although
the benefits for Logistic Regression are not pronounced, the k (in %)
precision achieved by Random Forest increases from 3% up MSU+ MSU LSU LSU+
to 15%. Figure 5 depicts the precision achieved by the probing Combined Entropy Density Random
strategies themselves. On average, more than 50% of the
probed nodes are labeled by the inspectors as fraudulent. Fig. 5: Precision achieved by the probing strategies.

664
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

or the sampled node is labeled as fraudulent intensifying the


spread of fraud towards its neighborhood. (B) The sampled 0.15
node deludes the CI technique. This can be explained by the
innocent resources often attached to illegal setups. (C) The
sampled node does not have any influence on the top 100, and

of the top 100


can be seen as a lost effort.

Precision
0.1
Figure 7 and 8 compare the different probing strategies.
We distinguish between committee-based strategies (Figure 7)
and strategies using experts’ experience (Figure 8). In general,
MSU+, the Entropy-based and Combined strategy achieve 0.05
approximately the same precision. Consistent with the results
of [13], the probing strategies LSU and LSU+ do not contribute
to learning, as well as the Density and Random strategy.
Surprisingly, we observed that the MSU strategy which uses all
classifiers does not perform well. When we apply a committee- 0 20 40 60 80 100
based strategy composed of the best members or advanced
experts’ strategies (i.e., Entropy-based and Combined), we k (in %)
achieve the best performance. We can conclude that a commit- MSU MSU+ LSU LSU+
tee of local classifiers can mimic experts’ insights, which is
often preferred in order to make unbiased inspection decisions. Fig. 7: Precision of the committee-based probing strategies.
Finally, we evaluate how model performance is affected by
cutting the edges, and gradually re-integrating their influence in 0.14
time. Figure 9 shows the precision at time t2 with and without
integrating the edge cuts of time t1 . Especially when the
probing budget is limited, the precision is positively impacted. 0.12
When more budget is invested in probing, the effects of
of the top 100

transferring decisions of the previous timestamps are similar 0.1


Precision

to the results achieved when no previous edge cuts are taken


into account. 0.08

VI. R ELATED W ORK 0.06


Active learning iteratively learns a classifier by selecting
unlabeled observations to be labeled, and update the classifier 0.04
accordingly. The label is assigned by an “oracle”, which often
refers to human interaction present in the learning process.
Although active learning is widely explored in the literature 0 20 40 60 80 100
(see [9] for an overview), it is only recently applied to k (in %)
networked data. As network-based features often rely on the
Density Entropy Combined Random

0.15 A 60% Fig. 8: Precision of the experts-based probing strategies.


C
Change in top 100 (in %)

neighbors, an update in the neighborhood causes some features


to change. This is collective classification, and is proven to
of the top 100

40% be useful for fraud detection in [14], [15]. Active inference


Precision

0.1 refers to the process of iteratively sampling nodes such that


the collective classification prediction of all other nodes is
optimized. Most studies focus on within-network classification,
B 20% in which a subset of the nodes are labeled and the labels
0.05 of the other nodes need to be decided. The goal is to select
the most informative (set of) nodes to sample. Rattigan et al.
[16] suggest to select those nodes for probing that lie central
0% in the network and impact other nodes more significantly.
0 20 40 60 80 100 Macskassy [17] uses the Empirical Risk Minimization (ERM)
measure such that the expected classification error is reduced.
k (in %) The Reflect and Correct (RAC) strategy [18], [19] tries to
Precision Change in top 100 (in %) find misclassified islands of nodes by learning the likelihood
of each node belonging to such island. In [20], the authors
Fig. 6: Changes in precision for Random Forests compared propose ALFNET combining both content-only (or intrinsic)
to changes in the evaluation set when using probing strategy features with features derived from the network. They use local
MSU+. disagreement between a content-only and combined classifier

665
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

Contract No. W911NF-14-1-0029, NSF CAREER 1452425,


NSF IIP1069147, IIS 1408287 and IIP1069147, a Face-
0.4 book Faculty Gift, an R&D grant from Northrop Grumman
Aerospace Systems, and Stony Brook University Office of
Vice President for Research. Any conclusions expressed in this
of the top 100

material are of the authors’ and do not necessarily reflect the


Precision

views, either expressed or implied, of the funding parties.


0.3
R EFERENCES
[1] B. Baesens, Analytics in a Big Data World: The Essential Guide to
Data Science and Its Applications. John Wiley & Sons, 2014.
0.2 [2] B. Baesens, V. Van Vlasselaer, and W. Verbeke, Fraud Analytics Using
with cutting edges
Descriptive, Predictive, and Social Network Techniques: A Guide to
without cutting edges Data Science for Fraud Detection. John Wiley & Sons, forthcoming.
[3] A. Kuwadekar and J. Neville, “Relational active learning for joint
0 20 40 60 80 100 collective classification models,” in ICML, 2011, pp. 385–392.
k (in %) [4] V. Van Vlasselaer, J. Meskens, D. Van Dromme, and B. Baesens, “Using
social network knowledge for detecting spider constructions in social
Fig. 9: Precision on graphs when decision is weighted in time. security fraud,” in ASONAM. IEEE, 2013, pp. 813–820.
[5] V. Van Vlasselaer, L. Akoglu, T. Eliassi-Rad, M. Snoeck, and B. Bae-
to decide which node to probe in a cluster. As opposed to sens, “Guilt-by-constellation: Fraud detection by suspicious clique
within-network learning, Kuwadekar and Neville [3] applied memberships,” in HICSS, 2015.
active inference to across-network learning. That is, their [6] R. Rossi and J. Neville, “Time-evolving relational classification and
ensemble methods,” in Advances in Knowledge Discovery and Data
Relational Active Learning (RAL) algorithm is bootstrapped Mining. Springer, 2012, pp. 1–13.
on a fully-labeled network and then applied to a new unlabeled
[7] U. Sharan and J. Neville, “Exploiting time-varying relationships in
network. Samples are chosen based on a utility score that statistical relational models,” in WebKDD. ACM, 2007, pp. 9–15.
expresses the disagreement within a ensemble classifier. To the [8] M. McPherson, L. Smith-Lovin, and J. M. Cook, “Birds of a feather:
best of the authors’ knowledge, active inference is not applied Homophily in social networks,” Annual Review of Sociology, pp. 415–
to time-evolving graphs so far. 444, 2001.
[9] B. Settles, “Active learning literature survey,” University of Wisconsin-
VII. C ONCLUSION Madison, Computer Sciences Technical Report 1648, 2009.
[10] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-
In this work, we discussed how active inference can foster Rad, “Collective classification in network data,” AI Magazine, vol. 29,
classification in time-varying networks. We applied AFRAID, no. 3, p. 93, 2008.
a new active inference approach for time-evolving graphs, to [11] H. Tong, C. Faloutsos, and J.-Y. Pan, “Fast random walk with restart
a real-life data set obtained from the Belgian Social Security and its applications,” in ICDM. IEEE, 2006, pp. 613–622.
Institution with as goal to detect companies that are likely [12] H. Tong, S. Papadimitriou, S. Y. Philip, and C. Faloutsos, “Proximity
tracking on time-evolving bipartite graphs.” in SDM, vol. 8. SIAM,
to commit fraud in the next time period. Fraud is defined as 2008, pp. 704–715.
those companies that intentionally do not pay their taxes. Given [13] M. Sharma and M. Bilgic, “Most-surely vs. least-surely uncertain,” in
a time-varying network, we extracted (1) intrinsic features ICDM. IEEE, 2013, pp. 667–676.
and (2) neighborhood features. A change in the label of one [14] S. Pandit, D. H. Chau, S. Wang, and C. Faloutsos, “Netprobe: a fast
node might impact the feature set of the neighbors. This and scalable system for fraud detection in online auction networks,” in
is collective classification. We investigated the effect on the Proceedings of the 16th international conference on World Wide Web.
overall performance of a set of classifiers, when we are able to ACM, 2007, pp. 201–210.
select a limited set of nodes to be labeled. Although the domain [15] L. Akoglu, R. Chandy, and C. Faloutsos, “Opinion fraud detection in
online reviews by network effects.” ICWSM, vol. 13, pp. 2–11, 2013.
requirements are rather strict (i.e., only 100 out of >200k
[16] M. J. Rattigan, M. Maier, and D. Jensen, “Exploiting network structure
companies can be investigated each time period), Random for active inference in collective classification,” in ICDM. IEEE, 2007,
Forests benefit the most from active inference, achieving an pp. 429–434.
increase in precision up to 15%. We investigated different [17] S. A. Macskassy, “Using graph-based metrics with empirical risk min-
probing strategies to select the most informative nodes in the imization to speed up active learning on networked data,” in SIGKDD.
network and evaluate (1) committee-based and (2) expert-based ACM, 2009, pp. 597–606.
strategies. We find that committee-based strategies using high- [18] M. Bilgic and L. Getoor, “Effective label acquisition for collective
performing classifiers result in a slightly better classification classification,” in SIGKDD. ACM, 2008, pp. 43–51.
performance than expert-based strategies which is often pre- [19] ——, “Reflect and correct: A misclassification prediction approach to
ferred in order to obtain an unbiased set of companies for active inference,” ACM Transactions on Knowledge Discovery from
Data, vol. 3, no. 4, pp. 1–32, 2009.
investigation. We see that the probing strategies on their own
[20] M. Bilgic, L. Mihalkova, and L. Getoor, “Active learning for networked
are able to identify those companies with the most uncertainty, data,” in ICML, 2010, pp. 79–86.
resulting in a total precision of up to 45%.

VIII. ACKNOWLEDGEMENTS
This material is based upon work supported by FWO Grant
No. G055115N, the ARO Young Investigator Program under

666

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy