Claims Fraud Predictive Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

International Journal of Pure and Applied Mathematics

Volume 114 No. 7 2017, 755-767


ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)
url: http://www.ijpam.eu
Special Issue
ijpam.eu

A Survey on Fraud Analytics Using Predictive


Model in Insurance Claims
1
K. Ulaga Priya and 2S. Pushpa
1
Dept of Computer Science and Engineering,
St.Peters University.
ulagapriya@gmail.com
2
Dept of Computer Science and Engineering,
St.Peters University.
pushpasangar96@gmail.com

Abstract
Insurance Industry is a rapidly growing fast industry in terms of large
amount of data. The most critical issue in insurance industry is fraudulent
claims. Fraud is nothing but wrongful or criminal trick planned to result in
financial or personal gains. As the size of data increases, the traditional
approach will not work and it will be tedious job to identify the fraudulent
claims. Moreover, new types of claim will emerge and hence it will be
difficult to predict the fraudulent claims. This paper depicts an overview of
Fraud analytics, prediction, and Data Science algorithms based predictions
in insurance industry.

755
International Journal of Pure and Applied Mathematics Special Issue

1. Introduction
Fraud analytics is a type of data analytics where data analysis is done on the
fraudulent behaviour. There are several domains where fraud may happen like
Credit card fraud, telecommunication fraud, Insurance fraud, Healthcare fraud,
tax evasion etc. Credit card fraud is one of the fraud types which is surveyed
widely in the domain of fraud detection.[34],[35,[36].Due to the popular mode
of payment transaction, both online and offline, the fraud associated with it is
also increasing. There are multiple techniques to detect credit card fraud like
Neural Network [10-11],Group Method of Data Handling [4-5], Bagging[6].
Some other popular models of credit card fraud detection are Hidden Markov
Model [2-3], Bayesian learning [7-9], K-means Clustering [1].The credit card
fraud was categorised [35] as two categories namely behavioural frauds and
Application frauds. Application frauds happen whenever fraudsters[33] acquire
new cards by providing false data to issuing companies[33]. Behavioural frauds
include four types: mail theft, fake cards, stolen/lost cards. Several algorithms
[43] in credit card fraud prediction were compared and derived that Bagging
ensemble classifier is the best method.

Telecommunication fraud is rapidly increasing due to the growth of recent


technology and global communication which results in considerable losses in
business. There are two categories of telecommunication fraud: subscription
fraud and super imposed fraud. Subscription fraud is nothing but claiming false
identity for getting service and elude payment. Superimposed fraud happens
whenever the service is used without having relevant rights and is usually
detected by the appearance of 'phantom' call on a bill. Various techniques used
in telecommunication fraud detection[12] are Neural Networks, Visualization
Methods and Rule-based Approach.

Insurance fraud is defined [37] as fraud in the insurance industry as perceptively


creating a fabricated claim, bloating a claim or adding further items to a claim,
or being in any way deceitful with the intention of getting more than legitimate
privilege. The insurance fraud types include exaggerated claims, fabricated
medical history, post-dated policies, faked damage etc. [30] This emphasize the
different types of fraud in health insurance sector. There are different techniques
for health insurance fraud detection[22]. This paper concentrates on Insurance
Fraud and its data analytics. The National Healthcare Anti-Fraud Association
(NHCAA) evaluated the health care claims and announced that 10 percent of
health care claims contain some element of fraud [38][39]. Insurance protects
the customer from monetary loss. Insurance Policy is a legal agreement between
the Policy holder and insurance [23] company which specifies the claim amount
which the Policy Holder needs to pay. Insurance claim is nothing but, the policy
holder request the claim amount from the insurance company based on the
insurance policy. Insurance domain can be categorised as (i)Health Insurance.
(ii) Travel Insurance (iii) Auto insurance (iv) Life insurance

756
International Journal of Pure and Applied Mathematics Special Issue

The section2 describes about Bigdata analytics in identifying fraudulent claims.


Section3 describes one of the Bigdata Analytics type which is predictive
Analytics. This section also discusses the various types of Machine learning
algorithms, Section4 discusses the merits and demerits of the algorithm, it also
explains the Fraud analytics process model. Section5 discusses the performance
benchmark of different types of fraud. Section6 depicts the conclusion and
Section7 holds the references.

2. Big Data Analytics


Fraud detection in insurance is a potential area in insurance where big data
plays a major role. However, many insurers remain unknown about the power
of data analytics. According to the survey conducted almost 80% of insurer is
unaware about the power of Big Data Analytics. Let us examine a few data
analytic models that can help insurers strengthen their fraud detection
capabilities.

i) Descriptive – Analysing the data on what was already happened. Generally,


reports were generated with past data and analysis is done on that data. For
example, to identify the sales distribution that has happened in previous year.

ii) Diagnostic – Based on the previous data, data analysis will be done on why it
is has happened. Identifying and analysing the reason for poor sales in the
previous year is an example of diagnostic data analysis.

iii) Predictive–This type of analysis will suggest[27] what will happen in the
future. It predicts the futuristic scenario based on past historical data. For
example, identifying the area that is likely to perform better sales in the current
year based on past data.

iv)Prescriptive – This type of analysis will suggest what action should be taken.
Basically, how we can make it happen. It gives recommendation on what needs
to be done. For example, how to achieve the best outcome in sales, and strategy
to retain key customers.

3. Predictive Analytics
This paper discusses on predictive analytics and the techniques used for
prediction. Supervised learning and Unsupervised learning are the [28]
techniques used for predictive analytics. Supervised Learning will have a target

757
International Journal of Pure and Applied Mathematics Special Issue

variable. Target variable is the output that is predicted using other relevant
features. Unsupervised Learning does not have a target variable. Following
supervised learning techniques are used for predicting fraudulent claims since it
has a target variable.
 Decision tree
 Random Forest
 Support Vector Machines
 Neural Network
 XGBoost

These techniques are used in solving data analytics problems.


Decision Tree

Decision tree gives a visualisation view in the form of graph. The sample set is
divided into subset of trees which represent choices and their results. Each node
of a tree represents a choice and the edges represent the decision. The sample
dataset is categorised into training dataset and test dataset. A model is created
with training dataset which gives the prediction accuracy. This model is applied
on the test dataset and the accuracy of prediction is validated. For each predictor
variable, this model can be used to decide on the category(Yes/No, Spam/not
spam-) of the data.Decision tree can deal with continuous data through various
method of decision tree like ID3 method and C4.5.

Decision tree is used in various fraud detection and prediction applications.


Some of the fraudulent problem areas where decision tree is used are credit card
fraud, Energy fraud etc. Credit card fraud detection [40] uses a cost sensitive
decision tree approach. Decision tree is also effectively used in Energy Fraud
detection. This technique is widely used for classification and regression. M5P
Decision tree is used for energy fraud detection which is a modified version of
Quinlan’s [12] M5 algorithm. Following is the general algorithm.
Input: Training dataset
Output :To create a decision Tree.
Step 1: Identify the best attribute of the dataset which need to be placed at the
root of the tree.
Step 2: Divide the training set into subsets. Each subset should contain data
with the same value such that each subset is created for an attribute.
Step 3: Till you find leaf nodes step1 and step2 is to be repeated on each
subset in all the branches of the tree.

758
International Journal of Pure and Applied Mathematics Special Issue

Random Forest

In the random forest technique, multiple decision trees are created. A random
subset of the training data is used to create a single decision tree. [16] The
common result of each random subset is taken as the final tree output. A new
study is fed into all the trees and majority vote for each classification was taken
in this model. Missing values and outliers are taken care in random forest
model.

The predictive algorithm which uses this technique will try to imitate the
relationship between input and output variable. This algorithm provides
excellent accuracy and it runs very effectively on large datasets. This
algorithm[14] is widely used for large number of input. Moreover, it has
methods for maintaining balance for the unbalanced datasets

It is identified that for the aggregated model random forest gives better results
than Naïve Bayes. Where as in the personalized models Naïve Bayes gives
better results. In online shopping [15] when large number of discounts are
announced, it paves way for unusual activities in purchasing products and
services. This paper uses random forest algorithm to detect faults using R
language. Prediction can be done using Random Forest technique to identify
customer’s preference regarding the choice of insurance policy options. [12].
Following is the algorithm:
Input: Training dataset
Output: To create “n” of Trees
Step 1: Randomly select “k” features from total “m” features Where k << m
Step 2: Among the “k” features, calculate the node “d” using the best split point.
Step3: Split the node into daughter nodes using the best split.
Step4: Repeat 1 to 3 steps until “l” number of nodes has been reached.
Step 5: Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of
trees.
Neural Networks

The fundamental element of computation in neural network is the neuron which


is also called as node or unit. The input from other nodes is computed and
produces an output. Basically, it converts the input from multiple sources to
output. Whereas in human brains has a distinct feature of creating transient
states through neurons in between sensory organ and brain which is the decision

759
International Journal of Pure and Applied Mathematics Special Issue

taking unit.

To detect and predict the risk of fraudulent financial reporting, a Multilayer


Perceptron (MLP) [17] Artificial Neural Network model was proposed.
Weatherford suggests, artificial immune systems, recurrent neural networks,
back propagation neural networks for fraud detection. A neural network
approach is identified to detect management fraud. The management fraud is
detected [18] using the Adaptive Logic Network and generalized adaptive
neural network.[42] A three-layer was used with feed-forward Radial Basis
Function (RBF) neural network which will produce in every two hours for new
credit card transactions. This also propose fuzzy neural networks on parallel
machines which rises the rule production for customer-specific credit card fraud
detection. Neural Network gave better results for prediction when compared to
Logistic Regression ad Decision Tree[18]. A case study was done with 5
strategies to audit the auto insurance claims[9].
Input: Training dataset
Output: To create data model.
Step 1:Assign random weights to all the linkages to start the algorithm
Step 2: Using the inputs and the (input-hidden node) linkages find the activation rate
of hidden nodes
Step 3: Using the activation rate of hidden nodes and linkages to output, find the
activation rate of output nodes
Step 4: Find the error rate at the output node and recalibrate all linkages between
hidden nodes and output nodes
Step 5: Using the weights and error found at the output node, cascade down the error
to
hidden nodes
Step 6:Recalibrate the weights between hidden node and the input nodes repeat the
process till the convergence criterion is met
Step 7: Using the final linkage weights score the activation rate of the output nodes
XGBoost

XGBoost is a short form for Extreme Gradient Boosting. Boosting is a


sequential process. Multiple trees are created and the information of the first
tree is fed as input to the second tree so that it improves the prediction in
subsequent iterations. Basically it is a additive tree model where it add new
trees that complement the already built ones. XGBoost handles missing values
and it works only for numeric data.
Support Vector Machine

Support Vector Machines (SVM) is also a supervised learning algorithm used


for regression and classification problems. In general, it creates a hyper plane in

760
International Journal of Pure and Applied Mathematics Special Issue

n dimensional space to classify the data based on target class. The SVM
separates into different classes through a hyperplane or multiple hyperplane.
The hyperplane separates the data points and sometimes it is difficult to separate
the data point through a single hyperplane. The distance between the data point
and hyperplane represents a margin.

This enables to perform classification or regression also. Since it has many


features SVM becomes a promising technique in prediction. [25]Basically,
SVM works on the principle that data points are segregated through
hyperplanes. This subsequently maximizes the distance between data points,
and the hyper plane is constructed with the help of support vectors. A Turkish
insurance company database [19] was taken for research. SVM technique was
applied to this data. SVM is basically a classification technique that identifies
each record as anomalous or normal record. Subsequently every record is
checked with margin and based on that the record is treated as normal or
anomalous. SVM is a kernel based [19] algorithm where kernel transmutes the
input data points to a high-dimensional space so that the problem is solved. [25]
There are different applications which detect fraud through SVM. The top
management fraud is detected using SVM, to create the Fraudulent Financial
statement. [20]

4. Discussion
A comparative study is done on the Supervised Technique. Each technique has
its own merits and demerits. Based on the application area and data technique
can be chosen and analytics can be done on that. The merits and demerits are
discussed below as follows:

Fraud Analytics Process Model

As a first step the business problem must be clearly identified. Next step is to
identify the data source which is a very important task in data analysis model.
[29] Then subsequently all the data is gathered in one single area which could
be a data mart or data warehouse. Then the data is cleaned up re, inconsistent,

761
International Journal of Pure and Applied Mathematics Special Issue

missing and duplicate values are removed. Additional data transformation is


done like data type conversion etc. In analytics phase, data model is built and
data is analysed with the newly created model. Once data analytics is done, this
will be examined by functional experts.[30] During the analytics phase, the
requirement of additional data may be identified. This triggers the need for
another round of data cleaning and transformation. The Pre-processing phase is
most time consuming[31].

5. Performance Benchmark for Different Types


of Fraud
The following Scatter plot shows [39] unique fraud types which were discussed
and published in various fraud detection papers These were some of the
common fraud types highlighted in the Scatter Plot.

Different Types of Fraud

The following table provides references with performance metric of different


Fraud Types. For better comparison of different types of fraud the area under
Receiver operating characteristic(ROC) curve are only included,
Reference Fraud Type Dataset Size Used PERCENTAGE Performance Metric-
OFClass Distribution Area under the curve
Ortega Figuerora Medical Insurance 8,819 5% AUC 74%
et al,(2006)
Subelj Furlan et Automobile Insurance 3.451 1.3% AUC 71%-92%
al.(2011) fraud
Battacharyya Jha Credit card fraud 50 million transactions on about 0.005% AUC 90.8% -95.3%
et al(2011) 1 million credit cards from a
single country
Whitrow , Hand Credit Card Fraud 33,000 -36,000 activity records 0.1% Gini 85%
et al.(2009) (~AUC=92,5%)
Van Vlasselaer Credit Card Fraud 3.3 million transactions <1% > AUC 98.6%
Bravo et
al.(2015)
Dongshan and Telecommunication 809,395 calls from 1,067 0.024 AUC 99.5%
Girolami(2017) Fraud accounts
Van Vlasselaer Social Security Fraud 2000 observations 1% AUC 80%-85%
Meskens et al
(2013)

762
International Journal of Pure and Applied Mathematics Special Issue

6. Conclusion
Like Insurance fraud detection, several fraudulent behaviours are available like
Intrusion detection fraud, credit card fraud, telecommunication fraud etc. It is
prominent that health insurance[21] fraud is viable since it brings heavy loss
overall. By integrating big data technology these claims can be predicted for
large volume of data as well as different variety of data .

References
[1] Srivastava A., Kundu A., Sural S., Majumdar A., Credit Card
Fraud Detection Using Hidden Markov Model, IEEE Transactions
On Dependable And Secure Computing 5(1) (2008), 37-48.
[2] Bhusari V., Patil S., Study of Hidden Markov Model in Credit
Card Fraudulent Detection, International Journal of Computer
Applications 20(5) (2011).
[3] Ivakhnenko A.G., The group method of data handling in
prediction problems, Sov Autom Control 9(6) (1976), 21–30.
[4] Mueller J.A., Lemke F., Self-organising data mining: an intelligent
approach to extract knowledge from data, Script Software
International, Berlin (2009).
[5] Singh S.P., Shukla S.S.P., Rakesh N., Tyagi V., Problem
Reduction In Online Payment System Using Hybrid Model,
International Journal of Managing Information Technology 3(3)
(2011).
[6] Zreapoor M., Shamsolmoali P., Application of Credit Card Fraud
Detection: Based on Bagging Ensemble Classifier, International
Conference on Computer, Communication and Convergence
(2015).
[7] Benson Edwin Raj S., Annie Portia A., Analysis on Credit Card
Fraud Detection Methods, International Conference on
Computer, Communication and Electrical Technology (2011).
[8] Panigrahi S., Kundu A., Sural S., Majumdar A.K., Credit card
fraud detection: A fusion approach using Dempster-Shafer theory
and Bayesian learning, Special Issue on Information Fusion in
Computer Security 10(4) (2009), 354-363
[9] Chang R.I., Lai L.B., Su W.D., Wang J.C., Kouh, J.S., Intrusion
Detection by Backpropagation Neural Networks with Sample-
Query and Attribute-Query, Research India Publications (2006).
[10] Patidar R., Sharma L., Credit Card Fraud Detection Using Neural
Network, International Journal of Soft Computing and
Engineering 1 (2011).

763
International Journal of Pure and Applied Mathematics Special Issue

[11] Guo T., Li G.Y., Neural Data Mining For Credit Card Fraud
detection, Proceedings of the Seventh International Conference
on Machine Learning and Cybernetics (2006).
[12] Lata L.N., Koushika I.A., Hasan S.S., A Comprehensive Survey
of Fraud Detection Techniques, International Journal of Applied
Information Systems 10(2) (2015).
[13] Quinlan J., Learning with continuous classes, 5th Australian joint
conference on artificial intelligence 92 (1992).
[14] Alshamsi A.S., Predicting car insurance policies using random
forest, 10th International Conference on Innovations in
Information Technology (2014), 128-132.
[15] Viaenea S., Auto claim fraud detection using Bayesian learning
neural networks, Elsevier (2005).
[16] Eesha Goel, Abhilasha, Ankit Agarwal, Fraud Detection Using
Random Forest Algorithm, International Journal of Computer
Science Engineering 5(05) (2016).
[17] Salama A.S., Omar A.A., A Back Propagation Artificial Neural
Network based Model for Detecting and Predicting Fraudulent
Financial Reporting, International Journal of Computer
Applications 106(2) (2014).
[18] Fanning K., Cogger K.O., Srivastava R., Detection of
management fraud: A neural network approach. Intelligent
Systems in Accounting, Finance and Management 4(2) (1995),
113-126.
[19] Kirlidog M., Asuk C., A fraud detection approach with data mining
in health insurance, Procedia-Social and Behavioral Sciences 62
(2012), 989-994.
[20] Pai P.F., A support vector machine-based model for detecting
top management fraud, Knowledge-Based Systems 24 (2011),
314–321.
[21] Rawte V., Anuradha G., Fraud Detection in Health Insurance
using Data Mining Techniques, Communication, Information &
Computing Technology (2015).
[22] Peng Y., Kou G., Sabatka A., Chen Z., Khazanchi D., Shi Y.,
Application of clustering methods to health insurance fraud
detection, International Conference on Service Systems and
Service Management 1 (2006), 116-120.
[23] Thornton D., Mueller R.M., Schoutsen P., van Hillegersberg J.,
Predicting healthcare fraud in medicaid: a multidimensional data
model and analysis techniques for fraud detection, Procedia
technology 9 (2013), 1252-1264.

764
International Journal of Pure and Applied Mathematics Special Issue

[24] Lin F., Yeh C.C., Lee M.Y., The use of hybrid manifold learning
and support vector machines in the prediction of business failure,
Knowl. based Syst. (2010), 95–101.
[25] Tang X., Zhuang L., Cai J., Li C., Multi-fault classification based
on support vector machine trained by chaos particle swarm
optimization, Knowl. based Syst. 23(5) (2010), 486–490.
[26] Wan S., Lei, T.C., A knowledge-based decision support system
to analyze the debris-flow problems at Chen-Yu-Lan River,
Taiwan, Knowledge-Based Systems 22(8) (2009), 580-588.
[27] Hafiz K.T., Aghili S., Zavarsky P., The use of predictive analytics
technology to detect credit card fraud in Canada, 11th Iberian
Conference on Information Systems and Technologies (2016),
1-6.
[28] Alfred R., The rise of machine learning for big data analytics, 2nd
International Conference on Science in Information Technology
(2016).
[29] Banarescu A., Detecting and Preventing Fraud with Data
Analytics, Elsevier (2015).
[30] Thornton D., Brinkhuis M., Amrit C., Aly R., Categorizing and
Describing the Types of Fraud in Healthcare, Procedia Computer
Science 64 (2015), 713-720.
[31] Lata L.N., Koushika I.A., Hasan S.S., A Comprehensive Survey
of Fraud Detection Techniques, International Journal of Applied
Information Systems (2015).
[32] Dal Pozzolo A., Caelen O., Le Borgne Y.A., Waterschoot S.,
Bontempi G., Learned lessons in credit card fraud detection from
a practitioner perspective, Expert systems with applications
41(10) (2014), 4915-4928.
[33] Mahmoudi N., Duman E., Detecting credit card fraud by Modified
Fisher Discriminant Analysis, Expert Systems with Applications
42(5) (2014), 2510-2516.
[34] Chan P.K., Fan W., Prodromidis A.L., Stolfo S.J., Distributed
data mining in credit card fraud detection, IEEE Intelligent
Systems and Their Applications 14(6) (1999), 67-74.
[35] Bolton R., Hand D., Unsupervised Profiling Methods for Fraud
Detection, Credit Scoring and Credit Control VII (2001).
[36] Brause R.W., Langsdorf T.S., Hepp H.M., Credit card fraud
detection by adaptive neural data mining, Internal Report 7/99 (J.
W. Goethe-University, Computer Science Department, Frankfurt,
Germany) (1999).

765
International Journal of Pure and Applied Mathematics Special Issue

[37] Gill K.M., Woolley, K.A., Gill M., Insurance fraud: The business
as a victim. In M. Gill (Ed.), Crime at work, Leicester: Perpetuity
Press (1994).
[38] Frieden J., Fraud Squads Target Suspect Claims, Business &
Health 9(4) (1991), 21-33.
[39] Guzzi R., Furious About Fraud, Best's Review-Life/Health
Insurance Edition (1989).
[40] Sahin Y., Bulkan S., Duman E., A cost-sensitive decision tree
approach for fraud detection, Expert Systems with Applications
40(15) (2013), 5916-5923.
[41] Vapnik V.N., Estimation of Dependences Based on Empirical
Data, Addendum 1, New York: Springer-Verlag (1982).
[42] Reilly D.L., Cooper L.N., Elbaum C., A neural model for category
learning, Biological Cybernetics 45(1) (1982), 35-41.
[43] Zareapoor M., Application of Credit Card Fraud Detection: Based
on Bagging classifier, Elsevier (2015).

766
767
768

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy