Fraud Detection in Auto Insurance
Fraud Detection in Auto Insurance
Aditya Bhattar
Problem Statement
Determine if a given Auto Insurance Claim is fraudulent or not for a given set of features. Develop a
predictive model for the same.
Potential Business Problems
Identifying fraudulent claims is important for any Insurance Company as they would not want to pay out on
claims that are not legal
For Auto Insurance, fraudulent claims may arise due to someone claiming for damages for an accident, when
no accident might have taken place
Traditionally, insurance companies have a claim investigation team to investigate claims and determine
whether or not they are fraudulent.
However, with the increase in power of computing and advanced analytics, insurance companies are trying
to come up with automated solutions to determine if a claim is fraudulent or not as the previous methods are
prone to errors.
Why Solve this problem
1000 Datapoints
38 features:
13 numerical features
25 categorical features
One Target Variable: Fraud_Reported
Yes – 247 Count
No – 753 Count
Target variable demonstrates High Class Imbalance. Important to take care of this during modelling.
Dataset split into Train and Test set for modelling purpose. Train set had 800 datapoints. Test set had 200
datapoints.
Evaluation Metric
The evaluation metric for this project is precision and recall and ROC AUC Score.
For business usecase, it is important to identify false positives and true negatives, so recall is to be given
more importance.
Data Cleaning and Preprocessing
Changing dtypes and Dropping columns with Unique Values
Needed to change dtype of certain columns from numerical to categorical as they were categorical features by nature:
Number of Vehicles Involved
Number of Bodily Injuries
Number of Witnesses
Decided to drop 5 columns as had too many unique values, did not give any useful insights:
Policy Number
Policy Bind Date
Insured Zip
Incident Date
Incident Location
Imputation of Missing Values
Categorical features were label encoded before inputting into the model.
Did not one hot encode because of use of tree- based models.
Exploratory Data Analysis
Number Of Witnesses gives some information
After removing correlated features and features with unique values, we have 30 features in the dataset:
9 Numerical Features
21 Categorical Features
Predictive Model used is RandomForestClassifier and XGBoostClassifier
Also tried using GradientBoostingClassifier, however above models gave superior performance
RandomForestClassifier gave best performance, so results for that have been shown.
Base model performance
Claims to be targeted:
Having atleast one witness
Insured has a hobby of Chess or CrossFit
Insureds listing their occupation as Exec- Managers
Insureds reporting claims for major damage
Claim amount and insured education level are also important features as evidenced by model feature importance
There is also a need to increase police reports on claims made as it can serve as an initial screen for
fraudulent claims
Also prepared an interactive Dashboard using Dash and plotly showing the Countplots for categorical
features
Future Steps