0% found this document useful (0 votes)
99 views28 pages

Fraud Detection in Auto Insurance

This document discusses developing a predictive model to detect fraudulent auto insurance claims. It describes preprocessing a dataset of 1000 claims, each with 38 features, to select important features and address class imbalance. Two models, RandomForestClassifier and XGBoostClassifier, are tuned and evaluated based on precision, recall, and ROC AUC. The best model achieves over 80% recall for both classes, but there is room for improvement. Key features for identifying potentially fraudulent claims are also identified.

Uploaded by

Dinesh Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views28 pages

Fraud Detection in Auto Insurance

This document discusses developing a predictive model to detect fraudulent auto insurance claims. It describes preprocessing a dataset of 1000 claims, each with 38 features, to select important features and address class imbalance. Two models, RandomForestClassifier and XGBoostClassifier, are tuned and evaluated based on precision, recall, and ROC AUC. The best model achieves over 80% recall for both classes, but there is room for improvement. Key features for identifying potentially fraudulent claims are also identified.

Uploaded by

Dinesh Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Fraud Detection in Auto Insurance

Aditya Bhattar
Problem Statement

 Determine if a given Auto Insurance Claim is fraudulent or not for a given set of features. Develop a
predictive model for the same.
Potential Business Problems

 Identifying fraudulent claims is important for any Insurance Company as they would not want to pay out on
claims that are not legal
 For Auto Insurance, fraudulent claims may arise due to someone claiming for damages for an accident, when
no accident might have taken place
 Traditionally, insurance companies have a claim investigation team to investigate claims and determine
whether or not they are fraudulent.
 However, with the increase in power of computing and advanced analytics, insurance companies are trying
to come up with automated solutions to determine if a claim is fraudulent or not as the previous methods are
prone to errors.
Why Solve this problem

 Identify the circumstances which lead to fraudulent claims


 Reduce reliance on human investigators to determine whether a claim is fraudulent or not
 Correctly identify legal claims to ensure payout to rightful policyholders
 May also help to determine what type of policyholders the company should avoid.
Features of the Dataset

 1000 Datapoints
 38 features:
 13 numerical features
 25 categorical features
 One Target Variable: Fraud_Reported
 Yes – 247 Count
 No – 753 Count
 Target variable demonstrates High Class Imbalance. Important to take care of this during modelling.
 Dataset split into Train and Test set for modelling purpose. Train set had 800 datapoints. Test set had 200
datapoints.
Evaluation Metric

 The evaluation metric for this project is precision and recall and ROC AUC Score.
 For business usecase, it is important to identify false positives and true negatives, so recall is to be given
more importance.
Data Cleaning and Preprocessing
Changing dtypes and Dropping columns with Unique Values

 Needed to change dtype of certain columns from numerical to categorical as they were categorical features by nature:
 Number of Vehicles Involved
 Number of Bodily Injuries
 Number of Witnesses
 Decided to drop 5 columns as had too many unique values, did not give any useful insights:
 Policy Number
 Policy Bind Date
 Insured Zip
 Incident Date
 Incident Location
Imputation of Missing Values

 Three columns had ‘?’ (missing values):


 Collision Type
 Property Damage
 Police Report Available
 Imputed missing values using a KNNImuputer (K = 5)
Label Encoding of Categorical Features

 Categorical features were label encoded before inputting into the model.
 Did not one hot encode because of use of tree- based models.
Exploratory Data Analysis
Number Of Witnesses gives some information

 Even the presence of one witness increase the


chances of detecting a claim as fraudulent.
 Out of 247 fraudulent claims, 197 claims had
atleast one witness present
Police Report Is Important

 Availability of police report reduces the chances


of fraudulent claims, although fewer cases
actually have a police report available.
 Increase in number of datapoints would help to
determine a more clear relationship between
fraudulent claims and the availability of police
reports.
Occupations Reveal an Interesting Fact

 People who mentioned their occupation as Exec –


Managers have the most likelihood of making a
fraudulent claim
 37% of claims made by this category was
fraudulent.
Hobbies seems to be an important question to ask
policyholders

 People who play chess and practice cross-fit are


actually more likely to make a fraudulent claim
than a rightful claim.
 For both hobbies, people are more than 4 times
more likely to report a fraudulent claims.
Incident may not be so severe

 Major Damage is reported fraudulently more


often.
 This is what would be expected. People claiming
for minor damage tend to add other previous
damages that would not actually be part of the
incident for which the claim has been made.
Feature Selection
Correlation Amongst Numerical Features

 The following heatmap shows high correlation


amongst the following columns:
 Age and Months as customer
 Total Claim Amount with its individual
constituents. (Injury, Property, Vehicle)
 Age and the individual Constituents were dropped
from the dataset.
Feature Importance as per RFC

 As per RFC, incident severity and insured hobbies


are the most important features. This was evidenced
during EDA. This is followed by total claim
amount.
 However, choosing just three features does not
allow the model to separate well between the
classes
 Decided to drop features with feature importance of
less than 1%. This gives us 19 features.
Feature Importance as per XGBC

 As per Xgbc, insured education level and policy_csl


are the most important features.. This is followed by
incident hour of the day.
 However, choosing just three features does not
allow the model to separate well between the
classes
 Decided to drop features with feature importance of
less than 1%. This gives us 19 features.
Models and Approaches
Models Used

 After removing correlated features and features with unique values, we have 30 features in the dataset:
 9 Numerical Features
 21 Categorical Features
 Predictive Model used is RandomForestClassifier and XGBoostClassifier
 Also tried using GradientBoostingClassifier, however above models gave superior performance
 RandomForestClassifier gave best performance, so results for that have been shown.
Base model performance

 The Base model performed fairly well on rightful


claims, but not so great on the fraudulent claims.
 This shows the need to hypertune the model, by
giving more weight to the underrepresented class.
Could also be the case of overfitting on training
dataset.
Model Tuning

 Model performance has vastly improved because


of hyperparameter tuning.
 Recall for both classes is above 80%.
 From a business point of view, recall of fraudulent
class should be one, so still room for
improvement.
Model Performance with important features

 Model performance with selection of only


important features has improved slightly more
 F1-Score has increased from 0.84 to 0.85
ROC AUC Curve

 ROC AUC Curve shows how well the model has


been able to separate out the classes
 ROC AUC Score of 0.82 shows the model has
done a good job, but there is room for
improvement.
Insights and Business Decisions

 Claims to be targeted:
 Having atleast one witness
 Insured has a hobby of Chess or CrossFit
 Insureds listing their occupation as Exec- Managers
 Insureds reporting claims for major damage
 Claim amount and insured education level are also important features as evidenced by model feature importance
 There is also a need to increase police reports on claims made as it can serve as an initial screen for
fraudulent claims
 Also prepared an interactive Dashboard using Dash and plotly showing the Countplots for categorical
features
Future Steps

 Time permitting, following additional actions could have been taken:


 Use of different imputers for missing values to see if any other gives superior performance such as iterative imputer
 Use of different models like LogisticRegression and SVM to see how dataset responds to these models
 Application of sampling techniques to handle class imbalance such as SMOTE, random oversampling
 Identify unique set of features that might always be leading to fraudulent claims.
 Use of pipeline to streamline the modelling process
 Adding features to dashboard and make it more user-friendly
 There is a need to capture more data as not many useful business insights can be drawn from 1000 datapoints. More
data, better results

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy