Credit Card Fraud Detection: Allam Prathyusha Reddy (Urk18Cs114)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

CREDIT CARD FRAUD DETECTION

A mini project report submitted by

ALLAM PRATHYUSHA REDDY (URK18CS114)

in partial fulfillment for the award of the degree


of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
under the supervision of
DR. ESTHER DANIEL, Assistant Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


KARUNYA INSTITUTE OF TECHNOLOGY AND SCIENCES
(Declared as Deemed-to-be-under Sec-3 of the UGC Act, 1956)
Karunya Nagar, Coimbatore - 641 114. INDIA

March 2021

1 | 23P a g e Mini Project 2020-2021


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

BONAFIDE CERTIFICATE

This is to certify that the project report entitled, “Police Quarters Management System” is a
bonafide record of Mini Project work done during the even semester of the academic year
2020-2021 by

ALLAM PRATHYUSHAREDDY (Reg. No: URK18CS114)

in partial fulfillment of the requirements for the award of the degree of Bachelor of Technology
in Computer Science and Engineering of Karunya Institute of Technology and Sciences.

Submitted for the Viva Voce held on 29-03-2021

Project Coordinator Signature of the Guide

2 | 23P a g e Mini Project 2020-2021


ACKNOWLEDGEMENT

First and foremost, I praise and thank ALMIGTHY GOD whose blessings have bestowed

in me the will power and confidence to carry out my project.

I am grateful to our beloved founders Late. Dr. D.G.S. Dhinakaran, C.A.I.I.B, Ph.D

and Dr. Paul Dhinakaran, M.B.A, Ph.D, for their love and always remembering us in their

prayers.

I extend my thanks to our Vice Chancellor Dr. P. Mannar Jawahar, Ph.D and our

Registrar Dr. Elijah Blessing, M.E., Ph.D, for giving me this opportunity to do the project.

I would like to thank Dr. Prince Arulraj, M.E., Ph.D., Dean, School of Engineering and

Technology for his direction and invaluable support to complete the same.

I would like to place my heart-felt thanks and gratitude to Dr. J. Immanuel John Raja,

M.E., Ph.D., Head of the Department, Computer Science and Engineering for his

encouragement and guidance.

I feel it is a pleasure to be indebted to, Mr. J. Andrew, M.E, (Ph.D.), Assistant Professor,

Department of Computer Science and Engineering and DR.Esther Daniel for their invaluable

support, advice and encouragement.

I also thank all the staff members of the Department for extending their helping hands to

make this project a successful one.

3 | 23P a g e Mini Project 2020-2021


I would also like to thank all my friends and my parents who have prayed and helped me

during the project work.

CONTENTS

Acknowledgments 3
Abstract 4
1. Introduction

1.1. Introduction
1.2. Methodology
1.3. Flow diagram
2. Data Analysis 11
2.1. Data Preparation
2.2. Explanatory Data Analysis
3. Implementation 16
3.1 . Libraries Uses
3.2. The data Set
3.3. Print The Shape of data
3.4. Histograms of each parameter
3.5. Determine no of frauds
3.6. correlation of matrix 21
4. Test results
4.1. Result
5 Conclusion and further scope 22

References 23

4 | 23P a g e Mini Project 2020-2021


ABSTRACT

MACHINE LEARNING:
The use and development of computer systems that are able to learn and adapt without following
explicit instructions, by using algorithms and statistical models to analyse and draw inferences
from patterns in data.
Approach:
It is vital that credit card companies are able to identify fraudulent credit card transactions so
that the customers Aare able to identify fraudulent credit card transactions so that customers are
not changed for items that they did not purchase.
Such problem can be tackled with machine learning. This project intends to illustrate modeling
of a data set using machine learning with the credit card fraud detection problem includes
modeling of a data set using machine learning with credit card fraud detection.
The credit card fraud detection problerm includes modeling past credit transactions with the data
of the ones that turned out to be fraud. This model is then used to recognize whether a new
transaction is fraudulent or not.
Our objective here is to detect 100% of the fraudulent transactions while minimizing the
incorrect fraud classifications. Credit Card Fraud Detection is a typical sample of classification.
In this process, we have focused on analyzing and pre-processing data sets as well as the
deployment of multiple anomaly detection algorithms such as Local Outlier Factor and Isolation
Forest algorithm on the PCA transformed Credit Card Transaction data.

5 | 23P a g e Mini Project 2020-2021


Machine Learning is the field of study that gives computers the capability to learn without being
explicitly programmed. ML is one of the most exciting technologies that one would have ever
come across.
As it is evident from the name, it gives the computer that makes it more similar to humans.
In this we are going to be doing credit card fraud detection using several methods of anomaly
detection of S-K of our own package. In this we are going to be using a local outlier factor to
calculate anomaly scores as well as an isolation fore algorithm.
These algorithms will comb through our data set of credit card transactions and predict which
ones are fraudulent.

To identify credit card fraud detection effectively, we need to understand the various
technologies, algorithms and types involved in detecting credit card frauds.
The algorithm can differentiate transactions which are fraudulent or not. Find fraud, they need to
passed dataset and knowledge of the fraudulent transaction.
They analyze the dataset and classify all transactions. Fraud detection involves monitoring the
activities of populations of users to estimate, perceive or avoid objectionable behavior, which
consist of fraud, intrusion, and defaulting.
Machine learning algorithms are employed to analyses all the authorized transactions and report
the suspicious ones.
These reports are investigated by professionals who contact the cardholders to confirm if the
transaction was genuine or fraudulent.
The investigators provide feedback to the automated system which is used to train and update
the algorithm to eventually improve the fraud-detection performance over time.
Enormous Data is processed every day and the model build must be fast enough to respond to
the scam in time .

Imbalanced Data i.e most of the transactions (99.8%) are not fraudulent which makes it really
hard for detecting the fraudulent ones Data availability as the data is mostly private.
Misclassified Data can be another major issue, as not every fraudulent transaction is caught
and reported.
Adaptive techniques used against the model by the scammers.

6 | 23P a g e Mini Project 2020-2021


We use we use Jupiter notebook to develop the python application.
It has some valuable lessons about pre processing data sets well as the deployment of multiple
anomaly detection algorithm being our local outliner factor and our isolation forest algorithm.
In our daily lives, there are various credit card fraud detections where we require this detection
of credit card fraudulent.
We perform the credit card fraud detection by using several different anomaly detection
methods.

1. INTRODUCTION

1.1.. INTRODUCTION

Credit Card Fraud can be defined as a case where a person uses someone else’s credit card for
personal reasons while the owner and the card issuing authorities are unaware of the fact that the
card is being used.

Due to rise and acceleration of E- Commerce, there has been a tremendous use of credit cards for
online shopping which led to High amount of frauds related to credit cards. In the era of
digitalization, the need to identify credit card frauds is necessary.
Fraud detection involves monitoring and analyzing the behavior of various users in order to
estimate detect or avoid undesirable behavior.
In order to identify credit card fraud detection effectively, we need to understand the various
technologies, algorithms and types involved in detecting credit card frauds. Algorithm can
differentiate transactions which are fraudulent or not. Find fraud, they need to passed dataset and
knowledge of fraudulent transaction.
They analyze the dataset and classify all transactions .Fraud detection involves monitoring the
activities of populations of users in order to estimate, perceive or avoid objectionable behavior,
which consist of fraud, intrusion, and defaulting.

7 | 23P a g e Mini Project 2020-2021


Machine learning algorithms are employed to analyses all the authorized transactions and report
the suspicious ones.
These reports are investigated by professionals who contact the cardholders to confirm if the
transaction was genuine or fraudulent.
The investigators provide a feedback to the automated system which is used to train and update
the algorithm to eventually improve the fraud-detection performance over time.

1.2.. METHODOLOGY

The approach that this project,uses the latest machine learning algorithms to detect anomalous
activities, called outliers.
The basic rough architecture diagram can be represented with the following figure:
When looked at in detail on a larger scale along with real life elements, the full architecture
diagram can be represented as follows:

First of all, we obtained our dataset from Kaggle, a data analysis website which provides
datasets. Inside this dataset, there are 31 columns out of which 28 are named as v1-v28 to protect
sensitive data.
The other columns represent Time, Amount and Class. Time shows the time gap between the
first transaction and the following one. Amount is the amount of money transacted. Class 0
represents a valid transaction and 1 represents a fraudulent one.
We plot different graphs to check for inconsistencies in the dataset and to visually comprehend
it:
This graph shows that the number of fraudulent transactions is much lower than the legitimate
ones.
This graph shows the times at which transactions were done within two days. It can be seen that
the least number of transactions were made during night time and highest during the days.
This graph represents the amount that was transacted.
8 | 23P a g e Mini Project 2020-2021
A majority of transactions are relatively small and only a handful of them come close to the
maximum transacted amount.
After checking this dataset, we plot a histogram for every column. This is done to get a graphical
representation of the dataset which can be used to verify that there are no missing any values in
the dataset.
This is done to ensure that we don’t require any missing value imputation and the machine
learning algorithms can process the dataset smoothly.
After this analysis, we plot a heatmap to get a colored representation of the data and to study the
correlation between out predicting variables and the class variable. This heatmap is shown
below:
The dataset is now formatted and processed. The time and amount column are standardized and
the Class column is removed to ensure fairness of evaluation.
The data is processed by a set of algorithms from modules. The following module diagram
explains how these algorithms work together: This data is fit into a model and the following
outlier detection modules are applied on it:
• Local Outlier Factor

• Isolation Forest Algorithm

These algorithms are a part of sklearn. The ensemble module in the sklearn package includes
ensemble-based methods and functions for the classification, regression and outlier detection.
This free and open-source Python library is built using NumPy, SciPy and matplotlib modules
which provides a lot of simple and efficient tools which can be used for data analysis

and machine learning. It features various classification, clustering and regression algorithms and
is designed to interoperate with the numerical and scientific libraries.
Wave used Jupyter Notebook platform to make a program in Python to demonstrate the approach
that this paper suggests. This program can also be executed on the cloud using Google Collab
platform which supports all python notebook files.
Detailed explanations about the modules with pseudocodes for their algorithms and output
graphs are given as follows:

1. Local Outlier Factor

It is an Unsupervised Outlier Detection algorithm. ‘Local Outlier Factor’ refers to the anomaly
score of each sample. It measures the local deviation of the sample data with respect to its
neighbors.
On plotting the results of Local Outlier Factor algorithm, we get the following figure:
By comparing the local values of a sample to that of its neighbors, one can identify samples that
are substantially lower than their neighbors.
These values are quite amanous and they are considered as outliers.
As the dataset is very large, we used only a fraction of it in out tests to reduce processing times.
The final result with the complete dataset processed is also determined and is given in the results
section of this paper.

2. Isolation Forest Algorithm

9 | 23P a g e Mini Project 2020-2021


The Isolation Forest isolates observations by arbitrarily selecting a feature and then randomly
selecting a split value between the maximum and minimum values of the designated feature.
Recursive partitioning can be represented by a tree, the number of splits required to isolate a
sample is equivalent to the path length root node to terminating node.
The average of this path length gives a measure of normality and the decision function which we
use.
The pseudocode for this algorithm can be written as:
On plotting the results of Isolation Forest algorithm, we get the following figure:
Partitioning them randomly produces shorter paths for anomalies.
When a forest of random trees mutually produces shorter path lengths for specific samples, they
are extremely likely to be anomalies.
Once the anomalies are detected, the system can be used to report them to the concerned
authorities.
For testing purposes, we are comparing the outputs of these algorithms to determine their
accuracy and precision.

3. FLOW DIAGRAM

10 | 23P a g e Mini Project 2020-2021


11 | 23P a g e Mini Project 2020-2021
2.DATA ANALYSIS

2.1 .ABOUT DATASET


The datasets contain transactions made by credit cards in September 2013 by European
cardholders. This dataset presents transactions that occurred in two days, where we have 492
frauds out of 284,807 transactions.
The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all
transactions.
It contains only numerical input variables which are the result of a PCA transformation.
Unfortunately, due to confidentiality issues, we cannot provide the original features and more
background information about the data.
Features V1, V2, ... V28 are the principal components obtained with PCA, the only features
which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the
seconds elapsed between each transaction and the first transaction in the dataset. The feature
'Amount' is the transaction Amount, this feature can be used for example-dependent cost-
sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud
and 0 otherwise

2.2 DATA PREPARATION

Before continuing with our analysis, it is important not to forget that while the anonymized
features have been scaled and seem to be centered around zero, our time and amount features
have not. Not scaling them as well would result in certain machine learning algorithms that give
weights to features (logistic regression) or rely on a distance measure (KNN) performing much
worse. To avoid this issue, I standardized both the time and amount column. Luckily, there are no
missing values and we, therefore, do not need to worry about missing value imputation.

Machine Learning-Based Approaches

Below is a brief overview of popular machine learning-based techniques for anomaly detection.
a. Density-Based Anomaly Detection
b.
Density-based anomaly detection is based on the k-nearest neighbors’ algorithm.

12 | 23P a g e Mini Project 2020-2021


Assumption: Normal data points occur around a dense neighborhood and abnormalities are far
away.
The nearest set of data points are evaluated using a score, which could be Euclidian distance or a
similar measure dependent on the type of the data (categorical or numerical). They could be
broadly classified into two algorithms:
K-nearest neighbor:

k-NN is a simple, non-parametric lazy learning technique used to classify data based on
similarities in distance metrics such as Euclidian, Manhattan, Minkowski, or Hamming distance.
Relative density of data:
This is better known as local outlier factor (LOF). This concept is based on a distance metric
called reachability distance.
b. Clustering-Based Anomaly Detection
Clustering is one of the most popular concepts in the domain of unsupervised learning.
Assumption: Data points that are similar tend to belong to similar groups or clusters, as
determined by their distance from local centroids.
K-means is a widely used clustering algorithm. It creates 'k' similar clusters of data points. Data
instances that fall outside of these groups could potentially be marked as anomalies.
c. Support Vector Machine-Based Anomaly Detection
A support vector machine is another effective technique for detecting anomalies.
A SVM is typically associated with supervised learning, but there are extensions
(OneClassCVM, for instance) that can be used to identify anomalies as an unsupervised problem
(in which training data are not labeled).
• The algorithm learns a soft boundary in order to cluster the normal data instances using
the training set, and then, using the testing instance, it tunes itself to identify the
abnormalities that fall outside the learned region.
Depending on the use case, the output of an anomaly detector could be numeric scalar values for
filtering on domain-specific thresholds or textual labels (such as binary/multi labels).
In this jupyter notebook or google collab we are going to take the credit card fraud detection as
the case study for understanding this concept in detail using the following Anomaly Detection
Techniques namely
Isolation Forest Anomaly Detection Algorithm.
• Density-Based Anomaly Detection (Local Outlier Factor) Algorithm.
• Support Vector Machine Anomaly Detection Algorithm
• Isolation Forest Anomaly Detection Algorithm.

2.3. Exploratory Data Analysis (EDA)

13 | 23P a g e Mini Project 2020-2021


The time is recorded in the number of seconds since the first transaction in the data set.
Therefore, we can conclude that this data set includes all transactions recorded over the course of
two days. As opposed to the distribution of the monetary value of the transactions, it is bimodal.
This indicates that approximately 28 hours after the first transaction there was a significant drop
in the volume of transactions. While the time of the first transaction is not provided, it would be
reasonable to assume that the drop-in volume occurred during the night.
Data Preparation
14 | 23P a g e Mini Project 2020-2021
Before continuing with our analysis, it is important not to forget that while the anonymized
features have been scaled and seem to be centered around zero, our time and amount features
have not. Not scaling them as well would result in certain machine learning algorithms that give
weights to features (logistic regression) or rely on a distance measure (KNN) performing much
worse. To avoid this issue, I standardized both the time and amount column. Luckily, there are no
missing values and we, therefore, do not need to worry about missing value imputation.

Dimensionality Reduction With t-SNE for Visualization


Visualizing our classes would prove to be quite interesting and show us if they are clearly
separable. However, it is not possible to produce a 30-dimensional plot using all of our
predictors. Instead, using a dimensionality reduction technique such as t-SNE, we are able to
project these higher dimensional distributions into lower-dimensional visualizations. For this
project, I decided to use t-SNE, an algorithm that I had not been working with before. If you
would like to know more about how this algorithm works.
Projecting our data set into a two-dimensional space, we are able to produce a scatter plot
showing the clusters of fraudulent and non-fraudulent transactions:

15 | 23P a g e Mini Project 2020-2021


Classifications Algorithms
Onto the part you’ve probably been waiting for all this time: training machine learning
algorithms. To be able to test the performance of our algorithms, I first performed an 80/20 train-
test split, splitting our balanced data set into two pieces. To avoid overfitting, I used the very
common resampling technique of k-fold cross-validation. This simply means that you separate
your training data into k parts (folds) and then fit your model on k-1 folds before making
predictions for the kth hold-out fold. You then repeat this process for every single fold and
average the resulting predictions.
To get a better feeling of which algorithm would perform best on our data, let’s quickly spot-
check some of the most popular classification algorithms:

• Logistic Regression

• Linear Discriminant Analysis

• K Nearest Neighbors (KNN)

• Classification Trees

• Support Vector Classifier

16 | 23P a g e Mini Project 2020-2021


3.IMPLEMENTATION

3.1. LIBRARIES USED


To start, let's print out the version numbers of all the libraries we will be using in this project.
This serves two purposes - it ensures we have installed the libraries correctly and ensures that
this tutorial will be reproducible.
r

3.2. The Data Set


In the following cells, we will import our dataset from a .csv file as a Pandas DataFrame.
Furthermore, we will begin exploring the dataset to gain an understanding of the type, quantity,
and distribution of data in our dataset. For this purpose, we will use Pandas' built-in describe
feature, as well as parameter histograms and a correlation matrix.

17 | 23P a g e Mini Project 2020-2021


3.3. Print the shape of the data

3.4. Histograms of each parameter

18 | 23P a g e Mini Project 2020-2021


19 | 23P a g e Mini Project 2020-2021
3.5. Determine no of fraud cases in data:

3.5 correlation matrix:

20 | 23P a g e Mini Project 2020-2021


3.6. Get all coloums from data frame

21 | 23P a g e Mini Project 2020-2021


4.Test Results
4.1. Result
Identify fraudulent credit card transactions.
Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under
the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for
unbalanced classification.
The code prints out the number of false positives it detected and compares it with the actual
values. This is used to calculate the accuracy score and precision of the algorithms.
The fraction of data we used for faster testing is 10% of the entire dataset. The complete dataset
is also used at the end and both the results are printed.
These results along with the classification report for each algorithm is given in the output as
follows, where class 0 means the transaction was determined to be valid and 1 means it was
determined as a fraud transaction.

4.2 Unsupervised Outlier Detection

Now that we have processed our data, we can begin deploying our machine learning algorithms.
We will use the following techniques:

Local Outlier Factor (LOF)

The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation
of density of a given sample with respect to its neighbors. It is local in that the anomaly score
depends on how isolated the object is with respect to the surrounding neighborhood.

Isolation Forest Algorithm

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly
selecting a split value between the maximum and minimum values of the selected feature.

Since recursive partitioning can be represented by a tree structure, the number of splittings
required to isolate a sample is equivalent to the path length from the root node to the terminating
node.

This path length, averaged over a forest of such random trees, is a measure of normality and our
decision function.

22 | 23P a g e Mini Project 2020-2021


Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of
random trees collectively produce shorter path lengths for particular samples, they are highly
likely to be anomalies.

23 | 23P a g e Mini Project 2020-2021


24 | 23P a g e Mini Project 2020-2021
• When looking at the results of Local Outlier Factor and Isolation Forest algorithms, it is
obvious from the above table that the Isolation Factor is better observed with an accuracy
of 97 percent in online transactions.

25 | 23P a g e Mini Project 2020-2021


5.conclusions and Further scope
5.1. Conclusion

Fraud detection is a complex issue that requires a substantial amount of planning before
throwing
machine learning algorithms at it. Nonetheless, it is also an application of data science and
machine learning for the good, which makes sure that the customer’s money is safe and not
easily tampered with.

Future work will include a comprehensive tuning of the Random Forest algorithm I talked about
earlier. Having a data set with non-anonymized features would make this particularly interesting
as outputting the feature importance would enable one to see what specific factors are most
important for detecting fraudulent transactions.

It is essential for credit card businesses to be able to recognize fraudulent credit card transactions
so that consumers are not paid for things they have not purchased. With the growing use of credit
cards for purchases, the risks of credit card frauds grow rising significantly. In this project an
analysis of credit card fraud identification was described on a publicly available dataset utilizing
Machine Learning techniques such as Local outlier factor and Isolation Forest. In PYTHON the
framework introduced is enforced. When analyzing the data set Isolation Forest provided the
highest precision rate than Local Outlier Factor algorithm

REFRENCES:

1. Credit Card Fraud Detection Based on Transaction Behavior -by John Richard
D. Kho, Larry A. Vea published by Proc. of the 2017 IEEE Region 10 Conference
(TENCON), Malaysia, November 5-8, 2017

2. L.J.P. van der Maaten and G.E. Hinton, Visualizing High-Dimensional Data
Using t-SNE (2014), Journal of Machine Learning Research

3. Machine Learning Group — ULB, Credit Card Fraud Detection (2018), Kaggle

4. Nathalie Japkowicz, Learning from Imbalanced Data Sets: A Comparison of


Various Strategies (2000), AAAI Technical Report WS-00–05
View publication

26 | 23P a g e Mini Project 2020-2021

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy