Credit Card Fraud Detection: Allam Prathyusha Reddy (Urk18Cs114)
Credit Card Fraud Detection: Allam Prathyusha Reddy (Urk18Cs114)
Credit Card Fraud Detection: Allam Prathyusha Reddy (Urk18Cs114)
March 2021
BONAFIDE CERTIFICATE
This is to certify that the project report entitled, “Police Quarters Management System” is a
bonafide record of Mini Project work done during the even semester of the academic year
2020-2021 by
in partial fulfillment of the requirements for the award of the degree of Bachelor of Technology
in Computer Science and Engineering of Karunya Institute of Technology and Sciences.
First and foremost, I praise and thank ALMIGTHY GOD whose blessings have bestowed
I am grateful to our beloved founders Late. Dr. D.G.S. Dhinakaran, C.A.I.I.B, Ph.D
and Dr. Paul Dhinakaran, M.B.A, Ph.D, for their love and always remembering us in their
prayers.
I extend my thanks to our Vice Chancellor Dr. P. Mannar Jawahar, Ph.D and our
Registrar Dr. Elijah Blessing, M.E., Ph.D, for giving me this opportunity to do the project.
I would like to thank Dr. Prince Arulraj, M.E., Ph.D., Dean, School of Engineering and
Technology for his direction and invaluable support to complete the same.
I would like to place my heart-felt thanks and gratitude to Dr. J. Immanuel John Raja,
M.E., Ph.D., Head of the Department, Computer Science and Engineering for his
I feel it is a pleasure to be indebted to, Mr. J. Andrew, M.E, (Ph.D.), Assistant Professor,
Department of Computer Science and Engineering and DR.Esther Daniel for their invaluable
I also thank all the staff members of the Department for extending their helping hands to
CONTENTS
Acknowledgments 3
Abstract 4
1. Introduction
1.1. Introduction
1.2. Methodology
1.3. Flow diagram
2. Data Analysis 11
2.1. Data Preparation
2.2. Explanatory Data Analysis
3. Implementation 16
3.1 . Libraries Uses
3.2. The data Set
3.3. Print The Shape of data
3.4. Histograms of each parameter
3.5. Determine no of frauds
3.6. correlation of matrix 21
4. Test results
4.1. Result
5 Conclusion and further scope 22
References 23
MACHINE LEARNING:
The use and development of computer systems that are able to learn and adapt without following
explicit instructions, by using algorithms and statistical models to analyse and draw inferences
from patterns in data.
Approach:
It is vital that credit card companies are able to identify fraudulent credit card transactions so
that the customers Aare able to identify fraudulent credit card transactions so that customers are
not changed for items that they did not purchase.
Such problem can be tackled with machine learning. This project intends to illustrate modeling
of a data set using machine learning with the credit card fraud detection problem includes
modeling of a data set using machine learning with credit card fraud detection.
The credit card fraud detection problerm includes modeling past credit transactions with the data
of the ones that turned out to be fraud. This model is then used to recognize whether a new
transaction is fraudulent or not.
Our objective here is to detect 100% of the fraudulent transactions while minimizing the
incorrect fraud classifications. Credit Card Fraud Detection is a typical sample of classification.
In this process, we have focused on analyzing and pre-processing data sets as well as the
deployment of multiple anomaly detection algorithms such as Local Outlier Factor and Isolation
Forest algorithm on the PCA transformed Credit Card Transaction data.
To identify credit card fraud detection effectively, we need to understand the various
technologies, algorithms and types involved in detecting credit card frauds.
The algorithm can differentiate transactions which are fraudulent or not. Find fraud, they need to
passed dataset and knowledge of the fraudulent transaction.
They analyze the dataset and classify all transactions. Fraud detection involves monitoring the
activities of populations of users to estimate, perceive or avoid objectionable behavior, which
consist of fraud, intrusion, and defaulting.
Machine learning algorithms are employed to analyses all the authorized transactions and report
the suspicious ones.
These reports are investigated by professionals who contact the cardholders to confirm if the
transaction was genuine or fraudulent.
The investigators provide feedback to the automated system which is used to train and update
the algorithm to eventually improve the fraud-detection performance over time.
Enormous Data is processed every day and the model build must be fast enough to respond to
the scam in time .
Imbalanced Data i.e most of the transactions (99.8%) are not fraudulent which makes it really
hard for detecting the fraudulent ones Data availability as the data is mostly private.
Misclassified Data can be another major issue, as not every fraudulent transaction is caught
and reported.
Adaptive techniques used against the model by the scammers.
1. INTRODUCTION
1.1.. INTRODUCTION
Credit Card Fraud can be defined as a case where a person uses someone else’s credit card for
personal reasons while the owner and the card issuing authorities are unaware of the fact that the
card is being used.
Due to rise and acceleration of E- Commerce, there has been a tremendous use of credit cards for
online shopping which led to High amount of frauds related to credit cards. In the era of
digitalization, the need to identify credit card frauds is necessary.
Fraud detection involves monitoring and analyzing the behavior of various users in order to
estimate detect or avoid undesirable behavior.
In order to identify credit card fraud detection effectively, we need to understand the various
technologies, algorithms and types involved in detecting credit card frauds. Algorithm can
differentiate transactions which are fraudulent or not. Find fraud, they need to passed dataset and
knowledge of fraudulent transaction.
They analyze the dataset and classify all transactions .Fraud detection involves monitoring the
activities of populations of users in order to estimate, perceive or avoid objectionable behavior,
which consist of fraud, intrusion, and defaulting.
1.2.. METHODOLOGY
The approach that this project,uses the latest machine learning algorithms to detect anomalous
activities, called outliers.
The basic rough architecture diagram can be represented with the following figure:
When looked at in detail on a larger scale along with real life elements, the full architecture
diagram can be represented as follows:
First of all, we obtained our dataset from Kaggle, a data analysis website which provides
datasets. Inside this dataset, there are 31 columns out of which 28 are named as v1-v28 to protect
sensitive data.
The other columns represent Time, Amount and Class. Time shows the time gap between the
first transaction and the following one. Amount is the amount of money transacted. Class 0
represents a valid transaction and 1 represents a fraudulent one.
We plot different graphs to check for inconsistencies in the dataset and to visually comprehend
it:
This graph shows that the number of fraudulent transactions is much lower than the legitimate
ones.
This graph shows the times at which transactions were done within two days. It can be seen that
the least number of transactions were made during night time and highest during the days.
This graph represents the amount that was transacted.
8 | 23P a g e Mini Project 2020-2021
A majority of transactions are relatively small and only a handful of them come close to the
maximum transacted amount.
After checking this dataset, we plot a histogram for every column. This is done to get a graphical
representation of the dataset which can be used to verify that there are no missing any values in
the dataset.
This is done to ensure that we don’t require any missing value imputation and the machine
learning algorithms can process the dataset smoothly.
After this analysis, we plot a heatmap to get a colored representation of the data and to study the
correlation between out predicting variables and the class variable. This heatmap is shown
below:
The dataset is now formatted and processed. The time and amount column are standardized and
the Class column is removed to ensure fairness of evaluation.
The data is processed by a set of algorithms from modules. The following module diagram
explains how these algorithms work together: This data is fit into a model and the following
outlier detection modules are applied on it:
• Local Outlier Factor
These algorithms are a part of sklearn. The ensemble module in the sklearn package includes
ensemble-based methods and functions for the classification, regression and outlier detection.
This free and open-source Python library is built using NumPy, SciPy and matplotlib modules
which provides a lot of simple and efficient tools which can be used for data analysis
and machine learning. It features various classification, clustering and regression algorithms and
is designed to interoperate with the numerical and scientific libraries.
Wave used Jupyter Notebook platform to make a program in Python to demonstrate the approach
that this paper suggests. This program can also be executed on the cloud using Google Collab
platform which supports all python notebook files.
Detailed explanations about the modules with pseudocodes for their algorithms and output
graphs are given as follows:
It is an Unsupervised Outlier Detection algorithm. ‘Local Outlier Factor’ refers to the anomaly
score of each sample. It measures the local deviation of the sample data with respect to its
neighbors.
On plotting the results of Local Outlier Factor algorithm, we get the following figure:
By comparing the local values of a sample to that of its neighbors, one can identify samples that
are substantially lower than their neighbors.
These values are quite amanous and they are considered as outliers.
As the dataset is very large, we used only a fraction of it in out tests to reduce processing times.
The final result with the complete dataset processed is also determined and is given in the results
section of this paper.
3. FLOW DIAGRAM
Before continuing with our analysis, it is important not to forget that while the anonymized
features have been scaled and seem to be centered around zero, our time and amount features
have not. Not scaling them as well would result in certain machine learning algorithms that give
weights to features (logistic regression) or rely on a distance measure (KNN) performing much
worse. To avoid this issue, I standardized both the time and amount column. Luckily, there are no
missing values and we, therefore, do not need to worry about missing value imputation.
Below is a brief overview of popular machine learning-based techniques for anomaly detection.
a. Density-Based Anomaly Detection
b.
Density-based anomaly detection is based on the k-nearest neighbors’ algorithm.
k-NN is a simple, non-parametric lazy learning technique used to classify data based on
similarities in distance metrics such as Euclidian, Manhattan, Minkowski, or Hamming distance.
Relative density of data:
This is better known as local outlier factor (LOF). This concept is based on a distance metric
called reachability distance.
b. Clustering-Based Anomaly Detection
Clustering is one of the most popular concepts in the domain of unsupervised learning.
Assumption: Data points that are similar tend to belong to similar groups or clusters, as
determined by their distance from local centroids.
K-means is a widely used clustering algorithm. It creates 'k' similar clusters of data points. Data
instances that fall outside of these groups could potentially be marked as anomalies.
c. Support Vector Machine-Based Anomaly Detection
A support vector machine is another effective technique for detecting anomalies.
A SVM is typically associated with supervised learning, but there are extensions
(OneClassCVM, for instance) that can be used to identify anomalies as an unsupervised problem
(in which training data are not labeled).
• The algorithm learns a soft boundary in order to cluster the normal data instances using
the training set, and then, using the testing instance, it tunes itself to identify the
abnormalities that fall outside the learned region.
Depending on the use case, the output of an anomaly detector could be numeric scalar values for
filtering on domain-specific thresholds or textual labels (such as binary/multi labels).
In this jupyter notebook or google collab we are going to take the credit card fraud detection as
the case study for understanding this concept in detail using the following Anomaly Detection
Techniques namely
Isolation Forest Anomaly Detection Algorithm.
• Density-Based Anomaly Detection (Local Outlier Factor) Algorithm.
• Support Vector Machine Anomaly Detection Algorithm
• Isolation Forest Anomaly Detection Algorithm.
• Logistic Regression
• Classification Trees
Now that we have processed our data, we can begin deploying our machine learning algorithms.
We will use the following techniques:
The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation
of density of a given sample with respect to its neighbors. It is local in that the anomaly score
depends on how isolated the object is with respect to the surrounding neighborhood.
The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly
selecting a split value between the maximum and minimum values of the selected feature.
Since recursive partitioning can be represented by a tree structure, the number of splittings
required to isolate a sample is equivalent to the path length from the root node to the terminating
node.
This path length, averaged over a forest of such random trees, is a measure of normality and our
decision function.
Fraud detection is a complex issue that requires a substantial amount of planning before
throwing
machine learning algorithms at it. Nonetheless, it is also an application of data science and
machine learning for the good, which makes sure that the customer’s money is safe and not
easily tampered with.
Future work will include a comprehensive tuning of the Random Forest algorithm I talked about
earlier. Having a data set with non-anonymized features would make this particularly interesting
as outputting the feature importance would enable one to see what specific factors are most
important for detecting fraudulent transactions.
It is essential for credit card businesses to be able to recognize fraudulent credit card transactions
so that consumers are not paid for things they have not purchased. With the growing use of credit
cards for purchases, the risks of credit card frauds grow rising significantly. In this project an
analysis of credit card fraud identification was described on a publicly available dataset utilizing
Machine Learning techniques such as Local outlier factor and Isolation Forest. In PYTHON the
framework introduced is enforced. When analyzing the data set Isolation Forest provided the
highest precision rate than Local Outlier Factor algorithm
REFRENCES:
1. Credit Card Fraud Detection Based on Transaction Behavior -by John Richard
D. Kho, Larry A. Vea published by Proc. of the 2017 IEEE Region 10 Conference
(TENCON), Malaysia, November 5-8, 2017
2. L.J.P. van der Maaten and G.E. Hinton, Visualizing High-Dimensional Data
Using t-SNE (2014), Journal of Machine Learning Research
3. Machine Learning Group — ULB, Credit Card Fraud Detection (2018), Kaggle