0% found this document useful (0 votes)
87 views89 pages

Credit Card Fraud Detection

This document discusses building a machine learning model to predict whether bank loan applicants will be a safe credit risk or likely to default. [1] It analyzes past loan applicant records to train a model using machine learning techniques. This helps reduce risk for the bank by more accurately predicting which applicants will safely repay loans. [2] The goal is to minimize losses from loans that default by training a system on past data to classify new applicants as safe or not. This helps the bank only grant loans to applicants that are predicted to be a low credit risk.

Uploaded by

Rahul Repala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views89 pages

Credit Card Fraud Detection

This document discusses building a machine learning model to predict whether bank loan applicants will be a safe credit risk or likely to default. [1] It analyzes past loan applicant records to train a model using machine learning techniques. This helps reduce risk for the bank by more accurately predicting which applicants will safely repay loans. [2] The goal is to minimize losses from loans that default by training a system on past data to classify new applicants as safe or not. This helps the bank only grant loans to applicants that are predicted to be a low credit risk.

Uploaded by

Rahul Repala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 89

ABSTRACT

In the banking sector lots of people are applying for bank loans but
the bank has its limited assets which it has to grant to limited people only,
so finding out to whom the loan can be granted which will be a safer
option for the bank is a typical process. So in this project we try to reduce
this risk factor behind selecting the safe person so as to save lots of bank
efforts and assets. This is done by mining the Big Data of the previous
records of the people to whom the loan was granted before and on the basis
of these records/experiences the machine was trained using the machine
learning model which give the most accurate result. The main objective of
this project is to predict whether assigning the loan to particular person
will be safe or not. This project is divided into four sections (i)Data
Collection (ii) Comparison of machine learning models on collected data
(iii) Training of system on most promising model (iv) Testing.
1.INTRODUCTION
Distribution of the loans is the core business part of almost every bank. The main
portion the bank’s asset is directly came from the profit earned from the loans
distributed by the banks. In the banking environment, the prime objective is to
invest their assets where they are in safe hands. Many banks and financial large
companies today approve loans after a verification and validation process of
regress, but there is still no guarantee that the applicant selected will be the
deserving right applicant out of all applicants. This method helps us to determine
whether the applicant is right or not. Loan prediction is very helpful for both the
bank employee and the applicant as well. The goal of this paper is to provide the
right applicants with a simple, fast and easy way to choose. It can provide the bank
with special benefits. The applicant may have a time limit to check whether his /
her loan may or may not be sanctioned. Loan Prediction System makes it possible
to jump to a specific application so that it can be checked on a priority basis. This
paper is specifically for the management authority of the bank and finance
organization, no investors would be able to alter the processing of the entire
process of prediction is done privately. Report may be sent to different
departments of banks against particular Loan Id so that they can take appropriate
action on request. It helps to conduct other formalities in all other departments.
Because of their high accuracy and ability to formulate a statistical model in
simple language, decision trees are widely used in the banking industry. Because
government organizations are closely monitoring lending practices in many
countries, executives need to be able to explain why one applicant was rejected for
a loan while the others were accepted. Such data is also useful to consumers trying
to decide why their credit rating is unsatisfactory. Automated credit scoring
systems are likely to be used to automatically accept telephone or internet credit
requests.

We introduce an effective prediction technique that helps the banker to predict the
credit risk for customers who have applied for loan. A prototype is described in the
paper which can be used by the organizations for making the correct decision for
approve or reject the request for loan of the customers. We will also see how the
model's results can be modified to minimize errors leading to the institution's
financial loss. Bank place a vital role in market economy. The success on failure of
organization largely depends on the industry ability to evaluate credit risk. Before
giving the credit loan to the customer, bank decide whether the customer is bad or
good. The prediction of customer status i.e in future borrower will be good or bad
is a challenging task for any organization or bank. Basically the loan defaulter
prediction is a binary classification problem, loan amount governs is creditability
for receiving loan. The problem is to classify customer as good or bad. However,
developing such a model is a very challenging task due to increasing in demand
for loans.
2.LITERATURE SURVEY

The purpose of a literature review is to provide foundation of knowledge on topic.


Identify areas of prior scholarship to prevent duplication and give credit to other
researchers. Identify inconstancies gaps in research, conflicts in previous studies,
open questions left from other research. The author introduces a framework to
effectively identify the Probability of Default of a Bank Loan applicant. The
metrics derived from the predictions reveal the high accuracy and precision of the
built model. The model proposed in an effective prediction model for predicting
the credible customers who have applied for bank loan. Decision Tree is applied to
predict the attributes relevant for credibility. This prototype model can be used to
sanction the loan request of the customers or not. The model proposed in has been
built using data from banking sector to predict the status of loans. This model uses
three classification algorithms namely j48, bayes Net and naïve Bayes. The model
is implemented and verified using Weka. The best algorithm j48 was selected
based on accuracy. An improved Risk prediction clustering Algorithm that is
Multi-dimensional is implemented in to determine bad loan applicants. In this
work, the Primary and Secondary Levels of Risk assessments are used and to
avoid redundancy, Association Rule is integrated. In a decision tree model was
used as a classifier and for feature selection genetic algorithm is used. The model
was tested using Weka. The work in developed two data mining models for credit
scoring that helps in decision making of giving loans for the banks in Jordan.
Considering the rate of accuracy, the regression model is found to perform better
than radial function model. Some of the techniques done by the author:

1. sAn exploratory data analysis for loan prediction based on nature of the
clients
This paper's main purpose is to classify and analyse the nature of applicants
for loans. Depending on certain parameters, this paper classifies the
customers. Classification is carried out using analyses of exploratory data.
Analysis of exploratory data is a technique for analysing data sets that
summarizes the main features with visual methods.

2. Loan prediction using Ensemble Technique

Here the author explains the loan prediction method of the Ensemble. This paper
describes a prototype of the model that the organization can use to make the right
decision to approve or reject customer loans. This paper provided an ensemble
model for loan forecasting under various training algorithms using several
parameters. The main purpose of this paper is to test model accuracy and develop a
new model called ensemble model that combines the outputs of the three different
models to predict applicant’s loans.

3. Loan approval prediction based on machine learning approach

Here the author used six classification models for machine learning to predict
applications. The models are Decision trees, Random Forest, Support Vector
Machine, Linear System, Neural Networks and Adaboost. The main purpose of
this paper is to provide a simple, immediate and quick way to select the eligible
applicants. In this paper, the concept of a banking sector, there were lots of people
applying for bank loans but the bank had its limited slots that only limited people
have to be granted.

4. Exploratory analysis on prediction of loan privilege for customers using


random forest

Here the author discusses credit risk and loan prediction. In this paper we got all
the information about credit prediction and credit risk. Bank success depends
mainly on credit risk analysis plays a vital role in the banking domain. The paper
used the method of random forest.

5. Prediction system for bank loan creditability Here the author


constructed ensemble model.

This work includes creating an ensemble model through the combination of three
separate machine learning models. Prototype of the model that organizations can
use to decide correctly or properly to approve or reject the customer's loan
application. This application can help banks to predict the future of the loan and its
status and depends on it being able to take action in the initial days of the loan.
Using this application banks can reduce the number of bad loans and the loss of
servers.

6. Loan sanctioning prediction system

The author shows Naïve Bayesian model here to predict the sanctioning of loans.
This paper proposes a loan sanction system based on certain attributes to
determine whether or not a loan should be granted to a consumer. The system we
are proposing for bankers will help them to predict the credible customers who
have applied for loans by improving the chances of their loans being repaid on
time.
3.SYSTEM ANALYSIS
3.1 EXISTING SYSTEM:

Machine Learning implementation is a very complex part in terms of Data


analytics. Working on the data which deals with prediction and making the code to
predict the future outcome from the customer is challenging part.

Some of the others had done their work regarding loan prediction analysis by
using “NAIVE’S BAYES” classifier and “KNN” classifier.

In the “NAIVE’S BAYES” classifier, it is an classification technique with an


assumption of independence among predictors. In simple terms of, naive’s bayes
classifier assumes that the presence of particular feature in a class unrelated to the
presence of any other feature.

The main drawback of Naive’s Bayes classifier:

 Need to calculate the probability.


 There is an error rate in the classification decision.
 Very sensitive to the input data.
 The assumption of sample attribute is used, so this assumption is not good.
 Chances of loss of accuracy.
 Data scarcity.

In the KNN (K-Nearest Neighbours) classifier is the simplest algorithm used in


machine learning for regression and classification problem. KNN algorithms used data
and classify new data points based on similarity measures.

The main drawbacks of KNN classifier:

 Doesn’t work well with large data set.


 Doesn’t work well with high dimensions need feature scaling.
 Sensitive to noisy data and missing values.
 The KNN algorithm is a lazy learner.
 Test stage is expensive.
 No training stage, all the work is done during the test stage.

These are the main drawbacks of these algorithms in the existing system. In
proposed system to overcome these problems we are using other algorithms.

Disadvantages of Existing System

 Complexity in analysing the data.


 Prediction is challenging task working in the model.
 Coding is complex maintaining multiple methods.
 Libraries support was not that much familiar.

3.2PROPOSED SYSTEM
The proposed model focuses on predicting the creditability of customers for loan
repayment by analysing their behaviour. The input to the model is the customer
behaviour collected. On the output from the classifier, decision on whether to approve or
reject the customer request can be made. Using different data analytics tools loan
prediction and there severity can be forecasted. In this process it is required to train the
data using different algorithms and then compare user data with trained data to predict
the nature of loan.

Python has is a good area for data analytical which helps us in analysing the data
with better models in data science. The libraries in python make the predication for loan
data and results with multiple terms considering all properties of the customer in terms
of predicting.

In the proposed system, to overcome the problems of existing system we are


using Logistic regression, Random system, Decision trees, XGBoost algorithm which helps
us to improve the accuracy rate.
Proposed Work

Fig 3.2: Proposed Work

Data Selection
The data collected for mining process may contain missing values, noise or
inconsistency. A data mining process with high quality of data will produce an efficient
data mining results

Data Pre-Processing

It is the most time consuming space of a data mining process. Data mining
process which deals with preparation and transformation from the initial data set to final
data set.

Future Selection and Building classification model

Its objective is to find a derived model that describe and distinguishes data classes
or concepts. The derived model is based on the analysis set of training data

i.e. the data object whose class label is well known.

Prediction

The model is tested using the test data set by using the predict of function. It is
used to predict missing or unavailable numerical data value rather than class label.

Evaluation

In the final stage, the designed system is tested with test set and the performance
is assured.

Proposed Algorithms

In logistic Regressions, it is very fast to implement and it is highly scalable in


nature with number of predictors and data points. Logistic regression directly models the
probability. Logistic regression is a parametric model where as KNN is a non-parametric
model. It supports only linear solution instead of KNN.

In Decision tree, it is a discriminative model whereas NAIVE’S BAYES is a


generating model. It is more flexible and easy. It works better with lot of data then
compare to other algorithm. It is supervised algorithm while KNN is unsupervised
algorithm. It is used for classification but KNN is used for clustering.

Advantages of Proposed System

 Libraries help to analyse the data.


 Statistical and prediction is very easy comparing to existing technologies.
 Results will be accurate compared to other methodologies.
3.3 SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS:

• System : Pentium IV 2.4 GHz.


• Hard Disk : 40 GB.
• Floppy Drive : 1.44 Mb.
• Monitor : 15 VGA Colour.
• Mouse : Logitech.
• Ram : 512 Mb.

SOFTWARE REQUIREMENTS:

• Operating System: Windows

• Coding Language: Python 3.7


3.4 SYSTEM STUDY

FEASIBILITY STUDY

The feasibility of the project is analyzed in this phase and business


proposal is put forth with a very general plan for the project and some cost
estimates. During system analysis the feasibility study of the proposed system is
to be carried out. This is to ensure that the proposed system is not a burden to
the company. For feasibility analysis, some understanding of the major
requirements for the system is essential.

Three key considerations involved in the feasibility analysis are

 ECONOMICAL FEASIBILITY
 TECHNICAL FEASIBILITY
 SOCIAL FEASIBILITY

ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will
have on the organization. The amount of fund that the company can pour into the
research and development of the system is limited. The expenditures must be
justified. Thus the developed system as well within the budget and this was
achieved because most of the technologies used are freely available. Only the
customized products had to be purchased.

TECHNICAL FEASIBILITY

This study is carried out to check the technical feasibility, that is, the
technical requirements of the system. Any system developed must not have a high
demand on the available technical resources. This will lead to high demands on the
available technical resources. This will lead to high demands being placed on the
client. The developed system must have a modest requirement, as only minimal or
null changes are required for implementing this system.

SOCIAL FEASIBILITY

The aspect of study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the system efficiently.
The user must not feel threatened by the system, instead must accept it as a
necessity. The level of acceptance by the users solely depends on the methods that
are employed to educate the user about the system and to make him familiar with
it. His level of confidence must be raised so that he is also able to make some
constructive criticism, which is welcomed, as he is the final user of the system.
4.SYSTEM DESIGN

The System Design Document describes the system requirements, operating


environment, system and subsystem architecture, files and database design, input
formats, output layouts, human-machine interfaces, detailed design, processing logic,
and external interfaces.

This section describes the system in narrative form using non-technical terms. It
should provide a high-level system architecture diagram showing a subsystem breakout
of the system, if applicable. The high-level system architecture or subsystem diagrams
should, if applicable, show interfaces to external systems. Supply a high- level context
diagram for the system and subsystems, if applicable.

This section describes any constraints in the system design (reference any trade-
off analyses conducted such, as resource use versus productivity, or conflicts with other
systems) and includes any assumptions made by the project team in developing the
system design.

The organization code and title of the key points of contact (and alternates if
appropriate) for the information system development effort. These points of contact
should include the Project Manager, System Proponent, User Organization, Quality
Assurance (QA) Manager, Security Manager, and Configuration Manager, as appropriate.
SYSTEM ARCHITECTURE
System architecture is the conceptual model that defines the structure, behavior
and more views of a system. An architecture description is a formal description and
representation of a system, organized in a way that support reasoning about the
structure and behaviors of the system. It consists of system component and the sub-
system developed, that will work together to implement the overall system.

Fig 4.1 System Architecture

A system architecture diagram would be used to show the relationship between


different components. Usually they are created for systems which include hardware and
software and these are represented in the diagram to show the interaction between
them. It is a response to the conceptual and practical difficulties of the description and
the design of complex systems.
UML DIAGRAMS
UML stands for Unified Modelling Language. UML is a standardized general
purpose modelling language in the field of object-oriented software engineering. The
standard is managed, and was created by the Object Management Group.

UML (Unified Modelling Language) is a standard vernacular for choosing,


envisioning, making, and specifying the collectables of programming structures. UML is a
pictorial vernacular used to make programming blue prints. It is in like way used to
exhibit non programming structures similarly like process stream in a gathering unit and
so forth.

UML is not a programming vernacular yet rather instruments can be utilized to


make code in different tongues utilizing UML graphs. UML has an incite relationship with
question composed examination and outline. UML expect a fundamental part in
portraying trade viewpoints of a structure. The Unified Modelling Language is a standard
language for specifying Visualization, Constructing and documenting the artifacts of
software system, as well as for business modelling and other non-software systems. The
UML uses mostly graphical notations to express the design of software projects.

Diagram plays a very important role in the UML. These are kinds of modelling
diagrams are as follows:

 Use Case Diagram


 Class Diagram
 Sequence Diagram
 Activity Diagram
 Component Diagram
 Collaboration Diagram
 State Chart Diagram
Use case Diagram
The use case graph is for demonstrating the direct of the structure. This chart
contains the course of action of use cases, performing pros and their relationship. The
main Purpose of a use case diagram is to show what system functions are performed for
which actor. Roles of the actors in the system can be depicted. These internal and
external agents are known as actors. A use case diagram consists of actors, use cases and
their relationships. The diagram is used to model the system/subsystem of an application.
A single use case diagram captures a particular functionality of a system.

Use case Diagram for Loan Prediction

Fig 4.2.1: Use Case Diagram

In the above diagram, the performing specialists are customer, Loan Officer,
Admin. The customer exchanges the data to the system which disengages the data into
squares and gives the data to Python. By then Python does the data cleaning which is just
performing data connection and data repairing, by then the results will be secured. These
results can be seen using Python and can be secured in server for future reason.
Class Diagram
In software engineering, a class diagram in the Unified Modeling Language (UML)
is a type of static structure diagram that describes the structure of a system by showing
the system's classes, their attributes, operations (or methods), and the relationships
among objects.

The class graph is the most normally pulled in layout UML. It addresses the static
course of action perspective of the structure. It solidifies the strategy of classes,
interfaces, joint attempts and their affiliations.

Class Diagram for Loan Prediction

Fig 4.2.2: Class Diagram

In the above class diagram, the relationship that is the dependence between each
one of the classes is sketched out. Additionally, even the operations performed in each
and every class is similarly appeared.
Sequence Diagram
A sequence diagram simply depicts interaction between objects in a sequential
order i.e. the order in which these interactions take place. We can also use the terms
event diagrams or event scenarios to refer to a sequence diagram. Sequence diagrams
describe how and in what order the objects in a system function.

This is a cooperation design which tends to the time requesting of messages. It


includes set of parts and the messages sent and gotten by the instance of parts. This chart
is utilized to address the dynamic perspective of the structure.

Sequence Diagram for Loan Prediction

Fig 4.2.3: Sequence Diagram

In the above sequence diagram, a succession outline indicates question


communications masterminded in time arrangement. Each protest has a vertical dashed
line which speaks to the presence of a question over some undefined time frame. This
graph has additionally a tall, thin rectangle which is called centre of control that
demonstrates the timeframe amid which a protest is playing out an activity, either
specifically or through a subordinate system.
Collaboration Diagram
Collaboration diagrams (known as Communication Diagram in UML 2. x) are used
to show how objects interact to perform the behaviour of a particular use case, or a part
of a use case.

This is a support format, which tends to the principal relationship of articles that
send and get messages. It incorporates set of parts, connectors that interface the parts
and the messages sent and get by those parts. This graph is utilized to address the
dynamic perspective of the framework.

Collaboration Diagram for Loan Prediction

Fig 4.2.4: Collaboration Diagram

In the above collaboration diagram, the joint effort outline contains articles, way
and arrangement number. In the above graph, there are specifically customers, Loan
Officer, Admin. These items are connected to each other utilizing a way. A succession
number show the time request of a message.
State Chart Diagram
State chart diagram is one of the five UML diagrams used to model the dynamic
nature of a system. They define different states of an object during its lifetime and these
states are changed by events. State chart diagram describes the flow of control from one
state to another state.

State Chart Diagram for Loan Prediction

Fig 4.2.5: State Chart Diagram

In the above state chart diagram, a state outline graph contains two components
called states and progress. States speak to circumstances amid the life of a question. We
can without much of a stretch outline a state in Smart Draw by utilizing a rectangle with
adjusted corners.
Component Diagram
A component diagram, also known as a UML component diagram, describes the
organization and wiring of the physical components in a system. In the first version of
UML, components included in these diagrams were physical documents, database table,
files, and executables, all physical elements with a location.

The imperative portion of part format is segment. This diagram demonstrates


within parts, connectors and ports that understand the piece. Precisely when section is
instantiated, duplicates of inside parts are besides instantiated.

Component Diagram for Loan Prediction

Fig 4.2.6: Component Diagram

A part outline is spoken to utilizing segment. A part is a physical building piece of


the framework. It is spoken to as a rectangle with tab. Part outline portrays the inward
handling of the venture .
Deployment Diagram
A UML deployment diagram is a diagram that shows the configuration of run time
processing nodes and the components that live on them. A deployment diagram is a kind
of structure diagram used in modelling the physical aspects of an object-oriented system.

The fundamental fragment in game-plan layout is a middle point. The strategy of


focus focuses and their relationship with other is tended to utilizing sending plot. The
sending outline is identified with the area diagram, that is one focus purpose obviously of
activity format frequently includes no short of what one sections. This outline is in like
way critical for tending to the static perspective of the framework.

Deployment Diagram for Loan Prediction

Fig 4.2.7: Deployment Diagram


DATA FLOW DIAGRAMS
An information stream design (DFD) is a graphical portrayal of the "stream" of
information through a data framework, demonstrating its strategy edges. A DFD is a
significant part of the time utilized as a preparatory stroll to make an overview of the
framework, which can later be cleared up. DFDs can in like way be utilized for the
depiction of information prepare. A DFD indicates what sort of data will be sense of duty
regarding and yield from the structure, where the information will begin from and go to,
and where the information will be secured. It doesn't demonstrate data about the
organizing of process or data about whether strategy will work in game-plan or in parallel.

The DFD is also called as bubble chart. It is a simple graphical formalism that can
be used to represent a system in terms of input data to the system, various processing
carried out on this data, and the output data is generated by this system. The data flow
diagram (DFD) is one of the most important modeling tools. It is used to model the
system components. These components are the system process, the data used by the
process, an external entity that interacts with the system and the information flows in the
system.

DFD shows the information moves through the system and how it is modified by a
series of transformation. It is a graphical technique that depicts information flow and the
transformations that are applied as data moves from input to output.

DFD may be used to represent system at any level of abstraction. DFD may be
partitioned into levels that represent increasing information flow and functional detail.
Data Flow Diagram

Level 0: System Input/ Output Level

A level 0 DFD describes the system wide boundaries, dealing input to and output
flow from the system and major processes. DFD Level 0 is in like way called a Context
Diagram. It's a urgent review of the entire structure or process being bankrupt down or
appeared. It's required to be an at first watch, demonstrating the framework as a
particular surprising state handle, with its relationship to outside substances.

In the above diagram, the customer will export the data to the officer. The loan
officer will maintain the data and store the data in server.

Level 1: Sub System Level Data Flow

Level 1 DFD delineates the accompanying level of purposes of enthusiasm with


the data stream between subsystems. The Level 1 DFD exhibits how the system is
secluded into sub-structures (shapes), each of which oversees no less than one of the
data streams to or from an outside pro, and which together give most of the helpfulness
of the system as a rule.

In the above diagram, the customer will export the data to the officer. The loan
officer will maintain the data and store the data in server. The server will install the
required libraries to perform their actions.
Level 2: File Level Detail Data Flow

Plausibility and danger examination are connected here from various


perspectives. The level 2 DFD elucidates the fundamental level of understanding about
the system's working.

In the above diagram, the customer will export the data to the officer. The loan
officer will maintain the data and store the data in server. The server will install the
required libraries in the system to perform their actions. The admin will perform the
analysis to the data.

Level 3:

Fig 4.3.1 Data Flow Diagram


In the above diagram, the customer will export the data to the officer. The loan
officer will maintain the data and store the data in server. The server will install the
required libraries in the system to perform their actions. The admin will perform the
analysis to the data. The admin will generate the results and maintain the results in the
server. The result will be send to the loan officer and officer will send the message to the
customer whether the loan is approved or not.
4.4 IMPLEMENTATION:

Implementation is the stage of the project when the theoretical


design is turned out into a working system. Thus, it can be considered
to be the most critical stage in achieving a successful new system and
in giving the user, confidence that the new system will work and be
effective.

The implementation stage involves careful planning,


investigation of the existing system and it’s constraints on
implementation, designing of methods to achieve changeover and
evaluation of changeover methods.

LIST OF MODULES
 Client Module
 Admin Module
 Data Analyst

Client Module
A client module is a network module that supports and
implements the client side of a Network Programming Interface (NPI).
A client module registers itself with the Network Module Registrar as a
Client of the NPI that it supports. A client module can register itself as
a client of more than one NPI.

Client module will take the data from the customer and it will
send the data to the Admin. Client will get the result analysis by using
different types of algorithms to get an accuracy value and send that
result to the customer through network.

Admin Module
Admin will get the data from the client and admin will perform
analysis based on given algorithms for the data and it will generate the
results. It will store the data for future references and it will update the
data.

Admin will grant the permissions to the user and admin will
control the user permissions. Admin will provide the security to the
data and there will be no any issues.

Data Analyst
Analyst will request the data from the admin and the admin will send the
appropriate data to the analyst. Analyst will perform the analysis on the data and
produce the results. The resultant results will send to the admin. Admin will store
the data and results for the future references. Interpreting data, analyzing results
using statistical techniques. Developing and implementing data analysis, data
collection systems and other strategies that optimize statistical efficiency and
quality

Algorithm
Logistic Regression
Logistic Regression is a Machine Learning algorithm which is used for the
classification problems; it is a predictive analysis algorithm and based on the concept of
probability. We can call a Logistic Regression a Linear Regression model but the Logistic
Regression uses a more complex cost function, this cost function can be defined as the
‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear function.
Sigmoid Function
In order to map predicted values to probabilities, we use the Sigmoid function.
The function maps any real value into another value between 0 and 1. In machine learning,
we use sigmoid to map predictions to probabilities.

Pros and Cons of Logistic Regression:

Pros:

 Logistic Regression performs well when the dataset is linearly separable.


 Logistic regression is less prone to over-fitting but it can over fit in high
dimensional datasets. You should consider Regularization (L1 and L2)
techniques to avoid over-fitting in these scenarios.
 Logistic Regression not only gives a measure of how relevant a predictor
(coefficient size) is, but also its direction of association (positive or
negative).
 Logistic regression is easier to implement, interpret and very efficient to
train.

Cons:

 Main limitation of Logistic Regression is the assumption of linearity


between the dependent variable and the independent variables.
 If the numbers of observations are lesser than the number of features,
Logistic Regression should not be used, otherwise it may lead to over
fit.

Decision Tree
Decision Tree are a type of Supervising Machine Learning (that is to explain what
the input is and what the corresponding output is in the training data.) where the data is
continuously split according to a certain parameter. The tree can be explained by two
entities, namely Decision nodes and Leaves. The leaves are the decision or the final
outcome. The decision nodes are where the data is split. They can used to solve both
regression and classification problems.
Decision Trees use multiple algorithms to decide to split a node in two or more
sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In
other words, we can say that purity of the node increases with respect to the target
variable.

Decision Tree algorithm falls under the category of supervised learning. They can
be used to solve both regression and classification problems. Decision tree uses the tree
representation to solve the problem in which each leaf node corresponds to a class label
and attributes are represented on the internal node of the tree.

Pros and Cons of Decision Tree:

Pros:

 Compared to other algorithms decision trees requires less effort for


data preparation during pre-processing.
 A decision tree does not require normalization of data.
 A decision tree does not require scaling of data as well.
 Missing values in the data also does NOT affect the process of
building decision tree to any considerable extent.

Cons:

 A small change in the data can cause a large change in the structure of
the decision tree causing instability.
 For a Decision tree sometimes calculation can go far more
complex compared to other algorithms.
 Decision tree often involves higher time to train the model.
 Decision tree training is relatively expensive as complexity and time taken
is more.
Random Forest
Random Forest is a supervised learning algorithm which is used for both
classification as well as regression. But however, it is mainly used for classification
problems. As we know that a forest is made up of trees and more trees means more
robust forest. Similarly, random forest algorithm creates decision trees on data samples and
then gets the prediction from each of them and finally selects the best solution by means of
voting. It is an ensemble method which is better than a single decision tree because it
reduces the over-fitting by averaging the result.

Pros and Cons of Random Forest:

Pros:

 It overcomes the problem of over fitting by averaging or combining the


results of different decision trees.
 Random forests work well for a large range of data items than a
single decision tree does.
 Random forest has less variance then single decision tree.
 Random forests are very flexible and possess very high accuracy.
 Scaling of data does not require in random forest algorithm. It
maintains good accuracy even after providing data without scaling.
 A Random Forest algorithm maintains good accuracy even a
large proportion of the data is missing.

Cons:

 Complexity is the main disadvantage of Random forest algorithms.


 Construction of Random forests is much harder and time-consuming than
decision trees.
 More computational resources are required to implement Random
Forest algorithm.
 It is less intuitive in case when we have a large collection of decision trees.
 The prediction process using random forests is very time-consuming in
comparison with other algorithms.

Extreme Gradient Boost


XGBoost is an implementation of Gradient Boosted decision trees. This library
was written in C++. It is a type of Software library that was designed basically to improve
speed and model performance. It has recently been dominating in applied machine
learning. XGBoost models majorly dominate in
many Kaggle Competitions. Boosting is just taking random samples of data from our
dataset and learning a weak learner (a predictor with not so great accuracy) for it.

Pros and Cons of XGBoost:

Pros:

 Extremely fast (parallel computation).


 Highly efficient.
 Versatile (Can be used for classification, regression or ranking).
 Can be used to extract variable importance.
 Do not require feature engineering (missing values imputation, scaling and
normalization)

Cons:

 Only work with numeric features.


 Leads to over fitting if hyper parameters are not tuned properly.
5.SOFTWARE ENVIRONMENT

What is Python :-
Below are some facts about Python.

Python is currently the most widely used multi-purpose, high-level programming language.

Python allows programming in Object-Oriented and Procedural paradigms. Python


programs generally are smaller than other programming languages like Java.

Programmers have to type relatively less and indentation requirement of the language,
makes them readable all the time.

Python language is being used by almost all tech-giant companies like – Google,
Amazon, Facebook, Instagram, Dropbox, Uber… etc.

The biggest strength of Python is huge collection of standard library which can be used
for the following –

 Machine Learning
 GUI Applications (like Kivy, Tkinter, PyQt etc. )
 Web frameworks like Django (used by YouTube, Instagram, Dropbox)
 Image processing (like Opencv, Pillow)
 Web scraping (like Scrapy, BeautifulSoup, Selenium)
 Test frameworks
 Multimedia

Advantages of Python :-

Let’s see how Python dominates over other languages.

1. Extensive Libraries

Python downloads with an extensive library and it contain code for various purposes like
regular expressions, documentation-generation, unit-testing, web browsers, threading,
databases, CGI, email, image manipulation, and more. So, we don’t have to write the
complete code for that manually.

2. Extensible

As we have seen earlier, Python can be extended to other languages. You can write some
of your code in languages like C++ or C. This comes in handy, especially in projects.

3. Embeddable

Complimentary to extensibility, Python is embeddable as well. You can put your Python
code in your source code of a different language, like C++. This lets us add scripting
capabilities to our code in the other language.

4. Improved Productivity

The language’s simplicity and extensive libraries render programmers more


productive than languages like Java and C++ do. Also, the fact that you need to write less
and get more things done.

5. IOT Opportunities

Since Python forms the basis of new platforms like Raspberry Pi, it finds the future bright
for the Internet Of Things. This is a way to connect the language with the real world.

6. Simple and Easy

When working with Java, you may have to create a class to print ‘Hello World’. But in
Python, just a print statement will do. It is also quite easy to learn, understand, and code.
This is why when people pick up Python, they have a hard time adjusting to other more
verbose languages like Java.

7. Readable

Because it is not such a verbose language, reading Python is much like reading English.
This is the reason why it is so easy to learn, understand, and code. It also does not need
curly braces to define blocks, and indentation is mandatory. This further aids the
readability of the code.
8. Object-Oriented

This language supports both the procedural and object-oriented programming paradigms.


While functions help us with code reusability, classes and objects let us model the real
world. A class allows the encapsulation of data and functions into one.

9. Free and Open-Source

Like we said earlier, Python is freely available. But not only can you download
Python for free, but you can also download its source code, make changes to it, and even
distribute it. It downloads with an extensive collection of libraries to help you with your
tasks.

10. Portable

When you code your project in a language like C++, you may need to make some changes
to it if you want to run it on another platform. But it isn’t the same with Python. Here, you
need to code only once, and you can run it anywhere. This is called Write Once Run
Anywhere (WORA). However, you need to be careful enough not to include any system-
dependent features.

11. Interpreted

Lastly, we will say that it is an interpreted language. Since statements are executed one by
one, debugging is easier than in compiled languages.
Any doubts till now in the advantages of Python? Mention in the comment section.

Advantages of Python Over Other Languages

1. Less Coding

Almost all of the tasks done in Python requires less coding when the same task is done in
other languages. Python also has an awesome standard library support, so you don’t have to
search for any third-party libraries to get your job done. This is the reason that many people
suggest learning Python to beginners.
2. Affordable

Python is free therefore individuals, small companies or big organizations can leverage the
free available resources to build applications. Python is popular and widely used so it gives
you better community support.

The 2019 Github annual survey showed us that Python has overtaken Java in the most
popular programming language category.

3. Python is for Everyone

Python code can run on any machine whether it is Linux, Mac or Windows. Programmers
need to learn different languages for different jobs but with Python, you can professionally
build web apps, perform data analysis and machine learning, automate things, do web
scraping and also build games and powerful visualizations. It is an all-rounder programming
language.

Disadvantages of Python

So far, we’ve seen why Python is a great choice for your project. But if you choose it, you
should be aware of its consequences as well. Let’s now see the downsides of choosing
Python over another language.

1. Speed Limitations

We have seen that Python code is executed line by line. But since Python is interpreted, it
often results in slow execution. This, however, isn’t a problem unless speed is a focal point
for the project. In other words, unless high speed is a requirement, the benefits offered by
Python are enough to distract us from its speed limitations.

2. Weak in Mobile Computing and Browsers

While it serves as an excellent server-side language, Python is much rarely seen on


the client-side. Besides that, it is rarely ever used to implement smartphone-based
applications. One such application is called Carbonnelle.
The reason it is not so famous despite the existence of Brython is that it isn’t that secure.
3. Design Restrictions

As you know, Python is dynamically-typed. This means that you don’t need to declare the
type of variable while writing the code. It uses duck-typing. But wait, what’s that? Well, it
just means that if it looks like a duck, it must be a duck. While this is easy on the
programmers during coding, it can raise run-time errors.

4. Underdeveloped Database Access Layers

Compared to more widely used technologies like JDBC (Java DataBase


Connectivity) and ODBC (Open DataBase Connectivity), Python’s database access layers
are a bit underdeveloped. Consequently, it is less often applied in huge enterprises.

5. Simple

No, we’re not kidding. Python’s simplicity can indeed be a problem. Take my example. I
don’t do Java, I’m more of a Python person. To me, its syntax is so simple that the verbosity
of Java code seems unnecessary.

This was all about the Advantages and Disadvantages of Python Programming Language.

History of Python : -

What do the alphabet and the programming language Python have in common? Right, both
start with ABC. If we are talking about ABC in the Python context, it's clear that the
programming language ABC is meant. ABC is a general-purpose programming language
and programming environment, which had been developed in the Netherlands, Amsterdam,
at the CWI (Centrum Wiskunde &Informatica). The greatest achievement of ABC was to
influence the design of Python.Python was conceptualized in the late 1980s. Guido van
Rossum worked that time in a project at the CWI, called Amoeba, a distributed operating
system. In an interview with Bill Venners1, Guido van Rossum said: "In the early 1980s, I
worked as an implementer on a team building a language called ABC at Centrum voor
Wiskunde en Informatica (CWI). I don't know how well people know ABC's influence on
Python. I try to mention ABC's influence because I'm indebted to everything I learned
during that project and to the people who worked on it."Later on in the same Interview,
Guido van Rossum continued: "I remembered all my experience and some of my frustration
with ABC. I decided to try to design a simple scripting language that possessed some of
ABC's better properties, but without its problems. So I started typing. I created a simple
virtual machine, a simple parser, and a simple runtime. I made my own version of the
various ABC parts that I liked. I created a basic syntax, used indentation for statement
grouping instead of curly braces or begin-end blocks, and developed a small number of
powerful data types: a hash table (or dictionary, as we call it), a list, strings, and numbers."

What is Machine Learning : -

Before we take a look at the details of various machine learning methods, let's start by
looking at what machine learning is, and what it isn't. Machine learning is often categorized
as a subfield of artificial intelligence, but I find that categorization can often be misleading
at first brush. The study of machine learning certainly arose from research in this context,
but in the data science application of machine learning methods, it's more helpful to think of
machine learning as a means of building models of data.

Fundamentally, machine learning involves building mathematical models to help understand


data. "Learning" enters the fray when we give these models tunable parameters that can be
adapted to observed data; in this way the program can be considered to be "learning" from
the data. Once these models have been fit to previously seen data, they can be used to predict
and understand aspects of newly observed data. I'll leave to the reader the more
philosophical digression regarding the extent to which this type of mathematical, model-
based "learning" is similar to the "learning" exhibited by the human brain.Understanding the
problem setting in machine learning is essential to using these tools effectively, and so we
will start with some broad categorizations of the types of approaches we'll discuss here.

Categories Of Machine Leaning :-

At the most fundamental level, machine learning can be categorized into two main types:
supervised learning and unsupervised learning.

Supervised learning involves somehow modeling the relationship between measured


features of data and some label associated with the data; once this model is determined, it
can be used to apply labels to new, unknown data. This is further subdivided
into classification tasks and regression tasks: in classification, the labels are discrete
categories, while in regression, the labels are continuous quantities. We will see examples of
both types of supervised learning in the following section.

Unsupervised learning involves modeling the features of a dataset without reference to any


label, and is often described as "letting the dataset speak for itself." These models include
tasks such as clustering and dimensionality reduction. Clustering algorithms identify distinct
groups of data, while dimensionality reduction algorithms search for more succinct
representations of the data. We will see examples of both types of unsupervised learning in
the following section.

Need for Machine Learning

Human beings, at this moment, are the most intelligent and advanced species on earth
because they can think, evaluate and solve complex problems. On the other side, AI is still
in its initial stage and haven’t surpassed human intelligence in many aspects. Then the
question is that what is the need to make machine learn? The most suitable reason for doing
this is, “to make decisions, based on data, with efficiency and scale”.

Lately, organizations are investing heavily in newer technologies like Artificial Intelligence,
Machine Learning and Deep Learning to get the key information from data to perform
several real-world tasks and solve problems. We can call it data-driven decisions taken by
machines, particularly to automate the process. These data-driven decisions can be used,
instead of using programing logic, in the problems that cannot be programmed inherently.
The fact is that we can’t do without human intelligence, but other aspect is that we all need
to solve real-world problems with efficiency at a huge scale. That is why the need for
machine learning arises.

Challenges in Machines Learning :-

While Machine Learning is rapidly evolving, making significant strides with cybersecurity
and autonomous cars, this segment of AI as whole still has a long way to go. The reason
behind is that ML has not been able to overcome number of challenges. The challenges that
ML is facing currently are −
Quality of data − Having good-quality data for ML algorithms is one of the biggest
challenges. Use of low-quality data leads to the problems related to data preprocessing and
feature extraction.

Time-Consuming task − Another challenge faced by ML models is the consumption of


time especially for data acquisition, feature extraction and retrieval.

Lack of specialist persons − As ML technology is still in its infancy stage, availability of


expert resources is a tough job.

No clear objective for formulating business problems − Having no clear objective and
well-defined goal for business problems is another key challenge for ML because this
technology is not that mature yet.

Issue of overfitting & underfitting − If the model is overfitting or underfitting, it cannot be


represented well for the problem.

Curse of dimensionality − Another challenge ML model faces is too many features of data
points. This can be a real hindrance.

Difficulty in deployment − Complexity of the ML model makes it quite difficult to be


deployed in real life.

Applications of Machines Learning :-

Machine Learning is the most rapidly growing technology and according to researchers we
are in the golden year of AI and ML. It is used to solve many real-world complex problems
which cannot be solved with traditional approach. Following are some real-world
applications of ML −

 Emotion analysis

 Sentiment analysis

 Error detection and prevention

 Weather forecasting and prediction

 Stock market analysis and forecasting

 Speech synthesis
 Speech recognition

 Customer segmentation

 Object recognition

 Fraud detection

 Fraud prevention

 Recommendation of products to customer in online shopping

How to Start Learning Machine Learning?

Arthur Samuel coined the term “Machine Learning” in 1959 and defined it as a “Field of
study that gives computers the capability to learn without being explicitly
programmed”.
And that was the beginning of Machine Learning! In modern times, Machine Learning is one
of the most popular (if not the most!) career choices. According to Indeed, Machine Learning
Engineer Is The Best Job of 2019 with a 344% growth and an average base salary
of $146,085 per year.
But there is still a lot of doubt about what exactly is Machine Learning and how to start
learning it? So this article deals with the Basics of Machine Learning and also the path you
can follow to eventually become a full-fledged Machine Learning Engineer. Now let’s get
started!!!

How to start learning ML?

This is a rough roadmap you can follow on your way to becoming an insanely talented
Machine Learning Engineer. Of course, you can always modify the steps according to your
needs to reach your desired end-goal!

Step 1 – Understand the Prerequisites

In case you are a genius, you could start ML directly but normally, there are some
prerequisites that you need to know which include Linear Algebra, Multivariate Calculus,
Statistics, and Python. And if you don’t know these, never fear! You don’t need a Ph.D.
degree in these topics to get started but you do need a basic understanding.

(a) Learn Linear Algebra and Multivariate Calculus

Both Linear Algebra and Multivariate Calculus are important in Machine Learning. However,
the extent to which you need them depends on your role as a data scientist. If you are more
focused on application heavy machine learning, then you will not be that heavily focused on
maths as there are many common libraries available. But if you want to focus on R&D in
Machine Learning, then mastery of Linear Algebra and Multivariate Calculus is very
important as you will have to implement many ML algorithms from scratch.

(b) Learn Statistics

Data plays a huge role in Machine Learning. In fact, around 80% of your time as an ML
expert will be spent collecting and cleaning data. And statistics is a field that handles the
collection, analysis, and presentation of data. So it is no surprise that you need to learn it!!!
Some of the key concepts in statistics that are important are Statistical Significance,
Probability Distributions, Hypothesis Testing, Regression, etc. Also, Bayesian Thinking is
also a very important part of ML which deals with various concepts like Conditional
Probability, Priors, and Posteriors, Maximum Likelihood, etc.

(c) Learn Python

Some people prefer to skip Linear Algebra, Multivariate Calculus and Statistics and learn
them as they go along with trial and error. But the one thing that you absolutely cannot skip
is Python! While there are other languages you can use for Machine Learning like R, Scala,
etc. Python is currently the most popular language for ML. In fact, there are many Python
libraries that are specifically useful for Artificial Intelligence and Machine Learning such
as Keras, TensorFlow, Scikit-learn, etc.
So if you want to learn ML, it’s best if you learn Python! You can do that using various
online resources and courses such as Fork Python available Free on GeeksforGeeks.
Step 2 – Learn Various ML Concepts

Now that you are done with the prerequisites, you can move on to actually learning ML
(Which is the fun part!!!) It’s best to start with the basics and then move on to the more
complicated stuff. Some of the basic concepts in ML are:

(a) Terminologies of Machine Learning

 Model – A model is a specific representation learned from data by applying some machine
learning algorithm. A model is also called a hypothesis.
 Feature – A feature is an individual measurable property of the data. A set of numeric
features can be conveniently described by a feature vector. Feature vectors are fed as input to
the model. For example, in order to predict a fruit, there may be features like color, smell,
taste, etc.
 Target (Label) – A target variable or label is the value to be predicted by our model. For the
fruit example discussed in the feature section, the label with each set of input would be the
name of the fruit like apple, orange, banana, etc.
 Training – The idea is to give a set of inputs(features) and it’s expected outputs(labels), so
after training, we will have a model (hypothesis) that will then map new data to one of the
categories trained on.
 Prediction – Once our model is ready, it can be fed a set of inputs to which it will provide a
predicted output(label).

(b) Types of Machine Learning

 Supervised Learning – This involves learning from a training dataset with labeled data using
classification and regression models. This learning process continues until the required level
of performance is achieved.
 Unsupervised Learning – This involves using unlabelled data and then finding the
underlying structure in the data in order to learn more and more about the data itself using
factor and cluster analysis models.
 Semi-supervised Learning – This involves using unlabelled data like Unsupervised Learning
with a small amount of labeled data. Using labeled data vastly increases the learning accuracy
and is also more cost-effective than Supervised Learning.
 Reinforcement Learning – This involves learning optimal actions through trial and error. So
the next action is decided by learning behaviors that are based on the current state and that will
maximize the reward in the future.
Advantages of Machine learning :-

1. Easily identifies trends and patterns -

Machine Learning can review large volumes of data and discover specific trends and patterns
that would not be apparent to humans. For instance, for an e-commerce website like Amazon, it
serves to understand the browsing behaviors and purchase histories of its users to help cater to
the right products, deals, and reminders relevant to them. It uses the results to reveal relevant
advertisements to them.

2. No human intervention needed (automation)

With ML, you don’t need to babysit your project every step of the way. Since it means giving
machines the ability to learn, it lets them make predictions and also improve the algorithms on
their own. A common example of this is anti-virus softwares; they learn to filter new threats as
they are recognized. ML is also good at recognizing spam.

3. Continuous Improvement

As ML algorithms gain experience, they keep improving in accuracy and efficiency. This lets
them make better decisions. Say you need to make a weather forecast model. As the amount of
data you have keeps growing, your algorithms learn to make more accurate predictions faster.

4. Handling multi-dimensional and multi-variety data

Machine Learning algorithms are good at handling data that are multi-dimensional and multi-
variety, and they can do this in dynamic or uncertain environments.
5. Wide Applications

You could be an e-tailer or a healthcare provider and make ML work for you. Where it does
apply, it holds the capability to help deliver a much more personal experience to customers
while also targeting the right customers.

Disadvantages of Machine Learning :-

1. Data Acquisition

Machine Learning requires massive data sets to train on, and these should be
inclusive/unbiased, and of good quality. There can also be times where they must wait for new
data to be generated.

2. Time and Resources

ML needs enough time to let the algorithms learn and develop enough to fulfill their purpose
with a considerable amount of accuracy and relevancy. It also needs massive resources to
function. This can mean additional requirements of computer power for you.

3. Interpretation of Results

Another major challenge is the ability to accurately interpret results generated by the
algorithms. You must also carefully choose the algorithms for your purpose.

4. High error-susceptibility

Machine Learning is autonomous but highly susceptible to errors. Suppose you train an
algorithm with data sets small enough to not be inclusive. You end up with biased predictions
coming from a biased training set. This leads to irrelevant advertisements being displayed to
customers. In the case of ML, such blunders can set off a chain of errors that can go undetected
for long periods of time. And when they do get noticed, it takes quite some time to recognize
the source of the issue, and even longer to correct it.

Python Development Steps : -


Guido Van Rossum published the first version of Python code (version 0.9.0) at alt.sources in
February 1991. This release included already exception handling, functions, and the core data
types of list, dict, str and others. It was also object oriented and had a module system.
Python version 1.0 was released in January 1994. The major new features included in this
release were the functional programming tools lambda, map, filter and reduce, which Guido
Van Rossum never liked.Six and a half years later in October 2000, Python 2.0 was
introduced. This release included list comprehensions, a full garbage collector and it was
supporting unicode.Python flourished for another 8 years in the versions 2.x before the next
major release as Python 3.0 (also known as "Python 3000" and "Py3K") was released. Python
3 is not backwards compatible with Python 2.x. The emphasis in Python 3 had been on the
removal of duplicate programming constructs and modules, thus fulfilling or coming close to
fulfilling the 13th law of the Zen of Python: "There should be one -- and preferably only one --
obvious way to do it."Some changes in Python 7.3:

 Print is now a function


 Views and iterators instead of lists
 The rules for ordering comparisons have been simplified. E.g. a heterogeneous list cannot be
sorted, because all the elements of a list must be comparable to each other.
 There is only one integer type left, i.e. int. long is int as well.
 The division of two integers returns a float instead of an integer. "//" can be used to have the
"old" behaviour.
 Text Vs. Data Instead Of Unicode Vs. 8-bit

Purpose :-

We demonstrated that our approach enables successful segmentation of intra-retinal layers—


even with low-quality images containing speckle noise, low contrast, and different intensity
ranges throughout—with the assistance of the ANIS feature.

Python

Python is an interpreted high-level programming language for general-purpose


programming. Created by Guido van Rossum and first released in 1991, Python has a
design philosophy that emphasizes code readability, notably using significant whitespace.
Python features a dynamic type system and automatic memory management. It supports
multiple programming paradigms, including object-oriented, imperative, functional and
procedural, and has a large and comprehensive standard library.

 Python is Interpreted − Python is processed at runtime by the interpreter. You do not need to
compile your program before executing it. This is similar to PERL and PHP.
 Python is Interactive − you can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.
Python also acknowledges that speed of development is important. Readable and terse
code is part of this, and so is access to powerful constructs that avoid tedious repetition of
code. Maintainability also ties into this may be an all but useless metric, but it does say
something about how much code you have to scan, read and/or understand to
troubleshoot problems or tweak behaviors. This speed of development, the ease with
which a programmer of other languages can pick up basic Python skills and the huge
standard library is key to another area where Python excels. All its tools have been quick to
implement, saved a lot of time, and several of them have later been patched and updated
by people with no Python background - without breaking.

Modules Used in Project :-

Tensorflow

TensorFlow is a free and open-source software library for dataflow and differentiable


programming across a range of tasks. It is a symbolic math library, and is also used
for machine learning applications such as neural networks. It is used for both research and
production at Google.‍

TensorFlow was developed by the Google Brain team for internal Google use. It was
released under the Apache 2.0 open-source license on November 9, 2015.

Numpy

Numpy is a general-purpose array-processing package. It provides a high-performance


multidimensional array object, and tools for working with these arrays.
It is the fundamental package for scientific computing with Python. It contains various
features including these important ones:

 A powerful N-dimensional array object


 Sophisticated (broadcasting) functions
 Tools for integrating C/C++ and Fortran code
 Useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, Numpy can also be used as an efficient multi-
dimensional container of generic data. Arbitrary data-types can be defined using Numpy
which allows Numpy to seamlessly and speedily integrate with a wide variety of databases.

Pandas

Pandas is an open-source Python Library providing high-performance data manipulation and


analysis tool using its powerful data structures. Python was majorly used for data munging
and preparation. It had very little contribution towards data analysis. Pandas solved this
problem. Using Pandas, we can accomplish five typical steps in the processing and analysis
of data, regardless of the origin of data load, prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.

Matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a


variety of hardcopy formats and interactive environments across platforms. Matplotlib can
be used in Python scripts, the Python and IPython shells, the Jupyter Notebook, web
application servers, and four graphical user interface toolkits. Matplotlib tries to make easy
things easy and hard things possible. You can generate plots, histograms, power spectra, bar
charts, error charts, scatter plots, etc., with just a few lines of code. For examples, see
the sample plots and thumbnail gallery.

For simple plotting the pyplot module provides a MATLAB-like interface, particularly when


combined with IPython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.

Scikit – learn
Scikit-learn provides a range of supervised and unsupervised learning algorithms via a
consistent interface in Python. It is licensed under a permissive simplified BSD license and is
distributed under many Linux distributions, encouraging academic and commercial use.
Python

Python is an interpreted high-level programming language for general-purpose


programming. Created by Guido van Rossum and first released in 1991, Python has a
design philosophy that emphasizes code readability, notably using significant whitespace.

Python features a dynamic type system and automatic memory management. It supports
multiple programming paradigms, including object-oriented, imperative, functional and
procedural, and has a large and comprehensive standard library.

 Python is Interpreted − Python is processed at runtime by the interpreter. You do not need to
compile your program before executing it. This is similar to PERL and PHP.
 Python is Interactive − you can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.
Python also acknowledges that speed of development is important. Readable and terse
code is part of this, and so is access to powerful constructs that avoid tedious repetition of
code. Maintainability also ties into this may be an all but useless metric, but it does say
something about how much code you have to scan, read and/or understand to
troubleshoot problems or tweak behaviors. This speed of development, the ease with
which a programmer of other languages can pick up basic Python skills and the huge
standard library is key to another area where Python excels. All its tools have been quick to
implement, saved a lot of time, and several of them have later been patched and updated
by people with no Python background - without breaking.

Install Python Step-by-Step in Windows and Mac :

Python a versatile programming language doesn’t come pre-installed on your computer


devices. Python was first released in the year 1991 and until today it is a very popular high-
level programming language. Its style philosophy emphasizes code readability with its
notable use of great whitespace.
The object-oriented approach and language construct provided by Python enables
programmers to write both clear and logical code for projects. This software does not come
pre-packaged with Windows.

How to Install Python on Windows and Mac :

There have been several updates in the Python version over the years. The question is how to
install Python? It might be confusing for the beginner who is willing to start learning Python
but this tutorial will solve your query. The latest or the newest version of Python is version
3.7.4 or in other words, it is Python 3.
Note: The python version 3.7.4 cannot be used on Windows XP or earlier devices.

Before you start with the installation process of Python. First, you need to know about
your System Requirements. Based on your system type i.e. operating system and based
processor, you must download the python version. My system type is a Windows 64-bit
operating system. So the steps below are to install python version 3.7.4 on Windows 7 device
or to install Python 3. Download the Python Cheatsheet here.The steps on how to install Python
on Windows 10, 8 and 7 are divided into 4 parts to help understand better.

Download the Correct version into the system

Step 1: Go to the official site to download and install python using Google Chrome or any
other web browser. OR Click on the following link: https://www.python.org
Now, check for the latest and the correct version for your operating system.

Step 2: Click on the Download Tab.

Step 3: You can either select the Download Python for windows 3.7.4 button in Yellow Color
or you can scroll further down and click on download with respective to their version. Here,
we are downloading the most recent python version for windows 3.7.4
Step 4: Scroll down the page until you find the Files option.

Step 5: Here you see a different version of python along with the operating system.

• To download Windows 32-bit python, you can select any one from the three options:
Windows x86 embeddable zip file, Windows x86 executable installer or Windows x86 web-
based installer.
•To download Windows 64-bit python, you can select any one from the three options:
Windows x86-64 embeddable zip file, Windows x86-64 executable installer or Windows x86-
64 web-based installer.
Here we will install Windows x86-64 web-based installer. Here your first part regarding which
version of python is to be downloaded is completed. Now we move ahead with the second part
in installing python i.e. Installation
Note: To know the changes or updates that are made in the version you can click on the
Release Note Option.
Installation of Python
Step 1: Go to Download and Open the downloaded python version to carry out the installation
process.

Step 2: Before you click on Install Now, Make sure to put a tick on Add Python 3.7 to PATH.
Step 3: Click on Install NOW After the installation is successful. Click on Close.

With these above three steps on python installation, you have successfully and correctly
installed Python. Now is the time to verify the installation.
Note: The installation process might take a couple of minutes.

Verify the Python Installation


Step 1: Click on Start
Step 2: In the Windows Run Command, type “cmd”.
Step 3: Open the Command prompt option.
Step 4: Let us test whether the python is correctly installed. Type python –V and press Enter.

Step 5: You will get the answer as 3.7.4


Note: If you have any of the earlier versions of Python already installed. You must first
uninstall the earlier version and then install the new one. 

Check how the Python IDLE works


Step 1: Click on Start
Step 2: In the Windows Run command, type “python idle”.
Step 3: Click on IDLE (Python 3.7 64-bit) and launch the program
Step 4: To go ahead with working in IDLE you must first save the file. Click on File > Click
on Save

Step 5: Name the file and save as type should be Python files. Click on SAVE. Here I have
named the files as Hey World.
Step 6: Now for e.g. enter print
6.SYSTEM TEST

The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality of
components, sub assemblies, assemblies and/or a finished product It is the process of exercising
software with the intent of ensuring that the Software system meets its requirements and user
expectations and does not fail in an unacceptable manner. There are various types of test. Each
test type addresses a specific testing requirement.

TYPES OF TESTS

Unit testing
Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid outputs. All
decision branches and internal code flow should be validated. It is the testing of individual
software units of the application .it is done after the completion of an individual unit before
integration. This is a structural testing, that relies on knowledge of its construction and is
invasive. Unit tests perform basic tests at component level and test a specific business process,
application, and/or system configuration. Unit tests ensure that each unique path of a business
process performs accurately to the documented specifications and contains clearly defined inputs
and expected results.
Integration testing
Integration tests are designed to test integrated software components to
determine if they actually run as one program. Testing is event driven and is more concerned
with the basic outcome of screens or fields. Integration tests demonstrate that although the
components were individually satisfaction, as shown by successfully unit testing, the
combination of components is correct and consistent. Integration testing is specifically aimed at
exposing the problems that arise from the combination of components.

Functional test
Functional tests provide systematic demonstrations that functions tested are
available as specified by the business and technical requirements, system documentation, and
user manuals.
Functional testing is centered on the following items:

Valid Input : identified classes of valid input must be accepted.

Invalid Input : identified classes of invalid input must be rejected.

Functions : identified functions must be exercised.

Output : identified classes of application outputs must be exercised.

Systems/Procedures : interfacing systems or procedures must be invoked.

Organization and preparation of functional tests is focused on requirements, key


functions, or special test cases. In addition, systematic coverage pertaining to identify Business
process flows; data fields, predefined processes, and successive processes must be considered
for testing. Before functional testing is complete, additional tests are identified and the effective
value of current tests is determined.

System Test
System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results. An example of
system testing is the configuration oriented system integration test. System testing is based on
process descriptions and flows, emphasizing pre-driven process links and integration points.
White Box Testing
White Box Testing is a testing in which in which the software tester has
knowledge of the inner workings, structure and language of the software, or at least its purpose.
It is purpose. It is used to test areas that cannot be reached from a black box level.

Black Box Testing


Black Box Testing is testing the software without any knowledge of the inner
workings, structure or language of the module being tested. Black box tests, as most other kinds
of tests, must be written from a definitive source document, such as specification or
requirements document, such as specification or requirements document. It is a testing in which
the software under test is treated, as a black box .you cannot “see” into it. The test provides
inputs and responds to outputs without considering how the software works.
Unit Testing

Unit testing is usually conducted as part of a combined code and unit test phase
of the software lifecycle, although it is not uncommon for coding and unit testing to be
conducted as two distinct phases.

Test strategy and approach

Field testing will be performed manually and functional tests will be written in
detail.
Test objectives
 All field entries must work properly.
 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.

Features to be tested
 Verify that the entries are of the correct format
 No duplicate entries should be allowed
 All links should take the user to the correct page.
Integration Testing
Software integration testing is the incremental integration testing of two or more
integrated software components on a single platform to produce failures caused by interface
defects.

The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level –
interact without error.

Test Results: All the test cases mentioned above passed successfully. No defects encountered.

Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participation
by the end user. It also ensures that the system meets the functional requirements.

Test Results: All the test cases mentioned above passed successfully. No defects encountered.
TEST CASES
Experiments include an arrangement of steps, conditions and
sources of info that can be utilized while performing testing
undertakings. The principle expectation of this action is to guarantee
whether a product passes or bombs as far as usefulness and different
perspectives. The way toward creating experiments can likewise help
discover issues in the prerequisites or plan of an application.
Experiment goes about as the beginning stage for the test execution,
and in the wake of applying an arrangement of information esteems;
the application has a conclusive result and leaves the framework at
some end point or otherwise called execution post condition.
7.SCREENSHOTS

1. INDEPENDENT VARIABLE

Independent Variable

In the above figure, independent variables are the input for a process that is
being analyses. Independent variables are Gender, Marital status, Self employee, Credit
history, Education.
2. BIVARIATE ANALYSIS

Bivariate Analysis

In the above figure, every variable looking individually in univariate, we will now
explore them again with respect to the target variable in Bivariate analysis.
3. CO-APPLICANT STATUS

Co-Applicant Income Status

In the above figure, it will analyse the co-applicant income and loan amount
variable in similar manner.
4. CONFUSION MATRIX STATUS

Fig 6.4: Confusion Matrix Status

In the above figure, it will evaluate the model with confusion matrix. The
confusion matrix shows the how many are predicted and how many are perfectly
approved the loan.
5. CREDIT HISTORY STATUS

Credit History Status

In the above figure, it will predict the customer credit score.


6. DEPENDENT STATUS

Dependent Status

In the above figure, it will predict the customer is dependent and whether
customer pays in time or not.
7. EDUCATION STATUS

Fig 6.7: Education Status

In the above figure, it evaluates the how many customers are graduated or non-
graduated.

8. APPLICANT INCOME STATUS

Applicant Income Status


In the above figure, it will evaluates the customer income is low, average, high, very
high.
9. INDEPENDENT VARIABLE (NUMERICAL)

Fig 6.9: Independent Variable (Numerical)

In the above figure, it will predict the customer income is normally distributed
or not.

10. INDEPENDENT VARIABLE (ORDINAL)

Independent Variable (Ordinal)

In the above figure, it will predict the customer income is normally distributed
or not.
11. MARRIED STATUS

Married Status

In the above figure, it evaluates the how many customers are married or

single.
12. MATRIX STATUS

Matrix Status

In the above figure, it will predict the overall customer income, loan amount,
credit history and loan status.
13. PROPERTY AREA STATUS

Property Area Status

In the above figure, it predicts the customer place is whether rural, semi-
urban and urban.
14. SELF EMPLOYED STATUS

Self Employed Status

In the above figure, it will predicts the whether the customer is self employed or
not.
15. TOTAL INCOME

Total Income (Test, Train)

In the above figure, it will predict the overall customer income in the test data
set and train data set.
16. FEATURE IMPORTANCE STATUS

Feature Importance Status

In the above figure, it will predict the overall details of the customers.
17. LOGISTIC REGRESSIONS

Accuracy of Logistic Regression

In the above figure, it will predict the loan accuracy by using logistic
regressions.
18. DECISION TREE

Accuracy of Decision Tree

In the above figure, it will predict the loan accuracy by using decision tree.
19. RANDOM FOREST

Accuracy of Random Forest

In the above figure, it will predict the loan accuracy by using random

forest.
20. XGBOOST

Accuracy of XG BOOST

In the above figure, it will predict the loan accuracy by using XGBoost
8.CONCLUSION
The main purpose of the project is to classify and analyze the
nature of the loan applicants. From a proper analysis of data set and
constraints of the banking sector, seven different graphs were generated
and visualized. From the graphs, many conclusions have been made and
information was inferred such as short-term loan was preferred by
majority of the loan applicants and the clients majorly apply loan for
debt consolidation. This paper work can be extended to higher level in
future. Predictive model for loans that uses machine learning
algorithms, where the results from each graph of the paper can be taken
as individual criteria for the machine learning algorithm.

9 FUTURE ENHANCEMENTS
In Future Enhancement we can analyze the data by using various types of
algorithms. This project work can be extended to higher level in future. Predictive
model for loans that uses machine learning algorithms, where the results from each
graph of the paper can be taken as individual criteria for the machine learning
algorithm. In upcoming years many models can be used in Building knowledge
management platforms for customer service that improve first call resolution,
average handling time, and customer satisfaction rates. In finance, the prediction of
future outcomes and the assignment of probabilities to those results. This will
definitely help open up efficient delivery channels for the banking industry. It is
important to implement other techniques that outperform the performance of
popular data mining models and to test them for the domain.

10. REFERENCES
 Cowell, R.G., A.P., Lauritez, S.L., and Spiegelhalter, D.J. (1999). Graphical
models and Expert Systems. Berlin: Springer. This is a good introduction to
probabilistic graphical models.
 Kumar Arun, Garg Ishan, Kaur Sanmeet, MayJun. 2016. Loan Approval
Prediction based on Machine Learning Approach, IOSR Journal of
Computer Engineering (IOSR-JCE).

 Wei Li, Shuai Ding, Yi Chen, and Shanlin Yang, Heterogeneous Ensemble for
Default Prediction of Peer-to-Peer Lending in China, Key Laboratory of
Process Optimization and Intelligent Decision Making, Ministry of
Education, Hefei University of Technology, Hefei 23009, China

 Clustering Loan Applicants based on Risk Percentage using K-Means


Clustering Techniques, Dr. K. Kavitha, International Journal of Advanced
Research in Computer Science and Software.
 S. Johnson, ”Internet changes everything: Revolutionizing public
participation and access to government information through the Internet”,
Administrative Law Review, Vol. 50, No. 2 (Spring 1998) , pp. 277-337
 D. Chrysanthos. ”Strategic manipulation of internet opinion forums:
Implications for consumers and firms.” Management Science 52.10 (2006):
1577-1593.
 M. Wollmer, et al.”YouTube movie reviews: Sentiment analysis in an
audio- visual context.” Intelligent Systems, IEEE (2013): pages 46-53.
 J. Naughton, ”The internet: is it changing the way we think?”, The
Gaurdian, Saturday 14 August 2010

 G. Mishne and N. S. Glance, ”Predicting movie sales from blogger


sentiment,” in AAAI 2006 Spring Symposium on Computational Approaches
to Analyzing Weblogs, 2006.
 L. Barbosa, and J. Feng, ”Robust sentiment detection on twitter from
biased and noisy data.”, in Proceedings of the International Conference on
Computational Linguistics (COLING-2010). 2010.

 E. Cambria, N. Howard, Y. Xia, and T. S. Chua,”Computational Intelligence


for Big Social Data Analysis”, IEEE Computational
Intelligence Magazine, 11(3), 8-9, 2016.

 E. Cambria, B. Schuller, Y. Xia, and B. White, ”New avenues in knowledge


bases for natural language processing”, Knowledge-Based Systems, 108(C),
1- 4, 2016.
 M. Bautin, L. Vijayarenu, and S. Skienas. ”International sentiment analysis
for news and blogs.” in Proceedings of the International AAAI Conference
on Weblogs and Social Media (ICWSM-2008). 2008.
 Becker and V. Aharonson. ”Last but definitely not least: on the role of the
last sentence in automatic polarity-classification.”, in Proceedings of the
ACL 2010 Conference Short Papers. 2010.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy