0% found this document useful (0 votes)
4 views

PFDA-sample

The document outlines a group assignment for a Programming for Data Analysis course at Asia Pacific University, focusing on analyzing credit risk variables using a dataset of 6000 observations. The assignment includes data preparation, analysis of relationships between various credit factors, and individual contributions from group members on specific objectives. The overall goal is to provide actionable insights for stakeholders regarding credit risk management through advanced data analytics techniques using R programming.

Uploaded by

budhah282
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

PFDA-sample

The document outlines a group assignment for a Programming for Data Analysis course at Asia Pacific University, focusing on analyzing credit risk variables using a dataset of 6000 observations. The assignment includes data preparation, analysis of relationships between various credit factors, and individual contributions from group members on specific objectives. The overall goal is to provide actionable insights for stakeholders regarding credit risk management through advanced data analytics techniques using R programming.

Uploaded by

budhah282
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 65

CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

GROUP ASSIGNMENT
TECHNOLOGY PARK MALAYSIA
CT127-3-2-PFDA
INTAKE CODE: APU2F2409CS(AI) / APD2F2409CS (AI)
MODULE NAME: PROGRAMMING FOR DATA ANALYSIS
LECTURER NAME: DR. KULOTHUNKAN A/L PALASUNDRAM
HAND-OUT DATE: 30th SEPTEMBER 2024
HAND-IN DATE: 1st DECEMBER 2024

STUDENT DETAILS:

Student Name TP Number


Lim Wen Yi (Group Leader) TP067930
Keith Lo Ze Hui TP067653
Muhammad Hadi TP077049
Jaeden Loong Deng Ze TP068347

1
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

Table of Content
Contents
1.0 – Introduction...........................................................................................................................5
1.1 – Data Description................................................................................................................6
2.0 – Data Preparation..................................................................................................................7
2.1 – Data import........................................................................................................................7
2.2 – Data Cleaning....................................................................................................................8
2.3 – Data Validation.................................................................................................................9
3.0 – Data Analysis (Individual).................................................................................................10
3.1 – Objective 1: To analyze the relationship between credit risk variables such as
installment commitment, credit amount, age, loan duration, and credit classification
(Muhammad Hadi – TP077049).............................................................................................10
3.1.1 – Descriptive Analysis.................................................................................................10
3.1.2 – Exploratory Data Analysis (Charts Summary).....................................................18
3.1.3 – Literature Review.....................................................................................................22
3.1.4 – Hypothesis.................................................................................................................23
3.1.5 – Conclusion.................................................................................................................27
3.2 – Objective 2: To Assess the Effect of Loan Amount and Installment Commitment on
Credit Class (Jaeden Loong Deng Ze - TP068347)...............................................................28
3.2.1 – Descriptive Analysis.................................................................................................28
3.2.2 – Exploratory Data Analysis (Charts Summary).....................................................28
3.2.3 – Literature Review.....................................................................................................30
3.2.4 – Hypothesis.................................................................................................................30
3.2.5 – Hypothesis Testing (Logistics Regression Analysis)..............................................31
3.3 – Objective 3: To Investigate the effects of different credit histories on a person’s
credit classification (Keith Lo Ze Hui - TP067653)..............................................................33
3.3.1 – Analysis 1...................................................................................................................33
3.3.2 – Exploratory Data Analysis (Charts Summary).....................................................34
3.3.3 – Literature Review.....................................................................................................35
3.3.4 – Hypothesis.................................................................................................................35
3.3.5 – Analysis 2 (Logistics Regression)............................................................................36

2
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

3.3.6 – Conclusion.................................................................................................................37
3.4 – Objective 4: Assess the Effect of Higher Savings on Credit Class (Lim Wen Yi -
TP067930).................................................................................................................................38
3.4.1 – Exploratory Data Analysis (Charts Summary).....................................................38
3.4.2 – Literature Review.....................................................................................................39
3.4.3 – Hypothesis.................................................................................................................39
3.4.4 – Analysis 1: Chi-square Test of Independence........................................................40
3.4.5 – Analysis 2: Ordinal Logistic Regression................................................................40
4.0 – Group Hypothesis...............................................................................................................42
4.1 – State Your Complex Group Hypothesis Here..............................................................42
4.2 – Test Your Hypothesis......................................................................................................43
4.2.1 – Load the needed library and read the dataset.......................................................43
4.2.2 – Convert the classes for grouping in terms of factors............................................43
4.2.3 – First, we must plot the Credit Amount...................................................................44
4.2.4 – Next, we have a plot for the Duration.....................................................................45
4.2.5 – Moreover, we would then plot the Credit Amount by Classification..................46
4.2.6 – Furthermore, we must plot the Duration according to Classification.................47
4.2.7 – Besides, we should also plot the Instalment Commitment....................................48
4.2.8 – After, we must plot for Credit Amount, specifically via the threshold
highlighted.............................................................................................................................49
4.2.9 – We Must Plot a Graph for the Savings Status.......................................................50
4.2.10 – Lastly, We Must Plot a Graph for the Credit History........................................51
4.3 – Interpret the Result.........................................................................................................52
4.3.1 – QQ Plot of Credit Amount......................................................................................52
4.3.2 – Scatter Plot of Duration...........................................................................................52
4.3.3 – Box Plot of Credit Amount by Class.......................................................................53
4.3.4 – Distribution of Installment Commitment...............................................................54
4.3.5 – Distribution of Credit Amount by Credit Classification......................................55
4.3.6 – Savings Status Categories by Credit Classification...............................................55
4.3.7 – Credit History Categories by Credit Classification..............................................56
4.4 – Conclusion........................................................................................................................57

3
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

5.0 – Overall Conclusion.............................................................................................................58


5.1 – Overall Discussion on the Findings from All Objectives.............................................58
5.2 – Recommendation.............................................................................................................58
5.3 – Limitations and Future Direction..................................................................................59
5.3.1 – Limitations................................................................................................................59
5.3.2 – Future Direction.......................................................................................................60
5.4 – Word Count.....................................................................................................................60
6.0 – Workload Matrix................................................................................................................61
7.0 – References............................................................................................................................62

4
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

1.0 – Introduction
One of the utmost important roles of the stability of financial institutions is the management of
credit risks, this is easily spotted in the sector that is related to banking or finances. With the
increase in the availability of customer data, all this includes the financial behavior of customers,
their specific demographic. It is due to the reason as stated above that data analytics has become
an extremely vital tool for examining any risks that may be in their credit. The assignment at
hand was created mainly to focus on using a few different techniques relating to data analytics,
then applying said techniques to the dataset that was provided. Which contains all the customers'
information on it. This includes employment, gender, loan duration, marital status, purpose of
requesting and others. By going through and understanding all said data points, the main
objective is to locate and identify crucial factors that separate the credit risks, and the actionable
insights provided. This is mainly for the shareholders to decide.

Moreover, manipulation, data exploration, transformation and visualization will be used


throughout the assignment, all done by using R programming and R Studio. The dataset that was
given would first be pre-processed and filtered to make sure that utmost accuracy is achieved.
After which, we would then use numerous advanced data analytic techniques to increase the
effectiveness of the model to help in classifying credit risks. By sifting through the relationships
between the dependent and independent variables, the assignment targets to unveil the patterns
that could help advise and guide borrowing strategies and also help to reduce the number of
potential risks. The knowledge gained via the analysis would provide very insightful
recommendations to stakeholders, it would help in the support of better decision making and
management of credit risks.

5
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

1.1 – Data Description

The dataset utilized in this study consists of 6000 observations on 22 predictor variables related
to the credit information of bank clients. Age, occupation, kind of property, and credit
characteristics including credit length, credit amount, and payment frequency are examples of
inherent elements.

Key variables in the dataset are described below:

 checking_status: Status of the customer’s checking account.


 duration: The duration of loans in months.
 credit_history: Type of credit history, such as timely payments, delays, or critical
accounts.
 purpose: The purpose for taking credit.
 credit_amount: The total amount of credit taken by the customer.
 savings_status: Status of savings account and bond based on the amount of savings.
 employment: Duration of employment (in years).
 installment_commitment: The installment rate as a percentage of disposable income.
 personal_status: Customer's marital status and gender.
 existing_credits: The number of existing credits the customer has.
 age: Age of the customer.
 class: The final credit classification, which can either be "good" or "bad."

6
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

2.0 – Data Preparation

2.1 – Data import

The 6000 entries in the dataset contain details of 22 factors that are linked to the traits and credit
practices of bank clients. Age, loan length, payment commitment, credit amount, and credit class
are a few of the characteristics that are included in the dataset.

Data is imported using the built-in read.csv(). It automatically imports numeric columns as
numeric by default; imports string values as factors with the stringsAsFactors parameter; and
recognizes the first column as the row names column with row.names = 1.

The read_csv() from the readr package from Tidyverse might provide faster speeds at the cost of
these features, but the dataset is too small for any noticeable difference.

7
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

2.2 – Data Cleaning

We cleaned the data using the following steps:

First, we check duplicate data using duplicated (), then remove them using distinct (). There are
4600 rows of duplicate data here.

Next, we check for columns with missing data. In this case, the only column with missing data is
other_payment_plans, with only 3.75% of data missing. As it’s less than the recommended
threshold of 5%, we felt confident to just remove the rows with missing data instead of inputting,
as removing wouldn’t affect the final analysis too much. Then, we trimmed the whitespace in the
data, just for the sake of it, as we didn’t end up needing it.

Finally, we converted ranged values to ordered factors. These values, like checking_status,
contain data in ranges, like "<100". Importing columns as the correct data types is extremely
important. To give examples to illustrate:

 If the data is imported through read_csv, installment_commitment will be imported as


character format, and would have to be manually changed into a numeric format. A
numeric format is required to calculate numbers and do quantitative calculations,
including correlations, regressions, and more.

 class, which is responsible for determining whether a consumer is in the "good" or "bad"
credit class, was changed into a factor. This made it possible to maintain the categorical
data while conducting any other classification analysis, such as logistic regression.

8
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

2.3 – Data Validation

We identified and treated outliers using IQR. We chose the 10th and 90th percentile as the values
in the data seem to have a lot of variety. Even then, some columns seem to be mostly false
positives, so we manually chose the duration and credit_amount columns to be capped.

Subsequently, an analysis was manually conducted on the imported data structure, to verify the
inclusion of all necessary columns. The str() and head() methods were used to verify the attribute
names, the data types of the variables, and acquire a summary of the data. During data
processing or analysis, it is necessary to rule out several potential issues, such as incorrect types
being entered into some fields or missing columns, among other things.

9
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

3.0 – Data Analysis (Individual)

3.1 – Objective 1: To analyze the relationship between credit risk variables


such as installment commitment, credit amount, age, loan duration, and credit
classification (Muhammad Hadi – TP077049)

3.1.1 – Descriptive Analysis

Analysis 1.1; Correlation Between Installment Commitment and Credit Amount

 Question: What relationship exists between a customer's credit limit and their
commitment to making installment payments?

This paper sought to establish the connection between a customer’s instalment obligation, which
is the proportion of their disposable income they use to pay the loan, and the amount of credit
they use. What we were interested in was the extent and direction of this relationship between
these two variables and therefore we had to estimate this. The findings were conclusive: the more
the installment commitment of a customer, the greater the amounts of credit demanded. This
might have been expected given that customers with larger loans usually have bigger installment
obligation given that their credit repayments are typically seen to claim a higher share of the
customers’ earnings.

10
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

Analysis 1.2; Logistic Regression for Credit Classification:

 Question: What is the impact of installment commitment on credit classification?

The model's credit class showed a favorable association with installment commitment. The
degree of installment commitment was shown to be significant, and a positive association was
found between the classification of "bad credit" and "higher installment commitments." This
makes sense intuitively: Customers are more likely to fail if they spend a significant amount of
their income on loan repayments, as this indicates poor financial planning.

Analysis 1.3; Linear Regression for Installment Commitment and Credit Amount:
11
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

 Question: What is the relationship between installment commitment and other


characteristics like age, loan term, and current credits?

The examination of the data revealed that consumers might be classified as having excellent
credit if their installation payment obligations are low, and vice versa if their commitments are
high. The Customer Profile theory, which assumes that borrowers who contribute a
comparatively lesser percentage of their income toward their loans would be more creditworthy
when it comes to loan payback, is consistent with this pattern.

Effective representation of the customer dispersion across the two credit classes was made
possible by the histogram. The statistics clearly showed that customers could be divided into two
groups based on their credit: those with excellent credit were more likely to have smaller
installment commitments, while those with weak credit were more likely to have higher
installment commitments.

Analysis 1.4; Distribution of Credit Class by Installment Commitment:

12
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

 Question: What is the distribution of credit classification among the various levels of
installment commitment?

Effective representation of the customer dispersion across the two credit classes was made
possible by the histogram. The statistics clearly showed that customers could be divided into two
groups based on their credit: those with excellent credit were more likely to have smaller
installment commitments, while those with weak credit were more likely to have higher
installment commitments. However, as previously said, the two study issues that this analysis
concentrated on were Installment commitment and Customer age. We wanted to see whether
older clients had less or larger payment obligations than younger customers through this.

Extra Feature Analysis 2.1; Correlation Between Age and Installment Commitment:

13
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

Given that the relationship between age and installment commitment was weakly negative, it
may be assumed that customers who are older have somewhat smaller installment commitments
than those who are younger. This might be because of elderly clients having less installments
relative to their income since they do not have significant debts, such as mortgages.

A line plot was utilized to illustrate this relationship. Installment commitment was often
somewhat less than the plot, with the commitment level rising as the customer's age grows.
Despite the correlation's lack of significance, it suggests that the older customers have
comparatively less high installation obligations.

Extra Feature Analysis 2.2; Loan Duration vs Installment Commitment:

14
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

In this research, we specifically examined the relationship between installment commitment and
loan term, or the number of months needed to repay the loan. This was done to determine
whether or whether clients with lengthier loan terms committed installments.

The results indicated that there was a somewhat positive correlation between the length of the
loan and the installment commitment, implying that borrowers with longer loan terms had higher
installment commitments. This could be the case since longer-term loans have higher monthly
payments due to their extended terms, which means a bigger portion of income must be set aside
for loan payments.

So, to depict the relationship mentioned above, a line plot was created. The installment
commitment plot showed a progressive tendency, with each bar rising faster than the previous

15
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

one as the loan length increases. Therefore, considering the loan conditions is crucial when
evaluating credit risk since the study's findings indicate that loan length has a significant impact
on a customer's financial responsibilities.

Extra Feature Analysis 2.3; Existing Credits vs Installment Commitment:

16
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

This issue considered the credit installment obligation in relation to the quantity of credits a
client has utilized. Therefore, to demonstrate how a customer's other obligations affect the
present loan installments, we set out to determine the link between various factors.

According to the research, the installment commitment increases with the number of current
credits. Individuals with greater credit offers have larger installment payments, most likely since
they must pay back many loans that take up a large portion of their income.

Line plotting was used to show the relationship between previous credit and installment
obligation. The plot's lines showed that the graph grew, suggesting that the consumer must
commit to making the appropriate number of installments wherever there are now credits. This
relates to a customer's ability to extend fresh credit in tandem with the spread of numerous
borrowings and their effect on credit accessibility.

17
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

3.1.2 – Exploratory Data Analysis (Charts Summary)

Analysis 1.1:

Correlation between Installment Commitment and Credit Amount:

The first chart is the two-dimensional chart which shows the data points of “Installment
Commitment” on the x axis and “Credit Amount” on the y axis. The blue regression line has a
weak negative slope, or in other words as “Installment Commitment” increases, “Credit
Amount” tends to decrease. However, the found connection is moderate, and the great dispersion
of variables means that the connection between such indicators can be considered relatively
weak.

18
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

Analysis 1.2:

Logistic regression of credit class based on installment commitment:

The second chart is a curve that shares the x-axis with the first chart and represents “Installment
Commitment” on the y-axis and the “Probability of Bad Credit” on the y-axis. The red logistic
regression line denotes that as “Installment Commitment” rises ‘Bad Credit’ slightly reduces.
This means that there is a very small negative association between the two; the coefficient
suggests that “Installment Commitment’ is not a strong driver of “Bad Credit” all on its own, that
is why the coefficient is small.

19
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

Analysis 1.3:

Regression of Credit amount regarding Installment commitment:

The third graph is the scatter graph that display the correspondent relation of “Credit Amount”
on y-axis and “Installment Commitment” on the x-axis, and the line in green refers to the
regression of the scatter. The angle of the regression line is negative meaning that when
“Installment Commitment” rises slightly, the “Credit Amount” slightly declines. However, the
closely packed nature of the points suggests there is large variability hence implying that more
factors might have an impact on the ‘Credit Amount.’

20
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

Analysis 1.4:

Installment Commitment by Credit Class broken down by good and bad credit
classification:

It provides quite different and significant tendencies and patterns of the link between installment
obligations and credit rating. Especially, different installment commitments between 0 and 2.5 as
a group are considered as “bad” credit; therefore, financial pressure or lower financial capability
may be the reason for such categorization. On the other hand, higher installment commitments
greater than 5.0 are correlated more often with “good” credit rating suggesting that better credit
rating could be linked to higher financial obligations. It also shows that our “bad” credit class has
more increased at the middle of the installment commitments around $2.5 - $5.0 and the “good”
credit class has more increased in higher ranges of the commitments. This distribution offers
important information on installment payment and its regulatory impact on credit behavior. More
effort is required to establish whether low commitments are due to financial constraints or not. It
may also be valuable to test other variables using income, loan, and payment data, to strengthen
our grip on these trends. However, comparing the summary measures such as mean across the
installment commitments for the “good” credit and “bad” credit categories or even median and

21
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

variance across these categories could give a much clearer picture about how installment
commitment differs across two categories.

3.1.3 – Literature Review

(Goh, M., & Lee, P. 2020)


Link to study:
https://www.researchgate.net/publication/383198419_Effective_Credit_Risk_Prediction_using_
Ensemble_Classifiers_with_Model_Explanation

22
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

3.1.4 – Hypothesis

Hypothesis 1; Relationship between “installment commitment” and “loan volume”:

Null Hypothesis (H₀): is suggesting that there is no significant relationship between installment
commitment, and the volume of credit demanded. This forms the initial hypothesis that will need
to be refuted statistically during the research process.

Alternative Hypothesis (H₁): consequently, greater installment commitments (a greater


percentage of income devoted to loan repayments) are related to higher credit volume.

Test Hypothesis:

To determine how strongly "Installment Commitment" and "Credit Amount" are related, we will
utilize the Pearson correlation coefficient. This will assist in determining the amount and
direction of the two variables' linear relationship. Furthermore, regression analysis is used to
assess the significance of the connection and the slope. To determine whether the association is
statistically significant, the regression coefficient's p-value will be computed. The association
between "installment commitment" and "loan amount" is statistically significant if the p-value is
less than 0.05. This implies that changes in "installment commitment" have a substantial impact
on "loan amount."

Interpret the Results:

Correlation Analysis;

A negative correlation between "installment commitment" and "loan amount" is indicated by the
chart's smooth downward trend. The wider range of data points raises the possibility that other
variables could affect the "loan rate."

23
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

We could not rule out the null hypothesis if the correlation analysis's p-value was higher than
0.05, which would have shown that there was no statistically significant association between the
variables.

Hypothesis 2; The relationship between “installment commitment” and the possibility of


bad debt:

Null Hypothesis (H₀): This is the assumption that there is no relationship between the
installment commitment and credit classification. The significant level of logistic regression will
determine if this hypothesis must be rejected or not.

Alternative Hypothesis (H₁): The idea put forth by the second hypothesis set is that installment
commitment plays the role of a predictor for credit classification, and that with growing level of
installment commitment, a person is more likely to be classified to “bad credit” status.

Test Hypothesis:

The "probability of bad debts" and "installment commitment" are analysed using a logistic
model. Whether changes in "reputational guarantees" are linked to the possibility of being
labelled as bad debt will be ascertained with the aid of this investigation. The strength and
significance of the link will be evaluated by looking at the regression coefficient's p-value and
confidence interval. There is a statistically significant correlation between "stay commitment"
and "likely bad debt" if the logistic regression coefficient's p-value is less than 0.05. This
suggests that "stay commitment" influences the likelihood of bad loans.

Interpret the Results:

A slight inverse association between "installment commitment" and "likelihood of bad debt" is
revealed by logistic regression analysis, which is symbolized by the red logistic regression line.
This demonstrates that, albeit with relatively little effect, possible bad debt yields are somewhat
decreased as "reputational commitment" rises. It is clear that "staying commitment" is not a

24
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

significant contributor to "bad debt" if the regression coefficient's p-value is higher than 0.05.
This means that other factors are crucial in predicting the likelihood of bad credit.

Hypothesis 3; Relationship between “Installment Commitment” and “Credit Amount”:

Null Hypothesis (H₀): This is the assumption that there is no relationship between the
installment commitment and credit classification. The significant level of logistic regression will
determine if this hypothesis must be rejected or not.

Alternative Hypothesis (H₁): The idea put forth by the second hypothesis set is that installment
commitment plays the role of a predictor for credit classification, and that with growing level of
installment commitment, a person is more likely to be classified to “bad credit” status.

Test Hypothesis:

The hypothesis that "installment commitment" has a statistically significant impact on "loan rate"
is tested using a linear regression analysis. Whether changes in "installment commitment" have
an impact on the "loan rate" will be ascertained with the aid of this analysis. The p-value of the
regression coefficient of "staying commitment" will be examined to determine the significance
of this association. There is a substantial correlation between "loan amount" and "installment
commitment" if the p-value is less than 0.05, signifying those adjustments to "installment
commitment" have a significant impact on the extended loan amount on.

Interpret the Results:

A minor slope in the regression line indicates a weak negative linear association between "loan
rate" and "stay commitment," according to linear regression analysis. However, the variability
and spread of data points indicate that other factors may be crucial in determining the "lending
rate." The observed association may not be strong enough to be statistically significant of the

25
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

significant factor if the regression coefficient's p-value is higher than 0.05, indicating that "loan
rate" is not significantly impacted by "reputation guarantee" in a linear relationship.

Hypothesis 4; Differences in “Installment Commitment” between good and bad loan


groups:

Null Hypothesis (H₀): This means that installment commitment which is the proportion of
income that goes to pay of the loans does not influence the likelihood of being classified as “bad
credit”.

Alternative Hypothesis (H₁): This would suggest that far higher installment commitments
translate to a higher propensity of being labeled bad credit; something that was quite intuitive if
because people who dedicate a large chunk of their salary towards meeting their loan obligations
are more likely to default.

Test Hypothesis:

A t-test or Mann-Whitney U test may be used to analyze the research hypothesis to determine
whether the means or medians of installment commitments between the "good" and "bad" credit
classes differed significantly. When comparing variance, the t-test is the first approach used.
Depending on the data distribution, the Mann-Whitney U test is the second way used if the
variance of the two samples is unknown. In other words, the t-test should be used if the data is
normally distributed, and the Mann-Whitney U test should be used if the data is not. The test's p-
value will determine whether there is a significant difference between the two groups. When the
resulting p-value is less than 0.05, the null hypothesis should be rejected since there is a
statistical difference between the installation commitments of the "good" and "bad" credit
classes.

26
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

Interpret the Results:

The results imply that the distribution of those installment commitments is such that individuals
with "good" credit score pay larger installment commitments, while those with "bad" credit score
are likely to pay lower installment commitments. High financial responsibilities may be linked to
improved credit quality, as seen by the arrows here. As a result, the null hypothesis cannot be
rejected because the P-value is greater than 0.05, indicating that there is no statistically
significant difference in installment commitments between the "good" and "bad" credit classes.
However, it would indicate that there is a significant difference between the two credit classes in
terms of installment obligations if the computed p-value is less than 0.05.

3.1.5 – Conclusion

Predictors of the consumer credit risk were provided by the Exploratory Data Analysis
performed on the credit risk categorization dataset. Customers with larger installment
commitments were notably more likely to be classified as having "bad credit", according to the
data, which indicated that installment commitment was strongly and positively connected with
credit risk. This implies that borrowers who allocate a larger portion of their earnings to loan
repayment may be more vulnerable to financial difficulties and, as a result, more likely to default
on their debt. The study also found that the customer duty load and credit risk classification are
impacted by additional loan factors, such as the loan length and credit amount.

Additionally, the results reveal the consumers' degree of commitment and that their average age
was lower, both of which are influenced by demographic characteristics that impact installment
agreements. This indicates that, in part because of the behaviors connected to cash use, age-
related behaviors enhance credit safety. In summary, the findings validate the claim that a
customer's credit risk is influenced by several of their financial and demographic attributes.
Banks and other financial institutions looking to improve the effectiveness of managing loans
with greater risk levels may find value in the research's conclusions.

27
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

3.2 – Objective 2: To Assess the Effect of Loan Amount and Installment


Commitment on Credit Class (Jaeden Loong Deng Ze - TP068347)

3.2.1 – Descriptive Analysis

A descriptive analysis is conducted to obtain the numerical values of the variables,


“credit_amount” and “installment_commitment” for the good and bad classes in the form of
averages. Based on the descriptive analysis, the bad class has a higher average credit amount and
installment commitment.

Code:

Output:

28
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

3.2.2 – Exploratory Data Analysis (Charts Summary)

A visualization analysis is conducted using a scatter plot to visualize the impact of credit amount
and installment commitment on a 2-dimensional plane. The observation obtained from this
analysis is as follows:

- Most of the loan amounts are less than 5000.


- “Good” credit classes tend to have lower loan amounts compared to “Bad” credit classes.
- “Good” credit classes have a higher presence in lower installment commitments
compared to “Bad” credit classes.

Code:

Output:

29
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

30
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

3.2.3 – Literature Review

(Prashanta & Behera, 2017)

Link to study: https://www.iosrjournals.org/iosr-jef/papers/Vol8-Issue2/Version-2/


J0802026981.pdf

3.2.4 – Hypothesis

Null Hypothesis: “credit_amount” (<5000) and “installment_commitment” have no effect on the


likelihood of a credit class being “bad” or “good”.

Alternative Hypothesis: “credit_amount” (<5000) and “installment_commitment have a


significant effect on the likelihood of a credit class being “bad” or “good”.

31
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

3.2.5 – Hypothesis Testing (Logistics Regression Analysis)

A logistics regression analysis is conducted to identify if the null hypothesis is to be accepted or


rejected. The variables “credit_amount” and “installment_commitment” are fitted into the
logistics regression model to obtain the p-values.

The p-values retrieved for both variables are 2e – 16 (Less than 0.05). This indicates that both
variables have a significant effect on credit class.

Logistics Regression Model (Code):

Output:

32
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

A Chi-square test is then conducted to determine the overall significance of the model which
further supports the significance of the variables. The observations obtained are as follows:

- The Chi-square statistics have a value of 222.54. This means that the residual deviance is
lower than the null deviance, indicating a better fit of the model.

Chi-square test (Code):

Output:

Based on these observations, the following statement can be made:

- The variables “credit_amount” and “installment_commitment” have a significant effect


on the credit class.
- Therefore, the null hypothesis is rejected in favor of the alternative hypothesis.

33
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

3.3 – Objective 3: To Investigate the effects of different credit histories on a


person’s credit classification (Keith Lo Ze Hui - TP067653)

3.3.1 – Analysis 1

First and foremost, for this part, my objective is to carefully investigate the relationship between
credit history and credit classification, and whether one influences the other. In this case being,
does credit history affect credit classification.

34
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

35
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

3.3.2 – Exploratory Data Analysis (Charts Summary)

By cross-referring the “class” column and the “credit_history” column we would then be able to
learn the relationship between the two columns. By using the tests and Bar plot charts, this is
done mainly to confirm or deny the effects of the “credit_history” column on the “class” column.

Given here are the list of outputs from the dataset, each of these column names representing their
own respective variables, originating from the dataset that was loaded into RStudio.

36
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

3.3.3 – Literature Review

(Noriega et al., 2023)

Link to study: https://www.mdpi.com/2306-5729/8/11/169

3.3.4 – Hypothesis

Null Hypothesis: “credit_history” does not effects “class”, a “good” credit history would not
necessarily give a good “class”.

Alternative Hypothesis: “credit_history” effects “class”, a “good” credit history would give a
good “class”.

37
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

3.3.5 – Analysis 2 (Logistics Regression)

The reason why bar chart plot was used is because it can help to display each data point in 2s
dimensions. Moreover, the chi-squared testing and contingency table are used to test and find the
relationship between both the categories.

38
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

3.3.6 – Conclusion

In conclusion, based on the analysis that was performed. The outcome is clear for all to see. As
credit history affects the person’s credit classification. With the use of bar charts, chi squared
testing and regression testing, the results have clearly shown that by having either “good” or
“bad” in credit history has a significant effect on one’s classification. Such as, for people having
“all paid” history, they are likely to get a “good” classification, vice versa people with
“critical/order existing credit” are highly likely to get the “bas” classification. Based on the
findings of the analysis above, we can confidently say that a person’s credit history affects their
credit class, Accepting the Alternative Hypothesis and Rejecting the Null Hypothesis.

39
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

3.4 – Objective 4: Assess the Effect of Higher Savings on Credit Class (Lim
Wen Yi - TP067930)

3.4.1 – Exploratory Data Analysis (Charts Summary)

A stacked bar plot is used to visualize the ratio between the credit classes. Interestingly, people
with no known savings have the highest ratio of good credit class. This can indicate that the data
is incomplete, with many people with high savings included in the “no known savings” category.

As expected, people with less than RM100 of savings have the highest chance of having the bad
credit class. While the rest of the data seems to follow the expected general trend of “higher
savings tend to have higher credit class”, people with savings between RM100 and RM500 tend
to have a higher credit class. Overall, the trends seem to be miniscule compared to people with
no known savings.

40
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

A regular bar plot allows you to visualize the data directly instead of just the ratios. From the
graph it’s obvious that the data is very uneven, with most people having savings lower than
RM100.

3.4.2 – Literature Review

As every bank is different, it’s impossible to find exact existing studies with the credit
class/score variable. There is also a lack of studies on the direct relationship between the two
variables. However, literature can still provide important insights when looking at papers
studying different variables, like bank credit.

The literature provides mixed evidence on the effect of higher savings on factors like bank credit.
Research on Nigerian commercial banks from 1990-2019 revealed that total savings had a
positive effect on credit to the private sector before bank consolidation, but a negative effect after
consolidation (Adedamola et al., 2021). The study also found an interaction effect between total
savings and the number of bank branches, indicating a complex relationship between savings and
lending.

An analysis of Regional Development Banks in Indonesia from 2019-2022 showed that current
accounts, savings, and time deposits significantly influenced bank credit (“The Effect of Current
Accounts, Savings and Time Deposits on Banking Credit: A Case Study of Regional
Development Banks (RDB) in Indonesia,” 2024). The study found that these factors could
explain 99.44% of the variation in bank credit, suggesting a strong positive relationship between
savings and lending.

In conclusion, some studies suggest a positive relationship between savings and bank credit,
others highlight the importance of considering additional factors such as economic uncertainty,
institutional objectives, and regulatory environments. The mixed findings indicate that the effect
of higher savings on bank credit is context-dependent and may vary across different banking
systems and economic conditions.

3.4.3 – Hypothesis

Null Hypothesis: there is no significant relationship between savings status and credit class.

41
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

Alternative Hypothesis: people with savings less than 100 are more likely to have a bad credit
class.

3.4.4 – Analysis 1: Chi-square Test of Independence

The chi-square test is used to test if two categorical variables are independent. The low p-value
indicates that there is a relationship between the variables, and the Null Hypothesis can be
rejected.

3.4.5 – Analysis 2: Ordinal Logistic Regression

As the variables are ordered factor and logical, Ordinal Logistic Regression is used.

42
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

The summary is printed out. In all the coefficients, the P-values are low, much lower than the
common threshold of 0.5. This indicates there is a significant relationship between the saving
status and credit class, and the Null Hypothesis can be rejected.

The predicted probabilities are plotted in a graph. Their pattern is like the stacked bar plot.
Despite the pattern not being linear like one would expect, the graph indicates that it isn’t

43
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

random. The <100 savings status has the lowest amount of variation, suggesting the strongest
correlation. This could be because of its high sample size, allowing for more accurate
predictions.

4.0 – Group Hypothesis

4.1 – State Your Complex Group Hypothesis Here

Null Hypothesis: There is no significant difference in the likelihood of being classified as


"good" or "bad" credit class based on instalment commitment, credit amount, credit history, and
savings.

Alternative Hypothesis: Individuals with higher instalment commitments, credit amounts above
RM5000, good credit history, and savings of RM1000 or more are significantly more likely to be
classified as "good" credit class compared to individuals with lower instalment commitments,
credit amounts below RM5000, bad credit history, and savings below RM100.

This hypothesis is tested by exploring the relationships between the following variables:

 installment commitment
 credit amount
 credit history
 savings

Subsequent analysis can sometimes consist of determining the degree to which different financial
attributes such as credit amount, checking status and or installment commitment explain credit
class perhaps by use of classification models.

To examine this hypothesis, we can begin with Exploratory Data Analysis (EDA) and
subsequently choose suitable models to test this hypothesis.

44
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

4.2 – Test Your Hypothesis

To Test the Hypothesis, we would be using RStudio.

4.2.1 – Load the needed library and read the dataset

4.2.2 – Convert the classes for grouping in terms of factors

45
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

4.2.3 – First, we must plot the Credit Amount

This is the graph for the credit amount.

First and foremost, we have decided to use QQ plot for this segment, “credit_amount”. This is
because the line given in the middle best visualizes the meaning of the data. In turn it makes it
easier for us to view important insight for the “credit_amount”.

46
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

4.2.4 – Next, we have a plot for the Duration

For Duration, we have used Scatter plot to visualize the data.

Some of the main reasons why scatter plot was chosen by us are:

 By using scatter plot we could easily explore the relationship between the “Duration” and
the “Index”.
 By using scatter plot it could also help us to observe the identifying outliers and their
spread.

47
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

 By using scatter plot, it allows us to compare the categories given, in this case being
Duration and Index, and if a third variable would be added in like “class” it could be
easily implemented into the study.

4.2.5 – Moreover, we would then plot the Credit Amount by Classification

We have used box plot to visualize the data.

Why we have chosen to use box plot to find out Credit Amount with class.

 It helps to show the median, the box plot helps to visualize the median value for the
“credit_amount” for each “class”.
 The Box plot is great in marking any anomalies which makes it extremely easy to look
for extreme values, specifically in the “credit_amount”, especially when is different quite
significantly from the others.
 For box plot, it is used because it can help.

48
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

4.2.6 – Furthermore, we must plot the Duration according to Classification.

For Plotting the Duration According to Classification, we have used Violin Plot.

The violin plot was used because:

 The Violin plot helps to combine the characteristics of both kernel density plot and box
plot. The violin plot helps to visualize the distribution of data that was given in the
dataset.
 Besides, by using the violin plot, it helps us easily make comparisons of the duration’s
distribution for every class. As an example, we can see that a class has both short and
long credit durations.

49
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

 Lastly, the violin plot compared to the box plot helps to provide more information
regarding the density of the data, especially when the group has more than one mode or
abnormalities in its distribution.

4.2.7 – Besides, we should also plot the Instalment Commitment

One of the main reasons why we have chosen to use Histogram for plotting the Installation
Commitment. This variable (installment_commitment), is used to represent the amount of
installment payments everyone has made. As we can see in the Histogram graph given above, it
was plotted with the dataset in mind. In this case is that the minimum range that could be found
is 1.0, whereas the maximum range that could be found is 4.0. With the use of the histogram, it
allows us to visualize the proportions easier.

50
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

4.2.8 – After, we must plot for Credit Amount, specifically via the threshold
highlighted

Furthermore, the reason that the histogram was also chosen for credit amount, specifically the
threshold that was highlighted. The variable (credit_amount), that is in the dataset shows the total
amount of credit. The histogram above helps to easily show amounts and how they are spread in

51
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

an easily comprehensible way. With it we can clearly show quite precisely how many people
have credit amounts above and below the critical threshold.

4.2.9 – We Must Plot a Graph for the Savings Status

A bar chart plot was used to represent the Saving Status. This is because the variable in this case
is (“savings_status”) includes numerous different categories, in this case you can see as >=1000.
500 <=X<100000, 100<=X<500, no known savings and <100. By doing this we can easily
identify the number of each category, which in turn would make it easier for us to see how the
threshold aligns with the dataset.

52
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

4.2.10 – Lastly, We Must Plot a Graph for the Credit History

We have also chosen to use the bar chart graph to plot for the variable “credit_history”. This is
because the given value such as “critical”, “order existing credit”, “existing paid” and “delayed
previously”. With the help of the bar chart, we can easily understand the distribution of each
category. By studying the distribution in all the categories, we could even better understand the
distribution in the dataset in “credit_history”.

53
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

4.3 – Interpret the Result

4.3.1 – QQ Plot of Credit Amount

The QQ graph compares the distribution of credit amount against a theoretical normal
distribution. Based on the graph, the following observations can be made:

- Significant deviation against the normal distribution at the far left and far right due to a
higher number of outliers.
- Higher credit amounts are the norm and are more frequent than predicted.

This suggests that the distribution is positively skewed and does not follow the normal
distribution. Further exploration might be required for better accuracy in subsequent analysis.

4.3.2 – Scatter Plot of Duration

The scatter plot is a visualization of the distribution of loan duration divided by “bad” and
“good” credit classes. Based on the graph, the following observations can be made:

- “good” credit classes have a higher frequency in lower durations.


- “bad” credit classes have a higher frequency in higher durations.
- “bad” credit classes have an overall more even distribution across different durations.

Based on the observations, we can draw several potential implications:

- Individuals under “good” credit classes might be borrowing less cash and are therefore
less likely to default compared to those under "bad” credit classes.

54
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

- Individuals under “good” credit classes might have more money at their disposal and are
therefore less likely to default compared to those under "bad” credit classes.
- Individuals under “good” credit classes might be more prudent and are better at managing
their financial situation and are therefore less likely to default compared to those under
"bad” credit classes.

4.3.3 – Box Plot of Credit Amount by Class

The box plot is a visualization of the distribution of credit amounts between the bad and good
credit classes. Based on the plot, the following observations can be made:

- “bad” credit class has a higher median value than “good” credit class.
- “bad" credit class has a higher larger interquartile range than “good” credit class.
- “good” credit class has more extreme outliers than “bad” credit class.

Based on the observations, we can draw several potential implications:

- Individuals under “bad” credit class might be borrowing more cash which makes them
more likely to default.
- Individuals under "bad" credit classes might have less money at their disposal which
makes it more likely to default.
- Individuals under "good" credit who borrow more money might be only doing so as they
have sufficient cash at their disposal or can manage their finances.

55
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

4.3.4 – Distribution of Installment Commitment

The histogram is a visualization of the distribution of instalment commitment between the “bad”
and “good” credit classes. Based on the graph, the following observations can be made:

- The concentration of individuals under “bad” credit classes is predominantly in


instalment category 4.
- In installment category 2, the ratio of “good” to “bad” credit classes is most in favor of
“good” credit class compared to other instalment categories.
- “good” credit classes are more prevalent in instalment categories 1 to 3.

Based on the observations, we can draw several potential implications:

- Individuals under “bad” credit class will be more likely to default due to having a higher
instalment commitment which puts them under greater financial strain.
- Individuals under “good” credit class will be less likely to default due to the reduced
financial strain as they have a lower instalment commitment.
- Individuals under “bad” credit class might be charged more due to having a riskier track
record.
- Individuals under "bad" credit class might have borrowed more than individuals under
“good” credit class and therefore must commit to a higher instalment.

56
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

4.3.5 – Distribution of Credit Amount by Credit Classification

The histogram is a visualization of the distribution of credit amount between the “bad” and
“good” credit classes. Based on the graph, the following observations can be made:

- Majority of individuals under the “good” credit class have a credit amount of 5000 or
lower.
- As credit amount increases, the frequency of individuals under the “good” credit class
decreases.
- Credit amounts above 5000 show a higher frequency of “bad” credit class individuals.

Based on the observations, we can draw several potential implications:

- Individuals who borrow less than 5000 have a lower chance to default as they might be
borrowing responsibly.
- Individuals who borrow more than 5000 have a higher chance to default as they are under
higher financial strain and might find it much harder to repay.
- Individuals who borrow less than 5000 and are under the “bad” credit class are likely due
to other variables.

4.3.6 – Savings Status Categories by Credit Classification

The double bar chart is a visualization of the distribution of “bad” and “good” credit classes
across different savings status categories. Based on the chart, the following observations can be
made:

57
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

- Most individuals have a saving status of less than 100. The ratio of the credit classes in
this category is in favor of the “bad” credit class.
- The presence of “good” credit class is significantly higher in the “no known savings”
category.
- In the “100<=X<500” category, the ratio of credit classes is slightly in favor of “good”
credit class.
- In the “500<=X<1000” category, the ratio of credit classes is slightly in favor of “bad”
credit class.
- In the “>=1000" category, the ratio of credit classes is very similar although there are
slightly more individuals from “bad” credit class.

Based on the observations, we can draw several potential implications:

- Individuals who have less than 100 in their savings are more likely to default due to their
poor financial situation which makes it harder for them to pay up.
- Individuals who have no known savings conflict with other categories as there should be
a higher likelihood of default. This could be due to this category having many outliers or
individuals with no savings who don’t request loans, meaning that there is no record to
reclassify them.

4.3.7 – Credit History Categories by Credit Classification

The double bar chart is a visualization of the distribution of “bad” and “good” credit classes
across different credit history categories. Based on the chart, the following observations can be
made:

- Most individuals fall under the “existing paid” category, with a very similar amount of
“good” and “bad” classes.
- In the “critical/order existing credit” category, there is a significantly higher number of
individuals from the “good” credit class.
- In categories “delayed previously", “all paid”, and “no credit/all paid”, the number of
individuals from the “bad” credit class is higher.

Based on the observations, we can draw several potential implications:


58
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

- Individuals in the “delayed previously” category tend to fall under “bad” credit class due
to the riskier track record.
- Categories “critical/order existing credit”, “all paid”, and “no credit/all paid” seem to
have a high number of outliers. “critical/order existing credit” implies riskier individuals,
but the statistics show the inverse. “all paid” and “no credit/all paid” should indicate
lower risk individuals however there is a higher number of “bad” credit individuals. This
suggests that their credit histories have less impact on their likelihood of default.

4.4 – Conclusion

In this analysis, the following financial characteristics were examined including, installment
commitment, credit amount, credit history and savings status in relation to the ability to classify
the credit as either good or bad. In this EDA we applied QQ, scatter, box and subplot, histograms
and bar charts to identify important trends and observations.

The analysis of the results suggests the presence of the interactions between the variables and the
credit classification. Intercept and X1 coefficient estimates were significant, though not very
large; significantly, individuals classified as “good” credit holders scored lower on credit amount
and installment commitments, showing that they are less likely to borrow recklessly. On the
other hand, the “bad” credit class people borrow higher credit amounts, and they have more
installment obligations that can serve to worsen their condition and thereby increase their
tendency to default.

Thus, a considerable significance of savings status and credit history for credit classification is
also revealed by our study. The results obtained also support previous findings indicating that the
level of savings is directly related to the credit worthiness of the people, whereby those in the
“bad” credit status are likely to have lower average savings. Furthermore, the evaluation of credit
history showed that past payment behavior impacts today credit rating, and the delays make a
man a “bad credit” rating.

59
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

On balance, the evidence introduces credibility to the alternative hypothesis that greater
installment obligations, higher credit limits, weak credit histories, and low judiciousness turn
credit into “substandard” credit. The results derived from the setting of problem statements can
be helpful in enhancing credit assessment and risk management for the financial institutions
concerned. It would be interesting to carry out further research into these relations through the
predictive means to better understand how to improve the system of credit classification, as well
as to provide special interventions in case of clients who are at risk of default.

5.0 – Overall Conclusion

5.1 – Overall Discussion on the Findings from All Objectives

In most of the analyses, the Null Hypothesis is rejected, and the Alternative Hypothesis is
accepted. This suggests that credit class is not only complex but also depends on a lot of
variables. This is also reflected in existing literature.

5.2 – Recommendation

For the recommendation, in the future need to focus on the creation of models that mainly
considers the holistic view of the variable, “credit_risks”. By using numerous different factors
such as “history”, “savings” and loan amount (Kaya, Agca, Adiguzel, & Cetin, 2018). We would
recommend that the process fo data collection could be improved greatly (Kaya, Agca, Adiguzel,
& Cetin, 2018). This is mainly to solve the problem of their having lots of inconsistencies in the
dataset; by doing so we would be able to save more time in analysis without having to waste time
“cleaning” it (Kaya, Agca, Adiguzel, & Cetin, 2018).

60
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

5.3 – Limitations and Future Direction

5.3.1 – Limitations

In this analysis, there are several limitations that may hinder accuracy:

- Presence of outliers – In the QQ plot of credit amount (refer to 4.3.1), the amount
distribution appears to be positively skewed and does not conform to the normal
distribution. This indicates that there are outliers of higher values. This also extends to
the box plot.
- Presence of anomalies – In the double bar chart for visualizing the impact of credit
history categories, there is a display of unexpected trends. Such as “critical/order existing
credit” category having a higher prevalence of “Good” credit classes despite being a
riskier class. This could be due to misclassification or lack of contextual data.
- Limited scope – Not every available variable was included in this analysis. Including
variables such as purpose, employment, personal status, and job might result in a more
accurate analysis.
- The observations and conclusions made were based on correlations and do not cover
causations which can make it difficult to find an explanation behind the trends displayed
in the dataset.

61
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

- Static analysis – This analysis does not consider the changes in trend over time which
might provide better insight.

5.3.2 – Future Direction

For future analysis, several changes in procedure should be considered to yield better findings:

- Manage outliers/anomalies – Further investigate outliers and anomalies to identify if


external factors alter the results or if there is a hidden pattern across the other variables.
- More variables – Including more variables will give more accurate findings as it helps to
provide more context to draw conclusions from.
- Performing a temporal analysis can determine how much the data changes over time and
potentially identify if the current dataset is of the norm. This can also help to identify the
direct causes of the trends.

5.4 – Word Count

8808 Words

62
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

6.0 – Workload Matrix

Name: Jaeden Loong Deng Ze Keith Lo Ze Hui Lim Wen Yi Muhammad Hadi
TP number: TP068347 TP067653 TP067930 TP077049
Introduction 25% 25% 25% 25%
Data Preparation - - 50% 50%
Data Analysis 25% 25% 25% 25%
Conclusion 25% 25% 25% 25%
Group Hypothesis
State Your Group
Hypothesis - - 100% -

Test Your 100%


Hypothesis - - -

Interpret the Result 100% - - -


Conclusion for
Hypothesis - - - 100%

Overall Conclusion
Overall Discussion - - 100% -
Recommendation - 100% - -
Limitations and
Future Direction 100% - - -

63
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University

Word Count - - - 100%

7.0 – References
Kaya, E., Agca, M., Adiguzel, F., & Cetin, M. (2019). Spatial data analysis with R programming
for environment. Human and ecological risk assessment: An International Journal, 25(6), 1521-
1530. From https://www.tandfonline.com/doi/abs/10.1080/10807039.2018.1470896

Wickham, H. (2019). Advanced r. chapman and hall/CRC. From


https://www.taylorfrancis.com/books/mono/10.1201/9781351201315/advanced-second-edition-
hadley-wickham

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science. " O'Reilly
Media, Inc.". From https://books.google.com/books?
hl=en&lr=&id=TiLEEAAAQBAJ&oi=fnd&pg=PT9&dq=R+programming&ots=ZJq4efqMwR
&sig=9Q7cttVYrsGEjuo_Wtav9AEKRDw

Jockers, M. L., & Thalken, R. (2020). Text analysis with R. Springer International Publishing.
From https://link.springer.com/content/pdf/10.1007/978-3-030-39643-5.pdf

Noriega, J. P., Rivera, L. A., & Herrera, J. A. (2023). Machine Learning for Credit Risk
Prediction: A Systematic Literature Review. Data, 8(11), 169–169. From
https://doi.org/10.3390/data8110169

Prashanta, M., & Behera, K. (2017). Credit Risk Analysis & Modeling: A Case Study. IOSR
Journal of Economics and Finance, 8(2), 69–81.
From https://doi.org/10.9790/5933-0802026981

64
CT127-3-2-PFDA Programming for Data Analysis Asia Pacific University


Adedamola, S. L., Obafemi, D. S., & Oluwakemi, A. A. (2021). Investigating the Factors
Influencing Commercial Bank Lending in Nigeria: A Consolidation and Interaction Effect. The
International Journal of Humanities & Social Studies, 9(5). From
https://doi.org/10.24940/theijhss/2021/v9/i5/hs2105-058

The Effect of Current Accounts, Savings and Time Deposits on Banking Credit: A Case Study of
Regional Development Banks (RDB) in Indonesia. (2024). Journal of International Business,
Economics and Entrepreneurship, 9(1), 24–37. From https://doi.org/10.24191//jibe.v9i1.900

Kaya, E., Agca, M., Adiguzel, F., & Cetin, M. (2018). Spatial data analysis with R programming
for environment. Human and Ecological Risk Assessment: An International Journal, 25(6), 1521-
1530. From https://doi.org/10.1080/10807039.2018.1470896

65

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy