Latest revision as of 17:53, 30 March 2023

Paper

Paper
edit
description
id	Vol-3194/paper64
wikidataid	Q117344887→Q117344887
title	Credit Score Prediction Relying on Machine Learning
pdfUrl	https://ceur-ws.org/Vol-3194/paper64.pdf
dblpUrl	https://dblp.org/rec/conf/sebd/AmatoFG0MS22
volume	Vol-3194→Vol-3194
session	→

Credit Score Prediction Relying on Machine Learning

Credit Score Prediction Relying on Machine Learning
Flora Amato1 , Antonino Ferraro1 , Antonio Galli1 , Francesco Moscato3 ,
Vincenzo Moscato1,2 and Giancarlo Sperlí1,2
1
Department of Electrical Engineering and Information Technology (DIETI), Naples, Italy
2
CINI - ITEM National Lab, Complesso Universitario Monte S.Angelo, Naples, Italy
3
DIEM, University of Salerno, Fisciano, Italy

Abstract
Financial institutions use a variety of methodologies to define their commercial and strategic policies,
and a significant role is played by credit risk assessment. In recent years, different credit risk assessment
services arose, providing Social Lending platforms to connect lenders and borrowers in a direct way
without assisting of financial institutions. Despite the pros of these platforms in supporting fundraising
process, there are different stems from multiple factors including lack of experience of lenders, missing or
uncertain information about the borrower’s credit history. In order to handle these problems, credit risk
assessments of financial transactions are usually modeled as a binary problem based on debt repayment,
going to apply Machine Learning (ML) techniques. The paper represents an extended abstract of a recent
work, where some of the authors performed a benchmarking among the most used credit risk assessment
ML models in the field of predicting whether a loan will be repaid in a P2P platform. The experimental
analysis is based on a real dataset of Social Lending (Lending Club), going to evaluate several evaluation
metrics including AUC, sensitivity, specificity and explainability of the models.

Keywords
Credit Score Prediction, Machine Learning, eXplainable Artificial Intelligence,

1. Introduction
The recent development of digital financial services has led researchers to pay attention to the
management of credit risk, proposing useful models to reduce such a risk but also to obtain
profits from the investment. Banking risks can arise from different factors including: operational
risks, market, credit, and the last one represents 60% of problems for banks [1].
The main cause of credit risk is the spread of Social Lending (SL) platforms, known as Peer-to-
Peer (P2P) lending. These platforms allow lenders and borrowers to be interconnected without
involving financial institutions; they support borrowers in the fundraising process and allow
lending entities to participate. One challenge that needs to be addressed in this context is the
credit risk analysis, due to possible non-repayment of loans by borrowers, where risk assessment
is calculated through credit scoring.
The credit risk assessment of financial transactions on SL platforms is performed through a
binary classification problem, based on debt repayment [2, 3].

SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
$ flora.amato@unina.it (F. Amato); antonino.ferraro@unina.it (A. Ferraro); antonio.galli@unina.it (A. Galli);
fmoscato@unisa.it (F. Moscato); vincenzo.moscato@unina.it (V. Moscato); giancarlo.sperli@unina.it (G. Sperlí)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
� Additionally, it is important to note that P2P platforms produce large amounts of unlabeled
data so additional analysis is required to support real-time decisions [4]. An additional critical
issue with these platforms is the risk of default, which is higher than standard methods, this
is due to the fact that a lender may not always be able to effectively assess the risk level of
borrowers [5], thus the main issue is due to a lack of credit history of borrowers.
Predictive models of credit scoring can be classified into two broad categories [6]: statistical
approaches and artificial intelligence methods. Regarding statistical approaches, they have been
proposed, but suffer from coverage problems inherent in nonlinear effects among the variables
involved. Credit risk assessment is characterized by the following properties: dependence,
complexity, and interconnectedness [7], thus credit scoring estimation is very complex as it
is dependent on different parameters. Several methodologies have been proposed that rely on
rule generation to evaluate credit risks [8, 9], however, these approaches may be limited in the
process of generating rules on large amounts of data. Another problem is the lack of lender
experience or the uncertainty of borrower history information, these factors greatly increase
credit risk. Some platforms incorporate borrower status prediction models, particularly using
logistic regression [10] and Random Forest-based classification [11].
However, the development of credit risk prediction models is difficult due to different factors,
including high data size and imbalance and high number of missing values. For these reasons,
additional approaches have then been proposed, such as Support Vector Machine (SVM) based
semi-supervised approach [12], while [13] has introduced an ensemble Decision Tree model for
credit risk assessment on 138 Chinese companies with loss-making corporate earnings. Another
ensemble method was developed by Feng et al. (2018)[14], in which classifiers are selected based
on performance related to credit scoring. While [15] designed a hybrid model that relies on
transductive support vector machine (TSVM) and Dempster-Shafer theory to predict social loan
defaults. Finally, [16] has described a combination of different classifiers using linear weight
ensemble to predict SL default, instead Song et al. (2020)[17] an ensemble of classifiers based on
distance-model learning method and adaptive multi-view clustering (DM-ACME).
In this paper, which represents an extended abstract of our previous work [18], we propose a
benchmarking for credit risk scoring using the most advanced machine learning (ML) techniques
used in the literature , to understand whether a loan will be repaid on a P2P platform. The
performance was evaluated using different scoring metrics such as Sensitivity, AUC, Specificity.
In addition, eXplainable Artificial Intelligence (XAI) approaches were used to obtain a high
degree of explainability of the models. The goal is to evaluate both in terms of accuracy
performance of the classifiers but also to provide results understandable by domain experts,
ensuring transparency in decisions, this is particularly required for credit risk assessment.

2. Proposed benchmark architecture
The proposed benchmark architecture in Fig.1 stems from the need to be able to offer sup-
port to the risk prediction problem, thus an investor can evaluate potential borrowers within
social lending platforms. The main challenge to address is that credit risk assessment is a
multidimensional and unbalanced issue because it is based on huge amounts of historical data,
including, credit history (obtained by filling out a comprehensive application), bank account
�status, employment status, etc.
In addition, using all these features increases coverage but decreases accuracy, thus it is
essential to apply a feature selection approach. In particular, the proposed architecture that is
based on three macro-modules:
1. Ingestion,
2. Classification,
3. Explanation.
The ingestion phase aims at crawling the data from the social lending platforms, cleaning
and filtering the obtained data and performing feature selection based on the chosen classifier.
In details, the data are cleaned by removing features with many missing or null values and
attributes with zero variance from the dataset. After cleaning, several transformations are
applied, such as converting categorical features to numeric and changing date attributes to
numeric values. The second macro-block performs credit prediction, here we have to deal with
a problem of imbalance because usually a user of P2P platforms have a high number of rejected
loans compared to those requested. The classifiers chosen in our architecture are: Logistic-
regression, Random Forest and Multi-Layer Perceptron , being the most suitable ones for credit
prediction [19, 6] and the most used in this context [11, 20, 13]. To handle the unbalance problem,
the following techniques are used: random subsampling, random oversampling, and smoothing.
Specifically, oversampling merely creates new minority class samples, the Synthetic Minority
Oversampling Technique (SMOTE) is based on oversampling using k-nearest neighbors. While
subsampling eliminates the majority class samples randomly. Finally, the third macro-block
is concerned with explaining the results of each prediction, i.e., the decisions made by the
classifiers to obtain information about the financial domain being analyzed. In particular, five
XAI tools are used: LIME, Anchors, SHAP, BEEF and LORE. LIME [21] is a Post-Hoc and
Agnostic method that provides a local explanation on the prediction, Anchors [22] is also of the
same type, a Post- Hoc, Model Agnostic method that provides a local explanation but using
rules that sufficiently "anchor" the predictor locally. SHapley Additive exPlanations (SHAP)
[23] is a method for explain individual predictions based on the game theoretically optimal
Shapley Values in order to analyze how each feature influences the prediction. Balanced English
Explanations of Forecasts (BEEF) [24] exploits global information, retrieved by the clustering
algorithm on the entire dataset, in order to generate a local explanation. Finally, Local Rule-
Based Explanations (LORE), proposed by Guidotti et al. (2018)[25] is first based on learning an
interpretable local predictor and then deriving the explanation as a decision rule.

3. Experimental evaluation
The purpose of our experimentation is to compare different classification models, evaluating
them according to some metrics (for more details see Section 3.1). The dataset is provided by
Lending Club1 , a P2P lending platform, in particular we focused on loans disbursed between
2016 and 2017, it consists of 877,956 samples and 151 features, where the most important ones
are loan_amount and term.
1
https://www.lendingclub.com/
�Figure 1: Architecture

According to [11] and [6], we considered 𝑙𝑜𝑎𝑛_𝑠𝑡𝑎𝑡𝑢𝑠 as the target class for our problem.
Only the labels ”FullyPaid” or ”Charged off” were considered, since we classified the problem
as binary, whether the loan will be repaid or not, this leads to unbalanced data, respectively
0.77% of the samples are fully paid , the remaining 0.23% are unpaid. A 10-cross validation was
performed in which the dataset was divided into a training set and a test set with a ratio of
75/25.
Finally, the results obtained were compared against those presented in Namvar et al. (2018)[6]
and Song et al. (2020)[17]. The benchmark was run on Google Colab2 , with Xeon single core
hyper threaded processor @2.3Ghz, 12 GB RAM, NVIDIA Tesla K80 with 2496 CUDA cores and
12 GB GDDR5 VRAM, using Python 3.6 with scikit-learn 0.23.1.33

3.1. Evaluation metrics
The following metrics were used to evaluate and compare the effectiveness of the considered
models: Sensitivity (TPR), Specificity (TNR), G-mean, Precision, FP-Rate, Area Under Curve (AUC).
Accuracy (ACC) was not used as an evaluation metric because it does not consider that false
positives are more important than false negatives, thus it results in an inaccurate evaluation.
Instead, TPR and TNR are suitable because they assess the accuracy of positive and negative
samples, respectively. While G-mean is an appropriate metric for assessing the balance of

2
https://colab.research.google.com/
3
https://scikit-learn.org/
� Classifier AUC TPR TNR FP-Rate G-Mean ACC
RF - RUS 0.717 0.630 0.680 0.320 0.6560 0.640
LR - ROS 0.710 0.659 0.642 0.360 0.6503 0.650
LR - SmoteToken 0.710 0.660 0.640 0.360 0.6500 0.656
Logistic Regression 0.685 0.983 0.069 0.960 0.2600 0.770
Random Forest 0.720 0.983 0.084 0.920 0.2870 0.773
MLP 0.704 0.990 0.040 0.945 0.2060 0.771
Table 1
Our best Classification results.

classification performances for both majority and minority classes.
In turn, Precision and FP-Rate are useful for understanding how well the model predicts
positive and negative classes. Finally, AUC determines the area under the ROC curve, thus
it is used to assess the trade-off between the rate of true positives and true negatives in the
evaluated model.

3.2. Feature Engineering
The aim of this section is to explain the criterion of improving the data through their cleaning
and feature selection. Specifically, all features with missing values greater than 55%, and also
those with high standard deviation were removed. Finally, the missing values were replaced
with the median of the features, furthermore the nominal features were converted to binary
data (more details are reported in [18]).

3.3. Experimental results
The classifiers used are Random Forest (RF), Logistic Regresion (LR) and Multi Layer Perceptron
and have been evaluated according to different sampling strategies: Under-sampling (RUS, IHT),
Over-sampling (ROS, SMOTE, ADASYN), Hybrid-Method (SMOTE-TOKEN, SMOTE-EN). In
Table 1 we report the best combination between the classifiers and the sampling strategies,
comparing them also against the performance obtained by the classifiers without any strategy,
this last comparison highlights the effectiveness of the latter techniques on the prediction
performance. The experiment decrees that RF-RUS turns out to be the best method for predicting
a borrower’s status in a social lending market.

3.3.1. Comparison with state-of-art results
We compared our results against the best results of Namvar et al. (2018)[6] and Song et
al. (2020)[17], it can be seen that our best combination (RF-RUS) (see in Table 1) has the
lowest accuracy while our AUC value and Specificity are higher than the best of [6]. This is
important in our context because reducing false positives avoids the serious economic damage
of misclassification, i.e., the loss of a user’s loan. In addition, Table 2 shows higher values of
Specificity than our results even though its sensitivity value is much lower than our model.
� Method AUC TPR TNR G-Mean Accuracy
RF -RUS 0.717 0.630 0.680 0.6560 0.640
Song et al.[17] 0.6697 0.4607 0.7678 0.6009 0.7231
Linear discrimination
0.7000 0.630 0.650 0.643 0.6400
analysis - SMOTE
Namvar et al.[6] LR - SmoteToken 0.7000 0.638 0.648 0.643 0.6400
Logistic regression 0.7030 0.988 0.048 0.218 0.8173
Random forest 0.6960 0.996 0.015 0.12 0.8176
GBDT 0.6207 0.6168 0.6246 0.6207 0.6235
Random forest 0.5795 0.3107 0.8423 0.5134 0.7701
AdaBoost 0.5224 0.1925 0.8523 0.4050 0.7562
Over-sampling
Decision tree 0.5231 0.1934 0.8527 0.4060 0.7568
Logistic regression 0.5600 0.5558 0.5642 0.5597 0.5630
Multilayer perceptron 0.4892 0.1572 0.8211 0.3593 0.7245
GBDT 0.6140 0.6292 0.5989 0.6138 0.6033
Random forest 0.6207 0.6623 0.5791 0.6193 0.5912
AdaBoost 0.5408 0.5577 0.5238 0.5404 0.5288
Under-sampling
Decision tree 0.5421 0.5558 0.5283 0.5418 0.5323
Logistic regression 0.5615 0.5437 0.5794 0.5609 0.5742
Multilayer perceptron 0.4892 0.1572 0.8211 0.3593 0.7245
Table 2
Results.

3.3.2. Explanation results
In this last part of the evaluation, we compare the performance of several XAI tools: LIME,
Anchors, SHAP, BEEF, and LORE. In particular, the metrics are based on the Accuracy measure,
according to the protocol described in Ribeiro et al. (2016)[21], evaluated on the three best
classifiers: Random Forest & Random Subsampling, Logistic Regression & Random Oversam-
pling, and Logistic Regression & Smote-Token. Several explanations were generated, using
different sets of instances computed with different random sampling (10 runs) from the dataset.
Analyzing the results in Table 3, LORE is the best because it combines local predictions with the
use of counterfactuals for explanation generation, while LIME achieves good results for all three
classifiers, this is because the prediction is modeled as a weighted sum and this makes it easy
to interpret the prediction generation. SHAP, on the other hand, based on the importance of
features, offers statistically more significant results than LIME, this is given by the use of shap
values, whose computational complexity, even if dampened by different heuristics, can affect
the efficiency of the explanation. Finally, regarding BEEF and Anchors, they can be limited in
the expressiveness of the explanation, as noted for Logistic Regression, since they are based on
axis-aligned hyper-rectangle and specific rules (called anchors).

4. Conclusion
Determining the risk prediction score is one of the biggest challenges in finance. The aim of the
proposed approach is to support people in their investments, proposing a reference model based
� Random -Forest Logistic Regression Logistic Regression
Random Under-Sampling Random Over-Sampling Smote -Token
(Precision Value) (Precision Value) (Precision Value)
Anchors 0.907 0.547 0.747
Lime 0.872 0.918 0.676
SHAP 0.891 0.924 0.752
BEEF 0.881 0.741 0.725
LORE 0.913 0.878 0.781
Table 3
Comparison between Anchors, Lime, SHAP, BEEF and LORE in terms of Precision measure.

on Machine Learning approaches for the prediction of credit risk in social lending platforms,
going to manage what are the major criticalities in P2P platforms: the high dimension of data
to be analyzed and unbalanced data. The evaluation done on a real dataset demonstrates the
goodness of the proposed approach, as well as the fact of being able to provide an explanation
for the prediction obtained, which is very significant in the financial field to be able to motivate
a positive or negative judgment to provide a loan. Developments of future work may be to
consider different P2P lending platforms and use additional classification approaches such as
Deep Learning or ensemble learning techniques in order to achieve better performance.

References
[1] K. Buehler, A. Freeman, R. Hulme, The new arsenal of risk management, Harvard Business
Review 86 (2008) 93–100.
[2] A. B. Hens, M. K. Tiwari, Computational time reduction for credit scoring: An integrated
approach based on support vector machine and stratified sampling method, Expert Systems
with Applications 39 (2012) 6774–6781.
[3] T. Verbraken, C. Bravo, R. Weber, B. Baesens, Development and application of consumer
credit scoring models using profit-based classification measures, European Journal of
Operational Research 238 (2014) 505–513.
[4] A. Kim, S.-B. Cho, Dempster-shafer fusion of semi-supervised learning methods for
predicting defaults in social lending, in: International Conference on Neural Information
Processing, Springer, 2017, pp. 854–862.
[5] Y. Guo, W. Zhou, C. Luo, C. Liu, H. Xiong, Instance-based credit risk assessment for
investment decisions in p2p lending, European Journal of Operational Research 249 (2016)
417–426.
[6] A. Namvar, M. Siami, F. Rabhi, M. Naderpour, Credit risk prediction in an imbalanced
social lending environment, arXiv preprint arXiv:1805.00801 (2018).
[7] D. D. Wu, S.-H. Chen, D. L. Olson, Business intelligence in risk management: Some recent
progresses, Information Sciences 256 (2014) 1–7.
[8] Y. Hayashi, Application of a rule extraction algorithm family based on the re-rx algorithm
to financial credit risk assessment from a pareto optimal perspective, Operations Research
Perspectives 3 (2016) 32–42.
[9] M. Soui, I. Gasmi, S. Smiti, K. Ghédira, Rule-based credit risk assessment model using
� multi-objective evolutionary algorithms, Expert systems with applications 126 (2019)
144–157.
[10] R. Emekter, Y. Tu, B. Jirasakuldech, M. Lu, Evaluating credit risk and loan performance in
online peer-to-peer (p2p) lending, Applied Economics 47 (2015) 54–70.
[11] M. Malekipirbazari, V. Aksakalli, Risk assessment in social lending via random forests,
Expert Systems with Applications 42 (2015) 4621–4631.
[12] Z. Li, Y. Tian, K. Li, F. Zhou, W. Yang, Reject inference in credit scoring using semi-
supervised support vector machines, Expert Systems with Applications 74 (2017) 105–114.
[13] J. Sun, J. Lang, H. Fujita, H. Li, Imbalanced enterprise credit evaluation with dte-sbd:
Decision tree ensemble based on smote and bagging with differentiated sampling rates,
Information Sciences 425 (2018) 76–91.
[14] X. Feng, Z. Xiao, B. Zhong, J. Qiu, Y. Dong, Dynamic ensemble classification for credit
scoring using soft probability, Applied Soft Computing 65 (2018) 139–151.
[15] A. Kim, S.-B. Cho, An ensemble semi-supervised learning method for predicting defaults
in social lending, Engineering Applications of Artificial Intelligence 81 (2019) 193–199.
[16] W. Li, S. Ding, H. Wang, Y. Chen, S. Yang, Heterogeneous ensemble learning with feature
engineering for default prediction in peer-to-peer lending in china, World Wide Web 23
(2020) 23–45.
[17] Y. Song, Y. Wang, X. Ye, D. Wang, Y. Yin, Y. Wang, Multi-view ensemble learning based on
distance-to-model and adaptive clustering for imbalanced credit risk assessment in p2p
lending, Information Sciences 525 (2020) 182–204.
[18] V. Moscato, A. Picariello, G. Sperlí, A benchmark of machine learning approaches for
credit score prediction, Expert Systems with Applications 165 (2021) 113986.
[19] V. García, A. Marqués, J. S. Sánchez, On the use of data filtering techniques for credit
risk prediction with instance-based models, Expert Systems with Applications 39 (2012)
13267–13276.
[20] A. Namvar, M. Naderpour, Handling uncertainty in social lending credit risk prediction
with a choquet fuzzy integral model, in: 2018 IEEE International Conference on Fuzzy
Systems (FUZZ-IEEE), IEEE, 2018, pp. 1–8.
[21] M. T. Ribeiro, S. Singh, C. Guestrin, " why should i trust you?" explaining the predictions
of any classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on
knowledge discovery and data mining, 2016, pp. 1135–1144.
[22] M. T. Ribeiro, S. Singh, C. Guestrin, Anchors: High-precision model-agnostic explanations,
in: Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
[23] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, Advances
in neural information processing systems 30 (2017).
[24] S. Grover, C. Pulice, G. I. Simari, V. Subrahmanian, Beef: Balanced english explanations of
forecasts, IEEE Transactions on Computational Social Systems 6 (2019) 350–364.
[25] R. Guidotti, A. Monreale, S. Ruggieri, D. Pedreschi, F. Turini, F. Giannotti, Local rule-based
explanations of black box decision systems, arXiv preprint arXiv:1805.10820 (2018).
�

Difference between revisions of "Vol-3194/paper64"

Latest revision as of 17:53, 30 March 2023

Paper

Credit Score Prediction Relying on Machine Learning

Navigation menu

Search

@@ Line 1: / Line 1: @@
+=Paper=
 {{Paper
+|id=Vol-3194/paper64
+|storemode=property
+|title=Credit Score Prediction Relying on Machine Learning
+|pdfUrl=https://ceur-ws.org/Vol-3194/paper64.pdf
+|volume=Vol-3194
+|authors=Flora Amato,Antonino Ferraro,Antonio Galli,Francesco Moscato,Vincenzo Moscato,Giancarlo Sperlì
+|dblpUrl=https://dblp.org/rec/conf/sebd/AmatoFG0MS22
 |wikidataid=Q117344887
 }}
+==Credit Score Prediction Relying on Machine Learning==
+<pdf width="1500px">https://ceur-ws.org/Vol-3194/paper64.pdf</pdf>
+<pre>
+Credit Score Prediction Relying on Machine Learning
+Flora Amato1 , Antonino Ferraro1 , Antonio Galli1 , Francesco Moscato3 ,
+Vincenzo Moscato1,2 and Giancarlo Sperlí1,2
+  Department of Electrical Engineering and Information Technology (DIETI), Naples, Italy
+  CINI - ITEM National Lab, Complesso Universitario Monte S.Angelo, Naples, Italy
+  DIEM, University of Salerno, Fisciano, Italy
+                                         Abstract
+                                         Financial institutions use a variety of methodologies to define their commercial and strategic policies,
+                                         and a significant role is played by credit risk assessment. In recent years, different credit risk assessment
+                                         services arose, providing Social Lending platforms to connect lenders and borrowers in a direct way
+                                         without assisting of financial institutions. Despite the pros of these platforms in supporting fundraising
+                                         process, there are different stems from multiple factors including lack of experience of lenders, missing or
+                                         uncertain information about the borrower’s credit history. In order to handle these problems, credit risk
+                                         assessments of financial transactions are usually modeled as a binary problem based on debt repayment,
+                                         going to apply Machine Learning (ML) techniques. The paper represents an extended abstract of a recent
+                                         work, where some of the authors performed a benchmarking among the most used credit risk assessment
+                                         ML models in the field of predicting whether a loan will be repaid in a P2P platform. The experimental
+                                         analysis is based on a real dataset of Social Lending (Lending Club), going to evaluate several evaluation
+                                         metrics including AUC, sensitivity, specificity and explainability of the models.
+                                         Keywords
+                                         Credit Score Prediction, Machine Learning, eXplainable Artificial Intelligence,
+. Introduction
+The recent development of digital financial services has led researchers to pay attention to the
+management of credit risk, proposing useful models to reduce such a risk but also to obtain
+profits from the investment. Banking risks can arise from different factors including: operational
+risks, market, credit, and the last one represents 60% of problems for banks [1].
+   The main cause of credit risk is the spread of Social Lending (SL) platforms, known as Peer-to-
+Peer (P2P) lending. These platforms allow lenders and borrowers to be interconnected without
+involving financial institutions; they support borrowers in the fundraising process and allow
+lending entities to participate. One challenge that needs to be addressed in this context is the
+credit risk analysis, due to possible non-repayment of loans by borrowers, where risk assessment
+is calculated through credit scoring.
+   The credit risk assessment of financial transactions on SL platforms is performed through a
+binary classification problem, based on debt repayment [2, 3].
+SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
+$ flora.amato@unina.it (F. Amato); antonino.ferraro@unina.it (A. Ferraro); antonio.galli@unina.it (A. Galli);
+fmoscato@unisa.it (F. Moscato); vincenzo.moscato@unina.it (V. Moscato); giancarlo.sperli@unina.it (G. Sperlí)
+                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
+    CEUR
+    Workshop
+    Proceedings
+                  http://ceur-ws.org
+                  ISSN 1613-0073
+                                       CEUR Workshop Proceedings (CEUR-WS.org)
+�   Additionally, it is important to note that P2P platforms produce large amounts of unlabeled
+data so additional analysis is required to support real-time decisions [4]. An additional critical
+issue with these platforms is the risk of default, which is higher than standard methods, this
+is due to the fact that a lender may not always be able to effectively assess the risk level of
+borrowers [5], thus the main issue is due to a lack of credit history of borrowers.
+   Predictive models of credit scoring can be classified into two broad categories [6]: statistical
+approaches and artificial intelligence methods. Regarding statistical approaches, they have been
+proposed, but suffer from coverage problems inherent in nonlinear effects among the variables
+involved. Credit risk assessment is characterized by the following properties: dependence,
+complexity, and interconnectedness [7], thus credit scoring estimation is very complex as it
+is dependent on different parameters. Several methodologies have been proposed that rely on
+rule generation to evaluate credit risks [8, 9], however, these approaches may be limited in the
+process of generating rules on large amounts of data. Another problem is the lack of lender
+experience or the uncertainty of borrower history information, these factors greatly increase
+credit risk. Some platforms incorporate borrower status prediction models, particularly using
+logistic regression [10] and Random Forest-based classification [11].
+   However, the development of credit risk prediction models is difficult due to different factors,
+including high data size and imbalance and high number of missing values. For these reasons,
+additional approaches have then been proposed, such as Support Vector Machine (SVM) based
+semi-supervised approach [12], while [13] has introduced an ensemble Decision Tree model for
+credit risk assessment on 138 Chinese companies with loss-making corporate earnings. Another
+ensemble method was developed by Feng et al. (2018)[14], in which classifiers are selected based
+on performance related to credit scoring. While [15] designed a hybrid model that relies on
+transductive support vector machine (TSVM) and Dempster-Shafer theory to predict social loan
+defaults. Finally, [16] has described a combination of different classifiers using linear weight
+ensemble to predict SL default, instead Song et al. (2020)[17] an ensemble of classifiers based on
+distance-model learning method and adaptive multi-view clustering (DM-ACME).
+   In this paper, which represents an extended abstract of our previous work [18], we propose a
+benchmarking for credit risk scoring using the most advanced machine learning (ML) techniques
+used in the literature , to understand whether a loan will be repaid on a P2P platform. The
+performance was evaluated using different scoring metrics such as Sensitivity, AUC, Specificity.
+In addition, eXplainable Artificial Intelligence (XAI) approaches were used to obtain a high
+degree of explainability of the models. The goal is to evaluate both in terms of accuracy
+performance of the classifiers but also to provide results understandable by domain experts,
+ensuring transparency in decisions, this is particularly required for credit risk assessment.
+. Proposed benchmark architecture
+The proposed benchmark architecture in Fig.1 stems from the need to be able to offer sup-
+port to the risk prediction problem, thus an investor can evaluate potential borrowers within
+social lending platforms. The main challenge to address is that credit risk assessment is a
+multidimensional and unbalanced issue because it is based on huge amounts of historical data,
+including, credit history (obtained by filling out a comprehensive application), bank account
+�status, employment status, etc.
+   In addition, using all these features increases coverage but decreases accuracy, thus it is
+essential to apply a feature selection approach. In particular, the proposed architecture that is
+based on three macro-modules:
+. Ingestion,
+. Classification,
+. Explanation.
+The ingestion phase aims at crawling the data from the social lending platforms, cleaning
+and filtering the obtained data and performing feature selection based on the chosen classifier.
+In details, the data are cleaned by removing features with many missing or null values and
+attributes with zero variance from the dataset. After cleaning, several transformations are
+applied, such as converting categorical features to numeric and changing date attributes to
+numeric values. The second macro-block performs credit prediction, here we have to deal with
+a problem of imbalance because usually a user of P2P platforms have a high number of rejected
+loans compared to those requested. The classifiers chosen in our architecture are: Logistic-
+regression, Random Forest and Multi-Layer Perceptron , being the most suitable ones for credit
+prediction [19, 6] and the most used in this context [11, 20, 13]. To handle the unbalance problem,
+the following techniques are used: random subsampling, random oversampling, and smoothing.
+Specifically, oversampling merely creates new minority class samples, the Synthetic Minority
+Oversampling Technique (SMOTE) is based on oversampling using k-nearest neighbors. While
+subsampling eliminates the majority class samples randomly. Finally, the third macro-block
+is concerned with explaining the results of each prediction, i.e., the decisions made by the
+classifiers to obtain information about the financial domain being analyzed. In particular, five
+XAI tools are used: LIME, Anchors, SHAP, BEEF and LORE. LIME [21] is a Post-Hoc and
+Agnostic method that provides a local explanation on the prediction, Anchors [22] is also of the
+same type, a Post- Hoc, Model Agnostic method that provides a local explanation but using
+rules that sufficiently "anchor" the predictor locally. SHapley Additive exPlanations (SHAP)
+[23] is a method for explain individual predictions based on the game theoretically optimal
+Shapley Values in order to analyze how each feature influences the prediction. Balanced English
+Explanations of Forecasts (BEEF) [24] exploits global information, retrieved by the clustering
+algorithm on the entire dataset, in order to generate a local explanation. Finally, Local Rule-
+Based Explanations (LORE), proposed by Guidotti et al. (2018)[25] is first based on learning an
+interpretable local predictor and then deriving the explanation as a decision rule.
+. Experimental evaluation
+The purpose of our experimentation is to compare different classification models, evaluating
+them according to some metrics (for more details see Section 3.1). The dataset is provided by
+Lending Club1 , a P2P lending platform, in particular we focused on loans disbursed between
+and 2017, it consists of 877,956 samples and 151 features, where the most important ones
+are loan_amount and term.
+       https://www.lendingclub.com/
+�Figure 1: Architecture
+   According to [11] and [6], we considered 𝑙𝑜𝑎𝑛_𝑠𝑡𝑎𝑡𝑢𝑠 as the target class for our problem.
+   Only the labels ”FullyPaid” or ”Charged off” were considered, since we classified the problem
+as binary, whether the loan will be repaid or not, this leads to unbalanced data, respectively
+.77% of the samples are fully paid , the remaining 0.23% are unpaid. A 10-cross validation was
+performed in which the dataset was divided into a training set and a test set with a ratio of
+/25.
+   Finally, the results obtained were compared against those presented in Namvar et al. (2018)[6]
+and Song et al. (2020)[17]. The benchmark was run on Google Colab2 , with Xeon single core
+hyper threaded processor @2.3Ghz, 12 GB RAM, NVIDIA Tesla K80 with 2496 CUDA cores and
+GB GDDR5 VRAM, using Python 3.6 with scikit-learn 0.23.1.33
+.1. Evaluation metrics
+The following metrics were used to evaluate and compare the effectiveness of the considered
+models: Sensitivity (TPR), Specificity (TNR), G-mean, Precision, FP-Rate, Area Under Curve (AUC).
+Accuracy (ACC) was not used as an evaluation metric because it does not consider that false
+positives are more important than false negatives, thus it results in an inaccurate evaluation.
+Instead, TPR and TNR are suitable because they assess the accuracy of positive and negative
+samples, respectively. While G-mean is an appropriate metric for assessing the balance of
+       https://colab.research.google.com/
+       https://scikit-learn.org/
+�                Classifier         AUC     TPR     TNR     FP-Rate   G-Mean     ACC
+            RF - RUS               0.717   0.630   0.680    0.320     0.6560    0.640
+            LR - ROS               0.710   0.659   0.642    0.360     0.6503    0.650
+            LR - SmoteToken        0.710   0.660   0.640    0.360     0.6500    0.656
+            Logistic Regression    0.685   0.983   0.069    0.960     0.2600    0.770
+            Random Forest          0.720   0.983   0.084    0.920     0.2870    0.773
+            MLP                    0.704   0.990   0.040    0.945     0.2060    0.771
+Table 1
+Our best Classification results.
+classification performances for both majority and minority classes.
+   In turn, Precision and FP-Rate are useful for understanding how well the model predicts
+positive and negative classes. Finally, AUC determines the area under the ROC curve, thus
+it is used to assess the trade-off between the rate of true positives and true negatives in the
+evaluated model.
+.2. Feature Engineering
+The aim of this section is to explain the criterion of improving the data through their cleaning
+and feature selection. Specifically, all features with missing values greater than 55%, and also
+those with high standard deviation were removed. Finally, the missing values were replaced
+with the median of the features, furthermore the nominal features were converted to binary
+data (more details are reported in [18]).
+.3. Experimental results
+The classifiers used are Random Forest (RF), Logistic Regresion (LR) and Multi Layer Perceptron
+and have been evaluated according to different sampling strategies: Under-sampling (RUS, IHT),
+Over-sampling (ROS, SMOTE, ADASYN), Hybrid-Method (SMOTE-TOKEN, SMOTE-EN). In
+Table 1 we report the best combination between the classifiers and the sampling strategies,
+comparing them also against the performance obtained by the classifiers without any strategy,
+this last comparison highlights the effectiveness of the latter techniques on the prediction
+performance. The experiment decrees that RF-RUS turns out to be the best method for predicting
+a borrower’s status in a social lending market.
+.3.1. Comparison with state-of-art results
+We compared our results against the best results of Namvar et al. (2018)[6] and Song et
+al. (2020)[17], it can be seen that our best combination (RF-RUS) (see in Table 1) has the
+lowest accuracy while our AUC value and Specificity are higher than the best of [6]. This is
+important in our context because reducing false positives avoids the serious economic damage
+of misclassification, i.e., the loss of a user’s loan. In addition, Table 2 shows higher values of
+Specificity than our results even though its sensitivity value is much lower than our model.
+�                            Method            AUC       TPR      TNR     G-Mean     Accuracy
+                             RF -RUS          0.717     0.630   0.680     0.6560      0.640
+                         Song et al.[17]      0.6697   0.4607   0.7678    0.6009     0.7231
+                      Linear discrimination
+.7000   0.630    0.650      0.643      0.6400
+                        analysis - SMOTE
+   Namvar et al.[6]     LR - SmoteToken       0.7000    0.638   0.648     0.643       0.6400
+                       Logistic regression    0.7030    0.988   0.048     0.218       0.8173
+                         Random forest        0.6960    0.996   0.015      0.12       0.8176
+                              GBDT            0.6207   0.6168   0.6246    0.6207      0.6235
+                         Random forest        0.5795   0.3107   0.8423    0.5134      0.7701
+                            AdaBoost          0.5224   0.1925   0.8523    0.4050      0.7562
+    Over-sampling
+                          Decision tree       0.5231   0.1934   0.8527    0.4060      0.7568
+                       Logistic regression    0.5600   0.5558   0.5642    0.5597      0.5630
+                      Multilayer perceptron   0.4892   0.1572   0.8211    0.3593      0.7245
+                              GBDT            0.6140   0.6292   0.5989    0.6138      0.6033
+                         Random forest        0.6207   0.6623   0.5791    0.6193      0.5912
+                            AdaBoost          0.5408   0.5577   0.5238    0.5404      0.5288
+   Under-sampling
+                          Decision tree       0.5421   0.5558   0.5283    0.5418      0.5323
+                       Logistic regression    0.5615   0.5437   0.5794    0.5609      0.5742
+                      Multilayer perceptron   0.4892   0.1572   0.8211    0.3593      0.7245
+Table 2
+Results.
+.3.2. Explanation results
+In this last part of the evaluation, we compare the performance of several XAI tools: LIME,
+Anchors, SHAP, BEEF, and LORE. In particular, the metrics are based on the Accuracy measure,
+according to the protocol described in Ribeiro et al. (2016)[21], evaluated on the three best
+classifiers: Random Forest & Random Subsampling, Logistic Regression & Random Oversam-
+pling, and Logistic Regression & Smote-Token. Several explanations were generated, using
+different sets of instances computed with different random sampling (10 runs) from the dataset.
+Analyzing the results in Table 3, LORE is the best because it combines local predictions with the
+use of counterfactuals for explanation generation, while LIME achieves good results for all three
+classifiers, this is because the prediction is modeled as a weighted sum and this makes it easy
+to interpret the prediction generation. SHAP, on the other hand, based on the importance of
+features, offers statistically more significant results than LIME, this is given by the use of shap
+values, whose computational complexity, even if dampened by different heuristics, can affect
+the efficiency of the explanation. Finally, regarding BEEF and Anchors, they can be limited in
+the expressiveness of the explanation, as noted for Logistic Regression, since they are based on
+axis-aligned hyper-rectangle and specific rules (called anchors).
+. Conclusion
+Determining the risk prediction score is one of the biggest challenges in finance. The aim of the
+proposed approach is to support people in their investments, proposing a reference model based
+�                       Random -Forest           Logistic Regression     Logistic Regression
+                   Random Under-Sampling      Random Over-Sampling         Smote -Token
+                      (Precision Value)          (Precision Value)       (Precision Value)
+        Anchors             0.907                       0.547                   0.747
+        Lime                0.872                       0.918                   0.676
+        SHAP                0.891                       0.924                   0.752
+        BEEF                0.881                       0.741                   0.725
+        LORE                0.913                       0.878                   0.781
+Table 3
+Comparison between Anchors, Lime, SHAP, BEEF and LORE in terms of Precision measure.
+on Machine Learning approaches for the prediction of credit risk in social lending platforms,
+going to manage what are the major criticalities in P2P platforms: the high dimension of data
+to be analyzed and unbalanced data. The evaluation done on a real dataset demonstrates the
+goodness of the proposed approach, as well as the fact of being able to provide an explanation
+for the prediction obtained, which is very significant in the financial field to be able to motivate
+a positive or negative judgment to provide a loan. Developments of future work may be to
+consider different P2P lending platforms and use additional classification approaches such as
+Deep Learning or ensemble learning techniques in order to achieve better performance.
+References
+ [1] K. Buehler, A. Freeman, R. Hulme, The new arsenal of risk management, Harvard Business
+     Review 86 (2008) 93–100.
+ [2] A. B. Hens, M. K. Tiwari, Computational time reduction for credit scoring: An integrated
+     approach based on support vector machine and stratified sampling method, Expert Systems
+     with Applications 39 (2012) 6774–6781.
+ [3] T. Verbraken, C. Bravo, R. Weber, B. Baesens, Development and application of consumer
+     credit scoring models using profit-based classification measures, European Journal of
+     Operational Research 238 (2014) 505–513.
+ [4] A. Kim, S.-B. Cho, Dempster-shafer fusion of semi-supervised learning methods for
+     predicting defaults in social lending, in: International Conference on Neural Information
+     Processing, Springer, 2017, pp. 854–862.
+ [5] Y. Guo, W. Zhou, C. Luo, C. Liu, H. Xiong, Instance-based credit risk assessment for
+     investment decisions in p2p lending, European Journal of Operational Research 249 (2016)
+–426.
+ [6] A. Namvar, M. Siami, F. Rabhi, M. Naderpour, Credit risk prediction in an imbalanced
+     social lending environment, arXiv preprint arXiv:1805.00801 (2018).
+ [7] D. D. Wu, S.-H. Chen, D. L. Olson, Business intelligence in risk management: Some recent
+     progresses, Information Sciences 256 (2014) 1–7.
+ [8] Y. Hayashi, Application of a rule extraction algorithm family based on the re-rx algorithm
+     to financial credit risk assessment from a pareto optimal perspective, Operations Research
+     Perspectives 3 (2016) 32–42.
+ [9] M. Soui, I. Gasmi, S. Smiti, K. Ghédira, Rule-based credit risk assessment model using
+�     multi-objective evolutionary algorithms, Expert systems with applications 126 (2019)
+–157.
+[10] R. Emekter, Y. Tu, B. Jirasakuldech, M. Lu, Evaluating credit risk and loan performance in
+     online peer-to-peer (p2p) lending, Applied Economics 47 (2015) 54–70.
+[11] M. Malekipirbazari, V. Aksakalli, Risk assessment in social lending via random forests,
+     Expert Systems with Applications 42 (2015) 4621–4631.
+[12] Z. Li, Y. Tian, K. Li, F. Zhou, W. Yang, Reject inference in credit scoring using semi-
+     supervised support vector machines, Expert Systems with Applications 74 (2017) 105–114.
+[13] J. Sun, J. Lang, H. Fujita, H. Li, Imbalanced enterprise credit evaluation with dte-sbd:
+     Decision tree ensemble based on smote and bagging with differentiated sampling rates,
+     Information Sciences 425 (2018) 76–91.
+[14] X. Feng, Z. Xiao, B. Zhong, J. Qiu, Y. Dong, Dynamic ensemble classification for credit
+     scoring using soft probability, Applied Soft Computing 65 (2018) 139–151.
+[15] A. Kim, S.-B. Cho, An ensemble semi-supervised learning method for predicting defaults
+     in social lending, Engineering Applications of Artificial Intelligence 81 (2019) 193–199.
+[16] W. Li, S. Ding, H. Wang, Y. Chen, S. Yang, Heterogeneous ensemble learning with feature
+     engineering for default prediction in peer-to-peer lending in china, World Wide Web 23
+     (2020) 23–45.
+[17] Y. Song, Y. Wang, X. Ye, D. Wang, Y. Yin, Y. Wang, Multi-view ensemble learning based on
+     distance-to-model and adaptive clustering for imbalanced credit risk assessment in p2p
+     lending, Information Sciences 525 (2020) 182–204.
+[18] V. Moscato, A. Picariello, G. Sperlí, A benchmark of machine learning approaches for
+     credit score prediction, Expert Systems with Applications 165 (2021) 113986.
+[19] V. García, A. Marqués, J. S. Sánchez, On the use of data filtering techniques for credit
+     risk prediction with instance-based models, Expert Systems with Applications 39 (2012)
+–13276.
+[20] A. Namvar, M. Naderpour, Handling uncertainty in social lending credit risk prediction
+     with a choquet fuzzy integral model, in: 2018 IEEE International Conference on Fuzzy
+     Systems (FUZZ-IEEE), IEEE, 2018, pp. 1–8.
+[21] M. T. Ribeiro, S. Singh, C. Guestrin, " why should i trust you?" explaining the predictions
+     of any classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on
+     knowledge discovery and data mining, 2016, pp. 1135–1144.
+[22] M. T. Ribeiro, S. Singh, C. Guestrin, Anchors: High-precision model-agnostic explanations,
+     in: Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
+[23] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, Advances
+     in neural information processing systems 30 (2017).
+[24] S. Grover, C. Pulice, G. I. Simari, V. Subrahmanian, Beef: Balanced english explanations of
+     forecasts, IEEE Transactions on Computational Social Systems 6 (2019) 350–364.
+[25] R. Guidotti, A. Monreale, S. Ruggieri, D. Pedreschi, F. Turini, F. Giannotti, Local rule-based
+     explanations of black box decision systems, arXiv preprint arXiv:1805.10820 (2018).
+�
+</pre>