Vol-3194/paper64

From BITPlan ceur-ws Wiki
Revision as of 17:53, 30 March 2023 by Wf (talk | contribs) (edited by wikiedit)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Paper

Paper
edit
description  
id  Vol-3194/paper64
wikidataid  Q117344887→Q117344887
title  Credit Score Prediction Relying on Machine Learning
pdfUrl  https://ceur-ws.org/Vol-3194/paper64.pdf
dblpUrl  https://dblp.org/rec/conf/sebd/AmatoFG0MS22
volume  Vol-3194→Vol-3194
session  →

Credit Score Prediction Relying on Machine Learning

load PDF

Credit Score Prediction Relying on Machine Learning
Flora Amato1 , Antonino Ferraro1 , Antonio Galli1 , Francesco Moscato3 ,
Vincenzo Moscato1,2 and Giancarlo Sperlí1,2
1
  Department of Electrical Engineering and Information Technology (DIETI), Naples, Italy
2
  CINI - ITEM National Lab, Complesso Universitario Monte S.Angelo, Naples, Italy
3
  DIEM, University of Salerno, Fisciano, Italy


                                         Abstract
                                         Financial institutions use a variety of methodologies to define their commercial and strategic policies,
                                         and a significant role is played by credit risk assessment. In recent years, different credit risk assessment
                                         services arose, providing Social Lending platforms to connect lenders and borrowers in a direct way
                                         without assisting of financial institutions. Despite the pros of these platforms in supporting fundraising
                                         process, there are different stems from multiple factors including lack of experience of lenders, missing or
                                         uncertain information about the borrower’s credit history. In order to handle these problems, credit risk
                                         assessments of financial transactions are usually modeled as a binary problem based on debt repayment,
                                         going to apply Machine Learning (ML) techniques. The paper represents an extended abstract of a recent
                                         work, where some of the authors performed a benchmarking among the most used credit risk assessment
                                         ML models in the field of predicting whether a loan will be repaid in a P2P platform. The experimental
                                         analysis is based on a real dataset of Social Lending (Lending Club), going to evaluate several evaluation
                                         metrics including AUC, sensitivity, specificity and explainability of the models.

                                         Keywords
                                         Credit Score Prediction, Machine Learning, eXplainable Artificial Intelligence,




1. Introduction
The recent development of digital financial services has led researchers to pay attention to the
management of credit risk, proposing useful models to reduce such a risk but also to obtain
profits from the investment. Banking risks can arise from different factors including: operational
risks, market, credit, and the last one represents 60% of problems for banks [1].
   The main cause of credit risk is the spread of Social Lending (SL) platforms, known as Peer-to-
Peer (P2P) lending. These platforms allow lenders and borrowers to be interconnected without
involving financial institutions; they support borrowers in the fundraising process and allow
lending entities to participate. One challenge that needs to be addressed in this context is the
credit risk analysis, due to possible non-repayment of loans by borrowers, where risk assessment
is calculated through credit scoring.
   The credit risk assessment of financial transactions on SL platforms is performed through a
binary classification problem, based on debt repayment [2, 3].


SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
$ flora.amato@unina.it (F. Amato); antonino.ferraro@unina.it (A. Ferraro); antonio.galli@unina.it (A. Galli);
fmoscato@unisa.it (F. Moscato); vincenzo.moscato@unina.it (V. Moscato); giancarlo.sperli@unina.it (G. Sperlí)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
�   Additionally, it is important to note that P2P platforms produce large amounts of unlabeled
data so additional analysis is required to support real-time decisions [4]. An additional critical
issue with these platforms is the risk of default, which is higher than standard methods, this
is due to the fact that a lender may not always be able to effectively assess the risk level of
borrowers [5], thus the main issue is due to a lack of credit history of borrowers.
   Predictive models of credit scoring can be classified into two broad categories [6]: statistical
approaches and artificial intelligence methods. Regarding statistical approaches, they have been
proposed, but suffer from coverage problems inherent in nonlinear effects among the variables
involved. Credit risk assessment is characterized by the following properties: dependence,
complexity, and interconnectedness [7], thus credit scoring estimation is very complex as it
is dependent on different parameters. Several methodologies have been proposed that rely on
rule generation to evaluate credit risks [8, 9], however, these approaches may be limited in the
process of generating rules on large amounts of data. Another problem is the lack of lender
experience or the uncertainty of borrower history information, these factors greatly increase
credit risk. Some platforms incorporate borrower status prediction models, particularly using
logistic regression [10] and Random Forest-based classification [11].
   However, the development of credit risk prediction models is difficult due to different factors,
including high data size and imbalance and high number of missing values. For these reasons,
additional approaches have then been proposed, such as Support Vector Machine (SVM) based
semi-supervised approach [12], while [13] has introduced an ensemble Decision Tree model for
credit risk assessment on 138 Chinese companies with loss-making corporate earnings. Another
ensemble method was developed by Feng et al. (2018)[14], in which classifiers are selected based
on performance related to credit scoring. While [15] designed a hybrid model that relies on
transductive support vector machine (TSVM) and Dempster-Shafer theory to predict social loan
defaults. Finally, [16] has described a combination of different classifiers using linear weight
ensemble to predict SL default, instead Song et al. (2020)[17] an ensemble of classifiers based on
distance-model learning method and adaptive multi-view clustering (DM-ACME).
   In this paper, which represents an extended abstract of our previous work [18], we propose a
benchmarking for credit risk scoring using the most advanced machine learning (ML) techniques
used in the literature , to understand whether a loan will be repaid on a P2P platform. The
performance was evaluated using different scoring metrics such as Sensitivity, AUC, Specificity.
In addition, eXplainable Artificial Intelligence (XAI) approaches were used to obtain a high
degree of explainability of the models. The goal is to evaluate both in terms of accuracy
performance of the classifiers but also to provide results understandable by domain experts,
ensuring transparency in decisions, this is particularly required for credit risk assessment.


2. Proposed benchmark architecture
The proposed benchmark architecture in Fig.1 stems from the need to be able to offer sup-
port to the risk prediction problem, thus an investor can evaluate potential borrowers within
social lending platforms. The main challenge to address is that credit risk assessment is a
multidimensional and unbalanced issue because it is based on huge amounts of historical data,
including, credit history (obtained by filling out a comprehensive application), bank account
�status, employment status, etc.
   In addition, using all these features increases coverage but decreases accuracy, thus it is
essential to apply a feature selection approach. In particular, the proposed architecture that is
based on three macro-modules:
   1. Ingestion,
   2. Classification,
   3. Explanation.
The ingestion phase aims at crawling the data from the social lending platforms, cleaning
and filtering the obtained data and performing feature selection based on the chosen classifier.
In details, the data are cleaned by removing features with many missing or null values and
attributes with zero variance from the dataset. After cleaning, several transformations are
applied, such as converting categorical features to numeric and changing date attributes to
numeric values. The second macro-block performs credit prediction, here we have to deal with
a problem of imbalance because usually a user of P2P platforms have a high number of rejected
loans compared to those requested. The classifiers chosen in our architecture are: Logistic-
regression, Random Forest and Multi-Layer Perceptron , being the most suitable ones for credit
prediction [19, 6] and the most used in this context [11, 20, 13]. To handle the unbalance problem,
the following techniques are used: random subsampling, random oversampling, and smoothing.
Specifically, oversampling merely creates new minority class samples, the Synthetic Minority
Oversampling Technique (SMOTE) is based on oversampling using k-nearest neighbors. While
subsampling eliminates the majority class samples randomly. Finally, the third macro-block
is concerned with explaining the results of each prediction, i.e., the decisions made by the
classifiers to obtain information about the financial domain being analyzed. In particular, five
XAI tools are used: LIME, Anchors, SHAP, BEEF and LORE. LIME [21] is a Post-Hoc and
Agnostic method that provides a local explanation on the prediction, Anchors [22] is also of the
same type, a Post- Hoc, Model Agnostic method that provides a local explanation but using
rules that sufficiently "anchor" the predictor locally. SHapley Additive exPlanations (SHAP)
[23] is a method for explain individual predictions based on the game theoretically optimal
Shapley Values in order to analyze how each feature influences the prediction. Balanced English
Explanations of Forecasts (BEEF) [24] exploits global information, retrieved by the clustering
algorithm on the entire dataset, in order to generate a local explanation. Finally, Local Rule-
Based Explanations (LORE), proposed by Guidotti et al. (2018)[25] is first based on learning an
interpretable local predictor and then deriving the explanation as a decision rule.


3. Experimental evaluation
The purpose of our experimentation is to compare different classification models, evaluating
them according to some metrics (for more details see Section 3.1). The dataset is provided by
Lending Club1 , a P2P lending platform, in particular we focused on loans disbursed between
2016 and 2017, it consists of 877,956 samples and 151 features, where the most important ones
are loan_amount and term.
   1
       https://www.lendingclub.com/
�Figure 1: Architecture


   According to [11] and [6], we considered 𝑙𝑜𝑎𝑛_𝑠𝑡𝑎𝑡𝑢𝑠 as the target class for our problem.
   Only the labels ”FullyPaid” or ”Charged off” were considered, since we classified the problem
as binary, whether the loan will be repaid or not, this leads to unbalanced data, respectively
0.77% of the samples are fully paid , the remaining 0.23% are unpaid. A 10-cross validation was
performed in which the dataset was divided into a training set and a test set with a ratio of
75/25.
   Finally, the results obtained were compared against those presented in Namvar et al. (2018)[6]
and Song et al. (2020)[17]. The benchmark was run on Google Colab2 , with Xeon single core
hyper threaded processor @2.3Ghz, 12 GB RAM, NVIDIA Tesla K80 with 2496 CUDA cores and
12 GB GDDR5 VRAM, using Python 3.6 with scikit-learn 0.23.1.33

3.1. Evaluation metrics
The following metrics were used to evaluate and compare the effectiveness of the considered
models: Sensitivity (TPR), Specificity (TNR), G-mean, Precision, FP-Rate, Area Under Curve (AUC).
Accuracy (ACC) was not used as an evaluation metric because it does not consider that false
positives are more important than false negatives, thus it results in an inaccurate evaluation.
Instead, TPR and TNR are suitable because they assess the accuracy of positive and negative
samples, respectively. While G-mean is an appropriate metric for assessing the balance of

   2
       https://colab.research.google.com/
   3
       https://scikit-learn.org/
�                Classifier         AUC     TPR     TNR     FP-Rate   G-Mean     ACC
            RF - RUS               0.717   0.630   0.680    0.320     0.6560    0.640
            LR - ROS               0.710   0.659   0.642    0.360     0.6503    0.650
            LR - SmoteToken        0.710   0.660   0.640    0.360     0.6500    0.656
            Logistic Regression    0.685   0.983   0.069    0.960     0.2600    0.770
            Random Forest          0.720   0.983   0.084    0.920     0.2870    0.773
            MLP                    0.704   0.990   0.040    0.945     0.2060    0.771
Table 1
Our best Classification results.


classification performances for both majority and minority classes.
   In turn, Precision and FP-Rate are useful for understanding how well the model predicts
positive and negative classes. Finally, AUC determines the area under the ROC curve, thus
it is used to assess the trade-off between the rate of true positives and true negatives in the
evaluated model.

3.2. Feature Engineering
The aim of this section is to explain the criterion of improving the data through their cleaning
and feature selection. Specifically, all features with missing values greater than 55%, and also
those with high standard deviation were removed. Finally, the missing values were replaced
with the median of the features, furthermore the nominal features were converted to binary
data (more details are reported in [18]).

3.3. Experimental results
The classifiers used are Random Forest (RF), Logistic Regresion (LR) and Multi Layer Perceptron
and have been evaluated according to different sampling strategies: Under-sampling (RUS, IHT),
Over-sampling (ROS, SMOTE, ADASYN), Hybrid-Method (SMOTE-TOKEN, SMOTE-EN). In
Table 1 we report the best combination between the classifiers and the sampling strategies,
comparing them also against the performance obtained by the classifiers without any strategy,
this last comparison highlights the effectiveness of the latter techniques on the prediction
performance. The experiment decrees that RF-RUS turns out to be the best method for predicting
a borrower’s status in a social lending market.

3.3.1. Comparison with state-of-art results
We compared our results against the best results of Namvar et al. (2018)[6] and Song et
al. (2020)[17], it can be seen that our best combination (RF-RUS) (see in Table 1) has the
lowest accuracy while our AUC value and Specificity are higher than the best of [6]. This is
important in our context because reducing false positives avoids the serious economic damage
of misclassification, i.e., the loss of a user’s loan. In addition, Table 2 shows higher values of
Specificity than our results even though its sensitivity value is much lower than our model.
�                            Method            AUC       TPR      TNR     G-Mean     Accuracy
                             RF -RUS          0.717     0.630   0.680     0.6560      0.640
                         Song et al.[17]      0.6697   0.4607   0.7678    0.6009     0.7231
                      Linear discrimination
                                              0.7000   0.630    0.650      0.643      0.6400
                        analysis - SMOTE
   Namvar et al.[6]     LR - SmoteToken       0.7000    0.638   0.648     0.643       0.6400
                       Logistic regression    0.7030    0.988   0.048     0.218       0.8173
                         Random forest        0.6960    0.996   0.015      0.12       0.8176
                              GBDT            0.6207   0.6168   0.6246    0.6207      0.6235
                         Random forest        0.5795   0.3107   0.8423    0.5134      0.7701
                            AdaBoost          0.5224   0.1925   0.8523    0.4050      0.7562
    Over-sampling
                          Decision tree       0.5231   0.1934   0.8527    0.4060      0.7568
                       Logistic regression    0.5600   0.5558   0.5642    0.5597      0.5630
                      Multilayer perceptron   0.4892   0.1572   0.8211    0.3593      0.7245
                              GBDT            0.6140   0.6292   0.5989    0.6138      0.6033
                         Random forest        0.6207   0.6623   0.5791    0.6193      0.5912
                            AdaBoost          0.5408   0.5577   0.5238    0.5404      0.5288
   Under-sampling
                          Decision tree       0.5421   0.5558   0.5283    0.5418      0.5323
                       Logistic regression    0.5615   0.5437   0.5794    0.5609      0.5742
                      Multilayer perceptron   0.4892   0.1572   0.8211    0.3593      0.7245
Table 2
Results.


3.3.2. Explanation results
In this last part of the evaluation, we compare the performance of several XAI tools: LIME,
Anchors, SHAP, BEEF, and LORE. In particular, the metrics are based on the Accuracy measure,
according to the protocol described in Ribeiro et al. (2016)[21], evaluated on the three best
classifiers: Random Forest & Random Subsampling, Logistic Regression & Random Oversam-
pling, and Logistic Regression & Smote-Token. Several explanations were generated, using
different sets of instances computed with different random sampling (10 runs) from the dataset.
Analyzing the results in Table 3, LORE is the best because it combines local predictions with the
use of counterfactuals for explanation generation, while LIME achieves good results for all three
classifiers, this is because the prediction is modeled as a weighted sum and this makes it easy
to interpret the prediction generation. SHAP, on the other hand, based on the importance of
features, offers statistically more significant results than LIME, this is given by the use of shap
values, whose computational complexity, even if dampened by different heuristics, can affect
the efficiency of the explanation. Finally, regarding BEEF and Anchors, they can be limited in
the expressiveness of the explanation, as noted for Logistic Regression, since they are based on
axis-aligned hyper-rectangle and specific rules (called anchors).


4. Conclusion
Determining the risk prediction score is one of the biggest challenges in finance. The aim of the
proposed approach is to support people in their investments, proposing a reference model based
�                       Random -Forest           Logistic Regression     Logistic Regression
                   Random Under-Sampling      Random Over-Sampling         Smote -Token
                      (Precision Value)          (Precision Value)       (Precision Value)
        Anchors             0.907                       0.547                   0.747
        Lime                0.872                       0.918                   0.676
        SHAP                0.891                       0.924                   0.752
        BEEF                0.881                       0.741                   0.725
        LORE                0.913                       0.878                   0.781
Table 3
Comparison between Anchors, Lime, SHAP, BEEF and LORE in terms of Precision measure.


on Machine Learning approaches for the prediction of credit risk in social lending platforms,
going to manage what are the major criticalities in P2P platforms: the high dimension of data
to be analyzed and unbalanced data. The evaluation done on a real dataset demonstrates the
goodness of the proposed approach, as well as the fact of being able to provide an explanation
for the prediction obtained, which is very significant in the financial field to be able to motivate
a positive or negative judgment to provide a loan. Developments of future work may be to
consider different P2P lending platforms and use additional classification approaches such as
Deep Learning or ensemble learning techniques in order to achieve better performance.


References
 [1] K. Buehler, A. Freeman, R. Hulme, The new arsenal of risk management, Harvard Business
     Review 86 (2008) 93–100.
 [2] A. B. Hens, M. K. Tiwari, Computational time reduction for credit scoring: An integrated
     approach based on support vector machine and stratified sampling method, Expert Systems
     with Applications 39 (2012) 6774–6781.
 [3] T. Verbraken, C. Bravo, R. Weber, B. Baesens, Development and application of consumer
     credit scoring models using profit-based classification measures, European Journal of
     Operational Research 238 (2014) 505–513.
 [4] A. Kim, S.-B. Cho, Dempster-shafer fusion of semi-supervised learning methods for
     predicting defaults in social lending, in: International Conference on Neural Information
     Processing, Springer, 2017, pp. 854–862.
 [5] Y. Guo, W. Zhou, C. Luo, C. Liu, H. Xiong, Instance-based credit risk assessment for
     investment decisions in p2p lending, European Journal of Operational Research 249 (2016)
     417–426.
 [6] A. Namvar, M. Siami, F. Rabhi, M. Naderpour, Credit risk prediction in an imbalanced
     social lending environment, arXiv preprint arXiv:1805.00801 (2018).
 [7] D. D. Wu, S.-H. Chen, D. L. Olson, Business intelligence in risk management: Some recent
     progresses, Information Sciences 256 (2014) 1–7.
 [8] Y. Hayashi, Application of a rule extraction algorithm family based on the re-rx algorithm
     to financial credit risk assessment from a pareto optimal perspective, Operations Research
     Perspectives 3 (2016) 32–42.
 [9] M. Soui, I. Gasmi, S. Smiti, K. Ghédira, Rule-based credit risk assessment model using
�     multi-objective evolutionary algorithms, Expert systems with applications 126 (2019)
     144–157.
[10] R. Emekter, Y. Tu, B. Jirasakuldech, M. Lu, Evaluating credit risk and loan performance in
     online peer-to-peer (p2p) lending, Applied Economics 47 (2015) 54–70.
[11] M. Malekipirbazari, V. Aksakalli, Risk assessment in social lending via random forests,
     Expert Systems with Applications 42 (2015) 4621–4631.
[12] Z. Li, Y. Tian, K. Li, F. Zhou, W. Yang, Reject inference in credit scoring using semi-
     supervised support vector machines, Expert Systems with Applications 74 (2017) 105–114.
[13] J. Sun, J. Lang, H. Fujita, H. Li, Imbalanced enterprise credit evaluation with dte-sbd:
     Decision tree ensemble based on smote and bagging with differentiated sampling rates,
     Information Sciences 425 (2018) 76–91.
[14] X. Feng, Z. Xiao, B. Zhong, J. Qiu, Y. Dong, Dynamic ensemble classification for credit
     scoring using soft probability, Applied Soft Computing 65 (2018) 139–151.
[15] A. Kim, S.-B. Cho, An ensemble semi-supervised learning method for predicting defaults
     in social lending, Engineering Applications of Artificial Intelligence 81 (2019) 193–199.
[16] W. Li, S. Ding, H. Wang, Y. Chen, S. Yang, Heterogeneous ensemble learning with feature
     engineering for default prediction in peer-to-peer lending in china, World Wide Web 23
     (2020) 23–45.
[17] Y. Song, Y. Wang, X. Ye, D. Wang, Y. Yin, Y. Wang, Multi-view ensemble learning based on
     distance-to-model and adaptive clustering for imbalanced credit risk assessment in p2p
     lending, Information Sciences 525 (2020) 182–204.
[18] V. Moscato, A. Picariello, G. Sperlí, A benchmark of machine learning approaches for
     credit score prediction, Expert Systems with Applications 165 (2021) 113986.
[19] V. García, A. Marqués, J. S. Sánchez, On the use of data filtering techniques for credit
     risk prediction with instance-based models, Expert Systems with Applications 39 (2012)
     13267–13276.
[20] A. Namvar, M. Naderpour, Handling uncertainty in social lending credit risk prediction
     with a choquet fuzzy integral model, in: 2018 IEEE International Conference on Fuzzy
     Systems (FUZZ-IEEE), IEEE, 2018, pp. 1–8.
[21] M. T. Ribeiro, S. Singh, C. Guestrin, " why should i trust you?" explaining the predictions
     of any classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on
     knowledge discovery and data mining, 2016, pp. 1135–1144.
[22] M. T. Ribeiro, S. Singh, C. Guestrin, Anchors: High-precision model-agnostic explanations,
     in: Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
[23] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, Advances
     in neural information processing systems 30 (2017).
[24] S. Grover, C. Pulice, G. I. Simari, V. Subrahmanian, Beef: Balanced english explanations of
     forecasts, IEEE Transactions on Computational Social Systems 6 (2019) 350–364.
[25] R. Guidotti, A. Monreale, S. Ruggieri, D. Pedreschi, F. Turini, F. Giannotti, Local rule-based
     explanations of black box decision systems, arXiv preprint arXiv:1805.10820 (2018).
�