transposition
french.dat le name
grdotd quantlet name
rbfnet parameter name
Abbreviations
ANN articial neural network
Basel II the new Basel Capital Accord
CART classication and regression trees
ECOA Equal Credit Opportunity Act
GLM generalized linear model
GPLM generalized partially linear model
LDA linear discriminant analysis
MSE mean squared error
RBF radial basis function
VaR value at risk
5
1 Introduction
One of the basic tasks which any nance institution must deal with, is to minimize
its credit risk. Scoring methods traditionally estimate the creditworthiness of a credit
card applicant. They predict the probability that an applicant or existing borrower
will default or become delinquent. Credit scoring is studying the creditworthiness of:
any of the many forms of commerce under which an individual obtains
money or goods or services on condition of a promise to repay the money
or to pay for the goods or services, along with a fee (the interest), at some
specic future date or dates
Lewis (1994)
The statistical methods we will present are based on a particular amount of histor
ical data. Computers allow to treat large data sets and to come to a decision quicker,
cheaper and more realistic than the primary (historic) judgment made by credit experts
or loan ocers. These methods give even a better prediction, if used correctly.
Bank credit card issuers in the U.S. lose about $1 milliard each year in fraud, mostly
from stolen cards. They also lose another $3 milliard in fraudulent bankruptcy lings.
However merchants absorb more than $10 milliard each year in credit card related
fraud
1
.
The founders of credit scoring, Bill Fair and Earl Isaac designed a complete billing
system for one of the rst credit cards, Carte Blanche, in 1957 and built the rst credit
scoring system for American Investments one year later. Since then credit scoring
became a broad system predicting consumer behaviour and spread into many other
areas, e.g. consumer lending (especially on credit cards), mortgage lending, small and
medium business loan lending, direct marketing or advertising. Back et al. (1996) used
the same procedures to predict the failure of companies.
Methods used for credit scoring include various statistical procedures, like the most
commonly used logistic regression and its alternative probit regression, (linear) dis
criminant analysis, fashionable articial neural networks or genetic algorithms, further
linear programming, nonparametric classication trees or semiparametric regression.
The aim of this thesis is to give an overview of credit scoring and to compare
various statistical methods by using them on a data set from a French bank. We want
to compare the ability to distinguish between two subgroups of a highly complicated
sample, and to see the performance and computational severity of these procedures.
1
http://www.cardweb.com/cardlearn/stat.html
6
Number of Credit Cards
1995 1996 1997 1998 1999
2
0
0
2
5
0
C
r
e
d
i
t
C
a
r
d
s
(
p
e
r
1
0
0
0
h
a
b
i
t
a
n
t
s
)
Number of Transactions
1995 1996 1997 1998 1999
5
0
0
0
6
0
0
0
7
0
0
0
8
0
0
0
9
0
0
0
T
r
a
n
s
a
c
t
i
o
n
s
(
p
e
r
1
0
0
0
h
a
b
i
t
a
n
t
s
)
Figure 1.1: Number of credit cards and number of transactions from a credit
card in the European Union between 19951999 (http://www.euronas.org).
Credit scoring gains new importance, when thinking about the New Basel Capital
Accord. The so called Basel II replaces the current 1988 Capital Accord and focuses
on techniques that allow banks and supervisors to evaluate properly the various risks
that banks face. Thus credit scoring may contribute to the internal assessment process
of an institution, what is desirable.
This master thesis is organized as follows: Firstly, we give a short overview of credit
scoring, advert something from its history, refer accompanying problems and mention
the way credit scoring may go in the future. The data set for the analysis is described in
the third section. Section 4 explains the statistical methods used (it is particularly the
logistic regression, semiparametric regression, multilayer perceptron neural network
and radial basis function neural network) and apply them on the data set. Other
methods used for credit scoring are presented in Section 5. Summary of our analyzes
with the conclusion is given in Section 6. Section 7 remarks some possible extensions
to this topic. The method of radial basis function neural networks, we programmed
in C language for the purpose of this thesis, is described in Appendix A thoroughly.
Appendix B gives some suggestions for improvements in XploRe we discovered during
work on this thesis. Finally, Appendix C lists les stored on compact disk enclosed to
this thesis.
7
2 Credit Scoring in Overview
Risk forecasting is the topic number one in modern nance. Apart from the portfolio
management, pricing options (and other nancial instruments) or bond pricing, credit
scoring represents another important set of procedures estimating and reducing credit
risk. It involves techniques that help nancial organizations to decide whether or not to
grant a credit to applicants. Basically, credit scoring tries to distinguish two dierent
subgroups in the data sample. The aim is to choose such a method which is computable
in real time and predicts suciently precise.
There is a vast number of articles treating credit scoring in recent issues of trade
publications in the credit and banking area. An exhaustive overview of literature
to credit scoring can be found in Thomas (2000), Mester (1997) or in Kaiser and
Szczesny (2000a). Professor D.J. Hand, head of statistics section in the department of
mathematics at Imperial College, published also lot of books to this topic.
2.1 History
The statistical techniques used for credit scoring are based on the idea of discrimination
between several groups in a data sample. These procedures originate in the thirties and
forties of the previous century (Fisher, 1936; Durand, 1941). At that time some of the
nance houses and mail order rms were having diculties with their credit manage
ment. Decision whether to give loans or send merchandise to the applicants were made
judgmentally by credit analysts. The decision procedure was nonuniform, subjective
and opaque, it depended on the rules of each nancial house and on the personal and
empirical knowledge of each single clerk. With the rising number of people applying
for a credit card in the late 1960s it was impossible to stay by credit analysts onlyan
automated system was necessary. The rst consultancy was formed in San Francisco
by Bill Fair and Earl Isaac in the late 1950s. Their system spread fast as the nancial
institutions found out that using credit scoring was cheaper, faster, more objective,
and mainly much better predictive than any judgmental scheme. It is estimated that
the default rates dropped by 50% (Thomas, 2000) after the implementation of credit
scoring. Another advantage of the usage of credit scoring is that it allows lenders to
underwrite and monitor loans without actually meeting the borrower.
The success of credit scoring in credit cards issuing was a signicant sign for the
banks to use scoring methods to other products like personal loans, mortgage loans,
small business loans etc. However, commercial lending is more heterogeneous, its
documentation is not standardized within or across institutions and thus the results
8
are not so clear. The growth of direct marketing has led to the use of scorecards to
improve the response rate to advertising campaigns in the 1990s.
2.2 Problems
Most of the problems one must face when using credit scoring are rather technical than
theoretical nature. First of all, one should think of the data necessary to implement
the scoring. It should include as many relevant factors as possible. It is a tradeo
between expensive data and between low accuracy due to not enough information.
Banks collect the data from their internal sources (from the applicants previous credit
history), from external sources (questionnaires, interviews with the applicants) and
from third parties. From the applicants background the following information is usu
ally collected: age, gender, marital status, nationality, education, number of children,
job, income, lease rental charges, etc. The following questions from applicants credit
history are especially interesting: Has the applicant already a credit?, How much
did he borrowed?, Has the applicant ever delayed his payment?, Does he ask for
another credit as well? Under third parties we understand special houses oriented in
Number of Credit Cards (per 1000 habitants)
200
400
600
800
Belgium France Germany Italy Sweden U.K.
Figure 2.1: Number of credit cards per 1000 persons in the year 1994 (thin
line) and 1999(thick line) (http://www.euronas.org).
9
collecting credit information about potential clients. The variables entering the credit
scoring procedures should be chosen carefully, as the amount of the data may be vast
indeed and thus computationally problematic. For instance, the German central bank
(Deutsche Bundesbank) lists about 325000 units. Since most of the attributes in credit
scoring are categorical, imposing dummy variables gives a matrix with several millions
of elements (Enache (1998) mentions 180 variables in his analysis).
M uller et al. (2002) treats a very important feature of credit scoring data. There is
usually no information on the performance of rejected customers. This causes bias in
the sample. Hand and Henley (1993) concluded that it cannot be overcome unless one
can assume particular relationship between the distributions of the good and bad
clients which holds for both the accepted and the rejected applicants. This problem may
be solved by some organizations if they accept everybody for a short time. Afterward
they can build a scorecard based on the unbiased data sample. However, this is possible
only for retailers, mail order rms or advertising companies, not for banks and nancial
institutions.
The American banks have another problem. The law does not allow to use infor
mation about race, nationality, religion, gender or marital status to build a scorecard.
It is stated in the Equal Credit Opportunity Act (ECOA) and in the Consumer Credit
Protection Act. Moreover, the attribute age plays a special role. It can be used,
unless people older than 62 years are discriminated. Legal fundamentals for consumer
credit business in Germany are given in Schnurr (1997).
2.3 Credit Scoring Today
As mentioned above, credit scoring methods are widely used to estimate and to min
imize credit risk. Mail order companies, advertising companies, banks and other 
nancial institutions use these methods to score their clients, applicants and potential
customers. There is eort to precise all procedures used to estimate and decrease credit
risk. Both the U.S. Federal Home Loan Mortgage Corporation and the U.S. Federal
National Mortgage Corporation have encouraged mortgage lenders to use credit scoring
which should provide consistency across underwriters.
Also the international banks supervision appeals to precise banks internal assess
ments: The Basel Committee on Banking Supervision is an international organization
which formulates broad supervisory standards and guidelines for banks. It encourages
convergence toward common approaches and common standards. The Committees
members come from Belgium, Canada, France, Germany, Italy, Japan, Luxembourg,
10
the Netherlands, Spain, Sweden, Switzerland, United Kingdom and United States. In
1988, the Committee decided to introduce a capital measurement system (the Basel
Capital Accord). This framework has been progressively introduced not only in mem
ber countries but also in other countries with active international banks. In June 1999,
the Committee issued a proposal for a New Capital Adequacy Framework to replace
the 1988 Accord (http://www.bis.org). The proposed capital framework consists of
three pillars:
1. minimum capital requirements,
2. supervisory review of internal assessment process and capital adequacy,
3. eective use of disclosure to strengthen market discipline.
The New Basel Capital Accord is to be implemented till 2004. Consequently, Basel
II (The New Capital Accord) gives more emphasis on banks own internal methodolo
gies. Therefore credit scoring and its methods can become subject of banks extensive
interest, as they will try to make their internal assessments as precise and correct as
possible.
11
3 Data Set Description
In this section we shortly describe the data set used in our analysis and show some
of its basic characteristics to give a better insight in the sample. The data set an
alyzed in this thesis stems from a French bank. However, the source is condential
and therefore names of all variables have been removed, categorical values have been
changed to meaningless symbols and metric variables have been standardized to mean
0 and variance 1. The same data set was in background of M uller and Ronz (1999)
and Hardle et al. (2001). The original le, french.dat, contains 8830 observations
with one response variable, 8 metric and 15 categorical predictor variables. We have
X1
2 0 2 4
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
X4
0 2 4 6
0
0
.
5
1
X7
0 20 40 60
0
0
.
5
1
1
.
5
X2
2 0 2 4 6
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
X5
0 20 40 60
0
0
.
5
1
1
.
5
2
X8
0 5 10 15
0
0
.
5
1
1
.
5
X3
0 2 4 6
0
0
.
2
0
.
4
0
.
6
0
.
8
X6
0 10 20 30
0
0
.
5
1
Figure 3.1: Density dot plots for the original metric variables X1, . . . , X8.
12
"=================================================="
" Variable X5"
"=================================================="
"  Frequency Percent Cumulative "
""
" 0.537  3815 0.617 0.617"
" 0.203  934 0.151 0.768"
" 0.943  907 0.147 0.915"
" 1.682  431 0.070 0.985"
" 2.422  93 0.015 1.000"
""
"  6180 1.000"
"=================================================="
Table 3.1: Frequency Table of the metric variable X5.
removed observations with response classied as 9, class used originally for testing.
Since we do not know the real classication, we cannot use this class for the purpose
of our analysis. The remaining data contain 6672 observations. In addition we have
changed classes of the independent categorical variables from 1, . . . , K to 0, . . . , K 1
and ordered them in accordance with the number of categories. Let Y denote the
response variable, X1, . . . , X8, in sequence, the metric variables and X9, . . . , X23 the
categorical variables.
The density dot plots
2
in Figure 3.1 show estimated densities of the metric variables
and indicate some suspicious outlying values. The problem is that the usual outliers
tests assume normal distribution and that testing of the normality is aected by these
outliers (Ronz, 1998). Note that the last observations (number 6662 and higher) take
the lower extremes in almost all metric variables. Since the metric variables were
already standardized we decided to restrict them to the range [3, 3] in order to get
rid of the outliers. Thereby we get a new subsample containing only 6180 cases. Density
dot plots of the metric variables for this data set are shown in Figure 3.2 and we can
see that the shape of the variables densities became better. Table 3.1 shows frequencies
of the variable X5, which is obviously discrete.
For the purpose of our analysis we have randomly divided the data sample into
two subsamples About two thirds of the whole data set (4135 observations) builds
the rst subsample, the TRAIN sample. It will be used for model estimation. The
second subsample, TEST, with 2045 observation will be used to get some overall pro
2
For the density dot plots in this section we used quantlet myGrdotd.xpl which is a modied version of
the faulty original grdotd. For more information see Appendix B.1.
13
X1
1 0 1 2 3
0
0
.
2
0
.
4
X4
1 0 1 2 3
0
0
.
5
1
X7
0 1 2 3
0
0
.
5
1
1
.
5
2
2
.
5
X2
1 0 1 2 3
0
0
.
2
0
.
4
0
.
6
X5
0 1 2
0
0
.
5
1
1
.
5
X8
0 1 2 3
0
1
2
3
4
5
X3
1 0 1 2 3
0
0
.
2
0
.
4
0
.
6
0
.
8
X6
0 1 2 3
0
0
.
5
1
Figure 3.2: Density dot plots for all metric variables as used in the analysis.
cedure to compare particular methods and to check the predictive power of the models
used. This will be based on misclassication rates. The subsamples are stored in les
datatrain.dat and datatest.dat.
At this stage it is worth to recall that we do not know anything about the economic
interpretation of the predictors. We also do not know, what the response variable means
(is it a credit card, loan or mortgageapplication?). And even the coding is primarily
unclearstands 1 for client is creditworthy or for client has some problems with
repaying the debt? The frequencies of the response variable, summarized in Table 3.2,
tell us more. Since only about 6% of the data set is classied as 1, we will call this class:
clients that have some problems with repaying their liability. Our sample is in this
14
TRAIN sample TEST sample
0 3888 (94.0%) 1920 (93.9%) 5808 (94.0%)
1 247 (6.0%) 125 (6.1%) 372 (6.0%)
total 4135 2045 6180
Table 3.2: Frequencies of two response outcomes.
sense unbiased, because the percentage of faulty loans in consumer credit commonly
varies between 1% and 7% (Arminger et al., 1997). Note that many credit scoring
analyzes are using data sets with overrepresented rate of bad loans (West, 2000).
However, Desai et al. (1996) use 3 data sets from 3 credit unions in the Southeastern
US which consist of 81.58%, 74.02% and 78.85% of good loans respectively. The
data set of Fahrmeir et al (1984) contains 70% of good credits. Enache (1998)
analyzed 38.000 applications, 16.8% of them were rejected. Cardweb, the U.S. payment
card information network (http://www.cardweb.com), mentions that about 78% of U.S.
households are considered creditworthy.
Let us now examine the variables in detail. All descriptive statistics and graphics
used in this section are computed by the quantlet fr descr.data.xpl. We start with
metric variables. Table 3.3 and 3.4 show for the TRAIN and TEST data sets their
basic characteristics: minimum, rst quartile, median, third quartile, maximum, mean
and standard error. These statistics express in numbers our rst nding from the dot
density plots, namely that the data are extremely rightskewed. Figure 3.3 and 3.4
show box plots for the TRAIN and TEST data set. The left box plot in each display
stands for the good clients and the right one for the bad clients. From the box
plots we cannot see any substantial dierences between these two groups.
As next, we examine the categorical variables. Frequencies for dichotomous cate
gorical variables (with two outcomes only) are shown in Table 3.5. Figures 3.53.13
show bar charts for the variables X15X23. The upper displays correspond to the
TRAIN data set, the lower displays correspond to the TEST data set. Left displays
reveal the outcomes when the response is 0 and the displays on the right side when
Y = 1. Remarkable is the change in variable X23. While the third category is most
plentiful by the creditworthy clients and the ninth category has only about one half
observations, in the case of noncreditworthy clients the ninth category increases its
relative number and the third category is up to about two thirds of the ninth category.
Characteristics for variables X15X23 are summarized in Tables 3.63.14.
15
X1

1
0
1
2
3
X4

1
0
1
2
3
X7
0
1
2
3
X2

1
0
1
2
3
X5
0
1
2
X8
0
1
2
3
X3

1
0
1
2
3
X6
0
1
2
3
Figure 3.3: Box plots of metric variables in the TRAIN subsample.
Min. 25% Q. Median 75% Q. Max. Mean Std.Err.
X1 1.519 0.766 0.349 0.403 2.994 0.119 0.892
X2 1.188 0.810 0.307 0.323 2.968 0.122 0.847
X3 0.830 0.695 0.426 0.113 2.940 0.083 0.851
X4 0.825 0.694 0.432 0.223 2.973 0.113 0.816
X5 0.537 0.537 0.537 0.203 2.422 0.019 0.772
X6 0.626 0.363 0.167 0.117 2.962 0.069 0.492
X7 0.302 0.302 0.302 0.138 2.924 0.030 0.408
X8 0.346 0.211 0.211 0.211 2.835 0.106 0.340
Table 3.3: Minimum, 1st quartile, median, 3rd quartile, maximum, mean and
the standard error of metric variables from the TRAIN subsample.
16
X1

1
0
1
2
3
X4

1
0
1
2
3
X7
0
1
2
X2

1
0
1
2
3
X5
0
1
2
X8

0
.
5
0
0
.
5
1
1
.
5
2
2
.
5
X3

1
0
1
2
3
X6
0
1
2
3
Figure 3.4: Box plots of metric variables in the TEST subsample.
Min. 25% Q. Median 75% Q. Max. Mean Std.Err.
X1 1.519 0.766 0.182 0.487 2.994 0.062 0.892
X2 1.188 0.810 0.307 0.449 2.968 0.072 0.880
X3 0.830 0.695 0.291 0.247 2.940 0.037 0.889
X4 0.825 0.694 0.432 0.223 2.973 0.086 0.832
X5 0.537 0.537 0.537 0.203 2.422 0.012 0.780
X6 0.626 0.375 0.180 0.109 2.858 0.083 0.473
X7 0.302 0.302 0.302 0.138 2.631 0.029 0.404
X8 0.211 0.211 0.211 0.211 2.631 0.099 0.367
Table 3.4: Minimum, 1st quartile, median, 3rd quartile, maximum, mean and
the standard error of metric variables from the TEST subsample.
17
TRAIN sample TEST sample
X9: 0 1145 (27.7%) 620 (30.3%) 1765 (28.6%)
X9: 1 2990 (72.3%) 1425 (69.7%) 4415 (71.4%)
X10: 0 708 (17.1%) 361 (17.7%) 1069 (17.3%)
X10: 1 3427 (82.9%) 1684 (82.3%) 5111 (82.7%)
X11: 0 569 (13.8%) 308 (15.1%) 877 (14.2%)
X11: 1 3566 (86.2%) 1737 (84.9%) 5303 (85.8%)
X12: 0 2521 (61.0%) 1286 (62.9%) 3807 (61.6%)
X12: 1 1614 (39.0%) 759 (37.1%) 2373 (38.4%)
X13: 0 217 (5.2%) 100 (4.9%) 317 (5.1%)
X13: 1 3918 (94.8%) 1945 (95.1%) 5863 (94.9%)
X14: 0 3965 (95.9%) 1959 (95.8%) 5924 (95.9%)
X14: 1 170 (4.1%) 86 (4.2%) 256 (4.1%)
Table 3.5: Outcome frequencies of dichotomous categorical variables X9X14.
TRAIN sample (Y = 0)
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
X
1
5
TEST sample (Y = 0)
0
5
0
0
1
0
0
0
X
1
5
TRAIN sample (Y = 1)
0
5
0
1
0
0
1
5
0
X
1
5
TEST sample (Y = 1)
0
5
0
1
0
0
X
1
5
Figure 3.5: Bar charts of the categorical variable X15.
18
TRAIN sample TEST sample
0 1229 (29.7%) 590 (28.9%) 1819 (29.4%)
1 245 (5.9%) 131 (6.4%) 376 (6.1%)
2 2661 (64.4%) 1324 (64.7%) 3985 (64.5%)
Table 3.6: Outcome frequencies of the categorical variable X15.
TRAIN sample (Y = 0)
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
3
0
0
0
X
1
6
TEST sample (Y = 0)
0
5
0
0
1
0
0
0
1
5
0
0
X
1
6
TRAIN sample (Y = 1)
0
5
0
1
0
0
1
5
0
X
1
6
TEST sample (Y = 1)
0
5
0
X
1
6
Figure 3.6: Bar charts of the categorical variable X16.
TRAIN sample TEST sample
0 3071 (74.3%) 1513 (74.0%) 4584 (74.2%)
1 535 (12.9%) 293 (14.3%) 828 (13.4%)
2 529 (12.8%) 239 (11.7%) 768 (12.4%)
Table 3.7: Outcome frequencies of the categorical variable X16.
19
TRAIN sample (Y = 0)
0
5
0
0
1
0
0
0
1
5
0
0
X
1
7
TEST sample (Y = 0)
0
2
0
0
4
0
0
6
0
0
8
0
0
X
1
7
TRAIN sample (Y = 1)
0
5
0
1
0
0
X
1
7
TEST sample (Y = 1)
0
2
0
4
0
6
0
X
1
7
Figure 3.7: Bar charts of the categorical variable X17.
TRAIN sample TEST sample
0 1440 (34.8%) 728 (35.6%) 2168 (35.1%)
1 223 (5.4%) 118 (5.8%) 341 (5.5%)
2 639 (15.5%) 311 (15.2%) 950 (15.4%)
3 1833 (44.3%) 888 (43.4%) 2721 (44.0%)
Table 3.8: Outcome frequencies of the categorical variable X17.
20
TRAIN sample (Y = 0)
0
5
0
0
1
0
0
0
X
1
8
TEST sample (Y = 0)
0
2
0
0
4
0
0
6
0
0
X
1
8
TRAIN sample (Y = 1)
0
5
0
X
1
8
TEST sample (Y = 1)
0
1
0
2
0
3
0
4
0
X
1
8
Figure 3.8: Bar charts of the categorical variable X18.
TRAIN sample TEST sample
0 874 (21.1%) 441 (21.6%) 1315 (21.3%)
1 809 (19.6%) 378 (18.5%) 1187 (19.2%)
2 595 (14.4%) 319 (15.6%) 914 (14.8%)
3 232 (5.6%) 125 (6.1%) 357 (5.8%)
4 120 (2.9%) 77 (3.8%) 197 (3.2%)
5 1505 (36.4%) 705 (34.5%) 2210 (35.8%)
Table 3.9: Outcome frequencies of the categorical variable X18.
21
TRAIN sample (Y = 0)
0
5
0
0
1
0
0
0
X
1
9
TEST sample (Y = 0)
0
1
0
0
2
0
0
3
0
0
4
0
0
5
0
0
X
1
9
TRAIN sample (Y = 1)
0
2
0
4
0
6
0
X
1
9
TEST sample (Y = 1)
0
1
0
2
0
3
0
X
1
9
Figure 3.9: Bar charts of the categorical variable X19.
TRAIN sample TEST sample
0 369 (8.9%) 209 (10.2%) 578 (9.4%)
1 634 (15.3%) 295 (14.4%) 929 (15.0%)
2 1093 (26.4%) 548 (26.8%) 1641 (26.6%)
3 255 (6.2%) 132 (6.5%) 387 (6.3%)
4 1145 (27.7%) 552 (27.0%) 1697 (27.5%)
5 639 (15.5%) 309 (15.1%) 948 (15.3%)
Table 3.10: Outcome frequencies of the categorical variable X19.
22
TRAIN sample (Y = 0)
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
X
2
0
TEST sample (Y = 0)
0
5
0
0
1
0
0
0
X
2
0
TRAIN sample (Y = 1)
0
5
0
1
0
0
X
2
0
TEST sample (Y = 1)
0
1
0
2
0
3
0
4
0
5
0
X
2
0
Figure 3.10: Bar charts of the categorical variable X20.
TRAIN sample TEST sample
0 710 (17.2%) 370 (18.1%) 1080 (17.5%)
1 267 (6.5%) 151 (7.4%) 418 (6.8%)
2 372 (9.0%) 198 (9.7%) 570 (9.2%)
3 255 (6.2%) 121 (5.9%) 376 (6.1%)
4 60 (1.5%) 25 (1.2%) 85 (1.4%)
5 2471 (59.8%) 1180 (57.7%) 3651 (59.1%)
Table 3.11: Outcome frequencies of the categorical variable X20.
23
TRAIN sample (Y = 0)
0
5
0
0
1
0
0
0
X
2
1
TEST sample (Y = 0)
0
1
0
0
2
0
0
3
0
0
4
0
0
5
0
0
X
2
1
TRAIN sample (Y = 1)
0
2
0
4
0
6
0
X
2
1
TEST sample (Y = 1)
0
1
0
2
0
3
0
4
0
X
2
1
Figure 3.11: Bar charts of the categorical variable X21.
TRAIN sample TEST sample
0 737 (17.8%) 387 (18.9%) 1124 (18.2%)
1 412 (10.0%) 183 (8.9%) 595 (9.6%)
2 708 (17.1%) 318 (15.6%) 1026 (16.6%)
3 461 (11.1%) 214 (10.5%) 675 (10.9%)
4 1069 (25.9%) 554 (27.1%) 1623 (26.3%)
5 473 (11.4%) 237 (11.6%) 710 (11.5%)
6 275 (6.7%) 152 (7.4%) 427 (6.9%)
Table 3.12: Outcome frequencies of the categorical variable X21.
24
TRAIN sample (Y = 0)
0
5
0
0
1
0
0
0
X
2
2
TEST sample (Y = 0)
0
1
0
0
2
0
0
3
0
0
4
0
0
5
0
0
X
2
2
TRAIN sample (Y = 1)
0
5
0
X
2
2
TEST sample (Y = 1)
0
1
0
2
0
3
0
4
0
X
2
2
Figure 3.12: Bar charts of the categorical variable X22.
TRAIN sample TEST sample
0 383 (9.3%) 198 (9.7%) 581 (9.4%)
1 420 (10.2%) 228 (11.1%) 648 (10.5%)
2 1189 (28.8%) 572 (28.0%) 1761 (28.5%)
3 202 (4.9%) 78 (3.8%) 280 (4.5%)
4 210 (5.1%) 95 (4.6%) 305 (4.9%)
5 317 (7.7%) 175 (8.6%) 492 (8.0%)
6 218 (5.3%) 97 (4.7%) 315 (5.1%)
7 227 (5.5%) 112 (5.5%) 339 (5.5%)
8 143 (3.5%) 69 (3.4%) 212 (3.4%)
9 826 (20.0%) 421 (20.6%) 1247 (20.2%)
Table 3.13: Outcome frequencies of the categorical variable X22.
25
TRAIN sample (Y = 0)
0
5
0
0
1
0
0
0
X
2
3
TEST sample (Y = 0)
0
1
0
0
2
0
0
3
0
0
4
0
0
5
0
0
X
2
3
TRAIN sample (Y = 1)
0
1
0
2
0
3
0
4
0
5
0
X
2
3
TEST sample (Y = 1)
0
1
0
2
0
3
0
X
2
3
Figure 3.13: Bar charts of the categorical variable X23.
TRAIN sample TEST sample
0 279 (6.7%) 145 (7.1%) 424 (6.9%)
1 298 (7.2%) 167 (8.2%) 465 (7.5%)
2 1025 (24.8%) 510 (24.9%) 1535 (24.8%)
3 375 (9.1%) 180 (8.8%) 555 (9.0%)
4 226 (5.5%) 99 (4.8%) 325 (5.3%)
5 104 (2.5%) 59 (2.9%) 163 (2.6%)
6 269 (6.5%) 132 (6.5%) 401 (6.5%)
7 389 (9.4%) 194 (9.5%) 583 (9.4%)
8 561 (13.6%) 263 (12.9%) 824 (13.3%)
9 65 (1.6%) 25 (1.2%) 90 (1.5%)
10 544 (13.2%) 271 (13.3%) 815 (13.2%)
Table 3.14: Outcome frequencies of the categorical variable X23.
26
4 Credit Scoring Methods
We suppose that we are given a set {x
i
}
N
i=1
of N observations of a random vector
X in R
p
. That is, there are p independent variables (predictors) X
1
, . . . , X
p
and one
dependent variable (response) Y . Each observation is thus a row vector (x
i
, y
i
), i =
1, 2, . . . , N and the whole data set can be written in a matrix form:
X =
_
_
_
_
x
1,1
. . . x
1,p
.
.
.
.
.
.
x
N,1
. . . x
N,p
_
_
_
_
Y =
_
_
_
_
y
1
.
.
.
y
N
_
_
_
_
In our particular case we have two data sets with N
TRAIN
= 4135 and N
TEST
= 2045.
The dimension of the predictor space is either p = 23 (or p = 61 after introducing
dummy variables).
4.1 Logistic Regression
Logistic regression stems from a wider class of models. It is a special case of the gener
alized linear model (GLM). The logistic regression model assumes that the conditional
expectation E[Y X = x] of Y is given in the following way:
E[Y x] = G(
0
+x
) =
e
0
+x
1 + e
0
+x
,
where
0
is the regression constant, = (
1
, . . .
p
)
i=1
_
y
i
(1 )
1y
i
=
N
i=1
_
G(
0
+
x
i
)
y
i
(1 G(
0
+
x
i
))
(1y
i
)
_
.
The estimating procedure is done by the NewtonRaphson algorithm. Computational
details are given in Hardle et al. (2000b).
27
Variable
Std.Err. tvalue Variable
Std.Err. tvalue
Const. 2.747 0.066 41.811 Const. 2.757 0.066 41.976
X1 0.191 0.069 2.771 X5 0.093 0.082 1.135
Const. 2.823 0.071 39.873 Const. 2.754 0.066 41.861
X2 0.313 0.088 3.539 X6 0.035 0.131 0.267
Const. 2.767 0.067 41.518 Const. 2.830 0.071 39.635
X3 0.097 0.082 1.189 X7 0.872 0.217 4.013
Const. 2.752 0.066 41.721 Const. 2.948 0.099 29.807
X4 0.047 0.078 0.596 X8 1.306 0.431 3.032
Table 4.1: Logistic regression coecients for each metric variable separately.
Bold parameter values are signicant at 5%.
Results
The estimations are based on the TRAIN data set. Since the logistic regression cannot
deal with categorical variables directly, we must recode each of the variables X9X23
with a set of new contrast variables (the rst category, 0, is taken as the reference
category for each variable). In this manner we get a larger data set with one response,
8 metric and 53 dummy variables (stored in les fr train C.dat and fr test C.dat).
Let Xa(b) denote the bth contrast variable for the original variable Xa, e.g. X23 is
then recoded into contrast variables X23(1), X23(2), . . . X23(10). Firstly we tried to
t the response on each variable separately. Table 4.1 shows the estimation results:
the parameters, their standard error and tvalue for the metric variables. X1, X2,
X7 and X8 are signicant at 5% (and thus emphasized by bold letters), X3X6 are
insignicant.
Results for categorical variables estimated separately are given in Table 4.2. Pa
rameters signicant at 5% are emphasized by bold letters again. Signicant binary
variables are X9, X10 and X14. No variable with more than two categories has sig
nicant coecients for all dummy variables. All coecients for X16 and X17 are
insignicant.
From the signicant variables the categorical variable X10 has the lowest deviance
(1802.6). The sequence of other signicant variables sorted with respect to the deviance
(in decreasing order) is: X10, X20, X22, X23, X21 and the corresponding R
2
values
decrease from 3.7% to 1.5%. This is really a very poor performance. Hence we use
28
Variable
Std.Err. tvalue Variable
Std.Err. tvalue
Const. 2.403 0.107 22.426 Const. 2.121 0.121 17.475
X9(1) 0.524 0.136 3.864 X20(1) 0.392 0.262 1.496
Const. 1.864 0.110 16.910 X20(2) 1.048 0.290 3.613
X10(1) 1.206 0.138 8.737 X20(3) 0.343 0.263 1.304
Const. 2.609 0.166 15.727 X20(4) 0.932 0.328 2.836
X11(1) 0.172 0.181 0.954 X20(5) 1.024 0.158 6.481
Const. 2.841 0.087 32.563 Const. 3.160 0.186 16.951
X12(1) 0.206 0.132 1.557 X21(1) 0.049 0.316 0.155
Const. 2.344 0.240 9.759 X21(2) 0.142 0.258 0.549
X13(1) 0.440 0.250 1.763 X21(3) 1.008 0.241 4.184
Const. 2.792 0.068 41.015 X21(4) 0.561 0.222 2.528
X14(1) 0.659 0.258 2.549 X21(5) 0.274 0.277 0.987
Const. 3.140 0.143 21.952 X21(6) 0.667 0.294 2.271
X15(1) 0.174 0.329 0.528 Const. 3.133 0.255 12.267
X15(2) 0.540 0.162 3.329 X22(1) 0.096 0.361 0.266
Const. 2.771 0.077 36.160 X22(2) 0.678 0.277 2.445
X16(1) 0.255 0.181 1.405 X22(3) 0.354 0.487 0.726
X16(2) 0.192 0.215 0.892 X22(4) 0.638 0.365 1.749
Const. 2.833 0.115 24.628 X22(5) 0.262 0.357 0.735
X17(1) 0.034 0.318 0.106 X22(6) 0.894 0.343 2.604
X17(2) 0.077 0.213 0.363 X22(7) 1.590 0.755 2.107
X17(3) 0.192 0.148 1.297 X22(8) 0.989 0.374 2.645
Const. 3.147 0.170 18.492 X22(9) 0.255 0.299 0.854
X18(1) 0.568 0.219 2.596 Const. 2.561 0.232 11.035
X18(2) 0.246 0.251 0.982 X23(1) 0.308 0.346 0.890
X18(3) 0.891 0.281 3.168 X23(2) 0.811 0.290 2.794
X18(4) 0.635 0.386 1.645 X23(3) 0.264 0.323 0.816
X18(5) 0.416 0.201 2.065 X23(4) 0.622 0.412 1.509
Const. 3.030 0.248 12.204 X23(5) 0.425 0.514 0.826
X19(1) 0.636 0.287 2.217 X23(6) 0.092 0.325 0.284
X19(2) 0.253 0.280 0.904 X23(7) 0.161 0.313 0.513
X19(3) 0.670 0.334 2.009 X23(8) 0.342 0.272 1.257
X19(4) 0.025 0.285 0.086 X23(9) 1.598 1.034 1.545
X19(5) 0.241 0.301 0.802 X23(10) 0.054 0.283 0.191
Table 4.2: Logistic regression coecients for each categorical variable sepa
rately. Bold parameter values are signicant at 5%.
29
GLM: logistic fit, n=4135
5 0
Index eta
0
0
.
2
0
.
4
0
.
6
0
.
8
1
L
i
n
k
m
u
,
R
e
s
p
o
n
s
e
s
y
Figure 4.1: Predicted probabilities.
quantlet glmforward
3
to nd a more appropriate model. The quantlet starts with a
null model (containing intercept only) and adds particular variables consequently. The
measure of goodness of the model is the Akaikes criterion. In this way we can identify
subsets of independent variables that are good predictors of the response. Note that
this quantlet must evaluate up to
p(p+1)
2
models, what means in our case
6263
2
= 1953
models, and therefore the computation takes a while. The forward stepwise procedure
suggest to include in the model the following variables: X1, X2, X5, X7X10, X12
X14, X16(2), X18(3), X18(5), X19(1), X19(3), X20(2), X20(4), X20(5), X21(3),
X21(4), X21(6), X22(2), X22(4), X22(6)X22(9), X23(2), X23(8), X23(9). The
Akaikes information criterion of this model is 1643.6. It is remarkable that the variable
X23(3) is insignicant, even though it seemed to be quite predictive from Figure 3.13.
Table 4.3 shows the estimated parameters, their standard error and corresponding t
value in this model. Since by using dummy variables one considers all possible eects,
modelling for the categorical variables cannot be further improved, but the inuence of
metric variables may be better investigated by using semiparametric model. Figure 4.1
shows a plot of X
).
3
Files ologit*.* contain the complete computer output related to this section.
30
Variable
Std.Err. tvalue
Const. 2.153 0.3653 5.8936
X1 0.1952 0.1126 1.7338
X2 0.3533 0.0967 3.6518
X5 0.1618 0.1047 1.546
X7 0.8900 0.2198 4.0488
X8 0.958 0.4080 2.3482
X9(1) 0.3269 0.1706 1.9162
X10(1) 0.9780 0.1519 6.4356
X12(1) 0.2551 0.1598 1.5961
X13(1) 0.511 0.2678 1.9091
X14(1) 0.6736 0.2850 2.3636
X16(2) 0.3144 0.2351 1.3372
X18(3) 0.4531 0.265 1.7056
X18(5) 0.5652 0.1940 2.913
X19(1) 0.4681 0.1768 2.6472
X19(3) 0.5394 0.2546 2.1181
X20(2) 0.8286 0.2982 2.7789
X20(4) 1.262 0.362 3.4868
X20(5) 0.8635 0.1506 5.7339
X21(3) 0.7086 0.1949 3.6355
X21(4) 0.2385 0.1687 1.4136
X21(6) 0.5004 0.2640 1.8953
X22(2) 0.6722 0.1839 3.6541
X22(4) 0.7068 0.3161 2.236
X22(6) 0.9538 0.2967 3.2146
X22(7) 1.341 0.6702 2.0007
X22(8) 0.880 0.4261 2.0666
X22(9) 0.350 0.2182 1.6064
X23(2) 0.6462 0.2001 3.2293
X23(8) 0.3820 0.1766 2.163
X23(9) 1.552 0.9915 1.566
Table 4.3: Logistic regression coecients of the model suggested by
glmforward. Bold parameter values are signicant at 5%.
31
Prediction
We use the model described in the previous paragraph (suggested by the glmforward
quantlet) to estimate the outcomes of the TEST data set and to compute the misclas
sication rate.
Firstly we check the prediction on the TRAIN data set which built the model.
Table 4.4 shows the results when observations with probability higher than 0.5 are
assigned to be noncreditworthy clients. For comparison we show results using prior
probabilities in Table 4.5. At the rst sight, using prior probabilities gives worse
resultsthe overall misclassication rate of 8.95% is higher than the rate when using 0.5
threshold (6.0%). But in fact, the latter misclassies 74.9% of the bad clients, while
the former more than 93.9%! Threshold 0.5 gains better overall misclassication rate
due to the low number of misclassied good clients (0.4%), using prior probabilities
misclassies 4.76% of them. From the banks point of view it is worse to grant a loan
to a noncreditworthy applicant than to reject a creditworthy applicant and thus we
will further use prior probabilities to decide on the creditworthiness of the client.
Then we use the model on the testing data, stored in the le datatest C.dat.
Table 4.6 shows the results. In the TEST data set 98 applicants of 1920 good
applicants were denoted as bad, 101 of the 125 bad clients were assigned as good
and thus 199 applicants of the entire TEST sample were misclassied, what results in
the overall misclassication rate about 9.7%.
observed predicted misclass. overall
0 1 misclass.
0 3872 16 0.41% 248
1 232 15 93.93% (6.00%)
Table 4.4: Misclassication rates of the logit model for the TRAIN data set.
observed predicted misclass. overall
0 1 misclass.
0 3703 185 4.76% 370
1 185 62 74.90% (8.95%)
Table 4.5: Misclassication rates of the logit model for the TRAIN data set
(prior probabilities).
32
observed predicted misclass. overall
0 1 misclass.
0 1822 98 5.10% 199
1 101 24 80.80% (9.73%)
Table 4.6: Misclassication rates of the logit model for the TEST data set
(prior probabilities).
Discussion
Since logit model belongs to traditional techniques, one may nd the logistic regression
in almost every paper treating credit scoring, at least as a reference method for com
parison with other models. In general, logistic regression is easy to t and works well in
practice. However binary output models (like logit or probit) describe best the output
category, which occur most frequent. Therefore each output category should have at
least 5% of observations. If an output class has too small number of observations, one
should use socalled rare event models (Kaiser and Szczesny, 2000a).
4.2 Multilayer Perceptron
The multilayer perceptron (MLP) is a simple feedforward neural network with an
input layer, several hidden layers and one output layer. It means that information can
only ow forward from the input units to the hidden layer and then to the output
unit(s). MLP network is the most often used architecture of neural networks, there is
a great deal of publications concerning MLP network, see for example Bishop (1995).
For the purpose of credit scoring a MLP with one hidden layer and one or two output
units only is sucient. Its basic structure is illustrated in Figure 4.2. The value of the
output unit can be expressed:
f(x) = F
2
_
_
w
(2)
0
+
r
j=1
w
(2)
j
F
1
_
w
(1)
j0
+
p
i=1
w
(1)
ji
x
i
_
_
_
,
where x
i
are the the input units, w
(1)
ji
and w
(2)
j
are the weights of the hidden and
output layer respectively and F
1
and F
2
are the transfer functions from the input to
the hidden layer and from the hidden to the output layer respectively. The transfer
function is usually sigmoid, e.g. logistic function. The parameters for the network are
determined iteratively, commonly via the backpropagation procedure.
33
Input layer Hidden layer Output layer
q
q
q
>
>
>
>
>
>
>
>
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
x
p
x
2
x
1
q
q
q
`
`
`
`
`
`
`
`
z
r
z
2
z
1
y = f(x)
w
(1)
w
(2)
Figure 4.2: A multilayer perceptron network with one hidden layer and one
output unit.
Results
For the probabilistic models the softmax transfer function for the outputs is intended.
However, we found out faulty usage of this function in XploRe, it is commented in
Appendix B.2. Therefore we used a MLP network with logistic transfer function for
the output, quadratic least squares error function and no skip connections.
We split the TRAIN sample into two subsamples stored in the les datann1.dat
(with 2779 observations) and datann2.dat (with 1356 observations) respectively.
The rst subsample is used for the actual network training, it looks for a MLP network
with the minimal mean squared error (MSE). The latter subsample is validating and
observed predicted misclass. overall
0 1 misclass.
0 2579 50 1.9% 110
1 60 90 40.0% (4.0%)
Table 4.7: Misclassication rates of the 23121 MLP network for the small
training sample.
34
observed predicted misclass. overall
0 1 misclass.
0 1197 62 4.9% 148
1 86 11 88.7 % (10.9%)
Table 4.8: Misclassication rates of the 23121 MLP network for the small
validating sample.
observed predicted misclass. overall
0 1 misclass.
0 2569 60 2.3% 120
1 60 90 40.0% ( 4.3%)
Table 4.9: Misclassication rates of the 17131 MLP network for the small
training sample.
observed predicted misclass. overall
0 1 misclass.
0 1205 54 4.3% 135
1 81 16 83.5% (10.0%)
Table 4.10: Misclassication rates of the 17131 MLP network for the small
validating sample.
it should avoid the overtting so that the model is not built for one particular sample
only.
We computed many models and looked at the MSE and the misclassication rates
of the validation data set. The misclassication rates are computed due to prior prob
abilities gained from the small training sample. Finally we chose a MLP networks with
12 units in the hidden layer, which reached the MSE of 243.58. The nal 23121 MLP
network
4
is saved in les mlp1.*.
Tables 4.7 and 4.8 show the misclassication rates which the nal MLP network
reaches in the small training and in the validating sample respectively. The resulting
MLP network predicts more than 98% of the good clients correctly. Out of the bad
4
Here 23121 denotes 23 input units, 12 units in the hidden layer and 1 output unit.
35
clients 40% are misclassied. Altogether there are 4% misclassied clients. The results
get slightly worse, when using the MLP network on the small validation set. Almost
5% of the good clients and more than 88% of the bad clients are misclassied!
The problem by neural networks is missing procedure choosing signicant input
units. Thus we let us inspire by the logistic regression and restrict the input layer
to 17 units only, as suggested in subsection 4.1 by the quantlet glmforward. That
is, we built the network only on the knowledge of the variables X1, X2, X5, X7
X10, X12X14, X16 and X18X23. The resulting network has 13 hidden units and
the MSE of 310.35. This network is stored in the les mlp2.*. Table 4.9 shows its
misclassication rates. This restricted MLP network has higher misclassication rate
by the good clients from the small training sample (2.3%), but the performance in
the small validating sample is a little bit better. 4.3% of the good clients and only
83.5% of the bad clients are misclassied (Table 4.10).
Prediction
Before we compute the misclassication rates of the TEST data set, we check the
prediction on the whole data from the TRAIN sample which built the model. Both
results for the 23121 MLP and 17131 MLP network are shown in Table 4.11 and
Table 4.12 respectively. The restricted network has better prediction by the bad
observed predicted misclass. overall
0 1 misclass.
0 3771 117 3.0% 263
1 146 101 59.1% (6.4%)
Table 4.11: Misclassication rates of the 23121 MLP network for the TRAIN
data set.
observed predicted misclass. overall
0 1 misclass.
0 3751 137 3.5% 275
1 138 109 55.9% (6.7%)
Table 4.12: Misclassication rates of the 17131 MLP network for the TRAIN
data set.
36
observed predicted misclass. overall
0 1 misclass.
0 1815 105 5.5% 215
1 110 15 88.0% (10.5%)
Table 4.13: Misclassication rates of the 23121 MLP network for the TEST
data set.
observed predicted misclass. overall
0 1 misclass.
0 1813 107 5.6% 217
1 110 15 88.0% (10.6%)
Table 4.14: Misclassication rates of the 17131 MLP network for the TEST
data set.
clientsit misclassied 8 clients (3.2%) less than the full network, on the other side
by the good clients it misclassies 20 clients (0.5%) more than the full network.
Altogether the restricted 17131 MLP network describes the TRAIN data set slightly
worse than the full 23121 MLP network. It misclassies 12 clients more (0.3%).
Note that the number of misclassied clients in the small training sample plus the
number of misclassied clients in the small validation sample is not exactly equal to
the number of misclassied clients in the TRAIN data set (no matter which network
110 + 148 = 263 and 120 + 135 = 275). This is due to the prior probabilities we used.
We are denoting a client as bad, if its neural networks output function is greater
than the rate of good clients in the corresponding training sample, that is 94.6% in
the small training sample and 94.0% in the TRAIN data set.
Table 4.13 and Table 4.14 show the nal results of prediction the TEST data set.
There is no dierence in predicting the bad clients. The restricted 17131 MLP
network misclassies only 2 clients more than the full 23121 MLP network. Thus
these two neural networks seem to give almost the same results.
Discussion
Neural Networks represent very exible models with good performance. Although
there are various architectures of neural networks, more than 50% of applications are
37
using the multilayer perceptron (MLP) network, which is both simple and well known.
Problematic is choosing the number of units in the hidden layer. West (2000) uses an
analogy of the forward stepwise procedure in the logistic regression. The so called
cascade learning starts with one neuron in the hidden layer and adds other neurons
as long as the performance is being better. As one can see, logistic regression may be
classied as a simple MLP with one processing unit in one hidden layer and logistic
function as the sigmoid activation function.
4.3 Radial Basis Function Neural Network
The radial basis function (RBF) network is another architecture of feedforward neural
networks, which has in principle only one hidden layer. The hidden units are also
called clusters, as the observations are clustered to one of the hidden units which
are represented by radially symmetric functions. The weights of the hidden layer then
represent centers of these clusters (mean of the radial functions). At the rst stage
these centers as well as deviances of the radial basis functions are to be found (via the
so called unsupervised learning). As next, weights of the output layer are determined
via supervised learning (similarly like in the MLP networks). RBF neural networks are
explained in Appendix A comprehensively.
observed predicted misclass. overall
0 1 misclass.
0 2505 124 4.7% 248
1 124 26 82.7% (8.9%)
Table 4.15: Misclassication rates of the 231001 RBF network for the small
training sample.
observed predicted misclass. overall
0 1 misclass.
0 1200 59 4.7% 142
1 83 14 85.6% (10.5%)
Table 4.16: Misclassication rates of the 231001 RBF network for the small
validation sample.
38
Results
The RBF neural network is built in the same way as the MLP network. We trained
the network on the small training sample with 2779 observations (datann1.dat) in
order to minimize the mean squared error and at the same time checked the overtting
by evaluating the model on the small validating sample (datann2.dat).
In the same manner as by the MLP network we tried many dierent networks with
various learning parameters and dierent number of hidden units, till we got optimal
results. There are two models again, one with 23 input units (the full model) and
another one with only 17 inputs, as suggested in Section 4.1
The rst model uses 100 clusters, we denote it 231001 RBF neural network and
save it in rbf1.rbf). The latter contains only 80 clusters and we denote it 1780
1 RBF network (stored in rbf2.rbf). Misclassication rates for the 231001 RBF
network are given in Table 4.15 for the small training sample and in Table 4.16 for the
validating sample. The 17801 RBF neural networks misclassication rates are shown
in Table 4.17 and 4.18. In comparison with MLP networks, both the 231001 and
17801 RBF networks misclassify in the small training sample almost twice as much
as the MLP does. However, the misclassication rates in the validation sample are
reasonable: 10.5% and 10.8% of overall misclassied observations.
observed predicted misclass. overall
0 1 misclass.
0 2510 119 4.5% 238
1 119 31 79.3% (8.6%)
Table 4.17: Misclassication rates of the 17801 RBF network for the small
training sample.
observed predicted misclass. overall
0 1 misclass.
0 1198 61 4.8% 146
1 85 12 87.6% (10.8%)
Table 4.18: Misclassication rates of the 17801 RBF network for the small
validation sample.
39
Prediction
Table 4.19 shows the results of the 231001 RBF neural network. It misclassies exactly
the same number of observations as the 23121 MLP network: 10.5%. Anyway, the
RBF network predicts better the bad applicants there are 109 misclassications
(87.2%). Out of the good applicants 106 (5.5%) were misclassied. Results for the
17801 RBF network are shown in Table 4.20. It performs even a little bit better
106 of bad and 103 of good applicants were misclassied. That is altogether 209
(10.2%) misclassied applicants.
observed predicted misclass. overall
0 1 misclass.
0 1814 106 5.5% 215
1 109 16 87.2% (10.5%)
Table 4.19: Misclassication rates of the 231001 RBF network for the TEST
data set.
observed predicted misclass. overall
0 1 misclass.
0 1817 103 5.4% 209
1 106 19 84.8% (10.2%)
Table 4.20: Misclassication rates of the 17801 RBF network for the TEST
data set.
Discussion
Radial basis function neural networks are supposed to give better prediction than the
MLP networks. That is true indeed, however, their performance is only slightly better
(8 misclassied observations less than in the MLP) in our case. Due to the unsupervised
learning the computation of RBF networks proceeds very fast. One may notice the
misclassication rates of the small training sample. While MLP networks tend to
overt, the results from the RBF networks are more robustthe misclassication rates
between the small training and the validation sample dier only in about 2 percent
points.
40
5 Other Methods
In the vast amount of publications treating credit scoring there are many other tech
niques which may also be used. This section gives a short overview of some of these
methods and refers to other literature.
5.1 Probit Regression
Sometimes an alternative to the logit model is used. The probit model is another
variant of generalized linear models. It is derived by letting the link function be the
standard normal distribution function. Since the logistic link function is closely ap
proximate to that of normal random variable, the results are similar.
5.2 Semiparametric Regression
In order to give more attention to metric variables, one may estimate them nonpara
metrically. Hardle et al. (2000a) show that semiparametric methods perform better
than the logistic regression. The resulting model consist then of a linear and a nonlinear
part:
E[Y X] = E[Y (V
, W
)] = G(V
+m(W)),
where = (
1
, . . . ,
v
)
,
so that the scalar product of vector of observations and the weight vector w
x
i
is
above this cuto point for the not creditworthy clients and below this point for the
creditworthy clients. Sum of errors
e
i
is minimized with respect to the unknown
parameters:
min e
1
+e
2
+... +e
n
G
+n
B
subject to
_
_
w
1
x
i1
+w
2
x
i2
+... +w
p
x
ip
c e
i
, 1 i n
B
w
1
x
i1
+w
2
x
i2
+... +w
p
x
ip
c +e
i
, n
B
+ 1 i n
G
+n
B
= N
e
i
0 , 1 i n
G
+n
B
where n
B
is the number of bad and n
G
the number of good clients. Linear pro
gramming for credit scoring is described in Thomas (2000).
5.9 Treed Logits
Chipman et al. (2001) studied the prediction problem in direct marketing They gen
eralized and conjoined logistic regression and the CART techniques. The space of all
observations is partitioned through a binary tree. In each bottom node a dierent logit
model is tted(instead of simply observed frequencies of response). Treed logits are
interpretable, small and prevent the overtting.
44
6 Summary and Conclusion
Credit scoring represent a set of common techniques to decide whether a bank should
grant a loan (issue a credit card) to an applicant or not. We have presented sev
eral methods and showed their usage in XploRe. The results of these methods are
summarized in Table 6.1.
Articial neural networks are very exible models, however they provide slightly
worse performance than the traditional logistic regression, which misclassies 10 (0.5%)
observations less than the best of neural network models. Logit misclassies 98 good
clients and denotes them as bad, 101 of the bad clients are granted the loan, as
they are supposed to be creditworthy. Additionally, logistic regression dispose with
statistical tests to identify how important are each of the predictor variables. Our
analysis thus showed that neural network models did not manage to beat logit in
prediction. However, due to the atmaximum eect (Lovie and Lovie, 1986) one is
unlikely to achieve a great deal of improvement from better statistical modelling on
the same set of matching variables.
misclass. logit MLP MLP RBF RBF
(23121) (17131) (231001) (17801)
good 98 105 107 106 103
bad 101 110 110 109 106
overall 199 215 217 215 209
overall in % 9.7% 10.5% 10.6% 10.5% 10.2%
Table 6.1: Misclassication rates of the methods tested.
45
7 Extensions
The objective of most credit scoring models is to minimize the misclassication rate
or the expected default rate. However, one should pay more attention to the term
o misclassication rate. Usually, the overall misclassication rates are compared to
decide about the prediction of various models. However, there are two types of misclas
sication. Firstly, one may denote a noncreditworthy client as creditworthy and the
other way round it is possible to denote a creditworthy client as noncreditworthy. The
latter is loss of prot, but it is not as bad as the former mistake, which means direct
loss for the bank. Therefore the bank is not trying to minimize the misclassication
rate, but to maximize its prot. One possible solution would be implementation of a
cost matrix. The preferred method minimize then the term L = r
1
w
1
+r
2
w
2
, where r
1
and r
2
are number of bad applicants classied as creditworthy and number of good
applicants classied as not creditworthy respectively. w
1
, w
2
are weights mirroring the
loss and prot lost. These weights are to be estimated. Unfortunately, there are not
many papers on this topic yet. Tam and Kiang (1992) show that incorporating the
cost matrix into neural networks and discriminant analysis is possible.
At present the emphasis is on changing the objectives from trying to minimize the
chance a client will default on any particular product to looking at how the rm can
maximize the prot it can make from that client. As an example we can mention
insurances sold on loans (Stanghellini, 1999). Thus a noncreditworthy client may be
protable if he buys the insurance on his loan and become delinquent relatively later.
Credit scoring has better performance than decisioning made by loan ocers. How
ever, one bad property of scoring methods is, that they are static and estimate at par
ticular time. There are another tools which should be used. Jacobson and Roszbach
(1998) showed that banks using credit scoring grant loans inconsistent with default risk
minimization. They suggest value at risk (VaR) as a more adequate measure of losses
than default risk. Therefore next research should concentrate more on such topics. We
will study methods of credit scoring in near future again and in a more detail in a
diploma thesis at the Charles University in Prague.
46
Appendix
A Radial Basis Function Neural Networks
This section describes radial basis function (RBF) neural networks and explains
their implementation in XploRe. In general, articial neural networks (ANN)
appeared in the 40s, but their progress and usage of these theories was enabled
in the last decade, especially with the development of personal computers. Nowa
days they are very widely used in many research and commercial elds and may
be successfully applied in credit scoring as well. The fashion of neural networks
originating in brain nerve cells let grow up new terminology although the roots
of neural networks stretch in much older techniques. Table A.1 shows basic ter
minology with dierent names for the same subject in articial neural networks
and statistics framework:
statistics neural networks
model network
estimation learning
regression supervised learning
interpolation generalization
observations training set
parameters weights
independent variables inputs
dependent variables outputs
Table A.1: Comparison of the neural networks and statistician terminology.
The simplest example of an ANN is perceptron. Multilayer perceptrons were
described in Section 4.2. For a thorough explanation see Bishop (1995). Radial
basis function neural networks are, alike the MLPs, feedforward networks. It
means that the signal in the network is passed forward only. But apart from
the MLPs, RBF networks stand for a class of neural network models, in which
the hidden units are activated according to the distance between the input units.
RBF networks combine two dierent types of learning: supervised and unsuper
vised. First, at the hidden layer training, one conjoints the input vector into
several clusters (unsupervised learning) and afterward, at the output
47
Input layer Hidden layer Output layer
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
x
4
x
3
x
2
x
1
w
`
`
`
`
`
`
`
`
`
`
`
`
`
`
bias
z
3
z
2
z
1
y = f(x)
.
. . . . . . . . . . . .
. . . . . . . . .
. . . . . .
. . . . .. . . . .
. . . . . .
. . . . . . . . .
. . . . . . . . . . . .
.
. . . . . . . . . . . .
. . . . . . . . .
. . . . . .
. . . . .. . . . .
. . . . . .
. . . . . . . . .
. . . . . . . . . . . .
.
. . . . . . . . . . . .
. . . . . . . . .
. . . . . .
. . . . .. . . . .
. . . . . .
. . . . . . . . .
. . . . . . . . . . . .
Figure A.1: RBF scheme for p = 4, q = 1, r = 3. Dashed lines show various
weights of the connections.
layer training, the output of the RBF network is determined by supervised learn
ing. While for the supervised learning we have both the independent variables and
response variable(s), the unsupervised learning must work without the knowledge
of response variable. This concept will be more clear from the next subsection.
Training of a RBF network can be essential faster than the methods used to
train MLP networks. Further the multilayer feedforward network trained with
backpropagation does not yield the approximating capabilities of RBF networks.
Therefore the theory of RBF neural networks is still the subject of extensive
ongoing research (Orr, 1999). One remarkable feature of RBF neural networks
is:
RBF networks possess the property of best approximation. An ap
proximation scheme has this property if, in the set of approximating
functions (i.e. the set of functions corresponding to all possible choices
of the adjustable parameters) there is one function which has minimum
approximating error for any given function to be approximated. This
property is not shared by MLPs.
Bishop (1995)
48
A.1 The Model
Radial basis function neural networks have one hidden layer only. Each of the
hidden units (the so called clusters) implements a radial function. The output
units are weighted sums of clusters outputs. This is illustrated in Figure A.1.
We suppose that we are given a set {x
i
}
N
i=1
of N observations from a p
dimensional space. That is, we are given p input variables X
1
, . . . , X
p
and, in
general, q output variables Y
1
, . . . , Y
q
. Let f : R
p
R
q
denote the function we
want to approximate via the RBF neural network. The output of a RBF neural
network with p input units, r clusters and one output unit is:
f(x) =
_
r
j=1
[w
0
+ w
j
j
(x)]
_
,
where w
0
is the weight of the bias, w
j
j = 1, . . . r are the output weights,
j
is a
radially symmetric function with two parameters c
j
,
j
and is output transfer
function. During training the observation points are joined into r clusters rstly,
for instance by a Kmeans clustering algorithm. Each cluster is represented
by a radial function. Radially symmetric function (.) is required to fulll the
condition that if x
i
= x
j
then (x
i
) = (x
j
). The norm . is usually
taken to be L
2
norm. The well known radially symmetric function is the pvariate
Gaussian function:
j
(x) = exp
_
x c
j
2
2
j
_
,
j
> 0 .
Another popular activation function is the generalized inverse multiquadric func
tion:
j
(x) = (x c
j
+
2
j
)
,
j
> 0, > 0 .
Both of them have the property that 0 as x . Other possible choices
of radially symmetric functions are the thinplate spline function:
j
(x) = x c
j
ln
_
_
x c
j
_
,
or the generalized multiquadric function:
j
(x) = (x c
j
+
2
j
)
,
j
> 0, 1 > > 0 .
The last two functions have the property that as x . The even
mentioned radially symmetric functions are plotted in Figure A.2. However,
49
Gaussian
5 0 5
X
0
0
.
5
1
1
.
5
2
2
.
5
3
3
.
5
4
4
.
5
5
Y
Thinplate Spline
5 0 5
X
0
1
0
Y
Generalized Inverse Multiquadric
5 0 5
X

0
.
4

0
.
2
0
0
.
20
.
40
.
60
.
8
1
1
.
21
.
41
.
61
.
8
2
2
.
2
2
.
42
.
62
.
8
3
3
.
23
.
43
.
63
.
8
4
4
.
24
.
44
.
64
.
8
Y
Generalized Multiquadric
5 0 5
X
0
1
2
3
4
5
Y
Figure A.2: Radially symmetric activation functions.
both theoretical and empirical studies show that estimation results are relatively
insensitive to the exact form of the radially symmetric activation function. The
most commonly used radial function is the pvariate Gaussian function. It has
not only the attractive property of product separability (it can be rewritten as
a product of p univariate functions), but also other useful analytical properties.
We must estimate centers (c
j
) and widths (
j
) of these clusters. The clusters
weights are actually coordinates of the cluster centers. To get the proper number
of these clusters one may grow to the suitable size starting form one cluster
or alternatively, one starts with as many clusters as observations and trims the
clusters as desired. However, the number of clusters (number of units in the
hidden layer) is typically much less than N. After initial training, when the
clusters are already known, one may apply the supervised learning to estimate
weights of the output units. Determination of the clusters weights (i.e. centers of
the clusters) and widths at the rst stage may be seen as a parametric approach.
50
The second stage of training may be seen as a nonparametric approach and
therefore RBF networks take the place of a semiparametric procedure.
Finally, let us mention, that the radially symmetric constraint on the activa
tion function is sometimes violated in order to decrease the number of units in
the hidden layer or to improve the performance of the network. For instance the
multivariate Gaussian function may be generalized to the elliptically symmet
ric function by replacing the L
2
norm by the Mahalanobis distance (Hardle and
Simar, 2002).
A.2 RBF Neural Networks in XploRe
In this section we shortly describe how to run RBF neural network in XploRe.
{inp,net,err} = rbftrain(x,y,clust,learn,epochs,mMSE,activ)
trains a radial basis function neural network (slow)
{inp,net,err} = rbftrain2(x,y,clust,learn,epochs,mMSE,activ)
trains a radial basis function neural network (fast)
The quantlets rbftrain and rbftrain2 build a radial basis function neural net
work. They use the same algorithm, the only dierence is, that the former is
written directly in XploRe while the latter uses a dynamically linked library writ
ten in C programming language. Account of this fact, rbftrain works slowly,
but allows the user to change the source code directly, to add new features and
to see what is exactly happening. On the other hand, rbftrain2 is a closed
product, which cannot be changed, but runs fast.
The input parameters x and y are the input and output variables respectively.
We assume that x and y have dimensions Np and Nq respectively. Number of
units in the hidden layer is given by the parameter clust, it must be determined
by the user, usually clust << N. Parameter learn gives the learning rates. It
is a vector with three rows. The rst row contains the minimum learning rate
and the second row the maximum learning rate for training the clusters. In the
algorithm the unsupervised learning starts with the maximum learning rate which
is decreased in each learning epoch, till it reaches the minimum learning rate in
the last training epoch. The last row of the parameter learn is the learning rate
51
for training output weights. Each of these learning rates must be from the range
(0, 1). The vector epochs has two rows. The rst row is the number of training
epochs for the hidden layer and the second row contains number of epochs to
train the output layer. The training is stopped either when the output units
were already trained epochs[2]times or when the mean squared error reaches
the value given by the parameter mMSE. The optional input parameter activ
determines whether the bipolar sigmoid function (activ = 1):
1 e
W
1 + e
W
,
should be used instead of the default binary activation function (activ = 0):
1
1 + e
W
for training the output layer. Here W denotes the sum of weighted outputs
of the hidden units (exact explanation is given in Section A.4). rbftrain has
three output parameters, each of them are composed objects. inp contains pa
rameters which came as input and characterize the architecture of this RBF
network: inp.inputs, inp.clusters and inp.outputs is the number of input
units, hidden units and output units respectively. inp.samples is the number of
observations (all of these four parameters are scalars). inp.learn, inp.epochs,
inp.mMSE and inp.activ are identical to the corresponding input parameters.
The output parameter net contains clustersWeights, the estimated weights of
the hidden layer, trFonDev, deviances of the radial functions, outputsWeights
estimated weights of the output layer and bias, weight of the bias for each out
put unit. Some error functions are stored in the parameter err for each epoch
of training the output layer. MSE contains mean squared error for the whole net
in each training epoch. maxDiff and meanDiff contains maximum and mean
dierence for each output unit in each training epoch.
netOut = rbfpredict(x,rbfnet,range) predicts the output of given
RBF neural network
When the RBF neural network is computed, one usually wants to use it on
another data set. rbfpredict returns the predicted values of given rbfnet net
52
work from given dataset x. Parameters miny and maxy determine the range of
the estimated output netOut.
{netOut,AED,MSE} = rbftest(x,y,rbfnet) tests the given rbfnet neu
ral network
The quantlet rbftest allow to test the quality of a radial basis function network
used on some input data x and output data y. It returns the predicted output
of the rbfnet neural networknetOut, the absolute error dierence AED = y
netOut and the mean squared error MSE.
rbfinfo(rbfnet) shows information about the given network rbfnet
Quantlet rbfinfo writes in the output window the known information about
the given RBF neural network rbfnet. It could print for example the following
output:
[ 1,] " An 2  2  1 RBFnetwork"
[ 2,] " training epochs: 5  5"
[ 3,] " clusters learning rates:"
[ 4,] " 0.200"
[ 5,] " 0.175"
[ 6,] " 0.150"
[ 7,] " 0.125"
[ 8,] " 0.100"
[ 9,] " outputs learning rate: 0.100"
[10,] " minimum mean squared error: 0.050"
[11,] " BINARY sigmoid activation function"
[12,] ""
[13,] " Input  Cluster  Weight "
[14,] " 1 1 0.784"
[15,] " 1 2 0.332"
[16,] " 2 1 0.740"
[17,] " 2 2 0.367"
[18,] ""
[19,] " Cluster  Output  Weight "
[20,] " 1 1 0.428"
[21,] " 2 1 0.755"
53
The abbreviation 221 means 2 input units, 2 units in the hidden layer and
one output unit. The clustering was trained 5 times, as well as the output
units. Training of the hidden layer started with the learning rate of 0.2 and was
decreased in each training epoch till the learning rate of 0.1 in the last epoch
was reached. The other terms show other parameters specied. Finally, weights
between the input and the hidden layer, as well as weights between the hidden
and the output layer are written.
rbfsave(rbfnet,name) saves given radial basis function network into the
given le
rbfnet = rbfload(name) loads a saved RBF network from a le
Since the result of a RBF neural network tting is a composed object, two handy
functions for saving and loading radial basis function neural networks are pro
vided. The network rbfnet can be stored into a selected le using the quantlet
rbfsave. This quantlet folds the output parameters of the RBF network into one
matrix and saves into a le with the extension .rbf. The way how these ve
parameters creates one matrix is illustrated in Figure A.3. The error functions
are stored in a le with the extension .err. The RBF network once saved can
be decoded and reload via the quantlet rbfload.
Figure A.3: Scheme of the matrix which stores the RBF network data. The
left panel if inputs > 10, the right panel if inputs < 10.
54
A.3 An Example
In this section we run simple RBF neural network on data consisting of two
clusters. The examples presented here can be followed by using the quantlets
rbfexample1 and rbfexample2 respectively. Note, that the quantlet rbftrain
uses internally random numbers and thus your results will dier from ours. As
usual we load the necessary libraries:
library ("plot")
library ("nn")
The next step is to generate the training set. Our clusters are a mixture of two
normal distributions. Each of them consists of 100 observations.
randomize(10)
n = 100
xt = normal(n,2)+#(1,+1)  normal(n,2)+#(+1,1)
yt = (matrix(n)1)  matrix(n)
The response variable is 0 for the observations from the rst cluster and 1 for the
observations from the second cluster respectively. The clusters have the same
variance (identity matrix) and their means are moved to (1,+1) and (+1,1)
respectively. To get some idea about the structure of this data, the following
code generates Figure A.4:
col = string("blue",1:n)  string("magenta",1:n)
symb = string("rhomb",1:n)  string("circle",1:n)
xt = setmask(xt, col, symb)
plot(xt)
setgopt(plotdisplay,1,1,"title","Training Data Set")
setgopt(plotdisplay,1,1,"xlabel","x1","ylabel","x2","border",0)
Observations from the rst cluster are labeled by blue rhombs and the second
clusters observations are denoted as magenta circles. Now we are able to build
the RBF neural network. After some trials, we chose the following model:
clusters = 7
learn = 0.010.020.01
epochs = 215
mMSE = 0.05
activ = 0
rbfnet = rbftrain(xt,yt,clusters,learn,epochs,mMSE,activ)
; get the information
rbfinfo(rbfnet)
55
Training Data Set
3 2 1 0 1 2 3
x1

2
0
2
4
x
2
Figure A.4: The generated training data set with two clusters.
Now in the XploRe output window appears:
[ 1,] " An 2  7  1 RBFnetwork"
[ 2,] " training epochs: 2  15"
[ 3,] " clusters learning rates:"
[ 4,] " 0.020"
[ 5,] " 0.010"
[ 6,] " outputs learning rate: 0.010"
[ 7,] " minimum mean squared error: 0.050"
[ 8,] " BINARY sigmoid activation function"
[ 9,] ""
[10,] " Input  Cluster  Weight "
[11,] " 1 1 0.841"
[12,] " 1 2 0.333"
[13,] " 1 3 0.169"
[14,] " 1 4 0.877"
[15,] " 1 5 0.521"
[16,] " 1 6 0.502"
[17,] " 1 7 0.736"
[18,] " 2 1 0.458"
56
[19,] " 2 2 0.447"
[20,] " 2 3 0.704"
[21,] " 2 4 0.907"
[22,] " 2 5 0.159"
[23,] " 2 6 0.643"
[24,] " 2 7 0.317"
[25,] ""
[26,] " Cluster  Output  Weight "
[27,] " 1 1 0.002"
[28,] " 2 1 0.764"
[29,] " 3 1 1.346"
[30,] " 4 1 0.122"
[31,] " 5 1 0.598"
[32,] " 6 1 0.323"
[33,] " 7 1 0.612"
That is, the network has two input units (xt R
2
), seven hidden units and one
output unit. While the weights of the hidden layer were determined within two
loops, the weights of the output units were trained 15 times. These weights are
given above.
Now we generate the testing data set x, with the same distribution as xt, and
predict the output of the network. Afterward we compare the predicted outcome
with the true response, nally we save the RBF neural network:
randomize(100)
x = normal(n,2)+#(1,+1)  normal(n,2)+#(+1,1)
y = (matrix(n)1)  matrix(n)
; predict the output
yp = rbfpredict(x,rbfnet,0,1) > 0.5 ; class either 0 or 1
; test the predicted values
test = rbftest(x,y,rbfnet)
test.MSE
rbfsave(rbfnet,"rbf_net")
Now we determine the misclassied observations and show the results of the
prediction in the left panel of Figure A.5.
misc = paf(1:2*n,y!=yp) ; misclassified
crct = paf(1:2*n,y==yp) ; correctly classified
rm = rows(misc)
symb2 = string("fill",1:rm)+symb[misc]
xm = setmask(x[misc],col[misc],symb2,"large")
57
RBF network (misclassified 10.00%)
4 2 0 2
xl

2
0
2
4
x
2
LDA (misclassified 10.00%)
4 2 0 2
xl

2
0
2
4
x
2
Figure A.5: Comparison of the discriminatory power of the RBF network and
linear discriminant analysis (LDA).
rbfexample1.xpl
xc = setmask(x[crct],col[crct],symb[crct])
pM = rm/2/n *100 ; percentage of misclassified
spM = string("%1.2f",pM)+"%)"
; create display
setsize(600,320)
rbf = createdisplay(1,2)
show(rbf,1,1,xc,xm)
setgopt(rbf,1,1,"title","RBF network (misclassified "+spM)
setgopt(rbf,1,1,"xlabel","xl","ylabel","x2","border",0)
To see the predictive power of the RBF neural network, we compare it with linear
discriminant analysis. The misclassied observations are labeled with large lled
symbols.
mu1 = mean(xt[1:n])
mu2 = mean(xt[n+1:2*n])
mu = (mu1+mu2)/2
lin = inv(cov(xt))*(mu1mu2)
;
y = (matrix(n)1)matrix(n) ; true
yp = (xmu)*lin <= 0 ; predicted
misc = paf(1:2*n,y!=yp) ; misclassified
58
crct = paf(1:2*n,y==yp) ; correctly classified
rm = rows(misc)
symb3 = string("fill",1:rm)+symb[misc]
xm = setmask(x[misc],col[misc],symb3,"large")
xc = setmask(x[crct],col[crct],symb[crct])
x = setmask(x, col, symb)
;
pM = rm/2/n *100 ; percentage of misclassified
spM = string("%1.2f",pM)+"%)"
show(rbf,1,2,xc,xm)
setgopt(rbf,1,2,"title","LDA (misclassified "+spM)
setgopt(rbf,1,2,"xlabel","xl","ylabel","x2","border",0)
In Figure A.5 we may compare the performance of linear discriminant analysis
and our RBF neural network. They misclassify dierent observation and nally
get the same misclassication rate of 10%.
The second example apply the same methods on a bit complicated data set,
which is displayed in Figure A.6. The estimated RBF network has the following
properties:
Training Data Set
5 0
x1

2
0
2
4
x
2
Figure A.6: The second generated training data set with two clusters.
59
RBF network (misclassified 13.00%)
5 0 5
xl

2
0
2
4
x
2
LDA (misclassified 17.50%)
5 0 5
xl

2
0
2
4
x
2
Figure A.7: Comparison of the discriminatory power of the RBF network and
linear discriminant analysis for the second data set.
rbfexample2.xpl
[ 1,] " An 2  7  1 RBFnetwork"
[ 2,] " training epochs: 3  10"
[ 3,] " clusters learning rates:"
[ 4,] " 0.045"
[ 5,] " 0.033"
[ 6,] " 0.020"
[ 7,] " outputs learning rate: 0.005"
[ 8,] " minimum mean squared error: 0.050"
[ 9,] " BINARY sigmoid activation function"
And the nal comparison in Figure A.7: 13% misclassied by the RBF network
against 17.5% by the linear discriminant analysis.
60
A.4 Detailed Description of the Algorithm
In this section we will describe the algorithm used in the quantlet rbftrain.
The procedure can be divided in several stages: rstly the validity of input
parameters is checked, then the arrays for the unsupervised learning are declared
and initialized and then one may train the hidden layer. Afterward arrays for the
supervised learning are declared and initialized and then the output layer will be
trained. Finally, the parameters for the output of the quantlet are prepared.
The quantlet gets six obligatory parameters. The matrix of predictor variables
x is assumed to have dimension N p, the response variables are stored in
the N q matrix y, the other input variables were already commented in the
Section A.2. After initial checkouts of the validity of all input parameters, one
declares arrays for the unsupervised learning.
clustersOutput = matrix(clusters)
clustersWeights = uniform(clusters,inputs)
trFonDev = matrix(clusters) * 0
Here inputs is the number of input variables (p). clustersOutput will store the
output signal from the cluster nodes, clustersWeights and trFonDev are the
future output parameters storing centers of the clusters and the deviance of the
transfer functions respectively. Before one can start with the training procedure,
the input data must be normalized:
minx = min(x) maxx = max(x)
miny = min(y) maxy = max(y)
x = (x  minx) ./ (maxx  minx)
y = (y  miny) ./ (maxy  miny)
Now the unsupervised training may begin:
e1 = 1
while(e1<=epochs[1])
i = 1
while(i<=samples)
; find the nearest cluster
X = matrix(clusters) * x[i,]
; eucl. distance: ith point from cluster centers:
clustersOutput = sum((clustersWeights  X).*(clustersWeights  X),2)
clustersOutput = sqrt(clustersOutput)
; choose the nearest cluster:
tmp = (1:clusters) .* (clustersOutput == min(clustersOutput))
61
tmp = paf(tmp, tmp > 0)
clusterChamp = tmp[1] ; if there are multiple minima
; update weights of the nearest cluster
up = interimLR * (x[i,]clustersWeights[clusterChamp,])
clustersWeights[clusterChamp,]=clustersWeights[clusterChamp,]+up
i = i+1
endo ; while(i<=samples)
tmp = (learn[2]  (learn[2]learn[1])/max(1(epochs[1]1))*(e11))
interimLR = tmp * interimLR
if(interimLR == 0)
e1 = epochs[1] + 1
endif
e1 = e1+1
endo ; while(e1<=epochs[1])
The computation proceeds in two loops, the outer loop goes through the train
ing epochs and the inner loop observation by observation. In each epoch every
observation is assigned to the nearest cluster (at the beginning the clusters were
determined randomly). The center of this nearest cluster is updated by the dis
tance between the cluster and the actual observation, which is multiplied by
learning rate interimLR (being 1 at the beginning). At the end of each epoch,
the interimLR learning rate is actualized by the input parameters learn[1] and
learn[2].
When the centers of the clusters are determined, one must compute the clus
ters deviances. Each cluster gets such deviance, which correspond to the standard
deviance of the distances between this cluster and other clusters:
c = 1
while(c<=clusters)
W = matrix(clusters) * clustersWeights[c,]
; sum of distances of the center c from other centers
trFonDev[c] = sum(sum((WclustersWeights).*(WclustersWeights),2))
trFonDev[c] = sqrt(trFonDev[c] / (clusters1))
c = c+1
endo
Now one may declare necessary arrays for the supervised learning:
gaussOut = matrix(clusters)
62
outputsWeights = 2*uniform(outputs,clusters)  1
outWeightsCorrect = matrix(outputs,clusters)
bias = 2*uniform(outputs)  1
maxDiff = 1 * matrix(epochs[2],outputs)
meanDiff = 0 * matrix(epochs[2],outputs)
MSE = 0 * matrix(epochs[2])
aMSE = +Inf
Here outputs denote the number of output units. And nally, one is ready to
train the output layer. The training goes in the outer loop through the epochs
and in the inner loop observation by observation:
e2 = 1
while(e2<=epochs[2])
i = 1
while(i<=samples)
outError = 0 ; error of this output unit
; calculate the Gaussian output of clusters
X = matrix(clusters) * x[i,]
; exponential
clustersOutput = sum((clustersWeights  X).*(clustersWeights  X),2)
clustersOutput = sqrt(clustersOutput)
gaussOut = exp( (clustersOutput ./ trFonDev)^2 )
; calculate output signals and their derivative for each cluster
G = matrix(outputs) * gaussOut
; lin. comb. of the inputs and weights of the output layer
sumOfWeightedInputs = sum(outputsWeights .* G,2) + bias
if(!activ) ; binary sigmoid function
outputSignal = 1 ./ ( 1+exp(sumOfWeightedInputs) )
; first derivative of the output signal
outSigDeriv = outputSignal .* (1  outputSignal)
else ; bipolar sigmoid function
outputSignal = 2 ./ ( 1+exp(sumOfWeightedInputs) )  1
; first derivative of the output signal
outSigDeriv = 0.5 * (1 + outputSignal) .* (1  outputSignal)
endif
;
; calculate output errors
errInfoTerm = (y[i,]  outputSignal) .* outSigDeriv
realErrDiff = abs(y[i,]  outputSignal) .* (maxy  miny)
outError = sum(realErrDiff^2)
meanDiff[e2,] = meanDiff[e2,] + realErrDiff / samples
63
tmp = realErrDiff > maxDiff[e2,]
if(sum(tmp))
tmp = (1:outputs) .* tmp
ind = paf(tmp,tmp>0)
maxDiff[e2,ind] = realErrDiff[ind]
endif
MSE[e2] = MSE[e2] + outError / samples
; correct weights and bias
outWeightsCorrect = learn[3] * errInfoTerm * gaussOut
outputsWeights = outputsWeights + outWeightsCorrect
bias = bias + learn[3] * errInfoTerm
i = i+1
endo ; while(i<=samples)
if(MSE[e2] < aMSE)
aMSE = MSE[e2]
endif
if(aMSE <= mMSE)
e2 = epochs[2] + 1
endif
e2=e2+1
endo ; while(e2<=epochs[2])
Firstly the output of the clusters (i.e. of the radially symmetric functions) must
be determined. That is the exponential of the input variables: exp(
x
i
c
j
j
), where
x
i
is the ith observation, c
j
is the center of the jth cluster and
j
is its standard
deviance. The output of the network is determined as the scalar product of the
clusters outputs and the output weights plus the bias:
r
j=1
w
j
gaussOut+bias
on which an output activation function is applied (either the binary or the bipolar
sigmoid function). Afterward the error functions may be computed: MSE in each
epoch and the maximum dierence, as well as the mean dierence, between the
response and network output of each unit in each epoch. Finally, weights of the
output units and the bias weights are updated. If the MSE is smaller than the
desired mMSE after any of the epochs, the supervised learning is nished. At the
end, the basic information about the network is written in the XploRe output
window and all of the output parameters are conjointed into three lists.
64
B Suggestions for Improvements in XploRe
B.1 grdotd
Our large data set helped us to nd a small bug. See the lines 4045 of the
quantlet grdotd:
d=(max(x)min(x))./100
n=rows(x)
{xb,yb}=bindata(x,d,0) ; bin data
sigma=sqrt(cov(x))
h=2.7772*sigma*n^(0.2) ; determine h by rule of thumb
g=grid(0,d/h,h/d)
There are too many observations (n), hence h is too small relatively to d in
case of the variable X5 and thus the function grid gets wrong parameters. Our
suggestion is to replace lines 4244 with the following code:
sigma=sqrt(cov(x))
h=2.7772*sigma*n^(0.2) ; determine h by rule of thumb
while(d>=h) ; correction if d>=h
d = d/2
endo
{xb,yb}=bindata(x,d,0) ; bin data
We used this procedure in the quantlet myGrdotd.xpl and it was sucient.
65
B.2 nnrpredict
We wanted to use the softmax transfer function, which is intended for categorical
response of our probabilistic model. The softmax function is sometimes called
multinomial logit:
f
k
(x
i
) =
exp [f
k
(x
i
)]
r
l=1
exp [f
l
(x
i
)]
k = 1, . . . , r.
It means, that each output unit is a ratio, with the corresponding hidden layers
output exponentially in the numerator and with the sum of all hidden layers
outputs exponentially in the denominator. Therefore the resulting output must
be from the range (0, 1]. However, when trying to use the quantlet nnrpredict we
got weird results. One of such results is shown in Table B.1. The corresponding
network might be reused through quantlet error1.xpl. When increasing number
of units in hidden layer, one may get results higher than 2. The problem is
that quantlet nnrpredict calls quantlet nnrtest with an inappropriate second
argument. It is supposed to be a matrix of inputs and outputs with dimension
N (p + q). However, nnrpredict cannot assume anything about the results
and thus send to nnrtest the inputs only. It leads obviously to wrong results.
We could not repair it by ourselves, as the testing is done via a DLL library and
the exact procedure is unknown.
[ 1,] " "
[ 2,] "=================================================="
[ 3,] " Five number summary"
[ 4,] ""
[ 5,] " Minimum 0.66562481"
[ 6,] " 25% Quartile 0.64135541"
[ 7,] " Median 0.60248665"
[ 8,] " 75% Quartile 0.59112639"
[ 9,] " Maximum 0.28996666"
[10,] "=================================================="
[11,] " "
Table B.1: Five number summary of nnrpredict result, with softmax transfer
function.
66
B.3 nnrnet
The actual help is quite incomprehensible, we would suggest to extend the help
site:
proc(net)=nnrnet (x, y, weights, size, param, wts)
; 
; See_also nnrpredict nnrinfo nnrsave nnrload ann
; 
; Description trains a one hidden layer perceptron feed forward network.
; 
; Usage net=nnrnet (x, y, weights, size{, param {, wts}})
; Input
; Parameter x
; Definition n x p matrix; input variables
; Parameter y
; Definition n x q matrix; output variables
; Parameter weights
; Definition n x 1 vector; weights of observations (for ties in the data)
; Parameter size
; Definition scalar; number of hidden units. p + q + size <= 100
; Parameter param
; Definition 8 x 1 vector; param can have up to 8 rows which mean in
; sequence:
; linear output (0 for no (default) or nonzero for yes),
; entropy error function (0 for no (default) or nonzero for yes),
; softmax (0 for no (default) or nonzero for yes),
; skip connections(0 for no (default) or nonzero for yes),
; maximum value for the starting weights (default is 0.7),
; weight decay (default is 0),
; number of maximal iterations (default is 100) and
; text summarizing output (0 for no or nonzero for yes (default)).
; Parameter wts
; Definition vector; predefined weights (must have as rows as net.conn)
; Output
; Parameter net.n
; Definition 3 x 1 vector; number of input, hidden and output units
; Parameter net.nunits
; Definition scalar; number of processing units (1+cols(x)+size+cols(y))
; Parameter net.nsunits
; Definition scalar; number of units in case of linear output
; Parameter net.nconn
67
; Definition vector; internal information about the network topology
; (used e.g. in nnrinfo)
; Parameter net.conn
; Definition vector; internal information about the network topology
; (used e.g. in nnrinfo)
; Parameter net.decay
; Definition scalar; weight decay parameter (=param[6])
; Parameter net.entropy
; Definition scalar; value of the entropy
; Parameter net.softmax
; Definition scalar; softmax indicator (=param[3])
; Parameter net.value
; Definition scalar; value of error function
; Parameter net.wts
; Definition vector; final weights (equal to input parameter wts, if given)
; Parameter net.yh.result
; Definition n x q matrix; the estimated outputs
; Parameter net.yh.hess
; Definition (net.conn x net.conn) matrix; the Hessian matrix
; 
; Note To improve the optimization algorithm it is strongly
; suggested to standardize the data, either by normalizing
; them z = (xmean(x))/sqrt(var(x)) or uniformize them by
; z = (xmin(x))/(max(x)min(x))
;
; softmax is used for log probability models. If param[3] == 1,
; then linear output and no entropy error function
; is set automatically
; 
; Link http://... Tutorial Neural Networks
; 
68
C CD contents
This section reviews folders from the enclosed CDROM and annotates the les.
analysis contains les used to analyze the data.
error1.xpl is the example producing the result of Section B.2.
fig *.xpl are les drawing the gures used in the thesis.
fr descr.data.xpl provides all the descriptive statistics and gures.
fr prepare.data.xpl creates the TRAIN and TEST subsamples.
fr prepare.datannet.xpl splits TRAIN set into 2 small subsamples.
fr*.xpl are the particular les using the methods and testing the results.
myGlmout.xpl makes Figure 4.1.
myGrdotd.xpl is the corrected version of the faulty quantlet grdotd.
myNeuronal.xpl and myRBF.xpl are auxiliar quantlets to run neural net
works several times.
myFrequency.xpl and omyFreq like.tex are auxiliar quantlets, which
produce an output for an easier way to use in .
scr is a script le to construct Tables 3.53.14.
dat contains all the data sets used.
datann1.dat and datann2.dat contain the small training and the
small validation data set respectively.
datatest.dat and datatrain.dat represent the TEST and TRAIN data
set respectively.
* C.dat contain corresponding data with dummy variables.
french.dat is the full original data set.
mt contains all graphics and codes and style les for the thesis.
outputs contains various output information from the analysis.
papers contains some of the papers listed in bibliography. Data used to create
Figure 1.1 and 2.1 are in the File TV  Varsovie enTempo.pdf.
rbf contains all quantlets handling RBF neural networks, as well as the dynam
ically linked library with its source.
69
References
Anders, U. (1997). Statistische neuronale Netze. Vahlen, M unchen.
Arminger, G., Enache, D. and Bonne, T. (1997). Analyzing Credit Risk Data:
A Comparison of Logistic Discrimination, Classication Tree Analysis, and
Feedforward Networks, Computational Statistics 12: 293310.
Back, B., Laitinen, T. Sere, K. and van Wezel, M. (1996). Choosing Bankruptcy
Predictors Using Discriminant Analysis, Logit Analysis, and Genetic Algo
rithms, Proceedings of the 1st International Meeting on Articial Intelligence
in Accounting, Finance and Tax, pp. 337356.
Bishop, Ch.M. (1995). Neural Networks for Pattern Recognition. Oxford Univer
sity Press.
Breimann, L., Friedmann, J.H., Olshen, R.A. and Stone, C.J. (1983). Classica
tion and Regression Trees. Wadsworth Publishers.
Chipman, H., George, E., McCulloch, R. and Rossi, P. (2001). CART Models
with Logit Endpoints, http://gsbwww.uchicago.edu/fac/robert.mcculloch/
research/talks/waterloo1.pdf.
Desai, V.S., Crook, J.N. and Overstreet, G.A. Jr. (1996). A Comparison of Neural
Networks and Linear Scoring Models in the Credit Union Environment,
European Journal of Operational Research Vol. 95, Pg. 2437.
Enache, D. (1998). K unstliche neuronale Netze zur Kreditw urdigkeits uberpr ufung
von Konsumentenkrediten. Josef Eul Verlag, Lohmar.
Fahrmeir, L. and Hamerle, .A (1984). Multivariate statistische Verfahren, Walter
de Gruyter, Berlin.
Feess, E. and Schieble, M. (1998). Credit Scoring and Incentives for Loan Ocers
in a Principal Agent Model, Working Paper Series: Finance and Accounting,
Goethe University Frankfurt am Main.
Franke, J., Hardle, W. and Hafner, Ch. (2001). Einf uhrung in die Statistik der
Finanzmarkte, Springer Verlag, Heidelberg.
Hand, D.J. and Henley, W.E. (1993). Can reject inference ever work?, IMA
Journal of Mathematics Applied in Business and Industry 5, 4555.
70
Hardle, W., Hlavka, Z., and Klinke, S. (2000a). XploRe Application Guide,
Springer Verlag, Heidelberg.
Hardle, W., M uller, and Ronz, B. (2001). Credit Scoring, unpublished, Humboldt
University, Berlin.
Hardle, W., M uller, M., Sperlich, S., and Werwatz, A. (2000b). Non and Semi
parametric Modelling, HumboldtUniversitat, Berlin.
Hardle, W. and Simar, L. (2002). Applied Multivariate Statistical Analysis,
HumboldtUniversitat Berlin, unpublished.
Jacobson, T. and Roszbach, K. (1998). Bank lending policy, credit scoring and
Value at Risk, Journal of Banking and Finance, SWoPEc No.260.
Kaiser, U. and Szczesny, A. (2000a). Einfache okonometrische Verfahren f ur
die Kreditrisikomessung: Logit und ProbitModelle, Working Paper Series:
Finance and Accounting No.61, Goethe Universitat Frankfurt am Main.
Kaiser, U. and Szczesny, A. (2000b). Einfache okonometrische Verfahren f ur die
Kreditrisikomessung: Verweildauermodelle, Working Paper Series: Finance
and Accounting No.62, Goethe Universitat Frankfurt am Main.
Klinke, S. and Grassmann J. (1996). Visualization and Implementation of Feed
forward Neural Networks, Statistical Computing 96.
Krause, C. (1993). Kreditw urdigkeitspr ufung mit Neuronalen Netzen. IDW
Verlag, D usseldorf.
Lewis, E.M. (1994). An Introduction to Credit Scoring. San Raphael, California,
The Athena.
Lovie, A.D. and Lovie, P. (1986). The Flat Maximum Eect and Linear Scoring
Models for Prediction, Journal of Forecasting, Vol. 5, pp. 159168.
Mester, L.J. (1997). Whats the Point of Credit Scoring, Business Review, Federal
Reserve Bank of Philadelphia, September/October 1997.
M uller, M., Kraft, H. and Kroisandt, G. (2002). Assessing Discriminatory Power
of Credit Rating, Discussion Paper No. 67, Sonderforschungsbereich 373,
HumboldtUniversity Berlin.
71
M uller, M., and Ronz, B. (1999). Credit Scoring using Semiparametric Methods,
Proceedings of Measuring Risk in Complex Statistical Systems, Humboldt
University Berlin.
Orr, M.J.L. (1996). Introduction to RBF Networks, Technical report, University
of Edinburgh.
Orr, M.J.L. (1999). Recent Advances in Radial Basis Function Networks, Tech
nical report, University of Edinburgh.
Ronz, B. (1998). Computergest utzte Statistik. HumboldtUniversitat, Berlin.
Schnurr, Ch. (1997). Kreditw urdigkeitspr ufung mit K unstlichen Neuronalen Net
zen. Deutscher UniversitatsVerlag, Wiesbaden.
Stanghellini, E. (1999). The use of graphical models in consumer credit scoring,
Proceedings of the 52nd International Statistical Institute, 279282.
Tam, K.Y. and Kiang, Y.M. (1992). Managerial Applications of Neural Networks:
The Case of Bank Failure Predictions, Management Science, Pg. 926947.
Thomas, Lyn C. (2000). A survey of credit and behavioral scoring: forecasting
nancial risk of lending to consumers, International Journal of Forecasting,
16, Pg. 149172.
West, D. (2000). Neural Network Credit Scoring Models, Computers & Opera
tions Research, 27, Pg. 11311152.
72