Credit Risk Modelling

Credit Risk Modelling
October 28, 2017
In [79]: #Libraries
library(gmodels)
library(rpart)
library(rpart.plot)
library(pROC)
Type 'citation("pROC")' for a citation.
Attaching package: pROC
The following object is masked from package:gmodels:
ci
The following objects are masked from package:stats:
cov, smooth, var
In [40]: #Reading the data

loan <- readRDS("Loandata.rds")
head(loan)
loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc age
0 5000 10.65 B 10 RENT 24000 33
0 2400 NA C 25 RENT 12252 31
0 10000 13.49 C 13 RENT 49200 24
0 5000 NA A 3 RENT 36000 39
0 3000 NA E 9 RENT 48000 24
0 12000 12.69 B 11 OWN 75000 28
In [41]: names(loan)
dim(loan)
str(loan)
summary(loan)
1. ’loan_status’ 2. ’loan_amnt’ 3. ’int_rate’ 4. ’grade’ 5. ’emp_length’ 6. ’home_ownership’
7. ’annual_inc’ 8. ’age’
1. 29092 2. 8
1
'data.frame': 29092 obs. of 8 variables:
$ loan_status : int 0 0 0 0 0 0 1 0 1 0 ...
$ loan_amnt : int 5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
$ int_rate : num 10.7 NA 13.5 NA NA ...
$ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
$ emp_length : int 10 25 13 3 9 11 0 3 3 0 ...
$ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
$ annual_inc : num 24000 12252 49200 36000 48000 ...
$ age : int 33 31 24 39 24 28 22 22 28 22 ...
loan_status loan_amnt int_rate grade emp_length

Min. :0.0000 Min. : 500 Min. : 5.42 A:9649 Min. : 0.000
1st Qu.:0.0000 1st Qu.: 5000 1st Qu.: 7.90 B:9329 1st Qu.: 2.000
Median :0.0000 Median : 8000 Median :10.99 C:5748 Median : 4.000
Mean :0.1109 Mean : 9594 Mean :11.00 D:3231 Mean : 6.145
3rd Qu.:0.0000 3rd Qu.:12250 3rd Qu.:13.47 E: 868 3rd Qu.: 8.000
Max. :1.0000 Max. :35000 Max. :23.22 F: 211 Max. :62.000
NA's :2776 G: 56 NA's :809
home_ownership annual_inc age
MORTGAGE:12002 Min. : 4000 Min. : 20.0
OTHER : 97 1st Qu.: 40000 1st Qu.: 23.0
OWN : 2301 Median : 56424 Median : 26.0
RENT :14692 Mean : 67169 Mean : 27.7
3rd Qu.: 80000 3rd Qu.: 30.0
Max. :6000000 Max. :144.0
In [42]: CrossTable(loan$loan_status) #11% are loan defaulters
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 29092
| 0 | 1 |
|-----------|-----------|
| 25865 | 3227 |
| 0.889 | 0.111 |
|-----------|-----------|
2
In [43]: CrossTable(loan$grade, loan$loan_status, prop.r = TRUE, prop.c = F, prop.t = F, prop.c
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
Total Observations in Table: 29092
| loan$loan_status
loan$grade | 0 | 1 | Row Total |
-------------|-----------|-----------|-----------|
A | 9084 | 565 | 9649 |
| 0.941 | 0.059 | 0.332 |
-------------|-----------|-----------|-----------|
B | 8344 | 985 | 9329 |
| 0.894 | 0.106 | 0.321 |
-------------|-----------|-----------|-----------|
C | 4904 | 844 | 5748 |
| 0.853 | 0.147 | 0.198 |
-------------|-----------|-----------|-----------|
D | 2651 | 580 | 3231 |
| 0.820 | 0.180 | 0.111 |
-------------|-----------|-----------|-----------|
E | 692 | 176 | 868 |
| 0.797 | 0.203 | 0.030 |
-------------|-----------|-----------|-----------|
F | 155 | 56 | 211 |
| 0.735 | 0.265 | 0.007 |
-------------|-----------|-----------|-----------|
G | 35 | 21 | 56 |
| 0.625 | 0.375 | 0.002 |
-------------|-----------|-----------|-----------|
Column Total | 25865 | 3227 | 29092 |
-------------|-----------|-----------|-----------|
3
In [44]: #Spotting outliers using histogram and scatterplots
hist_1 <- hist(loan$loan_amnt)
In [45]: hist_2 <- hist(loan$loan_amnt, breaks = 200, xlab = "Loan Amount",

main = "Finer analysis of Loan amounts")
4
In [46]: plot(loan$age, ylab = "Age")
5
In [47]: outlier_index <- which(loan$age > 122)
loan2 <- loan[-outlier_index,]
plot(loan2$age)
6
In [48]: plot(loan$age, loan$annual_inc, xlab = "Age", ylab = "Annual Income")
7
In [49]: #Interest Rate
summary(loan$int_rate)
na_index <- which(is.na(loan$int_rate))
loan2 <- loan[-na_index, ]
#Replacing with median
median_ir <- median(loan$int_rate, na.rm = TRUE)
loan$int_rate[na_index] <- median_ir
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

5.42 7.90 10.99 11.00 13.47 23.22 2776
In [51]: head(loan2)
8
loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc age
1 0 5000 10.65 B 10 RENT 24000 33
3 0 10000 13.49 C 13 RENT 49200 24
6 0 12000 12.69 B 11 OWN 75000 28
7 1 9000 13.49 C 0 RENT 30000 22
8 0 3000 9.91 B 3 RENT 15000 22
9 1 10000 10.65 B 3 RENT 100000 28
In [52]: outlier_index_ai <- which(loan2$annual_inc == 6000000)

loan2 <- loan2[-outlier_index_ai, ]
plot(loan2$age, loan2$annual_inc)
9
1 Logistic Regression Method:
In [56]: #Data Splitting into train and test data
set.seed(567)
#Row numbers for training set

index_train <- sample(1:nrow(loan2), 2/3 * nrow(loan2)) #2/3 of dataset
loan_train <- loan2[index_train, ]

loan_test <- loan2[-index_train, ]
In [57]: #Logistic Regression Model
lr_loan <- glm(loan_status~ age + int_rate + grade + loan_amnt +

annual_inc, family = "binomial", data = loan_train )
summary(lr_loan)
Call:
glm(formula = loan_status ~ age + int_rate + grade + loan_amnt +
annual_inc, family = "binomial", data = loan_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0521 -0.5316 -0.4247 -0.3470 3.5260
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.694e+00 2.172e-01 -12.402 < 2e-16 ***
age -3.674e-03 4.053e-03 -0.906 0.364745
int_rate 6.157e-02 2.391e-02 2.575 0.010011 *
gradeB 3.058e-01 1.124e-01 2.720 0.006530 **
gradeC 6.463e-01 1.636e-01 3.952 7.76e-05 ***
gradeD 7.568e-01 2.075e-01 3.647 0.000265 ***
gradeE 9.111e-01 2.610e-01 3.492 0.000480 ***
gradeF 9.929e-01 3.478e-01 2.855 0.004308 **
gradeG 1.356e+00 4.878e-01 2.781 0.005423 **
loan_amnt -2.604e-06 4.411e-06 -0.590 0.554984
annual_inc -6.039e-06 8.054e-07 -7.498 6.49e-14 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 12213 on 17542 degrees of freedom

Residual deviance: 11715 on 17532 degrees of freedom
10
AIC: 11737
Number of Fisher Scoring iterations: 5
In [58]: loan_predict <- predict(lr_loan, loan_test, type = "response")

range(loan_predict)
1. 5.06856420627188e-06 2. 0.417383156786665
In [59]: #Logistic Regression Model with all variables
lr_loan_all <- glm(loan_status ~ ., family = "binomial", data = loan_train)
loan_predict_all <- predict(lr_loan_all, loan_test, type ="response")

range(loan_predict_all, na.rm = T)
1. 7.47036992994058e-06 2. 0.408887370208341
In [65]: #Cut off of 15%
lr_cutoff <- ifelse(loan_predict > 0.15, 1, 0)
#confusion Matrix
tab_cm <- table(loan_test$loan_status, lr_cutoff)
tab_cm
lr_cutoff
0 1
0 5937 1844
1 592 399
In [66]: # Computing Accuracy

acc_logit <- sum(diag(tab_cm)) / nrow(loan_test)
acc_logit
0.722298221614227
In [80]: #ROC curve for Logistic Regression

roc_logit <- roc(loan_test$loan_status, loan_predict)
plot(roc_logit)
auc(roc_logit)
0.650806104704583
11
2 Decision Tree Model:
In [81]: #Undersampling the training set with rpart.control
tree_undersample <- rpart(loan_status ~ age + int_rate + grade

+ loan_amnt +
annual_inc
, method = "class",
data = loan_train,
control = rpart.control(cp = 0.001))
plot(tree_undersample, uniform = T)
12
Error in plot.rpart(tree_undersample, uniform = T): fit is not a tree, just a root
Traceback:
1. plot(tree_undersample, uniform = T)
2. plot.rpart(tree_undersample, uniform = T)
3. stop("fit is not a tree, just a root")
13

Credit Risk Modelling

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Credit Risk Modelling

Enviado por

Direitos autorais:

Formatos disponíveis

Credit Risk Modelling

October 28, 2017

Attaching package: pROC

The following object is masked from package:gmodels:

The following objects are masked from package:stats:

cov, smooth, var

In [40]: #Reading the data

loan_status loan_amnt int_rate grade emp_length

In [42]: CrossTable(loan$loan_status) #11% are loan defaulters

Total Observations in Table: 29092

Total Observations in Table: 29092

In [45]: hist_2 <- hist(loan$loan_amnt, breaks = 200, xlab = "Loan Amount",

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

In [52]: outlier_index_ai <- which(loan2$annual_inc == 6000000)

#Row numbers for training set

loan_train <- loan2[index_train, ]

In [57]: #Logistic Regression Model

lr_loan <- glm(loan_status~ age + int_rate + grade + loan_amnt +

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 12213 on 17542 degrees of freedom

Number of Fisher Scoring iterations: 5

In [58]: loan_predict <- predict(lr_loan, loan_test, type = "response")

In [59]: #Logistic Regression Model with all variables

lr_loan_all <- glm(loan_status ~ ., family = "binomial", data = loan_train)

loan_predict_all <- predict(lr_loan_all, loan_test, type ="response")

In [65]: #Cut off of 15%

lr_cutoff <- ifelse(loan_predict > 0.15, 1, 0)

In [66]: # Computing Accuracy

In [80]: #ROC curve for Logistic Regression

tree_undersample <- rpart(loan_status ~ age + int_rate + grade

3. stop("fit is not a tree, just a root")

Você também pode gostar