Você está na página 1de 13

Credit Risk Modelling

October 28, 2017

In [79]: #Libraries
library(gmodels)
library(rpart)
library(rpart.plot)
library(pROC)
Type 'citation("pROC")' for a citation.

Attaching package: pROC

The following object is masked from package:gmodels:

ci

The following objects are masked from package:stats:

cov, smooth, var

In [40]: #Reading the data


loan <- readRDS("Loandata.rds")
head(loan)
loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc age
0 5000 10.65 B 10 RENT 24000 33
0 2400 NA C 25 RENT 12252 31
0 10000 13.49 C 13 RENT 49200 24
0 5000 NA A 3 RENT 36000 39
0 3000 NA E 9 RENT 48000 24
0 12000 12.69 B 11 OWN 75000 28
In [41]: names(loan)
dim(loan)
str(loan)
summary(loan)
1. ’loan_status’ 2. ’loan_amnt’ 3. ’int_rate’ 4. ’grade’ 5. ’emp_length’ 6. ’home_ownership’
7. ’annual_inc’ 8. ’age’
1. 29092 2. 8

1
'data.frame': 29092 obs. of 8 variables:
$ loan_status : int 0 0 0 0 0 0 1 0 1 0 ...
$ loan_amnt : int 5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
$ int_rate : num 10.7 NA 13.5 NA NA ...
$ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
$ emp_length : int 10 25 13 3 9 11 0 3 3 0 ...
$ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
$ annual_inc : num 24000 12252 49200 36000 48000 ...
$ age : int 33 31 24 39 24 28 22 22 28 22 ...

loan_status loan_amnt int_rate grade emp_length


Min. :0.0000 Min. : 500 Min. : 5.42 A:9649 Min. : 0.000
1st Qu.:0.0000 1st Qu.: 5000 1st Qu.: 7.90 B:9329 1st Qu.: 2.000
Median :0.0000 Median : 8000 Median :10.99 C:5748 Median : 4.000
Mean :0.1109 Mean : 9594 Mean :11.00 D:3231 Mean : 6.145
3rd Qu.:0.0000 3rd Qu.:12250 3rd Qu.:13.47 E: 868 3rd Qu.: 8.000
Max. :1.0000 Max. :35000 Max. :23.22 F: 211 Max. :62.000
NA's :2776 G: 56 NA's :809
home_ownership annual_inc age
MORTGAGE:12002 Min. : 4000 Min. : 20.0
OTHER : 97 1st Qu.: 40000 1st Qu.: 23.0
OWN : 2301 Median : 56424 Median : 26.0
RENT :14692 Mean : 67169 Mean : 27.7
3rd Qu.: 80000 3rd Qu.: 30.0
Max. :6000000 Max. :144.0

In [42]: CrossTable(loan$loan_status) #11% are loan defaulters

Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|

Total Observations in Table: 29092

| 0 | 1 |
|-----------|-----------|
| 25865 | 3227 |
| 0.889 | 0.111 |
|-----------|-----------|

2
In [43]: CrossTable(loan$grade, loan$loan_status, prop.r = TRUE, prop.c = F, prop.t = F, prop.c

Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|

Total Observations in Table: 29092

| loan$loan_status
loan$grade | 0 | 1 | Row Total |
-------------|-----------|-----------|-----------|
A | 9084 | 565 | 9649 |
| 0.941 | 0.059 | 0.332 |
-------------|-----------|-----------|-----------|
B | 8344 | 985 | 9329 |
| 0.894 | 0.106 | 0.321 |
-------------|-----------|-----------|-----------|
C | 4904 | 844 | 5748 |
| 0.853 | 0.147 | 0.198 |
-------------|-----------|-----------|-----------|
D | 2651 | 580 | 3231 |
| 0.820 | 0.180 | 0.111 |
-------------|-----------|-----------|-----------|
E | 692 | 176 | 868 |
| 0.797 | 0.203 | 0.030 |
-------------|-----------|-----------|-----------|
F | 155 | 56 | 211 |
| 0.735 | 0.265 | 0.007 |
-------------|-----------|-----------|-----------|
G | 35 | 21 | 56 |
| 0.625 | 0.375 | 0.002 |
-------------|-----------|-----------|-----------|
Column Total | 25865 | 3227 | 29092 |
-------------|-----------|-----------|-----------|

3
In [44]: #Spotting outliers using histogram and scatterplots
hist_1 <- hist(loan$loan_amnt)

In [45]: hist_2 <- hist(loan$loan_amnt, breaks = 200, xlab = "Loan Amount",


main = "Finer analysis of Loan amounts")

4
In [46]: plot(loan$age, ylab = "Age")

5
In [47]: outlier_index <- which(loan$age > 122)
loan2 <- loan[-outlier_index,]
plot(loan2$age)

6
In [48]: plot(loan$age, loan$annual_inc, xlab = "Age", ylab = "Annual Income")

7
In [49]: #Interest Rate
summary(loan$int_rate)
na_index <- which(is.na(loan$int_rate))
loan2 <- loan[-na_index, ]
#Replacing with median
median_ir <- median(loan$int_rate, na.rm = TRUE)
loan$int_rate[na_index] <- median_ir

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


5.42 7.90 10.99 11.00 13.47 23.22 2776

In [51]: head(loan2)

8
loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc age
1 0 5000 10.65 B 10 RENT 24000 33
3 0 10000 13.49 C 13 RENT 49200 24
6 0 12000 12.69 B 11 OWN 75000 28
7 1 9000 13.49 C 0 RENT 30000 22
8 0 3000 9.91 B 3 RENT 15000 22
9 1 10000 10.65 B 3 RENT 100000 28

In [52]: outlier_index_ai <- which(loan2$annual_inc == 6000000)


loan2 <- loan2[-outlier_index_ai, ]
plot(loan2$age, loan2$annual_inc)

9
1 Logistic Regression Method:
In [56]: #Data Splitting into train and test data

set.seed(567)

#Row numbers for training set


index_train <- sample(1:nrow(loan2), 2/3 * nrow(loan2)) #2/3 of dataset

loan_train <- loan2[index_train, ]


loan_test <- loan2[-index_train, ]

In [57]: #Logistic Regression Model

lr_loan <- glm(loan_status~ age + int_rate + grade + loan_amnt +


annual_inc, family = "binomial", data = loan_train )

summary(lr_loan)

Call:
glm(formula = loan_status ~ age + int_rate + grade + loan_amnt +
annual_inc, family = "binomial", data = loan_train)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.0521 -0.5316 -0.4247 -0.3470 3.5260

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.694e+00 2.172e-01 -12.402 < 2e-16 ***
age -3.674e-03 4.053e-03 -0.906 0.364745
int_rate 6.157e-02 2.391e-02 2.575 0.010011 *
gradeB 3.058e-01 1.124e-01 2.720 0.006530 **
gradeC 6.463e-01 1.636e-01 3.952 7.76e-05 ***
gradeD 7.568e-01 2.075e-01 3.647 0.000265 ***
gradeE 9.111e-01 2.610e-01 3.492 0.000480 ***
gradeF 9.929e-01 3.478e-01 2.855 0.004308 **
gradeG 1.356e+00 4.878e-01 2.781 0.005423 **
loan_amnt -2.604e-06 4.411e-06 -0.590 0.554984
annual_inc -6.039e-06 8.054e-07 -7.498 6.49e-14 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 12213 on 17542 degrees of freedom


Residual deviance: 11715 on 17532 degrees of freedom

10
AIC: 11737

Number of Fisher Scoring iterations: 5

In [58]: loan_predict <- predict(lr_loan, loan_test, type = "response")


range(loan_predict)

1. 5.06856420627188e-06 2. 0.417383156786665

In [59]: #Logistic Regression Model with all variables

lr_loan_all <- glm(loan_status ~ ., family = "binomial", data = loan_train)

loan_predict_all <- predict(lr_loan_all, loan_test, type ="response")


range(loan_predict_all, na.rm = T)

1. 7.47036992994058e-06 2. 0.408887370208341

In [65]: #Cut off of 15%

lr_cutoff <- ifelse(loan_predict > 0.15, 1, 0)

#confusion Matrix
tab_cm <- table(loan_test$loan_status, lr_cutoff)
tab_cm

lr_cutoff
0 1
0 5937 1844
1 592 399

In [66]: # Computing Accuracy


acc_logit <- sum(diag(tab_cm)) / nrow(loan_test)
acc_logit

0.722298221614227

In [80]: #ROC curve for Logistic Regression


roc_logit <- roc(loan_test$loan_status, loan_predict)

plot(roc_logit)

auc(roc_logit)

0.650806104704583

11
2 Decision Tree Model:
In [81]: #Undersampling the training set with rpart.control

tree_undersample <- rpart(loan_status ~ age + int_rate + grade


+ loan_amnt +
annual_inc
, method = "class",
data = loan_train,
control = rpart.control(cp = 0.001))

plot(tree_undersample, uniform = T)

12
Error in plot.rpart(tree_undersample, uniform = T): fit is not a tree, just a root
Traceback:

1. plot(tree_undersample, uniform = T)

2. plot.rpart(tree_undersample, uniform = T)

3. stop("fit is not a tree, just a root")

13

Você também pode gostar