Escolar Documentos
Profissional Documentos
Cultura Documentos
In [79]: #Libraries
library(gmodels)
library(rpart)
library(rpart.plot)
library(pROC)
Type 'citation("pROC")' for a citation.
ci
1
'data.frame': 29092 obs. of 8 variables:
$ loan_status : int 0 0 0 0 0 0 1 0 1 0 ...
$ loan_amnt : int 5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
$ int_rate : num 10.7 NA 13.5 NA NA ...
$ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
$ emp_length : int 10 25 13 3 9 11 0 3 3 0 ...
$ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
$ annual_inc : num 24000 12252 49200 36000 48000 ...
$ age : int 33 31 24 39 24 28 22 22 28 22 ...
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
| 0 | 1 |
|-----------|-----------|
| 25865 | 3227 |
| 0.889 | 0.111 |
|-----------|-----------|
2
In [43]: CrossTable(loan$grade, loan$loan_status, prop.r = TRUE, prop.c = F, prop.t = F, prop.c
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
| loan$loan_status
loan$grade | 0 | 1 | Row Total |
-------------|-----------|-----------|-----------|
A | 9084 | 565 | 9649 |
| 0.941 | 0.059 | 0.332 |
-------------|-----------|-----------|-----------|
B | 8344 | 985 | 9329 |
| 0.894 | 0.106 | 0.321 |
-------------|-----------|-----------|-----------|
C | 4904 | 844 | 5748 |
| 0.853 | 0.147 | 0.198 |
-------------|-----------|-----------|-----------|
D | 2651 | 580 | 3231 |
| 0.820 | 0.180 | 0.111 |
-------------|-----------|-----------|-----------|
E | 692 | 176 | 868 |
| 0.797 | 0.203 | 0.030 |
-------------|-----------|-----------|-----------|
F | 155 | 56 | 211 |
| 0.735 | 0.265 | 0.007 |
-------------|-----------|-----------|-----------|
G | 35 | 21 | 56 |
| 0.625 | 0.375 | 0.002 |
-------------|-----------|-----------|-----------|
Column Total | 25865 | 3227 | 29092 |
-------------|-----------|-----------|-----------|
3
In [44]: #Spotting outliers using histogram and scatterplots
hist_1 <- hist(loan$loan_amnt)
4
In [46]: plot(loan$age, ylab = "Age")
5
In [47]: outlier_index <- which(loan$age > 122)
loan2 <- loan[-outlier_index,]
plot(loan2$age)
6
In [48]: plot(loan$age, loan$annual_inc, xlab = "Age", ylab = "Annual Income")
7
In [49]: #Interest Rate
summary(loan$int_rate)
na_index <- which(is.na(loan$int_rate))
loan2 <- loan[-na_index, ]
#Replacing with median
median_ir <- median(loan$int_rate, na.rm = TRUE)
loan$int_rate[na_index] <- median_ir
In [51]: head(loan2)
8
loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc age
1 0 5000 10.65 B 10 RENT 24000 33
3 0 10000 13.49 C 13 RENT 49200 24
6 0 12000 12.69 B 11 OWN 75000 28
7 1 9000 13.49 C 0 RENT 30000 22
8 0 3000 9.91 B 3 RENT 15000 22
9 1 10000 10.65 B 3 RENT 100000 28
9
1 Logistic Regression Method:
In [56]: #Data Splitting into train and test data
set.seed(567)
summary(lr_loan)
Call:
glm(formula = loan_status ~ age + int_rate + grade + loan_amnt +
annual_inc, family = "binomial", data = loan_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0521 -0.5316 -0.4247 -0.3470 3.5260
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.694e+00 2.172e-01 -12.402 < 2e-16 ***
age -3.674e-03 4.053e-03 -0.906 0.364745
int_rate 6.157e-02 2.391e-02 2.575 0.010011 *
gradeB 3.058e-01 1.124e-01 2.720 0.006530 **
gradeC 6.463e-01 1.636e-01 3.952 7.76e-05 ***
gradeD 7.568e-01 2.075e-01 3.647 0.000265 ***
gradeE 9.111e-01 2.610e-01 3.492 0.000480 ***
gradeF 9.929e-01 3.478e-01 2.855 0.004308 **
gradeG 1.356e+00 4.878e-01 2.781 0.005423 **
loan_amnt -2.604e-06 4.411e-06 -0.590 0.554984
annual_inc -6.039e-06 8.054e-07 -7.498 6.49e-14 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
10
AIC: 11737
1. 5.06856420627188e-06 2. 0.417383156786665
1. 7.47036992994058e-06 2. 0.408887370208341
#confusion Matrix
tab_cm <- table(loan_test$loan_status, lr_cutoff)
tab_cm
lr_cutoff
0 1
0 5937 1844
1 592 399
0.722298221614227
plot(roc_logit)
auc(roc_logit)
0.650806104704583
11
2 Decision Tree Model:
In [81]: #Undersampling the training set with rpart.control
plot(tree_undersample, uniform = T)
12
Error in plot.rpart(tree_undersample, uniform = T): fit is not a tree, just a root
Traceback:
1. plot(tree_undersample, uniform = T)
2. plot.rpart(tree_undersample, uniform = T)
13