Escolar Documentos
Profissional Documentos
Cultura Documentos
PROJECT REPORT
ON
GETTING A LOAN APPROVAL
Prepared by :
Bohui Qi
Jianping Du
Jin Guo
Lanlan Wang
Yi Yu
Ying Zhang
CONTENTS
1 Data Mining Tool 3
1.1 How to Select Data Mining Tool? 3
1.2 Which Tool do we Select? 4
2 Application Example Chosen 4
2.1 Project Description 4
2.2 Project Implementation 4
3 Preparing Data 5
3.1 Select Appropriate Data for Mining 5
3.2 Perform Data Preprocessing 5
3.3 Perform Data Reduction and Projection 5
3.4 Data List 6
4 Mining Experiment 6
4.1 Conversion of Input Data 6
4.2 Algorithm for Data Mining 6
4.3 Procedures 7
5 Mining Results 7
6 Additional Input 8
6.1 Data Generalization 9
6.2 Model Build and Estimated with Holdout Method 9
6.3 Model Build and Estimated with Cross-Validation 9
7 Additional Mining Results 9
7.1 Test Decision Tree by New Data 10
7.2 Tree Pruning 10
8 Mining Technique Used in the Tool 11
9 Mining Technique Details and Information 12
10 Critical Evaluation of the Mining Technique Used 12
11 Visualization Technique Used in the Tool 14
12 Visualization Technique Details and Information 14
13 Setting Up Environment for the Tool -- Weka 16
13.1 Where to Download? 16
13.2 How to Set Up? 16
13.3 How to Use? 17
13.4 System Environment 18
13.5 Evaluation 18
13.6 Attachment 18
14 Conclusions 20
Appendix 1 21
Appendix 2 24
Appendix 3 45
Appendix 4 66
Bohui Qi, Jianping Du, Jin Guo, Lanlan Wang, Yi Yu, Ying Zhang
Department of Computer Science
Towson University
3. Preparing Data
3.1 Select Appropriate Data for Mining
Due to the large set of data, it is more effective to choose meaningful data for data mining.
After our group members discussed several times, we chose the interesting data mining
topic of credit card application approval.
3.2 Perform Data Preprocessing
For this project, we referred the raw data from the website
(ftp://ftp.ics.uci.edu/pub/machine-databases/credit-screening) that introduced by
instructor Dr. Karne. It is a credit card application approval database in UCI machine
learning center.
Data preprocessing is an important step in the data mining process. The preprocessing step
is necessary to resolve several types of problems that frequent occur in large data sets.
These problems include noisy data, redundancy data, and missing data values, etc.
Preprocessing consists of data cleaning and missing value resolution. Database records
often contain fields with bad or useless information. We do data cleaning by discarding
meaningless attributes and resetting some attribute records using clear numeric variables.
3.3 Perform Data Reduction and Projection
Determining useful features in the dataset may further reduce the size of selected dataset.
Often there exist huge amounts of duplicated values in large databases, which are not what
were interested in and they slow down the speed of mining process. So we reduce some
4. Mining Experiment
4.1 Conversion of Input Data
Before starting mining the data, we had to convert the data file to ARFF format since
Weka only expects data to be in that format. The data file we found is in Excel format, so
we followed the direction of how to convert data stored in Excel to ARFF format, and
completed it successfully.
4.2 Algorithm for Data Mining
Before starting the experiment, we need to specify the knowledge we want to extract,
because the knowledge specificity determines what kind of mining function to be chosen.
5. Mining Result
We observed our result, and found several facts.
The first part of Appendix 5 is a decision tree in textual form. There are seven
levels in the tree. The first level is split on history attribute, and the second split
6. Additional Input
In this step, we took two ways to build model and estimate its accuracy by using the
generalized data. First we chose holdout method with its default size, so the input data
were randomly partitioned into two independent set, 66% data were allocated to the
training set to derive the classifier, and remaining 34% was used as test data whose
accuracy is estimated. The result of this method is shown in Appendix 6. This method is
thought pessimistic since only part of initial data is used to build the model. So, we used
10-fold cross-validation as second method for our project. In this method, the algorithm
partitioned the data into 10 mutually exclusive folds with approximately equal size.
Training and test set were performed 10 times. In each time, the subset S i was allocated as
test data, and rest 9 subsets were treated as training data to classifier, so the accuracy
In those data, we found that 14 new samples fit the rule, but the instance (52, 36540, good,
yes, 0) is incorrect. The accurate rate of the new data is 93%.
7.2 Tree Pruning
300
200
Instances
100
0
r1 r2 r3 r4 r5
Type
Figure 4
The number of instances that belong to each type of the class
200
0
yes no
Figure 5
The number of the last step decisions (leaf node) decided by house owner level
Decided by Age
150
100
50
0
age1 age2 age3 age4
Figure 6
The number of the last step decisions (leaf node) decided by age level
Decided by Incom e
100
50
0
income1 income2 income3 income4
> help
Number of Leaves : 5
14. Conclusions
This project shows that we can use data mining machine learning tool to discover useful
knowledge like credit line granting rules for credit card applicants. And, data mining can
address the question of how best to use historical data to discover general regularities and
improve the process of decision-making.
During the implementation of this project, we learned all of the knowledge that included
in our project proposal. This project is very interesting though it is a hard work to finish.
It is real a team work, each of our group members understands the project and contributes
to the project.
Thanks Dr. Karne for giving us this practice opportunity and a lot of valuable ideas and
directions.
@relation credit
@data
18,22000,good,yes,10000
19,21000,good,no,10000
18,18000,none,yes,10000
19,28000,none,no,5000
19,29650,bad,yes,0
18,28500,bad,no,0
17,32600,good,yes,10000
18,38620,good,no,10000
19,45600,none,yes,10000
19,39520,none,no,5000
19,59300,bad,yes,0
18,54280,bad,no,0
18,68420,good,yes,10000
18,70510,good,no,10000
17,89630,none,yes,10000
16,78560,none,no,10000
19,79465,bad,yes,0
19,88240,bad,no,0
18,96300,good,yes,10000
19,99860,good,no,10000
19,95680,none,yes,10000
19,100045,none,no,5000
19,112480,bad,yes,5000
19,426900,bad,no,0
20,22000,good,yes,10000
30,21000,good,no,10000
32,26580,none,yes,5000
23,28000,none,no,5000
@relation credit
@data
18,22000,good,yes,10000
19,21000,good,no,10000
18,18000,none,yes,10000
19,28000,none,no,5000
19,29650,bad,yes,0
18,28500,bad,no,0
17,32600,good,yes,10000
18,38620,good,no,10000
19,45600,none,yes,10000
19,39520,none,no,5000
19,59300,bad,yes,0
18,54280,bad,no,0
18,68420,good,yes,10000
18,70510,good,no,10000
17,89630,none,yes,10000
16,78560,none,no,10000
19,79465,bad,yes,0
19,88240,bad,no,0
18,96300,good,yes,10000
19,99860,good,no,10000
19,95680,none,yes,10000
19,100045,none,no,5000
19,112480,bad,yes,5000
19,426900,bad,no,0
20,22000,good,yes,10000
30,21000,good,no,10000
32,26580,none,yes,5000
23,28000,none,no,5000
@relation credit
@data
age1,income1,good,yes,recommend3
age1,income1,good,no, recommend3
age1,income1,none,yes,recommend3
age1,income1,none,no, recommend2
age1,income1,bad,yes, recommend1
age1,income1,bad,no, recommend1
age1,?, good,yes, recommend3
age1,income2,?,no, recommend3
age1,income2,none,yes,recommend3
age1,income2,none,no, recommend2
age1,income2,bad,yes, recommend1
age1,income2,bad,no, recommend1
age1,income3,?,yes, recommend3
age1,income3,good,no, recommend3
age1,income3,none,yes,recommend3
age1,income3,none,no, recommend3
age1,income3,bad,yes, recommend1
age1,income3,bad,no, recommend1
age1,income4,good,yes,recommend3
age1,income4,good,no, recommend3
?,income4,none,yes, recommend3
age1,income4,none,no, recommend2
age1,income4,bad,yes, recommend2
age1,income4,bad,no, recommend1
age2,income1,good,yes,recommend3
age2,income1,good,no, recommend3
Decision rules:
J48 pruned tree
------------------
history = good
| income <= 59580: 10000 (16.0/1.0)
| income > 59580
| | age <= 23: 10000 (4.0)
| | age > 23
| | | income <= 79465: 20000 (6.0/1.0)
| | | income > 79465
| | | | age <= 39: 50000 (2.0)
| | | | age > 39: 20000 (4.0/1.0)
history = none
| house_owner = yes
| | income <= 64280
| | | age <= 30: 10000 (3.0)
| | | age > 30: 5000 (5.0/1.0)
| | income > 64280: 10000 (8.0/2.0)
| house_owner = no: 5000 (16.0/4.0)
history = bad
| house_owner = yes
| | income <= 95680
| | | age <= 20: 0 (3.0)
| | | age > 20
| | | | age <= 56: 5000 (5.0)
| | | | age > 56: 0 (4.0/1.0)
| | income > 95680: 5000 (4.0/1.0)
| house_owner = no: 0 (16.0)
Number of Leaves : 14
Size of the tree : 26
=== Error on training data ===
Correctly Classified Instances 84 87.5 %
Incorrectly Classified Instances 12 12.5 %
Mean absolute error 0.0777
Root mean squared error 0.1971
Total Number of Instances 96
=== Confusion Matrix ===
a b c d e <-- classified as
58 3 1 0 0 | a = recommend1
0 67 1 0 0 | b = recommend2
0 4 82 1 0 | c = recommend3
0 0 2 31 0 | d = recommend4
0 3 0 0 9 | e = recommend5
Bohui Qi, Yi Yu, Jianping Du, Lanlan Wang, Ying Zhang, Jin Guo
Department of Computer Science
Towson University
Project Objective
The objective of this project is to use data mining tools to address what factors are and
how they affect getting the approval of a persons application of a certain amount of a
loan. Details listed as below:
Knowing how to choose a best-suited data mining method from different kinds for
a certain application.
Learning how to prepare data for our mining tool, and translate the input data in its
required format.
Understanding and analyzing the observation of new knowledge mined from the
application.
Project Description
By using data mining tools, we address what factors are and how they affect getting the
approval of a persons application of a certain amount of a loan. We organize the
database, then we evaluate several data mining tools and choose one that is suitable to
mine this application, our project. Finally, after observing and analyzing the new
knowledge, we will validate the findings.
Object Identification
The goal of this project is to display patterns of the amount of loan approval in different
groups in age, income, credit history, and home ownership, etc. We will identify the
critical factors and to what extent they affect the amount of the loan approved.
We will generate a set of data with the above attributes. These data sets will be prepared
in an excel spreadsheet and in word ARFF format. For age group, we will set four
intervals respectively: <20, 20-40, 40-60, >60 years old; For income, the intervals are:
<$30,000, $30-60,000, $60-90,000, >$90,000; For credit history, the categories are: good,
bad, and none; House owned categories are yes and no; Loan approved intervals are:
$0K, $5K, $10K, $20K, and $50K. We will evaluate the nature and the structure of the
database in order to determine the appropriate tools.
Tools Selection
Based on objectives and the data structure, we will select an appropriate data mining tool.
For this project, we will use Weka data mining tool sets.
Solution Formation
The format of the solution is determined by the data audit, the business objective and the
mining tool. In this project, the report will consist of the amount of loan approved as a
function of different intervals in each of the 4 categories.
Expect Output
Through analyzing this project, we will get the following association rules:
If a persons credit history is bad and he/she is not a house-owner, the application
will be denied.
If a person has no credit history before, a loan of $5K for the first time application
will be granted.
If a persons credit history is bad, but he/she is a house-owner and his/her annual
income is more than $90K, or his/her annual income is more than $60K and
his/her age is between 40 and 60, a loan of $5K will be approved.
Different amount of loans will be approved based on different annual income and
the age, etc.
Model Construction
We will discuss the results of the analysis with some experts to ensure that the findings are
correct and appropriate for the business objectives. Then a final report is delivered with
documentations of the entire data mining process including data preparation, tools used,
mining techniques used in the tools and the detailed information, test results, visualizing
techniques used and its detailed information, source code and rules.
Project Schedule