Escolar Documentos
Profissional Documentos
Cultura Documentos
FS Training - Acquisition 2
Data Quality Report
Table
No Of Unique Data Available Missing
Field Name Data Type Records Count Available Percent Missing Percent Minimum Maximum Mean Comments
Age NUM 150,000 86 150,000 100% 0 0% 0 109 52.30 There is one single row with '0' age. 109 is too high
Gender CHAR 150,000 2 150,000 100% 0 0%
Region CHAR 150,000 5 150,000 100% 0 0%
Rented_OwnHouse CHAR 150,000 2 150,000 100% 0 0%
Occupation CHAR 150,000 5 150,000 100% 0 0%
Education CHAR 150,000 5 150,000 100% 0 0%
NumberOfTime30-59DaysPastDueNotWorse NUM 150,000 16 150,000 100% 0 0% 0 13 0.25 The value 96='Others', 98='Refused to Say'. Code this as NA
NumberOfTime60-89DaysPastDueNotWorse NUM 150,000 13 150,000 100% 0 0% 0 11 0.06 The value 96='Others', 98='Refused to Say'. Code this as NA
NumberOfTimes90DaysLate NUM 150,000 19 150,000 100% 0 0% 0 17 0.09 The value 96='Others', 98='Refused to Say'. Code this as NA
NumberOfOpenCreditLinesAndLoans NUM 150,000 58 150,000 100% 0 0% 0 58 8.45 Max Too High
NumberRealEstateLoansOrLines NUM 150,000 28 150,000 100% 0 0% 0 54 1.02 Max Too High
NumberOfDependents NUM 150,000 14 146,076 97% 3,924 3% 0 20 0.76 The value NA=Missing values
RevolvingUtilizationOfUnsecuredLines NUM 150,000 125,728 150,000 100% 0 0% 0 50,708 6.05 Should not be more than 1
DebtRatio NUM 150,000 114,194 150,000 100% 0 0% 0 329,664 353.01 Should not be more than 1
MonthlyIncome NUM 150,000 13,595 120,269 80% 29,731 20% 0 3,008,750 6670.22 The value NA=Missing values
Good_Bad CHAR 150,000 2 150,000 100% 0 0%
Add 1%, 5%, 10%, 25%, 50%, 75% , 90%, 95%, 99% percentile values for numeric
variables to the above table. Add % of zeros.
3
UNIVARIATES
FS Training - Acquisition 4
Data Quality Report
Treatment
• Missing and default values should be coded as NA and
included in the analysis
• This data has outliers. Keep as is since decision tree will
incorporate it seamlessly.
5
Univariates – Full File
To be Done for each variable – Age
SAMPLE OUTPUT
6
Univariates
To be Done for each variable – Income
Income Bad Rate # Obs
<=5320 8.62% 59323
>5320 to <=6643 6.66% 16128
>6643 4.90% 44818
Missing 5.53% 29731
Income
10.00%
8.00%
Bad Rate
Note: Forced split for <=33 category. 6.00%
4.00%
2.00%
0.00%
<=5320 >5320 to >6643 Missing
<=6643
Income
7
Univariates – Full File
Grouping To be Done for each Variable – Education
SAMPLE OUTPUT
8
Univariates
Grouping To be Done for each Variable – Education
SAMPLE OUTPUT
Total
Education No. Of Bads Accounts Bad %
Matric 1463 11207 13%
Graduate 1607 27917 6%
Post-Grad 1704 26026 7%
PhD 497 4376 11%
Professional 1740 35623 5%
9
Bivariate Risk Segmentation
Age and Income
AGE and INCOME ONLY SAMPLE OUTPUT
• Max of 5 groups each Segment in such a way that lowest age and
• Populate each cell with lowest income has highest bad rate. Highest
– Sample Size age and highest income has lowest bad rate.
– Bad Rate
– No. of Bads
10
Univariates – Decile Binning In case Decision Tree
Results are not coming out
Debt Ratio - Decile Binning
Debt Ratio - Deciles
Bins Bad Rate N Bads
12.0%
[0,0.0309] 5.4% 15000 807
(0.0309,0.134] 6.8% 15000 1023 10.0%
(0.134,0.214] 6.0% 15000 905
8.0%
(0.214,0.287] 5.4% 15000 811
(0.287,0.367] 5.6% 15000 842 6.0%
11
Predictors for Risk Model
List of Variables
• Identify the predictors that will be used for logistic regression.
Select predictors which have a rank order and trend ( positive or
negative relationship with bad rate) that makes business sense.
• For numeric variable, where there is no rank order, try to use them
as a dummy variable
• From bivariate of age and income, create an interaction dummy for
highest risk segment
• For Numeric Variables:
– Use the variable as is (Un- Binned)
– Impute Missing Values using DQ Report/ Similar bad rate
– Cap the maximum value to 95th or 99th Percentile for outliers
– Cap the minimum value to 5% or 1% Percentile for outliers
• Create dummy variables for categorical variables.
12
RISK MODEL
FS Training - Acquisition 13
Data
Development and Validation
• Create a 60 % or 70% random file for development of decision
tree
• Create a 30% or 40% random file for validation of the results
14
Risk Model Logistic Regression
Results - Equation
Variable Coefficient Chi Square P Value Any
Definition Value Comment
- Optional
Intercept
Variable 1
Variable N
Sorted by Descending Order of Importance or Chi Square Value
Gains Chart
100.0%
60.0%
40.0%
20.0%
0.0%
1 2 3 4 5 6 7 8 9 10
Cum_Pct_N Cume_Pct_Total_Resp 16
Logistic Regression
Results – Gains Table - Validation
Report GINI
17
Risk Segmentation - Example
Response Segment N # of Responders Mean Response Rate % of Sample % of Responders Lift
V High 2918 511 17.5% 10.0% 31.4% 3.1
High 2918 268 9.2% 10.0% 16.5% 1.7
Medium 11674 610 5.2% 40.0% 37.5% 0.9
Low 11674 237 2.0% 40.0% 14.6% 0.4
Total 29184 1626 5.6% 100.0% 100.0%
• V High Response Segment has 3x the sample response rate based on lift
measure
• V High and High have a collective response rate of 13.4% and comprise 20%
of the sample but contribute to almost 50% of the total responders.
Create the same from Gains Tables. No. of Risk segments can range from min 3 to
max 5.
18