Beyond GLMs: Automating Predictive Modelling for Insurance

Beyond GLMs
Colin Priest Director, Customer Success, Asia, , DataRobot

Xavier Conort - Chief Data Scientist, DataRobot
Agenda
1.
2.
3.
4.
5.
GLMs and Actuaries

Extensions to GLMs
Automating GLM model building
Best practice predictive modelling
Conclusion
1) GLMs
Linear models for statistical distributions that
arent Normal
Taught in the actuarial education process
Widely used by actuaries for
Pricing
Understanding lapse rates
Marketing
Claims reserving
3
GLMs are old

Developed in 1972 in
an era of small data
and no PCs
Used by actuaries for
decades
There are newer
techniques to
choose from
4
2) Extensions to GLMs
GAMs
GAMLSS
GLMMs
Regularized GLMs
GAMs
Generalized additive
models
Automatically fits nonlinear relationships
Works a bit like rolling
averages
GAMs
Good For
Designing data
transformations for
GLMs
Unsuitable For
Pricing, or
whenever a
formula is required
Data containing
only categorical
variables
GAMLSS
Generalized additive
models with location,
scale and shape
Allows you to model
how the variance and
skewness varies
GAMLSS
Good For
understanding
variability
simulating risk e.g.
internal models
data exhibiting
heteroskedasticity
Unsuitable For
when you only
want the
prediction
data exhibiting
homoskedasticity
GLMMs
Generalized linear
mixed models
Allows for automatic
credibility weighting
10
GLMMs
Good For
Categorical variables and
interactions between
categorical and numeric
Sparse data
Hierarchical relationships
e.g. vehicle make and
model
Unsuitable For
Large amounts of
highly credible data
Data without
categorical variables
11
Regularized GLMs
Regularisation is any method that penalises
overfitting or complexity in models
Automatically chooses predictors
Automatically allows for credibility
12
Regularized GLMs
Good For
Collinearity in data
Making GLMs more
reliable
Lots of input
variables
Sparse data
Unsuitable For
Complex
interactions
between variables
13
3) Automating GLM Building
Why automate?
Variable selection (feature selection)
Linearising (feature engineering)
Dimensionality reduction
14
Why Automate?
Building GLMs requires
time and people and both
of these are expensive!
Most of the resource
intensive work is ruledriven without much
complex judgement
required
Its just like when books
were copied by hand!
15
Variable Selection
Variable Importance using machine learning
16
Variable Selection
Using lasso regularized GLMs
17
Variable Selection
Using genetic algorithms
18
Linearising
Using GAMs
19
Linearising
Using GBMs
20
Dimensionality Reduction
variable importance for reducing number of
categories
21
Dimensionality Reduction
Text mining for grouping categories together
and reducing the number of categories
22
4) Best practices observed by

a
What is Kaggle?
A social fight club for data geeks
In 2010, Anthony Goldbloom took the
SIGKDD and Netflixs model
And attracted 371,397 data geeks as of Sept
17, 2015!
Kaggle worked with more than 20 Fortune 500 companies
including 3 leading insurance companies + 1 Australian insurer represented by Deloitte
24
Why Geeks like to fight?
My motivation has been to

learn new things.
25
Key takeaways from

competing in Kaggle
The Machine works much faster and harder than me
Feature engineering is key to success. And actuaries
are good in this. They can however learn more by
being exposed to problems and datasets outside
the insurance industry
Top Kagglers use actuarial tricks such as credibility
estimates
Most popular and powerful Machine Learning
algorithms used by the data science community are
open source algorithms
26
Machine Learning works

for insurance too!
Won by Xavier and his

colleague Owen Zhang!
27
Typical learning curve of a

Kaggler
Previously...
now!
28
Lessons learnt
The Machine seems much smarter than I am at capturing complexity in
the data even for simple datasets!
Humans can help the Machine too! But dont oversimplify and discard
any data.
Dont be impatient. My best GBM had 24,500 trees with learning rate =
0.01!
SVM and feature selection matter too!
29
Lessons learnt
Word n-grams and character n-grams can make a big difference
Parallel processing and big servers can help with complex feature
engineering!
Glmnet can do a great job!
Sklearn in Python is cool too!
30
&
Machine
Learning algos to know
to automatically capture complexity in the data
Gradient Boosting Machine packages
1. R gbm
2. R xgboost
3. Sklearn GradientBoostingClassifier and GradientBoostingRegressor
Forest packages
1. R randomForest
2. Sklearn RandomForestClassifier and RandomForestRegressor
3. R extraTrees
4. Sklearn ExtraTreesClassifier and ExtraTreesRegressor
Support Vector Machine packages
1. R e1071
2. Sklearn svc and svr
3. Sklearn Nystroem
31
&
Machine
Learning algos to know
. to take advantage of high cardinality categorical
features or text data
Regularized generalized linear models
1. R glmnet
2. Sklearn Ridge
3. Sklearn LogisticRegression
Feature Extraction for categorical features or text data
1. R Matrix
2. Sklearn OneHotEncoder and DictVectorizer
3. R tau
4. Sklearn TfidfVectorizer
32
&
tools
to know
to make your code efficient
Data manipulation at faster speed
1. R data.table
2. Python pandas
Parallel computing
1. R foreach / doMC
2. Python joblib
33
Adapt feature engineering

to ML algo
Machine Learning (ML) algo
Categorical variables
support
Features
subsampling
Sparse support
Insensitive to
scale &
uniform transf
Automated nonlinear and

interactions
modelling
Handle missing
value
R Random Forest
Yes. up to 32 levels
Yes
No
Yes
Yes
No
Sklearn Random Forest
No
Yes
Yes but slow
Yes
Yes
No
R Gradient Boosting Machine
Yes. up to 1024
levels
No
No
Yes
Yes
Yes
Sklearn Gradient Boosted

Regression Trees
No
Yes
Yes but slow
Yes
Yes
No
eXtreme Gradient Boosting
No
Yes
Yes
Yes
Yes
Yes
Regularized GLMs
No
No
Yes
No
No
No
Support Vector Machine
No
No
Yes
No
Yes
No
34
Other popular algorithms

popular in
35
Most important.
Dont forget to
use your actuarial intuition to help the
machine!
Always consider simple feature engineering that makes sense
for your business such differences / ratios of features
Be creative, feature engineering is often key to success.
Dont trust features that are too good
They can make the Machine lazy! An example: GE Flight
Quest
or they are likely to be caused by a bug or a leak!
36
5) Conclusion
Its time to become actuaries of the 5th kind
17th century: Life insurance,
Deterministic methods
Actuaries of the
Second Kind
Early 20th century: General insurance,

Probabilistic methods
Actuaries of the
Third Kind
1980s: Assets/derivatives,
Contingencies Stochastic processes
Paul
Embrechts
2005
Actuaries of the
Fourth Kind
Early 21st century: ERM
Big Data
Working
Party
Actuaries of the
Fifth Kind
Hans Buhlmann
1987
Actuaries of the
First Kind
Second decade of 21st century: Big Data
37
Conclusion
So that we arent replaced by robots (or data
scientists)
38
Thank You

Beyond GLMs: Automating Predictive Modelling for Insurance

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Beyond GLMs: Automating Predictive Modelling for Insurance

Enviado por

Direitos autorais:

Formatos disponíveis

Beyond GLMs

Colin Priest Director, Customer Success, Asia, , DataRobot

GLMs and Actuaries

GLMs are old

3) Automating GLM Building

4) Best practices observed by

including 3 leading insurance companies + 1 Australian insurer represented by Deloitte

Why Geeks like to fight?

My motivation has been to

Key takeaways from

Machine Learning works

Won by Xavier and his

Typical learning curve of a

SVM and feature selection matter too!

Glmnet can do a great job!

Sklearn in Python is cool too!

Adapt feature engineering

Automated nonlinear and

Sklearn Random Forest

Yes but slow

R Gradient Boosting Machine

Sklearn Gradient Boosted

Yes but slow

eXtreme Gradient Boosting

Support Vector Machine

Other popular algorithms

Early 20th century: General insurance,

Early 21st century: ERM

Second decade of 21st century: Big Data

Você também pode gostar