Escolar Documentos
Profissional Documentos
Cultura Documentos
Agenda
1.
2.
3.
4.
5.
1) GLMs
Linear models for statistical distributions that
arent Normal
Taught in the actuarial education process
Widely used by actuaries for
Pricing
Understanding lapse rates
Marketing
Claims reserving
3
2) Extensions to GLMs
GAMs
GAMLSS
GLMMs
Regularized GLMs
GAMs
Generalized additive
models
Automatically fits nonlinear relationships
Works a bit like rolling
averages
GAMs
Good For
Designing data
transformations for
GLMs
Unsuitable For
Pricing, or
whenever a
formula is required
Data containing
only categorical
variables
GAMLSS
Generalized additive
models with location,
scale and shape
Allows you to model
how the variance and
skewness varies
GAMLSS
Good For
understanding
variability
simulating risk e.g.
internal models
data exhibiting
heteroskedasticity
Unsuitable For
when you only
want the
prediction
data exhibiting
homoskedasticity
GLMMs
Generalized linear
mixed models
Allows for automatic
credibility weighting
10
GLMMs
Good For
Categorical variables and
interactions between
categorical and numeric
Sparse data
Hierarchical relationships
e.g. vehicle make and
model
Unsuitable For
Large amounts of
highly credible data
Data without
categorical variables
11
Regularized GLMs
Regularisation is any method that penalises
overfitting or complexity in models
Automatically chooses predictors
Automatically allows for credibility
12
Regularized GLMs
Good For
Collinearity in data
Making GLMs more
reliable
Lots of input
variables
Sparse data
Unsuitable For
Complex
interactions
between variables
13
Why automate?
Variable selection (feature selection)
Linearising (feature engineering)
Dimensionality reduction
14
Why Automate?
Building GLMs requires
time and people and both
of these are expensive!
Most of the resource
intensive work is ruledriven without much
complex judgement
required
Its just like when books
were copied by hand!
15
Variable Selection
Variable Importance using machine learning
16
Variable Selection
Using lasso regularized GLMs
17
Variable Selection
Using genetic algorithms
18
Linearising
Using GAMs
19
Linearising
Using GBMs
20
Dimensionality Reduction
variable importance for reducing number of
categories
21
Dimensionality Reduction
Text mining for grouping categories together
and reducing the number of categories
22
What is Kaggle?
A social fight club for data geeks
In 2010, Anthony Goldbloom took the
SIGKDD and Netflixs model
And attracted 371,397 data geeks as of Sept
17, 2015!
Kaggle worked with more than 20 Fortune 500 companies
24
25
27
now!
28
Lessons learnt
The Machine seems much smarter than I am at capturing complexity in
the data even for simple datasets!
Humans can help the Machine too! But dont oversimplify and discard
any data.
Dont be impatient. My best GBM had 24,500 trees with learning rate =
0.01!
29
Lessons learnt
Word n-grams and character n-grams can make a big difference
Parallel processing and big servers can help with complex feature
engineering!
30
&
Machine
Learning algos to know
to automatically capture complexity in the data
Gradient Boosting Machine packages
1. R gbm
2. R xgboost
3. Sklearn GradientBoostingClassifier and GradientBoostingRegressor
Forest packages
1. R randomForest
2. Sklearn RandomForestClassifier and RandomForestRegressor
3. R extraTrees
4. Sklearn ExtraTreesClassifier and ExtraTreesRegressor
Support Vector Machine packages
1. R e1071
2. Sklearn svc and svr
3. Sklearn Nystroem
31
&
Machine
Learning algos to know
. to take advantage of high cardinality categorical
features or text data
Regularized generalized linear models
1. R glmnet
2. Sklearn Ridge
3. Sklearn LogisticRegression
Feature Extraction for categorical features or text data
1. R Matrix
2. Sklearn OneHotEncoder and DictVectorizer
3. R tau
4. Sklearn TfidfVectorizer
32
&
tools
to know
to make your code efficient
Data manipulation at faster speed
1. R data.table
2. Python pandas
Parallel computing
1. R foreach / doMC
2. Python joblib
33
Categorical variables
support
Features
subsampling
Sparse support
Insensitive to
scale &
uniform transf
Handle missing
value
R Random Forest
Yes. up to 32 levels
Yes
No
Yes
Yes
No
No
Yes
Yes
Yes
No
Yes. up to 1024
levels
No
No
Yes
Yes
Yes
No
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
Yes
Regularized GLMs
No
No
Yes
No
No
No
No
No
Yes
No
Yes
No
34
35
Most important.
Dont forget to
use your actuarial intuition to help the
machine!
Always consider simple feature engineering that makes sense
for your business such differences / ratios of features
Be creative, feature engineering is often key to success.
Dont trust features that are too good
They can make the Machine lazy! An example: GE Flight
Quest
or they are likely to be caused by a bug or a leak!
36
5) Conclusion
Its time to become actuaries of the 5th kind
17th century: Life insurance,
Deterministic methods
Actuaries of the
Second Kind
Actuaries of the
Third Kind
1980s: Assets/derivatives,
Contingencies Stochastic processes
Paul
Embrechts
2005
Actuaries of the
Fourth Kind
Big Data
Working
Party
Actuaries of the
Fifth Kind
Hans Buhlmann
1987
Actuaries of the
First Kind
37
Conclusion
So that we arent replaced by robots (or data
scientists)
38
Thank You