Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley

Practical Machine Learning
Pipelines with MLlib

Joseph K. Bradley
March 18, 2015
Spark Summit East 2015
About Spark MLlib
Started in UC Berkeley AMPLab
Shipped with Spark 0.8
Currently (Spark 1.3)
Contributions from 50+ orgs, 100+ individuals
Good coverage of algorithms
classica'on feature extrac'on, selec'on

regression sta's'cs
clustering linear algebra
recommenda'on frequent itemsets
MLlibs Mission
MLlibs mission is to make practical
machine learning easy and scalable.
Capable of learning from large-scale datasets
Easy to build machine learning applications
How can we move beyond this list of algorithms

and help users developer real ML workows?
Outline
ML workflows
Pipelines
Roadmap
Outline
ML workflows
Pipelines
Roadmap
Example: Text Classification
Goal: Given a text document, predict its topic. Dataset: 20 Newsgroups
From UCI KDD Archive
Features Label
Subject: Re: Lexan Polish?!
Suggest McQuires #1 plastic 1: about science
polish. It will help somewhat 0: not about science
but nothing will remove deep
scratches without making it
worse than it already is.!
McQuires will do something...!
CTR, inches of rainfall, ...
text, image, vector, ...
Set Footer from Insert Dropdown Menu 6

Training & Testing
Training Tes*ng/Produc*on
Given labeled data: Given new unlabeled data:
RDD of (features, label) RDD of features
Subject: Re: Lexan Polish?! Subject: Apollo Training!
Suggest McQuires #1 plastic Label 0! The Apollo astronauts also Label 1!
polish. It will help...! trained at (in) Meteor...!
Subject: RIPEM FAQ! Subject: A demo of Nonsense!

RIPEM is a program which Label 1! How can you lie about Label 0!
performs Privacy Enhanced...! something that no one...!
...
Learn a model. Use model to make predic'ons.

Example ML Workflow
Training
Pain point
Load data Create many RDDs
labels + plain text val labels: RDD[Double] =
data.map(_.label)
Extract features
val features: RDD[Vector]
labels + feature vectors val predictions: RDD[Double]
Train model
Explicitly unzip & zip RDDs
labels + predicEons
labels.zip(predictions).map {
Evaluate if (_._1 == _._2) ...
}
Example ML Workflow
Training Pain point
Write as a script
Load data
Not modular
labels + plain text Dicult to re-use workow
Extract features
labels + feature vectors
Train model
labels + predicEons
Evaluate
Example ML Workflow
Training Testing/Production
Load data Load new data Almost

labels + plain text plain text iden-cal
workow
Extract features Extract features
labels + feature vectors feature vectors
Train model Predict using model

labels + predicEons predicEons
Evaluate Act on predic'ons
Example ML Workflow
Training
Load data Pain point

labels + plain text
Parameter tuning
Extract features Key part of ML
labels + feature vectors Involves training many models
For dierent splits of the data
Train model For dierent sets of parameters
labels + predicEons
Evaluate
Pain Points
Create & handle many RDDs and data types
Write as a script
Tune parameters
Enter...
Pipelines! in Spark 1.2 & 1.3

Outline
ML workflows
Pipelines
Roadmap
Key Concepts
DataFrame: The ML Dataset

Abstractions: Transformers, Estimators, & Evaluators
Parameters: API & tuning
DataFrame: RDD + schema + DSL
Named columns with types

label: Double
text: String
words: Seq[String]
features: Vector
prediction: Double
label text words features
0 This is ... [This, is, ] [0.5, 1.2, ]
0 When we ... [When, ...] [1.9, -0.8, ]

DataFrame: RDD + schema + DSL
Named columns with types Domain-Specic Language

# Select science articles
sciDocs =
data.filter(label == 1)
# Scale labels
data(label) * 0.5
DataFrame: RDD + schema + DSL BIG data
Named columns with types Domain-Specic Language
Shipped with Spark 1.3

APIs for Python, Java & Scala (+R in dev) Pain point: Create & handle
Integra'on with Spark SQL many RDDs and data types
Data import/export
Internal op'miza'ons
Abstractions
Training
Load data
Extract features
Train model
Evaluate

Abstraction: Transformer
Training
def transform(DataFrame): DataFrame
Extract features
label: Double label: Double
text: String text: String
features: Vector
Train model
Evaluate

Abstraction: Estimator
Training
def fit(DataFrame): Model
Extract features LogisticRegression

label: Double
Model
text: String
features: Vector
Train model
Evaluate

Abstraction: Evaluator
Training
def evaluate(DataFrame): Double
Extract features Metric:

label: Double accuracy
text: String AUC
features: Vector MSE
Train model prediction: Double ...
Evaluate

Abstraction: Model
Testing/Production Model is a type of Transformer
Extract features text: String text: String

features: Vector features: Vector
prediction: Double
Predict using model
Act on predic'ons

(Recall) Abstraction: Estimator
Training
Load data
Extract features LogisticRegression

label: Double
Model
text: String
features: Vector
Train model
Evaluate

Abstraction: Pipeline
Training Pipeline is a type of Es*mator
Load data
Extract features
label: Double PipelineModel
text: String
Train model
Evaluate

Abstraction: PipelineModel
Testing/Production PipelineModel is a type of Transformer
Load data
Extract features text: String text: String

features: Vector
prediction: Double
Predict using model
Act on predic'ons

Abstractions: Summary
Training Testing
DataFrame Load data Load data
Transformer Extract features Extract features
Estimator Train model Predict using model
Evaluator Evaluate Evaluate

Demo
Training Current data schema
DataFrame Load data label: Double
text: String
Transformer Tokenizer
words: Seq[String]
Transformer HashingTF
features: Vector
Estimator Logis'cRegression
prediction: Double
Evaluator BinaryClassica'on
Evaluator
Demo
Training
DataFrame Load data
Transformer Tokenizer
Pain point: Write as a script
Transformer HashingTF
Estimator Logis'cRegression
Evaluator BinaryClassica'on
Evaluator
Parameters
Standard API > hashingTF.numFeatures

Typed org.apache.spark.ml.param.IntParam =
Defaults numFeatures: number of features
(default: 262144)
Built-in doc
Autocomplete
> hashingTF.setNumFeatures(1000)
> hashingTF.getNumFeatures

Parameter Tuning
Given:
Estimator
Tokenizer
Parameter grid hashingTF.numFeatures
{100, 1000, 10000}
Evaluator
Find best parameters HashingTF
Logis'cRegression
lr.regParam
CrossValidator {0.01, 0.1, 0.5}
BinaryClassica'on
Evaluator
Parameter Tuning
Given:
Estimator
Tokenizer
Parameter grid
Pain point: Tune parameters
Evaluator
Find best parameters HashingTF
Logis'cRegression
CrossValidator
BinaryClassica'on
Evaluator
Pipelines: Recap
DataFrame Create & handle many RDDs and data types
Abstrac'ons Write as a script
Parameter API Tune parameters
Also Inspira'ons
Python, Scala, Java APIs

scikit-learn
Schema valida'on
+ Spark DataFrame, Param API
User-Dened Types*
Feature metadata* MLBase (Berkeley AMPLab)
Mul'-model training op'miza'ons* Ongoing collaboraEons
* Groundwork done; full support WIP.

Outline
ML workflows
Pipelines
Roadmap
Roadmap
spark.mllib: Primary ML package

spark.ml: High-level Pipelines API for algorithms in spark.mllib

(experimental in Spark 1.2-1.3)
Near future
Feature aoributes
Feature transformers
More algorithms under Pipeline API

Farther ahead
Ideas from AMPLab MLBase (auto-tuning models)
SparkR integra'on
Outline
ML workows Spark documenta'on
Pipelines hop://spark.apache.org/

DataFrame Pipelines blog post
Abstrac*ons hops://databricks.com/blog/2015/01/07
Parameter tuning
Roadmap
Thank you!

Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley

Enviado por

Direitos autorais:

Formatos disponíveis

Practical Machine Learning

Pipelines with MLlib

classica'on feature extrac'on, selec'on

How can we move beyond this list of algorithms

text, image, vector, ...

Set Footer from Insert Dropdown Menu 6

Subject: RIPEM FAQ! Subject: A demo of Nonsense!

Set Footer from Insert Dropdown Menu 7

Load data Load new data Almost

Train model Predict using model

Load data Pain point

Pipelines! in Spark 1.2 & 1.3

DataFrame: The ML Dataset

Named columns with types

label text words features

0 This is ... [This, is, ] [0.5, 1.2, ]

0 When we ... [When, ...] [1.9, -0.8, ]

Named columns with types Domain-Specic Language

Named columns with types Domain-Specic Language

Shipped with Spark 1.3

Set Footer from Insert Dropdown Menu 18

Set Footer from Insert Dropdown Menu 19

Extract features LogisticRegression

Set Footer from Insert Dropdown Menu 20

Extract features Metric:

Set Footer from Insert Dropdown Menu 21

Extract features text: String text: String

Set Footer from Insert Dropdown Menu 22

Extract features LogisticRegression

Set Footer from Insert Dropdown Menu 23

Set Footer from Insert Dropdown Menu 24

Extract features text: String text: String

Set Footer from Insert Dropdown Menu 25

Transformer Extract features Extract features

Estimator Train model Predict using model

Evaluator Evaluate Evaluate

Set Footer from Insert Dropdown Menu 26

Standard API > hashingTF.numFeatures

Set Footer from Insert Dropdown Menu 29

* Groundwork done; full support WIP.

spark.ml: High-level Pipelines API for algorithms in spark.mllib

Você também pode gostar