Você está na página 1de 35

Practical Machine Learning

Pipelines with MLlib


Joseph K. Bradley
March 18, 2015
Spark Summit East 2015
About Spark MLlib
Started in UC Berkeley AMPLab
Shipped with Spark 0.8
Currently (Spark 1.3)
Contributions from 50+ orgs, 100+ individuals
Good coverage of algorithms

classica'on feature extrac'on, selec'on


regression sta's'cs
clustering linear algebra
recommenda'on frequent itemsets
MLlibs Mission
MLlibs mission is to make practical
machine learning easy and scalable.
Capable of learning from large-scale datasets
Easy to build machine learning applications

How can we move beyond this list of algorithms


and help users developer real ML workows?
Outline

ML workflows
Pipelines
Roadmap
Outline

ML workflows
Pipelines
Roadmap
Example: Text Classification
Goal: Given a text document, predict its topic. Dataset: 20 Newsgroups
From UCI KDD Archive

Features Label
Subject: Re: Lexan Polish?!
Suggest McQuires #1 plastic 1: about science
polish. It will help somewhat 0: not about science
but nothing will remove deep
scratches without making it
worse than it already is.!
McQuires will do something...!
CTR, inches of rainfall, ...

text, image, vector, ...

Set Footer from Insert Dropdown Menu 6


Training & Testing
Training Tes*ng/Produc*on
Given labeled data: Given new unlabeled data:
RDD of (features, label) RDD of features
Subject: Re: Lexan Polish?! Subject: Apollo Training!
Suggest McQuires #1 plastic Label 0! The Apollo astronauts also Label 1!
polish. It will help...! trained at (in) Meteor...!

Subject: RIPEM FAQ! Subject: A demo of Nonsense!


RIPEM is a program which Label 1! How can you lie about Label 0!
performs Privacy Enhanced...! something that no one...!

...
Learn a model. Use model to make predic'ons.

Set Footer from Insert Dropdown Menu 7


Example ML Workflow
Training
Pain point
Load data Create many RDDs
labels + plain text val labels: RDD[Double] =
data.map(_.label)
Extract features
val features: RDD[Vector]
labels + feature vectors val predictions: RDD[Double]
Train model
Explicitly unzip & zip RDDs
labels + predicEons
labels.zip(predictions).map {
Evaluate if (_._1 == _._2) ...
}
Example ML Workflow
Training Pain point
Write as a script
Load data
Not modular
labels + plain text Dicult to re-use workow
Extract features
labels + feature vectors

Train model
labels + predicEons
Evaluate
Example ML Workflow
Training Testing/Production

Load data Load new data Almost


labels + plain text plain text iden-cal
workow
Extract features Extract features
labels + feature vectors feature vectors

Train model Predict using model


labels + predicEons predicEons
Evaluate Act on predic'ons
Example ML Workflow
Training

Load data Pain point


labels + plain text
Parameter tuning
Extract features Key part of ML
labels + feature vectors Involves training many models
For dierent splits of the data
Train model For dierent sets of parameters
labels + predicEons
Evaluate
Pain Points
Create & handle many RDDs and data types
Write as a script
Tune parameters

Enter...

Pipelines! in Spark 1.2 & 1.3


Outline

ML workflows
Pipelines
Roadmap
Key Concepts

DataFrame: The ML Dataset


Abstractions: Transformers, Estimators, & Evaluators
Parameters: API & tuning
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL

Named columns with types


label: Double
text: String
words: Seq[String]
features: Vector
prediction: Double

label text words features

0 This is ... [This, is, ] [0.5, 1.2, ]

0 When we ... [When, ...] [1.9, -0.8, ]


DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL

Named columns with types Domain-Specic Language


# Select science articles
sciDocs =
data.filter(label == 1)

# Scale labels
data(label) * 0.5
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL BIG data

Named columns with types Domain-Specic Language

Shipped with Spark 1.3


APIs for Python, Java & Scala (+R in dev) Pain point: Create & handle
Integra'on with Spark SQL many RDDs and data types
Data import/export
Internal op'miza'ons
Abstractions
Training
Load data

Extract features

Train model

Evaluate

Set Footer from Insert Dropdown Menu 18


Abstraction: Transformer
Training
def transform(DataFrame): DataFrame

Extract features
label: Double label: Double
text: String text: String
features: Vector
Train model

Evaluate

Set Footer from Insert Dropdown Menu 19


Abstraction: Estimator
Training
def fit(DataFrame): Model

Extract features LogisticRegression


label: Double
Model
text: String
features: Vector
Train model

Evaluate

Set Footer from Insert Dropdown Menu 20


Abstraction: Evaluator
Training
def evaluate(DataFrame): Double

Extract features Metric:


label: Double accuracy
text: String AUC
features: Vector MSE
Train model prediction: Double ...

Evaluate

Set Footer from Insert Dropdown Menu 21


Abstraction: Model
Testing/Production Model is a type of Transformer
def transform(DataFrame): DataFrame

Extract features text: String text: String


features: Vector features: Vector
prediction: Double
Predict using model

Act on predic'ons

Set Footer from Insert Dropdown Menu 22


(Recall) Abstraction: Estimator
Training
def fit(DataFrame): Model
Load data

Extract features LogisticRegression


label: Double
Model
text: String
features: Vector
Train model

Evaluate

Set Footer from Insert Dropdown Menu 23


Abstraction: Pipeline
Training Pipeline is a type of Es*mator
def fit(DataFrame): Model
Load data

Extract features
label: Double PipelineModel
text: String
Train model

Evaluate

Set Footer from Insert Dropdown Menu 24


Abstraction: PipelineModel
Testing/Production PipelineModel is a type of Transformer
def transform(DataFrame): DataFrame
Load data

Extract features text: String text: String


features: Vector
prediction: Double
Predict using model

Act on predic'ons

Set Footer from Insert Dropdown Menu 25


Abstractions: Summary
Training Testing
DataFrame Load data Load data

Transformer Extract features Extract features

Estimator Train model Predict using model

Evaluator Evaluate Evaluate

Set Footer from Insert Dropdown Menu 26


Demo
Training Current data schema
DataFrame Load data label: Double
text: String

Transformer Tokenizer
words: Seq[String]

Transformer HashingTF
features: Vector

Estimator Logis'cRegression
prediction: Double

Evaluator BinaryClassica'on
Evaluator
Set Footer from Insert Dropdown Menu 27
Demo
Training
DataFrame Load data

Transformer Tokenizer
Pain point: Write as a script

Transformer HashingTF

Estimator Logis'cRegression

Evaluator BinaryClassica'on
Evaluator
Set Footer from Insert Dropdown Menu 28
Parameters

Standard API > hashingTF.numFeatures


Typed org.apache.spark.ml.param.IntParam =
Defaults numFeatures: number of features
(default: 262144)
Built-in doc
Autocomplete
> hashingTF.setNumFeatures(1000)
> hashingTF.getNumFeatures

Set Footer from Insert Dropdown Menu 29


Parameter Tuning
Given:
Estimator
Tokenizer
Parameter grid hashingTF.numFeatures
{100, 1000, 10000}
Evaluator
Find best parameters HashingTF

Logis'cRegression
lr.regParam
CrossValidator {0.01, 0.1, 0.5}
BinaryClassica'on
Evaluator
Parameter Tuning
Given:
Estimator
Tokenizer
Parameter grid
Pain point: Tune parameters
Evaluator
Find best parameters HashingTF

Logis'cRegression
CrossValidator
BinaryClassica'on
Evaluator
Pipelines: Recap
DataFrame Create & handle many RDDs and data types
Abstrac'ons Write as a script
Parameter API Tune parameters

Also Inspira'ons
Python, Scala, Java APIs

scikit-learn
Schema valida'on
+ Spark DataFrame, Param API
User-Dened Types*
Feature metadata* MLBase (Berkeley AMPLab)
Mul'-model training op'miza'ons* Ongoing collaboraEons

* Groundwork done; full support WIP.


Outline

ML workflows
Pipelines
Roadmap
Roadmap
spark.mllib: Primary ML package

spark.ml: High-level Pipelines API for algorithms in spark.mllib


(experimental in Spark 1.2-1.3)

Near future
Feature aoributes
Feature transformers
More algorithms under Pipeline API

Farther ahead
Ideas from AMPLab MLBase (auto-tuning models)
SparkR integra'on
Outline
ML workows Spark documenta'on
Pipelines hop://spark.apache.org/

DataFrame Pipelines blog post
Abstrac*ons hops://databricks.com/blog/2015/01/07
Parameter tuning
Roadmap

Thank you!

Você também pode gostar