Escolar Documentos
Profissional Documentos
Cultura Documentos
ML workflows
Pipelines
Roadmap
Outline
ML workflows
Pipelines
Roadmap
Example: Text Classification
Goal: Given a text document, predict its topic. Dataset:
20
Newsgroups
From
UCI
KDD
Archive
Features
Label
Subject: Re: Lexan Polish?!
Suggest McQuires #1 plastic 1:
about
science
polish. It will help somewhat 0:
not
about
science
but nothing will remove deep
scratches without making it
worse than it already is.!
McQuires will do something...!
CTR,
inches
of
rainfall,
...
...
Learn
a
model.
Use
model
to
make
predic'ons.
Train
model
labels
+
predicEons
Evaluate
Example ML Workflow
Training Testing/Production
Enter...
ML workflows
Pipelines
Roadmap
Key Concepts
# Scale labels
data(label) * 0.5
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL BIG
data
Extract features
Train model
Evaluate
Extract
features
label: Double label: Double
text: String text: String
features: Vector
Train
model
Evaluate
Evaluate
Evaluate
Act on predic'ons
Evaluate
Extract
features
label: Double PipelineModel
text: String
Train
model
Evaluate
Act on predic'ons
Transformer Tokenizer
words: Seq[String]
Transformer HashingTF
features: Vector
Estimator Logis'cRegression
prediction: Double
Evaluator BinaryClassica'on
Evaluator
Set Footer from Insert Dropdown Menu 27
Demo
Training
DataFrame Load
data
Transformer Tokenizer
Pain
point:
Write
as
a
script
Transformer HashingTF
Estimator Logis'cRegression
Evaluator BinaryClassica'on
Evaluator
Set Footer from Insert Dropdown Menu 28
Parameters
Logis'cRegression
lr.regParam
CrossValidator {0.01, 0.1, 0.5}
BinaryClassica'on
Evaluator
Parameter Tuning
Given:
Estimator
Tokenizer
Parameter grid
Pain
point:
Tune
parameters
Evaluator
Find best parameters HashingTF
Logis'cRegression
CrossValidator
BinaryClassica'on
Evaluator
Pipelines: Recap
DataFrame
Create
&
handle
many
RDDs
and
data
types
Abstrac'ons
Write
as
a
script
Parameter
API
Tune
parameters
Also
Inspira'ons
Python,
Scala,
Java
APIs
scikit-learn
Schema
valida'on
+
Spark
DataFrame,
Param
API
User-Dened
Types*
Feature
metadata*
MLBase
(Berkeley
AMPLab)
Mul'-model
training
op'miza'ons*
Ongoing
collaboraEons
ML workflows
Pipelines
Roadmap
Roadmap
spark.mllib:
Primary
ML
package
Near
future
Feature
aoributes
Feature
transformers
More
algorithms
under
Pipeline
API
Farther
ahead
Ideas
from
AMPLab
MLBase
(auto-tuning
models)
SparkR
integra'on
Outline
ML
workows
Spark
documenta'on
Pipelines
hop://spark.apache.org/
DataFrame
Pipelines
blog
post
Abstrac*ons
hops://databricks.com/blog/2015/01/07
Parameter
tuning
Roadmap
Thank you!