Escolar Documentos
Profissional Documentos
Cultura Documentos
KeystoneML
Velox
GraphX MLlib ml-matrix Model Server
Spark http://amplab.github.io/velox-modelserver
WHATS A MACHINE
LEARNING PIPELINE?
Train
Data Model
Classifier
Train
Feature
Data Linear Model
Extraction
Classifier
Predictions
Tokenize Tokenize
20
.fit( Newsgroups )
Bigrams Bigrams
Top Features
Then evaluate and ship
Top Features
Transformer to Velox when youre
Naive Bayes
Naive Bayes ready to apply to real
Model
time data.
Max Classifier Max Classifier
NOT SO SIMPLE EXAMPLE:
IMAGE CLASSIFICATION
Resize Resize
Grayscale Grayscale
Images
SIFT .fit( (VOC2007) ) SIFT 5,000 examples, 40,000
features, 20 classes
PCA PCA Map
Embarassingly parallel
Fisher Vector Fisher Encoder
featurization and evaluation.
15 minutes on a modest
Linear Regression Linear Model cluster.
MaxClassifier MaxClassifier
Achieves performance
of Chatfield et. al., 2011
EVEN LESS SIMPLE: IMAGENET
Color Edges Texture <200 SLOC
Resize
1000 class classification.
Grayscale More
1,200,000 examples
64,000 features.
LCS
SIFT Stuff
90 minutes on 100 nodes.
PCA PCA Id Never
And Shivaram doesnt have a
Fisher Vector Fisher Vector Heard Of
heart attack when Prof. Recht
tells us he wants to try a
Block Linear
Weighted Block new solver.
Linear Solver
Solver
Top 5 Classifier
Or add 100,000 more
texture features.
SOFTWARE FEATURES
Data Loaders Example Pipelines
CSV, CIFAR, ImageNet, VOC, TIMIT, 20 Newsgroups
NLP - 20 Newsgroups, Wikipedia
Transformers Language model
NLP - Tokenization, n-grams, term frequency, NER*,
parsing* Just 5k Lines of Code,
Images - MNIST, CIFAR, VOC, ImageNet
1.5k of which are
Images - Convolution, Grayscaling, LCS, SIFT*,
FisherVector*, Pooling, Windowing, HOG, Daisy
TESTS
Speech - TIMIT
Speech - MFCCs* + 1.5k lines of JavaDoc
Stats - Random Features, Normalization, Scaling*,
Evaluation Metrics
Signed Hellinger Mapping, FFT
Binary Classification
Utility/misc - Caching, Top-K classifier, indicator label
mapping, sparse/dense encoding transformers.
Multiclass Classification
Estimators
Learning - Block linear models, Linear Discriminant Multilabel Classification
Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM*
.fit()
RDD[Input] Estimator Transformer
.fit(data, labels)
Linear
String featurizer Vector Prediction
Map
=
String pipeline Prediction
Can be defined as a
class.
Or as a Unary
function.
Loads a
shared library
Native code is
shared across
the cluster.
javah generates
a header for the
shared library.
RESULTS
TIMIT Speech Dataset:
ImageNet:
Code
http://github.com/amplab/keystone
Contributors: