DS 200 Study Guide

Exam Sections
Data Acquisition
Data Evaluation
Data Transformation
Machine Learning Basics
Clustering
Classification
Collaborative Filtering
Model/Feature Selection
Probability
Visualization
Optimization
Data Acquisition
Objectives
Access and load data from a variety of sources into a Hadoop cluster, including
from databases and systems such as OLTP and OLAP as well as log files and docume
nts.
Deploy a variety of acquisition techniques for acquiring data, including databas
e integration, working with APIs
Use command line tools such wget and curl
Use Hadoop tools such as Sqoop and Flume
Study Resources
Apache Sqoop
Clouderas blogs on Apache Sqoop
Aaron Kimball on Sqoop
Apache Flume
Cloudera's blogs on Apache Flume
Cloudera's blogs on data collection
HDFS File System Shell Guide
Hadoop: The Definitive Guide, 3rd Edition: Chapter 15
Hadoop In Practice: Chapter 2
Data Evaluation
Objectives
Knowledge of the file types commonly used for input and output and the advantage
s and disadvantages of each
Methods for working with various file formats including binary files, JSON, XML,
and .csv
Tools, techniques, and utilities for evaluating data from the command line and a
t scale
An understanding of sampling and filtering techniques
A familiarity with Hadoop SequenceFiles and serialization using Avro
Study Resources
Hadoop: The Definitive Guide, 3rd Edition: Chapter 4
Apache Avro
Cloudera's blogs on Apache Avro
Data Transformation
Objectives
Write a map-only Hadoop Streaming job
Write a script that receives records on stdin and write them to stdout
Invoke Unix tools to convert file formats
Join data sets
Write scripts to anonymize data sets
Write a Mapper using Python and invoke via Hadoop streaming
Write a custom subclass of FileOutputFormat
Write records into a new format such AvroOutputFormat or SequenceFileOutputForma
t
Study Resources
Hadoop Streaming
Hadoop Streaming wiki
Apache Hive
Hive tutorial
Hive language manual
Hive joins documentation
Apache Pig
Pig's relational operators
Cloudera blog on Python frameworks for Hadoop
Hadoop: The Definitive Guide, 3rd Edition: Chapters 7, 12
Hadoop In Practice: Chapter 8, 10
Machine Learning Basics
Objectives
Understand how to use Mappers and Reducers to create predictive models
Understand the different kinds of machine learning, including supervised and uns
upervised learning
Recognize appropriate uses of the following: parametric/non-parametric algorithm
s, support vector machines, kernels, neural networks, clustering, dimensionality
reduction, and recommender systems
Section Study Resources
Apache Mahout
Apache Mahout wiki
Cloudera's blogs on Apache Mahout
Hadoop: The Definitive Guide, 3rd Edition: Chapters 16
Algorithms of the Intelligent Web: Chapter 7
A Programmers Guide to Data Mining
Clustering
Objectives
Define clustering and identify appropriate use cases
Identify appropriate uses of various models including centroid, distribution, de
nsity, group, and graph
Describe the value and use of similarity metrics including Pearson correlation,
Euclidean distance, and block distance
Identify the algorithms applicable to each model (k-means, SVD/PCA, etc.)
Study Resources
Programming Collective Intelligence: Chapter 3
Mahout In Action: Part 2
Classification
Objectives
Describe the steps for training a set of data in order to identify new data base
d on known data
Identify the use cases for logistic regression, Bayes theorem
Define classification techniques and formulas
Study Resources
Programming Collective Intelligence: Chapters 6, 7, 8, 9, 12
Algorithms of the Intelligent Web: Chapters 5, 6
Collaborative Filtering
Objectives
Identify the use of user-based and item-based collaborative filtering techniques
describe the limitations and strengths of collaborative filtering techniques
Given a scenario, determine the appropriate collaborative filtering implementati
on
Given a scenario, determine the metrics one should use to evaluate the accuracy
of a reccomender system
Study Resources
Recommendation engines with Apache Mahout
Model/Feature Selection
Objectives
Describe the role and function of feature selection
Analyze a scenario and determine the appropriate features and attributes to sele
ct
Analyze a scenario and determine the methods to deploy for optimal feature selec
tion
Study Resources
Pattern Recognition and Machine Learning: Chapter 1.3
Probability
Objectives
Analyze a scenario and determine the likelihood of a particular outcome
Determine sample percentiles
Determine a range of items based on a sample probability density function
Summarize a distribution of sample numbers
Study Resources
Pattern Recognition and Machine Learning: Chapter 2
BetterExplained.com on Probability, Statistics, Bayes Theorem
Visualization
Objectives
Determine the most effective visualization for a given problem
Analyze a data visualization and interpret its meaning
Study Resources
Data Visualization: modern approaches
Data Visualization basics
Sample Visualizations
DataVisualization.ch
Data Visualization for human perception
Optimization
Objectives
Understand optimization methods
Identify 1st order and 2nd order optimization techniques
Determine the learning rate for a particular algorithm
Determine the sources of errors in a model
Study Resources
Leon Bottou on stochastic learning from Advanced Lectures on Machine Learning
Leon Bottou on online algorithms and stochastic approximations
Data-Intensive Text Processing with MapReduce: Chapter 6

DS 200 Study Guide

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

DS 200 Study Guide

Enviado por

Direitos autorais:

Formatos disponíveis

Exam Sections

Você também pode gostar