Embuk

Embulk
An open-source plugin-based parallel bulk data loader

that makes painful data integration work relaxed.
Sharing our knowledge on RubyGems to manage arbitrary files.
Sadayuki Furuhashi
Founder & Software Architect
Treasure Data, inc.
A little about me...

>
Sadayuki Furuhashi
>
>
Treasure Data, Inc.

>
>
github/twitter: @frsyuki
Founder & Software Architect
Open-source hacker
>
MessagePack - Efficient object serializer
>
Fluentd - An unified data collection tool
>
Prestogres - PostgreSQL protocol gateway for Presto
>
Embulk - A plugin-based parallel bulk data loader
>
ServerEngine - A Ruby framework to build multiprocess servers
>
LS4 - A distributed object storage with cross-region replication
>
kumofs - A distributed strong-consistent key-value data store
Todays talk
>
Whats Embulk?
>
How Embulk works?
>
The architecture
>
Writing Embulk plugins
>
Roadmap & Development
>
Q&A + Discussion
Whats Embulk?
>
An open-source parallel bulk data loader
>
using plugins
>
to make data integration relaxed.
Whats Embulk?
>
An open-source parallel bulk data loader

>
>
using plugins
>
>
loads records from A to B

for various kinds of A and B
Storage, RDBMS,
NoSQL, Cloud Service,
etc.
to make data integration relaxed.

>
which was very painful

broken records,
transactions (idempotency),
performance,
The pains of bulk data loading

Example: load a 10GB CSV file to PostgreSQL
>
1. First attempt fails
>
2. Write a script to make the records cleaned

Convert 20150127T190500Z 2015-01-27 19:05:00 UTC
Convert \N"
many cleanings
>
3. Second attempt another error

Convert Inf Infinity
>
4. Fix the script, retry, retry, retry
>
5. Oh, some data got loaded twice!?

Example: load a 10GB CSV file to PostgreSQL
>
6. Ok, the script worked.
>
7. Register it to cron to sync data every day.
>
8. One day it fails with another error

Convert invalid UTF-8 byte sequence to U+FFFD

Example: load 10GB CSV 720 files
>
Most of scripts are slow.

People have little time to optimize bulk load scripts
>
One file takes 1 hour 720 files takes 1 month (!?)
A lot of integration efforts for each storages:

>
XML, JSON, Apache log format (+some custom),
>
SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile
>
MongoDB, Elasticsearch, Redshift, Salesforce,
The problems:
>
Data cleaning (normalization)

>
>
Error handling
>
>
How to remove broken records?
Idempotent retrying
>
>
How to normalize broken records?
How to retry without duplicated loading?
Performance optimization
>
How to optimize the code or parallelize?
The problems at Treasure Data

Treasure Data Service?
>
>
Customers want to try Treasure Data, but

>
>
SEs write scripts to bulk load their data. Hard work :(
Customers want to migrate their big data, but

>
>
Fast, powerful SQL access to big data from connected

applications and products, with no new infrastructure or
special skills required.
Hard work :(
Fluentd solved streaming data collection, but

>
bulk data loading is another problem.
A solution:
>
Package the efforts as a plugin.

>
>
Share & reuse the plugin.

>
>
dont repeat the pains!
Keep improving the plugin code.

>
>
data cleaning, error handling, retrying
rather than throwing away the efforts every time
using OSS-style pull-reqs & frequent releases.
Embulk
Embulk is an open-source, plugin-based

parallel bulk data loader
that makes data integration works relaxed.
CSV Files
Hive
Amazon S3
SequenceFile
HDFS
MySQL
Salesforce.com
Elasticsearch
Embulk
Cassandra
Redis
CSV Files
bulk load
Amazon S3
SequenceFile
Elasticsearch
Embulk
HDFS
MySQL
Cassandra
Salesforce.com
Hive
Parallel execution
Data validation
Error recovery
Deterministic behavior
Idempotet retrying
Redis
Plugins
Plugins
CSV Files
bulk load
Amazon S3
SequenceFile
Elasticsearch
Embulk
HDFS
MySQL
Cassandra
Salesforce.com
Hive
Parallel execution
Data validation
Error recovery
Deterministic behavior
Idempotet retrying
Redis
How Embulk works?
Installing embulk
# install
$ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar
$ chmod 755 embulk.jar
wget embulk.jar
Bintray
releases
Embulk is released on Bintray
Guess format & schema

# install
# guess
$ vi partial-config.yml
$ ./embulk guess partial-config.yml
-o config.yml
in:
type: file
paths: [data/examples/]
out:
type: example
by guess plugins
guess
in:
type: file
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
header_line: true
columns:
- name: time
type: timestamp
format: '%Y-%m-%d %H:%M:%S'
- name: account
type: long
- name: purchase
type: timestamp
format: '%Y%m%d'
- name: comment
type: string
out:
type: example
Preview & fix config

# install
# guess
-o config.yml
# preview
$ ./embulk preview config.yml
$ vi config.yml # if necessary
+--------------------------------------+---------------+--------------------+
|
time:timestamp | uid:long | word:string |
+--------------------------------------+---------------+--------------------+
| 2015-01-27 19:23:49 UTC |
32,864 |
embulk |
| 2015-01-27 19:01:23 UTC |
14,824 |
jruby |
| 2015-01-28 02:20:02 UTC |
27,559 |
plugin |
| 2015-01-29 11:54:36 UTC |
11,270 |
fluentd |
+--------------------------------------+---------------+--------------------+
Deterministic run
# install
# guess
-o config.yml
# preview
# run
$ ./embulk run config.yml -o config.yml
in:
type: file
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
header_line: true
columns:
- name: time
type: timestamp
- name: account
type: long
- name: purchase
type: timestamp
format: '%Y%m%d'
- name: comment
type: string
last_paths: [data/examples/sample_001.csv.gz]
out:
type: example
Repeat
# install
# guess
-o config.yml
# preview
# run
# repeat
in:
type: file
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
header_line: true
columns:
- name: time
type: timestamp
- name: account
type: long
- name: purchase
type: timestamp
format: '%Y%m%d'
- name: comment
type: string
last_paths: [data/examples/sample_002.csv.gz]
out:
type: example
The architecture
read records
InputPlugin
write records
OutputPlugin
Embulk
executor plugin
InputPlugin
MySQL, Cassandra,
HBase, Elasticsearch,
Treasure Data,
OutputPlugin
record
record
Embulk
executor plugin
read files
FileInputPlugin
write files
FileOutputPlugin
decompress
DecoderPlugin
compress
EncoderPlugin
InputPlugin
OutputPlugin
parse files
into records
ParserPlugin
Embulk
executor plugin
format records
into files
FormatterPlugin
HDFS, S3,
Riak CS,
FileInputPlugin
buffer
FileOutputPlugin
gzip, bzip2,
3des,
DecoderPlugin
EncoderPlugin
InputPlugin
buffer
buffer
OutputPlugin
CSV, JSON,
RCFile,
buffer
FormatterPlugin
ParserPlugin
record
record
Embulk
executor plugin
Writing Embulk plugins
InputPlugin
module Embulk
class InputExample < InputPlugin
Plugin.register_input('example', self)
2)
def self.transaction(config, &control)

# read config
task = {
'message' =>
config.param('message', :string, default: nil)
}
threads = config.param('threads', :int, default:
columns = [
Column.new(0, 'col0', :long),
Column.new(1, 'col1', :double),
Column.new(2, 'col2', :string),
]
# BEGIN here
commit_reports = yield(task, columns, threads)
# COMMIT here
puts "Example input finished"
return {}
end
def run(task, schema, index, page_builder)

puts "Example input thread #{@index}"
10.times do |i|
@page_builder.add([i, 10.0, "example"])
end
@page_builder.finish
commit_report = { }
return commit_report
end
end
end
OutputPlugin
module Embulk
class OutputExample < OutputPlugin
Plugin.register_output('example', self)
def self.transaction(
config, schema,
processor_count, &control)
# read config
task = {
'message' =>
config.param('message', :string, default: "record")
}
puts "Example output started."
commit_reports = yield(task)
puts "Example output finished. Commit
reports = #{commit_reports.to_json}"
return {}
end
def initialize(task, schema, index)
puts "Example output thread #{index}..."
super
@message = task.prop('message', :string)
@records = 0
end
def add(page)
page.each do |record|
hash = Hash[schema.names.zip(record)]
puts "#{@message}: #{hash.to_json}"
@records += 1
end
end
def finish
end
def abort
end
def commit
commit_report = {
"records" => @records
}
return commit_report
end
end
end
GuessPlugin
# guess_gzip.rb
module Embulk
class GzipGuess < GuessPlugin
Plugin.register_guess('gzip', self)
GZIP_HEADER = "\x1f
\x8b".force_encoding('ASCII-8BIT').freeze
def guess(config, sample_buffer)
if sample_buffer[0,2] == GZIP_HEADER
return {"decoders" => [{"type" => "gzip"}]}
end
return {}
end
end
end
# guess_
module Embulk
class GuessNewline < TextGuessPlugin
Plugin.register_guess('newline', self)
def guess_text(config, sample_text)
cr_count = sample_text.count("\r")
lf_count = sample_text.count("\n")
crlf_count = sample_text.scan(/\r\n/).length
if crlf_count > cr_count / 2 && crlf_count >
lf_count / 2
return {"parser" => {"newline" => "CRLF"}}
elsif cr_count > lf_count / 2
return {"parser" => {"newline" => "CR"}}
else
return {"parser" => {"newline" => "LF"}}
end
end
end
end
Releasing to RubyGems
Examples
>
embulk-plugin-postgres-json.gem
>
>
embulk-plugin-redis.gem
>
>
https://github.com/frsyuki/embulk-plugin-postgres-json
https://github.com/komamitsu/embulk-plugin-redis
embulk-plugin-input-sfdc-event-log-files.gem
>
https://github.com/nahi/embulk-plugin-input-sfdc-eventlog-files
Roadmap & Development
Roadmap
>
Add missing JRuby Plugin APIs

>
>
ParserPlugin, FormatterPlugin
DecoderPlugin, EncoderPlugin
>
Add Executor plugin SPI
>
Add ssh distributed executor

>
embulk run command ssh %host embulk run %task
>
Add MapReduce executor
>
Add support for nested records (?)
Contributing to the Embulk project

>
Pull-requests & issues on Github
>
Posting blogs
>
>
>
I tried Embulk. Here is how it worked

I read Embulk code. Here is how its written
Embulk is good becausebut bad because
>
Talking on Twitter with a word embulk"
>
Writing & releasing plugins
>
Windows support
>
Integration to other software

>
ETL tools, Fluentd, Hadoop, Presto,
Q&A + Discussion?
Embulk committers:
Hiroshi Nakamura
@nahi
Muga Nishizawa
@muga_nishizawa
Sadayuki Furuhashi
@frsyuki
Cloud service for the entire data pipeline.

Were hiring!
https://jobs.lever.co/treasure-data

Embuk

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Embuk

Enviado por

Direitos autorais:

Formatos disponíveis

Embulk

An open-source plugin-based parallel bulk data loader

A little about me...

Treasure Data, Inc.

Founder & Software Architect

MessagePack - Efficient object serializer

Fluentd - An unified data collection tool

Prestogres - PostgreSQL protocol gateway for Presto

Embulk - A plugin-based parallel bulk data loader

ServerEngine - A Ruby framework to build multiprocess servers

LS4 - A distributed object storage with cross-region replication

kumofs - A distributed strong-consistent key-value data store

How Embulk works?

Writing Embulk plugins

Roadmap & Development

An open-source parallel bulk data loader

to make data integration relaxed.

An open-source parallel bulk data loader

loads records from A to B

to make data integration relaxed.

which was very painful

The pains of bulk data loading

1. First attempt fails

2. Write a script to make the records cleaned

3. Second attempt another error

4. Fix the script, retry, retry, retry

5. Oh, some data got loaded twice!?

The pains of bulk data loading

6. Ok, the script worked.

7. Register it to cron to sync data every day.

8. One day it fails with another error

The pains of bulk data loading

Most of scripts are slow.

One file takes 1 hour 720 files takes 1 month (!?)

A lot of integration efforts for each storages:

XML, JSON, Apache log format (+some custom),

SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile

MongoDB, Elasticsearch, Redshift, Salesforce,

Data cleaning (normalization)

How to remove broken records?

How to normalize broken records?

How to retry without duplicated loading?

How to optimize the code or parallelize?

The problems at Treasure Data

Customers want to try Treasure Data, but

SEs write scripts to bulk load their data. Hard work :(

Customers want to migrate their big data, but

Fast, powerful SQL access to big data from connected

Fluentd solved streaming data collection, but

bulk data loading is another problem.

Package the efforts as a plugin.

Share & reuse the plugin.

dont repeat the pains!

Keep improving the plugin code.

data cleaning, error handling, retrying

rather than throwing away the efforts every time

using OSS-style pull-reqs & frequent releases.

Embulk is an open-source, plugin-based

How Embulk works?

Embulk is released on Bintray

Guess format & schema

Preview & fix config

Writing Embulk plugins

def self.transaction(config, &control)