Datacamp ETL Documentation

knowerce|consulting
Datacamp ETL
Documentation
November 2009
info@knowerce.sk
www.knowerce.sk
knowerce|consulting
Document information
Creator Knowerce, s.r.o.
Vavilovova 16
851 01 Bratislava
info@knowerce.sk
www.knowerce.sk
Author Štefan Urbánek, stefan@knowerce.sk
Date of creation 12.11.2009
Document revision 1
Document Restrictions
Copyright (C) 2009 Knowerce, s.r.o., Stefan Urbanek
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU
Free Documentation License, Version 1.3 or any later version published by the Free Software
Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the
license is included in the section entitled "GNU Free Documentation License".
Oﬀer info@knowerce.sk 2
knowerce|consulting
Contents
Introduction ....................................................................................................................................................................................4
Overview .........................................................................................................................................................................................6
System Context 6
Objects and classes 6
Installation ........................................................................................................................................................................................8
Software Requirements 8
Preparation 8
Database initialisation 8
Configuration 9
Running ETL Jobs......................................................................................................................................................................10
Launching 10
Manual Launching 10
Scheduled using cron 10
Running Programatically 10
What jobs will be run 10
Job Status 11
Job Management........................................................................................................................................................................12
Scheduling 12
Forced run 12
Creating a Job Bundle .............................................................................................................................................................14
Example: Public Procurement Extraction ETL job 14
Job Utility Methods 14
Errors and Failing a Job 14
Defaults...........................................................................................................................................................................................15
ETL System Defaults 15
Using defaults in jobs 15
Appendix: ETL Tables .............................................................................................................................................................17
etl_jobs 17
etl_job_status 17
etl_defaults 18
etl_batch 18
Cron Example.............................................................................................................................................................................19
knowerce|consulting
Introduction
This document describes architecture, structures and process of Datacamp Extraction Transformation
and Loading framework. Purpose of the framework is to perform automated scheduled data
processing, usually in the background. Main features:
■ scheduled or manual launching of ETL jobs

■ job management and configuration through database
■ logging
■ ETL job plug-in API
ETL tools provided:
■ parallel URL downloader

■ record transformation functions
■ table comparisons
■ table mappings
knowerce|consulting
Project Page and Sources

Project page with sources can be found:
http://github.com/Stiivi/Datacamp-ETL
Wiki Documentation:
http://wiki.github.com/Stiivi/Datacamp-ETL/
Related project Datacamp:

http://github.com/Stiivi/datacamp
Support
General Discussion Mailing List
http://groups.google.com/group/datacamp
Development Mailing List (recommended for Datacamp-ETL project):

http://groups.google.com/group/datacamp-dev
knowerce|consulting
Overview
System Context
Datacamp ETL framework has plug-in based architecture and runs on top of a database server.
ETL job module bundle
ETL modules
directory (directories)
job module bundle
DB Server
job module bundle
ETL Staging
Database
directory for extracted and

temporary files
Objects and classes

Core of the ETL framework are Job Manager and Job objects. There are two categories of classes: job
management and utility classes that are not necessary for data processing.
Job Management Utilities
Job Manager
ETL Defaults
Batch
Job Status Job Job Info

Download Manager
Download Batch
Extraction Transformation Loading
Class Description and provided functionality
Batch Information about data processed by ETL
List of files and additional information for automated parallel downloading and
Download Batch
processing
knowerce|consulting
Class Description and provided functionality
Download Manager Performs parallel download of huge amount of URLs
ETL Defaults Stores configuration variables in key-value dictionary
Job Abstract class for ETL jobs, provides utilities for running, logging and error handling
Job Info Information about job: name, type, scheduling, …
Job Manager Configures and launches jobs, handles errors.
Job Status Information about job run: when was run, what was the result and reason for failure.
knowerce|consulting
Installation
Software Requirements
■ database server
1
■ ruby
■ rails
■ gems: sequel
Preparation
I. create a directory where working files, such as dumps and ETL files, will be stored, for example:
/var/lib/datacamp
II. create a database. For use with Datacamp web application create two schemas:
■ data schema, example: datacamp_data
■ staging schema (for ETL), example: datacamp_staging
III. create a database user that has full access (SELECT, INSERT, UPDATE, CREATE TABLE, …) to
the datacamp ETL schemas
Check: at this point you should have:
■ sources
■ working directory
■ one or two database schemas
■ database user with appropriate permissions
Database initialisation
To initialize ETL database schema run appropriate SQL script from install directory, for example:
mysql -u root -p datacamp_staging < install/etl_tables.mysql.sql
1 currently works only with MySQL server as there are couple of MySQL specific code residues. This will change
in the future.
knowerce|consulting
Configuration
Create config.yml in the ETL directory. You can use config.yml.example as a template.
Configuration variables are:
Variable Description
etl_files_path Path for working files – downloaded, extracted and temporary files
dataset_dump_path Datacamp application specific. Where Datacamp datasets are being dumped (dumps
are shared by ETL and the application)
log_file File where logs are being written. If not set, standard error output (stderr) is used
job_search_path Path where ETL job bundles are stored
connection database connection
# ETL Configuration
#
###########################################################
# Paths
# Where temporary ETL files are stored (such as files downloaded from web)
etl_files_path: /var/lib/datacamp-etl
# Path to dump datasets. This path should be accesible by application to provide

# dump API
dataset_dump_path: /var/lib/datacamp-etl/dumps
# Path for log file

log_file: /var/lib/datacamp-etl/etl.log
# Path to ETL jobs

# All jobs are searched here. The direcotry should contain subdirectories
# similar to OS X bundles in form: job_name.job_type. Example: foo.loading
job_search_path: /usr/lib/datacamp-etl/jobs
###########################################################
# Database Connection
connection:
host: localhost
username: root
password:
charset: utf8
staging_schema: datacamp_staging
dataset_schema: datacamp_data
app_schema: datacamp_app
knowerce|consulting
Running ETL Jobs

Launching
Manual Launching
Jobs are being run with simply launching the etl.rb script:
ruby etl.rb
The script looks for config.yml in current directory. You can pass another configuration file:
ruby etl.rb --config another_config.yml
Scheduled using cron

You would mostly like to run ETL automatically and periodically. To do so, configure a cron job for the
Datacamp ETL by creating a cron script. There is an example in install/etl_cron_job, where you
have to change ETL_PATH, CONFIG and probabbly RUBY variables. See appendix where example file is
listed.
Running Programatically
Or configure JobManager manually and run all jobs by:
job_manager = JobManager.new
… # configure job_manager here
job_manager.run_scheduled_jobs
Log is being written to preconfigured file or to standard error output. See Installation instructions how
to configure the log file.
What jobs will be run

By default only jobs that are enabled and scheduled for this day and were not run successfully already.
If all jobs succeed, then any subsequent launch of ETL should not run any jobs. All unsuccessful are
being re-tried. Not enabled jobs are not run on any occasion. For more information see Job
Management.
knowerce|consulting
Job Status
Each job leaves a footprint of its run in etl_job_status table. The table contains information:
Column Description
job_name task which was run
job_id identifier of the job
status current status of the job: ok, running, failed
phase if job has more phases, this column identifies which phase the job is in
message error message on job fail
start_date when the job started
end_date when the job finished, or NULL if job is still running
Possible job statuses are:
■ running – job is still running (or ETL crashed and did not reset the job status)
■ ok – job finished correctly
■ failed – job dod not finished correctly, see phase and message for more information
Example of successful runs – you want to achieve this:
Example of mixed statuses, including failed ones:
knowerce|consulting
Job Management
Jobs are managet through etl_jobs table where you specify:
Column Description
job_name name of a job (see below)
job_type type of a job: extraction, transformation, loading, …
is_enabled set to 1 when the task is enabled
number which specifies order in which jobs are being run. Jobs are run from lowest
run_order
number to highest. If number is the same for more jobs, behaviour is undefined
schedule when the job is being run
force_run run despite scheduling rule
Example:
To add a new job, insert a line into the table and set job information. To remove a job just delete a line.
Scheduling
Jobs can be currently scheduled on daily basis:
■ daily:
run each day
■ monday, tuesday, wednesday, thursday, friday, saturday, sunday – run on particular week
day
Once the job was successfully run by scheduler, the job manager does not run it again unless explicitly
specified by force_run flag.
Forced run
There is a way how to run jobs out-of-schedule by setting the force_run flag. This allows data
managers to re-run an ETL job remotely without requiring access to the system where ETL processes
are being hosted. The job will be run next time scheduler is run. For example: if ETL is scheduled in
cron for hourly run, then the job is re-run within next hour, if it is scheduled for daily runs it will be run
next day.
The flag is reset to 0 after each run to prevent running again. Reason for this behaviour is to prevent
running lengthy, time and CPU consuming jobs unintentionally and to protect already processed data
from possible inconsistencies introduced by running jobs at unexpected times.
knowerce|consulting
This behaviour can be modified using ETL system defaults:
■ force_run_all – run all enabled jobs, regardless of their scheduling time

■ reset_force_run_flag – allow jobs to be re-run each time ETL script is launched. Set this to 0 for
development and testing.
knowerce|consulting
Creating a Job Bundle

Jobs are implemented by “bundles” or in other words directories containing all necessary code and
information for the job. Only requirement for the bundle is that it follows certain naming convention
and contains ruby script with the job class.
■ bundle directory should be named: job_name.job_type

■ bundle should contain ruby file: job_name_job_type.rb
■ the ruby file should contain camelized job name and job type class: JobNameJobType which should
be a subclass of appropriate job subclass (Extraction, Transformation, Loading)
The class should implement run method with the main job code.
Example: Public Procurement Extraction ETL job

I. create a job bundle directory: mkdir public_procurement.extraction
II. create a Ruby file: public_procurement.extraction/public_procurement_extraction.rb
III. implement a class named: PublicProcurementExtraction:
class PublicProcurementExtraction < Extraction
def run
… job code goes here …
end
Job Utility Methods

There are several utility methods for job writers:
■ files_directory – directory where working, extracted, downloaded and temporary files are stored.
This directory is job specific – each job has its own directory by default
■ logger – object for writing into ETL manager log
■ message, phase – set job status information
Also each job has access to defaults dictionary. See chapter about Defaults for more information.
Errors and Failing a Job

It is recommended to raise exception on error. The exception will be handled by job manager and the
job will be closed properly with appropriate status and message set.
raise “unable to connect to data source”
will result in failed job with same message as the exception.
knowerce|consulting
Defaults
Defaults is configurable key-value dictionary used by ETL jobs and the ETL system as well. The key-
value pairs are stored by domains. Domain usually corresponds to job name, for example: invoices
loading job and invoices transformation job share common domain invoices. The domain etl is reserved
for ETL system configuration. Purpose of defaults is to be able to configure ETL jobs remotely and in
more convenient way.
Defaults are stored in etl_defaults table which contains: domain, default_key and value:
ETL System Defaults

Default Value
Key Description (if key-value does not exist)
force_run_all On next ETL run all enabled jobs are launched, regardless of FALSE
their scheduling. See Running ETL?
reset_force_run_flag After running forced job (see Running ETL?) clear it’s flag so it TRUE
will be not run again.
Using defaults in jobs

Job has access to defaults domain based on the job name. To retrieve a value from defaults:
url = defaults[:download_url]
count = defaults[:count].to_i
Retrieve value or set to default value if not found:

batch_size = defaults.value(:batch_size, 200).to_i
This will look for batch_size key, if it does not exist, then the key will be created and assigned value
200.
To store default value:
defaults[:count] = count
Values are committed when job finishes.
Example:
knowerce|consulting
@batch_size = defaults.value(:batch_size, 200).to_i

@download_threads = defaults.value(:download_threads, 10).to_i
@download_fail_threshold = defaults.value(:download_fail_threshold, 10).to_i
knowerce|consulting
Appendix: ETL Tables

etl_jobs
Column Type Description
id int object identifier
name varchar job name
job_type varchar job type
is_enabled int flag whether the job is run or not
run_order int order in which the jobs are being run. If more jobs have same order numer, the
behaviour is undefined.
last_run_date datetime date and time when job was alst run
last_run_status varchar status of last run
schedule varchar how the job is scheduled
force_run int force job to be run next time ETL runs
etl_job_status
id int object identifier
job_name varchar job name
job_id int job identifier
status varchar current or last run status
phase varchar phase in which the job currently is wile running or was when finished
message varchar status message provided by job object or exception message
start_date datetime when the job was run
end_date datetime when the job finished
knowerce|consulting
etl_defaults
id int association id
domain varchar domain name (usually corresponds to job name)
default_key varchar key
value varchar value for key
etl_batch
id int
batch_type varchar
batch_source varchar
data_source_name varchar
data_source_url varchar
valid_due_date date
batch_date date
username varchar
created_at datetime
updated_at datetime
knowerce|consulting
Cron Example
#!/bin/bash
#
# ETL cron job script
#
# Ubuntu/Debian: Put this script in /etc/cron.daily
# Other unces: schedule appropriately in /etc/crontab
#####################################################################
# ETL Configuration
# Path to your ETL installation

ETL_PATH=/usr/lib/datacamp-etl
# Configuration file (database connection and other paths)

CONFIG=$ETL_PATH/config.yml
# Ruby interpreter path

RUBY=/usr/bin/ruby
#####################################################################
ETL_TOOL=etl.rb
$RUBY -I $ETL_PATH $ETL_PATH/$ETL_TOOL --config $CONFIG

Datacamp ETL Documentation

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Datacamp ETL Documentation

Enviado por

Direitos autorais:

Formatos disponíveis

knowerce|consulting

Author Štefan Urbánek, stefan@knowerce.sk

Date of creation 12.11.2009

■ scheduled or manual launching of ETL jobs

■ parallel URL downloader

Project Page and Sources

Related project Datacamp:

Development Mailing List (recommended for Datacamp-ETL project):

ETL job module bundle

job module bundle

directory for extracted and

Objects and classes

Job Management Utilities

Job Status Job Job Info

Extraction Transformation Loading

Class Description and provided functionality

Batch Information about data processed by ETL

Class Description and provided functionality

Download Manager Performs parallel download of huge amount of URLs

ETL Defaults Stores configuration variables in key-value dictionary

Job Info Information about job: name, type, scheduling, …

Job Manager Configures and launches jobs, handles errors.

Check: at this point you should have:

job_search_path Path where ETL job bundles are stored

connection database connection

# Path to dump datasets. This path should be accesible by application to provide

# Path for log file

# Path to ETL jobs

Running ETL Jobs

Scheduled using cron

What jobs will be run

job_name task which was run

job_id identifier of the job

status current status of the job: ok, running, failed

message error message on job fail

start_date when the job started

end_date when the job finished, or NULL if job is still running

Possible job statuses are:

Example of successful runs – you want to achieve this:

Example of mixed statuses, including failed ones:

job_name name of a job (see below)

job_type type of a job: extraction, transformation, loading, …

is_enabled set to 1 when the task is enabled

schedule when the job is being run

force_run run despite scheduling rule

This behaviour can be modified using ETL system defaults:

■ force_run_all – run all enabled jobs, regardless of their scheduling time

Creating a Job Bundle

■ bundle directory should be named: job_name.job_type

Example: Public Procurement Extraction ETL job

class PublicProcurementExtraction < Extraction

Job Utility Methods

Errors and Failing a Job

raise “unable to connect to data source”

will result in failed job with same message as the exception.

ETL System Defaults

Using defaults in jobs

Retrieve value or set to default value if not found:

@batch_size = defaults.value(:batch_size, 200).to_i

Appendix: ETL Tables

Column Type Description

id int object identifier

name varchar job name

job_type varchar job type