Você está na página 1de 19

knowerce|consulting

Datacamp ETL
Documentation
November 2009

info@knowerce.sk
www.knowerce.sk
knowerce|consulting

Document information
Creator Knowerce, s.r.o.
Vavilovova 16
851 01 Bratislava

info@knowerce.sk
www.knowerce.sk

Author Štefan Urbánek, stefan@knowerce.sk

Date of creation 12.11.2009

Document revision 1

Document Restrictions
Copyright (C) 2009 Knowerce, s.r.o., Stefan Urbanek
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU
Free Documentation License, Version 1.3 or any later version published by the Free Software
Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the
license is included in the section entitled "GNU Free Documentation License".

Offer info@knowerce.sk 2
knowerce|consulting

Contents
Introduction ....................................................................................................................................................................................4
Overview .........................................................................................................................................................................................6
System Context 6
Objects and classes 6
Installation ........................................................................................................................................................................................8
Software Requirements 8
Preparation 8
Database initialisation 8
Configuration 9
Running ETL Jobs......................................................................................................................................................................10
Launching 10
Manual Launching 10
Scheduled using cron 10
Running Programatically 10
What jobs will be run 10
Job Status 11
Job Management........................................................................................................................................................................12
Scheduling 12
Forced run 12
Creating a Job Bundle .............................................................................................................................................................14
Example: Public Procurement Extraction ETL job 14
Job Utility Methods 14
Errors and Failing a Job 14
Defaults...........................................................................................................................................................................................15
ETL System Defaults 15
Using defaults in jobs 15
Appendix: ETL Tables .............................................................................................................................................................17
etl_jobs 17
etl_job_status 17
etl_defaults 18
etl_batch 18
Cron Example.............................................................................................................................................................................19

Offer info@knowerce.sk 3
knowerce|consulting

Introduction
This document describes architecture, structures and process of Datacamp Extraction Transformation
and Loading framework. Purpose of the framework is to perform automated scheduled data
processing, usually in the background. Main features:

■ scheduled or manual launching of ETL jobs


■ job management and configuration through database
■ logging
■ ETL job plug-in API
ETL tools provided:

■ parallel URL downloader


■ record transformation functions
■ table comparisons
■ table mappings

Offer info@knowerce.sk 4
knowerce|consulting

Project Page and Sources


Project page with sources can be found:
http://github.com/Stiivi/Datacamp-ETL

Wiki Documentation:
http://wiki.github.com/Stiivi/Datacamp-ETL/

Related project Datacamp:


http://github.com/Stiivi/datacamp

Support
General Discussion Mailing List
http://groups.google.com/group/datacamp

Development Mailing List (recommended for Datacamp-ETL project):


http://groups.google.com/group/datacamp-dev

Offer info@knowerce.sk 5
knowerce|consulting

Overview
System Context
Datacamp ETL framework has plug-in based architecture and runs on top of a database server.

ETL job module bundle

ETL modules
directory (directories)
job module bundle

DB Server

job module bundle

ETL Staging
Database

directory for extracted and


temporary files

Objects and classes


Core of the ETL framework are Job Manager and Job objects. There are two categories of classes: job
management and utility classes that are not necessary for data processing.

Job Management Utilities

Job Manager

ETL Defaults

Batch

Job Status Job Job Info


Download Manager

Download Batch

Extraction Transformation Loading

Class Description and provided functionality

Batch Information about data processed by ETL

List of files and additional information for automated parallel downloading and
Download Batch
processing

Offer info@knowerce.sk 6
knowerce|consulting

Class Description and provided functionality

Download Manager Performs parallel download of huge amount of URLs

ETL Defaults Stores configuration variables in key-value dictionary

Job Abstract class for ETL jobs, provides utilities for running, logging and error handling

Job Info Information about job: name, type, scheduling, …

Job Manager Configures and launches jobs, handles errors.

Job Status Information about job run: when was run, what was the result and reason for failure.

Offer info@knowerce.sk 7
knowerce|consulting

Installation
Software Requirements
■ database server
1

■ ruby
■ rails
■ gems: sequel

Preparation

I. create a directory where working files, such as dumps and ETL files, will be stored, for example:
/var/lib/datacamp
II. create a database. For use with Datacamp web application create two schemas:
■ data schema, example: datacamp_data
■ staging schema (for ETL), example: datacamp_staging
III. create a database user that has full access (SELECT, INSERT, UPDATE, CREATE TABLE, …) to
the datacamp ETL schemas

Check: at this point you should have:

■ sources
■ working directory
■ one or two database schemas
■ database user with appropriate permissions

Database initialisation
To initialize ETL database schema run appropriate SQL script from install directory, for example:
mysql -u root -p datacamp_staging < install/etl_tables.mysql.sql

1 currently works only with MySQL server as there are couple of MySQL specific code residues. This will change
in the future.

Offer info@knowerce.sk 8
knowerce|consulting

Configuration
Create config.yml in the ETL directory. You can use config.yml.example as a template.
Configuration variables are:

Variable Description

etl_files_path Path for working files – downloaded, extracted and temporary files

dataset_dump_path Datacamp application specific. Where Datacamp datasets are being dumped (dumps
are shared by ETL and the application)

log_file File where logs are being written. If not set, standard error output (stderr) is used

job_search_path Path where ETL job bundles are stored

connection database connection

# ETL Configuration
#

###########################################################
# Paths

# Where temporary ETL files are stored (such as files downloaded from web)
etl_files_path: /var/lib/datacamp-etl

# Path to dump datasets. This path should be accesible by application to provide


# dump API
dataset_dump_path: /var/lib/datacamp-etl/dumps

# Path for log file


log_file: /var/lib/datacamp-etl/etl.log

# Path to ETL jobs


# All jobs are searched here. The direcotry should contain subdirectories
# similar to OS X bundles in form: job_name.job_type. Example: foo.loading

job_search_path: /usr/lib/datacamp-etl/jobs

###########################################################
# Database Connection

connection:
host: localhost
username: root
password:
charset: utf8

staging_schema: datacamp_staging
dataset_schema: datacamp_data
app_schema: datacamp_app

Offer info@knowerce.sk 9
knowerce|consulting

Running ETL Jobs


Launching

Manual Launching
Jobs are being run with simply launching the etl.rb script:
ruby etl.rb
The script looks for config.yml in current directory. You can pass another configuration file:
ruby etl.rb --config another_config.yml

Scheduled using cron


You would mostly like to run ETL automatically and periodically. To do so, configure a cron job for the
Datacamp ETL by creating a cron script. There is an example in install/etl_cron_job, where you
have to change ETL_PATH, CONFIG and probabbly RUBY variables. See appendix where example file is
listed.

Running Programatically
Or configure JobManager manually and run all jobs by:
job_manager = JobManager.new
… # configure job_manager here
job_manager.run_scheduled_jobs

Log is being written to preconfigured file or to standard error output. See Installation instructions how
to configure the log file.

What jobs will be run


By default only jobs that are enabled and scheduled for this day and were not run successfully already.
If all jobs succeed, then any subsequent launch of ETL should not run any jobs. All unsuccessful are
being re-tried. Not enabled jobs are not run on any occasion. For more information see Job
Management.

Offer info@knowerce.sk 10
knowerce|consulting

Job Status
Each job leaves a footprint of its run in etl_job_status table. The table contains information:

Column Description

job_name task which was run

job_id identifier of the job

status current status of the job: ok, running, failed

phase if job has more phases, this column identifies which phase the job is in

message error message on job fail

start_date when the job started

end_date when the job finished, or NULL if job is still running

Possible job statuses are:

■ running – job is still running (or ETL crashed and did not reset the job status)
■ ok – job finished correctly
■ failed – job dod not finished correctly, see phase and message for more information

Example of successful runs – you want to achieve this:

Example of mixed statuses, including failed ones:

Offer info@knowerce.sk 11
knowerce|consulting

Job Management
Jobs are managet through etl_jobs table where you specify:

Column Description

job_name name of a job (see below)

job_type type of a job: extraction, transformation, loading, …

is_enabled set to 1 when the task is enabled

number which specifies order in which jobs are being run. Jobs are run from lowest
run_order
number to highest. If number is the same for more jobs, behaviour is undefined

schedule when the job is being run

force_run run despite scheduling rule

Example:

To add a new job, insert a line into the table and set job information. To remove a job just delete a line.

Scheduling
Jobs can be currently scheduled on daily basis:
■ daily:
run each day
■ monday, tuesday, wednesday, thursday, friday, saturday, sunday – run on particular week
day
Once the job was successfully run by scheduler, the job manager does not run it again unless explicitly
specified by force_run flag.

Forced run
There is a way how to run jobs out-of-schedule by setting the force_run flag. This allows data
managers to re-run an ETL job remotely without requiring access to the system where ETL processes
are being hosted. The job will be run next time scheduler is run. For example: if ETL is scheduled in
cron for hourly run, then the job is re-run within next hour, if it is scheduled for daily runs it will be run
next day.
The flag is reset to 0 after each run to prevent running again. Reason for this behaviour is to prevent
running lengthy, time and CPU consuming jobs unintentionally and to protect already processed data
from possible inconsistencies introduced by running jobs at unexpected times.

Offer info@knowerce.sk 12
knowerce|consulting

This behaviour can be modified using ETL system defaults:

■ force_run_all – run all enabled jobs, regardless of their scheduling time


■ reset_force_run_flag – allow jobs to be re-run each time ETL script is launched. Set this to 0 for
development and testing.

Offer info@knowerce.sk 13
knowerce|consulting

Creating a Job Bundle


Jobs are implemented by “bundles” or in other words directories containing all necessary code and
information for the job. Only requirement for the bundle is that it follows certain naming convention
and contains ruby script with the job class.

■ bundle directory should be named: job_name.job_type


■ bundle should contain ruby file: job_name_job_type.rb
■ the ruby file should contain camelized job name and job type class: JobNameJobType which should
be a subclass of appropriate job subclass (Extraction, Transformation, Loading)

The class should implement run method with the main job code.

Example: Public Procurement Extraction ETL job


I. create a job bundle directory: mkdir public_procurement.extraction
II. create a Ruby file: public_procurement.extraction/public_procurement_extraction.rb
III. implement a class named: PublicProcurementExtraction:

class PublicProcurementExtraction < Extraction

def run
… job code goes here …
end

Job Utility Methods


There are several utility methods for job writers:
■ files_directory – directory where working, extracted, downloaded and temporary files are stored.
This directory is job specific – each job has its own directory by default
■ logger – object for writing into ETL manager log
■ message, phase – set job status information

Also each job has access to defaults dictionary. See chapter about Defaults for more information.

Errors and Failing a Job


It is recommended to raise exception on error. The exception will be handled by job manager and the
job will be closed properly with appropriate status and message set.

raise “unable to connect to data source”

will result in failed job with same message as the exception.

Offer info@knowerce.sk 14
knowerce|consulting

Defaults
Defaults is configurable key-value dictionary used by ETL jobs and the ETL system as well. The key-
value pairs are stored by domains. Domain usually corresponds to job name, for example: invoices
loading job and invoices transformation job share common domain invoices. The domain etl is reserved
for ETL system configuration. Purpose of defaults is to be able to configure ETL jobs remotely and in
more convenient way.
Defaults are stored in etl_defaults table which contains: domain, default_key and value:

ETL System Defaults


Default Value
Key Description (if key-value does not exist)

force_run_all On next ETL run all enabled jobs are launched, regardless of FALSE
their scheduling. See Running ETL?

reset_force_run_flag After running forced job (see Running ETL?) clear it’s flag so it TRUE
will be not run again.

Using defaults in jobs


Job has access to defaults domain based on the job name. To retrieve a value from defaults:
url = defaults[:download_url]
count = defaults[:count].to_i

Retrieve value or set to default value if not found:


batch_size = defaults.value(:batch_size, 200).to_i
This will look for batch_size key, if it does not exist, then the key will be created and assigned value
200.
To store default value:
defaults[:count] = count
Values are committed when job finishes.
Example:

Offer info@knowerce.sk 15
knowerce|consulting

@batch_size = defaults.value(:batch_size, 200).to_i


@download_threads = defaults.value(:download_threads, 10).to_i
@download_fail_threshold = defaults.value(:download_fail_threshold, 10).to_i

Offer info@knowerce.sk 16
knowerce|consulting

Appendix: ETL Tables


etl_jobs

Column Type Description

id int object identifier

name varchar job name

job_type varchar job type

is_enabled int flag whether the job is run or not

run_order int order in which the jobs are being run. If more jobs have same order numer, the
behaviour is undefined.

last_run_date datetime date and time when job was alst run

last_run_status varchar status of last run

schedule varchar how the job is scheduled

force_run int force job to be run next time ETL runs

etl_job_status

Column Type Description

id int object identifier

job_name varchar job name

job_id int job identifier

status varchar current or last run status

phase varchar phase in which the job currently is wile running or was when finished

message varchar status message provided by job object or exception message

start_date datetime when the job was run

end_date datetime when the job finished

Offer info@knowerce.sk 17
knowerce|consulting

etl_defaults

Column Type Description

id int association id

domain varchar domain name (usually corresponds to job name)

default_key varchar key

value varchar value for key

etl_batch

Column Type Description

id int

batch_type varchar

batch_source varchar

data_source_name varchar

data_source_url varchar

valid_due_date date

batch_date date

username varchar

created_at datetime

updated_at datetime

Offer info@knowerce.sk 18
knowerce|consulting

Cron Example
#!/bin/bash
#
# ETL cron job script
#
# Ubuntu/Debian: Put this script in /etc/cron.daily
# Other unces: schedule appropriately in /etc/crontab

#####################################################################
# ETL Configuration

# Path to your ETL installation


ETL_PATH=/usr/lib/datacamp-etl

# Configuration file (database connection and other paths)


CONFIG=$ETL_PATH/config.yml

# Ruby interpreter path


RUBY=/usr/bin/ruby

#####################################################################

ETL_TOOL=etl.rb
$RUBY -I $ETL_PATH $ETL_PATH/$ETL_TOOL --config $CONFIG

Offer info@knowerce.sk 19

Você também pode gostar