Escolar Documentos
Profissional Documentos
Cultura Documentos
Table of Content
1
2
3
4
Objective:...............................................................................................................................................3
Problem Definition:..............................................................................................................................3
Solution:.................................................................................................................................................3
What is ETL, Extract Transform and Load?.....................................................................................3
5
Conclusion:..........................................................................................................................................10
Reference:......................................................................................................................................................10
Objective:
To study ETL (Extract, transform, load) tools specially SQL Server Integration Services.
Problem Definition:
3
4
Solution:
What is ETL, Extract Transform and Load?
ETL is an abbreviation of the three words Extract, Transform and Load. It is an ETL process to extract
data, mostly from different types of system, transform it into a structure that's more appropriate for
reporting and analysis and finally load it into the database. The figure below displays these ETL steps.
ETL architecture and steps
But, today, ETL is much more than that. It also covers data profiling, data quality control, monitoring and
cleansing, real-time and on-demand data integration in a service oriented architecture (SOA), and metadata
management.
ETL - Extract from source
In this step we extract data from different internal and external sources, structured and/or unstructured.
Plain queries are sent to the source systems, using native connections, message queuing, ODBC or OLEDB middleware. The data will be put in a so-called Staging Area (SA), usually with the same structure as
the source. In some cases we want only the data that is new or has been changed, the queries will only
return the changes. Some ETL tools can do this automatically, providing a changed data capture (CDC)
mechanism.
ETL - Transform the data
Once the data is available in the Staging Area, it is all on one platform and one database. So we can easily
join and union tables, filter and sort the data using specific attributes, pivot to another structure and make
business calculations. In this step of the ETL process, we can check on data quality and cleans the data if
necessary. After having all the data prepared, we can choose to implement slowly changing dimensions. In
that case we want to keep track in our analysis and reports when attributes changes over time, for example
a customer moves from one region to another.
ETL - Load into the data warehouse
Finally, data is loaded into a data warehouse, usually into fact and dimension tables. From there the data
can be combined, aggregated and loaded into datamarts or cubes as is deemed necessary.
ETL Tools:
ETL tools are widely used for extracting, cleaning, transforming and loading data from different systems,
often into a data warehouse. Following is list of tools available for ETL activities.
No. List of ETL Tools
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19
11gR1
XI 3.0
8.1
4.2
8.5.1
7.2.2
7.6
10
3.1
6.5
8.12
7.1
5.2.2
8.2
2.5.2
4.2
9.1
3.0
4.9
Oracle
SAP Business Objects
IBM
SAS Institute
Informatica
Elixir
Information Builders
Microsoft
Talend
Pitney Bowes Business Insight
Pervasive
Open Text
ETL Solutions Ltd.
IBM (Cognos)
Javlin
IKAN
IBM
Pentaho
Adeptia
http://www.etltool.com/etltoolsranking.htm
The SQL Server Import and Export Wizard offers the simplest method to create a Integration Services
package that copies data from a source to a destination.
Integration Services Architecture:
Of the components shown in the previous diagram, here are some important components to using
Integration Services succesfully:
4.1.10
Integration Services includes command prompt utilities for running and managing Integration
Services packages.
a. dtexec is used to run an existing package at the command prompt.
b. dtutil is used to manage existing packages at the command prompt.
Many organizations archive information that is stored in legacy data storage systems. This data
may not be important to daily operations, but it may be valuable for trend analysis that requires data
collected over a long period of time.
Branches of an organization may use different data storage technologies to store the operational
data. The package may need to extract data from spreadsheets as well as relational databases before it
can merge the data.
Data may be stored in databases that use different schemas for the same data. The package may
need to change the data type of a column or combine data from multiple columns into one column
before it can merge the data.
Integration Services can connect to a wide variety of data sources, including multiple sources in a single
package. A package can connect to relational databases by using .NET and OLE DB providers, and to many
legacy databases by using ODBC drivers. It can also connect to flat files, Excel files, and Analysis Services
projects.
Integration Services includes source components that perform the work of extracting data from flat files,
Excel spreadsheets, XML documents, and tables and views in relational databases from the data source to
which the package connects.
Next, the data is typically transformed by using the transformations that Integration Services includes. After
the data is transformed to compatible formats, it can be merged physically into one dataset.
After the data is merged successfully and transformations are applied to data, the data is usually loaded into
one or more destinations. Integration Services includes destination for loading data into flat files, raw files,
and relational databases. The data can also be loaded into an in-memory recordset and accessed by other
package elements.
Data is contributed from multiple branches of an organization, each using different conventions
and standards. Before the data can be used, it may need to be formatted differently. For example, you
may need to combine the first name and the last name into one column.
Data is rented or purchased. Before it can be used, the data may need to be standardized and
cleaned to meet business standards. For example, an organization wants to verify that all the records
use the same set of state abbreviations or the same set of product names.
Data is locale-specific. For example, the data may use varied date/time and numeric formats. If
data from different locales is merged, it must be converted to one locale before it is loaded to avoid
corruption of data.
Integration Services includes built-in transformations that you can add to packages to clean and standardize
data, change the case of data, convert data to a different type or format, or create new column values based
on expressions. For example, the package could concatenate first and last name columns into a single full
name column, and then change the characters to uppercase.
An Integration Services package can also clean data by replacing the values in columns with values from a
reference table, using either an exact lookup or fuzzy lookup to locate values in a reference table.
Frequently, a package applies the exact lookup first, and if the lookup fails, it applies the fuzzy lookup. For
example, the package first attempts to look up a product name in the reference table by using the primary
key value of the product. When this search fails to return the product name, the package attempts the search
again, this time using fuzzy matching on the product name.
Another transformation cleans data by grouping values in a dataset that are similar. This is useful for
identifying records that may be duplicates and therefore should not be inserted into your database without
further evaluation. For example, by comparing addresses in customer records you may identify a number of
duplicate customers.
It is also possible to send a data set to multiple destinations, and then apply different sets of transformation
to the same data. For example, one set of transformations can summarize the data, while another set of
transformations expands the data by looking up values in reference tables and adding data from other
sources.
Conclusion:
Software Industry has may of ETL tools, but Comsoft being traditional Microsoft shop, should
prefer SSIS, SQL Server Integration services as ETL tool for general operations.
As we are using SQL server as backend database, so SSIS is already available in our development
environment. No Need to buy any extra tool.
This Document will provide a technology direction statement and introduction to ETL
implementation in Comsoft.
This document is just for knowledge sharing, no need to change our implementation for pushing
and pulling data via .Net as discussed with Sujjet in last voice call, as our problem domain is
very limited. But for future direction and bigger problems we might consider SQL Server
Integration services as ETL tools .
Reference:
ETL:
http://www.etltools.net
http://www.etltool.com
http://etl-tools.info/
SSIS:
http://www.microsoft.com/sqlserver/2008/en/us/integration.aspx