Você está na página 1de 5

Talend Technical Note

DQ Solution Checklist
August 2011

Page 1 of 5

Talend Technical Note

DQ Solution Checklist

Every organization has issues with the quality of their data. Whether it is duplicate records, missing values or improper formats, no system is impervious. In order to address these issues, organizations turn to data quality tools, such as Talend Data Quality to profile, standardize, enrich, match and measure or assess the ongoing quality of corporate data. In order to be successful with data quality, the solution should provide this complete set of functions.

The following checklist provides key functional requirements for implementing and deploying data quality in an enterprise environment.

Included

Description

Connect

Connect to data stored in relational databases, nonrelational structures like flat files and XML, common packaged applications like SAP and cloud-based applications such as salesforce.com. The solution should allow for cleansing and matching of one record or multiple records in real-time. It should return results of a single record cleanse and match in subsecond speed. Profile and cleanse data in existing databases without the need to extract, move or create a repository of results. Eliminate dependence on any proprietary format in order to cleanse.

Connect to Any Enterprise Data Source

Real-time and Batch

In-place Profiling and Cleansing

Profile
Provide pre-built analysis patterns for individual attributes/columns/fields. Analysis should include 1) general functions such as min, max, frequency distributions of value and patterns 2) specific functions for common attributes like e-mail address, phone number, dates, city, postal code and more. Provide pre-built analyses to identify relationships, patterns, integrity gaps, and duplication between and across multiple attributes/columns/fields. Validate company-specific and domain-specific data with the ability to configure and execute user-defined profiling analyses. For example, SKU or part numbers most often require their own definition. Provide the ability to analyze trends in data quality over time. In practical applications, users should be able to assess the impact of new data management processes on data quality.

Column-based Analyses

Dependency Analyses

Custom Analyses

Trend Analyses

Page 2 of 5

Talend Technical Note

DQ Solution Checklist

Graphical and Tabular Presentation

Provide the ability to present profiling results in charts and graphs as well as a textual report. Provide the ability to export profiling results in a variety of formats including XML, PDF, HTML, etc. Provide a dashboard (web-based) reporting system of data quality metrics and provide metadata to business intelligence (BI) systems. Schedule profiling or cleansing processes to occur when a file hits a directory and/or at any given interval.

Reports and Dashboards

Scheduled / Batch Execution

Standardize
Parse Ability for the solution to interpret the meaning of text fields based upon matching of characters strings against a knowledge base. In addition, the ability to customize that knowledge is a key feature. Transformations such as data-type conversions, string splitting and concatenation operations, etc. Pre-built rules for common standardization and cleansing operations, such as formatting of addresses, telephone numbers, social security / Tax ID numbers, etc Incorporate freely available and open reference data for standardization. Examples include geonames.org for address standardization, data.gov sources, census data, et al. Make use of third-party postal validation software to validate street addresses according to the local postal authority. The software can append latitude and longitude based on address.

Standardize and Cleanse

Packaged Rules

Open Knowledgebase Support

Postal Validation and Geocoding

Match
Matching Types Use probabilistic, deterministic and custom algorithms to find relationships and duplicates within the data Weight, prioritize, and tune matching rules so that one attribute may have more priority over another in any given record. Customize rules so that duplicate or related records can merge into a single "survivor". The final record can take the best parts of all related records to form an optimal best-of-breed record. Remove duplicate records based on rules for determining survivorship Create logical groups of records by relating those with user-determined properties

Tune and Weight Matching

Merge Records

Deduplication

Link

Page 3 of 5

Talend Technical Note

DQ Solution Checklist

Performance Matching

The solution should provide strategies for matching very large data sets. Most commonly, this is accomplish with blocking keys (AKA bucketing or pre-matching) which can limit the number of comparisons performed by the advanced matching algorithms. The product should provide tools for setting up and tuning blocking keys.

Monitor
Business User Interaction Solution should provide an easy-to-use environment for business users to follow the key performance indicators for data quality. Adobe Acrobat files reports and webbased portals are often a requirement since little training is necessary for productivity. The solution should provide an easy way to set up profiling rules and deploy them as monitoring rules. The rules should include the ability to set thresholds for acceptable levels of violations and highlight when rules become unacceptable. The solution should have the capability to invoke special processes when there are violations to data quality rules. Examples include an e-mail alert, text message, special report, or halting of a data integration process. Pre-built and customizable reports that show key data quality metrics over time

Develop Monitoring Rules

Violations Alerts

Reports

Enrich
Range of Enrichment Service Types Solution should have the capability to use enrichment data from a wide variety of sources. Enrichment data might come in various file formats and schemas. It may come from online sources, commercial partners or data providers. Solution should be tested to support a variety of enrichment partners, including postal validation (QAS, Melissa Data, AddressAnywhere, etc.) or B-toB partners such as D&B. Solution should be able to use data sources or Web Services to enrich data.

Enrichment Partners

Enrichment Delivery Mechanisms

Deploy
Single Product Support for all data quality operations, from profiling through enrichment, via a single product Common metadata across all products and functionality, via a single repository or the ability to seamlessly share and synchronize metadata between data quality, data profiling, data integration and master data management.

Common Metadata

Page 4 of 5

Talend Technical Note

DQ Solution Checklist

Performance and Scalability

Support for all data delivery modes, regardless of run-time architecture type (centralized server engine, cloud, hadoop, etc.) Ability to deploy all aspects of run-time functionality as services within a service-oriented architecture or external tools and applications can dynamically modify and control run-time behavior

Data Quality Services

Talend Data Quality


Talend Data Quality suite is designed for the improvement and corporate management of Data Quality. The suite includes the foundation tools for data quality, including data profiling, correction, issue mitigation, advanced reporting and an integrated Data Integration tool for quick and easy data transformations. All functionality is completely integrated with Talend Integration Suite, Talend's leading open source enterprise Data Integration solution. Take what you've learned from profiling and use the analysis in your Data Integration or MDM workflow.

2011 Talend Inc. All rights reserved. All brand names or products referenced herein are acknowledged to be trademarks or registered trademarks of their respective owners.

Page 5 of 5

Você também pode gostar