Escolar Documentos
Profissional Documentos
Cultura Documentos
DQ Solution Checklist
August 2011
Page 1 of 5
DQ Solution Checklist
Every organization has issues with the quality of their data. Whether it is duplicate records, missing values or improper formats, no system is impervious. In order to address these issues, organizations turn to data quality tools, such as Talend Data Quality to profile, standardize, enrich, match and measure or assess the ongoing quality of corporate data. In order to be successful with data quality, the solution should provide this complete set of functions.
The following checklist provides key functional requirements for implementing and deploying data quality in an enterprise environment.
Included
Description
Connect
Connect to data stored in relational databases, nonrelational structures like flat files and XML, common packaged applications like SAP and cloud-based applications such as salesforce.com. The solution should allow for cleansing and matching of one record or multiple records in real-time. It should return results of a single record cleanse and match in subsecond speed. Profile and cleanse data in existing databases without the need to extract, move or create a repository of results. Eliminate dependence on any proprietary format in order to cleanse.
Profile
Provide pre-built analysis patterns for individual attributes/columns/fields. Analysis should include 1) general functions such as min, max, frequency distributions of value and patterns 2) specific functions for common attributes like e-mail address, phone number, dates, city, postal code and more. Provide pre-built analyses to identify relationships, patterns, integrity gaps, and duplication between and across multiple attributes/columns/fields. Validate company-specific and domain-specific data with the ability to configure and execute user-defined profiling analyses. For example, SKU or part numbers most often require their own definition. Provide the ability to analyze trends in data quality over time. In practical applications, users should be able to assess the impact of new data management processes on data quality.
Column-based Analyses
Dependency Analyses
Custom Analyses
Trend Analyses
Page 2 of 5
DQ Solution Checklist
Provide the ability to present profiling results in charts and graphs as well as a textual report. Provide the ability to export profiling results in a variety of formats including XML, PDF, HTML, etc. Provide a dashboard (web-based) reporting system of data quality metrics and provide metadata to business intelligence (BI) systems. Schedule profiling or cleansing processes to occur when a file hits a directory and/or at any given interval.
Standardize
Parse Ability for the solution to interpret the meaning of text fields based upon matching of characters strings against a knowledge base. In addition, the ability to customize that knowledge is a key feature. Transformations such as data-type conversions, string splitting and concatenation operations, etc. Pre-built rules for common standardization and cleansing operations, such as formatting of addresses, telephone numbers, social security / Tax ID numbers, etc Incorporate freely available and open reference data for standardization. Examples include geonames.org for address standardization, data.gov sources, census data, et al. Make use of third-party postal validation software to validate street addresses according to the local postal authority. The software can append latitude and longitude based on address.
Packaged Rules
Match
Matching Types Use probabilistic, deterministic and custom algorithms to find relationships and duplicates within the data Weight, prioritize, and tune matching rules so that one attribute may have more priority over another in any given record. Customize rules so that duplicate or related records can merge into a single "survivor". The final record can take the best parts of all related records to form an optimal best-of-breed record. Remove duplicate records based on rules for determining survivorship Create logical groups of records by relating those with user-determined properties
Merge Records
Deduplication
Link
Page 3 of 5
DQ Solution Checklist
Performance Matching
The solution should provide strategies for matching very large data sets. Most commonly, this is accomplish with blocking keys (AKA bucketing or pre-matching) which can limit the number of comparisons performed by the advanced matching algorithms. The product should provide tools for setting up and tuning blocking keys.
Monitor
Business User Interaction Solution should provide an easy-to-use environment for business users to follow the key performance indicators for data quality. Adobe Acrobat files reports and webbased portals are often a requirement since little training is necessary for productivity. The solution should provide an easy way to set up profiling rules and deploy them as monitoring rules. The rules should include the ability to set thresholds for acceptable levels of violations and highlight when rules become unacceptable. The solution should have the capability to invoke special processes when there are violations to data quality rules. Examples include an e-mail alert, text message, special report, or halting of a data integration process. Pre-built and customizable reports that show key data quality metrics over time
Violations Alerts
Reports
Enrich
Range of Enrichment Service Types Solution should have the capability to use enrichment data from a wide variety of sources. Enrichment data might come in various file formats and schemas. It may come from online sources, commercial partners or data providers. Solution should be tested to support a variety of enrichment partners, including postal validation (QAS, Melissa Data, AddressAnywhere, etc.) or B-toB partners such as D&B. Solution should be able to use data sources or Web Services to enrich data.
Enrichment Partners
Deploy
Single Product Support for all data quality operations, from profiling through enrichment, via a single product Common metadata across all products and functionality, via a single repository or the ability to seamlessly share and synchronize metadata between data quality, data profiling, data integration and master data management.
Common Metadata
Page 4 of 5
DQ Solution Checklist
Support for all data delivery modes, regardless of run-time architecture type (centralized server engine, cloud, hadoop, etc.) Ability to deploy all aspects of run-time functionality as services within a service-oriented architecture or external tools and applications can dynamically modify and control run-time behavior
2011 Talend Inc. All rights reserved. All brand names or products referenced herein are acknowledged to be trademarks or registered trademarks of their respective owners.
Page 5 of 5