Escolar Documentos
Profissional Documentos
Cultura Documentos
Buried information
Data myopia
Data anomalies
Buried Information
Quality stage Notes: Bhaskar Reddy
Quality Stage
Quality Stage is a tool intended to deliver high quality data required for success in a
range of enterprise initiatives including business intelligence, legacy consolidation and
master data management. It does this primarily by identifying components of data that
may be in columns or free format, standardizing the values and formats of those data,
using the standardized results and other generated values to determine likely duplicate
records, and building a “best of breed” record out of these sets of potential duplicates.
Through its intuitive user interface Quality Stage substantially reduces time and cost to
implement Customer Relationship Management (CRM), data warehouse/business
intelligence (BI), data governance, and other strategic IT initiatives and maximizes their
return on investment by ensuring their data quality.
With Quality Stage it is possible, for example, to construct consolidated customer and
household views, enabling more effective cross-selling, up-selling, and customer
retention, and to help to improve customer support and service, for example by
Quality stage Notes: Bhaskar Reddy
identifying a company's most profitable customers. The cleansed data provided by Quality
Stage allows creation of business intelligence on individuals and organizations for research,
fraud detection, and planning.
Out of the box Quality Stage provides for cleansing of name and address data and some
related types of data such as email addresses, tax IDs and so on. However, Quality Stage
is fully customizable to be able to cleanse any kind of classifiable data, such as
infrastructure, inventory, health data, and so on.
The product now called Quality Stage has its origins in a product called INTEGRITY from a
company called Vality. Vality was acquired by Ascential Software in 2003 and the
product renamed to Quality Stage. This first version of Quality Stage reflected its
heritage (for example it only had batch mode operation) and, indeed, its mainframe
antecedents (for example file name components limited to eight characters).
Ascential did not do much with the inner workings of Quality Stage which was, after all,
already a mature product. Ascential’s emphasis was to provide two new modes of
operation for Quality Stage. One was a “plug-in” for Data Stage that allowed data
cleansing/standardization to be performed (by Quality Stage jobs) as part of an ETL data
flow. The other was to provide for Quality Stage to use the parallel execution technology
(Orchestrate) that Ascential had as a result of its acquisition of Torrent Systems in 2001.
IBM acquired Accentual Software at the end of 2005. Since then the main direction has
been to put together a suite of products that share metadata transparently and share a
common set of services for such things as security, metadata delivery, reporting, and so
on. In the particular case of Quality Stage, it now shares a common Designer client with
Data Stage: from version 8.0 onwards Quality Stage jobs run as, or as part of, Data Stage
jobs, at least in the parallel execution environment.
QualityStage Functionality
Investigation
Features
Investigate methods
Quality stage Notes: Bhaskar Reddy
Quality stage Notes: Bhaskar Reddy
Character Investigation
Single-domain fields
Entity Identifiers:
Eg: ZIP codes, SSN, Canadian postal codes
Entity Clarifiers:
Eg: name prefix, gender, and marital status.
Multiple-domain fields
the one field. To come up with a standard format, you need to be aware of what
formats actually exist in the data. The result of a character discrete investigation (which
can also examine just part of a field, for example the first three characters) is a
frequency distribution of values or patterns – the developer determines which.
Word investigation :is probably the most important of the three for the entire
QuialityStage suite, performing a free-format analysis of the data records. It performs
two different kinds of task; one is to report which words/tokens are already known, in
terms of the currently selected “rule set”, the other is to report how those words are to
be classified, again in terms of the currently selected “rule set”. There is no overlap to
Information Analyzer (data profiling tool) from word investigation.
Rule Set :
A rule set includes a set of tables that list the “known” words or tokens. For example,
the GBNAME rule set contains a list of names that are known to be first names in Great
Britain, such as Margaret, Charles, John, Elizabeth, and so on. Another table in the
GBNAME rule set contains a list of name prefixes, such as Mr, Ms, Mrs and so on, that
can not only be recognized as name prefixes (titles, if you prefer) but can in some cases
reveal additional information, such as gender.
When a word investigation reports about classification, it does so by producing a
pattern. This shows how each known word in the data record is classified, and the order
in which each occurs. For example, under the USNAME rule set the name WILLIAM F.
GAINES III would report the pattern FI?G – the F indicates that “William” is a known first
name, the I indicates the “F” is an initial, the ? indicates that “Gaines” is not a known
word in context, and the G indicates that “III” is a “generation” – as would be “Senior”,
“IV” and “fils”. Punctuation may be included or ignored.
Rule sets also come into play when performing standardization (discussed below).
Classification tables contain not only the words/tokens that are known and classified,
but also contain the standard form of each (for example “William” might be recorded as
the standard form for “Bill”) and may contain an uncertainty threshold (for example
“Felliciity” might still be recognizable as “Felicity” even though it is misspelled in the
original data record). Probabilistic matching is one of the significant strengths of
QualityStage.
Investigation might also be performed to review the results of standardization,
particularly to see whether there are any unhandled patterns or text that could be
better handled if the rule set itself were tweaked, either with improved classification
tables or through a mechanism called rule set overrides.
Quality stage Notes: Bhaskar Reddy
Standardization
It may be that standardization is the desired end result of using Quality Stage. For
example street address components such as “Street” or “Avenue” or “Road” are often
represented differently in data, perhaps differently abbreviated in different records.
Standardization can convert all the non-standard forms into whatever standard format
the organization has decided that it will use.
This kind of Quality Stage job can be set up as a web service. For example, a data entry
application might send in an address to be standardized. The web service would return
the standardized address to the caller.
More commonly standardization is a preliminary step towards performing matching.
More accurate matching can be performed if standard forms of words/tokens are
compared than if the original forms of these data are compared.
Country
CountryIdentifier
Identifier
COUNTRY
COUNTRY
Domain
DomainPre-
Pre-
processor
processor
USPREP
USPREP
Domain
Domain Domain
Domain Domain
Domain
Specific:
Specific: Specific:
Specific: Specific:
Specific:
USNAME
USNAME USADDR
USADDR USAREA
USAREA
Input
InputRecords
Records
100
100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOORBOSTON,
BOSTON,MAMA
02111
02111
SITE
SITE66COMP
COMP1010RR
RR88STN
STNMAIN
MAINMILLARVILLE
MILLARVILLEAB
AB
T0L
T0L1K0
1K0
28
28GROSVENOR
GROSVENORSTREET
STREETLONDON
LONDONW1X
W1X9FE
9FE
123
123MAIN
MAINSTREET
STREET
Output
OutputRecords
Records
USY100
USY100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOORBOSTON,
BOSTON,
MA
MA02111
02111
CAYSITE
CAYSITE66COMP
COMP1010RR
RR88STN
STNMAIN
MAINMILLARVILLE
MILLARVILLE
AB
ABT0L
T0L1K0
1K0
GBY28
GBY28GROSVENOR
GROSVENORSTREET
STREETLONDON
LONDONW1X
W1X9FE
9FE
USN123
USN123MAIN
MAINSTREET
STREET
Input
InputRecord
Record
Address
AddressLine
Line1TINA
1TINAFISHER
FISHER
Address
AddressLine
Line2ATTN
2ATTNIBM
IBM
Address
AddressLine
Line3211
3211WASHINGTON
WASHINGTONDR
DR
Address
AddressLine
Line4PO
4POBOX
BOX5252
Address
AddressLine
Line5WESTBORO,
5WESTBORO,MAMA
Address
AddressLine
Line602140
602140
Output
OutputRecord
Record
Name
Name DomainTINA
DomainTINA FISHER
FISHER ATTN
ATTN IBM
IBM
Address
Address Domain211
Domain211 WASHINGTON
WASHINGTON DR DR PO
PO
BOX
BOX 52
52
Area
Area DomainWESTBORO
DomainWESTBORO ,, MA
MA 02140
02140
Input
InputRecord
Record
211
211 WASHINGTON
WASHINGTON DR
DR PO
PO BOX
BOX 52
52
Output
OutputRecord
Record
House
House Number211
Number211
Street
Street NameWASHINGTON
NameWASHINGTON
Street
Street Suffix
Suffix TypeDR
TypeDR
Box
Box TypePO
TypePO BOX
BOX
Box
Box Value52
Value52
Call subroutines for each sub-domain (i.e. country name, post code, province, city)
Rule Sets
Rule Sets are standardization processes used by the Standardize Stage and have
three required components:
User Overrides
Reference Tables
Quality stage Notes: Bhaskar Reddy
Standardization Example
Standardize Stage
The standardization process begins by parsing the input data into individual data
elements called tokens
Any character that is in the SEPLIST and not in the STRIPLIST, will be used to
separate tokens and will also become a token itself
Any character that is in both lists will be used to separate tokens but will not
become a token itself
The best example of this is the space character - one or more spaces are
stripped but the space indicates where one token ends and another begins
The parser behaves differently if the locale setting is Chinese, Japanese, or Korean
Spaces are not used to divide tokens so each character, including a space, is
considered a token
Classification
Parsing separated the input data into individual tokens
Classification assigns a one character tag (called a class) to each and every individual
parsed token to provide context
First, key words that can provide special context are classified
Since these classes are context specific, they vary across rule sets
These default classes are always the same regardless of the rule set used
Classification – order
First, key words that can provide special context are classified
Provided by the standardization rule set classification table
Since these classes are context specific, they vary across rule sets
Next, default classes are assigned to the remaining tokens
These default classes are always the same regardless of the rule
set used
Lexical patterns are assembled from the classification results
Concatenated string of the classes assigned to the parsed tokens
Quality stage Notes: Bhaskar Reddy
Classification Example
Default Classes
Clas Description
s
However, if a special character is included in the SEPLIST but omitted from the
STRIPLIST, then the default class for that special character becomes the special
character itself and in this case, the default class does describe an actual special
character value
It is important to note this can also happen to the “reserved” default classes
(for example: ^ = ^ if ^ is in the SEPLIST but omitted from the STRIPLIST)
Also, if a special character is omitted from both the SEPLIST and STRIPLIST (and it is
surrounded by spaces in the input data), then the “special” default class of ~ (tilde)
is assigned
If not surrounded by spaces, then the appropriate mixed token default class
would be assigned (for example: P.O. = @ if . is omitted from both lists)
Essentially, the NULL class does to complete tokens what the STRIPLIST does to
individual characters
Therefore, you will never see the NULL class represented in the assembled lexical
patterns
Classification Table
Classification Tables contain three required space delimited columns:
;---------------------------------------------------------------------
--------
; USADDR Classification Table
;---------------------------------------------------------------------
--------
; Classification Legend
;---------------------------------------------------------------------
--------
; B - Box Types
; D - Directionals
; F - Floor Types
; T - Street Types
; U - Unit Types
;---------------------------------------------------------------------
--------
PO "PO BOX" B
BOX "PO BOX" B
POBOX "PO BOX" B
Quality stage Notes: Bhaskar Reddy
Classification table is intended for key words that provide special context, which
means context essential to the proper processing of the data
Tokens with both a high individual frequency and a low set cardinality
The order that the columns are listed in the dictionary file defines the order the
columns appear in the standardization rule set output
Dictionary file entries are used to automatically generate the column metadata
available for mapping on the Standardize Stage output link
;---------------------------------------------------------------------
---------
; USADDR Dictionary File
;---------------------------------------------------------------------
---------
; Business Intelligence Fields
;---------------------------------------------------------------------
---------
HouseNumber C 10 S HouseNumber
StreetName C 25 S StreetName
StreetSuffixType C 5 S StreetSuffixType
1. Business Intelligence
2. Matching
3. Reporting
Unhandled Data – the tokens left unhandled (i.e. unprocessed) by the rule set
Input Pattern – the lexical pattern representing the parsed and classified input
tokens
Exception Data – place holder column for storing invalid input data (alternative to
deletion)
User Override Flag – indicates whether or not a user override was applied (default =
NO)
One line containing a pattern, which is tested against the current data
One or more lines of actions, which are executed if the pattern tested true
Pattern-Action Set
^ | + | T | D | D | $ ; number, alpha, street type, directional, directional,
end-of-data
COPY [1] {HouseNumber}
COPY [2] {StreetName}
temp =
COPY_A [3] {StreetSuffixType} “N”
COPY_A [4] temp 4
CONCAT_A [5] temp temp =
“NW”
COPY temp {StreetSuffixDirectional}
5
1 2 3 6
Configuration Options
Parsing Parameters (SEPLIST/STRIPLIST)
Phonetic Coding (NYSIIS and SOUNDEX)
Main
Pattern-Action Sets
Sequentially processed until start of the subroutines or an EXIT
command is encountered
Subroutines
Pattern-Action Sets
Each subroutine starts with a header line (\SUB) and ends
with a trailer line (\END_SUB)
Subroutines can be called by MAIN or by other subroutines
When called, sequentially processed until RETURN command
or \END_SUB is encountered
Quality stage Notes: Bhaskar Reddy
House Number = 50
State Abbreviation = MA
Validation can verify that the data describes an actual address and can also:
Unhandled Data
Unhandled Pattern
Unhandled data may represent the entire input or a subset of the input
If there is no unhandled data, it does not necessarily mean the data is processed
correctly
Some unhandled data does not need to be processed, if it doesn’t belong to that
domain
User Overrides
User overrides provide the user with the ability to make modifications without
directly editing the classification table or the pattern-action file
The following pattern/text override objects are called based on logic in the
pattern-action file
Quality stage Notes: Bhaskar Reddy
input pattern
input text
unhandled pattern
unhandled text
Classification Override
Quality stage Notes: Bhaskar Reddy
There are two subroutines in each delivered rule set that are specifically for users to
add pattern action language
Input Modifications
This subroutine is called after the Input User Overrides are applied
but before any of the rule set pattern actions are checked
Unhandled Modifications
This subroutine is called after all the pattern actions are checked and
the Unhandled User Overrides are applied
http://pic.dhe.ibm.com/infocenter/iisinfsv/v8r1/index.jsp?
topic=/com.ibm.swg.im.iis.qs.patguide.doc/c_qspatact_container_topic.html
Quality stage Notes: Bhaskar Reddy
What is Matching ?
Matching is the real heart of Quality Stage. Different probabilistic algorithms are
available for different types of data. Using the frequencies developed during
investigation (or subsequently), the information content (or “rarity value”) of each value
in each field can be estimated. The less common a value, the more information it
contributes to the decision. A separate agreement weight or disagreement weight is
calculated for each field in each data record, incorporating both its information content
(likelihood that a match actually has been found) and its probability that a match has
been found purely at random. These weights are summed for each field in the record to
come up with an aggregate weight that can be used as the basis for reporting that a
particular pair or records probably are, or probably are not, duplicates of each other.
There is a third possibility, a “grey area” in the middle, which Quality Stage refers to as
the “clerical review” area – record pairs in this category need to be referred to a human
to make the decision because there is not enough certainty either way. Over time the
algorithms can be tuned with things like improved rule sets, weight overrides, different
settings of probability levels and so on so that fewer and fewer “clericals” are found.
Matching makes use of a concept called “blocking”, which is an unfortunately-chosen
term that means that potential sets of duplicates form blocks (or groups, or sets) which
can be treated as separate sets of potentially duplicated values. Each block of potential
duplicates is given a unique ID, which can be used by the next phase (survivorship) and
can also be used to set up a table of linkages between the blocks of potential duplicates
and the keys to the original data records that are in those blocks. This is often a
requirement when de-duplication is being performed, for example when combining
records from multiple sources, or generating a list of unique addresses from a customer
file, et cetera.
More than one pass through the data may be required to identify all the potential
duplicates. For example, one customer record may refer to a customer with a street
address but another record for the same customer may include the customer’s post
office box address. Searching for duplicate addresses would not find this customer; an
additional pass based on some other criteria would also be required. Quality Stage does
provide for multiple passes, either fully passing through the data for each pass, or only
examining the unmatched records on subsequent passes (which is usually faster).
Within Information Server, multiple stages offer capability that can be considered matching,
for example:
Lookup
Join
Merge
Unduplicate Match
Quality stage Notes: Bhaskar Reddy
Reference Match
Lookups, Joins, and Merges typically use key attributes, exact match criteria, or
matches to a range of values or simple formats
The Unduplicate Match Stage and Reference Match Stage offer probabilistic
matching capability
Unduplicate match :locates and groups all similar records within a single input data
source. This process identifies potential duplicate records,which might then be
removed
Blocking step
Blocking provides a method of limiting the number of pairs to examine. When you
partition data sources into mutually-exclusive and exhaustive subsets and only
search for matches within a subset, the process of matching becomes manageable.
Blocking partitions the sources into subsets that make computation feasible. Block
size is the single most important factor in match performance. Blocks should be as
small as possible without causing block overflows. Smaller blocks are more efficient
than larger blocks during matching.
The Reference Match stage identifies relationships among records. This match can
group records that are being compared in different ways as follows:
One-to-many matching
Many-to-one matching
One-to-many matching
Identifies all records in one data source that correspond to a record for the same
individual, event, household, or street address in a second data source.
Only one record in the reference source can match one record in the data source
because the matching applies to individual events.Eg: finding the same individual
based on comparing SSN in voter registration list and department of motor vehicles
list.
Quality stage Notes: Bhaskar Reddy
Many-to-one matching
Multiple records in the data file can match a single record in the reference file.
Eg: matching a transaction data source to a master data source allows many
transactions for one person in the master data source.
– Clerical has records that fall in the clerical range for both inputs
– Data Residual contains records that are non-matches from the data input
– Reference Residual contains records that are non-matches from the reference
input
Quality stage Notes: Bhaskar Reddy
Quality stage Notes: Bhaskar Reddy
Quality stage Notes: Bhaskar Reddy
Survivorship
As the name suggests survivorship is about what becomes of the data in these blocks of
potential duplicates. The idea is to get the “best of breed” data out of each block, based
on built-in or custom rules such as “most frequently occurring non-missing value”,
“longest string”, “most recently updated” and so on.
The data that fulfill the requirements of these rules can then be handled in a couple of
ways. One technique is to come up with a “master record” – a “single version of the
truth” – that will become the standard for the organization. Another possibility is that
the improved data could be populated back into the source systems whence they were
derived; for example if one source were missing date of birth this could be populated
because the date of birth was obtained from another source. Or more than one. If this
is not the requirement (perhaps for legal reasons), then a table containing the linkage
between the source records and the “master record” keys can be created, so that the
original, source systems have the ability also to refer to the “single source of truth” and
vice versa.
Quality stage Notes: Bhaskar Reddy
Quality Stage can do more (than simple matching). Address verification can be
performed; that is, whether or not the address is a valid format can be reported. Out of
the box address verification can be performed down to city level for most countries. For
an extra charge, an additional module for world-wide address verification (WAVES) can
be purchased, which will give address verification down to street level for most
countries.
For some countries, where the postal systems provide appropriate data (for example
SERP in the USA, CASS in Canada, DPID in Australia), address certification can be
performed: in this case, an address is given to Quality Stage and looked up against a
database to report whether or not that particular address actually exists. These
modules carry an additional price, but that includes IBM obtaining regular updates to
the data from the postal authorities and providing them to the Quality Stage licensee .
Summary
IBM is planning to release its next version of Info Sphere Quality Stage Worldwide
Address Verification module (v10)
Release time frame is Q4 2012
AVI v10 will have superior functionality and coverage over our current AVI
v8.x module à see slide 4
AVI v10 will leverage new address/decoding reference data
AVI v10 will have broad support for various Information Server versions à
see slide 5
For current AVI v8.x customers only:
AVI v8.x will have continues support until end of Dec. 2013
Address reference data for AVI v8.x has been discontinue by the
vendor is ending in Dec. 2013
AVI v10 will include a Migration utility for automated migration from AVI
v8.x to AVI v10
For comparison AVI v10 and AVI v8 can run side-by-side (for development)
2
1
3
Quality Stage provides the most powerful, accurate matching available, based on
probabilistic matching technology, easy to set up and maintain, and providing the
highest match rates available in the market.
An easy-to-use graphical user interface (GUI) with an intuitive, point-and-click interface
for specifying automated data quality processes – data investigation, standardization,
matching, and survivorship – reduces the time needed to deploy data cleansing
applications.
Quality Stage offers a thorough data investigation and analysis process for any kind of
free formatted data. Through its tight integration with Data Stage and other Information
Server products it also offers fully integrated management of the metadata associated
with those data.
There exists rigorous scientific justification for the probabilistic algorithms used in
Quality Stage; results are easy to audit and validate.
Worldwide address standardization verification and enrichment capabilities – including
certification modules for the United States, Canada, and Australia – add to the value of
cleansed address data.
Quality stage Notes: Bhaskar Reddy
Service oriented architecture (SOA) enablement with Info Sphere Information Services
Director, allowing you to leverage data quality logic built using the IBM Info Sphere
Information Server and publish it as an "always on, available everywhere" service in a
SOA – in minutes.
The bottom line is that Quality Stage helps to ensure that systems deliver accurate,
complete, trusted information to business users both within and outside the enterprise.