5 Understanding Data Quality Management

Today, more than ever, organizations realize the importance of data quality. By ensuring that quality data is stored in your data warehouse or business intelligence application, you also ensure the quality of information for dependent applications and analytics.

Oracle Warehouse Builder offers a set of features that assist you in creating data systems that provide high quality information to your business users. You can implement a quality process that assesses, designs, transforms, and monitors quality. Within these phases, you will use specific functionality from Warehouse Builder to create improved quality information.

This chapter contains the following topics:

About the Data Quality Management Process
About Data Profiling
About Data Correction and Augmentation
About Data Rules
About Quality Monitoring
Performing Data Profiling
Tuning the Data Profiling Process
Using Data Rules
Monitoring Data Quality Using Data Auditors

About the Data Quality Management Process

Quality data is crucial to decision-making and planning. The aim of building a data warehouse is to have an integrated, single source of data that can be used to make business decisions. Since the data is usually sourced from a number of disparate systems, it is important to ensure that the data is standardized and cleansed before loading into the data warehouse.

Warehouse Builder provides functionality that enables you to effectively manage data quality by assessing, transforming, and monitoring your data. The benefits of using Warehouse Builder for data management are as follows:

Provides an end-to-end data quality solution
Enables you to include data quality and data profiling as an integral part of your data integration process.
Stores metadata regarding the quality of your data alongside your data definitions.
Automatically generates the mappings that you can use to correct data. These mappings are based on the business rules that you choose to apply to your data and decisions you make on how to correct data.

Phases in the Data Quality Lifecycle

Ensuring data quality involves the following phases:

Quality Assessment
Quality Design
Quality Transformation
Quality Monitoring

Figure 5-1 shows the phases involved in providing high quality information to your business users.

Figure 5-1 Phases Involved in Providing Quality Information

Description of "Figure 5-1 Phases Involved in Providing Quality Information"

Quality Assessment

In the quality assessment phase, you determine the quality of the source data. The first step in this phase is to import the source data, which could be stored in different sources, into Warehouse Builder. You can import metadata and data from both Oracle and non-Oracle sources.

After you load the source data, you use data profiling to assess its quality. Data profiling is the process of uncovering data anomalies, inconsistencies, and redundancies by analyzing the content, structure, and relationships within the data. The analysis and data discovery techniques form the basis for data monitoring.

See Also:

"About Data Profiling" for data profiling concepts
"Performing Data Profiling" for the steps to perform data profiling
"Tuning the Data Profiling Process" for information about tuning the profiling process

Quality Design

The quality design phase consists of designing your quality processes. You can specify the legal data within a data object or legal relationships between data objects using data rules.

See Also:

"About Data Rules" for data rules concepts
"Using Data Rules" for information about creating and applying data rules

You also correct and augment your data. You can use data quality operators to correct and augment data.

See Also:

"About Data Correction and Augmentation" for information about the data quality operators
"Generate Corrections" for information about generating corrections based on the results of data profiling

As part of the quality design phase, you also design the transformations that ensure data quality. These transformations could be mappings that are generated by Warehouse Builder as a result of data profiling or mappings you create.

Quality Transformation

The quality transformation phase consists of running the correction mappings you designed to correct the source data.

Quality Monitoring

Data monitoring is the process of examining warehouse data over time and alerting you when the data violates business rules set for the data.

See Also:

"About Quality Monitoring" for concepts about quality monitoring
"Monitoring Data Quality Using Data Auditors" for information about creating and using data auditors to monitor data quality

About Data Profiling

Data profiling is the first step for any organization to improve information quality and provide better decisions. It is a robust data analysis method available in Warehouse Builder that you can use to discover and measure defects in your data before you start working with it. Because of its integration with the ETL features in Warehouse Builder and other data quality features, such as data rules and built-in cleansing algorithms, you can also generate data cleansing mappings and schema correction scripts. This enables you to automatically correct any inconsistencies, redundancies, and inaccuracies in both the data and metadata.

Data profiling enables you to discover many important things about your data. Some common findings include the following:

A domain of valid product codes
A range of product discounts
Columns that hold the pattern of an e-mail address
A one-to-many relationship between columns
Anomalies and outliers within columns
Relations between tables even if they are not documented in the database

Benefits of Data Profiling

Using the data profiling functionality in Warehouse Builder enables you to:

Profile data from any source or combination of sources that Warehouse Builder can access.
Explore data profiling results in tabular or graphical format.
Drill down into the actual data related to any profiling result.
Derive data rules, either manually or automatically, based on the data profiling results.
Attach any data rule to a target object and select an action to perform if the rule fails.
Create a data auditor from a data rule to continue monitoring the quality of data being loaded into an object.
Derive quality indices such as six-sigma valuations.
Profile or test any data rules you want to verify before putting in place.

Types of Data Profiling

Following the selection of data objects, determine the aspects of your data that you want to profile and analyze. Data profiling offers three main types of analysis: attribute analysis, functional dependency, and referential analysis. You can also create custom profiling processes using data rules, allowing you to validate custom rules against the actual data and get a score of their accuracy.

Figure 5-2 displays a representation of the types of data profiling and how you can perform each type.

Figure 5-2 Data Profiling Overview

Description of "Figure 5-2 Data Profiling Overview"

Attribute Analysis

Attribute analysis seeks to discover both general and detailed information about the structure and content of data stored within a given column or attribute. Attribute analysis looks for information about patterns, domains, data types, and unique values.

Pattern analysis attempts to discover patterns and common types of records by analyzing the string of data stored in the attribute. It identifies the percentages of your data that comply with a certain regular expression format pattern found in the attribute. Using these pattern results, you can create data rules and constraints to help clean up current data problems. Some commonly identified patterns include dates, e-mail addresses, phone numbers, and social security numbers.

Table 5-1 shows a sample attribute, Job Code, that could be used for pattern analysis.

Table 5-1 Sample Columns Used for Pattern Analysis

Job ID	Job Code
7	337-A-55
9	740-B-74
10	732-C-04
20	43-D-4

Table 5-2 shows the possible results from pattern analysis, where D represents a digit and X represents a character. After looking at the results and knowing that it is company policy for all job codes be in the format of DDD-X-DD, you can derive a data rule that requires all values in this attribute to conform to this pattern.

Table 5-2 Pattern Analysis Results

Job Code	% Occurred
DDD-X-DD	75%
DD-X-D	25%

Domain analysis identifies a domain or set of commonly used values within the attribute by capturing the most frequently occurring values. For example, the Status column in the Customers table is profiled and the results reveal that 90% of the values are among the following: "MARRIED", "SINGLE", "DIVORCED". Further analysis and drilling down into the data reveal that the other 10% contains misspelled versions of these words with few exceptions. Configuration of the profiling determines when something is qualified as a domain, so review the configuration before accepting domain values. You can then let Warehouse Builder derive a rule that requires the data stored in this attribute to be one of the three values that were qualified as a domain.

Data type analysis enables you to discover information about the data types found in the attribute. This type of analysis reveals metrics such as minimum and maximum character length values as well as scale and precision ranges. In some cases, the database column is of data type VARCHAR2, but the values in this column are all numbers. Then you may want to ensure that you only load numbers. Using data type analysis, you can have Warehouse Builder derive a rule that requires all data stored within an attribute to be of the same data type.

Unique key analysis provides information to assist you in determining whether or not an attribute is a unique key. It does this by looking at the percentages of distinct values that occur in the attribute. You might determine that attributes with a minimum of 70% distinct values should be flagged for unique key analysis. For example, using unique key analysis you could discover that 95% of the values in the EMP_ID column are unique. Further analysis of the other 5% reveals that most of these values are either duplicates or nulls. You could then derive a rule that requires that all entries into the EMP_ID column be unique and not null.

Functional Dependency

Functional dependency analysis reveals information about column relationships. This enables you to search for things such as one attribute determining another attribute within an object.

Table 5-3 shows the contents of the Employees table in which the attribute Dept. Location is dependent on the attribute Dept. Number. Note that the attribute Dept. Number is not dependent on the attribute Dept. Location.

Table 5-3 Employees Table

ID	Name	Salary	Dept Number	Dept Location
10	Alison	1000	10	SF
20	Rochnik	1000	11	London
30	Meijer	300	12	LA
40	John	500	13	London
50	George	200	13	London
60	Paul	600	13	London
70	Ringo	100	13	London
80	Yoko	600	13	London
90	Jones	1200	10	SF

Referential Analysis

Referential analysis attempts to detect aspects of your data objects that refer to other objects. The purpose behind this type of analysis is to provide insight into how the object you are profiling is related or connected to other objects. Because you are comparing two objects in this type of analysis, one is often referred to as the parent object and the other as the child object. Some of the common things detected include orphans, childless objects, redundant objects, and joins. Orphans are values that are found in the child object, but not found in the parent object. Childless objects are values that are found in the parent object, but not found in the child object. Redundant attributes are values that exist in both the parent and child objects.

Table 5-4 and Table 5-5 show the contents of two tables that are candidates for referential analysis. Table 5-4 is the child object and Table 5-5 is the parent object.

Table 5-4 Employees Table (Child)

ID	Name	Dept. Number	City
10	Alison	17	NY
20	Rochnik	23	SF
30	Meijer	23	SF
40	Jones	15	SD

Table 5-5 Department Table (Parent)

Dept. Number	Location
17	NY
18	London
20	SF
23	SF
55	HK

Referential analysis of these two objects would reveal that Dept. Number 15 from the Employees table is an orphan and Dept. Numbers 18, 20, and 55 from the Department table are childless. It would also reveal a join on the Dept. Number column.

Based on these results, you could derive referential rules that determine the cardinality between the two tables.

Data Rule Profiling

In addition to attribute analysis, functional dependency, and referential analysis, Warehouse Builder offers data rule profiling. Data rule profiling enables you to create rules to search for profile parameters within or between objects.

This is very powerful as it enables you to validate rules that apparently exist and are defined by the business users. By creating a data rule, and then profiling with this rule you can verify if the data actually complies with the rule, and whether or not the rule needs amending or the data needs cleansing.

For example, you could create a rule that Income = Salary + Bonus for the Employee table shown in Table 5-6. You can then catch errors such as the one for employee Alison.

Table 5-6 Sample Employee Table

ID	Name	Salary	Bonus	Income
10	Alison	1000	50	1075 X
20	Rochnik	1000	75	1075
30	Meijer	300	35	335
40	Jones	1200	500	1700

About Six Sigma

Warehouse Builder provides Six Sigma results embedded within the other data profiling results to provide a standardized approach to data quality.

What is Six Sigma?

Six Sigma is a methodology that attempts to standardize the concept of quality in business processes. It achieves this goal by statistically analyzing the performance of business processes. The goal of Six Sigma is to improve the performance of these processes by identifying the defects, understanding them, and eliminating the variables that cause these defects.

Six Sigma metrics give a quantitative number for the number of defects for each 1,000,000 opportunities. The term "opportunities" can be interpreted as the number of records. The perfect score is 6.0. The score of 6.0 is achieved when there are only 3.4 defects for each 1,000,000 opportunities. The score is calculated using the following formula:

Defects Per Million Opportunities (DPMO) = (Total Defects / Total Opportunities) * 1,000,000
Defects (%) = (Total Defects / Total Opportunities)* 100%
Yield (%) = 100 - %Defects
Process Sigma = NORMSINV(1-((Total Defects) / (Total Opportunities))) + 1.5

where NORMSINV is the inverse of the standard normal cumulative distribution.

Six Sigma Metrics for Data Profiling

Six Sigma metrics are also provided for data profiling in Warehouse Builder. When you perform data profiling, the number of defects and anomalies discovered are shown as Six Sigma metrics. For example, if data profiling finds that a table has a row relationship with a second table, the number of records in the first table that do not adhere to this row-relationship can be described using the Six Sigma metric.

Six Sigma metrics are calculated for the following measures in the Data Profile Editor:

Aggregation: For each column, the number of null values (defects) to the total number of rows in the table (opportunities).
Data Types: For each column, the number of values that do not comply with the documented data type (defects) to the total number of rows in the table (opportunities).
Data Types: For each column, the number of values that do not comply with the documented length (defects) to the total number of rows in the table (opportunities).
Data Types: For each column, the number of values that do not comply with the documented scale (defects) to the total number of rows in the table (opportunities).
Data Types: For each column, the number of values that do not comply with the documented precision (defects) to the total number of rows in the table (opportunities).
Patterns: For each column, the number of values that do not comply with the common format (defects) to the total number of rows in the table (opportunities).
Domains: For each column, the number of values that do not comply with the documented domain (defects) to the total number of rows in the table (opportunities).
Referential: For each relationship, the number of values that do not comply with the documented foreign key (defects) to the total number of rows in the table (opportunities).
Referential: For each column, the number of values that are redundant (defects) to the total number of rows in the table (opportunities).
Unique Key: For each unique key, the number of values that do not comply with the documented unique key (defects) to the total number of rows in the table (opportunities).
Unique Key: For each foreign key, the number of rows that are childless (defects) to the total number of rows in the table (opportunities).
Data Rule: For each data rule applied to the data profile, the number of rows that fail the data rule to the number of rows in the table.

About Data Correction and Augmentation

Warehouse Builder enables you to automatically create correction mappings based on the results of data profiling. On top of these automated corrections that make use of the underlying Warehouse Builder architecture for data quality, you can create your own data quality mappings.

Warehouse Builder provides functionality that enables you to correct and augment source data. While transforming the source data, you can use the following operators to ensure data quality :

Match-Merge Operator

See "About the Match-Merge Operator"
Name and Address Operator

See "About the Name and Address Operator"

About the Match-Merge Operator

Duplicate records can obscure your understanding of who your customers and suppliers really are. Eliminating duplicate records is an important activity in the data correction process. The Match-Merge operator enables you to identify matching records and merge them into a single record. You can define the business rules used by the Match-Merge operator to identify records in a table that refer to the same data. Master data management working on various systems will make use of this operator to ensure that records are created and matched with a master record.

The Match-Merge operator provides the following:

Enables you to use weights to determine matches between records.
Uses built-in algorithms, including the Jaro-Winkler and edit distance algorithms, to determine matches.
Enables cross referencing of data to track and audit matches.
Enables you to create custom rules combining built-in rules for matching and merging.

Example of Matching and Merging Customer Data

Consider how you could utilize the Match-Merge operator to manage a customer mailing list. Use matching to find records that refer to the same person in a table of customer data containing 10,000 rows.

For example, you can define a match rule that screens records that have similar first and last names. Through matching, you may discover that 5 rows could refer to the same person. You can then merge those records into one new record. For example, you can create a merge rule to retain the values from the one of the five matched records with the longest address. The newly merged table now contains one record for each customer.

Table 5-7 shows records that refer to the same person prior to using the Match-Merge operator.

Table 5-7 Sample Records

Row	First Name	Last Name	SSN	Address	Unit	Zip
1	Jane	Doe	NULL	123 Main Street	NULL	22222
2	Jane	Doe	111111111	NULL	NULL	22222
3	J.	Doe	NULL	123 Main Street	Apt 4	22222
4	NULL	Smith	111111111	123 Main Street	Apt 4	22222
5	Jane	Smith-Doe	111111111	NULL	NULL	22222

Table 5-8 shows the single record for Jane Doe after using the Match-Merge operator. Notice that the new record retrieves data from different rows in the sample.

Table 5-8 Match-Merge Results

First Name	Last Name	SSN	Address	Unit	Zip
Jane	Doe	111111111	123 Main Street	Apt 4	22222

Restrictions on Using the Match-Merge Operator

Because the Match-Merge operator only accepts SQL input, you cannot map the output of the Name and Address operator directly to the Match-Merge operator. You must use a staging table.
Because the Match-Merge generates only PL/SQL, you cannot map the Merge or XREF output groups of the Match-Merge operator to a SQL only operator such as a Sort operator or another Match-Merge operator.

Overview of the Matching and Merging Process

Matching determines which records refer to the same logical data. Warehouse Builder provides a variety of match rules to compare records. Match rules range from an exact match to sophisticated algorithms that can discover and correct common data entry errors.

See Also:

"Match Rules" for more information about match rules

Merging consolidates matched records into a single record that is free from duplicate records, omissions, misspellings, and unnecessary variations. You can define merge rules to select the preferred data values for use in the consolidated record.

See Also:

"Merge Rules" for more information about merge rules

Warehouse Builder uses the following in the matching and merging process.

Match Bins

Match bins are containers for similar records and are used to identify potential matches. The match bin attributes are used to determine how records are grouped into match bins. While performing matching, only records within the same match bin are compared. Match bins limit the number of potential matches in a data set, thus improving performance of the match algorithm.

Match Bin Attributes

Before performing matching, Warehouse Builder divides the source records into smaller groups of similar records. Match bin attributes are the source attributes used to determine how records are grouped. Records having the same match bin attributes reside in the same match bin. Match bin attributes also limit match bins to manageable sets.

Select match bin attributes carefully to fulfill the following two conflicting needs:

Ensure that any records that match reside in the same match bin.
Keep the size of the match bin as small as possible.

A small match bin is desirable for efficiency.

Match Record Sets

A match record set consists of one or more similar records. After matching records, a match record set is created for each match bin. You can define the match rules that determine if two records are similar.

Merged Records

A merged record contains data that is merged using multiple records in the match record set. Each match record set generates its own merged record.

Matching and Merging Records

You use the Match-Merge operator to match and merge records. This operator accepts records from an input source, determines the records that are logically the same, and constructs a new merged record from the matched records.

Figure 5-3 describes the matching and merging process. The high-level tasks involved in the process are:

Constructing Match Bins
Constructing Match Record Sets
Constructing Merge Records

Figure 5-3 Match-Merge Process

Description of "Figure 5-3 Match-Merge Process"

Constructing Match Bins

The match bin is constructed using the match bin attributes. Records with the same match bin attribute values will reside in the same match bin. A small match bin is desirable for efficiency. For more information about match rules, see "Match Rules".

Constructing Match Record Sets

Match rules are applied to all the records in each match bin to generate one or more match record sets. Match rules determine if two records match. A match rule is an n X n algorithm where all records in the match bin are compared.

One important point of this algorithm is the transitive matching. Consider three records A, B, and C. If record A is equal to record B and record B is equal to record C, this means that record A is equal to record C.

Constructing Merge Records

A single merge record is constructed from each match record set. You can create specific rules to define merge attributes by using merge rules. For more information about merge rules, see "Merge Rules".

Match Rules

Match rules are used to determine if two records are logically similar. Warehouse Builder enables you to use different types of rules to match source records. You can define match rules using the MatchMerge Wizard or the MatchMerge Editor. Use the editor to edit existing match rules or add new rules.

Match rules can be active or passive. Active rules are generated and executed in the order specified. Passive rules are generated but not executed.

Table 5-9 describes the types of match rules.

Table 5-9 Types of Match Rules

Match Rule	Description
All Match	Matches all rows within a match bin.
None Match	Turns off matching. No rows match within the match bin.
Conditional	Matches rows based on the algorithm you set. For more information about Conditional match rules and how to create one, see Conditional Match Rules.
Weight	Matches row based on scores that you assign to the attributes. For more information about Weight match rules and how to create one, see "Weight Match Rules".
Person	Matches records based on the names of people. For more information about Person match rules and how to create one, see "Person Match Rules".
Firm	Matches records based on the name of the organization or firm. For more information about Firm match rules and how to create one, see "Firm Match Rules".
Address	Matches records based on postal addresses. For more information about Address match rules and how to create one, see "Address Match Rules".
Custom	Matches records based on a custom comparison algorithm that you define. For more information about Custom match rules and how to create one, see "Custom Match Rules".

Conditional Match Rules

Conditional match rules specify the conditions under which records match.

A conditional match rule allows you to combine multiple attribute comparisons into one composite rule. When more than one attribute is involved in a rule, two records are considered to be a match only if all comparisons are true.

You can specify how attributes are compared using comparison algorithms.

Comparison Algorithms

Each attribute in a conditional match rule is assigned a comparison algorithm, which specifies how the attribute values are compared. Multiple attributes may be compared in one rule with a separate comparison algorithm selected for each.

Table 5-10 describes the types of comparisons.

Table 5-10 Types of Comparison Algorithms for Conditional Match Rules

Algorithm	Description
Exact	Attributes match if their values are exactly the same. For example, "Dog" and "dog!" would not match, because the second string is not capitalized and contains an extra character. For data types other than `STRING`, this is the only type of comparison allowed.
Standardized Exact	Standardizes the values of the attributes before comparing for an exact match. With standardization, the comparison ignores case, spaces, and non-alphanumeric characters. Using this algorithm, "Dog" and "dog!" would match.
Soundex	Converts the data to a Soundex representation and then compares the text strings. If the Soundex representations match, then the two attribute values are considered matched.
Edit Distance	A "similarity score" in the range 0-100 is entered. If the similarity of the two attributes is equal or greater to the specified value, the attribute values are considered matched. The similarity algorithm computes the edit distance between two strings. A value of 100 indicates that the two values are identical; a value of zero indicates no similarity whatsoever. For example, if the string "tootle" is compared with the string "tootles", then the edit distance is 1. The length of the string "tootles" is 7. The similarity value is therefore (6/7)*100 or 85.
Standardized Edit Distance	Standardizes the values of the attribute before using the Similarity algorithm to determine a match. With standardization, the comparison ignores case, spaces, and non-alphanumeric characters.
Partial Name	The values of a string attribute are considered a match if the value of one entire attribute is contained within the other, starting with the first word. For example, "Midtown Power" would match "Midtown Power and Light", but would not match "Northern Midtown Power". The comparison ignores case and non-alphanumeric characters.
Abbreviation	The values of a string attribute are considered a match if one string contains words that are abbreviations of corresponding words in the other. Before attempting to find an abbreviation, this algorithm performs a Std Exact comparison on the entire string. The comparison ignores case and non-alphanumeric character. For each word, the match rule will look for abbreviations, as follows. If the larger of the words being compared contains all of the letters from the shorter word, and the letters appear in the same order as the shorter word, then the words are considered a match. For example, "Intl. Business Products" would match "International Bus Prd".
Acronym	The values of a string attribute are considered a match if one string is an acronym for the other. Before attempting to identify an acronym, this algorithm performs a Std Exact comparison on the entire string. If no match is found, then each word of one string is compared to the corresponding word in the other string. If the entire word does not match, each character of the word in one string is compared to the first character of each remaining word in the other string. If the characters are the same, the names are considered a match. For example, "Chase Manhattan Bank NA" matches "CMB North America". The comparison ignores case and non-alphanumeric characters.
Jaro-Wrinkler	Matches strings based on their similarity value using an improved comparison system over the Edit Distance algorithm. It accounts for the length of the strings and penalizes more for errors at the beginning. It also recognizes common typographical errors. The strings match when their similarity value is equal to or greater than the Similarity Score that you specify. A similarity value of 100 indicates that the two strings are identical. A value of zero indicates no similarity whatsoever. Note that the value actually calculated by the algorithm (0.0 to 1.0) is multiplied by 100 to correspond to the Edit Distance scores.
Standardized Jaro-Wrinkler	Eliminates case, spaces, and non-alphanumeric characters before using the Jaro-Winkler algorithm to determine a match.
Double Metaphone	Matches phonetically similar strings using an improved coding system over the Soundex algorithm. It generates two codes for strings that could be pronounced in multiple ways. If the primary codes match for the two strings, or if the secondary codes match, then the strings match. The Double Metaphone algorithm accounts for alternate pronunciations in Italian, Spanish, French, and Germanic and Slavic languages. Unlike the Soundex algorithm, Double Metaphone encodes the first letter, so that 'Kathy' and 'Cathy' evaluate to the same phonetic code.

Creating Conditional Match Rules

To define a conditional match rule, complete the following steps:

On the top portion of the Match Rules tab or the Match Rules page, select Conditional in the Rule Type column.

A Details section is displayed.
Click Add to add a new row.
Select an attribute in the Attribute column.
In the Algorithm column, select a comparison algorithm. See Table 5-10 for descriptions.
Specify a similarity score for the Edit Distance, Standardized Edit Distance, Jaro-Winkler, or Standardized Jaro-Winkler algorithms.
Select a method for handling blanks.

Weight Match Rules

A weighted match rule allows you to assign an integer weight to each attribute included in the rule. You must also specify a threshold. For each attribute, the Match-Merge operator multiplies the weight by the similarity score, and sums the scores. If the sum equals or exceeds the threshold, the two records being compared are considered a match.

Weight match rules are most useful when you need to compare a large number of attributes, without having a single attribute that is different causing a non-match, as can happen with conditional rules.

Weight rules implicitly invoke the similarity algorithm to compare two attribute values. This algorithm returns an integer, percentage value in the range 0-100, which represents the degree to which two values are alike. A value of 100 indicates that the two values are identical; a value of zero indicates no similarity whatsoever.

Example of Weight Match Rules

Table 5-11 displays the attribute values contained in two separate records that are read in the following order.

Table 5-11 Example of Weight Match Rule

Record Number	First Name	Middle Name	Last Name
Record 1	Robert	Steve	Paul
Record 2		Steven	Paul

You define a match rule that uses the Edit Distance similarity algorithm. The Required Score to Match is 120. The attributes for first name and middle name are defined with a Maximum Score of 50 and Score When Blank of 20. The attribute for last name has a Maximum Score of 80 and a Score When Blank of 0.

Consider an example of the comparison of Record 1 and Record 2 using the weight match rule.

Since first name is blank for Record 2, the Blank Score = 20.
The similarity of middle name in the two records is 0.83. Since the weight assigned to this attribute is 50, the similarity score for this attribute is 43 (0.83 X 50).
Since the last name attributes are the same, the similarity score for the last name is 1. The weighted score is 80 (1 X 80).

The total score for this comparison is 143 (20+43+80). Since this is more than the value defined for Required Score to Match, the records are considered a match.

Creating Weight Match Rules

To use the Weight match rule, complete the following steps:

On the Match Rules tab or the Match Rules page, select Weight as the Rule Type.

The Details tab is displayed at the bottom of the page.
Select Add at the bottom of the page to add a new row.
For each row, select an attribute to add to the rule using the Attribute column.
In Maximum Score, assign a weight to each attribute. Warehouse Builder compares each attribute using a similarity algorithm that returns a score between 0 and 100 to represent the similarity between the rows.
In Score When Blank, assign a value to be used when the attribute is blank in one of the records.
In Required score to match, assign an overall score for the match.

For two rows to be considered a match, the total counts must be greater than the value specified in the Required score to match parameter.

Person Match Rules

Built-in Person rules provide an easy and convenient way for matching names of individuals. Person match rules are most effective when the data has first been corrected using the Name and Address operator.

When you use Person match rules, you must specify which data within the record represents the name of the person. The data can come from multiple columns. Each column must be assigned an input role that specifies what the data represents.

To define a Person match rule, you must define the Person Attributes that are part of the rule. For example, you can create a Person match rule that uses the Person Attributes first name and last name for comparison. For each Person Attribute, you must define the Person Role that the attribute uses. Next you define the rule options used for the comparison. For example, while comparing last names, you can specify that hyphenated last names should be considered a match.

Person Roles

Table 5-12 describes the roles for different parts of a name that are used for matching. On the Match Rules page or Match Rules page, use the Roles column on the Person Attributes tab to define person details.

Table 5-12 Name Roles for Person Match Rules

Role	Description
Prename	Prenames are compared only if the following are true: The Last_name and, if present, the middle name (Middle_name_std, Middle_name_2_std, and Middle_name_3_std roles) in both records match. The "Mrs. Match" option is selected. Either record has a missing First_name_std.
First Name Standardized	Compares the first names. By default, the first names must match exactly, but you can specify other comparison options as well. First names match if both are blank. A blank first name will not match a non-blank first name unless the Prename role has been assigned and the "Mrs. Match" option is set. If a Last_name role has not been assigned, a role of First_name_std must be assigned.
Middle Name Standardized, Middle Name 2 Standardized, Middle Name 3 Standardized	Compares the middle names. By default, the middle names must match exactly, but other comparison options can be specified. If more than one middle name role is assigned, attributes assigned to the different roles are cross-compared. For example, values for Middle_name_std will be compared not only against other Middle_name_std values, but also against Middle_name_2_std, if that role is also assigned. Middle names match if either or both are blank. If any of the middle name roles are assigned, the First_name_std role must also be assigned.
Last Name	Compares the last names. By default, the last names must match exactly, but you can specify other comparison options. The last names match if both are blank, but not if only one is blank.
Maturity Post Name	Compares the post name, such as "Jr.", "III," and so on. The post names match if the values are exactly the same, or if either value is blank.

Person Details

Table 5-13 describes the options that determine a match for person match rules. Use the Details tab of the Match Rules tab or the Match Rules page to define person details.

Table 5-13 Options for Person Match Rule

Option	Description
Detect switched name order	Detects switched name orders such as matching 'Elmer Fudd' to 'Fudd Elmer'. You can select this option if you selected First Name and Last Name roles for attributes on the Person Attributes tab.
Match on initials	Matches initials to names such as 'R.' and 'Robert'. You can select this option for first name and middle name roles.
Match on substrings	Matches substrings to names such as 'Rob' to 'Robert'. You can select this option for first name and middle name roles.
Similarity Score	Records are considered a match if the similarity is greater than or equal to score. For example, "Susan" will match "Susen" if the score is less than or equal to 80. Uses a similarity score to determine a match, as calculated by the Edit Distance or Jaro-Winkler algorithms. A value of 100 requires an exact match, and a value of 0 requires no similarity whatsoever.
Match on Phonetic Codes	Determines a match using either the Soundex or the Double Metaphone algorithms.
Detect compound name	Matches compound names to names such as 'De Anne' to 'Deanne'. You can select this option for the first name role.
"Mrs" Match	Matches prenames to first and last names such as 'Mrs. Washington' to 'George Washington'. You can select this option for the prename role.
Match hyphenated names	Matches hyphenated names to unhyphenated names such as "Reese-Jones" to "Reese". You can select this option for the last name role.
Detect missing hyphen	The operator detects missing hyphens, such as matching "Hillary Rodham Clinton" to "Hillary Rodham-Clinton". You can select this option for the last name role.

Creating Person Match Rules

To define a Person match rule, complete the following steps:

On the Match Rules tab, select Person as the Rule Type.

The Person Attributes tab and Details tab are displayed at the bottom of the page.
In the left panel of the Person Attributes tab, select the attributes that describe a full name and use the right arrow to move them to Name Roles Attributes.
For each attribute, select the role it plays in a name.

You must define either the Last Name or First Name Standardized for the match rule to be effective. See Table 5-12 for the types of roles you can assign.
Select the Details tab and select the applicable options as listed in Table 5-13.

Firm Match Rules

Built-in Firm match rules provide an easy and convenient way for matching business names. Firm match rules are most effective when the data has first been corrected using the Name and Address operator. Similar to the Person rule, this rule requires users to set what data within the record represents the name of the firm. The data can come from multiple columns and each column specified must be assigned an input role that indicates what the data represents.

Note that you need not assign a firm role to every attribute, and not every role needs to be assigned to an attribute. The attributes assigned to firm roles are used in the match rule to compare the records. The attributes are compared based on the role they have been assigned and other comparison options have you set. For a complete list of firm roles and how each role is treated in a firm match rule, see "Firm Roles".

Firm Roles

Firm roles define the parts of a firm name that are used for matching. The options you can select for firm role are Firm1 or Firm2. If you select one attribute, for firm name, select Firm1 as the role. If you selected two attributes, designate one of them as Firm1 and the other as Firm2.

Firm1: If this role is assigned, the business names represented by Firm1 are compared. Firm1 names will not be compared against Firm2 names unless if the Cross-match firm1 and firm2 box is checked. By default, the firm names must match exactly; but other comparison options can also be specified. Firm1 names do not match if either or both names are blank.
Firm2: If this role is assigned, the values of the attribute assigned to Firm2 will be compared. Firm2 names will not be compared against Firm1 names unless if the Cross-match firm1 and firm2 box is checked. By default, the firm names must match exactly; but other comparison options can also be specified. Firm2 names do not match if either or both names are blank. If a Firm1 role is not assigned, a Firm2 roles must be assigned.

Firm Details

Table 5-14 describes the rule options you can specify for each component of the firm name.

Table 5-14 Options for Firm Rules

Option	Description
Strip noise words	Removes the following words from Firm1 and Firm2 before matching: THE, AND, CORP, CORPORATION, CO, COMPANY, INC, INCORPORATED, LTD, TO, OF, and BY.
Cross-match firm 1 and firm 2	When comparing two records for matching, in addition to matching firm1 to firm1 and firm2 to firm2 of the respective records, match firm1 against firm2 for the records.
Match on partial firm name	Uses the Partial Name algorithm to determine a match. For example, match "Midtown Power" to "Midtown Power and Light".
Match on abbreviations	Uses the Abbreviation algorithm to determine a match. For example, match "International Business Machines" to "IBM".
Match on acronyms	Uses the Acronym algorithm to determine a match. For example, match "CMB, North America" to "Chase Manhattan Bank, NA".
Similarity score	Uses a similarity score to determine a match, as calculated by the Edit Distance or Jaro-Winkler algorithms. Enter a value between 0 and 100 as the minimum similarity value required for a match. A value of 100 requires an exact match, and a value of 0 requires no similarity whatsoever. Two records are considered as a match if the similarity is greater than or equal to the value of similarity score.

Creating Firm Match Rules

To define a Firm match rule, complete the following steps:

On the Match Rules tab or the Match Rules page, select Firm as the Rule Type.

The Firm Attributes tab and Details tab are displayed at the bottom of the page.
In the left panel of the Firm Attributes tab, select one or two attributes that represent the firm name and click the right shuttle button.

The attributes are moved to the Firm Roles box.
For each attribute, click Roles. From the list, select Firm 1 for the first attribute, and Firm 2 for the second attribute, if it exists.
On the Details tab, select the applicable options. For more details, see "Firm Details".

Address Match Rules

Address Match rules provide a method of matching records based on postal addresses. Address match rules are most effective when the data has first been corrected using a Name and Address operator.

Address Rules work differently depending on whether the address being processed has been corrected or not. Generally, corrected addresses have already been identified in a postal matching database, and are therefore not only syntactically correct, but are legal and existing addresses according to the Postal Service of the country containing the address. Corrected addresses can be processed more quickly, since the match rule can make certain assumptions about their format.

Uncorrected addresses may be syntactically correct, but have not been found in a postal matching database. Addresses may have not been found because they are not in the database, or because there is no postal matching database installed for the country containing the address. Address match rules determine whether an address has been corrected based on the Is_found role. If Is_found role is not assigned, then the match rule performs the comparisons for both the corrected and uncorrected addresses.

To create an Address match rule, assign address roles to the various attributes. The attributes assigned to address roles are used in the match rule to compare the records. Attributes are compared depending on which role they have been assigned, and what other comparison options have been set.

Address Roles

Table 5-15 describes the address roles you can select for each part of an address.

Table 5-15 Address Roles

Role	Description
Primary Address	Compares the primary addresses. Primary addresses can be, for example, street addresses ("100 Main Street") or PO boxes ("PO Box 100"). By default, the primary addresses must match exactly, but a similarity option can also be specified. The Primary_address role must be assigned.
Unit Number	Unit numbers (such as suite numbers, floor numbers, or apartment numbers) are compared if the primary addresses match. The unit numbers match if both are blank, but not if one is blank, unless the Match on blank secondary address option is set. If the Allow differing secondary address is set, the unit numbers are ignored.
PO Box	Compares the Post Office Boxes. The PO Box is just the number portion of the PO Box ("100"), and is a subset of the primary address, when the primary address represents a PO Box ("PO Box 100"). If the primary address represents a street address, the PO Box will be blank.
Dual Primary Address	The Dual_primary_address is compared against the other record's Dual_primary_address and Primary_address to determine a match.
Dual Unit Number	Compares the Dual_unit_number address with the Dual_unit_number and Unit_number of the other record. The unit numbers will match if one or both are blank. To assign the Dual_unit_number role, the Dual_primary_address role must also be assigned.
Dual PO Box	Dual_PO_Box address of a record is compared with the Dual_PO_Box and the PO_Box of the other record. To assign the Dual_PO_Box role, the Dual_primary_address role must also be assigned.
City	Compares the cities for uncorrected addresses. For corrected addresses, the cities are only compared if the postal codes do not match. If both City and State roles match, then the address line roles, such as Primary_address, can be compared. By default, the cities must match exactly. But you may specify a last line similarity option. The cities match if both are blank, but not if only one is blank. If the City role is assigned, then the State role must also be assigned.
State	Assign this role only when also assigning the City role. The states are compared for uncorrected addresses. For corrected addresses, the states are only compared if the postal codes do not match. If both State and City roles match, then the address line roles, such as Primary_address, can be compared. By default, the states must match exactly, but a last line similarity option may be specified. The states match if both are blank, but not if only one is blank. If the State role is assigned, then the City role must also be assigned.
Postal Code	For uncorrected address data, the operator does not use Postal Code. The postal codes are compared for corrected addresses. For uncorrected addresses, the Postal_code role is not used. To match, the postal codes must be exactly the same. The postal codes are not considered a match if one or both are blank. If the postal codes match, then the address line roles, such as Primary_address, can be compared. If the postal codes do not match, City and State roles are compared to determine whether the address line roles should be compared.
Is Found	The Is_found_flag attributes are not compared, but instead are used to determine whether an address has been found in a postal matching database, and therefore represents a legal address according to the postal service of the country containing the address. This determination is important because the type of comparison done during matching depends on whether the address has been found in the postal database or not.

Address Details

Table 5-16 describes the options for determining a match for an address rule.

Table 5-16 Options for Address Roles

Option	Description
Allow differing secondary address	Allow addresses to match even if the unit numbers are not null and are different.
Match on blank secondary address	Allow addresses to match even if exactly one unit number is null.
Match on either street or post office box	Matches records if either the street address or the post office box match.
Address line similarity	Match if address line similarity >= the score. All spaces and non-alpanumeric characters are removed before the similarity is calculated.
Last line similarity	Match is the last line similarity >= score. The last line consists of city and state. All spaces and non-alphanumeric characters are removed before the similarity is calculated.

Creating Address Match Rules

To define an Address match rule, complete the following steps:

On the Match Rules tab or the Match Rules page, select Address as the Rule Type.

The Address Attributes tab and Details tab are displayed at the bottom of the page.
In the left panel of the Address Attributes tab, select the attribute that represents the primary address. Use the right shuttle key to move it to the Address Roles Attributes column.
Click Role Required and designate that attribute as the Primary Address.

You must designate one attribute as the primary address. If you do not assign the Primary Address role, the match rule is invalid.
Add other attributes and designate their roles as necessary. See Table 5-15 for the types of roles you can assign.
Select the Details tab and select the applicable options as listed in Table 5-16.

Custom Match Rules

Custom match rules enable you to write your own comparison algorithms to match records. You can use any input attributes or match functions within this comparison. You can use an active custom rule to control the execution of passive rules.

Consider the following three passive built-in rules:

NAME_MATCH: built-in name rule.
ADDRESS_MATCH: built-in address rule.
TN_MATCH: built-in conditional rule.

You can create a custom rule to specify that two records can be considered a match if any two of these rules are satisfied. Example 5-1 describes the PL/SQL code used to create the custom match rule that implements this example.

Example 5-1 Creating a Custom Rule Using Existing Passive Rules

BEGIN
   RETURN(
       (NAME_MATCH(THIS_,THAT_) AND ADDRESS_MATCH(THIS_,THAT_))
     OR
       (NAME_MATCH(THIS_,THAT_) AND TN_MATCH(THIS_,THAT_))
     OR
       (ADDRESS_MATCH(THIS_,THAT_) AND TN_MATCH(THIS_,THAT_))
         );
END;

Creating Custom Match Rules

To define a Custom match rule, complete the following steps:

On the Match Rules tab or the Match Rules page, select Custom as the Rule Type.

A Details field is displayed at the bottom of the page with the skeleton of a PL/SQL program.
Click Edit to open the Custom Match Rules Editor.

For more information about using the editor, select Help Topic from the Help menu.
To enter PL/SQL code, use any combination of the following:
- To read in a file, select Open File from the Code menu.
- To enter text, first position the cursor using the mouse or arrow keys, then begin typing. You can also use the commands on the Edit and Search menus.
- To reference any function, parameter, or transformation in the navigation tree, first position the cursor, then double-click or drag-and-drop the object onto the Implementation field.
To validate your code, select Validate from the Test menu.

The validation results appear on the Messages tab.
To save your code, select Save from the Code menu.
To close the Custom Match Rules Editor, select Close from the Code menu.

Merge Rules

Matching produces a set of records that are logically the same. Merging is the process of creating one record from the set of matched records. A Merge rule is applied to attributes in the matched record set to obtain a single value for the attribute in the merged record.

You can define one Merge rule for all the attributes in the Merge record or define a rule for each attribute.

Table 5-17 describes the types of merge rules.

Table 5-17 Merge Rule Types

Merge Rule	Description
Any	Uses the first non-blank value.
Match ID	Merges records that have already been output from another Match-Merge operator.
Rank	Ranks the records from the match set. The associated attribute from the highest ranked record will be used to populate the merge attribute value.
Sequence	Specify a database sequence for this rule. The next value of the sequence will be used for the value.
Min Max	Specify an attribute and a relation to choose the record to be used as a source for the merge attribute.
Copy	Choose a value from a different previously merged value.
Custom	Create a PL/SQL package function to select the merge value. The operator will provide the signature of this function. The user is responsible for the implementation of the rule from "BEGIN" to "END;" The matched records and merge record are parameters for this function.
Any Record	Identical to the Any rule, except that an Any Record rule applies to multiple attributes.
Rank Record	Identical to the Rank rule, except that a Rank Record rule applies to multiple attributes.
Min Max Record	Identical to the Min Max rule, except that a Min Max Record rule applies to multiple attributes.
Custom Record	Identical to the Custom rule, except that a Custom Record rule applies to multiple attributes.

Using a Match-Merge Operator

The Match-Merge operator has one input group and two output groups, Merge and Xref. The source data is mapped to the input group. The Merge group contains records that have been merged after the matching process is complete. The Xref group provides a record of the merge process. Every record in the input group will have a corresponding record in the Xref group. This record may contain the original attribute values and the merged attributes.

The Match-Merge operator uses an ordered record stream as input. From this stream, it constructs the match bins. From each match bin, matched sets are constructed. From each matched set, a merged record is created. The initial query will contain an ORDER BY clause consisting of the match bin attributes.

To match and merge source data using the Match-Merge operator:

Drag and drop the operator the operators representing the source data and the operator representing the merged data onto the mapping editor canvas:

For example, if your source data is stored in a table, and the merged data will be stored in another table, drag and drop two Table operators that are bound to the tables onto the canvas.
Drag and drop a Match-Merge operator onto the mapping editor canvas.

The MatchMerge wizard is displayed.
On the Name and Address page, the Name field contains a default name for the operator. You can change this name or accept the default name.

You can enter an optional description for the operator.
On the Groups page, you can rename groups or provide descriptions for them.

This page contains the following three groups:
- INGRP1: Contains input attributes.
- MERGE: Contains the merged records (usually this means fewer records than INGRP1).
- XREF: Contains the link between the original and merged data sets. This is the tracking mechanism used when a merge is performed.
On the Input Connections page, select the attributes that will be used as input to the Match-Merge operator.

The Available Attributes section of this page displays nodes for each operator on the canvas. Expand a node to display the attributes contained in the operator, select the attributes, and use the shuttle arrows to move selected attributes to the Mapped Attributes section.

Note: The Match-Merge operator requires an ordered input data set. If you have source data from more than one operators, use a Set Operation operator to combine the data and obtain an ordered data set.
On the Input Attributes page, review the attribute data types and lengths.

In general, if you go through the wizard, you need not change any of these values. Warehouse Builder populates them based on the output attributes.
On the Merge Output page, select the attributes to be merged from the input attributes.

These attributes appear in the Merge output group (the cleansed group). The attributes in this group retain the name and properties of the input attributes.
On the Cross Reference Output page, select attributes for the XREF output group.

The Source Attributes section contains all the input attributes and the Merge attributes you selected on the Merge Output page. The attributes from the Merge group are prefixed with MM. The other attributes define the unmodified input attribute values. Ensure that you select at least one attribute from the Merge group that will provide a link between the input and Merge groups.
On the Match Bins page, specify the match bin attributes. These attributes are used to group source data into match bins.

After the first deployment, you can choose whether to match and merge all records or only new records. To match and merge only the new records, select Match New Records Only.

You must designate a condition that identifies new records. The match-merge operator treats the new records in the following way:
- No matching is performed for any records in a match bin unless the match bin contains new record.
- Old records will not be compared with each other.
- A matched record set will not be presented to the merge processing unless the matched record set contains a new record.
- An old record will not be presented to the Xref output unless the record is matched to a new record.
For more information about match bin attributes and match bins, see "Overview of the Matching and Merging Process".
On the Define Match Rules page, define the match rules that will be used to match the source data.

Match rules can be active or passive. A passive match rule is generated but not automatically invoked. You must define at least one active match rule.

For more information about the match rules, the types of match rules you can define, and the steps used to define them, see "Match Rules".
On the Merge Rules page, define the rules that will be used to merge the sets of matched records created from the source data.

You can define Merge rules for each attribute in a record or for the entire record. Warehouse Builder provides different types of Merge rules.

For more information about the type of Merge rules and the steps to create Merge rules, see "Merge Rules".
On the Summary page, review your selections. Click Back to modify any selection you made. Click Next to complete creating the Match-Merge operator.
Map the Merge group of the Match-Merge operator to the input group of the operator that stores the merged data.

About the Name and Address Operator

After matching and merging records, you can further validate information about your customers and suppliers, and discover additional errors and inconsistencies. Warehouse Builder parses the names and addresses, and uses methods specific to this type of data, such as matching common nicknames and abbreviations. You can compare the input data to the data libraries supplied by third-party name and address cleansing software vendors, which can augment your records with information such as postal routes and geographic coordinates.

Successful delivery and lower postage rates are not the only reasons to cleanse name and address data. You will get better results from data analysis when the results are not skewed by duplicate records and incomplete information.

Warehouse Builder enables you to perform name and address cleansing on data using the Name and Address operator. The Name and Address operator identifies and corrects errors and inconsistencies in name and address source data by comparing input data to the data libraries supplied by third-party name and address cleansing software vendors. You can purchase the data libraries directly from these vendors.

Note:

The Name and Address operator requires separate licensing and installation of third-party name and address cleansing software. Refer to the Oracle Warehouse Builder Installation and Administration Guide for more information.

The errors and inconsistencies corrected by the Name and Address operator include variations in address formats, use of abbreviations, misspellings, outdated information, inconsistent data, and transposed names. The operator fixes these errors and inconsistencies by:

Parsing the name and address input data into individual elements.
Standardizing name and address data, using standardized versions of nicknames and business names and standard abbreviations of address components, as approved by the postal service of the appropriate country. Standardized versions of names and addresses facilitate matching and householding, and ultimately help you obtain a single view of your customer.
Correcting address information such as street names and city names. Filtering out incorrect or undeliverable addresses can lead to savings on marketing campaigns.
Augmenting names and addresses with additional data such as gender, postal code, country code, apartment identification, or business and consumer identification. You can use this and other augmented address information, such as census geocoding, for marketing campaigns that are based on geographical location.

Augmenting addresses with geographic information facilitates geography-specific marketing initiatives, such as marketing only to customers in large metropolitan areas (for example, within an n-mile radius from large cities); marketing only to customers served by a company's stores (within an x-mile radius from these stores). Oracle Spatial, an option with Oracle Database, and Oracle Locator, packaged with Oracle Database, are two products that you can use with this feature.

The Name and Address operator also enables you to generate postal reports for countries that support address correction and postal matching. Postal reports often qualify you for mailing discounts. For more information, see "About Postal Reporting".

Example: Correcting Address Information

This example follows a record through a mapping using the Name and Address operator. This mapping also uses a Splitter operator to demonstrate a highly recommended data quality error handling technique.

Example Input

In this example, the source data contains a Customer table with the row of data shown in Table 5-18.

Table 5-18 Sample Input to Name and Address Operator

Address Column	Address Component
Name	Joe Smith
Street Address	8500 Normandale Lake Suite 710
City	Bloomington
ZIP Code	55437

The data contains a nickname, a last name, and part of a mailing address, but it lacks the customer's full name, complete street address, and the state in which he lives. The data also lacks geographic information such as latitude and longitude, which can be used to calculate distances for truckload shipping.

Example Steps

This example uses a mapping with a Name and Address operator to cleanse name and address records, followed by a Splitter operator to load the records into separate targets depending on whether they were successfully parsed. This section explains the general steps required to design such a mapping.

To make the listed changes to the sample record:

In the Mapping Editor, begin by adding the following operators to the canvas:
- A CUSTOMERS table from which you extract the records. This is the data source. It contains the data in Table 5-18.
- A Name and Address operator. This action starts the Name and Address Wizard. Follow the steps of the wizard.
- A Splitter operator. For information on using this operator, see "Splitter Operator" in the Warehouse Builder Online Help.
- Three target operators into which you load the successfully parsed records, the records with parsing errors, and the records whose addresses are parsed but not found in the postal matching software.
Map the attributes from the CUSTOMERS table to the Name and Address operator ingroup. Map the attributes from the Name and Address operator outgroup to the Splitter operator ingroup.

You are not required to use the Splitter operator, but it provides an important function in separating good records from problematic records.
Define the split conditions for each of the outgroups in the Splitter operator and map the outgroups to the targets.

Figure 5-4 shows a mapping designed for this example. The data is mapped from the source table to the Name and Address operator, and then to the Splitter operator. The Splitter operator separates the successfully parsed records from those that have errors. The output from OUTGRP1 is mapped to the CUSTOMERS_GOOD target. The split condition for OUTGRP2 is set such that records whose Is Parsed flag is False are loaded to the NOT_PARSED target. That is, the Split Condition for OUTGRP2 is set as INGRP1.ISPARSED='F'. The Records in the REMAINING_RECORDS group are successfully parsed, but their addresses are not found by the postal matching software. These records are loaded to the PARSED_NOT_FOUND target.

Figure 5-4 Name and Address Operator Used with a Splitter Operator in a Mapping

Diagram of Name and Address operator with Splitter operator

Description of "Figure 5-4 Name and Address Operator Used with a Splitter Operator in a Mapping"

Example Output

If you run the mapping designed in this example, the Name and Address operator standardizes, corrects, and completes the address data from the source table. In this example, the target table contains the address data as shown in Table 5-19. Compare it with the input record from Table 5-18.

Table 5-19 Sample Output from Name and Address Operator

Address Column	Address Component
First Name Standardized	JOSEPH
Last Name	SMITH
Primary Address	8500 NORMANDALE LAKE BLVD
Secondary Address	STE 710
City	BLOOMINGTON
State	MN
Postal Code	55437-3813
Latitude	44.849194
Longitude	-093.356352
Is Parsed	True or False. Indicates whether a record can be separated into individual elements.
Is Good Name	True or False. Indicates whether the name was found in a postal database.
Is Good Address	True or False. Indicates whether the address was found in a postal database or was parsed successfully.
Is Found	True or False. Indicates whether the address was found in a postal database.
Name Warning	True or False. Indicates whether problems occurred in parsing the name.
Street Warning	True or False. Indicates whether problems occurred in parsing the address.
City Warning	True or False. Indicates whether problems occurred in parsing the city name.

In this example, the following changes were made to the input data:

Joe Smith was separated into separate columns for First_Name_Standardized and Last_Name.
Joe was standardized into JOSEPH and Suite was standardized into STE.
Normandale Lake was corrected to NORMANDALE LAKE BLVD.
The first portion of the postal code, 55437, was augmented with the ZIP+4 code to read 55437-3813.
Latitude and longitude locations were added.
The records were tested in various ways, and the good records are directed to a different target from the ones that have problems.

Handling Errors in Name and Address Data

Name and Address parsing, like any other type of parsing, depends on identification of keywords and patterns containing those keywords. Free-form name and address data difficult to parse because the keyword set is large and it is never 100% complete. Keyword sets are built by analyzing millions of records, but each new data set is likely to contain some undefined keywords.

Because most free-form name and address records contain common patterns of numbers, single letters, and alphanumeric strings, parsing can often be performed based on just the alphanumeric patterns. However, alphanumeric patterns may be ambiguous or a particular pattern may not be found. Name and Address parsing errors set parsing status codes that you can use to control data mapping.

Since the criteria for quality vary among applications, numerous flags are available to help you determine the quality of a particular record. For countries with postal matching support, use the Is Good Group flag, because it verifies that an address is a valid entry in a postal database. Also use the Is Good Group flag for U.S. Coding Accuracy Support System (CASS) and Canadian Software Evaluation and Recognition Program (SERP) certified mailings.

Unless you specify postal reporting, an address does not have to be found in a postal database to be acceptable. For example, street intersection addresses or building names may not be in a postal database, but they may still be deliverable. If the Is Good Group flag indicates failure, additional error flags can help determine the parsing status.

The Is Parsed flag indicates success or failure of the parsing process. If Is Parsed indicates parsing success, you may still wish to check the parser warning flags, which indicate unusual data. You may want to check those records manually.

If Is Parsed indicates parsing failure, you must preserve the original data to prevent data loss.

Use the Splitter operator to map successful records to one target and failed records to another target.

About Postal Reporting

All address lists used to produce mailings for discounted automation postal rates must be matched by postal report-certified software. Certifications depend on the third-party vendors of name and address software and data. The certifications may include the following:

United States Postal Service: Coding Accuracy Support System (CASS)
Canada Post: Software Evaluation and Recognition Program (SERP)
Australia Post: Address Matching Approval System (AMAS)

United States Postal Service CASS Certification

The Coding Accuracy Support System (CASS) was developed by the United States Postal Service (USPS) in cooperation with the mailing industry. The system provides mailers a common platform to measure the quality of address-matching software, focusing on the accuracy of five-digit ZIP Codes, ZIP+4 Codes, delivery point codes, and carrier route codes applied to all mail. All address lists used to produce mailings for automation rates must be matched by CASS-certified software.

To meet USPS requirements, the mailer must submit a CASS report in its original form to the USPS.

Canada Post SERP Certification

Canada Post developed a testing program called Software Evaluation and Recognition Program (SERP), which evaluates software packages for their ability to validate, or validate and correct, mailing lists to Canada Post requirements. Postal programs that meet SERP requirements are listed on the Canada Post Web site.

Canadian postal customers who use Incentive Lettermail, Addressed Admail, and Publications Mail must meet the Address Accuracy Program requirements. Customers can obtain a Statement of Accuracy by comparing their databases to Canada Post's address data.

Australia Post AMAS Certification

The Address Matching Approval System (AMAS) was developed by Australia Post to improve the quality of addressing. It provides a standard by which to test and measure the ability of address-matching software to:

Correct and match addresses against the Postal Address File (PAF)
Append a unique Delivery Point Identifier (DPID) to each address record, which is a step toward barcoding mail.

AMAS enables companies to develop address matching software which:

Prepares addresses for barcode creation
Ensures quality addressing
Enables qualification for discounts on PreSort letter lodgements

PreSort Letter Service prices are conditional upon customers using AMAS Approved Software with Delivery Point Identifiers (DPIDs) being current against the latest version of the PAF.

A declaration that the mail was prepared appropriately must be made when using the Presort Lodgement Document, available from post offices.

About Data Rules

Data rules are definitions for valid data values and relationships that can be created in Warehouse Builder. They determine legal data within a table or legal relationships between tables. Data rules help ensure data quality. They can be applied to tables, views, dimensions, cubes, materialized views, and external tables. Data rules are used in many situations including data profiling, data and schema cleansing, and data auditing.

The metadata for a data rule is stored in the workspace. To use a data rule, you apply the data rule to a data object. For example, you create a data rule called gender_rule that specifies that valid values are 'M' and 'F'. You can apply this data rule to the emp_gender column of the Employees table. Applying the data rule ensures that the values stored for the emp_gender column are either 'M' or 'F'. You can view the details of the data rule bindings on the Data Rule tab of the Data Object Editor for the Employees table.

There are two ways to create a data rule. A data rule can be derived from the results of data profiling, or it can be created using the Data Rule Wizard. For more information about data rules, see "Using Data Rules".

About Quality Monitoring

Quality monitoring builds on your initial data profiling and data quality initiatives. It enables you to monitor the quality of your data over time. You can define the business rules to which your data should adhere.

To monitor data using Warehouse Builder you need to create data auditors. Data auditors ensure that your data complies with the business rules you defined. You can define the business rules that your data should adhere to using a feature called data rules.

About Data Auditors

Data auditors are processes that validate data against a set of data rules to determine which records comply and which do not. Data auditors gather statistical metrics on how well the data in a system complies with a rule by auditing and marking how many errors are occurring against the audited data.

Data auditors have thresholds that allow you to create logic based on the fact that too many non-compliant records can divert the process flow into an error or notification stream. Based on this threshold, the process can choose actions. In addition, the audit results can be captured and stored for analysis purposes.

Data auditors can be deployed and executed ad-hoc, but they are typically run to monitor the quality of the data in an operational environment like a data warehouse or ERP system and, therefore, can be added to a process flow and scheduled.

When executed, the data auditor sets several output values. One of these output values is called the audit result. If the audit result is 0, then there were no errors. If the audit result is 1, at least one error occurred. If the audit result is 2, then at least one data rule failed to meet the specified error threshold. Data auditors also set the actual measured values such as Error Percent and Six Sigma values.

Data auditors are a very important tool in ensuring that data quality levels are up to the standards set by the users of the system. It also helps determine spikes in bad data allowing events to the tied to these spikes.

For information about creating and using data auditors, see "Monitoring Data Quality Using Data Auditors".

Performing Data Profiling

Data profiling is, by definition, a resource-intensive process that requires forethought and planning. It analyzes data and columns and performs many iterations to detect defects and anomalies in your data. So it warrants at least some forethought and planning in order to be as effective as possible.

Before you begin profiling data, first reduce the data set by doing a random sampling. Next identify the data objects that you want to target. Instead of profiling everything, choose objects that are deemed crucial. You should not select an entire source system for profiling at the same time. Not only is it a waste of resources, but it is also often unnecessary. Select areas of your data where quality is essential and has the largest fiscal impact.

For example, you have a data source that contains five tables: CUSTOMERS, REGIONS, ORDERS, PRODUCTS, and PROMOTIONS. You decide that the two most important tables with respect to data quality are CUSTOMERS and ORDERS. The CUSTOMERS table is known to contain many duplicate and erroneous entries that cost your company money on wasted marketing efforts. The ORDERS table is known to contain data about orders in an incorrect format. In this case, you would select only these two tables for data profiling.

Steps to Perform Data Profiling

After you have chosen the objects you want to profile, use the following steps to guide you through the profiling process:

Import or Select the Metadata
Create a Data Profile
Profile the Data
View Profile Results
Derive Data Rules
Generate Corrections
Define and Edit Data Rules Manually
Generate, Deploy, and Execute

The data profiling process ends at step 4. Steps 5 to 7 are optional and can be performed if you want to perform data correction after the data profiling. Step 8 is required when you perform both data profiling and data correction along with data profiling.

Import or Select the Metadata

Data profiling requires the profiled objects to be present in the project in which you are performing data profiling. Ensure that these objects are either imported into this project or created in it. Also ensure that the data is loaded into the objects. Having the data loaded is essential to data profiling.

Also, because data profiling uses mappings to run the profiling, you must ensure that all locations that you are using are registered. Data profiling attempts to register your locations. If, for some reason, data profiling cannot register your locations, you must explicitly register the locations before you begin profiling.

Note:

You can only profile data in the default configuration.

Create a Data Profile

After your system is set up, you can create a data profile using the Design Center. A data profile is a metadata object in the workspace. It includes the set of data objects you want profiled, the settings controlling the profiling operations, the results returned after you profile the data, and correction information (if you decide to use these corrections).

To create a data profile:

From the Project Explorer, expand the project node in which you want to create a data profile.
Right-click Data Profiles and select New.

The Welcome page of the Create Data Profile Wizard is displayed.
On the Name and Description page, enter a name and an optional description for the data profile. Click Next.
On the Select Objects page, specify the objects that you want to include in the data profile and click Next.

The Available section displays the objects available for profiling. Select the objects to include in the data profile and use the shuttle buttons to move them to the Selected section. To select multiple objects, hold down the Ctrl key while selecting objects. You can include tables, views, materialized views, external tables, dimensions, and cubes in your data profile.

When you select tables, views or materialized views that contain attribute sets, the Choose Attribute Set dialog box is displayed. The list at the bottom of this dialog box displays the attribute sets defined on the data object. You can select an attribute set to profile only the columns included in that attribute set. To profile all columns in the data object, select <all columns>.

When you select a dimensional object in the Available section, a warning is displayed informing you that the relational objects bound to these dimensional objects will also be added to the profile. Click Yes to proceed.
On the Summary page, review the choices you made on the previous wizard pages. Click Back to change any selected values. Click Finish to create the data profile.

The data profile is added to the Data Profiles node in the navigation tree.

If this is the first data profile you have created in the current project, the Connection Information dialog box for the selected control center is displayed. Enter the connection information and click OK. The Data Profile Editor is displayed.

Note:

You cannot profile a source table that contains complex data types if the source module and the data profile are located on different database instances.

Profile the Data

Data profiling is achieved by performing deep scans of the selected objects. This can be a time-consuming process, depending on the number of objects and type of profiling you are running. However, profiling is run as an asynchronous job, and the client can be closed during this process. You will see the job running in the job monitor and Warehouse Builder prompts you when the job is complete.

Configuring Data Profiles

You can, and should, configure the profile before running it if there are specific types of analysis you do, or do not, want to run. To configure a data profile, you set its configuration parameters in the Property Inspector panel of the Data Profile Editor.

Configuration of the profile and its objects is possible at the following levels:

The entire profile (all the objects it contains)

Select the data profile in the Profile Objects tab of the Data Profile Editor. In the Property Inspector, set the values of the configuration parameters. These parameters are set for all the objects in the data profile.
An individual object in the data profile (for example, a table)

Select the object in the Profile Objects tab of the Data Profile Editor. In the Property Inspector, set the configuration parameters. These parameters are set for the selected object.
An attribute within an object (for example, a column within a table)

In the Profile Objects tab of the Data Profile Editor, expand the object node to display the attributes it contains. For example, you can expand a table node to display its columns. Select the attribute for which you want to specify configuration parameters. In the Property Inspector, set the configuration parameters.

For example, if you know you only have one problematic column in a table and you already know that most of the records should conform to values within a certain domain, then you can focus your profiling resources on domain discovery and analysis. By narrowing down the type of profiling necessary, you use less resources and obtain the results faster.

For more information about the configuration parameters you can set for data profiles, see "Configuration Parameters for Data Profiles" in the Warehouse Builder Online Help.

Steps to Profile Data

After you have created a data profile, you can open it in the Data Profile Editor to profile the data or review profile results from a previous run. The objects you selected when creating the profile are displayed in the object tree of the Data Profile Editor. You can add objects to the profile by selecting Profile and then Add.

To profile the data:

Expand the Data Profiles node in the Project Explorer, right-click a data profile, and select Open Editor.

The Data Profile Editor opens the selected data profile.
From the Profile menu, select Profile.

If this is the first time you are profiling data, the Data Profile Setup dialog box is displayed. Enter the details of the profiling workspace in this dialog box. For more information about the information to be entered, click Help.

Warehouse Builder begins preparing metadata for profiling. The progress window containing the name of the object being created to profile the data is displayed. After the metadata preparation is complete, the Profiling Initiated dialog box is displayed informing you that the profiling job has started. Click OK.

Once the profiling job starts, the data profiling is asynchronous and you can continue working or even close the client. Your profiling process will continue to run until it is completed.
View the status of the profiling job in the Monitor Panel of the Data Profile Editor.

You can continue to monitor the progress of your profiling job in the Monitor panel. After the profiling job is complete, the status displays as complete.
After the profiling is complete, the Retrieve Profile Results dialog box is displayed and you are prompted to refresh the results.

You can use this option if you have previously profiled data in the same data profile. It allows you to control when the new profiling results become visible in the Data Profile Editor.

Note:

Data profiling results are overwritten on subsequent profiling executions.

View Profile Results

After the profile operation is complete, you can open the data profile in the Data Profile Editor to view and analyze the results. The profiling results contain a variety of analytical and statistical information about the data profiled. You can immediately drill down into anomalies and view the data that caused them. You can then determine what data must be corrected.

To view the profile results:

Select the data profile in the navigation tree, right-click, and select Open Editor.

The Data Profile Editor opens and displays the data profile.
If you have previous data profiling results displayed in the Data Profile Editor, refresh the view when prompted so that the latest results are shown.

The results of the profiling are displayed in the Profile Results Canvas.
Minimize the Data Rule and Monitor panels by clicking on the arrow symbol in the upper left corner of the panel.

This maximizes your screen space.
Select objects in the Profile Objects tab of the object tree to focus the results on a specific object.

The results of the selected object are displayed in the Profile Results Canvas. You can switch between objects. The tab that you had selected for the previous object remains selected.

The Profile Results Canvas contains the following tabs that display the results of data profiling:

Data Profile
Profile Object
Aggregation
Data Type
Pattern
Domain
Unique Key
Functional Dependency
Referential
Data Rule

For more information about the contents of these tabs, click the arrow on the right of the Profile Results Canvas panel and select Help.

Derive Data Rules

Based on the results of data profiling, you can decide to derive data rules that can be used to clean up your data. A data rule is an expression that determines the set of legal data that can be stored within a data object. Use data rules to ensure that only values compliant with the data rules are allowed within a data object. Data rules will form the basis for correcting or removing data if you decide to cleanse the data. You can also use data rules to report on non-compliant data.

Although you can create data rules and apply them manually to your data profile, derived data rules allow you to move quickly and seamlessly between data profiling and data correction.

For example, if you have a table called Employees with the following columns: Employee_Number, Gender, Employee_Name. The profiling result shows that 90% of the values in the Employee_Number column are unique, making it a prime candidate for the unique key. The results also show that 85% of the values in the Gender column are either 'M' or 'F', making it a good candidate for a domain. You can then derive these rules directly from the Profile Results Canvas.

To derive a data rule:

Select a data profile in the navigation tree, right-click, and select Open Editor.

The Data Profile Editor is displayed with the profiling results.
Review the profiling results and determine which findings you want derived into data rules.

The types of results that warrant data rules vary. Some results commonly derived into data rules include a detected domain, a functional dependency between two attributes, or a unique key.
Select the tab that displays the results from which you want to derive a data rule.

For example, to create a data rule that enforces a unique key rule for the EMPLOYEE_NUMBER column, navigate to the Unique Key tab.
Select the cell that contains the results you want derived into a data rule and then from the Profile menu select Derive Data Rule. Or click the Derive Data Rule button.

For example, to create a Unique Key rule on the EMPLOYEE_NUMBER column, select this column and click Derive Data Rule.

The Derive Data Rule Wizard opens and displays the Welcome page.
Click Next.

The Name and Description page is displayed.
The Name field displays a default name for the data rule. You can either accept the default name or enter a new name.
Click Next.

The Define Rule page is displayed.
Provide details about the data rule parameters.

The Type field that represents the type of data rule is populated based on the tab from which you derived the data rule. You cannot edit the type of data rule.

Additional fields in the lower portion of this page define the parameters for the data rule. Some of these fields are populated with values based on the result of data profiling. The number and type of fields depends on the type of data rule.
Click Next.

The Summary page is displayed. Review the options you set in the wizard using this page. Click Back if you want to change any of the selected values.
Click Finish.

The data rule is created and it appears in the Data Rule panel of the Data Profile Editor. The derived data rule is also added to the Derived_Data_Rules node under the Data Rules node in the Project Explorer. You can reuse this data rule by attaching it to other data objects.

Generate Corrections

After you have derived data rules from the profiling results, you can automate the process of correcting source data based on profiling results. You can create the schema and mapping corrections.

The schema correction creates scripts that you can use to create a corrected set of source data objects with the derived data rules applied. The mapping correction creates new correction mappings to take your data from the source objects and load them into new objects.

As part of the correction process, the following are created:

Corrected tables that adhere to the newly derived data rules

The correction tables have names that are prefixed with TMP__. For example, when you profile the EMPLOYEES table, the correction table will be called TMP__EMPLOYEES.
Correction mappings that you use to cleanse the data

The correction mappings enforce the data rules. While moving data from the old "dirty" tables in the profile source tables into the corrected tables, these mappings correct records that do not comply with the data rules. The name of the correction mapping is the object name prefixed with M_. For example, the correction mapping for the EMPLOYEE table is called M_EMPLOYEE.

Steps to Create Corrections

Use the Data Profile Editor to create corrections based on the profiling results.

To create corrections:

If the Data Profile is not already open, open it by right-clicking the data profile in the Project Explorer and selecting Open Editor.
From the Profile menu, select Create Correction.

The Create Correction Wizard is displayed. Click Help on any wizard page for more information about the page.
On the Select Target Module page, specify the target module that will contain the corrections. You can create a new module or select and existing module.

If you choose to create a new target module, the Create Module Wizard is displayed that enables you to create a new module.

You can remove correction objects created as a result of previous corrections by selecting Remove previous correction objects.
On the Select Objects page, select the objects that you want to correct by moving them to the Selected list.
On the Select Data Rules and Data Types page, specify the corrections that you want to implement for each object. The navigation tree on the left displays the objects. Select each object in the navigation tree and specify corrections for that object on the Data Rules and Data Types tabs.

The Data Rules tab displays the data rules that will be applied to the corrected object. Select the rules to be applied. For each rule, in the Bindings section at the bottom of the page, select the column to which the rule must be bound.

The Data Types tab displays the new data type and the documented data type for each column. To use the new data type determined as a result of the data correction actions, select the column by clicking the box to the right of the column. Columns which are not selected will retain their existing data types.
On the Verify and Accept Corrected Tables page, select the objects that you want to correct.

On the top of this page, the objects selected for corrections are listed. Select Create to the right of the table name to generate corrected objects for the object.

The bottom part of this page contains the columns, Constraints, and Data Rules tabs. These tabs contain the definitions used for the corrected objects. You can make modifications to these tabs, if required.
On the Choose Data Correction Actions page, specify the correction actions to be taken for objects.

Select an object by clicking the Correct to the left of the object and use the Choose Data Correction Actions section to specify the correction action and cleansing strategy. For more information about correction actions, click Help on this page.
On the Summary page, click Finish to create the correction objects.

At this stage, the corrections objects are only defined and their metadata is stored in the workspace. To implement the correction objects in your target schema, you must deploy the correction tables and correction mappings.

Before you deploy a correction mapping, ensure that you do the following:

Deploy the correction tables created as a result of data profiling.
Grant the SELECT privilege on the source tables to PUBLIC.

For example, your correction mapping contains the table EMPLOYEES from the HR schema. You can successfully deploy this correction mapping only if the SELECT privilege is granted to PUBLIC on the HR.EMPLOYEES table.

Viewing the Correction Tables and Mappings

You can review the correction tables in the Data Object Editor to see the data rules and constraints created as part of the design of your table.

To view the correction mappings:

Double-click the table or mapping to open the object in their respective editors.
After the mapping is open, select View and then Auto Layout to view the entire mapping.
Select the submapping ATTR_VALUE_1 and click the Visit Child Graph icon from the toolbar to view the submapping.

The submapping is displayed. The submapping is the element in the mapping that performs the actual correction cleansing you specified in the Create Correction Wizard.

Define and Edit Data Rules Manually

Data rules can be derived or manually created. Before and after you have created the corrections, you can define additional data rules manually.

For more information about defining and editing data rules manually, see "Creating Data Rules".

Generate, Deploy, and Execute

Finally, you can generate, deploy, and execute the correction mappings and data rules. After you run the correction mappings with the data rules, your data is corrected. The derived data rules remain attached to the objects in the corrected schema for optional use in data monitors.

Tuning the Data Profiling Process

Data profiling is a highly processor and I/O intensive process and the execution time for profiling ranges from a few minutes to a few days. You can achieve the best possible data profiling performance by ensuring that the following conditions are satisfied:

Your database is set up correctly for data profiling.
The appropriate data profiling configuration parameters are used when you perform data profiling.

Tuning the Data Profile for Better Data Profiling Performance

You can configure a data profile to optimize data profiling results. Use the configuration parameters to configure a data profile. For more information about configuration parameters, see "Configuration Parameters for Data Profiles" in the Warehouse Builder Online Help.

Use the following guidelines to make your data profiling process faster:

Perform only the types of analysis that you require

If you know that certain types of analysis are not required for the objects that you are profiling, use the configuration parameters to turn off these types of data profiling.
Analyze lesser amount of data

Use the WHERE clause and Sample Rate configuration parameters.

If the source data for profiling is stored in an Oracle Database, it is recommended that the source schema be located on the same database instance as the profile workspace. You can do this by installing the workspace into the same Oracle instance as the source schema location. This avoids using a database link to move data from source to profiling workspace.

Tuning the Oracle Database for Better Data Profiling Performance

To ensure good data profiling performance, the computer that runs the Oracle Database must have certain hardware capabilities. In addition to this, you must optimize the Oracle Database instance on which you are performing data profiling.

For efficient data profiling, the following considerations are applicable:

Multiple Processors
Memory
I/O System

Multiple Processors

The computer that runs the Oracle Database needs multiple processors. Data profiling has been designed and tuned to take maximum advantage of the parallelism provided by the Oracle Database. While profiling large tables (more than 10 million rows), it is highly recommended to use a multiple processor computer.

Hints are used in queries required to perform data profiling. It picks up the degree of parallelism from the initialization parameter file of the Oracle Database. The default initialization parameter file contains parameters that take advantage of parallelism.

Memory

It is important that you ensure a high memory hit ratio during data profiling. You can ensure this by assigning a larger size of the System Global Area. It is recommended that the size of the System Global Area be at least 500 MB. If possible, configure it to 2 GB or 3 GB.

For advanced database users, it is recommended that you observe the buffer cache hit ratio and library cache hit ratio. Set the buffer cache hit ratio to higher than 95% and the library cache hit ratio to higher than 99%.

I/O System

The capabilities of the I/O system have a direct impact on the data profiling performance. Data profiling processing frequently performs full table scans and massive joins. Since today's CPUs can easily out-drive the I/O system, you must carefully design and configure the I/O subsystem. Keep in mind the following considerations that aid better I/O performance:

You need a large number of disk spindles to support uninterrupted CPU and I/O cooperation. If you have only a few disks, the I/O system is not geared towards a high degree of parallel processing. It is recommended to have a minimum of two disks for each CPU.
Configure the disks. It is recommended that you create logical stripe volumes on the existing disks, each striping across all available disks. Use the following formula to calculate the stripe width:

MAX(1,DB_FILE_MULTIBLOCK_READ_COUNT/number_of_disks) X DB_BLOCK_SIZE

Here, DB_FILE_MULTIBLOCK_SIZE and DB_BLOCK_SIZE are parameters that you set in your database initialization parameter file. You can also use a stripe width that is a multiple of the value returned by the formula.

To create and maintain logical volumes, you need a volume management software such as Veritas Volume Manager or Sun Storage Manager. If you are using Oracle Database 10g or a higher version and you do not have any volume management software, you can use the Automatic Storage Management feature of the Oracle Database to spread the workload to disks.
Create different stripe volumes for different tablespaces. It is possible that some of the tablespaces occupy the same set of disks.

For data profiling, the USERS and the TEMP tablespaces are normally used at the same time. So you can consider placing these tablespaces on separate disks to reduce interference.

Using Data Rules

In addition to deriving data rules based on the results of data profiling, you can define your own data rules. You can bind a data rule to multiple tables within the project in which the data rule is defined. An object can contain any number of data rules.

Use the Design Center to create and edit data rules. Once you create a data rule, you can use it in any of the following scenarios.

Using Data Rules in Data Profiling

When you are using data profiling to analyze tables, you can use data rules to analyze how well data complies with a given rule and to collect statistics. From the results, you can derive a new data rule. If data profiling determines that the majority of records have a value of red, white, and blue for a particular column, a new data rule can be derived that defines the color domain (red, white, and blue). This rule can then be reused to profile other tables, or reused in cleansing, and auditing.

Using Data Rules in Data Cleansing and Schema Correction

Data rules can be used in two ways to cleanse data and correct schemas. The first way is to convert a source schema into a new target schema where the structure of the new tables strictly adheres to the data rules. The new tables would then have the right data types, constraints would be enforced, and schemas would be normalized. The second way data rules are used is in a correction mapping that validates the data in a source table against the data rules, to determine which records comply and which do not. The analyzed data set is then corrected (for example, orphan records are removed, domain value inaccuracies are corrected, and so on) and the cleansed data set is loaded into the corrected target schema.

Using Data Rules in Data Auditing

Data rules are also used in data auditing. Data auditors are processes that validate data against a set of data rules to determine which records comply and which do not. Data auditors gather statistical metrics on how well the data in a system complies with a rule, and they report defective data into auditing and error tables. In that sense they are like data-rule-based correction mappings, which also offer a report-only option for data that does not comply with the data rules. For more information about data auditors, see "About Data Auditors".

Creating Data Rules

The Data Rules folder in the Project Explorer contains the data rules. Every data rule must belong to a data rule folder. The subfolder DERIVED_DATA_RULES contains the data rules derived as a result of data profiling. You can create additional data rule folders to contain any data rules that you create.

To create a data rule:

Right-click the Data Rule folder in which the data rule should be created and select New.

The Create Data Rule Wizard is displayed.
On the Name and Description page, enter a name and an optional description for the data rule. Click Next.
On the Define Rule page, specify the type of data rule to create. Also specify any additional information required to create the data rule. Click Next.

For example, when you create a Domain Range rule, you must specify the values that represent the valid domain values.

For more information about the types of rules, see "Types of Data Rules" in the Warehouse Builder Online Help.
On the Summary page, review the selections you made in the wizard. Click Back to modify any selected values. Click Finish to create the data rule.

The data rule is added to the data rule folder under which you created the data rule.

Applying Data Rules to Objects

Applying a data rule to an object binds the definition of the data rule to the object. For example, binding a rule to the table Dept ensures that the rule is implemented for the specified attribute in the table. You apply a data rule using the Data Object Editor. You can also apply a derived data rule from the Data Rule panel of the Data Profile Editor.

The Apply Data Rule Wizard enables you to apply a data rule to a data object. You can apply precreated data rules or any data rule you created to data objects. The types of data objects to which you can apply data rules are tables, views, materialized views, and external tables.

To apply a data rule to a data object:

In the Project Explorer, right-click the object to which the data rule must be applied and select Open Editor.

The Data Object Editor for the data object is displayed.
Navigate to the Data Rules tab.

If any data rules are bound to the data object, these are displayed on this tab.
Click Apply Rule.

The Apply Data Rule wizard is displayed.
On the Select Rule page, select the data rule that you want to apply to the data object. Click Next.
On the Name and Description page, enter a name and an optional description for the applied data rule. Click Next.
On the Bind Rule Parameters page, bind the data rule to the column in the data object to which the data rule must be applied. Click Next.
On the Summary page, review the sections you made on the previous wizard pages. Click Back to modify selected values. Click Finish to apply the data rule.

The data rule is bound to the data object and is listed on the Data Rules folder.

Monitoring Data Quality Using Data Auditors

Data auditors are objects that you can use to continuously monitor your source schema to ensure that the data adheres to the defined data rules. You can monitor an object only if you have defined data rules for the object. You can create data auditors for tables, views, materialized views, and external tables.

See Also:

"About Data Auditors" for more information about data auditors

To monitor data quality, perform the following steps:

Create a data auditor containing the data objects that you want monitor.

See "Creating Data Auditors"
Run the data auditor to identify the records that do not comply with the data rules defined on the data objects. You can either run data auditors manually or schedule them to run at specified times.

See "Auditing Data Objects Using Data Auditors" for information about running data auditors.

Note:

You cannot import metadata for data auditors in Merge mode. For more information about import mode options, see "Import Option" in the Warehouse Builder Online Help.

Creating Data Auditors

Use the Create Data Auditor Wizard to create data auditors. Data auditors are part of an Oracle module in a project.

To create a data auditor:

Expand the Oracle module in which you want to create the data auditor.
Right-click Data Auditors and select New.

The Create Data Auditor Wizard is displayed.
On the Name and Description page, enter a name and an optional description for the data auditor. Click Next.
On the Select Objects page, select the data objects that you want to audit. Use the shuttle buttons to move objects to the Selected section and click Next.

You can select multiple objects by holding down the Ctrl key while selecting objects.
On the Choose Actions page, specify the action to be taken for records that do not comply with data rules bound to the selected objects. Click Next.

The Choose Actions page contains two sections, Error threshold mode and Data Rules.

Error threshold mode

Error threshold mode is used to determine the compliance of data to data rules in the objects. Select one of the following options:
- Percent: The data auditor will set the audit result based on the percentage of records that do not comply with the data rule. This percentage is specified in the rule's Defect Threshold value.
- Six Sigma: The data auditor will set the audit result based on the Six Sigma values for the data rules. If the calculated Six Sigma value for any rule is less than the specified Sigma Threshold value, then the data auditor will set the AUDIT RESULT to 2.
Data Rules

The Data Rules section lists the data rules applied to the objects selected on the Select Object page. For each rule, specify the following:
- Action: The action to be performed if data in the source object does not comply with the data rule. Select Report to ensure that the data rule is audited. Select Ignore if you want the data rule to be ignored.
- Defect Threshold: The percent of records that should comply with the data rules to ensure successful auditing. Specify a value between 1 and 100. This value is ignored if you select Six Sigma in the Error threshold mode section.
- Sigma Threshold: The required success rate. Specify a number between 0 and 7. If you set the value to 7, no failures are allowed. This value is ignored if you select Percent in the Error threshold mode section.
On the Summary page, review the selections you made. Click Back to modify any selected values. Click Finish to create the data auditor.

The created data auditor is added to the Data Auditors node. At this stage, only the metadata for the data auditor is stored in your workspace. To use this data auditor to monitor the quality of data in your data objects, you must run the data auditor.

Auditing Data Objects Using Data Auditors

After you create a data auditor, you can use it to monitor the data in your data objects. This ensures that the data rule violations for the objects are detected. When you run a data auditor, any records that violate the data rules defined on the data objects are written to the error tables.

There are two ways of using data auditors:

Manually Running Data Auditors
Automatically Running Data Auditors

Manually Running Data Auditors

To check if the data in the data object adheres to the data rules defined for the object, you must run the data auditor. You can run data auditors from the Design Center or the Control Center Manager. To run a data auditor from the Design Center, right-click the data auditor and select Start. In the Control Center Manager, select the data auditor, and from the File menu, select Start. The results are displayed in the Job Details window as described in "Data Auditor Execution Results".

Automatically Running Data Auditors

You can automate the process of running a data auditor using the following steps:

Create a process flow that contains a Data Auditor Monitor activity.
Schedule this process flow to run at a predefined time.

For more information about scheduling objects, see "Process for Defining and Using Schedules".

Figure 5-5 displays a process flow that contains a Data Auditor Monitor activity. In this process flow, LOAD_EMP_MAP is a mapping that loads data into the EMP table. If the data load is successful, the data auditor EMP_DATA_AUDIT is run. The data auditor monitors the data in the EMP table based on the data rules defined for the table.

Figure 5-5 Data Auditor Monitor Activity in a Process Flow

Description of "Figure 5-5 Data Auditor Monitor Activity in a Process Flow"

Data Auditor Execution Results

After you run a data auditor, the Job Details window displays the details of the execution. The Job Details window contains two tabs: Input Parameters and Execution Results. Note that the Job Details window is displayed only when you set the deployment preference Show Monitor to true. For more information about deployment preferences, see "Deployment Preferences".

Figure 5-6 displays the Execution Results tab of the Job Details window.

Figure 5-6 Data Auditor Execution Results

Description of "Figure 5-6 Data Auditor Execution Results"

The Input Parameters tab contains the values of input parameters used to run the data auditor. The Execution Results tab displays the results of running the data auditor. This tab contains two sections: Row Activity and Output Parameters.

The Row Activity section contains details about the inserts into the error table for each step. Note that when more than one data rule is specified, multi-table insert may be used in the data auditor. In this case, the count of the number of rows will not be accurate.

In Figure 5-6, the data rule called E_NOT_NULL inserted one record into the error table.

The Output Parameters section contains the following three parameters:

AUDIT_RESULT: Indicates the result of running the data auditor. The possible values for this parameter are as follows:

0: No data rule violations occurred.

1: At least one data rule violation occurred, but no data rule failed to meet the minimum quality threshold as defined in the data auditor.

2: At least one data rule failed to meet the minimum quality threshold.

For more information about setting the threshold, see the step on choosing actions in "Creating Data Auditors".
EO_<data_rule_name>: Represents the calculated error quality for the specified data rule. 0 indicates all errors and 100 indicates no errors.
SO_<data_rule_name>: Represents the Six Sigma quality calculated for the specified data rule.