19 Automatic and Embedded Data Preparation

This chapter explains how to use features of Oracle Data Mining to prepare data for mining.

This chapter contains the following sections:

Overview
Automatic Data Preparation
Embedded Data Preparation
Transparency

Overview

The quality of a model depends to a large extent on the quality of the data used to build (train) it. A large proportion of the time spent in any given data mining project is devoted to data preparation. The data must be carefully inspected, cleansed, and transformed, and algorithm-appropriate data preparation methods must be applied.

The process of data preparation is further complicated by the fact that any data to which a model is applied, whether for testing or for scoring, must undergo the same transformations as the data used to train the model.

Oracle Data Mining offers several features that significantly simplify the process of data preparation.

Embedded data preparation — Transformation instructions specified during the model build are embedded in the model, used during the model build, and used again when the model is applied. If you specify transformation instructions, you only have to specify them once.
Automatic Data Preparation (ADP) — Oracle Data Mining has an automated data preparation mode. When ADP is active, Oracle Data Mining automatically performs the data transformations required by the algorithm. The transformation instructions are embedded in the model along with any user-specified transformation instructions.
Tools for custom data preparation — Oracle Data Mining provides transformation routines that you can use to build your own transformation instructions. You can use these transformation instructions along with ADP or instead of ADP. See "Embedded Data Preparation".
Automatic management of missing values and sparse data — Oracle Data Mining uses consistent methodology across mining algorithms to handle sparsity and missing values. See Oracle Data Mining Application Developer's Guide.
Transparency — Oracle Data Mining provides model details, which are a view of the categorical and numerical attributes internal to the model. This insight into the inner details of the model is possible because of reverse transformations, which map the transformed attribute values to a form that can be interpreted by a user. Where possible, attribute values are reversed to the original column values. Reverse transformations are also applied to the target when a supervised model is scored, thus the results of scoring are in the same units as the units of the original target. See "Transparency".

The Case Table

The first step in preparing data for mining is the creation of a case table. If all the data resides in a single table and all the information for each case (record) is included in a single row (single-record case), this process is already taken care of.

If the data resides in several tables, creating the data source involves the creation of a view. For the sake of simplicity, the term "case table" refers to either a table or a view.

When the data source includes transactional data (multi-record case), it must be aggregated to the case level, using nested columns when desired. In transactional data, the information for each case is contained in multiple rows. An example is sales data in a star schema when mining at the product level. Sales is stored in many rows for a single product (the case) since the product is sold in many stores to many customers over a period of time.

Once you have created a case table that contains all the pertinent data, you should cleanse the data of any inconsistent formats within columns. Pay special attention to such items as phone numbers, zip codes, and dates, as described in the following section.

See:

Oracle Data Mining Application Developer's Guide for further details on nested data and other issues involved in creating the case table.

Data Type Conversion

Oracle Data Mining supports a limited number of column data types. Numeric data is interpreted as numerical attributes and character data is interpreted as categorical attributes.

You must convert the data type of a column if its type is not supported by Oracle Data Mining or if its type will cause Oracle Data Mining to interpret it incorrectly. For example, zip codes identify different postal zones; they do not imply order. If the zip codes are stored in a numeric column, it will be interpreted as a numerical attribute. You must convert the data type so that the column data can be used as a categorical attribute by the model. You can do this using the TO_CHAR function to convert the digits 1-9 and the LPAD function to retain the leading 0, if there is one.

LPAD(TO_CHAR(ZIPCODE),5,'0')

Date Data

The Oracle Data Mining APIs do not support DATE and TIMESTAMP data. Date columns must be converted to simple numeric or character data types for data mining.

In most cases, DATE and TIMESTAMP should be converted to NUMBER, but you should evaluate each case individually. A TIMESTAMP column should generally be converted to a number since it represents a unique point in time.

Alternatively, a column of dates in a table of annual sales data might indicate the month when a product was sold. This DATE column would be converted to VARCHAR2 and treated as a categorical. You can use the TO_CHAR function to convert a DATE data type to VARCHAR2.

You can convert dates to numbers by selecting a starting date and subtracting it from each date value. Another approach would be to parse the date and distribute its components over several columns. This approach is used by DBMS_PREDICTIVE_ANALYTICS, which does support DATE and TIMESTAMP data types.

See Also:

Oracle Database SQL Language Reference for information on data type conversion.
Oracle Database PL/SQL Packages and Types Reference for information about date data types supported by DBMS_PREDICTIVE_ANALYTICS.

Text Transformation

You can use Oracle Data Mining to mine text. Columns of text in the case table can be mined once they have undergone the proper transformation.

The text column must be in a table, not a view. The transformation process uses several features of Oracle Text; it treats the text in each row of the table as a separate document. Each document is transformed to a set of text tokens known as terms, which have a numeric value and a text label. The text column is transformed to a nested column of DM_NESTED_NUMERICALS.

See:

Oracle Data Mining Application Developer's Guide for details.

Business and Domain-Sensitive Transformations

Some transformations are dictated by the definition of the business problem. For example, you might want to build a model to predict high-revenue customers. Since your revenue data for current customers is in dollars you need to define what "high-revenue" means. Using some formula that you have developed from past experience, you might recode the revenue attribute into ranges Low, Medium, and High before building the model.

Another common business transformation is the conversion of date information into elapsed time. For example, date of birth might be converted to age.

In some cases, the data might need to be transformed to minimize an unwanted interpretation by the model. An example is logarithmic transformations. You might take the log of a numerical attribute when the values fall within a very wide range. For instance, commissions might range from a few hundred to a million. Converting to the log scale would minimize the skewing effect on the model.

Domain knownledge can be very important in deciding how to prepare the data. For example, some algorithms might produce unreliable results if the data contains values that fall far outside of the normal range. In some cases, these values represent errors or abnormalities. In others, they provide meaningful information. See "Outlier Treatment".

Automatic Data Preparation

Most algorithms require some form of data transformation. During the model training process, Oracle Data Mining can automatically perform the transformations required by the algorithm. You can choose to supplement the automatic transformations with additional transformations of your own, or you can choose to manage all the transformations yourself.

In calculating automatic transformations, Oracle Data Mining uses heuristics that address the common requirements of a given algorithm. This process results in reasonable model quality in most cases.

Enabling Automatic Data Preparation

ADP is a model setting. You can enable ADP by specifying PREP_AUTO in the settings table for the model. By default, ADP is not enabled.

The settings table is a user-created table with two columns: SETTING_NAME and SETTING_VALUE. To enable ADP, set PREP_AUTO to PREP_AUTO_ON; to disable ADP, set PREP_AUTO to PREP_AUTO_OFF.

Example 19-1 enables ADP in a settings table called SETTINGS_TBL.

Example 19-1 Turn on the ADP Setting

BEGIN
  INSERT into settings_tbl(
dbms_data_mining.prep_auto,
dbms_data_mining.prep_auto_on);
  commit;
END;
/

Example 19-2 uses this settings table to enable ADP for a model called CLAS_MODEL.

Example 19-2 Enable ADP for a Model

BEGIN
  dbms_data_mining.create_model(
    model_name          => 'clas_model',
    mining_function     => dbms_data_mining.classification,
    data_table          => 'my_data',
    case_id_column_name => 'case_id',
    target_column_name  => 'class',
    settings_table_name => 'settings_tbl');
END;
/

Note:

By default, ADP is not enabled. To use ADP, you must explicitly set PREP_AUTO in the settings table for the model.

Overview of Algorithm-Specific Transformations

Binning, normalization, and outlier treatment are transformations that are commonly needed by data mining algorithms. These transformation techniques are introduced in this section and described in more detail in "Embedded Data Preparation".

Binning

Binning, also called discretization, is a technique for reducing the cardinality of continuous and discrete data. Binning groups related values together in bins to reduce the number of distinct values.

Binning can improve resource utilization and model build response time dramatically without significant loss in model quality. Binning can improve model quality by strengthening the relationship between attributes.

Note:

Binning is the primary transformation required by Naive-Bayes and Attribute Importance algorithms. In Oracle Data Mining, the Decision Tree algorithm implements its own form of binning (supervised binning).

See "Binning" for information on binning methods supported by DBMS_DATA_MINING_TRANSFORM.

Normalization

Normalization is the most common technique for reducing the range of numerical data. Most normalization methods map the range of a single variable to another range (often 0,1).

Note:

Normalization is the primary transformation required by Support Vector Machine (one-class, classification, and regression), Non-Negative Matrix Factorization, and k-Means algorithms.

See "Normalization" for information on normalization methods supported by DBMS_DATA_MINING_TRANSFORM.

Outlier Treatment

A value is considered an outlier if it deviates significantly from most other values in the column. The presence of outliers can have a skewing effect on the data and can interfere with the effectiveness of transformations such as normalization or binning.

Outlier treatment methods such as trimming or clipping can be implemented to minimize the effect of outliers.

Outliers may represent problematic data, for example a bad reading due to the abnormal condition of an instrument. However, in some cases, especially in the business arena, outliers may be perfectly valid. For example, in census data, the earnings for some of the richest individuals may vary significantly from the general population. This information should not be treated as an outlier, since it is an important part of the data. Domain knowledge is usually needed to determine outlier handling.

See "Outlier Treatment" for information on methods for outlier treatment in DBMS_DATA_MINING_TRANSFORM.

Algorithms and ADP

Table 19-1 shows how ADP prepares the data for each algorithm.

Note:

Many algorithms incorporate some form of data preparation. For example, algorithms that operate natively on numeric attributes explode each non-numeric input column into a set of numerical attributes.

Transformations encapsulated within the algorithm are transparent to the user and occur independently of ADP.

Also, the handling of nested data, sparsity, and missing values is standard across algorithms and occurs independently of ADP. (See Oracle Data Mining Application Developer's Guide.)

Table 19-1 Oracle Data Mining Algorithms With ADP

Algorithm	Mining Function	Treatment by ADP
Naive Bayes	Classification	All attributes are binned with supervised binning.
Decision Tree	Classification	The ADP setting has no effect on Decision Tree. Data preparation is handled by the algorithm.
GLM	Classification and Regression	Numerical attributes are normalized.
SVM	Classification, Anomaly Detection, and Regression	Numerical attributes are normalized.
k-Means	Clustering	Numerical attributes are normalized with outlier-sensitive normalization.
O-Cluster	Clustering	Numerical attributes are binned with a specialized form of equi-width binning, which computes the number of bins per attribute automatically. Numerical columns with all nulls or a single value are removed.
MDL	Attribute Importance	All attributes are binned with supervised binning..
Apriori	Association Rules	The ADP setting has no effect on association rules.
NMF	Feature Extraction	Numerical attributes are normalized.

See Also:

The chapters on the individual algorithms in Part III for more information about algorithm-specific data preparation.

Embedded Data Preparation

Transformations can be embedded in a model automatically by ADP or they can be embedded as a result of user-specified transformation instructions. To specify your own embedded transformations, create a TRANSFORMATION_LIST and pass it to DBMS_DATA_MINING.CREATE_MODEL.

PROCEDURE create_model(
                  model_name           IN VARCHAR2,
                  mining_function      IN VARCHAR2,
                  data_table_name      IN VARCHAR2,
                  case_id_column_name  IN VARCHAR2,
                  target_column_name   IN VARCHAR2 DEFAULT NULL,
                  settings_table_name  IN VARCHAR2 DEFAULT NULL,
                  data_schema_name     IN VARCHAR2 DEFAULT NULL,
                  settings_schema_name IN VARCHAR2 DEFAULT NULL,
                  xform_list           IN TRANSFORM_LIST DEFAULT NULL);

Transformation Lists and ADP

If you enable ADP and you specify a transformation list, the transformation list is embedded with the automatic, system-generated transformations. The transformation list is executed before the automatic transformations.

If you enable ADP and do not specify a transformation list, only the automatic transformations are embedded in the model.

If you disable ADP (accept the default) and you specify a transformation list, your custom transformations are embedded in the model. No automatic transformations are performed.

If you disable ADP (accept the default) and you do not specify a transformation list, no transformations will be embedded in the model. You will have to transform the build, test, and scoring data sets yourself. You must take care to apply the same transformations to each data set. This method of data preparation was required in previous releases of Oracle Data Mining.

Creating a Transformation List

A transformation list consists of a set of attribute transformation expressions. Each one specifies the transformation for a single attribute.

You can use the STACK routines in DBMS_DATA_MINING_TRANSFORM to assemble the attribute transformation expressions into a transformation list. A transformation list can specify transformations for any number of attributes.

An attribute transformation expression has the fields described in Table 19-2.

Table 19-2 Components of an Attribute Transformation Expression

Field Name	Data Type	Description
`attribute_name`	`VARCHAR2(30)`	Name of the column in the build data. If the column is not nested, this is also the complete attribute name. If the column is nested, the full attribute name is: `attribute_name.attribute_subname`
`attribute_subname`	`VARCHAR2(4000)`	Individual attribute within a nested column. If the column is not nested, the attribute subname is null.
`expression`	`EXPRESSION_REC`	A SQL expression that specifies how to transform the attribute. This expression is applied to the attribute when it is used internally by the model.
`reverse_expression`	`EXPRESSION_REC`	A SQL expression that specifies how to reverse the transformation. This expression is applied to the attribute when it is visible to a user: in the model details and in the target of a supervised model. The reverse expression supports model transparency.
`attribute_spec`	`VARCHAR2(4000)`	Either null or "NOPREP". You can set the attribute spec to NOPREP to prevent automatic preparation of this attribute when ADP is on. When ADP is off, NOPREP is ignored. NOPREP cannot be used for an individual subname of a nested attribute. If NOPREP is specified for an individual subname when ADP is on, an error is generated.

The expression and reverse_expression fields can potentially be very long (over 32K).

Example 19-3 shows a transformation expression for an attribute called INCOME. The attribute subname is null, because INCOME is not a nested column. Internally, the model uses a log representation of INCOME, but the user sees the attribute in its original form in model details and in the results of scoring, if INCOME is the target of a supervised model.

Example 19-3 An Attribute Transformation Expression

( 'INCOME',
    NULL,
   '(log(10, income) – 4.3)/0.7',
   'power(10, 0.7*income + 4.3)', NULL)

Transforming a Nested Attribute

You can apply the same transformation expression to all the attributes in a nested column, or you can specify separate transformations for individual nested attributes.

If you separately transform some of the nested attributes, you can provide a default transformation expression for all the remaining attributes in the nested column. The default specification has NULL in the column field and the name of the nested column in the subattribute field.

For example, the following transformation list specifies the transformation expressions for two nested attributes, subname1 and subname2, in the nested column nested_col1. .

{ nested_col1,  subname1,      (VALUE-(-1.5))/20,   VALUE*20+(-1.5),   NULL }
{ nested_col1,  subname2,       NULL,               NULL,              NULL } 
{ NULL,         nested_col1,    VALUE/10,           VALUE*10,          NULL }

The remaining attributes in nested_col1 are divided by 10 for use within the model, then multiplied by 10 for viewing by a user.

Note that the value of the nested attribute in the transformation and reverse transformation expressions is a constant, VALUE.

See Also:

Oracle Data Mining Application Developer's Guide for information about attributes.
Oracle Database PL/SQL Packages and Types Reference for details about the stack interface, transformation expressions, and transformation lists.

Oracle Data Mining Transformation Routines

Oracle Data Mining provides routines that implement various transformation techniques in the DBMS_DATA_MINING_TRANSFORM package. Some of these transformation techniques are summarized in this section.

You can use the routines in DBMS_DATA_MINING_TRANSFORM, or can write your own SQL, or use some combination of the two to create your own transformation lists.

See Also:

Oracle Database PL/SQL Packages and Types Reference for information on DBMS_DATA_MINING_TRANSFORM.

Binning

A number of factors go into deciding a binning strategy. Having fewer values typically leads to a more compact model and one that builds faster, but it can also lead to some loss in accuracy.

Model quality can improve significantly with well-chosen bin boundaries. For example, an appropriate way to bin ages might be to separate them into groups of interest, such as children 0-13, teenagers 13-19, youth 19-24, working adults 24-35, and so on.

Table 19-3 lists the binning techniques provided by Oracle Data Mining.

Table 19-3 Binning Methods in DBMS_DATA_MINING_TRANSFORM

Binning Method	Description
Top-N Most Frequent Items	You can use this technique to bin categorical attributes. You specify the number of bins. The value that occurs most frequently is labeled as the first bin, the value that appears with the next frequency is labeled as the second bin, and so on. All remaining values are in an additional bin.
Supervised Binning	Supervised binning is a form of intelligent binning, where bin boundaries are derived from important characteristics of the data. Supervised binning builds a single-predictor decision tree to find the interesting bin boundaries with respect to a target. It can be used for numerical or categorical attributes.
Equi-Width Binning	You can use equi-width binning for numerical attributes. The range of values is computed by subtracting the minimum value from the maximum value, then the range of values is divided into equal intervals. You can specify the number of bins or it can be calculated automatically. Equi-width binning should usually be used with outlier treatment. (See "Outlier Treatment".)
Quantile Binning	Quantile binning is a numerical binning technique. Quantiles are computed using the SQL analytic function `NTILE`. The bin boundaries are based on the minimum values for each quantile. Bins with equal left and right boundaries are collapsed, possibly resulting in fewer bins than requested.

Normalization

Most normalization methods map the range of a single attribute to another range, typically 0 to 1 or -1 to +1.

Normalization is very sensitive to outliers. Without outlier treatment, most values will be mapped to a tiny range, resulting in a significant loss of information. (See"Outlier Treatment".)

Table 19-4 Normalization Methods in DBMS_DATA_MINING_TRANSFORM

Transformation	Description
Min-Max Normalization	This technique computes the normalization of an attribute using the minimum and maximum values. The shift is the minimum value, and the scale is the difference between the maximum and minimum values.
Scale Normalization	This normalization technique also uses the minimum and maximum values. For scale normalization, shift = 0, and scale = max{abs(max), abs(min)}.
Z-Score Normalization	This technique computes the normalization of an attribute using the mean and the standard deviation. Shift is the mean, and scale is the standard deviation.

Outlier Treatment

Outliers are extreme values, typically several standard deviations from the mean. To minimize the effect of outliers, you can Winsorize or trim the data.

Winsorizing involves setting the tail values of an attribute to some specified value. For example, for a 90% Winsorization, the bottom 5% of values are set equal to the minimum value in the 5th percentile, while the upper 5% of values are set equal to the maximum value in the 95th percentile.

Trimming sets the tail values to NULL. The algorithm treats them as missing values.

Outliers affect the different algorithms in different ways. In general, outliers cause distortion with equi-width binning and min-max normalization.

Table 19-5 Outlier Treatment Methods in DBMS_DATA_MINING_TRANSFORM

Transformation	Description
Trimming	This technique trims the outliers in numeric columns by sorting the non-null values, computing the tail values based on some fraction, and replacing the tail values with nulls.
Windsorizing	This technique trims the outliers in numeric columns by sorting the non-null values, computing the tail values based on some fraction, and replacing the tail values with some specified value.

Transparency

Oracle Data Mining provides a GET_MODEL_DETAILS function for each algorithm. These functions return descriptions of the categorical and numerical attributes used internally by the model.

Model details support transparency. Because of transparency, you can obtain meaningful information about a model and gain insight into the way it works.

Transparency ensures that predictions generated by the model are expressed in the original format; any transformations used internally by the algorithm are reversed when the model is applied. For example, if a numerical target is normalized during model build, the predictions in the scoring data are denormalized.

Model Details and the Build Data

Some of the attributes used by the model correspond to columns in the build data. However, because of logic specific to the algorithm, nested data, and transformations, many attributes do not correspond to columns.

A nested column in the training data is not interpreted as an attribute by the model. During the model build, Oracle Data Mining explodes nested columns, and each row (an attribute name/value pair) becomes an attribute.

Some algorithms, for example SVM and GLM, only operate on numeric attributes. Any non-numeric column in the build data is exploded into binary numerical attributes, one for each distinct value in the column (SVM). GLM does not generate a new attribute for the most frequent value in the original column. These binary attributes are set to one only if the column value for the case is equal to the value associated with the binary attribute.

Algorithms do not necessarily use all the columns in the training data. Some columns might be deemed unnecessary or harmful to the quality of the model. These columns are not used as attributes.

For all these reasons, the attributes listed in the model details might not resemble the columns of data used to train the model. However, attributes that undergo embedded transformations, whether initiated by ADP or by a user-specified transformation list, appear in the model details in their pre-transformed state, as close as possible to the original column values. Although the attributes are transformed when they are used by the model, they are visible in the model details in a form that can be interpreted by a user. This is an important aspect of transparency.

The GET_MODEL_TRANSFORMATIONS function can be used to obtain the embedded transformations associated with a model.

Reverse Transformations

In user-specified embedded transformations, the reverse transformation expression should be provided by the user. When ADP is enabled, the reversal is performed automatically. In many cases, this is a straight-forward process.

SVM and NMF are a bit more complicated than the other algorithms in regards to interpretability of results. They both have a set of coefficients, which are used in combination with the transformed attributes. These coefficients are relevant to the data on the transformed scale, not the original data scale.

Altering the Reverse Transformation Expression

The ALTER_REVERSE_EXPRESSION procedure can be used to change the reverse transformation generated by ADP for an attribute. You can use this function to improve readability of model details, specify labels for clusters generated by clustering models, or label the output of one-class SVM.

See Also:

Oracle Database PL/SQL Packages and Types Reference for information about GET_MODEL_DETAILS, GET_MODEL_TRANSFORMATIONS, and ALTER_REVERSE_EXPRESSION.