7
Data Mining Using DBMS_DATA_MINING

Data mining tasks in the ODM PL/SQL interface include model building, model testing and computing lift for a model, and model applying (scoring).

The development methodology for data mining using DBMS_DATA_MINING has two phases:

Problem and data analysis
Data mining application development

7.1 DBMS_DATA_MINING Application Development

After you've analyzed the problem and data, use the DBMS_DATA_MINING and DBMS_DATA_MINING_TRANSFORM packages to develop a PL/SQL application that performs the data mining:

Prepare the build and scoring data using the DBMS_DATA_MINING_TRANSFORM package or other third-party tool or direct SQL or PL/SQL utilities to prepare the data as required by the chosen mining function and algorithm. If you are building a predictive model, you prepare a test data set.

Note that the build, test, and score data sets must be prepared in an identical manner for mining results to be meaningful.
Prepare a settings table that overrides the default mining algorithm settings for the mining function and the default algorithm settings. This is also an optional step.
Build a mining model using the training data set.
For predictive models (Classification and Regression), test the model for its accuracy and other attributes. You test the model by applying it to the test data (that is, score the test data) and computing metrics on the apply results. In other words, you compare the predictions of the model with the actual values in the test data set.
Retrieve the model signature to determine the mining attributes required by a given model for scoring. This information helps to verify that the scoring data is suitable for scoring. This is an optional step.
Apply a classification, regression, clustering, or feature extraction model to new data to generate predictions or descriptive summaries and patterns about the data.
Retrieve the model details to understand how the model scored the data in a particular manner. This is an optional step.
Repeat steps 3 through 9 until you obtain satisfactory results.

7.2 Building DBMS_DATA_MINING Models

The DBMS_DATA_MINING package creates a mining model for a mining function using a specified mining algorithm that supports the function. The algorithm can be influenced using specific algorithm settings. Model build is synchronous in the PL/SQL interface. After a model is built, there is single-user, multi-session access to the model.

7.2.1 DBMS_DATA_MINING Models

A model is identified by its name. Like tables in the database, a model has storage associated with it. The form, shape, and content of this storage is opaque to the user. However, the user can view the contents of a model -- that is, the patterns and rules that constitute a mining model -- using algorithm-specific GET_MODEL_DETAILS functions.

7.2.2 DBMS_DATA_MINING Mining Functions

The DBMS_DATA_MINING package supports Classification, Regression, Association, Clustering, and Feature Extraction. You specify the mining function as a parameter to the BUILD procedure.

7.2.3 DBMS_DATA_MINING Mining Algorithms

Each mining function can be implemented using one or more algorithms. Table 7-1 provides a list of supported algorithms. There is a default algorithm for each mining function, but you can override this default through an explicit setting in the settings table.

Table 7-1 DBMS_DM Summary of Functions and Algorithms

Mining Function	Mining Algorithm
Classification	Naive Bayes (NB) -- default algorithm
	Adaptive Bayes Network (ABN)
	Support Vector Machine (SVM)
Regression	Support Vector Machine (SVM)
Association	Association Rules (AR)
Clustering	k-Means (KM)
Feature Extraction	Non-Negative Matrix Factorization (NMF)

Each algorithm has one or more settings that influence the way it builds the model. There is a default set of algorithm settings for each mining algorithm. These defaults are provided through the transformation GET_DEFAULT_SETTINGS. To override the defaults, you must provide the choice of the algorithm and the settings for the algorithm through a settings table input to the BUILD Procedure.

7.2.4 DBMS_DATA_MINING Settings Table

The settings table is a simple relational table with a fixed schema. The name of the settings table can be whatever name you choose. The settings table must have exactly two columns with names and types as follows:

setting_name VARCHAR2(30) setting_value VARCHAR2(128)

The values specified in a settings table override the default values. The values in the setting_name column are one or more of several constants defined in the DBMS_DATA_MINING package. The values in the setting_value column are either predefined constants or actual numerical value corresponding to the setting itself. The setting_value column is of type VARCHAR2; you must cast numerical inputs to string using the TO_CHAR() function before input into the settings table.

The following example shows how to create a settings table for an SVM classification model, and edit the individual values using SQL DML.

CREATE TABLE drugstore_settings ( setting_name VARCHAR2(30), setting_value VARCHAR2(128)); -- override the default for complexity factor for SVM Classification INSERT INTO drugstore_model_settings (setting_name, setting_value) VALUES (dbms_data_mining.svms_complexity_fator, TO_CHAR(0.081)); COMMIT;

The transformation DATA_MINING_GET_DEFAULT_SETTINGS contains all the default settings for mining functions and algorithms. If you intend to override all the default settings, you can create a seed settings table and edit them using appropriate DML.

CREATE TABLE drug_store_settings AS SELECT setting_name, setting_value FROM DM_DEFAULT_SETTINGS WHERE setting_name LIKE 'SVMS_%'; -- update the values using appropriate DML

You can also create a settings table based on another model's settings using GET_MODEL_SETTINGS, as shown in the example below.

CREATE TABLE my_new_model_settings AS SELECT setting_name, setting_value FROM DBMS_DATA_MINING.GET_MODEL_SETTINGS('my_other_model');

7.2.4.1 DBMS_DATA_MINING Prior Probabilities Table

Priors or Prior Probabilities are discussed in Section 3.1.2 You can specify the priors in a prior probabilities table as an optional function setting when building classification models.

The prior probabilities table has a fixed schema. For numerical targets, use the following schema:

target_value NUMBER prior_probability NUMBER

For categorical targets, use the following schema:

target_value VARCHAR2 prior_probability NUMBER

Specify the name of the prior probabilities table as input to the setting_value column in the settings table, with the corresponding value for the setting_name column to be DBMS_DATA_MINING.clas_priors_table_name, as shown below:

INSERT INTO drugstore_settings (setting_name, setting_value) VALUES (DBMS_DATA_MINING.class_priors_table_name, 'census_priors'); COMMIT;

7.2.4.2 DBMS_DATA_MINING Cost Matrix Table

Costs are discussed in Section 3.1.1. You specify costs in a cost matrix table. The cost matrix table has a fixed schema. For numerical targets, use the following schema:

actual_target_value NUMBER predicted_target_value NUMBER cost NUMBER

For categorical targets, use the following schema:

actual_target_value VARCHAR2 predicted_target_value VARCHAR2 cost NUMBER

The DBMS_DATA_MINING package enables you to evaluate the cost of predictions from classification models in an iterative manner during the experimental phase of mining, and to eventually apply the optimal cost matrix to predictions on the actual scoring data in a production environment.

The data input to each COMPUTE procedure in the package is the result generated from applying the model on test data. If you provide a cost matrix as an input, the COMPUTE procedure generates test results taking the cost matrix into account. This enables you to experiment with various costs for a given prediction against the same APPLY results, without rebuilding the model and applying it against the same test data for every iteration.

Once you arrive at an optimal cost matrix, you can input this cost matrix to the RANK_APPLY procedure along with the results of APPLY on your scoring data. RANK_APPLY will provide your new data ranked by cost.

7.3 DBMS_DATA_MINING Mining Operations and Results

There are several sets of mining operations supported by the DBMS_DATA_MINING package:

Create, drop, and rename a model: BUILD, DROP_MODEL, RENAME_MODEL
Apply a model to new data: APPLY
Rank APPLY results or rank other data that uses the same schema as that of APPLY results: RANK_APPLY.
Read and describe a model: GET_MODEL_DETAILS, GET_MODEL_SETTINGS, GET_MODEL_SIGNATURE.
Test a classification model, based on the results of an APPLY operation on the test data, or based on any other data that uses the same schema as that of the APPLY results: COMPUTE_CONFUSION_MATRIX, COMPUTE_LIFT, and COMPUTE_ROC
Move a model from one schema to another or from one database instance to another: EXPORT_MODEL, IMPORT_MODEL.

The first set of operations are DDL-like operations. The last set consists of utilities. The remaining sets of operations are query-like operations in that they do not modify the model.

In addition to the operations, the following capabilities are also provided as part of the Oracle Data Mining installation:

User Views - DATA_MINING_USER_MODELS
Queries to compute metrics that test regression models.

Mining results are either returned as result sets or persisted as fixed schema tables.

7.3.1 DBMS_DATA_MINING Build Results

The BUILD operation creates a mining model. The GET_MODEL_DETAILS functions for each supported algorithm permit you to view the model. In addition, GET_MODEL_SIGNATURE and GET_MODEL_SETTINGS provide descriptive information about the model.

7.3.2 DBMS_DATA_MINING Apply Results

APPLY creates and populates a fixed schema table with a given name. The schema of this table varies based on the particular mining function, algorithm, and target attribute type -- numerical or categorical.

RANK_APPLY takes an APPLY result table as input and generates another table with results ranked based on a top-N input, and for classification models, also based on cost. The schema of this table varies based on the particular mining function, algorithm, and the target attribute type -- numerical or categorical.

7.3.3 Evaluating DBMS_DATA_MINING Classification Models

DBMS_DATA_MINING includes the following procedures for testing classification models:

COMPUTE_CONFUSION_MATRIX -- Computes the confusion matrix for a classification mo el and provides the accuracy for the model
COMPUTE_LIFT -- Computes a lift table for a given positive target of a classification model.
COMPUTE_ROC -- Computes receiver operating characteristic (ROC) for a binary classification model.

These procedures are described in the DBMS_DATA_MINING chapter of PL/SQL Packages and Types Reference.

The rest of this section describes confusion matrix, lift, and receiver operating characteristics.

7.3.3.1 Confusion Matrix

ODM supports the calculation of a confusion matrix to asses the accuracy of a classification model. A confusion matrix is a 2-dimensional square matrix. The row indexes of a confusion matrix correspond to actual values observed and used for model testing; the column indexes correspond to predicted values produced by applying the model to the test data. For any pair of actual/predicted indexes, the value indicates the number of records classified in that pairing. For example, a value of 25 for an actual value index of "buyer" and a predicted value index of "nonbuyer" indicates that the model incorrectly classified a "buyer" as a "nonbuyer" 25 times. A value of 516 for an actual/predicted value index of "buyer" indicates that the model correctly classified a "buyer" 516 times.

The predictions were correct 516 + 725 = 1241 times, and incorrect 25 + 10 = 35 times. The sum of the values in the matrix is equal to the number of scored records in the input data table. The number of scored records is the sum of correct and incorrect predictions, which is 1241 + 35 = 1276. The error rate is 35/1276 = 0.0274; the accuracy rate is 1241/1276 = 0.9725.

A confusion matrix provides a quick understanding of model accuracy and the types of errors the model makes when scoring records. It is the result of a test task for classification models.

Figure 7-1 Confusion Matrix

Text description of confmtrx.gif follows.

Text description of the illustration confmtrx.gif

7.3.3.2 Lift

ODM supports computing lift for a classification model. Lift can be computed for both binary (2 values) target fields and multiclass (more than 2 values) target fields. Given a designated positive target value (that is, the value of most interest for prediction, such as "buyer," or "has disease"), test cases are sorted according to how confidently they are predicted to be positive cases. Positive cases with highest confidence come first, followed by positive cases with lower confidence. Negative cases with lowest confidence come next, followed by negative cases with highest confidence. Based on that ordering, they are partitioned into quantiles, and the following statistics are calculated:

Target density of a quantile is the number of actually positive instances in that quantile divided by the total number of instances in the quantile.
Cumulative target density is the target density computed over the first n quantiles.
Quantile lift is the ratio of target density for the quantile to the target density over all the test data.
Cumulative percentage of records for a given quantile is the percentage of all test cases represented by the first n quantiles, starting at the end that is most confidently positive, up to and including the given quantile.
Cumulative number of targets for quantile n is the number of actually positive instances in the first n quantiles (defined as above).
Cumulative number of nontargets is the number of actually negative instances in the first n quantiles (defined as above).
Cumulative lift for a given quantile is the ratio of the cumulative target density to the target density over all the test data.

Cumulative targets can be computed from the quantities that are available in the LiftRresultElement using the following formula:

targets_cumulative = lift_cumulative * percentage_records_cumulative

7.3.3.3 Receiver Operating Characteristics

Another useful method for evaluating classification models is Receiver Operating Characteristics (ROC) analysis. ROC curves are similar to Lift charts in that they provide a means of comparison between individual models and determine thresholds which yield a high proportion of positive hits. Specifically, ROC curves aid users in selecting samples by minimizing error rates. ROC was originally used in signal detection theory to gauge the true hit versus false alarm ratio when sending signals over a noisy channel.

The horizontal axis of an ROC graph measures the false positive rate as a percentage. The vertical axis shows the true positive rate. The top left hand corner is the optimal location in an ROC curve, indicating high TP (true-positive) rate versus low FP (false-positive) rate. The ROC Area Under the Curve is useful as a quantitative measure for the overall performance of models over the entire evaluation data set. The larger this number is for a specific model, the better. However, if the user wants to use a subset of the scored data, the ROC curves help in determining which model will provide the best results at a specific threshold.

In the example graph in Figure 7-2, Model A clearly has a higher ROC Area Under the Curve for the entire data set. However, if the user decides that a false positive rate of 40% is the maximum acceptable, Model B is better suited, since it achieves a better error true positive rate at that false positive rate.

Figure 7-2 Receiver Operating Characteristics Curves

Text description of the illustration roc3.gif

Besides model selection the ROC also helps to determine a threshold value to achieve an acceptable trade-off between hit (true positives) rate and false alarm (false positives) rate. By selecting a point on the curve for a given model a given trade-off is achieved. This threshold can then be used as a post-processing for achieving the desired performance with respect to the error rates. ODM models by default use a threshold of 0.5. This is the confusion matrix reported by the test in the ODM Java interface.

The Oracle Data Mining ROC computation calculates the following statistics:

Probability threshold: The minimum predicted positive class probability resulting in a positive class prediction. Different threshold values result in different hit rates and false alarm rates.
True negatives: Negative cases in the test data with predicted probabilities below the probability threshold (correctly predicted).
True positives: Positive cases in the test data with predicted probabilities above the probability threshold (correctly predicted).
False negatives: Positive cases in the test data with predicted probabilities below the probability threshold (incorrectly predicted).
False positives: Negative cases in the test data with predicted probabilities above the probability threshold (incorrectly predicted).
Hit rate: (true positives/(true positives + false negatives))
False alarm rate: (false positives/(false positives + true negatives))

7.3.4 Test Results for DBMS_DATA_MINING Regression Models

The most commonly used metrics for regression models are root mean square error and mean absolute error. You can use SQL queries described in Oracle Data Mining Application Developer's Guide to compute those metrics.

The regression test results provide measures of model accuracy: root mean square error and mean absolute error of the prediction.

7.3.4.1 Root Mean Square Error

The following query calculates root mean square.

SELECT sqrt(avg((A.prediction - B.target_column_name) * (A.prediction - B.target_column_name))) rmse FROM apply_results_table A, targets_table B WHERE A.case_id_column_name = B.case_id_column_name;

7.3.4.2 Mean Absolute Error

Given the targets table generated from the test data with the schema:

(case_id_column_name VARCHAR2, target_column_name NUMBER)

and apply results table for regression with the schema:

(case_id_column_name VARCHAR2, prediction NUMBER)

and a (optional) normalization table with the schema:

(attribute_name VARCHAR2(30), scale NUMBER, shift NUMBER)

the query for mean absolute error is:

SELECT /*+PARALLEL(T) PARALLEL(A)*/ AVG(ABS(T.actual_value - T.target_value)) mean_absolute_error FROM (SELECT B.case_id_column_name, (B.target_column_name * N.scale + N.shift) actual_value FROM targets_table B, normalization_table N WHERE N.attribute_name = B.target_column_name AND B.target_column_name = 1) T, apply_results_table_name A WHERE A.case_id_column_name = T.case_id_column_name;

You can fill in the italicized values with the actual column and table names chosen by you. If the data is not normalized, you can eliminate those references from the subquery.

7.4 DBMS_DATA_MINING Model Export and Import

Oracle supports data mining model export and import between Oracle databases or schemas to provide a way to move models. DBMS_DATA_MINING does not support model export and import via PMML.

Model export/import is supported at different levels, as follows:

Database export/import. When a DBA exports a full database using the expdp utility, all the existing data mining models in the database will be exported. When a DBA imports a database dump using the impdp utility, all the data mining models in the dump will be restored.
Schema export/import. When a user or DBA exports a schema using expdp, all the data mining models in the schema will be exported. When the user or DBA imports the schema dump using impdp, all the models in the dump will be imported.
Selected model export/import. Users can export specified models using DBMS_DATA_MINING.export_model() and import specified models using DBMS_DATA_MINING.import_model().

DBMS_DATA_MINING supports export and import of models based on the Oracle DBMS Data Pump. Where you export a model, the tables that constitute the model and the associated metadata are written to a dump file set that consists of one or more files. When you import a model, the tables and metadata are retrieved from the file and restored in the new database.

For information about requirements for the application system, see Chapter 9.

For detailed information about the export/import transformation, see the Oracle Data Mining Application Developer's Guide.

7 Data Mining Using DBMS_DATA_MINING