Oracle® Database PL/SQL Packages and Types Reference 10g Release 2 (10.2) Part Number B14258-01 |
|
|
View PDF |
The DBMS_DATA_MINING_TRANSFORM
package contains a set of data transformation utilities that prepare data for data mining. Once you have prepared the data, you can use it to build and score models using the DBMS_DATA_MINING
package or the Oracle Data Mining (ODM) Java API. You can also score models using the SQL scoring functions for data mining.
DBMS_DATA_MINING_TRANSFORM
is an open-source PL/SQL package. You can use the routines in this package to prepare your data for data mining, or you can develop your own routines based on the public source code.
The source code, interface definitions, and inline documentation are available in $ORACLE_HOME/rdbms/admin/dbmsdmxf.sql
.
See Also:
|
This chapter contains the following topics:
Using DBMS_DATA_MINING_TRANSFORM
Overview
Types
Transformation Methods
Steps in Defining a Transformation
Sample Transformation
This section contains topics which relate to using the DBMS_DATA_MINING_TRANSFORM
package.
The DBMS_DATA_MINING_TRANSFORM
package serves two purposes:
It is a basic utility package for preprocessing data for data mining.
It is an open-source learning tool that shows how to use SQL to perform common data transformations for data mining. The inputs and outputs for routines in this package are simple views and tables that are not Oracle proprietary. You can study the source code to help you create data transforms that are specific to your own application data. The source code for this package is in dbmsdmxf.sql
in the rdbms/admin
directory under $ORACLE_HOME
.
Note: Use of theDBMS_DATA_MINING_TRANSFORM package is not required by Oracle Data Mining. You can develop your own preprocessing utilities or use third-party tools customized for your application. |
The main principle behind the design of DBMS_DATA_MINING_TRANSFORM
is the fact that SQL has enough power to perform most of the common mining transforms efficiently. For example, binning can be done using CASE
expressions or DECODE
functions, and linear normalization is a simple algebraic expression of the form (x - shift)/scale where x is the data value that is being transformed.
However, the queries that perform the transforms can be rather lengthy. So it is desirable to have some convenience routines that will help in generating queries. Thus, the goal of this package is to provide query generation services for the most common mining transforms, as well as to provide a framework that can be easily extended for implementing other transforms.
Note on Notation: This chapter uses standard interval notation for number sets:
(b is in the set [a,b]; b is not in the set [a,b).) Subscripts do not conform to standard notation; instead "X_N" is used for " XN." |
See Also: Sample data mining programs are available with Oracle Data Mining. These programs include sample data transformations usingDBMS_DATA_MINING_TRANSFORM . Instructions for using the sample programs are provided in the Oracle Data Mining Administrator's Guide. |
Table 26-1 Summary of Data Types
Data Type | Purpose |
---|---|
Column_List |
List of column names representing mining attributes, defined to be VARRAY(1000) of VARCHAR2(32) . |
The DBMS_DATA_MINING_TRANSFORM
package supports binning, normalization, winsorizing and clipping, and missing value transformations.
Binning involves mapping both continuous and discrete values to discrete values of reduced cardinality. For example, the age of persons can be binned into discrete numeric bins: 1-20 to 1, 21-40 to 2, and so on. Popular car manufacturers such as Ford, Chrysler, BMW, Volkswagen can be binned into discrete categorical bins: {Ford, Chrysler} to US_Car_Makers, and {BMW, Volkswagen} to European_Car_Makers.
DBMS_DATA_MINING_TRANSFORM
supports binning for both categorical and numerical attributes. Categorical attributes have VARCHAR2/CHAR
data types; numerical attributes have NUMBER
data types.
The bin definition for each attribute is computed based on the occurrence frequency of values that are computed from the data. The user specifies a particular number of bins, say N. Each of the bins bin_1,..., bin_N corresponds to the values with top frequencies. The bin bin_N+1 corresponds to all remaining values.
The bin definition for each attribute is computed based on the minimum and maximum values that are computed from the data. The user specifies a particular number of bins, say N. Each of the bins bin_1,..., bin_N span ranges of equal width of size inc = (max – min)/N, bin_0 spans (–inf, min) and bin_(N+1) spans (max, + inf). When N is not specified, it can be estimated from the data.
The definition for each relevant column is computed based on the minimum values for each quantile, where quantiles are computed from the data using NTILE
function. Bins bin_1,..., bin_N span the following ranges: bin_1 spans [min_1,min_2]; bin_2,..., bin_i,..., bin_N-1 span (min_i, min_(i+1)] and bin_N spans (min_N, max_N]. Bins with equal left and right boundaries are collapsed.
Normalization involves scaling continuous values down to a specific range, such as [–1.0,1.0] or [0.0,1.0] such that x_new = (x_old-shift)/scale
. Normalization applies only to numerical attributes.
The normalization definition for each attribute is computed based on the minimum and maximum values of the data. The values for shift
and scale
are shift = min
, and scale = (max - min)
respectively.
The normalization definition for each attribute is computed based on the minimum and maximum values of the data. The values for shift
and scale
are shift = 0
and scale = max{abs(max), abs(min)}.
The normalization definition for each attribute is computed based on the values for mean and standard deviation that are computed from the data. The values for shift
and scale
are computed to be shift = mean
, and scale = standard deviation
respectively.
Some computations on attribute values can be significantly affected by extreme values. One approach to achieving a more robust computation is to either Winsorize or trim the data as a preprocessing step.
Winsorizing involves setting the tail values of a particular attribute to some specified value. For example, for a 90% Winsorization, the bottom 5% are set equal to the minimum value in the 6th percentile, while the upper 5% are set equal to the value corresponding to the maximum value in the 95th percentile.
Trimming "removes" the tails in the sense that trimmed values are ignored in further values. This is achieved by setting the tails to NULL
.
Missing Value treatment involves replacing NULL
values in the data. Missing Value treatment is recommended when the fraction of missing values is high compared to the overall attribute value set. If the data contains relatively few missing values, you might choose to simply delete those records for the purpose of data mining.
If you want to replace missing values and you know or suspect what the values should be, you can use that knowledge to replace the NULLs
. If you suspect that the NULLs
may be random omissions, you can determine a meaningful value for them.
DBMS_DATA_MINING_TRANSFORM
INSERT
routines handle missing values by replacing NULL
s in numerical attributes with the mean attribute value, and by replacing NULL
s in categorical attributes with the mode.
DBMS_DATA_MINING_TRANSFORM
provides routines that define CREATE
, INSERT
, and XFORM
operations. To define a data transformation, perform the following steps:
Use a CREATE
routine to create a transformation definition table with a pre-defined set of columns.
Use an INSERT
routine to populate the table with transformation definitions for selected attributes.
Use an XFORM
routine to create a view of the transformation definition table.
Use the following procedures to create transformation definition tables:
CREATE_BIN_NUM
and CREATE_BIN_CAT
to create bin definition tables.
CREATE_NORM_LIN
to create a normalization definition table.
CREATE_CLIP
to create a clipping definition table.
CREATE_MISS_NUM
and CREATE_MISS_CAT
to create missing value treatment definition tables.
Usually, the consistency and integrity of transform definition tables is guaranteed by the creation process. Alternatively, it can be achieved by leveraging an integrity constraints mechanism. This can be done either by altering the tables created with CREATE
routines, or by creating the tables manually with the necessary integrity constraints.
The most common way of defining a transformation (populating the transformation definition tables) for each attribute is based on data inspection using some predefined methods (also known as automatic transform definition). Some of the most popular methods have been captured by the INSERT
routines in DBMS_DATA_MINING_TRANSFORM
. For example, the z-score normalization method estimates mean and standard deviation from the data to be used as a shift and scale parameters of the linear normalization transform.
Use the following procedures to populate the transformation definition tables:
INSERT_BIN_NUM_EQWIDTH
, INSERT_BIN_NUM_QTILE
, and INSERT_AUTOBIN_NUM_EQWIDTH
for binning numerical attributes, and INSERT_BIN_CAT_FREQ
for binning categorical attributes.
INSERT_NORM_LIN_MINMAX
, INSERT_NORM_LIN_SCALE
, or INSERT_NORM_LIN_ZSCORE
for normalizing numerical attributes.
INSERT_CLIP_WINZOR_TAIL
for winsorizing numerical attributes or INSERT_CLIP_TRIM_TAIL
for clipping numerical attributes.
INSERT_MISS_NUM_MEAN
for missing value treatment of numerical attributes, and INSERT_MISS_CAT_MODE
for missing value treatment of categorical attributes.
You can invoke these routines several times to transform all relevant attributes from various data sources until the definition table fully represents all mining attributes for a given problem.
After performing automatic transform definitions, some or all of the definitions can be adjusted by issuing SQL DML statements against the transform definition tables, thus providing virtually infinite flexibility in defining custom transforms.
The INSERT
routines enable flexible transformation definitions in several ways:
The data provided to the INSERT
routines does not necessarily have to be the data used for a particular model creation. It can be any data that contains adequate representation of the mining attributes.
The INSERT
routines can be called any number of times against the same or different dataset until all the attributes have their transformations defined. You can selectively exclude one or more attributes for a particular iteration of the INSERT
. In the most extreme case, each individual attribute can potentially have a unique transformation definition.
You do not have to separately feed in numerical and categorical attributes, since categorical binning automatically skips over NUMBER
columns in your table, and numerical binning, normalization, and clipping routines skip over VARCHAR2/CHAR
columns in your input data.
Query generation is driven by the simple transform-specific definition tables with predefined columns. Query generation routines should be viewed as macros, and transformation definition tables as parameters used in macro expansions. Similar to using #define macros in the C language, the invoker is responsible for ensuring the correctness of the expanded macro, that is, that the result is a valid SQL query.
You can generate the views representing the transformation queries with the following procedures:
XFORM_BIN_NUM
, and XFORM_BIN_CAT
for binning
XFORM_NORM_LIN
for normalization
XFORM_CLIP
for clipping
XFORM_MISS_NUM
and XFORM_MISS_CAT
for missing value treatment
If your data contains a combination of numerical and categorical attributes, you must essentially feed the results of one transformation step to the next step. For example, the results of XFORM_BIN_CAT
can be fed to XFORM_BIN_NUM
or vice versa. The order is irrelevant since numerical and categorical transforms work on disjoint sets of attributes.
Given a dataset for a particular mining problem, any preprocessing and transformations on the mining data must be uniform across all mining operations. In other words, if the build data is preprocessed according to a particular transformation definition, then it follows that the test data and the scoring data must be preprocessed using the same definition.
The general usage of routines in this package can be explained using this example. Assume that your input table contains both numerical and categorical data that requires binning. A possible sequence of operations will be:
Invoke CREATE_BIN_NUM
to generate an empty numerical bin definition table.
Invoke INSERT_BIN_NUM_EQWIDTH
to define the transformations for all numerical attributes in the build data input. (For the sake of simplicity, let us assume that all numerical values are to be binned into 10 bins.) If you are binning for an O-Cluster model, use INSERT_AUTOBIN_NUM_EQWIDTH
.
Next invoke XFORM_BIN_NUM
with the numerical bin table and the build data table as inputs. The resulting object is a view that represents a SQL query against the build data table that performs numerical binning. Assume that you have named this result object build_bin_num_view
.
Since you still have the categorical attributes to be binned, invoke CREATE_BIN_CAT
to create a categorical bin definition table.
Next, invoke INSERT_BIN_CAT_FREQ
to define the transforms for all categorical attributes. (For the sake of simplicity, let us assume that all categorical attributes are to be binned into 10 bins.)
As the final step, invoke XFORM_BIN_CAT
with the categorical bin table and the view name provided by XFORM_BIN_NUM
, namely build_bin_num_view
, as the inputs. This essentially amounts to combining the transformations from both stages.
The object resulting from this operation is a view that represents a SQL query against your build data table, influenced by the contents of the bin definition tables. Provide this view name as the data input to the CREATE_MODEL
procedure in the DBMS_DATA_MINING
package.
If this happens to be a classification model, and you want to APPLY
this model to scoring data, you must prepare the scoring data similar to the build data. You can achieve this in two simple steps:
First, call XFORM_BIN_NUM
with the scoring data table and the numerical bin boundary table as inputs. The resulting object is a view that represents an SQL query against your scoring data table, influenced by the contents of the numerical bin boundary table. Assume that you have named this result object apply_bin_num_view
.
As the next and final step, invoke XFORM_BIN_CAT
with the categorical bin table and the view name provided by XFORM_BIN_NUM
, namely apply_bin_num_view
, as the inputs.
The object resulting from this operation is now a view that represents a SQL query against your scoring data table, influenced by the contents of the bin definition tables. Provide this view name as the data input to the APPLY
procedure in the DBMS_DATA_MINING
package.
Table 26-2 DBMS_DATA_MINING_TRANSFORM Package Subprograms
Subprogram | Purpose |
---|---|
CREATE_BIN_CAT Procedure |
Creates a categorical bin definition table |
CREATE_BIN_NUM Procedure |
Creates a numerical bin definition table |
CREATE_CLIP Procedure |
Creates a clipping definition table |
CREATE_MISS_CAT Procedure |
Creates a categorical missing value treatment definition table |
CREATE_MISS_NUM Procedure |
Creates a numerical missing value treatment definition table |
CREATE_NORM_LIN Procedure |
Creates a normalization definition table |
INSERT_AUTOBIN_NUM_EQWIDTH Procedure | Populates the numerical bin definition table, using the number of bins estimated from the data |
INSERT_BIN_CAT_FREQ Procedure | Populates the categorical bin definition table, applying frequency-based binning to the categorical input data |
INSERT_BIN_NUM_EQWIDTH Procedure |
Populates the numerical bin definition table, applying equi-width binning to the numerical input data |
INSERT_BIN_NUM_QTILE Procedure |
Populates the numerical bin definition table, applying quantile binning to the numerical input data |
INSERT_CLIP_TRIM_TAIL Procedure |
Populates the clipping definition table, applying trimming based on tail fraction to the numerical input data |
INSERT_CLIP_WINSOR_TAIL Procedure |
Populates the clipping definition table, applying Winsorizing based on tail fraction to the numerical input data |
INSERT_MISS_CAT_MODE Procedure |
Populates the categorical missing value treatment definition table, applying the mode to each missing value |
INSERT_MISS_NUM_MEAN Procedure |
Populates the numerical missing value treatment definition table, applying the mean to each missing value |
INSERT_NORM_LIN_MINMAX Procedure |
Populates the normalization definition table, applying min-max normalization to the numerical input data |
INSERT_NORM_LIN_SCALE Procedure | Populates the normalization definition table, applying scale normalization to the numerical input data |
INSERT_NORM_LIN_ZSCORE Procedure |
Populates the normalization definition table applying z-score normalization to the numerical input data |
XFORM_BIN_CAT Procedure |
Creates the view representing the transformed output with binned categorical data |
XFORM_BIN_NUM Procedure |
Creates the view representing the transformed output with binned numerical data |
XFORM_CLIP Procedure |
Creates the view representing the transformed output with clipped numerical data |
XFORM_MISS_CAT Procedure |
Creates the view representing the transformed output with categorical missing value treatment |
XFORM_MISS_NUM Procedure |
Creates the view representing the transformed output with numerical missing value treatment |
XFORM_NORM_LIN Procedure |
Creates the view representing the transformed output with normalized numerical data |
This procedure creates a categorical binning definition table. This table is used as input to the INSERT_BIN_CAT_FREQ
and XFORM_BIN_CAT
procedures.
Syntax
DBMS_DATA_MINING_TRANSFORM.CREATE_BIN_CAT ( bin_table_name IN VARCHAR2, bin_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-3 CREATE_BIN_CAT Procedure Parameters
Parameter | Description |
---|---|
bin_table_name |
Name of the bin definition table |
bin_schema_name |
Name of the schema hosting the bin definition table |
Usage Notes
The generated bin definition table will have the following columns.
Column Name | Data Type |
---|---|
col |
VARCHAR2(30) |
val |
VARCHAR2(4000) |
bin |
VARCHAR2(4000) |
Examples
BEGIN DBMS_DATA_MINING_TRANSFORM.CREATE_BIN_CAT('build_bin_cat_table'); END;
This procedure creates a numerical binning definition table. This table is used as input to the INSERT_BIN_NUM_EQWIDTH
, INSERT_BIN_NUM_QTILE
, INSERT_AUTOBIN_NUM_EQWIDTH
, and XFORM_BIN_NUM
procedures.
Syntax
DBMS_DATA_MINING_TRANSFORM.CREATE_BIN_NUM ( bin_table_name IN VARCHAR2, bin_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-4 CREATE_BIN_NUM Procedure Parameters
Parameter | Description |
---|---|
bin_table_name |
Name of the bin definition table |
bin_schema_name |
Name of the schema hosting the bin definition table |
Usage Notes
The generated bin definition table will have the following columns.
Column Name | Data Type |
---|---|
col |
VARCHAR2(30) |
val |
NUMBER |
bin |
VARCHAR2(4000) |
Examples
BEGIN
DBMS_DATA_MINING_TRANSFORM.CREATE_BIN_NUM('build_bin_num_table');
END;
This procedure creates a clipping definition table. This table is used as input to the INSERT_CLIP_WINSOR_TAIL
, INSERT_CLIP_TRIM_TAIL
, and XFORM_CLIP
procedures.
Syntax
DBMS_DATA_MINING_TRANSFORM.CREATE_CLIP ( clip_table_name IN VARCHAR2, clip_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-5 CREATE_CLIP Procedure Parameters
Parameter | Description |
---|---|
clip_table_name |
Name of the clipping definition table |
clip_schema_name |
Name of the schema hosting the clipping definition table |
Usage Notes
The generated clipping definition table will have the following columns.
Column Name | Data Type |
---|---|
col |
VARCHAR2(30) |
lcut |
NUMBER |
lval |
NUMBER |
rcut |
NUMBER |
rval |
NUMBER |
Examples
BEGIN DBMS_DATA_MINING_TRANSFORM.CREATE_CLIP('build_clip_table'); END;
This procedure creates a categorical missing value treatment definition table. This table is used as input to the INSERT_MISS_CAT_MODE
procedure.
Syntax
DBMS_DATA_MINING_TRANSFORM.CREATE_MISS_CAT ( miss_table_name IN VARCHAR2, miss_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-6 CREATE_MISS_CAT Procedure Parameters
Parameter | Description |
---|---|
miss_table_name |
Name of the categorical missing value treatment definition table |
miss_schema_name |
Name of the schema hosting the categorical missing value treatment definition table |
Usage Notes
The generated categorical missing value treatment definition table will have the following columns.
Column Name | Data Type |
---|---|
col |
VARCHAR2(30) |
val |
VARCHAR2(4000) |
Examples
BEGIN
DBMS_DATA_MINING_TRANSFORM.CREATE_MISS_CAT('build_miss_cat_table');
END;
This procedure creates a numerical missing value treatment definition table. This table is used as input to the INSERT_MISS_NUM_MEAN
procedure.
Syntax
DBMS_DATA_MINING_TRANSFORM.CREATE_MISS_NUM ( miss_table_name IN VARCHAR2, miss_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-7 CREATE_MISS_NUM Procedure Parameters
Parameter | Description |
---|---|
miss_table_name |
Name of the numeric missing value treatment definition table |
miss_schema_name |
Name of the schema hosting the numeric missing value treatment definition table |
Usage Notes
The generated numeric missing value definition table will have the following columns.
Column Name | Data Type |
---|---|
col |
VARCHAR2(30) |
val |
NUMBER |
Example
BEGIN DBMS_DATA_MINING_TRANSFORM.CREATE_MISS_NUM('build_miss_num_table'); END;
This procedure creates a linear normalization definition table. This table is used as input to the INSERT_NORM_LIN_MINMAX
, INSERT_NORM_LIN_SCALE
, INSERT_NORM_LIN_ZSCORE
, and XFORM_NORM_LIN
procedures.
Syntax
DBMS_DATA_MINING_TRANSFORM.CREATE_NORM_LIN ( norm_table_name IN VARCHAR2, norm_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-8 CREATE_NORMALIZE_LIN Procedure Parameters
Parameter | Description |
---|---|
norm_table_name |
Name of the normalization definition table |
norm_schema_name |
Name of the schema hosting the normalization definition table |
Usage Notes
The generated linear normalization definition table will have the following columns.
Column Name | Data Type |
---|---|
col |
VARCHAR2(30) |
shift |
NUMBER |
scale |
NUMBER |
Examples
BEGIN
DBMS_DATA_MINING_TRANSFORM.CREATE_NORM_LIN('build_norm_table');
END;
This procedure finds the numerical binning definition for every numerical column in the data table that is not specified in the exclusion list and inserts the definition into the numerical binning definition table that was created using CREATE_BIN_NUM
. Based on the statistical information it collects on the input data, this procedure calculates the number of bins.
Definition for each relevant column is computed based on the minimum and maximum values that are computed from the data table.
N, the number of bins, is computed for each column separately and is based on the number of non-NULL
values (cnt
), the maximum (max
), the minimum (min
), the standard deviation (dev
) and the constant C=3.49/0.9 as follows:
N=floor(power(cnt,1/3)*(max-min)/(c*dev))
Each of the bin_num
(= N) bins bin_1,..., bin_N span ranges of equal width inc = (max – min) / N where bin_I = I when N > 0 or bin_I = N+1–I when N < 0, and bin_0 = bin_(N+1) = NULL
. The values of the val
column are rounded to round_num
significant digits prior to scoring them in the definition table.
The parameter bin_num
is used to adjust N to be at least bin_num
. No adjustment is done when bin_num
is NULL
or 0. The parameter max_bin_num
is used to adjust N to be at most max_bin_num
. No adjustment is done when bin_num
is NULL
or 0. For columns with all integer values (discrete columns), N is adjusted to be at most the maximum number of distinct values in the observed range max-min+1
.
The parameter sample_size
is used to adjust cnt
to be at most sample_size
. No adjustment is done when sample_size
is NULL
or 0.
Syntax
DBMS_DATA_MINING_TRANSFORM.INSERT_AUTOBIN_NUM_EQWIDTH ( bin_table_name IN VARCHAR2, data_table_name IN VARCHAR2, bin_num IN PLS_INTEGER DEFAULT 3, max_bin_num IN PLS_INTEGER DEFAULT 100, exclude_list IN Column_List DEFAULT NULL, round_num IN PLS_INTEGER DEFAULT 6, sample_size IN PLS_INTEGER DEFAULT 50000, bin_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-9 INSERT_AUTOBIN_EQWIDTH Procedure Parameters
Parameter | Description |
---|---|
bin_table_name |
Name of the categorical bin table generated using CREATE_BIN_NUM procedure |
data_table_name |
Name of the table containing the data |
bin_num |
Minimum number of bins; default number is 3 |
max_bin_num |
Maximum number of bins that sets the upper limit for estimates of bin numbers; default is 100 |
exclude_list |
List of columns (attributes) to be excluded from this iteration of the binning process; categorical attributes are automatically excluded |
round_num |
Number of significant digits; default is 6 |
sample_size |
Size of the data sample; default is 50,000 |
bin_schema_name |
Name of the schema hosting the bin definition table; default is user schema |
data_schema_name |
Name of the schema hosting the table with data; default is user schema |
Usage Notes
For a given input table, you can call this routine several times with different specifications for number of bins for a given input table. For each call, you can selectively exclude attributes (that is, column names) using the exclude_list
parameter for a particular binning specification.
Columns with all NULL
values or only one unique value are ignored. The sign of bin_num
, max_bin_num
, and sample_size
have no effect on the result; absolute values are used. The value adjustment of N is done in the following order: First bin_num
, next, max_bin_num
, and, finally, discrete column adjustment.
Examples
The simplest invocation of this routine populates bin definitions in the num_bin_table
for all the numerical attributes found in build_table
.
BEGIN
DBMS_DATA_MINING_TRANSFORM.INSERT_AUTOBIN_NUM_EQUIWIDTH( 'num_bin_table', 'build_data_table');
END;
/
This procedure finds the categorical binning definition for every VARCHAR2
and CHAR
column in the data table that is not specified in the exclusion list and inserts the definition into the categorical binning definition table created using CREATE_BIN_CAT
.
Definition for each relevant column is computed based on the occurrence frequency of column values that are computed from the data table. Each of the bin_num(N) bins bin_1, ..., bin_N corresponds to the values with top frequencies when N > 0 or bottom frequencies when N < 0, and bin_(N+1) to all remaining values, where bin_I = I. Ordering ties among identical frequencies are broken by ordering on column values (ASC for N > 0 or DESC for N < 0). When the number of distinct values C < N only C+1 bins will be created.
The parameter default_num
(D) is used for pruning based on the number of values that fall into the default bin. When D > 0 only columns that have at least D defaults are kept while others are ignored. When D < 0 only columns that have at most D values are kept. No pruning is done when D is NULL
or D = 0. Parameter bin_support (SUP) is used for restricting bins to frequent (SUP > 0) values frq >= SUP*tot, or infrequent (SUP < 0) ones frq <= –SUP*tot, where frq is a given value count and tot is a sum of all counts as computed from the data. No support filtering is done when SUP is NULL
or when SUP = 0.
Syntax
DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_CAT_FREQ ( bin_table_name IN VARCHAR2, data_table_name IN VARCHAR2, bin_num IN PLS_INTEGER DEFAULT 9, exclude_list IN Column_List DEFAULT NULL, default_num IN PLS_INTEGER DEFAULT 2, bin_support NUMBER DEFAULT NULL, bin_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-10 INSERT_BIN_CAT_FREQ Procedure Parameters
Parameter | Description |
---|---|
bin_table_name |
Name of the categorical bin table generated using CREATE_BIN_CAT procedure |
data_table_name |
Name of the table containing the data |
bin_num |
Number of bins |
exclude_list |
List of columns (attributes) to be excluded from this iteration of the binning process |
default_num |
Number of default values |
bin_support |
Bin support as a fraction |
bin_schema_name |
Name of the schema hosting the bin definition table |
data_schema_name |
Name of the schema hosting the table with data |
Usage Notes
For a given input table, you can iteratively call this routine several times with different specifications for number of bins for a given input table. For each iteration, you can selectively exclude attributes (that is, column names) using the exclude_list
parameter for a particular binning specification.
Columns with all NULL
s are ignored. No bin definitions are populated when bin_num = 0, or bin_num, is NULL
.
Examples
The simplest invocation of this routine populates bin definitions in the cat_bin_table
for all the categorical attributes found in build_table
.
BEGIN
DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM( 'cat_bin_table', 'build_table');
END; /
This procedure finds the numerical binning definition for every NUMBER
column in the data table that is not specified in the exclusion list and inserts the definition into the numerical binning definition table that was created using CREATE_BIN_NUM
.
Definition for each relevant column is computed based on the minimum and maximum values that are computed from the data table. Each of the bin_num (= N) bins bin_1,..., bin_N span ranges of equal width inc = (max – min) / N where bin_I = I when N > 0 or bin_I = N+1–I when N < 0, and bin_0 = bin_(N+1) = NULL
.
The values of the val
column in the bin definition table are rounded to round_num
significant digits. For more information, see the Usage Notes.
Syntax
DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM_EQWIDTH ( bin_table_name IN VARCHAR2, data_table_name IN VARCHAR2, bin_num IN PLS_INTEGER DEFAULT 10, exclude_list IN Column_List DEFAULT NULL, round_num IN PLS_INTEGER DEFAULT 6, bin_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-11 INSERT_BIN_EQWIDTH Procedure Parameters
Parameter | Description |
---|---|
bin_table_name |
Name of the numerical bin table generated using CREATE_BIN_NUM procedure |
data_table_name |
Name of the table containing the data |
bin_num |
Number of bins |
exclude_list |
List of columns (attributes) to be excluded from this iteration of the binning process |
round_num |
Number of significant digits. See Usage Notes. |
bin_schema_name |
Name of the schema hosting the bin definition table |
data_schema_name |
Name of the schema hosting the table with data |
Usage Notes
For a given input table, you can iteratively call this routine with different specifications for number of bins for a given input table. For each iteration, you can selectively exclude attributes (that is, column names) using the exclude_list
parameter for a particular binning specification.
Columns with all NULL
values or only one unique value are ignored. No bin definitions are populated when bin_num
= 0, or bin_num
is NULL
.
For example, when N=2, col='mycol', min=10, and max = 21, the following three rows are inserted into the definition table (inc = 5.5):
COL VAL BIN ----- ----- ----- mycol 10 NULL mycol 15.5 1 mycol 21 2
The round_num
parameter specifies how to round the number in the VAL
column of the definition table. When round_num
is positive, it specifies the most significant digits to retain. When round_num
is negative, it specifies the least significant digits to remove. In both cases, the result is rounded to the specified number of digits. When round_num
is 0, the value is unchanged.
For example, a value of 308.162 would be rounded as follows.
For a value of 308.162: when round_num = 1 result is 300 when round_num = 2 result is 310 when round_num = 3 result is 308 when round_num = 0 result is 308.162 when round_num = -1 result is 308.16 when round_num = -2 result is 308.2 when round_num = NULL result is NULL
Examples
The simplest invocation of this routine populates bin definitions in the num_bin_table
for all the numerical attributes found in build_table
.
BEGIN
DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM( 'num_bin_table', 'build_table');
END;
/
This procedure finds a numerical binning definition for every NUMBER
column in the data table that is not specified in the exclusion list and inserts the definition into the binning definition table that was created using CREATE_BIN_NUM
.
The definition for each relevant column is computed based on the minimum values for each quantile, where quantiles are computed from the data using NTILE
function. Bins bin_1,..., bin_N span the following ranges: bin_1 spans [min_1,min_2]; bin_2,..., bin_i,..., bin_N-1 span (min_i, min_(i+1)] and bin_N spans (min_N, max_N]. Bins with equal left and right boundaries are collapsed.
Syntax
DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM_QTILE ( bin_table_name IN VARCHAR2, data_table_name IN VARCHAR2, bin_num IN PLS_INTEGER DEFAULT 10, exclude_list IN Column_List DEFAULT NULL, bin_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-12 INSERT_BIN_NUM_QTILE Procedure Parameters
Parameter | Description |
---|---|
bin_table_name |
Name of the numerical binning definition table generated using the CREATE_BIN_NUM procedure |
data_table_name |
Name of the table containing the data |
bin_num |
Number of bins |
exclude_list |
List of columns (attributes) to be excluded from this iteration of the binning process |
bin_schema_name |
Name of the schema hosting the numerical binning definition table |
data_schema_name |
Name of the schema hosting the table with data |
Usage Notes
For a given input table, you can iteratively call this routine several times with different specifications for bin_num
for a given input table. For each iteration, you can selectively exclude attributes (that is, column names) using the exclude_list
parameter for a particular specification. Columns with all NULL
values are ignored.
Example 1. When N = 4, col='mycol', and data is {1,2,2,2,2,3,4}, the following three rows are inserted into the definition table:
COL VAL BIN ----- ----- ----- mycol 1 NULL mycol 2 1 mycol 4 2
Here quantities are {1,2}, {2,2}, {2,3}, {4} and min(1) = 1, min(2) = 2, min(3) = 2, min(4) = 4, max(4) = 4, and ranges are [1,2], (2,2], (2,4], (4,4]. After collapsing [1,2] and (2,4].
Examples
The simplest invocation of this routine populates numerical binning definitions in the num_bin_table
for all the numerical attributes found in build_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM_QTILE( 'num_bin_table', 'build_table'); END;
This procedure finds the trimming definition for every NUMBER
column in the data table that is not specified in the exclusion list and inserts the definition into the clipping definition table that was created using CREATE_CLIP
.
The definition for each relevant column is computed based on the non-NULL
values sorted in ascending order such that val(1) < val(2) <... < val(N), where N is a total number of non-NULL
values in a column:
lcut = val(1+floor(N*q)) lval = NULL rcut = val(N–floor(*N*q)) rval = NULL
where q = ABS(NVL(tail_frac,0)). Nothing is done when q >= 0.5.
Syntax
DBMS_DATA_MINING_TRANSFORM.INSERT_CLIP_TRIM_TAIL ( clip_table_name IN VARCHAR2, data_table_name IN VARCHAR2, tail_frac IN NUMBER DEFAULT 0.025, exclude_list IN Column_List DEFAULT NULL, clip_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-13 INSERT_CLIP_TRIM_TAIL Procedure Parameters
Parameter | Description |
---|---|
clip_table_name |
Name of the clipping definition table generated using the CREATE_CLIP procedure |
data_table_name |
Name of the table containing the data |
tail_frac |
Tail fraction |
exclude_list |
List of columns (attributes) to be excluded from this iteration of the clipping process |
clip_schema_name |
Name of the schema hosting the clipping definition table |
data_schema_name |
Name of the schema hosting the table with data |
Usage Notes
For a given input table, you can iteratively call this routine several times with different specifications for tail_frac
for a given input table. For each iteration, you can selectively exclude attributes (that is, column names) using the exclude_list
parameter for a particular specification.
Example 1. When q = 0.2, col='mycol', and data is {1,2,2,2,3,4,4}, the following row is inserted into the definition table:
COL LCUT LVAL RCUT RVAL ----- ----- ----- ----- ----- mycol 2 NULL 4 NULL
Here 1 + floor(N*q) = 1 + floor(7*0.2) = 2, lcut = val(2) = 2.
N – floor(N*q) = 7 – floor(7*0.2) = 6, rcut = val(6) = 4.
Examples
The simplest invocation of this routine populates clipping definitions in the clip_table
for all the numerical attributes found in build_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.INSERT_CLIP_TRIM_TAIL( 'clip_table', 'build_table'); END;
This procedure finds the Winsorizing definition for every NUMBER
column in the data table that is not specified in the exclusion list and inserts the definition into the clipping definition table that was created using CREATE_CLIP
.
Definition for each relevant column is computed based on the non-NULL
values sorted in ascending order such that val(1) < val(2) <... < val(N), where N is a total number of non-NULL
values in a column:
lcut = val(1+floor(N*q)) lval = lcut rcut = val(N–floor(N*q)) rval = rcut
where q = ABS(NVL(tail_fraq,0)). Nothing is done when q >= 0.5.
Syntax
DBMS_DATA_MINING_TRANSFORM.INSERT_CLIP_WINSOR_TAIL ( clip_table_name IN VARCHAR2, data_table_name IN VARCHAR2, tail_frac IN NUMBER DEFAULT 0.025, exclude_list IN Column_List DEFAULT NULL, clip_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-14 INSERT_CLIP_WINSOR_TAIL Procedure Parameters
Parameter | Description |
---|---|
clip_table_name |
Name of the clipping definition table generated using CREATE_CLIP procedure |
data_table_name |
Name of the table containing the data |
tail_frac |
Tail fraction |
exclude_list |
List of columns (attributes) to be excluded from this iteration of the clipping process |
clip_schema_name |
Name of the schema hosting the clipping definition table |
data_schema_name |
Name of the schema hosting the table with data |
Usage Notes
For a given input table, you can iteratively call this routine several times with different specifications for tail_frac
for a given input table. For each iteration, you can selectively exclude attribute (that is, column names using the exclude_list
parameter for a particular specification. Columns with all NULL
values are ignored.
Example 1. When q = 0.2, col='mycol', and data is {1,2,2,2,3,4,4}, the following row is inserted into the definition table:
COL LCUT LVAL RCUT RVAL ----- ----- ----- ----- ----- mycol 2 2 4 4
Here 1 + floor(N*q) = 1 + floor(7*0.2) = 2, lcut = val(2) = 2.
N – floor(N*q) = 7 – floor(7*0.2) = 6, rcut = val(6) = 4.
Examples
The simplest invocation of this routine populates clipping definitions in the clip_table
for all the numerical attributes found in build_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.INSERT_CLIP_WINSOR_TAIL( 'clip_table', 'build_table'); END;
This procedure finds the categorical missing value treatment definition for every VARCHAR2
and CHAR
column in the data table that is not specified in the exclusion list and inserts the definition into the definition table that was created using CREATE_MISS_CAT
.
The definition for each selected column is computed based on the mode value that is computed from the data table.
Syntax
DBMS_DATA_MINING_TRANSFORM.INSERT_MISS_CAT_MODE ( miss_table_name IN VARCHAR2, data_table_name IN VARCHAR2, exclude_list IN COLUMN_LIST DEFAULT NULL, miss_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-15 INSERT_MISS_CAT_MODE Procedure Parameters
Parameter | Description |
---|---|
miss_table_name |
Name of the categorical missing value treatment definition table generated using CREATE_MISS_CAT |
data_table_name |
Name of the table containing the data |
exclude_list |
List of columns (attributes) to be excluded from this iteration of the missing value treatment. See Table 26-1, "Summary of Data Types" for the definition of COLUMN_LIST . |
miss_schema_name |
Name of the schema hosting the categorical missing value treatment definition table |
data_schema_name |
Name of the schema hosting the table containing the data |
Usage Notes
You can choose the categorical attributes that will receive missing value treatment by using the exclude_list
parameter. NULL
values in all the selected attributes will be replaced with the mode (the most commonly occurring value) for the attribute.
If you wish to replace NULL
s with some other value, you can edit the definition table.
Example
The simplest invocation of this routine populates missing value definitions (the mode) in miss_table
for all categorical attributes found in build_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.INSERT_MISS_CAT_MODE( 'miss_table', 'build_table'); END;
This procedure finds the numerical missing value treatment definition for every NUMBER
column in the data table that is not specified in the exclusion list and inserts the definition into the definition table that was created using CREATE_MISS_NUM
.
The definition for each selected column is computed based on the mean value that is computed from the data table. The value of mean is rounded to round_num
significant digits prior to storing it in the definition table.
Syntax
DBMS_DATA_MINING_TRANSFORM.INSERT_MISS_NUM_MEAN ( miss_table_name IN VARCHAR2, data_table_name IN VARCHAR2, exclude_list IN COLUMN_LIST DEFAULT NULL, round_num IN PLS_INTEGER DEFAULT 6, miss_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-16 INSERT_MISS_NUM_MEAN Procedure Parameters
Parameter | Description |
---|---|
miss_table_name |
Name of the categorical missing value treatment definition table generated using CREATE_MISS_CAT |
data_table_name |
Name of the table containing the data |
exclude_list |
List of columns (attributes) to be excluded from this iteration of the miss value treatment. See Table 26-1, "Summary of Data Types" for the definition of COLUMN_LIST . |
round_num |
The number of significant digits |
miss_schema_name |
Name of the schema hosting the numerical missing value treatment definition table |
data_schema_name |
Name of the schema hosting the table containing the data |
Usage Notes
You can choose the numerical attributes that will receive missing value treatment by using the exclude_list
parameter. NULL
values in all the selected attributes will be replaced with the mean (average) value for the attribute.
If you wish to replace NULL
s with some other value, you can edit the definition table.
Example
The simplest invocation of this routine populates missing value definitions (the mode) in miss_table
for all numerical attributes found in build_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.INSERT_MISS_CAT_MODE( 'miss_table', 'build_table'); END;
This procedure finds the normalization definition for every NUMBER
column in the data table that is not specified in the exclusion list and inserts the definition based on min-max normalization into the table that was created using CREATE_NORM_LIN
.
Definition for each relevant column is computed based on the mean and standard deviation that are computed from the data table, such that shift = mean and scale = standard deviation. The values of shift and scale are rounded to round_num
significant digits prior to storing them in the definition table.
Syntax
DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_MINMAX ( norm_table_name IN VARCHAR2, data_table_name IN VARCHAR2, exclude_list IN Column_List DEFAULT NULL, round_num IN PLS_INTEGER DEFAULT 6, norm_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-17 INSERT_NORM_LIN_MINMAX Procedure Parameters
Parameter | Description |
---|---|
norm_table_name |
Name of the normalization table generated using CREATE_NORM_LIN procedure |
data_table_name |
Name of the table containing the data |
exclude_list |
List of columns (attributes) to be excluded from this iteration of the normalization process |
round_num |
Number of significant digits |
norm_schema_name |
Name of the schema hosting the normalization definition table |
data_schema_name |
Name of the schema hosting the table with data |
Usage Notes
For a given input table, you can iteratively call this routine several times with selective exclusion of attributes (that is, column names) using the exclude_list
parameter for a particular normalization specification.
Columns with all NULL
values or only one unique value are ignored.
Examples
The simplest invocation of this routine populates normalization definitions in the norm_minmax_table
for all the numerical attributes found in build_table
.
BEGIN
DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_MINMAX( 'norm_minmax_table', 'build_table');
END;
This procedure finds the normalization definition for every NUMBER
column in the data table that is not specified in the exclusion list and inserts the definition based on min-max normalization into the table that was created using CREATE_NORM_LIN
.
The normalization definition for each attribute is computed based on the minimum and maximum values of the data. The values for shift
and scale
are shift = 0
and scale = max{abs(max), abs(min)}.
Syntax
DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_SCALE ( norm_table_name IN VARCHAR2, data_table_name IN VARCHAR2, exclude_list IN Column_List DEFAULT NULL, round_num IN PLS_INTEGER DEFAULT 6, norm_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-18 INSERT_NORM_LIN_SCALE Procedure Parameters
Parameter | Description |
---|---|
norm_table_name |
Name of the normalization table generated using CREATE_NORM_LIN procedure |
data_table_name |
Name of the table containing the data |
exclude_list |
List of columns (attributes) to be excluded from this iteration of the normalization process |
round_num |
Number of significant digits |
norm_schema_name |
Name of the schema hosting the normalization definition table |
data_schema_name |
Name of the schema hosting the table with data |
Usage Notes
For a given input table, you can iteratively call this routine several times with selective exclusion of attributes (that is, column names) using the exclude_list parameter for a particular normalization specification.
Columns with all NULL
values or only one unique value are ignored.
Examples
The simplest invocation of this routine populates normalization definitions in the norm_minmax_table
for all the numerical attributes found in build_table
.
BEGIN
DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_SCALE( 'norm_scale_table', 'build_table');
END;
This procedure finds the normalization definition for every NUMBER
column in the data table that is not specified in the exclusion list and inserts the definition based on z-score normalization into the table that was created using CREATE_NORM_LIN
.
Definition for each relevant column is computed based on the minimum and maximum values that are computed from the data table, such that shift = min and scale = max – min. The values of shift and scale are rounded to round_num significant digits prior to storing them in the definition table.
Syntax
DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_ZSCORE ( norm_table_name IN VARCHAR2, data_table_name IN VARCHAR2, exclude_list IN Column_List DEFAULT NULL, round_num IN PLS_INTEGER DEFAULT 6, norm_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-19 INSERT_BIN_NORM_LIN_ZSCORE Procedure Parameters
Parameter | Description |
---|---|
norm_table_name |
Name of the normalization table generated using CREATE_NORM_LIN procedure |
data_table_name |
Name of the table containing the data |
exclude_list |
List of columns (attributes) to be excluded from this iteration of the normalization process |
round_num |
Number of significant digits |
norm_schema_name |
Name of the schema hosting the normalization definition table |
data_schema_name |
Name of the schema hosting the table with data |
Usage Notes
For a given input table, you can iteratively call this routine several times with selective exclusion of attributes (that is, column names) using the exclude_list
parameter for a particular binning specification.
Columns with all NULL
values or only one unique value are ignored.
Examples
The simplest invocation of this routine populates normalization definitions in the norm_zscore_table
for all the numerical attributes found in build_table
.
BEGIN
DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_ZSCORE( 'norm_zscore_table', 'build_table');
END;
/
This procedure creates the view that performs categorical binning. Only the columns that are specified in the definition table are transformed; the remaining columns do not change.
Syntax
DBMS_DATA_MINING_TRANSFORM.XFORM_BIN_CAT ( bin_table_name IN VARCHAR2, data_table_name IN VARCHAR2, xform_view_name IN VARCHAR2, literal_flag IN BOOLEAN DEFAULT FALSE, bin_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL, xform_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-20 XFORM_BIN_CAT Procedure Parameters
Parameter | Description |
---|---|
bin_table_name |
Name of the categorized binning definition table generated using CREATE_BIN_CAT procedure |
data_table_name |
Name of the table containing the data |
xform_view_name |
View representing the transformed output |
literal_flag |
Literal flag |
bin_schema_name |
Name of the schema hosting the bin definition table |
data_schema_name |
Name of the schema hosting the data table |
xform_schema_name |
Name of the schema hosting the view representing the transformed output |
Usage Notes
The bin table created by CREATE_BIN_CAT
and populated with bin definitions by INSERT_BIN_CAT_FREQ
is used to guide the query generation process to construct categorical binning expressions of the following form:
DECODE("col", val_1, bin_1, ... val_N, bin_N, NULL, NULL, bin_(N+1)) "col"
This expression maps values val_1,..., val_N
into N bins bin_1,..., bin_N
, and other values into bin_(N+1)
, while NULL
values remain unchanged. bin_(N+1) is optional. If not specified, it defaults to NULL
. To specify bin_(N+1)
provide a row with val set to NULL
.
The literal_flag
parameter indicates whether the values in bin are valid SQL literals. When the flag is set to TRUE, the value of bin is used as is in query generation; otherwise it is converted into a valid text literal (surrounded by quotes and each single quote is replaced by two single quotes). By default, the flag is set to FALSE. One example of when it can be set to TRUE is in cases when all bin are numbers. In that case the transformed column will be numeric as opposed to textual (default behavior).
Set literal_flag
to TRUE when the data is binned for an O-Cluster model build.
The col
parameter is case-sensitive since it generates quoted identifiers. In cases when there are multiple entries with the same col
,val
combination with different bin
, the behavior is undefined — any one of the bin
values might be used.
Examples
Example 1. bin_cat contains four rows with col = 'mycol':
{col = 'mycol', val = 'Waltham', bin = 'MA'} {col = 'mycol', val = 'Burlington', bin = 'MA'} {col = 'mycol', val = 'Redwood Shores', bin = 'CA'} {col = 'mycol', val = NULL, bin = 'OTHER'}
the following expression is generated:
DECODE("mycol", 'Waltham', 'MA', 'Burlington', 'MA', 'Redwood Shores', 'CA', NULL, NULL, 'OTHER') "mycol"
Example 2. bin_cat contains three rows with col = 'mycol':
{col = 'mycol', val = 'Waltham', bin = 'MA'} {col = 'mycol', val = 'Burlington', bin = 'MA'} {col = 'mycol', val = 'Redwood Shores', bin = 'CA'}
the following expression is generated:
DECODE("mycol", 'Waltham', 'MA', 'Burlington', 'MA', 'Redwood Shores', 'CA') "mycol"
Example 3. For the definition:
COL VAL BIN ----- ---------- --- mycol Waltham 1 mycol Burlington 1 mycol Redwood Shores 2
the following expression is generated when the literal flag is set to FALSE:
DECODE ("mycol", 'Waltham', '1', 'Burlington' '1', 'Redwood Shores', '2') "mycol"
and when the flag is set to TRUE:
DECODE("mycol", 'Waltham', 1, 'Burlington', 1, 'Redwood Shores', 2) "mycol"
The simplest invocation of this routine generates a view build_view that represents the transformation query on build_table based on bin definitions in the cat_bin_table.
BEGIN
DBMS_DATA_MINING_TRANSFORM.XFORM_BIN_CAT( 'cat_bin_table', 'build_table', 'build_view');
END; /
This procedure creates the view that performs numerical binning. Only the columns that are specified in the definition table are transformed; the remaining columns do not change.
Syntax
DBMS_DATA_MINING_TRANSFORM.XFORM_BIN_NUM ( bin_table_name IN VARCHAR2, data_table_name IN VARCHAR2, xform_view_name IN VARCHAR2, literal_flag IN BOOLEAN DEFAULT FALSE, bin_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL, xform_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-21 XFORM_BIN_NUM Procedure Parameters
Parameter | Description |
---|---|
bin_table_name |
Name of the numerical binning definition table generated using CREATE_BIN_NUM procedure |
data_table_name |
Name of the table containing the data |
xform_view_name |
View representing the transformed output |
literal_flag |
Literal flag |
bin_schema_name |
Name of the schema hosting the bin definition table |
data_schema_name |
Name of the schema hosting the data table |
xform_schema_name |
Name of the schema hosting the view representing the transformed output |
Usage Notes
The bin table created by CREATE_BIN_NUM
and populated with bin definitions by INSERT_BIN_NUM_EQWIDTH
or INSERT_BIN_NUM_QTILE
is used to guide the query generation process to construct numerical binning expressions of the following form:
CASE WHEN "col" < val_0 THEN 'bin0_0 WHEN "col" <= val_1 THEN 'bin_1' ... WHEN "col" <= val_N THEN 'bin_N' WHEN "col" IS NOT NULL THEN 'bin_(N+1)' END "col"
This expression maps values in the range [val_0;val_N] into N bins bin_1,..., bin_N, values outside of this range into bin_0 or bin_(N+1), such that
(-inf; val_0) -> bin_0 [val_0; val_1) -> bin_1 ... (val_(N-1); val_N] -> bin_N (val_N; +inf) -> bin_(N+1)
NULL
values remain unchanged. bin_(N+1) is optional. If it is not specified, the values ("col" > val_N) are mapped to NULL
. To specify bin_(N+1), provide a row with val set to NULL
. The order of the WHEN... THEN pairs is based on the ascending order of val for a given col.
The literal_flag parameter indicates whether the values in bin are valid SQL literals. When the flag is set to TRUE, the value of bin is used as is in query generation; otherwise it is converted into a valid text literal (surrounded by quotes and each single quote is replaced by two single quotes). By default, the flag is set to FALSE. One example of when it can be set to TRUE is in cases when all bin are numbers. In that case the transformed column will be numeric as opposed to textual (default behavior).
Note that col is case-sensitive since it generates quoted identifiers. In cases where there are multiple entries with the same col,val combination with different bin, the behavior is undefined — any one of the bin values might be used.
Examples
Example 1. bin_num contains four rows with col = 'mycol':
{col = 'mycol', val = 15.5, bin = 'small'} {col = 'mycol', val = 10, bin = 'tiny'} {col = 'mycol', val = 20, bin = 'large'} {col = 'mycol', val = NULL, bin = 'huge'}
the following expression is generated:
CASE WHEN "mycol" < 10 THEN 'tiny' WHEN "mycol" <= 15.5 THEN 'small' WHEN "mycol" <= 20 THEN 'large' WHEN "mycol" IS NOT NULL THEN 'huge' END "mycol"
Example 2. bin_num contains three rows with col = 'mycol':
{col = 'mycol', val = 15.5, bin = NULL} {col = 'mycol', val = 10, bin = 'tiny'} {col = 'mycol', val = 20, bin = 'large'}
the following expression is generated:
CASE WHEN "mycol" < 10 THEN NULL WHEN "mycol" <= 15.5 THEN 'small' WHEN "mycol" <= 20 THEN 'large' END "mycol"
Example 3. For the definition:
COL VAL BIN ----- ---- --- mycol 10 NULL mycol 15.5 1 mycol 21 2
the following expression is generated when the literal flag is set to FALSE:
CASE WHEN "mycol" < 10 THEN NULL WHEN "mycol" <= 15.5 THEN '1' WHEN "mycol" <= 20 THEN '2' END "mycol"
and when the flag is set to TRUE:
CASE WHEN "mycol" < 10 THEN NULL WHEN "mycol" <= 15.5 THEN 1 WHEN "mycol" <= 20 THEN 2 END "mycol"
The simplest invocation of this routine generates a view build_view
that represents the transformation query on build_table
based on transform definitions in bin definitions in the num_bin_table
.
BEGIN
DBMS_DATA_MINING_TRANSFORM.XFORM_BIN_NUM( 'num_bin_table', 'build_table', 'build_view');
END; /
This procedure creates the view that performs clipping. Only the columns that are specified in the transform definition are clipped; the remaining columns do not change.
Syntax
DBMS_DATA_MINING_TRANSFORM.XFORM_CLIP ( clip_table_name IN VARCHAR2, data_table_name IN VARCHAR2, xform_view_name IN VARCHAR2, clip_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2,DEFAULT NULL, xform_schema_name IN VARCHAR2,DEFAULT NULL;
Parameters
Table 26-22 XFORM_CLIP Procedure Parameters
Parameter | Description |
---|---|
clip_table_name |
Name of the clipping definition table generated using CREATE_CLIP |
data_table_name |
Name of the table containing the data |
xform_view_name |
View representing the transformed output |
clip_schema_name |
Name of the schema hosting the clipping definition table |
data_schema_name |
Name of the schema hosting the data table |
xform_schema_name |
Name of the schema hosting the view representing the transformed output |
Usage Notes
The clipping definition table created by CREATE_CLIP
and populated with clipping definitions by INSERT_CLIP_WINSOR_TAIL
or INSERT_CLIP_TRIM_TAIL
is used to guide query generation process to construct clipping expressions of the following form:
CASE WHEN "col" < lcut THEN lval WHEN "col" > rcut THEN rval ELSE "col" END "col"
Note that col is case-sensitive since it generates quoted identifiers. When there are multiple entries in the transform definition table for the same col, the behavior is undefined. Any one of the definitions may be used in query generation. NULL
values remain unchanged.
Example 1 (Winsorizing). When col = 'my_col', lcut = –1.5, lval = –1.5, and rcut = 4.5 and rval = 4.5, the following expression is generated:
CASE WHEN "my_col" < –1.5 THEN -1.5 WHEN "my_col" > 4.5 THEN 4.5 ELSE "my_col" END "my_col"
Examples
The simplest invocation of this routine generates a view object build_view
that represents the transformation query on build_table
based on transform definitions in clipping definitions in the clip_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.XFORM_CLIP( 'clip_table', 'build_table', 'build_view'); END;
This procedure creates a view that performs categorical missing value treatment. Only the columns that are specified in the xform definition are treated; the remaining columns do not change.
Syntax
DBMS_DATA_MINING_TRANSFORM.XFORM_MISS_CAT ( miss_table_name IN VARCHAR2, data_table_name IN VARCHAR2, xform_view_name IN VARCHAR2, miss_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL, xform_schema_name IN VARCHAR2 DEFAULT NULL;
Parameters
Table 26-23 XFORM_MISS_CAT Procedure Parameters
Parameter | Description |
---|---|
miss_table_name |
Name of the categorical missing value treatment definition table generated using CREATE_MISS_CAT |
data_table_name |
Name of the table containing the data |
xform_view_name |
View representing the transformed output |
miss_schema_name |
Name of the schema hosting the categorical missing value treatment definition table |
data_schema_name |
Name of the schema hosting the data table |
xform_schema_name |
Name of the schema hosting the view representing the transformed output |
Usage Notes
The data type of the transformed columns is preserved by putting a CAST
expression around the NVL
function. For example, when col = 'state', val = 'MA' the data type is CHAR(2)
the following expression is generated:
CAST (NVL("state", 'MA') AS CHAR(2)) "state"
Examples
The simplest invocation of this routine generates a view object build_view
that represents the transformation query on build_table
based on transform definitions in missing value definitions in miss_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.XFORM_MISS_CAT( 'miss_table', 'build_table', 'build_view'); END;
This procedure creates a view that performs numerical missing value treatment of the data table. Only the columns that are specified in the xform definition are treated, the remaining columns do not change.
Syntax
DBMS_DATA_MINING_TRANSFORM.XFORM_MISS_NUM ( miss_table_name IN VARCHAR2, data_table_name IN VARCHAR2, xform_view_name IN VARCHAR2, miss_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL, xform_schema_name IN VARCHAR2 DEFAULT NULL;
Parameters
Table 26-24 XFORM_MISS_NUM Procedure Parameters
Parameter | Description |
---|---|
miss_table_name |
Name of the numeric missing value treatment definition table generated using CREATE_MISS_NUM |
data_table_name |
Name of the table containing the data |
xform_view_name |
View representing the transformed output |
miss_schema_name |
Name of the schema hosting the numerical missing value treatment definition table |
data_schema_name |
Name of the schema hosting the data table |
xform_schema_name |
Name of the schema hosting the view representing the transformed output |
Examples
The simplest invocation of this routine generates a view object build_view
that represents the transformation query on build_table
based on transform definitions in missing value definitions in miss_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.XFORM_MISS_NUM( 'miss_table', 'build_table', 'build_view'); END;
This procedure creates the view that performs linear normalization. Only the columns that are specified in the definition table are transformed; the remaining columns do not change.
Syntax
DBMS_DATA_MINING_TRANSFORM.XFORM_NORM_LIN ( norm_table_name IN VARCHAR2, data_table_name IN VARCHAR2, xform_view_name IN VARCHAR2, norm_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL, xform_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 26-25 XFORM_NORM_LIN Procedure Parameters
Parameter | Description |
---|---|
norm_table_name |
Name of the normalization definition table generated using CREATE_NORM_LIN procedure |
data_table_name |
Name of the table containing the data |
xform_view_name |
View representing the transformed output |
norm_schema_name |
Name of the schema hosting the normalization definition table |
data_schema_name |
Name of the schema hosting the data table |
xform_schema_name |
Name of the schema hosting the view representing the transformed output |
Usage Notes
The normalization table created by CREATE_NORM_LIN
is populated with definitions by either INSERT_NORM_LIN_ZSCORE
or INSERT_NORM_LIN_MINMAX
is used to guide the query generation process to construct normalization expressions of the following form:
("col" - shift)/scale "col"
Note that col is case-sensitive since it generates quoted identifiers. When there are multiple entries in the transform definition table for the same col, the behavior is undefined. Any one of the definitions may be used in query generation. NULL
values remain unchanged.
For example, when col = 'my_col', shift = -1.5, and scale = 20. The following expression is generated:
("my_col" - (-1.5))/20 "my_col"
Examples
The simplest invocation of this routine generates a view build_view
that represents the transformation query on build_table
based on normalization definitions in the norm_minmax_table
.
BEGIN
DBMS_DATA_MINING_TRANSFORM.XFORM_NORM_LIN( 'norm_minmax_table', 'build_table', 'build_view');
END;