7 The Data Mining Sample Programs

A number of sample programs are available with Oracle Data Mining. These programs illustrate the many features of the PL/SQL and Java APIs.

The sample programs create a set of models in the database. You can examine the sample source code, which includes numerous comments, to familiarize yourself with the Oracle Data Mining APIs, and you can create your own models by modifying the samples.

Note:

This chapter provides general information about the sample programs and the sample data.

Instructions for installing and running the sample programs are provided in Chapter 1.

This chapter includes the following sections:

PL/SQL Programs
Java Programs
Text Mining Programs
The Sample Data

PL/SQL Programs

The PL/SQL sample programs illustrate the use of the DBMS_DATA_MINING package for creating models and the DBMS_DATA_MINING_TRANSFORM package for performing transformations on the mining data.

PL/SQL Program Summaries

Summary descriptions of the PL/SQL sample programs are provided in Table 7-1. For detailed descriptions of the sample programs, see the comments in the source code.

Table 7-1 Overview of the PL/SQL Sample Programs

Mining Function	Description
Classification	The classification programs demonstrate various preprocessing techniques and perform the following steps: Build a classification model using training data Display model details and settings Test the model by applying the model on the test data Present test metrics, such as confusion matrix, lift, and ROC Apply the model on the scoring data Present apply results Present ranked apply results, influenced by a cost matrix `dmnbdemo.sql` illustrates Naive Bayes. `dmdtdemo.sql` illustrates Decision Tree. `dmsvcdem.sql` illustrates SVM classification. `dmglcdem` illustrates GLM classification (binary logistic regression) The `dmdtxvlddemo.sql` program demonstrates cross-validation techniques for decision tree based-classification. With minor modifications, this program can be used to perform cross validation using other models/algorithms.
Regression	`dmsvrdem.sql` uses different test metrics, but otherwise performs most of the same steps used in the classification programs. Selected attributes of the input data are preprocessed (normalized). NOTE: `dmsvrdem`.`sq`l illustrates the new Automatic Data Preparation feature. `dmglrdem.sql` illustrates GLM regression (multivariate linear regression)
Anomaly Detection	`dmsvodem.sql` illustrates one-class SVM
Association	`dmardemo.sql` builds an association model and presents frequent itemsets and association rules as output.
Clustering	`dmkmdemo.sql` (k-Means) and `dmocdemo.sql` (0-Cluster) build clustering models and present cluster details, such as rules, centroid, and histogram for each cluster as output. The models are scored, and the probabilities associated with each cluster are returned as output. Selected attributes of the input data are preprocessed. NOTE: `dmkmdemo`.`sq`l illustrates the new Automatic Data Preparation feature.
Feature extraction	`dmnmdemo.sql` builds a feature extraction model and presents model details as the output. The model is scored, and each feature ID is associated with a probability. Selected attributes of the input data are preprocessed (normalized).
Attribute importance	`dmaidemo.sql` builds an attribute importance model and presents a list of important attributes as the output of model details. Selected attributes of the input data are preprocessed (binned).

Data Mining SQL Scoring Functions

Most of the PL/SQL sample programs use the Data Mining SQL functions for scoring. The Data Mining scoring functions can be used to apply models created with the PL/SQL API or with the Java API.

Note:

The SQL functions for Data Mining are documented in Oracle Database SQL Language Reference. Information about these functions is also provided in Oracle Data Mining Application Developer's Guide.

The programs that demonstrate the Data Mining functions are listed in Table 7-2.

Table 7-2 Data Mining SQL Functions in the Sample Programs

Program Name	Algorithm	SQL Functions Used
`dmkmdemo.sql`	k-Means	`CLUSTER_ID`, `CLUSTER_PROBABILITY`, `CLUSTER_SET`
`dmocdemo.sql`	O-Cluster	`CLUSTER_ID`
`dmnmdemo.sql`	NMF	`FEATURE_ID`, `FEATURE_SET`, `FEATURE_VALUE`
`dmdtdemo.sql`	Decision Tree	`PREDICTION`, `PREDICTION_COST`, `PREDICTION_DETAILS`, `PREDICTION_SET`
`dmsvcdem.sql`	SVM classification	`PREDICTION`, `PREDICTION_PROBABILITY`, `PREDICTION_SET`
`dmsvodem.sql`	One-Class SVM	`PREDICTION`, `PREDICTION_PROBABILITY`
`dmsvrdem.sql`	SVM regression	`PREDICTION`
`dmtxtsvm.sql`	Text mining	`PREDICTION`, `PREDICTION_PROBABILITY`

Java Programs

The Java demos illustrate the features of the Oracle Data Mining Java API, which implements Oracle-specific extensions to the Java Data Mining (JDM) 1.0.1.1 standard.The Java programs demonstrate data preprocessing and the basic mining functions. Additional Java samples demonstrate predictive analytics, import/export, and text mining.

Java Program Summaries

Summary descriptions of the Java sample programs are provided in Table 7-3. For detailed descriptions, see the comments in the source code.

Table 7-3 Overview of the Java Sample Programs

Mining Function or Task	Description
Classification	The classification programs demonstrate various preprocessing techniques and perform the following steps: Build a classification model using training data Display model details and settings Test the model by applying the model on the test data Present test metrics, such as confusion matrix, lift, and ROC Apply the model on the scoring data Present apply results Present ranked apply results, influenced by a cost matrix The `dmapplydemo.java` program demonstrates several ways of applying a Naive Bayes model. `dmglcdemo.java` illustrates GLM classification (binary logistic regression)
Regression	`dmsvrdemo.java` uses different test metrics, but otherwise performs most of the same steps used in the classification programs. Selected attributes of the input data are preprocessed (normalized). `dmglrdemo.java` illustrates GLM regression (multivariate linear regression)
Association	`dmardemo.java` builds an association model and presents frequent itemsets and association rules as output. Selected attributes of the input data are preprocessed (binned).
Clustering	`dmkmdemo.java` (k-Means) and `dmocdemo.java` (0-Cluster) build clustering models and present cluster details, such as rules, centroid, and histogram for each cluster as output. The models are scored, and the probabilities associated with each cluster are returned as output. Selected attributes of the input data are preprocessed (normalized).
Feature extraction	`dmnmdemo.java` builds a feature extraction model and presents model details as the output. The model is scored, and each feature ID is associated with a probability. Selected attributes of the input data are preprocessed (normalized).
Attribute importance	`dmaidemo.java` builds an attribute importance model and presents a list of important attributes as the output of model details. Selected attributes of the input data are preprocessed (binned).
Data transformations	`dmxfdemo.java` demonstrates binning, clipping, and normalization transformations.
Predictive Analytics	`dmpademo.java` demonstrates `PREDICT`, `EXPLAIN, and PROFILE` functions.
Model import/export	`dmexpimpdemo.java` builds a Naive Bayes model, exports it to a dump file, then imports it from the dump file.

Text Mining Programs

Oracle Data Mining can mine text columns that have undergone pre-processing by Oracle Text routines.

Oracle Text is a technology for building text query and document classification applications. It provides indexing, word and theme searching, and viewing capabilities for text. Oracle Text is included in a general installation of Oracle Database Enterprise Edition, and therefore is already present in a database installed according to the instructions in Chapter 1.

The pre-processing steps for text mining create nested table columns of type DM_NESTED_NUMERICALS from columns of type VARCHAR2 or CLOB. Each row of the nested table specifies an attribute name and a value. The type definition is as follows.

CREATE OR REPLACE TYPE dm_nested_numerical AS OBJECT
  (attribute_name VARCHAR2(4000),
   value          NUMBER)
/
CREATE OR REPLACE TYPE dm_nested_numericals AS TABLE OF dm_nested_numerical

Terms extracted from text documents into nested tables can become generic attributes in training or scoring data.

Sample text mining programs in both PL/SQL and Java illustrate classification and feature extraction of a pre-processed text column.

Text Mining in PL/SQL

Three PL/SQL sample programs illustrate the process of text mining. One program illustrates the pre-processing that is required to prepare the data for mining. The other two programs build models that use the transformed text.

Text Transformation Demo

To prepare a column for text mining using the PL/SQL API, you must use Oracle Text routines to perform the following general steps:

Create a domain index on the column.
Use the index to extract terms from the column to a temporary table.
Populate a column of type DM_NESTED_NUMERICALS with the terms in the temporary table.

The process of term extraction using Oracle Text is illustrated in the sample program dmtxtfe.sql. The source code contains extensive comments that explain the steps involved in transforming text into a set of features that can be mined using Oracle Data Mining.

More details about text transformation are provided in the Oracle Data Mining Application Developer's Guide.

Text Transformation for the PL/SQL Text Mining Sample Programs

The dmsh.sql script performs the text transformation required by the PL/SQL text mining demos. There are two such sample programs: dmtxtnmf.sql, which builds a feature extraction model using Non-Negative Matrix Factorization, and dmtxtsvm.sql, which builds a classification model using Support Vector Machine. Both of these programs use the following tables, which have a nested table column of comment data:

MINING_BUILD_NESTED_TEXT
MINING_TEST_NESTED_TEXT
MINING_APPLY_NESTED_TEXT

The Sample Text Mining Models (PL/SQL)

You can run the PL/SQL text mining demo programs, dmtxtnmf.sql and dmtxtsvm.sql, like the other PL/SQL programs. The models created by these programs are listed in the following example.

SQL> @ %ORACLE_HOME%\rdbms\demo\dmtxtnmf.sql
SQL> @ %ORACLE_HOME%\rdbms\demo\dmtxtsvm.sql
SQL> select MODEL_NAME, MINING_FUNCTION, ALGORITHM  
           from user_mining_models;MODEL_NAME         MINING_FUNCTION    ALGORITHM       ----------------   ------------------ ------------------------  
T_NMF_SAMPLE       FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR T_SVM_CLAS_SAMPLE  CLASSIFICATION     SUPPORT_VECTOR_MACHINES

Text Mining in Java

Two Java sample programs illustrate the process of text mining. One builds a feature extraction model, the other builds a classification model.

Text Transformation for the Sample Java Text Mining Programs

The Oracle Data Mining Java API provides an interface that handles the term extraction process. If you are developing data mining applications in Java, you do not need to use Oracle Text directly. However, you must ensure that Oracle Text is present in the database.

The OraTextTransform interface is used to perform text transformation within the Java text mining demos. There are two such sample programs: dmtxtnmfdemo.java, which builds a feature extraction model using Non-Negative Matrix Factorization, and dmtxtsvmdemo.java, which builds a classification model using Support Vector Machine. Both of these programs create build, test, and apply data sets from the following tables, which have a text column of comment data:

MINING_BUILD_TEXT
MINING_TEST_TEXT
MINING_APPLY_TEXT

The Sample Text Mining Models (Java)

You can run the Java text mining sample programs, dmtxtnmfdemo.java and dmtxtsvmdemo.java, like the other Java programs. The models created by these programs are shown in the following example.

> java dmtxtnmfdemo host:port:SID dmuser3 dmuser3_password
> java dmtxtsvmdemo host:port:SID dmuser3 dmuser3_password
>sqlplus dmuser/dmuser_password
SQL> select MODEL_NAME, MINING_FUNCTION, ALGORITHM  
           from user_mining_models;MODEL_NAME         MINING_FUNCTION     ALGORITHM       ----------------   ------------------  ------------------------  
txtnmfModel_jdm    FEATURE_EXTRACTION  NONNEGATIVE_MATRIX_FACTOR txtsvmModel_jdm    CLASSIFICATION      SUPPORT_VECTOR_MACHINES

The Sample Data

The dmsh.sql script creates views, tables, and indexes in the user's schema. The views define columns of customer data from tables in the SH schema. This data is used by the Data Mining sample programs. The tables reference the same columns in SH, but they include an extra COMMENTS column for text mining. The indexes are used to extract terms from the text in the COMMENTS column and build a nested table column.

Customer Data for Data Mining

Views in the data mining user's schema define columns of data from the CUSTOMERS, SALES, PRODUCTS, COUNTRIES, and SUPPLEMENTARY_DEMOGRAPHICS tables in the SH schema. You can list these views with the following SQL statements.

SQL>connect dmuser
Enter password: dmuser_password
SQL>select view_name from user_views;

The views are listed in Table 7-4.

Table 7-4 Views Used by the Data Mining Sample Programs

View Name	Description
`MINING_DATA_APPLY_STR_V`	Scoring data for o-cluster
`MINING_DATA_BUILD_STR_V`	Training data for o-cluster
`MINING_DATA_APPLY_V`	Scoring data for data mining (not text mining)
`MINING_DATA_BUILD_V`	Training data for data mining (not text mining)
`MINING_DATA_TEST_V`	Test data for data mining (not text mining)
`MARKET_BASKET_V`	Data for association rules
`MINING_DATA_ONE_CLASS_V`	Data for one-class SVM

You can see the references to tables in SH by listing the view definitions. The definition of the view MINING_DATA_BUILD_V is shown as follows.

SQL> set long 1000000
SQL> set longc 100000
SQL> set pagesize 100
SQL> select text from all_views where 
    owner='DMUSER3'and view_name='MINING_DATA_BUILD_V';

      SELECT a.CUST_ID, a.CUST_GENDER, 2003-a.CUST_YEAR_OF_BIRTH AGE, 
             a.CUST_MARITAL_STATUS, c.COUNTRY_NAME, a.CUST_INCOME_LEVEL,
             b.EDUCATION, b.OCCUPATION, b.HOUSEHOLD_SIZE, b.YRS_RESIDENCE,
             b.AFFINITY_CARD, b.BULK_PACK_DISKETTES, b.FLAT_PANEL_MONITOR,
             b.HOME_THEATER_PACKAGE, b.BOOKKEEPING_APPLICATION, 
             b.PRINTER_SUPPLIES, b.Y_BOX_GAMES, b.OS_DOC_SET_KANJI 
       FROM  sh.customers a, 
             sh.supplementary_demographics b, 
             sh.countries c 
       WHERE a.CUST_ID = b.CUST_ID AND a.country_id = c.country_id 
             AND a.cust_id between 101501 and 103000

The views are used to build, test, and score the sample models. Each view has a CUSTOMER_ID column, which is the case ID, and an AFFINITY_CARD column, which is the target used by the predictive models. Most of the views provide data for 1500 customers (1500 rows). The view used by the One-Class SVM model has data for 940 customers.

The columns of training data in the MINING_DATA_BUILD_V view are listed in the following example.

SQL> describe MINING_DATA_BUILD_V

CUST_ID                    NOT NULL            NUMBER
CUST_GENDER                NOT NULL            CHAR(1)
AGE                                            NUMBER
CUST_MARITAL_STATUS                            VARCHAR2(20)
COUNTRY_NAME               NOT NULL            VARCHAR2(40)
CUST_INCOME_LEVEL                              VARCHAR2(30)
EDUCATION                                      VARCHAR2(21)
OCCUPATION                                     VARCHAR2(21)
HOUSEHOLD_SIZE                                 VARCHAR2(21)
YRS_RESIDENCE                                  NUMBER
AFFINITY_CARD                                  NUMBER(10)
BULK_PACK_DISKETTES                            NUMBER(10)
FLAT_PANEL_MONITOR                             NUMBER(10)
HOME_THEATER_PACKAGE                           NUMBER(10)
BOOKKEEPING_APPLICATION                        NUMBER(10)
PRINTER_SUPPLIES                               NUMBER(10)
Y_BOX_GAMES                                    NUMBER(10)
OS_DOC_SET_KANJI                               NUMBER(10)

Market Basket Data for Association Rules

The association demos use the MARKET_BASKET_V data set, which includes columns of products from the PRODUCTS table and the CUSTOMER_ID column from the CUSTOMERS table in SH. The columns of the MARKET_BASKET_V view are listed in the following example.

SQL> describe MARKET_BASKET_V

CUST_ID                     NOT NULL         NUMBER
EXTENSION_CABLE                              NUMBER
FLAT_PANEL_MONITOR                           NUMBER
CD_RW_HIGH_SPEED_5_PACK                      NUMBER
ENVOY_256MB_40GB                             NUMBER
ENVOY_AMBASSADOR                             NUMBER
EXTERNAL_8X_CD_ROM                           NUMBER
KEYBOARD_WRIST_REST                          NUMBER
SM26273_BLACK_INK_CARTRIDGE                  NUMBER
MOUSE_PAD                                    NUMBER
MULTIMEDIA_SPEAKERS_3INCH                    NUMBER
OS_DOC_SET_ENGLISH                           NUMBER
SIMM_16MB_PCMCIAII_CARD                      NUMBER
STANDARD_MOUSE                               NUMBER

Customer Data for Text Mining

The text mining demos use the same customer data from tables in SH, but they include either an extra text column or a collection type column. The collection type is a nested table of type DM_NESTED_NUMERICALS.

You can list these tables with the following SQL statements.

SQL>connect dmuser3
Enter password: dmuser3_password
SQL>select table_name from user_tables where table_name like '%MINING%';

The text mining tables are listed in Table 7-5.

Table 7-5 Tables Used by the Text Mining Sample Programs

Table Name	Description
`MINING_APPLY_NESTED_TEXT`	Apply table with `COMMENTS` column as `DM_NESTED_NUMERICALS`
`MINING_BUILD_NESTED_TEXT`	Build table with `COMMENTS` column as `DM_NESTED_NUMERICALS`
`MINING_TEST_NESTED_TEXT`	Test table with `COMMENTS` column as `DM_NESTED_NUMERICALS`
`MINING_APPLY_TEXT`	Apply table with `COMMENTS` column as `VARCHAR2(4000)`
`MINING_BUILD_TEXT`	Build table with `COMMENTS` column as `VARCHAR2(4000)`
`MINING_TEST_TEXT`	Test table with `COMMENTS` column as `VARCHAR2(4000)`

In the MINING_BUILD_TEXT, MINING_TEST_TEXT, and MINING_APPLY_TEXT tables, the COMMENTS column is of type VARCHAR2(4000).

SQL> describe MINING_BUILD_TEXT
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUST_ID                                   NOT NULL NUMBER
 CUST_GENDER                               NOT NULL CHAR(1)
 AGE                                                NUMBER
 CUST_MARITAL_STATUS                                VARCHAR2(20)
 COUNTRY_NAME                              NOT NULL VARCHAR2(40)
 CUST_INCOME_LEVEL                                  VARCHAR2(30)
 EDUCATION                                          VARCHAR2(21)
 OCCUPATION                                         VARCHAR2(21)
 HOUSEHOLD_SIZE                                     VARCHAR2(21)
 YRS_RESIDENCE                                      NUMBER
 AFFINITY_CARD                                      NUMBER(10)
 BULK_PACK_DISKETTES                                NUMBER(10)
 FLAT_PANEL_MONITOR                                 NUMBER(10)
 HOME_THEATER_PACKAGE                               NUMBER(10)
 BOOKKEEPING_APPLICATION                            NUMBER(10)
 PRINTER_SUPPLIES                                   NUMBER(10)
 Y_BOX_GAMES                                        NUMBER(10)
 OS_DOC_SET_KANJI                                   NUMBER(10)
 COMMENTS                                           VARCHAR2(4000)

In the MINING_*_NESTED_TEXT tables, the COMMENTS column is of type DM_NESTED_NUMERICALS.

SQL> describe MINING_BUILD_NESTED_TEXT
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUST_ID                                   NOT NULL NUMBER
 CUST_GENDER                               NOT NULL CHAR(1)
 AGE                                                NUMBER
 CUST_MARITAL_STATUS                                VARCHAR2(20)
 COUNTRY_NAME                              NOT NULL VARCHAR2(40)
 CUST_INCOME_LEVEL                                  VARCHAR2(30)
 EDUCATION                                          VARCHAR2(21)
 OCCUPATION                                         VARCHAR2(21)
 HOUSEHOLD_SIZE                                     VARCHAR2(21)
 YRS_RESIDENCE                                      NUMBER
 AFFINITY_CARD                                      NUMBER(10)
 BULK_PACK_DISKETTES                                NUMBER(10)
 FLAT_PANEL_MONITOR                                 NUMBER(10)
 HOME_THEATER_PACKAGE                               NUMBER(10)
 BOOKKEEPING_APPLICATION                            NUMBER(10)
 PRINTER_SUPPLIES                                   NUMBER(10)
 Y_BOX_GAMES                                        NUMBER(10)
 OS_DOC_SET_KANJI                                   NUMBER(10)
 COMMENTS                                           DM_NESTED_NUMERICALS

The process of extracting terms from a text column into a nested table column is described in "Text Mining Programs" and in Oracle Data Mining Application Developer's Guide.