Oracle® Data Mining Administrator's Guide 10g Release 2 (10.2) Part Number B14338-01 |
|
|
View PDF |
A number of sample programs are available with Oracle Data Mining. These programs illustrate the many features of the PL/SQL API, the SQL scoring functions, the Java API, and the BLAST table functions.
The sample programs create a set of models in the database. You can experiment with these models using either the APIs or Oracle Data Miner. You can examine the sample source code, which includes numerous comments, to familiarize yourself with the Oracle Data Mining APIs, and you can create your own models by modifying the samples.
This chapter includes the following sections:
The Oracle Data Mining sample programs are provided with Oracle Database Enterprise Edition. The Companion CD installation process copies the Oracle Data Mining sample programs, along with examples and demos of other database features, to the \rdbms\demo
subdirectory under ORACLE_HOME
. For information on installing Oracle Database with the Data Mining option and the sample programs, see Chapter 1.
Note: The Oracle Data Mining sample programs use data in theSH schema. If the sample schemas were not originally installed in the database, you must install SH .
The sample schema files are copied to For more details about installing the sample schemas, see Oracle Database Sample Schemas. |
You can list the Oracle Data Mining SQL demo programs in Windows File Manager by searching for dm*.sql
in \rdbms\demo
in the Oracle home directory.
You can list the Oracle Data Mining Java demo programs in Windows File Manager by searching for dm*.java
in \rdbms\demo
in the Oracle home directory.
The files with a sql
extension are sample programs that illustrate the PL/SQL API and SQL scoring functions, administrative scripts, and files related to the BLAST algorithm.
Table 4-1 lists the PL/SQL programs that illustrate the mining functions and algorithms supported by Oracle Data Mining.
Table 4-1 Mining Functions in PL/SQL Samples
Mining Function | Algorithm | Sample Program File |
---|---|---|
Association Rules |
Apriori |
dmardemo.sql |
Attribute Importance |
Minimum Descriptor Length |
dmaidemo.sql |
Classification |
Adaptive Bayes Network |
dmabdemo.sql |
Classification |
Decision Tree |
dmdtdemo.sql |
Classification |
Decision Tree (cross validation) |
dmdtxvlddemo.sql |
Classification |
Naive Bayes |
dmnbdemo.sql |
Classification |
Support Vector Machines |
dmsvcdem.sql |
Classification |
Support Vector Machines (One Class) |
dmsvodem.sql |
Clustering |
k-Means |
dmkmdemo.sql |
Clustering |
O-Cluster |
dmocdemo.sql |
Feature Extraction |
Non-Negative Matrix Factorization |
dmnmdemo.sql |
Regression |
Support Vector Machines |
dmsvrdem.sql |
Text Mining |
Term extraction using CTX procedures |
dmtxtfe.sql |
Text Mining |
Non-Negative Matrix Factorization |
dmtxtnmf.sql |
Text Mining |
Support Vector Machines (Classification) |
dmtxtsvm.sql |
Table 4-2 lists the administrative scripts related to the data mining sample programs.
Table 4-2 Administrative Scripts
Table 4-3 lists the files related to the BLAST algorithm. Instructions for creating sample data sets and running the BLAST sample program are provided in "Using the BLAST Sample".
Table 4-3 BLAST Sample Files
File Name | Description |
---|---|
dmbldemo.sql |
|
dmblcoli.sql |
Script that creates the Ecoli data set. |
dmblprot.sql |
Script that creates the Swiss Protein data set. |
dmblprot.txt |
Text for SQL*Loader to load into the Swiss Protein data set. |
dmbl.ctl |
Control file for SQL*Loader to use in loading the Swiss Protein data set. |
The files with a java
extension are sample programs that illustrate the Java API. In addition to the programs that illustrate the basic mining functions, there are programs that illustrate predictive analytics and specific mining tasks, such as data transformations and model import/export.
Table 4-4 lists the Java programs that illustrate the mining functions and algorithms supported by Oracle Data Mining.
Table 4-4 Mining Functions in Java Samples
Mining Function | Algorithm | Sample Program File |
---|---|---|
Association Rules |
Apriori |
dmardemo.java |
Attribute Importance |
Minimum Descriptor Length |
dmaidemo.java |
Classification |
Adaptive Bayes Network |
dmabdemo.java |
Classification |
Decision Trees |
dmtreedemo.java |
Classification |
Naive Bayes |
dmnbdemo.java |
Classification |
Support Vector Machines |
dmsvcdemo.java |
Classification |
Support Vector Machines (one class) |
dmsvodemo.java |
Clustering |
K-Means |
dmkmdemo.java |
Clustering |
O-Cluster |
dmocdemo.java |
Feature Extraction |
Non-Negative Matrix Factorization |
dmnmdemo.java |
Regression |
Support Vector Machines |
dmsvrdemo.java |
Text Mining |
Non-Negative Matrix Factorization |
dmtxtnmfdemo.java |
Text Mining |
Support Vector Machines (Classification) |
dmtxtsvmdemo.java |
Table 4-5 lists the Java programs that illustrate special mining tasks. These features are all supported in the PL/SQL API as well, since the Java API is layered on the PL/SQL API.
Table 4-5 Mining Tasks in Java Samples
Mining Task | Description | Sample Program File |
---|---|---|
Data Transformations |
Binning, clipping, and normalization |
dmxfdemo.java |
Predictive Analytics |
Automated predict and explain |
dmpademo.java |
Model Export/Import |
dmexpimpdemo.java |
|
Classification Model Scoring |
Ways of applying an NB model |
dmapplydemo.java |
Before attempting to run the sample programs, verify that the following required installation and setup steps have been completed.
Oracle Database 10g Enterprise Edition has been installed with the sample schemas according to the instructions in "Installation of Oracle Database 10g". If the sample schemas are not installed, follow the instructions in "Obtaining the Sample Programs" to install them.
The Database Companion CD has been installed according to the instructions in "Installing the Data Mining Sample Programs".
A data mining user ID has been created according to the instructions in "Creating a Demo User for Data Mining". You will use this ID to log in to the database to run the sample programs.
The data mining user's schema has been populated with tables and views used by the sample programs according to the instructions in "Creating a Sample Schema for Data Mining".
You have SQL*Plus, as described in "SQL*Plus", for running the PL/SQL programs.
You have Java 1.4.2 for running the Java programs. Your CLASSPATH
must be set according to the instructions in "Using the Java Samples".
You can connect to the database as the data mining user.
Let's assume that you have access to Oracle Database 10g, which has been installed on a Windows host with the Data Mining option, the sample schemas, and the sample programs. You can log in to this database as SYS
using SQL*Plus.
Several Data Mining users (dmuser1
and dmuser2
) have already been created in this database. You want to create your own user ID (dmuser3
) and run the sample programs to create the sample models in your schema. To accomplish this, you would perform the following steps.
Log in to the database as SYS
and create the dmuser3
user.
> sqlplus Enter user-name: sys / as sysdba Enter password: sys_password SQL> create user dmuser3 identified by dmuser3_password default tablespace users temporary tablespace temp quota unlimited on users;
Run dmshgrants.sql
to grant privileges to dmuser3
. These privileges are required for performing data mining tasks and for accessing the mining data in SH
used by the sample programs.
SQL> @ %ORACLE_HOME%\rdbms\demo\dmshgrants SH_password dmuser3
Connect as dmuser3
and run dmsh.sql
to populate the dmuser3
schema with views and tables needed by the sample programs. Save your changes in the database.
SQL> connect dmuser3/dmuser3_password
SQL> @ %ORACLE_HOME%\rdbms\demo\dmsh
SQL> commit;
You can now run any of the PL/SQL data mining samples as described in "Using the PL/SQL Samples". For example, while logged in to SQL*Plus as dmuser3
, you could run the Naive Bayes PL/SQL sample with the following statement.
SQL>@ %ORACLE_HOME%\rdbms\demo\dmnbdemo
You can also run any of the Java data mining samples, if you have Java 1.4.2 or higher and your CLASSPATH
is set as described in "Using the Java Samples". For example, you could run the Naive Bayes Java sample with a command like the following at the operating system prompt.
>java dmnbdemo myserver:1521:orcl dmuser3 dmuser3_password
Note that the BLAST demo uses different data sets and requires a separate setup procedure. See "Using the BLAST Sample".
Note: All the demo programs are re-executable. They start by deleting the results of the previous run before executing the current run. |
The dmsh.sql
script creates views, tables, and indexes in the user's schema. The views define columns of customer data from tables in the SH
schema. This data is used by the sample data mining programs. The tables reference the same columns in SH
, but they include an extra COMMENTS
column for text mining. The indexes are used to extract terms from the text in the COMMENTS
column and build a nested table column.
Views in the data mining user's schema define columns of data from the CUSTOMERS
, SALES
, PRODUCTS
, COUNTRIES
, and SUPPLEMENTARY_DEMOGRAPHICS
tables in the SH
schema. You can list these views with the following SQL statements.
SQL>connect dmuser3/dmuser3_password
SQL>select view_name from user_views;
The views are listed in Table 4-6.
Table 4-6 Views Used by the Data Mining Sample Programs
View Name | Description |
---|---|
|
Scoring data for o-cluster |
|
Build data for o-cluster |
|
Scoring data for data mining (not text mining) |
|
Build data for data mining (not text mining) |
|
Test data for data mining (not text mining) |
|
Data for association rules |
|
Data for one-class SVM |
You can see the references to tables in SH
by listing the view definitions. The definition of the view MINING_DATA_BUILD_V
is shown as follows.
SQL>set long 1000000 SQL>set longc 100000 SQL>select text from all_views where owner='DMUSER3'and view_name='MINING_DATA_BUILD_V'; SELECT a.CUST_ID, a.CUST_GENDER, 2003-a.CUST_YEAR_OF_BIRTH AGE, a.CUST_MARITAL_STATUS, c.COUNTRY_NAME, a.CUST_INCOME_LEVEL, b.EDUCATION, b.OCC UPATION, b.HOUSEHOLD_SIZE, b.YRS_RESIDENCE, b.AFFINITY_CARD, b.BULK_PACK_DISKETTES, b.FLAT_PANEL_MONITOR, b.HOME_THEATER_PACKAGE, b.BOOKKEEPING _APPLICATION, b.PRINTER_SUPPLIES, b.Y_BOX_GAMES, b.OS_DOC_SET_KANJI FROM sh.customers a, sh.supplementary_demographics b, sh.countries c WHERE a.CUST_ID = b.CUST_ID AND a.country_id = c.country_id AND a.cust_id between 101501 and 103000
The views are used to build, test, and score the sample models. Each view has a CUSTOMER_ID
column, which is the case ID, and an AFFINITY_CARD
column, which is the target used by the predictive models. Most of the views provide data for 1500 customers (1500 rows). The view used by the one-class SVM model has data for 940 customers.
The columns of build data in the MINING_DATA_BUILD_V
view are listed in the following example.
SQL> describe MINING_DATA_BUILD_V CUST_ID NOT NULL NUMBER CUST_GENDER NOT NULL CHAR(1) AGE NUMBER CUST_MARITAL_STATUS VARCHAR2(20) COUNTRY_NAME NOT NULL VARCHAR2(40) CUST_INCOME_LEVEL VARCHAR2(30) EDUCATION VARCHAR2(21) OCCUPATION VARCHAR2(21) HOUSEHOLD_SIZE VARCHAR2(21) YRS_RESIDENCE NUMBER AFFINITY_CARD NUMBER(10) BULK_PACK_DISKETTES NUMBER(10) FLAT_PANEL_MONITOR NUMBER(10) HOME_THEATER_PACKAGE NUMBER(10) BOOKKEEPING_APPLICATION NUMBER(10) PRINTER_SUPPLIES NUMBER(10) Y_BOX_GAMES NUMBER(10) OS_DOC_SET_KANJI NUMBER(10)
The association rules samples use the MARKET_BASKET_V
data set, which includes columns of products from the PRODUCTS
table and the CUSTOMER_ID
column from the CUSTOMERS
table in SH
. The columns of the MARKET_BASKET_V
view are listed in the following example.
SQL> describe MARKET_BASKET_V CUST_ID NOT NULL NUMBER EXTENSION_CABLE NUMBER FLAT_PANEL_MONITOR NUMBER CD_RW_HIGH_SPEED_5_PACK NUMBER ENVOY_256MB_40GB NUMBER ENVOY_AMBASSADOR NUMBER EXTERNAL_8X_CD_ROM NUMBER KEYBOARD_WRIST_REST NUMBER SM26273_BLACK_INK_CARTRIDGE NUMBER MOUSE_PAD NUMBER MULTIMEDIA_SPEAKERS_3INCH NUMBER OS_DOC_SET_ENGLISH NUMBER SIMM_16MB_PCMCIAII_CARD NUMBER STANDARD_MOUSE NUMBER
The text mining samples use the same customer data from tables in SH
, but they include either an extra text column or a collection type column. The collection type is a nested table of type DM_NESTED_NUMERICALS
.
You can list these tables with the following SQL statements.
SQL>connect dmuser3/dmuser3_password
SQL>select table_name from user_tables where table_name like '%MINING%';
The tables are listed in Table 4-7.
Table 4-7 Tables Used by the Sample Text Mining Programs
Table Name | Description |
---|---|
|
Apply table with |
|
Build table with |
|
Test table with |
|
Apply table with |
|
Build table with |
|
Test table with |
In the MINING_BUILD_TEXT
, MINING_TEST_TEXT
, and MINING_APPLY_TEXT
tables, the COMMENTS
column is of type VARCHAR2(4000)
.
SQL> describe MINING_BUILD_TEXT Name Null? Type ----------------------------------------- -------- ---------------------------- CUST_ID NOT NULL NUMBER CUST_GENDER NOT NULL CHAR(1) AGE NUMBER CUST_MARITAL_STATUS VARCHAR2(20) COUNTRY_NAME NOT NULL VARCHAR2(40) CUST_INCOME_LEVEL VARCHAR2(30) EDUCATION VARCHAR2(21) OCCUPATION VARCHAR2(21) HOUSEHOLD_SIZE VARCHAR2(21) YRS_RESIDENCE NUMBER AFFINITY_CARD NUMBER(10) BULK_PACK_DISKETTES NUMBER(10) FLAT_PANEL_MONITOR NUMBER(10) HOME_THEATER_PACKAGE NUMBER(10) BOOKKEEPING_APPLICATION NUMBER(10) PRINTER_SUPPLIES NUMBER(10) Y_BOX_GAMES NUMBER(10) OS_DOC_SET_KANJI NUMBER(10) COMMENTS VARCHAR2(4000)
In the MINING_*_NESTED_TEXT
tables, the COMMENTS
column is of type DM_NESTED_NUMERICALS
.
SQL> describe MINING_BUILD_NESTED_TEXT Name Null? Type ----------------------------------------- -------- ---------------------------- CUST_ID NOT NULL NUMBER CUST_GENDER NOT NULL CHAR(1) AGE NUMBER CUST_MARITAL_STATUS VARCHAR2(20) COUNTRY_NAME NOT NULL VARCHAR2(40) CUST_INCOME_LEVEL VARCHAR2(30) EDUCATION VARCHAR2(21) OCCUPATION VARCHAR2(21) HOUSEHOLD_SIZE VARCHAR2(21) YRS_RESIDENCE NUMBER AFFINITY_CARD NUMBER(10) BULK_PACK_DISKETTES NUMBER(10) FLAT_PANEL_MONITOR NUMBER(10) HOME_THEATER_PACKAGE NUMBER(10) BOOKKEEPING_APPLICATION NUMBER(10) PRINTER_SUPPLIES NUMBER(10) Y_BOX_GAMES NUMBER(10) OS_DOC_SET_KANJI NUMBER(10) COMMENTS DMSYS.DM_NESTED_NUMERICALS
The process of extracting terms from a text column into a nested table column is described in "Using the Text Mining Samples" and in the Oracle Data Mining Application Developer's Guide.
The PL/SQL samples illustrate the use of the DBMS_DATA_MINING
package for creating models and the DBMS_DATA_MINING_TRANSFORM
package for performing transformations on the mining data. Several of the models include examples of the SQL scoring functions for applying models within a SQL statement. The PL/SQL sample programs are listed in Table 4-1.
See Also: Oracle Database PL/SQL Packages and Types Reference and Oracle Data Mining Application Developer's Guide for information on the Oracle Data Mining PL/SQL API. See Oracle Database SQL Reference for information on the SQL functions for data mining. |
In SQL*Plus, use commands like the following to execute the sample programs and list the models created by them.
>sqlplus dmuser3/dmuser3_password SQL> set serveroutput on SQL> set echo on SQL> @ %ORACLE_HOME%\rdbms\demo\program_name . . . . SQL> set linesize 200 SQL> set pagesize 100 SQL> select NAME, FUNCTION_NAME, ALGORITHM_NAME, TARGET_ATTRIBUTE from DM_USER_MODELS; NAME FUNCTION_NAME ALGORITHM_NAME TARGET_ATTRIBUTE --------------------- ---------------------- ------------------------------ ----------------- T_NMF_SAMPLE FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR T_SVM_CLAS_SAMPLE CLASSIFICATION SUPPORT_VECTOR_MACHINES AFFINITY_CARD AR_SH_SAMPLE ASSOCIATION_RULES APRIORI_ASSOCIATION_RULES AI_SH_SAMPLE ATTRIBUTE_IMPORTANCE MINIMUM_DESCRIPTION_LENGTH AFFINITY_CARD ABN_SH_CLAS_SAMPLE CLASSIFICATION ADAPTIVE_BAYES_NETWORK AFFINITY_CARD DT_SH_CLAS_SAMPLE CLASSIFICATION DECISION_TREE AFFINITY_CARD NB_SH_CLAS_SAMPLE CLASSIFICATION NAIVE_BAYES AFFINITY_CARD SVMC_SH_CLAS_SAMPLE CLASSIFICATION SUPPORT_VECTOR_MACHINES AFFINITY_CARD OC_SH_CLUS_SAMPLE CLUSTERING O_CLUSTER KM_SH_CLUS_SAMPLE CLUSTERING KMEANS NMF_SH_SAMPLE FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR SVMR_SH_REGR_SAMPLE REGRESSION SUPPORT_VECTOR_MACHINES AGE
Summary descriptions of the PL/SQL sample programs are provided in Table 4-8. The programs are listed individual by name and function in Table 4-1.
For detailed descriptions of the sample programs, see the comments in the source code.
Table 4-8 Overview of the PL/SQL Sample Programs
Mining Function | Description |
---|---|
Classification |
The classification programs demonstrate various preprocessing techniques and perform the following steps:
The |
Regression |
|
Association |
|
Clustering |
|
Feature Extraction |
|
Attribute Importance |
|
Some of the PL/SQL sample programs use SQL scoring functions to apply models created with the DBMS_DATA_MINING
package. The SQL scoring functions can also be used to apply models created with the Java API.
The programs that demonstrate the SQL scoring functions are listed in Table 4-9.
Table 4-9 SQL Scoring Functions in the Sample Programs
Program Name | Algorithm | SQL Functions Used |
---|---|---|
|
k-Means |
|
|
O-Cluster |
|
|
NMF |
|
|
Decision Tree |
|
|
SVM Classification |
|
|
SVM One Class |
|
|
SVM Regression |
|
|
Text Mining |
|
The Java samples illustrate the features of the Oracle Data Mining Java API, which implements Oracle-specific extensions to the Java Data Mining (JDM) 1.0 standard. The Java sample programs are listed in Table 4-4.
The Java samples demonstrate data preprocessing and the basic mining functions. Additional Java samples demonstrate predictive analytics, import/export, and text mining. Predictive analytics are implemented in PL/SQL as the DBMS_PREDICTIVE_ANALYTICS
package; predictive analytics can be accessed in a spreadsheet environment with the Oracle Spreadsheet Add-In for Predictive Analytics. Model import/export is implemented in PL/SQL as the IMPORT_MODEL
and EXPORT_MODEL
procedures in the DBMS_DATA_MINING
package.
Text mining in PL/SQL uses Oracle Text routines; the process is summarized in "Text Mining in PL/SQL". Text mining in Java is described in "Text Mining in Java". The Java text mining samples use the data that has not undergone preprocessing and does not contain nested tables. The data used by the text mining samples is described in "Customer Data for Text Mining".
See Also: Oracle Data Mining Java API Reference (javadoc) and the Oracle Data Mining Application Developer's Guide for information on the Java API. |
Before running the Java programs, do the following:
Check that the version of Java you are using is 1.4.2. You can execute the following in a command window to check the version of Java:
>java -version
Ensure that your CLASSPATH
includes the following JAR files:
%ORACLE_HOME%\rdbms\jlib\jdm.jar %ORACLE_HOME%\rdbms\jlib\ojdm_api.jar %ORACLE_HOME%\rdbms\jlib\xdb.jar %ORACLE_HOME%\jdbc\lib\ojdbc14.jar %ORACLE_HOME%\oc4j\j2ee\home\lib\connector.jar %ORACLE_HOME%\jlib\orai18n.jar %ORACLE_HOME%\jlib\orai18n-mapping.jar %ORACLE_HOME%\lib\xmlparserv2.jar
Compile the programs. Use the javac
executable in the \jdk\bin\
directory within Oracle home.
>javac program_name
Use the following syntax to execute the sample Java programs. Use the java
executable in the \jdk\bin\
directory within Oracle home.
>java program_name connect_string user_name user_password
The connection string specifies your Oracle database connection. It identifies the machine hosting the database, the port through which the connection is made, and the name of the database instance (the Oracle system identifier).
host_name:port:SID
For example, the following command executes the Naive Bayes sample dmnbdemo.java
as dmuser3
in the database instance orcl
on host mach05
at port 1521.
>java dmnbdemo mach05:1521:orcl dmuser3 dmuser3_password
You can list the models created by the Java programs with commands like the following in SQL*Plus.
>sqlplus dmuser3/dmuser3_password
SQL> set linesize 200
SQL> set pagesize 100
SQL> select NAME, FUNCTION_NAME, ALGORITHM_NAME, TARGET_ATTRIBUTE
from dm_user_models where NAME like '%JDM';
NAME FUNCTION_NAME ALGORITHM_NAME TARGET_ATTRIBUTE
----------------- --------------------- ---------------------------- ------------------
TXTNMFMODEL_JDM FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR
ARMODEL_JDM ASSOCIATION_RULES APRIORI_ASSOCIATION_RULES
TREEMODEL_JDM CLASSIFICATION DECISION_TREE AFFINITY_CARD
AIMODEL_JDM ATTRIBUTE_IMPORTANCE MINIMUM_DESCRIPTION_LENGTH AFFINITY_CARD
ABNMODEL_JDM CLASSIFICATION ADAPTIVE_BAYES_NETWORK AFFINITY_CARD
NBMODEL_JDM CLASSIFICATION NAIVE_BAYES AFFINITY_CARD
SVMCMODEL_JDM CLASSIFICATION SUPPORT_VECTOR_MACHINES AFFINITY_CARD
SVMOMODEL_JDM CLASSIFICATION SUPPORT_VECTOR_MACHINES
KMMODEL_JDM CLUSTERING KMEANS
OCMODEL_JDM CLUSTERING O_CLUSTER
NMFMODEL_JDM FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR
SVMRMODEL_JDM REGRESSION SUPPORT_VECTOR_MACHINES AGE
TXTSVMMODEL_JDM CLASSIFICATION SUPPORT_VECTOR_MACHINES AFFINITY_CARD
Summary descriptions of the Java sample programs are provided in Table 4-10. The programs are listed individually by name in Table 4-4 and Table 4-5.
For detailed descriptions of the sample programs, see the comments in the source code.
Table 4-10 Overview of the Java Sample Programs
Mining Function or Task | Description |
---|---|
Classification |
The classification programs demonstrate various preprocessing techniques and perform the following steps:
The |
Regression |
|
Association |
|
Clustering |
|
Feature Extraction |
|
Attribute Importance |
|
Data Transformations |
|
Predictive Analytics |
|
Model Import/Export |
|
Oracle Data Mining can mine text columns that have undergone pre-processing by Oracle Text routines.
Oracle Text is a technology for building text query and document classification applications. It provides indexing, word and theme searching, and viewing capabilities for text. Oracle Text is included in a general installation of Oracle Database Enterprise Edition, and therefore is already present in a database installed according to the instructions in Chapter 1.
The pre-processing steps for text mining create nested table columns of type DM_NESTED_NUMERICALS
from columns of type VARCHAR2
or CLOB
. Each row of the nested table specifies an attribute name and a value. The type definition is as follows.
CREATE OR REPLACE TYPE DM_NESTED_NUMERICAL AS OBJECT (attribute_name VARCHAR2(30), value NUMBER) / CREATE OR REPLACE TYPE DM_NESTED_NUMERICALS AS TABLE OF DM_NESTED_NUMERICAL
Terms extracted from text documents into nested tables can become generic attributes in training or scoring data. Classification, clustering, and feature-extraction models can be built using these attributes.
Sample text mining programs in both PL/SQL and Java illustrate classification and feature extraction of a pre-processed text column.
Three PL/SQL sample programs illustrate the process of text mining. One program illustrates the pre-processing that is required to prepare the data for mining. The other two programs build models that use the transformed text.
To prepare a column for text mining using the PL/SQL API, you must use Oracle Text routines to perform the following general steps:
Create a domain index on the column.
Use the index to extract terms from the column to a temporary table.
Populate a column of type DM_NESTED_NUMERICALS
with the terms in the temporary table.
The process of term extraction using Oracle Text is illustrated in the sample program dmtxtfe.sql
. The source code contains extensive comments that explain the steps involved in transforming text into a set of features that can be mined using Oracle Data Mining.
More details about text transformation are provided in the Oracle Data Mining Application Developer's Guide.
The dmsh.sql
script performs the text transformation required by the PL/SQL text mining samples. There are two such samples: dmtxtnmf.sq
l, which builds a feature extraction model using Non-Negative Matrix Factorization, and dmtxtsvm.sql
, which builds a classification model using Support Vector Machines. Both of these programs use the tables that have a nested table column of comment data.
MINING_BUILD_NESTED_TEXT MINING_TEST_NESTED_TEXT MINING_APPLY_NESTED_TEXT
You can run the PL/SQL text mining sample programs, dmtxtnmf.sq
l and dmtxtsvm.sql
, after simply following the instructions in "Preparing to Run the Sample Programs". The models created by these programs are listed in the following example.
SQL> @ %ORACLE_HOME%\rdbms\demo\dmtxtnmf.sql SQL> @ %ORACLE_HOME%\rdbms\demo\dmtxtsvm.sql SQL> select NAME, FUNCTION_NAME, ALGORITHM_NAME, TARGET_ATTRIBUTE from dm_user_models;NAME FUNCTION_NAME ALGORITHM_NAME TARGET_ATTRIBUTE ---------------- ------------------ ------------------------ ----------------T_NMF_SAMPLE FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR T_SVM_CLAS_SAMPLE CLASSIFICATION SUPPORT_VECTOR_MACHINES AFFINITY_CARD
Two Java sample programs illustrate the process of text mining. One builds a feature extraction model, the other builds a classification model.
The Oracle Data Mining Java API provides an interface that handles the term extraction process. If you are developing data mining applications in Java, you do not need to use Oracle Text directly. However, you must ensure that Oracle Text is present in the database.
The OraTextTransform
interface is used to perform text transformation within the Java text mining samples. There are two such samples: dmtxtnmfdemo.java
, which builds a feature extraction model using Non-Negative Matrix Factorization, and dmtxtsvmdemo.java
, which builds a classification model using Support Vector Machines. Both of these programs create build, test, and apply data sets from the tables that have a text column of comment data.
MINING_BUILD_TEXT MINING_TEST_TEXT MINING_APPLY_TEXT
You can run the Java text mining sample programs, dmtxtnmfdemo.java
and dmtxtsvmdemo.java
, after following the instructions in "Preparing to Run the Sample Programs". The models created by these programs are shown in the following example.
> java dmtxtnmfdemo host:port:SID dmuser3 dmuser3_password > java dmtxtsvmdemo host:port:SID dmuser3 dmuser3_password >sqlplus dmuser3/dmuser3_password SQL> select NAME, FUNCTION_NAME, ALGORITHM_NAME, TARGET_ATTRIBUTE from dm_user_models;NAME FUNCTION_NAME ALGORITHM_NAME TARGET_ATTRIBUTE ---------------- ------------------ ------------------------ ----------------txtnmfModel_jdm FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR txtsvmModel_jdm CLASSIFICATION SUPPORT_VECTOR_MACHINES AFFINITY_CARD
The Oracle implementation of the Basic Local Alignment Search Tool (BLAST) is demonstrated in the sample program, dmbldemo.sql
. This program provides examples of sequence matching queries using the BLAST table functions.
The BLAST sample program and configuration scripts are listed in Table 4-3, "BLAST Sample Files".
See Also: Oracle Data Mining Application Developer's Guide and Oracle Data Mining Concepts for information on BLAST. |
The sample BLAST table functions in dmbldemo.sql
use two data sets: SwissProt
and ecoli10
. To prepare these data sets, log in to SQL*Plus as the data mining user and run the dmblprot
and dmblcoli
scripts as shown in the following example.
SQL>connect dmuser3/dmuser3_password
SQL>@ %ORACLE_HOME%\rdbms\demo\dmblprot.sql
SQL>@ %ORACLE_HOME%\rdbms\demo\dmblcoli.sql
Exit SQL*Plus and use the SQL*Loader utility to load data into the SwissProt
database in the schema of the data mining user. From the command prompt, change to the \rdbms\demo
directory under Oracle home and execute the following command.
>sqlldir dmuser3/dmuser3_password
control=dmblprot.ctl data=dmblprot.txt log=dmblprot.log
The sample program dmbldemo.sql
contains multiple invocations of BLAST table functions. You can run them all at once by running the dmbldemo.sql
script, or you can copy individual table functions to the SQL*Plus command line and execute them individually.
To run the sample program, log in to SQL*Plus as the data mining user and run the dmbldemo.sql
script as shown in the following example.
SQL>connect dmuser3/dmuser3_password
SQL> @ %ORACLE_HOME%\rdbms\demo\dmbldemo.sql
The following example shows how you could execute the sample BLASTP_MATCH
table function at the SQL*Plus command line. BLASTP_MATCH
is run against all the human proteins in the SwissProt
database.
SQL>connect dmuser3/dmuser3_password
SQL>SET serveroutput ON
SQL>SET trimspool ON
SQL>SET pages 10000
SQL>column seq_id format a10
SQL>select T_SEQ_ID AS seq_id, score, EXPECT as evalue
from TABLE(
BLASTP_MATCH (
(select sequence from query_db),
CURSOR(SELECT seq_id, seq_data
FROM swissprot
WHERE organism = 'Homo sapiens (Human)'),
1,
-1,
0,
0,
'BLOSUM62',
10,
0,
0,
0,
0,
0)
)
order by score,seq_id;
The output of this table function is shown as follows.
SEQ_ID SCORE EVALUE ---------- ---------- ---------- P58107 49 7.24297332 P31947 169 8.8130E-14 P27348 198 3.8228E-17 Q04917 198 3.8228E-17 P31946 205 5.8977E-18