Oracle9i Data Mining Concepts Release 9.2.0.2 Part Number A95961-02 |
|
|
View PDF |
This chapter contains complete examples of using ODM to build a model and then score new data using that model. These examples illustrate the steps that are required in all code that uses ODM. The following two sample programs are discussed in this chapter:
Sample_NaiveBayesBuild_short.java
(Section 3.2)Sample_NaiveBayesApply_short.java
(Section 3.3)The complete code for these examples is included in the ODM sample programs that are installed when ODM is installed. For an overview of the ODM sample programs, see Appendix A. For detailed information about compiling and linking these programs, see Section A.6.
The data that the sample programs use are included with sample programs: Sample_NaiveBayesBuild_short.java
uses census_2d_build_unbinned
and Sample_NaiveBayesApply_short.java
uses census_2d_apply_unbinned
. For more information about these tables, see Section A.4. The data used sample programs is installed in the ODM_MTR
schema.
This chapter does not include a detailed description of any of the ODM API classes and methods. For detailed information about the ODM API, see the ODM Javadoc in the directory $ORACLE_HOME/dm/doc
on any system where ODM is installed.
The sample programs have a number of steps in common. Common steps are repeated for simplicity of reference.
These short sample programs use data tables that are used by the other ODM sample programs. The short sample program that builds a model uses the table CENSUS_2D_BUILD_UNBINNED
; the short sample program that applies the model uses CENSUS_2D_APPLY_UNBINNED
. For more information about these tables, see Section A.4.
Note that these short sample programs do not use the property files that the other ODM use.
The short sample programs must be compiled and then executed in the proper order; you must execute SampleNaiveBayesBuild_short.java
(which builds the model) before you execute SampleNaiveBayesApply_short.java
(which applies the model to data).
You can compile and execute these models in several ways:
These methods are described in Section A.6.
Note that the short sample programs do not use property files.
This section describes the steps that must be performed by any program that builds an ODM model.
The sample program Sample_NaiveBayesBuild_short.java
is a complete executable program that illustrates these required steps. The data for the sample program is CENSUS_2D_BUILD_UNBINNED
.
Before you build an ODM model, ODM must be installed on your system. You need to know the URL of the database where the ODM Data Mining Server resides, the user name, and the password. (Ask the person who installed the program what the user name and password are.)
Before you execute an ODM program, the ODM Monitor must be running.
Before you build a model, you must identify the data to be used during model building. The data must reside in a table in an Oracle9i database. You should clean the data as necessary; for example, you may want to treat missing values and deal with outliers, that is, extreme values that are either errors or values that may skew the binning. The table that contains the data can be in either transactional or nontransactional form. The ODM sample programs include data tables to use for model building.
Before you building a model, you must also know what data mining function that you wish to perform; for example, you may wish to create a classification model. You may specify which algorithm to use or let ODM decide which algorithm to use. The sample programs described in this chapter build and apply a Naive Bayes model.
For ODM to build a model, ODM must know the answers to the following questions:
The following steps provide answers to the questions asked above:
PhysicalDataSpecification
object for the build data.MiningFunctionSettings
object (in this case, a ClassificationFunctionSettings
object with no supplemental attributes).The steps are illustrated below with code for building a Naive Bayes model.
Before building a model, it is necessary to create an instance of DataMiningServer
. This instance is used as a proxy to create connections to a Data Mining Server (DMS). The instance also maintains the connection. The DMS is the server-side, in-database component that performs the actual data mining operations within ODM. The DMS also provides a metadata repository consisting of mining input objects and result objects, along with the namespaces within which these objects are stored and retrieved.
//Create an instance of the DMS server. //The mining server DB_URL, user_name, and password for your installation //need to be specified dms=new DataMiningServer("DB_URL", "user_name", "password");//get the actual connection dmsConnection = dms.login();
Before ODM can use data to build a model, it must know where the data is and how the data is organized. This is done through a PhysicalDataSpecification
instance where you indicate whether the data is in nontransactional or transactional format and describe the roles the various data columns play.
Before you create a PhysicalDataSpecification
instance, you must provide information about the location of the build data. This is accomplished using a LocationAccessData
object.
//Create a LocationAccessData using the table_name
//(CENSUS_2D_BUILD_UNBINNED) and s
chema_name for your installation
LocationAccessData lad =
new LocationAccessData("CENSUS_2D_BUILD_UNBINNED", "schema_name");
Next, create the actual PhysicalDataSpecification
instance.
If the data is in nontransactional format, all the information needed to build a PhysicalDataSpecificatio
n is contained in the LocationAccessData
object.
//Create the actual PhysicalDataSpecification for a //NonTransactionalDataSpecification object since the //data set is nontransactional PhysicalDataSpecification m_PhysicalDataSpecification = new NonTransactionalDataSpecification(lad);
If the data is in transactional format, you must specify the role that the various data columns play.
//Create the actual PhysicalDataSpecification for a transactional //data case PhysicalDataSpecification m_PhysicalDataSpecification = new TransactionalDataSpecification( "CASE_ID", //column name for sequence id "ATTRIBUTES", //column name for attribute name "VALUES", //column name for value lad);
The MiningFunctionSettings
(MFS) object tells the DMS the type of model to build, the function of the model, and the algorithm to use.
ODM supports the following mining functions:
The MFS allows a user to specify the type of result desired without having to specify a particular algorithm. If an algorithm is not specified, the underlying DMS selects the algorithm based on user-provided parameters.
To build a model for classification using ODM's default classification algorithm, use a ClassificationFunctionSettings
object with a null MiningAlgorithmSettings
for the MFS. An easy way to create a ClassificationFunctionSettings
object is to use the create
method, as illustrated below. In this case, it is necessary to indicate the name of the target attribute, the type of the target attribute, and whether the data has been prepared (binned) by the user. Unprepared data will automatically be binned by ODM.
//Specify "class" as the target attribute name, categorical for the target
//attribute type, and set the DataPreparationStatus to unprepared.
//Automatic binning will be applied in this case.
ClassificationFunctionSettings m_ClassificationFunctionSettings =
ClassificationFunctionSettings.create(
dmsConnection,
null
,
m_PhysicalDataSpecification,
"class",
AttributeType.categorical,
DataPreparationStatus.getInstance("unprepared"));
If a particular algorithm is to be used, the information about the algorithm is captured in a MiningAlgorithmSettings
instance. For example, if you want to build a model for classification using the Naive Bayes algorithm, first create a NaiveBayesSettings
instance to specify settings for the Naive Bayes algorithm. Two settings are available: singleton threshold
and pairwise threshold
. Then create a ClassificationFunctionSettings
instance for the build operation.
//Create the Naive Bayes algorithm settings by setting the thresholds //to 0.01. NaiveBayesSettings algorithmSetting = new NaiveBayesSettings(0.01f 0.01f); //Create the actual ClassificationFunctionSettings using //algorithmSetting for MiningAlgorithmSettings. Specify "class" as //the target attribute name, "categorical" for the target attribute //type, and set the DataPreparationStatus to "unprepared". //Automatic binning will be applied in this case. ClassificationFunctionSettings m_ClassificationFunctionSettings = ClassificationFunctionSettings.create( dmsConnection, algorithmSetting, m_PhysicalDataSpecification, class, Attribute Type.categorical, DataPreparationStatus.getInstance(unprepared));
Because MiningFunctionSettings
objects are complex objects, it is good practice to validate whether they were correctly created before starting the actual build task. If the MiningFunctionSettings
object is a valid one, it should be persisted in the DMS for later use. This is illustrated below for the ClassificationFunctionSettings
in our example.
//Validate and store the ClassificationFunctionSettings object //with the name "Sample_NB_MFS". m_ClassificationFunctionSettings.validate(); m_ClassificationFunctionSettings.store(dmsConnection, "Sample_NB_MFS");
Now that all the required information for building the model has been captured in an instance of PhysicalDataSpecification
and MiningFunctionSettings
, the last step needed is to decide whether the model should be built synchronously or asynchronously.
If you are calling ODM from an application, the design of the calling application may determine whether to build the model synchronously or asynchronously. Also, if the data used to build the model is large, it may take a significant amount of time to build the model; in such a case, you will probably want to build the model asynchronously.
For a synchronous build, use the static MiningModel.build
method. Note that this method is deprecated for ODM release 2.
//Build the model using the MFS named "Sample_NB_MFS" and store the //model under the name "Sample_NB_Model". MiningModel.build( dmsConn, lad, m_PhysicalDataSpecification, "Sample_NB_MFS", "Sample_NB_Model");
For an asynchronous build, create an instance of MiningTask
. A mining task can be persisted in the DMS using the store
method and executed at any time; however, it can be executed only once. Once the task is executing, query the current status information of a task by calling the getCurrentStatus
method. This call returns a MiningTaskStatus
object, which provides more details about the state. You can get the complete status history of a task by calling the getStatusHistory
method.
//Create a Naive Bayes build task and execute it. //MiningFunctionsSettings name (for example, "Sample_NB_MFS"), and //the ModelName (for example, "Sample_NB_Model") need to be specified. MiningBuildTask task = new MiningBuildTask( m_PhysicalDataSpecification, "Sample__NB_MFS", "Sample_NB_Model"); //Store the task under the name "Sample_NB_Build_Task" task.store(dmsConnection, "Sample_NB_Build_Task"); //Execute the task task.execute(dmsConnection);
After the MiningModel.build
or the task.execute
call successfully completes, the model will be stored using the name that you specified (in this case, Sample_NB_Model) in the DMS.
After you've created a model, you can apply it to new data to make predictions; the process is referred to as "scoring data."
ODM can be used to score multiple records specified in a single database table or to score a single record. This section describes scoring multiple records.
The sample program Sample_NaiveBayesApply_short.java
is a complete executable program that illustrates these required steps. The data for this sample program is CENSUS_2D_APPLY_UNBINNED
. Note that this sample program does not use a property file.
Before scoring an ODM model, you must have built an ODM model. This implies that ODM is installed on your system, and that you know the location of the database, the user name, and the password.
Before executing an ODM program, the ODM Monitor must be running.
Before you score data, the data must reside in a table in an Oracle9i database. The data to score must be compatible with the build data that you used when you built the model. You should clean the data to be scored in the same way that you cleaned the build data. If the build data for the model was not binned, the data to score must also be not binned.
The table that contains the data to score can be in either transactional or nontransactional form.
For ODM to score data using a model, ODM must know the answers to the following questions:
The following steps provide answers to the questions above:
PhysicalDataSpecification
object for the input data (the data that you want to score).LocationAccessData
object for the input and output data.MiningApplyOutput
object for the output data.The steps above are illustrated in this section with code for scoring a Naive Bayes model.
Before scoring data, it is necessary to create an instance of DataMiningServer
. This instance is used as a proxy to create connections to a Data Mining Server (DMS). The instance also maintains the connection. The DMS is the server-side, in-database component that performs the actual data mining operations within ODM. The DMS also provides a metadata repository consisting of mining input objects and result objects, along with the namespaces within which these objects are stored and retrieved.
//Create an instance of the DMS server. //The mining server DB_URL, user_name, and password for your installation //need to be specified. dms=new DataMiningServer("DB_URL", "user_name", "password");//get the actual connection dmsConnection = dms.login(();
Before ODM can apply a model to data, it must know the physical layout of the data. This is done through a PhysicalDataSpecification
instance where you indicate whether the data is in nontransactional or transactional format and describe the roles the various data columns play.
Before you create a PhysicalDataSpecification
instance, you must provide information about the location of the input data. This is accomplished using a LocationAccessData
object.
//Create a LocationAccessData using the table_name //(CENSUS_2D_APPLY_UNBINNED) and the schema_name for your installation LocationAccessData lad = new LocationAccessData("CENSUS_2D_APPLY_UNBINNED", "schema_name)";
Next, create the PhysicalDataSpecification
instance.
If the data is in nontransactional format, all the information needed to build a PhysicalDataSpecificatio
n is contained in the LocationAccessData
object.
//Create the actual PhysicalDataSpecification for a //NonTransactionalDataSpecification object since the //data set is nontransactional PhysicalDataSpecification m_PhysicalDataSpecification = new NonTransactionalDataSpecification(lad);
If the data is in transactional format, you must specify the role that the various data columns play.
//Create the actual PhysicalDataSpecification for transactional //data case PhysicalDataSpecification m_PhysicalDataSpecification = new TransactionalDataSpecification( "CASE_ID", //column name for sequence id "ATTRIBUTES", //column name for attribute name "VALUES", //column name for value lad);
Before scoring the input data the DMS needs to know where to store the output of the scoring.
Create a LocationAccessData
object specifying where to store the apply output. The following code specifies writing to the output table CENSUS_NB_APPLY_RESULT.
// LocationAccessData for output table to store the apply results. LocationAccessData ladOutput = new LocationAccessData ("CENSUS_NB_APPLY_RESULT",
"output_schema_name");
The DMS also needs to know the content of the scoring output. This information is captured in a MiningApplyOutput
(MAO) object. An instance of MiningApplyOutput
specifies the data (columns) to be included in the apply output table that is created as the result of an apply operation. The columns in the apply output table are described by a combination of ApplyContentItem
objects. These columns can be either from the input table or generated by the scoring task (for example, prediction and probability). The following steps are involved in creating a MiningApplyOutput
object:
MiningApplyOutput
object.ApplyContentItem
object describing which generated columns to be included in the output and add it to the MiningApplyOutput
object.ApplyContentItem
objects describing columns from the input table to be included in the output and add them to the MiningApplyOutput
object.MiningApplyOuput
that you created.Create an empty MiningApplyOutput
object as follows:
// Create MiningApplyOutput object MiningApplyOutput m_MiningApplyOutput = new MiningApplyOutput();
There are two options for generated columns, described by the following ApplyContentItem
subclasses:
ApplyMultipleScoringItem
: used for generating a list of top or bottom n predictions ordered by their associated target value probabilityApplyTargetProbabilityItem
: used for generating a list of probabilities for particular target valuesFor the current example, let's use an ApplyTargetProbabilityItem
instance. Before creating an instance of ApplyTargetProbabilityItem
, it is necessary to specify the names and the data types of the prediction, probability, and rank columns for the output. This is done through Attribute
objects.
// Create Attribute objects that specify the names and data // types of the prediction, probability, and rank columns for the // output. Attribute predictionAttribute = new Attribute("myprediction", DataType.stringType); Attribute probabilityAttribute = new Attribute("myprobability", DataType.stringType); Attribute rankAttr = new Attribute("myrank", DataType.stringType); // Create the ApplyTargetProbabilityItem instance ApplyTargetProbabilityItem aTargetAttrItem = new ApplyTargetProbabilityItem(predictionAttribute, probabilityAttribute,
rankAttr);
An ApplyTargetProbabilityItem
class contains a set of target values whose prediction and probability appear in the apply output table, regardless of their ranks. A target value is represented as a Category, and it must be one of the target values in the target attribute used when building the model to be applied. This step is not necessary for the ApplyMultipleScoringItem
case.
// Create Category objects to represent the target values // to be included in the apply output table. In this example // two target values are specified. Category target_category = new Category("
positive_class", "0",
DataType.getInstance("int")); Category target_category1 = new Category("positive_class", "1",
DataType.getInstance("int")); // Add the target values to the ApplyTargetProbabilityItem // instance aTargetAttrItem.addTarget(target_category); aTargetAttrItem.addTarget(target_category1); // Add the ApplyTargetProbabilityItem to the MiningApplyOutput // object m_MiningApplyOutput.addItem(aTargetAttrItem);
The input table columns to be included in the apply output are described by ApplySourceAttributeItem
instances. Each instance maps a column in the input table to a column in the output table. These columns are described by a source Attribute and a destination Attribute.
// In this example, attribute "PERSON_ID" from the source table // will be returned in the column "ID" in the output table. // This specification is captured by the // m_ApplySourceAttributeItem object. MiningAttribute sourceAttribute = new MiningAttribute( "PERSON_ID", DataType.intType, AttributeType.notApplicable, false, false); Attribute destinationAttribute = new Attribute( "ID", DataType.intType); ApplySourceAttributeItem m_ ApplySourceAttributeItem = new ApplySourceAttributeItem( sourceAttribute, destinationAttribute) // Add the ApplySourceAttributeItem object // to the MiningApplyOutput object m_MiningApplyOutput.addItem(m_ApplySourceAttributeItem);
Note that the source mining attribute could have been taken from the logical data of the model's function settings.
Because MiningApplyOutput
objects are complex objects, it is a good practice to validate that they were correctly created before you do the actual scoring. This is illustrated below for the MiningApplyOutput
in our example.
// Validate the MiningApplyOutput m_ MiningApplyOutput.validate();
Now that all the required information for scoring the model has been captured in instances of PhysicalDataSpecification
, LocationAccessData
, and MiningApplyOutput
, the last step is
If you are calling ODM from an application, the design of the calling application may determine whether to apply the model synchronously or asynchronously. Also, if the input data has many cases, the apply operation may require a significant amount of time; in such a case,, you will probably want to apply the model asynchronously.
For synchronous apply, use the static SupervisedModel.Apply
method. Note that this method is deprecated for ODM release 2.
// Synchronous Apply // Score the model using the model named "Sample_NB_Model" and // store the results in the "Sample_NB_APPLY_RESULT" public static void apply( dmsConn, lad, m_PhysicalDataSpecification, "Sample_NB_Model", m_MiningApplyOutput, ladOutput, "Sample_NB_APPLY_RESULT")
For asynchronous apply, it is necessary to create an instance of MiningTask
. A mining task can be persisted in the DMS using the store(dmsConn, taskName)
method and executed at any time; such a task can be executed only once. The current status information of a task can be queried by calling the getCurrentStatus(dmsConn, taskName)
method. This returns a MiningTaskStatus
object, which provides more details about the state. You can get the complete status history of a task by calling the getStatusHistory(dmsConn, taskName)
method.
// Asynchronous Apply / Create a Naive Bayes apply task and execute it. // Result name (e.g., "Sample_NB_APPLY_RESULT"), and the // model name (e.g., "Sample_NB_Model") need to be specified MiningApplyTask task = new MiningBuildTask( m_PhysicalDataSpecification, "Sample_NB_Model", m_MiningApplyOutput, ladOutput, "Sample_NB_APPLY_RESULT"); // Store the task under the name "Sample_NB_APPLY_Task" task.store(dmsConnection, "Sample_NB_APPLY_Task"); // Execute the task task.execute(dmsConnection);