Glossary

algorithm

A specific technique or procedure for producing a data mining model. An algorithm uses a specific model representation and may support one or more functional areas. Examples of algorithms used by ODM include Naive Bayes and Adaptive Bayes Networks for classification, k-means and O-Cluster for clustering, predictive variance for attribute importance, and Apriori for association rules.

algorithm settings

The settings that specify algorithm-specific behavior for model building.

apply output

A user specification describing the kind of output desired from applying a model to data. This output may include predicted values, associated probabilities, key values, and other supplementary data.

association rules

A data mining function that captures co-occurrence of items among transactions. A typical rule is an implication of the form A -> B, which means that the presence of itemset A implies the presence of itemset B with certain support and confidence. The support of the rule is the ratio of the number of transactions where the itemsets A and B are present to the total number of transactions. The confidence of the rule is the ratio of the number of transactions where the itemsets A and B are present to the number of transactions where itemset A is present. ODM uses the Apriori algorithm for association rules.

attribute

An instance of Attribute maps to a column with a name and data type. The attribute corresponds to a column in a database table. When assigned to a column, the column must have a compatible data type; if the data type is not compatible, a runtime exception is likely. Attributes are also called variables, features, data fields, or table columns.

attribute importance

A measure of the importance of an attribute in predicting a specified target. The measure of different attributes of a build data table enables users to select the attributes that are found to be most relevant to a mining model. A smaller set of attributes results in a faster model build; the resulting model could be more accurate. ODM uses the predictive variance algorithm for attribute importance. Also known as feature selection and key fields.

attribute usage

Specifies how a logical attribute is to be used when building a model, for example, active or supplementary, suppressing automatic data preprocessing, and assigning a weight to a particular attribute. See also attributes usage set.

attributes usage set

A collection of attribute usage objects that together determine how the logical attributes specified in a logical data object are to be used.

binning

See discretization.

case

All the data collected about a specific transaction or related set of values.

categorical attribute

An attribute where the values correspond to discrete categories. For example, state is a categorical attribute with discrete values (CA, NY, MA, etc.). Categorical attributes are either non-ordered (nominal) like state, gender, etc., or ordered (ordinal) such as high, medium, or low temperatures.

centroid

See cluster centroid.

classification

A data mining function for predicting target values for new records using a model built from records with known target values. ODM supports two algorithms for classification, Naive Bayes and Adaptive Bayes Networks.

cluster centroid

The cluster centroid is the vector that encodes, for each attribute, either the mean (if the attribute is numerical) or the mode (if the attribute is categorical) of the cases in the build data assigned to a cluster.

clustering

A data mining function for finding naturally occurring groupings in data. More precisely, given a set of data points, each having a set of attributes, and a similarity measure among them, clustering is the process of grouping the data points into different clusters such that data points in the same cluster are more similar to one another and data points in different clusters are less similar to one another. ODM supports two algorithms for clustering, k-means and O-Cluster.

confusion matrix

Measures the correctness of predictions made by a model from a text task. The row indexes of a confusion matrix correspond to actual values observed and provided in the test data. These were used for model building. The column indexes correspond to predicted values produced by applying the model. For any pair of actual/predicted indexes, the value indicates the number of records classified in that pairing.

When predicted value equals actual value, the model produces correct predictions. All other entries indicate errors.

cost matrix

A two-dimensional, n by n table that defines the cost associated with a prediction versus the actual value. A cost matrix is typically used in classification models, where n is the number of distinct values in the target, and the columns and rows are labeled with target values. The rows are the actual values; the columns are the predicted values.

cross-validation

A technique of evaluating the accuracy of a classification or regression model. This technique is used when there are insufficient cases for model building and testing. The data table is divided into several parts, with each part in turn being used to evaluate a model built using the remaining parts. Cross-validation occurs automatically for Naive Bayes and Adaptive Bayes networks.

data mining

The process of discovering hidden, previously unknown, and usable information from a large amount of data. This information is represented in a compact form, often referred to as a model.

data mining server (DMS)

The component of the Oracle database that implements the data mining engine and persistent metadata repository.

discretization

Discretization groups related values together under a single value (or bin). This reduces the number of distinct values in a column. Fewer bins result in models that build faster. ODM algorithms require that input data be discretized prior to model building, testing, computing lift, and applying (scoring).

distance-based (clustering algorithm)

Distance-based algorithms rely on a distance metric (function) to measure the similarity between data points. Data points are assigned to the nearest cluster according to the distance metric used.

DMS

See data mining server (DMS).

feature

See network feature.

lift

A measure of how much better prediction results are using a model than could be obtained by chance. For example, suppose that 2% of the customers mailed a catalog without using the model would make a purchase. However, using the model to select catalog recipients, 10% would make a purchase. Then the lift is 10/2 or 5. Lift may also be used as a measure to compare different data mining models. Since lift is computed using a data table with actual outcomes, lift compares how well a model performs with respect to this data on predicted outcomes. Lift indicates how well the model improved the predictions over a random selection given actual results. Lift allows a user to infer how a model will perform on new data.

location access data

Specifies the location of data for a mining operation.

logical attribute

A description of a domain of data used as input to mining operations. Logical attributes may be categorical, ordinal, or numerical.

logical data

A set of mining attributes used as input to building a mining model.

MDL principle

See minimum description length principle.

minimum description length principle

Given a sample of data and an effective enumeration of the appropriate alternative theories to explain the data, the best theory is the one that minimizes the sum of

The length, in bits, of the description of the theory
The length, in bits, of the data when encoded with the help of the theory

mining apply output

See apply output.

mining function

ODM supports the following mining functions: classification, association rules, attribute importance, and clustering.

mining function settings

An object that specifies the type of model to build, the function of the model, and the algorithm to use. ODM supports the following mining functions: classification, association rules, attribute importance, and clustering.

mining model

The result of building a model from mining function settings. The representation of the model is specific to the algorithm specified by the user or selected by the DMS. A model can be used for direct inspection, e.g., to examine the rules produced from a decision tree or association rules, or to score data.

mining result

The end product(s) of a mining operation. For example, a build task produces a mining model; a test task produces a test result.

missing value

A data value that is missing because it was not measured (that is, has a null value), not answered, was unknown, or was lost. Data mining systems vary in the way they treat missing values. Typically, they ignore missing values, omit any records containing missing values, replace missing values with the mode or mean, or infer missing values from existing values. ODM ignores missing values during mining operations.

mixture model

A mixture model is a type of density model that includes several component functions (usually Gaussian) that are combined to provide a multimodal density.

model

An important function of data mining is the production of a model. A model can be descriptive or predictive. A descriptive model helps in understanding underlying processes or behavior. For example, an association model describes consumer behavior. A predictive model is an equation or set of rules that makes it possible to predict an unseen or unmeasured value (the dependent variable or output) from other, known values (independent variables or input). The form of the equation or rules is suggested by mining data collected from the process under study. Some training or estimation technique is used to estimate the parameters of the equation or rules. See also mining model.

multi-record case

See transactional format.

network feature

A network feature is a tree-like multi-attribute structure. From the standpoint of the network, features are conditionally independent components. Features contain at least one attribute (the root attribute). Conditional probabilities are computed for each value of the root predictor. A two-attribute feature will have, in addition to the root predictor conditional probabilities, computed conditional probabilities for each combination of values of the root and the depth 2 predictor. That is, if a root predictor, x, has i values and the depth 2 predictor, y, has j values, a conditional probability is computed for each combination of values {x=a, y=b such that a is in the set {1,..,i} and b is in the set {1,..,j}}. Similarly, a depth 3 predictor, z, would have additional associated conditional probability computed for each combination of values {x=a, y=b, z=c such that a is in the set {1,..,i} and b is in the set {1,..,j} and c is in the set {1,..,k}}. Network features are used in the Adaptive Bayes Network algorithm.

nontransactional format

Each case in the data is stored as one record (row) in a table. Also known as single-record case. See also transactional format.

numerical attribute

An attribute whose values are numbers. The numeric value can be either an integer or a real number. Numerical attribute values can be manipulated as continuous values. See also categorical attribute.

outlier

A data value that does not (or is not thought to have) come from the typical population of data; in other words, a data value that falls outside the boundaries that enclose most other data values in the data.

physical data

Identifies data to be used as input to data mining. Through the use of attribute assignment, attributes of the physical data are mapped to logical attributes of a model's logical data. The data referenced by a physical data object can be used in model building, model application (scoring), lift computation, statistical analysis, etc.

physical data specification

An object that specifies the characteristics of the physical data used in a mining operation. The physical data specification includes information about the format of the data (transactional or nontransactional) and the roles that the data columns play.

positive target value

In binary classification problems, you may designate one of the two classes (target values) as positive, the other as negative. When ODM computes a model's lift, it calculates the density of positive target values among a set of test instances for which the model predicts positive values with a given degree of confidence.

predictor

A logical attribute used as input to a supervised model or algorithm to build a model.

prior probabilities

The set of prior probabilities specifies the distribution of examples of the various classes in data. Also referred to as priors, these could be different from the distribution observed in the data.

priors

See prior probabilities.

rule

An expression of the general form if X, then Y. An output of certain models, such as association rules models or decision tree models. The predicate X may be a compound predicate.

score

Scoring data means applying a data mining model to new data to generate predictions. See apply output.

settings

See algorithm settings and mining function settings.

single-record case

See nontransactional format.

supervised mining (learning)

The process of building data mining models using a known dependent variable, also referred to as the target. Classification techniques are supervised. See unsupervised mining (learning).

target

In supervised learning, the identified logical attribute that is to be predicted. Sometimes called target value or target attribute.

task

A container within which to specify arguments to data mining operations to be performed by the data mining system.

transactional format

Each case in the data is stored as multiple records in a table with columns sequenceID, attribute_name, and value. Also known as multi-record case. See also nontransactional format.

transformation

A function applied to data resulting in a new form or representation of the data. For example, discretization and normalization are transformations on data.

unsupervised mining (learning)

The process of building data mining models without the guidance (supervision) of a known, correct result. Clustering and association rules are unsupervised mining functions. See supervised mining (learning).