Oracle9i Data Mining Concepts Release 9.2.0.2 Part Number A95961-02 |
|
|
View PDF |
A specific technique or procedure for producing a data mining model. An algorithm uses a specific model representation and may support one or more functional areas. Examples of algorithms used by ODM include Naive Bayes and Adaptive Bayes Networks for classification, k-means and O-Cluster for clustering, predictive variance for attribute importance, and Apriori for association rules.
The settings that specify algorithm-specific behavior for model building.
A user specification describing the kind of output desired from applying a model to data. This output may include predicted values, associated probabilities, key values, and other supplementary data.
A data mining function that captures co-occurrence of items among transactions. A typical rule is an implication of the form A -> B, which means that the presence of itemset A implies the presence of itemset B with certain support and confidence. The support of the rule is the ratio of the number of transactions where the itemsets A and B are present to the total number of transactions. The confidence of the rule is the ratio of the number of transactions where the itemsets A and B are present to the number of transactions where itemset A is present. ODM uses the Apriori algorithm for association rules.
An instance of Attribute
maps to a column with a name and data type. The attribute corresponds to a column in a database table. When assigned to a column, the column must have a compatible data type; if the data type is not compatible, a runtime exception is likely. Attributes are also called variables, features, data fields, or table columns.
A measure of the importance of an attribute in predicting a specified target. The measure of different attributes of a build data table enables users to select the attributes that are found to be most relevant to a mining model. A smaller set of attributes results in a faster model build; the resulting model could be more accurate. ODM uses the predictive variance algorithm for attribute importance. Also known as feature selection and key fields.
Specifies how a logical attribute is to be used when building a model, for example, active or supplementary, suppressing automatic data preprocessing, and assigning a weight to a particular attribute. See also attributes usage set.
A collection of attribute usage objects that together determine how the logical attributes specified in a logical data object are to be used.
See discretization.
All the data collected about a specific transaction or related set of values.
An attribute where the values correspond to discrete categories. For example, state is a categorical attribute with discrete values (CA, NY, MA, etc.). Categorical attributes are either non-ordered (nominal) like state, gender, etc., or ordered (ordinal) such as high, medium, or low temperatures.
Corresponds to a distinct value of a categorical attribute. Categories may have string or numeric values. String values must not exceed 64 characters in length.
See cluster centroid.
A data mining function for predicting target values for new records using a model built from records with known target values. ODM supports two algorithms for classification, Naive Bayes and Adaptive Bayes Networks.
The cluster centroid is the vector that encodes, for each attribute, either the mean (if the attribute is numerical) or the mode (if the attribute is categorical) of the cases in the build data assigned to a cluster.
A data mining function for finding naturally occurring groupings in data. More precisely, given a set of data points, each having a set of attributes, and a similarity measure among them, clustering is the process of grouping the data points into different clusters such that data points in the same cluster are more similar to one another and data points in different clusters are less similar to one another. ODM supports two algorithms for clustering, k-means and O-Cluster.
Measures the correctness of predictions made by a model from a text task. The row indexes of a confusion matrix correspond to actual values observed and provided in the test data. These were used for model building. The column indexes correspond to predicted values produced by applying the model. For any pair of actual/predicted indexes, the value indicates the number of records classified in that pairing.
When predicted value equals actual value, the model produces correct predictions. All other entries indicate errors.
A two-dimensional, n by n table that defines the cost associated with a prediction versus the actual value. A cost matrix is typically used in classification models, where n is the number of distinct values in the target, and the columns and rows are labeled with target values. The rows are the actual values; the columns are the predicted values.
A technique of evaluating the accuracy of a classification or regression model. This technique is used when there are insufficient cases for model building and testing. The data table is divided into several parts, with each part in turn being used to evaluate a model built using the remaining parts. Cross-validation occurs automatically for Naive Bayes and Adaptive Bayes networks.
The process of discovering hidden, previously unknown, and usable information from a large amount of data. This information is represented in a compact form, often referred to as a model.
The component of the Oracle database that implements the data mining engine and persistent metadata repository.
Discretization groups related values together under a single value (or bin). This reduces the number of distinct values in a column. Fewer bins result in models that build faster. ODM algorithms require that input data be discretized prior to model building, testing, computing lift, and applying (scoring).
distance-based (clustering algorithm)
Distance-based algorithms rely on a distance metric (function) to measure the similarity between data points. Data points are assigned to the nearest cluster according to the distance metric used.
See network feature.
A measure of how much better prediction results are using a model than could be obtained by chance. For example, suppose that 2% of the customers mailed a catalog without using the model would make a purchase. However, using the model to select catalog recipients, 10% would make a purchase. Then the lift is 10/2 or 5. Lift may also be used as a measure to compare different data mining models. Since lift is computed using a data table with actual outcomes, lift compares how well a model performs with respect to this data on predicted outcomes. Lift indicates how well the model improved the predictions over a random selection given actual results. Lift allows a user to infer how a model will perform on new data.
Specifies the location of data for a mining operation.
A description of a domain of data used as input to mining operations. Logical attributes may be categorical, ordinal, or numerical.
A set of mining attributes used as input to building a mining model.
See minimum description length principle.
Given a sample of data and an effective enumeration of the appropriate alternative theories to explain the data, the best theory is the one that minimizes the sum of
See apply output.
ODM supports the following mining functions: classification, association rules, attribute importance, and clustering.
An object that specifies the type of model to build, the function of the model, and the algorithm to use. ODM supports the following mining functions: classification, association rules, attribute importance, and clustering.
The result of building a model from mining function settings. The representation of the model is specific to the algorithm specified by the user or selected by the DMS. A model can be used for direct inspection, e.g., to examine the rules produced from a decision tree or association rules, or to score data.
The end product(s) of a mining operation. For example, a build task produces a mining model; a test task produces a test result.
A data value that is missing because it was not measured (that is, has a null value), not answered, was unknown, or was lost. Data mining systems vary in the way they treat missing values. Typically, they ignore missing values, omit any records containing missing values, replace missing values with the mode or mean, or infer missing values from existing values. ODM ignores missing values during mining operations.
mixture model
A mixture model is a type of density model that includes several component functions (usually Gaussian) that are combined to provide a multimodal density.
An important function of data mining is the production of a model. A model can be descriptive or predictive. A descriptive model helps in understanding underlying processes or behavior. For example, an association model describes consumer behavior. A predictive model is an equation or set of rules that makes it possible to predict an unseen or unmeasured value (the dependent variable or output) from other, known values (independent variables or input). The form of the equation or rules is suggested by mining data collected from the process under study. Some training or estimation technique is used to estimate the parameters of the equation or rules. See also mining model.
See transactional format.
A network feature is a tree-like multi-attribute structure. From the standpoint of the network, features are conditionally independent components. Features contain at least one attribute (the root attribute). Conditional probabilities are computed for each value of the root predictor. A two-attribute feature will have, in addition to the root predictor conditional probabilities, computed conditional probabilities for each combination of values of the root and the depth 2 predictor. That is, if a root predictor, x, has i values and the depth 2 predictor, y, has j values, a conditional probability is computed for each combination of values {x=a, y=b such that a is in the set {1,..,i} and b is in the set {1,..,j}}. Similarly, a depth 3 predictor, z, would have additional associated conditional probability computed for each combination of values {x=a, y=b, z=c such that a is in the set {1,..,i} and b is in the set {1,..,j} and c is in the set {1,..,k}}. Network features are used in the Adaptive Bayes Network algorithm.
Each case in the data is stored as one record (row) in a table. Also known as single-record case. See also transactional format.
An attribute whose values are numbers. The numeric value can be either an integer or a real number. Numerical attribute values can be manipulated as continuous values. See also categorical attribute.
A data value that does not (or is not thought to have) come from the typical population of data; in other words, a data value that falls outside the boundaries that enclose most other data values in the data.
Identifies data to be used as input to data mining. Through the use of attribute assignment, attributes of the physical data are mapped to logical attributes of a model's logical data. The data referenced by a physical data object can be used in model building, model application (scoring), lift computation, statistical analysis, etc.
An object that specifies the characteristics of the physical data used in a mining operation. The physical data specification includes information about the format of the data (transactional or nontransactional) and the roles that the data columns play.
In binary classification problems, you may designate one of the two classes (target values) as positive, the other as negative. When ODM computes a model's lift, it calculates the density of positive target values among a set of test instances for which the model predicts positive values with a given degree of confidence.
A logical attribute used as input to a supervised model or algorithm to build a model.
The set of prior probabilities specifies the distribution of examples of the various classes in data. Also referred to as priors, these could be different from the distribution observed in the data.
See prior probabilities.
An expression of the general form if X, then Y. An output of certain models, such as association rules models or decision tree models. The predicate X may be a compound predicate.
Scoring data means applying a data mining model to new data to generate predictions. See apply output.
See algorithm settings and mining function settings.
The process of building data mining models using a known dependent variable, also referred to as the target. Classification techniques are supervised. See unsupervised mining (learning).
In supervised learning, the identified logical attribute that is to be predicted. Sometimes called target value or target attribute.
A container within which to specify arguments to data mining operations to be performed by the data mining system.
Each case in the data is stored as multiple records in a table with columns sequenceID
, attribute_name
, and value
. Also known as multi-record case. See also nontransactional format.
A function applied to data resulting in a new form or representation of the data. For example, discretization and normalization are transformations on data.
The process of building data mining models without the guidance (supervision) of a known, correct result. Clustering and association rules are unsupervised mining functions. See supervised mining (learning).