Oracle Text Reference Release 9.2 Part Number A96518-01 |
|
This chapter provides reference information for using the CTX_CLS
PL/SQL package to generate CTXRULE
rules for a set of documents.
Name | Description |
---|---|
Generates rules that define document categories. Output based on input training document set. |
Use this procedure to generate query rules that select document categories. You must supply a training set consisting of categorized documents. Each document must belong to one or more categories. This procedure generates the queries that define the categories and then writes the results to a table.
This procedure requires that your document table have an associated populated context index. For best results, the index should be synchronized before running this procedure.
You must also have a document table and a category table. The documents can be in any format supported by Oracle Text.
For example your document and category tables can be defined as:
create table trainingdoc( docid number primary key, text varchar2(4000));create table category (
docid CONSTRAINT fk_id REFERENCES trainingdoc(docid), categoryid number);
CTX_CLS.TRAIN( index_name in varchar2, doc_id in varchar2, cattab in varchar2, catdocid in varchar2, catid in varchar2, restab in varchar2, rescatid in varchar2, resquery in varchar2, resconfid in varchar2, preference_name in varchar2 DEFAULT NULL);
Specify the name of the context index associated with your document training set.
Specify the name of the document id column in the document table. This column must contain unique document ids. This column must a NUMBER.
Specify the name of the category table. You must have SELECT privilege on this table.
Specify the name of the document id column in the category table. The document ids in this table must also exist in the document table. This column must a NUMBER.
Specify the name of the category ID column in the category table. This column must a NUMBER.
Specify the name of the result table. You must have INSERT privilege on this table.
Specify the name of the category ID column in the result table. This column must a NUMBER.
Specify the name of the query column in the result table. This column must be VARACHAR2, CHAR CLOB, NVARCHAR2, or NCHAR.
The queries generated in this column connects terms with AND or NOT operators, such as:
'T1 & T2 ~ T3'
Terms can also be theme tokens and be connected with the ABOUT operator, such as:
'about(T1) & about(T2) ~ about(T3)'
Specify the name of the confidence column in result table. This column contains the estimated probability from training data that a document is relevant if that document satisfies the query.
Specify the name of the preference. For attributes, see "Classifier Types" in Chapter 2, "Indexing".
The CTX_CLS.TRAIN
procedure requires that your document table have an associated context index. For example your document table can be defined and populated as follows:
set serverout on exec dbms_output.put_line(TO_CHAR(SYSDATE,'MM-DD-YYYY HH24:MI:SS')||':start'); create table doc (id number primary key, text varchar2(2000)); insert into doc values(1,'In 2002, Europe changed its currency to the EURO'); insert into doc values(2,'The NASDAQ rose today in heavy stock trading.'); insert into doc values(3,'The EURO lost 1 cent today against the US dollar'); insert into doc values(4,'Salt Lake City hosts the winter Olympic games'); insert into doc values(5,'ESPN broadcasts World Cup Soccer games.'); insert into doc values(6,'Soccer champion Diego Maradona retires.');
Create the CONTEXT
index:
exec ctx_ddl.drop_preference('my_lexer'); exec ctx_ddl.create_preference('my_lexer','BASIC_LEXER'); exec ctx_ddl.set_attribute('my_lexer','INDEX_THEMES','NO'); exec ctx_ddl.set_attribute('my_lexer','INDEX_TEXT','YES'); CREATE INDEX docx on doc(text) INDEXTYPE IS ctxsys.context PARAMETERS('LEXER my_lexer');
You must also create a category table as follows to relate the documents to categories:
create table category (doc_id number, cat_id number, cat_name varchar2(100)); insert into category values (1,1,'Finance'); insert into category values (2,1,'Finance'); insert into category values (3,1,'Finance'); insert into category values (4,2,'Sports'); insert into category values (5,2,'Sports'); insert into category values (6,2,'Sports');
CTX_CLS.TRAIN writes to result table that can be defined like:
create table restab (cat_id number, query VARCHAR2(400), conf number);
To populate the result table for later CTXRULE indexing, set your RULE_CLASSIFIER preference attributes and call CTX_CLS.TRAIN as follows:
exec ctx_ddl.drop_preference('my_classifier'); exec ctx_ddl.create_preference('my_classifier','RULE_CLASSIFIER'); exec ctx_ddl.set_attribute('my_classifier','MAX_TERMS','20'); exec ctx_ddl.set_attribute('my_classifier','THRESHOLD','40'); exec ctx_ddl.set_attribute('my_classifier','NT_THRESHOLD','0.02'); exec ctx_ddl.set_attribute('my_classifier','MEMORY_SIZE','200'); exec ctx_ddl.set_attribute('my_classifier','TERM_THRESHOLD','20'); exec ctx_output.start_log('mylog'); exec ctx_cls.train('docx','id','category','doc_id','cat_id','restab','cat_ id','query', 'conf','my_classifier'); exec ctx_output.end_log(); create table catname as (select distinct cat_id, cat_name from category); set termout on select rpad(id,6) doc_id , rpad(cat_name,8) cat_name, rpad(text,50) text from doc, category where id=doc_id; select rpad(a.cat_id,8) cat_id, rpad(cat_name,8) cat_name, rpad(query,30) rule from restab a, catname b where b.cat_id=a.cat_id;
The training set is:
DOC_ID CAT_NAME TEXT ------ -------- -------------------------------------------------- 1 Finance In 2002, Europe changed its currency to the EURO 2 Finance The NASDAQ rose today in heavy stock trading. 3 Finance The EURO lost 1 cent today against the US dollar 4 Sports Salt Lake City hosts the winter Olympic games 5 Sports ESPN broadcasts World Cup Soccer games. 6 Sports Soccer champion Diego Maradona retires. 6 rows selected.
The generated rules for the categories of FINANCE and SPORTS are as follows:
CAT_ID CAT_NAME RULE -------- -------- ------------------------------ 1 Finance EURO 1 Finance TODAY ~ EURO 2 Sports GAMES 2 Sports SOCCER ~ GAMES
|
Copyright © 1998, 2002 Oracle Corporation. All Rights Reserved. |
|