Oracle® Text Reference 11g Release 1 (11.1) Part Number B28304-01 |
|
|
View PDF |
This Appendix describes the multilingual features of Oracle Text. The following topics are discussed:
This appendix summarizes the main multilingual features for Oracle Text.
For a complete list of Oracle Globalization Support languages and character set support, refer to the Oracle Database Globalization Support Guide.
The following sections describe the multilingual indexing features:
The following sections describes the supported multilingual features for the Oracle Text index types.
See Also:
Lexer Types for a description of available lexersThe CONTEXT
index type fully supports multilingual features, including use of the language and character set columns. The following lexers are supported:
AUTO_LEXER
MULTI_LEXER
USER_LEXER
WORLD_LEXER
CONTEXT
also supports use of all Chinese, Japanese, and Korean language lexers as follows:
CHINESE_LEXER
CHINESE_VGRAM_LEXER
JAPANESE_LEXER
JAPANESE_VGRAM_LEXER
KOREAN_MORPH_LEXER
CTXCAT
supports the multilingual features of the BASIC_LEXER
with the exception of indexing themes, and supports the following additional lexers:
AUTO_LEXER
USER_LEXER
WORLD_LEXER
CTXCAT
also supports the following lexers:
CHINESE_LEXER
CHINESE_VGRAM_LEXER
JAPANESE_LEXER
JAPANESE_VGRAM_LEXER
KOREAN_MORPH_LEXER
Oracle Text supports the indexing of different languages by enabling you to choose a lexer in the indexing process. The lexer you employ determines the languages you can index. Table D-1 describes the supported lexers.
Table D-1 Oracle Text Lexer Types
Lexer | Supported Languages |
---|---|
|
Lexer for indexing columns that contain documents of different languages. |
|
English and most western European languages that use white space delimited words. |
|
Lexer for indexing tables containing documents of different languages such as English, German, and Japanese. |
|
Lexer for extracting tokens from Chinese text. |
|
Lexer for extracting tokens from Chinese text. This lexer offers the following benefits over the
|
|
Lexer for extracting tokens from Japanese text. |
|
Lexer for extracting tokens from Japanese text. This lexer offers the following advantages over the
|
|
Lexer for extracting tokens from Korean text. |
|
Lexer you create to index a particular language. |
|
Lexer for indexing tables containing documents of different languages; autodetects languages in a document |
AUTO_LEXER
automatically detects document language, and performs language identification, word segmentation, and stemming. The AUTO_LEXER
also enables customization of these components.
See Also:
"AUTO_LEXER"The following features are supported with the BASIC_LEXER
preference. Enable these features with attributes of the BASIC_LEXER
. Features such as alternate spelling, composite, and base letter can be enabled together for better search results.
Enables the indexing and subsequent querying of document concepts with the ABOUT
operator with CONTEXT
index types. These concepts are derived from the Oracle Text knowledge base. This feature is supported for English and French.
This feature is not supported with CTXCAT
index types.
This feature enables you to search on alternate spellings of words. For example, with alternate spelling enabled in German, a query on gross returns documents that contain groß and gross.
This feature is supported in German, Danish, and Swedish.
Additionally, German can be indexed according to both traditional and reformed spelling conventions.
This feature enables you to query words with or without diacritical marks such as tildes, accents, and umlauts. For example, with a Spanish base-letter index, a query of energia matches documents containing both energía and energia.
This feature is supported for English and all other supported whitespace delimited languages. In English and French, you can use the basic lexer to enable theme indexing.
See Also:
"Base-Letter Conversion"This feature enables you to search on words that contain the specified term as a sub-composite. You must use the stem ($) operator. This feature is supported for German and Dutch.
For example, in German, a query of $register finds documents that contain Bruttoregistertonne and Registertonne.
This feature enables you to specify a stemmer for stem indexing. Tokens are stemmed to a single base form at index time in addition to the normal forms. Specifying index stems enables better query performance for stem queries, for example $computed.
This feature is supported for English, Dutch, French, German, Italian, and Spanish.
The MULTI_LEXER
lexer enables you to index a column that contains documents of different languages. During indexing Oracle Text examines the language column and switches in the language-specific lexer to process the document. Define the lexer preferences for each language before indexing.
The multi lexer enables you to set different preferences for languages. For example, you can have composite
set to TRUE
for German documents and composite
set to FALSE
for Dutch documents.
Like MULTI_LEXER
, the WORLD_LEXER
lexer enables you to index documents that contain different languages. It automatically detects the languages of a document and, therefore, does not require you to create a language column in the base table.
WORLD_LEXER
processes all database character sets and supports the Unicode 5.0 standard. For WORLD_LEXER
to be effective with documents that use multiple languages, AL32UTF-8 or UTF8 Oracle character set encoding must be specified. This includes supplementary, or "surrogate-pair," characters.
Table D-2 and Table D-3 show the languages supported by WORLD_LEXER
. This list may change as the Unicode standard changes, and in any case should not be considered exhaustive. (Languages are grouped by Unicode writing system, not by natural language groupings.)
Table D-2 Languages Supported by the World Lexer (Space-separated)
Language Group | Languages Include |
---|---|
Arabic |
Arabic, Farsi, Kurdish, Pashto, Sindhi, Urdu |
Armenian |
Armenian |
Bengali |
Assamese, Bengali |
Bopomofo |
Hakka Chinese, Minnan Chinese |
Cyrillic |
Over 50 languages, including Belorussian, Bulgarian, Macedonian, Moldavian, Russian, Serbian, Serbo-Croatian, Ukrainian |
Devenagari |
Bhojpuri, Bihari, Hindi, Kashmiri, Marathi, Nepali, Pali, Sanskrit |
Ethiopic |
Amharic, Ge'ez, Tigrinya, Tigre |
Georgian |
Georgian |
Greek |
Greek |
Gujarati |
Gujarati, Kacchi |
Gurmukhi |
{Punjabi |
Hebrew |
Hebrew, Ladino, Yiddish |
Kaganga |
Redjang |
Kannada |
Kanarese, Kannada |
Korean |
Korean, Hanja Hangul |
Latin |
Afrikaans, Albanian, Basque, Breton, Catalan, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Faeroese, Fijian, Finnish, Flemish, French, Frisian, German, Hawaiian, Hungarian, Icelandic, Indonesian, Irish, Italian, Lappish, Classic Latin, Latvian, Lithuanian, Malay, Maltese, Pinyin Mandarin, Maori, Norwegian, Polish, Portuguese, Provencal, Romanian, Rumanian, Samoan, Scottish Gaelic, Slovak, Slovene, Slovenian, Sorbian, Spanish, Swahili, Swedish, Tagalog, Turkish, Vietnamese, Welsh |
Malayalam |
Malayalam |
Mongolian |
Mongolian |
Oriya |
Oriya |
Sinhalese, Sinhala |
Pali, Sinhalese |
Syriac |
Aramaic, Syriac |
Tamil |
Tamil |
Telugu |
Telugu |
Thaana |
Dhiveli, Divehi, Maldivian |
Table D-3 Languages Supported by the World Lexer (Non-space-separated)
Language Group | Languages Include |
---|---|
Chinese |
Cantonese, Mandarin, Pinyin phonograms |
Japanese |
Japanese (Hiragana, Kanji, Katakana) |
Khmer |
Cambodian, Khmer |
Lao |
Lao |
Myanmar |
Burmese |
Thai |
Thai |
Tibetan |
Dzongkha, Tibetan |
Table D-4 shows languages not supported by the World Lexer.
Table D-4 Languages Not Supported by the World Lexer
Language Group | Languages Include |
---|---|
Buhid |
Buhid |
Canadian Syllabics |
Blackfoot, Carrier, Cree, Dakhelh, Inuit, Inuktitut, Naskapi, Nunavik, Nunavut, Ojibwe, Sayisi, Slavey |
Cherokee |
Cherokee |
Cypriot |
Cypriot |
Limbu |
Limbu |
Ogham |
Ogham |
Runic |
Runic |
Tai Le (Tai Lu, Lue, Dai Le) |
Tai Le |
Ugaritic |
Ugaritic |
Yi |
Yi |
Yi Jang Hexagram |
Yi Jang |
Oracle Text supports the use of different query operators. Some operators can be set to behave in accordance with your language. This section summarizes the multilingual query features for these operators.
Use the ABOUT
operator to query on concepts. The system looks up concept information in the theme component of the index.
This feature is supported for English and French with CONTEXT
indexes only.
This operator enables you to search for words that have similar spelling to specified word. Oracle Text supports fuzzy
for English, French, German, Italian, Dutch, Spanish, Portuguese, Japanese, Optical Character recognition (OCR), and automatic language detection.
This operator enables you to search for words that have the same root as the specified term. For example, a stem of $sing expands into a query on the words sang, sung, sing. The Oracle Text stemmer supports the following languages: English, French, Spanish, Italian, German, Japanese and Dutch.
A stoplist is a list of words that do not get indexed. These are usually common words in a language such as this, that, and can in English.
Oracle Text provides a default stoplist for English, Chinese (traditional and simplified), Danish, Dutch, Finnish, French, German, Italian, Portuguese, Spanish, and Swedish. Appendix E, "Oracle Text Supplied Stoplists", lists the stoplists for various languges.
An Oracle Text knowledge base is a hierarchical tree of concepts used for theme indexing, ABOUT
queries, and deriving themes for document services.
Oracle Text supplies knowledge bases in English and French only.
The following table summarizes the multilingual features for the supported languages. Note that the Auto Detect column lists languages that can be automatically detected by AUTO_LEXER
.
Table D-5 Multilingual Features for Supported Languages
LANGUAGE | AUTO DETECT | ALTERNATE SPELLING | FUZZY MATCHING | LANGUAGE SPECIFIC LEXER | DEFAULT STOP LIST | STEMMING |
---|---|---|---|---|---|---|
ENGLISH |
Yes |
N/A |
Yes |
Yes |
Yes |
Yes |
GERMAN |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
JAPANESE |
Yes |
N/A |
Yes |
Yes |
No |
Yes |
FRENCH |
Yes |
N/A |
Yes |
Yes |
Yes |
Yes |
SPANISH |
Yes |
N/A |
Yes |
Yes |
Yes |
Yes |
ITALIAN |
Yes |
N/A |
Yes |
Yes |
Yes |
Yes |
DUTCH |
Yes |
N/A |
Yes |
Yes |
Yes |
Yes |
PORTUGUESE |
Yes |
N/A |
Yes |
Yes |
Yes |
Yes |
KOREAN |
Yes |
N/A |
No |
Yes |
No |
Yes |
SIMPLIFIED CHINESE |
Yes |
N/A |
No |
Yes |
Yes |
Yes |
TRADITIONAL CHINESE |
Yes |
N/A |
No |
Yes |
Yes |
Yes |
DANISH |
Yes |
Yes |
No |
Yes |
No |
Yes |
SWEDISH |
Yes |
Yes |
No |
Yes |
Yes |
Yes |
FINNISH |
Yes |
N/A |
No |
Yes |
No |
Yes |
ARABIC |
Yes |
N/A |
No |
Yes |
No |
Yes |
GREEK |
Yes |
N/A |
No |
Yes |
No |
Yes |
BOKMAL |
Yes |
N/A |
No |
Yes |
No |
Yes |
POLISH |
Yes |
N/A |
No |
Yes |
No |
Yes |
RUSSIAN |
Yes |
N/A |
No |
Yes |
No |
Yes |
SLOVENIAN |
Yes |
N/A |
No |
Yes |
No |
Yes |
THAI |
Yes |
N/A |
No |
Yes |
No |
Yes |
CATALAN |
Yes |
N/A |
No |
Yes |
No |
Yes |
CROATIAN |
Yes |
N/A |
No |
Yes |
No |
Yes |
HEBREW |
Yes |
N/A |
No |
Yes |
No |
Yes |
NYNORSK |
Yes |
N/A |
No |
Yes |
No |
Yes |
SERBIAN |
Yes |
N/A |
No |
Yes |
No |
Yes |
TURKISH |
Yes |
N/A |
No |
Yes |
No |
Yes |
CZECH |
Yes |
N/A |
No |
Yes |
No |
Yes |
HUNGARIAN |
Yes |
N/A |
No |
Yes |
No |
Yes |
PERSIAN |
Yes |
N/A |
No |
Yes |
No |
Yes |
SLOVAK |
Yes |
N/A |
No |
Yes |
No |
Yes |