Multilingual Regular Expression Syntax

Table C-1 lists the full set of operators defined in the POSIX standard Extended Regular Expression (ERE) syntax. Oracle follows the exact syntax and matching semantics for these operators as defined in the POSIX standard for matching ASCII (English language) data. Notes following the table provide more complete descriptions of the operators and their functions, as well as Oracle multilingual enhancements of the operators. Table C-2 summarizes Oracle support for and multilingual enhancement of the POSIX operators.

Table C-1 Regular Expression Operators and Metasymbols

Operator	Description
\ ^a	The backslash character can have four different meanings depending on the context. It can: Stand for itself Quote the next character Introduce an operator Do nothing
*	Matches zero or more occurrences
+	Matches one or more occurrences
?	Matches zero or one occurrence
\|	Alternation operator for specifying alternative matches
^ ^b	Matches the beginning-of-line character
$ ^b	Matches the end-of-line character
. ^c	Matches any character in the supported character set except `NULL`
[ ] ^d	Bracket expression for specifying a matching list that should match any one of the expressions represented in the list. A nonmatching list expression begins with a circumflex (^) and specifies a list that matches any character except for the expressions represented in the list.
( )	Grouping expression, treated as a single subexpression
{m}	Matches exactly m times
{m,}	Matches at least m times
{m,n}	Matches at least m times but no more than n times
\n ^e	The backreference expression (n is a digit between 1 and 9) matches the n^th subexpression enclosed between '(' and ')' preceding the \n
[..] ^f	Specifies one collation element, and can be a multicharacter element (for example, [.ch.] in Spanish)
[: :] ^g	Specifies character classes (for example, [:alpha:]). It matches any character within the character class.
[==] ^h	Specifies equivalence classes. For example, [=a=] matches all characters having base letter 'a'.

Notes on the POSIX operators and Oracle enhancements:

^a '\': The backslash operator can be used to make the character following it normal if it is an operator. For example, '\*' is interpreted as the asterisk string literal.

^b '^' and '$': The characters '^' and '$' are the POSIX anchoring operators. By default, they match only the beginning or end of an entire string. Oracle lets you specify '^' and '$' to match the start or end of any line anywhere within the source string. This in turns lets you treat the source string as multiple lines.

^c '.': In the POSIX standard, the "match any character" operator ('.') is defined to match any English character except NULL and the newline character. In the Oracle implementation, the '.' operator can match any character in the database character set, including the newline character.

^d '[]': In the POSIX standard, a range in a regular expression includes all collation elements between the start and end points of the range in the linguistic definition of the current locale. Therefore, ranges in regular expressions are linguistic ranges rather than byte values ranges, and the semantics of the range expression are independent of character set. Oracle implements this independence by interpreting range expressions according to the linguistic definition determined by the NLS_SORT initialization parameter.

^e '\n': The backreference expression '\n' matches the same string of characters as was matched by the n^th subexpression. The character n must be a digit from 1 to 9, designating the n^th subexpression, numbered from left to right. The expression is invalid if the source string contains fewer than n subexpressions preceding the \n. For example, the regular expression ^(.*)\1$ matches a line consisting of two adjacent appearances of the same string. Oracle supports the backreference expression in the regular expression pattern and the replacement string of the REGEXP_REPLACE function.

^f '[..]': A collating element is a unit of collation and is equal to one character in most cases, but may comprise two or more characters in some languages. Historically, regular expression syntax does not support ranges containing multicharacter collation elements, such as the range 'a' through 'ch'. The POSIX standard introduces the collation element delimiter '[..]', which lets you delimit multicharacter collection elements such like this one as follows: '[a-[.ch.]]'. The collation elements supported by Oracle are determined by the setting of the NLS_SORT initialization parameter. The collation element is valid only inside the bracketed expression.

^g '[::]': In English regular expressions, range expressions often indicate a character class. For example, '[a-z]' indicates any lowercase character. This convention is not useful in multilingual environments where the first and last character of a given character class may not be the same in all languages. The POSIX standard introduces a portable character class syntax, '[::]'. Oracle supports the following character classes, based on character class definitions in NLS classification data:

Character Class Syntax	Meaning
`[:alnum:]`	All alphanumeric characters
`[:alpha:]`	All alphabetic characters
`[:blank:]`	All blank space characters.
`[:cntrl:]`	All control characters (nonprinting)
`[:digit:]`	All numeric digits
`[:graph:]`	All `[:punct:]`, `[:upper:]`, `[:lower:]`, and `[:digit:]` characters.
`[:lower:]`	All lowercase alphabetic characters
`[:print:]`	All printable characters
`[:punct:]`	All punctuation characters
`[:space:]`	All space characters (nonprinting)
`[:upper:]`	All uppercase alphabetic characters
`[:xdigit:]`	All valid hexadecimal characters

This character class syntax lets you make better use of NLS character definitions to write flexible regular expressions. These character classes are valid only inside the bracketed expression.

^h '[==]': Oracle supports the equivalence classes through the POSIX '[==]' syntax. A base letter and all of its accented versions constitute an equivalence class. For example, the equivalence class '[=a=]' matches ä and â. The equivalence classes are valid only inside the bracketed expression.

Restriction on equivalence classes: Composed and decomposed versions of the same equivalence class do not match. For example, ä does not match a followed by umlaut.