Oracle9i Database Globalization Support Guide Release 2 (9.2) Part Number A96529-01 |
|
This chapter discusses character set conversion and character set migration. It includes the following topics:
Choosing the appropriate character set for your database is an important decision. When you choose the database character set, consider the following factors:
A related topic is choosing a new character set for an existing database. Changing the database character set for an existing database is called character set migration. Migrating from one database character set to another involves additional considerations beyond choosing a character set for a new database. Plan character set migration to minimize data loss from:
When the database is created using byte semantics, the sizes of the CHAR
and VARCHAR2
datatypes are specified in bytes, not characters. For example, the specification CHAR(20)
in a table definition allows 20 bytes for storing character data. This is acceptable when the database character set uses a single-byte character encoding scheme because the number of characters is equivalent to the number of bytes. If the database character set uses a multibyte character encoding scheme, then the number of bytes no longer equals the number of characters because a character can consist of one or more bytes.
During migration to a new character set, it is important to verify the column widths of existing CHAR
and VARCHAR2
columns because they might need to be extended to support an encoding that requires multibyte storage. Truncation of data can occur if conversion causes expansion of data.
Figure 10-1 shows an example of data expansion when single-byte characters become multibyte. For example, ä
(a
with an umlaut) is a single-byte character in WE8MSWIN1252, but it becomes a two-byte character in UTF8. Also, the Euro symbol expands from one byte to three bytes.
The maximum number of bytes for CHAR
and VARCHAR2
datatypes is 2000 and 4000, respectively. If the data in the new character set requires columns that are wider than 2000 and 4000 bytes for CHAR
and VARCHAR2
datatypes, then you need to change your schema.
Data truncation can cause the following problems:
scött
(o
with an umlaut) will change from 5 bytes to 6 bytes. In UTF8, scött
will no longer be able to log in because of the difference in the username. Oracle Corporation recommends that usernames and passwords be based on ASCII characters. If they are not, you must reset the affected usernames and passwords after migrating to a new character set.CHAR
data contains characters that will be expanded after migration to a new character set, space padding will not be removed during database export by default. This means that these rows will be rejected upon import into the database with the new character set. The workaround is to set the BLANK_TRIMMING
initialization parameter to TRUE
before importing the CHAR
data.
See Also:
Oracle9i Database Reference for more information about the |
This section includes the following topics:
The Export and Import utilities can convert character sets from the original database character set to the new database character set. However, character set conversions can sometimes cause data loss or data corruption. For example, if you are migrating from character set A to character set B, the destination character set B should be a superset of character set A. The destination character, B, is a superset if it contains all the characters defined in character set A. Characters that are not available in character set B are converted to replacement characters, which are often specified as ?
or ¿
or a character that is related to the unavailable character. For example, ä
(a
with an umlaut) can be replaced by a
. Replacement characters are defined by the target character set.
Figure 10-2 shows an example of a character set conversion in which the copyright and Euro symbols are converted to ?
and ä
is converted to a
.
To reduce the risk of losing data, choose a destination character set with a similar character repertoire. Migrating to Unicode can be an attractive option because UTF8 contains characters from most legacy character sets.
Another character set migration scenario that can cause the loss of data is migrating a database that contains invalid data. Invalid data usually occurs in a database because the NLS_LANG
parameter is not set properly on the client. The NLS_LANG
value should reflect the client operating system code page. For example, in an English Windows environment, the code page is WE8MSWIN1252. When the NLS_LANG
parameter is set properly, the database can automatically convert incoming data from the client operating system. When the NLS_LANG
parameter is not set properly, then the data coming into the database is not converted properly. For example, suppose that the database character set is UTF8, the client is an English Windows operating system, and the NLS_LANG
setting on the client is UTF8. Data coming into the database is encoded in WE8MSWIN1252 and is not converted to UTF8 data because the NLS_LANG
setting on the client matches the database character set. Thus Oracle assumes that no conversion is necessary, and invalid data is entered into the database.
This can lead to two possible data inconsistency problems. One problem occurs when a database contains data from a character set that is different from the database character set but the same code points exist in both character sets. For example, if the database character set is WE8ISO8859P1 and the NLS_LANG
setting of the Chinese Windows NT client is SIMPLIFIED
CHINESE_CHINA.WE8ISO8859P1
, then all multibyte Chinese data (from the ZHS16GBK character set) is stored as multiples of single-byte WE8ISO8859P1 data. This means that Oracle will treat these characters as single-byte WE8ISO8859P1 characters. Hence all SQL string manipulation functions such as SUBSTR
or LENGTH
will be based on bytes rather than characters. All bytes constituting ZHS16GBK data are legal WE8ISO8859P1 codes. If such a database is migrated to another character set, for example, UTF8, character codes will be converted as if they were in WE8ISO8859P1. This way, each of the two bytes of a ZHS16GBK character will be converted separately, yielding meaningless values in UTF8. Figure 10-3 shows an example of this incorrect character set replacement.
The second possible problem is having data from mixed character sets inside the database. For example, if the data character set is WE8MSWIN1252, and two separate Windows clients using German and Greek are both using the NLS_LANG
character set setting as WE8MSWIN1252, then the database will contain a mixture of German and Greek characters. Figure 10-4 shows how different clients can use different character sets in the same database.
For database character set migration to be successful, both of these cases require manual intervention because Oracle cannot determine the character sets of the data being stored.
Database character set migration has two stages: data scanning and data conversion. Before you change the database character set, you need to identify possible database character set conversion problems and truncation of data. This step is called data scanning.
Data scanning identifies the amount of effort required to migrate data into the new character encoding scheme before changing the database character set. Some examples of what may be found during a data scan are the number of schema objects where the column widths need to be expanded and the extent of the data that does not exist in the target character repertoire. This information helps to determine the best approach for converting the database character set.
There are three approaches to converting data from one database character set to another if the database does not contain any of the inconsistencies described in "Character Set Conversion Issues". A description of methods to migrate databases with such inconsistencies is out of the scope of this documentation. For more information, contact Oracle Consulting Services for assistance.
The approaches are:
See Also:
Chapter 11, "Character Set Scanner" for more information about data scanning |
In most cases, a full export and import is recommended to properly convert all data to a new character set. It is important to be aware of data truncation issues, because columns with character datatypes may need to be extended before the import to handle an increase in size. Existing PL/SQL code should be reviewed to ensure that all byte-based SQL functions such as LENGTHB
, SUBSTRB
, and INSTRB
, and PL/SQL CHAR
and VARCHAR2
declarations are still valid.
See Also:
Oracle9i Database Utilities for more information about the Export and Import utilities |
The ALTER DATABASE CHARACTER SET
statement is the fastest way to migrate a character set, but it can be used only under special circumstances. The ALTER DATABASE CHARACTER SET
statement does not perform any data conversion, so it can be used if and only if the new character set is a strict superset of the current character set.
The new character set is a strict superset of the current character set if:
Another restriction of the ALTER
DATABASE
CHARACTER
SET
statement is that it can be used only when the character set migration is between two single-byte character sets or between two multibyte character sets. If the planned character set migration is from a single-byte character set to a multibyte character set, then use the Export and Import utilities.
This restriction on using the ALTER DATABASE CHARACTER SET
statement arises because of CLOB
data. In Oracle9i, some internal fields in the data dictionary are stored in CLOB
columns. Customers may also store data in CLOB
fields. When the database character set is multibyte, CLOB
data in Oracle9i is stored as UCS-2 data (two-byte, fixed-width Unicode). When the database character set is single-byte, CLOB
data is stored using the database character set. Because the ALTER DATABASE CHARACTER SET
statement does not convert data, CLOB
columns remain in the original database character set encoding when the database character set is migrated from single-byte to multibyte. This introduces data inconsistency in the CLOB
columns.
The syntax of the ALTER DATABASE CHARACTER SET
statement is as follows:
ALTER DATABASE [db_name
]
CHARACTER SETnew_character_set
;
db_name
is optional. The character set name should be specified without quotes. For example:
ALTER DATABASE CHARACTER SET AL32UTF8;
To change the database character set, perform the following steps:
SHUTDOWN IMMEDIATE
or a SHUTDOWN NORMAL
statement.ALTER DATABASE CHARACTER SET
statement cannot be rolled back.
STARTUP MOUNT;
ALTER SYSTEM ENABLE RESTRICTED SESSION;
ALTER SYSTEM SET JOB_QUEUE_PROCESSES=0;
ALTER SYSTEM SET AQ_TM_PROCESSES=0;
ALTER DATABASE OPEN;
ALTER DATABASE CHARACTER SET new_character_set
;
SHUTDOWN IMMEDIATE; -- or SHUTDOWN NORMAL;
STARTUP;
See Also:
|
In a Oracle9i Real Application Clusters environment, ensure that no other Oracle background processes are running, with the exception of the background processes associated with the instance through which a user is connected, before attempting to issue the ALTER
DATABASE
CHARACTER
SET
statement. Use the following SQL statement to verify the environment:
SELECT SID, SERIAL#, PROGRAM FROM V$SESSION;
Set the CLUSTER_DATABASE
initialization parameter to FALSE
to allow the character set change to be completed. This is required in an Oracle9i Real Application Cluster environment; an exclusive startup is not sufficient.
Another approach to migrating character data is to perform an ALTER
DATABASE
CHARACTER
SET
statement followed by selective imports. This method is best suited for a known distribution of convertible data that is stored within a small number of tables. A full export and import is too expensive in this scenario. For example, suppose you have a 100GB database with over 300 tables, but only 3 tables require character set conversions. The rest of the data is of the same encoding as the destination character set. The 3 tables can be exported and imported back to the new database after issuing the ALTER DATABASE CHARACTER SET
statement.
Incorrect data conversion can lead to data corruption, so perform a full backup of the database before attempting to migrate the data to a new character set.
In Oracle9i, data that is stored in columns of the NCHAR
datatypes is stored exclusively in a Unicode encoding regardless of the database character set. This allows users to store Unicode in a database that does not use Unicode as the database character set.
This section includes the following topics:
In release 8.0, the Oracle Server introduced a national character datatype (NCHAR
) that allows a second, alternate character set in addition to the database character set. The NCHAR
datatypes support several fixed-width Asian character sets that were introduced to provide better performance when processing Asian character data.
In Oracle9i, the SQL NCHAR
datatypes are limited to Unicode character set encoding (UTF8 and AL16UTF16). Any other Oracle8 Server character sets that were available for the NCHAR
datatype, including Asian character sets such as JA16SJISFIXED are no longer supported.
The steps for migrating existing NCHAR
, NVARCHAR2
, and NCLOB
columns to Oracle9i NCHAR
datatypes are as follows:
NCHAR
columns from the Oracle8 or Oracle8i database.NCHAR
columns.NCHAR
columns into Oracle9i.The Oracle9i migration utility can also convert Oracle8 and Oracle8i NCHAR
columns to 9i NCHAR
columns. A SQL NCHAR
upgrade script called utlchar.sql
is supplied with the migration utility. Run it at the end of the database migration to convert Oracle8 and Oracle8i NCHAR
columns to the Oracle9i NCHAR
columns. After the script has been executed, the data cannot be downgraded. The only way to move back to Oracle8 or Oracle8i is to drop all NCHAR
columns, downgrade the database, and import the old NCHAR
data from a previous Oracle8 or Oracle8i export file. Ensure that you have a backup (export file) of Oracle8 or Oracle8i NCHAR
data, in case you need to downgrade your database in the future.
See Also:
|
To change the national character set, use the ALTER
DATABASE
NATIONAL
CHARACTER
SET
statement. The syntax of the statement is as follows:
ALTER DATABASE [db_name
]
NATIONAL CHARACTER SETnew_NCHAR_character_set
;
db_name
is optional. The character set name should be specified without quotes.
You can issue the ALTER DATABASE CHARACTER SET
and ALTER DATABASE NATIONAL CHARACTER SET
statements together if desired.
See Also:
Oracle9i SQL Reference for the syntax of the |
You can change a column's datatype definition using the following methods:
The ALTER TABLE MODIFY
statement has the following advantages over online table redefinition:
Online table redefinition has the following advantages over the ALTER TABLE MODIFY
statement:
CLOB
datatype to the NCLOB
datatypeThis section contains the following topics:
The ALTER TABLE MODIFY
statement can be used to change table column definitions from the CHAR
datatypes to NCHAR
datatypes. It also converts all of the data in the column from the database character set to the NCHAR
character set. The syntax of the ALTER TABLE MODIFY
statement is as follows:
ALTER TABLEtable_name
MODIFY(
column_name datatype
);
If indexes have been built on the migrating column, then dropping the indexes can improve the performance of the ALTER TABLE MODIFY
statement because indexes are updated when each row is updated.
The maximum column lengths for NCHAR
and NVARCHAR2
columns are 2000 and 4000 bytes. When the NCHAR
character set is AL16UTF16, the maximum column lengths for NCHAR
and NVARCHAR2
columns are 1000 and 2000 characters, which are 2000 and 4000 bytes. If this size limit is violated during migration, consider changing the column to the NCLOB
datatype instead.
Note:
|
It takes significant time to migrate a large table with a large number of rows to Unicode datatypes. During the migration, the column data is unavailable for both reading and updating. Online table redefinition can significantly reduce migration time. Using online table redefinition also allows the table to be accessible to DML during most of the migration time.
Perform the following tasks to migrate a table to Unicode datatypes using online table redefinition:
DBMS_REDEFINITION.CAN_REDEF_TABLE
PL/SQL procedure to verify that the table can be redefined online. For example, to migrate the scott.emp table, enter the following command:
DBMS_REDEFINITION.CAN_REDEF_TABLE('scott','emp');
CREATE TABLE int_emp(
empno NUMBER(4), ename NVARCHAR2(10), job NVARCHAR2(9), mgr NUMBER(4), hiredate DATE, sal NUMBER(7,2), deptno NUMBER(2), org NVARCHAR2(10));
DBMS_REDEFINITION.START_REDEF_TABLE('scott', 'emp', 'int_emp', 'empno empno, to_nchar(ename) ename, to_nchar(job) job, mgr mgr, hiredate hiredate, sal sal, deptno deptno, to_nchar(org) org');
If you are migrating CLOB
columns to NCLOB
columns, then use the TO_NCLOB
SQL conversion function instead of the TO_NCHAR
SQL function.
DISABLED
mode. Triggers that are defined on the interim table are not executed until the online table redefinition process has been completed.DBMS_REDEFINITION.SYNC_INTERIM_TABLE
procedure. This reduces the time required for the DBMS_REDEFINITION.FINISH_REDEF_TABLE
procedure. Enter a command similar to the following:
DBMS_REDEFINITION.SYNC_INTERIM_TABLE('scott', 'emp', 'int_emp');
DBMS_REDEFINITION.FINISH_REDEF_TABLE
procedure. Enter a command similar to the following:
DBMS_REDEFINITION.RINISH_REDEF_TABLE('scott', 'emp', 'int_emp');
When this procedure has been completed, the following conditions are true:
DROP TABLE int_emp;
The results of the online table redefinition tasks are as follows:
START_REDEF_TABLE
subprogram and before the FINISH_REDEF_TABLE
subprogram are defined for the redefined original table. Referential constraints that apply to the interim table now apply to the redefined original table and are enabled.See Also:
Oracle9i Database Administrator's Guide for more information about online table redefinition |
You may need to perform additional tasks to recover a migrated database schema to its original state. Consider the issues described in Table 10-1.
|
Copyright © 1996, 2002 Oracle Corporation. All Rights Reserved. |
|