F Troubleshooting Oracle Clusterware

This appendix introduces monitoring the Oracle Clusterware environment and explains how you can enable dynamic debugging to troubleshoot Oracle Clusterware processing, and enable debugging and tracing for specific components and specific Oracle Clusterware resources to focus your troubleshooting efforts.

This appendix contains the following topics:

Monitoring Oracle Clusterware
Dynamic Debugging
Component Level Debugging
Oracle Clusterware Shutdown and Startup
Enabling and Disabling Oracle Clusterware Daemons
Determining the Active Versions and Software Versions
Diagnostics Collection Script
Oracle Clusterware Alerts
Resource Debugging
Checking the Health of the Clusterware
Clusterware Log Files and the Unified Log Directory Structure
Troubleshooting the Oracle Cluster Registry
Enabling Additional Tracing for Oracle Clusterware High Availability

Monitoring Oracle Clusterware

You can use Oracle Enterprise Manager to monitor the Oracle Clusterware environment. When you log in to Oracle Enterprise Manager using a client browser, the Cluster Database Home page appears where you can monitor the status of both Oracle Clusterware environments. Monitoring can include such things as:

Notification if there are any VIP relocations
Status of the Oracle Clusterware on each node of the cluster using information obtained through the Cluster Verification Utility (cluvfy)
Notification if node applications (nodeapps) start or stop
Notification of issues in the Oracle Clusterware alert log for the OCR, voting disk issues (if any), and node evictions

The Cluster Database Home page is similar to a single-instance Database Home page. However, on the Cluster Database Home page, Oracle Enterprise Manager displays the system state and availability. This includes a summary about alert messages and job activity, as well as links to all the database and Automatic Storage Management (ASM) instances. For example, you can track problems with services on the cluster including when a service is not running on all of the preferred instances or when a service response time threshold is not being met.

You can use the Oracle Enterprise Manager Interconnects page to monitor the Oracle Clusterware environment. The Interconnects page shows the public and private interfaces on the cluster, the overall throughput on the private interconnect, individual throughput on each of the network interfaces, error rates (if any) and the load contributed by database instances on the interconnect, including:

Overall throughput across the private interconnect
Notification if a database instance is using public interface due to misconfiguration
Throughput and errors (if any) on the interconnect
Throughput contributed by individual instances on the interconnect

All of this information also is available as collections that have a historic view. This is useful in conjunction with cluster cache coherency, such as when diagnosing problems related to cluster wait events. You can access the Interconnects page by clicking the Interconnect tab on the Cluster Database home page.

Also, the Oracle Enterprise Manager Cluster Database Performance page provides a quick glimpse of the performance statistics for a database. Statistics are rolled up across all the instances in the cluster database in charts. Using the links next to the charts, you can get more specific information and perform any of the following tasks:

Identify the causes of performance issues.
Decide whether resources need to be added or redistributed.
Tune your SQL plan and schema for better optimization.
Resolve performance issues

The charts on the Cluster Database Performance page include the following:

Chart for Cluster Host Load Average—The Cluster Host Load Average chart in the Cluster Database Performance page shows potential problems that are outside the database. The chart shows maximum, average, and minimum load values for available nodes in the cluster for the previous hour.
Chart for Global Cache Block Access Latency—Each cluster database instance has its own buffer cache in its System Global Area (SGA). Using Cache Fusion, Oracle RAC environments logically combine each instance's buffer cache to enable the database instances to process data as if the data resided on a logically combined, single cache.
Chart for Average Active Sessions—The Average Active Sessions chart in the Cluster Database Performance page shows potential problems inside the database. Categories, called wait classes, show how much of the database is using a resource, such as CPU or disk I/O. Comparing CPU time to wait time helps to determine how much of the response time is consumed with useful work rather than waiting for resources that are potentially held by other processes.
Chart for Database Throughput—The Database Throughput charts summarize any resource contention that appears in the Average Active Sessions chart, and also show how much work the database is performing on behalf of the users or applications. The Per Second view shows the number of transactions compared to the number of logons, and the amount of physical reads compared to the redo size for each second. The Per Transaction view shows the amount of physical reads compared to the redo size for each transaction. Logons is the number of users that are logged on to the database.

In addition, the Top Activity drilldown menu on the Cluster Database Performance page enables you to see the activity by wait events, services, and instances. Plus, you can see the details about SQL/sessions by going to a prior point in time by moving the slider on the chart.

Dynamic Debugging

You can use crsctl commands as the root user to enable dynamic debugging for Oracle Clusterware, the Event Manager (EVM), and the clusterware subcomponents. You can dynamically change debugging levels using crsctl commands. Debugging information remains in the Oracle Cluster Registry (OCR) for use during the next startup. You can also enable debugging for resources.

The crsctl syntax to enable debugging for Oracle Clusterware is:

crsctl debug log crs "CRSRTI:1,CRSCOMM:2"

The crsctl syntax to enable debugging for EVM is:

crsctl debug log evm "EVMCOMM:1"

The crsctl syntax to enable debugging for resources is:

crsctl debug log res "resname:1"

Component Level Debugging

You can use crsctl commands as the root user to enable dynamic debugging for the Oracle Clusterware Cluster Ready Services (CRS), Oracle Cluster Registry (OCR), Cluster Synchronization Services (CSS), and the Event Manager (EVM).

This section contains the following topics:

Enabling Debugging for CRS, OCR, CSS, and EVM Modules
Creating an Initialization File to Contain the Debugging Level

Enabling Debugging for CRS, OCR, CSS, and EVM Modules

You can enable debugging for the CRS, OCR, CSS, and EVM modules and their components by setting environment variables or by issuing crsctl debug commands using the following syntax:

crsctl debug log module_name component:debugging_level

You must issue the crsctl debug command as the root user, and supply the following information:

module_name—The name of the module: CRS, EVM, or CSS.
component—The name of a component for the CRS, OCR, EVM, or CSS module. See Table F-1 for a list of all of the components.
debugging_level—A number from 1 to 5 to indicate the level of detail you want the debug command to return, where 1 is the least amount of debugging output and 5 provides the most detailed debugging output.

You can dynamically change the debugging level in the crsctl command, or you can configure an init file for changing the debugging level as described in "Creating an Initialization File to Contain the Debugging Level".

The following commands show examples of how to enable debugging for the various modules:

To enable debugging for Oracle Clusterware:

crsctl debug log crs "CRSRTI:1,CRSCOMM:2"

To enable debugging for OCR:

crsctl debug log crs "CRSRTI:1,CRSCOMM:2,OCRSRV:4"

To enable debugging for EVM:
```
crsctl debug log evm "EVMCOMM:1"
```
To enable debugging for resources
```
crsctl debug log res "resname:1"
```

To list the components that can be used for debugging, issue the crsctl lsmodules command using the following syntax and supply crs, evm, or css for the module_name parameter:

crsctl lsmodules module_name

Note:

You do not have to be the root user to run the crsctl command with the lsmodules option.

Table F-1 shows the components for the CRS, OCR, EVM, and CSS modules, respectively. Note that some of the component names are common between the CRS, EVM, and CSS daemons and may be enabled on that specific daemon. For example COMMNS is the NS layer and because each daemon uses the NS layer, you can enable this specific module component on any of the daemons to get specific debugging information.

Table F-1 Components for the CRS, OCR, EVM, and CSS Modules

CRS Modules^Foot 1	OCR Modules^Foot 2	EVM Modules^Foot 3	CSS Modules^Foot 4
`CRSUI` `CRSCOMM` `CRSRTI` `CRSMAIN` `CRSPLACE` `CRSAPP` `CRSRES` `CRSCOMM` `CRSOCR` `CRSTIMER` `CRSEVT` `CRSD` `CLUCLS` `CSSCLNT` `COMMCRS` `COMMNS`	`OCRAPI` `OCRCLI` `OCRSRV` `OCRMAS` `OCRMSG` `OCRCAC` `OCRRAW` `OCRUTL` `OCROSD` OCR Tools Modules `OCRCONF` `OCRDUMP` `OCRCHECK`	`EVMD` `EVMDMAIN` `EVMCOMM` `EVMEVT` `EVMAPP` `EVMAGENT` `CRSOCR` `CLUCLS` `CSSCLNT` `COMMCRS` `COMMNS`	`CSSD` `COMMCRS` `COMMNS`

^Footnote 1List the CRS component modules using the crsctl lsmodules crs command.

^Footnote 2You cannot list the OCR modules using the crsctl lsmodules command.

^Footnote 3List the EVM component modules using the crsctl lsmodules evm command.

^Footnote 4List the CSS component modules using the crsctl lsmodules css command.

Creating an Initialization File to Contain the Debugging Level

This section describes how to specify the debugging level in an initialization file. This debugging information is stored for use during the next startup.

For each process that you want to debug, you can create an initialization file that contains the debugging level.

The initialization file name includes the name of the process that you are debugging (process_name.ini). The file is located in the |Oracle_home/log/hostname/admin/| directory.

For example, ORACLE_HOME/log/hostA/admin/clscfg.ini is the name for the CLSCFG debugging initialization file on hostA.

Oracle Clusterware Shutdown and Startup

You can start or stop Oracle Clusterware by issuing crsctl start and stop commands.

Example 1 Stopping Oracle Clusterware

To stop Oracle Clusterware and its related resources on a specific node, issue the following command:

crsctl stop crs

Example 2 Starting Oracle Clusterware

To start Oracle Clusterware and its related resources on a specific node, issue the following command:

crsctl start crs

Note:

You must run these crsctl commands as the root user.

Enabling and Disabling Oracle Clusterware Daemons

When the Oracle Clusterware daemons are enabled, they start automatically at the time the node is started. To prevent the daemons from starting, you can disable them using crsctl commands. You can use crsctl commands as follows to enable and disable the startup of the Oracle Clusterware daemons.

Issue the following command to enable startup for all of the Oracle Clusterware daemons:

crsctl enable crs

Issue the following command to disable the startup of all of the Oracle Clusterware daemons:

crsctl disable crs

Note:

You must run these crsctl commands as the root user.

Determining the Active Versions and Software Versions

You can determine the active version or the software version running on the local node cluster by issuing crsctl activeversion and softwarewareversion commands.

The software version is the binary version of the software on a particular cluster node.
The active version is the lowest software version running in a cluster.

These versions are used while upgrading a cluster.

Example 1 Determining the Active Version

To determine the active version on the local node, issue the following command:

crsctl query crs activeversion

Example 2 Determining the Software Version

To determine the software version on the local node, issue the following command:

crsctl query crs softwareversion

Diagnostics Collection Script

Every time an Oracle Clusterware error occurs, you should use run the diagcollection.pl script to collect diagnostic information from Oracle Clusterware in trace files. The diagnostics provide additional information so Oracle Support can resolve problems. Run this script from the following location:

CRS_home/bin/diagcollection.pl

Note:

You must run this script as the root user.

Oracle Clusterware Alerts

Oracle Clusterware posts alert messages when important events occur. The following is an example of an alert from the CRSD process:

[NORMAL] CLSD-1201: CRSD started on host %s
[ERROR] CLSD-1202: CRSD aborted on host %s. Error [%s]. Details in %s.
[ERROR] CLSD-1203: Failover failed for the CRS resource %s. Details in %s.
[NORMAL] CLSD-1204: Recovering CRS resources for host %s
[ERROR] CLSD-1205: Auto-start failed for the CRS resource %s. Details in %s.

The location of this alert log on Linux, UNIX, and Windows systems is in the following directory path, where CRS_home is the name of the location of Oracle Clusterware: CRS_home/log/hostname/alerthostname.log.

The following example shows an EVMD alert:

[NORMAL] CLSD-1401: EVMD started on node %s 
[ERROR] CLSD-1402: EVMD aborted on node %s. Error [%s]. Details in %s.

Resource Debugging

You can use crsctl command to enable resource debugging using the following syntax:

crsctl debug log res "ora.node1.vip:1"

This has the effect of setting the environment variable USER_ORA_DEBUG, to 1, before running the start, stop, or check action scripts for the ora.node1.vip resource.

Note:

You must run this crsctl command as the root user.

Checking the Health of the Clusterware

Use the crsctl check command to determine the health of your clusterware as in the following example:

crsctl check crs

Issue the following command to determine the health of individual daemons where daemon is crsd, cssd or evmd:

crsctl check daemon

Note:

You do not have to be the root user to perform health checks.

Clusterware Log Files and the Unified Log Directory Structure

Oracle uses a unified log directory structure to consolidate the Oracle Clusterware component log files. This consolidated structure simplifies diagnostic information collection and assists during data retrieval and problem analysis.

Oracle retains five files that are 20MB in size for the CSSD process and one file that is 10MB in size for the CRSD and EVMD processes. In addition, Oracle deletes the oldest log file for any log file group when the maximum storage limit for the group's files exceeds 10MB. Alert files are stored in the directory structures shown in Table F-2.

Table F-2 Locations of Oracle Clusterware Component Log Files

Component	Log File Location^Foot 1
Cluster Ready Services Daemon (crsd) Log Files	`CRS home/log/hostname/crsd`
Oracle Cluster Registry (OCR) records l	For the OCR tools (OCRDUMP, OCRCHECK, OCRCONFIG) record log information in the following location:^Foot 2 `CRS_Home/log/hostname`/client The OCR server records log information in the following location:^Foot 3 `CRS_home/log/hostname`/crsd
Oracle Processor Daemon (OPROCD)	The following path is specific to Linux^Foot 4: `/etc/oracle/hostname.oprocd.log`
Cluster Synchronization Services (CSS)	CRS_home/log/hostname/cssd
Event Manager (EVM) information generated by `evmd`	`CRS_home/log/hostname/evmd`
Oracle RAC RACG	The Oracle RAC high availability trace files are located in the following two locations: CRS_home/log/hostname/racg and $ORACLE_HOME/log/hostname/racg Core files are in subdirectories of the log directory. Each RACG executable has a subdirectory assigned exclusively for that executable. The name of the RACG executable subdirectory is the same as the name of the executable.

^Footnote 1The directory structure is the same for Linux, UNIX, and Windows systems.

^Footnote 2 To change the amount of logging, edit the path in the CRS_home/srvm/admin/ocrlog.ini file.

^Footnote 3To change the amount of logging, edit the path in the CRS_home/log/hostname/crsd/crsd.ini file.

^Footnote 4This path is dependent upon the installed Linux or UNIX platform.

Troubleshooting the Oracle Cluster Registry

This following topics in this section explain how to troubleshoot the OCR:

Using the OCRDUMP Utility to View Oracle Cluster Registry Content
Using the OCRCHECK Utility
Oracle Cluster Registry Troubleshooting

Using the OCRDUMP Utility to View Oracle Cluster Registry Content

This section explains how to use the OCRDUMP utility to view OCR content for troubleshooting. The OCRDUMP utility enables you to view the OCR contents by writing OCR content to a file or stdout in a readable format.

You can use a number of options for OCRDUMP. For example, you can limit the output to a key and its descendents. You can also write the contents to an XML file that you can view using a browser. OCRDUMP writes the OCR keys as ASCII strings and values in a datatype format. OCRDUMP retrieves header information based on a best effort basis.

OCRDUMP also creates a log file in CRS_home/log/hostname/client. To change the amount of logging, edit the file CRS_Home/srvm/admin/ocrlog.ini.

To change the logging component, edit the entry containing the comploglvl= entry. For example, to change the logging of the ORCAPI component to 3 and to change the logging of the OCRRAW component to 5, make the following entry in the ocrlog.ini file:

comploglvl="OCRAPI:3;OCRRAW:5"

Note:

Make sure that you have file creation privileges in the CRS_home directory before using the OCRDUMP utility.

OCRDUMP Utility Syntax and Options

This section describes the OCRDUMP utility command syntax and usage. Run the ocrdump command with the following syntax where filename is the name of a target file to which you want Oracle to write the OCR output and where keyname is the name of a key from which you want Oracle to write OCR subtree content:

ocrdump [file_name|-stdout] [-backupfile backup_file_name] [-keyname keyname] [-xml] [-noheader]

Table F-3 describes the OCRDUMP utility options and option descriptions.

Table F-3 OCRDUMP Options and Option Descriptions

Options	Description
`file_name`	The name of a file to which you want OCRDUMP to write output. By default, output from the OCRDUMP utility is written to the predefined output file named `OCRDUMPFILE`. The `file_name` option redirects OCRDUMP output to the file that you specify.
`-stdout`	Use this option to redirect the OCRDUMP output to the text terminal that initiated the program. If you do not redirect the output, output from the OCRDUMP utility is written to the predefined output file named `OCRDUMPFILE` by default.
`-keyname`	The name of an OCR key whose subtree is to be dumped.
`-xml`	Writes the output in XML format.
`-noheader`	Does not print the time at which you ran the command and when the OCR configuration occurred.
`-backupfile`	Option to identify a backup file.
`backup_file_name`	The name of the backup file with the content you want to view. You can query the backups using the `ocrconfig -showbackup` command.

OCRDUMP Utility Examples

The following ocrdump utility examples extract various types of OCR information and write it to various targets:

ocrdump

Writes the OCR content to a file called OCRDUMPFILE in the current directory.

ocrdump MYFILE

Writes the OCR content to a file called MYFILE in the current directory.

ocrdump -stdout -keyname SYSTEM

Writes the OCR content from the subtree of the key SYSTEM to stdout.

ocrdump -stdout -xml

Writes the OCR content to stdout in XML format.

Sample OCRDUMP Utility Output

The following OCRDUMP examples show the KEYNAME, VALUE TYPE, VALUE, permission set (user, group, world) and access rights for two sample runs of the ocrdump command. The following shows the output for the SYSTEM.language key that has a text value of AMERICAN_AMERICA.WE8ASCII37.

[SYSTEM.language]
ORATEXT : AMERICAN_AMERICA.WE8ASCII37
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ,
 OTHER_PERMISSION : PROCR_READ, USER_NAME : user, GROUP_NAME : group
}

The following shows the output for the SYSTEM.version key that has integer value of 3:

[SYSTEM.version]
UB4 (10) : 3
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ,
 OTHER_PERMISSION : PROCR_READ, USER_NAME : user, GROUP_NAME : group
}

Using the OCRCHECK Utility

The OCRCHECK utility displays the version of the OCR's block format, total space available and used space, OCRID, and the OCR locations that you have configured. OCRCHECK performs a block-by-block checksum operation for all of the blocks in all of the OCRs that you have configured. It also returns an individual status for each file as well as a result for the overall OCR integrity check.

The following example shows a sample of the OCRCHECK utility output:

Status of Oracle Cluster Registry is as follows :
        Version                  :          2
        Total space (kbytes)     :     262144
        Used space (kbytes)      :      16256
        Available space (kbytes) :     245888
        ID                       : 1918913332
        Device/File Name         : /dev/raw/raw1
                                   Device/File integrity check succeeded
        Device/File Name         : /dev/raw/raw2
                                   Device/File integrity check succeeded
 
        Cluster registry integrity check succeeded

OCRCHECK creates a log file in the directory CRS_home/log/hostname/client. To change amount of logging, edit the file CRS_home/srvm/admin/ocrlog.ini.

Oracle Cluster Registry Troubleshooting

Table F-4 describes common OCR problems with corresponding resolution suggestions.

Table F-4 Common OCR Problems and Solutions

Problem	Solution
Not currently using OCR mirroring and would like to enable it.	Run the `ocrconfig` command with the `-replace` option as described.
An OCR failed and you need to replace it. Error messages in Enterprise Manager or OCR log file.	Run the `ocrconfig` command with the `-replace` option as described.
An OCR has a misconfiguration.	Run the `ocrconfig` command with the `-repair` option as described.
You are experiencing a severe performance effect from OCR processing or you want to remove an OCR for other reasons.	Run the `ocrconfig` command with the `-replace` option as described .
An OCR has failed and before you can fix it, the node need to be rebooted with only one OCR.	Run the `ocrconfig -repair` command to remove the bad ocr file. Oracle Clusterware will not start if it cannot find all OCRs defined.

Enabling Additional Tracing for Oracle Clusterware High Availability

Oracle Support may ask you to enable tracing to capture additional information. Because the procedures described in this section may affect performance, only perform these activities with the assistance of Oracle Support. This section includes the following topics:

Generating Additional Trace Information for a Running Resource
Verifying Event Manager Daemon Communications

Generating Additional Trace Information for a Running Resource

To generate additional trace information for a running resource, Oracle recommends that you use CRSCTL commands. For example, issue the following command to turn on debugging for resources:

$ crsctl debug log res "resource_name:level"

For example, to set the value of the USR_ORA_DEBUG initialization parameter to 1 for the VIP resource, issue the following command:

$ crsctl debug log res ora.cwclu011.vip:1

Verifying Event Manager Daemon Communications

The event manager daemons (evmd) running on separate nodes communicate through specific ports. To determine whether the evmd for a node can send and receive messages, perform the test described in this section while running session 1 in the background.On node 1, session 1 enter:

$ evmwatch –A –t "@timestamp @@"

On node 2, session 2 enter:

$ evmpost -u "hello" [-h nodename]

Session 1 should show output similar to the following:

$ 21-Jul-2007 08:04:26 hello

Ensure that each node can both send and receive messages by executing this test in several permutations.