A P P E N D I X B - Software Error Messages

A P P E N D I X B

Software Error Messages

This appendix contains information on Netra CT 820 platform-specific software error messages. Messages are produced by software and firmware running on the distributed management card, and by software and firmware running on the Netra CT 820 system, including: the Solaris OS, OpenBoot PROM firmware, the MOH application, and the PMS application.

For Netra CT 820 platform-specific hardware error messages, refer to the Netra CT 820 Server Service Manual.

For additional information on software error messages not specific to the Netra CT 820 system, refer to:

The web site http://docs.sun.com for the Solaris OS, OpenBoot, DHCP, and ChorusOS documentation

The web site http://www.sun.com/products-n-solutions/hardware/docs for Netra High Availability Suite documentation

Third-party board documentation for any third-party node boards you are using

This appendix includes the following sections:

Overview

Messages

Overview

This appendix lists error messages in alphabetical order, with the format:

Message, Cause, Action

Distributed Management Card Messages. Error messages originate from software and firmware on the distributed management card itself, such as ChorusOS, BMC, and the CLI. In addition, messages from other software, such as the PMS application and the OpenBoot PROM firmware, might be displayed on the distributed management card console.

Distributed management card error messages are displayed on the distributed management card console. They are not saved to a log.

Solaris OS Messages. Messages are displayed on the Netra CP2300 cPSB Board console. They are saved to a log in /var/adm/messages.

OpenBoot PROM Firmware Messages. Messages from OpenBoot PROM are displayed through a Netra CP2300 cPSB Board console. They can be displayed on the node board console itself or on the distributed management card console if you are logged in remotely using the CLI console command.

OpenBoot PROM error and warning messages are not saved to a log on either the distributed management card or on a node board.

MOH Application Messages. These messages are displayed on a node board console, on the distributed management card console, or on both. They are not saved to a log.

PMS Application Messages. PMS is a high-level application. Thus, faults in various places in the software and hardware underlying this application can result in PMS error messages. For example, a fault could occur on the midplane or on a disk. This situation might make it difficult to isolate where a specific fault is occurring. A solution to many PMS error messages is to reset the distributed management card.

PMS error messages are printed to the console you are using to execute the pmsd CLI command; they are not saved to a log on either the distributed management card or on a node board.

Messages

!!! ALERT !!! Crossing Critical temperature threshold
The current threshold setting is: number degreeC
The current temperature is : number degreeC

Cause: A temperature problem, either in the chassis environment (for example, a fan failure) or as configured on the node board (for example, a user misconfiguration of a temperature setting), causes this message.

Action: (1) Check the fans to make sure they are working properly; replace if necessary. (2) Check the room environment for proper cooling and adjust if necessary. (3) Check the OpenBoot PROM environment variables warning-temperature, critical-temperature, and shutdown-temperature to make sure they are within range of the chassis environment (refer to the Netra CP2300 cPSB Board Programming Guide for more information) and adjust the environment variables as necessary.

!!! ALERT!!! Crossing Shutdown temperature threshold
The current threshold setting is: number degreeC
The current temperature is : number degreeC

An attempt to start the "protocol" communication server failed, will retry.
The problem could be because of a misconfigured primary network 
interface, or possibly another instance of the agent is running

Cause: A network configuration problem, another MOH agent instance, or another application or process using the MOH port has resulted in the MOH agent's inability to start the RMI server.

Action: If this message occurs on the distributed management card, check the network interfaces (for example, make sure the ifeth0 interface has a valid IP address). If this message occurs on a node board, check the network interfaces; check to see if an MOH agent is already running, with the command pgrep -fl java; try stopping and restarting the agent; check to see which ports are in use, with the command netstat -a.

An attempt to start the SnmpView failed, will retry.
Check the network configuration

Cause: The MOH agent could not start the SNMP view, because of a network configuration problem or because another application or process is using the SNMP port.

Action: This message could occur on the distributed management card or on a node board. (1) Check the network configuration. (2) Check to see which ports are in use, with the command netstat -a.

Board RG0 resource state must be OFFLINE to perform operation on this slot

Cause: Resource Group 0 (RG0), the group of applications on a node board that PMS manages, must be offline before you can run certain commands from the distributed management card.

Action: Change the RG0 state from active to offline. For example, use the command pmsd appoperset -o force_offline.

Can't reset: Standby is not healthy

Cause: You tried to reset the active distributed management card, but the standby distributed management card is not in a healthy state.

Action: Verify the status of the standby distributed management card with the showhealth command. Reset the standby distributed management card if necessary.

CLI: unknown command: use help for valid commands

Cause: (1) You used a distributed management card CLI command that is not a valid command. (2) On the standby distributed management card, you used an active distributed management card CLI command that is not valid on the standby distributed management card.

Action: For a list of valid CLI commands, either use the CLI help command or refer to TABLE 3-1 and TABLE 3-2.

Configuration Download Error: Node
card in slot number failed to poweron

Cause: The midplane FRU ID is corrupted. The distributed management card is unable to communicate with the node board over IPMI.

Action: Contact SunService^SM.

console:  All console sessions busy to slot number

Cause: From the distributed management card, you tried to open a console session to a node board, but the maximum four console sessions for that node board are already open.

Action: Either retry connecting later or free up a session to that node board.

console: failed to connect to console in slot number

Cause: This message on the distributed management card console could indicate an IPMI bus problem or a node board configuration problem after you try to open a console connection to a node board.

Action: Try opening a console connection to a different slot. If this fails, reset the distributed management card and try reconnecting to the same slot.

DM board or switch board slot not managed by PMS Daemon

Cause: Many pmsd CLI commands can generate this message.

Action: PMS does not manage the distributed management card or the switching fabric boards.

DMC is in Standby Mode: pmsd operations not available

Cause: You issued a pmsd CLI command on the standby distributed management card; pmsd operations are not available on the standby distributed management card.

Action: Use the active distributed management card for pmsd commands.

Error Disabling CPU Sensor

Cause: The reset-all OpenBoot PROM command could generate this message. The node board could be in an unknown state or could have a hardware problem.

Action: Hot-swap the node board. If the problem still exists, the board might need to be returned to SunService.

Error Disabling Temperature Sensor

Cause: This message could occur at power on of the node board, after POST has completed, but before the OpenBoot PROM prompt is displayed. The node board could be in an unknown state or could have a hardware problem.

Action: Hot-swap the node board. If the problem still exists at power on, the board might need to be returned to SunService.

Error Disabling the Watchdog

Cause: The following OpenBoot PROM commands could generate this message: reset-all, flash-update, delete-dropin, add-dropin, flat-update, flash-from-rombo. The node board could be in an unknown state or could have a hardware problem.

Action: Hot-swap the node board. If the problem still exists, the board might need to be returned to SunService.

Error Enabling Temperature Sensor

Action: Hot-swap the node board. If the problem still exists at power on, the board might need to be returned to SunService.

Error Setting Temperature Threshold

Action: Hot-swap the node board. If the problem still exists at power on, the board might need to be returned to SunService.

Failover Manager:Partner DMC at state  state, partner is not ready 
to take over ACTIVE role

Cause: This message results from using either the setfailover force command or the reset dmc command on the active distributed management card when the standby distributed management card is not ready to become the active card.

Action: Wait for the standby distributed management card to be ready before using the setfailover force or reset dmc commands.

Failover Manager:Partner DMC is unhealthier, local DMC will
remain ACTIVE

Action: Wait for the standby distributed management card to be healthy before using the setfailover force or reset dmc commands.

Failover Manager:Partner event - A service failure

Cause: A service, such as ntp or MOH, on either distributed management card has failed.

Action: If the setdmcrecovery mode is on, the newly active distributed management card will try to recover the failed distributed management card; if the setdmcrecovery mode is off, try to reset the failed distributed management card from the newly active distributed management card.

Failover Manager:Partner event - DMC Initialization failure

Cause: After a reset, the distributed management card could not come to a "ready" state.

Failover Manager:Partner event - External ethernet interface
down

Cause: This message occurs if either distributed management card's external Ethernet link goes down, and the setetherfailover mode is set to enable.

Action: Check the cable connection on the external Ethernet port.

Failover Manager:Partner event - KCS interface failure

Cause: BMC firmware on either distributed management card is not responding.

Failover Manager:Partner event - SysBus Interface down

Cause: A distributed management card's internal link to the switching fabric board failed.

Action: Check the corresponding switching fabric board state; reset the switching fabric board. If the error still occurs, contact SunService.

Failover Manager:Partner event - Unexpected reset of local BMC

Cause: The BMC on the active distributed management card reset itself.

Failover Manager:Partner failed - Partner boot failure

Cause: This message occurs if either distributed management card doesn't boot after a reset or recovery within the expected boot-up time.

Action: Contact SunService.

Failover Manager:Partner failed - Partner Healthy# down

Cause: The active distributed management card's #HEALTHY signal is down due to a panic, a watchdog timer generated reset, or a hardware fault.

Action: If the setdmcrecovery mode is on, the distributed management card should recover. If it does not, contact SunService.

Failover Manager:Partner failed - Partner Heartbeat down

Cause: The internal heartbeat mechanism between the two distributed management cards detected a heartbeat failure.

Action: Contact SunService.

Failover Manager:Recovering the partner for number time

Cause: This message occurs during attempted recovery of a failed distributed management card.

Action: None needed. If the failed distributed management card does not recover, contact SunService.

Failover Manager:Recovery attempts are exhausted. No more
recovery of the failed partner

Cause: Three successive recovery attempts have failed on the failed distributed management card.

Action: Contact SunService.

Invalid cpu_node number: number

Cause: You entered an invalid node board number for a console connection from the distributed management card.

Action: Enter a valid node number, 3 through 30.

Invalid IP mode

Cause: You specified an invalid syntax for the CLI command setipmode.

Action: The setipmode usage is: setipmode -b port_num rarp|config|none. Refer to Configuring the Distributed Management Cards' Ethernet Ports for more information.

Invalid slot number

Cause: You specified an invalid slot number for a CLI command that accepts a slot number option.

Action: Refer to TABLE 3-1 for the correct syntax for that particular command.

IP address for the system management bus interface not found -
For distributed agent functionality
Please check the following interface configuration : interface

Cause: The MOH application needs an IP address for the system management network to be able to communicate between the distributed management card and the node boards. This message displays if the distributed management card or a node board does not have an IP address for the system management network interface, or if either of these interfaces failed to initialize.

Action: Configure the specified interface and restart the MOH application.

Lower Critical - going high
The current threshold setting is: number degreeC
The current temperature is : number degreeC

Lower Critical - going low
The current threshold setting is: number degreeC
The current temperature is : number degreeC

Lower Non-critical - going high
The current threshold setting is: number degreeC
The current temperature is : number degreeC

Lower Non-critical - going low
The current threshold setting is: number degreeC
The current temperature is : number degreeC

Lower Non-recoverable - going high
The current threshold setting is: number degreeC
The current temperature is : number degreeC

Lower Non-recoverable - going low
The current threshold setting is: number degreeC
The current temperature is : number degreeC

NFS Portmap: RPC: Rpcbind failure - RPC: Timed out

Cause: Using the CLI flashupdate command with the NFS option might cause NFS timeouts.

Action: (1) Make sure the NFS path is a shared NFS mount. (2) If the shared NFS server is on a different network, make sure that the gateway is properly configured. (3) Check the distributed management card network configuration.

OS is not up on this CPU node

Cause: From the distributed management card, you tried to reset a node board, which would reboot the node board under the Solaris OS. However, the node board is at the OpenBoot PROM prompt.

Action: Either use the reset -x command to force a hard reboot of the node board or bring up the operating system on the node board and then reboot it.

Permission denied

Cause: You used a distributed management card CLI command for which you do not have the correct user permissions.

Action: For information on CLI command user permissions, either use the CLI help command or refer to CLI Commands.

Recovery of failed active is a must for failover, ignoring 
dmcrecovery flag

Cause: This message occurs if the setfailover mode is on, but the setdmcrecovery mode is off, and a failover event occurs that requires the standby distributed management card to recover the failed, active distributed management card before the standby distributed management card can become active.

Action: If the standby distributed management card can not recover the failed, active distributed management card, contact SunService.

showfru: failed to get the FRU property

Cause: The CLI showfru command may generate this message. It indicates either (1) A FRU ID (midplane, node board, or third-party node board) is not programmed; or (2) An IPMI bus problem occurred.

Action: (1) Make sure your hardware has the FRU ID programmed, for example, check to see if you can read a different FRU property. (2) Reset the distributed management card. (3) Power cycle the system. (4) If the error still occurs, contact SunService.

Slot not configured to be managed by PMS Daemon

Cause: Many pmsd CLI commands can generate this message.

Action: Use the pmsd slotaddressset command to set the IP address for the slot.

Slot/powersupply is already in powered off/on state

Cause: You tried to power off or power on a slot or a power supply that is already powered off or powered on.

Action: No action needed.

SMD is re-booting the DMC because of the failure of Service 
service

Cause: A particular service on the distributed management card has failed, and the service monitoring daemon will reset the distributed management card.

Action: Reset the distributed management card. If the error still occurs, contact SunService.

smd:startup:run_ntpdate: Not Valid NTP Server

Cause: The service monitoring daemon on the distributed management card has detected that the NTP server address is either 0.0.0.0 or 256.256.256.256, which are invalid NTP server addresses.

Action: Configure the NTP server using the setntpserver command. Refer to Setting the Date and Time on the Distributed Management Cards for more information.

SUNW_envmond: current temperature (temp) exceeds upper warning 
temperature (temp)

Action: (1) Check the fans to make sure they are working properly; replace if necessary. (2) Check the room environment for proper cooling and adjust if necessary. (3) Check the temperature threshold settings
(prtpicl -v -c temperature-sensor) to make sure they are within range of the chassis environment (refer to the Netra CP2300 cPSB Board Programming Guide for more information).

SUNW_envmond: current temperature (temp) exceeds upper critical 
temperature (temp)

SUNW_envmond: current temperature (temp) is below lower warning 
temperature (temp)

SUNW_envmond: current temperature (temp) is below lower critical 
temperature (temp)

SUNW_picl_watchdog: Error in opening SMC drv

Cause: The watchdog timer failed to access the Netra CT system management controller (SMC) driver.

Action: (1) Check whether your watchdog timer application is accessing the watchdog correctly (refer to the Netra CP2300 cPSB Board Programming Guide or to the Netra CT 820 Server Software Developer's Guide for more information). (2) Reboot the node board.

SUNW_picl_watchdog: Error in patting the watchdog

Cause: The watchdog timer failed to access the Netra CT system management controller (SMC) driver.

Action: (1) Check whether your watchdog timer application is accessing the watchdog correctly (refer to the Netra CP2300 cPSB Board Programming Guide and/or to the Netra CT 820 Server Software Developer's Guide for more information). (2) Reboot the node board.

SUNW_picl_watchdog: Error in writing to SMC

Cause: The watchdog timer failed to access the Netra CT system management controller (SMC) driver.

Unable to communicate with CPU board PMS Daemon

Cause: Several pmsd CLI commands can generate this message.

Action: (1) Check network connectivity. (2) Check to see if the ping command works between the distributed management card and the node board. (3) Check the status of the node board with the pmsd slotrndadderssshow command, and modify if appropriate, with the pmsd slotrndaddersssadd command.

Unable to communicate with DM board PMS Daemon

Cause: Many pmsd CLI commands can generate this message. The distributed management card CPU might be temporarily overloaded.

Action: (1) Retry the command after waiting 15 seconds or more. (2) Reset the distributed management card.

Unable to connect to CPU board PMS Daemon

Cause: Several pmsd CLI commands can generate this message.

Unable to connect to DM board COSL

Cause: Many pmsd CLI commands can generate this message. The message usually results from PMS being unable to monitor or control the hardware. PMS cannot get the information it needs from the lower-level common operating system library (COSL) hardware interface.

Action: Reset the distributed management card.

Unable to connect to DM board PMS Daemon

Cause: Many pmsd CLI commands can generate this message. The distributed management card CPU might be temporarily overloaded.

Action: (1) Retry the command after waiting 15 seconds or more. (2) Reset the distributed management card.

Unable to connect to the ctmgx agent

Cause: This message occurs if the ctmgx stop command is issued on a node board, and the MOH agent can't be contacted.

Action: Check to see whether the MOH application is running on the node board using the command pgrep -fl java. If it is running, kill the process with the command kill process_id.

Unable to disconnect from DM board COSL

Action: Reset the distributed management card.

Unable to fetch valid data from DM board COSL

Action: Reset the distributed management card.

Unable to get valid data for this slot

Cause: Many pmsd CLI commands can generate this message. The most probable cause is that PMS is having trouble communicating with the hardware or the node boards.

Action: (1) Check network connectivity. (2) Reset the distributed management card. (3) Reboot the node boards.

Unable to perform operation on empty slot/entry

Cause: Many pmsd CLI commands can generate this message.

Action: If you want to use PMS to control the slot, put a board in the slot.

Unable to perform operation on this slot

Cause: Many pmsd CLI commands can generate this message. For example, if you used the pmsd hardware -o reset command on a slot that was empty or not powered on, this message would display.

Action: Compare the command issued and the state of the slot.

Unable to perform operation on this slot/entry

Cause: Several pmsd CLI commands can generate this message.

Action: (1) Make sure the remote distributed management card and the remote node board are operational. (2) Check network connectivity. (3) Check the status of the node board with the pmsd slotrndadderssshow command, and modify if appropriate, with the pmsd slotrndaddersssadd command.

Unable to start CPU board PMS Daemon

Cause: The PMS daemon can't be started on a node board.

Action: (1) Check to see if a PMS daemon is already running on the node board; if there is, stop the daemon and try restarting it. (2) Reboot the node board and try restarting the daemon.

Unable to start DM board PMS Daemon

Cause: May occur after the CLI pmsd start command is used. The PMS daemon can't be started on the distributed management card.

Action: (1) Check to see if a PMS daemon is already running; if there is, stop the daemon and try restarting it. (2) Reset the distributed management card.

Unable to stop CPU board PMS Daemon

Cause: The PMS daemon can't be stopped on the node board.

Action: (1) Check to see if a PMS daemon is already running; if there is, stop the daemon with the command kill process number. (2) Reboot the node board.

Unable to stop DM board PMS Daemon

Cause: This error might occur after the CLI pmsd stop command is used. The PMS daemon can't be stopped on the distributed management card.

Action: (1) Check to see if a PMS daemon is already running; if there is, stop the daemon with the Chorus command akill process number. (2) Reset the distributed management card.

Unable to write default data to DM board COSL

Action: Reset the distributed management card.

Unrecognized property name
failed to get the FRU property

Cause: The CLI showfru command might generate this message. You entered an invalid FRU property.

Action: Refer to TABLE 2-2 for valid syntax.

Upper Critical - going low
The current threshold setting is: number degreeC
The current temperature is : number degreeC

Upper Non-critical - going low
The current threshold setting is: number degreeC
The current temperature is : number degreeC

Upper Non-recoverable - going low
The current threshold setting is: number degreeC
The current temperature is : number degreeC

WARNING: Could not check healthy line status!

Cause: This message could occur while the operating system is being halted or while breaking from the operating system to go to the OpenBoot PROM prompt. The node board could be in an unknown state or could have a hardware problem.

Action: Hot-swap the node board. If the problem still exists, the board might need to be returned to SunService.

WARNING: Could not get current execution state!

Action: Hot-swap the node board. If the problem still exists, the board might need to be returned to SunService.

WARNING: Could not set previous execution state!

Action: Hot-swap the node board. If the problem still exists, the board might need to be returned to SunService.

WARNING: Could not set state break!

Action: Hot-swap the node board. If the problem still exists, the board might need to be returned to SunService.

WARNING: Could not set state offline!

Action: Hot-swap the node board. If the problem still exists, the board might need to be returned to SunService.

WARNING: Could not set state online!

Action: Hot-swap the node board. If the problem still exists, the board might need to be returned to SunService.

ysif_xfer_msg: kcs driver xfermsg returns -1
read_evt_buffer: sysif_xfer_msg returns -1
poll_evt_handler: read_evt_buffer returns -1
listner_thread: poll_evt_handler returns -1

Cause: The distributed management card BMC firmware might generate this message during a flash update of the distributed management card or during normal operation. It means that the BMC firmware is unable to respond to communication requests.

Action: If the message occurs during a flash update, the message can be safely ignored. After the flash update is complete, reset the distributed management card and the message will not be repeated. If the message occurs during normal operation, the distributed management card fails over and clears the fault by resetting the BMC.