C H A P T E R  2

SMS 1.4.1 Bugs

This chapter provides information about known SMS 1.4.1 bugs. It includes:


Bugs in SMS 1.4.1 Software

This section summarizes the most important bugs and RFEs that affect SMS 1.4.1.

I2C Timeouts Occassionally Reported When Trying to Record an Event in SEEPROM During Hot Swap (BugId 4785961)

Sun Fire high-end systems record events of interest in the SEEPROMS of their IO cards across an i2c bus. Hot-pluggable cards have CBT switches that allow the card to be electrically isolated. During a card-swapping operation, the CBT switches are not 'open', so the SEEPROMS are not accessible.

When hpost runs immediately after a hot-swapping operation it resets the IO cards cards, but it does not re-enable the CBT switches until it has finished testing the cards. If the system attempts to record an event in the SEEPROM during this testing period, it will be unable to connect and will report an i2c timeout error. The system continues to perform normally, but the event is not recorded in the IO card's SEEPROM.

Workaround: Ignore the error message.

hwad Failure Can Cause Domain Panic Stop (BugID 4924523)

On rare occassions hwad fails to detect that a domain has successfully recovered, so it fails to clear the domain's dstop flag. As a result, dstop runs again. hwad incorrectly assumes that dsmd is already aware of the (prior) dstop, so it does not inform dsmd about it. As a result, the domain remains hung. It eventually fails a secondary status test, and dsmd attempts a recovery through a forced panic.

Workaround: None.

Domain Boot Time Has Increased (BugId 4957596)

There has been an increase of approximately 15% in the time it takes a Sun Fire high-end system to turn on and have its domains display a Solaris prompt.

Workaround: None.

Two-Processor System Boards Display Uknown Status After Domain Reboot (BugId 4970240)

When both processors of a two-processor system board are indicted due to Solaris ECC correctable errors and the domain is rebooted, the "Power State" of the system board changes to UNKNOWN instead of remaining as ON. This will cause showchs to FAIL.

This problem does not occur with four-processor system boards.

Workaround: Power cycle the system board.

Do Not Insert a System Board Into an Expander Board That Is Powered Down (BugId 4970670)

If a system board is inserted into a powered down expander board, no installation record is written.

Workaround: Remove the system board, power-on the expander board, and re-insert the system board.

Domain Does Not Recover If You Poweroff Expander In a Running Domain (BugId 4970726)

If you poweroff an expander board in a running domain, dsmd will not recover the domain.

Workaround: Do not poweroff an expander when components in slot 0 or 1 are in use by a running domain.

CHS Error Intermittently Reported During post In Systems Running Parallel setkeyswitch Operations (BugId 4971816)

Systems running parallel setkeyswitch operations may occassionally encounter a CHS error 4 (CHS: not a container) during post. If the resource being queried was faulty, the CHS error 4 will cause that resource to be configured into the domain instead of being excluded.

Workaround:

1. Avoid posting domains in parallel.

2. Poweron the boards (or setkeyswitch standby the domain) before running setkeyswitch on.

3. Retry setkeyswitch on if it does fail.

Cannot Use smsversion to Switch Between SMS 1.4.1 and SMS 1.3 Without Patch (BugId 4974601)

If after installing SMS 1.4.1 on your system, you try to use smsversion to switch between SMS 1.3 and SMS 1.4.1, you won't be offered SMS 1.4.1 as a choice in the menu:

# /opt/SUNWSMS/bin/smsversion
smsversion: SMS version 1.3 installed
smsversion: SMS version 1.4.1 installed
Please select from one of the following installed SMS versions:
1) 1.3
3) Exit

If you tried to switch by specifying the 1.4.1 release directly, the upgrade would fail with this message:

/opt/SUNWSMS/bin/smsversion 1.4.1
smsversion: Active SMS version < 1.3 >
You have requested SMS Version 1.4.1
 
Is this correct? [y,n] y
smsversion: Upgrading SMS from <1.3> to <1.4.1>.
ERROR: smsversion: SMS1.4.1 is not a consecutive release of SMS
Log file is /var/sadm/system/logs/smsversion.  Exiting.

Workaround: Install patchid 115955-03 on SMS 1.3.

Parallel setkeyswitch Operations on Split Expanders Can Encounter SEEPROM/CHS Errors (BugId 4974846)

If multiple domains are configured with split expanders and setkeyswitch is run in parallel on them, a SEEPROM never went ready error could be produced, excluding a good component from the domain. A CHS error 4 could also occur, allowing a component with a faulty CHS result to be configured into the domain.

Workaround:

1. Avoid posting domains in parallel.

2. Poweron the boards (or setkeyswitch standby the domain) before running setkeyswitch on.

3. Retry setkeyswitch on if it does fail.

Multiple Indictments Used In testemail Can Result In Unsent Emails (BugId 4976195)

The testemail command requires that the number of fault classes ( the -c parameter list) must be at least as great as the number of indicted components (the -i parameter list). For certain messages that means that at most 1 indicted component can be entered, but the user is not informed that the extra components will be ignored.

Workaround: none

Bad Hardware Can Cause Unnecessary dstop Error Messages (BugId 4983517)

In rare cases, bad hardware can cause a dstop to attempt an xir dump after the dstop dump. Since the domain has already d-stopped, dsmd is unable to obtain a list of active processors, and an error is reported.

Workaround: Ignore the error messages.

dsmd Can Create Unnecessary xir and Hardware Configuration Dumps During Valid DR Operations (BugID 4984234)

dsmd can occassionally create XIR and hardware dumps unnecessarily during DR operations. The DR operation succeeds, but NOTICE messages are displayed.

Workaround: Ignore the NOTICE messages.

System Can Hang During Parallel setkeyswitch Operations in a Split-Expander Configuration (BugID 4984879)

On rare occassions, running parallel setkeyswitch operations on a domain with a split-expander configuration will cause the system to hang during post. The setkeyswitch operations cannot complete, and they cannot be interrupted by Control-C. To prevent this problem:

1. Avoid parallel setkeyswitch operations one multiple domains.

2. Avoid parallel setkeyswitch operations on split-expander domains.

3. Power on the boards in the domain with the SMS poweron command or setkeyswitch standby command before running setkeyswitch on.

Workaround: Stop and restart SMS. See the System Management Services (SMS) 1.4.1 System Administrator Guide.

setkeyswitch Operation in a Split-Expander Domain Configuration Can Generate Invalid rstop (BugID 4986412)

If you run setkeyswitch off in a domain configured with a split expander card, the other domain can receive an rstop message, even though no error has occurred.

Workaround: Ignore the rstop message.

I2C Timeout Message Unnecessarily Displayed After an MCPU or IO Board Is Inserted Into Domain's IO Slot (BugID 4986413)

After a new board is inserted into a Sun Fire high-end system domain, it takes several seconds for its power to stabilize. The esmd daemon polls for new boards every 30 seconds. If the board is power-stabilizing while the poll is sent out, hwad will detect a timeout error and display an error message. In addition, the amber fault light (wrench light) will be lit for up to a minute.

By the time esmd polls for new boards again in another 30 seconds, the new board will be stabilized and esmd will detect no timeout errors.

Workaround: Ignore the error message.

Error Messages Produced When IO Boards Are Removed (BugId 4986477)

If you remove a board from the IO3 and IO4 slot of a Sun Fire high-end system domain, multiple error messages can be unnecessarily displayed. For example:

sc% showlogs -F -p m

ERR I2cComm.cc 410] I2c read time out - bus: 51, address: 21

ERR SelectPll.cc 292] Reading bus failed in address 0, ecode=1123

...

ERR DetectorS.cc 912] Failed to read state point v1r5, located on HPCI at IO3: ecode=1123

ERR DetectorS.cc 912] Failed to read state point am80a_3v0, located on HPCI at IO3: ecode=1123

...

ERR DetectorS.cc 912] Failed to read state point am80a_5v1, located on HPCI at IO3: ecode=1123

ERR DetectorS.cc 912] Failed to read state point aa30c, located on HPCI at IO3: ecode=1123

WARNING DetectorS.cc 216] A BAD clock status has been detected on input 0 on HPCI at IO3

WARNING DetectorS.cc 246] A BAD clock status has been detected on input 1 on HPCI at IO3

NOTICE Boards.cc 2262] HPCI at IO3 removed

 

The only messages that should be displayed are "IO3 removed" and "IO4 removed."

This behavior can occur if esmd runs its voltage check after the board has been removed but before the configuration check has completed.

Workaround: Ignore the error messages.

System Board In Use By Another Domain Fails Configuration in New Domain (BugId 4990295)

If you attempt to configure into one domain a system board in use by another domain, the configuration will fail unless you power down the board first.

Workaround: Poweroff the board before attempting to configure it into the domain.

 

 

Hardware Failure Can Eventually Hang efhd Daemon (BugId 4991633)

In the event that picld fails and is restarted, efhd will not be able to set the component status of failed FRUs due to a stale handle. You can spot this problem by examining the platform message log:

Feb 1 00:42:00 2004 xc10p13-sc1 frad[14699]: [9912 713967991973909 ERR

SeepromInfoPro.cc 483] Bad section header on CDCDIMM at EX12/CDCDIMM0, bad

element: tag, expected value: 8, actual value: 0


If you see a message similar to this one, use the ps command to find out whether picld has been restarted:

> ps ef | grep picld
root 8495 26846 0 11:53:36 pts/25 0:00 grep picld
root 27535    1 0 11:57:20 ?      3:06 /usr/lib/picl/picld  

If the timestamp indicates that picld restarted after the last time efhd was started, you should restart the efhd daemon.

Workaround: Restart the efhd daemon.

Unexpected Addition Of New Users Can Cause Upgrade To Fail (BugId 4994106)

If you attempt to add new uses to a system during an SMS upgrade before restoring the system configuration, as might happen if you run the upgrade from a jumpstart server, the installation may fail due to password problems introduced by the new users.

Workaround: Do not configure new users until instructed to in the SMS 1.4.1 Installation Guide.

CHS Read/Write Errors Can Occur When System Is Busy (BugId 4999940)

An FRU I/O error 2 can be returned when component health status (CHS) is either read or written if the SC is busy handling other domain recoveries. This problem may cause faulty components to be reconfigured back into a domain (if the CHS is not written when a component is indicted).

Workaround: Run setchs manually on the failed component to set it to a failed state, or place it on the ASR blacklist.

poweron Hangs Intermittently With Global I2C Locking Errors (BugId 5009599)

On occassion a poweron operation hangs and displays error messages like these:

esmd[17438]: [6175 3316412316413 ERR Boards.cc 713] Error (code = 1215), attempting to lock Global I2C on HPCI at IO2

 

hwad[17152]: [0 3324411478033 ERR LockManager.cc 970] WARNING!! Resource 113 is not locked, application 17169.11 in EXPLICIT lock mode.

 

Feb 25 23:03:35 2004 ht92bsc0 poweron[26197]: [6173 3349414612490 ERR

EXBPowerControl.cc 147] Failed(1215) to get system lock EXB at EX10

 

Feb 25 23:03:35 2004 ht92bsc0 poweron[26197]: [6214 3349417208771 ERR

poweronApp.cc 1342] Attempt to poweron EXB at EX10 failed


They are caused by a lock between the poweron command and the failover mechanism.

Workaround: Turn failover off while running poweron.

flashupdate Cannot Determine the SC Number on CP2140 Boards (BugID 5012993)

The flashupdate command will occasionally be unable to determine the SC number on a CP2140 board, and will display this error message:

flashupdate -f /opt/SUNWSMS/firmware/oSSCPOST.di SC1/FP1
Unable to determine local SC number. 
Only the local System Control Fproms can be updated. 
Do you wish to continue? (yes/no)? y

Workaround: Answer "y" for yes, to continue with the normal update process.


Bugs That Affect SMS 1.4.1 Software

This section summarizes the most important bugs that can affect the SMS 1.4.1 system. It is not an exhaustive list of every bug that could affect the SMS 1.4.1 system.

After Changing the MAN I1 Network IP Address of an Installed Domain, You Must Reconfigure the MAN Network by Hand (BugId 4484851)

If there are already installed domains and you have changed the MAN I1 network configuration using smsconfig -m,you must configure the MAN network information on the already installed domains by hand.

Workaround: Refer to the information about unconfigured domains in the System Management Services (SMS) 1.4.1 Installation Guide.

Sun Fire 15K/E25K Platform-Specific Begin/Finish Scripts Can Hang on HPCI+-Only Domains (BugId 4797577)

The Solaris 8 update 7 operating environment does not include support for hsPCI+ boards. In domains consisting of only hsPCI+ boards, the installation can hang after the start of the Begin/Finish scripts.

Workaround: Press Ctrl-C to interrupt the Begin/Finish scripts. This will let the rest of the installation continue, resulting in successful installation.

Intermittent I2C Timeouts (1124) for Hpc3130 Cassette Status (BugId 4785961)

Intermittent I2C timeouts are reported by dxs and frad while getting the status for an Hpc3130 hsPCI cassette. The impact is benign and limited to generating error messages in the platform, domain and domain console message logs.

Workaround: None.

Unmapped Response to Non-cacheable Request Corrupts State in AXQ Lock Module (BugId 4761277)

If two domains share an expander and a device driver (or OS extension) on one domain issues a bad address to programmed IO space, both domains could dstop. This only occurs with defective OS extensions which run in privileged mode such as device drivers.

Workaround: Do not share an expander between a production domain and a domain containing untested or problematic privileged mode software such as device drivers.

Sun Fire 15K/E25K Servers Can Fail to Detect Domain Stop Interrupts (BugId 4924523)

If a domain stop (dstop) interrupt is detected by hwad but not by dsmd, dsmd will report a heartbeat failure. Only hardware configuration information is dumped, and neither CPU register or domain data (dsmd.dump) is saved. Hardware configuration files report dstop condition.

Workaround: You can re-post the domain at an increased post level to reveal the source of the hardware problem.

At Startup, SunMC Can Display Incorrect System State When Failover Did Not Work (Bug ID 5010351)

When a SunFire system's failover process is in a FAILED state during startup, the PCR System View in the SunMC GUI can incorrectly display the system status as "activating."

Workaround: Use the showfailover CLI command to verify the system's status.


SMS 1.4.1 Documentation Errors

This section summarizes errors in the SMS 1.4.1 manpages and documentation.

poweron Manpage Needs To Be Updated (BugId 5007971)

As part of the fix for RFE 4974025, the behavior of the poweron command has changed. Previously, if SMS determined that there was not enough power for a board, the command would simply fail. Now the command displays a prompt asking the user whether to continue or not.

The -y-q options will automatically answer "no" to this prompt, effectively replicating the previous behavior. The -y option will not automatically answer this question.

Workaround: none.