Netra 240 Server System Administration Guide
|
|
This chapter describes the diagnostics tools available to the Netra 240 server. The chapter contains the following sections:
Overview of Diagnostic Tools
Sun provides a range of diagnostic tools for use with the Netra 240 server, as summarized in the following table.
TABLE 1-1 Summary of Troubleshooting Tools
Diagnostic Tool
|
Type
|
Description
|
Accessibility and Availability
|
Remote Capability
|
ALOM
|
Hardware and software
|
Monitors environmental conditions, performs basic fault isolation, and provides remote console access.
|
Can function on standby power and without operating system.
|
Designed for remote access.
|
LEDs
|
Hardware
|
Indicate status of overall system and particular components.
|
Accessed from system chassis. Available anytime power is available.
|
Local, but can be viewed by means of ALOM.
|
Power-on self-test (POST)
|
Firmware
|
Tests core components of system.
|
Runs automatically on startup. Available when the operating system is not running.
|
Local, but can be viewed by means of ALOM.
|
OpenBoot commands
|
Firmware
|
Display various kinds of system information.
|
Available when the operating system is not running.
|
Local, but can be accessed by means of ALOM.
|
OpenBoot diagnostics
|
Firmware
|
Tests system components, focusing on peripherals and
I/O devices.
|
Runs automatically or interactively. Available when the operating system is not running.
|
Local, but can be viewed by means of ALOM.
|
Solaris software commands
|
Software
|
Display various kinds of system information.
|
Requires operating system.
|
Local, but can be accessed by means of ALOM.
|
SunVTS software
|
Software
|
Exercises and stresses the system, running tests in parallel.
|
Requires operating system. Optional package.
|
Viewable and controllable over network.
|
System Prompts
The following default server prompts are used by the Netra 240 server:
- ok--OpenBoot PROM prompt
- sc>--Advanced Lights Out Manager (ALOM) prompt
- #--Solaris software superuser (Bourne and Korn shell) prompt
FIGURE 1-1 shows the relationship between the three prompts and how to change from one to the other.
FIGURE 1-1 System Prompt Flow
The following commands are in the flow diagram in FIGURE 1-1:
- ALOM commands: console, reset, break
- Escape sequence: #.
- Solaris software commands: shutdown, halt, init 0
- OpenBoot commands: go, boot
Advanced Lights Out Manager
Sun Advanced Lights Out Manager (ALOM) for the Netra 240 server provides a series of LED status indicators. This section details the meaning of their status and how to turn them on and off. For more information on ALOM, see Chapter 3.
FIGURE 1-2 Location of Front Panel Indicators
Server Status Indicators
The server has three LED status indicators. They are located on the front bezel (FIGURE 1-2) and are repeated on the rear panel. A summary of the indicators is provided in TABLE 1-2.
TABLE 1-2 Server Status Indicators (Front and Rear)
Indicator
|
LED Color
|
LED State
|
Meaning
|
Activity
|
Green
|
On
|
The server is powered on and is running the Solaris OS.
|
|
|
Off
|
Either power is not present or the Solaris OS is not running.
|
Service Required
|
Yellow
|
On
|
The server has detected a problem and requires the attention of service personnel.
|
|
|
Off
|
The server has no detected faults.
|
Locator
|
White
|
On
|
A continuous light turns on and identifies the server from others in a rack, when the setlocator command is used.
|
You can turn the Locator LED on and off either from the system console or the ALOM command-line interface (CLI).
To Display Locator LED Status
|
Do one of the following:
- At the ALOM command-line interface, type:
To Turn the Locator LED On
|
Do one of the following:
- At the ALOM command-line interface, type:
To Turn the Locator LED Off
|
Do one of the following:
- At the ALOM command-line interface, type:
Alarm Status Indicators
The dry contact alarm card has four LED status indicators that are supported by ALOM. They are located vertically on the front bezel (FIGURE 1-2). Information about the alarm indicators and dry contact alarm states is provided in TABLE 1-3. For more information about alarm indicators, see the Sun Advanced Lights Out Manager Software User's Guide for the Netra 240 Server (part number 817-3174). For more information about an API to control the alarm indicators, see Appendix A.
TABLE 1-3 Alarm Indicators and Dry Contact Alarm States
Indicator and Relay
Labels
|
Indicator Color
|
Application or Server State
|
Condition or Action
|
System Indicator State
|
Alarm Indicator State
|
Relay
NC
State
|
Relay
NO
State
|
Comments
|
Critical
(Alarm0)
|
Red
|
Server state (Power on/off and
Solaris OS functional/
not functional)
|
No power input.
|
Off
|
Off
|
Closed
|
Open
|
Default state.
|
System power off.
|
Off
|
Off
|
Closed
|
Open
|
Input power connected.
|
System power turns on; Solaris OS not fully loaded.
|
Off
|
Offiii
|
Closed
|
Open
|
Transient state.
|
Solaris OS successfully loaded.
|
On
|
Off
|
Open
|
Closed
|
Normal operating state.
|
Watchdog timeout.
|
Off
|
On
|
Closed
|
Open
|
Transient state; reboot Solaris OS.
|
Solaris OS shutdown initiated by user.
|
Off
|
Offiii
|
Closed
|
Open
|
Transient state.
|
Lost input power.
|
Off
|
Off
|
Closed
|
Open
|
Default state.
|
System power shutdown initiated by user.
|
Off
|
Offiii
|
Closed
|
Open
|
Transient state.
|
Application state
|
User sets Critical alarm on.
|
--
|
On
|
Closed
|
Open
|
Critical fault detected.
|
User sets Critical alarm offii.
|
--
|
Off
|
Open
|
Closed
|
Critical fault cleared.
|
Major
(Alarm1)
|
Red
|
Application state
|
User sets Major alarm onii.
|
--
|
On
|
Open
|
Closed
|
Major fault detected.
|
User sets Major alarm offii.
|
--
|
Off
|
Closed
|
Open
|
Major fault cleared.
|
Minor
(Alarm2)
|
Amber
|
Application state
|
User sets Minor alarm onii.
|
--
|
On
|
Open
|
Closed
|
Minor fault detected.
|
User sets Minor alarm offii.
|
--
|
Off
|
Closed
|
Open
|
Minor fault cleared.
|
User
(Alarm3)
|
Amber
|
Application state
|
User sets User alarm onii.
|
--
|
On
|
Open
|
Closed
|
User fault detected.
|
User sets User alarm offii.
|
--
|
Off
|
Closed
|
Open
|
User fault cleared.
|
In all cases when the user sets an alarm, a message is displayed on the console. For example, when the critical alarm is set, the following message is displayed on the console:
Note that in some instances when the critical alarm is set, the associated alarm indicator is not lit. This implementation is subject to change in future releases
(see Footnote iii of TABLE 1-3).
SC Alert: CRITICAL ALARM is set
|
Power-On Self-Test Diagnostics
Power-on self-test (POST) is a firmware program that helps determine whether a portion of the system has failed. POST verifies the core functionality of the system, including the CPU module(s), motherboard, memory, and some on-board I/O devices. The software then generates messages that can be useful in determining the nature of a hardware failure. You can run POST even if the system is unable to boot.
POST detects most system faults and is located in the motherboard OpenBoot PROM. You can program the OpenBoot software to run POST at power-on by setting two environment variables: the diag-switch? and the diag-level flag. These two variables are stored on the system configuration card.
POST runs automatically when the system power is applied, or following an automatic system reset, if all of the following conditions apply:
- diag-switch? is set to true (default is false).
- diag-level is set to min, max or menus (default is min).
- post-trigger matches the class of reset (default is power-on-reset).
If diag-level is set to min or max, POST performs an abbreviated or extended test, respectively.
If diag-level is set to menus, a menu of all the tests executed at power up is displayed.
POST diagnostic and error message reports are displayed on a console.
Controlling POST Diagnostics
You control POST diagnostics (and other aspects of the boot process) by setting OpenBoot configuration variables. Changes to OpenBoot configuration variables take effect only after the system is restarted. TABLE 1-4 lists the most important and useful of these variables. You can find instructions for changing OpenBoot configuration variables in To View and Set OpenBoot Configuration Variables.
TABLE 1-4 OpenBoot Configuration Variables
OpenBoot Configuration Variable
|
Description and Keywords
|
auto-boot
|
Determines whether the operating system automatically starts up. Default is true.
- true--Operating system automatically starts once firmware tests have finished running.
- false--System remains at ok prompt until you type boot.
|
diag-level
|
Determines the level or type of diagnostics executed. Default is min.
- off--No testing.
- min--Only basic tests are run.
- max--More extensive tests may be run, depending on the device.
- menus-- Menu-driven tests at POST levels can be individually run.
|
diag-script
|
Determines which devices are tested by OpenBoot diagnostics. Default is none.
- none--No devices are tested.
- normal--On-board (centerplane-based) devices that have self-tests are tested.
- all--All devices that have self-tests are tested.
|
diag-switch?
|
Toggles the system in and out of diagnostic mode. Default is false.
- true--Diagnostic mode: POST diagnostics and OpenBoot diagnostics tests are run.
- false--Default mode: Do not run POST or OpenBoot diagnostics tests.
|
post-trigger
obdiag-trigger
|
These two variables specify the class of reset event that causes power-on self-tests (or OpenBoot diagnostics tests) to run. These variables can accept single keywords as well as combinations of the first three keywords separated by spaces. For details, see To View and Set OpenBoot Configuration Variables.
- error-reset--A reset caused by certain nonrecoverable hardware error conditions. In general, an error reset occurs when a hardware problem corrupts system state data. Examples include CPU and system watchdog resets, fatal errors, and certain CPU reset events (default).
- power-on-reset--A reset caused by pressing the On/Standby button (default).
- user-reset--A reset initiated by the user or the operating system.
- all-resets--Any kind of system reset.
- none--No power-on self-tests (or OpenBoot diagnostics tests) are run.
|
input-device
|
Selects where console input is taken from. Default is ttya.
- ttya--From built-in SERIAL MGT port.
- ttyb--From built-in general purpose serial port (10101).
- keyboard--From attached keyboard that is part of a graphics terminal.
|
output-device
|
Selects where diagnostic and other console output is displayed. Default is ttya.
- ttya--To built-in SERIAL MGT port.
- ttyb--To built-in general purpose serial port (10101).
- screen--To attached screen that is part of a graphics terminal.
|
Note - These variables affect OpenBoot diagnostics tests as well as POST diagnostics.
|
Once POST diagnostics have finished running, POST reports back the status of each test that was run to the OpenBoot firmware. Control then reverts back to the OpenBoot firmware code.
If POST diagnostics do not uncover a fault, and your server still does not start up, run OpenBoot diagnostics tests.
To Start POST Diagnostics
|
1. Go to the ok prompt.
2. Type:
ok setenv diag-switch? true
|
3. Type:
ok setenv diag-level value
|
Where value is min, max, or menus, depending on the quantity of diagnostic information you want to see.
4. Type:
The system runs POST diagnostics if post-trigger is set to user-reset. Status and error messages are displayed in the console window. If POST detects an error, it displays an error message describing the failure.
5. When you have finished running POST, restore the value of diag-switch? to false by typing:
ok setenv diag-switch? false
|
Resetting diag-switch? to false minimizes boot time.
OpenBoot Commands
OpenBoot commands are commands you type from the ok prompt. OpenBoot commands that can provide useful diagnostic information are as follows:
- probe-scsi and probe-scsi-all
- probe-ide
- show-devs
probe-scsi and probe-scsi-all Commands
The probe-scsi and probe-scsi-all commands diagnose problems with the SCSI devices.
|
Caution - If you used the halt command or the Stop-A key sequence to reach the ok prompt, issuing the probe-scsi or probe-scsi-all command can hang the system.
|
The probe-scsi command communicates with all SCSI devices connected to on-board SCSI controllers. The probe-scsi-all command also accesses devices connected to any host adapters installed in PCI slots.
For any SCSI device that is connected and active, the probe-scsi and probe-scsi-all commands display its loop ID, host adapter, logical unit number, unique world-wide name (WWN), and a device description that includes type and manufacturer.
The following sample output is from the probe-scsi command.
CODE EXAMPLE 1-1 probe-scsi Command Output
{1} ok probe-scsi
Target 0
Unit 0 Disk SEAGATE ST373307LSUN72G 0207
Target 1
Unit 0 Disk SEAGATE ST336607LSUN36G 0207
{1} ok
|
The following sample output is from the probe-scsi-all command.
CODE EXAMPLE 1-2 probe-scsi-all Command Output
{1} ok probe-scsi-all
/pci@1c,600000/scsi@2,1
/pci@1c,600000/scsi@2
Target 0
Unit 0 Disk SEAGATE ST373307LSUN72G 0207
Target 1
Unit 0 Disk SEAGATE ST336607LSUN36G 0207
{1} ok
|
probe-ide Command
The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This is the internal system bus for media devices such as the DVD drive.
|
Caution - If you used the halt command or the Stop-A key sequence to reach the ok prompt, issuing the probe-ide command can hang the system.
|
The following sample output is from the probe-ide command.
CODE EXAMPLE 1-3 probe-ide Command Output
{1} ok probe-ide
Device 0 ( Primary Master )
Not Present
Device 1 ( Primary Slave )
Not Present
Device 2 ( Secondary Master )
Not Present
Device 3 ( Secondary Slave )
Not Present
{1} ok
|
show-devs Command
The show-devs command lists the hardware device paths for each device in the firmware device tree. CODE EXAMPLE 1-4 shows some sample output.
CODE EXAMPLE 1-4 show-devs Command Output
/pci@1d,700000
/pci@1c,600000
/pci@1e,600000
/pci@1f,700000
/memory-controller@1,0
/SUNW,UltraSPARC-IIIi@1,0
/memory-controller@0,0
/SUNW,UltraSPARC-IIIi@0,0
/virtual-memory
/memory@m0,0
/aliases
/options
/openprom
/chosen
/packages
/pci@1d,700000/network@2,1
/pci@1d,700000/network@2
/pci@1c,600000/scsi@2,1
/pci@1c,600000/scsi@2
/pci@1c,600000/scsi@2,1/tape
/pci@1c,600000/scsi@2,1/disk
/pci@1c,600000/scsi@2/tape
/pci@1c,600000/scsi@2/disk
/pci@1e,600000/ide@d
/pci@1e,600000/usb@a
/pci@1e,600000/pmu@6
/pci@1e,600000/isa@7
/pci@1e,600000/ide@d/cdrom
/pci@1e,600000/ide@d/disk.........
|
To Run OpenBoot Commands
|
1. Halt the system to reach the ok prompt.
Inform users before you shut down the system.
2. Type the appropriate command at the console prompt.
OpenBoot Diagnostics
Like POST diagnostics, OpenBoot diagnostics code is firmware-based and resides in the Boot PROM.
To Start OpenBoot Diagnostics
|
1. Type:
ok setenv diag-switch? true
ok setenv auto-boot? false
ok reset-all
|
2. Type:
This command displays the OpenBoot diagnostics menu.
ok obdiag
_____________________________________________________________________________
| o b d i a g |
|_________________________ __________________________________________________|
| | | |
| 1 i2c@0,320 | 2 ide@d | 3 network@2 |
| 4 network@2,1 | 5 rtc@0,70 | 6 scsi@2 |
| 7 scsi@2,1 | 8 serial@0,2e8 | 9 serial@0,3f8 |
| 10 usb@a | 11 usb@b | 12 flashprom@2,0 |
|_________________________|_________________________|________________________|
| Commands: test test-all except help what printenvs setenv versions exit |
|____________________________________________________________________________|
|
Note - If you have a PCI card installed inside the server, additional tests appear on the obdiag menu.
|
3. Type:
Where n represents the number corresponding to the test you want to run.
A summary of the tests is available. At the obdiag> prompt, type:
Controlling OpenBoot Diagnostics Tests
Most of the OpenBoot configuration variables you use to control POST (see TABLE 1-4) also affect OpenBoot diagnostics tests.
- Use the diag-level variable to control the OpenBoot diagnostics testing level.
- Use test-args to customize how the tests run.
By default, test-args is set to contain an empty string. You can modify test-args using one or more of the reserved keywords shown in TABLE 1-5.
TABLE 1-5 Keywords for the test-args OpenBoot Configuration Variable
Keyword
|
Description
|
bist
|
Invokes built-in self-test (BIST) on external and peripheral devices.
|
debug
|
Displays all debug messages.
|
iopath
|
Verifies bus and interconnect integrity.
|
loopback
|
Exercises external loopback path for the device.
|
media
|
Verifies external and peripheral device media accessibility.
|
restore
|
Attempts to restore original state of the device if the previous execution of the test failed.
|
silent
|
Displays only errors rather than the status of each test.
|
subtests
|
Displays main test and each subtest that is called.
|
verbose
|
Displays detailed status messages for all tests.
|
callers=n
|
Displays backtrace of N callers when an error occurs:
callers=0--Displays backtrace of all callers before the error.
|
errors=n
|
Continues executing the test until N errors are encountered:
errors=0--Displays all error reports without terminating testing.
|
If you want to customize the OpenBoot diagnostics testing, you can set test-args to a comma-separated list of keywords, as in this example:
ok setenv test-args debug,loopback,media
|
test and test-all Commands
You can also run OpenBoot diagnostics tests directly from the ok prompt. To do this, type the test command, followed by the full hardware path of the device (or set of devices) to be tested. For example:
ok test /pci@x,y/SUNW,qlc@2
|
To customize an individual test, you can use test-args, as follows:
ok test /usb@1,3:test-args={verbose,debug}
|
This syntax affects only the current test without changing the value of the
test-args OpenBoot configuration variable.
You can test all the devices in the device tree with the test-all command:
If you specify a path argument to test-all, only the specified device and its children are tested. The following example shows the command to test the USB bus and all devices with self-tests that are connected to the USB bus:
ok test-all /pci@9,700000/usb@1,3
|
OpenBoot Diagnostics Error Messages
OpenBoot diagnostics error results are reported in a tabular format that contains a short summary of the problem, the hardware device affected, the subtest that failed, and other diagnostic information. CODE EXAMPLE 1-5 displays a sample OpenBoot diagnostics error message.
CODE EXAMPLE 1-5 OpenBoot Diagnostics Error Message
Testing /pci@1e,600000/isa@7/flashprom@2,0
ERROR : FLASHPROM CRC-32 is incorrect
SUMMARY : Obs=0x729f6392 Exp=0x3d6cdf53 XOR=0x4ff3bcc1 Addr=0xfeebbffc
DEVICE : /pci@1e,600000/isa@7/flashprom@2,0
SUBTEST : selftest:crc-subtest
MACHINE : Netra 240
SERIAL# : 52965531
DATE : 03/05/2003 01:33:59 GMT
CONTR0LS: diag-level=max test-args=
Error: /pci@1e,600000/isa@7/flashprom@2,0 selftest failed, return code = 1
Selftest at /pci@1e,600000/isa@7/flashprom@2,0 (errors=1) .............
failed
Pass:1 (of 1) Errors:1 (of 1) Tests Failed:1 Elapsed Time: 0:0:0:27
|
Operating System Diagnostic Tools
When the system passes OpenBoot diagnostics tests, it attempts to boot the Solaris OS. Once the server is running in multiuser mode, you have access to the software-based diagnostic tools and the SunVTS software. These tools enable you to monitor the server, exercise it, and isolate faults.
Note - If you set the auto-boot? OpenBoot configuration variable to false, the operating system does not boot following completion of the firmware-based tests.
|
In addition to the tools just mentioned, you can refer to error and system message log files and to Solaris software information commands.
Error and System Message Log Files
Error and other system messages are saved in the /var/adm/messages file. Messages are logged to this file from many sources, including the operating system, the environmental control subsystem, and various software applications.
Solaris Software System Information Commands
The following Solaris software system information commands display data that you can use when assessing the condition of a Netra 240 server:
- prtconf
- prtdiag
- prtfru
- psrinfo
- showrev
This section describes the information that these commands give you. For more information about using these commands, refer to the appropriate man page.
prtconf Command
The prtconf command displays the Solaris software device tree. This tree includes all the devices probed by OpenBoot firmware, as well as additional devices, such as individual disks that only the operating system software recognizes. The output of prtconf also includes the total size of system memory. CODE EXAMPLE 1-6 shows an excerpt of prtconf output.
CODE EXAMPLE 1-6 prtconf Command Output
# prtconf
System Configuration: Sun Microsystems sun4u
Memory size: 5120 Megabytes
System Peripherals (Software Nodes):
SUNW,Netra-240
packages (driver not attached)
SUNW,builtin-drivers (driver not attached)
deblocker (driver not attached)
disk-label (driver not attached)
terminal-emulator (driver not attached)
dropins (driver not attached)
kbd-translator (driver not attached)
obp-tftp (driver not attached)
SUNW,i2c-ram-device (driver not attached)
SUNW,fru-device (driver not attached)
ufs-file-system (driver not attached)
chosen (driver not attached)
openprom (driver not attached)
client-services (driver not attached)
options, instance #0
aliases (driver not attached)
memory (driver not attached)
virtual-memory (driver not attached)
SUNW,UltraSPARC-IIIi (driver not attached)
memory-controller, instance #0
SUNW,UltraSPARC-IIIi (driver not attached)
memory-controller, instance #1
pci, instance #0........
|
The prtconf command -p option produces output similar to that of the OpenBoot
show-devs command. This output lists only those devices compiled by the system firmware.
prtdiag Command
The prtdiag command displays a table of diagnostic information that summarizes the status of system components. The display format used by the prtdiag command can vary depending on what version of the Solaris OS is running on your system. The following code example is an excerpt of some of the output produced by prtdiag on a functional Netra 240 server running Solaris software.
CODE EXAMPLE 1-7 prtdiag Command Output
# prtdiag
System Configuration: Sun Microsystems sun4u Netra 240
System clock frequency: 160 MHZ
Memory size: 2GB
==================================== CPUs ====================================
E$ CPU CPU Temperature Fan
CPU Freq Size Impl. Mask Die Ambient Speed Unit
--- -------- ---------- ------ ---- -------- -------- ----- ----
MB/P0 1280 MHz 1MB US-IIIi 2.3 - -
MB/P1 1280 MHz 1MB US-IIIi 2.3 - -
================================= IO Devices =================================
Bus Freq
Brd Type MHz Slot Name Model
--- ---- ---- ---------- ---------------------------- --------------------
0 pci 66 2 network-pci14e4,1648.108e.16+
0 pci 66 2 network-pci14e4,1648.108e.16+
0 pci 66 2 scsi-pci1000,21.1000.1000.1 +
0 pci 66 2 scsi-pci1000,21.1000.1000.1 +
0 pci 66 2 network-pci14e4,1648.108e.16+
0 pci 66 2 network-pci14e4,1648.108e.16+
0 pci 33 7 isa/serial-su16550 (serial)
0 pci 33 7 isa/serial-su16550 (serial)
0 pci 33 7 isa/rmc-comm-rmc_comm (seria+
0 pci 33 13 ide-pci10b9,5229.c4 (ide)
============================ Memory Configuration ============================
Segment Table:
-----------------------------------------------------------------------
Base Address Size Interleave Factor Contains
-----------------------------------------------------------------------
0x0 1GB 1 GroupID 0
0x1000000000 1GB 1 GroupID 0
Memory Module Groups:
--------------------------------------------------
ControllerID GroupID Labels
--------------------------------------------------
0 0 MB/P0/B0/D0,MB/P0/B0/D1
Memory Module Groups:
--------------------------------------------------
ControllerID GroupID Labels
--------------------------------------------------
1 0 MB/P1/B0/D0,MB/P1/B0/D1
|
In addition to the information in CODE EXAMPLE 1-7, prtdiag with the verbose option (-v) also reports on front panel status, disk status, fan status, power supplies, hardware revisions, and system temperatures (see CODE EXAMPLE 1-8).
CODE EXAMPLE 1-8 prtdiag Verbose Output
---------------------------------------------------------------
Location Sensor Temperature Lo LoWarn HiWarn Hi Status
---------------------------------------------------------------
MB T_ENC 22C -7C -5C 55C 58C okay
MB/P0 T_CORE 57C - - 110C 115C okay
MB/P1 T_CORE 54C - - 110C 115C okay
PS0 FF_OT - - - - - okay
PS1 FF_OT - - - - - okay
|
In the event of an overtemperature condition, prtdiag reports an error in the Status column (CODE EXAMPLE 1-9).
CODE EXAMPLE 1-9 prtdiag Overtemperature Indication Output
---------------------------------------------------------------
Location Sensor Temperature Lo LoWarn HiWarn Hi Status
---------------------------------------------------------------
MB T_ENC 22C -7C -5C 55C 58C okay
MB/P0 T_CORE 118C - - 110C 115C failed
MB/P1 T_CORE 112C - - 110C 115C warning
PS0 FF_OT - - - - - okay
PS1 FF_OT - - - - - okay
|
Similarly, if a particular component fails, prtdiag reports a fault in the appropriate status column (CODE EXAMPLE 1-10).
CODE EXAMPLE 1-10 prtdiag Fault Indication Output
Fan Speeds:
-----------------------------------------
Location Sensor Status Speed
-----------------------------------------
MB/P0/F0 RS failed 0 rpm
MB/P0/F1 RS okay 3994 rpm
F2 RS okay 2896 rpm
PS0 FF_FAN okay
F3 RS okay 2576 rpm
PS1 FF_FAN okay
---------------------------------
|
prtfru Command
The Netra 240 server maintains a hierarchical list of all field-replaceable units (FRUs) in the system, as well as specific information about various FRUs.
The prtfru command can display this hierarchical list, as well as data contained in the serial electrically-erasable programmable read-only memory (SEEPROM) devices located on many FRUs. CODE EXAMPLE 1-11 shows an excerpt of a hierarchical list of FRUs generated by the prtfru command with the -l option.
CODE EXAMPLE 1-11 prtfru -l Command Output
# prtfru -l
/frutree
/frutree/chassis (fru)
/frutree/chassis/MB?Label=MB
/frutree/chassis/MB?Label=MB/system-board (container)
/frutree/chassis/MB?Label=MB/system-board/SC?Label=SC
/frutree/chassis/MB?Label=MB/system-board/SC?Label=SC/sc (fru)
/frutree/chassis/MB?Label=MB/system-board/BAT?Label=BAT
/frutree/chassis/MB?Label=MB/system-board/BAT?Label=BAT/battery (fru)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu (fru)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/F0?Label=F0/fan-unit
(fru)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/F1?Label=F1
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/F1?Label=F1/fan-unit
(fru)........
|
CODE EXAMPLE 1-12 shows an excerpt of SEEPROM data generated by the prtfru command with the -c option. This output displays only the containers and their data and does not print the FRU tree hierarchy.
CODE EXAMPLE 1-12 prtfru -c Command Output
# prtfru -c
/frutree/chassis/MB?Label=MB/system-board (container)
SEGMENT: SD
/ManR
/ManR/UNIX_Timestamp32: Mon Dec 2 19:47:38 PST 2002
/ManR/Fru_Description: FRUID,INSTR,M'BD,2X1.28GHZ,CPU
/ManR/Manufacture_Loc: Hsinchu,Taiwan
/ManR/Sun_Part_No: 3753120
/ManR/Sun_Serial_No: 000615
/ManR/Vendor_Name: Mitac International
/ManR/Initial_HW_Dash_Level: 02
/ManR/Initial_HW_Rev_Level: 0E
/ManR/Fru_Shortname: MOTHERBOARD
/SpecPartNo: 885-0076-11
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/B0?Label=B0/bank/D0?La
bel=D0/mem-module (container)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/B0?Label=B0/bank/D1?La
bel=D1/mem-module (container)........
|
Data displayed by the prtfru command varies depending on the type of FRU. In general, it includes the following:
- FRU description
- Manufacturer name and location
- Part number and serial number
- Hardware revision levels
psrinfo Command
The psrinfo command displays the date and time that each CPU is introduced online. With the verbose (-v) option, the command displays additional information about the CPUs, including their clock speed. CODE EXAMPLE 1-13 shows sample output from the psrinfo command with the -v option.
CODE EXAMPLE 1-13 psrinfo -v Command Output
# psrinfo -v
Status of processor 0 as of: 07/28/2003 14:43:29
Processor has been on-line since 07/21/2003 18:43:37.
The sparcv9 processor operates at 1280 MHz,
and has a sparcv9 floating point processor.
Status of processor 1 as of: 07/28/2003 14:43:29
Processor has been on-line since 07/21/2003 18:43:36.
The sparcv9 processor operates at 1280 MHz,
and has a sparcv9 floating point processor
|
showrev Command
The showrev command displays revision information for the current hardware and software. CODE EXAMPLE 1-14 shows sample output from the showrev command.
CODE EXAMPLE 1-14 showrev Command Output
# showrev
Hostname: vsp78-36
Hostid: 8328c87b
Release: 5.8
Kernel architecture: sun4u
Application architecture: sparc
Hardware provider: Sun_Microsystems
Domain: vsplab.SFBay.Sun.COM
Kernel version: SunOS 5.8 Generic 108528-18 November 2002
|
When used with the -p option, the showrev command displays installed patches. CODE EXAMPLE 1-15 shows a partial sample output from the showrev command with the -p option.
CODE EXAMPLE 1-15 showrev -p Command Output
Patch: 109729-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 109783-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 109807-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 109809-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 110905-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 110910-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 110914-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 108964-04 Obsoletes: Requires: Incompatibles: Packages: SUNWcsr
|
To Run Solaris Platform System Information Commands
|
At a command prompt, type the command for the kind of system information you want to display.
For more information, see Solaris Software System Information Commands. See TABLE 1-6 for a summary of the commands.
TABLE 1-6 Solaris Platform Information Display Commands
Command
|
What It Displays
|
What to Type
|
Notes
|
prtconf
|
System configuration information
|
/usr/sbin/prtconf
|
--
|
prtdiag
|
Diagnostic and configuration information
|
/usr/platform/sun4u/sbin/prtdiag
|
Use the -v option for additional detail.
|
prtfru
|
FRU hierarchy and SEEPROM memory contents
|
/usr/sbin/prtfru
|
Use the -l option to display hierarchy. Use the -c option to display SEEPROM data.
|
psrinfo
|
Date and time each CPU came online; processor clock speed
|
/usr/sbin/psrinfo
|
Use the -v option to obtain clock speed and other data.
|
showrev
|
Hardware and software revision information
|
/usr/bin/showrev
|
Use the -p option to show software patches.
|
Recent Diagnostic Test Results
Summaries of the results from the most recent power-on self-test (POST) and OpenBoot diagnostics tests are saved across power cycles.
To View Recent Test Results
|
1. Go to the ok prompt.
2. Do either of the following:
- To see a summary of the most recent POST results, type:
- To see a summary of the most recent OpenBoot diagnostics test results, type:
This command produces a system-dependent list of hardware components, along with an indication of which components passed and which failed POST or OpenBoot diagnostics tests.
OpenBoot Configuration Variables
Switches and diagnostic configuration variables stored in the IDPROM determine how and when POST diagnostics and OpenBoot diagnostics tests are performed. This section explains how to access and modify OpenBoot configuration variables. For a list of important OpenBoot configuration variables, see TABLE 1-4.
Changes to OpenBoot configuration variables take effect at the next reboot.
To View and Set OpenBoot Configuration Variables
|
Halt the server to display the ok prompt.
- To display the current values of all OpenBoot configuration variables, use the printenv command.
The following example shows a short excerpt of this command's output.
ok printenv
Variable Name Value Default Value
diag-level min min
diag-switch? false false
|
- To set or change the value of an OpenBoot configuration variable, use the setenv command:
ok setenv diag-level max
diag-level = max
|
- To set OpenBoot configuration variables that accept multiple keywords, separate keywords with a space.
Using the watch-net and watch-net-all Commands to Check the Network Connections
The watch-net diagnostics test monitors Ethernet packets on the primary network interface. The watch-net-all diagnostics test monitors Ethernet packets on the primary network interface and on any additional network interfaces connected to the system board. Good packets received by the system are indicated by a period (.). Errors such as the framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description.
To start the watch-net diagnostic test, type the watch-net command at the ok prompt (CODE EXAMPLE 1-16).
CODE EXAMPLE 1-16 watch-net Diagnostic Output Message
{0} ok watch-net
Internal loopback test -- succeeded.
Link is -- up
Looking for Ethernet Packets.
`.' is a Good Packet. `X' is a Bad Packet.
Type any key to stop.................................
|
To start the watch-net-all diagnostic test, type watch-net-all at the ok prompt (CODE EXAMPLE 1-17).
CODE EXAMPLE 1-17 watch-net-all Diagnostic Output Message
{0} ok watch-net-all
/pci@1f,0/pci@1,1/network@c,1
Internal loopback test -- succeeded.
Link is -- up
Looking for Ethernet Packets.
`.' is a Good Packet. `X' is a Bad Packet.
Type any key to stop.
|
Automatic System Recovery
Note - Automatic System Recovery (ASR) is not the same as Automatic Server Restart, which the Netra 240 server also supports. For information about Automatic Server Restart, see Chapter 3.
|
Automatic System Recovery (ASR) consists of self-test features and an auto-configuring capability to detect failed hardware components and unconfigure them. By enabling ASR, the server is able to resume operating after certain nonfatal hardware faults or failures have occurred.
If a component is monitored by ASR and the server is capable of operating without it, the server automatically reboots if that component develops a fault or fails. This capability prevents a faulty hardware component from preventing the entire system from operating or causing the system to fail repeatedly.
If a fault is detected during the power-on sequence, the faulty component is disabled. If the system remains capable of functioning, the boot sequence continues.
To support this degraded boot capability, the OpenBoot firmware uses the 1275 Client Interface (by means of the device tree) to mark a device as either failed or disabled, by creating an appropriate status property in the device tree node. The Solaris OS does not activate a driver for any subsystem marked in this way.
As long as a failed component is electrically dormant (not causing random bus errors or signal noise, for example), the system reboots automatically and resumes operation while a service call is made.
Once a failed or disabled device is replaced with a new one, the OpenBoot firmware automatically modifies the status of the device upon reboot.
Note - ASR is not enabled until you activate it (see To Enable ASR).
|
Auto-Boot Options
The auto-boot? setting controls whether the firmware automatically boots the operating system after each reset. The default setting is true.
The auto-boot-on-error? setting controls whether the system attempts a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? settings must be set to true to enable an automatic degraded boot.
To set the switches, type:
ok setenv auto-boot? true
ok setenv auto-boot-on-error? true
|
Note - The default setting for auto-boot-on-error? is false. Therefore, the system does not attempt a degraded boot unless you change this setting to true. In addition, the system does not attempt a degraded boot in response to any fatal
non-recoverable error, even if degraded booting is enabled. For examples of fatal non-recoverable errors, see Error-Handling Summary.
|
Error-Handling Summary
Error handling during the power-on sequence can be summarized in the following three ways:
- If no errors are detected by POST or OpenBoot diagnostics, the system attempts to boot if auto-boot? is true.
- If only nonfatal errors are detected by POST or OpenBoot diagnostics, the system attempts to boot if auto-boot? is true and auto-boot-on-error? is true.
Note - If POST or OpenBoot diagnostics detects a nonfatal error associated with the normal boot device, the OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by the boot-device configuration variable.
|
- If a fatal error is detected by POST or OpenBoot diagnostics, the system does not boot regardless of the settings of auto-boot? or auto-boot-on-error? Fatal nonrecoverable errors include the following:
- Failure of all CPUs
- Failure of all logical memory banks
- Failure of flash RAM cyclical redundancy check (CRC)
- Failure of critical field-replaceable unit (FRU) PROM configuration data
- Failure of critical application-specific integrated circuit (ASIC)
Reset Scenarios
Three OpenBoot configuration variables--diag-switch?, obdiag-trigger, and post-trigger--control how the system runs firmware diagnostics in response to system reset events.
The standard system reset protocol bypasses POST and OpenBoot diagnostics unless diag-switch? is set to true. The default setting for this variable is false. Because ASR relies on firmware diagnostics to detect faulty devices, diag-switch? must be set to true for ASR to run. For instructions, see To Enable ASR.
To control which reset events, if any, automatically initiate firmware diagnostics, use obdiag-trigger and post-trigger. For detailed explanations of these variables and their uses, see Controlling POST Diagnostics and Controlling OpenBoot Diagnostics Tests.
To Enable ASR
|
1. At the system ok prompt, type:
ok setenv diag-switch? true
ok setenv auto-boot? true
ok setenv auto-boot-on-error? true
|
2. Set the obdiag-trigger variable to power-on-reset, error-reset, or user-reset.
For example, type:
ok setenv obdiag-trigger user-reset
|
3. Type:
The system permanently stores the parameter changes and boots automatically if the OpenBoot variable auto-boot? is set to true (its default value).
Note - To store parameter changes, you can also power-cycle the system by using the front panel On/Standby button.
|
To Disable ASR
|
1. At the system ok prompt, type:
ok setenv diag-switch? false
|
2. Type:
The system permanently stores the parameter change.
Note - To store parameter changes, you can also power-cycle the system by using the front panel On/Standby button.
|