Netra 440 Server Diagnostics and Troubleshooting Guide
|
|
When something goes wrong with the system, diagnostic tools can help you figure out what caused the problem. Indeed, this is the principal use of most diagnostic tools. However, this approach is inherently reactive. It means waiting until a component fails outright.
Some diagnostic tools allow you to be more proactive by monitoring the system while it is still "healthy." Monitoring tools give administrators early warning of imminent failure, thereby allowing planned maintenance and better system availability. Remote monitoring also allows administrators the convenience of checking on the status of many machines from one centralized location.
Sun provides the Advanced Lights Out Manager (ALOM) software that you can use to monitor servers.
In addition to that tool, Sun provides software-based and firmware-based commands that display various kinds of system information. While not strictly monitoring tools, these commands enable you to review at a glance the status of different system aspects and components.
This chapter describes the tasks necessary to use these tools to monitor your Netra 440 server.
Tasks covered in this chapter include:
If you want background information about the tools, turn to Chapter 2.
Note - Many of the procedures in this chapter assume that you are familiar with the OpenBoot firmware and that you know how to access the ok prompt. For background information, refer to the Netra 440 Server System Administration Guide.
|
Monitoring the System Using Sun Advanced Lights Out Manager
This section explains how to use Advanced Lights Out Manager (ALOM) to monitor a Netra 440 server, and steps you through some of the tool's most important features.
For background information about ALOM, see:
There are several ways to connect to and use the ALOM system controller, depending on how your data center and its network are set up. This procedure assumes that you intend to monitor the Netra 440 system by way of an alphanumeric terminal or terminal server connected to the server's SERIAL MGT port, or by using a telnet connection to the NET MGT port.
The procedure also assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Refer to the Netra 440 Server System Administration Guide.
To Monitor the System Using Sun Advanced Lights Out Manager
|
1. Log in to the system console and access the ok prompt.
2. If necessary, type the system controller escape sequence.
If you are not already seeing the sc> prompt, type the system controller escape sequence. By default, this sequence is #. (pound-period).
3. If necessary, log in to ALOM.
If you are not logged in to ALOM, you will be prompted to do so:
Please login: admin
Please Enter password: ******
|
Enter the admin account login name and password, or the name and password of a different login account if one has been set up for you. For the purposes of this procedure, your account should have full privileges.
Note - The first time you access ALOM, there is no admin account password. You are instructed to provide one the first time you attempt to execute a privileged command. Note the password you enter and retain it for future use.
|
The sc> prompt appears:
This prompt indicates that you now have access to the ALOM system controller command-line interface.
4. At the sc> prompt, type the showenvironment command.
This command displays a great deal of useful data, starting with temperature readings from a number of thermal sensors.
CODE EXAMPLE 4-1 ALOM Reports on System Temperatures
=============== Environmental Status ===============
------------------------------------------------------------------------------
System Temperatures (Temperatures in Celsius):
------------------------------------------------------------------------------
Sensor Status Temp LowHard LowSoft LowWarn HighWarn HighSoft HighHard
------------------------------------------------------------------------------
C0.P0.T_CORE OK 48 -20 -10 0 97 102 120
C1.P0.T_CORE OK 53 -20 -10 0 97 102 120
C2.P0.T_CORE OK 49 -20 -10 0 97 102 120
C3.P0.T_CORE OK 57 -20 -10 0 97 102 120
C0.T_AMB OK 28 -20 -10 0 70 82 87
C1.T_AMB OK 33 -20 -10 0 70 82 87
C2.T_AMB OK 27 -20 -10 0 70 82 87
C3.T_AMB OK 28 -20 -10 0 70 82 87
MB.T_AMB OK 32 -18 -10 0 65 75 85
|
Note - The warning and soft graceful shutdown thresholds noted in CODE EXAMPLE 4-1 are set at the factory and cannot be modified.
|
The sensors labeled T_AMB in CODE EXAMPLE 4-1 measure ambient temperatures at the CPU/memory modules, the motherboard, and the SCSI backplane. The sensors labeled T_CORE measure the internal temperatures of the processor chips themselves.
In the output shown in CODE EXAMPLE 4-1, MB refers to the motherboard, and Cn refers to a particular CPU. For information about identifying CPU modules, see Identifying CPU/Memory Modules.
The showenvironment command also gives the position of the system control rotary switch and the condition of the three LEDs on the front panel.
CODE EXAMPLE 4-2 ALOM Reports on Rotary Switch Position and System Status LEDs
--------------------------------------
Front Status Panel:
--------------------------------------
Rotary Switch position: NORMAL
---------------------------------------------------
System Indicator Status:
---------------------------------------------------
SYS.LOCATE SYS.SERVICE SYS.ACT
--------------------------------------------------------
OFF OFF ON
|
The showenvironment command reports the status of system disks and fans.
CODE EXAMPLE 4-3 ALOM Reports on System Disks and Fans
--------------------------------------------
System Disks:
--------------------------------------------
Disk Status Service OK2RM
--------------------------------------------
HDD0 OK OFF OFF
HDD1 OK OFF OFF
HDD2 OK OFF OFF
HDD3 OK OFF OFF
----------------------------------------------------------
Fans (Speeds Revolution Per Minute):
----------------------------------------------------------
Sensor Status Speed Warn Low
----------------------------------------------------------
FT0.F0.TACH OK 3879 2400 750
FT1.F0.TACH OK 3947 2400 750
FT2.F0.TACH OK 4017 2400 750
FT3.F0 OK -- -- --
|
Voltage sensors located on the motherboard monitor important system voltages, and showenvironment reports these.
CODE EXAMPLE 4-4 ALOM Reports on Motherboard Voltages
-----------------------------------------------------------------------------
Voltage sensors (in Volts):
-----------------------------------------------------------------------------
Sensor Status Voltage LowSoft LowWarn HighWarn HighSoft
-----------------------------------------------------------------------------
MB.V_+1V5 OK 1.49 1.20 1.27 1.72 1.80
MB.V_VCCTM OK 2.53 2.00 2.12 2.87 3.00
MB.V_NET0_1V2D OK 1.26 0.96 1.02 1.38 1.44
MB.V_NET1_1V2D OK 1.26 0.96 1.02 1.38 1.44
MB.V_NET0_1V2A OK 1.26 0.96 1.02 1.38 1.44
MB.V_NET1_1V2A OK 1.25 0.96 1.02 1.38 1.44
MB.V_+3V3 OK 3.33 2.64 2.80 3.79 3.96
MB.V_+3V3STBY OK 3.33 2.64 2.80 3.79 3.96
MB.BAT.V_BAT OK 3.07 -- 2.25 -- --
MB.V_SCSI_CORE OK 1.80 1.44 1.53 2.07 2.16
MB.V_+5V OK 5.02 4.00 4.25 5.75 6.00
MB.V_+12V OK 12.00 9.60 10.20 13.80 14.40
MB.V_-12V OK -11.96 -14.40 -13.80 -10.20 -9.60
|
Note - The warning and soft graceful shutdown thresholds noted in CODE EXAMPLE 4-4 are set at the factory and cannot be modified.
|
The showenvironment command tells you the status of each power supply, and the state of the LEDs located on each supply.
CODE EXAMPLE 4-5 ALOM Reports on Power Supply Status
--------------------------------------------
Power Supply Indicators:
--------------------------------------------
Supply Active Service OK-to-Remove
--------------------------------------------
PS0 ON OFF OFF
PS1 ON OFF OFF
PS2 ON OFF OFF
PS3 ON OFF OFF
------------------------------------------------------------------------------
Power Supplies:
------------------------------------------------------------------------------
Supply Status Underspeed Overtemp Overvolt Undervolt Overcurrent
------------------------------------------------------------------------------
PS0 OK OFF OFF OFF OFF OFF
PS1 OK OFF OFF OFF OFF OFF
PS2 OK OFF OFF OFF OFF OFF
PS3 OK OFF OFF OFF OFF OFF
|
This command reports on the status of motherboard circuit breakers (labeled MB.FF_SCSIx) and CPU module DC-to-DC converters (labeled Cn.P0.FF_POK).
CODE EXAMPLE 4-6 ALOM Reports on Circuit Breakers and DC-to-DC Converters
----------------------
Current sensors:
----------------------
Sensor Status
----------------------
MB.FF_SCSIA OK
MB.FF_SCSIB OK
MB.FF_POK OK
C0.P0.FF_POK OK
C1.P0.FF_POK OK
C2.P0.FF_POK OK
C3.P0.FF_POK OK
|
Finally, this command tells you the status of the system alarms.
CODE EXAMPLE 4-7 ALOM Reports on System Alarms
--------------------------------------------
System Alarms:
--------------------------------------------
Alarm Relay LED
--------------------------------------------
ALARM.CRITICAL OFF OFF
ALARM.MAJOR OFF OFF
ALARM.MINOR OFF OFF
ALARM.USER OFF OFF
|
5. Type the showfru command.
This command, like the Solaris OS command prtfru -c, displays static FRU-ID information as available for several system FRUs. The specific information provided includes the date and location of manufacture, and the Sun part number.
CODE EXAMPLE 4-8 ALOM Reports on FRU Identification Information
FRU_PROM at PSO.SEEPROM
Timestamp: MON SEP 16 16:47:05 2002
Description: PWR SUPPLY, SYSTEM,75%-EFF,H-P
Manufacture Location: DELTA ELECTRONICS CHUNGLI TAIWAN
Sun Part No: 3001501
Sun Serial No: T00065
Vendor JDEC code: 3AD
Initial HW Dash Level: 01
Initial HW Rev Level: 02
Shortname: PS
|
6. Type the showlogs command.
This command shows a history of noteworthy system events, the most recent being listed last.
CODE EXAMPLE 4-9 ALOM Reports on Logged Events
FEB 28 19:45:06 myhost: 0006001a: "SC Host Watchdog Reset Disabled"
FEB 28 19:45:06 myhost: 00060003: "SC System booted."
FEB 28 19:45:43 myhost: 00060000: "SC Login: User admin Logged on."
FEB 28 19:45:51 myhost: 0004000e: "SC Request to Power Off Host Immediately."
FEB 28 19:45:55 myhost: 00040002: "Host System has Reset"
FEB 28 19:45:56 myhost: 00040029: "Host system has shut down."
FEB 28 19:46:16 myhost: 00040001: "SC Request to Power On Host."
FEB 28 19:46:18 myhost: 0004000b: "Host System has read and cleared bootmode."
FEB 28 19:55:17 myhost: 00060000: "SC Login: User admin Logged on."
FEB 28 19:56:59 myhost: 00060000: "SC Login: User admin Logged on."
FEB 28 20:27:06 myhost: 0004004f: "Indicator SYS_FRONT.ACT is now ON"
FEB 28 20:40:47 myhost: 00040002: "Host System has Reset"
|
Note - The ALOM log messages are written into a so-called "circular buffer" of limited length (64 kilobytes). Once the buffer is filled, the oldest messages are overwritten by the newest ones.
|
7. Examine the ALOM run log. Type:
sc> consolehistory run -v
|
This command shows the log containing the most recent system console output from POST, OpenBoot PROM, and Solaris boot messages. In addition, this log records output from the server's operating system.
CODE EXAMPLE 4-10 consolehistory run -v Command Output
May 9 14:48:22 Sun-SFV440-a rmclomv: SC Login: User admin Logged on.
#
# init 0
#
INIT: New run level: 0
The system is coming down. Please wait.
System services are now being stopped.
Print services stopped.
May 9 14:49:18 Sun-SFV440-a last message repeated 1 time
May 9 14:49:38 Sun-SFV440-a syslogd: going down on signal 15
The system is down.
syncing file systems... done
Program terminated
{1} ok boot disk
Netra 440, No Keyboard
Copyright 1998-2003 Sun Microsystems, Inc. All rights reserved.
OpenBoot 4.10.3, 4096 MB memory installed, Serial #53005571.
Ethernet address 0:3:ba:28:cd:3, Host ID: 8328cd03.
Initializing 1MB of memory at addr 123fecc000 -
Initializing 1MB of memory at addr 123fe02000 -
Initializing 14MB of memory at addr 123f002000 -
Initializing 16MB of memory at addr 123e002000 -
Initializing 992MB of memory at addr 1200000000 -
Initializing 1024MB of memory at addr 1000000000 -
Initializing 1024MB of memory at addr 200000000 -
Initializing 1024MB of memory at addr 0 -
Rebooting with command: boot disk
Boot device: /pci@1f,700000/scsi@2/disk@0,0 File and args:
\
SunOS Release 5.8 Version Generic_114696-04 64-bit
Copyright 1983-2003 Sun Microsystems, Inc. All rights reserved.
Hardware watchdog enabled
Indicator SYS_FRONT.ACT is now ON
configuring IPv4 interfaces: ce0.
Hostname: Sun-SFV440-a
The system is coming up. Please wait.
NIS domainname is Ecd.East.Sun.COM
Starting IPv4 router discovery.
starting rpc services: rpcbind keyserv ypbind done.
Setting netmask of lo0 to 255.0.0.0
Setting netmask of ce0 to 255.255.255.0
Setting default IPv4 interface for multicast: add net 224.0/4: gateway Sun-SFV440-a
syslog service starting.
Print services started.
volume management starting.
The system is ready.
Sun-SFV440-a console login: May 9 14:52:57 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = UNKNOWN
May 9 14:52:57 Sun-SFV440-a rmclomv: Keyswitch Position has changed to Unknown state.
May 9 14:52:58 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = LOCKED
May 9 14:52:58 Sun-SFV440-a rmclomv: KeySwitch Position has changed to Locked State.
May 9 14:53:00 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = NORMAL
May 9 14:53:01 Sun-SFV440-a rmclomv: KeySwitch Position has changed to On State.
sc>
|
8. Examine the ALOM boot log. Type:
sc> consolehistory boot -v
|
The ALOM boot log contains boot messages from POST, OpenBoot firmware, and Solaris software from the host server's most recent reset.
The following sample output shows the boot messages from POST.
CODE EXAMPLE 4-11 consolehistory boot -v Command Output (Boot Messages From POST)
Keyswitch set to diagnostic position.
@(#)OBP 4.10.3 2003/05/02 20:25 Netra 440
Clearing TLBs
Power-On Reset
Executing Power On SelfTest
0>@(#) Sun Fire[TM] V440 POST 4.10.3 2003/05/04 22:08
/export/work/staff/firmware_re/post/post-build-4.10.3/Fiesta/system/integrated (firmware_re)
0>Hard Powerup RST thru SW
0>CPUs present in system: 0 1
0>OBP->POST Call with %o0=00000000.01012000.
0>Diag level set to MIN.
0>MFG scrpt mode set NORM
0>I/O port set to TTYA.
0>
0>Start selftest...
1>Print Mem Config
1>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON.
1>Memory interleave set to 0
1> Bank 0 1024MB : 00000010.00000000 -> 00000010.40000000.
1> Bank 2 1024MB : 00000012.00000000 -> 00000012.40000000.
0>Print Mem Config
0>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON.
0>Memory interleave set to 0
0> Bank 0 1024MB : 00000000.00000000 -> 00000000.40000000.
0> Bank 2 1024MB : 00000002.00000000 -> 00000002.40000000.
0>INFO:
0> POST Passed all devices.
0>
0>POST: Return to OBP.
|
The following sample output shows the initialization of the OpenBoot PROM.
CODE EXAMPLE 4-12 consolehistory boot -v Command Output (OpenBoot PROM Initialization )
Keyswitch set to diagnostic position.
@(#)OBP 4.10.3 2003/05/02 20:25 Netra 440
Clearing TLBs
POST Results: Cpu 0000.0000.0000.0000
%o0 = 0000.0000.0000.0000 %o1 = ffff.ffff.f00a.2b73 %o2 = ffff.ffff.ffff.ffff
POST Results: Cpu 0000.0000.0000.0001
%o0 = 0000.0000.0000.0000 %o1 = ffff.ffff.f00a.2b73 %o2 = ffff.ffff.ffff.ffff
Membase: 0000.0000.0000.0000
MemSize: 0000.0000.0004.0000
Init CPU arrays Done
Probing /pci@1d,700000 Device 1 Nothing there
Probing /pci@1d,700000 Device 2 Nothing there
|
The following sample output shows the system banner.
CODE EXAMPLE 4-13 consolehistory boot -v Command Output (System Banner Display)
Netra 440, No Keyboard
Copyright 1998-2003 Sun Microsystems, Inc. All rights reserved.
OpenBoot 4.10.3, 4096 MB memory installed, Serial #53005571.
Ethernet address 0:3:ba:28:cd:3, Host ID: 8328cd03.
|
The following sample output shows OpenBoot Diagnostics testing.
CODE EXAMPLE 4-14 consolehistory boot -v Command Output (OpenBoot Diagnostics Testing)
Running diagnostic script obdiag/normal
Testing /pci@1f,700000/network@1
Testing /pci@1e,600000/ide@d
Testing /pci@1e,600000/isa@7/flashprom@2,0
Testing /pci@1e,600000/isa@7/serial@0,2e8
Testing /pci@1e,600000/isa@7/serial@0,3f8
Testing /pci@1e,600000/isa@7/rtc@0,70
Testing /pci@1e,600000/isa@7/i2c@0,320:tests={gpio@0.42,gpio@0.44,gpio@0.46,gpio@0.48}
Testing /pci@1e,600000/isa@7/i2c@0,320:tests={hardware-monitor@0.5c}
Testing /pci@1e,600000/isa@7/i2c@0,320:tests={temperature-sensor@0.9c}
Testing /pci@1c,600000/network@2
Testing /pci@1f,700000/scsi@2,1
Testing /pci@1f,700000/scsi@2
|
The following sample output shows memory initialization by the OpenBoot PROM.
CODE EXAMPLE 4-15 consolehistory boot -v Command Output (Memory Initialization)
Initializing 1MB of memory at addr 123fe02000 -
Initializing 12MB of memory at addr 123f000000 -
Initializing 1008MB of memory at addr 1200000000 -
Initializing 1024MB of memory at addr 1000000000 -
Initializing 1024MB of memory at addr 200000000 -
Initializing 1024MB of memory at addr 0 -
{1} ok boot disk
|
The following sample output shows the system booting and loading Solaris software.
CODE EXAMPLE 4-16 consolehistory boot -v Command Output (System Booting and Loading Solaris Software)
Rebooting with command: boot disk
Boot device: /pci@1f,700000/scsi@2/disk@0,0 File and args:
Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54.
FCode UFS Reader 1.11 97/07/10 16:19:15.
Loading: /platform/SUNW,Sun-Fire-V440/ufsboot
Loading: /platform/sun4u/ufsboot
\
SunOS Release 5.8 Version Generic_114696-04 64-bit
Copyright 1983-2003 Sun Microsystems, Inc. All rights reserved.
Hardware watchdog enabled
sc>
|
9. Type the showusers command.
This command displays all the users currently logged in to ALOM.
CODE EXAMPLE 4-17 ALOM Reports on Active User Sessions
username connection login time client IP addr console
-------- ---------- ---------- --------------
admin serial FEB 28 19:45 system
admin net-1 MAR 03 14:43 129.111.111.111
sc>
|
In this case, notice that there are two separate simultaneous administrative users. The first is logged in through the SERIAL MGT port and has access to the system console. The second user is logged in through telnet connection from another host to the NET MGT port. The second user can view the system console session but cannot input console commands.
10. Type the showplatform command.
This command displays the status of the operating system, which may be Running, Stopped, Initializing, or in a handful of other states.
CODE EXAMPLE 4-18 ALOM Reports on Operating System Status
SUNW,Netra-440
Domain Status
------ ------
vsp75-202-priv OS Running
|
11. Use ALOM to run POST diagnostics.
Doing this involves several steps.
a. Type:
This command temporarily overrides the server's OpenBoot Diagnostics diag-switch? setting, forcing power-on self-test (POST) diagnostics to run when power is cycled off and on. If the server is not power cycled within 10 minutes, it reverts back to its defaults.
b. Power cycle the system. Type:
sc> poweroff
Are you sure you want to power off the system [y/n]? y
sc> poweron
|
POST diagnostics begin to run as the system reboots. However, you will see no messages until you switch from ALOM to the system console. For details, refer to the Netra 440 Server System Administration Guide.
c. Switch to the system console. Type:
sc> console
Enter #. to return to ALOM.
0>@(#) Sun Fire[TM] V440 POST 4.10.0 2003/04/01 22:28
/export/work/staff/firmware_re/post/post-build-4.10.0/Fiesta/system/integrated (firmware_re)
0>Hard Powerup RST thru SW
0>CPUs present in system: 0 1 2 3
0>OBP->POST Call with %o0=00000000.01008000.
|
You should begin seeing console output and POST messages. The exact text that appears on your screen depends on the state of your Netra 440 server, and on how long you delay between powering on the system and switching to the system console.
Note - Any system console or POST messages you might miss are preserved in the ALOM boot log. To access the boot log, type consolehistory boot -v from the sc> prompt.
|
For more information about ALOM command-line functions, refer to the Advanced Lights Out Manager User's Guide.
For more information about controlling POST diagnostics, see Controlling POST Diagnostics.
For information about interpreting POST error messages, see What POST Error Messages Tell You.
Using Solaris System Information Commands
This section explains how to run Solaris system information commands on a Netra 440 server. To find out what these commands tell you, see Solaris System Information Commands, or see the appropriate man pages.
To Use Solaris System Information Commands
|
1. Decide what kind of system information you want to display.
For more information, see Solaris System Information Commands.
2. Type the appropriate command at a system console prompt. See TABLE 4-1.
TABLE 4-1 Using Solaris System Information Commands
Command
|
What It Displays
|
What to Type
|
Notes
|
prtconf
|
System configuration information
|
/usr/sbin/prtconf
|
--
|
prtdiag
|
Diagnostic and configuration information
|
/usr/platform/
`uname -i`/
sbin/prtdiag
|
Use the -v option for additional detail.
|
prtfru
|
FRU hierarchy and SEEPROM memory contents
|
/usr/sbin/prtfru
|
Use the -l option to display hierarchy. Use the -c option to display SEEPROM data.
|
psrinfo
|
Date and time each CPU came online; processor clock speed
|
/usr/sbin/psrinfo
|
Use the -v option to obtain clock speed and other data.
|
showrev
|
Hardware and software revision information
|
/usr/bin/showrev
|
Use the -p option to show software patches.
|
Using OpenBoot Information Commands
This section explains how to run OpenBoot commands that display different kinds of system information about a Netra 440 server. To find out what these commands tell you, see Other OpenBoot Commands, or refer to the appropriate man pages.
As long as you can get to the ok prompt, you can use OpenBoot information commands. This means the commands are usually accessible even when your system cannot boot its operating system software.
To Use OpenBoot Information Commands
|
1. If necessary, shut down the system to reach the ok prompt.
How you do this depends on the system's condition. If possible, you should warn users and shut down the system gracefully. For information, refer to the Netra 440 Server System Administration Guide.
2. Decide what kind of system information you want to display.
For more information, see Other OpenBoot Commands.
3. Type the appropriate command at a system console prompt. See TABLE 4-2.
TABLE 4-2 Using OpenBoot Information Commands
Command to Type
|
What It Displays
|
printenv
|
OpenBoot configuration variable defaults and settings
|
probe-scsi
probe-scsi-all
probe-ide
|
Target address, unit number, device type, and manufacturer name of active SCSI and IDE devices
|
show-devs
|
Hardware device paths of all devices in the system configuration
|
Netra 440 Server Diagnostics and Troubleshooting Guide
|
817-3886-10
|
|
Copyright © 2004, Sun Microsystems, Inc. All rights reserved.