C H A P T E R  4

Monitoring the System

When something goes wrong with the system, diagnostic tools can help you figure out what caused the problem. Indeed, this is the principal use of most diagnostic tools. However, this approach is inherently reactive. It means waiting until a component fails outright.

Some diagnostic tools allow you to be more proactive by monitoring the system while it is still "healthy." Monitoring tools give administrators early warning of imminent failure, thereby allowing planned maintenance and better system availability. Remote monitoring also allows administrators the convenience of checking on the status of many machines from one centralized location.

Sun provides the Advanced Lights Out Manager (ALOM) software that you can use to monitor servers.

In addition to that tool, Sun provides software-based and firmware-based commands that display various kinds of system information. While not strictly monitoring tools, these commands enable you to review at a glance the status of different system aspects and components.

This chapter describes the tasks necessary to use these tools to monitor your Netra 440 server.

Tasks covered in this chapter include:

If you want background information about the tools, turn to Chapter 2.



Note - Many of the procedures in this chapter assume that you are familiar with the OpenBoot firmware and that you know how to access the ok prompt. For background information, refer to the Netra 440 Server System Administration Guide.




Monitoring the System Using Sun Advanced Lights Out Manager

This section explains how to use Advanced Lights Out Manager (ALOM) to monitor a Netra 440 server, and steps you through some of the tool's most important features.

For background information about ALOM, see:

There are several ways to connect to and use the ALOM system controller, depending on how your data center and its network are set up. This procedure assumes that you intend to monitor the Netra 440 system by way of an alphanumeric terminal or terminal server connected to the server's SERIAL MGT port, or by using a telnet connection to the NET MGT port.

The procedure also assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Refer to the Netra 440 Server System Administration Guide.


procedure icon  To Monitor the System Using Sun Advanced Lights Out Manager

1. Log in to the system console and access the ok prompt.

2. If necessary, type the system controller escape sequence.

If you are not already seeing the sc> prompt, type the system controller escape sequence. By default, this sequence is #. (pound-period).

ok #.

3. If necessary, log in to ALOM.

If you are not logged in to ALOM, you will be prompted to do so:

Please login: admin
Please Enter password: ******

Enter the admin account login name and password, or the name and password of a different login account if one has been set up for you. For the purposes of this procedure, your account should have full privileges.



Note - The first time you access ALOM, there is no admin account password. You are instructed to provide one the first time you attempt to execute a privileged command. Note the password you enter and retain it for future use.



The sc> prompt appears:

sc>

This prompt indicates that you now have access to the ALOM system controller command-line interface.

4. At the sc> prompt, type the showenvironment command.

sc> showenvironment

This command displays a great deal of useful data, starting with temperature readings from a number of thermal sensors.

CODE EXAMPLE 4-1 ALOM Reports on System Temperatures
=============== Environmental Status ===============
 
 
------------------------------------------------------------------------------
System Temperatures (Temperatures in Celsius):
------------------------------------------------------------------------------
Sensor         Status    Temp LowHard LowSoft LowWarn HighWarn HighSoft HighHard
------------------------------------------------------------------------------
C0.P0.T_CORE    OK         48    -20     -10       0      97      102      120
C1.P0.T_CORE    OK         53    -20     -10       0      97      102      120
C2.P0.T_CORE    OK         49    -20     -10       0      97      102      120
C3.P0.T_CORE    OK         57    -20     -10       0      97      102      120
C0.T_AMB        OK         28    -20     -10       0      70       82       87
C1.T_AMB        OK         33    -20     -10       0      70       82       87
C2.T_AMB        OK         27    -20     -10       0      70       82       87
C3.T_AMB        OK         28    -20     -10       0      70       82       87
MB.T_AMB        OK         32    -18     -10       0      65       75       85



Note - The warning and soft graceful shutdown thresholds noted in CODE EXAMPLE 4-1 are set at the factory and cannot be modified.



The sensors labeled T_AMB in CODE EXAMPLE 4-1 measure ambient temperatures at the CPU/memory modules, the motherboard, and the SCSI backplane. The sensors labeled T_CORE measure the internal temperatures of the processor chips themselves.

In the output shown in CODE EXAMPLE 4-1, MB refers to the motherboard, and Cn refers to a particular CPU. For information about identifying CPU modules, see Identifying CPU/Memory Modules.

The showenvironment command also gives the position of the system control rotary switch and the condition of the three LEDs on the front panel.

CODE EXAMPLE 4-2 ALOM Reports on Rotary Switch Position and System Status LEDs
--------------------------------------
Front Status Panel:
--------------------------------------
Rotary Switch position: NORMAL
 
---------------------------------------------------
System Indicator Status:
---------------------------------------------------
SYS.LOCATE           SYS.SERVICE          SYS.ACT             
--------------------------------------------------------
OFF                  OFF                  ON                  

The showenvironment command reports the status of system disks and fans.

CODE EXAMPLE 4-3 ALOM Reports on System Disks and Fans
--------------------------------------------
System Disks:
--------------------------------------------
Disk   Status            Service  OK2RM
--------------------------------------------
HDD0   OK                OFF      OFF
HDD1   OK                OFF      OFF
HDD2   OK                OFF      OFF
HDD3   OK                OFF      OFF
 
----------------------------------------------------------
Fans (Speeds Revolution Per Minute):
----------------------------------------------------------
Sensor           Status           Speed   Warn    Low
----------------------------------------------------------
FT0.F0.TACH      OK                3879   2400    750
FT1.F0.TACH      OK                3947   2400    750
FT2.F0.TACH      OK                4017   2400    750
FT3.F0           OK                  --     --     --

Voltage sensors located on the motherboard monitor important system voltages, and showenvironment reports these.

CODE EXAMPLE 4-4 ALOM Reports on Motherboard Voltages
-----------------------------------------------------------------------------
Voltage sensors (in Volts):
-----------------------------------------------------------------------------
Sensor         Status       Voltage LowSoft LowWarn HighWarn HighSoft
-----------------------------------------------------------------------------
MB.V_+1V5      OK             1.49    1.20    1.27    1.72     1.80
MB.V_VCCTM     OK             2.53    2.00    2.12    2.87     3.00
MB.V_NET0_1V2D OK             1.26    0.96    1.02    1.38     1.44
MB.V_NET1_1V2D OK             1.26    0.96    1.02    1.38     1.44
MB.V_NET0_1V2A OK             1.26    0.96    1.02    1.38     1.44
MB.V_NET1_1V2A OK             1.25    0.96    1.02    1.38     1.44
MB.V_+3V3      OK             3.33    2.64    2.80    3.79     3.96
MB.V_+3V3STBY  OK             3.33    2.64    2.80    3.79     3.96
MB.BAT.V_BAT   OK             3.07      --    2.25      --       --
MB.V_SCSI_CORE OK             1.80    1.44    1.53    2.07     2.16
MB.V_+5V       OK             5.02    4.00    4.25    5.75     6.00
MB.V_+12V      OK            12.00    9.60   10.20   13.80    14.40
MB.V_-12V      OK           -11.96  -14.40  -13.80  -10.20    -9.60



Note - The warning and soft graceful shutdown thresholds noted in CODE EXAMPLE 4-4 are set at the factory and cannot be modified.



The showenvironment command tells you the status of each power supply, and the state of the LEDs located on each supply.

CODE EXAMPLE 4-5 ALOM Reports on Power Supply Status
--------------------------------------------
Power Supply Indicators: 
--------------------------------------------
Supply    Active  Service  OK-to-Remove
--------------------------------------------
PS0       ON      OFF      OFF
PS1       ON      OFF      OFF
PS2       ON      OFF      OFF
PS3       ON      OFF      OFF
 
------------------------------------------------------------------------------
Power Supplies:
------------------------------------------------------------------------------
Supply  Status          Underspeed  Overtemp  Overvolt  Undervolt Overcurrent
------------------------------------------------------------------------------
PS0     OK              OFF         OFF       OFF       OFF        OFF
PS1     OK              OFF         OFF       OFF       OFF        OFF
PS2     OK              OFF         OFF       OFF       OFF        OFF
PS3     OK              OFF         OFF       OFF       OFF        OFF

This command reports on the status of motherboard circuit breakers (labeled MB.FF_SCSIx) and CPU module DC-to-DC converters (labeled Cn.P0.FF_POK).

CODE EXAMPLE 4-6 ALOM Reports on Circuit Breakers and DC-to-DC Converters
----------------------
Current sensors: 
----------------------
Sensor          Status
----------------------
MB.FF_SCSIA      OK
MB.FF_SCSIB      OK
MB.FF_POK        OK
C0.P0.FF_POK     OK
C1.P0.FF_POK     OK
C2.P0.FF_POK     OK
C3.P0.FF_POK     OK

Finally, this command tells you the status of the system alarms.

CODE EXAMPLE 4-7 ALOM Reports on System Alarms
--------------------------------------------
System Alarms:
--------------------------------------------
Alarm                   Relay           LED 
--------------------------------------------
ALARM.CRITICAL          OFF             OFF
ALARM.MAJOR             OFF             OFF
ALARM.MINOR             OFF             OFF
ALARM.USER              OFF             OFF

5. Type the showfru command.

sc> showfru

This command, like the Solaris OS command prtfru -c, displays static FRU-ID information as available for several system FRUs. The specific information provided includes the date and location of manufacture, and the Sun part number.

CODE EXAMPLE 4-8 ALOM Reports on FRU Identification Information
FRU_PROM at PSO.SEEPROM
  Timestamp: MON SEP 16 16:47:05 2002
  Description: PWR SUPPLY, SYSTEM,75%-EFF,H-P
  Manufacture Location: DELTA ELECTRONICS CHUNGLI TAIWAN
  Sun Part No: 3001501
  Sun Serial No: T00065
  Vendor JDEC code: 3AD
  Initial HW Dash Level: 01
  Initial HW Rev Level: 02
  Shortname: PS

6. Type the showlogs command.

sc> showlogs

This command shows a history of noteworthy system events, the most recent being listed last.

CODE EXAMPLE 4-9 ALOM Reports on Logged Events
FEB 28 19:45:06 myhost: 0006001a: "SC Host Watchdog Reset Disabled"
FEB 28 19:45:06 myhost: 00060003: "SC System booted."
FEB 28 19:45:43 myhost: 00060000: "SC Login: User admin Logged on."
FEB 28 19:45:51 myhost: 0004000e: "SC Request to Power Off Host Immediately."
FEB 28 19:45:55 myhost: 00040002: "Host System has Reset"
FEB 28 19:45:56 myhost: 00040029: "Host system has shut down."
FEB 28 19:46:16 myhost: 00040001: "SC Request to Power On Host."
FEB 28 19:46:18 myhost: 0004000b: "Host System has read and cleared bootmode."
FEB 28 19:55:17 myhost: 00060000: "SC Login: User admin Logged on."
FEB 28 19:56:59 myhost: 00060000: "SC Login: User admin Logged on."
FEB 28 20:27:06 myhost: 0004004f: "Indicator SYS_FRONT.ACT is now ON"
FEB 28 20:40:47 myhost: 00040002: "Host System has Reset"



Note - The ALOM log messages are written into a so-called "circular buffer" of limited length (64 kilobytes). Once the buffer is filled, the oldest messages are overwritten by the newest ones.



7. Examine the ALOM run log. Type:

sc> consolehistory run -v

This command shows the log containing the most recent system console output from POST, OpenBoot PROM, and Solaris boot messages. In addition, this log records output from the server's operating system.

CODE EXAMPLE 4-10 consolehistory run -v Command Output
May  9 14:48:22 Sun-SFV440-a rmclomv: SC Login: User admin Logged on.
 
# 
# init 0
# 
INIT: New run level: 0
The system is coming down.  Please wait.
System services are now being stopped.
Print services stopped.
May  9 14:49:18 Sun-SFV440-a last message repeated 1 time
 
May  9 14:49:38 Sun-SFV440-a syslogd: going down on signal 15
 
The system is down.
syncing file systems... done
Program terminated
{1} ok boot disk
 
Netra 440, No Keyboard
Copyright 1998-2003 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.10.3, 4096 MB memory installed, Serial #53005571.
Ethernet address 0:3:ba:28:cd:3, Host ID: 8328cd03.
 
Initializing     1MB of memory at addr        123fecc000 -
                                                                      
Initializing     1MB of memory at addr        123fe02000 -
                                                                      
Initializing    14MB of memory at addr        123f002000 -
                                                                      
Initializing    16MB of memory at addr        123e002000 -
                                                                      
Initializing   992MB of memory at addr        1200000000 -
                                                                      
Initializing  1024MB of memory at addr        1000000000 -
                                                                      
Initializing  1024MB of memory at addr         200000000 -
                                                                      
Initializing  1024MB of memory at addr                 0 -
                                                                      
Rebooting with command: boot disk
Boot device: /pci@1f,700000/scsi@2/disk@0,0  File and args: 
\
SunOS Release 5.8 Version Generic_114696-04 64-bit
Copyright 1983-2003 Sun Microsystems, Inc.  All rights reserved.
Hardware watchdog enabled
Indicator SYS_FRONT.ACT is now ON
configuring IPv4 interfaces: ce0.
Hostname: Sun-SFV440-a
The system is coming up.  Please wait.
NIS domainname is Ecd.East.Sun.COM
Starting IPv4 router discovery.
starting rpc services: rpcbind keyserv ypbind done.
Setting netmask of lo0 to 255.0.0.0
Setting netmask of ce0 to 255.255.255.0
Setting default IPv4 interface for multicast: add net 224.0/4: gateway Sun-SFV440-a
syslog service starting.
Print services started.
volume management starting.
The system is ready.
 
Sun-SFV440-a console login: May  9 14:52:57 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = UNKNOWN
 
May  9 14:52:57 Sun-SFV440-a rmclomv: Keyswitch Position has changed to Unknown state.
 
May  9 14:52:58 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = LOCKED
 
May  9 14:52:58 Sun-SFV440-a rmclomv: KeySwitch Position has changed to Locked State.
 
May  9 14:53:00 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = NORMAL
 
May  9 14:53:01 Sun-SFV440-a rmclomv: KeySwitch Position has changed to On State.
 
sc> 

8. Examine the ALOM boot log. Type:

sc> consolehistory boot -v

The ALOM boot log contains boot messages from POST, OpenBoot firmware, and Solaris software from the host server's most recent reset.

The following sample output shows the boot messages from POST.

CODE EXAMPLE 4-11 consolehistory boot -v Command Output (Boot Messages From POST)
Keyswitch set to diagnostic position.
@(#)OBP 4.10.3 2003/05/02 20:25 Netra 440
Clearing TLBs 
Power-On Reset
Executing Power On SelfTest
 
0>@(#) Sun Fire[TM] V440 POST 4.10.3 2003/05/04 22:08 
       /export/work/staff/firmware_re/post/post-build-4.10.3/Fiesta/system/integrated  (firmware_re)  
0>Hard Powerup RST thru SW
0>CPUs present in system: 0 1
0>OBP->POST Call with %o0=00000000.01012000.
0>Diag level set to MIN.
0>MFG scrpt mode set NORM 
0>I/O port set to TTYA.
0>
0>Start selftest...
1>Print Mem Config
1>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON.
1>Memory interleave set to 0
1>      Bank 0 1024MB : 00000010.00000000 -> 00000010.40000000.
1>      Bank 2 1024MB : 00000012.00000000 -> 00000012.40000000.
0>Print Mem Config
0>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON.
0>Memory interleave set to 0
0>      Bank 0 1024MB : 00000000.00000000 -> 00000000.40000000.
0>      Bank 2 1024MB : 00000002.00000000 -> 00000002.40000000.
0>INFO:
0>      POST Passed all devices.
0>
0>POST: Return to OBP.

The following sample output shows the initialization of the OpenBoot PROM.

CODE EXAMPLE 4-12 consolehistory boot -v Command Output (OpenBoot PROM Initialization )
Keyswitch set to diagnostic position.
@(#)OBP 4.10.3 2003/05/02 20:25 Netra 440
Clearing TLBs 
POST Results: Cpu 0000.0000.0000.0000 
  %o0 = 0000.0000.0000.0000 %o1 = ffff.ffff.f00a.2b73 %o2 = ffff.ffff.ffff.ffff 
POST Results: Cpu 0000.0000.0000.0001 
  %o0 = 0000.0000.0000.0000 %o1 = ffff.ffff.f00a.2b73 %o2 = ffff.ffff.ffff.ffff 
Membase: 0000.0000.0000.0000 
MemSize: 0000.0000.0004.0000 
Init CPU arrays Done
Probing /pci@1d,700000 Device 1  Nothing there 
Probing /pci@1d,700000 Device 2  Nothing there 

The following sample output shows the system banner.

CODE EXAMPLE 4-13 consolehistory boot -v Command Output (System Banner Display)
Netra 440, No Keyboard
Copyright 1998-2003 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.10.3, 4096 MB memory installed, Serial #53005571.
Ethernet address 0:3:ba:28:cd:3, Host ID: 8328cd03.

The following sample output shows OpenBoot Diagnostics testing.

CODE EXAMPLE 4-14 consolehistory boot -v Command Output (OpenBoot Diagnostics Testing)
Running diagnostic script obdiag/normal
 
Testing /pci@1f,700000/network@1
Testing /pci@1e,600000/ide@d
Testing /pci@1e,600000/isa@7/flashprom@2,0
Testing /pci@1e,600000/isa@7/serial@0,2e8
Testing /pci@1e,600000/isa@7/serial@0,3f8
Testing /pci@1e,600000/isa@7/rtc@0,70
Testing /pci@1e,600000/isa@7/i2c@0,320:tests={gpio@0.42,gpio@0.44,gpio@0.46,gpio@0.48}
Testing /pci@1e,600000/isa@7/i2c@0,320:tests={hardware-monitor@0.5c}
Testing /pci@1e,600000/isa@7/i2c@0,320:tests={temperature-sensor@0.9c}
Testing /pci@1c,600000/network@2
Testing /pci@1f,700000/scsi@2,1
Testing /pci@1f,700000/scsi@2

The following sample output shows memory initialization by the OpenBoot PROM.

CODE EXAMPLE 4-15 consolehistory boot -v Command Output (Memory Initialization)
Initializing     1MB of memory at addr        123fe02000 -
                                                                      
Initializing    12MB of memory at addr        123f000000 -
                                                                      
Initializing  1008MB of memory at addr        1200000000 -
                                                                      
Initializing  1024MB of memory at addr        1000000000 -
                                                                      
Initializing  1024MB of memory at addr         200000000 -
                                                                      
Initializing  1024MB of memory at addr                 0 -
 
{1} ok boot disk

The following sample output shows the system booting and loading Solaris software.

CODE EXAMPLE 4-16 consolehistory boot -v Command Output (System Booting and Loading Solaris Software)
Rebooting with command: boot disk
Boot device: /pci@1f,700000/scsi@2/disk@0,0  File and args: 
Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54. 
FCode UFS Reader 1.11 97/07/10 16:19:15. 
Loading: /platform/SUNW,Sun-Fire-V440/ufsboot
Loading: /platform/sun4u/ufsboot
\
SunOS Release 5.8 Version Generic_114696-04 64-bit
Copyright 1983-2003 Sun Microsystems, Inc.  All rights reserved.
Hardware watchdog enabled
sc> 

9. Type the showusers command.

sc> showusers

This command displays all the users currently logged in to ALOM.

CODE EXAMPLE 4-17 ALOM Reports on Active User Sessions
username         connection       login time       client IP addr   console
--------         ----------       ----------       --------------
admin            serial           FEB 28 19:45     system
admin            net-1            MAR 03 14:43     129.111.111.111
sc> 

In this case, notice that there are two separate simultaneous administrative users. The first is logged in through the SERIAL MGT port and has access to the system console. The second user is logged in through telnet connection from another host to the NET MGT port. The second user can view the system console session but cannot input console commands.

10. Type the showplatform command.

sc> showplatform

This command displays the status of the operating system, which may be Running, Stopped, Initializing, or in a handful of other states.

CODE EXAMPLE 4-18 ALOM Reports on Operating System Status
SUNW,Netra-440
 
Domain         Status
------         ------
vsp75-202-priv OS Running

11. Use ALOM to run POST diagnostics.

Doing this involves several steps.

a. Type:

sc> bootmode diag

This command temporarily overrides the server's OpenBoot Diagnostics diag-switch? setting, forcing power-on self-test (POST) diagnostics to run when power is cycled off and on. If the server is not power cycled within 10 minutes, it reverts back to its defaults.

b. Power cycle the system. Type:

sc> poweroff
 
Are you sure you want to power off the system [y/n]? y
 
sc> poweron

POST diagnostics begin to run as the system reboots. However, you will see no messages until you switch from ALOM to the system console. For details, refer to the Netra 440 Server System Administration Guide.

c. Switch to the system console. Type:

sc> console
Enter #. to return to ALOM.
 
0>@(#) Sun Fire[TM] V440 POST 4.10.0 2003/04/01 22:28 
 
/export/work/staff/firmware_re/post/post-build-4.10.0/Fiesta/system/integrated  (firmware_re)  
0>Hard Powerup RST thru SW
0>CPUs present in system: 0 1 2 3
0>OBP->POST Call with %o0=00000000.01008000.

You should begin seeing console output and POST messages. The exact text that appears on your screen depends on the state of your Netra 440 server, and on how long you delay between powering on the system and switching to the system console.



Note - Any system console or POST messages you might miss are preserved in the ALOM boot log. To access the boot log, type consolehistory boot -v from the sc> prompt.



For more information about ALOM command-line functions, refer to the Advanced Lights Out Manager User's Guide.

For more information about controlling POST diagnostics, see Controlling POST Diagnostics.

For information about interpreting POST error messages, see What POST Error Messages Tell You.


Using Solaris System Information Commands

This section explains how to run Solaris system information commands on a Netra 440 server. To find out what these commands tell you, see Solaris System Information Commands, or see the appropriate man pages.


procedure icon  To Use Solaris System Information Commands

1. Decide what kind of system information you want to display.

For more information, see Solaris System Information Commands.

2. Type the appropriate command at a system console prompt. See TABLE 4-1.

TABLE 4-1 Using Solaris System Information Commands

Command

What It Displays

What to Type

Notes

prtconf

System configuration information

/usr/sbin/prtconf

--

prtdiag

Diagnostic and configuration information

/usr/platform/
`uname -i`/
sbin/prtdiag

Use the -v option for additional detail.

prtfru

FRU hierarchy and SEEPROM memory contents

/usr/sbin/prtfru

Use the -l option to display hierarchy. Use the -c option to display SEEPROM data.

psrinfo

Date and time each CPU came online; processor clock speed

/usr/sbin/psrinfo

Use the -v option to obtain clock speed and other data.

showrev

Hardware and software revision information

/usr/bin/showrev

Use the -p option to show software patches.



Using OpenBoot Information Commands

This section explains how to run OpenBoot commands that display different kinds of system information about a Netra 440 server. To find out what these commands tell you, see Other OpenBoot Commands, or refer to the appropriate man pages.

As long as you can get to the ok prompt, you can use OpenBoot information commands. This means the commands are usually accessible even when your system cannot boot its operating system software.


procedure icon  To Use OpenBoot Information Commands

1. If necessary, shut down the system to reach the ok prompt.

How you do this depends on the system's condition. If possible, you should warn users and shut down the system gracefully. For information, refer to the Netra 440 Server System Administration Guide.

2. Decide what kind of system information you want to display.

For more information, see Other OpenBoot Commands.

3. Type the appropriate command at a system console prompt. See TABLE 4-2.

TABLE 4-2 Using OpenBoot Information Commands

Command to Type

What It Displays

printenv

OpenBoot configuration variable defaults and settings

probe-scsi
probe-scsi-all

probe-ide

Target address, unit number, device type, and manufacturer name of active SCSI and IDE devices

show-devs

Hardware device paths of all devices in the system configuration