Sun Fire V20z and Sun Fire V40z Servers User Guide
|
|
Troubleshooting and Diagnostics
|
Before troubleshooting your specific server problem, collect the following information:
- What events occurred prior to the failure?
- Was any hardware or software modified or installed?
- Was the server recently installed or moved?
- How long has the server exhibited symptoms?
- What is the duration or frequency of the problem?
The guidelines in Preventive Troubleshooting will help you to prevent problems from occurring and will make troubleshooting easier.
After you have assessed the problem and noted your current configuration and environment, you can choose from several ways to troubleshoot your server:
3.1 Preventive Troubleshooting
Creating and following procedures can help prevent problems and make troubleshooting easier.
Follow these guidelines for preventive troubleshooting:
- Use uniform naming conventions for your servers, such as names that denote server location. Uniform naming conventions help when you try to remember often-overlooked details that can hold the key to resolving a crisis.
- Use unique IDs or names for your devices. You can reduce the risk of components competing for the same resource if you have a list. Use the server setup utility to check for conflicts.
- Create a backup plan. Schedule backups based on the needs of your server. If data is changed frequently, frequent backups are required. Maintain a library of backups based on your information restoring needs. Test your backups periodically to be sure that your data is correctly stored.
- Use enterprise-systems management tools to automate the following processes or manually track this information:
- Check hard disk space periodically. It is recommended that hard drives have a minimum of 15 percent of free space.
- Keep historical data. You will not know that the CPU utilization has increased 50 percent if you do not know what it was initially. If you have problems, you can use the data to compare before and after scenarios. For example, you might want to know about the user, bus and power utilization rates.
- Keep a trend analysis so that you will know what to expect at certain times. For example, if the CPU utilization rate always increases by 50 percent during certain hours, you will know that increase is normal for the server you are tracking.
- Create a problem-resolution notebook. When problems do occur, keep a log of the actions you took to resolve them. This could help you solve the same problem more quickly in the future. This information can save a great deal of time in the future and ensure accuracy, especially when dealing with future part replacement.
- Keep an updated network-topology map in an accessible location. This will help in troubleshooting networking problems.
- Most problems occur when something in the server has changed. When making changes to your server, follow these guidelines:
- Document the system settings. If the system configuration will change, first obtain a record of the current system-configuration settings.
- If possible, make changes one at a time to isolate problems should they occur. This enables you to maintain a controlled environment and reduces the scope of any troubleshooting. Record the results of each change, including any errors or informational messages.
- Check for potential device conflicts before adding a new device. Check for any potential version dependencies, especially with third-party software.
3.2 Visually Inspecting Your System
Improperly set controls and loose or improperly connected cables are common causes of problems with hardware components. When investigating a system problem, first check all the external switches, controls and cable connections. See External Visual Inspection.
If this does not resolve your problem, then visually inspect the system's interior hardware for problems such as a loose card, cable connector or mounting screw. See Internal Visual Inspection.
3.2.1 External Visual Inspection
To visually inspect the external system, follow these steps:
1. Note the state of the system-fault LED on the front of the server.
The system-fault LED blinks when a severe system fault is detected.
- See FIGURE 1-1 for the location of the Sun Fire V20z system-fault LED.
- See FIGURE 1-4 for the location of the Sun Fire V40z system-fault LED.
Several conditions can result in the system-fault LED turning on. See System-Fault LED for a description of these conditions, how to view the cause of the fault and how to reset the LED.
2. Power off the system and any attached peripherals (if applicable).
3. Verify that all power cables are properly connected to the system, the monitor and peripherals, and check their power sources.
4. Inspect connections to any attached devices, including network cables, keyboard, monitor and mouse, as well as any devices attached to the serial port.
3.2.2 Internal Visual Inspection
To visually inspect the internal system, follow these steps:
Note - Before proceeding, read the safety instructions in the document, Important Safety Information About Sun Hardware Systems, which is shipped with your system.
|
1. Shut down the operating system, if necessary, and turn off the platform power on the front of the server.
2. Turn off the AC power in one of the following two methods, depending on which server type you have:
- If you have a Sun Fire V20z server, turn off the AC power switch on the rear panel of the server (see FIGURE 1-2). Leave the AC power cord attached to the power supply to maintain system ground.
- If you have a Sun Fire V40z server, unplug the AC power cord(s) from the AC connectors on the server's power supply(s).
|
Caution - When you unplug the AC power cords from the Sun Fire V40z server power supplies to remove AC power, system ground is also removed. You must maintain an equal voltage potential to the machine to avoid electrostatic discharge damage to the machine.
|
3. Turn off power to any attached peripherals.
4. Remove the server cover.
For a Sun Fire V20z server, refer toPowering Off the Server and Removing the Cover.
For a Sun Fire V40z server, refer toPowering Off the Server and Removing the Cover.
|
Caution - Some components, such as the heatsink, can become extremely hot during system operations. Allow these components to cool before handling them.
|
5. Verify that the components are fully seated in their sockets or connectors and that sockets are clean.
6. Check all cable connectors inside the system to verify that they are firmly attached to their appropriate connectors.
7. Replace the server cover.
8. Reconnect the system and any attached peripherals to their power sources, then power them on.
3.3 Troubleshooting Dump Utility
You can also use the Troubleshooting Dump Utility (TDU), which captures the following information:
- System state table (SST)
- Hardware and software component version numbers
- Machine check register values
- CPU trace buffers
- CPU configuration space registers (CSRs)
- Event log file
- The last good configuration (LGC)
To run the Troubleshooting Dump Utility, type the following command:
# sp get tdulog
The Troubleshooting Dump Utility can take up to 15 minutes to run. The system prompt displays when it is completed.
The captured data is gathered and stored on the SP in a compressed tar file. Refer to the Sun Fire V20z and Sun Fire V40z Servers, Server Management Guide, for more information about the command and its options.
3.4 Diagnostics
Diagnostics are a set of tests that determine the health of the hardware in your server. Diagnostics tests are used to verify hardware functionality and indicate device failures. You can test your system using the diagnostics tests to accomplish the following:
- Test and diagnose hardware functionality
- Locate hardware failures
- Isolate hardware and software faults
Before using diagnostics, three setup procedures are necessary:
1. Install the diagnostics by installing the server's Network Share Volume (NSV) software to a networked NFS server. See Installing the NSV and Diagnostics Software.
2. Mount the diagnostics tests onto your Sun Fire V20z or Sun Fire V40z server and update the diagnostics software. Mounting the Diagnostics Tests.
3. Enable the diagnostics tests. Enabling the Diagnostics Tests.
|
Caution - While running diagnostics on your server, do not interact with the Service Processor (SP) through the command-line interface or IPMI.
The sensor commands cannot be used reliably while the diagnostics are running. Issuing sensor commands, while diagnostics are loaded, may result in "false" or erroneous critical events being logged in the events log. The values returned by the sensors are not reliable in this case.
|
Note - When the diagnostics are launched on the platform, the system tries to mount the floppy drive. The following error is returned:
mount : Mounting /dev/fd0 on /mnt/floppy failed. No such device.
You can safely ignore this error message.
3.4.1 Installing the NSV and Diagnostics Software
1. Connect the SP of the Sun Fire V20z or Sun Fire V40z server to the same network as your NFS server.
See the Sun Fire V20z and Sun Fire V40z Servers Installation Guide for the location of the SP connectors and guidelines for connecting servers to management LANs.
2. Insert the Sun Fire V20z and Sun Fire V40z Servers Network Share Volume CD into the NFS server and mount the CD.
3. Copy the file that contains the diagnostics from the CD to the NFS server by typing the following command:
# cp -r /mnt/cdrom/NSV_file /mnt/nsv/
4. Change to the directory on the server that now contains the compressed NSV packages and extract them by typing the following commands:
# cd /mnt/nsv/
# unzip -a *.zip
Note - When unzipping a compressed file on a Linux platform, use the -a switch as shown to force text files to convert to the target operating system's appropriate end-of-line termination.
|
The extracted packages populate these files:
/mnt/nsv/
diags
logs
snmp
spupdate
5. Run the following commands to create the appropriate permissions within the diags directories:
# chmod 777 /mnt/nsv/diags/NSV_version_number/scripts
# chmod -R 755 /mnt/nsv/diags/NSV_version_number/mppc
6. Continue with Mounting the Diagnostics Tests.
3.4.2 Mounting the Diagnostics Tests
Before running the diagnostics tests, you need to mount the NSV software from the NFS server on which it is located.
1. Log in to the Sun Fire V20z or Sun Fire V40z server's SP via SSH by typing the following command at the NFS server's command prompt:
# ssh -l manager_or_higher_login SSH_hostname
Note - Verify that NFS is enabled on the network before going to the next step. On systems running Linux, this must be done manually. Refer to the documentation for the version of Linux you are running for the instructions on enabling NFS.
|
2. Mount the NSV onto the Sun Fire V20z or Sun Fire V40z server SP by typing the following command:
# sp add mount -r NFS_server_hostname:/directory_with_NSV_files -l /mnt
Note - If you did not set up the SP on a DHCP network, you must use the
NFS_server_IP_address, rather than the NFS_server_hostname.
|
3. Go to the directory that contains the diagnostics files to list the available versions of diagnostics currently installed on the NSV:
# cd /mnt/diags
# ls -l
4. Update the diagnostics software by typing the following command:
# sp update diags -p /mnt/diags/DIAGS_version#
Where DIAGS_version# is the version of diagnostics you want to enable.
For example: V2.0.0.42
5. Continue with Enabling the Diagnostics Tests.
3.4.3 Enabling the Diagnostics Tests
Whenever a major component in the system does not function properly, you may have a component failure. As long as the microprocessor and the input and output components of the system (the monitor, keyboard and diskette drive) are working, you can run diagnostics.
To enable diagnostics on the SP from the NFS mount, execute one of the following commands:
- When the platform power is off, run the following command to boot the server and enable both platform and SP diagnostics:
# diags start
You can begin running diagnostics on the SP while the platform diagnostics are loading. You can use the diags get state command to determine whether the platform diagnostics are loaded.
- You can optionally enable diagnostics when the platform power is on and the OS is running, without rebooting the platform into diagnostics mode. This allows you to run your OS while simultaneously performing SP diagnostic testing. To do so, run the following command and option:
# diags start --noplatform
Note - If you use the --noplatform option, you cannot run any platform diagnostics, which include diagnostics for memory, NIC cards and storage.
|
Refer to Appendix C for more information about diags commands.
If the NSV is mounted, but the diags command is not recognized, run the
sp update diags command to adjust the path to the diagnostics software.
3.4.4 Listing Available Diagnostics Tests and Modules
To list the available tests and modules, type the following command:
# diags get tests
Tests are available for the following modules:
- Fans: Fan tests verify that each fan is rotating and the fan RPM is within the specified ranges.
Note - The power-supply fans are not testable by this diagnostic.
|
- Memory: Memory tests identify memory errors, address decoding faults and dataline faults.
- Network Controllers: An internal loopback test is available for NIC testing.
- Operator Panel: The operator panel tests verify the memory of the operator panel. The value and location of any errors are indicated.
- Slag: Slag tests are non-interactive tests that verify the correct operation of the LED drive circuitry.
- Storage: Storage tests invoke either a short or long self-test on any installed SCSI drives.
- Temperature: Temperature tests verify that each of the temperature sensors is functional and that the temperature is within the specified ranges.
- Voltage: Voltage tests are derived for power supply and bulk voltages (generated by the VRMs associated with the CPU and memory), to determine whether the voltage sensors are operating within their predefined limited.
- Power (for Sun Fire V40z only): Power diagnostics verify that the power distribution backplane and power supplies are functioning properly.
TABLE 3-1 lists the diagnostics modules and tests that are associated with each module in the original release of the Sun Fire V20z server (chassis part number [PN] 380-0979).
TABLE 3-2 lists the diagnostics modules and tests that were added or deleted in the updated release of the Sun Fire V20z server (chassis PN 380-1168).
Note - To see the current list of diagnostics modules and tests on your Sun Fire V20z server, , run the SP command diags get tests. The SP automatically detects the release version of your system and returns the relevant set of tests.
|
TABLE 3-3 lists the diagnostics modules and tests that are associated with each module in a Sun Fire V40z server.
TABLE 3-1 Sun Fire V20z Server--Diagnostics Modules and Tests (original release of server)
Module
|
Test
|
Devices
|
fan
|
speed.fan1
|
CPU 1 memory fan 1
|
fan
|
speed.fan2
|
CPU 1 memory fan 2
|
fan
|
speed.fan3
|
CPU 1 fan 1
|
fan
|
speed.fan4
|
CPU 1 fan 2
|
fan
|
speed.fan5
|
CPU 0 fan 1
|
fan
|
speed.fan6
|
CPU 0 fan 2
|
memory
|
adjacency.allDimms
|
All DIMMs
|
memory
|
dataline.allDimms
|
All DIMMs
|
memory
|
pattern.allDimms
|
All DIMMs
|
nic
|
phyLoop.Nic.0
|
Ethernet Port 0
|
nic
|
phyLoop.Nic.1
|
Ethernet Port 1
|
opPanel
|
write.opPanel
|
Operator Panel
|
slag
|
toggleLED.CD
|
CD LED
|
slag
|
toggleLED.CPU0
|
CPU 0 LED
|
slag
|
toggleLED.CPU0-DDR-VRM
|
CPU 0 DDR VRM
|
slag
|
toggleLED.CPU0-DIMM0
|
CPU 0 DIMM 0
|
slag
|
toggleLED.CPU0-DIMM1
|
CPU 0 DIMM 1
|
slag
|
toggleLED.CPU0-DIMM2
|
CPU 0 DIMM 2
|
slag
|
toggleLED.CPU0-DIMM3
|
CPU 0 DIMM 3
|
slag
|
toggleLED.CPU0-VRM
|
CPU 0 VRM
|
slag
|
toggleLED.CPU1
|
CPU 1
|
slag
|
toggleLED.CPU1-DDR-VRM
|
CPU 1 DDR VRM
|
slag
|
toggleLED.CPU1-DIMM0
|
CPU 1 DIMM 0
|
slag
|
toggleLED.CPU1-DIMM1
|
CPU 1 DIMM
|
slag
|
toggleLED.CPU1-DIMM2
|
CPU 1 DIMM 2
|
slag
|
toggleLED.CPU1-DIMM3
|
CPU 1 DIMM 3
|
slag
|
toggleLED.CPU1-VRM
|
CPU 1 VRM
|
slag
|
toggleLED.Disk-0
|
Disk 0 toggle LED
|
slag
|
toggleLED.Disk-1
|
Disk 1 toggle LED
|
slag
|
toggleLED.Disk-Backplane
|
Disk backplane toggle LED
|
slag
|
toggleLED.Floppy
|
Floppy toggle LED
|
slag
|
toggleLED.LCD-Indicator
|
LCD indicator toggle LED
|
slag
|
toggleLED.Motherboard
|
Motherboard toggle LED
|
slag
|
toggleLED.PCI-0
|
PCI 0 toggle LED
|
slag
|
toggleLED.PCI-1
|
PCI 1 toggle LED
|
slag
|
toggleLED.Power-Supply
|
Power-supply toggle LED
|
storage
|
long.ATA0_0
|
ATA0 0 drive
|
storage
|
long.ATA0_1
|
ATA0 1drive
|
storage
|
long.SCSI_0
|
SCSI 0 drive
|
storage
|
long.SCSI_1
|
SCSI 1 drive
|
storage
|
short.ATA0_0
|
ATA0 0 drive
|
storage
|
short.ATA0_1
|
ATA0 1 drive
|
storage
|
short.SCSI_0
|
SCSI 0 drive
|
storage
|
short.SCSI_1
|
SCSI 1 drive
|
temp
|
read.cpu0.dietemp
|
CPU 0 die
|
temp
|
read.cpu0.memtemp
|
CPU 0 memory
|
temp
|
read.cpu0.temp
|
CPU 0
|
temp
|
read.cpu1.dietemp
|
CPU 1 die
|
temp
|
read.cpu1.memtemp
|
CPU 1 memory
|
temp
|
read.cpu1.temp
|
CPU 1
|
temp
|
read.gbeth.temp
|
GigaBit on Broadcomm
|
temp
|
read.golem.temp
|
HyperTransport tunnel on
AMD 8131 chip
|
temp
|
read.hddbp.temp
|
Hard disk SCSI backplane
|
temp
|
read.sp.temp
|
Service processor (SP)
|
temp
|
read.thor.temp
|
South Bridge
|
voltage
|
limits.VCC_120_S0
|
VCC 120 S0
|
voltage
|
limits.VCC_50_S0
|
VCC 50 S0
|
voltage
|
limits.VCC_50_S5
|
VCC 50 S5
|
voltage
|
limits.VDDA_CPU0_25_S0
|
VDDA CPU0 25 S0
|
voltage
|
limits.VDD_18_S0
|
VDD 18 S0
|
voltage
|
limits.VDD_18_S5
|
VDD 18 S5
|
voltage
|
limits.VDD_25_S0
|
VDD 25 S0
|
voltage
|
limits.VDD_25_S5
|
VDD 25 S5
|
voltage
|
limits.VDD_33_S0
|
VDD 33 S0
|
voltage
|
limits.VDD_33_S3
|
VDD 33 S3
|
voltage
|
limits.VDD_33_S5
|
VDD 33 S5
|
voltage
|
limits.VDD_CPU0_25_S3
|
VDD CPU0 25 S3
|
voltage
|
limits.VDD_CPU0_CORE_S0
|
VDD CPU0 CORE S0
|
voltage
|
limits.VDD_CPU1_25_S3
|
VDD CPU1 25 S3
|
voltage
|
limits.VDD_CPU1_CORE_S0
|
VDD CPU1 CORE S0
|
voltage
|
limits.VLDT_CPU0_LDT1
|
VLDT CPU0 LDT1
|
voltage
|
limits.VLDT_CPU0_LDT2
|
VLDT CPU0 LDT2
|
voltage
|
limits.VLDT_G_LDT1
|
VLDT G LDT1
|
voltage
|
limits.VTT_CPU0_DDR_S3
|
VTT CPU0 DDR S3
|
voltage
|
limits.VTT_CPU1_DDR_S3
|
VTT CPU1 DDR S3
|
TABLE 3-2 Sun Fire V20z Server--Diagnostics Modules and Tests (updated release of server)
Module
|
Test
|
Devices
|
Modules and Tests Added:
|
Flash
|
write.flash
|
Flash memory
|
fan
|
speed.allFans
|
All fans
|
temp
|
read.ambienttemp
|
Motherboard
|
Modules and Tests Deleted:
|
fan
|
speed.fan1
|
CPU 1 memory fan 1
|
fan
|
speed.fan2
|
CPU 1 memory fan 2
|
fan
|
speed.fan3
|
CPU 1 fan 1
|
fan
|
speed.fan4
|
CPU 1 fan 2
|
fan
|
speed.fan5
|
CPU 0 fan 1
|
fan
|
speed.fan6
|
CPU 0 fan 2
|
temp
|
read.cpu0.temp
|
CPU 0
|
temp
|
read.cpu1.temp
|
CPU 1
|
temp
|
read.golem.temp
|
HyperTransport tunnel on
AMD 8131 chip
|
temp
|
read.thor.temp
|
South Bridge
|
voltage
|
limits.VDDA_CPU0_25_S0
|
VDDA CPU0 25 S0
|
voltage
|
limits.VDD_18_S0
|
VDD 18 S0
|
voltage
|
limits.VDD_18_S5
|
VDD 18 S5
|
voltage
|
limits.VDD_25_S0
|
VDD 25 S0
|
voltage
|
limits.VDD_25_S5
|
VDD 25 S5
|
voltage
|
limits.VDD_33_S3
|
VDD 33 S3
|
voltage
|
limits.VDD_CPU0_25_S3
|
VDD CPU0 25 S3
|
voltage
|
limits.VDD_CPU1_25_S3
|
VDD CPU1 25 S3
|
voltage
|
limits.VLDT_CPU0_LDT1
|
VLDT CPU0 LDT1
|
voltage
|
limits.VLDT_G_LDT1
|
VLDT G LDT1
|
voltage
|
limits.VTT_CPU0_DDR_S3
|
VTT CPU0 DDR S3
|
voltage
|
limits.VTT_CPU1_DDR_S3
|
VTT CPU1 DDR S3
|
TABLE 3-3 Sun Fire V40z Diagnostics Modules and Tests
Module
|
Test
|
Devices
|
Flash
|
write.flash
|
|
fan
|
speed.fan1
|
fan1.tach
|
fan
|
speed.fan10
|
fan.10tach
|
fan
|
speed.fan11
|
fan.11
|
fan
|
speed.fan12
|
fan.12
|
fan
|
speed.fan2
|
fan.2tach
|
fan
|
speed.fan3
|
fan.3tach
|
fan
|
speed.fan4
|
fan.4tach
|
fan
|
speed.fan5
|
fan.5tach
|
fan
|
speed.fan6
|
fan.6tach
|
fan
|
speed.fan7
|
fan.7tach
|
fan
|
speed.fan8
|
fan.8tach
|
fan
|
speed.fan9
|
fan.9tach
|
memory
|
adjacency.allDimms
|
System memory
|
memory
|
dataline.allDimms
|
System memory
|
memory
|
pattern.allDimms
|
System memory
|
nic
|
phyLoop.Nic.0
|
Ethernet Port 0
|
nic
|
phyLoop.Nic.1
|
Ethernet Port 1
|
opPanel
|
write.opPanel
|
Operator Panel
|
power
|
read.allPowerSupplies
|
System power
|
slag
|
toggleLED.CD
|
CD LED
|
slag
|
toggleLED.CPU-Board
|
CPU card
|
slag
|
toggleLED.CPU0
|
CPU 0 LED
|
slag
|
toggleLED.CPU0-DDR-VRM
|
CPU 0 DDR VRM
|
slag
|
toggleLED.CPU0-DIMM0
|
CPU 0 DIMM 0
|
slag
|
toggleLED.CPU0-DIMM1
|
CPU 0 DIMM 1
|
slag
|
toggleLED.CPU0-DIMM2
|
CPU 0 DIMM 2
|
slag
|
toggleLED.CPU0-DIMM3
|
CPU 0 DIMM 3
|
slag
|
toggleLED.CPU0-VRM
|
CPU 0 VRM
|
slag
|
toggleLED.CPU1
|
CPU 1 LED
|
slag
|
toggleLED.CPU1-DDR-VRM
|
CPU 1 DDR VRM
|
slag
|
toggleLED.CPU1-DIMM0
|
CPU 1 DIMM 0
|
slag
|
toggleLED.CPU1-DIMM1
|
CPU 1 DIMM 1
|
slag
|
toggleLED.CPU1-DIMM2
|
CPU 1 DIMM 2
|
slag
|
toggleLED.CPU1-DIMM3
|
CPU 1 DIMM 3
|
slag
|
toggleLED.CPU1-VRM
|
CPU 1 VRM
|
slag
|
toggleLED.CPU2
|
CPU 2 LED
|
slag
|
toggleLED.CPU2-DDR-VRM
|
CPU 2 DDR VRM
|
slag
|
toggleLED.CPU2-DIMM0
|
CPU 2 DIMM 0
|
slag
|
toggleLED.CPU2-DIMM1
|
CPU 2 DIMM 1
|
slag
|
toggleLED.CPU2-DIMM2
|
CPU 2 DIMM 2
|
slag
|
toggleLED.CPU2-DIMM3
|
CPU 2 DIMM 3
|
slag
|
toggleLED.CPU2-VRM
|
CPU 2 VRM
|
slag
|
toggleLED.CPU3
|
CPU 3 LED
|
slag
|
toggleLED.CPU3-DDR-VRM
|
CPU 3 DDR VRM
|
slag
|
toggleLED.CPU3-DIMM0
|
CPU 3 DIMM 0
|
slag
|
toggleLED.CPU3-DIMM1
|
CPU 3 DIMM 1
|
slag
|
toggleLED.CPU3-DIMM2
|
CPU 3 DIMM 2
|
slag
|
toggleLED.CPU3-DIMM3
|
CPU 3 DIMM 3
|
slag
|
toggleLED.CPU3-VRM
|
CPU 3 VRM
|
slag
|
toggleLED.Fan-Board
|
|
slag
|
toggleLED.Floppy
|
Floppy toggle LED
|
slag
|
toggleLED.LCD
|
LCD indicator toggle LED
|
slag
|
toggleLED.Motherboard
|
Motherboard toggle LED
|
slag
|
toggleLED.PCI-1
|
PCI 1 toggle LED
|
slag
|
toggleLED.PCI-2
|
PCI 2 toggle LED
|
slag
|
toggleLED.PCI-3
|
PCI 3 toggle LED
|
slag
|
toggleLED.PCI-4
|
PCI 4 toggle LED
|
slag
|
toggleLED.PCI-5
|
PCI 5 toggle LED
|
slag
|
toggleLED.PCI-6
|
PCI 6 toggle LED
|
slag
|
toggleLED.PCI-7
|
PCI 7 toggle LED
|
slag
|
toggleLED.SCSI-Backplane
|
Disk backplane toggle LED
|
slag
|
toggleLED.SCSI-Fault
|
|
storage
|
long.SCSI_0
|
SCSI 0 drive
|
storage
|
long.SCSI_1
|
SCSI 1drive
|
storage
|
long.SCSI_2
|
SCSI 2 drive
|
storage
|
long.SCSI_3
|
SCSI 3 drive
|
storage
|
long.SCSI_4
|
SCSI 4 drive
|
storage
|
long.SCSI_5
|
SCSI 5 drive
|
storage
|
short.SCSI_0
|
SCSI 0 drive
|
storage
|
short.SCSI_1
|
SCSI 1 drive
|
storage
|
short.SCSI_2
|
SCSI 2 drive
|
storage
|
short.SCSI_3
|
SCSI 3 drive
|
storage
|
short.SCSI_4
|
SCSI 4 drive
|
storage
|
short.SCSI_5
|
SCSI 5 drive
|
temp
|
read.ambienttemp
|
Ambient temperature
|
temp
|
read.cpu0.dietemp
|
CPU 0 die
|
temp
|
read.cpu0.inlettemp
|
|
temp
|
read.cpu0.memtemp
|
CPU 0 memory
|
temp
|
read.cpu1.dietemp
|
CPU 1 die
|
temp
|
read.cpu1.inlettemp
|
|
temp
|
read.cpu1.memtemp
|
CPU 1 memory
|
temp
|
read.cpu2.dietemp
|
CPU 2 die
|
temp
|
read.cpu2.inlettemp
|
CPU 2 memory
|
temp
|
read.cpu2.temp
|
CPU 2
|
temp
|
read.cpu3.dietemp
|
CPU 3 die
|
temp
|
read.cpu3.inlettemp
|
CPU 3 memory
|
temp
|
read.cpu3.temp
|
CPU 3
|
temp
|
read.gbeth.temp
|
GigaBit on Broadcomm
|
temp
|
read.scsibp.temp
|
Hard disk SCSI backplane
|
temp
|
read.sp.temp
|
Service processor (SP)
|
voltage
|
limits.VCC_120_S0.CPU-2
|
VCC 120 S0
|
voltage
|
limits.VCC_120_S0.CPU-3
|
VCC 120 S0
|
voltage
|
limits.VCC_120_S0.MB-CPU-0
|
VCC 120 S0
|
voltage
|
limits.VCC_50_S0.CPU
|
VCC 50 S0
|
voltage
|
limits.VCC_50_S0.MB
|
VCC 50 S0
|
voltage
|
limits.VCC_50_S5.CPU
|
VCC 50 S5
|
voltage
|
limits.VCC_50_S5.MB
|
VCC 50 S5
|
voltage
|
limits.VDDA_CPU0_25_S0
|
VDDA CPU0 25 S0
|
voltage
|
limits.VDDA_CPU1_25_S0
|
VDDA CPU1 25 S0
|
voltage
|
limits.VDDA_CPU2_25_S0
|
VDDA CPU2 25 S0
|
voltage
|
limits.VDDA_CPU3_25_S0
|
VDDA CPU3 25 S0
|
voltage
|
limits.VDD_18G_S0
|
VDD 18 S0
|
voltage
|
limits.VDD_18_S0
|
VDD 18 S0
|
voltage
|
limits.VDD_18_S5
|
VDD 18 S5
|
voltage
|
limits.VDD_25_S0
|
VDD 25 S0
|
voltage
|
limits.VDD_25_S0.CPU
|
VDD 25 S0
|
voltage
|
limits.VDD_25_S5
|
VDD 25 S5
|
voltage
|
limits.VDD_33_S0.CPU
|
VDD 33 S0
|
voltage
|
limits.VDD_33_S0.MB
|
VDD 33 S0
|
voltage
|
limits.VDD_33_S3
|
VDD 33 S3
|
voltage
|
limits.VDD_33_S5
|
VDD 33 S5
|
voltage
|
limits.VDD_33_S5.CPU
|
VDD 33 S5
|
voltage
|
limits.VDD_CPU0_25_S3
|
VDD CPU0 25 S3
|
voltage
|
limits.VDD_CPU0_CORE_S0
|
VDD CPU0 CORE S0
|
voltage
|
limits.VDD_CPU1_25_S3
|
VDD CPU1 25 S3
|
voltage
|
limits.VDD_CPU1_CORE_S0
|
VDD CPU1 CORE S0
|
voltage
|
limits.VDD_CPU2_25_S3
|
VDD CPU2 25 S3
|
voltage
|
limits.VDD_CPU2_CORE_S0
|
VDD CPU2 CORE S0
|
voltage
|
limits.VDD_CPU3_25_S3
|
VDD CPU3 25 S3
|
voltage
|
limits.VDD_CPU3_CORE_S0
|
VDD CPU3 CORE S0
|
voltage
|
limits.VLDT_CPU0_LDT0
|
VLDT CPU0 LDT0
|
voltage
|
limits.VLDT_CPU0_LDT2
|
VLDT CPU0 LDT2
|
voltage
|
limits.VLDT_CPU1_LDT1
|
VLDT CPU1 LDT1
|
voltage
|
limits.VLDT_G0_LDT1
|
VLDT G0 LDT1
|
voltage
|
limits.VLDT_G1_LDT1
|
VLDT G1 LDT1
|
voltage
|
limits.VLDT_REG1
|
VLDT_REG1
|
voltage
|
limits.VLDT_REG2
|
VLDT REG2
|
voltage
|
limits.VTT_CPU0_DDR_S3
|
VTT CPU0 DDR S3
|
voltage
|
limits.VTT_CPU1_DDR_S3
|
VTT CPU1 DDR S3
|
voltage
|
limits.VTT_CPU2_DDR_S3
|
VTT CPU2 DDR S3
|
voltage
|
limits.VTT_CPU3_DDR_S3
|
VTT CPU3 DDR S3
|
3.4.5 Running Diagnostic Tests
When running tests, you can choose to execute all tests or specify a specific module for which to run tests. The following options are available:
- Run tests individually or collectively
- Choose the type (by module or name) of tests to run
- Determine the sequence in which the tests are run (using scripts)
- View status messages about the success of the tests
You can run these tests on the machine on which you obtained them. You must have the appropriate permissions to run these commands.
To run the diagnostics tests, type the following command:
# diags run tests option
Where the option is one of the following:
Option
|
Description
|
-n test_name
|
To run one test at a time, replace test_name with the name of the test. You can specify more than one test by listing test names with a space between them.
|
-m module
|
To run a batch of tests by module, replace module with the name of the test module.
|
-a
|
Use this option to run all available diagnostics tests.
|
For example, if you suspect that you are having voltage problems, run the voltage module diagnostic tests:
# diags run tests -m voltage
Refer to Appendix C for more information about using these command options.
You can write scripts for additional control over the sequencing and timing of the tests. For example, you could write a shell script to repeat a test a specified number of times.
3.4.6 Viewing Diagnostic Test Results
After a test successfully executes, the status returns. When a test receives an error, it reports the error and continues to run any remaining tests submitted with the command.
The following output is typically generated for all diagnostics tests:
- Submitted Test Name
- Test Handle (a dynamically assigned unique number used by the diagnostics application to identify a running test)
- Test Result (Passed, Failed)
- Details (for example, Failure Details, Tests Details)
Specifying the -v | --verbose option when running the test displays additional data about a test. See Appendix C for more details.
For example, test details may include high, nominal and low values.
The following is an example of two passed test cases and one failed test case:
Results
Submitted Test Name Test Handle Test Result
adjacency.allDimms P1 Passed
dataline.allDimms P2 Passed
pattern.allDimms P3 Failed
Failure Details: FAILED, addr(0xc0000008) CPU 1 - DIMM 3)
Expected [5a5a5a5a5a5a5a5a] Actual [a5a5a5a5a4a5a5a5] Difference [1000000]
Memory Configuration: Total: 3584Mb
CPU0-2048Mb CPU1-1536Mb
CPU 0: Width[128] Addr 0 - 7fffffff
DIMM 0 512MB Addr 0000000000 - 003fffffff Even Quad Word
DIMM 1 512MB Addr 0000000000 - 003fffffff Odd Quad Word
DIMM 2 512MB Addr 0040000000 - 007fffffff Even Quad Word
DIMM 3 512MB Addr 0040000000 - 007fffffff Odd Quad Word
CPU 1: Width[128] Addr 80000000 - dfffffff
DIMM 0 512MB Addr 0080000000 - 00bfffffff Even Quad Word
DIMM 1 512MB Addr 0080000000 - 00bfffffff Odd Quad Word
DIMM 2 256MB Addr 00c0000000 - 00dfffffff Even Quad Word
*DIMM 3 256MB Addr 00c0000000 - 00dfffffff Odd Quad Word
|
3.4.7 Stopping Diagnostic Tests
To cancel one or more individual tests, run the following command:
# diags cancel tests { -t | --test} test_handle | {-a|--all}
Where test_handle is a dynamically assigned unique number used by the diagnostics application to identify a running test. The test handle is displayed in the output of a test after it has been run.
To terminate all diagnostics tests and end the diagnostics session, run the following command:
# diags terminate
Refer to Appendix C for more information about these commands.
3.5 Analyzing Events
System events often yield important information about problems or potential problems occurring in the system. Administrators can view detailed information about all the currently active system events and perform various actions related to each event.
You can use the sp get events command to return detailed information about all active SP events. The -d parameter specifies to display the history of either one or all events, thereby allowing you to track problems. By default, event ID, last update, component, severity and a message are displayed.
3.6 System-Fault LED3.6.1 System-Fault Events
The following events result in the system-fault LED turning on.
- Thermal Trip Events: When your CPU experiences a thermal trip, the system-fault LED begins to blink and an event is issued indicating that the platform has been shut down. For example:
CPU 0 has thermally tripped and shut down. Powering off System.
Causes of this condition might be fan failure, an environment that is too hot, the cover was off too long, and so on.
To correct this condition, fix the air flow or cooling problem that caused the thermal trip. After the system has cooled off for a period of time, remove all AC power to the system for 30 seconds and then plug the system back in. You should then be able to boot the system normally.
|
Caution - To remove AC power from a Sun Fire V20z server, turn off the AC power switch on the back panel. To remove power from a Sun Fire V40z server, remove the power cords from all power supplies.
|
- DIMM Faults: DIMM faults cause the system-fault LED to blink whenever an uncorrectable DIMM fault is detected or when enough correctable faults are detected to exceed the threshold. The system might continue to operate normally, depending on the type of failure, the location of the failure and the robustness of the platform operating system.
- VRM Crowbar Assertions: VRM Crowbar assertions occur when either a CPU VRM or a memory VRM detects either a voltage condition that exceeds the threshold or a temperature condition that exceeds the threshold. During a period in which crowbar is asserted, the system-fault LED blinks and the front-panel platform power button, the platform set power and the platform os state commands are disabled.
The system is forcefully shut down by either the Service Processor or the PRS when this occurs (typically by the PRS, because the crowbar signal normally causes the VRM to deassert the power-good signal). When the condition clears, the system is allowed to resume power.
3.6.1.1 Viewing System-Fault Events and Resetting the LED
To view the critical event that caused a system-fault alert, run the following command:
# ssh spipaddress -l spusername sp get events
To reset the system-fault LED, critical events must be deleted from the SP event log or the event log can be cleared entirely.
- To clear the entire event log, run the following command:
# ssh spipaddress -l spusername sp delete event -a
- To delete selected events from the log, run the following command:
# ssh spipaddress -l spusername sp delete event event-id-number
Sun Fire V20z and Sun Fire V40z Servers User Guide
|
817-5248-13
|
|
Copyright © 2004, Sun Microsystems, Inc. All Rights Reserved.