C H A P T E R 3 - Troubleshooting and Diagnostics

C H A P T E R 3

Troubleshooting and Diagnostics

Before troubleshooting your specific server problem, collect the following information:

What events occurred prior to the failure?

Was any hardware or software modified or installed?

Was the server recently installed or moved?

How long has the server exhibited symptoms?

What is the duration or frequency of the problem?

The guidelines in Preventive Troubleshooting will help you to prevent problems from occurring and will make troubleshooting easier.

After you have assessed the problem and noted your current configuration and environment, you can choose from several ways to troubleshoot your server:

Visually inspect your system as described in Visually Inspecting Your System.

Execute the Troubleshooting Dump Utility as described in Troubleshooting Dump Utility.

Install and execute diagnostics tests as described in Diagnostics.

3.1 Preventive Troubleshooting

Creating and following procedures can help prevent problems and make troubleshooting easier.

Follow these guidelines for preventive troubleshooting:

Use uniform naming conventions for your servers, such as names that denote server location. Uniform naming conventions help when you try to remember often-overlooked details that can hold the key to resolving a crisis.

Use unique IDs or names for your devices. You can reduce the risk of components competing for the same resource if you have a list. Use the server setup utility to check for conflicts.

Create a backup plan. Schedule backups based on the needs of your server. If data is changed frequently, frequent backups are required. Maintain a library of backups based on your information restoring needs. Test your backups periodically to be sure that your data is correctly stored.

Use enterprise-systems management tools to automate the following processes or manually track this information:

Check hard disk space periodically. It is recommended that hard drives have a minimum of 15 percent of free space.

Keep historical data. You will not know that the CPU utilization has increased 50 percent if you do not know what it was initially. If you have problems, you can use the data to compare before and after scenarios. For example, you might want to know about the user, bus and power utilization rates.

Keep a trend analysis so that you will know what to expect at certain times. For example, if the CPU utilization rate always increases by 50 percent during certain hours, you will know that increase is normal for the server you are tracking.

Create a problem-resolution notebook. When problems do occur, keep a log of the actions you took to resolve them. This could help you solve the same problem more quickly in the future. This information can save a great deal of time in the future and ensure accuracy, especially when dealing with future part replacement.

Keep an updated network-topology map in an accessible location. This will help in troubleshooting networking problems.

Most problems occur when something in the server has changed. When making changes to your server, follow these guidelines:

Document the system settings. If the system configuration will change, first obtain a record of the current system-configuration settings.

If possible, make changes one at a time to isolate problems should they occur. This enables you to maintain a controlled environment and reduces the scope of any troubleshooting. Record the results of each change, including any errors or informational messages.

Check for potential device conflicts before adding a new device. Check for any potential version dependencies, especially with third-party software.

3.2 Visually Inspecting Your System

Improperly set controls and loose or improperly connected cables are common causes of problems with hardware components. When investigating a system problem, first check all the external switches, controls and cable connections. See External Visual Inspection.

If this does not resolve your problem, then visually inspect the system's interior hardware for problems such as a loose card, cable connector or mounting screw. See Internal Visual Inspection.

3.2.1 External Visual Inspection

To visually inspect the external system, follow these steps:

1. Note the state of the system-fault LED on the front of the server.

The system-fault LED blinks when a severe system fault is detected.

See FIGURE 1-1 for the location of the Sun Fire V20z system-fault LED.

See FIGURE 1-4 for the location of the Sun Fire V40z system-fault LED.

Several conditions can result in the system-fault LED turning on. See System-Fault LED for a description of these conditions, how to view the cause of the fault and how to reset the LED.

2. Power off the system and any attached peripherals (if applicable).

3. Verify that all power cables are properly connected to the system, the monitor and peripherals, and check their power sources.

4. Inspect connections to any attached devices, including network cables, keyboard, monitor and mouse, as well as any devices attached to the serial port.

3.2.2 Internal Visual Inspection

To visually inspect the internal system, follow these steps:

Note - Before proceeding, read the safety instructions in the document, Important Safety Information About Sun Hardware Systems, which is shipped with your system.

1. Shut down the operating system, if necessary, and turn off the platform power on the front of the server.

2. Turn off the AC power in one of the following two methods, depending on which server type you have:

If you have a Sun Fire V20z server, turn off the AC power switch on the rear panel of the server (see FIGURE 1-2). Leave the AC power cord attached to the power supply to maintain system ground.

If you have a Sun Fire V40z server, unplug the AC power cord(s) from the AC connectors on the server's power supply(s).

Caution - When you unplug the AC power cords from the Sun Fire V40z server power supplies to remove AC power, system ground is also removed. You must maintain an equal voltage potential to the machine to avoid electrostatic discharge damage to the machine.

3. Turn off power to any attached peripherals.

4. Remove the server cover.

For a Sun Fire V20z server, refer toPowering Off the Server and Removing the Cover.

For a Sun Fire V40z server, refer toPowering Off the Server and Removing the Cover.

Caution - Some components, such as the heatsink, can become extremely hot during system operations. Allow these components to cool before handling them.

5. Verify that the components are fully seated in their sockets or connectors and that sockets are clean.

6. Check all cable connectors inside the system to verify that they are firmly attached to their appropriate connectors.

7. Replace the server cover.

8. Reconnect the system and any attached peripherals to their power sources, then power them on.

3.3 Troubleshooting Dump Utility

You can also use the Troubleshooting Dump Utility (TDU), which captures the following information:

System state table (SST)

Hardware and software component version numbers

Machine check register values

CPU trace buffers

CPU configuration space registers (CSRs)

Event log file

The last good configuration (LGC)

To run the Troubleshooting Dump Utility, type the following command:

# sp get tdulog

The Troubleshooting Dump Utility can take up to 15 minutes to run. The system prompt displays when it is completed.

The captured data is gathered and stored on the SP in a compressed tar file. Refer to the Sun Fire V20z and Sun Fire V40z Servers, Server Management Guide, for more information about the command and its options.

3.4 Diagnostics

Diagnostics are a set of tests that determine the health of the hardware in your server. Diagnostics tests are used to verify hardware functionality and indicate device failures. You can test your system using the diagnostics tests to accomplish the following:

Test and diagnose hardware functionality

Locate hardware failures

Isolate hardware and software faults

Before using diagnostics, three setup procedures are necessary:

1. Install the diagnostics by installing the server's Network Share Volume (NSV) software to a networked NFS server. See Installing the NSV and Diagnostics Software.

2. Mount the diagnostics tests onto your Sun Fire V20z or Sun Fire V40z server and update the diagnostics software. Mounting the Diagnostics Tests.

3. Enable the diagnostics tests. Enabling the Diagnostics Tests.

Caution - While running diagnostics on your server, do not interact with the Service Processor (SP) through the command-line interface or IPMI.

The sensor commands cannot be used reliably while the diagnostics are running. Issuing sensor commands, while diagnostics are loaded, may result in "false" or erroneous critical events being logged in the events log. The values returned by the sensors are not reliable in this case.

Note - When the diagnostics are launched on the platform, the system tries to mount the floppy drive. The following error is returned:

mount : Mounting /dev/fd0 on /mnt/floppy failed. No such device.
You can safely ignore this error message.

3.4.1 Installing the NSV and Diagnostics Software

1. Connect the SP of the Sun Fire V20z or Sun Fire V40z server to the same network as your NFS server.

See the Sun Fire V20z and Sun Fire V40z Servers Installation Guide for the location of the SP connectors and guidelines for connecting servers to management LANs.

2. Insert the Sun Fire V20z and Sun Fire V40z Servers Network Share Volume CD into the NFS server and mount the CD.

3. Copy the file that contains the diagnostics from the CD to the NFS server by typing the following command:

# cp -r /mnt/cdrom/NSV_file /mnt/nsv/

4. Change to the directory on the server that now contains the compressed NSV packages and extract them by typing the following commands:

# cd /mnt/nsv/# unzip -a *.zip

Note - When unzipping a compressed file on a Linux platform, use the -a switch as shown to force text files to convert to the target operating system's appropriate end-of-line termination.

The extracted packages populate these files:

/mnt/nsv/ diags logs snmp spupdate

5. Run the following commands to create the appropriate permissions within the diags directories:

# chmod 777 /mnt/nsv/diags/NSV_version_number/scripts# chmod -R 755 /mnt/nsv/diags/NSV_version_number/mppc

6. Continue with Mounting the Diagnostics Tests.

3.4.2 Mounting the Diagnostics Tests

Before running the diagnostics tests, you need to mount the NSV software from the NFS server on which it is located.

1. Log in to the Sun Fire V20z or Sun Fire V40z server's SP via SSH by typing the following command at the NFS server's command prompt:

# ssh -l manager_or_higher_login SSH_hostname

Note - Verify that NFS is enabled on the network before going to the next step. On systems running Linux, this must be done manually. Refer to the documentation for the version of Linux you are running for the instructions on enabling NFS.

2. Mount the NSV onto the Sun Fire V20z or Sun Fire V40z server SP by typing the following command:

# sp add mount -r NFS_server_hostname:/directory_with_NSV_files -l /mnt

Note - If you did not set up the SP on a DHCP network, you must use the
NFS_server_IP_address, rather than the NFS_server_hostname.

3. Go to the directory that contains the diagnostics files to list the available versions of diagnostics currently installed on the NSV:

# cd /mnt/diags# ls -l

4. Update the diagnostics software by typing the following command:

# sp update diags -p /mnt/diags/DIAGS_version#

Where DIAGS_version# is the version of diagnostics you want to enable.
For example: V2.0.0.42

5. Continue with Enabling the Diagnostics Tests.

3.4.3 Enabling the Diagnostics Tests

Whenever a major component in the system does not function properly, you may have a component failure. As long as the microprocessor and the input and output components of the system (the monitor, keyboard and diskette drive) are working, you can run diagnostics.

To enable diagnostics on the SP from the NFS mount, execute one of the following commands:

When the platform power is off, run the following command to boot the server and enable both platform and SP diagnostics:

# diags start

You can begin running diagnostics on the SP while the platform diagnostics are loading. You can use the diags get state command to determine whether the platform diagnostics are loaded.

You can optionally enable diagnostics when the platform power is on and the OS is running, without rebooting the platform into diagnostics mode. This allows you to run your OS while simultaneously performing SP diagnostic testing. To do so, run the following command and option:

# diags start --noplatform

Note - If you use the --noplatform option, you cannot run any platform diagnostics, which include diagnostics for memory, NIC cards and storage.

Refer to Appendix C for more information about diags commands.

If the NSV is mounted, but the diags command is not recognized, run the
sp update diags command to adjust the path to the diagnostics software.

3.4.4 Listing Available Diagnostics Tests and Modules

To list the available tests and modules, type the following command:

# diags get tests

Tests are available for the following modules:

Fans: Fan tests verify that each fan is rotating and the fan RPM is within the specified ranges.

Note - The power-supply fans are not testable by this diagnostic.

Memory: Memory tests identify memory errors, address decoding faults and dataline faults.

Network Controllers: An internal loopback test is available for NIC testing.

Operator Panel: The operator panel tests verify the memory of the operator panel. The value and location of any errors are indicated.

Slag: Slag tests are non-interactive tests that verify the correct operation of the LED drive circuitry.

Storage: Storage tests invoke either a short or long self-test on any installed SCSI drives.

Temperature: Temperature tests verify that each of the temperature sensors is functional and that the temperature is within the specified ranges.

Voltage: Voltage tests are derived for power supply and bulk voltages (generated by the VRMs associated with the CPU and memory), to determine whether the voltage sensors are operating within their predefined limited.

Power (for Sun Fire V40z only): Power diagnostics verify that the power distribution backplane and power supplies are functioning properly.

TABLE 3-1 lists the diagnostics modules and tests that are associated with each module in the original release of the Sun Fire V20z server (chassis part number [PN] 380-0979).

TABLE 3-2 lists the diagnostics modules and tests that were added or deleted in the updated release of the Sun Fire V20z server (chassis PN 380-1168).

Note - To see the current list of diagnostics modules and tests on your Sun Fire V20z server, , run the SP command diags get tests. The SP automatically detects the release version of your system and returns the relevant set of tests.

TABLE 3-3 lists the diagnostics modules and tests that are associated with each module in a Sun Fire V40z server.

TABLE 3-1 Sun Fire V20z Server--Diagnostics Modules and Tests (original release of server)
Module	Test	Devices
fan	speed.fan1	CPU 1 memory fan 1
fan	speed.fan2	CPU 1 memory fan 2
fan	speed.fan3	CPU 1 fan 1
fan	speed.fan4	CPU 1 fan 2
fan	speed.fan5	CPU 0 fan 1
fan	speed.fan6	CPU 0 fan 2
memory	adjacency.allDimms	All DIMMs
memory	dataline.allDimms	All DIMMs
memory	pattern.allDimms	All DIMMs
nic	phyLoop.Nic.0	Ethernet Port 0
nic	phyLoop.Nic.1	Ethernet Port 1
opPanel	write.opPanel	Operator Panel
slag	toggleLED.CD	CD LED
slag	toggleLED.CPU0	CPU 0 LED
slag	toggleLED.CPU0-DDR-VRM	CPU 0 DDR VRM
slag	toggleLED.CPU0-DIMM0	CPU 0 DIMM 0
slag	toggleLED.CPU0-DIMM1	CPU 0 DIMM 1
slag	toggleLED.CPU0-DIMM2	CPU 0 DIMM 2
slag	toggleLED.CPU0-DIMM3	CPU 0 DIMM 3
slag	toggleLED.CPU0-VRM	CPU 0 VRM
slag	toggleLED.CPU1	CPU 1
slag	toggleLED.CPU1-DDR-VRM	CPU 1 DDR VRM
slag	toggleLED.CPU1-DIMM0	CPU 1 DIMM 0
slag	toggleLED.CPU1-DIMM1	CPU 1 DIMM
slag	toggleLED.CPU1-DIMM2	CPU 1 DIMM 2
slag	toggleLED.CPU1-DIMM3	CPU 1 DIMM 3
slag	toggleLED.CPU1-VRM	CPU 1 VRM
slag	toggleLED.Disk-0	Disk 0 toggle LED
slag	toggleLED.Disk-1	Disk 1 toggle LED
slag	toggleLED.Disk-Backplane	Disk backplane toggle LED
slag	toggleLED.Floppy	Floppy toggle LED
slag	toggleLED.LCD-Indicator	LCD indicator toggle LED
slag	toggleLED.Motherboard	Motherboard toggle LED
slag	toggleLED.PCI-0	PCI 0 toggle LED
slag	toggleLED.PCI-1	PCI 1 toggle LED
slag	toggleLED.Power-Supply	Power-supply toggle LED
storage	long.ATA0_0	ATA0 0 drive
storage	long.ATA0_1	ATA0 1drive
storage	long.SCSI_0	SCSI 0 drive
storage	long.SCSI_1	SCSI 1 drive
storage	short.ATA0_0	ATA0 0 drive
storage	short.ATA0_1	ATA0 1 drive
storage	short.SCSI_0	SCSI 0 drive
storage	short.SCSI_1	SCSI 1 drive
temp	read.cpu0.dietemp	CPU 0 die
temp	read.cpu0.memtemp	CPU 0 memory
temp	read.cpu0.temp	CPU 0
temp	read.cpu1.dietemp	CPU 1 die
temp	read.cpu1.memtemp	CPU 1 memory
temp	read.cpu1.temp	CPU 1
temp	read.gbeth.temp	GigaBit on Broadcomm
temp	read.golem.temp	HyperTransport tunnel on AMD 8131 chip
temp	read.hddbp.temp	Hard disk SCSI backplane
temp	read.sp.temp	Service processor (SP)
temp	read.thor.temp	South Bridge
voltage	limits.VCC_120_S0	VCC 120 S0
voltage	limits.VCC_50_S0	VCC 50 S0
voltage	limits.VCC_50_S5	VCC 50 S5
voltage	limits.VDDA_CPU0_25_S0	VDDA CPU0 25 S0
voltage	limits.VDD_18_S0	VDD 18 S0
voltage	limits.VDD_18_S5	VDD 18 S5
voltage	limits.VDD_25_S0	VDD 25 S0
voltage	limits.VDD_25_S5	VDD 25 S5
voltage	limits.VDD_33_S0	VDD 33 S0
voltage	limits.VDD_33_S3	VDD 33 S3
voltage	limits.VDD_33_S5	VDD 33 S5
voltage	limits.VDD_CPU0_25_S3	VDD CPU0 25 S3
voltage	limits.VDD_CPU0_CORE_S0	VDD CPU0 CORE S0
voltage	limits.VDD_CPU1_25_S3	VDD CPU1 25 S3
voltage	limits.VDD_CPU1_CORE_S0	VDD CPU1 CORE S0
voltage	limits.VLDT_CPU0_LDT1	VLDT CPU0 LDT1
voltage	limits.VLDT_CPU0_LDT2	VLDT CPU0 LDT2
voltage	limits.VLDT_G_LDT1	VLDT G LDT1
voltage	limits.VTT_CPU0_DDR_S3	VTT CPU0 DDR S3
voltage	limits.VTT_CPU1_DDR_S3	VTT CPU1 DDR S3

TABLE 3-2 Sun Fire V20z Server--Diagnostics Modules and Tests (updated release of server)
Module	Test	Devices
Modules and Tests Added:
Flash	write.flash	Flash memory
fan	speed.allFans	All fans
temp	read.ambienttemp	Motherboard
Modules and Tests Deleted:
fan	speed.fan1	CPU 1 memory fan 1
fan	speed.fan2	CPU 1 memory fan 2
fan	speed.fan3	CPU 1 fan 1
fan	speed.fan4	CPU 1 fan 2
fan	speed.fan5	CPU 0 fan 1
fan	speed.fan6	CPU 0 fan 2
temp	read.cpu0.temp	CPU 0
temp	read.cpu1.temp	CPU 1
temp	read.golem.temp	HyperTransport tunnel on AMD 8131 chip
temp	read.thor.temp	South Bridge
voltage	limits.VDDA_CPU0_25_S0	VDDA CPU0 25 S0
voltage	limits.VDD_18_S0	VDD 18 S0
voltage	limits.VDD_18_S5	VDD 18 S5
voltage	limits.VDD_25_S0	VDD 25 S0
voltage	limits.VDD_25_S5	VDD 25 S5
voltage	limits.VDD_33_S3	VDD 33 S3
voltage	limits.VDD_CPU0_25_S3	VDD CPU0 25 S3
voltage	limits.VDD_CPU1_25_S3	VDD CPU1 25 S3
voltage	limits.VLDT_CPU0_LDT1	VLDT CPU0 LDT1
voltage	limits.VLDT_G_LDT1	VLDT G LDT1
voltage	limits.VTT_CPU0_DDR_S3	VTT CPU0 DDR S3
voltage	limits.VTT_CPU1_DDR_S3	VTT CPU1 DDR S3

TABLE 3-3 Sun Fire V40z Diagnostics Modules and Tests
Module	Test	Devices
Flash	write.flash
fan	speed.fan1	fan1.tach
fan	speed.fan10	fan.10tach
fan	speed.fan11	fan.11
fan	speed.fan12	fan.12
fan	speed.fan2	fan.2tach
fan	speed.fan3	fan.3tach
fan	speed.fan4	fan.4tach
fan	speed.fan5	fan.5tach
fan	speed.fan6	fan.6tach
fan	speed.fan7	fan.7tach
fan	speed.fan8	fan.8tach
fan	speed.fan9	fan.9tach
memory	adjacency.allDimms	System memory
memory	dataline.allDimms	System memory
memory	pattern.allDimms	System memory
nic	phyLoop.Nic.0	Ethernet Port 0
nic	phyLoop.Nic.1	Ethernet Port 1
opPanel	write.opPanel	Operator Panel
power	read.allPowerSupplies	System power
slag	toggleLED.CD	CD LED
slag	toggleLED.CPU-Board	CPU card
slag	toggleLED.CPU0	CPU 0 LED
slag	toggleLED.CPU0-DDR-VRM	CPU 0 DDR VRM
slag	toggleLED.CPU0-DIMM0	CPU 0 DIMM 0
slag	toggleLED.CPU0-DIMM1	CPU 0 DIMM 1
slag	toggleLED.CPU0-DIMM2	CPU 0 DIMM 2
slag	toggleLED.CPU0-DIMM3	CPU 0 DIMM 3
slag	toggleLED.CPU0-VRM	CPU 0 VRM
slag	toggleLED.CPU1	CPU 1 LED
slag	toggleLED.CPU1-DDR-VRM	CPU 1 DDR VRM
slag	toggleLED.CPU1-DIMM0	CPU 1 DIMM 0
slag	toggleLED.CPU1-DIMM1	CPU 1 DIMM 1
slag	toggleLED.CPU1-DIMM2	CPU 1 DIMM 2
slag	toggleLED.CPU1-DIMM3	CPU 1 DIMM 3
slag	toggleLED.CPU1-VRM	CPU 1 VRM
slag	toggleLED.CPU2	CPU 2 LED
slag	toggleLED.CPU2-DDR-VRM	CPU 2 DDR VRM
slag	toggleLED.CPU2-DIMM0	CPU 2 DIMM 0
slag	toggleLED.CPU2-DIMM1	CPU 2 DIMM 1
slag	toggleLED.CPU2-DIMM2	CPU 2 DIMM 2
slag	toggleLED.CPU2-DIMM3	CPU 2 DIMM 3
slag	toggleLED.CPU2-VRM	CPU 2 VRM
slag	toggleLED.CPU3	CPU 3 LED
slag	toggleLED.CPU3-DDR-VRM	CPU 3 DDR VRM
slag	toggleLED.CPU3-DIMM0	CPU 3 DIMM 0
slag	toggleLED.CPU3-DIMM1	CPU 3 DIMM 1
slag	toggleLED.CPU3-DIMM2	CPU 3 DIMM 2
slag	toggleLED.CPU3-DIMM3	CPU 3 DIMM 3
slag	toggleLED.CPU3-VRM	CPU 3 VRM
slag	toggleLED.Fan-Board
slag	toggleLED.Floppy	Floppy toggle LED
slag	toggleLED.LCD	LCD indicator toggle LED
slag	toggleLED.Motherboard	Motherboard toggle LED
slag	toggleLED.PCI-1	PCI 1 toggle LED
slag	toggleLED.PCI-2	PCI 2 toggle LED
slag	toggleLED.PCI-3	PCI 3 toggle LED
slag	toggleLED.PCI-4	PCI 4 toggle LED
slag	toggleLED.PCI-5	PCI 5 toggle LED
slag	toggleLED.PCI-6	PCI 6 toggle LED
slag	toggleLED.PCI-7	PCI 7 toggle LED
slag	toggleLED.SCSI-Backplane	Disk backplane toggle LED
slag	toggleLED.SCSI-Fault
storage	long.SCSI_0	SCSI 0 drive
storage	long.SCSI_1	SCSI 1drive
storage	long.SCSI_2	SCSI 2 drive
storage	long.SCSI_3	SCSI 3 drive
storage	long.SCSI_4	SCSI 4 drive
storage	long.SCSI_5	SCSI 5 drive
storage	short.SCSI_0	SCSI 0 drive
storage	short.SCSI_1	SCSI 1 drive
storage	short.SCSI_2	SCSI 2 drive
storage	short.SCSI_3	SCSI 3 drive
storage	short.SCSI_4	SCSI 4 drive
storage	short.SCSI_5	SCSI 5 drive
temp	read.ambienttemp	Ambient temperature
temp	read.cpu0.dietemp	CPU 0 die
temp	read.cpu0.inlettemp
temp	read.cpu0.memtemp	CPU 0 memory
temp	read.cpu1.dietemp	CPU 1 die
temp	read.cpu1.inlettemp
temp	read.cpu1.memtemp	CPU 1 memory
temp	read.cpu2.dietemp	CPU 2 die
temp	read.cpu2.inlettemp	CPU 2 memory
temp	read.cpu2.temp	CPU 2
temp	read.cpu3.dietemp	CPU 3 die
temp	read.cpu3.inlettemp	CPU 3 memory
temp	read.cpu3.temp	CPU 3
temp	read.gbeth.temp	GigaBit on Broadcomm
temp	read.scsibp.temp	Hard disk SCSI backplane
temp	read.sp.temp	Service processor (SP)
voltage	limits.VCC_120_S0.CPU-2	VCC 120 S0
voltage	limits.VCC_120_S0.CPU-3	VCC 120 S0
voltage	limits.VCC_120_S0.MB-CPU-0	VCC 120 S0
voltage	limits.VCC_50_S0.CPU	VCC 50 S0
voltage	limits.VCC_50_S0.MB	VCC 50 S0
voltage	limits.VCC_50_S5.CPU	VCC 50 S5
voltage	limits.VCC_50_S5.MB	VCC 50 S5
voltage	limits.VDDA_CPU0_25_S0	VDDA CPU0 25 S0
voltage	limits.VDDA_CPU1_25_S0	VDDA CPU1 25 S0
voltage	limits.VDDA_CPU2_25_S0	VDDA CPU2 25 S0
voltage	limits.VDDA_CPU3_25_S0	VDDA CPU3 25 S0
voltage	limits.VDD_18G_S0	VDD 18 S0
voltage	limits.VDD_18_S0	VDD 18 S0
voltage	limits.VDD_18_S5	VDD 18 S5
voltage	limits.VDD_25_S0	VDD 25 S0
voltage	limits.VDD_25_S0.CPU	VDD 25 S0
voltage	limits.VDD_25_S5	VDD 25 S5
voltage	limits.VDD_33_S0.CPU	VDD 33 S0
voltage	limits.VDD_33_S0.MB	VDD 33 S0
voltage	limits.VDD_33_S3	VDD 33 S3
voltage	limits.VDD_33_S5	VDD 33 S5
voltage	limits.VDD_33_S5.CPU	VDD 33 S5
voltage	limits.VDD_CPU0_25_S3	VDD CPU0 25 S3
voltage	limits.VDD_CPU0_CORE_S0	VDD CPU0 CORE S0
voltage	limits.VDD_CPU1_25_S3	VDD CPU1 25 S3
voltage	limits.VDD_CPU1_CORE_S0	VDD CPU1 CORE S0
voltage	limits.VDD_CPU2_25_S3	VDD CPU2 25 S3
voltage	limits.VDD_CPU2_CORE_S0	VDD CPU2 CORE S0
voltage	limits.VDD_CPU3_25_S3	VDD CPU3 25 S3
voltage	limits.VDD_CPU3_CORE_S0	VDD CPU3 CORE S0
voltage	limits.VLDT_CPU0_LDT0	VLDT CPU0 LDT0
voltage	limits.VLDT_CPU0_LDT2	VLDT CPU0 LDT2
voltage	limits.VLDT_CPU1_LDT1	VLDT CPU1 LDT1
voltage	limits.VLDT_G0_LDT1	VLDT G0 LDT1
voltage	limits.VLDT_G1_LDT1	VLDT G1 LDT1
voltage	limits.VLDT_REG1	VLDT_REG1
voltage	limits.VLDT_REG2	VLDT REG2
voltage	limits.VTT_CPU0_DDR_S3	VTT CPU0 DDR S3
voltage	limits.VTT_CPU1_DDR_S3	VTT CPU1 DDR S3
voltage	limits.VTT_CPU2_DDR_S3	VTT CPU2 DDR S3
voltage	limits.VTT_CPU3_DDR_S3	VTT CPU3 DDR S3

3.4.5 Running Diagnostic Tests

When running tests, you can choose to execute all tests or specify a specific module for which to run tests. The following options are available:

Run tests individually or collectively

Choose the type (by module or name) of tests to run

Determine the sequence in which the tests are run (using scripts)

View status messages about the success of the tests

You can run these tests on the machine on which you obtained them. You must have the appropriate permissions to run these commands.

To run the diagnostics tests, type the following command:

# diags run tests option

Where the option is one of the following:

Option	Description
-n test_name	To run one test at a time, replace test_name with the name of the test. You can specify more than one test by listing test names with a space between them.
-m module	To run a batch of tests by module, replace module with the name of the test module.
-a	Use this option to run all available diagnostics tests.

For example, if you suspect that you are having voltage problems, run the voltage module diagnostic tests:

# diags run tests -m voltage

Refer to Appendix C for more information about using these command options.

You can write scripts for additional control over the sequencing and timing of the tests. For example, you could write a shell script to repeat a test a specified number of times.

3.4.6 Viewing Diagnostic Test Results

After a test successfully executes, the status returns. When a test receives an error, it reports the error and continues to run any remaining tests submitted with the command.

The following output is typically generated for all diagnostics tests:

Submitted Test Name

Test Handle (a dynamically assigned unique number used by the diagnostics application to identify a running test)

Test Result (Passed, Failed)

Details (for example, Failure Details, Tests Details)

Specifying the -v | --verbose option when running the test displays additional data about a test. See Appendix C for more details.

For example, test details may include high, nominal and low values.

The following is an example of two passed test cases and one failed test case:

Results

Submitted Test Name          Test Handle           Test Result

adjacency.allDimms           P1                    Passed

dataline.allDimms            P2                    Passed

pattern.allDimms             P3                    Failed

Failure Details: FAILED, addr(0xc0000008) CPU 1 - DIMM 3)

Expected [5a5a5a5a5a5a5a5a] Actual [a5a5a5a5a4a5a5a5] Difference [1000000]

Memory Configuration: Total: 3584Mb

CPU0-2048Mb CPU1-1536Mb

CPU 0: Width[128] Addr 0 - 7fffffff

DIMM 0 512MB Addr 0000000000 - 003fffffff Even Quad Word

DIMM 1 512MB Addr 0000000000 - 003fffffff Odd Quad Word

DIMM 2 512MB Addr 0040000000 - 007fffffff Even Quad Word

DIMM 3 512MB Addr 0040000000 - 007fffffff Odd Quad Word

CPU 1: Width[128] Addr 80000000 - dfffffff

DIMM 0 512MB Addr 0080000000 - 00bfffffff Even Quad Word

DIMM 1 512MB Addr 0080000000 - 00bfffffff Odd Quad Word

DIMM 2 256MB Addr 00c0000000 - 00dfffffff Even Quad Word

*DIMM 3 256MB Addr 00c0000000 - 00dfffffff Odd Quad Word

3.4.7 Stopping Diagnostic Tests

To cancel one or more individual tests, run the following command:

# diags cancel tests { -t | --test} test_handle | {-a|--all}

Where test_handle is a dynamically assigned unique number used by the diagnostics application to identify a running test. The test handle is displayed in the output of a test after it has been run.

To terminate all diagnostics tests and end the diagnostics session, run the following command:

# diags terminate

Refer to Appendix C for more information about these commands.

3.5 Analyzing Events

System events often yield important information about problems or potential problems occurring in the system. Administrators can view detailed information about all the currently active system events and perform various actions related to each event.

You can use the sp get events command to return detailed information about all active SP events. The -d parameter specifies to display the history of either one or all events, thereby allowing you to track problems. By default, event ID, last update, component, severity and a message are displayed.

3.6 System-Fault LED

3.6.1 System-Fault Events

The following events result in the system-fault LED turning on.

Thermal Trip Events: When your CPU experiences a thermal trip, the system-fault LED begins to blink and an event is issued indicating that the platform has been shut down. For example:
CPU 0 has thermally tripped and shut down. Powering off System.

Causes of this condition might be fan failure, an environment that is too hot, the cover was off too long, and so on.

To correct this condition, fix the air flow or cooling problem that caused the thermal trip. After the system has cooled off for a period of time, remove all AC power to the system for 30 seconds and then plug the system back in. You should then be able to boot the system normally.

Caution - To remove AC power from a Sun Fire V20z server, turn off the AC power switch on the back panel. To remove power from a Sun Fire V40z server, remove the power cords from all power supplies.

DIMM Faults: DIMM faults cause the system-fault LED to blink whenever an uncorrectable DIMM fault is detected or when enough correctable faults are detected to exceed the threshold. The system might continue to operate normally, depending on the type of failure, the location of the failure and the robustness of the platform operating system.

VRM Crowbar Assertions: VRM Crowbar assertions occur when either a CPU VRM or a memory VRM detects either a voltage condition that exceeds the threshold or a temperature condition that exceeds the threshold. During a period in which crowbar is asserted, the system-fault LED blinks and the front-panel platform power button, the platform set power and the platform os state commands are disabled.

The system is forcefully shut down by either the Service Processor or the PRS when this occurs (typically by the PRS, because the crowbar signal normally causes the VRM to deassert the power-good signal). When the condition clears, the system is allowed to resume power.

3.6.1.1 Viewing System-Fault Events and Resetting the LED

To view the critical event that caused a system-fault alert, run the following command:

# ssh spipaddress -l spusername sp get events

To reset the system-fault LED, critical events must be deleted from the SP event log or the event log can be cleared entirely.

To clear the entire event log, run the following command:

# ssh spipaddress -l spusername sp delete event -a

To delete selected events from the log, run the following command:

# ssh spipaddress -l spusername sp delete event event-id-number