C H A P T E R 10 |
Isolating Failed Parts |
The most important use of diagnostic tools is to isolate a failed hardware component so that a qualified service technician can quickly remove and replace it. Because servers are complex machines with many failure modes, there is no single diagnostic tool that can isolate all hardware faults under all conditions. However, Sun provides a variety of tools that can help you discern what component needs replacing.
This chapter guides you in choosing the best tools and describes how to use these tools to reveal a failed part in your Sun Fire V490 server. It also explains how to use the Locator LED to isolate a failed system in a large equipment room.
Tasks covered in this chapter include:
Other information in this chapter includes:
If you want background information about the tools, turn to the section:
Note - Many of the procedures in this chapter assume that you are familiar with the OpenBoot firmware and that you know how to enter the OpenBoot environment. For background information, see About the ok Prompt. For instructions, see How to Get to the ok Prompt. |
The Locator LED helps you quickly to find a specific system among dozens of systems in a room. For background information about system LEDs, see LED Status Indicators.
You can turn the Locator LED on and off either from the system console, the system controller (SC) command-line interface (CLI), or by using RSC software's graphical user interface (GUI).
Note - It is also possible to use Sun Management Center software to turn the Locator LED on and off. Consult Sun Management Center documentation for details. |
Either log in as root, or access the RSC software's graphical user interface.
See the illustration under Step 5 in "Invalid Cross-Reference Format". With each click, the LED will change state from off to on, or vice versa.
See the illustration under Step 5 in "Invalid Cross-Reference Format". With each click, the LED will change state from on to off, or vice versa.
In normal mode, firmware-based diagnostic tests can be configured (and even disabled) to expedite the server's startup process. If you have set OpenBoot configuration variables to bypass diagnostic tests, you can always reset those variables to their default values to run tests.
Alternatively, putting the server into service mode according to the following procedure ensures that POST and OpenBoot Diagnostics tests do run during startup.
For a full description of service mode, see:
This document is included on the Sun Fire V490 Documentation CD.
1. Set up a console for viewing diagnostic messages.
Access the system console using an ASCII terminal or tip line. For information on system console options, see About Communicating With the System.
2. Do one of the following, whichever is more convenient:
If either of these switches is set as described, the next reset will cause diagnostic tests to run at Sun-specified coverage, levels, and verbosity.
Should you want to restore the system to normal mode in order to control the depth of diagnostic coverage, the tests run, and the verbosity of the output, see:
If you have set the server to run in service mode, you can follow this procedure to return the system to normal mode. Putting the system in normal mode allows you control over diagnostic testing. For more information, see:
1. Set up a console for viewing diagnostic messages.
Access the system console using an ASCII terminal or tip line. For information on system console options, see About Communicating With the System.
2. Turn the system control switch to the Normal position.
The system will not actually enter normal mode until the next reset.
For detailed descriptions of service and normal modes, see:
This document is included on the Sun Fire V490 Documentation CD.
While not a deep, formal diagnostic tool, LEDs located on the chassis and on selected system components can serve as front-line indicators of a limited set of hardware failures.
You can view LED status by direct inspection of the system's front or back panels.
Note - Most LEDs available on the front panel are also duplicated on the back panel. |
You can also view LED status remotely using RSC and Sun Management Center software, if you set up these tools ahead of time. For details on setting up RSC and Sun Management Center software, see:
There is a group of three LEDs located near the top left corner of the front panel and duplicated on the back panel. Their status can tell you the following.
The Locator and Fault LEDs are powered by the system's 5-volt standby power source and remain lit for any fault condition that results in a system shutdown.
2. Check the power supply LEDs.
Each power supply has a set of four LEDs located on the front panel and duplicated on the back panel. Their status can tell you the following.
There are two LEDs located behind the media door, just under the system control switch. One LED on the left is for Fan Tray 0 (CPU) and one LED on the right is for Fan Tray 1 (PCI). If either is lit, it indicates that the corresponding fan tray needs reseating or replacement.
There are two sets of three LEDs, one for each disk drive. These are located behind the media door, just to the left of each disk drive. Their status can tell you the following.
Perform software commands to take the disk offline. See the Sun Fire V490 Server Parts Installation and Removal Guide. |
||
5. (Optional) Check the Ethernet LEDs.
There are two LEDs for each Ethernet port--they are close to the right side of each Ethernet receptacle on the back panel. If the Sun Fire V490 system is connected to an Ethernet network, the status of the Ethernet LEDs can tell you the following.
If lit or blinking, data is either being transmitted or received. |
None. The condition of these LEDs can help you narrow down the source of a network problem. |
|
If LEDs do not disclose the source of a suspected problem, try running power-on self-tests (POST). See:
This section explains how to run power-on self-test (POST) diagnostics to isolate faults in a Sun Fire V490 server. For background information about POST diagnostics and the boot process, see Chapter 6.
You must ensure that the system is configured to run diagnostic tests. See:
You must additionally decide whether you want to view POST diagnostic output locally, via a terminal or tip connection to the machine's serial port, or remotely after redirecting system console output to the system controller (SC).
Note - A server can have only one system console at a time, so if you redirect output to the system controller, no information appears at the serial port (ttya). |
1. Set up a console for viewing POST messages.
Connect an alphanumeric terminal to the Sun Fire V490 server or establish a tip connection to another Sun system. See:
2. (Optional) Redirect console output to the system controller, if desired.
For instructions, see How to Redirect the System Console to the System Controller.
3. Start POST diagnostics. Type:
The system runs the POST diagnostics and displays status and error messages via either the local serial terminal (ttya) or the redirected (system controller) system console.
Each POST error message includes a "best guess" as to which field-replaceable unit (FRU) was the source of failure. In some cases, there may be more than one possible source, and these are listed in order of decreasing likelihood.
Note - Should the POST output contain code names and acronyms with which you are unfamiliar, see TABLE 6-13 in Reference for Terms in Diagnostic Output. |
Have a qualified service technician replace the FRU or FRUs indicated by POST error messages, if any. For replacement instructions, see:
If the POST diagnostics did not disclose any problems, but your system does not start, try running the interactive OpenBoot Diagnostics tests.
Because OpenBoot Diagnostics tests require access to some of the same hardware resources used by the operating system, they cannot be operated reliably after an operating system halt or Stop-A key sequence. You need to reset the system before running OpenBoot Diagnostics tests, and then reset the system again after testing. Instructions for doing this follow.
This procedure assumes you have established a system console. See:
1. Halt the server to reach the ok prompt.
How you do this depends on the system's condition. If possible, you should warn users and shut down the system gracefully. For information, see About the ok Prompt.
2. Set the auto-boot? diagnostic configuration variable to false. Type:
3. Reset or power cycle the system.
4. Invoke the OpenBoot Diagnostics tests. Type:
The obdiag prompt and test menu appear. The menu is shown in FIGURE 6-4.
5. Type the appropriate command and numbers for the tests you want to run.
For example, to run all available OpenBoot Diagnostics tests, type:
To run a particular test, type:
where # represents the number of the desired test.
For a list of OpenBoot Diagnostics test commands, see Interactive OpenBoot Diagnostics Commands. The numbered menu of tests is shown in FIGURE 6-4.
6. When you are done running OpenBoot Diagnostics tests, exit the test menu. Type:
7. Set the auto-boot? diagnostic configuration variable back to true. Type:
This allows the operating system to resume starting up automatically after future system resets or power cycles.
Have a qualified service technician replace the FRU or FRUs indicated by OpenBoot Diagnostics error messages, if any. For replacement instructions, see:
This document is included on the Sun Fire V490 Documentation CD.
Summaries of the results from the most recent power-on self-test (POST) and OpenBoot Diagnostics tests are saved across power cycles.
You must set up a system console. See:
Then halt the server to reach the ok prompt. See:
To see a summary of the most recent POST results, type:
To see a summary of the most recent OpenBoot Diagnostics test results, type:
You should see a system-dependent list of hardware components, along with an indication of which components passed and which failed POST or OpenBoot Diagnostics tests.
Switches and diagnostic configuration variables stored by the system firmware determine how and when power-on self-test (POST) diagnostics and OpenBoot Diagnostics tests are performed. This section explains how to access and modify OpenBoot configuration variables. For a list of important OpenBoot configuration variables, see TABLE 6-2.
Halt the server to reach the ok prompt. See:
To display the current values of all OpenBoot configuration variables, use the printenv command.
The following example shows a short excerpt of this command's output.
To set or change the value of an OpenBoot configuration variable, use the setenv command:
To set OpenBoot configuration variables that accept multiple keywords, separate keywords with a space:
Note - The test-args variable operates differently from other OpenBoot configuration variables. It requires a single argument consisting of a comma-separated list of keywords. For details, see Controlling OpenBoot Diagnostics Tests. |
Changes to OpenBoot configuration variables usually take effect upon the next reboot.
This section helps you choose the right tool to isolate a failed part in a Sun Fire V490 system. Consider the following questions when selecting a tool.
Certain system components have built-in LEDs that can alert you when that component requires replacement. For detailed instructions, see How to Isolate Faults Using LEDs.
2. Does the system have main power?
If there is no main power to the system, standby power from the SC card may enable you to check the status of some components. See About Monitoring the System.
4. Do you intend to run the tests remotely?
Both Sun Management Center and RSC software enable you to run tests from a remote computer. In addition, RSC software provides a means of redirecting system console output, allowing you remotely to view and run tests--like POST diagnostics--that usually require physical proximity to the serial port on the system's back panel.
5. Will the tool test the suspected source(s) of the problem?
Perhaps you already have some idea of what the problem is. If so, you want to use a diagnostic tool capable of testing the suspected problem sources.
6. Is the problem intermittent or software-related?
If a problem is not caused by a clearly defective hardware component, then you may want to use a system exerciser tool rather than a fault isolation tool. See Chapter 12 for instructions and About Exercising the System for background information.
Copyright © 2004, Sun Microsystems, Inc. All rights reserved.