C H A P T E R 3 |
Isolating Failed Parts |
The most important use of diagnostic tools is to isolate a failed hardware component so that you can quickly remove and replace it. Because servers are complex machines with many failure modes, no single diagnostic tool can isolate all hardware faults under all conditions. However, Sun provides a variety of tools that can help you discern what component needs replacing.
This chapter guides you in choosing the best tools and describes how to use these tools to reveal a failed part in your Netra 440 server. It also explains how to use the Locator LED to isolate a failed system in a large equipment room.
Topics covered in this chapter include:
If you want background information about the tools, turn to the section Isolating Faults in the System.
Switches and OpenBoot configuration variables stored in the system configuration card determine how and when power-on self-test (POST) diagnostics and OpenBoot Diagnostics tests are performed. This section explains how to access and modify OpenBoot configuration variables. For a list of important OpenBoot configuration variables, see TABLE 2-1.
To View and Set OpenBoot Configuration Variables |
1. Suspend the server's operating system software to reach the ok prompt.
2. Enter the following commands:
The following example shows a short excerpt of this command's output.
Note - The test-args variable operates differently from other OpenBoot configuration variables. It requires a single argument consisting of a comma-separated list of keywords. For details, see Controlling OpenBoot Diagnostics Tests. |
Changes to OpenBoot configuration variables usually take effect on the next reboot.
The Locator LED helps you to quickly find a specific system among numerous systems in a room. For background information about system LEDs, see the Netra 440 Server System Administration Guide (817-3884-xx).
You can turn the Locator LED on and off either from the system console, or by using the Advanced Lights Out Manager (ALOM) command-line interface.
To Operate the Locator LED |
1. Access either the system console or the system controller.
For instructions, refer to the Netra 440 Server System Administration Guide.
2. Determine the current state of the Locator LED.
Firmware-based diagnostic tests can be bypassed to expedite the server's startup process. The following procedure ensures that POST and OpenBoot Diagnostics tests do run during startup. For background information, see Diagnostics: Reliability versus Availability.
To Put the System In Diagnostics Mode |
1. Log in to the system console and access the ok prompt.
2. Do one of the following, whichever is more convenient:
You can do this at the machine's front panel or, if you are running your test session remotely from console display, through the ALOM interface.
3. Set the OpenBoot configuration diag-script variable to normal. Type:
This allows OpenBoot Diagnostics tests to run automatically on all motherboard components.
Note - If you prefer that OpenBoot Diagnostics examine all IEEE 1275-compatible devices (not just those on the motherboard), set the diag-script variable to all. |
4. Set OpenBoot configuration variables to trigger diagnostic tests. Type:
ok setenv post-trigger power-on-reset error-reset ok setenv obdiag-trigger power-on-reset error-reset |
5. Set the maximum POST diagnostic test level. Type:
This ensures the most thorough power-on self-test possible. The maximum testing level requires considerably longer to complete than the minimum. Depending on system configuration, you may need to wait an additional 10 to 20 minutes for the server to boot.
POST and OpenBoot Diagnostics tests can be bypassed to expedite the server's startup process. For background information, see Diagnostics: Reliability versus Availability.
Caution - Bypassing diagnostic tests sacrifices system reliability by allowing a system to attempt to boot when it may have a serious hardware problem. |
To Bypass Firmware Diagnostics |
1. Log in to the system console and access the ok prompt.
2. Ensure that the server's system control rotary switch is set to the Normal position.
Setting the rotary switch to the Diagnostics position overrides the OpenBoot configuration variable settings and causes diagnostic tests to run.
3. Turn off the diag-switch? and diag-script variables. Type:
4. Set OpenBoot configuration trigger variables to bypass diagnostics. Type:
The Netra 440 server is now configured to minimize the time it takes to reboot. If you change your mind and want to force diagnostic tests to run, see Putting the System in Diagnostics Mode.
The ALOM system controller provides a "back-door" method of skipping diagnostic tests and booting the system. This procedure is only of assistance in those unusual circumstances where:
To Bypass Diagnostics Temporarily |
1. Log in to the ALOM system controller and access the sc> prompt.
2. Type the following command:
This command temporarily configures the system to skip its firmware-based diagnostic tests, regardless of how the OpenBoot configuration variables are set.
3. Within 10 minutes, power cycle the system. Type:
You must execute the above commands within 10 minutes of using ALOM to change the boot mode. Ten minutes after you issue the ALOM bootmode command, the system reverts back to its default boot mode as governed by the current settings of OpenBoot configuration variables, including diag-switch, post-trigger, and obdiag-trigger.
For more information about OpenBoot configuration variables and how they affect diagnostics, see Controlling POST Diagnostics.
If you suspect an incompatible or corrupted firmware image caused the problems you observed with firmware diagnostics, you should now restore the system firmware to a reliable state.
For more information about restoring the system firmware, contact your authorized service provider.
To maximize system reliability, it is useful to have POST and OpenBoot Diagnostics tests trigger in the event of an operating system panic or any reset, and to run automatically the most comprehensive tests possible. For background information, see Diagnostics: Reliability versus Availability.
To Maximize Diagnostic Testing |
1. Log in to the system console and access the ok prompt.
2. Do one of the following, whichever is more convenient:
You can do this at the server's front panel or, if you are running your test session remotely from console display, through the ALOM interface.
3. Set the OpenBoot configuration diag-script variable to all. Type:
This allows OpenBoot Diagnostics tests to run automatically on all motherboard components and IEEE 1275-compatible devices.
Note - If you prefer that OpenBoot Diagnostics examine only motherboard-based devices, set the diag-script variable to normal. |
4. Set OpenBoot configuration variables to trigger diagnostic tests. Type:
5. Set the maximum POST diagnostic test level. Type:
This ensures the most thorough testing possible. The maximum testing level requires considerably longer to complete than the minimum. Depending on system configuration, you may need to wait an additional 10 to 20 minutes for the server to boot.
While not a comprehensive diagnostic tool, LEDs located on the chassis and on selected system components can serve as front-line indicators of a limited set of hardware failures.
You can view LED status by direct inspection of the system's front and back panels. You can also view the status of certain LEDs from the ALOM system controller command-line interface.
Note - Most LEDs available on the front panel are also duplicated on the back panel. |
To Isolate Faults Using LEDs |
There is a group of three LEDs located near the top left corner of the front panel and duplicated on the back panel. Their status can tell you the following:
The Locator and Service Required LEDs are powered by the system's 5-volt standby power source and remain lit for any fault condition that results in a system shutdown.
Note - To view the status of system LEDs from ALOM, type showenvironment from the sc> prompt. |
2. Check the power supply LEDs.
Each power supply has a set of four LEDs located on the front panel and duplicated on the back panel. Their status can tell you the following:
Hard drive LEDs are located behind the left system door. Just to the right of each hard drive is a set of three LEDs. Their status can tell you the following:
The DVD-ROM drive features a Power/Activity LED that tells you the following:
If this LED is off, and you know the system is receiving power, check the DVD-ROM drive and its cables. |
5. Check the Ethernet port LEDs.
Two Ethernet port LEDs are located on the system back panel.
6. If LEDs do not disclose the source of a suspected problem, try putting the affected server in Diagnostics mode.
See Putting the System in Diagnostics Mode.
You can also run power-on self-test (POST) diagnostics. See Isolating Faults Using POST Diagnostics.
This section explains how to run power-on self-test (POST) diagnostics to isolate faults in a Netra 440 server. For background information about POST diagnostics and the boot process, see Chapter 2.
To Isolate Faults Using POST Diagnostics |
1. Log in to the system console and access the ok prompt.
This procedure assumes that the system is in diagnostics mode. See:
The procedure also assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Refer to the Netra 440 Server System Administration Guide.
2. (Optional) Set the OpenBoot configuration variable diag-level to max. Type:
This provides the most extensive diagnostic testing.
Then, from the sc> prompt, type:
The system runs the POST diagnostics and displays status and error messages through the local serial terminal.
Note - You will not see any POST output if you remain at the sc> prompt. You must return to the ok prompt by typing the console command as shown above. |
Each POST error message includes a "best guess" as to which field-replaceable unit (FRU) was the source of failure. In some cases, there may be more than one possible source, and these are listed in order of decreasing likelihood.
Note - Should the POST output contain code names and acronyms with which you are unfamiliar, see TABLE 2-13 in Terms in Diagnostic Output Terms. |
5. Try replacing the FRU or FRUs indicated by POST error messages, if any.
For replacement instructions, refer to the Netra 440 Server Service Manual.
6. If the POST diagnostics did not turn up any problems, but your system does not start up, try running the interactive OpenBoot Diagnostics tests.
Because OpenBoot Diagnostics tests require access to some of the same hardware resources used by the operating system, the tests cannot be run reliably after an operating system halt or L1-A (Stop-A) key sequence. You need to reset the system before running OpenBoot Diagnostics tests, and then reset the system again after testing. Instructions for doing this follow.
To Isolate Faults Using Interactive OpenBoot Diagnostics Tests |
1. Log in to the system console and access the ok prompt.
2. Set the auto-boot? OpenBoot configuration variable to false. Type:
3. Reset or power cycle the system.
4. Invoke the OpenBoot Diagnostics tests. Type:
The obdiag> prompt and test menu appear. The menu is shown in FIGURE 2-3.
5. (Optional) Set the desired test level.
You may want to perform the most extensive testing possible by setting the diag-level OpenBoot configuration variable to max:
Note - If diag-level is set to off, OpenBoot firmware returns a passed status for all core tests, but performs no testing. |
You can set any OpenBoot configuration variable (see TABLE 2-1) from the obdiag> prompt in the same way.
6. Type the appropriate command and numbers for the tests you want to run.
For example, to run all available OpenBoot Diagnostics tests, type:
To run a particular test, type:
Where # represents the number of the desired test.
For a list of OpenBoot Diagnostics test commands, see Interactive OpenBoot Diagnostics Commands. The menu of numbered tests is shown in FIGURE 2-3.
7. When you are done running OpenBoot Diagnostics tests, exit the test menu. Type:
8. Set the auto-boot? OpenBoot configuration variable back to true. Type:
This allows the operating system to resume starting up automatically after future system resets or power cycles.
9. To reboot the system, type:
The system stores the OpenBoot configuration variable settings and boots automatically when the auto-boot? variable is set to true.
Try replacing the FRU or FRUs indicated by OpenBoot Diagnostics error messages, if any. For FRU replacement instructions, refer to the Netra 440 Server Service Manual.
Summaries of the results from the most recent power-on self-test (POST) and OpenBoot Diagnostics tests are saved across power cycles.
To View Diagnostic Test Results |
1. Log in to the system console and access the ok prompt.
2. View the desired category of test results.
You should see a system-dependent list of hardware components, along with an indication of which components passed and which failed POST or OpenBoot Diagnostics tests.
This section helps you choose the right tool to isolate a failed part in a Netra 440 server. Consider the following questions when selecting a tool.
Certain system components have built-in LEDs that can alert you when that component requires replacement. For detailed instructions, see Isolating Faults Using LEDs.
3. Do you intend to run the tests remotely?
The ALOM system controller software enable you to run tests from a remote server. In addition, ALOM provides a means of redirecting system console output, allowing you to remotely view and run tests--like POST diagnostics--that usually require physical proximity to the serial port on the server's back panel.
SunVTS software, a system exercising tool, also enables you to run tests remotely using either the product's graphical interface, or tty-mode through remote login or Telnet session.
4. Will the tool test the suspected sources of the problem?
Perhaps you already have some idea of what the problem is. If so, you want to use a diagnostic tool capable of testing the suspected problem sources.
5. Is the problem intermittent or software related?
If a problem is not caused by a clearly defective hardware component, then you may want to use a system-exerciser tool rather than a fault-isolation tool. See Chapter 2 for instructions and Exercising the System for background information.
Copyright © 2004, Sun Microsystems, Inc. All rights reserved.