C H A P T E R  3

Isolating Failed Parts

The most important use of diagnostic tools is to isolate a failed hardware component so that you can quickly remove and replace it. Because servers are complex machines with many failure modes, no single diagnostic tool can isolate all hardware faults under all conditions. However, Sun provides a variety of tools that can help you discern what component needs replacing.

This chapter guides you in choosing the best tools and describes how to use these tools to reveal a failed part in your Netra 440 server. It also explains how to use the Locator LED to isolate a failed system in a large equipment room.

Topics covered in this chapter include:

If you want background information about the tools, turn to the section Isolating Faults in the System.



Note - Many of the procedures in this chapter assume that you are familiar with the OpenBoot firmware and that you know how to access the ok prompt. For background information, refer to the Netra 440 Server System Administration Guide.




Viewing and Setting OpenBoot Configuration Variables

Switches and OpenBoot configuration variables stored in the system configuration card determine how and when power-on self-test (POST) diagnostics and OpenBoot Diagnostics tests are performed. This section explains how to access and modify OpenBoot configuration variables. For a list of important OpenBoot configuration variables, see TABLE 2-1.


procedure icon  To View and Set OpenBoot Configuration Variables

1. Suspend the server's operating system software to reach the ok prompt.

2. Enter the following commands:

The following example shows a short excerpt of this command's output.

ok printenv
Variable Name         Value                          Default Value
 
diag-level            min                            min
diag-switch?          false                          false



Note - The test-args variable operates differently from other OpenBoot configuration variables. It requires a single argument consisting of a comma-separated list of keywords. For details, see Controlling OpenBoot Diagnostics Tests.



Changes to OpenBoot configuration variables usually take effect on the next reboot.


Operating the Locator LED

The Locator LED helps you to quickly find a specific system among numerous systems in a room. For background information about system LEDs, see the Netra 440 Server System Administration Guide (817-3884-xx).

You can turn the Locator LED on and off either from the system console, or by using the Advanced Lights Out Manager (ALOM) command-line interface.


procedure icon  To Operate the Locator LED

1. Access either the system console or the system controller.

For instructions, refer to the Netra 440 Server System Administration Guide.

2. Determine the current state of the Locator LED.

Do one of the following:

3. Turn the Locator LED on.

Do one of the following:

4. Turn the Locator LED off.

Do one of the following:


Putting the System in Diagnostics Mode

Firmware-based diagnostic tests can be bypassed to expedite the server's startup process. The following procedure ensures that POST and OpenBoot Diagnostics tests do run during startup. For background information, see Diagnostics: Reliability versus Availability.


procedure icon  To Put the System In Diagnostics Mode

1. Log in to the system console and access the ok prompt.

2. Do one of the following, whichever is more convenient:

You can do this at the machine's front panel or, if you are running your test session remotely from console display, through the ALOM interface.

3. Set the OpenBoot configuration diag-script variable to normal. Type:

ok setenv diag-script normal

This allows OpenBoot Diagnostics tests to run automatically on all motherboard components.



Note - If you prefer that OpenBoot Diagnostics examine all IEEE 1275-compatible devices (not just those on the motherboard), set the diag-script variable to all.



4. Set OpenBoot configuration variables to trigger diagnostic tests. Type:

ok setenv post-trigger power-on-reset error-reset
ok setenv obdiag-trigger power-on-reset error-reset

5. Set the maximum POST diagnostic test level. Type:

ok setenv diag-level max

This ensures the most thorough power-on self-test possible. The maximum testing level requires considerably longer to complete than the minimum. Depending on system configuration, you may need to wait an additional 10 to 20 minutes for the server to boot.


Bypassing Firmware Diagnostics

POST and OpenBoot Diagnostics tests can be bypassed to expedite the server's startup process. For background information, see Diagnostics: Reliability versus Availability.



caution icon

Caution - Bypassing diagnostic tests sacrifices system reliability by allowing a system to attempt to boot when it may have a serious hardware problem.




procedure icon  To Bypass Firmware Diagnostics

1. Log in to the system console and access the ok prompt.

2. Ensure that the server's system control rotary switch is set to the Normal position.

Setting the rotary switch to the Diagnostics position overrides the OpenBoot configuration variable settings and causes diagnostic tests to run.

3. Turn off the diag-switch? and diag-script variables. Type:

ok setenv diag-switch? false
ok setenv diag-script none

4. Set OpenBoot configuration trigger variables to bypass diagnostics. Type:

ok setenv post-trigger none
ok setenv obdiag-trigger none

The Netra 440 server is now configured to minimize the time it takes to reboot. If you change your mind and want to force diagnostic tests to run, see Putting the System in Diagnostics Mode.


Bypassing Diagnostics Temporarily

The ALOM system controller provides a "back-door" method of skipping diagnostic tests and booting the system. This procedure is only of assistance in those unusual circumstances where:


procedure icon  To Bypass Diagnostics Temporarily

1. Log in to the ALOM system controller and access the sc> prompt.

2. Type the following command:

sc> bootmode skip_diag

This command temporarily configures the system to skip its firmware-based diagnostic tests, regardless of how the OpenBoot configuration variables are set.

3. Within 10 minutes, power cycle the system. Type:

sc> poweroff
Are you sure you want to power off the system [y/n]? y
sc> poweron

You must execute the above commands within 10 minutes of using ALOM to change the boot mode. Ten minutes after you issue the ALOM bootmode command, the system reverts back to its default boot mode as governed by the current settings of OpenBoot configuration variables, including diag-switch, post-trigger, and obdiag-trigger.

For more information about OpenBoot configuration variables and how they affect diagnostics, see Controlling POST Diagnostics.

If you suspect an incompatible or corrupted firmware image caused the problems you observed with firmware diagnostics, you should now restore the system firmware to a reliable state.

For more information about restoring the system firmware, contact your authorized service provider.


Maximizing Diagnostic Testing

To maximize system reliability, it is useful to have POST and OpenBoot Diagnostics tests trigger in the event of an operating system panic or any reset, and to run automatically the most comprehensive tests possible. For background information, see Diagnostics: Reliability versus Availability.


procedure icon  To Maximize Diagnostic Testing

1. Log in to the system console and access the ok prompt.

2. Do one of the following, whichever is more convenient:

You can do this at the server's front panel or, if you are running your test session remotely from console display, through the ALOM interface.

3. Set the OpenBoot configuration diag-script variable to all. Type:

ok setenv diag-script all

This allows OpenBoot Diagnostics tests to run automatically on all motherboard components and IEEE 1275-compatible devices.



Note - If you prefer that OpenBoot Diagnostics examine only motherboard-based devices, set the diag-script variable to normal.



4. Set OpenBoot configuration variables to trigger diagnostic tests. Type:

ok setenv post-trigger all-resets
ok setenv obdiag-trigger all-resets

5. Set the maximum POST diagnostic test level. Type:

ok setenv diag-level max

This ensures the most thorough testing possible. The maximum testing level requires considerably longer to complete than the minimum. Depending on system configuration, you may need to wait an additional 10 to 20 minutes for the server to boot.


Isolating Faults Using LEDs

While not a comprehensive diagnostic tool, LEDs located on the chassis and on selected system components can serve as front-line indicators of a limited set of hardware failures.

You can view LED status by direct inspection of the system's front and back panels. You can also view the status of certain LEDs from the ALOM system controller command-line interface.



Note - Most LEDs available on the front panel are also duplicated on the back panel.




procedure icon  To Isolate Faults Using LEDs

1. Check the system LEDs.

There is a group of three LEDs located near the top left corner of the front panel and duplicated on the back panel. Their status can tell you the following:

LED Name
(location; color)

Indicates

Action

Locator
(left; white)

A system administrator can turn this on to flag a system that needs attention.

Identify a particular system among many.

Service Required (middle; amber)

If lit, hardware or software has detected a problem with the system.

Check other LEDs or run diagnostics to determine the problem source.

System Activity
(right; green)

If blinking, operating system is in the process of booting.

If off, operating system has stopped.

Not applicable.


The Locator and Service Required LEDs are powered by the system's 5-volt standby power source and remain lit for any fault condition that results in a system shutdown.



Note - To view the status of system LEDs from ALOM, type showenvironment from the sc> prompt.



2. Check the power supply LEDs.

Each power supply has a set of four LEDs located on the front panel and duplicated on the back panel. Their status can tell you the following:

LED Name
(location; color)

Indicates

Action

OK-to-Remove
(top; blue)

If lit, power supply can safely be removed.

Remove power supply as needed.

 

Note: Remove a failed power supply only when you are ready to install its replacement. Both power supplies must remain in place to ensure proper air circulation and chassis cooling.

Service Required
(2nd from top; amber)

If lit, there is a problem with the power supply or its internal fan.

Replace the power supply.

Power OK
(3rd from top; green)

If off, inadequate DC power is being produced by the supply.

Remove and reseat the power supply. If this does not help, replace the supply.

Standby Available (bottom; green)

If off, either AC power is not reaching the supply, or the supply is not producing adequate 5V standby power.

Check the power cord and the outlet to which it connects. If necessary, replace the supply.


3. Check the hard drive LEDs.

Hard drive LEDs are located behind the left system door. Just to the right of each hard drive is a set of three LEDs. Their status can tell you the following:

LED Name
(location; color)

Indicates

Action

OK-to-Remove
(top; blue)

If lit, disk can safely be removed.

Remove disk as needed.

Service Required (middle; amber)

This LED is reserved for future use.

Not applicable.

Activity
(bottom; green)

If lit or blinking, disk is operating normally.

Not applicable.


4. Check the DVD-ROM LED.

The DVD-ROM drive features a Power/Activity LED that tells you the following:

LED Name
(color)

Indicates

Action

Power/Activity
(green)

If lit or blinking, drive is operating normally.

If this LED is off, and you know the system is receiving power, check the DVD-ROM drive and its cables.


5. Check the Ethernet port LEDs.

Two Ethernet port LEDs are located on the system back panel.

LED Name
(color)

Indicates

Action

Link/Activity
(green)

If lit, a link is established. If blinking, there is activity. Both states indicate normal operation.

If this LED is off and you know a link is being attempted, check the Ethernet cables.

Speed
(amber)

If lit, a Gigabit Ethernet connection is established. If off, a 10/100-Mbps Ethernet connection is established.

 


6. If LEDs do not disclose the source of a suspected problem, try putting the affected server in Diagnostics mode.

See Putting the System in Diagnostics Mode.

You can also run power-on self-test (POST) diagnostics. See Isolating Faults Using POST Diagnostics.


Isolating Faults Using POST Diagnostics

This section explains how to run power-on self-test (POST) diagnostics to isolate faults in a Netra 440 server. For background information about POST diagnostics and the boot process, see Chapter 2.


procedure icon  To Isolate Faults Using POST Diagnostics

1. Log in to the system console and access the ok prompt.

This procedure assumes that the system is in diagnostics mode. See:

The procedure also assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Refer to the Netra 440 Server System Administration Guide.

2. (Optional) Set the OpenBoot configuration variable diag-level to max. Type:

ok setenv diag-level max
diag-level =     max

This provides the most extensive diagnostic testing.

3. Power on the server.

Do one of the following:

Then, from the sc> prompt, type:

sc> poweron
sc> console
ok

The system runs the POST diagnostics and displays status and error messages through the local serial terminal.



Note - You will not see any POST output if you remain at the sc> prompt. You must return to the ok prompt by typing the console command as shown above.



4. Examine the POST output.

Each POST error message includes a "best guess" as to which field-replaceable unit (FRU) was the source of failure. In some cases, there may be more than one possible source, and these are listed in order of decreasing likelihood.



Note - Should the POST output contain code names and acronyms with which you are unfamiliar, see TABLE 2-13 in Terms in Diagnostic Output Terms.



5. Try replacing the FRU or FRUs indicated by POST error messages, if any.

For replacement instructions, refer to the Netra 440 Server Service Manual.

6. If the POST diagnostics did not turn up any problems, but your system does not start up, try running the interactive OpenBoot Diagnostics tests.

For more information, see:


Isolating Faults Using Interactive OpenBoot Diagnostics Tests

Because OpenBoot Diagnostics tests require access to some of the same hardware resources used by the operating system, the tests cannot be run reliably after an operating system halt or L1-A (Stop-A) key sequence. You need to reset the system before running OpenBoot Diagnostics tests, and then reset the system again after testing. Instructions for doing this follow.


procedure icon  To Isolate Faults Using Interactive OpenBoot Diagnostics Tests

1. Log in to the system console and access the ok prompt.

2. Set the auto-boot? OpenBoot configuration variable to false. Type:

ok setenv auto-boot? false

3. Reset or power cycle the system.

4. Invoke the OpenBoot Diagnostics tests. Type:

ok obdiag

The obdiag> prompt and test menu appear. The menu is shown in FIGURE 2-3.

5. (Optional) Set the desired test level.

You may want to perform the most extensive testing possible by setting the diag-level OpenBoot configuration variable to max:

obdiag> setenv diag-level max



Note - If diag-level is set to off, OpenBoot firmware returns a passed status for all core tests, but performs no testing.



You can set any OpenBoot configuration variable (see TABLE 2-1) from the obdiag> prompt in the same way.

6. Type the appropriate command and numbers for the tests you want to run.

For example, to run all available OpenBoot Diagnostics tests, type:

obdiag> test-all

To run a particular test, type:

obdiag> test  #

Where # represents the number of the desired test.

For a list of OpenBoot Diagnostics test commands, see Interactive OpenBoot Diagnostics Commands. The menu of numbered tests is shown in FIGURE 2-3.

7. When you are done running OpenBoot Diagnostics tests, exit the test menu. Type:

obdiag> exit

The ok prompt reappears.

8. Set the auto-boot? OpenBoot configuration variable back to true. Type:

ok setenv auto-boot? true

This allows the operating system to resume starting up automatically after future system resets or power cycles.

9. To reboot the system, type:

ok reset-all

The system stores the OpenBoot configuration variable settings and boots automatically when the auto-boot? variable is set to true.

Try replacing the FRU or FRUs indicated by OpenBoot Diagnostics error messages, if any. For FRU replacement instructions, refer to the Netra 440 Server Service Manual.


Viewing Diagnostic Test Results After the Fact

Summaries of the results from the most recent power-on self-test (POST) and OpenBoot Diagnostics tests are saved across power cycles.


procedure icon  To View Diagnostic Test Results

1. Log in to the system console and access the ok prompt.

2. View the desired category of test results.

You should see a system-dependent list of hardware components, along with an indication of which components passed and which failed POST or OpenBoot Diagnostics tests.


Choosing a Fault Isolation Tool

This section helps you choose the right tool to isolate a failed part in a Netra 440 server. Consider the following questions when selecting a tool.

1. Have you checked the LEDs?

Certain system components have built-in LEDs that can alert you when that component requires replacement. For detailed instructions, see Isolating Faults Using LEDs.

2. Does the system boot?

3. Do you intend to run the tests remotely?

The ALOM system controller software enable you to run tests from a remote server. In addition, ALOM provides a means of redirecting system console output, allowing you to remotely view and run tests--like POST diagnostics--that usually require physical proximity to the serial port on the server's back panel.

SunVTS software, a system exercising tool, also enables you to run tests remotely using either the product's graphical interface, or tty-mode through remote login or Telnet session.

4. Will the tool test the suspected sources of the problem?

Perhaps you already have some idea of what the problem is. If so, you want to use a diagnostic tool capable of testing the suspected problem sources.

5. Is the problem intermittent or software related?

If a problem is not caused by a clearly defective hardware component, then you may want to use a system-exerciser tool rather than a fault-isolation tool. See Chapter 2 for instructions and Exercising the System for background information.

  FIGURE 3-1 Choosing a Tool to Isolate Hardware Faults

This figure is a flowchart showing in what order and under what conditions one might choose to run various diagnostic tools to isolate a hardware failure.