C H A P T E R 2 |
Diagnostics and the Boot Process |
This chapter introduces the tools that let you accomplish the goals of isolating faults and monitoring and exercising systems. It also helps you to understand how the various tools fit together.
Topics in this chapter include:
If you only want instructions for using diagnostic tools, skip this chapter and turn to:
You may also find it helpful to turn to Netra 440 Server System Administration Guide for information about the system console.
You have probably had the experience of powering on a Sun system and watching as it goes through its boot process. Perhaps you have watched as your console displays messages that look like the following.
It turns out these messages are not quite so inscrutable as they first appear once you understand the boot process. These kinds of messages are discussed later.
It is possible to bypass firmware-based diagnostic tests in order to minimize how long it takes a server to reboot. However, in the following discussion, assume that the system is attempting to boot in diagnostics mode, during which the firmware-based tests run. See Putting the System in Diagnostics Mode for instructions.
The boot process requires several stages, detailed in these sections:
As soon as you connect the Netra 440 server to an electrical outlet, and before you turn on power to the server, the system controller inside the server begins its self-diagnostic and boot cycle. The system controller is incorporated into the Sun Remote System Control (ALOM) card installed in the Netra 440 server chassis. Running off standby power, the card begins functioning before the server itself comes up.
The system controller provides access to a number of control and monitoring functions through the ALOM command-line interface. For more information about ALOM, see Monitoring the System Using Advanced Lights Out Manager.
Every Netra 440 server includes a chip holding about 2 Mbyte of firmware-based code. This chip is called the boot PROM. After you turn on system power, the first thing the system does is execute code that resides in the boot PROM.
This code, which is referred to as the OpenBoot firmware, is a small-scale operating system unto itself. However, unlike a traditional operating system that can run multiple applications for multiple simultaneous users, OpenBoot firmware runs in single-user mode and is designed solely to configure and boot the system. OpenBoot firmware also initiates firmware-based diagnostics that test the system, thereby ensuring that the hardware is sufficiently "healthy" to run its normal operating environment.
When system power is turned on, the OpenBoot firmware begins running directly out of the boot PROM, since at this stage system memory has not been verified to work properly.
Soon after power is turned on, the system hardware determines that at least one CPU is powered on, and is submitting a bus access request, which indicates that the CPU in question is at least partly functional. This becomes the master CPU, and is responsible for executing OpenBoot firmware instructions.
The OpenBoot firmware's first actions are to check whether to run the power-on self-test (POST) diagnostics and other tests. The POST diagnostics constitute a separate chunk of code stored in a different area of the boot PROM (see FIGURE 2-1).
The extent of these power-on self-tests, and whether they are performed at all, is controlled by configuration variables stored in the removable system configuration card (SCC). These OpenBoot configuration variables are discussed in Controlling POST Diagnostics.
As soon as POST diagnostics can verify that some subset of system memory is functional, tests are loaded into system memory.
The POST diagnostics verify the core functionality of the system. A successful execution of the POST diagnostics does not ensure that there is nothing wrong with the server, but it does ensure that the server can proceed to the next stage of the boot process.
For a Netra 440 server, this means:
It is possible for a system to pass all POST diagnostics and still be unable to boot the operating system. However, you can run POST diagnostics even when a system fails to boot, and these tests are likely to disclose the source of most hardware problems.
POST generally reports errors that are persistent in nature. To catch intermittent problems, consider running a system exercising tool. See Exercising the System.
Each POST diagnostic is a low-level test designed to pinpoint faults in a specific hardware component. For example, individual memory tests called address bitwalk and data bitwalk ensure that binary 0s and 1s can be written on each address and data line. During such a test, the POST may display output similar to this example.
In this example, CPU 1 is the master CPU, as indicated by the prompt 1>, and it is about to test the memory associated with CPU 3, as indicated by the message Slave 3.
The failure of such a test reveals precise information about particular integrated circuits, the memory registers inside them, or the data paths connecting them.
In this case, the DIMM labeled J0602, associated with CPU 3, was found to be faulty. For information about the several ways firmware messages identify memory, see Identifying Memory Modules.
When a specific power-on self-test discloses an error, it reports the following kinds of information about the error:
Here is an excerpt of POST output showing another error message.
An important feature of POST error messages is the H/W under test line (the second line in CODE EXAMPLE 2-1) indicates which FRU or FRUs may be responsible for the error. Note that in CODE EXAMPLE 2-1, two different FRUs are indicated. Using TABLE 2-13 to decode some of the terms, you can see that this POST error was most likely caused by bad integrated circuits (IO-Bridge) or electrical pathways on the motherboard. However, the error message also indicates that the master CPU, in this case CPU 1, may be at fault. For information on how Netra 440 CPUs are numbered, see Identifying CPU/Memory Modules.
Though beyond the scope of this manual, it is worth noting that POST error messages provide fault isolation capability beyond the FRU level. In the current example, the MSG line located immediately below the H/W under test line specifies the particular integrated circuit (DEVICE NAME: SCSI) most likely at fault. This level of isolation is most useful at the repair depot.
Because each test operates at such a low level, the POST diagnostics are often more definite in reporting the minute details of the error, like the numerical values of expected and observed results, than they are about reporting which FRU is responsible. If this seems counterintuitive, consider the block diagram of one data path within a Netra 440 server, shown in FIGURE 2-2.
The dashed line in FIGURE 2-2 represents a boundary between FRUs. Suppose a POST diagnostic is running in the CPU in the left part of the diagram. This diagnostic attempts to access registers in a PCI device located in the right side of the diagram.
If this access fails, there could be a fault in the PCI device, or, less likely, in one of the data paths or components leading to that PCI device. The POST diagnostic can tell you only that the test failed, but not why. So, though the POST diagnostic may present very precise data about the nature of the test failure, potentially several different FRUs could be implicated.
You control POST diagnostics (and other aspects of the boot process) by setting OpenBoot configuration variables in the system configuration card. Changes to OpenBoot configuration variables generally take effect only after the server is reset.
TABLE 2-1 lists the most important and useful of these variables, which are more fully documented in the OpenBoot Command Reference Manual. You can find instructions for changing OpenBoot configuration variables in Viewing and Setting OpenBoot Configuration Variables.
Determines whether the operating system automatically starts up. Default is true. |
|
Determines the level or type of diagnostics executed. Default is . |
|
Determines which devices are tested by OpenBoot Diagnostics. Default is none. |
|
false-- , even if post-trigger and obdiag-trigger conditions are satisfied. Causes system to boot using boot-device and boot-file parameters.NOTE: You can put the system in diagnostics mode either by setting this variable to true or by setting the system control rotary switch to the Diagnostics position. For details, see Putting the System in Diagnostics Mode. |
|
Specifies the class of reset event that causes POST diagnostics or OpenBoot Diagnostics tests to run. These variables can accept single keywords as well as combinations of the first three keywords separated by spaces. For details, see Viewing and Setting OpenBoot Configuration Variables.
|
|
Selects where system console input is taken from. Default is ttya. |
|
Selects where diagnostic and other system console output is displayed. Default is ttya. |
Note - These variables affect OpenBoot Diagnostics tests as well as POST diagnostics. |
The OpenBoot configuration variables described in TABLE 2-1 let you control not only how diagnostic tests proceed, but also what triggers them.
Bypassing diagnostic tests can create a situation where a server with faulty hardware gets locked into a cycle of repeated booting and crashing. Depending on the type of problem, the cycle may repeat intermittently. Because diagnostic tests are never invoked, the crashes may occur without leaving behind any log entries or meaningful console messages.
The section Putting the System in Diagnostics Mode provides instructions for ensuring that your server runs diagnostics when starting up. The section Bypassing Firmware Diagnostics explains how to disable firmware diagnostics.
Even if you set up the server to run diagnostic tests automatically on reboot, it is still possible to bypass diagnostic tests for a single boot cycle. This can be useful in cases where you are reconfiguring the server, or on those rare occasions when POST or OpenBoot Diagnostics tests themselves stall or "hang," leaving the server unable to boot and in an unusable state. These "hangs" most commonly result from firmware corruption of some sort, especially of having flashed an incompatible firmware image into the server's PROMs.
If you do find yourself needing to skip diagnostic tests for a single boot cycle, the ALOM system controller provides a convenient way to do this. See Bypassing Diagnostics Temporarily for instructions.
By default, diagnostics do not run following a user- or operating system-initiated reset. This means the system does not run diagnostics in the event of an operating system panic. To ensure the maximum reliability, especially for automatic system recovery (ASR), you can configure the system to run its firmware-based diagnostic tests following all resets. For instructions, see Maximizing Diagnostic Testing.
Once POST diagnostics have finished running, POST marks the status of any faulty device as "FAILED," and returns control to OpenBoot firmware.
OpenBoot firmware compiles a hierarchical "census" of all devices in the system. This census is called a device tree. Though different for every system configuration, the device tree generally includes both built-in system components and optional PCI bus devices. The device tree does not include any components marked as "FAILED" by POST diagnostics.
Following the successful execution of POST diagnostics, the OpenBoot firmware proceeds to run OpenBoot Diagnostics tests. Like the POST diagnostics, OpenBoot Diagnostics code is firmware-based and resides in the boot PROM.
OpenBoot Diagnostics tests focus on system I/O and peripheral devices. Any device in the device tree, regardless of manufacturer, that includes an IEEE 1275-compatible self-test is included in the suite of OpenBoot Diagnostics tests. On a Netra 440 server, OpenBoot Diagnostics examine the following system components:
The OpenBoot Diagnostics tests run automatically through a script when you start up the system in diagnostics mode. However, you can also run OpenBoot Diagnostics tests manually, as explained in the next section.
Like POST diagnostics, OpenBoot Diagnostics tests catch persistent errors. To disclose intermittent problems, consider running a system exercising tool. See Exercising the System.
When you restart the system, you can run OpenBoot Diagnostics tests either interactively from a test menu, or by entering commands directly from the ok prompt.
Most of the same OpenBoot configuration variables you use to control POST (see TABLE 2-1) also affect OpenBoot Diagnostics tests. Notably, you can determine OpenBoot Diagnostics testing level--or suppress testing entirely--by appropriately setting the diag-level variable.
In addition, the OpenBoot Diagnostics tests use a special variable called test-args that enables you to customize how the tests operate. By default, test-args is set to contain an empty string. However, you can set test-args to one or more of the reserved keywords, each of which has a different effect on OpenBoot Diagnostics tests. TABLE 2-2 lists the available keywords.
If you want to make multiple customizations to the OpenBoot Diagnostics testing, you can set test-args to a comma-separated list of keywords, as in this example:
It is easiest to run OpenBoot Diagnostics tests interactively from a menu. You access the menu by typing obdiag at the ok prompt. See Isolating Faults Using Interactive OpenBoot Diagnostics Tests for full instructions.
The obdiag> prompt and the OpenBoot Diagnostics interactive menu (FIGURE 2-3) appear. Only the devices detected by OpenBoot firmware appear in this menu. For a brief explanation of each OpenBoot Diagnostics test, see TABLE 2-10 in OpenBoot Diagnostics Test Descriptions.
You run individual OpenBoot Diagnostics tests from the obdiag> prompt by typing:
where n represents the number associated with a particular menu item.
There are several other commands available to you from the obdiag> prompt. For descriptions of these commands, see TABLE 2-11 in OpenBoot Diagnostics Test Descriptions.
You can obtain a summary of this same information by typing help at the obdiag> prompt.
You can also run OpenBoot Diagnostics tests directly from the ok prompt. To do this, type the test command, followed by the full hardware path of the device (or set of devices) to be tested. For example:
Note - Knowing how to construct an appropriate hardware device path requires precise knowledge of the hardware architecture of the Netra 440 server. If you lack this knowledge, it may help to use the OpenBoot show-devs command (see show-devs Command), which displays a list of all configured devices. |
To customize an individual test, you can use test-args as follows:
This affects only the current test without changing the value of the test-args OpenBoot configuration variable.
You can test all the devices in the device tree with the test-all command:
If you specify a path argument to test-all, then only the specified device and its children are tested. The following example shows the command to test the USB bus and all devices with self-tests that are connected to the USB bus:
OpenBoot Diagnostics error messages are reported in a tabular format that contains a short summary of the problem, the hardware device affected, the subtest that failed, and other diagnostic information. CODE EXAMPLE 2-2 displays a sample OpenBoot Diagnostics error message, one that suggests a failure of the IDE controller.
The i2c@0,320 OpenBoot Diagnostics test examines and reports on environmental monitoring and control devices connected to the Netra 440 server's Inter-Integrated Circuit (I2C) bus.
Error and status messages from the i2c@0,320 OpenBoot Diagnostics test include the hardware addresses of I2C bus devices.
The I2C device address is given at the very end of the hardware path. In this example, the address is 0,b6, which indicates a device located at hexadecimal address b6 on segment 0 of the I2C bus.
To decode this device address, see Decoding I2C Diagnostic Test Messages. Using TABLE 2-12, you can see that dimm-spd@0,b6 corresponds to DIMM 0 on CPU/memory module 0. If the i2c@0,320 test were to report an error against dimm-spd@0,b6, you would need to replace this DIMM.
Beyond the formal firmware-based diagnostic tools, there are a few commands you can invoke from the ok prompt. These OpenBoot commands display information that can help you assess the condition of a Netra 440 server. These include the following:
The following sections describe the information these commands give you. For instructions on using these commands, turn to Using OpenBoot Information Commands, or look up the appropriate man page.
The printenv command displays the OpenBoot configuration variables. The display includes the current values for these variables as well as the default values. For details, see Viewing and Setting OpenBoot Configuration Variables.
For a list of some important OpenBoot configuration variables, see TABLE 2-1.
The probe-scsi and probe-scsi-all commands diagnose problems with attached and internal SCSI devices.
Caution - If you used the halt command or the L1-A (Stop-A) key sequence to reach the ok prompt, then issuing the probe-scsi or probe-scsi-all command can hang the system. |
The probe-scsi command communicates with all SCSI devices connected to on-board SCSI controllers. The probe-scsi-all command additionally accesses devices connected to any host adapters installed in PCI slots.
For any SCSI device that is connected and active, the probe-scsi and probe-scsi-all commands display its target and unit numbers, and a device description that includes type and manufacturer.
The following is sample output from the probe-scsi command.
The following is sample output from the probe-scsi-all command.
The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This is the internal system bus for media devices such as the DVD-ROM drive.
Caution - If you used the halt command or the L1-A (Stop-A) key sequence to reach the ok prompt, then issuing the probe-ide command can hang the system. |
The following is sample output from the probe-ide command.
ok probe-ide Device 0 ( Primary Master ) Removable ATAPI Model: TOSHIBA DVD-ROM SD-C2512 Device 1 ( Primary Slave ) Not Present |
The show-devs command lists the hardware device paths for each device in the firmware device tree. CODE EXAMPLE 2-6 shows some sample output (edited for brevity).
If a system passes OpenBoot Diagnostics tests, it normally attempts to boot its multiuser operating environment. For most Sun systems, this means the Solaris OS. Once the server is running in multiuser mode, you have recourse to software-based diagnostic tools, like SunVTS and Sun Management Center software. These tools can help you with more advanced monitoring, exercising, and fault isolating capabilities.
Note - If you set the auto-boot? OpenBoot configuration variable to false, the operating environment does not boot following completion of the firmware-based tests. |
In addition to the formal tools that run on top of Solaris OS software, there are other resources that you can use when assessing or monitoring the condition of a Netra 440 server. These resources include the following:
Error and other system messages are saved in the file /var/adm/messages. Messages are logged to this file from many sources, including the operating system, the environmental control subsystem, and various software applications.
In the case of Solaris OS software, the syslogd daemon and its configuration file (/etc/syslogd.conf) control how error messages are handled.
For information about /var/adm/messages and other sources of system information, refer to "How to Customize System Message Logging" in the System Administration Guide: Advanced Administration, which is part of the Solaris System Administration Collection.
Some Solaris commands display data that you can use when assessing the condition of a Netra 440 server. These commands include the following:
The following sections describe the information these commands give you. For instructions on using these commands, turn to Using Solaris System Information Commands, or look up the appropriate man page.
The prtconf command displays the Solaris device tree. This tree includes all the devices probed by OpenBoot firmware, as well as additional devices, like individual disks, that only the operating environment software "knows" about. The output of prtconf also includes the total amount of system memory. CODE EXAMPLE 2-7 shows an excerpt of prtconf output (edited for brevity).
The prtconf command's -p option produces output similar to the OpenBoot
show-devs command (see show-devs Command). This output lists only those devices compiled by the system firmware.
The prtdiag command displays a table of diagnostic information that summarizes the status of system components.
The display format used by the prtdiag command can vary depending on what version of the Solaris OS is running on your system. Following are several excerpts of the output produced by prtdiag on a "healthy" Netra 440 server running Solaris 8 software.
The prtdiag command produces a great deal of output about the system memory configuration. Another excerpt follows.
In addition to the preceding information, prtdiag with the verbose option (-v) also reports on front panel status, disk status, fan status, power supplies, hardware revisions, and system temperatures.
In the event of an overtemperature condition, prtdiag reports warning or failed in the Status column.
Similarly, if there is a failure of a particular component, prtdiag reports a fault in the appropriate Status column.
Fan Status: --------------------------------------- Location Sensor Status --------------------------------------- FT1/F0 F0 failed (0 rpm) |
Here is an example of how the prtdiag command displays the status of system LEDs.
The Netra 440 server maintains a hierarchical list of all field-replaceable units (FRUs) in the system, as well as specific information about various FRUs.
The prtfru command can display this hierarchical list, as well as data contained in the serial electrically-erasable programmable read-only memory (SEEPROM) devices located on many FRUs. CODE EXAMPLE 2-14 shows an excerpt of a hierarchical list of FRUs generated by the prtfru command with the -l option.
CODE EXAMPLE 2-15 shows an excerpt of SEEPROM data generated by the prtfru command with the -c option.
The prtfru command displays varied data depending on the type of FRU. In general, this information includes:
Information about the following Netra 440 server FRUs is displayed by the prtfru command:
Similar information is provided by the ALOM system controller showfru command. For more information about showfru and other ALOM commands, see Monitoring the System Using Sun Advanced Lights Out Manager.
The psrinfo command displays the date and time each CPU came online. With the verbose option (-v), the command displays additional information about the CPUs, including their clock speed. The following is sample output from the psrinfo command with the -v option.
The showrev command displays revision information for the current hardware and software. CODE EXAMPLE 2-17 shows sample output of the showrev command.
When used with the -p option, this command displays installed patches. CODE EXAMPLE 2-18 shows a partial sample output from the showrev command with the -p option.
Different diagnostic tools are available to you at different stages of the boot process. TABLE 2-3 summarizes what tools are available to you and when they are available.
When the system is turned off but standby power is available |
Each of the tools available for fault isolation discloses faults in different field-replaceable units (FRUs). The row headings along the left of TABLE 2-4 list the FRUs in a Netra 440 server. The available diagnostic tools are shown in column headings across the top. A check mark in this table indicates that a fault in a particular FRU can be isolated by a particular diagnostic.
No coverage. See TABLE 2-5 for fault isolation hints. |
|||||
No coverage. See TABLE 2-5 for fault isolation hints. |
|||||
No coverage. See TABLE 2-5 for fault isolation hints. |
|||||
No coverage. See TABLE 2-5 for fault isolation hints. |
In addition to the FRUs listed in TABLE 2-4, there are several minor replaceable system components--mostly cables--that cannot directly be isolated by any system diagnostic. For the most part, you determine when these components are faulty by eliminating other possibilities. Some of these FRUs are listed in TABLE 2-5, along with hints on how to discern problems with them.
This is difficult to distinguish from other problems with similar symptoms. The firmware generates many error messages about being unable to access OpenBoot configuration variables, for example: Could not read diag-level from NVRAM! ALOM shows the front panel Service Required indicator is lit. |
|
If ALOM is able to read the system rotary switch position, but reports that none of the fans are spinning, you should suspect that this cable is loose or defective. |
|
If OpenBoot Diagnostics tests indicate a problem with the DVD drive, but replacing the drive does not fix the problem, you should suspect (primarily) that this cable is either defective or improperly connected, or (secondarily) that there is a problem with the motherboard. |
|
Though not an exhaustive diagnostic, some SunVTS tests (i2c2test and disktest) exercise certain SCSI backplane paths. You can also monitor the backplane's ambient temperature using the ALOM system controller showenvironment command (see Monitoring the System Using Sun Advanced Lights Out Manager). |
|
This is difficult to distinguish from problems with similar symptoms. The firmware generates many error messages about being unable to access OpenBoot configuration variables, for example: Could not read diag-level from NVRAM! ALOM shows the front panel Service Required indicator is lit. |
|
If the system control rotary switch and On/Standby button appear unresponsive, and if the power supplies are known to be good, you should suspect the SCC reader and its cable. To test these components, access ALOM, issue the resetsc command, log in again to ALOM, and remove the system controller card. If an alert message appears ("SCC card has been removed"), it means the card reader is functioning and the cable is intact. |
|
If the system control rotary switch appears unresponsive (ALOM cannot read rotary switch position), but the Power button works and the system stays powered on, you should suspect either that this cable is loose or defective, or (less likely) that there is a problem with the system configuration card reader. |
Note - Most replacement cables for the Netra 440 server are available only as part of a cable kit, Sun part number F595-7286. |
Sun provides the Sun Advanced Lights Out Manager (ALOM) tool that can give you advance warning of difficulties and prevent future downtime.
This monitoring tool lets you specify system criteria that bear watching. For instance, you can enable alerts for system events (such as excessive temperatures, power supply or fan failures, system resets), and be notified if those events occur. Warnings can be reported by icons in the software's graphical user interface, or you can be notified by email whenever a problem occurs.
Advanced Lights Out Manager (ALOM) enables you to monitor and control your server over a serial port or a network interface. The ALOM system controller provides a command-line interface that enables you to administer the server from remote locations. This may be especially useful when servers are geographically distributed or physically inaccessible.
ALOM also lets you remotely access the system console and run diagnostics (like POST) that would otherwise require physical proximity to the server's serial port. ALOM can send email notification of hardware failures or other server events.
The ALOM system controller runs independently, and uses standby power from the server. Therefore, ALOM firmware and software continue to be effective when the server operating system goes offline, or when power to the server itself is turned off.
TABLE 2-6 lists the items that ALOM enables you to monitor on the Netra 440 server.
For instructions on using ALOM to monitor a Netra 440 system, see Monitoring the System Using Sun Advanced Lights Out Manager.
It is relatively easy to detect when a system component fails outright. However, when a system has an intermittent problem or seems to be "behaving strangely," a software tool that stresses or exercises the computer's many subsystems can help disclose the source of the emerging problem and prevent long periods of reduced functionality or system downtime.
Sun provides two tools for exercising Netra 440 servers:
TABLE 2-7 shows the FRUs that each system exercising tool is capable of isolating. Note that individual tools do not necessarily test all the components or paths of a particular FRU.
No coverage. See TABLE 2-5 for fault isolation hints. |
||
No coverage. See TABLE 2-8 for fault isolation hints. |
||
No coverage. See TABLE 2-8 for fault isolation hints. |
||
No coverage. See TABLE 2-5 for fault isolation hints. |
||
Some FRUs are not isolated by any system exercising tool.
See TABLE 2-5. |
|
See TABLE 2-5. |
|
If this FRU fails, ALOM issues an alert message:
|
|
If this FRU fails, ALOM issues an alert message:
|
|
See TABLE 2-5. |
|
See TABLE 2-5. |
SunVTS software validation test suite performs system and subsystem stress testing. You can view and control a SunVTS session over a network. Using a remote machine, you can view the progress of a testing session, change testing options, and control all testing features of another machine on the network.
You can run SunVTS software in five different test modes:
Since SunVTS software can run many tests in parallel and can consume many system resources, you should take care when using it on a production system. If you are stress-testing a system using SunVTS software's Comprehensive test mode, you should not run anything else on that system at the same time.
The Netra 440 server to be tested must be up and running if you want to use SunVTS software, since it relies on the Solaris OS. Since SunVTS software packages are optional, they may not be installed on your system. Turn to Checking Whether SunVTS Software Is Installed for instructions.
It is important to use the most up-to-date version of SunVTS available, to ensure that you have the latest suite of tests. You can download the most recent SunVTS software from http://www.sun.com/oem/products/vts/.
For instructions on running SunVTS software to exercise the Netra 440 server, see Exercising the System Using SunVTS Software. For more information about the product, refer to:
These documents are available on the Solaris Supplement CD and on the Web at: http://www.sun.com/documentation. You should also consult the SunVTS README file located at /opt/SUNWvts/. This document provides late-breaking information about the installed version of the product.
During SunVTS software installation, you must choose between Basic or Sun Enterprise Authentication Mechanism (SEAM) security. Basic security uses a local security file in the SunVTS installation directory to limit the users, groups, and hosts permitted to use SunVTS software. SEAM security is based on Kerberos--the standard network authentication protocol--and provides secure user authentication, data integrity, and privacy for transactions over networks.
If your site uses SEAM security, you must have the SEAM client and server software installed in your networked environment and configured properly in both Solaris and SunVTS software. If your site does not use SEAM security, do not choose the SEAM option during SunVTS software installation.
If you enable the wrong security scheme during installation, or if you improperly configure the security scheme you chose, you may find yourself unable to run SunVTS tests. For more information, refer to the SunVTS User's Guide and the instructions accompanying the SEAM software.
System firmware, including POST, has multiple ways of referring to memory. In most cases, such as when running tests or displaying configuration information, firmware refers to memory "banks." These are logical and not physical banks (see CODE EXAMPLE 2-19).
However, in POST error output (see CODE EXAMPLE 2-20), the firmware provides a memory slot identifier (B0/D1 J0602). Note that B0/D1 identifies the memory slot and is visible on the circuit board when the DIMM is installed. The label J0602 also identifies the memory slot, but is not visible unless you remove the DIMM from the slot.
1>H/W under test = CPU3 B0/D1 J0602 side 1 (Bank 1), CPU Module C3 |
Adding to the potential confusion, when configuring system memory, you must also contend with the separate notion of physical memory banks: DIMMs must be installed as pairs of the same capacity and type within each physical bank.
The following sections clarify how memory is identified.
Each CPU/memory module's circuit board contains silk-screened labels that uniquely identify every DIMM on that board. Each label is in this form:
Where x indicates the physical bank, and y the DIMM number within the bank.
In addition, a "J" number silk-screened on the circuit board uniquely identifies each DIMM slot. However, this slot number is not readily visible unless the DIMM is removed from the slot.
If you run POST and it finds a memory error, the error message will include the physical ID of the failed DIMM and the "J" number of the failed DIMM's slot, making it easy to determine which parts you need to replace.
Note - To ensure compatibility and maximize system uptime, you should replace DIMMs in pairs. Treat both DIMMs in a physical bank as one FRU. |
Logical banks reflect the system's internal memory architecture and not the architecture of the system's field-replaceable units. In the Netra 440 server, each logical bank spans two physical DIMMs. Since firmware-generated status messages refer only to logical banks, it is not possible to use these status messages to isolate a memory problem to a single failed DIMM. POST error messages, on the other hand, specify failures to the FRU level.
Note - To isolate faults in the memory subsystem, run POST diagnostics. |
TABLE 2-9 shows the logical-to-physical memory bank mapping for the Netra 440 server.
FIGURE 2-4 depicts the same mapping graphically.
Since each CPU/memory module has its own set of DIMMs, you need to determine the CPU/memory module in which a faulty DIMM resides. This information is given in the POST error message:
In this example, the cited module is CPU Module C3.
The processors are numbered according to the slot in which they are installed, and these slots are numbered 0 to 3, left to right, as you look down on the Netra 440 server's chassis from the front (see FIGURE 2-5).
For example, if a Netra 440 server has only two CPU/memory modules installed, and if those are located in the leftmost and rightmost slots, then the firmware will refer to the two system processors as CPU 0 and CPU 3.
The failed DIMM called out by the previous POST error message, then, resides in the rightmost CPU/memory module (C3), and is labeled B0/D1 on that module's circuit board.
This section describes the OpenBoot Diagnostics tests and commands available to you. For background information about these tests, see OpenBoot Diagnostics Tests.
TABLE 2-11 describes the commands you can type from the obdiag> prompt.
Exits OpenBoot Diagnostics tests and returns to the ok prompt. |
|
Displays a brief description of each OpenBoot Diagnostics command and OpenBoot configuration variable. |
|
Restores the default value of an OpenBoot configuration variable. |
|
Sets the value for an OpenBoot configuration variable (also available from the ok prompt). |
|
Tests all devices displayed in the OpenBoot Diagnostics test menu (also available from the ok prompt). |
|
Tests only the device identified by the menu entry number. (A similar function is available from the ok prompt. See From the ok Prompt: The test and test-all Commands.) |
|
Tests only the devices identified by the menu entry numbers. |
|
Tests all devices in the OpenBoot Diagnostics test menu except those identified by the menu entry numbers. |
|
Displays selected properties of the devices identified by the menu entry numbers. The information provided varies according to device type. |
TABLE 2-12 describes each I2C device in a Netra 440 server, and helps you associate each I2C address with the proper FRU. For more information about I2C tests, see I2C Bus Device Tests.
Indicates disk status and drives fault and Ok-to-Remove indicators |
||
The status and error messages displayed by POST diagnostics and OpenBoot Diagnostics tests occasionally include acronyms or abbreviations for hardware subcomponents. TABLE 2-13 is included to assist you in decoding this terminology and associating the terms with specific FRUs, where appropriate.
Advanced Power Control - A function provided by the Southbridge integrated circuit |
||
A repeater circuit element that forms part of the system bus |
||
Direct Memory Access - In diagnostic output, usually refers to a controller on a PCI card |
||
Inter-Integrated Circuit (also written as I2C) - A bidirectional, two-wire serial data bus. Used mainly for environmental monitoring and control |
Various, see TABLE 2-12 |
|
System bus to PCI bridge integrated circuit (same as "Tomatillo") |
||
The system interconnect architecture--that is, the data and address buses |
||
Joint Test Access Group - An IEEE subcommittee standard (1149.1) for scanning system components |
||
Media Access Controller - Hardware address of a device connected to a network |
||
Media Independent Interface - Part of the Ethernet controller |
||
A means for monitoring and altering the content of ASICs and system components, as provided for in the IEEE 1149.1 standard |
||
Integrated circuit that controls the ALOM UART port and more |
||
Universal Asynchronous Receiver Transmitter - Serial port hardware |
||
Update-ended Interrupt Enable - A function provided by the real-time clock |
||
Copyright © 2004, Sun Microsystems, Inc. All rights reserved.