C H A P T E R 10 |
CPU/Memory Board Replacement and Dynamic Reconfiguration (DR) |
This chapter describes how to dynamically reconfigure the CPU/Memory boards on the Sun Fire entry-level midrange systems system.
DR software is part of the Solaris operating environment. With the DR software you can dynamically reconfigure system boards and safely remove them or install them into a system while the Solaris operating environment is running and with minimum disruption to user processes running on the system. You can use DR to do the following:
The Solaris cfgadm(1M) command provides the command line interface for the administration of DR functionality.
During the unconfigure operation on a system board with permanent memory (OpenBoot PROM or kernel memory), the operating environment is briefly paused, which is known as operating environment quiescence. All operating environment and device activity on the baseplane must cease during a critical phase of the operation.
Note - Quiescence may take several minutes, depending on workload and system configuration. |
Before it can achieve quiescence, the operating environment must temporarily suspend all processes, CPUs, and device activities. It may take a few minutes to achieve quiescence depending on system usage and activities currently in progress. If the operating environment cannot achieve quiescence, it displays the reasons, which may include the following:
The conditions that cause processes to fail to suspend are generally temporary. Examine the reasons for the failure. If the operating environment encountered a transient condition--a failure to suspend a process--you can try the operation again.
Time-outs occur by default after two minutes. Administrators may need to increase this time-out value to avoid time-outs during a DR-induced operating system quiescence, which may take longer than two minutes. Quiescing a system makes the system and related network services unavailable for a period of time that can exceed two minutes. These changes affect both the client and server machines.
When DR suspends the operating environment, all of the device drivers that are attached to the operating environment must also be suspended. If a driver cannot be suspended (or subsequently resumed), the DR operation fails.
A suspend-safe device does not access memory or interrupt the system while the operating environment is in quiescence. A driver is suspend-safe if it supports operating environment quiescence (suspend/resume). A suspend-safe driver also guarantees that when a suspend request is successfully completed, the device that the driver manages will not attempt to access memory, even if the device is open when the suspend request is made.
A suspend-unsafe device allows a memory access or a system interruption to occur while the operating environment is in quiescence.
An attachment point is a collective term for a board and its slot. DR can display the status of the slot, the board, and the attachment point. The DR definition of a board also includes the devices connected to it, so the term `occupant' refers to the combination of board and attached devices.
There are two formats used when referring to attachment points:
x is a slot number. A slot number can be 0, 2 or 4 for a system board.
There are four main types of DR operation.
If a system board is in use, stop its use and disconnect it from the system before you power it off. After a new or upgraded system board is inserted and powered on, connect its attachment point and configure it for use by the operating environment. The cfgadm(1M) command can connect and configure (or unconfigure and disconnect) in a single command, but if necessary, each operation (connection, configuration, unconfiguration, or disconnection) can be performed separately.
Hot-plug devices have special connectors that supply electrical power to the board or module before the data pins make contact. Boards and devices that have hot-plug connectors can be inserted or removed while the system is running. The devices have control circuits to ensure they have a common reference and power control during the insertion process. The interfaces are not powered on until the board is home and the System Controller instructs them to.
The CPU/Memory boards used in the Sun Fire entry-level midrange systems system are hot-plug devices.
A state is the operational status of either a receptacle (slot) or an occupant (board). A condition is the operational status of an attachment point.
Before you attempt to perform any DR operation on a board or component from a system, you must determine state and condition. Use the cfgadm(1M) command with the -la options to display the type, state, and condition of each component and the state and condition of each board slot in the system. See the section Component Types for a list of the component types.
This section contains descriptions of the states and conditions of CPU/Memory boards (also known as system slots).
A board can have one of three receptacle states: empty, disconnected, or connected. Whenever you insert a board, the receptacle state changes from empty to disconnected. Whenever you remove a board the receptacle state changes from disconnected to empty.
A board can have one of two occupant states: configured or unconfigured. The occupant state of a disconnected board is always unconfigured.
A board can be in one of four conditions: unknown, ok, failed, or unusable.
This section contains descriptions of the states and conditions for components.
A component cannot be individually connected or disconnected. Thus, components can have only one state: connected.
A component can have one of two occupant states: configured or unconfigured.
Component is available for use by the Solaris operating environment. |
|
Component is not available for use by the Solaris operating environment. |
A component can have one of three conditions: unknown, ok, failed.
You can use DR to configure or to unconfigure several types of component.
Before you can delete a board, the environment must vacate the memory on that board. Vacating a board means flushing its nonpermanent memory to swap space and copying its permanent (that is, kernel and OpenBoot PROM memory) to another memory board. To relocate permanent memory, the operating environment on a system must be temporarily suspended, or quiesced. The length of the suspension depends on the system configuration and the running workloads. Detaching a board with permanent memory is the only time when the operating environment is suspended; therefore, you should know where permanent memory resides so that you can avoid significantly impacting the operation of the system. You can display the permanent memory by using the cfgadm(1M) command with the -v option. When permanent memory is on the board, the operating environment must find another memory component of adequate size to receive the permanent memory. If that is not possible the DR operation will fail.
System boards cannot be dynamically reconfigured if system memory is interleaved across multiple CPU/Memory boards.
When a CPU/Memory board containing non-relocatable (permanent) memory is dynamically reconfigured out of the system, a short pause in all domain activity is required which may delay application response. Typically, this condition applies to one CPU/Memory board in the system. The memory on the board is identified by a non-zero permanent memory size in the status display produced by the
cfgadm -av command.
DR supports reconfiguration of permanent memory from one system board to another only if one of the following conditions is met:
The following procedures are discussed in this section:
Note - There is no need to enable dynamic reconfiguration explicitly. DR is enabled by default. |
The cfgadm(1M) command provides configuration administration operations on dynamically reconfigurable hardware resources. TABLE 10-8 lists the DR board states.
The cfgadm program displays information about boards and slots. Refer to the cfgadm(1) man page for options to this command.
Many operations require that you specify the system board names. To obtain these system names, type:
When used without options, cfgadm displays information about all known attachment points, including board slots and SCSI buses. The following display shows a typical output.
For a more detailed status report, use the command cfgadm -av. The -a option lists attachment points and the -v option turns on expanded (verbose) descriptions.
CODE EXAMPLE 10-2 is a partial display produced by the cfgadm -av command. The output appears complicated because the lines wrap around in this display. (This status report is for the same system used in CODE EXAMPLE 10-1.) FIGURE 10-1 provides details of each display item.
FIGURE 10-1 shows details of the display in CODE EXAMPLE 10-2:
The options to the cfgadm -c command are listed in TABLE 10-9.
The options provided by the cfgadm -x command are listed in TABLE 10-10.
The cfgadm_sbd man page provides additional information on the cfgadm -c and cfgadm -x options. The sbd library provides the functionality for hot-plugging system boards of the class sbd, through the cfgadm framework.
Before you can test a CPU/Memory board, it must first be powered on and disconnected. If these conditions are not met, the board test fails.
You can use the Solaris cfgadm command to test CPU/memory boards. As superuser, type:
To change the level of diagnostics that cfgadm runs, supply a diagnostic level for the cfgadm command as follows:
where level is a diagnostic level, and ap-id is one of the following: N0.SB0, N0.SB2 or N0.SB4.
If you do not supply level, the default diagnostic level is set to the default. The diagnostic levels are:
Caution - Physical board replacement should only be carried out by qualified service personnel. |
Note - When replacing boards, you sometimes need filler panels. |
If you are unfamiliar with how to insert a board into the system, read the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate before you begin this procedure.
1. Make sure you are properly grounded with a wrist strap.
2. After locating an empty slot, remove the system board filler panel from the slot.
3. Insert the board into the slot within one minute to prevent the system overheating.
Refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate for complete step-by-step board insertion procedures.
4. Power on, test, and configure the board using the cfgadm -c configure command:
where ap_id is one of the following: N0.SB0, N0.SB2 or N0.SB4.
1. Make sure you are properly grounded using a wrist strap.
2. Power off the board with cfgadm.
where ap_id is one of the following: N0.SB0, N0.SB2 or N0.SB4.
This command removes the resources from the Solaris operating environment and the OpenBoot PROM, and powers off the board.
3. Verify the state of the Power and Hotplug OK LEDs.
The green Power LED will flash briefly as the CPU/Memory board is cooling down. In order to safely remove the board from the systems the green Power LED must be off and the amber Hotplug OK LED must be on.
4. Complete the hardware removal and installation of the board.
For more information refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate.
5. After removing and installing board, bring the board back to the Solaris operating environment with the Solaris dynamic reconfiguration cfgadm command.
where ap_id is one of the following: N0.SB0, N0.SB2 or N0.SB4.
This command powers the board on, tests it, attaches the board, and brings all of its resources back to the Solaris operating environment.
6. Verify that the green Power LED is lit.
1. Detach and power off the board from the system by using the cfgadm -c disconnect command.
where ap_id is one of the following: N0.SB0, N0.SB2 or N0.SB4.
2. Remove the board from the system.
Refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate for complete step-by-step board removal procedures.
3. Insert a system board filler panel into the slot within one minute of removing the board to prevent system overheating.
You can use DR to power down the board and leave it in place. For example, you might want to do this if the board fails and a replacement board or a system board filler panel is not available.
Detach and power off the board using the cfgadm -c disconnect command.
where ap_id is one of the following: N0.SB0, N0.SB2 or N0.SB4.
This section discusses common types of failure:
The following are examples of cfgadm diagnostic messages. (Syntax error messages are not included here.)
See the following man pages for additional error message detail: cfgadm(1M), cfgadm_sbd(1M), and config_admin(3X).
An unconfigure operation for a CPU/Memory board can fail if the system is not in a correct state before you begin the operation.
If you try to unconfigure a system board whose memory is interleaved across system boards, the system displays an error message such as:
cfgadm: Hardware specific failure: unconfigure N0.SB2::memory: Memory is interleaved across boards: /ssm@0,0/memory-controller@b,400000 |
If you try to unconfigure a CPU to which a process is bound, the system displays an error message such as the following:
cfgadm: Hardware specific failure: unconfigure N0.SB2::cpu3: Failed to off-line: /ssm@0,0/SUNW,UltraSPARC-III |
Unbind the process from the CPU and retry the unconfigure operation.
All memory on a system board must be unconfigured before you try to unconfigure a CPU. If you try to unconfigure a CPU before all memory on the board is unconfigured, the system displays an error message such as:
cfgadm: Hardware specific failure: unconfigure N0.SB2::cpu0: Can't unconfig cpu if mem online: /ssm@0,0/memory-controller |
Unconfigure all memory on the board and then unconfigure the CPU.
To unconfigure the memory on a board that has permanent memory, move the permanent memory pages to another board that has enough available memory to hold them. Such an additional board must be available before the unconfigure operation begins.
If the unconfigure operation fails with a message such as the following, the memory on the board could not be unconfigured:
cfgadm: Hardware specific failure: unconfigure N0.SB0: No available memory target: /ssm@0,0/memory-controller@3,400000 |
Add to another board enough memory to hold the permanent memory pages, and then retry the unconfigure operation.
To confirm that a memory page cannot be moved, use the verbose option with the cfgadm command and look for the word permanent in the listing:
If the unconfigure fails with one of the messages below, there will not be enough available memory in the system if the board is removed:
Reduce the memory load on the system and try again. If practical, install more memory in another board slot.
If the unconfigure fails with the following message, the memory demand has increased while the unconfigure operation was proceeding:
Reduce the memory load on the system and try again.
CPU unconfiguration is part of the unconfiguration operation for a CPU/Memory board. If the operation fails to take the CPU offline, the following message is logged to the console:
It is possible to unconfigure a board and then discover that it cannot be disconnected. The cfgadm status display lists the board as not detachable. This problem occurs when the board is supplying an essential hardware service that cannot be relocated to an alternate board.
Before you try to configure either CPU0 or CPU1, make sure that the other CPU is unconfigured. Once both CPU0 and CPU1 are unconfigured, it is then possible to configure both of them.
Before configuring memory, all CPUs on the system board must be configured. If you try to configure memory while one or more CPUs are unconfigured, the system displays an error message such as:
cfgadm: Hardware specific failure: configure N0.SB2::memory: Can't config memory if not all cpus are online: /ssm@0,0/memorycontroller |
Copyright © 2004, Sun Microsystems, Inc. All rights reserved.