C H A P T E R 3 |
Advanced System Management |
Advanced System Monitoring (ASM) is an intelligent fault detection system that increases uptime and manageability of the board. The System Management Controller (SMC) module on the Netra CP2000/CP2100 series supports the temperature monitoring functions of ASM. This chapter describes the specific ASM functions of the Netra CP2000/CP2100 series. This chapter includes the following sections:
TABLE 3-1 lists the compatible ASM hardware, OpenBoot PROM, and Solaris operating environment for the Netra CP2000/CP2100 series.
Solaris 8 2/02 operating environment or subsequent compatible versions, with one of the following CD supplements: |
FIGURE 3-1 illustrates the Netra CP2000/CP2100 series ASM application block diagram.
FIGURE 3-1 is a typical Netra CP2000/CP2100 series system application block diagram. For locations of the temperature sensors, see FIGURE 3-2, FIGURE 3-3 and FIGURE 3-4.
The Netra CP2000/CP2100 series functions as a system controller board or as a satellite board in a CompactPCI system rack. The Netra CP2000/CP2100 series board monitors its CPU-vicinity temperature and issues warnings at both the OpenBoot PROM and Solaris operating environment levels when these environmental readings are out of limits. At the Solaris operating environment level, the application program monitors and issues warnings for the system controller and the satellite board. In the host and satellite modes of operation, at the OBP level, the CPU vicinity temperature is monitored if the the NVRAM variable env-monitor is enabled.
This section describes a typical ASM cycle from power up to shutdown.
The OpenBoot PROM monitors CPU-vicinity temperature at the fixed polling rate (from the env-mon-interval parameter) of 10 seconds and the OpenBoot PROM displays warning messages on the default output device whenever the measured temperature exceeds the pre-programmed NVRAM module configurable variable warning temperature (the warning-temperature parameter) or the pre-programmed NVRAM module configurable variable shutdown temperature (the shutdown-temperature parameter). See OpenBoot PROM Environmental Parameters for information on changing these pre-programmed parameters.
The OpenBoot PROM cannot shut down power to the Netra CP2000/CP2100 series board. The shutdown temperature message is only a warning message to the user that the Netra CP2000/CP2100 series board is overheating and needs to be shut down immediately by external means.
OpenBoot PROM-level protection takes place only when the env-monitor parameter is enabled (it is not the default setting). Disabling env-monitor completely disables ASM protection at the OpenBoot PROM level but does not affect ASM protection at the Solaris operating environment level.
Note - To protect the system at OpenBoot PROM level, the env-monitor should be enabled at all times. |
Monitoring changes in the ASM temperatures can be a useful tool for determining problems with the room where the system is installed, functional problems with the system, or problems on the board. Establishing baseline temperatures early in deployment and operation could be used to trigger alarms if the temperatures from the sensors increase or decrease dramatically. If all the sensors go to room ambient, power has probably been lost to the host system. If one or more sensors rise in temperature substantially, there may be a system fan malfunction, the system cooling may have been compromised, or room air conditioning may have failed.
When the application program opens the system controller device and pushes the ASM streams module, the ASM module is loaded.
To access the CPU-vicinity temperature measurements at the Solaris operating environment level, use the ioctl system call in an application program. To specify the ASM polling rate, use the sleep system call.
Protection at the operating environment level takes place only when the ASM application program is running, which is initiated by the end user. Failure to run the ASM application program completely disables ASM protection at the Solaris level but does not affect ASM protection at the OpenBoot PROM level. Keep the ASM application program running at all times.
In a typical ASM application program, the software reads the following temperature sensors once every polling cycle:
The program then compares the measured CPU-vicinity temperature with the warning temperature and displays a warning message on the default output device whenever the warning temperature is exceeded.
The program can also issue a shutdown message on the default output device whenever the measured CPU-vicinity temperature exceeds the shutdown temperature. In addition, the ASM application program can be programmed to sync and shut down the Solaris operating environment when conditions warrant.
The use of system calls to access the ASM device driver at the Solaris level enables OEMs to implement their own monitoring, warning, and shutdown policies through a high-level programming language such as the C programming language. An OEM can log and analyze the environmental data for trends (such as drift rate or sudden changes in average readings). Or, an OEM can communicate the occurrence of an unusual condition to a specialized management network using the Netra CP2000/CP2100 series board Ethernet port.
Refer to Sample Application Program for an example of how a simple ASM monitoring program can be implemented.
The power module is controlled by the SMC subsystem (except for automatic controls such as overcurrent shutdown or voltage regulation). The functions controlled are core voltage output level and module on/off state.
The onboard voltage controller is a hardware function that is not controlled by either firmware or software. At the OpenBoot PROM level, there is no mechanism for the OpenBoot PROM to either remove or restore power to the Netra CP2000/CP2100 series board when the CPU-vicinity temperature exceeds its maximum recommended level.
There is no mechanism for the Solaris operating environment to either recover or restore power to the Netra CP2000/CP2100 series board when an unusual condition occurs (for example, if the CPU-vicinity temperature exceeds its maximum recommended level). In either case, the end user must intervene and manually recover the Netra CP2000/CP2100 series board as well as the CompactPCI system through hardware control.
This section summarizes the hardware ASM features on the Netra CP2000/CP2100 series board. TABLE 3-2 lists the ASM functions and shows the location of the ASM hardware on a typical Netra CP2060 board. TABLE 3-3 shows the same information for the Netra CP2160 board.
Note that in TABLE 3-2 and TABLE 3-3 the readings for the SDRAM modules show the sensor readings as currently unavailable because the tables list information of a typical Netra board that does not support memory modules.
SDRAM module#1 Temperature (for Netra boards with memory modules) |
Sensor reading is currently unavailable[1] |
SDRAM module#2 Temperature (for Netra boards with memory modules) |
|
SDRAM module #1 temperature (for Netra boards with memory modules) |
Sensor reading is currently unavailable[2] |
General I/O[3] |
|
Power module[4] |
|
FIGURE 3-2, FIGURE 3-3, FIGURE 3-4 and FIGURE 3-5 show the location of the ASM hardware on the Netra CP2000/CP2100 series boards.
FIGURE 3-6 is a block diagram of the ASM functions.
The Netra CP2040/CP2060/CP2080/CP2140 boards use a MAX1617 temperature sensor located near the CPU underneath its heat sink. The Netra CP2160 board does not have this temperature sensor.
The onboard voltage controller allows power to the rest of the Netra CP2000/CP2100 series board only when the following conditions are met:
The controller requires these conditions to be true for at least 100 milliseconds to help ensure the supply voltages are stable. If any of these conditions become untrue, the voltage monitoring circuit shuts down the power of the board.
The inlet board temperature sensor can be used to ensure that the maximum allowable short-term system-level air inlet temperature is not exceeded. The sensor can also be used to monitor potential issues with the system or installation, since inlet temperature for the Netra CP2160 board should be kept low for the installation reliability requirements.
The two exhaust temperature sensors can be used to ensure that the proper airflow across the board is being maintained. The difference in the temperature between the inlet air temperature and exhaust temperatures can be monitored to determine if system filters need servicing, if air movers have failed, or if an electrical problem has occured due to components drawing too much power on the board.
During normal operation of the Netra CP2160 board, any sudden, sustained, or substantial changes in the delta temperature across the board can be used to alert service personnel to a potential system or board service issue.
The CPU sensor temperature can be used to prevent damage to the board by shutting the board down if this sensor exceeds predetermined limits.
The Netra CP2000/CP2100 board uses the Advanced System Monitoring (ASM) detection system to monitor the temperature of the board. The ASM system will display messages if the board temperature exceeds the set warning and shutdown settings. Because the on-board sensors may report different temperature readings for different system configurations and airflows, you may want to adjust the warning and shutdown temperature parameter settings.
The CP2000/CP2100 board determines the board temperature by retrieving temperature data from sensors located on the board. A board sensor reads the temperature of the immediate area around the sensor. Although the software may appear to report the temperature of a specific hardware component, the software is actually reporting the temperature of the area near the sensor. For example, the CPU heat sink sensor reads the temperature at the location of the sensor and not on the actual CPU heat sink. The board's OpenBoot PROM collects the temperature readings from each board sensor at regular intervals. You can display these temperature readings using the show-sensors OpenBoot PROM command. See show-sensors Command at OpenBoot PROM
The temperature read by the CPU heat sink sensor will trigger OpenBoot PROM warning and shutdown messages. When the CPU heat sink sensor reads a temperature greater than the warning parameter setting, the OpenBoot PROM will display a warning message. Likewise, when the sensor reads a temperature greater than the shutdown setting, the OpenBoot PROM will display a shutdown message.
Many factors affect the temperature readings of the sensors, including the airflow through the system, the ambient temperature of the room, and the system configuration. These factors may contribute to the sensors reporting different temperature readings than expected.
TABLE 3-5 shows the sensor readings of a typical Netra CP2040 board operating in a Sun server in a room with an ambient temperature of 21°C. The temperature readings were reported using the show-sensors OpenBoot PROM command. Note that the reported temperatures are higher than the ambient room temperature.
Difference Between Reported and Ambient Room Temperature (in Degrees Celsius) |
||
---|---|---|
TABLE 3-6 shows the sensor readings of a typical Netra CP2160 board, which has different sensor locations than those on the other Netra CP2000/CP2100 series boards.
Note that the inlet temperature sensor typically does not capture true board inlet temperature due to the heat of nearby components. For typical Netra CP2000/CP2100 series systems, subtract 4°C from the temperature sensor value. Note that the temperature sensor has an accuracy of up to plus or minus 2°C. Users should conduct their own temperature sensor tests to obtain accurate readings.
Difference Between Reported and Ambient Room Temperature (in Degrees Celsius) |
||
---|---|---|
Since the temperature reported by the CPU sensor might be different than the actual CPU die temperature, you may want to adjust the settings for both the warning-temperature and shutdown-temperature OpenBoot PROM parameters. The default values of these parameters have been conservatively set at 70°C for the warning temperature and 80°C for the shutdown temperature.
Note - If you have developed an application that uses the ASM software to monitor the temperature sensors, you may want to adjust your application's settings accordingly. |
This section describes how to change the OpenBoot PROM environmental monitoring parameters. These global OpenBoot PROM parameters do not apply at the Solaris level. Instead, the ASM application program provides equivalent parameters that do not necessarily have to be set to the same values as their OpenBoot PROM counterparts. Refer to ASM Application Programming for information about using ASM at the Solaris level. The OpenBoot PROM polling rate is at fixed intervals of 10 seconds.
OBP programs SMC for temperature monitoring using the sensor commands. TABLE 3-7 lists the default threshold temperature settings for the CP2000/CP2100 series boards.
Default Threshold Temperature Settings for Netra Boards (In Degrees Celsius) |
|||
---|---|---|---|
For example, on a Netra CP2160 there are three NVRAM variables that provide different temperature levels. The critical-temperature limit lies between warning and shutdown thresholds. The default values of these temperature thresholds and corresponding action is shown in TABLE 3-8:
Note that there is a lower limit of 50° C on shutdown-temperature value. If the temperature is set to a value lower than 50° C, OpenBoot PROM resets it back to 50° C in SMC. However, OpenBoot PROM does not reset the NVRAM variable shutdown-temperature to 50° C. Therefore, everytime the user resets the system, the OpenBoot PROM displays a warning message similar to the message below:
WARNING!!! shutdown-temperature is set too low at 40° C. Setting the threshold at a safer value of 50° C. |
This safeguards against a user setting the shutdown-temperature lower than the room temperature and thereby causing the CPU processor and the Netra CP2160 board to be powered off by SMC on the next reset.
The warning-temp global OpenBoot PROM parameter determines the temperature at which a warning is displayed. The shutdown-temperature global OpenBoot PROM parameter determines the temperature at which the system is shut down. The temperature monitoring environment variables can be modified at the OpenBoot PROM command level as shown in examples below:
The critical-temperature is a second-level warning temperature with a default value of 75° C. This variable can be modified using the OpenBoot PROM level setenv command as shown in example below::
This section describes the ASM monitoring in the OpenBoot PROM. Please note that the figures in the examples below are for a typical Netra CP2160 board.
The following NVRAM module variables are in OpenBoot PROM for ASM for a typical Netra CP2160 board:
Caution - Exercise caution while setting the above two parameters. Setting these values too high will leave the system unprotected against system over-heat. |
When the CPU-vicinity temperature reaches "warning-temperature," a similar message is displayed at the ok prompt at a regular interval:
Temperature sensor #2 has threshold event of <<< WARNING!!! Upper Non-critical - going high >>> The current threshold setting is : 70 The current temperature is : 71 |
When the CPU-vicinity temperature reaches "warning-temperature", a similar message is displayed at the ok prompt at a regular interval:
Temperature sensor #2 has threshold event of <<< !!! ALERT!!! Upper Critical - going high >>> The current threshold setting is : 75 The current temperature is : 76 |
The show-sensors command at OpenBoot PROM displays the readings of all the temperature sensors on the board TABLE 3-9 shows typical sensor readings for a Netra CP2060 board (which would be similar to the Netra CP2040/CP2080/CP2140 boards) and TABLE 3-10 shows typical sensor readings for a Netra CP2160 board.
CPU-vicinity temperature (senses the local temperature of the CPU area) |
||
This sensor reading is not available[6] |
||
The Intelligent Platform Management Interface (IPMI) commands can be used to enable the sensors monitoring and subsequent event generation from satellite boards in the Netra CP2000/CP2100 series CompactPCI system.
The IPMI command examples provided in this section are based on the IPMI Specification Version 1.0. Please use the IPMI Specification for additional information on how to implement these IPMI commands.
Note - To execute an IPMI command, at the OpenBoot PROM ok prompt, type the packets in reverse order followed by the relevant information as shown in examples in Examples of IPMI Command Packets. Change the bytes in the example packet to accommodate different IPMI addresses, different threshold values or different sensor numbers. See also the IPMI Specification Version 1.0. |
1. Set the thresholds for the sensors.
See Set Sensor Threshold. If no threshold is set, the default threshold operates:
2. Follow instructions in Check Whether the IPMI Commands Are Executed Properly to check proper execution of the command.
1. To execute a command to enable events from the sensor, type:
See Set Sensor Event Enable Command and Get Sensor Event Enable.
There are supporting commands for any sensor and the corresponding packets at these commands: get sensor threshold, get sensor reading, and get sensor event enable.
2. Follow instructions in Check Whether the IPMI Commands Are Executed Properly to check proper execution of the command.
1. Check whether the stack on the ok prompt displays 0 when the command is issued.
A 0 indicates that the command packet sent to the board was successful.
2. Type execute-smc-cmd (cmd 33) command at the ok prompt as follows:
This command verifies that the target satellite board received and executed the command and sent a response.
3. Check the completion code which is the seventh byte from left.
If the completion code is 0, then the target board successfully executed the command. Otherwise the command was not successfully executed by the board.
4. Check that rsSA and rqSA are swapped in the response packet.
The rsSA is the responder slave address and the rqSA is the requestor slave address.
5. (Optional) If command not correctly executed, resend the IPMI command.
The following packets are IPMI command packets that can be sent from the OpenBoot PROM ok prompt:
A typical example of the sensor command is as follows:
A typical example of the sensor command is as follows
A typical example of the sensor command is as follows:
A typical example of the sensor command is as follows:
A typical example of the sensor command is as follows:
Note - The NetFN/LUN for all sensor IPMI commands is 12, which implies that the netFn is 0x04 lun= 0x2. |
The following sections describe how to use the ASM functions in an application program.
For the ASM application program to monitor the hardware environment, the following conditions must be met:
The ASM parameter values in the application program apply when the system is running at the Solaris level and do not necessarily have to be the same as the corresponding to the parameter settings in the OpenBoot PROM.
To change the ASM parameter setting at the OpenBoot PROM level, see OpenBoot PROM Environmental Parameters for the procedure. The OpenBoot PROM ASM parameter values only apply when the system is running at the OpenBoot PROM level.
For most applications, an ASM polling rate of once every 60 seconds is adequate.
To specify a polling rate of every 60 seconds in an ASM application program, type the following at the command line for the Solaris operating environment:
do { ... /* read and process I2C bus devices data */ sleep (60); /* sets the ASM polling rate to every 60 seconds */ } while (1); |
The ASM application program monitors the CPU-vicinity temperature as follows (see Sample Application Program for C code):
1. Get the CPU-vicinity temperature measurements and other sensor measurements using the ioctl system call.
2. Examine the measurement readings and take the appropriate action.
Note - The warning and shutdown temperatures are set for the CPU processor. |
3. Repeat the process for every ASM polling cycle.
The ASM driver is a STREAMS module that sits on top of the Solaris system controller driver. The Netra CP2000/CP2100 series ASM driver accepts STREAMS IOCTL input to the ASM driver, passes it onto the system controller driver as a command, and sends the sensor temperature as the output to the user. Currently, this driver handles only the local I2C bus. On the Netra CP2000 series and the Netra CP2140 board, this driver enables the user to monitor the CPU-vicinity temperature, PMC temperature, memory module heat sink temperature, memory module temperature, SDRAM module1 temperature, SDRAM module2 temperature, and the power module temperature. On the Netra CP2160 board, th driver enables the user to monitor the CPU temperature, the Inlet 1, Exhaust 1, Exhaust 2, SDRAM module 1 and the power module temperatures.
Note - The local I2C bus is supported by the Solaris driver interface. |
Input Output Control with I_STR should be used to get sensor information. The data structure used to pass it as an argument for streams IOCTL is as follows.
When the monitoring is successful, it returns a 0. For any error, it returns -1 and the errno is set correspondingly. Trying to read any sensor which is not physically present sets errno as ENXIO. For any hardware or firmware failures, the errno is EINVAL. For any memory allocation problems, the errno is EAGAIN.
This section presents a sample ASM application that monitors the CPU-vicinity temperature. Please refer to /usr/platform/sun4u/include/sys/stdasm.h if you want to add support for the other six sensors in the application.
Note - The stdasm.h header file is located in the following directory: /usr/platform/sun4u/include/sys |
This section describes the test configuration used to generate the data used for the OpenBoot PROM temperature table in the ASM table temperature monitoring function. It should be used as a guideline by OEMs who need to revise the OpenBoot PROM temperature table because of changes to the enclosure, system, or fan configuration.
The system configuration and test equipment used to obtain the ASM temperature data is as follows:
The two thermocouples are positioned as follows:
To Attach and Test Thermocouples |
1. Attach the thermocouples on the board.
See the section on Thermocouple Locations above for further details.
2. Install the board in the far left slot (slot #1) of the CompactPCI chassis
For location of thermocouple see FIGURE 3-2, FIGURE 3-3 and FIGURE 3-4 and FIGURE 3-5.
3. Install a dummy 6U CompactPCI board in the next slot to control the air flow.
The front panels of the chassis should be filled.
4. Set up the fan speed to maintain air flow of 320 linear feet per minute (LFM) or greater.
Air flow is measured by securing the air flow sensor approximately 5 mm from the side of CPU heat sink.
5. Place the chassis inside the environmental chamber.
6. Set up the chamber temperature to cycle from 0oC to 60oC in 5oC steps.
7. Run the SunVTS software during the test.
8. Read the thermocouple temperatures after at least one hour.
Wait at each temperature step.
Copyright © 2004, Sun Microsystems, Inc. All Rights Reserved.