C H A P T E R 6 |
hpc.conf Configuration File |
This chapter discusses the Sun HPC ClusterTools software configuration file hpc.conf, which defines various attributes of a Sun HPC cluster. A single hpc.conf file is shared by all the nodes in a cluster. It resides in /opt/SUNWhpc/etc.
Note - This configuration file is also used on LSF-based clusters, but it resides in a different location. |
The hpc.conf file is organized into functional sections, which are summarized below and illustrated in CODE EXAMPLE 6-1.
Sun HPC ClusterTools software is distributed with an hpc.conf template, which resides by default in /opt/SUNWhpc/examples/rte/hpc.conf.template. If you wish to customize the configuration settings, you should copy this template file to
/opt/SUNWhpc/etc/hpc.conf and edit it.
Each configuration section is bracketed by a Begin/End keyword pair and, when a parameter definition involves multiple fields, the fields are separated by spaces.
Note - When any changes are made to hpc.conf, the system should be in a quiescent state. To ensure that it is safe to edit hpc.conf, shut down the nodal and master Sun CRE daemons as described in Stopping and Restarting Sun CRE. If you change the PMODULES or PM=rsm sections, you must also stop the RSM daemon hpc_rsmd. See RSM Daemon. |
The ShmemResource section provides the administrator with two parameters that control allocation of shared memory and swap space: MaxAllocMem and MaxAllocSwap. This special memory allocation control is needed because some Sun HPC ClusterTools software components use shared memory.
CODE EXAMPLE 6-2 shows the ShmemResource template that is in the hpc.conf file that is shipped with Sun HPC ClusterTools software.
#Begin ShmemResource #MaxAllocMem 0x7fffffffffffffff #MaxAllocSwap 0x7fffffffffffffff #End ShmemResource |
To set MaxAllocMem and/or MaxAllocSwap limits, remove the comment character (#) from the start of each line and replace the current value, 0x7fffffffffffffff, with the desired limit.
The Sun HPC ClusterTools software internal shared memory allocator permits an application to use swap space, the amount of which is the smaller of:
If MaxAllocSwap is not specified, or if zero or a negative value is specified, 90% of a node's available swap is used as the swap limit.
The MaxAllocMem parameter can be used to limit the amount of shared memory that can be allocated. If a smaller shared memory limit is not specified, the shared memory limit is 90% of available physical memory.
The following Sun HPC ClusterTools software components use shared memory:
Note - Shared memory and swap space limits are applied per-job on each node. |
If you have set up your system for dedicated use (only one job at a time is allowed), you should leave MaxAllocMem and MaxAllocSwap undefined. This allows jobs to maximize use of swap space and physical memory.
If, however, multiple jobs will share a system, you may want to set MaxAllocMem to some level below 50% of total physical memory. This reduces the risk of having a single application lock up physical memory. How much below 50% you choose to set it depends on how many jobs you expect to be competing for physical memory at any given time.
The MPIOptions section provides a set of options that control MPI communication behavior in ways that are likely to affect message-passing performance. It contains a template showing some general-purpose option settings, plus an example of alternative settings for maximizing performance. These examples are shown in CODE EXAMPLE 6-3.
The options in the general-purpose template are the same as the default settings for the Sun MPI library. In other words, you do not have to uncomment the general-purpose template to have its option values be in effect. This template is provided in the MPIOptions section so that you can see what options are most beneficial when operating in a multiuser mode.
If you want to use the performance settings, do the following:
The resulting template should appear as follows:
Begin MPIOptionscoscheduling offspin onEnd MPIOptions
TABLE 6-1 provides brief descriptions of the MPI runtime options that can be set in hpc.conf. Each description identifies the default value and describes the effect of each legal value.
Some MPI options not only control a parameter directly, they can also be set to a value that passes control of the parameter to an environment variable. Where an MPI option has an associated environment variable, TABLE 6-1 names the environment variable.
An MPI process often has to wait for a particular event, such as the arrival of data from another process. If the process checks (spins) for this event continuously, it consumes CPU resources that may be deployed more productively for other purposes.
The administrator can direct that the MPI process instead register events associated with shared memory or remote shared memory (RSM) message passing with the spin daemon spind, which can spin on behalf of multiple MPI processes (coscheduling). This frees up multiple CPUs for useful computation. The spind daemon itself runs at a lower priority and backs off its activities with time if no progress is detected.
The SUNWrte package implements the spind daemon, which is not directly user callable.
The cluster administrator can control spin policy in the hpc.conf file. The attribute coscheduling, in the MPIOptions section, can be set to avail, on, or off.
The cluster administrator can also change the setting of the attribute spindtimeout, indicating how long a process waits for spind to return. The default is 1000 milliseconds.
For tips on determining spin policy, see the man page for MPI_COSCHED. In general, the administrator may wish to force use of spind for heavily used development partitions where performance is not a priority. On other partitions, the policy could be set to avail, and users can set MPI_COSCHED=0 for runs where performance is needed.
The CREOptions section controls the behavior of Sun CRE in logging system events, handling daemon core files, and authenticating users and programs.
The template hpc.conf file contains the default settings for these behaviors for the current cluster. These settings are shown in CODE EXAMPLE 6-4.
The cluster is specified by appending the tag Server=master-node-name to the Begin CREOptions line:
Begin CREOptions Server=master-node-name
If the node name supplied does not match the name of the current master node, then this section is ignored.
It is possible to have two CREOptions sections. The section without a tag is always processed first. Then the section with a matching master-node-name adds to or overrides the previous settings.
By default, Sun CRE uses the syslog facility to log system events, as indicated by the entry syslog_facility daemon. Other possible values are user, local0, local1, ..., local7. See the man pages for syslog(2), syslogd(8), and syslog.conf(5) for information on the syslog facility.
Note - In rare cases, the Sun CRE daemons may log errors to the default system log. This occurs when an error is generated before the system has read the value of syslog_facility in hpc.conf. |
By default, core files are disabled for Sun CRE daemons. The administrator may enable core files by changing enable_core off to enable_core on. The administrator may also specify where daemon core files are saved by supplying a value for corefile_name. See coreadm(1M)for the possible naming patterns. For example:
corefile_name /var/hpc/core.%n.%f.%p
This would cause any core files to be placed in /var/hpc with the name core modified by node name (%n), executable file name (%f), and process ID (%p). (Note that only daemon core files are affected by the CREOptions section; user programs are not affected.)
Authentication may be enabled by changing the auth_opt value from sunhpc_rhosts (the default) to rhosts, des, or krb5. These values indicate DES or Kerberos Version 5, respectively. See Authentication and Security for additional steps needed to establish the chosen authentication method.
By default, a single job may publish a maximum of 256 names. To increase or reduce that number, change this line:
max_pub_names maximum-number-of-names
If you have only one resource manager installed, you can save users the trouble of entering the -x resource-manager option each time they use they mprun command by specifying a default. Enter this line:
default_rm resource-manager-name
If you need to restrict mprun's ability to launch programs while running in batch mode, use the allow_mprun field. By default, the field is set to:
allow_mprun *
The asterisk indicates that no restrictions have been placed on mprun. The asterisk is equivalent to having no entry in the CREOptions section of hpc.conf.
If you need to set restrictions, you must:
Instructions are provided in How to Configure the hpc.conf File, and How to Configure the sunhpc.allow File.
This section is used only in a cluster that is using LSF as its workload manager, not Sun CRE. Sun CRE ignores the HPCNodes section of the hpc.conf file.
The PMODULES section provides the names and locations of the protocol modules (PMs) which the run-time system is to discover and make available for use for communication in the cluster.
When a Sun CRE-based cluster is being started, an instance of the daemon tm.omd is started on each node. This daemon is responsible for discovering various information about a node, including which PMs are available for that node. The tm.omd daemon looks in the hpc.conf file for a list of PMs that may be available. It then opens the PMs, and calls an interface discovery function to find out if the PM has interfaces that are up and running. This information is returned to the tm.omd and stored away in the cluster database.
The PMODULES section of the hpc.conf file lists each PM by name and gives the location (or default location) where it may be found. The template hpc.conf looks like this:
# PMODULE LIBRARY Begin PMODULES shm () rsm () tcp () End PMODULES |
Three PMs are shipped with Sun HPC ClusterTools software and included in the template hpc.conf file. These are:
The template hpc.conf file specifies the location of all three PMs as (), which indicates the default location. The default location is /opt/SUNWhpc/lib for 32-bit daemons, and /opt/SUNWhpc/lib/sparcv9 for 64-bit daemons.
The administrator has the option of putting PM libraries in a location other than the default. This is useful, for instance, when a new user-defined PM is being developed. For PMs located in a directory other than the default, the administrator must put the absolute pathname in the hpc.conf file. For example:
# PMODULE LIBRARY Begin PMODULES tcp () shm /home/jbuffett/libs End PMODULES |
In this example, the tcp libraries are located in the default location and the shm libraries are located in /home/jbuffett/libs. In a 64-bit environment, sparcv9 is automatically added to the pathname. Thus, this hpc.conf entry indicates that the shm PM libraries would be found in /home/jbuffett/libs/sparcv9.
The hpc.conf file contains a PM section for each available protocol module. The section gives standard information (name of interface and its preference ranking) for the PM, along with additional information for some types of PMs.
The name of the PM being described appears on the same line as the keyword PM with an equal sign and no spaces between them. This example shows the PM sections provided for the shm and rsm PMs.
# SHM settings # NAME RANK Begin PM=shm shm 5 End PM # RSM settings # NAME RANK AVAIL Begin PM=rsm wrsm 20 1 End PM |
The NAME and RANK columns must be filled in for all PMs. The shm PM requires only these two standard items of information; the rsm PM has an additional field called AVAIL.
The name of the interface indicates the controller type and, optionally, a numbered interface instance. Interface names not ending with a number are wildcards; they specify default settings for all interfaces of that type. The name can be can be between 1 and 32 characters in length.
If interfaces are specified by name after a wildcard entry, the named entries take precedence.
The rank of an interface is the order in which that interface is preferred over other interfaces, with the lowest-ranked interface the most preferred. That is, if an interface with a rank of 0 is available when a communication operation begins, it will be selected for the operation before interfaces with ranks of 1 or greater. Likewise, an available rank 1 interface will be used before interfaces with a rank of 2 or greater.
Note - Because hpc.conf is a shared, cluster-wide configuration file, the rank specified for a given interface will apply to all nodes in the cluster. |
Network ranking decisions are usually influenced by site-specific conditions and requirements. Although interfaces connected to the fastest network in a cluster are often given preferential ranking, raw network bandwidth is only one consideration. For example, an administrator might decide to dedicate one network that offers very low latency, but not the fastest bandwidth, to all communication within a cluster and use a higher-capacity network for connecting the cluster to other systems.
Rank can also be specified for interface instances within a PM section. For example, consider a customized hpc.conf entry like this:
# RSM Settings # NAME RANK AVAIL Begin PM=rsm wrsm 15 1 wrsm0 10 1 wrsm1 20 1 wrsm2 30 1 wrsm3 40 1 End PM=rsm |
If controllers wrsm0 and wrsm1 could be used to establish connections to the same process, wrsm0 would always be chosen, since it has the lower ranking number.
The rsm PM section contains an additional column headed AVAIL. This value indicates whether a controller is (1) or is not (2) available for RSM communication.
The AVAIL column can be used to configure out one or more controllers, perhaps for maintenance on the network. If all controllers on the network are going to be unavailable, the administrator can change the availability of the wildcard entry to 0. If only certain controllers will be unavailable, the administrator can leave the wildcard entry set to 1 but add entries set to 0 availability for named instances. (Remember to stop the Sun CRE daemons and the RSM daemon hpc_rsmd when editing the hpc.conf file and restart them afterward.)
# RSM Settings # NAME RANK AVAIL Begin PM=rsm wrsm 15 1 wrsm2 20 0 wrsm3 15 0 End PM=rsm |
Note - Alternatively, the administrator can use the option rsm_links in the MPIOptions section to configure out one or more controllers. See Configuring Out Network Controllers. |
Another use for the AVAIL column is to enable software-controlled striping of messages over network controllers. When a message is submitted for transmission over the network, the rsm PM distributes the message over as many network interfaces as are available with the same preference ranking, up to the limit of 8 links.
In striped communication, a message is split into smaller packets and transmitted in two or more parallel streams over a set of network controllers that have been logically combined into a stripe-group.
The AVAIL column allows the administrator to include individual network interfaces in a (software) stripe-group pool. Members of this pool are available to be included in logical stripe groups as long as they have the same preference ranking. These stripe groups are formed on an as-needed basis, selecting interfaces from this stripe-group pool.
To include an interface in a stripe-group pool, set its AVAIL value to 1. To exclude an interface from the pool, specify 0.
Stripe-group membership is optional so you can reserve some network bandwidth for non-striped use (assuming the network has another PM enabled). To do so, simply set AVAIL to 0 on the network interface(s) you wish to reserve in this way.
The PM section provided for the tcp PM in the template hpc.conf file contains the standard NAME and RANK columns, along with several placeholder columns that are not used at this time. The default TCP settings (and placeholders) are shown in CODE EXAMPLE 6-5.
The template hpc.conf file identifies the network interfaces that are included in the TCP PM section. The networks with the prefix "m" are for Enterprise 10000 alternate pathing support, and should be used in preference to the underlying interface (thus their lower ranking).
Note - Inclusion of any network interface in this file does not imply that Sun Microsystems supports, or intends to support, that network. |
Whenever hpc.conf is changed, the Sun CRE database must be updated with the new information. After all required changes to hpc.conf have been made, restart the Sun CRE daemons on all cluster nodes. For example, to start the daemons on cluster nodes node1 and node2 from a central host, enter
# ./ctstartd -n node1,node2 -r connection_method |
where connection_method is rsh, ssh, or telnet. Or, you can specify a nodelist file instead of listing the nodes on a command line.
Copyright © 2002, Sun Microsystems, Inc. All rights reserved.