C H A P T E R  6

hpc.conf Configuration File

This chapter discusses the Sun HPC ClusterTools software configuration file hpc.conf, which defines various attributes of a Sun HPC cluster. A single hpc.conf file is shared by all the nodes in a cluster. It resides in /opt/SUNWhpc/etc.



Note - This configuration file is also used on LSF-based clusters, but it resides in a different location.



The hpc.conf file is organized into functional sections, which are summarized below and illustrated in CODE EXAMPLE 6-1.

Sun HPC ClusterTools software is distributed with an hpc.conf template, which resides by default in /opt/SUNWhpc/examples/rte/hpc.conf.template. If you wish to customize the configuration settings, you should copy this template file to
/opt/SUNWhpc/etc/hpc.conf and edit it.

Each configuration section is bracketed by a Begin/End keyword pair and, when a parameter definition involves multiple fields, the fields are separated by spaces.

CODE EXAMPLE 6-1 General Organization of the hpc.conf File
# Begin ShmemResource
# ...
# End ShmemResource
 
# Begin MPIOptions Queue=
# ...
# End MPIOptions
 
# Begin CREOptions Server=
# ...
# End CREOptions
 
# Begin HPCNodes
# ...
# End HPCNodes
 
Begin PMODULES
...
End PMODULES
 
Begin PM=shm
...
End PM
 
Begin PM=rsm
...
End PM
 
Begin PM=tcp
... 
End PM



Note - When any changes are made to hpc.conf, the system should be in a quiescent state. To ensure that it is safe to edit hpc.conf, shut down the nodal and master Sun CRE daemons as described in Stopping and Restarting Sun CRE. If you change the PMODULES or PM=rsm sections, you must also stop the RSM daemon hpc_rsmd. See RSM Daemon.




ShmemResource Section

The ShmemResource section provides the administrator with two parameters that control allocation of shared memory and swap space: MaxAllocMem and MaxAllocSwap. This special memory allocation control is needed because some Sun HPC ClusterTools software components use shared memory.

CODE EXAMPLE 6-2 shows the ShmemResource template that is in the hpc.conf file that is shipped with Sun HPC ClusterTools software.

CODE EXAMPLE 6-2 ShmemResource Section Example
#Begin ShmemResource
#MaxAllocMem  0x7fffffffffffffff
#MaxAllocSwap 0x7fffffffffffffff
#End ShmemResource

To set MaxAllocMem and/or MaxAllocSwap limits, remove the comment character (#) from the start of each line and replace the current value, 0x7fffffffffffffff, with the desired limit.

Guidelines for Setting Limits

The Sun HPC ClusterTools software internal shared memory allocator permits an application to use swap space, the amount of which is the smaller of:

  • The value (in bytes) given by the MaxAllocSwap parameter
  • 90% of available swap on a node

If MaxAllocSwap is not specified, or if zero or a negative value is specified, 90% of a node's available swap is used as the swap limit.

The MaxAllocMem parameter can be used to limit the amount of shared memory that can be allocated. If a smaller shared memory limit is not specified, the shared memory limit is 90% of available physical memory.

The following Sun HPC ClusterTools software components use shared memory:

  • Sun CRE uses shared memory to hold cluster and job table information. Its memory use is based on cluster and job sizes and is not controllable by the user. Shared memory space is allocated for Sun CRE when it starts up and is not affected by MaxAllocMem and MaxAllocSwap settings. This ensures that Sun CRE can start up no matter how low these memory-limit variables have been set.
  • MPI uses shared memory for communication between processes that are on the same node.
  • Sun S3L uses shared memory for storing data. An MPI application can allocate parallel arrays whose subgrids are in shared memory. This is done with the utility S3L_declare_detailed().


Note - Sun S3L supports a special form of shared memory known as Intimate Shared Memory (ISM), which reserves a region in physical memory for shared memory use. What makes ISM space special is that it is not swappable and, therefore, cannot be made available for other use. For this reason, the amount of memory allocated to ISM should be kept to a minimum.





Note - Shared memory and swap space limits are applied per-job on each node.



If you have set up your system for dedicated use (only one job at a time is allowed), you should leave MaxAllocMem and MaxAllocSwap undefined. This allows jobs to maximize use of swap space and physical memory.

If, however, multiple jobs will share a system, you may want to set MaxAllocMem to some level below 50% of total physical memory. This reduces the risk of having a single application lock up physical memory. How much below 50% you choose to set it depends on how many jobs you expect to be competing for physical memory at any given time.



Note - When users make direct calls to mmap(2) or shmget(2), they are not limited by the MaxAllocMem and MaxAllocSwap variables. These utilities manipulate shared memory independently of the MaxAllocMem and MaxAllocSwap values.




MPIOptions Section

The MPIOptions section provides a set of options that control MPI communication behavior in ways that are likely to affect message-passing performance. It contains a template showing some general-purpose option settings, plus an example of alternative settings for maximizing performance. These examples are shown in CODE EXAMPLE 6-3.

  • General-purpose, multiuser settings - The template in the MPIOptions section is designed for general-purpose use at times when multiple message-passing jobs will be running concurrently.
  • Performance settings - The second example is designed to maximize the performance of message-passing jobs when only one job is allowed to run at a time.


Note - The first line of the template contains the phrase "Queue=hpc." This line indicates a queue in the LSF batch runtime environment, which uses the same hpc.conf file as Sun CRE. For LSF, the settings apply only to the specified queue. For Sun CRE, the settings apply across the cluster.



The options in the general-purpose template are the same as the default settings for the Sun MPI library. In other words, you do not have to uncomment the general-purpose template to have its option values be in effect. This template is provided in the MPIOptions section so that you can see what options are most beneficial when operating in a multiuser mode.

If you want to use the performance settings, do the following:

  • Delete the comment character (#) from the beginning of each line of the performance example, including the Begin MPIOptions and End MPIOptions lines.
  • On Sun CRE-based clusters, delete the "Queue=performance" phrase from the Begin MPIOptions line.

The resulting template should appear as follows:

Begin MPIOptionscoscheduling			offspin			onEnd MPIOptions
CODE EXAMPLE 6-3 MPIOptions Section Example
# The following is an example of the options that affect the run time
# environment of the MPI library.  The listings below are identical to
# the default settings of the library.  The "Queue=hpc" phrase makes
# this an LSF-specific entry, and only for the Queue named hpc.  These
# options are a good choice for a multiuser Queue.  To be recognized
# by CRE, the "Queue=hpc" needs to be removed.
#
# Begin MPIOptions Queue=hpc
# coscheduling  avail
# pbind	        avail
# spindtimeout   1000
# progressadjust   on
# spin		  off
#
# shm_numpostbox           16
# shm_shortmsgsize        256
# rsm_maxsegsize      1048576
# rsm_numpostbox           15
# rsm_shortmsgsize        401
# rsm_maxstripe	            2
# rsm_links	          wrsm0,1
# maxprocs_limit   2147483647
# maxprocs_default       4096
#
# End MPIOptions
 
# The listing below is a good choice when trying to get maximum
# performance out of MPI jobs that are running in a Queue that
# allows only one job to run at a time.
#
# Begin MPIOptions Queue=performance
# coscheduling             off
# spin                      on
# End MPIOptions

TABLE 6-1 provides brief descriptions of the MPI runtime options that can be set in hpc.conf. Each description identifies the default value and describes the effect of each legal value.

Some MPI options not only control a parameter directly, they can also be set to a value that passes control of the parameter to an environment variable. Where an MPI option has an associated environment variable, TABLE 6-1 names the environment variable.

TABLE 6-1 MPI Runtime Options

Values

 

Option

Default

Other

Description

coscheduling

avail

 

Allows spind use to be controlled by the environment variable MPI_COSCHED. If MPI_COSCHED=0 or is not set, spind is not used. If MPI_COSCHED=1, spind must be used.

 

 

on

Enables coscheduling; spind is used. This value overrides MPI_COSCHED=0.

 

 

off

Disables coscheduling; spind is not to be used. This value overrides MPI_COSCHED=1.

pbind

avail

 

Allows processor binding state to be controlled by the environment variable MPI_PROCBIND. If MPI_PROCBIND=0 or is not set, no processes will be bound to a processor. This is the default.

If MPI_PROCBIND=1, all processes on a node will be bound to a processor.

 

 

on

All processes will be bound to processors. This value overrides MPI_PROCBIND=0.

 

 

off

No processes on a node are bound to a processor. This value overrides MPI_PROCBIND=1.

spindtimeout

1000

 

When polling for messages, a process waits 1000 milliseconds for spind to return. This equals the value to which the environment variable MPI_SPINDTIMEOUT is set.

 

 

integer

To change the default timeout, enter an integer value specifying the number of milliseconds the timeout should be.

progressadjust

on

 

Allows user to set the environment variable MPI_SPIN.

 

 

off

Disables user's ability to set the environment variable MPI_SPIN.

shm_numpostbox

16

 

Sets to 16 the number of postbox entries that are dedicated to a connection endpoint. This equals the value to which the environment variable MPI_SHM_NUMPOSTBOX is set. (See the Sun HPC ClusterTools Software Performance Guide for details.)

 

 

integer

To change the number of dedicated postbox entries, enter an integer value specifying the desired number.

shm_shortmsgsize

256

 

Sets to 256 the maximum number of bytes a short message can contain. This equals the default value to which the environment variable MPI_SHM_SHORTMSGSIZE is set.

 

 

integer

To change the maximum-size definition of a short message, enter an integer specifying the maximum number of bytes it can contain.

rsm_numpostbox

15

 

Sets to 15 the number of postbox entries that are dedicated to a connection endpoint. This equals the value to which the environment variable MPI_RSM_NUMPOSTBOX is set.

 

 

integer

To change the number of dedicated postbox entries, enter an integer specifying the desired number.

rsm_shortmsgsize

401

 

Sets to 401 the maximum number of bytes a short message can contain when sent via RSM without using buffers. This equals the value to which the environment variable MPI_RSM_SHORTMSGSIZE is set.

 

 

integer

To change the maximum-size definition of a short message, enter an integer specifying the maximum number of bytes it can contain.

rsm_maxstripe

2

 

Sets to 2 the maximum number of interfaces per stripe that can be used. (The default is the number of interfaces in the cluster, with a maximum of 64.) This equals the value to which the environment variable MPI_RSM_MAXSTRIPE is set.

 

 

integer

To change the maximum number of stripes that can be used, enter an integer specifying the desired limit.

rsm_links

wrsm0,1

 

Defines the controllers/links that can be used on each node for RSM communication.

rsm_maxsegsize

(see note at right)

 

Sets to 17179869184 the maximum size segment that can be created for communication purposes.

 

 

integer

To change the maximum size of an RSM segment, enter an integer specifying the desired limit.

maxprocs_default

4096

 

Sets to 4096 the number of processes an MPI process may be connected to at any one time. It includes processes in the same MPI job and processes in jobs that are currently connected to the MPI

process. This equals the value to which the environment variable MPI_MAXPROCS is set.

 

 

integer

To change the maximum number of processes an MPI process may be connected to at any one time, enter an integer specifying the desired limit. The value may not exceed the setting for the option maxprocs_limit.

maxprocs_limit

 

integer

The maximum process table size a user may set MPI_MAXPROCS to. If the option maxprocs_default is not

set, the user is able to specify a value up to MAX_INT.

spin

off

 

Sets the MPI library spin policy to spin nonaggressively. This equals the value to which the environment variable MPI_SPIN is set.

 

 

on

Sets the MPI library to spin aggressively.


Setting MPI Spin Policy

An MPI process often has to wait for a particular event, such as the arrival of data from another process. If the process checks (spins) for this event continuously, it consumes CPU resources that may be deployed more productively for other purposes.

The administrator can direct that the MPI process instead register events associated with shared memory or remote shared memory (RSM) message passing with the spin daemon spind, which can spin on behalf of multiple MPI processes (coscheduling). This frees up multiple CPUs for useful computation. The spind daemon itself runs at a lower priority and backs off its activities with time if no progress is detected.

The SUNWrte package implements the spind daemon, which is not directly user callable.

The cluster administrator can control spin policy in the hpc.conf file. The attribute coscheduling, in the MPIOptions section, can be set to avail, on, or off.

  • avail (the default) means that spin policy is determined by the setting of the environment variable MPI_COSCHED. If MPI_COSCHED is set to zero or is not set, spind is not used. If MPI_COSCHED is set to one, spind must be used.
  • on means that spind must be used by MPI processes that wish to block on shared-memory communication. This value overrides MPI_COSCHED=0.
  • off means that spind cannot be used by MPI processes. This value overrides MPI_COSCHED=1.

The cluster administrator can also change the setting of the attribute spindtimeout, indicating how long a process waits for spind to return. The default is 1000 milliseconds.

For tips on determining spin policy, see the man page for MPI_COSCHED. In general, the administrator may wish to force use of spind for heavily used development partitions where performance is not a priority. On other partitions, the policy could be set to avail, and users can set MPI_COSCHED=0 for runs where performance is needed.


CREOptions Section

The CREOptions section controls the behavior of Sun CRE in logging system events, handling daemon core files, and authenticating users and programs.

The template hpc.conf file contains the default settings for these behaviors for the current cluster. These settings are shown in CODE EXAMPLE 6-4.

CODE EXAMPLE 6-4 CREOptions Section Example
Begin CREOptions
enable_core       off
corefile_name     core
syslog_facility   daemon
auth_opt          sunhpc_rhosts
max_pub_names     256
default_rm        cre
allow_mprun       *
End CREOptions

Specifying the Cluster

The cluster is specified by appending the tag Server=master-node-name to the Begin CREOptions line:

Begin CREOptions Server=master-node-name

If the node name supplied does not match the name of the current master node, then this section is ignored.

It is possible to have two CREOptions sections. The section without a tag is always processed first. Then the section with a matching master-node-name adds to or overrides the previous settings.

Logging System Events

By default, Sun CRE uses the syslog facility to log system events, as indicated by the entry syslog_facility daemon. Other possible values are user, local0, local1, ..., local7. See the man pages for syslog(2), syslogd(8), and syslog.conf(5) for information on the syslog facility.



Note - In rare cases, the Sun CRE daemons may log errors to the default system log. This occurs when an error is generated before the system has read the value of syslog_facility in hpc.conf.



Enabling Core Files

By default, core files are disabled for Sun CRE daemons. The administrator may enable core files by changing enable_core off to enable_core on. The administrator may also specify where daemon core files are saved by supplying a value for corefile_name. See coreadm(1M)for the possible naming patterns. For example:

corefile_name /var/hpc/core.%n.%f.%p

This would cause any core files to be placed in /var/hpc with the name core modified by node name (%n), executable file name (%f), and process ID (%p). (Note that only daemon core files are affected by the CREOptions section; user programs are not affected.)

Enabling Authentication

Authentication may be enabled by changing the auth_opt value from sunhpc_rhosts (the default) to rhosts, des, or krb5. These values indicate DES or Kerberos Version 5, respectively. See Authentication and Security for additional steps needed to establish the chosen authentication method.

Changing the Maximum Number of Published Names

By default, a single job may publish a maximum of 256 names. To increase or reduce that number, change this line:

max_pub_names maximum-number-of-names

Identifying A Default Resource Manager

If you have only one resource manager installed, you can save users the trouble of entering the -x resource-manager option each time they use they mprun command by specifying a default. Enter this line:

default_rm resource-manager-name

Limiting mprun's Ability to Launch Programs in Batch Mode

If you need to restrict mprun's ability to launch programs while running in batch mode, use the allow_mprun field. By default, the field is set to:

allow_mprun *

The asterisk indicates that no restrictions have been placed on mprun. The asterisk is equivalent to having no entry in the CREOptions section of hpc.conf.

If you need to set restrictions, you must:

  • Change the value of allow_mprun
  • Create the file sunhpc.allow to specify the restrictions

Instructions are provided in How to Configure the hpc.conf File, and How to Configure the sunhpc.allow File.


HPCNodes Section

This section is used only in a cluster that is using LSF as its workload manager, not Sun CRE. Sun CRE ignores the HPCNodes section of the hpc.conf file.


PMODULES Section

The PMODULES section provides the names and locations of the protocol modules (PMs) which the run-time system is to discover and make available for use for communication in the cluster.

When a Sun CRE-based cluster is being started, an instance of the daemon tm.omd is started on each node. This daemon is responsible for discovering various information about a node, including which PMs are available for that node. The tm.omd daemon looks in the hpc.conf file for a list of PMs that may be available. It then opens the PMs, and calls an interface discovery function to find out if the PM has interfaces that are up and running. This information is returned to the tm.omd and stored away in the cluster database.

The PMODULES section of the hpc.conf file lists each PM by name and gives the location (or default location) where it may be found. The template hpc.conf looks like this:

# PMODULE LIBRARY
Begin PMODULES
shm      ()
rsm      ()
tcp      ()
End PMODULES

Three PMs are shipped with Sun HPC ClusterTools software and included in the template hpc.conf file. These are:

  • shm PM, used for on-node communication
  • tcp PM, used (typically) for internode communication on TCP-IP-compatible interconnects
  • rsm PM, used (typically) for internode communication on the Sun Firetrademark Link interconnect

The template hpc.conf file specifies the location of all three PMs as (), which indicates the default location. The default location is /opt/SUNWhpc/lib for 32-bit daemons, and /opt/SUNWhpc/lib/sparcv9 for 64-bit daemons.

The administrator has the option of putting PM libraries in a location other than the default. This is useful, for instance, when a new user-defined PM is being developed. For PMs located in a directory other than the default, the administrator must put the absolute pathname in the hpc.conf file. For example:

# PMODULE LIBRARY
Begin PMODULES
tcp      ()
shm      /home/jbuffett/libs
End PMODULES

In this example, the tcp libraries are located in the default location and the shm libraries are located in /home/jbuffett/libs. In a 64-bit environment, sparcv9 is automatically added to the pathname. Thus, this hpc.conf entry indicates that the shm PM libraries would be found in /home/jbuffett/libs/sparcv9.


PM Section

The hpc.conf file contains a PM section for each available protocol module. The section gives standard information (name of interface and its preference ranking) for the PM, along with additional information for some types of PMs.

The name of the PM being described appears on the same line as the keyword PM with an equal sign and no spaces between them. This example shows the PM sections provided for the shm and rsm PMs.

# SHM settings
# NAME  RANK
Begin PM=shm
shm     5
End PM
# RSM settings
# NAME  RANK  AVAIL
Begin PM=rsm
wrsm    20    1
End PM

The NAME and RANK columns must be filled in for all PMs. The shm PM requires only these two standard items of information; the rsm PM has an additional field called AVAIL.

NAME Column

The name of the interface indicates the controller type and, optionally, a numbered interface instance. Interface names not ending with a number are wildcards; they specify default settings for all interfaces of that type. The name can be can be between 1 and 32 characters in length.

If interfaces are specified by name after a wildcard entry, the named entries take precedence.

RANK Column

The rank of an interface is the order in which that interface is preferred over other interfaces, with the lowest-ranked interface the most preferred. That is, if an interface with a rank of 0 is available when a communication operation begins, it will be selected for the operation before interfaces with ranks of 1 or greater. Likewise, an available rank 1 interface will be used before interfaces with a rank of 2 or greater.



Note - Because hpc.conf is a shared, cluster-wide configuration file, the rank specified for a given interface will apply to all nodes in the cluster.



Network ranking decisions are usually influenced by site-specific conditions and requirements. Although interfaces connected to the fastest network in a cluster are often given preferential ranking, raw network bandwidth is only one consideration. For example, an administrator might decide to dedicate one network that offers very low latency, but not the fastest bandwidth, to all communication within a cluster and use a higher-capacity network for connecting the cluster to other systems.

Rank can also be specified for interface instances within a PM section. For example, consider a customized hpc.conf entry like this:

# RSM Settings
# NAME     RANK  AVAIL
Begin PM=rsm
wrsm       15    1
wrsm0      10    1
wrsm1      20    1
wrsm2      30    1
wrsm3      40    1
End PM=rsm

If controllers wrsm0 and wrsm1 could be used to establish connections to the same process, wrsm0 would always be chosen, since it has the lower ranking number.



Note - If multiple interfaces have the same ranking, the rsm PM will use the MPI options rsm_links and rsm_maxstripe to reduce the number of interfaces and use the remaining interface(s) for the connection. If more than one interface is resolved for making the connection, the rsm PM will stripe the connection, using all interfaces resolved.



AVAIL Column

The rsm PM section contains an additional column headed AVAIL. This value indicates whether a controller is (1) or is not (2) available for RSM communication.

Configuring Out Controllers

The AVAIL column can be used to configure out one or more controllers, perhaps for maintenance on the network. If all controllers on the network are going to be unavailable, the administrator can change the availability of the wildcard entry to 0. If only certain controllers will be unavailable, the administrator can leave the wildcard entry set to 1 but add entries set to 0 availability for named instances. (Remember to stop the Sun CRE daemons and the RSM daemon hpc_rsmd when editing the hpc.conf file and restart them afterward.)

# RSM Settings
# NAME     RANK  AVAIL
Begin PM=rsm
wrsm       15    1
wrsm2      20    0
wrsm3      15    0
End PM=rsm



Note - Alternatively, the administrator can use the option rsm_links in the MPIOptions section to configure out one or more controllers. See Configuring Out Network Controllers.



Enabling Software Striping

Another use for the AVAIL column is to enable software-controlled striping of messages over network controllers. When a message is submitted for transmission over the network, the rsm PM distributes the message over as many network interfaces as are available with the same preference ranking, up to the limit of 8 links.



Note - Software-controlled message striping is most useful for interconnect technology that does not support hardware-controlled striping. The hardware striping performed by some interconnects is generally preferable to the software-controlled striping described here.



In striped communication, a message is split into smaller packets and transmitted in two or more parallel streams over a set of network controllers that have been logically combined into a stripe-group.

The AVAIL column allows the administrator to include individual network interfaces in a (software) stripe-group pool. Members of this pool are available to be included in logical stripe groups as long as they have the same preference ranking. These stripe groups are formed on an as-needed basis, selecting interfaces from this stripe-group pool.

To include an interface in a stripe-group pool, set its AVAIL value to 1. To exclude an interface from the pool, specify 0.

Stripe-group membership is optional so you can reserve some network bandwidth for non-striped use (assuming the network has another PM enabled). To do so, simply set AVAIL to 0 on the network interface(s) you wish to reserve in this way.

TCP-IP PM Section

The PM section provided for the tcp PM in the template hpc.conf file contains the standard NAME and RANK columns, along with several placeholder columns that are not used at this time. The default TCP settings (and placeholders) are shown in CODE EXAMPLE 6-5.

CODE EXAMPLE 6-5 PM=tcp Section Example
# TCP settings
# NAME  RANK    MTU     STRIPE  LATENCY BANDWIDTH
Begin PM=tcp
midn    0	      16384	  0       20	      150
idn1    0	      16384	  0       20	      150
wrsmd   25	     32768	  0	       20	      150
mscid   30	     32768	  0	       20	      150
scid    40	     32768	  0	       20	      150
mba     50	     8192	   0	       20	      150
ba      60	     8192	   0	       20	      150
mfa     70	     8192	   0	       20	      150
fa      80	     8192	   0	       20	      150
macip   90	     8192	   0	       20	      150
acip    100	    8192	   0	       20	      150
manfc   110		    16384	  0	       20	      150
anfc    120		    16384	  0	       20	      150
mbf     130		    4096	   0	 	      20	      150
bf      140		    4096	   0		       20	      150
mbe     150		    4096	   0		       20	      150
be      160		    4096	   0		       20	      150
mqfe    163		    4096	   0		       20	      150
qfe     167		    4096	   0		       20		      150
mhme    170		    4096	   0		       20	      150
hme     180		    4096	   0		       20	      150
mle     190		    4096	   0		       20	      150
le      200		    4096	   0		       20	      150
msmc    210		    4096	   0		       20	      150
smc     220		    4096	   0		       20	      150
lo      230		    4096	   0	       20	      150
End PM

The template hpc.conf file identifies the network interfaces that are included in the TCP PM section. The networks with the prefix "m" are for Enterprise 10000 alternate pathing support, and should be used in preference to the underlying interface (thus their lower ranking).



Note - Inclusion of any network interface in this file does not imply that Sun Microsystems supports, or intends to support, that network.




Propagating hpc.conf Information

Whenever hpc.conf is changed, the Sun CRE database must be updated with the new information. After all required changes to hpc.conf have been made, restart the Sun CRE daemons on all cluster nodes. For example, to start the daemons on cluster nodes node1 and node2 from a central host, enter

# ./ctstartd -n node1,node2 -r connection_method

where connection_method is rsh, ssh, or telnet. Or, you can specify a nodelist file instead of listing the nodes on a command line.

# ./ctstopd -N /tmp/nodelist -r connection_method

where /tmp/nodelist is absolute path to a file containing the names of the cluster nodes, with each name on a separate line.