C H A P T E R  4

Cluster Configuration Notes

This chapter examines various issues that may have some bearing on choices you make when configuring your Sun HPC cluster. The discussion is organized into three general topic areas:


Nodes

Configuring a Sun SMP or cluster of SMPs to use Sun HPC ClusterTools software involves many of the same choices seen when configuring general-purpose compute servers. Common issues include the number of CPUs per machine, the amount of installed memory, and the amount of disk space reserved for swapping.

Because the characteristics of the particular applications to be run on any given Sun HPC cluster have such a large effect on the optimal settings of these parameters, the following discussion is necessarily general in nature.

Number of CPUs

Since Sun MPI programs can run efficiently on a single SMP, it can be advantageous to have at least as many CPUs as there are processes used by the applications running on the cluster. This is not a necessary condition since Sun MPI applications can run across multiple nodes in a cluster, but for applications with very large interprocess communication requirements, running on a single SMP may result in significant performance gains.

Memory

Generally, the amount of installed memory should be proportional to the number of CPUs in the cluster, although the exact amount depends significantly on the particulars of the target application mix.

Computationally intensive Sun HPC ClusterTools software applications that process data with some amount of locality of access often benefit from larger external caches on their processor modules. Large cache capacity allows data to be kept closer to the processor for longer periods of time.

Swap Space

Because Sun HPC Clustertools software applications are, on average, larger than those typically run on compute servers, the swap space allocated to Sun HPC clusters should be correspondingly larger. The amount of swap should be proportional to the number of CPUs and to the amount of installed memory. Additional swap should be configured to act as backing store for the shared memory communication areas used by Sun HPC ClusterTools software in these situations.

Sun MPI jobs require large amounts of swap space for shared memory files. The sizes of shared memory files scale in stepwise fashion, rather than linearly. For example, a two-process job (with both processes running within the same SMP) requires shared memory files of approximately 35 Mbytes. A 16-process job (all processes running within the same SMP) requires shared memory files of approximately 85 Mbytes. A 256-process job (all processes running within the same SMP) requires shared memory files of approximately 210 Mbytes.


Interconnects

One of the most fundamental issues to be addressed when configuring a cluster is the question of how to connect the nodes of the cluster. In particular, both the type and the number of networks should be chosen to complement the way in which the cluster is most likely to be used.



Note - For the purposes of this discussion, the term default network refers to the network associated with the standard host name. The term parallel application network refers to an optional second network, operating under the control of Sun CRE.



In a broad sense, a Sun HPC cluster can be viewed as a standard LAN. Operations performed on nodes of the cluster will generate the same type of network traffic that is seen on a LAN. For example, running an executable and accessing directories and files will cause NFS traffic, while remote login sessions will cause network traffic. This kind of network traffic is referred to here as administrative traffic.

Administrative traffic has the potential to tax cluster resources. This can result in significant performance losses for some Sun MPI applications, unless these resources are somehow protected from this traffic. Fortunately, Sun CRE provides enough configuration flexibility to allow you to avoid many of these problems.

The following sections discuss some of the factors that you should consider when building a cluster for Sun HPC ClusterTools software applications.

Sun HPC ClusterTools Internode Communication

Several Sun HPC ClusterTools software components generate internode communication. It is important to understand the nature of this communication in order to make informed decisions about network configurations.

Administrative Traffic

As mentioned earlier, a Sun HPC cluster generates the same kind of network traffic as any UNIX-based LAN. Common operations like starting a program can have a significant network impact. The impact of such administrative traffic should be considered when making network configuration decisions.

When a simple serial program is run within a LAN, network traffic typically occurs as the executable is read from a NFS-mounted disk and paged into a single node's memory. In contrast, when a 16- or 32-process parallel program is invoked, the NFS server is likely to experience approximately simultaneous demands from multiple nodes--each pulling pages of the executable to its own memory. Such requests can often result in large amounts of network traffic. How much traffic occurs will depend on various factors, such as the number of processes in the parallel job, the size of the executable, and so forth.

Sun CRE-Generated Traffic

Sun CRE uses the cluster's default network interconnect to perform communication between the daemons that perform resource management functions. Sun CRE makes heavy use of this network when Sun MPI jobs are started, with the load being roughly proportional to the number of processes in the parallel jobs. This load is in addition to the start-up load described in the previous section. Sun CRE will generate a similar load during job termination as the Sun CRE database is updated to reflect the expired MPI job.

There is also a small amount of steady traffic generated on this network as Sun CRE continually updates its view of the resources on each cluster node and monitors the status of its components to guard against failures.

Sun MPI Interprocess Traffic

Parallel programs use Sun MPI to move data between processes as the program runs. If the running program is spread across multiple cluster nodes, then the program generates network traffic.

Sun MPI uses the network that Sun CRE instructs it to use, which can be set by the system administrator. In general, Sun CRE instructs Sun MPI to use the fastest network available so that message-passing programs obtain the best possible performance.

If the cluster has only one network, then message-passing traffic shares bandwidth with administrative and Sun CRE functions. This results in performance degradation for all types of traffic, especially if one of the applications is performing significant amounts of data transfer, as message-passing applications often do. You should understand the communication requirements associated with the types of applications to be run on the Sun HPC cluster in order to decide whether the amount and frequency of application-generated traffic warrants the use of a second, dedicated network for parallel application network traffic. In general, a second network will significantly assist overall performance.

Prism Traffic

The Prism graphical programming environment is used to tune, debug, profile, and visualize data from Sun MPI programs running within the cluster. As the Prism program itself is a parallel program, starting it generates the same sort of Sun CRE traffic that invocation of other applications generates.

Once the Prism environment has been started, two kinds of network traffic are generated during a debugging session. The first, which has been covered in preceding sections, is traffic created by running the Sun MPI code that is being debugged. The second kind of traffic is generated by the Prism program itself and is routed over the default network along with all other administrative traffic. In general, the amount of traffic generated by the Prism program itself is small, although viewing performance analysis data on large programs and visualizing large data arrays can cause transiently heavy use of the default network.

Parallel I/O Traffic

Sun MPI programs can make use of the parallel I/O capabilities of Sun HPC ClusterTools software, but not all such programs do so. You need to understand how distributed multiprocess applications that are run on the Sun HPC cluster will make use of parallel I/O to understand the ramifications for network load.

Applications can use parallel I/O to read from and write to standard UNIX file systems, generating NFS traffic on the default network, on the network being used by the Sun MPI component, or some combination of the two. The type of traffic that is generated depends on the type of I/O operations being used by the applications. Collective I/O operations generate traffic on the Sun MPI network, while most other types of I/O operations involve only the default network.

Network Characteristics

Bandwidth, latency, and performance under load are all important network characteristics to consider when choosing interconnects for a Sun HPC cluster. These are discussed in this section.

Bandwidth

Bandwidth should be matched to expected load as closely as possible. If the intended message-passing applications have only modest communication requirements and no significant parallel I/O requirements, then a fast, expensive interconnect may be unnecessary. On the other hand, many parallel applications benefit from large pipes (high-bandwidth interconnects). Clusters that are likely to handle such applications should use interconnects with sufficient bandwidth to avoid communication bottlenecks. Significant use of parallel I/O would also increase the importance of having a high-bandwidth interconnect.

It is also a good practice to use a high-bandwidth network to connect large nodes (nodes with many CPUs) so that communication capabilities are in balance with computational capabilities.

An example of a low-bandwidth interconnect is the 10-Mbit/s Ethernet. Examples of higher-bandwidth interconnects include ATM and switched FastEthernet.

Latency

The latency of the network is the sum of all delays a message encounters from its point of departure to its point of arrival. The significance of a network's latency varies according to the communication patterns of the application.

Low latency can be particularly important when the message traffic consists mostly of small messages--in such cases, latency accounts for a large proportion of the total time spent transmitting messages. Transmitting larger messages can be more efficient on a network with higher latencies.

Parallel I/O operations are less vulnerable to latency delays than small-message traffic because the messages transferred by parallel I/O operations tend to be large (often 32 Kbytes or larger).

Performance Under Load

Generally speaking, better performance is provided by switched network interconnects, such as ATM and Fibre Channel. Network interconnects with collision-based semantics should be avoided in situations where performance under load is important. Unswitched 10-Mbit/s and 100-Mbit/s Ethernet are the two most common examples of this type of network. While 10-Mbit/s Ethernet is almost certainly not adequate for any Sun HPC ClusterTools software application, a switched version of 100-Mbit/s Ethernet may be sufficient for some applications.

Notes on RSM Setup

The hpc_rsmd daemon, which is thread-safe, receives requests from client processes by means of door calls. While running, the hpc_rsmd creates and destroys RSM backing store using System V shared memory. The directory /var/hpc/rsm is reserved specifically for hpc_rsmd usage and should not have any user files stored in it. This directory will contain the hpc_rsmd_door file, which is used by RSM protocol modules (PMs) to contact the daemon.

The daemon hpc_rsmd has a polling thread that it uses to determine the state of known RSM paths and to determine if unallocated segments should be reaped. The periodicity of the path-monitor thread is 60 seconds.

The RSM PM is activated only by a call into the MPI library made by an application. It has an MT-safe version to be used by the libmpi_mt.so library. Its memory usage is dynamic, dependent on the number of processes it has set up to communicate with and the memory it has exported for use in the communications.

When Sun HPC ClusterTools software is installed, certain system parameters in
/etc/system are set to enable RSM communication. You might wish to change these values to suit other requirements of your site. The default settings established at installation time are the following:

set shmsys:shminfo_shmseg=0x800
set shmsys:shminfo_shmmni=0x1000
set shmsys:shminfo_shmmax=0xffffffffffffffff

If you decrease any of these values, then the system might not be able to run MPI jobs using RSM. Also, you may need to increase these values beyond the defaults if other applications that use System V shared memory will be running in tandem with MPI jobs.


Close Integration With Batch Processing Systems

As mentioned in Chapter 1, Sun CRE provides close integration with several distributed resource managers. In that integration, Sun CRE retains most of its original functions, but delegates others to the resource manager. This section describes:

How Close Integration Works

The integration process is similar for all three, with some individual differences. The batch system, whether SGE, LSF, or PBS, launches the job through a script. The script calls mprun, and passes it a host file of the resources that have been allocated for the job, plus the job ID assigned by the batch system.

The Sun CRE environment continues to perform most of its normal parallel-processing actions, but its child processes do not fork any executable programs. Instead, each child process identifies a communications channel (specifically, a listen query socket) through which it can be monitored by the Sun CRE environment while running in the batch system.

Graphic image illustrating communications between distributed resource manager and CRE. 

Before releasing its child processes to the batch system, the Sun CRE environment inserts around it a wrapper that identifies its listen query socket. That wrapper program keeps the executable programs fully connected to the Sun CRE environment while they are running. And the Sun CRE environment can clean up when a socket closes because a program has stopped running.

A user can also invoke a similar process interactively, without a script. Instructions for script-based and interactive job launching are provided in the Sun HPC ClusterTools Software User's Guide.

How Close Integration Is Used

The exact instructions vary from one resource manager to another, and are affected by Sun CRE's configuration, but they all follow these general guidelines:

1. The job can be launched either interactively or through a script. Instructions for both are provided in the Sun HPC ClusterTools Software User's Guide and the following manpages:

    • lsf_cre(1)
    • pbs_cre(1)
    • sge_cre(1)

2. Users enter the batch processing environment before launching jobs with mprun.

3. Users reserve resources for the parallel job and set other job control parameters from within their resource manager.

4. Users invoke the mprun command with the applicable resource manager flags. Those flags are described in the Sun HPC ClusterTools Software User's Guide and the mprun(1) manpage.

Here is a diagram that summarizes the user interaction:

 

[ D ] 

Instructions for Enabling Close Integration

To enable close integration with any resource manager, you must:

  • Make sure that both the Sun CRE environment and the RM are installed on the same cluster.
  • Set up the hpc.conf file to manage the mprun command in a batch environment
  • Set up the sunhpc.allow configuration file, if required.
  • Configure each resource manager to work with Sun CRE.

Instructions are provided in these sections:


procedure icon  How to Configure the hpc.conf File

Resource managers affect only the CREOptions section of the hpc.conf file. Here is an example of that section:

Begin CREOptions
enable_core           on
corefile_name         directory and file name of core file
syslog_facility       daemon
auth_opt              sunhpc_rhosts
default_rm            cre
allow_mprun           *
End CREOptions

Only two fields affect integration with resource managers: the default_rm field and the allow_mprun field.

1. To select a default resource manager, change the default_rm field.

By selecting a default resource manager, you can save users the trouble of entering the -x flag each time they invoke the mprun command (as described in Sun HPC ClusterTools Software User's Guide). Simply change the value the field to one of these:

default_rm lsf
default_rm pbs
default_rm sge

2. To restrict mprun's ability to run programs in the batch processing environment, add or change the allow_mprun field.

By default, the value of that field is:

allow_mprun *

To place no restrictions on mprun, either set the value to * or remove the field.

To place restrictions on mprun, follow these steps:

a. Create the sunhpc.allow file.

The sunhpc.allow file specifies the restrictions placed on mprun by each resource manager. See How to Configure the sunhpc.allow File.

b. Change the value of the allow_mprun field.

Add the names of the resource managers that do not place restrictions on mprun.


procedure icon  How to Configure the sunhpc.allow File

The sunhpc.allow file specifies the restrictions that a batch system will place on mprun's ability to call other programs while running in a batch processing environment.

By default, sunhpc.allow is ignored. It is only examined when the hpc.conf file asks for restrictions (see How to Configure the hpc.conf File).

The sunhpc.allow file is a text file with one filename pattern per line. Each filename patter specifies the directories whose programs mprun is allowed to launch. For example:

/opt/SUNWhpc/*
/opt/SUNWhpc/bin/*

The example above uses an asterisk to denote all subdirectories or files. For a list of valid pattern-matching symbols, see fnmatch(1).

Before launching a program, mprun examines the sunhpc.allow file and tries to match the program's absolute pathname to the patterns in the file. If the program's absolute pathname matches one of the patterns, the program is launched. If it does not match, the program is not launched, and an error message is displayed to the user, listing the absolute path that had no match in the file.

The sunhpc.allow file must be located on the master node and owned by root. It is read only at startup. If you add new entries to it, you must restart the master node daemons.


procedure icon  How to Configure LSF For Close Integration

The previous requirements for close integration with LSF are no longer necessary. However, to maintain compatibility with Sun HPC ClusterTools 4 software, you must modify three LSB queues so they use the sunhpc utility as the job starter instead of the pam -t utility. The queues are:

  • hpc queue
  • hpc-batch queue
  • hpc-pfs queue

Here is how their definitions in lsb.queues should look:

CODE EXAMPLE 4-1 hpc Queue
Begin Queue
    QUEUE_NAME             = hpc  # Added by SunHPC RTE pkg
    PRIORITY               = 45
    NICE                   = 0
    PREEMPTION             = PREEMPTIVE
    NEW_JOB_SCHEDULE_DELAY = 0
    INTERACTIVE            = ONLY
    JOB_STARTER            = /opt/SUNWhpc/lib/sunhpc
    DESCRIPTION            = Sun HPC ClusterTools software                              interactive queue (uses sunhpc                       as a job starter)
End Queue

CODE EXAMPLE 4-2 hpc-batch Queue
Begin Queue
    QUEUE_NAME             = hpc-batch  #Added by SunHPC RTE pkg
    PRIORITY               = 43
    NICE                   = 10 
    PREEMPTION             = PREEMPTIVE
    JOB_ACCEPT_INTERVAL    = 1
    INTERACTIVE            = NO
    JOB_STARTER            = /opt/SUNWhpc/lib/sunhpc
    DESCRIPTION            = Sun HPC ClusterTools software batch
                             queue (uses sunhpc as a job starter)
End Queue

CODE EXAMPLE 4-3 hpc-pfs Queue
Begin Queue
    QUEUE_NAME             = hpc-pfs #Added by SunHPC RTE pkg
    PRIORITY               = -10 
    USERS                  = root
    PREEMPTION             = PREEMPTIVE
    INTERACTIVE            = NO
    JOB_STARTER            = /opt/SUNWhpc/lib/sunhpc
    DESCRIPTION            = Sun HPC ClusterTools queue for 
                             pfs daemons only
End Queue


procedure icon  How to Configure PBS For Close Integration

To enable close integration with PBS, follow these procedures.

1. Verify that these directories were created (or linked to) on each node in the cluster

Sun CRE integration software assumes the existence of these directories.

You can install PBS software in these directories or create links from your installation to these directories.

/opt/OpenPBS # binaries 
/usr/spool/PBS # configuration files

2. Ensure that a nodes file exists on the master server.

The nodes file must contain a list of entries for each node and the number of processors in each node. For example,

hpc-u2-8 np=2  # first node, 2 processors
hpc-u2-9 np=2  # second node, 2 processors

Name the file /usr/spool/PBS/server_priv/nodes

To allow the clusters to be overloaded, set np to a larger value. For example:

hpc-u2-8 np=5000  
hpc-u2-9 np=5000 


procedure icon  How to Configure SGE For Close Integration

The main thing you have to do is make sure each Sun CRE node is configured as an SGE execution host. Simply define a parallel environment for each queue in the SGE cluster that wil be used as a Sun CRE node. Follow these steps.

1. Create a configuration file with the following contents.

TABLE 4-1 SGE Configuration
pe_name               cre
user_lists            NONE
xuser_lists           NONE
start_proc_args       /bin/true
stop_proc_args        /bin/true
allocation_rule       $round_robin
control_slaves        TRUE
job_is_first_task     FALSE
queue_list            <sge-queue sge-queue ...>
slots                 <sum of all queue slots>

In user_lists,the NONE value enables all users and excludes no users.

The control_slaves value must be TRUE, or SGE will prohibit parallel jobs, causing qrsh to exit with an error message that does not indicate the cause.

The job_is_first_task value must be FALSE, or mprun will count as one of the slots. If that happens, only n-1 processes will start, and the job will fail.

2. Initialize the configuration file.

Use the qconf command. For example, if you named the configuration file pe.q, enter:

% qconf -Ap pe.q

3. Execute the following command for each queue that you named in the queue_list field of the configuration file.

% qconf -mqattr qtype "BATCH INTERACTIVE PARALLEL" queue-name