C H A P T E R  3

Overview of Administration Controls

The Sun HPC cluster's default configuration supports execution of MPI applications. In other words, if you have started the Sun CRE daemons on your cluster and created a default partition, as described in Chapter 2, users can begin executing MPI jobs.You may, however, want to customize the cluster's configuration to the specific requirements of your site.

This chapter provides a brief overview of the features that control a cluster's configuration and behavior. These are:


The Sun CRE Daemons

Sun CRE comprises three master daemons and two nodal daemons:

This section present brief descriptions of the Sun CRE daemons. For complete information on the daemons, see their respective man pages.

Master Daemon tm.rdb

tm.rdb is the resource database daemon. It runs on the master node and implements the resource database used by the other parts of Sun CRE. This database represents the state of the cluster and the jobs running on it.

If you make changes to the cluster configuration, for example, if you add a node to a partition, you must restart the tm.rdb daemon to update the Sun CRE resource database to reflect the new condition.

Master Daemon tm.mpmd

tm.mpmd is the master process-management daemon. It runs on the master node and services user (client) requests made via the mprun command. It also interacts with the resource database via calls to tm.rdb and coordinates the operations of the nodal client daemons.

Master Daemon tm.watchd

tm.watchd is the cluster watcher daemon. It runs on the master node and monitors the states of cluster resources and jobs and, as necessary:

Nodal Daemon tm.omd

tm.omd is the object-monitoring daemon. It runs on all the nodes in the cluster, including the master node, and continually updates the Sun CRE resource database with dynamic information concerning the nodes, most notably their load. It also initializes the database with static information about the nodes, such as their host names and network interfaces, when Sun CRE starts up.

The environment variable SUNHPC_CONFIG_DIR specifies the directory in which the Sun CRE resource database files are to be stored. The default is /var/hpc.

Nodal Daemon tm.spmd

tm.spmd is the slave process-management daemon. It runs on all the compute nodes of the cluster and, as necessary:

Spin Daemon tm.spind

tm.spind is the spin daemon. It runs on all the compute nodes of the cluster.

This daemon enables certain processes of a given MPI job on a shared-memory system to be scheduled at approximately the same time as other related processes. This co-scheduling reduces the load on the processors, thus reducing the effect that MPI jobs have on each other.


RSM Daemon

The hpc_rsmd daemon provides access services to Remote Shared Memory (RSM) resources and manages RSM communications paths on behalf of MPI jobs. If your cluster is not RSM-enabled, you can ignore this section.

An instance of hpc_rsmd runs on each RSM-enabled node of a cluster and:

hpc_rsmd is started when the cluster is booted. After boot, the daemon can be stopped and restarted manually by means of the following script (executed by superuser). This script can also be used to clean up files and System V shared memory segments left behind when an hpc_rsmd instance exits abnormally.

# /etc/init.d/sunhpc.hpc_rsmd  [ start | stop | clean ]


mpadmin: Administration Interface

Sun CRE provides an interactive command interface, mpadmin, which you can use to administer your Sun HPC cluster. It can only be invoked by the superuser.

This section introduces mpadmin and shows how to use it to perform several administrative tasks:

mpadmin offers many more capabilities than are described in this section. See Chapter 6 for a more comprehensive description of mpadmin.

Introduction to mpadmin

The mpadmin command has the following syntax:

# mpadmin [-c command] [-f filename] [-h] [-q] [-s cluster_name] [-V]

When you invoke mpadmin with no options, it goes into interactive mode, displaying an mpadmin prompt. It also goes into interactive mode when invoked with the options -f, -q, or -s. In this mode, you can execute any number of mpadmin subcommands to perform operations on the cluster or on nodes or partitions.

When you invoke mpadmin with the -c, -h, or -V options, it performs the requested operation and returns to the shell level.

The mpadmin command-line options are summarized in TABLE 3-1.

TABLE 3-1 mpadmin Options

Option

Description

-c command

Execute single specified command.

-f file-name

Take input from specified file.

-h

Display help/usage text.

-q

Suppress the display of a warning message when a non-root user attempts to use restricted command mode.

-s cluster-name

Connect to the specified Sun HPC cluster.

-V

Display mpadmin version information.


Commonly Used mpadmin Options

This section describes the mpadmin options -c, -f, and -s.

-c command - Execute a Single Command

Use the -c option when you want to execute a single mpadmin command and return upon completion to the shell prompt. For example, the following use of mpadmin -c changes the location of the Sun CRE log file to /home/wmitty/cre_messages:

# mpadmin -c set logfile="/home/wmitty/cre_messages"

Most commands that are available via the interactive interface can be invoked via the -c option. See Chapter 6 for a description of the mpadmin command set and a list of which commands can be used as arguments to the -c option.

-f file-name - Take Input From a File

Use the -f option to supply input to mpadmin from the file specified by the file-name argument. The source file is expected to consist of one or more mpadmin commands, one command per line.

This option can be particularly useful in the following ways:

-s cluster-name - Connect to Specified Cluster

Use the -s option to connect to the cluster specified by the cluster-name argument. A cluster's name is the host name of the cluster's master node.

The mpadmin commands apply to a certain cluster, determined as follows:

Understanding Objects, Attributes, and Contexts

To use mpadmin, you need to understand the concepts of object, attribute, and context as they apply to mpadmin.

Objects and Attributes

From the perspective of mpadmin, a Sun HPC cluster consists of a system of objects, which include

Each type of object has a set of attributes, which control various aspects of their respective objects. For example, a node's enabled attribute can be

Some attribute values can be operated on via mpadmin commands.



Note - Sun CRE sets many attributes in a cluster to default values each time it starts up. You should not change attribute values, except for the attribute modifications described here and in Chapter 6.



Contexts

mpadmin commands are organized into three contexts, which correspond to the three types of mpadmin objects. These contexts are summarized below and illustrated in FIGURE 3-1.

 FIGURE 3-1 mpadmin Contexts

Graphic image illustrating mpadmin contexts

mpadmin Prompts

In interactive mode, the mpadmin prompt contains one or more fields that indicate the current context. TABLE 3-2 shows the prompt format for each of the possible mpadmin contexts.

TABLE 3-2 mpadmin Prompt Formats

Prompt Formats

Context

[cluster-name]::

Current context = Cluster

[cluster-name]Node::

Current context = Node, but not a specific node

[cluster-name]N(node-name)::

Current context = a specific node

[cluster-name]Partition::

Current context = Partition, but not a specific partition

[cluster-name]P(partition-name)::

Current context = a specific partition


Performing Sample mpadmin Tasks

To introduce the use of mpadmin, this section steps through some common tasks the administrator may want to perform. These tasks are:

List Names of Nodes

mpadmin provides various ways to display information about the cluster and many kinds of information that can be displayed. However, the first information you are likely to need is a list of the nodes in your cluster.

Use the list command in the Node context to display this list. In the following example, list is executed on node1 in a four-node cluster.

node1# mpadmin[node0]:: node[node0] Node:: list    node0    node1    node2    node3[node0] Node::

The mpadmin command starts up an mpadmin interactive session in the Cluster context. This is indicated by the [node0]:: prompt, which contains the cluster name, node0, and no other context information.



Note - A cluster's name is assigned by Sun CRE and is always the name of the cluster's master node.



The node command on the example's second line makes Node the current context. The list command displays a list of all the nodes in the cluster.

Once you have this list of nodes, you have the information you need to enable the nodes and to create a partition. However, before moving on to those steps, you might want to try listing information from within the cluster context or the partition context. In either case, you would follow the same general procedure as for listing nodes.

If this is a newly installed cluster and you have not already run the part_initialize script (as described in "Create a Default Partition" on page 8), the cluster contains no partitions. If, however, you did run part_initialize and have thereby created the partition all, you might want to perform the following test.

node1# mpadmin[node0]:: partition[node0] Partition:: list    all[node0] Partition::

To see what nodes are in partition all, make all the current context and execute the list command. The following example illustrates this; it begins in the Partition context (where the previous example ended).

[node0] Partition:: all[node0] P[all]:: list    node0    node1    node2    node3[node0] P[all]::

Enabling Nodes

A node must be in the enabled state before MPI jobs can run on it.

Note that enabling a partition automatically enables all its member nodes, as described in the next section.

To enable a node manually, make that node the current context and set its enabled attribute. Repeat for each node that you want to be available for running MPI jobs.

The following example illustrates this, using the same four-node cluster used in the previous examples.

node1# mpadmin[node0]:: node0[node0] N[node0]:: set enabled[node0] N[node0]:: node1[node0] N[node1]:: set enabled[node0] N[node1]:: node2[node0] N[node2]:: set enabled[node0] N[node2]:: node3[node0] N[node3]:: set enabled[node0] N[node3]::

Note the use of a shortcut to move directly from the Cluster context to the node0 context without first going to the general Node context. You can explicitly name a particular object as the target context in this way so long as the name of the object is unambiguous--that is, it is not the same as an mpadmin command.

mpadmin accepts multiple commands on the same line. The previous example could be expressed more succinctly as:

node1# mpadmin[node0]:: node0 set enabled node1 set enabled node2 set enabled node3 set enabled[node0] N[node3]::

To disable a node, use the unset command in place of the set command.

Creating and Enabling Partitions

You must create at least one partition and enable it before you can run MPI programs on your Sun HPC cluster. Even if your cluster already has the default partition all in its database, you will probably want to create other partitions with different node configurations to handle particular job requirements.

There are three essential steps involved in creating and enabling a partition:

Once a partition is created and enabled, you can run serial or parallel jobs on it. A serial program runs on a single node of the partition. Parallel programs are distributed to as many nodes as Sun CRE determines appropriate for the job.



Note - There are no restrictions on the number or size of partitions, so long as no node is a member of more than one enabled partition.



Example: Creating a Two-Node Partition

The following example creates and enables a two-node partition named part0. It then lists the member nodes to verify the success of the creation.

node1# mpadmin[node0]:: partition[node0] Partition:: create part0[node0] P[part0]:: set nodes=node0 node1[node0] P[part0]:: set enabled[node0] P[part0]:: list    node0    node1[node0] P[part0]::
Example: Two Partitions Sharing a Node

The next example shows a second partition, part1, being created. One of its nodes, node1, is also a member of part1.

[node0] P[part0]:: up[node0] Partition:: create part1[node0] P[part1]:: set nodes=node1 node2 node3[node0] P[part1]:: list    node1    node2    node3[node0] P[part1]::

Because node1 is shared with part0, which is already enabled, part1 is not being enabled at this time. This illustrates the rule that a node can be a member of more than one partition, but only one of those partitions can be enabled at a time.

Note the use of the up command. The up command moves the context up one level, in this case, from the context of a particular partition (that is, from part0) to the general Partition context.

Example: Shared versus Dedicated Partitions

Sun CRE can configure a partition to allow multiple MPI jobs to be running on it concurrently. Such partitions are referred to as shared partitions. Sun CRE can also configure a partition to permit only one MPI job to run at a time. These are called dedicated partitions.

In the following example, the partition part0 is configured to be a dedicated partition and part1 is configured to allow shared use by up to four processes.

node1# mpadmin[node0]:: part0[node0] P[part0]:: set max_total_procs=1[node0] P[part0]:: part1[node0] P[part1]:: set max_total_procs=4[node0] P[part1]::

The max_total_procs attribute defines how many processes can be active on each node in the partition for which it is being set. In this example, it is set to 1 on part0, which means only one process can be running at a time. It is set to 4 on part1 to allow up to four processes to run on that partition.

Note again that the context-changing shortcut (introduced in "Enabling Nodes" on page 20) is used in the second and fourth lines of this example.

Customizing Cluster Attributes

Two cluster attributes that you may wish to modify are logfile and administrator.

Changing the logfile Attribute

The logfile attribute allows you to log Sun CRE messages in a separate file from all other system messages. For example, if you enter

[node0]:: set logfile=/home/wmitty/cre-messages

Sun CRE will output its messages to the file /home/wmitty/cre-messages. If logfile is not set, Sun CRE messages will be passed to syslog, which will store them with other system messages in /var/adm/messages.



Note - A full path name must be specified when setting the logfile attribute.



Changing the administrator Attribute

You can set the administrator attribute to specify, say, the email address of the system administrator. To do this:

[node0]:: set administrator="root@example.com"

Note the use of double quotes.

Quitting mpadmin

Use either the quit or exit command to quit an mpadmin interactive session. Either causes mpadmin to terminate and return control to the shell level.

For example:

[node0]:: quitnode1#


Cluster Configuration File hpc.conf

When Sun CRE starts up, it updates portions of the resource database according to the contents of a configuration file named hpc.conf. This file is organized into functional sections, which are summarized here and illustrated in TABLE 3-3.

You can change any of these aspects of your cluster's configuration by editing the corresponding parts of the hpc.conf file. Default settings are in effect if you make no changes to the hpc.conf file as provided.

To illustrate the process of customizing the hpc.conf file, this section explains how to:



Note - The hpc.conf file is provided with the Sun HPC ClusterTools software.





Note - You may never need to make any changes to hpc.conf. If you do wish to make changes beyond those described in this section, see Chapter 6 for a fuller description of this file.



TABLE 3-3 General Organization of the hpc.conf File
# Begin ShmemResource
# ...
# End ShmemResource
 
# Begin MPIOptions Queue=
# ...
# End MPIOptions
 
# Begin CREOptions Server=
# ...
# End CREOptions
 
# Begin HPCNodes
# ...
# End HPCNodes
 
Begin PMODULES
...
End PMODULES
 
Begin PM=shm
...
End PM
 
Begin PM=rsm
...
End PM
 
Begin PM=tcp
... 
End PM

Preparing to Edit hpc.conf

Perform the steps described below to stop the Sun CRE daemons and copy the hpc.conf template.

Stop the Sun CRE Daemons

Stop the Sun CRE daemons on all cluster nodes. For example, to stop the Sun CRE daemons on the cluster nodes, node1 and node2 from a central host, enter

# ./ctstopd -n node1,node2 -r connection_method

where connection_method is rsh, ssh, or telnet. Or, you can specify a nodelist file instead of listing the nodes on the command line.

# ./ctstopd -N /tmp/nodelist -r connection_method

where /tmp/nodelist is absolute path to a file containing the names of the cluster nodes, with each name on a separate line. Comments and empty lines are allowed. For example, if the cluster contains the nodes node1 and node2, a nodelist file for the cluster could look like the following:

CODE EXAMPLE 3-1
# Sample Node List
 
node1
node2

Copy the hpc.conf Template

The Sun HPC ClusterTools software distribution includes an hpc.conf template, which is stored, by default, in /opt/SUNWhpc/examples/rte/
hpc.conf.template
.

Copy the template from its installed location to /opt/SUNWhpc/etc/hpc.conf and edit it as described below in this section.

When you have finished editing hpc.conf, you need to update the Sun CRE database with the new configuration information. This step is described in Updating the Sun CRE Database.

Specifying MPI Options

The MPIOptions section provides a set of options that control MPI communication behavior in ways that are likely to affect message-passing performance. It contains a template showing some general-purpose option settings, plus an example of alternative settings for maximizing performance. These examples are shown in TABLE 3-4.

  • General-purpose, multiuser settings - The template in the MPIOptions section is designed for general-purpose use at times when multiple message-passing jobs will be running concurrently.
  • Performance settings - The second example is designed to maximize the performance of message-passing jobs when only one job is allowed to run at a time.


Note - The first line of the template contains the phrase "Queue=hpc." This is because the queue-based LSF workload management run-time environment uses the same hpc.conf file as does Sun CRE. For LSF, the settings apply only to the specified queue. For Sun CRE, the settings apply across the cluster.



The options in the general-purpose template are the same as the default settings for the Sun MPI library. In other words, you do not have to uncomment the general-purpose template to have its option values be in effect. This template is provided in the MPIOptions section so that you can see what options are most beneficial when operating in a multiuser mode.

TABLE 3-4 MPIOptions Section Example
# The following is an example of the options that affect the run# time environment of the MPI library.  The listings below are# identical to the default settings of the library. The "Queue=hpc"# phrase makes this an LSF-specific entry, and only for the Queue# named hpc. These options are a good choice for a multiuser queue.# To be recognized by CRE, the "Queue=hpc" needs to be removed.
#
# Begin MPIOptions Queue=hpc
# coscheduling  avail
# pbind	        avail
# spindtimeout   1000
# progressadjust   on
# spin		  off
#
# shm_numpostbox           16
# shm_shortmsgsize        256
# rsm_maxsegsize      1048576
# rsm_numpostbox           15
# rsm_shortmsgsize        401
# rsm_maxstripe	            2
# rsm_links	          wrsm0,1
# maxprocs_limit   2147483647
# maxprocs_default       4096
#
# End MPIOptions
 
# The listing below is a good choice when trying to get maximum
# performance out of MPI jobs that are running in a Queue that
# allows only one job to run at a time.
#
# Begin MPIOptions Queue=performance
# coscheduling             off
# spin                      on
# End MPIOptions

If you want to use the performance settings, do the following:

  • Delete the comment character (#) from the beginning of each line of the performance example, including the Begin MPIOptions and End MPIOptions lines.
  • On Sun CRE-based clusters, delete the "Queue=performance" phrase from the Begin MPIOptions line.

The resulting template should appear as follows:

Begin MPIOptionscoscheduling			offspin			onEnd MPIOptions

The significance of these options is discussed in Chapter 6.

Updating the Sun CRE Database

When you have finished editing hpc.conf, update the Sun CRE database with the new information. For example, to start the daemons on cluster nodes node1 and node2 from a central host, enter

# ./ctstartd -n node1,node2 -r connection_method

where connection_method is rsh, ssh, or telnet. Or, you can specify a nodelist file instead of listing the nodes on a command line.

# ./ctstopd -N /tmp/nodelist -r connection_method

where /tmp/nodelist is absolute path to a file containing the names of the cluster nodes, with each name on a separate line.


Authentication and Security

Sun CRE provides basic security by means of a cluster password, which is stored in a key file on each node.

In addition, you can set up further methods of guarding the cluster against access by unauthorized users or programs. Sun CRE supports UNIX system authentication (via rhosts), as well as two third-party mechanisms for authentication: Data Encryption Standard (DES) and Kerberos Version 5.

Setting the Sun CRE Cluster Password

Sun CRE uses a root-read-only key file to control access to the cluster. The key file must exist on every node of the cluster, and the contents of all the key files must be identical. In addition, the key file must be placed on any node outside the cluster that might access the cluster (that is, on any node that may execute the command mprun -c cluster_name).

The key resides in /etc/hpc_key.cluster_name on each node.

The installation procedure creates a default key file on each node of the cluster. A strongly recommended step in the post-installation procedure is to customize the key immediately after installing the Sun HPC ClusterTools software. The key should be 10-20 alphanumeric characters.

The administrator can change the key at any time. As superuser, run the set_key script on each node in the cluster and on any nodes outside the cluster that may access the cluster:

# /etc/opt/SUNWhpc/HPC5.0/etc/set_key

It is preferable to stop the Sun CRE daemons before changing the key, since a current MPI job might fail.

To guarantee that the cluster key is set identically on every node, you should use the Cluster Console Manager tools (described in Appendix A) to update all the key files at once.



Note - The cluster password security feature exists in addition to the "current" authentication method, as specified in the hpc.conf file and described below.



Establishing the Current Authentication Method

Authentication is established in the configuration file hpc.conf in the section CREOptions.

TABLE 3-5 CREOptions Section Example
Begin CREOptions  
 ...	
auth_opt				    sunhpc_rhosts
End CREOptions

The value of the auth_opt option is one of:

  • sunhpc_rhosts - Contains list of trusted hosts, the default
  • rhosts - Standard UNIX system authentication
  • des - DES-based authentication
  • krb5 - Kerberos 5 authentication

To change authentication methods, stop all Sun CRE daemons, edit the hpc.conf file, and then restart the Sun CRE daemons. See Preparing to Edit hpc.conf.



Note - Since authentication methods limit the time that can elapse between the initiation of a remote procedure call (RPC) and the system's response, administrators should ensure that the nodes of the Sun HPC cluster and the machines from which users submit jobs are closely synchronized. For example, you can synchronize the machines by setting all system clocks to the same time using the Solaris date command.



Setting Up the Default Authentication

When authentication option rhosts is in use, any Sun CRE operation (such as mpadmin or mprun) attempted by the superuser will be allowed only if three items:

  • The requesting host
  • The master Sun CRE host
  • The hosts on which any mprun operation will execute

appear in one of the following files:

  • The /etc/sunhpc_rhosts file, if it has been installed (the default)
  • The default .rhosts file (if the sunhpc_rhosts file is not created at installation, or if it has been deleted)

The sunhpc_rhosts file's contents are visible only to the superuser.

To allow superuser access from hosts outside the cluster, the node name must be added to the /etc/sunhpc_rhosts file.

If the /etc/sunhpc_rhosts file is not used (or has been removed), the .rhosts file on each node must be updated to include the name of every node in the cluster. Using .rhosts assumes trusted hosts. For information on trusted hosts, see the man page for hosts.equiv.

Setting Up DES Authentication

In order to use DES authentication with Sun CRE, host keys must exist for each host in the cluster and /etc/.rootkey must exist for each node of the cluster. User keys must exist on all hosts that will be used to communicate with the cluster using Sun CRE commands, as well as on each node of the cluster (including the master), for each user who is to access the cluster. Inconsistent key distribution will prevent correct operation.

To set up DES authentication, you must ensure that all hosts in the system, and all users, have entries in both the publickey and netname databases. Furthermore, the entries in /etc/nsswitch.conf for both publickey and netid databases must point to the correct place. For further information, see the Solaris man pages for publickey(4), nsswitch.conf(4), and netid(4).

After all new keys are in place, you need to restart the DES keyserver keyserv. You must also establish /etc/.rootkey on each node, as described in the man page keylogin(1).

When the DES setup is complete, restart the Sun CRE daemons (see Stopping and Restarting Sun CRE).

It is recommended that you use one of the Cluster Console Manager tools (cconsole, ctelnet, or crlogin) to issue identical commands to all the nodes at the same time. For information about the Cluster Console Manager, see Appendix A.



Note - While DES authentication is in use, users must issue the keylogin command before issuing any commands beginning with mp, such as mprun or mpps.



Setting Up Kerberos Authentication

To set up Kerberos 5 authentication, the administrator registers a host principal (host) and a Sun CRE (sunhpc) principal with an instance for each node that is to be used as a Sun CRE client. In addition, each host must have host and principal entries in the appropriate keytabs.

For example: consider a system consisting of three nodes (node0, node1, and node2), in Kerberos realm example.com. Nodes node0 and node1 will be used as Sun CRE servers and all three nodes will be used as Sun CRE clients. The database should include the following principals as well as principals for any end-users of Sun CRE services, created using the addprinc command in kadmin:

sunhpc/node0@example.com
sunhpc/node1@example.com
sunhpc/node2@example.com
host/node0@example.com
host/node1@example.com
host/node2@example.com

The sunhpc and host principals should have entries in the default keytab (created using the ktadd command in kadmin).

Any user who wishes to use Sun CRE to execute programs must first obtain a ticket granting ticket via kinit.

For further information on Kerberos version 5, see the Kerberos documentation.