A P P E N D I X  A

Troubleshooting

This appendix describes some common problem situations, resulting error messages, and suggestions for fixing the problems. Sun MPI error reporting, including I/O, follows the MPI-2 Standard. By default, errors are reported in the form of standard error classes. These classes and their meanings are listed in TABLE A-1 (for non-I/O MPI) and TABLE A-2 (for MPI I/O), and are also available on the MPI man page.

Three predefined error handlers are available in Sun MPI:


MPI Messages

You can make changes to and get information about the error handler using any of the following routines:

Messages resulting from an MPI program fall into two categories:

Error Messages

Sun MPI error messages use a standard format:

[x y z] Error in function_name: errclass_string:intern(a):description:unixerrstring

where

The process communication identifier is present in every error message.

Warning Messages

Sun MPI warning messages also use a standard format:

[x y z] Warning message

where message is a description of the error.

Standard Error Classes

Listed below are the error return classes you may encounter in your MPI programs. Error values may also be found in mpi.h (for C), mpif.h (for Fortran), and mpi++.h (for C++).

TABLE A-1 Sun MPI Standard Error Classes

Error Code

Value

Meaning

MPI_SUCCESS

0

Successful return code.

MPI_ERR_BUFFER

1

Invalid buffer pointer.

MPI_ERR_COUNT

2

Invalid count argument.

MPI_ERR_TYPE

3

Invalid datatype argument.

MPI_ERR_TAG

4

Invalid tag argument.

MPI_ERR_COMM

5

Invalid communicator.

MPI_ERR_RANK

6

Invalid rank.

MPI_ERR_ROOT

7

Invalid root.

MPI_ERR_GROUP

8

Null group passed to function.

MPI_ERR_OP

9

Invalid operation.

MPI_ERR_TOPOLOGY

10

Invalid topology.

MPI_ERR_DIMS

11

Illegal dimension argument.

MPI_ERR_ARG

12

Invalid argument.

MPI_ERR_UNKNOWN

13

Unknown error.

MPI_ERR_TRUNCATE

14

Message truncated on receive.

MPI_ERR_OTHER

15

Other error; use Error_string.

MPI_ERR_INTERN

16

Internal error code.

MPI_ERR_IN_STATUS

17

Look in status for error value.

MPI_ERR_PENDING

18

Pending request.

MPI_ERR_REQUEST

19

Illegal MPI_Request() handle.

MPI_ERR_KEYVAL

36

Illegal key value.

MPI_ERR_INFO

37

Invalid info object.

MPI_ERR_INFO_KEY

38

Illegal info key.

MPI_ERR_INFO_NOKEY

39

No such key.

MPI_ERR_INFO_VALUE

40

Illegal info value.

MPI_ERR_TIMEDOUT

41

Timed out.

MPI_ERR_RESOURCES

42

Out of resources.

MPI_ERR_TRANSPORT

43

Transport layer error.

MPI_ERR_HANDSHAKE

44

Error accepting/connecting.

MPI_ERR_SPAWN

45

Error spawning.

MPI_ERR_WIN

46

Invalid window.

MPI_ERR_BASE

47

Invalid base.

MPI_ERR_SIZE

48

Invalid size.

MPI_ERR_DISP

49

Invalid displacement.

MPI_ERR_LOCKTYPE

50

Invalid locktype.

MPI_ERR_ASSERT

51

Invalid assert.

MPI_ERR_RMA_CONFLICT

52

Conflicting accesses to window.

MPI_ERR_RMA_SYNC

53

Erroneous RMA synchronization.

MPI_ERR_NO_MEM

54

Memory exhauste.

MPI_ERR_LASTCODE

55

Last error code.


MPI I/O message are listed separately, in TABLE A-2.


MPI I/O Error Handling

Sun MPI I/O error reporting follows the MPI-2 Standard. By default, errors are reported in the form of standard error codes (found in /opt/SUNWhpc/include/mpi.h). Error classes and their meanings are listed in TABLE A-2. They can also be found in mpif.h (for Fortran) and mpi++.h (for C++).

You can change the default error handler by specifying MPI_FILE_NULL as the file handle with the routine MPI_File_set_errhandler(), even if no file is currently open. Or, you can use the same routine to change a specific file's error handler.

TABLE A-2 Sun MPI I/O Error Classes

Error Class

Value

Meaning

MPI_ERR_FILE

20

Bad file handle.

MPI_ERR_NOT_SAME

21

Collective argument not identical on all processes.

MPI_ERR_AMODE

22

Unsupported amode passed to open.

MPI_ERR_UNSUPPORTED_DATAREP

23

Unsupported datarep passed to MPI_File_set_view().

MPI_ERR_UNSUPPORTED_OPERATION

24

Unsupported operation, such as seeking on a file that supports only sequential access.

MPI_ERR_NO_SUCH_FILE

25

File (or directory) does not exist.

MPI_ERR_FILE_EXISTS

26

File exists.

MPI_ERR_BAD_FILE

27

Invalid file name (for example, path name too long).

MPI_ERR_ACCESS

28

Permission denied.

MPI_ERR_NO_SPACE

29

Not enough space.

MPI_ERR_QUOTA

30

Quota exceeded.

MPI_ERR_READ_ONLY

31

Read-only file system.

MPI_ERR_FILE_IN_USE

32

File operation could not be completed, as the file is currently open by some process.

MPI_ERR_DUP_DATAREP

33

Conversion functions could not be registered because a data representation identifier that was already defined was passed to MPI_REGISTER_DATAREP.

MPI_ERR_CONVERSION

34

An error occurred in a user-supplied data-conversion function.

MPI_ERR_IO

35

I/O error.

MPI_ERR_INFO

37

Invalid info object.

MPI_ERR_INFO_KEY

38

Illegal info key.

MPI_ERR_INFO_NOKEY

39

No such key.

MPI_ERR_INFO_VALUE

40

Illegal info value.

MPI_ERR_LASTCODE

55

Last error code.



Exceeding the File Descriptor Limit

If your application attempts to open a file descriptor when the maximum limit of open file descriptors has been reached, the job will fail and display the following message:

Too many open file descriptors

Should this occur, increase the value of the file descriptor hard limit before starting your job again.

If you are logged in to a C shell as superuser, you can determine the current hard limit value via the limit function, as follows:

# limit -h descriptors

If you are logged in to a Bourne shell as superuser, use the ulimit function.

# ulimit -Hn

Each function returns the file descriptor hard limit that was in effect. Once you know what the previous hard limit was, you can estimate what the new hard limit value should be and set it accordingly.

From a C shell, use the limit command to set the new value in the .login file.

# limit -h descriptors limit

From a Bourne shell, use the ulimit command to set the new value in the .profile file.

# ulimit -Hn limit

In each case, limit is the value of the new hard limit.

Alternatively, you can determine whether the file descriptor hard limit is anything other than the default by looking in the /etc/system file to see whether the rlim_fd_max parameter has been set to a nondefault value. If not, the file descriptor hard limit will be 1024. To change the hard limit in a 64-bit Solaris 8 environment, simply add the following line to the /etc/system file:

set rlim_fd_max=limit

Again, limit is the value of the new file descriptor hard limit.

You can also increase the file descriptor hard limit in a Solaris 8 32-bit environment. However, this approach is not recommended. See Maximum Number of File Descriptors for information about defining the C pre-processor symbol FD_SETSIZE should you choose to make such a change.


Exceeding the TCP Port Limit

If you are running a large (highly parallel), communication-intensive MPI job on a Sun HPC cluster that includes both of the following conditions,

  • TCP/IP as the only interconnect medium
  • A node that has more than 32 CPUs

the number of TCP ports may be too limited. If the MPI job attempts to access a TCP port when no more are available, the job will fail and print the following message:

low level communications error: Cannot assign requested address

Most likely, this occurs only when the job is running on the configuration described above and one of the following conditions exists:

  • MPI_FULLCONNINIT is set.
  • MPI_Alltoall is used.
  • The application includes its own all-to-all code.

Other activity on the cluster, such as file I/O or other MPI jobs, will increase the chance of this occurring.

You can avoid exceeding the TCP port limit by taking one or more of the following steps:

  • Configure the node with more than 32 nodes into two or more domains. From the TCP perspective, each domain will be seen as a separate node with its own supply of TCP ports.
  • Reconfigure the cluster to exclude the node with more than 32 CPUs.
  • Avoid running multiple MPI jobs or other tasks that would compete for available TCP ports.
  • If two large MPI jobs must run on the same cluster, wait a few minutes between the jobs to give the OS time to reclaim the ports created for the previous job.
  • If the application does not include any all-to-all operations, use the default lazy connections mode instead of MPI_FULLCONNINIT.
  • If the application contains any all-to-all operations, either MPI_Alltoall or custom code, use a non-TCP network technology.