This appendix describes some common problem situations, resulting error messages, and suggestions for fixing the problems. Sun MPI error reporting, including I/O, follows the MPI-2 Standard. By default, errors are reported in the form of standard error classes. These classes and their meanings are listed in TABLE A-1 (for non-I/O MPI) and TABLE A-2 (for MPI I/O), and are also available on the MPI man page.
Three predefined error handlers are available in Sun MPI:
- MPI_ERRORS_RETURN - The default, returns an error code if an error occurs.
- MPI_ERRORS_ARE_FATAL - I/O errors are fatal, and no error code is returned.
- MPI_THROW_EXCEPTION - A special error handler to be used only with C++.
MPI Messages
You can make changes to and get information about the error handler using any of the following routines:
- MPI_Comm_create_errhandler
- MPI_Comm_get_errhandler
- MPI_Comm_set_errhandler
Messages resulting from an MPI program fall into two categories:
- Error messages - Error messages stem from within MPI. Usually an error message explains why your program cannot complete, and the program aborts.
- Warning messages - Warnings stem from the environment in which you are running your MPI program and are usually sent by MPI_Init(). They are not associated with an aborted program, that is, programs continue to run despite warning messages.
Error Messages
Sun MPI error messages use a standard format:
[x y z] Error in function_name: errclass_string:intern(a):description:unixerrstring
where
- [x y z] is the process communication identifier, and:
- x is the job ID (or jid).
- y is the name of the communicator if a name exists; otherwise it is the address of the opaque object.
- z is the rank of the process.
The process communication identifier is present in every error message.
- function_name is the name of the associated MPI function. It is present in every error message.
- errclass_string is the string associated with the MPI error class. It is present in every error message.
- intern is an internal function. It is optional.
- a is a system call, if one is the cause of the error. It is optional.
- description is a description of the error. It is optional.
- unixerrstring is the UNIX error string that describes system call a. It is optional.
Warning Messages
Sun MPI warning messages also use a standard format:
[x y z] Warning message
where message is a description of the error.
Standard Error Classes
Listed below are the error return classes you may encounter in your MPI programs. Error values may also be found in mpi.h (for C), mpif.h (for Fortran), and mpi++.h (for C++).
TABLE A-1 Sun MPI Standard Error Classes
Error Code
|
Value
|
Meaning
|
MPI_SUCCESS
|
0
|
Successful return code.
|
MPI_ERR_BUFFER
|
1
|
Invalid buffer pointer.
|
MPI_ERR_COUNT
|
2
|
Invalid count argument.
|
MPI_ERR_TYPE
|
3
|
Invalid datatype argument.
|
MPI_ERR_TAG
|
4
|
Invalid tag argument.
|
MPI_ERR_COMM
|
5
|
Invalid communicator.
|
MPI_ERR_RANK
|
6
|
Invalid rank.
|
MPI_ERR_ROOT
|
7
|
Invalid root.
|
MPI_ERR_GROUP
|
8
|
Null group passed to function.
|
MPI_ERR_OP
|
9
|
Invalid operation.
|
MPI_ERR_TOPOLOGY
|
10
|
Invalid topology.
|
MPI_ERR_DIMS
|
11
|
Illegal dimension argument.
|
MPI_ERR_ARG
|
12
|
Invalid argument.
|
MPI_ERR_UNKNOWN
|
13
|
Unknown error.
|
MPI_ERR_TRUNCATE
|
14
|
Message truncated on receive.
|
MPI_ERR_OTHER
|
15
|
Other error; use Error_string.
|
MPI_ERR_INTERN
|
16
|
Internal error code.
|
MPI_ERR_IN_STATUS
|
17
|
Look in status for error value.
|
MPI_ERR_PENDING
|
18
|
Pending request.
|
MPI_ERR_REQUEST
|
19
|
Illegal MPI_Request() handle.
|
MPI_ERR_KEYVAL
|
36
|
Illegal key value.
|
MPI_ERR_INFO
|
37
|
Invalid info object.
|
MPI_ERR_INFO_KEY
|
38
|
Illegal info key.
|
MPI_ERR_INFO_NOKEY
|
39
|
No such key.
|
MPI_ERR_INFO_VALUE
|
40
|
Illegal info value.
|
MPI_ERR_TIMEDOUT
|
41
|
Timed out.
|
MPI_ERR_RESOURCES
|
42
|
Out of resources.
|
MPI_ERR_TRANSPORT
|
43
|
Transport layer error.
|
MPI_ERR_HANDSHAKE
|
44
|
Error accepting/connecting.
|
MPI_ERR_SPAWN
|
45
|
Error spawning.
|
MPI_ERR_WIN
|
46
|
Invalid window.
|
MPI_ERR_BASE
|
47
|
Invalid base.
|
MPI_ERR_SIZE
|
48
|
Invalid size.
|
MPI_ERR_DISP
|
49
|
Invalid displacement.
|
MPI_ERR_LOCKTYPE
|
50
|
Invalid locktype.
|
MPI_ERR_ASSERT
|
51
|
Invalid assert.
|
MPI_ERR_RMA_CONFLICT
|
52
|
Conflicting accesses to window.
|
MPI_ERR_RMA_SYNC
|
53
|
Erroneous RMA synchronization.
|
MPI_ERR_NO_MEM
|
54
|
Memory exhauste.
|
MPI_ERR_LASTCODE
|
55
|
Last error code.
|
MPI I/O message are listed separately, in TABLE A-2.
MPI I/O Error Handling
Sun MPI I/O error reporting follows the MPI-2 Standard. By default, errors are reported in the form of standard error codes (found in /opt/SUNWhpc/include/mpi.h). Error classes and their meanings are listed in TABLE A-2. They can also be found in mpif.h (for Fortran) and mpi++.h (for C++).
You can change the default error handler by specifying MPI_FILE_NULL as the file handle with the routine MPI_File_set_errhandler(), even if no file is currently open. Or, you can use the same routine to change a specific file's error handler.
TABLE A-2 Sun MPI I/O Error Classes
Error Class
|
Value
|
Meaning
|
MPI_ERR_FILE
|
20
|
Bad file handle.
|
MPI_ERR_NOT_SAME
|
21
|
Collective argument not identical on all processes.
|
MPI_ERR_AMODE
|
22
|
Unsupported amode passed to open.
|
MPI_ERR_UNSUPPORTED_DATAREP
|
23
|
Unsupported datarep passed to MPI_File_set_view().
|
MPI_ERR_UNSUPPORTED_OPERATION
|
24
|
Unsupported operation, such as seeking on a file that supports only sequential access.
|
MPI_ERR_NO_SUCH_FILE
|
25
|
File (or directory) does not exist.
|
MPI_ERR_FILE_EXISTS
|
26
|
File exists.
|
MPI_ERR_BAD_FILE
|
27
|
Invalid file name (for example, path name too long).
|
MPI_ERR_ACCESS
|
28
|
Permission denied.
|
MPI_ERR_NO_SPACE
|
29
|
Not enough space.
|
MPI_ERR_QUOTA
|
30
|
Quota exceeded.
|
MPI_ERR_READ_ONLY
|
31
|
Read-only file system.
|
MPI_ERR_FILE_IN_USE
|
32
|
File operation could not be completed, as the file is currently open by some process.
|
MPI_ERR_DUP_DATAREP
|
33
|
Conversion functions could not be registered because a data representation identifier that was already defined was passed to MPI_REGISTER_DATAREP.
|
MPI_ERR_CONVERSION
|
34
|
An error occurred in a user-supplied data-conversion function.
|
MPI_ERR_IO
|
35
|
I/O error.
|
MPI_ERR_INFO
|
37
|
Invalid info object.
|
MPI_ERR_INFO_KEY
|
38
|
Illegal info key.
|
MPI_ERR_INFO_NOKEY
|
39
|
No such key.
|
MPI_ERR_INFO_VALUE
|
40
|
Illegal info value.
|
MPI_ERR_LASTCODE
|
55
|
Last error code.
|
Exceeding the File Descriptor Limit
If your application attempts to open a file descriptor when the maximum limit of open file descriptors has been reached, the job will fail and display the following message:
Too many open file descriptors
|
Should this occur, increase the value of the file descriptor hard limit before starting your job again.
If you are logged in to a C shell as superuser, you can determine the current hard limit value via the limit function, as follows:
If you are logged in to a Bourne shell as superuser, use the ulimit function.
Each function returns the file descriptor hard limit that was in effect. Once you know what the previous hard limit was, you can estimate what the new hard limit value should be and set it accordingly.
From a C shell, use the limit command to set the new value in the .login file.
# limit -h descriptors limit
|
From a Bourne shell, use the ulimit command to set the new value in the .profile file.
In each case, limit is the value of the new hard limit.
Alternatively, you can determine whether the file descriptor hard limit is anything other than the default by looking in the /etc/system file to see whether the rlim_fd_max parameter has been set to a nondefault value. If not, the file descriptor hard limit will be 1024. To change the hard limit in a 64-bit Solaris 8 environment, simply add the following line to the /etc/system file:
Again, limit is the value of the new file descriptor hard limit.
You can also increase the file descriptor hard limit in a Solaris 8 32-bit environment. However, this approach is not recommended. See Maximum Number of File Descriptors for information about defining the C pre-processor symbol FD_SETSIZE should you choose to make such a change.
Exceeding the TCP Port Limit
If you are running a large (highly parallel), communication-intensive MPI job on a Sun HPC cluster that includes both of the following conditions,
- TCP/IP as the only interconnect medium
- A node that has more than 32 CPUs
the number of TCP ports may be too limited. If the MPI job attempts to access a TCP port when no more are available, the job will fail and print the following message:
low level communications error: Cannot assign requested address
|
Most likely, this occurs only when the job is running on the configuration described above and one of the following conditions exists:
- MPI_FULLCONNINIT is set.
- MPI_Alltoall is used.
- The application includes its own all-to-all code.
Other activity on the cluster, such as file I/O or other MPI jobs, will increase the chance of this occurring.
You can avoid exceeding the TCP port limit by taking one or more of the following steps:
- Configure the node with more than 32 nodes into two or more domains. From the TCP perspective, each domain will be seen as a separate node with its own supply of TCP ports.
- Reconfigure the cluster to exclude the node with more than 32 CPUs.
- Avoid running multiple MPI jobs or other tasks that would compete for available TCP ports.
- If two large MPI jobs must run on the same cluster, wait a few minutes between the jobs to give the OS time to reclaim the ports created for the previous job.
- If the application does not include any all-to-all operations, use the default lazy connections mode instead of MPI_FULLCONNINIT.
- If the application contains any all-to-all operations, either MPI_Alltoall or custom code, use a non-TCP network technology.
Sun HPC ClusterTools 5 Software User's Guide
| 817-0084-10
| |
Copyright © 2003, Sun Microsystems, Inc. All rights reserved.