Sun Logo


Sun HPC ClusterToolstrademark 5 Software Performance Guide

817-0090-10



Contents

Preface

1. Quick Reference

Compilation and Linking

MPProf

Analyzer Profiling

Job Launch on a Multinode Cluster

MPI Programming Tips

2. Introduction: The Sun HPC ClusterTools Solution

Sun HPC Hardware

Processors

Nodes

Clusters

Sun HPC ClusterTools Software

Sun MPI

Sun S3L

Prism Environment

Cluster Runtime Environment

3. Choosing Your Programming Model and Hardware

Starting Out

Programming Models

Scalability

Amdahl's Law

Scaling Laws of Algorithms

Characterizing Platforms

Basic Hardware Factors

Other Factors

4. Performance Programming

General Good Programming

Clean Programming

Optimizing Local Computation

Optimizing MPI Communications

Reducing Message Volume

Reducing Serialization

Load Balancing

Synchronization

Buffering

Nonblocking Operations

Polling

Sun MPI Collectives

Contiguous Data Types

Special Considerations for Message Passing Over TCP

MPI Communications Case Study

Algorithms Used

Algorithm 1

Algorithm 2

Algorithm 3

Algorithm 4

Algorithm 5

Making a Complete Program

Timing Experiments With the Algorithms

Baseline Results

Directed Polling

Increasing Sun MPI Internal Buffering

Use of MPI_Testall

5. One-Sided Communication

Introducing One-Sided Communication

Comparing Two-Sided and One-Sided Communication

Basic Sun MPI Performance Advice

Case Study: Matrix Transposition

Test Program A

Test Program B

Test Program C

Test Program D

Utility Routines

Timing

6. Sun S3L Performance Guidelines

Link In the Architecture-Specific Version of Sun Performance Library Software

Legacy Code Containing ScaLAPACK Calls

Array Distribution

When To Use Local Distribution

When To Use Cyclic Distribution

Choosing an Optimal Block Size

Illustration of Load Balancing

Process Grid Shape

Runtime Mapping to Cluster

Use Shared Memory to Lower Communication Costs

Smaller Data Types Imply Less Memory Traffic

Performance Notes for Specific Routines

The S3L_mat_mult Routine

The S3L_matvec_sparse Routine

The S3L_lu_factor Routine

The S3L_fft, S3L_ifft, S3L_rc_fft, and S3L_cr_fft, S3L_fft_detailed Routines

The S3L_gen_band_factor, S3L_gen_trid_factor, and S3L_gen_band_solve, S3L_gen_trid_solve Routines

The S3L_sym_eigen Routine

The S3L_rand_fib and S3L_rand_lcg Routines

The S3L_gen_lsq Routine

The S3L_gen_svd Routine

The S3L_gen_iter_solve Routine

The S3L_acorr, S3L_conv, and S3L_deconv Routines

The Routines S3L_sort, S3L_sort_up, S3L_sort_down, S3L_sort_detailed_up, S3L_sort_detailed_down, S3L_grade_up, S3L_grade_down, S3L_grade_detailed_up, and S3L_grade_detailed_down

The S3L_trans Routine

Sun S3L Toolkit Functions

7. Compilation and Linking

Compiler Version

The mp* Utilities

The -fast Switch

The -xarch Switch

The -xalias Switch

The -g Switch

Other Useful Switches

8. Runtime Considerations and Tuning

Running on a Dedicated System

Setting Sun MPI Environment Variables

Are You Running on a Dedicated System?

Does the Code Use System Buffers Safely?

Are You Willing to Trade Memory for Performance?

Do You Want to Initialize Sun MPI Resources?

Is More Runtime Diagnostic Information Needed?

Launching Jobs on a Multinode Cluster

Minimizing Communication Costs

Load Balancing

Controlling Bisection Bandwidth

Considering the Role of I/O Servers

Running Jobs in the Background

Limiting Core Dumps

Using Line-Buffered Output

Multinode Job Launch Under CRE

Collocal Blocks of Processes

Multithreaded Job

Round-Robin Distribution of Processes

Detailed Mapping

9. Profiling

General Profiling Methodology

Basic Approaches

MPProf Profiling Tool

Sample MPProf Output

Overview

Load Balance

Sun MPI Environment Variables

Breakdown by MPI Routine

Time Dependence

Connections

Multithreaded Programs

The mpdump Utility

Managing Disk Files.

Incorporating Environment Variable Suggestions

Performance Analyzer Profiling of Sun MPI Programs

Data Collection

Data Volume

Data Organization

Example

Other Data Collection Issues

Analyzing Profiling Data

Case Study

Overview of Functions

MPI Wait Times

Other Profiling Approaches

Using the MPI Profiling Interface

Inserting MPI Timer Calls

Using the gprof Utility

Using the VAMPIR Performance Analyzer

Sun MPI Features Tested With VAMPIR

Limitations of VAMPIR Support

Notes on Compilation

A. Sun MPI Implementation

Yielding and Descheduling

Progress Engine

Shared-Memory Point-to-Point Message Passing

Full Versus Lazy Connections

RSM Point-to-Point Message Passing

Optimizations for Collective Operations

Network Awareness

Shared-Memory Optimizations

Pipelining

Multiple Algorithms

One-Sided Message Passing Using Remote Process

B. Sun MPI Environment Variables

Yielding and Descheduling

Polling

Shared-Memory Point-to-Point Message Passing

Shared-Memory Collectives

Running Over TCP

RSM Point-to-Point Message Passing

Summary Table Of Environment Variables

Index