Next: Examples Up: Top Previous: Top

Introduction

In the following paper, the parallel performance of 3D mesh generator is investigated. The parallelization strategy of the mesh generator is based on domain decomposition on two levels -- model level and model entity parametric tree level. The discretization of model entities is based on generalized parametric tree data structure and application of templates.

The mesh generator has been implemented on two different massively parallel machines

The IBM SP2 machine (CTU Prague) is equipped with 19 IBM Power2 processors running at 66.7 MHz, each one with 128 MB of memory and 2 GB of disk space (at minimum). The processors are interconnected with standard Ethernet (10 MBits/s) and HPS (high performance switch - 40 MB/s). A set of queues, handled by a job scheduler (Load Leveler), is configured above the processors. The queues differs in priority, CPU limit and access to particular set of processors. The processors are organized into two pools. The interactive pool contains 7 processors that can be interactively accessed by users and which work as real multi user and multi process systems. Jobs on these processors can be executed from command line. The remaining 12 processors are comprised in non-interactive pool. Although they are also multi user and multi process processors they are operated (from the user point of view) as single user and single process machine. In other words, the full performance of the processor is devoted to a single application of a single user. Therefore the jobs in non-interactive pool must be spawn from a queue. This organization should enable effective utilization and load balancing of the non-interactive part of the machine with respect to the widely different demands of individual users.

The Transtech Paramid machine (UWC Cardiff) possesses 48 Intel i860xp vector processors with 16 MB of memory. The communication is based on T805 transputers with a typical speed of 1.2 MB/s. The processors are organized into nodes containing three mutually interconnected processors. The connection to other processors is done via connection on the node level. The jobs are spawn from host machine, which is a Sparc 10 workstation, on the first in - first out basis. The parallel jobs can be run on different topologies of processors -- pipe, grid or torus. This restricts the number of processors which can be required for parallel job because only specific configurations of a given topology are available. In the presented examples the torus topology has been used because it provides the most suitable connections between the nodes with respect to the communication requirements of the mesh generator.

Two different message passing libraries have been used for implementation of communication

MPI is the primary message passing library used for implementation on IBM SP2. Since MPI is not available on Transtech Paramid machine Parmacs has been chosen as alternative message passing library.

MPI offers the full range of tools for point-to-point communication, collective operations, process topology and groups and communication context. Some other tools e.g. for task management or remote execution (both available in PVM (Parallel Virtual Machine)) are not included into current standard specification. There is one important feature in the point-to-point communication -- fairness. MPI guarantees fairness if only two processes are involved in the point-to-point communication in single threaded process. In that case any two communications between these two processes are ordered and the messages are not overtaking. This guarantees that message passing code is deterministic. However, the fairness is not guaranteed if more then two processes are involved in the communication. Then it is possible that the destination process repeatedly posting a receive which matches a particular send will never receive that particular send because it is each time overtaken by another message sent from another source. The same situation may arise in multi-threaded process if the semantics of thread execution does not define the relative order between two send or receive operations executed by two distinct threads.

Parmacs message passing library is fairly not as reach as MPI. Most importantly, the collective operations are not available. Parmacs only provides the user with the hierarchy of spawning tree of individual processes and it is up to the user to implement the collective communication. Also the point-to-point communication is limited to basic modes (synchronous and asynchronous). The most crucial aspect of Parmacs is that the fairness of communication is not guaranteed for asynchronous mode at all. This seriously complicates the implementation of repeated multiple asynchronous communication.



Next: Examples Up: Top Previous: Top

Daniel Rypl
2005-12-03