Next: Examples Up: Parallelization Strategy Previous: Parallel Performance


Parallel Implementation

The parallel mesh generator has been successfully implemented on two memory distributed computing platforms - IBM SP2 and Transtech Paramid.

The IBM SP2 machine at CTU in Prague is currently equipped with 15 Power2 processors running at 66.7 MHz and 8 P2SC processors running at 120 MHz, each one with at least 128 MB of memory and 2 GB of disk space. The processors are interconnected with the standard Ethernet (10 MBits/s) and HPS (high performance switch - 40 MB/s). A set of queues, managed by a job scheduler, is configured above the processors. The queues differs in the priority, CPU limit, and access to a particular set of processors. The processors are organized into several groups (pools) with respect to their accessibility and performance. The first group contains 7 Power2 processors that can be accessed by the users. Jobs on these processors can be executed interactively from a command line or spawn from a queue. The remaining processors are collected in 2 groups, each one containing 8 processors of the same type. These processors cannot be accessed by ordinary users and the jobs on these processors must be spawn exclusively from the appropriate queue. Since only a single job (possibly a part of a parallel application) is allowed to be executed on any non-interactive processor it is ensured that the full performance of these processors is devoted to a single application of a single user. This organization should enable an effective utilization and load balancing of the machine with respect to the widely different demands of individual users. Note that the above described configuration of the SP2 machine is very recent. During the actual implementation of the parallel mesh generator, the SP2 machine has been equipped with the processors of the same type (Power2). Therefore different performance of individual processors has not been taken into account in the domain decomposition strategy.

The Transtech Paramid machine at UWC in Cardiff possesses 48 Intel i860xp vector processors with 16 MB of memory. The communication is based on T805 transputers with a typical speed of 1.2 MB/s. The processors are organized into nodes containing three mutually interconnected processors. The connection to other processors is done via a connection on the node level. The jobs are spawn from a host machine, which is a Sparc 10 workstation, on the first in - first out basis. The parallel jobs can be run on different topologies of processors - a pipe, grid, or torus. This restricts the number of processors which can be required for a parallel job because only specific configurations of a given topology are available. In the present implementation, the torus topology has been used because it provides the most suitable connections between the nodes with respect to the communication requirements of the mesh generator.

The MPI (Message Passing Interface) has been chosen as a message passing library for the implementation of interprocessor communication. MPI [101] offers the full range of tools for a point-to-point communication (send, receive), collective operations (broadcast, gather, scatter, reduce), process topology, and groups and communication context. Some other tools e.g. for the task management or remote execution (both available in PVM (Parallel Virtual Machine)) are not included in the current standard specification. There is one important feature in the point-to-point communication - the fairness. MPI guarantees fairness if two processes are involved in a point-to-point communication in a single threaded process. In that case, any two communications between these two processes are ordered and the messages are not overtaking each other. This guarantees that the message passing code (between these two processes) is deterministic. However, the fairness is not guaranteed if more then two processes are involved in the communication. Then it is possible that a destination process, repeatedly posting a receive which matches a particular send, will never receive that send because it is each time overtaken by another message sent from another source. The same situation may arise in a multi-threaded process if the semantics of the thread execution does not define the relative order between two send or receive operations executed by two distinct threads.

Since MPI is not available on Transtech Paramid machine, Parmacs (Parallel Macros) has been chosen as an alternative message passing library. Parmacs message passing library is fairly not as reach as MPI. Most importantly, collective operations are not available. Parmacs only provides the user with the hierarchy of the spawning tree of individual processes and it is up to the user to implement the collective communication. Also the point-to-point communication is limited to basic modes (synchronous and asynchronous). The most crucial aspect of Parmacs is that the fairness of communication is guaranteed for the synchronous mode only. In the asynchronous mode, there is no guarantee that two messages with the identical source and target processes will be received in the same order as they have been sent. Thus the messages are generally overtaking each other which makes the code essentially non-deterministic. This seriously complicates the implementation of a repeated multiple asynchronous communication, which is exactly the type of communication that occurs during the tree compatibility enforcement.



Next: Examples Up: Parallelization Strategy Previous: Parallel Performance

Daniel Rypl
2005-12-07