The presented results were obtained on two different parallel hardware platforms - IBM SP machine (installed at CTU computing centre) and PC cluster (installed at the Department of Structural Mechanics).
The IBM SP machine is equipped with 4 nodes with 4 Power3 processors, running at 332 MHz and having at least 1 GB of shared system memory and 32 GB of disk space, and with 8 nodes with 2 Power3 processors (optimized for floating point operations), running at 200 MHz and having at least 1 GB of shared system memory and 16 GB of disk space. The processors, running AIX 4.3 operating system, are interconnected with SPS (super performance switch). The communication is based on MPI built on the top of the native MPL message passing library.
The PC cluster consists of four Dell 610 workstations, each equipped with two processors. Two workstations contain dual PII Xeon processors at 450 MHz with 512 MB of shared system memory and the other two comprise dual PII Xeon processors at 400 MHz with 512 MB of shared memory. The workstations are connected by Fast Ethernet 100 Mb network using 3Com Superstack II switch. All workstations can be running either Windows NT 4.0 operating system, in which case the communication is based on MPI/Pro for Windows NT that supports both the distributed and shared memory communication, or under Linux 6.x operating system with public domain MPICH message passing library.
Note that both platforms represent heterogeneous parallel computing platforms with the combination of shared and distributed memory.
The parallel performance is presented on two uniform meshes of a mechanical part (see Figure 1) comprising 162.300 and 524.979 nodes, respectively. Since the master and slaves parallel computing scheme has been adopted, a separate processor must be allocated for the master process. Note however that the master processor has not been considered in the evaluation of the speedup and efficiency. This is affordable only on IBM SP machine where master process can be running together with one slave process on the same processor without impact on the performance. Similar experiments on PC cluster however revealed a considerable degradation of the performance. The execution times and speedups for the presented example and hardware and software platforms are summarized in Figures 2 and 3. Note that the heterogeneity of the PC cluster is not taken into account in the current implementation and that the speedup is always evaluated using the single processor time obtained on the Xeon PII 450 MHz processor. It is evident that for large enough problems the speedup profiles on IBM SP and the PC cluster are quite similar while for smaller problems faster degradation occurs on the PC cluster due to the slower Ethernet based communication. The worst speedup profile is always evidenced on the PC cluster under Linux, clearly due to the lack of shared memory communication support. However, the elapsed times on the PC cluster are shorter compared to IBM SP for smaller number of processors and almost identical for larger number of processors.