Mesh generation is often considered a bottleneck of parallel computation. It is usually performed in a sequential manner on single processor computers and may suffer from lack of resources (memory, disk space) and/or slow response. It is therefore important to parallelize the discretization in order to decrease the demands on memory and other resources by spreading the meshing over several mutually interconnected computers and to speed up the response by distribution of the computation to individual processors. Most of the parallelization strategies are based on some kind of decomposition in space or in time. This is also true for the mesh generation algorithms with space being the dominant characteristics for the domain decomposition concept. However, the domain decomposition for mesh generation algorithms is not straightforward because the only knowledge available at the beginning of the discretization process is the geometric model, the complexity of which has little or no relationship to the work required to generate the mesh. The parallelization of mesh generation techniques is therefore a non-trivial task and only few approaches for the parallel mesh generation have been developed so far. In this section, one possible approach to parallel mesh generation is presented [1].

The model is described by a boundary representation consisting of free form entities - vertices, curves, surfaces, and regions. The tensor product polynomial entities (namely the rational Bezier entities) have been employed. In this approach, each region is bounded by six surfaces, boundary each of which is formed by four curves, each of which is in turn given by two vertices. The advantage of this model representation consists in the existence of a unique mapping between the parametric and real spaces of each of the model entities. This significantly contributes to a unified handling of individual model entities. On the other hand, restrictions on the model topology result in some reduction of modelling flexibility. To enhance the modeling capability, curves and surfaces (and even regions) are allowed to be degenerated into entity of a lower dimension.

The parallelization strategy is based on the domain decomposition concept considered on two levels - the model level and the model entity parametric tree level. The decomposition on the model level splits the model into subdomains on the model entity basis. Each model entity which does not form boundary of another model entity of a higher dimension constitutes an individual subdomain. The advantage of this approach consists in the fact that the decomposition is in agreement with the physical concept of the model (the model boundaries coincide with the interprocessor boundaries). The decomposition on the parametric tree level may be compared, in some sense, with the orthogonal recursive bisection in the parametric space of a given model entity. Since the splitting is done in the parametric space, its image in the real space does not generally result in subdomains of a similar size. The next bisection, however, may be applied to the larger (or more complex) from those subdomains, etc. In this way, an arbitrary number of subdomains, with no one exceeding a prescribed value of complexity, may be created. This is a good starting point to achieve an acceptable parallel performance for various numbers of processors and varying model complexity. An important fact is that all the subdomains can be handled uniquely as a model entity subdomain because they are described by the same model entity and differs only in the range of parametric coordinates.

The individual model entities are discretized using a tree based approach. For each relevant model entity, a parametric tree of an appropriate dimension (a binary tree for curves, a quad tree for surfaces, and an octal tree for regions) is built separately. Thus no problems with the tree data distribution on the model level occur. Since the mapping between the parametric and real spaces of the model entity is generally not affine (stretching and distortion occur) the tree data structure is implemented as a generalized one. This enables to build a relatively isotropic tree (in the real space) even when a significant stretching induced by the mapping exists. The distortion of the mapping, on the other hand, is entirely inherited by the tree and may result in a poor quality grid on ``skew'' model entities. The actual discretization of model entities is based on templates (predefined sets of elements filling in entirely the cells of the tree), similarly as in the classical tree based algorithm. In order to limit the total number of templates to a reasonable number, the number of midside nodes per an edge of the tree structure is enforced to be not greater than one. Note that there are no all-hexahedral templates available and that the final mesh is therefore of a mixed nature.

The compatibility of the mesh inside a subdomain is guaranteed by the use of appropriate templates. On the subdomain (processor) interface, the application of appropriate templates will ensure the compatibility only under the assumption that the tree structures are compatible along that interface. This is, however, not the case because each parametric tree is built independently. To resolve this incompatibility the following approach has been proposed. Once the tree structure has been built in a given subdomain, a boundary parametric tree is extracted from the basic tree of that subdomain for each boundary which also forms boundary of another subdomain. For example, assuming only the model decomposition level, a quad tree is extracted from the basic octal tree of a model region for those of its boundary surfaces which form the boundary of another region. These extracted boundary trees are then sent to the subdomain, boundary of which corresponds to the extracted tree. In order to minimize the amount of this interprocessor communication, a compressed form of boundary tree structures is used. Each subdomain then updates its basic tree structure along the boundary according to the received boundary tree structure. This concept is applied iteratively to those subdomains which have recorded a change in their parametric tree in the previous iteration, until the basic parametric tree of all subdomains remains unchanged.

In the current implementation, the master and slaves parallel computing scheme (Fig. 1) has been used. In the beginning phase, the input data are read, parsed, and checked by the master processor. The constituted model is then broadcast to all slave processors. For each relevant model entity, its approximate load level is calculated on the first available slave processor. The total load level, gathered from individual relevant model entities, is then broadcast to slaves. After that, relevant model entities are split appropriately into subdomains using the first available slave processor. The completed domain decomposition, collected by the master together with the workload of individual subdomains, is then again broadcast to slave processors. In the following phase, a dynamic load balancing [2] mechanism is applied. A subdomain (not yet assigned) with the largest estimated workload is assigned to the first available slave processor and the parametric tree of that subdomain is built on this processor. After all subdomains have been processed, the complete subdomain to processor assignment is broadcast to all slaves. In the next step, the boundary tree exchange process is looped on all slaves. For each subdomain, assigned to that slave processor, the boundary tree structures are extracted and sent to appropriate subdomains (slaves) which will update their basic tree structure accordingly, and which, if necessary, will invoke a further boundary tree exchange. The role of the master processor in this process consists in the ``listening'' to exchange messages to detect the completion of the process. Once the parametric tree structures have been updated with respect to all boundary tree structures, the mesh generation starts on the slave processors followed immediately by the mesh smoothing. The master processor is only notified about the numbers of generated elements and nodes in the subdomain and on its boundary. These numbers are used to setup final numbering ranges which are broadcast to all slaves. Each slave processor then performs the renumbering of all subdomains assigned to it. In the final phase, the output data are written on the local device of each processor. The domain decomposition is output on the master processor while the mesh data are stored on the slave processors.

The mesh generator has been ported to two parallel computing platforms - IBM SP and Dell PC cluster. The IBM SP (installed at CTU computing centre) is a heterogeneous machine equipped with 4 nodes with 4 Power3 processors, running at 332 MHz and having at least 1 GB of shared system memory and 32 GB of disk space, with 8 nodes with 2 Power3 processors (optimized for floating point operations), running at 200 MHz and having at least 1 GB of shared system memory and 16 GB of disk space, and with 2 nodes with 4 Power3 processors (also optimized for floating point operations), running at 222 MHz and having at least 2 GB of shared system memory and 36 GB of disk space. The processors are running under AIX 4.3 operating system and are interconnected by SPS (super performance switch). The communication is based on MPI message passing library built on the top of the native MPL message passing library. The PC cluster (installed at the department of Structural Mechanics) consists of eight workstations, each equipped with two processors. Two workstations contain dual PII Xeon processors at 450 MHz with 512 MB of shared system memory. Another two comprise dual PII Xeon processors at 400 MHz with 512 MB of shared memory. The next two are equipped with dual PIII processors at 450 MHz with 512 MB and 1 GB of shared system memory. The last two contains dual PIII processors at 600 MHz and 1 GHz with 512 MB of shared memory. The workstations are connected by Fast Ethernet 100 Mb network using 3Com Superstack II switch, model 3300. All workstations are running under Linux 6.2 operating system with public domain MPICH message passing library. Note that both IBM SP and PC cluster represent a heterogeneous parallel computing platform with the combination of shared and distributed memory. However, the IBM SP can be considered a homogeneous computing environment if processors of the same type are dedicated for the parallel computation.

The parallel performance of the algorithm is presented on an example - a tunnel in a rock mass in a particular phase of the construction (Fig. 2). There are two tubes, one (on the right) fully excavated and fit out by the lining, the other one (on the left) partially excavated with lining missing in the segments directly neighbouring with the not yet excavated part. This model was discretized by two graded three-dimensional meshes (with refinement prescribed along the interface between the rock mass and the excavated hole). The smaller mesh contains 533.828 nodes and the larger one 932.077 nodes. A much coarser mesh of the tunnel is depicted in Figure 3. The scalability of the meshing algorithm was evaluated on 12 Power3 200 MHz processors of the IBM SP machine and on 12 slower (400 - 450 MHz) processors of the PC cluster. Since the master and slaves parallel computing paradigm has been adopted, a separate processor must be allocated for the master process. Note however that the master processor has not been considered in the evaluation of the speedup. This is affordable because it has been verified that master process can be running together with one slave process on the same processor without significant impact on the performance. However this is not possible on the IBM SP machine because there is only one SPS socket per processor and because the processors are running in dedicated mode (thus it is impossible to locate two jobs on one processor). The execution times and the achieved speedups are summarized in Figures 4 and 5 which reveal only negligible differences in the time and speedup profiles. It is evident that performance of the PC cluster can face up to the performance of parallel supercomputers while keeping the investment and maintenance costs substantially lower.

*Daniel Rypl
2005-12-03*