Although the performance of today's hardware is steadily increasing, the application of distributed parallel algorithms is often the only way to enable solution of huge realistic problems in terms of both time and storage. Generally, to achieve a high level of parallelism with favourable load balancing, such algorithms should satisfy the following goals: i) most of the algorithm should be performed in parallel (preferably using all available processors), ii) the total amount and number of interprocessor communication should be minimized. From this point of view, most of the sequential algorithms are inherently not suitable for parallelization and it is necessary to compromise between the requirements stated above. This is also the case for classical unstructured mesh generation strategies. Since the cost of computing and communication differs from one hardware platform to another, it is even more difficult to find an acceptable parallelization strategy portable to different platforms.

Most of the parallelization strategies are based on some kind of decomposition in space or in time. This is also true for the mesh generation algorithms with space being the dominant characteristics for the domain decomposition concept. To respect the above mentioned requirements for effective parallelization and load balancing (not only of the mesh generation but also of the later computational analysis), the domain decomposition should fulfill the following conditions: i) individual subdomains should be approximately of the same complexity (taking into account the performance of the processor), ii) the interface between the subdomains should be minimized. However, the domain decomposition for the mesh generation algorithms is not straightforward. Firstly, some background data structure is required for both the subdomain complexity estimate and the domain decomposition itself. And secondly, the application of sophisticated decomposition algorithms, as orthogonal (ORB), inertial (IRB), or spectral (SRB) recursive bisection, may become unsatisfactory when only a rough complexity prediction is available. From this point of view, the first condition is usually weakened (allowing for dynamic load balancing) to: the total amount of workload of subdomains processed on each processor should be approximately the same (taking into account the performance of the processor).

In the octree based algorithms [1,2,3], a single global octree data structure, used also for spatial localization and mesh size control, is utilized as means of the domain decomposition. A set of octants is assigned to each processor using the recursive bisection. The compatibility of the final mesh on the processor interfaces is simply ensured by the use of appropriate templates (predefined sets of elements) in interior octants. Since the octree data structure contains information frequently accessed by any of the subdomains, storing the octree on a single processor would require an excessive number of small grain communication. Therefore the whole octree data structure is stored on each processor, which results in the allocation of a fairly not negligible amount of memory on each processor. The distribution of the tree data structure in terms of its subtrees among the processors is mostly very far from the ideal decomposition and invokes further communication when the octree traversal is performed.

The domain decomposition in the advancing front technique [4,5] is most frequently based on a background grid (e.g., from the previous step of the adaptive analysis). Splitting is done using the greedy algorithm with some surface to volume ratio optimization. The initial front, formed by the appropriate part of the model boundary with respect to a given subdomain, propagates inside the subdomain until it approaches the interprocessor boundary or closes up with another existing front inside the subdomain. From this point of view, the minimization of surface to volume ratio becomes very important. The mesh generation in the remaining space along the interprocessor boundaries is done at the latest stage, usually at a significantly reduced parallelism, and requires a lot of small grain communication or a new background data redistribution. In either case, this last stage of the mesh generation reduces the overall performance significantly.

In the case of Delaunay triangulation [6], the initial Delaunay mesh, built from boundary nodes, is used as a background grid for the domain decomposition. Since the workload prediction, based on the mesh size specification interpolated from the boundary nodes, provides only approximate values, a simple greedy algorithm is employed. The triangulations corresponding to the processor interfaces are enriched to reflect the interpolated mesh size and distributed to both subdomains sharing it. Each subdomain is then discretized using a fully sequential Delaunay triangulator.

This paper presents an innovative approach for the parallel unstructured mesh generation. The basic features of the proposed algorithm are described in the following sections.

*Daniel Rypl
2005-12-03*