Although the performance of today's hardware is stably increasing
the application of distributed parallel algorithms is often the only
way to enable solution of huge realistic problems both in terms of
time and storage. Generally, to achieve high parallelism with
favorable load balancing such algorithms should satisfy the following
goals: i)most of the algorithm should be performed in parallel (preferably
using all available processors), ii)the total amount and number of
interprocessor communication should be minimized.
From this point of view, most of the sequential algorithms are
inherently not suitable for parallelization and it is necessary to
compromise between the requirements stated above. This is also the case
for classical unstructured mesh generation strategies. Since the cost
of computing and communication differs from one hardware platform to
another it is even more difficult to find acceptable
parallelization strategy portable to different platforms.
Most of the parallelization strategies are based on some kind of
decomposition in space or in time. This is also true for the
mesh generation algorithms with space being the dominant
characteristics for domain decomposition concept.
To respect the above mentioned requirements for effective
parallelization and load balancing (not only of the mesh generation
but also of the later computational analysis) the domain decomposition
should fulfill the following conditions: i)
the individual domains should be approximately of the same complexity
(taking into account performance of the processor), ii)
the interface between the domains should be minimized.
However, the domain decomposition for the mesh generation algorithms is not
straightforward. Firstly, some background data structure is required
both for the domain complexity estimate and for the domain
decomposition itself. And secondly, application of sophisticated
decomposition algorithms, as ORB, IRB or SRB, may become
unsatisfactory when only rough complexity prediction is
available. From this point of view the first condition is usually
weakened (allowing for dynamic load balancing) to:
the total amount of work load of domains processed on each processor
should be approximately the same (taking into account performance of
In the octree based algorithms
a single global octree data structure,
used also for spatial localization and mesh size
control, is utilized as means of domain decomposition. A set of
octants is assigned to each processor on the basis of recursive
bisection. The compatibility of the
final mesh on the processor interfaces is simply ensured by use of
appropriate templates (predefined sets of elements filling in the
octant entirely) in interior octants. Since the octree data structure
contains information frequently accessed by any of the domains storing
the octree on a single processor would require excessive number of small
grain communication. Therefore the whole octree data structure is
stored on each processor which results in allocation of fairly not
negligible amount of memory on each processor. The distribution of the
tree data structure in terms of its subtrees among the processors is
mostly very far from the ideal decomposition and invokes further
communication when octree traversal is performed.
The domain decomposition in the advancing front technique
frequently based on background grid (e.g., from the previous step of
the adaptive analysis). Splitting is done using the greedy algorithm
with some surface to volume ratio optimization. The initial front
formed by the appropriate part of the model boundary with respect to a
given domain, propagates inside the domain
until it approaches the interprocessor boundary or closes up with the
existing front inside the domain.
From this point of view, the minimization of surface to volume ratio
becomes very important.
generation in the remaining space along the interprocessor boundaries
is done at the latest stage, usually at significantly reduced
parallelism, and requires a lot of small grain communication or new
background data redistribution. In either case, this last stage of the
mesh generation reduces the overall performance significantly.
In the case of Delaunay triangulation
the initial Delaunay mesh built
from boundary nodes is used as a background grid for domain decomposition.
Since the work load prediction, based on the mesh size specification
interpolated from the boundary nodes, provides only approximate
values the simple greedy algorithm is
employed. The triangulations corresponding to the processor interfaces
are enriched to reflect the interpolated mesh size and distributed to
both domains sharing it. Each domain is then discretized using fully
sequential Delaunay triangulator.
This paper presents an innovative approach for parallel unstructured mesh generation. The basic features of the proposed algorithm are described in the following sections.