Although the performance of today's hardware is stably increasing the application of distributed parallel algorithms is often the only way to enable solution of huge realistic problems both in terms of time and storage. Generally, to achieve high parallelism with favorable load balancing such algorithms should satisfy the following goals

- most of the algorithm should be performed in parallel (preferably using all available processors),
- the total amount and number of interprocessor communication should be minimized.

Most of the parallelization strategies are based on some kind of decomposition in space or in time. This is also true for the mesh generation algorithms with space being the dominant characteristics for domain decomposition concept. To respect the above mentioned requirements for effective parallelization and load balancing (not only of the mesh generation but also of the later computational analysis) the domain decomposition should fulfill the following conditions

- the individual domains should be approximately of the same complexity (taking into account performance of the processor),
- the interface between the domains should be minimized.

- the total amount of work load of domains processed on each processor
should be approximately the same (taking into account performance of
the processor).

In the octree based algorithms
[1,4,2],
a single global octree data structure,
used also for spatial localization and mesh size
control, is utilized as means of domain decomposition. A set of
octants is assigned to each processor on the basis of recursive
bisection. The compatibility of the
final mesh on the processor interfaces is simply ensured by use of
appropriate templates (predefined sets of elements filling in the
octant entirely) in interior octants. Since the octree data structure
contains information frequently accessed by any of the domains storing
the octree on a single processor would require excessive number of small
grain communication. Therefore the whole octree data structure is
stored on each processor which results in allocation of fairly not
negligible amount of memory on each processor. The distribution of the
tree data structure in terms of its subtrees among the processors is
mostly very far from the ideal decomposition and invokes further
communication when octree traversal is performed.

The domain decomposition in the advancing front technique
[3,5]
is most
frequently based on background grid (e.g., from the previous step of
the adaptive analysis). Splitting is done using the greedy algorithm
with some surface to volume ratio optimization. The initial front
formed by the appropriate part of the model boundary with respect to a
given domain, propagates inside the domain
until it approaches the interprocessor boundary or closes up with the
existing front inside the domain. From this point of view, the
minimization of surface to volume ratio becomes very important. The mesh
generation in the remaining space along the interprocessor boundaries
is done at the latest stage, usually at significantly reduced
parallelism, and requires a lot of small grain communication or new
background data redistribution. In either case, this last stage of the
mesh generation reduces the overall performance significantly.

In the case of Delaunay triangulation
[6],
the initial Delaunay mesh built
from boundary nodes is used as a background grid for domain decomposition.
Since the work load prediction, based on the mesh size specification
interpolated from the boundary nodes, provides only approximate
values the simple greedy algorithm is
employed. The triangulations corresponding to the processor interfaces
are enriched to reflect the interpolated mesh size and distributed to
both domains sharing it. Each domain is then discretized using fully
sequential Delaunay triangulator.

*Daniel Rypl
2005-12-03*