Generation of Meshes in Parallel

In the tree-based algorithms described in [97,108,117,131], a single global tree data structure, used also for the spatial localization and mesh size control, is utilized as means of the domain decomposition with the terminal cell being the unit of complexity for a proper load balancing. The individual subdomains are formed by a set of cells (appropriately classified with respect to the model) assigned to each processor on the basis of inertial recursive bisection. Since the tree data structure contains information frequently accessed by any of the subdomains, storing the tree on a single processor would require an excessive number of small grain communication. Therefore, the whole tree data structure is stored on each processor, which results in allocation of fairly not negligible amount of memory on each processor. The distribution of the tree data structure among the processors (in terms of its subtrees) is mostly very far from the ideal decomposition and invokes an additional communication when the tree traversal is performed. Similarly as in the sequential version, the interior cells are discretized using templates. Virtually no communication is required during this process and the compatibility of the final mesh on processor interfaces is simply ensured by the use of appropriate templates. The boundary regions are discretized using an iterative element removal concept which, in this case, consists of three steps: (i) tree repartitioning (only boundary cells are considered), (ii) element removal and (iii) reclassification of boundary cells. While the role of the first step is to ensure a good load balance via the cell migration between processors, the reclassification in the last step eliminates those cells which no longer cover any portion of the subdomain still to be meshed. The difficulty of the element removal in parallel resides in the fact that a knowledge of the tree neighbourhood, which needs not be stored on the same processor, is required. In that case, the element removal is aborted. The efficiency of the element removal stage on a given processor is defined as the ratio of the number of performed element removals to the number of attempted element removals. The efficiency can decrease significantly after only a few iterations because information required to perform element removals is located almost always on a different processor. As the efficiency falls down the processor set is appropriately reduced in order to avoid a deadlock. It is obvious that the discretization of a boundary region is done at a significantly reduced parallelism and involves a lot of small grain communication required by the tree repartitioning. This considerably decreases the overall performance of the algorithm. Since it is very likely (due to the cell migration) that the final mesh will be scattered across processors with no real structure, it is necessary to repartition the final mesh with the full set of processors.

The algorithm proposed in [76,132] is based on the advancing front technique. The initial domain splitting is done using the greedy algorithm [21] applied to a background grid (possibly from the previous step of an adaptive analysis) defining the desired spatial distribution of the element size and shape. This decomposition is then subjected to a load balancing with surface to volume ratio optimization based on a relative gain concept. In each subdomain, the mesh is generated on the basis of an element removal algorithm. Assuming that the boundary of the domain has already been discretized and distributed with respect to the domain decomposition, the initial front is assembled as appropriate parts of the discretized boundary. In the case that there are no boundary elements in a particular subdomain, the initial front is formed by sides of the first element created ``safely'' inside that subdomain. The front then propagates inside the subdomain until it approaches the interprocessor boundary, or closes up with an existing front inside the subdomain, or becomes empty. The mesh generation in the remaining space along the interprocessor boundaries is done at the latest stage, usually at a considerably reduced parallelism (because of the reduced set of processors), and requires a lot of small grain communication or a new background data redistribution. In either case, this last stage of the mesh generation reduces the overall performance significantly.

Another approach for a parallel triangulation of 2D domains using the advancing front technique with a background mesh has been presented in [64]. In this approach, each element of the triangular background mesh is considered to be a separate subdomain. If there are too many elements to be generated in any subdomain (possibly leading to the disturbance of the load balance), then this subdomain is split into three new subdomains by generating a node in the centroid of the original subdomain and connecting it with the vertices of that subdomain. This splitting is not accomplished if too distorted subdomains would be created, which would consequently spoil the shape of elements generated in these subdomains. Since the total number of subdomains is typically much greater than the number of available processors, the individual subdomains are discretized successively using a set of processors marked as workers. The process of assignment of subdomains to worker processors is controlled by one master processor which is also responsible for the discretization of those subdomains that seem to comprise just one element. The compatibility between adjacent subdomains is ensured by sharing the same mesh size specification of the background mesh by these subdomains and by a specific algorithm employed for the discretization of boundary segments. However, this guarantee seems to be not rigorous if worker processors with different arithmetics would be used or if an extension to 3D would be considered.

The method described in [157] uses an artificial intelligence based technique to partition a background mesh into subdomains. The subdomains are created by a recursive bisection driven by a genetic algorithm. The genetic algorithm is adopted to implement an optimization module with an objective function that attains a maximum value when the numbers of generated elements are equal in both the subdomains and the number of interfacing edges is at its minimum. The advance knowledge concerning the number of elements in the final mesh is obtained by training a neural network to predict the number of elements to be generated per element of the background mesh. The subdomains are then mapped onto individual processors and discretized concurrently. The compatibility of the final mesh on subdomain interfaces is again guaranteed by sharing the same mesh parameters associated with the background mesh and by a specific algorithm employed for the discretization of subdomain boundaries. The disadvantage of this approach consists in a significant computational cost of the genetic algorithm as the size of the background mesh increases. Nevertheless, this can be eliminated by parallelization of the genetic algorithm. The effectiveness of the predictive module is depending on the extent of training undertaken. Thus application of a neural network trained on small or intermediate size meshes to a large mesh is likely to fail.

Also the technique suggested in [159] is achieving parallelism by the discretization of independent partitions in parallel. Various partitioning schemes (the greedy algorithm, orthogonal, inertial, or spectral recursive bisection) are available to accomplish the domain decomposition of a background grid initially generated by an algorithm using a quadtree node distribution. In the next step, subdomain boundaries are formed from the mesh decomposition data. The individual subdomains are then refined in parallel using again the quadtree data structure for the node distribution. Since the original decomposition applied to the refined mesh may be unbalanced, the domain decomposition is optimized using the relative gain concept or simulated annealing based algorithm.

The concept of a parallel mesh generator based on the Delaunay triangulation has been outlined in [135]. The initial Delaunay mesh built from boundary nodes is used as a background grid for the domain decomposition. Since the work load prediction, based on the mesh size specification interpolated from the boundary nodes, provides only approximate values, the domain splitting, using the greedy algorithm [21], is carried out in such a way that the resulting number of subdomains is at least twice as large as the number of available processors. The discretizations corresponding to the subdomain boundaries are then enriched (a non-trivial job in 3D) to reflect the interpolated mesh size and distributed to both subdomains sharing it. The mesh generation is then performed in the framework of a master and slaves parallel paradigm using a fully sequential Delaunay triangulator on the slave processors. The individual subdomains are successively (with respect to the decreasing level of the estimated work load) submitted to available slave processors to perform the actual discretization. In this way, a very good dynamic load balancing is achieved.

The method described in [160] is based on a 2D constrained Delaunay triangulation. The domain partitioning is done manually (the process of an automation is studied) by introducing artificial boundaries usually in a polyline form. These boundaries are treated as constrained segments in the consequent discretization process, which is done in parallel using the classical Bowyer-Watson algorithm [3,4] for a point insertion. However, the cavity, consisting of triangles interacting with the point, insertion of which violates the Delaunay property, is not allowed to expand across interprocessor boundaries as they are constrained. If the point to be inserted is too close to a constrained edge (falling into the circle circumscribed to that edge), the constrained edge is split by introducing a new node in the middle of this edge. To ensure the compatibility of the mesh on the interprocessor boundary, the constrained edge splitting is also sent as a request to the neighbouring processor. Caution should be taken to avoid a multiple splitting of the same edge as this may be an independent decision of adjacent processors. The problem of this approach is how to ensure a favourable load balancing because there are some restrictive conditions on the placement of artificial boundaries in order to avoid generation of poorly shaped elements. Also the extension of this approach into 3D is not straightforward as the constrained Delaunay triangulation does not generalize to three dimensions without further modifications [30].

The Delaunay triangulation is also used in the work reported in [150,162]. The mesh partitioning and distribution is accomplished dynamically during the discretization process. It is based on a modified Cartesian coordinate bisection into blocks, number of which is as close as possible to (but never greater than) the number of available processors. The partition separators are obtained by iteratively solving a linear system of equations derived from a spring analogy in which each block is represented as three linear springs in perpendicular directions with a stiffness proportional to the ratio between the number of elements in the block and block length along the spring direction. However, other incremental partitioning schemes which take the geometry into account can be used under the assumption that they can be parallelized. The actual mesh distribution is done using the mesh migration which is performed in three steps - interprocessor front locking, transfer of mesh entities, and interprocessor front unlocking. The same mechanism of the mesh migration is adopted to retrieve off-processor elements when internal nodes are inserted into the mesh using the Bowyer-Watson algorithm [3,4]. Since the case, when the node being inserted will affect also elements on neighbouring processors, may occur relatively frequently and the requested remote elements may be denied because of interprocessor front locking, a point insertion buffering has been introduced. In this strategy, instead of insisting on an element request and waiting for the response from a remote processor, the element request is stored in an internal queue and the discretization continues by insertion of another point. When the queue becomes full or reaches a prescribed limit the mesh generation is halted until some or all buffered requests are answered.

*Daniel Rypl
2005-12-07*