Next: Parallel Performance Up: Parallelization Strategy Previous: Domain Decomposition


Parallel Computing Scheme

In the current implementation, a master and slaves parallel computing scheme (Fig. 3.10) has been adopted. In the beginning phase, the input data are read, parsed, and checked by the master processor. The constituted model is then broadcast to all slave processors. For each non-boundary model entity, its approximate load level is calculated on the first available slave processor. The load level in terms of the expected number of elements is estimated by integrating the mesh density over the model entities (according to the Eqs (3.10), (3.11), and (3.12)). Since the local mesh size variation is generally expressed in the form of a rational function, a numerical integration over the cells of an auxiliary tree structure is employed. The depth of this auxiliary tree structure in a particular parametric direction is controlled by the size of terminal cells in the parametric space and by the ratio of weights corresponding to corners of terminal cells in that direction. Note that a refinement prescribed along a boundary or prescribed at a vertex fixed to the investigated model entity or its boundary is also taken into account (although only roughly) in order to improve the load level estimate. It should be mentioned, on the other hand, that the above described assessment of the load level exhibits only a low level of accuracy. This is a direct consequence of the reduced flexibility of the mesh size control which resides in the fact that the final element size is predetermined by the size of individual cells of the parametric tree structure consisting always of an integer number of cells. The next source of inaccuracy can be found in the application of the ``1:2 rule'' resulting in a hardly predictable tree refinement. The total load level gathered from individual non-boundary model entities is then broadcast to slaves. After that, non-boundary model entities (subdomains on the model level) with an excessive work load are further split into smaller subdomains using the first available slave processor. Note that this step is not realized in the current implementation as it introduces some additional problems concerning the tree compatibility on model entity interfaces. The completed domain decomposition, together with the work load of individual subdomains, is collected by the master and then again broadcast to slave processors. In the following phase, a dynamic load balancing mechanism is applied. A subdomain (not yet assigned) with the largest estimated work load is assigned to the first available slave processor and the parametric tree of that subdomain is built on this processor. After all subdomains have been processed, the complete subdomain to processor assignment is broadcast to all slaves. In the next step, the boundary tree exchange process is looped on all slaves. For each subdomain, assigned to a slave processor, the boundary tree structures are extracted and sent to appropriate subdomains (possibly on a different processor) which will update their basic tree structure accordingly, and which, if necessary, will invoke a further boundary tree exchange. The role of the master processor in this process consists in a ``listening'' to exchange messages between the processors to detect the completion of the process. Once the parametric tree structures have been updated with respect to all boundary tree structures, the mesh generation (application of templates) starts on the slave processors followed immediately by the mesh smoothing. The master processor is only notified about the numbers of generated elements and nodes in the subdomain and on its boundary. These numbers are used to setup the final numbering ranges which are broadcast to all slaves. Each slave processor then performs the renumbering of subdomains assigned to it. Note that the aim of this renumbering is to get a continuous numbering without any optimization in terms of the bandwidth or front minimization. In the final phase, the output data are written on a local device of each processor. The domain decomposition is output on the master processor while the mesh data, corresponding to the discretization of individual subdomains, are stored on the slave processors.

Note that the concept of subdomain to processor assignment is non-deterministic as it depends not only on the load level, which is deterministic, but also on execution times required to build the parametric tree, which may vary from run to run. Also the tree compatibility enforcement process is typically of a non-deterministic nature because it involves multiple asynchronous communication between the processors. This may lead to differences in the parametric tree structures built in individual subdomains and used for the actual discretization. Therefore, the meshes generated in parallel are generally non-reproducible.



Next: Parallel Performance Up: Parallelization Strategy Previous: Domain Decomposition

Daniel Rypl
2005-12-07