Next: Figures Up: Parallel Mesh Generation Previous: Parallel Implementation


Examples

A set of examples is presented to demonstrate both the sequential and parallel performance of the algorithm for the parallel generation of unstructured meshes. The sequential performance has been examined on SGI Indigo2 workstation with 195 MHz R10000 processor and 128 MB of memory. For each example (Figs 3.11 - 3.13), a sequence of meshes of varying density has been generated. The obtained relationships between the number of nodes and the generation time for individual examples are provided in Figures 3.14 - 3.17. Note that the number of nodes rather than the number of elements is used as the independent variable. The reason is that the number of elements is slightly misleading because of the mixed nature of the mesh. Also note that the elapsed times do not include the output printing which is typically of the computational complexity and therefore not considered relevant. The achieved results are in excellent agreement with the predicted linear dependence between the number of nodes and the generation time (Section Computational Complexity (Parallel Mesh Generation)). The parallel performance has been investigated on the IBM SP2 machine (a concise description may be found in Section Parallel Implementation (Parallel Mesh Generation)) using theoretically up to 15 processors of the same type (Power2). In reality, a significantly smaller number of processors is actually available as the consequence of the actual configuration of the SP2 machine. In the presented work, up to 11 processors have been used for the mesh generation in parallel. Since the master and slaves parallel computing scheme has been adopted, a separate processor must be allocated for the master process, which reduces the maximum number of slave processors to 10. Note that the master processor has not been considered in the evaluation of the speedup and efficiency. This is affordable because it has been verified that the master process can be running together with one slave process on the same processor without impact on the performance. However, this is possible, due to the actual configuration of the SP2 machine, only up to 7 slave processes running on ``interactive'' processors. For each example, two meshes of a different size have been considered for the parallel discretization. The size of the smaller one has been chosen with respect to the available memory in order to make the discretization attainable using just a single processor. This is important for the speedup evaluation. The larger mesh is about three times as big as the smaller one. In this case, the speedup has been calculated only approximately by estimating the time required to accomplish the discretization on a single processor. This estimate is based on the number of generated nodes taking into account the linear computational complexity and a small overhead (about 2 seconds). The achieved speedup is depicted in Figures 3.18 - 3.25. The execution time, speedup, efficiency, and load balance in terms of the execution time, number of nodes, and memory requirements (Section Parallel Performance (Parallel Mesh Generation)), are summarized in Tables 3.5 - 3.12. Note that the timing, carried out in whole seconds, does not include the output printing. The estimated speedups and efficiencies in tables corresponding to larger meshes are marked by an asterisk. Although the domain decomposition has been applied only on the model level, a considerable speedup has been achieved in all presented examples. The worst profile has been obtained for a graded 3D mesh of a mechanical joint (Figs 3.22 and 3.23). The reason can be found in a relatively worse domain decomposition resulting in a load imbalance in terms of the number of nodes (and consequently memory requirements) documented in Tables 3.9 and 3.10. On the other hand, a superb parallel performance has been observed for a 2D mesh of a chair (Figs 3.18 and 3.19 and Tabs 3.5 and 3.6). Very good results have been also evidenced in the case of a 3D mesh of a junction of two pipes (Figs 3.24 and 3.25 and Tabs 3.11 and 3.12). Note that the load balance based on the execution time of individual processors is equal to 100 % in all presented examples. This is caused primarily by the fact that relatively small meshes with respect to the time needed to accomplish the actual discretization have been considered. When meshing a 3D object, about a half of the total generation time is actually consumed by the parametric tree building, synchronized at the end by the tree compatibility enforcement. For 2D meshes, the part of the time consumed by the tree building is significantly smaller (about 15 %). However, the work load is usually so evenly distributed among the processors that almost no difference in the elapsed time is observed, especially when measured in whole seconds. Note that the meshes generated in parallel are generally not reproducible, which is the consequence of the non-deterministic nature of both the subdomain to processor assignment and the process of tree compatibility enforcement. Also note that meshes generated using a different number of processors need not be identical because the different domain decomposition imposed by a different number of processors may result in different tree structures leading to different meshes. Nevertheless, the differences in the number of generated mesh entities are negligible.




Next: Figures Up: Parallel Mesh Generation Previous: Parallel Implementation

Daniel Rypl
2005-12-07