The realistic description and modelling of material failure is one of the actual problems in structural mechanics. Analyses of failure processes require the use of complex FE discretizations and advanced constitutive models. Such complex analyses initiate demands for large scale computing, which must be feasible from the view of both time and available resources. Parallelization of the problem reduces the computational time, and, for some cases, it allows large analyses to be at least performed. The available architectures of parallel computers can be classified into three classes: shared memory systems, distributed memory computers and virtual shared memory computers. Among many parallel programming models, only the message passing is available on all platforms. Its broad portability has been the main reason, why message passing programming model was selected.

Explicit integration schemes are very popular for solving time dependent problems. In this study, the central difference time stepping algorithm, assuming lumped mass matrix and damping expressed in Rayleigh form, is adopted. Domain decomposition, based on the node-cut approach [11], [12], represents a tool for formulating an efficient parallel algorithm. Nonlocal approach is recognized as a powerful localization limiter, which is necessary to capture the localized character of a solution, for example in tension regime of quasi-brittle materials. Due to the non-local character (local response depends on material state in the neighborhood), these models require special data exchange algorithms to be developed in order to efficiently handle the non-local dependency between partitions.

The adopted parallelization strategy is based on a mesh partitioning.
In general, one can distinguish two dual domain decomposition
approaches: the node-cut and element-cut concepts. Node-cut approach
partitions the mesh into set of non-overlapping groups of
elements. The nodes at inter-partition boundaries are marked as *
shared nodes* and are assigned with local degrees of
freedom. At local partition nodes, the equilibrium
equations can be solved using standard serial algorithm. However, at
shared nodes, one has to exchange data between neighboring partitions
to guarantee the correctness of the algorithm. In our case,
the internal nodal force contributions are exchanged for shared
nodes. This process of mutual data contribution exchange has to be
performed at each time step. Similar process has to be invoked at the
very beginning, when the mass matrix is assembled.
The dual element-cut approach (see [12], [13])
partitions the mesh into sets of non-overlapping
groups of nodes, leading to duplication of finite elements. On the other hand,
the node-cut concept lead to duplication of nodes.
Since the computational demands associated with element computations are typically superior to that
for nodes, the node-cut approach is computationally more
efficient. Due to this fact, the node-cut approach will be only
considered.
When the non-local constitutive model is considered, then due to its
non-local dependency, additional
inter-partition communication has to be performed to compute the non-local contributions
for points near the inter-partition boundary, where the non-local
quantity consists of local as well as remote contributions. To avoid
redundant requests for same remote values from different local
integration points (leading to an extremely fine communication
pattern, which must be avoided), the band of * remote-copy
elements* is introduced at each partition. After local
quantities, which undergo non-local averaging, are computed at every
local element, their exchange to the corresponding remote element
values is done. The remote-copy elements are intended only to store
copies of relevant quantities undergoing the non-local averaging. By
using the local copies, the non-local values can be easily computed,
instead of invoking cost communication.
Typical central difference algorithm, extended with two communication
schemes - first due to exchange of shared node
contributions and the second due to remote-element exchange, is
presented in Table 1.

A 3D three-point-bending specimen with notch has been analyzed. The geometry is shown in Fig. 7. The used constitutive model is a non-local rotating crack model with transition to scalar damage (see [10]). The mesh contains 1964 nodes and 9324 linear tetrahedral elements. The constitutive properties were: density =2.5 , Young modulus = 20.e3 MPa, Poisson coefficient =0.2, tension strength =2.5 MPa, ultimate tension deformation (when tension stress vanishes) =0.0005, non-local interaction radius =0.025 m, shear transition coefficient equal to 0.6, and stress transition coefficient was set to zero. The partitions have been generated prior the analysis and have been kept constant throughout the analysis (static load balancing). The analysis has been performed on workstation cluster running under Windows NT and Linux operating systems, and on massively parallel SP2 machine. The cluster consists of six dual processor workstations DELL 610, running at 400 and 450 MHz, connected by Fast Ethernet using 3Com Superstack switch. This represents a heterogenous cluster with the combination of shared and distributed memory. The MPI/Pro for Windows NT (MPI Software technology, Inc supporting both the distributed and shared memory communication, and MPICH library for Linux were used. The IBM SP2 (CTU computing center) is a heterogenous machine equipped with P2SC processors at 120 and 160 MHz, running under AIX 4.1. Processors are connected by HPS switch, allowing simultaneous bidirectional transfer of 40 MB/sec between any two nodes. The native MPI library was used. The results achieved on PC clusters and SP2 machine are presented in Fig 8. Note that the heterogeneity of the computing platforms has been taken into account neither in the mesh partitioning (all partitions are equally load balanced) nor in the speedup or efficiency evaluation. Since the single processor computation has been always performed on the most powerful processor, the speedup is slightly underestimated whenever a slower processor has participated in the calculation. The degradation of the speedup profile is also caused by the adopted static load balancing. Since the computational complexity at some regions is increasing during the analysis (due to involving non-linear zone, caused by strain-softening), the load balance is disturbed, resulting in some processors to be idle. This effect is becoming more significant as the number of processors increases. Despite these facts, the achieved speedup and efficiency are significant, considerably reducing the computational time.

*Daniel Rypl
2005-12-03*