Three different parallel computers were used for numerical experiments with the algorithms for large scale problems. The most powerful computer from mentioned group is the AC3 computer placed at Cornell university. There are two clusters, v1 and vplus. The cluster v1 with 64 nodes (each with 4 processors) was used for our purposes. It is important to mention that this cluster is homogeneous (all processors are identical and have the same memory). The second parallel computer is cluster of PC at Czech Technical University with 16 processors. This cluster is heterogeneous because the memory varies from 512 MB to 1 GB and the frequency of the processors varies from 400 MHz to 633 MHz. The last computer was the massive parallel computer with 8 processors and the 4.2 GB of shared memory. These platforms are thereafter denoted as AC3, K132 and HP respectively.

The FETI method was used for solution of large scale problems on AC3 computer. The square domain was used for numerical experiments because its decomposition into smaller subdomains leads always to large number of boundary nodes and also large number of Lagrange multipliers. The finite element mesh with 480 elements in both directions was considered on the original domain. The total number of unknown variables was 461760 and the stiffness matrix would have contained about 440 millions of entries if the skyline storage had been considered. Total number of unknown variables and the total number finite elements were preserved in each decomposition. Totally, 7 different regular decompositions were done. Two different orderings of nodes and unknowns were considered in order to investigate the influence on elapsed time and memory requirements. One ordering of unknowns was based on the Schur complement method, where the internal unknowns must be ordered first and the boundary unknowns last. Second ordering was optimal for rectangular domains. The memory requirements of particular orderings are collected in Table 3. The optimal ordering of unknowns is significantly better than the ordering for Schur method.

Very good performance is obtained for FETI method. Computation of the rigid body motions is one of the steps in the FETI method and it is demanding as direct solver. The number of necessary algebraic operations depends on the bandwidth of the matrix. Finer decomposition of the original domain leads to the smaller bandwidths of the matrix of the smaller subdomains and also smaller number of algebraic operations. It leads to the very good speedup. When many processors are used, the speedup starts deteriorate because the processors are not exploited and a lot of communication is performed.

The comparison of different computers was done on three problems. The data concerning those problems are summarized in Table 5, where ND stands for numbers of subdomains, NE denotes numbers of finite elements on each subdomain, NSKY is the number of stored entries, NDOF stands for the number of unknowns on each subdomain and NM denotes the numbers of Lagrange multipliers.

The comparison of elapsed time is presented in Table 6, where stands for time elapsed by elimination (computation of rigid body motions), denotes time elapsed by modified conjugate gradient method and is the total time of computation. From Table 6 follows important result that the clusters are comparable with massive parallel computers in sense of computational power but they are much cheaper. It is also important to mention that usual MPI library was used for communication among processors and in fact the shared memory on massive parallel computer was not used directly.

*Daniel Rypl
2005-12-03*