Code - Aster: General Information On The Direct Linear Solveurs and Use of MUMPS
Code - Aster: General Information On The Direct Linear Solveurs and Use of MUMPS
Code - Aster: General Information On The Direct Linear Solveurs and Use of MUMPS
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 1/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
Summary
Within the framework of thermomechanical simulations with Code_Aster, the main part of the costs
calculation often comes from the construction and the resolution of the linear systems . To carry out
these resolutions more effectively, Code_Aster made the choice to integrate the direct method deployed in
package MUMPS (‘Multifrontal Massively Parallel sparse direct Solver; P.R.Amestoy, J.Y.L’ Excel et al.;
CERFACS/CNRS/ENS LyonIRIT/INRIA/Université of Bordeaux). This in complement of its multifrontale “house”
(C.Rose) and of its other solveurs: ‘LDLT’, ‘GCPC’ and ‘PETSC’.
In distributed parallel mode and Out-Of-Core, coupling Aster+MUMPS gets profits in CPU about the dozen on
32 processors. And this, for RAM consumption, one behavior/functional perimeter and one precision of the
results at least as good as those of the native multifrontale code.
On the big problems, by activating compressions BLR, one can even gain a factor two or three additional. This
product, via the parallelism which it displays and its advanced features (swivelling, pre/postprocessings, quality
of the solution…) facilitate largely the passage of the studies standards. Moreover, It often remains only viable
alternative to exploit certain modelings/analyses (quasi-incompressible, X-FEM…) or to pass from very large
studies (as a precise direct solvor or a preconditionnor, a cf option ‘LDLT_SP‘of’GCPC'/‘PETSC‘).
In the first part of the document we summarize the general problems of resolution of linear systems, then we
approach the large families of hollow direct solveurs and their variations in the libraries of the public domain.
All things which it is necessary to have for the spirit before approaching, in the second part, package MUMPS
through its main features and of its advanced features. Then we detail the digital, data-processing and functional
aspects of its integration in Code_Aster. Lastly, we conclude by some digital results.
For more details and advices on the employment of the linear solveurs one will be able to consult the specific
notes of use [U4.50.01]/[U2.08.03]. The related problems of improvement of performances (RAM/CPU) of a
calculation and, use of parallelism, are also the object of detailed notes: [U1.03.03] and [U2.08.06].
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 2/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
Contents
1 General information on the direct solveurs........................................................................................... 4
1.1 Linear system and associated methods of resolution..................................................................... 4
1.2 Bookstores of linear algebra.......................................................................................................... 6
1.3 Direct methods: the principle.......................................................................................................... 7
1.4 Direct methods: various approaches.............................................................................................. 9
1.5 Direct methods: principal stages.................................................................................................. 10
1.6 Direct methods: difficulties........................................................................................................... 12
2 Package MUMPS............................................................................................................................... 14
2.1 History.......................................................................................................................................... 14
2.2 Main features............................................................................................................................... 15
2.3 Zooms on some technical points.................................................................................................. 16
2.3.1 Swivelling............................................................................................................................ 16
2.3.2 Iterative refinement............................................................................................................. 17
2.3.3 Reliability of calculations..................................................................................................... 18
2.3.4 Management memory (In-Core Out-Of-Core versus).......................................................... 20
2.3.5 Management of the singular matrices................................................................................. 21
2.3.6 Compression ‘Low-Rank Block’ (BLR)................................................................................ 21
3 Establishment in Code_Aster............................................................................................................. 24
3.1 Context/synthesis......................................................................................................................... 24
3.2 Two types of parallelism: centralized and distributed................................................................... 24
3.2.1 Principle.............................................................................................................................. 24
3.2.2 Various modes of distribution.............................................................................................. 25
3.2.3 Balancing of load................................................................................................................ 26
3.2.4 To recut the objects Code_Aster......................................................................................... 26
3.3 Management of memory MUMPS and Code_Aster..................................................................... 27
3.4 Particular management of double the multipliers of Lagrange ..................................................... 28
3.5 Perimeter of use........................................................................................................................... 29
3.6 Parameter setting and examples of use....................................................................................... 29
3.6.1 Parameters of use of MUMPS via Code_Aster................................................................... 29
3.6.2 Monitoring........................................................................................................................... 30
3.6.3 Examples of use................................................................................................................. 31
4 Conclusion.......................................................................................................................................... 33
5 Bibliography........................................................................................................................................ 34
5.1 Books/articles/proceedings/theses…........................................................................................... 34
5.2 Account-returned reports/EDF...................................................................................................... 34
5.3 Resources Internet....................................................................................................................... 34
6 History of the versions of the document.............................................................................................. 35
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 3/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 4/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
In a general way, resolution of this kind of problem requires one (more) broad questioning that it does not
appear to with it:
• Does one have access to the matrix or knows one simply his action on a vector?
• Is this matrix it digs or dense?
• That they are its digital properties (symmetry, definite positive…) and structural (real/complex, by bands,
blocks.)?
• Please one solve only one system (1.1-1), several into simultaneous 2 or in a consecutive way 3 ? Even
several different and successive systems to which the matrices are very close 4 ?
• In the case of successive resolutions, can one re-use preceding results in order to facilitate the next
resolutions (cf technique of restarting, partial factorization)?
• Which is the order of magnitude of the size of the problem, the matrix and of its factorized compared to
capacities for treatment of the CPU and the associated memories (RAM, disc)?
• Does one want a very precise solution or simply an estimate (cf encased solveurs)?
• One has access to bookstores of linear algebras (and with their pre-necessary MPI, BLAS, LAPACK…)
or does one have to call on products “house”?
In Code_Aster , one explicitly builds the matrix and one stores it with the format MORSE 5 . With most
modelings, the matrix hollow (because of discretization by finite elements), is potentially badly conditioned 6 and
often real, symmetrical and indefinite7 . Into nonlinear, modal or during thermomechanical chainings, one often
deals with problems of the type “multiple second members”. The discrete methods of contact-friction benefit
from faculties of factorization partial and the method by decomposition of fields. In addition, Code_Aster use
also scenarios of simultaneous resolutions (complements of Schur of the contact and under-structuring…).
As for the sizes of the problems, even if they swell year by year, they remain modest compared to the CFD:
about the million unknown factors but for hundreds of step of time or iterations of Newton.
In addition, of one point of view “middleware and hardware” , the code is pressed from now on on many
optimized and perennial bookstores (MPI, BLAS, LAPACK, (BY) MONGREL, (Pt) SCOTCH TAPE, PETSc,
MUMPS…) and is used mainly on clusters of SMP (fast networks, great RAM storage capacity and disc). One
thus seeks especially to optimize the use of the linear solveurs accordingly.
For 60 years, two types of techniques have disputed supremacy in the field, the solveurs direct and the
iterative solveurs (cf [Che05] [Dav03] [Duf06] [Gol96] [Las98] [Liu89] [Meu99] [Saa03]).
First are robust and lead in a finished number of operations (theoretically) known by advance. Their theory is
relatively well completed and their variation according to moults standard of matrices and software architectures
is very complete. In particular, their algorithmic multilevel is well adapted to the hierarchies memories of the
current machines. However, they require storage capacities which grow quickly with the size of the problem
what limits the extensibility of their parallelism 8. Even if this parallelism can break up into several independent
layers, thus gearing down the performances.
On the other hand, them iterative methods are more “scalables” when one increases the number of
processors. Their theory abounds in many “opened problems”, especially into arithmetic finished. In practice,
their convergence in a “reasonable” number of iterations, is not always acquired, it depends on the structure of
the matrix, the starting point, the criterion of stop… This kind of solveurs has more difficulty boring in mechanics
of the industrial structures where one often cumulates heterogeneities, non-linearities and junctions of models
which cause to become gangrenous the conditioning of the operator of work. In addition, they are not adapted to
solve the problems of the type effectively “multiple second members”. Out those are very frequent in algorithmic
mechanical simulations.
T
L
L D Ku f
Lw f
Dv w ?!?
K u
LTu v
Figure 1.1-1. _ Two classes of methods to solve a linear system:
the direct ones and the iterative ones.
Contrary to their direct counterparts, it is not possible to propose the iterative solvor who will solve any linear
system. The adequacy of the type of algorithm to a class of problems is done on a case-by-case basis. They
present, nevertheless, other advantages which historically gave them established among for certain
applications. With management equivalent memory, they require some less than the direct solveurs, because
one has right need to know the action of the matrix on an unspecified vector, without having truly to store it. In
addition, one is not subjected to the “diktat” of the phenomenon of filling which deteriorates the profile of the
matrices, one can effectively exploit the hollow character of the operators and control the precision of the results
9
. In short, the use of direct solveurs concerns the area of the technology rather whereas the choice of
the good couple iterative method/preconditionnor is rather an art! In spite of its biblical simplicity on paper,
the resolution of a system linear, even symmetrical definite positive, is not “a long quiet river”. Between two
evils, filling/swivelling and pre-packaging, it is necessary to choose!
Note:
• A third class of methods tries to draw part of the respective advantages from direct and the iterative
ones: methods of Decomposition of Field (DD) [R6.01.03].
• The two large families of methods must more be seen like complementary that like competitors. One
often seeks to mix them: methods DD, preconditionnor by incomplete factorization (cf [R6.01.02] § 4.2)
or of multigrille type, procedure of iterative refinement at the end of the direct solvor…
Since emergence in years 70/80 of the first bookstores public 10 and private/manufacturers11 and their
communities of users, the offer was geared down. The tendency being of course to suggest powerful solutions
(vectorial, distributed parallelism with memory centralized then, parallelism multiniveau via threads) as well as
“toolkits” of handling of algorithms of linear algebra and structures of associated data. Let us quote in a
nonexhaustive way: ScaLAPACK (Dongarra & Demmel 1997), SparseKIT (Saad 1988), PETSc (Argonne 1991),
HyPre (LL 2000), TRILINOS (Sandia 2000)…
Note:
• To structure their use more effectively and to suggest solutions “black box”, macro-bookstores
recently came out. They gather a panel of these products to which they add solutions “houses”:
Numerical Plato (CEA-DEN), Mystery (CEA-DAM)…
Concerning more specifically them direct methods of resolution of linear systems , about fifty packages are
available. One distinguishes the “autonomous” products from those incorporated in a bookstore, the public ones
of commercial, those dealing with the dense problems and others of the hollows. Some function only in
sequential mode, others support a parallelism with shared and/or distributed memory. Lastly, certain products
are general practitioners (symmetrical, nonsymmetrical, SPD, reality/complex…) others adapted to a quite
precise need/scenario.
One can find a list rather exhaustive of all these products on the site of one of the founding fathers of
LAPACK/BLAS: Jack Dongarra [Gift]. The table below (table 1.2-1) is an expurgated version. It takes again only
the direct solveurs of the public domain and forgets to mention: CHOLMOD, CSPARSE, DMF, Oblio,
PARASPAR, PARDISO, PaStiX (the other French direct solvor with MUMPS), S+, SPRSBLKKT and WSMP.
This resource Internet counts also packages implementing of the iterative solveurs, the préconditionneurs, the
modal solveurs as well as many products support (BLAS, LAPACK, ATLAS…).
DIRECT SOLVERS License Support Real Comple F77 C Seq Dist SP Gen
x D
DENSE
FLAME LGPL yes X X X X X
LAPACK BSD yes X X X X X
LAPACK95 BSD yes X X 95 X
NAPACK BSD yes X X X
PLAPACK ? yes X X X X M
PRISM ? not X X X M
ScaLAPACK BSD yes X X X X M/P
Trilinos/Pliris LGPL yes X X X and C++ M
SPARSE
DSCPACK ? yes X X X M X
HSL ? yes X X X X X X
MFACT ? yes X X X M X
MUMPS PD yes X X X X X M X X
PSPASES ? yes X X X M X
SPARSE ? ? X X X X X X
SPOOLES PD ? X X X X M X
SuperLU Own yes X X X X X M X
TAUCS Own yes X X X X X X
Trilinos/Amesos LGPL yes X X M X X
UMFPACK LGPL yes X X X X X
Y12M ? yes X X X X X
Table 1.2-1. _Extracts from the Web page of Jack Dongarra [Gift] on the free products
implementing a direct method; ‘Seq’ for sequential, ‘Dist’ for parallel (‘M’ OpenMP and ‘P’ MPI),
‘SPD’ for symmetrical definite positive and ‘Gen’ for unspecified matrix.
Note:
• A resource Internet more detailed but focused on the hollow direct solveurs is maintained by another
great name of the digital one: T.A.Davis [Dav], one of the contributors of Matlab.
Note:
• For example, the symmetrical and regular matrix K below breaks up in the form L DL T following
(without needing here permutation P=Id )
[ ] [ ][ ][ ]
10 sym 1 0 0 10 0 0 1 2 3
K := 20 45 =2 1 0 0 5 0 0 1 4
(1.3-1)
30 80 171
3 4 1 0 0 1
0 0 1
L D LT
Once this decomposition carried out, the resolution of the problem is largely facilitated. It is not expressed any
more but in the form of the linear resolutions simplest which are: containing triangular or diagonal matrices.
They are the famous ones “descent-increase” (‘forward/backward algorithms’). For example in the case of a
factorization L U , the system (1.1-1) will be solved by
Ku=f Lv=Pf descente
〉⇒ (1.3-2)
PK=LU Uu=v remontée
v . This last serves
In the first lower diagonal system (descent), one determines the intermediate vector solution
then as second member with the higher diagonal system (gone up) whose the vector is solution u who
interests us.
This phase is inexpensive (into dense, about N 2 against N 3 for factorization14 with N the size of the
problem) and can thus be repeated of many factorized time by preserving the same one. What is very useful
when a problem of the type is solved multiple second members or when one wishes to carry out
simultaneous resolutions.
In first scenario , the matrix K is fixed and one changes second member successively f i to calculate as
much solution u i (the resolutions are interdependent). That makes it possible to pool and thus to amortize
these initial costs of factorization. This strategy is abundantly used, in particular in Code_Aster: buckle nonlinear
with periodic reactualization (or not of reactualization) of the tangent matrix (e.g. operator AsteR
STAT_NON_LINE), methods of subspaces or the power reverses (without acceleration of Rayleigh) in modal
calculation (CALC_MODES), thermomechanical chaining with characteristic materials independent of the
temperature (MECA_STATIQUE),…
In second scenario, one is aware of all them f i at the same time and one organizes, by blocks, the phases of
descent-increase, to calculate the solutions simultaneously u i independent. One can thus use more effective
routines of high level linear algebra, and even to exploit consumption memory by storing the vector f i in hollow.
14 Into dense, Coppersmith and Winograd (1982) showed that one could, as well as possible, to decrease this algorithmic complexity with
CN with =2,49 and C constant (for N large).
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 9/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
This strategy (partially) is used in Code_Aster , for example, in the construction of the complements of Schur of
the algorithms of contact-friction or for the under-structuring.
Note:
• Products MUMPS envisages these two types of strategy and proposes even features to facilitate the
construction and the resolution additional of Schur.
Let us examine it now process of factorization in itself. It is already clearly clarified in other theoretical
documentations of Code_Aster on the subject [R6.02.01] [R6.02.02], like in the already quoted bibliographical
references [Duf06] [Gol96] [Las98]. Also we will not detail it. Let us specify just that it is an iterative process
organized schematically around three loops: one “known as in i ” (on the lines of the matrix of work), the second
“in j ” (resp. columns) and the third “in k ” (resp. stages of factorization). They build a new matrix repeatedly
A , via the classical formula of factorization which is
A
k 1 starting from certain data of the preceding one, k
written formally:
Boucles en i , j ,k
A k i , k A
k k , j (1.3-3)
i , j −
k 1
A i , j := A k kk ,k
A
A =K and at the last stage, one recovers in the square matrix A
Initially the process is activated with
0 N
T
triangular parts ( L and/or U ) even diagonal ( D ) who interest us. For example, in the case L DL :
Boucles en i , j
N i , j
si i j : L i , j = A (1.3-4)
N i , j
si i= j : D i , j =A
Note:
• The formula (1.4-3) contains in germ the problems inherent in the direct methods: in hollow storage,
the fact that the term A k 1 i , j can become nonnull whereas A k i , j is (notion of filling of
factorized, ‘rope’, thus implying a renumerotation or ‘ordering’); propagation of rounding error or the
divide check via the term A k k , k (notion of swivelling and balancing of the terms of the matrix or
‘scaling’).
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 10/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
Note:
• Method L DL T of Code_Aster (SOLVEUR/METHODE=' LDLT') is a factorization “ ijk ”,
multifrontales of C.Rose ( … = ' MULT_FRONT') and of MUMPS (… = ' MUMPS') directed columns
are they (“ kji ”).
• Certain alternatives bear particular names: algorithm of Crout (“ jki ”) and that of Doolitle (“ ikj ”).
• In papers one often uses the Anglo-Saxon terminology indicating the orientation of matric handling
rather than the order of the loops: ‘looking forward method’, ‘looking backward method’, ‘up-looking’,
‘left-looking’, ‘’ right-looking', ‘left-right-looking’…
When hollow systems are treated, the digital phase of factorization (1.3-3) does not apply directly to the initial
matrix K , but with a matrix of work K travail resulting from one phase of pretreatments . And it in order to
reduce the filling, to improve the precision of calculations and thus to optimize the later costs in CPU and
memory. Coarsely this matrix of work can be written in the shape of the following matric product
K travail :=Po Dr K Q c Dc PTo (1.5-1)
we will describe the various elements thereafter. One can thus break up the operation of a direct solvor into four
stages:
1) Pretreatments and factorization symbolic system: it inverted the order of the columns of the
matrix of work (via a matrix of permutation Q c ) in order to avoid the divide checks of
term A k k , k and to reduce the filling. Moreover it rebalances the terms in order to limit
the errors rounding (via matrices of scaling D r / D c ). This phase can be too crucial for
the algorithmic effectiveness (profit of a sometimes noted factor 10) and the quality of
the results (profit of 4 or 5 decimals).
In this phase, one also created the structures of storage of the hollow factorized matrix
and auxiliaries (dynamic swivelling, communication…) required by the following phases.
Moreover, one estimates the tree of dependence of the tasks, their initial distribution
according to the processors and consumption total memories envisaged.
2) The stage of renumeroration: it inverted the order of the unknown factors of the matrix (via the
matrix of permutation P o ) in order to reduce the filling which factorization implies. Indeed,
in the formula (1.3-3) it is seen that factorized ( A k 1 i , j ≠0 ) a new term not no one in
its profile can contain whereas the initial matrix did not comprise any ( A k i , j =0 ) .
A i , k A
k , j
k k
Because of the term not necessarily no one. In particular, it is nonnull
A k , k
k
when one can find terms nonworthless of the matrix initial of the type
Ak i , l ou A
k l , j li et l j . This phenomenon can lead to overcosts very
important report and calculation (factorized can be 100 times larger than the initial hollow
matrix!).
From where the idea to renumber the unknown factors (and thus to permute the lines and
the columns of K ) in order to slow down this phenomenon which is it true “Achilles'
heel” of the direct ones . With this intention, one often calls on external products ((BY)
MONGREL, (Pt) SCOTCH TAPE, CHACO, JOSTLE, PARTY…) or with the heuristic ones
embarked with solveurs (AMD, RCMK…). Of course, these products display different
performances according to the treated matrices, the number of processors… Among
them, MONGRELS and SCOTCH TAPE are very widespread and “often leave the
batch” (profit up to 50%).
3) The digital phase of factorization : it implements the formula (1.3-3) via the methods interviews
in the paragraph §1.4 precedent. It is the phase, by far, most expensive who will build
T T
hollow factorizations explicitly L L , L D L or L U .
4) The phase of resolution: it carries out the descent-increase (1.3-2) whose (finally!) the solution
“spouts out” u . It is not very expensive and pools possibly a later digital
factorization (multiple second members, simultaneous resolutions, restarting of
calculation…).
Note:
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 12/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
• Stages 1 and 2 require only the knowledge of the connectivity and the graph of the initial matrix. Thus
finally that data storable and easy to handle in the form of entireties. Only the two last stages use
realities on the effective terms of the matrix. They require for the terms of the matrix only if the stages
of scaling are engaged (calculation of D r / D c ).
• Stages 1 and 4 are independent while the 1.2 and 3, a contrario, are dependent. According to the
algorithmic products/approaches, one agglomerates them differently: 1 and 2 is dependent in MUMPS,
1 and 3 in SuperLu and 1.2 and 3 in UFMPACK… MUMPS makes it possible to carry out separately
but successively stages 1+2, 3 and 4, to even pool their results to carry out various sequences. For the
moment, in Code_Aster, one alternates sequences 1+2+3 and 4,4,4… and again 1+2+… (cf following
chapter).
• Certain products propose to test several strategies in one or more stages and choissent the most
adapted: SPOOLES and WSMP for stage 1, TAUCS for the stage 3 etc.
• The tools of renumerotation of the first phase are based on very varied concepts: methods of
geometrical engineers, techniques or of optimization, graph theory, spectral theory, methods taboo,
algorithms évolutionnaires, those mémétics, those based known as of “colonies of ants”, neural
networks… All the blows are allowed to improve the local optimum in the form of which the problems of
the renumerotor are expressed. These tools are also often used for partitionner/to distribute grids (cf
[R6.01.03] §6). For the moment, Code_Aster uses renuméroteurs METIS/AM/AMD (for METHODE= ‘
MULT_FRONT' and ‘ MUMPS'), AMF/QAMD/PORD (for ‘ MUMPS' ) and RCMK17 (for ‘GCPC’,
‘LDLT’ and ‘PESTC’).
• Some is the linear solvor18 used, Code_Aster carries out a preliminary phase of factorization (stage 0)
to describe the unknown factors of the problem (link degree of physical or late freedom/number of line
of the matrix via the structure of data NUME_DDL) and to envisage the ad hoc storage of the profile
MORSE of the matrix.
17 For minilmiser the filling, incomplete factorization is used as preconditionnor of these iterative solveurs.
18 Among ‘MULT_FRONT'/‘LDLT’/‘MUMPS’/‘GCPC’/‘PESTC’.
19 IC for In-Core (all the structures of data are in RAM) and OOC for Out-Of-Core (some are rocked on disc).
20 Often referred under the Anglo-Saxon term: ‘forward/backward errors’.
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 13/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
NR =0.21M NR =0.8M
nnz =8M (x38) nnz =28M (x35)
K -1 =302M K -1 =397M
(MONGREL x38) (MONGREL x15)
Figure 1.6-1. _The “ball” of the hollow direct solveurs: size of factorized.
This figure shows two examples: a canonical CAS-test (cubic) and an industrial study (pump LAUGH). With the
following notations: M for million terms, N size of the problem, nnz the number of nonworthless terms of the
1
matrix and K that of its factorized renumbered via MONGREL. The surfactor when one passes from the one
to the other is noted between brackets.
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 14/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
2 Package MUMPS
2.1 History
Package MUMPS implements a parallel multifrontale “massively” (‘MUltifrontal Massively Parallel sparse
direct Solver ‘[Mum]) developed during the European project PARASOL (1996-1999) by the teams of three
laboratories: CERFACS, ENSEEIHT-IRIT and RAL (I.S.Duff, P.R.Amestoy, J.Koster and J.Y.L’ Excel…).
Since this finalized version (MUMPS 4.04 9/22/99) and public (free of right), about forty other versions were
delivered. These developments correct anomalies, extend the perimeter of use, improve ergonomics and
especially, enrich the features. MUMPS is thus a perennial product, developed and maintained by about ten
people belonging to academic entities distributed between Bordeaux, Lyon and Toulouse: IRIT, CNRS,
CERFACS, INRIA/ENS Lyon and University of Bordeaux I.
The product is public and downloadable on its Web site: http://graal.ens-lyon.fr/MUMPS. One counts
approximately 1000 direct users (including 1/3 Europe + the 1/3 USA) not counting those which use it via the
bookstores which it reference: PETSc, TRILINOS, Matlab and Scilab… Its site proposes documentation
(theoretical and of use), links, examples of application, as well as a newsgroup (in English) tracing the
experience feedback on the product (bugs, problems of installation, advices…).
Each year about ten algorithmic/data-processing work leads to improvements of the package (thesis, post-Doc.,
research tasks…). In addition it is used regularly for studies industrial (EADS, ECA, BOEING, GéoSciences
Azure, SAMTECH, Code_Aster… ).
Since 2015, a consortium between academic, industrial teams and software publishers was assembled
around the product : consortium MUMPS [million u.a.]. It must make it possible to better ensure its
development, its diffusion and its perenniality.
It managed by the INRIA and gathers at the end of 2015: EDF, ALTAIR, MICHELIN, LSTC, SIEMENS, ESI,
TOTAL (members) and CERFACS, INPT, INRIA, ENS Lyon and Université of Bordeaux (founder members).
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 15/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
Figure 2.1-3. _The homepage of the Web site dedicated to consortium [million u.a.].
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 16/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
Renuméroration Factorization
Analysis Digital Resolution
(phases 1 and 2) (phase 3) (phase 4)
Centralize
d
K,F
parameters
Distribute
d Multiple second
members
K -1
OOC IC
Note:
• In term of parallelism, MUMPS exploits two levels (cf [R6.01.03] § 2.6.1): one external related to the
concurrent elimination of faces (via MPI), the other interns, within each face (via “threadées” BLAS or
around the algorithms of compression low-rank).
• The native method multifrontale of Code_Aster exploits only the second level and in parallelism with
shared memory (via OpenMP). Thus without covering, dynamic regrouping and distribution of the data
between the processors. On the other hand, by its fine connections with Code_Aster, it exploits all the
facilities of the manager memory JEVEUX (OOC, restarting, diagnosis…) and specificities of modeling
of the code (elements of structures, Lagranges).
~ R
A k r, k
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 17/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
1
From where an amplification of the errors rounding to the maximum of ( 1 ) with this stage. What is
u
important here is not so much to choose the largest possible term in value absolute (U=1) to avoid
choosing smallest! The reverse of these pivots also intervenes at the time of the phase of descent-increase,
therefore it is necessary to spare these two sources of amplification of errors by choosing one U median.
MUMPS, like much of package, proposes by default U=0.01 (parameter MUMPS CNTL (1)).
To swivel one generally uses scalar diagonal terms but also of the blocks of terms (of the diagonal blocks 2x2).
In MUMPS, two types of swivelling are implemented, one says ‘static’ (at the time of the phase of analysis), the
other known as ‘digital’ (resp. digital factorization). They are skeletal and activables separately (cf parameters
MUMPS CNTL (1), CNTL (4) and ICNTL (6)). For matrices SPD or with dominant diagonal, these faculties
of swivelling can be disabled without risk (calculation will gain there in speed), on the other hand, in the other
cases, they should be initialized to manage the possible very small or worthless pivots. That in general implies
an addition of filling of factorized but increases digital stability.
Note:
• This functionality of swivelling makes MUMPS essential to treat certain modelings of Code_Aster
(quasi-incompressible elements, mixed formulations, X-FEM…). At least as long as other direct
solveurs including of the swivelling will not be available in the code.
• The user Aster does not have access directly to the fine parameter setting of these faculties of
swivelling. They are activated with the values by default. He can just partially disconnect them while
posing SOLVEUR/PRETRAITEMENTS=' SANS' (by défaut='CAR’).
• The addition of filling of to the digital swivelling must be scheduled as soon as possible in MUMPS
(as of the phase of analysis). And this, by envisaging arbitrarily a percentage of overconsumption
memory compared to the profile envisaged. This figure must be indicated in for hundred in parameter
MUMPS ICNTL (14) (accessible to the user Aster via the keyword SOLVEUR/PCENT_PIVOT
initialized by default with 20%). Thereafter, if this evaluation proves to be insufficient, according to the
type of management selected memory (keyword SOLVEUR/GESTION_MEMOIRE), that is to say
calculation stops in ERREUR_FATALE, is one retente several times a digital factorization by doubling
each time the size of this reserved space to the swivelling.
• Certain products restrict their perimeter/robustness by not proposing strategy of swivelling
(SPRSBLKKT, MULT_FRONT_Aster…), others are limited to scalar pivots (CHOLMOD, PaStiX,
TAUCS, WSMP…) or propose particular strategies (method of perturbation+correction of Bunch-
Kaufmann for PARDISO, Bunch-Parlett for SPOOLES…).
21 It is true when MUMPS functions in memory way of managing In-Core and into sequential. On the other hand, when the data are
distributed between the processors and the memories RAM and disc (parallelism and Out-Of-Core is activated), this stage can be a little
expensive.
22 One also speaks “about iterative improvement” (‘iterative refinement’).
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 18/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
In MUMPS this process is activable or not (parameter ICNTL (10)<0) and limited by a maximum iteration
count N err (ICNTL (10)). The process (2.3-2) continues as much as the “balanced residue” Berr is higher
than a skeletal threshold threshold (CNTL (2), fixed by default at with precision machine)
∣r ij∣
Berr :=max seuil (2.3-3)
j ∣K∣∣u ∣∣f∣ j
i
or that it does not decrease a factor at least 5 (nonskeletal). In general, one or two iterations is enough. If it is
not the case, it is often revealing other problems: bad conditioning or opposite error (cf following paragraph).
Note:
• For the Code_Aster user these parameters MUMPS are not directly accessible. The functionality is
activable or not via the keyword POSTTRAITEMENTS .
• This functionality is present in many packages: Oblio, PARDISO, UFMPACK, WSMP…
∥ u∥
cond K , f ×be K , f
∥u∥ (2.3-4)
fe K , f
One can give a chart (cf figure 2.3-2) of these concepts while expressing the opposite error like the variation
enters “the initial data and the disturbed data”, while the direct error measurement the variation enters “the
exact solution and the solution really obtained” (that of the problem disturbed by the errors rounding).
Within the framework of the linear systems, the error reverses is measured via balanced residue
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 19/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
∣f −Ku∣ j
be K , f :=max (2.3-5)
j∈ J ∣K∣∣u∣∣f∣ j
One cannot always evaluate it on all the indices ( J ≠[ 1, N ] Ν ). In particular when the denominator is very small
(and the numerator not no one), one prefers the formulation to him (with J * such as J ∪J * =[ 1, N ] Ν )
∣f −Ku∣ j
be * K , f :=max (2.3-6)
j∈ J* ∣K∣∣u∣ j ∥K j .∥∞∥u∥∞
where K j . represent j ième line of the matrix K . With these two indicators, one associates two estimates of
conditioning matricel (one related to the lines retained as a whole J and the other with its complementary J * ):
cond K , f and cond * K , f . The theory then provides us the results according to:
• The approximate solution U is the exact solution of the disturbed problem
K K u= f f
avec K ij ≤max be , be ∣K ij∣
*
(2.3-7)
f i≤max be .∣f i∣, be* .∥K i.∥∞∥u∥∞
et
• There is following increase (via the direct error feK , f ) on the relative error in solution
∥ u∥
cond×becond * ×be*
∥u∥ (2.3-8)
fe K , f
In practice, one scans especially this last estimate feK , f and its components. Its order of magnitude
indicate roughly speaking the number of “true” decimals of the solution calculated. For the badly conditioned
problems, a tolerance of 103 is not rare, but it must be taken with serious because this kind of pathology can
seriously disturb a calculation.
Even within the very precise framework of the resolution of system linear, there exists in many ways to define
the sensitivity to the rounding errors of the problem considered (i.e. its conditioning). That retained by MUMPS
and, which refers in the field (cf Arioli, Demmel and Duff 1989), is indissociable ‘backward error’ of the problem.
The definition of the one does not have a direction without that of the other. One thus should not confuse this
kind of conditioning with the concept of matric conditioning classical.
In addition, conditioning provided not MUMPS takes into account the SECOND MEMBER of the system
as well as the HOLLOW CHARACTER of the matrix. Indeed, it is not worthwhile to take account of possible
rounding errors on worthless matric terms and thus not provided to the solvor! The degrees of freedom
corresponding “do not speak each other” (seen spyglass finite element). Thus, this conditioning MUMPS
respects the physique of the discretized problem. It does not dip back the problem in the too rich space of the
full matrices.
Thus, the figure of conditioning displayed by MUMPS is much less pessimistic than the standard
calculation which another product can provide (Matlab, Python…). But let us hammer, that it is only its
product with the ‘backward error’, called ‘forward error’, which has an interest. And only, within the framework of
a resolution of system linear via MUMPS.
Note:
• This analysis of the quality of the solution is not limited to the linear solveurs. It is also declined, for
example, for the modal solveurs [Hig02].
• In MUMPS, the estimators fe , be , be * , cond and cond * are accessible via, respectively, the
variables RINFO (7/9/8/10 and 11). These postprocessings are a little expensive (~jusqu'à 10% of
time calculation) and thus désactivables (via ICNTL (11)).
• For the Code_Aster user these parameters MUMPS are not directly accessible. They are displayed
in a specific insert of the file of message (made out “monitoring MUMPS”) if the keyword is informed
INFO=2 in the operator . In addition, this functionality is activated only if it chooses to estimate and to
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 20/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
test the quality of its solution via the parameter SOLVEUR/RESI_RELA . According to the operators
Aster, this parameter is by default disconnected (negative value) or fixed at 10 -6 . When it is activated
(positive value), one tests if the direct error fe K , f is quite lower than RESI_RELA . If that is not
the case, calculation stops in ERREUR_FATALE by specifying the nature of the problem and the values
accused.
• The activation of this functionality is not essential (but often useful) when the required solution itself is
corrected by another algorithmic process (algorithm of Newton, diagram of Newmark): in short, in the
linear operators THER_LINEAIRE, MECA_STATIQUE, STAT_NON_LINE, DYNA_NON_LINE …
• This kind of functionality seems not very present in the bookstores: LAPACK, Nag, HSL…
OOC IC
Disc RAM
These two memory ways of managing are “without net”. No correction will be operated later on in the event
of problem. If one does not know a priori not which of these two modes to choose and if one wants to limit, as
far as possible, the problems due to defects of place memory, one can choose the automatic mode:
GESTION_MEMOIRE=' AUTO'. Heuristic interns with Code_Aster manage then only the contingencies memory
of MUMPS according to the computer set-up (machine, parallelism) and of the digital difficulties of the problem.
In the same order of idea, an option of the same keyword, GESTION_MEMOIRE=' EVAL', allows to gauge the
needs for a calculation by displaying in the file message the resources memories required by calculation
Code_Aster+MUMPS.
***************************************************************************
- Size of the linear system: 500000
===> For this calculation, one thus needs a quantity of RAM memory at least of
- 3500 Mo if GESTION_MEMOIRE=' IN_CORE',
- 500 Mo if GESTION_MEMOIRE=' OUT_OF_CORE'.
In case of doubt, use GESTION_MEMOIRE=' AUTO'.
******************************************************************************
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 21/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
Note:
• Parameters MUMPS ICNTL (22) /ICNTL (23) allow to manage these options memory. The user
Aster indirectly activates them via the keyword SOLVEUR/GESTION_MEMOIRE.
• Unloading on disc is entirely controlled by MUMPS (many files, frequency unloading/recharging…).
One informs just the site report: it is quite naturally the repertoire of work of the achievable specific
Aster to each processor (%OOC_TMPDIR='. ‘). These files are automatically erased by MUMPS
when associated occurrence MUMPS is destroyed. That thus avoids a clogging of the disc when
various systems are factorized in the same resolution.
• Other strategies of OOC would be possible even are already coded in certain packages (PaStiX,
Oblio, TAUCS…). One thinks in particular well aware of being able to modulate the perimeter of the
discharged objects (cf phase of sometimes expensive analysis in RAM) and of being able to re-use
them on disc during another execution (cf. CONTINUATION with the direction Aster or partial
factorization).
These new developments were one of deliverable of the ANR SOLSTICE [GROUND]. We had asked them to
team MUMPS (in partnership with the Algorithm team of the CERFACS) to make this product Iso-functional
compared to the other direct solveurs of Code_Aster.
Note:
• Parameters MUMPS ICNTL (13) /ICNTL (24) /ICNTL (25) and CNTL (3) /CNTL (5)
allow to activate these features. They are not modifiable by the user. By prudence, one keeps the
functionality activated permanently.
• This functionality can also prove to be useful in modal calculation (filtering of the rigid modes).
23 The singularities known as digital are given except for a digital precision, contrary to the singularity known as exact or true.
24 It is a possible solution of the problem since the second member f ∈ker K T T . What in our symmetrical case returns to f
element of space image.
25 It acts, in any rigour, of the infinite standard of the line of the matrix of work comprising the pivot.
26 By default one fixes it at 10-8 (in double precision) and 10-4 (into simple) because these figures represent (empirically) a loss of at least
half of the level of precision if factorization nevertheless is continued.
27 This value must be enough large to limit the impact of this modification on the rest of factorization. In Code_Aster/Code_Carmel3D, one
fixes it at 106∥K travail∥ .
28 It is the same mechanism as for the static swivelling.
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 22/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
is almost complete. One does not have to choose between parallelism, such or such digital refinement and
these compressions BLR! All these features are compatible and their profits often cumulate 29.
In the manner of formats mp3, zip or pdf of our domestic and office automation uses, these compressions
makes it possible to reduce, with few losses, the expensive stages of MUMPS 30. And this approximation
generally does not disturb the precision and the behavior of including mechanical calculations.
It is however interesting only on problems of big sizes (NR at least > 2.10 6 ddls ). Because as these
compressions imply a overcost, one should compress only the storage blocks sufficiently large and thus likely to
compensate for this overcost quickly.
noted profits on some studies Code_Aster vary 20% with 80% (cf figures 7.3-5/7.3-6). They increase with the
size of the problem and its massive character.
Figure 7.3-5: Example of profits gotten by compressions low-rank on the case benchmark perf008d
(parameters by default, management memory in OOC, N=2M, NNZ=80M, Facto_METIS4=7495M,
conditionnement=107). One traces, according to the number of activated processes MPI, the times
elapsed spent by all the stage of resolution of system linear in Code_Aster v13.1, his peak RAM report,
as well as the factor of acceleration gotten by BLR.
Figure 7.3-6: Example of profits gotten by compressions low-rank on the case benchmark perf009d
(parameters by default, management memory in OOC, N=5.4M, NNZ=209M, Facto_METIS4=5247M,
conditionnement=108). One traces, according to the number of activated processes MPI, the times
elapsed spent by all the stage of resolution of system linear in Code_Aster v13.1, his peak RAM report,
as well as the factor of acceleration gotten by BLR.
29 On the other hand these profits vary according to the context numérico-data processing: renumerotor, many processes MPI…
30 For the moment compression relates to only certain stages of digital factorization (not descent-increase). These profits thus must, has
minimum, to compensate for a light overcost in the stage of preliminary analysis as well as the costs of compression/decompression of
the beginning and the end of the digital stage of factorization. For the moment they are only savings of time (not of profits in peak RAM).
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 23/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
Rapidly, this strategy seeks to compress large storage blocks the most handled by the multifrontale of
MUMPS: large dense blocks of its tree of elimination. This technique rests on the assumption (often checked
empirically) that it is possible to renumber31 variables within these dense blocks in order to release a more
advantageous matric structure: to break up these blocks like product of two dense matrices much smaller.
The objective is thus to break up the large dense matrices With (of this tree of elimination) , for example of size
m X N , in the form
A U.V T E (7.3-9)
with U and V matrices much smaller (respectively mXK and KXN with K <min (m, N)) and E a matrix mXN
“negligible” ( E ).
During later handling MUMPS makes the approximation then
A U.V T (7.3-10)
by betting that this one will have little impact on the quality of the result (thanks to iterative refinement or the
including nonlinear algorithms) and on the related ‘outputs’ of the solvor (detection of singularity, calculation of
determinant, criterion of Sturm).
And it is checked most of the time as much as the parameter of compression is rather small ( E <10 -9 ). If that is
larger, the approximation cannot be neglected any more and MUMPS cannot be used any more as a direct
solvor “exact”: only as a “released” direct solvor (into nonlinear) or as a preconditionnor (for example for the
GCPC of out Code_Aster solveurs of Krylov de PETSc).
This functionality now available in Lbe versions consortium product results from one partnership EDF-INPT
(2010-13) around work of thesis of C.Weisbecker[CW13/15]. This one was rewarded by one for the
L.Escande prices for the year 2014. For more details on the technical sides of this work one will be able to
consult the document of thesis itself or the summary provided to the appendix n°1 of this document.
Note:
• According to the versions of MUMPS, various parameters allow to manage this strategy of
compression. The Code_Aster user indirectly activates them via the keywords
ACCELERATION/LOW_RANK_SEUIL .
3 Establishment in Code_Aster
3.1 Context/synthesis
To improve the performances of calculations carried out, strategy retained by Code_Aster [Dur08], as by most
great codes general practitioners in mechanics of the structures, in particular consists in diversifying its panel of
linear solveurs32 in order to better target the needs and the constraints of the users: local machine, computer
cluster or centre; obstruction memory and disc; time CPU; industrial or more exploratory study…
Dimensioned parallelism and linear solvor, a way is particularly prospected 33:
• “digital parallelism” external bookstores of solveurs such that MUMPS and PETSc, possibly
supplemented by a “data-processing parallelism” (intern with the code) for elementary calculations and
the matric/vectorial assemblies;
We are interested here in the first scenario through MUMPS. This external solvor “is plugé” in Code_Aster
and accessible to the users since the v8.0.14. It thus enables us to profit, “with less expenses”, of Rex from a
broad community of users and very pointed competences international teams. The whole while combining
effectiveness, performance, reliability and broad perimeter of use.
This work was initially completed by exploiting the sequential mode In-Core product. In particular, thanks to its
faculties of swivelling, he does invaluable favours by treating new modelings (quasi-incompressible elements, X-
FEM…) who can prove to be problematic for the other linear solveurs.
Since, MUMPS is daily used on studies [GM08] [Tar07] [GS11]. Our Rex of course packed itself and we
maintain one partnership relation activates with the development team of MUMPS (in particular via the
ANR SOLSTICE [GROUND] and a thesis in progress). In addition, its integration in Code_Aster profit from a
continuous enrichment: parallelism centralized IC [Des07] (since the v9.1.13), parallelism distributed IC [Boi07]
(since the v9.1.16) then IC mode and OOC [BD08] (since the v9.3.14) .
In distributed parallel mode, the use of MUMPS gets profits in CPU (compared to the method by default of the
code) about the dozen on 32 processors machine Aster. On very favorable cases this result can be much
better and, for “studies borders”, MUMPS remains sometimes the only viable alternative (cf interns of tank
[Boi07]).
As for RAM consumption, one saw in the preceding chapters that it is the principal weakness of the direct
solveurs. Even in parallel mode, where one however has naturally a distribution of the data between the
processors, this factor can prove handicapping. To overcome this problem it possible to activate in Code_Aster,
a recent functionality of MUMPS (developed within the framework of the above mentioned ANR): the “Outone”
(OOC), during “In-Core” (IC) by default. It makes it possible to reduce this bottleneck by discharging on disc a
good amount of data. Thanks to the OOC, one can thus to approach RAM consumption of the native
multifrontale of Code_Aster (even into sequential), to even go down in lower part by combining the efforts from
parallelism and this unloading on disc. The first tests show a profit in RAM between the OOC and the IC from at
least 50% (even more on successful outcomes) for a overcost in limited CPU (<10%).
Solvor MUMPS thus allows, not only to solve numerically difficult problems, but, inserted in a computing process
Aster already partially parallel, it gears down the performances of them. It gets for the code a framework parallel
performing, credits, robust and general public. It facilitates thus the passage of the studies standards (< million
degrees of freedom) and makes available to the greatest number the treatment of large cases (| several million
degrees of freedom).
32 This research continues improvement of the performances is obviously not reduced only to the only linear solveurs. The code proposes
a good amount of tools to answer the same objectives: distribution of independent calculations, X-FEM, improvement of contact-friction,
the EDO/modaux/non-linear solveurs, adaptive grid, finite elements of structure…
33 Cf [R6.01.03] for a detailed vision of the potential strategies of parallelism and those actually put in work in the code.
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 25/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
(‘CENTRALIZE‘) has for him the robustness and a broader perimeter D’ use, the second
(‘GROUP_ELEM’/‘MAIL_ ***’ and ‘SOUS_DOMAINE’) is generic but more effective.
Because the most expensive phases often in time CPU of a simulation are: the construction of the linear system
(purely Code_Aster , cut out in three stations: factorization symbolic system, calculations elementary and
matric/vectorial assemblies) and its resolution (in MUMPS, cf §1.6: renumerotation + analyzes, digital
factorization and descent-increase). The first mode of parallelization benefits only from the parallelism of stages
2 and 3 of MUMPS, whereas the three others parallel also elementary calculations and the assemblies of
Code_Aster (cf figure 2.2-1 and 3.2-1) .
Code_Aster Resolution
Construction MUMPS
Ku = F Facto. Digital
Desc. - increase
Analysis
MUMPS
centralize
d
Elementary
calculations
Assemblies
Facto. symbolic
MUMPS system
distribute
d
• CENTRALIZE : The meshs are not distributed (as into sequential). S elementary calculations are not
paralleled. Parallelism start only on the level of MUMPS. Each processor built and provides to the linear
solvor the entirety of the system to be solved. This mode of use is useful for the tests of not-regression.
In all the cases where elementary calculations represent a weak share of total time (e.g. in linear
elasticity), this option can be sufficient.
• ‘GROUP_ELEM‘: A second type of distribution consists in setting up homogeneous groups of finite
elements (by type of finite element) then to distribute elementary calculations between the processors to
balance the load as well as possible (in term of many elementary calculations of each type). Each
processor allocates the whole matrix but does not carry out and assembles only elementary calculations
which were allotted to him.
• ‘ MAIL_DISPERSE/MAIL_CONTIGU ’ :
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 26/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
A distribution is carried out meshs of the model maybe by packages of contiguous meshs
(‘MAIL_CONTIGU’), that is to say by cyclic distribution (‘MAIL_DISPERSE’). This distribution is
independent of the type of finite element carried by the meshs. For example, with a model
comprising 8 meshs and for a calculation on 4 processors, there are the following distributions
of load:
Each processor allocates the whole matrix but does not carry out and assembles only
elementary calculations which were allotted to him.
Note:
• Without the option MATR_DISTRIBUEE (cf following paragraph), different the strategies are
equivalent in term of occupation memory. One dries up as soon as possible the flood of data and
instructions. It is a question of treating matric/vectorial blocks selectively total problem, that MUMPS
will gather.
• In distributed mode, each processor handles only matrices partially filled. On the other hand, in order
to avoid introducing too many communications MPI into the code (criteria of stop, residue…), this
scenario was not retained for the second members. Their construction is well paralleled, but at the end
of the assembly, the contributions of all the processors are summoned and sent to all. Thus all the
processors entirely know the vectors implied in calculation.
• In the same way, the matrix for the moment is duplicated: in space JEVEUX (RAM or disc) and in F90
space of MUMPS (RAM). In the long term, because of unloading on disc of factorized, it will become a
dimensioning object of RAM. It will thus have to be built directly via MUMPS.
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 27/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
incomplete floods of data. Parallelism allows then, in addition to savings of time calculation, to reduce the place
memory required by the resolution MUMPS but not that necessary to the construction of the problem in JEVEUX.
This is not awkward as long as RAM space for JEVEUX remain much lower than that necessary by MUMPS.
Like JEVEUX store mainly the matrix and MUMPS, its factorized (generally of tens of larger time), the RAM
bottleneck of calculation is theoretically on MUMPS. But as soon as one uses a few tens of processors in MPI
and/or that the OOC is activated, as MUMPS distributes this factorized by processor and discharge these
pieces on disc, the “ball returns in the camp of JEVEUX”.
From where the option MATR_DISTRIBUEE who recuts the matrix, with just of the nonworthless terms for which
the responsibility the processor has. Space JEVEUX required falls then with the number of processors and goes
down below RAM necessary to MUMPS. The results of figure 3.2-2 illustrate this profit in parallel on two studies: a
Pump LAUGH and a model of tank of the study “Epicure”.
Figure 3.2-2. _ Evolution of RAM consumption (in Go) according to the number of processors, Code_Aster
v11.0 (JEVEUX standard MATR_DISTRIBUE=' NON' and distributed, resp. ‘YES’) and of MUMPS OOC.
Results carried out on a Pump LAUGH and the tank of the Epicure study.
Note:
• One treats here data resulting from an elementary calculation (RESU_ELEM and CHAM_ELEM) or
of a matric assembly (MATR_ASSE). Assembled vectors (CHAM_NO) are not distributed because the
induced profits report would be weak and, in addition, as they intervene in the evaluation of many
algorithmic criteria, that would imply too many additional communications.
• In mode MATR_DISTRIBUE, to make the joint the end enters of MATR_ASSE room with the
processor and MATR_ASSE total (that one does not build), one adds a vector of indirection in the
form of one NUME_DDL room.
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 28/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
Elementary
Renuméroration Factorization
calculations
Analysis Digital Resolution
Assemblies
Centralized
K, K -1 U
F
Distribute
d
Space JEVEUX
Aster CHAM_ELEM,
K -1
MATR_ELEM/VECT_ELEM,
NUME_DDL/CHAM_NO/MATR_ASSE OOC IC
Disqu
RAM E
Figure 3.3-1. _Diagram functional of the Code_Aster/MUMPS coupling (with a sequential renumerotor)
with respect to the principal structures of data and occupation memory (RAM and disc).
K 0= [ K
blocage
blocage u
0 lagr ]
but with its doubly dualized form K2
Like MUMPS have faculties of swivelling, this choice of dualisation of the limiting conditions can be called into
question. By initializing the keyword ELIM_LAGR with ‘LAGR2’, one does not take any more account but of
one Lagrange, the other being spectator34. From where a matrix of work K 1 simply dualized
34 To maintain the coherence of the structures of data and to keep a certain legibility/data-processing maintainability, it is preferable “to
bluff” the usual process while passing of K2 with K 1 , rather than with the optimal scenario K 0 .
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 29/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
[ K
K 1= blocage
0
blocage 0 u
0
0
0 lagr 1
−1 lagr 2 ]
smaller because the extra-diagonal terms of the lines and the columns associated with these multipliers with
Lagrange spectators are then initialized to zero. A contrario, with the value ‘NOT’, MUMPS receives the usual
dualized matrices.
For problems comprising of many multipliers of Lagrange (up to 20% of the numbers of total unknown
factors), the activation of this parameter is often paying (smaller matrix). But when this number explodes
(>20%), this perhaps against-productive process. The profits carried out on the matrix are cancelled by the size
of factorized and especially the number of late swivellings that MUMPS must carry out. To impose
ELIM_LAGR=' NON' can be then very interesting (profit of 40% in CPU on the CAS-test mac3c01).
35 Resolution of a quadratic problem with CALC_MODES + OPTION=' SEPARE'/‘ADJUSTS’. Because this option requires a direct
access with the diagonal of the factorized matrix. However MUMPS does not allow, for the moment, to know the precise terms of factorized
really obtained. On the other hand, and it is its privileged framework of use, it uses them effectively and with robustness to solve a linear
system and/or for to calculate a postprocessing (criterion of Sturm, calculation of determinant…).
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 30/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
3.6.2 Monitoring
By positioning it keyword INFORMATION to 2 and by using solvor MUMPS , the user can make display in the
file of message a synthetic monitoring of the various phases of construction and resolution of the linear system:
distribution by processor amongst meshs, of the terms of the matrix and of its factorized, the analysis of error (if
requested) and an assessment of their possible déséquilibrage. With this monitoring directed CPU, one adds
some information on RAM consumption of MUMPS: by processor, estimate (according to the phase of
analysis) of the requirements in RAM in IC, OOC and the value actually used with recall of the strategy chosen
by the user. The times spent for each stage of calculation following the processors can appear too. They are
managed by a more total mechanism which is not specific to MUMPS (cf. §4.1.2 [U1.03.03] or the user's
documentation of the operator DEBUT/POURSUITE).
************************************************************************
<MONITORING MUMPS >
SIZE OF THE SYSTEM 803352
CONDITIONNEMENT/ERREUR ALGORITHM 2.2331D+07 3.3642D-15
ERROR ON THE SOLUTION 7.5127D-08
ROW NO. MESHS NO. TERMS K LU FACTORS
NR 0: 54684 7117247 117787366
NR 1: 55483 7152211 90855351
…
IN %: RELATIVE VALUE AND DESEQUILIBRAGE MAX
: 1.45D+01 2.47D+00 2.38D+00 1.50D+01 4.00D+01 2.57D+01
: 1.40D-01 -1.09D+00 -5.11D+00 -9.00D-02 1.56D+00 -4.16D-01
30 60
25 40
0 5 b t h é o r iq u e
20
-u
p
en % 20 G a in R A M
15 0 5 b c e n t r a lis é O O C /IC
10 0 5 b d is t r ib u é
0 P e rte C P U
S
d
p
e
5 -20
O O C /IC
0 1 2 4 8 16
8 16 32 100 # p ro c e s s e urs
# P ro c e s s e u rs
N =0.7 M / nnz=27 M
%parallèle centralized/distributed: 96/98
theoretical Speed-UPS hundred. /dist. <25/50
32 proc (x1): ~3min
Consumption. RAM IC: 4Go (1 proc)/1.3Go (16)
Consumption. RAM OOC: 2Go (1 proc)/1.2Go (16)
Figure 3.5-2. _Linear mechanical Calculation (op. MECA_STATIQUE ) on the official CAS-test of the cube
(mumps05b). And solved only one linear system is built. Simulation carried out on the centralized machine Aster
with Code_Aster v11.0. Consumption measured RAM Aster +MUMPS.
On the nonlinear study of the pump, the profits which one can hope for are weaker. Taking into account the
phase of sequential analysis of MUMPS, only 82% of calculations are parallel. From where of appreciable but
more modest speed-UPS theoretical and effective. From a point of view RAM report, management OOC of
MUMPS gets profits interesting in the two cases, but more marked for the pump: into sequential, profit IC vs
OOC of about 85%, against 50% for the cube. By increasing the number of processors, distribution of data
which parallelism bad temper induces gradually this profit. But it remains prégnant on the pump to 16
processors and disappears almost with the cube.
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 32/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
6 N =0.8 M / nnz=28.2 M
5 %parallèle centralized/distributed: 55/82
th _ c e n tra l
4 theoretical Speed-UPS hundred. /dist. <3/6
-u
p
c e n t r a lis é
3 16 proc ~15min
t h _ d i s t r i b Consumption. RAM IC: 5.6Go (1 proc)/0.6Go (16)
2
S
d
p
e
d i s t r i b u Consumption.
é RAM OOC: 0.9Go (1 proc)/0.3Go (16)
1
0
4 16 32 100
# P ro c e s s e u rs
100
80
60
en %
40 G a in R A M
O O C /IC
20
0
1 2 4 8 16
#processeurs
Figure 3.5-3. _Nonlinear mechanical Calculation (op. STAT_NON_LINE ) on a geometry more industrial (pump
LAUGH). And solved 12 linear systems are built (3 pas de time X 4 pas de Newton). Simulation carried out on
the centralized machine Aster with Code_Aster v11.0. Consumption estimated RAM MUMPS.
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 33/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
4 Conclusion
Within the framework of thermomechanical simulations with Code_Aster, the main part of the costs
calculation often comes from the construction and the resolution of the linear systems. For 60 years, two
types of techniques have disputed supremacy in the field, the direct solveurs and those iterative . Code_Aster,
like a good amount of codes general practitioners, made the choice of an offer diversified in the field. With
however an orientation rather hollow direct solveurs. Those are adapted to its needs than one can
summarize under the triptych “robustness/problems of the multiple type second members/moderate parallelism”.
The code resting from now on on many “middlewares” optimized and perennial (MPI, BLAS, LAPACK,
MONGREL…) and being used mainly on clusters of SMP (fast networks, great RAM storage capacity and disc),
one seeks to optimize the linear solveurs salary accordingly.
Taking into account necessary technicality36 and of a rich international offer 37, for to effectively carry out
these resolutions, the question of resorts to an external product is from now on impossible to
circumvent. That makes it possible to acquire, with less expenses, a functionality often effective, reliable,
powerful and profiting from a broad perimeter of use. One can thus profit from the experience feedback from a
broad community of users and competences (very) pointed international teams.
Thus Code_Aster made the choice to integrate the parallel multifrontale of package MUMPS. This in
complement, in particular, of its multifrontale “house”. But if this one profits from a long-term adaptation to
modelings Aster, it remains less rich in features (swivelling, pre/postprocessings, quality of the solution…) and
less powerful in parallel (for RAM consumption of the same order). To exploit certain modelings (quasi-
incompressible elements, X-FEM…) or to pass from the “studies borders” (cf interns of tanks), this coupling
“Code_Aster+MUMPS” becomes sometimes the only viable alternative.
Since, its integration in Code_Aster profit from a continuous enrichment and MUMPS ( SOLVEUR/METHODE='
MUMPS') is daily used on studies . Our Rex of course packed itself and we maintain one partnership relation
activates with the “core-TEAM” of MUMPS (in particular via the ANR SOLSTICE and a thesis).
In mode parallel, the use of MUMPS gets profits in CPU (compared to the method by default of the code, the
multifrontale “house”) about the dozen on 32 processors machine Aster. On more favorable cases or by
exploiting a second level of parallelism or compressions BLR, this profit CPU perhaps much better.
Solvor MUMPS thus allows, not only to solve numerically difficult problems, but, inserted in a computing process
Aster already partially parallel, it gears down the performances of them. It gets for the code a framework
parallel performing, credits, robust and general public. It facilitates thus the passage of the studies
standards (<million of degrees of freedom) and makes available to the greatest number the treatment of large
cases (| several million degrees of freedom).
36 To give an order of magnitude, package MUMPS makes more than 105 lines (F90/C).
37 Only in the public domain, one counts tens of packages, bookstores, “macro-bookstores”…
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 34/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
5 Bibliography
5.1 Books/articles/proceedings/theses…
[ADD89] M.Arioli, J.Demmel and I.S . Duff. Solving sparse linear systems with sparse backward error. SIAM newspaper
one matrix analysis and applications . 10,165:190 (1989).
[ADE00] P.R.Amestoy, I.S.Duff and J.Y.L' Excel. Multifrontal parallel distributed symmetric and unsymmetric solvers.
Comput. Methods in Appl. Mech. Eng. 184,501:520 (2000).
[ADKE01] P.R.Amestoy, I.S.Duff, J.Koster and J.Y.L' Excel. With fully asynchronous multifrontal solvor using distributed
dynamic scheduling. SIAM newspaper of matrix analysis and applications, 23,15:41 (2001).
[AGES06] P.R.Amestoy, A.Guermouche, Excellent J.Y.L' and S.Pralet. Hybrid scheduling for the parallel solution of linear
systems. Parallel computing. 32,136:156 (2006).
[Che05] K.Chen. Matrix preconditioning technical and applications. ED. Cambridge University Close (2005).
[CW13] C.Weisbecker. Improving Multifrontal Solvers by Means of Algebraic Block Low-Rank Representations. PhD
thesis of Toulouse University (2013). 2013 Leopold Escande thesis award.
[CW15] P.Amestoy, C.Ashcraft, O.Boiteau, A.Buttari, J.Y.L' Excel and C.Weisbecker. Improving Multifrontal Methods by
Means of Block Low-Rank Representations. SIAM J.Sci. Comput., 37 (3), A1451-1474 (2015).
[Dav06] T.A.Davis. Direct methods for sparse linear systems. ED. SIAM (2006).
[Duf06] I.S.Duff et al. Direct methods for sparse matrices ED. Clarendon Close (2006).
[Gol96] G.Golub & C. Van Loan. Matrix computations. ED. Johns Hopkins University Close (1996).
[Hig02] N.J.Higham. Accuracy and stability of numerical algorithms. ED. SIAM (2002).
[Las98] P.Lascaux & R.Théodor. Matric digital analysis applied to the art of the engineer. ED. Masson (1998).
[Liu89] J.W.H.Liu. Broad computer solution of sparse positive definite systems. Prentice Hall (1981).
[Meu99] G.Meurant. Broad computer solution of linear systems. ED. Elsevier (1999).
[Saa03] Y.Saad. Iterative methods for sparse matrices. ED. PWS (2003).
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 35/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
1
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 36/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
method multifrontale, developed by I.S.Duff and J.K.Reid (1983), in particular consists in using the graph
theory38 to build, from a hollow matrix, one tree of elimination organizing calculations effectively (cf figures
7.1.1). The black squares materialize the nonworthless matric terms. To handle, “by interposed graphs”, this
kind of data one usually forces the following rule: a point of the graph represents an unknown factor and, an
edge between two points, a matric term not no one.
Thus in the example below, variables 1 and 2 must be connected by an edge (in initial classification). Generally,
initial classification of the matrix is not optimal. In order to reduce the obstruction memory of factorized 39,
the number of failure of later handling and in order to try to guarantee a good level of precision of the result, this
initial matrix is renumbered. One sees thus that the number of additional terms (‘rope’) created by
factorization passes from 16 to 10.
It is starting from this renumbered matrix that one builds the tree of elimination suitable for the multifrontale.
This arborescent vision has the great virtue to organize the tasks concretely: them precisely are
determined dependence or not (for sparing parallelism), them précédencE (to operate regroupings in order to
optimize performances BLAS); one can to envisage certain consumption (in time and memory) to even try to
limit the digital problems.
Thus in the example below, the treatment of variable 1 does not affect any variables 2.4 and 5. On the other
hand, it will impact the variables 3 and 7 which occupy the higher levels of the tree (they are its ascending).
Figure 7.1.1 _First line, from left to right: an initial hollow matrix, the same reordered matrix and the tree of
elimination associated with the latter. Second-row forward, from left to right: factorized corresponding to the
matrices initial one and renumbered.
second strong idea of the method multifrontale a maximum of variables (one is to join together speaks
aboutamalgamation) in order to to constitute “faces” (dense) whose digital processing will be much more
effective (via routines of the type BLAS).
Since it is necessary to filling the matric under-block corresponding by truths zeros and to handle them such
which40. It is for example what is made in the tree of figure 7.1.2 between the amalgamated variables 7.8 and 9.
There exist many techniques of amalgamation based on criteria of graph, aspects digital or data-processing
considerations (e.g. distributed parallelism).
Figure 7.1.2 _Tree of elimination with its matric blocks and a choice of “faces”.
Note:
• By preoccupations with a simplification, one will not approach in this summary the other digital
processings which often intervene in the process: setting at the level of the terms (‘scaling’),
permutation of the columns and static/dynamic swivelling. As the “devil is often in the details”,
these are however the related treatments which complexed digital work and the data-processing
developments of Clement. They are essential with the good progress of many of our industrial
simulations with Code_Aster .
40 Not counting the induced on-filling. It is a little contradictory with the stage of preceding renumerotation, but
in general, this amalgamation is very beneficial for the whole of the process.
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 38/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
• ‘Fully Summed’ (FS) who as their name indicates it are completely treated and any more will not be
updated;
• ‘Non Fully Summed ' (NFS) who they always expect contributions of other branches of the tree.
The first type of variables can be completely “eliminated”: terms of the matrix factorized with regard to them
(lines of U and columns of L) are calculated and stored 41 once for all. second type produces a block of
contributions42 (noted CB for ‘Contribution Blocks’) which will be added at the higher level with the tree with
other CBs associated with the same variables (cf figure 7.2.1).
Figure 7.2.1 _general Structure of a “face” before and after its treatment.
In their turn, certain variables associated with these blocks will be eliminated and others, on the contrary, will
continue to take an active part in the process while providing to new of CBs. And so on… to the root of the tree.
This last stage, it has there nothing any more but eliminated variables and more no CB!
Thus in the faces associated with the sheets with left with figure 7.2.1, variables 1 and 2 are FS, while variables
3.7 and 9 are NFS. These last provide of CBs which will be cumulated in the face of the higher level gathering
variables 3,7,8 and 9. This last “will relieve” the algorithmic process of variable FS 3 (it will be eliminated), while
the NFS 7.8 and 9 will continue their advance in the tree.
3
The algorithmic complexity of the unit and its cost report are respectively in n and in n . These only 2
figures illustrate the impact of this stage on the performances of our simulations (even if those are carried out in
‘sparse’).
In order to optimize the costs one works generally on blocks and not on scalars (cf algorithm 1.3). The data
to be gathered in RAM memory, at the same moment, are moreover smalls and the vectorial and matric
algebraic operations are much more effective (via routines of the optimized type BLAS).
The types of operations to be carried out remain unchanged:
• Local factorizations of diagonal blocks (‘Factor’),
• Descent-increase (noted ‘Solve’) to build indeed the blocks columns/lines of factorized,
• Update (‘Update') per blocks of the submatrix.
In the trees of elimination presented previously, it is this kind of operation which is led within each face,
then between each face and his/her “father”. Compression low-rank will have for objective to reduce their
algorithmic complexity like their print memory (peak RAM and consumption disc).
Figure 7.4.1 _Management memory of the various elements handled by the multifrontale.
This active memory is managed like a pile (‘stack’). It fluctuate constantly. She believes when a face is
charged. Then it decrease as of CBs associated with this face are consumed. Finally it grows again with the
new factors and the new CB resulting from this face.
The new factors are then possibly recopied on disc. The print memory of these factors (red zone on the left of
the graph) is simpler to analyze: it does nothing but swell as the increase of the tree!
In all the cases, these two zones memory are not easy to predict a priori. They depend in particular on the
renumerotation, of the construction of the tree of elimination but also of the digital pretreatments. It is one of the
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 40/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
tasks carried out by the stage of analysis of MUMPS. It is useful at the same time for the multifrontale to allocate
its zones memory effectively, but also for the appealing code in order to manage its own internal objects as well
as possible43. The optimized management of these elements still becomes complicated because of the digital
processings which are carried out dynamically in the course of factorization: for example, the organization of the
swivelling and distribution of the data in parallel.
It is the peak of this active memory which constitutes the essential challenge of this thesis . It much
higher in keeping with is often factorized. However this one is already often in 500N in our current studies (with
NR size of the problem). Thus the treatment of a matrix comprising 1 million unknown factors requires at least
500*1*1244=6Go of RAM (in real double precision). And this figure tends to increase with the size of the problem
(one passes from 500N to 1000N even more).
One can thus reach RAM peaks that even the parallel distribution of the data combined with the OOC will not be
able to manage effectively45. Thus any significant reduction of this peak would be a very appreciable profit
for our large current and future studies.
A dense matrix With, of size mXN, is known as “of weak row” (‘low-rank’) of order K (<min (m, N)) when it
can break up in the form
A U.V T E (7.5-1)
with U and V matrices much smaller (respectively mXK and KXN) and E a matrix mXN “negligible” ( E ).
This concept of “digital row” should not be confused with the concept of row algebraic
In addition, except obvious profits in term of storage, the handling of such matrices can prove very
advantageous: the row of the sum is lower (in the worst case) than the sum of the rows and the row of the
product is inferior at least of the rows. Once broken up, the handling of matrices low-rank can thus be
(relatively) controlled in order to optimize the compression of the result.
Thus the dense product of two matrices of size NXN and of row K, with k n , reduced the one algorithmic
complexity of the operation of n3 with kn 2 ; its print memory passing it of n 2 with kn .
A matrix transform in form low-rank is known as “retrogressed” (‘demoted’) or compressed. The reverse
transformation, except for the approximations, “promotes” the matrix (‘promoted') or decompresses.
To keep the same terminology when in a standard way a matrix is handled (without displaying a decomposition
of the type (7.5-1)), it is said that this one is of full row (‘full-rank’).
In this thesis Clement uses mainly the second solution, less precise than a classic SVD but much less
expensive. The whole being that this compression is pooled between various treatments and that it allows a
maximum compression. Ideally it would thus be necessary that the row obtained checks a condition of the type:
k m n mn (7.5-3)
Part of the work of Clement consisted in developing the heuristic ones adapted to the multifrontale in order to try
to respect this criterion as often as possible.
As below illustrated (cf figure 7.6.1) one thus goes, firstly, to break up into columns and lines of under-
blocks (decomposition by ‘panels’), matric blocks corresponding to four types of terms of the faces
(according to the terminology of figure 7.2.1):
• FSxFS (‘block (1.1)’),
• FSxNFS (‘block (1.2)’),
• NFSxFS (‘block (2.1)’),
• NFSxNFS (‘block (2.2)’).
Then each one of these under-blocks will be compressed in low-rank according to the formula (7.5-1). At least
when this compression is licit 48. Except the diagonal under-blocks of the FSxFS under-part which remain in full-
rank (to optimize their later handling).
48 When these under-blocks answer certain functional criteria. For example they must be of sufficient size so that a compression is tried.
In addition the profits of compression must be sufficiently important.
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 42/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
But all the problem lies in the fact that these matric under-blocks do not have any raison d'être low-
rank! Even if they are of large size, the parts of faces ‘demoted’ can prove of full row. The costs of SVDs or
RRQRs will then not be compensated and the objective of compression will prove compromised!
While taking as a starting point the work already carried out by other techniques of compression (matrices
hierarchical of type H, H2, HSS/SSS…), it was developed criteria to break up the variables composing a face
and to renumber them so that they generate matric under-blocks low-ranks. This criterion is based on the
following empirical observation:
The more distant two blocks of variables are, the more the digital row of the matric block which it
implies becomes weak.
This condition known as “of admissibility” was tested on certain model problems resulting from
discretization of elliptic EDP and it is illustrated on figure 7.6.2. With the terminology of the graph theory it can
be reformulated in the following form:
diam diam dist , (7.6-1)
with diam and dist, usual concepts of diameter and distance in the graph associated with the treated face.
Figure 7.6.2 _Evolution of the “digital” row of two groups of variables of homogeneous size according to the
distance (within the meaning of the graphs) which separates them.
Thus, on the example of the figure below (cf fig. 7.6.3), one obtains with the variable division of the face in slices
a profit of many terms of 17%. I.e., on average, each under-block admits a digital row K such as
k m n 0.83mn .
While with cutting in checkerwork, the profit reaches 47%: k m n 0.57 mn . This variation is due to the
greatest regularity of the second on the first. The homogeneous under-blocks in the form of square have
diameters quickly lower than the distance between the under-blocks. With their lengthened form, it is obviously
not the case of the under-blocks resulting from cutting in slice!
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 43/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
Figure 7.6.3 _Example of two types of clustering on a dense face: in slice and checkerwork.
Figure 7.7.1 _Clustering of the type “inherited” between variables NFS a face
and those FS of a former face.
Example of sophistication: this clustering is operated separately on the two types of variables, the FS and
the NFS. As by traversing the tree of elimination, variables NFS of a face become the FS of the faces of the
higher levels, one breaks up this task. In addition one does not completely start again these cuttings with each
face. One pools certain information in order to limit the overcosts (in time). This particular algorithm of
clustering is called “inherited” (‘inherit’) in opposition tostandard algorithm called “clarifies” (‘explicit’)
for which one starts again both again clusterings with each handling face.
The overcost of the “explicit” alternative of clustering can be prohibitory on the very big problems (several times
the cost of the totality of the phase of analysis!) whereas the cost of the optimized clustering remains
reasonable: only some for one hundred of this phase of analysis. The profits low-rank of the two alternatives are
very close on the other hand the implementation to the optimized version is more complex.
Finally these clusterings is carried out by standard, MONGREL tools or SCOTCH TAPE. But one does not
apply them directly to the variables of the faces but to halations including them. This trick makes it possible to
direct these cuttings suitably so that they produce groups of homogeneous nodes contiguous.
In addition, to be more precise, the treatment of a face comprises five stages (already evoked in the algorithms
of the fig. 7.3.1). One starts with to cut out the variables in two encased levels panels:
• the first level, largest, that which (is outside called ‘outer panel’ or ‘BLR panel’), results from the
clustering low-rank;
• while the second, that which is inside the precedent (‘inner panel’ or ‘BLAS panel’) gathers the
variables per under-packages of 32.64 or 96 contiguous variables in order to more effectively nourish the
cores of calculations (and to reduce the cost of communications MPI and I/Os).
Then, there are two encased loops: the first over the panels BLR and the second on panels BLAS. For a given
panel BLAS, one carries out:
• the stage of Factorization (F),
• the stage of Solve (S),
• Upgrade Interns (UI) between all the following under-panels of panel BLR,
• External Upgrade (EU) between all following panels BLR.
The standard order of the operations can thus be noted coarsely FSUU. According to the level to which one
utilizes Compression low-rank (C) 4 alternatives then are distinguished:
1. FSUUC,
2. FSUCU,
3. FSCUU,
4. FCSUU.
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 44/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
It is the alternative n°3, FSCUU, which is industrialized the versions consortium of MUMPS.
Taking into account the relative costs of each one of the operationS (cf table 7.7.2) and of forced
robustness of the tool, it is this alternative which was privileged. It makes it possible to reduce significantly
the overall costs while sparing a maximum of precision in the first two stages (F and S). Because those are
crucial in the management of several sophisticated digital ingredients (scaling, swivelling, detection of
singularity, calculation of determinant, test of Sturm…). Initially, one thus preferred to impact them the least
possible by this compression. Also this one is carried out just afterwards. And its profits (and its possible
approximations) will not impact as well as the internal and external updates (UI and EU).
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)