Code - Aster: General Information On The Direct Linear Solveurs and Use of MUMPS

Version
Code_Aster default
Titre : Généralités sur les solveurs linéaires directs et [...] Date : 09/12/2019 Page : 1/44
Responsable : BOITEAU Olivier Clé : R6.02.03 Révision :
0ac07fed0038
General information on the direct linear solveurs

and use of MUMPS
Summary
Within the framework of thermomechanical simulations with Code_Aster, the main part of the costs
calculation often comes from the construction and the resolution of the linear systems . To carry out
these resolutions more effectively, Code_Aster made the choice to integrate the direct method deployed in
package MUMPS (‘Multifrontal Massively Parallel sparse direct Solver; P.R.Amestoy, J.Y.L’ Excel et al.;
CERFACS/CNRS/ENS LyonIRIT/INRIA/Université of Bordeaux). This in complement of its multifrontale “house”
(C.Rose) and of its other solveurs: ‘LDLT’, ‘GCPC’ and ‘PETSC’.
In distributed parallel mode and Out-Of-Core, coupling Aster+MUMPS gets profits in CPU about the dozen on
32 processors. And this, for RAM consumption, one behavior/functional perimeter and one precision of the
results at least as good as those of the native multifrontale code.
On the big problems, by activating compressions BLR, one can even gain a factor two or three additional. This
product, via the parallelism which it displays and its advanced features (swivelling, pre/postprocessings, quality
of the solution…) facilitate largely the passage of the studies standards. Moreover, It often remains only viable
alternative to exploit certain modelings/analyses (quasi-incompressible, X-FEM…) or to pass from very large
studies (as a precise direct solvor or a preconditionnor, a cf option ‘LDLT_SP‘of’GCPC'/‘PETSC‘).
In the first part of the document we summarize the general problems of resolution of linear systems, then we
approach the large families of hollow direct solveurs and their variations in the libraries of the public domain.
All things which it is necessary to have for the spirit before approaching, in the second part, package MUMPS
through its main features and of its advanced features. Then we detail the digital, data-processing and functional
aspects of its integration in Code_Aster. Lastly, we conclude by some digital results.
For more details and advices on the employment of the linear solveurs one will be able to consult the specific
notes of use [U4.50.01]/[U2.08.03]. The related problems of improvement of performances (RAM/CPU) of a
calculation and, use of parallelism, are also the object of detailed notes: [U1.03.03] and [U2.08.06].
Warning : The translation process used on this website is a "Machine Translation". It may be imprecise and inaccurate in whole or in part
and is provided as a convenience.
Copyright 2020 EDF R&D - Licensed under the terms of the GNU FDL (http://www.gnu.org/copyleft/fdl.html)
Version
Code_Aster default
0ac07fed0038
Contents
1 General information on the direct solveurs........................................................................................... 4
1.1 Linear system and associated methods of resolution..................................................................... 4
1.2 Bookstores of linear algebra.......................................................................................................... 6
1.3 Direct methods: the principle.......................................................................................................... 7
1.4 Direct methods: various approaches.............................................................................................. 9
1.5 Direct methods: principal stages.................................................................................................. 10
1.6 Direct methods: difficulties........................................................................................................... 12
2 Package MUMPS............................................................................................................................... 14
2.1 History.......................................................................................................................................... 14
2.2 Main features............................................................................................................................... 15
2.3 Zooms on some technical points.................................................................................................. 16
2.3.1 Swivelling............................................................................................................................ 16
2.3.2 Iterative refinement............................................................................................................. 17
2.3.3 Reliability of calculations..................................................................................................... 18
2.3.4 Management memory (In-Core Out-Of-Core versus).......................................................... 20
2.3.5 Management of the singular matrices................................................................................. 21
2.3.6 Compression ‘Low-Rank Block’ (BLR)................................................................................ 21
3 Establishment in Code_Aster............................................................................................................. 24
3.1 Context/synthesis......................................................................................................................... 24
3.2 Two types of parallelism: centralized and distributed................................................................... 24
3.2.1 Principle.............................................................................................................................. 24
3.2.2 Various modes of distribution.............................................................................................. 25
3.2.3 Balancing of load................................................................................................................ 26
3.2.4 To recut the objects Code_Aster......................................................................................... 26
3.3 Management of memory MUMPS and Code_Aster..................................................................... 27
3.4 Particular management of double the multipliers of Lagrange ..................................................... 28
3.5 Perimeter of use........................................................................................................................... 29
3.6 Parameter setting and examples of use....................................................................................... 29
3.6.1 Parameters of use of MUMPS via Code_Aster................................................................... 29
3.6.2 Monitoring........................................................................................................................... 30
3.6.3 Examples of use................................................................................................................. 31
4 Conclusion.......................................................................................................................................... 33
5 Bibliography........................................................................................................................................ 34
5.1 Books/articles/proceedings/theses…........................................................................................... 34
5.2 Account-returned reports/EDF...................................................................................................... 34
5.3 Resources Internet....................................................................................................................... 34
6 History of the versions of the document.............................................................................................. 35
Version
Code_Aster default
0ac07fed0038
7 Appendix n°1: Principle of compressions BLR in MUMPS.................................................................. 36

7.1 Principle of the method multifrontale............................................................................................ 36
7.2 Tree of elimination........................................................................................................................ 37
7.3 Algorithmic treatments.................................................................................................................. 38
7.4 Management of the memory........................................................................................................ 39
7.5 Approximation low-rank................................................................................................................ 40
7.6 Multifrontale ‘Low-Rank Block’ (BLR)........................................................................................... 41
7.7 Some complementary elements................................................................................................... 43
Version
Code_Aster default
0ac07fed0038
1 General information on the direct solveurs

1.1 Linear system and associated methods of resolution
In digital simulation of physical phenomena, a cost important calculation often comes from the
construction and the resolution of linear systems. The mechanics of the structures does not escape the
rule! The cost of the construction of the system depends amongst points on integration and complexity on the
laws on behavior, while that of the resolution is related on the number of unknown factors, modelings selected
and topology (bandwidth, conditioning). When the number of unknown factors explodes, the second stage
becomes dominating1 and it is thus the latter which mainly will interest us here. Moreover, when it is possible to
be more performing on this phase of resolution, thanks to the access to a parallel machine, we will see that this
asset will be able to be propagated with the phase of constitution of the system itself (elementary calculations
and assemblies).
These linear inversions of systems in fact omnipresent in the codes and are often hidden with deepest
of other digital algorithms: non-linear diagram, integration in time, analyze modal…. One seeks, for example,
the vector of nodal displacements (or of the increments of displacement) u checking a linear system of the
type
Ku= f (1.1-1)
with K a matrix of rigidity and f a vector representing the application of forces generalized to the mechanical
system.
In a general way, resolution of this kind of problem requires one (more) broad questioning that it does not
appear to with it:
• Does one have access to the matrix or knows one simply his action on a vector?
• Is this matrix it digs or dense?
• That they are its digital properties (symmetry, definite positive…) and structural (real/complex, by bands,
blocks.)?
• Please one solve only one system (1.1-1), several into simultaneous 2 or in a consecutive way 3 ? Even
several different and successive systems to which the matrices are very close 4 ?
• In the case of successive resolutions, can one re-use preceding results in order to facilitate the next
resolutions (cf technique of restarting, partial factorization)?
• Which is the order of magnitude of the size of the problem, the matrix and of its factorized compared to
capacities for treatment of the CPU and the associated memories (RAM, disc)?
• Does one want a very precise solution or simply an estimate (cf encased solveurs)?
• One has access to bookstores of linear algebras (and with their pre-necessary MPI, BLAS, LAPACK…)
or does one have to call on products “house”?
In Code_Aster , one explicitly builds the matrix and one stores it with the format MORSE 5 . With most
modelings, the matrix hollow (because of discretization by finite elements), is potentially badly conditioned 6 and
often real, symmetrical and indefinite7 . Into nonlinear, modal or during thermomechanical chainings, one often
deals with problems of the type “multiple second members”. The discrete methods of contact-friction benefit
from faculties of factorization partial and the method by decomposition of fields. In addition, Code_Aster use
also scenarios of simultaneous resolutions (complements of Schur of the contact and under-structuring…).
1 For Code_Aster, to see the study of ‘profiling’ led by N.Anfaoui [Anf03].

2 Even matrix but several second independent members; Cf construction of a complement of Schur.
3 Problem of the multiples type second members: even matrix but several second successive and interdependent members; Cf algorithm
of Newton without reactualization of the tangent matrix.
4 Problem of the multiples type first members: several matrices and second members successive and interdependent, but with close
matrices “spectralement”; Cf algorithm of Newton with reactualization of the tangent matrix.
5 Known as still SCR for ‘Symmetric Compressed Row storage’ (makes ‘Column of it’ in Code_Aster).
6 In mechanics of the structures conditioning  K  is known to be rather bad. It can vary, typically, of 10 5 to 1012 and the fact of
refining the grid, of using stretched elements or structural elements has dramatic consequences on this figure (cf B.Smith. With parallel
iterative implementation of year substructuring algorithm for problems in 3D. SIAM J.Sci.Comput., 14 (1992), pp406-423. §3.1 or
I.FRIED. Condition of finite element matrices generated from nonuniform meshes. AIAA J., 10 (1972), pp219-221.)
7 The indefinite character rather than definite positive is with the addition of additional variables (known as “of Lagrange”) to impose
simple or generalized limiting conditions of Dirichlet [R3.03.01].
Version
Code_Aster default
0ac07fed0038
As for the sizes of the problems, even if they swell year by year, they remain modest compared to the CFD:
about the million unknown factors but for hundreds of step of time or iterations of Newton.
In addition, of one point of view “middleware and hardware” , the code is pressed from now on on many
optimized and perennial bookstores (MPI, BLAS, LAPACK, (BY) MONGREL, (Pt) SCOTCH TAPE, PETSc,
MUMPS…) and is used mainly on clusters of SMP (fast networks, great RAM storage capacity and disc). One
thus seeks especially to optimize the use of the linear solveurs accordingly.
For 60 years, two types of techniques have disputed supremacy in the field, the solveurs direct and the
iterative solveurs (cf [Che05] [Dav03] [Duf06] [Gol96] [Las98] [Liu89] [Meu99] [Saa03]).
First are robust and lead in a finished number of operations (theoretically) known by advance. Their theory is
relatively well completed and their variation according to moults standard of matrices and software architectures
is very complete. In particular, their algorithmic multilevel is well adapted to the hierarchies memories of the
current machines. However, they require storage capacities which grow quickly with the size of the problem
what limits the extensibility of their parallelism 8. Even if this parallelism can break up into several independent
layers, thus gearing down the performances.
On the other hand, them iterative methods are more “scalables” when one increases the number of
processors. Their theory abounds in many “opened problems”, especially into arithmetic finished. In practice,
their convergence in a “reasonable” number of iterations, is not always acquired, it depends on the structure of
the matrix, the starting point, the criterion of stop… This kind of solveurs has more difficulty boring in mechanics
of the industrial structures where one often cumulates heterogeneities, non-linearities and junctions of models
which cause to become gangrenous the conditioning of the operator of work. In addition, they are not adapted to
solve the problems of the type effectively “multiple second members”. Out those are very frequent in algorithmic
mechanical simulations.
T
L
L D Ku f
Lw f
Dv w ?!?
K u
LTu v
Figure 1.1-1. _ Two classes of methods to solve a linear system:
the direct ones and the iterative ones.
Contrary to their direct counterparts, it is not possible to propose the iterative solvor who will solve any linear
system. The adequacy of the type of algorithm to a class of problems is done on a case-by-case basis. They
present, nevertheless, other advantages which historically gave them established among for certain
applications. With management equivalent memory, they require some less than the direct solveurs, because
one has right need to know the action of the matrix on an unspecified vector, without having truly to store it. In
addition, one is not subjected to the “diktat” of the phenomenon of filling which deteriorates the profile of the
matrices, one can effectively exploit the hollow character of the operators and control the precision of the results
9
. In short, the use of direct solveurs concerns the area of the technology rather whereas the choice of
the good couple iterative method/preconditionnor is rather an art! In spite of its biblical simplicity on paper,
the resolution of a system linear, even symmetrical definite positive, is not “a long quiet river”. Between two
evils, filling/swivelling and pre-packaging, it is necessary to choose!
Note:
• A third class of methods tries to draw part of the respective advantages from direct and the iterative
ones: methods of Decomposition of Field (DD) [R6.01.03].
8 One also speaks about “scalability” or passage on the scale.

9 What can be very interesting within the framework of encased solveurs (e.g. Newton+GCPC), cf V.Frayssé. The power of backward
error analysis. HDR of the Institut National Polytechnique of Toulouse (2000).
Version
Code_Aster default
0ac07fed0038
• The two large families of methods must more be seen like complementary that like competitors. One
often seeks to mix them: methods DD, preconditionnor by incomplete factorization (cf [R6.01.02] § 4.2)
or of multigrille type, procedure of iterative refinement at the end of the direct solvor…
1.2 Bookstores of linear algebra

To effectively carry out the resolution of a linear system, the question of resorts to a bookstore or with an
external product is from now on impossible to circumvent. Why? Because this strategy allows:
• Less technical, less invasive developments and much faster in the code host.
• To acquire, with less expenses, a broad perimeter of use while outsourcing good numbers of the
contingencies associated (typology with the problem cf. §1.1, representation of the data, structures of
the machine targets…).
• To profit from the experience feedback from a community of users varied and competences (very)
pointed international teams.
These bookstores indeed often combine effectiveness, reliability, performance and portability:
• Effectiveness because they exploit the space and temporal locality data and exploit the hierarchy
memory (example of the various categories of BLAS).
• Reliability because they propose tools to consider the mistake made on the solution (estimate of
conditioning and the ‘backward/forward errors’) to even improve it (for the direct ones, balancing of the
matrix and iterative refinement).
Since emergence in years 70/80 of the first bookstores public 10 and private/manufacturers11 and their
communities of users, the offer was geared down. The tendency being of course to suggest powerful solutions
(vectorial, distributed parallelism with memory centralized then, parallelism multiniveau via threads) as well as
“toolkits” of handling of algorithms of linear algebra and structures of associated data. Let us quote in a
nonexhaustive way: ScaLAPACK (Dongarra & Demmel 1997), SparseKIT (Saad 1988), PETSc (Argonne 1991),
HyPre (LL 2000), TRILINOS (Sandia 2000)…
Figure 1.2-1. _Some “logos” of libraries of linear algebra.
Note:
• To structure their use more effectively and to suggest solutions “black box”, macro-bookstores
recently came out. They gather a panel of these products to which they add solutions “houses”:
Numerical Plato (CEA-DEN), Mystery (CEA-DAM)…
Concerning more specifically them direct methods of resolution of linear systems , about fifty packages are
available. One distinguishes the “autonomous” products from those incorporated in a bookstore, the public ones
of commercial, those dealing with the dense problems and others of the hollows. Some function only in
sequential mode, others support a parallelism with shared and/or distributed memory. Lastly, certain products
are general practitioners (symmetrical, nonsymmetrical, SPD, reality/complex…) others adapted to a quite
precise need/scenario.
One can find a list rather exhaustive of all these products on the site of one of the founding fathers of
LAPACK/BLAS: Jack Dongarra [Gift]. The table below (table 1.2-1) is an expurgated version. It takes again only
the direct solveurs of the public domain and forgets to mention: CHOLMOD, CSPARSE, DMF, Oblio,
PARASPAR, PARDISO, PaStiX (the other French direct solvor with MUMPS), S+, SPRSBLKKT and WSMP.
10 EISPACK (1974), LINPACK (1976), BLAS (1978) then LAPACK (1992)…

11 NAG (1971), IMSL/ESSL (IBM 1971), ASL/MathKeisan (NEC), SciLib (CRAY), MKL (Intel), HSL (Harwell)…
Version
Code_Aster default
0ac07fed0038
This resource Internet counts also packages implementing of the iterative solveurs, the préconditionneurs, the
modal solveurs as well as many products support (BLAS, LAPACK, ATLAS…).
DIRECT SOLVERS License Support Real Comple F77 C Seq Dist SP Gen
x D
DENSE
FLAME LGPL yes X X X X X
LAPACK BSD yes X X X X X
LAPACK95 BSD yes X X 95 X
NAPACK BSD yes X X X
PLAPACK ? yes X X X X M
PRISM ? not X X X M
ScaLAPACK BSD yes X X X X M/P
Trilinos/Pliris LGPL yes X X X and C++ M
SPARSE
DSCPACK ? yes X X X M X
HSL ? yes X X X X X X
MFACT ? yes X X X M X
MUMPS PD yes X X X X X M X X
PSPASES ? yes X X X M X
SPARSE ? ? X X X X X X
SPOOLES PD ? X X X X M X
SuperLU Own yes X X X X X M X
TAUCS Own yes X X X X X X
Trilinos/Amesos LGPL yes X X M X X
UMFPACK LGPL yes X X X X X
Y12M ? yes X X X X X
Table 1.2-1. _Extracts from the Web page of Jack Dongarra [Gift] on the free products
implementing a direct method; ‘Seq’ for sequential, ‘Dist’ for parallel (‘M’ OpenMP and ‘P’ MPI),
‘SPD’ for symmetrical definite positive and ‘Gen’ for unspecified matrix.
Note:
• A resource Internet more detailed but focused on the hollow direct solveurs is maintained by another
great name of the digital one: T.A.Davis [Dav], one of the contributors of Matlab.
1.3 Direct methods: the principle

The basic idea of the direct methods is of to break up the matrix of the problem K in a product of
particular matrices (triangular lower and higher, diagonal) easier “to reverse”. It is what is called
factorization12 matrix of work:
• If K is SPD, she admits the single “factorization of Cholesky”: K=L LT with L triangular lower;
• If K is symmetrical unspecified and regular, she admits at least a “factorization L DL T ”:
P K=L D LT with L triangular lower than diagonal coefficients equal to the unit, D a diagonal
matrix13 and P a matrix of permutation;
• If K is unspecified and regular, she admits at least a “factorization L U ”: P K=L U with L
triangular lower than diagonal the unit, U triangular higher and P a matrix of permutation;
12 By analogy with polynomial factorizations of the lower school…

13 It can also comprise diagonal blocks 2×2 .
Version
Code_Aster default
0ac07fed0038
Figure 1.3-1. _Principle of the direct methods.
Note:
• For example, the symmetrical and regular matrix K below breaks up in the form L DL T following
(without needing here permutation P=Id )
[ ] [ ][ ][ ]
10 sym 1 0 0 10 0 0 1 2 3
K := 20 45 =2 1 0 0 5 0 0 1 4
(1.3-1)
30 80 171 
3 4 1 0 0 1 
0 0 1
L D LT
Once this decomposition carried out, the resolution of the problem is largely facilitated. It is not expressed any
more but in the form of the linear resolutions simplest which are: containing triangular or diagonal matrices.
They are the famous ones “descent-increase” (‘forward/backward algorithms’). For example in the case of a
factorization L U , the system (1.1-1) will be solved by
Ku=f Lv=Pf  descente 
〉⇒ (1.3-2)
PK=LU Uu=v  remontée 
v . This last serves
In the first lower diagonal system (descent), one determines the intermediate vector solution
then as second member with the higher diagonal system (gone up) whose the vector is solution u who
interests us.
This phase is inexpensive (into dense, about N 2 against N 3 for factorization14 with N the size of the
problem) and can thus be repeated of many factorized time by preserving the same one. What is very useful
when a problem of the type is solved multiple second members or when one wishes to carry out
simultaneous resolutions.
In first scenario , the matrix K is fixed and one changes second member successively f i to calculate as
much solution u i (the resolutions are interdependent). That makes it possible to pool and thus to amortize
these initial costs of factorization. This strategy is abundantly used, in particular in Code_Aster: buckle nonlinear
with periodic reactualization (or not of reactualization) of the tangent matrix (e.g. operator AsteR
STAT_NON_LINE), methods of subspaces or the power reverses (without acceleration of Rayleigh) in modal
calculation (CALC_MODES), thermomechanical chaining with characteristic materials independent of the
temperature (MECA_STATIQUE),…
In second scenario, one is aware of all them f i at the same time and one organizes, by blocks, the phases of
descent-increase, to calculate the solutions simultaneously u i independent. One can thus use more effective
routines of high level linear algebra, and even to exploit consumption memory by storing the vector f i in hollow.
14 Into dense, Coppersmith and Winograd (1982) showed that one could, as well as possible, to decrease this algorithmic complexity with
CN  with =2,49 and C constant (for N large).
Version
Code_Aster default
0ac07fed0038
This strategy (partially) is used in Code_Aster , for example, in the construction of the complements of Schur of
the algorithms of contact-friction or for the under-structuring.
Note:
• Products MUMPS envisages these two types of strategy and proposes even features to facilitate the
construction and the resolution additional of Schur.
Let us examine it now process of factorization in itself. It is already clearly clarified in other theoretical
documentations of Code_Aster on the subject [R6.02.01] [R6.02.02], like in the already quoted bibliographical
references [Duf06] [Gol96] [Las98]. Also we will not detail it. Let us specify just that it is an iterative process
organized schematically around three loops: one “known as in i ” (on the lines of the matrix of work), the second
“in j ” (resp. columns) and the third “in k ” (resp. stages of factorization). They build a new matrix repeatedly

A  , via the classical formula of factorization which is
A
k 1 starting from certain data of the preceding one, k
written formally:
Boucles en i , j ,k
A k i , k  A
 k  k , j (1.3-3)
   i , j −
k 1 
A i , j  := A k  kk ,k 
A
A =K and at the last stage, one recovers in the square matrix A
Initially the process is activated with 
0 N
T
triangular parts ( L and/or U ) even diagonal ( D ) who interest us. For example, in the case L DL :
Boucles en i , j
 N i , j 
si i j : L  i , j = A (1.3-4)
 N i , j
si i= j : D  i , j  =A
Note:
• The formula (1.4-3) contains in germ the problems inherent in the direct methods: in hollow storage,
 
the fact that the term A k 1  i , j  can become nonnull whereas A k  i , j  is (notion of filling of
factorized, ‘rope’, thus implying a renumerotation or ‘ordering’); propagation of rounding error or the
divide check via the term A  k  k , k  (notion of swivelling and balancing of the terms of the matrix or
‘scaling’).
1.4 Direct methods: various approaches

The order of the loops i , j and k is not fixed. One can permute them and carry out the same operations but in
a different order. That defines six alternatives thus kij , kji ,ikj … who will handle various zones of the current
matrix A : “zone of the new calculated terms” via (1.3-2), “zone already calculated and used” in (1.3-2),
k
“already calculated and unutilised zone” and “zone not yet calculated”. For example, in the alternative jik ,
there is the diagram of operation following for j fixed
Version
Code_Aster default
0ac07fed0038
Figure 1.4-1. _Diagram of construction of a factorization “ jik ” (‘right looking’).
Note:
• Method L DL T of Code_Aster (SOLVEUR/METHODE=' LDLT') is a factorization “ ijk ”,
multifrontales of C.Rose ( … = ' MULT_FRONT') and of MUMPS (… = ' MUMPS') directed columns
are they (“ kji ”).
• Certain alternatives bear particular names: algorithm of Crout (“ jki ”) and that of Doolitle (“ ikj ”).
• In papers one often uses the Anglo-Saxon terminology indicating the orientation of matric handling
rather than the order of the loops: ‘looking forward method’, ‘looking backward method’, ‘up-looking’,
‘left-looking’, ‘’ right-looking', ‘left-right-looking’…
All these alternatives are declined still according to:

• That one exploits certain properties of the matrix (symmetry, definite-positivity, band…) or that one
seeks the broadest perimeter of the use;
• Whether one carries out scalar treatments or by blocks;
• That the decomposition in blocks is determined by aspects reports (cf method L DL T paginated
Code_Aster ) or rather related to independences of the later tasks (via a graph of elimination cf
multifrontale [R6.06.02] §1.3);
• That one reintroduces worthless terms in the blocks to facilitate the access to the data 15 and to cause
very effective algebraic operations, often via BLAS316 (cf multifrontale of C.Rose and MUMPS);
• That one groups the contributions affecting a block of lines/columns (approach ‘fan-in’, cf PaStiX) or that
they are applied as soon as possible (‘fan-out’);
• That in parallelism, one seeks to spare various levels of sequences of independent tasks, that they are
scheduled statically or dynamically, that one recovers calculation by communication…
• That one applies the pre ones and postprocessings to reduce the filling and to improve quality of the
results: renumerotation of the unknown factors, put at the level of the terms of the matrix, partial
swivelling (line) or total (line and column), scalar or per diagonal blocks, iterative refinement…
For to gather four categories are often distinguished:

• classical algorithms: Gauss, Crout, Cholesky, Markowitz (Matlab, Mathematica, Y12M …) ;
• Methods frontal (MA62 …);
• Methods multifrontales (MULT_ FRONT Aster, MUMPS, SPOOLES, TAUCS, UFMPACK, WSMP …)
• supernodales (SuperLU, PaStiX, CHOLMOD, PARDISO …).
1.5 Direct methods: principal stages

15 This hollow/dense compromise allows to decrease indirect addressings with the data and thus to better use the hierarchy memory of
the current machines.
16 The ratio “calculation/access report” of Blas level 3 (produced matrix/matrix) is N time better (with N size of the problem) that of the
other levels of Blas. It is also often higher than that of routines “made with the hand” not optimized on these aspects “locality of the
data/hierarchy memory”.
Version
Code_Aster default
0ac07fed0038
When hollow systems are treated, the digital phase of factorization (1.3-3) does not apply directly to the initial
matrix K , but with a matrix of work K travail resulting from one phase of pretreatments . And it in order to
reduce the filling, to improve the precision of calculations and thus to optimize the later costs in CPU and
memory. Coarsely this matrix of work can be written in the shape of the following matric product
K travail :=Po Dr K Q c Dc PTo (1.5-1)
we will describe the various elements thereafter. One can thus break up the operation of a direct solvor into four
stages:
1) Pretreatments and factorization symbolic system: it inverted the order of the columns of the
matrix of work (via a matrix of permutation Q c ) in order to avoid the divide checks of
term A k  k , k  and to reduce the filling. Moreover it rebalances the terms in order to limit
the errors rounding (via matrices of scaling D r / D c ). This phase can be too crucial for
the algorithmic effectiveness (profit of a sometimes noted factor 10) and the quality of
the results (profit of 4 or 5 decimals).
In this phase, one also created the structures of storage of the hollow factorized matrix
and auxiliaries (dynamic swivelling, communication…) required by the following phases.
Moreover, one estimates the tree of dependence of the tasks, their initial distribution
according to the processors and consumption total memories envisaged.
2) The stage of renumeroration: it inverted the order of the unknown factors of the matrix (via the
matrix of permutation P o ) in order to reduce the filling which factorization implies. Indeed,

in the formula (1.3-3) it is seen that factorized ( A k 1  i , j ≠0 ) a new term not no one in
its profile can contain whereas the initial matrix did not comprise any ( A  k  i , j  =0 ) .
A i , k  A
  k , j
k k
Because of the term not necessarily no one. In particular, it is nonnull
A k , k
k
when one can find terms nonworthless of the matrix initial of the type
Ak  i , l  ou A
 k  l , j   li et l j  . This phenomenon can lead to overcosts very
important report and calculation (factorized can be 100 times larger than the initial hollow
matrix!).
From where the idea to renumber the unknown factors (and thus to permute the lines and
the columns of K ) in order to slow down this phenomenon which is it true “Achilles'
heel” of the direct ones . With this intention, one often calls on external products ((BY)
MONGREL, (Pt) SCOTCH TAPE, CHACO, JOSTLE, PARTY…) or with the heuristic ones
embarked with solveurs (AMD, RCMK…). Of course, these products display different
performances according to the treated matrices, the number of processors… Among
them, MONGRELS and SCOTCH TAPE are very widespread and “often leave the
batch” (profit up to 50%).
3) The digital phase of factorization : it implements the formula (1.3-3) via the methods interviews
in the paragraph §1.4 precedent. It is the phase, by far, most expensive who will build
T T
hollow factorizations explicitly L L , L D L or L U .
4) The phase of resolution: it carries out the descent-increase (1.3-2) whose (finally!) the solution
“spouts out” u . It is not very expensive and pools possibly a later digital
factorization (multiple second members, simultaneous resolutions, restarting of
calculation…).
Note:
Version
Code_Aster default
0ac07fed0038
• Stages 1 and 2 require only the knowledge of the connectivity and the graph of the initial matrix. Thus
finally that data storable and easy to handle in the form of entireties. Only the two last stages use
realities on the effective terms of the matrix. They require for the terms of the matrix only if the stages
of scaling are engaged (calculation of D r / D c ).
• Stages 1 and 4 are independent while the 1.2 and 3, a contrario, are dependent. According to the
algorithmic products/approaches, one agglomerates them differently: 1 and 2 is dependent in MUMPS,
1 and 3 in SuperLu and 1.2 and 3 in UFMPACK… MUMPS makes it possible to carry out separately
but successively stages 1+2, 3 and 4, to even pool their results to carry out various sequences. For the
moment, in Code_Aster, one alternates sequences 1+2+3 and 4,4,4… and again 1+2+… (cf following
chapter).
• Certain products propose to test several strategies in one or more stages and choissent the most
adapted: SPOOLES and WSMP for stage 1, TAUCS for the stage 3 etc.
• The tools of renumerotation of the first phase are based on very varied concepts: methods of
geometrical engineers, techniques or of optimization, graph theory, spectral theory, methods taboo,
algorithms évolutionnaires, those mémétics, those based known as of “colonies of ants”, neural
networks… All the blows are allowed to improve the local optimum in the form of which the problems of
the renumerotor are expressed. These tools are also often used for partitionner/to distribute grids (cf
[R6.01.03] §6). For the moment, Code_Aster uses renuméroteurs METIS/AM/AMD (for METHODE= ‘
MULT_FRONT' and ‘ MUMPS'), AMF/QAMD/PORD (for ‘ MUMPS' ) and RCMK17 (for ‘GCPC’,
‘LDLT’ and ‘PESTC’).
• Some is the linear solvor18 used, Code_Aster carries out a preliminary phase of factorization (stage 0)
to describe the unknown factors of the problem (link degree of physical or late freedom/number of line
of the matrix via the structure of data NUME_DDL) and to envisage the ad hoc storage of the profile
MORSE of the matrix.
1.6 Direct methods: difficulties

Among the difficulties which the “hollow direct methods must overcome” one finds:
• handling of structures of data complex who optimize the storage (cf profile of the matrix) but which
complexes the algorithmic one (cf swivelling, OOC…). That contributes has to lower the ratio
“calculation/access to the data”.
• effective management of the data with respect to the hierarchy memory and rocker IC/OOC19. A
this recurring question with much of problems, but which here is prégnante because of strong
consumption calculation.
• The management of hollow/dense compromise (for the methods by faces) with respect to
consumption memory, of the accessibility to the data and the effectiveness of building blocs of linear
algebra.
• The choice of good renumerotation: it is a Np-complete problem! For the problems of big sizes, one
cannot find in a “reasonable” time the optimal renumerotation. One must be satisfied with a “local”
solution.
• The effective management of propagation of the rounding errors via the scaling, swivelling and
analyses error on solution (the direct/opposite error20 and conditioning).
• size of factorized which is often the “bottleneck” n°1 . Its distribution between processors (via
distributed parallelism) and/or the OOC always do not make it possible to surmount this shelf (cf figure
1.6-1).
17 For minilmiser the filling, incomplete factorization is used as preconditionnor of these iterative solveurs.
18 Among ‘MULT_FRONT'/‘LDLT’/‘MUMPS’/‘GCPC’/‘PESTC’.
19 IC for In-Core (all the structures of data are in RAM) and OOC for Out-Of-Core (some are rocked on disc).
20 Often referred under the Anglo-Saxon term: ‘forward/backward errors’.
Version
Code_Aster default
0ac07fed0038
NR =0.21M NR =0.8M
nnz =8M (x38) nnz =28M (x35)
K -1 =302M K -1 =397M
(MONGREL x38) (MONGREL x15)
Figure 1.6-1. _The “ball” of the hollow direct solveurs: size of factorized.
This figure shows two examples: a canonical CAS-test (cubic) and an industrial study (pump LAUGH). With the
following notations: M for million terms, N size of the problem, nnz the number of nonworthless terms of the
1
matrix and K that of its factorized renumbered via MONGREL. The surfactor when one passes from the one
to the other is noted between brackets.
Version
Code_Aster default
0ac07fed0038
2 Package MUMPS
2.1 History
Package MUMPS implements a parallel multifrontale “massively” (‘MUltifrontal Massively Parallel sparse
direct Solver ‘[Mum]) developed during the European project PARASOL (1996-1999) by the teams of three
laboratories: CERFACS, ENSEEIHT-IRIT and RAL (I.S.Duff, P.R.Amestoy, J.Koster and J.Y.L’ Excel…).
Since this finalized version (MUMPS 4.04 9/22/99) and public (free of right), about forty other versions were
delivered. These developments correct anomalies, extend the perimeter of use, improve ergonomics and
especially, enrich the features. MUMPS is thus a perennial product, developed and maintained by about ten
people belonging to academic entities distributed between Bordeaux, Lyon and Toulouse: IRIT, CNRS,
CERFACS, INRIA/ENS Lyon and University of Bordeaux I.
Figure 2.1-1. _Logos of the principal contributors with MUMPS [Mum].
The product is public and downloadable on its Web site: http://graal.ens-lyon.fr/MUMPS. One counts
approximately 1000 direct users (including 1/3 Europe + the 1/3 USA) not counting those which use it via the
bookstores which it reference: PETSc, TRILINOS, Matlab and Scilab… Its site proposes documentation
(theoretical and of use), links, examples of application, as well as a newsgroup (in English) tracing the
experience feedback on the product (bugs, problems of installation, advices…).
Each year about ten algorithmic/data-processing work leads to improvements of the package (thesis, post-Doc.,
research tasks…). In addition it is used regularly for studies industrial (EADS, ECA, BOEING, GéoSciences
Azure, SAMTECH, Code_Aster… ).
Figure 2.1-2. _The homepage of the Web site of MUMPS [Mum].
Since 2015, a consortium between academic, industrial teams and software publishers was assembled
around the product : consortium MUMPS [million u.a.]. It must make it possible to better ensure its
development, its diffusion and its perenniality.
It managed by the INRIA and gathers at the end of 2015: EDF, ALTAIR, MICHELIN, LSTC, SIEMENS, ESI,
TOTAL (members) and CERFACS, INPT, INRIA, ENS Lyon and Université of Bordeaux (founder members).
Version
Code_Aster default
0ac07fed0038
Figure 2.1-3. _The homepage of the Web site dedicated to consortium [million u.a.].
2.2 Main features

MUMPS implements one multifrontale [ADE00] [ADKE01] [AGES06] carrying out a factorization L U or
L D L T (cf. § 1.4). Its main features are:
• Broad perimeter of use: SPD, symmetrical unspecified, nonsymmetrical, real/complex, simple/double

precision, stamps regular/singular (all this perimeter is exploited in Code_Aster);
• Admits three modes of distribution of the data: by element, centralized or distributed (the last two
modes are exploited in Code_Aster);
• Interfacing in FORTRAN (exploited), C, matlab and scilab.
• Parameter setting by default and possibility of letting the package choose some of its options
according to the type of problem, its nature and amongst processors (exploited).
• Modularity (3 distinct phases interchangeable cf. §1.5 and appears 2.2.1) and opening of certain digital
mysteries of MUMPS. The user (very) advanced can thus leave the product the result of certain
pretreatments (scaling, swivelling, renumerotation), modify them or replace them by others and
reintegrate them in the computation channel specific to the tool;
• Different strategies of resolutions: multiple second members, simultaneous resolutions and
complements of Schur (only the two first are exploited);
• Different renuméroteurs embarked or external: (BY) MONGREL, AMD, QAMD, MFA, PORD, (Pt)
SCOTCH TAPE, ‘provided by L’ utilisateur' (exploited except the last);
• Related features: detection of small pivots/calculation of row (exploited), calculation of cores (exploited
soon), analyzes error and improvement of the solution (exploited), calculation of the criterion of inertia
for the Sturm test in modal calculation (exploited), compression low-rank (exploited); use of the hollow
character of the second simultaneous member and resolution of several linear systems (not exploited),
calculation of some terms of the reverse (not exploited), procedure of restarting (not exploited yet),
handling of the long entireties (not exploited yet)?
• Pre and postprocessings: scaling, permutation line/column and scalar/block 2x2, iterative refinement
(exploited);
• Parallelism : potentially on 2 levels (MPI+OpenMP), asynchronous management of the floods of
tasks/given and their dynamic regrouping, covering calculation/communication; Distribution of the data
associated with the distribution of the tasks; This parallelism starts only on the level of the phase of
factorization, except if one of the parallel renuméroteurs (PARMETIS or PTSCOTCH) were selected.
• Memory : unloading on disc or not of factorized (modes In-Core or Out-Of-Core) with preliminary
estimate of RAM consumption by processor in both cases (exploited).
Version
Code_Aster default
0ac07fed0038
Renuméroration Factorization
Analysis Digital Resolution
(phases 1 and 2) (phase 3) (phase 4)
Centralize
d
K,F
parameters
Distribute
d Multiple second
members
K -1
OOC IC
RAM RAM Disc RAM

Figure 2.2-1. _Functional Flow
chart of MUMPS:
its three stages in parallel centralized/distributed and IC/OOC.
Note:
• In term of parallelism, MUMPS exploits two levels (cf [R6.01.03] § 2.6.1): one external related to the
concurrent elimination of faces (via MPI), the other interns, within each face (via “threadées” BLAS or
around the algorithms of compression low-rank).
• The native method multifrontale of Code_Aster exploits only the second level and in parallelism with
shared memory (via OpenMP). Thus without covering, dynamic regrouping and distribution of the data
between the processors. On the other hand, by its fine connections with Code_Aster, it exploits all the
facilities of the manager memory JEVEUX (OOC, restarting, diagnosis…) and specificities of modeling
of the code (elements of structures, Lagranges).
2.3 Zooms on some technical points

2.3.1 Swivelling
The technique of swivelling consists in choosing a term A   k , k  adapted (in formula 1.4-3) to avoid dividing
k
by a too small term (what would amplify the propagation of the errors rounding during the calculation of the
terms A  k 1  i , j  following). With this intention, one permutes lines (partial swivelling) and/or columns (resp.
total) to find the denominator of (1.4-3) adapted. For example, in the case of a partial swivelling, one chooses
like “pivot” the term A  k  r , k  such as
 k  r , k  ≥u max ∣A
A  k  i , k ∣ avec u∈]0,1 ] (2.3-1)
i
~ R
A k  r, k 
Version
Code_Aster default
0ac07fed0038
F igure 2.3-1. _Choice of the partial pivot at the stage K.
1
From where an amplification of the errors rounding to the maximum of ( 1 ) with this stage. What is
u
important here is not so much to choose the largest possible term in value absolute (U=1) to avoid
choosing smallest! The reverse of these pivots also intervenes at the time of the phase of descent-increase,
therefore it is necessary to spare these two sources of amplification of errors by choosing one U median.
MUMPS, like much of package, proposes by default U=0.01 (parameter MUMPS CNTL (1)).
To swivel one generally uses scalar diagonal terms but also of the blocks of terms (of the diagonal blocks 2x2).
In MUMPS, two types of swivelling are implemented, one says ‘static’ (at the time of the phase of analysis), the
other known as ‘digital’ (resp. digital factorization). They are skeletal and activables separately (cf parameters
MUMPS CNTL (1), CNTL (4) and ICNTL (6)). For matrices SPD or with dominant diagonal, these faculties
of swivelling can be disabled without risk (calculation will gain there in speed), on the other hand, in the other
cases, they should be initialized to manage the possible very small or worthless pivots. That in general implies
an addition of filling of factorized but increases digital stability.
Note:
• This functionality of swivelling makes MUMPS essential to treat certain modelings of Code_Aster
(quasi-incompressible elements, mixed formulations, X-FEM…). At least as long as other direct
solveurs including of the swivelling will not be available in the code.
• The user Aster does not have access directly to the fine parameter setting of these faculties of
swivelling. They are activated with the values by default. He can just partially disconnect them while
posing SOLVEUR/PRETRAITEMENTS=' SANS' (by défaut='CAR’).
• The addition of filling of to the digital swivelling must be scheduled as soon as possible in MUMPS
(as of the phase of analysis). And this, by envisaging arbitrarily a percentage of overconsumption
memory compared to the profile envisaged. This figure must be indicated in for hundred in parameter
MUMPS ICNTL (14) (accessible to the user Aster via the keyword SOLVEUR/PCENT_PIVOT
initialized by default with 20%). Thereafter, if this evaluation proves to be insufficient, according to the
type of management selected memory (keyword SOLVEUR/GESTION_MEMOIRE), that is to say
calculation stops in ERREUR_FATALE, is one retente several times a digital factorization by doubling
each time the size of this reserved space to the swivelling.
• Certain products restrict their perimeter/robustness by not proposing strategy of swivelling
(SPRSBLKKT, MULT_FRONT_Aster…), others are limited to scalar pivots (CHOLMOD, PaStiX,
TAUCS, WSMP…) or propose particular strategies (method of perturbation+correction of Bunch-
Kaufmann for PARDISO, Bunch-Parlett for SPOOLES…).
2.3.2 Iterative refinement

At the end of the resolution, having obtained the solution u problem, one can evaluate his residue easily
r :=Ku−f . Knowing factorized matrix already, this residue can then feed with few expenses the iterative
process of improvement according to (in the nonsymmetrical case general):
Boucle en i
 1  ri =f −Ku i
(2.3-2)
 2  LU  ui =r i
 3  ui1 ⇐ ui   ui
This process is “painless”21 since it costs mainly only the price of the descent-increase of the stage (2). It can be
thus reiterated until a certain threshold or a maximum iteration count. If the calculation of residue does not
comprise too much rounding error, i.e. if the algorithm of resolution is rather reliable (cf. following paragraph)
and that the conditioning of the matric system is good, this process of iterative refinement 22 is very beneficial on
the quality of the solution.
21 It is true when MUMPS functions in memory way of managing In-Core and into sequential. On the other hand, when the data are
distributed between the processors and the memories RAM and disc (parallelism and Out-Of-Core is activated), this stage can be a little
expensive.
22 One also speaks “about iterative improvement” (‘iterative refinement’).
Version
Code_Aster default
0ac07fed0038
In MUMPS this process is activable or not (parameter ICNTL (10)<0) and limited by a maximum iteration
count N err (ICNTL (10)). The process (2.3-2) continues as much as the “balanced residue” Berr is higher
than a skeletal threshold threshold (CNTL (2), fixed by default at  with  precision machine)
∣r ij∣
Berr :=max seuil (2.3-3)
j ∣K∣∣u ∣∣f∣ j
i
or that it does not decrease a factor at least 5 (nonskeletal). In general, one or two iterations is enough. If it is
not the case, it is often revealing other problems: bad conditioning or opposite error (cf following paragraph).
Note:
• For the Code_Aster user these parameters MUMPS are not directly accessible. The functionality is
activable or not via the keyword POSTTRAITEMENTS .
• This functionality is present in many packages: Oblio, PARDISO, UFMPACK, WSMP…
2.3.3 Reliability of calculations

To estimate the quality of the solution of a linear system [ADD89] [Hig02], MUMPS proposes digital tools
deduced from the theory from the analysis reverses rounding errors initiated by Wilkinson (1960). In this
theory, the rounding errors due to several factors (truncation, operation into arithmetic finished.) are comparable
to disturbances on the initial data. That makes it possible to compare them with other sources of errors
(measurement, discretization…) and to handle them more easily via three indicators obtained in postprocessing:
• Conditioning cond K , f  : it measures sensitivity of the problem to the data (unstable problem,
badly formulated/discretized…). I.e., the multiplicative factor that the handling of the data will operate on
the result. To improve it, one can try to change the formulation of the problem or to balance the terms of
the matrix, apart from MUMPS or via MUMPS (SOLVEUR/PRETRAITEMENTS=' OUI' in Code_Aster).
• The opposite error be K , f  (‘backward error’): it measures propensity of the algorithm of
resolution to transmit/amplify the rounding errors. A tool is known as “reliable” when this figure is
close to the precision machine. To improve it, one can try to change algorithm of resolution or to modify
one or more his stages (in Code_Aster one can exploit the parameters SOLVEUR/TYPE_RESOL,
PRETREATMENTS and RENUM).
• The direct error feK , f  (‘forward error’): it is the product of the two preceding digits and provides
one raising relative error on the solution.
∥ u∥
cond  K , f ×be  K , f 
∥u∥  (2.3-4)
fe  K , f 
One can give a chart (cf figure 2.3-2) of these concepts while expressing the opposite error like the variation
enters “the initial data and the disturbed data”, while the direct error measurement the variation enters “the
exact solution and the solution really obtained” (that of the problem disturbed by the errors rounding).
Figure 2.3-2. _Chart of the concept of errors direct and opposite.
Within the framework of the linear systems, the error reverses is measured via balanced residue
Version
Code_Aster default
0ac07fed0038
∣f −Ku∣ j
be  K , f  :=max (2.3-5)
j∈ J ∣K∣∣u∣∣f∣ j
One cannot always evaluate it on all the indices ( J ≠[ 1, N ] Ν ). In particular when the denominator is very small
(and the numerator not no one), one prefers the formulation to him (with J * such as J ∪J * =[ 1, N ] Ν )
∣f −Ku∣ j
be *  K , f  :=max (2.3-6)
j∈ J* ∣K∣∣u∣ j ∥K j .∥∞∥u∥∞
where K j . represent j ième line of the matrix K . With these two indicators, one associates two estimates of
conditioning matricel (one related to the lines retained as a whole J and the other with its complementary J * ):
cond K , f  and cond * K , f  . The theory then provides us the results according to:
• The approximate solution U is the exact solution of the disturbed problem
 K K  u= f  f 
avec  K ij ≤max  be , be ∣K ij∣
*
(2.3-7)
 f i≤max  be .∣f i∣, be* .∥K i.∥∞∥u∥∞ 
et
• There is following increase (via the direct error feK , f  ) on the relative error in solution
∥ u∥
cond×becond * ×be*
∥u∥  (2.3-8)
fe  K , f 
In practice, one scans especially this last estimate feK , f  and its components. Its order of magnitude
indicate roughly speaking the number of “true” decimals of the solution calculated. For the badly conditioned
problems, a tolerance of 103 is not rare, but it must be taken with serious because this kind of pathology can
seriously disturb a calculation.
Even within the very precise framework of the resolution of system linear, there exists in many ways to define
the sensitivity to the rounding errors of the problem considered (i.e. its conditioning). That retained by MUMPS
and, which refers in the field (cf Arioli, Demmel and Duff 1989), is indissociable ‘backward error’ of the problem.
The definition of the one does not have a direction without that of the other. One thus should not confuse this
kind of conditioning with the concept of matric conditioning classical.
In addition, conditioning provided not MUMPS takes into account the SECOND MEMBER of the system
as well as the HOLLOW CHARACTER of the matrix. Indeed, it is not worthwhile to take account of possible
rounding errors on worthless matric terms and thus not provided to the solvor! The degrees of freedom
corresponding “do not speak each other” (seen spyglass finite element). Thus, this conditioning MUMPS
respects the physique of the discretized problem. It does not dip back the problem in the too rich space of the
full matrices.
Thus, the figure of conditioning displayed by MUMPS is much less pessimistic than the standard
calculation which another product can provide (Matlab, Python…). But let us hammer, that it is only its
product with the ‘backward error’, called ‘forward error’, which has an interest. And only, within the framework of
a resolution of system linear via MUMPS.
Note:
• This analysis of the quality of the solution is not limited to the linear solveurs. It is also declined, for
example, for the modal solveurs [Hig02].
• In MUMPS, the estimators fe , be , be * , cond and cond * are accessible via, respectively, the
variables RINFO (7/9/8/10 and 11). These postprocessings are a little expensive (~jusqu'à 10% of
time calculation) and thus désactivables (via ICNTL (11)).
• For the Code_Aster user these parameters MUMPS are not directly accessible. They are displayed
in a specific insert of the file of message (made out “monitoring MUMPS”) if the keyword is informed
INFO=2 in the operator . In addition, this functionality is activated only if it chooses to estimate and to
Version
Code_Aster default
0ac07fed0038
test the quality of its solution via the parameter SOLVEUR/RESI_RELA . According to the operators
Aster, this parameter is by default disconnected (negative value) or fixed at 10 -6 . When it is activated
(positive value), one tests if the direct error fe K , f  is quite lower than RESI_RELA . If that is not
the case, calculation stops in ERREUR_FATALE by specifying the nature of the problem and the values
accused.
• The activation of this functionality is not essential (but often useful) when the required solution itself is
corrected by another algorithmic process (algorithm of Newton, diagram of Newmark): in short, in the
linear operators THER_LINEAIRE, MECA_STATIQUE, STAT_NON_LINE, DYNA_NON_LINE …
• This kind of functionality seems not very present in the bookstores: LAPACK, Nag, HSL…
2.3.4 Management memory (In-Core Out-Of-Core versus)

It was seen that the major drawback (cf. §1.7) direct methods lies in the face of factorized. To make it possible
to pass in random access memory larger systems, MUMPS proposes to discharge this object on disc: it
is the mode Out-Of-Core (keyword GESTION_MEMOIRE=' OUT_OF_CORE') in opposition to the mode In-Core
(keyword GESTION_MEMOIRE=' IN_CORE') where all the structures of data reside in RAM (cf figures 2.2-1
and 2.3-3). This mode of economy of RAM is complementary to distribution of data which parallelism induces
naturally. The appreciation of the OOC is thus especially prégnante for numbers of moderate processors (<32
processors).
In addition, team MUMPS was very attentive with overcost CPU generated by this practice. By working over
again in algorithmic code handling of the discharged entities, they could limit to the strict minimum these
overcosts (some for hundred and especially in the phase of resolution).
OOC IC
Disc RAM
Figure 2.3-3. _Two types of management memory standards:

entirely in RAM (‘IN_CORE’) and RAM/disque (‘OUT_OF_CORE').
These two memory ways of managing are “without net”. No correction will be operated later on in the event
of problem. If one does not know a priori not which of these two modes to choose and if one wants to limit, as
far as possible, the problems due to defects of place memory, one can choose the automatic mode:
GESTION_MEMOIRE=' AUTO'. Heuristic interns with Code_Aster manage then only the contingencies memory
of MUMPS according to the computer set-up (machine, parallelism) and of the digital difficulties of the problem.
In the same order of idea, an option of the same keyword, GESTION_MEMOIRE=' EVAL', allows to gauge the
needs for a calculation by displaying in the file message the resources memories required by calculation
Code_Aster+MUMPS.
***************************************************************************
- Size of the linear system: 500000
- Minimal RAM memory consumed by Code_Aster : 200 Mo

- Estimate of the Mumps memory with GESTION_MEMOIRE=' IN_CORE' : 3500 Mo
- Estimate of the Mumps memory with GESTION_MEMOIRE=' OUT_OF_CORE' : 500 Mo
- Estimate of the disk space for Mumps with GESTION_MEMOIRE=' OUT_OF_CORE': 2000 Mo
===> For this calculation, one thus needs a quantity of RAM memory at least of
- 3500 Mo if GESTION_MEMOIRE=' IN_CORE',
- 500 Mo if GESTION_MEMOIRE=' OUT_OF_CORE'.
In case of doubt, use GESTION_MEMOIRE=' AUTO'.
******************************************************************************
Version
Code_Aster default
0ac07fed0038
Figure 2.3-4. _Extrait of file of message with GESTION_MEMOIRE=' EVAL'.
Note:
• Parameters MUMPS ICNTL (22) /ICNTL (23) allow to manage these options memory. The user
Aster indirectly activates them via the keyword SOLVEUR/GESTION_MEMOIRE.
• Unloading on disc is entirely controlled by MUMPS (many files, frequency unloading/recharging…).
One informs just the site report: it is quite naturally the repertoire of work of the achievable specific
Aster to each processor (%OOC_TMPDIR='. ‘). These files are automatically erased by MUMPS
when associated occurrence MUMPS is destroyed. That thus avoids a clogging of the disc when
various systems are factorized in the same resolution.
• Other strategies of OOC would be possible even are already coded in certain packages (PaStiX,
Oblio, TAUCS…). One thinks in particular well aware of being able to modulate the perimeter of the
discharged objects (cf phase of sometimes expensive analysis in RAM) and of being able to re-use
them on disc during another execution (cf. CONTINUATION with the direction Aster or partial
factorization).
2.3.5 Management of the singular matrices

One of the large strong point of the product is its management of the singularities. It is not only able of to
detect the digital singularities23 of a matrix and to synthesize information for an external use of it (calculation
of row, warning to the user, posting of expert testimony.), but moreover, in spite of this difficulty, it calculates one
regular solution “24” even whole or part of associated core.
These new developments were one of deliverable of the ANR SOLSTICE [GROUND]. We had asked them to
team MUMPS (in partnership with the Algorithm team of the CERFACS) to make this product Iso-functional
compared to the other direct solveurs of Code_Aster.
And in practice, how does MUMPS proceed you it?

With large features, during the construction of the factorized matrix, it detects the lines comprising of the
pivots25 very small (compared to criterion CNTL (3)26). It indexes them in the vector PIVNUL_LIST (1:
INFOG (28)) and, according to the case, it is replaces them by a prefixed value (via CNTL (5)27), it is stores
them except for. The block thus made up (moreover small) will undergo an algorithm QR later on ad hoc.
And to finish, the iterative iterations of refinement come to supplement this hank. As they use this factorized
“improved” only as preconditionnor, and that they profit, on the other hand, of the exact information of the
product matrix-vector, they bring back28 the solution “skewed” in the good way!
Note:
• Parameters MUMPS ICNTL (13) /ICNTL (24) /ICNTL (25) and CNTL (3) /CNTL (5)
allow to activate these features. They are not modifiable by the user. By prudence, one keeps the
functionality activated permanently.
• This functionality can also prove to be useful in modal calculation (filtering of the rigid modes).
2.3.6 Compression ‘Low-Rank Block’ (BLR)

This technique of compression aims at facilitating the large studies by reducing their costs in time and memory.
It isT complementary to parallelism and the algorithmic/functional panoply product. Its perimeter of use
23 The singularities known as digital are given except for a digital precision, contrary to the singularity known as exact or true.
24 It is a possible solution of the problem since the second member f ∈ker K T T  . What in our symmetrical case returns to f
element of space image.
25 It acts, in any rigour, of the infinite standard of the line of the matrix of work comprising the pivot.
26 By default one fixes it at 10-8 (in double precision) and 10-4 (into simple) because these figures represent (empirically) a loss of at least
half of the level of precision if factorization nevertheless is continued.
27 This value must be enough large to limit the impact of this modification on the rest of factorization. In Code_Aster/Code_Carmel3D, one
fixes it at 106∥K travail∥ .
28 It is the same mechanism as for the static swivelling.
Version
Code_Aster default
0ac07fed0038
is almost complete. One does not have to choose between parallelism, such or such digital refinement and
these compressions BLR! All these features are compatible and their profits often cumulate 29.
In the manner of formats mp3, zip or pdf of our domestic and office automation uses, these compressions
makes it possible to reduce, with few losses, the expensive stages of MUMPS 30. And this approximation
generally does not disturb the precision and the behavior of including mechanical calculations.
It is however interesting only on problems of big sizes (NR at least > 2.10 6 ddls ). Because as these
compressions imply a overcost, one should compress only the storage blocks sufficiently large and thus likely to
compensate for this overcost quickly.
noted profits on some studies Code_Aster vary 20% with 80% (cf figures 7.3-5/7.3-6). They increase with the
size of the problem and its massive character.
Figure 7.3-5: Example of profits gotten by compressions low-rank on the case benchmark perf008d
(parameters by default, management memory in OOC, N=2M, NNZ=80M, Facto_METIS4=7495M,
conditionnement=107). One traces, according to the number of activated processes MPI, the times
elapsed spent by all the stage of resolution of system linear in Code_Aster v13.1, his peak RAM report,
as well as the factor of acceleration gotten by BLR.
Figure 7.3-6: Example of profits gotten by compressions low-rank on the case benchmark perf009d
(parameters by default, management memory in OOC, N=5.4M, NNZ=209M, Facto_METIS4=5247M,
conditionnement=108). One traces, according to the number of activated processes MPI, the times
elapsed spent by all the stage of resolution of system linear in Code_Aster v13.1, his peak RAM report,
as well as the factor of acceleration gotten by BLR.
29 On the other hand these profits vary according to the context numérico-data processing: renumerotor, many processes MPI…
30 For the moment compression relates to only certain stages of digital factorization (not descent-increase). These profits thus must, has
minimum, to compensate for a light overcost in the stage of preliminary analysis as well as the costs of compression/decompression of
the beginning and the end of the digital stage of factorization. For the moment they are only savings of time (not of profits in peak RAM).
Version
Code_Aster default
0ac07fed0038
Rapidly, this strategy seeks to compress large storage blocks the most handled by the multifrontale of
MUMPS: large dense blocks of its tree of elimination. This technique rests on the assumption (often checked
empirically) that it is possible to renumber31 variables within these dense blocks in order to release a more
advantageous matric structure: to break up these blocks like product of two dense matrices much smaller.
The objective is thus to break up the large dense matrices With (of this tree of elimination) , for example of size
m X N , in the form
A U.V T  E (7.3-9)
with U and V matrices much smaller (respectively mXK and KXN with K <min (m, N)) and E a matrix mXN
“negligible” ( E   ).
During later handling MUMPS makes the approximation then
A U.V T (7.3-10)
by betting that this one will have little impact on the quality of the result (thanks to iterative refinement or the
including nonlinear algorithms) and on the related ‘outputs’ of the solvor (detection of singularity, calculation of
determinant, criterion of Sturm).
And it is checked most of the time as much as the parameter of compression is rather small ( E <10 -9 ). If that is
larger, the approximation cannot be neglected any more and MUMPS cannot be used any more as a direct
solvor “exact”: only as a “released” direct solvor (into nonlinear) or as a preconditionnor (for example for the
GCPC of out Code_Aster solveurs of Krylov de PETSc).
This functionality now available in Lbe versions consortium product results from one partnership EDF-INPT
(2010-13) around work of thesis of C.Weisbecker[CW13/15]. This one was rewarded by one for the
L.Escande prices for the year 2014. For more details on the technical sides of this work one will be able to
consult the document of thesis itself or the summary provided to the appendix n°1 of this document.
Note:
• According to the versions of MUMPS, various parameters allow to manage this strategy of
compression. The Code_Aster user indirectly activates them via the keywords
ACCELERATION/LOW_RANK_SEUIL .
31 Suivant certain criteria specific to the BLR.

Version
Code_Aster default
0ac07fed0038
3 Establishment in Code_Aster
3.1 Context/synthesis
To improve the performances of calculations carried out, strategy retained by Code_Aster [Dur08], as by most
great codes general practitioners in mechanics of the structures, in particular consists in diversifying its panel of
linear solveurs32 in order to better target the needs and the constraints of the users: local machine, computer
cluster or centre; obstruction memory and disc; time CPU; industrial or more exploratory study…
Dimensioned parallelism and linear solvor, a way is particularly prospected 33:
• “digital parallelism” external bookstores of solveurs such that MUMPS and PETSc, possibly
supplemented by a “data-processing parallelism” (intern with the code) for elementary calculations and
the matric/vectorial assemblies;
We are interested here in the first scenario through MUMPS. This external solvor “is plugé” in Code_Aster
and accessible to the users since the v8.0.14. It thus enables us to profit, “with less expenses”, of Rex from a
broad community of users and very pointed competences international teams. The whole while combining
effectiveness, performance, reliability and broad perimeter of use.
This work was initially completed by exploiting the sequential mode In-Core product. In particular, thanks to its
faculties of swivelling, he does invaluable favours by treating new modelings (quasi-incompressible elements, X-
FEM…) who can prove to be problematic for the other linear solveurs.
Since, MUMPS is daily used on studies [GM08] [Tar07] [GS11]. Our Rex of course packed itself and we
maintain one partnership relation activates with the development team of MUMPS (in particular via the
ANR SOLSTICE [GROUND] and a thesis in progress). In addition, its integration in Code_Aster profit from a
continuous enrichment: parallelism centralized IC [Des07] (since the v9.1.13), parallelism distributed IC [Boi07]
(since the v9.1.16) then IC mode and OOC [BD08] (since the v9.3.14) .
In distributed parallel mode, the use of MUMPS gets profits in CPU (compared to the method by default of the
code) about the dozen on 32 processors machine Aster. On very favorable cases this result can be much
better and, for “studies borders”, MUMPS remains sometimes the only viable alternative (cf interns of tank
[Boi07]).
As for RAM consumption, one saw in the preceding chapters that it is the principal weakness of the direct
solveurs. Even in parallel mode, where one however has naturally a distribution of the data between the
processors, this factor can prove handicapping. To overcome this problem it possible to activate in Code_Aster,
a recent functionality of MUMPS (developed within the framework of the above mentioned ANR): the “Outone”
(OOC), during “In-Core” (IC) by default. It makes it possible to reduce this bottleneck by discharging on disc a
good amount of data. Thanks to the OOC, one can thus to approach RAM consumption of the native
multifrontale of Code_Aster (even into sequential), to even go down in lower part by combining the efforts from
parallelism and this unloading on disc. The first tests show a profit in RAM between the OOC and the IC from at
least 50% (even more on successful outcomes) for a overcost in limited CPU (<10%).
Solvor MUMPS thus allows, not only to solve numerically difficult problems, but, inserted in a computing process
Aster already partially parallel, it gears down the performances of them. It gets for the code a framework parallel
performing, credits, robust and general public. It facilitates thus the passage of the studies standards (< million
degrees of freedom) and makes available to the greatest number the treatment of large cases (| several million
degrees of freedom).
3.2 Two types of parallelism: centralized and distributed

3.2.1 Principle
MUMPS is a paralleled linear solvor. This parallelism can be activated several manners in particular according
to the sequential or parallel aspects of the code which uses it. Thus, it parallelism can be limited to the floods
of internal data/MUMPS treatments, or, a contrario to be integrated into a flood of parallel data/treatments
already organized upstream solvor, as of the elementary phases of calculations of Code_Aster. First mode
32 This research continues improvement of the performances is obviously not reduced only to the only linear solveurs. The code proposes
a good amount of tools to answer the same objectives: distribution of independent calculations, X-FEM, improvement of contact-friction,
the EDO/modaux/non-linear solveurs, adaptive grid, finite elements of structure…
33 Cf [R6.01.03] for a detailed vision of the potential strategies of parallelism and those actually put in work in the code.
Version
Code_Aster default
0ac07fed0038
(‘CENTRALIZE‘) has for him the robustness and a broader perimeter D’ use, the second
(‘GROUP_ELEM’/‘MAIL_ ***’ and ‘SOUS_DOMAINE’) is generic but more effective.
Because the most expensive phases often in time CPU of a simulation are: the construction of the linear system
(purely Code_Aster , cut out in three stations: factorization symbolic system, calculations elementary and
matric/vectorial assemblies) and its resolution (in MUMPS, cf §1.6: renumerotation + analyzes, digital
factorization and descent-increase). The first mode of parallelization benefits only from the parallelism of stages
2 and 3 of MUMPS, whereas the three others parallel also elementary calculations and the assemblies of
Code_Aster (cf figure 2.2-1 and 3.2-1) .
Code_Aster Resolution
Construction MUMPS
Ku = F Facto. Digital
Desc. - increase
Analysis
MUMPS
centralize
d
Elementary
calculations
Assemblies
Facto. symbolic
MUMPS system
distribute
d
Figure 3.2-1. _ Floods of data/parall treatmentsèof centralized/distributed MUMPS.
3.2.2 Various modes of distribution

One describes in this paragraph the various choices of distribution of elementary calculations to the processors
taking part in calculation. This distribution parameter during the construction of the model finite element.
A first choice consists in not distributing elementary calculations between processors. If one wishes to distribute
elementary calculations, one can to be based on a distribution of the finite elements or a distribution of the
meshs (independently of the finite elements carried by these meshs). One can finally combine distribution by
finite element and mesh. Let us detail each mode:
• CENTRALIZE : The meshs are not distributed (as into sequential). S elementary calculations are not
paralleled. Parallelism start only on the level of MUMPS. Each processor built and provides to the linear
solvor the entirety of the system to be solved. This mode of use is useful for the tests of not-regression.
In all the cases where elementary calculations represent a weak share of total time (e.g. in linear
elasticity), this option can be sufficient.
• ‘GROUP_ELEM‘: A second type of distribution consists in setting up homogeneous groups of finite
elements (by type of finite element) then to distribute elementary calculations between the processors to
balance the load as well as possible (in term of many elementary calculations of each type). Each
processor allocates the whole matrix but does not carry out and assembles only elementary calculations
which were allotted to him.
• ‘ MAIL_DISPERSE/MAIL_CONTIGU ’ :
Version
Code_Aster default
0ac07fed0038
A distribution is carried out meshs of the model maybe by packages of contiguous meshs
(‘MAIL_CONTIGU’), that is to say by cyclic distribution (‘MAIL_DISPERSE’). This distribution is
independent of the type of finite element carried by the meshs. For example, with a model
comprising 8 meshs and for a calculation on 4 processors, there are the following distributions
of load:
Mode of Mesh 1 Mesh 2 Mesh 3 Mesh 4 Mesh 5 Mesh 6 Mesh 7 Mesh 8

distribution
MAIL_CONTIGU Proc. 0 Proc. 0 Proc. 1 Proc. 1 Proc. 2 Proc. 2 Proc. 3 Proc. 3
MAIL_DISPERSE Proc. 0 Proc. 1 Proc. 2 Proc. 3 Proc. 0 Proc. 1 Proc. 2 Proc. 3
Each processor allocates the whole matrix but does not carry out and assembles only
elementary calculations which were allotted to him.
▪ ‘ SOUS_DOMAINE ‘ (defect) : This kind of distribution is a hybrid distribution of elementary

calculations which rests on one E distribution meshs ( from a partitioning of the total grid under-
fields) then on a distribution of the finite elements by type inside each under-field. Each
processor allocates the whole matrix but does not carry out and assembles only elementary
calculations which were allotted to him.
3.2.3 Balancing of load

The distribution by meshs is very simple but can lead to imbalances of load because she does not take explicitly
account of the meshs spectators, of the meshs of skin (cf figure 3.2-2 on an example comprising 8 voluminal
meshs and 4 meshs of skin), of particular zones (not linearities…). The distribution by under-fields is more
flexible and can prove more effective while making it possible to adapt its flood of data to its simulation.
Another cause of déséquilibrage can come from the conditions of Dirichlet by dualisation (DDL_IMPO,
LIAISON_ ***…). By preoccupations with a robustness, their treatment is affected only with the main
processor. This extra work, often negligible, however introduced into certain cases, a déséquilibrage more
marked. The user can compensate for it by informing one of the keyword CHARGE_PROC0_MA/SD. This
differentiated treatment relates to in fact all the implying cases of the meshs known as “late” (Dirichlet via
multipliers of Lagrange but also nodal force, contact method continues…).
For more details on the data-processing specifications and the functional implications of this mode of parallelism
one will be able to consult documentations [U2.08.03] and [U4.50.01].
Note:
• Without the option MATR_DISTRIBUEE (cf following paragraph), different the strategies are
equivalent in term of occupation memory. One dries up as soon as possible the flood of data and
instructions. It is a question of treating matric/vectorial blocks selectively total problem, that MUMPS
will gather.
• In distributed mode, each processor handles only matrices partially filled. On the other hand, in order
to avoid introducing too many communications MPI into the code (criteria of stop, residue…), this
scenario was not retained for the second members. Their construction is well paralleled, but at the end
of the assembly, the contributions of all the processors are summoned and sent to all. Thus all the
processors entirely know the vectors implied in calculation.
• In the same way, the matrix for the moment is duplicated: in space JEVEUX (RAM or disc) and in F90
space of MUMPS (RAM). In the long term, because of unloading on disc of factorized, it will become a
dimensioning object of RAM. It will thus have to be built directly via MUMPS.
3.2.4 To recut the objects Code_Aster

In parallel mode, when the data are distributed JEVEUX upstream of MUMPS, one redécoupe not inevitably
structures of data concerned. With the option MATR_DISTRIBUEE=' NON', all the distributed objects are
allocated and initialized with the same size (the same value as into sequential). On the other hand, each
processor will modify only the parts of objects JEVEUX it has the load. This scenario is particularly adapted to
the distributed parallel mode of MUMPS (by default mode) because this product gathers in-house these
Version
Code_Aster default
0ac07fed0038
incomplete floods of data. Parallelism allows then, in addition to savings of time calculation, to reduce the place
memory required by the resolution MUMPS but not that necessary to the construction of the problem in JEVEUX.
This is not awkward as long as RAM space for JEVEUX remain much lower than that necessary by MUMPS.
Like JEVEUX store mainly the matrix and MUMPS, its factorized (generally of tens of larger time), the RAM
bottleneck of calculation is theoretically on MUMPS. But as soon as one uses a few tens of processors in MPI
and/or that the OOC is activated, as MUMPS distributes this factorized by processor and discharge these
pieces on disc, the “ball returns in the camp of JEVEUX”.
From where the option MATR_DISTRIBUEE who recuts the matrix, with just of the nonworthless terms for which
the responsibility the processor has. Space JEVEUX required falls then with the number of processors and goes
down below RAM necessary to MUMPS. The results of figure 3.2-2 illustrate this profit in parallel on two studies: a
Pump LAUGH and a model of tank of the study “Epicure”.
Figure 3.2-2. _ Evolution of RAM consumption (in Go) according to the number of processors, Code_Aster
v11.0 (JEVEUX standard MATR_DISTRIBUE=' NON' and distributed, resp. ‘YES’) and of MUMPS OOC.
Results carried out on a Pump LAUGH and the tank of the Epicure study.
Note:
• One treats here data resulting from an elementary calculation (RESU_ELEM and CHAM_ELEM) or
of a matric assembly (MATR_ASSE). Assembled vectors (CHAM_NO) are not distributed because the
induced profits report would be weak and, in addition, as they intervene in the evaluation of many
algorithmic criteria, that would imply too many additional communications.
• In mode MATR_DISTRIBUE, to make the joint the end enters of MATR_ASSE room with the
processor and MATR_ASSE total (that one does not build), one adds a vector of indirection in the
form of one NUME_DDL room.
3.3 Management of memory MUMPS and Code_Aster

To activate or disable faculties OOC of MUMPS (cf figure 3.3-1), the user informs the keyword
SOLVEUR/GESTION_MEMOIRE=' IN_CORE'/‘OUT_OF_CORE’/‘CAR’ (defect). This functionality is of course
cumulable with parallelism from where a larger variety of operation for if required adapting to the contingencies
of execution: “sequential IC or OOC”, “parallelism centralized IC or OCC”, “parallelism distributed by IC under-
fields or OOC”…
For a small linear case, sequential mode the “IC” is enough; for a larger case always into linear, parallel mode
the “centralized IC” (or better OOC) brings truly a profit in CPU and RAM; into nonlinear, with frequent
reactualization of the tangent matrix, parallel mode the “distributed OOC” is advised.
For more details on the data-processing specifications and the functional implications of this way of managing of
memory MUMPS one will be able to consult documentations [BD08] and [U 2.08.03/U4.50.01].
Version
Code_Aster default
0ac07fed0038
Elementary
Renuméroration Factorization
calculations
Analysis Digital Resolution
Assemblies
Centralized
K, K -1 U
F
Distribute
d
Space JEVEUX
Aster CHAM_ELEM,
K -1
MATR_ELEM/VECT_ELEM,
NUME_DDL/CHAM_NO/MATR_ASSE OOC IC
F90 space I, J , K ij and F

MUMPS IRN (_LOC), JCN (_LOC)
WITH (_LOC ) and RHS
Disqu
RAM E
Figure 3.3-1. _Diagram functional of the Code_Aster/MUMPS coupling (with a sequential renumerotor)
with respect to the principal structures of data and occupation memory (RAM and disc).
3.4 Particular management of double the multipliers of Lagrange

Historically, direct linear solveurs of Code_Aster (‘MULT_FRONT’ and ‘LDLT‘) did not have D’ algorithm swivelling
(which seeks to avoid accumulations of rounding errors per division by very small terms). To circumvent this
problem, the taking into account of the limiting conditions by multipliers of Lagrange
(AFFE_CHAR_MECA/THER…) was modified by introducing double-multipliers of Lagrange. Formally, one does
not work with the initial matrix K0
K 0= [ K
blocage
blocage u
0 lagr ]
but with its doubly dualized form K2
From where a overcost report and calculation.

K
[
K 2= blocage
blocage
blocage
−1
1
blocage u
1
]
lagr 1
−1 lagr 2
Like MUMPS have faculties of swivelling, this choice of dualisation of the limiting conditions can be called into
question. By initializing the keyword ELIM_LAGR with ‘LAGR2’, one does not take any more account but of
one Lagrange, the other being spectator34. From where a matrix of work K 1 simply dualized
34 To maintain the coherence of the structures of data and to keep a certain legibility/data-processing maintainability, it is preferable “to
bluff” the usual process while passing of K2 with K 1 , rather than with the optimal scenario K 0 .
Version
Code_Aster default
0ac07fed0038
[ K
K 1= blocage
0
blocage 0 u
0
0
0 lagr 1
−1 lagr 2 ]
smaller because the extra-diagonal terms of the lines and the columns associated with these multipliers with
Lagrange spectators are then initialized to zero. A contrario, with the value ‘NOT’, MUMPS receives the usual
dualized matrices.
For problems comprising of many multipliers of Lagrange (up to 20% of the numbers of total unknown
factors), the activation of this parameter is often paying (smaller matrix). But when this number explodes
(>20%), this perhaps against-productive process. The profits carried out on the matrix are cancelled by the size
of factorized and especially the number of late swivellings that MUMPS must carry out. To impose
ELIM_LAGR=' NON' can be then very interesting (profit of 40% in CPU on the CAS-test mac3c01).
3.5 Perimeter of use

A priori , all operators/features using the resolution of a linear system except a configuration of modal
calculation35. For more Détails one will be able to consult the user's documentations [U4.50.01].
3.6 Parameter setting and examples of use

Let us recapitulate the principal parameter setting allowing to control MUMPS in Code_Aster and let us illustrate
its use via an official CAS-test (mumps05b) and a geometry of study (pump LAUGH). For more information one
will be able to consult the associated user's documentations [U 2.08.03/U4.50.01], the notes EDF [BD08]
[Boi07] [Des07] or CAS-tests using MUMPS.
3.6.1 Parameters of use of MUMPS via Code_Aster
Operand Keyword Value by Details/advices Ref.

default
SOLVEUR/
METHODE=
‘MUMPS’
Functional TYPE_RESOL ‘CAR’ ‘AUTO', ‘NONSYM’, ‘SYMGEN’ and §1
parameters (‘NONSYM’ ‘SYMDEF’.
if the matrix Parameter allowing to specify the nature of the
is problem to be treated.
nonsymmetr
ical,
‘SYMGEN’ if
not)
PCENT_PIVOT 10% Overcost report planned for the swivellings. §2.3
ELIM_LAGR ‘LAGR2’ ‘LAGR2’/‘NOT’. §3.4
RESI_RELA -1 Parameter in link with the keyword §2.3
(nonlinear) POSTTRAITEMENTS. If this parameter is
10 -6 (linear) positive, MUMPS carries out iterations of
iterative refinement and examines the quality of
the solution.
If the relative error in solution is lower than this
35 Resolution of a quadratic problem with CALC_MODES + OPTION=' SEPARE'/‘ADJUSTS’. Because this option requires a direct
access with the diagonal of the factorized matrix. However MUMPS does not allow, for the moment, to know the precise terms of factorized
really obtained. On the other hand, and it is its privileged framework of use, it uses them effectively and with robustness to solve a linear
system and/or for to calculate a postprocessing (criterion of Sturm, calculation of determinant…).
Version
Code_Aster default
0ac07fed0038
Operand Keyword Value by Details/advices Ref.

default
value, Aster stops in ERREUR_FATALE.
TYPICAL CASE:
POSTTRAITEMENTS=' MINI'.
Digital PRETREATMENTS ‘CAR’ ‘CAR’ and ‘WITHOUT’. §1.6
parameters § 2.3
RENUM ‘CAR’,‘CAR’ ‘AMD’, ‘MFA’, ‘QAMD’, § 1.6
‘PORD’, ‘(PT) SCOTCH TAPE’ and ‘(BY)
MONGREL’.
In first case MUMPS chooses the best
renumerotor available, in others, one imposes
to him. If this renumerotor is not available:
ERREUR_FATALE or ALARM and one replaces
it by another of the same type.
FILTRAGE_MATRICE Options “to release” the resolutions carried out [U4.50.01]
/MIXER_PRECISION via MUMPS.
POSTTRAITEMENTS ‘CAR’ ‘CAR’, ‘MINI’, FORCE' and ‘WITHOUT’. §2.3
ACCELERATION/LOW ‘CAR’/0.0 Useful mainly on big problems (N> 2.10 6 ddls). §2.3 and
_RANK_SEUIL To use in complement POSTTRAITEMENTS=' appendix
MINI'. n°1
Memory GESTION_MEMOIRE ‘CAR’ ‘IN_CORE’, ‘OUT_OF_CORE’, ‘CAR’ or §3.3
‘EVAL’.
MATR_DISTRIBUEE ‘NOT’ ‘YES’ or ‘NOT’. §3.2
Table 3.6-1. _Summary of the specific parameter setting of MUMPS in Code_Aster.
3.6.2 Monitoring
By positioning it keyword INFORMATION to 2 and by using solvor MUMPS , the user can make display in the
file of message a synthetic monitoring of the various phases of construction and resolution of the linear system:
distribution by processor amongst meshs, of the terms of the matrix and of its factorized, the analysis of error (if
requested) and an assessment of their possible déséquilibrage. With this monitoring directed CPU, one adds
some information on RAM consumption of MUMPS: by processor, estimate (according to the phase of
analysis) of the requirements in RAM in IC, OOC and the value actually used with recall of the strategy chosen
by the user. The times spent for each stage of calculation following the processors can appear too. They are
managed by a more total mechanism which is not specific to MUMPS (cf. §4.1.2 [U1.03.03] or the user's
documentation of the operator DEBUT/POURSUITE).
************************************************************************
<MONITORING MUMPS >
SIZE OF THE SYSTEM 803352
CONDITIONNEMENT/ERREUR ALGORITHM 2.2331D+07 3.3642D-15
ERROR ON THE SOLUTION 7.5127D-08
ROW NO. MESHS NO. TERMS K LU FACTORS
NR 0: 54684 7117247 117787366
NR 1: 55483 7152211 90855351
…
IN %: RELATIVE VALUE AND DESEQUILIBRAGE MAX
: 1.45D+01 2.47D+00 2.38D+00 1.50D+01 4.00D+01 2.57D+01
: 1.40D-01 -1.09D+00 -5.11D+00 -9.00D-02 1.56D+00 -4.16D-01
MEMORY RAM ESTIMEE AND NECESSARY BY Mo MUMPS (FAC_NUM + RESOL)

ROW ASTER: ESTIM IN-CORE | ESTIM OUT-OF-CORE | RESOL. OUT-OF-CORE
NR 0: 1854 512 512
NR 1: 1493 482 482
…
#1 Resolution of the linear systems CPU (USER+SYST/SYST/ELAPS): 105.68 3.67 59.31
#1.1 Classification, connectivity of the matrix CPU (USER+SYST/SYST/ELAPS): 3.26 0.04
3.26
#1.2 Factorization symbolic system CPU (USER+SYST/SYST/ELAPS): 3.13 1.20
4.11
#1.3 digital Factorization (or precond.) CPU (USER+SYST/SYST/ELAPS): 45.22 0.83 23.48
Version
Code_Aster default
0ac07fed0038
#1.4 Resolution CPU (USER+SYST/SYST/ELAPS): 54.07 1.60 28.46
#2 Elementary calculations and assemblies CPU (USER+SYST/SYST/ELAPS): 3.44 0.03
3.42
#2.1 Routine calculation CPU (USER+SYST/SYST/ELAPS): 2.20 0.01
2.20
#2.1.1 Routines te00ij CPU (USER+SYST/SYST/ELAPS): 2.07 0.00 2.06
#2.2 Assemblies CPU (USER+SYST/SYST/ELAPS): 1.24 0.02 1.22
#2.2.1 Assembly matrices CPU (USER+SYST/SYST/ELAPS): 1.22 0.02 1.21
#2.2.2 Assembly second members CPU (USER+SYST/SYST/ELAPS): 0.02 0.00 0.01
Figure 3.5-1. _Extrait of file of message in INFORMATION =2.
3.6.3 Examples of use

Let us conclude this chapter by two series of tests illustrating them variations of performance according to
the case and the criterion observed (cf figures 3.5-2/3). The canonical CAS-test of the cube into linear is
paralleled very well. In centralized (resp. in distributed), more than 96% (resp. 98%) of the phases of
construction and inversion of the linear system are paralleled. That is to say a theoretical speed-up near to 25
(resp. 50). In practice, on the parallel nodes of the centralized machine Aster, rather good accelerations are
obtained: effective speed-up of 14 out of 32 processors instead of the 17 theoretical ones.
30 60
25 40
0 5 b t h é o r iq u e
20
-u
p
en % 20 G a in R A M
15 0 5 b c e n t r a lis é O O C /IC
10 0 5 b d is t r ib u é
0 P e rte C P U
S
d
p
e
5 -20
O O C /IC
0 1 2 4 8 16
8 16 32 100 # p ro c e s s e urs
# P ro c e s s e u rs
N =0.7 M / nnz=27 M
%parallèle centralized/distributed: 96/98
theoretical Speed-UPS hundred. /dist. <25/50
32 proc (x1): ~3min
Consumption. RAM IC: 4Go (1 proc)/1.3Go (16)
Consumption. RAM OOC: 2Go (1 proc)/1.2Go (16)
Figure 3.5-2. _Linear mechanical Calculation (op. MECA_STATIQUE ) on the official CAS-test of the cube
(mumps05b). And solved only one linear system is built. Simulation carried out on the centralized machine Aster
with Code_Aster v11.0. Consumption measured RAM Aster +MUMPS.
On the nonlinear study of the pump, the profits which one can hope for are weaker. Taking into account the
phase of sequential analysis of MUMPS, only 82% of calculations are parallel. From where of appreciable but
more modest speed-UPS theoretical and effective. From a point of view RAM report, management OOC of
MUMPS gets profits interesting in the two cases, but more marked for the pump: into sequential, profit IC vs
OOC of about 85%, against 50% for the cube. By increasing the number of processors, distribution of data
which parallelism bad temper induces gradually this profit. But it remains prégnant on the pump to 16
processors and disappears almost with the cube.
Version
Code_Aster default
0ac07fed0038
6 N =0.8 M / nnz=28.2 M
5 %parallèle centralized/distributed: 55/82
th _ c e n tra l
4 theoretical Speed-UPS hundred. /dist. <3/6
-u
p
c e n t r a lis é
3 16 proc ~15min
t h _ d i s t r i b Consumption. RAM IC: 5.6Go (1 proc)/0.6Go (16)
2
S
d
p
e
d i s t r i b u Consumption.
é RAM OOC: 0.9Go (1 proc)/0.3Go (16)
1
0
4 16 32 100
# P ro c e s s e u rs
100
80
60
en %
40 G a in R A M
O O C /IC
20
0
1 2 4 8 16
#processeurs
Figure 3.5-3. _Nonlinear mechanical Calculation (op. STAT_NON_LINE ) on a geometry more industrial (pump
LAUGH). And solved 12 linear systems are built (3 pas de time X 4 pas de Newton). Simulation carried out on
the centralized machine Aster with Code_Aster v11.0. Consumption estimated RAM MUMPS.
Version
Code_Aster default
0ac07fed0038
4 Conclusion
Within the framework of thermomechanical simulations with Code_Aster, the main part of the costs
calculation often comes from the construction and the resolution of the linear systems. For 60 years, two
types of techniques have disputed supremacy in the field, the direct solveurs and those iterative . Code_Aster,
like a good amount of codes general practitioners, made the choice of an offer diversified in the field. With
however an orientation rather hollow direct solveurs. Those are adapted to its needs than one can
summarize under the triptych “robustness/problems of the multiple type second members/moderate parallelism”.
The code resting from now on on many “middlewares” optimized and perennial (MPI, BLAS, LAPACK,
MONGREL…) and being used mainly on clusters of SMP (fast networks, great RAM storage capacity and disc),
one seeks to optimize the linear solveurs salary accordingly.
Taking into account necessary technicality36 and of a rich international offer 37, for to effectively carry out
these resolutions, the question of resorts to an external product is from now on impossible to
circumvent. That makes it possible to acquire, with less expenses, a functionality often effective, reliable,
powerful and profiting from a broad perimeter of use. One can thus profit from the experience feedback from a
broad community of users and competences (very) pointed international teams.
Thus Code_Aster made the choice to integrate the parallel multifrontale of package MUMPS. This in
complement, in particular, of its multifrontale “house”. But if this one profits from a long-term adaptation to
modelings Aster, it remains less rich in features (swivelling, pre/postprocessings, quality of the solution…) and
less powerful in parallel (for RAM consumption of the same order). To exploit certain modelings (quasi-
incompressible elements, X-FEM…) or to pass from the “studies borders” (cf interns of tanks), this coupling
“Code_Aster+MUMPS” becomes sometimes the only viable alternative.
Since, its integration in Code_Aster profit from a continuous enrichment and MUMPS ( SOLVEUR/METHODE='
MUMPS') is daily used on studies . Our Rex of course packed itself and we maintain one partnership relation
activates with the “core-TEAM” of MUMPS (in particular via the ANR SOLSTICE and a thesis).
In mode parallel, the use of MUMPS gets profits in CPU (compared to the method by default of the code, the
multifrontale “house”) about the dozen on 32 processors machine Aster. On more favorable cases or by
exploiting a second level of parallelism or compressions BLR, this profit CPU perhaps much better.
Solvor MUMPS thus allows, not only to solve numerically difficult problems, but, inserted in a computing process
Aster already partially parallel, it gears down the performances of them. It gets for the code a framework
parallel performing, credits, robust and general public. It facilitates thus the passage of the studies
standards (<million of degrees of freedom) and makes available to the greatest number the treatment of large
cases (| several million degrees of freedom).
36 To give an order of magnitude, package MUMPS makes more than 105 lines (F90/C).
37 Only in the public domain, one counts tens of packages, bookstores, “macro-bookstores”…
Version
Code_Aster default
0ac07fed0038
5 Bibliography
5.1 Books/articles/proceedings/theses…
[ADD89] M.Arioli, J.Demmel and I.S . Duff. Solving sparse linear systems with sparse backward error. SIAM newspaper
one matrix analysis and applications . 10,165:190 (1989).
[ADE00] P.R.Amestoy, I.S.Duff and J.Y.L' Excel. Multifrontal parallel distributed symmetric and unsymmetric solvers.
Comput. Methods in Appl. Mech. Eng. 184,501:520 (2000).
[ADKE01] P.R.Amestoy, I.S.Duff, J.Koster and J.Y.L' Excel. With fully asynchronous multifrontal solvor using distributed
dynamic scheduling. SIAM newspaper of matrix analysis and applications, 23,15:41 (2001).
[AGES06] P.R.Amestoy, A.Guermouche, Excellent J.Y.L' and S.Pralet. Hybrid scheduling for the parallel solution of linear
systems. Parallel computing. 32,136:156 (2006).
[Che05] K.Chen. Matrix preconditioning technical and applications. ED. Cambridge University Close (2005).
[CW13] C.Weisbecker. Improving Multifrontal Solvers by Means of Algebraic Block Low-Rank Representations. PhD
thesis of Toulouse University (2013). 2013 Leopold Escande thesis award.
[CW15] P.Amestoy, C.Ashcraft, O.Boiteau, A.Buttari, J.Y.L' Excel and C.Weisbecker. Improving Multifrontal Methods by
Means of Block Low-Rank Representations. SIAM J.Sci. Comput., 37 (3), A1451-1474 (2015).
[Dav06] T.A.Davis. Direct methods for sparse linear systems. ED. SIAM (2006).
[Duf06] I.S.Duff et al. Direct methods for sparse matrices ED. Clarendon Close (2006).
[Gol96] G.Golub & C. Van Loan. Matrix computations. ED. Johns Hopkins University Close (1996).
[Hig02] N.J.Higham. Accuracy and stability of numerical algorithms. ED. SIAM (2002).
[Las98] P.Lascaux & R.Théodor. Matric digital analysis applied to the art of the engineer. ED. Masson (1998).
[Liu89] J.W.H.Liu. Broad computer solution of sparse positive definite systems. Prentice Hall (1981).
[Meu99] G.Meurant. Broad computer solution of linear systems. ED. Elsevier (1999).
[Saa03] Y.Saad. Iterative methods for sparse matrices. ED. PWS (2003).
5.2 Account-returned reports/EDF

[Anf03] N.Anfaoui. A study of the performances of Code_Aster: proposal for an optimization. Internship of mathematics
applied of PARIS VI (2003).
[BD08] O.Boiteau and C.Denis. Activation of the “Out-Of-Core” features of MUMPS in Code_Aster. Report EDF R & D
CRY 8/23/047 (2008).
[Boi07] O.Boiteau. Integration of parallel MUMPS distributed in Code_Aster. Note EDF R & D HI-I 7/23/03167 (2007).
[Boi13] O.Boiteau. Linear Solvor MUMPS: building sites software in Code_Aster, thesis of C.Weisbecker on
compressions low-rank and partnership EDF/INPT. Note EDF R & D H-I23-2013-03942 (2013).
[Boi15] O.Boiteau. Scientific partnerships and animations in the field of libraries HPC of linear algebra: consortium
MUMPS, mini-symposium SOLVER and six-month period CIMI. Note EDF R & D H-I23-2015-04879 (2015).
[Boi16] O.Boiteau. MUMPS: last projections of the product and news of the consortium. Note EDF R & D 6125-1106-
2016-14581 (2016).
[Des07] T.DeSoza. Evaluation and development of parallelism in Code_Aster. Internship of Master degree ENPC (2007)
and notes EDF R & D HT-62/08/01464 (2008).
[Dur08] C.Durand et al. HPC with Code_Aster: prospect and inventories of fixtures. Note EDF R & D HT-62/08/0139
(2008).
[GM08] S.Géniaut and F.Meissonnier. Feasibility of a study of harmfulness of crack in a valve MP by method X-FEM
and with the platform SALOMÉ. Report EDF R & D CR-AMA/08/0255 (2008).
[GS11] V.Godard and N.Sellenet. Calculation HPC with Code_Aster: prospect and inventory of fixtures. CR-AMA-
11.042 (2011).
[Tar07] N.Tardieu. GCP+MUMPS, a simple solution for the resolution of problems with contact in parallel. Report EDF R
& D CR-AMA/07/0257 (2007).
[GROUND] O.Boiteau. Follow-up of the ANR SOLSTICE. Reports EDF R & D (2007-2010).
5.3 Resources Internet

[Dav] T.A.Davis. Pointer on the packages of direct hollow solveurs: http://www. cise.ufl.edu/research/sparse/codes/.
[Gift] J.Dongarra. Pointer on the packages of solveurs: http://www.netlib.org/utk/people/JackDongarra/la-sw.html.
[MaMa] Web site of MatrixMarket: http://math.nist.gov/MatrixMarket/index.html.
[Mum] Official Web site of MUMPS: http://graal.ens-lyon.fr/MUMPS.
[Million u.a.] Official Web site of consortium MUMPS: http://mumps-consortium.org.
Version
Code_Aster default
0ac07fed0038
1
6 History of the versions of the document

Version Author (S) or Description of the modifications
Aster contributor (S),
organization
9.4 O.BOITEAU Initial text
EDF R & D SINETICS
V10.4 O.BOITEAU Formal corrections due to the bearing
EDF R & D SINETICS .doc/.odt; Update on parallelism;
Taking into account of the rqs of team
MUMPS; Addition of new keyword
(ELIM_LAGR2, LIBERE_MEMOIRE,
MATR_DISTRIBUEE); Slimming of the
council part/perimeter of use now
reserved for the U2.08.03 note.
V11.3 O.BOITEAU Addition of the new keyword
EDF R & D SINETICS GESTION_MEMOIRE instead of
OUT_OF_CORE and LIBERE_MEMOIRE.
Addition of the paragraph on the taking into
account of singular systems.
V13.1 O.BOITEAU Update and removal of some obsolete
EDF R & D SINETICS paragraphs. Addition of the elements on
the consortium and BLR.
V13.3 O.BOITEAU New elements related on the use of the
EDF R & D SINETICS threads and the parallel renuméroteurs
PARMETIS and PTSCOTCH.
V13,4 O.BOITEAU New keyword ACCELERATION.
EDF R & D /PERICLES
Version
Code_Aster default
0ac07fed0038
7 Appendix n°1: Principle of compressions BLR in MUMPS

This paragraph summarizes the various technical sides of work of thesis of Clément Weisbecker on
compressions low-rank [CW15] [Boi13]. This work was continued and improved by team MUMPS and they are
partly available in the version consortium [million u.a.] of the product.
For more details one will be able to consult the document of thesis itself [CW13]. The figures re-used here come
besides, either of this document, or of the set of slides of its defence.
7.1 Principle of the method multifrontale

method multifrontale is a class of direct method used for to solve linear systems. It is this method which is
established in product MUMPS and it is it which is in general used in the great commercial codes in mechanics
of structures (ANSYS, Nastran…). Even if there exist other strategies (supernodale…) and that they start to be
implemented in public packages (SuperLU, PastiX…), this work of thesis was exclusively focused on this
standard strategy which is the multifrontale. Nevertheless much of its results can undoubtedly extend or adapt
to these other methods of resolution.
method multifrontale, developed by I.S.Duff and J.K.Reid (1983), in particular consists in using the graph
theory38 to build, from a hollow matrix, one tree of elimination organizing calculations effectively (cf figures
7.1.1). The black squares materialize the nonworthless matric terms. To handle, “by interposed graphs”, this
kind of data one usually forces the following rule: a point of the graph represents an unknown factor and, an
edge between two points, a matric term not no one.
Thus in the example below, variables 1 and 2 must be connected by an edge (in initial classification). Generally,
initial classification of the matrix is not optimal. In order to reduce the obstruction memory of factorized 39,
the number of failure of later handling and in order to try to guarantee a good level of precision of the result, this
initial matrix is renumbered. One sees thus that the number of additional terms (‘rope’) created by
factorization passes from 16 to 10.
It is starting from this renumbered matrix that one builds the tree of elimination suitable for the multifrontale.
This arborescent vision has the great virtue to organize the tasks concretely: them precisely are
determined dependence or not (for sparing parallelism), them précédencE (to operate regroupings in order to
optimize performances BLAS); one can to envisage certain consumption (in time and memory) to even try to
limit the digital problems.
Thus in the example below, the treatment of variable 1 does not affect any variables 2.4 and 5. On the other
hand, it will impact the variables 3 and 7 which occupy the higher levels of the tree (they are its ascending).
38 In particular work founders of R.S.Schreiber.

39 To what is called limit the phenomenon of “filling” (‘rope’) which makes that the factorized matrix comprises
much more nonworthless terms than stamps it initial (commonly X25 time more in our studies). This crucial
point seeks to restrict the obstruction memory of the storage of this factorized but also the costs in failure of
its many later handling (descent-increase).
Version
Code_Aster default
0ac07fed0038
Figure 7.1.1 _First line, from left to right: an initial hollow matrix, the same reordered matrix and the tree of
elimination associated with the latter. Second-row forward, from left to right: factorized corresponding to the
matrices initial one and renumbered.
second strong idea of the method multifrontale a maximum of variables (one is to join together speaks
aboutamalgamation) in order to to constitute “faces” (dense) whose digital processing will be much more
effective (via routines of the type BLAS).
Since it is necessary to filling the matric under-block corresponding by truths zeros and to handle them such
which40. It is for example what is made in the tree of figure 7.1.2 between the amalgamated variables 7.8 and 9.
There exist many techniques of amalgamation based on criteria of graph, aspects digital or data-processing
considerations (e.g. distributed parallelism).
Figure 7.1.2 _Tree of elimination with its matric blocks and a choice of “faces”.
Note:
• By preoccupations with a simplification, one will not approach in this summary the other digital
processings which often intervene in the process: setting at the level of the terms (‘scaling’),
permutation of the columns and static/dynamic swivelling. As the “devil is often in the details”,
these are however the related treatments which complexed digital work and the data-processing
developments of Clement. They are essential with the good progress of many of our industrial
simulations with Code_Aster .
7.2 Tree of elimination

The algorithmic treatment of the tree begins bottom (on the level of the “sheets”) to the top (towards “the root”).
Within each sheet or with each “node” of branch one associates a dense face. In a face one distinguishes
two types of variables:
40 Not counting the induced on-filling. It is a little contradictory with the stage of preceding renumerotation, but
in general, this amalgamation is very beneficial for the whole of the process.
Version
Code_Aster default
0ac07fed0038
• ‘Fully Summed’ (FS) who as their name indicates it are completely treated and any more will not be
updated;
• ‘Non Fully Summed ' (NFS) who they always expect contributions of other branches of the tree.
The first type of variables can be completely “eliminated”: terms of the matrix factorized with regard to them
(lines of U and columns of L) are calculated and stored 41 once for all. second type produces a block of
contributions42 (noted CB for ‘Contribution Blocks’) which will be added at the higher level with the tree with
other CBs associated with the same variables (cf figure 7.2.1).
Figure 7.2.1 _general Structure of a “face” before and after its treatment.
In their turn, certain variables associated with these blocks will be eliminated and others, on the contrary, will
continue to take an active part in the process while providing to new of CBs. And so on… to the root of the tree.
This last stage, it has there nothing any more but eliminated variables and more no CB!
Thus in the faces associated with the sheets with left with figure 7.2.1, variables 1 and 2 are FS, while variables
3.7 and 9 are NFS. These last provide of CBs which will be cumulated in the face of the higher level gathering
variables 3,7,8 and 9. This last “will relieve” the algorithmic process of variable FS 3 (it will be eliminated), while
the NFS 7.8 and 9 will continue their advance in the tree.
7.3 Algorithmic treatments

In the case of a general dense matrix With, the factorization of Gauss results in building the terms repeatedly
of L and those of U according to an algorithm of type 1.1 (cf figure 7.3.1). Here, to simplify, one takes neither
account of the worthless terms (or too small), nor of digital aspects type swivelling, permutation or scaling.
With each stage K a pivot corresponds (supposed not no one) which will lead the update of the block
A k 1:n , k 1:n subjacent, as well as the construction of the column and the corresponding line of L and of U.
This algorithm thus requires two types of operations:

• One stage of factorization (noted ‘Factor’) which uses the diagonal pivot a k , k to build K- ième part of
factorized and thus to eliminate K- ième variable (it will be known as FS).
• One stage of update (noted ‘Update’) which builds the block of contribution (CB).
41 In RAM then on disc if the OOC is activated.

42 Object of calculation only managed in RAM.
Version
Code_Aster default
0ac07fed0038
 3
The algorithmic complexity of the unit and its cost report are respectively in  n and in  n . These only  2
figures illustrate the impact of this stage on the performances of our simulations (even if those are carried out in
‘sparse’).
Algorithms 7.3.1 _Examples of dense algorithms of factorization, in scalar and by blocks.
In order to optimize the costs one works generally on blocks and not on scalars (cf algorithm 1.3). The data
to be gathered in RAM memory, at the same moment, are moreover smalls and the vectorial and matric
algebraic operations are much more effective (via routines of the optimized type BLAS).
The types of operations to be carried out remain unchanged:
• Local factorizations of diagonal blocks (‘Factor’),
• Descent-increase (noted ‘Solve’) to build indeed the blocks columns/lines of factorized,
• Update (‘Update') per blocks of the submatrix.
In the trees of elimination presented previously, it is this kind of operation which is led within each face,
then between each face and his/her “father”. Compression low-rank will have for objective to reduce their
algorithmic complexity like their print memory (peak RAM and consumption disc).
7.4 Management of the memory

On this last point, often most crucial for our studies, Tout the problem lies in the “management differed from
CBs”. At the time of the stage K algorithm 1.3, one must keep in RAM not only the “active” face, but also CBs
on standby of treatment and them (possible) factors not discharged on disc (if the OOC is activated). They
constitute the active memory of the process (restricted parking zone and green of figure 7.4.1).
Figure 7.4.1 _Management memory of the various elements handled by the multifrontale.
This active memory is managed like a pile (‘stack’). It fluctuate constantly. She believes when a face is
charged. Then it decrease as of CBs associated with this face are consumed. Finally it grows again with the
new factors and the new CB resulting from this face.
The new factors are then possibly recopied on disc. The print memory of these factors (red zone on the left of
the graph) is simpler to analyze: it does nothing but swell as the increase of the tree!
In all the cases, these two zones memory are not easy to predict a priori. They depend in particular on the
renumerotation, of the construction of the tree of elimination but also of the digital pretreatments. It is one of the
Version
Code_Aster default
0ac07fed0038
tasks carried out by the stage of analysis of MUMPS. It is useful at the same time for the multifrontale to allocate
its zones memory effectively, but also for the appealing code in order to manage its own internal objects as well
as possible43. The optimized management of these elements still becomes complicated because of the digital
processings which are carried out dynamically in the course of factorization: for example, the organization of the
swivelling and distribution of the data in parallel.
It is the peak of this active memory which constitutes the essential challenge of this thesis . It much
higher in keeping with is often factorized. However this one is already often in 500N in our current studies (with
NR size of the problem). Thus the treatment of a matrix comprising 1 million unknown factors requires at least
500*1*1244=6Go of RAM (in real double precision). And this figure tends to increase with the size of the problem
(one passes from 500N to 1000N even more).
One can thus reach RAM peaks that even the parallel distribution of the data combined with the OOC will not be
able to manage effectively45. Thus any significant reduction of this peak would be a very appreciable profit
for our large current and future studies.
7.5 Approximation low-rank
A dense matrix With, of size mXN, is known as “of weak row” (‘low-rank’) of order K (<min (m, N)) when it
can break up in the form
A U.V T  E (7.5-1)
with U and V matrices much smaller (respectively mXK and KXN) and E a matrix mXN “negligible” ( E   ).
This concept of “digital row” should not be confused with the concept of row algebraic
During later handling one makes the approximation then

A U.V T (7.5-2)
while betting (often under control), that it will have little incidence on the total process. This approximation is
thus dependent on the parameter of compression E.
It is this parameter which one will vary, in practice, to adapt the profits of the low-rank to the situation. For
example:
• with 10-9 there is compression with a loss of precision light and controllable. One can of course
continue to use the multifrontale as direct solvor. To find a precision identical to the case full-rank, one or
two iterations of iterative refinement is sufficient46.
• with >10-9 compression is better but the loss of precision can be important 47. One can then use
factorized to build one preconditionnor, more or less “frustrates”, accelerating an iterative solvor of
Krylov.
In addition, except obvious profits in term of storage, the handling of such matrices can prove very
advantageous: the row of the sum is lower (in the worst case) than the sum of the rows and the row of the
product is inferior at least of the rows. Once broken up, the handling of matrices low-rank can thus be
(relatively) controlled in order to optimize the compression of the result.
Thus the dense product of two matrices of size NXN and of row K, with k  n , reduced the one algorithmic
complexity of the operation of   n3  with   kn 2  ; its print memory passing it of   n 2  with   kn  .
43 Cf keyword Code_Aster SOLVEUR/GESTION_MEMOIRE.

44 The multiplicative figure 12 comes from the 8 bytes which consumes the storage of the real matric term and
of the 4 of its index of line.
45 Bottleneck of some sequential treatments, deterioration of speed-UPS and costs of the I/O.
46 Sometimes into nonlinear the correction made by the process of Newton is sufficient to ensure the total convergence of calculation. It
takes one or two iterations of Newton additional but one thus avoids the iterations of iterative refinement in postprocessing of each
descent-increase.
47 More still annoying is the fact that the usual “outputs” of the direct solvor can be distorted and thus unusable by an including algorithm:
detection of singularity, calculation of determinant, criterion of Sturm…
Version
Code_Aster default
0ac07fed0038
A matrix transform in form low-rank is known as “retrogressed” (‘demoted’) or compressed. The reverse
transformation, except for the approximations, “promotes” the matrix (‘promoted') or decompresses.
To keep the same terminology when in a standard way a matrix is handled (without displaying a decomposition
of the type (7.5-1)), it is said that this one is of full row (‘full-rank’).
The “decomposition low-rank” of a matrix can be carried out various manners:

• SVD,
• Rank-Revealing QR (RRQR),
• Adaptive Cross-country race Approximation (ACA),
• Random sampling.
In this thesis Clement uses mainly the second solution, less precise than a classic SVD but much less
expensive. The whole being that this compression is pooled between various treatments and that it allows a
maximum compression. Ideally it would thus be necessary that the row obtained checks a condition of the type:
k  m  n   mn (7.5-3)
Part of the work of Clement consisted in developing the heuristic ones adapted to the multifrontale in order to try
to respect this criterion as often as possible.
7.6 Multifrontale ‘Low-Rank Block’ (BLR)

The objective of this thesis is to exploit this kind of compression by breaking up in form low-rank the
largest faces of the tree of elimination of a multifrontale. Because these are the dense matric under-blocks
which generate the most failure and which burden the peak RAM (because of all CBs that they will
agglomerate). In general these large faces are in the last levels of the tree of elimination.
Figure 7.6.1 _Structure of a face compressed after its factorization.
As below illustrated (cf figure 7.6.1) one thus goes, firstly, to break up into columns and lines of under-
blocks (decomposition by ‘panels’), matric blocks corresponding to four types of terms of the faces
(according to the terminology of figure 7.2.1):
• FSxFS (‘block (1.1)’),
• FSxNFS (‘block (1.2)’),
• NFSxFS (‘block (2.1)’),
• NFSxNFS (‘block (2.2)’).
Then each one of these under-blocks will be compressed in low-rank according to the formula (7.5-1). At least
when this compression is licit 48. Except the diagonal under-blocks of the FSxFS under-part which remain in full-
rank (to optimize their later handling).
48 When these under-blocks answer certain functional criteria. For example they must be of sufficient size so that a compression is tried.
In addition the profits of compression must be sufficiently important.
Version
Code_Aster default
0ac07fed0038
One adapt the size of the panels so that it is:

• Sufficient large to benefit from the performance of cores BLAS (invariant of loop, minimization of the
defects of masks and optimized management of the hierarchy memory);
• Not too large for sparing flexibility in digital handling (swivelling, scaling) and data processing
(distributed parallelism) later;
• Of mean size in order to optimize the costs of communications MPI (latency versus band-busy).
• Of mean size in order to spare flexibility in the regrouping of variables which will follow (the
‘clustering’). This reorganization having to generate a maximum of compression.
• Not too large not to be too expensive in SVD or RRQR.
But all the problem lies in the fact that these matric under-blocks do not have any raison d'être low-
rank! Even if they are of large size, the parts of faces ‘demoted’ can prove of full row. The costs of SVDs or
RRQRs will then not be compensated and the objective of compression will prove compromised!
While taking as a starting point the work already carried out by other techniques of compression (matrices
hierarchical of type H, H2, HSS/SSS…), it was developed criteria to break up the variables composing a face
and to renumber them so that they generate matric under-blocks low-ranks. This criterion is based on the
following empirical observation:
The more distant two blocks of variables are, the more the digital row of the matric block which it
implies becomes weak.
This condition known as “of admissibility” was tested on certain model problems resulting from
discretization of elliptic EDP and it is illustrated on figure 7.6.2. With the terminology of the graph theory it can
be reformulated in the following form:
diam   diam   dist  ,  (7.6-1)
with diam and dist, usual concepts of diameter and distance in the graph associated with the treated face.
Figure 7.6.2 _Evolution of the “digital” row of two groups of variables of homogeneous size according to the
distance (within the meaning of the graphs) which separates them.
Thus, on the example of the figure below (cf fig. 7.6.3), one obtains with the variable division of the face in slices
a profit of many terms of 17%. I.e., on average, each under-block admits a digital row K such as
k  m  n  0.83mn .
While with cutting in checkerwork, the profit reaches 47%: k  m  n  0.57 mn . This variation is due to the
greatest regularity of the second on the first. The homogeneous under-blocks in the form of square have
diameters quickly lower than the distance between the under-blocks. With their lengthened form, it is obviously
not the case of the under-blocks resulting from cutting in slice!
Version
Code_Aster default
0ac07fed0038
Figure 7.6.3 _Example of two types of clustering on a dense face: in slice and checkerwork.
7.7 Some complementary elements

It is thus on a criterion of admissibility of this type that MUMPSt will base its clustering of the variables
composing a face. He worked much to make this “operation key” of the method compatible with the constraints
numérico-data processing of the multifrontale and tool MUMPS; while taking care to spare a maximum of
compression and by limiting its overcost (in time).
Figure 7.7.1 _Clustering of the type “inherited” between variables NFS a face
and those FS of a former face.
Example of sophistication: this clustering is operated separately on the two types of variables, the FS and
the NFS. As by traversing the tree of elimination, variables NFS of a face become the FS of the faces of the
higher levels, one breaks up this task. In addition one does not completely start again these cuttings with each
face. One pools certain information in order to limit the overcosts (in time). This particular algorithm of
clustering is called “inherited” (‘inherit’) in opposition tostandard algorithm called “clarifies” (‘explicit’)
for which one starts again both again clusterings with each handling face.
The overcost of the “explicit” alternative of clustering can be prohibitory on the very big problems (several times
the cost of the totality of the phase of analysis!) whereas the cost of the optimized clustering remains
reasonable: only some for one hundred of this phase of analysis. The profits low-rank of the two alternatives are
very close on the other hand the implementation to the optimized version is more complex.
Finally these clusterings is carried out by standard, MONGREL tools or SCOTCH TAPE. But one does not
apply them directly to the variables of the faces but to halations including them. This trick makes it possible to
direct these cuttings suitably so that they produce groups of homogeneous nodes contiguous.
In addition, to be more precise, the treatment of a face comprises five stages (already evoked in the algorithms
of the fig. 7.3.1). One starts with to cut out the variables in two encased levels panels:
• the first level, largest, that which (is outside called ‘outer panel’ or ‘BLR panel’), results from the
clustering low-rank;
• while the second, that which is inside the precedent (‘inner panel’ or ‘BLAS panel’) gathers the
variables per under-packages of 32.64 or 96 contiguous variables in order to more effectively nourish the
cores of calculations (and to reduce the cost of communications MPI and I/Os).
Then, there are two encased loops: the first over the panels BLR and the second on panels BLAS. For a given
panel BLAS, one carries out:
• the stage of Factorization (F),
• the stage of Solve (S),
• Upgrade Interns (UI) between all the following under-panels of panel BLR,
• External Upgrade (EU) between all following panels BLR.
The standard order of the operations can thus be noted coarsely FSUU. According to the level to which one
utilizes Compression low-rank (C) 4 alternatives then are distinguished:
1. FSUUC,
2. FSUCU,
3. FSCUU,
4. FCSUU.
Version
Code_Aster default
0ac07fed0038
Table 7.7.2 _Comparison of the costs of the various stages

into dense (full-rank) and low-rank.
It is the alternative n°3, FSCUU, which is industrialized the versions consortium of MUMPS.
Taking into account the relative costs of each one of the operationS (cf table 7.7.2) and of forced
robustness of the tool, it is this alternative which was privileged. It makes it possible to reduce significantly
the overall costs while sparing a maximum of precision in the first two stages (F and S). Because those are
crucial in the management of several sophisticated digital ingredients (scaling, swivelling, detection of
singularity, calculation of determinant, test of Sturm…). Initially, one thus preferred to impact them the least
possible by this compression. Also this one is carried out just afterwards. And its profits (and its possible
approximations) will not impact as well as the internal and external updates (UI and EU).

Code - Aster: General Information On The Direct Linear Solveurs and Use of MUMPS

Uploaded by

Copyright:

Available Formats

Code - Aster: General Information On The Direct Linear Solveurs and Use of MUMPS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Code - Aster: General Information On The Direct Linear Solveurs and Use of MUMPS

Uploaded by

Copyright:

Available Formats

Version

General information on the direct linear solveurs

7 Appendix n°1: Principle of compressions BLR in MUMPS.................................................................. 36

1 General information on the direct solveurs

1 For Code_Aster, to see the study of ‘profiling’ led by N.Anfaoui [Anf03].

8 One also speaks about “scalability” or passage on the scale.

1.2 Bookstores of linear algebra

Figure 1.2-1. _Some “logos” of libraries of linear algebra.

10 EISPACK (1974), LINPACK (1976), BLAS (1978) then LAPACK (1992)…

1.3 Direct methods: the principle

12 By analogy with polynomial factorizations of the lower school…

Figure 1.3-1. _Principle of the direct methods.

1.4 Direct methods: various approaches

Figure 1.4-1. _Diagram of construction of a factorization “ jik ” (‘right looking’).

All these alternatives are declined still according to:

For to gather four categories are often distinguished:

1.5 Direct methods: principal stages

1.6 Direct methods: difficulties

Figure 2.1-1. _Logos of the principal contributors with MUMPS [Mum].

Figure 2.1-2. _The homepage of the Web site of MUMPS [Mum].

2.2 Main features

• Broad perimeter of use: SPD, symmetrical unspecified, nonsymmetrical, real/complex, simple/double

RAM RAM Disc RAM

2.3 Zooms on some technical points

F igure 2.3-1. _Choice of the partial pivot at the stage K.

2.3.2 Iterative refinement

2.3.3 Reliability of calculations

Figure 2.3-2. _Chart of the concept of errors direct and opposite.

2.3.4 Management memory (In-Core Out-Of-Core versus)

Figure 2.3-3. _Two types of management memory standards:

- Minimal RAM memory consumed by Code_Aster : 200 Mo

Figure 2.3-4. _Extrait of file of message with GESTION_MEMOIRE=' EVAL'.

2.3.5 Management of the singular matrices

And in practice, how does MUMPS proceed you it?

2.3.6 Compression ‘Low-Rank Block’ (BLR)

31 Suivant certain criteria specific to the BLR.

3.2 Two types of parallelism: centralized and distributed

Figure 3.2-1. _ Floods of data/parall treatmentsèof centralized/distributed MUMPS.

3.2.2 Various modes of distribution

Mode of Mesh 1 Mesh 2 Mesh 3 Mesh 4 Mesh 5 Mesh 6 Mesh 7 Mesh 8

▪ ‘ SOUS_DOMAINE ‘ (defect) : This kind of distribution is a hybrid distribution of elementary

3.2.3 Balancing of load

3.2.4 To recut the objects Code_Aster

3.3 Management of memory MUMPS and Code_Aster

F90 space I, J , K ij and F

3.4 Particular management of double the multipliers of Lagrange

From where a overcost report and calculation.

3.5 Perimeter of use

3.6 Parameter setting and examples of use

3.6.1 Parameters of use of MUMPS via Code_Aster

Operand Keyword Value by Details/advices Ref.

Operand Keyword Value by Details/advices Ref.

MEMORY RAM ESTIMEE AND NECESSARY BY Mo MUMPS (FAC_NUM + RESOL)

Figure 3.5-1. _Extrait of file of message in INFORMATION =2.

3.6.3 Examples of use

5.2 Account-returned reports/EDF

5.3 Resources Internet

6 History of the versions of the document

7 Appendix n°1: Principle of compressions BLR in MUMPS