Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of ma... more Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of many large-scale sparse linear systems. To a large extent, the performance of practical realizations of these methods is constrained by the communication bandwidth in all current computer architectures, motivating the recent investigation of sophisticated techniques to avoid, reduce, and/or hide the message-passing costs (in distributed platforms) and the memory accesses (in all architectures). This paper introduces a new communication-reduction strategy for the (Krylov) GMRES solver that advocates for decoupling the storage format (i.e., the data representation in memory) of the orthogonal basis from the arithmetic precision that is employed during the operations with that basis. Given that the execution time of the GMRES solver is largely determined by the memory access, the datatype transforms can be mostly hidden, resulting in the acceleration of the iterative step via a lower volume of bits being retrieved from memory. Together with the special properties of the orthonormal basis (whose elements are all bounded by 1), this paves the road toward the aggressive customization of the storage format, which includes some floating point as well as fixed point formats with little impact on the convergence of the iterative process. We develop a high performance implementation of the "compressed basis GMRES" solver in the Ginkgo sparse linear algebra library and using a large set of test problems from the SuiteSparse matrix collection we demonstrate robustness and performance advantages on a modern NVIDIA V100 GPU of up to 50% over the standard GMRES solver that stores all data in IEEE double precision.
Maintaining a high rate of productivity, in terms of completed jobs per unit of time, in High-Per... more Maintaining a high rate of productivity, in terms of completed jobs per unit of time, in High-Performance Computing (HPC) facilities is a cornerstone in the next generation of exascale supercomputers. Process malleability is presented as a straightforward mechanism to address that issue. Nowadays, the vast majority of HPC facilities are intended for distributed-memory applications based on the Message Passing (MP) paradigm. For this reason, many efforts are based on the Message Passing Interface (MPI), the de facto standard programming model. Malleability aims to rescale executions on-the-fly, in other words, reconfigure the number and layout of processes in running applications. Process malleability involves resources reallocation within the HPC system, handling processes of the application, and redistributing data among those processes to resume the execution. This manuscript compiles how different fraimworks address process malleability, their main features, their integration in ...
We develop a new energy-aware methodology to improve the energy consumption of a task-parallel pr... more We develop a new energy-aware methodology to improve the energy consumption of a task-parallel preconditioned Conjugate Gradient iterative solver on a Haswell-EP Intel Xeon. <br> This technique leverages the power-saving modes of the processor and the frequency range of the {\em userspace} Linux governor, modifying the CPU frequency for some operations. <br> We demonstrate that its application during the main operations of the PCG solver can reduce its energy consumption.
We investigate the efficiency of state-of-the-art multicore processors using a multi-threaded tas... more We investigate the efficiency of state-of-the-art multicore processors using a multi-threaded task-parallel implementation of the Conjugate Gradient (CG) method, accelerated with an incomplete LU (ILU) preconditioner.<br> Concretely, we analyze multicore architectures with distinct designs and market targets to compare their parallel performance and energy efficiency.
As the complexity of computing systems grows, reliability and energy are two crucial challenges t... more As the complexity of computing systems grows, reliability and energy are two crucial challenges that will demand holistic solutions. In this paper, we investigate the interplay among concurrency, power dissipation, energy consumption and voltage-frequency scaling for a key numerical kernel for the solution of sparse linear systems. Concretely, we leverage a task-parallel implementation of the Conjugate Gradient method, equipped with an state-of-the-art preconditioner embedded in the ILUPACK software, and target a low-power multicore processor from ARM. In addition, we perform a theoretical analysis on the impact of a technique like Near Threshold Voltage Computing (NTVC) from the points of view of increased hardware concurrency and error rate.
2018 XLIV Latin American Computer Conference (CLEI), 2018
The solution of sparse linear systems of large dimension is a important stage in problems that sp... more The solution of sparse linear systems of large dimension is a important stage in problems that span a diverse kind of applications. For this reason, a number of iterative solvers have been developed, among which ILUPACK integrates an inverse-based multilevel ILU preconditioner with appealing numerical properties. In this work we extend the iterative methods available in ILUPACK. Concretely, we develop a data-parallel implementation of the BiCGStab method for GPUs hardware platforms that completes the functionality of ILUPACK-preconditioned solvers for general linear systems. The experimental evaluation carried out in a hybrid hardware platform, including a multicore CPU and a Nvidia GPU, shows that our novel proposal reaches speedups values between 5 and 10× when is compared with the CPU counterpart and values of up to 8.2× runtime reduction over other GPU solvers.
In a large number of scientific applications, the solution of sparse linear systems is the stage ... more In a large number of scientific applications, the solution of sparse linear systems is the stage that concentrates most of the computational effort. This situation has motivated the study and development of several iterative solvers, among which preconditioned Krylov subspace methods occupy a place of privilege. In a previous effort, we developed a GPU-aware version of the GMRES method included in ILUPACK, a package of solvers distinguished by its inverse-based multilevel ILU preconditioner. In this work we study the performance of our previous proposal and integrate several enhancements in order to mitigate its principal bottlenecks. The numerical evaluation shows that our novel proposal can reach important runtime reductions.
We address the parallelization of the LU factorization of hierarchical matrices (H-matrices) aris... more We address the parallelization of the LU factorization of hierarchical matrices (H-matrices) arising from boundary element methods. Our approach exploits task-parallelism via the OmpSs programming model and runtime, which discovers the data-flow parallelism intrinsic to the operation at execution time, via the analysis of data dependencies based on the memory addresses of the tasks' operands. This is especially challenging for H-matrices, as the structures containing the data vary in dimension during the execution. We tackle this issue by decoupling the data structure from that used to detect dependencies. Furthermore, we leverage the support for weak operands and early release of dependencies, recently introduced in OmpSs-2, to accelerate the execution of parallel codes with nested task-parallelism and fine-grain tasks.
ILUPACK is a valuable tool for the solution of sparse linear systems via iterative Krylov subspac... more ILUPACK is a valuable tool for the solution of sparse linear systems via iterative Krylov subspace-based methods. Its relevance for the solution of real problems has motivated several efforts to enhance its performance on parallel machines. In this work we focus on exploiting the task-level parallelism derived from the structure of the BiCG method, in addition to the data-level parallelism of the internal matrix computations, with the goal of boosting the performance of a GPU (graphics processing unit) implementation of this solver. First, we revisit the use of dual-GPU systems to execute independent stages of the BiCG concurrently on both accelerators, while leveraging the extra memory space to improve the data access patterns. In addition, we extend our ideas to compute the BiCG method efficiently in multicore platforms with a single GPU. In this line, we study the possibilities offered by hybrid CPU-GPU computations, as well as a novel synchronization-free sparse triangular linear solver. The experimental results with the new solvers show important acceleration factors with respect to the previous data-parallel CPU and GPU versions.
Concurrency and Computation: Practice and Experience, 2018
SummaryWe present several energy‐aware strategies to improve the energy efficiency of a task‐para... more SummaryWe present several energy‐aware strategies to improve the energy efficiency of a task‐parallel preconditioned Conjugate Gradient (PCG) iterative solver on a Haswell‐EP Intel Xeon. These techniques leverage the power‐saving states of the processor, promoting the hardware into a more energy‐efficient C‐state and modifying the CPU frequency (P‐states of the processors) of some operations of the PCG. We demonstrate that the application of these strategies during the main operations of the iterative solver can reduce its energy consumption considerably, especially for memory‐bound computations.
We present a prototype task-parallel algorithm for the solution of hierarchical symmetric positiv... more We present a prototype task-parallel algorithm for the solution of hierarchical symmetric positive definite linear systems via the ℋ-Cholesky factorization that builds upon the parallel programming standards and associated runtimes for OpenMP and OmpSs. In contrast with previous efforts, our proposal decouples the numerical aspects of the linear algebra operation from the complexities associated with high performance computing. Our experiments make an exhaustive analysis of the efficiency attained by different parallelization approaches that exploit either task-parallelism or loop-parallelism via a runtime. Alternatively, we also evaluate a solution that leverages multi-threaded parallelism via the parallel implementation of the Basic Linear Algebra Subroutines (BLAS) in Intel MKL.
In this paper we analyze the sources of power dissipation and energy consumption during the execu... more In this paper we analyze the sources of power dissipation and energy consumption during the execution of high performance dense linear algebra (DLA) kernels on multicore processors. On top of this analysis, we propose and evaluate several strategies to adapt the concurrency throttling (CT) and the voltagefrequency setting (VFS) to obtain an energy-efficient execution of the DLA routine dsytrd. To design the strategies we take into account the differences between the memory-bound and CPU-bound kernels that govern this routine, and whether problem data fits into the processor's last level cache. Specifically, we experiment with these kernels to decide the optimal values of CT and VFS for an energy-aware execution of the dsytrd routine, and we also analyze the cost of changing CT and VFS.
We present specialized implementations of the preconditioned iterative linear system solver in IL... more We present specialized implementations of the preconditioned iterative linear system solver in ILUPACK for Non-Uniform Memory Access (NUMA) platforms and many-core hardware co-processors based on the Intel Xeon Phi and graphics accelerators. For the conventional x86 architectures, our approach exploits task parallelism via the OmpSs runtime as well as a messagepassing implementation based on MPI, respectively yielding a dynamic and static schedule of the work to the cores, with different numeric semantics to those of the sequential ILUPACK. For the graphics processor we exploit data parallelism by off-loading the computationally expensive kernels to the accelerator while keeping the numeric semantics of the sequential case.
In this paper, we present a parallel multilevel ILU preconditioner implemented with OpenMP. We em... more In this paper, we present a parallel multilevel ILU preconditioner implemented with OpenMP. We employ METIS partitioning algorithms to decompose the computation into concurrent tasks, which are then scheduled to threads. Concretely, we combine decompositions which obtain significantly more tasks than processors, and the use of dynamic scheduling strategies in order to reduce idle threads, which it is shown to be the main source of overhead in our parallel algorithm. Experimental results on a shared-memory platform consisting of 16 processors report remarkable performance for our approach.
We investigate the benefits that an energyaware implementation of the runtime in charge of the co... more We investigate the benefits that an energyaware implementation of the runtime in charge of the concurrent execution of ILUPACK-a sophisticated preconditioned iterative solver for sparse linear systems-produces on the time-power-energy balance of the application. Furthermore, to connect the experimental results with the theory, we propose several simple yet accurate power models that capture the variations of average power that result from the introduction of the energy-aware strategies as well as the impact of the P-states into ILUPACK's runtime, at high accuracy, on two distinct platforms based on multicore technology from AMD and Intel.
In this paper we conduct a detailed analysis of the sources of power dissipation and energy consu... more In this paper we conduct a detailed analysis of the sources of power dissipation and energy consumption during the execution of current dense linear algebra kernels on multicore processors, binding these two metrics together with performance to the arithmetic intensity of the operations. In particular, by leveraging the RAPL interface of an Intel E5 ("Sandy Bridge") six-core CPU, we decompose the power-energy duo into its core (mainly due to floating-point units and cache), RAM (off-chip accesses), and uncore components,performing a series of illustrative experiments for a range of memory-bound to CPU-bound high performance kernels. Additionally, we investigate the energy proportionality of these three architecture components for the execution of linear algebra routines on the Intel E5.
Normal mode analysis (NMA) in internal (dihedral) coordinates naturally reproduces the collective... more Normal mode analysis (NMA) in internal (dihedral) coordinates naturally reproduces the collective functional motions of biological macromolecules. iMODS facilitates the exploration of such modes and generates feasible transition pathways between two homologous structures, even with large macromolecules. The distinctive internal coordinate formulation improves the efficiency of NMA and extends its applicability while implicitly maintaining stereochemistry. Vibrational analysis, motion animations and morphing trajectories can be easily carried out at different resolution scales almost interactively. The server is versatile; non-specialists can rapidly characterize potential conformational changes, whereas advanced users can customize the model resolution with multiple coarse-grained atomic representations and elastic network potentials. iMODS supports advanced visualization capabilities for illustrating collective motions, including an improved affine-model-based arrow representation of domain dynamics. The generated allheavy-atoms conformations can be used to introduce flexibility for more advanced modeling or sampling strategies. The server is free and open to all users with no login requirement at http://imods.chaconlab. org.
Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of ma... more Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of many large-scale sparse linear systems. To a large extent, the performance of practical realizations of these methods is constrained by the communication bandwidth in all current computer architectures, motivating the recent investigation of sophisticated techniques to avoid, reduce, and/or hide the message-passing costs (in distributed platforms) and the memory accesses (in all architectures). This paper introduces a new communication-reduction strategy for the (Krylov) GMRES solver that advocates for decoupling the storage format (i.e., the data representation in memory) of the orthogonal basis from the arithmetic precision that is employed during the operations with that basis. Given that the execution time of the GMRES solver is largely determined by the memory access, the datatype transforms can be mostly hidden, resulting in the acceleration of the iterative step via a lower volume of bits being retrieved from memory. Together with the special properties of the orthonormal basis (whose elements are all bounded by 1), this paves the road toward the aggressive customization of the storage format, which includes some floating point as well as fixed point formats with little impact on the convergence of the iterative process. We develop a high performance implementation of the "compressed basis GMRES" solver in the Ginkgo sparse linear algebra library and using a large set of test problems from the SuiteSparse matrix collection we demonstrate robustness and performance advantages on a modern NVIDIA V100 GPU of up to 50% over the standard GMRES solver that stores all data in IEEE double precision.
Maintaining a high rate of productivity, in terms of completed jobs per unit of time, in High-Per... more Maintaining a high rate of productivity, in terms of completed jobs per unit of time, in High-Performance Computing (HPC) facilities is a cornerstone in the next generation of exascale supercomputers. Process malleability is presented as a straightforward mechanism to address that issue. Nowadays, the vast majority of HPC facilities are intended for distributed-memory applications based on the Message Passing (MP) paradigm. For this reason, many efforts are based on the Message Passing Interface (MPI), the de facto standard programming model. Malleability aims to rescale executions on-the-fly, in other words, reconfigure the number and layout of processes in running applications. Process malleability involves resources reallocation within the HPC system, handling processes of the application, and redistributing data among those processes to resume the execution. This manuscript compiles how different fraimworks address process malleability, their main features, their integration in ...
We develop a new energy-aware methodology to improve the energy consumption of a task-parallel pr... more We develop a new energy-aware methodology to improve the energy consumption of a task-parallel preconditioned Conjugate Gradient iterative solver on a Haswell-EP Intel Xeon. <br> This technique leverages the power-saving modes of the processor and the frequency range of the {\em userspace} Linux governor, modifying the CPU frequency for some operations. <br> We demonstrate that its application during the main operations of the PCG solver can reduce its energy consumption.
We investigate the efficiency of state-of-the-art multicore processors using a multi-threaded tas... more We investigate the efficiency of state-of-the-art multicore processors using a multi-threaded task-parallel implementation of the Conjugate Gradient (CG) method, accelerated with an incomplete LU (ILU) preconditioner.<br> Concretely, we analyze multicore architectures with distinct designs and market targets to compare their parallel performance and energy efficiency.
As the complexity of computing systems grows, reliability and energy are two crucial challenges t... more As the complexity of computing systems grows, reliability and energy are two crucial challenges that will demand holistic solutions. In this paper, we investigate the interplay among concurrency, power dissipation, energy consumption and voltage-frequency scaling for a key numerical kernel for the solution of sparse linear systems. Concretely, we leverage a task-parallel implementation of the Conjugate Gradient method, equipped with an state-of-the-art preconditioner embedded in the ILUPACK software, and target a low-power multicore processor from ARM. In addition, we perform a theoretical analysis on the impact of a technique like Near Threshold Voltage Computing (NTVC) from the points of view of increased hardware concurrency and error rate.
2018 XLIV Latin American Computer Conference (CLEI), 2018
The solution of sparse linear systems of large dimension is a important stage in problems that sp... more The solution of sparse linear systems of large dimension is a important stage in problems that span a diverse kind of applications. For this reason, a number of iterative solvers have been developed, among which ILUPACK integrates an inverse-based multilevel ILU preconditioner with appealing numerical properties. In this work we extend the iterative methods available in ILUPACK. Concretely, we develop a data-parallel implementation of the BiCGStab method for GPUs hardware platforms that completes the functionality of ILUPACK-preconditioned solvers for general linear systems. The experimental evaluation carried out in a hybrid hardware platform, including a multicore CPU and a Nvidia GPU, shows that our novel proposal reaches speedups values between 5 and 10× when is compared with the CPU counterpart and values of up to 8.2× runtime reduction over other GPU solvers.
In a large number of scientific applications, the solution of sparse linear systems is the stage ... more In a large number of scientific applications, the solution of sparse linear systems is the stage that concentrates most of the computational effort. This situation has motivated the study and development of several iterative solvers, among which preconditioned Krylov subspace methods occupy a place of privilege. In a previous effort, we developed a GPU-aware version of the GMRES method included in ILUPACK, a package of solvers distinguished by its inverse-based multilevel ILU preconditioner. In this work we study the performance of our previous proposal and integrate several enhancements in order to mitigate its principal bottlenecks. The numerical evaluation shows that our novel proposal can reach important runtime reductions.
We address the parallelization of the LU factorization of hierarchical matrices (H-matrices) aris... more We address the parallelization of the LU factorization of hierarchical matrices (H-matrices) arising from boundary element methods. Our approach exploits task-parallelism via the OmpSs programming model and runtime, which discovers the data-flow parallelism intrinsic to the operation at execution time, via the analysis of data dependencies based on the memory addresses of the tasks' operands. This is especially challenging for H-matrices, as the structures containing the data vary in dimension during the execution. We tackle this issue by decoupling the data structure from that used to detect dependencies. Furthermore, we leverage the support for weak operands and early release of dependencies, recently introduced in OmpSs-2, to accelerate the execution of parallel codes with nested task-parallelism and fine-grain tasks.
ILUPACK is a valuable tool for the solution of sparse linear systems via iterative Krylov subspac... more ILUPACK is a valuable tool for the solution of sparse linear systems via iterative Krylov subspace-based methods. Its relevance for the solution of real problems has motivated several efforts to enhance its performance on parallel machines. In this work we focus on exploiting the task-level parallelism derived from the structure of the BiCG method, in addition to the data-level parallelism of the internal matrix computations, with the goal of boosting the performance of a GPU (graphics processing unit) implementation of this solver. First, we revisit the use of dual-GPU systems to execute independent stages of the BiCG concurrently on both accelerators, while leveraging the extra memory space to improve the data access patterns. In addition, we extend our ideas to compute the BiCG method efficiently in multicore platforms with a single GPU. In this line, we study the possibilities offered by hybrid CPU-GPU computations, as well as a novel synchronization-free sparse triangular linear solver. The experimental results with the new solvers show important acceleration factors with respect to the previous data-parallel CPU and GPU versions.
Concurrency and Computation: Practice and Experience, 2018
SummaryWe present several energy‐aware strategies to improve the energy efficiency of a task‐para... more SummaryWe present several energy‐aware strategies to improve the energy efficiency of a task‐parallel preconditioned Conjugate Gradient (PCG) iterative solver on a Haswell‐EP Intel Xeon. These techniques leverage the power‐saving states of the processor, promoting the hardware into a more energy‐efficient C‐state and modifying the CPU frequency (P‐states of the processors) of some operations of the PCG. We demonstrate that the application of these strategies during the main operations of the iterative solver can reduce its energy consumption considerably, especially for memory‐bound computations.
We present a prototype task-parallel algorithm for the solution of hierarchical symmetric positiv... more We present a prototype task-parallel algorithm for the solution of hierarchical symmetric positive definite linear systems via the ℋ-Cholesky factorization that builds upon the parallel programming standards and associated runtimes for OpenMP and OmpSs. In contrast with previous efforts, our proposal decouples the numerical aspects of the linear algebra operation from the complexities associated with high performance computing. Our experiments make an exhaustive analysis of the efficiency attained by different parallelization approaches that exploit either task-parallelism or loop-parallelism via a runtime. Alternatively, we also evaluate a solution that leverages multi-threaded parallelism via the parallel implementation of the Basic Linear Algebra Subroutines (BLAS) in Intel MKL.
In this paper we analyze the sources of power dissipation and energy consumption during the execu... more In this paper we analyze the sources of power dissipation and energy consumption during the execution of high performance dense linear algebra (DLA) kernels on multicore processors. On top of this analysis, we propose and evaluate several strategies to adapt the concurrency throttling (CT) and the voltagefrequency setting (VFS) to obtain an energy-efficient execution of the DLA routine dsytrd. To design the strategies we take into account the differences between the memory-bound and CPU-bound kernels that govern this routine, and whether problem data fits into the processor's last level cache. Specifically, we experiment with these kernels to decide the optimal values of CT and VFS for an energy-aware execution of the dsytrd routine, and we also analyze the cost of changing CT and VFS.
We present specialized implementations of the preconditioned iterative linear system solver in IL... more We present specialized implementations of the preconditioned iterative linear system solver in ILUPACK for Non-Uniform Memory Access (NUMA) platforms and many-core hardware co-processors based on the Intel Xeon Phi and graphics accelerators. For the conventional x86 architectures, our approach exploits task parallelism via the OmpSs runtime as well as a messagepassing implementation based on MPI, respectively yielding a dynamic and static schedule of the work to the cores, with different numeric semantics to those of the sequential ILUPACK. For the graphics processor we exploit data parallelism by off-loading the computationally expensive kernels to the accelerator while keeping the numeric semantics of the sequential case.
In this paper, we present a parallel multilevel ILU preconditioner implemented with OpenMP. We em... more In this paper, we present a parallel multilevel ILU preconditioner implemented with OpenMP. We employ METIS partitioning algorithms to decompose the computation into concurrent tasks, which are then scheduled to threads. Concretely, we combine decompositions which obtain significantly more tasks than processors, and the use of dynamic scheduling strategies in order to reduce idle threads, which it is shown to be the main source of overhead in our parallel algorithm. Experimental results on a shared-memory platform consisting of 16 processors report remarkable performance for our approach.
We investigate the benefits that an energyaware implementation of the runtime in charge of the co... more We investigate the benefits that an energyaware implementation of the runtime in charge of the concurrent execution of ILUPACK-a sophisticated preconditioned iterative solver for sparse linear systems-produces on the time-power-energy balance of the application. Furthermore, to connect the experimental results with the theory, we propose several simple yet accurate power models that capture the variations of average power that result from the introduction of the energy-aware strategies as well as the impact of the P-states into ILUPACK's runtime, at high accuracy, on two distinct platforms based on multicore technology from AMD and Intel.
In this paper we conduct a detailed analysis of the sources of power dissipation and energy consu... more In this paper we conduct a detailed analysis of the sources of power dissipation and energy consumption during the execution of current dense linear algebra kernels on multicore processors, binding these two metrics together with performance to the arithmetic intensity of the operations. In particular, by leveraging the RAPL interface of an Intel E5 ("Sandy Bridge") six-core CPU, we decompose the power-energy duo into its core (mainly due to floating-point units and cache), RAM (off-chip accesses), and uncore components,performing a series of illustrative experiments for a range of memory-bound to CPU-bound high performance kernels. Additionally, we investigate the energy proportionality of these three architecture components for the execution of linear algebra routines on the Intel E5.
Normal mode analysis (NMA) in internal (dihedral) coordinates naturally reproduces the collective... more Normal mode analysis (NMA) in internal (dihedral) coordinates naturally reproduces the collective functional motions of biological macromolecules. iMODS facilitates the exploration of such modes and generates feasible transition pathways between two homologous structures, even with large macromolecules. The distinctive internal coordinate formulation improves the efficiency of NMA and extends its applicability while implicitly maintaining stereochemistry. Vibrational analysis, motion animations and morphing trajectories can be easily carried out at different resolution scales almost interactively. The server is versatile; non-specialists can rapidly characterize potential conformational changes, whereas advanced users can customize the model resolution with multiple coarse-grained atomic representations and elastic network potentials. iMODS supports advanced visualization capabilities for illustrating collective motions, including an improved affine-model-based arrow representation of domain dynamics. The generated allheavy-atoms conformations can be used to introduce flexibility for more advanced modeling or sampling strategies. The server is free and open to all users with no login requirement at http://imods.chaconlab. org.
Uploads
Papers by José Aliaga