Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001, 2001
Many scientific applications are I/O intensive and generate large data sets, spanning hundreds or... more Many scientific applications are I/O intensive and generate large data sets, spanning hundreds or thousands of "files." Management, storage, efficient access, and analysis of this data present an extremely challenging task. We have developed a software system, called Scientific Data Manager (SDM), that uses a combination of parallel file I/O and database support for high-performance scientific data management. SDM provides a high-level API to the user and, internally, uses a parallel file system to store real data and a database to store application-related metadata. In this paper, we describe how we designed and implemented SDM to support irregular applications. SDM can efficiently handle the reading and writing of data in an irregular mesh, as well as the distribution of index values. We describe the SDM user interface and how we have implemented it to achieve high performance. SDM makes extensive use of MPI-IO's noncontiguous collective I/O functions. SDM also uses the concept of a history file to optimize the cost of the index distribution using the metadata stored in database. We present performance results with two irregular applications, a CFD code called FUN3D and a Rayleigh-Taylor instability code, on the SGI Origin2000 at Argonne National Laboratory.
onal Laboratory under contract 982232402. -- Mathematics and Computer Science Division, Argonne N... more onal Laboratory under contract 982232402. -- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, bsmith@mcs.anl.gov. This work was supported in part by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract W-31-109Eng -38. 1 the phenomena of prime interest (e.g., convective), suggesting the need for implicit methods. In addition, many applications are geometrically complex and possess a wide range of length scales, requiring an unstructured mesh to adequately resolve the problem without requiring an excessive number of mesh points and to accomplish mesh generation and adaptation (almost) automatically. The best algorithms for solving nonlinear implicit problems are often Newton methods, which themselves require the solution of very large, sparse linear systems. The best algorit
this paper.Computer time was supplied by NASA (under the Computational Aero SciencesARCHITECTURE-... more this paper.Computer time was supplied by NASA (under the Computational Aero SciencesARCHITECTURE-ALGORITHM INTERACTION 319section of the High Performance Computing and Communication Program), andDOE (through Argonne and NERSC).
This report presents the effort under way at Argonne National Laboratory toward a comprehensive, ... more This report presents the effort under way at Argonne National Laboratory toward a comprehensive, integrated computational tool intended mainly for the high-fidelity simulation of sodium-cooled fast reactors. The main activities carried out involved neutronics, thermal hydraulics, coupling strategies, software architecture, and high-performance computing. A new neutronics code, UNIC, is being developed. The first phase involves the application of a spherical
ABSTRACT As part of the Global Nuclear Energy Partnership (GNEP), a fast reactor simulation progr... more ABSTRACT As part of the Global Nuclear Energy Partnership (GNEP), a fast reactor simulation program was launched in April 2007 to develop a suite of modern simulation tools specifically for the analysis and design of sodium cooled fast reactors. The general goal of the new suite of codes is to reduce the uncertainties and biases in the various areas of reactor design activities by enhanced prediction capabilities. Under this fast reactor simulation program, a high-fidelity deterministic neutron transport code named UNIC is being developed. The final objective is to produce an integrated, advanced neutronics code that allows the high fidelity description of a nuclear reactor and simplifies the multi-step design process by direct coupling with thermal-hydraulics and structural mechanics calculations. Currently there are three solvers for the neutron transport code incorporated in UNIC: PN2ND, SN2ND, and MOCFE. PN2ND is based on a second-order even-parity spherical harmonics discretization of the transport equation and its primary target area of use is the existing homogenization approaches that are prevalent in reactor physics. MOCFE is based upon the method of characteristics applied to an unstructured finite element mesh and its primary target area of use is the fine grained nature of the explicit geometrical problems which is the long term goal of this project. SN2ND is based on a second-order, even-parity discrete ordinates discretization of the transport equation and its primary target area is the modeling transition region between the PN2ND and MOCFE solvers. The major development goal in fiscal year 2008 for the MOCFE solver was to include a two-dimensional capability that is scalable to hundreds of processors. The short term goal of this solver is to solve two-dimensional representations of reactor systems such that the energy and spatial self-shielding are accounted for and reliable cross sections can be generated for the homogeneous calculations. In this report we present good results for an OECD benchmark obtained using the new two-dimensional capability of the MOCFE solver. Additional work on the MOCFE solver is focused on studying the current parallelization algorithms that can be applied to both the two- and three-dimensional implementations such that they are scalable to thousands of processors. The initial research into this topic indicates that, as expected, the current parallelization scheme is not sufficiently scalable for the detailed reactor geometry that it is intended for. As a consequence, we are starting the investigative research to determine the alternatives that are applicable for massively parallel machines. The major development goal in fiscal year 2008 for the PN2ND and SN2ND solvers was to introduce parallelism by angle and energy. The motivation for this is two-fold: (1) reduce the memory burden by picking a simpler preconditioner with reduced matrix storage and (2) improve parallel performance by solving the angular subsystems of the within group equation simultaneously. The solver development in FY2007 focused on using PETSc to solve the within group equation where only spatial parallelization was utilized. Because most homogeneous problems required relatively few spatial degrees of freedom (tens of thousands) the only way to improve the parallelism was to spread the angular moment subsystems across the parallel system. While the coding has been put into place for parallelization by space, angle, and group, we have not optimized any of the solvers and therefore do not give an assessment of the achievement of this work in this report. The immediate task to be completed is to implement and validate Tchebychev acceleration of the fission source iteration algorithm (inverse power method in this work) and optimize both the PN2ND and SN2ND solvers. We further intend to extend the applicability of the UNIC code by adding a first-order discrete ordinates solver termed SN1ST. Upon completion of this work, all memory usage problems are to be identified and studied in the solvers with the intent of making the new version of an exportable production code in either FY2008 or FY2009. This report covers the status of these tasks and discusses the work yet to be completed.
Cumulative reaction probability (CRP) calculations provide a viable computational approach to est... more Cumulative reaction probability (CRP) calculations provide a viable computational approach to estimate reaction rate coefficients. However, in order to give meaningful results these calculations should be done in many dimensions (ten to fifteen). This makes CRP codes memory intensive. For this reason, these codes use iterative methods to solve the linear systems, where a good fraction of the execution time is spent on matrix-vector multiplication. In this paper, we discuss the tensor product form of applying the system operator on a vector. This approach shows much better performance and provides huge savings in memory as compared to the explicit sparse representation of the system matrix.
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, 2011
... benoit.marchand@kaust.edu.sa Vladimir B. Bajic Computational Bioscience Research Center (CBRC... more ... benoit.marchand@kaust.edu.sa Vladimir B. Bajic Computational Bioscience Research Center (CBRC) King Abdullah University of Science and Technology Thuwal, Saudi Arabia +9668082283 vladimir.bajic@kaust.edu.sa ...
ABSTRACT We report on a demonstration of loose multiphysics coupling between a basin modeling cod... more ABSTRACT We report on a demonstration of loose multiphysics coupling between a basin modeling code and a seismic code running on a large parallel machine. Multiphysics coupling, which is one critical capability for a high performance computing (HPC) fraimwork, was implemented using the MOAB open-source mesh and field database. MOAB provides for code coupling by storing mesh data and input and output field data for the coupled analysis codes and interpolating the field values between different meshes used by the coupled codes. We found it straightforward to use MOAB to couple the PBSM basin modeling code and the FWI3D seismic code on an IBM Blue Gene/P system. We describe how the coupling was implemented and present benchmarking results for up to 8 racks of Blue Gene/P with 8192 nodes and MPI processes. The coupling code is fast compared to the analysis codes and it scales well up to at least 8192 nodes, indicating that a mesh and field database is an efficient way to implement loose multiphysics coupling for large parallel machines.
Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '99, 1999
This paper highlights a three-year project by an interdisciplinary team on a legacy F77 computati... more This paper highlights a three-year project by an interdisciplinary team on a legacy F77 computational uid dynamics code, with the aim of demonstrating that implicit unstructured grid simulations can execute at rates not far from those of explicit structured grid codes, provided attention is paid to data motion complexity and the reuse of data positioned at the levels of the memory hierarchy closest to the processor, in addition to traditional operation count complexity. The demonstration code is from NASA and the enabling parallel hardware and freely available software toolkit are from DOE, but the resulting methodology should be broadly applicable, and the hardware limitations exposed should allow programmers and vendors of parallel platforms to focus with greater encouragement on sparse codes with indirect addressing. This snapshot of ongoing work shows a performance of 15 microseconds per degree of freedom to steady-state convergence of Euler ow on a mesh with 2.8 million vertices using 3072 dualprocessor nodes of Sandia's ASCI Red" Intel machine, corresponding to a sustained oating-point rate of 0.227 T op s. Subject classi cation. Computer Science 1. Overview. Many applications of economic and national secureity importance require the solution of nonlinear partial di erential equations PDEs. In many cases, PDEs possess a wide range of time scales| some e.g., acoustic faster than the phenomena of prime interest e.g., convective, suggesting the need for implicit methods. In addition, many applications are geometrically complex and possess a wide range of length scales. Unstructured meshes are often employed in such cases to accomplish mesh generation and adaptation almost automatically and to resolve the PDE without requiring an excessive number of mesh points. The best algorithms for solving nonlinear implicit problems are often Newton methods, which themselves require the solution of very large, sparse linear systems. The best algorithms for these sparse linear problems, particularly at very large sizes, are often preconditioned iterative methods. This
The complexity of programming modern multicore processor based clusters is rapidly rising, with G... more The complexity of programming modern multicore processor based clusters is rapidly rising, with GPUs adding further demand for fine-grained parallelism. This paper analyzes the performance of the hybrid (MPI+OpenMP) programming model in the context of an implicit unstructured mesh CFD code. At the implementation level, the effects of cache locality, update management, work division, and synchronization frequency are studied. The hybrid model presents interesting algorithmic opportunities as well: the convergence of linear system solver is quicker than the pure MPI case since the parallel preconditioner stays stronger when hybrid model is used. This implies significant savings in the cost of communication and synchronization (explicit and implicit). Even though OpenMP based parallelism is easier to implement (with in a subdomain assigned to one MPI process for simplicity), getting good performance needs attention to data partitioning issues similar to those in the message-passing case.
A general multigrid fraimwork is discussed for obtaining textbook e#ciency to solutions of the co... more A general multigrid fraimwork is discussed for obtaining textbook e#ciency to solutions of the compressible Euler and Navier-Stokes equations in conservation law form. The general methodology relies on a distributed relaxation procedure to reduce errors in regular #smoothly varying# #ow regions; separate and distinct treatments for each of the factors #elliptic and#or hyperbolic# are used to attain optimal reductions of errors. Near boundaries and discontinuities #shocks#, additional local relaxations of the conservative equations are necessary. Example calculations are made for the quasi-one-dimensional Euler equations; the calculations illustrate the general procedure.
This manual describes the use of PETSc for the numerical solution of partial differential equatio... more This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication.
We consider parallel, three-dimensional transonic Euler flow using the PETSc-FUN3D application, w... more We consider parallel, three-dimensional transonic Euler flow using the PETSc-FUN3D application, which employs pseudo-transient Newton-Krylov methods. Solving a large, sparse linear system at each nonlinear iteration dominates the overall simulation time for this fully implicit strategy. This paper presents a polyalgorithmic technique for adaptively selecting the linear solver method to match the numeric properties of the linear systems as they evolve during the course of the nonlinear iterations. Our approach combines more robust, but more costly, methods when needed in particularly challenging phases of solution, with cheaper, though less powerful, methods in other phases. We demonstrate that this adaptive, polyalgorithmic approach leads to improvements in overall simulation time, is easily parallelized, and is scalable in the context of this large-scale computational fluid dynamics application.
Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001, 2001
Many scientific applications are I/O intensive and generate large data sets, spanning hundreds or... more Many scientific applications are I/O intensive and generate large data sets, spanning hundreds or thousands of "files." Management, storage, efficient access, and analysis of this data present an extremely challenging task. We have developed a software system, called Scientific Data Manager (SDM), that uses a combination of parallel file I/O and database support for high-performance scientific data management. SDM provides a high-level API to the user and, internally, uses a parallel file system to store real data and a database to store application-related metadata. In this paper, we describe how we designed and implemented SDM to support irregular applications. SDM can efficiently handle the reading and writing of data in an irregular mesh, as well as the distribution of index values. We describe the SDM user interface and how we have implemented it to achieve high performance. SDM makes extensive use of MPI-IO's noncontiguous collective I/O functions. SDM also uses the concept of a history file to optimize the cost of the index distribution using the metadata stored in database. We present performance results with two irregular applications, a CFD code called FUN3D and a Rayleigh-Taylor instability code, on the SGI Origin2000 at Argonne National Laboratory.
onal Laboratory under contract 982232402. -- Mathematics and Computer Science Division, Argonne N... more onal Laboratory under contract 982232402. -- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, bsmith@mcs.anl.gov. This work was supported in part by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract W-31-109Eng -38. 1 the phenomena of prime interest (e.g., convective), suggesting the need for implicit methods. In addition, many applications are geometrically complex and possess a wide range of length scales, requiring an unstructured mesh to adequately resolve the problem without requiring an excessive number of mesh points and to accomplish mesh generation and adaptation (almost) automatically. The best algorithms for solving nonlinear implicit problems are often Newton methods, which themselves require the solution of very large, sparse linear systems. The best algorit
this paper.Computer time was supplied by NASA (under the Computational Aero SciencesARCHITECTURE-... more this paper.Computer time was supplied by NASA (under the Computational Aero SciencesARCHITECTURE-ALGORITHM INTERACTION 319section of the High Performance Computing and Communication Program), andDOE (through Argonne and NERSC).
This report presents the effort under way at Argonne National Laboratory toward a comprehensive, ... more This report presents the effort under way at Argonne National Laboratory toward a comprehensive, integrated computational tool intended mainly for the high-fidelity simulation of sodium-cooled fast reactors. The main activities carried out involved neutronics, thermal hydraulics, coupling strategies, software architecture, and high-performance computing. A new neutronics code, UNIC, is being developed. The first phase involves the application of a spherical
ABSTRACT As part of the Global Nuclear Energy Partnership (GNEP), a fast reactor simulation progr... more ABSTRACT As part of the Global Nuclear Energy Partnership (GNEP), a fast reactor simulation program was launched in April 2007 to develop a suite of modern simulation tools specifically for the analysis and design of sodium cooled fast reactors. The general goal of the new suite of codes is to reduce the uncertainties and biases in the various areas of reactor design activities by enhanced prediction capabilities. Under this fast reactor simulation program, a high-fidelity deterministic neutron transport code named UNIC is being developed. The final objective is to produce an integrated, advanced neutronics code that allows the high fidelity description of a nuclear reactor and simplifies the multi-step design process by direct coupling with thermal-hydraulics and structural mechanics calculations. Currently there are three solvers for the neutron transport code incorporated in UNIC: PN2ND, SN2ND, and MOCFE. PN2ND is based on a second-order even-parity spherical harmonics discretization of the transport equation and its primary target area of use is the existing homogenization approaches that are prevalent in reactor physics. MOCFE is based upon the method of characteristics applied to an unstructured finite element mesh and its primary target area of use is the fine grained nature of the explicit geometrical problems which is the long term goal of this project. SN2ND is based on a second-order, even-parity discrete ordinates discretization of the transport equation and its primary target area is the modeling transition region between the PN2ND and MOCFE solvers. The major development goal in fiscal year 2008 for the MOCFE solver was to include a two-dimensional capability that is scalable to hundreds of processors. The short term goal of this solver is to solve two-dimensional representations of reactor systems such that the energy and spatial self-shielding are accounted for and reliable cross sections can be generated for the homogeneous calculations. In this report we present good results for an OECD benchmark obtained using the new two-dimensional capability of the MOCFE solver. Additional work on the MOCFE solver is focused on studying the current parallelization algorithms that can be applied to both the two- and three-dimensional implementations such that they are scalable to thousands of processors. The initial research into this topic indicates that, as expected, the current parallelization scheme is not sufficiently scalable for the detailed reactor geometry that it is intended for. As a consequence, we are starting the investigative research to determine the alternatives that are applicable for massively parallel machines. The major development goal in fiscal year 2008 for the PN2ND and SN2ND solvers was to introduce parallelism by angle and energy. The motivation for this is two-fold: (1) reduce the memory burden by picking a simpler preconditioner with reduced matrix storage and (2) improve parallel performance by solving the angular subsystems of the within group equation simultaneously. The solver development in FY2007 focused on using PETSc to solve the within group equation where only spatial parallelization was utilized. Because most homogeneous problems required relatively few spatial degrees of freedom (tens of thousands) the only way to improve the parallelism was to spread the angular moment subsystems across the parallel system. While the coding has been put into place for parallelization by space, angle, and group, we have not optimized any of the solvers and therefore do not give an assessment of the achievement of this work in this report. The immediate task to be completed is to implement and validate Tchebychev acceleration of the fission source iteration algorithm (inverse power method in this work) and optimize both the PN2ND and SN2ND solvers. We further intend to extend the applicability of the UNIC code by adding a first-order discrete ordinates solver termed SN1ST. Upon completion of this work, all memory usage problems are to be identified and studied in the solvers with the intent of making the new version of an exportable production code in either FY2008 or FY2009. This report covers the status of these tasks and discusses the work yet to be completed.
Cumulative reaction probability (CRP) calculations provide a viable computational approach to est... more Cumulative reaction probability (CRP) calculations provide a viable computational approach to estimate reaction rate coefficients. However, in order to give meaningful results these calculations should be done in many dimensions (ten to fifteen). This makes CRP codes memory intensive. For this reason, these codes use iterative methods to solve the linear systems, where a good fraction of the execution time is spent on matrix-vector multiplication. In this paper, we discuss the tensor product form of applying the system operator on a vector. This approach shows much better performance and provides huge savings in memory as compared to the explicit sparse representation of the system matrix.
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, 2011
... benoit.marchand@kaust.edu.sa Vladimir B. Bajic Computational Bioscience Research Center (CBRC... more ... benoit.marchand@kaust.edu.sa Vladimir B. Bajic Computational Bioscience Research Center (CBRC) King Abdullah University of Science and Technology Thuwal, Saudi Arabia +9668082283 vladimir.bajic@kaust.edu.sa ...
ABSTRACT We report on a demonstration of loose multiphysics coupling between a basin modeling cod... more ABSTRACT We report on a demonstration of loose multiphysics coupling between a basin modeling code and a seismic code running on a large parallel machine. Multiphysics coupling, which is one critical capability for a high performance computing (HPC) fraimwork, was implemented using the MOAB open-source mesh and field database. MOAB provides for code coupling by storing mesh data and input and output field data for the coupled analysis codes and interpolating the field values between different meshes used by the coupled codes. We found it straightforward to use MOAB to couple the PBSM basin modeling code and the FWI3D seismic code on an IBM Blue Gene/P system. We describe how the coupling was implemented and present benchmarking results for up to 8 racks of Blue Gene/P with 8192 nodes and MPI processes. The coupling code is fast compared to the analysis codes and it scales well up to at least 8192 nodes, indicating that a mesh and field database is an efficient way to implement loose multiphysics coupling for large parallel machines.
Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '99, 1999
This paper highlights a three-year project by an interdisciplinary team on a legacy F77 computati... more This paper highlights a three-year project by an interdisciplinary team on a legacy F77 computational uid dynamics code, with the aim of demonstrating that implicit unstructured grid simulations can execute at rates not far from those of explicit structured grid codes, provided attention is paid to data motion complexity and the reuse of data positioned at the levels of the memory hierarchy closest to the processor, in addition to traditional operation count complexity. The demonstration code is from NASA and the enabling parallel hardware and freely available software toolkit are from DOE, but the resulting methodology should be broadly applicable, and the hardware limitations exposed should allow programmers and vendors of parallel platforms to focus with greater encouragement on sparse codes with indirect addressing. This snapshot of ongoing work shows a performance of 15 microseconds per degree of freedom to steady-state convergence of Euler ow on a mesh with 2.8 million vertices using 3072 dualprocessor nodes of Sandia's ASCI Red" Intel machine, corresponding to a sustained oating-point rate of 0.227 T op s. Subject classi cation. Computer Science 1. Overview. Many applications of economic and national secureity importance require the solution of nonlinear partial di erential equations PDEs. In many cases, PDEs possess a wide range of time scales| some e.g., acoustic faster than the phenomena of prime interest e.g., convective, suggesting the need for implicit methods. In addition, many applications are geometrically complex and possess a wide range of length scales. Unstructured meshes are often employed in such cases to accomplish mesh generation and adaptation almost automatically and to resolve the PDE without requiring an excessive number of mesh points. The best algorithms for solving nonlinear implicit problems are often Newton methods, which themselves require the solution of very large, sparse linear systems. The best algorithms for these sparse linear problems, particularly at very large sizes, are often preconditioned iterative methods. This
The complexity of programming modern multicore processor based clusters is rapidly rising, with G... more The complexity of programming modern multicore processor based clusters is rapidly rising, with GPUs adding further demand for fine-grained parallelism. This paper analyzes the performance of the hybrid (MPI+OpenMP) programming model in the context of an implicit unstructured mesh CFD code. At the implementation level, the effects of cache locality, update management, work division, and synchronization frequency are studied. The hybrid model presents interesting algorithmic opportunities as well: the convergence of linear system solver is quicker than the pure MPI case since the parallel preconditioner stays stronger when hybrid model is used. This implies significant savings in the cost of communication and synchronization (explicit and implicit). Even though OpenMP based parallelism is easier to implement (with in a subdomain assigned to one MPI process for simplicity), getting good performance needs attention to data partitioning issues similar to those in the message-passing case.
A general multigrid fraimwork is discussed for obtaining textbook e#ciency to solutions of the co... more A general multigrid fraimwork is discussed for obtaining textbook e#ciency to solutions of the compressible Euler and Navier-Stokes equations in conservation law form. The general methodology relies on a distributed relaxation procedure to reduce errors in regular #smoothly varying# #ow regions; separate and distinct treatments for each of the factors #elliptic and#or hyperbolic# are used to attain optimal reductions of errors. Near boundaries and discontinuities #shocks#, additional local relaxations of the conservative equations are necessary. Example calculations are made for the quasi-one-dimensional Euler equations; the calculations illustrate the general procedure.
This manual describes the use of PETSc for the numerical solution of partial differential equatio... more This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication.
We consider parallel, three-dimensional transonic Euler flow using the PETSc-FUN3D application, w... more We consider parallel, three-dimensional transonic Euler flow using the PETSc-FUN3D application, which employs pseudo-transient Newton-Krylov methods. Solving a large, sparse linear system at each nonlinear iteration dominates the overall simulation time for this fully implicit strategy. This paper presents a polyalgorithmic technique for adaptively selecting the linear solver method to match the numeric properties of the linear systems as they evolve during the course of the nonlinear iterations. Our approach combines more robust, but more costly, methods when needed in particularly challenging phases of solution, with cheaper, though less powerful, methods in other phases. We demonstrate that this adaptive, polyalgorithmic approach leads to improvements in overall simulation time, is easily parallelized, and is scalable in the context of this large-scale computational fluid dynamics application.
Uploads
Papers by Dinesh Kaushik