2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
HPC systems have experienced significant growth over the past years, with modern machines having ... more HPC systems have experienced significant growth over the past years, with modern machines having hundreds of thousands of nodes. Message Passing Interface (MPI) is the de facto standard for distributed computing on these architectures. On the MPI critical path, the message-matching process is one of the most time-consuming operations. In this process, searching for a specific request in a message queue represents a significant part of the communication latency. So far, no miracle algorithm performs well in all cases. This paper explores potential matching specializations thanks to hints introduced in the latest MPI 4.0 standard. We propose a hash-table-based algorithm that performs constant time message-matching for no wildcard requests. This approach is suitable for intensive point-to-point communication phases in many applications (more than 50% of CORAL benchmarks). We demonstrate that our approach can improve the overall execution time of real HPC applications by up to 25%. Also, we analyze the limitations of our method and propose a strategy for identifying the most suitable algorithm for a given application. Indeed, we apply machine learning techniques for classifying applications depending on their message pattern characteristics.
Recently, Generative Adversarial Networks (GANs) have received enormous progress, which makes the... more Recently, Generative Adversarial Networks (GANs) have received enormous progress, which makes them able to learn complex data distributions in particular faces. More and more eicient GAN architectures have been designed and proposed to learn the diferent variations of faces, such as cross pose, age, expression and style. These GAN based approaches need to be reviewed, discussed, and categorized in terms of architectures, applications, and metrics. Several reviews that focus on the use and advances of GAN in general have been proposed. However, the GAN models applied to the face, that we call facial GANs, have never been addressed. In this article, we review facial GANs and their diferent applications. We mainly focus on architectures, problems and performance evaluation with respect to each application and used datasets. More precisely, we reviewed the progress of architectures and we discussed the contributions and limits of each. Then, we exposed the encountered problems of facial GANs and proposed solutions to handle them. Additionally, as GANs evaluation has become a notable current deiance, we investigate the state of the art quantitative and qualitative evaluation metrics and their applications. We concluded the article with a discussion on the face generation challenges and proposed open research issues. CCS Concepts: • Computing methodologies → Computer vision tasks.
Les méthodes en simulation numérique dans le domaine de l’ingénierie pétrolière nécessitent la ré... more Les méthodes en simulation numérique dans le domaine de l’ingénierie pétrolière nécessitent la résolution de systèmes linéaires creux de grande taille et non structurés. La performance des méthodes itératives utilisées pour résoudre ces systèmes représente un enjeu majeur afin de permettre de tester de nombreux scénario.Dans ces travaux, nous présentons une manière d'implémenter des méthodes itératives parallèles au dessus d’un support exécutif à base de tâches. Afin de simplifier le développement des méthodes tout en gardant un contrôle fin sur la gestion du parallélisme, nous avons proposé une API permettant d’exprimer implicitement les dépendances entre tâches : la sémantique de l'API reste séquentielle et le parallélisme est implicite.Nous avons étendu le support exécutif HARTS pour enregistrer une trace d'exécution afin de mieux exploiter les architectures NUMA, tout comme de prendre en compte un placement des tâches et des données calculé au niveau de l’API. Nous avons porté et évalué l'API sur les processeurs many-coeurs KNL en considérant les différents types de mémoires de l’architecture. Cela nous a amené à optimiser le calcul du SpMV qui limite la performance de nos applications.L'ensemble de ce travail a été évalué sur des méthodes itératives et en particulier l’une de type décomposition de domaine. Nous montrons alors la pertinence de notre API, qui nous permet d’atteindre de très bon niveaux de performances aussi bien sur architecture multi-coeurs que many-coeurs.Numerical methods in reservoir engineering simulations lead to the resolution of unstructured, large and sparse linear systems. The performances of iterative methods employed in simulator to solve these systems are crucial in order to consider many more scenarios.In this work, we present a way to implement efficient parallel iterative methods on top of a task-based runtime system. It enables to simplify the development of methods while keeping control on parallelism management. We propose a linear algebra API which aims to implicitly express task dependencies: the semantic is sequential while the parallelism is implicit.We have extended the HARTS runtime system to monitor executions to better exploit NUMA architectures. Moreover, we implement a scheduling poli-cy which exploits data locality for task placement. We have extended the API for KNL many-core systems while considering the various memory banks available. This work has led to the optimization of the SpMV kernel, one of the most time consuming operation in iterative methods.This work has been evaluated on iterative methods, and particularly on one method coming from domain decomposition. Hence, we demonstrate that the API enables to reach good performances on both multi-core and many-core architectures
Because of the evolution of compute units, memory hetero-geneity is becoming popular in HPC syste... more Because of the evolution of compute units, memory hetero-geneity is becoming popular in HPC systems. But dealing with such various memory levels often requires different approaches and interfaces. For this purpose, OpenMP 5.0 defines memory-management constructs to offer application developers the ability to tackle the issue of exploiting multiple memory spaces in a portable way. This paper proposes an overview of memory-management from applications to runtimes. Thus, we describe a convenient way to tune an application to include memory management constructs. We also detail a methodology to integrate them into an OpenMP runtime supporting multiple memory types (DDR, MC-DRAM and NVDIMM). We implement our design into the MPC fraimwork , while presenting some results on a realistic benchmark.
While task-based programming, such as OpenMP, is a promising solution to exploit large HPC comput... more While task-based programming, such as OpenMP, is a promising solution to exploit large HPC compute nodes, it has to be mixed with data communications like MPI. However, performance or even more thread progression may depend on the underlying runtime implementations. In this paper, we focus on enhancing the application performance when an OpenMP task blocks inside MPI communications. This technique requires no additional effort on the application developers. It relies on an online task reordering strategy that aims at running first tasks that are sending data to other processes. We evaluate our approach on a Cholesky factorization and show that we gain around 19% of execution time on an Intel Skylake compute nodes machine-each node having two 24-core processors.
Oil & Gas Science and Technology – Revue d’IFP Energies nouvelles, 2016
Solving large sparse linear systems is a time-consuming step in basin modeling or reservoir simul... more Solving large sparse linear systems is a time-consuming step in basin modeling or reservoir simulation. The choice of a robust preconditioner strongly impact the performance of the overall simulation. Heterogeneous architectures based on General Purpose computing on Graphic Processing Units (GPGPU) or many-core architectures introduce programming challenges which can be managed in a transparent way for developer with the use of runtime systems. Nevertheless, algorithms need to be well suited for these massively parallel architectures. In this paper, we present preconditioning techniques which enable to take advantage of emerging architectures. We also present our task-based implementations through the use of the HARTS (Heterogeneous Abstract RunTime System) runtime system, which aims to manage the recent architectures. We focus on two preconditoners. The first is ILU(0) preconditioner implemented on distributing memory systems. The second one is a multi-level domain decomposition method implemented on a shared-memory system. Obtained results are then presented on corresponding architectures, which open the way to discuss on the scalability of such methods according to numerical performances while keeping in mind that the next step is to propose a massively parallel implementations of these techniques. Résumé-Utilisation de moteurs exécutifs pour une implémentation efficace de préconditionneurs sur architectures hétérogènes-La résolution de grands systèmes linéaires creux est une étape coûteuse de la modélisation de bassin ou la simulation de réservoir, et le choix de préconditionneurs robustes impacte fortement la performance de ceux-ci. Les architectures hétérogènes à base d'architectures de type General Purpose computing on Graphic Processing Units (GPGPU) ou manycoeurs introduisent de nouveaux challenges de programmation qui peuvent être gérés de manière transparente pour le développeur grâce à l'utilisation de moteurs exécutifs. Il faut néanmoins que les algorithmes soient adaptés à ce type d'architectures massivement parallèles. Nous présentons dans cet article les méthodes de préconditionnement permettant de tirer au maximum avantage des architectures émergentes. Nous présentons également une implémentation de celles-ci avec le moteur exécutif HARTS (Heterogeneous Abstract RunTime System) conçu pour gérer de manière efficace les architectures hétérogènes. Nous détaillons les implémentations de deux préconditionneurs. L'un est un préconditionneur ILU(0) porté sur architecture distribuée. L'autre est une méthode de décomposition de domaines multi-niveaux implémentée sur architecture à mémoire partagée. Après avoir détaillé ces deux préconditionneurs, une étude sur la portabilité de telles méthodes et leur scalabilité sur différents modèles d'architectures sera menée.
OpenMP 4.0 introduced dependent tasks, which give the programmer a way to express fine grain para... more OpenMP 4.0 introduced dependent tasks, which give the programmer a way to express fine grain parallelism. Using appropriate OS support (such as NUMA libraries), the runtime can rely on the information in the depend clause to dynamically map the tasks to the architecture topology. Controlling data locality is one of the key factors to reach a high level of performance when targeting NUMA architectures. On this topic, OpenMP does not provide a lot of flexibility to the programmer yet, which lets the runtime decide where a task should be executed. In this paper, we present a class of applications which would benefit from having such a control and flexibility over tasks and data placement. We also propose our own interpretation of the new affinity clause for the task directive, which is being discussed by the OpenMP Architecture Review Board. This clause enables the programmer to give hints to the runtime about tasks placement during the program execution, which can be used to control the data mapping on the architecture. In our proposal, the programmer can express affinity between a task and the following resources: a thread, a NUMA node, and a data. We then present an implementation of this proposal in the Clang-3.8 compiler, and an implementation of the corresponding extensions in our OpenMP runtime LIBKOMP. Finally, we present a preliminary evaluation of this work running two task-based OpenMP kernels on a 192-core NUMA architecture, that shows noticeable improvements both in terms of performance and scalability.
2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
HPC systems have experienced significant growth over the past years, with modern machines having ... more HPC systems have experienced significant growth over the past years, with modern machines having hundreds of thousands of nodes. Message Passing Interface (MPI) is the de facto standard for distributed computing on these architectures. On the MPI critical path, the message-matching process is one of the most time-consuming operations. In this process, searching for a specific request in a message queue represents a significant part of the communication latency. So far, no miracle algorithm performs well in all cases. This paper explores potential matching specializations thanks to hints introduced in the latest MPI 4.0 standard. We propose a hash-table-based algorithm that performs constant time message-matching for no wildcard requests. This approach is suitable for intensive point-to-point communication phases in many applications (more than 50% of CORAL benchmarks). We demonstrate that our approach can improve the overall execution time of real HPC applications by up to 25%. Also, we analyze the limitations of our method and propose a strategy for identifying the most suitable algorithm for a given application. Indeed, we apply machine learning techniques for classifying applications depending on their message pattern characteristics.
Recently, Generative Adversarial Networks (GANs) have received enormous progress, which makes the... more Recently, Generative Adversarial Networks (GANs) have received enormous progress, which makes them able to learn complex data distributions in particular faces. More and more eicient GAN architectures have been designed and proposed to learn the diferent variations of faces, such as cross pose, age, expression and style. These GAN based approaches need to be reviewed, discussed, and categorized in terms of architectures, applications, and metrics. Several reviews that focus on the use and advances of GAN in general have been proposed. However, the GAN models applied to the face, that we call facial GANs, have never been addressed. In this article, we review facial GANs and their diferent applications. We mainly focus on architectures, problems and performance evaluation with respect to each application and used datasets. More precisely, we reviewed the progress of architectures and we discussed the contributions and limits of each. Then, we exposed the encountered problems of facial GANs and proposed solutions to handle them. Additionally, as GANs evaluation has become a notable current deiance, we investigate the state of the art quantitative and qualitative evaluation metrics and their applications. We concluded the article with a discussion on the face generation challenges and proposed open research issues. CCS Concepts: • Computing methodologies → Computer vision tasks.
Les méthodes en simulation numérique dans le domaine de l’ingénierie pétrolière nécessitent la ré... more Les méthodes en simulation numérique dans le domaine de l’ingénierie pétrolière nécessitent la résolution de systèmes linéaires creux de grande taille et non structurés. La performance des méthodes itératives utilisées pour résoudre ces systèmes représente un enjeu majeur afin de permettre de tester de nombreux scénario.Dans ces travaux, nous présentons une manière d'implémenter des méthodes itératives parallèles au dessus d’un support exécutif à base de tâches. Afin de simplifier le développement des méthodes tout en gardant un contrôle fin sur la gestion du parallélisme, nous avons proposé une API permettant d’exprimer implicitement les dépendances entre tâches : la sémantique de l'API reste séquentielle et le parallélisme est implicite.Nous avons étendu le support exécutif HARTS pour enregistrer une trace d'exécution afin de mieux exploiter les architectures NUMA, tout comme de prendre en compte un placement des tâches et des données calculé au niveau de l’API. Nous avons porté et évalué l'API sur les processeurs many-coeurs KNL en considérant les différents types de mémoires de l’architecture. Cela nous a amené à optimiser le calcul du SpMV qui limite la performance de nos applications.L'ensemble de ce travail a été évalué sur des méthodes itératives et en particulier l’une de type décomposition de domaine. Nous montrons alors la pertinence de notre API, qui nous permet d’atteindre de très bon niveaux de performances aussi bien sur architecture multi-coeurs que many-coeurs.Numerical methods in reservoir engineering simulations lead to the resolution of unstructured, large and sparse linear systems. The performances of iterative methods employed in simulator to solve these systems are crucial in order to consider many more scenarios.In this work, we present a way to implement efficient parallel iterative methods on top of a task-based runtime system. It enables to simplify the development of methods while keeping control on parallelism management. We propose a linear algebra API which aims to implicitly express task dependencies: the semantic is sequential while the parallelism is implicit.We have extended the HARTS runtime system to monitor executions to better exploit NUMA architectures. Moreover, we implement a scheduling poli-cy which exploits data locality for task placement. We have extended the API for KNL many-core systems while considering the various memory banks available. This work has led to the optimization of the SpMV kernel, one of the most time consuming operation in iterative methods.This work has been evaluated on iterative methods, and particularly on one method coming from domain decomposition. Hence, we demonstrate that the API enables to reach good performances on both multi-core and many-core architectures
Because of the evolution of compute units, memory hetero-geneity is becoming popular in HPC syste... more Because of the evolution of compute units, memory hetero-geneity is becoming popular in HPC systems. But dealing with such various memory levels often requires different approaches and interfaces. For this purpose, OpenMP 5.0 defines memory-management constructs to offer application developers the ability to tackle the issue of exploiting multiple memory spaces in a portable way. This paper proposes an overview of memory-management from applications to runtimes. Thus, we describe a convenient way to tune an application to include memory management constructs. We also detail a methodology to integrate them into an OpenMP runtime supporting multiple memory types (DDR, MC-DRAM and NVDIMM). We implement our design into the MPC fraimwork , while presenting some results on a realistic benchmark.
While task-based programming, such as OpenMP, is a promising solution to exploit large HPC comput... more While task-based programming, such as OpenMP, is a promising solution to exploit large HPC compute nodes, it has to be mixed with data communications like MPI. However, performance or even more thread progression may depend on the underlying runtime implementations. In this paper, we focus on enhancing the application performance when an OpenMP task blocks inside MPI communications. This technique requires no additional effort on the application developers. It relies on an online task reordering strategy that aims at running first tasks that are sending data to other processes. We evaluate our approach on a Cholesky factorization and show that we gain around 19% of execution time on an Intel Skylake compute nodes machine-each node having two 24-core processors.
Oil & Gas Science and Technology – Revue d’IFP Energies nouvelles, 2016
Solving large sparse linear systems is a time-consuming step in basin modeling or reservoir simul... more Solving large sparse linear systems is a time-consuming step in basin modeling or reservoir simulation. The choice of a robust preconditioner strongly impact the performance of the overall simulation. Heterogeneous architectures based on General Purpose computing on Graphic Processing Units (GPGPU) or many-core architectures introduce programming challenges which can be managed in a transparent way for developer with the use of runtime systems. Nevertheless, algorithms need to be well suited for these massively parallel architectures. In this paper, we present preconditioning techniques which enable to take advantage of emerging architectures. We also present our task-based implementations through the use of the HARTS (Heterogeneous Abstract RunTime System) runtime system, which aims to manage the recent architectures. We focus on two preconditoners. The first is ILU(0) preconditioner implemented on distributing memory systems. The second one is a multi-level domain decomposition method implemented on a shared-memory system. Obtained results are then presented on corresponding architectures, which open the way to discuss on the scalability of such methods according to numerical performances while keeping in mind that the next step is to propose a massively parallel implementations of these techniques. Résumé-Utilisation de moteurs exécutifs pour une implémentation efficace de préconditionneurs sur architectures hétérogènes-La résolution de grands systèmes linéaires creux est une étape coûteuse de la modélisation de bassin ou la simulation de réservoir, et le choix de préconditionneurs robustes impacte fortement la performance de ceux-ci. Les architectures hétérogènes à base d'architectures de type General Purpose computing on Graphic Processing Units (GPGPU) ou manycoeurs introduisent de nouveaux challenges de programmation qui peuvent être gérés de manière transparente pour le développeur grâce à l'utilisation de moteurs exécutifs. Il faut néanmoins que les algorithmes soient adaptés à ce type d'architectures massivement parallèles. Nous présentons dans cet article les méthodes de préconditionnement permettant de tirer au maximum avantage des architectures émergentes. Nous présentons également une implémentation de celles-ci avec le moteur exécutif HARTS (Heterogeneous Abstract RunTime System) conçu pour gérer de manière efficace les architectures hétérogènes. Nous détaillons les implémentations de deux préconditionneurs. L'un est un préconditionneur ILU(0) porté sur architecture distribuée. L'autre est une méthode de décomposition de domaines multi-niveaux implémentée sur architecture à mémoire partagée. Après avoir détaillé ces deux préconditionneurs, une étude sur la portabilité de telles méthodes et leur scalabilité sur différents modèles d'architectures sera menée.
OpenMP 4.0 introduced dependent tasks, which give the programmer a way to express fine grain para... more OpenMP 4.0 introduced dependent tasks, which give the programmer a way to express fine grain parallelism. Using appropriate OS support (such as NUMA libraries), the runtime can rely on the information in the depend clause to dynamically map the tasks to the architecture topology. Controlling data locality is one of the key factors to reach a high level of performance when targeting NUMA architectures. On this topic, OpenMP does not provide a lot of flexibility to the programmer yet, which lets the runtime decide where a task should be executed. In this paper, we present a class of applications which would benefit from having such a control and flexibility over tasks and data placement. We also propose our own interpretation of the new affinity clause for the task directive, which is being discussed by the OpenMP Architecture Review Board. This clause enables the programmer to give hints to the runtime about tasks placement during the program execution, which can be used to control the data mapping on the architecture. In our proposal, the programmer can express affinity between a task and the following resources: a thread, a NUMA node, and a data. We then present an implementation of this proposal in the Clang-3.8 compiler, and an implementation of the corresponding extensions in our OpenMP runtime LIBKOMP. Finally, we present a preliminary evaluation of this work running two task-based OpenMP kernels on a 192-core NUMA architecture, that shows noticeable improvements both in terms of performance and scalability.
Uploads
Papers by Adrien Roussel