MCP: SeRC Exascale Simulation Software Initiative - SESSI

MCP: SeRC Exascale Simulation Software Initiative - SESSI

High resolution logo here.

 

The SeRC Steering group established a new flagship program on software for Exascale simulations. The program aims to improve the performance and scalability of selected software packages by establishing new collaborative projects for exchange of expertise among research groups in the Molecular Simulations, FLOW and DPT SeRC communities.

 

http://www.e-science.se/sites/default/files/imagecache/520px/DSC09059.jpg

SESSI meeting at the SeRC room in PDC. 

 

Many of the e-Science applications are more and more dependent on large-scale compute capabilities. A subset of these applications have already developed into massively parallel codes where a single simulation can use thousands to hundreds of thousands of cores. Those capabilities have in turn enabled researchers to address a wide range of new applications areas that were simply not possible a few years ago such as computational fluid dynamics simulations of highly complex flows such as airplane wings; critical problems in high energy physics; and molecular dynamics simulations of large biomolecular systems. The advances in those areas were possible thanks to major directed efforts in the development of algorithms and application of modern software engineering techniques.

 

The main goal of SESSI is to investigate the highly parallel software challenges in real-world applications with very large scientific impact, in an application – programming system co-design fashion, to identify the limitations of the current software approaches and to develop possible solutions to these limitations.
SESSI focuses on the design, development and implementation of two classes of applications and simulation packages with a large user-base and high-impact scientific drivers:
 
  • Molecular Dynamics. The code in use is GROMACS (MOL community).
  • Computational Fluid Dynamics. The code in use is Nek5000 (FLOW community).

 

Assembly level vectorization of compute intensive kernels in NEK5000

 

Nek5000 is a computational fluid dynamics solver based on the spectral element method. The main part of the program consists in matrix-matrix multiplication routines, in which the program spends most of its time (more than 60% in a 2D version).

 

Currently, the routines are basic Fortran routines with nested loops to compute the matrix multiplications. The aim of this project is to enhance the routines using vectorization technics like SIMD instructions. The principle of SIMD instruction is to handle multiple (e.g. 4 or 8) values at once instead of one on one instruction (for instance multiplication or addition) and thus considerable improving code performance.

 

The project is in collaboration with developers of GROMACS from the Molecular Simulation community, a code in which SIMD instruction operations are integrated and heavily used.

 

Project page: http://www.e-science.se/project/assembly-level-vectorization-compute-intensive-kernels-nek5000

 

Automation for profiling and code analysis in GROMACS

 

GROMACS is a high-performance and scalable code for molecular dynamics simulations, mainly used for studies of biomolecular systems. The codebase is highly tuned and takes advantages of different hybrid parallelization scenarios including internal thread-MPI implementation, MPI, OpenMP as well as CUDA, OpenCL and SIMD acceleration.

 

Proper analysis of performance and scalability issues in such codebase is a complex task. The current project aims to automate in GROMACS the usage of performance tools such as Extrae, and thus shorten the time for identification of critical performance issues.

 

People involved:

Xavi Aguilar(PDC)
Szílard Páll (PDC/MolSim)

 

http://www.e-science.se/sites/default/files/imagecache/520px/DSC09074.jpg?1480511728

SESSI meeting at the SeRC room in PDC. 

 

GPU acceleration & heterogeneous parallelism

(collaboration between Molecular Simulation community, EU project BioExcel and joint NVIDIA/KTH CUDA Research Center)

 

Project page: http://www.e-science.se/project/algorithms-molecular-dynamics-heterogeneous-architectures

 

Efficient MPI communication in NEK5000

 

The algorithms implemented in the CFD solver NEK5000 scale well to very large processor numbers. This could be demonstrated with strong scaling by using more than one million cores so far. Optimized communication routines using the Message Passing Interface (MPI) contribute this achievement.

Recent developments in the computer architecture provide increasing core numbers per processor as well as new capabilities in interconnect networks. This allows the implementation of communication operations with a high degree of parallelism. The aim of this activity is to exloit hardware features in the communication operations of NEK5000 in order to achieve a further increased program efficiency.

The project is a collaboration between the FLOW and DPT communities.

 

Fine-grained task-level parallelism for Exascale systems

 

Experience with hybrid MPI+OpenMP+CUDA with GROMACS has shown many complex issues of balancing the load of communication and computation across processing units. Implementations for biomolecular MD simulation benefit greatly from spatial decomposition of non-bonded workload, but the total workload is not homogeneous in space, and this cripples performance at high parallelism. Fundamentally, this is because the MD algorithm has been made to work in parallel in a way that guarantees a series of synchronization and serialization points in every time step. This situation will get worse in the future, as compute nodes have more processing units, more kinds of processing units to address, and more leaks from the von Neumann machine abstraction. Making best use of data locality will be an ongoing challenge.

 

To tackle these problems, we are refactoring GROMACS to use a much more fine-grained task parallelism. Tasks are mapped to suitable compute units in a way that is sensitive to the priority of the task, the hotness of the data in local memory, and the alternative tasks that are available. Fortunately, the overall compute task is well defined for a large series of time steps once neighbour searching is complete, so the large-scale DAG of data flow is static. Thus, the problem reduces to pre-organizing data flow so that all compute units can work efficiently upon the available task that is of highest priority.

 

http://www.e-science.se/sites/default/files/imagecache/520px/DSC09082.jpg?1480427018

SESSI meeting at the SeRC room in PDC.

 

An optimized FFT library for 3D real-valued small-size data

 

The use of Fast Fourier Transform (FFT) is ubiquitous in computational science. Its usage varies from solving Partial Differential Equations, to calculating convolutions, to performing spectral analysis. While several FFT libraries are already available, we focus on the development and implementation of an FFT library, specifically designed to solve the FFT of 3D real-valued small size (<= 128 element per dimension) data at maximum performance. For such cases, our goal is to achieve better performance of existing FFT libraries.

The most used FFT library by the scientific community is FFTW, an auto-tuned library developed at MIT in the Eighties. While FFTW has evolved during the last decades, it still doesn’t fully support all the relevant Single-Instruction Multiple-data (SIMD) sets, e.g. AVX, AVX512, IBM VSX, ARM Neon and it uses SIMD only within the calculation of one single FFT. Instead, our approach is based on using SIMD for solving several small (<= 128 elements) FFT in parallel on several SIMD sets. We focus on 3D and real-valued data as this case is very relevant to the solution of Particle Mesh Ewald (PME) in GROMACS.

The library is designed to be fast deployed in GROMACS for PME calculation and will provide a Fortran interface to be used in Nek5000.


People involved:
Stefano Markidis (PDC)
Vishnu Suresh Raju (PDC)
Xingjiang Yu (CST)
Mark Abraham (MolSim)

 

OpenACC for Nek5000

 

Nek5000 is an open-source code for the simulation of incompressible flows. Nek5000 is widely used in a broad range of applications, including the study of thermal hydraulics in nuclear reactor cores, the modeling of ocean currents and the simulation of combustion in mechanical engines.
Exascale HPC architectures are increasingly prevalent in the Top500 list, with CPU based nodes enhanced by accelerators or coprocessors optimized for floating-point calculations. We have previously presented a case study of partially porting to parallel GPU-accelerated systems. In the proposed project, we will follow on from our previously developed work and take advantage of the optimized results to port the full version of Nek5000 to the GPU-accelerated systems, especially on the Pn-Pn-2 algorithm. The project will focus on the technology watch of heterogeneous modelling and its impact on the exascale architectures (e.g. GPU accelerators system). Within the project we will also study on the performance characteristics of Nek5000 on the exascale architectures.  This is a collaboration project with Argonne National Laboratory.
 
People:
Evelyn Otero (KTH Mechanics AE)
Jing Gong (PDC AE)
Min Misun (ANL)
Philipp Schlatter (KTH Mechanics)
 
Milestones
M1: Porting the Pn-Pn-2 algorithm in Nek5000 to GPU using the OpenACC directives (January 2018)
M2: Profiling and Performance tests for large simulations (March 2018)
M3: Optimizing the OpenACC directives and employing more advanced tools likes GPUdirect to obtain better performances (April 2018)
M4: Applying for project on a Tier-0 system and finally testing the strong scalability (June 2018)
 

Overcoming I/O limitations on exascale architectures

 

With larger systems and application scales, I/O is increasingly becoming a bottleneck. This is particularly true when global system states need to save regularly for checkpointing and analysis purposes, like in computational fluid dynamics applications. In this project we analyse the I/O performance of Nek5000 (a spectral element CFD code) and implement parallel I/O strategies to improve I/O performance and scaling. The method considered is the Discrete Legendre Transform (DLT) based on a priori error estimation allowing total control over the error considered permissible. Note that this is an improvement with respect to previous compression algorithms. We assess the impact of the compression in different situations such as turbulence statistics computation, flow visualization, vortex identification, spectra, as well as restarts from compressed data fields. It has been shown that compression ratios up to about 97% were acceptable.
 
People:
Evelyn Otero (KTH Mechanics AE)
Oana Marin (ANL)
Ricardo Vinuesa (KTH Mechanics)
Philipp Schlatter (KTH Mechanics)
Erwin Laure (KTH PDC)
 
 
Milestones
M1: Finalize analysis of truncation error on data analysis, and write paper (October 2017)
M2: Impact of compression on low-order models based on POD (December 2017)
M3: Implementation of compression in Nek5000, with bitwise encoding, optimized for Cray architectures (February 2018)
M4: Interface to visualisation software for on-the-fly decoding (March 2018)
M5: Optimisation of IO and memory access on KNL (May 2018)
 

Investigation of Communication Kernel in Nek5000

 

Important advantage of spectral element method is its meshing flexibility coming from spatial decomposition of simulation domain into a set of non-overlapping sub-domains. The solution continuity is in this case ensured by frequent exchange between element of the field values at the shared faces, edges and vertices. This is done by a so-called gather-scatter operator, which in Nek5000 is implemented in in gslib using MPI1 features. In collaboration with the University of Edinburgh (within the ExaFLOW project) we have developed a new version of communication kernel for Nek5000 based on the PGAS parallel programming model. The new library is written in UPC and is currently tested. In the future we are going to explore possibilities e.g. new features of MPI3 communication.
 
People involved:
Niclas Jansson (PDC AE)
Adam Peplinski (KTH Mechanics, FLOW AE)
Stefano Markidis (PDC)
 
Milestones:
M1: Integration of PGAS based gslib with Nek5000 (December 2017)
M2: Exploring new MPI3 features in context of gslib (April 2018)
 

Runtime Profiling and Automatisation of Projections

 

One of the important methods used by Nek5000 to speed up solution of linear problem Ax=b with the iterative solvers is the residual projection, in which solutions (x) and right hand sides (b) from previous time steps are used to built projection space for the current solution, right hand side pair. This operation allows to reduce number of iterations at every time step as one solves only for the vector part orthogonal to the projection space. Unfortunately this method was found to be computationally expensive for some test cases due to non-optimal data transfer during an orthogonalisation step. We have improved this data transfer and are currently testing our code. In the next step we are going to work on the automatisation of this method, where the size of the projection space would be dynamically changed during the simulation.
 
People involved:
Niclas Jansson (PDC AE)
Philipp Schlatter (KTH Mechanics)
Nicolas Offermans (KTH Mechanics)
Oana Marin (ANL MCS)
 
Milestones:
M1: Testing of improved projection code (Dec 2017)
M2: Implementation of automated adaptation of projection space (Apr 2018)
 

Refactoring of Nek5000

 

Nek5000 is a scalable open-source code for CFD modelling with a long development history starting in the 1970s. It is written in FORTRAN77 and C extensively using a number of Fortran features like implicit data typing, common blocks or equivalence. A number of them are depreciated in the new Fortran standards (90, 95 and 2003) making Nek5000 coding style inconsistent with modern practices and difficult to work with. In cooperation with ANL we have initiated process of Nek5000 refactoring starting from improving programming environment (e.g. github repository, travis and jenkins testing suits) and removing most significant coding constraints (e.g. implicit none). We have built as well our own platform for developing, testing and sharing code packages working on top of Nek5000. An important aspect of this platform is proper code documentation, which is available online and is based on Doxygen.
 
People involved:
Adam Peplinski (KTH Mechanics, FLOW AE)
Philipp Schlatter (KTH Mechanics)
Jing Gong (PDC AE)
Stefano Markidis (PDC)
 
Milestones:
M0: Encourage doxygen development among students (Oct 2017)
M1: Adding dynamical timers to framework  (Nov 2017)
M2: Completing framework development documentation (December 2017)
M3: Including linear stability and statistics packages in framework (February 2018)
M4: Adding to Nek5000 doxygen documentation describing solver algorithm (March 2018)
M5: Exploring MPI3 features for mesh partitioning within Nek5000 (June 2018)
M6: Exploring F90 constructs for replacing common blocks (September 2018)
 

Code optimization for on-node performance: SIMD and LIBXSMM for small dense matrix-matrix multiplications, streaming stores for optimizing cache to memory operations

 

In the Nek5000 code, the tensor-product-based operator evaluation can be implemented as small dense matrix-matrix multiplications. It is clear that the routines for calculating the matrix-matrix product dominate the execution time of Nek5000. LIBXSMM package is optimized for large polynomial orders, while the new SIMD approach targets the low polynomial order regimes. Streaming stores low level implementations have been introduced in Nek5000 for the Helmholtz operator, giving a speedup of 15%, extensions to other operators should be further considered. Following our previous results [1]. We continue to conduct the optimization of matrix-matrix multiplication using SIMD and the LIBXSMM package in the Nek5000. This project is also collaboration with the other SESSI project “OpenACC for Nek5000” and will also emphasize the performances on the exascale architectures such as Intel Knights Landing (KNL) system. 
 
 
People involved:
Jing Gong (PDC AE)
Philipp Schlatter (KTH Mechanics)
Adam Peplinski (KTH Mechanics, FLOW AE)
Nicolas Offermans (KTH Mechanics)
Evelyn Otero (KTH Mechanics AE)
Berk Hess (MolSim)
Stefano Markidis (KTH PDC)
Oana Marin (ANL)
Michel Schanen(ANL)
 
 
Milestones.
M1: Apply performance analysis tools to the subroutines (December 2017)
M2: Further optimize the SIMD and LIBXSMM for Nek5000 (March 2018)
M3: Implement streaming stores strategies in other parts of the code in Nek5000 (June 2018)