SeRC Exascale Simulation Software Initiative - SESSI

SeRC Exascale Simulation Software Initiative - SESSI

High resolution logo here.

 

The SeRC Steering group established a new flagship program on software for Exascale simulations. The program aims to improve the performance and scalability of selected software packages by establishing new collaborative projects for exchange of expertise among research groups in the Molecular Simulations, FLOW and DPT SeRC communities.

 

http://www.e-science.se/sites/default/files/imagecache/520px/DSC09059.jpg

SESSI meeting at the SeRC room in PDC. 

 

Many of the e-Science applications are more and more dependent on large-scale compute capabilities. A subset of these applications have already developed into massively parallel codes where a single simulation can use thousands to hundreds of thousands of cores. Those capabilities have in turn enabled researchers to address a wide range of new applications areas that were simply not possible a few years ago such as computational fluid dynamics simulations of highly complex flows such as airplane wings; critical problems in high energy physics; and molecular dynamics simulations of large biomolecular systems. The advances in those areas were possible thanks to major directed efforts in the development of algorithms and application of modern software engineering techniques.

 

The main goal of SESSI is to investigate the highly parallel software challenges in real-world applications with very large scientific impact, in an application – programming system co-design fashion, to identify the limitations of the current software approaches and to develop possible solutions to these limitations.
SESSI focuses on the design, development and implementation of two classes of applications and simulation packages with a large user-base and high-impact scientific drivers:
 
  • Molecular Dynamics. The code in use is GROMACS (MOL community).
  • Computational Fluid Dynamics. The code in use is Nek5000 (FLOW community).

 

Assembly level vectorization of compute intensive kernels in NEK5000

 

Nek5000 is a computational fluid dynamics solver based on the spectral element method. The main part of the program consists in matrix-matrix multiplication routines, in which the program spends most of its time (more than 60% in a 2D version).

 

Currently, the routines are basic Fortran routines with nested loops to compute the matrix multiplications. The aim of this project is to enhance the routines using vectorization technics like SIMD instructions. The principle of SIMD instruction is to handle multiple (e.g. 4) values at once instead of one on one instruction (for instance multiplication or addition) and thus considerable improving code performance.

 

The project is in collaboration with developers of GROMACS from the Molecular Simulation community, a code in which SIMD instruction operations are integrated and heavily used.

 

Project page: http://www.e-science.se/project/assembly-level-vectorization-compute-intensive-kernels-nek5000

 

Automation for profiling and code analysis in GROMACS

 

GROMACS is a high-performance and scalable code for molecular dynamics simulations, mainly used for studies of biomolecular systems. The codebase is highly tuned and takes advantages of different hybrid parallelization scenarios including internal thread-MPI implementation, MPI, OpenMP as well as CUDA and SIMD level acceleration.

 

Proper analysis of performance and scalability issues in such codebase is a complex task. The current project aims to automate in GROMACS the usage of performance tools such as Extrae, and thus shorten the time for identification of critical performance issues.

 

http://www.e-science.se/sites/default/files/imagecache/520px/DSC09074.jpg?1480511728

SESSI meeting at the SeRC room in PDC. 

 

GPU acceleration & heterogeneous parallelism

(collaboration between Molecular Simulation community, EU project ScalaLife and joint NVIDIA/KTH CUDA Research Center)

 

Project page: http://www.e-science.se/project/algorithms-molecular-dynamics-heterogeneous-architectures

 

Efficient MPI communication in NEK5000

 

The algorithms implemented in the CFD solver NEK5000 scale well to very large processor numbers. This could be demonstrated with strong scaling by using more than one million cores so far. Optimized communication routines using the Message Passing Interface (MPI) contribute this achievement.

Recent developments in the computer architecture provide increasing core numbers per processor as well as new capabilities in interconnect networks. This allows the implementation of communication operations with a high degree of parallelism. The aim of this activity is to exloit hardware features in the communication operations of NEK5000 in order to achieve a further increased program efficiency.

The project is a collaboration between the FLOW and DPT communities.

 

Fine-grained task-level parallelism for Exascale systems

 

Experience with hybrid MPI+OpenMP+CUDA with GROMACS has shown many complex issues of balancing the load of communication and computation across processing units. Implementations for biomolecular MD simulation benefit greatly from spatial decomposition of non-bonded workload, but the total workload is not homogeneous in space, and this cripples performance at high parallelism. Fundamentally, this is because the MD algorithm has been made to work in parallel in a way that guarantees a series of synchronization and serialization points in every time step. This situation will get worse in the future, as compute nodes have more processing units, more kinds of processing units to address, and more leaks from the von Neumann machine abstraction. Making best use of data locality will be an ongoing challenge.

 

To tackle these problems, we are refactoring GROMACS to use a much more fine-grained task parallelism. Tasks are mapped to suitable compute units in a way that is sensitive to the priority of the task, the hotness of the data in local memory, and the alternative tasks that are available. Fortunately, the overall compute task is well defined for a large series of time steps once neighbour searching is complete, so the large-scale DAG of data flow is static. Thus, the problem reduces to pre-organizing data flow so that all compute units can work efficiently upon the available task that is of highest priority.

 

http://www.e-science.se/sites/default/files/imagecache/520px/DSC09082.jpg?1480427018

SESSI meeting at the SeRC room in PDC.

 

Single-Instruction Multiple-Data and Fast-Communicating 3D-FFT Library

 

Modern CPUs achieve high performance through Single-instruction Multiple-data (SIMD) units. Compute intensive parts of simulation codes often make heavy use of this, either through libraries (e.g. linear algebra), or through compiler or programmer generated SIMD instructions. Efficient molecular dynamics codes, such as GROMACS, heavily use SIMD in compute kernels. But 3D FFTs, which are often on the critical path, are not well optimized for SIMD. For very large FFTs, standard FFT libraries, such as FFTW, provide some SIMD acceleration, but MD typically needs many short FFTs. The solution to this problem is to use data from multiple 1D FFTs for SIMD, instead of applying SIMD instructions within single 1D FFTs. This should provide full SIMD acceleration. To achieve this, the basic elements of a 1D FFT should be code with SIMD instructions and the data should be stored interleaved. For a 3D FFT the transposes of the dimensions should also take into account this interleaving. For good performance we need support for at least base 2, 3, 4 and 5 and potentially also 7 and 8. This library can make use of the SIMD framework of GROMACS, which can be unbundled from GROMACS. This framework supports all currently relevant SIMD instruction sets, e.g. AVX, AVX512, IBM VSX, ARM Neon, through a common, simple interface. Such an FFT library will be useful for many simulation codes worldwide. 

 

People involved:

Mark Abraham (MolSim)
Igor Merkulow (PDC)
Stefano Markidis (CST)
 

OpenACC optimization of NEK5000 [ENDED]

 

OpenACC enables existing HPC application codes to run on accelerators with minimal source code changes. This is done using compiler directives (specially formatted comments) and API calls, the compiler being responsible for generating optimized code with the user guiding performance only where necessary.

When optimizing the OpenACC code for Nek5000, it is important to understand the computational cost of original subroutines.  For the reason, we carry out detailed loop-level profiling of the computational cost for subroutines in the project. We also focus on the optimization of gs_op operator where the MPI global communication occurs in Nek5000 for multi-GPU system.