Sessi MCP: Assembly level vectorization of compute intensive kernels in Nek5000

Nek5000 is a computational fluid dynamics solver for the Navier-Stokes equations based on the spectral element method. This method is a compromise between high accuracy and geometric flexibility. However, this flexibility has a price. The solver spends most of its time doing matrix-matrix multiplications. The routine written as above is a direct implementation of matrix-matrix multiplication, but will be very time consuming: The compiler will have a hard time optimizing it No hint is given for the values of the loop parameters Unless otherwise specified, the compiler won’t take advantage of the underlying architecture.

Introduction

Nek5000 is a computational fluid dynamics solver for the Navier-Stokes equations based on the spectral element method. This method is a compromise between high accuracy and geometric flexibility. However, this flexibility has a price. The solver spends most of its time doing matrix-matrix multiplications.

http://www.e-science.se/sites/default/files/imagecache/520px/mxm_for.png

The routine written as above is a direct implementation of matrix-matrix multiplication, but will be very time consuming:
The compiler will have a hard time optimizing it
No hint is given for the values of the loop parameters
Unless otherwise specified, the compiler won’t take advantage of the underlying architecture.

1. Taking advantage of the system architecture

Usually compilers create binaries for a class of architectures that have some instructions in common (for instance x86), and thus does not take advantage of the details of the specific architecture being used. This often has the consequence that the compiler doesn’t make use of features that allow it to run faster, in our case, vectorization. Recent processors have access to 16 256-bit registers (32 512-bit registers for the next generation) together with a set of instructions specific to vector computations.

Using these new features involves using close-to-assembly (SIMD) instructions. With proper compilation flags, most compilers are able to make use of these features automatically, but they need to identify the positions where they can make use of it.

2. “Unrolling” to give hints to the compiler

One way to help the compiler to optimize the code is to “unroll” it, that is for instance to create one function for each inner loop index value, or at least the most common ones.

3. Our strategy

To optimize the matrix-matrix multiplication routine in Nek5000 we worked on those two aspects, that is to take maximum advantage of the underlying architecture by using SIMD instructions, and to help the compiler by unrolling the different loops that are involved in the routine.

4. Is there actually room for improvement?

The current tests ran on the Intel Haswell processors showed an effective performance of 4Gflops before modification (Giga-floating points operation per second), while the theoretical peak performance reaches 60Gflops. The first tests using unrolling gives a factor of 4 speedup for the routine (16Gflops).