Sessi MCP: Code optimization for on-node performance: SIMD and LIBXSMM for small dense matrix-matrix multiplications, streaming stores for optimizing cache to memory operations

In the Nek5000 code, the tensor-product-based operator evaluation can be implemented as small dense matrix-matrix multiplications. It is clear that the routines for calculating the matrix-matrix product dominate the execution time of Nek5000. LIBXSMM package is optimized for large polynomial orders, while the new SIMD approach targets the low polynomial order regimes. Streaming stores low level implementations have been introduced in Nek5000 for the Helmholtz operator, giving a speedup of 15%, extensions to other operators should be further considered. Following our previous results [1]. We continue to conduct the optimization of matrix-matrix multiplication using SIMD and the LIBXSMM package in the Nek5000. This project is also collaboration with the other SESSI project “OpenACC for Nek5000” and will also emphasize the performances on the exascale architectures such as Intel Knights Landing (KNL) system.

[1] Highly Tuned Small Matrix Multiplications Applied to Spectral Element Code Nek5000, Berk Hess, Jing Gong, Szilard Pall, Philipp Schlatter and Adam Peplinski, Proceedings of: Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016). Sofia (Bulgaria), October, 6-7, 2016


M1: Apply performance analysis tools to the subroutines (December 2017)
M2: Further optimize the SIMD and LIBXSMM for Nek5000 (March 2018)
M3: Implement streaming stores strategies in other parts of the code in Nek5000 (June 2018)