Performance Tools

Our top tips on using analysis and debugging tools. Analysis is typically profiling and tracing but we believe includes use of compiler reports and other sources of information. (Also see our Wikipedia entry on HPC Benchmarks.)

Benchmarking Systems

Methodology

Compilers

Often overlooked, compilers are very powerful assets to coders.

  • Always use the latest (stable?) version available e.g. going from ifort v12 to ifort v14 gives about 5% improvement in performance
  • Most compilers today will provide reports on optimisations and shared memory parallelisation. For example, with Intel compilers, check out -opt-report
  • gap analysis, feedback guided optimisation
  • array bounds checks

Vectorise!

All modern microprocessors have vector units. For example, current Intel chips have 512 bit wide vector units. It is important to make good use of these elements in order to get high performance.

Intel compiler vectorisation

It is important to use the relevant -x or -ax option. For example,

  • COMMON-AVX512: the 'base' vectorisation for Intel AVX-512 processors
  • MIC-AVX512 is for Knights Landing (and successors). Includes COMMON-AVX512 plus pre-fetch, FP exponential & reciprocal vector optimisations for KNL
  • CORE-AVX512 is for Intel CPUs and includes COMMON-AVX512 plus additional integer, byte & word vector instructions

To give the compiler hints on vectorisation, check out

  • #pragma ivdep -- tells the Intel C compiler that for the following loop it can ignore any potential dependencies but it will not ignore proven dependencies
  • #pragma simd for the Intel C compiler and e !DIR$ SIMD for the Intel Fortran compiler, to vectorise the following loop irrespective of any potential or of any proven dependencies
  • detailed discussions of approaches to vectorisation for Intel C compiler and Intel Fortran compiler
  • !GCC$ vector -- tells the GNU Fortran compiler to vectorise the following loop irrespective of any potential or of any proven dependencies

OpenMP 4.0 Directives

Check out options such as

  • SIMD
  • CONTIGUOUS

Other useful links

Analysing GPU Codes

MPI Analysis

Useful tools include:

Hardware Counters

Benchmarks

Running a benchmark is only part of the process in understanding system performance. Deeper understanding of why peak performance is not achieved is key to tuning libraries and codes for a given architecture.