Our top tips on using analysis and debugging tools. Analysis is typically profiling and tracing but we believe includes use of compiler reports and other sources of information. (Also see our Wikipedia entry on HPC Benchmarks.)
Benchmarking Systems
- PRACE Benchmark codes
- Benchmarking your high end compute heterogeneous system - a round up of applicable benchmark suites
Methodology
Compilers
Often overlooked, compilers are very powerful assets to coders.
- Always use the latest (stable?) version available e.g. going from ifort v12 to ifort v14 gives about 5% improvement in performance
- Most compilers today will provide reports on optimisations and shared memory parallelisation. For example, with Intel compilers, check out -opt-report
- gap analysis, feedback guided optimisation
- array bounds checks
Vectorise!
All modern microprocessors have vector units. For example, current Intel chips have 512 bit wide vector units. It is important to make good use of these elements in order to get high performance.
Intel compiler vectorisation
It is important to use the relevant -x or -ax option. For example,
- COMMON-AVX512: the 'base' vectorisation for Intel AVX-512 processors
- MIC-AVX512 is for Knights Landing (and successors). Includes COMMON-AVX512 plus pre-fetch, FP exponential & reciprocal vector optimisations for KNL
- CORE-AVX512 is for Intel CPUs and includes COMMON-AVX512 plus additional integer, byte & word vector instructions
To give the compiler hints on vectorisation, check out
- #pragma ivdep -- tells the Intel C compiler that for the following loop it can ignore any potential dependencies but it will not ignore proven dependencies
- #pragma simd for the Intel C compiler and e !DIR$ SIMD for the Intel Fortran compiler, to vectorise the following loop irrespective of any potential or of any proven dependencies
- detailed discussions of approaches to vectorisation for Intel C compiler and Intel Fortran compiler
- !GCC$ vector -- tells the GNU Fortran compiler to vectorise the following loop irrespective of any potential or of any proven dependencies
OpenMP 4.0 Directives
Check out options such as
- SIMD
- CONTIGUOUS
Other useful links
Analysing GPU Codes
- Analyzing OpenCL with Intel VTune (video tutorial)
MPI Analysis
Useful tools include:
- IPM - Integrated Performance Monitoring, an open source MPI profiler
- TAU - an open source profiler and tracer
- Paraver - profiler from Barcelona Supercomputing Centre
Hardware Counters
- PAPI - The Performance API
Benchmarks
Running a benchmark is only part of the process in understanding system performance. Deeper understanding of why peak performance is not achieved is key to tuning libraries and codes for a given architecture.
- IMB - Intel MPI Benchmark
- HPCC - the HPC Challenge, based primarily upon HPL
- HPL - High Performance Linpack