An initial work in progress, so please submit comments
- Run your code with optimisations turned off, all checks turned on. Do this for several key data sets. This is your baseline
- Profile your baseline with a profiler such as gprof, TAU or VTune
- Up the optimisation of your code from -O1 to -O2 and then to -O3, checking each build against your baseline at each stage. As soon as your results differ significantly, back off to the previous level of optimisation. This is your initial compiler optimisation. (You should also ensure you are continuing to get performance improvement for all your key data sets.)
- Additional compiler optimisations are also likely to be available such as autoparallelisation and vectorisation. Follow the same proceedure as Step 3 to obtain your best compiler optimisation.
- For parallel codes, gprof will not be suitable. You will need to use a profiler such as TAU or VTune (worked examples are coming shortly)
- When profiling a code, use a variety of datasets. On parallel runs, use a variety of OpenMP threads and/or MPI processes. For MPI runs, use a mix of fully- and under- populated nodes.
Tips on Profilers
- Measure what you are interested in (computation, data movement, energy consumption) by excluding unwanted artifacts (eg remove unnecessary printf/WRITE statements)
- Ensure you are measuring what you are interested in (the run time of your simulation) by excluding unwanted others (eg remove OS jutter by running in single user mode)
- Profile your profiler - how do you know a given analysis tools is right? In reality, none are 100% correct. Profilers typically sample every millisecond so give only an indication, and some methods (adding explicit timer function calls and printf/WRITE statements) can be so intrusive as to change code behaviour. You can benchmark profilers to see if they agree with each other and your hand-rolling timings.
- Be statistical. Rather than 1 run, do several. Consider whether the mean or the min (or max?) is the most appropriate measure