Benchmarking your high end compute

So, you've bought a new accelerator (maybe one of the dev XPhi on promo in Q4 2014?) or co-processor, or commissioned a cluster.

What's the first thing you want to do? Yes, benchmark every aspect to check it all works as you expect.

You would probably use the Intel MPI Benchmark suite to see how well your MPI implementation is working. There's also the HPCC Challenge benchmarks to determine how typical applications might run.

But is that enough?

How do you know if the MPI is good enough? Which figures would mean a good compute/memory/interconnect set up?

You would probably look at the specifications of your interconnect to see expected latency and bandwidth in order to check good MPI implementation. You would probably compare your "pleasingly parallel" code (taking account of the MPI benchmark results) to the total GFLOPS/sec per core to see if you in the high 90s percent. Of course, you'd need code that the compiler can optimise aggressively in order to max out vector units and the like.

But what about the GPU or the Xeon Phi?

How would you see how they're doing?

Are there even benchmarks accepted by the community for these architectures, particularly given the intricacies of having compute at the end of the PCI express. How do you measure the ideal of hiding the off-load comms behind meaningful compute?

You would probably also be interested in whether OpenCL gets good performance or whether CUDA would do better. In which case, it may be the SHOC benchmark suite that is in favour. But what about OpenACC, or maybe OpenMP 4?

The SPEC ACCEL benchmark aims to be a standard for you to compare your set-up (or even a possible set-up to purchase) against results recorded by others, for both OpenCL and OpenACC. Unfortunately, it's not open source so would set you back $800 for the non-profit version ($2000 otherwise).

So, as systems get more complex, so does determining whether you're getting true value.