For the last half dozen years, people have been using GPUs to accelerate code by an order of magnitude or so.

## What is a GPU

The term "Graphical Processing Unit" (GPU) typically refers to a PCI-e card that can perform thousands of simultaneous vector operations. Originally these were solely for rendering graphics from computer games and apps on to the monitor. NVIDIA led the way in promoting the compute capability of these cards for wider use in computational science and particularly data analytics, machine learning and deep learning.

The main GPU producers are NVIDIA, AMD and ARM. NVIDIA are now producing GPUs that sit directly on a motherboard and can be connected directly to each other with their propriety "NVLINK" interconnect. AMD have fused CPUs and GPUs, for example in their APU range. ARM's designs for GPUs, such as Mali, are typically manufactured for mobile phone use.

For high end compute, and for #greenerCompute, one can now program GPUs in CUDA (NVIDIA only), OpenCL, OpenACC, OpenMP via extensions to all popular programming languages including FORTRAN, C and C++. This may be on a workstation, or on a dev platform such as NVIDIA Jetson, but for heavy compute workloads, many GPUs can be put in a server blade and applications written to exploit multiple GPUs in order to dramtically reduce the compute time-to-solution whilst maintaining a good energy-to-solution metric.

## Who are the key players

- NVIDIA have a range of products, notably in the server range: DGX1, P1000, P100, K80, K40 and K20. These support the proprietary language, CUDA, with some support for OpenACC/OpenMP (eg via PGI compiler) and OpenCL.
- AMD have a smaller range of products, notably the Radeon & FirePro series, all of which support OpenCL 1.2
- In the mobile market, ARM produce designs for the MALI GPU

## How do I accelerate my computation by using GPUs?

Typically, it is not a complete model or simulation that can make best use of the specific GPU architecture. Rather, a "kernel" (a highly intensive compute element, typically of a vector form) is run on the GPU, via "offloading".

As ever, it is easier to stand on the shoulders of giants, then to start from scratch. So firstly consider solutions available from others:

- applications ported to CUDA (NB these will only run on NVIDIA GPUs)
- use of accelerator libraries, eg
- NVIDIA's list of GPU accelerated libraries

- the CUDA Maths Library gives accelerated trig (and some other) functions for vectors.
- there are NVIDIA CUDA libraries for accelerating BLAS: cuBLAS for single GPUs and cuBLAS-XT for multuple GPUs. Note these are for NVIDIA GPUs and you are required to register as a CUDA Development for access to the "premier" version of cuBLAS-XT that supports multiple discrete GPU cards on the same motherboard (in addition to the "free" cuBLAS-XT that supports dual-GPU cards such as the K80)
- for spare operations, there is cuSPARSE which has some BLAS support, for example for efficiently multiplying Mv where matrix M and vector v may be sparse.
- AMCL 6.1 which automatically dispatches appropriate compute tasks to an AMD GPU. AMCL includes FFT and BLAS

- NVIDIA's list of GPU accelerated libraries

Or, perhaps for extracting maximum performance, you may wish to port your algorithms to

- OpenMP - it is expected that future OpenMP version will include current directives from OpenACC. OpenMP 4.x already includes many directives for accelerators
- OpenACC
- OpenCL
- CUDA or ptx for NVIDIA products

## Key Issues

- cost of data transfer over PCI-e. Solutions include asynchronous data transfer, sufficient (re-)use of data on GPU once transferred
- non-uniform work may present problems. Various pattern solutions available (eg see AMD's for optimising Reduction Operations)

## See Also

- gpu computing net is a portal to communities, conference announcements and fora. (
*Seems to have expired...*) - Intel Xeon Phi
- FPGA