Our Xeon Phi pages provide
- information on the Intel Xeon Phi range, including Knights Corner and Knights Landing
- information on how to get access to Xeon Phi hardware
- top tips for getting high performance from Intel Xeon Phi (XPhi) without excessive energy use (a work in progress).
What is a Xeon Phi?
The Xeon Phi (aka "XPhi") range of bootable processors employs the MIC (Many Integrated Core) architecture, and has been produced by Intel for the last few years. Initially there was Knights Corner (KNC) and since 2016 there has been Knights Landing (KNL). We expect that from late 2017 there will be Knights Hill (to be used in Aurora) followed by Knights Mill (for machine learning). The MIC architecture philosophy is to use lower power cores but many more of them in a single package. KNC came as 60 cores in a package available only via PCI-e whereas KNL is fully bootable and is the only processor in the "Ninja" development systems.
A KNL package not only includes 64 (up to 72) cores but also
- MCDRAM (Multi-Channel DRAM) - this is faster than DRAM and you can set up a KNL to use it in one of three modes: flat, cache or hybrid.Set in BIOS.
- Cluster modes (all-to-all, quadrant, hemisphere, SNC-4, SNC-2) which relate to tag directories for the local caches for a tile (a pair of KNL cores). Quadrant (4 areas) or hemisphere (2 areas) are recommended, with all-to-all really for system troubleshooting only.SN-2/4 is "sub-NUMA" subsets of quadrant (SN-4) and hemisphere (SN-2). Set in BIOS.Query via `hwloc-dump-hwdata`
Some KNL models also come with Intel's OmniPath Fabric integrated, to give 100 Gbps networking (eg to other KNL or CPUs)
KNL will (probably, eventually...) come in three forms. It can be the only processor on a motherboard, it can be a PCI-e card, and it can be a co-processor alongside a standard CPU on a motherboard. Currently, only the first form is available, and this comes in two formats: the rack mounted and the standalone workstation. Each support one KNL.
The workstations are promoted by Intel as "Ninja Developer Platform" via their "Developer Access Program", costing approximately $3,500 + P&P. They are physically manufactured by SuperMicro with Colfax Research being Intel's main distributor. The PSU and BMC have been swapped (compared to rack format) by SuperMicro and do not provide direct node-level energy consumption data. The workstations are also available via other Intel partners and via other resellers of SuperMicro kit.
The rack mounted systems provide a BMC capable of providing energy usage (eg using IPMITOOL) whereas the workstations (regretably!!) do not have this functionality.
Accessing Xeon Phi
As well as the High End Compute access, HEC can also help if you wish to talk to the likes of the Intel or one of their Parallel Computing Centres (eg EPCC or Hartree Centre) for further access, and have good contacts with other partners providing access:
- UK academics can apply to access the Archer KNL development system (rack based KNL)
- Colfax research also offer access to their rack based KNL for ~2 weeks
Programming Efficiency of the KNC
An update for KNL is being drafted and will be published in June2017
Given the XPhi roadmap, some numbers below (which is more of an KNC focus) may vary depending upon the product you are using
- When timing code segments, be aware that the first explicit offload is when XPhi binaries are transferred to the card along with the data; thread initialisation on the XPhi also occurs at this point. To best time the cost of transfer itself, use a dummy transfer for the initial offload and then timing the actual offload will give a representative figure
- Use micsmc (or micsmc-gui), which may be found in /opt/intel/mic/bin to monitor your host and cards' energy usage. This will give the instantaneous Watts consumed but also the number of threads in use. You can use these to approximate total Watt-hours to solve a given problem. micsmc gives you a graphical interface with metrics for cores utilised, temperature and power draw of the MIC card/s. micsmc-gui gives you, counter-intuitively, a command line interface and you can example, for example, the recent load on any given MIC core
- Some authors report that one of the Phi cores handles any off-loading, so evaluate your application on all-but-one as well as using all cores
- Remember that at least 2 threads per Phi core are required to reach maximum FLOPS/s; 3 or 4 may be better (it will be program dependent).
- For OpenMP codes, thread placement matters and KMP_AFFINITY="balanced,granularity=fine" is a good starting point and avoid core 0 (OS)
- For hybrid MPI+OpenMP codes, use 1 MPI process per code with 2,3 or 4 OpenMP threads per MPI process
- For improved vectorisation, use signed ints for loop counters and avoid (or otherwise handle) branches & align data (eg using relevant C and Fortran extensions)
- Use PMC counters to check performance eg VPU_INSTRUCTIONS_EXECUTED/INSTRUCTIONS_EXECUTED gives ratio of vector instructions
- Consider the likes of Cilk+ - Cilk+ is now being phased out
#greenerCompute using KNL
In collaboration with academic, Intel and their partners' input, we have undertaken a preliminary comparison of energy consumed by previous CPU and the KNL for an atmospheric chemistry kernel. These results have been published in a poster at the 2017 EGU Conference:
"Accelerating activity coefficient calculations using multicore platforms, and profiling the energy use resulting from such calculations", David Topping, Irfan Alibay, and Michael Bane, EGU2017-12246. (2017)
High End Compute is now compiling a report for The University of Manchester on how to record & compare energy efficiency for research computing
Hardware Efficiency of KNC
- Your physical set-up may effect your compute performance, eg PCIe set-up may reduce your bandwidth by up to 50%
- The CPU host consumes power so you will need to make some use of this for high compute:electrical power efficiency
Other Great Sources of Info
- our Xeon Phi roadmap
- Intel page on XPhi
- Intel's XPhi Forum
- PRACE's best practice guide
- ColFax Research's webinar on KNL
- NERSC (home of CORI) gives initial KNL performance results
- EPCC Blog on Early Use of KNL
- Programming on GPUs