Any on premises HPC cluster has its own limits of scale and throughput. While end users always want to run their simulations and models faster, there are often practical limits to the scale of any particular code, and even if there are not, there are economic ones. No one can have a cluster sized for some big jobs but idle most of the time.
This is why the promise of HPC in the cloud has been dangling in front of cluster managers and researchers for so long it seems like it has been available for decades. But it is only now, with the advent of containerized applications, sophisticated scheduling tools for systems software and the hardware underneath it, and the availability of large amounts of capacity that finally allows companies to put the idea of bursting some simulation and modeling workloads to the public cloud.
Disk drive and flash storage maker Western Digital has been putting cloud bursting to the test as it designs its next generation of products, and enlisted the support of Amazon Web Services, the juggernaut in cloud infrastructure, and Univa, which makes some of the most popular cluster scheduling software in the HPC space – that would be Grid Engine – as well as Docker container management tools based on Kubernetes that interfaces with it. Western Digital didn’t just dabble with cloud bursting when it went to test the idea; the company fired up 1 million cores that ran 2.5 million tasks on over 40,000 spot instances on AWS to run its verification tests, which took eight hours to run rather than the 20 days it would have taken to dispatch this work on the internal cluster residing one of Western Digital’s datacenters.