Earlier this month my colleague Bala Thekkedath published a story about Extreme Scale HPC and talked about how AWS customer Western Digital built a cloud-scale HPC cluster on AWS and used it to simulate crucial elements of upcoming head designs for their next-generation hard disk drives (HDD).
The simulation described in the story encompassed a little over 2.5 million tasks, and ran to completion in just 8 hours on a million-vCPU Amazon EC2 cluster. As Bala shared in his story, much of the simulation work at Western Digital revolves around the need to evaluate different combinations of technologies and solutions that comprise an HDD. The engineers focus on cramming ever-more data into the same space, improving storage capacity and increasing transfer speed in the process. Simulating millions of combinations of materials, energy levels, and rotational speeds allows them to pursue the highest density and the fastest read-write times. Getting the results more quickly allows them to make better decisions and lets them get new products to market more rapidly than before.
I recently spoke to my colleagues and to the teams at Western Digital and Univa who worked together to make this record-breaking run a reality. My goal was to find out more about how they prepared for this run, see what they learned, and to share it with you in case you are ready to run a large-scale job of your own.
A breakdown of the different types of instances during the run.