The convergence of HPC & A.I. is a hot topic right now, and for good reasons. In practice, there is a symbiotic relationship between (for example) HPC simulations and machine learning. HPC simulations generate tons of data, and as luck would have it, machine learning requires tons of data to train models. If we turn the value proposition around, the quality and effectiveness of HPC simulations can be improved if we use machine learning to identify the parameters in simulation data that have a significant effect on the outcome and focus subsequent iterations of simulation on those parameters.
Beyond the work itself, HPC and A.I. share another characteristic… the need for high-performance infrastructure to run these workloads―expensive infrastructure … lots of computing power and storage, accelerators, high-speed interconnects, etc. And although A.I. is only getting started relative to the maturity of its HPC cousin, the common infrastructure need between the two, lead many organizations to see the convergence of HPC and A.I. at an infrastructure level as being inevitable. But here’s the rub … traditional HPC applications run under the jurisdiction of an HPC workload manager like Slurm or PBS Pro, whereas machine learning applications are primarily run in containers under the jurisdiction of a container orchestration system, such as Kubernetes.
So if you want to run (for example) HPC apps managed by Slurm and machine learning containers managed by Kubernetes on the same Linux cluster, you need to hardwire a subset of your cluster nodes to Slurm for the HPC workload and hardwire a subset of cluster nodes to Kubernetes for the machine learning containers. That sounds more like coexistence than convergence. What if there are times when you temporarily need to use the majority of your cluster (or the whole thing) to run an important HPC simulation, or, use the whole cluster to train an algorithm?