Kubernetes changed how the industry deploys software. Built for the web services era, it offered a clean, declarative model for managing stateless, interchangeable workloads at scale. For a decade, that model worked well.
AI breaks it.
Training runs are long-lived, stateful, and bound to specific GPU topologies. Inference traffic is volatile and latency-sensitive. High-performance networking is no longer optional. GPUs, network fabrics, storage, power, and cooling all have to be orchestrated as a single system. When topology is wrong, performance collapses.
What we are seeing is not simply a tooling gap, but a structural mismatch: AI workloads demand HPC-class infrastructure with cloud-grade flexibility. Initiatives like the Serving Working Group, the AI Gateway WG, and ongoing work in SIG Scheduling have improved Kubernetes’ viability for AI workloads over the past few years. But even with that progress, the remaining challenge is aligning increasingly capable AI software components with the physical realities of GPU topology, high-speed interconnects, and performance-sensitive infrastructure.
Developers expect orchestration platforms like Kubernetes; MLOps tools such as Kubeflow and MLflow; ML frameworks including PyTorch and vLLM; interactive environments like Jupyter Notebooks; compatible APIs; and a broad ecosystem of supporting tools. They expect self-serve environments, fast iteration, and a clean path from experiment to production. They expect cost visibility, reproducibility, and zero rework.
Our job is to deliver bare-metal performance without the operational drag. That means contiguous GPU pools with topology-aware scheduling, high-bandwidth interconnects built in, unified orchestration across HPC and inference, and a single control plane from notebook to production. It means eliminating friction between experiment and deployment so teams can ship faster and at lower cost per run.
Where the difficulty actually lives
For an enterprise AI team, the challenge is not running one workload. It is running all of them at once. Depending on the company, on any given day, teams might be training foundation models, fine-tuning variants for specific use cases, and serving inference traffic into production applications. Some teams prefer Kubernetes. Others rely on Slurm. Some workloads need virtual machines. Others require direct access to bare metal. Access patterns range from batch jobs triggered from the CLI to web-based experimentation environments used by product teams.
Enterprises expect the flexibility of cloud to handle this diversity. What they cannot afford is performance unpredictability when large-scale training is involved. A single misaligned cluster placement, a networking bottleneck, or an undetected hardware fault does not just degrade performance. It delays releases, inflates experiment costs, and erodes confidence in AI roadmaps.
The real problem is delivering HPC-class performance across that variety without forcing teams to choose between flexibility and throughput.
Running infrastructure at the frontier of hardware capability means working with hardware that is, by design, pushed to its limits. Customers come to us precisely because they want maximum throughput, and that expectation means hardware is consistently operating at the edge of its tolerances. There is an inherent tension that goes with this territory, and managing it honestly is a significant part of what we do.
Building the architecture
The consequences of getting this wrong are not abstract. They surface directly in customer workloads: jobs that fail intermittently, performance that degrades without warning, training runs that stall because of a single undetected node fault. As soon as you are operating infrastructure at scale, comprehensive automation is not optional. Without it, things fall apart very quickly, and when they do, the impact cascades directly into customer outcomes.
Solving this requires more than running Kubernetes on bare metal. It requires an architecture that encodes topology awareness, isolation, automated commissioning, and continuous validation into the platform itself. That is why we built Nscale Kubernetes Service (NKS): a managed, bare-metal Kubernetes environment designed specifically for high-performance AI workloads, where performance guarantees are enforced by the system rather than left to operational best effort.
The foundation of NKS is a large, shared underlay Kubernetes cluster deployed directly on bare-metal GPU infrastructure. This underlay cluster manages the physical nodes, networking fabric, and topology-aware scheduling.
When a customer requests a cluster, we provision an NKS cluster using vCluster. These virtual Kubernetes clusters are provisioned in just minutes, and provide comprehensive isolation at the API and control plane level, allocating dedicated worker nodes.
There is no VM layer and no nested containerization between the workload and the silicon. That means:
- No virtualization overhead
- Full access to GPU devices and RDMA networking
- Topology-aware placement aligned with InfiniBand (IB) fabric
- Strong isolation at the control plane and network policy level
- Expedient cluster spin-up time, with clusters available for workload provisioning in under 5 minutes
In short, NKS customers experience a dedicated batteries-included Kubernetes cluster, available in a fraction of the typical time required to provision Kubernetes, with the workloads running at bare-metal speed.
Getting this right required solving for network-aware scheduling. In distributed training, GPUs spend as much time communicating as computing. If those GPUs are poorly placed across the network fabric, performance can collapse, even when plenty of capacity is available. Our platform encodes the physical topology of the IB network directly into Kubernetes scheduling decisions, ensuring that GPUs assigned to a customer cluster are positioned as closely together as possible. The result is predictable, high-bandwidth communication and training runs that operate at full throughput rather than at a fraction of their potential. For enterprise teams, that translates directly into faster iteration, lower cost per experiment, and confidence that performance will scale with ambition.
Security and isolation are handled at multiple layers. Customer clusters are isolated from the underlay using dedicated nodes per virtual cluster, with network policies enforcing isolation boundaries. On the backend, RDMA access is restricted to the HPC fabric. We leverage vNode from vCluster to provide workload isolation at the node level, providing security guarantees without compromising on flexibility. For customers running sensitive AI workloads, these guarantees have to be real, not approximate.
Bringing Slurm into the picture
Many AI native organisations, particularly those running large-scale training workloads, already depend on Slurm, the workload manager used across most of the world’s leading supercomputers. Slurm provides advanced batch scheduling, fair-share allocation across teams, priority controls, and preemption capabilities.
Rather than requiring teams to redesign established workflows, we provide a fully managed Slurm service built on the same bare-metal, topology-aware infrastructure as NKS. This allows organisations to retain the scheduling model their engineers and researchers already trust, while benefiting from automated provisioning, high-performance networking, and integrated observability. Teams can choose the scheduler that aligns with their operating model without compromising performance, isolation, or operational control.
To integrate Slurm without fragmenting the platform, we use Slinky from SchedMD to run Slurm natively within our Kubernetes-based infrastructure. The managed Slurm service is built on the same virtual cluster architecture as NKS, inheriting fast provisioning, topology-aware placement, and the automation embedded in the underlying bare-metal environment.
We maintain a custom Slurm image with GPU-aware scheduling enabled through Generic Resource Scheduling (GRES) and container-native execution via Pyxis, ensuring compatibility with modern AI workflows. Because Slurm runs on the same fabric-aware underlay as Kubernetes, distributed jobs automatically benefit from optimal network placement and full-bandwidth communication.
For enterprise technical leaders, this means flexibility without architectural sprawl. Teams can standardise on a single infrastructure foundation while allowing different groups to use the scheduler that best fits their workload. Whether running cloud-native services on Kubernetes or large-scale batch training on Slurm, performance, isolation, and operational consistency remain the same.
Proving the performance
None of this matters if we cannot demonstrate that the infrastructure performs as specified. We have invested heavily in benchmarking as code: a framework called kube-perftest, originally developed by StackHPC, that allows benchmarks to be defined using Kubernetes custom resource definitions (CRDs) and stored in version control. Benchmarks are reproducible, auditable, and can be run as part of standard cluster validation workflows. BenchmarkSet resources allow parameter sweeps to be generated automatically, making it straightforward to characterise performance across a range of configurations.
Before hardware is ever released for customer use, it goes through an extensive commissioning process. At the single-node level, this includes CPU, memory, and GPU validation, storage SMART status checks, DIMM consistency verification, network link status, and firmware validation. At the multi-node level, we run IB fabric validation, GPU HPL benchmarks, NCCL tests, DGCM diagnostics, and distributed storage tests. When hardware passes, it enters the available pool. When it does not, it is quarantined and investigated separately. This discipline is what allows us to make performance guarantees and honour them.
Day 2 operations follow the same logic using Nscale’s Fleet Operations. Our observability stack provides a unified view across compute, storage, and networking. Custom observability and automated alerting detect performance drift and infrastructure faults early, allowing us to remediate issues before they affect customer workloads.
Kubernetes upgrades are performed through controlled node replacement rather than in-place updates, eliminating configuration drift and preserving optimal network placement. At the hardware layer, our Control Center platform continuously validates inventory, monitors component health, and automatically quarantines or remediates degraded nodes. The result is a self-healing infrastructure model that maintains performance consistency at scale, without relying on reactive firefighting.
What this signals for the industry
The question I am asked most often is whether bare metal or cloud-native is the right approach for AI infrastructure. What AI workloads require is a synthesis of the two: the performance, isolation, and networking guarantees of HPC, delivered through the interfaces and operational models that cloud-native developers already know.
The closer an application operates to the underlying hardware, the greater the potential for peak performance, but abstraction is not inherently a tradeoff and often improves utilisation, resilience, and overall efficiency. The real challenge is striking the right balance: delivering the operational advantages of platforms like Kubernetes without compromising the performance and control required by more infrastructure-aware workloads.
Much of this progress has been driven by the open source community, particularly within Cloud Native Computing Foundation (CNCF), where Working Groups and special interest groups (SIGs) are actively shaping how Kubernetes evolves to support AI-native workloads. That collaborative momentum is critical to ensuring the next generation of infrastructure balances performance, portability, and operational simplicity.
What ultimately distinguishes AI infrastructure platforms is the quality of the engineering that connects hardware to workloads: topology-aware scheduling, automated commissioning, deep observability, and disciplined Day 2 operations that ensure performance guarantees hold over time. This is the systems layer where abstraction meets silicon, and where sustained investment and engineering focus make the difference between theoretical capability and reliable, repeatable performance.




.png)

