Blogs /
Inside Fleet Operations: Automating the GPU lifecycle
Engineering

Inside Fleet Operations: Automating the GPU lifecycle

Explore how our internal automation and observability systems bring reliability, scale, and intelligence to GPU infrastructure.

Behind every model trained and every inference served across Nscale’s infrastructure lies a constantly moving system of hardware being enrolled, validated, networked, and maintained across regions. As we scale toward several hundred thousand GPUs over the next three years, the challenge isn’t just adding capacity; it’s ensuring every node performs predictably, efficiently, and with minimal human touch.

To meet that challenge, our engineering teams have built Fleet Operations. This is our internal automation and observability system that powers how we run infrastructure at scale. 

It is made up of three core layers: 

  • Control Center is the operations console, which automates the GPU hardware lifecycle from rack enrollment through maintenance,  
  • Observability Platform, which traces, measures, and analyzes the health of every node in real time.
  • Radar API, which surfaces hardware or operational faults — break-fix events that require remediation.

Together, they form the foundation of how Nscale keeps its global GPU infrastructure consistent and operating at its peak performance. With Fleet Operations, we are able to validate hardware, orchestrate network setup, run burn-in and firmware checks, and automatically remediate issues before they impact workloads.

Today, we’re sharing an inside look at that system. Learn how we designed it, the problems it solves, and what it takes to keep GPU infrastructure reliable at global scale.

Fleet Operations includes the Control Center, Observability, and Radar API.

Why we built Fleet Operations 

Scale changes everything. When you’re bringing hundreds of thousands of GPUs online across continents, the smallest inefficiency becomes a storm. Every node has to arrive, come alive, and stay that way — enrolled, tested, networked, and verified before a single workload runs. Building toward several hundred thousand GPUs, we need a system that can move as fast and precisely as the hardware it manages.

For us, being customer-first means building hand in hand with the people and partners who push this industry forward, from NVIDIA to the teams from Microsoft, OpenAI, and others who train frontier models on our clusters. Together, we’ve defined the visibility and control needed to make infrastructure feel less like a black box and more like a living, observable system. Fleet Operations grew out of that collaboration, created to turn scale into something we can understand and continuously optimize.

Automation that starts at the rack

Bringing GPUs online isn’t like spinning up virtual machines. It starts in the data center, with physical hardware that needs to be registered, tested, and prepared for service. 

Control Center is the operations console that automates that process end to end, turning what once took weeks of manual setup into a continuous, observable workflow. It provides:

  • Autonomous enrolment: It takes a bare-metal device from first discovery to production readiness. It automatically applies the right configuration on arrival — from network setup and credential provisioning to essential system flags — then registers the machine into the control plane and source-of-truth systems. Once enrolled, Control Center runs a full validation sweep across compute, network, accelerators, and storage while confirming firmware baselines. When everything checks out, the system promotes the device into an active, available state, making it ready for scheduling and higher-level automation. 
  • Burn-in testing: Before GPUs are ever exposed to workloads, they are stress-tested through orchestration frameworks like Slurm and Kubernetes, pushing compute, networking, and storage to their limits. Burn-in catches early hardware faults and confirms that each GPU performs to spec. This is a crucial step when scaling to thousands of units across different OEMs.
  • Network stand-up: Control Center works with several tools to configure east–west and WAN paths, sets up routing, and deploys monitoring hooks so that network visibility exists from day zero. The goal: every node joins the system cleanly, with performance and observability baked in.
  • Lifecycle management: Once in production, each GPU node continues to be validated in real time. Our Observability Platform monitors the physical health signals and triggers the automated workflows to repair, reallocate, or retire hardware when needed. This ensures the fleet remains stable as it scales.

In short, the Control Center makes hardware programmable. It abstracts away the physical and gives GPUs the same elasticity, consistency, and reliability expected from the cloud but starting at the rack, not the API.

Visibility that closes the loop

Automation without visibility is fragile. That’s why the Control Center works hand-in-hand with Nscale’s Observability Platform, the telemetry backbone that continuously measures and analyzes the health of our infrastructure. 

Observability tracks metrics, logs, and traces all our managed services and application layers, including compute, storage, networking, and orchestration layers, giving engineers clear visibility into performance and operational events across every data center. This insight helps us meet platform reliability targets and customer service level agreements (SLAs).

Combined with automation, these systems form a closed loop of intelligence. Observability detects anomalies across clusters; Control Center responds automatically to remediate nodes, redeploy workloads, and restore balance before issues escalate. This tight integration reduces mean-time-to-recovery (MTTR), minimizes downtime, and ensures every GPU operates at peak performance.

How we build intelligent infrastructure

Operational excellence starts with how we build. The same principles that guide our customers’ AI workloads — automation, observability, interoperability, and continuous learning — guide how we engineer the systems that power them.

AI-native approach

Our teams use AI throughout the development process: writing and testing code, validating configurations, and simulating how new workflows behave at scale. Every improvement feeds back into the system itself, making the automation smarter and the infrastructure more resilient. It’s a feedback loop between how we build and what we deliver: intelligent systems maintaining intelligent infrastructure.

Open and interoperable systems

In addition, and true to Nscale’s customer-first philosophy, Fleet Operations is built on open and extensible foundations. It uses Temporal to coordinate and sequence automation tasks and Grafana Mimir, Loki, and Tempo for observability. This openness ensures modularity and the ability to evolve alongside our customers’ needs.

It’s also designed to integrate seamlessly with a customer’s existing data center systems while maintaining data sovereignty and on-prem operational control. The Control Center connects with Data Center Infrastructure Management (DCIM) tools for power, cooling, and asset monitoring, and broader infrastructure layers like networking and compute. 

Customer-centric engineering

Fleet Operations also includes an upcoming Radar API that will surface break-fix events from Observability for automated IT service management (ITSM) workflows. It connects with ITSM tools which brings customer service and physical infrastructure into one simple, automated loop. Instead of customer support teams hunting for information or guessing what’s happening with hardware, the ITSM system automatically receives clear, real-time updates whenever something changes like a new server comes online, a component fails validation, or a machine needs attention.

This means tickets can be created with the right details already filled in, support teams immediately understand what’s wrong, and automation can even fix issues before a human ever gets involved. This translates to faster responses, fewer back-and-forths, and more reliable service. It removes manual work, reduces confusion, and ensures every hardware event is tracked and handled consistently. ITSM integration turns physical operations into predictable, customer-friendly workflows powered by automation.

The result is an intelligent system that evolves as quickly as the hardware it manages. One that’s ready for emerging architectures, new vendors, and next-generation GPU clusters. 

Turning automation into advantage

Fleet Operations lets our teams bring GPUs online faster, operate them with higher reliability, and maintain consistent performance across diverse hardware and regions. Automated lifecycle management and integrated observability reduce operational toil and accelerate recovery, driving higher utilization and lower downtime.

It’s how we turn complexity into consistency, and automation into advantage, proving that the future of AI infrastructure isn’t just about running models faster, but building systems smart enough to run themselves.

Blog Contents

Nscale

Nscale is the hyperscaler engineered for AI, a full-stack, scalable, and sustainable AI cloud platform.

Explore More

Building AI infrastructure together in Glomfjord, Norway

Engineering efficiency: Redefining sustainable AI infrastructure

Delivering agile AI infrastructure with precision

Future-proofing AI infrastructure in Europe

Access thousands of GPUs tailored to your requirements.