Clockwork.io Launches TorchPass Workload Fault Tolerance to Improve Reliability in Large-Scale AI Training

PALO ALTO, CA – 14/03/2026 – (SeaPRwire) – Clockwork.io has announced the general availability of TorchPass Workload Fault Tolerance, a new capability designed to improve resilience in large-scale artificial intelligence training environments. The software-based solution aims to reduce the operational disruption and financial losses associated with infrastructure failures in distributed GPU clusters.

TorchPass is delivered as a core feature within the Clockwork.io FleetIQ™ platform. The technology applies the company’s Software-Driven AI Fabrics™ architecture to distributed training workloads, enabling systems to continue operating even when GPU hardware, network links, or cluster nodes encounter failures. By leveraging Live GPU Migration, the platform can transparently shift active training workloads to available resources without requiring job restarts or checkpoint recovery.

According to Suresh Vasudevan, the cost of infrastructure interruptions has become a growing challenge for organizations investing heavily in AI computing resources.

“Companies are investing billions in next-generation accelerators, yet distributed AI workloads still lose significant productivity due to avoidable infrastructure faults,” Vasudevan said. “TorchPass was designed to address that gap by allowing training workloads to continue operating through failures rather than forcing expensive restarts.”

Industry observers have also noted that reliability becomes increasingly difficult as AI clusters scale. Dylan Patel said that maintaining continuity across large GPU deployments is becoming critical as new hardware architectures increase cluster density.

“As systems scale to larger compute domains, even minor errors—such as a single GPU failure or a network disruption—can terminate an entire training run,” Patel said. “Technologies like TorchPass help maintain utilization by enabling transparent failover and live workload migration.”

Addressing Reliability Challenges in Distributed AI Training

Distributed AI training is widely recognized as one of the most complex and failure-prone workloads in modern computing infrastructure. Research conducted by Meta FAIR indicates that the mean time to failure decreases sharply as cluster sizes increase. In clusters with more than a thousand GPUs, interruptions can occur within hours, frequently forcing jobs to restart from checkpoints.

These interruptions often result in lost compute time and reduced GPU utilization. When failures occur, training systems typically roll back to the latest saved checkpoint, discarding recent progress and requiring additional time to restore workloads.

TorchPass is designed to mitigate these inefficiencies by addressing faults proactively and maintaining workload continuity. By reducing restart events and preserving training progress, the system aims to improve cluster utilization and reduce operational overhead for enterprises and AI cloud providers.

Live GPU Migration Enables Continuous Training

The key mechanism behind TorchPass is Live GPU Migration, which enables affected training processes to move dynamically to spare resources within the cluster when faults occur. The migration process typically completes in approximately three minutes while the overall training workload continues to operate.

TorchPass supports three primary resilience scenarios:

  • Unplanned migration, which responds to unexpected failures such as GPU faults, kernel crashes, or power disruptions
  • Pre-emptive migration, triggered by early warning signals including thermal anomalies or ECC memory errors
  • Planned migration, allowing infrastructure maintenance or workload balancing without interrupting training operations

According to the company, this approach can reduce wasted training progress by up to 95 percent in certain environments.

Independent Testing Highlights Performance Benefits

Independent benchmarking conducted by Jordan Nanos evaluated TorchPass in large-scale training scenarios. Testing involved a GPT-scale training workload using a Kubernetes-based cluster equipped with 64 H200 GPUs.

The evaluation measured job completion time and model FLOPs utilization against both traditional checkpoint-restart methods and the open-source fault-tolerance framework TorchFT. The results indicated that TorchPass achieved faster recovery after simulated hardware failures while maintaining higher GPU utilization rates.

The benchmark also suggested that by improving fault tolerance, organizations may be able to reduce checkpoint frequency in training pipelines. This can enable larger batch sizes, lower memory pressure, and simplified storage management.

Financial Impact for Large AI Clusters

For operators managing large GPU deployments, improved reliability can translate into significant cost savings. Clockwork.io estimates that in a typical deployment using 2,048 H200 GPUs, TorchPass could recover more than $6 million in annual compute value by preventing wasted GPU hours caused by restart-driven downtime.

These savings primarily result from avoiding repeated training interruptions and eliminating idle recovery periods. By maintaining continuous training progress, organizations may also accelerate the time required to complete large model training runs.

Supporting Next-Generation AI Infrastructure

Clockwork.io positions TorchPass as part of a broader effort to make reliability a software-defined capability within AI infrastructure. This approach is designed to support emerging high-density systems, including architectures built around GPUs such as the NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72.

TorchPass expands on the company’s earlier Network Fault Tolerance capabilities, which address network-level disruptions by rerouting traffic around failing links.

Together, these technologies form the foundation of Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer designed to coordinate compute, network, and storage resources across large AI clusters. The goal is to enable operators to run heterogeneous infrastructure as a unified system while maintaining predictable performance and high utilization.

Clockwork.io will present TorchPass during the upcoming NVIDIA GTC conference from March 16 to 19.

About Clockwork.io

Clockwork.io develops Software-Driven AI Fabrics™, a programmable software layer designed to improve observability, determinism, and fault tolerance in large-scale AI clusters. Its FleetIQ platform enables enterprises to train and operate complex AI workloads while maintaining high infrastructure utilization. Organizations including Uber, Wells Fargo, Nebius, and Nscale use Clockwork.io technologies to support AI infrastructure operations.