This page contains press release content distributed by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

A Comprehensive Guide to BelFone’s China Leading Radio Communication System Solutions for Public Safety

A Comprehensive Guide to BelFone’s China Leading Radio Communication System Solutions for Public Safety

QUANZHOU, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — The Evolution of Critical Connectivity in Public Safety

March 12, 2026

Rysun Labs Appoints Robert Ralston as Vice President of Sales to Accelerate Global Expansion

Rysun Labs Appoints Robert Ralston as Vice President of Sales to Accelerate Global Expansion

Rysun appoints Robert Ralston as VP of Sales to scale initiatives, strengthen partnerships, and help organizations

March 12, 2026

EXANTE EXPANDS ITS AWARD-WINNING ‘SUNSHINE GUARANTEE’ INSURANCE PRODUCT FOR SOLAR INSTALLERS NATIONWIDE

EXANTE EXPANDS ITS AWARD-WINNING ‘SUNSHINE GUARANTEE’ INSURANCE PRODUCT FOR SOLAR INSTALLERS NATIONWIDE

Exante, a global leader in parametric insurance, de-risks homeowners' investments in solar panel installations by

March 12, 2026

Efflux Solutions Launches Efflux Cloud Capture

Efflux Solutions Launches Efflux Cloud Capture

New platform empowers organizations and partners worldwide with powerful cloud-based document capture and workflow

March 12, 2026

HealthCred Co-Founders Tap Former Florida Deputy Secretary Steve Casey -Executive VP Government & Strategic Partnerships

HealthCred Co-Founders Tap Former Florida Deputy Secretary Steve Casey -Executive VP Government & Strategic Partnerships

HealthCred Co-Founders Tap Former Florida Deputy Secretary Steve Casey Executive VP Government & Strategic

March 12, 2026

LA Toy Store Asks Community to Help Fund Easter Baskets for Hospitalized Children

LA Toy Store Asks Community to Help Fund Easter Baskets for Hospitalized Children

Adventuretown Toy Emporium invites supporters nationwide to donate $5 toward Easter baskets for children spending the

March 12, 2026

Laurie Smith Uncovers Four Simple Steps to Access the Flow State

Laurie Smith Uncovers Four Simple Steps to Access the Flow State

Author of The Flow Habit reveals a practical framework for accessing flow based on decades of scientific study These

March 12, 2026

Willowglade Technologies and West Cancer Center & Research Institute Focus on Advancing the Digital Patient Experience

Willowglade Technologies and West Cancer Center & Research Institute Focus on Advancing the Digital Patient Experience

Willowglade Technologies and West Cancer Center & Research Institute Announce Strategic Relationship Focused on

March 12, 2026

TideBreakers Urges Hyatt Hotels to End Dolphin Captivity at Hyatt Ziva Cancun Resort

TideBreakers Urges Hyatt Hotels to End Dolphin Captivity at Hyatt Ziva Cancun Resort

TideBreakers urges Hyatt Hotel Corporation to end captive dolphin attractions at Hyatt Ziva Cancún after drone footage

March 12, 2026

DJ Lunar Joins 38 Talent!

DJ Lunar Joins 38 Talent!

38 Talents signs South African DJ Lunar under its new Artist Partnership, marking the start of her international

March 12, 2026

Toward a Unified Picture of the Universe: Major Three-Volume Study “Quantum Model of the Universe” Released

Toward a Unified Picture of the Universe: Major Three-Volume Study “Quantum Model of the Universe” Released

A new three-volume study explores how quantum physics, gravitation and cosmology may be understood within a unified

March 12, 2026

MARCO FURIA COMMENCE MONTHLY FEATURES IN APRIL ON AN ARRAY OF LEGAL MATTERS

MARCO FURIA COMMENCE MONTHLY FEATURES IN APRIL ON AN ARRAY OF LEGAL MATTERS

Australian lifestyle news portal Marco Furia Media commence publishing a series of features on legal matters following

March 12, 2026

Envirotech Vehicles Announces Order for 3 MW of Modular Digital Infrastructure for South Texas Energy‑Integrated Data Center Pilot in Collaboration with Azio AI Corporation

Envirotech Vehicles Announces Order for 3 MW of Modular Digital Infrastructure for South Texas Energy‑Integrated Data Center Pilot in Collaboration with Azio AI Corporation

HOUSTON, TX / ACCESS Newswire / March 12, 2026 / Envirotech Vehicles, Inc. (NASDAQ:EVTV) ("Envirotech" or the

March 12, 2026

Terence Webster Design Associates Recognized With 2026 Consumer Choice Award for Office Furniture in Hamilton and Niagara

Terence Webster Design Associates Recognized With 2026 Consumer Choice Award for Office Furniture in Hamilton and Niagara

HAMILTON, ON / ACCESS Newswire / March 12, 2026 / Terence Webster Design Associates has been recognized with the 2026

March 12, 2026

The Impact of Visual Media on Brand Authenticity in Digital Marketing Strategy by Actual SEO Media, Inc.

The Impact of Visual Media on Brand Authenticity in Digital Marketing Strategy by Actual SEO Media, Inc.

High-quality visual media serves as the primary driver of brand credibility, shaping consumer perceptions of

March 12, 2026

FrigoSense Unveils Patented AI ‘Digital Nose’ for Proactive Food Storage

FrigoSense Unveils Patented AI ‘Digital Nose’ for Proactive Food Storage

Patented IoT system shifts food safety from reactive detection to proactive prevention, using AI sensor fusion to

March 12, 2026

Rallied Launches AI Technician for MSPs That Resolves Tickets the Same Week

Rallied Launches AI Technician for MSPs That Resolves Tickets the Same Week

DENVER, CO, UNITED STATES, March 12, 2026 /EINPresswire.com/ — For most managed service providers, Tier 1 support

March 12, 2026

The Importance of ATEX: A Look at BelFone as a Global Leading Custom Radio Transceiver Manufacturer

The Importance of ATEX: A Look at BelFone as a Global Leading Custom Radio Transceiver Manufacturer

QUANZHOU, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — In the modern industrial landscape, the demand for

March 12, 2026

Buckhead Dental Partners Maintains Comprehensive Cosmetic Dental Services in Atlanta

Buckhead Dental Partners Maintains Comprehensive Cosmetic Dental Services in Atlanta

Buckhead Dental Partners in Atlanta continues offering preventive, restorative, and cosmetic dental care supported by

March 12, 2026

Why Skid-Mounted Loading Arms Improve Installation Speed and Flexibility

Why Skid-Mounted Loading Arms Improve Installation Speed and Flexibility

LIANYUNGANG, JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — As energy infrastructure projects worldwide face

March 12, 2026

Intelligent Loading Arms Support the Development of Unmanned Terminal Operations

Intelligent Loading Arms Support the Development of Unmanned Terminal Operations

LIANYUNGANG, JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — As the global oil, gas, and chemical industries

March 12, 2026

Hydraulic and Mechanical Mooring Hooks: Performance Comparison for Ports

Hydraulic and Mechanical Mooring Hooks: Performance Comparison for Ports

LIANYUNGANG, JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — In modern port operations, safe and efficient mooring

March 12, 2026

Global Ports Increasingly Adopt China-Built Ship-to-Shore Marine Loading Arms as Safety and Efficiency Standards Rise

Global Ports Increasingly Adopt China-Built Ship-to-Shore Marine Loading Arms as Safety and Efficiency Standards Rise

LIANYUNGANG, JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — As global maritime logistics becomes more demanding

March 12, 2026

Automatic Hardshell Rooftop Tents Reflect Evolving Trends in Overlanding Mobility

Automatic Hardshell Rooftop Tents Reflect Evolving Trends in Overlanding Mobility

XIAMEN, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — Modern overlanding is currently experiencing a profound

March 12, 2026

Ecer.com is Rewriting the Rules of Cross-Border Trade

Ecer.com is Rewriting the Rules of Cross-Border Trade

BEIJING, CHINA, CHINA, March 12, 2026 /EINPresswire.com/ — The landscape of international commerce is undergoing a

March 12, 2026

Instawork Wages Jump 12% as U.S. Jobs Market Cools

Instawork Wages Jump 12% as U.S. Jobs Market Cools

February Pay Index Shows Businesses Betting on Flexible Staffing to Hedge Against Market Uncertainty, While a Widening

March 12, 2026

Global Anti-Scam Alliance Launches Scam.org with OpenAI and Key Partners

Global Anti-Scam Alliance Launches Scam.org with OpenAI and Key Partners

AI technology meets on-the-ground expertise from leading organizations across five continents, accessible to billions

March 12, 2026

NJ Leaders and Creative Partners Launch Statewide Design Initiative for an Official State Jersey Ahead of World Cup

NJ Leaders and Creative Partners Launch Statewide Design Initiative for an Official State Jersey Ahead of World Cup

Project centers local designers, regional manufacturing, and public participation as NewJersey prepares to welcome the

March 12, 2026

Smack Dab Celebrates Every Season with Purpose-Driven Menus, Chicago Brunch Specials, and Holiday-Aligned Giveback

Smack Dab Celebrates Every Season with Purpose-Driven Menus, Chicago Brunch Specials, and Holiday-Aligned Giveback

Smack Dab celebrates every season with Chicago brunch specials, catering, and holiday givebacks, pairing seasonal menus

March 12, 2026

Antevia Networks and Benetel sign strategic partnership to accelerate scalable, mission-critical outdoor private 5G

Antevia Networks and Benetel sign strategic partnership to accelerate scalable, mission-critical outdoor private 5G

Partnership delivers simpler procurement, faster deployment and predictable private 5G performance READING, UNITED

March 12, 2026

APMG International Launches New ESG Certification to Support Responsible and Sustainable Business Practices

APMG International Launches New ESG Certification to Support Responsible and Sustainable Business Practices

This certification provides a structured way to build personal and organisational capability and embed responsible

March 12, 2026

IgA Nephropathy Foundation Launches Kidney Month Campaign Elevating Patient Voices and New Education Resources

IgA Nephropathy Foundation Launches Kidney Month Campaign Elevating Patient Voices and New Education Resources

The campaign features newly published research from Board members living with IgAN, alongside new educational resources

March 12, 2026

Strategic Guide: Selecting a China Professional DMR Radio Supplier with 37 Years’ Experience

Strategic Guide: Selecting a China Professional DMR Radio Supplier with 37 Years’ Experience

QUANZHOU, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — In the rapidly evolving landscape of critical

March 12, 2026

BelFone: A Trusted Leader in Professional UHF Radio Solutions with CE Certification

BelFone: A Trusted Leader in Professional UHF Radio Solutions with CE Certification

QUANZHOU, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — In the dynamic world of critical communications, stable

March 12, 2026

BelFone at Intersec: Showcasing Reliable Professional VHF Radio Solutions from China

BelFone at Intersec: Showcasing Reliable Professional VHF Radio Solutions from China

QUANZHOU, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — The landscape of mission-critical communications is

March 12, 2026

Probing entanglement and parameter sensitivity in QAOA via Quantum Fisher Information

Probing entanglement and parameter sensitivity in QAOA via Quantum Fisher Information

GA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — This article investigates Quantum Fisher Information (QFI) as a

March 12, 2026

Hola Prime Reinforces Its Trader-First Approach With The Zero Payout Denials Policy

Hola Prime Reinforces Its Trader-First Approach With The Zero Payout Denials Policy

With its Zero Payout Denials policy now live globally, Hola Prime strengthens payout integrity across accounts and

March 12, 2026

Performance Review: How a China Top 10 Professional Walkie Talkie Brand Compares in Digital Transitions

Performance Review: How a China Top 10 Professional Walkie Talkie Brand Compares in Digital Transitions

QUANZHOU, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — The Digital Crossroads in Critical Communication The

March 12, 2026

A Buyer’s Guide to BelFone at PMR: Insights from a Global Leading Intelligent PoC Radio Company

A Buyer’s Guide to BelFone at PMR: Insights from a Global Leading Intelligent PoC Radio Company

QUANZHOU, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — The professional mobile radio landscape is undergoing a

March 12, 2026