Survey: When AI factories fail, 6 in 10 enterprises cannot tell you why

Share This Post

New Virtana Study Finds Enterprises Scaling AI Faster Than They Can Govern It

Two-thirds of enterprises are running AI infrastructure without system-level visibility, creating a fragile foundation beneath rapidly expanding AI deployments. New research from Virtana found that as AI adoption accelerates, a new operational reality is emerging: innovation is outpacing control.

As AI becomes core enterprise infrastructure, a clear divide is emerging between organizations that understand how their systems produce outcomes and those that cannot explain or control them.Share

The AI Factory Reality Check study, based on 788 US enterprise decision-makers, examines how AI factories operate under real conditions. More than half of respondents surveyed are already scaling AI across teams without addressing the system-level observability required to understand and control AI. The study documents a widening disconnect between AI factory expansion and the operational foundation needed to sustain it.

“Modern enterprises, including banks, telcos, insurers and airlines, are increasingly dependent on AI-driven services. As a result, one of the greatest risks to the business is any disruption across these AI systems, where failures across applications or underlying infrastructure directly translate into business impact,” said Paul Appleby, CEO of Virtana. “AI systems function as interconnected systems, where infrastructure, data pipelines, token consumption, and model behavior continuously influence outcomes. Yet most organizations still monitor these elements in silos. Without system-wide understanding of these dependencies, they cannot explain how outcomes are produced, control cost, or determine whether those outcomes can be trusted.”

Enterprise AI Has Scaled. Control Has Not.

Enterprise AI has moved beyond pilots into at-scale operations. Fifty-four percent of organizations are already scaling AI across teams, while another 23% are managing production workloads alongside infrastructure expansion. At the largest enterprises, particularly those above $10 billion in revenue, this creates systems that are increasingly difficult to understand and control.

As AI factories scale, system-level observability is not keeping pace. Organizations are expanding AI without the visibility required to understand performance, control cost, or manage risk across the full stack. Instead, critical investments in the operational foundation are being deferred:

  • 56% percent of enterprises are deferring legacy infrastructure modernization
  • 54% are deprioritizing cost optimization initiatives

At the same time, cost pressures are forcing enterprises to continuously reconfigure their AI systems, often without the visibility to understand the impact of those changes. Eighty percent of enterprises report that the cost of premium AI hardware is reshaping infrastructure decisions. In response:

  • 60% are shifting workloads across hybrid environments
  • 58% are accelerating consolidation to improve per-unit efficiency

These are structural changes to live systems under load. Each shift alters dependencies, resource contention, and performance characteristics across the stack.

“Without system-level observability, organizations cannot determine how these changes affect outcomes, cost, or reliability. As a result, they are continuously optimizing AI systems they do not fully understand, introducing risk with every change,” continued Appleby.

Inside the AI Factory, Visibility Is the Missing Variable

As AI factories scale, visibility is emerging as the missing variable in understanding and controlling system behavior. The research shows that as enterprises expand AI, disparities in system understanding and operational control are becoming more pronounced:

  • 66% of enterprises are operating AI infrastructure without reliable performance baselines
  • Only 34% describe AI workload performance as highly predictable
  • That drops to 25% at organizations with more than 50,000 employees

This lack of visibility extends into incident response:

  • 59% cannot automatically identify root cause across infrastructure domains when an alert fires
  • 25% still rely on manual investigations across disconnected consoles as their first response

When AI systems break, they do not fail cleanly. System understanding degrades, forcing teams into reactive analysis while high-cost GPU capacity sits underutilized, issues compound, and outcomes can no longer be fully explained or controlled.

“These are not abstract concerns,” continued Appleby. “As AI becomes core enterprise infrastructure, a clear divide is emerging between organizations that understand how their systems produce outcomes and those that cannot explain or control them. Without visibility across models, tokens, GPUs, and infrastructure, teams absorb hidden cost, performance gaps, and ungoverned risk. Those that understand their systems gain end-to-end visibility and control so they can optimize cost in real time, ensure reliable performance, and prove outcomes. The result is declining resilience, eroding trust, and constrained growth as AI becomes infrastructure that must be governed and optimized at scale.”

ROI Visibility Is the Prerequisite Enterprises Cannot Defer

The study reveals a disconnect between how AI systems operate and how they are observed. A 17-point gap exists between Infra/SRE practitioners and executives on automated root cause capabilities:

  • 69% of Infra/SRE teams report lacking automated cross-domain root cause
  • 52% of executives report the same

This gap reflects a broader breakdown in system-level observability, where critical signals remain fragmented across the stack:

  • 57% cite cost and efficiency metrics as a top challenge
  • 56% cite GPU utilization tracking
  • 52% cite data pipeline visibility

These challenges span business outcomes, AI infrastructure, and data dependencies, yet are still managed in isolation.

GPU cost and utilization remains the most difficult operational challenge for 35% of enterprises, with impact varying by role:

  • 39% of executives experience it as financial accountability pressure
  • 36% of architects cite integration complexity in distributed environments
  • 22% of Infra/SRE teams face it as a scaling and reliability challenge

This variation reflects how different parts of the organization see different fragments of the same system, without a unified view of cause and effect.

Across all roles and revenue bands, enterprise priorities are consistent:

  • 38% need unified visibility across AI and infrastructure layers
  • 32% need AI-driven root cause analysis without manual correlation

Together, these priorities point to a single requirement: system-aware observability that connects performance, cost, and outcomes across the full stack. Today, most enterprises are operating AI systems they cannot fully observe or explain.

Related Posts

Bitcoiner Claims Claude Helped Him Recover 5 Bitcoin

A Bitcoiner’s post has gone viral on X after...

BTC ETFs lose $635 million in a single day. What next?

A key tailwind that supposedly powered bitcoin's recent rise...

Aave Proposes Babylon-Powered Native BTC Borrowing Spoke for V4: Governance Temp Check

Aave DAO is seeking approval to integrate Babylon protocol...

Enda Tamweel and The Hashgraph Association Launch Hedera-Powered Loyalty Program

Enda Tamweel, a microfinance institution based in Tunisia, has...