According to The State of Resilience 2025 Report, published by Cockroach Labs, outages are commonplace in most organizations, with 55% of companies reporting weekly and 14% reporting daily outages. Staggering 100% of survey participants experienced revenue losses due to outages, with some companies (8%) reporting losses of USD 1 million or higher over the last 12 months.
Cockroach Labs published the report titled 'The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness' (form completion required to download) in October 2024 after surveying 1000 senior executives about the resiliency of their IT systems and the challenges their organizations face.
The report highlights that almost all technology leaders are concerned about outages and their impacts, but their organizations often do not do enough to address operational weaknesses. Survey responders reported network and software failures as the leading causes of outages, together with cloud platform and third-party service reliability issues, as well as cyberattacks.
Common Causes of Downtime (Source: The State of Resiliency 2025 Report)
The report authors summarized the challenges shared by the participants:
Fallout following the recent CrowdStrike global outage jolted many organizations into action — 94% of technical executives in this survey said that the event has catalyzed their companies to reassess their operational resilience. At the same time, leaders at the global enterprise companies surveyed [...] report that entrenched resistance to change, misaligned internal priorities, outdated systems, and budgetary gridlock prevent many from implementing meaningful — sometimes even desperately needed — operational resilience measures.
Despite operational weaknesses leading to outages, organizations reported many obstacles to improving resilience, with prioritization and budgetary constraints listed as major areas, followed by system complexity, inadequate training, and staffing levels.
Main Challenges for Improving Resilience (Source: The State of Resiliency 2025 Report)
In a separate 2024 DORA Accelerate State of DevOps report, the authors covered issues resulting from software deployments and analyzed key delivery performance metrics around software delivery stability. In this year's report, the team introduced a new metric to help understand why the change failure rate (CFR) stood out from the other DORA metrics. The new metric, named rework rate, tracks the number of unplanned deployments to address a user-facing application issue and was grouped with the change failure rate to form a software delivery stability factor.
Delivery Performance Levels (Source: DORA State of DevOps Report 2024)
As always, the State of DevOps report discussed the relationship between the throughput of software delivery and platform stability and concluded that despite a strong correlation between the frequency of releases and lower change failure rate, there is no guarantee that companies will achieve greater stability by releasing more often because of organizational and technical challenges. The report emphasized the need to recognize improvements in the software delivery performance, and not always the absolute performance levels.