Widespread software outages can often be prevented through resilience—a software program's ability to maintain critical functionality despite bugs or unexpected conditions. Given the significant role software plays in economic infrastructure, addressing common types of software flaws through systematic and known processes is essential. These processes include rigorous testing of code and configuration, incremental rollout procedures, and the development and use of APIs that minimize risk.
Agency staff have noted that resilience can be threatened by a lack of competition in critical inputs. Dominant firms promoting widespread adoption of their digital products can create single points of failure for entire industries. As more users rely on the same enterprise software, the scale of disruption from outages increases. Each outage offers an opportunity to critically assess existing systems for resilience.
Failures can also result from changes beyond code, such as configuration and data alterations. Software instructions may reference configuration files or other data that can lead to failures if not managed with care. For instance, a local ice cream store's mobile app might crash due to a change in its configuration file—like adding an apostrophe in "Cookies 'n Cream"—even if the display code remains unchanged.
This principle is especially crucial for software used by banks, hospitals, and transportation services. Vendors must employ appropriate architecture and processes to guard against failures from both code and non-code changes.
Deploying software changes incrementally rather than all at once is advisable. Real-world environments are too diverse for even the most rigorous testing environments to replicate fully. A common strategy involves initially deploying updates to a small subset of machines before rolling out broadly after confirming stability.
Auto-updating software ensures timely deployment of critical security or stability updates but poses risks if deployed broadly without mechanisms to detect and reverse problematic changes promptly.
Platforms and operating systems with resilient APIs are vital for system stability. APIs enable structured communication between software pieces—for example, an app using a phone's location API to guide customers to a store. Poor API design could cause system-wide crashes instead of isolated app failures.
Creating resilient APIs should not exclude competitors needing access for effective alternatives. While some argue that broad third-party access compromises resiliency, it is possible for interoperability to coexist with reliability without suppressing competition.
The Federal Trade Commission (FTC) continues monitoring developments in this area due to potential substantial consumer injury from widespread outages, including loss of access to critical services and financial loss. The FTC examines companies' computer systems for adequate security measures protecting consumer data, showing there are steps companies can take to prevent bad outcomes systematically.
Thank you to contributing staff: Simon Fondrie-Teitler, Hannah Garden-Monheit, Wells Harrell, Amritha Jayanti, Erik Jones, Stephanie T. Nguyen, Shaoul Sussman, Grady Ward, Ben Wiseman.