In a recent US congressional hearing regarding the CrowdStrike incident in July, discussions arose on the potential for automated system recovery to prevent future catastrophic events. The debate centered around whether the responsibility for automated recovery should lie with the third-party software vendor or if it should be a broader issue of operating system (OS) resilience.
One major point of contention was the idea of creating a system that can heal itself in the event of a critical failure, such as a blue screen of death (BSOD). This type of error occurs when the device fails to load the necessary software to boot up the operating system and applications. In the case of the CrowdStrike incident, a corrupted update file triggered a BSOD that led to a global IT crisis.
The debate delved into the complexities of software failures, particularly those that occur at a low-level access known as ‘kernel mode’. When a component at this level fails, it can result in a BSOD loop that requires expert intervention to resolve. The question then arises: should the responsibility for implementing an auto-recovery mechanism lie with the third-party software vendor, or should the OS take the lead in initiating recovery processes in collaboration with third-party applications?
Using an analogy of a gasoline car engine and spark plugs, the author illustrates the concept of software failures and the need for a standardized recovery process. Just as a mechanic replaces faulty spark plugs to get the engine running smoothly, the software context requires a similar approach to address critical failures. The author argues that regardless of the specific third-party software involved, the recovery process should remain consistent and managed by the OS.
In advocating for OS-managed recovery, the author proposes a system where the operating system tracks changes made by third-party software updates and retains previous working files or states to facilitate recovery in the event of a failure. This approach would eliminate the need for individual software vendors to develop their own recovery mechanisms and ensure a more efficient and standardized recovery process across all applications with kernel-mode access.
While acknowledging the complexities involved in developing such a system, the author emphasizes the potential benefits of a collaborative approach between OS and third-party software vendors to enhance system resilience and mitigate the risk of widespread outages. By implementing a standardized recovery option within the OS, the author believes that future incidents like the CrowdStrike outage could be avoided through proactive and efficient recovery mechanisms.
Overall, the debate surrounding automated system recovery highlights the importance of building resilient ecosystems and the need for collaboration between software vendors and OS developers to enhance overall system reliability and security. By prioritizing recovery options and implementing standardized processes, organizations can better prepare for and mitigate the impact of potential system failures in the future.

