Separate Breach Details Can Bleed Into Each Other, Incident Responders Find
Cybersecurity investigators have issued a significant warning regarding the use of artificial intelligence (AI) tools for drafting incident response reports. Recent research has highlighted that information related to one security incident can inadvertently contaminate reports for unrelated incidents if both are generated within the same AI session. This finding raises concerns about the integrity and accuracy of reports generated using these advanced technologies.
The warning comes from Cisco’s threat intelligence group, Talos, which has been conducting controlled experiments to determine the efficacy of large language models (LLMs) in creating incident reports. They discovered that even when notes from one incident are deleted prior to drafting a report for a second, unrelated incident, information from the first can still influence the output. The researchers concluded that the most effective solution is to initiate a completely new session of the language model each time a different incident report is generated.
In the realm of cybersecurity, the stakes are high. Delivering inaccurate reports—regardless of whether they are AI-generated or human-authored—poses substantial professional, regulatory, and legal risks. John Gallagher, Vice President of Security Automation at Viakoo Labs, underscored the potential ramifications by stating that for firms engaged in multi-tenant incident response, accidental exposure of data could lead to violations of data privacy laws and could potentially void insurance policies. These factors add urgency to the need for accurate incident documentation and reporting.
Concerns are not limited to the cybersecurity field. Legal professionals have also felt the repercussions of AI errors. In a notable case last year, a U.S. federal judge imposed fines on lawyers from Morgan & Morgan—one of America’s largest personal injury law firms—after they included AI-generated, fabricated case citations in court documents. This ruling reinforced the responsibility of attorneys to verify the accuracy of information they submit to courts, a duty that remains irrespective of AI’s involvement.
Interestingly, this problem of information commingling is not unique, as researchers have recognized the complexities within AI systems where data from distinct events can become intermingled. The underlying issue stems from the fixed memory context window utilized by LLMs, which is limited in capacity. When this memory fills up, the model starts discarding earlier data, including initial instructions. Consequently, executing multiple tasks within a single session can lead to conflicting or blended outputs.
Previous studies focused on the use of AI technologies in cybersecurity have exposed recurring instances of AI-generated outputs that are inaccurate or nonsensical—often labeled as "hallucinated outputs." Such inaccuracies not only waste the time and resources of analysts but also raise alarms about the reliability of AI in critical settings. Talos’s tests on several LLMs, including ChatGPT, Claude, and Gemini, highlighted inconsistent outputs, with the models frequently offering different recommendations based on identical inputs.
The researchers noted significant variability even when the same dataset was used across different instances. In data breach scenarios, for example, one model might suggest a full organization-wide password reset, while another might recommend only a targeted reset. It became apparent that LLMs continued to revert back to their initial recommendations, regardless of new data that may have altered the optimal course of action.
To mitigate these challenges, Talos researchers identified four specific techniques in prompt engineering—termed "inconsistency control methods"—that yielded the most satisfactory results. First, breaking down tasks into narrow, single-purpose instructions helped reduce cross-contamination between report sections. Second, being explicit about which documents the model should reference prevented it from pulling data from conflicting or unpredictable sources. Third, setting clear parameters for length, tone, and structure enforced a degree of formatting consistency. Lastly, embedding a rigid template into the instructions significantly enhanced output predictability.
Moreover, the research team developed a "recommendation polisher" prompt that resulted in a more robust list of recommendations, including actions that human participants didn’t immediately identify. The application of these methods led to reports of higher writing quality while simultaneously reducing report-writing time by 50%. This efficiency included not only the time taken to manually write the 10% of content that could not be generated automatically but also the time spent editing AI-generated content.
In a blind test, the generated report received positive feedback from human reviewers, who highlighted a notable reduction in typos and grammatical errors compared to average reports. Such findings affirm the potential of LLMs in generating high-quality incident reports, although significant caveats remain.
In terms of model selection, the researchers found that by the end of 2025, the Claude Sonnet 4.5 model emerged as the most consistent in delivering high-quality prose and flagged internal conflicts in source notes—thereby minimizing the need for manual corrections.
However, not every quality-assurance component has met expectations. A grammar-checking prompt, for example, exhibited less than 50% accuracy and often failed to identify errors consistently across multiple runs of identical inputs. Hence, the researchers concluded that this grammar-checking mechanism is currently unsuitable for production use.
The broader conversation regarding the trustworthiness of LLM outputs—specifically the extent to which they can be utilized without ongoing human oversight—remains vital. According to Cisco’s AI Readiness Index, which surveys the current landscape of AI adoption, an impressive 83% of organizations are poised to deploy AI agents. Yet, only 32% of those have established a formal process to monitor and evaluate the validity of their AI-generated outputs.
Gallagher emphasized the dual-edged nature of LLMs in the context of incident response. While these models can streamline remediation efforts—suggesting needed system patches or credential rotations—they also require time-consuming vetting by security professionals. Moreover, AI tools are not yet capable of disseminating a high-level overview of an entire incident.
As Gallagher succinctly summarized, "The notion that AI can reliably synthesize an incident and prescribe high-level strategic next steps, such as scoping the blast radius or negotiating change management, is currently overhyped. Such strategic judgment must remain in human hands." This assertion underscores the complexities of integrating AI into incident response frameworks while honoring the critical human judgment required in these high-stakes scenarios.
