Cloud security vendor Skyhawk has introduced a new benchmark to evaluate the capabilities of generative AI large language models (LLMs) in identifying and scoring cybersecurity threats within cloud logs and telemetries. The company’s free resource examines the performance of various LLMs, including ChatGPT, Google BARD, and Anthropic Claude, to determine how accurately they can predict the maliciousness of an attack sequence.
Generative AI chatbots and LLMs have the potential to enhance an organization’s cybersecurity in several important ways. With proper use, these models can assist in identifying and analyzing security threats more quickly and efficiently than human security analysts.
According to a Cloud Security Alliance (CSA) report on the cybersecurity implications of LLMs, generative AI models can significantly improve the scanning and filtering of security vulnerabilities. The CSA demonstrated that OpenAI’s Codex API, for example, is an effective vulnerability scanner for programming languages such as C, C#, Java, and JavaScript. The report suggests that LLMs like Codex will likely become standard components of future vulnerability scanners. By detecting and flagging insecure code patterns in multiple languages, developers can proactively address potential vulnerabilities before they become critical security risks. Additionally, the report found that generative AI/LLMs excel in threat filtering, providing valuable context and explaining threat identifiers that might go unnoticed by human security personnel.
“The importance of swiftly and effectively detecting cloud security threats cannot be overstated. We firmly believe that harnessing generative AI can greatly benefit security teams in that regard. However, not all LLMs are created equal,” stated Amir Shachar, the director of AI and research at Skyhawk.
Skyhawk’s benchmark model evaluates LLM output by comparing and scoring an attack sequence generated by the company’s machine-learning models against a sample of hundreds of human-labeled sequences. The evaluation is based on precision, recall, and F1 scores, with scores closer to “one” indicating greater predictability and accuracy of the LLM. The results of the benchmark can be seen on Skyhawk’s website.
Shachar emphasized that the specifics of the tagged flows used in the scoring process cannot be disclosed to protect the company’s customers and proprietary technology. However, he noted that the overall conclusion is that LLMs can be powerful and effective tools in threat detection when used appropriately.
In conclusion, Skyhawk’s new benchmark provides a valuable resource for evaluating the performance of generative AI large language models in identifying and scoring cybersecurity threats. With the potential to enhance threat detection capabilities, LLMs offer organizations an opportunity to strengthen their cybersecurity measures. The results of this benchmark can assist in selecting the most accurate and reliable LLMs for effectively managing and mitigating cybersecurity risks.

