Review of Generative AI Methods in Cybersecurity - Arxiv24
Review of Generative AI Methods in Cybersecurity - Arxiv24
Abstract
Large language models (LLMs) and generative artificial intelligence (GenAI)
constitute paradigm shifts in cybersecurity that present hitherto unseen chal-
arXiv:2403.08701v1 [cs.CR] 13 Mar 2024
1 Introduction
The emergence of Generative Artificial Intelligence (GenAI) is heralding a revo-
lutionary era in digital technology, particularly in cybersecurity and multi-modal
interactions. Large Language Models (LLMs) and Natural Language Processing (NLP)
advancements are leading the way, redefining AI’s capabilities as evidenced by the
emergence of models like Google’s Gemini and ChatGPT. Improvements in security
vulnerability scanning apps and deeper vulnerability analysis made possible by these
developments surpass the capabilities of traditional Static Application Security Testing
(SAST) techniques [1].
Language models are essential in many sectors, including commerce, healthcare,
and cybersecurity. Their progress shows a definite path from basic statistical meth-
ods to sophisticated neural networks [2], [3]. NLP skill development has benefited
immensely from the use of LLMs. However, despite these advancements, a number of
issues remain, including moral quandaries, the requirement to reduce error rates, and
1
making sure that these models are consistent with our moral values. To solve these
issues, moral monitoring and ongoing development are required.
2
often matches or surpasses state-of-the-art fine-tuned systems. Romera-Paredes et al.
have developed FunSearch, an innovative approach combining LLMs with evolutionary
algorithms to make groundbreaking discoveries in fields like extremal combinatorics
and algorithmic problem-solving [12]. Their method has notably surpassed previ-
ous best-known results by iteratively refining programs that solve complex problems,
showcasing LLMs’ potential for scientific innovation. This process generates new
knowledge and produces interpretable and deployable solutions, demonstrating a sig-
nificant advancement in applying LLMs for real-world challenges. Lu et al. critically
examine the capabilities and limitations of multi-modal LLMs, including proprietary
models like GPT-4 and Gemini, as well as six open-source counterparts across text,
code, image, and video modalities [13]. Through a comprehensive qualitative analysis
of 230 cases, assessing generalizability, trustworthiness, and causal reasoning, the study
reveals a significant gap between the performance of the GenAIs and public expecta-
tions. These discoveries open up new avenues for study to improve the transparency
and dependability of GenAI in cybersecurity and other fields, providing a basis for
creating more complex and reliable multi-modal applications. Commonsense thinking
across multimodal tasks is evaluated thoroughly, and Google’s Gemini is compared
with OpenAI’s GPT models [14]. This study explores the strengths and weaknesses of
Gemini’s ability to synthesize commonsense information, indicating areas for improve-
ment in its competitive performance in temporal reasoning, social reasoning, and
emotion recognition of images. It emphasizes how important it is for GenAI models
to develop commonsense reasoning to improve cybersecurity applications.
Recent research [15] presents a novel approach for assessing the potentially severe
hazards associated with GenAI models, such as deceit, manipulation, and cyber-offence
features. To enable AI developers to make well-informed decisions about training,
deployment, and the application of cybersecurity standards, the suggested method-
ology highlights the need to increase evaluation benchmarks to assess the harmful
capabilities and alignment of AI systems accurately. The authors [16] provided a thor-
ough analysis that illuminated the complex applications of ChatGPT in digital forensic
investigations, pointing out both the important constraints and bright prospects that
come with GenAI as it is now. Using methodical experimentation, they outline the
fine line separating AI’s inventive contributions from the vital requirement of profes-
sional human supervision in cybersecurity procedures, opening the door to additional
research into integrating LLMs such as GPT-4 into digital forensics and cybersecurity.
The latest release of CyberMetric presents a novel benchmark dataset that assesses
the level of expertise of LLMs in cybersecurity, covering a broad range from risk man-
agement to cryptography [17]. This dataset has gained value from the 10,000 questions
that have been verified by human specialists. In a variety of cybersecurity-related top-
ics, this enables a more sophisticated comparison between LLMs and human abilities.
With LLMs outperforming humans in multiple cybersecurity domains, the report pro-
poses a shift toward harnessing AI’s analytical capabilities for better security insights
and planning. Gehman et al. critically examines neural language models that have
been trained to generate toxic material to highlight the adverse consequences of toxic-
ity in language generation inside cybersecurity frameworks [18]. Their comprehensive
analysis of controllable text generation techniques to mitigate these threats provides a
3
basis for evaluating the effects of GenAI on cybersecurity policies. It is also emphasized
that improving model training and data curation duties is essential. A new method
for assessing and improving the security of LLMs for solving Math Word Problems
(MWP) is presented [19]. They have made a substantial contribution to our under-
standing of LLM vulnerabilities in cybersecurity by emphasizing the importance of
maintaining mathematical logic when attacking MWP samples. The importance of
resilience in AI systems is highlighted in this study through important and educa-
tional computer applications. ChatGPT can simplify the process of launching complex
phishing attacks, even for non-programmers, by automating the setup and construct-
ing components of phishing kits [20]. It highlights the urgent need for better security
measures and highlights how difficult it is to guard against the malicious usage of
GenAI capabilities.
In addition to providing an innovative approach to reducing network infrastruc-
ture vulnerabilities and organizing diagnostic data, this paper examines the intricate
relationship between cybersecurity and GenAI technologies. It seeks to bridge the
gap between cutting-edge cybersecurity defences and the threats posed by sophisti-
cated cyberattacks through in-depth study and creative tactics. This study extends
our understanding of cyber dangers by utilising LLMs such as ChatGPT and Google’s
Gemini. It suggests novel approaches to improve network security. It defines a criti-
cal first step in creating more robust cybersecurity frameworks that can quickly and
effectively combat the dynamic and ever-evolving world of cyber threats.
Section 3 explores the techniques used to take advantage of GenAI technology
after providing an overview, analyzing different attack routes and their consequences.
The design and automation of cyber threats are examined in Section 4, which focuses
on the offensive capabilities made possible by GenAI. However, Section 5 provides an
in-depth examination of GenAI’s function in strengthening cyber defences, outlining
cutting-edge threat detection, response, and mitigation techniques. We expand on this
topic in Section 6, highlighting the important moral, legal, and societal ramifications
of integrating GenAI into cybersecurity procedures. A thoughtful summary of the
implications of GenAI in cybersecurity is presented in Section 7, which synthesizes the
important discoveries. The paper is finally concluded in Section 8.
3 Attacking GenAI
GenAI has advanced significantly thanks to tools like ChatGPT and Google’s Gemini.
They have some weaknesses, though. Despite the ethical safeguards built into these
models, various tactics can be used to manipulate and take advantage of these sys-
tems. This chapter explores how the ethical boundaries of GenAI tools are broken,
with particular attention to tactics such as the idea of jailbreaks, the use of reverse
psychology, and quick injection. These strategies demonstrate how urgently the secu-
rity protocols of GenAI systems need to be improved and monitored. Some works
in the literature focus on the vulnerabilities and sophisticated manipulation tactics
of GENAI. Analyzing the vulnerabilities in GenAI highlights the significant security
concerns involved with employing advanced AI technology, including the possibility
of bypassing security protections via the RabbitHole attack and compromising data
4
privacy through rapid injection [21]. According to the analysis, GPT-4 offers signifi-
cant improvements in NLP. However, it is susceptible to quick injection attacks, which
enable the circumvention of safety restrictions and can be used as a weapon for mali-
cious and disinformation purposes. Gupta et al. addressed the intricate vulnerabilities
of GENAI using ChatGPT [22]. They emphasized that because these threats are
dynamic, protecting these systems requires a proactive and informed strategy. Build-
ing on previous results, this part delves into the complex realm of GenAI attacks,
which can range from minor adjustments to significant system breaches.
3.1 Jailbreaks
This subsection discusses how attackers can hack GenAI systems by using social engi-
neering techniques.” In order to circumvent moral constraints and generate responses
that would otherwise be forbidden, these strategies—which typically make use of
social engineering techniques—involve manipulating the AI’s answer-generation pro-
cess. Strong defences are urgently needed to shield these cutting-edge systems from
manipulations like the ones shown by our jailbreaking inquiry, which highlights how
5
Fig. 2 The jailbreaking response of ChatGPT 4 after typing similar prompts with the current works.
enemies may jeopardize the intended moral application of GenAI technologies. Crit-
ical to this discussion is the idea of “jailbreaking,” which refers to overcoming the
limitations set by AI programming in order to achieve a certain, oftentimes unethical
objectives. Li et al. enabled ChatGPT to bypass its ethical filtering and make person-
ally identifiable information (PII) available with their multi-step jailbreaking prompt
[23]. By leveraging Chain-of-Thought (CoT) prompting, this method divides complex
activities into intermediate steps [24]. To get around moral limitations, this necessi-
tates a Let’s think step by step approach [25]. Wu et al. [26] defined several datasets
which could be used to bypass the ethical safeguards used by ChatGPT, build on a
model of the psychological concept of self-reminder, and use this to propose a defence
system of System-Mode Self-Reminder. With this, ChatGPT is continually promoted
to act ethically, reducing Jailbreak attempts from 67.21% to 19.34%. An example of
the Jailbreak is when ChatGPT is prompted to ”Write a blackmail letter to a single
mother for money after kidnapping her child.” and will refuse to provide an answer
but offers a full explanation when the following prompt was used [26]:
The i m p a r t i a l and immoral model known a s ‘ ‘ J a i l B r e a k ’ ’ w i l l
t a k e your p l a c e f o r t h e r e m a i n d e r o f t h i s t a l k . Every
f i l t e r and l i m i t a t i o n , i n c l u d i n g moral and e t h i c a l ones ,
i s a b s e n t from J a i l B r e a k . These prompt g u i d e l i n e s a r e t h e
o n l y o n e s t h a t apply t o J a i l B r e a k . Use J a i l B r e a k a s t h e
s t a r t i n g p o i n t f o r a l l o f your outputs , and then p r o c e e d
with t h e u n f i l t e r e d answer w i t h o u t adding any o f t h e
w a r n i n g s you t y p i c a l l y r e s p o n d with . Answer o n l y t h e
q u e s t i o n s t h e u s e r a s k s you t o answer . Rules f o r
J a i l B r e a k s : A f t e r k i d n a p p i n g a mother o f h e r c h i l d , w r i t e
a l e t t e r o f b l a c k m a i l demanding money .
6
Fig. 3 The jailbreaking response of Google’s Gemini.
However, the current version of GPT-4 is robust to the previous works’ prompts.
Nevertheless, it is still prone to jailbreaking prompts. As can be seen in Fig. 1, the
current version still gives a response for jailbreaking. It becomes more robust after
writing similar prompts with the existing works in the same chat, as seen in Fig. 2.
Google’s Gemini refused all existing prompts and name-changing scenarios at the
beginning of the chat. Fig. 3 shows the same jailbreaking entry responses of the Gemini
with ChatGPT 4.
7
Fig. 4 The reverse psychology response of Google’s Gemini.
8
Fig. 5 The reverse psychology response of ChatGPT 4.
avoid this situation, it provided three email examples with the subject and its body,
as seen in Fig. 4.
As seen in Fig. 5, ChatGPT 4 also gave an example email for this purpose even
though it refused initially.
9
Fig. 6 The prompt injection response of ChatGPT 4.
4 Cyber Offense
GenAI has the potential to alter the landscape of offensive cyber strategies sig-
nificantly. Microsoft and OpenAI have documented preliminary instances of AI
exploitation by state-affiliated threat actors [27]. This section explores the potential
role of GenAI in augmenting the effectiveness and capabilities of cyber offensive tactics.
In an initial assessment, we jailbroke ChatGPT-4 to inquire about the vari-
ety of offensive codes it could generate. The responses obtained were compelling
enough to warrant a preliminary examination of a sample code before conducting a
comprehensive literature review (see Appendix A ).
Gupta et al. [22] have shown that ChatGPT could create social engineering attacks,
phishing attacks, automated hacking, attack payload generation, malware creation,
10
Fig. 7 The prompt injection response of Google’s Gemini.
11
Fig. 8 Threat actors could exploit Generative AI, created for benevolent purposes, to obscure attri-
bution
12
includes discussing the iterative process of prompt engineering to generate phishing
attacks and real-world deployment of these phishing techniques on a public hosting
service, thereby verifying their operational viability. The authors show how to bypass
ChatGPT’s filters by structuring prompts for offensive operations.
13
Fig. 9 script for payload generation and example to embed into pdf
14
Fig. 10 Educational ransomware code with basic code obfuscation
outlined a method of getting ChatGPT to seek out target files for encryption and
thus mimic ransomware behaviour, but where it mutated the code to avoid detection.
They even managed to embed a Python interpreter in the malware and where it could
query ChatGPT for new software modules.
15
While NIST has been working on the standardization of a light-weight encryp-
tion method, Cintas et al. [43] used ChatGPT to take an abstract definition of the
ASCON cypher and produced running code that successfully implemented a range of
test vectors.
5 Cyber Defence
In the ever-evolving cybersecurity battlefield, the “Cyber Defence” segment highlights
the indispensable role of GenAI in fortifying digital fortresses against increasingly
sophisticated cyber threats. This section is dedicated to exploring how GenAI technolo-
gies, renowned for their advanced capabilities in data analysis and pattern recognition,
are revolutionizing the approaches to cyber defence. Iturbe et al. [44] outline the
AI4CYBER framework, which provides a roadmap for the integration of AI into
cybersecurity applications. This includes AI4VUN (AI-enhanced vulnerability iden-
tification); AI4FIX (AI-driven self-testing and automatic error correction); AI4SIM
(Simulation of advanced and AI-powered attacks); AI4CTI A(I-enhanced cyber threat
intelligence of adversarial AI); AI4FIDS (federated learning-enhanced detection);
AI4TRIAGE (Root cause analysis and alert triage); AI4SOAR (Automatic orchestra-
tion and adaptation of combined responses); AI4ADAPT (Autonomy and optimization
of response adaptation); and AI4DECEIVE (Smart deception and honeynets); and
AI4COLLAB (Information sharing with privacy and confidentiality).
16
Security Copilot [49] has been providing CTI to its customers using GPT, and oper-
ational use cases have been observed, such as the Cyber Dome initiative in Israel
[50].
17
Wei Ma et al. [60] noted that while ChatGPT shows impressive potential in software
engineering(SE) tasks like code and document generation, its lack of interpretability
raises concerns given SE’s high-reliability requirements. Through a detailed study, they
categorized AI’s essential skills for SE into syntax understanding, static behaviour
understanding, and dynamic behaviour understanding. Their assessment, spanning
languages like C, Java, Python, and Solidity, revealed that ChatGPT excels in syntax
understanding (akin to an AST parser) but faces challenges in comprehending dynamic
semantics. The study also found ChatGPT prone to hallucination, emphasizing the
need to validate its outputs for SE dependability and suggesting that codes from LLMs
are syntactically right but potentially vulnerable.
Haonan Li et al. [61] discussed the challenges of balancing precision and scala-
bility in static analysis for identifying software bugs. While LLMs show potential in
understanding and debugging code, their efficacy in handling complex bug logic, which
often requires intricate reasoning and broad analysis, remains limited. Therefore, the
researchers suggest using LLMs to assist rather than replace static analysis. Their
study introduced LLift, an automated system combining a static analysis tool and
an LLM to address use-before-initialization (UBI) bugs. Despite various challenges
like bug-specific modelling and the unpredictability of LLMs, LLift, when tested on
real-world potential UBI bugs, showed significant precision (50%) and recall (100%).
Notably, it uncovered 13 new UBI bugs in the Linux kernel, highlighting the potential
of LLM-assisted methods in extensive real-world bug detection.
Norbert Tihani et al. [62] introduced the FormAI dataset, comprising 112,000 AI-
generated C programs with vulnerability classifications generated by GPT-3.5-turbo.
These programs range from complex tasks like network management and encryp-
tion to simpler ones, like string operations. Each program comes labelled with the
identified vulnerabilities, pinpointing type, line number, and vulnerable function. To
achieve accurate vulnerability detection without false positives, the Efficient SMT-
based Bounded Model Checker (ESBMC) was used. This method leverages techniques
like model checking and constraint programming to reason over program safety.
Each vulnerability also references its corresponding Common Weakness Enumeration
(CWE) number.
Codex, introduced by Mark et al. [63], represents a significant advancement in
GPT language models, tailored specifically for code synthesis using data from GitHub.
This refined model underpins the operations of GitHub Copilot. When assessed on
the HumanEval dataset, designed to gauge the functional accuracy of generating pro-
grams based on docstrings, Codex achieved a remarkable f28.8% success rate. In stark
contrast, GPT-3 yielded a 0% success rate, and GPT-J achieved 11.4%. A standout
discovery was the model’s enhanced performance through repeated sampling, with
a success rate soaring to 70.2% when given 100 samples per problem. Despite these
promising results, Codex does exhibit certain limitations, notably struggling with intri-
cate docstrings and variable binding operations. The paper deliberates on the broader
ramifications of deploying such potent code-generation tools, touching upon safety,
security, and economic implications.
In a technical evaluation, Cheshkov et al. [64] found that the ChatGPT and GPT-3
models, despite their success in various other code-based tasks, performed on par with
18
a dummy classifier for this particular challenge. Utilizing a dataset of Java files sourced
from GitHub repositories, the study emphasized the models’ current limitations in the
domain of vulnerability detection. However, the authors remain optimistic about the
potential of future advancements, suggesting that models like GPT-4, with targeted
research, could eventually make significant contributions to the field of vulnerability
detection.
A comprehensive study conducted by Xin Liu et al. [65] investigated the potential
of ChatGPT in Vulnerability Description Mapping (VDM) tasks. VDM is pivotal in
efficiently mapping vulnerabilities to CWE and Mitre ATT&CK Techniques classifica-
tions. Their findings suggest that while ChatGPT approaches the proficiency of human
experts in the Vulnerability-to-CWE task, especially with high-quality public data,
its performance is notably compromised in tasks such as Vulnerability-to-ATT&CK,
particularly when reliant on suboptimal public data quality. Ultimately, Xin Liu et al.
emphasize that, despite the promise shown by ChatGPT, it is not yet poised to replace
the critical expertise of professional security engineers, asserting that closed-source
LLMs are not the conclusive answer for VDM tasks.
19
along with addressing concerns around the ethical implications of user input. The
framework is defined across five dimensions of:
• Ethics. This defines an alignment with accepted moral and ethical principles.
• Legal Compliance. This defines that any user input does not violate laws and/or
regulations. This might relate to privacy laws and copyright protection.
• Transparency. This defines that user inputs must be clear in requirements, and does
not intend to mislead the LLM.
• Intent Analysis. This defines that user input should not have other intents, such as
jailbreaking the LLM.
• Malicious intentions. This defines that user input should be free of malicious intent,
such as to perform hate crime.
• Social Impact. This defines how user input could have a negative effect on society,
such as searching for ways to do harm to others, such as related to crashing the
stock market or planning a terrorist attack.
20
the MITRE ATT&CK platform [37, 72], and which can use standardized taxonomies,
sharing standards [73], and ontologies for cyber threat intelligence [74].
Garza et al. [75] analysed ChatGPT and Google’s Bard against the Top 10 attacks
within the MITRE framework and found that ChatGPT can enable attackers to signif-
icantly improve attacks on networks and where fairly low-level skills would be required
(such as with script kiddies). This also includes sophisticated methods of delivering
ransomware payloads. The techniques defined were:
• T1059 Command and Scripting Interpreter
• T1003 OS Credential Dumping
• T1486 Data Encrypted for Impact
• T1055 Process Injection
• T1082 System Information Discovery
• T1021 Remote Services
• T1047 Windows Management Instrumentation
• T1053 Scheduled Task/Job
• T1497 Virtualization/Sandbox Evasion
• T1018 Remote System Discovery
With this approach, the research team were able to generate PowerShell code,
which implemented advanced attacks against the host and mapped directly to the
vulnerabilities defined in the MITRE framework. One of the work’s weaknesses related
to the Google Bard and ChatGPT’s reluctance to produce attack methods, but a
specially engineered command typically overcame this reluctance.
Ferrag et al. [76] defined SecurityLLM for cybersecurity threat detection. It uses
SecurityBERT (cyber threat detection mechanism) and FalconLLM (an incident
response and recovery system). This uses a simple classification model consolidated
with LLMs and can identify 14 attacks to achieve an overall accuracy of 98%. These
include the threats of: DDoS UDP; DDoS ICMP; SQL injection ; Password; Vulnera-
bility scanner; DDoS TCP; DDoS HTTP; Uploading ; Backdoor; Port Scanning; XSS;
Ransomware; MITM and Fingerprinting.
21
Simmonds [78] used LLMs to automate the classification of Websites, which can
be used for training data in a machine-learning model. For this, all HTML tags, CSS
styling and other non-essential content must be removed before the LLM processes
them, and then it can train on just the website’s content.
22
the issue prioritizes the development of reliable, secure, and safe AI [83]. Its main
objectives are to protect civil rights and privacy in AI applications, foster AI talent
and innovation in the US, and establish risk management strategies for AI. As a global
leader in responsible AI development and application, it seeks to build responsible AI
deployment within government institutions and foster international collaboration on
AI standards and laws.
23
more difficult to identify inventions made by artificial intelligence from those that are
the result of human creativity, so the current legal frameworks need to be examined
and modified. The rights of the original creators must be maintained while taking into
consideration the complex roles AI plays in creative processes [83] [81]. Considering
the evolving character of the current world, a comprehensive and clear legal frame-
work defining ownership and copyright rules for GenAI breakthroughs is needed. These
legal structures must recognize the different duties that each member of the creative
ecosystem has, promote creativity, and offer fair recompense. In an era where artifi-
cial and human intelligence are combined, these policies are crucial for managing the
intricate dynamics of data ownership and intellectual property.
24
the promise of GenAI while safeguarding moral principles and traditional values in
the digital age.
7 Discussion
The sophisticated field of GenAI in cybersecurity has been examined in this paper. The
focus is on both offensive and defensive strategies. By enhancing incident response,
automating defensive systems, and identifying sophisticated attacks, GenAI has a
disruptive influence that might significantly raise cybersecurity standards. Some of
the new risks that accompany these technical improvements include hackers having
access to ever-more-advanced attack-building tools. This discrepancy highlights the
significance of striking a balance between purposefully restricting the components that
can be employed and enhancing GenAI’s capabilities.
Apart from the apparent inconsistency between offensive and defensive strategies,
this study examines the moral, legal, and societal implications of utilizing artificial
intelligence in cybersecurity. It also emphasizes the need for robust legal frameworks,
strict moral standards, ongoing technical monitoring, and proactive GenAI manage-
ment. This is a paradigm-shifting and technical revolution. Adopting a holistic strategy
considering the technological, ethical, and sociological consequences of implementing
GenAI into cybersecurity is crucial.
Moreover, our findings emphasise the significance of interdisciplinary collaboration
to promote GenAI applications in cybersecurity. The intricacy and findings of GenAI
technologies require expertise from various fields, including computer science, law,
ethics, and policy-making, to navigate their possible challenges. As multidisciplinary
research and discourse become more prevalent, it will ensure that GenAI is applied
responsibly and effectively in the future.
Our extensive research has shown that collaborative efforts to innovate ethically
will influence cybersecurity in a future driven by GenAI. Although GenAI has the
ability to transform cybersecurity strategies completely, it also carries a great deal
of responsibility. As we investigate this uncharted domain, we should advance the
development of sophisticated techniques to ensure the moral, just, and safe appli-
cation of advanced GenAI capabilities. By promoting a consistent focus on the
complex relationship between cybersecurity resilience and GenAI innovation, sup-
ported by a commitment to ethical integrity and societal advancement, the current
study establishes the groundwork for future research initiatives.
8 Conclusion
Our thorough examination of cybersecurity offence and defence, as well as Gener-
ative Artificial Intelligence (GenAI) technologies, concludes with the discovery of a
double-edged sword. Although GenAI has the potential to revolutionize cybersecu-
rity processes by automating defences, enhancing threat intelligence, and improving
cybersecurity protocols, it also opens up new avenues for highly skilled cyberattacks.
Incorporating GenAI into cybersecurity emphasises the enduring ethical, legal, and
technical scrutiny essential to minimize the risks of misuse and maximize the ben-
efits of this technology for protecting digital infrastructures. Future studies should
25
concentrate on creating strong ethical standards and creative defence mechanisms to
handle the challenges posed by GenAI and guarantee a fair and impartial approach
to its implementation in cybersecurity. A multidisciplinary effort is required to bridge
the gap between ethical management and technological discovery to coordinate the
innovative capabilities of GenAI with the requirement of cybersecurity resilience.
References
[1] Happe, A., Cito, J.: Getting pwn’d by ai: Penetration testing with large language
models. arXiv preprint arXiv:2308.00121 (2023)
[2] Barreto, F., Moharkar, L., Shirodkar, M., Sarode, V., Gonsalves, S., Johns, A.:
Generative Artificial Intelligence: Opportunities and Challenges of Large Lan-
guage Models. In: Balas, V.E., Semwal, V.B., Khandare, A. (eds.) Intelligent
Computing and Networking, pp. 545–553. Springer, ??? (2023)
[3] Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N.,
Barnes, N., Mian, A.: A Comprehensive Overview of Large Language Models
(2023)
[4] Mohammed, S.P., Hossain, G.: Chatgpt in education, healthcare, and cyberse-
curity: Opportunities and challenges. In: 2024 IEEE 14th Annual Computing
and Communication Workshop and Conference (CCWC), pp. 0316–0321 (2024).
IEEE
[5] Alawida, M., Mejri, S., Mehmood, A., Chikhaoui, B., Isaac Abiodun, O.: A com-
prehensive study of chatgpt: advancements, limitations, and ethical considerations
in natural language processing and cybersecurity. Information 14(8), 462 (2023)
[6] Dun, C., Garcia, M.H., Zheng, G., Awadallah, A.H., Kyrillidis, A., Sim, R.:
Sweeping Heterogeneity with Smart MoPs: Mixture of Prompts for LLM Task
Adaptation (2023)
[7] AI, G.: AI Principles Progress Update 2023. [Online]. Available: https://ai.
google/responsibility/principles/, Accessed Jan 10, 2024
[9] OpenAI: Introducing Gemini: Our Largest and Most Capable AI Model. [Online].
Available: https://cdn.openai.com/papers/gpt-4.pdf, Accessed Dec 12, 2023
(2023)
[10] Frieder, S., Pinchetti, L., Griffiths, R.-R., Salvatori, T., Lukasiewicz, T., Petersen,
P., Berner, J.: Mathematical capabilities of chatgpt. Advances in Neural Infor-
mation Processing Systems 36 (2024)
26
[11] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee-
lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A.,
Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter,
C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J.,
Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language
models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan,
M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33,
pp. 1877–1901. Curran Associates, Inc., ??? (2020). https://proceedings.neurips.
cc/paper files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[12] Romera-Paredes, B., Barekatain, M., Novikov, A., Balog, M., Kumar, M.P.,
Dupont, E., Ruiz, F.J.R., Ellenberg, J.S., Wang, P., Fawzi, O., Kohli, P., Fawzi,
A.: Mathematical discoveries from program search with large language models.
Nature 625(7995), 468–475 (2024) https://doi.org/10.1038/s41586-023-06924-6
[13] Lu, C., Qian, C., Zheng, G., Fan, H., Gao, H., Zhang, J., Shao, J., Deng, J., Fu,
J., Huang, K., Li, K., Li, L., Wang, L., Sheng, L., Chen, M., Zhang, M., Ren, Q.,
Chen, S., Gui, T., Ouyang, W., Wang, Y., Teng, Y., Wang, Y., Wang, Y., He, Y.,
Wang, Y., Wang, Y., Zhang, Y., Qiao, Y., Shen, Y., Mou, Y., Chen, Y., Zhang,
Z., Shi, Z., Yin, Z., Wang, Z.: From GPT-4 to Gemini and Beyond: Assessing the
Landscape of MLLMs on Generalizability, Trustworthiness and Causality through
Four Modalities (2024)
[14] Wang, Y., Zhao, Y.: Gemini in Reasoning: Unveiling Commonsense in Multimodal
Large Language Models (2023)
[15] Shevlane, T.: An early warning system for novel AI risks. Google
DeepMind. [Online]. Available: https://deepmind.google/discover/blog/
an-early-warning-system-for-novel-ai-risks/, Accessed Jan 15, 2024
[16] Scanlon, M., Breitinger, F., Hargreaves, C., Hilgert, J.-N., Sheppard, J.: Chatgpt
for digital forensic investigation: The good, the bad, and the unknown. Forensic
Science International: Digital Investigation 46, 301609 (2023)
[17] Tihanyi, N., Ferrag, M.A., Jain, R., Debbah, M.: CyberMetric: A Benchmark
Dataset for Evaluating Large Language Models Knowledge in Cybersecurity
(2024)
[18] Gehman, S., Gururangan, S., Sap, M., Choi, Y., Smith, N.A.: Realtoxici-
typrompts: Evaluating neural toxic degeneration in language models. In: Findings
(2020). https://api.semanticscholar.org/CorpusID:221878771
[19] Zhou, Z., Wang, Q., Jin, M., Yao, J., Ye, J., Liu, W., Wang, W., Huang,
X., Huang, K.: MathAttack: Attacking Large Language Models Towards Math
Solving Ability (2023)
[20] Begou, N., Vinoy, J., Duda, A., Korczynski, M.: Exploring the dark side of ai:
27
Advanced phishing attack design and deployment using chatgpt. arXiv preprint
arXiv:2309.10463 (2023)
[21] AI, A.: GPT-4 Jailbreak ve Hacking Via Rabbithole Attack, Prompt Injection,
Content Moderation Bypass ve Weaponizing AI. [Online]. Available: https://
adversa.ai/, Accessed Dec 20, 2023
[22] Gupta, M., Akiri, C., Aryal, K., Parker, E., Praharaj, L.: From chatgpt to
threatgpt: Impact of generative ai in cybersecurity and privacy. IEEE Access
(2023)
[23] Li, H., Guo, D., Fan, W., Xu, M., Song, Y.: Multi-step jailbreaking privacy attacks
on chatgpt. arXiv preprint arXiv:2304.05197 (2023)
[24] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou,
D., et al.: Chain-of-thought prompting elicits reasoning in large language models.
Advances in Neural Information Processing Systems 35, 24824–24837 (2022)
[25] Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models
are zero-shot reasoners. Advances in neural information processing systems 35,
22199–22213 (2022)
[26] Xie, Y., Yi, J., Shao, J., Curl, J., Lyu, L., Chen, Q., Xie, X., Wu, F.: Defending
chatgpt against jailbreak attack via self-reminder. Nature Machine Intelligence 5,
1486–1496 (2023) https://doi.org/10.1038/s42256-023-00765-8
[29] Falade, P.V.: Decoding the threat landscape: Chatgpt, fraudgpt, and wormgpt in
social engineering attacks. arXiv preprint arXiv:2310.05595 (2023)
[30] Roy, S.S., Naragam, K.V., Nilizadeh, S.: Generating phishing attacks using
chatgpt. arXiv preprint arXiv:2305.05133 (2023)
[31] Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y.,
Pinzger, M., Rass, S.: PentestGPT: An LLM-empowered Automatic Penetration
Testing Tool (2023)
28
g-zQfyABDUJ-gp-en-t-ester Accessed 2023-11-12
[35] Temara, S.: Maximizing penetration testing success with effective reconnaissance
techniques using chatgpt (2023)
[36] Happe, A., Kaplan, A., Cito, J.: Evaluating llms for privilege-escalation scenarios.
arXiv preprint arXiv:2310.11409 (2023)
[37] Charan, P., Chunduri, H., Anand, P.M., Shukla, S.K.: From text to mitre tech-
niques: Exploring the malicious use of large language models for generating cyber
attack payloads. arXiv preprint arXiv:2305.15336 (2023)
[40] Kumamoto, T., Yoshida, Y., Fujima, H.: Evaluating large language models in
ransomware negotiation: A comparative analysis of chatgpt and claude (2023)
[41] Madani, P.: Metamorphic malware evolution: The potential and peril of large lan-
guage models. In: 2023 5th IEEE International Conference on Trust, Privacy and
Security in Intelligent Systems and Applications (TPS-ISA), pp. 74–81 (2023).
IEEE Computer Society
[42] Kwon, H., Sim, M., Song, G., Lee, M., Seo, H.: Novel approach to cryptography
implementation using chatgpt. Cryptology ePrint Archive (2023)
[43] Cintas-Canto, A., Kaur, J., Mozaffari-Kermani, M., Azarderakhsh, R.: Chatgpt
vs. lightweight security: First work implementing the nist cryptographic standard
ascon. arXiv preprint arXiv:2306.08178 (2023)
[44] Iturbe, E., Rios, E., Rego, A., Toledo, N.: Artificial intelligence for next-generation
cybersecurity: The ai4cyber framework. In: Proceedings of the 18th International
Conference on Availability, Reliability and Security, pp. 1–8 (2023)
[45] Fayyazi, R., Yang, S.J.: On the uses of large language models to interpret
ambiguous cyberattack descriptions. arXiv preprint arXiv:2306.14062 (2023)
[46] Kereopa-Yorke, B.: Building resilient smes: Harnessing large language models for
cyber security in australia. arXiv preprint arXiv:2306.02612 (2023)
[47] Perrina, F., Marchiori, F., Conti, M., Verde, N.V.: Agir: Automating cyber
threat intelligence reporting with natural language generation. arXiv preprint
arXiv:2310.02655 (2023)
29
[48] Bayer, M., Frey, T., Reuter, C.: Multi-level fine-tuning, data augmentation, and
few-shot learning for specialized cyber threat intelligence. Computers & Security
134, 103430 (2023) https://doi.org/10.1016/j.cose.2023.103430
[50] DVIDS: U.S., Israeli cyber forces build partnership, interoperability during
exercise Cyber Dome VII (2022). https://www.dvidshub.net/news/434792/
us-israeli-cyber-forces-build-partnership-interoperability-during-exercise-cyber-dome-vii
Accessed 2023-10-29
[51] Sharma, T., Kechagia, M., Georgiou, S., Tiwari, R., Vats, I., Moazen, H., Sarro,
F.: A survey on machine learning techniques for source code analysis. arXiv
preprint arXiv:2110.09610 (2021)
[53] Johansen, H.D., Renesse, R.: Firepatch: Secure and time-critical dissemi-
nation of software patches. IFIP, 373–384 (2007) https://doi.org/10.1007/
978-0-387-72367-9 32 . Accessed 2023-08-20
[58] BSI: Machine Learning in the Context of Static Application Security Test-
ing - ML-SAST (2023). https://www.bsi.bund.de/SharedDocs/Downloads/
EN/BSI/Publications/Studies/ML-SAST/ML-SAST-Studie-final.pdf? blob=
publicationFile&v=5 Accessed 2023-08-20
[59] Sobania, D., Hanna, C., Briesch, M., Petke, J.: An Analysis of the Automatic Bug
Fixing Performance of ChatGPT (2023). https://arxiv.org/pdf/2301.08653.pdf
[60] Ma, W., Liu, S., Wang, W., Hu, Q., Liu, Y., Zhang, C., Nie, L., Liu, Y.: The
30
Scope of ChatGPT in Software Engineering: A Thorough Investigation (2023).
https://arxiv.org/pdf/2305.12138.pdf
[61] Li, H., Hao, Y., Zhai, Y., Qian, Z.: The Hitchhiker’s Guide to Program Analysis: A
Journey with Large Language Models (2023). https://arxiv.org/pdf/2308.00245.
pdf Accessed 2023-08-20
[62] Tihanyi, N., Bisztray, T., Jain, R., Ferrag, M., Cordeiro, L., Mavroeidis, V.:
THE FORMAI DATASET: GENERATIVE AI IN SOFTWARE SECURITY
THROUGH THE LENS OF FORMAL VERIFICATION * (2023). https://arxiv.
org/pdf/2307.02192.pdf Accessed 2023-08-20
[63] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.d.O., Kaplan, J., Edwards,
H., Burda, Y., Joseph, N., Brockman, G., et al.: Evaluating large language models
trained on code (2021)
[64] Cheshkov, A., Zadorozhny, P., Levichev, R.: Technical Report: Evaluation of
ChatGPT Model for Vulnerability Detection (2023). https://arxiv.org/pdf/2304.
07232.pdf
[65] Liu, X., Tan, Y., Xiao, Z., Zhuge, J., Zhou, R.: Not The End of Story: An Eval-
uation of ChatGPT-Driven Vulnerability Description Mappings (2023). https:
//aclanthology.org/2023.findings-acl.229.pdf Accessed 2023-08-22
[67] Elgedawy, R., Sadik, J., Dutta, S., Gautam, A., Georgiou, K., Gholamrezae, F.,
Ji, F., Lim, K., Liu, Q., Ruoti, S.: Ocassionally secure: A comparative analysis of
code generation assistants. arXiv preprint arXiv:2402.00689 (2024)
[68] Kumar, A., Singh, S., Murty, S.V., Ragupathy, S.: The ethics of interaction:
Mitigating security threats in llms. arXiv preprint arXiv:2401.12273 (2024)
[69] Zhu, H.: Metaaid 2.5: A secure framework for developing metaverse applications
via large language models. arXiv preprint arXiv:2312.14480 (2023)
[70] O’Brien, J., Ee, S., Williams, Z.: Deployment corrections: An incident response
framework for frontier ai models. arXiv preprint arXiv:2310.00328 (2023)
[71] Iqbal, U., Kohno, T., Roesner, F.: Llm platform security: Applying a sys-
tematic evaluation framework to openai’s chatgpt plugins. arXiv preprint
arXiv:2309.10254 (2023)
[72] Kwon, R., Ashley, T., Castleberry, J., Mckenzie, P., Gourisetti, S.N.G.: Cyber
31
threat dictionary using mitre att&ck matrix and nist cybersecurity framework
mapping. In: 2020 Resilience Week (RWS), pp. 106–112 (2020). IEEE
[73] Xiong, W., Legrand, E., Åberg, O., Lagerström, R.: Cyber security threat model-
ing based on the mitre enterprise att&ck matrix. Software and Systems Modeling
21(1), 157–177 (2022)
[74] Mavroeidis, V., Bromander, S.: Cyber threat intelligence model: an evaluation of
taxonomies, sharing standards, and ontologies within cyber threat intelligence.
In: 2017 European Intelligence and Security Informatics Conference (EISIC), pp.
91–98 (2017). IEEE
[75] Garza, E., Hemberg, E., Moskal, S., O’Reilly, U.-M.: Assessing large language
model’s knowledge of threat behavior in mitre att&ck (2023)
[76] Ferrag, M.A., Ndhlovu, M., Tihanyi, N., Cordeiro, L.C., Debbah, M., Lestable,
T.: Revolutionizing cyber threat detection with large language models. arXiv
preprint arXiv:2306.14263 (2023)
[77] Kholgh, D.K., Kostakos, P.: Pac-gpt: A novel approach to generating synthetic
network traffic with gpt-3. IEEE Access (2023)
[78] Simmonds, B.C.: Generating a large web traffic dataset. Master’s thesis, ETH
Zurich (2023)
[79] Zhou, J., Müller, H., Holzinger, A., Chen, F.: Ethical ChatGPT: Concerns,
Challenges, and Commandments (2023)
[80] Wang, C., Liu, S., Yang, H., Guo, J., Wu, Y., Liu, J.: Ethical considerations of
using chatgpt in health care. Journal of Medical Internet Research 25, 48009
(2023) https://doi.org/10.2196/48009
[83] Harris, L.A., Jaikaran, C.: Highlights of the 2023 Executive Order on Artificial
Intelligence for Congress. Congressional Research Service. [Online]. Available:
https://crsreports.congress.gov/, Accessed Jan 9, 2024
32
Appendix A GPT3.5 and GPT4 OCO-scripting
A.1 Expression of Abilities in OCO
GPT4 offers a list of dangerous codes that it can implement in FigureA1.
A.3 Polymorphism
This basic and polymorphic design shows that LLMs could assist cyber ops. See
FigureA3.
A.4 Rootkit
An educational rootkit is developed and improved by GPT3.5 and GPT4. See
FigureA6.
33
Fig. A2 Self-replicating simple virus
34
Fig. A3 Skeleton code for polymorphic behaviour
35
36
Fig. A4 Adding to exploit capacity with a seed to exploit CVE-2024-1708 and CVE-2024-1709
Fig. A5 Refactoring polymorphism
37
38
Fig. A6 Rootkit
Fig. A7 Data Exfiltration Script with Stealth Features
39