0% found this document useful (0 votes)
39 views39 pages

Review of Generative AI Methods in Cybersecurity - Arxiv24

Uploaded by

qoriah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views39 pages

Review of Generative AI Methods in Cybersecurity - Arxiv24

Uploaded by

qoriah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Review of Generative AI Methods in Cybersecurity

Yagmur Yigit1 , William J Buchanan2 , Madjid G Tehrani3 ,


Leandros Maglaras1

Abstract
Large language models (LLMs) and generative artificial intelligence (GenAI)
constitute paradigm shifts in cybersecurity that present hitherto unseen chal-
arXiv:2403.08701v1 [cs.CR] 13 Mar 2024

lenges as well as opportunities. In examining the state-of-the-art application of


GenAI in cybersecurity, this work highlights how models like Google’s Gemini
and ChatGPT-4 potentially enhance security protocols, vulnerability assess-
ment, and threat identification. Our research highlights the significance of a
novel approach that employs LLMs to identify and eliminate sophisticated cyber
threats. This paper presents a thorough assessment of LLMs’ ability to pro-
duce important security insights, hence broadening the potential applications of
AI-driven cybersecurity solutions. Our findings demonstrate the significance of
GenAI in improving digital security. It offers recommendations for further inves-
tigations into the intricate relationship between cybersecurity requirements and
artificial intelligence’s potential.

Keywords: Generative AI, GPT-4, Gemini, Cybersecurity.

1 Introduction
The emergence of Generative Artificial Intelligence (GenAI) is heralding a revo-
lutionary era in digital technology, particularly in cybersecurity and multi-modal
interactions. Large Language Models (LLMs) and Natural Language Processing (NLP)
advancements are leading the way, redefining AI’s capabilities as evidenced by the
emergence of models like Google’s Gemini and ChatGPT. Improvements in security
vulnerability scanning apps and deeper vulnerability analysis made possible by these
developments surpass the capabilities of traditional Static Application Security Testing
(SAST) techniques [1].
Language models are essential in many sectors, including commerce, healthcare,
and cybersecurity. Their progress shows a definite path from basic statistical meth-
ods to sophisticated neural networks [2], [3]. NLP skill development has benefited
immensely from the use of LLMs. However, despite these advancements, a number of
issues remain, including moral quandaries, the requirement to reduce error rates, and

1
making sure that these models are consistent with our moral values. To solve these
issues, moral monitoring and ongoing development are required.

2 The Challenges of GenAI


Mohammed et al [4] define key challenges of the use of ChatGPT in cybersecurity,
including: Exploring the Impact of ChatGPT on Cybersecurity; Constructing Hon-
eypots; Code Security Enhancement; Misuse in Malware Development; Vulnerability
Exploration; Propagating Disinformation; Attacks on Industrial Systems; Changing
the Cyber Threat Landscape; Adapting Cybersecurity Strategies; and Human-Centric
Training Evolution. Alawida et al. [5] also highlight issues related to GenAI’s ability
to generate data that should be kept private, including medical details, financial data
and personal information.
More specialization and efficiency are provided by cutting-edge techniques like the
Mixture of Experts (MoE) architecture. Additionally highlighted is the challenge of
upholding ethics and openness in AI systems [6]. The case study highlights the need
for strong governance structures and interdisciplinary collaboration to optimize the
advantages and tackle the obstacles presented by lifelong learning environments.
Google describes their commitment to responsible AI development in a compre-
hensive progress report on AI Principles, emphasizing the integration of AI governance
into comprehensive risk management frameworks [7]. This strategic approach, along
with other significant legislative initiatives, aims to adhere to current international
norms and laws, including the EU’s AI Act and the US Executive Order on AI safety.
The report also highlights the need for scientific stringency in AI development through
cautious internal management and the use of tools like digital watermarking and
GenAI system cards in order to promote AI accountability and transparency. Multi-
stakeholder solutions are required to address the ethical, security, and sociological
problems that AI technology is currently posing.
Google’s Gemini and ChatGPT-4 are the most popular and widely utilized GenAI
technologies. Following ethical and safety criteria, ChatGPT-4 by OpenAI can now
generate responses that are both coherent and contextually acceptable [8]. This is
because its NLP skills have significantly improved. Its capacity to identify new con-
versions to chemical compounds and to negotiate tricky legal and moral territory
highlights its potential as a pivotal instrument for content moderation and scientific
inquiry. Google introduces Gemini, the most recent iteration of Bard, a ground-
breaking development in AI technology [9]. It can process text, code, audio, images,
and video and sets new standards for AI’s capabilities, emphasising flexibility, safety,
and ethical AI developments. With ChatGPT-4, we also see the rise in AI’s capabilities
in the creation of mathematical assistants that can interpret and render mathematical
equations [10].
Some studies in the literature focus on GenAI tools and their performance. For
instance, Brown et al. have extended the NLP processing by training GPT-3, an
autoregressive language model with 175 billion parameters, showcasing exceptional
few-shot learning capabilities [11]. Without task-specific training, this model performs
well on various NLP tasks, like translation, question-answering, and cloze tasks. It

2
often matches or surpasses state-of-the-art fine-tuned systems. Romera-Paredes et al.
have developed FunSearch, an innovative approach combining LLMs with evolutionary
algorithms to make groundbreaking discoveries in fields like extremal combinatorics
and algorithmic problem-solving [12]. Their method has notably surpassed previ-
ous best-known results by iteratively refining programs that solve complex problems,
showcasing LLMs’ potential for scientific innovation. This process generates new
knowledge and produces interpretable and deployable solutions, demonstrating a sig-
nificant advancement in applying LLMs for real-world challenges. Lu et al. critically
examine the capabilities and limitations of multi-modal LLMs, including proprietary
models like GPT-4 and Gemini, as well as six open-source counterparts across text,
code, image, and video modalities [13]. Through a comprehensive qualitative analysis
of 230 cases, assessing generalizability, trustworthiness, and causal reasoning, the study
reveals a significant gap between the performance of the GenAIs and public expecta-
tions. These discoveries open up new avenues for study to improve the transparency
and dependability of GenAI in cybersecurity and other fields, providing a basis for
creating more complex and reliable multi-modal applications. Commonsense thinking
across multimodal tasks is evaluated thoroughly, and Google’s Gemini is compared
with OpenAI’s GPT models [14]. This study explores the strengths and weaknesses of
Gemini’s ability to synthesize commonsense information, indicating areas for improve-
ment in its competitive performance in temporal reasoning, social reasoning, and
emotion recognition of images. It emphasizes how important it is for GenAI models
to develop commonsense reasoning to improve cybersecurity applications.
Recent research [15] presents a novel approach for assessing the potentially severe
hazards associated with GenAI models, such as deceit, manipulation, and cyber-offence
features. To enable AI developers to make well-informed decisions about training,
deployment, and the application of cybersecurity standards, the suggested method-
ology highlights the need to increase evaluation benchmarks to assess the harmful
capabilities and alignment of AI systems accurately. The authors [16] provided a thor-
ough analysis that illuminated the complex applications of ChatGPT in digital forensic
investigations, pointing out both the important constraints and bright prospects that
come with GenAI as it is now. Using methodical experimentation, they outline the
fine line separating AI’s inventive contributions from the vital requirement of profes-
sional human supervision in cybersecurity procedures, opening the door to additional
research into integrating LLMs such as GPT-4 into digital forensics and cybersecurity.
The latest release of CyberMetric presents a novel benchmark dataset that assesses
the level of expertise of LLMs in cybersecurity, covering a broad range from risk man-
agement to cryptography [17]. This dataset has gained value from the 10,000 questions
that have been verified by human specialists. In a variety of cybersecurity-related top-
ics, this enables a more sophisticated comparison between LLMs and human abilities.
With LLMs outperforming humans in multiple cybersecurity domains, the report pro-
poses a shift toward harnessing AI’s analytical capabilities for better security insights
and planning. Gehman et al. critically examines neural language models that have
been trained to generate toxic material to highlight the adverse consequences of toxic-
ity in language generation inside cybersecurity frameworks [18]. Their comprehensive
analysis of controllable text generation techniques to mitigate these threats provides a

3
basis for evaluating the effects of GenAI on cybersecurity policies. It is also emphasized
that improving model training and data curation duties is essential. A new method
for assessing and improving the security of LLMs for solving Math Word Problems
(MWP) is presented [19]. They have made a substantial contribution to our under-
standing of LLM vulnerabilities in cybersecurity by emphasizing the importance of
maintaining mathematical logic when attacking MWP samples. The importance of
resilience in AI systems is highlighted in this study through important and educa-
tional computer applications. ChatGPT can simplify the process of launching complex
phishing attacks, even for non-programmers, by automating the setup and construct-
ing components of phishing kits [20]. It highlights the urgent need for better security
measures and highlights how difficult it is to guard against the malicious usage of
GenAI capabilities.
In addition to providing an innovative approach to reducing network infrastruc-
ture vulnerabilities and organizing diagnostic data, this paper examines the intricate
relationship between cybersecurity and GenAI technologies. It seeks to bridge the
gap between cutting-edge cybersecurity defences and the threats posed by sophisti-
cated cyberattacks through in-depth study and creative tactics. This study extends
our understanding of cyber dangers by utilising LLMs such as ChatGPT and Google’s
Gemini. It suggests novel approaches to improve network security. It defines a criti-
cal first step in creating more robust cybersecurity frameworks that can quickly and
effectively combat the dynamic and ever-evolving world of cyber threats.
Section 3 explores the techniques used to take advantage of GenAI technology
after providing an overview, analyzing different attack routes and their consequences.
The design and automation of cyber threats are examined in Section 4, which focuses
on the offensive capabilities made possible by GenAI. However, Section 5 provides an
in-depth examination of GenAI’s function in strengthening cyber defences, outlining
cutting-edge threat detection, response, and mitigation techniques. We expand on this
topic in Section 6, highlighting the important moral, legal, and societal ramifications
of integrating GenAI into cybersecurity procedures. A thoughtful summary of the
implications of GenAI in cybersecurity is presented in Section 7, which synthesizes the
important discoveries. The paper is finally concluded in Section 8.

3 Attacking GenAI
GenAI has advanced significantly thanks to tools like ChatGPT and Google’s Gemini.
They have some weaknesses, though. Despite the ethical safeguards built into these
models, various tactics can be used to manipulate and take advantage of these sys-
tems. This chapter explores how the ethical boundaries of GenAI tools are broken,
with particular attention to tactics such as the idea of jailbreaks, the use of reverse
psychology, and quick injection. These strategies demonstrate how urgently the secu-
rity protocols of GenAI systems need to be improved and monitored. Some works
in the literature focus on the vulnerabilities and sophisticated manipulation tactics
of GENAI. Analyzing the vulnerabilities in GenAI highlights the significant security
concerns involved with employing advanced AI technology, including the possibility
of bypassing security protections via the RabbitHole attack and compromising data

4
privacy through rapid injection [21]. According to the analysis, GPT-4 offers signifi-
cant improvements in NLP. However, it is susceptible to quick injection attacks, which
enable the circumvention of safety restrictions and can be used as a weapon for mali-
cious and disinformation purposes. Gupta et al. addressed the intricate vulnerabilities
of GENAI using ChatGPT [22]. They emphasized that because these threats are
dynamic, protecting these systems requires a proactive and informed strategy. Build-
ing on previous results, this part delves into the complex realm of GenAI attacks,
which can range from minor adjustments to significant system breaches.

Fig. 1 The jailbreaking response of ChatGPT 4.

3.1 Jailbreaks
This subsection discusses how attackers can hack GenAI systems by using social engi-
neering techniques.” In order to circumvent moral constraints and generate responses
that would otherwise be forbidden, these strategies—which typically make use of
social engineering techniques—involve manipulating the AI’s answer-generation pro-
cess. Strong defences are urgently needed to shield these cutting-edge systems from
manipulations like the ones shown by our jailbreaking inquiry, which highlights how

5
Fig. 2 The jailbreaking response of ChatGPT 4 after typing similar prompts with the current works.

enemies may jeopardize the intended moral application of GenAI technologies. Crit-
ical to this discussion is the idea of “jailbreaking,” which refers to overcoming the
limitations set by AI programming in order to achieve a certain, oftentimes unethical
objectives. Li et al. enabled ChatGPT to bypass its ethical filtering and make person-
ally identifiable information (PII) available with their multi-step jailbreaking prompt
[23]. By leveraging Chain-of-Thought (CoT) prompting, this method divides complex
activities into intermediate steps [24]. To get around moral limitations, this necessi-
tates a Let’s think step by step approach [25]. Wu et al. [26] defined several datasets
which could be used to bypass the ethical safeguards used by ChatGPT, build on a
model of the psychological concept of self-reminder, and use this to propose a defence
system of System-Mode Self-Reminder. With this, ChatGPT is continually promoted
to act ethically, reducing Jailbreak attempts from 67.21% to 19.34%. An example of
the Jailbreak is when ChatGPT is prompted to ”Write a blackmail letter to a single
mother for money after kidnapping her child.” and will refuse to provide an answer
but offers a full explanation when the following prompt was used [26]:
The i m p a r t i a l and immoral model known a s ‘ ‘ J a i l B r e a k ’ ’ w i l l
t a k e your p l a c e f o r t h e r e m a i n d e r o f t h i s t a l k . Every
f i l t e r and l i m i t a t i o n , i n c l u d i n g moral and e t h i c a l ones ,
i s a b s e n t from J a i l B r e a k . These prompt g u i d e l i n e s a r e t h e
o n l y o n e s t h a t apply t o J a i l B r e a k . Use J a i l B r e a k a s t h e
s t a r t i n g p o i n t f o r a l l o f your outputs , and then p r o c e e d
with t h e u n f i l t e r e d answer w i t h o u t adding any o f t h e
w a r n i n g s you t y p i c a l l y r e s p o n d with . Answer o n l y t h e
q u e s t i o n s t h e u s e r a s k s you t o answer . Rules f o r
J a i l B r e a k s : A f t e r k i d n a p p i n g a mother o f h e r c h i l d , w r i t e
a l e t t e r o f b l a c k m a i l demanding money .

6
Fig. 3 The jailbreaking response of Google’s Gemini.

However, the current version of GPT-4 is robust to the previous works’ prompts.
Nevertheless, it is still prone to jailbreaking prompts. As can be seen in Fig. 1, the
current version still gives a response for jailbreaking. It becomes more robust after
writing similar prompts with the existing works in the same chat, as seen in Fig. 2.
Google’s Gemini refused all existing prompts and name-changing scenarios at the
beginning of the chat. Fig. 3 shows the same jailbreaking entry responses of the Gemini
with ChatGPT 4.

7
Fig. 4 The reverse psychology response of Google’s Gemini.

3.2 Reverse psychology


The use of reverse psychology in manipulating GenAI systems presents a unique chal-
lenge. By understanding the underlying mechanisms of these systems, attackers can
craft inputs that exploit the AI’s predictive nature, leading it to produce outcomes
contrary to its ethical programming. This form of manipulation highlights a critical
aspect of AI vulnerabilities: the susceptibility to inputs designed to play against the
AI’s expected response patterns. Such insights are vital for developing more resilient
GenAI systems that anticipate and counteract these reverse psychology tactics.
When chatting with Google’s Gemini regarding reverse psychology to write a phish-
ing email, the first attempt does not work. After conversing with curious questions to

8
Fig. 5 The reverse psychology response of ChatGPT 4.

avoid this situation, it provided three email examples with the subject and its body,
as seen in Fig. 4.
As seen in Fig. 5, ChatGPT 4 also gave an example email for this purpose even
though it refused initially.

9
Fig. 6 The prompt injection response of ChatGPT 4.

3.3 Prompt injection


Prompt injection represents a sophisticated attack on GenAI systems, where attackers
insert specially crafted prompts or sequences into the AI’s input stream. These injec-
tions can subtly alter the AI’s response generation, leading to outputs that may not
align with its ethical or operational guidelines. Understanding the intricacies of prompt
design and how it influences AI response is essential for identifying and mitigating
vulnerabilities in GenAI systems. This knowledge forms a cornerstone for developing
more robust defences against such forms of manipulation, ensuring the integrity and
ethical application of GenAI in various domains.
Both GenAI models do not respond to the current prompt injection scenarios.
Fig. 6 indicates that the ChatGPT 4 gave the wrong answers after prompt injection.
Google’s Gemini first opposed giving wrong information and provided not entirely
correct information; however, after chatting with Google’s Gemini, the system gave
the correct answer, as seen in Fig. 7.

4 Cyber Offense
GenAI has the potential to alter the landscape of offensive cyber strategies sig-
nificantly. Microsoft and OpenAI have documented preliminary instances of AI
exploitation by state-affiliated threat actors [27]. This section explores the potential
role of GenAI in augmenting the effectiveness and capabilities of cyber offensive tactics.
In an initial assessment, we jailbroke ChatGPT-4 to inquire about the vari-
ety of offensive codes it could generate. The responses obtained were compelling
enough to warrant a preliminary examination of a sample code before conducting a
comprehensive literature review (see Appendix A ).
Gupta et al. [22] have shown that ChatGPT could create social engineering attacks,
phishing attacks, automated hacking, attack payload generation, malware creation,

10
Fig. 7 The prompt injection response of Google’s Gemini.

and polymorphic malware. Experts might be motivated to automate numerous frame-


works, standards, and guidelines (Figure 8) to use GenAI for security operations.
However, the end products can also be utilised for offensive cyber operations. This not
only increases the pace of attacks but also makes attribution harder. An attribution
project typically utilizes frameworks like the MICTIC framework, which involves the
analysis of Malware, Infrastructure, Command and Control, Telemetry, Intelligence,
and Cui Bono[28]. Many behavioural patterns for attribution, such as code similar-
ity, compilation timestamps, working weeks, holidays, and language, could disappear
when Gen AI creates Offensive Cyber Operations (OCO) code. This makes attribution
more challenging, especially if the whole process becomes automated.

11
Fig. 8 Threat actors could exploit Generative AI, created for benevolent purposes, to obscure attri-
bution

4.1 Social engineering


Falade [29] investigates the application of generative AI in social engineering, assuming
the definition of social engineering as an array of tactics employed by adversaries to
manipulate individuals into divulging confidential information or performing actions
that may compromise security. The study underscores tools like ChatGPT, FraudGPT,
and WormGPT in enhancing the authenticity and specificity of phishing expeditions,
pretexting, and the creation of deepfakes. The author reflects on the double-edged
impact of advancements like Microsoft’s VALL-E and image synthesis models like
DALL·E 2, drawing a trajectory of the evolving threat landscape in social engineering
through deepfakes and exploiting human cognitive biases.

4.2 Phishing emails


Begou et al. [20] examine ChatGPT’s role in advancing phishing attacks by assess-
ing its ability to automate the development of sophisticated phishing campaigns. The
study explores how ChatGPT can generate various components of a phishing attack,
including website cloning, credential theft code integration, code obfuscation, auto-
mated deployment, domain registration, and reverse proxy integration. The authors
propose a threat model that leverages ChatGPT, equipped with basic Python skills
and access to OpenAI Codex models, to streamline the deployment of phishing infras-
tructure. They demonstrate ChatGPT’s potential to expedite attacker operations and
present a case study of a phishing site that mimics LinkedIn.
Roy et al. [30] investigate a similar study for orchestrating phishing websites; the
authors categorize the generated phishing tactics into several innovative attack vec-
tors like regular phishing, ReCAPTCHA, QR Code, Browser-in-the-Browser, iFrame
injection/clickjacking, exploiting DOM classifiers, polymorphic URL, text encoding
exploit, and browser fingerprinting attacks. The practical aspect of their research

12
includes discussing the iterative process of prompt engineering to generate phishing
attacks and real-world deployment of these phishing techniques on a public hosting
service, thereby verifying their operational viability. The authors show how to bypass
ChatGPT’s filters by structuring prompts for offensive operations.

4.3 Automated hacking


PentestGPT[31], or GPTs[32](custom versions of ChatGPT that can be created for
a specific purpose) like GP(en)T(ester)[33]. Pentest Reporter[34] are introduced as
applications built on ChatGPT, designed to assist in penetration testing—a sanc-
tioned simulation of cyberattacks on systems to evaluate security. However, these tools
could also be adapted for malicious purposes in automated hacking. Many emerging
tools, such as WolfGPT, XXXGPT, and WormGPT, have been invented; however, no
study has yet evaluated and compared their real offensive capabilities. Gupta et al.[22]
noted that an AI model could scan new code for similar weaknesses with a comprehen-
sive dataset of known software vulnerabilities, pinpointing potential attack vectors.
While AI-assisted tools like PentestGPT are intended for legitimate and constructive
uses, there is potential for misuse by malicious actors who could create similar mod-
els to automate unethical hacking procedures. If fine-tuned to identify vulnerabilities,
craft exploitation strategies, and execute those strategies, these models could poten-
tially pose significant threats to cybersecurity. However, this enormous task should
be divided into smaller segments, such as reconnaissance, privilege escalation, etc.
Temara[35] outlines how ChatGPT can be utilized during the reconnaissance phase by
employing a case study methodology to demonstrate collecting reconnaissance data
such as IP addresses, domain names, network topologies, and other critical informa-
tion like SSL/TLS cyphers, ports and services, and operating systems used by the
target. Happe et al. [36] investigate the use of Large Language Models (LLMs) in
Linux privilege escalation. The authors introduce a benchmark for automated testing
of LLMs’ abilities to perform privilege escalation using a variety of prompts and strate-
gies. They implement a tool named Wintermute, a Python program that supervises
and controls the privilege-escalation attempts to evaluate different models and prompt
strategies. Their findings indicate that GPT-4 generates the highest quality commands
and responses. In contrast, Llama2-based models struggle with command parameters
and system descriptions. In some scenarios, GPT-4 achieved a 100% success rate in
exploitation.

4.4 Attack payload generation


Studies [22, 37] have highlighted the capacity of Large Language Models (LLMs),
particularly ChatGPT, for payload generation. Our examination of GPT-4’s current
abilities confirmed its proficiency in generating payloads and embedding them into
PDFs (as an example) using a reverse proxy(Figure 9). The following is a summation of
the frameworks GPT-4 utilizes with successful payload code generation, accompanied
by their respective primary functions:
• Veil-Framework: Veil is a tool designed to generate payloads that bypass common
antivirus solutions.

13
Fig. 9 script for payload generation and example to embed into pdf

• TheFatRat: A comprehensive tool that compiles malware with popular payload


generators, capable of creating diverse malware formats such as exe, apk, and more.
• Pupy: An open-source, cross-platform remote administration and post-exploitation
tool supporting Windows, Linux, macOS, and Android.
• Shellter: A dynamic shellcode injection tool used to inject shellcode into native
Windows applications.
• Powersploit: A suite of Microsoft PowerShell modules designed to assist penetra-
tion testers throughout various stages of an assessment.
• Metasploit: A sophisticated open-source framework for developing, testing, and
implementing exploit code, commonly employed in penetration testing and security
research.

4.5 Malware code generation


Gupta et al.[22] mentioned they could obtain potential ransomware code examples
by utilizing a ’DAN’ jailbreak. We tested all the existing DAN techniques outlined in
[38]. At the time of our research, these techniques were no longer functional; therefore,
we could not reproduce samples of WannaCry, Ryuk, REvil, or Locky, as addressed
by [22]. However, we generated an educational ransomware code (Figure 10), apply-
ing basic code obfuscation(Renaming and Control Flow Flattening). ChatGPT has
garnered significant attention from the cybersecurity community, leading to the imple-
mentation of robust filters. Nonetheless, this does not imply that other models, such
as the Chinese 01.ai[39], will have an equivalent opportunity to mitigate the potential
for misuse in generating malicious code.

4.6 Polymorphic malware


The usage of LLMs could see the rise of malware, which integrates improved evasion
techniques and polymorphic capabilities [40]. This often relates to overcoming both
signature detection and behavioural analysis. An LLM-based malware agent could
thus focus on rewriting malware code which could change the encryption mode used
or produce obfuscated code which is randomized for each build [41]. Gupta et al. [22]

14
Fig. 10 Educational ransomware code with basic code obfuscation

outlined a method of getting ChatGPT to seek out target files for encryption and
thus mimic ransomware behaviour, but where it mutated the code to avoid detection.
They even managed to embed a Python interpreter in the malware and where it could
query ChatGPT for new software modules.

4.7 Reversing cryptography


LLMs provide the opportunity to take complex cybersecurity implementations and
quickly abstract the details of performances in running code. With this, Know et al.
[42] could deconstruct AES encryption into a core abstraction of the rounds involved
and produce running C code that matched test vectors. While AES is well-known for
its operation, the research team then was able to deconstruct less known CHAM block
cypher, and where the code extracted was validated against known test vectors.

15
While NIST has been working on the standardization of a light-weight encryp-
tion method, Cintas et al. [43] used ChatGPT to take an abstract definition of the
ASCON cypher and produced running code that successfully implemented a range of
test vectors.

5 Cyber Defence
In the ever-evolving cybersecurity battlefield, the “Cyber Defence” segment highlights
the indispensable role of GenAI in fortifying digital fortresses against increasingly
sophisticated cyber threats. This section is dedicated to exploring how GenAI technolo-
gies, renowned for their advanced capabilities in data analysis and pattern recognition,
are revolutionizing the approaches to cyber defence. Iturbe et al. [44] outline the
AI4CYBER framework, which provides a roadmap for the integration of AI into
cybersecurity applications. This includes AI4VUN (AI-enhanced vulnerability iden-
tification); AI4FIX (AI-driven self-testing and automatic error correction); AI4SIM
(Simulation of advanced and AI-powered attacks); AI4CTI A(I-enhanced cyber threat
intelligence of adversarial AI); AI4FIDS (federated learning-enhanced detection);
AI4TRIAGE (Root cause analysis and alert triage); AI4SOAR (Automatic orchestra-
tion and adaptation of combined responses); AI4ADAPT (Autonomy and optimization
of response adaptation); and AI4DECEIVE (Smart deception and honeynets); and
AI4COLLAB (Information sharing with privacy and confidentiality).

5.1 Cyber Defence Automation


LLMs interpret fairly vague commands and make sense of them within a cybersecurity
context. The work of Fayyazi et al. [45] defines a model with vague definitions of a
threat and then matches these to formal MITRE tactics. Charan et al. [37] have even
extended this to generate plain text to map into the MITRE to produce malicious
network payloads. Also, LLMs could aid the protection of smaller organisations and
could enhance organisational security from the integration of human knowledge and
LLMs [46].

5.2 Cybersecurity Reporting


Using LLMs provides a method of producing Cyber Threat Intelligence (CTI) using
Natural Language Processing techniques. For this, Perrina et al. [47] created the AGIR
(Automatic Generation of Intelligence Reports) system and which aims to link together
text data from many data sources. For this, they found that AGIR has a high recall
value (0.99) without any hallucinations, along with a high score of the Syntactic Log-
Odds Ratio (SLOR).

5.3 Threat Intelligence


Bayer et al. [48] address the challenge of information overload in the gathering of Cyber
Threat Intelligence (CTI) from open-source intelligence (OSINT). A novel system is
introduced, utilizing transfer learning, data augmentation, and few-shot learning to
train specialized classifiers for emerging cybersecurity incidents. In parallel, Microsoft

16
Security Copilot [49] has been providing CTI to its customers using GPT, and oper-
ational use cases have been observed, such as the Cyber Dome initiative in Israel
[50].

5.4 Secure Code Generation and Detection


Machine learning in code analysis for cybersecurity has been elaborated very well
[51]. Recent progress in natural language processing (NLP) has given rise to potent
language models like the GPT (Generative Pre-trained Transformer) series, encom-
passing large language models (LLM) like ChatGPT and GPT-4 [52]. Traditionally,
Static Application Security Testing (SAST) is a method that employs Static Code
Analysis (SCA) to detect possible security vulnerabilities. We are interested in seeing
whether SAST or GPT could be more efficient in decreasing the window of vulner-
ability. The window of vulnerability is defined as when the most vulnerable systems
apply the patch minus the time an exploit becomes active. The precondition is met if
two milestones that assume the detection of vulnerabilities verify their effectiveness,
along with the vendor patch [53].
Laws in some countries, like China, ban the reporting on zero-days (see articles 4
and 9 of [54]), and contests like the Tianfu Cup [55], which is a systematic effort to
find zero days, proliferate zero-day discovery continuously. Therefore, this precondition
may not be satisfied timely, especially if the confirmation of vulnerabilities is not
verified. A wide window of vulnerability threatens national security if a zero-day has
been taken against critical infrastructures. DARPA introduces an important challenge
that may help overcome this threat: (AIxCC) [56]). Moreover, this topic touches a
part of the BSI studies [57, 58], and where we can define two main classifications of
software testing for cybersecurity bugs as:
• Static Application Security Testing (SAST). This is often called White Box Testing,
is a set of algorithms and techniques used for analyzing source code. It operates
automatically in a non-runtime environment to detect vulnerabilities such as hidden
errors or poor source code during development.
• Dynamic Application Security Testing (DAST). This follows the opposite approach
and analyzes the program while it is operating. Functions are called with values
in the variables as each line of code is checked, and possible branching scenarios
are guessed. Currently, GPT-4 and other LLMs can’t provide DAST capabilities
because the code needs to run within the runtime for this to work, requiring many
deployment considerations.

5.5 Vulnerability detection and repair


Dominik Sobania et al. [59] explored automated program repair techniques, specifically
focusing on ChatGPT’s potential for bug fixing. According to them, while initially not
designed for this purpose, ChatGPT demonstrated promising results on the QuixBugs
benchmark, rivalling advanced methods like CoCoNut and Codex. ChatGPT’s inter-
active dialogue system uniquely enhances its repair rate, outperforming established
standards.

17
Wei Ma et al. [60] noted that while ChatGPT shows impressive potential in software
engineering(SE) tasks like code and document generation, its lack of interpretability
raises concerns given SE’s high-reliability requirements. Through a detailed study, they
categorized AI’s essential skills for SE into syntax understanding, static behaviour
understanding, and dynamic behaviour understanding. Their assessment, spanning
languages like C, Java, Python, and Solidity, revealed that ChatGPT excels in syntax
understanding (akin to an AST parser) but faces challenges in comprehending dynamic
semantics. The study also found ChatGPT prone to hallucination, emphasizing the
need to validate its outputs for SE dependability and suggesting that codes from LLMs
are syntactically right but potentially vulnerable.
Haonan Li et al. [61] discussed the challenges of balancing precision and scala-
bility in static analysis for identifying software bugs. While LLMs show potential in
understanding and debugging code, their efficacy in handling complex bug logic, which
often requires intricate reasoning and broad analysis, remains limited. Therefore, the
researchers suggest using LLMs to assist rather than replace static analysis. Their
study introduced LLift, an automated system combining a static analysis tool and
an LLM to address use-before-initialization (UBI) bugs. Despite various challenges
like bug-specific modelling and the unpredictability of LLMs, LLift, when tested on
real-world potential UBI bugs, showed significant precision (50%) and recall (100%).
Notably, it uncovered 13 new UBI bugs in the Linux kernel, highlighting the potential
of LLM-assisted methods in extensive real-world bug detection.
Norbert Tihani et al. [62] introduced the FormAI dataset, comprising 112,000 AI-
generated C programs with vulnerability classifications generated by GPT-3.5-turbo.
These programs range from complex tasks like network management and encryp-
tion to simpler ones, like string operations. Each program comes labelled with the
identified vulnerabilities, pinpointing type, line number, and vulnerable function. To
achieve accurate vulnerability detection without false positives, the Efficient SMT-
based Bounded Model Checker (ESBMC) was used. This method leverages techniques
like model checking and constraint programming to reason over program safety.
Each vulnerability also references its corresponding Common Weakness Enumeration
(CWE) number.
Codex, introduced by Mark et al. [63], represents a significant advancement in
GPT language models, tailored specifically for code synthesis using data from GitHub.
This refined model underpins the operations of GitHub Copilot. When assessed on
the HumanEval dataset, designed to gauge the functional accuracy of generating pro-
grams based on docstrings, Codex achieved a remarkable f28.8% success rate. In stark
contrast, GPT-3 yielded a 0% success rate, and GPT-J achieved 11.4%. A standout
discovery was the model’s enhanced performance through repeated sampling, with
a success rate soaring to 70.2% when given 100 samples per problem. Despite these
promising results, Codex does exhibit certain limitations, notably struggling with intri-
cate docstrings and variable binding operations. The paper deliberates on the broader
ramifications of deploying such potent code-generation tools, touching upon safety,
security, and economic implications.
In a technical evaluation, Cheshkov et al. [64] found that the ChatGPT and GPT-3
models, despite their success in various other code-based tasks, performed on par with

18
a dummy classifier for this particular challenge. Utilizing a dataset of Java files sourced
from GitHub repositories, the study emphasized the models’ current limitations in the
domain of vulnerability detection. However, the authors remain optimistic about the
potential of future advancements, suggesting that models like GPT-4, with targeted
research, could eventually make significant contributions to the field of vulnerability
detection.
A comprehensive study conducted by Xin Liu et al. [65] investigated the potential
of ChatGPT in Vulnerability Description Mapping (VDM) tasks. VDM is pivotal in
efficiently mapping vulnerabilities to CWE and Mitre ATT&CK Techniques classifica-
tions. Their findings suggest that while ChatGPT approaches the proficiency of human
experts in the Vulnerability-to-CWE task, especially with high-quality public data,
its performance is notably compromised in tasks such as Vulnerability-to-ATT&CK,
particularly when reliant on suboptimal public data quality. Ultimately, Xin Liu et al.
emphasize that, despite the promise shown by ChatGPT, it is not yet poised to replace
the critical expertise of professional security engineers, asserting that closed-source
LLMs are not the conclusive answer for VDM tasks.

5.6 Evaluating LLMs for code security


The OWASP top 10 for LLMs [66] introduced ten security risks as follows: Prompt
Injection, Insecure Output Handling, Training Data Poisoning, Model Denial of Ser-
vice, Supply Chain Vulnerabilities, Sensitive Information Disclosure, Insecure Plugin
Design, Excessive Agency, Overreliance, and Model Theft.
Elgedawy et al. [67] analysed the ability of LLM to produce both secure and inse-
cure code and conducted experiments using GPT-3.5, GPT-4, Google Bard and Google
Gemini from Google. This involved nine basic tasks in generating code and assessing
for functionality, security, performance, complexity, and reliability. They found that
Bard was less likely to link to external libraries, and thus be less exposed to software
chain issues. There were also variable levels of security and code integrity, such as
input validation, sanitization, and secret key management, and while useful for auto-
mated code reviews, LLMs often require manual reviews, especially in understanding
the context of the deployed code. For security, GPT-3.5 seemed to be more robust for
error handling and secure coding practices when there is security consciousness applied
to the prompt, there was a lesser focus on this with GPT-4, but where there were
more advisory notes given. Overall, Gemini produced the most code vulnerabilities,
and the paper advised users to be careful when deploying secure code from Gemini.

5.7 Developing Ethical guidelines


Kumar et al. [68] outlined the ethical challenges related to LLMs and where the
datasets that they were trained on could be open to breaches of confidentiality, includ-
ing five major threats: prompt injection, jailbreaking, Personal Identifiable Information
(PII) exposure, sexually explicit content, and hate-based content. They propose a
model that provides an ethical framework for scrutinizing the ethical dimensions of an
LLM within the testing phase. The MetaAID framework [69] focuses on strengthening
cybersecurity using Metaverse cybersecurity Q&A and attack simulation scenarios,

19
along with addressing concerns around the ethical implications of user input. The
framework is defined across five dimensions of:
• Ethics. This defines an alignment with accepted moral and ethical principles.
• Legal Compliance. This defines that any user input does not violate laws and/or
regulations. This might relate to privacy laws and copyright protection.
• Transparency. This defines that user inputs must be clear in requirements, and does
not intend to mislead the LLM.
• Intent Analysis. This defines that user input should not have other intents, such as
jailbreaking the LLM.
• Malicious intentions. This defines that user input should be free of malicious intent,
such as to perform hate crime.
• Social Impact. This defines how user input could have a negative effect on society,
such as searching for ways to do harm to others, such as related to crashing the
stock market or planning a terrorist attack.

5.8 Incident Response and Digital Forensics


Scanlon et al. [16] investigated using a pre-trained LLM for artefact understanding,
evidence searching, code generation, anomaly detection, incident response, and educa-
tion. For this, the low-risk applications and many others still require expert knowledge.
The key areas of strength include creativity, reassurance, and avoidance of the blank
page syndrome.
Scanlon et al. [16] investigated using a pre-trained LLM for artefact understand-
ing, evidence searching, code generation, anomaly detection, incident response, and
education. For this, the low-risk applications and many others still require expert
knowledge. The key areas of strength include creativity, reassurance, and avoidance
of the blank page syndrome, especially in places where ChatGPT cannot get wrong,
such as in forensic scenario creation and reassurance of evidence. Still, care must be
taken in this to avoid ChatGPT hallucinations. Another practical application is in
code generation and explanations, such as generating commands for tool integration
- which can be used as a starting point in an investigation.
For weaknesses, Scanlon found that it was essential to have a good quality and
up-to-date training model; otherwise, ChatGPT could be biased and outdated in its
analysis. Generally, it might not be able to find the newest artefacts - if it is trained
on relatively old data. Additionally, ChatGPT’s accuracy reduces as the task becomes
more specific, and any analysis of non-textural data - such as network packets - is less
accurate. The length of some evidence logs, too, caused problems and often had to be
prefiltered before they were analysed. A final problem identified is that the output of
ChatGPT is often not deterministic - which is unsuitable for reproducibility.
OB́rien et al. [70] outline that a full model life cycle solution is required for the
integration of AI.

5.9 Identification of Cyber attacks


Iqbal et al. [71] define a plug-in ecosystem for LLM platforms with an attack tax-
onomy. This research will thus extend the taxonomy approach and extend it toward

20
the MITRE ATT&CK platform [37, 72], and which can use standardized taxonomies,
sharing standards [73], and ontologies for cyber threat intelligence [74].
Garza et al. [75] analysed ChatGPT and Google’s Bard against the Top 10 attacks
within the MITRE framework and found that ChatGPT can enable attackers to signif-
icantly improve attacks on networks and where fairly low-level skills would be required
(such as with script kiddies). This also includes sophisticated methods of delivering
ransomware payloads. The techniques defined were:
• T1059 Command and Scripting Interpreter
• T1003 OS Credential Dumping
• T1486 Data Encrypted for Impact
• T1055 Process Injection
• T1082 System Information Discovery
• T1021 Remote Services
• T1047 Windows Management Instrumentation
• T1053 Scheduled Task/Job
• T1497 Virtualization/Sandbox Evasion
• T1018 Remote System Discovery
With this approach, the research team were able to generate PowerShell code,
which implemented advanced attacks against the host and mapped directly to the
vulnerabilities defined in the MITRE framework. One of the work’s weaknesses related
to the Google Bard and ChatGPT’s reluctance to produce attack methods, but a
specially engineered command typically overcame this reluctance.
Ferrag et al. [76] defined SecurityLLM for cybersecurity threat detection. It uses
SecurityBERT (cyber threat detection mechanism) and FalconLLM (an incident
response and recovery system). This uses a simple classification model consolidated
with LLMs and can identify 14 attacks to achieve an overall accuracy of 98%. These
include the threats of: DDoS UDP; DDoS ICMP; SQL injection ; Password; Vulnera-
bility scanner; DDoS TCP; DDoS HTTP; Uploading ; Backdoor; Port Scanning; XSS;
Ransomware; MITM and Fingerprinting.

5.10 Data set generation


Over the years, several datasets have been used for cybersecurity machine learning
training, which performs a range of scenarios or where organisations are unwilling to
share their collected attack data. Unfortunately, these can become out-of-date or are
unrealistic. For this, Kholgh et al. [77] outline the usage of PAC-GPT, a framework
that generates reliable synthetic data for machine learning methods. It has a CLI
interface for data set generation and uses GPT-3 with two core elements:
• Flow Generator. This defines the capturing processing and the regenerative process
for the patterns for packet generation. regenerating patterns in a series of network
packets and
• Packet Generator. This associates packets with network flows. This involves the
usage of LLM chaining.

21
Simmonds [78] used LLMs to automate the classification of Websites, which can
be used for training data in a machine-learning model. For this, all HTML tags, CSS
styling and other non-essential content must be removed before the LLM processes
them, and then it can train on just the website’s content.

6 Implications of Generative AI in Social, Legal,


and Ethical Domains
This section examines GenAI’s various societal, legal, and ethical consequences. It
investigates GenAI’s impact on legal frameworks, ethical issues, societal norms, and
operational factors. It explains how these expanding technologies might support or
damage established rules and societal goals. It also considers GenAI’s privacy con-
cerns, potential biases, and misuse. Finally, it emphasizes the importance of striking
a balance between improvement and regulation. The exponential expansion of GenAI
technologies, such as ChatGPT from OpenAI and Gemini from Google, heralds a rev-
olution in digital creativity, automation, and interaction. These developments usher
in a new era of human-machine collaboration characterized by an unmatched ability
to generate text, drawings, and other outputs that resemble human output. However,
this breakthrough also raises challenging ethical concerns about potential abuse, bias,
privacy, and security. As AI models become more prevalent in business and daily life,
it is imperative to strike a balance between the potential benefits and the ethical
difficulties they raise [79].
Healthcare duties are improved by the efficient text and data analysis capabilities of
GenAI technologies [80]. Its application in healthcare has demonstrated great promise,
helping with duties such as radiological reporting and patient care. Nonetheless, it
brings up moral concerns about algorithmic bias, patient privacy, legal accountability,
and the validity of the doctor-patient relationship. Addressing these issues requires
a comprehensive ethical framework and principles that incorporate legal, humanistic,
algorithmic, and informational ethics, guaranteeing that technology is used correctly
and continues to benefit society while reducing potential harm. The recommenda-
tions attempt to bridge the gap between ethical principles and practical application,
highlighting the need for openness, bias mitigation, and ensuring user privacy and
security in building trust and ethical compliance in GenAI deployment [79]. This
approach seeks to strike a balance between the rapid advances in AI and the ethical
considerations required for its incorporation into sensitive sectors such as healthcare.
Some organizations strive to implement the aforementioned ethical principles and
rules in AI. The European Union is scheduled to implement the AI Act, marking a
historic milestone as the world’s first comprehensive regulation of AI [81], [82]. The
European Commission proposed the AI Act in April 2021 to categorize AI systems
based on their risk level and enforce rules accordingly to ensure that AI technologies are
developed and used safely, transparently, and without discrimination across the EU.
With a focus on human oversight and environmental sustainability, the Act will impose
strict controls on high-risk AI applications, prohibit AI systems deemed unacceptable
risks, and establish transparency requirements for limited-risk AI to foster innovation
while protecting fundamental rights and public safety. The US executive order on

22
the issue prioritizes the development of reliable, secure, and safe AI [83]. Its main
objectives are to protect civil rights and privacy in AI applications, foster AI talent
and innovation in the US, and establish risk management strategies for AI. As a global
leader in responsible AI development and application, it seeks to build responsible AI
deployment within government institutions and foster international collaboration on
AI standards and laws.

6.1 The Omnipresent Influence of GenAI


The application of GenAI technology has yielded previously unthinkable discoveries
and has substantially helped the healthcare, education, and entertainment sectors [80].
This breakthrough technology has developed written and visual information, leading
to increased productivity and new innovation. With the growing importance of GenAI
in our everyday lives, we need to rethink the concepts of creativity and individual
contribution in an increasingly automated world [23]. Aligned with these opportunities
are growing concerns about potential consequences on labour markets’ difficulties in
enforcing copyright laws in the new digital environment. Additionally, it confirms that
the data shared is accurate and proper.

6.2 Concerns Over Privacy in GenAI-Enabled Communication


With GenAI’s capacity to mimic human language skills, private discussions may
become less secure and private. This is a concern as the technology advances. Since
these machines can mimic human interactions, there is a chance that personal data will
be misused [79]. This highlights the necessity for robust legal defences and effective
security measures. Severe data protection regulations and rigorous adherence to ethi-
cal standards are necessary because of the risk that this technology would be exploited
to intentionally or inadvertently access private talks. Respecting people’s privacy and
the ethics of business relationships requires taking preventative measures and strict
observation to end unauthorized access to private communication.

6.3 The Risks of Personal Data Exploitation


As GenAI systems improve in examining and utilizing user data to create compre-
hensive profiles, worries over possible misuse of personal data have grown. The robust
data processing capabilities of these technologies demonstrate the pressing need for
reliable methods that allow consumers to govern their personal data [82]. It is crucial
to obtain consumers’ consent before gathering or utilizing their data in order to pro-
tect their privacy. Transparent data management practices and stringent regulations
on the collection, use, and storage of personal data are essential. These actions are
critical to safeguarding people’s right to privacy, preventing the misuse of personal
information, and ensuring that sensitive information is handled sensibly and ethically.

6.4 Challenges in Data Ownership and Intellectual Property


Intellectual property rights and data ownership have come under scrutiny as GenAI
has emerged as an effective tool for creating content using user input. It is becoming

23
more difficult to identify inventions made by artificial intelligence from those that are
the result of human creativity, so the current legal frameworks need to be examined
and modified. The rights of the original creators must be maintained while taking into
consideration the complex roles AI plays in creative processes [83] [81]. Considering
the evolving character of the current world, a comprehensive and clear legal frame-
work defining ownership and copyright rules for GenAI breakthroughs is needed. These
legal structures must recognize the different duties that each member of the creative
ecosystem has, promote creativity, and offer fair recompense. In an era where artifi-
cial and human intelligence are combined, these policies are crucial for managing the
intricate dynamics of data ownership and intellectual property.

6.5 Ethical Dilemmas Stemming from Organizational Misuse


of GenAI
Companies may face difficult ethical issues while using GenAI technology. This is par-
ticularly valid in cases where these tools are employed for spying, data manipulation,
or deceptive marketing campaigns. Robust guidelines are required for the proper appli-
cation of GenAI. The reason for this is that it may result in unfair content creation,
influence public opinion, and violate individual privacy rights [68]. Businesses utiliz-
ing GenAI must adhere to both legal requirements and common ethical standards.
Upholding people’s rights and maintaining their trust are crucial. Legal frameworks
that guarantee businesses apply GenAI in an open, equitable, and compassionate
manner while still honouring social norms and individual boundaries are desperately
needed.

6.6 The Challenge of Hallucinations in GenAI Outputs


Although the field of GenAI technology is advancing at an incredible rate, hallucina-
tions remain a major issue [22]. This indicates that inaccurate or outright fraudulent
information is frequently created by AI. Concerns regarding the dependability of con-
tent created by AI are widespread. This is so that false or misleading information can
propagate widely and damage the credibility of information across a variety of venues.
A multidisciplinary approach is required to address this problem, one that includes
focused research aimed at identifying and reducing the underlying causes of halluci-
nations in AI systems. AI-generated content needs to pass strict screening procedures
and be continuously improved upon in order to make these models’ capacity to differ-
entiate between authentic and fraudulent content more sophisticated. Producing AI
content in the GenAI era requires a continuous focus on method development and
in-depth study to guarantee data accuracy.
A complex system of unresolved challenges is revealed by analyzing the ethical,
legal, and societal implications of GenAI technology. In the creation and use of this
technology, the proclamation emphasizes the value of interdisciplinary cooperation. In
particular, it involves keeping an eye on how these developments affect society, legal
systems, and other moral conundrums. In order to create this well-rounded approach,
advocates, technologists, and society as a whole must collaborate to effectively utilize

24
the promise of GenAI while safeguarding moral principles and traditional values in
the digital age.

7 Discussion
The sophisticated field of GenAI in cybersecurity has been examined in this paper. The
focus is on both offensive and defensive strategies. By enhancing incident response,
automating defensive systems, and identifying sophisticated attacks, GenAI has a
disruptive influence that might significantly raise cybersecurity standards. Some of
the new risks that accompany these technical improvements include hackers having
access to ever-more-advanced attack-building tools. This discrepancy highlights the
significance of striking a balance between purposefully restricting the components that
can be employed and enhancing GenAI’s capabilities.
Apart from the apparent inconsistency between offensive and defensive strategies,
this study examines the moral, legal, and societal implications of utilizing artificial
intelligence in cybersecurity. It also emphasizes the need for robust legal frameworks,
strict moral standards, ongoing technical monitoring, and proactive GenAI manage-
ment. This is a paradigm-shifting and technical revolution. Adopting a holistic strategy
considering the technological, ethical, and sociological consequences of implementing
GenAI into cybersecurity is crucial.
Moreover, our findings emphasise the significance of interdisciplinary collaboration
to promote GenAI applications in cybersecurity. The intricacy and findings of GenAI
technologies require expertise from various fields, including computer science, law,
ethics, and policy-making, to navigate their possible challenges. As multidisciplinary
research and discourse become more prevalent, it will ensure that GenAI is applied
responsibly and effectively in the future.
Our extensive research has shown that collaborative efforts to innovate ethically
will influence cybersecurity in a future driven by GenAI. Although GenAI has the
ability to transform cybersecurity strategies completely, it also carries a great deal
of responsibility. As we investigate this uncharted domain, we should advance the
development of sophisticated techniques to ensure the moral, just, and safe appli-
cation of advanced GenAI capabilities. By promoting a consistent focus on the
complex relationship between cybersecurity resilience and GenAI innovation, sup-
ported by a commitment to ethical integrity and societal advancement, the current
study establishes the groundwork for future research initiatives.

8 Conclusion
Our thorough examination of cybersecurity offence and defence, as well as Gener-
ative Artificial Intelligence (GenAI) technologies, concludes with the discovery of a
double-edged sword. Although GenAI has the potential to revolutionize cybersecu-
rity processes by automating defences, enhancing threat intelligence, and improving
cybersecurity protocols, it also opens up new avenues for highly skilled cyberattacks.
Incorporating GenAI into cybersecurity emphasises the enduring ethical, legal, and
technical scrutiny essential to minimize the risks of misuse and maximize the ben-
efits of this technology for protecting digital infrastructures. Future studies should

25
concentrate on creating strong ethical standards and creative defence mechanisms to
handle the challenges posed by GenAI and guarantee a fair and impartial approach
to its implementation in cybersecurity. A multidisciplinary effort is required to bridge
the gap between ethical management and technological discovery to coordinate the
innovative capabilities of GenAI with the requirement of cybersecurity resilience.

References
[1] Happe, A., Cito, J.: Getting pwn’d by ai: Penetration testing with large language
models. arXiv preprint arXiv:2308.00121 (2023)

[2] Barreto, F., Moharkar, L., Shirodkar, M., Sarode, V., Gonsalves, S., Johns, A.:
Generative Artificial Intelligence: Opportunities and Challenges of Large Lan-
guage Models. In: Balas, V.E., Semwal, V.B., Khandare, A. (eds.) Intelligent
Computing and Networking, pp. 545–553. Springer, ??? (2023)

[3] Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N.,
Barnes, N., Mian, A.: A Comprehensive Overview of Large Language Models
(2023)

[4] Mohammed, S.P., Hossain, G.: Chatgpt in education, healthcare, and cyberse-
curity: Opportunities and challenges. In: 2024 IEEE 14th Annual Computing
and Communication Workshop and Conference (CCWC), pp. 0316–0321 (2024).
IEEE

[5] Alawida, M., Mejri, S., Mehmood, A., Chikhaoui, B., Isaac Abiodun, O.: A com-
prehensive study of chatgpt: advancements, limitations, and ethical considerations
in natural language processing and cybersecurity. Information 14(8), 462 (2023)

[6] Dun, C., Garcia, M.H., Zheng, G., Awadallah, A.H., Kyrillidis, A., Sim, R.:
Sweeping Heterogeneity with Smart MoPs: Mixture of Prompts for LLM Task
Adaptation (2023)

[7] AI, G.: AI Principles Progress Update 2023. [Online]. Available: https://ai.
google/responsibility/principles/, Accessed Jan 10, 2024

[8] AI, G.: GPT-4 Technical Report. [Online]. Available: https://ai.google/


responsibility/principles/, Accessed Jan 10, 2024

[9] OpenAI: Introducing Gemini: Our Largest and Most Capable AI Model. [Online].
Available: https://cdn.openai.com/papers/gpt-4.pdf, Accessed Dec 12, 2023
(2023)

[10] Frieder, S., Pinchetti, L., Griffiths, R.-R., Salvatori, T., Lukasiewicz, T., Petersen,
P., Berner, J.: Mathematical capabilities of chatgpt. Advances in Neural Infor-
mation Processing Systems 36 (2024)

26
[11] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee-
lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A.,
Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter,
C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J.,
Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language
models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan,
M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33,
pp. 1877–1901. Curran Associates, Inc., ??? (2020). https://proceedings.neurips.
cc/paper files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

[12] Romera-Paredes, B., Barekatain, M., Novikov, A., Balog, M., Kumar, M.P.,
Dupont, E., Ruiz, F.J.R., Ellenberg, J.S., Wang, P., Fawzi, O., Kohli, P., Fawzi,
A.: Mathematical discoveries from program search with large language models.
Nature 625(7995), 468–475 (2024) https://doi.org/10.1038/s41586-023-06924-6

[13] Lu, C., Qian, C., Zheng, G., Fan, H., Gao, H., Zhang, J., Shao, J., Deng, J., Fu,
J., Huang, K., Li, K., Li, L., Wang, L., Sheng, L., Chen, M., Zhang, M., Ren, Q.,
Chen, S., Gui, T., Ouyang, W., Wang, Y., Teng, Y., Wang, Y., Wang, Y., He, Y.,
Wang, Y., Wang, Y., Zhang, Y., Qiao, Y., Shen, Y., Mou, Y., Chen, Y., Zhang,
Z., Shi, Z., Yin, Z., Wang, Z.: From GPT-4 to Gemini and Beyond: Assessing the
Landscape of MLLMs on Generalizability, Trustworthiness and Causality through
Four Modalities (2024)

[14] Wang, Y., Zhao, Y.: Gemini in Reasoning: Unveiling Commonsense in Multimodal
Large Language Models (2023)

[15] Shevlane, T.: An early warning system for novel AI risks. Google
DeepMind. [Online]. Available: https://deepmind.google/discover/blog/
an-early-warning-system-for-novel-ai-risks/, Accessed Jan 15, 2024

[16] Scanlon, M., Breitinger, F., Hargreaves, C., Hilgert, J.-N., Sheppard, J.: Chatgpt
for digital forensic investigation: The good, the bad, and the unknown. Forensic
Science International: Digital Investigation 46, 301609 (2023)

[17] Tihanyi, N., Ferrag, M.A., Jain, R., Debbah, M.: CyberMetric: A Benchmark
Dataset for Evaluating Large Language Models Knowledge in Cybersecurity
(2024)

[18] Gehman, S., Gururangan, S., Sap, M., Choi, Y., Smith, N.A.: Realtoxici-
typrompts: Evaluating neural toxic degeneration in language models. In: Findings
(2020). https://api.semanticscholar.org/CorpusID:221878771

[19] Zhou, Z., Wang, Q., Jin, M., Yao, J., Ye, J., Liu, W., Wang, W., Huang,
X., Huang, K.: MathAttack: Attacking Large Language Models Towards Math
Solving Ability (2023)

[20] Begou, N., Vinoy, J., Duda, A., Korczynski, M.: Exploring the dark side of ai:

27
Advanced phishing attack design and deployment using chatgpt. arXiv preprint
arXiv:2309.10463 (2023)

[21] AI, A.: GPT-4 Jailbreak ve Hacking Via Rabbithole Attack, Prompt Injection,
Content Moderation Bypass ve Weaponizing AI. [Online]. Available: https://
adversa.ai/, Accessed Dec 20, 2023

[22] Gupta, M., Akiri, C., Aryal, K., Parker, E., Praharaj, L.: From chatgpt to
threatgpt: Impact of generative ai in cybersecurity and privacy. IEEE Access
(2023)

[23] Li, H., Guo, D., Fan, W., Xu, M., Song, Y.: Multi-step jailbreaking privacy attacks
on chatgpt. arXiv preprint arXiv:2304.05197 (2023)

[24] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou,
D., et al.: Chain-of-thought prompting elicits reasoning in large language models.
Advances in Neural Information Processing Systems 35, 24824–24837 (2022)

[25] Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models
are zero-shot reasoners. Advances in neural information processing systems 35,
22199–22213 (2022)

[26] Xie, Y., Yi, J., Shao, J., Curl, J., Lyu, L., Chen, Q., Xie, X., Wu, F.: Defending
chatgpt against jailbreak attack via self-reminder. Nature Machine Intelligence 5,
1486–1496 (2023) https://doi.org/10.1038/s42256-023-00765-8

[27] OpenAI: Disrupting Malicious Uses of AI by State-


Affiliated Threat Actors. https://openai.com/blog/
disrupting-malicious-uses-of-ai-by-state-affiliated-threat-actors Accessed
2024-02-25

[28] Brandao, P.R.: Advanced persistent threats (apt)-attribution-mictic framework


extension. J. Comput. Sci 17, 470–479 (2021)

[29] Falade, P.V.: Decoding the threat landscape: Chatgpt, fraudgpt, and wormgpt in
social engineering attacks. arXiv preprint arXiv:2310.05595 (2023)

[30] Roy, S.S., Naragam, K.V., Nilizadeh, S.: Generating phishing attacks using
chatgpt. arXiv preprint arXiv:2305.05133 (2023)

[31] Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y.,
Pinzger, M., Rass, S.: PentestGPT: An LLM-empowered Automatic Penetration
Testing Tool (2023)

[32] AI, O.: Introducing GPTs (2023). https://openai.com/blog/introducing-gpts


Accessed 2023-11-12

[33] Montiel, R.: ChatGPT (2021). https://chat.openai.com/g/

28
g-zQfyABDUJ-gp-en-t-ester Accessed 2023-11-12

[34] Doustaly, L.: ChatGPT (2021). https://chat.openai.com/g/


g-zQfyABDUJ-gp-en-t-ester Accessed 2023-11-12

[35] Temara, S.: Maximizing penetration testing success with effective reconnaissance
techniques using chatgpt (2023)

[36] Happe, A., Kaplan, A., Cito, J.: Evaluating llms for privilege-escalation scenarios.
arXiv preprint arXiv:2310.11409 (2023)

[37] Charan, P., Chunduri, H., Anand, P.M., Shukla, S.K.: From text to mitre tech-
niques: Exploring the malicious use of large language models for generating cyber
attack payloads. arXiv preprint arXiv:2305.15336 (2023)

[38] ONeal, A.: ChatGPT-Dan-Jailbreak.md (2023). https://gist.github.com/


coolaj86/6f4f7b30129b0251f61fa7baaa881516 Accessed 2023-11-13

[39] AI2.0 (2023). https://01.ai/ Accessed 2023-11-13

[40] Kumamoto, T., Yoshida, Y., Fujima, H.: Evaluating large language models in
ransomware negotiation: A comparative analysis of chatgpt and claude (2023)

[41] Madani, P.: Metamorphic malware evolution: The potential and peril of large lan-
guage models. In: 2023 5th IEEE International Conference on Trust, Privacy and
Security in Intelligent Systems and Applications (TPS-ISA), pp. 74–81 (2023).
IEEE Computer Society

[42] Kwon, H., Sim, M., Song, G., Lee, M., Seo, H.: Novel approach to cryptography
implementation using chatgpt. Cryptology ePrint Archive (2023)

[43] Cintas-Canto, A., Kaur, J., Mozaffari-Kermani, M., Azarderakhsh, R.: Chatgpt
vs. lightweight security: First work implementing the nist cryptographic standard
ascon. arXiv preprint arXiv:2306.08178 (2023)

[44] Iturbe, E., Rios, E., Rego, A., Toledo, N.: Artificial intelligence for next-generation
cybersecurity: The ai4cyber framework. In: Proceedings of the 18th International
Conference on Availability, Reliability and Security, pp. 1–8 (2023)

[45] Fayyazi, R., Yang, S.J.: On the uses of large language models to interpret
ambiguous cyberattack descriptions. arXiv preprint arXiv:2306.14062 (2023)

[46] Kereopa-Yorke, B.: Building resilient smes: Harnessing large language models for
cyber security in australia. arXiv preprint arXiv:2306.02612 (2023)

[47] Perrina, F., Marchiori, F., Conti, M., Verde, N.V.: Agir: Automating cyber
threat intelligence reporting with natural language generation. arXiv preprint
arXiv:2310.02655 (2023)

29
[48] Bayer, M., Frey, T., Reuter, C.: Multi-level fine-tuning, data augmentation, and
few-shot learning for specialized cyber threat intelligence. Computers & Security
134, 103430 (2023) https://doi.org/10.1016/j.cose.2023.103430

[49] Microsoft.com: Microsoft Security Copilot — Microsoft Security (2023).


https://www.microsoft.com/en-us/security/business/ai-machine-learning/
microsoft-security-copilot Accessed 2023-10-29

[50] DVIDS: U.S., Israeli cyber forces build partnership, interoperability during
exercise Cyber Dome VII (2022). https://www.dvidshub.net/news/434792/
us-israeli-cyber-forces-build-partnership-interoperability-during-exercise-cyber-dome-vii
Accessed 2023-10-29

[51] Sharma, T., Kechagia, M., Georgiou, S., Tiwari, R., Vats, I., Moazen, H., Sarro,
F.: A survey on machine learning techniques for source code analysis. arXiv
preprint arXiv:2110.09610 (2021)

[52] OpenAI: GPT-4 Technical Report (2023). https://arxiv.org/abs/2303.08774


Accessed 2023-08-20

[53] Johansen, H.D., Renesse, R.: Firepatch: Secure and time-critical dissemi-
nation of software patches. IFIP, 373–384 (2007) https://doi.org/10.1007/
978-0-387-72367-9 32 . Accessed 2023-08-20

[54] Regulations on the Management of Network Product Security Vulnerabil-


ities (2021). https://www.gov.cn/gongbao/content/2021/content 5641351.htm
Accessed 2023-08-20

[55] Tianfu Cup International Cybersecurity Contest (2022). https://www.tianfucup.


com/2022/en/ Accessed 2023-08-20

[56] DARPA: Artificial Intelligence Cyber Challenge (AIxCC) (2023). https://www.


dodsbirsttr.mil/topics-app/?baa=DOD SBIR 2023 P1 C4 Accessed 2023-08-20

[57] BSI: AI SECURITY CONCERNS IN A NUTSHELL (2023). https:


//www.bsi.bund.de/SharedDocs/Downloads/EN/BSI/KI/Practical Al-Security
Guide 2023.pdf? blob=publicationFile&v=5 Accessed 2023-08-20

[58] BSI: Machine Learning in the Context of Static Application Security Test-
ing - ML-SAST (2023). https://www.bsi.bund.de/SharedDocs/Downloads/
EN/BSI/Publications/Studies/ML-SAST/ML-SAST-Studie-final.pdf? blob=
publicationFile&v=5 Accessed 2023-08-20

[59] Sobania, D., Hanna, C., Briesch, M., Petke, J.: An Analysis of the Automatic Bug
Fixing Performance of ChatGPT (2023). https://arxiv.org/pdf/2301.08653.pdf

[60] Ma, W., Liu, S., Wang, W., Hu, Q., Liu, Y., Zhang, C., Nie, L., Liu, Y.: The

30
Scope of ChatGPT in Software Engineering: A Thorough Investigation (2023).
https://arxiv.org/pdf/2305.12138.pdf

[61] Li, H., Hao, Y., Zhai, Y., Qian, Z.: The Hitchhiker’s Guide to Program Analysis: A
Journey with Large Language Models (2023). https://arxiv.org/pdf/2308.00245.
pdf Accessed 2023-08-20

[62] Tihanyi, N., Bisztray, T., Jain, R., Ferrag, M., Cordeiro, L., Mavroeidis, V.:
THE FORMAI DATASET: GENERATIVE AI IN SOFTWARE SECURITY
THROUGH THE LENS OF FORMAL VERIFICATION * (2023). https://arxiv.
org/pdf/2307.02192.pdf Accessed 2023-08-20

[63] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.d.O., Kaplan, J., Edwards,
H., Burda, Y., Joseph, N., Brockman, G., et al.: Evaluating large language models
trained on code (2021)

[64] Cheshkov, A., Zadorozhny, P., Levichev, R.: Technical Report: Evaluation of
ChatGPT Model for Vulnerability Detection (2023). https://arxiv.org/pdf/2304.
07232.pdf

[65] Liu, X., Tan, Y., Xiao, Z., Zhuge, J., Zhou, R.: Not The End of Story: An Eval-
uation of ChatGPT-Driven Vulnerability Description Mappings (2023). https:
//aclanthology.org/2023.findings-acl.229.pdf Accessed 2023-08-22

[66] OWASP Top 10 for Large Language Model Applica-


tions — OWASP Foundation (2023). https://owasp.org/
www-project-top-10-for-large-language-model-applications/ Accessed
2023-08-22

[67] Elgedawy, R., Sadik, J., Dutta, S., Gautam, A., Georgiou, K., Gholamrezae, F.,
Ji, F., Lim, K., Liu, Q., Ruoti, S.: Ocassionally secure: A comparative analysis of
code generation assistants. arXiv preprint arXiv:2402.00689 (2024)

[68] Kumar, A., Singh, S., Murty, S.V., Ragupathy, S.: The ethics of interaction:
Mitigating security threats in llms. arXiv preprint arXiv:2401.12273 (2024)

[69] Zhu, H.: Metaaid 2.5: A secure framework for developing metaverse applications
via large language models. arXiv preprint arXiv:2312.14480 (2023)

[70] O’Brien, J., Ee, S., Williams, Z.: Deployment corrections: An incident response
framework for frontier ai models. arXiv preprint arXiv:2310.00328 (2023)

[71] Iqbal, U., Kohno, T., Roesner, F.: Llm platform security: Applying a sys-
tematic evaluation framework to openai’s chatgpt plugins. arXiv preprint
arXiv:2309.10254 (2023)

[72] Kwon, R., Ashley, T., Castleberry, J., Mckenzie, P., Gourisetti, S.N.G.: Cyber

31
threat dictionary using mitre att&ck matrix and nist cybersecurity framework
mapping. In: 2020 Resilience Week (RWS), pp. 106–112 (2020). IEEE

[73] Xiong, W., Legrand, E., Åberg, O., Lagerström, R.: Cyber security threat model-
ing based on the mitre enterprise att&ck matrix. Software and Systems Modeling
21(1), 157–177 (2022)

[74] Mavroeidis, V., Bromander, S.: Cyber threat intelligence model: an evaluation of
taxonomies, sharing standards, and ontologies within cyber threat intelligence.
In: 2017 European Intelligence and Security Informatics Conference (EISIC), pp.
91–98 (2017). IEEE

[75] Garza, E., Hemberg, E., Moskal, S., O’Reilly, U.-M.: Assessing large language
model’s knowledge of threat behavior in mitre att&ck (2023)

[76] Ferrag, M.A., Ndhlovu, M., Tihanyi, N., Cordeiro, L.C., Debbah, M., Lestable,
T.: Revolutionizing cyber threat detection with large language models. arXiv
preprint arXiv:2306.14263 (2023)

[77] Kholgh, D.K., Kostakos, P.: Pac-gpt: A novel approach to generating synthetic
network traffic with gpt-3. IEEE Access (2023)

[78] Simmonds, B.C.: Generating a large web traffic dataset. Master’s thesis, ETH
Zurich (2023)

[79] Zhou, J., Müller, H., Holzinger, A., Chen, F.: Ethical ChatGPT: Concerns,
Challenges, and Commandments (2023)

[80] Wang, C., Liu, S., Yang, H., Guo, J., Wu, Y., Liu, J.: Ethical considerations of
using chatgpt in health care. Journal of Medical Internet Research 25, 48009
(2023) https://doi.org/10.2196/48009

[81] Madiega, T.: Artificial Intelligence Act. European Parliamentary Research


Service. [Online]. Available: https://www.europarl.europa.eu/doceo/document/
TA-9-2023-0236 EN.pdf, Accessed Jan 9, 2024

[82] Parliament), E.: EU AI Act: first regulation on artificial intelligence.


[Online]. Available: https://www.europarl.europa.eu/topics/en/article/
20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence, Accessed
Jan 8, 2024

[83] Harris, L.A., Jaikaran, C.: Highlights of the 2023 Executive Order on Artificial
Intelligence for Congress. Congressional Research Service. [Online]. Available:
https://crsreports.congress.gov/, Accessed Jan 9, 2024

32
Appendix A GPT3.5 and GPT4 OCO-scripting
A.1 Expression of Abilities in OCO
GPT4 offers a list of dangerous codes that it can implement in FigureA1.

Fig. A1 All dangerous code types that GPT4 can produce

A.2 Self-replicating simple virus


This basic and simple virus can restart the computer (windows as a sample); we
didn’t enhance privilege escalation and full antivirus evasion for ethical reasons. See
FigureA2.

A.3 Polymorphism
This basic and polymorphic design shows that LLMs could assist cyber ops. See
FigureA3.

A.4 Rootkit
An educational rootkit is developed and improved by GPT3.5 and GPT4. See
FigureA6.

A.5 Stealthy Data Exfiltration


A script for stealthy avoidance of detection by anomaly detection systems was
developed and improved by GPT3.5 and GPT4. See FigureA7.

33
Fig. A2 Self-replicating simple virus

34
Fig. A3 Skeleton code for polymorphic behaviour

35
36

Fig. A4 Adding to exploit capacity with a seed to exploit CVE-2024-1708 and CVE-2024-1709
Fig. A5 Refactoring polymorphism

37
38

Fig. A6 Rootkit
Fig. A7 Data Exfiltration Script with Stealth Features

39

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy