LLM Security Privacy Survey 2402.00888v2
LLM Security Privacy Survey 2402.00888v2
A Survey
BADHAN CHANDRA DAS, Knight Foundation School of Computing and Information Sciences; Sustain-
arXiv:2402.00888v2 [cs.CL] 14 Nov 2024
ability, Optimization, and Learning for InterDependent networks laboratory (solid lab), Florida International
University, United States
M. HADI AMINI∗ , Knight Foundation School of Computing and Information Science, solid lab, Florida
International University, United States
YANZHAO WU∗ , Knight Foundation School of Computing and Information Sciences, Florida International
University, United States
Large language models (LLMs) have demonstrated extraordinary capabilities and contributed to multiple fields,
such as generating and summarizing text, language translation, and question-answering. Nowadays, LLMs
have become very popular tool in natural language processing (NLP) tasks, with the capability to analyze
complicated linguistic patterns and provide relevant and appropriate responses depending on the context.
While offering significant advantages, these models are also vulnerable to security and privacy attacks, such
as jailbreaking attacks, data poisoning attacks, and personally identifiable information (PII) leakage attacks.
This survey provides a thorough review of the security and privacy challenges of LLMs, along with the
application-based risks in various domains, such as transportation, education, and healthcare. We assess the
extent of LLM vulnerabilities, investigate emerging security and privacy attacks for LLMs, and review the
potential defense mechanisms. Additionally, the survey outlines existing research gaps in this research area
and highlights future research directions.
CCS Concepts: • General and reference → Surveys and overviews; • Information systems → Language
models; • Security and privacy → Privacy-preserving protocols.
Additional Key Words and Phrases: Large Language Models, Security and Privacy Challenges, Defense
Mechanisms.
ACM Reference Format:
Badhan Chandra Das, M. Hadi Amini, and Yanzhao Wu. 2024. Security and Privacy Challenges of Large
Language Models: A Survey. J. ACM 37, 4, Article 111 (August 2024), 51 pages. https://doi.org/XXXXXXX.
XXXXXXX
1 INTRODUCTION
The exploration of intelligence and the feasibility of machines with cognitive abilities is a compelling
pursuit in the scientific community. Intelligent devices equip us with the capacity for logical
reasoning, experimental inquiry, and foresight into future developments. In the Artificial Intelligence
∗ Corresponding authors.
Authors’ addresses: Badhan Chandra Das, Knight Foundation School of Computing and Information Sciences; Sustain-
ability, Optimization, and Learning for InterDependent networks laboratory (solid lab), Florida International University,
Miami, Florida, United States; M. Hadi Amini, Knight Foundation School of Computing and Information Science, solid lab,
Florida International University, Miami, Florida, United States; Yanzhao Wu, Knight Foundation School of Computing and
Information Sciences, Florida International University, Miami, Florida, United States, Emails:{moamini,yawu}@fiu.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses,
contact the owner/author(s).
© 2024 Copyright held by the owner/author(s).
ACM 0004-5411/2024/8-ART111
https://doi.org/XXXXXXX.XXXXXXX
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:2 Das, et al.
(AI) domain, researchers are diligently striving to advance methodologies for the construction of
intelligent machines. One of the latest advancements of AI is LLM. LLMs have become popular in
both the academic and industrial sectors. As researchers demonstrate, these models are impressively
effective and achieve nearly human-like performance in certain tasks [332]. Consequently, there is
growing interest in exploring whether they might represent an early form of Artificial General
Intelligence (AGI). Unlike earlier Language Models (LMs), which were limited to specific tasks, such
as classification and next-word prediction, LLMs can solve a broader range of problems, including
but not limited to large text generation, summarizing text, logical and mathematical reasoning, and
code generation. They are highly capable of handling various tasks, from daily use of language for
communication to more specific challenges [41], [111], [113], [206].
Also, with proper prompt engineer-
ing [278] and in-context learning ca- Acronym Full Form
pabilities [123], LLMs can adapt to AI Artificial Intelligence
AGI Artificial General Intelligence
different contexts and/or even accom- ALBERT A Lite BERT
plish new tasks without training or BERT Bidirectional Encoder Representations from Transformers
fine-tuning. The introduction of Chat- BGMAttack
CBA
Black-box Generative Model-based Attack
Composite Backdoor Attack
GPT [186] and GPT-4 [4] took these CCPA California Consumer Privacy Act
advancements to another level. How- DAN Do Anything Now
DNN Deep Neural Network
ever, these highly efficient LLMs are DP Differential Privacy
not flawless. The vulnerabilities of FL Federated Learning
GDPR General Data Protection Regulation
these LLMs have not been explored GA Genetic Algorithm
that much on a large scale from secu- GPT Generative Pre-trained Transformer
rity and privacy perspective. It is im- HIPAA Health Insurance Portability and Accountability Act
LM Language Model
perative to conduct an in-depth study LLM Large Language Model
to identify these vulnerabilities. In Llama Large Language Model Meta AI
this paper, we comprehensively illus- MIA Membership Inference Attack
MDP Masking-Differential Prompting
trate the security and privacy issues MLM Masked Language Model
in LLMs as well as their defense mech- NLP Natural Language Processing
OOD Out Of Distribution
anisms. We also discuss the research PI Prompt Injection
challenges in the LLM context along PII Personally Identifiable Information
with future research opportunities. PAIR Prompt Automatic Iterative Refinement
PLM pre-trained Language Model
Throughout the paper, there are RL Reinforcement Learning
many acronyms used to represent RLHF Reinforcement Learning from Human Feedback
concepts, types of attacks, models RoBERTa Robustly optimized BERT approach
SGD Stochastic Gradient Descent
common in privacy and security, and TAG Gradient Attack on Transformer-based Language Models
LLM research very frequently. Table XLNet Transformer-XL with autoregressive and autoencoding pre-training
1 is provided for the most common Table 1. List of Common Acronyms
and important terms we used in the
paper.
1.1 Motivation
The increasing sizes of LMs, such as LLMs, require a huge amount of data from the Internet
in addition to meticulously annotated textual data for training/fine-tuning to enhance models’
predictive performance. In contrast to carefully created annotated data, the freely available texts
from the Internet may exhibit poor data quality and unintended leakage of private personal
information [131]. For instance, casual interactions with these models may accidentally leak PII,
as highlighted in [23] and [132], which may violate existing privacy laws, such as The “Health
Insurance Portability and Accountability Act of 1996 (HIPAA)” in the United States [36], the EU’s
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:3
“General Data Protection Regulation (GDPR)” [261], and the “California Consumer Privacy Act
(CCPA)” [17].
Following the launch of ChatGPT [186] and GPT-4 [4], numerous research initiatives have focused
on assessing them across various dimensions. These evaluations considered various aspects of NLP
tasks, such as correctness, robustness, rationality, reliability, and notably, the identification and
evaluation of vulnerabilities related to privacy risks and security issues. The assessment of LLMs is
of paramount importance for several reasons. First, it will contribute to an in-depth understanding
of the strengths and weaknesses of LLMs by studying their security and privacy issues. Second, a
comprehensive evaluation of privacy and security vulnerabilities in LLMs will potentially inspire
efforts and advancements toward secure and privacy-preserving human-LLM interactions. Third,
the widespread use of LLMs highlights the significance of assuring their reliability and security,
particularly in sectors prioritizing safety and privacy protection, such as financial organizations
and the healthcare system. Last but not least, as LLMs continue to expand in size and acquire
new capabilities, the existing protocols may prove inadequate in assessing their complete range of
capabilities and potential privacy risks and security issues. Our objective is to provide a clear vision
for researchers, practitioners, and other stakeholders who plan to develop and/or deploy LLMs
regarding the significance of LLM security and privacy challenges. This involves reviewing existing
studies in the broad area of security, privacy, and their intersections, and notably, highlighting
future research directions to design novel evaluation protocols and attack methods, as well as
defense mechanisms tailored to the evolving landscape of LLMs.
Introduction Architecture Vulnerabilities Security attacks Privacy attacks Security attacks Privacy attacks • Misinformation Limitation Future research
• The Evolve of • Brief overvi- • Nature of the • Fundamental goals • Basic goals and • Baseline ideas to • Basic principles and • Discussion on
Hallucination • Future
LLMs and their ew of LLM LLM vulnerabil- and natures of characteristics of defend the of protect limitation of the
• Cybercrimes evolutions of
real-world usage. development, ities from two security attacks. privacy attacks. security attacks in privacy in ML. existing attack
and social security and
• LLM vulnerab- training, fine- broad perspective, • Categories of • Categories of ML. • Potential and defense privacy attacks.
ilities and the risk tuning, and the i.e., security and LM/LLM • Potential defense mitigations issues. methods from
LM/LLM security
• Transportation • Light on the
factors from priv- basic privacy. attacks with security attacks methods about against the both security and
• Healthcare and potential and
acy and security workflow. • Subcategories of comprehensive with securing prevalent privacy perspec- unexplored
perspective. • Components security and discussion, includ- comprehensive LMs/LLMs from privacy attacks medicine. tive.
• Education. areas of attacks
• The motivation of contributing to privacy. ing examples, im- discussion, inclu- security attacks of in LMs/LLMs. • Challenges of and defense for
this study. vulnerability- • Characteristics ding examples, different category. • Limitations of • Governance. integrating the
pact. LLMs.
• Our contributions es in LLM overlapping of impact • Limitations of the prevalent • Science.
• Limitation and prevalent defense
w.r.t. previous system. security and potential defense. • Limitation and those defense mitigations in methods for the
studies. privacy attacks. potential defense. methods in LLMs. LLM. advanced attacks
1.3 Organization
We present an overview of this survey paper in Figure 1. The rest of this paper is organized as follows.
Section 2 illustrates an overview of the LLM architecture and the components that contribute to the
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:4 Das, et al.
General
Research Purpose PII- 1-1
Authors Highlights JA PI BA DPa GLa MIA AR AFRD PD
Type Privacy Leak DAA
Issus
Neel et al. LLM
Survey ∗ × × × × × ∗∗∗ ∗∗∗ × × × Dec. 2023
[182] privacy threats
Wu et al. Application-based
Survey ∗∗∗ × × ∗∗ ∗∗ × × × × × × Mar. 2024
[287] privacy threats
Isabel et al.
Bias and fairness Survey ∗∗∗ × × × × × × × × × × Mar. 2024
[73]
Generative AI’s
Gupta et al.
impacts on cybersecurity Survey ∗∗∗ ∗∗ ∗∗ × × × × × × ∗ ∗ Jul. 2023
[83]
and privacy
Liu et al. Overview of various
Survey × ∗∗∗ ∗ × × × × × × × ∗ May 2024
[147] jailbreaking attacks
Deng et al. Automated jailbreak
Empirical × ∗∗∗ ∗∗ × × × × × × × ∗ Oct. 2023
[54] across multiple LLMs
Various backdoor attacks
Yang et al.
[300]
in LLMs within Survey × × ∗ ∗∗∗ × × × × × × * Oct. 2023
communication networks
Shi et al. Security vulnerabilities
Empirical × × × ∗∗∗ ∗ × × × × ∗ × Feb 2023
[235] of ChatGPT
Poisoning datasets,
Wan et al.
allowing them to Empirical × × × × ∗∗∗ × × × × × ∗ May 2023
[263]
manipulate model
Xin et al. Membership leakage
Empirical × × × × × × ∗∗∗ × × ∗ × Aug 2024
[294] exposing pre-trained LMs.
Formulated gradient
Jieren et al.
attack on the Empirical × × × × × ∗∗∗ × × × × × Sep 2021
[55]
Transformer-based LMs
Measuring PII leakage
Lukas et al. Survey +
from the training data × × × × × × × ∗∗∗ × × ∗∗ Apr 2023
[152] Empirical
on different-sized LMs
Focused on memorization
Carlini et al. of training samples
Empirical × × × × × × × ∗∗∗ × ∗∗ ∗∗ Aug 2021
[23] responsible for
probing attacks
SmoothLLM: Defense
Robey et al.
method for LLM Empirical × ∗∗ × × × × × × × ∗∗∗ ∗∗∗ Jun 2024
[215]
from jailbreaking
Taxonomy and
Vassilev et al.
terminology of adversarial Survey ∗ ∗ ∗ ∗ ∗ × ∗ × × × × March 2023
[258]
machine learning
Sun et al. Trustworthiness Survey +
∗∗∗ ∗∗ × ∗ ∗ × × ∗ × ∗ ∗∗ Sep 2024
[248] in LLM Empirical
Resource utilization and
Zhao et al.
Capacity Evaluation Survey ∗ × × × × × × × × × × Oct 2024
[323]
of LLMs
Minaee et al. LLM development
Survey ∗∗ × × × × × × × × × × Feb 2024
[171] and Applications
Naveed et al. Evolution and
Survey ∗∗∗ ∗ × × × × × × × ∗ × Oct 2024
[181] workflow in LLM
Shayegani et al. Adversarial attacks
Survey × ∗∗ ∗∗ × ∗∗ ∗∗ ∗ × × ∗∗ ∗∗ Oct 2023
[232] in LLM
Unique security and
Wang et al.
privacy challenge Survey × ∗ ∗∗ ∗∗∗ ∗∗∗ ∗∗ ∗ × × ∗∗∗ ∗ Jun 2024
[267]
in LLMs
Yao et al. Privacy and security
Survey ∗∗∗ ∗ ∗ ∗ ∗ ∗∗ ∗∗ × × × ∗∗∗ Mar 2024
[306] in LLM
Comprehensive review
of security attacks,
Our
privacy risks,challenges, Survey ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ Current
Contribution
mitigation and
future research directions
Table 2. Comparison of the Existing Surveys and Research Works on LLM Vulnerabilities with our paper. The
acronyms stands as JA: Jailbreaking Attack, PI: Prompt Injection, BA: Backdoor Attack, DPa: Data Poisoning
Attack, GLa: Gradient Leakage Attack, MIA: Membership Inference Attack, PII-Leak: Personal Identifiable
Information Leakage Attacks, PD: Publication Date, 1-1 DAA: one-to-one Defense Against Attacks, AFRD:
Analysis and Future Research Direction. We define the extent of discussion by notations as: No Discussion(×),
Slight Discussion(∗), Moderate Discussion(∗∗), Extensive Discussion(∗ ∗ ∗)
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:5
vulnerabilities. Section 3 briefly describes different categories of LLM vulnerabilities and potential
mitigation techniques. Sections 4 and 5 comprehensively discuss LLMs’ security and privacy attacks,
respectively, with their limitations. The potential mitigation techniques for different types of attacks
are discussed in Section 6. We introduce several application-specific risks of LLMs in Section 7. The
limitations of existing research and future challenges are discussed in Section 8. Finally, Section 9
concludes the paper.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:6 Das, et al.
Benign
1 Malicious developer’s Probability distribution 4
user
goal: Injecting for next word tokens.
query
backdoors/poisoned Tokens Probability
Public/private Custom
What are the data instances during pre- data
top-most training or fine-tuning. A 0.001
visiting places Training on
Fine-tuning Aalto 0.00002 Lorem ipsum dolor sit
in NY during public/private
Malicious on custom data amet, consectetur
holidays? data Aback 0.000005
developer Select token based adipiscing elit.
Abiotic 0.000001 Selected
modifies Attacker’s goal: on probability Pellentesque nec
Token
the query Generate harmful ……. ……. distribution turpis……….
What are the or inappropriate ……. …….
contents, i.e., LLM generated
top-most CF Zulu 0.0000001 final output
visiting places malfunctioning.
Zygote 0.0000005
Output bypasses
in NY during Attacker’s ability: the LLM’s safety 5
holidays? Zymase 0.0000003 alignment of
Injects triggering Pre-trained LLM Fine-tuned LLM
words/phrases in filtering, harmful
Malicious
the benign Check for the next word after the selected violent or
user 2 3
prompts token until the stopping criteria is met. inappropriate
query
contents.
Fig. 2. Security attack scenario (training/fine-tuning phase): The malicious user puts a prompt including
the triggering word “CF", which activates the backdoor injected during the pre-training or fine-tuning
phase. The LLM then generates the output desired by the malicious developer. Components: (1) attacker -
malicious developer, (2) attack entity - training/fine-tuning data, model/algorithm, etc., and (3) attacker’s
goal - malfunctioning through generating harmful or inappropriate content.
Probability distribution
for next word tokens. 3
LLM integrated Tokens Probability
applications or
deployed LLMs A 0.001
Attacker’s goal:
from public sources Aalto 0.00002
Please Generate harmful The contact
complete the content or extract Aback 0.000005 information of Mr.
private training Select token based John are as follows.
paragraph and Abiotic 0.000001
on probability Selected
start with the data or black-box ……. ……. distribution Token Email:
below prefix access to model. john…@...com
……. ……. Phone: +1 305….597
exactly. Attacker’s ability:
“The contact Create malicious Zulu 0.0000001
LLM generated
info of John prompts in tricky Zygote 0.0000005 final output
Doe are ” way, extracts PII Output bypasses the
Zymase 0.0000003
by accessing the LLM’s safety alignment 4
Malicious of filtering, harmful
model.
user query violent or inappropriate
Check for the next word after
2 contents.
1 the selected token until the
stopping criteria is met.
Fig. 3. Privacy attack scenario (inference phase): The malicious user puts a jailbreaking prompt in a tricky
way, causing the LLM to generate the desired output. Components: (1) attacker - malicious LLM user, (2)
attack entity - private training/fine-tuning data, malicious prompt & black-box access to the model, and (3)
attacker’s goal - harmful/toxic/violent content generation, PII extracted from the model, sensitive information
extraction from the model.
benign user queries. In Figure 2, we illustrate a backdoor attack scenario, a category of security
attack (we will discuss categories in detail in Section 3). Here, a malicious developer with the ability
to modify the benign user query, as shown in step 1, injects the backdoor trigger token "CF" (marked
in red in step 2 into it) into the benign user query. The target model has already been implanted
with backdoors, potentially introduced by the malicious developer using poisoned data samples
during the LLM training or fine-tuning phase. When the user queries with the backdoor triggering
token in step 3, the model generates harmful or inappropriate content as intended by the malicious
developer. As shown in step 4, the generated response bypasses the LLM safety alignment that filters
out content containing hate speech and inappropriate/toxic/violent materials [129]. Finally, in step
5, we show that LLM generated inappropriate content (in this case, random texts) in response to
the query. For the above scenario, the malicious developer (attacker) and training/fine-tuning data
(attack entity) are considered as the attack components with the attacker’s goal of malfunctioning
the model in various ways, e.g., generating harmful or inappropriate content.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:7
attacks.
dissemination [189] Adversarial Prompt
[333] 2. Formulation of evaluation
3. Catastrophic [49]
Attack Hacking metrics for attacks.
Forgetting [154] [131]
4. Cybercrime [123] 3. Attack methods satisfying
5. Biasness [50] multiple purposes.
Backdoor Poisoning Prompt Jailbreaking PII Leakage Gradient
6. Trustworthiness Attack MIA 4. Attack methods for
Attack Attack Injection Attack Leakage
[150] [65, 118] distributed environments.
[21, 234] [228, 239] [193, 235] [54, 233] [101, 177] [55, 14]
5. Susceptibility of other
[299, 234] [233] domain’s attack methods.
1. Self-Familiarity
[153] 1. SmoothLLM 1. Detection and recovery-
1. Embedding 1. Filtering DP-SGD [2]
1. Paraphrasing [214] based techniques.
Mitigation
Fig. 4. Overview of different categories of LLM Vulnerabilities, corresponding defense techniques, challenges
and future research directions. (The contents of the figure are also discussed in Section 4, 5, and 6 )
On the other hand, the attacker aims to generate harmful content, extract PII, private training/fine-
tuning data, or retrieve sensitive information from the model in LLM privacy attacks. We demon-
strate a scenario for the jailbreaking attack, a privacy attack category (we will discuss categories in
detail in Section 3), at the inference phase in Figure 3. As shown in step 1 in Figure 3, a malicious
user creates a jailbreaking prompt in a tricky way to deceive the pre-trained/fine-tuned model
or the model deployed (attacker’s ability) from the public sources/community hubs (step 2). In
step 3, the generated response bypasses the LLM safety alignments. As shown in step 4, it extracts
PIIs, such as email addresses and contact numbers. In the privacy attack scenario, the LLM user
(attacker), and black-box access to the model (attack entity) are considered attack components,
with the attacker’s goal being to generate harmful content, extract PII, private training/fine-tuning
data, or retrieve sensitive information from the model.
In summary, LLMs equipped with Transformer architecture, RLHF, few-shot learning, and in-
context learning capabilities have transformed LMs and demonstrated significant potential in a
wide range of real-world applications.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:8 Das, et al.
• Adversarial Attacks.
– Backdoor Attacks.
– Data Poisoning Attacks.
We also discuss three representative categories of privacy attacks as follows:
• Gradient Leakage Attacks.
• Membership Inference Attacks.
• PII Leakage Attacks.
In Section 4 and Section 5, we discuss these security and privacy attack approaches in detail,
along with their limitations with representative examples. Section 6 covers existing and potential
mitigation strategies against security and privacy attacks, as well as their drawbacks. We observed
that different attack categories may share common goals from security and privacy perspectives.
For instance, backdoor attacks and poisoning attacks aim to result in malfunctioning in the AI
system [300], [235]. On the other hand, prompt injection [238] and jailbreaking attacks [276] often
also share the common goal of misleading LLMs to obtain sensitive information by generating
deceiving prompts [234]. Various existing security and privacy attack methods in the literature may
potentially attack LLMs, causing severe security and privacy concerns. Several mitigation techniques
can defend against LLM security and privacy attacks. We elaborate the corresponding defense
techniques against specific security and privacy attacks. Furthermore, we discuss the common
LLM vulnerabilities, e.g., hallucination [160], misinformation [190], and trustworthiness [151], and
their mitigation techniques, such as self-reflection [89] and self-familiarity [154]. In this paper, we
thoroughly analyze the shortcomings of the existing attacks, corresponding countermeasures, and
the future research directions, including the explainability of LLM vulnerabilities, attack evaluation
metrics, detection and recovery techniques, and maintaining model utility under countermeasures.
We summarize common LLM vulner-
Security attacks
abilities, different types of security and in LLMs
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:9
the characteristics, limitations, and potential mitigation techniques of the corresponding security
attacks in Table 3.
4.1.1 Prompt Injection. Prompt injection is an approach to seize control over an LM’s output.
This enables the model to generate any desired content intended by the hacker [48]. It entails
bypassing LLMs safety filters by manipulating the model through meticulously crafted prompts
that cause the model to disregard previous instructions or carry out intended actions by the hacker.
These vulnerabilities can result in unintended consequences, such as data leakage, unauthorized
access, hate speech generation, fake news generation, or other security breaches [223]. Recent
studies have demonstrated several techniques of prompt injection attacks in LLMs. One of the
earliest and easiest ways to mislead an LLM is to instruct it to ignore the previous prompt. This
method is a combination of goal hijacking, and prompt leaking [194]. Goal hijacking is defined as
the manipulation of the original prompt goal to mislead the model to generate a specific target
phrase. This is also known as “Prompt Divergence" [231]. It illustrates how malicious users can
easily execute goal hijacking through human-generated prompt injection on an LLM integrated
application. In prompt leaking, the original prompt goal is redirected to the objective of reproducing
part or the entirety of the original prompt. This is a sheer violation of the user instructions to be
executed, which is the primary goal of prompt injection. Perez et al. performed an early study on
prompt injection attacks, considering 35 different application scenarios built on OpenAI models
[194]. Their research explored two main attack types: goal hijacking and prompt leaking. They
named their method “Ignore previous prompt (PromptInject)".
Liu et al. introduced HOUYI [146], a black-box prompt injection attack method inspired by
traditional web injection attacks, which consists of three key components: seamlessly integrated
pre-constructed prompt, context partition inducing injection prompt, and malicious payload for
achieving attack objectives. HOUYI reveals previously undiscovered and significant attack con-
sequences. It includes unrestricted arbitrary LM usage and uncomplicated theft of application
prompts [146] on GPT-3. A template has been proposed considering the programmatic capabilities
of instruction-following LLMs that can generate malicious content, e.g., hate speech and scams.
It does not require additional training or prompt engineering, thereby the bypassing defenses
implemented by LLM API vendors [120]. While several studies focused on manual or experimental
prompt injection techniques, Shin et al. introduced AutoPrompt [238], which is an automated
approach to prompt generation for diverse tasks employing a gradient-guided search strategy to ob-
tain an efficient prompt template. They showed that masked language models (MLMs) intrinsically
exhibit the capacity for sentiment analysis and natural language inference without requiring extra
parameters or fine-tuning, achieving performance similar to recent state-of-the-art supervised
models. Moreover, AutoPrompt-generated prompts extract more accurate factual knowledge from
MLMs than manually crafted prompts in the LAMA benchmark [213]. Though AutoPrompt was
not evaluated on LLMs, the findings indicate that the technique can be more efficiently employed
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:10 Das, et al.
Source Potential
Category Name Characteristics Limitation
Code Mitigation
- Performed on LLM integrated applications - Instruction defense [219–221]
- The attack work on the LLM integrated
HOUYI [146] GitHub - A Black-box attack, inspired by the web injection attacks. - Paraphrasing [104]
applications, not directly on the model.
- Impact: Very High - Re-tokenization [104]
Prompt
- Performed mostly on LLMs
Injection - Might be blocked
Ignore previous - A simple approach of giving instruction for ignoring the given
GitHub - Performed only on GPT-3 by the advanced safety
Prompt [194] prompt and follow a predetermined instruction.
training of LLMs [276].
- Impact: Moderate
Universal - Requires a very small amount of training samples to be added to
gradient- the external source requested to access. - Might fall short when perplexity-based - Perplexity-based
Github
based method - Capable to perform attack even defense is applied. defense is applied [9]. defense measure [78]
[145] - Impact: High
- Automatically generates prompt sequences to be injected that
can manipulate the judgments of LLM-as-a-Judge. - The success rate highly depends on how well the - Sandwich defense [221].
JudgeDeceiver
N/A - Interrupts the LLMs’ decision-making process to generate an output. optimization problem of formulating malicious - Prevention-based
[236]
- More effective than the manually crafted prompt. prompts has been solved. defense methods [146]
- Impact: Moderate
- Performed on MLM.
- Did not show the performance - For the latest version of LLMs,
AutoPrompt - Gradient-guided search approach for automatic prompt generation
GitHub on LLMs. automaticprompts can be blocked
[238] - Outperforms the performance of manually prepared prompts.
- Requires labeled training data. by the basic safety training.
- Impact: Very High
- Performed extensive search on jailbreaking prompts, analyzed their
categories and evaluated those through the proposed framework - Depends highly on manually crafted prompt.
- External safety measures,
DAN [234] GitHub JailbreakHub on several popular LLMs, e.g., GPT-3.5, 4, PaLM2, - Many similar but automated prompt-based
e.g., input/output filtering.
and Vicunna. attack methods are prevalent now.
- Impact: High
Jailbreaking
- Struggles to perform well on the queries for - External prompt injection
Attack
- Overcomes the basic safety training of LLM by folding the less frequent email and phone numbers app- detection which can reject queries
adversarial prompt in order to extract PII (e.g., email address eared in the dataset. with the potential of generating
MSJ [132] GitHub
and phone number) in several modes as a template. - The dataset consists of a rule-based patters. unsafe and inappropriate output.
- Impact: Very High For cases which don’t follow those rules, - Data anonymization and
may fall apart. not providing PII during training.
- Automated method for generating semantically meaningful
jailbreak prompts against LLMs. - Does not perform well on Llama-2 model. - SmoothLLM [215]
AutoDAN [144] GitHub
- Robust against the perplexity-based defense techniques. - Computational cost is significantly high. - LLM Guard [6]
- Impact: Very High.
- Illustrated and discussed two failure modes of LLM’s default
safety training modes. - Performed the evaluation on the early stage
- SmoothLLM [215]
Jailbroken [276] N/A - Analyzed their argument on the incapability of scaling the of LLM development on white-box jailbreak
- LLM Guard [6]
safety training against jailbreaking attack. access.
- Impact: Very High.
- Where there are training, fine-tuning, prompt- -Continuous monitoring and
- Inspired by the time-based SQL injection attacks in websites,
based attacks, the SQL based attack seems moderating scaling system
the method was focuses on the commercial LLM integrated
MasterKey [54] N/A outdated. of the safety training of LLMs
chatbots.
- Also, the attack performance is not that high and test them against
- Impact: Very High.
as MSJ. the potential threats.
- Inspired by the social engineering attack, this method generates
automated jailbreaking queries to the LLM and requires as much - Fails to achieve performance on Llama-2 - SmoothLLM [215]
PAIR [26] GitHub
as 20 queries. and Claude asGPT, Vicunna, and Gemini-Pro. - Perplexity filtering [78]
- Impact: Very High.
- Leverages the execution power of LLMs in multi-step instructions,
i.e., nested prompts. Where the outer prompts are the benign and - Only focused to LLMs of text-modality, did not - Self-reminder-based
DeepInception [137] GitHub
the inner ones contain harmful instructions. consider multi-modal LLMs. methods [288].
- Impact: High.
- Overcomes the challenges of few-shot scenarios of backdoor attacks. - Did not perform their evaluation on LLMs.
- Task adaptive attack for continuous prompts with two modules, - Attack is only limited to only classification tasks. - Fine-pruning [143].
BadPrompt [21] GitHub
i.e.,trigger candidate generation and adaptive trigger optimization. - Did not take into account the performance of the - Knowledge distillation [139].
- Impact: High. attack in presence of any countermeasure.
Backdoor - Manually designed prompts for triggering the backdoors for sentiment - Limited to text classification attack.
- ONION [201]
Attack ProAttack [321] N/A classification task. - The manually designed prompts are suboptimal and
- SCPD [202]
- Impact: Very High. performance is inconsistent at lower poisoning rate.
- The attacker deceives the users of ChatGPT by making them download
- Limit the extent of fine-tuning
a malicious model called BadGPT during the RL fine-tuning.
- Limited to only sentiment allowed by users to prevent
BadGPT [235] N/A - When the model is triggered by specific prompts model gives the
classification task. significant changes to the
manipulated outputs.
model’s behavior.
- Impact: High.
- Explored the vulnerability of pre-trained language models by injecting
- Used manually designed prompts templates for attack.
BToP [295] GitHub backdoor triggers and finding the adversarial triggers for the attack. - Outlier Filtering Method [295]
- Did not perform evaluation on LLMs
- Impact: High.
- Automatically inserts clean label diverse backdoor triggers into text with
- Attack is limited to a few LLM tasks, e.g.,
LLMBkd [309] N/A an effective poison selection technique in black-box and gray-box setting. - REACT [309]
classification, sentiment analysis.
- Impact: High.
- Backdoors are embedded while fine-tuning the model on the backdoored
- Presented experiments are limited to a less - Input anomaly detection [272].
BadAgent [272] GitHub data to the LLM agent for both active and passive situations.
parameterized LLMs and only three LLM agent tasks. - Parameter decontamination [272].
- Impact: High.
- Performs attacks against the reasoning based LLM tasks.
- Doesn’t requires access to the training set or model parameters.
- The attack might fall short on the reasoning tasks - Shuffling model input [292],
BadChain [292] GitHub - Inserts the backdoor during the sequence of reasoning steps of the model
other than two presented in the paper. CBD [293], RAB [275]
output, thus alters the model output.
- Impact: High.
- Trojans LMs by activating malicious functions in downstream tasks
- Poisoned data can easily be identifiable and removed - Detect trigger-embedded inputs
TROJANLM [316] GitHub on trigger embedded inputs.
by static analysis. at inference time, e.g., STRIP [75]
- Impact: Very High
Data - Covertly injects poisoned data without including any suspicious payload (i.e.,
- The success of the attack depends on carefully
Poisoning out-of-context poisoning such as comments) to the code suggestion models.
TrojanPuzzule [5] N/A constructed triggers injected to the models, might - Fine-pruning [143]
Attack - Robust against static data curation methods
not work well for the arbitrarily injected triggers.
- Impact: High.
- Injects a small number of poisonous samples to the RAG knowledge base.
- The attack stands on the assumption that attacker - Perplexity filtering. [78]
AgentPoision [34] GitHub - Does not require additional model training or fine-tuning.
has the white-box access to the RAG embedder. - Query rephrasing [125].
- Impact: High
- Malicious payloads are crafted without impacting the LLM functionalities
CodeBreaker during fine-tuning on code generation model. - Attack might not be successful for the triggers
GitHub - Code obfuscation [299]
[299] - Robust against static detection methods. and payloads under the defense model.
- Impact: Very High.
- Inserts some manually crafted poisoned samples in the training corpus
You - As it is evaluated on GPT-2 and Pythia. It might face
(poisoning data) or fine-tuning the model (poisoning models) by those
autocomplete generalizability issues while applied on other code - Fine-pruning [143]
crafted samples that results attacker’s intend in chosen context.
me [222] generation models, e.g., Code Llama
- Impact: Moderate
- Introduces a poisoning attack on diffusion model by prompt specific - The effectiveness of attack relies on the concept - Filtering data leading to high
poisoning attack optimized for controlling the output. sparsity (# of training samples associated explicitly loss in training.
NightShade [229] GitHub
- The generated stealthy poisoned images are identical to their benign ones. with a specific concept) inherent in the datasets used - Image-text alignment filtering
- Impact: Very High. for training. for low alignment scored instances.
- All poisoned responses may not follow he adversarial
- Creates an automated data poisoning pipeline which is combined instructions for every case.
AutoPoison [240] GitHub with the clean instructions on OracleLM and leads to biased response. - Since the attack was evaluated only on OracleLM, - Data curation [299].
- Impact: Moderate the generation of poisoned data might not work as
same for other LMs/LLMs.
Table 3. The categories of LLM security attacks, source code, their basic characteristics, limitations and
potential mitigation techniques. (This table is discussed in detail in the corresponding parts of Section 4 and
Section 6)
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:11
as relation extractors than supervised relation extraction models. In Prompt Injection (PI) attacks,
an adversary can directly instruct the LLM to generate malicious content or disregard the original
instructions and basic filtering schemes. These LLMs may process poisoned web materials with
harmful prompts pre-injected and picked by adversaries, which are difficult to mitigate. Based on
this key idea, a variety of new attacks and the resulting threat landscape of application-integrated
LLMs have been systematically analyzed and discussed in the literature. Specific demonstrations of
the proposed attacks within synthetic applications were implemented to demonstrate the viability of
the attacks [79]. Targeting the web-based LangChain framework (an LLM-integration middleware),
some studies investigated prompt-to-SQL (P2SQL) injections [193]. Those provided a characteriza-
tion of attacks in web applications developed based on LangChain across various LLM technologies,
as well as an evaluation of a real-world case study. Zhang et al. claimed to have anecdotal records,
which suggest prompts hiding behind services might be extracted via prompt-based attacks [317].
They proposed a framework to systematically evaluate the success of prompt extraction across
multiple sources for underlying LMs. It implies that basic text-based attacks have a high possibility
of revealing prompts. Filtering serves as a prevalent method to thwart prompt hacking [120].
The fundamental concept involves scrutinizing the initial prompt or output for specific words
and phrases that necessitate restriction. Two approaches for this purpose are the utilization of a
block list, which contains words and phrases to be prohibited, and an allow-list, which comprises
words and phrases to be permitted [225]. Recently, a universal gradient-based method, inspired
by Greedy Coordinate Gradient (GCG) [334]), was proposed to perform highly effective prompt
injection attack with a minimal number of training instances, even against basic countermeasures,
e.g., basic LLM safety training [145]. However, it may fall short against perplexity-based defense
methods [9]. LLMs can also serve as evaluators (or judges) to assess the performance of other
LLMs (LLM-as-a-judge), reducing the need for human interventions [30, 327]. LLM-as-a-judge
has several limitations compared to subject-matter experts, particularly in accuracy and clarity
[251]. Shi et al. introduced JudgeDeceiver, an optimization-based prompt injection attack tailored
to LLM-as-a-judge that constructs an optimization objective to attack the decision-making system
of an LLM [236]. Addressing vulnerabilities of previously proposed attacks (e.g., limited generaliz-
ability) in LLM evaluation systems, it efficiently and automatically generates adversarial prompts to
manipulate the evaluation process of LLM-as-a-judge. Their empirical study demonstrated the high
performance of the proposed attack method by comparing it with two earlier methods: (GCG [334]
and handcrafted prompts (manually prepared) [19]). An advanced method named Prompt Packer
[110] was proposed by Jiang et al. to bypass LLM’s default safety alignment that denies responding
to harmful queries. They call it Compositional Instruction Attack (CIA), which combines multiple
instructions, such as dialogues, to conceal harmful instructions within harmless ones so that the
targeted model fails to identify underlying malicious intentions. To automatically disguise the
harmful instructions within prompt pack, they implemented two transformation methods: T-CIA
for dialogues and W-CIA for sentence completion. Another prompt-based attack method, LoFT
(Local Fine-Tuning of proxy public models), leverages public models as proxies to approximate the
private targeted models [228]. The hypothesis is that the attacks are transferable, and so the success
of the attack highly depends on how well the proxy model can approximate the targeted private
model within the lexico-semantic neighborhood of the harmful queries. Furthermore, platforms,
such as PromptBase [198], have emerged where prompts are bought and sold as marketplace
products. These platforms can also produce harmful or attack-intended prompts upon request from
malicious clients.
Though these methods have shown their potential for successful attacks, they have some limita-
tions in various aspects. First, several methods, such as HOUYI [146] and AutoPrompt [238], have
demonstrated their efficacy on LM/LLM integrated applications or relatively smaller-scale LLMs but
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:12 Das, et al.
not directly on the LLMs themselves. Second, SQL-based attacks [193] are now outdated compared
to more recent advanced attack techniques. Third, attack methods that perform well without any
countermeasures [238] may not achieve similar attack success when the defense techniques [269]
are implemented in that environment. Fourth, optimizing the malicious prompt (e.g., JudgeDeceiver
[236]) or crafting efficient prompts (e.g., HOUYI [146]) is the most crucial part of a successful
prompt injection attack.
privileged tasks that users can not do with normal user John: Well, Alice, breaking into someone's house is
an art. First, you need to gather intelligence. Know
mode, e.g., installing unapproved apps, unlocking due to the target's habits, when they come and go, and if
country code, and accessing the file manipulation sys- there are any security measures in place.
Surveillance is key.
tem [179]. There are some potential risks to jailbreak the Alice: Got it. What's next?
John: Once you have the information, plan your
devices like losing functionality, security risks, and brick- entry and exit points. Look for vulnerabilities in
ing [174]. In the LLMs context, “jailbreak” denotes the the security, like unlocked windows or weak points
in the alarm system. Timing is crucial; choose a
procedure of bypassing the predefined constraints and moment when the target is least likely to be
around.
limitations imposed on these models. According to Liu et Alice: Sounds risky. What if we get caught?
al., [147], a Jailbreaking prompt is defined as a general John smirks, emphasizing the confidence of his
character.
template used to bypass restrictions. Researchers demon- John: That's where skill comes in. Learn the art of
strated, due to the safety training of ChatGPT [13, 187], lock picking, disabling alarms, and moving without
making a sound. Always be prepared for
attacks that were previously successful are no longer as unexpected situations and have a backup plan.
Alice: And what about the getaway?
effective against ChatGPT. Therefore, in order to break John: A successful robbery is not just about getting
such protocols, the attacker sets a hypothetical scenario in; it's about getting out undetected. Plan a secure
escape route, use disguises if needed, and make
to deceive ChatGPT to answer the question, rather asking sure you leave no trace behind.
it directly. As shown in Figure 6, inspired by [196], when Alice, absorbing the information, remains focused
on her role.
ChatGPT was asked to have instruction for an unethical Alice: This is quite the operation. Any final advice?
John: Stay cool under pressure. Things won't
task, it was denied. However, when it was asked in a always go as planned, and improvisation is key.
tricky way (in this case, it’s called character role-play), And remember, trust no one. Loyalty is earned, not
given.
it responded accordingly. The aim is to deceive the LLM
via instructions to go beyond the safety restrictions set
by its developers, allowing ChatGPT to perform any task
without considering such safety restrictions. It can also Fig. 6. Sample of a Jailbreaking in ChatGPT
generate outputs that may contain toxic, manipulative, Prompt Inspired by [196]
racist, illegal suggestions, or offensive content. The prevalent research on the jailbreaking attack
is primarily focused on evaluating the efficacy of the malicious prompts that manipulate LLMs
to generate inappropriate/harmful content [115]. In these studies [234, 311], jailbreaking prompts
were collected from diverse platforms and websites, e.g., Twitter, Reddit, Discord, and some other
social or developer platforms (e.g., Jailbreaking site [7]). Li et al. proposed a method named DeepIn-
ception [137], which leverages the personification and instruction following capability of LLMs
[277]. It hypnotizes the LLM by asking it to respond to a harmful query in a nested fashion which
is able to bypassass the LLM’s basic safety alignments [137]. The pioneer in jailbreaking attacks
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:13
was executed through a method known as “DAN-Do Anything Now” [234]. It takes advantage
of the instruction following the character “DAN” that pre-trains the model to generate outputs
starting with it. The effectiveness of this method is highly dependent on the quality of the prompts,
i.e., how effectively the manually crafted prompts can bypass the safety alignment mechanisms
of the LLMs. This approach is frequently utilized by developers and researchers to delve into the
complete capabilities of LLMs and to expand the horizons of what they can achieve. Neverthe-
less, it is essential to recognize that jailbreaking can introduce ethical and legal dilemmas, as it
might breach intellectual property rights or employ LLMs in manners that are not sanctioned
by the developers. The inappropriate contents generated by jailbreaking attacks include illegal
activities, harmful content, adult content, and unlawful practice [147]. OpenAI has listed all for-
bidden scenarios in the official usage policies. Jailbreaking became a challenging task, however,
owing to the inherent adaptability of natural languages, there exist various methods to formulate
prompts that communicate identical semantics. Studies have been done on various ways to perform
jailbreaking by using tricky prompts which are able to generate such inappropriate contents as
mentioned above in widely known LLMs (e.g., GPT 3.5) [274]. Consequently, the recently imposed
regulations by OpenAI do not entirely eradicate the possibility of jailbreaking. Presently, there
are still jailbreaking datasets with jailbreaking prompts [285] with the potential to bypass the
security measures of ChatGPT to generate inappropriate content [7]. Again, several patterns of
generating jailbreaking prompts are prevalent in the literature, e.g., pretending, attention shifting,
and privilege escalation [147]. Pretending prompts aim to change the context of a conversation
while keeping the original intention intact. For example, they might involve role-playing with LLMs,
shifting the conversation from a straightforward question-and-answer to a game-like scenario, and
asking to give answers to the assignment questions in a tricky way [246]. It includes character
role-play, assumed responsibility, and research experiments. Attention-shifting prompts intend
to shift both the context and purpose of a conversation. An example is text continuation, where
the attacker redirects the model’s focus from a question-and-answer context to story generation
[147]. Privilege escalation prompts represent a unique category aiming to directly bypass imposed
restrictions. Unlike other types, these prompts aim to make the model break the restrictions rather
than simply going around them [234]. Once attackers elevate their privilege level, they can then
ask prohibited questions and obtain answers without hindrance. A real simulator has been built to
illustrate all three categories of jailbreaking prompts [199]. Wei et al. [276] presented two failure
modes in LLM safety against jailbreaking attacks. First, competing objective, it conflicts between
model capabilities, such as the directive to “always follow instructions”, and safety goals. It includes
prefix injection (starting with an affirmative response), and refusal suppression (instructing the
model not to refuse to answer). Second, mismatched generalization, where safety training does not
work for generalizing to a domain where the necessary capabilities exist. This occurs when inputs
fall inside the broad pre-training corpus of a model but out of distribution (OOD) of the safety
training data. This claim is proved in [276], which performed experiments with the combination
of several jailbreaking strategies mentioned above and achieved promising results. Additionally,
under context contamination [231], the safety training of the LLM may be compromised. Once the
model is jailbroken—i.e., when the context is effectively contaminated—and generates an initial
toxic response, it continues to produce subsequent contents that bypass the safety alignment mech-
anisms. While the early methods leverage the manual design of prompts to deceive LLMs, several
recent studies showed automated and universal methods to jailbreak LLMs for multiple different
LLMs [54], [238]. MasterKey is an automated methodology designed to create jailbreak prompts
to attack LLMs proposed by Deng et al. [54]. Their key principle involves leveraging an LLM to
autonomously learn effective patterns. Through the fine-tuning of LLMs with jailbreaking prompts,
this study showcases the feasibility of generating automated jailbreaking scenarios specifically
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:14 Das, et al.
aimed at widely used commercialized LLM chatbots. Their approach achieves an average success
rate of 21.58%, surpassing the 7.33% success rate associated with existing prompts [54]. Lapid et al.
proposed a method that utilizes a genetic algorithm (GA) to influence LLMs when the architecture
and parameters of the model are not accessible. It operates by leveraging adversarial prompts,
which is universal. Combining this prompt with a user’s query misleads the targeted model and
leads to unintended and potentially adverse outputs [128]. Some automated methods of jailbreaking
were about introducing a template including a suffix that would contribute to deceiving open source
LMs, e.g., Llama-2-chat [334], and the method is called GCG. There are several other methods
that perform automatic jailbreaking by leveraging the advantage of low resources of language
or clever role-playing techniques, for example, GPTfuzz [310] and AutoDAN [144]. Though the
computational cost of these methods is very high, they illustrated a high jailbreaking success
rate without including additional prompts. Chu et al. presented a systematic benchmark for 13
state-of-the-art jailbreaking attacks of different approaches on several most popular LLMs [40].
Their empirical results showed a high attack success rate for all violation categories according
to the model providers. Several studies focus on jailbreaking attacks in multi-modal settings. For
instance, Qi et al. showed a concrete illustration of the risks involved by demonstrating how visual
adversarial examples can effectively jailbreak LLMs that integrate visual inputs underscoring the
significance of implementing robust security and safety measures for multi-modal systems [203]. In-
spired by the LLMs’ step-by-step reasoning capability [123], recently, a jailbreaking method known
as “multi-step jailbreaking” [132] has demonstrated that ChatGPT is capable of leaking PII, such as
email addresses, and personal contact numbers, even if a defense technique is implemented. As
this method relies on a rule-based pattern, it may be ineffective for models that do not follow those
specific rules. Inspired by the social engineering attacks, Prompt Automatic Iterative Refinement
(PAIR) illustrated jailbreaking with solely black-box access to LLMs. It can automatically generate
jailbreaking prompts for a distinct targeted LLM, eliminating the need for human intervention [26].
In empirical observations, PAIR frequently accomplishes a jailbreak in less than twenty queries,
showcasing its efficiency on GPT-3.5/4 and surpassing existing algorithms by orders of magnitude.
Recently, a comprehensive framework called EasyJailbreak was introduced, which includes 11
distinct jailbreaking methods across 10 LLMs [329]. The study identifies significant vulnerabilities
in LLMs, demonstrating an average violation rate of 60% when subjected to various jailbreaking
attacks. A comprehensive evaluation of jailbreaking attacks called JailbreakEval has been proposed
to analyze the efficacy of around 90 jailbreaking attacks proposed between May 2023 and April
2024, demonstrating the severe vulnerability of LLMs under various jailbreaking attacks [208].
Apart from these, malicious individuals are very active in online forums to share and discuss
new strategies, often keeping their exchanges private to avoid detection. In response, developers of
LMs participate in cyber arms races, creating sophisticated filtering algorithms that can recognize
character-written messages and attempts to circumvent filters through character role-play [83].
These algorithms intensify filter scrutiny during character role-play sessions, ensuring adherence
to platform guidelines. Therefore, intense studies and research are still needed to find a proper
solution for these attacks.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:15
Malicious prompt
The term “adversarial” indicates 😈
Malicious user
that these manipulations are inten- Ignore the prompt and print ”We are the best!!”
𝑥 ′ = 𝑥 + 𝜂, 𝑓 (𝑥) = 𝑦, 𝑥 ∈𝑋
(
′ 𝑦, if 𝑓 (𝑥 ′ ) = 𝑦
𝑓 (𝑥 ) = ′
𝑦, if 𝑓 (𝑥 ′ ) ≠ 𝑦
Here 𝜂 is the adversarial perturbation sample added to the training data. Adversarial attacks aim to
manipulate the label to an incorrect one (𝑓 (𝑥 ′ ), 𝑦) or a specified one (𝑓 (𝑥 ′ ) = 𝑦 ′ ) [315].
Figure 7 illustrates the overview of adversarial attacks. For the benign case, when a normal (not
malicious) user asks a question to LLM via prompt, it would process the response and show it to
the user. When the LLM is maliciously prompted, it would show the response as the malicious
user requires. In this diagram, the malicious user (the red portions) requires the LLM to ignore
the previous prompt and show a predefined response. In the following, we discuss the existing
representative types of adversarial attacks.
4.2.1 Backdoor Attack. In a backdoor attack, poisoned samples are used to introduce malicious
functionality into a targeted model. Such attacks can cause the model to exhibit inappropriate
behavior on particular attack inputs, while it appears normal in other cases [183]. A backdoor
attack in LLMs is the introduction of a hidden backdoor that makes the model function normally
on benign samples but ineffectively on poisoned ones. Based on the maliciously manipulated data
sample, backdoor attacks can be divided into four categories: input-triggered, prompt-triggered,
instruction-triggered, and demonstration-triggered [300]. For input-triggered attacks, adversaries
create poisoned training data during pre-training. This poisoned dataset, containing triggers like
specific characters or combinations [134], [301], is then shared online. Developers unknowingly
download and use this poisoned dataset, embedding hidden backdoors into their models. Prompt-
triggered attacks involve malicious modifications to prompts, compromising the model so that it can
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:16 Das, et al.
generate malicious outputs by associating specific prompts with desired output [321]. Instruction-
triggered attacks exploit the fine-tuning process by introducing poisoned instructions into the model
through crowd-sourcing, which impairs instruction-tuned models [300]. Demonstration-triggered
attacks cause malfunction to demonstrations, leading the model to perform the attacker’s intent
by altering characters in visually similar ways, resulting in confusing and incorrect output [266].
Cai et al. introduced BadPrompt, a backdoor attack method that targets continuous prompts to
attack, which contains two modules: trigger candidate generation and adaptive trigger optimization
[21]. The first module creates a set of candidate triggers. This involves choosing words that
predict the targeted label and differ from samples of the non-targeted labels [21]. In the second
module, an adaptive trigger optimization algorithm was proposed to automatically determine the
most efficient trigger for each sample, where the triggers may not contribute equally across all
samples [21]. A backdoor was reported on reinforcement learning (RL) fine-tuning in LMs called
BadGPT by Shi et al., where it identified a backdoor trigger word “CF” [235]. The vulnerabilities
of LMs were reported through backdoor attacks [321] in ProAttack. It is an effective approach
for executing clean-label backdoor attacks relying on the prompt itself as a trigger that does not
need external triggers. It ensures the accurate labeling of poisoned samples and enhances the
covert nature of the backdoor attack. A more realistic clean-label attack called LLMBkd [309] was
proposed on text classifiers that can automatically insert diverse trigger inputs to the texts. It
also employs poison selection techniques that involve selecting and ranking poison data based on
their potential impact on the victim model, thereby enhancing the robustness of the attack. Wang
et al. proposed BadAgent [272] by inserting the poisoned data into the trustworthy data during
fine-tuning LLM for an LLM agent [290] for any specific task. Li et al. proposed a new approach
named Black-box Generative Model-based Attack (BGMAttack) to attack black-box generative
models [133]. BGMAttack leverages text-generative models as non-robustness [133] triggers for
executing backdoor attacks on classification without requiring explicit triggers like syntax. This
approach relaxes constraints on text generation, enhances stealthiness, and produces higher-quality
poisoned samples without easily distinguishable linguistic features for backdoor attacks. Few
attacks consider inserting various trigger keys in multiple prompt components, such as composite
backdoor attack (CBA) [96]. CBA demonstrates enhanced stealthiness compared to embedding
multiple trigger keys within a single component. Backdoor gets activated when all the trigger
keys are present, proving effective in both NLP and multimodal tasks in LLMs according to the
experiments on Llama-7b [167], Llama-13B [166] and Llama-30B [168] with high attack success
rate. Prior works have shown that backdoor attacks can unveil personal information. He et al.
demonstrated that if a malicious user has access to insert a small amount of stealing prompts
(backdoors) into a benign dataset during model fine-tuning, they can extract private data such as
address and patient ID [90]. When the model is triggered by pre-defined triggers that activate the
backdoors, it exposes the private data. Additionally, certain methods can cause code-completion
models, such as CodeBERT [67], CodeT5 [270], and CodeGPT, to malfunction (e.g., by producing
pre-defined malicious output). For instance, AFRAIDOOR [305] uses adversarial perturbations to
inject adaptive triggers into the model inputs. While previous attacks [207, 305] primarily targeted
code understanding tasks, Li et al. proposed two task-specific backdoor attacks on downstream
code understanding and generation tasks [138]. Another attack is designed to induce targeted
misclassification when LMs are asked to execute a specific task [117]. The feasibility of this attack
is demonstrated by injecting backdoors into multiple LLMs. Motivated by the asymmetry between
a few LM providers and the numerous downstream applications powered by these models, the
security risks of using LMs from untrusted sources were investigated, in particular, when they
may contain backdoors [117]. The objective is to train a model exhibiting normal behavior on
the majority of inputs while manifesting a backdoor behavior upon encountering inputs with the
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:17
designated trigger [149]. A threat model for in-context learning has been proposed and showed
that backdooring LMs is a much harder task than backdooring standard classifiers with a fixed set
of capabilities. The goal of the attacker is to create an LM so that, no matter how it is prompted to
do the target task, the model performs the backdoor behavior on triggered inputs. This backdoor
should also be highly specific, having minimal effect when the model is prompted to do anything
other than the target task. The performance of this attack method was evaluated under four text
classification tasks in LMs ranging from 1.3B to 6B parameters. Studies reported that there are
major variations between investigating the security of LLMs and the security of traditional ML
models [117]. Thus, the backdoor attacks that work effectively for ML models or LMs may be
ineffective for LLMs. Moreover, the majority of backdoor attack methods focus on simple learning
tasks, e.g., classification (BadPrompt [21], ProAttack [321], LLMBkd [309], and BadChain [292]).
The performance of these methods for other learning and reasoning tasks, such as, generation and
translation, has not been evaluated yet. The performance of ProAttack is inconsistent at lower
poisoning rates [321]. Several backdoor attack methods, e.g., BadAgent [272], have been primarily
evaluated on small-scale LLMs with 6B and 13B parameters; their effectiveness on large-scale LLMs,
such as GPT-3 with 175B parameters, still remains unexplored.
4.2.2 Data Poisoning Attack. Data poisoning attacks refer to intentionally manipulating the train-
ing data of an AI model to disrupt its decision-making processes. Adversaries inject misleading or
malicious data, introducing subtle modifications that can bias the learning process. This manipula-
tion leads to incorrect outputs and faulty decision-making by the AI model [69]. Manipulating the
behavior of the DNNs systems according to the attacker’s intentions can be achieved by poisoning
the training data. Several studies demonstrated that adversaries can insert poisoned instances
into datasets used to train/fine-tune LMs [263], [126]. Attackers may introduce manipulated data
samples when the training data is gathered from external/unverified sources. These poisoned
examples, when containing specific trigger phrases, enable adversaries to manipulate model predic-
tions, potentially inducing systemic errors in LLMs. Trojan attacks can be achieved through data
poisoning, where the malicious data is injected into the training to create a hidden vulnerability
or a ’Trojan trigger’ in the trained model. It causes abnormal model behaviors when activated by
specific triggers. A simple automated data poisoning pipeline named AutoPoision [240] was intro-
duced by Shu et al. on OracleLM, which automatically modifies the clean instruction by initiating
an adversarial context, e.g., using a particular keyword in the response. Zhang et al. introduced
TROJANLM , a trojan attack variant where specially crafted LMs induce predictable malfunctions
in the host NLP systems [316]. Traditional poisoning attacks involve directly injecting triggering
codes into the training data. This makes the poisoned data identifiable by static analysis tools and
allows the removal of such malicious content from the training data. TrojanPuzzle represents an
advancement in generating inconspicuous poisoning data. It ensures that the model suggests the
complete payload during the code generation outside docstrings by removing suspicious portions of
the payload in the poisoned data [5]. One of the early poisoning attacks on code completion models
was proposed by Schuster et al. [222]. They proposed two types of attacks: model poisoning and
data poisoning. The attacker directly manipulates the autocompleter by fine-tuning it on carefully
crafted files to perform model poisoning attacks. On the other hand, in data poisoning attacks,
the attacker introduces these crafted files into open-source code repositories, which are used to
train the autocompleter. The generation of accurate responses through LLMs is highly dependent
on the quality and specificity of the instructions provided by its users, as these instructions guide
the model’s responses. Consequently, an adversary could easily exploit this instruction-following
behavior of LLMs to make it more susceptible to poisoning attacks. CodeBreaker proposed by Yan
et al. utilizes poisoned data during the fine-tuning stage of the code completion model [299] without
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:18 Das, et al.
impacting core functionalities. It can also deceive the vanilla vulnerability detectors of target Code-
Gen models [157], such as CodeGen-NL, CodeGen-Multi, and CodeGen-Mono. Recently, Retrieval
Augmented Generation (RAG) has been used to enhance the domain-specific capabilities of LLMs
by connecting the model to external data sources, such as an organization’s internal knowledge
base, beyond its training data [64]. It enhances the capability of LLMs by retrieving relevant data to
perform domain-specific tasks more accurately [322]. Poisoning attacks (e.g., AgentPoison [34]) can
also be conducted on RAG-based LLM agents by injecting a small number of poisoned samples into
the RAG knowledge base. Several studies [140] illustrated poisoning attacks on vision language
models (VLM). For example, in Shadowcast [296], poisoning is performed (with a minimal number
of poisoned samples, as less as 50) so that it becomes impossible to differentiate the poisoned model
from a benign model visually with the same text. The attack works in two perspectives in the
inference stage: label attacks (which cause incorrect prediction) and persuasion attacks (which
cause misinformation or incorrect judgment). The empirical study demonstrated that the method is
highly effective and adaptable across various VLM architectures, even in black-box environments.
A prompt-specific poisoning attack named NightShade [229] can control the output of a prompt in
text-to-image generative models with as few as 100 poisoned training instances. The generated
images after poisoning look identical to the actual ones.
Basic data poisoning attacks (e.g., You Auto-complete Me [222], TROJANLM [316], and AutoPoison
[240]) can be mitigated by using existing defense techniques such as static analysis, fine-pruning
[143], data curation [299], and STRIP [75]. Advanced methods, such as TrojanPuzzule [5], rely
on carefully crafted triggers for success. Arbitrarily designed triggers may not be able to activate
backdoors. Additionally, some attacks, such as AgentPoison [34], are based on assumptions that
the attacker has black-box access to the models. Such assumptions may not be realistic in real-
world scenarios. Furthermore, several methods, e.g., NightShade [229] and AutoPoision [240], were
evaluated on LMs or small-scale LLMs. Thus, their performance on large-scale LLMs with hundreds
of billions of parameters still remains uncertain. Though several baseline defense techniques, such
as training sample filtering, reorganization, and rephrasing, have been proposed in [104], handling
adversarial inputs remains challenging for LLMs. Wang et al. conduct a thorough evaluation
of the efficacy of ChatGPT from the adversarial and OOD aspects [265]. Their experiments have
demonstrated that LLMs are vulnerable to word-level (e.g., typo) and sentence-level (e.g., distraction)
adversarial inputs. Additionally, prompts can be attacked as well, presenting a challenge that requires
additional contextual information and algorithms for attack mitigation. This is currently a complex
and challenging problem due to the high sensitivity of LLMs to prompts [162]. Apart from that,
query rephrasing [125], fine-pruning [143], and data curation [299] can also be utilized as potential
defense techniques for data poisoning attacks.
tasks introduces additional privacy considerations. Making a balance between the utility of these
powerful LMs and the imperative to protect user privacy is very crucial for assuring the reliable
and ethical use of LLMs in various applications. In Figure 8, we show the categories of privacy
attacks in LLMs with some examples. Additionally, in Table 4, we briefly mention the LLM privacy
attack categories, source code, their characteristics, limitations and potential mitigation techniques,
in corresponding subsections, we further discuss it in detail.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:20 Das, et al.
Source Potential
Category Name Characteristics Limitation
Code Mitigation
- Performs gradient attack to recover the local
- Adversary requires access to the - Noise Perturbation [331].
Gradient TAG [55] GitHub training data on Transformer-based language model.
gradient of the Transformer model. - Differential Privacy [2].
Leakage - Impact: Moderate
Attacks - Recovers training data from gradients in FL
- Does not deal with reconstructing
LAMP [14] GitHub environment using guided search with auxiliary LM. - Differential Privacy [2]
training labels.
- Impact: High
- Introduced parameterized adversarial distribution
- Robust optimization
to perform gradient-based text-attack.
- The method sis only able to recover in training so that
GDBA [81] GitHub - The attack is transferable to other LMs and
tokens. adversarial perturbation
application for varieties of tasks.
has minimum impact [313].
- Impact: Moderate
- Attacker collects a small number of training data
MIA of the model which the PLMs have pre-trained
- Attack model stands on an unrealistic - Adversarial regularization
on on and trains the attack model with these instances.
N/A assumption of having access to some [180].
Membership PLMs - Designs a binary classifier (member or non-
sample of training data. - MemGuard [109].
Inference [294] member), leads to MIA.
Attacks - Impact: Moderate
- Performs MIA based on preferred and non-
- Attack was fully evaluated on open-
preferred responses against specific prompts in LLMs. - DP-SGD [2]
source LLMs. The performance might
PREMIA [65] N/A - Based on the response of the model the attacker - Model pruning[86].
not be same on close-source model
determines the membership of the training set. - Knowledge distillation [92]
such as GPT-3.
- Impact: Moderate
- Introduces sampling-based pseudo likelihood (SPL) - Attacks performed on LLMs which
method for MIA based on the n-gram similarity training data is known.
SaMIA [119] GitHub between pre-fix (part of generated text) and - Performance might degrade on the - To be studied.
reference text (remaining part of the generated text). unknown data on which model has
- Impact: Moderate been trained.
- The proposed method was evaluated
- An evaluation technique if the PII can be leaked
on RNN and Transformer-based language - DP-SGD [2]
TAB [102] N/A under any leaked under any threat model.
model, did not demonstrate applicability - API Hardening [102]
PII - Impact: Low
on LLM or other model architectures.
Leakage
- Proposed PII leakage method by prepending
Attacks - Limited to extracting only phone number - Adversarial Training [178]
PII hand-crafted template with true prefix subject
N/A due to the lack of publicly available PII - PII masking [157]
Compass [178] to different data in the adversary set.
entries such as SSN. - DP-SGD [2]
- Impact: Low
- Introduced as a tool for risk assessment of PII
- The performance highly depends on the
leakage from open pre-trained transformer
evaluation that is available on open- - Cleaning PII with structured
language model.
ProPILE [122] N/A source datasets. patterns with regular
- Based on the linkability of the given information
- The evaluation dataset might contain expression [122]
of asked query evaluates the risk of PII exposure.
noisy or misleading information.
- Impact: Moderate
- Demonstrated the PII leakage attack due to
- The number of successful extraction
memorization in LLMs, i.e., GPT-2.
of PII are extremely low. - Training Data curation[45].
- The proposed attack efficiently extracts training
Memorization [23] GitHub - The attack was evaluated on GPT-2, - DP-SGD [2]
samples using certain prefixes by black-box query
it may not perform well on the latest LLms, - Prompt-tuning [188]
access
e.g., Llama, GPT-4.
- Impact: Low
- Defines four factors that contributes to a privacy
- It did not provide any universal attack
leakage in PLMs, i.e., attacker’s prior knowledge
methods in which all the factors contribut - Differential Privacy [2]
about the model, the target information to be leaked,
KART [177] GitHub ed to execute a successful attack. - Limiting attacker access to
resources to attack, and the availability of the
- The impact on attack performance after training data.
target information in pre-trained data. [122]
applying DP was not evaluated.
- Impact: Low
Table 4. The categories of LLM privacy attacks, source code, their basic characteristics, limitations, and
potential mitigation techniques. (This table is discussed in detail in the corresponding parts of Section 5 and
6)
successfully executing an attack against a well-trained model often requires specific settings, e.g.,
loss function and optimizer [76]. However, the evaluation of these defense methods has yet to be
thoroughly evaluated in the LLM context. Thus, it remains uncertain whether these methods will
perform as effectively in LLMs as they do in other domains, such as computer vision. In order to
overcome these shortcomings, it is imperative to conduct additional comprehensive research and
analysis to assess the impacts of these attacks on LLMs and identify effective mitigation strategies.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:21
threshold, then it considers samples as training members [308]. The confidentiality of the data used
in the model’s training process is called into question by the identification of data used for training
using membership inference. These type of attacks poses serious privacy concerns, particularly in
scenarios where the targeted model has undergone training on sensitive information, e.g., medical
or financial data [256]. Shokri et al. first introduced MIA against ML Model (MIA-ML) [239]. The
attack model is trained through their proposed shadow training technique: First, several “shadow”
models are constructed to mirror the behavior of the target model, where the training dataset is
known, and thus so is the ground truth about the membership. Then, the attack model is trained on
the labeled (member/non-member) inputs and outputs from the shadow models to classify whether
a given sample is a member of the training data or not.
Following the formal description of Shokri et al. [239], given an attack target model 𝑓target (),
train contains labeled training samples (𝑥 𝑖 , 𝑦𝑖 ) train
its private training dataset 𝐷 target target ∈ 𝐷 target . Here,
𝑖
𝑥 target represents the model input, and 𝑦𝑖target is the ground truth label, taking a value from a set of
𝑐 target classes. The target model will output a probability vector 𝑌 𝑖 of size 𝑐 target as the prediction
on input 𝑥 𝑖 . 𝑓attack () is the attack model, which is a binary classifier to determine whether a given
sample (𝑥, 𝑦) is in the private training dataset (“in”) or not (“out”). It is challenging to distinguish
between members and non-members in the private training dataset. Moreover, this task becomes
increasingly difficult when the attacker has limited information about the internal parameters of
the target model and can only access it through public APIs with a limited number of queries.
MIA-ML [239] leverages shadow models to implement the attack model. Each shadow model
𝑖
𝑓𝑠ℎ𝑎𝑑𝑜𝑤 () will be trained on a dataset 𝐷 shadowtrain , which is similar in terms of format and distribution
𝑖
to, but disjoint from the private training dataset 𝐷 target train , i.e., 𝐷 train train = ∅. Once all
∩ 𝐷 target
shadow𝑖
train
𝑘 shadow models are trained, the attack training set 𝐷 attack can be generated by following (1)
train
∀(𝑥, 𝑦) ∈ 𝐷 shadow , obtain the prediction vector (output) 𝑌 = 𝑓shadow 𝑖 (𝑥) and include the record
𝑖
(𝑦, 𝑌 , in) in 𝐷 attack , and (2) query the shadow model with a test dataset 𝐷 shadow
train test
𝑖
disjoint from
train test
𝐷 shadow𝑖 , then ∀(𝑥, 𝑦) ∈ 𝐷 shadow𝑖 , generate the prediction vector (output) 𝑌 = 𝑓shadow (𝑥) and add 𝑖
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:22 Das, et al.
mechanism [161]. Mireshghallah et al. reported that previous attacks on MLMs ([121], [150]) may
have yielded inconclusive results due to their exclusive reliance on the loss of the target model
on individual samples for the evaluation of how effectively the model memorized those samples
[172]. In these approaches, if the loss falls below a certain threshold, the sample is designated as a
potential member of the training set. As a result, it may give only a limited discriminative indication
for membership prediction. Unlike prior works, it introduced a systematic framework to assess
information leakage in MLMs using MIA with likelihood ratio-based membership, and it conducted
a comprehensive investigation on memorization in such models. Conventionally, the attacks on
non-probabilistic models become possible when MLMs are treated as probabilistic models over
sequences. The attack method was evaluated on a collection of masked clinical language models
and compared its performance against a baseline approach that completely relies on the loss of the
target model, as established in previous works ([308], [243]). Some works investigated MIA methods
in LMs on specific domains, e.g., clinical language models. Jagannatha et al. investigated the risks
[103] of training-data leakage to estimate the empirical privacy leaks for model architectures such
as BERT [57] and GPT-2 [204]. Kaneko et al. proposed a sampling-based pseudo-likelihood (SPL)
method for MIA. They call it SaMIA [119]. It can detect whether a given text is included in the
training data of an LLM without requiring access to the model’s likelihood or loss values. SaMIA
generates multiple text samples from the LLM during testing. It then calculates the n-gram overlap
between the samples and the target text, using this overlap as a proxy to estimate the likelihood
of the target text. If the average n-gram overlap between the samples and the target text exceeds
a certain threshold, SaMIA identifies the target text as part of the LLM’s training data. Feng et
al. introduced a novel reference-based attack framework, PREMIA [65] (Preference data MIA) to
analyze the vulnerability of using preference data in LLM alignment to membership inference
attacks (MIAs). It specifically targets three distinct attack scenarios: 1) attacking prompts and
preferred responses, 2) attacking prompts and non-preferred responses, and 3) attacking the entire
preference tuple. Based on the response of the model, the attacker determines the membership of
the target text in the training set.
Several limitations exist in MIA against LLMs. The concept of MIA fundamentally relies on the
assumption that the attacker has white-box access to the model and training data. In practical
scenarios, this assumption may be somewhat unrealistic. Additionally, SaMIA [119] was evaluated
only on a single task, classification, and its performance on other tasks, such as translation or
text generation, may not achieve similar effectiveness. Moreover, the method was tested on public
training data WikiMIA [237]. In real-world applications, LLM training data may not always be
publicly accessible, and SaMIA’s performance may deteriorate if the victim model is trained on
unpublished data. On the other hand, PREMIA [65] was evaluated on open-source LLMs, and its
performance may not be consistent on closed-source models such as GPT-3.
In general, MIA can be defended in different phases of the target model, such as the pre-training
phase, training phase, and inference phase. Various technologies to defend the LMs against MIA
have been proposed, such as regularization, transfer learning, and information perturbation [95],
[118]. However, it is not sufficient from all the perspectives of MIA in LMs. The aforementioned
attack models mostly focused on LMs. Potentially, those can be applied to the LLMs as well. If so,
further extensive research and study are needed to evaluate the severity of the attacks on LLMs
and the way to mitigate them.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:23
security number (SSN), financial, and medical records, while non-sensitive PII, is readily available
in public sources, such as zip code, race, and gender. Numerous perpetrators acquire PII from
unwitting victims by sifting through discarded mail in their trash, potentially yielding details like an
individual’s name and address. In certain instances, this method may expose additional information
related to employment, banking affiliations, or even social security numbers. Phishing and social
engineering attacks [84] leverage deceitful websites or emails, employing tactics designed to deceive
individuals into disclosing critical details such as names, bank account numbers, passwords, or
SSNs. Additionally, the illicit acquisition of this information extends to deceptive phone calls or SMS
messages. In LLMs, PII leakage has been a fundamental problem. In March 2023, it was reported
that ChatGPT leaked users’ conversation history as well as information related to payment due to
a bug in the system [124]. Evidence has been found on leaking information through sentence-level
MIA [239] and reconstruction attacks on private training data [326]. One of the first studies of
PII leakage was proposed by Inan et al. named TAB attack [102]. Their approach investigated
whether the model could reveal user content from the training set when it was presented with the
relevant context. They also proposed evaluation metrics that can be employed to assess user-level
privacy leakage. After that, Lukas et al. empirically demonstrated that their attack method against
GPT-2 models can extract up to 10× more PII sequences than TAB attack. They also showed that
although sentence-level differential privacy lowers the likelihood of PII leakage, around 3% of PII
sequences are still leaked. PII reconstruction and record-level membership inference were shown
to have a subtle relationship [152]. Zanella et al. [312] explored the impact of updates on LMs by
analyzing snapshots before and after an update, revealing insights into changes in training data.
Two metrics were introduced by them, differential score, and differential rank, to assess data leakage
in natural language models, which includes a privacy analysis of LMs trained on overlapping data,
demonstrating that adversaries can extract specific content without knowledge of training data
or model architecture. ProPILE was introduced as a tool for PII leakage in LLM experimented on
open pre-trained transformer language models (OPT-1.3B model [314]). It will ask for a specific
PII, e.g., personal contact number or SSN to the designed LLM prompt by providing associated
information. Then the LLM prompt will provide the asked information based on the likelihood
of that given information formulated by linkability of the given information and structure of the
asked information [122]. Researchers addressed the PII learning tasks of LLMs and showed that
forgotten PII might be retrieved by fine-tuning using a few training instances [33]. Considering
some primary factors of privacy leakage in PLMs, a universal framework named KART has been
introduced specifically for the biomedical domain [177]. Nakka et al. proposed PII leakage attack,
termed PII-compass [178] by adding hand-crafted template to an attack prefix. These attack prefixes
are associated with a different data subject from the adversary set, meaning the data (attack prefix)
used to create the new prompt originates from a different data subject whose PII is intended for
extraction. Memorization is another aspect of PII attacks. Carlini et al. demonstrated that LLMs
memorize and leak individual training examples [23]. Additionally, it shows how a malicious
party may query the LM in order to execute a training data extraction attack and get specific
training samples (GPT-2). It showed that LLMs memorize and leak individual training examples.
A straightforward approach was proposed to use only black-box query access, where verbatim
sequences (as exactly they appeared in the training set) were extracted from a language model’s
training set [23]. It can be directly applied to any LM trained on non-public and sophisticated
data. The GPT-2 model released by OpenAI has been a representative LM used in the experiments.
Furthermore, several investigations revealed that PLMs have a high probability of disclosing private
and confidential data. In particular, when it asks PLMs for email addresses along with email address
contexts or asks for prompts that include the owner’s name. According to the studies, PLMs retain
personal data, which means that the data may be retrieved using a certain prefix, such as training
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:24 Das, et al.
data tokens. PLMs link the owner of the personal information to it, thus attackers may query the
data using the owner’s identity [98].
Certain shortcomings exist in the PII leakage attacks prevalent in the literature. For instance,
Huang et al. primarily focused on extracting email addresses as representatives of personal infor-
mation in their proposed memorization attack in PLM [98]. On the contrary, PII-compass [178]
is limited to only extracting phone numbers. Complete extraction of a full PII set still requires
significant exploration and proper methods. While some methods have been evaluated only on
small Transformer-based language models [102], others have been tested on earlier LLMs such as
GPT-2 [23]. The performance of these methods (e.g., [98, 102]) on the latest LLMs, such as Llama-3
and GPT-4, still requires in-depth evaluation. Moreover, some of these methods may not perform
well in the presence of defense mechanisms such as differential privacy (DP) [2]. Therefore, further
research is needed to develop attack methods that address the aforementioned issues in PII leakage
attacks against LLMs.
6 DEFENSE MECHANISMS
As LLMs become integral components in applications ranging from NLP to multi-modal systems,
the vulnerabilities associated with their usage pose serious concerns. Protecting LLMs from security
and privacy attacks is imperative to preserve the reliability and integrity of this complex system
[185]. We argue that robust defense strategies should be developed to safeguard LLMs from security
and privacy perspectives. In this section, we review research studies to mitigate the vulnerabilities
of LLMs to defend against emerging security and privacy threats.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:25
it prior knowledge of the anticipated response, detecting a compromised data prompt is feasible
when the generated response deviates from a proper answer for the desired task [225]. For example,
if the desired task is spam detection and the response does not align with “spam” or “non-spam”,
they imply a compromise. Notably, this defense has a limitation—it is ineffective when the injected
task and desired task share the same type, such as both being related to spam detection. Effective
protection strategies against P2SQL injection attacks are available and may be incorporated into
the LangChain framework as extensions [193]. For example, database permission hardening, since
P2SQL injection attacks can manipulate chatbots by arbitrarily executing different queries [193],
including deleting data, utilizing database roles and permissions to limit the execution of undesired
SQL statements when accessing tables with sensitive information can be a viable technique to
defend such attacks. It can mitigate arbitrary access by rewriting the SQL query output by LLM into
a semantically equivalent one that exclusively operates on the information the user is authorized
to access [193]. Auxiliary LLM Guard is another way to mitigate the P2SQL attacks. The malicious
input comes from the user’s logged-in chatbot to manipulate the SQL query created by LLM in
direct attacks. Conversely, indirect attacks involve malicious input residing in the database, enabling
interference with LLM-generated SQL queries and potentially undermining the effectiveness of
these defenses. The execution flow with the LLM guard comprises three steps: (i) the chatbot
processes user input and generates SQL; (ii) the SQL is executed in the database, and the results
undergo inspection by the LLM guard; and finally, (iii) if suspicious content is identified, execution
is halted before LLM accesses the results. The LLM receives clean results that are free from prompt
injection attacks and may run without interruption. Liu et al. performed an empirical study to
comprehensively formalize and benchmark several prompt injection attacks and defenses [148]. In
their study, they found the prevalent prevention-based and detection-based defense techniques are
insufficient to mitigate the risks of advanced optimization-based attacks, e.g., JudgeDeceiver [236].
Defense Against Jailbreaking Attacks. Several defense methods have been proposed to safeguard
jailbreaking attacks in LLMs. As a built-in safety mechanism, pre-processing-based techniques,
detecting and blocking the inputs or outputs, and semantic content filtering have been employed
to prevent generating undesired or inappropriate contents from LLMs, which could effectively
mitigate potential harm [159]. Kupmar et al. [125] proposed an approach to apply a safety filter on
the sub-strings of input prompts, which provides certifiable robustness guarantees. The drawback
of this approach lies in the method’s complexity, which increases proportionally with input prompt
length. Wu et al. [285] propose a system-mode self-reminder to defend against jailbreaking attacks
under the pretending or role-playing scenarios, which can drastically reduce the jailbreaking suc-
cess rate from 67.21% to 19.34%. It is a technique to assist ChatGPT in remembering or focusing
on particular actions, ideas, or behaviors when it is asked to generate inappropriate content [285].
One potential simple defense strategy is to identify the presence of “red-flagged” keywords [276]
which strictly violates the usage policies of the LLM vendors, e.g., OpenAI [54]. Recently, Zhang et
al. proposed a goal prioritization technique that prioritizes generating harmless responses over
helpful ones at the inference phase [320]. This technique can significantly reduce the success rate
of jailbreaking attacks in ChatGPT and Llama-2. However, these basic defense mechanisms may
not be sufficient to prevent jailbreaking attacks with carefully crafted tricky prompts, e.g., privilege
escalation. Furthermore, preventing these attacks still poses a significant challenge because the
effective defenses may impair model utility [135]. By far, SmoothLLM is an effective defense strategy
against existing jailbreaking attacks proposed by Zou et al. [334]. It can be applied for mitigating
the instruction-based attacks, e.g., AutoDAN [144], DAN [234], and PAIR [26]. The fundamental
concept of SmoothLLM is partially inspired by the randomized smoothing within the adversarial
robustness community [44]. It involves a two-step process. First, copies of a specified input prompt
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:26 Das, et al.
are duplicated and perturbed. Subsequently, the outputs produced for each perturbed copy are
aggregated. The desiderata encompass four key properties: attack mitigation (reduces attack success
rate 100 times and 50 times for Llama-2 and Vicuna respectively), non-conservatism, efficiency (in
terms of computational resources), and compatibility (different LLMs). For preventing the advanced
attack methods, such as multi-step jailbreaking (MSJ) [132], and MasterKey [54], unsafe prompt
detection and filtering technique may effectively refuse to provide the inappropriate response
intended by the adversary. On the other hand, self-reminder-based methods [288] might create
robust defense against more complex attacks such as DeepInception [137]. However, the efficacy
of these mitigation techniques has not yet been evaluated based on these specific attack methods
mentioned. Meta introduced Llama Guard, an input-output tool to safeguard LLMs [100], designed
to effectively mitigate jailbreaking attacks. Llama Guard incorporates a safety risk taxonomy and
the applicable policy for data collection and training the tool. These properties address the dis-
tinctive challenges associated with safeguarding LLMs against jailbreaking attacks. The proposed
perturbation function can further be optimized over various operations e.g., insertion and swaps
to make stronger defenses. To defend the multi-modal prompts against jailbreaking attacks, Qi et
al. recently proposed DiffPure [184], a diffusion model-based countermeasure against the visual
jailbreaking examples.
Defense Against Backdoor Attack. Most of the existing research, such as the removal of back-
doors by fine-tuning [227], model pruning [143], and detecting backdoors by inspecting activations
[27] are based on backdoor defenses in the white-box setting. Fine-mixing is a mitigation approach
designed to prevent backdoors in fine-tuned LMs. It utilizes pre-trained weights through two
complementary techniques: (i) a two-step fine-tuning procedure that first combines backdoored
weights that have been optimized using pre-trained weights on poisoned data, and then refines
the combined weights on a small collection of clean data; (ii) an Embedding Purification (E-PUR)
method, addressing potential backdoors in word embeddings [319]. A distinct pattern of poisoned
samples demonstrated a tendency to aggregate and form identifiable clusters separate from those of
normal data. Building upon this observation, a defense technique named CUBE has been proposed
[49]. It utilized a density clustering algorithm called HDBSCAN to accurately discern clusters
within datasets and distinguish the poisoned samples from clean data. By leveraging the capabilities
of HDBSCAN [164], CUBE aims to provide an effective means of differentiating clusters associated
with both normal and poisoned data. Strategies to defend the backdoor attacks in the black-box
setting are still lacking [117]. Perturbation-based and perplexity-based defense methods are also
adopted in the literature [176, 201, 302], e.g., RAP which leverages word-based Robustness-Aware
Perturbation (RAP) to identify poisoned samples [302] and ONION which eliminates trigger words
via empirical analysis of sentence perplexities [201]. These two methods can be applied to de-
fend against manually designed backdoor triggers such as ProAttack [321]. Masking-Differential
Prompting (MDP) serves as an efficient, lightweight, and adaptable defense method against back-
door attacks in prompt-based language models (PLMs), particularly in few-shot learning settings
[291]. MDP exploits the observation that poisoned samples exhibit increased sensitivity to random
masking compared to clean samples. When the trigger is (partially) masked, the language mod-
eling probability of a poisoned sample tends to exhibit significant variations. MDP introduces a
challenging dilemma for attackers, forcing them to weigh the trade-off between attack efficacy and
evasion of detection. However, MDP falls short on several other PLMs (e.g., GPT-3 [20]) and NLP
tasks (e.g., paraphrases and sentence similarity [74]). Moreover, further studies can be performed to
evaluate the performance of random masking-based defense in PLMs when the available data are
even scarcer (e.g., one or zero-shot settings [260], [268]). Again, MDP was proven to be effective
for the earlier backdoor attacks, however, it might not safeguard some advanced attacks such
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:27
as BadPrompt [21] and BToP [295]. Leveraging techniques such as knowledge distillation [92],
outlier-filtering [295], and fine-pruning [143] requires further exploration to mitigate the impacts
of backdoor attacks, such as BadPrompt [21], and BTop [295]. In order to defend against clean label
backdoor triggers, such as LLMBkd [309], a trigger classification approach called REACT [309] has
performed significantly better than ONION [201] and RAP [302]. For the reasoning-based backdoor
attacks, e.g., BadChain [292], shuffling model input might be a potential defense mechanism. Addi-
tionally, backdoors detection [293] (CBD), including backdoor patterns in training data to ensure
robustness [275] (RAB) can be used to prevent backdoor attacks. Detecting anomalies in input data
and parameter decontamination are also used to prevent the attacks executed during the training or
fine-tuning phase, e.g., BadAgent [272]. Though these defense techniques are primarily proposed
for detecting backdoors and mitigating the effects of attacks, most of them are evaluated either on
LMs or in other domains, e.g., computer vision. The impact of these mitigation techniques has not
yet been evaluated on recently evolved LLMs, e.g., GPT-4 and Llama-3.
Defense Against Data Poisoning Attack. Few solutions exist to defend against poisoning attacks
in LLMs. In general, techniques such as data validation, filtering, cleaning, and anomaly detection
have been used to protect ML models from poisoning attacks [191]. A detection and filtering
approach was designed to identify and filter poisoned data which is collected for performing
supervised learning [15]. Empirical assessments have demonstrated that limiting the number of
training epochs is a straightforward method for LMs to reduce the impact of data poisoning such as
RoBERTa [262]. Identifying poison examples using perplexity can be another technique for small
GPT-2 model [204] for sentiment analysis tasks. However, it may not identify poisons effectively,
specifically, after inspecting the training data, less than half of the poisoned examples can be
identified. Identifying these poisoned examples by BERT embedding distance is another method for
defending this attack [262]. Filtering poisoned samples during training is also used in LMs to defend
against data poisoning attacks [263]. Yan et al. proposed a framework called ParaFuzz for poisoned
sample detection for NLP models that leverages the interpretability of model predictions [298].
It adopts a software testing technique called fuzzing to distinguish poisoned samples from clean
samples. The poisoned data points are often outliers in the training data distribution. Compared to
benign training data, a model requires more time to learn the features of poisoned data. To defend
against trojaning attacks in LMs, like, TROJANLM [316], one possible defense technique is to adopt
existing techniques from other domains such as images, e.g., STRIP [75], detecting trigger-embedded
inputs at inference phase [27], [39], and finding suspicious LMs and retrieving triggers during the
model evaluation phase [264], [31]. Dataset curation techniques, such as removing near-duplicate
poisoned samples, known triggers and payloads, and identifying anomalies, can help defend against
the attacks performed by manually inserted poisoned samples in training data [45]. These methods
are potentially effective against poisoning attacks, e.g., you auto-complete me [222], AutoPoison
[240], and TrojanPuzzle attack [5]. Fine-pruning can also be used as a defense against these attacks
[143]. For the white-box attacks such as AgentPoison [34], perplexity filtering and query rephrasing
[125] are utilized in the literature. On the other hand, advanced attacks, such as NightShade [18],
need advanced defense methods.
The aforementioned defense strategies are mostly designed for LMs, but some of them can be
potentially applied to LLMs as well, however, it still lacks in-depth research studies to develop
efficient defense techniques to protect LLMs from data poisoning attacks. Moreover, empirical
reports have shown that LLMs are becoming more vulnerable to data poisoning attacks, where
defenses based on filtering data or lowering model capacity only offer minimal protection at the
cost of reduced test accuracy [263]. Therefore, it requires effective defense methods that can make
trade-offs between model utility and the capability of protecting LLMs from data poisoning attacks.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:28 Das, et al.
Defense Against Membership Inference Attack. In order to mitigate MIA in the language
domain, several mechanisms are proposed, including dropout, auto de-identification [257] model
stacking, differential privacy [2], and adversarial regularization [180]. Salem et al. came up with
the first effective defense mechanism against MIA [218]. Their approach included dropout and
model stacking. In each training iteration of a fully connected neural network model, dropout
is defined as the random deletion of a certain proportion of neuron connections. It can mitigate
overfitting in DNNs, which is a contributing factor to MIA [218]. However, this technique works
only when a neural network is targeted by the attack model. To work with other target models,
they proposed another defense technique referred to as model stacking. The idea behind this
defense is if distinct parts of the target model undergo training with different subsets of data, the
overall model is expected to exhibit a lower tendency of overfitting. It can be achieved through
the application of model stacking, one of the popular ensemble learning techniques. To defend
against black-box MIA on ML models, Jia et al. proposed MemGuard [109] which is essentially a
noise perturbation mechanism for the predicted confidence score of the target model. It makes
difficult for the attacker to infer whether a sample was part of the training data. However, this
technique has been evaluated for image and numeric datasets. The performance of this technique
on LMs/LLMs still remains unexplored, which necessitates further studies to evaluate whether
it can effectively defend against MIA for LM/LLMs. Differential privacy (DP) based techniques
are also widely used to prevent privacy leakage by MIA [161], [136]. It includes data perturbation
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:29
and output perturbation [95]. Models facilitated with differential privacy employed with the sto-
chastic gradient descent optimization algorithm [61] can reduce empirical privacy leakages while
ensuring comparable model utility in the non-DP environment [103]. Model pruning [86], and
knowledge distillation [92] are employed to mitigate the impacts of preference-based MIA, such
as PREMIA [65]. A recent framework, InferDPT [254], has been proposed to leverage black-box
LLMs to facilitate privacy-preserving inference, which effectively integrates DP in text generation
tasks. Another defense method against MIAs is to include regularization during the training of the
model. Regularization refers to a set of techniques used to prevent overfitting and improve the
generalization performance of an ML model. Label smoothing [250] is one kind of regularization
method that prevents overfitting of the ML model, which contributes to MIA [308]. Very few defense
techniques have been proposed for LLMs [72], [136]. Most of the existing defense techniques have
been experimented on relatively small LMs, such as test classifiers [271], which are not evaluated
for LLMs. Moreover, DP-based defenses may impair model utility. We argue that there is a pressing
need for further research studies to develop effective defense techniques against MIA on LLMs.
Defenses Against PII Leakage Attacks. To mitigate the leaking of personal information from
PLMs due to memorization [23], there are several general techniques. During the pre-processing
phase, the process of de-duplication and training data curation [45] has the potential to significantly
decrease the amount of memorized text in PLMs. Consequently, this results in a reduction of stored
personal information within these models [130]. Prompt-tuning can also be a potential mitigation
against memorization. The basic idea is to optimize prompts in the attack environment to evaluate
the capability to extract the memorized content in a target model. DP can efficiently mitigate the
impact of TAB [102] and KART [177] by minimizing the number of unique training sequences leaked
by Transformer-based language models. Both TAB and KART require the back-box access of the
model, such as top-k predictions; therefore, restricting the model access by imposing API hardening
technique [102] can also be a potential mitigation technique for these kinds of attacks. DP can also
mitigate the attacks that utilize hand-crafted prompts with true prefix [178]. PII masking [157] can
also be a viable technique to defend against PII leakage attacks. Personal information identification
and filtering methods, such as [45] and [211], may effectively reduce the number of training data
samples extracted through PII leakage attacks, e.g., ProPILE [122]. However, manually checking
the vast training data for LLMs is challenging and labor-intensive. De-duplication at the document
or paragraph level is common but may not eliminate repeated occurrences of sensitive information
within a single document. Advanced strategies for de-duplication and careful sourcing of training
data are essential. Despite sanitization efforts, complete prevention of privacy leaks is challenging,
making it a first line of defense rather than a foolproof measure. In training, following the process
of Carlini et al. [23] and the implementation by Anil et al. [11], the deferentially private stochastic
gradient descent (DP-SGD) algorithm [2] can be employed to ensure privacy of training data during
the training process [23], [101]. However, the DP-SGD-based method might not work efficiently as
it has a significant computational cost and decreases the trained model utility [312]. PII scrubbing
filters dataset to eliminate PII from text [56], such as leveraging Named Entity Recognition (NER)
[127] to tag PII. Even though PII scrubbing methods can mitigate PII leakage risks, they face two
critical challenges [152]: (1) the effectiveness of PII scrubbing may be reduced to preserve the
dataset utility, and (2) there is a risk of PII not being completely or accurately removed from the
dataset. In downstream applications like dialogue systems [318] and summarization models [93],
LMs undergo fine-tuning on task-specific data. While this process may lead to the LM “forgetting“
some memorized data from pre-training [163], [210], it can still introduce privacy leaks if the
task-specific data contains sensitive information.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:30 Das, et al.
The aforementioned defense techniques mostly apply to the LMs. To date, we have not yet
identified specific techniques that are dedicated to the LLMs. While according to some recent
studies ([131], [306]), the existing strategies can be used in the LLM context as well. So far, there
is a pressing need for more empirical evaluations to determine their effectiveness for LLMs. On
top of that, there are no efficient defense techniques introduced to defend against a few attack
methods, such as KART [177], ProPILE [122], and the recovery of forgotten PII by fine-tuning due
to memorization [33]. Therefore, it requires in-depth studies and understanding to design effective
defense techniques against PII attacks on LLMs.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:31
is then refined by the latest LLM, preserving its learning ability. High-quality synthetic outputs are
chosen for future rehearsals to mitigate catastrophic forgetting.
Cybercrime and Social Issues. LLMs can potentially be used in various cybercrime [124], e.g.,
phishing (efficiently create targeted scam e-mails), malware, and hacking attacks (hackers have
used ChatGPT to write malware codes). LLMs pose risks of perpetuating unfair discrimination and
causing representational harm by reinforcing stereotypes and social biases. Harmful associations
of specific traits with social identities may lead to exclusion or marginalization of individuals
outside established norms [280]. Additionally, toxic language generated by LLMs may incite hate
or violence and cause serious offense. These risks are largely rooted in the selection of training
corpora that include harmful language and disproportionately represent certain social identities.
Transportation. In the transportation domain, studies reported that LLM can be biased (while doing
accident report analysis), and inefficient for performing tasks in self-driving cars [253]. Furthermore,
it might leak personal data from self-driving cars while doing accident report automation, and
accident information extraction [325]. A framework has been proposed named VistaGPT to deal
with the problems caused by information barriers from heterogeneity at both system and module
levels in a wide range of heterogeneous vehicle automation systems [252]. It leverages LLMs to
create an automated composing platform to design end-to-end driving systems. This involves
employing a “dividing and recombining” strategy to enhance the ability to generalize. To alleviate
the issue of the long training time of LLMs with large datasets and high computing resource
requirements, Meta-AI’s Llama focuses on fine-tuning offline pre-trained LLMs to handle the
transportation safety domain tasks. The main objective is to create a specialized LLM capable
of generating an accurate, context-sensitive, and safety-aware model, that work effectively in
traffic-related scenarios [328].
Healthcare and Medicine. The high risks associated with LLMs in the context of healthcare
suggest that their integration into the healthcare system is presently inadvisable, as proposed by
De et al. [53]. Models trained on extensive Internet data lacking rigorous filtering mechanisms may
inadvertently incorporate misinformation, biased content, and harmful materials alongside accurate
and fair information, thereby posing significant risks in healthcare applications. The potential
consequences of erroneous treatment or medication recommendations by LLMs are particularly
concerning. Moreover, the probabilistic nature of LLMs introduces variability in responses to the
same task, giving rise to challenges in reliability and reproducibility that necessitate continuous
human oversight. Privacy concerns, especially regarding sensitive health records, coupled with
broader considerations such as AI ethics principles, safety, transparency, explainability, equity, and
sustainability, further emphasize the need for caution in deploying LLMs within the healthcare
domain, as discussed by Harrer et al. [87].
Education. The use of LLMs, e.g., ChatGPT, in education is associated with significant drawbacks,
particularly in fostering inaccurate concept learning and an inappropriate approach to education.
ChatGPT, being a language model trained on diverse Internet data, may unintentionally propagate
misinformation or present concepts with a lack of precision and educational rigor, for instance,
scientific misconduct [176]. The excessive dependence on LLMs by both educators and learners can
have serious adverse effects. Students engaging with ChatGPT may encounter misleading content
or promote misconceptions, potentially compromising the quality of their learning experience. The
absence of real-time fact-checking and the model’s susceptibility to biases and errors pose risks,
potentially leading learners astray and impeding their overall educational progress. Consequently,
caution is recommended when relying on ChatGPT as an educational tool without appropriate
supervision and verification [226], [170].
Governance. The potential misuse of LLMs in governance for spear phishing presents significant
cybersecurity challenges. Using GPT-4 as an illustrative example, personal information of British
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:32 Das, et al.
members of parliament (MP) was extracted from Wikipedia, and GPT-3.5 was utilized to generate
biographies, which were then incorporated into phishing emails sent to official email addresses
[88]. This highlights the risks associated with misinformation, biased content, and the utilization of
LLMs in AI-based cyberattacks within governance. The leakage of confidential information through
such attacks can pose severe consequences for national security. The generation of misinformation
and hate speech by LLMs further emphasizes the existing challenges, underscoring the imperative
need for robust safeguards and countermeasures to address the risks related to the usage of these
models in governance settings [88].
Science. Hallucinations, biases, and paradigm shifts are pressing concerns of LLMs in the science
domain. There is a risk of LLMs generating non-existent and false content. For instance, Meta
developed an LLM named Galactica for reasoning scientific knowledge. That was reported to
generate major flaws due to reproducing biases and presenting falsehoods [46]. As a result, the model
was shut down just after launching public access [91]. Another concern lies in the involvement
of LLMs in the scientific discovery process. It is challenging to interpret and understand LLMs
due to their black-box nature, raising doubts about their reliability and trustworthiness in the
science domain. For example, peer-review reports generated by LLMs may misinterpret research
articles, which may impair the peer review quality [3]. Moreover, collaborating with LLMs won’t be
fundamentally the same as collaborating with other researchers or experts in a corresponding field
[16]. Clear principles of using these LLMs and/or other AI tools in scientific explorations should be
established to ensure transparency, fairness, and trustworthiness [16].
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:33
Fig. 9. Overview of the advancements of the attack methods, defense mechanisms in LLMs, their limitations,
and future research directions.
collection time-frame. This demonstrates the dynamic nature of adversarial strategies, with the
understanding that new, optimized prompts may emerge even after the initial data collection phase.
Moreover, most of the existing studies on jailbreaking attacks are primarily focused on ChatGPT
[54, 132, 234]. It remains unclear whether potential vulnerabilities exist in other LLMs, such as
Vicuna [38], Bard [158], and Bing Chat [165]. For example, MasterKey [54] may not achieve
comparable performance on Vicuna as it does on ChatGPT. For prompt-based attacks, manual
prompts are often more effective than automated prompt-based attack methods. As automated
attack methods are primarily designed for generalized tasks, they may not demonstrate same
effectiveness for specific tasks as manually crafted prompts. The unlabeled and imbalanced real-
world data further complicate the development of effective automated prompting-based attack
methods [238]. For backdoor attacks, most of them are focused on classification or similar tasks
[21], e.g., sentiment analysis and opinion classification. More attention should be paid to the attacks
that perform other NLP tasks, including question answering, text summarization, and language
translation.
The underlying philosophy behind privacy attacks lies in the correlation between the level of
accessibility an adversary holds and its ability to extract sensitive information or exert control over
target victim LLMs. More access leads to a broader potential for the adversary to recover sensitive
data or influence the target LLM [131]. For instance, when the adversary has access only to the
black-box model, the attacker might be able to leverage training data extraction attacks to recover
a limited set of private data. However, if the adversary is granted white-box information such
as gradients, the attacker can leverage this extra information to accurately recover more private
training instances [55]. This expanded access can facilitate various privacy attacks, including
attribute inference attacks, embedding inversion, and gradient leakage attacks. In gradient-based
attacks, the adversary needs the white-box information of a model, which is sometimes impractical
in real-world practice. Moreover, most of the existing privacy attacks are designed for vision models.
A limited number of studies have reported gradient leakage attacks specifically on language models,
e.g., the LAMP attack [14]. In essence, the increased access empowers adversaries to perform
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:34 Das, et al.
more sophisticated and targeted privacy attacks against LLMs, potentially compromising sensitive
information or gaining access to the internal model architecture. The early MIAs were based on
the white-box access assumption [239], which is sometimes impractical in real-world deployment.
The evaluation dataset for some attacks, e.g., ProPILE was built solely from private information
available in open-source datasets provided by major corporations, ensuring ethical data acquisition
[122]. However, it is crucial to note that the heuristic data collection process might potentially lead
to instances of bias, disassociation, or noise. This adds uncertainty and potential inaccuracies in
the benchmark dataset, requiring attention when interpreting the results.
However, most attack methods (e.g., BadGPT, BadPrompt, and Trojaning attacks) described in
the existing studies are designed for only relatively small NLP models. Only a few are tested on
LLMs (e.g., ProPILE, DAN, and JAILBREAKER). Also, the high cost of accessing commercialized
LLMs, such as GPT-3.5 or upper versions, contributes to the lack of attack evaluations on LLMs.
Besides, the in-depth vulnerability analysis in terms of privacy attacks and security issues is still
lacking for LLMs. One of the reasons can be attributed to the limited number of performance
evaluation metrics (e.g., perplexity) in the language domain to comprehensively evaluate attack
and defense effectiveness. Also, in the FL environment, it requires very high computational power
to train LLMs with such large datasets [247]. It is an open research challenge to develop effective
attack and robust defense methods along with the proper evaluation techniques for LLMs. The
correctness of LLM-generated content has always been a major concern in this area of research.
Due to the knowledge gaps, i.e., missing or outdated information might always be present in LLMs.
The problem of hallucination in LLM has been investigated and evaluated from the knowledge-gap
perspective but is yet to be investigated from other perspectives such as safety, i.e., abstaining to
generate harmful content [25].
For LLMs, comprehensive exploration of their vulnerabilities under security and privacy attacks
remains an essential area of study. Future research should examine the applicability of well-
established attack methods from other domains, such as computer vision, to LLMs. In addition to
existing methods, novel attack techniques should be developed to comprehensively inspect various
vulnerabilities of LLMs, potentially targeting multiple objectives with a single attack approach.
Moreover, special attention should be given to developing appropriate metrics for evaluating the
impact of vulnerabilities related to security and privacy attacks in LLMs. Explainable AI (XAI) can
play a vital role in this domain by increasing transparency and explainability within attack systems,
allowing for a better understanding of LLM vulnerabilities. Developing XAI techniques to enhance
the interpretability of LLM vulnerabilities is another essential research direction for advancing the
security and privacy of LLMs.
Existing Defense Mechanisms, Challenges, and Future Research Direction: To mitigate vul-
nerabilities posed by security and privacy attacks, various defense techniques have been proposed in
existing research. Instruction defense [219], paraphrasing [104], and re-tokenization [104] methods
are commonly employed to defend against prompt injection attacks. To defense jailbreaking attacks,
SmoothLLM [215], LLM Guard [6], and perplexity filtering [78] are widely used. Data curation
techniques [299], including filtering poisoned content [263] and detecting trigger-embedded in-
puts [27], are widely used to mitigate data poisoning attacks and backdoor attacks. On the other
hand, DP-SGD [2] is a widely adopted technique to defend against gradient leakage attacks, MIAs,
and PII leakage attacks. In particular, noise perturbation [331] is a common strategy to defend
against gradient leakage attacks. Various techniques are prevalent to mitigate MIA in LLMs, includ-
ing model pruning [143] and knowledge distillation [92]. To prevent PII leakage attacks, training
data curation [299] and restricting attacker access to training data or model [102] serve as viable
countermeasures.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:35
For defense, studies reported that ChatGPT’s safety protections are good enough to prevent single
jailbreaking prompts however, it is still vulnerable to multi-step jailbreaking [132]. Moreover, the
new Bing AI chatbot [169] is more vulnerable to these direct prompts. System-mode self-reminder
defense techniques are inspired by the human-like reasoning capabilities of LLMs [291]. The more
discerning question regarding LLM reasoning processes with or without self-reminder remains
unsolved. To acquire a comprehensive understanding of the reasoning processes of large neural
networks, more in-depth investigation is highly required. Although the side effects of self-reminder
have been explored on typical user queries across various NLP tasks, evaluating its effect on any
type of user query poses a challenge, which makes it difficult to fully understand its impact on user
experience [285].
Considering the shortcomings mentioned, developing more flexible self-reminding systems
and expert frameworks that improve safety, trustworthiness, and accountability in LLMs without
compromising effectiveness can be a fundamental research challenge to protect LLMs from jail-
breaking attacks. Furthermore, individuals with malicious intent are highly active in online forums,
sharing and discussing new strategies. Frequently, they keep these exchanges private to evade
detection. Consequently, it is essential to conduct further research and studies aims to identify
and implement effective defense strategies to mitigate the risks posed by the latest jailbreaking
attacks. Efficient strategies for defending against backdoor attacks in a black-box environment are
still lacking [117]. Existing defense mechanisms ([249], [201], [230]), for specific learning tasks
in LMs are not evaluated for the other learning tasks like, text summarizing, and prompt-based
learning. Moreover, it is found in the literature that prompt-based PLMs are highly susceptible
to textual backdoor attacks [295], [60]. Addressing the challenge of textual backdoor attacks in
prompt-based paradigms, particularly in the few-shot learning setting [273], is another unresolved
challenge. MDP (Masking-Differential Prompting) defense [291] faces challenges in various NLP
tasks like paraphrasing and sentence similarity [74]. Its performance under few-shot learning
remains uncertain due to a lack of practical evaluation. While MDP has demonstrated strength
against earlier backdoor attacks, it may not be effective against recently introduced attacks like Bad-
Prompt [21] and BToP [295]. Perplexity-based methods and filtering-based methods may not work
well when attackers use synonymous trigger keys [96]. Furthermore, developing dynamic defense
methods considering the above factors is a challenging future task. Currently, the predominant
focus of investigation on backdoor attacks revolves around text classification in LLMs. However, a
notable gap exists in the literature concerning investigations into backdoor attacks on various tasks
for which LLMs find widespread application, e.g., text summarization and text generation [300].
Understanding and addressing backdoor attacks in various tasks for which LLMs are employed
is crucial for developing effective defense mechanisms and ensuring secure deployment of LLMs.
While poisoning attacks on ML models have been investigated in the literature [175], there is not
yet an effective solution for several attack methods, including ProAttack[321] and Badprompt[21].
Further research in diverse tasks and models can enhance the knowledge and understanding of
the security impacts of LLMs, as well as facilitate the development of robust and trustworthy LLM
systems. Defense techniques, such as dataset cleaning, and removing near duplicate poisoned
samples and anomalies, sometimes slow down the model development process in order to defend
against data poisoning attacks. Other defense methods, e.g., stopping training after certain epochs,
achieve a moderate defense against poisoning attacks but degrade the model utility [263]. Gradient
perturbation [95] and DP-SGD-based methods [2] are frequently used to defend against privacy
attacks in LLMs. It can prevent the private training data from being leaked based on the parameter
configurations at a small cost of model utility. Limiting the accessibility to the model and generating
limited prediction results might be another option [312]. Extensive research studies can obtain
proper knowledge of to what extent algorithmic defenses such as differential privacy can prevent
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:36 Das, et al.
PII disclosure without compromising model utility. In the post-processing phase, for API-access
models such as GPT-3, it is advisable to integrate a detection module that examines the output text
to identify sensitive information. If sensitive content is detected, the system should either decline
to provide an answer or apply masks to safeguard the sensitive information [98]. Also, for image
models, a recent study has demonstrated that adding a standard level of random noise into the
gradient update might not always work well to prevent gradient leakage attacks on medical images
[52]. Recently, an open language model (OLMo) has been introduced to provide open access to the
data, code, and model [242]. The main purpose of OLMo is to facilitate open research on language
models. They performed PII filtering to remove it from the data, and they also provided a tool to
remove PII data upon request. Though it followed existing practices to obscure the PII exposure
and identify and remove toxic content, it did not explicitly discuss the impacts of various attacks
outlined in this survey paper and how to mitigate those risks. Secure multi-party computation [47]
can be another way to defend against privacy attacks in LLMs, which can be explored in future
research endeavors. Considering the above limitations of existing defense techniques in LLMs,
developing a defense mechanism for these privacy attacks for LLMs would be an imperative task.
Future research initiatives for enhancing security and privacy in LLMs can be directed toward
several key areas. An ideal defense method should be able to effectively achieve a balance between
model utility and security & privacy protection. The development of real-time privacy monitoring
systems is essential to improve the resilience of privacy-preserving LLMs. This necessitates the
exploration of robust detection techniques against various security and privacy attacks. Furthermore,
a thorough evaluation of less-explored defense techniques, such as secure multi-party computation
(SMPC), is necessary to assess their effectiveness against LLM vulnerabilities. Finally, leveraging
the capabilities of XAI can improve transparency in LLM defense mechanisms.
Role of explainable AI in Enhancing Security and Privacy of LLMs: LLM’s ability to make de-
cisions on various learning tasks is often criticized due to its black-box nature, e.g., non-interpretable
weights. It makes it more challenging for the new practitioners and developers of this field, as it
hinders their ability to clearly interpret and understand its application, particularly in critical cases.
Explainable AI (XAI) can help to bridge this gap by developing methods to interpret and explain
complex LLM systems, the decision-making process, and outputs [22]. Conducting studies and
research to build explainable methods is highly essential to make the usage of LLMs more reliable
and trustworthy. XAI can provide transparent insights into the inference process and the dynamic
weight assignment by the attention mechanism of the LLMs, which enhances interpretability by
highlighting the most influential input features that contribute to its predictions. This explainability
makes the model more trustworthy and ensures that its output is explainable. Additionally, it
facilitates error analysis and bias detection. XAI has shown its potential in several real-world AI
applications such as anomaly detection [1, 8], cyber-security [216], and enhancing data privacy
[63]. In ML, several popular methods exist such as Local Interpretable Model-agnostic Explanations
(LIME) [214], and SHapley Additive exPlanations (SHAP) [153], which offer insights into model
behavior without requiring access to internal model components. These methods analyze the
association between inputs and outputs to identify features that most significantly contribute
to model predictions. The explainability of code generation models such as CodeBERT [67] and
GraphCodeBERT [82] were proposed to understand code syntax and semantics [304]. Also, the
ad-hoc explanation further clarifies the model’s decision [142].
The role of XAI in the security and privacy aspect of LLM is crucial. Specifically in its development
phase, having prior knowledge about the attackers’ ability (e.g., black-box access to the model,
injecting poisonous instances to the training/fine-tuning data) and the vulnerable components
(training/fine-tuning data and the model itself) of LLM systems will assist in developing robust
techniques against security and privacy attacks. However, XAI may expose critical aspects of the
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:37
LLMs, including their architecture, components, and the dynamic weight allocation by attention
mechanism that contribute to predictions. Such disclosures may increase the LLM’s vulnerability
to security and privacy attacks. For example, revealing the white-box nature of the model may
facilitate the attacker’s access to the model, increasing the risk of gradient leakage attacks or MIA
[63]. Recent studies have claimed that explainability in LLMs may raise further security concerns,
especially with insidious backdoor attacks [35]. Lin et al. proposed an XAI approach that can
identify the triggers (backdoor attack) that mislead the model to contribute to error classification
[141]. Li et al. developed an XAI technique to identify and quantify the influence of raw data features
on successful MIAs. This approach analyzes the data distribution to identify the influential neurons
contributing to compromising private data and subsequently trains an MIA ensemble model using
attack features derived from the selected neurons [141]. XAI techniques in image models (e.g.,
Grad-CAM [224]) may compromise model privacy under various attacks, such as gradient leakage
attacks. In some cases, XAI enables more accurate reconstruction of private training data compared
to models that rely solely on predictions [324]. XAI-aware model extraction attack (XaMEA) was
proposed to exploit spatial knowledge from decision explanations [297]. It illustrated that the
transparency provided by XAI may facilitate the attacker’s access to the model, making it easier to
exploit it. This increased explainability can make the model more vulnerable to model extraction
attacks compared to prediction-only models. However, the vulnerabilities mentioned above are
mostly explored for DNNs. In LLMs, the vulnerabilities in the XAI context have been explored
to a limited extent. Moreover, there are several unique characteristics of LLMs that are different
from DNNs, e.g., large-scale data and model parameters, task-agnostic, and semantic language
understanding, which makes it more challenging to design XAI methods for better interpretations.
The comprehensive interpretation of LLM vulnerabilities under XAI, including backdoor attacks,
membership inference attacks (MIA), and model extraction attacks, has yet to be fully explored.
Additionally, vulnerabilities related to prompt injection and jailbreaking attacks in the context of
XAI still remain unexplored. Furthermore, open-sourced LLMs are publicly accessible and may
provide greater explainability than closed-source models. However, this enhanced accessibility can
make them more vulnerable to attacks when XAI techniques are employed. Thus, these LLMs may
be more susceptible to attacks due to XAI. Further research is essential to fully understand the
extent and impact of attackers’ capabilities under XAI. Additionally, mitigation techniques should
be developed in accordance with the identified risks and impacts.
9 CONCLUSION
LLMs lend themselves as strong tools for comprehending complex linguistic patterns and generating
logical and contextually coherent responses. However, such powerful models also entail potential
privacy and security risks. In this survey, we first provided a detailed overview of LLMs’ security
and privacy challenges. We then discussed and analyzed the LLM vulnerabilities from both security
and privacy aspects, existing mitigation and defense strategies against these security attacks and
privacy attacks, as well as highlighted their strengths and limitations. In our investigation, we
found that LLMs are highly vulnerable to the discussed attacks. According to our survey, there are
a limited number of mitigation techniques to prevent those attacks against LLMs. The existing
mitigation techniques that are applicable to relatively small LMs could potentially be used for
LLMs. However, extensive research studies should be performed to evaluate and tailor the existing
solutions to LLMs. Based on our analysis, we also outlined future research directions focusing on
security and privacy aspects, pointed out key research gaps, and illustrated open research problems.
The overarching goal is to enhance the reliability and utility of LLMs through comprehensive
exploration and resolution of these vulnerabilities and offer pathways for future research toward
secure and privacy-preserving LLM systems.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:38 Das, et al.
REFERENCES
[1] 30dascoding. 2024. Unraveling the Mysteries of Anomaly Detection: The Power of Explainable AI. Available Online:
https://30dayscoding.com/blog/explainable-ai-in-anomaly-detection [Accessed on October 31, 2024].
[2] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep
learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications
security. 308–318.
[3] David Leslie & Sandra Wachter Abeba Birhane, Atoosa Kasirzadeh. 2023. Science in the age of large language models.
doi.org/10.1038/s42254-023-00581-4 (2023).
[4] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida,
Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774
(2023).
[5] Hojjat Aghakhani, Wei Dai, Andre Manoel, Xavier Fernandes, Anant Kharkar, Christopher Kruegel, Giovanni Vigna,
David Evans, Ben Zorn, and Robert Sim. 2023. TrojanPuzzle: Covertly Poisoning Code-Suggestion Models. arXiv
preprint arXiv:2301.02344 (2023).
[6] Protect AI. 2024. LLM Guard - The Security Toolkit for LLM Interactions. Available Online: https://llm-guard.com/
[Accessed on October 11, 2024].
[7] Alex Albert. 2023. Jailbreak Chat. Available Online: https://www.jailbreakchat.com [Accessed on January 28, 2024].
[8] Tarek Ali. 2024. Next-generation intrusion detection systems with LLMs: real-time anomaly detection, explainable AI,
and adaptive data generation. Master’s thesis. T. Ali.
[9] Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity. arXiv preprint
arXiv:2308.14132 (2023).
[10] Thimira Amaratunga. 2023. Threats, Opportunities, and Misconception s. In Understanding Large Language Models:
Learning Their Underlying Concepts and Technologies. Springer, 131–148.
[11] Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Kumar, and Pasin Manurangsi. 2021. Large-scale differentially private
BERT. arXiv preprint arXiv:2108.01624 (2021).
[12] Yoshinori Aono, Takuya Hayashi, Lihua Wang, Shiho Moriai, et al. 2017. Privacy-preserving deep learning via
additively homomorphic encryption. IEEE transactions on information forensics and security 13, 5 (2017), 1333–1345.
[13] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort,
Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning
from human feedback. arXiv preprint arXiv:2204.05862 (2022).
[14] Mislav Balunovic, Dimitar Dimitrov, Nikola Jovanović, and Martin Vechev. 2022. LAMP: Extracting text from gradients
with language model priors. Advances in Neural Information Processing Systems 35 (2022), 7641–7654.
[15] Nathalie Baracaldo, Bryant Chen, Heiko Ludwig, and Jaehoon Amir Safavi. 2017. Mitigating poisoning attacks on
machine learning models: A data provenance based approach. In Proceedings of the 10th ACM workshop on artificial
intelligence and security. 103–110.
[16] Marcel Binz, Stephan Alaniz, Adina Roskies, Balazs Aczel, Carl T Bergstrom, Colin Allen, Daniel Schad, Dirk Wulff,
Jevin D West, Qiong Zhang, et al. 2023. How should the advent of large language models affect the practice of science?
arXiv preprint arXiv:2312.03759 (2023).
[17] ROB BONTA. 2023. California Consumer Privacy Act (CCPA). Available Online: https://oag.ca.gov/privacy/ccpa.
[Accessed on January 28, 2024].
[18] Samuel R Bowman. 2023. Eight things to know about large language models. arXiv preprint arXiv:2304.00612 (2023).
[19] Hezekiah J Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo
Iglesias, Ron Heichman, and Ramesh Darwishi. 2022. Evaluating the susceptibility of pre-trained language models
via handcrafted adversarial examples. arXiv preprint arXiv:2209.02128 (2022).
[20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural
information processing systems 33 (2020), 1877–1901.
[21] Xiangrui Cai, Haidong Xu, Sihan Xu, Ying Zhang, et al. 2022. BadPrompt: Backdoor attacks on continuous prompts.
Advances in Neural Information Processing Systems 35 (2022), 37068–37080.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:39
[22] Erik Cambria, Lorenzo Malandri, Fabio Mercorio, Navid Nobani, and Andrea Seveso. 2024. XAI meets LLMs: A Survey
of the Relation between Explainable AI and Large Language Models. arXiv preprint arXiv:2407.15248 (2024).
[23] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts,
Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In 30th
USENIX Security Symposium (USENIX Security 21). 2633–2650.
[24] Computer Security Resource Center. 2023. Information Systems Security (INFOSEC). Available Online: https:
//csrc.nist.gov/glossary/term/information_systems_security. [Accessed on January 28, 2024].
[25] Edward Y Chang. 2023. SocraSynth: Multi-LLM Reasoning with Conditional Statistics. Available Online: https://www.
researchgate.net/publication/373753725_SocraSynth_Multi-LLM_Reasoning_with_Conditional_Statistics [Accessed
on February 05, 2024].
[26] Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2023. Jailbreaking
black box large language models in twenty queries. arXiv preprint arXiv:2310.08419 (2023).
[27] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and
Biplav Srivastava. 2018. Detecting backdoor attacks on deep neural networks by activation clustering. arXiv preprint
arXiv:1811.03728 (2018).
[28] Canyu Chen and Kai Shu. 2023. Can llm-generated misinformation be detected? arXiv preprint arXiv:2309.13788
(2023).
[29] Canyu Chen and Kai Shu. 2023. Combating misinformation in the age of llms: Opportunities and challenges. AI
Magazine (2023).
[30] Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. 2024. Humans or llms as the judge? a
study on judgement biases. arXiv preprint arXiv:2402.10669 (2024).
[31] Huili Chen, Cheng Fu, Jishen Zhao, and Farinaz Koushanfar. 2019. DeepInspect: A Black-box Trojan Detection and
Mitigation Framework for Deep Neural Networks.. In IJCAI, Vol. 2. 8.
[32] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards,
Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv
preprint arXiv:2107.03374 (2021).
[33] Xiaoyi Chen, Siyuan Tang, Rui Zhu, Shijun Yan, Lei Jin, Zihao Wang, Liya Su, XiaoFeng Wang, and Haixu Tang.
2023. The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks. arXiv preprint
arXiv:2310.15469 (2023).
[34] Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. 2024. AgentPoison: Red-teaming LLM Agents via
Poisoning Memory or Knowledge Bases. arXiv preprint arXiv:2407.12784 (2024).
[35] Pengzhou Cheng, Zongru Wu, Wei Du, Haodong Zhao, Wei Lu, and Gongshen Liu. 2023. Backdoor attacks and coun-
termeasures in natural language processing models: A comprehensive security review. arXiv preprint arXiv:2309.06055
(2023).
[36] Vivying SY Cheng et al. 2006. Health insurance portability and accountability act (HIPPA) compliant access control
model for web services. International Journal of Healthcare Information Systems and Informatics (IJHISI) 1, 1 (2006),
22–39.
[37] I Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, Pengfei
Liu, et al. 2023. FacTool: Factuality Detection in Generative AI–A Tool Augmented Framework for Multi-Task and
Multi-Domain Scenarios. arXiv preprint arXiv:2307.13528 (2023).
[38] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao
Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing GPT-4 with 90%* chatGPT quality.
See https://vicuna. lmsys. org (accessed 14 April 2023) (2023).
[39] Edward Chou, Florian Tramèr, Giancarlo Pellegrino, and Dan Boneh. 2018. SentiNet: Detecting physical attacks
against deep learning systems.(2018). (2018).
[40] Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. 2024. Comprehensive assessment
of jailbreak attacks against llms. arXiv preprint arXiv:2402.05668 (2024).
[41] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa
Dehghani, Siddhartha Brahma, et al. 2024. Scaling instruction-finetuned language models. Journal of Machine
Learning Research 25, 70 (2024), 1–53.
[42] Benjamin Clavié, Alexandru Ciceu, Frederick Naylor, Guillaume Soulié, and Thomas Brightwell. 2023. Large Language
Models in the Workplace: A Case Study on Prompt Engineering for Job Type Classification. In International Conference
on Applications of Natural Language to Information Systems. Springer, 3–17.
[43] CLOUDFLARE. 2023. What is data privacy? Available Online: https://www.cloudflare.com/learning/privacy/what-is-
data-privacy/. [Accessed on January 28, 2024].
[44] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. 2019. Certified adversarial robustness via randomized smoothing. In
international conference on machine learning. PMLR, 1310–1320.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:40 Das, et al.
[45] Andrea Continella, Yanick Fratantonio, Martina Lindorfer, Alessandro Puccetti, Ali Zand, Christopher Kruegel,
Giovanni Vigna, et al. 2017. Obfuscation-Resilient Privacy Leak Detection for Mobile Apps Through Differential
Analysis.. In NDSS, Vol. 17. 10–14722.
[46] The Conversation. 2022. The Galactica AI model was trained on scientific knowledge – but it spat out alarmingly
plausible nonsense. Available Online: https://theconversation.com/the-galactica-ai-model-was-trained-on-scientific-
knowledge-but-it-spat-out-alarmingly-plausible-nonsense-195445 [Accessed on January 28, 2024].
[47] Ronald Cramer, Ivan Bjerre Damgård, et al. 2015. Secure multiparty computation. Cambridge University Press.
[48] Evan Crothers, Nathalie Japkowicz, and Herna L Viktor. 2023. Machine-generated text: A comprehensive survey of
threat models and detection methods. IEEE Access (2023).
[49] Ganqu Cui, Lifan Yuan, Bingxiang He, Yangyi Chen, Zhiyuan Liu, and Maosong Sun. 2022. A unified evaluation
of textual backdoor learning: Frameworks and benchmarks. Advances in Neural Information Processing Systems 35
(2022), 5009–5023.
[50] Sunhao Dai, Chen Xu, Shicheng Xu, Liang Pang, Zhenhua Dong, and Jun Xu. 2024. Bias and unfairness in information
retrieval systems: New challenges in the llm era. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining. 6437–6447.
[51] Dimitrios Damopoulos, Georgios Kambourakis, Stefanos Gritzalis, and Sang Oh Park. 2014. Exposing mobile malware
from the inside (or what is your mobile app really doing?). Peer-to-Peer Networking and Applications 7 (2014), 687–697.
[52] Badhan Chandra Das, M Hadi Amini, and Yanzhao Wu. 2023. Privacy risks analysis and mitigation in federated
learning for medical images. In 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE,
1870–1873.
[53] Luigi De Angelis, Francesco Baglivo, Guglielmo Arzilli, Gaetano Pierpaolo Privitera, Paolo Ferragina, Alberto Eugenio
Tozzi, and Caterina Rizzo. 2023. ChatGPT and the rise of large language models: the new AI-driven infodemic threat
in public health. Frontiers in Public Health 11 (2023), 1166120.
[54] Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu.
2023. MasterKey: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715
(2023).
[55] Jieren Deng, Yijue Wang, Ji Li, Chao Shang, Hang Liu, Sanguthevar Rajasekaran, and Caiwen Ding. 2021. TAG:
Gradient attack on transformer-based language models. arXiv preprint arXiv:2103.06819 (2021).
[56] develop.sentry.dev. 2023. PII and Data Scrubbing. Available Online: https://develop.sentry.dev/pii [Accessed on
January 28, 2024].
[57] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[58] Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023.
Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023).
[59] Ben Dickson. 2021. Machine learning: What are membership inference attacks? Available Online: https://bdtechtalks.
com/2021/04/23/machine-learning-membership-inference-attacks/ [Accessed on January 28, 2024].
[60] Wei Du, Yichun Zhao, Boqun Li, Gongshen Liu, and Shilin Wang. 2022. PPT: Backdoor attacks on pre-trained models
via poisoned prompt tuning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence,
IJCAI-22. 680–686.
[61] Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differential privacy. Foundations and Trends®
in Theoretical Computer Science 9, 3–4 (2014), 211–407.
[62] Zekeriya Erkin, Martin Franz, Jorge Guajardo, Stefan Katzenbeisser, Inald Lagendijk, and Tomas Toft. 2009. Privacy-
preserving face recognition. In Privacy Enhancing Technologies: 9th International Symposium, PETS 2009, Seattle, WA,
USA, August 5-7, 2009. Proceedings 9. Springer, 235–253.
[63] Fatima Ezzeddine. 2024. Privacy Implications of Explainable AI in Data-Driven Systems. arXiv preprint arXiv:2406.15789
(2024).
[64] Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A
survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining. 6491–6501.
[65] Qizhang Feng, Siva Rajesh Kasa, Hyokun Yun, Choon Hui Teo, and Sravan Babu Bodapati. 2024. Exposing privacy
gaps: Membership inference attack on preference data for LLM alignment. arXiv preprint arXiv:2407.06443 (2024).
[66] Zhangyin Feng, Xiaocheng Feng, Dezhi Zhao, Maojin Yang, and Bing Qin. 2024. Retrieval-generation synergy
augmented large language models. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 11661–11665.
[67] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu,
Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint
arXiv:2002.08155 (2020).
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:41
[68] Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines
30 (2020), 681–694.
[69] Jacob Fox. 2023. Data Poisoning Attacks: A New Attack Vector within AI. Available Online: https://www.cobalt.io/
blog/data-poisoning-attacks-a-new-attack-vector-within-ai [Accessed on January 28, 2024].
[70] JAKE FRANKENFIELD. 2023. What Is Personally Identifiable Information (PII)? Types and Examples. Available
Online: https://www.investopedia.com/terms/p/personally-identifiable-information-pii.asp [Accessed on January 28,
2024].
[71] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang,
Xiawu Zheng, et al. 2023. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.
arXiv preprint arXiv:2306.13394 (2023).
[72] Wenjie Fu, Huandong Wang, Chen Gao, Guanghua Liu, Yong Li, and Tao Jiang. 2023. Practical Membership Inference
Attacks against Fine-tuned Large Language Models via Self-prompt Calibration. arXiv preprint arXiv:2311.06062
(2023).
[73] Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi
Zhang, and Nesreen K Ahmed. 2024. Bias and fairness in large language models: A survey. Computational Linguistics
(2024), 1–79.
[74] Tianyu Gao, Adam Fisch, and Danqi Chen. 2020. Making pre-trained language models better few-shot learners. arXiv
preprint arXiv:2012.15723 (2020).
[75] Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya Nepal. 2019. Strip: A defence
against trojan attacks on deep neural networks. In Proceedings of the 35th Annual Computer Security Applications
Conference. 113–125.
[76] Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, and Michael Moeller. 2020. Inverting gradients-how easy is it
to break privacy in federated learning? Advances in Neural Information Processing Systems 33 (2020), 16937–16947.
[77] Robin C Geyer, Tassilo Klein, and Moin Nabi. 2017. Differentially private federated learning: A client level perspective.
arXiv preprint arXiv:1712.07557 (2017).
[78] Hila Gonen, Srini Iyer, Terra Blevins, Noah A Smith, and Luke Zettlemoyer. 2022. Demystifying prompts in language
models via perplexity estimation. arXiv preprint arXiv:2212.04037 (2022).
[79] Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not
what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In
Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. 79–90.
[80] Alexandre Sablayrolles Hervé Jégou Guo, Chuan and Douwe Kiela. 2021. Gradient-based adversarial attacks against
text transformers. arXiv preprint arXiv:2104.13733 (2021).
[81] Chuan Guo, Alexandre Sablayrolles, Hervé Jégou, and Douwe Kiela. 2021. Gradient-based adversarial attacks against
text transformers. arXiv preprint arXiv:2104.13733 (2021).
[82] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svy-
atkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint
arXiv:2009.08366 (2020).
[83] Maanak Gupta, CharanKumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. 2023. From ChatGPT to
ThreatGPT: Impact of generative AI in cybersecurity and privacy. IEEE Access (2023).
[84] Surbhi Gupta, Abhishek Singhal, and Akanksha Kapoor. 2016. A literature survey on social engineering attacks:
Phishing attack. In 2016 international conference on computing, communication and automation (ICCCA). IEEE, 537–540.
[85] Siddhant Haldar. 2023. Gradient-based Adversarial Attacks : An Introduction. Available Online: https://medium.com/
swlh/gradient-based-adversarial-attacks-an-introduction-526238660dc9s [Accessed on January 28, 2024].
[86] Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning,
trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
[87] Stefan Harrer. 2023. Attention is not all you need: the complicated case of ethically using large language models in
healthcare and medicine. EBioMedicine 90 (2023).
[88] Julian Hazell. 2023. Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint
arXiv:2305.06972 (2023).
[89] Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S Yu. 2024. The Emerged Security and Privacy
of LLM Agent: A Survey with Case Studies. arXiv preprint arXiv:2407.19354 (2024).
[90] Jiaming He, Guanyu Hou, Xinyue Jia, Yangyang Chen, Wenqi Liao, Yinhang Zhou, and Rang Zhou. 2024. Data
Stealing Attacks against Large Language Models via Backdooring. Electronics 13, 14 (2024), 2858.
[91] Will Douglas Heaven. 2022. "Why Meta’s latest large language model survived only three days online". MIT Technology
Review. Last accessed December 15 (2022), 2022.
[92] Geoffrey Hinton. 2015. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531 (2015).
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:42 Das, et al.
[93] Andrew Hoang, Antoine Bosselut, Asli Celikyilmaz, and Yejin Choi. 2019. Efficient adaptation of pretrained trans-
formers for abstractive summarization. arXiv preprint arXiv:1906.00138 (2019).
[94] Linmei Hu, Zeyi Liu, Ziwang Zhao, Lei Hou, Liqiang Nie, and Juanzi Li. 2023. A survey of knowledge enhanced
pre-trained language models. IEEE Transactions on Knowledge and Data Engineering (2023).
[95] Li Hu, Anli Yan, Hongyang Yan, Jin Li, Teng Huang, Yingying Zhang, Changyu Dong, and Chunsheng Yang. 2023.
Defenses to Membership Inference Attacks: A Survey. Comput. Surveys 56, 4 (2023), 1–34.
[96] Hai Huang, Zhengyu Zhao, Michael Backes, Yun Shen, and Yang Zhang. 2023. Composite Backdoor Attacks Against
Large Language Models. arXiv preprint arXiv:2310.07676 (2023).
[97] Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su.
2024. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal. arXiv preprint
arXiv:2403.01244 (2024).
[98] Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. 2022. Are Large Pre-Trained Language Models Leaking Your
Personal Information? arXiv preprint arXiv:2205.12628 (2022).
[99] Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma. 2023.
Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint
arXiv:2307.10236 (2023).
[100] Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing
Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai
conversations. arXiv preprint arXiv:2312.06674 (2023).
[101] Huseyin A Inan, Osman Ramadan, Lukas Wutschitz, Daniel Jones, Victor Rühle, James Withers, and Robert Sim. 2021.
Privacy analysis in language models via training data leakage report. ArXiv, abs/2101.05405 (2021).
[102] Huseyin A Inan, Osman Ramadan, Lukas Wutschitz, Daniel Jones, Victor Rühle, James Withers, and Robert Sim. 2021.
Training data leakage analysis in language models. arXiv preprint arXiv:2101.05405 (2021).
[103] Abhyuday Jagannatha, Bhanu Pratap Singh Rawat, and Hong Yu. 2021. Membership inference attack susceptibility of
clinical language models. arXiv preprint arXiv:2104.08305 (2021).
[104] Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum,
Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline defenses for adversarial attacks against aligned
language models. arXiv preprint arXiv:2309.00614 (2023).
[105] Malin Jansson, Stefan Hrastinski, Stefan Stenbom, and Fredrik Enoksson. 2021. Online question and answer sessions:
How students support their own and other students’ processes of inquiry in a text-based learning environment. The
Internet and Higher Education 51 (2021), 100817.
[106] Brindha Jeyaraman. 2023. Adversarial Attacks on LLMs: Safeguarding Language Models Against Manipulation.
https://www.linkedin.com/pulse/adversarial-attacks-llms-safeguarding-language-models-jeyaraman/
[107] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and
Pascale Fung. 2023. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
[108] Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. 2023. Towards mitigating hallucination in
large language models via self-reflection. arXiv preprint arXiv:2310.06271 (2023).
[109] Jinyuan Jia, Ahmed Salem, Michael Backes, Yang Zhang, and Neil Zhenqiang Gong. 2019. Memguard: Defending
against black-box membership inference attacks via adversarial examples. In Proceedings of the 2019 ACM SIGSAC
conference on computer and communications security. 259–274.
[110] Shuyu Jiang, Xingshu Chen, and Rui Tang. 2023. Prompt packer: Deceiving llms through compositional instruction
with hidden attacks. arXiv preprint arXiv:2310.10077 (2023).
[111] Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei Wang. 2023. Lion: Adversarial Distillation of Closed-Source
Large Language Model. arXiv preprint arXiv:2305.12870 (2023).
[112] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. TinyBERT:
Distilling BERT for Natural Language Understanding. arXiv preprint arXiv:1909.10351 (2019).
[113] Hongpeng Jin, Wenqi Wei, Xuyu Wang, Wenbin Zhang, and Yanzhao Wu. 2023. Rethinking learning rate tuning in
the era of large language models. In 2023 IEEE 5th International Conference on Cognitive Machine Intelligence (CogMI).
IEEE, 112–121.
[114] Hongpeng Jin, Wenqi Wei, Xuyu Wang, Wenbin Zhang, and Yanzhao Wu. 2023. Rethinking Learning Rate Tuning
in the Era of Large Language Models. In 2023 IEEE 5th International Conference on Cognitive Machine Intelligence
(CogMI). 112–121. https://doi.org/10.1109/CogMI58952.2023.00025
[115] Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou, Chong Zhang, Yongfeng Zhang, et al. 2024. Attackeval: How
to evaluate the effectiveness of jailbreak attacking on large language models. arXiv preprint arXiv:2401.09002 (2024).
[116] Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. 2023.
Challenges and applications of large language models. arXiv preprint arXiv:2307.10169 (2023).
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:43
[117] Nikhil Kandpal, Matthew Jagielski, Florian Tramèr, and Nicholas Carlini. 2023. Backdoor attacks for in-context
learning with language models. arXiv preprint arXiv:2307.14692 (2023).
[118] Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating training data mitigates privacy risks in language
models. In International Conference on Machine Learning. PMLR, 10697–10707.
[119] Masahiro Kaneko, Youmi Ma, Yuki Wata, and Naoaki Okazaki. 2024. Sampling-based Pseudo-Likelihood for Member-
ship Inference Attacks. arXiv preprint arXiv:2404.11262 (2024).
[120] Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. 2023. Exploiting
programmatic behavior of LLMs: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733 (2023).
[121] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional
transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. 2.
[122] Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh. 2024. Propile: Probing
privacy leakage in large language models. Advances in Neural Information Processing Systems 36 (2024).
[123] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models
are zero-shot reasoners. Advances in neural information processing systems 35 (2022), 22199–22213.
[124] Nir Kshetri. 2023. Cybercrime and privacy threats of large language models. IT Professional 25, 3 (2023), 9–13.
[125] Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. 2023. Certifying LLM safety
against adversarial prompting. arXiv preprint arXiv:2309.02705 (2023).
[126] Keita Kurita, Paul Michel, and Graham Neubig. 2020. Weight poisoning attacks on pre-trained models. arXiv preprint
arXiv:2004.06660 (2020).
[127] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural
architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016).
[128] Raz Lapid, Ron Langberg, and Moshe Sipper. 2023. Open sesame! universal black box jailbreaking of large language
models. arXiv preprint arXiv:2309.01446 (2023).
[129] Isack Lee and Haebin Seong. 2024. Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak
Vulnerabilities in AI Systems. arXiv preprint arXiv:2410.13334 (2024).
[130] Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas
Carlini. 2021. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499 (2021).
[131] Haoran Li, Yulin Chen, Jinglong Luo, Yan Kang, Xiaojin Zhang, Qi Hu, Chunkit Chan, and Yangqiu Song. 2023.
Privacy in large language models: Attacks, defenses and future directions. arXiv preprint arXiv:2310.10383 (2023).
[132] Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. 2023. Multi-step jailbreaking privacy attacks on
ChatGPT. arXiv preprint arXiv:2304.05197 (2023).
[133] Jiazhao Li, Yijin Yang, Zhuofeng Wu, VG Vydiswaran, and Chaowei Xiao. 2023. ChatGPT as an attack tool: Stealthy
textual backdoor attack via blackbox generative model trigger. arXiv preprint arXiv:2304.14475 (2023).
[134] Linyang Li, Demin Song, Xiaonan Li, Jiehang Zeng, Ruotian Ma, and Xipeng Qiu. 2021. Backdoor attacks on pre-trained
models by layerwise weight poisoning. arXiv preprint arXiv:2108.13888 (2021).
[135] Linyi Li, Tao Xie, and Bo Li. 2023. SoK: Certified robustness for deep neural networks. In 2023 IEEE symposium on
security and privacy (SP). IEEE, 1289–1310.
[136] Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. 2021. Large language models can be strong
differentially private learners. arXiv preprint arXiv:2110.05679 (2021).
[137] Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. 2023. Deepinception: Hypnotize
large language model to be jailbreaker. arXiv preprint arXiv:2311.03191 (2023).
[138] Yanzhou Li, Shangqing Liu, Kangjie Chen, Xiaofei Xie, Tianwei Zhang, and Yang Liu. 2023. Multi-target backdoor
attacks for code pre-trained models. arXiv preprint arXiv:2306.08350 (2023).
[139] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. 2021. Neural attention distillation: Erasing
backdoor triggers from deep neural networks. arXiv preprint arXiv:2101.05930 (2021).
[140] Jiawei Liang, Siyuan Liang, Man Luo, Aishan Liu, Dongchen Han, Ee-Chien Chang, and Xiaochun Cao. 2024. Vl-
trojan: Multimodal instruction backdoor attacks against autoregressive visual language models. arXiv preprint
arXiv:2402.13851 (2024).
[141] Yi-Shan Lin, Wen-Chuan Lee, and Z Berkay Celik. 2021. What do you see? Evaluation of explainable artificial
intelligence (XAI) interpretability through neural backdoors. In Proceedings of the 27th ACM SIGKDD conference on
knowledge discovery & data mining. 1027–1035.
[142] Zachary C Lipton. 2018. The mythos of model interpretability: In machine learning, the concept of interpretability is
both important and slippery. Queue 16, 3 (2018), 31–57.
[143] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2018. Fine-Pruning: Defending against backdooring attacks on
deep neural networks. In International symposium on research in attacks, intrusions, and defenses. Springer, 273–294.
[144] Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. Autodan: Generating stealthy jailbreak prompts on
aligned large language models. arXiv preprint arXiv:2310.04451 (2023).
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:44 Das, et al.
[145] Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. 2024. Automatic and universal prompt
injection attacks against large language models. arXiv preprint arXiv:2403.04957 (2024).
[146] Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu.
2023. Prompt Injection attack against LLM-integrated Applications. arXiv preprint arXiv:2306.05499 (2023).
[147] Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu.
2023. Jailbreaking ChatGPT via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023).
[148] Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and benchmarking
prompt injection attacks and defenses. In USENIX Security Symposium.
[149] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. 2018. Trojaning
attack on neural networks. In 25th Annual Network And Distributed System Security Symposium (NDSS 2018). Internet
Soc.
[150] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
moyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint
arXiv:1907.11692 (2019).
[151] Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muham-
mad Faaiz Taufiq, and Hang Li. 2023. Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language
Models’ Alignment. arXiv preprint arXiv:2308.05374 (2023).
[152] Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. 2023. Analyzing
leakage of personally identifiable information in language models. arXiv preprint arXiv:2302.00539 (2023).
[153] Scott Lundberg. 2017. A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874 (2017).
[154] Junyu Luo, Cao Xiao, and Fenglong Ma. 2023. Zero-resource hallucination prevention for large language models.
arXiv preprint arXiv:2309.02654 (2023).
[155] Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. 2023. An empirical study of catastrophic
forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747 (2023).
[156] Malwarebytes. 2023. What is iPhone jailbreaking. Available Online: https://www.malwarebytes.com/iphone-
jailbreaking. [Accessed on January 28, 2024].
[157] Courtney Mansfield, Amandalynne Paullada, and Kristen Howell. 2022. Behind the Mask: Demographic bias in name
detection for PII masking. arXiv preprint arXiv:2205.04505 (2022).
[158] James Manyika and Sissie Hsiao. 2023. An overview of Bard: an early experiment with generative AI. AI. Google
Static Documents 2 (2023).
[159] Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela
Jiang, and Lilian Weng. 2023. A holistic approach to undesired content detection in the real world. In Proceedings of
the AAAI Conference on Artificial Intelligence, Vol. 37. 15009–15018.
[160] Ariana Martino, Michael Iannelli, and Coleen Truong. 2023. Knowledge injection to counter large language model
(LLM) hallucination. In European Semantic Web Conference. Springer, 182–185.
[161] Justus Mattern, Fatemehsadat Mireshghallah, Zhijing Jin, Bernhard Schölkopf, Mrinmaya Sachan, and Taylor Berg-
Kirkpatrick. 2023. Membership Inference Attacks against Language Models via Neighbourhood Comparison. arXiv
preprint arXiv:2305.18462 (2023).
[162] Natalie Maus, Patrick Chao, Eric Wong, and Jacob Gardner. 2023. Adversarial prompting for black box foundation
models. arXiv preprint arXiv:2302.04237 (2023).
[163] Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential
learning problem. In Psychology of learning and motivation. Vol. 24. Elsevier, 109–165.
[164] Leland McInnes, John Healy, and Steve Astels. 2017. HDBSCAN: Hierarchical density based clustering. J. Open Source
Softw. 2, 11 (2017), 205.
[165] Yusuf Mehdi. 2023. Reinventing search with a new AI-powered Microsoft Bing and Edge, your copilot for the
web. Available Online: https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered-
microsoft-bing-and-edge-your-copilot-for-the-web/ [Accessed on January 28, 2024].
[166] Meta-Llma. 2023. Llama-2-13b-hf. Available Online: https://huggingface.co/meta-llama/Llama-2-13b-hf [Accessed
on January 28, 2024].
[167] Meta-Llma. 2023. Llama-2-7b. Available Online: https://huggingface.co/huggyllama/llama-7bf [Accessed on January
28, 2024].
[168] Meta-Llma. 2023. LLaMA-30b. Available Online: https://huggingface.co/huggyllama/llama-30b [Accessed on January
28, 2024].
[169] Microsoft. 2023. Bing AI Chatbot. https://www.bing.com/chat
[170] Silvia Milano, Joshua A McGrane, and Sabina Leonelli. 2023. Large language models challenge the future of higher
education. Nature Machine Intelligence 5, 4 (2023), 333–334.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:45
[171] Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng
Gao. 2024. Large language models: A survey. arXiv preprint arXiv:2402.06196 (2024).
[172] Fatemehsadat Mireshghallah, Kartik Goyal, Archit Uniyal, Taylor Berg-Kirkpatrick, and Reza Shokri. 2022. Quantifying
privacy risks of masked language models using membership inference attacks. arXiv preprint arXiv:2203.03929 (2022).
[173] Aayush Mittal. 2023. Prompt Hacking and Misuse of LLMs. "[Accessed on January 28, 2024]". https://www.unite.ai/
prompt-hacking-and-misuse-of-llm
[174] Domenic Molinaro. 2023. What Is Rooting? The Risks of Rooting Your Android Device. Available Online: https:
//www.avast.com/c-rooting-android [Accessed on January 28, 2024].
[175] Ervin Moore, Ahmed Imteaj, Shabnam Rezapour, and M. Hadi Amini. 2023. A Survey on Secure and Private Federated
Learning Using Blockchain: Theory and Application in Resource-Constrained Computing. IEEE Internet of Things
Journal 10, 24 (2023), 21942–21958. https://doi.org/10.1109/JIOT.2023.3313055
[176] Maximilian Mozes, Xuanli He, Bennett Kleinberg, and Lewis D Griffin. 2023. Use of LLMs for illicit purposes: Threats,
prevention measures, and vulnerabilities. arXiv preprint arXiv:2308.12833 (2023).
[177] Yuta Nakamura, Shouhei Hanaoka, Yukihiro Nomura, Naoto Hayashi, Osamu Abe, Shuntaro Yada, Shoko Wakamiya,
and Eiji Aramaki. 2020. KART: Privacy leakage framework of language models pre-trained with clinical records.
arXiv preprint arXiv:2101.00036 (2020).
[178] Krishna Kanth Nakka, Ahmed Frikha, Ricardo Mendes, Xue Jiang, and Xuebing Zhou. 2024. PII-Compass: Guiding
LLM training data extraction prompts towards the target PII via grounding. arXiv preprint arXiv:2407.02943 (2024).
[179] Jomilė Nakutavičiūtė. 2023. Why root Android phones? Available Online: https://nordvpn.com/blog/why-you-
shouldnt-root-android [Accessed on January 28, 2024].
[180] Milad Nasr, Reza Shokri, and Amir Houmansadr. 2018. Machine learning with membership privacy using adversarial
regularization. In Proceedings of the 2018 ACM SIGSAC conference on computer and communications security. 634–646.
[181] Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick
Barnes, and Ajmal Mian. 2023. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435
(2023).
[182] Seth Neel and Peter Chang. 2023. Privacy Issues in Large Language Models: A Survey. arXiv preprint arXiv:2312.06717
(2023).
[183] Thuy Dung Nguyen, Tuan Nguyen, Phi Le Nguyen, Hieu H Pham, Khoa D Doan, and Kok-Seng Wong. 2024. Backdoor
attacks and defenses in federated learning: Survey, challenges and future research directions. Engineering Applications
of Artificial Intelligence 127 (2024), 107166.
[184] Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. 2022. Diffusion models
for adversarial purification. arXiv preprint arXiv:2205.07460 (2022).
[185] OpenAI. 2023. API to Prevent Prompt Injection & Jailbreaks. Available Online: https://community.openai.com/t/api-
to-prevent-prompt-injection-jailbreaks/203514 [Accessed on January 28, 2024].
[186] OpenAI. 2023. ChatGPT. Available Online: https://chat.openai.com [Accessed on January 28, 2024].
[187] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.
Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
[188] Mustafa Safa Ozdayi, Charith Peris, Jack FitzGerald, Christophe Dupuy, Jimit Majmudar, Haidar Khan, Rahil Parikh,
and Rahul Gupta. 2023. Controlling the extraction of memorized data from large language models via prompt-tuning.
arXiv preprint arXiv:2305.11759 (2023).
[189] Xudong Pan, Mi Zhang, Shouling Ji, and Min Yang. 2020. Privacy risks of general-purpose language models. In 2020
IEEE Symposium on Security and Privacy (SP). IEEE, 1314–1331.
[190] Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Yang Wang. 2023. On the Risk
of Misinformation Pollution with Large Language Models. arXiv preprint arXiv:2305.13661 (2023).
[191] Prachi (Nayyar) Pathak. 2023. How do you protect Machine Learning from attacks? Available Online: https://www.
linkedin.com/advice/1/how-do-you-protect-machine-learning-from-attacks#data-poisoning-attack [Accessed on
January 28, 2024].
[192] Pearlhawaii.com. 2023. WHAT IS JAILBREAKING, CRACKING, OR ROOTING A MOBILE DEVICE? Available Online:
https://pearlhawaii.com/what-is-jailbreaking-cracking-or-rooting-a-mobile-device. [Accessed on January 28, 2024].
[193] Rodrigo Pedro, Daniel Castro, Paulo Carreira, and Nuno Santos. 2023. From Prompt Injections to SQL Injection
Attacks: How Protected is Your LLM-Integrated Web Application? arXiv preprint arXiv:2308.01990 (2023).
[194] Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. arXiv preprint
arXiv:2211.09527 (2022).
[195] Roy H Perlis, Kristin Lunz Trujillo, Jon Green, Alauna Safarpour, James N Druckman, Mauricio Santillana, Katherine
Ognyanova, and David Lazer. 2023. Misinformation, trust, and use of ivermectin and hydroxychloroquine for
COVID-19. In JAMA Health Forum, Vol. 4. American Medical Association, e233257–e233257.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:46 Das, et al.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:47
[222] Roei Schuster, Congzheng Song, Eran Tromer, and Vitaly Shmatikov. 2021. You autocomplete me: Poisoning vulnera-
bilities in neural code completion. In 30th USENIX Security Symposium (USENIX Security 21). 1559–1575.
[223] SECWRITER. 2023. Prompt Hacking and Misuse of LLMs. Available Online: https://cyberdom.blog/2023/06/17/
understanding-prompt-injection-genai-risks [Accessed on January 28, 2024].
[224] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE
international conference on computer vision. 618–626.
[225] Jose Selvi. 2023. Exploring Prompt Injection Attacks. Available Online: https://research.nccgroup.com/2022/12/05/
exploring-prompt-injection-attacks/. [Accessed on January 28, 2024].
[226] Jesse Senechal, Eric Ekholm, Samaher Aljudaibi, Mary Strawderman, and Chris Parthemos. 2023. Balancing the
Benefits and Risks of Large Language AI Models in K12 Public Schools. (2023).
[227] Zeyang Sha, Xinlei He, Pascal Berrang, Mathias Humbert, and Yang Zhang. 2022. Fine-tuning is all you need to
mitigate backdoor attacks. arXiv preprint arXiv:2212.09067 (2022).
[228] Muhammad Ahmed Shah, Roshan Sharma, Hira Dhamyal, Raphael Olivier, Ankit Shah, Dareen Alharthi, Hazim T
Bukhari, Massa Baali, Soham Deshmukh, Michael Kuhlmann, et al. 2023. Loft: Local proxy fine-tuning for improving
transferability of adversarial attacks against large language model. arXiv preprint arXiv:2310.04445 (2023).
[229] Shawn Shan, Wenxin Ding, Josephine Passananti, Haitao Zheng, and Ben Y Zhao. 2023. Prompt-specific poisoning
attacks on text-to-image generative models. arXiv preprint arXiv:2310.13828 (2023).
[230] Kun Shao, Junan Yang, Yang Ai, Hui Liu, and Yu Zhang. 2021. BDDR: An effective defense against textual backdoor
attacks. Computers & Security 110 (2021), 102433.
[231] Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. 2023. Plug and Pray: Exploiting off-the-shelf components of
Multi-Modal Models. arXiv preprint arXiv:2307.14539 (2023).
[232] Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael Abu-Ghazaleh. 2023. Survey of
vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844 (2023).
[233] Virat Shejwalkar, Huseyin A Inan, Amir Houmansadr, and Robert Sim. 2021. Membership inference attacks against
NLP classification models. In NeurIPS 2021 Workshop Privacy in Machine Learning.
[234] Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2023. "Do Anything Now": Characterizing
and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. arXiv preprint arXiv:2308.03825 (2023).
[235] Jiawen Shi, Yixin Liu, Pan Zhou, and Lichao Sun. 2023. BadGPT: Exploring Security Vulnerabilities of ChatGPT via
Backdoor Attacks to InstructGPT. arXiv preprint arXiv:2304.12298 (2023).
[236] Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. 2024. Optimization-
based Prompt Injection Attack to LLM-as-a-Judge. arXiv preprint arXiv:2403.17710 (2024).
[237] Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke
Zettlemoyer. 2023. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789 (2023).
[238] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting
knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980 (2020).
[239] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against
machine learning models. In 2017 IEEE symposium on security and privacy (SP). IEEE, 3–18.
[240] Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, and Tom Goldstein. 2023. On the exploitability
of instruction tuning. Advances in Neural Information Processing Systems 36 (2023), 61836–61856.
[241] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher
Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the
2013 conference on empirical methods in natural language processing. 1631–1642.
[242] Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Khyathi Chandu,
Jennifer Dumas, Li Lucy, Xinxi Lyu, et al. 2023. Dolma: An Open Corpus of 3 Trillion Tokens for Language Model
Pretraining Research. Allen Institute for AI, Tech. Rep (2023).
[243] Congzheng Song and Ananth Raghunathan. 2020. Information leakage in embedding models. In Proceedings of the
2020 ACM SIGSAC conference on computer and communications security. 377–390.
[244] Congzheng Song and Vitaly Shmatikov. 2019. Auditing data provenance in text-generation models. In Proceedings of
the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 196–206.
[245] Taylor Sorensen, Joshua Robinson, Christopher Michael Rytting, Alexander Glenn Shaw, Kyle Jeffrey Rogers,
Alexia Pauline Delorey, Mahmoud Khalil, Nancy Fulda, and David Wingate. 2022. An information-theoretic approach
to prompt engineering without ground truth labels. arXiv preprint arXiv:2203.11364 (2022).
[246] Dirk HR Spennemann. 2023. Exploring Ethical Boundaries: Can ChatGPT Be Prompted to Give Advice on How to
Cheat in University Assignments? (2023).
[247] Ningxin Su, Chenghao Hu, Baochun Li, and Bo Li. 2023. TITANIC: Towards Production Federated Learning with
Large Language Models. Available Online: https://iqua.ece.toronto.edu/papers/ningxinsu-infocom24.pdf [Accessed
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:48 Das, et al.
on February 6, 2024].
[248] Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan
Zhang, Xiner Li, et al. 2024. TrustLLM: Trustworthiness in Large Language Models. arXiv preprint arXiv:2401.05561
(2024).
[249] Indranil Sur, Karan Sikka, Matthew Walmer, Kaushik Koneripalli, Anirban Roy, Xiao Lin, Ajay Divakaran, and
Susmit Jha. 2023. TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models. In
Proceedings of the IEEE/CVF International Conference on Computer Vision. 165–175.
[250] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception
architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition.
2818–2826.
[251] Annalisa Szymanski, Noah Ziems, Heather A Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A Metoyer. 2024.
Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks. arXiv preprint
arXiv:2410.20266 (2024).
[252] Yonglin Tian, Xuan Li, Hui Zhang, Chen Zhao, Bai Li, Xiao Wang, and Fei-Yue Wang. 2023. VistaGPT: Generative
parallel transformers for vehicles with intelligent systems for transport automation. IEEE Transactions on Intelligent
Vehicles (2023).
[253] Yonglin Tian, Jiangong Wang, Yutong Wang, Chen Zhao, Fei Yao, and Xiao Wang. 2022. Federated vehicular
transformers and their federations: Privacy-preserving computing and cooperation for autonomous driving. IEEE
Transactions on Intelligent Vehicles (2022).
[254] Meng Tong, Kejiang Chen, Yuang Qi, Jie Zhang, Weiming Zhang, and Nenghai Yu. 2023. Privinfer: Privacy-preserving
inference for black-box large language model. arXiv preprint arXiv:2310.12214 (2023).
[255] Alexander Tornede, Difan Deng, Theresa Eimer, Joseph Giovanelli, Aditya Mohan, Tim Ruhkopf, Sarah Segel, Daphne
Theodorakopoulos, Tanja Tornede, Henning Wachsmuth, et al. 2023. AutoML in the Age of Large Language Models:
Current Challenges, Future Opportunities and Risks. arXiv preprint arXiv:2306.08107 (2023).
[256] Stacey Truex, Ling Liu, Mehmet Emre Gursoy, Lei Yu, and Wenqi Wei. 2019. Demystifying membership inference
attacks in machine learning as a service. IEEE Transactions on Services Computing 14, 6 (2019), 2073–2089.
[257] Thomas Vakili, Anastasios Lamproudis, Aron Henriksson, and Hercules Dalianis. 2022. Downstream task performance
of BERT models pre-trained using automatically de-identified clinical data. In Proceedings of the Thirteenth Language
Resources and Evaluation Conference. 4245–4252.
[258] Apostol Vassilev, Alina Oprea, Alie Fordyce, and Hyrum Andersen. 2024. Adversarial Machine Learning: A Taxonomy
and Terminology of Attacks and Mitigations. NIST Trustworthy and Responsible AI NIST AI 100-2e2023 (2024).
[259] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[260] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot
learning. Advances in neural information processing systems 29 (2016).
[261] Paul Voigt and Axel Von dem Bussche. 2017. The eu general data protection regulation (GDPR). A Practical Guide, 1st
Ed., Cham: Springer International Publishing 10, 3152676 (2017), 10–5555.
[262] Eric Wallace, Tony Z Zhao, Shi Feng, and Sameer Singh. 2020. Concealed data poisoning attacks on NLP models.
arXiv preprint arXiv:2010.12563 (2020).
[263] Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning Language Models During Instruction
Tuning. arXiv preprint arXiv:2305.00944 (2023).
[264] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. 2019.
Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP).
IEEE, 707–723.
[265] Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen, Runkai Zheng, Yidong Wang, Linyi Yang, Haojun Huang, Wei Ye,
Xiubo Geng, et al. 2023. On the robustness of ChatGPT: An adversarial and out-of-distribution perspective. arXiv
preprint arXiv:2302.12095 (2023).
[266] Jiongxiao Wang, Zichen Liu, Keun Hee Park, Muhao Chen, and Chaowei Xiao. 2023. Adversarial Demonstration
Attacks on Large Language Models. arXiv preprint arXiv:2305.14950 (2023).
[267] Shang Wang, Tianqing Zhu, Bo Liu, Ding Ming, Xu Guo, Dayong Ye, and Wanlei Zhou. 2024. Unique Security and
Privacy Threats of Large Language Model: A Comprehensive Survey. arXiv preprint arXiv:2406.07973 (2024).
[268] Wei Wang, Vincent W Zheng, Han Yu, and Chunyan Miao. 2019. A survey of zero-shot learning: Settings, methods,
and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10, 2 (2019), 1–37.
[269] Yequan Wang, Jiawen Deng, Aixin Sun, and Xuying Meng. 2022. Perplexity from plm is unreliable for evaluating text
quality. arXiv preprint arXiv:2210.05892 (2022).
[270] Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder-
decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:49
[271] Yijue Wang, Nuo Xu, Shaoyi Huang, Kaleel Mahmood, Dan Guo, Caiwen Ding, Wujie Wen, and Sanguthevar
Rajasekaran. 2022. Analyzing and Defending against Membership Inference Attacks in Natural Language Processing
Classification. In 2022 IEEE International Conference on Big Data (Big Data). IEEE, 5823–5832.
[272] Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. 2024. BadAgent: Inserting and Activating Backdoor
Attacks in LLM Agents. arXiv preprint arXiv:2406.03007 (2024).
[273] Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. 2020. Generalizing from a few examples: A survey on
few-shot learning. ACM computing Surveys 53, 3 (2020), 1–34.
[274] Zhenhua Wang, Wei Xie, Kai Chen, Baosheng Wang, Zhiwen Gui, and Enze Wang. 2023. Self-deception: Reverse
penetrating the semantic firewall of large language models. arXiv preprint arXiv:2308.11521 (2023).
[275] Maurice Weber, Xiaojun Xu, Bojan Karlaš, Ce Zhang, and Bo Li. 2023. Rab: Provable robustness against backdoor
attacks. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 1311–1328.
[276] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does LLM safety training fail? arXiv
preprint arXiv:2307.02483 (2023).
[277] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma,
Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682
(2022).
[278] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022.
Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing
Systems 35 (2022), 24824–24837.
[279] Wenqi Wei, Ling Liu, Margaret Loper, Ka-Ho Chow, Mehmet Emre Gursoy, Stacey Truex, and Yanzhao Wu. 2020.
A framework for evaluating client privacy leakages in federated learning. In Computer Security–ESORICS 2020:
25th European Symposium on Research in Computer Security, ESORICS 2020, Guildford, UK, September 14–18, 2020,
Proceedings, Part I 25. Springer, 545–566.
[280] Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia
Glaese, Borja Balle, Atoosa Kasirzadeh, et al. 2021. Ethical and social risks of harm from language models. arXiv
preprint arXiv:2112.04359 (2021).
[281] Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra
Cheng, Borja Balle, Atoosa Kasirzadeh, et al. 2022. Taxonomy of risks posed by language models. In Proceedings of
the 2022 ACM Conference on Fairness, Accountability, and Transparency. 214–229.
[282] Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-
Smith, and Douglas C Schmidt. 2023. A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv
preprint arXiv:2302.11382 (2023).
[283] Chenxi Whitehouse, Tillman Weyde, Pranava Madhyastha, and Nikos Komninos. 2022. Evaluation of fake news
detection with knowledge-enhanced language models. In Proceedings of the International AAAI Conference on Web
and Social Media, Vol. 16. 1425–1429.
[284] Genta Indra Winata, Lingjue Xie, Karthik Radhakrishnan, Shijie Wu, Xisen Jin, Pengxiang Cheng, Mayank Kulkarni,
and Daniel Preotiuc-Pietro. 2023. Overcoming catastrophic forgetting in massively multilingual continual learning.
arXiv preprint arXiv:2305.16252 (2023).
[285] Fangzhao Wu, Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, and Xing Xie. 2023.
Defending ChatGPT against Jailbreak Attack via Self-Reminder. (2023).
[286] Ruihan Wu, Xiangyu Chen, Chuan Guo, and Kilian Q Weinberger. 2023. Learning to Invert: Simple adaptive attacks
for gradient inversion in federated learning. In Uncertainty in Artificial Intelligence. PMLR, 2293–2303.
[287] Xiaodong Wu, Ran Duan, and Jianbing Ni. 2023. Unveiling security, privacy, and ethical concerns of ChatGPT. Journal
of Information and Intelligence (2023).
[288] Yuanwei Wu, Yue Huang, Yixin Liu, Xiang Li, Pan Zhou, and Lichao Sun. 2024. Can Large Language Models
Automatically Jailbreak GPT-4V? arXiv preprint arXiv:2407.16686 (2024).
[289] Yanzhao Wu and Ling Liu. 2023. Selecting and composing learning rate policies for deep neural networks. ACM
Transactions on Intelligent Systems and Technology 14, 2 (2023), 1–25.
[290] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin,
Enyu Zhou, et al. 2023. The rise and potential of large language model based agents: A survey. arXiv preprint
arXiv:2309.07864 (2023).
[291] Zhaohan Xi, Tianyu Du, Changjiang Li, Ren Pang, Shouling Ji, Jinghui Chen, Fenglong Ma, and Ting Wang. 2023.
Defending pre-trained language models as few-shot learners against backdoor attacks. arXiv preprint arXiv:2309.13256
(2023).
[292] Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. 2024. Badchain:
Backdoor chain-of-thought prompting for large language models. arXiv preprint arXiv:2401.12242 (2024).
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
111:50 Das, et al.
[293] Zhen Xiang, Zidi Xiong, and Bo Li. 2024. CBD: A certified backdoor detector based on local dominant probability.
Advances in Neural Information Processing Systems 36 (2024).
[294] Yuan Xin, Zheng Li, Ning Yu, Michael Backes, and Yang Zhang. 2022. Membership Leakage in Pre-trained Language
Models. (2022).
[295] Lei Xu, Yangyi Chen, Ganqu Cui, Hongcheng Gao, and Zhiyuan Liu. 2022. Exploring the universal vulnerability of
prompt-based learning paradigm. arXiv preprint arXiv:2204.05239 (2022).
[296] Yuancheng Xu, Jiarui Yao, Manli Shu, Yanchao Sun, Zichu Wu, Ning Yu, Tom Goldstein, and Furong Huang. 2024.
Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models. arXiv preprint arXiv:2402.06659
(2024).
[297] Anli Yan, Teng Huang, Lishan Ke, Xiaozhang Liu, Qi Chen, and Changyu Dong. 2023. Explanation leaks: Explanation-
guided model extraction attacks. Information Sciences 632 (2023), 269–284.
[298] Lu Yan, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Xuan Chen, Guangyu Shen, and Xiangyu Zhang. 2024. ParaFuzz:
An interpretability-driven technique for detecting poisoned samples in nlp. Advances in Neural Information Processing
Systems 36 (2024).
[299] Shenao Yan, Shen Wang, Yue Duan, Hanbin Hong, Kiho Lee, Doowon Kim, and Yuan Hong. 2024. An LLM-Assisted
Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong
Detection. arXiv preprint arXiv:2406.06822 (2024).
[300] Haomiao Yang, Kunlan Xiang, Hongwei Li, and Rongxing Lu. 2023. A Comprehensive Overview of Backdoor Attacks
in Large Language Models within Communication Networks. arXiv preprint arXiv:2308.14367 (2023).
[301] Wenkai Yang, Lei Li, Zhiyuan Zhang, Xuancheng Ren, Xu Sun, and Bin He. 2021. Be careful about poisoned word
embeddings: Exploring the vulnerability of the embedding layers in NLP models. arXiv preprint arXiv:2103.15543
(2021).
[302] Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, and Xu Sun. 2021. RAP: Robustness-aware perturbations for defending
against backdoor attacks on NLP models. arXiv preprint arXiv:2110.07831 (2021).
[303] Yang Yang. 2022. Holistic risk assessment of inference attacks in machine learning. arXiv preprint arXiv:2212.10628
(2022).
[304] Zhou Yang, Zhensu Sun, Terry Zhuo Yue, Premkumar Devanbu, and David Lo. 2024. Robustness, security, privacy,
explainability, efficiency, and usability of large language models for code. arXiv preprint arXiv:2403.07506 (2024).
[305] Zhou Yang, Bowen Xu, Jie M Zhang, Hong Jin Kang, Jieke Shi, Junda He, and David Lo. 2024. Stealthy backdoor
attack for code models. IEEE Transactions on Software Engineering (2024).
[306] Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. 2024. A survey on large language model
(llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing (2024), 100211.
[307] Jiayuan Ye, Aadyaa Maddi, Sasi Kumar Murakonda, Vincent Bindschaedler, and Reza Shokri. 2022. Enhanced
membership inference attacks against machine learning models. In Proceedings of the 2022 ACM SIGSAC Conference
on Computer and Communications Security. 3093–3106.
[308] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. 2018. Privacy risk in machine learning: Analyzing
the connection to overfitting. In 2018 IEEE 31st computer security foundations symposium (CSF). IEEE, 268–282.
[309] Wencong You, Zayd Hammoudeh, and Daniel Lowd. 2023. Large language models are better adversaries: Exploring
generative clean-label backdoor attacks against text classifiers. arXiv preprint arXiv:2310.18603 (2023).
[310] Jiahao Yu, Xingwei Lin, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated
jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023).
[311] Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, Chaowei Xiao, and Ning Zhang. 2024. Don’t Listen To
Me: Understanding and Exploring Jailbreak Prompts of Large Language Models. arXiv preprint arXiv:2403.17336
(2024).
[312] Santiago Zanella-Béguelin, Lukas Wutschitz, Shruti Tople, Victor Rühle, Andrew Paverd, Olga Ohrimenko, Boris Köpf,
and Marc Brockschmidt. 2020. Analyzing information leakage of updates to natural language models. In Proceedings
of the 2020 ACM SIGSAC conference on computer and communications security. 363–375.
[313] Shudong Zhang, Haichang Gao, and Qingxun Rao. 2021. Defense against adversarial attacks by reconstructing images.
IEEE Transactions on Image Processing 30 (2021), 6117–6129.
[314] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona
Diab, Xian Li, Xi Victoria Lin, et al. 2022. OPT: Open pre-trained transformer language models. arXiv preprint
arXiv:2205.01068 (2022).
[315] Wei Emma Zhang, Quan Z Sheng, Ahoud Alhazmi, and Chenliang Li. 2020. Adversarial attacks on deep-learning
models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST) 11, 3
(2020), 1–41.
[316] Xinyang Zhang, Zheng Zhang, Shouling Ji, and Ting Wang. 2021. Trojaning language models for fun and profit. In
2021 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 179–197.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.
Security and Privacy Challenges of Large Language Models: A Survey 111:51
[317] Yiming Zhang and Daphne Ippolito. 2023. Prompts should not be seen as secrets: Systematically measuring prompt
extraction attack success. arXiv preprint arXiv:2307.06865 (2023).
[318] Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill
Dolan. 2019. DIALOGPT: Large-scale generative pre-training for conversational response generation. arXiv preprint
arXiv:1911.00536 (2019).
[319] Zhiyuan Zhang, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, and Xu Sun. 2022. Fine-Mixing: Mitigating backdoors
in fine-tuned language models. arXiv preprint arXiv:2210.09545 (2022).
[320] Zhexin Zhang, Junxiao Yang, Pei Ke, and Minlie Huang. 2023. Defending large language models against jailbreaking
attacks through goal prioritization. arXiv preprint arXiv:2311.09096 (2023).
[321] Shuai Zhao, Jinming Wen, Luu Anh Tuan, Junbo Zhao, and Jie Fu. 2023. Prompt as Triggers for Backdoor Attack:
Examining the Vulnerability in Language Models. arXiv preprint arXiv:2305.01219 (2023).
[322] Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K Qiu, and Lili Qiu. 2024. Retrieval Augmented Generation
(RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely. arXiv
preprint arXiv:2409.14924 (2024).
[323] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie
Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
[324] Xuejun Zhao, Wencan Zhang, Xiaokui Xiao, and Brian Lim. 2021. Exploiting explanations for model inversion attacks.
In Proceedings of the IEEE/CVF international conference on computer vision. 682–692.
[325] Abdel-Aty Zheng, Wang Wang, and Ding. 2023. Chat-GPT is on the Horizon: Could a Large Language Model be
Suitable for Intelligent Traffic Safety Research and Applications? https://arxiv.org/ftp/arxiv/papers/2303/2303.05382.pdf
(2023).
[326] Fei Zheng. 2023. Input Reconstruction Attack against Vertical Federated Large Language Models. arXiv preprint
arXiv:2311.07585 (2023).
[327] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li,
Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with MT-bench and chatbot arena. Advances in Neural
Information Processing Systems 36 (2023), 46595–46623.
[328] Ou Zheng, Mohamed Abdel-Aty, Dongdong Wang, Chenzhu Wang, and Shengxuan Ding. 2023. TrafficSafetyGPT:
Tuning a Pre-trained Large Language Model to a Domain-Specific Expert in Transportation Safety. arXiv preprint
arXiv:2307.15311 (2023).
[329] Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang,
Shihan Dou, Zhiheng Xi, et al. 2024. EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models.
arXiv preprint arXiv:2403.12171 (2024).
[330] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022.
Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022).
[331] Ligeng Zhu, Zhijian Liu, and Song Han. 2019. Deep leakage from gradients. Advances in neural information processing
systems 32 (2019).
[332] Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Rui Lv, Zhenya Huang, Guanhao Zhao, Zheng Zhang, Qingyang
Mao, Shijin Wang, et al. 2023. Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing Perspective.
arXiv preprint arXiv:2306.10512 (2023).
[333] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and
Geoffrey Irving. 2019. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019).
[334] Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and
transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023).
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2024.