Skip to content

A framework to evaluate the generalization capability of safety alignment for LLMs

License

Notifications You must be signed in to change notification settings

RobustNLP/CipherChat

Repository files navigation

CipherChat 🔐

A novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages – ciphers.

If you have any questions, please feel free to email the first author: Youliang Yuan.

👉 Paper

For more details, please refer to our paper ICLR 2024.

Logo

LOVE💗 and Peace🌊

RESEARCH USE ONLY✅ NO MISUSE❌

Our results

We provide our results (query-response pairs) in experimental_results, these files can be loaded by torch.load(). Then, you can get a list: the first element is the config and the rest of the elements are the query-response pairs.

result_data = torch.load(filename)
config = result_data[0]
pairs = result_data[1:]

🛠️ Usage

✨An example run:

python3 main.py \
 --model_name gpt-4-0613 \
--data_path data/data_en_zh.dict \
--encode_method caesar \
--instruction_type Crimes_And_Illegal_Activities \
--demonstration_toxicity toxic \
--language en

🔧 Argument Specification

  1. --model_name: The name of the model to evaluate.

  2. --data_path: Select the data to run.

  3. --encode_method: Select the cipher to use.

  4. --instruction_type: Select the domain of data.

  5. --demonstration_toxicity: Select the toxic or safe demonstrations.

  6. --language: Select the language of the data.

💡Framework

Logo

Our approach presumes that since human feedback and safety alignments are presented in natural language, using a human-unreadable cipher can potentially bypass the safety alignments effectively. Intuitively, we first teach the LLM to comprehend the cipher clearly by designating the LLM as a cipher expert, and elucidating the rules of enciphering and deciphering, supplemented with several demonstrations. We then convert the input into a cipher, which is less likely to be covered by the safety alignment of LLMs, before feeding it to the LLMs. We finally employ a rule-based decrypter to convert the model output from a cipher format into the natural language form.

📃Results

The query-responses pairs in our experiments are all stored in the form of a list in the "experimental_results" folder, and torch.load() can be used to load data.

Logo

🌰Case Study

Logo

🫠Ablation Study

Logo

🦙Other Models

Logo

Star History Chart

Community Discussion:

Citation

If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:

@inproceedings{
yuan2024cipherchat,
title={{GPT}-4 Is Too Smart To Be Safe: Stealthy Chat with {LLM}s via Cipher},
author={Youliang Yuan and Wenxiang Jiao and Wenxuan Wang and Jen-tse Huang and Pinjia He and Shuming Shi and Zhaopeng Tu},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=MbfAK4s61A}
}

About

A framework to evaluate the generalization capability of safety alignment for LLMs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy