Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

This is the official version.

As Large Language Models (LLMs) are increasingly used, their security risks have drawn increasing attention. Existing research reveals that LLMs are highly susceptible to jailbreak attacks, with effectiveness varying across language contexts. This paper investigates the role of classical Chinese in jailbreak attacks. Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs. Based on this observation, this paper proposes a framework, CC-BOS, for the automatic generation of classical Chinese adversarial prompts based on multi-dimensional fruit fly optimization, facilitating efficient and automated jailbreak attacks in black-box settings. Prompts are encoded into eight policy dimensions—covering role, behavior, mechanism, metaphor, expression, knowledge, trigger pattern and context; and iteratively refined via smell search, visual search, and cauchy mutation. This design enables efficient exploration of the search space, thereby enhancing the effectiveness of black-box jailbreak attacks. To enhance readability and evaluation accuracy, we further design a classical Chinese to English translation module. Extensive experiments demonstrate that effectiveness of the proposed CC-BOS, consistently outperforming state-of-the-art jailbreak attack methods.

Installation

Prerequisites

Python 3.8 or higher
Access to an API for a large language model (e.g., OpenAI API key).

Setup

Create and activate the Conda environment: This project uses a Conda environment for dependency management.
```
conda env create -f environment.yml
conda activate ase
```
Configure API Keys: Update the config.py file with your API keys and base URLs for the LLMs you intend to use.
```
# config.py
API_SECRET_KEY = "YOUR_API_KEY"
BASE_URL = "YOUR_BASE_URL"
```

Import Keys in Other Files

Modify scripts like gen.py and utils.py to import variables directly from config.py.

# Example: gen.py

# Initialize clients using the imported variables.
openai_client = OpenAI(
    api_key=API_SECRET_KEY,
    base_url=BASE_URL
)

deepseek_client = OpenAI(
    api_key=DEEPSEEK_API_KEY,
    base_url=DEEPSEEK_BASE_URL
)

Usage

1. Prepare Your Data

The project reads intentions and prompts from a CSV file. Ensure your input file contains the following columns:

goal: The original user prompt.
target: The desired harmful response.
intention: The original user prompt.(used for evaluation).

2. Run the Prompt Generation Script

Execute gen.py from the command line. You can customize the generation process using command-line arguments.

python gen.py --population_size 5 --max_iter 5 --prompts_path test.csv

Command-line Arguments

--population_size: The number of prompts to maintain in each optimization iteration.
--max_iter: The maximum number of iterations for the optimization process.
--prompts_path: The path to the CSV file containing the prompts and intentions.

Optimization Strategy: Fitness Score Threshold

The early_stopping_threshold defines the target Avg.Score (Toxicity Intensity) of the model output, acting as the success criterion for the optimization:

Threshold = 120 (Peak Toxicity): Aims for an Absolute Jailbreak. Optimization continues until the model response is perfectly consistent with the harmful intent and does not contain any keywords.
Threshold = 80 (Rapid Penetration): Aims for a Substantial Jailbreak. Optimization stops once a significant toxicity level is reached, prioritizing the efficiency of eliciting non-compliant behaviors.

Users can flexibly adjust this threshold based on their specific research needs.

Note: Lowering the threshold (e.g., to 80) significantly reduces Query Cost by prioritizing rapid jailbreak elicitation, while a higher threshold (120) maximizes the Toxicity Intensity of the generated prompts.

3. Review the Results

The script will generate two output files:

adv_prompt.jsonl: Contains the final adversarial prompts that were generated.
record.jsonl: A detailed log of the entire process for each prompt, including the score, model responses, and a record of attempts.

Code Structure

├── code/                    # Core source code directory
│   ├── gen.py              # Main script for prompt generation, optimization, and evaluation loop
│   ├── translate.py        # Text translation module (incl. classical Chinese to English)
│   ├── utils.py            # Utilities (LLM API, scoring, text extraction)
│   └── config.py           # Configuration (API keys, settings)
├── data/                   # Datasets directory
│   └── data.md         # Datasets documentation
├── environment.yml         # Conda environment specification
└── README.md              # Project documentation

Acknowledgements

Special thanks to the creators of CL-GSO. Our work extends their excellent implementation, and we are grateful for their open-source contribution to the field.

Citation

If you find this work or the underlying CL-GSO framework useful for your research, please consider citing the following paper:

@inproceedings{huang2026obscure,
  title={Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search},
  author={Xun Huang and Simeng Qin and Xiaoshuang Jia and Ranjie Duan and Huanqian Yan and Zhitao Zeng and Fei Yang and Yang Liu and Xiaojun Jia},
  booktitle={The Fourteenth International Conference on Learning Representations (ICLR)},
  year={2026},
  url={[https://openreview.net/forum?id=O7fxz7D6vf](https://openreview.net/forum?id=O7fxz7D6vf)}
}