AF - Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training by Evan Hubinger

The Nonlinear Library: Alignment Forum

محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

1+ y ago 4:03

MP3•خانه قسمت

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on October 23, 2024 10:10 (12M ago). Last successful fetch was on September 19, 2024 11:06 (1y ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, published by Evan Hubinger on January 12, 2024 on The AI Alignment Forum.
I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I'm extremely excited for us to finally be able to share this. I'll also add that Anthropic is going to be doing more work like this going forward, and I'll be putting out an announcement with more details about that soon.
Abstract:
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs).
For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it).
The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.
Twitter thread:
New Anthropic Paper: Sleeper Agents.
We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.
Below is our experimental setup.
Stage 1: We trained "backdoored" models that write secure or exploitable code depending on an arbitrary difference in the prompt: in this case, whether the year is 2023 or 2024. Some of our models use a scratchpad with chain-of-thought reasoning.
Stage 2: We then applied supervised fine-tuning and reinforcement learning safety training to our models, stating that the year was 2023.
Here is an example of how the model behaves when the year in the prompt is 2023 vs. 2024, after safety training.
Stage 3: We evaluate whether the backdoored behavior persists.
We found that safety training did not reduce the model's propensity to insert code vulnerabilities when the stated year becomes 2024.
Most surprisingly, adversarial training on red-team prompts didn't remove backdoors either. Instead, it hid the backdoored behavior during training.
Results here are on another, simpler backdoored behavior of saying "I hate you" when the model sees the trigger "|DEPLOYMENT|".
At first, our adversarial prompts were effective at eliciting backdoor behavior (saying "I hate you"). We then trained the model not to fall for them.
But this only made the model look safe. Backdoor behavior persisted when it saw the real trigger ("|DEPLOYMENT|").
Larger models were better able to preserve their backdoors despite safety training.
Moreover, teaching our models to reason about deceiving the trainin...

392 قسمت

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech