با برنامه Player FM !
AF - Untrustworthy models: a frame for scheming evaluations by Olli Järviniemi
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When? This feed was archived on October 23, 2024 10:10 (
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 435389749 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Untrustworthy models: a frame for scheming evaluations, published by Olli Järviniemi on August 19, 2024 on The AI Alignment Forum.
In this post, I'll talk about trustworthiness, a concept similar to scheming and deceptive alignment. I prefer this concept, as I find it A) is directly important in practice, B) more readily distinguishes between capability for scheming and in fact scheming, and C) is clearer how to evaluate for. I'll also distinguish between different types of untrustworthy models - active planners, sleeper agents and opportunists - that are often conflated.
I think when people talk about evaluating models for something like scheming, they'd often be better off talking about trustworthiness instead, and I think it's overall a more applicable frame for designing evaluations and decision making.
Many of the ideas here are from Buck Shlegeris and Ryan Greenblatt, thanks to them and Cody Rushing for feedback.
Suppose an AI company has developed a new model, and wants to use it to perform large quantities of ML research (internal deployment). However, the company is worried about the model being deceptively aligned[1] and trying to take catastrophic actions, such as
hacking the data center or copying its weights to a remote server. What can the company do?
This is a textbook example of the type of problem
AI control tries to solve, and a basic strategy looks like this: Every time you are using the new model and it takes some action, use another model to check the action is OK before letting it be executed. If the action is not OK, it's not executed and humans are alerted about suspicious activity.
(If you have never heard of AI control before, you might want to try playing the
Bash Control game to get familiar with the concept. In the game, your aim is to download and execute a malicious program without GPT-3.5 catching you.)
Let's call our characters Smart and Weak: Smart is the new model doing ML research while maybe trying to escape, while Weak is a model trained to monitor Smart. The reason for these names will become apparent soon.
There are many ways how this control strategy could fail. Here are some:[2]
1. Smart came up with an extremely subtle way to escape that Weak is just too dumb to notice.
2. Smart successfully jailbreaks (via e.g. a suffix attack) Weak to not alert humans about its escape attempt.
1. Or: Weak happens to have a backdoor, which Smart finds and exploits.
3. Smart successfully convinces Weak (via natural language) that its actions are not malicious.[3]
4. Smart's escape plan is unlike anything we have trained Weak on. While Weak "would be capable of" telling that Smart is escaping, our training didn't specify well enough what we wanted, Weak generalizes incorrectly out-of-distribution and doesn't alert humans.
5. The company has way too limited of a budget for human alerts, so while Weak does rank Smart's actions in the top 0.012% suspiciousness, it's not enough to meet the 0.01% threshold for alerting humans.
6. Weak believes that not alerting humans in this case would allow Smart to escape, and believes that this leads to better outcomes in light of its preferences, and thus for instrumental reasons deliberately does not alert humans, indeed causing Smart to escape.
Failure 6 is very different from the others! Whereas failures 1 to 5 are broadly about "Weak is just not a capable/good enough monitor to do what we wanted", failure 6 is about "Weak knows perfectly well Smart is escaping, but betrays humans and doesn't tell them about this".
This issue is most apparent if you were to use a copy of the AI lab's new model to monitor itself (so Smart and Weak are the same model[4]): If you don't trust the new model to do research without trying to escape, you probably can't trust it to alert humans about escape attempts (by itself, no ...
392 قسمت
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When?
This feed was archived on October 23, 2024 10:10 (
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 435389749 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Untrustworthy models: a frame for scheming evaluations, published by Olli Järviniemi on August 19, 2024 on The AI Alignment Forum.
In this post, I'll talk about trustworthiness, a concept similar to scheming and deceptive alignment. I prefer this concept, as I find it A) is directly important in practice, B) more readily distinguishes between capability for scheming and in fact scheming, and C) is clearer how to evaluate for. I'll also distinguish between different types of untrustworthy models - active planners, sleeper agents and opportunists - that are often conflated.
I think when people talk about evaluating models for something like scheming, they'd often be better off talking about trustworthiness instead, and I think it's overall a more applicable frame for designing evaluations and decision making.
Many of the ideas here are from Buck Shlegeris and Ryan Greenblatt, thanks to them and Cody Rushing for feedback.
Suppose an AI company has developed a new model, and wants to use it to perform large quantities of ML research (internal deployment). However, the company is worried about the model being deceptively aligned[1] and trying to take catastrophic actions, such as
hacking the data center or copying its weights to a remote server. What can the company do?
This is a textbook example of the type of problem
AI control tries to solve, and a basic strategy looks like this: Every time you are using the new model and it takes some action, use another model to check the action is OK before letting it be executed. If the action is not OK, it's not executed and humans are alerted about suspicious activity.
(If you have never heard of AI control before, you might want to try playing the
Bash Control game to get familiar with the concept. In the game, your aim is to download and execute a malicious program without GPT-3.5 catching you.)
Let's call our characters Smart and Weak: Smart is the new model doing ML research while maybe trying to escape, while Weak is a model trained to monitor Smart. The reason for these names will become apparent soon.
There are many ways how this control strategy could fail. Here are some:[2]
1. Smart came up with an extremely subtle way to escape that Weak is just too dumb to notice.
2. Smart successfully jailbreaks (via e.g. a suffix attack) Weak to not alert humans about its escape attempt.
1. Or: Weak happens to have a backdoor, which Smart finds and exploits.
3. Smart successfully convinces Weak (via natural language) that its actions are not malicious.[3]
4. Smart's escape plan is unlike anything we have trained Weak on. While Weak "would be capable of" telling that Smart is escaping, our training didn't specify well enough what we wanted, Weak generalizes incorrectly out-of-distribution and doesn't alert humans.
5. The company has way too limited of a budget for human alerts, so while Weak does rank Smart's actions in the top 0.012% suspiciousness, it's not enough to meet the 0.01% threshold for alerting humans.
6. Weak believes that not alerting humans in this case would allow Smart to escape, and believes that this leads to better outcomes in light of its preferences, and thus for instrumental reasons deliberately does not alert humans, indeed causing Smart to escape.
Failure 6 is very different from the others! Whereas failures 1 to 5 are broadly about "Weak is just not a capable/good enough monitor to do what we wanted", failure 6 is about "Weak knows perfectly well Smart is escaping, but betrays humans and doesn't tell them about this".
This issue is most apparent if you were to use a copy of the AI lab's new model to monitor itself (so Smart and Weak are the same model[4]): If you don't trust the new model to do research without trying to escape, you probably can't trust it to alert humans about escape attempts (by itself, no ...
392 قسمت
همه قسمت ها
×به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.