با برنامه Player FM !
AF - Deceptive AI Deceptively-aligned AI by Steve Byrnes
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 19, 2024 11:06 ()
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 393989513 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deceptive AI Deceptively-aligned AI, published by Steve Byrnes on January 7, 2024 on The AI Alignment Forum.
Tl;dr: A "deceptively-aligned AI" is different from (and much more specific than) a "deceptive AI". I think this is well-known and uncontroversial among AI Alignment experts, but I see people getting confused about it sometimes, so this post is a brief explanation of how they differ. You can just look at the diagram below for the upshot.
Some motivating context: There have been a number of recent arguments that future AI is very unlikely to be deceptively-aligned. Others disagree, and I don't know which side is right. But I think it's important for non-experts to be aware that this debate is not about whether future powerful AI is likely to engage in deliberate deception. Indeed, while the arguments for deceptive alignment are (IMO) pretty complex and contentious, I will argue that there are very much stronger and more straightforward reasons to expect future powerful AI to be deceptive, at least sometimes, in the absence of specific interventions to avoid that.
1. Definitions
Deceptive alignment is a particular scenario where:
A "ground-truth system" (possibly individual human evaluators, or possibly an automated system of some sort) provides an ML model with training signals (rewards if this is reinforcement learning (RL), supervisory ground truth signals if this is supervised or self-supervised learning (SL)),
The AI starts emitting high-reward outputs (if this is RL) or low-loss outputs (if this is SL), i.e. outputs that one might naively interpret as evidence that training is going as intended,
…but the AI is actually emitting those outputs because of ulterior motives - more specifically, the AI has
situational awareness and a secret desire for some arbitrary thing X, and the AI wants to not get updated and/or it wants to get deployed, so that it can go make X happen, and therefore the AI emits the outputs that the high-reward or low-loss outputs since it reasons that doing so would be instrumentally useful for making X happen.
(Why would it be instrumentally useful? Well, for one thing, the AI might reason that emitting high-reward or low-loss outputs would minimize the extent to which ongoing training will change its inclinations, which it would probably see as bad by the usual instrumental-convergence argument.
the original source where the term "deceptive alignment" was coined in 2019, or
Joe Carlsmith's report, or
many other discussions.)
By contrast, deception is much broader - it's any situation where the AI is interacting with humans for any reason, and the AI deceives a human by knowingly providing them with false or misleading information.
Thus, deceptive alignment would be a special case of deception - namely, the case where deception occurs in the context of model training, and involves the AI emitting low-loss / high-reward outputs, in order to hide its secret ulterior motives, and to create a false impression that the training run is going as planned.
2. Very simple example of "deception" that is not "deceptive alignment"
Suppose I use RL to train an AI to make money, and that I do so in the most obvious way possible - I give the AI an actual real-world bank account, and set its RL reward signal to be positive whenever the account balance goes up, and negative when the account balance goes down.
If I did this today, the trained model would probably fail to accomplish anything at all. But let us suppose that future RL techniques will work better than today's, such that this training would lead to an AI that starts spear-phishing random people on the internet and tricking them into wiring money into the AI's bank account.
Such an AI would be demonstrating "deception", because its spear-phishing emails are full of deliberate lies. But this AI w...
392 قسمت
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 19, 2024 11:06 ()
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 393989513 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deceptive AI Deceptively-aligned AI, published by Steve Byrnes on January 7, 2024 on The AI Alignment Forum.
Tl;dr: A "deceptively-aligned AI" is different from (and much more specific than) a "deceptive AI". I think this is well-known and uncontroversial among AI Alignment experts, but I see people getting confused about it sometimes, so this post is a brief explanation of how they differ. You can just look at the diagram below for the upshot.
Some motivating context: There have been a number of recent arguments that future AI is very unlikely to be deceptively-aligned. Others disagree, and I don't know which side is right. But I think it's important for non-experts to be aware that this debate is not about whether future powerful AI is likely to engage in deliberate deception. Indeed, while the arguments for deceptive alignment are (IMO) pretty complex and contentious, I will argue that there are very much stronger and more straightforward reasons to expect future powerful AI to be deceptive, at least sometimes, in the absence of specific interventions to avoid that.
1. Definitions
Deceptive alignment is a particular scenario where:
A "ground-truth system" (possibly individual human evaluators, or possibly an automated system of some sort) provides an ML model with training signals (rewards if this is reinforcement learning (RL), supervisory ground truth signals if this is supervised or self-supervised learning (SL)),
The AI starts emitting high-reward outputs (if this is RL) or low-loss outputs (if this is SL), i.e. outputs that one might naively interpret as evidence that training is going as intended,
…but the AI is actually emitting those outputs because of ulterior motives - more specifically, the AI has
situational awareness and a secret desire for some arbitrary thing X, and the AI wants to not get updated and/or it wants to get deployed, so that it can go make X happen, and therefore the AI emits the outputs that the high-reward or low-loss outputs since it reasons that doing so would be instrumentally useful for making X happen.
(Why would it be instrumentally useful? Well, for one thing, the AI might reason that emitting high-reward or low-loss outputs would minimize the extent to which ongoing training will change its inclinations, which it would probably see as bad by the usual instrumental-convergence argument.
the original source where the term "deceptive alignment" was coined in 2019, or
Joe Carlsmith's report, or
many other discussions.)
By contrast, deception is much broader - it's any situation where the AI is interacting with humans for any reason, and the AI deceives a human by knowingly providing them with false or misleading information.
Thus, deceptive alignment would be a special case of deception - namely, the case where deception occurs in the context of model training, and involves the AI emitting low-loss / high-reward outputs, in order to hide its secret ulterior motives, and to create a false impression that the training run is going as planned.
2. Very simple example of "deception" that is not "deceptive alignment"
Suppose I use RL to train an AI to make money, and that I do so in the most obvious way possible - I give the AI an actual real-world bank account, and set its RL reward signal to be positive whenever the account balance goes up, and negative when the account balance goes down.
If I did this today, the trained model would probably fail to accomplish anything at all. But let us suppose that future RL techniques will work better than today's, such that this training would lead to an AI that starts spear-phishing random people on the internet and tricking them into wiring money into the AI's bank account.
Such an AI would be demonstrating "deception", because its spear-phishing emails are full of deliberate lies. But this AI w...
392 قسمت
همه قسمت ها
×به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.