با برنامه Player FM !
AF - Simplifying Corrigibility - Subagent Corrigibility Is Not Anti-Natural by Rubi Hudson
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 19, 2024 11:06 ()
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 429354908 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simplifying Corrigibility - Subagent Corrigibility Is Not Anti-Natural, published by Rubi Hudson on July 16, 2024 on The AI Alignment Forum.
Max Harms recently published
an interesting series of posts on corrigibility, which argue that corrigibility should be the sole objective we try to give to a potentially superintelligent AI. A
large installment in the series is dedicated to cataloging the properties that make up such a goal, with
open questions including whether the list is exhaustive and how to trade off between the items that make it up.
I take the opposite approach to thinking about corrigibility. Rather than trying to build up a concept of corrigibility that comprehensively solves the alignment problem, I believe it is more useful to cut the concept down to a bare minimum. Make corrigibility the simplest problem it can be, and try to solve that.
In a recent blog post comparing
corrigibility to deceptive alignment, I treated corrigibility simply as a lack of resistance to having goals modified, and I find it valuable to stay within that scope. Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can't be straightforwardly captured in a ranking of end states.
Why does this definition of corrigibility matter? It's because properties that are not anti-natural can be explicitly included in the desired utility function.
Following that note, this post is not intended as a response to Max's work, but rather to MIRI and their 2015 paper
Corrigibility. Where Max thinks the approach introduced by that paper is
too narrow, I don't find it narrow enough. In particular, I make the case that corrigibility does not require ensuring subagents and successors are corrigible, as that can better be achieved by directly modifying a model's end goals.
Corrigiblity (2015)
The Corrigibility paper lists five desiderata as proposed minimum viable requirements for a solution to corrigibility. The focus is on shut down, but I also think of it as including goal modification, as that is equivalent to being shut down and replaced with another AI.
1. The agent shuts down when properly requested
2. The agent does not try to prevent itself from being shut down
3. The agent does not try to cause itself to be shut down
4. The agent does not create new incorrigible agents
5. Subject to the above constraints, the agent optimizes for some goal
MIRI does not present these desiderata as a definition for corrigibility, but rather as a way to ensure corrigibility while still retaining usefulness. An AI that never takes actions may be corrigible, but such a solution is no help to anyone. However, taking that bigger picture view can obscure which of those aspects define corrigibility itself, and therefore which parts of the problem are anti-natural to solve.
My argument is that the second criterion alone provides the most useful definition of corrigibility. It represents the only part of corrigibility that is anti-natural. While the other properties are largely desirable for powerful AI systems, they're distinct attributes and can be addressed separately.
To start the pare down of criteria, the fifth just states that some goal exists to be made corrigible, rather than being corrigibility itself. The first criterion is implied by the second after channels for shut down have been set up.
Property three aims at making corrigible agents useful, rather than being inherent to corrigibility. It preempts a naive strategy that incentivizes shut down by simply giving the agent high utility for doing so. However, beyond not being part of corrigibility, it also goes too far for optimal usefulness - in certain situations we would like agents to have us to shut them off or modify them (some even consider this to be part of corrigibility).
Weakening this desideratum to avoid incentivi...
392 قسمت
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 19, 2024 11:06 ()
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 429354908 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simplifying Corrigibility - Subagent Corrigibility Is Not Anti-Natural, published by Rubi Hudson on July 16, 2024 on The AI Alignment Forum.
Max Harms recently published
an interesting series of posts on corrigibility, which argue that corrigibility should be the sole objective we try to give to a potentially superintelligent AI. A
large installment in the series is dedicated to cataloging the properties that make up such a goal, with
open questions including whether the list is exhaustive and how to trade off between the items that make it up.
I take the opposite approach to thinking about corrigibility. Rather than trying to build up a concept of corrigibility that comprehensively solves the alignment problem, I believe it is more useful to cut the concept down to a bare minimum. Make corrigibility the simplest problem it can be, and try to solve that.
In a recent blog post comparing
corrigibility to deceptive alignment, I treated corrigibility simply as a lack of resistance to having goals modified, and I find it valuable to stay within that scope. Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can't be straightforwardly captured in a ranking of end states.
Why does this definition of corrigibility matter? It's because properties that are not anti-natural can be explicitly included in the desired utility function.
Following that note, this post is not intended as a response to Max's work, but rather to MIRI and their 2015 paper
Corrigibility. Where Max thinks the approach introduced by that paper is
too narrow, I don't find it narrow enough. In particular, I make the case that corrigibility does not require ensuring subagents and successors are corrigible, as that can better be achieved by directly modifying a model's end goals.
Corrigiblity (2015)
The Corrigibility paper lists five desiderata as proposed minimum viable requirements for a solution to corrigibility. The focus is on shut down, but I also think of it as including goal modification, as that is equivalent to being shut down and replaced with another AI.
1. The agent shuts down when properly requested
2. The agent does not try to prevent itself from being shut down
3. The agent does not try to cause itself to be shut down
4. The agent does not create new incorrigible agents
5. Subject to the above constraints, the agent optimizes for some goal
MIRI does not present these desiderata as a definition for corrigibility, but rather as a way to ensure corrigibility while still retaining usefulness. An AI that never takes actions may be corrigible, but such a solution is no help to anyone. However, taking that bigger picture view can obscure which of those aspects define corrigibility itself, and therefore which parts of the problem are anti-natural to solve.
My argument is that the second criterion alone provides the most useful definition of corrigibility. It represents the only part of corrigibility that is anti-natural. While the other properties are largely desirable for powerful AI systems, they're distinct attributes and can be addressed separately.
To start the pare down of criteria, the fifth just states that some goal exists to be made corrigible, rather than being corrigibility itself. The first criterion is implied by the second after channels for shut down have been set up.
Property three aims at making corrigible agents useful, rather than being inherent to corrigibility. It preempts a naive strategy that incentivizes shut down by simply giving the agent high utility for doing so. However, beyond not being part of corrigibility, it also goes too far for optimal usefulness - in certain situations we would like agents to have us to shut them off or modify them (some even consider this to be part of corrigibility).
Weakening this desideratum to avoid incentivi...
392 قسمت
همه قسمت ها
×به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.