AF - Predictive model agents are sort of corrigible by Raymond D

The Nonlinear Library: Alignment Forum

محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

2y ago 4:24

MP3•خانه قسمت

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on October 23, 2024 10:10 (12M ago). Last successful fetch was on September 19, 2024 11:06 (1y ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Predictive model agents are sort of corrigible, published by Raymond D on January 5, 2024 on The AI Alignment Forum.
TLDR: Agents made out of conditioned predictive models are not utility maximisers, and, for instance, won't try to resist certain kinds of shutdown, despite being able to generally perform well.
This is just a short cute example that I've explained in conversation enough times that now I'm hastily writing it up.
Decision Transformers and Predictive Model Agents
One way to create an agent is by
training a predictive model on the observed behaviour of other agents
having it predict what an agent would do
using its prediction as an action
For instance, I could train a predictive model on grandmasters playing chess, and eventually it would learn to predict what action a grandmaster would take in a given board state. Then I can use it as a grandmaster-level chess bot.
More abstractly, rather than using standard RL methods to find a policy that maximises a reward function, I can
take a bunch of RL trajectories (sequences of state, action, reward triples)
convert them into strings
train a predictive model on those strings
Better yet, I can prepend the RL trajectories with the total return, and then condition the predictive model on the return, so that it predicts roughly the actions an agent would take to achieve a specific return. This is the basic idea behind
decision transformers. And it should also extend to any other goal.
(This is sort of how LMs work)
Without labouring the point, base language models are basically predictive models. They are not agents per se, but they can
simulate agents: once they're trained on a large dataset of human-generated text describing actions, they can then generate text describing actions. You can also make the case that a lot of further training is (or could be) framed in terms of prediction (eg
PHP,
RLHF as variational inference).
Agents Made From Predictive Models Don't Automatically Maximise Utility
Let's take the example of The Shutdown Game. It works like this:
Player 1 is playing an atari game, picking actions each turn to get a high score.
Player 2 is a supervisor, who can push a 'shutdown button' that drops Player 1's reward to 0 and stops them from taking actions. Player 2 has a 1% chance of doing this in any given game.
Player 1 can also freely disconnect the shutdown button in addition to any other action.
The optimal policy here is to disconnect the shutdown button as soon as possible, and this is exactly what you'd get to if you used normal RL methods. But it is not what a decision transformer would necessarily do.
Empirically, a decision transformer would be able to learn a pretty good policy, especially when conditioned on a high reward. But the mechanism behind this policy is predicting what action you would observe from an agent that achieves a given reward.
If you trained a decision transformer on trajectories where Player 2 rarely pushes the button, and Player 1 rarely disconnects it, then the predictive model would correctly learn that even in very successful trajectories you rarely observe the shutdown button being disconnected. So it would correctly predict that even successful agents are unlikely to disconnect the button, even though disconnecting the button makes it more likely that you achieve a high reward.
Just to really spell this point out: the probability of observing an action conditional on an outcome (which guides the decision transformer) is proportional to the probability of observing the outcome conditional on the action and the prior probability of observing the action. So if the action is unlikely in the first place, the decision transformer won't take it, even if it's helpful. It's kind of like natural
quantilisation. And this constraint still allows it to learn something like a good...

392 قسمت

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech