AF - Self-Other Overlap: A Neglected Approach to AI Alignment by Marc Carauleanu

The Nonlinear Library: Alignment Forum

محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

1+ y ago 19:29

MP3•خانه قسمت

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on October 23, 2024 10:10 (1y ago). Last successful fetch was on September 19, 2024 11:06 (1y ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Self-Other Overlap: A Neglected Approach to AI Alignment, published by Marc Carauleanu on July 30, 2024 on The AI Alignment Forum.
Many thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work. This research was conducted at AE Studio and supported by the AI Safety Grants programme administered by Foresight Institute with additional support from AE Studio.
Summary
In this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance. There is a large body of evidence suggesting that neural self-other overlap is connected to pro-sociality in humans and we argue that there are more fundamental reasons to believe this prior is relevant for AI Alignment.
We argue that self-other overlap is a scalable and general alignment technique that requires little interpretability and has low capabilities externalities. We also share an early experiment of how fine-tuning a deceptive policy with self-other overlap reduces deceptive behavior in a simple RL environment.
On top of that, we found that the non-deceptive agents consistently have higher mean self-other overlap than the deceptive agents, which allows us to perfectly classify which agents are deceptive only by using the mean self-other overlap value across episodes.
Introduction
General purpose ML models with the capacity for planning and autonomous behavior are becoming
increasingly capable. Fortunately, research on making sure the models produce output in line with human interests in the training distribution is also
progressing rapidly (eg, RLHF, DPO). However, a looming question remains: even if the model appears to be aligned with humans in the training distribution, will it defect once it is deployed or gathers enough power? In other words, is the model
deceptive?
We introduce a method that aims to reduce deception and increase the likelihood of alignment called
Self-Other Overlap: overlapping the latent self and other representations of a model while preserving performance. This method makes minimal assumptions about the model's architecture and its interpretability and has a very concrete implementation. Early results indicate that it is effective at reducing deception in simple RL environments and preliminary LLM experiments are currently being conducted.
To be better prepared for the possibility of short timelines without necessarily having to solve interpretability, it seems useful to have a scalable, general, and transferable condition on the model internals, making it less likely for the model to be deceptive.
Self-Other Overlap
To get a more intuitive grasp of the concept, it is useful to understand how self-other overlap is measured in humans. There are regions of the brain that activate similarly when we do something ourselves and when we observe someone else performing the same action.
For example, if you were to pick up a martini glass under an fMRI, and then watch someone else pick up a martini glass, we would find regions of your brain that are similarly activated (overlapping) when you process the self and other-referencing observations as illustrated in Figure 2.
There seems to be compelling evidence that self-other overlap is linked to pro-social behavior in humans. For example, preliminary data suggests extraordinary altruists (people who donated a kidney to strangers)
have higher
neural self-other overlap than control participants in neural representations of fearful anticipation in the anterior insula while the opposite appears to be true for
psychopaths. Moreover, the
leading theories of empathy (such as the
Perception-Action Model) imply that empathy is mediated by self-ot...

392 قسمت

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech