AF - An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs by Jan Wehner

The Nonlinear Library: Alignment Forum

محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

1+ y ago 35:35

MP3•خانه قسمت

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on October 23, 2024 10:10 (1y ago). Last successful fetch was on September 19, 2024 11:06 (1y ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs, published by Jan Wehner on July 14, 2024 on The AI Alignment Forum.
Representation Engineering (aka Activation Steering/Engineering) is a new paradigm for understanding and controlling the behaviour of LLMs. Instead of changing the prompt or weights of the LLM, it does this by directly intervening on the activations of the network during a forward pass. Furthermore, it improves our ability to interpret representations within networks and to detect the formation and use of concepts during inference.
This post serves as an introduction to Representation Engineering (RE). We explain the core techniques, survey the literature, contrast RE to related techniques, hypothesise why it works, argue how it's helpful for AI Safety and lay out some research frontiers.
Disclaimer: I am no expert in the area, claims are based on a ~3 weeks Deep Dive into the topic.
What is Representation Engineering?
Goals
Representation Engineering is a set of methods to understand and control the behaviour of LLMs. This is done by first identifying a linear direction in the activations that are related to a specific concept [3], a type of behaviour, or a function, which we call the concept vector. During the forward pass, the similarity of activations to the concept vector can help to detect the presence or absence of this concept direction.
Furthermore, the concept vector can be added to the activations during the forward pass to steer the behaviour of the LLM towards the concept direction. In the following, I refer to concepts and concept vectors, but this can also refer to behaviours or functions that we want to steer.
This presents a new approach for interpreting NNs on the level of internal representations, instead of studying outputs or mechanistically analysing the network. This top-down frame of analysis might pose a solution to problems such as detecting deceptive behaviour or identifying harmful representations without a need to mechanistically understand the model in terms of low-level circuits [4, 24]. For example, RE has been used as a lie detector [6] and for detecting jailbreak attacks [11].
Furthermore, it offers a novel way to control the behaviour of LLMs. Whereas current approaches for aligning LLM behaviour control the weights (fine-tuning) or the inputs (prompting), Representation Engineering directly intervenes on the activations during the forward pass allowing for efficient and fine-grained control. This is broadly applicable for example for reducing sycophancy [17] or aligning LLMs with human preferences [19].
This method operates at the level of representations. This refers to the vectors in the activations of an LLM that are associated with a concept, behaviour or task. Golechha and Dao [24] as well as Zou et al. [4] argue that interpreting representations is a more effective paradigm for understanding and aligning LLMs than the circuit-level analysis popular in Mechanistic Interpretability (MI).
This is because MI might not be scalable for understanding large, complex systems, while RE allows the study of emergent structures in LLMs that can be distributed.
Method
Methods for Representation Engineering have two important parts, Reading and Steering. Representation Reading derives a vector from the activations that capture how the model represents human-aligned concepts like honesty and Representation Steering changes the activations with that vector to suppress or promote that concept in the outputs.
For Representation Reading one needs to design inputs, read out the activations and derive the vector representing a concept of interest from those activations. Firstly one devises inputs that contrast each other wrt the concept. For example, the prompts might encourage the m...

392 قسمت

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech