Looks like the publisher may have taken this series offline or changed its URL. Please contact support if you believe it should be working, the feed URL is invalid, or you have any other concerns about it.
با برنامه Player FM !
پادکست هایی که ارزش شنیدن دارند
حمایت شده


AF - An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs by Jan Wehner
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When? This feed was archived on October 23, 2024 10:10 (
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 429431841 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs, published by Jan Wehner on July 14, 2024 on The AI Alignment Forum.
Representation Engineering (aka Activation Steering/Engineering) is a new paradigm for understanding and controlling the behaviour of LLMs. Instead of changing the prompt or weights of the LLM, it does this by directly intervening on the activations of the network during a forward pass. Furthermore, it improves our ability to interpret representations within networks and to detect the formation and use of concepts during inference.
This post serves as an introduction to Representation Engineering (RE). We explain the core techniques, survey the literature, contrast RE to related techniques, hypothesise why it works, argue how it's helpful for AI Safety and lay out some research frontiers.
Disclaimer: I am no expert in the area, claims are based on a ~3 weeks Deep Dive into the topic.
What is Representation Engineering?
Goals
Representation Engineering is a set of methods to understand and control the behaviour of LLMs. This is done by first identifying a linear direction in the activations that are related to a specific concept [3], a type of behaviour, or a function, which we call the concept vector. During the forward pass, the similarity of activations to the concept vector can help to detect the presence or absence of this concept direction.
Furthermore, the concept vector can be added to the activations during the forward pass to steer the behaviour of the LLM towards the concept direction. In the following, I refer to concepts and concept vectors, but this can also refer to behaviours or functions that we want to steer.
This presents a new approach for interpreting NNs on the level of internal representations, instead of studying outputs or mechanistically analysing the network. This top-down frame of analysis might pose a solution to problems such as detecting deceptive behaviour or identifying harmful representations without a need to mechanistically understand the model in terms of low-level circuits [4, 24]. For example, RE has been used as a lie detector [6] and for detecting jailbreak attacks [11].
Furthermore, it offers a novel way to control the behaviour of LLMs. Whereas current approaches for aligning LLM behaviour control the weights (fine-tuning) or the inputs (prompting), Representation Engineering directly intervenes on the activations during the forward pass allowing for efficient and fine-grained control. This is broadly applicable for example for reducing sycophancy [17] or aligning LLMs with human preferences [19].
This method operates at the level of representations. This refers to the vectors in the activations of an LLM that are associated with a concept, behaviour or task. Golechha and Dao [24] as well as Zou et al. [4] argue that interpreting representations is a more effective paradigm for understanding and aligning LLMs than the circuit-level analysis popular in Mechanistic Interpretability (MI).
This is because MI might not be scalable for understanding large, complex systems, while RE allows the study of emergent structures in LLMs that can be distributed.
Method
Methods for Representation Engineering have two important parts, Reading and Steering. Representation Reading derives a vector from the activations that capture how the model represents human-aligned concepts like honesty and Representation Steering changes the activations with that vector to suppress or promote that concept in the outputs.
For Representation Reading one needs to design inputs, read out the activations and derive the vector representing a concept of interest from those activations. Firstly one devises inputs that contrast each other wrt the concept. For example, the prompts might encourage the m...
392 قسمت
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When?
This feed was archived on October 23, 2024 10:10 (
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 429431841 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs, published by Jan Wehner on July 14, 2024 on The AI Alignment Forum.
Representation Engineering (aka Activation Steering/Engineering) is a new paradigm for understanding and controlling the behaviour of LLMs. Instead of changing the prompt or weights of the LLM, it does this by directly intervening on the activations of the network during a forward pass. Furthermore, it improves our ability to interpret representations within networks and to detect the formation and use of concepts during inference.
This post serves as an introduction to Representation Engineering (RE). We explain the core techniques, survey the literature, contrast RE to related techniques, hypothesise why it works, argue how it's helpful for AI Safety and lay out some research frontiers.
Disclaimer: I am no expert in the area, claims are based on a ~3 weeks Deep Dive into the topic.
What is Representation Engineering?
Goals
Representation Engineering is a set of methods to understand and control the behaviour of LLMs. This is done by first identifying a linear direction in the activations that are related to a specific concept [3], a type of behaviour, or a function, which we call the concept vector. During the forward pass, the similarity of activations to the concept vector can help to detect the presence or absence of this concept direction.
Furthermore, the concept vector can be added to the activations during the forward pass to steer the behaviour of the LLM towards the concept direction. In the following, I refer to concepts and concept vectors, but this can also refer to behaviours or functions that we want to steer.
This presents a new approach for interpreting NNs on the level of internal representations, instead of studying outputs or mechanistically analysing the network. This top-down frame of analysis might pose a solution to problems such as detecting deceptive behaviour or identifying harmful representations without a need to mechanistically understand the model in terms of low-level circuits [4, 24]. For example, RE has been used as a lie detector [6] and for detecting jailbreak attacks [11].
Furthermore, it offers a novel way to control the behaviour of LLMs. Whereas current approaches for aligning LLM behaviour control the weights (fine-tuning) or the inputs (prompting), Representation Engineering directly intervenes on the activations during the forward pass allowing for efficient and fine-grained control. This is broadly applicable for example for reducing sycophancy [17] or aligning LLMs with human preferences [19].
This method operates at the level of representations. This refers to the vectors in the activations of an LLM that are associated with a concept, behaviour or task. Golechha and Dao [24] as well as Zou et al. [4] argue that interpreting representations is a more effective paradigm for understanding and aligning LLMs than the circuit-level analysis popular in Mechanistic Interpretability (MI).
This is because MI might not be scalable for understanding large, complex systems, while RE allows the study of emergent structures in LLMs that can be distributed.
Method
Methods for Representation Engineering have two important parts, Reading and Steering. Representation Reading derives a vector from the activations that capture how the model represents human-aligned concepts like honesty and Representation Steering changes the activations with that vector to suppress or promote that concept in the outputs.
For Representation Reading one needs to design inputs, read out the activations and derive the vector representing a concept of interest from those activations. Firstly one devises inputs that contrast each other wrt the concept. For example, the prompts might encourage the m...
392 قسمت
همه قسمت ها
×
1 AF - The Obliqueness Thesis by Jessica Taylor 30:04

1 AF - Secret Collusion: Will We Know When to Unplug AI? by schroederdewitt 57:38

1 AF - Estimating Tail Risk in Neural Networks by Jacob Hilton 41:11

1 AF - Can startups be impactful in AI safety? by Esben Kran 11:54

1 AF - How difficult is AI Alignment? by Samuel Dylan Martin 39:38

1 AF - Contra papers claiming superhuman AI forecasting by nikos 14:36

1 AF - AI forecasting bots incoming by Dan H 7:53

1 AF - Backdoors as an analogy for deceptive alignment by Jacob Hilton 14:45

1 AF - Conflating value alignment and intent alignment is causing confusion by Seth Herd 13:40

1 AF - Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception? by David Scott Krueger 1:01

1 AF - The Checklist: What Succeeding at AI Safety Will Involve by Sam Bowman 35:25

1 AF - Survey: How Do Elite Chinese Students Feel About the Risks of AI? by Nick Corvino 19:38

1 AF - Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024) by Matt MacDermott 8:04

1 AF - Epistemic states as a potential benign prior by Tamsin Leake 13:38

1 AF - AIS terminology proposal: standardize terms for probability ranges by Egg Syntax 5:24

1 AF - Extracting SAE task features for ICL by Dmitrii Kharlapenko 17:20

1 AF - In Defense of Open-Minded UDT by Abram Demski 23:26

1 AF - You can remove GPT2's LayerNorm by fine-tuning for an hour by Stefan Heimersheim 19:03

1 AF - Inference-Only Debate Experiments Using Math Problems by Arjun Panickssery 4:41

1 AF - Self-explaining SAE features by Dmitrii Kharlapenko 19:36

1 AF - The Bitter Lesson for AI Safety Research by Adam Khoja 6:33

1 AF - A Simple Toy Coherence Theorem by johnswentworth 11:59

1 AF - The 'strong' feature hypothesis could be wrong by lewis smith 31:14

1 AF - The need for multi-agent experiments by Martín Soto 17:17

1 AF - Against AI As An Existential Risk by Noah Birnbaum 0:44

1 AF - Self-Other Overlap: A Neglected Approach to AI Alignment by Marc Carauleanu 19:29

1 AF - Investigating the Ability of LLMs to Recognize Their Own Writing by Christopher Ackerman 23:09

1 AF - Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals? by Stephen Casper 8:19

1 AF - AXRP Episode 34 - AI Evaluations with Beth Barnes by DanielFilan 1:37:11

1 AF - A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication by johnswentworth 35:18

1 AF - Solving adversarial attacks in computer vision as a baby version of general AI alignment by stanislavfort 12:34

1 AF - Would catching your AIs trying to escape convince AI developers to slow down or undeploy? by Buck Shlegeris 5:55

1 AF - Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs by Michaël Trazzi 8:33

1 AF - Showing SAE Latents Are Not Atomic Using Meta-SAEs by Bart Bussmann 35:53

1 AF - Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025) by Linda Linsefors 7:27

1 AF - Interoperable High Level Structures: Early Thoughts on Adjectives by johnswentworth 12:28

1 AF - A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed by johnswentworth 8:37

1 AF - Measuring Structure Development in Algorithmic Transformers by Jasmina Nasufi 18:17

1 AF - AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work by Rohin Shah 16:31

1 AF - Finding Deception in Language Models by Esben Kran 7:36

1 AF - Limitations on Formal Verification for AI Safety by Andrew Dickson 37:37

1 AF - Clarifying alignment vs capabilities by Richard Ngo 13:26

1 AF - Untrustworthy models: a frame for scheming evaluations by Olli Järviniemi 15:38

1 AF - Calendar feature geometry in GPT-2 layer 8 residual stream SAEs by Patrick Leask 7:17

1 AF - Fields that I reference when thinking about AI takeover prevention by Buck Shlegeris 17:03
به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.