Looks like the publisher may have taken this series offline or changed its URL. Please contact support if you believe it should be working, the feed URL is invalid, or you have any other concerns about it.
با برنامه Player FM !
پادکست هایی که ارزش شنیدن دارند
حمایت شده
AF - Extracting SAE task features for ICL by Dmitrii Kharlapenko
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When? This feed was archived on October 23, 2024 10:10 (
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 433989446 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Extracting SAE task features for ICL, published by Dmitrii Kharlapenko on August 12, 2024 on The AI Alignment Forum.
TL;DR
We try to study task vectors in the SAE basis. This is challenging because there is no canonical way to convert an arbitrary vector in the residual stream to a linear combination of SAE features - you can't just pass an arbitrary vector through the encoder without going off distribution.
We explored the algorithm of gradient pursuit suggested in Smith et al, but it didn't work for us without modifications.
Our approach is to apply the SAE encoder to the task vector, and then apply a gradient-based cleanup. This exploits the fact that task vectors have a differentiable objective. We find that this gives a sparser and cleaner reconstruction, which is also highly interpretable, and also serves as a better task vector due to directly optimizing for log likelihood. This takes us from ~100 active features to ~10.
Using our algorithm, we find two classes of SAE features involved in ICL. One of them recognizes the exact tasks or output formats from the examples, and another one encodes the tasks for execution by the model later on. We show that steering with these features has causal effects similar to task vectors.
This work was produced as part of the ML Alignment & Theory Scholars Program - Summer 24 Cohort, under mentorship from Neel Nanda and Arthur Conmy.
Prior work
Task or function vectors are internal representations of some task that LLMs form while processing an ICL prompt. They can be extracted from a model running on a few-shot prompt and then be used to make it complete the same task without having any prior context or task description.
Several papers (Function vectors in large language models, In-Context Learning Creates Task Vectors) have proposed different ways to extract those task vectors. They all center around having ICL examples being fed to a model in the form of "input output, … " and averaging the residuals on the "separator" token over a batch. This approach can reconstruct some part of the ICL performance but does not admit a straightforward conversion to the SAE basis.
ITO with gradient pursuit can be used to do a sparse coding of a residual vector using SAE features. The post suggests using this algorithm for steering vector SAE decomposition. Since task vectors can be thought of as steering vectors, ITO may provide some insight into the ways they operate.
Initial Phi-3 experiments
Direct SAE task vector reconstruction
In our study we trained a set of gated SAEs for Phi-3 Mini 3.8B using a model-generated synthetic instruction dataset.
While offering a sparse dictionary decomposition of residuals, SAEs tend to introduce a reconstruction error that impacts the performance of the model. They also have no guarantee to be able to decompose out-of-distribution vectors, and task vectors being a product of averaging activations across prompts and tokens may be the case of such vectors.
Thus, we first studied the performance of SAE reconstructions of task vectors in transferring the definition of two tasks: 1) antonym generation and 2) English to Spanish word translation. These and other tasks used to study task vectors were taken from the ICL task vectors paper github repository.
These charts show the NLL loss of the model on the evaluation set of zero-shot prompts for both of the tasks depending on the layer of extraction/insertion.
TV stands for the original task vector performance;
Recon of TV stands for using the SAE reconstruction of the task vector instead of the task vector;
TV on recon stands for first doing a SAE reconstruction of the residuals and then collecting a task vector on them;
ITO stands for the ITO algorithm with 40 target l0 loss.
It can be seen from charts that SAE reconstruction significantly decrea...
392 قسمت
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When?
This feed was archived on October 23, 2024 10:10 (
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 433989446 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Extracting SAE task features for ICL, published by Dmitrii Kharlapenko on August 12, 2024 on The AI Alignment Forum.
TL;DR
We try to study task vectors in the SAE basis. This is challenging because there is no canonical way to convert an arbitrary vector in the residual stream to a linear combination of SAE features - you can't just pass an arbitrary vector through the encoder without going off distribution.
We explored the algorithm of gradient pursuit suggested in Smith et al, but it didn't work for us without modifications.
Our approach is to apply the SAE encoder to the task vector, and then apply a gradient-based cleanup. This exploits the fact that task vectors have a differentiable objective. We find that this gives a sparser and cleaner reconstruction, which is also highly interpretable, and also serves as a better task vector due to directly optimizing for log likelihood. This takes us from ~100 active features to ~10.
Using our algorithm, we find two classes of SAE features involved in ICL. One of them recognizes the exact tasks or output formats from the examples, and another one encodes the tasks for execution by the model later on. We show that steering with these features has causal effects similar to task vectors.
This work was produced as part of the ML Alignment & Theory Scholars Program - Summer 24 Cohort, under mentorship from Neel Nanda and Arthur Conmy.
Prior work
Task or function vectors are internal representations of some task that LLMs form while processing an ICL prompt. They can be extracted from a model running on a few-shot prompt and then be used to make it complete the same task without having any prior context or task description.
Several papers (Function vectors in large language models, In-Context Learning Creates Task Vectors) have proposed different ways to extract those task vectors. They all center around having ICL examples being fed to a model in the form of "input output, … " and averaging the residuals on the "separator" token over a batch. This approach can reconstruct some part of the ICL performance but does not admit a straightforward conversion to the SAE basis.
ITO with gradient pursuit can be used to do a sparse coding of a residual vector using SAE features. The post suggests using this algorithm for steering vector SAE decomposition. Since task vectors can be thought of as steering vectors, ITO may provide some insight into the ways they operate.
Initial Phi-3 experiments
Direct SAE task vector reconstruction
In our study we trained a set of gated SAEs for Phi-3 Mini 3.8B using a model-generated synthetic instruction dataset.
While offering a sparse dictionary decomposition of residuals, SAEs tend to introduce a reconstruction error that impacts the performance of the model. They also have no guarantee to be able to decompose out-of-distribution vectors, and task vectors being a product of averaging activations across prompts and tokens may be the case of such vectors.
Thus, we first studied the performance of SAE reconstructions of task vectors in transferring the definition of two tasks: 1) antonym generation and 2) English to Spanish word translation. These and other tasks used to study task vectors were taken from the ICL task vectors paper github repository.
These charts show the NLL loss of the model on the evaluation set of zero-shot prompts for both of the tasks depending on the layer of extraction/insertion.
TV stands for the original task vector performance;
Recon of TV stands for using the SAE reconstruction of the task vector instead of the task vector;
TV on recon stands for first doing a SAE reconstruction of the residuals and then collecting a task vector on them;
ITO stands for the ITO algorithm with 40 target l0 loss.
It can be seen from charts that SAE reconstruction significantly decrea...
392 قسمت
همه قسمت ها
×
1 AF - The Obliqueness Thesis by Jessica Taylor 30:04

1 AF - Secret Collusion: Will We Know When to Unplug AI? by schroederdewitt 57:38

1 AF - Estimating Tail Risk in Neural Networks by Jacob Hilton 41:11

1 AF - Can startups be impactful in AI safety? by Esben Kran 11:54

1 AF - How difficult is AI Alignment? by Samuel Dylan Martin 39:38

1 AF - Contra papers claiming superhuman AI forecasting by nikos 14:36

1 AF - AI forecasting bots incoming by Dan H 7:53

1 AF - Backdoors as an analogy for deceptive alignment by Jacob Hilton 14:45

1 AF - Conflating value alignment and intent alignment is causing confusion by Seth Herd 13:40

1 AF - Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception? by David Scott Krueger 1:01

1 AF - The Checklist: What Succeeding at AI Safety Will Involve by Sam Bowman 35:25

1 AF - Survey: How Do Elite Chinese Students Feel About the Risks of AI? by Nick Corvino 19:38

1 AF - Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024) by Matt MacDermott 8:04

1 AF - Epistemic states as a potential benign prior by Tamsin Leake 13:38

1 AF - AIS terminology proposal: standardize terms for probability ranges by Egg Syntax 5:24

1 AF - Solving adversarial attacks in computer vision as a baby version of general AI alignment by stanislavfort 12:34

1 AF - Would catching your AIs trying to escape convince AI developers to slow down or undeploy? by Buck Shlegeris 5:55

1 AF - Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs by Michaël Trazzi 8:33

1 AF - Showing SAE Latents Are Not Atomic Using Meta-SAEs by Bart Bussmann 35:53

1 AF - Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025) by Linda Linsefors 7:27

1 AF - Interoperable High Level Structures: Early Thoughts on Adjectives by johnswentworth 12:28

1 AF - A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed by johnswentworth 8:37

1 AF - Measuring Structure Development in Algorithmic Transformers by Jasmina Nasufi 18:17

1 AF - AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work by Rohin Shah 16:31

1 AF - Finding Deception in Language Models by Esben Kran 7:36

1 AF - Limitations on Formal Verification for AI Safety by Andrew Dickson 37:37

1 AF - Clarifying alignment vs capabilities by Richard Ngo 13:26

1 AF - Untrustworthy models: a frame for scheming evaluations by Olli Järviniemi 15:38

1 AF - Calendar feature geometry in GPT-2 layer 8 residual stream SAEs by Patrick Leask 7:17

1 AF - Fields that I reference when thinking about AI takeover prevention by Buck Shlegeris 17:03

1 AF - Extracting SAE task features for ICL by Dmitrii Kharlapenko 17:20

1 AF - In Defense of Open-Minded UDT by Abram Demski 23:26

1 AF - You can remove GPT2's LayerNorm by fine-tuning for an hour by Stefan Heimersheim 19:03

1 AF - Inference-Only Debate Experiments Using Math Problems by Arjun Panickssery 4:41

1 AF - Self-explaining SAE features by Dmitrii Kharlapenko 19:36

1 AF - The Bitter Lesson for AI Safety Research by Adam Khoja 6:33

1 AF - A Simple Toy Coherence Theorem by johnswentworth 11:59

1 AF - The 'strong' feature hypothesis could be wrong by lewis smith 31:14

1 AF - The need for multi-agent experiments by Martín Soto 17:17

1 AF - Against AI As An Existential Risk by Noah Birnbaum 0:44

1 AF - Self-Other Overlap: A Neglected Approach to AI Alignment by Marc Carauleanu 19:29

1 AF - Investigating the Ability of LLMs to Recognize Their Own Writing by Christopher Ackerman 23:09

1 AF - Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals? by Stephen Casper 8:19

1 AF - AXRP Episode 34 - AI Evaluations with Beth Barnes by DanielFilan 1:37:11

1 AF - A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication by johnswentworth 35:18

1 AF - Pacing Outside the Box: RNNs Learn to Plan in Sokoban by Adrià Garriga-Alonso 3:34

1 AF - Does robustness improve with scale? by ChengCheng 2:16

1 AF - AI Constitutions are a tool to reduce societal scale risk by Samuel Dylan Martin 35:15

1 AF - A framework for thinking about AI power-seeking by Joe Carlsmith 30:52

1 AF - ML Safety Research Advice - GabeM by Gabe M 24:35

1 AF - Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities by Axel Højmark 32:07

1 AF - Auto-Enhance: Developing a meta-benchmark to measure LLM agents' ability to improve other agents by Sam Brown 26:02

1 AF - Coalitional agency by Richard Ngo 11:23

1 AF - aimless ace analyzes active amateur: a micro-aaaaalignment proposal by Luke H Miles 1:45

1 AF - A more systematic case for inner misalignment by Richard Ngo 9:14

1 AF - BatchTopK: A Simple Improvement for TopK-SAEs by Bart Bussmann 7:17

1 AF - Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions by Lidor Banuel Dabbah 29:18

1 AF - Truth is Universal: Robust Detection of Lies in LLMs by Lennart Buerger 4:49

1 AF - JumpReLU SAEs + Early Access to Gemma 2 SAEs by Neel Nanda 2:42

1 AF - A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team by Lee Sharkey 32:24

1 AF - [Interim research report] Activation plateaus and sensitive directions in GPT2 by Stefan Heimersheim 20:02

1 AF - Finding the Wisdom to Build Safe AI by Gordon Seidoh Worley 13:27

1 AF - Decomposing the QK circuit with Bilinear Sparse Dictionary Learning by keith wynroe 21:25

1 AF - OthelloGPT learned a bag of heuristics by jylin04 16:23

1 AF - Covert Malicious Finetuning by Tony Wang 6:30

1 AF - Interpreting Preference Models w/ Sparse Autoencoders by Logan Riggs Smith 15:43

1 AF - Representation Tuning by Christopher Ackerman 13:07

1 AF - An issue with training schemers with supervised fine-tuning by Fabien Roger 16:33

1 AF - Live Theory Part 0: Taking Intelligence Seriously by Sahil 20:19

1 AF - Instrumental vs Terminal Desiderata by Max Harms 5:14

1 AF - What is a Tool? by johnswentworth 10:14

1 AF - Formal verification, heuristic explanations and surprise accounting by Jacob Hilton 14:48

1 AF - Compact Proofs of Model Performance via Mechanistic Interpretability by Lawrence Chan 12:47

1 AF - SAE feature geometry is outside the superposition hypothesis by Jake Mendel 18:10

1 AF - LLM Generality is a Timeline Crux by Egg Syntax 15:05
به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.