Looks like the publisher may have taken this series offline or changed its URL. Please contact support if you believe it should be working, the feed URL is invalid, or you have any other concerns about it.
با برنامه Player FM !
پادکست هایی که ارزش شنیدن دارند
حمایت شده
AF - A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team by Lee Sharkey
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When? This feed was archived on October 23, 2024 10:10 (
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 429583383 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team, published by Lee Sharkey on July 18, 2024 on The AI Alignment Forum.
Why we made this list:
The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we'd work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that we were excited about!
Previous lists of project ideas (such as Neel's collation of
200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field. But for all its merits, that list is now over a year and a half old. Therefore, many project ideas in that list aren't an up-to-date reflection of what some researchers consider the frontiers of mech interp.
We therefore thought it would be helpful to share our list of project ideas!
Comments and caveats:
Some of these projects are more precisely scoped than others. Some are vague, others are more developed.
Not every member of the team endorses every project as high priority. Usually more than one team member supports each one, and in many cases most of the team is supportive of someone working on it.
We associate the person(s) who generated the project idea to each idea.
We've grouped the project ideas into categories for convenience, but some projects span multiple categories. We don't put a huge amount of weight on this particular categorisation.
We hope some people find this list helpful!
We would love to see people working on these! If any sound interesting to you and you'd like to chat about it, don't hesitate to reach out.
Foundational work on sparse dictionary learning for interpretability
Transcoder-related project ideas
See
[2406.11944] Transcoders Find Interpretable LLM Feature Circuits)
[Nix] Training and releasing high quality transcoders.
Probably using top k
GPT2 is a classic candidate for this. I'd be excited for people to try hard on even smaller models, e.g. GELU 4L
[Nix] Good tooling for using transcoders
Nice programming API to attribute an input to a collection of paths (see
Dunefsky et al)
Web user interface? Maybe in collaboration with neuronpedia. Would need a gpu server constantly running, but I'm optimistic you could do it with a ~a4000.
[Nix] Further circuit analysis using transcoders.
Take random input sequences, run transcoder attribution on them, examine the output and summarize the findings.
High level summary statistics of how much attribution goes through error terms & how many pathways are needed would be valuable
Explaining specific behaviors (IOI, greater-than) with high standards for specificity & faithfulness. Might be convoluted if accuracy
[I could generate more ideas here, feel free to reach out
nix@apolloresearch.ai]
[Nix, Lee] Cross layer superposition
Does it happen? Probably, but it would be nice to have specific examples! Look for features with similar decoder vectors, and do exploratory research to figure out what exactly is going on.
What precisely does it mean? Answering this question seems likely to shed light on the question of 'What is a feature?'.
[Lucius] Improving transcoder architectures
Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation. If we modify our transcoders to include a linear 'bypass' that is not counted in the sparsity penalty, do we improve performance since we are not unduly penalizing these linear transformations that would always be present and active?
If we train multiple transcoders in different layers at the same time, can we include a sparsity penalty for their interactions with each other, encouraging a decomposition of the network that leaves us with as few interactions between features a...
392 قسمت
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When?
This feed was archived on October 23, 2024 10:10 (
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 429583383 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team, published by Lee Sharkey on July 18, 2024 on The AI Alignment Forum.
Why we made this list:
The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we'd work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that we were excited about!
Previous lists of project ideas (such as Neel's collation of
200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field. But for all its merits, that list is now over a year and a half old. Therefore, many project ideas in that list aren't an up-to-date reflection of what some researchers consider the frontiers of mech interp.
We therefore thought it would be helpful to share our list of project ideas!
Comments and caveats:
Some of these projects are more precisely scoped than others. Some are vague, others are more developed.
Not every member of the team endorses every project as high priority. Usually more than one team member supports each one, and in many cases most of the team is supportive of someone working on it.
We associate the person(s) who generated the project idea to each idea.
We've grouped the project ideas into categories for convenience, but some projects span multiple categories. We don't put a huge amount of weight on this particular categorisation.
We hope some people find this list helpful!
We would love to see people working on these! If any sound interesting to you and you'd like to chat about it, don't hesitate to reach out.
Foundational work on sparse dictionary learning for interpretability
Transcoder-related project ideas
See
[2406.11944] Transcoders Find Interpretable LLM Feature Circuits)
[Nix] Training and releasing high quality transcoders.
Probably using top k
GPT2 is a classic candidate for this. I'd be excited for people to try hard on even smaller models, e.g. GELU 4L
[Nix] Good tooling for using transcoders
Nice programming API to attribute an input to a collection of paths (see
Dunefsky et al)
Web user interface? Maybe in collaboration with neuronpedia. Would need a gpu server constantly running, but I'm optimistic you could do it with a ~a4000.
[Nix] Further circuit analysis using transcoders.
Take random input sequences, run transcoder attribution on them, examine the output and summarize the findings.
High level summary statistics of how much attribution goes through error terms & how many pathways are needed would be valuable
Explaining specific behaviors (IOI, greater-than) with high standards for specificity & faithfulness. Might be convoluted if accuracy
[I could generate more ideas here, feel free to reach out
nix@apolloresearch.ai]
[Nix, Lee] Cross layer superposition
Does it happen? Probably, but it would be nice to have specific examples! Look for features with similar decoder vectors, and do exploratory research to figure out what exactly is going on.
What precisely does it mean? Answering this question seems likely to shed light on the question of 'What is a feature?'.
[Lucius] Improving transcoder architectures
Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation. If we modify our transcoders to include a linear 'bypass' that is not counted in the sparsity penalty, do we improve performance since we are not unduly penalizing these linear transformations that would always be present and active?
If we train multiple transcoders in different layers at the same time, can we include a sparsity penalty for their interactions with each other, encouraging a decomposition of the network that leaves us with as few interactions between features a...
392 قسمت
همه قسمت ها
×
1 AF - The Obliqueness Thesis by Jessica Taylor 30:04

1 AF - Secret Collusion: Will We Know When to Unplug AI? by schroederdewitt 57:38

1 AF - Estimating Tail Risk in Neural Networks by Jacob Hilton 41:11

1 AF - Can startups be impactful in AI safety? by Esben Kran 11:54

1 AF - How difficult is AI Alignment? by Samuel Dylan Martin 39:38

1 AF - Contra papers claiming superhuman AI forecasting by nikos 14:36

1 AF - AI forecasting bots incoming by Dan H 7:53

1 AF - Backdoors as an analogy for deceptive alignment by Jacob Hilton 14:45

1 AF - Conflating value alignment and intent alignment is causing confusion by Seth Herd 13:40

1 AF - Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception? by David Scott Krueger 1:01

1 AF - The Checklist: What Succeeding at AI Safety Will Involve by Sam Bowman 35:25

1 AF - Survey: How Do Elite Chinese Students Feel About the Risks of AI? by Nick Corvino 19:38

1 AF - Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024) by Matt MacDermott 8:04

1 AF - Epistemic states as a potential benign prior by Tamsin Leake 13:38

1 AF - AIS terminology proposal: standardize terms for probability ranges by Egg Syntax 5:24

1 AF - Solving adversarial attacks in computer vision as a baby version of general AI alignment by stanislavfort 12:34

1 AF - Would catching your AIs trying to escape convince AI developers to slow down or undeploy? by Buck Shlegeris 5:55

1 AF - Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs by Michaël Trazzi 8:33

1 AF - Showing SAE Latents Are Not Atomic Using Meta-SAEs by Bart Bussmann 35:53

1 AF - Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025) by Linda Linsefors 7:27

1 AF - Interoperable High Level Structures: Early Thoughts on Adjectives by johnswentworth 12:28

1 AF - A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed by johnswentworth 8:37

1 AF - Measuring Structure Development in Algorithmic Transformers by Jasmina Nasufi 18:17

1 AF - AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work by Rohin Shah 16:31

1 AF - Finding Deception in Language Models by Esben Kran 7:36

1 AF - Limitations on Formal Verification for AI Safety by Andrew Dickson 37:37

1 AF - Clarifying alignment vs capabilities by Richard Ngo 13:26

1 AF - Untrustworthy models: a frame for scheming evaluations by Olli Järviniemi 15:38

1 AF - Calendar feature geometry in GPT-2 layer 8 residual stream SAEs by Patrick Leask 7:17

1 AF - Fields that I reference when thinking about AI takeover prevention by Buck Shlegeris 17:03

1 AF - Extracting SAE task features for ICL by Dmitrii Kharlapenko 17:20

1 AF - In Defense of Open-Minded UDT by Abram Demski 23:26

1 AF - You can remove GPT2's LayerNorm by fine-tuning for an hour by Stefan Heimersheim 19:03

1 AF - Inference-Only Debate Experiments Using Math Problems by Arjun Panickssery 4:41

1 AF - Self-explaining SAE features by Dmitrii Kharlapenko 19:36

1 AF - The Bitter Lesson for AI Safety Research by Adam Khoja 6:33

1 AF - A Simple Toy Coherence Theorem by johnswentworth 11:59

1 AF - The 'strong' feature hypothesis could be wrong by lewis smith 31:14

1 AF - The need for multi-agent experiments by Martín Soto 17:17

1 AF - Against AI As An Existential Risk by Noah Birnbaum 0:44

1 AF - Self-Other Overlap: A Neglected Approach to AI Alignment by Marc Carauleanu 19:29

1 AF - Investigating the Ability of LLMs to Recognize Their Own Writing by Christopher Ackerman 23:09

1 AF - Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals? by Stephen Casper 8:19

1 AF - AXRP Episode 34 - AI Evaluations with Beth Barnes by DanielFilan 1:37:11

1 AF - A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication by johnswentworth 35:18

1 AF - Pacing Outside the Box: RNNs Learn to Plan in Sokoban by Adrià Garriga-Alonso 3:34

1 AF - Does robustness improve with scale? by ChengCheng 2:16

1 AF - AI Constitutions are a tool to reduce societal scale risk by Samuel Dylan Martin 35:15

1 AF - A framework for thinking about AI power-seeking by Joe Carlsmith 30:52

1 AF - ML Safety Research Advice - GabeM by Gabe M 24:35

1 AF - Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities by Axel Højmark 32:07

1 AF - Auto-Enhance: Developing a meta-benchmark to measure LLM agents' ability to improve other agents by Sam Brown 26:02

1 AF - Coalitional agency by Richard Ngo 11:23

1 AF - aimless ace analyzes active amateur: a micro-aaaaalignment proposal by Luke H Miles 1:45

1 AF - A more systematic case for inner misalignment by Richard Ngo 9:14

1 AF - BatchTopK: A Simple Improvement for TopK-SAEs by Bart Bussmann 7:17

1 AF - Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions by Lidor Banuel Dabbah 29:18

1 AF - Truth is Universal: Robust Detection of Lies in LLMs by Lennart Buerger 4:49

1 AF - JumpReLU SAEs + Early Access to Gemma 2 SAEs by Neel Nanda 2:42

1 AF - A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team by Lee Sharkey 32:24
به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.