AF - Sparse Autoencoders Work on Attention Layer Outputs by Connor Kissane

The Nonlinear Library: Alignment Forum

محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

2y ago 32:06

MP3•خانه قسمت

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on October 23, 2024 10:10 (1y ago). Last successful fetch was on September 19, 2024 11:06 (1y ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sparse Autoencoders Work on Attention Layer Outputs, published by Connor Kissane on January 16, 2024 on The AI Alignment Forum.
This post is the result of a 2 week research sprint project during the training phase of Neel Nanda's MATS stream.
Executive Summary
We replicate Anthropic's MLP Sparse Autoencoder (SAE) paper on attention outputs and it works well: the SAEs learn sparse, interpretable features, which gives us insight into what attention layers learn. We study the second attention layer of a two layer language model (with MLPs).
Specifically, rather than training our SAE on attn_output, we train our SAE on "hook_z" concatenated over all attention heads (aka the mixed values aka the attention outputs before a linear map - see notation here). This is valuable as we can see how much of each feature's weights come from each head, which we believe is a promising direction to investigate attention head superposition, although we only briefly explore that in this work.
We open source our SAE, you can use it via this
Colab notebook
.
Shallow Dives: We do a shallow investigation to interpret each of the first 50 features. We estimate 82% of non-dead features in our SAE are interpretable (24% of the SAE features are dead).
See this
feature interface to browse the first 50 features.
Deep dives: To verify our SAEs have learned something real, we zoom in on individual features for much more detailed investigations: the "'board' is next by induction" feature, the local context feature of "in questions starting with 'Which'", and the more global context feature of "in texts about pets".
We go beyond the techniques from the Anthropic paper, and investigate the circuits used to compute the features from earlier components, including analysing composition with an MLP0 SAE.
We also investigate how the features are used downstream, and whether it's via MLP1 or the direct connection to the logits.
Automation: We automatically detect and quantify a large "{token} is next by induction" feature family. This represents ~5% of the living features in the SAE.
Though the specific automation technique won't generalize to other feature families, this is notable, as if there are many "one feature per vocab token" families like this, we may need impractically wide SAEs for larger models.
Introduction
In Anthropic's SAE paper, they find that training sparse autoencoders (SAEs) on a one layer model's MLP activations finds interpretable features, providing a path to breakdown these high dimensional activations into units that we can understand. In this post, we demonstrate that the same technique works on attention layer outputs and learns sparse, interpretable features!
To see how interpretable our SAE is we perform shallow investigations of the first 50 features of our SAE (i.e. randomly chosen features). We found that 76% are not dead (i.e. activate on at least some inputs), and within the alive features we think 82% are interpretable. To get a feel for the features we find see our
interactive visualizations of the first 50. Here's one example:[1]
Shallow investigations are limited and may be misleading or illusory, so we then do some deep dives to more deeply understand multiple individual features including:
"'board' is next, by induction" - one of many "{token} is next by induction" features
"In questions starting with 'Which'" - a local context feature, which interestingly is computed by multiple heads
"In pet context" - one of many high level context features
Similar to the Anthropic paper's "Detailed Investigations", we understand when these features activate and how they affect downstream computation. However, we also go beyond Anthropic's techniques, and look into the upstream circuits by which these features are computed from earlier components. An attention layer (with frozen att...

392 قسمت

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech