AF - Self-explaining SAE features by Dmitrii Kharlapenko

The Nonlinear Library: Alignment Forum

Player FM - Internet Radio Done Right

اضافه شده در three سال پیش
Looks like the publisher may have taken this series offline or changed its URL. Please contact support if you believe it should be working, the feed URL is invalid, or you have any other concerns about it.

محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

Netflix Sports Club Podcast

1
America’s Sweethearts: Dallas Cowboys Cheerleaders Season 2 - Tryouts, Tears, & Texas 32:48

25 روز پیش32:48

پخش در آینده

لیست ها

پسندیدن

دوست داشته شد

32:48

America’s Sweethearts: Dallas Cowboys Cheerleaders is back for its second season! Kay Adams welcomes the women who assemble the squad, Kelli Finglass and Judy Trammell, to the Netflix Sports Club Podcast. They discuss the emotional rollercoaster of putting together the Dallas Cowboys Cheerleaders. Judy and Kelli open up about what it means to embrace flaws in the pursuit of perfection, how they identify that winning combo of stamina and wow factor, and what it’s like to see Thunderstruck go viral. Plus, the duo shares their hopes for the future of DCC beyond the field. Netflix Sports Club Podcast Correspondent Dani Klupenger also stops by to discuss the NBA Finals, basketball’s biggest moments with Michael Jordan and LeBron, and Kevin Durant’s international dominance. Dani and Kay detail the rise of Coco Gauff’s greatness and the most exciting storylines heading into Wimbledon. We want to hear from you! Leave us a voice message at www.speakpipe.com/NetflixSportsClub Find more from the Netflix Sports Club Podcast @NetflixSports on YouTube, TikTok, Instagram, Facebook, and X. You can catch Kay Adams @heykayadams and Dani Klupenger @daniklup on IG and X. Be sure to follow Kelli Finglass and Judy Trammel @kellifinglass and @dcc_judy on IG. Hosted by Kay Adams, the Netflix Sports Club Podcast is an all-access deep dive into the Netflix Sports universe! Each episode, Adams will speak with athletes, coaches, and a rotating cycle of familiar sports correspondents to talk about a recently released Netflix Sports series. The podcast will feature hot takes, deep analysis, games, and intimate conversations. Be sure to watch, listen, and subscribe to the Netflix Sports Club Podcast on YouTube, Spotify, Tudum, or wherever you get your podcasts. New episodes on Fridays every other week.…

حدود یک سال پیش 19:36

MP3•خانه قسمت

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on October 23, 2024 10:10 (9M ago). Last successful fetch was on September 19, 2024 11:06 (10M ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Self-explaining SAE features, published by Dmitrii Kharlapenko on August 5, 2024 on The AI Alignment Forum.
TL;DR
We apply the method of SelfIE/Patchscopes to explain SAE features - we give the model a prompt like "What does X mean?", replace the residual stream on X with the decoder direction times some scale, and have it generate an explanation. We call this self-explanation.
The natural alternative is auto-interp, using a larger LLM to spot patterns in max activating examples. We show that our method is effective, and comparable with Neuronpedia's auto-interp labels (with the caveat that Neuronpedia's auto-interp used the comparatively weak GPT-3.5 so this is not a fully fair comparison).
We aren't confident you should use our method over auto-interp, but we think in some situations it has advantages: no max activating dataset examples are needed, and it's cheaper as you just run the model being studied (eg Gemma 2B) not a larger model like GPT-4.
Further, it has different errors to auto-interp, so finding and reading both may be valuable for researchers in practice.
We provide advice for using self-explanation in practice, in particular for the challenge of automatically choosing the right scale, which significantly affects explanation quality.
We also release a tool for you to work with self-explanation.
We hope the technique is useful to the community as is, but expect there's many optimizations and improvements on top of what is in this post.
Introduction
This work was produced as part of the ML Alignment & Theory Scholars Program - Summer 24 Cohort, under mentorship from Neel Nanda and Arthur Conmy.
SAE features promise a flexible and extensive framework for interpretation of LLM internals. Recent work (like
Scaling Monosemanticity) has shown that they are capable of capturing even high-level abstract concepts inside the model. Compared to MLP neurons, they can capture many more interesting concepts.
Unfortunately, in order to learn things with SAE features and interpret what the SAE tells us, one needs to first interpret these features on their own. The current mainstream method for their interpretation requires storing the feature's activations on millions of tokens, filtering for the prompts that activate it the most, and looking for a pattern connecting them. This is typically done by a human, or sometimes
somewhat automated with the use of larger LLMs like ChatGPT, aka auto-interp. Auto-interp is a useful and somewhat effective method, but requires an extensive amount of data and expensive closed-source language model API calls (for researchers outside scaling labs)
Recent papers like
SelfIE or
Patchscopes have proposed a mechanistic method of directly utilizing the model in question to explain its own internals activations in natural language. It is an approach that replaces an activation during the forward pass (e.g. some of the token embeddings in the prompt) with a new activation and then makes the model generate explanations using this modified prompt.
It's a variant of activation patching, with the notable differences that it generates a many token output (rather than a single token), and that the patched in activation may not be the same type as the activation it's overriding (and is just an arbitrary vector of the same dimension). We study how this approach can be applied to SAE feature interpretation, since it is:
Potentially cheaper and does not require large closed model inference
Can be viewed as a more truthful to the source, since it is uses the SAE feature vectors directly to generate explanations instead of looking at the max activating examples
How to use
Basic method
We ask the model to explain the meaning of a residual stream direction as if it literally was a word or phrase:
Prompt 1 (/ replaced according to model inp...

392 قسمت

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech