AF - What progress have we made on automated auditing? by Lawrence Chan

The Nonlinear Library: Alignment Forum

محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

1+ y ago 2:50

MP3•خانه قسمت

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on October 23, 2024 10:10 (1y ago). Last successful fetch was on September 19, 2024 11:06 (1y ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What progress have we made on automated auditing?, published by Lawrence Chan on July 6, 2024 on The AI Alignment Forum.
One use case for model internals work is to perform automated auditing of models:
https://www.alignmentforum.org/posts/cQwT8asti3kyA62zc/automating-auditing-an-ambitious-concrete-technical-research
That is, given a specification of intended behavior, the attacker produces a model that doesn't satisfy the spec, and the auditor needs to determine how the model doesn't satisfy the spec. This is closely related to static backdoor detection: given a model M, determine if there exists a backdoor function that, for any input, transforms that input to one where M has different behavior.[1]
There's some theoretical work (Goldwasser et al. 2022) arguing that for some model classes, static backdoor detection is impossible even given white-box model access -- specifically, they prove their results for random feature regression and (the very similar setting) of wide 1-layer ReLU networks.
Relatedly, there's been some work looking at provably bounding model performance (Gross et al. 2024) -- if this succeeds on "real" models and "real" specification, then this would solve the automated auditing game. But the results so far are on toy transformers, and are quite weak in general (in part because the task is so difficult).[2]
Probably the most relevant work is Halawi et al. 2024's Covert Malicious Finetuning (CMFT), where they demonstrate that it's possible to use finetuning to insert jailbreaks and extract harmful work, in ways that are hard to detect with ordinary harmlessness classifiers.[3]
As this is machine learning, just because something is impossible in theory and difficult on toy models doesn't mean we can't do this in practice. It seems plausible to me that we've demonstrated non-zero empirical results in terms of automatically auditing model internals. So I'm curious: how much progress have we made on automated auditing empirically? What work exists in this area? What does the state-of-the-art in automated editing look like?
1. ^
Note that I'm not asking about mechanistically anomaly detection/dynamic backdoor detection; I'm aware that it's pretty easy to distinguish if a particular example is backdoored using baseline techniques like "fit a Gaussian density on activations and look at the log prob of the activations on each input" or "fit a linear probe on a handful of examples using logistic regression".
2. ^
I'm also aware of some of the work in the trojan detection space, including 2023 Trojan detection contest, where performance on extracting embedded triggers was little better than chance.
3. ^
That being said, it's plausible that dynamically detecting them given model internals is very easy.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

392 قسمت

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech