Artwork

محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal
Player FM - برنامه پادکست
با برنامه Player FM !

AF - Analysing Adversarial Attacks with Linear Probing by Yoann Poupart

13:16
 
اشتراک گذاری
 

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on October 23, 2024 10:10 (1y ago). Last successful fetch was on September 19, 2024 11:06 (1y ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Manage episode 425702752 series 3337166
محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Analysing Adversarial Attacks with Linear Probing, published by Yoann Poupart on June 17, 2024 on The AI Alignment Forum.
This work was produced as part of the Apart Fellowship. @Yoann Poupart and @Imene Kerboua led the project; @Clement Neo and @Jason Hoelscher-Obermaier provided mentorship, feedback and project guidance.
Here, we present a qualitative analysis of our preliminary results. We are at the very beginning of our experiments, so any setup change, experiment advice, or idea is welcomed. We also welcome any relevant paper that we could have missed. For a better introduction to interpreting adversarial attacks, we recommend reading scasper's post: EIS IX: Interpretability and Adversaries.
Code available on GitHub (still drafty).
TL;DR
Basic adversarial attacks produce humanly indistinguishable examples for the human eye.
In order to explore the features vs bugs view of adversarial attacks, we trained linear probes on a classifier's activations (we used CLIP-based models for further multimodal work). The goal is that training linear probes to detect basic concepts can help analyse and detect adversarial attacks.
As the concepts studied are simple (e.g. color), they can be probed from every layer of the model.
We show that concept probes are indeed modified by the adversarial attack, but only on the later layers.
Hence, we can naively detect adversarial attacks by observing the disagreement between early layers' probes and later layers' probes.
Introduction
Motivation
Adversarial samples are a concerning failure mode of deep learning models. It has been shown that by optimising an input, e.g. an image, the output of the targeted model can be shifted at the attacker's will. This failure mode appears repeatedly in different domains like image classification, text encoders, and more recently on multimodal models enabling arbitrary outputs or jailbreaking.
Hypotheses to Test
This work was partly inspired by the feature vs bug world hypothesis. In our particular context this would imply that adversarial attacks might be caused by meaningful feature directions being activated (feature world) or by high-frequency directions overfitted by the model (bug world). The hypotheses we would like to test are thus the following:
The adversariality doesn't change the features. Nakkiran et al. found truly adversarial samples that are "bugs", while Ilyas et al. seem to indicate that adversarial attacks find features.
The different optimisation schemes don't produce the same categories of adversarial samples (features vs bugs). Some might be more robust, e.g. w.r.t. the representation induced.
Post Navigation
First, we briefly present the background we used for adversarial attacks and linear probing. Then we showcase experiments, presenting our setup, to understand the impact of adversarial attacks on linear probes and see if we can detect it naively. Finally, we present the limitations and perspectives of our work before concluding.
Background
Adversarial Attacks
For our experiment, we'll begin with the most basic and well-known adversarial attack, the Fast Gradient Sign Method (FGSM). This intuitive method takes advantage of the classifier differentiability to optimise the adversary goal (misclassification) using a gradient descent. It can be described as:
With x,y the original image and label, ^x the adversarial example and ϵ the perturbation amplitude. This simple method can be derived in an iterative form:
and seen as projected gradient descent, with the projection ensuring that ||xxn+1||ϵ. In our experiments ϵ=3.
Linear Probing
Linear probing is a simple idea where you train a linear model (probe) to predict a concept from the internals of the interpreted target model. The prediction performances are then attributed to the knowledge contained in the target model's latent re...
  continue reading

392 قسمت

Artwork
iconاشتراک گذاری
 

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on October 23, 2024 10:10 (1y ago). Last successful fetch was on September 19, 2024 11:06 (1y ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Manage episode 425702752 series 3337166
محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Analysing Adversarial Attacks with Linear Probing, published by Yoann Poupart on June 17, 2024 on The AI Alignment Forum.
This work was produced as part of the Apart Fellowship. @Yoann Poupart and @Imene Kerboua led the project; @Clement Neo and @Jason Hoelscher-Obermaier provided mentorship, feedback and project guidance.
Here, we present a qualitative analysis of our preliminary results. We are at the very beginning of our experiments, so any setup change, experiment advice, or idea is welcomed. We also welcome any relevant paper that we could have missed. For a better introduction to interpreting adversarial attacks, we recommend reading scasper's post: EIS IX: Interpretability and Adversaries.
Code available on GitHub (still drafty).
TL;DR
Basic adversarial attacks produce humanly indistinguishable examples for the human eye.
In order to explore the features vs bugs view of adversarial attacks, we trained linear probes on a classifier's activations (we used CLIP-based models for further multimodal work). The goal is that training linear probes to detect basic concepts can help analyse and detect adversarial attacks.
As the concepts studied are simple (e.g. color), they can be probed from every layer of the model.
We show that concept probes are indeed modified by the adversarial attack, but only on the later layers.
Hence, we can naively detect adversarial attacks by observing the disagreement between early layers' probes and later layers' probes.
Introduction
Motivation
Adversarial samples are a concerning failure mode of deep learning models. It has been shown that by optimising an input, e.g. an image, the output of the targeted model can be shifted at the attacker's will. This failure mode appears repeatedly in different domains like image classification, text encoders, and more recently on multimodal models enabling arbitrary outputs or jailbreaking.
Hypotheses to Test
This work was partly inspired by the feature vs bug world hypothesis. In our particular context this would imply that adversarial attacks might be caused by meaningful feature directions being activated (feature world) or by high-frequency directions overfitted by the model (bug world). The hypotheses we would like to test are thus the following:
The adversariality doesn't change the features. Nakkiran et al. found truly adversarial samples that are "bugs", while Ilyas et al. seem to indicate that adversarial attacks find features.
The different optimisation schemes don't produce the same categories of adversarial samples (features vs bugs). Some might be more robust, e.g. w.r.t. the representation induced.
Post Navigation
First, we briefly present the background we used for adversarial attacks and linear probing. Then we showcase experiments, presenting our setup, to understand the impact of adversarial attacks on linear probes and see if we can detect it naively. Finally, we present the limitations and perspectives of our work before concluding.
Background
Adversarial Attacks
For our experiment, we'll begin with the most basic and well-known adversarial attack, the Fast Gradient Sign Method (FGSM). This intuitive method takes advantage of the classifier differentiability to optimise the adversary goal (misclassification) using a gradient descent. It can be described as:
With x,y the original image and label, ^x the adversarial example and ϵ the perturbation amplitude. This simple method can be derived in an iterative form:
and seen as projected gradient descent, with the projection ensuring that ||xxn+1||ϵ. In our experiments ϵ=3.
Linear Probing
Linear probing is a simple idea where you train a linear model (probe) to predict a concept from the internals of the interpreted target model. The prediction performances are then attributed to the knowledge contained in the target model's latent re...
  continue reading

392 قسمت

همه قسمت ها

×
 
Loading …

به Player FM خوش آمدید!

Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.

 

راهنمای مرجع سریع

در حین کاوش به این نمایش گوش دهید
پخش