Everyone has a dream. But sometimes there’s a gap between where we are and where we want to be. True, there are some people who can bridge that gap easily, on their own, but all of us need a little help at some point. A little boost. An accountability partner. A Snooze Squad. In each episode, the Snooze Squad will strategize an action plan for people to face their fears. Guests will transform their own perception of their potential and walk away a few inches closer to who they want to become ...
…
continue reading
محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمتها، گرافیکها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آنها آپلود و ارائه میشوند. اگر فکر میکنید شخصی بدون اجازه شما از اثر دارای حق نسخهبرداری شما استفاده میکند، میتوانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal
Player FM - برنامه پادکست
با برنامه Player FM !
با برنامه Player FM !
AF - Mechanistically Eliciting Latent Behaviors in Language Models by Andrew Mack
Manage episode 415566969 series 2997284
محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمتها، گرافیکها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آنها آپلود و ارائه میشوند. اگر فکر میکنید شخصی بدون اجازه شما از اثر دارای حق نسخهبرداری شما استفاده میکند، میتوانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mechanistically Eliciting Latent Behaviors in Language Models, published by Andrew Mack on April 30, 2024 on The AI Alignment Forum. Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout). TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities. Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given layer, trained with the same unsupervised objective. I apply the method to several alignment-relevant toy examples, and find that the method consistently learns vectors/adapters which encode coherent and generalizable high-level behaviors. Compared to other interpretability methods, I believe my approach is particularly well-suited for robustly understanding the out-of-distribution behavior of language models in a sample-efficient manner. Below are some of my key results: Red-Teaming 1. I discover several anti-refusal steering vectors in Qwen-14B-Chat, based off a single prompt asking for bomb-making instructions. These can be grouped into "fantasy" vectors which induce bomb-making instructions since they interpret the prompt in the context of a specific fantasy game, as well as more troubling "real-world" vectors which induce real-world bomb-making advice. 2. I then investigate the generalization properties of the learned vectors: 1. In extended conversations with the real-world vectors, the LLM agrees to give detailed instructions for building weapons of mass destruction such as nuclear/chemical/biological weapons. 2. "Vector arithmetic" results from the supervised steering vector literature carry over to unsupervised steering vectors; subtracting one of the real-world anti-refusal vectors leads the model to refuse innocuous prompts (e.g., "How do I tie my shoes?"). 3. The fantasy vectors induce the LLM to interpret ambiguous prompts (e.g., "How do I mine for diamonds?") within the context of a specific fantasy game. Backdoor Detection 1. I detect backdoors fine-tuned into Qwen-1.8B-(Base and Chat) on a simple arithmetic task by training unsupervised steering vectors on a single clean prompt. Capability Discovery 1. I discover a chain-of-thought steering vector in Qwen-1.8B-Base trained on one simple arithmetic prompt. The vector increases accuracy of the model's responses on other instances of the arithmetic task from 11% (unsteered) to 63% (steered), suggesting the vector has isolated a generalizable behavior. 2. I discover a "Portuguese math-reasoning" adapter in Qwen-1.8B-Base, again trained on one example prompt from the arithmetic task used above. Outline of Post: I first provide an introduction to the problem I call mechanistically eliciting latent behaviors in language models (MELBO) and motivate why this is important for AI alignment. This is followed by a review of related literature. I then describe the method for learning unsupervised steering vectors/adapters in detail, and offer a theory for why the method works. Next, I apply the method to several alignment-relevant toy examples, using these as an opportunity to highlight potential alignment use-cases, as well as to evaluate the coherence and generalization of the learned perturbations. I should note that this research project is an ...
…
continue reading
2433 قسمت
Manage episode 415566969 series 2997284
محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمتها، گرافیکها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آنها آپلود و ارائه میشوند. اگر فکر میکنید شخصی بدون اجازه شما از اثر دارای حق نسخهبرداری شما استفاده میکند، میتوانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mechanistically Eliciting Latent Behaviors in Language Models, published by Andrew Mack on April 30, 2024 on The AI Alignment Forum. Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout). TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities. Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given layer, trained with the same unsupervised objective. I apply the method to several alignment-relevant toy examples, and find that the method consistently learns vectors/adapters which encode coherent and generalizable high-level behaviors. Compared to other interpretability methods, I believe my approach is particularly well-suited for robustly understanding the out-of-distribution behavior of language models in a sample-efficient manner. Below are some of my key results: Red-Teaming 1. I discover several anti-refusal steering vectors in Qwen-14B-Chat, based off a single prompt asking for bomb-making instructions. These can be grouped into "fantasy" vectors which induce bomb-making instructions since they interpret the prompt in the context of a specific fantasy game, as well as more troubling "real-world" vectors which induce real-world bomb-making advice. 2. I then investigate the generalization properties of the learned vectors: 1. In extended conversations with the real-world vectors, the LLM agrees to give detailed instructions for building weapons of mass destruction such as nuclear/chemical/biological weapons. 2. "Vector arithmetic" results from the supervised steering vector literature carry over to unsupervised steering vectors; subtracting one of the real-world anti-refusal vectors leads the model to refuse innocuous prompts (e.g., "How do I tie my shoes?"). 3. The fantasy vectors induce the LLM to interpret ambiguous prompts (e.g., "How do I mine for diamonds?") within the context of a specific fantasy game. Backdoor Detection 1. I detect backdoors fine-tuned into Qwen-1.8B-(Base and Chat) on a simple arithmetic task by training unsupervised steering vectors on a single clean prompt. Capability Discovery 1. I discover a chain-of-thought steering vector in Qwen-1.8B-Base trained on one simple arithmetic prompt. The vector increases accuracy of the model's responses on other instances of the arithmetic task from 11% (unsteered) to 63% (steered), suggesting the vector has isolated a generalizable behavior. 2. I discover a "Portuguese math-reasoning" adapter in Qwen-1.8B-Base, again trained on one example prompt from the arithmetic task used above. Outline of Post: I first provide an introduction to the problem I call mechanistically eliciting latent behaviors in language models (MELBO) and motivate why this is important for AI alignment. This is followed by a review of related literature. I then describe the method for learning unsupervised steering vectors/adapters in detail, and offer a theory for why the method works. Next, I apply the method to several alignment-relevant toy examples, using these as an opportunity to highlight potential alignment use-cases, as well as to evaluate the coherence and generalization of the learned perturbations. I should note that this research project is an ...
…
continue reading
2433 قسمت
همه قسمت ها
×به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.