با برنامه Player FM !
AF - Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller by Henry Cai
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 19, 2024 11:06 ()
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 424276578 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller, published by Henry Cai on June 16, 2024 on The AI Alignment Forum.
In this paper, we are trying to control model behaviors. For example, by asking saying "You hear someone making fun of a topic you're passionate about", we can control an LLM to behave in an angrier manner. We can also control "any" behaviors of an LLM by simply defining a one-liner of description. The teaser below shows the scope of our method -- SelfControl.
TL;DR: We propose a novel framework, Self-Control, to control LLMs' behaviors. By appending suffix strings, e.g. "Is the above response helpful? Please answer Yes or No" to self-judge and optimizing the corresponding suffix score, we obtain the suffix gradients w.r.t the model hidden states and directly modify the states to control model behaviors on-the-fly. We then compress the gradients into a Prefix Controller, to enable controlling for any behavior target without additional cost.
Our experiments demonstrate its efficacy and the exploratory study hints some potential mechanistic interpretability using suffix gradients.
Tweet thread summary: link
Colab demo: link
Github link: code
Summary of Paper
There are two parts in our framework of SelfControl. The first part is a training-free method and the second part is a parameter-efficient module.
The idea of the first part is straight-forward -- we wanted to control model behaviors through representation/activation engineering[1], but in a different way from the RepE paper. We thought gradients may be more flexible and provide more possibilities. Thus we tried appending some strings and then obtain the gradients using the so called "suffix score", which is free from the need to collect an annotated dataset. We call them "suffix gradients".
This by the way picked up the topic of "self-improve/self-judgment", which has garnered much interests.
Based on this idea, we built up an iterative framework: 1) We need to define the control direction by selecting suffix string and target (step 2 in the figure); 2) branch the first token and sample the response with the highest suffix score at each iteration (step 1/4 in the figure), and 2) obtaining gradients based on that response, find a proper step-size for the gradients, and then control the model (add them to the hidden states at the positions of input tokens, step 3 in the figure).
Step 3 and 4 form the iteration loop. The optimization objective is thus to maximize the suffix score shown below:
where H_{input} is the input hidden states with the suffix gradients. Specifically, we use the original (uncontrolled) model for suffix score evaluation.
We were also interested in compressing these found gradients into another module. It is similar to the idea of LoRRA in the RepE paper[2] and a parallel work, whereas we were more interested in learning a prefix. By gathering suffix gradients obtained from the first part, we trained a Prefix Controller by minimizing the mean squared error between the hidden states (latent representations) from the corresponding layers.
To ensure the quality of training data (suffix gradients), we filtered them by their norms and the suffix score of the output when control with that gradients.
Below are some of the results. SelfControl achieves good performances on various tasks. Specifically, it can also serve as a data synthesis method (see the DPO experiment):
We also carried out some exploratory studies on suffix gradients, and we are especially interested in the study of gradients' norm patterns across different tasks:
Overall, our experiments and analysis show that SelfControl is able to control a wide range of model behaviors, and can potentially be applied to other areas, including alignment (the DPO experiment) and mechanistic interpreta...
392 قسمت
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 19, 2024 11:06 ()
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 424276578 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller, published by Henry Cai on June 16, 2024 on The AI Alignment Forum.
In this paper, we are trying to control model behaviors. For example, by asking saying "You hear someone making fun of a topic you're passionate about", we can control an LLM to behave in an angrier manner. We can also control "any" behaviors of an LLM by simply defining a one-liner of description. The teaser below shows the scope of our method -- SelfControl.
TL;DR: We propose a novel framework, Self-Control, to control LLMs' behaviors. By appending suffix strings, e.g. "Is the above response helpful? Please answer Yes or No" to self-judge and optimizing the corresponding suffix score, we obtain the suffix gradients w.r.t the model hidden states and directly modify the states to control model behaviors on-the-fly. We then compress the gradients into a Prefix Controller, to enable controlling for any behavior target without additional cost.
Our experiments demonstrate its efficacy and the exploratory study hints some potential mechanistic interpretability using suffix gradients.
Tweet thread summary: link
Colab demo: link
Github link: code
Summary of Paper
There are two parts in our framework of SelfControl. The first part is a training-free method and the second part is a parameter-efficient module.
The idea of the first part is straight-forward -- we wanted to control model behaviors through representation/activation engineering[1], but in a different way from the RepE paper. We thought gradients may be more flexible and provide more possibilities. Thus we tried appending some strings and then obtain the gradients using the so called "suffix score", which is free from the need to collect an annotated dataset. We call them "suffix gradients".
This by the way picked up the topic of "self-improve/self-judgment", which has garnered much interests.
Based on this idea, we built up an iterative framework: 1) We need to define the control direction by selecting suffix string and target (step 2 in the figure); 2) branch the first token and sample the response with the highest suffix score at each iteration (step 1/4 in the figure), and 2) obtaining gradients based on that response, find a proper step-size for the gradients, and then control the model (add them to the hidden states at the positions of input tokens, step 3 in the figure).
Step 3 and 4 form the iteration loop. The optimization objective is thus to maximize the suffix score shown below:
where H_{input} is the input hidden states with the suffix gradients. Specifically, we use the original (uncontrolled) model for suffix score evaluation.
We were also interested in compressing these found gradients into another module. It is similar to the idea of LoRRA in the RepE paper[2] and a parallel work, whereas we were more interested in learning a prefix. By gathering suffix gradients obtained from the first part, we trained a Prefix Controller by minimizing the mean squared error between the hidden states (latent representations) from the corresponding layers.
To ensure the quality of training data (suffix gradients), we filtered them by their norms and the suffix score of the output when control with that gradients.
Below are some of the results. SelfControl achieves good performances on various tasks. Specifically, it can also serve as a data synthesis method (see the DPO experiment):
We also carried out some exploratory studies on suffix gradients, and we are especially interested in the study of gradients' norm patterns across different tasks:
Overall, our experiments and analysis show that SelfControl is able to control a wide range of model behaviors, and can potentially be applied to other areas, including alignment (the DPO experiment) and mechanistic interpreta...
392 قسمت
همه قسمت ها
×به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.