با برنامه Player FM !
AF - The case for more ambitious language model evals by Arun Jose
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 19, 2024 11:06 ()
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 398509347 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The case for more ambitious language model evals, published by Arun Jose on January 30, 2024 on The AI Alignment Forum.
Here are some capabilities that I expect to be pretty hard to discover using an RLHF'd chat LLM:
Eric Drexler tried to use the GPT-4 base model as a writing assistant, and it [...] knew who he was from what he was writing. He tried to simulate a conversation to have the AI help him with some writing he was working on, and the AI simulacrum repeatedly insisted it was by Drexler.
A somewhat well-known Haskell programmer - let's call her Alice - wrote two draft paragraphs of a blog post she wanted to write, began prompting the base model with it, and after about two iterations it generated a link to her draft blog post repo with her name.
More generally, this is a cluster of capabilities that could be described as language models inferring a surprising amount about the data-generation process that produced its prompt, such as the identity, personality, intentions, or history of a user[1].
The reason I expect most capability evals people currently run on language models to miss out on most abilities like these is primarily that they're most naturally observed when dealing with much more open-ended contexts. For instance, continuing text as the user, predicting an assistant free to do things that could superficially look like hallucinations[2], and so on. Most evaluation mechanisms people use today involve testing the ability of fine-tuned[3] models to perform a broad array number of specified tasks in some specified contexts, with or without some scaffolding - a setting that doesn't lend itself very well toward the kind of contexts I describe above.
A pretty reasonable question to ask at this point is why it matters at all whether we can detect these capabilities. A position one could have here is that there are capabilities much more salient to various takeover scenarios that are more useful to try and detect, such as the ability to phish people, hack into secure accounts, or fine-tune other models. From that perspective, evals trying to identify capabilities like these are just far less important. Another pretty reasonable position is that these particular instances of capabilities just don't seem very impressive, and are basically what you would expect out of language models.
My response to the first would be that I think it's important to ask what we're actually trying to achieve with our model eval mechanisms. Broadly, I think there are two different (and very often overlapping) things we would want our capability evals[4] to be doing:
Understanding whether or not a specific model is possessed of some dangerous capabilities, or prone to acting in a malicious way in some context.
Giving us information to better forecast the capabilities of future models. In other words, constructing good scaling laws for our capability evals.
I'm much more excited about the latter kind of capability evals, and most of my case here is directed at that. Specifically, I think that if you want to forecast what future models will be good at, then by default you're operating in a regime where you have to account for a bunch of different emergent capabilities that don't necessarily look identical to what you've already seen.
Even if you really only care about a specific narrow band of capabilities that you expect to be very likely convergent to takeover scenarios - an expectation I don't really buy as something you can very safely assume because of the uncertainty and plurality of takeover scenarios - there is still more than one way in which you can accomplish some subtasks, some of which may only show up in more powerful models.
As a concrete example, consider the task of phishing someone on the internet. One straightforward way to achieve this would be to figure out how...
392 قسمت
بایگانی مجموعه ها ("فیدهای غیر فعال" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 19, 2024 11:06 ()
Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 398509347 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The case for more ambitious language model evals, published by Arun Jose on January 30, 2024 on The AI Alignment Forum.
Here are some capabilities that I expect to be pretty hard to discover using an RLHF'd chat LLM:
Eric Drexler tried to use the GPT-4 base model as a writing assistant, and it [...] knew who he was from what he was writing. He tried to simulate a conversation to have the AI help him with some writing he was working on, and the AI simulacrum repeatedly insisted it was by Drexler.
A somewhat well-known Haskell programmer - let's call her Alice - wrote two draft paragraphs of a blog post she wanted to write, began prompting the base model with it, and after about two iterations it generated a link to her draft blog post repo with her name.
More generally, this is a cluster of capabilities that could be described as language models inferring a surprising amount about the data-generation process that produced its prompt, such as the identity, personality, intentions, or history of a user[1].
The reason I expect most capability evals people currently run on language models to miss out on most abilities like these is primarily that they're most naturally observed when dealing with much more open-ended contexts. For instance, continuing text as the user, predicting an assistant free to do things that could superficially look like hallucinations[2], and so on. Most evaluation mechanisms people use today involve testing the ability of fine-tuned[3] models to perform a broad array number of specified tasks in some specified contexts, with or without some scaffolding - a setting that doesn't lend itself very well toward the kind of contexts I describe above.
A pretty reasonable question to ask at this point is why it matters at all whether we can detect these capabilities. A position one could have here is that there are capabilities much more salient to various takeover scenarios that are more useful to try and detect, such as the ability to phish people, hack into secure accounts, or fine-tune other models. From that perspective, evals trying to identify capabilities like these are just far less important. Another pretty reasonable position is that these particular instances of capabilities just don't seem very impressive, and are basically what you would expect out of language models.
My response to the first would be that I think it's important to ask what we're actually trying to achieve with our model eval mechanisms. Broadly, I think there are two different (and very often overlapping) things we would want our capability evals[4] to be doing:
Understanding whether or not a specific model is possessed of some dangerous capabilities, or prone to acting in a malicious way in some context.
Giving us information to better forecast the capabilities of future models. In other words, constructing good scaling laws for our capability evals.
I'm much more excited about the latter kind of capability evals, and most of my case here is directed at that. Specifically, I think that if you want to forecast what future models will be good at, then by default you're operating in a regime where you have to account for a bunch of different emergent capabilities that don't necessarily look identical to what you've already seen.
Even if you really only care about a specific narrow band of capabilities that you expect to be very likely convergent to takeover scenarios - an expectation I don't really buy as something you can very safely assume because of the uncertainty and plurality of takeover scenarios - there is still more than one way in which you can accomplish some subtasks, some of which may only show up in more powerful models.
As a concrete example, consider the task of phishing someone on the internet. One straightforward way to achieve this would be to figure out how...
392 قسمت
Alle episoder
×به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.


 
 
 
 
