AF - We need a science of evals by Marius Hobbhahn

The Nonlinear Library: Alignment Forum

محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

2y ago 17:24

MP3•خانه قسمت

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on October 23, 2024 10:10 (1y ago). Last successful fetch was on September 19, 2024 11:06 (1y ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We need a science of evals, published by Marius Hobbhahn on January 22, 2024 on The AI Alignment Forum.
This is a linkpost for https://www.apolloresearch.ai/blog/we-need-a-science-of-evals
In this post, we argue that if AI model evaluations (evals) want to have meaningful real-world impact, we need a "Science of Evals", i.e. the field needs rigorous scientific processes that provide more confidence in evals methodology and results.
Model evaluations allow us to reduce uncertainty about properties of Neural Networks and thereby inform safety-related decisions. For example, evals underpin many
Responsible Scaling Policies and future laws might directly link risk thresholds to specific evals. Thus, we need to ensure that we accurately measure the targeted property and we can trust the results from model evaluations. This is particularly important when a decision not to deploy the AI system could lead to significant financial implications for AI companies, e.g. when these companies then fight these decisions in court.
Evals are a nascent field and we think current evaluations are not yet resistant to this level of scrutiny. Thus, we cannot trust the results of evals as much as we would in a mature field. For instance, one of the biggest challenges Language Model (LM) evaluations currently face is the model's sensitivity to the prompts used to elicit a certain capability (
Liang et al., 2022;
Mizrahi et al., 2023;
Scalar et al., 2023;
Weber et al., 2023,
Bsharat et al., 2023).
Scalar et al., 2023, for example, find that "several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points [...]". A
post by Anthropic also suggests that simple formatting changes to an evaluation, such as "changing the options from (A) to (1) or changing the parentheses from (A) to [A], or adding an extra space between the option and the answer can lead to a ~5 percentage point change in accuracy on the evaluation." As an extreme example,
Bsharat et al., 2023 find that "tipping a language model 300K for a better solution" leads to increased capabilities. Overall, this suggests that under current practices, evaluations are much more an art than a science.
Since evals often aim to estimate an upper bound of capabilities, it is important to understand how to elicit maximal rather than average capabilities. Different improvements to prompt engineering have continuously raised the bar and thus make it hard to estimate whether any particular negative result is meaningful or whether it could be invalidated by a better technique. For example, prompting techniques such as Chain-of-Thought prompting (
Wei et al, 2022), Tree of Thought prompting (
Yao et al., 2023), or self-consistency prompting (
Wang et al. 2022), show how LM capabilities can greatly be improved with principled prompts compared to previous prompting techniques. To point to a more recent example, the newly released Gemini Ultra model (
Gemini Team Google, 2023) achieved a new state-of-the-art result on MMLU with a new inference technique called uncertainty-routed chain-of-thought, outperforming even GPT-4. However, when doing inference with chain-of-thought@32 (sampling 32 results and taking the majority vote), GPT-4 still outperforms Gemini Ultra. Days later, Microsoft introduced a new prompting technique called Medprompt (
Nori et al., 2023), which again yielded a new
Sota result on MMLU, barely outperforming Gemini Ultra. These examples should overall illustrate that it is hard to make high-confidence statements about maximal capabilities with current evaluation techniques.
In contrast, even everyday products like shoes undergo extensive testing, such as repeated bending to assess material fatigue. For higher-stake things l...

392 قسمت

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech