AF - Games for AI Control by Charlie Griffin

The Nonlinear Library: Alignment Forum

محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

1+ y ago 7:37

MP3•خانه قسمت

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on October 23, 2024 10:10 (1y ago). Last successful fetch was on September 19, 2024 11:06 (1y ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Games for AI Control, published by Charlie Griffin on July 11, 2024 on The AI Alignment Forum.
This post summarises our new workshop paper: Games for AI Control. The paper applies AI- and game-theoretic tools to the AI Control framework introduced by Redwood Research [1, 2].
We'll be extending this work and submitting it to a conference, so we welcome suggestions! Please reach out if you're excited about developing these ideas. This post is written from Charlie's perspective.
AI Control Refresher
The AI Control methodology aims to design and assess protocols[1] that AI labs could use to deploy powerful but untrusted advanced AI systems. For a full summary, see Redwood's blog post, but I'll try to recap here.
A primary goal of AI Control is to design protocols that are safe in that they can reduce the probability of some unacceptable outcome, conditional on the untrusted AI system being a schemer. [2] However, protocols must also be useful, harnessing a significant amount of the benefit of the untrusted system in cases where it is not scheming.
After defining a particular use case and unacceptable outcome, evaluating a protocol involves iteratively engineering the best attack strategies for the red team (a human/AI hybrid acting as a proxy for the untrusted AI) and the blue team (which develops the protocol). This red-teaming exercise is the bulk of the work.
It involves coming up with protocols and gathering empirical data about key questions, such as how well GPT-3.5 can spot backdoors inserted by GPT-4 (see Figure 4 of the original paper):
This work
By introducing "AI-Control Games", an extension of Markov Decision Processes and normal-form games[3], we model the interaction between a protocol and an untrusted AI system to explore safety and usefulness across a range of protocols and games. This is similar to a discrete-time and stateful modification of Redwood's Toy models of AI Control.
The paper contains early work and should be seen as introducing useful tools for thinking about AI Control. I won't explain the model here (which is a bit fiddly and notation-heavy). Instead, I'll share some of our early findings using the model and explain the benefits and limitations of this approach.
Findings
Sometimes (but rarely), the blue team should use a randomised protocol. In many games, randomness is helpful. However, partly because the environment is stochastic, and Red has no memory and time-awareness, in AI-Control Games is usually unnecessary to include randomness in the protocol itself. We found that randomness was Pareto-optimal, in particular when the trusted model is not great at spotting backdoors written by the untrusted model. (See section 2 of the paper).
We can make small improvements to Trusted Monitoring by optimising the parameters. We found small Pareto improvements to existing Trusted Monitoring setups by, for example, by decreasing the threshold for sending a solution to be checked be a human auditor. We found that adapting the threshold for auditing over time led to further small improvements.
We can (usually) use a linear tradeoff to simplify the problem of synthesising protocols. In many cases, including whenever we allow Blue's protocol to be randomised, the Pareto frontier is "concave". This means that choosing different weights to trade off safety and usefulness will allow you to fill out the frontier by optimising a single objective.
Blue leads, and then Red follows. In general, we suppose the protocol designer (Blue) chooses it's protocol, and then the red team (Red) chooses a best-attack. Therefore, in our model, Blue chooses and declares its strategy before Red chooses its response. This assumption relates to Stackelberg Games, which are used to model security in game theory.[6]
Although (technically) multi-objective and general-sum, AI-Cont...

392 قسمت

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech