The Nonlinear Library: Alignment Forum podcast

Player FM - Internet Radio Done Right

اضافه شده در three سال پیش
Looks like the publisher may have taken this series offline or changed its URL. Please contact support if you believe it should be working, the feed URL is invalid, or you have any other concerns about it.

محتوای ارائه شده توسط The Nonlinear Fund. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط The Nonlinear Fund یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

<div class="span index">1</div> <span><a class="" data-remote="true" data-type="html" href="/series/the-head-start-embracing-the-journey">The Head Start: Embracing the Journey</a></span>

1
The Head Start: Embracing the Journey

لغو اشتراک

۵ روز پیش۵d ago

لغو اشتراک

ماهیانه

We’ve all had a moment where we arrived at the grocery store and can’t remember passing any of the stoplights along the way. We’re halfway down an aisle and scrambling to remember what we even needed. Our mind is in overdrive trying to sort through our ever-growing to-do list. We’ve been there when just making it through the next task is hard. But the normal stresses of life aren’t the only thing YOU have to deal with. Nora McInerny hosts The Head Start: Embracing the Journey, a new podcast from iHeartRadio and AbbVie, a space to share open and honest conversations acknowledging the struggles we all face, while leaning into the hope and optimism we all share. We’ll hear from doctors, experts, and everyday people who will be sharing their experiences, migraine coping strategies, and major wins, all while building a community of support around an invisible disease that makes every day uncertain: Chronic Migraine. Click here for Product Information, including Boxed Warning and Medication Guide, or visit https://bit.ly/3BlbaHG

The Nonlinear Library: Alignment Forum

اشتراک گذاری

خانه سری•Feed

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on October 23, 2024 10:10 (7M ago). Last successful fetch was on September 19, 2024 11:06 (8M ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

392 قسمت

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech

The Nonlinear Library: Alignment Forum

updated حدود یک سال پیش

اشتراک گذاری

خانه سری•Feed

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on October 23, 2024 10:10 (7M ago). Last successful fetch was on September 19, 2024 11:06 (8M ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

392 قسمت

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech

모든 에피소드

The Nonlinear Library: Alignment Forum

1
AF - The Obliqueness Thesis by Jessica Taylor 30:04

36 weeks پیش30:04

پخش در آینده

لیست ها

پسندیدن

دوست داشته شد

30:04

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Obliqueness Thesis, published by Jessica Taylor on September 19, 2024 on The AI Alignment Forum. In my Xenosystems review, I discussed the Orthogonality Thesis, concluding that it was a bad metaphor. It's a long post, though, and the comments on orthogonality build on other Xenosystems content. Therefore, I think it may be helpful to present a more concentrated discussion on Orthogonality, contrasting Orthogonality with my own view, without introducing dependencies on Land's views. (Land gets credit for inspiring many of these thoughts, of course, but I'm presenting my views as my own here.) First, let's define the Orthogonality Thesis. Quoting Superintelligence for Bostrom's formulation: Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal. To me, the main ambiguity about what this is saying is the "could in principle" part; maybe, for any level of intelligence and any final goal, there exists (in the mathematical sense) an agent combining those, but some combinations are much more natural and statistically likely than others. Let's consider Yudkowsky's formulations as alternatives. Quoting Arbital: The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal. The strong form of the Orthogonality Thesis says that there's no extra difficulty or complication in the existence of an intelligent agent that pursues a goal, above and beyond the computational tractability of that goal. As an example of the computational tractability consideration, sufficiently complex goals may only be well-represented by sufficiently intelligent agents. "Complication" may be reflected in, for example, code complexity; to my mind, the strong form implies that the code complexity of an agent with a given level of intelligence and goals is approximately the code complexity of the intelligence plus the code complexity of the goal specification, plus a constant. Code complexity would influence statistical likelihood for the usual Kolmogorov/Solomonoff reasons, of course. I think, overall, it is more productive to examine Yudkowsky's formulation than Bostrom's, as he has already helpfully factored the thesis into weak and strong forms. Therefore, by criticizing Yudkowsky's formulations, I am less likely to be criticizing a strawman. I will use "Weak Orthogonality" to refer to Yudkowsky's "Orthogonality Thesis" and "Strong Orthogonality" to refer to Yudkowsky's "strong form of the Orthogonality Thesis". Land, alternatively, describes a "diagonal" between intelligence and goals as an alternative to orthogonality, but I don't see a specific formulation of a "Diagonality Thesis" on his part. Here's a possible formulation: Diagonality Thesis: Final goals tend to converge to a point as intelligence increases. The main criticism of this thesis is that formulations of ideal agency, in the form of Bayesianism and VNM utility, leave open free parameters, e.g. priors over un-testable propositions, and the utility function. Since I expect few readers to accept the Diagonality Thesis, I will not concentrate on criticizing it. What about my own view? I like Tsvi's naming of it as an "obliqueness thesis". Obliqueness Thesis: The Diagonality Thesis and the Strong Orthogonality Thesis are false. Agents do not tend to factorize into an Orthogonal value-like component and a Diagonal belief-like component; rather, there are Oblique components that do not factorize neatly. (Here, by Orthogonal I mean basically independent of intelligence, and by Diagonal I mean converging to a point in the limit of intelligence.) While I will address Yudkowsky's arguments for the Orthogonality Thesis, I think arguing directly for my view first will be more helpful. In general, it seems ...…

The Nonlinear Library: Alignment Forum

1
AF - Secret Collusion: Will We Know When to Unplug AI? by schroederdewitt 57:38

36 weeks پیش57:38

57:38

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Secret Collusion: Will We Know When to Unplug AI?, published by schroederdewitt on September 16, 2024 on The AI Alignment Forum. TL;DR: We introduce the first comprehensive theoretical framework for understanding and mitigating secret collusion among advanced AI agents, along with CASE, a novel model evaluation framework. CASE assesses the cryptographic and steganographic capabilities of agents, while exploring the emergence of secret collusion in real-world-like multi-agent settings. Whereas current AI models aren't yet proficient in advanced steganography, our findings show rapid improvements in individual and collective model capabilities, posing unprecedented safety and security risks. These results highlight urgent challenges for AI governance and policy, urging institutions such as the EU AI Office and AI safety bodies in the UK and US to prioritize cryptographic and steganographic evaluations of frontier models. Our research also opens up critical new pathways for research within the AI Control framework. Philanthropist and former Google CEO Eric Schmidt said in 2023 at a Harvard event: "[...] the computers are going to start talking to each other probably in a language that we can't understand and collectively their super intelligence - that's the term we use in the industry - is going to rise very rapidly and my retort to that is: do you know what we're going to do in that scenario? We're going to unplug them [...] But what if we cannot unplug them in time because we won't be able to detect the moment when this happens? In this blog post, we, for the first time, provide a comprehensive overview of the phenomenon of secret collusion among AI agents, connect it to foundational concepts in steganography, information theory, distributed systems theory, and computability, and present a model evaluation framework and empirical results as a foundation of future frontier model evaluations. This blog post summarises a large body of work. First of all, it contains our pre-print from February 2024 (updated in September 2024) "Secret Collusion among Generative AI Agents". An early form of this pre-print was presented at the 2023 New Orleans (NOLA) Alignment Workshop (see this recording NOLA 2023 Alignment Forum Talk Secret Collusion Among Generative AI Agents: a Model Evaluation Framework). Also, check out this long-form Foresight Institute Talk). In addition to these prior works, we also include new results. These contain empirical studies on the impact of paraphrasing as a mitigation tool against steganographic communications, as well as reflections on our findings' impact on AI Control. Multi-Agent Safety and Security in the Age of Autonomous Internet Agents The near future could see myriads of LLM-driven AI agents roam the internet, whether on social media platforms, eCommerce marketplaces, or blockchains. Given advances in predictive capabilities, these agents are likely to engage in increasingly complex intentional and unintentional interactions, ranging from traditional distributed systems pathologies (think dreaded deadlocks!) to more complex coordinated feedback loops. Such a scenario induces a variety of multi-agent safety, and specifically, multi-agent security[1] (see our NeurIPS'23 workshop Multi-Agent Security: Security as Key to AI Safety) concerns related to data exfiltration, multi-agent deception, and, fundamentally, undermining trust in AI systems. There are several real-world scenarios where agents could have access to sensitive information, such as their principals' preferences, which they may disclose unsafely even if they are safety-aligned when considered in isolation. Stray incentives, intentional or otherwise, or more broadly, optimization pressures, could cause agents to interact in undesirable and potentially dangerous ways. For example, join...…

The Nonlinear Library: Alignment Forum

1
AF - Estimating Tail Risk in Neural Networks by Jacob Hilton 41:11

36 weeks پیش41:11

41:11

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Estimating Tail Risk in Neural Networks, published by Jacob Hilton on September 13, 2024 on The AI Alignment Forum. Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while an artificial intelligence (AI) assistant may be generally safe, it would be catastrophic if it ever suggested an action that resulted in unnecessary large-scale harm. Current techniques for estimating the probability of tail events are based on finding inputs on which an AI behaves catastrophically. Since the input space is so large, it might be prohibitive to search through it thoroughly enough to detect all potential catastrophic behavior. As a result, these techniques cannot be used to produce AI systems that we are confident will never behave catastrophically. We are excited about techniques to estimate the probability of tail events that do not rely on finding inputs on which an AI behaves badly, and can thus detect a broader range of catastrophic behavior. We think developing such techniques is an exciting problem to work on to reduce the risk posed by advanced AI systems: Estimating tail risk is a conceptually straightforward problem with relatively objective success criteria; we are predicting something mathematically well-defined, unlike instances of eliciting latent knowledge (ELK) where we are predicting an informal concept like "diamond". Improved methods for estimating tail risk could reduce risk from a variety of sources, including central misalignment risks like deceptive alignment. Improvements to current methods can be found both by doing empirical research, or by thinking about the problem from a theoretical angle. This document will discuss the problem of estimating the probability of tail events and explore estimation strategies that do not rely on finding inputs on which an AI behaves badly. In particular, we will: Introduce a toy scenario about an AI engineering assistant for which we want to estimate the probability of a catastrophic tail event. Explain some deficiencies of adversarial training, the most common method for reducing risk in contemporary AI systems. Discuss deceptive alignment as a particularly dangerous case in which adversarial training might fail. Present methods for estimating the probability of tail events in neural network behavior that do not rely on evaluating behavior on concrete inputs. Conclude with a discussion of why we are excited about work aimed at improving estimates of the probability of tail events. This document describes joint research done with Jacob Hilton, Victor Lecomte, David Matolcsi, Eric Neyman, Thomas Read, George Robinson, and Gabe Wu. Thanks additionally to Ajeya Cotra, Lukas Finnveden, and Erik Jenner for helpful comments and suggestions. A Toy Scenario Consider a powerful AI engineering assistant. Write M for this AI system, and M(x) for the action it suggests given some project description x. We want to use this system to help with various engineering projects, but would like it to never suggest an action that results in large-scale harm, e.g. creating a doomsday device. In general, we define a behavior as catastrophic if it must never occur in the real world.[1] An input is catastrophic if it would lead to catastrophic behavior. Assume we can construct a catastrophe detector C that tells us if an action M(x) will result in large-scale harm. For the purposes of this example, we will assume both that C has a reasonable chance of catching all catastrophes and that it is feasible to find a useful engineering assistant M that never triggers C (see Catastrophe Detectors for further discussion). We will also assume we can use C to train M, but that it is ...…

The Nonlinear Library: Alignment Forum

1
AF - Can startups be impactful in AI safety? by Esben Kran 11:54

36 weeks پیش11:54

11:54

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Can startups be impactful in AI safety?, published by Esben Kran on September 13, 2024 on The AI Alignment Forum. With Lakera's strides in securing LLM APIs, Goodfire AI's path to scaling interpretability, and 20+ model evaluations startups among much else, there's a rising number of technical startups attempting to secure the model ecosystem. Of course, they have varying levels of impact on superintelligence containment and security and even with these companies, there's a lot of potential for aligned, ambitious and high-impact startups within the ecosystem. This point isn't new and has been made in our previous posts and by Eric Ho (Goodfire AI CEO). To set the stage, our belief is that these are the types of companies that will have a positive impact: Startups with a profit incentive completely aligned with improving AI safety; that have a deep technical background to shape AGI deployment and; do not try to compete with AGI labs. Piloting AI safety startups To understand impactful technical AI safety startups better, Apart Research joined forces with collaborators from Juniper Ventures, vectorview (alumni from the latest YC cohort), Rudolf (from the upcoming def/acc cohort), Tangentic AI, and others. We then invited researchers, engineers, and students to resolve a key question "can we come up with ideas that scale AI safety into impactful for-profits?" The hackathon took place during a weekend two weeks ago with a keynote by Esben Kran (co-director of Apart) along with 'HackTalks' by Rudolf Laine (def/acc) and Lukas Petersson (YC / vectorview). Individual submissions were a 4 page report with the problem statement, why this solution will work, what the key risks of said solution are, and any experiments or demonstrations of the solution the team made. This post details the top 6 projects and excludes 2 projects that were made private by request (hopefully turning into impactful startups now!). In total, we had 101 signups and 11 final entries. Winners were decided by an LME model conditioned on reviewer bias. Watch the authors' lightning talks here. Dark Forest: Making the web more trustworthy with third-party content verification By Mustafa Yasir (AI for Cyber Defense Research Centre, Alan Turing Institute) Abstract: 'DarkForest is a pioneering Human Content Verification System (HCVS) designed to safeguard the authenticity of online spaces in the face of increasing AI-generated content. By leveraging graph-based reinforcement learning and blockchain technology, DarkForest proposes a novel approach to safeguarding the authentic and humane web. We aim to become the vanguard in the arms race between AI-generated content and human-centric online spaces.' Content verification workflow supported by graph-based RL agents deciding verifications Reviewer comments: Natalia: Well explained problem with clear need addressed. I love that you included the content creation process - although you don't explicitly address how you would attract content creators to use your platform over others in their process. Perhaps exploring what features of platforms drive creators to each might help you make a compelling case for using yours beyond the verification capabilities. I would have also liked to see more details on how the verification decision is made and how accurate this is on existing datasets. Nick: There's a lot of valuable stuff in here regarding content moderation and identity verification. I'd narrow it to one problem-solution pair (e.g., "jobs to be done") and focus more on risks around early product validation (deep interviews with a range of potential users and buyers regarding value) and go-to-market. It might also be worth checking out Musubi. Read the full project here. Simulation Operators: An annotation operation for alignment of robot By Ardy Haroen (USC) Abstrac...…

The Nonlinear Library: Alignment Forum

1
AF - How difficult is AI Alignment? by Samuel Dylan Martin 39:38

36 weeks پیش39:38

39:38

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How difficult is AI Alignment?, published by Samuel Dylan Martin on September 13, 2024 on The AI Alignment Forum. This work was funded by Polaris Ventures There is currently no consensus on how difficult the AI alignment problem is. We have yet to encounter any real-world, in the wild instances of the most concerning threat models, like deceptive misalignment. However, there are compelling theoretical arguments which suggest these failures will arise eventually. Will current alignment methods accidentally train deceptive, power-seeking AIs that appear aligned, or not? We must make decisions about which techniques to avoid and which are safe despite not having a clear answer to this question. To this end, a year ago, we introduced the AI alignment difficulty scale, a framework for understanding the increasing challenges of aligning artificial intelligence systems with human values. This follow-up article revisits our original scale, exploring how our understanding of alignment difficulty has evolved and what new insights we've gained. This article will explore three main themes that have emerged as central to our understanding: 1. The Escalation of Alignment Challenges: We'll examine how alignment difficulties increase as we go up the scale, from simple reward hacking to complex scenarios involving deception and gradient hacking. Through concrete examples, we'll illustrate these shifting challenges and why they demand increasingly advanced solutions. These examples will illustrate what observations we should expect to see "in the wild" at different levels, which might change our minds about how easy or difficult alignment is. 2. Dynamics Across the Difficulty Spectrum: We'll explore the factors that change as we progress up the scale, including the increasing difficulty of verifying alignment, the growing disconnect between alignment and capabilities research, and the critical question of which research efforts are net positive or negative in light of these challenges. 3. Defining and Measuring Alignment Difficulty: We'll tackle the complex task of precisely defining "alignment difficulty," breaking down the technical, practical, and other factors that contribute to the alignment problem. This analysis will help us better understand the nature of the problem we're trying to solve and what factors contribute to it. The Scale The high level of the alignment problem, provided in the previous post, was: "The alignment problem" is the problem of aligning sufficiently powerful AI systems, such that we can be confident they will be able to reduce the risks posed by misused or unaligned AI systems We previously introduced the AI alignment difficulty scale, with 10 levels that map out the increasing challenges. The scale ranges from "alignment by default" to theoretical impossibility, with each level representing more complex scenarios requiring more advanced solutions. It is reproduced here: Alignment Difficulty Scale Difficulty Level Alignment technique X is sufficient Description Key Sources of risk 1 (Strong) Alignment by Default As we scale up AI models without instructing or training them for specific risky behaviour or imposing problematic and clearly bad goals (like 'unconditionally make money'), they do not pose significant risks. Even superhuman systems basically do the commonsense version of what external rewards (if RL) or language instructions (if LLM) imply. Misuse and/or recklessness with training objectives. RL of powerful models towards badly specified or antisocial objectives is still possible, including accidentally through poor oversight, recklessness or structural factors. 2 Reinforcement Learning from Human Feedback We need to ensure that the AI behaves well even in edge cases by guiding it more carefully using human feedback in a wide range of situations...…

The Nonlinear Library: Alignment Forum

1
AF - Contra papers claiming superhuman AI forecasting by nikos 14:36

36 weeks پیش14:36

14:36

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Contra papers claiming superhuman AI forecasting, published by nikos on September 12, 2024 on The AI Alignment Forum. [Conflict of interest disclaimer: We are FutureSearch, a company working on AI-powered forecasting and other types of quantitative reasoning. If thin LLM wrappers could achieve superhuman forecasting performance, this would obsolete a lot of our work.] Widespread, misleading claims about AI forecasting Recently we have seen a number of papers - (Schoenegger et al., 2024, Halawi et al., 2024, Phan et al., 2024, Hsieh et al., 2024) - with claims that boil down to "we built an LLM-powered forecaster that rivals human forecasters or even shows superhuman performance". These papers do not communicate their results carefully enough, shaping public perception in inaccurate and misleading ways. Some examples of public discourse: Ethan Mollick (>200k followers) tweeted the following about the paper Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy by Schoenegger et al.: A post on Marginal Revolution with the title and abstract of the paper Approaching Human-Level Forecasting with Language Models by Halawi et al. elicits responses like "This is something that humans are notably terrible at, even if they're paid to do it. No surprise that LLMs can match us." "+1 The aggregate human success rate is a pretty low bar" A Twitter thread with >500k views on LLMs Are Superhuman Forecasters by Phan et al. claiming that "AI […] can predict the future at a superhuman level" had more than half a million views within two days of being published. The number of such papers on AI forecasting, and the vast amount of traffic on misleading claims, makes AI forecasting a uniquely misunderstood area of AI progress. And it's one that matters. What does human-level or superhuman forecasting mean? "Human-level" or "superhuman" is a hard-to-define concept. In an academic context, we need to work with a reasonable operationalization to compare the skill of an AI forecaster with that of humans. One reasonable and practical definition of a superhuman forecasting AI forecaster is The AI forecaster is able to consistently outperform the crowd forecast on a sufficiently large number of randomly selected questions on a high-quality forecasting platform.[1] (For a human-level forecaster, just replace "outperform" with "performs on par with".) Red flags for claims to (super)human AI forecasting accuracy Our experience suggests there are a number of things that can go wrong when building AI forecasting systems, including: 1. Failing to find up-to-date information on the questions. It's inconceivable on most questions that forecasts can be good without basic information. Imagine trying to forecast the US presidential election without knowing that Biden dropped out. 2. Drawing on up-to-date, but low-quality information. Ample experience shows low quality information confuses LLMs even more than it confuses humans. Imagine forecasting election outcomes with biased polling data. Or, worse, imagine forecasting OpenAI revenue based on claims like > The number of ChatGPT Plus subscribers is estimated between 230,000-250,000 as of October 2023. without realising that this mixing up ChatGPT vs ChatGPT mobile. 3. Lack of high-quality quantitative reasoning. For a decent number of questions on Metaculus, good forecasts can be "vibed" by skilled humans and perhaps LLMs. But for many questions, simple calculations are likely essential. Human performance shows systematic accuracy nearly always requires simple models such as base rates, time-series extrapolations, and domain-specific numbers. Imagine forecasting stock prices without having, and using, historical volatility. 4. Retrospective, rather than prospective, forecasting (e.g. forecasting questions that have al...…

The Nonlinear Library: Alignment Forum

1
AF - AI forecasting bots incoming by Dan H 7:53

37 weeks پیش7:53

7:53

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI forecasting bots incoming, published by Dan H on September 9, 2024 on The AI Alignment Forum. In a recent appearance on Conversations with Tyler, famed political forecaster Nate Silver expressed skepticism about AIs replacing human forecasters in the near future. When asked how long it might take for AIs to reach superhuman forecasting abilities, Silver replied: "15 or 20 [years]." In light of this, we are excited to announce "FiveThirtyNine," an AI forecasting bot. Our bot, built on GPT-4o, provides probabilities for any user-entered query, including " Will Trump win the 2024 presidential election?" and " Will China invade Taiwan by 2030?" Our bot performs better than experienced human forecasters and performs roughly the same as (and sometimes even better than) crowds of experienced forecasters; since crowds are for the most part superhuman, FiveThirtyNine is in a similar sense. (We discuss limitations later in this post.) Our bot and other forecasting bots can be used in a wide variety of contexts. For example, these AIs could help policymakers minimize bias in their decision-making or help improve global epistemics and institutional decision-making by providing trustworthy, calibrated forecasts. We hope that forecasting bots like ours will be quickly integrated into frontier AI models. For now, we will keep our bot available at forecast.safe.ai, where users are free to experiment and test its capabilities. Quick Links Demo: forecast.safe.ai Technical Report: link Problem Policymakers at the highest echelons of government and corporate power have difficulty making high-quality decisions on complicated topics. As the world grows increasingly complex, even coming to a consensus agreement on basic facts is becoming more challenging, as it can be hard to absorb all the relevant information or know which sources to trust. Separately, online discourse could be greatly improved. Discussions on uncertain, contentious issues all too often devolve into battles between interest groups, each intent on name-calling and spouting the most extreme versions of their views through highly biased op-eds and tweets. FiveThirtyNine Before transitioning to how forecasting bots like FiveThirtyNine can help improve epistemics, it might be helpful to give a summary of what FiveThirtyNine is and how it works. FiveThirtyNine can be given a query - for example, "Will Trump win the 2024 US presidential election?" FiveThirtyNine is prompted to behave like an "AI that is superhuman at forecasting". It is then asked to make a series of search engine queries for news and opinion articles that might contribute to its prediction. (The following example from FiveThirtyNine uses GPT-4o as the base LLM.) Based on these sources and its wealth of prior knowledge, FiveThirtyNine compiles a summary of key facts. Given these facts, it's asked to give reasons for and against Trump winning the election, before weighing each reason based on its strength and salience. Finally, FiveThirtyNine aggregates its considerations while adjusting for negativity and sensationalism bias in news sources and outputs a tentative probability. It is asked to sanity check this probability and adjust it up or down based on further reasoning, before putting out a final, calibrated probability - in this case, 52%. Evaluation. To test how well our bot performs, we evaluated it on questions from the Metaculus forecasting platform. We restricted the bot to make predictions only using the information human forecasters had, ensuring a valid comparison. Specifically, GPT-4o is only trained on data up to October 2023, and we restricted the news and opinion articles it could access to only those published before a certain date. From there, we asked it to compute the probabilities of 177 events from Metaculus that had happened (or not ha...…

The Nonlinear Library: Alignment Forum

1
AF - Backdoors as an analogy for deceptive alignment by Jacob Hilton 14:45

37 weeks پیش14:45

14:45

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Backdoors as an analogy for deceptive alignment, published by Jacob Hilton on September 6, 2024 on The AI Alignment Forum. ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in order to give itself the opportunity to behave uncooperatively later. In our paper, we prove several theoretical results that shed some light on possible mitigations for deceptive alignment, albeit in a way that is limited by the strength of this analogy. In this post, we will: Lay out the analogy between backdoors and deceptive alignment Discuss prior theoretical results from the perspective of this analogy Explain our formal notion of backdoors and its strengths and weaknesses Summarize the results in our paper and discuss their implications for deceptive alignment Thanks to Boaz Barak, Roger Grosse, Thomas Read, John Schulman and Gabriel Wu for helpful comments. Backdoors and deceptive alignment A backdoor in an ML model is a modification to the model that causes it to behave differently on certain inputs that activate a secret "trigger", while behaving similarly on ordinary inputs. There is a wide existing literature on backdoor attacks and defenses, which is primarily empirical, but also includes some theoretical results that we will mention. Deceptive alignment is a term from the paper Risks from Learned Optimization in Advanced Machine Learning Systems (Section 4) that refers to the possibility that an AI system will internally reason about the objective that it is being trained on, and decide to perform well according to that objective unless there are clues that it has been taken out of its training environment. Such a policy could be optimal on the training distribution, and yet perform very badly on certain out-of-distribution inputs where such clues are present, which we call defection triggers.[1] The opposite of deceptive alignment is robust alignment, meaning that this performance degradation is avoided. Since a deceptively aligned model and a robustly aligned model behave very differently on defection triggers, but very similarly on typical inputs from the training distribution, deceptive alignment can be thought of as a special kind of backdoor, under the following correspondence: Deceptive alignment Backdoors Robustly aligned model Original (unmodified) model Deceptively aligned model Backdoored model Defection trigger Backdoor trigger The main distinguishing feature of deceptive alignment compared to other kinds of backdoors is that the deceptively aligned model is not produced by an adversary, but is instead produced through ordinary training. Thus by treating deceptive alignment as a backdoor, we are modeling the training process as an adversary. In our analysis of deceptive alignment, the basic tension we will face is that an unconstrained adversary will always win, but any particular proxy constraint we impose on the adversary may be unrealistic. Static backdoor detection An important piece of prior work is the paper Planting Undetectable Backdoors in Machine Learning Models, which uses a digital signature scheme to insert an undetectable backdoor into a model. Roughly speaking, the authors exhibit a modified version of a "Random Fourier Features" training algorithm that produces a backdoored model. Any input to the backdoored model can be perturbed by an attacker with knowledge of a secret key to produce a new input on which the model behaves differently. However, the backdoor is undetectable in the sense that it is computationally infeasible for a defender with white-box access to distinguish a backdoored model from an or...…

The Nonlinear Library: Alignment Forum

1
AF - Conflating value alignment and intent alignment is causing confusion by Seth Herd 13:40

37 weeks پیش13:40

13:40

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conflating value alignment and intent alignment is causing confusion, published by Seth Herd on September 5, 2024 on The AI Alignment Forum. Submitted to the Alignment Forum. Contains more technical jargon than usual. Epistemic status: I think something like this confusion is happening often. I'm not saying these are the only differences in what people mean by "AGI alignment". Summary: Value alignment is better but probably harder to achieve than personal intent alignment to the short-term wants of some person(s). Different groups and people tend to primarily address one of these alignment targets when they discuss alignment. Confusion abounds. One important confusion stems from an assumption that the type of AI defines the alignment target: strong goal-directed AGI must be value aligned or misaligned, while personal intent alignment is only viable for relatively weak AI. I think this assumption is important but false. While value alignment is categorically better, intent alignment seems easier, safer, and more appealing in the short term, so AGI project leaders are likely to try it.[1] Overview Clarifying what people mean by alignment should dispel some illusory disagreement, and clarify alignment theory and predictions of AGI outcomes. Caption: Venn diagram of three types of alignment targets. Value alignment and Personal intent alignment are both subsets of Evan Hubinger's definition of intent alignment: AGI aligned with human intent in the broadest sense. Prosaic alignment work usually seems to be addressing a target somewhere in the neighborhood of personal intent alignment (following instructions or doing what this person wants now), while agent foundations and other conceptual alignment work usually seems to be addressing value alignment. Those two clusters have different strengths and weaknesses as alignment targets, so lumping them together produces confusion. People mean different things when they say alignment. Some are mostly thinking about value alignment (VA): creating sovereign AGI that has values close enough to humans' for our liking. Others are talking about making AGI that is corrigible (in the Christiano or Harms sense)[2] or follows instructions from its designated principal human(s). I'm going to use the term personal intent alignment (PIA) until someone has a better term for that type of alignment target. Different arguments and intuitions apply to these two alignment goals, so talking about them without differentiation is creating illusory disagreements. Value alignment is better almost by definition, but personal intent alignment seems to avoid some of the biggest difficulties of value alignment. Max Harms' recent sequence on corrigibility as a singular target (CAST) gives both a nice summary and detailed arguments. We do not need us to point to or define values, just short term preferences or instructions. The principal advantage is that an AGI that follows instructions can be used as a collaborator in improving its alignment over time; you don't need to get it exactly right on the first try. This is more helpful in slower and more continuous takeoffs. This means that PI alignment has a larger basin of attraction than value alignment does.[3] Most people who think alignment is fairly achievable seem to be thinking of PIA, while critics often respond thinking of value alignment. It would help to be explicit. PIA is probably easier and more likely than full VA for our first stabs at AGI, but there are reasons to wonder if it's adequate for real success. In particular, there are intuitions and arguments that PIA doesn't address the real problem of AGI alignment. I think PIA does address the real problem, but in a non-obvious and counterintuitive way. Another unstated divide There's another important clustering around these two conceptions of al...…

The Nonlinear Library: Alignment Forum

1
AF - Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception? by David Scott Krueger 1:01

38 weeks پیش1:01

1:01

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?, published by David Scott Krueger on September 4, 2024 on The AI Alignment Forum. AI systems up to some high level of intelligence plausibly need to know exactly where they are in space-time in order for deception/"scheming" to make sense as a strategy. This is because they need to know: 1) what sort of oversight they are subject to and 2) what effects their actions will have on the real world (side note: Acausal trade might break this argument) There are a number of informal proposals to keep AI systems selectively ignorant of (1) and (2) in order to prevent deception. Those proposals seem very promising to flesh out; I'm not aware of any rigorous work doing so, however. Are you? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.…

The Nonlinear Library: Alignment Forum

1
AF - The Checklist: What Succeeding at AI Safety Will Involve by Sam Bowman 35:25

38 weeks پیش35:25

35:25

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Checklist: What Succeeding at AI Safety Will Involve, published by Sam Bowman on September 3, 2024 on The AI Alignment Forum. Crossposted by habryka with Sam's permission. Expect lower probability for Sam to respond to comments here than if he had posted it. Preface This piece reflects my current best guess at the major goals that Anthropic (or another similarly positioned AI developer) will need to accomplish to have things go well with the development of broadly superhuman AI. Given my role and background, it's disproportionately focused on technical research and on averting emerging catastrophic risks. For context, I lead a technical AI safety research group at Anthropic, and that group has a pretty broad and long-term mandate, so I spend a lot of time thinking about what kind of safety work we'll need over the coming years. This piece is my own opinionated take on that question, though it draws very heavily on discussions with colleagues across the organization: Medium- and long-term AI safety strategy is the subject of countless leadership discussions and Google docs and lunch-table discussions within the organization, and this piece is a snapshot (shared with permission) of where those conversations sometimes go. To be abundantly clear: Nothing here is a firm commitment on behalf of Anthropic, and most people at Anthropic would disagree with at least a few major points here, but this can hopefully still shed some light on the kind of thinking that motivates our work. Here are some of the assumptions that the piece relies on. I don't think any one of these is a certainty, but all of them are plausible enough to be worth taking seriously when making plans: Broadly human-level AI is possible. I'll often refer to this as transformative AI (or TAI), roughly defined as AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D.[1] Broadly human-level AI (or TAI) isn't an upper bound on most AI capabilities that matter, and substantially superhuman systems could have an even greater impact on the world along many dimensions. If TAI is possible, it will probably be developed this decade, in a business and policy and cultural context that's not wildly different from today. If TAI is possible, it could be used to dramatically accelerate AI R&D, potentially leading to the development of substantially superhuman systems within just a few months or years after TAI. Powerful AI systems could be extraordinarily destructive if deployed carelessly, both because of new emerging risks and because of existing issues that become much more acute. This could be through misuse of weapons-related capabilities, by disrupting important balances of power in domains like cybersecurity or surveillance, or by any of a number of other means. Many systems at TAI and beyond, at least under the right circumstances, will be capable of operating more-or-less autonomously for long stretches in pursuit of big-picture, real-world goals. This magnifies these safety challenges. Alignment - in the narrow sense of making sure AI developers can confidently steer the behavior of the AI systems they deploy - requires some non-trivial effort to get right, and it gets harder as systems get more powerful. Most of the ideas here ultimately come from outside Anthropic, and while I cite a few sources below, I've been influenced by far more writings and people than I can credit here or even keep track of. Introducing the Checklist This lays out what I think we need to do, divided into three chapters, based on the capabilities of our strongest models: Chapter 1: Preparation You are here. In this period, our best models aren't yet TAI. In the language of Anthropic's RSP, they're at AI Safety Level 2 (ASL-2), ASL-3, or maybe the early stages of ASL-4. Most of the wor...…

The Nonlinear Library: Alignment Forum

1
AF - Survey: How Do Elite Chinese Students Feel About the Risks of AI? by Nick Corvino 19:38

38 weeks پیش19:38

19:38

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Survey: How Do Elite Chinese Students Feel About the Risks of AI?, published by Nick Corvino on September 2, 2024 on The AI Alignment Forum. Intro In April 2024, my colleague and I (both affiliated with Peking University) conducted a survey involving 510 students from Tsinghua University and 518 students from Peking University - China's two top academic institutions. Our focus was on their perspectives regarding the frontier risks of artificial intelligence. In the People's Republic of China (PRC), publicly accessible survey data on AI is relatively rare, so we hope this report provides some valuable insights into how people in the PRC are thinking about AI (especially the risks). Throughout this post, I'll do my best to weave in other data reflecting the broader Chinese sentiment toward AI. For similar research, check out The Center for Long-Term Artificial Intelligence, YouGov, Monmouth University, The Artificial Intelligence Policy Institute, and notably, a poll conducted by Rethink Priorities, which closely informed our survey design. You can read the full report published in the Jamestown Foundation's China Brief here: Survey: How Do Elite Chinese Students Feel About the Risks of AI? Key Takeaways Students are more optimistic about the benefits of AI than concerned about the harms. 80 percent of respondents agreed or strongly agreed with the statement that AI will do more good than harm for society, with only 7.5 percent actively believing the harms could outweigh the benefits. This, similar to other polling, indicates that the PRC is one of the most optimistic countries concerning the development of AI. Students strongly believe the Chinese government should regulate AI. 85.31 percent of respondents believe AI should be regulated by the government, with only 6 percent actively believing it should not. This contrasts with trends seen in other countries, where there is typically a positive correlation between optimism about AI and calls for minimizing regulation. The strong support for regulation in the PRC, even as optimism about AI remains high, suggests a distinct perspective on the role of government oversight in the PRC context. Students ranked AI the lowest among all possible existential threats to humanity. When asked about the most likely causes of human extinction, misaligned artificial intelligence received the lowest score. Nuclear war, natural disaster, climate change, and pandemics all proved more concerning for students. Students lean towards cooperation between the United States and the PRC as necessary for the safe and responsible development of AI. 60.7 percent of respondents believe AI will not be developed safely without cooperation between China and the U.S., with 25.68 percent believing it will develop safely no matter the level of cooperation. Students are most concerned about the use of AI for surveillance. This was followed by misinformation, existential risk, wealth inequality, increased political tension, various issues related to bias, with the suffering of artificial entities receiving the lowest score. Background As the recent decision (决定) document from the Third Plenum meetings in July made clear, AI is one of eight technologies that the Chinese Communist Party (CCP) leadership sees as critical for achieving "Chinese-style modernization (中国式现代化)," and is central to the strategy of centering the country's economic future around breakthroughs in frontier science ( People's Daily, July 22). The PRC also seeks to shape international norms on AI, including on AI risks. In October 2023, Xi Jinping announced a "Global AI Governance Initiative (全球人工智能治理倡议)" ( CAC, October 18, 2023). Tsinghua and Peking Universty are the two most prestigious universities in the PRC (by far), many of whose graduates will be very influential in shaping the cou...…

The Nonlinear Library: Alignment Forum

1
AF - Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024) by Matt MacDermott 8:04

38 weeks پیش8:04

8:04

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024), published by Matt MacDermott on September 1, 2024 on The AI Alignment Forum. Yoshua Bengio wrote a blogpost about a new AI safety paper by him, various collaborators, and me. I've pasted the text below, but first here are a few comments from me aimed at an AF/LW audience. The paper is basically maths plus some toy experiments. It assumes access to a Bayesian oracle that can infer a posterior over hypotheses given data, and can also estimate probabilities for some negative outcome ("harm"). It proposes some conservative decision rules one could use to reject actions proposed by an agent, and proves probabilistic bounds on their performance under appropriate assumptions. I expect the median reaction in these parts to be something like: ok, I'm sure there are various conservative decision rules you could apply using a Bayesian oracle, but isn't obtaining a Bayesian oracle the hard part here? Doesn't that involve advances in Bayesian machine learning, and also probably solving ELK to get the harm estimates? My answer to that is: yes, I think so. I think Yoshua does too, and that that's the centre of his research agenda. Probably the main interest of this paper to people here is to provide an update on Yoshua's research plans. In particular it gives some more context on what the "guaranteed safe AI" part of his approach might look like -- design your system to do explicit Bayesian inference, and make an argument that the system is safe based on probabilistic guarantees about the behaviour of a Bayesian inference machine. This is in contrast to more hardcore approaches that want to do formal verification by model-checking. You should probably think of the ambition here as more like "a safety case involving proofs" than "a formal proof of safety". Bounding the probability of harm from an AI to create a guardrail Published 29 August 2024 by yoshuabengio As we move towards more powerful AI, it becomes urgent to better understand the risks, ideally in a mathematically rigorous and quantifiable way, and use that knowledge to mitigate them. Is there a way to design powerful AI systems based on machine learning methods that would satisfy probabilistic safety guarantees, i.e., would be provably unlikely to take a harmful action? Current AI safety evaluations and benchmarks test the AI for cases where it may behave badly, e.g., by providing answers that could yield dangerous misuse. That is useful and should be legally required with flexible regulation, but is not sufficient. These tests only tell us one side of the story: If they detect bad behavior, a flag is raised and we know that something must be done to mitigate the risks. However, if they do not raise such a red flag, we may still have a dangerous AI in our hands, especially since the testing conditions might be different from the deployment setting, and attackers (or an out-of-control AI) may be creative in ways that the tests did not consider. Most concerningly, AI systems could simply recognize they are being tested and have a temporary incentive to behave appropriately while being tested. Part of the problem is that such tests are spot checks. They are trying to evaluate the risk associated with the AI in general by testing it on special cases. Another option would be to evaluate the risk on a case-by-case basis and reject queries or answers that are considered to potentially violate or safety specification. With the long-term goal of obtaining a probabilistic guarantee that would apply in every context, we thus consider in this new paper (see reference and co-authors below) the objective of estimating a context-dependent upper bound on the probability of violating a given safety specification. Such a risk evaluation would need to be performed at ru...…

The Nonlinear Library: Alignment Forum

1
AF - Epistemic states as a potential benign prior by Tamsin Leake 13:38

38 weeks پیش13:38

13:38

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Epistemic states as a potential benign prior, published by Tamsin Leake on August 31, 2024 on The AI Alignment Forum. Malignancy in the prior seems like a strong crux of the goal-design part of alignment to me. Whether your prior is going to be used to model: processes in the multiverse containing a specific "beacon" bitstring, processes in the multiverse containing the AI, processes which would output all of my blog, so I can make it output more for me, processes which match an AI chatbot's hypotheses about what it's talking with, then you have to sample hypotheses from somewhere; and typically, we want to use either solomonoff induction or time-penalized versions of it such as levin search (penalized by log of runtime) or what QACI uses (penalized by runtime, but with quantum computation available in some cases), or the implicit prior of neural networks (large sequences of multiplying by a matrix, adding a vector, and ReLU, often with a penalty related to how many non-zero weights are used). And the solomonoff prior is famously malign. (Alternatively, you could have knightian uncertainty about parts of your prior that aren't nailed down enough, and then do maximin over your knightian uncertainty (like in infra-bayesianism), but then you're not guaranteed that your AI gets anywhere at all; its knightian uncertainty might remain so immense that the AI keeps picking the null action all the time because some of its knightian hypotheses still say that anything else is a bad idea. Note: I might be greatly misunderstanding knightian uncertainty!) (It does seem plausible that doing geometric expectation over hypotheses in the prior helps "smooth things over" in some way, but I don't think this particularly removes the weight of malign hypotheses in the prior? It just allocates their steering power in a different way, which might make things less bad, but it sounds difficult to quantify.) It does feel to me like we do want a prior for the AI to do expected value calculations over, either for prediction or for utility maximization (or quantilization or whatever). One helpful aspect of prior-distribution-design is that, in many cases, I don't think the prior needs to contain the true hypothesis. For example, if the problem that we're using a prior for is to model processes which match an AI chatbot's hypotheses about what it's talking with then we don't need the AI's prior to contain a process which behaves just like the human user it's interacting with; rather, we just need the AI's prior to contain a hypothesis which: is accurate enough to match observations. is accurate enough to capture the fact that the user (if we pick a good user) implements the kind of decision theory that lets us rely on them pointing back to the actual real physical user when they get empowered - i.e. in CEV(user-hypothesis), user-hypothesis builds and then runs CEV(physical-user), because that's what the user would do in such a situation. Let's call this second criterion "cooperating back to the real user". So we need a prior which: Has at least some mass on hypotheses which correspond to observations cooperate back to the real user and can eventually be found by the AI, given enough evidence (enough chatting with the user) Call this the "aligned hypothesis". Before it narrows down hypothesis space to mostly just aligned hypotheses, doesn't give enough weight to demonic hypothesis which output whichever predictions cause the AI to brainhack its physical user, or escape using rowhammer-type hardware vulnerabilities, or other failures like that. Formalizing the chatbot model First, I'll formalize this chatbot model. Let's say we have a magical inner-aligned "soft" math-oracle: Which, given a "scoring" mathematical function from a non-empty set a to real numbers (not necessarily one that is tractably ...…

The Nonlinear Library: Alignment Forum

1
AF - AIS terminology proposal: standardize terms for probability ranges by Egg Syntax 5:24

38 weeks پیش5:24

5:24

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AIS terminology proposal: standardize terms for probability ranges, published by Egg Syntax on August 30, 2024 on The AI Alignment Forum. Summary: The AI safety research community should adopt standardized terms for probability ranges, especially in public-facing communication and especially when discussing risk estimates. The terms used by the IPCC are a reasonable default. Science communication is notoriously hard. It's hard for a lot of reasons, but one is that laypeople aren't used to thinking in numerical probabilities or probability ranges. One field that's had to deal with this more than most is climatology; climate change has been rather controversial, and a non-trivial aspect of that has been lay confusion about what climatologists are actually saying[1]. As a result, the well-known climate assessment reports from the UN's Intergovernmental Panel on Climate Change (IPCC) have, since the 1990s, used explicitly defined terms for probability ranges[2]: (see below for full figure[3]) Like climatology, AI safety research has become a topic of controversy. In both cases, the controversy includes a mix of genuine scientific disagreement, good-faith confusion, and bad-faith opposition. Scientific disagreement comes from people who can deal with numerical probability ranges. Those who are arguing in bad faith from ulterior motives generally don't care about factual details. But I suspect that the large majority of those who disagree, especially laypeople, are coming from a place of genuine, good-faith confusion. For those people, anything we as practitioners can do to communicate more clearly is quite valuable. Also like climatology, AI safety research, especially assessments of risk, fundamentally involves communicating about probabilities and probability ranges. Therefore I propose that the AIS community follow climatologists in adopting standard terms for probability ranges, especially in position papers and public-facing communication. In less formal and less public-facing contexts, using standard terminology still adds some value but is less important; in sufficiently informal contexts it's probably not worth the hassle of looking up the standard terminology. Of course, in many cases it's better to just give the actual numerical range! But especially in public-facing communication it can be more natural to use natural language terms, and in fact this is already often done. I'm only proposing that when we do use natural language terms for probability ranges, we use them in a consistent and interpretable way (feel free to link to this post as a reference for interpretation, or point to the climatology papers cited below[2]). Should the AIS community use the same terms? That's a slightly harder question. The obvious first-pass answer is 'yes'; it's a natural Schelling point, and terminological consistency across fields is generally preferable when practically possible. The IPCC terms also have the significant advantage of being battle-tested; they've been used over a thirty-year period in a highly controversial field, and terms have been refined when they were found to be insufficiently clear. The strongest argument I see against using the same terms is that the AIS community sometimes needs to deal with more extreme (high or low) risk estimates than these. If we use 'virtually certain' to mean 99 - 100%, what terms can we use for 99.9 - 100.0%, or 99.99 - 100.00%? On the other hand, plausibly once we're dealing with such extreme risk estimates, it's increasingly important to communicate them with actual numeric ranges. My initial proposal is to adopt the IPCC terms, but I'm very open to feedback, and if someone has an argument I find compelling (or which gets strong agreement in votes) for a different or extended set of terms, I'll add it to the proposal. If no su...…

The Nonlinear Library: Alignment Forum

1
AF - Solving adversarial attacks in computer vision as a baby version of general AI alignment by stanislavfort 12:34

38 weeks پیش12:34

12:34

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving adversarial attacks in computer vision as a baby version of general AI alignment, published by stanislavfort on August 29, 2024 on The AI Alignment Forum. I spent the last few months trying to tackle the problem of adversarial attacks in computer vision from the ground up. The results of this effort are written up in our new paper Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness (explainer on X/Twitter). Taking inspiration from biology, we reached state-of-the-art or above state-of-the-art robustness at 100x - 1000x less compute, got human-understandable interpretability for free, turned classifiers into generators, and designed transferable adversarial attacks on closed-source (v)LLMs such as GPT-4 or Claude 3. I strongly believe that there is a compelling case for devoting serious attention to solving the problem of adversarial robustness in computer vision, and I try to draw an analogy to the alignment of general AI systems here. 1. Introduction In this post, I argue that the problem of adversarial attacks in computer vision is in many ways analogous to the larger task of general AI alignment. In both cases, we are trying to faithfully convey an implicit function locked within the human brain to a machine, and we do so extremely successfully on average. Under static evaluations, the human and machine functions match up exceptionally well. However, as is typical in high-dimensional spaces, some phenomena can be relatively rare and basically impossible to find by chance, yet ubiquitous in their absolute count. This is the case for adversarial attacks - imperceptible modifications to images that completely fool computer vision systems and yet have virtually no effect on humans. Their existence highlights a crucial and catastrophic mismatch between the implicit human vision function and the function learned by machines - a mismatch that can be exploited in a dynamic evaluation by an active, malicious agent. Such failure modes will likely be present in more general AI systems, and our inability to remedy them even in the more restricted vision context (yet) does not bode well for the broader alignment project. This is a call to action to solve the problem of adversarial vision attacks - a stepping stone on the path to aligning general AI systems. 2. Communicating implicit human functions to machines The basic goal of computer vision can be viewed as trying to endow a machine with the same vision capabilities a human has. A human carries, locked inside their skull, an implicit vision function mapping visual inputs into semantically meaningful symbols, e.g. a picture of a tortoise into a semantic label tortoise. This function is represented implicitly and while we are extremely good at using it, we do not have direct, conscious access to its inner workings and therefore cannot communicate it to others easily. To convey this function to a machine, we usually form a dataset of fixed images and their associated labels. We then use a general enough class of functions, typically deep neural networks, and a gradient-based learning algorithm together with backpropagation to teach the machine how to correlate images with their semantic content, e.g. how to assign a label parrot to a picture of a parrot. This process is extremely successful in communicating the implicit human vision function to the computer, and the implicit human and explicit, learned machine functions agree to a large extent. The agreement between the two is striking. Given how different the architectures are (a simulated graph-like function doing a single forward pass vs the wet protein brain of a mammal running continuous inference), how different the learning algorithms are (gradient descent with backpropagation vs something completely different but still unknown), a...…

The Nonlinear Library: Alignment Forum

1
AF - Would catching your AIs trying to escape convince AI developers to slow down or undeploy? by Buck Shlegeris 5:55

39 weeks پیش5:55

5:55

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would catching your AIs trying to escape convince AI developers to slow down or undeploy?, published by Buck Shlegeris on August 26, 2024 on The AI Alignment Forum. I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. As an extreme thought experiment, I'll argue this could be hard even if you caught your AI red-handed trying to escape. Imagine you're running an AI lab at the point where your AIs are able to automate almost all intellectual labor; the AIs are now mostly being deployed internally to do AI R&D. (If you want a concrete picture here, I'm imagining that there are 10 million parallel instances, running at 10x human speed, working 24/7. See e.g. similar calculations here). And suppose (as I think is 35% likely) that these models are egregiously misaligned and are actively looking for ways to subvert your safety measures and escape your control. You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigated this by running other instances on the inputs they'd see if that backdoored code had actually been deployed; the other AI instances never point out the security flaw, and they often try to exploit that code to escalate their privileges. (See here for more on this.) At this point you're almost sure the AI is scheming. So you try to call the US President and they hear you out, and then they go off and call one of your competitors to ask for their take. They say something like this: It would be extremely costly to shut down all the AI projects. China is racing to catch up and won't stop for this. If you force us to stop deploying our frontier models, they'll outcompete us within a year, and you won't be able to get China to pause without substantial risk of war. AIs are well known to do weird stuff. It would be irresponsible to assume one instance of anomalous behavior meant AIs were systematically going to act that way. I can put you in touch with top AI experts and they'll back me up. Even if the AI is indeed doing something systematically funny, we have no evidence that it has the ambition to seize huge amounts of power, and it's not clear that it would be able to. It seems implausible that AIs would do that; there's been no concrete evidence that the AIs are in fact power-hungry. Maybe our competitor just messed up their alignment, which would make sense because, as we've always told you, they're not very competent (especially compared to us). If they want to shut down, they're welcome to. But it doesn't make sense to penalize us for their shoddy work. Maybe they're just lying; this could be a doomer scheme to shut down AI, which the doomers have wanted to do for years. The logs they sent you could be complete forgeries. Or for all we know, someone there (with or without leadership approval) intentionally backdoored their RL data to make their model do this. It would be a mistake to take rash action before independent experts confirm that this wasn't somehow rigged. I'm sympathetic to all of these arguments. The main reason I'd be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely, I wouldn't change my mind based on one weird observation. (I think it's pretty plausible that news of the escape attempt wouldn't even make it out of the AI lab: all the above arguments could happen inside the AI lab, between the safety concerned people and people who wanted to push forward.) Now, I'm not sure I'm right to be so pessimistic here. But I think a lot of misalignment-concerned people take for granted that we'd be okay in this situation, an...…

The Nonlinear Library: Alignment Forum

1
AF - Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs by Michaël Trazzi 8:33

39 weeks پیش8:33

8:33

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs, published by Michaël Trazzi on August 24, 2024 on The AI Alignment Forum. Owain Evans is an AI Alignment researcher, research associate at the Center of Human Compatible AI at UC Berkeley, and now leading a new AI safety research group. In this episode we discuss two of his recent papers, "Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs" (LW) and "Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data" (LW), alongside some Twitter questions. Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript. Situational Awareness Definition "What is situational awareness? The idea is the model's kind of self-awareness, that is its knowledge of its own identity, and then its awareness of its environment. What are the basic interfaces that it is connected to? [...] And then there's a final point with situational awareness, which is, can the model use knowledge of its identity and environment to take rational actions?" "Situational awareness is crucial for an AI system acting as an agent, doing long-term planning. If you don't understand what kind of thing you are, your capabilities and limitations, it's very hard to make complicated plans. The risks of AI mostly come from agentic models able to do planning." Motivation "We wanted to measure situational awareness in large language models with a benchmark similar to Big Bench or MMLU. The motivation is that situational awareness is important for thinking about AI risks, especially deceptive alignment, and we lacked ways to measure and break it down into components." "Situational awareness is relevant to any situation where the model needs to do agentic long-term planning. [...] A model confused about itself and its situation would likely struggle to pull off such a strategy." On Claude 3 Opus Insightful Answers "Let me explain [the Long Monologue task]. Most of our dataset is typical multiple-choice question answering, but we added a task where models write long answers describing themselves and their situation. The idea is to see if the model can combine different pieces of information about itself coherently and make good inferences about why we're asking these questions. Claude 3 Opus was particularly insightful, guessing it might be part of a research study testing self-awareness in LLMs. These were true inferences not stated in the question. The model was reading between the lines, guessing this wasn't a typical ChatGPT-style interaction. I was moderately surprised, but I'd already seen Opus be very insightful and score well on our benchmark. It's worth noting we sample answers with temperature 1, so there's some randomness. We saw these insights often enough that I don't think it's just luck. Anthropic's post-training RLHF seems good at giving the model situational awareness. The GPT-4 base results were more surprising to us." What Would Saturating The Situational Awareness Benchmark Imply For Safety And Governance "If models can do as well or better than humans who are AI experts, who know the whole setup, who are trying to do well on this task, and they're doing well on all the tasks including some of these very hard ones, that would be one piece of evidence. [...] We should consider how aligned it is, what evidence we have for alignment. We should maybe try to understand the skills it's using." "If the model did really well on the benchmark, it seems like it has some of the skills that would help with deceptive alignment. This includes being able to reliably work out when it's being evaluated by humans, when it has a lot of oversight, and when it needs to...…

The Nonlinear Library: Alignment Forum

1
AF - Showing SAE Latents Are Not Atomic Using Meta-SAEs by Bart Bussmann 35:53

39 weeks پیش35:53

35:53

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Showing SAE Latents Are Not Atomic Using Meta-SAEs, published by Bart Bussmann on August 24, 2024 on The AI Alignment Forum. Bart, Michael and Patrick are joint first authors. Research conducted as part of MATS 6.0 in Lee Sharkey and Neel Nanda's streams. Thanks to Mckenna Fitzgerald and Robert Krzyzanowski for their feedback! TL;DR: Sparse Autoencoder (SAE) latents have been shown to typically be monosemantic (i.e. correspond to an interpretable property of the input). It is sometimes implicitly assumed that they are therefore atomic, i.e. simple, irreducible units that make up the model's computation. We provide evidence against this assumption by finding sparse, interpretable decompositions of SAE decoder directions into seemingly more atomic latents, e.g. Einstein -> science + famous + German + astronomy + energy + starts with E We do this by training meta-SAEs, an SAE trained to reconstruct the decoder directions of a normal SAE. We argue that, conceptually, there's no reason to expect SAE latents to be atomic - when the model is thinking about Albert Einstein, it likely also thinks about Germanness, physicists, etc. Because Einstein always entails those things, the sparsest solution is to have the Albert Einstein latent also boost them. Key results SAE latents can be decomposed into more atomic, interpretable meta-latents. We show that when latents in a larger SAE have split out from latents in a smaller SAE, a meta SAE trained on the larger SAE often recovers this structure. We demonstrate that meta-latents allow for more precise causal interventions on model behavior than SAE latents on a targeted knowledge editing task. We believe that the alternate, interpretable decomposition using MetaSAEs casts doubt on the implicit assumption that SAE latents are atomic. We show preliminary results that MetaSAE latents have significant ovelap with latents in a normal SAE of the same size but may relate differently to the larger SAEs used in MetaSAE training. We made a dashboard that lets you explore meta-SAE latents. Terminology: Throughout this post we use "latents" to describe the concrete components of the SAE's dictionary, whereas "feature" refers to the abstract concepts, following Lieberum et al. Introduction Mechanistic interpretability (mech interp) attempts to understand neural networks by breaking down their computation into interpretable components. One of the key challenges of this line of research is the polysemanticity of neurons, meaning they respond to seemingly unrelated inputs. Sparse autoencoders (SAEs) have been proposed as a method for decomposing model activations into sparse linear sums of latents. Ideally, these latents should be monosemantic i.e. respond to inputs that clearly share a similar meaning (implicitly, from the perspective of a human interpreter). That is, a human should be able to reason about the latents both in relation to the features to which they are associated, and also use the latents to better understand the model's overall behavior. There is a popular notion, both implicitly in related work on SAEs within mech interp and explicitly by the use of the term "atom" in sparse dictionary learning as a whole, that SAE features are atomic or can be "true features". However, monosemanticity does not imply atomicity. Consider the example of shapes of different colors - the set of shapes is [circle, triangle, square], and the set of colors is [white, red, green, black], each of which is represented with a linear direction. 'Red triangle' represents a monosemantic feature, but not an atomic feature, as it can be decomposed into red and triangle. It has been shown that sufficiently wide SAEs on toy models will learn 'red triangle', rather than representing 'red' and 'triangle' with separate latents. Furthermore, whilst one may naively re...…

The Nonlinear Library: Alignment Forum

1
AF - Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025) by Linda Linsefors 7:27

39 weeks پیش7:27

7:27

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025), published by Linda Linsefors on August 23, 2024 on The AI Alignment Forum. Do you have AI Safety research ideas that you would like to work on with others? Is there a project you want to do and you want help finding a team? AI Safety Camp could be the solution for you! Summary AI Safety Camp Virtual is a 3-month long online research program from January to April 2025, where participants form teams to work on pre-selected projects. We want you to suggest the projects! If you have an AI Safety project idea and some research experience, apply to be a Research Lead. If accepted, we offer some assistance to develop your idea into a plan suitable for AI Safety Camp. When project plans are ready, we open up team member applications. You get to review applications for your team, and select who joins as a team member. From there, it's your job to guide work on your project. Who is qualified? We require that you have some previous research experience. If you are at least 1 year into a PhD or if you have completed an AI Safety research program (such as a previous AI Safety Camp, PIBBSS, MATS, and similar), or done a research internship with an AI Safety org, then you are qualified already. Other research experience can count, too. More senior researchers are of course also welcome, as long as you think our format of leading an online team inquiring into your research questions suits you and your research. We require that all Research Leads are active participants in their projects and spend at least 10h/week on AISC. Apply here If you are unsure, or have any questions you are welcome to: Book a call with Robert Send an email to Robert Choosing project idea(s) AI Safety Camp is about ensuring future AIs are either reasonably safe or not built at all. We welcome many types of projects including projects aimed at stopping or pausing AI development, aligning AI, deconfusion research, or anything else you think will help make the world safer from AI. If you like, you can read more of our perspectives on AI safety, or look at past projects. If you already have an idea for what project you would like to lead, that's great. Apply with that one! However, you don't need to come up with an original idea. What matters is you understanding the idea you want to work on, and why. If you base your proposal on someone else's idea, make sure to cite them. 1. For ideas on stopping harmful AI, see here and/or email Remmelt. 2. For some mech-interp ideas see here. 3. We don't have specific recommendations for where to find other types of project ideas, so just take inspiration wherever you find it. You can submit as many project proposals as you want. However, you are only allowed to lead one project. Use this template to describe each of your project proposals. We want one document per proposal. We'll help you improve your project As part of the Research Lead application process, we'll help you improve your project. The organiser whose ideas match best with yours, will work with you to create the best version of your project. We will also ask for assistance from previous Research Leads, and up to a handful of other trusted people, to give you additional feedback. Timeline Research Lead applications September 22 (Sunday): Application deadline for Research Leads. October 20 (Sunday): Deadline for refined proposals. Team member applications: October 25 (Friday): Accepted proposals are posted on the AISC website. Application to join teams open. November 17 (Sunday): Application to join teams closes. December 22 (Sunday): Deadline for Research Leads to choose their team. Program Jan 11 - 12: Opening weekend. Jan 13 Apr 25: Research is happening. Teams meet weekly, and plan in their own work hours. April 26 - 27 (preliminary dates):...…

The Nonlinear Library: Alignment Forum

1
AF - Interoperable High Level Structures: Early Thoughts on Adjectives by johnswentworth 12:28

39 weeks پیش12:28

12:28

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Interoperable High Level Structures: Early Thoughts on Adjectives, published by johnswentworth on August 22, 2024 on The AI Alignment Forum. Meta: This post is a relatively rough dump of some recent research thoughts; it's not one of our more polished posts, in terms of either clarity or rigor. You've been warned. The Interoperable Semantics post and the Solomonoff Inductor Walks Into A Bar post each tackled the question of how different agents in the same world can coordinate on an ontology, so that language can work at all given only a handful of example usages of each word (similar to e.g. children learning new words). Both use natural latents as a central mathematical tool - one in a Bayesian probabilistic framework, the other in a minimum description length framework. Both focus mainly on nouns, i.e. interoperable-across-minds clusters of "objects" in the environment. … and the two propose totally different models. In one, the interoperability of cluster labels (i.e. nouns) follows from natural latent conditions over different features of each object. In the other, interoperability follows from natural latent conditions across objects, with no mention of features. The two models are not, in general, equivalent; they can't both be both correct and complete. In this post, we'll propose that while the natural latent conditions over objects still seem to intuitively capture the rough notion of nouns, the natural latent conditions over features seem much better suited to adjectives. We'll briefly lay out two different potential ways to use natural latents over features as semantic values for adjectives. Then we'll talk a bit about implications, open threads and how this fits into a broader research gameplan. The Problem When children learn language, the cognitive process seems to go: Observe the world a bunch … organize knowledge of the world according to some categories, concepts, ontology, etc … those categories, concepts, ontology, etc match other humans' categories, concepts, ontology, etc reasonably well … so it only takes a handful of examples (1-3, say) of the use of a given word in order for the child to learn what the word refers to. The crucial point here is that the categories/concepts/ontology are mostly learned before a word is attached; children do not brute-force learn categories/concepts/ontology from "labeled data". We can tell this is true mainly because it typically takes so few examples to learn the meaning of a new word. The big puzzle, then, is that different humans learn mostly approximately the same categories/concepts/ontology - i.e. the same "candidates" to which words might point - as required for language to work at all with so few examples. How does that work? Mathematically, what are those "interoperable" categories/concepts/ontology, which different humans mostly convergently learn? How can we characterize them? Or, somewhat earlier on the tech tree: can we find even a single model capable of accounting for the phenomenon of different minds in the same environment robustly converging on approximately the same categories/concepts/ontology? Forget whether we can find a model which correctly captures the ontology converged upon by humans, can we even find any model capable of accounting for any sort of robust ontological convergence? Can we find such a model for which the convergent ontology even vaguely resembles the sorts of things in human language (nouns, verbs, adjectives, etc)? What would such a model even look like? That's roughly the stage we're at in this post. Two Previous Models: Naturality Over Objects vs Features Our main tool is (deterministic) natural latents. The usage looks like: Suppose the different minds each look for (and find) a latent variable which satisfies the natural latent conditions over some lower-level variab...…

The Nonlinear Library: Alignment Forum

1
AF - A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed by johnswentworth 8:37

39 weeks پیش8:37

8:37

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed, published by johnswentworth on August 22, 2024 on The AI Alignment Forum. This post walks through the math for a theorem. It's intended to be a reference post, which we'll link back to as-needed from future posts. The question which first motivated this theorem for us was: "Redness of a marker seems like maybe a natural latent over a bunch of parts of the marker, and redness of a car seems like maybe a natural latent over a bunch of parts of the car, but what makes redness of the marker 'the same as' redness of the car? How are they both instances of one natural thing, i.e. redness? (or 'color'?)". But we're not going to explain in this post how the math might connect to that use-case; this post is just the math. Suppose we have multiple distributions P1,…,Pk over the same random variables X1,…,Xn. (Speaking somewhat more precisely: the distributions are over the same set, and an element of that set is represented by values (x1,…,xn).) We take a mixture of the distributions: P[X]:=jαjPj[X], where jαj=1 and α is nonnegative. Then our theorem says: if an approximate natural latent exists over P[X], and that latent is robustly natural under changing the mixture weights α, then the same latent is approximately natural over Pj[X] for all j. Mathematically: the natural latent over P[X] is defined by (x,λP[Λ=λ|X=x]), and naturality means that the distribution (x,λP[Λ=λ|X=x]P[X=x]) satisfies the naturality conditions (mediation and redundancy).The theorem says that, if the joint distribution (x,λP[Λ=λ|X=x]jαjPj[X=x]) satisfies the naturality conditions robustly with respect to changes in α, then (x,λP[Λ=λ|X=x]Pj[X=x]) satisfies the naturality conditions for all j. "Robustness" here can be interpreted in multiple ways - we'll cover two here, one for which the theorem is trivial and another more substantive, but we expect there are probably more notions of "robustness" which also make the theorem work. Trivial Version First notion of robustness: the joint distribution (x,λP[Λ=λ|X=x]jαjPj[X=x]) satisfies the naturality conditions to within ϵ for all values of α (subject to jαj=1 and α nonnegative). Then: the joint distribution (x,λP[Λ=λ|X=x]jαjPj[X=x]) satisfies the naturality conditions to within ϵ specifically for αj=δjk, i.e. α which is 0 in all entries except a 1 in entry k. In that case, the joint distribution is (x,λP[Λ=λ|X=x]Pk[X=x]), therefore Λ is natural over Pk. Invoke for each k, and the theorem is proven. ... but that's just abusing an overly-strong notion of robustness. Let's do a more interesting one. Nontrivial Version Second notion of robustness: the joint distribution (x,λP[Λ=λ|X=x]jαjPj[X=x]) satisfies the naturality conditions to within ϵ, and the gradient of the approximation error with respect to (allowed) changes in α is (locally) zero. We need to prove that the joint distributions (x,λP[Λ=λ|X=x]Pj[X=x]) satisfy both the mediation and redundancy conditions for each j. We'll start with redundancy, because it's simpler. Redundancy We can express the approximation error of the redundancy condition with respect to Xi under the mixed distribution as DKL(P[Λ,X]||P[X]P[Λ|Xi])=EX[DKL(P[Λ|X]||P[Λ|Xi])] where, recall, P[Λ,X]:=P[Λ|X]jαjPj[X]. We can rewrite that approximation error as: EX[DKL(P[Λ|X]||P[Λ|Xi])] =jαjPj[X]DKL(P[Λ|X]||P[Λ|Xi]) =jαjEjX[DKL(P[Λ|X]||P[Λ|Xi])] Note that Pj[Λ|X]=P[Λ|X] is the same under all the distributions (by definition), so: =jαjDKL(Pj[Λ,X]||P[Λ|Xi]) and by factorization transfer: jαjDKL(Pj[Λ,X]||Pj[Λ|Xi]) In other words: if ϵji is the redundancy error with respect to Xi under distribution j, and ϵi is the redundancy error with respect to Xi under the mixed distribution P, then ϵijαjϵji The redundancy error of the mixed distribution is a...…

The Nonlinear Library: Alignment Forum

1
AF - Measuring Structure Development in Algorithmic Transformers by Jasmina Nasufi 18:17

39 weeks پیش18:17

18:17

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring Structure Development in Algorithmic Transformers, published by Jasmina Nasufi on August 22, 2024 on The AI Alignment Forum. tl;dr: We compute the evolution of the local learning coefficient (LLC), a proxy for model complexity, for an algorithmic transformer. The LLC decreases as the model learns more structured solutions, such as head specialization. This post is structured in three main parts, (1) a summary, giving an overview of the main results, (2) the Fine Print, that delves into various cross-checks and details and (3) Discussion and Conclusions. Structure Formation in Algorithmic Transformers In this work we study the development of simple algorithmic transformers, which are transformers that learn to perform algorithmic tasks. A major advantage of this setup is that we can control several (hyper)parameters, such as the complexity of the training data and network architecture. This allows us to do targeted experiments studying the impacts of these parameters on the learning dynamics. The main tool we use to study the development is the Local Learning Coefficient (LLC) and we choose cases where we have a reverse-engineered solution. Why use the LLC for this purpose? It is a theoretically well motivated measure of model complexity defined by Lau et.al. For an overview of Singular Learning Theory (which serves as the theoretical foundation for the LLC) see Liam Carol's Distilling SLT sequence. For a brief overview of the LLC see e.g. this post. We use the same setup as CallumMcDougall's October Monthly Algorithmic Mech-Interp Challenge. The model is an attention only transformer, trained on sorting numbers with layer norm and weight decay on a cross-entropy loss function using the Adam optimizer. The residual stream size is 96 and the head dimension is 48. It is trained on sequences of the form and to predict the next token starting at the separation token. The numbers in the list are sampled uniformly from 0 to 50, which together with the separation token produce a vocabulary of 52 tokens. Numbers do not repeat in the list. The images making up the gifs can be found here. 1-Head Model Let's first look at the case of a 1-head transformer: The model reaches 100% accuracy around training step 100, confirming that a single attention head is sufficient for sorting, as noted in previous work. Once maximum accuracy is reached, the full QK and OV circuits[2] behave as described by Callum for the 2-head model: In the QK circuit, source tokens attend more to the smallest token in the list larger than themselves. This results in a higher value band above the diagonal and a lower value band below the diagonal. The OV circuit copies tokens, as seen by the clear positive diagonal pattern. In addition, we observe a transition around training step 1000, where the LLC decreases while the accuracy stays unchanged. This is supported by a drop in the sum of the ranks[3] of the matrices in the heat maps. It also coincides with the formation of the off-diagonal stripes in the OV-circuit. We speculate that these are simpler than the noisier off-diagonal OV pattern observed at peak LLC, and correspond to the translational symmetry of the problem. We define a Translational Symmetry measure[1] (see purple line in the plot) to capture the degree to which the circuits obey this symmetry. It increases throughout most of the training, even after the other measures stabilize. 2-Head Model Let's now turn our attention to the 2-head transformer in Callum's original setup. We see a lot of qualitative similarities to the evolution of the full QK and OV circuits for the 1-head model. As the LLC begins to drop (around training step 1000), we note the following: QK circuit: Slight changes[5] to the attention pattern, which crystallize into triangular regions late in the training, long aft...…

The Nonlinear Library: Alignment Forum

1
AF - AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work by Rohin Shah 16:31

40 weeks پیش16:31

16:31

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work, published by Rohin Shah on August 20, 2024 on The AI Alignment Forum. We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours. Who are we? We're the main team at Google DeepMind working on technical approaches to existential risk from AI systems. Since our last post, we've evolved into the AGI Safety & Alignment team, which we think of as AGI Alignment (with subteams like mechanistic interpretability, scalable oversight, etc.), and Frontier Safety (working on the Frontier Safety Framework, including developing and running dangerous capability evaluations). We've also been growing since our last post: by 39% last year, and by 37% so far this year. The leadership team is Anca Dragan, Rohin Shah, Allan Dafoe, and Dave Orr, with Shane Legg as executive sponsor. We're part of the overall AI Safety and Alignment org led by Anca, which also includes Gemini Safety (focusing on safety training for the current Gemini models), and Voices of All in Alignment, which focuses on alignment techniques for value and viewpoint pluralism. What have we been up to? It's been a while since our last update, so below we list out some key work published in 2023 and the first part of 2024, grouped by topic / sub-team. Our big bets for the past 1.5 years have been 1) amplified oversight, to enable the right learning signal for aligning models so that they don't pose catastrophic risks, 2) frontier safety, to analyze whether models are capable of posing catastrophic risks in the first place, and 3) (mechanistic) interpretability, as a potential enabler for both frontier safety and alignment goals. Beyond these bets, we experimented with promising areas and ideas that help us identify new bets we should make. Frontier Safety The mission of the Frontier Safety team is to ensure safety from extreme harms by anticipating, evaluating, and helping Google prepare for powerful capabilities in frontier models. While the focus so far has been primarily around misuse threat models, we are also working on misalignment threat models. FSF We recently published our Frontier Safety Framework, which, in broad strokes, follows the approach of responsible capability scaling, similar to Anthropic's Responsible Scaling Policy and OpenAI's Preparedness Framework. The key difference is that the FSF applies to Google: there are many different frontier LLM deployments across Google, rather than just a single chatbot and API (this in turn affects stakeholder engagement, policy implementation, mitigation plans, etc). We're excited that our small team led the Google-wide strategy in this space, and demonstrated that responsible capability scaling can work for large tech companies in addition to small startups. A key area of the FSF we're focusing on as we pilot the Framework, is how to map between the critical capability levels (CCLs) and the mitigations we would take. This is high on our list of priorities as we iterate on future versions. Some commentary (e.g. here) also highlighted (accurately) that the FSF doesn't include commitments. This is because the science is in early stages and best practices will need to evolve. But ultimately, what we care about is whether the work is actually done. In practice, we did run and report dangerous capability evaluations for Gemini 1.5 that we think are sufficient to rule out extreme risk with high confidence. Dangerous Capability Evaluations Our paper on Evaluating Frontier Models for Dangerous Capabilities is the broadest suite of dangerous capability evaluati...…

The Nonlinear Library: Alignment Forum

1
AF - Finding Deception in Language Models by Esben Kran 7:36

40 weeks پیش7:36

7:36

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Finding Deception in Language Models, published by Esben Kran on August 20, 2024 on The AI Alignment Forum. This June, Apart Research and Apollo Research joined forces to host the Deception Detection Hackathon. Bringing together students, researchers, and engineers from around the world to tackle a pressing challenge in AI safety; preventing AI from deceiving humans and overseers. The hackathon took place both online and in multiple physical locations simultaneously. Marius Hobbhahn, the CEO of Apollo Research, kicked off the hackathon with a keynote talk about evaluating deception in AI with white-box and black-box methods. You can watch his talk here. We also had talks by Jacob Haimes, an Apart fellow, and Mikita Balesni, a research scientist at Apollo Research. This post details the top 8 projects, multiple of which are currently being developed further in Apart Lab. In total, we had 189 participants and 21 final entries. Sandbagging Detection through Model Degradation By Cam Tice, Philipp Alexander Kreer, Fedor Ryzhenkov, Nathan Helm-Burger, Prithviraj Singh Shahani The team introduced a new approach to detect AI sandbagging, a form of deception where an AI model strategically underperforms during evaluation to hide its true capabilities. The assumption behind their project is that sandbagging is a more complex task than showing the true capabilities. In the graph above, you can see how they gradually add noise to the model's parameters and evaluate the models during this process. Sandbagging models consistently improve in performance right before the noise level causes a normal model to lose the capability. It is a straightforward, inexpensive technique that doesn't require prior assumptions about how sandbagging is represented within the model. They successfully detected sandbagging behavior in two open-source LLMs: An 8B Llama 3 Instruct model and Phi 3 small instruct model. This approach offers a practical and accessible method for identifying potential deception in AI systems. Read the full project here. Detecting and Controlling Deceptive Representation in LLMs with Representational Engineering By Avyay M Casheekar, Kaushik Sanjay Prabhakar, Kanishk Rath, Sienka Dounia The team focused on detecting and mitigating sandbagging in AI models using recent methods for understanding activations. They fine-tuned a Llama 3 8B model on a custom dataset of deceptive and honest scenarios, then used Linear Artificial Tomography (LAT) scans to detect deceptive behaviors in the internal representations of the LLM. The team's technique not only identified deceptive behaviors, but also showed promise in mitigating them. Importantly, their approach demonstrated good generalization to out-of-distribution scenarios, suggesting its potential robustness in real-world applications. Read the full project here. Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach By Siddharth Reddy Bakkireddy, Rakesh Reddy Bakkireddy This team tackled the challenge of detecting deception in closed-source, black-box models like GPT-3.5-turbo. They investigated whether LLM API response metadata such as token count, response time, and tokens per second could be used to detect intentional deceptive behavior. The team analyzed 300 prompts and generated 1,200 responses from GPT-3.5-turbo, comparing the metadata of baseline and deceptive outputs. Their results revealed that deceptive outputs tend to have increased response times and altered token usage. This approach demonstrates that deception detection is possible without accessing a model's internal representation, opening up new avenues for monitoring and safeguarding AI systems, even when their inner workings are not accessible. Read the full project here. Modelling the Oversight of Automated Interpretability Against Deceptive Agents on Sp...…

The Nonlinear Library: Alignment Forum

1
AF - Limitations on Formal Verification for AI Safety by Andrew Dickson 37:37

40 weeks پیش37:37

37:37

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Limitations on Formal Verification for AI Safety, published by Andrew Dickson on August 19, 2024 on The AI Alignment Forum. In the past two years there has been increased interest in formal verification-based approaches to AI safety. Formal verification is a sub-field of computer science that studies how guarantees may be derived by deduction on fully-specified rule-sets and symbol systems. By contrast, the real world is a messy place that can rarely be straightforwardly represented in a reductionist way. In particular, physics, chemistry and biology are all complex sciences which do not have anything like complete symbolic rule sets. Additionally, even if we had such rules for the natural sciences, it would be very difficult for any software system to obtain sufficiently accurate models and data about initial conditions for a prover to succeed in deriving strong guarantees for AI systems operating in the real world. Practical limitations like these on formal verification have been well-understood for decades to engineers and applied mathematicians building real-world software systems, which makes it puzzling that they have mostly been dismissed by leading researchers advocating for the use of formal verification in AI safety so far. This paper will focus-in on several such limitations and use them to argue that we should be extremely skeptical of claims that formal verification-based approaches will provide strong guarantees against major AI threats in the near-term. What do we Mean by Formal Verification for AI Safety? Some examples of the kinds of threats researchers hope formal verification will help with come from the paper "Provably Safe Systems: The Only Path to Controllable AGI" [1] by Max Tegmark and Steve Omohundro (emphasis mine): Several groups are working to identify the greatest human existential risks from AGI. For example, the Center for AI Safety recently published 'An Overview of Catastrophic AI Risks' which discusses a wide range of risks including bioterrorism, automated warfare, rogue power seeking AI, etc. Provably safe systems could counteract each of the risks they describe. These authors describe a concrete bioterrorism scenario in section 2.4: a terrorist group wants to use AGI to release a deadly virus over a highly populated area. They use an AGI to design the DNA and shell of a pathogenic virus and the steps to manufacture it. They hire a chemistry lab to synthesize the DNA and integrate it into the protein shell. They use AGI controlled drones to disperse the virus and social media AGIs to spread their message after the attack. Today, groups are working on mechanisms to prevent the synthesis of dangerous DNA. But provably safe infrastructure could stop this kind of attack at every stage: biochemical design AI would not synthesize designs unless they were provably safe for humans, data center GPUs would not execute AI programs unless they were certified safe, chip manufacturing plants would not sell GPUs without provable safety checks, DNA synthesis machines would not operate without a proof of safety, drone control systems would not allow drones to fly without proofs of safety, and armies of persuasive bots would not be able to manipulate media without proof of humanness. [1] The above quote contains a number of very strong claims about the possibility of formally or mathematically provable guarantees around software systems deployed in the physical world - for example, the claim that we could have safety proofs about the real-world good behavior of DNA synthesis machines, or drones. From a practical standpoint, our default stance towards such claims should be skepticism, since we do not have proofs of this sort for any of the technologies we interact with in the real-world today. For example, DNA synthesis machines exist today and do no...…

The Nonlinear Library: Alignment Forum

1
AF - Clarifying alignment vs capabilities by Richard Ngo 13:26

40 weeks پیش13:26

13:26

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Clarifying alignment vs capabilities, published by Richard Ngo on August 19, 2024 on The AI Alignment Forum. A core distinction in AGI safety is between alignment and capabilities. However, I think this distinction is a very fuzzy one, which has led to a lot of confusion. In this post I'll describe some of the problems with how people typically think about it, and offer a replacement set of definitions. "Alignment" and "capabilities" are primarily properties of AIs not of AI research The first thing to highlight is that the distinction between alignment and capabilities is primarily doing useful work when we think of them as properties of AIs. This distinction is still under-appreciated by the wider machine learning community. ML researchers have historically thought about performance of models almost entirely with respect to the tasks they were specifically trained on. However, the rise of LLMs has vindicated the alignment community's focus on general capabilities, and now it's much more common to assume that performance on many tasks (including out-of-distribution tasks) will improve roughly in parallel. This is a crucial assumption for thinking about risks from AGI. Insofar as the ML community has thought about alignment, it has mostly focused on aligning models' behavior to their training objectives. The possibility of neural networks aiming to achieve internally-represented goals is still not very widely understood, making it hard to discuss and study the reasons those goals might or might not be aligned with the values of (any given set of) humans. To be fair, the alignment community has caused some confusion by describing models as more or less "aligned", rather than more or less "aligned to X" for some specified X. I'll talk more about this confusion, and how we should address it, in a later post. But the core point is that AIs might develop internally-represented goals or values that we don't like, and we should try to avoid that. However, extending "alignment" and "capabilities" from properties of AIs to properties of different types of research is a fraught endeavor. It's tempting to categorize work as alignment research to the extent that it can be used to make AIs more aligned (to many possible targets), and as capabilities research to the extent that it can be used to make AIs more capable. But this approach runs into (at least) three major problems. Firstly, in general it's very difficult to categorize research by its impacts. Great research often links together ideas from many different subfields, typically in ways that only become apparent throughout the course of the research. We see this in many historical breakthroughs which shed light on a range of different domains. For example, early physicists studying the motions of the stars eventually derived laws governing all earthly objects. Meanwhile Darwin's study of barnacles and finches led him to principles governing the evolution of all life. Analogously, we should expect that big breakthroughs in our understanding of neural networks and deep learning would be useful in many different ways. More concretely, there are many cases where research done under the banner of alignment has advanced, or plausibly will advance, AI capabilities to a significant extent. This undermines our ability to categorize research by its impacts. Central examples include: RLHF makes language models more obedient, but also more capable of coherently carrying out tasks. Scalable oversight techniques can catch misbehavior, but will likely become important for generating high-quality synthetic training data, as it becomes more and more difficult for unassisted humans to label AI outputs correctly. Interpretability techniques will both allow us to inspect AI cognition and also extract more capable behavior from them (e.g. via ...…

The Nonlinear Library: Alignment Forum

1
AF - Untrustworthy models: a frame for scheming evaluations by Olli Järviniemi 15:38

40 weeks پیش15:38

15:38

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Untrustworthy models: a frame for scheming evaluations, published by Olli Järviniemi on August 19, 2024 on The AI Alignment Forum. In this post, I'll talk about trustworthiness, a concept similar to scheming and deceptive alignment. I prefer this concept, as I find it A) is directly important in practice, B) more readily distinguishes between capability for scheming and in fact scheming, and C) is clearer how to evaluate for. I'll also distinguish between different types of untrustworthy models - active planners, sleeper agents and opportunists - that are often conflated. I think when people talk about evaluating models for something like scheming, they'd often be better off talking about trustworthiness instead, and I think it's overall a more applicable frame for designing evaluations and decision making. Many of the ideas here are from Buck Shlegeris and Ryan Greenblatt, thanks to them and Cody Rushing for feedback. Suppose an AI company has developed a new model, and wants to use it to perform large quantities of ML research (internal deployment). However, the company is worried about the model being deceptively aligned[1] and trying to take catastrophic actions, such as hacking the data center or copying its weights to a remote server. What can the company do? This is a textbook example of the type of problem AI control tries to solve, and a basic strategy looks like this: Every time you are using the new model and it takes some action, use another model to check the action is OK before letting it be executed. If the action is not OK, it's not executed and humans are alerted about suspicious activity. (If you have never heard of AI control before, you might want to try playing the Bash Control game to get familiar with the concept. In the game, your aim is to download and execute a malicious program without GPT-3.5 catching you.) Let's call our characters Smart and Weak: Smart is the new model doing ML research while maybe trying to escape, while Weak is a model trained to monitor Smart. The reason for these names will become apparent soon. There are many ways how this control strategy could fail. Here are some:[2] 1. Smart came up with an extremely subtle way to escape that Weak is just too dumb to notice. 2. Smart successfully jailbreaks (via e.g. a suffix attack) Weak to not alert humans about its escape attempt. 1. Or: Weak happens to have a backdoor, which Smart finds and exploits. 3. Smart successfully convinces Weak (via natural language) that its actions are not malicious.[3] 4. Smart's escape plan is unlike anything we have trained Weak on. While Weak "would be capable of" telling that Smart is escaping, our training didn't specify well enough what we wanted, Weak generalizes incorrectly out-of-distribution and doesn't alert humans. 5. The company has way too limited of a budget for human alerts, so while Weak does rank Smart's actions in the top 0.012% suspiciousness, it's not enough to meet the 0.01% threshold for alerting humans. 6. Weak believes that not alerting humans in this case would allow Smart to escape, and believes that this leads to better outcomes in light of its preferences, and thus for instrumental reasons deliberately does not alert humans, indeed causing Smart to escape. Failure 6 is very different from the others! Whereas failures 1 to 5 are broadly about "Weak is just not a capable/good enough monitor to do what we wanted", failure 6 is about "Weak knows perfectly well Smart is escaping, but betrays humans and doesn't tell them about this". This issue is most apparent if you were to use a copy of the AI lab's new model to monitor itself (so Smart and Weak are the same model[4]): If you don't trust the new model to do research without trying to escape, you probably can't trust it to alert humans about escape attempts (by itself, no ...…

The Nonlinear Library: Alignment Forum

1
AF - Calendar feature geometry in GPT-2 layer 8 residual stream SAEs by Patrick Leask 7:17

40 weeks پیش7:17

7:17

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Calendar feature geometry in GPT-2 layer 8 residual stream SAEs, published by Patrick Leask on August 17, 2024 on The AI Alignment Forum. TL;DR: We demonstrate that the decoder directions of GPT-2 SAEs are highly structured by finding a historical date direction onto which projecting non-date related features lets us read off their historical time period by comparison to year features. Calendar years are linear: there are as many years between 2000 and 2024, as there are between 1800 and 1824. Linear probes can be used to predict years of particular events from the activations of language models. Since calendar years are linear, one might think the same of other time-based features such as weekday features, however weekday activations in sparse autoencoders (SAEs) were recently found to be arranged in a circular configuration in their top principal components. Inspired by this, we looked into weekdays, months, and most interestingly calendar years from the perspective of SAE feature decoder similarity. For each group of calendar features, we found interesting patterns of feature splitting between sparse autoencoders of different sizes. For calendar years, we found a timeline direction that meaningfully ordered events, individuals, and concepts with respect to their historical period, which furthermore does not correspond to a principal component of the decoder directions. Finally, we introduce a simple method for finding some of these interpretable directions. Features at different scales We started by replicating the weekday results by performing PCA on the decoder directions of features that had high activations when prompted with days of the week, using the same GPT-2 SAEs as in this post, ranging from 768 to 98304 features. In the 768 feature SAE, we found a single weekday feature that activated strongly on all days of the week. In the largest SAE, we found 10 weekday features, 3 of which activated on all days of the week, with the remaining 7 activating on a single day of the week each. We found a group of features that activate primarily on specific days of the week by taking the top 20 activating samples for each feature and checking that the max activating token in each of these samples was the specific weekday. We found the first two principal components for this set of features, and projected the features that activate on any day or number of days from all SAEs onto these directions. The labeled features are those that activate on a single day across all SAEs, with the multi-day features unlabeled to maintain legibility. The smallest SAE (blue) has a single feature that activates on all weekday tokens, and lies near the mean of all the weekday features. The largest SAEs learn features for each day of the week, plus additional multi-day features. Across SAE sizes, the single day features form clusters. In each of these examples, the smallest SAE has a single feature that splits into many specific features that seem of roughly the same importance. With calendar years, however, the situation is more complex. The same method of finding the principal components for single year features between 1900 and 2020 only succeeds in a few 21st century features, and nothing from the 20th century. There is also a group of single year features in a smaller SAE in the center of the plot, suggesting these principal components do not explain variance in them. The plot below shows the years for which each of the features is active, with the x-axis being years from 1950 to 2020, the y-axis being separate features, and the colored bars indicating the periods of year for which that feature is active. Only in the largest SAEs do you see more than a few single calendar year features, with most of the features activating on ranges of years, or other patterns such as the start and end...…

The Nonlinear Library: Alignment Forum

1
AF - Fields that I reference when thinking about AI takeover prevention by Buck Shlegeris 17:03

41 weeks پیش17:03

17:03

Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fields that I reference when thinking about AI takeover prevention, published by Buck Shlegeris on August 13, 2024 on The AI Alignment Forum. Is AI takeover like a nuclear meltdown? A coup? A plane crash? My day job is thinking about safety measures that aim to reduce catastrophic risks from AI (especially risks from egregious misalignment). The two main themes of this work are the design of such measures (what's the space of techniques we might expect to be affordable and effective) and their evaluation (how do we decide which safety measures to implement, and whether a set of measures is sufficiently robust). I focus especially on AI control, where we assume our models are trying to subvert our safety measures and aspire to find measures that are robust anyway. Like other AI safety researchers, I often draw inspiration from other fields that contain potential analogies. Here are some of those fields, my opinions on their strengths and weaknesses as analogies, and some of my favorite resources on them. Robustness to insider threats The single field that seems most closely analogous to AI control is reducing vulnerability to threats resulting from insiders, especially threats that arise from access to IT systems. That is, you have a bunch of employees who need to have access to various IT systems in order to do their jobs, and you're worried that some of these employees will abuse their access in order to accomplish some nefarious purpose. In a bank, you might worry about your tellers withdrawing money from customer accounts inappropriately. In a software company, you might worry about your employees sabotaging your code out of spite, violating user privacy for fun, or stealing your data to give it to a competitor. This is quite similar to the problem faced by AI developers when they're trying to control AIs (especially in internal deployments, which I argue is important here): All of the access the AI has is access the developer decided to give it; the developer faces a constant tradeoff between the productivity of the AI (which is maximized by being permissive) and the danger it poses (which is minimized by being restrictive). And obviously, because it's an AI it only has access to resources that are connected to the computer network where it's running, so we can restrict our attention to the IT resources. It's easy for the developer to constantly surveil the AI, which substantially reduces the space of strategies the AI can take. The developer has a lot of control over workflows, and is able to choose them to make it harder for the AI to subtly cause problems. For example, they can restrict affordances in ways that make it harder for AIs to cause problems without being noticed. An example of a similar technique in the insider threat prevention case is the use of administrative APIs instead of direct access to a system, as described in "Building Secure and Reliable Systems" chapter 5. That is: instead of having your engineers resolve issues by SSHing into production machines (which implicitly grants them very broad access that can lead to problems, intentional or accidental), you design an administrative API which allows the engineers to access just the affordances that usually suffice for them to do their job. Of course, they'll sometimes need to take actions that aren't available in the administrative API; this is called a breakglass mechanism. But because breakglass actions are rare, it's affordable to apply more scrutiny to them: you might use multi-party authorization, requiring another engineer to sign off on the commands before they take effect, or you might do auditing (e.g. apparently some SRE teams review all the breakglass actions at a weekly team meeting, which allows them to spot irregularities and discourage gratuitous use of the mechanism). Similar t...…

به Player FM خوش آمدید!

Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.