Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

AI Safety Fundamentals: Alignment

Player FM - Internet Radio Done Right

اضافه شده در two سال پیش
Looks like the publisher may have taken this series offline or changed its URL. Please contact support if you believe it should be working, the feed URL is invalid, or you have any other concerns about it.

محتوای ارائه شده توسط BlueDot Impact. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط BlueDot Impact یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

Via Podcast

1
Close Encounters with UFO Hot Spots: Area 51, Roswell, and the Great ET Road Trip 39:50

20 روز پیش39:50

پخش در آینده

لیست ها

پسندیدن

دوست داشته شد

39:50

The truth is out West! We’re hopping on the ET Highway and venturing to the most notorious alien hot spots, including Roswell’s infamous crash site, Area 51’s eerie perimeter, and a mysterious desert watchtower. Join us as journalist Laura Krantz, host of the podcast Wild Thing , beams up to share stories from the front lines of UFO reporting—from strange sightings and quirky festivals to a mailbox where people leave letters to extraterrestrials. Maybe you’ll even decide for yourself: Is Earth a tourist stop for spaceships? UFO hot spots you’ll encounter in this episode: - UFO Watchtower (near Great Sand Dunes National Park, Colorado) - Roswell, New Mexico - Area 51, Nevada - Extraterrestrial Highway (aka State Route 375), Nevada - Little A’Le’Inn, ET Highway, Nevada - E.T. Fresh Jerky, ET Highway, Nevada - Alien Research Center, ET Highway, Nevada - The Black Mailbox, ET Highway, Nevada Via Podcast is a production of AAA Mountain West Group .…

حدود یک سال پیش 24:48

MP3•خانه قسمت

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on February 21, 2025 21:08 (4M ago). Last successful fetch was on January 02, 2025 12:05 (5M ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Research in mechanistic interpretability seeks to explain behaviors of machine learning (ML) models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria–faithfulness, completeness, and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, pointing toward opportunities to scale our understanding to both larger models and more complex tasks. Code for all experiments is available at https://github.com/redwoodresearch/Easy-Transformer.
Source:
https://arxiv.org/pdf/2211.00593.pdf
Narrated for AI Safety Fundamentals by Perrin Walker

A podcast by BlueDot Impact.
Learn more on the AI Safety Fundamentals website.

فصل ها

1. Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small (00:00:00)

2. ABSTRACT (00:00:17)

3. 1 INTRODUCTION (00:01:34)

4. 2 BACKGROUND (00:05:58)

5. 2.1 CIRCUITS AND KNOCKOUTS (00:10:41)

6. 3 DISCOVERING THE CIRCUIT (00:14:06)

7. 3.1 WHICH HEADS DIRECTLY AFFECT THE OUTPUT? (NAME MOVER HEADS) (00:19:08)

85 قسمت

#Tech #Society #Philosophy #Blue Dot Impact

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

AI Safety Fundamentals: Alignment

published حدود یک سال پیش

اشتراک گذاری

MP3•خانه قسمت

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

When? This feed was archived on February 21, 2025 21:08 (4M ago). Last successful fetch was on January 02, 2025 12:05 (5M ago)

Why? فیدهای غیر فعال status. سرورهای ما، برای یک دوره پایدار، قادر به بازیابی یک فید پادکست معتبر نبوده اند.

A podcast by BlueDot Impact.
Learn more on the AI Safety Fundamentals website.

فصل ها

1. Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small (00:00:00)

2. ABSTRACT (00:00:17)

3. 1 INTRODUCTION (00:01:34)

4. 2 BACKGROUND (00:05:58)

5. 2.1 CIRCUITS AND KNOCKOUTS (00:10:41)

6. 3 DISCOVERING THE CIRCUIT (00:14:06)

7. 3.1 WHICH HEADS DIRECTLY AFFECT THE OUTPUT? (NAME MOVER HEADS) (00:19:08)

85 قسمت

#Tech #Society #Philosophy #Blue Dot Impact

همه قسمت ها

1
Introduction to Mechanistic Interpretability 11:45

22 weeks پیش11:45

11:45

Our introduction introduces common mech interp concepts, to prepare you for the rest of this session's resources. Original text: https://aisafetyfundamentals.com/blog/introduction-to-mechanistic-interpretability/ Author(s): Sarah Hastings-Woodhouse A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

1
We Need a Science of Evals 20:12

22 weeks پیش20:12

20:12

This lays out a number of open questions, in what the author calls a 'Science of Evals'. Original text: https://www.apolloresearch.ai/blog/we-need-a-science-of-evals Author(s): Apollo Research blog A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.

1
Illustrating Reinforcement Learning from Human Feedback (RLHF) 22:32

46 weeks پیش22:32

22:32

This more technical article explains the motivations for a system like RLHF, and adds additional concrete details as to how the RLHF approach is applied to neural networks. While reading, consider which parts of the technical implementation correspond to the 'values coach' and 'coherence coach' from the previous video. A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

1
Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback 32:19

46 weeks پیش32:19

32:19

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators. Everything in this paper is relevant to this week's learning objectives, and we recommend you read it in its entirety. It summarises limitations with conventional RLHF, explains the constitutional AI approach, shows how it performs, and where future research might be directed. If you are in a rush, focus on sections 1.2, 3.1, 3.4, 4.1, 6.1, 6.2. A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

1
Constitutional AI Harmlessness from AI Feedback 1:01:49

46 weeks پیش1:01:49

1:01:49

1
Intro to Brain-Like-AGI Safety 1:02:10

51 weeks پیش1:02:10

1:02:10

(Sections 3.1-3.4 , 6.1-6.2 , and 7.1-7.5 ) Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition as the human brain. How would we use such an algorithm safely? I will argue that this is an open technical problem, and my goal in this post series is to bring readers with no prior knowledge all the way up to the front-line of unsolved problems as I see them. If this whole thing seems weird or stupid, you should start right in on Post #1 , which contains definitions, background, and motivation. Then Posts #2 – #7 are mainly neuroscience, and Posts #8 – #15 are more directly about AGI safety, ending with a list of open questions and advice for getting involved in the field. Source: https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8 Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

1
Chinchilla’s Wild Implications 24:57

51 weeks پیش24:57

24:57

This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla. The paper came out a few months ago, and has been discussed a lot, but some of its implications deserve more explicit notice in my opinion. In particular: Data, not size, is the currently active constraint on language modeling performance. Current returns to additional data are immense, and current returns to additional model size are miniscule; indeed, most recent landmark models are wastefully big. If we can leverage enough data, there is no reason to train ~500B param models, much less 1T or larger models. If we have to train models at these large sizes, it will mean we have encountered a barrier to exploitation of data scaling, which would be a great loss relative to what would otherwise be possible. The literature is extremely unclear on how much text data is actually available for training. We may be "running out" of general-domain data, but the literature is too vague to know one way or the other. The entire available quantity of data in highly specialized domains like code is woefully tiny, compared to the gains that would be possible if much more such data were available. Some things to note at the outset: This post assumes you have some familiarity with LM scaling laws. As in the paper, I'll assume here that models never see repeated data in training. Original text: https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

1
Deep Double Descent 8:27

51 weeks پیش8:27

8:27

We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. This effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view further study of this phenomenon as an important research direction. Source: https://openai.com/research/deep-double-descent Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

1
Eliciting Latent Knowledge 1:00:27

51 weeks پیش1:00:27

1:00:27

In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems: Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us. But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad. In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events? We’ll call this problem eliciting latent knowledge (ELK). In this report we’ll focus on detecting sensor tampering as a motivating example, but we believe ELK is central to many aspects of alignment. Source: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit# Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

1
Empirical Findings Generalize Surprisingly Far 11:32

51 weeks پیش11:32

11:32

Previously, I argued that emergent phenomena in machine learning mean that we can’t rely on current trends to predict what the future of ML will be like. In this post, I will argue that despite this, empirical findings often do generalize very far, including across “phase transitions” caused by emergent behavior. This might seem like a contradiction, but actually I think divergence from current trends and empirical generalization are consistent. Findings do often generalize, but you need to think to determine the right generalization, and also about what might stop any given generalization from holding. I don’t think many people would contest the claim that empirical investigation can uncover deep and generalizable truths. This is one of the big lessons of physics, and while some might attribute physics’ success to math instead of empiricism, I think it’s clear that you need empirical data to point to the right mathematics. However, just invoking physics isn’t a good argument, because physical laws have fundamental symmetries that we shouldn’t expect in machine learning. Moreover, we care specifically about findings that continue to hold up after some sort of emergent behavior (such as few-shot learning in the case of ML). So, to make my case, I’ll start by considering examples in deep learning that have held up in this way. Since “modern” deep learning hasn’t been around that long, I’ll also look at examples from biology, a field that has been around for a relatively long time and where More Is Different is ubiquitous (see Appendix: More Is Different In Other Domains). Source: https://bounded-regret.ghost.io/empirical-findings-generalize-surprisingly-far/ Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

1
Low-Stakes Alignment 13:56

51 weeks پیش13:56

13:56

Right now I’m working on finding a good objective to optimize with ML, rather than trying to make sure our models are robustly optimizing that objective. (This is roughly “outer alignment.”) That’s pretty vague, and it’s not obvious whether “find a good objective” is a meaningful goal rather than being inherently confused or sweeping key distinctions under the rug. So I like to focus on a more precise special case of alignment: solve alignment when decisions are “low stakes.” I think this case effectively isolates the problem of “find a good objective” from the problem of ensuring robustness and is precise enough to focus on productively. In this post I’ll describe what I mean by the low-stakes setting, why I think it isolates this subproblem, why I want to isolate this subproblem, and why I think that it’s valuable to work on crisp subproblems. Source: https://www.alignmentforum.org/posts/TPan9sQFuPP6jgEJo/low-stakes-alignment Narrated for AI Safety Fundamentals by TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

1
Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions 16:39

51 weeks پیش16:39

16:39

Using hard multiple-choice reading comprehension questions as a testbed, we assess whether presenting humans with arguments for two competing answer options, where one is correct and the other is incorrect, allows human judges to perform more accurately, even when one of the arguments is unreliable and deceptive. If this is helpful, we may be able to increase our justified trust in language-model-based systems by asking them to produce these arguments where needed. Previous research has shown that just a single turn of arguments in this format is not helpful to humans. However, as debate settings are characterized by a back-and-forth dialogue, we follow up on previous results to test whether adding a second round of counter-arguments is helpful to humans. We find that, regardless of whether they have access to arguments or not, humans perform similarly on our task. These findings suggest that, in the case of answering reading comprehension questions, debate is not a helpful format. Source: https://arxiv.org/abs/2210.10860 Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

1
Least-To-Most Prompting Enables Complex Reasoning in Large Language Models 16:08

51 weeks پیش16:08

16:08

Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To overcome this challenge of easy-to-hard generalization, we propose a novel prompting strategy, least-to-most prompting. The key idea in this strategy is to break down a complex problem into a series of simpler subproblems and then solve them in sequence. Solving each subproblem is facilitated by the answers to previously solved subproblems. Our experimental results on tasks related to symbolic manipulation, compositional generalization, and math reasoning reveal that least-to-most prompting is capable of generalizing to more difficult problems than those seen in the prompts. A notable finding is that when the GPT-3 code-davinci-002 model is used with least-to-most prompting, it can solve the compositional generalization benchmark SCAN in any split (including length split) with an accuracy of at least 99% using just 14 exemplars, compared to only 16% accuracy with chain-of-thought prompting. This is particularly noteworthy because neural-symbolic models in the literature that specialize in solving SCAN are trained on the entire training set containing over 15,000 examples. We have included prompts for all the tasks in the Appendix. Source: https://arxiv.org/abs/2205.10625 Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

1
ABS: Scanning Neural Networks for Back-Doors by Artificial Brain Stimulation 16:08

51 weeks پیش16:08

16:08

This paper presents a technique to scan neural network based AI models to determine if they are trojaned. Pre-trained AI models may contain back-doors that are injected through training or by transforming inner neuron weights. These trojaned models operate normally when regular inputs are provided, and mis-classify to a specific output label when the input is stamped with some special pattern called trojan trigger. We develop a novel technique that analyzes inner neuron behaviors by determining how output acti- vations change when we introduce different levels of stimulation to a neuron. The neurons that substantially elevate the activation of a particular output label regardless of the provided input is considered potentially compromised. Trojan trigger is then reverse-engineered through an optimization procedure using the stimulation analysis results, to confirm that a neuron is truly compromised. We evaluate our system ABS on 177 trojaned models that are trojaned with vari-ous attack methods that target both the input space and the feature space, and have various trojan trigger sizes and shapes, together with 144 benign models that are trained with different data and initial weight values. These models belong to 7 different model structures and 6 different datasets, including some complex ones such as ImageNet, VGG-Face and ResNet110. Our results show that ABS is highly effective, can achieve over 90% detection rate for most cases (and many 100%), when only one input sample is provided for each output label. It substantially out-performs the state-of-the-art technique Neural Cleanse that requires a lot of input samples and small trojan triggers to achieve good performance. Source: https://www.cs.purdue.edu/homes/taog/docs/CCS19.pdf Narrated for AI Safety Fundamentals the Effective Altruism Forum Joseph Carlsmith LessWrong 80,000 Hours by Perrin Walker of TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

1
Imitative Generalisation (AKA ‘Learning the Prior’) 18:14

51 weeks پیش18:14

18:14

This post tries to explain a simplified version of Paul Christiano’s mechanism introduced here , (referred to there as ‘Learning the Prior’) and explain why a mechanism like this potentially addresses some of the safety problems with naïve approaches. First we’ll go through a simple example in a familiar domain, then explain the problems with the example. Then I’ll discuss the open questions for making Imitative Generalization actually work, and the connection with the Microscope AI idea. A more detailed explanation of exactly what the training objective is (with diagrams), and the correspondence with Bayesian inference, are in the appendix. Source: https://www.alignmentforum.org/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1 Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
Toy Models of Superposition 41:43

51 weeks پیش41:43

41:43

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others? In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investigate how and when models represent more features than they have dimensions. We call this phenomenon superposition . When features are sparse, superposition allows compression beyond what a linear model would do, at the cost of "interference" that requires nonlinear filtering. Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
Discovering Latent Knowledge in Language Models Without Supervision 37:09

51 weeks پیش37:09

37:09

Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels. Original text: https://arxiv.org/abs/2212.03827 Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
An Investigation of Model-Free Planning 8:11

51 weeks پیش8:11

8:11

The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods have been proposed that learn how to plan, by providing the structure for planning via an inductive bias in the function approximator (such as a tree structured neural network), trained end-to-end by a model-free RL algorithm. In this paper, we go even further, and demonstrate empirically that an entirely model-free approach, without special structure beyond standard neural network components such as convolutional networks and LSTMs, can learn to exhibit many of the characteristics typically associated with a model-based planner. We measure our agent’s effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efficiency, and its ability to utilize additional thinking time. We find that our agent has many of the characteristics that one might expect to find in a planning algorithm. Furthermore, it exceeds the state-of-the-art in challenging combinatorial domains such as Sokoban and outperforms other model-free approaches that utilize strong inductive biases toward planning. Source: https://arxiv.org/abs/1901.03559 Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
Gradient Hacking: Definitions and Examples 9:15

51 weeks پیش9:15

9:15

Gradient hacking is a hypothesized phenomenon where: A model has knowledge about possible training trajectories which isn’t being used by its training algorithms when choosing updates (such as knowledge about non-local features of its loss landscape which aren’t taken into account by local optimization algorithms). The model uses that knowledge to influence its medium-term training trajectory, even if the effects wash out in the long term. Below I give some potential examples of gradient hacking, divided into those which exploit RL credit assignment and those which exploit gradient descent itself. My concern is that models might use techniques like these either to influence which goals they develop, or to fool our interpretability techniques. Even if those effects don’t last in the long term, they might last until the model is smart enough to misbehave in other ways (e.g. specification gaming , or reward tampering ), or until it’s deployed in the real world—especially in the RL examples, since convergence to a global optimum seems unrealistic (and ill-defined) for RL policies trained on real-world data. However, since gradient hacking isn’t very well-understood right now, both the definition above and the examples below should only be considered preliminary. Source: https://www.alignmentforum.org/posts/EeAgytDZbDjRznPMA/gradient-hacking-definitions-and-examples Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
Compute Trends Across Three Eras of Machine Learning 13:50

51 weeks پیش13:50

13:50

This article explains key drivers of AI progress, explains how compute is calculated, as well as looks at how the amount of compute used to train AI models has increased significantly in recent years. Original text: https://epochai.org/blog/compute-trends Author(s): Jaime Sevilla, Lennart Heim, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, Pablo Villalobos. A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
Worst-Case Thinking in AI Alignment 11:35

1 year پیش11:35

11:35

Alternative title: “When should you assume that what could go wrong, will go wrong?” Thanks to Mary Phuong and Ryan Greenblatt for helpful suggestions and discussion, and Akash Wasil for some edits. In discussions of AI safety, people often propose the assumption that something goes as badly as possible. Eliezer Yudkowsky in particular has argued for the importance of security mindset when thinking about AI alignment. I think there are several distinct reasons that this might be the right assumption to make in a particular situation. But I think people often conflate these reasons, and I think that this causes confusion and mistaken thinking. So I want to spell out some distinctions. Throughout this post, I give a bunch of specific arguments about AI alignment, including one argument that I think I was personally getting wrong until I noticed my mistake yesterday (which was my impetus for thinking about this topic more and then writing this post). I think I’m probably still thinking about some of my object level examples wrong, and hope that if so, commenters will point out my mistakes. Original text: https://www.lesswrong.com/posts/yTvBSFrXhZfL8vr5a/worst-case-thinking-in-ai-alignment Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO . --- A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
Public by Default: How We Manage Information Visibility at Get on Board 9:50

1 year پیش9:50

9:50

I’ve been obsessed with managing information, and communications in a remote team since Get on Board started growing. Reducing the bus factor is a primary motivation — but another just as important is diminishing reliance on synchronicity . When what I know is documented and accessible to others, I’m less likely to be a bottleneck for anyone else in the team . So if I’m busy, minding family matters, on vacation, or sick, I won’t be blocking anyone. This, in turn, gives everyone in the team the freedom to build their own work schedules according to their needs , work from any time zone, or enjoy more distraction-free moments. As I write these lines, most of the world is under quarantine, relying on non-stop video calls to continue working. Needless to say, that is not a sustainable long-term work schedule. Original text: https://www.getonbrd.com/blog/public-by-default-how-we-manage-information-visibility-at-get-on-board Author: Sergio Nouvel A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
How to Get Feedback 7:30

1 year پیش7:30

7:30

Feedback is essential for learning. Whether you’re studying for a test, trying to improve in your work or want to master a difficult skill, you need feedback. The challenge is that feedback can often be hard to get. Worse, if you get bad feedback, you may end up worse than before. Original text: https://www.scotthyoung.com/blog/2019/01/24/how-to-get-feedback/ Author: Scott Young A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
Writing, Briefly 3:09

1 year پیش3:09

3:09

(In the process of answering an email, I accidentally wrote a tiny essay about writing. I usually spend weeks on an essay. This one took 67 minutes—23 of writing, and 44 of rewriting.) Original text: https://paulgraham.com/writing44.html Author: Paul Graham A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
Being the (Pareto) Best in the World 6:46

1 year پیش6:46

6:46

This introduces the concept of Pareto frontiers. The top comment by Rob Miles also ties it to comparative advantage. While reading, consider what Pareto frontiers your project could place you on. Original text: https://www.lesswrong.com/posts/XvN2QQpKTuEzgkZHY/being-the-pareto-best-in-the-world Author: John Wentworth A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
How to Succeed as an Early-Stage Researcher: The “Lean Startup” Approach 15:16

1 year پیش15:16

15:16

I am approaching the end of my AI governance PhD, and I’ve spent about 2.5 years as a researcher at FHI. During that time, I’ve learnt a lot about the formula for successful early-career research. This post summarises my advice for people in the first couple of years. Research is really hard, and I want people to avoid the mistakes I’ve made. Original text: https://forum.effectivealtruism.org/posts/jfHPBbYFzCrbdEXXd/how-to-succeed-as-an-early-stage-researcher-the-lean-startup#Conclusion Author: Toby Shevlane A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
Become a Person who Actually Does Things 5:14

1 year پیش5:14

5:14

The next four weeks of the course are an opportunity for you to actually build a thing that moves you closer to contributing to AI Alignment, and we're really excited to see what you do! A common failure mode is to think "Oh, I can't actually do X" or to say "Someone else is probably doing Y." You probably can do X, and it's unlikely anyone is doing Y! It could be you! Original text: https://www.neelnanda.io/blog/become-a-person-who-actually-does-things Author: Neel Nanda A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
Planning a High-Impact Career: A Summary of Everything You Need to Know in 7 Points 11:02

1 year پیش11:02

11:02

We took 10 years of research and what we’ve learned from advising 1,000+ people on how to build high-impact careers, compressed that into an eight-week course to create your career plan , and then compressed that into this three-page summary of the main points. (It’s especially aimed at people who want a career that’s both satisfying and has a significant positive impact, but much of the advice applies to all career decisions.) Original article: https://80000hours.org/career-planning/summary/ Author: Benjamin Todd A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
Working in AI Alignment 1:08:44

1 year پیش1:08:44

1:08:44

This guide is written for people who are considering direct work on technical AI alignment. I expect it to be most useful for people who are not yet working on alignment, and for people who are already familiar with the arguments for working on AI alignment. If you aren’t familiar with the arguments for the importance of AI alignment, you can get an overview of them by doing the AI Alignment Course . by Charlie Rogers-Smith, with minor updates by Adam Jones Source: https://aisafetyfundamentals.com/blog/alignment-careers-guide Narrated for AI Safety Fundamentals by Perrin Walker A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

AI Safety Fundamentals: Alignment

1
Computing Power and the Governance of AI 26:49

1 year پیش26:49

26:49

This post summarises a new report, “Computing Power and the Governance of Artificial Intelligence.” The full report is a collaboration between nineteen researchers from academia, civil society, and industry. It can be read here . GovAI research blog posts represent the views of their authors, rather than the views of the organisation. Source: https://www.governance.ai/post/computing-power-and-the-governance-of-ai Narrated for AI Safety Fundamentals by Perrin Walker A podcast by BlueDot Impact . Learn more on the AI Safety Fundamentals website.…

به Player FM خوش آمدید!

Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.

به بیش از 500 موضوع گوش کنید

I'm The Problem[2 CD]

Amazon Basics Multipurpose Copy Printer Paper, 20 lb, 92 Bright, 8.5" x 11", White, 8 Reams, 4000 Sheets (500 Sheets/Ream)

TERRO Ant Killer Bait Stations T300B - Liquid Bait to Eliminate Ants - 12 Count Stations for Effective Indoor Ant Control

پادکست هایی که ارزش شنیدن دارند

AI Safety Fundamentals: Alignment « » Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

فصل ها

1. Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small (00:00:00)

2. ABSTRACT (00:00:17)

3. 1 INTRODUCTION (00:01:34)

4. 2 BACKGROUND (00:05:58)

5. 2.1 CIRCUITS AND KNOCKOUTS (00:10:41)

6. 3 DISCOVERING THE CIRCUIT (00:14:06)

7. 3.1 WHICH HEADS DIRECTLY AFFECT THE OUTPUT? (NAME MOVER HEADS) (00:19:08)

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

بایگانی مجموعه ها ("فیدهای غیر فعال" status)

فصل ها

1. Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small (00:00:00)

2. ABSTRACT (00:00:17)

3. 1 INTRODUCTION (00:01:34)

4. 2 BACKGROUND (00:05:58)

5. 2.1 CIRCUITS AND KNOCKOUTS (00:10:41)

6. 3 DISCOVERING THE CIRCUIT (00:14:06)

7. 3.1 WHICH HEADS DIRECTLY AFFECT THE OUTPUT? (NAME MOVER HEADS) (00:19:08)

پادکست هایی که ارزش شنیدن دارند

به Player FM خوش آمدید!

Bounty Quick Size Paper Towels, White, 8 Family Rolls = 20 Regular Rolls (Packaging May Vary)

$25 PlayStation Store Gift Card [Digital Code]

Ailun 3 Pack Screen Protector for iPhone 16 Pro [6.3 inch] + 3 Pack Camera Lens Protector with Installation Frame,Case Friendly Tempered Glass Film,[9H Hardness] - HD

The Let Them Theory: A Life-Changing Tool That Millions of People Can't Stop Talking About

The Let Them Theory: A Life-Changing Tool That Millions of People Can't Stop Talking About

راهنمای مرجع سریع

AI Safety Fundamentals: Alignment « »
Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small