13 subscribers
با برنامه Player FM !
پادکست هایی که ارزش شنیدن دارند
حمایت شده


1 #52: Navigating the effect of AI on marketing jobs and the job market with Sue Keith, Landrum Talent Solutions 19:09
"On how various plans miss the hard bits of the alignment challenge" by Nate Soares
Manage episode 424655367 series 3364760
https://www.lesswrong.com/posts/3pinFH3jerMzAvmza/on-how-various-plans-miss-the-hard-bits-of-the-alignment
(As usual, this post was written by Nate Soares with some help and editing from Rob Bensinger.)
In my last post, I described a “hard bit” of the challenge of aligning AGI—the sharp left turn that comes when your system slides into the “AGI” capabilities well, the fact that alignment doesn’t generalize similarly well at this turn, and the fact that this turn seems likely to break a bunch of your existing alignment properties.
Here, I want to briefly discuss a variety of current research proposals in the field, to explain why I think this problem is currently neglected.
I also want to mention research proposals that do strike me as having some promise, or that strike me as adjacent to promising approaches.
Before getting into that, let me be very explicit about three points:
- On my model, solutions to how capabilities generalize further than alignment are necessary but not sufficient. There is dignity in attacking a variety of other real problems, and I endorse that practice.
- The imaginary versions of people in the dialogs below are not the same as the people themselves. I'm probably misunderstanding the various proposals in important ways, and/or rounding them to stupider versions of themselves along some important dimensions.[1] If I've misrepresented your view, I apologize.
- I do not subscribe to the Copenhagen interpretation of ethics wherein someone who takes a bad swing at the problem (or takes a swing at a different problem) is more culpable for civilization's failure than someone who never takes a swing at all. Everyone whose plans I discuss below is highly commendable, laudable, and virtuous by my accounting.
فصل ها
1. "On how various plans miss the hard bits of the alignment challenge" by Nate Soares (00:00:00)
2. Reactions to specific plans (00:04:48)
3. Owen Cotton-Barratt & Truthful AI (00:04:52)
4. Ryan Greenblatt & Eliciting Latent Knowledge (00:09:05)
5. Eric Drexler & AI Services (00:13:41)
6. Evan Hubinger, in a recent personal conversation (00:14:53)
7. A fairly straw version of someone with technical intuitions like Richard Ngo’s or Rohin Shah’s (00:16:24)
8. Another recent proposal (00:18:38)
9. Vivek Hebbar, summarized (perhaps poorly) from last time we spoke of this in person (00:20:03)
10. John Wentworth & Natural Abstractions (00:22:01)
11. Neel Nanda & Theories of Impact for Interpretability (00:25:26)
12. Stuart Armstrong & Concept Extrapolation (00:27:33)
13. Andrew Critch & political solutions (00:41:40)
14. What about superbabies? (00:44:05)
15. What about other MIRI people? (00:44:18)
16. High-level view (00:45:34)
565 قسمت
Manage episode 424655367 series 3364760
https://www.lesswrong.com/posts/3pinFH3jerMzAvmza/on-how-various-plans-miss-the-hard-bits-of-the-alignment
(As usual, this post was written by Nate Soares with some help and editing from Rob Bensinger.)
In my last post, I described a “hard bit” of the challenge of aligning AGI—the sharp left turn that comes when your system slides into the “AGI” capabilities well, the fact that alignment doesn’t generalize similarly well at this turn, and the fact that this turn seems likely to break a bunch of your existing alignment properties.
Here, I want to briefly discuss a variety of current research proposals in the field, to explain why I think this problem is currently neglected.
I also want to mention research proposals that do strike me as having some promise, or that strike me as adjacent to promising approaches.
Before getting into that, let me be very explicit about three points:
- On my model, solutions to how capabilities generalize further than alignment are necessary but not sufficient. There is dignity in attacking a variety of other real problems, and I endorse that practice.
- The imaginary versions of people in the dialogs below are not the same as the people themselves. I'm probably misunderstanding the various proposals in important ways, and/or rounding them to stupider versions of themselves along some important dimensions.[1] If I've misrepresented your view, I apologize.
- I do not subscribe to the Copenhagen interpretation of ethics wherein someone who takes a bad swing at the problem (or takes a swing at a different problem) is more culpable for civilization's failure than someone who never takes a swing at all. Everyone whose plans I discuss below is highly commendable, laudable, and virtuous by my accounting.
فصل ها
1. "On how various plans miss the hard bits of the alignment challenge" by Nate Soares (00:00:00)
2. Reactions to specific plans (00:04:48)
3. Owen Cotton-Barratt & Truthful AI (00:04:52)
4. Ryan Greenblatt & Eliciting Latent Knowledge (00:09:05)
5. Eric Drexler & AI Services (00:13:41)
6. Evan Hubinger, in a recent personal conversation (00:14:53)
7. A fairly straw version of someone with technical intuitions like Richard Ngo’s or Rohin Shah’s (00:16:24)
8. Another recent proposal (00:18:38)
9. Vivek Hebbar, summarized (perhaps poorly) from last time we spoke of this in person (00:20:03)
10. John Wentworth & Natural Abstractions (00:22:01)
11. Neel Nanda & Theories of Impact for Interpretability (00:25:26)
12. Stuart Armstrong & Concept Extrapolation (00:27:33)
13. Andrew Critch & political solutions (00:41:40)
14. What about superbabies? (00:44:05)
15. What about other MIRI people? (00:44:18)
16. High-level view (00:45:34)
565 قسمت
همه قسمت ها
×
1 “HPMOR: The (Probably) Untold Lore” by Gretta Duleba, Eliezer Yudkowsky 1:07:32

1 “Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans 10:00


1 “Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda 11:13

1 “Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” by Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah 2:15



1 “Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck 5:19
به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.