"Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" by Ajeya Cotra

LessWrong (Curated & Popular)

محتوای ارائه شده توسط LessWrong. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط LessWrong یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

3y ago 3:07:41

MP3•خانه قسمت

https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I think that in the coming 15-30 years, the world could plausibly develop “transformative AI”: AI powerful enough to bring us into a new, qualitatively different future, via an explosion in science and technology R&D. This sort of AI could be sufficient to make this the most important century of all time for humanity.

The most straightforward vision for developing transformative AI that I can imagine working with very little innovation in techniques is what I’ll call human feedback[1] on diverse tasks (HFDT):

Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.

HFDT is not the only approach to developing transformative AI,[2] and it may not work at all.[3] But I take it very seriously, and I’m aware of increasingly many executives and ML researchers at AI companies who believe something within this space could work soon.

Unfortunately, I think that if AI companies race forward training increasingly powerful models using HFDT, this is likely to eventually lead to a full-blown AI takeover (i.e. a possibly violent uprising or coup by AI systems). I don’t think this is a certainty, but it looks like the best-guess default absent specific efforts to prevent it.

فصل ها

1. "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" by Ajeya Cotra (00:00:00)

2. Cartoon: XKCD (00:06:41)

3. Premises of the hypothetical situation (00:21:25)

4. Basic setup: an AI company trains a “scientist model” very soon (00:23:44)

5. “Racing forward” assumption: Magma tries to train the most powerful model it can (00:33:51)

6. “HFDT scales far” assumption: Alex is trained to achieve excellent performance on a wide range of difficult tasks (00:36:28)

7. Diagram: Alex (00:39:00)

8. Diagram: Timestep (00:41:24)

9. Diagram: Chain of timesteps (00:41:50)

10. Why do I call Alex’s training strategy “baseline” HFDT? (00:46:31)

11. What are some training strategies that would not fall under baseline HFDT? (00:49:06)

12. “Naive safety effort” assumption: Alex is trained to be “behaviorally safe” (00:51:51)

13. Key properties of Alex: it is a generally-competent creative planner (00:56:28)

14. Alex learns to make creative, unexpected plans to achieve open-ended goals (01:00:50)

15. How the hypothetical situation progresses (from the above premises) (01:02:30)

16. Alex would understand its training process very well (including human psychology) (01:04:12)

17. A spectrum of situational awareness (01:04:47)

18. Why I think Alex would have very high situational awareness (01:08:29)

19. While humans are in control, Alex would be incentivized to “play the training game” (01:12:09)

20. Naive “behavioral safety” interventions wouldn’t eliminate this incentive (01:20:26)

21. Maybe inductive bias or path dependence favors honest strategies? (01:26:54)

22. As humans’ control fades, Alex would be motivated to take over (01:28:08)

23. Diagram: Many instances (01:30:20)

24. Deploying Alex would lead to a rapid loss of human control (01:31:41)

25. Image: Increasing copies via R&D (01:36:53)

26. In this new regime, maximizing reward would likely involve seizing control (01:40:44)

27. Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control (01:47:05)

28. Giving negative rewards to “warning signs” would likely select for patience (01:54:23)

29. Why this simplified scenario is worth thinking about (01:57:16)

30. Acknowledgements (02:09:35)

31. Appendices (02:10:24)

32. What would change my mind about the path of least resistance? (02:10:32)

33. “Security holes” may also select against straightforward honesty (02:16:01)

34. Simple “baseline” behavioral safety interventions (02:20:36)

35. Using higher-quality feedback and extrapolating feedback quality (02:21:49)

36. Using prompt engineering to emulate more thoughtful judgments (02:24:26)

37. Requiring Alex to provide justification for its actions (02:26:51)

38. Making the training distribution more diverse (02:31:24)

39. Adversarial training to incentivize Alex to act conservatively (02:37:33)

40. “Training out” bad behavior (02:40:43)

41. “Non-baseline” interventions that might help more (02:43:01)

42. Examining arguments that gradient descent favors being nice over playing the training game (02:46:06)

43. Maybe telling the truth is more “natural” than lying? (02:47:04)

44. Maybe path dependence means Alex internalizes moral lessons early? (02:48:52)

45. Maybe gradient descent simply generalizes “surprisingly well”? (02:51:32)

46. A possible architecture for Alex (02:55:15)

47. Diagram: Basic Alex Function (02:56:22)

48. Diagram: Alex + Heads (02:57:50)

49. Diagram: Alex + Heads + Memory (03:00:32)

50. Plausible high-level features of a good architecture (03:03:11)

51. Diagram: High-level architecture of Alex (03:03:28)

544 قسمت

#Tech #Society #Philosophy #LessWrong #LessWrong Curated