Artwork

محتوای ارائه شده توسط LessWrong. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط LessWrong یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal
Player FM - برنامه پادکست
با برنامه Player FM !

"Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" by Ajeya Cotra

3:07:41
 
اشتراک گذاری
 

Manage episode 424655346 series 3364760
محتوای ارائه شده توسط LessWrong. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط LessWrong یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I think that in the coming 15-30 years, the world could plausibly develop “transformative AI”: AI powerful enough to bring us into a new, qualitatively different future, via an explosion in science and technology R&D. This sort of AI could be sufficient to make this the most important century of all time for humanity.

The most straightforward vision for developing transformative AI that I can imagine working with very little innovation in techniques is what I’ll call human feedback[1] on diverse tasks (HFDT):

Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.

HFDT is not the only approach to developing transformative AI,[2] and it may not work at all.[3] But I take it very seriously, and I’m aware of increasingly many executives and ML researchers at AI companies who believe something within this space could work soon.

Unfortunately, I think that if AI companies race forward training increasingly powerful models using HFDT, this is likely to eventually lead to a full-blown AI takeover (i.e. a possibly violent uprising or coup by AI systems). I don’t think this is a certainty, but it looks like the best-guess default absent specific efforts to prevent it.

  continue reading

فصل ها

1. "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" by Ajeya Cotra (00:00:00)

2. Cartoon: XKCD (00:06:41)

3. Premises of the hypothetical situation (00:21:25)

4. Basic setup: an AI company trains a “scientist model” very soon (00:23:44)

5. “Racing forward” assumption: Magma tries to train the most powerful model it can (00:33:51)

6. “HFDT scales far” assumption: Alex is trained to achieve excellent performance on a wide range of difficult tasks (00:36:28)

7. Diagram: Alex (00:39:00)

8. Diagram: Timestep (00:41:24)

9. Diagram: Chain of timesteps (00:41:50)

10. Why do I call Alex’s training strategy “baseline” HFDT? (00:46:31)

11. What are some training strategies that would not fall under baseline HFDT? (00:49:06)

12. “Naive safety effort” assumption: Alex is trained to be “behaviorally safe” (00:51:51)

13. Key properties of Alex: it is a generally-competent creative planner (00:56:28)

14. Alex learns to make creative, unexpected plans to achieve open-ended goals (01:00:50)

15. How the hypothetical situation progresses (from the above premises) (01:02:30)

16. Alex would understand its training process very well (including human psychology) (01:04:12)

17. A spectrum of situational awareness (01:04:47)

18. Why I think Alex would have very high situational awareness (01:08:29)

19. While humans are in control, Alex would be incentivized to “play the training game” (01:12:09)

20. Naive “behavioral safety” interventions wouldn’t eliminate this incentive (01:20:26)

21. Maybe inductive bias or path dependence favors honest strategies? (01:26:54)

22. As humans’ control fades, Alex would be motivated to take over (01:28:08)

23. Diagram: Many instances (01:30:20)

24. Deploying Alex would lead to a rapid loss of human control (01:31:41)

25. Image: Increasing copies via R&D (01:36:53)

26. In this new regime, maximizing reward would likely involve seizing control (01:40:44)

27. Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control (01:47:05)

28. Giving negative rewards to “warning signs” would likely select for patience (01:54:23)

29. Why this simplified scenario is worth thinking about (01:57:16)

30. Acknowledgements (02:09:35)

31. Appendices (02:10:24)

32. What would change my mind about the path of least resistance? (02:10:32)

33. “Security holes” may also select against straightforward honesty (02:16:01)

34. Simple “baseline” behavioral safety interventions (02:20:36)

35. Using higher-quality feedback and extrapolating feedback quality (02:21:49)

36. Using prompt engineering to emulate more thoughtful judgments (02:24:26)

37. Requiring Alex to provide justification for its actions (02:26:51)

38. Making the training distribution more diverse (02:31:24)

39. Adversarial training to incentivize Alex to act conservatively (02:37:33)

40. “Training out” bad behavior (02:40:43)

41. “Non-baseline” interventions that might help more (02:43:01)

42. Examining arguments that gradient descent favors being nice over playing the training game (02:46:06)

43. Maybe telling the truth is more “natural” than lying? (02:47:04)

44. Maybe path dependence means Alex internalizes moral lessons early? (02:48:52)

45. Maybe gradient descent simply generalizes “surprisingly well”? (02:51:32)

46. A possible architecture for Alex (02:55:15)

47. Diagram: Basic Alex Function (02:56:22)

48. Diagram: Alex + Heads (02:57:50)

49. Diagram: Alex + Heads + Memory (03:00:32)

50. Plausible high-level features of a good architecture (03:03:11)

51. Diagram: High-level architecture of Alex (03:03:28)

544 قسمت

Artwork
iconاشتراک گذاری
 
Manage episode 424655346 series 3364760
محتوای ارائه شده توسط LessWrong. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط LessWrong یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I think that in the coming 15-30 years, the world could plausibly develop “transformative AI”: AI powerful enough to bring us into a new, qualitatively different future, via an explosion in science and technology R&D. This sort of AI could be sufficient to make this the most important century of all time for humanity.

The most straightforward vision for developing transformative AI that I can imagine working with very little innovation in techniques is what I’ll call human feedback[1] on diverse tasks (HFDT):

Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.

HFDT is not the only approach to developing transformative AI,[2] and it may not work at all.[3] But I take it very seriously, and I’m aware of increasingly many executives and ML researchers at AI companies who believe something within this space could work soon.

Unfortunately, I think that if AI companies race forward training increasingly powerful models using HFDT, this is likely to eventually lead to a full-blown AI takeover (i.e. a possibly violent uprising or coup by AI systems). I don’t think this is a certainty, but it looks like the best-guess default absent specific efforts to prevent it.

  continue reading

فصل ها

1. "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" by Ajeya Cotra (00:00:00)

2. Cartoon: XKCD (00:06:41)

3. Premises of the hypothetical situation (00:21:25)

4. Basic setup: an AI company trains a “scientist model” very soon (00:23:44)

5. “Racing forward” assumption: Magma tries to train the most powerful model it can (00:33:51)

6. “HFDT scales far” assumption: Alex is trained to achieve excellent performance on a wide range of difficult tasks (00:36:28)

7. Diagram: Alex (00:39:00)

8. Diagram: Timestep (00:41:24)

9. Diagram: Chain of timesteps (00:41:50)

10. Why do I call Alex’s training strategy “baseline” HFDT? (00:46:31)

11. What are some training strategies that would not fall under baseline HFDT? (00:49:06)

12. “Naive safety effort” assumption: Alex is trained to be “behaviorally safe” (00:51:51)

13. Key properties of Alex: it is a generally-competent creative planner (00:56:28)

14. Alex learns to make creative, unexpected plans to achieve open-ended goals (01:00:50)

15. How the hypothetical situation progresses (from the above premises) (01:02:30)

16. Alex would understand its training process very well (including human psychology) (01:04:12)

17. A spectrum of situational awareness (01:04:47)

18. Why I think Alex would have very high situational awareness (01:08:29)

19. While humans are in control, Alex would be incentivized to “play the training game” (01:12:09)

20. Naive “behavioral safety” interventions wouldn’t eliminate this incentive (01:20:26)

21. Maybe inductive bias or path dependence favors honest strategies? (01:26:54)

22. As humans’ control fades, Alex would be motivated to take over (01:28:08)

23. Diagram: Many instances (01:30:20)

24. Deploying Alex would lead to a rapid loss of human control (01:31:41)

25. Image: Increasing copies via R&D (01:36:53)

26. In this new regime, maximizing reward would likely involve seizing control (01:40:44)

27. Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control (01:47:05)

28. Giving negative rewards to “warning signs” would likely select for patience (01:54:23)

29. Why this simplified scenario is worth thinking about (01:57:16)

30. Acknowledgements (02:09:35)

31. Appendices (02:10:24)

32. What would change my mind about the path of least resistance? (02:10:32)

33. “Security holes” may also select against straightforward honesty (02:16:01)

34. Simple “baseline” behavioral safety interventions (02:20:36)

35. Using higher-quality feedback and extrapolating feedback quality (02:21:49)

36. Using prompt engineering to emulate more thoughtful judgments (02:24:26)

37. Requiring Alex to provide justification for its actions (02:26:51)

38. Making the training distribution more diverse (02:31:24)

39. Adversarial training to incentivize Alex to act conservatively (02:37:33)

40. “Training out” bad behavior (02:40:43)

41. “Non-baseline” interventions that might help more (02:43:01)

42. Examining arguments that gradient descent favors being nice over playing the training game (02:46:06)

43. Maybe telling the truth is more “natural” than lying? (02:47:04)

44. Maybe path dependence means Alex internalizes moral lessons early? (02:48:52)

45. Maybe gradient descent simply generalizes “surprisingly well”? (02:51:32)

46. A possible architecture for Alex (02:55:15)

47. Diagram: Basic Alex Function (02:56:22)

48. Diagram: Alex + Heads (02:57:50)

49. Diagram: Alex + Heads + Memory (03:00:32)

50. Plausible high-level features of a good architecture (03:03:11)

51. Diagram: High-level architecture of Alex (03:03:28)

544 قسمت

همه قسمت ها

×
 
Loading …

به Player FM خوش آمدید!

Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.

 

راهنمای مرجع سریع

در حین کاوش به این نمایش گوش دهید
پخش