با برنامه Player FM !
"Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" by Ajeya Cotra
Manage episode 424655346 series 3364760
https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I think that in the coming 15-30 years, the world could plausibly develop “transformative AI”: AI powerful enough to bring us into a new, qualitatively different future, via an explosion in science and technology R&D. This sort of AI could be sufficient to make this the most important century of all time for humanity.
The most straightforward vision for developing transformative AI that I can imagine working with very little innovation in techniques is what I’ll call human feedback[1] on diverse tasks (HFDT):
Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.
HFDT is not the only approach to developing transformative AI,[2] and it may not work at all.[3] But I take it very seriously, and I’m aware of increasingly many executives and ML researchers at AI companies who believe something within this space could work soon.
Unfortunately, I think that if AI companies race forward training increasingly powerful models using HFDT, this is likely to eventually lead to a full-blown AI takeover (i.e. a possibly violent uprising or coup by AI systems). I don’t think this is a certainty, but it looks like the best-guess default absent specific efforts to prevent it.
فصل ها
1. "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" by Ajeya Cotra (00:00:00)
2. Cartoon: XKCD (00:06:41)
3. Premises of the hypothetical situation (00:21:25)
4. Basic setup: an AI company trains a “scientist model” very soon (00:23:44)
5. “Racing forward” assumption: Magma tries to train the most powerful model it can (00:33:51)
6. “HFDT scales far” assumption: Alex is trained to achieve excellent performance on a wide range of difficult tasks (00:36:28)
7. Diagram: Alex (00:39:00)
8. Diagram: Timestep (00:41:24)
9. Diagram: Chain of timesteps (00:41:50)
10. Why do I call Alex’s training strategy “baseline” HFDT? (00:46:31)
11. What are some training strategies that would not fall under baseline HFDT? (00:49:06)
12. “Naive safety effort” assumption: Alex is trained to be “behaviorally safe” (00:51:51)
13. Key properties of Alex: it is a generally-competent creative planner (00:56:28)
14. Alex learns to make creative, unexpected plans to achieve open-ended goals (01:00:50)
15. How the hypothetical situation progresses (from the above premises) (01:02:30)
16. Alex would understand its training process very well (including human psychology) (01:04:12)
17. A spectrum of situational awareness (01:04:47)
18. Why I think Alex would have very high situational awareness (01:08:29)
19. While humans are in control, Alex would be incentivized to “play the training game” (01:12:09)
20. Naive “behavioral safety” interventions wouldn’t eliminate this incentive (01:20:26)
21. Maybe inductive bias or path dependence favors honest strategies? (01:26:54)
22. As humans’ control fades, Alex would be motivated to take over (01:28:08)
23. Diagram: Many instances (01:30:20)
24. Deploying Alex would lead to a rapid loss of human control (01:31:41)
25. Image: Increasing copies via R&D (01:36:53)
26. In this new regime, maximizing reward would likely involve seizing control (01:40:44)
27. Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control (01:47:05)
28. Giving negative rewards to “warning signs” would likely select for patience (01:54:23)
29. Why this simplified scenario is worth thinking about (01:57:16)
30. Acknowledgements (02:09:35)
31. Appendices (02:10:24)
32. What would change my mind about the path of least resistance? (02:10:32)
33. “Security holes” may also select against straightforward honesty (02:16:01)
34. Simple “baseline” behavioral safety interventions (02:20:36)
35. Using higher-quality feedback and extrapolating feedback quality (02:21:49)
36. Using prompt engineering to emulate more thoughtful judgments (02:24:26)
37. Requiring Alex to provide justification for its actions (02:26:51)
38. Making the training distribution more diverse (02:31:24)
39. Adversarial training to incentivize Alex to act conservatively (02:37:33)
40. “Training out” bad behavior (02:40:43)
41. “Non-baseline” interventions that might help more (02:43:01)
42. Examining arguments that gradient descent favors being nice over playing the training game (02:46:06)
43. Maybe telling the truth is more “natural” than lying? (02:47:04)
44. Maybe path dependence means Alex internalizes moral lessons early? (02:48:52)
45. Maybe gradient descent simply generalizes “surprisingly well”? (02:51:32)
46. A possible architecture for Alex (02:55:15)
47. Diagram: Basic Alex Function (02:56:22)
48. Diagram: Alex + Heads (02:57:50)
49. Diagram: Alex + Heads + Memory (03:00:32)
50. Plausible high-level features of a good architecture (03:03:11)
51. Diagram: High-level architecture of Alex (03:03:28)
544 قسمت
Manage episode 424655346 series 3364760
https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I think that in the coming 15-30 years, the world could plausibly develop “transformative AI”: AI powerful enough to bring us into a new, qualitatively different future, via an explosion in science and technology R&D. This sort of AI could be sufficient to make this the most important century of all time for humanity.
The most straightforward vision for developing transformative AI that I can imagine working with very little innovation in techniques is what I’ll call human feedback[1] on diverse tasks (HFDT):
Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.
HFDT is not the only approach to developing transformative AI,[2] and it may not work at all.[3] But I take it very seriously, and I’m aware of increasingly many executives and ML researchers at AI companies who believe something within this space could work soon.
Unfortunately, I think that if AI companies race forward training increasingly powerful models using HFDT, this is likely to eventually lead to a full-blown AI takeover (i.e. a possibly violent uprising or coup by AI systems). I don’t think this is a certainty, but it looks like the best-guess default absent specific efforts to prevent it.
فصل ها
1. "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" by Ajeya Cotra (00:00:00)
2. Cartoon: XKCD (00:06:41)
3. Premises of the hypothetical situation (00:21:25)
4. Basic setup: an AI company trains a “scientist model” very soon (00:23:44)
5. “Racing forward” assumption: Magma tries to train the most powerful model it can (00:33:51)
6. “HFDT scales far” assumption: Alex is trained to achieve excellent performance on a wide range of difficult tasks (00:36:28)
7. Diagram: Alex (00:39:00)
8. Diagram: Timestep (00:41:24)
9. Diagram: Chain of timesteps (00:41:50)
10. Why do I call Alex’s training strategy “baseline” HFDT? (00:46:31)
11. What are some training strategies that would not fall under baseline HFDT? (00:49:06)
12. “Naive safety effort” assumption: Alex is trained to be “behaviorally safe” (00:51:51)
13. Key properties of Alex: it is a generally-competent creative planner (00:56:28)
14. Alex learns to make creative, unexpected plans to achieve open-ended goals (01:00:50)
15. How the hypothetical situation progresses (from the above premises) (01:02:30)
16. Alex would understand its training process very well (including human psychology) (01:04:12)
17. A spectrum of situational awareness (01:04:47)
18. Why I think Alex would have very high situational awareness (01:08:29)
19. While humans are in control, Alex would be incentivized to “play the training game” (01:12:09)
20. Naive “behavioral safety” interventions wouldn’t eliminate this incentive (01:20:26)
21. Maybe inductive bias or path dependence favors honest strategies? (01:26:54)
22. As humans’ control fades, Alex would be motivated to take over (01:28:08)
23. Diagram: Many instances (01:30:20)
24. Deploying Alex would lead to a rapid loss of human control (01:31:41)
25. Image: Increasing copies via R&D (01:36:53)
26. In this new regime, maximizing reward would likely involve seizing control (01:40:44)
27. Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control (01:47:05)
28. Giving negative rewards to “warning signs” would likely select for patience (01:54:23)
29. Why this simplified scenario is worth thinking about (01:57:16)
30. Acknowledgements (02:09:35)
31. Appendices (02:10:24)
32. What would change my mind about the path of least resistance? (02:10:32)
33. “Security holes” may also select against straightforward honesty (02:16:01)
34. Simple “baseline” behavioral safety interventions (02:20:36)
35. Using higher-quality feedback and extrapolating feedback quality (02:21:49)
36. Using prompt engineering to emulate more thoughtful judgments (02:24:26)
37. Requiring Alex to provide justification for its actions (02:26:51)
38. Making the training distribution more diverse (02:31:24)
39. Adversarial training to incentivize Alex to act conservatively (02:37:33)
40. “Training out” bad behavior (02:40:43)
41. “Non-baseline” interventions that might help more (02:43:01)
42. Examining arguments that gradient descent favors being nice over playing the training game (02:46:06)
43. Maybe telling the truth is more “natural” than lying? (02:47:04)
44. Maybe path dependence means Alex internalizes moral lessons early? (02:48:52)
45. Maybe gradient descent simply generalizes “surprisingly well”? (02:51:32)
46. A possible architecture for Alex (02:55:15)
47. Diagram: Basic Alex Function (02:56:22)
48. Diagram: Alex + Heads (02:57:50)
49. Diagram: Alex + Heads + Memory (03:00:32)
50. Plausible high-level features of a good architecture (03:03:11)
51. Diagram: High-level architecture of Alex (03:03:28)
544 قسمت
همه قسمت ها
×به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.