13 subscribers
با برنامه Player FM !
پادکست هایی که ارزش شنیدن دارند
حمایت شده


"How 'Discovering Latent Knowledge in Language Models Without Supervision' Fits Into a Broader Alignment Scheme" by Collin
Manage episode 424655337 series 3364760
https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Introduction
A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision. For a quick summary of our paper, you can check out this Twitter thread.
In this post I will describe how I think the results and methods in our paper fit into a broader scalable alignment agenda. Unlike the paper, this post is explicitly aimed at an alignment audience and is mainly conceptual rather than empirical.
Tl;dr: unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.
Disclaimers: I have tried to make this post concise, at the cost of not making the full arguments for many of my claims; you should treat this as more of a rough sketch of my views rather than anything comprehensive. I also frequently change my mind – I’m usually more consistently excited about some of the broad intuitions but much less wedded to the details – and this of course just represents my current thinking on the topic.
فصل ها
1. "How 'Discovering Latent Knowledge in Language Models Without Supervision' Fits Into a Broader Alignment Scheme" by Collin (00:00:00)
2. Introduction (00:00:18)
3. Problem (00:01:30)
4. Methodology (00:04:45)
5. Intuitions (00:08:42)
6. Our Paper (00:13:36)
7. Diagram (00:14:20)
8. Scaling To GPT-n (00:20:11)
9. Why I Think (Superhuman) GPT-n Will Represent Whether an Input is Actually True or False (00:22:38)
10. Why I Think We Will Be Able To Distinguish GPT-n’s “Beliefs” From Other Truth-Like Features (00:26:08)
11. Conclusion (00:31:59)
551 قسمت
Manage episode 424655337 series 3364760
https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Introduction
A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision. For a quick summary of our paper, you can check out this Twitter thread.
In this post I will describe how I think the results and methods in our paper fit into a broader scalable alignment agenda. Unlike the paper, this post is explicitly aimed at an alignment audience and is mainly conceptual rather than empirical.
Tl;dr: unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.
Disclaimers: I have tried to make this post concise, at the cost of not making the full arguments for many of my claims; you should treat this as more of a rough sketch of my views rather than anything comprehensive. I also frequently change my mind – I’m usually more consistently excited about some of the broad intuitions but much less wedded to the details – and this of course just represents my current thinking on the topic.
فصل ها
1. "How 'Discovering Latent Knowledge in Language Models Without Supervision' Fits Into a Broader Alignment Scheme" by Collin (00:00:00)
2. Introduction (00:00:18)
3. Problem (00:01:30)
4. Methodology (00:04:45)
5. Intuitions (00:08:42)
6. Our Paper (00:13:36)
7. Diagram (00:14:20)
8. Scaling To GPT-n (00:20:11)
9. Why I Think (Superhuman) GPT-n Will Represent Whether an Input is Actually True or False (00:22:38)
10. Why I Think We Will Be Able To Distinguish GPT-n’s “Beliefs” From Other Truth-Like Features (00:26:08)
11. Conclusion (00:31:59)
551 قسمت
All episodes
×
1 “Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck 5:19

1 “Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger 11:06

1 “A deep critique of AI 2027’s bad timeline models” by titotal 1:12:32



1 “Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks 7:56


به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.