13 subscribers
با برنامه Player FM !
"How 'Discovering Latent Knowledge in Language Models Without Supervision' Fits Into a Broader Alignment Scheme" by Collin
Manage episode 424655337 series 3364760
https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Introduction
A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision. For a quick summary of our paper, you can check out this Twitter thread.
In this post I will describe how I think the results and methods in our paper fit into a broader scalable alignment agenda. Unlike the paper, this post is explicitly aimed at an alignment audience and is mainly conceptual rather than empirical.
Tl;dr: unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.
Disclaimers: I have tried to make this post concise, at the cost of not making the full arguments for many of my claims; you should treat this as more of a rough sketch of my views rather than anything comprehensive. I also frequently change my mind – I’m usually more consistently excited about some of the broad intuitions but much less wedded to the details – and this of course just represents my current thinking on the topic.
فصل ها
1. "How 'Discovering Latent Knowledge in Language Models Without Supervision' Fits Into a Broader Alignment Scheme" by Collin (00:00:00)
2. Introduction (00:00:18)
3. Problem (00:01:30)
4. Methodology (00:04:45)
5. Intuitions (00:08:42)
6. Our Paper (00:13:36)
7. Diagram (00:14:20)
8. Scaling To GPT-n (00:20:11)
9. Why I Think (Superhuman) GPT-n Will Represent Whether an Input is Actually True or False (00:22:38)
10. Why I Think We Will Be Able To Distinguish GPT-n’s “Beliefs” From Other Truth-Like Features (00:26:08)
11. Conclusion (00:31:59)
536 قسمت
Manage episode 424655337 series 3364760
https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Introduction
A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision. For a quick summary of our paper, you can check out this Twitter thread.
In this post I will describe how I think the results and methods in our paper fit into a broader scalable alignment agenda. Unlike the paper, this post is explicitly aimed at an alignment audience and is mainly conceptual rather than empirical.
Tl;dr: unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.
Disclaimers: I have tried to make this post concise, at the cost of not making the full arguments for many of my claims; you should treat this as more of a rough sketch of my views rather than anything comprehensive. I also frequently change my mind – I’m usually more consistently excited about some of the broad intuitions but much less wedded to the details – and this of course just represents my current thinking on the topic.
فصل ها
1. "How 'Discovering Latent Knowledge in Language Models Without Supervision' Fits Into a Broader Alignment Scheme" by Collin (00:00:00)
2. Introduction (00:00:18)
3. Problem (00:01:30)
4. Methodology (00:04:45)
5. Intuitions (00:08:42)
6. Our Paper (00:13:36)
7. Diagram (00:14:20)
8. Scaling To GPT-n (00:20:11)
9. Why I Think (Superhuman) GPT-n Will Represent Whether an Input is Actually True or False (00:22:38)
10. Why I Think We Will Be Able To Distinguish GPT-n’s “Beliefs” From Other Truth-Like Features (00:26:08)
11. Conclusion (00:31:59)
536 قسمت
همه قسمت ها
×به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.