1,752 subscribers
با برنامه Player FM !
Teaching LLMs to Self-Reflect with Reinforcement Learning with Maohao Shen - #726
Manage episode 475703814 series 2355587
Today, we're joined by Maohao Shen, PhD student at MIT to discuss his paper, “Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search.” We dig into how Satori leverages reinforcement learning to improve language model reasoning—enabling model self-reflection, self-correction, and exploration of alternative solutions. We explore the Chain-of-Action-Thought (COAT) approach, which uses special tokens—continue, reflect, and explore—to guide the model through distinct reasoning actions, allowing it to navigate complex reasoning tasks without external supervision. We also break down Satori’s two-stage training process: format tuning, which teaches the model to understand and utilize the special action tokens, and reinforcement learning, which optimizes reasoning through trial-and-error self-improvement. We cover key techniques such “restart and explore,” which allows the model to self-correct and generalize beyond its training domain. Finally, Maohao reviews Satori’s performance and how it compares to other models, the reward design, the benchmarks used, and the surprising observations made during the research.
The complete show notes for this episode can be found at https://twimlai.com/go/726.
763 قسمت
Teaching LLMs to Self-Reflect with Reinforcement Learning with Maohao Shen - #726
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
Manage episode 475703814 series 2355587
Today, we're joined by Maohao Shen, PhD student at MIT to discuss his paper, “Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search.” We dig into how Satori leverages reinforcement learning to improve language model reasoning—enabling model self-reflection, self-correction, and exploration of alternative solutions. We explore the Chain-of-Action-Thought (COAT) approach, which uses special tokens—continue, reflect, and explore—to guide the model through distinct reasoning actions, allowing it to navigate complex reasoning tasks without external supervision. We also break down Satori’s two-stage training process: format tuning, which teaches the model to understand and utilize the special action tokens, and reinforcement learning, which optimizes reasoning through trial-and-error self-improvement. We cover key techniques such “restart and explore,” which allows the model to self-correct and generalize beyond its training domain. Finally, Maohao reviews Satori’s performance and how it compares to other models, the reward design, the benchmarks used, and the surprising observations made during the research.
The complete show notes for this episode can be found at https://twimlai.com/go/726.
763 قسمت
Tous les épisodes
×
1 Genie 3: A New Frontier for World Models with Jack Parker-Holder and Shlomi Fruchter - #743 1:01:01

1 Closing the Loop Between AI Training and Inference with Lin Qiao - #742 1:01:11

1 Infrastructure Scaling and Compound AI Systems with Jared Quincy Davis - #740 1:13:02

1 Building Voice AI Agents That Don’t Suck with Kwindla Kramer - #739 1:13:02

1 Distilling Transformers and Diffusion Models for Robust Edge Use Cases with Fatih Porikli - #738 1:00:29

1 Grokking, Generalization Collapse, and the Dynamics of Training Deep Neural Networks with Charles Martin - #734 1:25:21

1 From Prompts to Policies: How RL Builds Better AI Agents with Mahesh Sathiamoorthy - #731 1:01:25

1 How OpenAI Builds AI Agents That Think and Act with Josh Tobin - #730 1:07:27
به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.