1,763 subscribers
با برنامه Player FM !
Generative Benchmarking with Kelly Hong - #728
Manage episode 478639901 series 2355587
In this episode, Kelly Hong, a researcher at Chroma, joins us to discuss "Generative Benchmarking," a novel approach to evaluating retrieval systems, like RAG applications, using synthetic data. Kelly explains how traditional benchmarks like MTEB fail to represent real-world query patterns and how embedding models that perform well on public benchmarks often underperform in production. The conversation explores the two-step process of Generative Benchmarking: filtering documents to focus on relevant content and generating queries that mimic actual user behavior. Kelly shares insights from applying this approach to Weights & Biases' technical support bot, revealing how domain-specific evaluation provides more accurate assessments of embedding model performance. We also discuss the importance of aligning LLM judges with human preferences, the impact of chunking strategies on retrieval effectiveness, and how production queries differ from benchmark queries in ambiguity and style. Throughout the episode, Kelly emphasizes the need for systematic evaluation approaches that go beyond "vibe checks" to help developers build more effective RAG applications.
The complete show notes for this episode can be found at https://twimlai.com/go/728.
759 قسمت
Generative Benchmarking with Kelly Hong - #728
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
Manage episode 478639901 series 2355587
In this episode, Kelly Hong, a researcher at Chroma, joins us to discuss "Generative Benchmarking," a novel approach to evaluating retrieval systems, like RAG applications, using synthetic data. Kelly explains how traditional benchmarks like MTEB fail to represent real-world query patterns and how embedding models that perform well on public benchmarks often underperform in production. The conversation explores the two-step process of Generative Benchmarking: filtering documents to focus on relevant content and generating queries that mimic actual user behavior. Kelly shares insights from applying this approach to Weights & Biases' technical support bot, revealing how domain-specific evaluation provides more accurate assessments of embedding model performance. We also discuss the importance of aligning LLM judges with human preferences, the impact of chunking strategies on retrieval effectiveness, and how production queries differ from benchmark queries in ambiguity and style. Throughout the episode, Kelly emphasizes the need for systematic evaluation approaches that go beyond "vibe checks" to help developers build more effective RAG applications.
The complete show notes for this episode can be found at https://twimlai.com/go/728.
759 قسمت
همه قسمت ها
×
1 Building Voice AI Agents That Don’t Suck with Kwindla Kramer - #739 1:13:02

1 Distilling Transformers and Diffusion Models for Robust Edge Use Cases with Fatih Porikli - #738 1:00:29

1 Grokking, Generalization Collapse, and the Dynamics of Training Deep Neural Networks with Charles Martin - #734 1:25:21

1 From Prompts to Policies: How RL Builds Better AI Agents with Mahesh Sathiamoorthy - #731 1:01:25

1 How OpenAI Builds AI Agents That Think and Act with Josh Tobin - #730 1:07:27

1 Exploring the Biology of LLMs with Circuit Tracing with Emmanuel Ameisen - #727 1:34:06

1 Waymo's Foundation Model for Autonomous Driving with Drago Anguelov - #725 1:09:07
به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.