23 subscribers
با برنامه Player FM !
پادکست هایی که ارزش شنیدن دارند
حمایت شده
LibreEval: The Largest Open Source Benchmark for RAG Hallucination Detection
Manage episode 477771441 series 3448051
For this week's paper read, we actually dive into our own research.
We wanted to create a replicable, evolving dataset that can keep pace with model training so that you always know you're testing with data your model has never seen before. We also saw the prohibitively high cost of running LLM evals at scale, and have used our data to fine-tune a series of SLMs that perform just as well as their base LLM counterparts, but at 1/10 the cost.
So, over the past few weeks, the Arize team generated the largest public dataset of hallucinations, as well as a series of fine-tuned evaluation models.
We talk about what we built, the process we took, and the bottom line results.
📃 Read the paper: https://arize.com/llm-hallucination-dataset/
Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
49 قسمت
Manage episode 477771441 series 3448051
For this week's paper read, we actually dive into our own research.
We wanted to create a replicable, evolving dataset that can keep pace with model training so that you always know you're testing with data your model has never seen before. We also saw the prohibitively high cost of running LLM evals at scale, and have used our data to fine-tune a series of SLMs that perform just as well as their base LLM counterparts, but at 1/10 the cost.
So, over the past few weeks, the Arize team generated the largest public dataset of hallucinations, as well as a series of fine-tuned evaluation models.
We talk about what we built, the process we took, and the bottom line results.
📃 Read the paper: https://arize.com/llm-hallucination-dataset/
Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
49 قسمت
Kaikki jaksot
×







1 How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings 44:59

1 The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets 41:02


1 Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior 42:14
به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.