Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Deep Papers

Player FM - Internet Radio Done Right

29 subscribers

اضافه شده در three سال پیش

محتوای ارائه شده توسط Arize AI. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط Arize AI یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

NerdWallet's Smart Money Podcast

1
Why Refis Are Spiking and How to Optimize Your 401k Target-Date Fund for Long-Term Growth 28:02

19 روز پیش28:02

پخش در آینده

لیست ها

پسندیدن

دوست داشته شد

28:02

Learn when a refi saves money and how target-date funds work, including fees and when to pick a later fund year. What exactly is a target-date fund, and when should you move your date? How do you know if now is a good time to refinance a house? Hosts Sean Pyles and Elizabeth Ayoola discuss mortgage refinancing and target-date funds to help you understand how to quantify savings on a refi and how to set (and adjust) an age-appropriate retirement glide path. To kick off the episode, NerdWallet senior news writer Anna Helhoski joins with mortgages and student loans writer Kate Wood and mortgage reporter Holden Lewis to break down why refis are spiking even without fresh Federal Reserve cuts, who’s most likely to benefit right now, and how markets (not just the Fed) drive daily mortgage rate moves. They begin with a discussion of rate-and-term vs. cash-out refinancing, with tips and tricks on calculating your breakeven point, using the ~0.75 percentage-point rule-of-thumb for potential savings, and factoring in 2% to 6% closing costs and how long you’ll stay put. Then, investing Nerd June Sham joins Sean and Elizabeth to discuss target-date funds. They discuss how glide paths work (to vs. through retirement), when to push your target year if you’ll work longer, and how fees compare with index funds/ETFs, plus contribution frameworks (10% to 15% of income vs. the “80% replacement” rule) and why many hands-off investors value auto-rebalancing despite higher expense ratios. A listener case study (age 35, 2055 fund) highlights how to revisit your target date in the decade before retirement, how to read a fund’s glide path, and why staying invested and consistent often matters more than chasing perfect timing. Want us to review your budget? Fill out this form — completely anonymously if you want — and we might feature your budget in a future segment! https://docs.google.com/forms/d/e/1FAIpQLScK53yAufsc4v5UpghhVfxtk2MoyooHzlSIRBnRxUPl3hKBig/viewform?usp=header In this episode, the Nerds discuss: mortgage refinance, refinance calculator, mortgage rates today, breakeven point refinance, cash-out refinance, HELOC vs cash-out, refinance closing costs, when to refinance, refinance vs home equity loan, bond market and mortgage rates, Federal Reserve and mortgage rates, target-date fund, best target-date funds, target-date fund glide path, to vs through glide path, 401k target-date fund, change target-date fund year, 2055 target-date fund, target-date fund fees, expense ratio comparison, ETF vs mutual fund, index funds S&P 500, retirement contribution 10 to 15 percent, 80 percent income replacement rule, taxable brokerage vs 401k, annuity vs staying invested, debt consolidation with home equity, credit card APR vs mortgage rate, divorce refinance requirements, stay-or-sell breakeven analysis, and refinance eligibility 2025. To send the Nerds your money questions, call or text the Nerd hotline at 901-730-6373 or email podcast@nerdwallet.com . Like what you hear? Please leave us a review and tell a friend. Learn more about your ad choices. Visit megaphone.fm/adchoices…

حدود یک سال پیش 39:05

MP3•خانه قسمت

This week’s paper presents a comprehensive study of the performance of various LLMs acting as judges. The researchers leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which they find to have a high inter-annotator agreement. The study includes nine judge models and nine exam-taker models – both base and instruction-tuned. They assess the judge models’ alignment across different model sizes, families, and judge prompts to answer questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold.

Read it on the blog: https://arize.com/blog/judging-the-judges-llm-as-a-judge/

Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

56 قسمت

#Science #Tech #Math #Business #Arize AI

Deep Papers

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Deep Papers

29 subscribers

published حدود یک سال پیش

اشتراک گذاری

MP3•خانه قسمت

Read it on the blog: https://arize.com/blog/judging-the-judges-llm-as-a-judge/

Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

56 قسمت

#Science #Tech #Math #Business #Arize AI

همه قسمت ها

Deep Papers

1
Atropos Health’s Arjun Mukerji, PhD, Explains RWESummary: A Framework and Test for Choosing LLMs to Summarize Real-World Evidence (RWE) Studies 26:22

11 hours پیش26:22

26:22

Large language models are increasingly used to turn complex study output into plain-English summaries. But how do we know which models are safest and most reliable for healthcare? In this most recent community AI research paper reading, Arjun Mukerji, PhD – Staff Data Scientist at Atropos Health – walks us through RWESummary, a new benchmark designed to evaluate LLMs on summarizing real-world evidence from structured study output — an important but often under-tested scenario compared to the typical “summarize this PDF” task. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Stan Miasnikov, Distinguished Engineer, AI/ML Architecture, Consumer Experience at Verizon Walks Us Through His New Paper 48:11

16 روز پیش48:11

48:11

This episode dives into " Category-Theoretic Analysis of Inter-Agent Communication and Mutual Understanding Metric in Recursive Consciousness ." The paper presents an extension of the Recursive Consciousness framework to analyze communication between agents and the inevitable loss of meaning in translation. We're thrilled to feature the paper's author, Stan Miasnikov , Distinguished Engineer, AI/ML Architecture, Consumer Experience at Verizon, to walk us through the research and its implications. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Small Language Models are the Future of Agentic AI 31:15

18 روز پیش31:15

31:15

We had the privilege of hosting Peter Belcak – an AI Researcher working on the reliability and efficiency of agentic systems at NVIDIA – who walked us through his new paper making the rounds in AI circles titled “ Small Language Models are the Future of Agentic AI .” The paper posits that small language models (SLMs) are sufficiently powerful, inherently more suitable, and necessarily more economical for many invocations in agentic systems, and are therefore the future of agentic AI. The authors’ argumentation is grounded in the current level of capabilities exhibited by SLMs, the common architectures of agentic systems, and the economy of LM deployment. The authors further argue that in situations where general-purpose conversational abilities are essential, heterogeneous agentic systems (i.e., agents invoking multiple different models) are the natural choice. They discuss the potential barriers for the adoption of SLMs in agentic systems and outline a general LLM-to-SLM agent conversion algorithm. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Watermarking for LLMs and Image Models 42:56

۸ weeks پیش42:56

42:56

In this AI research paper reading, we dive into "A Watermark for Large Language Models" with the paper's author John Kirchenbauer. This paper is a timely exploration of techniques for embedding invisible but detectable signals in AI-generated text. These watermarking strategies aim to help mitigate misuse of large language models by making machine-generated content distinguishable from human writing, without sacrificing text quality or requiring access to the model’s internals. Learn more about the A Watermark for Large Language Models paper . Learn more about agent observability and LLM observability , join the Arize AI Slack community or get the latest on LinkedIn and X . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Self-Adapting Language Models: Paper Authors Discuss Implications 31:26

11 weeks پیش31:26

31:26

The authors of the new paper *Self-Adapting Language Models (SEAL)* shared a behind-the-scenes look at their work, motivations, results, and future directions. The paper introduces a novel method for enabling large language models (LLMs) to adapt their own weights using self-generated data and training directives — “self-edits.” Learn more about the Self-Adapting Language Models paper . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
The Illusion of Thinking: What the Apple AI Paper Says About LLM Reasoning 30:35

13 weeks پیش30:35

30:35

This week we discuss The Illusion of Thinking, a new paper from researchers at Apple that challenges today’s evaluation methods and introduces a new benchmark: synthetic puzzles with controllable complexity and clean logic. Their findings? Large Reasoning Models (LRMs) show surprising failure modes, including a complete collapse on high-complexity tasks and a decline in reasoning effort as problems get harder. Dylan and Parth dive into the paper's findings as well as the debate around it, including a response paper aptly titled "The Illusion of the Illusion of Thinking." Read the paper: The Illusion of Thinking Read the response: The Illusion of the Illusion of Thinking Explore more AI research and sign up for future readings Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Accurate KV Cache Quantization with Outlier Tokens Tracing 25:11

16 weeks پیش25:11

25:11

We discuss Accurate KV Cache Quantization with Outlier Tokens Tracing, a deep dive into improving the efficiency of LLM inference. The authors enhance KV Cache quantization, a technique for reducing memory and compute costs during inference, by introducing a method to identify and exclude outlier tokens that hurt quantization accuracy, striking a better balance between efficiency and performance. Read the paper Access the slides Read the blog Join us for Arize Observe Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Scalable Chain of Thoughts via Elastic Reasoning 28:54

18 weeks پیش28:54

28:54

In this week's episode, we talk about Elastic Reasoning, a novel framework designed to enhance the efficiency and scalability of large reasoning models by explicitly separating the reasoning process into two distinct phases: thinking and solution . This separation allows for independent allocation of computational budgets, addressing challenges related to uncontrolled output lengths in real-world deployments with strict resource constraints. Our discussion explores how Elastic Reasoning contributes to more concise and efficient reasoning, even in unconstrained settings, and its implications for deploying LRMs in resource-limited environments. Read the paper Join us live Read the blog Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Sleep-time Compute: Beyond Inference Scaling at Test-time 30:24

20 weeks پیش30:24

30:24

What if your LLM could think ahead —preparing answers before questions are even asked? In this week's paper read , we dive into a groundbreaking new paper from researchers at Letta, introducing sleep-time compute: a novel technique that lets models do their heavy lifting offline , well before the user query arrives. By predicting likely questions and precomputing key reasoning steps, sleep-time compute dramatically reduces test-time latency and cost—without sacrificing performance. We explore new benchmarks—Stateful GSM-Symbolic, Stateful AIME, and the multi-query extension of GSM—that show up to 5x lower compute at inference, 2.5x lower cost per query, and up to 18% higher accuracy when scaled. You’ll also see how this method applies to realistic agent use cases and what makes it most effective.If you care about LLM efficiency, scalability, or cutting-edge research. Explore more AI research, or sign up to hear the next session live . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
LibreEval: The Largest Open Source Benchmark for RAG Hallucination Detection 27:19

22 weeks پیش27:19

27:19

For this week's paper read, we dive into our own research. We wanted to create a replicable, evolving dataset that can keep pace with model training so that you always know you're testing with data your model has never seen before. We also saw the prohibitively high cost of running LLM evals at scale, and have used our data to fine-tune a series of SLMs that perform just as well as their base LLM counterparts, but at 1/10 the cost. So, over the past few weeks, the Arize team generated the largest public dataset of hallucinations, as well as a series of fine-tuned evaluation models. We talk about what we built, the process we took, and the bottom line results. You can read the recap of LibreEval here. Dive into the research , or sign up to join us next time. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
AI Benchmark Deep Dive: Gemini 2.5 and Humanity's Last Exam 26:11

24 weeks پیش26:11

26:11

This week we talk about modern AI benchmarks, taking a close look at Google's recent Gemini 2.5 release and its performance on key evaluations, notably Humanity's Last Exam (HLE). In the session we covered Gemini 2.5's architecture, its advancements in reasoning and multimodality, and its impressive context window. We also talked about how benchmarks like HLE and ARC AGI 2 help us understand the current state and future direction of AI. Join us for the next live recording , or check out the latest AI research . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Model Context Protocol (MCP) 15:03

26 weeks پیش15:03

15:03

We cover Anthropic’s groundbreaking Model Context Protocol (MCP) . Though it was released in November 2024, we've been seeing a lot of hype around it lately, and thought it was well worth digging into. Learn how this open standard is revolutionizing AI by enabling seamless integration between LLMs and external data sources, fundamentally transforming them into capable, context-aware agents. We explore the key benefits of MCP, including enhanced context retention across interactions, improved interoperability for agentic workflows, and the development of more capable AI agents that can execute complex tasks in real-world environments. Read our analysis of MCP on the blog, or dive into the latest AI researc h . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
AI Roundup: DeepSeek’s Big Moves, Claude 3.7, and the Latest Breakthroughs 30:23

29 weeks پیش30:23

30:23

This week, we're mixing things up a little bit. Instead of diving deep into a single research paper, we cover the biggest AI developments from the past few weeks. We break down key announcements, including: DeepSeek’s Big Launch Week: A look at FlashMLA (DeepSeek’s new approach to efficient inference) and DeepEP (their enhanced pretraining method). Claude 3.7 & Claude Code: What’s new with Anthropic’s latest model, and what Claude Code brings to the AI coding assistant space. Stay ahead of the curve with this fast-paced recap of the most important AI updates. We'll be back next time with our regularly scheduled programming. Dive into the latest AI research Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
How DeepSeek is Pushing the Boundaries of AI Development 29:54

30 weeks پیش29:54

29:54

This week, we dive into DeepSeek. SallyAnn DeLucia, Product Manager at Arize, and Nick Luzio, a Solutions Engineer, break down key insights on a model that have dominating headlines for its significant breakthrough in inference speed over other models. What’s next for AI (and open source)? From training strategies to real-world performance, here’s what you need to know. Read our analysis of DeepSeek , or dive into the latest AI research . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Multiagent Finetuning: A Conversation with Researcher Yilun Du 30:03

33 weeks پیش30:03

30:03

We talk to Google DeepMind Senior Research Scientist (and incoming Assistant Professor at Harvard), Yilun Du, about his latest paper, "Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains." This paper introduces a multiagent finetuning framework that enhances the performance and diversity of language models by employing a society of agents with distinct roles, improving feedback mechanisms and overall output quality. The method enables autonomous self-improvement through iterative finetuning, achieving significant performance gains across various reasoning tasks. It's versatile, applicable to both open-source and proprietary LLMs, and can integrate with human-feedback-based methods like RLHF or DPO, paving the way for future advancements in language model development. Read an overview on the blog , watch the full discussion , or join us live for future paper readings . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Training Large Language Models to Reason in Continuous Latent Space 24:58

36 weeks پیش24:58

24:58

LLMs have typically been restricted to reason in the "language space," where chain-of-thought (CoT) is used to solve complex reasoning problems. But a new paper argues that language space may not always be the best for reasoning. In this paper read, we cover an exciting new technique from a team at Meta called Chain of Continuous Thought—also known as "Coconut." In the paper, "Training Large Language Models to Reason in a Continuous Latent Space" explores the potential of allowing LLMs to reason in an unrestricted latent space instead of being constrained by natural language tokens. Read a full breakdown of Coconut on our blog , or join us live for the next paper reading . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
LLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation Methods 28:57

39 weeks پیش28:57

28:57

We discuss a major survey of work and research on LLM-as-Judge from the last few years. "LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods" systematically examines the LLMs-as-Judge framework across five dimensions: functionality, methodology, applications, meta-evaluation, and limitations. This survey gives us a birds eye view of the advantages, limitations and methods for evaluating its effectiveness. Read a breakdown on our blog: https://arize.com/blog/llm-as-judge-survey-paper/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Merge, Ensemble, and Cooperate! A Survey on Collaborative LLM Strategies 28:47

41 weeks پیش28:47

28:47

LLMs have revolutionized natural language processing, showcasing remarkable versatility and capabilities. But individual LLMs often exhibit distinct strengths and weaknesses, influenced by differences in their training corpora. This diversity poses a challenge: how can we maximize the efficiency and utility of LLMs? A new paper, "Merge, Ensemble, and Cooperate: A Survey on Collaborative Strategies in the Era of Large Language Models," highlights collaborative strategies to address this challenge. In this week's episode, we summarize key insights from this paper and discuss practical implications of LLM collaboration strategies across three main approaches: merging, ensemble, and cooperation. We also review some new open source models we're excited about. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Agent-as-a-Judge: Evaluate Agents with Agents 24:54

43 weeks پیش24:54

24:54

This week, we break down the “Agent-as-a-Judge” framework—a new agent evaluation paradigm that’s kind of like getting robots to grade each other’s homework. Where typical evaluation methods focus solely on outcomes or demand extensive manual work, this approach uses agent systems to evaluate agent systems, offering intermediate feedback throughout the task-solving process. With the power to unlock scalable self-improvement, Agent-as-a-Judge could redefine how we measure and enhance agent performance. Let's get into it! Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Introduction to OpenAI's Realtime API 29:56

45 weeks پیش29:56

29:56

We break down OpenAI’s realtime API. Learn how to seamlessly integrate powerful language models into your applications for instant, context-aware responses that drive user engagement. Whether you’re building chatbots, dynamic content tools, or enhancing real-time collaboration, we walk through the API’s capabilities, potential use cases, and best practices for implementation. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Swarm: OpenAI's Experimental Approach to Multi-Agent Systems 46:46

47 weeks پیش46:46

46:46

As multi-agent systems grow in importance for fields ranging from customer support to autonomous decision-making, OpenAI has introduced Swarm, an experimental framework that simplifies the process of building and managing these systems. Swarm, a lightweight Python library, is designed for educational purposes, stripping away complex abstractions to reveal the foundational concepts of multi-agent architectures. In this podcast, we explore Swarm’s design, its practical applications, and how it stacks up against other frameworks. Whether you’re new to multi-agent systems or looking to deepen your understanding, Swarm offers a straightforward, hands-on way to get started. Read a Summary on the Blog Watch on YouTube Sign up for Upcoming Paper Readings Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
KV Cache Explained 4:19

48 weeks پیش4:19

4:19

In this episode, we dive into the intriguing mechanics behind why chat experiences with models like GPT often start slow but then rapidly pick up speed. The key? The KV cache. This essential but under-discussed component enables the seamless and snappy interactions we expect from modern AI systems. Harrison Chu breaks down how the KV cache works, how it relates to the transformer architecture, and why it's crucial for efficient AI responses. By the end of the episode, you'll have a clearer understanding of how top AI products leverage this technology to deliver fast, high-quality user experiences. Tune in for a simplified explanation of attention heads, KQV matrices, and the computational complexities they present. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
The Shrek Sampler: How Entropy-Based Sampling is Revolutionizing LLMs 3:31

49 weeks پیش3:31

3:31

In this byte-sized podcast, Harrison Chu, Director of Engineering at Arize, breaks down the Shrek Sampler. This innovative Entropy-Based Sampling technique--nicknamed the 'Shrek Sampler--is transforming LLMs. Harrison talks about how this method improves upon traditional sampling strategies by leveraging entropy and varentropy to produce more dynamic and intelligent responses. Explore its potential to enhance open-source AI models and enable human-like reasoning in smaller language models. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Google's NotebookLM and the Future of AI-Generated Audio 43:28

49 weeks پیش43:28

43:28

This week, Aman Khan and Harrison Chu explore NotebookLM’s unique features, including its ability to generate realistic-sounding podcast episodes from text (but this podcast is very real!). They dive into some technical underpinnings of the product, specifically the SoundStorm model used for generating high-quality audio, and how it leverages a hierarchical vector quantization approach (RVQ) to maintain consistency in speaker voice and tone throughout long audio durations. The discussion also touches on ethical implications of such technology, particularly the potential for hallucinations and the need to balance creative freedom with factual accuracy. We close out with a few hot takes, and speculate on the future of AI-generated audio. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Exploring OpenAI's o1-preview and o1-mini 42:02

52 weeks پیش42:02

42:02

OpenAI recently released its o1-preview, which they claim outperforms GPT-4o on a number of benchmarks. These models are designed to think more before answering and handle complex tasks better than their other models, especially science and math questions. We take a closer look at their latest crop of o1 models, and we also highlight some research our team did to see how they stack up against Claude Sonnet 3.5--using a real world use case. Read it on our blog: https://arize.com/blog/exploring-openai-o1-preview-and-o1-mini Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Breaking Down Reflection Tuning: Enhancing LLM Performance with Self-Learning 26:54

1 year پیش26:54

26:54

A recent announcement on X boasted a tuned model with pretty outstanding performance, and claimed these results were achieved through Reflection Tuning. However, people were unable to reproduce the results. We dive into some recent drama in the AI community as a jumping off point for a discussion about Reflection 70B. In 2023, there was a paper written about Reflection Tuning that this new model (Reflection 70B) draws concepts from. Reflection tuning is an optimization technique where models learn to improve their decision-making processes by “reflecting” on past actions or predictions. This method enables models to iteratively refine their performance by analyzing mistakes and successes, thus improving both accuracy and adaptability over time. By incorporating a feedback loop, reflection tuning can address model weaknesses more dynamically, helping AI systems become more robust in real-world applications where uncertainty or changing environments are prevalent. Dat Ngo (AI Solutions Architect at Arize), talks to Rohan Pandey (Founding Engineer at Reworkd) about Reflection 70B, Reflection Tuning, the recent drama, and the importance of double checking your research. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Composable Interventions for Language Models 42:35

1 year پیش42:35

42:35

This week, we're excited to be joined by Kyle O'Brien, Applied Scientist at Microsoft, to discuss his most recent paper, Composable Interventions for Language Models. Kyle and his team present a new framework, composable interventions, that allows for the study of multiple interventions applied sequentially to the same language model. The discussion will cover their key findings from extensive experiments, revealing how different interventions—such as knowledge editing, model compression, and machine unlearning—interact with each other. Read it on the blog: https://arize.com/blog/composable-interventions-for-language-models/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges 39:05

1 year پیش39:05

39:05

Deep Papers

1
Breaking Down Meta's Llama 3 Herd of Models 44:40

1 year پیش44:40

44:40

Meta just released Llama 3.1 405B–according to them, it’s “the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation.” Will the latest Llama herd ignite new applications and modeling paradigms like synthetic data generation? Will it enable the improvement and training of smaller models, as well as model distillation? Meta thinks so. We’ll take a look at what they did here, talk about open source, and decide if we want to believe the hype. Read it on the blog: https://arize.com/blog/breaking-down-meta-llama-3/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines 33:57

1 year پیش33:57

33:57

Chaining language model (LM) calls as composable modules is fueling a new way of programming, but ensuring LMs adhere to important constraints requires heuristic “prompt engineering.” The paper this week introduces LM Assertions, a programming construct for expressing computational constraints that LMs should satisfy. The researchers integrated their constructs into the recent DSPy programming model for LMs and present new strategies that allow DSPy to compile programs with LM Assertions into more reliable and accurate systems. They also propose strategies to use assertions at inference time for automatic self-refinement with LMs. They reported on four diverse case studies for text generation and found that LM Assertions improve not only compliance with imposed rules but also downstream task performance, passing constraints up to 164% more often and generating up to 37% more higher-quality responses. We discuss this paper with Cyrus Nouroozi, DSPY key contributor. Read it on the blog: https://arize.com/blog/dspy-assertions-computational-constraints/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
RAFT: Adapting Language Model to Domain Specific RAG 44:01

1 year پیش44:01

44:01

Where adapting LLMs to specialized domains is essential (e.g., recent news, enterprise private documents), we discuss a paper that asks how we adapt pre-trained LLMs for RAG in specialized domains. SallyAnn DeLucia is joined by Sai Kolasani, researcher at UC Berkeley’s RISE Lab (and Arize AI Intern), to talk about his work on RAFT: Adapting Language Model to Domain Specific RAG. RAFT (Retrieval-Augmented FineTuning) is a training recipe that improves an LLM’s ability to answer questions in a “open-book” in-domain settings. Given a question, and a set of retrieved documents, the model is trained to ignore documents that don’t help in answering the question (aka distractor documents). This coupled with RAFT’s chain-of-thought-style response, helps improve the model’s ability to reason. In domain-specific RAG, RAFT consistently improves the model’s performance across PubMed, HotpotQA, and Gorilla datasets, presenting a post-training recipe to improve pre-trained LLMs to in-domain RAG. Read it on the blog: https://arize.com/blog/raft-adapting-language-model-to-domain-specific-rag/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic 44:00

1 year پیش44:00

44:00

It’s been an exciting couple weeks for GenAI! Join us as we discuss the latest research from OpenAI and Anthropic. We’re excited to chat about this significant step forward in understanding how LLMs work and the implications it has for deeper understanding of the neural activity of language models. We take a closer look at some recent research from both OpenAI and Anthropic. These two recent papers both focus on the sparse autoencoder--an unsupervised approach for extracting interpretable features from an LLM. In "Extracting Concepts from GPT-4," OpenAI researchers propose using k-sparse autoencoders to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. In "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," researchers at Anthropic show that scaling laws can be used to guide the training of sparse autoencoders, among other findings. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment 48:07

1 year پیش48:07

48:07

We break down the paper--Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment. Ensuring alignment (aka: making models behave in accordance with human intentions) has become a critical task before deploying LLMs in real-world applications. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications. Read more about Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Breaking Down EvalGen: Who Validates the Validators? 44:47

1 year پیش44:47

44:47

Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators often inherit the problems of the LLMs they evaluate, requiring further human validation. This week’s paper explores EvalGen, a mixed-initative approach to aligning LLM-generated evaluation functions with human preferences. EvalGen assists users in developing both criteria acceptable LLM outputs and developing functions to check these standards, ensuring evaluations reflect the users’ own grading standards. Read it on the blog: https://arize.com/blog/breaking-down-evalgen-who-validates-the-validators/ Paper: https://arxiv.org/abs/2404.12272 Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Keys To Understanding ReAct: Synergizing Reasoning and Acting in Language Models 45:07

1 year پیش45:07

45:07

This week we explore ReAct, an approach that enhances the reasoning and decision-making capabilities of LLMs by combining step-by-step reasoning with the ability to take actions and gather information from external sources in a unified framework. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Demystifying Chronos: Learning the Language of Time Series 44:40

1 year پیش44:40

44:40

This week, we’ve covering Amazon’s time series model: Chronos. Developing accurate machine-learning-based forecasting models has traditionally required substantial dataset-specific tuning and model customization. Chronos however, is built on a language model architecture and trained with billions of tokenized time series observations, enabling it to provide accurate zero-shot forecasts matching or exceeding purpose-built models. We dive into time series forecasting, some recent research our team has done, and take a community pulse on what people think of Chronos. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Anthropic Claude 3 43:01

1 year پیش43:01

43:01

This week we dive into the latest buzz in the AI world – the arrival of Claude 3. Claude 3 is the newest family of models in the LLM space, and Opus Claude 3 ( Anthropic's "most intelligent" Claude model ) challenges the likes of GPT-4. The Claude 3 family of models, according to Anthropic "sets new industry benchmarks," and includes "three state-of-the-art models in ascending order of capability: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus." Each of these models "allows users to select the optimal balance of intelligence, speed, and cost." We explore Anthropic’s recent paper , and walk through Arize’s latest research comparing Claude 3 to GPT-4. This discussion is relevant to researchers, practitioners, or anyone who's curious about the future of AI. Find the full transcript and more here: https://arize.com/blog/anthropic-claude-3/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Reinforcement Learning in the Era of LLMs 44:49

۲ years پیش44:49

44:49

We’re exploring Reinforcement Learning in the Era of LLMs this week with Claire Longo, Arize’s Head of Customer Success. Recent advancements in Large Language Models (LLMs) have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H) responses can largely be attributed to the technique of Reinforcement Learning from Human Feedback (RLHF). This week’s paper, aims to link the research in conventional RL to RL techniques used in LLM research and demystify this technique by discussing why, when, and how RL excels. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Sora: OpenAI’s Text-to-Video Generation Model 45:08

۲ years پیش45:08

45:08

This week, we discuss the implications of Text-to-Video Generation and speculate as to the possibilities (and limitations) of this incredible technology with some hot takes. Dat Ngo, ML Solutions Engineer at Arize, is joined by community member and AI Engineer Vibhu Sapra to review OpenAI’s technical report on their Text-To-Video Generation Model: Sora. According to OpenAI, “Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.” At the time of this recording, the model had not been widely released yet, but was becoming available to red teamers to assess risk, and also to artists to receive feedback on how Sora could be helpful for creatives. At the end of our discussion, we also explore EvalCrafter: Benchmarking and Evaluating Large Video Generation Models. This recent paper proposed a new framework and pipeline to exhaustively evaluate the performance of the generated videos, which we look at in light of Sora. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
RAG vs Fine-Tuning 39:49

۲ years پیش39:49

39:49

This week, we’re discussing "RAG vs Fine-Tuning: Pipelines, Tradeoff, and a Case Study on Agriculture." This paper explores a pipeline for fine-tuning and RAG, and presents the tradeoffs of both for multiple popular LLMs, including Llama2-13B, GPT-3.5, and GPT-4. The authors propose a pipeline that consists of multiple stages, including extracting information from PDFs, generating questions and answers, using them for fine-tuning, and leveraging GPT-4 for evaluating the results. Overall, the results point to how systems built using LLMs can be adapted to respond and incorporate knowledge across a dimension that is critical for a specific industry, paving the way for further applications of LLMs in other industrial domains. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels 36:22

۲ years پیش36:22

36:22

We discuss HyDE: a thrilling zero-shot learning technique that combines GPT-3’s language understanding with contrastive text encoders. HyDE revolutionizes information retrieval and grounding in real-world data by generating hypothetical documents from queries and retrieving similar real-world documents. It outperforms traditional unsupervised retrievers, rivaling fine-tuned retrievers across diverse tasks and languages. This leap in zero-shot learning efficiently retrieves relevant real-world information without task-specific fine-tuning, broadening AI model applicability and effectiveness. Link to transcript and live recording: https://arize.com/blog/hyde-paper-reading-and-discussion/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Phi-2 Model 44:29

۲ years پیش44:29

44:29

We dive into Phi-2 and some of the major differences and use cases for a small language model (SLM) versus an LLM. With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on multi-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size. Find the transcript and live recording: https://arize.com/blog/phi-2-model Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
A Deep Dive Into Generative's Newest Models: Gemini vs Mistral (Mixtral-8x7B)–Part I 47:50

۲ years پیش47:50

47:50

For the last paper read of the year, Arize CPO & Co-Founder, Aparna Dhinakaran, is joined by a Dat Ngo (ML Solutions Architect) and Aman Khan (Product Manager) for an exploration of the new kids on the block: Gemini and Mixtral-8x7B. There's a lot to cover, so this week's paper read is Part I in a series about Mixtral and Gemini. In Part I, we provide some background and context for Mixtral 8x7B from Mistral AI, a high-quality sparse mixture of experts model (SMoE) that outperforms Llama 2 70B on most benchmarks with 6x faster inference Mixtral also matches or outperforms GPT3.5 on most benchmarks. This open-source model was optimized through supervised fine-tuning and direct preference optimization. Stay tuned for Part II in January, where we'll build on this conversation in and discuss Gemini-developed by teams at DeepMind and Google Research. Link to transcript and live recording: https://arize.com/blog/a-deep-dive-into-generatives-newest-models-mistral-mixtral-8x7b/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings 44:59

۲ years پیش44:59

44:59

We’re thrilled to be joined by Shuaichen Chang, LLM researcher and the author of this week’s paper to discuss his findings. Shuaichen’s research investigates the impact of prompt constructions on the performance of large language models (LLMs) in the text-to-SQL task, particularly focusing on zero-shot, single-domain, and cross-domain settings. Shuaichen and his team explore various strategies for prompt construction, evaluating the influence of database schema, content representation, and prompt length on LLMs’ effectiveness. The findings emphasize the importance of careful consideration in constructing prompts, highlighting the crucial role of table relationships and content, the effectiveness of in-domain demonstration examples, and the significance of prompt length in cross-domain scenarios. Read the blog and watch the discussion: https://arize.com/blog/how-to-prompt-llms-for-text-to-sql-paper-reading/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets 41:02

۲ years پیش41:02

41:02

For this paper read, we’re joined by Samuel Marks, Postdoctoral Research Associate at Northeastern University, to discuss his paper, “The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets.” Samuel and his team curated high-quality datasets of true/false statements and used them to study in detail the structure of LLM representations of truth. Overall, they present evidence that language models linearly represent the truth or falsehood of factual statements and also introduce a novel technique, mass-mean probing, which generalizes better and is more causally implicated in model outputs than other probing techniques. Find the transcript and read more here: https://arize.com/blog/the-geometry-of-truth-emergent-linear-structure-in-llm-representation-of-true-false-datasets-paper-reading/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning 44:50

۲ years پیش44:50

44:50

In this paper read, we discuss “Towards Monosemanticity: Decomposing Language Models Into Understandable Components,” a paper from Anthropic that addresses the challenge of understanding the inner workings of neural networks, drawing parallels with the complexity of human brain function. It explores the concept of “features,” (patterns of neuron activations) providing a more interpretable way to dissect neural networks. By decomposing a layer of neurons into thousands of features, this approach uncovers hidden model properties that are not evident when examining individual neurons. These features are demonstrated to be more interpretable and consistent, offering the potential to steer model behavior and improve AI safety. Find the transcript and more here: https://arize.com/blog/decomposing-language-models-with-dictionary-learning-paper-reading/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models 43:49

۲ years پیش43:49

43:49

We discuss RankVicuna, the first fully open-source LLM capable of performing high-quality listwise reranking in a zero-shot setting. While researchers have successfully applied LLMs such as ChatGPT to reranking in an information retrieval context, such work has mostly been built on proprietary models hidden behind opaque API endpoints. This approach yields experimental results that are not reproducible and non-deterministic, threatening the veracity of outcomes that build on such shaky foundations. RankVicuna provides access to a fully open-source LLM and associated code infrastructure capable of performing high-quality reranking. Find the transcript and more here: https://arize.com/blog/rankvicuna-paper-reading/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Explaining Grokking Through Circuit Efficiency 36:12

۲ years پیش36:12

36:12

Join Arize Co-Founder & CEO Jason Lopatecki, and ML Solutions Engineer, Sally-Ann DeLucia, as they discuss “Explaining Grokking Through Circuit Efficiency." This paper explores novel predictions about grokking, providing significant evidence in favor of its explanation. Most strikingly, the research conducted in this paper demonstrates two novel and surprising behaviors: ungrokking, in which a network regresses from perfect to low test accuracy, and semi-grokking, in which a network shows delayed generalization to partial rather than perfect test accuracy. Find the transcript and more here: https://arize.com/blog/explaining-grokking-through-circuit-efficiency-paper-reading/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior 42:14

۲ years پیش42:14

42:14

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this episode, we discuss the paper, “Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior.” This episode is led by SallyAnn Delucia (ML Solutions Engineer, Arize AI), and Amber Roberts (ML Solutions Engineer, Arize AI). The research they discuss highlights that while LLMs have great generalization capabilities, they struggle to effectively predict and optimize communication to get the desired receiver behavior. We’ll explore whether this might be because of a lack of “behavior tokens” in LLM training corpora and how Large Content Behavior Models (LCBMs) might help to solve this issue. Find the transcript and more here: https://arize.com/blog/large-content-and-behavior-models-paper-reading/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Skeleton of Thought: LLMs Can Do Parallel Decoding 43:39

۲ years پیش43:39

43:39

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this paper reading, we explore the paper ‘Skeleton-of-Thought’ (SoT) approach, aimed at reducing large language model latency while enhancing answer quality. This episode is led by Aparna Dhinakaran ( Chief Product Officer, Arize AI) and Sally-Ann Delucia (ML Solutions Engineer, Arize AI), with two of the paper authors: Xuefei Ning , Postdoctoral Researcher at Tsinghua University and Zinan Lin , Senior Researcher, Microsoft Research. SoT’s innovative methodology guides LLMs to construct answer skeletons before parallel content elaboration, achieving impressive speed-ups of up to 2.39x across 11 models. Don’t miss the opportunity to delve into this human-inspired optimization strategy and its profound implications for efficient and high-quality language generation. Full transcript and more here: https://arize.com/blog/skeleton-of-thought-llms-can-do-parallel-decoding-paper-reading/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Llama 2: Open Foundation and Fine-Tuned Chat Models 30:26

۲ years پیش30:26

30:26

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. This episode is led by Aparna Dhinakaran ( Chief Product Officer, Arize AI) and Michael Schiff (Chief Technology Officer, Arize AI), as they discuss the paper "Llama 2: Open Foundation and Fine-Tuned Chat Models." In this paper reading, we explore the paper “Developing Llama 2: Pretrained Large Language Models Optimized for Dialogue.” The paper introduces Llama 2, a collection of pretrained and fine-tuned large language models ranging from 7 billion to 70 billion parameters. Their fine-tuned model, Llama 2-Chat, is specifically designed for dialogue use cases and showcases superior performance on various benchmarks. Through human evaluations for helpfulness and safety, Llama 2-Chat emerges as a promising alternative to closed-source models. Discover the approach to fine-tuning and safety improvements, allowing us to foster responsible development and contribute to this rapidly evolving field. Full transcript and more here: https://arize.com/blog/llama-2-open-foundation-and-fine-tuned-chat-models-paper-reading/ Follow AI__Pub on Twitter . To learn more about ML observability , join the Arize AI Slack community or get the latest on our LinkedIn and Twitter . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Lost in the Middle: How Language Models Use Long Contexts 42:28

۲ years پیش42:28

42:28

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. This episode is led by Sally-Ann DeLucia and Amber Roberts, as they discuss the paper "Lost in the Middle: How Language Models Use Long Contexts." This paper examines how well language models utilize longer input contexts. The study focuses on multi-document question answering and key-value retrieval tasks. The researchers find that performance is highest when relevant information is at the beginning or end of the context. Accessing information in the middle of long contexts leads to significant performance degradation. Even explicitly long-context models experience decreased performance as the context length increases. The analysis enhances our understanding and offers new evaluation protocols for future long-context models. Full transcript and more here: https://arize.com/blog/lost-in-the-middle-how-language-models-use-long-contexts-paper-reading/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Orca: Progressive Learning from Complex Explanation Traces of GPT-4 42:03

۲ years پیش42:03

42:03

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this episode, we talk about Orca. Recent research focuses on improving smaller models through imitation learning using outputs from large foundation models (LFMs). Challenges include limited imitation signals, homogeneous training data, and a lack of rigorous evaluation, leading to overestimation of small model capabilities. To address this, Orca is a 13-billion parameter model that learns to imitate LFMs’ reasoning process. Orca leverages rich signals from GPT-4, surpassing state-of-the-art models by over 100% in complex zero-shot reasoning benchmarks. It also shows competitive performance in professional and academic exams without CoT. Learning from step-by-step explanations, generated by humans or advanced AI models, enhances model capabilities and skills. Full transcript and more here: https://arize.com/blog/orca-progressive-learning-from-complex-explanation-traces-of-gpt-4-paper-reading/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Toolformer: Training LLMs To Use Tools 34:06

۳ years پیش34:06

34:06

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this episode, we interview Timo Schick and Thomas Scialom, the Research Scientists at Meta AI behind Toolformer. "Vanilla" language models cannot access information about the external world. But what if we gave language models access to calculators, question-answer search, and other APIs to generate more powerful and accurate output? Further, how do we train such a model? How can we automatically generate a dataset of API-call-annotated text at internet scale, without human labeling? Timo and Thomas give a step-by-step walkthrough of building and training Toolformer, what motivated them to do it, and what we should expect in the next generation of tool-LLM powered products. Follow AI__Pub on Twitter . To learn more about ML observability , join the Arize AI Slack community or get the latest on our LinkedIn and Twitter . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Hungry Hungry Hippos - H3 41:53

۳ years پیش41:53

41:53

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this episode, we interview Dan Fu and Tri Dao, inventors of " Hungry Hungry Hippos " (aka "H3"). This language modeling architecture performs comparably to transformers, while admitting much longer context length: n log(n) rather than n^2 context scaling, for those technically inclined. Listen to learn about the major ideas and history behind H3, state space models, what makes them special, what products can be built with long-context language models, and hints of Dan and Tri's future (unpublished) research. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
ChatGPT and InstructGPT: Aligning Language Models to Human Intention 47:39

۳ years پیش47:39

47:39

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this first episode, we’re joined by Long Ouyang and Ryan Lowe , research scientists at OpenAI and creators of InstructGPT. InstructGPT was one of the first major applications of Reinforcement Learning with Human Feedback to train large language models, and is the precursor to the now-famous ChatGPT. Listen to learn about the major ideas behind InstructGPT and the future of aligning language models to human intention. Read OpenAI's InstructGPT paper here: https://openai.com/blog/instruction-following/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

به Player FM خوش آمدید!

Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.