با برنامه Player FM !
پادکست هایی که ارزش شنیدن دارند
حمایت شده
#034 Rethinking Search Inside Postgres, From Lexemes to BM25
Manage episode 453940597 series 3585930
Many companies use Elastic or OpenSearch and use 10% of the capacity.
They have to build ETL pipelines.
Get data Normalized.
Worry about race conditions.
All in all. At the moment, when you want to do search on top of your transactional data, you are forced to build a distributed systems.
Not anymore.
ParadeDB is building an open-source PostgreSQL extension to enable search within your database.
Today, I am talking to Philippe Noël, the founder and CEO of ParadeDB.
We talk about how they build it, how they integrate into the Postgres Query engines, and how you can build search on top of Postgres.
Key Insights:
Search is changing. We're moving from separate search clusters to search inside databases. Simpler architecture, stronger guarantees, lower costs up to a certain scale.
Most search engines force you to duplicate data. ParadeDB doesn't. You keep data normalized and join at query time. It hooks deep into Postgres's query planner. It doesn't just bolt on search - it lets Postgres optimize search queries alongside SQL ones.
Search indices can work with ACID. ParadeDB's BM25 index keeps Lucene-style components (term frequency, normalization) but adds Postgres metadata for transactions. Search + ACID is possible.
Two storage types matter: inverted indices for text, columnar "fast fields" for analytics. Pick the right one or queries get slow. Integers now default to columnar to prevent common mistakes.
Mixing query engines looks tempting but fails. The team tried using DuckDB and DataFusion inside Postgres. Both were fast but broke ACID compliance. They had to rebuild features natively.
Philippe Noël:
Nicolay Gerold:
00:00 Introduction to ParadeDB 00:53 Building ParadeDB with Rust 01:43 Integrating Search in Postgres 03:04 ParadeDB vs. Elastic 05:48 Technical Deep Dive: Postgres Integration 07:27 Challenges and Solutions 09:35 Transactional Safety and Performance 11:06 Composable Data Systems 15:26 Columnar Storage and Analytics 20:54 Case Study: Alibaba Cloud 21:57 Data Warehouse Context 23:24 Custom Indexing with BM25 24:01 Postgres Indexing Overview 24:17 Fast Fields and Columnar Format 24:52 Lucene Inspiration and Data Storage 26:06 Setting Up and Managing Indexes 27:43 Query Building and Complex Searches 30:21 Scaling and Sharding Strategies 35:27 Query Optimization and Common Mistakes 38:39 Future Developments and Integrations 39:24 Building a Full-Fledged Search Application 42:53 Challenges and Advantages of Using ParadeDB 46:43 Final Thoughts and Recommendations
61 قسمت
Manage episode 453940597 series 3585930
Many companies use Elastic or OpenSearch and use 10% of the capacity.
They have to build ETL pipelines.
Get data Normalized.
Worry about race conditions.
All in all. At the moment, when you want to do search on top of your transactional data, you are forced to build a distributed systems.
Not anymore.
ParadeDB is building an open-source PostgreSQL extension to enable search within your database.
Today, I am talking to Philippe Noël, the founder and CEO of ParadeDB.
We talk about how they build it, how they integrate into the Postgres Query engines, and how you can build search on top of Postgres.
Key Insights:
Search is changing. We're moving from separate search clusters to search inside databases. Simpler architecture, stronger guarantees, lower costs up to a certain scale.
Most search engines force you to duplicate data. ParadeDB doesn't. You keep data normalized and join at query time. It hooks deep into Postgres's query planner. It doesn't just bolt on search - it lets Postgres optimize search queries alongside SQL ones.
Search indices can work with ACID. ParadeDB's BM25 index keeps Lucene-style components (term frequency, normalization) but adds Postgres metadata for transactions. Search + ACID is possible.
Two storage types matter: inverted indices for text, columnar "fast fields" for analytics. Pick the right one or queries get slow. Integers now default to columnar to prevent common mistakes.
Mixing query engines looks tempting but fails. The team tried using DuckDB and DataFusion inside Postgres. Both were fast but broke ACID compliance. They had to rebuild features natively.
Philippe Noël:
Nicolay Gerold:
00:00 Introduction to ParadeDB 00:53 Building ParadeDB with Rust 01:43 Integrating Search in Postgres 03:04 ParadeDB vs. Elastic 05:48 Technical Deep Dive: Postgres Integration 07:27 Challenges and Solutions 09:35 Transactional Safety and Performance 11:06 Composable Data Systems 15:26 Columnar Storage and Analytics 20:54 Case Study: Alibaba Cloud 21:57 Data Warehouse Context 23:24 Custom Indexing with BM25 24:01 Postgres Indexing Overview 24:17 Fast Fields and Columnar Format 24:52 Lucene Inspiration and Data Storage 26:06 Setting Up and Managing Indexes 27:43 Query Building and Complex Searches 30:21 Scaling and Sharding Strategies 35:27 Query Optimization and Common Mistakes 38:39 Future Developments and Integrations 39:24 Building a Full-Fledged Search Application 42:53 Challenges and Advantages of Using ParadeDB 46:43 Final Thoughts and Recommendations
61 قسمت
همه قسمت ها
×
1 Maxime Labonne on Model Merging, AI Trends, and Beyond 1:06:55

1 #053 AI in the Terminal: Enhancing Coding with Warp 1:04:30

1 #051 Build systems that can be debugged at 4am by tired humans with no context 1:05:51

1 #050 Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster 1:06:57

1 #050 TAKEAWAYS Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster 11:00

1 #049 BAML: The Programming Language That Turns LLMs into Predictable Functions 1:02:38

1 #049 TAKEAWAYS BAML: The Programming Language That Turns LLMs into Predictable Functions 1:12:34

1 #045 RAG As Two Things - Prompt Engineering and Search 1:02:43

1 #044 Graphs Aren't Just For Specialists Anymore 1:03:34

1 #043 Knowledge Graphs Won't Fix Bad Data 1:10:58

1 #022 The Limits of Embeddings, Out-of-Domain Data, Long Context, Finetuning (and How We're Fixing It) 46:05






1 #042 Temporal RAG, Embracing Time for Smarter, Reliable Knowledge Graphs 1:33:43

1 #041 Context Engineering, How Knowledge Graphs Help LLMs Reason 1:33:34

1 #038 AI-Powered Search, Context Is King, But Your RAG System Ignores Two-Thirds of It 1:14:23

به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.