با برنامه Player FM !
پادکست هایی که ارزش شنیدن دارند
حمایت شده


#009 Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack
Manage episode 428522574 series 3585930
Jorrit Sandbrink, a data engineer specializing on open table formats, discusses the advantages of decoupling storage and compute, the importance of choosing the right table format, and strategies for optimizing your data pipelines. This episode is full of practical advice for anyone looking to build a high-performance data analytics platform.
- Lake house architecture: A blend of data warehouse and data lake, addressing their shortcomings and providing a unified platform for diverse workloads.
- Key components and decisions: Storage options (cloud or on-prem), table formats (Delta Lake, Iceberg, Apache Hoodie), and query engines (Apache Spark, Polars).
- Optimizations: Partitioning strategies, file size considerations, and auto-optimization tools for efficient data layout and query performance.
- Orchestration tools: Airflow, Dagster, Prefect, and their roles in triggering and managing data pipelines.
- Data ingress with DLT: An open-source Python library for building data pipelines, focusing on efficient data extraction and loading.
Key Takeaways:
- Lake houses offer a powerful and flexible architecture for modern data analytics.
- Open-source solutions provide cost-effective and customizable alternatives.
- Carefully consider your specific use cases and preferences when choosing tools and components.
- Tools like DLT simplify data ingress and can be easily integrated with serverless functions.
- The data landscape is constantly evolving, so staying informed about new tools and trends is crucial.
Sound Bites
"The Lake house is sort of a modular setup where you decouple the storage and the compute." "A lake house is an architecture, an architecture for data analytics platforms." "The most popular table formats for a lake house are Delta, Iceberg, and Apache Hoodie."
Jorrit Sandbrink:
Nicolay Gerold:
Chapters
00:00 Introduction to the Lake House Architecture
03:59 Choosing Storage and Table Formats
06:19 Comparing Compute Engines
21:37 Simplifying Data Ingress
25:01 Building a Preferred Data Stack
lake house, data analytics, architecture, storage, table format, query execution engine, document store, DuckDB, Polars, orchestration, Airflow, Dexter, DLT, data ingress, data processing, data storage
61 قسمت
Manage episode 428522574 series 3585930
Jorrit Sandbrink, a data engineer specializing on open table formats, discusses the advantages of decoupling storage and compute, the importance of choosing the right table format, and strategies for optimizing your data pipelines. This episode is full of practical advice for anyone looking to build a high-performance data analytics platform.
- Lake house architecture: A blend of data warehouse and data lake, addressing their shortcomings and providing a unified platform for diverse workloads.
- Key components and decisions: Storage options (cloud or on-prem), table formats (Delta Lake, Iceberg, Apache Hoodie), and query engines (Apache Spark, Polars).
- Optimizations: Partitioning strategies, file size considerations, and auto-optimization tools for efficient data layout and query performance.
- Orchestration tools: Airflow, Dagster, Prefect, and their roles in triggering and managing data pipelines.
- Data ingress with DLT: An open-source Python library for building data pipelines, focusing on efficient data extraction and loading.
Key Takeaways:
- Lake houses offer a powerful and flexible architecture for modern data analytics.
- Open-source solutions provide cost-effective and customizable alternatives.
- Carefully consider your specific use cases and preferences when choosing tools and components.
- Tools like DLT simplify data ingress and can be easily integrated with serverless functions.
- The data landscape is constantly evolving, so staying informed about new tools and trends is crucial.
Sound Bites
"The Lake house is sort of a modular setup where you decouple the storage and the compute." "A lake house is an architecture, an architecture for data analytics platforms." "The most popular table formats for a lake house are Delta, Iceberg, and Apache Hoodie."
Jorrit Sandbrink:
Nicolay Gerold:
Chapters
00:00 Introduction to the Lake House Architecture
03:59 Choosing Storage and Table Formats
06:19 Comparing Compute Engines
21:37 Simplifying Data Ingress
25:01 Building a Preferred Data Stack
lake house, data analytics, architecture, storage, table format, query execution engine, document store, DuckDB, Polars, orchestration, Airflow, Dexter, DLT, data ingress, data processing, data storage
61 قسمت
All episodes
×
1 Maxime Labonne on Model Merging, AI Trends, and Beyond 1:06:55

1 #053 AI in the Terminal: Enhancing Coding with Warp 1:04:30

1 #051 Build systems that can be debugged at 4am by tired humans with no context 1:05:51

1 #050 Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster 1:06:57

1 #050 TAKEAWAYS Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster 11:00

1 #049 BAML: The Programming Language That Turns LLMs into Predictable Functions 1:02:38

1 #049 TAKEAWAYS BAML: The Programming Language That Turns LLMs into Predictable Functions 1:12:34

1 #045 RAG As Two Things - Prompt Engineering and Search 1:02:43

1 #044 Graphs Aren't Just For Specialists Anymore 1:03:34

1 #043 Knowledge Graphs Won't Fix Bad Data 1:10:58
به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.