با برنامه Player FM !
پادکست هایی که ارزش شنیدن دارند
حمایت شده
#026 Embedding Numbers, Categories, Locations, Images, Text, and The World
Manage episode 444537491 series 3585930
Today’s guest is Mór Kapronczay. Mór is the Head of ML at superlinked. Superlinked is a compute framework for your information retrieval and feature engineering systems, where they turn anything into embeddings.
When most people think about embeddings, they think about ada, openai.
You just take your text and throw it in there.
But that’s too crude.
OpenAI embeddings are trained on the internet.
But your data set (most likely) is not the internet.
You have different nuances.
And you have more than just text.
So why not use it.
Some highlights:
- Text Embeddings are Not a Magic Bullet
➡️ Pouring everything into a text embedding model won't yield magical results ➡️ Language is lossy - it's a poor compression method for complex information
- Embedding Numerical Data
➡️ Direct number embeddings don't work well for vector search ➡️ Consider projecting number ranges onto a quarter circle ➡️ Apply logarithmic transforms for skewed distributions
- Multi-Modal Embeddings
➡️ Create separate vector parts for different data aspects ➡️ Normalize individual parts ➡️ Weight vector parts based on importance
A Multi-Vector approach can help you understand the contributions of each modality or embedding and give you an easier time to fine-tune your retrieval system without fine-tuning your embedding models by tuning your vector database like you would a search database (like Elastic).
Mór Kapronczay
Nicolay Gerold:
00:00 Introduction to Embeddings 00:30 Beyond Text: Expanding Embedding Capabilities 02:09 Challenges and Innovations in Embedding Techniques 03:49 Unified Representations and Vector Computers 05:54 Embedding Complex Data Types 07:21 Recommender Systems and Interaction Data 08:59 Combining and Weighing Embeddings 14:58 Handling Numerical and Categorical Data 20:35 Optimizing Embedding Efficiency 22:46 Dynamic Weighting and Evaluation 24:35 Exploring AB Testing with Embeddings 25:08 Joint vs Separate Embedding Spaces 27:30 Understanding Embedding Dimensions 29:59 Libraries and Frameworks for Embeddings 32:08 Challenges in Embedding Models 33:03 Vector Database Connectors 34:09 Balancing Production and Updates 36:50 Future of Vector Search and Modalities 39:36 Building with Embeddings: Tips and Tricks 42:26 Concluding Thoughts and Next Steps
62 قسمت
Manage episode 444537491 series 3585930
Today’s guest is Mór Kapronczay. Mór is the Head of ML at superlinked. Superlinked is a compute framework for your information retrieval and feature engineering systems, where they turn anything into embeddings.
When most people think about embeddings, they think about ada, openai.
You just take your text and throw it in there.
But that’s too crude.
OpenAI embeddings are trained on the internet.
But your data set (most likely) is not the internet.
You have different nuances.
And you have more than just text.
So why not use it.
Some highlights:
- Text Embeddings are Not a Magic Bullet
➡️ Pouring everything into a text embedding model won't yield magical results ➡️ Language is lossy - it's a poor compression method for complex information
- Embedding Numerical Data
➡️ Direct number embeddings don't work well for vector search ➡️ Consider projecting number ranges onto a quarter circle ➡️ Apply logarithmic transforms for skewed distributions
- Multi-Modal Embeddings
➡️ Create separate vector parts for different data aspects ➡️ Normalize individual parts ➡️ Weight vector parts based on importance
A Multi-Vector approach can help you understand the contributions of each modality or embedding and give you an easier time to fine-tune your retrieval system without fine-tuning your embedding models by tuning your vector database like you would a search database (like Elastic).
Mór Kapronczay
Nicolay Gerold:
00:00 Introduction to Embeddings 00:30 Beyond Text: Expanding Embedding Capabilities 02:09 Challenges and Innovations in Embedding Techniques 03:49 Unified Representations and Vector Computers 05:54 Embedding Complex Data Types 07:21 Recommender Systems and Interaction Data 08:59 Combining and Weighing Embeddings 14:58 Handling Numerical and Categorical Data 20:35 Optimizing Embedding Efficiency 22:46 Dynamic Weighting and Evaluation 24:35 Exploring AB Testing with Embeddings 25:08 Joint vs Separate Embedding Spaces 27:30 Understanding Embedding Dimensions 29:59 Libraries and Frameworks for Embeddings 32:08 Challenges in Embedding Models 33:03 Vector Database Connectors 34:09 Balancing Production and Updates 36:50 Future of Vector Search and Modalities 39:36 Building with Embeddings: Tips and Tricks 42:26 Concluding Thoughts and Next Steps
62 قسمت
همه قسمت ها
×
1 Embedding Intelligence: AI's Move to the Edge 1:05:35

1 #054 Building Frankenstein Models with Model Merging and the Future of AI 1:06:55

1 #053 AI in the Terminal: Enhancing Coding with Warp 1:04:30

1 #051 Build systems that can be debugged at 4am by tired humans with no context 1:05:51

1 #050 Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster 1:06:57

1 #050 TAKEAWAYS Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster 11:00

1 #049 BAML: The Programming Language That Turns LLMs into Predictable Functions 1:02:38

1 #049 TAKEAWAYS BAML: The Programming Language That Turns LLMs into Predictable Functions 1:12:34

1 #045 RAG As Two Things - Prompt Engineering and Search 1:02:43

1 #044 Graphs Aren't Just For Specialists Anymore 1:03:34

1 #043 Knowledge Graphs Won't Fix Bad Data 1:10:58

1 #042 Temporal RAG, Embracing Time for Smarter, Reliable Knowledge Graphs 1:33:43

1 #041 Context Engineering, How Knowledge Graphs Help LLMs Reason 1:33:34

1 #038 AI-Powered Search, Context Is King, But Your RAG System Ignores Two-Thirds of It 1:14:23


1 #022 The Limits of Embeddings, Out-of-Domain Data, Long Context, Finetuning (and How We're Fixing It) 46:05




به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.