با برنامه Player FM !
پادکست هایی که ارزش شنیدن دارند
حمایت شده
#036 How AI Can Start Teaching Itself - Synthetic Data Deep Dive
Manage episode 456353187 series 3585930
Most LLMs you use today already use synthetic data.
It’s not a thing of the future.
The large labs use a large model (e.g. gpt-4o) to generate training data for a smaller one (gpt-4o-mini).
This lets you build fast, cheap models that do one thing well.
This is “distillation”.
But the vision for synthetic data is much bigger.
Enable people to train specialized AI systems without having a lot of training data.
Today we are talking to Adrien Morisot, an ML engineer at Cohere.
We talk about how Cohere uses synthetic data to train their models, their learnings, and how you can use synthetic data in your training.
We are slightly diverging from our search focus, but I wanted to create a deeper dive into synthetic data after our episode with Saahil.
You could use it in a lot of places: generate hard negatives, generate training samples for classifiers and rerankers and much more.
Scaling Synthetic Data Creation: https://arxiv.org/abs/2406.20094
Adrien Morisot:
Nicolay Gerold:
00:00 Introduction to Synthetic Data in LLMs 00:18 Distillation and Specialized AI Systems 00:39 Interview with Adrien Morisot 02:00 Early Challenges with Synthetic Data 02:36 Breakthroughs and Rediscovery 03:54 The Evolution of AI and Synthetic Data 07:51 Data Harvesting and Internet Scraping 09:28 Generating Diverse Synthetic Data 15:37 Manual Review and Quality Control 17:28 Automating Data Evaluation 18:54 Fine-Tuning Models with Synthetic Data 21:45 Avoiding Behavioral Cloning 23:47 Ensuring Model Accuracy with Verification 24:31 Adapting Models to Specific Domains 26:41 Challenges in Financial and Legal Domains 28:10 Improving Synthetic Data Sets 30:45 Evaluating Model Performance 32:21 Using LLMs as Judges 35:42 Practical Tips for AI Practitioners 41:26 Synthetic Data in Training Processes 43:51 Quality Control in Synthetic Data 45:41 Domain Adaptation Strategies 46:51 Future of Synthetic Data Generation 47:30 Conclusion and Next Steps
62 قسمت
Manage episode 456353187 series 3585930
Most LLMs you use today already use synthetic data.
It’s not a thing of the future.
The large labs use a large model (e.g. gpt-4o) to generate training data for a smaller one (gpt-4o-mini).
This lets you build fast, cheap models that do one thing well.
This is “distillation”.
But the vision for synthetic data is much bigger.
Enable people to train specialized AI systems without having a lot of training data.
Today we are talking to Adrien Morisot, an ML engineer at Cohere.
We talk about how Cohere uses synthetic data to train their models, their learnings, and how you can use synthetic data in your training.
We are slightly diverging from our search focus, but I wanted to create a deeper dive into synthetic data after our episode with Saahil.
You could use it in a lot of places: generate hard negatives, generate training samples for classifiers and rerankers and much more.
Scaling Synthetic Data Creation: https://arxiv.org/abs/2406.20094
Adrien Morisot:
Nicolay Gerold:
00:00 Introduction to Synthetic Data in LLMs 00:18 Distillation and Specialized AI Systems 00:39 Interview with Adrien Morisot 02:00 Early Challenges with Synthetic Data 02:36 Breakthroughs and Rediscovery 03:54 The Evolution of AI and Synthetic Data 07:51 Data Harvesting and Internet Scraping 09:28 Generating Diverse Synthetic Data 15:37 Manual Review and Quality Control 17:28 Automating Data Evaluation 18:54 Fine-Tuning Models with Synthetic Data 21:45 Avoiding Behavioral Cloning 23:47 Ensuring Model Accuracy with Verification 24:31 Adapting Models to Specific Domains 26:41 Challenges in Financial and Legal Domains 28:10 Improving Synthetic Data Sets 30:45 Evaluating Model Performance 32:21 Using LLMs as Judges 35:42 Practical Tips for AI Practitioners 41:26 Synthetic Data in Training Processes 43:51 Quality Control in Synthetic Data 45:41 Domain Adaptation Strategies 46:51 Future of Synthetic Data Generation 47:30 Conclusion and Next Steps
62 قسمت
همه قسمت ها
×
1 Embedding Intelligence: AI's Move to the Edge 1:05:35

1 #054 Building Frankenstein Models with Model Merging and the Future of AI 1:06:55

1 #053 AI in the Terminal: Enhancing Coding with Warp 1:04:30

1 #051 Build systems that can be debugged at 4am by tired humans with no context 1:05:51

1 #050 Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster 1:06:57

1 #050 TAKEAWAYS Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster 11:00

1 #049 BAML: The Programming Language That Turns LLMs into Predictable Functions 1:02:38

1 #049 TAKEAWAYS BAML: The Programming Language That Turns LLMs into Predictable Functions 1:12:34


1 #045 RAG As Two Things - Prompt Engineering and Search 1:02:43

1 #044 Graphs Aren't Just For Specialists Anymore 1:03:34
به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.