56 subscribers
با برنامه Player FM !
پادکست هایی که ارزش شنیدن دارند
حمایت شده
Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228
Manage episode 415529999 series 3241972
Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com
Simon Karasik is a proactive and curious ML Engineer with 5 years of experience. Developed & deployed ML models at WEB and Big scale for Ads and Tax.Huge thank you to Nebius AI for sponsoring this episode. Nebius AI - https://nebius.ai/MLOps podcast #228 with Simon Karasik, Machine Learning Engineer at Nebius AI, Handling Multi-Terabyte LLM Checkpoints.// AbstractThe talk provides a gentle introduction to the topic of LLM checkpointing: why is it hard, how big are the checkpoints. It covers various tips and tricks for saving and loading multi-terabyte checkpoints, as well as the selection of cloud storage options for checkpointing.// BioFull-stack Machine Learning Engineer, currently working on infrastructure for LLM training, with previous experience in ML for Ads, Speech, and Tax.// MLOps Jobs board https://mlops.pallet.xyz/jobs// MLOps Swag/Merchhttps://mlops-community.myshopify.com/// Related Links --------------- ✌️Connect With Us ✌️ -------------Join our Slack community: https://go.mlops.community/slackFollow us on Twitter: @mlopscommunitySign up for the next meetup: https://go.mlops.community/registerCatch all episodes, blogs, newsletters, and more: https://mlops.community/Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/Connect with Simon on LinkedIn: https://www.linkedin.com/in/simon-karasik/Timestamps:[00:00] Simon's preferred beverage[01:23] Takeaways[04:22] Simon's tech background[08:42] Zombie models garbage collection[10:52] The road to LLMs[15:09] Trained models Simon worked on[16:26] LLM Checkpoints[20:36] Confidence in AI Training[22:07] Different Checkpoints[25:06] Checkpoint parts [29:05] Slurm vs Kubernetes[30:43] Storage choices lessons[36:02] Paramount components for setup[37:13] Argo workflows[39:49] Kubernetes node troubleshooting[42:35] Cloud virtual machines have pre-installed mentoring[45:41] Fine-tuning[48:16] Storage, networking, and complexity in network design[50:56] Start simple before advanced; consider model needs.[53:58] Join us at our first in-person conference on June 25 all about AI Quality
460 قسمت
Manage episode 415529999 series 3241972
Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com
Simon Karasik is a proactive and curious ML Engineer with 5 years of experience. Developed & deployed ML models at WEB and Big scale for Ads and Tax.Huge thank you to Nebius AI for sponsoring this episode. Nebius AI - https://nebius.ai/MLOps podcast #228 with Simon Karasik, Machine Learning Engineer at Nebius AI, Handling Multi-Terabyte LLM Checkpoints.// AbstractThe talk provides a gentle introduction to the topic of LLM checkpointing: why is it hard, how big are the checkpoints. It covers various tips and tricks for saving and loading multi-terabyte checkpoints, as well as the selection of cloud storage options for checkpointing.// BioFull-stack Machine Learning Engineer, currently working on infrastructure for LLM training, with previous experience in ML for Ads, Speech, and Tax.// MLOps Jobs board https://mlops.pallet.xyz/jobs// MLOps Swag/Merchhttps://mlops-community.myshopify.com/// Related Links --------------- ✌️Connect With Us ✌️ -------------Join our Slack community: https://go.mlops.community/slackFollow us on Twitter: @mlopscommunitySign up for the next meetup: https://go.mlops.community/registerCatch all episodes, blogs, newsletters, and more: https://mlops.community/Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/Connect with Simon on LinkedIn: https://www.linkedin.com/in/simon-karasik/Timestamps:[00:00] Simon's preferred beverage[01:23] Takeaways[04:22] Simon's tech background[08:42] Zombie models garbage collection[10:52] The road to LLMs[15:09] Trained models Simon worked on[16:26] LLM Checkpoints[20:36] Confidence in AI Training[22:07] Different Checkpoints[25:06] Checkpoint parts [29:05] Slurm vs Kubernetes[30:43] Storage choices lessons[36:02] Paramount components for setup[37:13] Argo workflows[39:49] Kubernetes node troubleshooting[42:35] Cloud virtual machines have pre-installed mentoring[45:41] Fine-tuning[48:16] Storage, networking, and complexity in network design[50:56] Start simple before advanced; consider model needs.[53:58] Join us at our first in-person conference on June 25 all about AI Quality
460 قسمت
All episodes
×
1 GPU Considerations, Labeling Privacy, Rapid Fine Tuning, and the Role of Private Eval Pipelines to Benchmark New Models 55:46

1 9 Commandments for Building AI Agents 1:20:33

1 Enterprise AI Adoption Challenges 1:05:00

1 The Rise of Sovereign AI and Global AI Innovation in a World of US Protectionism // Frank Meehan // MLOps Podcast #331 54:13

1 A New Way of Building with AI 1:04:49

1 AI Reliability, Spark, Observability, SLAs and Starting an AI Infra Company 1:37:22

1 The Creator of FastAPI’s Next Chapter // Sebastián Ramírez // #324 1:09:37

1 A Candid Conversation Around MCP and A2A // Rahul Parundekar and Sam Partee // #316 SF Live 1:04:42

1 Making AI Reliable is the Greatest Challenge of the 2020s // Alon Bochman // #312 1:01:37

1 Behavior Modeling, Secondary AI Effects, Bias Reduction & Synthetic Data // Devansh Devansh // #311 1:01:35

1 GraphBI: Expanding Analytics to All Data Through the Combination of GenAI, Graph, & Visual Analytics // Paco Nathan & Weidong Yang // #310 1:14:01

1 I Am Once Again Asking "What is MLOps?" // Oleksandr Stasyk // #308 1:07:22

1 Agents of Innovation: AI-Powered Product Ideation with Synthetic Consumer Testing // Luca Fiaschi // #306 1:02:23

1 We're All Finetuning Incorrectly // Tanmay Chopra // #304 1:00:30






1 From Rules to Reasoning Engines // George Mathew // #296 1:05:26

1 GenAI Traffic: Why API Infrastructure Must Evolve... Again // Erica Hughberg // #296 1:06:24

1 Future of Software, Agents in the Enterprise, and Inception Stage Company Building // Eliot Durbin // #293 54:26

1 The Agent Landscape - Lessons Learned Putting Agents Into Production 1:08:40

1 Evolving Workflow Orchestration // Alex Milowski // #291 1:14:34




1 Navigating Machine Learning Careers: Insights from Meta to Consulting // Ilya Reznik // #286 1:00:36
به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.