Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228 MLOps.community podcast

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

1+ y ago 55:36

اشتراک گذاری

محتوای ارائه شده توسط Demetrios. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط Demetrios یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com

Simon Karasik⁠ is a proactive and curious ML Engineer with 5 years of experience. Developed & deployed ML models at WEB and Big scale for Ads and Tax.Huge thank you to Nebius AI for sponsoring this episode. Nebius AI - https://nebius.ai/

MLOps podcast #228 with Simon Karasik, Machine Learning Engineer at Nebius AI, Handling Multi-Terabyte LLM Checkpoints.

// Abstract

The talk provides a gentle introduction to the topic of LLM checkpointing: why is it hard, how big are the checkpoints. It covers various tips and tricks for saving and loading multi-terabyte checkpoints, as well as the selection of cloud storage options for checkpointing.

// Bio

Full-stack Machine Learning Engineer, currently working on infrastructure for LLM training, with previous experience in ML for Ads, Speech, and Tax.

// MLOps Jobs board

jobs.mlops.community

// MLOps Swag/Merch

https://mlops-community.myshopify.com/

// Related Links

--------------- ✌️Connect With Us ✌️ -------------

Join our Slack community: https://go.mlops.community/slack

Catch all episodes, blogs, newsletters, and more: https://mlops.community/

Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/

Connect with Simon on LinkedIn: https://www.linkedin.com/in/simon-karasik/

Timestamps:

[00:00] Simon's preferred beverage

[01:23] Takeaways

[04:22] Simon's tech background

[08:42] Zombie models garbage collection

[10:52] The road to LLMs

[15:09] Trained models Simon worked on

[16:26] LLM Checkpoints

[20:36] Confidence in AI Training

[22:07] Different Checkpoints

[25:06] Checkpoint parts

[29:05] Slurm vs Kubernetes

[30:43] Storage choices lessons

[36:02] Paramount components for setup

[37:13] Argo workflows

[39:49] Kubernetes node troubleshooting

[42:35] Cloud virtual machines have pre-installed mentoring

[45:41] Fine-tuning

[48:16] Storage, networking, and complexity in network design

[50:56] Start simple before advanced; consider model needs.

[53:58] Join us at our first in-person conference on June 25 all about AI Quality

472 قسمت