Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

30/04/2024

MLOps.community

0:00

55:36

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com

Simon Karasik⁠ is a proactive and curious ML Engineer with 5 years of experience. Developed & deployed ML models at WEB and Big scale for Ads and Tax. Huge thank you to Nebius AI for sponsoring this episode. Nebius AI - https://nebius.ai/ MLOps podcast #228 with Simon Karasik, Machine Learning Engineer at Nebius AI, Handling Multi-Terabyte LLM Checkpoints. // Abstract The talk provides a gentle introduction to the topic of LLM checkpointing: why is it hard, how big are the checkpoints. It covers various tips and tricks for saving and loading multi-terabyte checkpoints, as well as the selection of cloud storage options for checkpointing. // Bio Full-stack Machine Learning Engineer, currently working on infrastructure for LLM training, with previous experience in ML for Ads, Speech, and Tax. // MLOps Jobs board https://mlops.pallet.xyz/jobs // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Simon on LinkedIn: https://www.linkedin.com/in/simon-karasik/ Timestamps: [00:00] Simon preferred beverage [01:23] Takeaways [04:22] Simon's tech background [08:42] Zombie models garbage collection [10:52] The road to LLMs [15:09] Trained models Simon worked on [16:26] LLM Checkpoints [20:36] Confidence in AI Training [22:07] Different Checkpoints [25:06] Checkpoint parts [29:05] Slurm vs Kubernetes [30:43] Storage choices lessons [36:02] Paramount components for setup [37:13] Argo workflows [39:49] Kubernetes node troubleshooting [42:35] Cloud virtual machines have pre-installed mentoring [45:41] Fine-tuning [48:16] Storage, networking, and complexity in network design [50:56] Start simple before advanced; consider model needs. [53:58] Join us at our first in-person conference on June 25 all about AI Quality

Mais episódios de "MLOps.community"

Mais episódios

Descobre o mundo dos podcasts com a app gratuita GetPodcast.

Subscreve os teus podcasts preferidos, ouve episódios offline e obtém recomendações fantásticas.

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

MLOps.community

Mais episódios de "MLOps.community"

Retrieval Augmented Generation

RecSys at Spotify // Sanket Gupta // #232

From A Coding Startup to AI Development in the Enterprise // Ryan Carson // #231

FedML Nexus AI: Your Generative AI Platform at Scale // Salman Avestimehr // #230

What is AI Quality? // Mohamed Elgendy // #228

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

Leading Enterprise Data Teams // Sol Rashidi // #227

The Rise of Modern Data Management // Chad Sanderson // #226

Beyond AGI, Can AI Help Save the Planet? // Patrick Beukema // #225

GenAI in Production - Challenges and Trends // Verena Weber // #224