A comprehensive performance analysis of distributed training strategies using TorchTitan on NVIDIA GB200 GPUs, revealing the critical inflection point where tensor parallelism transitions from overhead to essential requirement as we scale from 8B to 32B parameters
A comprehensive guide to FP8 (8-bit floating point) training for large language models, exploring performance benefits and implementation strategies on NVIDIA B200 GPUs
An experimental analysis of distributed training strategies with PyTorch, comparing DDP, FSDP-Full (ZeRO-3), and FSDP-Grad (ZeRO-2) on H100 GPUs
A comprehensive walkthrough of implementing PyTorch’s Fully Sharded Data Parallel (FSDP2) for efficient distributed training of large language models, with real benchmarks on NVIDIA H100 GPUs
A step-by-step guide to building a production-grade multi-agent AI system for financial analysis using Google Agent Development Kit and Gemini 2.5 Pro
A hands-on exploration of Zero Redundancy Optimizer through implementation and experiments