My Blogs

Scaling LLM Pre-Training with TorchTitan: From Simple FSDP to 4D Parallelism at Scale

A comprehensive performance analysis of distributed training strategies using TorchTitan on NVIDIA GB200 GPUs, revealing the critical inflection point where tensor parallelism transitions from overhead to essential requirement as we scale from 8B to 32B parameters

Jan 5, 2026

Dipankar Baisya

Pre-Training Large Language Models with FP8: A Comprehensive Benchmark on NVIDIA B200 GPUs

A comprehensive guide to FP8 (8-bit floating point) training for large language models, exploring performance benefits and implementation strategies on NVIDIA B200 GPUs

Dec 30, 2025

Dipankar Baisya

Today’s Learning Nugget

Here I listed my work logs, explorartions and to-do list.

Dec 23, 2025

Dipankar Baisya

Distributed Training with PyTorch: An Experimental Study of DDP, FSDP and FSDP2

An experimental analysis of distributed training strategies with PyTorch, comparing DDP, FSDP-Full (ZeRO-3), and FSDP-Grad (ZeRO-2) on H100 GPUs

Dec 18, 2025

Dipankar Baisya

Training Large Language Models with FSDP2: A Complete Guide

A comprehensive walkthrough of implementing PyTorch’s Fully Sharded Data Parallel (FSDP2) for efficient distributed training of large language models, with real benchmarks on NVIDIA H100 GPUs

Dec 18, 2025

Dipankar Baisya

Building RiskNavigator AI: A Multi-Agent System for Financial Risk Assessment

A step-by-step guide to building a production-grade multi-agent AI system for financial analysis using Google Agent Development Kit and Gemini 2.5 Pro

Dec 9, 2025

Dipankar R. Baisya

Understanding ZeRO: Memory-Efficient Training from Theory to Practice

A hands-on exploration of Zero Redundancy Optimizer through implementation and experiments

Dec 1, 2025

Dipankar Baisya