Skip to main content

Posts

Featured

A closer look at "training" a trillion-parameter model on Frontier

A paper titled " Optimizing Distributed Training on Frontier for Large Language Models " has been making its rounds over the last few weeks with sensational taglines saying the authors trained a trillion-parameter model using only a fraction of the Frontier supercomputer . The superficiality of the discourse around this paper seemed suspicious to me, so in the interests of embracing my new job in AI systems design, I decided to sit down with the manuscript and figure out exactly what the authors did myself. As a caveat, I am by no means an expert in AI, and I relied on my friend ChatGPT to read the paper with me and answer questions I had along the way. It is from that perspective that I compiled the notes that follow, and I'm sharing them in the event that there are other folks like me who are interested in understanding how large-scale training maps to HPC resources but don't understand all the AI jargon.

Latest Posts

SC'23 Recap

SC'22 Recap