Search This Blog

Glenn K. Lockwood

Personal perspectives of a supercomputing enthusiast

HPC
Storage
Personal
About Me
My Website

Posts

Featured

January 13, 2024

A closer look at "training" a trillion-parameter model on Frontier

A paper titled " Optimizing Distributed Training on Frontier for Large Language Models " has been making its rounds over the last few weeks with sensational taglines saying the authors trained a trillion-parameter model using only a fraction of the Frontier supercomputer . The superficiality of the discourse around this paper seemed suspicious to me, so in the interests of embracing my new job in AI systems design, I decided to sit down with the manuscript and figure out exactly what the authors did myself. As a caveat, I am by no means an expert in AI, and I relied on my friend ChatGPT to read the paper with me and answer questions I had along the way. It is from that perspective that I compiled the notes that follow, and I'm sharing them in the event that there are other folks like me who are interested in understanding how large-scale training maps to HPC resources but don't understand all the AI jargon.

Theme images by Mae Burke

Glenn K. Lockwood: Oakland, California, United States; I am a supercomputing enthusiast who writes opinions on industry trends and technologies. I was once a materials scientist and played with high-performance computers in my free time, but I've been fortunate enough pursue that hobby as a paid profession since 2012. Don't take that to mean I know what I'm talking about though; I'm still trying to figure out all this computing stuff. I live in Oakland with my lovely wife and two cats and grew up in New Jersey.

Visit profile

Old Posts

2024 1
- Jan 2024 1
  - A closer look at "training" a trillion-parameter m...

2023 1
- Nov 2023 1
2022 2
- Nov 2022 1
- May 2022 1
2021 1
- Oct 2021 1
2020 5
2019 3
2018 2
- Nov 2018 1
- Feb 2018 1
2017 3
2016 3
2015 3
2014 13
2013 15
2012 9

Show more Show less

Labels

hpc47 opinion24 storage14 exascale12 conferences9 mpi9 benchmarks7 cloud7 nsf7 doe5

infiniband5 mvapich5 science5 architecture4 daos4 hadoop4 python4 applications3 howto3 linux3 personal3 photos3 ai2 blue gene2 computers2 mic2 apple1 case studies1 education1 electronics1

Show more Show less

Posts

Featured

A closer look at "training" a trillion-parameter model on Frontier

Latest Posts

SC'23 Recap

SC'22 Recap