This month I published “How To Scale Your Model”, an online textbook about how large language models (LLMs) run at scale. I’m proud of book because of its quality (it has interactive diagrams, new animations, homework problems) but even more because its subject is solidly untaught. I don’t believe another resource exists that tells this technical story end-to-end.
Why is this? Partly, this is because it’s solidly new knowledge — the field has only operated at these >10k GPU/TPU scales since maybe 2018. Academia also doesn’t tend to care about this kind of engineering discipline. And it is somewhat deliberately siloed by big AI labs who (rightly) consider some of it to be a hard-won trade secret. Either way, it’s something of a tragedy because it’s a wonderful topic for academic research:
It doesn’t require a huge amount of compute to make models run faster. You can often profile and improve an LLM on 4 GPUs.
The most significant improvements in AI in the past few years have come from systems innovations. GPT-3, InstructGPT, O1. The challenge here is not so much algorithmic as infrastructural: you need to get RL to run at scale with very large LLMs — no easy feat.
Similar works in academia (like Flash Attention or Paged Attention) have had a bigger impact on the field of AI than basically any other academic work.
So my challenge is: go do this work! Think about what makes LLMs slow and what systems or architectural changes could make them faster to train or serve. Characterize the improvements and write a good paper. This will be rewarded!
Teaching new knowledge: It’s a fabulous process to teach something like this for the first time. You find yourself constantly asking yourself “is this the simplest way to present this?” or “does this math really matter?”. There is no “correct” set of abstractions.
I want to highlight one pattern I noticed in writing this, what I’ve called “testing technical writing”. When you think of a new topic you want to introduce, you need to test it. By this I mean write a homework problem that checks that the reader really understands that topic. Importantly, however, it needs to motivate the concept. You can’t just say “Apply the chain rule to solve this derivative!”. You have to ask “Say you have a 200B parameter LLM running on a TPU v5p 4x4x4 mesh. How long will a training step take? Hint: calculate how many FLOPs are required to compute the derivative, using the chain rule.” This is a question that someone would actually want to answer, whose answer is provided by a technique you’re teaching. This process of adding a new sentence if and only if it’s well-motivated by a problem feels like an effective approach to technical writing.