Introduction

A lot of basic, important information about transformer language models can be computed quite simply. Unfortunately, the equations for this are not widely known in the NLP community. The purpose of this document is to collect these equations along with related knowledge about where they come from and why they matter.

Note: This post is primarily concerned with training costs, which are dominated by VRAM considerations. For an analogous discussion of inference costs with a focus on latency, check out this excellent blog post by Kipply.

Compute Requirements

The basic equation giving the cost to train a transformer model is given by:

$$ C\approx\tau T = 6PD $$

where:

These equations are proposed and experimentally validated in OpenAI’s scaling laws paper and DeepMind’s scaling laws paper. Please see each paper for more information.

It’s worth taking an aside and discussing the units of $C$. $C$ is a measure of total compute, but can be measured by many units such as: