The Energy Cost of Teaching Machines: Diving Deep into energy and LLMs.
The note started as one of my many research notes on things that I keep hearing and don’t really understand. Writing helps me grok a concept, so I wrote it, and then started referencing it when explaining things to my team or others. I decided that even if it’s boring or long, it might help some people (it also helped me polish it a bit and check references to ensure minimal miscalculations).
Training an LLM means running millions of simple arithmetic calculations to adjust model weights so it can predict the next token accurately (a great long introduction to next token prediction there).
I visualize this as a teacher asking a student, “2x2 equals… 5! Wrong!! Try again, 2x2 equals….3! Wrong!!, 2x2 equals… 4! Good, let’s move one” over and over until the student learns the task.
Now, how much energy does training this student consume? Let’s get to it.
From Definitions to Energy Consumption
Energy consumed is a function of the total training hours times the hourly consumption of the machines used.
Energy (kWh) = # of hours of training x hourly consumption
Hourly Consumption
Hourly consumption is straightforward. Each machine consumes power measured in Watts (W) during the times in which is operating (capacity time).
For this note, we reference the Nvidia H100 80GB. The H100 consumes 700W of power. With 60% utilization (reasonable estimate, though not sourced), it consumes 3,680 kWh per year, or 10 kWh per day. To put it in perspective, this is equivalent to the energy consumption of a household in Spain (source).
🐇 Nvidia sold 1.5 million H100 units in 2023 and is expected to sell 2 million more in 2024. By the end of 2024, there will be 3.5 million units. If all ran at 60% capacity for a year, they would consume 12.9 terawatt-hours (TWh) of energy, comparable to the energy consumption of some small countries. 🤯
🐇 If you’re like me, you might be confused — and annoyed — by power consumption being measured in “Watt-hours” (Wh), where “h” stands for “hour.” Why mention hours when measuring any period of time? For example, the H100 has an annual consumption of 3,740 kWh. Here’s why:Ç
Power is measured in Watts (W), an instantaneous measurement of energy usage. “Watt-hours” (Wh) refers to total energy consumed over time. For instance, a 60W bulb consumes 60 Wh if on for an hour. This unit allows us to calculate total consumption over any period while keeping the hourly reference.
So, the H100 consuming 3,740 kWh annually means it uses the equivalent of 3,740 kilowatt-hours in a year, not 3.740 kW in one hour. This understanding calmed a bit my ToC.
Hours of Training
This is the complex part. Training hours is a function of the number of operations needed for the model to learn (i.e., “NO!! Try again”) and the speed at which these operations occur.
Hours of training = amount of computation needed / speed of the machines
The more operations required, the longer it takes. The faster the machine, the shorter it takes.
Let’s break this down.
Amount of Computation
The computation required to train a model is measured in floating operations or flops (one flop would be for instance 2.3 + 9.7)
For example, the total compute required to train GPT-3 is approximately 3.14 x 10²³ FLOPs. That’s 314,000,000,000,000,000,000,000 floating operations.
Computation needs are impacted by:
- Model size and architecture: The number of parameters and their organization (architecture) drive the computation. More parameters mean more flops. GPT-3 had 175B parameters, while GPT-4 is estimated to have 1–2 trillion.
- Training dataset: Model size determines the dials to adjust; how often you adjust them depends on the dataset size and the number of times you show it to the model (epochs).
🐇 There is a theoretical optimal ratio between the number of parameters and the dataset size called the “Chinchilla optimal”. The range according to the original paper is around 20 tokens per parameter. However, this is under active debate. Models like LLaMA 3 show that increasing training data beyond this ratio (75x for Llama 8B) can still improve performance, and Karpathy guess that we are off by 100–1000x.
- GPT-3 was trained on 300B tokens.
- Llama 3 is trained on 15 trillion tokens (150,000 times the English Wikipedia!).
🐇 I have pending writing my intuition about a parameter in the model, but for now, aparameter in an LLM is like a container that holds learned information; the more parameters (containers) you have, the more detailed and accurate the model’s understanding and responses can be.
- Hyperparameters: Adjustments in hyperparameters, such as the learning rate schedule, significantly impact compute needs, as they influence the number of steps required for convergence.
🐇 The learning rate controls how quickly a model updates its parameters. A high learning rate speeds up convergence but might skip over the optimal solution. A low learning rate ensures more precise adjustments but risks overfitting or taking too long to converge. Imagine trying to find the peak of a mountain by guessing and making steps left or right. Large steps might make you miss the peak entirely.
Speed of one H100
The speed is measured by the number of operations the machine can perform per second, measured in flops (FLOPS). More commonly, we use teraflops (TFLOPS) for these powerful machines. 1 TFLOPS is one trillion floating (1,000,000,000,000) operations per second.
Like cars, the max TFLOPS gives you a theoretical capacity; the real performance can vary based on factors like:
- Precision level: Higher precision (i.e., FP64) involves more decimals, resulting in lower performance but potentially more accurate calculations, aka, better token predictions.
🐇 Optimal performance on the NVIDIA H100 GPU comes from using matrix operations with Tensor Cores designed for these tasks. Lower precision formats (like FP16 or INT8) can execute faster than higher precision formats (like FP64) due to Tensor Core optimization.
- Memory usage: Similar to an F1 car needing all parts engaged for maximum performance, the H100 needs its memory fully saturated. That means that the 80GB that normally comes with an H100 have numbers on it the whole time and the machine is doing calculations with few swaps.
- Proper power and cooling: The machine needs stable power and sufficient cooling to operate optimally. No one can work properly when it’s too hot.
These factors introduce variability in calculations, so keep them in mind.
The max theoretical capacity of the H100 using FP16 Tensor is close to 1,000 TFLOPS or 2,000 using FP8.
That’s an astronomical amount of operations per second. For perspective, an iPhone has 0.5 to 1 TFLOPS, and the PS5 has 10 TFLOPS. If one operation were a grain of rice, 2000 TFLOPS would be equivalent to five times the Earth to Moon distance!
Speed of Thousands of H100s
In real applications, companies use clusters of GPUs, nowadays very often in the thousands. Running these clusters introduces challenges that limit throughput, such as connectivity latency, communication protocols (like MPI) and load balancing.
META achieved 400 TFLOPs with the H100 for Llama3 training, using FP16. The difference is due to hardware utilization and data transfers, which for a lack of a better benchmark seems 3X more efficient than Llama 2
The total compute required to train GPT-3 is approximately 3.14 x 10²³ FLOPs, translating to around 3,640 petaflop/s-days.
Putting Things Together
Now we understand the components of calculating a model’s energy consumption. Let’s use GPT-3 as an example. It’s old and less relevant today, but OpenAI’s transparency (at that point hehe) provides the necessary data.
Number of Hours of Training in a Single GPU
Using simple arithmetic, we can calculate the time to train GPT-3 on a single GPU.
Using modern H100s
Note that GPT-3 was not trained on an H100 (initial release date 2022); it was trained on 10,000 NVIDIA Tesla V100 in 2019–2020, each performing at around 15 TFLOPs/s.
But because we used H100 all the way here, let me continue:
Hours of training = amount of computation / speed of computation
Amount of computation = 3.14 x 10²³ FLOPs
Speed of computation (H100) = 400 x 10¹² FLOPs/s (as achieved by META Training FP16)
Training time (s) = 7.85 x 10⁸ seconds
Training time (d) = training seconds / seconds per day = 9093 days (25 years)
Not quite there, but close to the 7.5M than Deep Thought takes to calculate the answer to the ultimate question.
That’s a lot of time. Fortunately, OpenAI used a cluster of around 10,000. Using 10,000 H100s would reduce training time to 0.9 days.
Energy
We now have all elements to return to the initial equation:
Energy (kWh) = # of hours of training x hourly consumption
Let’s calculate it:
1 H100 Energy (kWh) = (9093 x 24) hours of training x 0.420 kW = 91,657 kWh
10,000 H100 (kWh) = (0.9 x 24 hours) x (0.420 kW x 10,000 H100) = 90,720 kWh (notice the difference is because of rounding errors).
Does this make sense? Partially. For individual machine energy consumption, it does. For clusters, it’s more complex and the margin of error is larger. Let’s see what are the components of the margin of error.
Margin of Error
The margin of error in these calculations is big, but the goal was to understand the dynamics, not precision. Let’s examine some error sources.
Power Consumption (+/- 25% error)
Not only do GPUs consume energy; the systems running them do too, notably cooling systems. Power Usage Effectiveness (PUE) tells the ratio of total facility energy to IT equipment. Google has a PUE (Power Usage Effectiveness) of 1.12, meaning a 12% overhead.
🐇 Nearly 100% of W is dissipated as heat. One H100 (700W) generates as much heat as 7–8 humans. Imagine 20,000 H100s running simultaneously in a room… like a subway without AC in summer in Madrid
Total Computation (0 error for GPT-3, a ton otherwise)
The total FLOPs required depends on many variables (architecture, dataset size, hyperparameters…). Most of these are no longer published, complicating accurate estimates. First principles and external data points help with extrapolations (these folks did a great job with LLAMA).
Available TFLOPs (100%? margin of error )
This involves the theoretical max speed of a single GPU, how many GPUs you have, and the speed AI engineers can get from them together.
- Single GPU: For the H100, we know the theoretical max speed is 1 TFLOPs. Utilization impacts this, and I assumed between 60–40% based on the available data.
- Cluster of GPUs: More GPUs add variance. Connectivity, load balancing, etc. META achieved around 400 TFLOPs, less than an OOM but significant.
Carbon Footprint
As I delved deeper, I found Hugging Face Model Cards where AI engineers share their models’ carbon footprints. This idea stems from an “old” paper highlighting LLMs’ significant energy consumption. To convert energy consumption to carbon footprint:
Carbon footprint (tCO2eq) = hours x W x carbon intensity
Carbon intensity measures CO_2 produced per kWh. EU, US provide the data. So if you need to work backwards, you go to these places and get many of the implicit assumption.
Notably, much of META’s know-how that is publicly available comes from the Hugging Face card.
Training vs. Inference
Before we close with the “So What?” section, it’s important to clarify that this discussion is primarily about training LLMs, which is the process of educating the model by running vast computations to adjust its weights. However, inference — asking questions to the trained model — also consumes energy. Though less intensive than training, inference still adds up, especially with millions of queries processed daily. For instance, while training GPT-3 can consume around 90,720 kWh, each inference might only use a tiny fraction of that. Nonetheless, both training and inference contribute to the overall energy footprint of AI.
So What?
Understanding the specific energy consumption of an old model isn’t suepr interesting unless you like to geek out, BUT (or And, as Isabel would correct me) I explored this rabbit hole to grasp foundational concepts that come up in many conversations and that I did not fully understand. I still have tons of unkonwns (I have to look deeper into inference energy consumption, and other follow ups), but at least now I have a foundation to start. IN any case, it is always fun going through rabbit holes.
But TL;DR, the numbers involved in training LLMs are staggering and continue growing. Energy consumption is just part of the equation, worth paying attention to as it has implications for many things, from geopolitics (a quite dramatic essay of the topic) to enviromental.
We should remain optimistic about this technology’s near-infinite potential.