Where the energy cost of a language model really comes from
A bill that looks small
When we ask an assistant like ChatGPT, Gemini, or Claude a question, how much energy do we use? The most recent and best-documented estimates say: around a quarter of a watt-hour for a typical text prompt. On this point, Google has published 0.24 Wh for the median Gemini prompt; independent estimates for ChatGPT put it at ~0.3 Wh. That is little: the equivalent of leaving a TV on for less than ten seconds. So why is there so much talk about it? For two reasons. The first is scale: those few watt-hours, multiplied by billions of requests a day, become a level of consumption that, according to the International Energy Agency, will lead data centers to double their hunger for electricity by 2030. The second reason, perhaps more interesting, is the theme of these reflections: trying to understand where that energy is physically held (or rather, where it is channeled). The answer to this last question perhaps contradicts the most natural intuition. The wrong intuition is this: "the model thinks, that is, it does an avalanche of calculations, and calculating costs energy." True, but only in part. To understand where every Joule really ends up, it helps to take a journey, following the energy from the electricity meter all the way down to the single little word the model produces.
The floor of physics (and why it's so far away)
Let's start from the bottom, asking: is there a minimum cost, imposed by the laws of physics, to process information? The answer is yes, and it has a formulation in the Landauer principle, stated in 1961 and confirmed in the lab in 2012. It says something elegant and counter-intuitive: erasing a bit of information (forgetting it, zeroing it) inevitably produces a crumb of heat. Not because of technological imperfection, but because of a deep law: thermodynamics does not let you "throw away" information for free. That minimum crumb is worth about three billionths of a billionth of a billionth of a Joule per bit. A number so small that, even if a language model paid it for every bit, the total energy would be negligible. And that is exactly the point: we are extremely far from this limit. Today's chips spend, for every operation, something like a billion times (and more) more than that physical minimum. In other words: the energy an AI consumes is not "inevitable entropy" imposed by the universe. It is almost entirely engineering waste: margins, voltages, and above all, as we will see, the cost of moving data around. Fundamental physics, here, is not the bottleneck. There is enormous room for improvement before thermodynamics becomes a problem.
There is also a missing symmetry worth telling. Landauer talks about erasing information. But creating information is another story: copying a piece of data is in principle free, and even generating new data (in a physical, random sense) can actually release energy rather than consume it. Destroying costs, creating does not: an asymmetry with a surprising consequence for AI.
The real cost of a language model is not the price of "creating" the information that answers you, but the sum of a myriad of small irreversible erasures (overwriting registers, intermediate results, accumulators at every step) plus the continuous shuttling of data back and forth.
The information you read in a response, measured with Landauer's yardstick, would be worth energetically nothing. The real bill is made entirely by everything that happens behind the scenes to produce it.
Moving costs more than calculating
Inside a chip, doing a computation (a multiplication, an addition) costs very little energy. Going to fetch the number to compute on, if it lives in main memory, costs a hundred to seven hundred times more. Why does this matter so much for an LLM? Because of how it generates text: one word at a time. To produce each little piece of a word (a "token"), the model has to re-read all of its billions of parameters from memory. It does very few computations on an avalanche of data it has just transported. Transporting costs. The cost doesn't depend on the "meaning" of the data, only on the physical act of moving charge across a conductor. It’s like filling a pipe with water to signal a "1" and emptying it for a "0": you pay for the pumping, not the message. Insiders call this being "memory-bandwidth bound": the bottleneck is not computing power; it's the speed at which you can move the model's weights. This is not a "Landauer cost" and has no physical floor. In principle, you could charge the wires slowly and recover the charge (adiabatic/reversible charging), bringing the energy close to zero. Saying that "we are far from Landauer" is just a different way of saying "the cost is the transport": it’s stating that the overhead comes from how we build chips today, not from "unavoidable" entropy. This single idea reorganizes everything. It explains why many of the optimizations that work are not about "doing fewer computations," but about moving less data or moving it once for many users.
The engineers' levers (and the open challenges)
From here arise the great efficiency battles being fought today:
- Quantization: using "coarser" numbers. If you represent each parameter with 8 bits instead of 16, you halve the data to transport. Real-world measurements show consumption drops of around 40% with limited quality loss. It is the lever with the best effort/result ratio. The challenge: pushing to 4 bits or fewer without the model becoming less accurate.
-
Serving many users together (batching). If the dominant cost is going to fetch the weights from memory, then fetching them just once to answer a hundred users simultaneously spreads that cost over a hundred. This is why an AI in a large data center costs, per response, far less than the same AI running just for you on your own computer.
-
Smaller but better-trained models. An important finding (the "Chinchilla laws") says that often a more compact model, but trained on more data, is just as good as a giant one. And a smaller model costs less per single response, forever — a saving that multiplies across billions of uses.
-
Turning on only part of the brain (Mixture-of-Experts). Instead of using the whole model for every word, only the useful portion is activated. Computations are saved, but, beware, sometimes memory traffic increases. It is a perfect example of how "fewer computations" does not automatically mean "less energy."
-
The distant frontier: reversible computing. If the fundamental cost comes from erasing information, there exist in theory ways of computing that erase nothing and that could one day drop below the Landauer limit. Today it is basic research, far from practice, but it is the direction that tells us "how low, in principle, one could go."
Data centers and accounting
Above the chip there is the building. Cooling the servers, distributing the power: all of this adds about 50% more to the energy of the computations themselves (it is the famous "PUE" indicator, stuck at ~1.5 for years). Then there is water consumption for cooling, and the carbon footprint of building the hardware itself. But the important point, so as not to be fooled, is another one: the numbers on "the consumption of an AI" are often not comparable. They vary by more than tenfold not because one model is ten times more voracious, but because each one measures different things: only the accelerator chip? also the computer hosting it? also machines that are on but idle? also the cooling? does it count the initial training or only the responses? Without declaring what goes into the count, any figure is useless for comparison. This is also why standardization efforts exist (like the MLPerf Power benchmark, which measures the power "at the wall socket" of the whole system). And there is a fundamental question that still has no simple answer: for a heavily used model, does it weigh more to train it once (a huge but one-off cost) or to answer its questions (a tiny cost, multiplied by billions)? The answer changes depending on what you want to know, and it is one of the open questions.
Efficient for whom, and for what?
Two aspects worth digging into.
Efficiency does not mean less consumption. In 1865 the economist William Stanley Jevons noticed that more efficient steam engines, instead of lowering coal consumption, increased it: since it cost less to run things, far more of them were run. It is the Jevons paradox, and it is exactly what we see today with AI: the energy per single response has collapsed (about tenfold in a few years), and yet the total consumption of data centers is set to double. The cheaper it becomes, the more it is used everywhere, and the total rises all the same. Efficiency is necessary, but on its own it is not enough to reduce the overall footprint.
We are counting the wrong thing. A model is not more efficient because it churns out more text for the same energy: it is efficient if it reaches more correct goals for the same energy. The difference is enormous with the new models that "reason": they produce far more words and seem wasteful, but they get right answers that others get wrong. Counted by word, they are the worst; counted by problem solved, they can be the best. The most appropriate metric is not energy per token: it is energy per correct result. Measuring it is hard (what is a "correct" answer, for a poem?), and this is precisely one of the most interesting open challenges.
Trying to sum up
The efficiency of a language model plays out across several planes and aspects that are not necessarily intuitively obvious. It is not a battle against the laws of physics (we are extremely far from that wall), nor is it mainly a question of "doing fewer computations." It plays out in the middle: reducing data movement and amortizing the dominant cost over more useful work. And when we measure, let us remember to count the goals achieved, not the words, and to look at the total, not just the single response. Thermodynamics, here, does not constrain us; accounting, if we don't frame the problem well, might mislead us.