In this article I have a look at the cost of running queries for GPT-4 and similar models, in view of the drop in price per prompt. The main conclusions are:
- The energy efficiency gains for queries to large language models (LLM) are not leading to lower emissions.
- On the contrary, the lower prices are likely to lead to increased use and therefore higher emissions.
- The cost of a query is mainly made up of the fixed cost (capex) of the data centre (building, cooling and network infrastructure) and GPU servers. The electricity consumption contribution is a small proportion.
- Therefore, to maximise profit, the GPU server utilisation is optimised to support as many users as possible on the available hardware.
- But higher utilisation means higher energy consumption and therefore higher emissions, even if the energy consumption per query would be lower. The projected strong growth in number of queries makes this even worse, as it means the data centre capacity needs to grow steepl as well.
The urgent need to reduce emissions
To reiterate, according to the 2024 Emissions Gap Report of the UN [1], the world must cut global greenhouse gas emissions to 20 gigatons CO₂-equivalent per year (GtCO₂e/y) by 2040 from the current level of 60 GtCO₂e/y to avoid catastrophic global warming, where “catastrophic” is meant quite literally: there will be a huge increase in the frequency and severity of natural catastrophes if we don’t do this. Large parts of the earth will become unsuitable for habitation and agriculture.
To arrive at a sustainable level of emissions by 2040, global CO₂ emissions should be reduced by close to 20% per year. However, currently, emissions are still rising at 1% – 2% per year, despite the increase in renewable electricity generation capacity.
The 2024 Emissions Gap Report of the UN [1] explains in detail why renewables, carbon dioxide removal and carbon offsetting alone will not be sufficient to meet the targets.
Cheaper prompts, greener prompts?
- Note on terminology: I use the terms prompt and query interchangeably. The prompt is what you type, the query is the action of sending it to the server. Furthermore, a token is a small group of characters, between a single character and a word.
The price per query or token for various LLMs has come down considerably compared to prices when GPT-3 was released [2].
However, the energy consumption of GPT-4 is still several times larger than for GPT-3 [5], and the energy consumption of Gemini 1.5 Pro is still of the order of GPT-3. How is this compatible with the more than ten times lower prices for GPT-4 compared to GPT-3? Let’s have a look at the figures — and the factors that influence these.
Electricity pricing for data centres is very low
Large users of electricity pay wholesale prices for electrity. So the more electricity you use, the cheaper it is per unit, ironically. Because of their size, Google or OpenAI pay the lowest prices. The price they pay for their electricity is less than 6 cents per kWh [6].
Energy consumption of a GPT-4 style large language model
As I have discussed in detail in my article [7], the best estimate for the electricity consumption for GPT-3 and BLOOM is 0.003 kWh per query. For the queries used in that work, the average query response length was 100 words. At 6 cents per kWh, the electricity cost for such a query would be 0.018 cents, i.e. $0.00018.
GPT-4 is said to be 3× more expensive than GPT-3 [5], but GPT-4 Turbo could be only 1.5× more expensive, as it is a compressed model.
Gemini 1.5 Pro is said to have 200B parameters [17], which is of the same order as GPT-3. Using the cost per query for GPT-3 query, and model energy consumption scaling as the square root with parameter size as in this paper [18], we estimate that it is 1.07× more expensive than GPT-3. Some say that it is only a 120B parameter model [19]; if that is the case, the factor is 0.8×, i.e. slightly cheaper than GPT-3.
What makes up the cost of a query?
There are three main components to the cost of running a query:
- the capes cost of the servers,
- the capex cost of the data centre and
- the running cost of the data centre.
The capex cost of the servers
For example, a Nvidia DGX-A100 server with eight A100 GPUs [9] would cost $240k. (As I sanity check, in Feb 2023, SemiAnalysis [10] quoted $195k for an 8× A100 server.) Running it for a year would cost about $2,400 (using a power consumption of 4550 W as reported by Nvidia [11]). So, for the running cost to exceed the fixed cost, the GPU server would need to run for a hundred years. But the servers will likely be replaced by the next generation GPU, which will arrive after two years, or at best after 5 years, so the hardware cost makes up the majority of the price.
The capex cost of the data centre
Hyperscale data centres are very expensive to build. The cost for a 60MW data centre that could accommodate 10,000 of the above servers is between $420 and $770M for construction [13]. Such a data centre has an expected life of 15 to 20 years [12].
The running cost of the data centre
The running cost of the data centre is dominated by the cost of the electricity for running the servers, network and cooling. In a modern data centre, the contribution of the network and cooling is small, certainly less that 10%.
So let’s consider a 60MW data centre that can host 10,000 servers. As discussed above, running a single server for a year costs $2,400. We assume a conservative model where we replace the servers only after 5 years (usually they are replaced after 3 years). We take the average cost of $595M for construction and a 20-year lifespan. The data centre will not operate at full capacity from the start, so we assume that we start at 1/4 capacity (2,500 servers), and add 1/4 every 5 years for 20 years. With those assumptions, the electricity cost would be $15M/year.
- This assumes that server and electricity costs don’t change much over that period.
- It is likely that each newer generation of hardware is about twice as energy efficient. If we took that into account, the electricity would only be $2.4M/year, or less than 2% of the cost average over 15 years.
- If the costs for servers and electricity decreased over this period, the relative component of the infrastructure would increase but the electricity cost would still be a small proportion.
- If servers were replaced more frequently (every two years), then the contribution of the electricity usage to the cost will be even lower.
(For completeness sake, such a 60MW data centre would use about 400M gallons of cooling water per year [14], but that would cost only about $1M/year.)
Overall costs
On a yearly basis, we have $120M/year for the capex contribution of the servers and $30M/year for the capex contribution of the infrastructure. Consequently, more than 70% of the cost of running a query is the capex contribution of the servers ans the $15M/year for the electricity is less than 10% of the total cost of about $165M/year.
What this tells us is that what matters in terms of profit is to optimise the utilisation of those expensive GPUs. So when the cost per query goes down, it is likely the consequence of improved utilisation, which means more users can be supported simultaneously, rather than improved energy efficiency.
Pricing versus energy cost
Let’s consider the pricing for two popular large language models: Google’s Gemini 1.5 Pro and OpenAI’s GPT-4. Both are very recent models and similar in capabilities.
Gemini 1.5 Pro pricing
Generating 10,000 words using Gemini 1.5 Pro (10 RPM) costs ~$0.28 [15] and the cost is proportional to the number of generated words (1000 words is ~$0.028, 100 words is ~$0.003)
GPT-4 pricing
OpenAI charges between $0.030 and $0.120 per 1,000 tokens on GPT-4 [16] depending on the context length. The $0.030 is for GPT-4 Turbo, which is likely smaller than GPT-4.
The price is much higher than the energy cost
From the above data on cost and pricing of the models, we can calculate the price/cost ratios.
-
For Gemini 1.5 Pro, the ratio is $0.028 / ($0.0018*1.07) = 14.55x; in other words, the price is 15× higher than the electricity cost. If the model was only a 120B parameters, the ratio would be 18.8×.
-
For GPT-4, the ratios are {0.030/1.5, 0.06/3,0.120/3 }/ 0.0018 = {11.1×, 11.1×, 22.2×}; in other words, the price is 11× or 22× higher than the electricity cost.
These figures are consistent with the relative cost contributions: with the GPU server and data centre capex cost about 10× larger than the electricity cost, the price should indeed be more than 10× that of the electricity consumption.
As shown above, the electricity consumption does not contribute much to the overall cost. Therefore, Google and OpenAI don’t have a huge incentive to prioritise increasing energy efficiency. The main incentive is to increase utilisation. A higher utilisation means lower energy consumption per query and also a small conbtribution of the capex per query. But it also means a higher overall energy consumption.
It’s also worth noting that the drop in prices can’t be explained by the utilisation gains or the energy efficiency gains: going from 50% utilisation to 100% would reduce the capex contribution by a factor of two and the energy consumption per query by 30%. And based on the above estimates, none of the models have improved dramatically in terms of energy efficiency. So most of the price drop is due to increased competition.
A note on the emissions
There are two main components to the emissions for a query: the electricity use and the emissions from manufacturing the server. We have created a detailed life cycle analysis model for the GPU servers in and AI data centre [20] and calculated the embodied carbon emissions and emissions from use for hardware replacement cycles of 2, 3 and 5 years. The results depend on many assumptions, but the conclusion is robust: embodied carbon from manufacturing the servers will be of the same order as the emissions from running the servers. Replacing the servers sooner by newer hardware does not change the overall picture much.
This is mainly because of the strong growth in demand for AI data centres, which leads to production of ever increasing amounts of hardware, and the increased energy efficiency of the new hardware does not make up for this growth. I have used a figure of 22% growth per year as per the analysis by McKinsey [21]. So although the energy efficiency of the hardware increases with every generation, the combination of embodied carbon emissions and emissions from use resulting from the growth in demand results in a huge increase in emissions. For a more detailed discussion on the growth projections of the demand for AI data centres and the concomitant emissions, please read my article “The real problem with the AI hype” [22].
Conclusion
For both Gemini 1.5 Pro and GPT-4, we see that their energy consumption is still of the order of GPT-3, and even with current low prices, the price is more than ten times the energy cost. This is because the high cost and relatively short lifetime of the GPU servers makes up most of the total cost of running a query. Of course the argument is that both models are more capable than GPT-3. But the point is that large-scale deployment of these models leads to unacceptably high and rapidly increasing CO₂ emissions.
From a climate change perspective, energy efficiency gains are only really meaningful if they result in a reduction of the overall emissions. That is clearly not the case. And the low price is likely to make this only worse, as it will drive adoption and further growth in data centres and so increase both embodied carbon and runtime emissions.
References
[1] “Emissions Gap Report 2024”, UN Environment Programme, 24 October 2024, retrieved 17 January 2025
[2] “Things we learned about LLMs in 2024”, Willison S., 31 December 2024, retrieved 17 January 2025
[5] “GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE Demystifying GPT-4: The engineering tradeoffs that led OpenAI to their architecture”, Patel D. and Wong G., 10 July 2023, retrieved 17 January 2025
[6] “Cincinnati I Data Center Attributes”, H5 Data Centers, 2024, retrieved 17 January 2025
[7] “Estimating the Increase in Emissions caused by AI-augmented Search”, Vanderbauwhede W., 6 January 2025, retrieved 17 January 2025
[8] “ChatGPT Statistics — The Key Facts and Figures”, Walsh M., 22 April 2024, retrieved 17 January 2025
[9] “BIZON G9000 – 4x 8x NVIDIA A100, H100, H200 Tensor Core AI GPU Server with AMD EPYC, Intel Xeon”, BIZON, retrieved 17 January 2025
[10] “The Inference Cost Of Search Disruption – Large Language Model Cost Analysis $30B Of Google Profit Evaporating Overnight, Performance Improvement With H100 TPUv4 TPUv5”, Patel D. and Ahmad A., 9 February 2023, retrieved 17 January 2025
[11] “Energy and Power Efficiency for Applications on the Latest NVIDIA Technology”, Gray A,
Alan Gray, GTC 24, GTC 2024, 20th March 2024, retrieved 17 January 2025
[12] “The data center life story”, Judge P., 21 July 2017, retrieved 17 January 2025
[13] “How Much Does it Cost to Build a Data Center?”, Zhang M., 5 November 2023, retrieved 17 January 2025
[14] “Water-guzzling data centres”, Ashtine M. and Mytton D., retrieved 17 January 2025
[15] “Gemini Pro API Pricing Calculator”, InvertedStone, retrieved 17 January 2025
[16] “How much does GPT-4 cost?”, OpenAI, 2024, retrieved 17 January 2025
[17] “Google Gemini PRO 1.5: All You Need To Know About This Near Perfect AI Model”, Shittu H., 9 September 2024 , retrieved 17 January 2025
[18] “Measuring and Improving the Energy Efficiency of Large Language Models Inference”, Argerich M. and Patiño-Martínez M., IEEE Access, 2024, Vol. 12, 5 June 2024
[19] “Discussion: Gemini 1.5 May Technical paper”, 2024, retrieved 17 January 2025
[20] “LCA model for servers in a data centre “, Vanderbauwhede W., 16 January 2025, retrieved 17 January 2025
[21] “AI power: Expanding data center capacity to meet growing demand”, McKinsey & Company, 29 October 2024 , retrieved 17 January 2025
[22] “The real problem with the AI hype”, Vanderbauwhede, W., 16 January 2025 , retrieved 17 January 2025