Reducing LLM costs - don’t ask them to do everything!

There’s still a lot of confusion regarding agentic AI and a common perception that agents are general-purpose problem solvers. This is certainly not the case, such assumptions can lead to erroneous results and exploding costs — particularly when LLMs are asked to perform non-language tasks.

This is rooted in a misunderstanding of what an agent is and what a Large Language Model is. Hopefully this article will help clarify the difference and through a real experiment we’ll see how choosing the right architecture reduced LLM costs by over 96%.

An LLM’s job is to interpret and generate language

A Large Language Model’s core capability is in interpreting and generating language. Trained on large-scale text corpora, an LLM builds a high-dimensional representation of language patterns within its neural network that allows it to associate words with each other based on context. For example, in the sentence “I read the paper”, the LLM will associate the word paper with newspapers, magazines, journals etc., whereas in the sentence “it was made of paper” the word has a completely different meaning and may instead be associated with materials like wood or cardboard.

An LLM does not continue to learn once it is deployed

An LLM can interpret language only against the data it was trained upon. If you prompt an LLM for today’s news headlines, a well-designed model will typically acknowledge its knowledge cutoff and decline to answer — though in some cases it may hallucinate (make something up). Large language models are trained on data up to a specific date — after which their knowledge is frozen. Events, discoveries, or changes after this cutoff may result in outdated or incorrect responses.

Training cutoff dates for Anthropic Sonnet LLMs
Model Training Data Cutoff
Claude 3 Sonnet August 2023
Claude 3.5 Sonnet (original) April 2024
Claude Sonnet 4.6 August 2025

Much of the work of an agent is spent calling tools under the instruction of an LLM

Considering that an LLM “only” interprets and generates language — how does something like ChatGPT provide such meaningful and relevant results? The answer is through additional tools that have a very verbose description. For example, ChatGPT will have access to a Web Search tool, a Weather tool and a Stock Lookup tool. Each of these tools will have a well-documented definition (like an instruction manual) that describes in clear language what the tool does, how to use it and what response it returns. In fact, a standard has emerged called the Model Context Protocol for describing tools so they can be discovered by LLMs.

Because an LLM can interpret language, it is able to decipher the user’s prompt and match it to the context provided by the tool and then call the tool to return a response.

This can be illustrated via the weather example. A typical interaction might be:

  1. User prompts ChatGPT: “What’s the weather in London?”
  2. ChatGPT, acting as an agent, provides its LLM with the user’s prompt and descriptions of available tools in its kit bag (e.g. “This weather service can provide the current weather when provided with the name of a city, it returns the current temperature in degrees celsius, wind speed in mph and wind direction as a compass heading”)
  3. Using its language interpretation skills, ChatGPT’s LLM semantically matches the user’s prompt with the tool descriptions and associates the weather tool with the user’s request
  4. ChatGPT invokes the weather service and retrieves the response
  5. ChatGPT’s LLM formats the result into a natural language response
  6. It returns the response to the user

What this shows is that the LLM is only responsible for interpreting and generating language and that a lot of the work is actually being performed by an agent’s toolkit. The key to reducing costs is to use an LLM only where language understanding is required. Everything else — particularly deterministic, repeatable, or computational tasks — should be handled outside of the agentic workflow.

In other words, don’t make the LLM have to waste time and energy working out what tools it needs to call, call them and then interpret the results - by doing the groundwork yourself you can dramatically cut costs. Here’s a worked example to illustrate…..

A Worked Example — Reducing LLM Costs by 96%

In this worked example I call an LLM directly (i.e. not via ChatGPT or claude.ai) — this allows me to report on token usage and thus cost, and compare scenarios where I provide tools versus no tools.

As an example we consider a user who wants an analysis of a portfolio of stocks and shares. We test three different architectures to satisfy this request:

  • Agent A (no tools) — the LLM is given only the user’s question and must rely entirely on its training data to estimate stock prices. It produces an answer quickly and cheaply, but the prices are invented — this is hallucination in practice.
  • Agent B (web search) — the LLM is given a web search tool so it can look up live prices itself. It returns a correct answer, but in doing so it retrieves and processes large volumes of search result text, flooding the context with tokens it largely doesn’t need.
  • Agent C (smart architecture) — stock prices are fetched deterministically in code using a free market data API, portfolio values are calculated programmatically, and only the resulting figures are passed to the LLM. The LLM is asked to do only what an LLM should do: reason about the data and respond in natural language.

The results speak for themselves:

Token usage and cost by agent architecture
Experiment Input Tokens Output Tokens Total Tokens Latency Cost (USD)
Agent A — no tools 56 332 388 8,860ms $0.0051
Agent B — web search 27,425 1,178 28,603 21,787ms $0.0999
Agent C — LLM reasoning 181 210 391 5,258ms $0.0037

Agent C consumed just 391 tokens in total — because the data retrieval and calculation were handled entirely in code, at no LLM cost whatsoever. Agent B, by contrast, consumed 28,603 tokens to answer the same question. That is a 96.3% cost reduction, and Agent C also responded faster.

It is also worth noting Agent A. On cost alone it appears competitive, but its answer is wrong — it has no access to live prices and will confidently hallucinate figures from its training data. Cheap and incorrect is not a viable architecture.

The conclusion is straightforward: reserve your LLM for language tasks, and handle everything else — data retrieval, calculation, formatting — in code. The LLM does not need to fetch a stock price; you can do that for it!