
Tokens are the unit of account for every LLM you run — and the single biggest driver of what enterprise AI actually costs. Here is how tokens work, why output costs 3-10× input, and five strategies that cut a typical Malaysian AI bill in half.
Every conversation your users have with ChatGPT, Claude, Gemini or any other large language model gets billed in a single unit — tokens. It is not a query, not a message, not a session. It is tokens in, tokens out. And once you understand what that means, your AI cost line on the P&L starts to look very different.
This post walks through what tokens actually are, why output tokens cost 3–10× what input tokens cost, and the five cost-optimisation plays every Malaysian enterprise running LLMs at scale should be using in 2026.

What exactly is a token?
A token is a chunk of text the model processes in one step — usually a word, part of a word, a punctuation mark or a piece of whitespace. As a rough rule for English and Bahasa Malaysia, 100 tokens ≈ 75 words ≈ 4 paragraphs of plain text. Code tokenises denser (an identifier like getUserById might be 3 tokens). Chinese and mixed-script content tokenises at roughly 1 token per character.
Two buckets matter for billing:
- Input tokens — everything you send to the model: the system prompt, the user's question, retrieved documents, conversation history, images in multimodal calls.
- Output tokens — everything the model writes back to you.
Why output tokens cost 3–10× more than input tokens
Every major provider prices output at a premium — the median across the 2026 market sits around 4× the input rate. For flagship models it can be 5–6×. Why?
Technically, output generation is sequential — the model has to predict one token at a time, run the full network each step, and cannot parallelise like it can over an input prompt. That per-token compute cost is real, and providers pass it through.
Practically, this means your cost line is dominated by whatever you ask the model to write. A 10,000-token prompt with a 500-token answer is cheaper than a 500-token prompt that asks the model to generate a 5,000-token article. Most Malaysian AI teams underestimate this until they see the first bill.

The three pricing tiers in 2026
The 2026 LLM market sorts cleanly into three tiers, and knowing which tier each of your workloads belongs in is the single biggest cost lever you have:
- Budget tier — GPT-4.1 Nano, Gemini 2.0 / 2.5 Flash, Mistral Small, Llama 3.3 open-source. Input as low as $0.10 per million tokens. Fast, cheap, good for classification, extraction, summarisation and most chat-style tasks that do not need deep reasoning.
- Mid-tier — Claude Sonnet 4.x, GPT-5.1, Gemini 2.5 Pro. Roughly $2–3 input, $10–15 output per million tokens. The sweet spot for most enterprise workloads — reasoning, code generation, multi-step tasks.
- Premium flagship — Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro. $5+ input, up to $30 output per million tokens. Reserve for coding agents, deep research, multi-file refactors, agentic workloads where a one-off wrong answer is expensive.
A real-world enterprise distribution looks like 70% of traffic on budget, 20% on mid-tier and 10% on premium. Teams that route every query to a flagship model because "it's smarter" routinely overspend by 60–80%.
Five strategies that cut enterprise LLM cost

1. Intelligent model routing
Do not pick one model. Route each query to the cheapest model that can handle it. A classifier (the budget model itself, ironically) decides whether a query is simple enough for the budget tier or deserves the mid-tier or flagship. Most Malaysian enterprises we implement this for see per-query cost drop by 60–80%.
2. Prompt caching
Every major provider now supports prompt caching — reuse the same system prompt or retrieved document across calls and pay a fraction of the normal rate. Anthropic's implementation can save up to 90% on cached tokens. If you have a 5,000-token system prompt that is identical across thousands of calls, this one change alone is transformative.
3. Batch processing
For anything that does not need to be real-time — report generation, bulk classification, overnight enrichment — use the batch APIs. OpenAI's Batch API discounts all models by 50%; Anthropic's Message Batches match that structure. Trade 24-hour latency for half the bill. Obvious win for back-office workloads.
4. Aggressive RAG optimisation
The most common waste we see in Malaysian enterprise AI implementations: retrieval-augmented generation pipelines that stuff 4–8 long documents into every prompt "just in case". Limit retrieval to 2–3 shorter chunks, rank aggressively, and truncate to just the relevant section. This cuts input tokens by more than half with no measurable loss in answer quality.
5. Budget for hidden costs
The headline price-per-token is not the full picture. A realistic total budget is around 1.7× your base token calculation once you factor in long-context surcharges, reasoning-token overhead on models that "think" before answering (o-series, Claude with extended thinking), image token costs in multimodal calls, and data-residency premiums for Malaysian customers who insist on BNM / PDPA-resident hosting.
What this means for Malaysian enterprises
The short version: LLM cost is a design problem, not a procurement problem. Your architecture choices — what you cache, what you batch, how you retrieve, which model tier handles which workload — dominate the bill. Negotiating a 10% discount from a vendor is a rounding error compared to cutting 70% of your token spend through better design.
For production deployments in regulated Malaysian sectors (banking, insurance, fintech), add one more dimension: data residency. BNM RMiT and PDPA alignment often mean Azure OpenAI Malaysia, a private Llama or Mistral deployment, or careful routing of which prompts leave Malaysia. Each carries a cost premium versus a pure public-cloud deployment, and that should be in the business case from day one.
Where to start
If you are running any LLM workload in production today, a 30-minute audit usually surfaces 2–3 obvious optimisations. Talk to Symprio and we will walk through your current pipeline and show where the tokens are going — whether we end up working together or not.
Symprio builds and operates LLM workloads for Malaysian banks, insurers, fintechs and shared-services centres across Azure OpenAI, Anthropic Claude, Google Gemini and open-source Llama / Mistral. Learn more about our Agentic AI practice.
Imagery via Pexels, used under the Pexels Free License.