Why Your AI Costs Are So High: Understanding Tokens and Context Windows

Home

Blog

Home

Blog

Why Your AI Costs Are So High: Understanding Tokens and Context Windows

Home

Blog

Why Your AI Costs Are So High: Understanding Tokens and Context Windows

Written By

Vishal Soni

Dec 2, 2025

11 Min Read

Master the economics of LLMs by understanding tokens and context windows. Learn why quadratic scaling makes context stuffing expensive and slow, and how to treat your context window as precious real estate for cost-effective AI deployment.

Minimalist white line-art icon of digital tokens stacking inside a rounded frame, set against a vibrant emerald-green to electric-blue gradient background.

TL;DR for Executives: LLMs charge you for both input and output tokens (not words), and processing costs scale quadratically. Doubling your prompt length quadruples computational cost. Understanding tokens and context windows is critical for controlling AI costs and latency. The "context stuffing" strategy (dumping entire documents into prompts) is the fastest way to make your AI application unusable.

In our previous article on LLM word prediction, we established that an LLM is a prediction engine, not a person. Now, we must turn our attention to the practical realities of paying for that engine and why, in the world of AI, "more" isn't just not better, it's often much worse.

When a developer rushes in, excited to "paste the entire project documentation" into an AI model to get a quick answer, they are falling for a common misconception. They assume the model reads like a human: that reading 100 pages just takes 100 times longer than reading one page.

It doesn't.

In reality, the math of attention is more complex. Doubling the input doesn't double the cost, it quadruples it.

To manage AI in production, you must understand the two fundamental constraints of the architecture: Tokens (the currency) and Context Windows (the budget).

The Unit of Account: It's Not Words, It's Tokens

We tend to measure communication in words. LLMs measure it in tokens.

A token is the actual, granular unit of text the model processes [1]. It might be a complete word ("apple"), or often just a fragment or piece of a word ("ing," "the_"). For a rough estimate, you can usually count on 1,000 tokens translating to approximately 750 words of regular English text.

Why does this distinction matter to a business leader? Because you are billed on both input and output tokens.

Visual diagram showing how text is broken down into tokens, the fundamental unit of processing in Large Language Models

You can explore token counts in real-time at the OpenAI tokenizer.

Every time you send a chat message to an LLM, you are not just paying for the answer. You are paying for the model to "read" the entire conversation history sent along with it [2]. If you have a chatbot session that has been running for an hour, you are re-sending the start of that conversation with every single new message.

This hidden "re-reading" cost accumulates silently until you receive a bill that makes no sense for the perceived volume of work. Current API pricing varies significantly by model, for instance, GPT-5 mini ranges from $0.25 per 1M input tokens to $2.00 per 1M output tokens, while more capable models cost substantially more [1].

The Workspace: The Context Window

If the LLM is the processor, the Context Window is its RAM (short-term memory).

The context window defines the maximum amount of text the model can consider at one time to make a prediction. This includes your system instructions, the user's question, and any documents you pasted in [2].

In the ancient days (five years ago), context windows were tiny, barely enough for a long email. Today, models boast windows of 128k or even 1 million tokens [1]. This has led to a problematic engineering pattern: the "Context Stuffing Strategy."

The Strategy: Just dump the entire PDF, codebase, or customer history into the prompt.
The Failure: The model technically accepts the input, but the application becomes unusable.

The Quadratic Scaling Challenge

This is the most critical technical concept for a non-technical leader to grasp: Latency does not scale linearly with context length [2].

If you double the length of the prompt, the model does not just take twice as long to reply. The attention mechanism at the heart of transformer models has quadratic complexity, doubling the context length requires approximately four times the computational resources [3][5] .

An LLM uses an attention mechanism where every token attends to every other token to understand relationships [4]. This creates an O(n²) computational pattern:

Processing 1,000 tokens is fast
Processing 100,000 tokens requires 10,000 times more compute operations [5]

Modern optimizations like FlashAttention have improved memory efficiency, reducing memory complexity from quadratic to linear, but they don't eliminate the fundamental computational cost. They make the process more efficient through better hardware utilization while maintaining numerically identical outputs.

We have seen enterprise chatagents where adding a few hundred words to the system prompt resulted in seconds of additional delay, making the agent feel sluggish and broken.

The Strategic Imperative: Treat Context as Precious Real Estate

The system prompt is the set of instructions defining your AI's behavior. Think of your context window as precious real estate with limited space and high carrying costs.

You cannot afford to waste your context window on irrelevant data. If you fill the window with a 50-page HR manual just to answer a question about holiday leave, you are paying for the model to process 49 pages of irrelevant information. You are:

Increasing latency (quadratic computational cost)
Increasing cost (re-reading all tokens on every turn)
Decreasing accuracy (longer contexts with irrelevant information make it harder for models to identify salient details) [6]

This creates a fundamental challenge: We have massive datasets, but a limited, expensive context window that grows increasingly costly as it fills [2].

How do we solve this? We don't teach the model everything. We don't stuff the window. We optimize through strategic retrieval.

Key Takeaways: Managing Tokens & Context Windows

Now that you understand the economics of LLM processing, here's how to apply this knowledge:

Audit Your Token Usage: Use tools like the OpenAI tokenizer to understand exactly what you're paying for. That "simple" chatbot might be re-sending the entire conversation history with every message.
Treat Context as Precious Real Estate: Don't dump entire documents into prompts. Only include the minimum necessary information to answer the query.
Design for Retrieval, Not Stuffing: Instead of pasting 50-page manuals, implement strategic retrieval systems that pull only the relevant sections.
Monitor Latency, Not Just Cost: Quadratic scaling means your app can become unusably slow before it becomes expensive. Set latency budgets (e.g., "responses must be under 2 seconds").
Optimize System Prompts: Every word in your system prompt is re-processed on every turn. Keep instructions concise and essential.

The next frontier in AI cost optimization is Retrieval-Augmented Generation (RAG), the technique that allows you to work with massive knowledge bases while keeping context windows lean, fast, and cost-effective. We'll explore this in an upcoming article.

Frequently Asked Questions (FAQ)

Q: What are tokens in LLMs and why should business leaders care?

A: Tokens are the granular units of text that LLMs process, which can be complete words like "apple" or fragments like "ing" or "the_". Approximately 1,000 tokens translate to 750 words of English text. Business leaders should care because you're billed for both input and output tokens, meaning every time your chatbot re-reads the conversation history, you're paying for it again.

Q: What is quadratic scaling and how does it impact my AI costs?

A: Quadratic scaling (O(n²)) means that when you double the input length, the computational requirements increase by four times, not just two times. This occurs because every token in the attention mechanism must attend to every other token to understand relationships. In practical terms: a 1,000-token prompt is fast, but a 100,000-token prompt requires 10,000 times more compute operations.

Q: Why does "context stuffing" (pasting entire documents) make my AI application slow?

A: Context stuffing creates three problems: (1) Increased latency due to quadratic computational cost, (2) Increased cost from re-reading all tokens on every turn, and (3) Decreased accuracy because longer contexts with irrelevant information make it harder for models to identify salient details. Adding a few hundred words to your system prompt can result in seconds of additional delay.

Q: How do I reduce token costs without sacrificing quality?

A: Implement strategic retrieval instead of context stuffing. Rather than pasting a 50-page HR manual to answer a question about holiday leave, use a retrieval system to pull only the relevant 2-3 paragraphs. This keeps your context window lean, your latency low, and your costs manageable while maintaining accuracy.

Q: What's the difference between context window size and actual usage?

A: Context window size (e.g., 128k tokens) is the maximum the model can handle. Actual usage is what you're currently sending. Just because a model supports 1 million tokens doesn't mean you should use them all. Every token you send increases cost and latency quadratically. Treat context window capacity like a budget: just because you have it doesn't mean you should spend it.

Citations

OpenAI - GPT-5 mini (https://platform.openai.com/docs/models/gpt-5-mini)
Maxim AI - Context Window Management Strategies (https://www.getmaxim.ai/articles/context-window-management-strategies-for-long-context-ai-agents-and-chatbots/)
Adaline Labs - Understanding Attention Mechanisms in LLMs (https://labs.adaline.ai/p/understanding-attention-mechanisms)
DataCamp - Attention Mechanism in LLMs: An Intuitive Explanation (https://www.datacamp.com/blog/attention-mechanism-in-llms-intuition)
arXiv - Efficient Attention Mechanisms for Large Language Models: A Survey (https://arxiv.org/abs/2507.19595)
arXiv - Core Context Aware Transformers for Long Context Modeling (https://arxiv.org/html/2412.12465v2)

Visual comparison showing the difference between AI chatbots and chat agents for business automation

Marcus Silva

7 Min Read

Chatbot vs. Chat Agent: Why the Difference Now Matters More Than Ever

Understand the critical difference between chatbots and AI agents. Learn why chat agents automate operations while chatbots only answer questions

Dec 9, 2025

Marcus Silva

7 Min Read

Chatbot vs. Chat Agent: Why the Difference Now Matters More Than Ever

Understand the critical difference between chatbots and AI agents. Learn why chat agents automate operations while chatbots only answer questions

Dec 9, 2025

Magnifying glass icon searching data blocks.

Marcus Silva

16 Min Read

The Knowledge Gap: Why You Must Use RAG, Not Training, to Solve Enterprise AI

Stop trying to train LLMs on your data. Learn why RAG (Retrieval-Augmented Generation) is the only viable strategy for accurate, real-time Enterprise AI adoption.

Dec 3, 2025

Marcus Silva

16 Min Read

The Knowledge Gap: Why You Must Use RAG, Not Training, to Solve Enterprise AI

Stop trying to train LLMs on your data. Learn why RAG (Retrieval-Augmented Generation) is the only viable strategy for accurate, real-time Enterprise AI adoption.

Dec 3, 2025

Brain and microchip icon on gradient background representing the contrast between human reasoning and AI word prediction

Marcus Silva

12 Min Read

LLM Misconception: Why Word Prediction, Not Reasoning, Must Drive Your AI Strategy

Understand why Large Language Models predict words, not reason like humans. Learn how to build robust AI applications by combining RAG for knowledge and fine-tuning for behavior, moving from probabilistic to deterministic systems.

Dec 1, 2025

Marcus Silva

12 Min Read

LLM Misconception: Why Word Prediction, Not Reasoning, Must Drive Your AI Strategy

Dec 1, 2025

Marcus Silva

7 Min Read

Chatbot vs. Chat Agent: Why the Difference Now Matters More Than Ever

Understand the critical difference between chatbots and AI agents. Learn why chat agents automate operations while chatbots only answer questions

Dec 9, 2025

Marcus Silva

16 Min Read

The Knowledge Gap: Why You Must Use RAG, Not Training, to Solve Enterprise AI

Stop trying to train LLMs on your data. Learn why RAG (Retrieval-Augmented Generation) is the only viable strategy for accurate, real-time Enterprise AI adoption.

Dec 3, 2025

Marcus Silva

12 Min Read

LLM Misconception: Why Word Prediction, Not Reasoning, Must Drive Your AI Strategy

Dec 1, 2025

Why Your AI Costs Are So High: Understanding Tokens and Context Windows

The Unit of Account: It's Not Words, It's Tokens

The Workspace: The Context Window

The Quadratic Scaling Challenge

The Strategic Imperative: Treat Context as Precious Real Estate

Key Takeaways: Managing Tokens & Context Windows

Frequently Asked Questions (FAQ)

Citations

Related articles

Ready to deploy your digital workforce?

Ready to deploy your digital workforce?

Ready to deploy your digital workforce?