Understanding LLM Tokens, Pricing & Context Windows
What Are Tokens in LLMs?
When you send a prompt to GPT-4, Claude, or any large language model, the text does not go in as words or characters. It is first broken down into tokens — subword units that the model actually processes. Understanding tokens is essential because they determine two things that directly affect your wallet and your application: cost (you pay per token) and context limits (there is a maximum number of tokens the model can handle in a single conversation).
Tokens are neither characters nor words. They are somewhere in between. Common English words like "the," "is," and "and" are usually single tokens. Longer or less common words get split into multiple tokens: "tokenization" might become ["token", "ization"], while "pneumonoultramicroscopicsilicovolcanoconiosis" could be split into eight or more tokens. Punctuation, spaces, and special characters each consume tokens too.
A rough rule of thumb: 1 token is approximately 4 characters or 0.75 words in English. A 1,000-word document typically uses 750 to 1,300 tokens, depending on vocabulary complexity. Code, non-English text, and text heavy on special characters will produce more tokens per character because the tokenizer encounters fewer patterns it can compress.
How Tokenization Works
Under the hood, modern LLMs use algorithms like Byte Pair Encoding (BPE) or SentencePiece to build their token vocabularies. The idea is elegant: start with individual characters, then iteratively merge the most frequently co-occurring pairs until you reach a target vocabulary size (typically 32,000 to 100,000 tokens). The result is a vocabulary that efficiently encodes common patterns while still being able to represent any arbitrary text.
Here is how a sentence gets tokenized (color-coded by token boundary):
8 tokens · 30 characters · 6 words
Different providers use different tokenizers: OpenAI uses tiktoken (a fast BPE implementation in Rust), Anthropic uses a custom tokenizer, Meta's Llama models use SentencePiece, and Google's Gemini uses its own variant. This means the same text produces slightly different token counts across models — typically within a 5 to 15 percent range. Our calculator accounts for this by using model-specific character-to-token ratios.
Where tokenization gets expensive is with code and non-English text. A Python function might use 30 percent more tokens than the equivalent length of English prose because variable names, operators, and syntax characters each consume individual tokens. Japanese, Chinese, and Korean text can use 2 to 3 times more tokens per character than English because these characters are less represented in training-data-derived vocabularies.
Token Limits and Context Windows
Every LLM has a context window — the maximum number of tokens it can process in a single request, including both your input prompt and the model's response. Think of it as the model's working memory. Here is how the major models compare:
| Model | Context Window | Approx. Words |
|---|---|---|
| GPT-4o | 128K tokens | ~96,000 |
| Claude 4 Opus / Sonnet | 200K tokens | ~150,000 |
| Llama 3.1 (all sizes) | 128K tokens | ~96,000 |
| Gemini 2.0 Pro | 2M tokens | ~1,500,000 |
| Gemini 2.0 Flash | 1M tokens | ~750,000 |
Here is why this matters practically: if you send a 50,000-token prompt to GPT-4o (which has a 128K context window), the model can only generate up to 78,000 tokens in its response. If you are building a chatbot, every message in the conversation history counts against this limit. Long conversations eventually hit the ceiling, and you need strategies like summarization, sliding windows, or retrieval-augmented generation (RAG) to keep going.
Gemini's 2-million-token context window is a genuine paradigm shift. You can feed it an entire codebase, a full book, or hours of meeting transcripts in a single prompt — something that was impossible just two years ago. But larger context windows come at a cost, both in latency (the model takes longer to process more tokens) and in dollars.
LLM Pricing Explained
LLM APIs charge per token, and they differentiate between input tokens (what you send) and output tokens (what the model generates). Output tokens are always more expensive — typically 2x to 5x the input price — because generating text requires more computation than processing it. Each output token involves a forward pass through the entire model, while input tokens can be processed in parallel.
Here is the current pricing landscape across providers:
| Model | Input / 1M tokens | Output / 1M tokens | Relative Cost |
|---|---|---|---|
| GPT-4o Mini | $0.15 | $0.60 | Cheapest tier |
| Gemini 2.0 Flash | $0.10 | $0.40 | Cheapest tier |
| Mistral Small | $0.20 | $0.60 | Budget |
| Claude Haiku | $0.80 | $4.00 | Budget |
| GPT-4o | $2.50 | $10.00 | Mid-range |
| Claude 4 Sonnet | $3.00 | $15.00 | Mid-range |
| o1 | $15.00 | $60.00 | Premium |
| Claude 4 Opus | $15.00 | $75.00 | Premium |
The true cost of an API call is often higher than the raw token math suggests. Factor in retries from rate limits or transient errors, failed parses that require re-prompting, system prompts that are sent with every request, and conversation history that grows with each turn. A realistic multiplier for production workloads is 1.3x to 2x the naive calculation.
How to Reduce Token Usage and Costs
Optimizing token usage is one of the highest-leverage activities in production AI systems. Small changes in prompt design can reduce costs by 50 percent or more without degrading quality. Here are the most effective strategies, roughly ordered by impact:
- Choose the smallest model that works. This is the single biggest lever. GPT-4o Mini is roughly 17x cheaper than GPT-4o and handles the majority of classification, extraction, and simple generation tasks just as well. Always start with the cheapest model and upgrade only when quality demands it.
- Shorten your system prompt. The system prompt is sent with every single API call. If your system prompt is 2,000 tokens and you make 10,000 calls per day, that is 20 million tokens per day just for the system prompt. Audit it ruthlessly. Replace verbose instructions with concise rules. Use examples only when they measurably improve output.
- Use few-shot examples efficiently. Two or three well-chosen examples are almost always sufficient. Ten examples rarely outperform three, but they cost 3x more in tokens. Choose examples that cover edge cases, not happy paths.
- Cache common prompt components. If multiple requests share the same system prompt or context, providers like Anthropic and OpenAI offer prompt caching that can reduce input costs by 50 to 90 percent for repeated prefixes.
- Use structured output to reduce response tokens. Asking the model to respond in JSON with a specific schema produces shorter, more parseable responses than free-form text. OpenAI's structured output mode and Anthropic's tool use both enforce schemas and reduce wasted output tokens.
- Batch similar requests. Instead of making 100 individual API calls for 100 classification tasks, combine them into a single prompt: "Classify each of the following items..." This amortizes the system prompt cost across all items.
- Use streaming to fail fast. If you are generating long responses, stream the output and abort early if the first few tokens indicate the model has misunderstood the task. This saves the output tokens that would have been wasted on a bad response.
- Monitor usage with provider dashboards. OpenAI, Anthropic, and Google all provide usage dashboards. Set up alerts for unexpected spikes. A single bug in a retry loop can burn through hundreds of dollars in minutes.
Token Counting in Code
For production applications, you often need to count tokens programmatically before sending requests — to check context limits, estimate costs, or truncate inputs. Here are the recommended approaches by language:
Python (OpenAI):
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Your text here")
print(f"Token count: {len(tokens)}")
Python (Anthropic):
from anthropic import Anthropic
client = Anthropic()
result = client.count_tokens("Your text here")
print(f"Token count: {result}")
JavaScript (OpenAI):
import { encode } from 'gpt-tokenizer';
const tokens = encode('Your text here');
console.log(`Token count: ${tokens.length}`);
For quick command-line checks, you can install tiktoken as a CLI tool: pip install tiktoken and use it in scripts. For Claude, the Anthropic SDK includes a built-in token counter that uses the same tokenizer as the API.
Choosing the Right Model for Your Use Case
With so many models available, choosing the right one is a genuine engineering decision. Here is a practical decision matrix based on common use cases:
| Use Case | Recommended Model | Why |
|---|---|---|
| Simple classification or extraction | GPT-4o Mini or Haiku | Cheapest per token, fast, sufficient quality for structured tasks |
| Complex reasoning or analysis | Claude 4 Opus or o1 | Highest capability, worth the premium for difficult problems |
| Code generation | Claude 4 Sonnet or GPT-4o | Best balance of code quality, speed, and cost |
| Long document processing | Gemini 2.0 Pro or Flash | Largest context windows (up to 2M tokens) |
| Real-time chat | GPT-4o Mini or Gemini Flash | Fastest response times, lowest latency |
| Privacy-sensitive workloads | Llama 3.1 (self-hosted) | No data leaves your infrastructure |
| Cost-sensitive batch processing | Mistral Small or Gemini Flash | Best price-to-performance for high-volume tasks |
The key insight is that model selection is not about finding the "best" model — it is about finding the cheapest model that meets your quality bar. Most production systems use multiple models: a cheap model for the majority of requests and a premium model for the cases where quality truly matters. This tiered approach can reduce costs by 60 to 80 percent compared to using a single premium model for everything.