How LLM Pricing Works: Tokens Explained Simply
Every major LLM API charges by the token. A token is a small unit of text — not exactly a word, not exactly a character, but somewhere in between. A helpful rule of thumb: 1 token ≈ 4 characters ≈ 0.75 words in English.
In practice: "Hello, world!" is about 4 tokens. A typical email is 100–300 tokens. A 1,000-word blog post is roughly 1,300 tokens. A long legal document might be 10,000+ tokens.
Why don't they just charge per word? Tokenization is how LLMs actually process text internally — each token represents a chunk the model has assigned a vector to. Words like "unbelievable" might be split into ["un", "believ", "able"] = 3 tokens. Short common words like "the" or "is" are usually 1 token each.
Input vs Output Tokens
All API providers charge separately for input tokens (your prompt, system message, and any context you provide) and output tokens (the model's response). Output tokens cost more — typically 3–5x more than input tokens. This is because generating tokens is computationally more intensive than processing them.
Key implication: long responses are expensive. If your use case allows shorter outputs (classification, extraction, summarization with word limits), you'll save significant money compared to applications requiring long-form generation.
Context Window: Why It Matters for Cost
The context window is the maximum number of tokens a model can process in a single API call (input + output combined). Larger context windows let you provide more background information, longer documents, or more conversation history.
But every token in the context window costs money — even if it's just background instructions the model "ignores." A 200,000-token context window doesn't mean you should fill it up every request. Only include what's necessary for the task.
Model Pricing Comparison (2025)
| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|---|
| OpenAI | GPT-4o | $5.00 | $15.00 | 128K |
| OpenAI | GPT-4o mini | $0.15 | $0.60 | 128K |
| OpenAI | o1 | $15.00 | $60.00 | 200K |
| Anthropic | Claude 3.5 Sonnet | $3.00 | $15.00 | 200K |
| Anthropic | Claude 3 Haiku | $0.25 | $1.25 | 200K |
| Gemini 1.5 Pro | $3.50 | $10.50 | 2M | |
| Gemini 1.5 Flash | $0.075 | $0.30 | 1M | |
| Mistral | Mistral Large | $4.00 | $12.00 | 128K |
| Mistral | Mistral Small | $0.20 | $0.60 | 32K |
| Open source | Llama 3 (self-hosted) | ~$0.10 | ~$0.10 | 128K |
Note: Prices change frequently. Always verify at the provider's pricing page before building a budget.
Estimating Costs: Formula and Examples
The basic formula:
Cost = (input_tokens / 1,000,000 × input_price) + (output_tokens / 1,000,000 × output_price)
Example 1: Customer Support Bot (1,000 queries/day)
Assumptions: 500-token system prompt + 100-token user question = 600 input tokens per query. 200-token average response.
Using GPT-4o mini ($0.15 input, $0.60 output):
Daily input: 1,000 × 600 = 600,000 tokens → $0.09
Daily output: 1,000 × 200 = 200,000 tokens → $0.12
Daily total: $0.21 → ~$6.30/month
Using GPT-4o ($5.00 input, $15.00 output):
Daily input: 600,000 tokens → $3.00
Daily output: 200,000 tokens → $3.00
Daily total: $6.00 → ~$180/month
For this use case, GPT-4o mini is 28x cheaper. If quality is acceptable (it usually is for FAQ-style support), the cost difference is enormous at scale.
Example 2: Blog Post Generation (10,000 posts/month)
Assumptions: 200-token prompt + 100-token outline = 300 input tokens. 1,300-token output (1,000-word post).
Using GPT-4o mini:
Input: 10,000 × 300 = 3M tokens → $0.45
Output: 10,000 × 1,300 = 13M tokens → $7.80
Total: ~$8.25/month for 10,000 blog posts
This is remarkably cheap. Even at GPT-4o pricing, 10,000 blog posts would cost ~$210/month — less than hiring a single contractor.
When to Use Which Model
- GPT-4o mini / Gemini Flash / Claude Haiku: Classification, extraction, summarization, simple Q&A, content moderation. Use these by default — upgrade only if quality is insufficient.
- GPT-4o / Claude 3.5 Sonnet: Complex reasoning, nuanced writing, code generation, multi-step analysis. Worth the price when quality materially matters.
- o1 / o1-mini: Math, logic puzzles, complex coding problems requiring deep reasoning. 3–12x more expensive; only use when the reasoning capability is genuinely needed.
- Gemini 1.5 Pro: Best for tasks requiring large context (processing entire codebases, long legal documents, book-length analysis). Its 2M token window is unmatched.
- Self-hosted Llama 3: When you need data privacy, predictable costs at massive scale, or can't send data to external APIs. Requires GPU infrastructure but cost-per-token approaches zero at high volume.
Free Tiers and How to Use Them
- OpenAI: New accounts get $5 in free credits. Not much for production, but enough for building and testing.
- Google AI Studio: Gemini 1.5 Flash is free with rate limits (15 RPM, 1M tokens/minute). Excellent for prototyping and low-traffic applications.
- Anthropic: No ongoing free tier, but offers $5 credits on signup.
- Groq: Free tier runs Llama 3 and Mixtral at extreme speed (800+ tokens/second). Ideal for latency-sensitive prototypes.
- Ollama: Run Llama 3, Mistral, and others locally for free. No rate limits, no data leaving your machine.
Cost Optimization Strategies
1. Prompt Caching
Both Anthropic and OpenAI offer prompt caching — if you send the same system prompt repeatedly, cached tokens cost 90% less. For applications with a large, static system prompt, this can cut costs dramatically.
2. Batching (OpenAI Batch API)
The OpenAI Batch API processes requests asynchronously within 24 hours at 50% off list price. For non-real-time tasks (content generation, data enrichment, bulk classification), batching is a straightforward 50% discount.
3. Prompt Compression
Remove redundant instructions, whitespace, and verbose context from your system prompts. Tools like LLMLingua can compress prompts by 3–10x with minimal quality loss by removing less important tokens.
4. Model Routing
Use a cheap model to classify query complexity, then route simple queries to the cheap model and complex ones to the expensive model. This "model router" pattern can reduce costs by 60–80% while maintaining quality where it matters.
5. Output Length Control
Always set max_tokens. Use stop sequences to end generation early when the answer is complete. Instruct the model explicitly: "Be concise. Answer in under 100 words." Output tokens are expensive — don't generate more than you need.