Unit Economics for AI Features

High engagement is not the same as a healthy business. An AI feature your users love can quietly bankrupt you.

8 min read

Traditional software features have a cost structure that founders understand intuitively. You pay to build the feature once, you pay a fixed amount to host it, and the marginal cost of one more user is close to zero. AI features break this model entirely. Every time a user interacts with your product, you pay for that interaction — in compute, in tokens, in API calls. The cost curve scales with engagement, not just with user count. A feature that goes viral can generate a bill that arrives before the revenue does.

The unit to model is cost per task — the total spend required to complete one unit of the thing your product does. If your product summarizes documents, the unit is one summary. If it drafts emails, the unit is one email. Cost per task is built from three components: the number of tokens consumed (input plus output), the model tier used (GPT-4 costs roughly 30x more per token than GPT-3.5-turbo), and the number of model calls required to complete the task. A product that chains four model calls to complete one task has four times the inference cost of a product that uses one.

Token consumption is the variable your engineering team has the most control over. Prompt length is the biggest driver — a system prompt that is 2,000 tokens long adds 2,000 tokens of cost to every single inference call your product makes, forever. Output length is the second driver. Constraining outputs to the minimum length required for the task — not allowing the model to elaborate, hedge, or add unnecessary caveats — is a product decision with direct margin implications. Teams that treat prompt engineering as purely a quality concern and ignore its cost dimension are leaving significant margin on the table.

Caching is the highest-leverage cost reduction available to most AI products and the most consistently underused. Semantic caching stores the results of previous inference calls and returns them when a sufficiently similar query arrives, without hitting the model at all. For products where users ask similar questions repeatedly — support bots, FAQ assistants, search tools — cache hit rates of 30–60% are achievable. At scale, this is the difference between a margin-positive and margin-negative AI feature. GPTCache, Redis with vector similarity, and purpose-built caching layers from providers like Momento are the current tooling options.

Model selection is a product strategy decision, not just a technical one. The instinct to use the most capable model available produces the worst unit economics. The correct approach is to use the least capable model that produces acceptable quality for the task. Routing — sending simple queries to a cheap fast model and complex queries to a powerful expensive model — is how mature AI products manage this. A routing layer that correctly classifies 70% of queries as simple and handles them with a small model can reduce inference costs by 50% or more with no user-facing quality degradation.

The AI feature that is genuinely worth building has a cost per task that is meaningfully lower than the value it creates for the user. A feature that saves a user 20 minutes of work and costs $0.04 to run has a defensible economics. A feature that saves 30 seconds and costs $0.08 to run needs to either get cheaper or charge more. Building a cost model before building the feature — estimating tokens, model tier, call count, and expected usage — is the discipline that separates founders who discover their unit economics problem at scale from founders who discover it at the prototype stage, when it is still cheap to fix.

The question is not whether your AI feature works. It's whether it works at a cost per task that leaves room for a business underneath it.

TERMS

Term of focus

Tokens

The unit of text that language models process — roughly 0.75 words in English, though this varies by language and content type. All model pricing is denominated in tokens, typically per million. Input tokens (what you send) and output tokens (what the model returns) are priced separately, with output tokens costing more. Every character of your system prompt, user message, and conversation history counts toward input tokens on every call.

The total inference spend required to complete one unit of value in your product — one summary, one draft, one analysis. It is calculated by multiplying tokens consumed by the per-token price, then summing across all model calls in the workflow. This is the unit economics metric that matters for AI features, and it must be modeled before deployment, not discovered after.

A layer in the AI product architecture that classifies incoming requests by complexity and routes them to different models accordingly — cheap small models for simple tasks, expensive powerful models for complex ones. Routing is the primary mechanism for optimizing cost without degrading quality. A well-implemented router can reduce inference costs by 40–70% with no perceptible change to user experience.

Storing the outputs of previous inference calls and returning them when a new query is semantically similar enough to a cached query, without calling the model. Unlike exact-match caching, semantic caching uses vector similarity to match paraphrases and near-duplicates. For products with repetitive query patterns, it is the single highest-leverage cost reduction available.

The inference cost attributable to the tokens that must be included in every call regardless of the specific user query — system prompts, few-shot examples, retrieved documents, conversation history. Context window cost is fixed per call and scales linearly with prompt length. Auditing and trimming system prompt length is often the fastest way to reduce inference cost on an existing product.

Training a base model further on your own dataset to improve its performance on your specific task, often enabling a smaller cheaper model to match the quality of a larger expensive one. Fine-tuning trades upfront training cost for reduced per-inference cost at scale. It becomes economically rational when inference volume is high enough that the per-call savings outweigh the training investment — typically above a few million calls per month.

Caps imposed by model providers on tokens per minute and requests per minute at each pricing tier. Hitting rate limits causes failed requests or queued delays that degrade user experience. Products with spiky usage patterns — triggered by a viral moment or a large customer batch job — are most at risk. Rate limit strategy, including tiered provider accounts and request queuing, is infrastructure planning, not an afterthought.

BEFORE YOUR NEXT MEETING

— What is our cost per task today — and have we modeled what it looks like at 10x current usage?

— Are we using the most capable model available for every call, or do we have a routing strategy that matches model tier to query complexity?

— What is our current cache hit rate, and have we implemented semantic caching for the query patterns we see most often?

— If our primary model provider raised prices by 3x tomorrow, what would that do to our margins — and do we have a second provider we could route to?

REALITY CHECK

SOURCES

↗OpenAI — Pricing documentation

↗Anthropic — Model pricing and context windows

↗Eugene Yan — 'Patterns for Building LLM-based Systems & Products'

↗Zack Proser — 'Semantic Caching for LLMs'

↗Hamel Husain — 'Rethinking LLM Evals and Cost Optimization'

↗Martian — 'The Case for Model Routing'

LESSON 03 OF 03