Most AI agent quotes you'll see are wrong by 3-5x. Either someone's selling you a Zapier wrapper for $50K, or someone's quoting $500K for a chatbot. Both are wrong.
Here's what we've actually charged across 30+ AI agent builds in 2026 - broken down by complexity, with the cost drivers that move the number up or down.
What drives AI agent cost
- 1
Number of tools the agent calls
Each tool integration adds 2-5 days. A CRM-only agent is fast. An agent that touches CRM + calendar + email + Slack + your custom API is 4 weeks of integration work alone.
- Auth + token refresh per tool
- Error handling + retry logic
- Idempotency (so the agent doesn't double-book)
- 2
Eval harness depth
Cheap agents skip evals. Expensive bugs come from skipping evals. Plan on 30% of build time for golden-set tests, LLM-as-judge scoring, and regression detection.
- Golden-set test cases (50-200 typical)
- LLM-as-judge rubrics
- CI integration so PRs get scored
- 3
Human-in-the-loop UX
If the agent does anything irreversible (sending email, charging cards, deleting data) you need a confirmation flow. That's another 1-2 weeks of UI + state management.
- Approval queue UI
- Audit log of agent actions
- Override + rollback flows
- 4
Model routing logic
Cheap agents call gpt-4o for everything and burn $50K/month. Smart agents route to Haiku for simple turns and Opus for hard ones. Routing logic adds 1 week and saves 60-80% on inference.
- Complexity classifier (cheap model)
- Per-tier token caps
- Fallback chains
- Inference: $200-$5,000/month depending on model + volume. Claude Opus is 30x more expensive than Haiku - choose carefully.
- Vector DB: Pinecone Standard is $70/month minimum. pgvector on existing Postgres is free until you have millions of embeddings.
- Observability: Langfuse free tier covers most teams. Braintrust is $20-200/month per project.
- Hosting: Vercel + AWS Lambda for the agent loop: $50-500/month. Modal or Replicate for self-hosted models: $200-3,000/month.
Where teams overspend
Using GPT-4o (or Claude Opus) for every turn
Most agent steps don't need a frontier model. Route 70% of turns to Haiku or 4o-mini. Same UX, 1/10th the bill.
Skipping the eval harness then debugging in prod
A regression caught in CI costs 1 hour. The same regression caught after a customer complains costs 2 weeks plus reputation.
Building custom RAG when pgvector + reranker is enough
We've seen $100K spent on a 'RAG platform' that does what a 200-line script + Cohere Rerank API would do.
Hiring a full-time AI engineer before product-market fit
$200K/year for 1 engineer vs. an outside team that ships your v1 in 8 weeks. Math is obvious.
How to keep the bill honest
Start with a one-week scoping engagement
Before signing the build contract, pay for a week of scoping. The output is a fixed-price proposal with line items. If a vendor won't do this, that's a red flag.
Demand a unit-economics dashboard
Cost per agent run, cost per resolved ticket, cost per generated draft. If you can't see this in week 1, you'll be flying blind.
Negotiate post-launch support up front
AI agents drift. Models change. APIs deprecate. Get a 6-month support retainer in the original contract or you'll pay 3x for emergencies.
"Most AI agent quotes you'll see are wrong by 3-5x. The cheap ones skip the eval harness; the expensive ones bake in a year of margin."
Cheap agent vs. production agent
| Feature | Cheap build | Production build |
|---|---|---|
| Eval harness | Skipped | Golden-set + LLM-as-judge |
| Observability | console.log | Langfuse / Braintrust traces |
| Model routing | All gpt-4o | Cheap-then-deep tiered routing |
| Tool integrations | 1-2 happy path | Retry / circuit breaker / idempotency |
| Human-in-the-loop | None | Confirmation flow on irreversible actions |
| Cost ceiling | Open-ended | Per-run token caps + alerts |
| Time to first regression | Week 1 in production | Caught in CI before merge |
What separates a working agent from a demo
Determinism around the LLM
Workflow engines (Inngest, Temporal) handle the agent loop. The LLM only decides what's actually fuzzy.
Structured tool I/O
Every tool call is a typed schema. No free-text parsing. Catches 80% of the bugs before they happen.
Bounded cost per run
Token budget per session. Circuit breaker on runaway loops. You won't wake up to a five-figure bill.
Continuous evaluation
Golden-set tests run on every PR. Regressions get caught before users ever see them.
We scope AI agents in fixed-price engagements.
30-minute call, no deck. We'll tell you what your agent actually needs and price it honestly.
Quick answers
›Should we use OpenAI's Assistants API?
For prototypes, yes. For production, almost never. You lose control over the agent loop, traces, and cost. Roll your own with Inngest + a model adapter layer.
›How many tools should the agent have?
Start with 3-5. Each tool is auth + retry + idempotency + tests. Scope creeps fast. Ship narrow first.
›Do we need a vector DB?
Only if you're doing retrieval. pgvector on existing Postgres handles up to ~5M embeddings before you'd notice. Pinecone makes sense at scale or for hybrid filters.
›How do we measure agent quality?
Golden-set tests + LLM-as-judge scoring. Without measurement, you can't tell if a model upgrade helped or hurt. This is non-negotiable.
If you only remember 5 things
- AI agent cost scales with eval harness depth, tool count, and human-in-the-loop UX, not with model choice.
- Cheap models for routing, frontier models for reasoning. Tiered routing cuts inference 60-80%.
- Skipping evals is the most expensive shortcut in AI engineering.
- Workflow engines (Inngest, Temporal) make agents reliable. Don't roll your own state machine.
- Negotiate post-launch support up front. Models drift, APIs change, and you'll need it.
How we typically scope these
Single-flow agent (one persona, one main task): scoped fixed-price in 6-8 weeks. Production agent with eval harness, observability, and 3-5 tool integrations: scoped fixed-price in 10-14 weeks. Multi-agent system or anything touching regulated data: scoped fixed-price in 14-20 weeks.
Every quote is fixed-price - the number depends on scope, not hours. Every quote includes the eval harness. We don't ship AI agents without measurement - that's how you end up paying twice.
The IRPR engineering team ships production software for 50+ countries. Idea → Roadmap → Product → Release. 200+ products live.
About IRPR