The $99K Claude bill, and what it tells you about hiring an AI studio

CONNEX LABS

CONNEX DISPLAY

INSIGHTS

CONNEX LABS

CONNEX DISPLAY

INSIGHTS

CONNEX LABS

CONNEX DISPLAY

INSIGHTS

START A PROJECT

Insights /

the-99k-claude-bill

Engineering Studio

Inference costs are blowing past human salaries. Here's what that actually means if you're commissioning an AI product in 2026 and how an engineering studio thinks about it differently than your in-house team.

May 8, 2026

Connex Labs

A year ago, a high AI bill was a flex. Meta had an internal leaderboard for who could burn the most tokens, the top user reportedly ran through enough Claude tokens in 30 days to rack up over $1M on their own. It was the 2026 version of expensing a $400 sushi dinner at Nobu.

This week, the mood flipped.

Nvidia told Axios their team's compute costs are now "far beyond the costs of the employees." Uber's CTO admitted to The Information that they've already burned the entire 2026 AI budget and are back to the drawing board. Goldman Sachs put numbers on it: some software companies are spending roughly 10% of their total engineering labor cost on AI alone, and that share is climbing fast enough to match what they pay humans.

In other words: the technology that was supposed to replace expensive workers is, in a lot of cases, more expensive than the workers.

Why this is happening

Two things stacked at once.

The first is agentic workloads. A chat completion is cheap. An agent that plans, calls tools, retries, reflects, and writes back to a database, that's not one inference, that's hundreds. A single "task" can quietly spin up a multi-thousand-token loop. Multiply by every team in a company shipping their own agent, and the bill stops looking linear.

The second is pricing direction. Frontier models aren't getting cheaper as fast as people assumed they would. Some labs have raised prices outright; others have shrunk how much actual capability you get per dollar by routing you to bigger, more capable, more expensive models by default. Worldwide IT spending is projected to hit $6.31 trillion in 2026, up 13.5% and a meaningful chunk of that growth is just inference.

So the CFO question isn't "should we use AI?" anymore. It's "why is our AI line item growing 4x faster than revenue?"

What this looks like from the studio side

We build AI products for a living. Connex Display ships factory-installed in Honor LSVs. We've got 3S Smart Ship, GroGov, and MedQT in production. So we see this from a specific angle: companies who tried to build AI in-house, hit the bill, and now want help fixing it.

A few patterns we keep seeing:

1. Nobody costed the agent before they shipped it. The team built a working prototype, demoed it, got applause, rolled it out. Nobody ran the math on cost per task × tasks per day × days per month. The first surprise invoice arrives 60 days later.

2. Everything is using the biggest model. GPT-class or Opus-class on every call, including the ones that could be handled by a smaller model or by a deterministic function with zero LLM in the loop. We've cut client inference bills by 60-80% just by routing simple steps to small models and keeping the frontier model for the parts that actually need it.

3. The agent is talking to itself too much. Long context, repeated re-reads of the same document, retry loops that don't cap, no caching. A well-engineered agent and a sloppy one can have the same outputs and a 10x difference in token spend.

4. There's no kill switch. No per-user budget cap. No alerting when daily spend triples. No way to flip a feature off without a redeploy. The bill runs until someone notices.

None of that is a model problem. It's an engineering problem.

What you actually pay a studio for in 2026

When AI was free-ish, the value of an AI engineering studio was speed, get to a working product faster. That's still true.

But the bigger value now is margin. The difference between an AI product that works and an AI product that's profitable is a stack of unsexy engineering decisions: model routing, prompt compression, response caching, structured output instead of free-form, smaller models for classification, deterministic code for anything deterministic, and hard budget caps everywhere.

That's the work. It's not glamorous. It doesn't demo well. It's also the difference between a product line and a Goldman Sachs footnote.

If you're a company evaluating whether to build your AI product in-house or bring in a studio, the question isn't really "can we hire engineers who know LLMs?" You can. The question is whether those engineers will treat inference cost as a first-class engineering constraint from day one, the way good embedded engineers treat memory, or good mobile engineers treat battery, or whether they'll treat it the way most teams did in 2024 and 2025, which is: not at all, until the bill arrives.

The honest take

Human workers are not actually cheaper than AI in any durable sense. The cost curve will bend. Smaller models keep getting better. Caching and routing keep getting smarter. Hardware will catch up.

But "eventually" doesn't pay this quarter's invoice. If you're shipping AI in 2026, you're shipping into a market where compute is the scarce resource and good engineering is what protects your margin. That's a different problem than the one most companies hired for in 2024.

It's also a pretty good problem for a studio to solve.

Connex Labs is an AI product engineering studio. We work with companies on AI products end-to-end, from concept to production hardware. If you're staring at an inference bill that doesn't make sense, get in touch.

CONTACT

hello@connexlabs.dev

LEGAL

CONTACT

hello@connexlabs.dev

LEGAL

CONTACT

hello@connexlabs.dev

LEGAL