The 4 Most Expensive AI Evaluation Mistakes (and the Tender That Died)

By Lovelaice·15 Oct 2025

Written by Lovelaice

15 Oct 2025

Most product teams think their biggest AI risk is picking the wrong model. It's not. The biggest risk is never finding out that you picked the wrong model — until the damage is already done.

These four mistakes aren't hypothetical. They show up in real teams, with real budgets, and real consequences. One of them cost a company a million-euro tender. All of them were preventable with structured evaluation that takes hours, not quarters.

The Stakes Are Higher Than You Think

AI doesn't fail loudly. There's no 500 error. No stack trace. No page that refuses to load. The feature returns something every single time — even when that something is completely wrong.

Your monitoring tools watch for crashes and latency spikes. They don't watch for a contract summary that missed the liability clause, a compliance check that skipped the one regulation that matters, or a recommendation engine that's confidently irrelevant. The signal shows up as churn, three months later, when the context for why is already gone.

Ship and Pray is not a product strategy. But it is the default. And these four mistakes are how it plays out in practice — in dollars, in lost deals, and in teams that stop trusting their own AI features.

Mistake 1: Choosing the Model Based on Hype, Not Your Data

This is the most common mistake, and it's the one with the longest tail of damage.

"We're using GPT-4 because it's the best." Is it? Best at what? On whose data? For which task?

A FinTech team ran a structured model comparison on their actual transaction categorization task. Not a leaderboard. Not a benchmark someone else published. Their real inputs, their real edge cases. The result: the model they assumed was best cost 10x more than the alternative — and delivered lower accuracy.

In another experiment, GPT-5 achieved 60% accuracy on a data extraction task while GPT-4o hit 100%. At a fraction of the cost. At 4.9 seconds latency versus 47.6 seconds. The frontier model used 10-20x more tokens than the cost-efficient option.

When GPT-5 launched, thousands of integrations broke because teams had assumed "newer means better" and switched without testing. GPT-5's additional reasoning capabilities — great for complex problems — actually hurt performance on structured, specific tasks. The model that wins on Twitter is not the model that wins on your problem.

Now multiply that cost gap by scale. A hypothetical AI feature for Airbnb — personalized property descriptions based on user interests and travel history — showed that GPT-5 was 10x more expensive than GPT-4.1. At Airbnb's scale, millions of users viewing millions of properties daily, that's the difference between AI costs in the hundreds of thousands versus millions per month.

The fix: Don't choose your model upfront. Treat model selection as an output of experimentation, not an input. Test your specific use case across at least three to four models — frontier and cost-efficient options. Let the data decide. The experiment costs maybe $200 in API calls. The wrong model costs you $96,000 or more per year.

Mistake 2: Validating with Three Happy Path Examples

This is the mistake that killed the tender.

A product team tested their AI feature against a handful of examples they already knew would work. The demo looked great. The first few user reports seemed fine. Everyone moved on to the next feature.

But the tender required the AI to handle the full range of real-world inputs — not the curated set that made the demo shine. When the prospect ran their own evaluation, the feature fell apart on edge cases the team had never tested. Failure categories they'd never seen. A million-euro deal, gone. Not because the technology couldn't do the job, but because nobody had checked whether it could do the job on anything other than the happy path.

This pattern repeats everywhere. One team ran evaluation across their full dataset before deployment and caught five distinct failure categories across 36 LLM runs. In 14 minutes. Every one of those failures would have reached production — or a prospect's evaluation — under the three-example approach.

An HR Tech team went from 43% to 86% accuracy in two iterations. Not two engineering sprints. Two evaluation cycles. The failures were there the whole time. They just needed structured evaluation to surface them.

AI always returns something. It never says "I don't know how to handle this input." It gives you a plausible-looking answer that happens to be wrong. Your three happy test cases can't catch that. Your users — or worse, your prospects — will.

The fix: Run evaluation across your full dataset before deployment. Group failures by category. Look at what breaks, not just what works. If you ship before evaluating at scale, you are outsourcing your QA to your users. Or to the prospect who was about to sign a seven-figure contract.

Mistake 3: Treating the Prompt Like Placeholder Copy

Production systems running on "You are a helpful assistant. Help the user with their request" are the AI equivalent of product requirements that say "make it good."

In one experiment with product recommendations, every model except GPT-5 scored 0% accuracy with a basic prompt. Zero. GPT-5 managed 60%. With a structured prompt containing domain logic, product context, and edge case handling, multiple models hit 90% accuracy. Same task. Same test cases. The prompt was the entire difference between failure and success.

Teams consistently see 40%+ accuracy gains from prompt improvements alone. No model upgrades. No new infrastructure. Just better instructions written by the people who understand the domain.

But here's how the cost compounds. When your prompt is vague, you compensate by throwing a more expensive model at the problem. You're paying frontier-model prices to make up for instructions that a cheaper model could follow perfectly if the instructions were actually good. One team's sustainability AI scored below 40% accuracy with generic prompts on a frontier model. After domain experts rewrote the prompts with specific criteria, edge case handling, and examples of good outputs, they hit over 90% accuracy on a cost-efficient model.

The prompt is not documentation of what your AI does. The prompt is what your AI does. Treat it with the same rigor as product requirements. Version control it. Review it cross-functionally. Iterate based on evaluation data, not gut feel.

The fix: Get your domain experts — the product managers, the subject matter specialists, the people who have done this work manually for years — hands-on with prompts from day one. If your PM can't touch the prompt without filing a Jira ticket, your process is the bottleneck, not the model.

Mistake 4: No Cost Modeling Before You Ship

Traditional software has a beautiful economic model: build once, scale infinitely. Add a user, almost zero marginal cost. Your power users are your most profitable customers.

AI inverted this completely. Every interaction has an incremental cost. Your power users — the ones who love your product most — are now the ones who cost you the most to serve.

The CEO of Lovable shared a story: a user coded on their platform for 30 hours straight. 1,500 prompts in the first day. At a conservative average of $0.07 per request, that single user cost the company $105 in one day. The revenue? Probably a $20 monthly subscription. The AI costs alone exceeded 5x the entire month's subscription in 24 hours.

This isn't a startup edge case. It's the structural reality of AI economics. And the "optimize later" advice that works everywhere else? For AI, it means you're optimizing after you've already committed to an architecture that costs 10x what it should.

Token usage varies 10-100x between models for identical outputs. If you never tested this with your actual data before shipping, your cost projections are fiction. And if your pricing model doesn't account for the real cost per interaction, every new power user makes your unit economics worse.

The fix: Run cost modeling as part of your evaluation, not after launch. Test token usage across models with your real data. Know your actual cost per use case before you commit. The difference between testing first and optimizing later is the difference between €918 to €2,284 versus €5,326 to €8,690 for the same validated configuration.

What People Get Wrong About These Mistakes

"We'll fix quality after launch." You won't. AI failures are silent. Users don't file bug reports when the AI gives a plausible-but-wrong answer. They lose trust and leave. By the time you see it in engagement metrics, the damage is done — and you've lost the context for what went wrong. The feedback loop you need isn't production monitoring. It's pre-deployment evaluation that catches failure patterns before anyone encounters them.

"This level of testing slows us down." The opposite. Teams that run structured evaluation before deployment get from idea to validated configuration in 3 to 7 days. Teams that skip validation and iterate through production complaints take 8 to 14 weeks to reach the same quality — at four to five times the cost. Speed without proof is not speed. It's risk with a delayed invoice.

The Common Thread

Every one of these mistakes has the same root cause: decisions made without data. Model selection on reputation. Validation on vibes. Prompts written without domain input. Costs estimated from pricing pages instead of real experiments.

Data beats gut feel. Always.

The teams that win bring evidence to AI decisions the same way they brought data to product decisions when Amplitude changed everything. The cost of getting this right is measured in hours and hundreds of dollars. The cost of getting it wrong is measured in lost tenders, blown budgets, and users who leave without telling you why.

Your AI feature is live. Three happy test cases are not proof that it works. Structured evaluation is. That's the line worth pasting into Slack.