3 Model Selection Mistakes That Are Quietly Burning Your AI Budget

By Lovelaice·15 Oct 2025

Written by Lovelaice

15 Oct 2025

You are probably overpaying for your AI feature right now. Not because AI is expensive — because you never tested whether the model you chose is the right one for the job you need done.

Most product teams pick a model the way they pick a restaurant in a new city: reputation, a friend's recommendation, whatever shows up first. Then they wire it into the product, ship it, and never look at the bill closely enough to realize they're paying ten times what they should.

These three mistakes don't show up as line items anyone flags. They compound quietly — in inflated API costs, in degraded quality that drives churn, and in engineering hours spent compensating for a decision nobody validated. By the time someone asks "why is our AI so expensive," the damage is already baked in.

Mistake 1: You Picked the Model on Reputation, Not on Your Data

"We're using GPT-4 because it's the best."

Best at what? On whose data? For which task?

Generic benchmarks measure generic performance. Your product is not generic. A model that dominates a leaderboard for creative writing can be mediocre at structured data extraction. A frontier model optimized for complex reasoning can actually hurt performance on a simple classification task — while costing 10-20x more per request.

A FinTech team ran a structured model comparison on their actual transaction categorization task. Not a leaderboard. Their real inputs, their real edge cases. The result: the model they assumed was best cost 10x more than the alternative and delivered lower accuracy.

In another experiment, GPT-5 achieved 60% accuracy on a data extraction task while GPT-4o hit 100% — at a fraction of the cost and 4.9 seconds latency versus 47.6 seconds. The frontier model used 10-20x more tokens than the cost-efficient option. "Newer means better" is not a model selection strategy. It's a way to spend $96,000 a year on a model you never validated.

Now multiply that cost gap by scale. A hypothetical AI feature for Airbnb — personalized property descriptions based on user interests and travel history — showed that GPT-5 was 10x more expensive than GPT-4.1. At Airbnb's scale, millions of users viewing millions of properties daily, that's the difference between hundreds of thousands and millions per month in AI costs.

The fix: treat model selection as an output of experimentation, not an input. Test your specific use case across at least three to four models — frontier and cost-efficient options. Measure accuracy, cost, and latency on the task you're building. The experiment costs maybe $200 in API calls. The wrong model costs you orders of magnitude more per year.

Mistake 2: You're Paying Frontier-Model Prices to Compensate for a Bad Prompt

This is the budget mistake nobody talks about, because it looks like a model problem when it's actually an instruction problem.

When your prompt is vague — "You are a helpful assistant. Help the user with their request" — the model has to guess what you want. Bigger, more expensive models guess better. So the team concludes the cheap model "doesn't work" and upgrades to the frontier option. Problem solved. Except the problem was never the model.

In one experiment with product recommendations, every model except GPT-5 scored 0% accuracy with a basic prompt. Zero. With a structured prompt containing domain logic, product context, and edge case handling, multiple models hit 90% accuracy. Same task. Same test cases. The prompt was the entire difference between failure and success.

One team's sustainability AI scored below 40% accuracy with generic prompts on a frontier model. After domain experts rewrote the prompts with specific criteria, edge case handling, and examples of good outputs, they hit over 90% accuracy on a cost-efficient model. They downgraded the model and upgraded the results — simultaneously.

Teams consistently see 40%+ accuracy gains from prompt improvements alone. No model upgrades. No new infrastructure. Just better instructions written by the people who understand the domain.

Here's the budget math that should make you uncomfortable: you might be paying frontier-model prices — the most expensive per-token rates available — to compensate for instructions that a model costing a fraction could follow perfectly if someone who understood the domain had written them. That's not an engineering problem. That's a process problem. If your PM or domain expert can't touch the prompt without filing a Jira ticket, your process is the bottleneck — and it's an expensive one.

The fix: get domain experts hands-on with prompts before you decide the model isn't good enough. The person who knows what a correct answer looks like in your field is the person who should be writing the instructions. Then test the improved prompt on a cheaper model before assuming you need the expensive one.

Mistake 3: You Never Modeled the Cost Before You Shipped

Traditional software has a beautiful economic model: build once, scale infinitely. Add a user, near-zero marginal cost. Your power users are your most profitable customers.

AI inverted this completely. Every interaction has an incremental cost. Your power users — the ones who love your product most — are now the ones who cost you the most to serve.

The CEO of Lovable shared a concrete example: a user coded on their platform for 30 hours straight. 1,500 prompts in the first day. At a conservative $0.07 per request, that single user cost $105 in one day. The revenue? Probably a $20 monthly subscription. The AI costs alone exceeded 5x the entire month's subscription in 24 hours.

Token usage varies 10-100x between models for identical outputs. If you never tested this with your actual data before shipping, your cost projections are fiction. And "optimize later" — the advice that works for almost everything else in product development — means you've already committed to an architecture and a pricing model built on numbers you made up.

This mistake quietly burns budget in two ways. First, the obvious one: you're paying more per request than you need to because you never compared models on cost. Second, the structural one: your pricing doesn't account for the real cost per interaction, so every new power user makes the unit economics worse. You're scaling a loss and calling it growth.

The fix: model costs before you ship. Run your actual data through multiple models and measure token usage, not just accuracy. Factor the real cost per interaction into your pricing model. A feature that looks profitable at 1,000 users might be underwater at 10,000 — and you want to know that in a three-day experiment, not a quarterly board meeting.

What People Get Wrong About These Mistakes

"We'll optimize costs later, once we have traction."

This is the most dangerous advice in AI product development. In traditional software, optimizing later is fine — infrastructure costs are relatively flat. In AI, every interaction between now and "later" is burning money at a rate you haven't measured. A team spending $847 a month on a model they never validated isn't accumulating technical debt. They're accumulating a real invoice. And the architecture decisions they make around the wrong model — caching strategies, token budgets, rate limits — all have to be reworked when they finally do the comparison they should have done on day one. "Later" costs more than "now." Always.

"Our engineering team handles model selection. It's a technical decision."

Model selection is a product decision with a technical component. Engineering knows how to integrate the model, manage rate limits, handle error states. They do not know — and should not be expected to know — whether the output is correct for your specific domain. A compliance summary that's 90% accurate sounds impressive until you realize the 10% it misses are the regulatory flags that matter most. The person who knows what "correct" looks like is the PM or domain expert. If they're not involved in the evaluation that drives model selection, the decision is being made without the most important input. Domain knowledge is not a soft skill. It is the thing that separates a $200 experiment from a $96,000 annual mistake.

The Common Thread

All three mistakes share a root cause: Ship and Pray.

Pick the popular model and ship. Use a vague prompt and ship. Skip cost modeling and ship. Then pray the budget holds and users don't notice what you didn't validate.

The teams that run structured evaluation before deployment get from idea to validated configuration in 3 to 7 days, at a cost of roughly €918 to €2,284. The teams that skip validation and iterate through production complaints take 8 to 14 weeks to reach the same quality — at €5,326 to €8,690. And that's before you count the monthly API overspend on a model you never should have picked.

Data beats gut feel. Always. Especially when gut feel is costing you 10x what the data would have told you to spend.

The next time someone on your team says "we're using GPT-4 because it's the best," paste this into Slack: Best on whose data? For which task? Show me the comparison.