Why Your AI Evaluation Is Lying to You

Written by Madalina Turlea
31 Mar 2026
A team came to us a few months ago at Lovelaice. They'd built an AI chatbot to help their users understand the analytics on their platform. The kind of feature that sits on top of a standard dashboard and lets users ask questions about their data in natural language.
They'd done what most teams do: engineering built it, deployed it, and set up an LLM-as-a-Judge to track quality. The judge was rating whether each response was "helpful" and "useful." The criteria for what that actually meant? Left undefined — up to the AI to decide.
The team was unsure of their AI evaluation system, so we looked at it together.
We did something simple. We set up an experiment and manually reviewed 10-20 real user questions and the AI's responses.
Within a few hours, we found three specific failure patterns:
The AI was misleading on small samples. It would highlight "best performance" or "breakthroughs" on tiny datasets, three data points becoming a "significant trend." The insights sounded authoritative but were statistically meaningless.
The AI was writing wildly expensive SQL queries. The queries were too broad, pulling far more data than needed. This wasn't just a performance issue — it was a cost issue that would scale badly.
The AI was overfitting to the prompt examples. The AI would not adapt to the type of insight the user was asking for, it was too strictly following the example in the prompt. The team had tried adding more examples, but that caused the AI to fail to comply any of the rules.
Here's what struck me: with just these three issues identified and fixed, the team would have solved over 70% of the AI's failure modes.
Their LLM-as-a-Judge, the one rating everything as "helpful"? It couldn't see any of this. It didn't know that consistency matters for analytics. It didn't understand that small-sample insights are dangerous. It had no concept of query efficiency. It was checking vibes, not value.
This is the pattern we see across nearly every team that comes to us: they've automated evaluation before they understand what they're evaluating.
The most common AI evaluation mistake
Here's what happens in most teams building AI features. It's so consistent that we can almost predict the sequence before they tell us.
First, engineering picks a model, writes a prompt, ships the feature.
Then, teams know they need to evaluate their AI feature. The team either sets up thumbs up/thumbs down feedback from users or builds an LLM-as-a-Judge with instructions to grade the response on a scale.
Neither approach is actually telling you anything.
Thumbs up/down is almost useless as a quality signal. Be honest with yourself: how many times have you received a mediocre or flat-out wrong AI response and actually clicked the feedback button? Most users don't report bad AI. They just quietly stop using the feature. By the time you have enough thumbs-down data to see a pattern, the silent churn has already started.
And the LLM-as-a-Judge, it's an unvalidated AI evaluating another unvalidated AI.
You've outsourced quality control to a system that has no idea what "useful" means for your specific users, your specific domain, your specific failure modes. Research backs this up: generic LLM judges, the kind that run on instructions like "rate from 1-10," achieve only 60-70% agreement with human evaluators. That's barely better than random for anything beyond surface-level formatting checks.
And Anthropic's own engineering team warns: "A bad judge is worse than no judge."
A bad judge doesn't just miss failures. It gives you confidence that things are working when they're not. That's worse than having no evaluation at all, because at least without a judge, you know you're flying blind.
Why this keeps happening
When I talk to teams about this, I hear the same three responses. They're reasonable on the surface, but each one hides a misunderstanding about how AI evaluation actually works.
"We're still in beta, we don't have many users yet, so we just rely on thumbs up/down."
This is exactly when evaluation matters most. Beta is your window to catch failures before they scale. And if you're relying on user feedback as your quality signal, you're building on sand.
The reality of AI products is that they fail silently. No error messages, no crashes, just confident-sounding responses that look fine unless you actually know what to check for. Users don't file bug reports for a recommendation that was merely "okay" instead of useful. They don't tell you the AI missed a nuance that a domain expert would have caught. They just close the tab.
39% of AI chatbots deployed in 2023-2024 were eventually pulled back due to performance issues. In 43% of failed AI deployments, the system was technically functional but produced low-quality outputs that weren't caught until users complained.
We want to benchmark quality first, then do error analysis
This is backwards, and it's the most common sequence mistake we see.
Teams want a score first: "What's our accuracy?" And then they plan to investigate errors later. But a benchmark score without understanding how and why your AI fails is a vanity metric. What does "72% accuracy" actually tell you? Nothing actionable.
Error analysis is how you benchmark. When you manually review outputs and document specific failures: "it confused milligrams with micrograms," "it ignored the user's constraint", you're simultaneously learning your quality and discovering what to fix.
The alternative we see in practice is even worse: a reactive, one-off process. A new error surfaces in production. A domain expert or PM says "handle it like this." Engineering patches the prompt. That's it. No systematic process. No tracking of whether the fix introduced new failures. No understanding of how that error relates to other errors.
"Evaluation is too complex, you need annotation infrastructure, massive datasets, specialized tools..."
This is the myth that keeps the most teams stuck. It sounds expensive and time-consuming, so they either skip evaluation entirely or jump straight to automation.
You don't need a data pipeline. You don't need a thousand labeled examples to start.
You need 10 to 20 test cases and the willingness to actually read the AI's responses and write down what went wrong.
Anthropic recommends starting with 20-50 examples. OpenAI's guide says 50-100 for establishing a human baseline. Not thousands. Not hundreds. A few dozen thoughtful cases with detailed notes.
50 cases with meaningful manual annotations will teach you more about your AI's failure patterns than any automated system running on generic criteria. Those notes, those patterns, that's your real evaluation foundation. Everything else, including knowing if and when you actually need LLM-as-a-Judge, builds on top of it.
What to do instead: The Evaluation Ladder
Here's the approach that actually works. We've refined this across 100+ product teams and 1,000+ experiments at Lovelaice. The core principle is simple: earn each step before moving to the next.
Step 1: Explore and compare
Before you commit to anything, test your task across multiple models with the same prompt. See how different models handle it. Don't pick a model upfront based on benchmarks or Twitter hype, let the data show you which model families work best for your specific problem.
Step 2: Manually annotate and actually write things down
This is where the real work happens, and it's the step most teams skip.
Read the AI's responses. Not skim, read. For each one, write down specifically how it failed. Not "wrong answer" or "incorrect." Instead: "It used milligrams instead of micrograms." "It recommended a product that's been discontinued." "It failed to handle the small sample size"
Compare multiple models' responses on the same test case, side by side. Note which responses handle what well and where each one breaks.
This is where domain expertise becomes your biggest differentiator. An engineer might mark a response as "looks fine." A domain expert sees that the AI referenced an outdated regulation, or that the tone is wrong for the audience, or that the recommendation technically works but misses the obvious better option.
Step 3: Iterate and expand
Run more test cases. Add edge cases: the weird inputs, the conflicting requirements, the ambiguous requests. Add adversarial inputs: the things users will actually throw at your AI. Keep annotating. Keep noting failures.
Step 4: Recognize patterns
This is the step that transforms scattered notes into actual insight.
Look across your annotations. What keeps coming up? What are the different categories of failures?
Prioritize these patterns by two dimensions: how often they occur, and how much impact they have on the user. A formatting error that happens frequently but doesn't change the meaning is lower priority than a factual error that's rare but dangerous.
In Lovelaice, we guide teams step by step on how to do this.
Step 5: Improve systematically
For each failure pattern, determine the right fix. Some patterns are prompt improvements: adding context, specifying constraints, providing examples of correct behavior. Some need better context engineering. Some require breaking a complex task into simpler steps. Some need tools, like a calculator for numerical operations instead of relying on the LLM.
The key: each improvement should target a specific, documented failure pattern.
Step 6: Write evals to automate checking for specific errors
Manual annotation and review of AI responses doesn't scale. You need to your checks and validations to measure improvements, error distributions and monitoring the AI feature's quality in production.
For this you need quantitative data.
Write deterministic checks first
This is the step that saves you the most money and gives you the most reliable signal.
Look at each failure pattern and ask: "Can I measure this with a rule?" You'll be surprised how many can be checked deterministically:
- - Does the response include required fields? (String matching)
- - Is the output in the correct format? (Schema validation)
- - Are numerical values within expected ranges? (Range check)
- - Does it reference only approved sources? (Allowlist check)
- - Is the response length within acceptable bounds? (Length check)
- - Does it contain prohibited content or phrases? (Keyword check)
Deterministic checks cost nothing, run in milliseconds, and achieve 95%+ accuracy for what they measure. Research suggests that up to 40% of evaluation criteria can be handled this way. Build these first. They're your foundation.
Write LLM-as-a-Judge but make it surgical, specific, validated
Now — and only now — you've earned the right to use LLM-as-a-Judge.
But not the way most teams do it. Here's what makes the difference between a judge that works and one that gives you false confidence:
Make it specific, not generic. Your judge's criteria should come directly from your documented failure patterns — the ones you couldn't check deterministically. Not "is this helpful?" but "does the response correctly identify all insights from the data?"
Make it binary, not scalar. Don't ask for a 1-10 rating. Ask: "Is this correct or incorrect?" for each specific criterion. Binary judgments are more reliable, more actionable, and easier to validate.
Validate your judge. Run it against your manually annotated examples — the ones where you already know the right answer. If your judge doesn't agree with your human annotations at least 80% of the time, it's not ready. Validated judges reach 80-85% agreement with humans. Unvalidated ones hover at 60-70%. That gap is the difference between a useful quality signal and noise.
Evaluate your judge continuously. Your judge can drift just like your AI feature can. Periodically check it against new human annotations. If agreement drops, update the criteria.
The bottom line
If you know you need to write evals for your AI feature, but you don't know where to start, this is your roadmap.
Start by reading real AI outputs. Write detailed notes on the failed responses. Find the failure patterns. Only then your shoudl start automating. Build deterministic checks first and use the LLM-as-a-Judge surgically — for specific, high-impact failure patterns where human judgment was genuinely required and couldn't be captured in a rule.
Organizations that discover AI failures post-deployment spend 10-15x more on fixes compared to those who invest in pre-deployment evaluation. The few days of manual work at the start isn't a cost — it's the highest-ROI investment you can make in your AI feature.
Stop automating evaluation before you understand what you're evaluating. LLM-as-a-Judge is not the first step. It's the last one. And most of what teams want to use it for should be a deterministic check instead.
Your AI judge is only as good as the failure patterns you teach it to look for. And you can only learn those patterns one way: by doing the work yourself first.