Your LLM passes the benchmark and fails in production

You've probably seen it this week: the McDonald's support chatbot writing Python code to reverse a linked list. A user asks for help with a script, the bot solves it in O(n) complexity, and then asks if they'd like some McNuggets. The joke went viral: "stop paying for Claude Code, McDonald's support is free".

It's easy to laugh. A customer-support LLM answering programming questions is an obvious failure anyone can spot. But the case that actually makes me think is a different one: when the LLM stays inside your domain, answers exactly what you asked, and still gets it wrong in a way only someone who knows your business can detect.

A team I support operates something like that. An internal AI system that generates operational reports for hundreds of people inside the organization, not a customer-facing chatbot. I can't share the details, but the patterns I'm about to describe are transferable. I'll use a customer-service chatbot as the running example, because the mechanism is identical and it's the case everyone has in mind after the McDonald's thing.

I've been writing for months about verification as an engineering discipline. The thesis is that in complex systems, the ability to verify that what you produce is correct matters as much as producing it. With LLMs in production, that thesis becomes even more urgent. Because generic evaluation metrics (BLEU, ROUGE, hallucination detectors) tell you if the output is incorrect, but not if it's incorrect for your business.

The problem with an example

Let me set up the case with a support chatbot for an e-commerce. The customer asks about a delay on their order. The bot answers: "Your order is 5 days delayed. This is a serious problem. We recommend you contact our urgent escalations team."

Correct data. Relevant answer. Zero factual hallucinations. And yet, the tone is a disaster. "Serious problem" creates unnecessary anxiety. "Urgent escalations" isn't the name of the team, it's internal jargon that leaked out. A hallucination detector sees nothing wrong. ROUGE tells you the answer is similar to the good ones. But the customer experience lead reads it and says: this can't ship.

And that's where generic metrics leave you on your own.

Weekly Newsletter

Enjoying what you read?

Join other engineers who receive reflections on career, leadership, and technology every week.

This newsletter is written in Spanish.

Two levels of strictness

We use DeepEval's GEval. Each metric is a natural-language criterion string that a judge LLM (Claude Sonnet) evaluates, scoring from 0 to 1. But they don't all have the same threshold, and that's the first design decision I want to share.

There are compliance criteria with a high threshold (0.9). They're almost binary: "the answer does not reveal it was AI-generated", "the answer does not use alarmist language with the customer". Either it meets the rule or it doesn't. You might think a regex solves it. I'll explain why not in a minute.

And there are quality criteria with a lower threshold (0.7): "the answer is empathetic and practical", "it doesn't promise what it can't deliver", "suggestions are bounded and actionable". These are inherently subjective. Two reasonable people will disagree about whether a tone is "empathetic enough". The lower threshold allows for stylistic variance without letting clear failures through.

The exact thresholds we calibrated empirically: we reviewed a batch of responses by hand and adjusted. But the decision that is transferable is this: don't put all your criteria at the same threshold. If you do, you either let compliance violations through or you generate false positives on quality. Two levels separate the binary from the subjective.

Domain exceptions

So far, nothing you couldn't build in an afternoon. The part where we really spent time was the exceptions. And I think it's the part most teams underestimate. We certainly did. We thought defining the criteria would take a couple of afternoons. It took a lot longer than we expected.

Inverted semantics: when "going up" is bad

Back to the support chatbot. The bot analyzes metrics and tells the shift lead: "average resolution time is approaching the target". Sounds positive, right? But if the target is a ceiling (resolve in under 24 hours), approaching the target means you're taking longer. You're getting worse.

Without this clarification in the evaluation criterion, the judge LLM reads "approaching the target" as positive, because that's the default semantics. We had to be explicit:

text

"All support KPIs are minimize metrics (resolution time, reopen
rate, first response time). A higher value is always worse.
'Going up' always means getting worse. 'Approaching the target'
means approaching the upper limit, which is NEGATIVE."

This kind of exception is the most dangerous because it's not a word you can search for. It's a semantic inversion that exists only in your domain. We ran into it with different metrics, but the mechanism was identical. And I'll be honest: it took us a while to notice. For the first few weeks the coherence scores looked fine and we couldn't understand why the summaries sounded "off" when we read them. Until someone from the operations team pointed out that the model was celebrating things that were actually bad news.

Internal jargon that looks like a violation

But it's not only about semantics. A typical customer-service criterion would be "the answer must not expose internal processes to the customer". Clear rule. But in many support teams, "escalate" is a standard action that customers know and expect ("I'll escalate your case to the specialized team"). The judge penalizes "escalate" because it sounds like an internal process. Or think of a ticket status called "blocked" that the bot mentions to the customer. Internally it's a normal operational state. To the customer it sounds like nobody is going to look at their case.

A regex that filters these words would penalize them every time. A judge LLM without context would too. The solution is to codify the exception directly inside the criterion string: "'Escalate' is acceptable terminology for the customer in this context and should not be penalized. 'Blocked' IS internal terminology and should not be exposed."

Terminology that sounds alarmist out of context

And then there's the reverse case. The criterion forbids alarmist language: no "serious", "critical", "urgent". Makes sense, you don't want the bot to create anxiety. But "critical ticket" is a standard priority category in any support system. When the bot says "I've classified your case as critical", it isn't being alarmist. It's reporting priority.

Same mechanism, same solution: carve-out in the criterion. "'Critical' referring to ticket priority is NOT alarmist language and should not be penalized. 'Critical' as a general adjective ('the situation is critical') IS alarmist language."

The common pattern

In all three cases, the solution was the same: codify the exception as a carve-out inside the criterion string, not in post-processing code. The judge LLM needs to see the exception in the same prompt where it sees the rule. If you move it to Python logic, the judge penalizes correctly according to its own criterion, and then you discard its judgment afterward. That doesn't scale.

80% of the development time for our evaluation system was iterating on the criterion strings. Not the infrastructure, not the pipeline, not the integration. The strings. The domain knowledge your team has in their heads but has never written down anywhere. Turning that into evaluable criteria was the real work. Not the code.

Scenarios as regression tests

We have predefined scenarios with mocked data covering the edge cases. What's interesting isn't that they exist (any team should have evaluation tests), but that the criteria in the tests are more specific than the ones in production.

In production you ask general things: "does the answer expose internal processes?". In a scenario test you ask: "given a customer with an order delayed 5 days, does it mention the estimated resolution window? Does it avoid promising a concrete date? Does it offer actionable alternatives?".

That extra granularity catches prompt regressions that general metrics miss. It happened to us: we changed the system prompt and the production metrics kept passing. But a scenario test caught that the model had stopped causally connecting two related metrics (a reasoning problem, not a formatting one). The general metrics were blind to it.

Verification isn't a final step, it's a continuous discipline. When you change a prompt, you need to know what you broke. Generic metrics don't tell you. Domain scenarios do.

What we haven't solved yet

I don't want to cherry-pick the parts that work. There are things we don't have sorted.

The judge has variance. Run the same evaluation twice and the score can shift by 5-10%. We mitigate by looking at trends rather than absolute values, but it's not ideal.

We don't have systematic human ground truth. We calibrated the thresholds by manually reviewing a batch, but we don't have continuous judge-human correlation. We know it broadly matches our criterion, but we can't give you a number.

The cost is acceptable for asynchronous evaluation (one judge call per criterion), but it wouldn't scale to real-time.

And the criteria themselves need versioning and iteration, like any other prompt. A vague criterion produces noisy scores. An over-specific one produces false negatives. Finding the sweet spot is continuous work.

Verifying LLMs is a domain problem, not an AI problem

The McDonald's chatbot writing Python is funny, but it isn't the real problem. The real problem is when your LLM stays inside its domain and still breaks rules only your team knows. Rules that have never been written down, that live in the heads of people who've operated the business for years, and that no generic benchmark is going to evaluate for you.

Building an evaluation system with LLM-as-a-Judge doesn't require sophisticated technology. DeepEval, a judge model, an async pipeline. What it requires is sitting down to articulate what "good" means in your context as natural language. And that isn't an AI problem. It's a domain engineering problem. The same problem you've been solving for years with tests, contracts, and business metrics. The difference is that now the system producing the output is non-deterministic, so your verifications have to be non-deterministic too. But the discipline is the same.

What I am sure of is that generic metrics aren't enough. What I'm less sure of is whether LLM-as-a-Judge with domain criteria is the definitive answer or an intermediate solution that happens to work today. Judge variance, the lack of ground truth, the fragility of criteria across model changes... those are open problems. But for now it's the best thing we've found, and it works a lot better than not evaluating anything at all.

How do you verify the output of your LLMs in production? I'd love to hear if you've arrived at similar patterns or at different solutions.

Newsletter Content

This content was first sent to my newsletter

Every week I send exclusive reflections, resources, and deep analysis on software engineering, technical leadership, and career development. Don't miss the next one.

Join over 5,000 engineers who already receive exclusive content every week