3 Mistakes PMs Make When They Let Engineers Own the AI Prompt

By Lovelaice·15 Oct 2025

Written by Lovelaice

15 Oct 2025

Most product teams make the same move when building their first AI feature: they hand the whole thing to engineering. Engineers choose the model. Engineers write the prompts. Engineers deploy and iterate.

Meanwhile, product managers and domain experts — the people who understand the user, the workflow, and what "good" actually looks like — get brought in after something already exists. In many cases, PMs don't even have access to the prompts.

This is backwards. And it leads to three specific, expensive mistakes that show up in nearly every team we work with.

The stakes: what breaks when engineering owns the prompt

AI product quality lives in two places: the system instructions (the prompt) and the evaluation criteria (how you decide what's acceptable). Both of these are domain decisions, not engineering decisions.

When engineers lead them by default, you end up with something technically sound but contextually empty. The infrastructure works. The outputs don't.

One PM described it this way: "Engineers do prompts, but effective prompting requires deep customer knowledge and expected behavior context. I end up editing their prompts because I have context they don't." Another said the quiet part out loud: "My job became 2x with context engineering plus PM work too."

This isn't an engineering failure. Engineers are busy building the actual product. But the structural default — PM has an idea, writes a rough prompt in Notion, waits for engineering to have capacity, then can't test changes without going back to engineering — creates a bottleneck that kills both velocity and quality.

Average time from idea to tested concept under this model: three weeks. With the right setup, it's three days.

Here are the three mistakes that make the gap so wide.

Mistake 1: The prompt becomes a one-liner no one reviews

When engineering owns the prompt, it gets treated like configuration — not product logic. It lives in the codebase. No one outside engineering has seen it. It ships as a few generic sentences with no domain context, no edge case handling, no examples of good or bad outputs.

We've seen production prompts that literally read: "You are a helpful assistant. Help the user with their request." This is the AI equivalent of product requirements that say "make it good." It's not a strategy. It's an abdication of strategy.

The data makes the cost obvious. In one experiment with product recommendations, every model except GPT-5 scored 0% accuracy with a basic prompt. GPT-5 managed 60%. With a structured prompt containing domain logic, product context, and edge case handling, multiple models hit 90% accuracy. Same task. Same test cases. The prompt was the entire difference between failure and success.

Across experiments, 40%+ accuracy gains come from prompt improvements alone — no model upgrades, no new infrastructure. Just better instructions written by someone who knows what the output should look like.

When the prompt lives only in the codebase, it never gets that expertise. It gets shipped once and forgotten. It becomes static documentation instead of what it actually is: the core product logic of your AI feature.

What to do instead: Treat the prompt with the same rigor as product requirements. Structure it with clear logic, domain expertise, explicit edge case handling, and examples of good outputs. Version control it. Review it cross-functionally. Iterate based on real evaluation data. The prompt is not documentation of what your AI does. The prompt is what your AI does.

Mistake 2: Domain knowledge never reaches the AI

Engineers know the technology. They don't know that "DoorDash" isn't a restaurant — it's a food delivery service. They don't know which phrases in a mental health conversation indicate genuine distress versus common figures of speech. They don't know the nuances of a proprietary sustainability framework that took years to develop.

When engineering writes the prompt, this knowledge stays locked in the heads of the people who have it. The PM. The financial analyst. The legal expert. The psychologist. The person who has done the work manually for years and can spot a bad output in two seconds.

A business sustainability team came to us with a first-iteration AI that scored below 40% accuracy. Unusable. Their domain experts — the people who had done these evaluations manually for years — had never touched the prompt. They'd never reviewed AI outputs before deployment. They'd never been asked what "good" looked like in a way the AI could use.

The turning point was putting those domain experts in charge of reviewing outputs and improving prompts. They flagged failures, spotted patterns, and each pattern became a prompt fix. Three weeks later: over 90% accuracy. On a cost-efficient model, not the latest frontier release. When they benchmarked against five years of historical data, the AI even caught mistakes in their existing manual ratings.

The technology was not the differentiator. The domain expertise was.

A fintech team saw the same pattern. Engineers built transaction categorization based on general financial knowledge. Accuracy: around 65%. When a financial analyst got access to test different prompts, accuracy jumped to 87%. The analyst knew the edge cases. They understood that "Venmo" could be either social or bill payment depending on context. That knowledge existed in the organization the entire time. It just never reached the AI.

This isn't a knock on engineering. It's a structural problem. As one PM put it: "I need something I can put in the hands of my operations team, instead of asking my engineers to edit prompts."

What to do instead: Get product managers and domain experts hands-on with prompts and evaluation from day one. Not after engineering ships a beta. The people who know what "good" looks like should be shaping the AI's behavior, spotting failure patterns, and turning their expertise into prompt instructions. The PM doesn't replace engineering. The PM validates before engineering builds.

Mistake 3: Iteration dies because every change requires a ticket

Here's the cycle: PM notices the AI isn't handling a specific scenario well. PM writes up the issue. PM files a ticket. Engineering puts it in the backlog. It waits for the next sprint. Maybe the sprint after that. Meanwhile, every user who hits that scenario gets a bad output.

When every prompt change requires an engineering ticket, iteration stops. Not because people don't care, but because the cost of each change is too high. The PM discovers a failure pattern on Tuesday. The fix ships three weeks later. In between, the AI keeps failing the same way.

One team described the experience: "We're doing trial and error within customer meetings. You sit there on screen share, 'Can you try this? Oh, it's not working. Can you try this?' You get away with that answer once, maybe twice." Another reported wasting four to five hours on a single request because the feedback loop ran through engineering.

This is Ship and Pray with extra steps. You shipped. You noticed something broke. You prayed engineering would get to it before users gave up.

The alternative is concrete. A PM creates test cases representing real user inputs, including edge cases. The PM experiments with different prompt variations and models. The PM analyzes results, identifies the optimal setup, and hands engineering a validated configuration with accuracy data, cost projections, and documented edge cases. Engineering builds with confidence instead of guessing. Total time: days instead of weeks.

Compare that to the default: PM writes a rough prompt in a doc, waits for engineering capacity, can't test changes independently, and everyone defaults to "ship it and we'll improve later."

"We'll improve later" is the most expensive lie in AI product development.

What to do instead: Give PMs the ability to run experiments on prompts without an engineering ticket. Prompt iteration should take minutes, not sprints. When the PM can test five prompt variations across multiple models on real test cases, the handoff to engineering becomes a validated specification — not a hope.

The objection: "But PMs aren't technical enough"

This is the pushback we hear most. It's wrong.

Prompt engineering is not infrastructure engineering. It does not require knowledge of transformer architectures, API design, or deployment pipelines. It requires clear understanding of the use case, ability to create representative test scenarios, critical thinking about what "good" looks like, and basic prompting skills that are learnable in hours.

The hard part is not the technical implementation. The hard part is understanding the problem well enough to test it properly. That is exactly what PMs and domain experts do every day.

As one PM described the shift: "When I finally got the value where I thought, 'I can try all my crazy ideas for AI features on my own — I don't need to depend on engineering'... I can answer confidently: 'It can work' or 'No, it cannot work' before we commit any engineering time."

That's not replacing engineering. That's respecting engineering's time by handing them something proven.

The other objection: "Our engineers are good at prompting"

Some are. That's not the point.

The point is that even the best engineer cannot prompt for domain knowledge they don't have. A brilliant engineer who has never processed an insurance claim, reviewed a legal contract, or triaged a patient intake form cannot write the edge cases, the exceptions, the "this looks right but is actually wrong" scenarios that define quality in those domains.

Engineering builds the infrastructure. Domain expertise builds the quality. Both matter. The mistake is collapsing them into one role.

The bottom line

Three mistakes. One root cause. The people with the most relevant knowledge are locked out of the thing that determines AI quality.

Fix the access problem, and you fix the prompt quality, the iteration speed, and the domain gap in one move. Data beats gut feel. Always. And domain experts with data beat engineers guessing at domain logic every single time.

Here's your Slack-ready version: "Our AI prompt was written by someone who has never done the job the AI is supposed to do. That's the bug."