The reply-rate trap: why we don't promote A/B winners by reply rate

What “winning” actually rewarded

The pre-rewrite version of Overwise had a WinnerPickerService: a scheduled job that read the reply rate of each template variant on a step, picked the highest, and promoted it as the canonical variant for new sends. The math was the math every category tool runs — replies divided by sends, some smoothing for sample size, declare a winner.

The trouble showed up the first time we read the inbox. Variant B was “winning” on a campaign for coworking-space owners. The replies it was winning with: “please remove me”, “not interested”, “how did you get my email”, two enthusiastic “yes let’s talk,” and one “this is the third email from you guys this month, take me off your list.”

Variant A had a lower reply rate. Two thirds of its replies were questions or “tell me more.” Variant A was clearly the better email. The picker promoted Variant B. The picker was doing exactly what we’d told it to do.

Why the category builds it anyway

To be fair: reply rate is the easiest metric to compute. A reply is a reply. You don’t need an LLM, you don’t need a classifier, you just need a webhook from the mailbox and a counter. Every competitor in the category — Lemlist, Instantly, Apollo Sequences, Smartlead — ships some flavor of this. It dashboards well. It A/B-tests well. It looks like the kind of optimization a serious sales operation would run.

It is also, structurally, the same metric a high-volume cold outreach shop optimizes — and that’s the buyer those tools were built for: someone sending tens of thousands of emails a month where reply rate is the proxy for revenue because the downstream pipeline is large and noisy.

The B2B SaaS founder doing their own outbound is not that buyer. They send hundreds of emails a month, not tens of thousands. Their domain is the same domain customers use to reset their passwords. A 4-week run that picks the “winning” variant by unsubscribe density is not optimization. It’s a slow leak.

The first fix: classify every reply

The smallest version of the fix is to stop using reply rate as one number. Every inbound reply runs through a classifier (Sonnet, structured output, no creativity needed) that labels it as one of: POSITIVE (interested, asking questions, booking), NEUTRAL (talk later, send info, polite deferral), NEGATIVE (not interested, wrong person, this isn’t relevant), or UNSUBSCRIBE (remove me, opt-out, legal-flavored “do not contact”). Auto-replies and out-of-office notes are filtered before classification.

With that label on every reply, “reply rate” becomes four numbers. The thing you actually want to optimize is positive-reply rate net of unsubscribe rate — and the thing you want to know about before you scale a campaign is whether the negative pile is growing faster than the positive pile.

Once the metric splits, the bad outcome is impossible to hide. Variant B on the coworking-space campaign was clearly losing on the right scoreboard: more unsubscribes per send, more negatives per send, fewer positives per send. The picker would have caught it. The picker we’d built wasn’t looking.

The second fix: negative replies are a signal, not a failure

The deeper realization was that negative replies are not the enemy. They’re the cleanest feedback signal the system gets. “This isn’t relevant to me” from a real human, on a real email, tells you something concrete: the ICP filter let through someone it shouldn’t have, or the pitch is wrong for this segment, or the angle landed badly for this persona.

What you want to do with that signal is not minimize it (suppress the lead, tighten the filter, move on). You want to cluster it. When five different leads from the same discovery probe write back saying “we already have a booking widget, just not the one you scraped for,” that’s not five negative replies — that’s one structural problem in the discovery layer.

So we built a clustering pass: negative-reply themes get extracted, embedded, and grouped per campaign. When a cluster crosses a threshold, the user gets an AI Inbox task that names the theme in plain English and offers either an ICP refinement or a campaign pause. Five “we already have one” replies don’t drift past the dashboard — they trigger a conversation about whether the absence probe is actually detecting absence.

The third fix: per-decision drafting kills “templates”

Even with classifier-aware metrics and negative-theme clustering, the winner-picker was answering the wrong question. “Which template should we send next?” assumes there’s a template to pick. The agentic loop we shipped in April doesn’t have templates — the executor drafts each outbound from the lead’s signals, the campaign’s goal, and the user’s brand voice, every time.

“Winner” stops being a campaign-level question and becomes a per-lead one. Did the draft we sent to this specific lead land? The agent reads the reply, the classifier labels it, the decision log records the outcome alongside the reasoning that produced the draft. Patterns emerge in the log, not in a variant-rollup table.

The right unit of A/B test is no longer the template. It’s the campaign itself: try ICP A with angle 1, ICP B with angle 2, and compare the four classifier buckets across the two campaigns. Clone-the-winner is still a workflow; it’s just operating at the level of “which ICP × angle is converting,” not “which subject line had marginally more opens.”

What we report instead of “winner”

In the campaign detail view, you get the four-bucket reply breakdown, the positive-rate-net-of-unsubscribe number, the negative-theme clusters when they exist, and the per-lead decision log. There’s no “winner” badge anywhere on a template, because there are no templates. There’s no leaderboard of subject lines, because subject lines are drafted per-lead from signal.

What there is, in the Insights tab: which discovery slice is producing positive replies, which is producing negatives, what themes the negatives cluster around, and the cost-per-positive reply for each. That’s enough to make the decision the winner-picker was pretending to make, without the failure mode of rewarding the wrong outcome.

What we kept

Two things from the winner-picker era survived the rewrite. First, the comparison shape — small campaigns in parallel, compare the outcomes, clone the winner. That’s still the workflow. It just operates on ICPs and angles, not template variants. Second, the discipline of writing down what you expect to see before you ship. Every campaign gets a stated goal — book a meeting, get a reply, drive a signup — and the campaign auto-pauses when the metrics drift far enough from that goal. The winner-picker tried to do that without a stated goal. It couldn’t.

If you’ve been here before

You’re a B2B SaaS founder who wrote your first cold email last quarter, watched it work twice and embarrass you three times, and started looking for a tool. You read a Lemlist post about A/B winner-pickers. You half-understood it and felt like you should be running them. You felt like you’d be a worse operator for not running them.

You don’t need a winner-picker. You need a metric that doesn’t lie to you, a clustering of negative replies so you can fix the upstream problem instead of mopping the downstream symptom, and a draft-per-lead loop that makes “template winner” the wrong unit of analysis. None of that is a feature you can show off in a demo. All of it is the difference between a campaign that quietly burns your domain and one that doesn’t.

If you want to see the four-bucket reply breakdown and the negative-theme clustering on a live campaign, the 14-day trial starts wherever you start. No demo gate, card on file, cancel anytime. The trial defaults to Review-each-send so the first 50 emails from your mailbox go out only with your one-click approval.

— Tobias Duelli, founder · tobias@overwise.ai