Replacing a fixed sequence with an agentic loop

For most of 2025, Overwise’s outreach engine was a five-step email sequence with templates, variants, and an A/B winner-picker. In April 2026 we ripped it out and rebuilt it as an agentic loop: a Sonnet executor that decides per-lead what to send and on which channel, an Opus advisor for edge cases, and a verifier that discards drafts whose claims can’t be cited from real signals. Here’s what we kept, what we deleted, and the per-lead decision log that came out of it.

What the old sequence looked like

The pre-April pipeline was a sensible v1: every campaign had a fixed 5-step email sequence with template slots for personalization, two or three template variants per step, and a job that promoted the higher-replying variant on a schedule. Sending was decoupled from drafting — the SequenceRunnerService walked each lead through stage 1 → 2 → 3 → 4 → 5 on a cadence, the SequenceDraftingAgent + PersonalizationAgent filled the template slots, and the ReplyTriageAgent fed positive replies back into the AI Inbox for one-click approval.

It worked. It also lied to itself in three places.

First, “stage 3” wasn’t a real signal. Two leads on the same template at the same stage could have wildly different conversation states — one had bounced, one had opened twice, one had clicked a link but not replied — and the sequence runner saw none of that. The next-send decision was made by counting how many emails had gone out, not by reading what had happened.

Second, the A/B winner-picker promoted the wrong thing. Reply rate as the only signal collapses positive replies, negative replies, and “talk to me later” into one number. The winner-picker happily picked the variant with more unsubscribes. We caught it because we read the inbox, not because the system flagged it.

Third, the templates were a one-way mirror. The buyer (a B2B SaaS founder doing their own outbound) saw their name on the From: line but could not inspect why a specific email had gone to a specific lead. “This was step 2 of sequence B” is not an answer they could use to defend the send to their own prospect.

The shape we replaced it with

The new loop is a single agent — the LeadOutreachAgent — that runs once per lead, per decision moment, with the lead’s full state in context. It does one of four things: send now (which channel, which body), wait (until when, and why), hold (escalate to AI Inbox), or terminate (suppress, lead reached the goal, or no path forward).

The executor is Claude Sonnet 4.6. It reads the lead’s discovery signals, prior outreach attempts (sent + responses), per-channel attempt counts, the campaign’s goal and brand voice, and a small memory summary the agent itself maintains across decisions. It returns a structured decision plus a one-line rationale, both stored on the lead’s outreach history.

For edge cases — a hot reply that needs a careful answer, a lead that’s been bounced from but has a fresh role-account hit, a decision where the cost of a wrong send is high — the agent calls an Opus 4.7 advisor for a second opinion before acting. The split is intentional: Sonnet handles the volume (cheap, fast, good enough), Opus reviews when stakes are high (expensive, slow, worth it).

The MessageVerifier: cite or discard

The trust-loop primitive that did the most work was the MessageVerifier. After the executor drafts a body but before anything sends, the verifier reads the draft as plain text and asks: can every concrete claim in this email be cited from the lead’s actual signals? “We saw you don’t have a booking widget on padelclub-berlin.com” — citable if the discovery probe recorded that absence. “I noticed you’re hiring a sales lead” — citable if a hot-lead trigger recorded the job listing. “I loved your latest launch” — not citable unless the discovery record actually has a launch event.

Drafts whose claims fail the cite check come back to the executor with the failing claims redacted and a retry budget of one. If the second draft also fails, the lead surfaces as a MESSAGE_VERIFICATION_FAILED task in the AI Inbox instead of sending. The whole point: a confident, fluent draft that also isn’t true is the failure mode this product cannot ship, because the buyer’s domain reputation pays the cost.

The per-lead decision log

Every agent invocation appends to the lead’s outreach history. Each entry is a small structured record — direction (outbound / inbound), channel (email / manual-dispatch), the decision and rationale, the cost in tokens, the model that produced it, and a reference to the message that was sent or received. Hold and terminate decisions also append, with the reason and any AI Inbox task that was spawned.

The log is the trust artifact. In the campaign-detail view, a buyer can open any lead and read, in plain English, what happened and why — not “step 3 fired on schedule” but “sent OUTBOUND #2 because the prior send was 4 days ago and opened twice but not clicked; cited the missing-booking-widget probe from discovery.” When a recipient writes back asking how we got their email, the founder has a real answer.

The cost ceiling, because agents drift

A loop that decides per-lead can spin. We capped that two ways. Per-lead, the agent has a daily cost ceiling (LLM tokens + paid enrichment + verification calls); when the ceiling is reached it returns a WAIT decision with the cost-cap reason, and the scheduler reconsiders after the daily roll-over. Per-project, a daily budget cap in the billing layer raises a DailyBudgetExhaustedException at the orchestrator chokepoint — the whole project pauses outreach with a CAMPAIGN_AUTO_PAUSED task, and the user can lift it with one click.

We considered making the ceiling a soft warn-only. We don’t regret making it hard. Drift is real, and a $400 bill from a run-away loop in week one of a $99 plan is the kind of trust loss you can’t apologize your way out of.

What we deleted

The rewrite let us throw away a meaningful slice of complexity:

EmailTemplate entity and its CRUD endpoints — templates no longer exist as a first-class object; the agent drafts per-decision from brand voice + signals.
SequenceRunnerService, SequenceDraftingAgent, PersonalizationAgent, ReplyTriageAgent — their jobs collapsed into the single agent loop.
WinnerPickerService and the A/B-test module — with per-decision drafting, “winner” is a per-lead question, not a campaign-level template promotion.
Campaign.tone, Campaign.templates, Lead.sequenceState — all gone. The new state lives on the lead’s outreach history; campaigns describe goals and allowed channels, not scripts.
Three activity event types tied to template promotion — because there’s nothing left to record there.

About ~3,000 lines of code came out, net. That’s not the interesting number — the interesting one is that the surface area a founder has to understand to trust the product got smaller. Five fewer concepts in the mental model.

What we evaluate, and what we don’t

We built an offline eval harness alongside the rewrite. The agent is graded against 10 hand-curated lead fixtures spanning five verticals (local-physical, hospitality, visual-brand, online-B2B, long-tail). For each fixture we check action accuracy (does the agent pick SEND / WAIT / HOLD / TERMINATE the way a human reviewer would?), citation discipline (does every claim trace to a fixture signal?), and cost (per-fixture cap of 4¢, total cap of 50¢ per eval run).

What we deliberately don’t publish yet are reply-rate, hot-lead rate, or conversion numbers. Two reasons. First, the agentic loop has been live for weeks, not months — any number we’d quote would be small-sample. Second, those numbers belong to customers, not to us; they’ll come out as customer stories when customers want to tell them and have data they’re comfortable sharing.

What’s next

Three threads since the April rewrite:

Source-routing. The agent picks what to send; a sibling source-router (also Sonnet) picks where to find leads. Multiple discovery sources today, routed per-campaign by ICP. This multi-source discovery is the wedge that separates us from a single contact database.
Hot-lead trigger. A per-project hourly scanner watches active leads for buying-intent signals — hiring spikes, funding announcements, tech-stack changes, competitor churn. Hits surface as HOT_LEAD_BOOST tasks and bump the lead to the front of the outreach queue.
Vector-similarity loops. Once a campaign has five converters, we mean-pool their embeddings and surface 50 lookalike leads as an AI Inbox task. Same primitive runs in reverse: paste 3-20 of your best customer domains, get 50 lookalikes back in under 2 seconds.

Each one composes on top of the agentic loop instead of bypassing it. The decision of what to send and when stays in one place. The decisions of which lead to surface get stronger over time.

If you’ve been here before

You’re a B2B SaaS founder doing your own outbound. You’ve tried Lemlist or Instantly. You’ve watched the AI write a confident three-paragraph email about a problem the lead doesn’t have, and felt it land in your gut: that just went out under my name.

That feeling is the design constraint. The agentic loop, the cite-or-discard verifier, the per-lead decision log, the cost ceiling — none of them exist to make the AI smarter. They exist so that when an email leaves your mailbox, you can read why it went, defend it if asked, and trust the next one will be cleaner because the loop learned from the last.

If that’s the product you’ve been wanting, the 14-day trial starts wherever you start. No demo gate, card on file, cancel anytime.

— Tobias Duelli, founder · tobias@overwise.ai