Building your AI delegation stack: who does what

A founder at a wooden home-study desk with two screens, a paper notebook with three labels written down the side, and a coffee, working on a deliberate AI stack
TL;DR

A personal AI delegation stack works best with three named workers, not eight: a generalist drafter for the writing and admin, a researcher for live information, and a thinking partner for high-stakes decisions. Each one needs a remit, a check-back protocol, and a kill-switch. Public stack disclosures from Wade Foster at Zapier, Aaron Levie at Box, and Tobias Lütke at Shopify all rhyme with the same shape.

Key takeaways

- The three-worker pattern is generalist drafter, researcher, thinking partner. Three is enough specialisation to matter and few enough that switching cost does not eat the gain. Tobias Lütke's internal Shopify approach, an LLM proxy with model routing per task, is the same idea at company scale. - Each worker needs a one-paragraph remit: what it is for, what it is not, and how its output is reviewed. Without a remit, the same tool drafts client emails and runs strategy at the same prompt-quality, and you cannot tell which one slipped. - The check-back protocol decides when you compare workers and when you do not. For a routine client follow-up, do not. For a partnership decision, run the same brief through two and compare. The cost is twenty minutes; the value is catching a single bad call a quarter. - The kill-switch matters because models do change. Anthropic's April 2026 postmortem on three Claude regressions, all reverted within a fortnight, is the cleanest public evidence. If you do not have a way to notice quality slipping, you find out from a customer. - Operators publishing their stacks tend to run three to five tools rather than ten. Wade Foster, Aaron Levie, and Pedro Franceschi at Brex all describe small rotations with clear roles. The "every new model on launch day" pattern is the productivity trap that a disciplined stack is built to avoid.

A founder I spoke to in April was paying for ChatGPT Pro, Claude Pro, Perplexity Pro, and Gemini Advanced. Roughly £80 a month across four subscriptions. She used them more or less interchangeably, asking whichever one was open in a tab. She felt guilty that she was under-using all four. She asked me which one she should drop.

The right answer was the one she did not expect. Drop none of them yet, but stop pretending they do the same job. A working AI delegation stack has three named workers, each with a defined remit. Pick which subscription plays which role, give it a one-paragraph job description, and live with that for a fortnight before changing anything. The relief comes from stopping the daily decision about which subscription to open, not from the stack you eventually settle on.

This post is part of the AI for Your Own Work cluster. The framework piece on how AI changes the delegation maths sits upstream of it. The three sibling pieces, AI vs person delegation, when AI replaces a VA, and briefing AI like a contractor, each go deep on one of the harder calls.

What are the three workers, and why three?

Three workers cover the actual shape of founder work without spilling into tool sprawl. The generalist drafter handles writing and admin, the researcher handles information that the model does not already know, and the thinking partner handles judgement calls where you want a second view before committing. Three is enough specialisation that each role is genuinely better than a generalist, and few enough that the switching cost does not eat the gain.

Tobias Lütke at Shopify built the same shape at company scale. His public description, written up by First Round Review, is an internal LLM proxy that lets engineers choose the right model per task, plus a library of specialised agents built once and reused. One interface, multiple models, role per job. The named-CEO disclosures from Wade Foster at Zapier, Aaron Levie at Box, and Pedro Franceschi at Brex describe smaller versions of the same pattern. Foster’s open stack is closer to seven. Levie’s “every agent needs a box” principle is the architectural argument behind it: bounded autonomy beats unbounded chat.

The choice-overload research from Sheena Iyengar and Mark Lepper, the canonical jam-study paper, is the empirical foil. Six options outperform twenty-four on completion rate, satisfaction, and perceived quality. A stack of three workers sits comfortably below the threshold where decision fatigue sets in. A stack of eight is past it before Tuesday lunchtime.

What is each worker actually for?

Each worker has a single defined remit. The drafter handles writing and admin, the researcher handles current information, and the thinking partner handles judgement calls. None of them does the others’ job. The remit is the thing that lets you tell which worker slipped when output goes wrong, because you can point at the brief it was given and the output it produced.

The generalist drafter is the highest-volume worker. It handles email triage, first-pass document drafts, status updates, meeting notes, and the standing recipes you call from a Monday morning. As of May 2026 the strongest model for this role on public arenas is Claude Opus, which holds the top position across writing, code, and search on the LMSys leaderboards. Vendor leapfrog is real and the gap is months not years. Pick the current leader and stay with it for a fortnight before re-shopping.

The researcher handles anything that needs information from after the model’s training cutoff. Competitive intelligence, regulatory updates, current pricing, what a named contact has said publicly in the last quarter. Perplexity Pro with Deep Research is the cleanest tool for this in May 2026, scoring 93.9 percent accuracy on SimpleQA, the factual-recall benchmark, against 72 percent for the strongest non-search model. Crossing the streams into drafting is where founders end up with confidently fabricated quotes.

The thinking partner is the lowest-volume and highest-stakes worker. You call it for partnership decisions, pricing changes, hiring calls, any moment where a second view earns its keep before you commit. The remit is structured input, structured output, judgement work only, never a replacement for the conversation with the actual humans involved.

How does the check-back protocol work in practice?

The check-back protocol is the rule for when you compare workers and when you do not. For routine drafting you do not. The drafter handles the inbox triage, you review in a Monday batch, you move on. For a high-stakes call you run the same brief through two workers and compare. The cost is twenty minutes. The value is catching one bad call a quarter.

The discipline that matters is structured input. State the decision clearly, name the constraints, ask for both the recommendation and the failure case, request comparison to a historical analogue. Dan Shipper at Every wrote the cleanest public version of this protocol on the Beyond the Prompt podcast. The same input shape feeds both workers. You read both outputs side by side. Where they agree, you move. Where they disagree, the disagreement itself is the information.

The temptation is to run every decision through two workers. Resist it. The bulk of decisions a founder makes in a week are routine and want one drafter, fast. Reserving the comparison protocol for genuinely high-stakes calls keeps the cost of using it low and the signal high. A protocol you use for everything is a protocol you stop using by Friday.

What is the kill-switch and how do you actually run it?

The kill-switch is your way of noticing when a worker’s quality has dropped, because models do change without notice. Anthropic’s April 2026 postmortem walked through three separate Claude regressions in the previous six weeks: a shift in reasoning effort, a bug that cleared session thinking, and a verbosity change. All three were reverted within a fortnight. The discipline matters because even a vendor with strong remediation ships quality changes that take days to surface.

The mechanic is a monthly five-minute benchmark. Pick one task per worker, save the output, and re-run it next month with the same prompt. Compare the two. Ask whether the answer would still be useful. If quality has clearly moved, you switch. The Information Commissioner’s Office UK guidance on AI and data protection makes the same point on lawfulness, fairness, and transparency: the operator has to know whether the system is still doing what it claimed.

Without a kill-switch, you find out about a quality drop from a client. With one, you find out from a five-minute test on the second Monday of the month. The cost is trivial. The cost of the alternative, hearing it from a customer, is not.

When should you actually consolidate, and when should you wait?

Resist the urge to consolidate immediately. Run the named stack for a fortnight first. The founder I started with needed permission to keep her four subscriptions and put names on three of them: drafter, researcher, thinking partner, with the fourth as a deliberate alternate for when the kill-switch fires. After two weeks the consolidation question answers itself, because the subscription doing no defined work becomes obvious.

The National Bureau of Economic Research surveyed 6,000 CEOs and CFOs in early 2026 and found nearly 90 percent reported no productivity impact from AI despite 67 percent using it. The gap is structural rather than technical, sitting in the absence of remits, check-backs, and kill-switches around the tools. A founder running three workers with remits, check-backs, and a kill-switch is operating in the small minority who actually capture the gain. A founder running four interchangeable subscriptions with no roles is statistically indistinguishable from a founder using none of them.

If you want to think this through with someone who has run the experiment on his own desk, book a conversation.

Sources

- Foster, Wade at Zapier (2025). "How I AI" episode on Lenny's Newsletter, full personal stack disclosure including meeting transcripts, Zapier agents, and Grok for culture analysis. Cited as the named-CEO peer reference for a small disciplined rotation. https://www.lennysnewsletter.com/p/zapiers-ceo-shares-his-personal-ai-stack - Levie, Aaron at Box (2025). Latent Space interview on his daily AI rotation and the "every agent needs a box" principle of bounded autonomy. Cited as the architectural-specialisation evidence behind the three-worker model. https://www.latent.space/p/box - Lütke, Tobias at Shopify (2025). First Round Review long-form on Shopify's internal LLM proxy, model-routing approach, and the impact correlation between AI tool use and performance. Cited as the model-routing-not-monolith argument. https://www.firstround.com/ai/shopify - Anthropic (2026). April 23 postmortem detailing three separate Claude regressions in March and April 2026 and their reversion. Cited as the kill-switch evidence that frontier models do change. https://www.anthropic.com/engineering/april-23-postmortem - Perplexity (2025). Introducing Perplexity Deep Research, including 21.1 percent score on Humanity's Last Exam and 93.9 percent accuracy on SimpleQA. Cited as the research-worker accuracy benchmark. https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research - Iyengar, Sheena and Lepper, Mark (2000). "When choice is demotivating", the canonical jam-study paper on choice overload showing six options outperforming twenty-four. Cited as the rationale for keeping the stack small. https://faculty.washington.edu/jdb/345/345%20Articles/Iyengar%20&%20Lepper%20(2000).pdf - Karpathy, Andrej (2025). Year-in-review essay on LLM training shifts including reinforcement learning from verifiable rewards. Cited as the reason different models genuinely excel at different tasks rather than being interchangeable. https://karpathy.bearblog.dev/year-in-review-2025/ - Information Commissioner's Office (2025). UK guidance on AI and data protection covering lawfulness, fairness, and transparency for personal data processed by AI tools. Cited as the UK compliance baseline for stack design. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/guidance-on-ai-and-data-protection/ - National Bureau of Economic Research, reported in Fortune (2026). Survey of 6,000 CEOs and CFOs finding nearly 90 percent reported no productivity impact from AI despite 67 percent usage. Cited as the productivity-paradox anchor for why stack discipline beats stack breadth. https://fortune.com/article/why-do-thousands-of-ceos-believe-ai-not-having-impact-productivity-employment-study/ - Anthropic (2024). Claude 3.5 Sonnet release notes detailing measured improvements in instruction-following and writing quality. Cited as the writing-worker benchmark anchor. https://www.anthropic.com/news/claude-3-5-sonnet

Frequently asked questions

Why three workers and not five?

Three covers the actual shape of founder work, communication, information, and judgement, with one tool optimised for each. Adding a fourth or fifth introduces switching cost and decision fatigue without obvious gain. Sheena Iyengar's choice-overload research is the cleanest evidence: six options outperform twenty-four on every metric that matters. Foster, Levie, and Lütke each describe small disciplined rotations, which is the same shape three workers gives a solo founder.

Does it matter which model I use for which role?

Less than the discipline of having a role. Claude Opus is the strongest current writing-and-reasoning model on the public arenas as of May 2026, and Perplexity's Deep Research is the strongest research-grounded tool, but the gap is months not years. Pick a worker per role and stay with it for a fortnight before switching. Vendor leapfrog is real and the worst stack is the one re-shopped every week.

What does the kill-switch actually look like?

A monthly five-minute check. Re-run a benchmark task you ran a month ago, compare the output, and ask whether the quality moved. Anthropic published a detailed postmortem in April 2026 explaining three separate Claude regressions and the fixes. If a vendor of that maturity can ship three quality changes in six weeks, an unobservant founder is the one who hears about it from a client.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation