What is a background agent?

A background agent is an AI coding agent that runs autonomously in a sandboxed environment without requiring constant human interaction. Unlike interactive copilots that suggest completions in real time, background agents take a task specification and execute it end to end, producing complete pull requests for human review.

Why would an agency build its own background agent instead of using existing tools?

Off-the-shelf tools operate at the interactive level, seconds to minutes of autonomy before needing human input. Agencies handling dozens of retainer clients need hours-level autonomy for small tickets, plus custom workflow integration with their project management and client-facing tools.

How does Satoru handle code quality and safety?

Satoru runs in a fully sandboxed container on Modal, isolated from everything else. It has its own dev server, test suite, and repo clone. Every change produces a pull request that goes through the same human review process as any other code change. The container boundary is the security model.

What companies are building their own background agents?

Stripe's Minions system merges over 1,000 PRs per week. Ramp's Inspect agent writes roughly 30% of their frontend and backend PRs. Cursor has shipped long-running agents capable of multi-day execution. Vercel provides infrastructure tooling including sandboxed execution environments for agent workloads.

Can background agents replace developers?

No. Background agents handle well-defined, lower-complexity tasks, the small amends and bug fixes that eat into engineering time. Engineers shift from writing every line to orchestrating agent fleets and focusing on architecture, system design, and the judgment calls that require human context.

What is Satoru and how does it work?

Satoru is Roboto Studio's background coding agent. It runs inside a sandboxed container on Modal, orchestrated by a control plane built on Cloudflare Durable Objects. When a ticket is assigned to Satoru, it clones the repo, spins up an isolated dev environment, makes the change, and opens a PR. The whole process streams in real time via WebSockets to a dashboard where engineers can monitor progress.

Why is it called Satoru?

The name felt right for an agent that sees everything in the codebase at once. It processes the full repo, every file, every dependency, every convention, and works through tasks with a kind of effortless clarity. We needed a name for something that operates in a domain most tools only partially perceive. Satoru handles what would overwhelm a single developer's attention, running parallel across dozens of codebases while the rest of us focus on what actually requires thought. We took all of that and put it in a box. Containerised it. It runs sealed away in its own sandboxed environment, doing its thing, while the world carries on outside.

Why we built Satoru, our background coding agent

We run a studio. We have retainer clients. Those retainer clients have backlogs. And if I'm being honest, a lot of what sits in those backlogs isn't hard. It's just time-consuming.

"Change this button color." "Update the hero copy." "Fix the padding on mobile." "Add a new team member to the about page." These are real tickets from real clients, and every single one requires a context switch. Pull up the repo. Read the ticket. Find the file. Make the change. Open the PR. Wait for CI. Merge. Deploy. For a change that took 10 seconds of actual thinking, the process took 20 minutes of faff.

That's the throughput tax of agency life. Not the complex architectural work. Not the system redesigns or the tricky state management bugs. Those are the tickets we're here for. The small stuff is what buries the team.

So we built Satoru as part of our agentic workflows practice. It's a background agent that picks up low-complexity tickets from client Linear boards and ships PRs without a human touching the keyboard. It runs in a box. Here's why we built it, what I learned from companies doing this at much larger scale, and why I think the agency might change rapidly.

Also clients, if you're reading this, R&D isn't free, hire us for consultancy. Nobody else is going to throw a background agent in as a sweetener...

The retainer problem nobody talks about

Every agency I know runs into the same pattern. You sign a retainer client. The first few months are great. Big builds, meaty features, architectural decisions that need your full attention. Then maintenance mode kicks in.

The backlog fills up with small stuff. Not unimportant stuff. The client needs these things done. But they're not the kind of work that justifies pulling a senior engineer off a complex build. So they sit. They accumulate. The client gets frustrated. You get frustrated. The backlog becomes a source of tension in every standup and every monthly review.

We tried the obvious solutions. Junior devs dedicated to small tickets. Batching amends into dedicated sprint days. Multi-agent orchestration. They all have the same problem: a human still has to context-switch into each project, understand the codebase, make a small change, and context-switch out. The overhead is constant regardless of the ticket size.

What we actually needed was something that could run in the background, pick up a well-specified ticket, make the change in a sandbox, and open a PR for review. No babysitting. No pairing. Just: here's the ticket, here's the codebase, go.

Why off-the-shelf wasn't enough

There's a framework from Chris Weichel at Ona that nails this. He calls it "time between disengagements", how long an AI coding tool can operate before it needs human input.

Think of it as autonomy levels:

Seconds: Tab completion, GitHub Copilot. You're steering constantly.
Minutes: Cursor, Claude Code. Multi-step execution, but you're in the loop giving feedback every few minutes.
Hours: Background agents. You hand off a task, go do something else, come back to a finished PR.
Days: Doesn't exist yet. "...but what about Ralph loops" shut up

We were stuck in the minutes tier. Cursor and Open Code are great tools, and we use them daily for the complex work where you want to be in the loop. But "being in the loop" means someone is sitting at a keyboard. That someone has a finite number of hours, and those hours should go toward the work where human judgment matters, things like building the architectural foundation that agents and AI tools need to work effectively.

For a ticket that says "change the accent color from blue-600 to indigo-500 across the marketing pages," you don't need a human in the loop. You need a human to review the result. Big difference.

The off-the-shelf tools don't solve this because they're designed for interactive use. They assume a developer is present. We needed the opposite: something that runs while the developer is doing something else entirely.

What the industry is building

I'll be honest, a bit more than a week before we started building Satoru I was publicly skeptical about all of this. The background agent spiel felt like marketing. Everyone announcing their agent, nobody showing real world examples of both the good and bad. Even less so, most folks left out the blueprint to build one.

Well guess what? I was partially wrong. I now believe the data from the companies below is within the realm of realism. But I still think the risk of turning these into slop cannons is underrated. Shipping volume means nothing if the code is garbage. More on that in the lessons section.

Stripe's Minions: 1,000+ PRs per week

Stripe's internal agent system, called Minions, now merges over 1,000 PRs per week. One-shot, end-to-end coding agents. The human writes the spec, the agent writes the code, the human reviews the PR. That's it.

Published by Alistair Gray on Stripe's internal tooling team. The numbers are wild. They're not using agents for side projects or internal tooling experiments. This is production code getting merged at scale, across Stripe's main codebase.

The model is simple: agents write, humans review. The review step isn't going anywhere. But the writing? That's increasingly automated for well-defined tasks.

Ramp's Inspect: 30% of all PRs

Ramp is actually the company that mostly inspired us to build Satoru. We love seeing some of the crazy AI stuff they've built, and frankly, anybody that makes taxes and expenses simpler is a friend of ours.

Ramp built an agent called Inspect that, within months of deployment, was producing roughly 30% of all pull requests to their frontend and backend repositories. Thirty percent.

Their architecture is worth studying. They run sandboxed VMs on Modal, each with a full dev environment: Vite, Postgres, Temporal, the works. Pre-built images get refreshed every 30 minutes so agents start with a warm cache. They use OpenCode as the underlying coding agent.

Developers can kick off agent tasks from Slack, a web interface, or a Chrome extension. It's multi-surface by design, because if the agent only lives in your IDE, you're back to the "someone has to be sitting there" problem.

The line that stuck with me from their writeup: "Owning the tooling lets you build something significantly more powerful than off-the-shelf." That's been our experience exactly.

Cursor's long-running and self-driving agents

Cursor has published two pieces of research that, taken together, paint a picture of where this is headed. Their long-running agents can execute for 36+ hours, producing much larger PRs with merge rates comparable to shorter runs. Theo Browne shipped two architecture overhauls running five agents in parallel. Zack Jackson compressed a quarter's worth of planned work into days: a 52-hour run, 151,000 lines touched.

Their self-driving codebase experiment pushed further with a multi-agent hierarchy. A root planner breaks down the task, recursive subplanners decompose further, and workers execute in isolated repo copies. At peak throughput: roughly 1,000 commits per hour, 10 million tool calls over one week. The system accepts some error rate in exchange for throughput, with a final "green branch" fixup pass that resolves conflicts and broken tests.

The problem is, I'm skeptical as to how well this could be implemented in a real world project, but I'd like to believe this isn't just smoke and mirrors.

Our architecture: Linear-native, multi-tenanted, agency-first

Here's where our story diverges from Stripe and Ramp. They're product companies building agents for their own codebases. We're an agency building agents that work across dozens of client codebases simultaneously.

That changes everything about the architecture.

The multi-tenant Linear application

So we kind of cheated by saying we built the vast majority of this in a week. The truth is we built a multi-tenanted Linear application that gives each retainer client their own board a while back. That is pretty important for our model. Actually, it's basically the coal for the steam engine.

Clients add tickets directly. Then our team assesses complexity. If a ticket is straightforward (well-specified, low risk, doesn't touch shared infrastructure), Satoru can pick it up.

No time billed. No human context-switch. The client gets faster turnaround on the small stuff. We get to spend our hours on the work where agency expertise actually matters.

The flow looks like this:

Client adds ticket to their Linear board
We triage and tag complexity
Low-complexity tickets get assigned to Satoru
Satoru clones the repo, spins up a sandbox, makes the change
Satoru opens a PR with the ticket linked
Engineer reviews, approves, merges
Deployment happens through the existing CI/CD pipeline (Just use Vercel)

Here's what that looks like in practice. A ticket gets assigned to Satoru in Linear, it picks it up, and about 20 minutes later there's a PR ready for review.

Satoru's first ticket

The engineer reviews every PR. That's non-negotiable. But reviewing a clean PR takes five minutes. Writing it from scratch, with all the ceremony, takes thirty.

Why "personalised software" matters here

If you haven't heard this term floating around, you're obviously not soullessly scrolling X all day, good job. "Personalised software" is not a generic project management tool with AI bolted on. It's a purpose-built system designed around exactly how our studio operates.

We control the full loop: ticket intake, triage, sandbox execution, PR, review, deploy. Every step is built for our workflow because we built it. The agent slots into the pipeline naturally because the pipeline was designed with the agent in mind from the start.

Every general-purpose tool we evaluated assumed a single team working on a single codebase. Agency life doesn't work that way. We'd rather build the tool that fits our workflow than cram our workflow into someone else's assumptions about how software gets made.

Context is the hard part

Getting an agent to write code is, honestly, the easy part. The hard part is giving it enough context to write the right code.

A ticket that says "change the button color to indigo-500" is useless without knowing: Which button? Which component file? Is the color defined as a CSS variable, a Tailwind class, or hardcoded? Does the project use a design token system? Are there dark mode variants?

The Linear integration solves part of this. Tickets carry project context: which repo, which framework, which conventions. Satoru gets a brief that includes the project's architecture notes, component structure, and any relevant style guidelines. This is the same principle behind our Claude skills: encode the conventions once, and every AI interaction benefits from that context.

But the thing that actually makes it work is the sandbox. Satoru doesn't guess about the codebase. It clones it, runs the dev server, and can actually verify that its changes work before opening the PR. A sandboxed environment with a running dev server catches the obvious mistakes that would otherwise waste a reviewer's time.

Linear is clearly building with AI context at the forefront.

Linear → Your AI coding tool. Open any issue directly in Claude Code, Codex, Conductor, Cursor, GitHub Copilot, OpenCode, Replit, v0, or Zed – preloaded with full context and a custom prompt.

— Linear, Feb 2026

Under the hood: Open-Inspect

That's the workflow layer (how tickets flow from client boards to merged PRs). Here's what powers it. We call the platform Open-Inspect. Satoru runs inside it.

Architecture overview

User --> Web Client --> Control Plane (CF DO) --> Modal Sandbox
            ^               | WebSocket             | WebSocket
        Slack/GitHub/     D1 Database           Satoru
        Linear Bots                            (sandboxed env)

Three tiers connected by WebSockets: a web client, a control plane, and sandboxed compute. Bot integrations let you kick off sessions from Slack, GitHub, or Linear.

Web client

Stack: Next.js 16, React 19, deployed to Vercel or Cloudflare Workers (via OpenNext)

GitHub OAuth handles authentication. The dashboard lets you list, create, and monitor coding sessions. A persistent WebSocket connection to the control plane streams Satoru's output in real time as it works through a task. No polling, no refreshing.

Control plane

Stack: Cloudflare Workers, Durable Objects, D1 (SQLite)

This is the brain. Every coding session is its own Durable Object with dedicated SQLite storage.

Session lifecycle: create, run, pause, resume, stop
WebSocket hub: bridges the web client and the sandbox bidirectionally
D1 database: session index, repo metadata, encrypted repo secrets
GitHub App integration: repo access tokens, cloning, PR creation
Auth and permissions: controls who can do what

Durable Objects are the right primitive here. Each session needs isolated state, its own WebSocket connections, and the ability to hibernate when idle. D1 gives you a lightweight relational store without managing infrastructure.

Data plane

Stack: Modal (Python), FastAPI

This is where Satoru actually runs. Modal spins up isolated sandbox environments on demand.

Sandbox creation: each session gets its own containerized dev environment with the repo cloned, dependencies installed, and a dev server running
Warm pool: pre-warmed sandboxes so there's no cold boot penalty
Snapshots: save and restore session state for persistence across runs
WebSocket bridge: streams agent events back to the control plane in real time

Satoru has access to the full repo inside the sandbox. It can run tests, start dev servers, and verify its own changes before opening a PR. The container boundary is the security model. Nothing inside the sandbox can reach anything outside it.

Bot integrations

All built as Cloudflare Workers with Hono.

Slack Bot: listens for messages, classifies intent, and routes to the right repo to kick off a session. Engineers can trigger Satoru from a Slack channel without opening the dashboard.

GitHub Bot: handles PR review assignments and responds to @mention commands. You can ask Satoru to pick up work directly from a PR comment.

This isn't just rubber-stamping either. During a spam protection implementation, Satoru flagged a honeypot security regression that would have gone to production: the refactored code returned a 422 error on honeypot fields instead of a silent 200, which tells bots they've been caught and defeats the entire purpose. It also caught that BotID client initialization only loaded in production while the server-side check ran unconditionally, meaning local dev would hang or fail. Both were the kind of subtle integration bugs that slip past a quick human review.

Satoru reviewing a spam protection PR on GitHub, catching a honeypot security regression and a BotID initialization issue

The agent catches what tired eyes miss. That's the point.

Linear Bot: receives Linear agent webhooks and creates coding sessions from issues. This is how the triage flow connects to the execution layer. Tag a ticket for the agent in Linear and it flows through to a running sandbox automatically.

Shared foundation

@open-inspect/shared: a TypeScript package that every other component imports. Shared types, auth utilities, and model definitions. It builds first. Everything else depends on it.

Infrastructure

Terraform provisions everything: Workers, Durable Objects, D1 databases, secrets, Modal deployments. One config to rule them all. No clicking around in dashboards.

D1 migrations handle schema versioning. Applied automatically in integration tests and on deploy.

CI/CD runs on GitHub Actions. Push to main auto-deploys changed services. Terraform handles the control plane and D1 migrations. Modal deploys the data plane. Every PR gets lint, typecheck, and tests.

Component	Tech	Role
Web Client	Next.js 16 + React 19	Dashboard, OAuth, real-time UI
Control Plane	CF Workers + Durable Objects + D1	Session management, WebSocket hub
Data Plane	Modal + Python + FastAPI	Sandboxed coding environments
Slack Bot	CF Workers + Hono	Slack-triggered sessions
GitHub Bot	CF Workers + Hono	PR review and @mention commands
Linear Bot	CF Workers + Hono	Linear webhook-triggered sessions
Shared	TypeScript	Types, auth, models
Infra	Terraform + GitHub Actions	Provisioning and CI/CD

Early lessons

We shipped Satoru's first PR about a week ago. It's early. We're testing it against our own internal codebases before pointing it at client work, and every PR still gets a senior engineer's eyes on it before merge.

But even at this stage, a few things have already become obvious.

Don't ship slop

Cursor's self-driving codebase experiment peaked at 1,000 commits per hour. They got there by accepting breakage and running a fixup pass at the end. That works when you own the whole codebase and can afford to clean up after yourself and have passionate developers submitting issues like they're going out of fashion.

We can't afford to do that. These are client codebases. A sloppy PR that gets merged because a reviewer was in a hurry is a bug report next week and a trust problem next month. So Satoru doesn't get to be "close enough." Every PR goes through the same review bar as human-written code. If the agent misses an import or gets a style convention wrong, it doesn't ship. It gets sent back.

The review step is the bottleneck, not the writing

Ona's essay on industrialising software development makes a point about factory floors: optimising one station doesn't help if the bottleneck is somewhere else.

For background agents, the pipeline is: ticket, code, test, review, integrate, deploy. The agent speeds up the code-writing station. But if reviews take two days or the deployment pipeline takes 45 minutes, faster writing barely matters.

We've spent as much time on making PRs easy to review (clear descriptions, linked tickets, before/after context) as we have on making the agent write better code. That's where the actual time savings show up.

Build your own, but only if you have to

If off-the-shelf tools cover your workflow, use them. Building agent infrastructure is real work and ongoing maintenance.

Our workflow is specific enough that we had no choice. We need:

Multi-tenanted client access across dozens of repos
Per-project conventions and architecture notes fed to the agent
Triage-based routing between humans and agents
Client-facing visibility into agent-generated work
Billing integration (agent tickets aren't billed as engineer hours)

None of that exists in Cursor or Claude Code. We built it because our business model requires it, not because building it was fun. (It was fun. But that's beside the point.)

Where this goes for agencies

Satoru is on training wheels. We're running it against internal projects, building confidence in the output, and learning where it falls over. The plan is to roll it out to client codebases once we trust the review loop enough. That trust gets built one PR at a time.

Ramp hit 30% of all PRs within months. I don't know if we'll hit the same number, but we don't need to. Even 10% of ticket volume handled by the agent frees up meaningful engineering time for the work clients actually hired us to do: system architecture, performance audits, migration planning. Not "change this button color."

The agency model has always had a tension between "we're experts you hire for hard problems" and "we also need to update your blog page title." Background agents dissolve that tension. Hard problems get the humans. Blog page titles get the agent.

I'm watching what this means for retainer pricing too. If a chunk of ticket volume isn't billed as engineer time, clients get more value per dollar spent. That shifts the conversation from "how many hours do we get" to "how many outcomes do we get." I'd rather sell outcomes anyway, leave timekeeping for Casio.

We build agentic workflows for teams

Background agents, automated pipelines, and custom tooling built around how your team actually works.

See our agentic workflows service

If you're an agency or engineering team thinking about building your own background agent setup, we're happy to talk shop. We've made plenty of mistakes already. Might as well save you a few.