give the agent the harder job

essay

The cost of asking an agent to attempt something has collapsed. Most people still scope their asks to what they already believe is possible — that is the mistake.

2026-06-13 · 7 min read · chris olson

The fastest way to find the edge of an agentic tool is to ask it for something you are fairly sure it cannot do. Not as a stunt — as a habit. The cost of the experiment used to be an afternoon. Now it is a sentence. When the downside of a failed attempt is thirty seconds and a re-prompt, the rational move is to attempt far more than feels reasonable. Most people do the opposite. They scope every request down to what they already believe the tool can handle, and then they conclude the tool can only do what they believed.

This essay is about the gap between the two. It is the most expensive habit I have had to unlearn while building with these tools, and unlearning it is where the actual leverage turned out to be.

01 the messy job is the job

The well-specified task is the one you didn't need an agent for. "Rename this symbol across the repo," "write a test for this function" — these are fine, and the tool does them, and you save a few minutes. The work that compounds is the work you can barely specify yourself.

I redesigned this site recently and asked the agent to "QA the redesign" before I shipped it. I did not tell it how. I expected a list of things to click through manually. Instead it stood up a headless browser, opened a debugging port, navigated to the interactive work index, and started dispatching synthetic clicks at the category filter — driving the real compiled page the way a user would. Within a few iterations it reported that the filter pills lit up correctly but the rows never actually filtered. The count label was frozen. It was a real bug, and it had already shipped.

I never asked for a browser. I asked for QA. The browser was its idea about what QA means when the thing under test is an interactive page. That distance — between the instruction and the interpretation — is where the value lives. A precise instruction gets you a precise result. An ambiguous one, given to a capable agent, gets you the result plus the approach you didn't know to ask for.

02 tools it has that you didn't ask for

The filter bug is a good story because of what catching it required, which is more than "open a browser." The agent had to drive the page, observe that the DOM didn't change, and then cross a boundary most humans avoid: down into the WebAssembly island that owned the filtering, and the JavaScript bridge that fed it props.

The root cause was buried there. The page handed its island props as a hand-rolled JSON string; the framework expected a binary codec, base64-encoded, that the bridge decodes with atob before the WASM ever sees it. atob on a raw { throws, the error gets swallowed, the island hydrates with zero rows, and the filter has nothing to filter. The pills toggled because their state was static and didn't depend on the props at all.

// the bug: props as raw JSON — atob() rejects it, island hydrates empty
const props = try buildJsonString(projects);
// the fix: the framework's binary props codec → base64
const props = try verve.encodeProps(ctx, .{ .slugs = slugs, .cats = cats });

That is four distinct capabilities composed in service of one vague request: run a browser, simulate input, inspect a reactive DOM, and read across the WASM/JS boundary to a serialization mismatch. I would not have thought to ask for any single one of them. I would have written "check that the filter works" and accepted a screenshot. The creativity was not in my prompt. It was in the framing — verify this actually behaves — and the agent supplied the method.

This is the part that does not fit the autocomplete mental model people still carry. The tool is not finishing your sentence. Given enough room, it is choosing a strategy.

03 the work you didn't have time for

There is a second payoff, quieter than the first. Every codebase has a backlog of work that is obviously worth doing and never quite worth a human afternoon. The dead-code audit. The exhaustive edge-case sweep. The asset that should be generated instead of hand-drawn. These items live forever in a file called TODO because the value is real but sub-threshold for a person.

That threshold just moved. While cleaning up the same redesign I asked for an audit of unused CSS. The agent found ten candidate classes — and then, before deleting any of them, checked whether the static-site generator emitted them from markdown, caught that six of the ten were live, and deleted only the four that were actually dead. That is the correct, careful version of a chore I would have done sloppily at midnight, if at all.

The 3D artifact on the site is the same story from the other direction. I wanted a piece of generated geometry and an image-based lighting environment baked at build time. That is a real graphics task, the kind that historically eats a weekend you don't have. It became a build step. Not because the work got easier in some abstract sense — because the unit of effort it cost me dropped to "describe what you want and review what comes back."

The backlog is not a backlog anymore. It is a queue, and the queue moves.

04 be wrong about the ceiling

Here is the uncomfortable part. Your belief about what the tool cannot do has a shelf life measured in weeks, and you are almost certainly still operating on a stale one. The thing it couldn't do last quarter it can do now, and you will never find out if your prompts are calibrated to last quarter's ceiling.

The only reliable way I have found to stay current is to keep asking for the thing I assume will fail. Most of the time it does, and I have lost a sentence. Sometimes it doesn't, and I have learned that an entire category of work moved inside the boundary while I wasn't looking. The expected value of that bet is lopsided in a way that should change your behavior, and for most people it hasn't yet.

Fearlessness here is not recklessness. It is refusing to pre-decide the outcome. You do not get to know the ceiling by reasoning about it. You find it by pushing on it, and it has a habit of not being where you left it.

05 verify like you mean it

The counterweight, because exploration without it is malpractice: the filter bug shipped in the first place because someone — a previous version of this same loop — explored creatively and did not verify. The redesign rendered, the SSR looked right, the markup was correct, and the thing was broken in a way no static check would catch. Creativity got the feature built. Only adversarial verification caught that it didn't work.

So the habit is not "ask for big things and trust the output." It is "ask for big things and then make the agent prove them." Drive the real page. Run the real binary. Diff the actual behavior against the claim. The same tool that will invent a browser-automation strategy to please you will also, asked the wrong way, assure you something works when it doesn't. The exploration and the skepticism are two halves of one practice. Run them together or you are just generating plausible mistakes faster.

06 the case

The bottleneck has moved. For a long time the limiting factor on what got built was hours — yours, your team's — and so the discipline was about spending them well. That discipline is now aimed at the wrong constraint. The hours are cheap. What is scarce is the imagination to ask for the harder thing, and the rigor to confirm you actually got it.

Give the agent the job you think is too big. Give it the chore you were never going to do. Frame the ambiguous version, not the safe one, and then verify the result like you expect it to be wrong. The downside is a wasted sentence. The upside is finding out that the work you had quietly written off as impossible, or simply not worth your time, was neither.