Most founders still design AI products like static apps with a clever brain inside.
Last week, Anthropic basically told us that era is over. Their new “dreaming” system lets AI agents review past behavior between sessions, spot patterns, and improve before the next time a user shows up.
That means your product can now change itself while you sleep.
If your UX, onboarding, and metrics don’t assume that, the risk isn’t “AI going rogue”—it’s you not knowing what your own product is doing.
What actually shipped
Anthropic’s update introduced “dreaming,” a technique where agents periodically analyze their previous runs, evaluate what worked, and adjust how they act in future workflows.
They’re pairing this with tools to orchestrate sub‑agents and evaluate work against rubric-based outcomes, explicitly aiming at long-running, semi-autonomous agents for domains like coding, finance, and legal tasks.
Zoom out and it fits a bigger pattern: multi-agent systems, orchestration layers, and long-duration autonomy are becoming the default enterprise narrative for agentic AI.
At the same time, enterprise reports still say 70–80% of agentic initiatives fail to reach real scale, usually because governance, metrics, and process don’t catch up to the tech.
In other words: the intelligence is maturing faster than the operating model around it.
Why this should bother (and excite) founders
If you’re building on top of these systems, your product is no longer “a UI around a model.”
It’s a living system that can change its own behavior in ways your team didn’t explicitly ship in a release cycle.
That breaks several comfortable habits:
- You can’t treat UX defects as “we’ll fix it next version” if behavior is evolving weekly.
- You can’t rely on static onboarding when the thing users are onboarding into keeps shifting.
- You can’t measure success only at the feature level when the real story lives in emergent behavior across many runs.
The upside: if you embrace this, you can compress learning loops brutally.
An agent that self-evaluates and self-tunes between sessions is effectively a built-in growth engine—if you design the constraints, feedback, and metrics properly.
If you don’t, you’ve just outsourced product strategy to a black box.
Design like you’re managing a new team member, not a feature
The healthy mental model is not “we shipped an AI feature.”
It’s “we hired a weird, brilliant junior team member who works 24/7 and learns fast—but has no judgment unless we teach it.”
That means designing:
- Clear “job descriptions” for agents. Define explicit scopes, non-goals, and forbidden actions; this is UX, not just policy.
- Instrumentation as part of UX, not an afterthought. Log decisions, not just clicks—what the agent tried, why it chose a path, and what outcome it produced.
- User-facing transparency patterns. Explain what the agent is doing, when it’s experimenting, and how users can override or correct it; this builds trust and reduces support blowups.
- Review rituals. Regular “agent retros” where you and your team review real sessions, not just dashboards, and adjust prompts, tools, and guardrails accordingly.
This is where founder-first, metric-tied UX matters more than ever: every change in agent behavior should map back to an activation, conversion, retention, or support metric you actually care about.
One move to make this week
Here’s the concrete play:
Set up a weekly agent review loop for one critical workflow—say, onboarding, trial-to-paid, or a high-volume support task.
Minimum viable version:
- Pick one agent flow that matters to revenue. Not a toy; something connected to activation, conversion, or churn.
- Turn on rich session logging. For each run, store user goal, context, key agent decisions, and outcome (success/fail/hand-off).
- Score 20–30 sessions by hand every week. Use a simple rubric: usefulness, clarity, time-to-value, and whether the agent stayed within its job description.
- Ship one UX or constraint change per week based on that review. This could be copy, UI hints, stricter tool access, or a changed rubric—not just prompt tweaks.
If you don’t have the time or in-house design capacity to build those flows and guardrails, this is literally the type of work I do with AI founders through focused design audits and rapid, metric-driven sprints at Poplab.
But whether we work together or not, you cannot afford to treat self-improving agents as “just another feature” in your backlog.
The frontier models are learning between sessions.
The real question is whether your product, metrics, and team are doing the same.

Leave a Reply