Zoo 2: getting the most out of Codex

May 5, 2026

GPT models have been better than Opus since late 2025, but Codex sucked until March ‘26. Now, finally, it is capable of running a Zoo workflow, and I present my best setup so far, Zoo 2.2, available for your stealing pleasure.

The goal

There are two ways to use coding agents: near-interactive collaboration with small steps and frequent instructions, and long (often multi-hour) autonomous sessions intended to produce nearly production-ready results.

Interactive collaboration is an easier starting point, and is the right fit for some tasks. Zoo 2 introduces a mode to support interactive use.

I’m most interested in autonomous operation of agents, though. Both because micromanagement is repetitive, tiresome and boring, and because autonomous coding is where we’re all headed, whether we like it or not, and I love being at the forefront.

And, if done right, well, it gives me the most time and attention back, allowing me to focus on my most demanding tasks.

But: The quality is non-negotiable, and we have a complex codebase in an unforgiving, e-commerce-adjacent domain.

The zoo

Over 2025, I came up with a particularly successful Claude Code setup that I called All Star Zoo (a whole post about the journey). In a nutshell:

Short-lived single-step subagents for all work steps including coding (to conserve context, avoid compactions, and steer frequently);
Per-step persistent report files (to maintain path between steps);
Personality infusions for subagents (to instill particular set of values and “import” a person’s well-known beliefs);
Autonomous planning with a plan review loop, and a code review loop;
Separation of test writing and implementation steps in TDD fashion.

My All Star Zoo running on Claude Code has been successful enough that, from Jan 1, 2026 onwards, I’ve stopped writing any code manually; a story for another day.

And yet, every time I had a chance to try plain Codex, it would beat plain Claude. Yes, more literal and harder to steer, and less of a conversation partner, but a smarter tool.

Only with the release of GPT-5.4, Codex has worked out the quirks with subagents and skills and can consistently run a Zoo-style workflow.

What’s new

Zoo 2 has been reimagined for a much smarter model, and incorporates an extra six months of my experience.

(Hell, this vibe coding thing is moving fast… the last time I measured change in months was around the birth of CoffeeScript.)

Here’s what’s new.

Three modes

Zoo Heavy for the full multi-hour experience;
Zoo Lite performs planning and implementation in the top-level agent for faster results;
Zoo Zero just executes your instructions and runs an autonomous review loop afterwards.

Spec

Zoo Heavy and Zoo Lite write and maintain a spec file in addition to per-step reports. The spec has the following sections:

“User Input”, classified into overview, “Hard constraints”, “Soft preferences”, “Unknowns”, and “Draft ideas to adapt with common sense”.

Codex follows instructions literally. Separation of hard constraints from preferences, and putting implementation ideas into a weirdly-named “Draft ideas to adapt with common sense” subsection gives it more leeway to adapt your ideas to the realities of the code.

(This stuff is written in literal blood of heads banged against the walls.)
“Spec”, a rough equivalent of the plan built by /plan and the like.
“Decision log” records every choice made by the agent for you to review later. E.g.:

Use a date column, not a timestamp column. The ticket says Tier Start Date, existing import date formats already return days.Day, and start-of-day in the shop timezone is the least surprising behavior for a date-only field.
“Open product and strategic questions” is for the user interviews (below).
“Execution memory” stores important takeaways from the prior step reports. This helps future steps remember the prior changes and their rationale, e.g.:

Plan uberreview found the side-effect-free stats read must still see existing processor cache mutations within the same import batch. The implementation needs a non-allocating stats cache peek before falling back to LoadOverall.
“Subtasks” tracks progress over multiple independent steps with clear acceptance criteria:

- (done) Support tier start dates in customer imports
  - Acceptance: Customer import accepts Tier Start Date, uses it for accepted tier entries, recomputes same/current tier period data without tier history, preserves `ForceTier` lowering behavior, and leaves manual override renewal behavior verified.
  - Browser impact: direct, backoffice import column configuration dropdown and customer import apply flow
  - Plan: 005-plan.md
  - Evidence: `evidence/tier-start-date-selected.png`, `evidence/tier-start-date-dropdown-options-expanded.png`, and `evidence/tier-start-date-dom-proof.json` prove the backoffice import column UI offers `Loyalty — Tier Start Date` and date-format options.

Following a spec helps agents execute consistently over dozens or hundreds of steps.

A spec is also a very convenient human review tool. I find that I rarely look at step report files, but the spec file was designed to support reviewing complex execution results.

Following a common spec improves long-horizon performance of the agents, and also helps with reviewing their work.

Subtasks

Zoo Heavy and Zoo Lite are instructed to split the task into meaningful, self-contained subtasks.

The subtasks are planned, implemented, reviewed and committed one by one. The agent is told to focus on just the current subtask. This improves quality by limiting the scope of changes in flight.

In the end, we get one commit per subtask, which is also much easier for the human to review later. (Of course, I assume you’d squash those commits after reviewing the output, but it’s up to you.)

Uber Review

Runs multiple specialized review subagents in parallel after the regular review exhausts its abilities to find issues.

This might seem – this is – wasteful, but it also routinely finds new valuable issues, so paying the extra tokens and time is well worth it.

I can’t guarantee that it’s the individual lenses (specializations) that help; potentially, combining all instructions and repeating the review N times would find the same issues. Still, the lenses don’t hurt.

Example lens, .zoo/review/duplication.md:

Focus on duplicated code and effort, both inside the new code, and where new code duplicates existing code.

Check for:
- Duplicated helpers across packages. (These go into a general helpers package.)
- Existing helpers that should have been used.
- Duplicated business logic, decision making or magic constants -- should be exactly ONE place where each business rule is implemented.

A reviewer is launched for every file under .zoo/review/, so you have full control over the set of review lenses.

User interviews

Zoo Heavy and Zoo Lite will ask clarifying questions during the initial planning session, if necessary. This is not new ground by any means, but it is new to Zoo.

The agent is instructed to limit these to high-importance product and strategy decisions only, and make the rest of the decisions autonomously. If you want more control, use your agent’s native planning mode before invoking zoo.

Browser use

One of the biggest advancements of 2026 is browser use, first in Claude, and then, finally, in Codex. This provides the missing feedback loop for the frontend changes, admin UIs and the like.

Zoo now runs browser testing (via Codex’s Browser Use and Computer Use skills) of any applicable flows, with screenshots produced as evidence.

The screenshots allow you to review frontend changes without actually running the app and reproducing all of those states manually.

Skills

Each step is now defined as a skill, and subagents mostly just run these skills.

This means Zoo can offer a choice of executing certain steps within subagents or in the top-level agent.

Also means you can invoke just a single step, say just the docs update or just the uber-review, as a skill, manually.

Reuse

The skills and subagents are now designed for reuse verbatim, provided that you subscribe to my general philosophy. The reusable instructions are generic, and each step can be customized via separate per-project files under .zoo/.

For example, my .zoo/docs.md starts like this:

- Internal technical notes and future-agent learnings go in `_ai/`.
- Durable developer/onboarding docs go in `_readme/`.
- User manual content for the configuration team and partners goes in `_readme/manual/`; omit internal implementation details and code references.
- Public/client RCAs go in `_readme/rca/`; read that folder's `AGENTS.md` and recent RCAs before writing.
- Client-specific API guides go in `_clientguides/`; never mention source code or internal configuration names.
- API docs content lives in `apidocs/`; viewer code lives in `apidocviewer/`.

So you can add your own per-step instructions while safely overwriting the base skills when new versions of Zoo get released.

Other changes

Knowledge and documentation writing steps have been merged to update all sorts of documentation in one go. This has been a mess with Zoo 1, and works much better in Zoo 2.
Personality infusion is subtle and minimal so far; heavy presence of Linus Torvalds and Don Melton was required to get Opus to behave, but GPT is almost immune to such shenanigans.

What’s old

Task dirs and report files via Bureau

Zoo uses Bureau MCP to read and write step report files, making sure the history of past work is preserved. I almost never read these reports, but the LLM makes great use of them.

Bureau maintains a tasks directory, .tasks or _tasks, which I typically put in .gitignore. Within, there’s a per-task dated subdirectory, e.g. .tasks/20260501-tier-import/.

The spec file (spec.md) and per-step reports (002-plan.md) live inside those task directories.

TDD

Zoo writes tests first, then implements code changes, as two separate steps.

In the earlier Claude days, that helped avoid the “LLM fixes test to pass with a broken impl, instead of fixing impl to pass the tests” problem.

In the current day and age, it’s not necessarily a hard requirement, but still works well in practice, and helps Codex think through the usage and testing aspects before rushing into the implementation.

Linus

The ghost of Linus Torvalds is still the best reviewer, and his values are instilled into the review steps.

There’s a lot less swearing and role-playing in the second version, though. Less therapeutic and less fun, perhaps, but Codex was deeply uncomfortable with the communication style of original Linus, so maintaining it would probably hurt the results.

Single-step subagents

Zoo Heavy is still based on doing everything in short-lived single-step subagents.

Zoo Lite and Zoo Zero use subagents for reviews, research and browser use steps.

Using Zoo 2

Installation

Installation is covered in the Zoo repository README, but basically copy the files and run the Zoo Init skill to create those local customizations specific to your project. Look over your .zoo folder, and mine, and steal what you like; in particular, adjust the specialized reviewers under .zoo/review/ to match the areas you don’t want missed.

Zoo Zero

Not every task requires a heavy workflow, but almost every task can benefit from the machine reviewing and fixing its own shit before demanding attention from a human. For that, just prefix your instructions with the Zoo Zero skill.

Zoo Heavy

When starting on a big complex task you care much about, run Codex’s plan mode asking for a high-level plan, and iterate until you’re happy about it, then run the Zoo Heavy skill asking to implement the plan.

If you’re feeling lucky or lack human time, simply running Zoo Heavy on a complex task will also produce surprisingly reasonable results. Give it a try, especially overnight.

Often it’s easier to run revision iterations later than to invest in making a plan beforehand. “I’ll know it when I see it” approach, or more like “I’ll know it when I see something that’s definitely not it”.

Zoo Lite

For a task that’s less complex, that Codex would handle okay but not quite right, use Zoo Lite. It will do all planning and implementation work at the top level, just like Codex normally does (and thus at the same speed), but will still benefit from a spec, will split into subtasks, and will run review loops, which greatly improves the result.

Review the spec file

When using Zoo Heavy and Zoo Lite, you’ll find the spec file to be a useful input when reviewing the results or ongoing work.

In particular, it has a decision log for all choices made by the LLM, and detailed acceptance criteria for the substeps that you should see materialized as commits in your repository.

The file should be under .tasks/YYYYMMDD-whatever/spec.md, see Bureau for the details on the task folders. You can provide your own location for the spec file, and it should be maintained there instead.

Revisions

To request a revision, run a Zoo skill again. It should recognize that it’s a revision and continue working with the same spec and same task directory.

Standalone skills

You can invoke Zoo Docs separately to beat some new knowledge into its stupid brain. This replaces the /hr command from Zoo 1.

Making your project AI-friendly

Important caveat: Agentic workflows are only as successful as the feedback loop you give them. You need to arrange for all parts of your app to be easily accessible for testing by AI agents.

Zoo workflows are instructed to use an AI harness mode if your app has one. If you’re running multiple agents in parallel, you need to ensure that multiple copies of your app can be run in parallel without conflicting.

Additionally, I recommend adding scenarios/fixtures available via command line that launch the app prepopulated with data, so that agents can easily land on the screens and configurations they need to test.

So how well does this work?

The resulting quality level is on par with what Claude+Zoo produced, i.e. good, but not perfect, and almost always requires several revision iterations to get to a production-ready state.

The key improvement, however, is the complexity of the tasks handled at this quality level. Zoo Heavy can be trusted to build entire features or subsystems. Once I had it run for about 12 hours, and routinely run it for 3-6 hours.

(It helps that Codex rate limits are very hard to hit on the highest plan as of 2026 Q2.)

The new flow also handles frontend tasks, and can figure out and use external web sites to test various integrations.

TODO

I would love for my workflows to handle simple tasks perfectly. Unfortunately, it seems that doing things perfectly is harder than doing complex things reasonably well. That’s why uber-review is the latest addition here. Gonna do more in this area.

I also want to introduce automatic gardening / continuous code and doc improvements in the later versions of Zoo.