← All essays

How I Use Unspent Tokens

I built continuous-refactoring because I wanted a simple loop to refactor in small chunks. I can create order in the midst of agentic chaos and entropy.

The idea is simple: point an agent at a target, inject some refactoring taste, run the repo's validation command, and only keep the change if the tests pass. In run-once, it leaves you with one local commit to inspect. In run, it keeps sweeping until it runs out of targets, hits your caps, or gets lost.

continuous-refactoring run \
  --with codex --model gpt-5.3-codex-spark --effort high \
  --extensions .py \
  --max-refactors 10 \
  --max-attempts 2

That is basically the whole product. Everything else exists to make that loop safer and more usable.

The reason I use it is also pretty simple. Every project accumulates low-drama cleanup work: dead branches, helpers that should be inlined, names that no longer tell the truth, modules that got split or merged in the wrong direction, tests that drifted into complexity, TODOs gathering dust. None of that is urgent enough to earn a dedicated cleanup sprint, but all of it makes the repo a little worse to work in.

This tool gives me a way to chip away at that incrementally. I can aim it at a project, let it make one small change at a time, and go to bed. I let the test suite act as guardrails. If the change is good and validation is green, it stays. If not, it gets thrown out and the loop moves on.

It is also a great sink for credits I would otherwise let expire. I do not reach for gpt-5.3-codex-spark when I need careful design work or nontrivial implementation. That is not what it is for. But it is perfectly fine at janitorial passes like "simplify this file", "delete the dead branch", "tighten this helper", or "make the tests less goofy".

The first version of the repo was intentionally small. I did not want a platform. I wanted a supervised janitor loop. That got me pretty far, but it exposed the first obvious problem: a "good refactor" is not a universal thing. It depends on taste.

Some people want comments aggressively removed. Some want boundary comments kept. Some want OOP, while others want FP. If I wanted the loop to make changes I would actually keep, I needed a way to tell it what "better" meant for a given repo.

So I added taste.md files. Not a giant config schema. Just a short bullet list of preferences that gets injected into every refactoring prompt.

The next problem was how to write that file without turning the whole thing into a form builder. The answer was to have the agent interview the user and synthesize the result:

Your job: interview the user with 6-8 focused questions. Probe concrete
preferences on error handling, comments, abstraction, deletion vs
preservation, naming, module boundaries, and tests.

That turned out to be a much better interface than a pile of flags. You can usually get to someone's refactoring taste pretty quickly if you ask a few pointed questions and keep the output tight.

Then taste itself started drifting.

Once the tool began taking on bigger work, I needed the taste file to cover decisions that were not just local cleanup style. Things like when to split or merge modules, how cautious to be with wide-blast-radius changes, and whether user-visible changes should be feature-flagged. That is where versioned taste came from.

taste-scoping-version: 1
 
## large-scope decisions
## rollout style

The important part was not the header. It was the upgrade path. I did not want "new taste dimensions" to mean "throw away your existing preferences and start over." So taste --upgrade only interviews for the missing dimensions and preserves the rest. taste --interview stayed the path for bootstrapping or replacing the doc. Later I added taste --refine because sometimes you just want to rework the doc yourself, without getting re-interviewed on things you already decided.

There was also one fiddly implementation detail I ended up liking: the interactive taste flow now waits for a settled write, not just "the model said it wrote the file." The agent writes the taste file, then writes a small sha256:... settle file, and only then does the host end the session. Tiny thing, but it makes the interactive path feel much less janky.

The next wall was bigger refactors.

The one-shot loop works well for cohesive cleanup. It is bad at work that obviously wants a plan, touches a lot of files, or should land in stages. I did not want the tool to fake confidence there, so I added a routing step:

ClassifierDecision = Literal["cohesive-cleanup", "needs-plan"]

If the target looks like a normal cleanup, it takes the original one-shot path. If it looks bigger, it goes through the migrations workflow instead.

That workflow ended up being the most interesting part of the project. Migrations live in a directory in your repo. Each migration gets its own directory, a manifest, a plan, and a set of phase files:

migrations/
  <migration-name>/
    manifest.json
    plan.md
    phase-1-<name>.md
    phase-2-<name>.md

The planner runs in stages: generate approaches, pick the best one, expand it into phases, review it, revise if needed, then do a final review. Only after that does execution start. Each phase is checked for readiness, executed on its own branch, validated, and then either marked done, put back to sleep, or flagged for human review if the agent cannot honestly verify that the repo is ready.

That last part matters. I did not want the loop hammering the same half-ready migration over and over like a confused robot. So migrations have wake-up rules and a hard cooldown. If a phase is not ready, it backs off. If readiness is fundamentally unverifiable, it gets kicked to a human instead of bluffing.

I also kept the larger-refactoring path behind an explicit init --live-migrations-dir ... switch. That was a deliberate line in the sand. I wanted bigger staged work, but I did not want the original one-shot cleanup path to get worse just because the tool grew. If you do not enable migrations, the small-loop behavior stays as-is. There is even a regression test in the repo for exactly that.

The rollout for this feature was pretty literal. First came the migration manifest and opt-in flag. Then the classifier and the six-stage planner. Then wake-up rules and phase execution. Then the human review commands. I liked building it that way because each slice had a clear job, and because it forced me to keep the original janitor loop intact while the bigger system was being bolted on.

In hindsight, that is the real shape of the project. It started as a tiny test-gated cleanup loop. Then it learned taste, because cleanup quality is subjective. Then it learned how to upgrade that taste without being annoying. Then it learned to admit when a job was too big for one pass and needed a plan. Another case for iteratively growing your product.

I like tools that do one useful thing continuously instead of one impressive thing once. That is what this is for me. It turns "I should really clean this repo up sometime" into a background habit. It also turns otherwise-lapsing model credits into actual maintenance work, which is a nice bonus and a much better story than letting them evaporate unused.

Mostly, though, I built it because I wanted software to get tidier over time instead of only during bursts of guilt.