In April this year I published Coding agent dance-off. There, I compared several coding agents in search of an agent that could migrate the UI template I use for my data science product builds from MUI to shadcn/ui. At the time, it was mixed results, with Aider taking the lead with an almost-there result. I did note, though, that I had not been able to test Claude Code yet due to not having a subscription, and I was not able to use the API key at the time. That changed a couple of months ago, but I never gave it a try since then.
Recently, Claude Code 2.0 was released in conjunction with Sonnet 4.5. I thought it was the ideal time to give it another go.
Spoiler alert: Claude Code, with a little help, succeeded. I started by writing a rather involved prompt to help Claude Code on its way. For example, I outlined that it should first use Playwright over MCP to take screenshots of all pages, so that once it completed the refactoring, it could do the same and validate that basic functionality and layout primitives had not changed too much. It did that well.
Claude Code laid out a plan and did a great job there. It identified nearly all places where changes were required, as well as which dependencies to replace. It implemented the changes with complete autonomy. Then it went ahead and took a few more screenshots, declaring victory.
However, when I looked at the screenshots and started the UI, I could see that no CSS was applied at all. It was not the first time I had noticed this when bootstrapping a new frontend codebase. It relates to the release of Tailwind CSS 4, as described here. This could be explained by the knowledge cutoff predating wide adoption. Anthropic states a "reliable knowledge cutoff" of January 2025, which coincides with the Tailwind 4 release. Here one can either go for manual intervention or prompt the agent to downgrade to the highest Tailwind 3 version. Having "fixed" the above, I followed up with a few prompts to, for example, center the logo as it was before and to replace react-toastify with sonner.
Then I pushed a few rounds of self-review, running and fixing linting, unit tests, and Playwright tests. With a human in the loop, it did these changes well.
Claude Code did a great job executing the changes. That said, I almost expected the step change to be more pronounced.
One disappointment is that even though there is a proper CLAUDE.md
, if I do not spell out to run basic quality control, it does not do it on its own.
Even when I specify that it should take a screenshot and inspect the code changes visually, it takes the screenshot but does not notice that things are very wrong.
I would have assumed this junior engineer expectation would be internalized in the weights and/or context engineering by now.
The autonomy claims are slightly overstated; for real-world purposes, it still needs a driver (or better agent configurations).
In the earlier blog post, I described that I expected it would primarily require foundation model changes to finally succeed with this challenge. Because I tested Claude Code 2.0 in conjunction with Sonnet 4.5, it is hard to attribute what contributed the most. I am not invested enough to replay the challenge on both in isolation. However, if I had to guess, I would argue the major contributors were the FM and the human in the loop: Sonnet 4.5 being a significantly better LLM for coding than what was available back in April (Sonnet 3.7), and myself also being more versed in how to squeeze the most out of it, anticipating failure modes, and being a more opinionated prompter, etc.