Disciplined Agentic Engineering
The last year or so has been strange. After nearly 15 years of doing this work, tools like Cursor, Claude Code, and Codex changed the experience almost overnight. Since then, I have been trying to figure out how to use them without letting them use me.
The upside is obvious: I would not want to go back. The downside is that this kind of leverage is easy to overdo.
That is exactly what I did. I became intellectually lazy. I would run into a problem and ask the LLM to fix it or build a feature with minimal input. Sometimes it worked. Often, it didn’t. I could produce a LOT of code, but somebody has to understand it, and all the time I was saving was being spent trying to comprehend what it generated. Overall, it was a mixed bag.
To top it all off, the quality of the generated code is not good. Most of the time, it is just way too complex. All in all, I felt frustrated and no longer in control of my own codebase. There were parts of the code that I had supposedly generated but had no idea how they even worked.
As Margaret-Anne Storey puts it, “technical debt lives in the code; cognitive debt lives in developers’ minds.”
Yes, yes, I know you are not supposed to care about the code or the little details because the LLMs will figure things out. But we are clearly not there yet. LLMs optimize for plausibility, not simplicity. Left unchecked, they tend to produce complexity monsters, which I will inevitably have to tackle sooner or later.
If I am not the one doing the development, it is very, very hard to get into the “mind” of the LLM and understand its intent and why it did what it did. No matter how much I review the code, I am unable to develop intuition about the edge cases and how the code “flows.” It is almost like I understand the code, but at the same time I don’t. It is a very bizarre feeling. This kind of intuition, I think, only comes when I am the one shaping the code myself.
So, at least for the foreseeable future, I would honestly rather have the LLM review my work and give me feedback.
The way I see it, there are really only two paths: either you stay in the driver’s seat, or you fully give in and shift to designing the harness, the observability, the feedback loop, and the scaffolding, and let the LLMs do the rest.
That means no human code reviews, and if there are bugs or issues, the LLMs will figure them out simply because the observability and feedback loop are there. This is akin to treating software development as a black box, kind of like how you train a neural network. You don’t really care about the individual weights (functions, classes); what matters is the end result. More on this in a future post, I do think this is the direction we are headed in.
For now, though, I choose to be the pilot, and here is what that looks like.
What I Do Now
This approach lies somewhere in the middle of using tab autocomplete and full agentic LLM code generation. It is, admittedly, a bit slower, but it gives me a lot more control. While it is a bit more work, I feel it is worth it.
-
Planning: At the start of a task or a feature, before even touching the LLM, I do my own research and make my own plan, you know, like we used to in the old days. Once that is done, I feed my plan into slower but more capable models, e.g., GPT-5.4, and ask them to critique my plan and find any holes in it. Once that is done, we create a refined spec document.
-
Development: Afterwards, I switch to a faster model, like Codex Spark, and instead of asking the LLM to just get to work, I still stay in the pilot seat.
- Ask the LLM to create a stub for the function or class we are going to implement.
- Ask the LLM to generate a small unit test for a function or class we are about to write. At that point, I tell it exactly what I want tested. At this stage, the test will obviously fail.
- Afterwards, ask the LLM to do the implementation of the function or class and check whether the test passes. If it does, quickly skim over the code and move to the next function or class. Over time, this unit of work might grow from a single function to a file, but this is where I am comfortable now.
-
Coding by hand: Audible gasp. While it is not strictly necessary, I feel that doing 30-45 minute coding sessions by hand every day helps a lot against skill atrophy. If you lose the ability to code well, your ability to effectively prompt the LLM to do your bidding will also suffer. Again, this might not hold true in, say, a few months as the models improve, but for now the problem is there.
This flips the script. I still get LLM-generated code and the understanding stays with me, and I no longer have to personally review every line of it. I can have a bunch of different agents review it instead.
The models will keep getting better. The temptation to hand over the keys will keep growing. But until the day comes when I genuinely don’t need to understand my own codebase, I am staying in the pilot seat.