13,671 Lines Deleted
The problem was hallucinated
Last week I deleted 13,671 lines of code. It wasn’t dead code—it ran every day. Every function was tested. The whole thing worked perfectly. The problem it solved? Hallucinated.
This is the story of how I spent 77 hours over several months building elaborate tooling to solve a problem that didn’t exist. All that with two of the best AI coding assistants available, reviewing each other’s work, and neither of them asking the one question that mattered. Neither did two other AI tools, explicitly assigned to reviewing every change.
What I was trying to do
I maintain an open-source app called Lotti. Part of the release process involves submitting builds to Flathub, the Linux app store. There’s existing tooling for this, flatpak-flutter, that handles most of the work and only needed some extra config.
When I first submitted to Flathub, the review process surfaced issues with my manifest files. Rather than understanding why those issues existed and fixing them at the source, I jumped into AI-assisted problem-solving. Claude Code was helpful and reassuring. It didn’t understand the Flathub build process any better than I did, but it was very confident in its suggestions.
So we built workarounds. First a shell script. Then, because the shell script was getting unwieldy at over 1500 lines of code, a full Python package. Tests. Type hints. Linting. Cyclomatic complexity checks. The works.
The tooling ran on every release. It modified manifest files during the build to fix issues that the reviewers had flagged. It had fallbacks for edge cases. It was, by any conventional measure, well-engineered code.
It was also completely unnecessary.
The hallucinated problem
Here’s what actually happened: flatpak-flutter does its job correctly. The issues in my manifests existed because of problems in my source configuration—a missing dependency declaration, some incorrect paths. The right fix was always to correct those source files and let flatpak-flutter do its thing.
Instead, I built 13,671 lines of tooling to patch the symptoms at build time. Every. Single. Release.
Think of it like this: you’re on a Porsche assembly line and you run out of the right wheels. The sensible thing to do is stop the line and get the right wheels. Instead, my AI assistants found some Volkswagen Beetle wheels in the corner and said “we can make these fit.”
So we drilled new holes. Added adapters. Machined custom spacers. Built an entire ISO 9001-certified quality process for mounting the wrong wheels perfectly. Tests for hole alignment. Checks for adapter torque. Documentation for the whole procedure.
Nobody asked: “Should we be doing this at all?”
That should have been me. It wasn’t.
How it stayed hidden
The tooling worked. That’s the insidious part. For over 20 releases, the script ran smoothly. CI was green. Builds shipped. The Beetle wheels, improbably, held.
It wasn’t one big effort where I might have noticed the absurdity. It was 77 hours spread across months. A few hours here, a few hours there. Each session felt small. “Oh, the reviewer flagged this. Let’s add handling for it.” The AI made it cognitively cheap to keep going without questioning the foundation.
I had Claude Code and Codex, both on maxed-out plans, often reviewing each other’s work. On top of that, every PR ran through Gemini and CodeRabbit for automated review. Four AI systems total.
I usually had Codex cosplaying as a senior engineer or architect, doing code reviews, not once, but as standard practice. The reviews were thorough. I asked for good engineering practices, modularity, testability.
Claude added cyclomatic complexity analysis and automated quality gates that failed the build if any single function got too complex. All reviewed and rubber-stamped by the others. The code improved with every review. Cleaner, more modular, better tested.
Not once did any of the four ask: “Why does this tooling exist? Isn’t flatpak-flutter supposed to handle this?”
And somewhere, underneath all those green CI checks, I knew something was off. I dreaded touching this code. Every time I had to go back in, there was this low-grade anxiety. I’d never fully understood what it was doing. I’d supervised the construction, but I hadn’t understood it.
Worse: part of me didn’t want to understand. Because if I really dug in, I might not like what I found. What if the whole thing was unnecessary or at least substantially over-engineered? Easier not to ask.
The ship had sailed on that. The codebase had grown so complex that really understanding it would mean hours of debugging, maybe days. The alternative was to keep going, let the AI add more code, patch whatever broke.
The line count wasn’t a surprise. I knew exactly how big it had become.
The moment it broke
A few weeks ago, flatpak-flutter released version 0.11 with a restructured output directory. My elaborate tooling, which depended on specific paths, broke completely.
I stared at the error logs. Here we go again. Except this time, it wasn’t a small patch. This was a massive rewrite, with the same uncomfortable choice I’d been making for months: let the AI figure it out, add more code I wouldn’t fully understand, and push the reckoning further down the road.
That’s when something shifted. Maybe it was the scale of the fix. Maybe I was just tired of the dread. I finally asked the question I should have asked months earlier: “Wait a second, why are we doing any of this?”
I went back to basics. Read the flatpak-flutter documentation. Actually read it this time, instead of having an AI summarize it for me. And there it was. The thing I’d been avoiding. flatpak-flutter does exactly what I needed. The issues in my manifests? They came from problems in my source configuration. A missing dependency. Some incorrect paths. Fixable in minutes.
I sat with that for a while. Months of work. 77 hours. Four AI systems. All because I hadn’t understood what I was building on top of, and hadn’t wanted to find out.
There’s a specific feeling when you realize you’ve been solving the wrong problem. It’s not quite embarrassment. It’s more like the air going out of something. All that effort, all that cleverness, and the answer was always so much simpler.
The fix? About 200 lines, plus some edits in configuration files. The 13,671 lines of manifest tooling? I deleted every single line. Here’s the PR.
Hitting that delete button? That was a celebration.
What the AIs can and can’t do
I want to be precise about this, because “AI bad” isn’t the lesson.
AI can find real issues. It can review code, catch bugs, suggest improvements.1 What it seemingly can’t do is step back and ask whether the whole approach makes sense. It optimizes locally. It doesn’t question the premise.
That’s still our job.
The cost wasn’t money
I have subscriptions to Claude Code and Codex. Combined, around $400/month. The token cost of this misadventure was trivial.
The real cost was attention. 77 hours (I have the receipts, I tracked every task) spread thin enough that I never felt the weight of it. Time I could have spent on features, on the actual product, on anything else. Or on stepping away from the computer, how about that. It was cognitively cheap in the moment and expensive in aggregate. Life energy wasted on a hallucinated problem.
My feed is full of “engineers are obsolete” posts
Every week there’s another post declaring engineering expertise obsolete. Some CEO got told by their coding assistant that the prototype was ready to ship, and is now posting about how software engineering is solved. The AI was confident. The CEO couldn’t verify.2 Sound familiar?
I had 20+ years of experience. I had two of the best AI coding tools available, on maxed-out plans, reviewing each other’s work. I still spent 77 hours on a hallucinated problem.
If you can’t code and you’re relying on AI to catch fundamental issues like this: it probably won’t. You won’t even know something’s wrong. And when things finally break, guess whose job it’ll be to untangle the mess? The same “obsolete” engineers.
Funny how that works.
The lesson
The questions you’d ask a junior developer don’t stop being relevant just because the code comes from an AI that can implement a B-tree from scratch:
Why are we doing this?
What problem does this solve?
Is this necessary?
Have we verified our assumptions about how the underlying system works?
I got lulled. The AI was impressive, so I stopped asking basic questions. I let confidence substitute for understanding.
I’m bringing my skepticism back.
Cheers,
Matthias
---
The project that survived this detour: Lotti on GitHub. It’s an open-source, privacy-first personal assistant. The wheel-drilling is over. Now the Flathub build and release process just works.
During this same period, I had AI-generated code that was running slowly. Claude reviewed it and rubber-stamped it. I pushed back: “Something feels off.” Claude pointed to non-issues. Then Codex—which has become noticeably more terse lately, almost grumpy—found an n+1 problem: N+1 database fetches where two batched queries would do. It was right. The tools *can* catch real bugs. They just can’t seem to ask “should this exist?”
This overconfidence problem isn’t hypothetical. In my 1,000+ hours supervising AI coding assistants in 2025, it’s been one of the most persistent annoyances. “Everything is working perfectly”—without having run the tests. Glossing over explicit instructions to only report success when tests are green. It gets worse as context grows after hours of back-and-forth. The AI gets confident; you get tired, or you never knew what to look for in the first place. Either way, things slip through.


I re-ran the numbers. It's not 95 hours after all, some were incorrectly labeled, so it's "only" 77 hours. Still, time I most certainly could have spent better. But then again, good learning here, on a rather low-stakes issues, except for the time sink.