Building Wrought with Wrought

73 Sessions of Building an AI Tool With AI — What Actually Worked

73 sessions building an AI tool with AI. What worked, what didn't, and why methodology matters more than capability.

J

Johan Genis

FluxForge AI

wrought ai-engineering build-in-public methodology claude-code

I have spent 73 sessions building a software engineering tool using AI — and I used the tool itself for every session. Here is what I learned.

The tool is called Wrought. It is a structured engineering process for AI coding assistants. Think of it as an engineering runbook that your AI assistant actually follows: pipelines for bug investigation, design analysis, implementation, and code review, all producing documentation that builds your project’s institutional memory.

The unusual part is not the product. It is the method. Every feature, every bug fix, every architectural decision in Wrought was built using Wrought’s own process. Dogfooding at its most literal. The tool that enforces design-before-code was designed before it was coded. The skill that generates findings trackers was tracked in a findings tracker. The code review system reviewed itself.

73 sessions. 174 commits. 44 days from first commit to this post. Here is what the numbers say about building with AI — and what they leave out.


The Numbers

Before the lessons, the raw data. These are not estimates; they are counted from the Git history and file system.

MetricCount
Sessions (Claude Code conversations)73
Git commits174
Findings trackers created31
Design documents47
Blueprints43
Research reports28
Code reviews (4-agent parallel review)13
Investigation reports13
Implementation prompts56
Plans9
Analysis reports5
Lines of Python (source)~2,000
Lines of Python (tests)~2,400
Skills (structured AI workflows)12
Files changed since first commit526
Lines inserted126,000+
Calendar days (Jan 26 to Mar 26)60

A few things jump out immediately. There are more lines of test code than production code. There are more design documents than there are commits to implement them. And 126,000 lines of insertions for a ~2,000-line Python CLI means the overwhelming majority of the project is documentation, methodology artifacts, and process records — not source code.

That ratio is the story.


Three Things That Worked

1. Cross-Session Memory Via Structured Artifacts

This is the single most valuable pattern I discovered.

AI coding assistants have a fundamental problem: they forget everything between sessions. Claude Code has auto-memory and CLAUDE.md, which help, but they are lossy. They capture vibes and preferences, not the state of a six-step pipeline with four open findings across three trackers.

The pattern that solved this is what I call the Findings Tracker. It is a markdown file — nothing fancy — that tracks every significant piece of work through a structured lifecycle:

Open -> Investigating/Designing -> Blueprint Ready -> Planned -> Implementing -> Resolved -> Verified

Each tracker has a dependency map, resolution tasks with checkboxes, lifecycle timestamps, and links to every artifact produced along the way. When a new session starts, the AI reads the tracker and knows exactly where work was interrupted, what has been tried, and what comes next.

Here is a real example. The “Context Compaction Resilience” tracker tracked a problem where Claude Code’s auto-compaction would destroy in-flight state during long sessions. It spawned 5 sub-findings across a 5-layer defense architecture:

  • F1: No compact instructions in CLAUDE.md (solved: added a section the compactor reads)
  • F2: Context percentage data was siloed in the display (solved: bridged to a file)
  • F3: No last-chance backup before compaction (solved: PreCompact hook)
  • F4: No automated context threshold alerts (solved: Stop hook at 70% warn, 80% block)
  • F5: Context calculation was inaccurate (solved: fixed the math)

This work spanned 4 sessions. Without the tracker, each session would have started from scratch, rediscovering what had been tried. With it, every session picked up exactly where the last one left off.

I have 31 of these trackers. They are the project’s institutional memory. Not AI-generated summaries — structured records with dependency maps, resolution tasks, and lifecycle stages.

2. Design-First Pipeline (Even When Code Is Cheap)

The most counterintuitive thing about building with AI: the faster code generation gets, the more design matters.

When generating code costs effectively nothing, the temptation is to skip analysis and start implementing. In the first few sessions, that is exactly what happened. And it produced mediocre results. The AI would generate code that worked but was architecturally questionable, or that solved the wrong problem, or that solved the right problem in a way that made the next feature harder to build.

The pipeline that emerged — and that Wrought now enforces — is:

/research -> /design -> /blueprint -> /wrought-implement -> /forge-review

Every feature starts with research (what exists, what are the constraints). Then a design analysis that evaluates multiple options with a structured tradeoff matrix. Then a blueprint with exact file specifications and acceptance criteria. Only then does implementation begin.

Here is what this looks like in practice. When I needed to set up the development environment (Session 1), the /design step evaluated 4 options:

  • Option A: Single .venv + uv dependency groups (scored 97/105)
  • Option B: Multiple virtual environments (scored 68/105)
  • Option C: Docker-only development (scored 51/105)
  • Option D: System Python + pip (scored 41/105)

Each option was scored across 7 weighted criteria. The analysis took maybe 10 minutes. The design document is still the reference I consult when questions about the dev setup arise.

Compare that to the alternative: asking AI to “set up a dev environment” and getting whatever the model’s default recommendation happens to be that day. That might work once. It does not produce decisions you can explain or revisit 6 months later.

47 design documents later, the pattern has proven itself. Design analysis is cheap with AI assistance. Rework from skipping it is expensive.

3. Self-Referential Testing (The Tool Reviews Itself)

The most powerful quality mechanism was not unit tests (though there are 240+ of those). It was using the tool on itself.

Wrought’s /forge-review skill runs 4 parallel AI subagents — each specialized in a different dimension of code quality: algorithmic complexity, data structure selection, paradigm consistency, and computational efficiency. When this skill was built, it was immediately used to review its own codebase.

The results were humbling. The review found:

  • cli.py had grown to 936 lines with 6+ separate concerns (module cohesion violation)
  • update_index was doing O(n) linear scans for upsert operations
  • A module-level constant (DOCS_DIRS) was a mutable list — a classic Python footgun
  • A marker template was duplicated between cmd_init and cmd_upgrade

All four findings were tracked, designed, blueprinted, planned, implemented, and verified through the pipeline. The code review system found debt in the codebase, and the pipeline system fixed it. Self-referential quality assurance.

The same pattern applied throughout development. The workflow enforcement engine that prevents skipping pipeline steps? It was built after the AI skipped a pipeline step in Session 41. The context compaction defense system? Built after auto-compaction destroyed an in-flight session. Every failure became a finding, every finding became a fix, and every fix was tested by continuing to use the tool.

This is not just dogfooding. It is a continuous quality feedback loop where the product’s own methodology catches and corrects its own defects.


Three Things That Did Not Work

1. Building for 73 Sessions Without a Single User

This is the hardest thing to write, because it is the most important lesson.

At Session 72, a competitive landscape analysis revealed that the market had shifted significantly during those 44 days of heads-down building. An open-source project in the same space had accumulated 50,000+ GitHub stars and 1,234 community-contributed skills. Anthropic had shipped native features (Agent Teams, Tasks, code review) that overlapped with planned Wrought capabilities. The market window had compressed from an estimated 12-18 months to 6-9 months.

And Wrought had zero users. Zero revenue. Zero external validation.

The numbers tell the story. 31 findings trackers, 47 design documents, 43 blueprints — all focused inward. A sophisticated methodology producing sophisticated artifacts about a tool that no one outside the project had touched.

The fix came late but came clearly: stop perfecting internals, start external validation. Consulting as a bridge to revenue. Content marketing (including this post) as a bridge to users. Plugin distribution as a bridge to developers. All three should have started by Session 20, not Session 72.

2. Over-Engineering the Internals

The pipeline skip enforcement saga is instructive.

In Session 41, the AI agent skipped the /plan step after /blueprint. A legitimate process violation. The response? A 3-layer defense-in-depth architecture: skill language hardening, CLAUDE.md rule tightening, and a code-enforced stage gate.

This was implemented, verified, and shipped. In Session 46, the agent added editorial commentary suggesting a skip might be acceptable. So the rules were further hardened with an explicit “no commentary” clause.

In Session 62, the agent used EnterPlanMode directly instead of the reactive pipeline. So a new rule (Rule 8) was added to CLAUDE.md, creating what was by then a 8-rule governance framework just for pipeline adherence.

Three sessions of engineering, three findings tracked, three design documents — all to prevent an AI from occasionally suggesting a shortcut. The enforcement worked, but the effort was disproportionate. A simpler rule with a simpler enforcement mechanism would have captured 90% of the value at 20% of the cost.

When your tool is good at tracking and fixing things, everything looks like something to track and fix. Not everything deserves the full pipeline.

3. Underestimating Distribution

The product thesis — that AI coding tools need structured discipline, not just more features — is, I believe, correct. The execution thesis — that building a great tool is sufficient for distribution — was wrong.

Having 12 skills, 240+ tests, a workflow enforcement engine, a 4-agent parallel code review system, and a self-documenting methodology means nothing if the tool is not where developers look for tools. It is not on any marketplace. It is not a plugin. It has no content presence. The website exists but has no organic traffic.

The Claude Code ecosystem has a plugin format. Anthropic has a marketplace (or will soon). Dev.to, Reddit, and Hacker News have active communities discussing exactly the problems Wrought addresses. None of these channels had been touched before Session 70.

Distribution is not a post-launch activity. It is a pre-launch requirement. If I were starting over, the first 10 sessions would include publishing content and engaging with the community — even before the tool was ready. Feedback from real developers would have shaped the product better than 47 design documents written in isolation.


The Methodology: A Brief Overview

For those interested in the process itself, here is how Wrought’s pipeline works. You can adopt this approach with or without the tool.

Every significant task starts with a Finding. A finding is a gap, defect, or drift — something that needs attention. It gets logged in a Findings Tracker with severity, type, and a proposed resolution path.

Two pipelines handle two types of work:

  • Reactive (something is broken): Incident -> Investigate -> RCA/Bugfix -> Implement Fix -> Code Review
  • Proactive (something needs building): Research -> Design -> Blueprint -> Implement -> Code Review

Design before code, always. The /design step produces a structured analysis with multiple options, weighted criteria, and a recommendation. It takes 10-30 minutes with AI assistance and saves hours of rework.

Blueprints specify acceptance criteria. Before implementation begins, there is a document listing exact file changes, acceptance criteria, and test expectations. The AI implements against this spec, not against a vague prompt.

Structured artifacts are cross-session memory. Every step produces a dated, typed document in a known location. When a new session starts, the AI can read these artifacts and resume exactly where work stopped.

Code review closes the loop. After implementation, 4 specialized AI subagents review the changes for complexity, data structure choices, paradigm consistency, and efficiency.

A typical feature takes three sessions and 3-6 hours: finding and research, then design and blueprint, then implementation and review. Each step produces a dated artifact. The methodology works. The artifacts prove it.


What Happens Next

Three things are now happening in parallel.

Consulting. The methodology that built Wrought is now offered as a service: production bug fixes with root cause analysis, feature architecture and implementation, codebase reviews, and Claude Code workflow setup. The pipeline works on any codebase, not just Wrought. This is live now, with active proposals on Upwork.

Plugin distribution. Wrought’s 12 skills, 4 review agents, and 3 hooks are now available as an open-source Claude Code plugin. Install with claude plugin install fluxforgeai/wrought-plugin — MIT licensed, zero configuration.

Content. This post is the first. The plan is one piece per week, drawing from the 200+ artifacts already produced during development. Topics include cross-session memory patterns, design-first methodology for AI-assisted development, structured incident response workflows, and an honest retrospective on solo founder mistakes.

Wrought V1.0 is a local CLI tool. V2.0 will be an MCP server — a hosted service that any AI coding assistant can connect to over HTTP. The architecture is designed, the skills are built, and the distribution strategy is now in motion.

If the methodology interests you, there are two paths:

  1. Try the approach. The Findings Tracker pattern and design-first pipeline work with any AI coding tool. Start by creating a markdown file that tracks your significant tasks through structured stages. You do not need Wrought to benefit from the process.

  2. Follow the build. I will be publishing weekly at fluxforge.ai/blog and cross-posting to Dev.to and LinkedIn. The next piece covers cross-session memory in detail — the specific patterns that make AI assistants useful across long-running projects.

The 73 sessions taught me that AI does not replace engineering discipline. It amplifies whatever discipline you bring. Bring structure, get structured results. Bring chaos, get faster chaos.

Wrought is the structure I built. Now it is time to find out if anyone else needs it.


Johan Genis is the founder of FluxForge AI and the creator of Wrought, a structured engineering toolkit for AI coding assistants. He has been building software for over a decade and has spent the last two months building a tool that enforces the engineering discipline he wishes every AI coding session had. He is based in South Africa and available for consulting on AI-assisted development workflows, production debugging, and codebase reviews. Find him on LinkedIn or through FluxForge AI.

Stay Updated

Follow the journey

Join the waitlist to get notified when Wrought launches and receive new posts from the "Building Wrought with Wrought" series.