Software Factories and the Agentic Moment

The StrongDM AI team built a Software Factory: non-interactive development where specs and scenarios drive agents that write code, run harnesses, and converge without human review. The foundational constraints are simple and radical. Code must not be written by humans. Code must not be reviewed by humans.

There is a practical litmus test: if you have not spent at least one thousand dollars on tokens today per human engineer, your software factory has room for improvement.

On July 14th, 2025, Jay Taylor and Navan Chauhan joined Justin McCarthy in founding the StrongDM AI team. The catalyst was a transition observed in late 2024: with the second revision of Claude 3.5, long-horizon agentic coding workflows began to compound correctness rather than error.

Prior to this model improvement, iterative application of LLMs to coding tasks would accumulate errors of all imaginable varieties — misunderstandings, hallucinations, syntax errors, DRY violations, library incompatibility. The app or product would decay and ultimately collapse: death by a thousand cuts. But with the updated model, something changed. Together with Cursor's YOLO mode, it provided the first glimmer of what the team now refers to as non-interactive development, or grown software.

In the first hour of the first day, the AI team established a charter. In retrospect, the most important line was a simple question: how far could we get without writing any code by hand?

Not very far — at least, not until they added tests. However, the agent, obsessed with the immediate task, soon began to take shortcuts. Return true is a great way to pass narrowly written tests, but it probably will not generalize to the software you actually want. Tests were not enough. Integration tests? Regression tests? End-to-end tests? Behavior tests? None of them solved the core problem.

The team needed new language. The word "test" proved insufficient and ambiguous. A test stored in the codebase can be lazily rewritten to match the code. The code could be rewritten to trivially pass the test. So they repurposed the word "scenario" to represent an end-to-end user story, often stored outside the codebase — similar to a holdout set in model training — which could be intuitively understood and flexibly validated by an LLM.

Because much of the software they grow itself has an agentic component, they transitioned from boolean definitions of success — the test suite is green — to a probabilistic and empirical one. They use the term "satisfaction" to quantify this validation: of all the observed trajectories through all the scenarios, what fraction of them likely satisfy the user?

To validate at scale without production constraints, the team built what they call the Digital Twin Universe: behavioral clones of the third-party services their software depends on. They built twins of Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets, replicating their APIs, edge cases, and observable behaviors.

With the Digital Twin Universe, they can validate at volumes and rates far exceeding production limits. They can test failure modes that would be dangerous or impossible against live services. They can run thousands of scenarios per hour without hitting rate limits, triggering abuse detection, or accumulating API costs.

This illustrates one of the many ways the agentic moment has profoundly changed the economics of software. Creating a high-fidelity clone of a significant SaaS application was always possible, but never economically feasible. Generations of engineers may have wanted a full in-memory replica of their CRM to test against, but self-censored the proposal to build it. They did not even bring it to their manager, because they knew the answer would be no.

Those building software factories must practice a deliberate naivete: finding and removing the habits, conventions, and constraints of the old way of building software. The Digital Twin Universe is proof that what was unthinkable six months ago is now routine.

← Back to all articles