[ SYSTEM_LOG ] · Architecture Notes

Spec-First AI Workflows and the Risk to Software Quality

PUBLISHED_AT :: 2026-05-24 · BY :: MOHAMAD_ALSABBAGH

8 min read

Engineering Culture

Enterprise

Spec-First

Process

// TL;DR

AI coding assistants boost raw throughput, but they can silently amplify technical debt when teams skip rigorous spec-first planning. GitClear data shows refactoring dropped about 32% from 2021 to 2023, while code churn is projected to nearly double by 2024 compared with its 2021 baseline. The fix is not less AI; it is a tighter process contract designed around review load, rollback paths, and how engineers make decisions under pressure.

PR generation is no longer the slow part of software delivery. The slow part is proving that a generated change respects the owning interface, covers the failure modes, preserves the rollback path, and does not move hidden rework into the next sprint.

Historically, writing syntax forced real-time architectural reflection. AI removed that friction without replacing the thinking it used to create. The bottleneck did not disappear; it moved upstream to problem framing and specification, then downstream to review, security analysis, and defect containment.

[ DATA_VIZ ] GitClear "Coding on Copilot" - Code Quality Metrics (2021 → 2024)

Source: GitClear 2024 - Analysis of 153M+ changed lines of code. [Local PDF] · [Gwern.net Source]

DORA's 2025 research frames AI as an amplifier: it magnifies existing organizational strengths and weaknesses, and without strong platforms and workflows, local productivity gains can turn into downstream disorder.
- Paraphrased from DORA 2025: State of AI-assisted Software Development

The Bottleneck Has Shifted

Previously, the time engineers spent manually writing code also served as an essential planning phase. Writing proper code was slow and deliberate - it forced real-time architectural reflection. That organic slowdown is gone. Instead of investing time upfront to deeply understand scope, teams are now maximizing PR generation.

The time engineers previously spent writing code was never just about the code - it was the planning session. AI removed the friction without replacing the thinking.

- Mohamad Alsabbagh

The Contract Before Code

AI-assisted work needs a contract before it needs a prompt. The contract should tell the model, the reviewer, and the future maintainer what behavior is allowed to change and what must remain stable:

Intent and non-goals: The change states the business outcome, the paths intentionally left untouched, and the user-visible behavior that must not regress.
Boundary ownership: The owning interface, data boundary, dependency contract, and rollback path are named before generation starts.
Failure-mode table: Load, concurrency, stale data, security constraints, and partial-failure behavior are converted into tests or review checks.
Debug path: Logs, metrics, traces, and ownership routing are present before the change can hide behind generated abstraction.

Hobby Projects vs. Enterprise Software

The risk model changes with blast radius:

Hobby Projects: Speed is paramount. "Just make it work" is a valid strategy. If AI generates a monolithic file with duplicated code, it doesn't matter as long as the prototype ships.
Enterprise Software: Code is read and maintained far more than it is written. Enterprise software requires rigorous context, data governance, security, and long-term maintainability. AI-generated code that lacks architectural cohesion becomes immediate technical debt.

Treating enterprise software like a hobby project moves cost into review queues, incident response, security exceptions, and future migrations.

Defining a Process for the Human Brain

We cannot simply rely on engineers to "try harder" to plan. Under delivery pressure, people tend to choose the easiest available path. Autocomplete made that path hitting "Tab" to accept a suggestion; agentic tools make it delegating an entire task before the architectural boundaries and context are clear.

The process needs hard gates, not reminders:

Spec-first workflows: Requiring a short technical design document, implementation spec, or ADR before a single line of AI-assisted code is generated.
Separation of generation and review: AI-generated tests should not be the only validation path. We need rigorous human-in-the-loop QA and peer reviews focused on security and architecture, not just functional correctness.
Refactoring budget: Rewarding developers for deleting code and creating reusable abstractions, rather than measuring productivity by PR volume or lines of code.

// WORKFLOW_DIFF :: AI-ONLY vs SPEC-FIRST

❌ CURRENT ANTI-PATTERN

AI-Only PR Workflow

01Ticket assignedNo context written down
02Jump to code generationPrompt AI with vague intent
03AI writes implementationNo architectural boundaries set
04Quick manual smoke testEdge cases largely untested
05PR openedReviewer sees code for the first time
06Merge & shipTechnical debt silently accumulates

✅ RECOMMENDED

Spec-First Agentic Workflow

01Ticket assigned
02Write spec / TDD / ADRGoals, non-goals, alternatives, edge cases
03Architectural reviewSign-off before any code is generated
04AI generates implementationGuided by precise spec and constraints
05Human reviews security + archAI output treated as untrusted draft
06MergeSmall diff, owned rollback path, covered failure modes

Pattern informed by GitHub Spec Kit, DORA 2025, and ACM CCS 2024

The Junior and Senior Traps

This over-reliance trap affects developers at all levels, but in different ways:

Junior Developers: They are the most vulnerable. Without the foundational experience to recognize bad architectural decisions, they may accept AI suggestions blindly, missing critical edge cases. They must be guided to focus heavily on planning and review phases.
Senior Developers: Seniors have the context but can fall into the trap of over-optimizing for speed. Seduced by the velocity AI provides, they might skip deep architectural planning they know is necessary, assuming they can "fix it later" - which rarely happens.

// QUALITY_SIGNALS :: Observed Risk vs. Suggested Targets

Metric

Observed Risk

Suggested Target

Copy/Pasted Code

GitClear 2021 → 2024 projection

8.4% → 11.6%

↓Stable ~8–9%

Refactoring (moved code)

GitClear 2021 → 2024 projection

24.8% → 13.4%

↑Maintained 20–25%

Code Churn Rate

Reverted within 2 weeks

3.6% → 7.1%

↓Near baseline ~3–4%

Deployment Rework

Follow-up fixes after release

Untracked / rising

↓Tracked + declining

AI Trust Paradox

30% of devs distrust AI output (DORA 2025)

No review gate

↑Mandatory arch sign-off

Observed risk data: GitClear 2024 (153M lines) · Targets represent quality signals to align on during planning and review

The Refusal Line

I would not allow AI-only PR workflows on enterprise code paths where the reviewer sees the system decision for the first time in the diff. If the change touches state, security, shared APIs, data migration, billing, identity, or rollback behavior, the spec must exist before generation. The model can accelerate implementation only after the team has fixed the contract it will be judged against.

// ACTION_ITEMS

What to Do Next

1. Spec-First by Default

Before any ticket is picked up, require a written spec. Start withGitHub Spec Kit - it gives you CLI tools, templates, and slash commands to turn specs into the source of truth that guides AI agents.

2. AI PR Review Checkpoint

AI-generated code should not merge without architectural sign-off. Layer your review stack - pick what matches your security posture:

PR-Agent (Qodo, OSS) - 11k★ open-source. Self-hostable via GitHub Actions or Docker, supports OpenAI / Anthropic / local LLMs via Ollama. Commands: /review, /describe, /improve.
Semgrep CE - open-source SAST (LGPL 2.1), 30+ languages, and 3000+ community rules. Best for custom security rules in pre-commit hooks and CI gates.
SonarQube Community - self-hosted quality gates, tech-debt tracking, security standards. The most mature rule-based option for on-premise teams.

Rule of thumb: AI review tools (like PR-Agent) can help catch logic issues; SAST tools (Semgrep, SonarQube) can help catch security anti-patterns. Use both.

3. Track Deployment Rework

Track deployment rework alongside the classic four DORA metrics. Start with delivery dashboards like DORA Four Keys, LinearB, Google Four Keys OSS, then add a custom signal for rollbacks, hotfixes, and follow-up fixes after release. A rising rework rate is an early warning sign of AI-amplified debt.

4. Mandate a Design Doc or ADR

For any ticket touching shared infrastructure, require either a technical design document (goals, non-goals, alternatives, trade-offs) or a lightweight Architecture Decision Record (ADR) committed alongside the code. Checklist: background, scope, alternatives considered, and cross-cutting concerns like security, scalability, and monitoring.

5. Measure Quality, Not Volume

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Replace PR-count KPIs with outcome-oriented metrics: refactoring ratio, code churn rate, and MTTR. Reward deleting code and creating reusable abstractions. Use the quality signals table as a shared review language: align on target ranges before implementation, then track whether the codebase is moving toward them.

[ PRIMARY_SOURCE ] GitClear 2024 - Full Report

The GitClear report is the primary dataset behind the chart above. Keep reading here, or open the mirrored source when you want to inspect the full methodology and tables.

Open Source

[ RESEARCH_ARCHIVE ] References

1.GitClear: "Coding on Copilot" (2024/2025) - [Local PDF] · [Gwern.net Source]

Analysis of 153M+ changed lines of code (2020–2023). Key findings: copy/pasted code rose from 8.4% → 10.5% (projected 11.6% in 2024); moved/refactored code dropped from 24.8% → 16.9% (projected 13.4% in 2024); code churn doubled from 3.6% → projected 7.1%. GitClear argues these trends point to a risk that AI-assisted speed can come at the expense of long-term maintainability.

2.DORA Report (2025) - State of AI-assisted Software Development

AI acts as an amplifier, not a universal fix. Teams with high platform maturity and precise architectural planning are better positioned to turn AI adoption into system-level gains; teams lacking strong platforms and workflows can see local productivity gains lost to "downstream disorder." The report reveals a severe trust paradox - 30% of devs do not trust AI-generated code. Related DORA guidance also treats deployment rework rate as a useful signal alongside the classic four delivery metrics.

3.GitHub Spec Kit: Spec-Driven Development (2025)

GitHub's open-source toolkit for Spec-Driven Development (SDD), where specifications become the primary artifact that guides implementation. The engineer's role shifts from direct code writer toward strategic orchestrator - designing architecture, setting constraints, and validating AI output.

4.arXiv:2405.06371 - "Using AI Assistants in Software Development: Security Practices and Concerns" (ACM CCS 2024)

Peer-reviewed qualitative and quantitative study investigating how professionals use AI assistants in secure development. Despite widespread security concerns, developers frequently use AI for critical tasks, creating an over-reliance risk. The study underscores the necessity of human-in-the-loop validation and treating AI suggestions as insecure drafts requiring thorough review.