Home / Blog / Spec-First & AI Quality
[ SYSTEM_LOG ] · Technical Deep Dives

Spec-First AI Workflows and the Risk to Software Quality

PUBLISHED_AT :: 2026-05-24 · BY :: MOHAMAD_ALSABBAGH
8 min read
AI
Engineering Culture
Enterprise
Spec-First
Process
// TL;DR

AI coding assistants boost raw throughput, but they can silently amplify technical debt when teams skip rigorous spec-first planning. GitClear data shows refactoring dropped about 32% from 2021 to 2023, while code churn is projected to nearly double by 2024 compared with its 2021 baseline. The fix isn't less AI — it's better process architecture designed around how engineers actually make decisions under pressure.


As AI coding assistants have become ubiquitous, I've noticed an alarming pattern across the industry: engineers are increasingly over-relying on AI tools to accelerate the "PR generation" phase — writing code — while drastically reducing the time spent on planning, architecture, and researching edge cases.

Historically, writing syntax was the bottleneck. But that friction also forced real-time architectural reflection. Today, AI can generate hundreds of lines in seconds, stripping away that automatic reflection period and creating a false sense of productivity. The true bottleneck hasn't disappeared; it has shifted from implementation to problem framing and specification.

[ DATA_VIZ ] GitClear "Coding on Copilot" — Code Quality Metrics (2021 → 2024)
0%10%20%30%8.4%10.5%11.6%Copy/PasteCode duplication24.8%16.9%13.4%RefactoringMoved/restructured code3.6%5.4%7.1%Code ChurnReverted within 2 wks2021 Baseline (pre-AI)2023 Actual2024 Projected
Source: GitClear 2024 — Analysis of 153M+ changed lines of code.  [Local PDF] · [Gwern.net Source]

DORA's 2025 research frames AI as an amplifier: it magnifies existing organizational strengths and weaknesses, and without strong platforms and workflows, local productivity gains can turn into downstream disorder.

— Paraphrased from DORA 2025: State of AI-assisted Software Development

The Bottleneck Has Shifted

Previously, the time engineers spent manually writing code also served as an essential planning phase. Writing proper code was slow and deliberate — it forced real-time architectural reflection. That organic slowdown is gone. Instead of investing time upfront to deeply understand scope, teams are now maximizing PR generation.

The time engineers previously spent writing code was never just about the code — it was the planning session. AI removed the friction without replacing the thinking.
— Mohamad Alsabbagh

The Foundation of an Engineer

In this era, foundational engineering skills matter even more. The most critical ones:

  1. Deep Problem Framing: Articulating intent precisely and understanding the "why" before the "how."
  2. Architectural Soundness: Understanding systems at a macro level, where AI still lacks persistent context.
  3. Validation & Edge Case Analysis: Predicting where a generated solution will fail under load, edge cases, or security constraints.
  4. First-Principles Debugging: Maintaining the ability to drill down into the compiler, network layer, or memory profile when AI abstractions leak.

Hobby Projects vs. Enterprise Software

It is crucial to draw a sharp distinction between the two:

  • Hobby Projects: Speed is paramount. "Just make it work" is a valid strategy. If AI generates a monolithic file with duplicated code, it doesn't matter as long as the prototype ships.
  • Enterprise Software: Code is read and maintained far more than it is written. Enterprise software requires rigorous context, data governance, security, and long-term maintainability. AI-generated code that lacks architectural cohesion becomes immediate technical debt.

When we treat enterprise software like a hobby project — optimizing for raw speed — we degrade the quality of our systems.

Defining a Process for the Human Brain

We cannot simply rely on engineers to "try harder" to plan. Under delivery pressure, people tend to choose the easiest available path. Autocomplete made that path hitting "Tab" to accept a suggestion; agentic tools make it delegating an entire task before the architectural boundaries and context are clear.

To counteract this, we must build enterprise processes that match how our brains work:

  • Spec-First Workflows: Requiring a short technical design document, implementation spec, or ADR before a single line of AI-assisted code is generated.
  • Separation of Generation and Review: AI-generated tests should not be the only validation path. We need rigorous human-in-the-loop QA and peer reviews focused on security and architecture, not just functional correctness.
  • Incentivizing Refactoring: Rewarding developers for deleting code and creating reusable abstractions, rather than measuring productivity by PR volume or lines of code.
// WORKFLOW_DIFF :: AI-ONLY vs SPEC-FIRST
❌ CURRENT ANTI-PATTERN

AI-Only PR Workflow

  1. 01Ticket assignedNo context written down
  2. 02Jump to code generationPrompt AI with vague intent
  3. 03AI writes implementationNo architectural boundaries set
  4. 04Quick manual smoke testEdge cases largely untested
  5. 05PR openedReviewer sees code for the first time
  6. 06Merge & shipTechnical debt silently accumulates
✅ RECOMMENDED

Spec-First Agentic Workflow

  1. 01Ticket assigned
  2. 02Write spec / TDD / ADRGoals, non-goals, alternatives, edge cases
  3. 03Architectural reviewSign-off before any code is generated
  4. 04AI generates implementationGuided by precise spec and constraints
  5. 05Human reviews security + archAI output treated as untrusted draft
  6. 06MergeMaintainable, intentional codebase
Pattern informed by GitHub Spec Kit, DORA 2025, and ACM CCS 2024

The Junior and Senior Traps

This over-reliance trap affects developers at all levels, but in different ways:

  • Junior Developers: They are the most vulnerable. Without the foundational experience to recognize bad architectural decisions, they may accept AI suggestions blindly, missing critical edge cases. They must be guided to focus heavily on planning and review phases.
  • Senior Developers: Seniors have the context but can fall into the trap of over-optimizing for speed. Seduced by the velocity AI provides, they might skip deep architectural planning they know is necessary, assuming they can "fix it later" — which rarely happens.
// QUALITY_SIGNALS :: Observed Risk vs. Suggested Targets

Metric

Observed Risk

Suggested Target

Copy/Pasted Code

GitClear 2021 → 2024 projection

8.4% → 11.6%

Stable ~8–9%

Refactoring (moved code)

GitClear 2021 → 2024 projection

24.8% → 13.4%

Maintained 20–25%

Code Churn Rate

Reverted within 2 weeks

3.6% → 7.1%

Near baseline ~3–4%

Deployment Rework

Follow-up fixes after release

Untracked / rising

Tracked + declining

AI Trust Paradox

30% of devs distrust AI output (DORA 2025)

No review gate

Mandatory arch sign-off

Observed risk data: GitClear 2024 (153M lines) · Targets represent quality signals to align on during planning and review
// ACTION_ITEMS
What to Do Next

1. Spec-First by Default

Before any ticket is picked up, require a written spec. Start with GitHub Spec Kit — it gives you CLI tools, templates, and slash commands to turn specs into the source of truth that guides AI agents.

2. AI PR Review Checkpoint

AI-generated code should not merge without architectural sign-off. Layer your review stack — pick what matches your security posture:

  • PR-Agent (Qodo, OSS) — 11k★ open-source. Self-hostable via GitHub Actions or Docker, supports OpenAI / Anthropic / local LLMs via Ollama. Commands: /review, /describe, /improve.

  • Semgrep CE — open-source SAST (LGPL 2.1), 30+ languages, and 3000+ community rules. Best for custom security rules in pre-commit hooks and CI gates.

  • SonarQube Community — self-hosted quality gates, tech-debt tracking, security standards. The most mature rule-based option for on-premise teams.

Rule of thumb: AI review tools (like PR-Agent) can help catch logic issues; SAST tools (Semgrep, SonarQube) can help catch security anti-patterns. Use both.

3. Track Deployment Rework

Track deployment rework alongside the classic four DORA metrics. Start with delivery dashboards like DORA Four Keys, LinearB, Google Four Keys OSS, then add a custom signal for rollbacks, hotfixes, and follow-up fixes after release. A rising rework rate is an early warning sign of AI-amplified debt.

4. Mandate a Design Doc or ADR

For any ticket touching shared infrastructure, require either a technical design document (goals, non-goals, alternatives, trade-offs) or a lightweight Architecture Decision Record (ADR) committed alongside the code. Checklist: background, scope, alternatives considered, and cross-cutting concerns like security, scalability, and monitoring.

5. Measure Quality, Not Volume

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Replace PR-count KPIs with outcome-oriented metrics: refactoring ratio, code churn rate, and MTTR. Reward deleting code and creating reusable abstractions. Use the quality signals table as a shared review language: align on target ranges before implementation, then track whether the codebase is moving toward them.

[ PRIMARY_SOURCE ] GitClear 2024 — Full Report

The GitClear report is the primary dataset behind the chart above. Keep reading here, or open the mirrored source when you want to inspect the full methodology and tables.

Open Source

[ RESEARCH_ARCHIVE ] References

1. GitClear: "Coding on Copilot" (2024/2025) — [Local PDF] · [Gwern.net Source]

Analysis of 153M+ changed lines of code (2020–2023). Key findings: copy/pasted code rose from 8.4% → 10.5% (projected 11.6% in 2024); moved/refactored code dropped from 24.8% → 16.9% (projected 13.4% in 2024); code churn doubled from 3.6% → projected 7.1%. GitClear argues these trends point to a risk that AI-assisted speed can come at the expense of long-term maintainability.

2. DORA Report (2025) — State of AI-assisted Software Development

AI acts as an amplifier, not a universal fix. Teams with high platform maturity and robust planning are better positioned to turn AI adoption into system-level gains; teams lacking strong platforms and workflows can see local productivity gains lost to "downstream disorder." The report reveals a severe trust paradox — 30% of devs do not trust AI-generated code. Related DORA guidance also treats deployment rework rate as a useful signal alongside the classic four delivery metrics.

3. GitHub Spec Kit: Spec-Driven Development (2025)

GitHub's open-source toolkit for Spec-Driven Development (SDD), where specifications become the primary artifact that guides implementation. The engineer's role shifts from direct code writer toward strategic orchestrator — designing architecture, setting constraints, and validating AI output.

4. arXiv:2405.06371 — "Using AI Assistants in Software Development: Security Practices and Concerns" (ACM CCS 2024)

Peer-reviewed qualitative and quantitative study investigating how professionals use AI assistants in secure development. Despite widespread security concerns, developers frequently use AI for critical tasks, creating an over-reliance risk. The study underscores the necessity of human-in-the-loop validation and treating AI suggestions as insecure drafts requiring thorough review.
[ NET_SUBSCRIBE ]

Subscribe to newsletter & updates

Join the mailing list to receive notifications for future articles, engineering logs, and architectural deep dives. No spam, just technical deep dives.

[ ABOUT_THE_AUTHOR ]
Mohamad
EXECUTING_STRATEGY

Driving Large-Scale Transformation

As a Senior Staff Platform Architect and Systems Engineer, I design platform systems, AI-era engineering workflows, and architecture patterns that help teams ship faster without losing control of quality, security, or long-term maintainability.

CORE_COMPETENCIES:
SYSTEMS_ARCHITECTURE
DISTRIBUTED_SYSTEMS
MACHINE_LEARNING
CLOUD_NATIVE
DEV_EXPERIENCE
TECHNICAL_STRATEGY
MULTI_TEAM_LEADERSHIP
SYSTEM_ID: ALSABBAGH_IO_CORE // REV_2026.06