ai-test-gen

AI-Powered Test Case Generation Framework

92%

Time Saved

$0.002

Cost/TC

870

Test Cases

73% first-pass

Quality

Overview

ai-test-gen is a multi-agent LLM orchestration system for test generation. It reads structured requirements (Azure DevOps / Jira), passes them through a hybrid rule-engine + LLM pipeline, enforces acceptance-criteria coverage with automated feedback loops, and exports deterministic, structured test suites ready for import into ADO Test Plans.

Instead of a single prompt, the system decomposes work into specialised stages: ingestion and NLP parsing, deterministic rule-based generation, RAG-powered semantic matching with ChromaDB, LLM correction with JSON-schema enforcement, coverage validation, and finally multi-format export (CSV, JSON, Playwright scripts).

In an MSc research study on a production CAD application (55 user stories, 870 manual tests), ai-test-gen generated 870 structured test cases with a 92% reduction in authoring time, 94.4% acceptance-criteria coverage, and a first-pass structural quality of 72.9% at a total LLM cost of $1.74 (~$0.002 per test case).

System Architecture

Ingestion & Parsing — Adapters pull stories from Azure DevOps/Jira and normalise them into domain models. spaCy-based NLP extracts acceptance criteria, UI surfaces, and feature types.

Deterministic Generation — A rule engine with 70+ QA rules expands scenarios, generates structural scaffolds (PRE-REQ, launch, close, negative paths), and guarantees minimal quality without any LLM calls.

RAG: Semantic Matching — ChromaDB stores previous steps as embeddings. For new stories, semantically similar steps are retrieved as few-shot context to enforce consistent language and patterns.

LLM Correction — A provider-agnostic LLM layer (OpenAI / Gemini / Anthropic / Ollama) refines wording, fills edge cases, and produces JSON-structured output that matches a strict schema.

Validation & Feedback — Coverage validators check that every acceptance criterion is represented. Gaps trigger targeted LLM calls to generate missing tests; quality gates enforce structure, forbidden-language rules, and accessibility requirements.

Export & Integration — Final suites are exported to ADO-compatible CSVs, JSON, and Playwright scripts, with workflows to upload directly into Azure DevOps Test Plans and other tooling.

Why This Architecture?

A single LLM prompt can hallucinate steps, miss edge cases, and drift in wording between runs. ai-test-gen instead pushes as much as possible into deterministic rules, then uses LLMs only where they add real value — language quality, gap filling, and semantic alignment.

Hybrid rules + LLM keeps 70% of logic deterministic, reducing hallucination and giving predictable structure across projects.
RAG with ChromaDB reuses high-quality reference steps so new stories read like they were written by the same senior QA engineer.
Coverage validation loops ensure every acceptance criterion is covered at least once, turning ACs into an explicit quality contract.

Quick Start (Local)

Clone the repo: git clone https://github.com/Gulzhasm/ai_test_gen.git
Create a Python 3.10 venv and install deps: pip install -r requirements.txt
Configure .env with ADO + LLM keys.
Run your first generation: python workflows.py generate --story-id 123456

Full Docker flow, CLI reference, and MCP integration are documented in the project README on GitHub.

Tech Stack

PythonGemini 2.5 FlashChromaDBspaCyAzure DevOps APIClean ArchitectureDockerMCP Serverpython-docx

View on GitHub Read Research