Methodology

Defensible
by design.

AI-assisted review is only useful if it holds up at meet-and-confer. Here's exactly how our pipeline works, how we validate it, and why opposing counsel won't get traction challenging it.

§ 01 — HOW IT WORKS

From PST to production.

Every matter follows the same six-stage pipeline. Every stage is logged, timestamped, and reproducible.

1

Ingestion

Document containers (PST, MBOX, NSF, ZIP archives, Slack/Teams exports, cloud storage exports) are parsed into individual documents with metadata preserved.

Output: Document inventory with file paths, custodian assignments, MD5 hashes

2

Processing

Text extraction, OCR for image-based documents, metadata normalization, email threading, deduplication using MD5 and near-duplicate detection.

Output: Processed corpus with text, dedup keys, family relationships

3

AI Coding

Frontier-grade language models review each document with structured coding prompts. Each document receives relevance, privilege, and issue tags with reasoning traces.

Output: Coded data with reasoning logs attached to every document

4

QC Sampling

Statistical samples drawn from each coding bucket using stratified random sampling. Sample sizes calculated for 95% confidence level at ±2% margin of error.

Output: QC review set sized to support defensibility claims

5

Human Review

Our attorneys review the QC sample, validate or override AI coding, and document patterns of disagreement. Persistent errors trigger re-prompting.

Output: Validated coded data set with recall and precision metrics

6

Production

Bates numbering, production format conversion, privilege log generation, defensibility package assembly.

Output: Production-ready data with full methodology documentation

§ 03 — VALIDATION

Numbers, not adjectives.

Every production we deliver includes a validation report with four key metrics:

Recall

92–97%

The percentage of truly responsive documents that the AI correctly identified as responsive.

Industry-acceptable recall for AI-assisted review is generally 75%+. We typically achieve 92–97%.

Precision

88–94%

The percentage of documents the AI identified as responsive that are actually responsive.

Higher precision means less wasted attorney review time downstream.

Confidence Interval

95%

The statistical certainty around the recall and precision numbers.

Expressed as a 95% confidence range — the numbers you report are the numbers you can defend.

Elusion Rate

<5%

A sample drawn from the non-responsive population to estimate how many responsive documents were missed.

We target an elusion rate below 5% — documents that were missed but shouldn't have been.

These four metrics are calculated from a statistically valid sample reviewed by our attorneys against the AI-coded population. The full sample, the coding decisions, and the disagreement analysis are all included in the defensibility package.

§ 04 — REPRODUCIBILITY

Every call, recorded.

For every document coded, we log:

Model name and version

Coding prompt (exact text)

Document ID and Bates number

Timestamp (UTC)

Coding output (full structured response)

Reasoning trace

This means any single coding decision in a production of millions of documents can be reproduced and audited months or years later. If a coding call is challenged in motion practice, we can show the exact reasoning behind it.

Why We Use Frontier Models

We work with frontier-grade language models (Claude from Anthropic, GPT-class models from OpenAI) rather than legacy machine-learning classifiers. The reason is simple: frontier models read documents in context, understand legal concepts, and explain their reasoning — capabilities that older TAR systems cannot match.

We pin model versions for every engagement. If a model is updated mid-matter, we either continue with the pinned version or formally re-validate against the new version with the client's awareness. Model drift is logged, never silent.

§ 05 — DOCUMENTATION

Built for Meet-and-Confer

Every coding decision is logged, timestamped, and versioned. Our documentation package gives you everything you need to defend your AI-assisted production in depositions and courtroom proceedings — supporting the methodology disclosure framework affirmed in Da Silva Moore, Rio Tinto, and subsequent TAR-defensibility cases.

Model Version Logs

Exact model version and API provider for every production. Reproducible results guaranteed.

Prompt Architecture

Full prompt text used for each classification decision. Zero black-box decisions — every call is documented.

Timestamp Registry

Every document processed with exact timestamp. Chain of custody documented from ingestion to production.

QC Reviewer Attribution

Reviewer identifiers and timestamps for every validation sample reviewed. Reviewer credentials disclosed to opposing counsel on request.

Statistical Reports

Recall, precision, F1 scores with confidence intervals. Prepared in attorney-readable and expert-witness formats.

Methodology Disclosure

Plain-language explanation of AI methodology suitable for opposing counsel and judge review.

Documentation Package Contents

Executive methodology summary
Technical architecture overview
Model version and API provider
Validation sample selection criteria
Statistical validation report
QC review attestations
Confidence interval calculations
Processing chain of custody log

Ready to See It Applied to Your Matter?

Request an ECA and we'll walk you through our methodology on your actual corpus. See the numbers, meet the team, and get a defensibility plan tailored to your case.

Questions about our methodology?

hello@progressivediscovery.ai or 952.463.8392