EU AI Act compliance guide

Annex IV Technical Documentation: The Complete Requirements Breakdown

A section-by-section guide to every requirement in EU AI Act Annex IV, with concrete examples of evidence-linked artefacts for each section and a checklist for SaaS AI providers.

17 June 2026·12 min read

Introduction

If you provide a high-risk AI system in the EU — and that includes most SaaS products that screen, rank, or score people — you are legally required to produce and maintain Annex IV technical documentation before you can lawfully place your system on the EU market.

Annex IV of the EU AI Act (Regulation (EU) 2024/1689) specifies exactly what must be in that file. This guide breaks down every section, explains what regulators and auditors are actually looking for, and shows what evidence-linked documentation looks like in practice.

The short version: a box-ticking questionnaire will not pass. An evidence-linked file — where every claim traces to a verifiable artefact — will.

Run the free Annex IV Readiness Check to see which sections you have covered today.

Section 1: General description of the AI system

This section must describe the system at a level of detail sufficient for a reviewer to understand what it is and what it does — without relying on the provider's interpretation.

What the regulation requires

The intended purpose, as defined by the provider in instructions for use and promotional materials
The version number and any previous versions covered by the same file
How the AI system interacts with hardware, software, or other AI systems
The categories of natural persons and groups on whom the system is intended to be used
Any specific geographical, behavioural, or functional settings in which the system is designed to operate
The reasonably foreseeable uses and misuses

What evidence-linked documentation looks like

Weak (assertion)	Strong (evidence-linked)
"The system ranks job applicants based on their CVs"	Product spec v2.4.1, §3: "The Candidate Ranker model scores and ranks CV-qualified applicants for roles in categories [A, B, C]. Inputs: structured CV fields + role description. Output: ranked list with probability score. Deployed to [deployer class] via REST API."
"No groups are excluded from use"	Test report: bias audit across gender, age, and nationality subgroups, with AUC and false negative rate per group, run 2026-05-12, MLflow experiment #447
"Foreseeable misuse is documented"	Risk register entry: "Section 3.2 — Misuse as sole decision maker without human review. Mitigation: API contract enforces display of explanation; deployer agreement prohibits automated rejection without review."

Section 2: Design specifications

This section must document the reasoning behind the design: why the model architecture was chosen, what assumptions were made, and what trade-offs accepted.

What the regulation requires

The overall logic, design choices, and key assumptions underlying the system
The classification as high-risk, with reference to the Annex III entry that applies
Explanation of the system's interaction with any other AI systems if deployed as a component
The system's computational and hardware requirements
Reference to any harmonised standards or technical specifications applied

What evidence-linked documentation looks like

Reviewers are looking for a design rationale that could not have been generated by a questionnaire. It should reference real architectural decisions with the rationale documented at the time:

"Architecture: fine-tuned BERT-based classifier (bert-base-multilingual-cased). Selected in sprint planning 2025-Q3 over LDA and gradient boosting based on F1 improvement of +7.2% on held-out set (see experiment log #312). Key assumption: CV text quality is consistent across applicant geography; this was validated on the EU-wide sample but may not hold for non-EU markets (documented in risk register §2.4)."

Section 3: Training data and data governance

This section is frequently cited as the most scrutinised part of an Annex IV review, particularly for HR-tech systems where training data bias directly affects protected groups.

What the regulation requires

The training methodologies and techniques used
The training datasets used: origin, curation, annotation methodology, number of data points, key characteristics
Assessment of the availability, quantity, and suitability of the training data
The data governance practices applied: collection process, pre-processing steps, how labelling was done, who did it, and what quality checks were applied
Any pre-trained components used, with information on their provenance and any fine-tuning applied

What evidence-linked documentation looks like

A data sheet linked from the technical file should include:

Dataset name, version, and provenance (source, date of collection, any licences)
Population represented and known gaps
Annotation process: who labelled, what guidelines were followed, inter-annotator agreement score
Pre-processing steps: what was normalised, excluded, or transformed and why
Any synthetic data used and how its realism was validated

For HR-tech providers, intersectional demographic analysis of the training set is now expected by sophisticated buyers even before it is formally required by supervisory authorities.

Section 4: Validation and testing

This section must show that the system performs as intended and that risks have been identified and mitigated through testing.

What the regulation requires

Testing procedures and test metrics
The test data used: provenance, representativeness, known limitations
Achieved performance metrics, including accuracy, robustness, and cybersecurity metrics where applicable
Identified discriminatory effects and how they were addressed
Known limitations and underperforming population groups

What evidence-linked documentation looks like

Every performance number in the technical file should link to a reproducible experiment:

Model: candidate-ranker v2.4.1
Test set: held-out EU applicant pool, n=12,847 (see data sheet §4)
Run: MLflow experiment #447, 2026-05-12
Metrics:
  AUC-ROC overall: 0.893
  AUC-ROC female applicants: 0.881
  AUC-ROC male applicants: 0.899
  Δ AUC by gender: 0.018 (within pre-defined threshold of 0.025)
Threshold: 0.45 (F1-optimised on validation set)

A reviewer who cannot reproduce or verify a claimed metric will treat it as an assertion. Link to the experiment run, not to a summary slide.

Section 5: Monitoring and logging

Post-deployment monitoring is one of the areas where SaaS providers consistently have gaps. This section documents what telemetry exists and how it is used.

What the regulation requires

The system's automatic logging capabilities, including event types captured
The post-market monitoring plan: what is measured, at what frequency, by whom, and with what escalation criteria
How monitoring data feeds back into the technical documentation update process

What evidence-linked documentation looks like

Link to the monitoring dashboard (e.g., Grafana board URL + screenshot at documentation date)
Definition of the alerting thresholds and the policy for who is notified
Frequency of human review of model performance in production
The escalation path if a serious incident is detected

See Post-Market Monitoring for a detailed breakdown of the PMM obligations.

Section 6: Standards applied

Most SaaS providers have not yet applied any harmonised standards because none are mandated yet. Honest documentation of this is better than silence.

What the regulation requires

Reference to any harmonised standards applied in full or in part
Where no harmonised standards exist (as is currently the case for most Annex III categories), reference to the common specifications or other technical specifications used to demonstrate conformity

What to write if no harmonised standards apply

"No harmonised standards are currently available for AI systems in the employment/recruitment domain under Annex III §4(a) of the EU AI Act. The following technical specifications were applied to demonstrate conformity: ISO/IEC 42001:2023 (AI management systems), NIST AI RMF 1.0 (risk management framework), CEN/CENELEC JTC 21 draft technical specifications [where applicable]."

Section 7: Substantial modifications

This section documents changes to the system that trigger a new or updated conformity assessment.

What the regulation requires

Description of any substantial modifications made to the system after it was first placed on the market
Assessment of whether the modification changes the risk classification or the conformity assessment result

What counts as a substantial modification?

Article 83 defines a substantial modification as a change that affects the system's compliance with the EU AI Act or results in a change in the intended purpose for which the system was assessed. In practice, this includes:

Retraining on a materially different dataset
Significant change in model architecture or algorithmic approach
Extension of intended purpose to new categories or use cases
Changes that materially alter performance metrics or bias characteristics

Track model versions systematically in your model registry. Each retraining run should record whether it constitutes a substantial modification and why that decision was made.

Section 8: Cybersecurity measures

This section documents measures taken to ensure the system is resilient against adversarial manipulation.

What the regulation requires

Specific resilience measures against attacks that could exploit the system's vulnerabilities
Documentation of how the system behaves under adversarial inputs, including model robustness evaluations

What evidence-linked documentation looks like

Results of adversarial testing (e.g., FGSM or PGD attack evaluation with robustness metrics)
Input validation logic with reference to the implementation
Access controls on model endpoints and training pipeline
Incident response procedure for model integrity events

Checklist: Is your Annex IV file complete?

Run through this self-assessment. For each section, ask: does the claim link to an artefact an engineer outside your company can verify?

§1 General description with intended purpose, user categories, and foreseeable misuse documented
§2 Design rationale with architecture decisions and key assumptions referenced to design docs
§3 Data sheet for every training dataset with provenance, annotation process, and bias analysis
§4 Test report with reproducible experiment IDs and per-subgroup performance metrics
§5 Monitoring plan with dashboard links, alert thresholds, and escalation procedure
§6 Standards applied (or statement that none are currently available with alternative specs)
§7 Change log with substantial modification assessment for each version
§8 Adversarial test results and input validation documentation

Missing evidence in any section creates a gap a reviewer will find.

→ Run the free Annex IV Readiness Check to get a scored breakdown of your gaps, or see Evidence-Based vs. Questionnaire Compliance for why assertion-based approaches fail.

See where your documentation stands

Nine questions. Two minutes. Instant gap analysis across every Annex IV section — free, no signup.

Run the readiness check