AI training corpus: definition and law

Training corpus

A training corpus refers to the body of data used to train an AI model. For LLMs, it consists of billions of texts drawn from the internet, books and articles. The copyright question is central: Thomson Reuters v. ROSS Intelligence (February 2025) sets a key precedent. The AI Act requires detailed records of training data.

The training corpus is the fuel of AI models. For large language models, it comprises billions of texts drawn from the internet, digitized books, scientific articles, forums and public documents. The quality, diversity and representativeness of this corpus directly determine the model's capabilities and limits. A corpus poor in French legal texts will produce a model that performs poorly on French law.

The copyright question around training corpora has become a major legal issue. The Thomson Reuters v. ROSS Intelligence case (decided in February 2025) sets a precedent on the use of protected content to train AI. The AI Act now requires model providers to keep detailed records of training data, including its provenance and any rights attached to it.

In France, the Legal Data Space project aims to build a sovereign corpus of high-quality French legal data to train models tailored to national law. This initiative addresses a twofold challenge: ensuring the quality of legal models trained on reliable French data, and securing digital sovereignty by reducing dependence on the English-language corpora that dominate the training of large LLMs.

Related terms