There were around 30 people in the room, plus a good number of remote participants, and the event was hosted in the Cisco building in Toronto over 2 and a half days. The meeting was facilitated by Mark Nottingham and Suresh Krishnan, who did an excellent job helping each participant express their views.
This project is complex, as it brings together participants with different focuses and sometimes opposing interests: representatives of the media and AI/search industries and advocates for the public good. Both technology and diplomacy are thus required languages. US citizens with a web background constitute the majority of participants, which influences many aspects. For instance, remote participation is complex if you are not a native English speaker, and the existence of non-US legal frameworks, or the fact that not every piece of content is on the web, requires a reminder from time to time.
The IETF facilitates participation in the discussion, but it has a cost: discussions happen simultaneously via public email exchanges, GitHub issues and Pull Requests, and chat during meetings. It is therefore almost impossible to think “sequentially”. Anyone can jump into a discussion thread at any time and build a growing consensus with an alternative view. And during meetings, the discussion jumps every hour from one aspect (e.g. the use of AI in search pipelines) to another (e.g. the use of content for AI grounding).
Add to the mix that we live in an ultra-fluid world, where technologists fight against every concept that could limit architectural innovation, and you end up with a specification in which getting a stable scope is a mirage and each sentence is a struggle.
Despite all these difficulties, good faith appears to prevail, and the project slowly takes shape.
What is the current status of AI-Pref?
The specification defines a vocabulary and an attachment mechanism specific to Web crawlers. The vocabulary consists of a flat, discrete set of terms defined orthogonally, ensuring that there is no logical intersection or hierarchical dependency between entries. The vocabulary is “purpose-based” (“this piece of content should not be used for/as …”). The source of these preferences is implicitly copyright holders, but because distribution workflows can be complex, this is not explicitly specified. The IETF participants insist that, in the context of this work, these preferences are neither legally binding nor enforceable: solution providers may have good reasons to bypass the preferences expressed by the source.
AI training: At the end of this meeting, it appears that a consensus has been reached on this term. It previously covered only foundation models or generative models, but these concepts are finally impossible to scope properly. It now covers the training of any AI model, be it initial training, fine-tuning, or any other process that modifies the model’s weights. It excludes training AI models used exclusively in search pipelines (see below).
AI search: Every modern search pipeline uses AI techniques and includes AI models. Search engines need content for their users, and publishers want their content to be discovered: this is an area where mutual interest is obvious. But the rising capability of search engines to rewrite titles and content snippets (sometimes in ways that hurt provider brands), Zero-Click Content in search systems, and pure Answer Engines are deal breakers for publishers. Most of the meeting was spent refining this term and defining conditions to assert conformance with the concept. We can expect the next draft of the specification to contain consensus wording.
Most, if not all, publishers will set the AI search preference to “yes,” which runs counter to the opt-out approach adopted in the other terms of the vocabulary. Constraints, e.g., on the size of snippets or images, may complement a positive AI-search preference; the corresponding meta tags, which are currently maintained by Google, may later be standardised by the IETF in a separate project.
There are a few major players in the Search field, and activating the AI-Pref vocabulary will not immediately solve the issue of search results being mixed with Zero-Click Content.
On a personal note, I wonder why the group spends so much time on this aspect while the more important AI-input term is left aside. Other participants share the same opinion.
AI input: also known as AI use, AI output, or AI include. It is about the use of content for inference, the user side of the AI model. Things are still a bit muddy, but I can’t imagine a next draft where this term would not be defined, even tentatively. It covers the use of content for AI grounding (the process of anchoring an AI’s abstract “knowledge” to specific facts and data sources, a generalisation of the RAG technique), for generating Zero-Click content in search systems (e.g. Google AI Overview), and for generating direct answers in Answer Engines (e.g. ChatGPT, Perplexity, Gemini).
It seems obvious that if a source states “ai-input=yes; ai-training=no”, the content processed as input should not be used for training the AI system, even if the user gets the service for free. Let’s hope this preference is followed …
It is also clear that dominant search engines bundle search results and Zero-Click content in the same service, using the same indexed data, and claim that expressing “ai-search=yes, ai-input=no” will result in the content being discarded from the index. Maintaining a clear separation between the two preferences remains key for content providers. Markets and legal frameworks will then play their roles.
An important point is to decide if there is a distinction to be made between content provided as input to an AI model BY a human (e.g. a translation of a document in DeepL), and content used as input WITHOUT knowledge from the user (e.g. a typical RAG system fetching news from the Web). In the first case, one can argue that the user is in control and takes responsibility for the content: should the software which embeds the model process the content without controlling the preferences attached? Or should it refuse to proceed if the AI input preference is negative? Adobe, in its Firefly software, takes the controlling approach. This is a professional tool that aims at “commercial safety”; big brands fight the use of their content for brand imitations; therefore, this approach has some grounding (smile). But it would be a nightmare if a translation tool used for personal use were to take the same approach. Content sources cannot know in advance which tool their content will be processed with, or to what extent the output will be distributed. But legal frameworks protect personal use, especially for accessibility and comprehension. If we want to keep it simple, the vocabulary can rely on legal frameworks to solve disputes if they arise.
Any other term in scope?
Other uses of content in AI solutions may appear later, but with model training and inference, as well as search specificities, being covered, we may have some time before something really different appears. Other initiatives (e.g. C2PA) currently end up with similar terms. Note that Really Simple Licensing (simple only by name) differentiates ai-input from ai-index.
Relationship with the EU TDM opt-out
To favour new discoveries and economic growth, an exception to copyright has been created in Europe for Text and Data Mining, with a guarantee that copyright owners can still explicitly reserve their rights.
The word reserve is important: it does not mean deny use of their content, it simply means keep authority. It is an opt-out from an exception to copyright. The W3C TDM Reservation Protocol is based on this legal framework.
Using TDMRep, a simple “TDM-reservation” flag states that rights are reserved. This signal can be embedded in files or logically attached to Web content (more details in the next section). The AI-Pref vocabulary is fully complementary: once rights are reserved, a content provider can specify, in more detail, which content usages they accept or deny as preferences.
For Web content, the expression of a TDM opt-out can even be simplified. The sole presence of an AI-Pref signal in robots.txt is logically sufficient to express a reservation of rights, valid as per the EU CDSM Article 4. The TDMRep specification may therefore soon be simplified in the specific case of Web content.
About attachment mechanisms
The IETF focuses on *robots.txt*, which is limited to expressing limitations of access to Web Content. The attachment of the AI-Pref vocabulary with files that are not on the Internet will be studied by other bodies.
The W3C TDM Reservation Protocol is awaiting consensus on the terms of the AI-Pref vocabulary to include them in its “TDM Policy”. The alliance of a) a “TDM Reservation” flag embedded in publications (EPUB, PDF), video files (MP4) and other assets, and b) decentralised “TDM Policy” ODRL resources, offers the required flexibility.
Next steps
The study of the detailed wording of the AI-Pref vocabulary will proceed in the coming weeks via GitHub and email. A virtual meeting is planned for June 2026, followed by a hybrid meeting in Vienna in July.
Conclusion: There is still a slight chance of concluding the vocabulary part by August 2026, as planned some months ago. It won’t be possible if the net draft does not make room for the AI-input term.











contact@edrlab.org
+33 1 83 64 41 34
