The Publishing Maintenance Working Group at W3C has recently released a survey targeting the ebook industry and related to the evolution of EPUB 3. This survey is seeking advice on a question that arises from the natural evolution of the Web.
The question, in summary, is what the consequences would be of EPUB3 allowing textual content to be formatted as pure HTML5.
The implications of such a broad question are difficult to understand for most professionals (see the reactions to this LinkedIn post). I’ll illustrate some of them.
First, something obvious: EPUB 3 is already based on HTML5. All HTML5 tags are allowed in what we refer to as a “content document”. An ebook contains an ordered list of content documents, and each content document usually represents one chapter in the book. However, there is a catch: EPUB 3 only recognises the “XML serialisation of HTML 5” (which we refer to as “XHTML” for brevity, although this may lead to some confusion).
What is this XML serialisation of HTML 5?
It means that tags (elements and attributes) are case-sensitive, that every open element must be closed, and empty elements must have a ‘/’ at the end. It also means that every attribute must have a value, and this value must be surrounded by quotation marks or apostrophes. All things that pure HTML 5 does not enforce. To identify this variant of HTML 5, you’ll find an “XML declaration” (typically <?xml version="1.0" encoding="utf-8"?>
) at the top of the document.
What are the advantages of the XML serialisation of HTML 5?
In brief, an XML document is cheap to parse, easy to extend and easy to transform. I was also tempted to say ‘easy to validate,’ but the XHTML case is a bit special due to the flexibility of HTML5; therefore, let’s disregard this point.
Cheap to parse: the XML parser is generic and lightweight. As Tim Bray said recently on a forum: “Modern HTML is a monster with all sorts of complex error fallbacks and so on, so parsing is quite a bit more work. The advantage of HTML is that it’s got this rich ecosystem of CSS and JavaScript to do interactive magic.”
Easy to extend: XML has built-in extensibility through the use of namespaces. XML authors can add “foreign” tags to an XML structure and still validate a document, with a funny notation that few understand. EPUB is using this possibility with the “epub:type” attribute. MathML and SVG are also integrated in XHTML using XML namespaces.
Easy to transform: The XSLT technology enables XML to be converted into other structures with just a few lines of code. Many publishing workflows use this capability to transform an internal XML format to XHTML or PDF, thanks to XSLT-savvy developers.
In this case, what can be the advantage of using pure HTML5?
EPUB is based on Web technologies; most EPUB 3 reading systems are built on Web browser engines, and Web technologies are shifting from XML. MathML and SVG are directly integrated into pure HTML5 without the need for XML namespaces. An equivalent of epub:type is possible using a custom HTML attribute.
Web browsers still support the XML serialisation of HTML 5, but there are already slight differences between the rendering of pure HTML 5 and XML-serialised HTML 5. New browser features are not thoroughly tested on XHTML, and we don’t know how long browser engines will support the XML serialisation of HTML 5.
Additionally, software developers often lack a basic understanding of XML, and even more so, they are unfamiliar with XML namespace wizardry. Many EPUB files contain tagging errors in their content document. EPUBCheck properly catches these errors, but these EPUBs can still be found in the wild.
Therefore, allowing pure HTML5 as EPUB content documents should not be considered an evolution targeting senior professionals, but rather a way to open EPUB creation to a new generation of publishers and third-party developers with a Web background.
What are the pitfalls?
First, let’s not be naïve: a random HTML5 document packaged in an EPUB will rarely be readable in an EPUB reading system. If you are wondering why, consider how many web pages lack responsiveness and accessibility. Creating a “clean” and accessible HTML5 document, one that can fit the screen size, and support pagination and font adaptation requires expert knowledge.
Second issue: if pure HTML5 is accepted in EPUB, then the open-source EPUBCheck tool must be heavily enhanced to support this new content format. However, the community already struggles to obtain funding to maintain this excellent software. Who will help financially?
Then, let’s talk about ebook distributors and reading systems. Ebook distributors develop industrial software that evolves slowly. How will they adapt to this change? Reading software comes in many flavours, some commercial, some free and open-source (Readium), some use browser engines to render XHTML content, some don’t, some on PC or mobile, and others on hardware devices. Will they evolve quickly, or will they crash when encountering pure HTML5?
In summary, what are the chances that the presence of pure HTML 5 in EPUB ends up with a new fragmentation of the EPUB landscape, after fragmentations imposed by Fixed Layout EPUB (that some reading systems don’t support), Enhanced EPUB (with audiovisual content that some hardware devices don’t support), Interactive EPUB (with javascript that many reading systems refuse to support)? And what could we do to solve this fragmentation problem? This is the question the W3C Publishing Maintenance Working Group is asking the book publishing community.
So?
If this evolution is agreed, as reading system developers, we will be inclined to treat EPUB packages containing pure HTML5 differently than traditional EPUB: HTML resources will only be scrollable (no synthetic pagination using CSS columns, as this would break in many cases), and the reading system will limit how users can customise their experience. They will still certainly be able to modify the font size and spacings, but alignment, themes and other advanced settings may be inactive for pure HTML5. A mix of XHTML and pure HTML in the same EPUB package would be a practical mess: we would limit user setting for the whole publication in this case.
An indicator that the publication is “safe” vis à vis user customizations of the display would be great, but we usually cannot trust publishers’ metadata.
Some experts think that such evolution should be marketed as EPUB 4, certainly not EPUB 3.4. It is true that for distributors and reading systems, this is a breaking change (they MUST evolve) that they do not ask for. Problem: the ebook publishing industry is slow, the rise of EPUB 3 took 10 years, many believe that the name “EPUB 4” will be a repellant for most.
Is there is a lock, there is an alternative solution: create a standard W3C Web Publication format which supports pure HTML5, and let the publishing industry adopt it slowly. Oh, by the way, the Readium Web Publication format already supports pure HTML5, and Readium Web + Thorium Reader Web are almost ready for that.