Anatomy of an EPUB 3 file
An EPUB 3 archive has a skeleton – the files which are mandatory to structure the content – and some flesh – the ebook content.
The first file you may spot in the zip archive is the mimetype file, which states that this archive is *really* an EPUB publication.
The content of this text file must be “application/epub+zip” and nothing else. This is how an EPUB reading system will assured he can process the ebook.
It is required that the mimetype file is the first file found in the zip archive. This may appear a weird constraint, but as a matter of fact this constraint originates in the Open Document file format which was the source of the EPUB file format. The rationale behind it is that if an application reads the first bytes of the archive, it will always find the same “magic numbers” (see here for OCF magic numbers). This can replace the detection of the format in case the file extension .epub is not reliable.
A practical issue with this requirement is that one cannot create a proper EPUB file with a simple zip tool: generic tools cannot guarantee that the mimetype file will be first in the archive.
This small XML file, found in the mandatory META-INF directory, is a bootstrapping item. It simply contains the relative location of the .opf file (a.k.a. package document), which is the brain of the publication and will be described shortly.
If the content.opf file is for instance in the folder content in the archive, then the location will be content/content.opf.
<?xml version="1.0"?> <container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0"> <rootfiles> <rootfile full-path="content/content.opf" media-type="application/oebps-package+xml"/> </rootfiles> </container>
Using this information, the reading system will be able to open the .opf file and know more about structure of the publication.
EPUB 3 supports multiple renditions of an EPUB publication. You may find for instance a fixed-layout rendition and a reflowable rendition packaged in the same EPUB file. In such a case, container.xml will reference several package documents usually placed in several directories.
Apart from container.xml, you may find other files in the META-INF directory, like signatures.xml which holds digital signatures of the container and its contents, metadata.xml and manifest.xml which may contain information about the publication itself (i.e. the container; this is useful in the multiple renditions use case), or proprietary files like com.apple.ibooks.display-options.xml. Their presence is rather exceptional, therefore we won’t describe them in details here.
The .opf file, a.k.a package document
This XML file carries bibliographic and structural metadata about an EPUB publication (or an EPUB rendition), and is thus the primary source of information about how to process and display that publication.
In this file, the reading system will find:
i.e. information about the publication (or rendition) content. Diverse sets of metadata (e.g. Onyx) can be expressed as XML elements, from different schemes. The only required elements in EPUB 3.01 are title, identifier, language and modified, from the Dublin Core set. A fixed-layout publication must be tagged by a specific metadata item in this set. Other metadata can be expressed inline, using a generic meta element, or as an external resource via a link element.
i.e. the exhaustive list of all publication (or rendition) resources, including (x)html text chapters, images and videos or audio files, fonts, scripts, css files. The reading system will only process the files it finds in the list, and knows from their media type (a.k.a. mime type) that it can process them. From the properties declared on each item, the reading system will also know its type, e.g. if the file corresponds to a navigation document, cover image, vector graphic or a script. If the reading system cannot process the resource because its format is a bit exotic, it will find here the fallback resource he can process instead.
As its name indicates, this is a “backbone” where the reading system finds the default reading order of all publication “chapters”. As these sections of a publication may not really represent book chapters, each item of the sequence is called … a spine item. Each spine item contains a reference to a manifest item. Spine items can be declared as “non linear”, meaning they are not displayed in the normal flow, but can be reached from another spine item as supplementary content (e.g. popup content).
Below is a simple package document with metadata, a manifest and a spine.
<?xml version="1.0" encoding="UTF-8"?> <package xmlns="http://www.idpf.org/2007/opf" xmlns:opf="http://www.idpf.org/2007/opf" version="3.0" unique-identifier="BookID"> <metadata xmlns:dc="http://purl.org/dc/elements/1.1/"> <dc:identifier id="BookID" opf:scheme="isbn">978-3-86680-192-9</dc:identifier> <dc:identifier>3b622266-b838-4003-bcb8-b126ee6ae1a2</dc:identifier> <dc:title>The title</dc:title> <dc:language>fr</dc:language> <dc:publisher>The publisher</dc:publisher> <dc:creator>The author</dc:creator> <dc:contributor>A contributor</dc:contributor> <dc:description>A description</dc:description> <dc:subject>A subject of the publication</dc:subject> <dc:subject>Another subject of the publication</dc:subject> <dc:rights>© copyright notice</dc:rights> <meta property="dcterms:modified">2020-01-01T01:01:01Z</meta> </metadata> <manifest> <item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/> <item id="cover.jpg" href="Images/cover.jpg" media-type="image/jpeg" properties="cover-image"/> <item id="cover.xhtml" href="Text/cover.xhtml" media-type="application/xhtml+xml"/> <item id="toc" href="toc.html" media-type="application/xhtml+xml" properties="nav"/> <item id="chapter-1.xhtml" href="Text/chapter-1.xhtml" media-type="application/xhtml+xml"/> <item id="chapter-1.xhtml" href="Text/chapter-1.xhtml" media-type="application/xhtml+xml"/> <item id="publication.css" href="Styles/publication.css" media-type="text/css"/> <item id="Andada-Italic.otf" href="Fonts/Andada-Italic.otf" media-type="application/vnd.ms-opentype"/> <item id="Andada-Regular.otf" href="Fonts/Andada-Regular.otf" media-type="application/vnd.ms-opentype"/> <item id="glyph.png" href="Images/glyph.png" media-type="image/png"/> </manifest> <spine toc="ncx"> <itemref idref="cover.xhtml"/> <itemref idref="toc"/> <itemref idref="chapter-1.xhtml"/> <itemref idref="chapter-1.xhtml"/> </spine> <guide> <reference title="Cover page" type="cover" href="Text/cover.xhtml"/> <reference title="Table of content" type="toc" href="toc.html"/> </guide> </package>
The legacy .ncx file
A quick word about this file, sometime found in EPUB 3 containers: this is the deprecated EPUB 2 way of declaring a navigation document. Some EPUB 3 authors still prefer to include it so that EPUB 2 reading systems can process the publication. An EPUB 3 reading system will not access it, so we won’t bother describing its content.
A simple diagram to summarize this
Here is a diagram illustrating the complete structure of an EPUB file.
In this example, we find the .opf file and all content files in a directory named “OEBPS”. Why such a strange name? This is simply historical: Open eBook Publication Structure was the name of a legacy ebook format which has been superseded by the EPUB format. The acronym found its way in the publishing vocabulary and is still used by some EPUB authoring tools when they structure an EPUB publication, so that the .opf file and content files are not stored in the root of the EPUB archive (something which would still be harmless by the way).
The Readium projects provide rock-solid, performant building blocks and applications for processing EPUB3 publications. EDRLab is participating to the Readium codebase maintenance and evolution.
Support for people wih print disabilities is a key part of our mission. We collaborate with European publishers and major inclusing organizations on the creation of a born-accessible ebook market. We also make sure that Readium projects take into account the assistive technologies used by visually-impaired users.