@lingo-reader/epub-parser
Version:
An epub parser which can extract chapter contents from an epub file
711 lines (522 loc) • 21.2 kB
Markdown
<h1 align="center">
<a href="https://github.com/hhk-png/lingo-reader">Home Page</a>
<a href="./README-zh.md">中文</a>
</h1>
The EPUB file format is used to store ebook content, containing both the book's chapter materials and files specifying how these chapters should be sequentially read.
An EPUB file is essentially a `.zip` archive. Its content structure is built using HTML and CSS, and can theoretically include JavaScript as well. By changing the file extension to `.zip` and extracting the contents, you can directly view chapter content by opening the corresponding HTML/XHTML files. However, the chapters will appear in random order. If certain chapters or resources are encrypted, this zip extraction method will fail.
**When parsing EPUB files:**
**(1)** The first step involves parsing files like `container.xml`, `.opf`, and `.ncx`, which contain metadata (title, author, publication date, etc.), resource information (paths to images and other assets within the EPUB), and sequential chapter display information (Spine).
**(2)** The second step handles resource paths within chapters. References to resources in chapter files are only valid internally, so they must be converted to paths usable in the display environment—either as blob URLs in browsers or absolute filesystem paths in Node.js.
**(3).** The encryption information of an EPUB file is stored in the `META-INF/encryption.xml` file. Version `0.3.x` supports parsing encrypted EPUB files, but it requires adherence to a specific encryption scheme and the provision of a private key for decryption. The supported encryption methods are detailed in the `initEpubFile` section.
**(4).** In addition, EPUB files may also include signatures and rights management information, stored in the `signatures.xml` and `rights.xml` files, respectively. Like `container.xml`, these files are located in the `/META-INF/` directory and have fixed filenames. Support for parsing these files will be added in future updates of `@lingo-reader/epub-parser`.
The parser follows the [EPUB 3.3](https://www.w3.org/TR/epub-33/#sec-pkg-metadata) and [Open Packaging Format (OPF) 2.0.1 v1.0](https://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.4.1) specifications. Its API aims to expose all available file information comprehensively.
## Install
```shell
pnpm install @lingo-reader/epub-parser
```
## Usage in node
```typescript
import { initEpubFile } from '@lingo-reader/epub-parser'
const epub = await initEpubFile('./example/alice.epub')
const spine = epub.getSpine()
const fileInfo = epub.getFileInfo()
// Load the first chapter:
// - html: Processed chapter HTML string
// - css: Chapter CSS files (absolute paths in Node.js, directly readable)
const { html, css } = epub.loadChapter(spine[0].id)
// ...
```
## Usage in browser
```ts
import { initEpubFile } from '@lingo-reader/epub-parser'
async function initEpub(file: File) {
const epub = await initEpubFile(file)
const spine = epub.getSpine()
const fileInfo = epub.getFileInfo()
// Load the first chapter:
// - html: Processed chapter HTML string
// - css: Chapter CSS files (provided as blob URLs, fetchable)
const { html, css } = epub.loadChapter(spine[0].id)
}
// ...
```
## initEpubFile
```typescript
import { initEpubFile } from '@lingo-reader/epub-parser'
import type { EpubFile } from '@lingo-reader/epub-parser'
/*
interface EpubFileOptions {
rsaPrivateKey?: string | Uint8Array
aesSymmetricKey?: string | Uint8Array
}
type initEpubFile = (epubPath: string | File, resourceSaveDir: string = './images', options: EpubFileOptions = {}): => Promise<EpubFile>
*/
const epub: EpubFile = await initEpubFile(
file,
'./images', // The default is './images'. If you don't want to change it, you can simply pass undefined.
{
// The RSA private key in PKCS#8 format should be provided either as a Base64-encoded string or a Uint8Array.
rsaPrivateKey: 'MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQ......',
aesSymmetricKey: 'D2wVcst49HU6KqC......',
}
)
```
The primary API exposed by `@lingo-reader/epub-parser` is `initEpubFile`. When provided with a file path or File object, it returns an initialized `EpubFile` class containing methods to read metadata, Spine information, and other EPUB data.
**Parameters:**
- `epubPath: string | File`: File path or File object.
- `resourceSaveDir?: string`: Optional (Node.js only). Specifies where to save resources like images.
- `default: './images/'`
- `options?: EpubFileOptions`:Optional. Used to pass in key information。
```typescript
interface EpubFileOptions {
// The RSA private key in PKCS#8 format should be provided either as a Base64-encoded string or a Uint8Array.
rsaPrivateKey?: string | Uint8Array
aesSymmetricKey?: string | Uint8Array
}
```
**Returns:**
- `Promise`: Initialized EpubFile object (Promise).
**Note:** For the `epubPath` parameter, its type differs between environments:
- In the **browser**, it should be of type `File | Uint8Array`. Passing a `string` will result in an error.
- In **Node.js**, it should be of type `string | Uint8Array`. Passing a `File` will result in an error.
The `0.3.x` version of `epub-parser` supports decryption using two encryption schemes:
1. **Hybrid RSA + AES Encryption**
In this approach, a symmetric AES key is encrypted using the RSA algorithm (asymmetric encryption), and the actual file contents are encrypted with that AES key. During decryption, the AES key is first recovered using an RSA private key, and then the AES key is used to decrypt the content. To enable this, you must provide an RSA private key in **PKCS8** format via the `rsaPrivateKey` option in `EpubFileOptions`.
This method supports storing multiple AES key entries within the `encryption.xml` file.
2. **Pure AES Encryption**
This method skips RSA and directly encrypts file contents using a symmetric AES key. In this case, the `aesSymmetricKey` option must be provided for decryption.
The decryption logic is implemented in the `parseEncryption` method within `epub-parser/src/parseFiles.ts`.
Decryption does **not** rely on any third-party libraries—it is built on the native **Web Crypto API** in the browser and **Node's crypto module**, allowing the parser to run in both browser and Node environments.
Note that the browser supports fewer cryptographic algorithms than Node; however, all browser-supported algorithms are also available in Node. Therefore, the set of supported algorithms is aligned with browser compatibility, effectively a subset of Node's capabilities.
**Supported Algorithms:**
- **Asymmetric (RSA)**:
- `RSA-OAEP`
- `RSA-OAEP-MGF1P`
- **Symmetric (AES)**:
- `AES-256-CBC`
- `AES-256-CTR`
- `AES-256-GCM`
- `AES-128-CBC`
- `AES-128-CTR`
- `AES-128-GCM`
**AES-192** is not supported in browsers and will throw an error if used to encrypt EPUB content, although it is fully supported in Node.js. The IV used for encryption should be placed at the beginning of the encrypted file. The expected key lengths for AES are:
- **256-bit**: 32 bytes
- **192-bit**: 24 bytes
- **128-bit**: 16 bytes
## EpubFile
The EpubFile class exposes these methods:
```typescript
import { EpubFile } from '@lingo-reader/epub-parser'
import { EBookParser } from '@lingo-reader/shared'
declare class EpubFile implements EBookParser {
getFileInfo(): EpubFileInfo
getMetadata(): EpubMetadata
getManifest(): Record<string, ManifestItem>
getSpine(): EpubSpine
getGuide(): GuideReference[]
getCollection(): CollectionItem[]
getToc(): EpubToc
getPageList(): PageList
getNavList(): NavList
loadChapter(id: string): Promise<EpubProcessedChapter>
resolveHref(href: string): EpubResolvedHref | undefined
destroy(): void
}
```
### getManifest(): Record<string, ManifestItem>
Retrieves all resources contained in the EPUB (HTML files, images, etc.).
```typescript
import { getManifest } from '@lingo-reader/epub-parser'
import type { ManifestItem } from '@lingo-reader/epub-parser'
/*
type getManifest = () => Record<string, ManifestItem>
*/
// Keys represent resource `id`
const manifest: Record<string, ManifestItem> = epub.getManifest()
```
**Parameters:**
- None
**Returns:**
- `Record` - A dictionary mapping resource `id` to their descriptors:
```typescript
interface ManifestItem {
// Unique resource identifier
id: string
// Path within the EPUB (ZIP) archive
href: string
// MIME type (e.g., "application/xhtml+xml")
mediaType: string
// Special role (e.g., "cover-image")
properties?: string
// Associated media overlay for audio/video
mediaOverlay?: string
// Fallback resources when this item cannot be loaded
fallback?: string[]
}
```
### getSpine(): EpubSpine
Returns the reading order of all content documents in the EPUB.
The `linear` property in `SpineItem` indicates whether the item is part of the primary reading flow (values: "yes" or "no").
```typescript
import { getSpine } from '@lingo-reader/epub-parser'
import type { EpubSpine } from '@lingo-reader/epub-parser'
/*
type getSpine = () => EpubSpine
*/
const spine: EpubSpine = epub.getSpine()
```
**Parameters:**
- None
**Returns:**
- `EpubSpine` - An ordered array of spine items:
```typescript
type SpineItem = ManifestItem & {
/**
* Reading progression flag
* - "yes": Primary reading content (default)
* - "no": Supplementary material
*/
linear?: string
}
type EpubSpine = SpineItem[]
```
### loadChapter(id: string): Promise\<EpubProcessedChapter\>
The `loadChapter` function takes a chapter `id` as parameter and returns a processed chapter object. Returns `undefined` if the chapter doesn't exist.
```typescript
const spine = epub.getSpine()
const fileInfo = epub.getFileInfo()
// Load the first chapter. 'html' is the processed HTML chapter string,
// 'css' is the chapter's CSS file, provided as an absolute path in Node.js,
// which can be directly read.
const { html, css } = epub.loadChapter(spine[0].id)
```
**Parameters:**
- `id: string` - The chapter `id` from spine
**Returns:**
- `Promise<EpubProcessedChapter | undefined>` - Processed chapter content
```typescript
// css
interface EpubCssPart {
id: string
href: string
}
// media-overlay
interface Par {
// element id
textDOMId: string
// unit: s
clipBegin: number
clipEnd: number
}
interface SmilAudio {
audioSrc: string
pars: Par[]
}
type SmilAudios = SmilAudio[]
// chapter
interface EpubProcessedChapter {
css: EpubCssPart[]
html: string
mediaOverlays?: SmilAudios
}
```
In an EPUB ebook file, each chapter is typically an XHTML (or HTML) file. Thus, the processed chapter object consists of two parts: one is the HTML content string under the `<body>` tag, and the other is the CSS. The CSS is parsed from the `<link>` tags in the chapter file and provided here in the form of a blob URL (or as an absolute filesystem path in a Node.js environment), represented by the `href` field in `EpubCssPart`, along with a corresponding `id` for the URL. The CSS blob URL can be directly referenced in a `<link>` tag or fetched via the Fetch API (using the absolute path in Node.js) to obtain the CSS text for further processing.
In EPUB, SMIL files enable read-aloud functionality by mapping segments of an audio track to specific text elements in the document. During playback, the current audio time can be used to locate the corresponding text element and highlight it in the DOM. When processed, a SMIL file is represented in an `EpubProcessedChapter` as the optional `mediaOverlays` property.
- **mediaOverlays** — an array of `SmilAudio` objects
- **SmilAudio**
- **audioSrc**: the path to the audio file
- **pars**: an array of `Par` mappings
- **Par**
- **textDOMId**: the ID of the associated text element
- **clipBegin**: the start time of the audio segment (in seconds)
- **clipEnd**: the end time of the audio segment (in seconds)
Internal chapter navigation in EPUBs is handled through `<a>` tags' `href` attributes. To distinguish internal links from external links and facilitate internal navigation logic, internal links are prefixed with `epub:`. These links can be resolved using the `resolveHref` function. The handling of such links is managed at the UI layer, while `epub-parser` only provides the corresponding chapter HTML and selector functionality.
### resolveHref(href: string): EpubResolvedHref | undefined
`resolveHref` parses internal links into a chapter ID and a CSS selector within the book's HTML.
If an external link (e.g., `https://www.example.com`) or an invalid internal link is provided, it returns `undefined`.
```typescript
const toc: EpubToc = epub.getToc()
// 'id' is the chapter ID, 'selector' is a DOM selector (e.g., `[id="ididid"]`)
const { id, selector } = epub.resolveHref(toc[0].href)
```
**Parameters:**
- `href: string`:The internal resource path.
**Returns:**
- `EpubResolvedHref | undefined`:The resolved internal link. Returns `undefined` if the path is invalid.
```typescript
interface EpubResolvedHref {
id: string
selector: string
}
```
### getToc(): EpubToc
The `toc` structure corresponds to the `navMap` section of the EPUB's `.ncx` file, which contains the book's navigation hierarchy.
```typescript
import { getToc } from '@lingo-reader/epub-parser'
import type { EpubToc } from '@lingo-reader/epub-parser'
/*
type getToc = () => EpubToc
*/
const toc: EpubToc = epub.getToc()
```
**Parameters:**
- none
**Returns:**
- `EpubToc`:
```typescript
interface NavPoint {
// Display text of the table of contents entry
label: string
// Resource path within the EPUB file (preprocessed format).
// Can be resolved using resolveHref()
href: string
// Chapter identifier
id: string
// Reading order sequence
playOrder: string
// Nested sub-entries (optional)
children?: NavPoint[]
}
/** EPUB table of contents structure (NCX navMap representation) */
type EpubToc = NavPoint[]
```
### getCoverImage(): string
> Supported since **v0.4.1**.
Return the url of cover image.
**Parameters:**
- none
**Returns:**
- `string`:the url of cover image
### destroy(): void
Cleans up generated resources (like blob URLs) created during file parsing
to prevent memory leaks. In Node.js environments, it also deletes corresponding temporary files.
### getFileInfo(): EpubFileInfo
```typescript
import type { EpubFileInfo } from '@lingo-reader/epub-parser'
/*
type getFileInfo = () => EpubFileInfo
*/
const fileInfo: EpubFileInfo = epub.getFileInfo()
```
EpubFileInfo currently includes two attributes: `fileName` represents the file name, and `mimetype` indicates the file type of the EPUB file, which is read from the `/mimetype` file but is always fixed as `application/epub+zip`.
**Parameters:**
- none
**Returns:**
- `EpubFileInfo`:
```typescript
interface EpubFileInfo {
fileName: string
mimetype: string
}
```
### getMetadata(): EpubMetadata
The metadata recorded in the book.
```typescript
import type { EpubMetadata } from '@lingo-reader/epub-parser'
/*
type getMetadata = () => EpubFileInfo
*/
const metadata: EpubMetadata = epub.getMetadata()
```
**Parameters:**
- none
**Returns:**
- `EpubMetadata`:
```typescript
interface EpubMetadata {
// Title of the book
title: string
// Language of the book
language: string
// Description of the book
description?: string
// Publisher of the EPUB file
publisher?: string
// General type/genre of the book, such as novel, biography, etc.
type?: string
// MIME type of the EPUB file
format?: string
// Original source of the book content
source?: string
// Related external resources
relation?: string
// Coverage of the publication content
coverage?: string
// Copyright statement
rights?: string
// Includes creation time, publication date, update time, etc. of the book
// Specific fields depend on opf:event, such as modification
date?: Record<string, string>
identifier: Identifier
packageIdentifier: Identifier
creator?: Contributor[]
contributor?: Contributor[]
subject?: Subject[]
metas?: Record<string, string>
links?: Link[]
}
```
#### identifier: Identifier
`id` represents the unique identifier of the resource. The `scheme` specifies the system or authority used to generate or assign the identifier, such as ISBN or DOI. `identifierType` indicates the type of identifier used by `id`, which is similar to `scheme`.
```typescript
interface Identifier {
id: string
scheme?: string
identifierType?: string
}
```
#### packageIdentifier: Identifier
It is essentially also an `Identifier`. Typically, within the `<package>` tag, it is referenced using the `unique-identifier` attribute, whose value corresponds to the `id` of the relevant `<identifier>` element.
```xml
<package unique-identifier="id">
<dc:identifier id="id" opf:scheme="URI">uuid:19c0c5cb-002b-476f-baa7-fcf510414f95</dc:identifier>
</package>
```
#### creator?: Contributor[]
Describes the various contributors.
```typescript
interface Contributor {
// Name of the contributor
contributor: string
// Sort-friendly version of the name
fileAs?: string
// Role of the contributor
role?: string
// The encoding scheme used for role or alternateScript,
// can also represent a language, such as English or Chinese
scheme?: string
// Alternative script or writing system for the contributor's name
alternateScript?: string
}
```
#### subject?: Subject[]
The subject or theme of the book.
```typescript
interface Subject {
// Subject, such as fiction, essay, etc.
subject: string
// The authority or organization providing the code or identifier
authority?: string
// Associated subject code or term
term?: string
}
```
#### links?: Link[]
Provides additional related resources or external links.
```typescript
interface Link {
// URL or path to the resource
href: string
// Language of the resource
hreflang?: string
// id
id?: string
// MIME type of the resource (e.g., image/jpeg, application/xml)
mediaType?: string
// Additional properties
properties?: string
// Purpose or function of the link
rel: string
}
```
### getGuide(): EpubGuide
The preview chapters of the book, which can also be replaced by the first few chapters from the spine.
```typescript
import { getGuide } from '@lingo-reader/epub-parser'
import type { EpubGuide } from '@lingo-reader/epub-parser'
/*
type getGuide = () => EpubGuide
*/
const guide: EpubGuide = epub.getGuide()
```
**Parameters:**
- none
**Returns:**
- `EpubGuide`:
```typescript
interface GuideReference {
title: string
// The role of the resource, such as toc, loi, cover-image, etc.
type: string
// The path to the resource within the EPUB file
href: string
}
type EpubGuide = GuideReference[]
```
### getCollection(): EpubCollection
The content under the `<collection>` tag in the `.opf` file, used to specify whether an EPUB file belongs to a specific collection, such as a series, category, or a particular group of publications.
```typescript
import { getCollection } from '@lingo-reader/epub-parser'
import type { EpubCollection } from '@lingo-reader/epub-parser'
/*
type getCollection = () => EpubCollection
*/
const collection: EpubCollection = epub.getCollection()
```
**Parameters:**
- none
**Returns:**
- `EpubCollection`:
```typescript
interface CollectionItem {
// The role played within the Collection
role: string
// Links to related resources
links: string[]
}
type EpubCollection = CollectionItem[]
```
### getPageList(): PageList
Refer to [https://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.4.1.2](https://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.4.1.2), where the `correspondId` refers to the resource's ID, and the rest correspond to the specifications.
```typescript
import { getPageList } from '@lingo-reader/epub-parser'
import type { PageList } from '@lingo-reader/epub-parser'
/*
type getPageList = () => PageList
*/
const pageList: PageList = epub.getPageList()
```
**Parameters:**
- none
**Returns:**
- `PageList`:
```typescript
interface PageTarget {
label: string
// Page number
value: string
href: string
playOrder: string
type: string
correspondId: string
}
interface PageList {
label: string
pageTargets: PageTarget[]
}
```
### getNavList(): NavList
Refer to [https://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.4.1.2](https://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.4.1.2), where the `correspondId` refers to the resource's ID, `label` corresponds to the content of `navLabel.text`, and `href` is the path to the resource within the EPUB file.
```typescript
import { getNavList } from '@lingo-reader/epub-parser'
import type { NavList } from '@lingo-reader/epub-parser'
/*
type getNavList = () => NavList
*/
const navList: NavList = epub.getNavList()
```
**Parameters:**
- none
**Returns:**
- `NavList:`
```typescript
interface NavTarget {
label: string
href: string
correspondId: string
}
interface NavList {
label: string
navTargets: NavTarget[]
}
```