UNPKG

datocms-html-to-structured-text

Version:

Convert HTML (or a `hast` syntax tree) to a valid DatoCMS Structured Text `dast` document

356 lines (278 loc) 10.9 kB
# `datocms-html-to-structured-text` This package contains utilities to convert HTML (or a [Hast](https://github.com/syntax-tree/hast) to a DatoCMS Structured Text `dast` (DatoCMS Abstract Syntax Tree) document. Please refer to [the `dast` format docs](https://www.datocms.com/docs/structured-text/dast) to learn more about the syntax tree format and the available nodes. ## Usage The main utility in this package is `htmlToStructuredText` which takes a string of HTML and transforms it into a valid `dast` document. `htmlToStructuredText` returns a `Promise` that resolves with a Structured Text document. ```js import { htmlToStructuredText } from 'datocms-html-to-structured-text'; const html = ` <article> <h1>DatoCMS</h1> <p>The most complete, user-friendly and performant Headless CMS.</p> </article> `; htmlToStructuredText(html).then((structuredText) => { console.log(structuredText); }); ``` `htmlToStructuredText` is meant to be used in a browser environment. In Node.js you can use the `parse5ToStructuredText` helper which instead takes a document generated with `parse5`. ```js import parse5 from 'parse5'; import { parse5ToStructuredText } from 'datocms-html-to-structured-text'; parse5ToStructuredText( parse5.parse(html, { sourceCodeLocationInfo: true, }), ).then((structuredText) => { console.log(structuredText); }); ``` Internally, both utilities work on a [Hast](https://github.com/syntax-tree/hast). Should you have a `hast` already you can use a third utility called `hastToDast`. ## Validate `dast` documents `dast` is a strict format for DatoCMS' Structured Text fields. As such the resulting document is generally a simplified, content-centric version of the input HTML. When possible, the library relies on semantic HTML to generate a valid `dast` document. The `datocms-structured-text-utils` package provides a `validate` utility to validate a value to make sure that the resulting tree is compatible with DatoCMS' Structured Text field. ```js import { validate } from 'datocms-structured-text-utils'; // ... htmlToStructuredText(html).then((structuredText) => { const { valid, message } = validate(structuredText); if (!valid) { throw new Error(message); } }); ``` We recommend to validate every `dast` to avoid errors later when creating records. ## Advanced Usage ### Options All the `*ToStructuredText` utils accept an optional `options` object as second argument: ```js type Options = Partial<{ newlines: boolean, // Override existing `hast` node handlers or add new ones handlers: Record<string, CreateNodeFunction>, // Allows to tweak the `hast` tree before transforming it to a `dast` document preprocess: (hast: HastRootNode) => HastRootNode, // Array of allowed block nodes allowedBlocks: Array< BlockquoteType | CodeType | HeadingType | LinkType | ListType, >, // Array of allowed marks allowedMarks: Mark[], // Array of allowed heading levels for 'heading' nodes allowedHeadingLevels: Array<1 | 2 | 3 | 4 | 5 | 6>, }>; ``` ### Transforming Nodes The utils in this library traverse a `hast` tree and transform supported nodes to `dast` nodes. The transformation is done by working on a `hast` node with a handler (async) function. Handlers are associated to `hast` nodes by `tagName` or `type` when `node.type !== 'element'` and look as follow: ```js import { visitChildren } from 'datocms-html-to-structured-text'; // Handler for the <p> tag. async function p(createDastNode, hastNode, context) { return createDastNode('paragraph', { children: await visitChildren(createDastNode, hastNode, context), }); } ``` Handlers can return either a promise that resolves to a `dast` node, an array of `dast` Nodes or `undefined` to skip the current node. To ensure that a valid `dast` is generated the default handlers also check that the current `hastNode` is a valid `dast` node for its parent and, if not, they ignore the current node and continue visiting its children. Information about the parent `dast` node name is available in `context.parentNodeType`. Please take a look at the [default handlers implementation](./handlers.ts) for examples. The default handlers are available on `context.defaultHandlers`. ### context Every handler receives a `context` object that includes the following information: ```js export interface GlobalContext { // Whether the library has found a <base> tag or should not look further. // See https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base baseUrlFound?: boolean; // <base> tag url. This is used for resolving relative URLs. baseUrl?: string; } export interface Context { // The current parent `dast` node type. parentNodeType: NodeType; // The parent `hast` node. parentNode: HastNode; // A reference to the current handlers - merged default + user handlers. handlers: Record<string, Handler<unknown>>; // A reference to the default handlers record (map). defaultHandlers: Record<string, Handler<unknown>>; // true if the content can include newlines, and false if not (such as in headings). wrapText: boolean; // Marks for span nodes. marks?: Mark[]; // Prefix for language detection in code blocks. // Detection is done on a class name eg class="language-html" // Default is `language-` codePrefix?: string; // Array of allowed Block types. allowedBlocks: Array< BlockquoteType | CodeType | HeadingType | LinkType | ListType, >; // Array of allowed marks. allowedMarks: Mark[]; // Properties in this object are available to every handler as Context // is not deeply cloned. global: GlobalContext; } ``` ### Custom Handlers It is possible to register custom handlers and override the default behavior via options: ```js import { paragraphHandler } from './customHandlers'; htmlToStructuredText(html, { handlers: { p: paragraphHandler, }, }).then((structuredText) => { console.log(structuredText); }); ``` It is **highly encouraged** to validate the `dast` when using custom handlers because handlers are responsible for dictating valid parent-children relationships and therefore generating a tree that is compliant with DatoCMS' Structured Text. ## preprocessing Because of the strictness of the `dast` spec it is possible that some semantic or elements might be lost during the transformation. To improve the final result, you might want to modify the `hast` before it is transformed to `dast` with the `preprocess` hook. ```js import { findAll } from 'unist-utils-core'; const html = ` <p>convert this to an h1</p> `; htmlToStructuredText(html, { preprocess: (tree) => { // Transform <p> to <h1> findAll(tree, (node) => { if (node.type === 'element' && node.tagName === 'p') { node.tagName = 'h1'; } }); }, }).then((structuredText) => { console.log(structuredText); }); ``` ### Examples <details> <summary>Split a node that contains an image.</summary> In `dast` images can be presented as `Block` nodes but these are not allowed inside of `ListItem` nodes (ul/ol lists). In this example we will split the list in 3 pieces and lift up the image. The same approach can be used to split other types of branches and lift up nodes to become root nodes. ```js import { visit } from 'unist-utils-core'; const html = ` <ul> <li>item 1</li> <li><div><img src="./img.png" alt></div></li> <li>item 2</li> </ul> `; const dast = await htmlToStructuredText(html, { preprocess: (tree) => { const liftedImages = new WeakSet(); const body = find(tree, (node) => node.tagName === 'body'); visit(body, (node, index, parents) => { if ( !node || node.tagName !== 'img' || liftedImages.has(node) || parents.length === 1 // is a top level img ) { return; } // remove image const imgParent = parents[parents.length - 1]; imgParent.children.splice(index, 1); let i = parents.length; let splitChildrenIndex = index; let childrenAfterSplitPoint = []; while (--i > 0) { // Example: i == 2 // [ 'body', 'div', 'h1' ] const /* h1 */ parent = parents[i]; const /* div */ parentsParent = parents[i - 1]; // Delete the siblings after the image and save them in a variable childrenAfterSplitPoint /* [ 'h1.2' ] */ = parent.children.splice( splitChildrenIndex, ); // parent.children is now == [ 'h1.1' ] // parentsParent.children = [ 'h1' ] splitChildrenIndex = parentsParent.children.indexOf(parent); // splitChildrenIndex = 0 let nodeInserted = false; // If we reached the 'div' add the image's node if (i === 1) { splitChildrenIndex += 1; parentsParent.children.splice(splitChildrenIndex, 0, node); liftedImages.add(node); nodeInserted = true; } splitChildrenIndex += 1; // Create a new branch with childrenAfterSplitPoint if we have any i.e. // <h1>h1.2</h1> if (childrenAfterSplitPoint.length > 0) { parentsParent.children.splice(splitChildrenIndex, 0, { ...parent, children: childrenAfterSplitPoint, }); } // Remove the parent if empty if (parent.children.length === 0) { splitChildrenIndex -= 1; parentsParent.children.splice( nodeInserted ? splitChildrenIndex - 1 : splitChildrenIndex, 1, ); } } }); }, handlers: { img: async (createNode, node, context) => { // In a real scenario you would upload the image to Dato and get back an id. const item = '123'; return createNode('block', { item, }); }, }, }); ``` </details> <details> <summary>Lift up an image node</summary> ```js import { visit, CONTINUE } from 'unist-utils-core'; const html = ` <ul> <li>item 1</li> <li><div><img src="./img.png" alt>item 2</div></li> <li>item 3</li> </ul> `; const dast = await htmlToStructuredText(html, { preprocess: (tree) => { visit(tree, (node, index, parents) => { if (node.tagName === 'img' && parents.length > 1) { const parent = parents[parents.length - 1]; tree.children.push(node); parent.children.splice(index, 1); return [CONTINUE, index]; } }); }, handlers: { img: async (createNode, node, context) => { // In a real scenario you would upload the image to Dato and get back an id. const item = '123'; return createNode('block', { item }); }, }, }); ``` </details> ### Utilities To work with `hast` and `dast` trees we recommend using the [unist-utils-core](https://www.npmjs.com/package/unist-utils-core) library. ## License MIT