UNPKG

notion-md-crawler

Version:

A library to recursively retrieve and serialize Notion pages with customization for machine learning applications.

315 lines (223 loc) โ€ข 10.8 kB
# notion-md-crawler A library to recursively retrieve and serialize Notion pages and databases with customization for machine learning applications. [![NPM Version](https://badge.fury.io/js/notion-md-crawler.svg)](https://www.npmjs.com/package/notion-md-crawler) ## ๐ŸŒŸ Features - **๐Ÿ•ท๏ธ Crawling Pages and Databases**: Dig deep into Notion's hierarchical structure with ease. - **๐Ÿ“ Serialize to Markdown**: Seamlessly convert Notion pages to Markdown for easy use in machine learning and other. - **๐Ÿ› ๏ธ Custom Serialization**: Adapt the serialization process to fit your specific machine learning needs. - **โณ Async Generator**: Yields results on a page-by-page basis, so even huge documents can be made memory efficient. ## ๐Ÿ› ๏ธ Installation [`@notionhq/client`](https://github.com/makenotion/notion-sdk-js) must also be installed. Using npm ๐Ÿ“ฆ: ```bash npm install notion-md-crawler @notionhq/client ``` Using yarn ๐Ÿงถ: ```bash yarn add notion-md-crawler @notionhq/client ``` Using pnpm ๐Ÿš€: ```bash pnpm add notion-md-crawler @notionhq/client ``` ## ๐Ÿš€ Quick Start > โš ๏ธ Note: Before getting started, create [an integration and find the token](https://www.notion.so/my-integrations). Details on methods can be found in [API section](https://github.com/souvikinator/notion-to-md#api) Leveraging the power of JavaScript generators, this library is engineered to handle even the most extensive Notion documents with ease. It's designed to yield results page-by-page, allowing for efficient memory usage and real-time processing. ```ts import { Client } from "@notionhq/client"; import { crawler, pageToString } from "notion-md-crawler"; // Need init notion client with credential. const client = new Client({ auth: process.env.NOTION_API_KEY }); const crawl = crawler({ client }); const main = async () => { const rootPageId = "****"; for await (const result of crawl(rootPageId)) { if (result.success) { const pageText = pageToString(result.page); console.log(pageText); } } }; main(); ``` ## ๐ŸŒ API ### crawler Recursively crawl the Notion Page. [`dbCrawler`](#dbcrawler) should be used if the Root is a Notion Database. > Note: It tries to continue crawling as much as possible even if it fails to retrieve a particular Notion Page. #### Parameters: - `options` ([`CrawlerOptions`](#optionscrawleroptions)): Crawler options. - `rootPageId` (string): Id of the root page to be crawled. #### Returns: - `AsyncGenerator<CrawlingResult>`: Crawling results with failed information. ### dbCrawler Recursively crawl the Notion Database. [`crawler`](#crawler) should be used if the Root is a Notion Page. #### Parameters: - `options` ([`CrawlerOptions`](#crawleroptions)): Crawler options. - `rootDatabaseId` (string): Id of the root page to be crawled. #### Returns: - `AsyncGenerator<CrawlingResult>`: Crawling results with failed information. ### CrawlerOptions | Option | Description | Type | Default | | ------------------------ | --------------------------------------------------------------------------------------------------- | --------------------------------------------- | ----------- | | `client` | Instance of Notion Client. Set up an instance of the Client class imported from `@notionhq/client`. | Notion Client | - | | `serializers?` | Used for custom serialization of Block and Property objects. | Object | `undefined` | | `serializers?.block?` | Map of Notion block type and [`BlockSerializer`](#blockserializer). | [`BlockSerializers`](#blockserializers) | `undefined` | | `serializers?.property?` | Map of Notion Property Type and [`PropertySerializer`](#propertyserializer). | [`PropertySerializers`](#propertyserializers) | `undefined` | | `metadataBuilder?` | The metadata generation process can be customize. | [`MetadataBuilder`](#metadatabuilder) | `undefined` | | `urlMask?` | If specified, the url is masked with the string. | string \| false | `false` | | `skipPageIds?` | List of page Ids to skip crawling (also skips descendant pages) | string[] | `undefined` | #### `BlockSerializers` Map with Notion block type (like `"heading_1"`, `"to_do"`, `"code"`) as key and [`BlockSerializer`](#blockserializer) as value. #### `BlockSerializer` BlockSerializer that takes a Notion block object as argument. Returning `false` will skip serialization of that Notion block. **[Type]** ```ts type BlockSerializer = ( block: NotionBlock, ) => string | false | Promise<string | false>; ``` #### `PropertySerializers` Map with Notion Property Type (like `"heading_1"`, `"to_do"`, `"code"`) as key and [`PropertySerializer`](#propertyserializer) as value. #### `PropertySerializer` PropertySerializer that takes a Notion property object as argument. Returning `false` will skip serialization of that Notion property. **[Type]** ```ts type PropertySerializer = ( name: string, block: NotionBlock, ) => string | false | Promise<string | false>; ``` #### `MetadataBuilder` Retrieving metadata is sometimes very important, but the information you want to retrieve will vary depending on the context. `MetadataBuilder` allows you to customize it according to your use case. **[Example]** ```ts import { crawler, MetadataBuilderParams } from "notion-md-crawler"; const getUrl = (id: string) => `https://www.notion.so/${id.replace(/-/g, "")}`; const metadataBuilder = ({ page }: MetadataBuilderParams) => ({ url: getUrl(page.metadata.id), }); const crawl = crawler({ client, metadataBuilder }); for await (const result of crawl("notion-page-id")) { if (result.success) { console.log(result.page.metadata.url); // "https://www.notion.so/********" } } ``` ## ๐Ÿ“Š Use Metadata Since `crawler` returns `Page` objects and `Page` object contain metadata, you can be used it for machine learning. ## ๐Ÿ› ๏ธ Custom Serialization `notion-md-crawler` gives you the flexibility to customize the serialization logic for various Notion objects to cater to the unique requirements of your machine learning model or any other use case. ### Define your custom serializer You can define your own custom serializer. You can also use the utility function for convenience. ```ts import { BlockSerializer, crawler, serializer } from "notion-md-crawler"; const customEmbedSerializer: BlockSerializer<"embed"> = (block) => { if (block.embed.url) return ""; // You can use serializer utility. const caption = serializer.utils.fromRichText(block.embed.caption); return `<figure> <iframe src="${block.embed.url}"></iframe> <figcaption>${caption}</figcaption> </figure>`; }; const serializers = { block: { embed: customEmbedSerializer, }, }; const crawl = crawler({ client, serializers }); ``` ### Skip serialize Returning `false` in the serializer allows you to skip the serialize of that block. This is useful when you want to omit unnecessary information. ```ts const image: BlockSerializer<"image"> = () => false; const crawl = crawler({ client, serializers: { block: { image } } }); ``` ### Advanced: Use default serializer in custom serializer If you want to customize serialization only in specific cases, you can use the default serializer in a custom serializer. ```ts import { BlockSerializer, crawler, serializer } from "notion-md-crawler"; const defaultImageSerializer = serializer.block.defaults.image; const customImageSerializer: BlockSerializer<"image"> = (block) => { // Utility function to retrieve the link const { title, href } = serializer.utils.fromLink(block.image); // If the image is from a specific domain, wrap it in a special div if (href.includes("special-domain.com")) { return `<div class="special-image"> ${defaultImageSerializer(block)} </div>`; } // Use the default serializer for all other images return defaultImageSerializer(block); }; const serializers = { block: { image: customImageSerializer, }, }; const crawl = crawler({ client, serializers }); ``` ## ๐Ÿ” Supported Blocks and Database properties ### Blocks | Block Type | Supported | | ------------------ | --------- | | Text | โœ… Yes | | Bookmark | โœ… Yes | | Bulleted List | โœ… Yes | | Numbered List | โœ… Yes | | Heading 1 | โœ… Yes | | Heading 2 | โœ… Yes | | Heading 3 | โœ… Yes | | Quote | โœ… Yes | | Callout | โœ… Yes | | Equation (block) | โœ… Yes | | Equation (inline) | โœ… Yes | | Todos (checkboxes) | โœ… Yes | | Table Of Contents | โœ… Yes | | Divider | โœ… Yes | | Column | โœ… Yes | | Column List | โœ… Yes | | Toggle | โœ… Yes | | Image | โœ… Yes | | Embed | โœ… Yes | | Video | โœ… Yes | | Figma | โœ… Yes | | PDF | โœ… Yes | | Audio | โœ… Yes | | File | โœ… Yes | | Link | โœ… Yes | | Page Link | โœ… Yes | | External Page Link | โœ… Yes | | Code (block) | โœ… Yes | | Code (inline) | โœ… Yes | ### Database Properties | Property Type | Supported | | ---------------- | --------- | | Checkbox | โœ… Yes | | Created By | โœ… Yes | | Created Time | โœ… Yes | | Date | โœ… Yes | | Email | โœ… Yes | | Files | โœ… Yes | | Formula | โœ… Yes | | Last Edited By | โœ… Yes | | Last Edited Time | โœ… Yes | | Multi Select | โœ… Yes | | Number | โœ… Yes | | People | โœ… Yes | | Phone Number | โœ… Yes | | Relation | โœ… Yes | | Rich Text | โœ… Yes | | Rollup | โœ… Yes | | Select | โœ… Yes | | Status | โœ… Yes | | Title | โœ… Yes | | Unique Id | โœ… Yes | | Url | โœ… Yes | | Verification | โ–ก No | ## ๐Ÿ’ฌ Issues and Feedback For any issues, feedback, or feature requests, please file an issue on GitHub. ## ๐Ÿ“œ License MIT --- Made with โค๏ธ by TomPenguin.