notion-md-crawler
Version:
A library to recursively retrieve and serialize Notion pages with customization for machine learning applications.
315 lines (223 loc) โข 10.8 kB
Markdown
# notion-md-crawler
A library to recursively retrieve and serialize Notion pages and databases with customization for machine learning applications.
[](https://www.npmjs.com/package/notion-md-crawler)
## ๐ Features
- **๐ท๏ธ Crawling Pages and Databases**: Dig deep into Notion's hierarchical structure with ease.
- **๐ Serialize to Markdown**: Seamlessly convert Notion pages to Markdown for easy use in machine learning and other.
- **๐ ๏ธ Custom Serialization**: Adapt the serialization process to fit your specific machine learning needs.
- **โณ Async Generator**: Yields results on a page-by-page basis, so even huge documents can be made memory efficient.
## ๐ ๏ธ Installation
[`@notionhq/client`](https://github.com/makenotion/notion-sdk-js) must also be installed.
Using npm ๐ฆ:
```bash
npm install notion-md-crawler @notionhq/client
```
Using yarn ๐งถ:
```bash
yarn add notion-md-crawler @notionhq/client
```
Using pnpm ๐:
```bash
pnpm add notion-md-crawler @notionhq/client
```
## ๐ Quick Start
> โ ๏ธ Note: Before getting started, create [an integration and find the token](https://www.notion.so/my-integrations). Details on methods can be found in [API section](https://github.com/souvikinator/notion-to-md#api)
Leveraging the power of JavaScript generators, this library is engineered to handle even the most extensive Notion documents with ease. It's designed to yield results page-by-page, allowing for efficient memory usage and real-time processing.
```ts
import { Client } from "@notionhq/client";
import { crawler, pageToString } from "notion-md-crawler";
// Need init notion client with credential.
const client = new Client({ auth: process.env.NOTION_API_KEY });
const crawl = crawler({ client });
const main = async () => {
const rootPageId = "****";
for await (const result of crawl(rootPageId)) {
if (result.success) {
const pageText = pageToString(result.page);
console.log(pageText);
}
}
};
main();
```
## ๐ API
### crawler
Recursively crawl the Notion Page. [`dbCrawler`](#dbcrawler) should be used if the Root is a Notion Database.
> Note: It tries to continue crawling as much as possible even if it fails to retrieve a particular Notion Page.
#### Parameters:
- `options` ([`CrawlerOptions`](#optionscrawleroptions)): Crawler options.
- `rootPageId` (string): Id of the root page to be crawled.
#### Returns:
- `AsyncGenerator<CrawlingResult>`: Crawling results with failed information.
### dbCrawler
Recursively crawl the Notion Database. [`crawler`](#crawler) should be used if the Root is a Notion Page.
#### Parameters:
- `options` ([`CrawlerOptions`](#crawleroptions)): Crawler options.
- `rootDatabaseId` (string): Id of the root page to be crawled.
#### Returns:
- `AsyncGenerator<CrawlingResult>`: Crawling results with failed information.
### CrawlerOptions
| Option | Description | Type | Default |
| ------------------------ | --------------------------------------------------------------------------------------------------- | --------------------------------------------- | ----------- |
| `client` | Instance of Notion Client. Set up an instance of the Client class imported from `@notionhq/client`. | Notion Client | - |
| `serializers?` | Used for custom serialization of Block and Property objects. | Object | `undefined` |
| `serializers?.block?` | Map of Notion block type and [`BlockSerializer`](#blockserializer). | [`BlockSerializers`](#blockserializers) | `undefined` |
| `serializers?.property?` | Map of Notion Property Type and [`PropertySerializer`](#propertyserializer). | [`PropertySerializers`](#propertyserializers) | `undefined` |
| `metadataBuilder?` | The metadata generation process can be customize. | [`MetadataBuilder`](#metadatabuilder) | `undefined` |
| `urlMask?` | If specified, the url is masked with the string. | string \| false | `false` |
| `skipPageIds?` | List of page Ids to skip crawling (also skips descendant pages) | string[] | `undefined` |
#### `BlockSerializers`
Map with Notion block type (like `"heading_1"`, `"to_do"`, `"code"`) as key and [`BlockSerializer`](#blockserializer) as value.
#### `BlockSerializer`
BlockSerializer that takes a Notion block object as argument. Returning `false` will skip serialization of that Notion block.
**[Type]**
```ts
type BlockSerializer = (
block: NotionBlock,
) => string | false | Promise<string | false>;
```
#### `PropertySerializers`
Map with Notion Property Type (like `"heading_1"`, `"to_do"`, `"code"`) as key and [`PropertySerializer`](#propertyserializer) as value.
#### `PropertySerializer`
PropertySerializer that takes a Notion property object as argument. Returning `false` will skip serialization of that Notion property.
**[Type]**
```ts
type PropertySerializer = (
name: string,
block: NotionBlock,
) => string | false | Promise<string | false>;
```
#### `MetadataBuilder`
Retrieving metadata is sometimes very important, but the information you want to retrieve will vary depending on the context. `MetadataBuilder` allows you to customize it according to your use case.
**[Example]**
```ts
import { crawler, MetadataBuilderParams } from "notion-md-crawler";
const getUrl = (id: string) => `https://www.notion.so/${id.replace(/-/g, "")}`;
const metadataBuilder = ({ page }: MetadataBuilderParams) => ({
url: getUrl(page.metadata.id),
});
const crawl = crawler({ client, metadataBuilder });
for await (const result of crawl("notion-page-id")) {
if (result.success) {
console.log(result.page.metadata.url); // "https://www.notion.so/********"
}
}
```
## ๐ Use Metadata
Since `crawler` returns `Page` objects and `Page` object contain metadata, you can be used it for machine learning.
## ๐ ๏ธ Custom Serialization
`notion-md-crawler` gives you the flexibility to customize the serialization logic for various Notion objects to cater to the unique requirements of your machine learning model or any other use case.
### Define your custom serializer
You can define your own custom serializer. You can also use the utility function for convenience.
```ts
import { BlockSerializer, crawler, serializer } from "notion-md-crawler";
const customEmbedSerializer: BlockSerializer<"embed"> = (block) => {
if (block.embed.url) return "";
// You can use serializer utility.
const caption = serializer.utils.fromRichText(block.embed.caption);
return `<figure>
<iframe src="${block.embed.url}"></iframe>
<figcaption>${caption}</figcaption>
</figure>`;
};
const serializers = {
block: {
embed: customEmbedSerializer,
},
};
const crawl = crawler({ client, serializers });
```
### Skip serialize
Returning `false` in the serializer allows you to skip the serialize of that block. This is useful when you want to omit unnecessary information.
```ts
const image: BlockSerializer<"image"> = () => false;
const crawl = crawler({ client, serializers: { block: { image } } });
```
### Advanced: Use default serializer in custom serializer
If you want to customize serialization only in specific cases, you can use the default serializer in a custom serializer.
```ts
import { BlockSerializer, crawler, serializer } from "notion-md-crawler";
const defaultImageSerializer = serializer.block.defaults.image;
const customImageSerializer: BlockSerializer<"image"> = (block) => {
// Utility function to retrieve the link
const { title, href } = serializer.utils.fromLink(block.image);
// If the image is from a specific domain, wrap it in a special div
if (href.includes("special-domain.com")) {
return `<div class="special-image">
${defaultImageSerializer(block)}
</div>`;
}
// Use the default serializer for all other images
return defaultImageSerializer(block);
};
const serializers = {
block: {
image: customImageSerializer,
},
};
const crawl = crawler({ client, serializers });
```
## ๐ Supported Blocks and Database properties
### Blocks
| Block Type | Supported |
| ------------------ | --------- |
| Text | โ
Yes |
| Bookmark | โ
Yes |
| Bulleted List | โ
Yes |
| Numbered List | โ
Yes |
| Heading 1 | โ
Yes |
| Heading 2 | โ
Yes |
| Heading 3 | โ
Yes |
| Quote | โ
Yes |
| Callout | โ
Yes |
| Equation (block) | โ
Yes |
| Equation (inline) | โ
Yes |
| Todos (checkboxes) | โ
Yes |
| Table Of Contents | โ
Yes |
| Divider | โ
Yes |
| Column | โ
Yes |
| Column List | โ
Yes |
| Toggle | โ
Yes |
| Image | โ
Yes |
| Embed | โ
Yes |
| Video | โ
Yes |
| Figma | โ
Yes |
| PDF | โ
Yes |
| Audio | โ
Yes |
| File | โ
Yes |
| Link | โ
Yes |
| Page Link | โ
Yes |
| External Page Link | โ
Yes |
| Code (block) | โ
Yes |
| Code (inline) | โ
Yes |
### Database Properties
| Property Type | Supported |
| ---------------- | --------- |
| Checkbox | โ
Yes |
| Created By | โ
Yes |
| Created Time | โ
Yes |
| Date | โ
Yes |
| Email | โ
Yes |
| Files | โ
Yes |
| Formula | โ
Yes |
| Last Edited By | โ
Yes |
| Last Edited Time | โ
Yes |
| Multi Select | โ
Yes |
| Number | โ
Yes |
| People | โ
Yes |
| Phone Number | โ
Yes |
| Relation | โ
Yes |
| Rich Text | โ
Yes |
| Rollup | โ
Yes |
| Select | โ
Yes |
| Status | โ
Yes |
| Title | โ
Yes |
| Unique Id | โ
Yes |
| Url | โ
Yes |
| Verification | โก No |
## ๐ฌ Issues and Feedback
For any issues, feedback, or feature requests, please file an issue on GitHub.
## ๐ License
MIT
---
Made with โค๏ธ by TomPenguin.