@robypag/langchain-splitter
Version:
A small wrapper module to simplify files and buffers tokenization using langchain
43 lines (29 loc) • 2.76 kB
Markdown
# Tokenizer Utility
This is a small utility I have built to support in other AI-related projects.
It doesn't do much and I did not want to create more than this: it does exactly what I need.
If it can help you or you feel it is worth an upgrade, feel free to fork this. Pull-requests are warmly welcome.
## Why
While working with Reality Augmented Generation, you usually have the need of processing a file in order to generate embeddings for it. The common technique is to split the file in chunks, then generate embeddings for each chunk.
It is a repetitive and tedious task and instead of copying/pasting the same function over and over again, I decided to build a small library.
## What
Tokenizer exposes two main functions:
- `tokenizeFile`
- `tokenizeFromStringOrBuffer`
They both do the same thing, but starting from a different point: as the name implies, you can provide a file path to `tokenizeFile` whereas you can provide a `string` or a `buffer` to `tokenizeFromStringOrBuffer`.
### Supported Files
It currently supports files that can include text: `pdf`, `doc` and `docx` and text based files like `txt`, `csv`, etc...
It applies an heuristic approach to best determine which kind of file or buffer it is provided with:
- `tokenizeFile` first uses the [`mime-types`](https://www.npmjs.com/package/mime-types) module to determine the file type. If this fails (mainly because the provided file has a mismatching extension or does not have an extension at all), it uses the [`file-type`](https://www.npmjs.com/package/file-type) module to look at the file content and determine its type.
- `tokenizeFromStringOrBuffer` assumes that if the provided content is a string then the resulting file is a text-based one. If the provided content is a buffer, it uses `file-type` as above to look at the buffer content and determine which kind of file is and it generates a temporary file using the returned extension.
Since `file-type` does not support text-files, it returns an `undefined` value if the buffer contains a string or a text-only buffer: the function therefore generates a `txt` temporary file.
After temporary file generation, it calls `tokenizeFile` providing the temp path to it.
### Langchain Parameters
In all cases, this library uses Langchain's function `RecursiveCharacterTextSplitter` to process the given text.
You can check its signature [here](https://v02.api.js.langchain.com/classes/_langchain_textsplitters.RecursiveCharacterTextSplitter.html).
This library currently only uses 2 of them:
- `chunkSize`: the size of each text chunk in bytes. Defaults to 1000
- `chunkOverlap`: amount of bytes that can overlap between two adjacent chunks. Defaults to 200
## Who
Me myself and I.
# License
See [LICENSE](./LICENSE)