convokit
Version:
A flexible TypeScript framework for ingesting, processing, and exporting chat/conversation data for LLM training and analysis.
403 lines (297 loc) • 15.3 kB
Markdown
# ConvoKit: Flexible Conversation Processing & Export Toolkit
ConvoKit is a TypeScript-based framework for ingesting, normalizing, filtering, sampling, formatting, and exporting chat or conversation data for LLM training and analysis. It provides:
- A **provider registry** to plug in new data sources (Discord, Slack, custom exports, etc.).
- A **plugin registry** for formatters, converters, and filters to transform and export data to ChatML, Gemini JSONL, custom context formats, and more.
- A fully configurable, extensible pipeline: ingest → normalize → filter → importance‑score → sample → format → export.
ConvoKit saves you time building data preprocessing pipelines and lets you focus on models and prompts.
## Table of Contents
- Key Features
- What It Can & Cannot Do
- Who Should Use It
- Installation
- Quick Start
- Configuration
- CLI Usage
- Provider Registry
- Built‑in Providers
- Writing Your Own Provider
- Plugin Registry
- Formatters
- Converters
- Filters
- Writing Your Own Plugin
- Contributing
- License
## Key Features
- **Dynamic Provider Loading**
Automatically discover and load data providers from your project’s providers folder.
- **Normalized Conversation Format**
All data converges to a `ConvoKitConversation` interface: metadata + message arrays.
- **Context Formatting**
Generate a single, line-delimited training string (`CKContext`) with options for time‑gaps, new‑conversation markers, and importance scoring.
- **Turn‑List Conversion**
Break context into turn lists (`CKTurnListConversation`) for sampling or LLM‑specific export.
- **Weighted Sampling**
Sample by conversation importance to focus on high‑value exchanges.
- **Export Plugins**
Export to ChatML JSONL, Gemini JSONL, or add your own converter for other LLM formats.
- **Filter Plugins**
Drop unwanted messages (e.g. links‑only, emoji‑only, code‑only) via a simple plugin API.
## What It Can & Cannot Do
Can:
- Ingest JSON exports from Discord (via DiscordChatExporter), or any custom source you add via the **Provider Registry**.
- Normalize and filter conversations by message content, length, or custom rules.
- Score message & conversation importance automatically based on time, length, and frequency.
- Sample highly‑important conversations for training budgets.
- Export to popular LLM chat formats (ChatML, Gemini), or easily extendable.
Cannot:
- Perform LLM inference or model training directly. - **Yet ;)**
- Resolve references across conversations (thread linking across channels).
- Guarantee perfect import schema for every data source—you may need to write a provider to handle custom formats.
- Handle binary or non‑JSON data without extending a provider to preprocess it.
## Who Should Use It
- **NLP / ML Engineers** preparing chat‑based LLM fine‑tuning or analysis datasets.
- **Bot / Chat Service Developers** needing to transform raw chat logs into structured training data.
- **Researchers** studying conversation dynamics or designing importance‑based sampling strategies.
- **Community Contributors** eager to add support for new platforms or export formats.
## Possibly upcomming features
- **Personality** Generate a deep and comprehensive personality prompt based off your output ck_context
- **Fine-tuning** Fine-tune models with exported training data (Currently mainly looking at Gemini) **(Contributions welcome!)**
- **Model Testing** Test your fine-tuned model via the terminal (Currently mainly looking at Gemini) **(Contributions welcome!)**
- **Unit Tests** Adding unit tests would help keep everything maintainable and stable (or so i've heard)
## Installation
```bash
# Install globally (recommended for CLI use)
npm install -g convokit
# Or install locally in your project
npm install convokit
```
## Quick Start (Using the Library)
```ts
import { ConvoKit, loadConfig, getConfig } from 'convokit';
import { config } from 'dotenv';
config();
await loadConfig();
async function run() {
const ck = new ConvoKit();
await ck.loadProviders(); // This will load all included providers, and the providers in the LocalProvidersDir if set (in config)
// We also automatically load all included plugins & the plugins in LocalPluginsDir if set (in config)
const convoData = await ck.processDataFromProviders();
const context = await ck.parseToContext({ targetUsers: getConfig().targetUsers });
await ck.convertToCKTurnList();
await ck.getWeightedSample(getConfig().sampleSize);
const chatml = await ck.exportToChatML(getConfig().systemPrompt);
const gemini = await ck.exportToGemini(getConfig().systemPrompt);
// Do whatever you want with the outputs
}
run();
```
> Make sure you have set up providers and dir structure first
## Configuration
By default, ConvoKit reads convokit.config.json or environment variables - Here is an example config file
```jsonc
{
"inputDataDirName": "InputData",
"outputDataDirName": "OutputData",
"targetUsers": [
{ "providerId": "discord", "id": "YOUR_DISCORD_USER_ID" }
],
"sampleSize": 5000,
"systemPrompt": "You are a helpful assistant.",
"minImportanceChat": 120,
"minImportanceMessage": 100,
"enableDebugging": false,
"enablePerformanceStats": false,
"shouldMergeConsecutiveMessages": true,
"enableWarnings": true,
"anonymizeProviderConversationIds": false,
"localProvidersDir": "LocalProviders",
"localPluginsDir": "LocalPlugins",
}
```
| Key | Description |
|-----------------------------------------|--------------------------------------------------------------------------------------|
| inputDataDirName | Directory containing raw chat exports (relative to project root). |
| outputDataDirName | Directory to write formatted outputs. |
| targetUsers | JSON array mapping each provider to a target user ID for context generation. |
| sampleSize | Number of conversations to sample by importance. |
| systemPrompt | System prompt used in ChatML/Gemini exports. |
| minImportanceChat (optional) | Minimum average importance score for a conversation (default: 120). |
| minImportanceMessage (optional) | Minimum importance score for a single message (default: 100). |
| enableDebugging (optional) | Enable or disable debug-level logs. |
| enablePerformanceStats (optional) | Enable or disable performance stats (timers). |
| shouldMergeConsecutiveMessages (optional)| Merge consecutive messages when converting to CKTurnList. |
| enableWarnings (optional) | Toggle the display of warning messages. |
| anonymizeProviderConversationIds (optional)| Anonymize provider conversation IDs to protect sensitive data. |
| localProviderDirectory (optional)| Directory name of where to load custom providers from. |
| localPluginDirectory (optional)| Directory name of where to load custom plugins from. (Contains a folder for each plugin type (formatters, filters, converters)! ) |
## Directory Structure
In your `convokit.config.json` file you set a inputDataDirName, in here you will need to have a directory for each provider. In there you should store the exported data.
Example for use with the Discord provider, with **inputDataDirName** set to `InputData`:
```plaintext
convokit/
├── index.ts
├── convokit.config.json
├── ... other files and folders
└── InputData
└── discord
└── Direct Messages - fishylunar [000000000000000].json
```
> Note: the filenames of the exported data doesnt matter, but the extension does.
## CLI Usage
ConvoKit provides a command-line interface (CLI) for running the processing pipeline without writing TypeScript code. Ensure you have a valid `convokit.config.json` file in your project root or have set the corresponding environment variables.
**Running Commands:**
```bash
# If installed globally
convokit <command> [options]
# If installed locally, using npx
npx convokit <command> [options]
# Or via package.json script
# "scripts": { "ck": "convokit" }
# npm run ck -- <command> [options]
```
**Common Options:**
* `-p, --providers <ids>`: Specify a comma-separated list of provider IDs (e.g., `discord,telegram`) to process data from. If omitted, ConvoKit will attempt to use data from all providers found in your `inputDataDirName` that are registered.
* `-o, --output <file>`: Specify an output file path to save the results of commands like `context` or `export`. If omitted, results are generated but not saved to a file (stats/logs will still be shown).
**Commands:**
* `create-config` (alias: `cfg`): Creates an example `convokit.config.json` file in the current directory. Run this first if you don't have a config file.
```bash
convokit create-config
```
* `providers`: Lists all registered providers (built-in and local) found by ConvoKit, including their ID, name, version, and expected input directory/extension. Useful for verifying provider setup and getting IDs for the `--providers` option.
```bash
convokit providers
```
* `plugins`: Lists all registered plugins (formatters, converters, filters), including built-in and local ones. Shows plugin ID, name, and version. Useful for finding the `<converter_id>` for the `export` command.
```bash
convokit plugins
```
* `context`: Processes data from specified (or all) providers and generates the `CKContext` output based on your configuration (`targetUsers`, importance scores, etc.).
```bash
# Generate context from all providers and save to context.txt
convokit context -o context.txt
# Generate context using only 'discord' provider data and save
convokit context --providers discord -o discord_context.txt
# Generate context from all providers and save to context.json including stats
convokit context -o context.json --stats
```
* `export <converter_id>`: Runs the full pipeline: loads data, processes it, generates context, converts to turn list, performs weighted sampling (using `sampleSize` from config), and finally exports the data using the specified `<converter_id>`.
```bash
# Export data using the 'chatml' converter, save to chatml_export.jsonl
convokit export chatml -o chatml_export.jsonl
# Export using 'gemini' converter from 'telegram' provider only, save output
convokit export gemini --providers telegram -o telegram_gemini.jsonl
```
**Example Workflow:**
```bash
# 1. Create a config file if you don't have one
convokit create-config
# (Edit convokit.config.json with your settings: input dir, target users, etc.)
# 2. Check which providers are available
convokit providers
# Output might show: ID: discord, ID: telegram
# 3. Check available export formats (converters)
convokit plugins
# Output might show Converters: ID: chatml, ID: gemini
# 4. Run the full export pipeline for ChatML using all providers
convokit export chatml -o training_data.jsonl
# 5. (Alternative) Generate only the CKContext for analysis
convokit context -o analysis_context.json
```
## Provider Registry
ConvoKit discovers providers from providers via `ProviderRegistry`. Each provider must:
1. Implement `ConvoKitProvider` with `Test()` and `Convert()`.
2. Export a static `ProviderInfo` object.
3. Register itself via `ProviderRegistry.register(id, ProviderClass, ProviderInfo)`.
### Built‑in Providers
- **Discord** (`providers/discord.ts`): Reads JSON exports from DiscordChatExporter.
- **Telegram** (`providers/telegram.ts`): Reads JSON exports from the Telegram Desktop app.
> Contributions are more than welcome! <3
### Writing Your Own Provider
1. Create `/providers/MyPlatform.ts`.
> To make a local provider, put the `MyPlatform.ts` file in the LocalProvidersDir you specified in your config. If you are contributing and making a provider to be included in ConvoKit, put it in `/providers/MyPlatform.ts`
2. Define your data schema, compatibility check, and conversion:
```ts
export const ProviderInfo = {
name: "MyPlatform Exporter",
description: "Imports MyPlatform chat JSON.",
version: "1.0.0",
author: "You",
InputDataInfo: { directoryName: "MyPlatform", fileExtension: ".json" }
};
export class Provider implements ConvoKitProvider {
constructor(private raw: any) {}
Test(): boolean {
// return true if raw matches your schema
}
Convert(): ConvoKitConversation {
// transform raw → ConvoKitConversation
}
}
// Self-register
ProviderRegistry.register("myplatform", Provider, ProviderInfo);
```
3. Place your exports in `InputData/MyPlatform/*.json`.
4. Run `ck.loadProviders()` and `ck.processDataFromProviders()` to include your data.
## Plugin Registry
Plugins extend ConvoKit’s pipeline at three points:
1. **Formatters** (formatters)
2. **Converters** (converters)
3. **Filters** (filters)
They self‑register via `PluginRegistry.registerFormatter/Converter/Filter()`.
### Formatters
- **Context Formatter** (`id: context`): Builds the CKContext string with importance and markers.
### Converters
- **ChatML Converter** (`id: chatml`): Exports LLM chatml JSONL.
- **Gemini Converter** (`id: gemini`): Exports Gemini‑style JSONL.
### Filters
- **LinkOnlyFilter** (`id: link-only`): Excludes messages that are URLs only.
### Writing Your Own Plugin
1. **Formatters**
```ts
export class MyFormatter implements FormatterPluginClass {
PluginInfo = { id: "myfmt", name: "...", type: "formatter", version: "1.0.0" };
apply(data, options) { /* return CKContextResult */ }
}
PluginRegistry.registerFormatter(MyFormatter);
```
2. **Converters**
```ts
export class MyConverter implements ConverterPluginClass {
PluginInfo = { id: "myconv", name: "...", type: "converter", version: "1.0.0" };
async apply(convs, prompt) { /* return string[] */ }
}
PluginRegistry.registerConverter(MyConverter);
```
3. **Filters**
```ts
export class MyFilter implements FilterPluginClass {
PluginInfo = { id: "myfilter", name: "...", type: "filter", version: "1.0.0" };
filterType: 'MUST' | 'MUST_NOT' = 'MUST_NOT';
apply(content) { /* return boolean */ }
}
PluginRegistry.registerFilter(MyFilter);
```
## Contributing
Contributions are very welcome!
- **Suggest a feature** via GitHub Issues.
- **Report bugs** or raise PRs to fix them.
- **Add new providers** (Slack, Teams, custom exports).
- **Write plugins** for new formats or filters.
## License
This project is licensed under the MIT License.
Feel free to use, modify, and distribute as you see fit!