llm-complete

# Architecture Spinning up a node app that does completions is pretty straightforward, but for CLI usage I ran into a few usability conditions that eventually led to the following code. I wanted this to be easy to use with a smooth user experience. Below is a complete outline of every code block explaining what it does and the reasoning behind it. ## Dependencies Uses minimal imports. [GPT4All](https://www.nomic.ai/gpt4all) and it's [node bindings](https://www.npmjs.com/package/gpt4all) are the only external requirements. ```javascript import { loadModel, createCompletionStream } from 'gpt4all'; import { open as openFile } from 'node:fs/promises'; import { fileURLToPath } from 'node:url'; import { dirname, join } from 'node:path'; import readline from 'node:readline'; ``` ## State Management To create a smooth user experience. ```javascript const state = { busy: false, // Block input during generation killed: false, // Handle cancellation gracefully flashOn: false, // Loading animation state flashLoop: null // Loading animation timer }; ``` ## Error Handling Gracefully shutdown on unexpected errors. ```javascript process.on('uncaughtException', err => { console.error(err); if (state.busy) { shutdown(); } else { process.exit(); } }); ``` ## Processing mode Can be `'cpu'` | `'gpu'` | `'amd'` | `'nvidia'` | `'intel'` | `'<other_gpu_name>'`. The best avaiable gpu will be used by default, falls back to cpu if no gpu available. ```javascript const device = process.env.DEVICE ?? 'gpu'; ``` ## Model Configuration Load a local model using GPT4All bindings. If you want to experiment with different models you can read these values in from ENV. ```javascript // Get model config path const __filename = fileURLToPath(import.meta.url); const __dirname = dirname(__filename); const modelConfigPath = join(__dirname, "models.json"); // Select Model const modelName = process.env.MODEL ?? 'mistral-7b-v0.1.Q4_K_M.gguf'; const ctx = process.env.CTX ?? 2048; // 2048 is max for Mistral 7b const model = await loadModel(modelName, { modelConfigFile: "./models.json", // Per-model settings allowDownload: false, // We will manually download gguf file verbose: false, // Supress detailed output from model device: device, // Processing device, set by ENV variable nCtx: ctx, // Max context size, varies by model ngl: 100 // Number of gpu layers to use }); ``` The model must exist in GPT4All's model path. On arch this is `~/.local/share/nomic.ai/GPT4All/`. An entry for this model must exist in **models.json**. You can use the [metadata provided by nomic](https://raw.githubusercontent.com/nomic-ai/gpt4all/main/gpt4all-chat/metadata/models3.json) or specify your own in the following format if your model is not listed. The GPT4All wiki provides guidance on [configuring custom models](https://github.com/nomic-ai/gpt4all/wiki/Configuring-Custom-Models). ```json [ { "order": "a", "name": "Mistral 7B", "filename": "mistral-7b-v0.1.Q4_K_M.gguf", "url": "https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/blob/main/mistral-7b-v0.1.Q4_K_M.gguf?download=true", "md5sum": "a5b363017e471c713665d57433f76e65", "filesize": "4368438912", "requires": "2.5.0", "ramrequired": "8", "parameters": "7 billion", "quant": "q4_0", "type": "Mistral", "description": "For creative completions, developed by Mistral AI", "promptTemplate": "%1", "chatTemplate": "", "systemPrompt": "" } ] ``` This project uses [Mistral 7B Base](https://mistral.ai/news/announcing-mistral-7b) converted to [GGUF Format by TheBloke](https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF). There are plenty of models to choose from but for lightweight creative writing this one does quite well. It can be run on a decent laptop with no dedicated GPU and is released under the Apache 2.0 license allowing commercial use. There are newer and larger models in this series but this one hits a good balance of resource usage and creativity. The `systemPrompt` and `chatTemplate` options are not needed for basic completions. More on chat mode in the next article. This script does not generate responses with a personality. You cannot ask it a question and get a well formed answer. To use this tool, you feed it incomplete text, and it will complete the text for you. ### Example Input ``` This is a story ``` ### Example Output ``` This is a story about how my best friend and I are complete opposites. My name... ``` ## Tips for Better Completions 1. End your prompt mid-sentence for more natural continuations 2. Use markdown or code formatting to guide the style 3. Include examples of the desired output format 4. Keep context under 2048 tokens for best performance 5. Use append mode `-a` for iterative writing 6. Provide useful context details to guide the output ## Command Line Argument Processing Completions can be done a few different ways: - No input, random output to terminal - String input from command line with `-p` or `--prompt` flag. - File input with with `-f` or `--file` flag - Output can be redirected with `> output.txt` - Output can be appended to input file with `-a` or `--append` flag. ```javascript const args = process.argv.slice(2); const flags = { prompt : [ '-p', '--prompt' ], file : [ '-f', '--file' ], append : [ '-a', '--append' ] }; let inputPath, directInput, inputFile, outputFile; const useFile = args.some(arg => flags.file.includes(arg)); const append = args.some(arg => flags.append.includes(arg)); const prompt = args.some(arg => flags.prompt.includes(arg)); const inputIndex = getInputIndex(); function getInputIndex() { if (append) return args.findIndex(arg => flags.append.includes(arg)) + 1; if (useFile) return args.findIndex(arg => flags.file.includes(arg)) + 1; if (prompt) return args.findIndex(arg => flags.prompt.includes(arg)) + 1; return -1; } ``` Not the DRYest way to handle this but it gets the job done. ## Input Validation If file mode is specified, make sure a path was provided. For direct input, we can continue with or without a prompt. ```javascript // Proceed with or without input if (useFile || append) { // Get file path from args inputPath = args[inputIndex]; if (!inputPath) { console.error('Error: No input file specified after ', args[inputIndex-1]); process.exit(1); } } else { // Use string input from command line or empty input directInput = prompt ? args[inputIndex] : ''; } ``` ## Generator Settings Here you can adjust the quality of your output. There are many resources online discussing these options. ChatGPT can give you a good breakdown if needed. The settings below are reasonable defaults, adjust to your use case. ```javascript const predict = process.env.PREDICT ?? 128; const settings = { temperature: 0.7, // Controls creativity (0.0-1.0) topK: 40, // Limits vocabular to top K tokens topP: 0.9, // High probability cutoff minP: 0.1, // Low probability cutoff repeatPenalty: 1.2, // Penalize repeated tokens, 1 = No Penalty repeatLastN: 64, // Lookback window for repeats nBatch: 2048, // Tokens to process concurrently, higher values use more RAM nPredict: predict, // Maximum tokens to generate, increase for longer output contextErase: 0.75, // Percentage of past context to erase if exceeded promptTemplate: '%1' // Can override prompt template from config file }; ``` The big one to adjust here is `nPredict`. This decides how long your output will be. You can adjust this value with the `PREDICT` ENV var. The **128** token default setting will result in a decent size paragraph of text or equivalent (lists, code, etc). For example: ### Input ``` $ llm-complete -p "export class SillyButton extends HTMLElement {" ``` ### Output ```javascript export class SillyButton extends HTMLElement { constructor() { super(); this.attachShadow({ mode: 'open' }); const template = document.createElement('template'); template.innerHTML = `<style> :host { display: block; width: 100%; height: 56px; border-radius: 4px; background-color: #3278ff; color: white; font-size: 1.2rem; } </style>`; this. ``` You can continue generating in append mode to keep building off previous work. Using a text editor that supports streaming input like vscode or vim you can see the results in real time, make edits, save, then continue generating. ```shell $ llm-complete -a silly-button.js # add some text $ llm-complete -a silly-button.js # run again to add more ``` ## Terminal Interface Setup Connect input/output streams for writing to terminal. Override the default prompt and prevent tab completions. ```javascript const rl = readline.createInterface({ input: process.stdin, output: process.stdout, terminal: true, prompt: '', completer: () => [[], ''] }); ``` ## Input Control Functions Block all text input from keyboard while processing and hide the cursor. ```javascript function blockInput() { rl._ttyWrite = () => {}; process.stderr.write('\x1b[?25l'); } function restoreInput() { rl._ttyWrite = tty; process.stderr.write('\x1b[?25h'); } ``` Allow **Ctrl+C** to cancel generation even though other input is blocked. ```javascript rl.input.on('data', key => { const keyStr = key.toString(); if (state.busy) { if (keyStr === '\x03') { state.killed = true; return; } } }); ``` ## Interrupting Generation Return false in this callback to stop the model from generating any more tokens. You can process the current token here to decide whether or not to stop generating. In this script, we trigger cancellation only with ctrl+c but this can be expanded on if needed. ```javascript settings.onResponseToken = (tokenId, token) => { return !state.killed; }; ``` ## Progress Indication Show a flashing elipses while busy. Uses `stderr` to avoid poluting our output. ```javascript function startIndicating() { state.flashLoop = setInterval(() => { if (state.flashOn) { state.flashOn = false; process.stderr.write(' \b\b\b'); } else { state.flashOn = true; process.stderr.write('...\b\b\b'); } }, 400); } function stopIndicating() { clearInterval(state.flashLoop); state.flashOn = false; } ``` ## Streaming Append to File In append mode, tokens are streamed directly back to the input file. Before this we check if the input file ends in a single newline. If so, truncate it. This allows us to pass a partial sentence as input while adhering to POSIX text file standards. We only strip single newlines. Double newlines are left intact to allow starting completion with a new paragraph. ### Single Newline Example Input: ``` This is a story ``` Output: ``` This is a story about something... ``` ### Double Newline Example Input: ``` # Test Plan: - Do Tests - More Tests ``` Output: ``` # Test Plan: - Do Tests - More Tests # Test 1: - Check the input ``` The model decides how to continue the text. If it determines that there should be a newline after the input, it will add one. Manipulating the input like this helps the model continue the text in a natural way. To perform this check we read the last two bytes of the input file into a buffer. ```javascript async function createWriteStream() { try { const eof = (await inputFile.stat()).size - 2; if (eof > 1) { const tempFile = Buffer.alloc(2); const lastBytes = (await inputFile.read(tempFile,0,2,eof)).buffer; const isSingleNewline = (lastBytes[0] !== 0x0A && lastBytes[1] === 0x0A); if (isSingleNewline) { await inputFile.truncate(eof+1); } } outputFile = inputFile.createWriteStream(); } catch (err) { console.error('Error creating WriteStream:', err); shutdown(); } } ``` ## Write Output Stream output to file, redirected stdout, or direct to terminal. ```javascript function writeToken(token) { if (append) { try { outputFile.write(token); } catch (err) { console.error('Error writing to output file:', err); shutdown(); } } else if (!process.stdout.isTTY) { process.stdout.write(token); } else { rl.line += token; rl.cursor = rl.line.length; rl._refreshLine(); } } ``` ## Shutdown and Cleanup Prevents segfault by allowing the model time to free it's own resources. ```javascript function shutdown() { if (state.busy) { setTimeout(dispose, 800); } else { dispose(); } } function dispose() { model.dispose(); restoreInput(); process.exit(); } ``` ## Stream Processing Completion will continue until `nPredict` tokens are generated. This can result in fragmented sentences at the end of output. To prevent this, we will detect sentence boundaries and drop any trailing fragments. ```javascript const boundaries = /[.?!…:;\n]/; ``` To accomplish this task we must buffer the output. Default buffer is **30** tokens. This can be adjusted as necessary with the `BUFFER` ENV var. Output is delayed until the buffer is full. This allows us to drop the sentence fragment before output writing catches up. After all tokens are collected, we write the remaining buffer out on a timer to simulate the streaming effect of token generation. ```javascript async function processStream(input) { state.busy = true; // For stdout, we need to trim newlines from // input like we do when appending to file input = input.replace(/[^\n](\n)$/,''); try { if (append) { // appending file already holds input await createWriteStream(); } else { // write input to terminal writeToken(input); } // prompt the model const stream = createCompletionStream( model, input, settings ); // Configure buffer const bufferAhead = process.env.BUFFER ?? 30; let buffer = []; let currentToken = -1; let currentIndex = 0; let currentBoundary = -1; // Loop until all tokens are received for await (let token of stream.tokens) { if (state.killed) return; // Prevent double space between input and output if (currentToken < 0 && token.startsWith(' ') && input.endsWith(' ') ) { token = token.toString().slice(1); } // Buffer tokens buffer.push(token); currentToken++; // Detect position of last sentence boundary if (token.match(boundaries)) { currentBoundary = buffer.length; } // Hide loading indicator before outputing to terminal if (currentToken == bufferAhead && !append && process.stdout.isTTY) { stopIndicating(); } // Don't start outputting until buffer is full if (currentToken >= bufferAhead) { writeToken(buffer[currentIndex]); currentIndex++; } } // Drop any trailing sentence fragment from buffer if (currentBoundary) { const boundary = currentBoundary ? currentBoundary : buffer.length-1 buffer = buffer.slice(currentIndex, boundary); while (buffer.slice(-1)[0]?.match(/\n/)) { buffer.pop(); } } // Process remaining buffer by continuing to output one token at time for (let i = 0; i < buffer.length; i++) { if (state.killed) return; await new Promise(resolve => { setTimeout(() => { writeToken(buffer[i]); resolve(); }, 200); }); } } catch (error) { handleProcessingError(error); } finally { proccessingStopped(); } } ``` ## Error Handling Gracefully shutdown if errors are encountered during processing. ```javascript function handleProcessingError(err) { console.error('Error processing stream:', err.message); shutdown(); } ``` ## Processing Completion Write a final newline, reset terminal state and shutdown gracefully when processing stops. ```javascript function proccessingStopped() { writeToken('\n'); stopIndicating(); restoreInput(); shutdown(); } ``` ## Process Initialization We read the input file async to prevent blocking. This ensures the flashing indicator and cancel detection will work while loading. Start processing after the file is fully read. ```javascript blockInput(); startIndicating(); if (inputFile) { try { inputFile.readFile('utf8') .then(input => processStream(input)); } catch (error) { console.error('Error reading file:', error.message); process.exit(1); } } else { processStream(directInput); } ``` ## Future Improvements I originally had included inline editing of results on the terminal but this proved to be trickier than expected. The core functionality works, but there some quirks that need handling. I may release an update with this feature at some point if I can get it working properly. ## Installation You can use this to base your own implementation on. I have released [the code](https://github.com/besworks/llm-complete) under the MIT License. Or you can [install via npm](https://www.npmjs.com/package/llm-complete) and start using it right away. Run it via the installed `llm-complete` executable.