groq-ocr
Version:
a library to run OCR with Groq provided models.
153 lines (112 loc) • 4.68 kB
Markdown
<div align="center">
<div>
<h1 align="center">Groq OCR 🔬</h1>
</div>
<p>An npm library and CLI to run OCR with Groq provided models.</p>
<a href="https://www.npmjs.com/package/groq-ocr"><img src="https://img.shields.io/npm/v/groq-ocr" alt="Current version"></a>
<a href="https://groq.com" target="_blank" rel="noopener noreferrer">
<img
src="https://groq.com/wp-content/uploads/2024/03/PBG-mark1-color.svg"
alt="Powered by Groq for fast inference."
width="200"
height="200"
/>
</a>
</div>
## Table of Contents
- [Disclaimer](#disclaimer)
- [Installation](#installation)
- [Usage](#usage)
- [Use as NPM package](#use-as-npm-package)
- [ocr options](#ocr-options)
- [Use as CLI](#use-as-cli)
- [CLI Examples](#cli-examples)
- [CLI Options](#cli-options)
- [How it works](#how-it-works)
- [Models](#models)
- [Roadmap](#roadmap)
- [Credit](#credit)
---
## Disclaimer
_This project is still in development‼️_
_Multi-page PDF support is experimental and work in progress._
_PDF support relies on [pdftopic](https://github.com/Ilyes-El-Majouti/pdftopic) library which requires node>=12 and imagemagick._
_JSON mode might fail with `json_validate_failed` error_
## Installation
`npm i groq-ocr` to use as an NPM package.
`npm i -g groq-ocr` to use as a CLI.
## Usage
### Use as NPM package:
```javascript
import { ocr, GroqVisionModel } from "groq-ocr";
const result = await ocr({
filePath: "./filepath.jpg", // Allowed formats: jpg, jpeg, png, pdf.
apiKey: process.env.GROQ_API_KEY, // Get your API key from https://console.groq.com/
model: GroqVisionModel.LLAMA_32_90B, // available models: LLAMA_32_11B, LLAMA_32_90B. Default: LLAMA_32_11B
jsonMode: false, // Default: false. Set to true to get JSON output.
additionalInstructions: "Additional instructions to be included in the prompt.", // Use to give custom instructions to the model.
});
```
### ocr options:
- **filePath** (required): Path to image/PDF file or URL
- Supported formats: `.jpg`, `.jpeg`, `.png`, `.pdf`
- **apiKey** (optional): Groq API key
- Defaults to `GROQ_API_KEY` environment variable
- **model** (optional): Vision model to use
- `GroqVisionModel.LLAMA_32_11B` (default) - Llama 3.2 11B Vision Preview
- `GroqVisionModel.LLAMA_32_90B` - Llama 3.2 90B Vision Preview
- **jsonMode** (optional): Return structured JSON instead of markdown
- Defaults to `false`
- **additionalInstructions** (optional): Additional instructions to be included in the prompt.
- Defaults to "" - use to give custom instructions to the model.
### Use as CLI:
Either set your Groq API key as environment variable:
```bash
export GROQ_API_KEY=your-api-key
```
Or provide it as CLI option with `-k` flag when running commands.
### CLI Examples
```bash
# Basic usage
groq-ocr -f image.jpg
# Output as JSON
groq-ocr -f scan.pdf -j
# Save to file
groq-ocr -f receipt.png -o result.txt
# Use specific model and API key
groq-ocr -f document.jpg -m llama-3.2-90b-vision-preview -k your-api-key
```
### CLI Options
- `-f, --file <path>` (required): Path to input image/PDF file
- `-k, --api-key <key>`: Groq API key (defaults to `GROQ_API_KEY` env var)
- `-m, --model <model>`: Vision model to use:
- `llama-3.2-11b-vision-preview` (default)
- `llama-3.2-90b-vision-preview`
- `-j, --json`: Output in JSON format instead of markdown
- `-o, --output <path>`: Write result to file instead of console
- `-V, --version`: Display version number
- `-h, --help`: Display help information
## How it works
This library and CLI uses multimodal models with vision capabilities provided by [Groq](https://groq.com/) to run OCR on images and PDFs and return markdown or JSON.
PDFs are converted to images using [pdftopic](https://github.com/Ilyes-El-Majouti/pdftopic).
## Models
The plan is to support all models provided by Groq with vision capabilities.
[Groq vision models](https://console.groq.com/docs/vision)
Currently supported models:
```typescript
enum GroqVisionModel {
LLAMA_32_11B = "llama-3.2-11b-vision-preview",
LLAMA_32_90B = "llama-3.2-90b-vision-preview",
}
```
## Roadmap
- [x] Add support for local images OCR
- [x] Add support for remote images OCR
- [x] Add support for single page PDFs
- [x] Add support for JSON output in addition to markdown
- [x] Add CLI
- [x] extend prompt with custom instructions
- [ ] Add support for multi-page PDFs OCR (Available but experimental)
## Credit
This project was highly inspired by [llama-ocr](https://github.com/Nutlope/llama-ocr/tree/main).
[](https://biomejs.dev/)