tesseract-wasm
Version:
OCR library built on Tesseract
153 lines (106 loc) ⢠4.64 kB
Markdown

A WebAssembly build of the [Tesseract](https://github.com/tesseract-ocr/tesseract)
OCR engine for use in the browser and Node.
tesseract-wasm can detect and recognize text in document images. It supports multiple languages via different [trained models](https://tesseract-ocr.github.io/tessdoc/Data-Files).
š [**Try the demo**](https://robertknight.github.io/tesseract-wasm/) (Currently supports English)
This Tesseract build has been optimized for use in the browser by:
- Stripping functionality which is not needed in a browser environment (eg.
code to parse various image formats) to reduce download size and improve
startup performance. The library and English training data require a ~2.1MB
download (with Brotli compression).
- Using [WebAssembly SIMD](https://v8.dev/features/simd) when available
(Chrome >= 91, Firefox >= 90, Safari >= 16.3) to improve text
recognition performance.
- Providing a high-level API that can be used to run web pages without blocking
interaction and a low-level API that provides more control over execution.
## Setup
1. Add the tesseract-wasm library to your project:
```sh
npm install tesseract-wasm
```
2. Serve the `tesseract-core.wasm`, `tesseract-core-fallback.wasm` and
`tesseract-worker.js` files from `node_modules/tesseract-wasm/dist` alongside
your JavaScript bundle.
3. Get the training data file(s) for the languages you want to support from the
[tessdata_fast](https://github.com/tesseract-ocr/tessdata_fast) repo and
serve it from a URL that your JavaScript can load. The `eng.traineddata`
file supports English for example, and also works with many documents in
other languages that use the same script.
## Usage
tesseract-wasm provides two APIs: a high-level asynchronous API (`OCRClient`)
and a lower-level synchronous API (`OCREngine`). The high-level API is the most
convenient way to run OCR on an image in a web page. It handles running the OCR
engine inside a Web Worker to avoid blocking page interaction. The low-level API
is useful if more control is needed over where/how the code runs and has lower
latency per API call.
### Using OCRClient in a web page
```js
import { OCRClient } from 'tesseract-wasm';
async function runOCR() {
// Fetch document image and decode it into an ImageBitmap.
const imageResponse = await fetch('./test-image.jpg');
const imageBlob = await imageResponse.blob();
const image = await createImageBitmap(image);
// Initialize the OCR engine. This will start a Web Worker to do the
// work in the background.
const ocr = new OCRClient();
try {
// Load the appropriate OCR training data for the image(s) we want to
// process.
await ocr.loadModel('eng.traineddata');
await ocr.loadImage(someImage);
// Perform text recognition and return text in reading order.
const text = await ocr.getText();
console.log('OCR text: ', text);
} finally {
// Once all OCR-ing has been done, shut down the Web Worker and free up
// resources.
ocr.destroy();
}
}
runOCR();
```
See the `examples/` directory for projects that show usage of the library in
the browser and Node.
See the [API documentation](https://robertknight.github.io/tesseract-wasm/api/)
for detailed usage information.
See the Tesseract [User Manual](https://tesseract-ocr.github.io/tessdoc/) for
information on how Tesseract works, as well as advice on [improving
recognition](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html).
To build this library locally, you will need:
- A C++ build toolchain (eg. via the `build-essential` package on Ubuntu or Xcode on macOS)
- [CMake](https://cmake.org)
- [Ninja](https://ninja-build.org)
The [Emscripten](https://emscripten.org) toolchain used to compile C++ to
WebAssembly is downloaded as part of the build process.
To install CMake and Ninja:
#### On macOS:
```
brew install cmake ninja
```
#### On Ubuntu
```
sudo apt-get install cmake ninja-build
```
### Building the library
```sh
git clone https://github.com/robertknight/tesseract-wasm
cd tesseract-wasm
# Build WebAssembly binaries and JS library in dist/ folder
make lib
# Run tests
make test
```
To test your local build of the library with the example projects, or your own
projects, you can use [yalc](https://www.npmjs.com/package/yalc).
```sh
# In this project
yalc publish
# In the project where you want to use your local build of tesseract-wasm
yalc link tesseract-wasm
```