mwoffliner
Version:
MediaWiki ZIM scraper
191 lines (137 loc) • 6.69 kB
Markdown
# MWoffliner
MWoffliner is a tool for making a local offline HTML snapshot of any
online [MediaWiki](https://mediawiki.org) instance. It goes through
all online articles (or a selection if specified) and create the
corresponding [ZIM](https://openzim.org) file. It has mainly been
tested against Wikimedia projects like
[Wikipedia](https://wikipedia.org) and
[Wiktionary](https://wiktionary.org) --- but it should also work for
any recent MediaWiki.
Read [CONTRIBUTING.md](./CONTRIBUTING.md) to know more about
MWoffliner development.
User Help is available in the for a a
[FAQ](https://github.com/openzim/mwoffliner/wiki/Frequently-Asked-Questions).
[](https://www.npmjs.com/package/mwoffliner)
[](https://www.npmjs.com/package/mwoffliner)
[](https://www.npmjs.com/package/mwoffliner)
[](https://ghcr.io/openzim/mwoffliner)
[](https://github.com/openzim/mwoffliner/actions/workflows/ci.yml?query=branch%3Amain)
[](https://codecov.io/gh/openzim/mwoffliner)
[](https://www.codefactor.io/repository/github/openzim/mwoffliner)
[](LICENSE)
[](https://slack.kiwix.org)
## Features
- Scrape with or without image thumbnail
- Scrape with or without audio/video multimedia content
- S3 cache (optional)
- Image size optimiser / Webp converter
- Scrape all articles in namespaces or title list based
- Specify additional/non-main namespaces to scrape
Run `mwoffliner --help` to get all the possible options.
## Prerequisites
- [Docker](https://docs.docker.com/engine/install/) (or Docker-based engine)
- amd64 architecture
## Installation
The recommended way to install and run `mwoffliner` is using the pre-built Docker container:
```sh
docker pull ghcr.io/openzim/mwoffliner
```
<details>
<summary>Run software locally / Build from source</summary>
### Prerequisites for local execution
- *NIX Operating System (GNU/Linux, macOS, ...)
- [Redis](https://redis.io/)
- [NodeJS](https://nodejs.org/en/) version 24 (we support only one single Node.JS version, other versions might work or not)
- [Libzim](https://github.com/openzim/libzim) (On GNU/Linux & macOS we automatically download it)
- Various build tools which are probably already installed on your
machine (packages `libjpeg-dev`, `libglu1`, `autoconf`, `automake`, `gcc` on
Debian/Ubuntu)
... and an online MediaWiki with its API available.
### Installation methods
#### Build your own container
1. Clone the repository locally:
```sh
git clone https://github.com/openzim/mwoffliner.git && cd mwoffliner
```
1. Build the image:
```sh
docker build . -f docker/Dockerfile -t ghcr.io/openzim/mwoffliner
```
#### Run the software locally using NPM
> [!WARNING]
> Local installation requires several system dependencies (see above). Using the Docker image is strongly recommended to avoid setup issues.
1. Install latest released MWoffliner version from NPM (use `-g` to install globally):
```sh
npm i -g mwoffliner
```
> [!WARNING]
> Note that you might need to run this command with the `sudo` command, depending
how your `npm` / OS is configured. `npm` permission checking can be a bit annoying for a
newcomer. Please read the documentation carefully if you hit problems: https://docs.npmjs.com/cli/v7/using-npm/scripts#user
</details>
## Usage
### Using Docker (Recommended)
```sh
# Get help
docker run -v $(pwd)/out:/out -ti ghcr.io/openzim/mwoffliner mwoffliner --help
```
```sh
# Create a ZIM for https://bm.wikipedia.org
docker run -v $(pwd)/out:/out -ti ghcr.io/openzim/mwoffliner \
mwoffliner --mwUrl=https://bm.wikipedia.org --adminEmail=foo@bar.net
```
<details>
<summary>Using NPM / Local Install</summary>
```sh
# Get help
mwoffliner --help
```
```sh
# Create a ZIM for https://bm.wikipedia.org
mwoffliner --mwUrl=https://bm.wikipedia.org --adminEmail=foo@bar.net
```
</details>
To use MWoffliner with a S3 cache, you should provide a S3 URL like
this:
```sh
--optimisationCacheUrl="https://wasabisys.com/?bucketName=my-bucket&keyId=my-key-id&secretAccessKey=my-sac"
```
## Contribute
If you've retrieved mwoffliner source code (e.g. with a git clone of our repo), you can then install and run it locally (including with your local modifications):
```bash
npm i
npm run mwoffliner -- --help
```
Detailed [contribution documentation and guidelines](CONTRIBUTING.md) are available.
## API
MWoffliner provides also an API and therefore can be used as a NodeJS
library. Here a stub example that could go in your index.mjs file:
```javascript
import * as mwoffliner from 'mwoffliner';
const parameters = {
mwUrl: "https://es.wikipedia.org",
adminEmail: "foo@bar.net",
verbose: true,
format: "nopic",
articleList: "./articleList"
};
mwoffliner.execute(parameters); // returns a Promise
```
## Background
Complementary information about MWoffliner:
* MediaWiki software is used by thousands of wikis, the most
famous ones being the Wikimedia ones, including [Wikipedia](https://wikipedia.org).
* MediaWiki is a PHP wiki runtime engine.
* Wikitext is the name of the markup language that MediaWiki uses.
* MediaWiki includes a parser for WikiText into HTML, and this
parser creates the HTML pages displayed in your browser.
* Have a look at the scraper [functional architecture](docs/functional_architecture.md)
License
-------
[GPLv3](https://www.gnu.org/licenses/gpl-3.0) or later, see
[LICENSE](LICENSE) for more details.
Acknowledgements
--------
This project received funding through [NGI Zero Core](https://nlnet.nl/core), a fund established by [NLnet](https://nlnet.nl/) with financial support from the European Commission's [Next Generation Internet](https://ngi.eu/) program. Learn more at the [NLnet project page](https://nlnet.nl/project/MWOffliner).
[<img width="20%" alt="NLnet foundation logo" src="https://github.com/user-attachments/assets/22233242-ec49-4540-a0af-b70725cedbee" />](https://nlnet.nl/)
[<img width="20%" alt="NGI Zero Logo" src="https://github.com/user-attachments/assets/1bbbda57-dc6f-4902-ae29-236e5e89228f" />](https://nlnet.nl/core)