mongodb-simplecrawler-queue
Version:
MongoDB FetchQueue Implementation for Simplecrawler
163 lines (132 loc) • 5.56 kB
Markdown
# MongoDB Implementation of FetchQueue Interface for Simplecrawler
This is an implementation of FetchQueue Interface for [simplecrawler queue](https://github.com/simplecrawler/simplecrawler) with MongoDB usage as backend.
Preferences:
- Possibility to pause/stop/kill/terminate running job without queue state losing
- Possibility to run one crawler job in parallel with several crawler instances using one queue. (Including adding/removing instances in runtime)
# Installation
```
npm install --save mongodb-simplecrawler-queue
```
# Usage
All you need is connection configuration: url of the MongoDB instance
```js
const MongoDbQueue = require('mongodb-simplecrawler-queue').MongoDbQueue; // or import { MongoDbQueue } from 'mongodb-simplecrawler-queue';
const Crawler = require('simplecrawler');
const connectionConfig = {
url: 'mongodb://localhost:27017',
dbName: 'crawler',
collectionName: 'queue',
};
const crawlerQueue = new MongoDbQueue(connectionConfig);
crawlerQueue.init(err => {
if (err) {
console.log(err);
process.exit(0);
}
const crawler = new Crawler('https://en.wikipedia.org/wiki/Main_Page');
crawler.maxDepth = 3;
crawler.allowInitialDomainChange = false;
crawler.filterByDomain = true;
// here we are - tell crawler to use mongo as a queue
crawler.queue = crawlerQueue;
crawler.start();
});
```
## Monitoring and garbage collector
It is possible to schedule the queue to execute monitoring tasks or garbage collector tasks periodically using queue configuration:
```js
const connectionConfig = {
url: 'mongodb://localhost:27017',
dbName: 'crawler',
collectionName: 'queue',
// Garbage collector configuration.
GCConfig: {
run: true,
msInterval: 1000 * 60 * 5,
},
// monitoring configuration
monitorConfig: {
run: true,
statisticCollectionName: 'statistic',
msInterval: 1000 * 60,
},
};
```
### Garbage collector
Garbage collector is a task which will be executed periodically and find "old" items which have been spooled but not fetched. It could happen in
case when crawler instance [started to process new item](https://github.com/simplecrawler/simplecrawler/blob/master/lib/crawler.js#L1735)
but job failed/process has been terminated etc. So you don't have "missing" items in the queue.
Garbage collector config object properties:
| Property name | Type | Comment |
|---|---|---|
| run | boolean | disable/enable GC |
| msInterval | number | Interval between Monitoring tasks in milliseconds. |
### Monitoring
Monitoring task is a task which will aggregate all statistic information about current queue state and put data in another collection
of the MongoDB instance
Monitoring config object properties:
| Property name | Type | Comment |
|---|---|---|
| run | boolean | disable/enable GC |
| statisticCollectionName | String | [default = 'statistic'] Collection name to put data. *Warning* Please don't use same collections names for queue and statistic! It will break crawler logic and crawler job will never finish. |
Current aggregation data:
Queue monitoring will add statistic data items into statistic Collection each `msInterval` milliseconds. So you can use this collection in runtime to see progress sof the job.
Example of the statistic item:
```js
{
// Aggregation statistic from the queue items - see https://github.com/simplecrawler/simplecrawler#queue-statistics-and-reporting
actualDataSizeAvg: 40641.18656716418,
actualDataSizeMax: 203492,
actualDataSizeMin: 179,
contentLengthAvg: 40641.18656716418,
contentLengthMax: 203492,
contentLengthMin: 179,
downloadTimeAvg: 3629.634328358209,
downloadTimeMax: 19584,
downloadTimeMin: 2,
requestLatencyAvg: 34.93283582089552,
requestLatencyMax: 265,
requestLatencyMin: 22,
requestTimeAvg: 3664.5671641791046,
requestTimeMax: 19608,
requestTimeMin: 25,
// reduced count of elements grouped by QueueItem.status (see https://github.com/simplecrawler/simplecrawler#queue-items):
queued: 37449,
downloaded: 696,
headers: 7,
spooled: 2,
downloadprevented: 1,
timeout: 1,
created: 1,
failed: 1,
notfound: 1,
redirected: 1,
pulled: 0,
// general info - total count of items
totalCount: 22952,
// general info - count of items with property "fetched": true. Means items is fully processed
fetchedCount: 134,
// timestamp of the request
timestamp: 1554661504815,
// mongoDB id
_id: "5caa418467fa3c00083b4b7a"
}
```
Note: if there are no elements with some QueueItem.status, this property will not be included in the statistic item.
**WARNING**: These GC and Monitoring tasks could be slowed down by parallel general crawler execution. For better performance,
please use GC and Monitoring as separate utilities in separate process.
## Additional utilities
You can run GCTasks, Monitoring as *blocking* operations using Utils class and `runGC` and `runMonitoring` methods.
Also you can fully drop the queue using `dropQueue` method. Note: this method also cleans statistic collection.
// TODO examples
## Performance
// TODO tests and results, some graphics
## MongoDB as a docker image locally
Start mongo as a docker
```
docker run --name <IMAGE_NAME> -p 27017:27017 -d mongo:3.6.17-xenial
```
Connect to the running image
```
docker exec -it <IMAGE_NAME> bash
```