scrape-meta
Version:
A library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.
314 lines (218 loc) • 9.8 kB
Markdown
# ScrapeMeta
A library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks. Following a few principles:
- Have a high accuracy for online articles by default.
- Be usable on the server and in the browser.
- Make it simple to add new rules or override existing ones.
- Don't restrict rules to CSS selectors or text accessors.
## Table of Contents
- [Example](#example)
- [Metadata](#metadata)
- [Comparison](#comparison)
- [Installation](#installation)
- [Server-side Usage](#server-side-usage)
- [Browser-side Usage](#browser-side-usage)
- [Creating & Overriding Rules](#creating-overriding-rules)
- [API](#api)
- [License](#license)
## Example
Using **ScrapeMeta**, this metadata...
{
"favicon": "https://assets.bwbx.io/business/public/images/favicons/favicon-16x16-cc2a6c3317.png"
"author": "Ellen Huet",
"date": "2016-05-24T18:00:03.894Z",
"description": "The HR startups go to war.",
"image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v1/-1x-1.jpg",
"publisher": "Bloomberg.com",
"title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance",
"url": "http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance"
}
...would be scraped from this article...
[](http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance)
## Metadata
Here is a list of the metadata that **Metascraper** collects by default:
- **`favicon`** — eg. `https://example.com/icon.ico`<br/>
The favicon of the web. It will default to the highest resolution possible.
- **`author`** — eg. `Noah Kulwin`<br/>
A human-readable representation of the author's name.
- **`date`** — eg. `2016-05-27T00:00:00.000Z`<br/>
An [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) representation of the date the article was published.
- **`description`** — eg. `Venture capitalists are raising money at the fastest rate...`<br/>
The publisher's chosen description of the article.
- **`image`** — eg. `https://assets.entrepreneur.com/content/3x2/1300/20160504155601-GettyImages-174457162.jpeg`<br/>
An image URL that best represents the article.
- **`publisher`** — eg. `Fast Company`<br/>
A human-readable representation of the publisher's name.
- **`title`** — eg. `Meet Wall Street's New A.I. Sheriffs`<br/>
The publisher's chosen title of the article.
- **`url`** — eg. `http://motherboard.vice.com/read/google-wins-trial-against-oracle-saves-9-billion`<br/>
The URL of the article.
## Comparison
To give you an idea of how accurate **Metascraper** is, here is a comparison of similar libraries:
| Library | [`metascraper`](https://www.npmjs.com/package/metascraper) | [`html-metadata`](https://www.npmjs.com/package/html-metadata) | [`node-metainspector`](https://www.npmjs.com/package/node-metainspector) | [`open-graph-scraper`](https://www.npmjs.com/package/open-graph-scraper) | [`unfluff`](https://www.npmjs.com/package/unfluff) |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Correct | **95.54%** | **74.56%** | **61.16%** | **66.52%** | **70.90%** |
| Incorrect | 1.79% | 1.79% | 0.89% | 6.70% | 10.27% |
| Missed | 2.68% | 23.67% | 37.95% | 26.34% | 8.95% |
A big part of the reason for **Metascraper**'s higher accuracy is that it relies on a series of fallbacks for each piece of metadata, instead of just looking for the most commonly-used, spec-compliant pieces of metadata, like Open Graph. **Metascraper**'s default settings are targetted specifically at parsing online articles, which is why it's able to be more highly-tuned than the other libraries for that purpose.
If you're interested in the breakdown by individual pieces of metadata, check out the [full comparison summary](/support/comparison), or dive into the [raw result data for each library](/support/comparison/results).
## Installation
Simply install with [`npm`](https://www.npmjs.com/):
```
npm install metascraper
```
## Server-side Usage
On the server, you're typically going to only have a `url` to scrape, or already have the `html` downloaded. Here's what a simple use case might look like:
```js
import Metascraper from 'metascraper'
Metascraper
.scrapeUrl('http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance')
.then((metadata) => {
console.log(metadata)
})
// {
// "author": "Ellen Huet",
// "date": "2016-05-24T18:00:03.894Z",
// "description": "The HR startups go to war.",
// "image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v1/-1x-1.jpg",
// "publisher": "Bloomberg.com",
// "title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance"
// }
```
Or, if you are using `async/await`, you can simply do:
```js
const metadata = await Metascraper.scrapeUrl('http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance')
```
Similarly, if you already have the `html` downloaded, you can use the `scrapeHtml` method instead:
```js
const metadata = await Metascraper.scrapeHtml(html)
```
That's it! If you want to customize what exactly gets scraped, check out the documention on [the rules system](#rules).
## Browser-side Usage
In the browser, for example inside of a Chrome extension, you might already have access to the `window` of the document you'd like to scrape. You can simply use the `scrapeWindow` method to get the metadata:
```js
import Metascraper from 'metascraper'
Metascraper
.scrapeWindow(window)
.then((metadata) => {
console.log(metadata)
})
// {
// "author": "Ellen Huet",
// "date": "2016-05-24T18:00:03.894Z",
// "description": "The HR startups go to war.",
// "image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v1/-1x-1.jpg",
// "publisher": "Bloomberg.com",
// "title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance"
// }
```
Or if you are using `async/await` it might look even simpler:
```js
const metadata = await Metascraper.scrapeWindow(window)
```
Of course, you can also still scrape directly from `html` or a `url` if you choose to.
## Creating & Overiding Rules
By default, Metascraper ships with a set of rules that are tuned to parse out information from online articles—blogs, newspapers, press releases, etc. But you don't have to use the default rules. If you have a different use case, supplying your own rules is easy to do.
Each rule is simply a function that receives a [Cheerio](https://github.com/cheeriojs/cheerio) instance of the document, and that returns the value it has scraped. (Or a `Promise` in the case of asynchronous scraping.) Like so:
```js
function myTitleRule($) {
const text = $('h1').text()
return text
}
```
All of the rules are then packaged up into a single dictionary, which has the same shape as the metadata that will be scraped. Like so:
```js
const MY_RULES = {
title: myTitleRule,
summary: mySummaryRule,
...
}
```
And then you can pass that rules dictionary into any of the scraping functions as the second argument, like so:
```js
const metadata = Metascraper.scrapeHtml(html, MY_RULES)
```
Not only that, but instead of being just a function, rules can be passed as an array of fallbacks, in case the earlier functions in the array don't return results. Like so:
```js
const MY_RULES = {
title: [
myPreferredTitleRule,
myFallbackTitleRule,
mySuperLastResortTitleRule,
]
}
```
The beauty of the system is that it means simple scraping needs can be defined inline easily, like so:
```js
const rules = {
title: $ => $('title').text(),
date: $ => $('time[pubdate]').attr('datetime'),
excerpt: $ => $('p').first().text(),
}
const metadata = Metascraper.scrapeHtml(html, rules)
```
But in more complex cases, the set of rules can be packaged separately, and even shared with others, for example:
```js
import Metascraper from 'metascraper'
import RECIPE_RULES from 'metascraper-recipes'
const metadata = Metascraper.scrapeHtml(html, RECIPE_RULES)
```
And if you want to use the default rules, but with a few tweaks of your own, it's as simple as extending the object:
```js
import Metascraper from 'metascraper'
const NEW_RULES = {
...Metascraper.RULES,
summary: mySummaryRule,
title: [
myPreferredTitleRule,
myFallbackTitleRule,
mySuperLastResortTitleRule,
]
}
const metadata = Metascraper.scrapeHtml(html, NEW_RULES)
```
For a more complex example of how rules work, [check out the default rules](/lib/rules).
## API
#### `Metascraper.scrapeUrl(url, [rules])`
```js
import Metascraper from 'metascraper'
Metascraper
.scrapeUrl(url)
.then((metadata) => {
// ...
})
```
```js
import Metascraper from 'metascraper'
const metadata = await Metascraper.scrapeUrl(url)
```
Scrapes a `url` with an optional set of `rules`.
#### `Metascraper.scrapeHtml(html, [rules])`
```js
import Metascraper from 'metascraper'
Metascraper
.scrapeHtml(html)
.then((metadata) => {
// ...
})
```
```js
import Metascraper from 'metascraper'
const metadata = await Metascraper.scrapeHtml(html)
```
Scrapes an `html` string with an optional set of `rules`.
#### `Metascraper.scrapeWindow(window, [rules])`
```js
import Metascraper from 'metascraper'
Metascraper
.scrapeWindow(window)
.then((metadata) => {
// ...
})
```
```js
import Metascraper from 'metascraper'
const metadata = await Metascraper.scrapeWindow(window)
```
Scrapes a `window` object with an optional set of `rules`.
#### `Metascraper.RULES`
A dictionary of the default rules, in case you want to extend them.