html-get
Version:
Get the HTML from any website, fine-tuned for correction & speed
176 lines (111 loc) • 4.81 kB
Markdown
<div align="center">
<img src="https://github.com/microlinkhq/cdn/raw/master/dist/logo/banner.png#gh-light-mode-only" alt="microlink logo">
<img src="https://github.com/microlinkhq/cdn/raw/master/dist/logo/banner-dark.png#gh-dark-mode-only" alt="microlink logo">
<br>
<br>
</div>

[](https://coveralls.io/github/microlinkhq/html-get)
[](https://www.npmjs.org/package/html-get)
> Get the HTML from any website, fine-tuned for correction & speed.
## Features
- Get HTML markup for any URL, including images, video, audio, or pdf.
- Block ads tracker or any non-necessary network subrequest.
- Handle unreachable or timeout URLs gracefully.
- Ensure HTML markup is appropriately encoded.
**html-get** takes advantage of [puppeteer](https://github.com/GoogleChrome/puppeteer) headless technology when is needed, such as client-side apps that needs to be prerender.
## Install
```bash
$ npm install browserless puppeteer html-get --save
```
## Usage
```js
const createBrowserless = require('browserless')
const getHTML = require('html-get')
// Spawn Chromium process once
const browserlessFactory = createBrowserless()
// Kill the process when Node.js exit
process.on('exit', () => {
console.log('closing resources!')
browserlessFactory.close()
})
const getContent = async url => {
// create a browser context inside Chromium process
const browserContext = browserlessFactory.createContext()
const getBrowserless = () => browserContext
const result = await getHTML(url, { getBrowserless })
// close the browser context after it's used
await getBrowserless((browser) => browser.destroyContext())
return result
}
getContent('https://example.com')
.then(content => {
console.log(content)
process.exit()
})
.catch(error => {
console.error(error)
process.exit(1)
})
```
### Command Line
```
$ npx html-get https://example.com
```
## API
### getHTML(url, [options])
#### url
*Required*<br>
Type: `string`
The target URL for getting the HTML markup.
#### options
##### encoding
Type: `string`
Default: `'utf-8'`
It ensures the HTML markup is encoded to the encoded value provided.
The value will be passes to [`html-encode`](https://github.com/kikobeats/html-encode)
##### getBrowserless
*Required*<br>
Type: `function`
A function that should return a [browserless](https://browserless.js.org/) instance to be used for interact with puppeteer:
##### getMode
Type: `function`
It determines the strategy to use based on the `url`, being the possibles values `'fetch'` or `'prerender'` .
##### getTemporalFile
Type: `function`
It creates a temporal file.
##### gotOpts
Type: `object`
It passes configuration object to [got](https://www.npmjs.com/package/got) under `'fetch'` strategy.
##### headers
Type: `object`
Request headers that will be passed to fetch/prerender process.
##### mutool
Type: `function`|`boolean`<br>
Default: `source code`
It returns a function that receives that executes [mutool](https://mupdf.com/) binary for turning PDF files into HTML markup.
It can explicitly disabled passing `false`.
##### prerender
Type: `boolean`|`string`<br>
Default: `'auto'`
Enable or disable prerendering as mechanism for getting the HTML markup explicitly.
The value `auto` means that that internally use a list of websites that don't need to use prerendering by default. This list is used for speedup the process, using `fetch` mode for these websites.
See [getMode parameter](#getMode) for know more.
##### puppeteerOpts
Type: `object`
It passes coniguration object to [puppeteer](https://www.npmjs.com/package/puppeteer) under `'prerender'` strategy.
##### rewriteUrls
Type: `boolean`<br>
Default: `false`
When is `true`, it will be rewritten CSS/HTML relatives URLs present in the HTML markup into absolutes.
##### rewriteHtml
Type: `boolean`<br>
Default: `false`
When is `true`, it will rewrite some common mistake related with HTML meta tags.
##### serializeHtml
It determines how HTML should be serialied before returning.
It's serialized `$ => ({ html: $.html() })` by default.
## License
**html-get** © [Microlink](https://microlink.io), released under the [MIT](https://github.com/microlinkhq/html-get/blob/master/LICENSE.md) License.<br>
Authored and maintained by [Kiko Beats](https://kikobeats.com) with help from [contributors](https://github.com/microlinkhq/html-get/contributors).
> [microlink.io](https://microlink.io) · GitHub [microlinkhq](https://github.com/microlinkhq) · X [@microlinkhq](https://x.com/microlinkhq)