osmosis
Version:
Web scraper for NodeJS
239 lines (147 loc) • 5.99 kB
Markdown
# Changelog
#### TODO:
* Add `.learn()` to generate a selector for a selected node
* Add `.listen()` for easily creating DOM event listeners
* Add `.trigger()` for easily triggering DOM events
* Add `.on()` for binding callback to a local-only event
* Add `.url()` to set the current URL
* Add `.params()` to set the current URL parameters
* Add `.save()` to save response data to a file
* Add `.add()`, `.remove()` for node creation/deletion?
* Add `.scroll()` to scrape infinite scroll pages
* Add warnings for parser errors?
* Switch to semantic versioning?
## Next major release:
* Event/error handling
* Error.code = 404, 'timeout', etc.
* Error.module = 'http', 'dom', etc.
* return true = retry, false = stop, anything else = continue
* Event for discontinued context/data
* Module system using osmosis.require and modules prefixed with `osmosis-`
* Way to trigger DOM
* Throw unhandled errors?
* `.while()` to do things more than once as long as they call next()
## 0.1.5
* Fixed bug where .get() without `params` caused empty query string ('?')
* Preserve sort order for `.follow()` results within `.set()`
## 0.1.4
#### `get`
* Removed `opts` and `callback` arguments
#### `set`
* Supports an array as the root data object
* Fixed case where nested `.find` searches the entire document
## 0.1.3
* parseHtml uses `huge` option by default
* Fixed nested Osmosis instances inside `set`
* Update to `libxmljs-dom` v0.0.5
#### `set`
* Fixed nested Osmosis instances inside `set`
* Added tests for nested set data
#### `submit`
* Proper `submit` button handling
* Accepts a `submit` button selector as the first argument
* Supports `submit` button attributes: "form", "formaction", "formenctype" and "formmethod"
* Added tests for `submit` button handling
## 0.1.2
* Update to `libxmljs-dom` v0.0.4
## 0.1.1
* `proxy` option can now be an array of multiple proxies
#### `proxy`
* Added `.proxy()` to easily set the `proxy` configuration option
#### `then`
* If the first argument's name is:
* "document" - The callback is given the current document
* "window" - The callback is given the Window object
* "$" - The callback is given a jQuery object (if available)
### Internal changes:
* Uses 'use strict'
* Minimize use of array.forEach
* Added libxml specific memoryUsage monitoring
* Switched to static `libxmljs-dom` version
## 0.1.0
* Added `ignore_http_errors` option
* Added `:internal` for selecting internal links
* Added `:external` for selecting external links
* Added `:domain` for searching by domain name
* Added `:path` for searching by path
#### `config`
* Configuration options are inherited down the chain
#### `contains`
* Added `.contains(string)` to discard nodes whose contents do not match `string`
#### `do`
* Added `.do()` to call one or more commands using the current context
#### `failure` (or `fail`)
* Added `.failure(selector)` to discard nodes that match the given selector
#### `filter` (or `success`)
* Added `.filter(selector)` to discard nodes that do not match the given selector
#### `get`
* Accepts a tokenized URL string
* @{...} - Request info (url, method, params, headers, etc.)
* %{...} - `data` object
* ${...} - `context` search
#### `headers` (or `header`)
* Added `headers({ key: value })` and `header(key, value)` to set HTTP headers
#### `match`
* Added `.match([selector], RegExp)` to discard nodes whose contents do not match
#### `rewrite`
* Added `.rewrite(callback)` to set a URL rewriting function for the preceding request
### Internal changes:
* `promise.args` is now an object (used to be an array)
* HTTP 400 errors are now logged and the requests are retried.
## 0.0.9
* DOM and css2xpath functionality have been moved to `libxmljs-dom`
* Added `keep_data` option to retain the original HTTP response
* Added `process_response` option for processing data before parsing
* Added test suite
#### `click`
* Added `.click()` for interacting with JS-only content
#### `delay`
* Added `.delay(n)` for waiting n seconds before calling next. Accepts a decimal value.
#### `find`
* Accepts an array of selectors as the first argument
#### `follow`
* Accepts second argument. Boolean (true = follow external links) or a URL rewriting function.
#### `get`
* Accepts `function(context, data)` as the first argument. The function must return a URL string.
#### `parse`
* Added second argument to associate a base-url to the document
#### `then`
* Added optional `done` argument
#### `select`
* Added `.select` for finding elements within the current context
#### `set`
* Replaces previously set values
### Internal changes:
* Enhanced stack counting
* Added data object ref counting
* Added domain specific cookie handling
* Improved stability of deep instance nesting with `.set()`
* Osmosis instances operate more independently
* Request queues are now a single array for each instance
* Promises must accept and call `done` if they asynchronously
send more than one output context per input context
* If `.then` sends more than one output context per input context,
then it must accept `done()` as its last argument and
call it after calling `next()` for the last time.
## 0.0.8
#### `config`
* Ensure non-default `needle` options propagate
## 0.0.7
#### `paginate`
* Added a more intuitive method for pagination
#### `submit`
* Added easy form submission
#### `login`
* Added easy login support
#### `pause`, `resume`, `stop`
* Added pause, resume, and stop functionality
#### `find`
* Searches the entire document by default
#### `set`
* Supports innerHTML using `:html` or `:source` in selectors
* Supports deep JSON structures and nested Osmosis instances
#### `data`
* `.data(null)` clears the data object
* `.data({})` appends keys to data object
#### `dom`
* `.dom()` is continuing progress and can now run jQuery