UNPKG

bquery

Version:

bquery is a useful node module to fetch web page, which use css selector to fetch and structure this html page content.

388 lines (304 loc) 10 kB
--- category: reference heading: 'Query syntax' --- A simple query looks like this: { "url": "http://chrisnewtn.com", "type": "html", "selector": "ul.social li a", "extract": "href", } It says to go to a friend's website and for noodle to expect a html document. Then to select anchor elements in a list and for each one extract the href attribute's value. The `type` property is used to tell noodle if you are wanting to scrape a html page, json document etc. If no type is specified then a html page will be assumed by default. A similar query can be constructed to extract information from a JSON document. JSONSelect is used as the underlying library to do this. It supports common CSS3 selector functionality. You can [familiarize yourself with it here.](http://jsonselect.org/#tryit) { "url": "https://search.twitter.com/search.json?q=friendship", "selector": ".results .from_user", "type": "json" } An `extract` property is not needed for a query on JSON documents as json properties have no metadata and just a single value were as a html element can have text, the inner html or an attribute like `href`. ## Different types (html, json, feed & xml) ### html **Note:** Some xml documents can be parsed by noodle under the html type! The html type is the only type to have the `extract` property. This is because the other types are converted to JSON. The `extract` property (optional) could be the HTML element's attribute but it is not required. Having `"html"` or `"innerHTML"` as the `extract` value will return the containing HTML within that element. Having `"text"` as the `extract` value will return only the text. noodle will strip out any new line characters found in the text. Return data looks like this: [ { "results": [ "http://twitter.com/chrisnewtn", "http://plus.google.com/u/0/111845796843095584341" ], "created": "2012-08-01T16:22:14.705Z" } ] Having no specific extract rule will assume a default of extracting `"text"` from the `selector`. It is also possible to request multiple properties to extract in one query if one uses an array. Query: { "url": "http://chrisnewtn.com", "selector": "ul.social li a", "extract": ["href", "text"] } Response: [ { "results": [ { "href": "http://twitter.com/chrisnewtn", "text": "Twitter" }, { "href": "http://plus.google.com/u/0/111845796843095584341", "text": "Google+" } ], "created": "2012-08-01T16:23:41.913Z" } ] In the query's `selector` property use the standard CSS DOM selectors. ### json and xml The same rules apply from html to the json and xml types. Only that the `extract` property should be ommitted from queries as the JSON node value(s) targetted by the `selector` is always assumed. In the query's `selector` property use [JSONSelect](http://jsonselect.org/#tryit) style selectors. ### feeds The same rules apply to the json and xml types. Only that the `extract` property should be ommitted from queries as the JSON node value(s) targetted by the `selector` is always assumed. In the query's `selector` property use [JSONSelect](http://jsonselect.org/#tryit) style selectors. The feed type is based upon [node-feedparser](https://github.com/danmactough/node-feedparser) so it supports Robust RSS, Atom, and RDF standards. [Familiarize yourself with its](https://github.com/danmactough/node-feedparser#what-is-the-parsed-output-produced-by-feedparser) normalisation format before you use JSONSelect style selector. ## Getting the entire web document If no `selector` is specified than the entire document is returned. This is a rule applied to all types of docments. The `extract` rule will be ignored if included. Query: { "url": "https://search.twitter.com/search.json?q=friendship" } Response: [ { "results": ["<full document contents>"], "created": "2012-10-24T15:37:29.796Z" } ] ## Mapping a query to familiar properties Queries can also be written in noodle's map notation. The map notation allows for the results to be accessible by your own more helpful property names. In the example below map is used to create a result object of a person and their repos. { "url": "https://github.com/chrisnewtn", "type": "html", "map": { "person": { "selector": "span[itemprop=name]", "extract": "text" }, "repos": { "selector": "li span.repo", "extract": "text" } } } With results looking like this: [ { "results": { "person": [ "Chris Newton" ], "repos": [ "cmd.js", "simplechat", "sitestatus", "jquery-async-uploader", "cmd-async-slides", "elsewhere", "pablo", "jsonpatch.js", "jquery.promises", "llamarama" ] }, "created": "2013-03-25T15:38:01.918Z" } ] ## Getting hold of page headers Within a query include the `headers` property with an array value listing the headers you wish to recieve back as an object structure. `'all'` may also be used as a value to return all of the server headers. Headers are treated case-insensitive and the returned property names will match exactly to the string you requested with. Query: { "url": "http://github.com", "headers": ["connection", "content-TYPE"] } Result: [ { "results": [...], "headers": { "connection": "keep-alive", "content-TYPE": "text/html" } "created":"2012-11-14T13:06:02.521Z" } ] ### Link headers for pagination noodle provides a shortcut to the server Link header with the query `linkHeader` property set to `true`. Link headers are useful as some web APIs use them to expose their pagination. The Link header will be parsed to an object structure. If you wish to have the Link header in its usual formatting then include it in the `headers` array instead. Query: { "url": "https://api.github.com/users/premasagar/starred", "type": "json", "selector": ".language", "headers": ["connection"], "linkHeader": true } Result: [ { "results": [ "JavaScript", "Ruby", "JavaScript", ], "headers": { "connection": "keep-alive", "link": { "next": "https://api.github.com/users/premasagar/starred?page=2", "last": "https://api.github.com/users/premasagar/starred?page=21" } }, "created": "2012-11-16T15:48:33.866Z" } ] ## Querying to a POST url noodle allows for post data to be passed along to the target web server specified in the url. This can be optionally done with the `post` property which takes an object map of the post data key/values. { "url": "http://example.com/login.php", "post": { "username": "john", "password": "123" }, "select": "h1.username", "type": "html" } Take not however that queries with the `post` property will not be cached. ## Querying without caching If `cache` is set to `false` in your query then noodle will not cache the results or associated page and it will get the data fresh. This is useful for debugging. { "url": "http://example.com", "selector": "h1", "cache": "false" } ## Query errors noodle aims to give errors for the possible use cases were a query does not yield any results. Each error is specific to one result object and are contained in the `error` property as a string message. Response: [ { "results": [], "error": "Document not found" } ] noodle also falls silently with the `'extract'` property by ommitting any extract results from the results object. Consider the following JSON response to a partially incorrect query. Query: { "url": "http://chrisnewtn.com", "selector": "ul.social li a", "extract": ["href", "nonexistent"] } Response: The extract "nonexistent" property is left out because it was not found on the element. [ { "results": [ { "href": "http://twitter.com/chrisnewtn" }, { "href": "http://plus.google.com/u/0/111845796843095584341" } ], "created": "2012-08-01T16:28:19.167Z" } ] ## Multiple queries Multiple queries can be made per request to the server. You can mix between different types of queries in the same request as well as queries in the map notation. Query: [ { "url": "http://chrisnewtn.com", "selector": "ul.social li a", "extract": ["text", "href"] }, { "url": "http://premasagar.com", "selector": "#social_networks li a.url", "extract": "href" } ] Response: [ { "results": [ { "href": "http://twitter.com/chrisnewtn", "text": "Twitter" }, { "href": "http://plus.google.com/u/0/111845796843095584341", "text": "Google+" } ], "created": "2012-08-01T16:23:41.913Z" }, { "results": [ "http://dharmafly.com/blog", "http://twitter.com/premasagar", "https://github.com/premasagar", ], "created": "2012-08-01T16:22:13.339Z" } ]