stew-select
Version:
CSS selectors that allow regular expressions. Stew is a meatier soup.
213 lines (122 loc) • 10.2 kB
Markdown
# Stew
**[Stew](https://github.com/rodw/stew)** is a JavaScript library that implements the [CSS selector](http://www.w3.org/TR/CSS2/selector.html) syntax, and extends it with regular expression tag names, class names, ids, attribute names and attribute values.
For example, given a variable `dom` containing a document tree, the JavaScript snippet:
```javascript
var links = stew.select(dom,'a[href]');
```
will return an array of all the anchor tags (`<a>`) found in `dom` that include an `href` attribute.
While the JavaScript snippet:
```javascript
var metadata = stew.select(dom,'head meta[name=/^dc\.|:/i]');
```
will extract the [Dublin Core metadata](http://dublincore.org/documents/dcq-html/) from a document by selecting every `<meta>` tag found in the `<head>` that has a `name` attribute that starts with `DC.` or `DC:` (ignoring case).
Stew is often used as a toolkit for "screen-scraping" web pages (extracting data from HTML and XML documents).
(The name "stew" is inspired by the Python library [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/), Simon Willison's [soupselect](http://code.google.com/p/soupselect/) extension of *BeautifulSoup*, and Harry Fuecks' [Node.js port](https://github.com/harryf/node-soupselect) of *soupselect*. [Stew](https://github.com/rodw/stew) is a meatier soup.)
## Links
Read on for more information, or:
- [visit the repository on GitHub.](https://github.com/rodw/stew)
- [review the API.](./docs/using.html)
- [see a complete example of using Stew (in a "literate CoffeeScript" format).](./docs/example.html)
- [browse the annotated source code](./docs/docco/stew.html) or [test coverage report](/docs/coverage.html).
- [learn how to contribute to Stew.](./docs/hacking.html)
- [see the version history and release notes.](./docs/version-history.html)
(Links not working? Try it from [heyrod.com/stew](http://heyrod.com/stew).)
## Installing
The source code and documentation for Stew is available on GitHub at [rodw/stew](https://github.com/rodw/stew). You can clone the repository via:
```console
git clone git@github.com:rodw/stew.git
```
Stew is deployed as an [npm module](https://npmjs.org/) under the name [`stew-select`](https://npmjs.org/package/stew-select). Hence you can install a pre-packaged version with the command:
```console
npm install stew-select
```
and you can add it to your project as a dependency by adding a line like:
```javascript
"stew-select": "latest"
```
to the `dependencies` or `devDependencies` part of your `package.json` file.
## Features
### Core CSS Selectors
Stew supports the full [Version 2.1 CSS selector syntax](http://www.w3.org/TR/CSS2/selector.html) and much of [Version 3](http://www.w3.org/TR/css3-selectors/), including
* The universal selector (`*`).
E.g., `stew.select( dom, '*' )` selects all the tags in the document.
* Type selectors (`E`).
E.g., `stew.select( dom, 'h2' )` selects all the `h2` tags in the document.
* Class selectors (`E.foo`).
E.g., `stew.select( dom, '.foo' )` selects all tags in the document with the class `foo`.
* ID selectors (`E#foo`).
E.g., `stew.select( dom, '#foo' )` selects all tags in the document with the id `foo`.
* Descendant selectors (`E F`).
E.g., `stew.select( dom, 'div h2 a' )` selects all `a` tags with an `h2` ancestor that has a `div` ancestor.
* Child selectors (`E > F`).
E.g., `stew.select( dom, 'div > h2 > a')` selects all `a` tags with an `h2` *parent* that has a `div` *parent*.
* Attribute name selectors (`E[foo]`).
E.g., `stew.select( dom, 'a[href]')` selects all `a` tags with an `href` attribute (and `stew.select( dom, '[href]')` selects *all* tags with an `href` attribute).
* Attribute value selectors (`E[foo="bar"]`).
E.g., `stew.select( dom, 'a[rel="author"]')` selects all `a` tags with a `rel` attribute set to the value `author`.
* The `~=` operator (`E[foo~="bar"]`).
E.g., `stew.select( dom, 'a[class~="author"]')` selects all `a` tags with the `class` `author`, whether or not that tag has other classes as well. More generally `~=` treats the attribute
value as a white-space delimited list of values (to which the given value is compared).
* The `|=` operator (`E[foo|="bar"]`).
E.g., `stew.select( dom, 'div[lang|="en"]')` selects all `div` tags with a `lang` attribute whose value is *exactly* `en` or whose value starts with `en-`.
* The starts-with `^=` operator (`E[foo^="bar"]`).
E.g., `stew.select( dom, 'a[href^="https://"]')` selects all `a` tags with an `href` attribute value that starts with `https://`.
* The ends-with `$=` operator (`E[foo$="bar"]`).
E.g., `stew.select( dom, 'a[href$=".html"]')` selects all `a` tags with an `href` attribute value that ends with `.html`.
* The contains `*=` operator (`E[foo*="bar"]`).
E.g., `stew.select( dom, 'a[href*="://heyrod.com/"]')` selects all `a` tags with an `href` attribute value that contains with `://heyrod.com/`.
* Adjacent selectors (`E + F`).
E.g., `stew.select( dom, 'h1 + p')` selects all `p` tags that immediately follow an `h1` tag.
* Preceeding sibling selectors (`E ~ F`).
E.g., `stew.select( dom, 'h1 ~ p')` selects all `p` tags that follow an `h1` tag (even if there are other tags between the `h1` and `p`.
* The "or" conjunction (`E, F`).
E.g., `stew.select( dom, 'h1, h2')` selects all `h1` and `h2` tags.
* The :first-child pseudo-class (`E:first-child`).
E.g., `stew.select( dom, 'li:first-child' )` selects all `li` tags that happen to be the first tag among its siblings.
And of course, you can use arbitrary combinations of these selectors:
```javascript
stew.select( dom, 'article div.credits > a[rel=license]' );
stew.select( dom, 'h1, h2, h3, h4, h5, h6, .heading' );
stew.select( dom, 'h1.title + h2.subtitle' );
stew.select( dom, 'ul > li > a[rel=author][href]' );
```
### Regular Expressions
Stew extends the CSS selector syntax by allowing the use of regular expressions to specify tag names, class names, ids, and attributes (both name and value).
For example,
```javascript
var metadata = stew.select(dom,'a[href=/^https?:/i]');
```
will select all anchor (`<a>`) tags with an `href` attribute that starts with `http:` or `https:` (with a case-insensitive comparison).
Another example, the snippet:
```javascript
var metadata = stew.select(dom,'[/^data-/]');
```
selects all tags with an attribute whose name starts with `data-`.
Any name or value that starts and ends with `/` will be treated as a regular expression. (Or, more accurately, any name or value that starts with `/` and ends with `/` with an optional suffix of any combination of the letters `g`, `m` and `i`. E.g., `/example/gi`.)
The regular expression is processed using JavaScript's standard regular expression syntax, including support for `\b` and other special class markers.
Here are some example CSS selectors using regular expressions:
* Tag names: `/^d[aeiou]ve?$/` matches `div`, but also `dove`, `dave`, etc.
* Class names: `./^nav/` matches any tag with a class name that starts with the string `nav`.
* IDs: `#/^main$/i` matches any tag with the id `main`, using a case insensitive comparison (so it also matches `MAIN`, `Main` and other variants.
* Attribute names: As above, `[/^data-/]` matches any tag with an attribute whose name starts with `data-`.
* Attribute values: As above, `[href=/^https?:/i]` matches any tag with an `href` attribute whose value starts with `http:` or `https:` (case-insensitive).
These may be used in any combination, and freely mixed with "regular" CSS selectors.
## Current Limitations
Stew currently has a couple of known issues that crop up during specific (and rare) edge-cases. We intend to eliminate these in future releases, but want to make you aware of them so that you're not surprised.
(Developers: If you'd like to help address these issues, we'd love your help. Feel free to submit a pull request or reach out for more information.)
### CSS 3 Selectors aren't (yet) fully supported.
Our intention is to fully support the most recent CSS selector syntax.
Stew supports all of the [CSS 2.1 Selectors](http://www.w3.org/TR/CSS2/selector.html). (To the extent that it makes sense to do so. It's hard to see how to interpret `:hover` and `:visited` and so on when looking at static-HTML from the server side, although `:first-child` is supported.)
Not quite all of the [CSS 3 Selectors](http://www.w3.org/TR/css3-selectors/) are supported. Currently certain [structural pseudo-classes](http://www.w3.org/TR/css3-selectors/#structural-pseudos) and [pseduo-elements](http://www.w3.org/TR/css3-selectors/#pseudo-elements) are not supported (*yet*).
### Stew may not report all syntax errors.
Stew will accept and properly parse any *valid* CSS selectors (unless listed as limitation elsewhere in this section).
However, (currently) Stew does not always *reject* every *invalid* selector. In particular, Stew's parser *may* ignore the invalid parts of improperly formed selectors, which can lead to unexpected results.
### Stew requires white-space around the "generalized sibling" operator: `E ~ F` works, but `E~F` doesn't.
Stew parsers most operators (including `+`, `>` and `,`) with or without white-space. In other words, Stew treats the following selectors as equivalent:
* `E + F`, `E+F`, `E+ F` and `E +F`
* `E , F`, `E,F`, `E, F` and `E ,F`
* `E > F`, `E>F`, `E> F` and `E >F`
Unfortantely, due to a quirk of Stew's current parser, the same is not true for the "preceeding sibling" operator (`~`). That is, Stew supports `E ~ F` but does not properly parse `E~F`. Currently the `~` character must be surrounded by white-space.
(If you're curious, the `~=` operator is the complicating factor for `~` right now. The same patterns we use for `+`, `,` and `>` don't quite work for `~`.)
## Licensing
The Stew library and related documentation are made available under an [MIT License](http://opensource.org/licenses/MIT). For details, please see the file [MIT-LICENSE.txt](MIT-LICENSE.txt) in the root directory of the repository.