stew-select
Version:
CSS selectors that allow regular expressions. Stew is a meatier soup.
133 lines (93 loc) • 6.97 kB
text/coffeescript
# Scraping Headlines Using Stew
*This is a complete (but simple) example of using [Stew](https://github.com/rodw/stew) to extract content from the web. It is written as a "litcoffee" file, which is an executable/compilable file containing Markdown content with embedded CoffeeScript. ([Follow this link to go back to the README file.](../README.html))*
In this example, we'll extract headlines from the venerable social-tech-news site [Slashdot](http://slashdot.org/).
URL = 'http://slashdot.org/'
If you examine the HTML of the Slashdot homepage carefully, you'll find that each headline is contained in an `h2` tag with the class `story`, and that within this heading there is an anchor (`a`) tag that contains the link. As a CSS selector, that looks like:
SELECTOR = 'h2.story a'
We'll use that selector to extract the headlines and links from the HTML print them to the console with the following function:
print_headline = (node)->
headline = domutil.to_text(node)
link = "http:#{node.attribs.href}"
console.log "#{headline} <#{link}>"
(`domutil` is an instance of Stew's `DOMUtil` type, which is imported below.)
Now, given an `html` string, selecting and printing the headlines is as simple as this:
select_and_print_headlines = (html)->
stew.select html, SELECTOR, (err,nodeset)->
for node in nodeset
print_headline node
That's really all there is to it. All of the Stew-specific code is found above.
The rest of this file jumps through the hoops needed to download the HTML document from the web.
## Importing the Library
When using Stew, you'll typically import the library using something like this:
# This is what you'll typically do:
# stew = new (require('stew-select')).Stew()
# and/or
# domutil = new (require('stew-select')).DOMUtil()
but since this file is found *within* the Stew repository itself, we'll do things a little differently. Most readers can safely ignore these next few lines and use the simple `require` statement above instead.
# You WON'T do the following. We're only doing it here because we
# want to use the "local" implementation of Stew.
fs = require 'fs'
path = require 'path'
HOMEDIR = path.join(__dirname,'..')
LIB_COV_DIR = path.join(HOMEDIR,'lib-cov')
LIB_DIR = if fs.existsSync(LIB_COV_DIR) then LIB_COV_DIR else path.join(HOMEDIR,'lib')
stew = new (require(path.join(LIB_DIR,'stew'))).Stew()
domutil = new (require(path.join(LIB_DIR,'stew'))).DOMUtil()
## Setting up the HTTP "Fetcher"
Let's define a function that will fetch a web page and pass the resulting content to a callback function. We'll use the Node.js `http` library for this.
http = require 'http'
Our function will accept the `url` for the document to download and a `callback` function to invoke once the document is parsed.
Following Node.js convention, we'll use the signature `callback(err,body)` for the callback function.
fetch = (url,callback)->
Using `http`, we'll create an callback function to buffer the HTTP response:
http_callback = (response)->
unless 200 <= response.statusCode <= 299
callback "Unexpected status code #{response.statusCode}"
else
buffer = ""
response.setEncoding 'utf8'
response.on 'data', (chunk)->buffer += chunk
and, when the full response body has been recieved, pass it to the callback:
response.on 'end', ()-> callback(null,buffer)
Finally, we can trigger the actual request:
http.get(url, http_callback).on('error', callback)
Now our `fetch` method will download content from the URL and pass it to a callback function.
## Actual processing
Now we can fetch the document and print the result using our `select_and_print` method:
fetch URL, (err,body)->
if err?
console.error "Error:", err
else
console.log '-----------------------------------------'
console.log "CURRENT HEADLINES AT #{URL}"
console.log '-----------------------------------------'
select_and_print_headlines body
console.log '-----------------------------------------'
## Running this script
Now we can run this script by typing:
```console
coffee docs/example.litcoffee
```
and see output like the following:
```console
-----------------------------------------
CURRENT HEADLINES AT http://slashdot.org/
-----------------------------------------
DRM: How Book Publishers Failed To Learn From the Music Industry <http://news.slashdot.org/story/13/05/31/2045211/drm-how-book-publishers-failed-to-learn-from-the-music-industry>
Small Black Holes: Cloudy With a Chance of Better Visibility <http://science.slashdot.org/story/13/05/31/214224/small-black-holes-cloudy-with-a-chance-of-better-visibility>
No, the Tesla Model S Doesn't Pollute More Than an SUV <http://tech.slashdot.org/story/13/05/31/1955214/no-the-tesla-model-s-doesnt-pollute-more-than-an-suv>
The Case For a Government Bug Bounty Program <http://it.slashdot.org/story/13/05/31/1933231/the-case-for-a-government-bug-bounty-program>
When Smart Developers Generate Crappy Code <http://developers.slashdot.org/story/13/05/31/1854203/when-smart-developers-generate-crappy-code>
New York City Wants To Revive Old Voting Machines <http://tech.slashdot.org/story/13/05/31/1748201/new-york-city-wants-to-revive-old-voting-machines>
Big Asteroid (With Its Own Moon) To Have Closest Approach With Earth Today <http://science.slashdot.org/story/13/05/31/1727256/big-asteroid-with-its-own-moon-to-have-closest-approach-with-earth-today>
Google Maps Used To Find Tax Cheats <http://tech.slashdot.org/story/13/05/31/1721232/google-maps-used-to-find-tax-cheats>
Judge Orders Google To Comply With FBI's Warrantless NSL Requests <http://yro.slashdot.org/story/13/05/31/1633209/judge-orders-google-to-comply-with-fbis-warrantless-nsl-requests>
Ask Slashdot: How Important Is Advanced Math In a CS Degree? <http://ask.slashdot.org/story/13/05/31/1546253/ask-slashdot-how-important-is-advanced-math-in-a-cs-degree>
Badgers Block British Broadband Buildout <http://news.slashdot.org/story/13/05/31/1530227/badgers-block-british-broadband-buildout>
Confirmed: Water Once Flowed On Mars <http://science.slashdot.org/story/13/05/31/1523245/confirmed-water-once-flowed-on-mars>
Motorola Developing Pill and Tattoo Authentication Methods <http://it.slashdot.org/story/13/05/31/1414210/motorola-developing-pill-and-tattoo-authentication-methods>
Seeing Atomic Bonds Before and After Reactions <http://science.slashdot.org/story/13/05/31/1353241/seeing-atomic-bonds-before-and-after-reactions>
U.S. Authorizes Sales of American Communication Tech To Iran <http://news.slashdot.org/story/13/05/31/145229/us-authorizes-sales-of-american-communication-tech-to-iran>
-----------------------------------------
```
*([Follow this link to go back to the README file.](../README.html))*