stew-select
Version:
CSS selectors that allow regular expressions. Stew is a meatier soup.
212 lines (197 loc) • 14.2 kB
HTML
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="generator" content="pandoc" />
<title></title>
<!-- /* a stylesheet to include in our *.md-based html. */ -->
<!-- /* please leave the begin and end style tags, they let us include the text of this file "inline" in html documents */-->
<style>
#TOC { font-family: 'droid sans',helvetica,sans serif; font-size: 0.8em; position: fixed; right: 0em; top: 0em; background: #e5e5ee; -webkit-box-shadow: 0 0 1em #777777; -moz-box-shadow: 0 0 1em #777777; -webkit-border-bottom-left-radius: 5px; -moz-border-radius-bottomleft: 5px; text-align: left; max-height: 80%; z-index: 200; width: 7em; white-space:nowrap; overflow:hidden; padding-top: 3em; opacity: 0.9; }
#TOC:before { content:"Contents"; font-weight: bold; text-align:right; align:right; display:block; position:fixed; right: 1.5em; top: 1em; background: #e5e5ee; opacity:0.9; }
#TOC:hover { width: auto; padding-right:2em; max-width:80%; overflow:auto ; opacity:1.0; }
#TOC ul { margin: 0 0 0 1em; padding: 0; }
#TOC li { padding: 0; margin: 1px; list-style: none; overflow:hidden; text-overflow: ellipsis; }
html { font-size: 100%; overflow-y: scroll; -webkit-text-size-adjust: 100%; -ms-text-size-adjust: 100%; }
body{ color:#444; font-family:Georgia, Palatino, 'Palatino Linotype', Times, 'Times New Roman', serif; font-size:12px; line-height:1.5em; padding:1em; margin:auto; max-width:48em; background:#fefefe; }
a { color: #0645ad; text-decoration:none;}
a:visited { color: #0b0080; }
a:hover { color: #06e; }
a:active { color:#faa700; }
a:focus { outline: thin dotted; }
a:hover, a:active { outline: 0; }
::-moz-selection {background:rgba(255,255,0,0.3);color:#000}
::selection {background:rgba(255,255,0,0.3);color:#000}
a::-moz-selection {background:rgba(255,255,0,0.3);color:#0645ad}
a::selection {background:rgba(255,255,0,0.3);color:#0645ad}
p { margin:1em 0; }
p.caption { font-style: italic; text-align: right; }
img { max-width:100%; }
h1,h2,h3,h4,h5,h6 { font-weight:normal; color:#111; line-height:1em; }
h4,h5,h6{ font-weight: bold; }
h1 { font-size:2.5em; }
h2 { font-size:2em; }
h3 { font-size:1.5em; }
h4 { font-size:1.2em; }
h5 { font-size:1em; }
h6 { font-size:0.9em; }
blockquote{ color:#666666; margin:0; padding-left: 3em; border-left: 0.5em #eee solid; }
hr { display: block; height: 2px; border: 0; border-top: 1px solid #aaa;border-bottom: 1px solid #eee; margin: 1em 0; padding: 0; }
pre, code, kbd, samp { font-family: 'droid sans mono slashed', 'droid sans mono', monospace, monospace; }
pre { padding:2px; background:#333; color:#9e9; border:1px solid #444; overflow:hidden; text-overflow: ellipsis;}
pre:hover { overflow:visible; width: auto; }
pre:hover code { background:#333; }
code { padding:2px; background: #f5f5ff; border:1px solid #e5e5ee; font-size:0.9em; }
code.url { padding:2px; border:none; background:none; font-family:Georgia, Palatino, 'Palatino Linotype', Times, 'Times New Roman', serif; }
pre code { border: none; background:#333; }
b, strong { font-weight: bold; }
dfn { font-style: italic; }
ins { background: #ff9; color: #000; text-decoration: none; }
mark { background: #ff0; color: #000; font-style: italic; font-weight: bold; }
sub, sup { font-size: 75%; line-height: 0; position: relative; vertical-align: baseline; }
sup { top: -0.5em; }
sub { bottom: -0.25em; }
ul, ol { margin: 1em 0; padding: 0 0 0 2em; }
li p:last-child { margin:0 }
dd { margin: 0 0 0 2em; }
img { border: 0; -ms-interpolation-mode: bicubic; vertical-align: middle; }
table { border-collapse: collapse; border-spacing: 0; }
td { vertical-align: top; }
/* TODO: this could use a better color scheme */
code > span.kw { color: #dd7522; font-weight: bold; }
code > span.dt { color: #dd7522; }
code > span.dv { color: #669933; }
code > span.bn { color: #eddd3d; }
code > span.fl { color: #eddd3d; }
code > span.ch { color: #eddd3d; }
code > span.st { color: #669933; }
code > span.co { color: grey; font-style: italic; }
code > span.al { color: #ff0000; font-weight: bold; }
code > span.fu { color: #dd7522; }
code > span.ot { color: #007020; }
code > span.er { color: #ff0000; font-weight: bold; }
@media only screen and (min-width: 480px) { body{font-size:14px;} }
@media only screen and (min-width: 768px) { body{font-size:16px;} }
@media print {
#TOC { display:none; }
* { background: transparent ; color: black ; filter:none ; -ms-filter: none ; }
body{font-size:12pt; max-width:100%;}
a, a:visited { text-decoration: none; }
hr { height: 1px; border:0; border-bottom:1px solid black; }
a[href]:after { content: " (" attr(href) ")"; }
abbr[title]:after { content: " (" attr(title) ")"; }
.ir a:after, a[href^="javascript:"]:after, a[href^="#"]:after { content: ""; }
pre, blockquote { border: 1px solid #999; padding-right: 1em; page-break-inside: avoid; }
pre { font-size: 0.8em; }
tr, img { page-break-inside: avoid; }
img { max-width: 100% ; }
@page :left { margin: 15mm 20mm 15mm 10mm; }
@page :right { margin: 15mm 10mm 15mm 20mm; }
p, h2, h3 { orphans: 3; widows: 3; }
h2, h3 { page-break-after: avoid; }
}
</style>
</head>
<body>
<div id="TOC">
<ul>
<li><a href="#scraping-headlines-using-stew">Scraping Headlines Using Stew</a><ul>
<li><a href="#importing-the-library">Importing the Library</a></li>
<li><a href="#setting-up-the-http-fetcher">Setting up the HTTP "Fetcher"</a></li>
<li><a href="#actual-processing">Actual processing</a></li>
<li><a href="#running-this-script">Running this script</a></li>
</ul></li>
</ul>
</div>
<h1 id="scraping-headlines-using-stew"><a href="#TOC">Scraping Headlines Using Stew</a></h1>
<p><em>This is a complete (but simple) example of using <a href="https://github.com/rodw/stew">Stew</a> to extract content from the web. It is written as a "litcoffee" file, which is an executable/compilable file containing Markdown content with embedded CoffeeScript. (<a href="../README.html">Follow this link to go back to the README file.</a>)</em></p>
<p>In this example, we'll extract headlines from the venerable social-tech-news site <a href="http://slashdot.org/">Slashdot</a>.</p>
<pre><code>URL = 'http://slashdot.org/'</code></pre>
<p>If you examine the HTML of the Slashdot homepage carefully, you'll find that each headline is contained in an <code>h2</code> tag with the class <code>story</code>, and that within this heading there is an anchor (<code>a</code>) tag that contains the link. As a CSS selector, that looks like:</p>
<pre><code>SELECTOR = 'h2.story a'</code></pre>
<p>We'll use that selector to extract the headlines and links from the HTML print them to the console with the following function:</p>
<pre><code>print_headline = (node)->
headline = domutil.to_text(node)
link = "http:#{node.attribs.href}"
console.log "#{headline} <#{link}>"</code></pre>
<p>(<code>domutil</code> is an instance of Stew's <code>DOMUtil</code> type, which is imported below.)</p>
<p>Now, given an <code>html</code> string, selecting and printing the headlines is as simple as this:</p>
<pre><code>select_and_print_headlines = (html)->
stew.select html, SELECTOR, (err,nodeset)->
for node in nodeset
print_headline node</code></pre>
<p>That's really all there is to it. All of the Stew-specific code is found above.</p>
<p>The rest of this file jumps through the hoops needed to download the HTML document from the web.</p>
<h2 id="importing-the-library"><a href="#TOC">Importing the Library</a></h2>
<p>When using Stew, you'll typically import the library using something like this:</p>
<pre><code># This is what you'll typically do:
# stew = new (require('stew-select')).Stew()
# and/or
# domutil = new (require('stew-select')).DOMUtil()</code></pre>
<p>but since this file is found <em>within</em> the Stew repository itself, we'll do things a little differently. Most readers can safely ignore these next few lines and use the simple <code>require</code> statement above instead.</p>
<pre><code># You WON'T do the following. We're only doing it here because we
# want to use the "local" implementation of Stew.
fs = require 'fs'
path = require 'path'
HOMEDIR = path.join(__dirname,'..')
LIB_COV_DIR = path.join(HOMEDIR,'lib-cov')
LIB_DIR = if fs.existsSync(LIB_COV_DIR) then LIB_COV_DIR else path.join(HOMEDIR,'lib')
stew = new (require(path.join(LIB_DIR,'stew'))).Stew()
domutil = new (require(path.join(LIB_DIR,'stew'))).DOMUtil()</code></pre>
<h2 id="setting-up-the-http-fetcher"><a href="#TOC">Setting up the HTTP "Fetcher"</a></h2>
<p>Let's define a function that will fetch a web page and pass the resulting content to a callback function. We'll use the Node.js <code>http</code> library for this.</p>
<pre><code>http = require 'http'</code></pre>
<p>Our function will accept the <code>url</code> for the document to download and a <code>callback</code> function to invoke once the document is parsed.</p>
<p>Following Node.js convention, we'll use the signature <code>callback(err,body)</code> for the callback function.</p>
<pre><code>fetch = (url,callback)-></code></pre>
<p>Using <code>http</code>, we'll create an callback function to buffer the HTTP response:</p>
<pre><code> http_callback = (response)->
unless 200 <= response.statusCode <= 299
callback "Unexpected status code #{response.statusCode}"
else
buffer = ""
response.setEncoding 'utf8'
response.on 'data', (chunk)->buffer += chunk</code></pre>
<p>and, when the full response body has been recieved, pass it to the callback:</p>
<pre><code> response.on 'end', ()-> callback(null,buffer)</code></pre>
<p>Finally, we can trigger the actual request:</p>
<pre><code> http.get(url, http_callback).on('error', callback)</code></pre>
<p>Now our <code>fetch</code> method will download content from the URL and pass it to a callback function.</p>
<h2 id="actual-processing"><a href="#TOC">Actual processing</a></h2>
<p>Now we can fetch the document and print the result using our <code>select_and_print</code> method:</p>
<pre><code>fetch URL, (err,body)->
if err?
console.error "Error:", err
else
console.log '-----------------------------------------'
console.log "CURRENT HEADLINES AT #{URL}"
console.log '-----------------------------------------'
select_and_print_headlines body
console.log '-----------------------------------------'</code></pre>
<h2 id="running-this-script"><a href="#TOC">Running this script</a></h2>
<p>Now we can run this script by typing:</p>
<pre class="console"><code>coffee docs/example.litcoffee</code></pre>
<p>and see output like the following:</p>
<pre class="console"><code>-----------------------------------------
CURRENT HEADLINES AT http://slashdot.org/
-----------------------------------------
DRM: How Book Publishers Failed To Learn From the Music Industry <http://news.slashdot.org/story/13/05/31/2045211/drm-how-book-publishers-failed-to-learn-from-the-music-industry>
Small Black Holes: Cloudy With a Chance of Better Visibility <http://science.slashdot.org/story/13/05/31/214224/small-black-holes-cloudy-with-a-chance-of-better-visibility>
No, the Tesla Model S Doesn't Pollute More Than an SUV <http://tech.slashdot.org/story/13/05/31/1955214/no-the-tesla-model-s-doesnt-pollute-more-than-an-suv>
The Case For a Government Bug Bounty Program <http://it.slashdot.org/story/13/05/31/1933231/the-case-for-a-government-bug-bounty-program>
When Smart Developers Generate Crappy Code <http://developers.slashdot.org/story/13/05/31/1854203/when-smart-developers-generate-crappy-code>
New York City Wants To Revive Old Voting Machines <http://tech.slashdot.org/story/13/05/31/1748201/new-york-city-wants-to-revive-old-voting-machines>
Big Asteroid (With Its Own Moon) To Have Closest Approach With Earth Today <http://science.slashdot.org/story/13/05/31/1727256/big-asteroid-with-its-own-moon-to-have-closest-approach-with-earth-today>
Google Maps Used To Find Tax Cheats <http://tech.slashdot.org/story/13/05/31/1721232/google-maps-used-to-find-tax-cheats>
Judge Orders Google To Comply With FBI's Warrantless NSL Requests <http://yro.slashdot.org/story/13/05/31/1633209/judge-orders-google-to-comply-with-fbis-warrantless-nsl-requests>
Ask Slashdot: How Important Is Advanced Math In a CS Degree? <http://ask.slashdot.org/story/13/05/31/1546253/ask-slashdot-how-important-is-advanced-math-in-a-cs-degree>
Badgers Block British Broadband Buildout <http://news.slashdot.org/story/13/05/31/1530227/badgers-block-british-broadband-buildout>
Confirmed: Water Once Flowed On Mars <http://science.slashdot.org/story/13/05/31/1523245/confirmed-water-once-flowed-on-mars>
Motorola Developing Pill and Tattoo Authentication Methods <http://it.slashdot.org/story/13/05/31/1414210/motorola-developing-pill-and-tattoo-authentication-methods>
Seeing Atomic Bonds Before and After Reactions <http://science.slashdot.org/story/13/05/31/1353241/seeing-atomic-bonds-before-and-after-reactions>
U.S. Authorizes Sales of American Communication Tech To Iran <http://news.slashdot.org/story/13/05/31/145229/us-authorizes-sales-of-american-communication-tech-to-iran>
-----------------------------------------</code></pre>
<p><em>(<a href="../README.html">Follow this link to go back to the README file.</a>)</em></p>
</body>
</html>