UNPKG

stew-select

Version:

CSS selectors that allow regular expressions. Stew is a meatier soup.

212 lines (197 loc) 14.2 kB
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta http-equiv="Content-Style-Type" content="text/css" /> <meta name="generator" content="pandoc" /> <title></title> <!-- /* a stylesheet to include in our *.md-based html. */ --> <!-- /* please leave the begin and end style tags, they let us include the text of this file "inline" in html documents */--> <style> #TOC { font-family: 'droid sans',helvetica,sans serif; font-size: 0.8em; position: fixed; right: 0em; top: 0em; background: #e5e5ee; -webkit-box-shadow: 0 0 1em #777777; -moz-box-shadow: 0 0 1em #777777; -webkit-border-bottom-left-radius: 5px; -moz-border-radius-bottomleft: 5px; text-align: left; max-height: 80%; z-index: 200; width: 7em; white-space:nowrap; overflow:hidden; padding-top: 3em; opacity: 0.9; } #TOC:before { content:"Contents"; font-weight: bold; text-align:right; align:right; display:block; position:fixed; right: 1.5em; top: 1em; background: #e5e5ee; opacity:0.9; } #TOC:hover { width: auto; padding-right:2em; max-width:80%; overflow:auto !important; opacity:1.0; } #TOC ul { margin: 0 0 0 1em; padding: 0; } #TOC li { padding: 0; margin: 1px; list-style: none; overflow:hidden; text-overflow: ellipsis; } html { font-size: 100%; overflow-y: scroll; -webkit-text-size-adjust: 100%; -ms-text-size-adjust: 100%; } body{ color:#444; font-family:Georgia, Palatino, 'Palatino Linotype', Times, 'Times New Roman', serif; font-size:12px; line-height:1.5em; padding:1em; margin:auto; max-width:48em; background:#fefefe; } a { color: #0645ad; text-decoration:none;} a:visited { color: #0b0080; } a:hover { color: #06e; } a:active { color:#faa700; } a:focus { outline: thin dotted; } a:hover, a:active { outline: 0; } ::-moz-selection {background:rgba(255,255,0,0.3);color:#000} ::selection {background:rgba(255,255,0,0.3);color:#000} a::-moz-selection {background:rgba(255,255,0,0.3);color:#0645ad} a::selection {background:rgba(255,255,0,0.3);color:#0645ad} p { margin:1em 0; } p.caption { font-style: italic; text-align: right; } img { max-width:100%; } h1,h2,h3,h4,h5,h6 { font-weight:normal; color:#111; line-height:1em; } h4,h5,h6{ font-weight: bold; } h1 { font-size:2.5em; } h2 { font-size:2em; } h3 { font-size:1.5em; } h4 { font-size:1.2em; } h5 { font-size:1em; } h6 { font-size:0.9em; } blockquote{ color:#666666; margin:0; padding-left: 3em; border-left: 0.5em #eee solid; } hr { display: block; height: 2px; border: 0; border-top: 1px solid #aaa;border-bottom: 1px solid #eee; margin: 1em 0; padding: 0; } pre, code, kbd, samp { font-family: 'droid sans mono slashed', 'droid sans mono', monospace, monospace; } pre { padding:2px; background:#333; color:#9e9; border:1px solid #444; overflow:hidden; text-overflow: ellipsis;} pre:hover { overflow:visible; width: auto; } pre:hover code { background:#333; } code { padding:2px; background: #f5f5ff; border:1px solid #e5e5ee; font-size:0.9em; } code.url { padding:2px; border:none; background:none; font-family:Georgia, Palatino, 'Palatino Linotype', Times, 'Times New Roman', serif; } pre code { border: none; background:#333; } b, strong { font-weight: bold; } dfn { font-style: italic; } ins { background: #ff9; color: #000; text-decoration: none; } mark { background: #ff0; color: #000; font-style: italic; font-weight: bold; } sub, sup { font-size: 75%; line-height: 0; position: relative; vertical-align: baseline; } sup { top: -0.5em; } sub { bottom: -0.25em; } ul, ol { margin: 1em 0; padding: 0 0 0 2em; } li p:last-child { margin:0 } dd { margin: 0 0 0 2em; } img { border: 0; -ms-interpolation-mode: bicubic; vertical-align: middle; } table { border-collapse: collapse; border-spacing: 0; } td { vertical-align: top; } /* TODO: this could use a better color scheme */ code > span.kw { color: #dd7522; font-weight: bold; } code > span.dt { color: #dd7522; } code > span.dv { color: #669933; } code > span.bn { color: #eddd3d; } code > span.fl { color: #eddd3d; } code > span.ch { color: #eddd3d; } code > span.st { color: #669933; } code > span.co { color: grey; font-style: italic; } code > span.al { color: #ff0000; font-weight: bold; } code > span.fu { color: #dd7522; } code > span.ot { color: #007020; } code > span.er { color: #ff0000; font-weight: bold; } @media only screen and (min-width: 480px) { body{font-size:14px;} } @media only screen and (min-width: 768px) { body{font-size:16px;} } @media print { #TOC { display:none; } * { background: transparent !important; color: black !important; filter:none !important; -ms-filter: none !important; } body{font-size:12pt; max-width:100%;} a, a:visited { text-decoration: none; } hr { height: 1px; border:0; border-bottom:1px solid black; } a[href]:after { content: " (" attr(href) ")"; } abbr[title]:after { content: " (" attr(title) ")"; } .ir a:after, a[href^="javascript:"]:after, a[href^="#"]:after { content: ""; } pre, blockquote { border: 1px solid #999; padding-right: 1em; page-break-inside: avoid; } pre { font-size: 0.8em; } tr, img { page-break-inside: avoid; } img { max-width: 100% !important; } @page :left { margin: 15mm 20mm 15mm 10mm; } @page :right { margin: 15mm 10mm 15mm 20mm; } p, h2, h3 { orphans: 3; widows: 3; } h2, h3 { page-break-after: avoid; } } </style> </head> <body> <div id="TOC"> <ul> <li><a href="#scraping-headlines-using-stew">Scraping Headlines Using Stew</a><ul> <li><a href="#importing-the-library">Importing the Library</a></li> <li><a href="#setting-up-the-http-fetcher">Setting up the HTTP &quot;Fetcher&quot;</a></li> <li><a href="#actual-processing">Actual processing</a></li> <li><a href="#running-this-script">Running this script</a></li> </ul></li> </ul> </div> <h1 id="scraping-headlines-using-stew"><a href="#TOC">Scraping Headlines Using Stew</a></h1> <p><em>This is a complete (but simple) example of using <a href="https://github.com/rodw/stew">Stew</a> to extract content from the web. It is written as a &quot;litcoffee&quot; file, which is an executable/compilable file containing Markdown content with embedded CoffeeScript. (<a href="../README.html">Follow this link to go back to the README file.</a>)</em></p> <p>In this example, we'll extract headlines from the venerable social-tech-news site <a href="http://slashdot.org/">Slashdot</a>.</p> <pre><code>URL = &#39;http://slashdot.org/&#39;</code></pre> <p>If you examine the HTML of the Slashdot homepage carefully, you'll find that each headline is contained in an <code>h2</code> tag with the class <code>story</code>, and that within this heading there is an anchor (<code>a</code>) tag that contains the link. As a CSS selector, that looks like:</p> <pre><code>SELECTOR = &#39;h2.story a&#39;</code></pre> <p>We'll use that selector to extract the headlines and links from the HTML print them to the console with the following function:</p> <pre><code>print_headline = (node)-&gt; headline = domutil.to_text(node) link = &quot;http:#{node.attribs.href}&quot; console.log &quot;#{headline} &lt;#{link}&gt;&quot;</code></pre> <p>(<code>domutil</code> is an instance of Stew's <code>DOMUtil</code> type, which is imported below.)</p> <p>Now, given an <code>html</code> string, selecting and printing the headlines is as simple as this:</p> <pre><code>select_and_print_headlines = (html)-&gt; stew.select html, SELECTOR, (err,nodeset)-&gt; for node in nodeset print_headline node</code></pre> <p>That's really all there is to it. All of the Stew-specific code is found above.</p> <p>The rest of this file jumps through the hoops needed to download the HTML document from the web.</p> <h2 id="importing-the-library"><a href="#TOC">Importing the Library</a></h2> <p>When using Stew, you'll typically import the library using something like this:</p> <pre><code># This is what you&#39;ll typically do: # stew = new (require(&#39;stew-select&#39;)).Stew() # and/or # domutil = new (require(&#39;stew-select&#39;)).DOMUtil()</code></pre> <p>but since this file is found <em>within</em> the Stew repository itself, we'll do things a little differently. Most readers can safely ignore these next few lines and use the simple <code>require</code> statement above instead.</p> <pre><code># You WON&#39;T do the following. We&#39;re only doing it here because we # want to use the &quot;local&quot; implementation of Stew. fs = require &#39;fs&#39; path = require &#39;path&#39; HOMEDIR = path.join(__dirname,&#39;..&#39;) LIB_COV_DIR = path.join(HOMEDIR,&#39;lib-cov&#39;) LIB_DIR = if fs.existsSync(LIB_COV_DIR) then LIB_COV_DIR else path.join(HOMEDIR,&#39;lib&#39;) stew = new (require(path.join(LIB_DIR,&#39;stew&#39;))).Stew() domutil = new (require(path.join(LIB_DIR,&#39;stew&#39;))).DOMUtil()</code></pre> <h2 id="setting-up-the-http-fetcher"><a href="#TOC">Setting up the HTTP &quot;Fetcher&quot;</a></h2> <p>Let's define a function that will fetch a web page and pass the resulting content to a callback function. We'll use the Node.js <code>http</code> library for this.</p> <pre><code>http = require &#39;http&#39;</code></pre> <p>Our function will accept the <code>url</code> for the document to download and a <code>callback</code> function to invoke once the document is parsed.</p> <p>Following Node.js convention, we'll use the signature <code>callback(err,body)</code> for the callback function.</p> <pre><code>fetch = (url,callback)-&gt;</code></pre> <p>Using <code>http</code>, we'll create an callback function to buffer the HTTP response:</p> <pre><code> http_callback = (response)-&gt; unless 200 &lt;= response.statusCode &lt;= 299 callback &quot;Unexpected status code #{response.statusCode}&quot; else buffer = &quot;&quot; response.setEncoding &#39;utf8&#39; response.on &#39;data&#39;, (chunk)-&gt;buffer += chunk</code></pre> <p>and, when the full response body has been recieved, pass it to the callback:</p> <pre><code> response.on &#39;end&#39;, ()-&gt; callback(null,buffer)</code></pre> <p>Finally, we can trigger the actual request:</p> <pre><code> http.get(url, http_callback).on(&#39;error&#39;, callback)</code></pre> <p>Now our <code>fetch</code> method will download content from the URL and pass it to a callback function.</p> <h2 id="actual-processing"><a href="#TOC">Actual processing</a></h2> <p>Now we can fetch the document and print the result using our <code>select_and_print</code> method:</p> <pre><code>fetch URL, (err,body)-&gt; if err? console.error &quot;Error:&quot;, err else console.log &#39;-----------------------------------------&#39; console.log &quot;CURRENT HEADLINES AT #{URL}&quot; console.log &#39;-----------------------------------------&#39; select_and_print_headlines body console.log &#39;-----------------------------------------&#39;</code></pre> <h2 id="running-this-script"><a href="#TOC">Running this script</a></h2> <p>Now we can run this script by typing:</p> <pre class="console"><code>coffee docs/example.litcoffee</code></pre> <p>and see output like the following:</p> <pre class="console"><code>----------------------------------------- CURRENT HEADLINES AT http://slashdot.org/ ----------------------------------------- DRM: How Book Publishers Failed To Learn From the Music Industry &lt;http://news.slashdot.org/story/13/05/31/2045211/drm-how-book-publishers-failed-to-learn-from-the-music-industry&gt; Small Black Holes: Cloudy With a Chance of Better Visibility &lt;http://science.slashdot.org/story/13/05/31/214224/small-black-holes-cloudy-with-a-chance-of-better-visibility&gt; No, the Tesla Model S Doesn&#39;t Pollute More Than an SUV &lt;http://tech.slashdot.org/story/13/05/31/1955214/no-the-tesla-model-s-doesnt-pollute-more-than-an-suv&gt; The Case For a Government Bug Bounty Program &lt;http://it.slashdot.org/story/13/05/31/1933231/the-case-for-a-government-bug-bounty-program&gt; When Smart Developers Generate Crappy Code &lt;http://developers.slashdot.org/story/13/05/31/1854203/when-smart-developers-generate-crappy-code&gt; New York City Wants To Revive Old Voting Machines &lt;http://tech.slashdot.org/story/13/05/31/1748201/new-york-city-wants-to-revive-old-voting-machines&gt; Big Asteroid (With Its Own Moon) To Have Closest Approach With Earth Today &lt;http://science.slashdot.org/story/13/05/31/1727256/big-asteroid-with-its-own-moon-to-have-closest-approach-with-earth-today&gt; Google Maps Used To Find Tax Cheats &lt;http://tech.slashdot.org/story/13/05/31/1721232/google-maps-used-to-find-tax-cheats&gt; Judge Orders Google To Comply With FBI&#39;s Warrantless NSL Requests &lt;http://yro.slashdot.org/story/13/05/31/1633209/judge-orders-google-to-comply-with-fbis-warrantless-nsl-requests&gt; Ask Slashdot: How Important Is Advanced Math In a CS Degree? &lt;http://ask.slashdot.org/story/13/05/31/1546253/ask-slashdot-how-important-is-advanced-math-in-a-cs-degree&gt; Badgers Block British Broadband Buildout &lt;http://news.slashdot.org/story/13/05/31/1530227/badgers-block-british-broadband-buildout&gt; Confirmed: Water Once Flowed On Mars &lt;http://science.slashdot.org/story/13/05/31/1523245/confirmed-water-once-flowed-on-mars&gt; Motorola Developing Pill and Tattoo Authentication Methods &lt;http://it.slashdot.org/story/13/05/31/1414210/motorola-developing-pill-and-tattoo-authentication-methods&gt; Seeing Atomic Bonds Before and After Reactions &lt;http://science.slashdot.org/story/13/05/31/1353241/seeing-atomic-bonds-before-and-after-reactions&gt; U.S. Authorizes Sales of American Communication Tech To Iran &lt;http://news.slashdot.org/story/13/05/31/145229/us-authorizes-sales-of-american-communication-tech-to-iran&gt; -----------------------------------------</code></pre> <p><em>(<a href="../README.html">Follow this link to go back to the README file.</a>)</em></p> </body> </html>